Date: Tue, 24 Oct 1995 09:59:18 -0400 From: John Weise To: Howard Besser Subject: MESL Re: search engines Howard, I've attempted to answer your classes questions. I imagine this is just a start, and that there are more questions to come. I've included Zoe so she can correct me where I am wrong. The search engine we are using is the Full Text Lexicographer (FTL), developed by Ken Alexander at SILS. http://polyphemus.engin.umich.edu:80/ftl/ For us, it was between FTL and Oracle (we didn't explore other options due to time constraints and since these were readily accessible). FTL was very easy for the MESL programmer, Zoe Gurevich, to implement. We are not using a database application. Instead, we have a bunch of FTL indices (for each field) that point to the detailed data pages (one data page for each museum object). *How is the text formatted? The detailed data pages (e.g., http://mesl.itd.umich.edu/MESLDIST1/LC/details/12735.183.html) are generated ahead of time and sitting on server ready to go. These pages are initially generated by a Perl script that rips through a bunch of raw data files (ASCII), extracting the data and formatting it. For every museum object, there is a raw data file. Here are the first 5 fields from one... Data_Agreement_Number: 1.0 Holding_Institution: National Gallery of Art Accession_Number: 1964.13.1 Accession_Method: Gift Credit_Line: NGA 1964.13.1 The raw data files are also used to create the FTL indices. An FTL index is generated for every field that we want indexed. Almost all HTML documents (throughout the system) are generated automatically by Perl scripts. Sometimes they are generated on the fly by a CGI Perl Script. *Does the text sit in a database? No. We start with an ASCII text data file for each museum. Zoe has written Perl scripts that goes through the data and, after jumping through lots of hoops, splits it into the raw data files. From there, details pages and FTL indices are generated. We're talking about moving the data into a database so that it can be used for other purposes, but not for direct use by this Web site. *How do you link an HTML form into a Query of a database or structured data? With CGI scripts, and a search engine (FTL). We have prefab indices for every data field that we want to be searchable. Accession numbers play a major role in all of this. All details pages, and associated images and documents, are tied together by accession number (which are unique within a museum's data). *How difficult is all of this? Well, Zoe is a terrific programmer with lots of experience. However, she hadn't done much Perl/CGI scripting before this project. FTL has been very easy to use (for an experienced programmer like Zoe). It has very good (if not exceptional) text searching capabilities. The biggest trick is to get access to a web server where one can run CGI scripts. On this campus, CGI scripting is out of the question for most students, since the systems (system administrators) don't permit it. While, at this time, we can't provide a place for students to experiment with the writing of CGI scripts, I do think it is possible for some experimentation to be done with the submission of queries to the MESL system. If one studies up on HTML forms, and then models the HTML source code for a searh form (e.g., http://mesl.itd.umich.edu/MESLDIST1/FOWLER/htdocs/search.html), it seems that it would be possible to experiment with the creation of searching interfaces. I'm not sure if this is what you were getting at. We'll need Zoe's input to really figure out what the options are. *Who would know about the alternatives to talk about some of this stuff? For more info about search engines and searching structured text, etc., Ken Alexander is a wonderful resource, and he's right there at SILS. I'm happy to come and say what I know about the MESL system, and I might be able to drag Zoe in there with me, if that would be useful. If people have so many questions, why have I only received 2 questions from your students? :) People are more than welcome to send questions and comments to me (jweise@umich.edu) or mesl.web@umich.edu (Zoe and me). They can call me too (763-9157). I hope to make more info like this a permanent fixture on the MESL home page, but these things take time. Our server computer is a Sun Sparc 20 with 128Mb of RAM. We are using a 9Gb local disk for the storage and serving of images and data. >._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. >Howard Besser >Visiting Associate Professor >School of Information & Library Studies >University of Michigan >Ann Arbor, MI 48109-1092 >(313)764-3417 (voice) >(313)764-2475 (fax) >howardb@umich.edu >http://www.sils.umich.edu/~howardb/ John Weise.................................................................... ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ University of Michigan Internet e-mail: jweise@umich.edu Information Technology Division voice mail: (313) 763-9157 Ann Arbor, Michigan, USA URL: http://www.umich.edu/~jweise/ .............................................................................. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^