by Vlad Wielbut
"Picture is worth a thousand words". This well known saying is hard to dispute, especially from within a culture that becomes increasingly visual, relying more and more on non-textual formats for learning, entertainment and communication. Paradoxically, the computer technology, in itself revolutionary and having enormous impact on many aspects of our lives, was until recently dominated by textual materials. This was due to the inability of the systems to handle huge amounts of data needed to encode even relatively small pieces of visual information. Several factors contributed to the current dramatic shift toward image-enhanced or even image-centered computing: increase in processing power of servers, availability of inexpensive digital storage space, improvements in digitizing techniques, faster data flow in networks. However, despite all these impressive achievements in digital image technology, textual information still remains the primary connection between a user and single images within databases: searching for an image still requires typing its accession number, its title, artist, subject, or other verbal labels pre-determined by the system. This is about to change.
What is wrong with using text to describe and access images? Haven't we used it for centuries in museums, galleries, and other repositories of graphical materials? Isn't there already enough criticism that our picture-driven media encourage illiteracy and inhibit imagination? Well, there is nothing inherently wrong with text in general or textual access to visual material in particular, and, as it will become clear later in this essay, titles, and keywords, and names will not become extinct in the world populated by image databases. Nonetheless, searching images, or anything for that matter, via textual intermediaries has its flaws; let me name just a few:
In view of these limitations the advantages of being able to search for images using visual determinants become more obvious.
- it is necessary that an extensive set of descriptors be entered along with the image in order to make this image accessible. However, many museums and archives cannot devote enough time to in-depth cataloging of every item they acquire. In fact, very often an object is given only the briefest possible identification (number, title, author) or simply filed among other related artifacts, without any individual information whatsoever.
- different types of textual data used to describe an image make maintaining consistency a real headache. For instance, single acquisition date may be input in a variety of ways: February 3, 1995 or 02.03.95 or 2/3/95 or 1995.02.03 ...
- user must be familiar with the standards for data input used in a particular system or the system must be able to handle variations. This problem is further aggravated by misspellings on both sides or nonspecific requests on user side.
- certain inherently visual concepts are very difficult to replicate in a language, for instance shapes. One can hardly imagine a system that could handle user's request of a type: "Find me a picture of a pear-shaped lake."
- textual descriptors are very often subjected to the constraints of a controlled vocabulary, which, again, becomes less reliable in the visual domain: "orange" may mean quite different colors to different people.
Now that we know what we need, it is time to think of ways to achieve it. Let's start by enumerating the elements of image content that may be relevant in the design of more complete tools of image searching:
It is quite extensive list and it would not be fair not to mention that only some of these points have achieved a level of more or less successful application, while the rest are still being researched and tested or haven't even grown out of their larval state of theoretical models. However, it does not mean that the designers of databases implementing the query-by-image-content search engines will have to wait until all the elements become usable: in fact, in most cases it will be sufficient to include only some of the image-content searching techniques, as determined by the type of users most frequently accessing particular database. For example: radiologists looking for early signs of tumor may be more interested in searching for a particular shape or texture rather than color, while an oceanographer will pay close attention to colors indicating variations in depth, temperature, oxidation and presence of plankton in water. In addition to that, various users may prefer various levels of sophistication of respective interfaces: less demanding ones will be satisfied with a simple color wheel for selecting approximate hues in desired images, while domain experts may demand a wide range of controls enabling them to specify exact levels of saturation, lightness, etc. Therefore, it is quite likely that already existing or soon-to-be-developed techniques may find a pool of useful applications, satisfying the needs of some users.
- Color: green, magenta, purple, jade...
- Shape: cross, oval, half-circle...
- Spatial relationships: above, left of, row, column, center...
- Line: thick, straight, dotted, jagged...
- Texture: coarse, grainy, spotted, smooth...
- Motion (for video and animation): upward, left-right, continuous...
- Volume (for 3 D objects): cube, ball, toroid...
Let me now explain how some of the elements of image content are being handled by their respective practical implementations:
Color seems to be one of the easier elements to deal with and to a certain degree automatic color analysis have already been present in more sophisticated graphical applications. To make color searchable all we need is a systems that assigns all the pixels whose RGB values fall within a pre-defined range to a specific color label, for example, every pixel with RGB between 340 and 360 will now be known to the computer as "yellow". At the next level of complexity the system may compute the ratio of pixels of a particular color to the total number of pixels within the image and turn it into a percentage. These percentages may again be translated into a set of "natural language" labels, such as: some, a lot, mostly, etc., making it possible to search for "mostly green" images or those with "some blue".(*)
Shape identification has only recently migrated from research into practical implementation, but at least three promising approaches have already started competing for dominance:
- Local, where the structural features of a shape are computed from local regions, e.g. points of maximal curvature, boundary segments, etc. (see Fig. 1)
... Fig. 1
- Global, which tries to treat the object as a whole. In practice it may mean covering the shape with a set of triangles and/or rectangles (Fig. 2), or "flood-filling", i.e. starting with a particular pixel (per rule of thumb usually located in the center of an image, where objects tend to appear), and adding all the neighbors with RGB values within accepted range (Fig. 3).
... Fig. 2
- Feature indexing interprets a shape as a set of boundary segments (Fig. 4) and transforms this set into a normalized coordinate systems (Fig. 5) to account for differing representation of the same shape from image to image, e.g. scale, rotation, neighboring shapes. This approach also makes use of the global interpretation, using it as a final step in the matching process.
... Fig. 4
... Fig. 5
Establishing spatial relationships relies heavily on textual data that accompanies images and has been researched primarily in the area of face recognition. Using sophisticated syntactical tools and extensive dictionaries the system extracts words and phrases that provide information about whether a particular face is likely to appear in the image, and where. This serves as a preliminary step to using object recognition techniques in actually locating the face. For example, in fulfilling a request to find the face of Jane Doe the system is likely to start from the upper right corner of a picture whose caption reads: "Jane Doe (right) and Jennifer Doe (left) at the grave of Alice Doe" and will not attempt to find one in "Jennifer Doe (right) and Alice Doe (left) presented with the Jane Doe Award", let alone "Sunset on Lake Michigan".
Video, being a set of subsequent images rather than a single image, represents a more complex challenge. Here the key element of any image interpretation process is segmentation, i.e. splitting the video stream into smaller, more homogeneous chunks (shots). This can be done via an algorithm that establishes boundaries between shots by "looking at" changes in image intensity, appearance or disappearance of an object, global versus local (part of image, e.g. moving object) fluctuations. Each shot is then represented by its single frame, which may be analyzed by the search engine.
Of course, nothing is perfect. The "query-by-image-content" techniques have to deal with a plethora of their own problems and drawbacks before they will be able to show their full potential. First major problem that comes to mind when dealing with images is the size of image files. Everyone who has ever tried to rotate, scale, copy, or even open a high-resolution image from within a graphical application, knows how painfully slow it may be; now imagine analyzing hundreds of these 3-4 Mb giants in search for a particular color, texture, or shape...
One possible solution is automatic indexing of the image at input time, whereby a set of numerical data "describing" the elements of the image content is extracted and stored separately from the image. Whenever a query is submitted, another set of data is extracted from the query image and matching is performed quickly and efficiently on sets of data (small to very small files) rather than on images themselves (big to very big files). However, the decision has to be made beforehand on how many features should be indexed, a dilemma known to every cataloger: too many of them will likely slow the system, wiping out part of the advantage of having smaller files; too few will make the system less flexible, less capable of handling more specific queries, and less adaptable to future enhancements. Hopefully, further gains in processing speed of computers will skew the uneasy balance toward the former option.
There remains the problem of browsing or further searching images retrieved by the first query. What good does it do to quickly find 50 matches if these 50 files would have to be opened and looked at in order to determine whether any of them contains what the user was looking for? Again, the solution lies in representation: returning a set of thumbnails rather than originals will allow the user to quickly locate the desired image, which is then retrieved via a permanent link from its thumbnail.
A flaw specific to the query-by-image-content is that it is poorly suited to handle any queries other than quantitative, e.g. semantic.. In other words, we may query the system for pictures with "some black" in them but it would be unreasonable to expect it to handle a "give me a photo of a small black boy, crying" request. Also, queries by image content alone tend to increase recall (i.e. proportion of relevant images found) but inhibit precision (i.e. proportion of images closely related to the desired image). Therefore it is quite likely that future attempts at implementing these techniques will put considerable effort into synchronizing the visual with the textual part of a query (**).
There is also a number of smaller problems associated with certain types of image content, for instance object recognition has to deal with such obstacles as non-rigid or overlapping objects, blurry images, variations in intensity (e.g. shadows, reflections). These will have to be resolved within their respective research areas.
Well, that's all very impressive, but who really needs it behind the people whose livelihoods depend on it, namely the researchers themselves? It is hard to imagine a visitor to an art museum, whether "virtual" or "real", looking for a "painting with green dots on a purple background" rather than for "something by Miro", "abstractionist paintings" or "19th century portraits of women". However, art museums or "virtual galleries" are neither the only, nor the main sources of digitized images; there is quite a number of areas where image databases already play an important role, and ability to search for images by their content would significantly enhance usefulness of these databases. Here's but a handful of examples:
- Law enforcement: fingerprint, face, DNA matching may become significantly faster and more reliable
- Already mentioned medicine: certain shapes, textures, even colors may provide critical information in diagnosis of a disease, tumor, abnormalities, etc.
- Military: for example in target recognition for self-guiding missile systems
- Advertisement industry: selecting images with certain dominant colors may be important in designing a coordinated ad campaign.
- Earth and space sciences: aerial and satellite photos have become vital for research in these areas - sheer number of these images makes enhanced search capabilities almost a sine qua non condition of their usefulness.
This is obviously a very new and largely uncharted area of development; it is quite likely that soon we will be able to "pick and choose" between competing search engines designed to handle queries by image content. At the moment, however, there appears to be a limited number of developers advanced enough and courageous enough to put their designs to practical tests. The most prominent of them are:
- IBM with its QBIC which allows users to search a database of images by similarity, whereby an image from a random selection is chosen by the user and all the images with certain degree of similarity within one of 3 categories (color percentages, color layout, and texture) are retrieved. This search may also be customized by specifying exact percentages of one or more colors within the image being sought.
- Illustra with its DataBlade, a combination of textual and visual query elements, both in 2D and 3D environments.
- Virage Co. with its Virage Engine, also structured around the similarity measure, but with somewhat expanded capabilities: this engine allows image interpretation by color, texture, composition (color layout) and structure (general shape characteristics of an object). Furthermore, all these features may be combined within a single query.
- U of C at Berkeley with its Cypress, an impressive outgrowth of the Chabot Project which combines both descriptive (date, location) and image content attributes into one comprehensive "concept search".
Only a decade ago the only "picture" one could get on a computer screen was one created from arranged in certain way letters of alphabet or punctuation marks. Today's desktop computers are capable of displaying and manipulating high quality photographic images in digitized form, and image databases online are proliferating at an incredible pace (***). In view of this encroaching "picturization" of cyberspace, developing powerful searching tools for image databases becomes a real necessity. We are still in a fairly new and uncharted territory, and although recent achievements in this area are worth looking at and exploring, they will not constitute final word on the subject. However, those who are waiting for these new search engines to become fully operational and useful, are in a very advantageous position, for they are still able to have their say in the development of the final product; now it is time to conduct comprehensive research in areas such as: types of queries most likely to be posted by different groups of users, elements of an intelligent and user-friendly interface, integration of textual and image-content data within queries, etc. But, given the speed of change in digital technology, the time for doing that is quickly running out.
(*) Of course there is always an option of providing the user with a color wheel or table - the query is activated by clicking on a particular color.
(**) The Chabot Project of the UC at Berkeley with its "concept query" may serve as an example of this approach.
(***) NASA's Earth Observing System is expected to add to the existing pool about 1 terabyte of image data every day when fully operational.
Artificial Intelligence Review, Special Issue on Integration of Natural-language Processing and Vision, Vol. 8, No. 5-6, 1995
"Automatic Indexing and Content-based Retrieval of Captioned Images" by R. K. Srihari [in: IEEE Computer, September 1995, pp. 49-56]
"Chabot: Retrieval from a Relational Database of Images" by V. E. Ogle and M. Stonebraker [in: IEEE Computer, September 1995, pp. 40-48]
"Content-based Image Retrieval Systems" by V. N. Gudivada and V. V. Raghavan [in: IEEE Computer, September 1995, pp. 18-22]
"Content-based Retrieval for Multimedia"
"CORE: a Content-based Retrieval Engine For Multimedia Information Systems" by J. K. Wu et al. [in: Multimedia systems, Vol. 3 (1995), pp. 25-41]
"Fast Multiresolution Image Querying" by C. E. Jacobs et al. [in: ACM SIGGRAPH 95: Conference Proceedings, pp. 277-286]
"Human and Machine Recognition of Faces: a Survey" by R. Chellappa et al. [in: Proceedings of the IEEE Vol. 83 (1995), no. 5, pp. 705-740]
"IBM's Image Recognition Tech for Databases at Work: QBIC or not QBIC?" by C. Okon [in: Advanced Imaging, May 1995, pp. 63-65]
"Multimedia Information systems" by W. Grosky [in: IEEE Multimedia, Vol. 1 (1994), no. 1, pp. 12 - 24]
"Photobook: Tools for Content-Based Manipulation of Image Databases" by A. Pentland et al. [in: Storage and Retrieval for Image and Video Databases II, Vol. 2, (Proceedings SPIE) 1994, pp. 34 - 47]
"Query by Image and Video Content: the QBIC System" by M. Flickner et al. [in: IEEE Computer, September 1995, pp. 23-32]
"Rich Interaction in the Digital Library" by R. Rao et al. [in: Communications of the ACM, Vol. 38 (1995), no. 4, pp. 29-39]
"Similar-shape Retrieval in Shape Data Management" by R. Mehrotra and J. E. Gary [in: IEEE Computer, September 1995, pp. 57-62]
"Visual Information Management FAQs"
"Visual Information Retrieval Technology: a VIRAGE Perspective" by A. Gupta