The Future of Information Filtering

Paul Canavese
Library and Information Studies 296A
Howard Besser
April 29, 1994

© NOTICE

This is a high-input society. It seems that not a minute may be wasted in consuming commodities and communicating with as many people as possible. But in a Babel of signals, we must listen to a great deal of chatter to hear one bit of information we really want.

Without significance, variety isn't the spice of life. It can be as dull as monotony when it has nothing to say... This is because to recognize a thing as new, one must be able to distinguish it from what is old... we can be surprised repeatedly only by contrast with that which is familiar... not by chaos.

- Orrin Klapp, Overload and Boredom: Essays on the Quality of Life in the Information Society

Never before has so much information been available to the general public. And never before has the flow of newly-created information been greater. There has also never before been a time when so much information has been available in an electronic form. We live in an "Information Age," the extent of which is growing by the minute. "We have for the first time an economy based on a key resource that is not only renewable, but self-generating," notes John Naisbitt in Megatrends. "Running out is not a problem, but drowning in it is."

So, as information production continues to increase, the areas of study geared toward extracting the information we want become increasingly important. As the stream of data has become wider, more and more people are realizing that they simply cannot process it all in time. These people need a plan for processing this information, and many will look to computers to implement these plans through an information filtering system.

Automated information filtering systems are in their infancy, but current trends suggest that a demand for them will be high in the near future. For now, experimental work continues, and we can only speculate on how these systems will ultimately affect how people will get their information, and how that information will be affected.

Information Anxiety

In 1989, Richard Saul Wurman wrote a book titled Information Anxiety. He writes that this ailment "is produced by the ever-widening gap between what we understand and what we think we should understand. It is the black hole between data and knowledge, and it happens when information doesn't tell us what we want or need to know."

The number of conventional information sources is at an all-time high. The number of mainstream periodicals continues to increase, while specialized magazines and newsletters also proliferate. The advent of desktop publishing has made it possible for almost anyone to publish, and the result is a torrent of new information produced every day.

But even greater is the amount of information becoming available in an electronic form. This information comes in an even richer variety of formats. First, there are the electronic equivalent of all the conventional forms: more and more periodicals and journals are "going online" and making themselves available in an electronic form. Second are the "raw" sources, such as newswire feeds, which give us access to information in an even less-digested form. The third, and most unique kind of information is the "grass roots" data. This information, such as that in Usenet newsgroups, can be created by anyone online (and this makes it a very prolific source). The problem only gets worse if we consider the databases of "reference" information, which are growing at an even greater rate.

Information Filtering Today

"Information Filtering" is a field of study designed for creating a systematic approach to extracting information that a particular person finds important from a larger stream of information. It shares a lot of similarities to the "Information Retrieval", which actively searches out information from an existing database of information.

The main goal of Wurman's Information Anxiety is to help readers filter their conventional information. He prescribes a "Low-Fat Information Diet," which should limit the sources a person should look through. He suggests limiting the input to one of each kind of information source (daily newspaper, news magazine, culture magazine...). Unless a person has a personal reader who will perform filtering for him or her (like some rich executives), this kind of approach is necessary.

When a significant amount of information is available in electronic form, it becomes possible to use a computer program to do some of the filtering tasks. The ultimate computer filterer would read all incoming information and set aside all articles that a particular human reader would want to read. The problem comes in both defining that complex interest and determining what matches that interest.

Unfortunately, no significant automatic information filters are in mainstream use. However, computer users can use some very simple ("manual") methods of filtering information. The organization of Usenet newsgroups allows a user to choose perticular discussion groups that focus on topics that interest the user. News reader program also allow a user to mark certain subject titles or article authors so they are "filtered out". The subject title must be an exact match, however, and is only useful in filtering out "follow-up" articles to an original posting that is uninteresting.

Simple information retrieval systems exist, as well. Databases can be searched for matching keywords or combinations of keywords. Some systems even allow slightly more complicated searches ("Search for articles containing the words 'Star Wars' or 'Lucasfilm,' but not the word 'S.D.I.'). Neither information filtering systems nor information retrieval methods in wide use today utilize "intelligent" look-up or extraction.

Advanced Filtration Concepts

Computer scientists and information technologists have experimented with a number of different methods to best determine a reader's interest in a given piece of information. Most work has been done using a newsgroup-type model, which consists of a stream of separate articles. Experimental filtration programs would select out articles that would interest a particular reader.

The first step in creating a filtration system is determining and representing a reader's interest. What topics or other elements of an article determine a reader's interest? One straight-forward approach asks the target reader for a list of keywords that he or she finds interesting. Some systems may also ask for a "rating" which determines the level of interest associated with the word. These words are then compiled into a "user profile," which will later be compared with articles to determine matches.

Another method would observe the articles that a user decides to read, analyze their content, and add that information to a cumulative user profile. For more accurate results, a program could ask a reader to indicate how interesting each article is after reading it. A system could construct a user profile from this information by simply add every word found in interesting articles, and increasing the ranking of words that appear multiple times.

Commonly-used words can create problems when this kind of approach is taken. We wouldn't want occurances of conjunctions to determine if an article is filtered in or out. The simple solution is to construct a list of commonly-used words, and make sure they are never put into a user profile. A more advanced method would analyze a word within the context of an article, by making a ratio out of the number of times the word occurs in the article and the total words in the article. This is then compared to average occurance of that word in any text. If the occurance of the word is higher in our article, the program weights it more significantly.

Once the program has created a reader profile, it can use this to screen articles in or out. The program can, again, look directly at the number of keyword matches, calculate ranked matchings, and compare the matchings with average occurances of the word in a random text. Programs could also try to recognize synonymous or related words and calculate that into the ranking.

Some studies have found that the best result come from performing a number of different methods and combining the results. Particular articles tend to be matched well by particular methods, and using one matching model usually filters out too many close matches. The ultimate goal in the field of artificial intelligence is to emulate the understanding of ideas with a computer, and current research is aspiring to results much greater than those mentioned here.

Future Impact of Filtering

With information filtration systems in the early stages of development, is it difficult to get a clear picture of how the information we will get from the media of the future will differ from the information we get today. There are certainly a number of issues that should be of strong concern for those developing these systems, however.

It is very easy to imagine problems that are a direct extension of those explained by current media critics. Noam Chomsky and others describe a mainstream media with increasing outlets but decreasing sources of information. Large media companies are buying up smaller media companies and continuing to operate them, with the same editorial viewpoint. Chomsky points out that no matter where you go in the mainstream media, there are particular kinds of news that you will never find. The advances in information technology could help or hurt this situation, depending on who decides which news sources will become part of the "information stream." Hopefully, alternative sources of information will be able to find their way into the electronic forms of media.

A more realistic problem with alternative voices, however, may be more subtle. If a complete information filtration system is built to include daily news, it will have to not only choose stories of interest, but select the "best" relation of that story. However that is determined (and whether that is determined on a story-by-story basis or not) would drastically affect a person's (or a public's) perception of the news.

A more philosophical objection to filtering technology asks if a system of automated selection of information will stifle new thought. After all, if a readers are sheltered from all topics except those that they know about and want to hear more about, they may get a skewed perception of the world, or at least one that is much less rich than they might have gotten if they were forced to plow through articles themselves.

Some form of information filtering is already a necessity. While current filtering systems are very simplistic, more complex and workable systems will be developed soon, and probably enter the mainstream. How much automated forms of filtering will take over this role will depend on the strengths of developed systems and how much people will trust computers to tell them what they want to know.

Bibliography

Belkin, Nicholas J., et al, "Information Filtering and Information Retrieval: Two Sides of the Same Coin?", Communications of the ACM (Volume 35, No 12, December 1992).

Bowen, T.F., et al, "The Datacycle Architecture", Communications of the ACM (Volume 35, No 12, December 1992).

Foltz, Peter W., et al, "Personalized Information Delivery: An Analysis of Information Filtering Methods", Communications of the ACM (Volume 35, No 12, December 1992).

Klapp, Orrin, Overload and Boredom: Essays on the Quality of Life in the Information Society (New York: Greenwood Press, 1986).

Naisbitt, John, Megatrends (New York: Warner Books, 1985).

Sheth, Beerud Dilip, A Learning Approach to Personalized Information Filtering (Master's Thesis, Department of Electrical Engineering and Computer Science, University of California at Berkeley, 1994).

Wurman, Richard Saul, Information Anxiety (New York: Doubleday, 1989).


Impact Main Menu