Mark Edstrom
edstrom@uclink4.berkeley.edu
Ph.D. student in the Department of Sociology
3 December 1996
FINAL PAPER


ENGLISH AND THE WORLD WIDE WEB

Note to readers:
This paper was prepared for a Fall 1996 course titled "Impact of New Information Resources: Multimedia and Networks" offered by the School of Information Management & Systems (SIMS) at the University of California, Berkeley. The course was taught by Professor Howard Besser. Many of the issues addressed in this paper are ideas that came out of a focus group that we formed during the course of the semester which was concerned, broadly, with looking at the social implications of new media technologies.

INTRODUCTION
This paper will begin with a short theoretical section about how technology and society interact and will move quickly to the particular issue I will focus on in this paper: the predominance of English-language content on the Web. My interest in this topic is driven by a desire to understand 1) more about how things came to be the way they are on the Web and 2) what the broader significance of a Web dominated by English-language content is. In looking at how things came to be the way they are, I will examine the Unicode and ASCII standards at some length in order to illustrate some of my abstract musings about the relation of technology to society with concrete examples. The over-arching point I hope to make by the end of this paper is that a technologically-deterministic perspective will be inadequate for explaining how human experiences will change in the information age.

HOW DO TECHNOLOGY AND SOCIETY INTERACT?
By a "technologically-deterministic perspective," I mean a way of looking at the world that imagines humans to be more-or-less on the "receiving-end" of technology. While examples can be cited to show that people don't always do what is good for them with regard to technology (e.g., the overuse of automobiles and the resulting environmental problems), I hope to make the point in this paper that technology does not simply impact society. In fact, the current runs both ways between society and technology, and at times the two concepts become somewhat difficult to separate. People (and their associated cultural agenda and personal concerns) design and produce technologies, and social context inevitably permeates and resides in the products they create to some degree. But the important corollary of this is that people use whatever technologies exist to pursue their basic needs and wants; social life and relations are not entirely defined by the existence or absence of certain technologies, though in some cases the texture of everyday life is undoubtedly affected by them.

To see how the interaction of technology and society had been conceptualized before the popularization of personal computers, I checked out several books from the library that dealt with the social ramifications of earlier technological developments. The two most interesting book I came across was The Social Effects of Aviation (Ogburn, 1946). This book, written by a distinguished professor at the University of Chicago in 1946, was particularly interesting to look at because of the author's crass willingness to predict the social "effects" of technological development many years out into the future. Ogburn titled the final 21 chapters of the book after the social institutions he believed would be affected by aviation (i.e., "Crime," "Education," "The Family," etc.) and proceeded to describe, with no lack of exactitude, what changes aviation would cause.

My thinking on the study of technology has been deeply influenced by a book called America Calling: A Social History of the Telephone to 1940, in which Claude Fischer calls into question the "billiard ball" model (society impacted by an external force) of technological development espoused by Ogburn and many that came after him. Fischer points out that "the telephone did not radically alter American ways of life; rather, Americans used it to more vigorously pursue their characteristic ways of life" (Fischer, 1991). I hope to demonstrate by the end of this paper that despite the fact that English-language content does dominate the Web and that this dominance will have some negative effects on non-English speaking populations (effects primarily of an economic nature that are somewhat outside the scope of this paper), people will provide more non-English-language content as soon as the need for it is articulated more strongly. The Web (or whatever mode of global communication comes after it) is unlikely to remain dominated by English for long if people desire content in other languages and have at least some means to realize the ends of their choosing.

THE PREDOMINANCE OF ENGLISH-LANGUAGE CONTENT ON THE WEB
There seems to be general agreement in the popular press about the fact that English-language content dominates the Web, though there also appear to be many different ways of measuring English's predominance, and somewhat divergent estimates of what the actual percentage of content in English is. In an article in the Wall Street Journal in September of 1996, Robin Frost cites estimates that indicate that "90% or more of the content on the Web is in English" (Frost, September 1996). An article in Quill magazine, states (more candidly than most) that "no one knows for sure, but it seems a reasonable guess that today 75 percent to 80 percent of all data on the Internet are in English" (Johnson, August 1996). In the online version of InfoWorld, an article states that "64 percent of the world's Internet servers are in the United States, followed by Germany with the second largest concentration at just 5 percent" (Kalin, October 1996). An article in Business Week says that "perhaps 80% of Internet sites today are located in English-speaking countries" (Mandel, April 1996). And, finally, an article in the Los Angeles Times declares, without qualification, that "90% of Internet traffic worldwide is in English" (Romero, February 1996).

The statistics cited in many of these articles paper-over much of the complexity of the situation. Despite the fact that many authors try to sound authoritative about their estimates, it is nearly impossible to determine how much of the content of the World Wide Web is in English. There is no centralized administrative body that tracks Web content with regard to language. It would also be impossible to determine, even using automated "worms" and the like, which content is in English because of the fact that there is not any mandatory, standardized metacode at the beginning of all HTML documents to indicate what language a particular webpage is in. (There are some browser plug-ins that can determine what language an HTML document is in, and change their display of the information accordingly, but the code at the beginning of HTML documents that allows them to do this is not standardized across all languages.) This situation with regard to language is similar to the problems with images as they exist now in digital format: how are we to name, categorize, and retrieve images using the Web or other distributed database systems without some sort of metacode standard?

It is easier, though by no means simple in the current environment of explosive Internet growth, to determine where Internet servers are located geographically rather than determining the language makeup of their content. This is easier than determining language content because all servers must register their domain names to be accessed by the broader cyberspace community. One organization, InterNIC, has centralized pools of data about which organizations are connected to which domain names, and where these organizations are based, though I am not sure to what extent this information is available to the public. The existence of one central authority through which all domain names must be approved has proved controversial, particularly for major multinationals who, citing trademark law, claim individuals have copped domain names that are not legitimately theirs. But the fact that this arrangement produces some centralized data on the makeup of the entire Internet is a very favorable side-effect indeed. However, knowing the geographic location of a server still does not allow us to predict very accurately what the language of its content will be. Moreover, it is typical for Websites to be multi-lingual, particularly when the primary language being used is not English. I question the methods used to produce these statistics because such methods are important to consider in designing any rigorous study of the Web, not because I doubt English-language content is more common than content in any other language. In fact, it seems entirely safe to assert that English-language content dominates the Web at this point in time.

THE ALLEGED SIGNIFICANCE OF THE PREDOMINANCE OF ENGLISH-LANGUAGE CONTENT ON THE WEB
Thankfully, there seems to be very little debate about whether it is a good thing that English dominates the Web: most people seem to agree that it is not. But many popular responses to the current situation, from Americans and non-Americans alike, are paternalistic, and portray non-English speakers as totally dependent on American software makers to find them a way out of an English-only cyberspace.

In an article in the Los Angeles Times, "experts" say that "the growth of the Net worldwide will only speed the spread of English as the common language of the globe... Instead of seeing a small world of multiculturalism, many foreigners view the emergence of the Net as another tool for American cultural imperialism" (Romero, February 1996). The director of Russia's best-known Internet provider is quoted in the New York Times as saying that the Web is the "ultimate act of intellectual colonialism. The product comes from America so we either must adapt to English or stop using it... This just makes the world into new sorts of haves and have-nots" (Specter, April 1996).

Though there are plenty of reasons to be concerned about the economic prospects of much of the world's population in the information age, these particular statements are misleading for several reasons. First, they portray the Web inaccurately as being an orchestrated rather than spontaneous, chaotic effort and, in so doing, exaggerate the imperialistic intentions of millions of individual users posting Websites in whatever language they happen to speak (primarily English). Second, they downplay the amount of content available in most major languages, and make it sound as if the Web will only become increasingly dominated by English.

SOME BASIC REASONS ENGLISH BECAME THE LINGUA FRANCA OF THE WEB
Place of Origin
English is the most widely used language on the Web for many reasons. The first reason is historical. The Internet was developed in the United States by the U.S. Department of Defense (though not to maintain communications in the event of nuclear attack, as the story goes in popular Net mythology). It spread quickly (particularly in the last 5 years) in the United States because of demand from universities, private corporations and individuals. Because American needs and priorities greatly influenced the development of the Internet, and determined many of its earliest standards, Americans encountered fewer obstacles (particularly with regard to language) in utilizing the technology than did others in many parts of the world. According to Nicholas Negroponte of the MIT Media Lab, who is quoted in an article about the Web in the Wall Street Journal, "the Net is U.S.-centric to the extent that the U.S. is its birthplace" (Frost, September 1996).

Lower Costs
Later in the same article, Negroponte goes on to say that low access costs are also an important reason so many more English-speakers use the Web relative to non-English-speakers. Corporations can now purchase a 64Kb connection from an Internet Service Provider (ISP) for about $300-$400 per month in the United States, whereas costs in more regulated telecommunications markets (such as Germany, France, and Spain, for example) range from $1000 to $3000 per month (Kalin, October 1996). Individuals in the United States can buy unlimited, direct dial-up connections through most major ISPs (America Online, AT&T, Compuserve, etc.) for about $20 dollars per month and connect to these ISPs through a local call in all but the smallest markets. Finally, with the introduction of WebTV in the United States (and not in other countries, yet), which boasts minimal start-up costs of $329 for the set-top device and $29.99 per month for the connection in the Bay Area (see press release), Americans will be the first to experience the Web from the comfort of their lazyboys.

Complex Technologies Still Prefer Simple Languages
Another important reason that English-language content dominates the Web is that it is still more difficult, current computer technology notwithstanding, to create material in languages that are not based on Latin characters than it is to create documents in English. The primary problem in using East Asian languages has been figuring out how to use the limited interface devices we have today (keyboards and mice) to key in, in some cases, thousands of different characters. Though people have come up with many solutions to the limitations of the keyboard with regard to Cyrillic, Arabic, and Chinese alphabets, there is still no simple, elegant solution to keying in these alphabets on their own terms. In many cases, characters have to be constructed with multiple keystrokes on Latin-character keyboards, or keyed in phonetically (based on pronunciation in English). Other software interfaces present the user with characters on the screen from which the desired character is selected with a mouse. As long as the interface between the user and the computer is limited to the mouse and keyboard, Latin-character-based languages are likely to be at least slightly easier to convert to a computer's zeros and ones. Sophisticated voice recognition technology could make the production of non-Latin-character-based content easier, but this does not seem likely to happen in the very near future. It is also possible, though not likely, that text-inputting technology (in whatever form) will not advance quickly enough, and that the paradigm of "text" will be eclipsed by video and audio-based communications media, which do not privilege Latin-based character languages over others.

OTHER REASONS ENGLISH BECAME THE DOMINANT LANGUAGE OF THE WEB: TECHNICAL STANDARDS
At this point, I would like give some concrete examples of how cultural priorities and social context affect the development of technological standards, and particularly how the American origin of personal computer technology is related to, though not exclusively responsible for, the current prevalence of English-language content on the Web. To do this, I will first provide a simple explanation of how characters are represented in binary code. Second, I will trace the history of the development of several computing standards over the past 30 years, most notably ASCII, and its heir apparent, Unicode. Third, I will look at how the widespread adoption of the Unicode would improve the multi-lingual functionality of the Web but would still leave several key problems unresolved.

Basics
In the most basic terms, computers store information in the form of zeros and ones (binary code). Characters (numbers, letters and symbols) are encoded in standardized ways, so that information can be shared and so that software programs can be written that will understand the information on any computer. The most common standard for encoding characters today is known as ASCII (American Standard Code for Information Interchange). The official ASCII standard uses a 7-bit encoding scheme, which basically means that it can distinguish between 128 (2 to the 7th power) different characters. (An extended character set has been added so that the scheme can now represent 256 (2 to the 8th power) different characters.) Of course the problem with this -- and this is where it becomes particularly relevant to the World Wide Web -- is that 256 characters are not quite enough to make all of the world's languages intelligible to a computer.

A Very Short History of Standards
The first major standard for encoding data for computers was the 6-bit BCD (binary coded decimal). This encoding standard was adequate when it was first introduced in the late 1950's (and necessary at the time because of its compact size of 6 bits), but quickly became problematic as computer usage moved away from simple number-crunching to language related tasks. ASCII (7-bit encoding scheme) was created in 1965 in response to the limitations of 6-bit BCD and was certified in 1977 by the American National Standards Institute (ANSI). It came into near-universal usage shortly thereafter. Not surprisingly, the shortcomings of the ASCII encoding scheme became apparent shortly after its introduction.

There were several responses to ASCII's limitations. One was the creation of the double-byte character set (DBCS), which encoded information in both 1-byte and 2-byte packets. This encoding scheme was problematic for several reasons, one of them being that the computer had to continually determine whether a character was comprised of 8 or 16 bits. Another response to the limitations of ASCII was organized in 1983 by the International Standards Organization (ISO), whose objective was to create an encoding scheme in which characters were represented with a 16-bit standard. Their efforts were derailed not by technological impossibility, but political blunder (which seems to be a recurring plot structure in the history of technological development). They underestimated resistance to their standards proposal (ISO 10646) which would have combined many similar or identical Han characters to make it possible to fit all the languages they wanted to include in the 65,536 (2 to the 16th power) slots available in a 16-bit encoding scheme. In desperation (I would guess), ISO floated a plan for a "memory pig" of a standard, a 4-byte encoding scheme that would allow for about 4 billion different characters. Faced with the reality of limited computational power, this plan died a natural death.

In 1987, another attempt was made at establishing a 16-bit standard that would be capable of handling all the world's major written languages. Unicode, as it came to be known, was successful where ISO 10646 had not been, and managed to combine nearly-identical characters (from East Asian languages, primarily) in a way that did not offend the sensibilities of the parties involved, so that all major languages could be represented in its encoding protocol. In achieving this consensus, the Unicode standard has become the most logical candidate for replacing the ASCII standard, and the most exciting because of its unique identification of more than 65,000 characters. (most factual data in this section, A Very Short History of Standards, is from McClure, 1995)

Unicode: A Few Details
If Unicode is uniformly adopted (as seems likely in the next 5-10 years), it could have major ramifications for the multi-lingual usability of the Web. To date, Windows NT is the only major PC operating system that provides near-complete support of the Unicode standard. Reports indicate that the next iteration of Windows (probably '98?) will also support the Unicode standard, but Windows 95 definitely does not. Plug-ins are currently available for Netscape and other applications to allow users to view non-Latin-character-based content, even with an English-language operating system. (The best-integrated plug-in I found for Netscape is called Navigate with an Accent, and can be downloaded for a free 30-day trial) And, obviously, a variety of language-specific encoding standards exist (variations and elaborations of the ASCII standard), most of which are best used in the presence of an operating system designed to support whichever non-Latin character set the language you are using requires. I have also heard rumors about a software package, though I was not able to track down the specific product name, that allows the user to toggle between operating systems running in different languages. Because of the annoying connection between operating system and language support under the limited ASCII paradigm, this could prove useful.

Unicode and the World Wide Web
What is most exciting about this history of encoding standards is that we are positioned on the brink of a new age with regard to multi-lingual support on the Web. The newest version of the Unicode standard, 2.0, supports the scripts necessary to create content in Arabic, Armenian, Bengali, Bopomofo, Cyrillic, Devanagari, Georgian, Greek, Gujarati, Gurmkhi, Han, Hangul, Hebrew, Hiragana, Kannada, Katakana, Latin, Lao, Malayalam, Oriya, Phonetic, Tamil, Telugu, Thai and Tibetan (see listing under Unicode).

But what is perhaps the most amazing element of this history of encoding standards is that people have figured out ways to communicate using computers in nearly all the world's languages, without having a standardized 16-bit encoding standard that offers absolute, unique character identification like Unicode. The fact that people often use technologies in ways that are not anticipated by those who designed them is one of the principal reasons long-term predictions about how technological developments nearly always fall flat. Another reason it is difficult to look at technological development and predict social outcomes is that people are constantly operating in an environment of insufficient technology; figuring out how to make-do with what is available is a process fundamental to the human condition. I do not, for these reasons, believe a technologically-deterministic viewpoint provides much leverage, and particularly not with a technological format as versatile and malleable as the Web, in predicting how the Web will affect people's lives.

The adoption of Unicode could change computing and the Web dramatically. One of the most useful things a Unicode standard would facilitate is truly multi-lingual database functions. For example, with characters (or scripts, to be more precise) that have absolute identities, it would be possible to have one sortable database containing information in Chinese, Japanese, and English. Along similar conceptual lines, search engines could do far more sophisticated searches in any language through the use of absolute character identity. Perhaps more importantly, operating systems that offered complete support for the Unicode standard would allow Websurfers to more easily see the diversity of languages present on the Web, rather than offering them a screen full of garbage (*&%^$#@!) every time they happen upon a site not denominated by Latin characters.

Though the prospect of the Unicode standard is tantalizing, it is by no stretch a panacea. Languages still have different directional orientations (e.g., right to left, top to bottom), different spell-checking needs, and different ways of being keyed in, depending on the system being used. Unicode 2.0 does not resolve any of these problems, nor does it provide a standardized way of determining what language a document is in if characters are common to several languages. Furthermore, merely seeing all of the world's languages on a computer terminal is no great leap-- but understanding them would be. Several of the articles I looked at for this paper "hyped" the idea of software programs (e.g., Globalink) that would automatically translate the content of foreign Websites into the user's language (Immergut, November 1996 and Frost, September 1996). This technology seems unlikely to be very interesting or sophisticated any time in the immediate future, because of limited computing power and the painfully bad translations that tend to come out of inorganic processes, though such software could be an important factor in the long-term.

CONCLUSION
It is perfectly reasonable to be concerned about the homogenization of cultures and language, but the World Wide Web seems unlikely to contribute much toward this outcome. Despite the legacy of ASCII and other technological barriers to the production of non-Latin-character content for the Web, Websites outside the United States are growing at a rapid clip. The number of Internet hosts in Japan increased by 211% in the twelve months preceding July 1996, while the number in China increased by 1,003% (Frost, September 1996). Well-developed histories and languages are not likely to be abandoned as a result of technological obstacles. Although English will, undoubtedly, remain an important language for the foreseeable future on the Web (and in general), new technologies and standards (e.g., Unicode) will make the production of information in languages other than English easier and more worthwile.



REFERENCES (use your browser's "back" button to return to the text)

Brook, James and Iain A. Boal, eds. Resisting the Virtual Life: The Culture and Politics of Information. San Francisco (1995): City Lights Books.

Fischer, Claude S. America Calling: A Social History of the Telephone to 1940. Berkeley, California (1992): University of California Press.

Frost, Robin. Web's Heavy U.S Accent Grates On Overseas Ears; Foreign Users Criticize Lack Of Diversity In Languages, Topics. Wall Street Journal (Thu, Sept 26, 1996):B6(W), B6(E), col 4.

Gonzalez, Sean. Java & HotJava: Waking up the Web. PC Magazine v14, n18 (Oct 24, 1995):265.

Hamilton, David P. Distant vision. Wall Street Journal (Mon, Nov 14, 1994):R34(W), R34(E), col 1.

Hitheesing, Nikhil. The mother of development. Forbes v157, n2 (Jan 22, 1996):88.

Hudson, Richard L. Software makers are developing ways to make their programs more polyglot. Wall Street Journal (Mon, July 27, 1992):A5B(W), B4D(E), col 1.

Immergut, Debra Jo. Bridging the Web's language gap. Wall Street Journal (Tue, Nov 12, 1996):A20(W), A16(E), col 1.

Johnson, J.T. Language critical skill for surfing globe's databases: cultural definitions hinge on directions taken by Internet. Quill v83, n6 (July-August, 1996):16.

Kalin, Sara. Foreign Internet access costs soar above United States'. InfoWorld v18, n44 (Oct 28, 1996):TW2.

Mandel, Michael J. A World Wide Web for tout le monde. Business Week, n3469 (April 1, 1996):36.

McClure, Wanda L.; Hannah, Stan A. Communicating Globally: The Advent Of Unicode. Computers in Libraries. v15, n5 (May, 1995):19 (6 pages).

Mendelson, Edward. QuickView Plus. PC Magazine v15, n9 (May 14, 1996):165.

Mendelson, Edward. Computing without borders. PC Magazine v13, n18 (Oct 25, 1994):30.

Meyerowitz, Joshua. No Sense of Place: The Impact of Electronic Media on Social Behavior. New York (1985): Oxford University Press.

Nash, Kim S. Browsers boil over with new features. Computerworld v30, n8 (Feb 19, 1996):59.

Ogburn, William F. The Social Effects of Aviation. New York (1946): Houghton Mifflin.

Perenson, Melissa J. Translations. PC Magazine v15, n12 (June 25, 1996):76.

Petzold, Charles. Move over, ASCII! Unicode is here. PC Magazine v12, n18 (Oct 26, 1993):374.

Petzold, Charles. Typing Unicode characters from the keyboard. PC Magazine v12, n21 (Dec 7, 1993):426.

Petzold, Charles. Unicode, wide characters, and C. PC Magazine v12, n19 (Nov 9, 1993):369.

Petzold, Charles. Viewing a Unicode TrueType font under Windows NT. PC Magazine v12, n20 (Nov 23, 1993):379.

Pollack, Andrew. A cyberspace front in a multicultural world. New York Times v144 (Mon, August 7, 1995):C1(N), D1(L), col 2.

Romero, Dennis. The Net's a small (English) world after all. Los Angeles Times v115 (Fri, Feb 23, 1996):E1, col 1.

Shannon, L.R. An Internet multilingual interpreter. New York Times v145 (Tue, March 19, 1996):B10(N), C5(L), col 4.

Smith, Ben. Around the world in text displays. Byte v15, n5 (May, 1990):262.

Specter, Michael. World, Wide, Web: 3 English Words; Other Countries Resent The Domination Of The English Language On The Internet. New York Times v145, sec4 (Sun, April 14, 1996):E1(N), E1(L), col 1.

Strassi, Kimberley A. Search Engines Push Internet Navigation On A Global Course; Services In Local Languages Are Set Up To Tap Growth In Variety Of Users Abroad. Wall Street Journal (Mon, Oct 7, 1996):B7C(W), B13G(E).

Technology and Social Change, edited by F.R. Allen et al. New York (1957): Appleton-Century Crofts.

Unicode supported scripts: http://www.cm.spyglass.com/unicode/standard/supported.html

Williams, Margot. Visiting a foreign site? Web translators are available. Washington Post v119, n293 (Mon, Sept 23, 1996):WB19, col 1, 14 col in.