Wednesday, January 30, 2008

Problems Associated with Classifying the Web

One of the problems associated with cataloging the web is the rate technology changes and develops. For example, “PUSH technology”, is a collective technology that sends information directly to people’s computers. Users must subscribe to the service and provide information about the types of resources they want. The problem with this is valuable information may be limited.
According to Gralla, [6]

“Instead of you having to go out and gather information, its delivered right to you with no effort."

However, with the convenience of having information packaged and sent, the problem of accuracy within the information still exists. In this writer’s opinion, packages should consist of information that has been cataloged and validated as a valuable resource by those in the Library and Information Science profession.

Another problem is that titles and other descriptions may change without notification. For example, URL’s change and/or disappear. According to Karen G. Schneider, author of Cataloging Internet Resources: Concerns and Caveats, [7]

“One solution is to assign a PURL [8] to the record at creation, so the PURL is displayed in the master record in the online catalog.”

Therefore, if the URL changes the PURL does not. This solution seems to work for
changing URL’s but what about sites that disappear? This question can be answered by future research. Nevertheless, by using PURL, catalog maintenance is reduced to a single update.

Another problem is the Internets rate of growth. According to Vinh-The Lam, author of Cataloging Internet Resources: Why, What, How, [9]

“During the one year period of the first OCLC Internet Project from October 1991 to September 1992, the network traffic in bytes grew from 1.88 to 3.32 trillion.”

Therefore, more people continue to access the web. Other difficulties in cataloging websites include the variety of materials to select from, file formats, which are subject to change, and time consumed because of back load problems. Additionally, many tools used for retrieving information have limited capabilities. Vinh-The Lam states, [10]

“They rely on directories and filenames assigned to the
Internet electronic files, which contain incomplete,
inconsistent and nonstandard data.”


Another problem with classifying the web is that publication date and edition information is not provided. To combat this problem many Library and Information Science professionals suggest the use of notes for the title proper field of MARC
500, and notes for the MARC field 856, which is for the location of electronic and access information.

The information provided above further illustrates ways in which the web can be classified. Vinh-The Lam’s statement proves that catalogers and others in the Library and Information Science profession have a valid reason for wanting the web to have standards: so users can be provided with information that is complete.

Why Classification is Needed

1. Classification is needed because students and researchers must be able to find efficient information quickly.

2. Classification is needed to provide predictable search engine results.

3. Classification is needed to provide access to more information.

4. Classification is needed to control vocabulary.

5. Classification is needed to eliminate duplication of search results.

6. Classification is needed to provide standards.

Without web classification, students and researchers may miss a great deal of information related to their searches. According to Stephen J. Scroggins, author of Internet cataloging issues, [1][4]

“The glut of data on the Information Superhighway means, however, that students and faculty using the Internet for research may not find what the need quickly or efficiently. Indeed, there is an increasing sense of frustration among Internet users resulting from the unpredictable and misleading results search engines yield. This rapidly expanding and ever-changing resource needs organizing, and this challenge has become a major issue facing librarians, particular catalogers, and commercial service providers.”


The duplication of search engine results primarily stems from how search engines work. For example, each search engine has its own rules about how materials are gathered. Therefore, Internet users will have to use a different search engines to get different results. In the book, How The Internet Works by Preston Gralla[2][5], there are six steps associated with how Internet search engines work:

1. Crawlers or spiders follow links on home pages until the information needed is located.

2. Information found by the crawlers or spiders are sent to indexing software.

3. The software takes the information found and places it into a database.

4. Users search the engine by putting in keywords.

5. The database searches for the terms put in by the user.

6. Results appear on the users screen and once a link is clicked, the user is sent to the document.


Gralla sums the process up by stating,


“They are essentially massive databases that cover wide swaths of the Internet”, and “you search through them as you would a database by typing keywords that describe the information you want.”


However, there are problems associated with search engine retrieval methods. For example, crawler or spiders do not follow every link. Some bypass files that are graphic or animated. In addition, many only look for popular sites. Therefore, users are not provided with information that is accurate and predictable.

The above information illustrates the need for classifying the web. Users have a right to be provided quick and efficient information that is predictable. Classifying the web will achieve this goal as well as provide users with access to more information.


[1][4] Stephen J. Scroggins, “Internet Cataloging Issues,” Colorado Libraries v.26 no2 (Summer 2000): 46-47, 10 Nov. 2001
[2][5] Gralla, 186-87.

WWW Dot to Classify or Not?

There is a basic concept that must be remembered by Library and Information Science scholars when debating if websites should be classified or left unclassified: it is that classification is based on a theory, which describes knowledge as a universe that is structured to flow from general to specific. For instance, a general subject such as music can become specific when we consider types of music, language of music, time period of music, etc. This flow of knowledge leads to information being classified regardless of how the information is manifested, i.e., books, e-journals, websites, and others.

To begin, a few related terms are defined:[1][1]

Classification - a logical system for the arrangement of knowledge.

Classifier - a person who applies a classification system to a body of knowledge or a collection of documents.

Controlled Vocabulary - in subject analysis and retrieval, the use of an authorized subset of the language as indexing terms.

Facet - a component (based on a particular characteristic) of a complex subject, e.g., geographic facet, language facet, literary form facet.

Literary Warrant – The concept that new notations are created for a classification
scheme and new terms are added to a controlled vocabulary only when information packets actually exist about new concepts.

Metadata - data about data.

PURLS (Persistent Uniform Resource Locator) - URL’s that do not change.

Theories of Classification

The grandfather theory of classification can be traced to Aristotle’s “classical theory of categories.” The foundation of the theory is the idea of common things being placed together. Therefore, hierarchical classification was created. A statement made by Lois Mai Chan, author of Cataloging and Classification, can illustrate the longevity of this theory[2][2],

“On the whole, the progression is from the general to the specific, forming a hierarchical, or “free,” structure, each class being a species of the class on the preceding level and a genus to the one below it.”

Of course, challenges to the classical theory surfaced. For example,
Ludwig Wittgenstein’s “resemblance theory”, is based on similarity not categories. Or, Lotfi Zadeh’s “fuzzy set theory,” which suggest that some categories are defined, and some are not because it depends on the person who is observing the information. For instance, some website images are clearly defined because there is no distortion in the image but for images that are distorted; the cataloger must use his/her own perception of what the image suppose to be. However, the cataloged information may not be accurate because another cataloger may see the image differently. A similar theory to the fuzzy set is the “Prototype Theory,” which was created by Eleanor Rosch. These theories suggest that categories are defined by properties that are shared by all members. Therefore, no member is better than the other is. In addition, categories should be free from the catalogers’ biases.

Other theorist such as J.L. Austin who extended Wittgenstein’s work by adding words also challenged the classical theory. According to Arlene G. Taylor, author of The Organization of Knowledge,[3][3]

“He wondered why we call different things by the same name (e.g., foot of a mountain, foot of a list, person’s foot).”

While many theorists have brought fourth new ways for categorizing information, the classical theory remains widely used. It is this writer’s opinion that students and researchers would be best served if Internet information were arranged in a hierarchy consisting of common elements that are standardized.


[1][1] Lois Mai Chan, Cataloging and Classification: An Introduction, (New York: McGraw-Hill, 1994) 482-484.
[2][2] Chan., 260.
[3][3] Arlene G. Taylor, The Organization of Knowledge (Englewood: Libraries Unlimited, 1999) 174.

Problems Associated with Classifying the Web

One of the problems associated with cataloging the web is the rate technology changes and develops. For example, “PUSH technology”, is a co...