The Term Vector Database: fast access to indexing terms for Web pages

Raymie Stata1, Krishna Bharat1, 2, Farzin Maghoul3
1Compaq Systems Research Center, 130 Lytton Avenue, Palo Alto, CA 94301, U.S.A.
2 Google Inc., 2400 Bayshore Pkwy, Mtn View, CA 94043, U.S.A.
3AltaVista Corporation, 1825 S. Grant Street, San Mateo, CA 94402, U.S.A.

Abstract

We have built a database that provides term vector information for large numbers of pages (hundres of millions). The basic operation of the database is to take URLs and return term vectors. Compared to computing vectors by downloading pages via HTTP, the Term Vector Database is several orders of magnitude faster, enabling a large class of applications that would be impractical without such a database. This paper describes the Term Vector Database in detail. It also reports on two applications built on top of the database. The first application is an optimization of connectivity-based topic distillation. The second application is a web page classifier used to annotate results returned by a web search engine.

Keywords: Page Classification; Term vectors; Topic Distillation; Web Connectivity; Web Search


1. Introduction

In the vector space model of information retrieval ([Salton 71]), documents are modeled as vectors in a high-dimensional space of many thousands of terms. The terms are derived from words and phrases in the document and are weighted by their importance within the document and within the corpus of documents. Each document's vector seeks to represent the document in a "vector space", allowing comparison with vectors derived from other sources, for example, queries or other documents. This model has had a long and distinguished history, having been used as the basis of successful algorithms for document ranking, document filtering, document clustering, and relevance feedback (see [Sparck Jones et al 97], [Baeza-Yates et al 99]).

Despite its usefulness, term vector information for Web pages is not readily available. One could, of course, compute vectors on demand by downloading pages over HTTP, but the amount of time required by this approach rules out all but the simplest experiments. Even in companies that run web search engines, term vector information is typically stored in inverted form; that is, given a term, one can find the vectors containing it, but not the other way around. While this inverted form is useful for some applications (including, of course, serving queries), many other applications need to retrieve vectors based on a page identifier.

To support such applications, we have built the Term Vector Database. We have populated it with term vectors for all the pages in the AltaVista index. Unlike search engines, which map from terms to page ids, our database maps from page ids to terms. Compared to retrieving pages over the web, our database is many orders of magnitude faster.

One must note that the selection of indexing terms for documentshas been standard practice in library science even before information retrieval came into existence as a discipline. Librarians would select a small set of semantically representative, discriminatory keywords to represent each document. This is essentially what we do in our database, but on the scale of the Web, which makes it a challenging engineering problem.

The Term Vector Database is especially useful in conjunction with another tool that we created, the Connectivity Server [Bharat et al 98a]. The Connectivity Server provides fast access to the connectivity graph of the Web. Given a URL or a set of URLs, the server rapidly computes a graph of pages and links surrounding the input URLs. A task-specific agent can use this graph to walk the web or mine information from the graph structure. This can be done many orders of magnitude faster than crawling actual pages on the web. Unfortunately, the connectivity server lacks content information. The Term Vector Database provides a complementary service that addresses this problem. It provides rapid access to content information for most pages in the Connectivity Server database, thus allowing more intelligent and selective computation on the WWW's structure.

In addition to adding value to the Connectivity Server, the Term Vector Database supports other applications, including classifying pages, finding representative passages within pages, identifying the language in which a page is written, and helping to filter spam and pornography. In Section 3, we describe two applications that we have implemented using the database: topic distillation and page classification. But first, Section 2 gives an overview of the system.


2. System overview

This section describes the Term Vector Database. The first subsection gives a precise definition of the data stored in the database, that is, a definition of exactly what a "term vector" is in our context. The following subsection describes the primary operation exported by the database, which is to retrieve vectors given URLs. The Sec. 2.3 describes the internal organization of the data in the database. The final subsection describes the software modules that make up the database.

2.1 Definition of term vectors

A term vector is a sequence of term-weight pairs. The design of a term vector database must answer the following questions: What exactly is a term? Which terms are to be included in a page's term vector? How are the term weights stored in vectors computed?

In our system, we answer these questions as follows:

2.2 API

The main function in the API takes a set of page identifiers and returns a set of vectors:

tv_lookup(int n, const conn_id_t *pages, /*out*/tv_t *vectors);

This function retrieves vectors for a set of URLs rather than a single one. Taking a set encourages a batch-oriented style of use, which potentially amortizes the cost of disk IOs over the retrieval of multiple vectors.

The type conn_id_t is a page identifier assigned to pages by our Connectivity Server [Bharat et al 98a]. As mentioned earlier, the Connectivity Server provides fast access to connectivity data for the web: given a URL, it returns the pages that point to and are pointed to by the URL. Rather than deal directly with URLs, the Connectivity Server uses a set of densely-packed integers to identify pages. Functions in the Connectivity Server convert between these integers and text URLs. In our work with the Connectivity Server, these identifiers have proven more convenient to handle in code than text URLs. By adopting them in the Term Vector Database, we not only gain this convenience, but we also gain easy interoperation with the Connectivity Server, supporting interesting applications (see, for example, Section 3.3).

The type tv_t is defined as follows:

typedef struct tv_pair_s {
    int tid; /* Term identifier. */
    float weight;
} tv_pair_t;

typedef struct tv_s {
    conn_id_t pageid;
    int len; /* Number of pairs in vector */
    tv_pair_t e[TV_MAX_VECTOR_LEN];
} tv_t;

Notice the use of integers to represent terms; as with page ids in the Connectivity Server, we find these to be more convenient to manipulate than text strings. Notice the use of float for weights. As mentioned earlier, term vectors in the database use simple counts as weights, so these vectors always have integral counts. However, applications often normalize vectors or combine them in ways that produce non-integral weights.

2.3 Data representation

An instance of the Term Vector Database is stored as a single file. This file is split into four sections: the header section which contains assorted database-wide information, the index section which maps from page ids to physical locations of vector data, the term vectors section which stores the vector data in a compressed format, and the term text section which maps from term identifiers to the actual term text.

The rest of this subsection describes the implementation details of the vector and index data. Readers not interested in this level of detail may want to skip to the next subsection.

The main section of a Term Vector Database file is the term vectors section, which contains the actual term vector data. This section is a sequence of fixed-length (128-byte) records in the following format:

This format provides good compression for vector data. However, even with compression, vectors sometimes do not fit into 128 bits. In this case, terms are removed from it (starting from the least frequent) until it will fit.

Recall that page identifiers are a dense set of integers. Given this and the fact that vector records are fixed size, one might think we could simply use page identifiers as indices into the vector section. However, the Connectivity Server has more page identifiers than the Term Vector Database has vectors. This is because the Connectivity Server has page identifiers for many pages on the frontier of the crawl, that is, pages discovered by the crawler but not yet downloaded. To avoid wasting space, we pack vector records densely. To quickly find the vector for a particular identifier, we use the index section.

The index section contains an index for mapping page identifiers to indices of vector records. The index consists of a sequence of 64-byte entries, each providing a mapping for 480 page ids. The first four-byte integer in an index entry is a base index for the entry. The next 60 bytes of an index entry is a vector of 480 bits indicating whether or not there is a term vector for a given id. To compute the offset of the vector data for page id P, one can proceed as follows. Divide P by 480 to find the appropriate index entry. The remainder of that division is the bit for P within that entry. Check that this bit is set. If so, let L be the number of bits that are set between the start of the index entry and the bit for P, that is, the number of vectors present in the database before P mapped by the same entry as P. Add L to the base index for the index entry, and one has the index for P.

2.4 Software components

The Term Vector Database consists of three main software components, illustrated in Fig. 1:


Fig. 1: Software components

The page filter implements the policy described in Section 2.1 to convert pages to vectors. The output of the filter is stored in a "raw format" that is about twice as big as the ultimate format. The Term Vector Database builder converts these raw vectors into the file representation described in Section 2.3; in addition to compressing the vectors, the builder sorts them according to page identifier. The final piece of software is the implementation of the API described in Section 2.2.


3. Experience

The Term Vector Database has been used in a variety of research projects. This section describes our experience so far. First, we describe the build process and give some quantitative data. Next, we describe two projects built on top of the Term Vector Database: topic distillation and page classification.

3.1 Builds of the Term Vector Database

We have built the term vector database on three different crawls of the Web performed by AltaVista. The page filter (the first box in Fig. 1) is run as part of a proprietary process. Once the raw vectors are produced, the builder is run to produce a database file. The most recent build of the Term Vector Database contains 272M vectors and requires 33GB of storage (32.5G for the term vectors section, 0.5G for the term text section, and a relatively small amount for the index and header sections). It took 11 hours to build this database on a 16GB AlphaServer ES-40 with four 500 MHz processors. As described in Sec. 2.3, the build process may drop terms when the data for a vector cannot be compressed into 128 bytes. This build dropped terms from 5M vectors, a little under 2%; in these 5M vectors, an average of about 4 terms were dropped per vector.

Fig. 2 gives some histograms of our vector data. The plot on the left gives a histogram of vector sizes; the big jump at "50" occurs because the page filter clips all larger vectors to 50 terms. The plot on the right gives a histogram of term counts. Clearly, small counts dominate: 45% of terms in a vector have a count of one; 99% of terms in a vector have a count of at most 31.


Fig. 2: Term vector statistics

On an AlphaServer 4100, after the index section of the database file has been preloaded into the system's file cache, it takes 17ms to access a random vector (approximately the cost of reading a random disk block).

3.2 Topic distillation

Our first application of the Term Vector Database -- indeed, the motivation for building it -- was topic distillation. Topic distillation is a technique of using hyperlink connectivity to improve ranking of web search results (see [Kleinberg 98], [Chakrabarti et al 98], [Bharat et al 98] for specific instances of topic distillation algorithms). These algorithms work on the assumption that some of the best pages on a query topic are highly connected pages in a subgraph of the web that is relevant to the topic. A simple mechanism for constructing this query-specific subgraph is to seed the subgraph with top-ranked pages from a standard search engine and then expand this set with other pages in vicinity of the seed set ([Kleinberg 98]). However, this expanded set sometimes includes highly connected pages that are not relevant to the query topic, reducing the precision of the final result. [Bharat et al 98b] identifies this problem as topic drift.

[Bharat et al 98b] shows that topic drift can be avoided by using topic vectors to filter the expanded subgraph. A topic vector is a term vector computed from the text of pages in the seed set. A page is allowed to remain in the expanded graph only if its term vector is a good match to this topic vector. Specifically, the inner product of the two vectors is compared to a suitable relevance threshold, and pages below the threshold are expunged from the expanded graph.

In the past, we computed term vectors and topic vectors by downloading pages from the web after computing the initial subgraph. Downloading pages to compute vectors proved to be a significant bottleneck. We eliminated this bottleneck by re-implementing the algorithm on top of the Term Vector Database. The dramatic performance improvement provided by the Term Vector Database allows us to increase the size of our seed sets and expanded subgraphs while still reducing the overall time it takes to perform topic distillation.

This example illustrates how the Connectivity Server and Term Vector Database can be combined to produce interesting applications. Thus, it is beneficial to use page identifiers from the Connectivity Server as keys to the Term Vector Database. In the case of topic distillation, for example, the expanded subgraph is computed using the Connectivity Server, which means the result is expressed in terms of Connectivity Server page identifiers. These identifiers are then given directly to the Term Vector Database without any translation.

3.3 Classifying web pages

Perhaps one of the most interesting applications we have built so far is page classification.

Page classification is the process of categorizing pages as belonging to one of a number of a priori topics. We use a bayesian classifier implemented by term vector matching (see [Kalt et al 96] for details). For categories, we use the 12 top-level categories in the Yahoo directory. For each of these categories, a category vector of 10,000 terms was precomputed from a training set of roughly 30,000 pages from Yahoo per category. A page is classified by retrieving its term vector from the Term Vector Database and matching it against each of the category vectors. The category with the highest resulting match score is selected as the category for the page. We only allow one category per page; ambiguous pages whose vectors are not definitively closer to one category over all other categories are not classified. For efficiency, the category vectors are combined into a sparse matrix implemented as an inverted index of terms.

We used this classification engine to build a cgi-proxy that augments AltaVista search results with category tags, text strings that indicate the dominant topic of the result page. For example, if a page predominately contains sports related information, the tag Recreation_or_Sports is used to inform the user of this fact. These tags serve the same purpose as showing titles and abstracts with search results: they allow users to understand the topic of a result page without actually fetching it.

Fig. 3 is a screen shot of the results returned by our proxy given the ambiguous query "bulls take over":


Fig. 3: Example of categorized results

These results are annotated with tags such as Business_and_Economy and Recreation_or_Sports that indicate the general topic of the page.

The "More on this topic" link lets users refine queries according to topic. In Fig. 3, when the user clicks on the "More on this topic" link associated with the second result labeled Recreation_or_Sports, the user gets the refined result given in Fig. 4:


Fig. 4: Refined results

The "More on this topic" link leads to a result page devoted to the activities of the Chicago Bulls basketball team rather than activities on the stock market. (Northern Light's custom folders are a different user interface to the same refinement functionality. Our UI provides the users with information about the predominant topic of items in a result list; Northern Light's UI allows users to refine their search according to a given topic even if a page belonging to that topic is not in the current list of search results.)

For our 12-category configuration, the inverted term index is small enough to keep in memory, so we are able to tag result pages dynamically as they come through the proxy without noticeable overhead. If the overhead is unacceptable, or if a larger set of category vectors needs to be matched, an alternative would be to classify and tag pages in advance.


4. Future work

We will continue using the database to build new applications. However, based on our experience so far, we also plan to make improvements to the Term Vector Database itself.

The biggest area of change will be in the definition of term vectors. We will look for term candidates in more places, such as alt text in tags and link text on incoming links. We will be supporting multi-word terms. We will be switching to a table-based stemmer, which will both improve performance and allow support for multiple languages. We will explore weights for terms that are influenced by markup (for example, give higher weightings to terms in titles). And we will explore alternative lexicons.

Another area of fruitful research would be in improvements to the compression of vectors. We would like to cut the storage requirements of the database in half, which would allow it to fit in a reasonable amount of RAM, either on a single large machine or partitioned across multiple, smaller machines. Our current compression techniques work mostly by eliminating redundancy from within individual vectors; we are unlikely to achieve another factor of two by continuing down that path. The next path to explore is eliminating redundancy across vectors.


5. Conclusions

Term vectors have proven themselves to be a robust and useful data structure for information retrieval. Our work has shown that a database that quickly maps from page identifiers to term vectors enables a variety of Web information retrieval applications. We believe this database will be especially interesting when combined with other Web-specific sources of information such as the Connectivity Server.


References

[Chakrabarti et al 98]
S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, S. Rajagopalan. Automatic resource list compilation by analyzing hyperlink structure and associated text. Proc. 7th International World Wide Web Conference, 1998.
[Baeza-Yates et al 99]
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley/ACM Press, 1999.
[Bharat et al 98a]
K. Bharat, A. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian. The Connectivity Server: fast access to linkage information on the Web. Proc. 7th International World Wide Web Conference, pages 104-111, August 1998.
[Bharat et al 98b]
K. Bharat and M. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proc. 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 104-111, August 1998.
[Kalt et al 96]
T. Kalt and W. B. Croft. A New Probabilistic Model of Text Classification and Retrieval. University Of Massachusetts, 1996. http://ciir.cs.umass.edu/info/psfiles/irpubs/ir-78.ps.gz
[Kleinberg 98]
J. Kleinberg. Authoritative sources in a hyperlinked environment. Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998. Also appears as IBM Research Report RJ 10076, May 1997.
[Porter 80]
M. F. Porter. An algorithm for suffix stripping. Program, 14(3), 130--137, 1980.
[Salton 71]
G. Salton. The SMART retrieval system - experiments in automatic document processing, Prentice-Hall, Englewood Cliffs N.J. 1971.
[Salton et al 88]
G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):51323, 1988.
[Sparck Jones et al 97]
K. Sparck Jones and P. Willet, eds. Readings in Information Retrieval. Morgan Kaufmann, 1997.

Vitae

Picture of Krishna

Krishna Bharat is a member of the research staff at Google Inc. in Mountain View, California. Formerly he was at Compaq Computer Corporation's Systems Research Center, which is where the research described here was done. His research interests include Web content discovery and retrieval, user interface issues in Web search and task automation, and relevance assessments on the Web. He received his Ph.D. in Computer Science from Georgia Institute of Technology in 1996, where he worked on tool and infrastructure support for building distributed user interface applications.

Picture of Farzin

Farzin Maghoul is a member of the engineering staff at AltaVista Corporation. His research interests include relevance and ranking, semantic topic clusters, and text summarization. He received his MS in Computer Science from Indiana University in 1977.

Picture of Raymie

Raymie Stata received his Ph.D. from the Massachusetts Institute of Technology in 1996. Afterwards, he joined the Systems Research Center as a member of the research staff. His is interested in a broad spectrum of web-related issues, including web information retrieval, extraction and mining; web security; web servers; and the creation of new web applications.