The Web contains an abundance of useful semi-structured information that can and should be mined. Types of structure include hyperlinks between pages, structure within hypertext pages, and structure within URLs. We have implemented a programming language, Squeal, that facilitates structure-based queries. Specifically, the Squeal user can query the Web as if it were in a standard relational database. We describe Squeal and show the ease of writing structure-based information tools in Squeal.
The success of the Web is a testament to the importance of structure. Web pages consist not only of text but also of intra-document structure (via annotations indicating headers, lists, and formatting directives) and inter-document structure (hyperlinks). Even the uniform resource locator (URL) identifying a page is structured: typically a host name followed by a location in the file hierarchy or a set of query keywords and values. All of these types of information are used automatically by human readers (through browser software) but have been awkward for programmers to make use of in their search tools. Some examples of structure-based queries are:
We designed Squeal to allow programmers to easily express these types of queries.
Rather than invent a new query language for structured information, we decided to build on the most popular query language for databases: Structured Query Language (SQL). Benefits of this decision include:
In the next section, we will present Squeal. In the following section, we will demonstrate the ease of writing structural queries in Squeal. We will conclude with a discussion of the implications and a comparison to other database-inspired languages for querying the Web.
Squeal provides users with the illusion that the Web is stored and organized in a relational database. Squeal consists of two parts: (1) a schema describing the Web and (2) an implementation that answers queries about the schema (even though the entire Web is not really in a database).
A schema describes the structure of a relational database, i.e., the tables, fields, and the relationships between them. For example, the schema for an employee database might include a table employee with fields first_name, last_name, and social_security_number, where each employee has a distinct social_security_number, but different employees may have the same first_name or last_name.
The following description of the Squeal schema is a slight oversimplification. (The real tables are more normalized and have additional fields.) For full information on the Squeal schema, see [Spertus 1998].
The page table, which describes a Web page, has the following fields:
The following are examples of legal queries (with SQL keywords in capital letters and comments preceded by slashes):
// What text is on the page "http://www9.org"?SELECT contents
FROM page
WHERE url="http://www9.org";
// What pages contain "hypertext" and have fewer than 1000 bytes?SELECT url
FROM page
WHERE contents LIKE '%hypertext%'
AND bytes < 1000;
The tag table, which describes each html tag on a page, has the following fields:
Some legal queries are:
// What pages contain the word "hypertext" and contain a picture?SELECT url
FROM page p, tag t
WHERE p.contents LIKE "%hypertext%"
AND t.url = p.url
AND t.name = "IMG";
// What tags appear on the page "http://www9.org"?SELECT name
FROM tag
WHERE url="http://www9.org";
The att table contains information about the attribute names and values associated with each tag. (For example, consider the hypertext "<IMG SRC=foo.gif ALIGN=center>". The tag value is "IMG", the first attribute The fields of the att table are:
Some legal queries are:
// What are the values of the SRC attributes associated with IMG tags on "http://www9.org"?SELECT a.value
FROM att a, tag t
WHERE t.url = "http://www9.org"
AND t.name = "IMG"
AND a.tag_id = t.tag_id
AND a.name = "SRC";
// What pages are pointed to by "http://www9.org"?SELECT a.value
FROM att a, tag t
WHERE t.url = "http://www9.org"
AND t.name = "A"
AND a.tag_id = t.tag_id
AND a.name = "HREF";
While it is possible to inquire about hyperlinks with the tag and att table, we also provide a more convenient link table with the following fields:
Some legal queries are:
// What pages are pointed to by "http://www9.org"?SELECT destination_url
FROM link
WHERE source_url = "http://www9.org";
// What pages point to "http://www9.org"?SELECT source_url
FROM link
WHERE destination_url = "http://www9.org";
// What pages are pointed to via hyperlinks with anchor text "Web conference"?SELECT destination_url
FROM link
WHERE anchor = "Web conference";
The utility of the hstruct and lstruct fields will be shown later.
The parse table describes the structure of a URL. It has the following fields:
For example, the URL "http://www.ai.mit.edu:80/people/index.html#S" would appear in the parse table as follows:
| component | value | depth |
|---|---|---|
| host | www.ai.mit.edu | - |
| port | 80 | - |
| path | index.html | 1 |
| path | people | 2 |
| ref | S | - |
We are planning to also add support for parsing parameterized arguments.
We have presented the highlights of the Squeal schema and shown some SQL queries that can be made on it. Note that all of the queries shown are standard SQL, thus the user need only learn our schema (just as one would to access any database) and not a new query language.
The Squeal interpreter provides the illusion that the Web is in a relational database, with certain limitations because it is impractical (if not impossible [Abiteboul and Vianu 1997]) to determine complete information about the Web. Squeal responds to user queries by querying search engines and fetching and parsing Web pages, which it puts into a small local off-the-shelf relational database. Once the interpreter has caused the relevant information to be placed in the local database, it passes through the user query and returns the response. Because information is put into the database only when demanded, we call this a "just-in-time database" [Spertus and Stein 1998].
Let us look at how one of our sample queries would be interpreted:
// What pages are pointed to by "http://www9.org"?SELECT destination_url
FROM link
WHERE source_url = "http://www9.org";
The Squeal interpreter would respond as follows:
SELECT destination_url FROM link WHERE source_url = "http://www9.org") to the local database server, which has a SQL interpreter. It returns a list of URLs, which the Squeal interpreter returns to the user.
Let us look at another query:
// What pages point to "http://www9.org"?SELECT source_url
FROM link
WHERE destination_url = "http://www9.org";
The Squeal interpreter would respond as follows:
SELECT source_url FROM link WHERE destination_url = "http://www9.org") to the local database server, which will return a list of URLs, which the Squeal interpreter returns to the user.
The interpreter is described in more detail elsewhere [Spertus 1998, Spertus and Stein 1998].
We call Squeal applications ParaSites because they exploit information on the Web in a manner unintended by the information's authors. Our goal is not to argue that these are the best possible applications but to show the variety of useful structural queries and the ease with which they can be implemented in Squeal. Full details about the applications, including evaluations, can be found in [Spertus 1998].
One useful class of information retrieval applications is recommender systems [Resnick and Varian 1997], where a program recommends new Web pages (or some other resource) judged likely to be of interest to a user, based on the user's initial set of seed pages P. The standard text-based technique for recommender systems, used by the Excite search service (www.excite.com), is extracting keywords that appear on the seed pages and returning pages that contain these keywords. Note that this technique is based purely on the text of a page, independent of any inter- or intra-document structure.
Another technique for making recommendations is collaborative filtering [Shardanand and Maes 1995], where pages are recommended that were liked by other people who liked P. This is based on the observation that items thought valuable/similar by one user are likely to by another user. As collaborative filtering was currently practiced, users explicitly rated pages to indicate their recommendations. This inconvenient and expensive step can be eliminated through data mining by interpreting the act of creating hyperlinks to a page as being an implicit recommendation. In other words, if a person links to pages Q and R, we can guess that people who like Q may like R, especially if the links to Q and R appear near each other on the referencing page (such as within the same list). This mines intra-document structural information.
Accordingly, if a user requests a page similar to a set of pages {P1, …, Pn}, the system can find (e.g., through AltaVista) pages R that point to a maximal subset of these pages and then return to the user what other pages are referenced by R. Note that the ParaSite does not have to understand what the pages have in common. It just needs to find a list that includes the pages and can infer that whatever trait they have in common is also exemplified by other pages they point to.
For example, the first page returned from AltaVista that pointed to both Electronic Privacy Information Center ("www.epic.org") and Computer Professionals for Social Responsibility ("www.cpsr.org/home.html") was a list of organizations fighting the Communications Decency Act; links included the Electronic Frontier Foundation ("www.eff.org") and other related organizations.
We are not the only ones to use this technique, which derives from citation indexing [Kessler 1963, Small 1973], and many people have applied this technique to the Web [Kleinberg 1998, Dean and Henzinger 1999, Rousseau 1997]. What is novel is the ease with which such heuristics can be written and tested.
We use the following algorithm to find pages similar to P1 and P2:
The corresponding Squeal code is:
SELECT link3.destination_url, COUNT(*)
FROM link link1, link2, link3
WHERE link1.destination_url = p1
AND link2.destination_url = p2
AND link1.source_url = link2.source_url
AND link2.source_url = link3.source_url
GROUP BY link3.destination_url
ORDER BY COUNT(*) DESC;
Some heuristics for improving precision are:
This last heuristic was motivated by the observation that some pages contains hundreds or thousands of links and that the most similar pairs of links are likely to be within the same list or under the same header. It makes use of the hstruct and lstruct fields described above.
The Squeal code for the recommender system, including the requirement that links occur within the same list, is:
SELECT link3.destination_url, COUNT(*)
FROM link link1, link2, link3
WHERE link1.destination_url = p1
AND link2.destination_url = p2
AND link1.source_url = link2.source_url
AND link2.source_url = link3.source_url
AND link1.lstruct = link2.lstruct
AND link2.lstruct = link3.lstruct
GROUP BY link3.destination_url
ORDER BY COUNT(*) DESC;
While the syntax may be confusing to people not familiar with SQL, note the relative size of the code and the English description. We know of no other computer language where this query of the Web can be expressed more concisely.
When we ran this program with seed pages www.now.org (National Organization for Women) and www.feminist.org (The Feminist Majority Foundation), we got the following results:
| URL | Count | Description |
|---|---|---|
| www.aauw.org | 4 | American Assoc. of University Women |
| www.aclu.org | 4 | American Civil Liberties Union |
| www.feminist.org/gateway/womenorg.html#top | 4 | Directory of Women's Organizations |
| www.cs.cmu.edu/afs/cs.cmu.edu/user/mmbt/www/women/writers.html | 3 | A Celebration of Women Writers |
| a href="http://www.cs.utk.edu/~bartley/other/ISA.html | 3 | Incest Survivors Anonymous |
| www.democrats.org | 3 | Democratic National Committee |
| www.igc.org/igc/womensnet/ | 3 | WomensNet |
| www.pfaw.org | 3 | People for the American Way |
A new type of application made necessary by the Web is a tool to find users' personal home pages, given their name and perhaps an affiliation. Like many information classification tasks, determining whether a given page is a specific person's home page is an easier problem for a person to solve than for a computer. Consequently, our ParaSite's primary strategy is not determining directly if a page "looks like" a home page but finding pages that human beings have labeled as being someone's home page. While there is no single stereotypical title for home pages, there is for the anchor text of hyperlinks to them: the author's name. This can be done in Squeal for the name "Pattie Maes" as follows:
// Create a table to store candidate pagesCREATE TABLE candidate (url VARCHAR(1024));
// Populate table with destinations of links with anchor text "Pattie Maes"INSERT INTO candidate (url)
SELECT destination_url
FROM link
WHERE anchor = "Pattie Maes";
Observe that this takes advantage of inter-document structure. In contrast, the Ahoy! home page finder [Shakes et al 1997] generates candidates by searching for the name anywhere in a document.
Once we have the candidate pages, we can make use of intra-document information to rank them. For example, while it is promising if the name appears anywhere on the page, it is most promising if the name appears in the title field or a header field:
// Create a table to store ranked resultsCREATE TABLE result (url VARCHAR(1024), score INT);
// Give a page 5 points if it contains the name anywhereINSERT INTO result (url, score)
SELECT destination_url, 5
FROM candidate c, page p
WHERE p.url = c.url
AND p.contents LIKE '%Pattie Maes%';
// Give a page 10 points if it contains the name in the titleINSERT INTO result (url, score)
SELECT destination_url, 10
FROM candidate c, tag t, att a
WHERE t.url = c.url
AND t.name = "TITLE"
AND a.tag_id = t.tag_id
AND a.name = "anchor"
AND a.value LIKE '%Pattie Maes%';
The structure of the URL can also be used. The URL of Pattie Maes's home page is: "http://pattie.www.media.mit.edu/people/pattie/". This is easily recognized as a likely home page because:
The last heuristic would be added to the code as follows:
// Give a page 10 points if the penultimate directory
// is "home[s]" or "people".INSERT INTO result (url, score)
SELECT destination_url, 10
FROM candidate c, parse p
WHERE p.url_value = c.url
AND p.component = "path"
AND p.depth = 2
AND (p.value = "people" OR p.value = "homes" OR p.value = "home");
The Squeal code to display the resulting URLs and their total number of points is simply:
SELECT url, SUM(*)
FROM result
GROUP BY url ORDER BY SUM(*) DESC;
Other heuristics, with their Squeal encodings, are provided in [Spertus 1998]. The top results for the full version are:
| url | SUM(*) |
|---|---|
| pattie.www.media.mit.edu/people/pattie/ | 33 |
| lcs.www.media.mit.edu/people/pattie/ | 23 |
| www.media.mit.edu/~pattie | 16 |
We implemented a simple variant of the program encodes and returns reasons for each decision [Spertus 1998]. (The code is not shown here because we wish to emphasize what is unique about Squeal and not the power of SQL.) Some of the reasons the system provided for its top decision were:
| score | reason |
|---|---|
| 10 | File name is the empty string |
| 10 | Penultimate directory is: "people" |
| 2 | Anchor from "http://lcs.www.media.mit.edu/groups/agents/papers.html" is the full name: "Pattie Maes" |
| 2 | Anchor from "http://nif.www.media.mit.edu/" is the full name: "Pattie Maes" |
| 1 | Anchor from "http://aries.www.media.mit.edu/people/aries/home-page.html" includes the full name: "Prof. Pattie Maes" |
Users frequently encounter "broken links" (obsolete URLs) while searching and browsing. In 1997, the Web Characterization Group found that 5-8% of file requests were for broken links [Pitkow 98]. We provide two structure-based techniques for tracking down moved pages.
Consider the following blurb, returned by HotBot (www.hotbot.com) in response to the query "Lenore Blum 1943":
Lenore Blum
Lenore Blum 1943 Written by Lisa Hayes, Class of 1998 (Agnes Scott College) Lenore Blum was a bright and artistic child who loved math, art, and music from her original introductions to them. Blum finished high school at the age of 16, after which...
http://www.scottlan.edu/lriddle/women/BLUM.HTM, 5359 bytes, 27Apr97
The goal of a moved-page finder is to find the new URL Unew given the information in an out-of-date blurb, i.e., the invalid URL Ubad and the title of the page. In this example, Ubad is "www.scottlan.edu/lriddle/women/BLUM.HTM", and the title is "Lenore Blum".
We can create URL Ubase by removing directory levels from Ubad until we obtain a valid URL. We can then crawl from Ubase in search of a page with the given title. This is based on the intuition that someone who cared enough about the page to house it in the past is likely to at least link to the page now. In this example, the page was quickly found; its new name was "http://www.scottlan.edu/lriddle/women/blum.htm".
People who pointed to a URL Ubad in the past are some of the most likely people to point to Unew now, either because they were informed of the page movement or took the trouble to find the new location themselves. Here is a heuristic based on that observation:
A question is how to recognize when we've found the target page. We do this by looking for the known title text or letting the user specify a key phrase. The code to implement both of these techniques appears in [Spertus 1998].
Because the Web contains useful structural information, it is important to be able to make structure-based queries. We built such a system, Squeal, on top of the most popular structured query language, SQL. By making use of our schema and implementation, any person familiar with SQL can use Squeal to make powerful queries on the Web.
We described three applications that make use of the Web's structural information and showed or sketched their Squeal implementations. Our point was not to argue that these are the best possible such applications (although evaluations have been encouraging [Spertus 1998]) but to give an idea of the structure-based queries one might want to make and to show how easy it is to do so in Squeal.
An extractor developed within the TSIMMIS project uses user-specified wrappers to convert web pages into database objects, which can then be queried [Hammer et al 1997]. Specifically, hypertext pages are treated as text, from which site-specific information (such as a table of weather information) is extracted in the form of a database object. This is in contrast to our system, where each page is converted into a set of database relations according to the same schema.
This work is influenced by WebSQL, a language that allows queries about hyperlink paths among Web pages, with limited access to the text and internal structure of pages and URLs [Mihaila 1996, Mendelzon 1997, Arocena 1997]. In the default configuration, hyperlinks are divided into three categories, internal links (within a page), local links (within a site), and global links. It is also possible to define new link types based on anchor text; for example, links with anchor text "next". All of these facilities can be implemented in our system, although WebSQL's syntax is more concise. While it is possible to access a region of a document based on text delimiters in WebSQL, one cannot do so on the basis of structure. Some queries we can express but not expressible in WebSQL are:
W3QL is another language for accessing the web as a database, treating web pages as the fundamental units [Konopnicki 1998]. Information one can obtain about web pages includes:
For example, it is possible to request that a specific value be entered into a form and to follow all links that are returned, giving the user the titles of the pages. It is not possible for the user to specify forms in our system (or in WebSQL), access to a few search engines being hardcoded. Access to the internal structure of a page is more restricted than with our system. In W3QL, one cannot specify all hyperlinks originating within a list, for example.
An additional way in which Squeal differs from all of the other systems is in providing a data model guaranteeing that data is saved from one query to the next and (consequently) containing information about the time at which data was retrieved or interpreted. Because the data is written to a SQL database, it can be accessed by other applications. Another way our system is unique is in providing equal access to all tags and attributes, unlike WebSQL and W3QL, which can only refer to certain attributes of links and provide no access to attributes of other tags. Furthermore, Squeal is the only system that is built on genuine SQL, with the full power of that language.
Some powerful query systems have been built for Extensible Markup Language (XML), including Ozone [Lahiri et al 1998] and XML-QL [Deutsch et al 1999], both of which are influenced by relational and object-oriented databases. Because XML tags are customized and pages more structured than in SQL, these systems are able to support queries that focus more on semantics than syntax. Despite the different domain, the XML and Ozone work suggests that Squeal would benefit from support for object-oriented queries.
We are in the process of creating:
For the latest information about our project, see http://parasite.mills.edu.
We were assisted in this research by Oren Etzioni, Keith Golden, Tom Knight, and Pattie Maes. We also benefited from interaction with Alberto Mendelzon's WebSQL group and with Alexa Internet. This paper was improved by comments from anonymous reviewers. Ellen Spertus is partially supported by a National Science Foundation Career Grant.
Ellen Spertus is an Assistant Professor in the Department of Mathematics and Computer Science at Mills College in Oakland, California. She received her S.B. (1990), S.M. (1992), and PhD (1998) in computer science from MIT, where she worked with Prof. Lynn Andrea Stein in information retrieval and Prof. William J. Dally in computer architecture. She has also worked at Microsoft Research and been a visiting scholar at the University of Washington.
Lynn Andrea Stein is an Associate Professor in the Department of Electrical Engineering and Computer Science and a member of the Artificial Intelligence Laboratory and the Laboratory for Computer Science at MIT. She received the A.B. degree in computer science from Harvard and Radcliffe Colleges in 1986 and the Sc.M. (1987) and Ph.D. (1990) from the Department of Computer Science at Brown University. Her research and teaching center on nontraditional computational architectures supporting interaction among programs and users.