With the explosive growth of the World-Wide Web, it is becoming increasingly difficult for users to collect and analyze Web pages that are relevant to a particular topic. To address this problem we are developing WTMS, a system for Web Topic Management. In this paper we explain how the WTMS crawler efficiently collects Web pages for a topic. We also introduce the user interface of the system that integrates several techniques for analyzing the collection. Moreover, we present the various views of the interface that allow navigation through the information space. We highlight several examples to show how the system enables the user to gain useful insights about the collection.
The World-Wide Web is undoubtedly the best source for getting information on any topic. Therefore, more and more people use the Web for topic management , the task of gathering, evaluating and organizing information resources on the Web. Users may investigate topics both for professional or personal interests.
Generally the popular portals or search engines like Yahoo and Alta Vista are used for gathering information on the WWW. However, with the explosive growth of the Web, topic management is becoming an increasingly difficult task. On the one hand this leads to a large number of documents being retrieved for most queries. The results are presented as pages of scrolled lists. Going through these pages to retrieve the relevant information is tedious. Moreover, the Web has over 350 million pages and continues to grow rapidly at a million pages per day . Such growth and flux pose basic limits of scale for today's generic search engines. Thus, many relevant information may not have been gathered and some information may not be up-to-date.
Because of these problems, recently there is much awareness that for serious Web users, focussed portholes are more useful than generic portals . Therefore, systems that allow the user to collect and organize the information related to a particular topic and allow easy navigation through this information space is becoming essential. Such a Web Topic Management System should have several features to be really useful:
We are building WTMS, a Web Topic Management System to allow the collection and analysis of information on the Web related to a particular topic. This paper discusses the various features of the system. The next section cites related work. Section 3 explains how the focussed crawler of WTMS allows the collection of information from the WWW relevant to a topic. Section 4 presents the various views of the system that allow the user to navigate through the information space. Several graph algorithm based techniques to analyze the collection are introduced in section 5. Finally, section 6 is the conclusion.
Several systems for visualizing WWW sites have been developed. Examples includes Navigational View Builder , Harmony Internet Browser  and Narcissus . Visualization techniques for World-Wide Web search results are also beingdeveloped. For example, the WebQuery system  visualizes the results of a search query along with all pages that link to or are linked to by any page in the original result set. Another example is WebBook , which potentially allows the results of the search to be organized and manipulated in various ways in a 3D space. In this paper our emphasis is to develop views that allow navigation through Web pages about a particular topic and gain useful insights about the collection.
In recent times there has been much interest in collecting Web pages related to a particular topic. The shark-search algorithm for collecting topic-specific pages is presented in . Another focussed crawler is presented in . The crawler used in WTMS is similar to these systems. However, it uses several heuristics to improve performance.
Another interesting problem is determining the important pages in a collection. Kleinberg defines the HITS algorithm to identify authority and hub pages in a collection . Authority pages are authorities on a topic and hubs point to many pages relevant to the topic. Therefore, pages with many in-links, specially from hubs, are considered to be good authorities. On the other hand, pages with many out-links, specially from authorities, are considered to be good hubs. The algorithm has been refined in CLEVER  and Topic Distillation . We feel that this algorithm is very important for a Web Topic Management system. However, we also believe that determining the hub and authority sites for a topic are more useful than determining the hub and authority pages.
Mapuccino (formerly WebCutter) , ,  and TopicShop ,  are two systems that have been developed for WWW Topic Management. Both systems use a crawler for collecting Web pages related to a topic and use various types of visualization to allow the user to navigate through the resultant information space. While Mapuccino presents the information as a collection of Web pages, TopicShop presents the information as a collection of Web sites. As emphasized in the introduction, we believe that it is more effective to present the information at various levels of abstraction depending on the user's focus. Moreover, a topic management system should allow the user to use several techniques to analyze the information space.
The architecture of the Web Topic Management System is discussed in . The system uses a focussed crawler to collect Web pages related to a user-specified topic. In this section we will discuss the basic strategy for focussed crawling and how we can improve performance by using some heuristics.
For collecting WWW pages related to a particular topic the user has to first specify some seed urls relevant to the topic. Alternatively, the user can specify the keywords for the topic and the crawler issues a query with the specified keywords to a popular Web search engine and uses the results as the seed urls. The WTMS crawler downloads the seed urls and creates a representative document vector (RDV) based on the frequently occurring keywords in these urls.
The crawler then downloads the pages that are referenced from the seed urls and calculates their similarity to the RDV using the vector space model . If the similarity is above a threshold, the pages are indexed by a text search engine and the links from the pages are added to a queue. The crawler continues to follow the out-links until the queue is empty or an user-specified limit is reached.
The crawler also determines the pages pointing to the seed urls. (A query link:u to search engines like AltaVista and Google returns all pages pointing to url u). These pages are downloaded and if their similarity to the RDV is greater than a threshold, they are indexed and the urls pointing to these pages are added to the queue. The crawler continues to follow the in-links until the queue is empty or an user-specified limit is reached.
After crawling, the collection consists of the seed urls as well as all pages similar to these seed urls that have paths to or from the seeds. We believe that this collection is a good source of information available on the WWW for the user-specified topic. It should be noted that the crawler has a stop url list to avoid downloading some popular pages (like Yahoo's and Netscape's Home pages) as part of the collection.
This focussed crawling strategy is similar to the techniques described in  and . Most focussed crawlers start with some seed urls and follows the out-links (sometimes in-links also). Pages that are relevant to the topic of interest are indexed and the links from these pages are also followed. The relevancy can be determined by various techniques like the vector space model (as in ) or using a classifier (for example in ).
The main bottleneck of a crawler is the time spent in downloading Web pages. Besides network congestion, a crawler needs to follow the convention of issuing one download request to a site per 30 seconds. Therefore, downloading many pages from a single site may take a long time. For a focussed crawler only Web pages relevant to the topic are important. So many of the downloaded pages may have to be discarded. In fact, using our basic focussed crawler for topics as diverse as World Cup Soccer, Information Visualization and Titanic we found that less than 50% of the downloaded pages were found to be relevant. If we could determine that a page will be irrelevant without examining the contents of the page, we can avoid downloading the page, thereby improving performance. We use two heuristics for the purpose.
If a web site has information related to several topics, a page in the Web site for one of the topics may have links to pages relating to the other topics or to the main page of the site for ease of navigation. For example, the page http://www.discovery.com/area/science/titanic/weblinks.html, a page relevant to Titanic has a link to http://www.discovery.com/online.html, the main page of Discovery online, a page not related to the topic. However, since most web sites are well organized, pages that are dissimilar do not occur near each other in the directory hierarchy.
Therefore, when we are examining the pages linked to or linked from a Web page, we need to download the linked page only if it is near the current page. The determination of nearness will be an optimization between the number of irrelevant downloads and the number of relevant pages that are not downloaded. If we use a strict criteria to determine the nearness between two pages, the number of downloads will be lower, but we may miss some relevant pages. On the other hand, a lenient criteria to determine nearness will retrieve all the relevant pages but at the cost of increasing the number of downloads.
Figure 1 shows how nearness is determined by the WTMS crawler. Suppose a page in the directory A/B is the current page. Then pages in the same web site are considered to be near (and therefore downloaded) if and only if they belong to the directories shown in the figure. Thus pages in the parent directory (A) as well as any children directories (C,D) are considered to be near. Sections of sibling directories (E,F,G) are also downloaded. After crawling several web sites, we found that this definition of nearness gives the optimal result. It should be noted that if a page has a link to or from a page in another Web site, we have to download the page (unless it is in the stop url list). Also note that if a url contains any of the topic keywords, the page will be always downloaded. So all pages from http://www.titanicmovie.com will be downloaded for the Titanic collection.
Because web sites are well organized, generally most pages in the same directory have similar themes. Thus all pages in http://www.murthy.com/txlaw/ talk about tax laws. One of these pages, http://www.murthy.com/txlaw/txwomsoc.html was retrieved by a query to Google with the keywords "world cup soccer" since it talks about visa issuance to the Women's World Cup Soccer tournament. However, none of the other pages of the directory are relevant to the collection on World Cup Soccer. However, in the basic crawler all these pages were downloaded, only to be discarded after determining the similarity to RDV.
To avoid this problem, during crawling, we keep a count on the number of pages that have been indexed and ignored for each directory. If more than 25 pages of a directory are downloaded and 90% of those pages are ignored, we do not download any more pages from that directory.
|Collections||Information Visualization||World Cup Soccer||Titanic|
|% Rejected Directories||4.56||6.14||6.9|
|% Relevant Pages Missed||4.26||4.04||2.56|
|Average Score Pages Missed||0.261||0.279||0.254|
Table 1 shows the comparison between the basic crawler and the enhanced crawlers for 3 collections, Information Visualization, World Cup Soccer and Titanic. The following statistics are shown:
Thus, our enhancements were able to significantly reduce the download time without missing many relevant pages of the collection.
After the information about a particular topic has been crawled and indexed, a WTMS server is initialized. Any user can access the WTMS server from a Web browser. The WTMS user interface is opened in the client browser as a Java (Swing) applet. Since most users won't require all the collected information, the applet initially loads only a minimal amount of information. Based on user interactions, the applet can request further information from the WTMS server. XML is used for information exchange between the WTMS server and clients.
The WTMS interface provides various types of views to allow navigation through the gathered information space. This section gives a brief description of the WTMS interface applet.
In WTMS the collected information is organized into an abstraction hierarchy as discussed in . The Web pages relevant to a topic downloaded by the crawler are considered to be the smallest unit of information in the system. The pages can be grouped into their respective physical domains; for example all pages in www.cnn.com can be grouped together. Many large corporations use several Web servers based on functionality or geographic location. For example, www.nec.com and www.nec.co.jp are the Web sites of NEC Corporation in US and Japan respectively. Similarly, shopping.yahoo.com is the shopping component of the Yahoo www.yahoo.com portal. Therefore, we group together related physical domains into logical Web sites by analyzing the urls. (In other cases we may need to break down the domains into smaller logical Web sites. For example, for corporations like Geocities and Tripod who provide free home pages, the Web sites www.geocities.com and members.tripod.com are actually a collection of home pages with minimal relationship among each other. However, this technique has not yet been incorporated into WTMS.)
Most HTML authors organize the information into a series of HTML pages for ease of navigation. Moreover, readers generally want to know what is available on a given site, not on a single page. Therefore, logical Web sites are the basic unit of information in WTMS. Thus, for calculating the hub and authority scores, we apply the algorithm introduced in  to logical sites instead of individual Web pages.
Figure 2 shows the initial view of the WTMS interface for a collection on World Cup Soccer. It shows a table containing various information about the logical Web sites that were discovered for the topic. Besides the url and the number of pages, it shows the hub and authority scores. (These scores have values between 0 and 1). The table also shows the title of the main page of the sites. The main page is determined by the connectivity and the depth of the page in the site hierarchy as discussed in . Note that generally while calculating the hub and authority scores, intra-site links are ignored . Assuming that all Web pages within a site are by the same author, this removes the author's judgment while determining the global importance of a page within the overall collection. However, while determining the local importance of a page within a site, the author's judgment is also important. So the intra-site links are not ignored while determining the main page of a site.
The table gives a good overview about the collection. Clicking on the labels on the top the user can sort the table by that particular statistics. For example, in Figure 2, it is sorted by the authority scores. Some authorities for information on World Cup Soccer are shown at the top.
A logical site has several physical domains. The domains consists of directories and each directory has several Web pages and sub-directories. WTMS allows the users to view the information about a collection at various levels of abstraction depending on their focus. We discuss some of these visualizations in this subsection. Note that WTMS also allows the user to group related sites as discussed in section 5.
|(a) The logical web site is selected|
|(b) A directory within the site is selected|
Figure 3 shows the details of the site seawifs.gsfc.nasa.gov, the highest authority for the collection on Titanic. In Figure 3(a) the logical site itself is selected. The constituents of the site as well as sites having links to and from the selected site are shown as a glyphs (graphical elements). The figure shows that the logical site has several physical domains like rsd.gsfc.nasa.gov and seawifs.gsfc.nasa.gov. Notice that if a physical domain has just one directory or site (for example daac.gsfc.nasa.gov) then the glyph for that directory or page is shown. The brightness of a glyph is proportional to the last modification time of the freshest page in the group. Thus, bright red glyphs indicate groups which contain very fresh pages (for example, www.dimensional.com) and black glyphs indicates old pages (for example, www.nationalgeographic.com).
The view also shows the links to and from the currently selected site. If another site has a link to the site, it has an arrow pointing out. Similarly, if a site has a link from the selected site, it has an arrow pointing in. Thus, Figure 3(a) shows that www.marinemeuseum.org has both links to and from seawifs.gsfc.nasa.gov. The arrows to and from the children of the selected site, give an indication of their connectivity. For example, the figure shows that the domain rsd.gsfc.nasa.gov has only out-links while seawifs.gsfc.nasa.gov has both in-links and out-links. The thickness of the arrow is proportional to the number of links. On the other hand, the color of the arrows indicates whether the link is inter-site or intra-site. Green indicates inter-site links. Thus all links from pages in the domain rsd.gsfc.nasa.gov are to other pages within the same logical site. On the other hand, blue indicates inter-site links. Thus all links from the other sites are blue. A combination of inter-site and intra-site links is indicated by cyan. Thus seawifs.gsfc.nasa.gov has both inter-site and intra-site in-links and out-links.
The user can click on a glyph with the right mouse button to see more details. For example, in Figure 3(b) the user has navigated further down the hierarchy and selected the directory seawifs.gsfc.nasa.gov/OCEAN_PLANET/HTML. The relevant web pages in the directory are shown. The directory has a page with a large number of in-links and out-links, titanic.html. Note that the glyph is a circle for a Web page and rectangle for the groups. Further, the label of the currently selected node is highlighted and the rectangles representing the currently selected path is not filled.
Clicking on a glyph with the left mouse button shows information about the corresponding page or site. Thus, Figure 4 shows information about the Georgia Tech Graphics, Visualization & Usability Center Home Page, a page in the Information Visualization collection. This dialog box allows the user to visit the page, see the links to and from the page or see the related pages. For example, Figure 5 shows the url, title as well as the structural and semantic similarity values of pages related to the above page. The table is sorted by the semantic similarity of a page to the selected page which is determined by the vector space model . The structural similarity of a page is determined by the links of the page with respect to the selected page and is calculated as follows:
The WTMS interface allows the user to filter the information space based on various criteria. For example, the user can see only pages modified before or after a particular date. Moreover, in the Related Pages table (Figure 5), the user can see only pages that are co-cited or directly linked to the selected page. Another useful technique is to specify keywords to determine relevant Web pages or sites. For keyword queries, as well as for determining the similar pages to a selected pages, the WTMS interface sends a query to the WTMS server. The results from the server are used in the visualizations.
To make a Web Topic Management system more useful, it should provide analysis beyond traditional keyword or attribute based querying. At present WTMS provides various graph algorithm based techniques to analyze the topic-specific collection. The algorithms are applied on the site graph which is a directed graph with the logical sites as the nodes. If a page in site a has a link to a page in site b, then an edge is created from node a to b in the site graph. It should be emphasized that our analysis techniques are not very complex and thus applicable to collections with a large number of Web sites in real time.
One useful technique is to consider the site graph as an undirected graph and determine the connected components . Nodes in the same connected component are connected by paths of one or more links and can thus be considered to be related.
|(a) World Cup Soccer||(b) Titanic|
Figure 6 shows the connected components that were discovered for the collections on World Cup Soccer and Titanic. If some connected component contained only one site, the site itself is shown (for example, www.iranian.com in Figure 6(a)). Each component is represented by a glyph whose size is proportional to the number of Web pages the group contains. The glyphs are ordered by the maximum authority score among the sites in the group. Thus the group containing seawifs.gsfc.nasa.gov is at the top for the Titanic collection. The label of a connected component is determined by the site with the highest authority score it contains. The user can click on a glyph to see the logical sites and web pages of the corresponding component. For example, in Figure 6(a), the user is seeing the details of a connected component containing the site www.ee.umd.edu.
Figure 6 shows that for both the collections only a few connected components were formed (even though the collections had thousands of pages). Thus the connected components is a useful mechanism to group the logical sites into meaningful clusters. The major information about the topic can be found in the top few components. The isolated sites found later in the view generally contain information different from the main theme of the collection. For example, as seen from Figure 6(a), we see that the Electrical Department of the University of Maryland is in the collection on World Cup Soccer. On examining the main page of the site (for this collection) www.ee.umd.edu/~dstewart/pinball/PAPA6/, we found that the page talks about the 1998 World Pinball Championship. The search engine retrieved the page, since one of the participating teams was named World Cup Soccer! Similarly, as discussed in section 3.2.2, the site www.murthy.com which talks about tax laws, is in the collection since one of its page talks about visa issuance to the Women's World Cup Soccer tournament. Further, for the Titanic collection shown in Figure 6(b), we found an isolated site for a casino named Titanic. Thus, it is evident that the connected components are effective in isolating pages that are in the fringes of the collection.
Considering the site graph as a directed graph we can also determine the strongly connected components . Each pair of nodes of a strongly connected component have bi-directional links between them. Figure 7 shows a strongly connected components for the Information Visualization collection. All the sites shown in the figure are reachable from each other. For a strongly connected component with a large number of nodes and links, the graph is too complex for the user to understand. Therefore WTMS allows various ways to filter the graph:
|(a) For the Information Visualization collection, 20%of the sites reference each other|
|(b) For the World Cup Soccer collection, 3.5% of the sites reference each other|
The dialog box shown in Figure 10 gives an indication about the strongly connected components of a collection. The figure shows that for an academic topic like Information Visualization more sites are grouped into strongly connected components (20%) than for a topic of more general interest like World Cup Soccer (only 3.5%). On analyzing the collections using the WTMS interface, we determined that this is because the Information Visualization collection mainly consists of Web sites of research communities which sometimes cross-reference each other. On the other hand, the World Cup Soccer collection consists of several commercial Web sites on the topic as well as personal pages of people interested in the topic. The personal pages may point to some commercial sites but not vice versa and of course there is hardly any cyclical referencing among competing commercial sites. Thus, the number of sites that belong to a strongly connected component gives an indication about whether the collection is competitive or collaborative. In a collaborative collection, unlike a competitive collection, there will be several Web sites referencing each other; therefore these sites will be grouped into strongly connected components.
For most topics, the collections will consist of hundreds of logical sites. Sometimes the user may want to filter the information space based on various criteria. Obviously, the hubs and authorities are some of the most important sites for the collection. So one option is to show the top n or n% hubs and authorities as the important sites, where n is any integer. However, instead of choosing an arbitrary integer, in some situations other techniques might be more appropriate. In this section we will define two techniques for filtering the information space.
We define a hub cover for a site graph with V sites and E links as a subset V_h of V, such that for all links (u,v) in E, u (the source of the link) belongs to V_h. An optimum hub cover is the hub cover of the smallest size for the given graph. That is, the optimum hub cover of a collection is the smallest number of sites that have links to all the sites of the collection. Filtering the collection by showing only the optimum hub cover is useful, because from these sites the user can reach all the sites of the collection.
On the other hand, the authority cover for a site graph with V sites and E links is a subset V_a of V, such that for all links (u,v) in E, v (the destination of the link) belongs to V_a. The optimum authority cover is the authority cover with the smallest size in the site graph. That is, the optimum authority cover of a collection is the smallest number of sites that have links from all the sites of the collection. Obviously, filtering the collection by the optimum authority cover is also useful.
Determining the optimum hub and authority covers for a site graph is similar to the vertex cover problem . The vertex cover problem for an undirected graph G=(V,E) is to determine the smallest possible subset V' of V such that if (u,v) is in E, then at least one of u or v is in V'. Unfortunately, the vertex cover problem is NP-complete .
In WTMS we determine the approximate optimum hub and authority covers. The algorithm to determine the approximate optimum hub cover is as follows:
The algorithm examines the nodes of the graph sorted by their increasing in-degress. For nodes having just one in-link, the source of the link has to be added to the hub cover. For nodes having more than one link, the algorithm adds to the hub cover the link source with the highest hub score. Whenever a node is added to the hub cover, all nodes that have links from that node can be ignored.
Notice that even though we remove the sites with no in-links before the while loop in the above algorithm, these sites can still be in a hub cover. However, sites with both no in-links and out-links will be ignored by the algorithm. Since these sites are in the fringes of the collection (as discussed in Section 5.1), they should not be considered important to the collection.
The approximate optimum authority cover can be determined by a similar algorithm. We can also determine the hub and authority covers for the Web pages by applying the algorithms on the original graph of the collection.
|Collections||Information Visualization||World Cup Soccer||Titanic|
|Number of Sites||305||140||528|
|Size of Approximate Optimal Hub Cover||35||20||52|
|Size of Approximate Optimal Authority Cover||28||18||68|
Table 2 shows the size of the approximate optimal hub and authority covers that were discovered for the collections. In all cases we could discover a few sites, significantly smaller than the total number of sites, which have links to/from all the sites of the collections.
|(a) Hub Cover||(b) Authority Cover||(c) Nodes directly reachable from the main hub www.anancyweb.com|
Figure 11(a) shows the approximate optimal hub cover for the World Cup Soccer collection, sorted by the hub scores. Similarly, Figure 11(b) shows the approximate optimal authority cover sorted by the authority scores. These views are useful because starting from these sites the user can visit all the sites of the collection. Clicking on one of the sites in the Hub cover view shows the sites that can be directly reached from the selected site. Thus, Figure 11(c) shows the sites with links from the main hub for the World Cup Soccer collection www.anancyweb.com. Similarly, clicking on a site in the Authority cover view shows the sites that have links to the selected site.
In this paper, we have presented WTMS, a Web Topic Management system. WTMS uses a crawler to gather information related to a particular topic from the WWW. The enhancements to the crawler has resulted in improved performance by reducing the number of pages that need to be downloaded by more than 20% while missing only a few insignificant pages.
The WTMS interface provides various visualizations to navigate through the information space at different levels of abstraction. For example, the user can see the details of a logical site or a Web page. The system allows the user to integrate searching and browsing. Besides traditional keyword-based search, structural analysis techniques are provided which allow the user to gain several useful insights about the collection. This kind of analysis is not easy using traditional search engines or the previous systems for topic management. We have also introduced the concept of optimum hub and authority covers as a technique to filter the information space. The example scenarios presented in the paper indicates the usefulness of the system for Web Topic Management.
Future work is planned along various directions:
We believe that as the WWW grows bigger, systems like the WTMS will become essential for retrieving useful information.
|Sougata Mukherjea received his Bachelors in Computer Science & Engineering from Jadavpur University, Calcutta, India in 1988 and MS in Computer Science from Northeastern University, Boston, Ma in 1991. He then obtained his PhD in Computer Science from Georgia Institute of Technology, Atlanta, Ga in 1995. Till 1999 he was working at NEC's C&C Research Laboratories in San Jose, Ca. Presently he is working for Icarian, a start-up company in Sunnyvale, Ca. His main research interests are in Web Technologies like crawling and link analysis, Information Retrieval and Informtion Visualization. He can be reached by e-mail at firstname.lastname@example.org or email@example.com.|