The Stochastic Approach for Link-Structure Analysis (SALSA) and the TKC Effect

Abridged Version
R. Lempel       S. Moran
Department of Computer Science
The Technion,
Haifa 32000, Israel
email: {rlempel,moran}@cs.technion.ac.il

ABSTRACT

Today, when searching for information on the WWW, one usually performs a query through a term-based search engine. These engines return, as the query's result, a list of Web sites whose contents matches the query. For broad topic queries, such searches often result in a huge set of retrieved documents, many of which are irrelevant to the user. However, much information is contained in the link-structure of the WWW. Information such as which pages are linked to others can be used to augment search algorithms. In this context, Jon Kleinberg introduced the notion of two distinct types of Web-sites: hubs and authorities. Kleinberg argued that hubs and authorities exhibit a mutually reinforcing relationship: A good hub will point to many authorities, and a good authority will be pointed at by many hubs. In light of this, he devised an algorithm aimed at finding authoritative sites.

We present SALSA - a new stochastic approach for link structure analysis, which examines random walks on graphs derived from the link structure. We show that both SALSA and Kleinberg's Mutual Reinforcement approach employ the same meta-algorithm. We then prove that SALSA is equivalent to a weighted in-degree analysis of the link-structure of WWW subgraphs, making it computationally more efficient than the Mutual Reinforcement approach.

We compare the results of applying SALSA to the results derived through Kleinberg's approach.

These comparisons reveal a topological phenomenon called the TKC Effect which, in certain cases, prevents the Mutual Reinforcement approach from identifying meaningful authorities.

Keywords: Information Retrieval; Link Structure Analysis; Hubs and Authorities; Random walks; SALSA.


1. Introduction

Searching the WWW - The Challenge

The WWW is a rapidly expanding hyperlinked collection of unstructured information. The lack of structure and the enormous volume of the WWW pose tremendous challenges on the WWW Information Retrieval systems called search engines. These search engines are presented with queries, and return a list of Web-sites which are deemed (by the engine) to pertain to the query.

When considering the difficulties which WWW search engines face, we distinguish between narrow-topic queries and broad-topic queries. This distinction pertains to the presence which the query's topic has on the Web:

Narrow topic queries are queries for which very few resources exist on the Web, and which present a "needle in the haystack" challenge for search engines. An example for such a query is an attempt to locate the lyrics of a specific song, by quoting a line from it ("We all live in a yellow submarine"). Search engines encounter a recall challenge when handling such queries - Finding the few resources which pertain to the query.

On the other hand, broad-topic queries pertain to topics for which there is an abundance of information on the Web, sometimes as many as millions of relevant resources (with varying degrees of relevance). The vast majority of users are not interested in retrieving the entire huge set of resources - most users will be quite satisfied with a few authoritative results: Web sites which are highly relevant to the topic of the query, significantly more than most other sites. The challenge which search engines face here is one of precision - Retrieving only the most relevant resources to the query.

This work focuses on finding authoritative resources which pertain to broad-topic queries.

Term-based search engines

Term-based search engines face both classical problems in Information Retrieval, as well as problems specific to the WWW setting, when handling broad-topic queries. The classic problems include the following issues ([20],[4]):

In addition to the classical issues in Information Retrieval, there is a Web-specific obstacle which search engines must overcome, called search engine persuasion ([19]). There may be millions of sites pertaining in some manner to broad-topic queries, but most users will only browse through the first ten results returned by their favorite search facility. With the growing economic impact of the WWW, and the growth of e-commerce, it is crucial for businesses to have their sites ranked high by the major search engines. There are quite a few companies who sell this kind of expertise - They design Web sites which are tailored to rank high with specific queries on the major search engines. These companies research the ranking algorithms and heuristics of term-based engines, and know how many keywords to place (and where) in a Web-page so as to improve the page's ranking (which directly impacts the page's visibility). A less sophisticated technique, used by some site creators, is called keyword spamming ([4]). Here, the authors repeat certain terms (some of which are only remotely connected to their site's context), in order to "lure" search engines into ranking them highly for many queries.

Informative link structure - The answer?

The WWW is a hyperlinked collection. In addition to the textual content of the individual pages, the link structure of such collections contains information which can, and should, be tapped when searching for authoritative sources. Consider the significance of a link p$\rightarrow$q: With such a link p suggests, or even recommends, that surfers visiting p follow the link and visit q. This may reflect the fact that pages p and q share a common topic of interest, and that the author of p thinks highly of q's contents. Such a link, called an informative link, is p's way to confer authority on q ([16]). Note that informative links provide a positive critical assessment of q's contents which originates from outside the control of the author of q (as opposed to assessments based on q's textual content, which is under complete control of q's author). This makes the information extracted from informative links less vulnerable to manipulative techniques such as spamming.

Unfortunately, not all links are informative. There are many kinds of links which confer little or no authority ([4]), such as intra-domain (inner) links (whose purpose is to provide navigational aid in a complex Web site of some organization), commercial/sponsor links, and links which result from link-exchange agreements. A crucial task which should be completed prior to analyzing the link structure of a given collection, is to filter out as many of the non-informative links as possible.

Related work on link structures

Prior to the WWW age, link structures were studied in the area of bibliometrics, which studies the citation structure of written documents ([23],[15]). Many works in this area were aimed at finding high-impact papers published in scientific journals ([10]), and at clustering related documents ([1])

Some works have studied the Web's link structure, in addition to the textual content of the pages, as means to visualize areas thought to contain good resources ([3]). Other works used link structures for categorizing pages and clustering them ([24],[21]).

Marchiori, in [19], uses the link-structure of the Web to enhance search results of term-based search engines. This is done by considering the potential hyper-information contained in each Web-page: The information that can be found when following hyperlinks which originate in the page.

This work is motivated by the approach introduced by Jon Kleinberg ([16]). In an attempt to impose some structure on the chaotic WWW, Kleinberg distinguished between two types of Web-sites which pertain to a certain topic: The first are authoritative pages in the sense described previously. The second type of sites are hub pages. Hubs are resource lists - They do not directly contain information pertaining to the topic, but rather point to many authoritative sites. According to this model, hubs and authorities exhibit a mutually reinforcing relationship: Good hubs point to many good authorities, and good authorities are pointed at by many good hubs. In light of the mutually reinforcing relationship, hubs and authorities should form communities, which can be pictured as dense bipartite portions of the Web, where the hubs link densely to the authorities. The most prominent community in a WWW subgraph is called the principal community of the collection.

Kleinberg suggested an algorithm to identify these communities, which is described in detail in section 2.

Researchers from IBM's Almaden Research Center have implemented Kleinberg's algorithm in various projects. The first was HITS, which is described in [11], and offers some enlightening practical remarks. The ARC system, described in [7], augments Kleinberg's link-structure analysis by considering also the anchor text, the text which surrounds the hyperlink in the pointing page. The reasoning behind this is that many times, the pointing page describes the destination page's contents around the hyperlink, and thus the authority conferred by the links can be better assessed. These projects were extended by the CLEVER project ([14]). Researchers from outside IBM, such as Henzinger and Brahat, have also studied Kleinberg's approach and have proposed improvements to it ([13]).

Anchor text has also been used by Brin and Page in [2]. Another major feature of their work on the Google search engine ([12]) is a link-structure based site ranking approach called PageRank, which can be interpreted as a stochastic analysis of some random-walk behavior through the entire WWW.

In [18], the authors use the links surrounding a small set of same-topic sites to assemble a larger collection of neighboring pages which should contain many authoritative resources on the initial topic. The textual content of the collection is then analyzed in ranking the relevancy of its individual pages.

This work

While preserving the theme that Web sites pertaining to a given topic should be split to hubs and authorities, we replace Kleinberg's Mutual Reinforcement approach of [16] by a new stochastic approach (SALSA), in which the coupling between hubs and authorities is less tight. The intuition behind our approach is the following: consider a bipartite graph G, whose two parts correspond to hubs and authorities, where an edge between hub r and authority s means that there is an informative link from r to s. Then, authorities and hubs pertaining to the dominant topic of the sites in G should be highly visible (reachable) from many sites in G. Thus, we will attempt to identify these sites by examining certain random walks in G, under the proviso that such random walks will tend to visit these highly visible sites more frequently than other, less connected sites. We show that in finding the principal communities of hubs and authorities, both Kleinberg's Mutual Reinforcement approach and our Stochastic approach employ the same meta-algorithm on different representations of the input graph. We then compare the results of applying SALSA to the results derived by Kleinberg's approach. Through these comparisons, we isolate a particular topological phenomenon which we call the Tightly Knit Community (TKC) Effect. In certain scenarios, this effect hampers the ability of the the Mutual Reinforcement approach to identify meaningful authorities. We demonstrate that SALSA is less vulnerable to the TKC effect, and can find meaningful authorities in collections where the Mutual Reinforcement approach fails to do so.

After demonstrating some results achieved by means of SALSA, we prove that the ranking of sites in the Stochastic approach may be calculated by examining the weighted in/out degrees of the sites in G. This result yields that SALSA is computationally lighter than the Mutual Reinforcement approach. We also discuss the reason for our success with analyzing weighted in/out degrees of sites, which previous work has claimed to be unsatisfactory for identifying authoritative sites.

The rest of the paper is organized as follows: Section 2 recounts Kleinberg's Mutual Reinforcement Approach. In section 3 we view Kleinberg's approach from a higher level, and define a meta-algorithm for link structure analysis. Section 4 presents our new approach, SALSA. In section 5 we compare the two approaches by considering their outputs on the WWW and on artificial topologies. Then, in section 6 we prove the connection between SALSA and weighted in/out degree rankings of sites. Our conclusions and ideas for future work are brought in section 7. The paper uses basic results from the theory of stochastic processes, which are brought in the full version. The main contribution of the paper can be grasped without following the full mathematical analysis.


2. Kleinberg's Mutual Reinforcement Approach

The Mutual Reinforcement approach ([16]) starts by assembling a collection C of Web-sites, which should contain communities of hubs and authorities pertaining to a given topic t. It then analyzes the link structure induced by that collection, in order to find the authoritative sites on topic t.

Denote by q a term-based search query to which sites in our topic of interest t are deemed to be relevant. The collectionC is assembled in the following manner:

The collection C and its link structure induce the following directed graph G: G's nodes are the sites in C, and for all i , j\in$C, the directed edge i$\rightarrow$j appears in G if and only if site i contains a hyperlink to site j. Let W denote the |C| × |C| adjacency matrix of G.

Each site s$\in$C is now assigned a pair of weights, a hub-weight h(s) and an authority weight a(s), based on the following two principles:

The top ranking sites, according to both kinds of weights, form the Mutually Reinforcing communities of hubs and authorities. In order to assign such weights, Kleinberg uses the following iterative algorithm:

  1. Initialize a(s)$\leftarrow$1 , h(s)$\leftarrow$1 for all sites s$\in$C.
  2. Repeat the following three operations until convergence:

Note that applying the I operation is equivalent to assigning authority weights according to the result of multiplying the vector of all hub weights by the matrix WT. The O operation is equivalent to assigning hub weights according to the result of multiplying the vector of all authority weights by the matrix W.

Kleinberg showed that this algorithm converges, and that the resulting authority weights [hub weights] are the coordinates of the normalized principal eigenvector of WTW [of WWT] (the eigenvector which corresponds to the eigenvalue of highest magnitude of the matrix). WTW and WWT are well known matrices in the field of bibliometrics:

  1. A=WTW is the co-citation matrix ([23]) of the collection. [A]i,j is the number of sites which jointly point at (cite) pages i and j. Kleinberg's iterative algorithm converges to authority weights which correspond to the entries of the (unique, normalized) principal eigenvector of A.
  2. H=WWT is the bibliographic coupling matrix ([15]) of the collection. [H]i,j is the number of sites jointly referred to (pointed at) by pages i and j. Kleinberg's iterative algorithm converges to hub weights which correspond to the entries of H's (unique, normalized) principal eigenvector.

3. A Meta-Algorithm for Link Structure Analysis

Examining the Mutual Reinforcement approach from a higher level, we can identify a general framework, or meta-algorithm, for finding hubs and authorities by link-structure analysis. This meta-algorithm is a version of the spectral filtering method, presented in [6]:

For the meta-algorithm to be useful, the algebraic principal communities of hubs and authorities should reflect the true authorities and hubs in C

The two degrees of freedom which the meta-algorithm allows, are the method for obtaining the collection, and the definition of the association matrices. Given a specific collection, the algebraic communities produced by the meta-algorithm are determined solely by the definition of the association matrices.


4. SALSA: Analyzing a Random Walk on the Web

In this section we introduce the Stochastic Approach for Link Structure Analysis - SALSA. The approach is based upon the theory of Markov chains, and relies on the stochastic properties of random walks performed on our collection of sites. It follows the meta-algorithm described in section 3, and differs from the Mutual Reinforcement approach in the manner in which the association matrices are defined.

The input to our scheme consists of a collection of sites C which is built around a topic t in the manner described in section 2. Intuition suggests that authoritative sites on topic t should be visible from many sites in the subgraph induced by C. Thus, a random walk on this subgraph will visit t-authorities with high probability.

We combine the theory of random walks with the notion of the two distinct types of Web sites, hubs and authorities, and actually analyze two different Markov chains: A chain of hubs and a chain of authorities. Unlike "conventional" random walks on graphs, state transitions in these chains are generated by traversing two WWW-links in a row, one link forward and one link backwards (or vise versa). Analyzing both chains allows our approach to give each Web site two distinct scores, a hub score and an authority score.

The idea of ranking Web sites using random walks is not new. The search engine Google ([2], [12]) incorporates stochastic information into its ranking of pages. The PageRank component of the search engine examines a single random walk on the entire WWW. Hence, the ranking of Web sites in Google is independent of the search query (a global ranking), and no distinction is made between hubs and authorities.

Let us build a bipartite undirected graph G'=(Vh ,Va ,E) from our site collection C and its link structure:

Each non-isolated site s$\in$C is represented by two nodes of G', sh and sa. Each WWW-link s$\rightarrow$r  is represented by an undirected edge connecting sh and ra.

On this bipartite graph we will perform two distinct random walks. Each walk will only visit nodes from one of the two sides of the graph, by traversing paths consisting of two G'-edges in each step. Since each edge crosses sides of G', each walk is confined to just one of the graph's sides, and the two walks will naturally start off from different sides of G'. Note also that every path of length 2 in G' represents a traversal of one WWW link in the proper direction (when passing from the hub-side of G' to the authority-side), and a retreat along a WWW link (when crossing in the other direction). Since the hubs and authorities of topic t should be highly visible in G' (reachable from many nodes by either a direct edge or by short paths), we may expect that the t-authorities will be amongst the nodes most frequently visited by the random walk on Va, and that the t-hubs will be amongst the nodes most frequently visited by the random walk on Vh.

We will examine the two different Markov chains which correspond to these random walks: The chain of the visits to the authority side of G' (the authority chain), and the chain of visits to the hub side of G'. Analyzing these chains separately naturally distinguishes between the two aspects of each site.

We now define two stochastic matrices, which are the transition matrices of the two Markov chains at interest:

  1. The hub-matrix $\tilde{H}$:
    \begin{displaymath}\tilde{h}_{i,j} = \sum_{\{ k\vert(i_h,k_a),(j_h,k_a) \in\tilde{G} \}}{\frac{1}{deg(i_h)} \cdot \frac{1}{deg(k_a)}} \end{displaymath}
  2. The authority-matrix à:
    \begin{displaymath}\tilde{a}_{i,j} = \sum_{\{ k\vert(k_h,i_a),(k_h,j_a) \in\tilde{G} \}}{\frac{1}{deg(i_a)} \cdot \frac{1}{deg(k_h)}} \end{displaymath}

A positive transition probability ãi,j > 0 implies that a certain page h points to both pages i and j, and hence page j is reachable from page i by two steps: retracting along the link h$\rightarrow$i  and then following the link h$\rightarrow$j.

Alternatively, the matrices $\tilde{H}$ and à can be defined as follows: Let W be the adjacency matrix of the directed graph defined by C and its link structure. Denote by Wr the matrix which results by dividing each nonzero entry of W by the sum of the entries in its row, and by Wc the matrix which results by dividing each nonzero element of W by the sum of the entries in its column (Obviously, the sums of rows/columns which contain nonzero elements are greater than zero). Then $\tilde{H}$ consists of the non-zero rows and columns of Wr WcT, and à consists of the non-zero rows and columns of WcT Wr. We ignore the rows and columns of Ã, \tilde{H}$ which consist entirely of zeros, since (by definition) all the nodes of G' have at least one incident edge. The matrices à and $\tilde{H}$ serve as the association matrices required by the meta-algorithm for identifying the authorities and hubs. Recall that the Mutual Reinforcement approach uses the association matrices A=WTW and H=WWT.

We shall assume that G' is connected, causing both stochastic matrices à and $\tilde{H}$ to be irreducible. This assumption does not form a limiting factor, since when G' is not connected, we may use our technique on each connected component separately. Section 6.1 further elaborates on the case when à and $\tilde{H}$ have multiple irreducible components.

Some properties of $\tilde{H}$ and à:

Following the framework of the meta-algorithm, the principal community of authorities(hubs) found by the SALSA will be composed of the sites whose entries in the principal eigenvector of à ($\tilde{H}$) are the highest. By the Ergodic Theorem ([9]), the principal eigenvector of an irreducible, aperiodic stochastic matrix is actually the stationary distribution of the underlying Markov chain, and its high entries correspond to sites most frequently visited by the (infinite) random walk.


5. Results

5.1 The Tightly-Knit Community (TKC) Effect

A tightly-knit community is a small but highly interconnected set of sites. Roughly speaking, the TKC effect occurs when such a community scores high in link-analyzing algorithms, even though the sites in the TKC are not authoritative on the topic, or pertain to just one aspect of the topic. Our study indicates that the Mutual Reinforcement approach is vulnerable to this effect, and will sometimes rank the sites of a TKC in unjustified high positions.

As an example, consider a collection C which contains the following two communities: A community y, with a small number of hubs and authorities, in which every hub points to most of the authorities; and a much larger community z, in which each hub points to a smaller part of the authorities. The topic covered by z is the dominant topic of the collection, and is probably of wider interest on the WWW. Since there are many z-authoritative sites, the hubs do not link to all of them, whereas the smaller y community is densely interconnected. The TKC effect occurs when the sites of y are ranked higher than those of z.

In the full paper we provide a combinatorial construction, which demonstrates such (artificial) communities y and z, where the Mutual Reinforcement approach scores y higher than z, and the stochastic approach scores z higher. This bias of the Mutual Reinforcement approach towards tightly-knit communities will be demonstrated on WWW queries in the next section.

5.2 The WWW

We tested the different approaches on broad-topic WWW queries (both single topic queries and multi-topic queries). We obtained a collection of sites for each query, and then derived the principal community of authorities with both approaches.Two of these queries ("java", "abortion") were used by Kleinberg in [16], and are brought here for the sake of comparison. All collections were assembled during February, 1999. The root sets were compiled using AltaVista ([8]), which also provided the linkage information needed for building the base sets.

When expanding the root set to the entire collection, we filtered the links pointing to and from Web sites. Following [16], we ignored intra-domain links (since these links tend to be navigational aids inside an intranet, and do not confer authority on the link's destination). We also ignored links to cgi scripts, and tried to identify ad-links and ignore them as well. Overall, 38% of the links we examined were ignored. The collections themselves turn out to be relatively sparse graphs, with the number of edges never exceeding three times the number of nodes. We note that a recent work by Kleinberg et al. ([17]) has examined some other connectivity characteristics of such collections.

For each query, we list the top authorities which were returned by the two approaches. The results are displayed in tables containing four columns:

  1. The url.
  2. The title of the url.
  3. The category of the url: (1) for a member of the root set, (2) for a site pointing into the root set, and (3) for a site pointed at by a member of the root set.
  4. The value of the coordinate of this url in the principal eigenvector of the authority matrix.
Single-Topic Query: Java

The results for this query, with our first example of the TKC effect, are shown in table 1. All of the top ten Mutual Reinforcement authorities are part of the EARTHWEB Inc. network. They are interconnected, but since the domain names of the sites are different, the interconnecting links were not filtered out. Some of the sites are highly relevant to the query (and have many incoming links from sites outside the EarthWeb net), but most appear in the principal community only because of their EarthWeb affiliation. With SALSA, only the top three Mutual Reinforcement authorities are retained, and the other seven are replaced by other authorities, some of which are clearly more related to the query.

Size of root size  = 160, Size of collection = 2810
Principal Community, Mutual Reinforcement Approach:

url title cat weight
http://www.jars.com/ EarthWeb's JARS.COM Java Review Service (3) 0.334102
http://www.gamelan.com/ Gamelan - The Official Java Directory (3) 0.303624
http://www.javascripts.com/ Javascripts.com - Welcome (3) 0.255254
http://www.datamation.com/ EarthWeb's Datamation.com (3) 0.251379
http://www.roadcoders.com/ Handheld Software Development@RoadCoders (3) 0.250816
http://www.earthweb.com/ EarthWeb (3) 0.249373
http://www.earthwebdirect.com/ Welcome to Earthweb Direct (3) 0.247467
http://www.itknowledge.com/ ITKnowledge (3) 0.246874
http://www.intranetjournal.com/ intranetjournal.com (3) 0.24518
http://www.javagoodies.com/ Java Goodies JavaScript Repository (3) 0.238793
Principal Community, SALSA:

url title cat weight
http://java.sun.com/ Java(tm) Technology Home Page (3) 0.365264
http://www.gamelan.com/ Gamelan - The Official Java Directory (3) 0.36369
http://www.jars.com/ EarthWeb's JARS.COM Java Review Service (3) 0.303862
http://www.javaworld.com/ IDG's magazine for the Java community (3) 0.217269
http://www.yahoo.com/ Yahoo! (3) 0.21412
http://www.javasoft.com/ Java(tm) Technology Home Page (3) 0.203099
http://www.sun.com/ Sun Microsystems (3) 0.187355
http://www.javascripts.com/ Javascripts.com - Welcome (3) 0.138548
http://www.htmlgoodies.com/ htmlgoodies.com - Home (3) 0.130676
http://javaboutique.internet.com/ The Ultimate Java Applet Resource (1) 0.118081
Table 1: Authorities for WWW query "Java"

Single-Topic Query: movies

This query demonstrates the TKC effect in a most striking fashion on the WWW. First, consider the Mutual Reinforcement principal community of authorities, presented in table 2:

Size of root size  = 175, Size of collection = 4539

url title cat weight
http://go.msn.com/npl/msnt.asp MSN.COM (3) 0.167332
http://go.msn.com/bql/whitepages.asp White Pages - msn.com (3) 0.167202
http://go.msn.com/bsl/webevents.asp Web Events (3) 0.167202
http://go.msn.com/bql/maps.asp Microsoft Expedia Maps-Home (3) 0.167202
Table 2:Mutual Reinforcement Authorities for WWW query "movies"

The top 30 authorities returned by the Mutual Reinforcement approach were all go.msn.com sites. All but the first received the exact same weight, 0.167202. Recall that we do not allow same-domain links in our collection, hence none of the top authorities was pointed at by a go.msn.com site. To understand how these sites scored so well, we turn to the principal community of hubs, shown in table 3:


url title cat weight
http://denver.sidewalk.com/movies movies: denver.sidewalk (1) 0.169197
http://boston.sidewalk.com/movies movies:boston.sidewalk (1) 0.169061
http://twincities.sidewalk.com/movies movies: twincities.sidewalk (1) 0.1688
http://newyork.sidewalk.com/movies movies: newyork.sidewalk (1) 0.168537

Table 3: Mutual Reinforcement Hubs for WWW query "movies"

These innocent looking hubs are all part of the Microsoft Network (msn), but when building the basic set we did not identify them as such. All these hubs point, almost without exception, to the entire set of authorities found by the MR approach (hence the equal weights which the authorities exhibit). However, the vast majority of the sites in the collection were not part of this "conspiracy", and almost never pointed to any of the go.msn.com sites. Therefore, the authorities returned by the Stochastic approach (table 4) contain none of those go.msn.com sites, and are much more relevant to the query:


url title cat weight
http://us.imdb.com/ The Internet Movie Database (3) 0.253333
http://www.mrshowbiz.com/ Mr Showbiz (3) 0.22335
http://www.disney.com/ Disney.com-The Web Site for Families (3) 0.22003
http://www.hollywood.com/ Hollywood Online:...all about movies (3) 0.213355
http://www.imdb.com/ The Internet Movie Database (3) 0.199987
http://www.paramount.com/ Welcome to Paramount Pictures (3) 0.196682
http://www.mca.com/ Universal Studios (3) 0.180021
Table 4: Stochastic authorities for WWW query "movies"

A similar community is obtained by the Mutual Reinforcement approach, after deleting the rows and columns which correspond to the top 30 authorities from the matrix WTW. This deletion dissolves the msn.com community, and allows a community similar to the one obtained by SALSA to manifest itself.

Multi-Topic Query: abortion

This topic is highly polarized, with different cyber communities supporting pro-life and pro-choice views. In table 5, we bring the top 10 authorities, as determined by the two approaches:

Size of root size  = 160, Size of collection = 1693
Principal Community, Mutual Reinforcement Approach:

url title cat weight
http://www.nrlc.org/ National Right To Life (3) 0.420832
http://www.prolife.org/ultimate/ The Ultimate Pro-Life Resource List (3) 0.316564
http://www.all.org/ What's new at American Life League (3) 0.251506
http://www.hli.org/ Human Life International (3) 0.212931
http://www.prolife.org/cpcs-online/ Crisis Pregnancy Centers Online (3) 0.187707
http://www.ohiolife.org/ Ohio Right to Life (3) 0.182076
http://www.rtl.org/ Abortion, adoption and assisted-suicide
Information at Right to Life...
(1) 0.17943
http://www.bethany.org/ Bethany Christian Services (3) 0.161359
http://www.ldi.org/ abortion malpractice litigation (1) 0.140076
http://www.serve.com/fem4life/ Feminists for Life of America (3) 0.122106
Principal Community, SALSA:

url title cat weight
http://www.nrlc.org/ National Right To Life (3) 0.344029
http://www.prolife.org/ultimate/ The Ultimate Pro-Life Resource List (3) 0.284714
http://www.naral.org/ NARAL Choice for America (3) 0.240227
http://www.feminist.org/ Feminist Majority Foundation (3) 0.186843
http://www.now.org/ National Organization for Women (3) 0.177946
http://www.cais.com/agm/main/ The Abortion Rights Activist (1) 0.166083
http://www.gynpages.com/ Abortion Clinics Online (3) 0.163117
http://www.plannedparenthood.org/ Planned Parenthood Federation (3) 0.157186
http://www.all.org/ What's new at American Life League (3) 0.142357
http://www.hli.org/ Human Life International (3) 0.142357
Table 5: Authorities for WWW query "Abortion"

All 10 top authorities found by the Mutual Reinforcement approach are pro-life resources, while the top 10 SALSA authorities are split, with 6 pro-choice sites and 4 pro-life sites (which are the same top 4 pro-life sites found by the Mutual Reinforcement approach). Again, we see the TKC effect: The Mutual Reinforcement approach ranks highly authorities on only one aspect of the query, while SALSA blends authorities from both aspects into its principal community.

Multi-Topic Query: genetics

This query is especially ambiguous in the WWW: It can be in the context of genetic engineering, genetic algorithms, or in the context of health issues and the human genome.

As in the "abortion" query, SALSA brings a diverse principal community, with authorities on the various contexts of the query, while the Mutual Reinforcement approach is focussed on one context (Genetic Algorithms, in this case). Both principal communities are shown in table 6:

Size of root size  = 120, Size of collection = 2952
Principal Community, Mutual Reinforcement Approach:

url title cat weight
http://www.aic.nrl.navy.mil/galist/ The Genetic Algorithms Archive (3) 0.27848
http://alife.santafe.edu/ Artificial Life Online (3) 0.276159
http://www.yahoo.com/ Yahoo! (3) 0.273599
http://www.geneticprogramming.com/ The Genetic Programming Notebook (1) 0.25588
http://gal4.ge.uiuc.edu/illigal.home.html illiGAL Home Page (3) 0.235717
http://www.cs.gmu.edu/research/gag/ The Genetic Algorithms Group... (3) 0.201237
http://www.scs.carleton.ca/~csgs/
  resources/gaal.html
Genetic Algorithms and Artificial Life Resources (1) 0.181315
http://lancet.mit.edu/ga/ GAlib: Matthew's Genetic Algorithms Library (3) 0.181157
Principal Community, SALSA:

url title cat weight
http://www.ncbi.nlm.nih.gov/ The National Center for
Biotechnology Information
(3) 0.250012
http://www.yahoo.com/ Yahoo! (3) 0.227782
http://www.aic.nrl.navy.mil/galist/ The Genetic Algorithms Archive (3) 0.223191
http://www.nih.gov/ National Institute of Health (NIH) (3) 0.194688
http://gdbwww.gdb.org/ The Genome Database (3) 0.177001
http://alife.santafe.edu/ Artificial Life Online (3) 0.172383
http://www.genengnews.com/ Genetic Engineering News (GEN) (1) 0.141617
http://gal4.ge.uiuc.edu/illigal.home.html illiGAL Home Page (3) 0.13259
Table 6: Authorities for WWW query "genetic"

6. SALSA and the In/Out Degrees of Sites

In the previous sections we have presented the Stochastic approach as an alternative method for link-structure analysis, and have shown a few encouraging results obtained by it, as compared with the Mutual Reinforcement approach. We have also presented the TKC effect, a topological phenomenon which sometimes derails the MR approach and prevents it from converging to a useful community of authoritative sites.

The sample results shown so far have all been produced on unweighted collections, in which all informative links have received unit weight. Both approaches can produce better rankings when applied on weighted collections, in which each informative link receives a weight which reflects the amount of authority that the pointing site confers to the pointed site. Possible factors which may contribute to a link's weight include the following:

6.1 Analysis of the Stochastic Ranking

We now prove a general result about the ranking produced by SALSA in weighted collections, for which some basic background in stochastic processes is assumed.

Let G=(H;A;E) be a positively weighted, directed bipartite graph with no isolated nodes, and let all edges be directed from sites in H to sites in A. We will use the following notations:

Let MA be a Markov chain whose states are the set A of vertices, with the following transition probabilities between every two states i , j\in$A:

\begin{displaymath}P_A(i,j) = \sum_{\{ k \in H\vert k \rightarrow i,\ k \rightar......ow i)}{d_{in}(i)} \cdot\frac{w(k \rightarrow j)}{d_{out}(k)} \end{displaymath}

Similarly, let MH be a Markov chain whose states are the set H of vertices, with the following transition probabilities between every two states k , l\in$H:

\begin{displaymath}P_H(k,l) = \sum_{\{ i \in A\vert k \rightarrow i,\ l \rightar......ow i)}{d_{out}(k)} \cdot\frac{w(l \rightarrow i)}{d_{in}(i)} \end{displaymath}

Consider the following binary relation on the vertices of A (states of MA):

\begin{displaymath}R_A = \{ (i,j)\ \vert\ P_A(i,j)>0 \}\end{displaymath}

It is not hard to show (and is shown in the full paper) that RA is an equivalence relation on A (similar arguments can be made concerning MH). This implies that all the states of MA are recurrent (none are transient). The equivalence classes of RA are the irreducible components of MA. We first deal with the case where RA consists of one equivalence class (i.e., MA is irreducible).

Proposition 1   Whenever MA is an irreducible chain (has a single irreducible component), it has a unique stationary distribution

$\pi=(\pi_1,\ldots,\pi_{\vert A \vert})$satisfying:
\begin{displaymath}\pi_i = \frac{d_{in}(i)}{\cal{W}}\ for\ all\ i \in A\end{displaymath}

Similarly, whenever MH is an irreducible chain, its unique stationary distribution $\pi=(\pi_1,\ldots,\pi_{\vert H \vert}$) satisfies:

\begin{displaymath}\pi_k = \frac{d_{out}(k)}{\cal{W}}\ for\ all\ k \in H\end{displaymath}

Proof: We will prove the proposition for MA. The proof for MH is similar.

By the Ergodic Theorem ([9]), any irreducible, aperiodic Markov chain has a unique stationary distribution vector. It will therefore suffice to show that the vector $\pi$ with the properties claimed in the proposition is indeed a stationary distribution vector of MA.

  1. $\pi$ is a distribution vector: Its entries are non-negative, and their sum equals one.
    \begin{displaymath}\sum_{i \in A} \pi_i = \sum_{i \in A} \frac{d_{in}(i)}{\cal{W}} =\frac{1}{\cal{W}} \sum_{i \in A} d_{in}(i) = 1\end{displaymath}
  2. $\pi$ is a stationary distribution vector of MA. Here we need to show the equality $\pi P_A=\pi$:
    \begin{eqnarray*}[\pi P_A]_i & = & \sum_{j \in A} \pi_j P_A(j,i) \\&=&\sum_{j......(k \rightarrowi)\\&=&\frac{d_{in}(i)}{\cal{W}}\\&=&\pi_i\\
\end{eqnarray*}
$\Box$

Thus, when the (undirected) support graph of G is connected, SALSA assigns each site an authority weight which is proportional to the sum of weights of its incoming edges. The hub weight of each site is proportional to the sum of weights of its outgoing edges. In unweighted collections (with all edges having unit weight), each site's Stochastic authority(hub) weight is simply proportional to the in(out) degree of the site.

This mathematical analysis, in addition to providing insight about the ranking that is produced by SALSA, also suggests a very simple algorithm for calculating the Stochastic ranking: Simply calculate, for all sites, the sum of weights on their incoming(outgoing) edges, and normalize these two vectors. There is no need to apply any resource-consuming iterative method to approximate the principal eigenvector of the transition matrix of the Markov chain.

Markov chains with multiple irreducible components

Consider the case in which the authority chain MA consists of multiple irreducible components. Denote these (pairwise disjoint) components by $A_1,A_2,\ldots,A_k$, where $A_i\subset A, 1 \leq i \leq k$. What will be the outcome of a random walk performed on the set of states A according to the transition matrix PA? To answer this question, we will need some notations:

Proposition 2   The random walk on A, governed by the transition matrix PA and started from all states with equal probability, will converge to a stationary distribution as follows:

\begin{displaymath}\lim_{n \rightarrow \infty} eP_A^n = \tilde{\pi}\ \ where\ \ ......j =\frac{\vert A_{c(j)} \vert}{\vert A \vert}\cdot\pi^{c(j)}_j\end{displaymath}

Proof: Denote by $p_i^n,\ 1 \leq i \leq k$the probability of being in a site belonging to Ai after the n'th step of the random walk. This probability is determined by the distribution vector ePAn. Clearly,

\begin{displaymath}p_i^0 = \sum_{j \in A_I}e_j=\frac{\vert A_i \vert}{\vert A \vert}\end{displaymath}

Since the transition probability between any two sites (states) which belong to different irreducible components is zero, pin=pi0 for all n (probability does not shift from one component to another). Inside each irreducible component the Ergodic Theorem holds, thus the probabilities which correspond to the sites of Ai in $\lim_{n \rightarrow \infty}eP_A^n$ will be proportional to $\pi^i$, and the proposition follows.

$\Box$

This proposition points out a natural way to compare the authoritativeness of sites from different irreducible components: Simply multiply each site's authority score by the normalized size of the irreducible component to which it belongs. We do not claim that this is in any way optimal, as very small irreducible components should be trimmed from the graph altogether. But the underlying principle is important: Consider the size of the community when evaluating the quality of the top sites in that community. The budget which the Mayor of New York City controls is much larger than that of the Mayor of Osh Kosh, Wisconsin.

It is this combination of a site's intra-community authority score and its community's size that allows the Stochastic approach to blend authorities from different aspects of a multi-topic query, and which reduces its vulnerability to the TKC effect.

6.2 In-Degree as a Measure of Authority (Revisited)

Extensive research in link-structure analysis has been conducted in recent years under the premise that considering the in-degree of sites as a sole measure of their authority does not produce satisfying results. Kleinberg, as a motivation to the Mutual Reinforcement approach, showed some examples of the inadequacy of a simple in-degree ranking ([16]). Our results in section 5.2 seem to contradict this premise: The Stochastic rankings seem quite satisfactory there, and since those collections were unweighted, the Stochastic rankings are equivalent to simple in-degree counts (normalized by the size of the connected component which each site belongs to). To gain more perspective on this apparent contradiction, let us elaborate on the first stage of the meta-algorithm for link-structure analysis (from section 3), in which the graph to be analyzed is assembled:

  1. Given a query, assemble a collection of Web-sites which should contain many hubs and authorities pertaining to the query, and few hubs and authorities for any particular unrelated topic.
  2. Filter out non-informative links connecting sites in the collection.
  3. Assign weights to all non-filtered links. These weights should reflect the information conveyed by the link.

It is only after these steps that the weighted, directed graph is analyzed and the rankings of hubs and authorities are produced. The analysis of the graph, however important, is just the second stage in the meta-algorithm, and the steps involved in the first stage are crucial to the success of the entire algorithm.

Considerable research efforts have been invested in improving the quality of the assembled graphs. The current state of the art techniques for these steps is now such that in many cases, simple (and efficient) algorithms and heuristics produce quite satisfying results on the assembled graphs.

It is important to keep in mind the main goal of broad-topic WWW searches, which is to enhance the precision at 10 of the results, not to rank the entire collection of sites correctly. It is entirely irrelevant if the site in place 98 is really better than the site in place 216. The Stochastic ranking, which turns out to be equivalent to a weighted in-degree ranking, discovers the most authoritative sites quite effectively (and very efficiently) in many (carefully assembled) collections. No claim is made on the quality of its ranking on the rest of the sites (which constitute the vast majority of the collection).


7. Conclusions

We have developed a new approach for finding hubs and authorities, which we call SALSA - The Stochastic Approach for Link Structure Analysis. SALSA examines random walks on two different Markov chains which are derived from the link structure of the WWW: The authority chain and the hub chain. The principal community of authorities (hubs) corresponds to the sites that are most frequently visited by the random walk defined by the authority (hub) Markov chain. SALSA and Kleinberg's Mutual Reinforcement approach are both in the framework of the same meta-algorithm.

We have shown that the ranking produced by SALSA is equivalent to a weighted in/out-degree ranking (with the sizes of irreducible components also playing a part). This makes SALSA computationally lighter than the Mutual Reinforcement approach.

Both approaches were tested on the WWW, where SALSA appears to compare well with the Mutual Reinforcement approach. These tests, as well as analytical work, have revealed a topological phenomenon on the Web called the TKC effect. This effect sometimes derails the Mutual Reinforcement approach, and prevents it from finding relevant authoritative sites (or from finding authorities on all meanings/aspects of the query):

We note that SALSA is less vulnerable to the TKC effect, and produces good results in many cases where the Mutual Reinforcement approach fails to do so.

The following issues are left for future research:
  1. In collections with many connected components, we have studied one manner in which to combine the inner-component authority score with the size of the component. There may be better ways to combine these two factors into a single score.
  2. We have found a simple property of the Stochastic ranking, which enables us to compute this ranking without the need to approximate the principal eigenvector of the stochastic matrix which defines the random walk. Is there some simple property which will allow us to calculate the Mutual Reinforcement ranking without approximating the principal eigenvector of WTW? If not, can we alter the graph G in some simple manner (for instance, change some weights on the edges) so that the Stochastic ranking on the modified graph will be approximately equal to the Mutual Reinforcement ranking on the original graph?

Acknowledgments

The second author would like to thank Udi Manber for introducing him to the search problems studied in this paper, and Udi Manber and Toni Pitassi for delightful and interesting discussions at the early stages of this research.


Bibliography

  1. J. Gary Auguston and Jack Minker. An analysis of some graph theoretical cluster techniques. JACM, 17:4(1970) 571-588 .
  2. Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. Proc. 7th International WWW Conference, 1998.
  3. Jeromy Carrière and Rick Kazman. Webquery: Searching and visualizing the web through connectivity. Proc. 6th International WWW Conference, 1997.
  4. S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, S.R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Hypersearching the web. Scientific American, June 1999.
  5. S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, S.R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Mining the web's link structure. IEEE Computer, August 1999.
  6. S. Chakrabarti, B. Dom, D. Gibson, S.R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Spectral filtering for resource discovery. ACM SIGIR workshop on Hypertext Information Retrieval on the Web, 1998.
  7. Soumen Chakrabarti, Byron Dom, David Gibson, Jon M. Kleinberg, Prabhakar Raghavan, and Sridhar Rajagopalan. Automatic resource list compilation by analyzing hyperlink structure and associated text. Proc. 7th International WWW Conference, 1998.
  8. Compaq Computer Corporation. Altavista net guide.http://www.altavista.com/.
  9. Robert G. Gallager. Discrete Stochastic Processes. Kluwer Academic Publishers, 1996.
  10. E. Garfield. Citation analysis as a tool in journal evaluation. Science, 178(1972) 471-479 .
  11. David Gibson, Jon M. Kleinberg, and Prabhakar Raghavan. Inferring web communities from link topology. Proc. 9th ACM Conference on Hypertext and Hypermedia, 1998.
  12. Google Inc. Google search engine. http://www.google.com/.
  13. Monika R. Henzinger and Krishna Bharat. Improved algorithms for topic distillation in a hyperlinked environment. Proceedings of the 21'st International ACM SIGIR Conference on Research and Development in IR, August 1998.
  14. IBM Corporation Almaden Research Center. Clever. http://www.almaden.ibm.com/cs/k53/clever.html.
  15. M.M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14(1963) 10-25 .
  16. Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998.
  17. Jon M. Kleinberg, Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew S. Tomkins. The web as a graph: Measurements, models and methods. Proceedings of the Fifth International Computing and Combinatorics Conference, 1999.
  18. Ken Law, Thomas Tong, and Alan Wong. Automatic categorization based on link structure, Stanford University, 1999.
  19. Massimo Marchiori. The quest for correct information on the web: Hyper search engines. Proc. 6th International WWW Conference, 1997.
  20. Christos H. Papadimitriou, Prabhakar Raghavan, Hisao Tamaki, and Santosh Vempala. Latent semantic indexing: A probabilistic analysis. Preliminary version appeared in PODS 98, pp. 159-168 .
  21. Peter Pirolli, James Pitkow, and Ramana Rao. Silk from a sow's ear: Extracting usable structures from the web. Proc. ACM SIGCHI Conference on Human Factors in Computing, 1996.
  22. C.J. van Rijsbergen. Information Retrieval. Butterworths, 1979.
  23. H. Small. Co-citation in the scientific literature: A new measure of the relationship between two documents. J. American Soc. Info. Sci., 24(1973) 265-269 .
  24. R. Weiss, B. Vélez, M. Sheldon, C. Namprempre, P. Szilagyi, A. Duda, and D. Gifford. Hypursuit: A hierarchical network search engine that exploits content-link hypertext clustering. Proc. 7th ACM Conference on Hypertext, 1996.

Vitae


About this document ...

The Stochastic Approach for Link-Structure Analysis (SALSA) and the TKC Effect

This document was generated using the LaTeX2HTML translator Version 98.1p7 (June 18th, 1998)


Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, Ross Moore, Mathematics Department, Macquarie University, Sydney.