Today, when searching for information on the WWW, one usually performs a query through a term-based search engine. These engines return, as the query's result, a list of Web sites whose contents matches the query. For broad topic queries, such searches often result in a huge set of retrieved documents, many of which are irrelevant to the user. However, much information is contained in the link-structure of the WWW. Information such as which pages are linked to others can be used to augment search algorithms. In this context, Jon Kleinberg introduced the notion of two distinct types of Web-sites: hubs and authorities. Kleinberg argued that hubs and authorities exhibit a mutually reinforcing relationship: A good hub will point to many authorities, and a good authority will be pointed at by many hubs. In light of this, he devised an algorithm aimed at finding authoritative sites.
We present SALSA - a new stochastic approach for link structure analysis, which examines random walks on graphs derived from the link structure. We show that both SALSA and Kleinberg's Mutual Reinforcement approach employ the same meta-algorithm. We then prove that SALSA is equivalent to a weighted in-degree analysis of the link-structure of WWW subgraphs, making it computationally more efficient than the Mutual Reinforcement approach.
We compare the results of applying SALSA to the results derived through Kleinberg's approach.
These comparisons reveal a topological phenomenon called the TKC Effect which, in certain cases, prevents the Mutual Reinforcement approach from identifying meaningful authorities.
The WWW is a rapidly expanding hyperlinked collection of unstructured information. The lack of structure and the enormous volume of the WWW pose tremendous challenges on the WWW Information Retrieval systems called search engines. These search engines are presented with queries, and return a list of Web-sites which are deemed (by the engine) to pertain to the query.
When considering the difficulties which WWW search engines face, we distinguish between narrow-topic queries and broad-topic queries. This distinction pertains to the presence which the query's topic has on the Web:
Narrow topic queries are queries for which very few resources exist on the Web, and which present a "needle in the haystack" challenge for search engines. An example for such a query is an attempt to locate the lyrics of a specific song, by quoting a line from it ("We all live in a yellow submarine"). Search engines encounter a recall challenge when handling such queries - Finding the few resources which pertain to the query.
On the other hand, broad-topic queries pertain to topics for which there is an abundance of information on the Web, sometimes as many as millions of relevant resources (with varying degrees of relevance). The vast majority of users are not interested in retrieving the entire huge set of resources - most users will be quite satisfied with a few authoritative results: Web sites which are highly relevant to the topic of the query, significantly more than most other sites. The challenge which search engines face here is one of precision - Retrieving only the most relevant resources to the query.
This work focuses on finding authoritative resources which pertain to broad-topic queries.
Term-based search engines face both classical problems in Information Retrieval, as well as problems specific to the WWW setting, when handling broad-topic queries. The classic problems include the following issues (,):
In addition to the classical issues in Information Retrieval, there is a Web-specific obstacle which search engines must overcome, called search engine persuasion (). There may be millions of sites pertaining in some manner to broad-topic queries, but most users will only browse through the first ten results returned by their favorite search facility. With the growing economic impact of the WWW, and the growth of e-commerce, it is crucial for businesses to have their sites ranked high by the major search engines. There are quite a few companies who sell this kind of expertise - They design Web sites which are tailored to rank high with specific queries on the major search engines. These companies research the ranking algorithms and heuristics of term-based engines, and know how many keywords to place (and where) in a Web-page so as to improve the page's ranking (which directly impacts the page's visibility). A less sophisticated technique, used by some site creators, is called keyword spamming (). Here, the authors repeat certain terms (some of which are only remotely connected to their site's context), in order to "lure" search engines into ranking them highly for many queries.
The WWW is a hyperlinked collection. In addition to the textual content of the individual pages, the link structure of such collections contains information which can, and should, be tapped when searching for authoritative sources. Consider the significance of a link pq: With such a link p suggests, or even recommends, that surfers visiting p follow the link and visit q. This may reflect the fact that pages p and q share a common topic of interest, and that the author of p thinks highly of q's contents. Such a link, called an informative link, is p's way to confer authority on q (). Note that informative links provide a positive critical assessment of q's contents which originates from outside the control of the author of q (as opposed to assessments based on q's textual content, which is under complete control of q's author). This makes the information extracted from informative links less vulnerable to manipulative techniques such as spamming.
Unfortunately, not all links are informative. There are many kinds of links which confer little or no authority (), such as intra-domain (inner) links (whose purpose is to provide navigational aid in a complex Web site of some organization), commercial/sponsor links, and links which result from link-exchange agreements. A crucial task which should be completed prior to analyzing the link structure of a given collection, is to filter out as many of the non-informative links as possible.
Prior to the WWW age, link structures were studied in the area of bibliometrics, which studies the citation structure of written documents (,). Many works in this area were aimed at finding high-impact papers published in scientific journals (), and at clustering related documents ()
Some works have studied the Web's link structure, in addition to the textual content of the pages, as means to visualize areas thought to contain good resources (). Other works used link structures for categorizing pages and clustering them (,).
Marchiori, in , uses the link-structure of the Web to enhance search results of term-based search engines. This is done by considering the potential hyper-information contained in each Web-page: The information that can be found when following hyperlinks which originate in the page.
This work is motivated by the approach introduced by Jon Kleinberg (). In an attempt to impose some structure on the chaotic WWW, Kleinberg distinguished between two types of Web-sites which pertain to a certain topic: The first are authoritative pages in the sense described previously. The second type of sites are hub pages. Hubs are resource lists - They do not directly contain information pertaining to the topic, but rather point to many authoritative sites. According to this model, hubs and authorities exhibit a mutually reinforcing relationship: Good hubs point to many good authorities, and good authorities are pointed at by many good hubs. In light of the mutually reinforcing relationship, hubs and authorities should form communities, which can be pictured as dense bipartite portions of the Web, where the hubs link densely to the authorities. The most prominent community in a WWW subgraph is called the principal community of the collection.
Kleinberg suggested an algorithm to identify these communities, which is described in detail in section 2.
Researchers from IBM's Almaden Research Center have implemented Kleinberg's algorithm in various projects. The first was HITS, which is described in , and offers some enlightening practical remarks. The ARC system, described in , augments Kleinberg's link-structure analysis by considering also the anchor text, the text which surrounds the hyperlink in the pointing page. The reasoning behind this is that many times, the pointing page describes the destination page's contents around the hyperlink, and thus the authority conferred by the links can be better assessed. These projects were extended by the CLEVER project (). Researchers from outside IBM, such as Henzinger and Brahat, have also studied Kleinberg's approach and have proposed improvements to it ().
Anchor text has also been used by Brin and Page in . Another major feature of their work on the Google search engine () is a link-structure based site ranking approach called PageRank, which can be interpreted as a stochastic analysis of some random-walk behavior through the entire WWW.
In , the authors use the links surrounding a small set of same-topic sites to assemble a larger collection of neighboring pages which should contain many authoritative resources on the initial topic. The textual content of the collection is then analyzed in ranking the relevancy of its individual pages.
While preserving the theme that Web sites pertaining to a given topic should be split to hubs and authorities, we replace Kleinberg's Mutual Reinforcement approach of  by a new stochastic approach (SALSA), in which the coupling between hubs and authorities is less tight. The intuition behind our approach is the following: consider a bipartite graph G, whose two parts correspond to hubs and authorities, where an edge between hub r and authority s means that there is an informative link from r to s. Then, authorities and hubs pertaining to the dominant topic of the sites in G should be highly visible (reachable) from many sites in G. Thus, we will attempt to identify these sites by examining certain random walks in G, under the proviso that such random walks will tend to visit these highly visible sites more frequently than other, less connected sites. We show that in finding the principal communities of hubs and authorities, both Kleinberg's Mutual Reinforcement approach and our Stochastic approach employ the same meta-algorithm on different representations of the input graph. We then compare the results of applying SALSA to the results derived by Kleinberg's approach. Through these comparisons, we isolate a particular topological phenomenon which we call the Tightly Knit Community (TKC) Effect. In certain scenarios, this effect hampers the ability of the the Mutual Reinforcement approach to identify meaningful authorities. We demonstrate that SALSA is less vulnerable to the TKC effect, and can find meaningful authorities in collections where the Mutual Reinforcement approach fails to do so.
After demonstrating some results achieved by means of SALSA, we prove that the ranking of sites in the Stochastic approach may be calculated by examining the weighted in/out degrees of the sites in G. This result yields that SALSA is computationally lighter than the Mutual Reinforcement approach. We also discuss the reason for our success with analyzing weighted in/out degrees of sites, which previous work has claimed to be unsatisfactory for identifying authoritative sites.
The rest of the paper is organized as follows: Section 2 recounts Kleinberg's Mutual Reinforcement Approach. In section 3 we view Kleinberg's approach from a higher level, and define a meta-algorithm for link structure analysis. Section 4 presents our new approach, SALSA. In section 5 we compare the two approaches by considering their outputs on the WWW and on artificial topologies. Then, in section 6 we prove the connection between SALSA and weighted in/out degree rankings of sites. Our conclusions and ideas for future work are brought in section 7. The paper uses basic results from the theory of stochastic processes, which are brought in the full version. The main contribution of the paper can be grasped without following the full mathematical analysis.
The Mutual Reinforcement approach () starts by assembling a collection C of Web-sites, which should contain communities of hubs and authorities pertaining to a given topic t. It then analyzes the link structure induced by that collection, in order to find the authoritative sites on topic t.
Denote by q a term-based search query to which sites in our topic of interest t are deemed to be relevant. The collectionC is assembled in the following manner:
The collection C and its link structure induce the following directed graph G: G's nodes are the sites in C, and for all i , jC, the directed edge ij appears in G if and only if site i contains a hyperlink to site j. Let W denote the |C| × |C| adjacency matrix of G.
Each site sC is now assigned a pair of weights, a hub-weight h(s) and an authority weight a(s), based on the following two principles:
The top ranking sites, according to both kinds of weights, form the Mutually Reinforcing communities of hubs and authorities. In order to assign such weights, Kleinberg uses the following iterative algorithm:
Note that applying the I operation is equivalent to assigning authority weights according to the result of multiplying the vector of all hub weights by the matrix WT. The O operation is equivalent to assigning hub weights according to the result of multiplying the vector of all authority weights by the matrix W.
Kleinberg showed that this algorithm converges, and that the resulting authority weights [hub weights] are the coordinates of the normalized principal eigenvector of WTW [of WWT] (the eigenvector which corresponds to the eigenvalue of highest magnitude of the matrix). WTW and WWT are well known matrices in the field of bibliometrics:
Examining the Mutual Reinforcement approach from a higher level, we can identify a general framework, or meta-algorithm, for finding hubs and authorities by link-structure analysis. This meta-algorithm is a version of the spectral filtering method, presented in :
M will have a unique real positive eigenvalue µ(M) of multiplicity 1, such that for any other eigenvalue µ' of M, µ(M) > |µ'|. Denote by vµ(M) the (unique) unit eigenvector which corresponds to µ(M) whose first non-zero coordinate is positive. vµ(M) will actually be a positive vector, and will be referred to as the principal eigenvector of M.
For the meta-algorithm to be useful, the algebraic principal communities of hubs and authorities should reflect the true authorities and hubs in C
The two degrees of freedom which the meta-algorithm allows, are the method for obtaining the collection, and the definition of the association matrices. Given a specific collection, the algebraic communities produced by the meta-algorithm are determined solely by the definition of the association matrices.
In this section we introduce the Stochastic Approach for Link Structure Analysis - SALSA. The approach is based upon the theory of Markov chains, and relies on the stochastic properties of random walks performed on our collection of sites. It follows the meta-algorithm described in section 3, and differs from the Mutual Reinforcement approach in the manner in which the association matrices are defined.
The input to our scheme consists of a collection of sites C which is built around a topic t in the manner described in section 2. Intuition suggests that authoritative sites on topic t should be visible from many sites in the subgraph induced by C. Thus, a random walk on this subgraph will visit t-authorities with high probability.
We combine the theory of random walks with the notion of the two distinct types of Web sites, hubs and authorities, and actually analyze two different Markov chains: A chain of hubs and a chain of authorities. Unlike "conventional" random walks on graphs, state transitions in these chains are generated by traversing two WWW-links in a row, one link forward and one link backwards (or vise versa). Analyzing both chains allows our approach to give each Web site two distinct scores, a hub score and an authority score.
The idea of ranking Web sites using random walks is not new. The search engine Google (, ) incorporates stochastic information into its ranking of pages. The PageRank component of the search engine examines a single random walk on the entire WWW. Hence, the ranking of Web sites in Google is independent of the search query (a global ranking), and no distinction is made between hubs and authorities.
Let us build a bipartite undirected graph G'=(Vh ,Va ,E) from our site collection C and its link structure:
On this bipartite graph we will perform two distinct random walks. Each walk will only visit nodes from one of the two sides of the graph, by traversing paths consisting of two G'-edges in each step. Since each edge crosses sides of G', each walk is confined to just one of the graph's sides, and the two walks will naturally start off from different sides of G'. Note also that every path of length 2 in G' represents a traversal of one WWW link in the proper direction (when passing from the hub-side of G' to the authority-side), and a retreat along a WWW link (when crossing in the other direction). Since the hubs and authorities of topic t should be highly visible in G' (reachable from many nodes by either a direct edge or by short paths), we may expect that the t-authorities will be amongst the nodes most frequently visited by the random walk on Va, and that the t-hubs will be amongst the nodes most frequently visited by the random walk on Vh.
We will examine the two different Markov chains which correspond to these random walks: The chain of the visits to the authority side of G' (the authority chain), and the chain of visits to the hub side of G'. Analyzing these chains separately naturally distinguishes between the two aspects of each site.
We now define two stochastic matrices, which are the transition matrices of the two Markov chains at interest:
A positive transition probability ãi,j > 0 implies that a certain page h points to both pages i and j, and hence page j is reachable from page i by two steps: retracting along the link hi and then following the link hj.
Alternatively, the matrices and Ã can be defined as follows: Let W be the adjacency matrix of the directed graph defined by C and its link structure. Denote by Wr the matrix which results by dividing each nonzero entry of W by the sum of the entries in its row, and by Wc the matrix which results by dividing each nonzero element of W by the sum of the entries in its column (Obviously, the sums of rows/columns which contain nonzero elements are greater than zero). Then consists of the non-zero rows and columns of Wr WcT, and Ã consists of the non-zero rows and columns of WcT Wr. We ignore the rows and columns of Ã, which consist entirely of zeros, since (by definition) all the nodes of G' have at least one incident edge. The matrices Ã and serve as the association matrices required by the meta-algorithm for identifying the authorities and hubs. Recall that the Mutual Reinforcement approach uses the association matrices A=WTW and H=WWT.
We shall assume that G' is connected, causing both stochastic matrices Ã and to be irreducible. This assumption does not form a limiting factor, since when G' is not connected, we may use our technique on each connected component separately. Section 6.1 further elaborates on the case when Ã and have multiple irreducible components.
Some properties of and Ã :
Following the framework of the meta-algorithm, the principal community of authorities(hubs) found by the SALSA will be composed of the sites whose entries in the principal eigenvector of Ã () are the highest. By the Ergodic Theorem (), the principal eigenvector of an irreducible, aperiodic stochastic matrix is actually the stationary distribution of the underlying Markov chain, and its high entries correspond to sites most frequently visited by the (infinite) random walk.
A tightly-knit community is a small but highly interconnected set of sites. Roughly speaking, the TKC effect occurs when such a community scores high in link-analyzing algorithms, even though the sites in the TKC are not authoritative on the topic, or pertain to just one aspect of the topic. Our study indicates that the Mutual Reinforcement approach is vulnerable to this effect, and will sometimes rank the sites of a TKC in unjustified high positions.
As an example, consider a collection C which contains the following two communities: A community y, with a small number of hubs and authorities, in which every hub points to most of the authorities; and a much larger community z, in which each hub points to a smaller part of the authorities. The topic covered by z is the dominant topic of the collection, and is probably of wider interest on the WWW. Since there are many z-authoritative sites, the hubs do not link to all of them, whereas the smaller y community is densely interconnected. The TKC effect occurs when the sites of y are ranked higher than those of z.
In the full paper we provide a combinatorial construction, which demonstrates such (artificial) communities y and z, where the Mutual Reinforcement approach scores y higher than z, and the stochastic approach scores z higher. This bias of the Mutual Reinforcement approach towards tightly-knit communities will be demonstrated on WWW queries in the next section.
We tested the different approaches on broad-topic WWW queries (both single topic queries and multi-topic queries). We obtained a collection of sites for each query, and then derived the principal community of authorities with both approaches.Two of these queries ("java", "abortion") were used by Kleinberg in , and are brought here for the sake of comparison. All collections were assembled during February, 1999. The root sets were compiled using AltaVista (), which also provided the linkage information needed for building the base sets.
When expanding the root set to the entire collection, we filtered the links pointing to and from Web sites. Following , we ignored intra-domain links (since these links tend to be navigational aids inside an intranet, and do not confer authority on the link's destination). We also ignored links to cgi scripts, and tried to identify ad-links and ignore them as well. Overall, 38% of the links we examined were ignored. The collections themselves turn out to be relatively sparse graphs, with the number of edges never exceeding three times the number of nodes. We note that a recent work by Kleinberg et al. () has examined some other connectivity characteristics of such collections.
For each query, we list the top authorities which were returned by the two approaches. The results are displayed in tables containing four columns:
The results for this query, with our first example of the TKC effect, are shown in table 1. All of the top ten Mutual Reinforcement authorities are part of the EARTHWEB Inc. network. They are interconnected, but since the domain names of the sites are different, the interconnecting links were not filtered out. Some of the sites are highly relevant to the query (and have many incoming links from sites outside the EarthWeb net), but most appear in the principal community only because of their EarthWeb affiliation. With SALSA, only the top three Mutual Reinforcement authorities are retained, and the other seven are replaced by other authorities, some of which are clearly more related to the query.
Size of root size = 160, Size of collection = 2810 Principal Community, Mutual Reinforcement Approach:
|http://www.jars.com/||EarthWeb's JARS.COM Java Review Service||(3)||0.334102|
|http://www.gamelan.com/||Gamelan - The Official Java Directory||(3)||0.303624|
|http://www.roadcoders.com/||Handheld Software Development@RoadCoders||(3)||0.250816|
|http://www.earthwebdirect.com/||Welcome to Earthweb Direct||(3)||0.247467|
Principal Community, SALSA:
|http://java.sun.com/||Java(tm) Technology Home Page||(3)||0.365264|
|http://www.gamelan.com/||Gamelan - The Official Java Directory||(3)||0.36369|
|http://www.jars.com/||EarthWeb's JARS.COM Java Review Service||(3)||0.303862|
|http://www.javaworld.com/||IDG's magazine for the Java community||(3)||0.217269|
|http://www.javasoft.com/||Java(tm) Technology Home Page||(3)||0.203099|
|http://www.htmlgoodies.com/||htmlgoodies.com - Home||(3)||0.130676|
|http://javaboutique.internet.com/||The Ultimate Java Applet Resource||(1)||0.118081|
This query demonstrates the TKC effect in a most striking fashion on the WWW. First, consider the Mutual Reinforcement principal community of authorities, presented in table 2:
Size of root size = 175, Size of collection = 4539
|http://go.msn.com/bql/whitepages.asp||White Pages - msn.com||(3)||0.167202|
|http://go.msn.com/bql/maps.asp||Microsoft Expedia Maps-Home||(3)||0.167202|
The top 30 authorities returned by the Mutual Reinforcement approach were all go.msn.com sites. All but the first received the exact same weight, 0.167202. Recall that we do not allow same-domain links in our collection, hence none of the top authorities was pointed at by a go.msn.com site. To understand how these sites scored so well, we turn to the principal community of hubs, shown in table 3:
These innocent looking hubs are all part of the Microsoft Network (msn), but when building the basic set we did not identify them as such. All these hubs point, almost without exception, to the entire set of authorities found by the MR approach (hence the equal weights which the authorities exhibit). However, the vast majority of the sites in the collection were not part of this "conspiracy", and almost never pointed to any of the go.msn.com sites. Therefore, the authorities returned by the Stochastic approach (table 4) contain none of those go.msn.com sites, and are much more relevant to the query:
|http://us.imdb.com/||The Internet Movie Database||(3)||0.253333|
|http://www.disney.com/||Disney.com-The Web Site for Families||(3)||0.22003|
|http://www.hollywood.com/||Hollywood Online:...all about movies||(3)||0.213355|
|http://www.imdb.com/||The Internet Movie Database||(3)||0.199987|
|http://www.paramount.com/||Welcome to Paramount Pictures||(3)||0.196682|
A similar community is obtained by the Mutual Reinforcement approach, after deleting the rows and columns which correspond to the top 30 authorities from the matrix WTW. This deletion dissolves the msn.com community, and allows a community similar to the one obtained by SALSA to manifest itself.
This topic is highly polarized, with different cyber communities supporting pro-life and pro-choice views. In table 5, we bring the top 10 authorities, as determined by the two approaches:
Size of root size = 160, Size of collection = 1693 Principal Community, Mutual Reinforcement Approach:
|http://www.nrlc.org/||National Right To Life||(3)||0.420832|
|http://www.prolife.org/ultimate/||The Ultimate Pro-Life Resource List||(3)||0.316564|
|http://www.all.org/||What's new at American Life League||(3)||0.251506|
|http://www.hli.org/||Human Life International||(3)||0.212931|
|http://www.prolife.org/cpcs-online/||Crisis Pregnancy Centers Online||(3)||0.187707|
|http://www.ohiolife.org/||Ohio Right to Life||(3)||0.182076|
|http://www.rtl.org/||Abortion, adoption and assisted-suicide
Information at Right to Life...
|http://www.bethany.org/||Bethany Christian Services||(3)||0.161359|
|http://www.ldi.org/||abortion malpractice litigation||(1)||0.140076|
|http://www.serve.com/fem4life/||Feminists for Life of America||(3)||0.122106|
Principal Community, SALSA:
|http://www.nrlc.org/||National Right To Life||(3)||0.344029|
|http://www.prolife.org/ultimate/||The Ultimate Pro-Life Resource List||(3)||0.284714|
|http://www.naral.org/||NARAL Choice for America||(3)||0.240227|
|http://www.feminist.org/||Feminist Majority Foundation||(3)||0.186843|
|http://www.now.org/||National Organization for Women||(3)||0.177946|
|http://www.cais.com/agm/main/||The Abortion Rights Activist||(1)||0.166083|
|http://www.gynpages.com/||Abortion Clinics Online||(3)||0.163117|
|http://www.plannedparenthood.org/||Planned Parenthood Federation||(3)||0.157186|
|http://www.all.org/||What's new at American Life League||(3)||0.142357|
|http://www.hli.org/||Human Life International||(3)||0.142357|
All 10 top authorities found by the Mutual Reinforcement approach are pro-life resources, while the top 10 SALSA authorities are split, with 6 pro-choice sites and 4 pro-life sites (which are the same top 4 pro-life sites found by the Mutual Reinforcement approach). Again, we see the TKC effect: The Mutual Reinforcement approach ranks highly authorities on only one aspect of the query, while SALSA blends authorities from both aspects into its principal community.
This query is especially ambiguous in the WWW: It can be in the context of genetic engineering, genetic algorithms, or in the context of health issues and the human genome.
As in the "abortion" query, SALSA brings a diverse principal community, with authorities on the various contexts of the query, while the Mutual Reinforcement approach is focussed on one context (Genetic Algorithms, in this case). Both principal communities are shown in table 6:
Size of root size = 120, Size of collection = 2952 Principal Community, Mutual Reinforcement Approach:
|http://www.aic.nrl.navy.mil/galist/||The Genetic Algorithms Archive||(3)||0.27848|
|http://alife.santafe.edu/||Artificial Life Online||(3)||0.276159|
|http://www.geneticprogramming.com/||The Genetic Programming Notebook||(1)||0.25588|
|http://gal4.ge.uiuc.edu/illigal.home.html||illiGAL Home Page||(3)||0.235717|
|http://www.cs.gmu.edu/research/gag/||The Genetic Algorithms Group...||(3)||0.201237|
|Genetic Algorithms and Artificial Life Resources||(1)||0.181315|
|http://lancet.mit.edu/ga/||GAlib: Matthew's Genetic Algorithms Library||(3)||0.181157|
Principal Community, SALSA:
|http://www.ncbi.nlm.nih.gov/||The National Center for
|http://www.aic.nrl.navy.mil/galist/||The Genetic Algorithms Archive||(3)||0.223191|
|http://www.nih.gov/||National Institute of Health (NIH)||(3)||0.194688|
|http://gdbwww.gdb.org/||The Genome Database||(3)||0.177001|
|http://alife.santafe.edu/||Artificial Life Online||(3)||0.172383|
|http://www.genengnews.com/||Genetic Engineering News (GEN)||(1)||0.141617|
|http://gal4.ge.uiuc.edu/illigal.home.html||illiGAL Home Page||(3)||0.13259|
In the previous sections we have presented the Stochastic approach as an alternative method for link-structure analysis, and have shown a few encouraging results obtained by it, as compared with the Mutual Reinforcement approach. We have also presented the TKC effect, a topological phenomenon which sometimes derails the MR approach and prevents it from converging to a useful community of authoritative sites.
The sample results shown so far have all been produced on unweighted collections, in which all informative links have received unit weight. Both approaches can produce better rankings when applied on weighted collections, in which each informative link receives a weight which reflects the amount of authority that the pointing site confers to the pointed site. Possible factors which may contribute to a link's weight include the following:
We now prove a general result about the ranking produced by SALSA in weighted collections, for which some basic background in stochastic processes is assumed.
Let G=(H;A;E) be a positively weighted, directed bipartite graph with no isolated nodes, and let all edges be directed from sites in H to sites in A. We will use the following notations:
Let MA be a Markov chain whose states are the set A of vertices, with the following transition probabilities between every two states i , jA:
Similarly, let MH be a Markov chain whose states are the set H of vertices, with the following transition probabilities between every two states k , lH:
Consider the following binary relation on the vertices of A (states of MA):
It is not hard to show (and is shown in the full paper) that RA is an equivalence relation on A (similar arguments can be made concerning MH). This implies that all the states of MA are recurrent (none are transient). The equivalence classes of RA are the irreducible components of MA. We first deal with the case where RA consists of one equivalence class (i.e., MA is irreducible).
Proposition 1 Whenever MA is an irreducible chain (has a single irreducible component), it has a unique stationary distribution
Similarly, whenever MH is an irreducible chain, its unique stationary distribution ) satisfies:
Proof: We will prove the proposition for MA. The proof for MH is similar.
By the Ergodic Theorem (), any irreducible, aperiodic Markov chain has a unique stationary distribution vector. It will therefore suffice to show that the vector with the properties claimed in the proposition is indeed a stationary distribution vector of MA.
Thus, when the (undirected) support graph of G is connected, SALSA assigns each site an authority weight which is proportional to the sum of weights of its incoming edges. The hub weight of each site is proportional to the sum of weights of its outgoing edges. In unweighted collections (with all edges having unit weight), each site's Stochastic authority(hub) weight is simply proportional to the in(out) degree of the site.
This mathematical analysis, in addition to providing insight about the ranking that is produced by SALSA, also suggests a very simple algorithm for calculating the Stochastic ranking: Simply calculate, for all sites, the sum of weights on their incoming(outgoing) edges, and normalize these two vectors. There is no need to apply any resource-consuming iterative method to approximate the principal eigenvector of the transition matrix of the Markov chain.
Consider the case in which the authority chain MA consists of multiple irreducible components. Denote these (pairwise disjoint) components by , where . What will be the outcome of a random walk performed on the set of states A according to the transition matrix PA? To answer this question, we will need some notations:
Proposition 2 The random walk on A, governed by the transition matrix PA and started from all states with equal probability, will converge to a stationary distribution as follows:
Proof: Denote by the probability of being in a site belonging to Ai after the n'th step of the random walk. This probability is determined by the distribution vector ePAn. Clearly,
Since the transition probability between any two sites (states) which belong to different irreducible components is zero, pin=pi0 for all n (probability does not shift from one component to another). Inside each irreducible component the Ergodic Theorem holds, thus the probabilities which correspond to the sites of Ai in will be proportional to , and the proposition follows.
This proposition points out a natural way to compare the authoritativeness of sites from different irreducible components: Simply multiply each site's authority score by the normalized size of the irreducible component to which it belongs. We do not claim that this is in any way optimal, as very small irreducible components should be trimmed from the graph altogether. But the underlying principle is important: Consider the size of the community when evaluating the quality of the top sites in that community. The budget which the Mayor of New York City controls is much larger than that of the Mayor of Osh Kosh, Wisconsin.
It is this combination of a site's intra-community authority score and its community's size that allows the Stochastic approach to blend authorities from different aspects of a multi-topic query, and which reduces its vulnerability to the TKC effect.
Extensive research in link-structure analysis has been conducted in recent years under the premise that considering the in-degree of sites as a sole measure of their authority does not produce satisfying results. Kleinberg, as a motivation to the Mutual Reinforcement approach, showed some examples of the inadequacy of a simple in-degree ranking (). Our results in section 5.2 seem to contradict this premise: The Stochastic rankings seem quite satisfactory there, and since those collections were unweighted, the Stochastic rankings are equivalent to simple in-degree counts (normalized by the size of the connected component which each site belongs to). To gain more perspective on this apparent contradiction, let us elaborate on the first stage of the meta-algorithm for link-structure analysis (from section 3), in which the graph to be analyzed is assembled:
It is only after these steps that the weighted, directed graph is analyzed and the rankings of hubs and authorities are produced. The analysis of the graph, however important, is just the second stage in the meta-algorithm, and the steps involved in the first stage are crucial to the success of the entire algorithm.
Considerable research efforts have been invested in improving the quality of the assembled graphs. The current state of the art techniques for these steps is now such that in many cases, simple (and efficient) algorithms and heuristics produce quite satisfying results on the assembled graphs.
It is important to keep in mind the main goal of broad-topic WWW searches, which is to enhance the precision at 10 of the results, not to rank the entire collection of sites correctly. It is entirely irrelevant if the site in place 98 is really better than the site in place 216. The Stochastic ranking, which turns out to be equivalent to a weighted in-degree ranking, discovers the most authoritative sites quite effectively (and very efficiently) in many (carefully assembled) collections. No claim is made on the quality of its ranking on the rest of the sites (which constitute the vast majority of the collection).
We have developed a new approach for finding hubs and authorities, which we call SALSA - The Stochastic Approach for Link Structure Analysis. SALSA examines random walks on two different Markov chains which are derived from the link structure of the WWW: The authority chain and the hub chain. The principal community of authorities (hubs) corresponds to the sites that are most frequently visited by the random walk defined by the authority (hub) Markov chain. SALSA and Kleinberg's Mutual Reinforcement approach are both in the framework of the same meta-algorithm.
We have shown that the ranking produced by SALSA is equivalent to a weighted in/out-degree ranking (with the sizes of irreducible components also playing a part). This makes SALSA computationally lighter than the Mutual Reinforcement approach.
Both approaches were tested on the WWW, where SALSA appears to compare well with the Mutual Reinforcement approach. These tests, as well as analytical work, have revealed a topological phenomenon on the Web called the TKC effect. This effect sometimes derails the Mutual Reinforcement approach, and prevents it from finding relevant authoritative sites (or from finding authorities on all meanings/aspects of the query):
We note that SALSA is less vulnerable to the TKC effect, and produces good results in many cases where the Mutual Reinforcement approach fails to do so.
The second author would like to thank Udi Manber for introducing him to the search problems studied in this paper, and Udi Manber and Toni Pitassi for delightful and interesting discussions at the early stages of this research.
The Stochastic Approach for Link-Structure Analysis (SALSA) and the TKC Effect
This document was generated using the LaTeX2HTML translator Version 98.1p7 (June 18th, 1998)