Experienced users who query search engines have a complex behavior. They explore many topics in parallel, experiment with query variations, consult multiple search engines, and gather information over many sessions. In the process they need to keep track of search context -- namely useful queries and promising result links, which can be hard. We present an extension to search engines called SearchPad that makes it possible to keep track of "search context" explicitly. We describe an efficient implementation of this idea deployed on four search engines: AltaVista, Excite, Google and Hotbot. Our design of SearchPad has several desirable properties: (i) portability across all major platforms and browsers, (ii) instant start requiring no code download or special actions on the part of the user, (iii) no server side storage, and (iv) no added client-server communication overhead. An added benefit is that it allows search services to collect valuable relevance information about the results shown to the user. In the context of each query SearchPad can log the actions taken by the user, and in particular record the links that were considered relevant by the user in the context of the query. The service was tested in a multi-platform environment with over 150 users for 4 months and found to be usable and helpful. We discovered that the ability to maintain search context explicitly seems to affect the way people search. Repeat SearchPad users looked at more search results than is typical on the web, suggesting that availability of search context may partially compensate for non relevant pages in the ranking.
As users gain expertise in searching on the WWW they begin to make use of the wide choice in search services available online. However, as they cast a wider net to locate the information they seek, they start to employ a more elaborate and complex search process. Experienced users searching on the web seem to have the following behavior:
The trouble with the above behavior is that the user needs to carry around a lot of contextual information over time and there is no convenient way to record it or make it explicit. Specifically, they need to remember URLs of potentially useful results as they look for more results, and remember useful queries over time. Both of these can be hard to memorize. Saving information to the browser's collection of bookmarks is one potential solution. However, there are several reasons why this is not convenient:
In this paper we describe an extension to the search result page called SearchPad, which helps users search more effectively by explicitly maintaining their "search context." By search context, we mean queries recently deployed by the user, along with hyperlinks of result pages the user visited and/or liked in the context of each query. SearchPad is very similar to a bookmarks window except that it is search specific and maintains a relationship between queries and links the user would like to keep track of (which we call leads). As with bookmarks, clicking on a saved lead causes the corresponding page to be loaded in the browser. Saved queries can be replayed on other search engines.
To make SearchPad usable by a large audience we had two design goals which made the implementation of the system challenging:
As we shall describe, our implementation provides an additional service. It allows the search service to collect query specific result relevance and usage data. Specifically, SearchPad can log for each client:
Such information is valuable to search providers. It can be used to statistically compare two ranking algorithms and find out which one is better. Similarly, it can be used to compare two search services. It can also be used to discover the most relevant pages for popular queries, which in turn can be used to improved results for those queries in the future.
Collection of usage data raises concerns about privacy and the author strongly supports the privacy of users on the web. However, the major vulnerability from the user's point of view is having search services know about their interests. Unfortunately, this is already revealed by the query. The information we collect, namely the results they viewed and found useful, reveals more about the quality of the pages returned than about the user. Thus we argue that this is not a further breach of privacy. In any case there other search companies on the web, notably Direct Hit [DirectHit], with a business model based on collecting data on the pages that user's look at. They count "click-throughs" (i.e., the number of times users click on a particular link) received by result links for popular queries and reorder results based on perceived popularity. We believe that the information we can collect is superior to Direct Hit's data, because we discover the results people actually liked -- not just the results they clicked on. Even when a query finds no useful results, users tend to click on a few results per query to understand what happened, which can contaminate the click-through log. With our scheme the data collected is purer. Also, collecting click-throughs as done previously imposes an overhead both on the server and the user. In Direct Hit's scheme click-throughs are trapped by redirecting result accesses through a web server that logs the data and then issues a redirect to the actual result page. With our scheme all logging is done at the client without an extra HTTP access.
We present a walk-through to illustrate the user's interaction with SearchPad. In our implementation, the user accesses AltaVista, Excite, Google and Hotbot through special URLs that route communication with the engines through the SearchPad proxy.
Figure 1 shows an AltaVista page transformed by the proxy. The "SearchPad" button at the top left brings up the SearchPad agent (Figure 3), allowing the user access any previously marked queries and leads. Each result on the AltaVista page has a blue "Mark" button associated with it. Clicking on this button causes the corresponding link to be added to SearchPad, along with the corresponding query. If the query already exists the link is merged into the existing set of links. Links added to SearchPad are called "leads." Note that marking is a cheap operation, and involves only a local transfer of data from the result page to the SearchPad agent. No network communication occurs and hence no delay. This is illustrated in Figures 2 and 3.
Figure 2 shows the second AltaVista result for the query: genetic engineering, which has just been visited by the user. All visits to result page and time spent therein are logged by SearchPad as part of its data collection process. Also, on return from a result page, the blue Mark button for the just-visited result link turns red as in Figure 2 (hard to see in grayscale). The color change is an invitation for the user to mark the link. Also, it makes the result easier to spot, increasing the likelihood of the user marking the lead if they liked it.
Each query has a circular selector in front of it to support query selection. To send a query to a search engine the user would first select the query and click on the search engine. If they selected the most recent query, 'genetic engineering', and clicked on Google, they would get the result set shown in figure 4.
In Figure 5 the user has subsequently marked the lead labeled 'MelissaVirus.com: The very latest Melissa Virus information' for the (repeat) query "melissa virus". This moves "melissa virus" to the top of the list of SearchPad queries and adds the new lead to the end of the list of leads for the query.
SearchPad also has an Edit Mode (see Figure 6) to support changes to the stored data. This is because, although the browser may shutdown and the machine get rebooted, the information stored in SearchPad is permanent. Hence, the user may periodically want to delete some leads or queries to free up space. Also they might want to merge the leads classified under various related queries into a single meaningful query. In Edit Mode, SearchPad is still fully functional, except that it provides extra buttons to edit its state. The cross ("X") marks are buttons to delete the query or lead they are associated with. Queries can be renamed by clicking on Rename, which brings up a dialog to enter the new query. If the new query matches another existing query the leads in the two queries are merged. The old query is discarded.
This approach is faced with the following problem. Embedded scripts are constrained by the browser both in terms of access (i.e., limited access to other windows) and storage (no access to the filesystem) in the normal mode of operation. In some web browsers, the embedded scripts can request the user for more access to the web browser’s state. Nonetheless this is not useful because many users will refuse such a request, since it might represent a security risk. Thus, embedded scripts face many restrictions. We describe next how these may be overcome.
Similarly, when a result’s hyperlink is clicked to view the result page, we log the same type of information in association with the “view” event. When the user returns to the page containing search results after viewing a result page, the “return” event is logged as well, with a timestamp. When a “return” event follows a “view” event, the time difference provides an estimate of the time spent viewing the result page.
Eventually, as events accumulate and leads are added, the storage available in the cookie access log will be exhausted. At this point either the user can be prevented from marking any more leads (unless some are deleted), or SearchPad can compress the data.We support a clever form of data compression to free up more space.
To compress data in the cookie access log, SearchPad does a "hard" reload of itself. This causes fresh copy of the SearchPad web page is fetched from the server ignoring the cache. The cookies comprising the cookie access log are configured so that they are transmitted to the web server every time SearchPad is reloaded over the net. Also, a fresh set of cookies are transmitted back from the server and overwrite the previous cookies. This is part of the standard RFC 2109 cookie exchange protocol. We use this to transfer activity log information to the server and also to reduce the data stored in SearchPad. Specifically:
To ensure timely data collection at the server, SearchPad is configured to periodically hard reload itself, thus logging the user’s activity periodically. Further, to avoid transmitting the cookies to the server during other communications, the cookies are configured so that they will be transmitted only when SearchPad is reloaded and not when result pages are fetched. We do this by associating SearchPad with a path that extends the path of result pages, as explained in RFC 2109. This has the effect of allowing SearchPad to read cookies set by result pages but not vice versa.
We conducted a trial of the SearchPad service at our research laboratory - Compaq, Systems Research Center, from May 6 - Sep 3, 1999. The service was available on the company intranet, but most of the usage was by the research staff of the Systems Research Center (about 50 people), and to a smaller extent by Compaq Research as a whole (about 150 people). Logs were collected in partially shrouded format so that queries themselves were unrecognizable, but hostnames and other details were preserved. Our logs show that accesses outside the research community did not contribute significantly to usage.
Table 1 summarizes the usage statistics for the 4 month period. This does not include accesses by the author for testing. The aim of the study was to understand if people would find our service useful. Although users were invited to use the system through internal advertising, no incentive was given to make them use it. Also, assurances were given that we would protect their privacy. Hence, we did not attempt to keep track of the results that were bookmarked. since within a small community such information might reveal more than it would on the Internet at large. Also, we would need a large user base to collect a statistically significant sample of usage information to make any relevance judgements.
|Total Number of Result Pages Viewed||2281|
|Number of Distinct Accessing Hosts||178|
|Number of Distinct Queries||1133|
|Average # of Result Pages/Query||2.01|
|Average # of Result Pages/host||12.8|
|Percentage Accesses w SearchPad "Docked"||8%|
The high usage of AltaVista may be biased by the fact that AltaVista was created by Compaq Research. The usage of the other engines can be taken to represent perceived value by our user base. In most cases each host in the log corresponds a distinct user. We were curious to see if usage patterns would change with the SearchPad model of searching. For example, we were curious if users look at more pages, since they now had the option of keeping track of temporary leads? The average number of result pages per query was 2.01, which is higher than previously reported (e.g., 1.39 was reported in a previous study by [Silverstein et al, 98]). Our number is somewhat diluted by the presence of casual users who used SearchPad marginally, possibly for test queries. Considering more seasoned users (users who used SearchPad to view more than 50 result pages) the number of result pages viewed per query is slightly higher = 2.15. We noticed a large number of single page views in the logs, even for seasoned users. A single result page view is often evidence of the fact that the user found the result they were looking for immediately (i.e., the ranking was good), or that they were disappointed with the query and formulated a better query. If we consider only cases in which users looked at more than one result page we find that the average page views per query is higher = 3.98. This suggests that having a tool to record search context may encourage users to explore result sets more deeply, and compensate for some non-relevant pages in the ranking.
The only interface design choice we tried to evaluate was the option of attaching SearchPad to the left of the results window, as an extra frame. This was done by clicking on the "SearchPad" button at the top left of the result page (see Figure 1). We call this "docking." Each result window could have a docked version of SearchPad potentially. Docking was hard to implement since it meant keeping several versions of SearchPad synchronized. However, the user study shows that only 8% of the users liked the docking option. This actually reduced to 5% for users with more than 50 result page views, suggesting that embedding SearchPad in a frame is not convenient.
SearchPad was tested on Netscape versions 3 and higher on Unix, MacOS, and Windows 95/NT, and on Internet Explorer versions 4 and higher on Windows 95/NT, and found to work reliably.
In this paper we describe an extension to search engines to explicitly maintain user search context as they look for information, on many topics, using many search engines, and over many sessions. By search context we mean queries that were previously deployed and considered useful, and promising result links associated with each query. SearchPad is an agent that works collaboratively with result pages, and allows users to remember queries and associated leads in a convenient helper window. Unlike bookmarks, which correspond to the user's long-term memory of information, the leads in SearchPad constitute the user's short-term memory and represent work in progress. They tend be less valuable than bookmarks and are maintained only as long as the user's information need is current. Hence we perceive SearchPad as a complement to the browser's bookmarks facility.
The service was tested in a multi-platform environment with over 150 users for 4 months and found to be usable and helpful. It is possible that the ability to maintain search context explicitly affects the way people search. Repeat SearchPad users looked at more search results than reported previously. This suggests that explicit availability of search context might partially compensate for non relevant pages in the ranking.
Krishna Bharat is a member of the research staff at Google Inc. in Mountain View, California. Formerly he was at Compaq Computer Corporation's Systems Research Center, which is where the research described here was done. His research interests include Web content discovery and retrieval, user interface issues in Web search and task automation, and relevance assessments on the Web. He received his Ph.D. in Computer Science from Georgia Institute of Technology in 1996, where he worked on tool and infrastructure support for building distributed user interface applications.