Hourly analysis of navigational, transactional, and informational user-intents in search engine queries
Many studies have been undertaken to analyze the behaviors of those who use search engines. One such avenue for study is transaction log analysis (TLA). With literally billions of queries entered every year there is a plethora of data to be combed through and just as much information to be gleaned from them. This essay gathers queries submitted to a single search engine over a twenty-four hour period, evaluates and tags them to determine user intent according to a predefined rubric, and observes any hourly trends of user intent. Specifically, we look at user intent and how it may or may not fluctuate over the course of one twenty-four hour day, and address the implications of the finding for search engine development and further research.
Any search, be it on the Internet or an OPAC, begins with an information need. Schneiderman, Byrd and Croft (1997) define an information need as "the perceived need for information that leads to someone using an information retrieval system." But the need may not be simply for information—at least, not when accessing the Internet. The Internet has become not only an avenue for sharing information, but an avenue for commerce, entertainment and socialization. Thus the user's intent when logging on cannot be generically defined as an information need. It is more nuanced than that. In the age of Web 2.0, social networking sites, federated searching, "Cyber Monday," Netflix and global tracking, navigating to service oriented sites for the first time without the aid of a search engine is not likely. Thus the search engine is positioned to be the welcome mat of the Internet. Indeed, on December 15, 2009, online sales soared to more than $9 billion, the largest one-day sales total to date at the time. There can be no doubt that search engines played a major role in influencing those shoppers' online destinations
With such a vital role in the online world, search engines have garnered not only commercial attention but also academic attention. Many studies have been undertaken to analyze the behaviors of those who use search engines. One such avenue for study is transaction log analysis (TLA). With literally billions of queries entered every year there is a plethora of data to be combed through and just as much information to be gleaned. It is the intention of this proposal to take a look as some of what has been done to understand the behavior of searchers, and to perhaps discover another angle with which to study that behavior through time-based user intention query analysis. Specifically, we will look at the fluctuation of user intent over the course of one business day. If such trends present themselves distinctly, we will have gained another perspective on user behavior that can be used to more effectively provide Web searchers with the results they desire: Better Web recall and precision tailored to their search need, whatever that need may be.
In 2002 Broder developed a taxonomy of Web search (Broder, 2002). Much of the work that Broder cites as having been done in the late 1990s was based upon the false presumption that the Web searcher was motivated purely by an information need, much like the traditional library searcher (See: Holscher & Schube, 2000; Navarro-Prieto et al, 1999; Muramatu & Pratt, 2001; Choo et al, 1999). In his report Broder posits that the intention behind a Web search is not strictly informational, as those studies suggested. He puts forward a "need behind the query" that can be classified into three distinct intents. Briefly, his classification is as follows:
Broder suggests that navigational intent could be likened to what in traditional information retrieval would be called a "known item" search, where there is really only one "right" result, with the possible exception that "hub" types of pages (i.e. directories), where the user is one click away from the target, might be acceptable.
A user whose intent is informational seeks static information without further interaction. Such intent could be very closely linked to that of traditional information retrieval, with the caveat that Web-based queries tend to be far wider in scope. For example, the searcher could be seeking as narrow a subject as "define: metric," or as wide as "cars" or "football."
Transactional queries seek a Web site where further interaction will take place. Such activities as shopping, game playing, downloading, accessing databases or chatting would be categorized as transactional in Broder's taxonomy.
Broder's methodology included a self-selection survey of Alta Vista users and analysis of an Alta Vista query log. His survey was conducted via a pop-up window; he himself notes several biases that resulted from this process of self-selection.
The self-selection survey employed three questions of relevance to our study (the percentages given are for respondents answering in the affirmative):
Broder does not engage in much analysis of the survey findings, but appears to allow the statistics to speak for themselves. From the user self-selection survey results cited above he identifies user intents as 24.5% navigational, 36% transactional, and presumes the remainder (39.5%) to be informational. He provides no rubric besides the three intents for evaluating the query log.
When manually evaluating the Alta Vista query log himself, he finds navigational queries to be 20% and transactional queries to be 30% of his 400 query data set, and again presumes the remainder (50%) are informational.
Aside from Broder's construction of a taxonomy of Web queries based upon user intent, his study offers little by way of formality or scientific rigor. In light of the studies we will discuss below that far exceed Broder's in methodology and design, we can base little credibility on his resulting statistical figures beyond a mere starting point for discussion. What he does offer of value is a fresh insight (at the time) into user intent in a Web-based environment. He correctly identifies the increased potential for interaction that the Internet offers beyond that offered in a static database of documents. While the Internet contains a great number of documents, it goes beyond that into the realms of social interaction and commerce, aspects that have great impact upon query log analysis and search engine development and improvement.
Rose and Levinson, in their article "Understanding User Goals in Web Search" (2004), took Broder's lead, with some modifications, and similarly attempted to classify queries. Focusing their work on the classification of Web queries, Rose and Levinson used a data set of 500 queries taken from the AltaVista search engine, randomly selected from several days and different times of day. They began by brainstorming a variety of goal possibilities based upon personal experience and internal AltaVista studies to form an initial framework. From this framework they began analysis of the data set, modifying their categories as needed along the way. From this, in turn, they developed a hierarchical structure.
Rose and Levinson retained two of Broder's classifications, navigational and informational, without significant modification. However, instead of the transactional classification they devised the category of "resource seeking." Under this rubric they included downloading, entertainment, interaction, and obtaining, all of which are distinct from information gathering in that they require further interaction.
In studying the transaction log they considered the query, the results page, the link the user clicked, and further searches or actions by the user. Averaging their three sets of results, their distribution of informational, navigational and transactional/resource was 61%, 13% and 26% respectively.
Relative to our study, in which our data set is only session-initial queries, Rose and Levinson do suggest that analyzing session-initial queries may lead to an overemphasis of navigational intent due to the brevity of query terms (shorter length queries tend to imply a navigational intent; i.e. "Microsoft" would imply the intent to visit Microsoft.com). While this could be the case, the definition of navigational intent (deciphering that the user's intent is to visit a particular Web site) is equally at risk of biasing the results due to the subjective nature of query interpretation. Overall the Rose and Levinson study is informative for our purposes, but extremely limited in value. Besides expanding on and further refining Broder's classification, they do not add much value to the literature, in part due to the minimal size of their sample. With millions of queries entering search engines daily, a mere 500 is hardly representative of the population. Further, though they do sample from differing days and times, they do not further qualify their choices nor delimit them proportionally to reflect the overall population.
To this end, Jansen, Booth, and Spink (2008) provide us with a more systematic analysis of user intent. Having discussed Broder and Rose and Levinson, Jansen et al. observe that Belkin (1993), who studied traditional library catalog searches, classified search episodes in terms of goal, method, mode of retrieval, and type of resource interacted with during the search. Jansen et al. suggest that Web searching does have continuity with this assessment, but differs in three respects that make it a unique domain of study: context, scale and variety. The first, context, is the direct availability of a nearly ubiquitous Web. Search engines provide access to textual and multimedia content in a variety of settings (home, work and mobile), that a library cannot. Second, the vast numbers of users and the scale of topics submitted are unparalleled in closed system searching. Third, the variety of content, users and systems is unique. The content and user diversity on the Web is most unique in comparison to a closed system such as the library.
Jansen et al. observe that search engines must respond to this diversity. Not only do users seek information, they use search engines for browsing, transacting business, finding Web pages, providing spell checking, etc. As Web content changes, so must the search engines that crawl them. "It is in this cornucopia of alternatives where search engines most differ from classic information search and pre-Web retrieval systems" (Jansen et al, 2008). While the mode of information retrieval on the Web is similar to traditional information retrieval, the goals and types of resources have changed dramatically. No longer are those resources strictly informational, and thus search engines are challenged to accommodate these changes in part by identifying the intent behind the query.
To this end Jansen et al. designed a study to consider the query intent with the following objectives:
By qualitatively analyzing samples of queries from seven search engine transaction logs, Jansen et al. identified the characteristics for various user intent categories. In doing so, they considered not only query characteristics, but other data as well, such as query length, results page, and vertical accessed (e.g. searching by choosing "web," "images," "videos," etc. alone). Using the Rose and Levinson 2004 study as a starting point, Jansen et al. further refined user intent, but maintained Broder's classification scheme of informational, transactional and navigational intent. We will return later to their classification rubric as we define ours for this study.
Where Jansen et al. diverge from Broder, and from Rose and Levinson, is in their employment of automated query classification. With the aid of automation, they were able to analyze a much larger, and thus more representative, data set. Jansen et al. further controlled their study by selecting a transaction log from one day (May 6, 2005) from one search engine (Dogpile.com). Upon refining the data set through elimination of null queries, multiple query sets over the mean determined in previous studies (Jansen et al., 2000), and collapsing of duplicate queries per user session into one record, they arrived at a data set of 1.5 million queries to analyze. To test the effectiveness of their algorithm they sampled 400 queries taken randomly from the data set and manually analyzed them. From this analysis they determined their algorithm had an error rate of 26%. In other words, 74% of the time their algorithm correctly identified user intent. From their study, Jansen et al. concluded that the intent of 81% of the queries was informational, the intent of 10% was navigational, and the intent of 9% was transactional.
The strengths of the Jansen et al. study, and our takeaway, is the systematic approach they took in sampling and data collection, as well as the robustness of their results due to the size of the sample taken. Comparatively, samples by Broder (400), Rose and Levinson (500), Lee, Liu, and Cho (2005) (50), and Kang and Kim (2003) (200), were miniscule. The weakness of the Jansen et al. study is its accuracy rate. Twenty-six percent of queries misclassified can have a rather substantial effect on the accuracy of the results, and thus the conclusions reached. No doubt manually classifying such a large data set is impossible and the fact that search engines employ such algorithms to improve search result relevancy supports attempts at automated classification for study. Nevertheless, with an error rate that high, doubt is cast on the validity of their results. In particular, the fact that Jansen et al. found such a high percentage of informational intent, as compared to Broder and Rose and Levinson, warrants further investigation.
Beitzel, Jansen, Chowdhury, Grossman, and Frieder, in their paper Hourly analysis of a very large topically categorized Web query log (2004), base their classification upon topical categories rather than the user-intent classification schemes described above. Beitzel et al. instead focused their study on how queries change over the course of one twenty-four-hour period, through peak and non-peak hours. They took their data set from a seven-day period with queries from approximately 50 million users. Their classification of queries was more robust than the above studies, 14 categories in all. They further analyzed trend popularity as well as trends in uniqueness of queries within categories.
Of note for our study is the observation that indeed topically categorized queries do appear to exhibit trends over the course of a twenty-four-hour period, though some trends were more stable and others more variable. Though some findings are rather pedestrian, such as the number of queries being higher during peak hours as opposed to non-peak hours, other observations were more noteworthy: The average number of query repetitions per hour does not change significantly throughout the day; most queries appear no more than several times per hour; and queries received during peak hours are more similar to one another than those received at non-peak hours. These observations led the research team to believe that predictive algorithms that are able to estimate the likelihood of query repetition should be possible. This finding could aid in cache management and load-balancing algorithms employed by search engines. Further, these algorithms could aid in improving retrieval effectiveness by assisting in query disambiguation.
The strengths of this study are several. The data set employed is massive. Compared to the data sets described earlier in this paper, this one dwarfs them all. Granted, the methods employed and aim of the study are more automated and oriented toward algorithm development. The granularity of the classification scheme is also admirable. However, when one considers the classifications employed, one can easily piece them all within the rubrics of Broder and Jansen et al. very easily.
One weakness of the Beitzel et al. study was the researchers' failure to comment upon the accuracy of identification of their algorithm, as no indication of manual verification was given. No doubt their findings have a degree of error in them, but none is reported. Nonetheless, our takeaway is the observation of query trends over a twenty-four-hour period. None of the prior studies took this aspect into account, nor would their data set have substantiated very generalizable conclusions.
Zhang, Jansen and Spink provide the final contribution for our study. In Time series analysis of a Web search engine transaction log (2009), Zhang et al. apply time-series analysis to a query log to analyze searcher behavior over time and investigate if either basic or advanced time-series analysis is a valid method for searcher actions. Their study incorporates another Dogpile.com transaction log, from May 15, 2006. The methodology they employ involves a proportional sampling method into equidistant time periods, analysis of query length, frequency of submissions, application of the autoregressive integrated moving average method to do a one-period-ahead prediction on the data, and a Box-Jenkins transfer function model * to discover relationships among different fields in the transaction log data set.
The results obtained are presented in several sections: Basic data analysis and extended data analysis including predictive behavior modeling, which indentifies possible future user actions or choices. In the basic analysis section Zhang et al. observed user interactions, browser use, vertical accessed, searcher intent, rank of clicked link, query length and clicks on sponsored links and organic links. In the extended data analysis the team used factors such as keywords in the queries and average rank of clicked result to develop a predictive model identifying possible future user actions or choices. Using automated algorithmic analysis of user intent, they observed much the same as Beitzel et al. in terms of peak and non-peak usage, as well as a spread between informational, transactional and navigational intents similar to what was observed in Broder, Rose and Levinson, and Jansen et al.
Like the Jansen et al. study, the Zhang et al. study used algorithmic analysis, which produced an error rate as high as 25%.† Though a manual approach is impossible for real-time query analysis, error rates such as these beg the question of usefulness when extrapolated from the smaller data sets to a daily log of millions or even billions of queries. Zhang et al. fare better than others—at least by their own estimation—but still wrongly identify one of four queries by automated analysis. Extrapolating this from a sampled set of four hundred queries to their data set of 400,000, let alone four million, suggests that their analysis is tentative at best, but not without merit.
Considering the studies discussed in this paper, it appears as though none directly address Beitzel's "Circadian" rhythm‡ in terms of Broder's user-intent taxonomy. Specifically, though Zhang et al. employed automated query analysis to trend user intent as one objective of their study, none have employed a robust data set that is proportionally sampled from a single 24-hour period and applied manual evaluation of user intent along the lines of Broder's or Jansen's classification schema: informational, navigational and transactional.
Thus this study seeks to proportionally sample one day's worth of queries submitted to a single search engine server, November 3, 2009, and manually analyze the resulting 4,067 queries according to the rubric as defined by Jansen et al. It is suggested that, in the same way that Beitzel et al. found a "Circadian" rhythm of topically categorized queries, we should find a similar phenomenon when classifying queries according to the well-established classification of informational, navigational and transactional intent. With a proportional sampling of this size we hope to provide a reliable indication of any trends that may emerge that would warrant a more formal, large-scale study employing a similar methodology.
With the aforementioned studies in mind, this study endeavors to provide a pilot for manually analyzing a recent search engine transaction log to determine user-intention trends over the course of one day. The foundational methodology for our study is based upon the work of Jansen and his survey of search log analysis methodology in his article Search log analysis: What is it, what's been done, how to do it (Jansen, 2006). Based on this methodology we employed basic transaction log analysis to determine user-intent based solely upon the harvested queries. As such, our analysis of the data set is based on an inductive, grounded theory methodology as laid out by Jansen (2006) and is limited to one data set and the manual analysis of session-initial user submitted queries.
Though manual analysis alone is implausible for real time information retrieval via a search engine, in light of the 26% error rate reported by Jansen et al. (2008), it is believed that if a large enough sample size can be manually analyzed, many of the shortcomings of those studies that used automated analysis can be overcome, and contributions to the understanding of user intent can reliably be made.
The query transaction log was harvested from one server/cluster, for one full day, Tuesday November 3, 2009, from a major Internet search provider. Only queries entered into the generic Web vertical were pulled, based upon the rationale that users employing any of the specialized verticals (i.e. image, video) would obviously introduce bias into the data set and would be better analyzed in a separate study.
The data set was originally massive and impossible to manually process. Indeed, as Beitzel et al. observed, manipulating this amount of data is impossible for almost any statistical software package to process. In order to obtain a more manageable data set for manual analysis, several stages were completed to insure a proportionate and representative sample was obtained:
Manual analysis was chosen because, although it is too labor-intensive to be feasible for large-scale studies, for a pilot program such as this the enhanced accuracy and validity of results afforded by manual analysis made make it more reliable and thus the preferred methodology.
After the final query data set was received, the time-stamp data was removed to avoid any potential bias. The data set was then divided into three roughly equal-sized sets. The queries were manually analyzed by three judges via a Delphi technique. Each set was evaluated by two judges independent of one another. Any disagreements were adjudicated by the third judge. Through this process, reasonable assurance of correct identification of user-intent was achieved.
To aid in consistency among judges, the benchmark for identification of user-intent was the taxonomy of Web search as offered by Andrew Broder (2002), and further investigated by Bernard Jansen et al. (2008), as illustrated in Table 1.
It should be noted that some queries are ambiguous enough to warrant cross-classification into two of the above categories. This is a shortcoming of a study that does not consider landing pages (pages navigated to from the search results list), or refining of queries. However, using the Delphi technique, almost all queries that were classified differently by the first two judges were resolved in arbitration with reasonable certainty. Of the 4,067 queries analyzed, only 11% (447) required arbitration, and of those only 1.5% (63) of the original set were difficult to classify into only one category with certainty by any of the three judges. Additionally, query terms were counted to determine any trends of query length relative to user intent.
All queries and judgments were again paired with their respective time stamps in an Excel spreadsheet for tabulation, reducing them into hourly totals for time-series trend analysis. The data was then analyzed to determine if the time-of-day variable showed any influence upon the user-intent variables (informational, transactional, navigational). Additionally, query lengths were loaded into a separate spreadsheet for analysis relative to average query length per intent, as well as any trend that might be observed regarding time of day.
As discussed above, we employed basic transaction log analysis to determine user-intent based solely upon the harvested queries and their time stamps. As such, our analysis of the data set is based on an inductive, grounded theory methodology as laid out by Jansen (2006) and is limited to one data set. From this analysis we gathered the following preliminary statistics.
Figure 1 illustrates the overall distribution of user intent, which reflects 69% informational, 19% transactional, and 12% navigational. Comparison to Beitzel et al. and Jansen et al. shows that our overall traffic and distribution results are similar.§ Additionally, the longer informational query term average (3.85 terms), middle length transactional query term average (2.43 terms), and shorter navigational query term average (1.7 terms), as well as overall average query length (2.7 terms), shown in Figure 2, are fully in line with Zhang et al. and the current literature. Query length was randomly distributed throughout the day, with no trend observable. As observed by Jansen and Spink (2005), search behavior is similar across various search engines, thus both statistics above strongly suggest that our sample and analysis are consistent with current studies, and are reasonably accurate and representative of search engine queries overall.
As Figure 3 suggests, and the literature supports, the distribution of all queries peaks at mid-day and troughs in the early morning, with a fairly level plateau through the evening hours. Given the fact that all queries were obtained through a single server/cluster in order to maintain a representative hourly sampling for time-based analysis, this distribution was expected and further supports our obtaining a proportionally distributed and representative sample of queries.
At the heart of the study, the distribution of query intents shows an obvious peak during mid-day for informational questions, and a less pronounced yet still noticeable (and proportionate) increase in transactional queries following a similar trend. Navigational queries, though they follow an overall trend similar to transactional, appear to peak slightly in the evening hours.
Our resulting query intent distribution is similar to that of Zhang et al. (2008), who observed an overall percentile distribution similar to ours, but with less observable trends in transactional and navigational intents. Given the fact that our study utilized manual analysis, as opposed to automated algorithms, the nuances in our trends can be seen as more granular and accurately reflective of user intent.
Based upon the results of the study, several conclusions can be formulated:
From our study it appears there is an observable, though not a marked, trend in all query intents. This warrants more extensive study. Our study, being a pilot study, was limited in scope and thus the results are not immediately generalized.
It is noted that a query can have multiple intents, and that observation of the query alone will not eliminate incorrect identification. However, as Jansen et al. (2008) found in their research, roughly 75% of queries can be classified into a single category, leaving 25% with possible multiple intents. Considering the rate of error for automated query analysis in combination with this potential for multiple query intents, the implication is that through manual analysis ambiguity is minimized, and thus the resulting classification and distributions are more reliable.
However, solely relying on manual analysis would be difficult if not impossible for a more extensive study. That being the case, further research using a combination of algorithmic analysis and manual analysis could prove effective. If one developed algorithms that would identify user-intent with reasonable certainty, kicking the remainder into a pool for manual analysis might overcome the shortcomings inherent with either approach and provide more reliable and generalizable results.
The strengths of this study are the use of real queries without the influence of self-selection or surveys, and thus avoiding any bias from user knowledge of the study. Second, the sample size is quite adequate for a pilot study; in excess of 4,000 queries were proportionally sampled. Third, our study is one of only a few investigating chronological user-intent trends via current transaction logs. More studies of this kind would aid in resource and cache allocation, predictive algorithms, and disambiguation of user intent in cases of vague or ambiguous queries.
The study did provide a snapshot of observable user-intent trends over the course of an average weekday. Based upon our findings, a more thorough study should be conducted with an increased data set, which would provide more generalizable, transferable, and reliable conclusions.
* For a description of these analytic tools, see the Wikipedia article "Predictive analytics," accessible at http://en.wikipedia.org/wiki/Predictive_analytics#Time_series_models.
Baeza-Yates, R., Calderon-Benavides, L. & Gonzalez, C. (2006). The intention behind Web queries. [Conference paper]. String Processing and Information Retrieval (SPIRE 2006), 4209/2006, 98-109. doi: 10.1007/11880561_9
Beitzel, S. M., Jansen, E. C., Lewis, D. D., Chowdhury, A., & Frieder, O. (2007). Automatic classification of Web queries using very large unlabeled query logs. ACM Transactions on Information Systems, 25(2), Article No. 9.
Beitzel, S., Jansen, E. C., Chowdhury, A., Grossman, D. & Frieder, O. (2004). Hourly analysis of a very large topically categorized Web query log. [Proceedings paper]. Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval (July 2004), 321-328.
Choo, C. W., Detlor, B. & Turnbull, D. (2000). Information seeking on the Web: An integrated model of browsing and searching. First Monday, 5(2). Available at http://firstmonday.org/issues/issue5_2/choo/index.html
Holscher, C. & Strube, G. (2000). Web search behavior of Internet experts and newbies. [Proceedings paper]. Proceedings of WWW9. Available at http://www9.org/w9cdrom/81/81.html
Jansen, B.J., Booth, D. & Spink, A. (2008). Determining the informational, navigational and transactional intent of Web queries. Information processing and management 44 (2008), 1251-1266. doi: 10.1016/j.ipm.2007.07.015
Jansen, B. J., Spink, A. & Pederson, J. (2005). A temporal comparison of Alta Vista Web searching. Journal of the American society for information science and technology, 56(6), 559-570. Accessed at http://ist.psu.edu/faculty_pages/jjansen/academic/pubs/jansen_altavista_jasist.pdf
Johnson, N. (2009). Biggest online shopping day of 2009 expected today as holiday spending uptick continues. Search Engine Watch. Accessed on 12/18/2009. Available at http://blog.searchenginewatch.com/091214-000038#
Kang, I., & Kim, G. (2003). Query type classification for Web document retrieval. [Conference paper]. 26th annual international ACM SIGIR Conference on Research and Development in Information Retrieval, July-August 2003, 64-71.
Navarro-Prieto, R., Scaife, M. & Rogers, Y. (1999). Cognitive strategies in Web searching. [Proceedings paper]. Proceedings of the 5th conference on human factors /& the Web. Available at http://zing.ncsl.nist.gov/hfweb/proceedings/navarro-prieto/index.html.
Rose, D., & Levinson, D. (2004). Understanding user goals in Web search. [Proceedings paper]. Proceedings of the 13th International Conference on the World Wide Web conference (WWW 2004). New York, 13-19. doi: 10.1145/988672.988675
Schneiderman, B., Byrd, D., & Croft, W. B. (1997). Clarifying search: A user-interface framework for text searches. D-Lib Magazine, January 1997. Available at http://www.dlib.org/dlib/january97/retrieval/01shneiderman.html.
Tim Bridwell is currently enrolled at UWM SOIS and works as an information architect for Fusion92, an emergent marketing firm in Arlington Heights, IL.
Copyright, 2013 Library Student Journal | Contact