85,99 €
This book is a second edition, updated and expanded to explain the technologies that help us find information on the web. Search engines and web navigation tools have become ubiquitous in our day to day use of the web as an information source, a tool for commercial transactions and a social computing tool. Moreover, through the mobile web we have access to the web's services when we are on the move. This book demystifies the tools that we use when interacting with the web, and gives the reader a detailed overview of where we are and where we are going in terms of search engine and web navigation technologies.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 1076
Veröffentlichungsjahr: 2011
Table of Contents
Title Page
Copyright
Dedication
Preface
Motivation
Audience and Prerequisites
Timeliness
Acknowledgments
List of Figures
Chapter 1: Introduction
1.1 Brief Summary of Chapters
1.2 Brief History of Hypertext and the WEB
1.3 Brief History of Search Engines
Chapter 2: The Web and the Problem of Search
2.1 Some Statistics
2.2 Tabular Data Versus Web Data
2.3 Structure of the Web
2.4 Information Seeking on the Web
2.5 Informational, Navigational, and Transactional Queries
2.6 Comparing Web Search to Traditional Information Retrieval
2.7 Local Site Search Versus Global Web Search
2.8 Difference Between Search and Navigation
2.9 Chapter Summary
Chapter 3: The Problem of Web Navigation
3.1 Getting Lost in Hyperspace and the Navigation Problem
3.2 How Can the Machine Assist in User Search and Navigation
3.3 Trails Should be First Class Objects
3.4 Enter Markov Chains and Two Interpretations of Its Probabilities
3.5 Conflict Between Web Site Owner and Visitor
3.6 Conflict Between Semantics of Web Site and the Business Model
3.7 Chapter Summary
Chapter 4: Searching the Web
4.1 Mechanics of a Typical Search
4.2 Search Engines as Information Gatekeepers of the Web
4.3 Search Engine Wars, is the Dust Settling?
4.4 Statistics from Studies of Search Engine Query Logs
4.5 Architecture of a Search Engine
4.6 Crawling the Web
4.7 What Does it Take to Deliver a Global Search Service?
4.8 Chapter Summary
Chapter 5: How Does a Search Engine Work
5.1 Content Relevance
5.2 Link-Based Metrics
5.3 Popularity-Based Metrics
5.4 Evaluating Search Engines
5.5 Chapter Summary
Chapter 6: Different Types of Search Engines
6.1 Directories and Categorization of Web Content
6.2 Search Engine Advertising
6.3 Metasearch
6.4 Personalization
6.5 Question Answering (Q&A) on the Web
6.6 Image Search
6.7 Special Purpose Search Engines
Chapter 7: Navigating the Web
7.1 Frustration in Web Browsing and Navigation
7.2 Navigation Tools
7.3 Navigational Metrics
7.4 Web Data Mining
7.5 The Best Trail Algorithm
7.6 Visualization that AIDS Navigation
7.7 Navigation in Virtual and Physical Spaces
Chapter 8: The Mobile Web
8.1 The Paradigm of Mobile Computing
8.2 Mobile Web Services
8.3 Mobile Device Interfaces
8.4 The Navigation Problem in Mobile Portals
8.5 Mobile Search
Chapter 9: Social Networks
9.1 What is a Social Network?
9.2 Social Network Analysis
9.3 Peer-to-Peer Networks
9.4 Collaborative Filtering
9.5 Weblogs (Blogs)
9.6 Power-Law Distributions in the Web
9.7 Searching in Social Networks
9.8 Social Tagging and Bookmarking
9.9 Opinion Mining
9.10 Web 2.0 and Collective Intelligence
Chapter 10: The Future of Web Search and Navigation
Bibliography
Index
Copyright © 2010 by John Wiley & Sons, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at www.wiley.com/go/permission.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging-in-Publication Data:
Levene, M. (Mark), 1957-
An introduction to search engines and web navigation / Mark Levene.
p. cm.
ISBN 978-0-470-52684-2 (pbk.)
1. Internet searching. 2. Web search engines. I. Title.
ZA4230.L48 2010
025.0425–dc22
2010008435
To my wife Sara and three children
Tamara, Joseph and Oren
Preface
Motivation
Searching and navigating the web have become part of our daily online lives. Web browsers and the standard navigation tools embedded in them provide a showcase of successful software technology with a global user-base, that has changed the way in which we search for and interact with information. Search engine technology has become ubiquitous, providing a standard interface to the endless amount of information that the web contains. Since the inception of the web, search engines have delivered a continuous stream of innovations, satisfying their users with increasingly accurate results through the implementation of advanced retrieval algorithms and scalable distributed architectures. Search and navigation technologies are central to the smooth operation of the web and it is hard to imagine finding information without them. Understanding the computational basis of these technologies and the models underlying them is of paramount importance both for IT students and practitioners.
There are several technical books on web search and navigation but the ones I have seen are either very academic in nature, that is, targeted at the postgraduate student or advanced researcher, and therefore have a limited audience, or they concentrate on the user interface and web site usability issues, ignoring the technicalities of what is happening behind the scenes. These books do not explain at an introductory level how the underlying computational tools work. This book answers the need for an introductory, yet technical, text on the topic.
My research into web search and navigation technologies started during the beginning of the 1990s just before the internet boom, when, together with my colleagues, we began looking at hypertext as a model for unstructured (or semistructured) data connected via a network of links, much in the same way web pages are connected. Of particular interest to us was the infamous “navigation problem” when we lose our way navigating (or what has become known as “surfing”) through the myriad of information pages in the network. Tackling this problem has provided continued impetus for my research.
In a wider context, the activity of information seeking, that is, the process we go through when searching and locating information in order to augment our state of knowledge, has been of major concern to all involved in the development of technologies that facilitate web interaction.
I have been using browser navigation tools and search engines since their early days, and have been fascinated by the flow of new ideas and the improvements that each new tool has delivered. One of my aims in this text is to demystify the technology underlying the tools that we use in our day-to-day interaction with the web, and another is to inform readers about upcoming technologies, some of which are still in the research and development stage.
I hope that this book will instill in you some of my enthusiasm for the possibilities that these technologies have and are creating to extend our capabilities of finding and sharing information.
Audience and Prerequisites
The book is intended as an undergraduate introductory text on search and navigation technologies, but could also be used to teach an option on the subject. It is also intended as a reference book for IT professionals wishing to know how these technologies work and to learn about the bigger picture in this area.
The course has no formal prerequisites, all that is required is for the learner to be a user of the web and to be curious to know how these technologies work. All the concepts that are introduced are explained in words, and simple examples from my own experience are given to illustrate various points. Occasionally, to add clarity to an important concept, a formula is given and explained. Each chapter starts with a list of learning objectives and ends with a brief bullet-pointed summary. There are several exercises at the end of each chapter. Some of these aim to get the student to explore further issues, possibly with a reference which can be followed up, some get the student to discuss an aspect of the technology, and others are mini-projects (which may involve programming) to add to the student's understanding through a hands-on approach. The book ends with a set of notes containing web addresses to items mentioned in the book, and an extensive bibliography of the articles and books cited in the book.
Readers should be encouraged to follow the links in the text and to discover new and related links that will help them understand how search and navigation tools work, and to widen their knowledge with related information.
Timeliness
I believe that due to the importance of the topic it is about time that such a book should appear. Search and navigation technologies are moving at a very fast pace due to the continued growth of the web and its user base, and improvements in computer networking and hardware. There is also strong competition between different service providers to lock-in users to their products. This is good news for web users, but as a result some of the numerics in the text may be out of date. I have qualified the statistics I have given with dates and links, which can be found in the notes, so the reader can follow these to get an up-to-date picture and follow the trends. I do not expect the core technologies I have covered to radically change in the near future and I would go so far as to claim that in essence they are fundamental to the web's working, but innovation and new ideas will continue to flourish and mold the web's landscape.
If you find any errors or omissions please let me know so that I can list them on the book's web site. I will also be grateful to receive any constructive comments and suggestions, which can be used to improve the text.
Acknowledgments
First I would like to thank my wife and family who have been extremely supportive throughout this project, encouraging me to put in the extra hours needed to complete such a task. I would also like to thank my colleagues at the Department of Computer Science and Information Systems at Birkbeck, who have read and commented on parts of the book. Special thanks to my editors at Wiley, Lucy Hitz and George Telecki, who have patiently guided me through the publication process. Finally, I would like to thank the reviewers for their constructive comments.
The people who have built the innovative technologies that drive today's web are the real heroes of the revolution that the World Wide Web has brought upon us. Without them, this book could not have been written. Not only in terms of the content of the book, but also in terms of the tools I have been using daily to augment my knowledge on how search and navigation technologies work in practice.
Mark Levene
London, June 2010
Chapter 1
Introduction
“People keep asking me what I think of it now it's done. Hence my protest: The Web is not done!”
Tim Berners-Lee, Inventor of the World Wide Web
The last two decades have seen dramatic revolutions in information technology; not only in computing power, such as processor speed, memory size, and innovative interfaces, but also in the everyday use of computers. In the late 1970s and during the 1980s, we had the revolution of the personal computer (PC), which brought the computer into the home, the classroom, and the office. The PC then evolved into the desktop, the laptop, and the netbook as we know them today.
The 1990s was the decade of the World Wide Web (the Web), built over the physical infrastructure of the Internet, radically changing the availability of information and making possible the rapid dissemination of digital information across the globe. While the Internet is a physical network, connecting millions of computers together globally, the Web is a virtual global network linking together a massive amount of information. Search engines now index many billions of web pages and that number is just a fraction of the totality of information we can access on the Web, much of it residing in searchable databases not directly accessible to search engines.
Now, in the twenty-first century we are in the midst of a third wave of novel technologies, that of mobile and wearable computing devices, where computing devices have already become small enough so that we can carry them around with us at all times, and they also have the ability to interact with other computing devices, some of which are embedded in the environment. While the Web is mainly an informational and transactional tool, mobile devices add the dimension of being a location-aware ubiquitous social communication tool.
Coping with, organizing, visualizing, and acting upon the massive amount of information with which we are confronted when connected to the Web are amongst the main problems of web interaction (1). Searching and navigating (or surfing) the Web are the methods we employ to help us find information on the web, using search engines and navigation tools that are either built-in or plugged-in to the browser or are provided by web sites.
In this book, we explore search and navigation technologies to their full, present the State-of-the art tools, and explain how they work. We also look at ways of modeling different aspects of the Web that can help us understand how the Web is evolving and how it is being and can be used. The potential of many of the technologies we introduce has not yet been fully realized, and many new ideas to improve the ways in which we interact with the Web will inevitably appear in this dynamic and exciting space.
1.1 Brief Summary of Chapters
This book is roughly divided into three parts. The first part (Chapters 1–3) introduces the problems of web interaction dealt with in the book, the second part (Chapters 4–6) deals with web search engines, and the third part (Chapters 7–9) looks at web navigation, the mobile web, and social network technologies in the context of search and navigation. Finally, in Chapter 10, we look ahead at the future prospects of search and navigation on the Web.
Chapters 1–3 introduce the reader to the problems of search and navigation and provide background material on the Web and its users. In particular, in the remaining part of Chapter 1, we give brief histories of hypertext and the Web, and of search engines. In Chapter 2, we look at some statistics regarding the Web, investigate its structure, and discuss the problems of information seeking and web search. In Chapter 3, we introduce the navigation problem, discuss the potential of machine learning to improve search and navigation tools, and propose Markov chains as a model for user navigation.
Chapters 4–6 cover the architectural and technical aspects of search engines. In particular, in Chapter 4, we discuss the search engine wars, look at some usage statistics of search engines, and introduce the architecture of a search engine, including the details of how the Web is crawled. In Chapter 5, we dissect a search engine's ranking algorithm, including content relevance, link- and popularity-based metrics, and different ways of evaluating search engines. In Chapter 6, we look at different types of search engines, namely, web directories, search engine advertising, metasearch engines, personalization of search, question answering engines, and image search and special purpose engines.
Chapters 7–9 concentrate on web navigation, and looks beyond at the mobile web and at how viewing the Web in social network terms is having a major impact on search and navigation technologies. In particular, in Chapter 7, we discuss a range of navigation tools and metrics, introduce web data mining and the Best Trail algorithm, discuss some visualization techniques to assist navigation, and look at the issues present in real-world navigation. In Chapter 8, we introduce the mobile web in the context of mobile computing, look at the delivery of mobile web services, discuss interfaces to mobile devices, and present the problems of search and navigation in a mobile context. In Chapter 9, we introduce social networks in the context of the Web, look at social network analysis, introduce peer-to-peer networks, look at the technology of collaborative filtering, introduce weblogs as a medium for personal journalism on the Web, look at the ubiquity of power-law distributions on the Web, present effective searching strategies in social networks, introduce opinion mining as a way of obtaining knowledge about users opinions and sentiments, and look at Web 2.0 and collective intelligence that have generated a lot of hype and inspired many start-ups in recent years.
1.2 Brief History of Hypertext and the WEB
The history of the Web dates back to 1945 when Vannevar Bush, then an advisor to President Truman, wrote his visionary article “As We May Think,” and described his imaginary desktop machine called memex, which provides personal access to all the information we may need (2). An artist's impression of memex is shown in Fig. 1.1.
Figure 1.1 Bush's memex. (Source: Life Magazine 1945;9(11):123.)
The memex is a “sort of mechanized private file and library,” which supports “associative indexing” and allows navigation whereby “any item may be caused at will to select immediately and automatically another.” Bush emphasizes that “the process of tying two items together is an important thing.” By repeating this process of creating links, we can form a trail which can be traversed by the user; in Bush's words, “when numerous items have been thus joined together to form a trail they can be reviewed in turn.” The motivation for the memex's support of trails as first-class objects was that the human mind “operates by association” and “in accordance to some intricate web of trails carried out by the cells of the brain.”
Bush also envisaged the “new profession of trailblazers” who create trails for other memex users, thus enabling sharing and exchange of knowledge. The memex was designed as a personal desktop machine, where information is stored locally on the machine. Trigg (3) emphasizes that Bush views the activities of creating a new trail and following a trail as being connected. Trails can be authored by trailblazers based on their experience and can also be created by memex, which records all user navigation sessions. In his later writings on the memex, published in Ref. 4, Bush revisited and extended the memex concept. In particular, he envisaged that memex could “learn from its own experience” and “refine its trails.” By this, Bush means that memex collects statistics on the trails that the user follows and “notices” the ones that are most frequently followed. Oren (5) calls this extended version adaptive memex, stressing that adaptation means that trails can be constructed dynamically and given semantic justification; for example, by giving these new trails meaningful names.
The term hypertext (6) was coined by Ted Nelson in 1965 (7), who considers “a literature” (such as the scientific literature) to be a system of interconnected writings. The process of referring to other connected writings, when reading an article or a document, is that of following links. Nelson's vision is that of creating a repository of all the documents that have ever been written thus achieving a universal hypertext. Nelson views his hypertext system, which he calls Xanadu, as a network of distributed documents that should be allowed to grow without any size limit, such that users, each corresponding to a node in the network, may link their documents to any other documents in the network. Xanadu can be viewed as a generalized memex system, which is both for private and public use. As with memex, Xanadu remained a vision that was not fully implemented; a mockup of Xanadu's linking mechanism is shown in Fig. 1.2. Nelson's pioneering work in hypertext is materialized to a large degree in the Web, since he also views his system as a means of publishing material by making it universally available to a wide network of interconnected users.
Figure 1.2 Nelson's Xanadu. (Source: Figure 1.3, Xanalogical structure, needed now more than ever: Parallel documents, deep links to content, deep versioning, and deep re-use, by Nelson TH. www.cs.brown.edu/memex/ACM_HypertextTestbed/papers/60.html.)
Douglas Engelbart's on-line system (NLS) (8) was the first working hypertext system, where documents could be linked to other documents and thus groups of people could work collaboratively. The video clips of Engelbart's historic demonstration of NLS from December 1968 are archived on the Web,1 and a recollection of the demo can be found in Ref. (9); a picture of Engelbart during the demo is shown in Fig. 1.3.
Figure 1.3 Engelbart's NLS. (Source: Home video of the birth of the hyperlink. www.ratchetup.com/eyes/2004/01/wired_recently_.html.)
About 30 years later in 1990, Tim Berners-Lee—then working for Cern, the world's largest particle physics laboratory—turned the vision of hypertext into reality by creating the World Wide Web as we know it today (10).2
The Web works using three conventions: (i) the URL (unified resource locator) to identify web pages, (ii) HTTP (hypertext transfer protocol) to exchange messages between a browser and web server, and (iii) HTML (hypertext markup language) (11) to display web pages. More recently, Tim Berners-Lee has been promoting the semantic web (12) together with XML (extensible markup language) (13), and RDF (resource description framework) (14), as a means of creating machine understandable information that can better support end user web applications. Details on the first web browser implemented by Tim Berners-Lee in 1990 can be found at www.w3.org/People/Berners-Lee/WorldWideWeb.
The creation of the Mosaic browser by Marc Andreessen in 1993 followed by the creation of Netscape early in 1994 were the historic events that marked the beginning of the internet boom that lasted throughout the rest of the 1990s, and led to the mass uptake in web usage that continues to increase to this day. A screenshot of an early version of Mosaic is shown in Fig. 1.4.
Figure 1.4 Mosaic browser initially released in 1993. (Source: http://gladiator.ncsa.illinois.edu/Images/press-images/mosaic.jpg.)
1.3 Brief History of Search Engines
The roots of web search engine technology are in information retrieval (IR) systems, which can be traced back to the work of Luhn at IBM during the late 1950s (15). IR has been an active field within information science since then, and has been given a big boost since the 1990s with the new requirements that the Web has brought.
Many of the methods used by current search engines can be traced back to the developments in IR during the 1970s and 1980s. Especially influential is the SMART (system for the mechanical analysis and retrieval of text) retrieval system, initially developed by Gerard Salton and his collaborators at Cornell University during the early 1970s (16). An important treatment of the traditional approaches to IR was given by Keith van Rijsbergen (17), while more modern treatments with reference to the Web can be found in Refs 18, 19, 20, and 21. More recent developments, which concentrate on web technologies, are the probabilistic perspective on modeling the Web as in Ref. 22 and the data mining perspective on managing web information, which can be found in Refs 23 and 24.
Owing to the massive amount of information on the Web, right from the early days of the Web, search engines have become an indispensable tool for web users. A history of search engines detailing some of the early search services can be found in Ref. 25.3
Here, we will be very selective and mention only a few of the early and current search engines; see http://searchenginewatch.com/links and http://en.wikipedia.org/wiki/List_of_search_engines for up-to-date listings of the major search engines. More details on many of the current search engines are spread throughout the book.
Yahoo (www.yahoo.com), which started up in February 1994, was one of the earliest search services.4 Initially, Yahoo was only providing a browsable directory, organizing web pages into categories which were classified by human editors. Yahoo continues to maintain a strong brand and has evolved into a full-fledged search engine by acquiring existing search engine technology in mid-2003. (You can get some insight on the latest innovations in Yahoo's search engine from its weblog at www.ysearchblog.com.)InfoSeek, which started up in July 1994, was the first search engine that I was using on a regular basis, and as with many of the innovative web tools, users voted with their clicks and its reputation spread by word of mouth. In July 1998, Infoseek merged with Walt Disney's Buena Vista Internet Group to form Go.com, which was ultimately abandoned in January 2001.Inktomi, which started up in September 1995, provides search engine infrastructure rather than delivering the service from their web site. Until it was acquired by Yahoo in March 2003, it was providing search services to some of the major search engines.AltaVista (www.altavista.com), which started up in December 1995, was the second search engine that I was using on a regular basis. It was initially a research project in Digital Equipment Corporation, and was eventually acquired by Overture in April 2003.AlltheWeb (www.alltheweb.com) was launched in May 1999 by Fast Search & Transfer, and in a very short time was able to build a very large and fresh index with fast and accurate search results. It was also acquired by Overture in April 2003.Ask Jeeves (www.ask.com) started up in April 1996. It went public in July 1999, and is one of the survivors in the search engine game. Its strong brand and distinctive question answering facility have evolved into a general search service through its acquisition of Teoma in September 2001, which has enabled it to manage a proprietary search service and develop its own search technology. It was acquired by e-commerce conglomerate IAC (InterActiveCorp) in July 2005.Overture (www.overture.com) started up as Goto.com in September 1997, and pioneered pay-per-click search engine advertising. It was renamed as Overture in September 2001 and was acquired by Yahoo in July 2003. In April 2005, Overture was rebranded as Yahoo Search Marketing (http://searchmarketing.yahoo.com).Bing (www.bing.com) is Microsoft's search engine that went online in June 2009. It replaced Live search, released in September 2006, which replaced MSN search, originally launched in August 1995, coinciding with the release of Windows 95. Initially, MSN search partnered with major search engines to provide the search facility for their site. Realizing the strategic importance of search to Microsoft's core business, Microsoft announced, in 2003, that it would develop its own proprietary search technology. The beta version of the search engine was released by MSN in November 2004, and in February 2005 MSN search was officially delivering search results from its internally developed engine. (You can get some insight on the latest innovations in Bing's search engine from its weblog at www.bing.com/community/blogs/search.)Google (www.google.com) was started up in September 1998, by Larry Page and Sergey Brin, then PhD students at Stanford University.5 Google was the third search engine that I was using on a regular basis and am still using today, although I do consult other search services as well. It became a public company in August 2004, and, as of late 2004, has been the most popular search engine. You will find a wealth of information in this book on the innovative features that Google and other search engines provide. (You can get some insight on the latest innovations in Google's search engine from its weblog at http://googleblog.blogspot.com.)1 Video clips from Engelbart's demo can be found at http://sloan.stanford.edu/mousesite/1968Demo.html.
2 A little history of the World Wide Web from 1945 to 1995. www.w3.org/History.html.
3 See also, A history of search engines, by W. Sonnenreich. www.wiley.com/legacy/compbooks/sonnenreich/history.html.
4 The history of Yahoo!—How it all started. http://docs.yahoo.com/info/misc/history.html.
5 Google History. www.google.com/corporate/history.html.
Chapter 2
The Web and the Problem of Search
“Basically, our goal is to organise the world's information and make it universally accessible and useful.”
Larry Page, cofounder of Google
To understand the magnitude of the search problem we present some statistics regarding the size of the Web, its structure, and usage, and describe the important user activity of information seeking. We also discuss the specific challenges web search poses and compare local site search within an individual web site to global search over the entire web.
Chapter Objectives
Give an indication of the size of the Web, and how it can be measured.Give an indication of the relative usage of search engines.Highlight the differences between structured data organized in tables, and traditional web data that does not have a fixed structure.Explain the bow-tie structure of the Web.Introduce the notion of a small-world network (or graph) in the context of the Web.Discuss different kinds of information-seeking strategies on the Web: direct navigation, navigating within a directory and using a search engine.Discuss the problems inherent in web information seeking.Introduce a taxonomy of web searches.Present the differences between web search and traditional information retrieval.Introduce the notions of precision and recall used to evaluate the quality of an information retrieval system, and discuss these in the context of web search.Discuss the differences between search within a local web site and global web search.Highlight the fact that web search engines do not solve the site search problem.Make clear the difference between search and navigation.2.1 Some Statistics
The Web is undoubtedly the largest information repository known to man. It is also the most diverse in terms of the subject matter that it covers, the quality of information it encompasses, its dynamic nature in terms of its evolution, and the way in which the information is linked together in a spontaneous manner.
2.1.1 Web Size Statistics
As an indication of the massive volume of the Web, an estimate of its size, given by Murray of Cyveillance in July 2000 (1), was 2.1 billion pages. At that time the Web was growing at a rate of 7.3 million web pages a day, so according to this prediction there were already over 4 billion web pages by April 2001. Extrapolating forward using this growth rate, we can estimate that the Web would have over 28 billion web pages in 2010. As we will see, this estimate was very conservative as our size estimate for 2010 is about 600 billion, which implies a growth rate of 200 million web pages per day.
This estimate does not include deep web data contained in databases, which are not directly accessible to search engines (2). As an example, patent databases such as those provided by the US patent and trademark office,1 are only accessible through a tailored search interface. Thus, without direct access to such data, search engines cannot easily fully index this information.2 It is estimated that the deep web (also known as the hidden or invisible web3) is approximately 550 times larger than the information that can be accessed directly through web pages. Other types of web data, which are ephemeral in nature such as train timetables (which may last months or years) and travel bargains (which normally last only weeks), or contain complex formats such as audio and video, are problematic for search engines and although not invisible, are difficult to deal with. Also, there are web pages which are literally not accessible, since they are not linked from other visible web pages, and thus are deemed to be part of the hidden web.
The deep web site is accessed through web query interfaces that access back-end web databases connected to a web server. Therefore, a deep web site may have several query interfaces connecting to one or more web databases. A study from 2004 estimated that there are approximately 0.30 million deep web sites, 0.45 million web databases, and 1.25 million query interfaces (3). Through random sampling from these databases, they concluded that the three major search engines (Google, Yahoo, and Microsoft's Live rebranded as Bing) cover about one-third of the deep web. It also transpires that there is a significant overlap between the search engines in what is covered. So the deep web is not so invisible to search engines but what is hidden seems to be hidden from all of them.
For search engines, the issue of coverage, that is, the proportion of the accessible web they hold in their web page index, is crucial. However good the search engine tool may be, if its coverage is poor, it will miss relevant web pages in its results set.
In early 2004, Google reported that their index contained 4.28 billion web pages.4 After an intensive crawling and re-indexing period during 2004, Google announced later in the year that it had nearly doubled its index to a reported size of over 8 billion web pages.5 For comparison, toward the end of 2004 MSN Search (rebranded as Bing in mid-2009), which had then begun deploying its own search engine, reported an index size of over 5 billion web pages,6 in April 2005, Yahoo search reported a similar index size of over 5 billion,7 and Teoma, the search engine powering Ask Jeeves, reported an index in excess of 2 billion web pages.8
Older estimates of search engine sizes from the end of 2002, were as follows: Google had over 3 billion documents, AlltheWeb (now integrated with Yahoo Search) had 2 billion documents, AltaVista (also integrated into Yahoo Search) had over 1 billion documents, Teoma had over 1 billion documents, and MSN Search had access to over 1 billion documents.9
As we will see below, the Web has grown since 2004, and our current estimate of the accessible web as of 2010 stands at about 600 billion pages. This estimate may still be conservative, as each search engine covers only a certain fraction of the totality of accessible web pages (4), but it gives us a good idea of the scale of the enterprise. The exact number is evasive but our current estimate of 600 billion accessible web pages, approaching 1 trillion, is probably not far from the truth; this not withstanding the issue of the quality of a web page and how often it is visited, if at all.
To measure the size of the Web, Lawrence and Giles (5) (see also Ref. (6)) had an ingenious idea based on a widely used statistical method to estimate the size of a population, which is called the capture–recapture method (7). To illustrate the method, suppose you have a lake of fish and you want to estimate their number. Randomly select, say 100 fish, tag them, and return them to the lake. Then, select another random sample with replacement, say of 1000 fish, from the lake and observe how many tagged fish there are in this second sample. In this second sample, some of the fish may be selected more than once, noting that the chance of selecting a tagged fish will be the same for each fish in the second sample; that is, 100 divided by the total number of fish in the lake. Suppose that there were 10 tagged fish out of the 1000, that is, 1%. Then we can deduce that the 100 fish are in the same proportion relative to the whole population, that is they are 1% of the total population. So, our estimate of the number of fish in the lake in this case will be 10,000.
Using this method, Lawrence and Giles defined the following experiment with pairs of search engines to estimate the size of the Web. To start with, they tagged all the pages indexed by the first search engine as were the fish. They then chose several typical queries (575 to be precise) and counted the number of unique hits from the first search engine in the pair; that is, they counted the number of web pages returned from the first search engine. They then fired the same queries to the second search engine and measured the proportion of tagged pages from the results set; these pages are in the intersection of the results of the two search engines. As with the fish, assuming that the set of all tagged pages is in the same proportion relative to the set of all accessible web pages, as is the intersection relative to the results set of the second search engine, we can estimate the size of the accessible web. The resulting formulae is the number of pages indexed by the first search engine multiplied by the number of pages returned by the second search engine, divided by the number in the intersection.
Their estimate of the size of the Web from a study carried out in 1998 was 320 million pages, and around 800 million from a later study carried out in 1999. A further estimate from 2005 using a similar technique claims that the size of the indexable web has more than 11.5 billion pages (8).
A more recent estimate from the beginning of 2010, which is periodically updated on www.worldwidewebsize.com, put a lower bound on the number of indexable web pages at about 21 billion pages. The technique used by de Kunder to reach this estimate is based on the expected number of web pages containing a selected collection of words. Each day 50 word queries are sent to Google, Yahoo, Bing, and Ask and the number of web pages found for these words are recorded. The 50 words have been chosen so that they are evenly spread on a log–log plot of word frequencies constructed from a sample of more than 1 million web pages from the Open Directory (www.dmoz.org), which can be considered to be a representative sample of web pages. (The distribution of word frequencies obeys Zipf's law; see Section 5.1.3 and see Section 9.6.) Once the word frequencies are known, the size of each search engine index can be extrapolated. The size of the overlap between the search engines is computed from the daily overlap of the top-10 results returned by the search engines from a sufficiently large number of random word queries drawn from the Open Directory sample. Finally, the overlap and index sizes are combined to reach an estimate of the Web's size.
This estimate is much lower than the 120 billion pages that the search engine Cuil (www.cuil.com) has reported to index in 2008.10 Although Google has not been disclosing the size of its index, a post from its Web Search Infrastructure Team on the official Google blog from July 200811 reported that they process over 1 trillion unique URLs (1012). This figure of 1 trillion contains duplicate web pages such as autogenerated copies, so on its own it does not tell us how many web pages there actually are. To get an estimate of the Web's size we can make use of the finding that about 30% of web pages are either duplicates or near-duplicates of other pages (9). The resulting estimate of about 700 billion web pages is still a rough upper bound as some pages are created with the intent to deceive search engines to include them in their index and have little relevance to users, detracting from the user experience. The activity of creating such pages is known as spamdexing, and such pages when detected by a search engine, are considered as spam and therefore not indexed. Using a further estimate that about 14% of web pages are spam (10), we can conclude that the Web contains approximately 600 billion indexable web pages as of 2010.
Even more daunting is the thought of delivering a speedy search service that has to cope with over 500 million (half a billion) queries a day, which is about 6000 queries a second. The answer to the question, “How do they do it?” will be addressed in Chapter 4, when we dig deep inside search engine technology. Keeping up with the pace in this extremely dynamic environment is an uphill struggle. The Web is very fluid; it is constantly changing and growing. Many of its pages are dynamically generated such as news pages which are constantly updated and stock prices which are continuously monitored, and many pages are displayed differently to varying audiences; for example, depending on the browser used, or some contextual information such as the country of origin of the surfer (if this is evident from their domain name) or the time of day. These complexities often mean that the web pages are written in a scripting language rather than in HTML and thus are harder for search engines to interpret. On top of all this, there is a multitude of data formats to deal with,12 which makes the search engine's task even more difficult.
In their 1999 study, Lawrence and Giles also reported that the degree of overlap between search engines is low, a result that has been confirmed time and time again since then (11). This would imply that metasearch, where results from several search engines are aggregated, would significantly increase the coverage of a search service. Although, in principle this is true, the major search engines are now blocking metasearch engines unless they pay for the service. Also, as the relative coverage of the major search engines increases, the benefits of metasearch are less clear. As gatekeepers of web information, the major search engines, predominantly Google, Yahoo, and Microsoft's Bing, are rapidly monopolizing the web search space and thus other issues, which may lead to regulation of search engines, are currently being raised and debated13; see Section 4.2.
A higher level measure of the size of the Web is the number of accessible web sites, rather than web pages. So, to estimate the number of web sites we need only identify the home page of each site as its representative. Researchers at the Online Computer Library Center (OCLC)14 have conducted annual samples of the Web from 1997 to 2002 in order to analyze the trends in the size of the public web, which includes only sites that offer free and unrestricted access to a significant amount of their content.
Each web site can be identified by its IP (Internet Protocol) address. A random sample from the set of valid IP numbers is generated and each IP address is tested to check if it corresponds to an existing web site. The proportion of web sites within the sample is then used to extrapolate an estimate of the number of web sites from the total number of valid IP addresses. This extrapolation can be viewed as an application of the capture–recapture method.
In 1993 there were just 130 web sites15 and the growth has been exponential until 2000, when there were about 2.9 million public web sites. In 2001 there were about 3.1 million web sites in the public web and in 2002 the number amazingly decreased to about 3 million (12). This evidence suggests that the growth of the Web may periodically slow down in terms of number of web sites, which does not necessarily mean that the growth in terms of number of pages will follow a similar trend. One reason for the slowdown in 2002 is due to the fact that web technology had lost some of its novelty factor and we no longer witnessed the mad rush to buy domain names and gain web presence. On the one hand, organizations are spending more time in consolidating their web sites but on the other, due to the slowdown in the economy at that time, many web sites have literally disappeared.
Statistics regarding the number of registered commercial domains are also available, although many web sites own several domain names, implying that such statistics are an unreliable measure of the actual number of web sites. As of the beginning of 2010 there were about 113.90 million registered commercial domains compared to about 44.30 million in October 2004.16 (Academic and government domains are excluded from this count.) It is interesting to note that although on the whole the number of registered domains is increasing, many domains are also deleted from the count (i.e., they are not re-registered when they expire).
Netcraft (www.netcraft.com) performs a monthly survey of the number of web sites across all domains, reporting about 233.85 million sites as of December 2009 compared to about 66.80 million in June 2005.17 Netcraft identifies the number of web sites by counting the web servers hosting a domain rather than by counting valid IP addresses.
An interesting point to make is that some of the web sites and web pages that have disappeared may be accessible through the Internet Archive,18 which is a nonprofit company founded to build an “Internet Library” with the aim of offering permanent access to its digital collections. This is part of a broader agenda to archive web material, which is becoming a priority for the Web, since to a large degree the state of the Web at any given time represents a snapshot of our society and culture. Thus, there is value in preserving parts of the Web, so as to have access to its previous states. The issues relating to preservation and archiving of the Web are part of a larger concern regarding the lifespan of digital artifacts and the problem of having access to historical information.
So, how much information is out there? According to a study carried out in Berkeley in 2003,19 if we include information in the deep web, the numbers add up to about 92,000TB (1 million million bytes) of information, which is 92PB (1000TB) of information. (The size of the surface web i.e., the World Wide Web, was estimated at about 170TB.) With the amount of information on the Web growing on a day-to-day basis it will not be long before we will be talking in terms of exabytes (1 million TB) of information. Of course, much of the content is irrelevant to us and of doubtful quality, but if it is out there and can be searched, someone may be interested in it. At the end of the day, search engines companies continually have to make a choice on which content they should index and make publicly available, and this will undoubtedly lead to some controversy.
2.1.2 Web Usage Statistics
The market share of the competing search engines is measured by companies that track the search and browsing behavior from a panel of several million users while they are surfing the Web.20 We quote some statistics from late 2008 and the beginning of 2009, noting that the percentages are only approximations obtained from sampling, and that the reported measurements are variable across the different information providers. The percentages given are indications of trends and thus, are subject to fluctuations.
The most visible trend is that Google's popularity in terms of audience reach has become increasingly dominant in the western world in the last few years, but its position is far from leading in the Far East. The rise of Google in the space of a few years from an experimental search engine developed by two research students in Stanford in 1998 is in itself an amazing story, which is told in depth elsewhere. It is hard to predict whether these trends will persist, and when making such predictions we should also take into account the fact that search engine loyalty is generally low.
In the United States, the popularity statistics show Google with 64%, Yahoo with 21%, Bing (Microsoft's search engine, rebranded as Bing from Live in mid-2009) with 8%, and Ask (also known as Ask Jeeves) with 4%. It is interesting to note that Google's market share is much larger in many of the European countries such as France (91%), Germany (93%), Italy (90%) and the United Kingdom (90%); similar figures are seen in South America. The global picture includes Baidu (www.baidu.com), the leading Chinese search engine which was launched in 1999, with 13% globally, but Google is still the global leader with 64%, followed by Yahoo with 15%, Bing with 4%, and Ask with 2%.
In the Far East, the story is somewhat different. In China the market share of Baidu is 57%, Google is 16%, and Yahoo is 5%. Major reasons for the big success of a local brand in China are the cultural and language differences. Baidu has a controversial policy (at least in the West), in that it provides searchers with links to music files that are available for download on the Web; there is an ongoing dispute between Google and Baidu on this issue. In Korea, a local web search engine called Naver (www.naver.com) which launched in 1999, is even more dominant with a market share of 75%. Surprisingly, in Korea the second most popular search engine, Daum (www.daum.net), which started in 1995 and was Korea's first web portal, is also local with a market share of 20%. In Korea Google's share is only 1.5%, coming behind Yahoo which has a share of 4%. Here also, major reasons for the success of the local brands are the cultural and language differences. In Japan, Yahoo with a market share of 51% is the leader, followed by Google with 38%. Yahoo had an early head start in Japan, incorporating there in 1996, less than a year after its parent company was formed; on the other hand, Google opened offices in Japan only in 2001. Yahoo Japan has a very localized strategy, with 40% of its shares being owned by the local telecommunications and media company Softbank. It has built a very local identity and is considered by many Japanese as a local brand. Russia is another country where Google is second with a market share of 21% behind the local web search engine, Yandex (www.yandex.com), with a share of 55%. Yandex was launched in 1997, and its success relative to Google, Yahoo, and Microsoft's Bing can be attributed to its handling of the Russian language.
How many people are surfing the Web? There were about 800 million internet users as of late 2004 and the number doubled to 1.6 billion in mid-2009 (which is approaching a quarter of the world's population).21
According to a report from late 2008,22 there are about 400 million broadband subscribers, which covers about a quarter of the Internet users. The share of broadband subscription is highest in Western Europe (about 26%), North America (about 22.5%), and South and East Asia, which includes China and India (about 23%). Asia-Pacific has a much lower share (about 15.5%) and the rest of the world's share is even lower (about 13%). It is interesting to note that if we look at countries, then China has the largest number of broadband subscribers at about 81 million and has thus overtaken the United States, which at second place has about 79 million subscribers.
As the gap in pricing between broadband and narrowband continues to close, so will the trend of increased broadband connections continue to rise. In terms of trends as of 2010, mobile broadband is starting to take off in countries where the network infrastructure is available.
For October 2004, usage statistics indicate that users spent, on an average, 25 hours and 33 min surfing the net, viewing 1074 web pages, with an average of 35 min per session and viewing 35 web pages during the session. For comparison purposes, the statistics for February 2009 revealed that users spent, on an average, 34 hours and 17 min surfing the net, viewing 1549 web pages, with an average of 60 min per session and viewing 44 pages per session.23
This indicates that users are, on an average, spending more time surfing the Web and viewing more pages than before. It is worth noting that these statistics tend to fluctuate from month to month and that there are cognitive limits on what internet users may achieve within any surfing session.
In terms of search engine hits per day, Google has reported over 200 million during mid 2003.24 The number of searches Google receives per day as of 2010 is elusive, but it is probably of the order of 3.5 billion per day which is over 40,000 queries per second (13). If we are interested in the volume of queries for a particular phrase or keyword, we can obtain up-to-date figures by making use of the keyword tool provided by Google,25 which is used by advertisers to find appropriate keywords to improve the performance of a campaign. For example, the tool shows that the average monthly volume in April 2009 for the query “computer science” was 673,000.
We mention the Pew Internet and American Life Project (www.pewinternet.org), which is a nonprofit “fact tank” that produces reports exploring the impact of the Internet on families, communities, work and home, daily life, education, health care, and civic and political life. Its reports are based on data collection from random phone surveys, online surveys, and qualitative research. This information is supplemented with research experts in the field of study. The project has produced reports on a variety of topical issues such as music downloading, online privacy, online banking, online dating, broadband users, Wikipedia users, mobile access to data and information, adults and social network web sites, cloud computing, and the future of the Internet.
We have all heard of road rage but now we have the phenomenon of web rage or search rage. A survey conducted by Roper Starch Worldwide in mid-200026 concluded that it takes on an average 12 min of web searching before the onset of search rage when users get extremely frustrated and lose their temper. A more recent survey commissioned in the United Kingdom by the Abbey National during the beginning of 200227
