E-Book
18,99 €

Search Engine Society E-Book

Alexander Halavais

0,0

18,99 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: John Wiley & Sons
Kategorie: Geisteswissenschaft
Serie: DMS - Digital Media and Society
Sprache: Englisch

Beschreibung

Search engines have become a key part of our everyday lives. Yet there is growing concern with how algorithms, which run just beneath the surface of our interactions online, are affecting society. This timely new edition of Search Engine Society enlightens readers on the forms of bias that algorithms introduce into our knowledge and social spaces, drawing on recent changes to technology, industries, policies, and research. It provides an introduction to the social place of the search engine and addresses crucial questions such as: * How have search engines changed the way we organize our thoughts about the world, and how we work? * To what extent do politics shape search, and does search shape politics? This book is a must-read for those who wish to understand the future of the social internet and how search shapes it.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 491

Veröffentlichungsjahr: 2017

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Cover

Title Page

Introduction

The answer

The mirror

What has changed

1 The Engines

Search engines today

Before the search engine

How a search engine works

Pre-web internet search

Searching the web

Commodifying search

Search and society

Notes

2 Searching

The perfect search

Searching skills

A diversity of search

Tracking searches and searchers

The adaptive search engine

The mind of a search engine

The adaptive society

Notes

3 Sociable Search

Social search

Search context

Collaborative filtering

Social networks and search for expertise

(Re-)Finding community

4 Attention

The web is not flat

PageRank

An attention economy

Spam, counter-spam

Who pays for search?

Curation and search

The search ecosystem

Notes

5 Knowledge and Democracy

Search inequality

Bursting bubbles

Knowledge glocalization

Search and traditional knowledge institutions

Addressing the imbalances

The search intellectual

From problems to issues

Notes

6 Control

Containing search

Algorithmic governance

Intellectual property

Promoting search diversity

Don’t be secretive

Notes

7 Privacy

Finding and forgetting the individual

Your search is showing

Search, surveillance, and transparency

The global village and new legends

Notes

8 Future Finding

The fading of the search engine

The rise of the findable ecosystem

The ascendance of the algorithm

The history of right now

Platform wars II

Who will know?

Notes

Glossary

Bibliography

Index

End User License Agreement

Guide

Cover

Table of Contents

Begin Reading

Pages

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

Digital Media and Society Series

Nancy Baym, Personal Connections in the Digital Age, 2nd edition

Mercedes Bunz and Graham Meikle, The Internet of Things

Jean Burgess and Joshua Green, YouTube

Mark Deuze, Media Work

Andrew Dubber, Radio in the Digital Age

Charles Ess, Digital Media Ethics, 2nd edition

Jordan Frith, Smartphones as Locative Media

Alexander Halavais, Search Engine Society, 2nd edition

Martin Hand, Ubiquitous Photography

Robert Hassan, The Information Society

Tim Jordan, Hacking

Graeme Kirkpatrick, Computer Games and the Social Imaginary

Leah A. Lievrouw, Alternative and Activist New Media

Rich Ling and Jonathan Donner, Mobile Communication

Donald Matheson and Stuart Allan, Digital War Reporting

Dhiraj Murthy, Twitter, 2nd edition

Zizi A. Papacharissi, A Private Sphere: Democracy in a Digital Age

Jill Walker Rettberg, Blogging, 2nd edition

Patrik Wikström, The Music Industry, 2nd edition

Search Engine Society

Second Edition

ALEXANDER HALAVAIS

polity

The right of Alexander Halavais to be identified as Author of this Work has been asserted in accordance with the UK Copyright, Designs and Patents Act 1988.

First edition published in 2008 by Polity Press

This edition first published in 2018 by Polity Press

Polity Press65 Bridge StreetCambridge CB2 1UR, UK

Polity Press101 Station Landing, Suite 300Medford, MA 02155, USA

All rights reserved. Except for the quotation of short passages for the purpose of criticism and review, no part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior permission of the publisher.

ISBN-13: 978-1-5095-1686-5

A catalogue record for this book is available from the British Library.

Names: Halavais, Alexander M. Campbell, author.Title: Search engine society / Alexander Halavais.Description: Second edition. | Cambridge, UK ; Medford, MA : Polity Press, [2017] | Series: Digital media and society | Includes bibliographical references and index.Identifiers: LCCN 2017024677 (print) | LCCN 2017037380 (ebook) | ISBN 9781509516858 (Mobi) | ISBN 9781509516865 (Epub) | ISBN 9781509516827 (hardback) | ISBN 9781509516834 (pbk.)Subjects: LCSH: Web search engines--Social aspects. | Information technology--Social aspects.Classification: LCC HM851 (ebook) | LCC HM851 .H343 2017 (print) | DDC 303.48/33--dc23LC record available at https://lccn.loc.gov/2017024677

The publisher has used its best endeavors to ensure that the URLs for external websites referred to in this book are correct and active at the time of going to press. However, the publisher has no responsibility for the websites and can make no guarantee that a site will remain live or that the content is or will remain appropriate.

Every effort has been made to trace all copyright holders, but if any have been inadvertently overlooked the publisher will be pleased to include any necessary credits in any subsequent reprint or edition.

For further information on Polity, visit our website: politybooks.com

Introduction

In 2016, German Chancellor Angela Merkel raised an alarm about the influence of search on the web: “Algorithms, when they are not transparent, can lead to a distortion of our perception, they can shrink our expanse of information” (Connolly 2016). While there have, over time, been many criticisms of the effect of search engines on society, this brings us to the very point of many of them: search reduces the amount of information available to us, and in doing so introduces bias.

Perhaps you come to this book wondering whether search engines are biased. They are. It would be impossible for them to be otherwise. Indeed, that is their intended function. A perfect mirror, reflecting the web and other knowledge spaces precisely, would serve little purpose. We want search engines to act as a filter, removing the less important stuff, so that we can apply our naturally limited attention to what really matters.

Matters to whom? That is perhaps the most important question we can ask about how search has changed our social lives. Over nearly a decade since the first edition of this book was published, a new focus has emerged around questions of algorithmic culture, on the power of hidden processes that shape our information space, and how economic and political relations are encoded into platforms and networks. Naturally, these questions have existed much longer, and have often been central to the study of large sociotechnical systems for decades, but the rapid integration of the internet into the everyday lives of most of the planet’s population has given new rise to questions of how these often invisible biases might be affecting what it is to be a human of the twenty-first century.

Search engines and social platforms have quietly been at the forefront of algorithmic culture, reshaping the web, and, by extension, much of the way in which we interact with one another. No one has missed the rise of the largest search engine company, Google, as one of the new economic powerhouses of the last few decades. But the focus has largely been on the ways in which Google has profited from advertising, or its seeming growth into a company that provides everything from education to automobiles, rather than the core technology that drove its growth. The search engine seems so unassuming, working quietly in the background, changing the structure of our lives.

The answer

Take a moment and type the following search string into your favorite search engine: “Google is your friend.” Today, the number of “hits” for that phrase on Google stands at “about 882,000.” The company is successful, but who knew it was so friendly? Even the abbreviation of the phrase – GIYF – receives over 100,000 hits. If you have picked up this book, you can probably guess the context in which this phrase is used. If not, a description may be found at justfuckinggoog leit.com, which reads, in part:

Google Is Your Friend

All Smart People Use Google

It Appears That You Are Not One Of Them

The search engine has become so much a part of our culture that there is a common assumption that we have found a cure for stupid questions. Folded into that assumption, we find a host of others: that even the unintelligent have access to and can use a search engine, that a search engine will lead someone to a page that contains accurate information, and that questions are best directed first to a machine.

Unpacking the black box of the search engine is something of interest not only to technologists and marketers, but to anyone who wants to understand how we make sense of a newly networked world. Search engines and social platforms have come to play a central role in corralling and controlling the ever-growing sea of information that is available to us, and yet they are trusted more readily than they ought to be. They freely provide, it seems, a sorting of the wheat from the chaff, and answer our most profound and most trivial questions. They have become an object of faith.

We ask many things of search engines – what do they ask in return? Search engines are at once the most and the least visible part of the digital, networked revolution. The modern search engine has taken on the mantle of what the ancients of many cultures thought of as an oracle: a source of knowledge about our world and who we are. Children growing up in the twenty-first century have only ever known a world in which search engines could be queried, and almost always provide some kind of an answer, even if it may not be the best one.

The mirror

Search engines appear to be merely a functional tool, aimed at making the real work of the web easier, but they have the potential to reveal to us not only their internal structures, but the structures of the societies that build them. In Troilus and Cressida, Shakespeare (1912, p. 35) hints at why an examination of search engines is so enticing:

And in such indexes, although small pricksTo their subsequent volumes, there is seenThe baby figure of the giant massOf things to come at large.

Search engines represent the filters through which we view the content of the web, screens that allow us to inflict our own desires on the “giant mass” of the web, taming it and making it useful. At the same time, the view it presents is likely to shape future social values.

In his book Information please, Mark Poster (2006) reminds us of the change that has occurred: we once asked people for information and now we ask machines. He is interested in how this change to machine-mediated information affects our interactions. But it is all too easy to forget that the machines we are asking are constructed in ways that already reflect the conceptions and values of their makers.

What has changed

Much of the search landscape has changed since the first edition of this book, but a remarkable amount has remained the same. This is reflected in the content of the second edition, which retains much of the structure of the original, and many of the central arguments. But writing books about internet phenomena is a cursed task; books exist outside of “internet time” (Karpf 2011). In the months between completing the manuscript of the first edition of this book and its publication, Google had reversed their approach to sociable search, and a dozen other changes had happened to the search ecosystem. No doubt, by the time this is in your hands, search will have evolved in ways not anticipated by this second edition. Updates and other materials will appear at the book’s companion site: ses-book.com.

I would like to thank those who read and responded to the first edition and particularly those who found the book useful to their own thinking and writing. There is nothing more gratifying than seeing the work of other scholars who were able to use my book as a point of reference or a foil. And I would like to thank my wife and my two sons, the youngest of whom was especially eager to know when this edition would be completed and I could focus my attention on what he considered more pressing matters.

We have come full circle in the years since the first edition. Many have thought seriously about search engines, “algorithms,” and what they mean for our culture. But we are moving into a new era of search, one in which it disappears even more quickly from view. When search becomes invisible – answering factual questions without reference to the source, or deciding which of our friends provides the best content on social platforms – it becomes even more vital to watch, to test, and to understand how search and discovery technologies are changing us. Especially as the technology employs ever more complicated and capable algorithms, we too will evolve in our relationship to knowledge and to each other. We should do so carefully and consciously.

CHAPTER ONEThe Engines

How did we come to mechanize the process of search? How is it that it became something that is increasingly done for us rather than by us? It certainly did not happen all at once. Like most large sociotechnical systems, the process occurred gradually, and was influenced by a combination of technical innovation and the effect of existing structures of economic and political power. And, not surprisingly, it has also reshaped both industrial relationships and political influence in turn. The current state of the search ecosystem is the result of many years of evolution, and layers of encoding of our relationships to each other and to our collective knowledge. Search engines have politics that have been baked in over a long period of time, and that process deserves a deep archeological exploration, peeling back layer by layer.

This chapter cannot reach the depth such a historical treatment requires. It can, however, provide an outline of that history and an indication of the kinds of processes that have been built into search technology. Moreover, it can show how that suite of technologies made its way into the larger media ecosystem, shaping it in turn. It is tempting to treat the search engine as a free-standing technology, an invention that has made it easier to find things located on another independent technology, the World Wide Web. But even a cursory investigation suggests that the search engine, like most other technologies, is not something that can be treated without reference to a larger social context, and to evolutionary social and cultural changes. The search engine, far from being an isolated modern artifact, represents a touchstone of digital culture, and a reflection of the culture in which it exists.

The permanent loss of search engines is now almost unfathomable, but, were it to occur, we would find the way we communicate, learn about the world, and conduct our everyday lives would be changed. And so, we must look beyond the familiar “search box” and understand what it reveals and what it conceals.

Search engines today

A basic definition of the search engine might refer to an information retrieval system that allows for “keyword” searches of distributed digital text. That definition often remains our frame of reference, if we have one. If you ask someone what a search engine is, however, they are less likely to provide a definition than they are to indicate one of the handful of popular web search engines that represent some of the most popular sites on the web: Google, Baidu, or Bing, for instance.

And these sites are popular. As of 2012, more than half of Americans said that they used a search engine at least once a day (Purcell, Brenner, & Rainie 2012). Google is easily the most popular search engine today, and the various Google sites, including its search engine, are among the most visited sites on the web (comScore 2016). Google’s dominance was already established a decade ago, and, despite inroads by Bing and Baidu, Google has continued to gain market share (see table 1.1). In 1999, Google was receiving 3.5 million search requests each day (Battelle 2005) and, while the growth has slowed in recent years, Google now receives at least a thousand times that number (Sullivan 2016), from more than a billion people each month. There can be little doubt that visits to search engines make up a large part of internet use, though it can be difficult to discover just how frequent that use is, and for what reasons.

One reason for this difficulty is that people often encounter the large search engines through the façade of another site – that is, without intending to. So a search on a particular website may rely on Google to do the actual searching, or it may draw on an internal search engine. Both of these are a form of search, but may be measured differently by different research firms (Hargittai 2004). Many portal sites are also search engines, so just measuring the visitors, for example, to Yahoo! properties does not provide a useful metric of actual searches. (And even if you did, those Yahoo! searches could just be repackaged Google searches: Sullivan 2015.) Facebook is not usually considered a “search engine” even though it handles a surprisingly large number of search queries on a daily basis. And the traditional search box accessed via the web is itself giving way with the shift to mobile technology as the primary form of access (Schwartz 2016); by 2015, more than half of the queries Google received were from mobile devices (Sterling 2015). As hard as measuring the use of public search engines is, it is nearly impossible to measure search more generally: people searching their company intranet or their hard drive, for example.

Table 1.1 Global search engine use as of September 2016

Source: NetMarketShare (2016). ComScore rates Google sites with a slightly lower share.

Search engine

Global share (%)

Google

73.02

Bing

9.26

Baidu

8.74

Yahoo!

7.07

Particularly over the last decade, there has been a rise in specialized search engines that seek to index not the entire web, but some constrained portion. This might be referred to as “vertical search,” as opposed to the “horizontal search” of the general-purpose search engines, though this distinction has broken apart as the large search engine companies seek to acquire any novel approaches that might help them to win a share of submarkets. There remain certain areas that are in some sense naturally vertical, often because they index a part of the web that is not easily accessed (the so-called “dark web”) or because they are otherwise defined by linguistic, cultural, or political borders, as in the case of Baidu or Yandex; but it is more accurate to say that search has grown increasingly complex in a number of ways.

Topically constrained search engines seek out only pages within a particular knowledge domain, or of a particular type of content. Some of these vertical search engines are focused on a particular industry. For example, an attorney in the United States might turn to open sources like FindLaw to provide news and information about their practice; to Lawyers. com to find an attorney within a particular practice area; to THOMAS, a search engine maintained by the federal government to track legislation; or to TESS to try to find out whether a proposed trademark is likely to infringe on an existing one – in addition to employing a number of subscription-based search engines for legal information such as those provided by Westlaw and Lexis-Nexis.

The inverse of this may be general search engines designed for particular kinds of users. The most obvious example of this is government surveillance networks. The US National Security Agency draws in massive amounts of digital information from all over the world, “nearly everything a typical user does on the internet,” an estimated 20 trillion transactions in the US alone, and analyzes it using an indexing and query system called XKeyscore (Greenwald 2013). Companies also use sophisticated real-time search and analytics systems that sift through huge amounts of data (in 2010, Raffi Krikorian indicated that Twitter alone handled eight terabytes each day), and then provide that intelligence to private clients. A company called Echosec is one of several that correlate images shared via social media with their geographic location to infer emerging events ranging from military operations to natural disasters. The CEO of the company recently noted that it reminded him of the early days of search engines – a quickly evolving field where the query is more complex than a few keywords (El Akaad 2015).

While less constrained in terms of topic, academic search attempts to seek out a particular kind of document: one that adheres to traditionally scholarly constraints. At the most basic level, these sites are efforts to move databases that have traditionally been found in libraries onto the web. ScienceDirect, for example, provides reference to scientific literature for web users, and Google Scholar offers the utility of a large article and citation index provided by scholarly journals combined with other scholarly (and less scholarly) sources from the web, extending Google’s tentacles into an important space for search. But there are dozens of others that provide access to open (e.g., BASE) or closed (e.g., DeepDyve) collections of academic articles. New ways of discovering this work – including through social networking platforms and large-scale analytics – will mean that this area will continue to evolve rapidly.

Academia is far from the only topically constrained space for doing search. While general-interest search engines like Google can be used to ferret out illicit files, a number of search engines specifically serve this particular niche. Torrentz.eu, which was shut down in 2016 after being in service for 13 years, was a meta-search engine that provided links to torrent files on various torrent trackers. And the “deep web” is partially defined by being those sites obscured from the major search engines, but they often have search engines of their own. For sites available via Tor, a routing system intended to provide a layer of anonymity, there exist more than a dozen search engines, including Ahmia.fi, Grams, and DARPA’s Memex project. And data need not be illicit to be “deep.” SunXDCC provides search for files shared via Internet Relay Chat. Search aggregators exist for things like local Craiglist listings, data to help to assess stocks, game cheats, coupons, and the meta-search engines that allow you to search across multiple engines.

In sum, while the general-purpose search engines provide access to the broadest range of resources, there remains space for search engines that are either deliberately constrained or that reach into areas where Google’s “crawlers” do not dare to tread. Even more specialized forms of search draw together data that would not usually be considered “documents.” Wolfram Alpha is not a web search engine at all, but an attempt to provide answers to questions. It can tell me what my most-liked photo on Facebook is, what the tensile strength of oak is, or what the plot of a mathematical function is. Zanran seeks to aggregate sources of numerical and statistical data from the web, so that a search for “age at death of US presidents” links to sites that provide that information in tabular form. These represent a basic parsing of information beyond keywords. Much of the excitement around the “semantic web” and microformats has dissipated, but the idea that structured information can be extracted and queried from large unstructured collections, like the World Wide Web, remains promising. Those who are trying to be noticed by search engines will frequently include metadata in their pages, including geolocation or indications of contact information. Efforts to develop protein and genetic material search systems go back at least a decade (Liebel, Kindler, & Pepperkok 2005), and a number of companies are now vying to be leaders in genomic search (Ossola 2015).

Thanks in large part to the shift to the mobile web, both searching for geographically constrained results and using geolocated data to constrain search have become vital. “Local search” has largely made local telephone directories a thing of nostalgia, allowing people to not only search for local businesses, but read reviews of those businesses written by their peers. Rather than competing with local search, many of the largest business directories (“yellow pages”) have created their own local search engines, as have local newspapers and television stations. Sometimes, being created within the same locale is enough to make two resources related. The Geo Search Tool (www.geosearchtool.com) or Google Earth can each help you find YouTube videos shot near the same location around the same time, providing a whole new way of thinking about organizing amateur video and other recording. And sometimes it need not be so explicit. Google has long localized results based on the general location of the searcher as revealed by their “IP (Internet Protocol) address.” While it is not clear how Google and other search engines use geographic signals at present, a patent in 2015 by Google (US 20150339397 A1) describes a method for predicting the location of the device used for a search as well as where the searcher is likely to go next.

Sometimes it is not the content area that determines the search, but the type of media. Although large search engines, beginning with AltaVista, generally have had some ability to search for multimedia, it continues to present some of the greatest challenges, as well as opportunities, for those creating new search technologies. Structured “metatags” can be leveraged when present, but it is extracting meaning from the content itself that is more difficult. A number of efforts have been made to extract the content of photographs and videos, often using machine learning. By seeing how humans classify the contents of a large number of photographs, a system may learn to replicate this skill when presented with an unclassified photograph. For this to work well, it generally requires a large number of human-classified examples. One of the reasons Facebook’s facial recognition system does well is that it can draw on a constant stream of human-assigned tags to help train it (Lachance 2016). As we move to devices that monitor the user’s environment, they may be engaged in ongoing image recognition, which represents a kind of continual search (Simonite 2013). Rather than categorizing multimedia, it may be enough to identify items that are in some way similar, as Google’s “search by image” does, or Shazam does when listening to a song in the environment and identifying it. Such similarity structures may prove to open doors to fundamentally different ways of searching and browsing music and video (e.g., the Songrium project: Masahiro, Goto, & Nakano 2014). As we move away from text-based documents and queries, search engines will be called upon to effectively extract information from these less structured forms of media.

As more and more of the world becomes internetworked, both the method of search and the world of searchable things extends beyond the purely digital. It would have been difficult to imagine, even a few years ago, that a search query would consist of the words “Alexa, where are my keys?” spoken while standing in your kitchen (Crist 2016). While the contours of the “Internet of Things (IoT)” are still being sketched out, it seems clear that search will reach beyond digitized documents and draw in real-time data from our devices, our social networks, and other sources, aggregating them into a usable search result. We are only just beginning to see examples of how they might play out. Early IoT-specific search engines, such as Shodan and Thingful, have focused on locating internet-connected appliances, and have found themselves at the center of discussions around privacy and security. And it may end up being the IoT devices themselves doing the searching (Carlton 2016) – what happens when your fridge needs to find the perfect ingredient for tomorrow’s dinner party?

The line between a search engine and what has traditionally been called artificial intelligence is narrowing. In Steven Spielberg’s 2001 film A.I., the search engine is represented as a projected character known as “Dr. Know” (“Ask Dr. Know! There is nothing I don’t.”). A similar role was played by the holographic librarian named “Vox” in the 2002 remake of The Time Machine. Both depictions suggest a cultural recognition of search as a function that requires intelligence, and a feeling that machines can take on a part of that process. That idea is hardly new – for decades before the term “search engine” evolved, there had been interest in the relationship of search (often through structured relationships) to machine intelligence (Thornton & Du Boulay 1992). IBM has focused much of its recent energy on the Watson “cognitive computing” platform, and has sold it in part as a solution for enterprise search, revealing “trends and patterns hidden in unstructured content.” Google is using machine learning to help to interpret user queries, employing an algorithm they call RankBrain to try to triangulate the meaning of a person’s query (Clark 2015).

And it is not just artificial intelligence that can provide answers – some of the answers you need are probably known by another human. Yahoo! Answers was launched in 1995, and while it has lost ground, it still attracts millions of users each month, according to Quantcast. Other Question-and-Answer (Q&A) sites have grown over the last few years, including Quora and Stack Overflow, the latter of which provides answers to programming questions and, as of 2016, reaches 32 million people monthly. We might consider review sites, from restaurant and travel (like Yelp and Tripadvisor) to professional services (Angie’s List and Healthgrades), as serving as a kind of curated search. Of course, these are just some of the explicit Q&A sites; implicitly, certain kinds of queries will be more effectively answered by others on Twitter or Facebook than they will be by a search engine.

At least for those of us who remember the time before search engines, they are defined by the query box on a web page. But search is both much bigger and much more complex than it once was. At one end, machine learning and analytics are drawing on almost unfathomable stores of unstructured data, finding patterns and connections within them that no human ever could. At the other end, individuals need to make sense of their “Personal Networked Spaces,” digital information that relates directly to them and to their lives, right now, right where they are (Michel, Julien, & Payton 2014). Naturally, these tasks have existed before, but the size, extent, and diversity of the content of the web make it the ultimate target for such efforts. As a result, those who would have studied other topics in artificial intelligence, information design, library science, and a host of other fields have set their sights instead on developing better search engines.

Before the search engine

Some consider the greatest modern threat to be too much information, a glut of data that obscures what is really valuable. In his book Data smog, David Shenk (1997, p. 43) argues that computers are the “most powerful engines driving the information glut” by constantly drawing more data to our attention. While it is undoubtedly the case that the internet allows for the rapid delivery of ever growing amounts of information, it is also true that new computing devices were often created in order to manage and control increasingly complex environments. What once could be handled by a human, or a collection of individuals, became too time-consuming to result in effective control. So, in 1823, when the British government recognized the need for an effective replacement for human “calculators” to come up with tide tables at their ports, they funded an effort by Charles Babbage to design the first mechanical computer (Campbell-Kelley & Aspray 1996). Likewise, when the United States government found that it would take more than ten years to tabulate the decennial national census in 1890, they turned to Herman Hollerith, who founded the company that later became IBM, to create an automatic tabulating system (Aul 1972). That pattern of turning to information technology when faced with an overwhelming amount of data has occurred over and over: in libraries, in large businesses, and, eventually, on the World Wide Web.

It is natural to think of information technology as digital computing, since so much of contemporary information processing is relegated to networked computers. Computers are only the most recent in a long line of technologies that were created to allow for better control of complex collections and flows of information. The obvious example is the library: once a collection of books and papers grows to a significant size, finding the appropriate piece of information in a timely manner becomes the subject of its own techniques, records, and machinery. Collections of documents can be traced back nearly as far as history itself has been recorded; were cave drawings the first libraries? As Kaser (1962) explains, many spiritual traditions conceive of the library as eternal, and the librarian as all-powerful. As early private collections grew larger, librarians emerged to organize and manage these collections. Because libraries were so important to many classical civilizations, the librarian was in a revered and politically powerful position which required special skills in collecting and manipulating information. In some ways, entrusting the organization of library resources to an individual – taking a large collection and making an individual or group of librarians the gateway to that knowledge – represented the first kind of search engine. And, as with later incorporations of that role, gaining control of the resource meant ceding some degree of power to the librarian.

Large libraries have always been a nexus of potential information overload, and so techniques and technologies evolved to help us filter and find information. Sorting and finding items within these collections required the creation and maintenance of information about the collection: metadata. The Babylonian library at Nippur had such records of the collection as early as the twentieth century BCE. The nature of the need was simple enough: the librarian needed to be able to discover which documents addressed a given topic, and then find where that document was physically located so that it could be retrieved for the person requesting information. Given that the subject of a work was often the issue most closely indexed to an informational need, the most popular indexes in the English-speaking world – the Dewey Decimal System and the Library of Congress System – provide a classification that is based on the subject matter of a book, so that books on similar topics are likely to be found in close proximity. Indeed, the role of spatial organization and information structure have been closely tied through most of the history of humanity: information architecture was once simply architecture (Latimer 2011).

Unfortunately, the number of dimensions of indexes that can be represented within spatial organization is limited, and the focus soon shifted from spatial organization to other forms. The use of computing systems in libraries has formed an important basis for how search engines now work. There is a long history of ideas about how to organize knowledge in the library, but the rise of computing in a library setting brought mathematics and linguistics to bear in new ways, and some of the core techniques now used by search engines were first used by library indexes. The field of Information Retrieval (IR) now bridges the closed library index and the wider collection of documents on the web (Salton 1975), and draws from many areas of computing and information science to better understand the information available over computer networks.

Public and private libraries were not the only form of data collections. The industrial revolution led to new forms of social organization, particularly the rise of bureaucracy, which required a flood of new paper files. Records and copies of correspondence were generally kept on paper, and guides emerged for suggesting the best ways to organize these materials, including the best ways to stack papers on a desk. Paper stacking gave way to pigeonholes, and the business titans of the early twentieth century made use of a fabulously expensive piece of office furniture called the “Wooton desk,” which contained hundreds of pigeonholes and could be closed and locked, allowing for the secure storage of and access to personal work documents. The gradual development and innovation that led to vertical filing – a technology, perhaps unsurprisingly, developed by the inventor of the Dewey Decimal System – was a result of a data glut that began a century before anyone uttered the word “internet” (Yates 1982).

While subject-oriented classification made sense for the broad and relatively slowly changing materials of a library, it would have been useless when applied to the office of the last century. First, time was very much of the essence: when a document or file was created, changed, moved, or destroyed was often as important as the document’s subject matter. Likewise, such records were often closely related to the people involved. Clearly this was true of customer records, and large insurance companies – whose very survival rested on increasing the size of their customer base – often drove innovations in business filing, right through to adopting the earliest electronic computers.

The earliest computer systems drew on the ideas of librarians and filing clerks, but were also constrained by the technology itself. While these earlier approaches provided metaphors for digital storage, they failed to consider the hardware constraints posed by the new computing devices and placed limits on the new capabilities of these machines. Computer programmers made use of queues and stacks of data, created new forms of encoding data digitally, and new imaginary structures for holding that data. Not housed in drawers or on shelves, these collections could be rearranged and cross-indexed much more quickly than their physical counterparts. Over time, this evolved into its own art, and database design continues to be a rapidly advancing subfield of computer science. Ironically, as more and more books are digitized, or physical books are stored in closed stacks and their storage and retrieval are automated, the physical library is beginning to look more like a database.

As the internet began its exponential increase in size during the 1990s, driven by the emergence of the World Wide Web, it became apparent that there was more information than could easily be browsed. What began as the equivalent of a personal office, with a small private library and a couple of filing cabinets, grew to rival and exceed the size of the largest libraries in the world. The change was not immediate, and, in the early stages, individuals were able to create guides that listed collections at various institutions, generally consisting of freely available software and a handful of large documents. Especially with the advent of the web, the physical machine where the documents were stored began to matter less and less, and the number of people contributing documents grew quickly. No longer could a person browse the web as if it were a small bookshop, relatively confident that they had visited each and every shelf. Competing metaphors from librarians, organizational communicators, and computer programmers sought out ways of bringing order, but the search engine, in many ways, was a novel solution for this new information environment.

How a search engine works

Before outlining the development and commercialization of search over time, it is useful to understand how a basic search engine works. Our interaction with the classic search engine, as users, is fairly uncomplicated. A website presents a box in which we type a few words we presume are relevant, and the engine produces a list of pages that contain that combination of words. In practice, this interface with the person, while important, is only one of three parts of what makes up a search engine. The production of the database queried by the web form requires, first, that information about webpages be gathered from around the web, and, second, that this collection of data be processed in such a way that a page’s “relevance” to a particular set of keywords may be determined. By understanding the basic operation of each of these steps and the challenges they pose, an overall understanding of the basic technology may be reached. Figure 1.1 provides an overview of the process common to most search engines.

The process begins with a system that automatically calls up pages on the web and records them, usually called a crawler, but sometimes referred to as a “spider,” “web robot,” or “bot.” Imagine a person sitting at a computer browsing the web in a methodical way. She begins her process with a list of webpages she plans to visit. She types the URL for the first of these pages into the browser. Once it loads, she saves a copy of the page on her hard drive, noting the time and the date. She then looks through the page for any hyperlinks to other pages. If she finds hyperlinks that are not already on her list, she adds them to the bottom of the list. Following this pattern, she is likely to record a large part of the entire web. Once complete, she would begin again from the top of her list, as there are probably changes to these pages and newly created pages that have been published and linked to since she began.

Figure 1.1 Conceptual organization of the typical search engine

If the search engines really relied on individual humans to do this, it would take thousands of years to complete even a single crawl of the web. However, the operation described is not particularly complex, and creating a computer program that can duplicate this behavior is not difficult. Because the crawler is a relatively simple piece of technology, it has not evolved as much as other parts of the search engine. Even the smallest-scale crawlers are usually multi-threaded, making many requests at the same time rather than waiting for each page to be produced before moving on. They generally run not on a single computer, but on a large number of computers working in tandem. Most are careful to distribute their requests across the web, rather than ask for all of the pages from one server at once, since the crush of requests could easily overwhelm a single server, and most are “polite,” taking into account webpage authors’ requests for certain pages to be ignored. Nonetheless, these crawlers can sometimes make up a substantial number of the requests to a less-trafficked website. By one estimate, roughly half the traffic on the web is generated by these non-human visitors (Piejko 2016).

That does not mean that crawlers are all the same. There is an entire menagerie of crawlers out looking for new content on the web. On many pages, visits by web robots outnumber visits by real people. Some of these – going by exotic names like Slurp and Exabot – are gathering information for the largest general-purpose search engines, but others may be run just once by an individual. Small crawlers are built into a number of applications, including plug-ins for browsers and a bot used by Adobe Acrobat to create a PDF from a website. Because of small differences in how they are programmed, they behave slightly differently, following some links and not others, or coming back to re-check more or less frequently. Publishers of websites can exercise some level of control over Google’s bot, through tools provided online, and most crawlers will obey a set of rules presented in a special “robots.txt” file a publisher may place on the server. But beyond these limited restrictions, the bots attempt to capture information from as much of the web as possible, as frequently as possible.

Following hyperlinks may not be enough. Large portions of the web are now generated dynamically, according to various requests from website visitors. Think, for example, of an online site that provides theatre tickets. The calendar, the pages describing available tickets, or even the seating maps may change depending on the show, the location of the person accessing the site, the current date, previous sales, and other variables. The modern webpage is probably not just generated dynamically by the server based on the content of a database, but built with HTML in combination with CSS and Javascript, and it often updates sections of the page on the fly (AJAX), creating a special challenge for the crawler (Mesbah, van Deursen, & Lenselink 2011; Google 2014).

Most crawlers make an archival copy of some or all of a web-page, and extract the links immediately to find more pages to crawl. Some crawlers, like the Heritrix spider employed by the Internet Archive, the “wget” program often distributed with Linux, and web robots built into browsers and other web clients, are pretty much done at this stage. However, most crawlers create an archive that is designed to be parsed and organized in one way or another. Some of this processing (like “scraping” links, or storing metadata) can occur within the crawler itself, but there is usually some form of processing of the text and code of a webpage afterward to try to obtain structural information about it.

The most basic form of processing, common to almost every modern search engine, is extraction of key terms to create a keyword index of the web by an “indexer.” We are all familiar with how the index of a book works: it takes information about which words and ideas appear on any given page and reverses it so that you may learn which pages contain any given word or idea. In retrospect, a full-text index of the web is one of the obvious choices for finding material online, but particularly in the early development of search engines it was not clear what parts should be indexed: the page titles, metadata, hyperlink text, or full text (Yuwono et al. 1995). If indexing the full text of a page, is it possible to determine which words are most important?

In practice, even deciding what constitutes a “word” (or a “term”) can be difficult. For most western languages, it is possible to look for words by finding letters between the spaces and punctuation. This becomes more difficult in languages like Chinese and Japanese, which have no clear markings between terms. In English, contractions and abbreviations cause problems. Some spaces mean more than others; someone looking for information about “York” probably has little use for pages that mention “New York,” for instance. A handful of words like “the” and “my” are often dismissed as “stop words” and not included in the index because they are so common. Further application of “natural language processing” (NLP) is capable of determining the parts of speech of terms, and synonyms can be identified to provide further clues for searching. At the most extreme end of indexing are efforts to allow a computer to in some way understand the genre or topic of a given page by “reading” the text to determine its meaning.1

An index works well for a book. Even in a fairly lengthy work, it is not difficult to check each occurrence of a keyword or idea, but the same is not true of the web. Generally, an exhaustive examination of each of the pages containing a keyword is impossible, particularly when much of the material is not just unhelpful, but – as in the case of spam – intentionally misleading. This is why results must be ranked according to perceived relevance, and the process by which a particular search engine indexes its content and ranks the results is really a large part of what makes it unique. One of the ways Google leapt ahead of its competitors early on is that it developed an algorithm called “PageRank” that relied on hyperlinks to infer the authority of various pages containing a given keyword. Some of the problems of PageRank will be examined in a later chapter; here, it is enough to note that the process by which an index is established, and the attributes that are tracked, make up a large part of the “secret recipes” of the various search engines.

The crawling of the web and processing of that content happen behind the scenes, and result in a database of indexed material that may then be queried by an individual. The final piece of a search engine is its most visible part: the interface, or “front end,” that accepts a query, processes it, and presents the results. The presentation of an initial request can be, and often is, very simple: the search box found in the corner of a webpage, for example. The sparse home page for the Google search engine epitomizes this simplicity. However, providing people with an extensive set of tools to tailor their search, and to refine their search, can lead to interesting challenges, particularly for large search engines with an extremely diverse set of potential users.

In some ways, the ideal interface anticipates people’s behaviors, understanding what they expect and helping to reveal possibilities without overwhelming them. This can be done in a number of ways. Clearly the static design of the user interface is important, as is the process, or flow, of a search request. Westlaw, among other search engines, provides a thesaurus function to help users build more comprehensive searches. Over time, search engines have picked up certain interface elements, and kept them or left them behind based on response from those interacting with search. Type-ahead search queries, which pre-populate the search box with the top matching queries, were something experimented with by several search engines in the mid-2000s. Now they are a mainstay not just on the major search engines but on many other interfaces that draw on user input (Li et al. 2009). After declaring no interest in social signals for search (e.g., drawing on search results based on what friends produced or searched for), Google for a time provided indications of social results, including a feature they called “Search Plus Your World,” which indicated how your social network was affecting which sites appeared in the results pages. Although by all accounts Google continues to include social signals, neither relationship nor authorship is indicated in the results pages any longer. As more traffic shifts to mobile devices, it seems likely that interfaces that are easier to use without a keyboard, including those that are voice-related and that incorporate the locative context, will be the most visible to those who search online.

Once a set of results are created, they are usually ranked in some way to provide a list of topics that present the most significant hits – sites that contain the keywords – first. The most common way of displaying results is as a simple list, with some form of summary of each page. Often the keywords are presented in the context of the surrounding text. In some cases, there are options to limit or expand the search, to change the search terms, or to alter the search in some other way. On some search engines, results are clustered by topic.

All three of these elements – the crawler, the indexer, and the front end – work together to keep a search engine’s index continuously updated. The largest search engines are constantly under development to better analyze and present searchable databases of the public web. Some of this work is aimed at making search more efficient and useful, but some is required just to keep pace with the growing amount of content available online. The technologies used on the web change frequently, and, when they do, search engines have to change with them. As people employ document formats other than HTML (PDF-formatted documents, for instance), visual formats, or complex interactive sites, search engines need to create tools to make sense of these formats. The sheer amount of material that must be indexed increases exponentially each year, requiring substantial investments in computing hardware and bandwidth. As of 2011, Google data centers used as much electrical power as would normally provide for 200,000 homes (Glanz 2011). Someone visiting a skyscraper can quickly appreciate the work that went into building it, but few are aware of the work that must be continually done to make a search engine function.

Pre-web internet search

Once one has used a search engine, it seems obvious that it should exist, but the need for a general search engine during the early days of the web was neither immediate nor apparent. It usually is not until a collection of data grows too large to map in its entirety that the need for a search interface is made clear. Consider the average home library, which may fill only a bookcase or two. The books may be placed randomly, or by size, or by which are used more often or more appreciated, or by some idiosyncratic subject arrangement. At some point, however, a library grows to the point at which looking through everything to find the book you want is impractical, and at that point some form of indexing is necessary. Likewise, networked information started out as relatively small collections in relatively few repositories, and it was not until later that the need for different forms of indexing was made clear and tools were created to meet this need.

Early technologies used for finding files or users were often built into the operating system and, once computers were networked, it was often possible to use the same functions from a distance. Since long before the web has existed,2 the Unix command “finger,” for example, has provided information about a particular user, including when that user last logged on, and often some personal contact information. Its creator, Les Earnest, designed “finger” to aid in social networking at the Stanford Artificial Intelligence Lab (quoted in Shah 2000):

People generally worked long hours there, often with unpredictable schedules. When you wanted to meet with some group, it was important to know who was there and when the others would likely reappear. It also was important to be able to locate potential volleyball players when you wanted to play, Chinese food freaks when you wanted to eat, and antisocial computer users when it appeared that something strange was happening on the system.

When computers were networked via the internet, it was possible to “finger” individuals from across the country or the world, to find out more about them. Eventually, it was used for other purposes, including distributing weather reports.

The first indexes on the internet were created by hand, often by the users of the systems as a guide to others. Consider some of the protocols in use on the internet before the emergence of the World Wide Web, beginning with “File Transfer Protocol” (FTP), one of the first ways of moving files between computers. An early internet user would choose an FTP server from a list of public servers (a list they or someone else probably had downloaded from one of the servers on that list), and request a listing of files on that server. Often, there was a text document that could be downloaded that briefly summarized the content of each of the files on a given server. FTP continues to be used today as a way of transferring files, but the process of browsing through FTP servers in the hope of finding the document you were seeking was laborious and inconsistent, especially as the number of FTP servers increased. This increase also brought with it the rise of “anonymous” FTP servers, which allowed anyone to upload and download files to and from the server. While the increase in content was a boon to those who used the internet, it became increasingly difficult to locate specific files. As a result, what might be considered the first search engine on the internet arrived in 1990, before the World Wide Web had gained a foothold, and at a time when many universities had only recently become a part of the network (P. Deutsch 2000). This system, called “Archie,” periodically visited the existing FTP sites and indexed their directories. It is probably a stretch to say that it “crawled” these sites, since, unlike today’s web crawlers, it did not discover new servers linked to the existing servers. It also did not examine the full content of each of these pages, but limited itself to the titles of the files. Nonetheless, it represented a first effort to rein in a quickly growing, chaotic information resource, not by imposing order on it from above, but by mapping and indexing the disorder to make it more usable.

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben: