16,99 €
Tommaso Venturini and Richard Rogers offer a critical and conceptual introduction to digital methods.
In a direct and accessible way, the authors provide hands-on advice to equip readers with the knowledge they need to understand which digital methods are best suited to their research goals and how to use them. Cutting through theoretical and technical complications, they focus on the different practices associated with digital methods to skillfully provide a quick-start guide to the art of querying, prompting, API calling, scraping, mining, wrangling, visualizing, crawling, plotting networks, and scripting. While embracing the capacity of digital methods to rekindle sociological imagination, this book also delves into their limits and biases and reveals the hard labor of digital fieldwork. The book also touches upon the epistemic and political consequences of these methods, but with the purpose of providing practical advice for their usage.
Digital Methods is a must-read for students and scholars of digital social research, media studies, critical data studies, digital humanities, computational social sciences, and for those who are interested in digital methods but do not know where to start.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 396
Veröffentlichungsjahr: 2025
Cover
Table of Contents
Title Page
Copyright
Acknowledgements
1. Introduction
Why this book?
Repurposing data and methods
Following the medium
Foregrounding bias
Diversifying social research
Two important remarks about terminology
Overview of chapters
Pedagogical features
2. Sourcing Digital Inquiries
Online platforms and their records
Accounting for platform effects
The importance of a varied data diet
Discussion questions
Further reading
3. Querying and Triangulating
Understanding the purpose of digital search
Exploiting the anatomy of search and recommendation engines
Comparing and triangulating
Discussion questions
Further reading
4. Prompting AI
What exactly are large generative models?
What large generative models can do
From prompt engineering to prompt tuning
Discussion questions
Further reading
5. Calling Platform APIs
The rise of APIs
How to work with APIs
API-based tools and their risks
Discussion questions
Further reading
6. Scraping Documents and Interfaces
Scraping or the art of taking websites apart
Scraping tricks
The limits of scraping
Discussion questions
Further reading
7. Wrangling Digital Records
Data preparation is data analysis
Six wrangling operations
Tools for data wrangling
Discussion questions
Further reading
8. Mining Texts
What to count
How to count
Where to count
Discussion questions
Further reading
9. Exploring through Visualizations
Exploratory data analysis
The bases of visual exploration
The right chart for the right question
Pattern detection and interactive visualizations
Discussion questions
Further reading
10. Crawling Connections
Thinking relationally
Disentangling hierarchy and community in networks
Riding a web crawler
Discussion questions
Further reading
11. Analyzing Relational Landscapes
What’s in a network?
Paths
Centrality
Community structure
Visual network analysis
Discussion questions
Further reading
12. Tinkering with Scripts and Libraries
Foundations of coding
Programming tricks for non-programmers
Discussion questions
Further reading
13. Conclusion
The “digital” in “digital methods”
The combined study of social phenomena and platform effects
The separate study of platform effects
Glossary
References
Index
End User License Agreement
Chapter 3
Figure 3.1
Searching as fishing, choosing where to cast one line among different data ponds…
Chapter 4
Figure 4.1.
The images generated by DALL·E 3 for the prompt “écolier” (left) and “écolière” …
Figure 4.2.
The images generated by DALL·E 3 for the prompt “schoolboy” (left) and “schoolgi…
Chapter 5
Figure 5.1.
URL structure for retrieving webpage (above) and API information (below)
Chapter 6
Figure 6.1.
How the same content is transformed from a HTML and CSS code to a webpage (throu…
Chapter 8
Figure 8.1.
The evolution of the presence of temporal adverbs (left), acronyms and names of …
Chapter 9
Figure 9.1.
Six key functions of exploratory visualization and some of the charts that can b…
Chapter 10
Figure 10.1.
An example of a hyperlink network generated by hierarchical and communitarian co…
Figure 10.2.
The functioning of crawler starting from the seed-page highlighted in black insi…
Chapter 11
Figure 11.1.
An example of websites network and the two tables describing it
Figure 11.2.
Königsberg’s bridges network (original figures from Euler, 1736)
Figure 11.3.
External connection of different types of sites (original figures from Rogers &…
Figure 11.4.
Friendship network in a 4th grade class arranged according to the layout of orig…
Figure 11.5.
Geoengineering network spatialized in four different ways (panels A–D) and a cli… (panel E)
Figure 11.6.
Geoengineering network: nodes and labels equally sized (left) and sized accordin…
Figure 11.7.
The geoengineering network: node colored according to their category (left); sam…
Chapter 12
Figure 12.1.
A Python script exemplifying several programming concepts
Figure 12.2.
Pseudocode version of the script introduced in figure 12.1
Chapter 13
Figure 13.1.
Alt-lite, most engaged with content on Facebook, 2017.
Chapter 3
Table 3.1.
Logical operators used to connect different terms in a query
Chapter 7
Table 7.1.
The same information encoded according to four common data formats
Table 7.2.
Different types of references that can be used in a spreadsheet
Table 7.3.
List of common spreadsheet functions and the operation that they allow
Table 7.4.
Table describing a series of comics characters and their features
Table 7.5.
Pivot table summarizing the information about the same characters
Table 7.6.
List of most common regex and their function
Chapter 8
Table 8.1.
The elements of text mining
Table 8.2.
Complexity continuum of the items that can be analyzed through text mining
Table 8.3.
One- to six-grams that can be extracted from the same phrase
Chapter 9
Table 9.1.
Type of visual variable and their uses
Cover
Table of Contents
Title Page
Copyright
Acknowledgements
Begin Reading
Glossary
References
Index
End User License Agreement
iii
iv
ix
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
Tommaso Venturini and Richard Rogers
polity
Copyright © Tommaso Venturini and Richard Rogers 2025
The right of Tommaso Venturini and Richard Rogers to be identified as Authors of this Work has been asserted in accordance with the UK Copyright, Designs and Patents Act 1988.
First published in 2025 by Polity Press
Polity Press65 Bridge StreetCambridge CB2 1UR, UK
Polity Press111 River StreetHoboken, NJ 07030, USA
All rights reserved. Except for the quotation of short passages for the purpose of criticism and review, no part of this publication may be reproduced, stored in a retrieval system or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the publisher.
ISBN-13: 978-1-5095-6260-2
A catalogue record for this book is available from the British Library.
Library of Congress Control Number: 2024951692
The publisher has used its best endeavours to ensure that the URLs for external websites referred to in this book are correct and active at the time of going to press. However, the publisher has no responsibility for the websites and can make no guarantee that a site will remain live or that the content is or will remain appropriate.
Every effort has been made to trace all copyright holders, but if any have been overlooked the publisher will be pleased to include any necessary credits in any subsequent reprint or edition.
For further information on Polity, visit our website:politybooks.com
To the authors of this book, the term “Digital Methods” describes a methodological challenge and an intellectual fixation. More importantly, however, it also describes a community of people with whom we have shared this challenge and fixation.
This book and the ideas that it discusses would simply not exist without a small but growing group of enthusiastic colleagues who, over the years, have pushed the boundaries of sociological imagination and new media methods.
In line with this collective ethos, rather than individual names, we will mention the groups that around Europe and elsewhere have closely traveled by our side in the exciting and sometimes perilous journey of digital methods.
They are the Digital Methods Initiative, the Public Data Lab, the Open Intelligence Lab, the médialab at SciencesPo Paris, the Medialab at the University of Geneva, the CNRS Center for Internet and Society, the Warwick Centre for Interdisciplinary Methodologies, the King’s College Digital Humanities Department, the TANTlab, Density Design, the Visual Methodologies Collective, the Media Studies program at Siegen, the Department of Social and Political Sciences at the University of Milan, the ANU VOSON Lab, and the many others who have attended summer and winter schools, workshops, and events, participated in small and large research projects with us, and who have provided invaluable feedback and support.
If this book is any good, it is thanks to the hard work carried out by all these groups.
This book introduces the skills that uphold the use of digital methods in social research. As an introduction, it serves researchers and students inside and outside academia, who are interested in digital methods but do not know where to start. As a short introduction, it provides a brief guide for a quick start. Focusing on the different practices associated with digital methods, it touches upon the epistemic and political consequences of these methods, but with the purpose of providing hands-on advice for their use.
This book falls squarely between two previous digital methods books that situated this approach historically and conceptually (Rogers, 2013) and described recipes and tools for studying web archives, search engines, and social media platforms (Rogers, 2019). This book focuses on the principles that sit behind the practice of digital methods and that are often taken for granted in previous works, while avoiding discussion of specific software tools, application programming interfaces (APIs) or data dashboards, which are ever evolving and often discontinued or revamped. As we write, a crucial data source for Facebook and Instagram is being shut down (CrowdTangle) and replaced by another (Meta Content Library), as happened again five years prior – in both cases with significant consequences for researchers (Grevy Gotfredsen & Dowling, 2024).
Digital methods are introduced in the form of a digest, without the footnotes of software settings or the cataloging of digital studies to date. Rather, we present short takes on collecting and wrangling data by querying, prompting artificial intelligence (AI), text-mining, scraping, crawling, and calling APIs. Alongside those, we discuss preferred visualization approaches as well as a rationale for placing these approaches in the middle of the analysis rather than as a culmination.
As it will be clear across the chapters of this book, far from being a single or simple research protocol, digital methods are a complex toolbox composed of a multitude of techniques for collecting, preparing, analyzing, and visualizing digital records. This variety is unified, however, by four core methodological principles: repurposing data and methods, following the medium, foregrounding biases, and diversifying social research.
From the outset, digital methods defined themselves in opposition to the notion of “virtual” both methodologically and topically. Methodologically, they distinguished themselves from the project of “virtual methods,” the social science effort to “digitize” and migrate existing methods to the online environment, for example, turning ethnographies into virtual ethnographies or surveys into online surveys (Rogers, 2009). Digital methods were instead interested in the so-called methods of the medium, i.e., natively digital methods that have their first or only implementation within digital media. In a sense, digital methods are shorthand for digital media methods, or to “digital media, digitally analyzed” (Rose, 2022).
Likewise, in terms of research topics, digital methods saw new media not only as a site for the study of online cultures, but also as a medium to inquire into broader cultural and societal dynamics. They examine the online world not as a virtual realm in and unto itself, or as a cyberspace of imaginaries, but rather a panoply of records about social life.
This effort to focus on digitally native methods and to use them to study societal phenomena is best captured by the idea of repurposing. Repurposing means starting from data and methods developed for non-academic purposes (e.g., user engagement metrics developed for marketing) and using them as tools for social research. In practice, it means that when using web data for research, one should first undertake some “technical fieldwork” (Rieder et al., 2015) to identify the kinds of data available online and consider how they could be used for social research. Repurposing means that digital methods are a form of “second-hand research” that sits atop online services and mainstream and alt-tech platforms, which have not been designed for scientific data collection.
Repurposing Wikipedia, for example, can mean extracting the table of contents of the “same entry” in different languages and placing them side by side to study cultural specificity or competing interpretations of events, or snowballing the hyperlinks in the “see also” section of different pages to create a conceptual network demonstrating close and distant content relationships. In both instances, the repurposed or second-hand research outputs become material for cultural and social analysis that is broader than the platform under study.
This approach is in line with an early definition of new media as “putting things on top of other things” (Rogers, 2013). The repurposing of digital methods has its origins in the remix culture of new media, and especially in the art of mashing up songs shared on Napster and elsewhere during the years of the copyleft debates (Shiga, 2007). Mashup culture extended to the mixing of data streams or data overlays to produce “composite applications” (Liu & Palen, 2010; Daniel & Matera, 2014). Possibly the most well-known was the Katrina Information Map, a Google Map that was layered or annotated with pinpricks of disaster relief centers (Crowe, 2005). To make such a service or application, one examines which sources (or data) are available to mix and inventively produce new outputs of interest for research and societal relevance.
The focus on repurposing distinguishes digital methods from digital humanities, which tend to focus on digitized materials from historical or documentary archives. Whereas digital methods are oriented towards online and digitally native data, digital humanities have long been in the business of converting historical records into machine-readable material. As one scholar puts it: “Though not ‘everything’ has been digitized, we have reached a tipping point, an event horizon where enough text and literature have been encoded to both allow and indeed force us to ask an entirely new set of questions” (Jockers, 2013: 4). Drawing on a more established tradition of archiving and preservation, digital humanities are now engaged in developing the techniques of distance reading necessary to harness the “great unread” (Moretti, 2000) they amassed.
The second principle, following the medium, means that research with “web data” should focus on the “web” as much as on the “data.” Websites, social media platforms and apps are neither neutral recorders nor mere conduits through which opinions and actions flow unfettered. They are complex bundles of technological effects and vernacular cultures, which privilege certain sources and practices while depressing others. Content is recommended and moderated, both manually and algorithmically. Online publishing and social media posting have styles and rhythms for the user to fit in and stand out (Carmi, 2020). In digital methods these media effects and media biases are never just a noise to be cleaned out – not only because such cleaning is impossible, but also because there is value in keeping the web, the search engine, or the social media platform as a part of social research accounts.
There are at least two ways to retain the medium in an account of a social or cultural analysis. One is to situate the data collection in the history of the platform in terms of the algorithmic and vernacular culture. Twitter/X of 2025 is a different platform to Twitter/X of 2020. The feed has changed from posts by followed accounts in reverse chronological order to a personalized list based on a “for you” algorithm first popularized by TikTok. Metrics have been renamed: retweets having become reposts. The culture of the platform has also changed with certain subcultures (academic Twitter/X, for example) disengaging from the platform and other (right-wing trolls, for example) becoming more prominent.
Another way of retaining the medium is in the story of data collection. There have been, for example, multiple styles of Twitter/X data collection, depending on where the data were retrieved: the search and streaming APIs; the academic API; scraping; and the API compliant with the European Digital Services Act. Given their research affordances, each of these methods of data collection could produce profoundly different data sets.
Following the medium means considering its research affordances: i.e., the digital objects and metrics provided by a platform or a web service and available for repurposing. For example, Facebook native digital objects are posts, pages, and metrics of interactions. One may make a collection of posts from a series of pages concerning a social group (e.g., the Somali diaspora) or a movement (e.g., the Alt Right) and amongst them determine which posts have received many (and few) interactions, thereby demonstrating what animates the group or movement “on Facebook.” To include the platform in the account, means considering, for example, the effect of content moderation. Studying an Alt Right movement would, for example, yield less material today than a decade ago. As a case in point, one scholar found that a significant percentage of Twitter/X accounts tweeting about Brexit had disappeared from the platform only a few years after the UK referendum (Bastos, 2021). Content moderation, together with the lack of reliable archives, leaves researchers with a piecemeal account of events and often forces them to rely on screenshots and other materials found outside of the original websites or platforms (Brügger, 2015).
Because of their interest in using digital data and tools for the study of society, digital methods partake in the “computational turn” in the social sciences and the hope that online records may allow a subtler and wider tracing of “the spread of arguments, rumors, or positions about political and other issues” (Lazer et al., 2009). Such an outlook shares methodological underpinnings with “natural experiments” and “unobtrusive measures,” where one redeploys “found data” unaffected by measuring instruments or researcher effects (Webb et al., 1966). Despite this family resemblance, however, digital methods do not fully share the great expectations of computational social sciences. Digital methods appreciate the research opportunities offered by the internet, but have also developed a symmetrical interest in the bias of digital media and in the way they can negatively affect social research (and social life).
Compared to data obtained with more established sociological techniques, web and social media data are not good data in the sense described by one scholar as “collected as cleanly as possible and as early as possible in its life cycle [as well as] captured regularly, and preferably over long periods of time” (Borgman, 2009).
The web is ephemeral and is unevenly archived, for its safeguard relies on the insufficient resources of the global initiative of the Internet Archive or on the partial efforts by national libraries to preserve “national webs.” On the web, data degrade swiftly because of phenomena such as “link rot” and information removal through content moderation. Records are often missing and, even when systematically collected, their temporal coherence is affected by changes in algorithms and interfaces. This is the problem that sank the flagship big data project, Google Flu Trends, for example. After several years of monitoring the spread of influenza through geolocated queries, the service went haywire as searches became pre-filled by autocompletion. This change in Google interface mingled the concerns of the users and the priorities of the engine, and eventually led Google Flu Trends to overreporting (Lazer et al., 2014).
Finally, social media platforms are notoriously difficult to archive or are even “unarchivable by design” (Ben-David, 2020), and the availability of their data for academic and other investigative research waxes and wanes with the times. At the time of writing, it appears we have reached another nadir as the social media APIs, when available, are stingy in types of data (or endpoints), amounts available, and access regimes (Rieder et al., 2022), and no longer provide the kind of vast, diverse, and fast data flows that used to be the pride of “big data” (Diebold, 2012).
In recent years, concerns about the shortcomings and biases of online platforms have overflowed from academic research to society at large, as social media face a crisis in confidence about the quality of their contents and the overperforming of posts of dubious quality. While this mistrust has a longer history, it reached a pinnacle with the fake news debacle of Facebook and other platforms around the 2016 US presidential election. It returned in 2020 when a whistleblower revealed that Facebook privileges posts receiving angry reactions and again when the new owner of Twitter/X announced a drastic reduction in moderation efforts. Each of these scandals, in its own way, has driven an upsurge in interest in platform studies, especially around content moderation and algorithmic amplification. This in turn has led to both a greater scrutiny of the platforms as well as a greater reluctance toward facilitating such scrutiny.
These scandals also created incentives for digital methods to re-center around the study of digital media themselves. While for a few years the popularity of computational approaches pushed researchers to consider digital media as a window on social phenomena, the blatant biases of online platforms and their algorithms brought a renewed focus on their distortions. With that, digital methods practitioners are increasingly turning to platform critique, using querying, scraping, API calling, dashboard data, and engagement metrics to examine content privileging after or despite moderation. A comparative or cross-platform study employing those techniques found, for example, that social media tend to “mainstream the fringe,” although in ways that are different by platform (Rogers, 2023b). On Twitter/X, it was the hyperpartisan source and on TikTok the parodying youth.
Despite the aforementioned shortcomings, digital methods also have great merits and, among them, the most important is probably the fact of being radically different from classic qualitative and quantitative techniques (Latour et al., 2012). Some scholars saw this diversification as a threat to the role of social sciences and argued that “Fifty years ago, academic social scientists might be seen as occupying the apex of the – generally limited – social science research ‘apparatus’. Now they occupy an increasingly marginal position in the huge research infrastructure [of] knowing capitalism” (Savage & Burrows 2007). Digital methods are, in many ways, a reaction to this threat through an effort to hijack the infrastructures of knowing capitalism to renew and revitalize academic research.
Digital records are not necessarily more abundant than the ones generated by surveys and censuses, and their quality is generally worse than that of direct observations and in-depth interviews. Yet, they allow investigating a larger and more diverse range of collective phenomena (Venturini, 2025). Digital methods are appealing because, despite their limitations, they allow pushing social inquiry beyond the aggregated trends monitored by statistics and beyond the situated interactions examined by ethnography, opening the possibility to observe dynamics that span space and time and yet need to be considered individually. Think, for example, of how the records of digital media allow one to follow how ideas ripple through platforms over months and years and across countries and continents, while retaining the capacity to pinpoint the specific accounts that most contributed to their propagation and the single moments of their evolution (Venturini, 2024).
For their capacity to open new research avenues and to rekindle our methodological imagination, the place of digital methods in the social sciences is somewhat close to the role played by data journalism and open-source intelligence in the renewal of investigative reporting (Anderson, 2018). Despite their more practical and immediate objectives, these journalistic practices resemble digital methods not only in the tools and data that they employ, but also in their goal to push beyond the limits of traditional news-making through forms of investigation capable of examining large populations over a long period of time, without sacrificing the richness and subtlety of the analysis (Bounegru & Gray, 2021). Like digital methods, data journalism makes use of web and social media data, institutional and corporate databases, and other natively digital sources to tell stories about cultural and societal trends that would otherwise remain unexplored.
As you might have noticed, in this book we do not capitalize either the web or the internet, despite the web being shorthand for World Wide Web and internet being a proper noun still capitalized in many dictionaries (and the Modern Language Association). Since the mid-2010s, media scholarship has gradually shifted from “capital Internet” to “small internet” (and from Web to web), and for good reasons. Those against capitalization argue that styling digital media through a proper noun exaggerates their uniqueness, their agency, and their unity. The capital letter, they contend, unwarrantedly sets the internet and the web apart from all other small letter media (the alphabet, the press, the radio, the television, etc.), as if this medium were more special than the others. They also argue that using a proper name wrongly suggests the idea that the internet or the web might act by themselves, that they might have some inherent logic or trajectory distinct from what their users make of them. Finally, they claim, online media have grown so plural and heterogeneous that they should not be referred to as a particular, single entity.
While we agree with the line of reasoning, the opposite argument has its merit. As we said above, a key lesson of our approach is to “follow the medium” and repurpose its methods rather than imposing the methods of the social sciences onto it. This implies that the web and the internet are, at least from a methodological viewpoint, rather unique, unified, and endowed with agency. They have their own ways, which are not only different from those used to investigate previous media, but distinctive enough for researchers to borrow them to study other things than the media themselves – thus making the web and the internet not only an object of study, but also a tool for research. These methods are certainly plural (hence the many chapters of this book), but they are also unified by a common logic that facilitates their combined use (hence the many cross-references between the chapters).
Finally, as we will see over and again in this book and in the practice of digital research, the methods of the web and the internet have a distinctive tendency to resist their users. They are “stubborn.” More than other sociological methods, they do not easily comply with the goal of the researcher. The data that they yield are always messy or slightly offset and their analyses and visualizations are often surprising (and sometimes disappointing). To some extent, this is due to their newness and lack of consolidation, but this resistance also comes from the fact that these methods (being second-hand) follow a trajectory that is not that of research, but that of the technical infrastructures, public institutions, and private enterprises that have originally developed them.
This last argument brings us to a second terminological question. While in this book we stick to the slang of digital media, we often use the notion of “records” rather than the more digitally native idea of “data” (though we use that, too). This distinction derives from the “second-hand” nature of digital methods. While sociological techniques tend to produce their own information, digital methods repurpose information that is collected for reasons other than academic research. We call records this trove of information in its original (before repurposed) format, and we argue that digital records become research data through the patient work of cleaning, filtering, and refining. The idea of records also has the advantage of avoiding the implicit naturalization conveyed by the notions of “data” (from the Latin “datum,” “given”) or “traces.” Far from being given or left for researchers to use, the information exploited by digital methods is actively recorded, archived, and (partially) redistributed by some interested party (in most cases an online platform and the corporation that owns it), often with deliberately non-academic purpose: marketing and advertising in most cases, but also surveillance and technical optimization.
The practice of digital methods, like the practice of all research methods, is messy and iterative. While it would be convenient to provide students and scholars with a linear protocol of well-defined operations to be carried out one after the other, the reality of research is infinitely more complex. Still, for the sake of legibility, we did our best to separate the skills underpinning digital methods into different buckets and to arrange these buckets in an order that facilitates their consideration.
The next chapter (chapter 2) starts with a general discussion about the nature of the records most usually exploited in digital methods projects and considers the sources of these records, including the appropriateness of focusing on the mainstream platforms, as well as considering more alternative platforms and fringes of the web and social media.
In chapter 3, we consider the simplest and most common tool to access online records: the querying of a search engine. We show that despite its apparent simplicity, this practice actually involves multiple crucial research decisions. We discuss, in particular, how to choose between a generic and more specialized search engine, and how to formulate elaborate queries with operators and regular expressions. We also consider search engine personalization and the possibility to embrace it by creating role-playing research personas or, instead, to avoid it with a ring-fenced research browser.
While search engines have for a long time been the main entry to the richness of the web and the internet, their primacy is now contested by a new generation of artificial intelligence chatbots which, instead of indexing and searching online information, digest and summarize it, with distinctive advantages and disadvantages. Chapter 4 discusses the rise of this new generation of connectionist AI over the previous rules-based symbolic AI, and uses this discussion as a basis to provide advice on when to use the new large generative models and when to be wary of their promises. Strategies for calibrating and prompting AI follow, focusing on how to “tune” a model and generate dialogue loops that avoid drifting.
While easy to access and use, search engines and AI chatbots are generally not the preferred gateway for the collection of large sets of digital records. Since the rise of online platforms, this role has instead been filled by APIs. Chapter 5 reveals their function and examines how to call them, build on top of them and rely on them – occasionally at one’s peril, given the history of sudden shutdowns of the Facebook, Instagram, Twitter/X as well as Reddit APIs.
The following chapter (chapter 6) considers the increasingly common situation in which the lack of API or official data access forces researchers to scrape data directly from webpages. We describe the complex task of turning a bundle of semi-structured information accessible through digital interfaces into a structured table of records that can be submitted to analysis or visualization. We provide tricks on how to extract research data from websites and platforms and discuss sensitive questions related to data ethics and politics.
The next chapter is dedicated to data wrangling (chapter 7) – the hard but necessary work of cleaning and grooming necessary to turn records into data and prepare them for analysis. It explores the subtle and context-dependent distinction between information and noise and then proceeds to discuss different techniques for cleaning and structuring data records to create workable datasets that can be used for further analysis and visualization.
Having discussed different ways to collect and prepare data, in chapter 8 we move on to consider a series of techniques to visualize them, drawing on the work of three key thinkers: John W. Tukey, Jacques Bertin, and Edward Tufte. We take up Tukey’s point that visualization is for the purposes of exploratory analysis rather than confirmatory display, positioning visualization as a means to generate questions rather than answers. Bertin’s Sémiologie Graphique [Semiology of Graphics] (1967) helps us to discuss the use of the different “retinal variables” (color, size, shape, etc.) for visual representation. Finally, Tufte’s work allows emphasizing how the visualization should first of all help to bring out the patterns contained in a corpus of records, rather than pursuing a wow effect. The chapter proceeds by discussing how to choose which chart type based on the objective of the research.
While the term “digital data” may call to mind tables of numbers and mathematical treatments, the records with which digital methods are most often confronted are texts. The chapter on mining texts (chapter 9) combines insights from overlapping traditions called “text analysis” (in information sciences), “natural language processing” (in computer science), and “distant reading” (in digital humanities). It introduces how texts may be divided into countable units through tokenization and lemmatization, or grouped as phrases and counted or associated with other terms through working with N-grams, noun phrases, and named entities.
Chapter 10 introduces another type of data that is both native and crucially important to digital media – networks – and discusses one of the favorite techniques to collect this type of data – crawling. The chapter provides hands-on advice on how to steer a web crawler and examine how networks of websites can be used to investigate the politics of association through preferential and communal attachment.
Drawing on this discussion, chapter 11 provides an overview of network analysis. It concentrates on network measures of relational importance such as authority or betweenness centrality (where those nodes that are “highly between” are significant as brokers or points of passage). The chapter also focuses on techniques for detecting communities and structural holes, and concludes by considering the technique of visual network analysis and the use of force-directed layouts.
Finally, in chapter 12, we conclude with some very basic programming skills. Without entering into the syntax of specific programming languages, we present the most important conceptual structures of computer code, providing readers with the basic information they need to learn how to hack existing scripts and prompt AI chatbots effectively to generate or debug code.
First, although the methods covered in this book overlap extensively and are often used together in practice, we have made an effort to divide them into distinct and relatively short chapters that you should be able to read separately.
We have also sorted the chapters into what we felt was the most intuitive order, devoting the first half of the book to harvesting techniques (ch. 2) based on search engines (ch. 3), AIs (ch. 4), APIs (ch. 5), and scraping (ch. 6); and the second half to data cleaning (ch. 7) and analysis through text mining (ch. 8), exploratory visualization (ch. 9), crawling (ch. 10), visual network analysis (ch. 11), and scripting (ch. 12). Yet even from this short list, it is obvious that things are more complicated (for example, AI can be used to analyze data and crawling and scripting are also collection techniques). Therefore, feel free to explore the chapters in any order, and keep an eye out for cross-references that highlight connections across the chapters.
To help you grasp their content at a glance, all chapters begin with a short summary. It will give you an overview of what is to come. All chapters also end with a few questions that you can use to make sure you have understood the gist of the text, but you can also review them before the chapter to sharpen your attention on the most important discussion in the text. After the questions, you will find some suggestions for further reading, which will help you extend and deepen your learning of the different methods covered in this book.
These suggestions obviously cannot cover all the resources available online on digital methods (in terms of publications, tutorials, tools, videos, etc.). This is one of the reasons why we have taken care to identify the most important notions for each technique and put them in bold in the text. These are keywords that you can use to find more resources. They are also the key conceptual tools you need to navigate the sea of digital methods. At the end of the book, you will find a glossary that defines each of them. Feel free to consult it and good luck in your exploration!
In this chapter, we begin our journey in the field of digital methods from what is arguably the first task that any research project should attend to: the collection of data. While specific collection techniques are discussed in most chapters of this book (according to the approach they rely upon), here we address the more general question of the nature and origin of digital records. Acknowledging that most digital methods projects rely on records derived from online platforms, we consider the advantages and disadvantages of this source, in terms of the coverage of information and the attunement between the research questions and the affordances, cultures of use, and data formats of platforms. We also examine how platforms’ limitations and biases can be turned into research objects and conclude by inviting scholars to look beyond the most mainstream sources and dig up more marginal yet more interesting troves of digital records.
In the introduction, we argued that digital methods are a form of second-hand research and that their interest lies in their capacity to rekindle the methodological imagination of the social sciences through the new research opportunities offered by digital records and tools. As such, digital methods command a form of ingenuity that begins with the choice of one’s data sources. If, whatever your research topic, your reflex is to jump straight to data collection on Google, Facebook, or Twitter/X simply because others have done so in the literature, then you are missing the point – and missing it twice. Methodologically, you are missing the possibility of discovering the records and analytical techniques most fitting to your objectives. And politically, you are going along with the effort of big tech corporations to colonize online communication. In this chapter, we discuss both these risks and how to avoid them.
With some simplification, the history of digital media can be pictured as a tug of war between the opposite forces of diversification and normalization. Diversification comes from the efforts of individual users and online communities to take a technology that is open and manyfold and to pull it in many different and often unexpected directions. Normalization stems from the struggle of corporations and states to tame the online wilderness and turn it into a series of well-guarded and well-walled gardens.
The tug of war between normalization and diversification is as old as digital media. Famously, the internet originated as a military technology meant to assure strategic communication in the event of nuclear attack on the US (Baran, 1964) and the web was initially imagined as a sprawling scholarly library (Berners-Lee et al., 1992). While these military and academic uses never disappeared, they were soon outgrown by the overflowing imagination of a multitude of techno-idealists (Flichy, 2007), communalists (Rheingold, 1993), trolls (Dery, 1994), freaks (Rushkoff, 1994), role-players (Turkle, 1995), and mass media fans (Jenkins, 2006). This cultural diversification, however, was not matched by an equal social diversity. Early digital publishing required creating one’s own website, which in turn demanded significant financial and technical resources. Consequently, for its first two decades, the internet and the web remained largely dominated by “young, white, English-speaking men from Europe and the U.S. with money and education” (Driscoll, 2020).
Things started to change in the 2000s, when social media platforms such as MySpace, Facebook, YouTube and Twitter (but also services such as Google Ads and devices such as the Apple iPhone) both expanded the access to online communication and normalized its uses. As they attracted new users to digital media, platforms pushed their earlier countercultures to the margins and took the center of online communication. Platforms made the web truly world-wide and, at the same time, truly profitable. By setting up a vast apparatus of infrastructure to monitor, maximize, and market online engagement, they created a lucrative attention economy, based on the measuring of online engagement and its selling as a commodity for an increasingly distributed networks of advertisers (O’Reilly, 2007). This development has been extended and supercharged by the growing popularity of mobile devices and the rise of a sprawling system of apps that made access to online media even easier but also even more controlled (Pybus et al., 2015).
The way in which platforms and apps organize their attention economy is in line with their general penchant for “centralized decentralization.” Where older media tended to concentrate content (in a limited number of broadcasting sources) and advertising (in a limited number of marketing agencies), social media have distributed both, allowing (almost) anyone to distribute content and publicity online. Digital platforms and apps also started to provide all their users with metrics and tools to track their communication success. By democratizing the access to marketing metrics that used to be the apanage of professional advertisers (views, comments, likes, etc.), platforms stimulate engagement and increase participation. More than any previous form of communication, they provide quantified feedback that gamifies interaction and competition for status. Rather than a “panopticon” – a system where everyone can be seen from a single central position – social media built a “holopticon” – a system where everyone can see themselves and everyone else.
The monitoring infrastructures set up by online platforms and apps have all sorts of societal fallouts (explored in the conclusion), but it is not without advantages for the study of social phenomena. Before the advent of digital methods, the collection of sociological data was generally quite expensive and entailed large-scale surveys or long-time ethnographic immersions, both impossible without the support of well-funded research institutions. Through digital media, the opinions of large populations can be monitored semiautomatically, and remote communities can be accessed from one’s computer. This is true for online communication in general and even truer for social media, as their infrastructures multiply the number of off-the-shelf tools and ready-made metrics that can be reappropriated for social investigation.
There is nothing wrong with taking advantage of this proverbial low-hanging fruit of digital monitoring and it is not a bad idea to jump-start one’s research by exploiting the powerful search and quantification infrastructure of social media platforms. A few days of snooping around YouTube, Wikipedia, Reddit and others can provide a great preliminary exploration for most research topics. In this preliminary phase, the idea is not to be picky or critical, but to provisionally embrace the way in which platforms structure social phenomena, accepting the suggestions of their search and recommendation engines (as discussed in the next chapter), and trusting their metrics of importance. After all, given that the entire business model of social media platforms revolves around measuring societal conversations, we may as well capitalize on it.
Take Twitter/X for example. For most of the 2010s and 2020s, the platform offered a convenient overview of the trends in public debate (at least in the Global North). Most political and cultural opinion leaders were active on the platform and its technical features and communication format made it easy for scholars, journalists, and other observers to take the pulse of public conversations. Public debate was suddenly packaged in short messages that could be easily quoted, shared, and embedded, and that came with a bunch of in-built relevance metrics. Cherry on the cake: both contents and metrics could be bulk downloaded. The fact that Twitter/X is only used by a minority of the population (globally and within each country), that its communicational environment encouraged short and antagonistic discussions, and that its popularity metrics privileged controversial and outrageous figures did not prevent observers from using it as a dashboard of societal debate. This was true for journalists who developed a form of addiction to the platform (McGregor & Molyneux, 2020), but also for scholars (especially computational social scientists) who could not resist the temptation to ground their research in its easily available, yet poorly representative data (Tufekci, 2014).
