19,99 €
To analyse social and behavioural phenomena in our digitalized world, it is necessary to understand the main research opportunities and challenges specific to online and digital data. This book presents an overview of the many techniques that are part of the fundamental toolbox of the digital social scientist. Placing online methods within the wider tradition of social research, Giuseppe Veltri discusses the principles and frameworks that underlie each technique of digital research. This practical guide covers methodological issues such as dealing with different types of digital data, construct validity, representativeness and big data sampling. It looks at different forms of unobtrusive data collection methods (such as web scraping and social media mining) as well as obtrusive methods (including qualitative methods, web surveys and experiments). Special extended attention is given to computational approaches to statistical analysis, text mining and network analysis. Digital Social Research will be a welcome resource for students and researchers across the social sciences and humanities carrying out digital research (or interested in the future of social research).
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 429
Veröffentlichungsjahr: 2019
Cover
Front Matter
Abbreviations
Introduction
1 Social Research Using Digital Data and Methods
Self-Reported and Behavioural Data
Big Data
The Construct Validity Problem
Representativeness and Access
‘Native’ or Complex Digital Methods
Digital Structured, Unstructured and Semi-structured Data
2 Unobtrusive vs Obtrusive Methods
Web Scraping and News Sources
Social Media Data
Collecting Data from APIs
Understanding Social Media Data
Tools and Instruments
Notes
3 Online Obtrusive Data Collection Methods
Online Qualitative Research Methods
Web Surveys
Experiments
Notes
4 Quantitative Data Analysis Reloaded
Quantitative Analysis and Digital Data
Conventional and Computational Approaches
Further Differences
Model-based Recursive Partitioning
Further Readings
Notes
5 Networks and Data
Networks
Key Concepts
Basic Network Metrics
Network-level Metrics
Types of Networks
Property of Networks
Longitudinal Network Analysis
Tools
Further Readings
Notes
6 Text Mining
From Content Analysis to Text Mining
Text Mining
Text-mining Pre-processing Basic Concepts
Parts of Speech Tagging
Sentiment Analysis
Topic Models
Semantic Networks
Tools
Further Readings
Notes
7 Final Remarks
On Digital Social Research
Notes
References
Index
End User License Agreement
Chapter 1
Table 1.1
: A schematic overview of the two ‘systems of thinking’ underlying human behaviour
Table 1.2
: Cross-tabulation between typology of data and modes of behaviour
Table 1.3
: Differences between structured and unstructured data
Chapter 2
Table 2.1
: Online obtrusive and unobtrusive data collection methods
Table 2.2
: Different formats of XML
Chapter 3
Table 3.1
: Comparison between qualitative and quantitative research in social sciences
Table 3.2
: Online interview types
Table 3.3
: Cross-tabulation between types of online in-depth interviews
Table 3.4
: A list of further analytical approaches to digital data
Table 3.5
: Factual and counterfactual outcomes
Table 3.6
: Pros and cons of online experiments
Chapter 4
Table 4.1
: Comparison between the conventional approach of quantitative analysis in the soc…
Chapter 5
Table 5.1
: Network elements and terms in different disciplines
Chapter 1
Figure 1.1
: The three Vs of big data
Figure 1.2
: The development process from concepts to variables for designed data
Figure 1.3
: The repurposing process from available data to concepts for organic data
Figure 1.4
: The partiality of relationship between a concept and its indicators
Chapter 2
Figure 2.1
: Mapping of technologies required for scraping
Figure 2.2
: Working with web texts combining methods
Figure 2.3
: API to end-user workflow
Figure 2.4
: Conceptual model for the social media platform Twitter
Figure 2.5
: Social media entities and some of their relations
Figure 2.6
: Conceptual diagram of how explicit relations between social media data entities …
Figure 2.7
: The two-mode to one-mode projection
Figure 2.8
: Implicit relations between content/resources, metadata, groups
Chapter 3
Figure 3.1
: Combination of synchronous, asynchronous, active and passive modes of data colle…
Figure 3.2
: Scrolling and paging design approaches for web surveys
Chapter 4
Figure 4.1
: Types of multivariate methods
Figure 4.2
: The two cultures of modelling
Figure 4.3
: Example of model-based recursive partitioning
Chapter 5
Figure 5.1
: Fundamental network features
Figure 5.2
: Forum and newsgroups network relationships
Figure 5.3
: Hyperlink analysis
Figure 5.4
: Facebook friendship network relationships
Figure 5.5
: The Twitter network of relationships
Figure 5.6
: Linear snowball sampling
Figure 5.7
: Exponential non-discriminative snowball sampling
Figure 5.8
: Exponential discriminative snowball sampling
Figure 5.9
: Examples of centralities and nodes
Figure 5.10
: Degree, closeness and betweenness centralities
Figure 5.11
: Ego network and circles
Figure 5.12
: A two-mode network
Figure 5.13
: A network visualization of the women by events study
Figure 5.14
: A two-mode network between users and posts
Figure 5.15
: Multilayer network example
Figure 5.16
: Distribution of links in a non-scale-free network (left) and in a scale-free one…
Figure 5.17
: Examples of a random network (left), a scale-free network (centre) and a small-w…
Chapter 6
Figure 6.1
: Comparison between bag-of-words only and use of
n-
grams (bi-grams in this case)
Figure 6.2
: Example of supervised sentiment analysis analytical flow using Twitter data
Figure 6.3
: LDA process of topic creation
Figure 6.4
: A graphical model of LDA
Figure 6.5
: Semantic network of lexical units’ co-occurrences from tweets related to climate…
Cover
Table of Contents
Begin Reading
ii
iii
iv
xiii
xiv
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
To my family and friends
Giuseppe A. Veltri
polity
Copyright © Giuseppe A. Veltri 2020
The right of Giuseppe A. Veltri to be identified as Author of this Work has been asserted in accordance with the UK Copyright, Designs and Patents Act 1988.
First published in 2020 by Polity Press
Polity Press65 Bridge StreetCambridge CB2 1UR, UK
Polity Press101 Station LandingSuite 300Medford, MA 02155, USA
All rights reserved. Except for the quotation of short passages for the purpose of criticism and review, no part of this publication may be reproduced, stored in a retrieval system or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the publisher.
ISBN-13: 978-1-5095-2933-9
A catalogue record for this book is available from the British Library.
Library of Congress Cataloging-in-Publication DataNames: Veltri, Giuseppe A., author.Title: Digital social research / Giuseppe A. Veltri.Description: Medford, MA : Polity Press, [2019] | Includes bibliographical references and index.Identifiers: LCCN 2019014898 (print) | LCCN 2019015511 (ebook) | ISBN 9781509529339 (Epub) | ISBN 9781509529308 (hardback) | ISBN 9781509529315 (pbk.)Subjects: LCSH: Social media--Research. | Social sciences--Research--Data processing. | Social sciences--Research--Methodology.Classification: LCC HM742 (ebook) | LCC HM742 .V45 2019 (print) | DDC 302.23/1--dc23LC record available at https://lccn.loc.gov/2019014898
The publisher has used its best endeavours to ensure that the URLs for external websites referred to in this book are correct and active at the time of going to press. However, the publisher has no responsibility for the websites and can make no guarantee that a site will remain live or that the content is or will remain appropriate.
Every effort has been made to trace all copyright holders, but if any have been overlooked the publisher will be pleased to include any necessary credits in any subsequent reprint or edition.
For further information on Polity, visit our website: politybooks.com
ALAAM
auto-logistic actor attribute models
API
application program interface
CAQDAS
computer-assisted qualitative data analysis
CART
classification and regression trees
CFA
confirmatory factor analysis
CHAID
chi-square automatic interaction detection
CSS
computational social science
CTM
correlated topic model
DGP
data-generating process
DMI
Digital Methods Initiative
EDI
electronic data interchange
ERGM
exponential random graph models
FA
factor analysis
GDPR
General Data Protection Regulation
http
hypertext transfer protocol
ICT
information-communication infrastructure
IR
information retrieval
JSON
JavaScript Object Notation
LCA
latent class analysis
LDA
latent Dirichlet allocation
LSA
latent semantic analysis
MTMM
multitrait–multimethod matrix
NER
named entity recognition
NLP
natural language processing
NoSQL
not only SQL
OLFG
online focus group
OLS
ordinary least square
PCA
principal component analysis
RCT
randomized controlled trial
RDBMS
relational database management system
REST
representational state transfer
RSS
rich site summary
SEM
structural equation modelling
SNA
social network analysis
SOAP
simple object access protocol
SQL
structured query language
SRS
simple random sample
SVD
singular value decomposition
URL
uniform resource locator
UTOS
units, treatments, observation operations, settings
VAU
Voson Activity Units
VGI
volunteer geographic information
WWW
World Wide Web
In July 2014, a group of researchers published a paper titled ‘Experimental evidence of massive-scale emotional contagion through social networks’ (Kramer et al., 2014), a study conducted on the social media platform Facebook involving hundreds of thousands of its users as participants. This study went under the spotlight not so much because of its scientific findings but because it exemplified how social scientific research has been changing, in terms both of opportunities and of associated risks. Hundreds of thousands of participants were manipulated, in experimental terms, without knowing that they were part of a study that was being conducted on a scale unprecedented in social science research. Since then, terms like ‘big data’, digital research and web social science have inundated conferences and paper abstracts. Social scientists have reacted fundamentally in one of two ways: wild enthusiasm or stern scepticism. For the enthusiasts, the availability of digital data collected by multiple means of recording our digital traces represented the long-awaited turn in an increasingly difficult reality of data collection. For the sceptic, while appreciating the potential, digital research raised questions about the quality of such data; it posed questions about issues of data access and ownership, and started a debate about integrating online and offline data. Of course, there are plenty of researchers who fall in between these two broad categories, and that is where this book ideally situates itself. Its approach is to provide a critical overview of the common methods used to carry out digital research, and, at the same time, to have a particular sensitivity to methodological principles and theoretical issues that are not easily dismissed by the new availability of data about society and people.
The enthusiasm exists, perhaps, because of a personal academic history. During my PhD years, I experienced at first hand the move from analogical to digital data, the increasing presence of software-assisted data management and analysis. Often, at that time, learning to use the latest software would provide a sense of thrill and new possibilities of research would open up in front of you. However, soon enough, after this feeling of great potential, old problems and questions of research methodology would return, throwing cold water on your enthusiasm.
A similar cycle of enthusiasm and doubt is part and parcel of being a digital researcher these days. There is little doubt that digital data are changing the way social science is done. One example of this is the perception of value that data now have. During my training as a researcher, I was taught that data are those precious things that are very hard to obtain and therefore, once in our grasp, should be exploited to the fullest. Today, most social scientists collecting digital data have datasets sleeping in their hard disks or in the Cloud. Because of the scarcity of data, each data collection was often maniacally crafted in terms of the instruments used – for example, a questionnaire – with already a pretty clear idea of how the analysis would be conducted. As data have become much cheaper, the amount of planning and analytical strategy appears to be decreasing. The latter point is also due to the increased obsolescence of data. In a context of fast, continuous and affordable data collection, data become ‘old’ very quickly and yet the use of archives is problematic because of access issues. As the entire research process speeds up, data are collected, analysed and archived very quickly in order to be able to move on to the next project.
At the same time, digital social research is fast-moving, and for several reasons. The first is that the division between online and offline research is increasingly fading. The usual distinction between research ‘about the Internet’ and ‘through the ‘Internet’, where the first term refers to human behaviour and social phenomena specific to the online world, while the second label refers to using the Internet as a field of research to conduct a study that could also be conducted offline, does not hold up well these days. Digital data defy formal definition, but for practical reasons; we can describe them as the digital traces of human behaviour and opinions recorded by a wide set of digital services operating in different domains of society (e.g., financial, transportation, health, commercial, social). The nature of digital data is continuously expanding as digital services emerge and many different objects acquire the capacity of recording information about their use and the environment in which they operate, the so-called ‘Internet of Things’.
The second reason to consider digital data, especially those from social media, as fast-moving is the consideration that their use by people is evolving over time. For example, Facebook has become a main social arena for many people; their initial naive use of the platform that probably existed when it started is long gone. People are now strategic in their use of their Facebook presence. One example of this is the positivity bias that Facebook content has (Spottswood and Hancock, 2016). Almost like a large-scale Hawthorne effect, in which individuals modify an aspect of their behaviour in response to their awareness of being observed, people know that their digital presence is observable and therefore adapt to this visibility.
The third reason is that social digital data are becoming increasingly complex. To illustrate this point, let’s take an example based on one of the fundamental research instruments in social research: the questionnaire. In the pre-digital age, a questionnaire would collect data designed by its makers in terms of answers to questions as well as some contextual data provided by interviewers if a door-to-door collection was part of the design. As surveys started to be conducted by telephone, different types of data became available to researchers – for example, the duration of the task of completing the survey (speed of response is used as a proxy of quality, as we will discuss in Chapter 3). Online surveys have enlarged the type of data that are collected in a questionnaire. Together with the outcome of questions, a plethora of metadata and paradata can be collected and analysed jointly with the ‘main’ data. Metadata (data about data) and paradata (data about process) are empirical measurements about the process of creating survey data, in other words recordings about the fieldwork process. They include time spent per screen, keystrokes and mouse clicks, change of answers, etc.
The latter is just one example of how a relatively uncomplicated type of data familiar to social scientists is now a potentially complex, multidimensional object of analysis. Needless to say, data collected from online sources are often of this type, with a degree of complexity sometimes much higher than previously encountered. This is a somewhat new situation for social scientists – too much choice can be overwhelming and confusing. The ‘digital challenge’ will be a crucial one because the ‘thirst’ for methodological innovation in social sciences is due to the enduring crisis that has characterized most of the widely used existing techniques. Surveys are exemplary in this case, a pillar methodology across so many different disciplines that is suffering a long-lasting crisis due to the increased difficulty in assessing response rates and sampling frames, and limited capacity in capturing variables that are not self-reported measures but important proxies. Similar considerations concern the in-depth interview, another important instrument of data collection in social science. One criticism concerns the translation of a technique developed before the advent of digital media and the question related to the implications of interviews carried via computer-mediated communication.
Increasingly, self-reported surveys and interviews measuring human motivations and behaviour are under scrutiny and being compared to more ‘organic’ sources of data (Curti et al., 2015). This is not to say that digital data do not raise a substantial amount of concern regarding the tendency to consider these as ‘organic’: the current debate relates to the kind of critical awareness that should accompany all methods used by social scientists. Perhaps for historical reasons, the artificial nature of traditional methods has been long forgotten until recently, when their capacity for generating quality data has become increasingly problematic.
Such limitations are even more clear if we consider two further aspects: first, the vast majority of social science data from surveys and interviews are cross-sectional without a longitudinal temporal dimension (Abbott, 2001); second, most social science datasets are coarse aggregations of variables because of the limitations in what can be asked from self-reported instruments. Digital data are forcing innovation on both accounts, moving from static snapshots to dynamic processes and from coarse aggregations to high resolutions of data. The interesting by-product of these innovations is the possibility of an increased focus in the social sciences on processes rather than structures. For the first time, we can obtain longitudinal baseline norms, variance, patterns and cyclical behaviour. This requires thinking beyond the simple causality of the traditional scientific method into extended systemic models of correlation, association and episode triggering. Network analysis is a good example here: the availability of longitudinal relational data sparked the recent methodological and theoretical innovations about the dynamics of networks (Barabási and Posfai, 2016).
The aim of this book is to provide an overview and understanding of the most used digital research methods of data collection and the associated analytical strategies, paying particular attention to the methodological theoretical issues that still need further reflection and discussion. This work is the outcome of my research experience as well as of teaching done over five years at the University of Leicester on the MA course in new media and society, at the methodology summer school of the London School of Economics and at my current institution, the University of Trento, with the addition of several courses taught across Europe and Asia. The backbone of this book constitutes material developed for two courses: ‘Research methods for the online world’ and its complementary and more applied module ‘New media, online persuasion and behaviour’.
Equally important has been the experience of several large-scale behavioural research projects conducted for several European institutions on topics such as online transparency of platforms (Lupiáñez-Villanueva et al., 2018), online advertising (Lupiáñez-Villanueva et al., 2016) and online gambling (Codagnone et al., 2014). These have all been invaluable in learning how to study digital phenomena within the context of providing evidence for the development of public policies.
There is a vast number of books dedicated to mastering specific research techniques and the aim here is not to emulate such texts. Instead, the aim is to contextualize each technique in the study of human behaviour and societies. The use of the term social sciences is meant to describe the reality that there are different disciplines dealing with human affairs. Sociology, political science, anthropology, psychology and economics all have their own epistemological positions and methodological preferences. Digital social research does not escape the same condition: it is conducted with different aims, theories and methods depending on the academic discipline of context. Most of the content of this book should be applicable to all social science disciplines, but it is inevitable that some discipline-related emphasis is present. Therefore, it is better to make it explicit that most of the author digital research has been carried out in the context of social psychological, behavioural and sociological studies. While there is increasing collaboration across the social sciences, those familiar with different approaches and from different disciplines will recognize an implicit set of assumptions and research goals, and will not, hopefully, be put off.
The first chapter is dedicated to the nature of digital data from the perspective of a social scientist. This is because the complexity of digital data is large and requires particular attention in their use for social research. The second chapter is dedicated to one category of data collection methods for the digital world. This category is labelled ‘unobtrusive’, meaning that these methods do not require the active participation of individuals: data are collected without directly engaging with people, who are not aware, unless they are notified, that their data are being used for research. However, these methods pose new challenges to researchers in that they have to take into account the design of these platforms when they draw conclusions about social phenomena. Both the so-called ‘affordance’ of technological infrastructures and their political economy need to be considered (Madsen, 2015; Fuchs, 2015).
The third chapter concerns methods that have found their digital evolution: surveys, focus groups, experiments have found a new life online, albeit with some caveats. These are obtrusive methods; they require active engagement by participants. While they have a consolidated history of practice, their extension to the digital domain also poses challenges and raises opportunities.
Chapter 4 is dedicated to what I believe is a crucial issue: the epistemological and methodological changes and challenges that digital data are bringing to social science research. The emphasis here is on the increasingly common use of analytical methods coming from computer science in the domain of digital social science. This point has been the object of debate among methodologists and it is a crucial obstacle in finding common ground with computer scientists in joint projects.
Chapter 5 presents an overview of network analysis, a longstanding tradition in social science research that has found new life thanks to the availability of digital relational data, data about relationships between actors, and that is largely, but not only, applied to datasets from social media (e.g., emails, telephone calls, text messages, etc.).
Chapter 6 deals with one of the most interesting developments in methodology for the social scientist: text mining. Text has been always part of the data collected by social science research, particularly in the qualitative tradition. Content analysis has been the dominant way of quantifying text characteristics and the most common analytical strategy adopted by researchers who have to manage large quantities of text. The digital age has, among other changes, brought an exponential growth of text produced by people. It is the experience of the everyday use of social media, but also blogs and forums. Never before has so much text spontaneously produced by people been available. At the same time, ever more sophisticated methods of automatic analysis of texts have been developed, allowing researchers to analyse anything from a few hundred to millions of documents (see, for example, Sudhahar et al., 2015). While these types of automatic analysis do not aim at substituting the in-depth understanding that human analysis can provide, they do provide a unique bird’s-eye view of a large set of documents that was simply impossible to have before. A bit like the Nazca Lines, a series of large ancient geoglyphs in the Nazca Desert in southern Peru, and observable in their entirety only from the sky above, so text-mining techniques can allow researchers to detect common patterns or even structures across many different documents.
The last chapter is dedicated to a few general remarks about doing digital social research, among which we will discuss the ethical aspects of this kind of studies. The recent Cambridge Analytica scandal about the misuse, for commercial purposes, of Facebook data has grabbed the attention of millions of citizens across the globe. At the same time, the introduction in the European Union of the new GDPR (General Data Protection Regulation) has changed the rules of the game, including those for social scientists (European Commission, 2018). There is no doubt that this is a complex and important issue that deserves an entire book by itself. I cannot provide a lengthy discussion of the ethical and legal implications of using digital data and will therefore limit discussion to what I believe are the most salient points. I will also mention the issue of access in terms of inequality of research opportunities across the research community. It is no mystery that many large digital platforms, including the largest social media, are owned by American companies. The consequence has been that the North American academic institutions had historically stronger relationships with these private entities compared to European and Asian universities.
After this rather long introduction, we move next to discuss the nature of digital data for the purpose of social research. Exciting opportunities are emerging, while, at the same time, old and new methodological challenges are not easily settled. These challenges are the future of social science research: if we do not include our digital social life in our research practices, our capacity to understand human societies is greatly diminished. And yet, the critical eye that social scientists have learned to exercise needs to be sharp in a research domain in which digital data have become the most valuable asset for very large sectors of the economy and are also more and more crucial for the political evolution of democracies and non-democracies alike.
All social research methods have underlying assumptions about human nature and, in particular, about the way people make decisions, create their opinions and behave. In fact, this is one of the aspects that differentiates between the various different disciplines within the social sciences. Psychology, economics and sociology all have different models of how people behave. Besides the theoretical implications of such differences, the consequences for the type of methodology employed are substantial. Depending on which underlying model is selected, a particular research method is considered appropriate to study human behaviour.
For a long time, in economics but not only, a model of how people make decisions, known as the ‘rational choice theory’, has been considered the baseline. According to this model, people’s preferences have a well-defined structure and the choice between courses of action is an almost automatic mechanism in which the individual applies his or her system of preferences to a limited set of options (for example, the set of products that fall within the budget available). In other social sciences, the most common underlying model of human behaviour was unbalanced by an ‘oversocialized view’, ‘a conception of people as so overwhelmingly sensitive to the opinions of others, and hence obedient to the dictates of consensually developed norms and values, internalized through socialization, that obedience is not burdensome but unthinking and automatic’ (Granovetter, 2017: 11; see also Di Maggio, 1997).
Both models look primarily at people’s conscious thought processes and determine what they think, believe and how they act. Deviation from either economic rationality or forms of ‘sociological rationality’ were labelled as ‘irrational’. In such a context, the way to elicit data and study people’s behaviour relies on what people themselves report, about their opinions, social norms, attitudes and beliefs. These are often defined as ‘self-reported’ data, meaning that researchers rely on participants of a study to report on something they have done or on what they think or believe. Surveys and interviews of all sorts are examples of self-reported data. In contrast to this approach, there are observational or behavioural data, data about the actual actions and behaviour carried out by someone. To better appreciate the difference, let’s take the example of asking someone how many times she or he goes to the gym every month, and compare the response to the actual tracking of their movements, for example, on a GPS phone or watch. The two pieces of information can differ dramatically. Researchers in the social sciences have learned to live with limitations of self-reported data, such as the social desirability bias (Kreuter et al., 2009). The concept of social desirability rests on the notion that there are social norms governing some behaviours and attitudes and that people may misrepresent themselves in order to appear to comply with these norms. This is the reason why participants might provide inaccurate information about their behaviour to researchers. At the same time, people have difficulty in verbalizing accurately what they have done, felt and thought. Recalling events from memory is not easy either (Gaskell et al., 2000). In other words, self-reported measures have their limitations, but they have been the most common way of conducting social research related to human behaviour.
However, the biggest challenge to self-reported data has come from a shift in the model of human behaviour. Since the late 1990s, psychologists have distinguished between two systems of thought with different capacities and processes (Kahneman, 2011; Kahneman and Frederick, 2002; Metcalfe and Mischel, 1999; Sloman, 1996; Smith and DeCoster, 2000; Lichtenstein and Slovic, 2006), which have been referred to as System 1 and System 2 (Evans and Stanovich, 2013). System 1 (S1) is made up of intuitive thoughts of great capacity, is based on associations acquired through experience and quickly and automatically calculates information. System 2 (S2), on the other hand, involves low-capacity reflective thinking, is based on rules acquired through culture or formal learning, and calculates information in a relatively slow and controlled manner. The processes associated with these systems have been defined as Type 1 (fast, automatic, unconscious) and Type 2 (slow, conscious, controlled) respectively (see Table 1.1). The perspective of the dual system became increasingly popular, even outside the academy, after the publication of Daniel Kahneman’s book Thinking, Fast and Slow (2011). Kahneman was awarded the Nobel Memorial Prize in 2002 for his contribution to the explanation of individual economic behaviour through the elaboration of the ‘prospect theory’ (see Kahneman and Tversky, 2008).
Table 1.1 A schematic overview of the two ‘systems of thinking’ underlying human behaviour
System 1
System 2
Quick, automatic, no effort, no sense of voluntary control
Slow, effortful, attention to mental activities requiring it
Continuous construal of what is going on at any instant
Good at cost/benefit analysis, but lazy and saddled by decision paralysis (cognitive overload)
Characteristics
Characteristics
Quick (reflexive)
Deliberate (reflective)
Heuristic-based
Conscious
Use shortcuts
Rule-based
When it plays
When it plays
When speed is criticalAvoid decision paralysis
May take over when System 1 cannot process data
When System 2 is lazy or not activated (not worth, no energy, lack of awareness)
May correct/override System 1 if effort shows that intuition or impulse is wrong
The so-called ‘dual model’ of the mind is now the most supported way of understanding human behaviour at the individual level. The model has also been applied outside psychology, for example in sociology (Moore, 2017; Lizardo et al., 2016) and in political science (Achen and Bartels, 2017), and the implications of Kahneman and Tversky’s work have led to the research programme known as behavioural economics, which has had a great impact on traditional micro-economics theory.
From the initial underlying model of human behaviour based on the ‘theory of rational decision-making’, or rational choice theory, the current model portrays human beings as characterized by ‘bounded rationality’ – in other words, they are rational with limits, in which the ‘irrational’ is not some mysterious and almost metaphysical force, but instead the outcome of systematic error and biases originated by how our cognition and emotions work (and interact).
A more precise model of human behaviour and decision-making has implications for social science research methodology and in particular for the aforementioned distinction between self-reported and observational/behavioural data. The dual mode of thinking brings back the importance of unconscious thought processes, but also of contextual and environmental influences on human behaviour, something that is highly problematic in studies using self-reported measures and instruments only. Traditionally, collecting behavioural data has been very difficult and expensive for social scientists. To keep track of people’s actual behaviour could be done only for small groups of people and for a very limited amount of time. The availability of digital data has brought us a large increase in behavioural data; we now have digital traces of people’s actual behaviour that were quite simply never available before.
The combined effect of a relatively new and powerful foundational model of human behaviour and decision-making offered by the dual model together with the availability of behavioural data thanks to the digital traces recorded by a multitude of services and tools is very promising for social scientists. Before continuing this line of argument, let’s clarify one point that might be the object of criticism. Considering human behaviour as the outcome of mutual influences of conscious acting and unconscious heuristics, of biases and environmental influences, does not mean a return to a form of reductionism in which people’s opinions count for nothing. Self-reported data will remain an important source of information for social scientists, but, at the same time, the availability of behavioural data will function as complementary data to understand complex social phenomena. If we cross-tabulate that typology of data with the modality of human behaviour and decision-making – as shown in Table 1.2 – the complementarity becomes clearer.
Table 1.2 Cross-tabulation between typology of data and modes of behaviour
Typology of data
Type of human behaviour
Self-reported
System 2, rational deliberation, attitudes conscious description
Behavioural/observational
System 1, heuristics use and context/environment influence
The distinction between self-reported and behavioural data is no longer mainly theoretical because the new opportunities for collecting the latter are unprecedented. Such opportunity opens up new research directions, as well as the possibility of reviewing current theories and existing models. Table 1.2 reports a distinction that is useful particularly for those who are interested in studying human behaviour at the micro and meso levels, that is to say at the individual and the group levels of analysis, but it is less pertinent to the macro level.
According to Granovetter (2017: 13), both overrational and oversocialized models of human behaviour are atomistic in nature: ‘both share a conception of action by atomized actors. In the under-socialized account, atomization results from narrow pursuit of self-interest; in the oversocialized one, from behavioural patterns having been internalized and thus little affected by ongoing social relations.’ Behavioural digital data can have, among other features, a great deal of information about social relations and people’s embeddedness; they can help overcome such an atomized view (we will return to this issue later in the book).
However, the increased availability of collecting data about people’s behaviour does not free us from biases generated by the design and aims of digital platforms. People’s behaviour is constrained by the platform they use; for example, it is not possible to write an essay on Twitter unless we decide to write it using a large number of individual tweets. There are, therefore, several potential sources of confounding factors, as we will further elaborate in the section below on construct validity.
Returning to the issue of the different levels of analysis, at all levels another distinction is relevant: the one between static and dynamic data. The large majority of data collected in the social sciences have been ‘static’ – that is, data collection has been carried out at a given time. The reason for this is because longitudinal data collection, data collected over a period of time, was very difficult and expensive. The only exceptions were analysis carried out on documents and data that were archived and therefore accessible – for example, newspapers but also administrative data collected by governments or other institutions. Relying on static data has produced an involuntary emphasis on theories that focus more on ‘structures’ rather than on processes (Abbott, 2001). In other words, it has been historically difficult for social scientists to observe the dynamic unfolding of social events, especially at the macro level, because collecting data for this purpose was extremely complex and demanding in terms of resources. Most surveys are cross-sectional, meaning that they are carried out once or twice, and the same applies to interviews and other forms of data collection. Digital data introduce a much-increased capacity for recording and using longitudinal data for social scientific purposes. Obviously, digital data have not been historically around for many decades, but future researchers might have at their disposal longitudinal datasets that were absent in the past. The dynamic nature of digital data might be more enriching than their raw size; ‘big data’ concerns not only size but also resolution, as we will discuss later.
Behavioural digital data are the object of attention of a new generation of social scientists who believe in their potential to bring about a regeneration of the current theories and framework that were developed in a condition of data scarcity, with different models of human behaviour and an overreliance on self-reported data. It is too early to say what changes will bring the context of increased data availability, but this is the most exciting aspect of the use of digital data for social scientific research. The nature of the data collected from the digital world is not without problems and it poses specific challenges to researchers. In the next section, we will discuss the nature of big data and later we will look at some of the methodological issues concerning their application in the context of social science research. After that, we will distinguish between different types of digital data, with an emphasis on unstructured and semi-structured data, given that these are particularly interesting, as well as challenging, for social scientists.
Currently, the most discussed type of digital data in the social sciences is so-called ‘big data’. The expression hides different objects and technologies and is therefore one of those umbrella terms on which it is virtually impossible to attach a consensual definition. A common description of big data refers to the three Vs (Figure 1.1) of volume, velocity and variety. Volume refers to the quantity of data produced: this is the most salient feature of big data, the sheer amount of data created by digital services and goods. Velocity stands for the fast-moving nature of digital data that are often produced ‘on the fly’; for example, when we search something on a search engine, the exact list of results is generated instantly from our query. The third V stands for variety, the multiplicity of formats that data can have in the digital world. The latter is a source of richness but also of troubles for social science researchers. This is because big data is generated by a vast range of largely invisible processes with frequently incomparable dimensions, and different degrees of dimensionality.
Figure 1.1 The three Vs of Big Data
Source: based on Claverie-Berge (2012).
This is an important point that deserves consideration. The large majority of big data, from the most common such as social media and search engines data to transactions at self-check-out in hotels or supermarkets, are generated for different and specific purposes. They are not the design of a researcher who already has in mind an idea of a theoretical framework of reference and an analytical strategy. In contrast, surveys are designed data-harvesting instruments. Survey designers are experts in the art of eliciting the types of records that allow the processes that generated them to be inferred, and to contribute in pre-understood ways to the statistical modelling and sample selection controls that will be used to model them. Surveys are deliberately designed to tame the effects of multiply entangled correlations. Big data, by contrast, are just a large conglomerate of such correlations – very often they are not carefully designed. Twitter and big national surveys have both been used to analyse public opinion, but their data are different and so what they can reveal about public opinion is different in each case. Sentiment analysis on Twitter data, the emotional valence of tweets computed by text mining, is now a popular way of tracking public opinion and is not well suited for surveys. However, unpacking people’s attitudes about public issues is probably still better served by carefully designed surveys and associated samples. From this point of view, the debate about big data enthusiasts and sceptics should be formulated differently. There are social research questions and issues for which big data are interesting and others for which ‘traditional’ social scientific methods are still more reliable and useful.
Therefore, one of the first characteristics of big data, highly relevant for the social scientist, is their ‘organic’, as opposed to ‘designed’, nature (for social research data). Currently, data are becoming a cheap commodity, simply because society has created systems that automatically track transactions of all sorts. For example, Internet search engines build datasets with every entry; Twitter generates tweet data continuously; traffic cameras digitally count cars; scanners record purchases; Internet sites capture and store mouse clicks. Collectively, human society is assembling massive amounts of behavioural data. If we think of these processes as an ecosystem, it is self-measuring on an increasingly broad scale. Indeed, we might label these data as ‘organic’, a now-natural feature of this ecosystem. Therefore, big data are considered ‘organic’, they are created by different actors in the context not of research, but of producing or delivering goods or services. This is in contrast to ‘designed’ data, those that are collected when we design experiments, questionnaires, focus groups, etc. and that do not exist until they are collected.
Social scientists are not completely new to this context in the use of data. There is a longstanding tradition of secondary datasets analysis but there are some important differences as well. Secondary data in the social sciences generally indicate the reuse of existing datasets collected either by official institutions or by other researchers (Vartanian, 2011). Although these datasets might not be collected for research purpose (although they often are), they are usually publicly accessible and their methodological features are quite transparent. Some datasets are of very high quality, for example from academic or research institutions as well as governmental organizations; others might be less reliable. Common to big data is the idea of the repurposing of data. Data that were initially collected for other aims are repurposed for new specific research goals set by the secondary analyst. The difference is that, for big data, especially those collected by private companies, the lack of transparency about how the data are collected or coded is a problem that digital social researchers have to face.
Repurposing of data requires a good understanding of the context in which the repurposed data were generated in the first place. In other words, these are not ‘natural’: they are the outcome of designers and socioeconomic processes, therefore created with certain goals and trade-offs. It is about finding a balance between identifying the weaknesses of the repurposed data and, at the same time, finding their strengths. A good practice for social scientists, which applies not only to big data, is to think about the ideal dataset for their research and then compare it with what is available. It will make salient the problems and opportunities of what is available.
In the previous sections, we have highlighted how digital data, including in their big data format, possess some useful characteristics for social scientific research. There are essentially three positive features:
Increased size and resolution
. Big data are by definition ‘big’, but what this adjective really means requires further elaboration. Size refers to the sheer number of cases or participants that we can include in our research. A much-debated ambition of big data is to move from samples to populations, in other words to include in a study not a carefully selected and hopefully representative portion of a population to infer something about it, but a direct analysis of the entire population itself. Another way of interpreting the ‘big’ in big data is not only about the number of cases but about their increased ‘resolution’ (Housley et al., 2014). For resolution, we are considering the number of data points available for each subject. To explain the concept, let’s make a comparative example. If we design a questionnaire with twenty questions, we will collect twenty data points per participant. Historically in the social sciences, the amount of data points available per person has been limited by constraints in the data collection processes. It has been unfeasible to use questionnaires with two hundred questions because participants would not be able to answer without great fatigue, the latter affecting the quality of the data collected. Therefore, the amount of data collected for each individual has always been a careful trade-off between different variables determined by research goals and planned analytical strategy. One of the most attractive features of big data is the possibility of having a much larger set of data points per individual. (However, it is a different story to have coherent data points to be used in one specific ‘construct’, something that we will discuss in the next section about construct validity.) Let’s take the example of the data collected by Facebook for each of its users. While it is impossible to know exactly how many variables are collected by the American social media company, what is visible to researchers already is in the order of hundreds of variables. Such richness represents an intriguing possibility for social scientists, but at the same time poses the challenge of how to deal with so much data for each individual. The availability of many data points is particularly interesting for the measurement of
stratified variables
, the latter being variables that measure complex constructs constituted by components. Already, in survey research, complex research objects are measured by multiple items (questions) that are supposed to measure different sub-aspects and that can be combined together. This is an important methodological point to which we will return later in the section about construct validity because big data also present problems in testing the validity of constructs.
Big data are long – that is, longitudinal
. The other useful feature of big data for social scientists is their lengthy time span. The longitudinal nature of big data creates a novel opportunity for the social sciences, which have historically encountered high costs and technical difficulties in collecting data over long periods of time. Having the possibility of studying the unfolding of social events can help to understand their processes. For example, Russell Neumann et al. (2014) revised the famous agenda-setting theory studying more than 100 million Twitter active users, archives of approximately 160 million active blogs, and 300,000 forums and message boards for one entire calendar year. Besides the size of the datasets in terms of users and documents involved, one year of tracking such data is a step forward from previous studies based on a much more limited time span. The span of the available digital data is limited to the diffusion of the related technologies, but the digitalization of archives of documents, when complete and in a good state, provides researchers with unprecedented opportunities. In general, the continuous recording of digital data means that the ‘recorder is always on’ and therefore we can reconstruct people’s expressed opinions, behaviour and choices for longer periods of time.
Non-reactive heterogeneity, including behavioural
. Another important feature of big data is that they are not, in most cases, collected by means of direct elicitation of people. While surveys, interviews, experiments, etc. require the active engagement of participants, most digital data are collected in the background. People are barely aware of the process of data collection while they are using services or tools such smartphones, tablets or PCs. The advantage of this almost invisible footprint is to make the Hawthorne effect less likely. At the same time, this opacity of the data collection process does raise ethical concerns, as we will discuss later in the book. There is another important consequence of the invisibility of digital data collection: digital data and their enhanced large version, big data, are better suited to capture behavioural information than traditional social scientific instruments. We have already discussed the difference between self-reported and behavioural data. The latter type has been traditionally difficult to accurately collect, for large groups of people and over a long time. Now, let’s take the example of my smartphone GPS data: since it has been active, it has accurately tracked my movements for example over the past year. One year of movements by someone can reveal a great deal and is therefore an object of privacy protection. However, the possibility of using the aggregated data of a group of people or of a specific community in order to study mobility on roads and urban design issues is already being exploited by some city planners.
