43,99 €
Provides clear guidance on leveraging computational techniques to answer social science questions
In disciplines such as political science, sociology, psychology, and media studies, the use of computational analysis is rapidly increasing. Statistical modeling, machine learning, and other computational techniques are revolutionizing the way electoral results are predicted, social sentiment is measured, consumer interest is evaluated, and much more. Computational Analysis of Communication teaches social science students and practitioners how computational methods can be used in a broad range of applications, providing discipline-relevant examples, clear explanations, and practical guidance.
Assuming little or no background in data science or computer linguistics, this accessible textbook teaches readers how to use state-of-the art computational methods to perform data-driven analyses of social science issues. A cross-disciplinary team of authors—with expertise in both the social sciences and computer science—explains how to gather and clean data, manage textual, audio-visual, and network data, conduct statistical and quantitative analysis, and interpret, summarize, and visualize the results. Offered in a unique hybrid format that integrates print, ebook, and open-access online viewing, this innovative resource:
Computational Analysis of Communication is an invaluable textbook and reference for students taking computational methods courses in social sciences, and for professional social scientists looking to incorporate computational methods into their work.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 703
Veröffentlichungsjahr: 2022
Wouter van AtteveldtVrije Universiteit AmsterdamAmsterdam, Netherlands
Damian TrillingUniversity of AmsterdamAmsterdam, Netherlands
Carlos Arcila CalderónUniversity of SalamancaSalamanca, Spain
This edition first published 2022
© 2022 John Wiley & Sons, Inc.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Wouter van Atteveldt, Damian Trilling, and Carlos Arcila Calderón to be identified as the authors of this work has been asserted in accordance with law.
Registered Office
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
Editorial Office
The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Limit of Liability/Disclaimer of Warranty
While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Library of Congress Cataloging-in-Publication Data
Names: van Atteveldt, Wouter, author. | Trilling, Damian, 1983- author. | Calderón, Carlos Arcila, author. Title: Computational analysis of communication : a practical introduction to the analysis of texts, networks, and images with code examples in Python and R / Wouter van Atteveldt, Vrije Universiteit Amsterdam, Damian Trilling, University of Amersterdam, Carlos Arcila Calderón, University of Salamanca. Description: Hoboken, NJ : John Wiley & Sons, [2022] | Includes bibliographical references and index. Identifiers: LCCN 2021058779 (print) | LCCN 2021058780 (ebook) | ISBN 9781119680239 (paperback) | ISBN 9781119680277 (pdf) | ISBN 9781119680284 (epub) Subjects: LCSH: Social sciences--Network analysis. | Communication--Network analysis. | Computational linguistics--Network analysis. | Communication--Data processing. Classification: LCC HM741 .A88 2021 (print) | LCC HM741 (ebook) | DDC 302.3072--dc23 LC record available at https://lccn.loc.gov/2021058779LC ebook record available at https://lccn.loc.gov/2021058780
Cover image: © 4X image/Getty Images
Cover design by Wiley
Set in 9.5/12.5pt STIXTwoText by Integra Software Services Pvt. Ltd, Pondicherry, India
To our patient spouses
Cover
Title page
Copyright
Dedication
Preface
Acknowledgement
1 Introduction
1.1 The Role of Computational Analysis in the Social Sciences
1.2 Why Python and/or R?
1.3 How to Use This Book
1.4 Installing R and Python
1.4.1 Installing R and RStudio
1.4.2 Installing Python and Jupyter Notebook
1.5 Installing Third-Party Packages
2 Getting Started: Fun with Data and Visualizations
2.1 Fun With Tweets
2.2 Fun With Textual Data
2.3 Fun With Visualizing Geographic Information
2.4 Fun With Networks
3 Programming Concepts for Data Analysis
3.1 About Objects and Data Types
3.1.1 Storing Single Values: Integers, Floating-Point Numbers, Booleans
3.1.2 Storing Text
3.1.3 Combining Multiple Values: Lists, Vectors, And Friends
3.1.4 Dictionaries
3.1.5 From One to More Dimensions: Matrices and n-Dimensional Arrays
3.1.6 Making Life Easier: Data Frames
3.2 Simple Control Structures: Loops and Conditions
3.2.1 Loops
3.2.2 Conditional Statements
3.3 Functions and Methods
4 How to Write Code
4.1 Re-using Code: How Not to Re-Invent the Wheel
4.2 Understanding Errors and Getting Help
4.2.1 Error Messages
4.2.2 Debugging Strategies
4.3 Best Practice: Beautiful Code, GitHub, and Notebooks
5 From File to Data Frame and Back
5.1 Why and When Do We Use Data Frames?
5.2 Reading and Saving Data
5.2.1 The Role of Files
5.2.2 Encodings and Dialects
5.2.3 File Handling Beyond Data Frames
5.3 Data from Online Sources
6 Data Wrangling
6.1 Filtering, Selecting, and Renaming
6.2 Calculating Values
6.3 Grouping and Aggregating
6.3.1 Combining Multiple Operations
6.3.2 Adding Summary Values
6.4 Merging Data
6.4.1 Equal Units of Analysis
6.4.2 Inner and Outer Joins
6.4.3 Nested Data
6.5 Reshaping Data: Wide To Long And Long To Wide
6.6 Restructuring Messy Data
7 Exploratory Data Analysis
7.1 Simple Exploratory Data Analysis
7.2 Visualizing Data
7.2.1 Plotting Frequencies and Distributions
7.2.2 Plotting Relationships
7.2.3 Plotting Geospatial Data
7.2.4 Other Possibilities
7.3 Clustering and Dimensionality Reduction
7.3.1 k-means Clustering
7.3.2 Hierarchical Clustering
7.3.3 Principal Component Analysis and Singular Value Decomposition
8 Statistical Modeling and Supervised Machine Learning
8.1 Statistical Modeling and Prediction
8.2 Concepts and Principles
8.3 Classical Machine Learning: From Naïve Bayes to Neural Networks
8.3.1 Naïve Bayes
8.3.2 Logistic Regression
8.3.3 Support Vector Machines
8.3.4 Decision Trees and Random Forests
8.3.5 Neural Networks
8.4 Deep Learning
8.4.1 Convolutional Neural Networks
8.5 Validation and Best Practices
8.5.1 Finding a Balance Between Precision and Recall
8.5.2 Train, Validate, Test
8.5.3 Cross-validation and Grid Search
9 Processing Text
9.1 Text as a String of Characters
9.1.1 Methods for Dealing With Text
9.2 Regular Expressions
9.2.1 Regular Expression Syntax
9.2.2 Example Patterns
9.3 Using Regular Expressions in Python and R
9.3.1 Splitting and Joining Strings, and Extracting Multiple Matches
10 Text as Data
10.1 The Bag of Words and the Term-Document Matrix
10.1.1 Tokenization
10.1.2 The DTM as a Sparse Matrix
10.1.3 The DTM as a “Bag of Words”
10.1.4 The (Unavoidable) Word Cloud
10.2 Weighting and Selecting Documents and Terms
10.2.1 Removing stop words
10.2.2 Removing Punctuation and Noise
10.2.3 Trimming a DTM
10.2.4 Weighting a DTM
10.3 Advanced Representation of Text
10.3.1 n-grams
10.3. Collocations
10.3.3 Word Embeddings
10.3.4 Linguistic Preprocessing
10.4 Which Preprocessing to Use?
11 Automatic Analysis of Text
11.1 Deciding on the Right Method
11.2 Obtaining a Review Dataset
11.3 Dictionary Approaches to Text Analysis
11.4 Supervised Text Analysis: Automatic Classification and Sentiment Analysis
11.4.1 Putting Together a Workflow
11.4.2 Finding the Best Classifier
11.4.3 Using the Model
11.4.4 Deep Learning
11.5 Unsupervised Text Analysis: Topic Modeling
11.5.1 Latent Dirichlet Allocation (LDA)
11.5.2 Fitting an LDA Model
11.5.3 Analyzing Topic Model Results
11.5.4 Validating and Inspecting Topic Models
11.5.5 Beyond LDA
12 Scraping Online Data
12.1 Using Web APIs: From Open Resources to Twitter
12.2 Retrieving and Parsing Web Pages
12.2.1 Retrieving and Parsing an HTML Page
12.2.2 Crawling Websites
12.2.3 Dynamic Web Pages
12.3 Authentication, Cookies, and Sessions
12.3.1 Authentication and APIs
12.3.2 Authentication and Webpages
12.4 Ethical, Legal, and Practical Considerations
13 Network Data
13.1 Representing and Visualizing Networks
13.2 Social Network Analysis
13.2.1 Paths and Reachability
13.2.2 Centrality Measures
13.2.3 Clustering and Community Detection
14 Multimedia Data
14.1 Beyond Text Analysis: Images, Audio and Video
14.2 Using Existing Libraries and APIs
14.3 Storing, Representing, and Converting Images
14.4 Image Classification
14.4.1 Basic Classification with Shallow Algorithms
14.4.2 Deep Learning for Image Analysis
14.4.3 Re-using an Open Source CNN
15 Scaling Up and Distributing
15.1 Storing Data in SQL and noSQL Databases
15.1.1 When to Use a Database
15.1.2 Choosing the Right Database
15.1.3 A Brief Example Using SQLite
15.2 Using Cloud Computing
15.3 Publishing Your Source
15.4 Distributing Your Software as Container
16 Where to Go Next
16.1 How Far Have We Come?
16.2 Where To Go Next?
16.3 Open, Transparent, and Ethical Computational Science
Bibliography
Index
End User License Agreement
Chapter 1
Figure 1.1 RStudio Desktop.
Figure 1.2 Jupyter Notebook.
Chapter 4
Figure 4.1 A online discussion in Stackover...
Figure 4.2 The online repository GitHub...
Figure 4.3 Markdown (left) and Jupyter (right)...
Figure 4.4 Jupyter notebook in Google Colab...
Chapter 5
Figure 5.1 A csv file opened in a text editor...
Chapter 8
Figure 8.1 Underfitting and overfitting. Example adapted from...
Figure 8.2 Visual representation of a confusion matrix.
Figure 8.3 The sigmoid function.
Figure 8.4 A simple decision tree.
Figure 8.5 Schematic representation of a typical...
Figure 8.6 A neural network.
Figure 8.7 Simplified example of a Convolutional...
Figure 8.8 A (pretty good) ROC curve.
Chapter 11
Figure 11.1 Latent Dirichlet Allocation in...
Chapter 14
Figure 14.1 A screen shot of a real-time video analyzed by...
Figure 14.2 A photograph of refugees on a lifeboat, used as an input...
Figure 14.3 Representation of the matrix data structure of...
Figure 14.4 Semantic segmentation...
Figure 14.5 First 10 handwritten digits from the...
Figure 14.6 Examples of Fashion MNIST items.
Chapter 15
Figure 15.1 Creating a Virtual Machine on Microsoft
Figure 15.2 Running a script on a virtual machine
Chapter 2
Example 2.1 Retrieving cached tweets about COVID.
Example 2.2 Barplot of tweets over time. Note that...
Example 2.3 My First Tag Cloud.
Example 2.4 Topic Model of the COVID tags. Note...
Example 2.5 Location of COVID tweets.
Example 2.6 Corpus comparison: North...
Example 2.7 Retweet network in COVID tweets...
Example 2.8 Reply Network of Tweets.
Chapter 3
Example 3.1 Determining the type of an object.
Table 3.1 Most used basic data types in Python and R.
Example 3.2 Some simple operations.
Example 3.3 Floating point numbers, integers, and boolean values.
Example 3.4 Strings and bytes.
Example 3.5 Collections arrays...
Example 3.6 Slicing vectors and converting data types.
Example 3.7 Some more operations on one-dimensional arrays.
Example 3.8 R enforces that all elements of a...
Example 3.9 Lists can store very different objects...
Example 3.10 The (unexpected) behavior of mutable objects..
Example 3.11 Sets.
Example 3.12 Key-value pairs in Python dictionaries and R named lists
Example 3.13 Working with two- or n-dimensional arrays.
Example 3.14 Creating a simple data frame.
Example 3.15 For-loops let you repeat operations.
Example 3.16 List comprehensions are very popular in Python.
Example 3.17 A simple conditional control structure.
Example 3.18 A more complex conditional control structure.
Example 3.19 Writing functions.
Example 3.20 Functions are particular useful when used repeatedly.
Example 3.21 Generators behave like lists in that you can iterate...
Chapter 4
Example 4.1 Error handling.
Chapter 5
Example 5.1 Creating a data frame from other data structures.
Table 5.1 Basics of data frame handling.
Example 5.2 Reading files into a data frame.
Example 5.3 Reading files without data frames.
Example 5.4 More examples of reading from and writing to files.
Example 5.5 In Python, scikit-learn has a convenience function to...
Example 5.6 A collection of US...
Chapter 7
Example 7.1 Load data from Eurobarometer survey and select some variables.
Example 7.2 Absolute and relative frequencies of...
Example 7.3 Drop missing values.
Example 7.4 Cross tabulation of support of refugees...
Example 7.5 Barplot of support for refugees.
Example 7.6 Histogram of Age.
Example 7.7 Bloxplots of age by country.
Example 7.8 Line graph of average support for refugees by day.
Example 7.9 Plotting multiple lines in one graph.
Example 7.10 Creating subfigures
Example 7.11 Scatterplot of average support...
Example 7.12 Scatterplot with regression line.
Example 7.13 Pearson correlation coefficient.
Example 7.14 Create a data frame to plot the heatmap.
Example 7.15 Heatmap of country gender and support for refugees.
Example 7.16 Add ribbons to the line graph of...
Example 7.17 Simple world map.
Example 7.18 Select EU countries and joint the map with Eurobarometer data.
Example 7.19 Map of Europe with the average level...
Example 7.20 Getting the optimal number of clusters.
Example 7.21 Using Kmeans to group countries based on the average...
Example 7.22 Visualization of clusters.
Example 7.23 Using hierarchical clustering to group countries...
Example 7.24 Dendogram to visualize the hierarchical clustering.
Example 7.25 Re-run hierarchical clustering with three clusters.
Example 7.26 Re-run hierarchical clustering with three clusters.
Example 7.27 Principal component analysis...
Example 7.28 Plot PC1 and PC2.
Example 7.29 Proportion of variance explained.
Example 7.30 Plot of the proportion of variance explained.
Example 7.31 Cumulative explained variance.
Example 7.32 Plot of the cumulative explained variance.
Example 7.33 Combining PCA to reduce dimensionality...
Chapter 8
Example 8.1 Obtaining a model through estimating an OLS regression.
Example 8.2 Using the OLS model we estimated before...
Table 8.1 Some common machine learning terms explained.
Example 8.3 Preparing a dataset for supervised machine learning.
Example 8.4 A simple Naïve Bayes classifier.
Example 8.5 Calculating precision and recall.
Example 8.6 A simple logistic regression classifier.
Example 8.7 A simple Support Vector Machine classifier.
Example 8.8 A simple Random Forest classifier.
Example 8.9 Choosing a different cutoff point for predictions with logist...
Example 8.10 The ROC curve of a (not very impressive)...
Example 8.11 Crossvalidation.
Example 8.12 A simple gridsearch in Python.
Example 8.13 A simple gridsearch in Python using multiple CPUs.
Example 8.14 A gridsearch in R. Note that in R, not all parameters are...
Chapter 9
Example 9.1 Internal representation and of single and multiple texts.
Example 9.2 Some basic text cleaning approaches.
Example 9.3 Using regular expressions to clean a text.
Example 9.4 Using regular expressions on a data frame.
Example 9.5 Splitting extracting and joining a single text.a
Example 9.6 Applying split and extract_all on text columns.
Table 9.1 Useful strings operations in R and Python to clean noise.
Table 9.2 Regular expression syntax.
Table 9.3 Regular expression’s syntax.
Table 9.4 Regular expression functions and methods.
Chapter 10
Example 10.1 Example document-term matrix.
Example 10.2 Differences between tokenizers
Example 10.3 Tokenization of Japanese verse
Example 10.4 Example document-term matrix.
Example 10.5 A look inside the DTM.
Example 10.6 Word cloud of the US State of the Union corpus.
Example 10.7 Top words used in Trump Tweets.
Example 10.8 Simple stop word removal.
Example 10.9 Inspecting and Customizing stop word lists.
Example 10.10 Cleaning a single tweet at the text and token level.
Example 10.11 Cleaning the whole corpus and making a tag cloud.
Example 10.12 Trimming a Document-Term Matrix.
Example 10.13 Tf.Idf weighting.
Example 10.14 Generating n-grams.
Example 10.15 Words and bigrams containing...
Example 10.16 Identifying and applying....
Example 10.17 Using word embeddings for...
Example 10.18 Using UDPipe to analyze a sentence.
Example 10.19 Nouns used in the most recent State...
Example 10.20 Using Spacy to analyze a Spanish sentence.
Table 10.1 Overview of part-of-speech (POS) tags.
Chapter 11
Example 11.1 Downloading and caching IMDB review data.
Example 11.2 Different approaches to a simple dictionary-based sentiment...
Example 11.3 Training a Naïve Bayes...
Example 11.4 An example of a custom function to...
Example 11.5 Instead of fitting vectorizer and...
Example 11.6 A gridsearch to find the...
Example 11.7 For the sake of comparison...
Example 11.8 Saving and loading a vectorizer and a classifier.
Example 11.9 Using eli5 to understand text classification results.
Example 11.10 Using eli5 to explain a prediction.
Example 11.11 Dutch sentiment data (Modified...
Example 11.12 Deep Learning:...
Example 11.13 Deep Learning: Training and Testing the model.
Example 11.14 LDA Topic Model of Obama’s State...
Example 11.15 Analyzing and inspecting LDA results.
Example 11.16 Computing perplexity and coherence of topic models.
Chapter 12
Example 12.1 Retrieving JSON data from the Google Books API.
Example 12.2 Transforming the data into a data frame.
Example 12.3 Full script including pagination.
Example 12.4 Parsing websites using XPATHs or CSS selectors.
Example 12.5 Getting the text of an HTML element versus getting the...
Example 12.6 Parsing link texts and links.
Example 12.7 Specifying a user agent to pretend to be a specific browser.
Example 12.8 Generating a list of URLs that follow the same pattern.
Example 12.9 Crawling a website.
Example 12.10 Dumping the HTML source to a file.
Example 12.11 Using Selenium in Python to open a browser window...
Example 12.12 Passing a key as...
Example 12.13 Explicitly setting a cookie to circumvent a cookie wall.
Example 12.14 Shorter version of Example 12.13 for single requests.
Table 12.1 Overview of CSS Select and XPath syntax.
Chapter 13
Example 13.1 Creating a graph from scratch.
Example 13.2 Visualization of an undirected graph.
Example 13.3 Creating a directed graph.
Example 13.4 Visualization of a directed graph.
Example 13.5 Visualization of a weighted graph.
Example 13.6 Visualization of a weighted graph including vertex sizes.
Example 13.7 Induced subgraphs for Democrats and Republicans.
Example 13.8 Reading a graph from a file.
Example 13.9 Possible paths between two nodes in the imaginary Facebook...
Example 13.10 Visualization of a circuit.
Example 13.11 A network with two components..
Example 13.12 Estimating distances in the network.
Example 13.13 Incident edges and neighbors of...
Example 13.14 Incident edges and neighbors of...
Example 13.15 Computing degree centralities in...
Example 13.16 Estimations of closeness, eigenvector...
Example 13.17 Using the degree centrality to change...
Example 13.18 Finding all the maximal cliques in an undirected graph.
Example 13.19 Dendrogram to visualize clustering with Girvan–Newman.
Example 13.20 Community detection with Girvan–Newman.
Example 13.21 Community detection with Louvain...
Example 13.22 Plotting clusters with Greedy optimization...
Example 13.23 Loading and analyzing a real...
Example 13.24 Visualizing the network of Spanish politicians and...
Chapter 14
Example 14.1 Loading JPG and PNG pictures as objects.
Example 14.2 Converting images to gray-scale and creating...
Example 14.3 Converting images to RGB color model....
Example 14.4 Resize to 25% and visualize a picture.
Example 14.5 Resize to 224 x 224 and visualize a picture.
Example 14.6 Function to crop the image to create a...
Example 14.7 Rotating a picture 45 degrees.
Example 14.8 Comparing two flattened...
Example 14.9 Loading MNIST dataset and preparing training and test sets.
Example 14.10 Modeling the handwritten digits with...
Example 14.11 Loading Fashion MNIST dataset...
Example 14.12 Creating the architecture of the MLP with Keras.
Example 14.13 Compiling fitting and evaluating the model for the MLP.
Example 14.14 Predicting classes using the MLP.
Example 14.15 Loading a visualizing the ResNet50 architecture.
Example 14.16 Cropping an image to get a picture of a sea landscape.
Example 14.17 Predicting the class of the first image.
Example 14.18 Cropping an image to get a picture of refugees in a lifeboat.
Example 14.19 Predicting the class of the second image.
Table 14.1 Some computer vision concepts used in...
Chapter 15
Example 15.1 SQLite offers you database...
Cover
Title page
Copyright
Dedication
Table of Contents
Preface
Acknowledgement
Begin Reading
Bibliography
Index
End User License Agreement
i
ii
iii
iv
v
vi
vii
viii
ix
x
xi
xii
xiii
xiv
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
Why write another methods textbook? Aren’t there enough textbooks already? And what about all the great online resources? We have been teaching computational analysis of communication for years for various universities and other organizations. These courses used different formats, ranged from semester-long courses to short workshops, used different techniques, and were taught at different levels – but we never found the book that really fit our audience. Regularly, students and colleagues ask us for book recommendations, and educators and administrators want to know which book to put on a reading list. And regularly, our answer was one along the lines of: Well, there is this great book on [R/Python/Neural Networks/…], but ….
The “but”, in almost all cases, has to do with the audience: students of the social sciences who have at least some knowledge of and are interested in empirical research and quantitative methods, but have no experience in programming. They do want (or have to) learn programming to conduct the analyses they are interested in, but are not necessarily interested in programming for its own sake. They do not want to just push a button in some tool that limits their possibilities to what someone else has designed, but they also do not want to follow a whole Introduction to Computer Science with a comprehensive overview of programming concepts and paradigms that they might never need.
For years, we have therefore used our own materials to find a balance between teaching programming concepts where necessary but focussing on their application for answering questions that are of genuine interest to those studying various forms of communication. This book is our attempt to bring together and systematize this approach to teaching the analysis of communication.
A second driver for writing this book was to get over the “language war” that is sometimes visible in the field. In our own research and teaching, we find both R and Python to be great tools, each with their strengths and weaknesses. Too often, existing teaching materials focus on the language rather than the underlying concept. We believe that a good computational methods textbook should give practical instructions on the implementation of a concept in a given language, but put the concept rather than the language at the forefront. For that reason, we decided to use R and Python side by side, allowing students (and professors) to choose either – and to allow interested readers to view the differences and similarities between the languages.
Writing this book has also been an exercise in planning and coordination. With two of us being located in Amsterdam and one in Salamanca, we had many video calls to divide tasks and discuss each other’s drafts. One can hardly call us tech-adverse, but nothing is as productive (and nice) as sitting together in a room, as we experienced during a writing weekend on the island of Texel. The COVID-19 pandemic, though, cancelled all plans for further in-person writing, and with suddenly many other unexpected priorities emerging, it took more time – and many more online meetings – for the final version of the book to see the light of the world.
This book would not have been possible without the continuous input we got over years – from students, colleagues, and others. They shaped our ideas on both how to analyze communication computationally, but also our ideas about how to teach this. It would also not have been possible without the patience of Nel, Rodrigo, and Sanne, when we again had to spend more hours than we thought on what at one point only became known as “the book”.
Wouter van Atteveldt Damian Trilling Carlos Arcila Calderón Amsterdam, Salamanca, Texel, & online
We would like to thank colleagues, friends, and students who provided feedback and input on earlier versions of parts of the manuscript: Dmitry Bogdanov, Andreu Casas, Modesto Escobar, Anne Kroon, Nicolas Mattis, Cecil Meeusen, Jesús Sánchez-Oro, Nel Ruigrok, Susan Vermeer, Mehdi Zamani, Rodrigo de la Barra, and Holli Semetko. We also want to thank the editors, copy-editors, and the other great people at Wiley as well as the initial reviewers for their help and confidence.
For an earlier version of the example for web scraping with Selenium, we would like to thank Marthe Möller.
And, of course, all others that we might have forgotten to mention here (sorry!).
Abstract
This chapter explains how the methods outlined in this book are situated within the methodological and epistemological frameworks used by social scientists. It argues why the use of Python and R is fundamental for the computational analysis of communication. Finally, it shows how this book can be used by students and scholars.
Keywords computational social science, Python, R
Understand the role of computational analysis in the social sciences
Understand the choice between Python and/or R
Know how to read this book
The use of computers is nothing new in the social sciences. In fact, one could argue that some disciplines within the social sciences have even been early adopters of computational approaches. Take the gathering and analyzing of large-scale survey data, dating back to the use of the Hollerith Machine in the 1890 US census. Long before every scholar had a personal computer on their desk, social scientists were using punch cards and mainframe computers to deal with such data. If we think of the analysis of communication more specifically, we already see attempts to automate content analysis in the 1960’s (see, e.g. Scharkow, 2017).
However, something has profoundly changed in recent decades. The amount and type of data we can collect as well as the computational power we have access to have increased dramatically. In particular, digital traces that we leave when communicating online, from access logs to comments we place, have required new approaches (e.g., Trilling, 2017). At the same time, better computational facilities now allow us to ask questions we could not answer before.
González-Bailón (2017), for instance, argued that the computational analysis of communication now allows us to test theories that were formulated a century ago, such as Tarde’s theory of social imitation. Salganik (2019) tells an impressive methodological story of continuity in showing how new digital research methods build on and relate to established methods such as surveys and experiments, while offering new possibilities by observing behavior in new ways.
A frequent misunderstanding, then, about computational approaches is that they would somehow be a-theoretical. This is probably fueled by clichés coined during the “Big Data”-hype in the 2010’s, such as the infamous saying that in the age of Big Data, correlation is enough (Mayer-Schönberger and Cukier, 2013); but one could not be more wrong: as the work of Kitchin (2014a, b) shows, computational approaches can be well situated within existing epistemologies. For the field to advance, computational and theoretical work should be symbiotic, with each informing the other and with neither claiming superiority Margolin, 2019). Thus, the computational scientists’ toolbox includes both more data-driven and more theory-driven techniques; some are more bottom-up and inductive, others are more top-down and deductive. What matters here, and what is often overlooked, is in which stage of the research process they are employed. In other words, both inductive and deductive approaches as they are distinguished in more traditional social-science textbooks (e.g., Bryman, 2012) have their equivalent in the computational social sciences.
Therefore, we suggest that the data collection and data analysis process is thought of as a pipeline. To test, for instance, a theoretically grounded hypothesis about personalization in the news, we could imagine a pipeline that starts with scraping online news, proceeds with some natural-language processing techniques such as Named Entity Recognition, and finally tests whether the mentioning of persons has an influence on the placement of the stories. We can distinguish here between parts of the pipeline that are just necessary but not inherently interesting to us, and parts of the pipeline that answer a genuinely interesting question. In this example, the inner workings of the Named Entity Recognition step are not genuinely interesting for us – we just need to do it to answer our question. We do care about how well it works and especially which biases it may have that could affect our substantive outcomes, but we are not really evaluating any theory on Named Entity Recognition here. We are, however, answering a theoretically interesting question when we look at the pipeline as a whole, that is, when we apply the tools in order to tackle a social scientific problem. Of course, what is genuinely interesting depends on one’s discipline: For a computational linguist, the inner workings of the named entity recognition may actually be the interesting part, and our research question just one possible “downstream task”.
This distinction is also sometimes referred to as “building a better mousetrap” versus “understanding”. For instance, Breiman (2001) remarked: “My attitude toward new and/or complicated methods is pragmatic. Prove that you’ve got a better mousetrap and I’ll buy it. But the proof had better be concrete and convincing.” (p. 230). In contrast, many social scientists are using statistical models to test theories and to understand social processes: they want to specifically understand how x relates to y, even if y may be better predicted by another (theoretically uninteresting) variable.
This book is to some extent about both building mousetraps and understanding. When you are building a supervised machine learning classifier to determine the topic of each text in a large collection of news articles or parliamentary speeches, you are building a (better) mousetrap. But as a social scientist, your work does not stop there. You need to use the mousetrap to answer some theoretically interesting question.
Actually, we expect that the contents of this book will provide a background that helps you to face the current research challenges in both academia and industry. On the one hand, the emerging field of Computational Social Science has become one of the most promising areas of knowledge and many universities and research institutes are looking for scholars with this profile. On the other hand, it is widely known that nowadays the computational skills will increase your job opportunities in private companies, public organizations or NGOs, given the growing interest in data-driven solutions.
When planning this book, we needed to make a couple of tough choices. We aimed to at least give an introduction to all techniques that students and scholars who want to computationally analyze communication will probably be confronted with. Of course, specific – technical – literature on techniques such as, for instance, machine learning can cover the subject in more depth, and the interested student may indeed want to dive into one or several of the techniques we cover more deeply. Our goal here is to offer enough working knowledge to apply these techniques and to know what to look for. While trying to cover the breadth of the field without sacrificing too much depth when covering each technique, we still needed to draw some boundaries. One technique that some readers may miss is agent-based modeling (ABM). Arguably, such simulation techniques are an important technique in the computational social sciences more broadly (Cioffi-Revilla, 2014), and they have recently been applied to the analysis of communication as well (Waldherr, 2014, Wettstein, 2020). Nevertheless, when reviewing the curricula of current courses teaching the computational analysis of communication, we found that simulation approaches do not seem to be at the core of such analyses (yet). Instead, when looking at the use of computational techniques in fields such as journalism studies (e.g., Boumans and Trilling, 2016), media studies (e.g., Rieder, 2017), or the text-as-data movement (Grimmer and Stewart, 2013), we see a core of techniques that are used over and over again, and that we have therefore included in our book. In particular, besides general data analysis and visualization techniques, these are techniques for gathering data such as web scraping or the use of API’s; techniques for dealing with text such as natural language processing and different ways to turn text into numbers; supervised and unsupervised machine learning techniques; and network analysis.
By far most work in the computational social sciences is done using Python and/or R. Sure, for some specific tasks there are standalone programs that are occasionally used; and there are some useful applications written in other languages such as C or Java. But we believe it is fair to say that it is very hard to delve into the computational analysis of communication without learning at least either Python or R, and preferably both of them. There are very few tasks that you cannot do with at least one of them.
Some people have strong beliefs as to which language is “better” – we do not subscribe to that view. Most techniques that are relevant to us can be done in either language, and personal preference is a big factor. R started out as a statistical programming environment, and that heritage is still visible, for instance in the strong emphasis on vectors, factors, et cetera, or the possibility to estimate complex statistical models in just one line of code. Python started out as a general-purpose programming language, which means that some of the things we do feel a bit more “low-level” – Python abstracts away less of the underlying programming concepts than R does. This sometimes gives us more flexibility – at the cost of being more wordy. In recent years, however, Python and R have been growing closer to each other: with modules like pandas and statsmodels, Python now has R-like functionality handling data frames and estimating common statistical models on them; and with packages such as quanteda, handling of text – traditionally a strong domain of Python – has become more accessible in R.
This is the main reason why we decided to write this “bi-lingual” book. We wanted to teach techniques for the computational analysis of communication, without enforcing a specific implementation. We hope that the reader will learn from our book, say, how to transform a text into features and how to choose an appropriate machine learning model, but will find it of less importance in which language this happens.
However, sometimes, there are good reasons to choose one language above the other. For instance, many machine learning models in the popular caret package in R under the hood create a dense matrix, which severely limits the number of documents and features one can use; also, some complex web scraping tasks are maybe easier to realize in Python. On the other hand, R’s data wrangling and visualization techniques in the tidyverse environment are known for their user-friendliness and quality. In the rare cases where we believe that R or Python is clearly superior for a given task, we indicate this; for the rest, we believe that it is up to the reader to choose.
This book differs from more technically oriented books on the one hand and more conceptual books on the other hand. We do cover the technical background that is necessary to understand what is going on, but we keep both computer science concepts and mathematical concepts to a minimum. For instance, if we had written a more technical book about programming in Python, we would have introduced rather earlier and in detail concepts such as classes, inheritence, and instances of classes. Instead, we decide to provide such information only as additional background where necessary and focus, rather pragmatically, on the application of techniques for the computational analysis of communication. Vice versa, if we had written a more conceptual book on new methods in our field, we would have given more emphasis to epistemological aspects, and had skipped the programming examples, which are now at the core of this book.
We do not expect much prior knowledge from the readers of this book. Sure, some affinity with computers helps, but there is no strict requirement on what you need to know. Also in terms of statistics, it helps if you have heard of concepts such as correlation or regression analysis, but even if your knoweldge here is rather limited, you should be able to follow along.
This also means that you may be able to skip chapters. For instance, if you already work with R and/or Python, you may not need our detailed instructions of the beginning. Still, the book follows a logical order in which chapters build on previous ones. For instance, when explaining supervised machine learning on textual data, we expect you to be familiar with previous chapters that deal with machine learning in general, or with the handling of textual data.
This book is designed in such a way that it can be used as a text book for introductory courses on the computational analysis of communications. Often, such courses will be on the graduate level, but it is equally possible to use this book in an undergraduate course; maybe skipping some parts that may go too deep. All code examples are not only printed in this book, but also available online. Students as well as social-scientists who want to brush up their skillset should therefore also be able to use this book for self-study, without a formal course around it. Lastly, this book can also be a reference for readers asking themselves: “How do I do this again?”. In particular, if the main language you work in is R, you can look up how to do similar things in Python and vice versa.
Code examples
Regardless of the context in which you use this book, one thing is for sure: The only way to learn computational analysis methods is by practicing and playing around. For this reason, the code examples are probably the most important part of the book. Where possible, the examples use real world data that is freely available on the Internet. To make sure that the examples still work in five years’ time, we generally provide a copy of this data on the book website, but we also provide a link to the original source.
One thing to note is that to avoid unnessecary repetition the examples are sometimes designed to continue on earlier snippets from that chapter. So, if you seem to be missing a data set, or if some package is not imported yet, make sure you run all the code examples from that chapter.
Note that although it is possible to copy-paste the code from the website accompaying this book1, we would actually recommend typing the examples yourself. That way, you are more conscious about the commands you are using and you are adding them to your “muscle memory”.
Finally, realize that the code examples in this book are just examples. There’s often more ways to do something, and our way is not necessarily the only good (let alone the best) way. So, after you get an example to work, spend some time to play around with it: try different options, maybe try it on your own data, or try to achieve the same result in a different way. The most important thing to remember is: you can’t break anything! So just go ahead, have fun, and if nothing works anymore you can always start over from the code example from the book.
R and Python are the most popular programming languages that data scientists and computational scholars have adopted to conduct their work. While many develop a preference for one or the other language, the chances are good that you will ultimately switch back and forth between them, depending on the specific task at hand and the project you are involved in.
Before you can start with analyzing data and communication in Python or R, you need to install interpreters for these languages (i.e., programs that can read code in these languages and execute it) on your computer. Interpreters for both Python and R are open source and completely free to download and use. Although there are various web-based services on which you can run code for both languages (such as Google Colab or RStudio Cloud), it is generally better to install an interpreter on your own computer.
After installing Python or R, you can execute code in these languages, but you also want a nice Integrated Development Environment (IDE) to develop your data analysis scripts. For R we recommend RStudio, which is free to install and is currently the most popular environment for working with R. For Python we recommend starting with JupyterLab or JupyterNotebook, which is a browser-based environment for writing and running Python code. All of these tools are available and well documented for Windows, MacOS, and Linux. After explaining how to install R and Python, there is a very important section on installing packages. If you plan to only use either R or Python (for now), feel free to skip the part about the other language.
If you are writing longer Python programs (as opposed to, for instance, short data analysis scripts) you probably want to install a full-blown IDE as well. We recommend PyCharm2 for this, which has a free version that has everything you need, and the premium version is also free for students and academic or open source developers. See their website for download and installation instructions.
Anaconda
An alternative to installing R, Python, and optional libraries separately and as you need them (which we will explain later in this chapter) is to install the so called Anaconda Distribution, one of the most used and extensive platforms to perform data science. Anaconda is free and open-source, and is conceived to run Python and R code for data analysis and machine learning. Installing the complete Anaconda Distribution on your computer3 provides you with everything that you need to follow the examples in this book and includes development environments such as Spyder, Jupyter, and RStudio. It also includes a large set of pre-installed packages often used in data science and its own package manager, conda, which will help you to install and update other libraries or dependencies. In short, Anaconda bundles almost all the important software to perform computational analysis of communication.
So, should you install Anaconda, or should you install all software separately as outlined in this chapter? It depends. On the pro side, by downloading Anaconda you have everything installed at once and do not have to worry about dependencies (e.g., Windows users usually do not have a C compiler installed, but some packages may need it). On the con side, it is huge and also installs many things you do not need, you essentially get a non-standard installation, in which programs and packages are stored in different locations than those you (or your computer) may expect. Nowadays, as almost all computers actually already have some version of Python installed (even though you may not know it), you also end up in a possibly confusing situation where it may be unclear which version you are actually running, or for which version you installed a package. For this reason, our recommendation is to not use Anaconda unless it is already installed or you have a specific reason to do so (for example, if your professor requires you to use it).
Firstly, we will install R and its most popular IDE RStudio, and we will learn how to install additional packages and how to run a script. R is an object-based programming language orientated to statistical computing that can be used for most of the stages of computational analysis of communication. If you are completely new to R, but familiar with other popular statistical packages in social sciences (such as SPSS or STATA), you will find that you can perform in R many already-known statistical operations. If you are not familiar with other statistical packages, do not panic, we will guide you from the very beginning. Unlike much traditional software that requires just one complete and initial installation, when working with R, we will first install the raw programming language and then we will continue to install additional components during our journey. It might sound cumbersome, but in fact it will make your work more powerful and flexible, since you will be able to choose the best way to interact with R and especially you will select the packages that are suitable for your project.
Now, let us install R. The easiest way is to go to the RStudio CRAN page at https://cran.rstudio.com/.4 Click on the link for installing R for your operating system, and install the latest version. If you use Linux, you may want to install R via your package manager. For Ubuntu linux, it is best to follow the instructions on https://cran.r-project.org/bin/linux/ubuntu/.
After installing R, let us immediately install RStudio Desktop (the free version). Go to https://rstudio.com/products/rstudio/download/#download and download and run the installer for your computer. If you open RStudio you should get a screen similar to Figure 1.1. If this is the first time you open RStudio you probably won’t see the top left pane (the scripts), you can create that pane by creating a new R Script via the file menu or with the green plus icon in the top left corner.
Figure 1.1 RStudio Desktop.
Of the four panes in RStudio, you will probably spend most time in the top left pane, where you can view and edit your analysis scripts. A script is simply a list of commands that the computer should execute one after the other, for example: open your data, do some computations, and make a nice graph.
To run a line of code, you can place your cursor anywhere on that line and click the Run icon or press control+Enter. To try that, type the following into your newly opened script:
print("Hello world")
Now, place your cursor on that line and press Run (or control+Enter). What happens is that the line is copied to the Console in the bottom left corner and executed. So, the results of your commands (and any error messages) will be shown in this console view.
In contrast to most traditional programming languages, the easiest way to run R code is line by line. You can simply place your cursor on the first line, and repeatedly press control+Enter, which executes a line and then places the cursor on the next line. You can also select multiple lines (or part of a line) to execute those commands together, but in general it is easier to check that everything is going as planned if you run the code line by line.
You can also write commands directly in the console and execute them (by pressing Enter). This can be useful for trying things out or to run things that only need to be run once, but in general we would strongly recommend typing all your commands in a script and then executing them. That way, the script serves as a log of the commands you used to analyze your data, so you (or a colleague) can read and understand how you did the analyses.
RStudio Projects
A very good idea to organize your data and code is to work with RStudio Projects. In fact, we would recommend you to now create a new empty project for the examples in this book. To do this, click on the Project button in the top right and select “New Project”. Then, select New Directory and New Project and enter a name for this project and a parent folder for the project if you don’t want it in your Documents. Using a project means that the scripts and data files for your project are all in the same location and you don’t need to mess around with specifying the locations of files (which will probably be different for someone else or on a different computer). Moreover, RStudio remembers which files you were editing for each project, so if you are working on multiple projects it’s very easy to switch between them. We recommend creating a project now for the book (and/or for any projects you are working on), and always switching to a project when you open RStudio.
On the right side of the RStudio workspace you will find two additional windows. In the top right pane there are two or more tabs: environment and history, and depending on additional packages you may have installed there may be some more. In environment you can manage your workspace (the set of elements you need to deploy for data analysis) and have a list of the objects you have uploaded to it. You may also import datasets with this tool. In the history tab you have an inventory of code executions, which you can save to a file, or move directly to console or to an R document.
Note that in the environment you can save and load your “workspace” (all data in the computer memory). However, relying on this functionality is often not a good idea: it will only save the state of your current session, whereas you will most likely want to save your R syntax file and/or your data instead. If you have your raw input data (e.g., as a csv file, see Chapter 5) and your analysis script, you can always reproduce what you have been doing. If you only have a snapshot of your workspace, you know the state in which you arrived, but cannot necessarily reproduce (or change) how you got there.
In the bottom right pane there are five additional useful tabs. In files you can explore your computer and manage all the files you may use for the project, including importing datasets. In plots, help and viewer, you can visualize the outputs, figures, documentation and general outcomes, respectively, that you have executed in your script. Finally, the tab for packages will be of great utility since it will let you install or update packages from CRAN or even from a file saved on your computer with a friendly interface.
Python is an object-orientated programming language and it is probably the favorite language of computational and data scientists in all disciplines around the world. There are different releases of Python, but the biggest difference used to be between Python 2 and Python 3. Fortunately, you will probably never need to install or use Python 2, and in fact, since January 2020 it is no longer supported. Thus, you can just use any recent Python 3 version for this book. When browsing through questions on online fora such as Stackoverflow or reading other people’s code on Github (we will talk about that in Chapter 4), you still may come across legacy code in Python 2. Such code usually does not run directly in a Python 3 interpreter, but in most cases, only minor adaptions are necessary to make it work.
We will install and run Python and Jupyter Notebook using a terminal or command line interface. This is a tool that is installed on all computers that allows you to enter commands to the computer directly. First, create a project folder for this book using the File Explorer (Windows) or Finder (MacOS). Then, on Windows you can shift + Right click that folder and select “Open command Window here”. On MacOS, after navigating to the folder you just created, you click on “Finder” in the menu at the top of the screen, then on “Services”, then on “New Terminal at Folder.” In both cases, this should open a new window (usually black or gray) that allows you to type commands.
Note that on most computers, Python is already installed by default. You can check this by typing the following command in your terminal:
python3 --version
On some versions of Windows, you may need to use py instead of python3:
py --version
In either case, the output of this command should be something like Python 3.8.5. If python --version also returns this version, you are free to use either command (but on older systems python can still refer to Python 2, so make sure that you are using Python 3 for this book!).
If Python is not installed on your system, go to https://www.python.org/downloads/windows/ or https://www.python.org/downloads/mac-osx/ and download and install the latest stable release (which at the time of writing is 3.9.0).5 After installing it, open a terminal again and run the command above to verify that it is installed correctly.
Included in any recent Python install is pip, the program that you will use for installing Python packages. You can check that pip is installed correctly by typing the following command on your terminal:
pip3 --version
Which should report something like pip 20.0.2 from … (python 3.8). Again, if pip reports the same version you can also use it instead of pip3. On some systems pip3 will not work, so use pip in that case (but make sure to check that it points to Python 3).
Installing Jupyter Notebook. Next, we will install Jupyter Notebook, which you can use to run all the examples in this book and is a great environment for developing Python data analysis scripts. Jupyter Notebooks (which are also included in the IDE JupyterLab if you installed that), are run as a web application that allows you to create documents that contain code and inline text fragments. One of the nicest things about the Jupyter Notebook is that the code is inserted in fields (so-called “cells”) that you can run one by one, getting its respective output, which, when added to the narrative text, will make your script more clean and reproducible. You can also add formatted text blocks (using a simple formatting language called Markdown) to explain to the reader what you are doing. In Section 4.3, we will address notebooks again as a good practice for a computational scientist.
You can install Jupyter notebook directly using pip using the following command (executed in a terminal):
pip3 install jupyter-notebook
Now, you can run Jupyter by executing the following command on the terminal:
jupyter notebook
This will print some useful information, including the URL at which you can access the notebook. However, it should also directly open this in a browser (e.g. Chrome) so you can directly start working. In your browser you should see the Jupyter main screen similar to the middle window in Figure 1.2. Create a new notebook by clicking on the New button in the top right and selecting Python 3. This should open a window similar to the bottom window in Figure 1.2.
Figure 1.2 Jupyter Notebook.
In Jupyter, code is entered into cells. First, type
