Mastering Social Media Mining with Python - Marco Bonzanini - E-Book

Mastering Social Media Mining with Python E-Book

Marco Bonzanini

0,0
38,39 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Acquire and analyze data from all corners of the social web with Python

About This Book

  • Make sense of highly unstructured social media data with the help of the insightful use cases provided in this guide
  • Use this easy-to-follow, step-by-step guide to apply analytics to complicated and messy social data
  • This is your one-stop solution to fetching, storing, analyzing, and visualizing social media data

Who This Book Is For

This book is for intermediate Python developers who want to engage with the use of public APIs to collect data from social media platforms and perform statistical analysis in order to produce useful insights from data. The book assumes a basic understanding of the Python Standard Library and provides practical examples to guide you toward the creation of your data analysis project based on social data.

What You Will Learn

  • Interact with a social media platform via their public API with Python
  • Store social data in a convenient format for data analysis
  • Slice and dice social data using Python tools for data science
  • Apply text analytics techniques to understand what people are talking about on social media
  • Apply advanced statistical and analytical techniques to produce useful insights from data
  • Build beautiful visualizations with web technologies to explore data and present data products

In Detail

Your social media is filled with a wealth of hidden data – unlock it with the power of Python. Transform your understanding of your clients and customers when you use Python to solve the problems of understanding consumer behavior and turning raw data into actionable customer insights.

This book will help you acquire and analyze data from leading social media sites. It will show you how to employ scientific Python tools to mine popular social websites such as Facebook, Twitter, Quora, and more. Explore the Python libraries used for social media mining, and get the tips, tricks, and insider insight you need to make the most of them. Discover how to develop data mining tools that use a social media API, and how to create your own data analysis projects using Python for clear insight from your social data.

Style and approach

This practical, hands-on guide will help you learn everything you need to perform data mining for social media. Throughout the book, we take an example-oriented approach to use Python for data analysis and provide useful tips and tricks that you can use in day-to-day tasks.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 434

Veröffentlichungsjahr: 2016

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Mastering Social Media Mining with Python
Credits
About the Author
About the Reviewer
www.PacktPub.com
eBooks, discount offers, and more
Why subscribe?
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. Social Media, Social Data, and Python
Getting started
Social media - challenges and opportunities
Opportunities
Challenges
Social media mining techniques
Python tools for data science
Python development environment setup
pip and virtualenv
Conda, Anaconda, and Miniconda
Efficient data analysis
Machine learning
Natural language processing
Social network analysis
Data visualization
Processing data in Python
Building complex data pipelines
Summary
2. #MiningTwitter – Hashtags, Topics, and Time Series
Getting started
The Twitter API
Rate limits
Search versus Stream
Collecting data from Twitter
Getting tweets from the timeline
The structure of a tweet
Using the Streaming API
Analyzing tweets - entity analysis
Analyzing tweets - text analysis
Analyzing tweets - time series analysis
Summary
3. Users, Followers, and Communities on Twitter
Users, friends, and followers
Back to the Twitter API
The structure of a user profile
Downloading your friends' and followers' profiles
Analysing your network
Measuring influence and engagement
Mining your followers
Mining the conversation
Plotting tweets on a map
From tweets to GeoJSON
Easy maps with Folium
Summary
4. Posts, Pages, and User Interactions on Facebook
The Facebook Graph API
Registering your app
Authentication and security
Accessing the Facebook Graph API with Python
Mining your posts
The structure of a post
Time frequency analysis
Mining Facebook Pages
Getting posts from a Page
Facebook Reactions and the Graph API 2.6
Measuring engagement
Visualizing posts as a word cloud
Summary
5. Topic Analysis on Google+
Getting started with the Google+ API
Searching on Google+
Embedding the search results in a web GUI
Decorators in Python
Flask routes and templates
Notes and activities from a Google+ page
Text analysis and TF-IDF on notes
Capturing phrases with n-grams
Summary
6. Questions and Answers on Stack Exchange
Questions and answers
Getting started with the Stack Exchange API
Searching for tagged questions
Searching for a user
Working with Stack Exchange data dumps
Text classification for question tags
Supervised learning and text classification
Classification algorithms
Naive Bayes
k-Nearest Neighbor
Support Vector Machines
Evaluation
Performing text classification on Stack Exchange data
Embedding the classifier in a real-time application
Summary
7. Blogs, RSS, Wikipedia, and Natural Language Processing
Blogs and NLP
Getting data from blogs and websites
Using the WordPress.com API
Using the Blogger API
Parsing RSS and Atom feeds
Getting data from Wikipedia
A few words about web scraping
NLP Basics
Text preprocessing
Sentence boundary detection
Word tokenization
Part-of-speech tagging
Word normalization
Case normalization
Stemming
Lemmatization
Stop word removal
Synonym mapping
Information extraction
Summary
8. Mining All the Data!
Many social APIs
Mining videos on YouTube
Mining open source software on GitHub
Mining local businesses on Yelp
Building a custom Python client
HTTP made simple
Summary
9. Linked Data and the Semantic Web
A Web of Data
Semantic Web vocabulary
Microformats
Linked Data and Open Data
Resource Description Framework
JSON-LD
Schema.org
Mining relations from DBpedia
Mining geo coordinates
Extracting geodata from Wikipedia
Plotting geodata on Google Maps
Summary

Mastering Social Media Mining with Python

Mastering Social Media Mining with Python

Copyright © 2016 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: July 2016

Production reference: 1260716

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham 

B3 2PB, UK.

ISBN 978-1-78355-201-6

www.packtpub.com

Credits

Author

Marco Bonzanini

Copy Editor

Vibha Shukla

Reviewer

Weiai Wayne Xu

Project Coordinator

Nidhi Joshi

Commissioning Editor

Pramila Balan

Proofreader

Safis Editing

Acquisition Editor

Sonali Vernekar

Indexer

Mariammal Chettiyar

Content Development Editor

Siddhesh Salvi

Graphics

Jason Monteiro

Disha Haria

Technical Editor

Pranil Pathare

Production Coordinator

Arvindkumar Gupta

About the Author

Marco Bonzanini is a data scientist based in London, United Kingdom. He holds a PhD in information retrieval from Queen Mary University of London. He specializes in text analytics and search applications, and over the years, he has enjoyed working on a variety of information management and data science problems.

He maintains a personal blog at http://marcobonzanini.com, where he discusses different technical topics, mainly around Python, text analytics, and data science.

When not working on Python projects, he likes to engage with the community at PyData conferences and meet-ups, and he also enjoys brewing homemade beer.

This book is the outcome of a long journey that goes beyond the mere content preparation. Many people have contributed in different ways to shape the final result. Firstly, I would like to thank the team at Packt Publishing, particularly Sonali Vernekar and Siddhesh Salvi, for giving me the opportunity to work on this book and for being so helpful throughout the whole process. I would also like to thank Dr. Weiai “Wayne” Xu for reviewing the content of this book and suggesting many improvements. Many colleagues and friends, through casual conversations, deep discussions, and previous projects, strengthened the quality of the material presented in this book. Special mentions go to Dr. Miguel Martinez-Alvarez, Marco Campana, and Stefano Campana. I'm also happy to be part of the PyData London community, a group of smart people who regularly meet to talk about Python and data science, offering a stimulating environment. Last but not least, a distinct special mention goes to Daniela, who has encouraged me during the whole journey, sharing her thoughts, suggesting improvements, and providing a relaxing environment to go back to after work.

About the Reviewer

Weiai Wayne Xu is an assistant professor in the department of communication at University of Massachusetts – Amherst and is affiliated with the University’s Computational Social Science Institute. Previously, Xu worked as a network science scholar at the Network Science Institute of Northeastern University in Boston. His research on online communities, word-of-mouth, and social capital have appeared in various peer-reviewed journals. Xu also assisted four national grant projects in the area of strategic communication and public opinion. Aside from his professional appointment, he is a co-founder of a data lab called CuriosityBits Collective (http://www.curiositybits.org/).

www.PacktPub.com

eBooks, discount offers, and more

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Preface

In the past few years, the popularity of social media has grown dramatically, with more and more users sharing all kinds of information through different platforms. Companies use social media platforms to promote their brands, professionals maintain a public profile online and use social media for networking, and regular users discuss about any topic. More users also means more data waiting to be mined.

You, the reader of this book, are likely to be a developer, engineer, analyst, researcher, or student who wants to apply data mining techniques to social media data. As a data mining practitioner (or practitioner-to-be), there is no lack of opportunities and challenges from this point of view.

Mastering Social Media Mining with Python will give you the basic tools you need to take advantage of this wealth of data. This book will start a journey through the main tools for data analysis in Python, providing the information you need to get started with applications such as NLP, machine learning, social network analysis, and data visualization. A step-by-step guide through the most popular social media platforms, including Twitter, Facebook, Google+, Stack Overflow, Blogger, YouTube and more, will allow you to understand how to access data from these networks, and how to perform different types of analysis in order to extract useful insight from the raw data.

There are three main aspects being touched in the book, as listed in the following list:

Social media APIs: Each platform provides access to their data in different ways. Understanding how to interact with them can answer the questions: how do we get the data? and also what kind of data can we get? This is important because, without access to the data, there would be no data analysis to carry out. Each chapter focuses on different social media platforms and provides details on how to interact with the relevant API.Data mining techniques: Just getting the data out of an API doesn't provide much value to us. The next step is answering the question: what can we do with the data? Each chapter provides the concepts you need to appreciate the kind of analysis that you can carry out with the data, and why it provides value. In terms of theory, the choice is to simply scratch the surface of what is needed, without digging too much into details that belong to academic textbooks. The purpose is to provide practical examples that can get you easily started.Python tools for data science: Once we understand what we can do with the data, the last question is: how do we do it? Python has established itself as one of the main languages for data science. Its easy-to-understand syntax and semantics, together with its rich ecosystem for scientific computing, provide a gentle learning curve for beginners and all the sharp tools required by experts at the same time. The book introduces the main Python libraries used in the world of scientific computing, such as NumPy, pandas, NetworkX, scikit-learn, NLTK, and many more. Practical examples will take the form of short scripts that you can use (and possibly extend) to perform different and interesting types of analysis over the social media data that you have accessed.

If exploring the area where these three main topics meet is something of interest, this book is for you. 

What this book covers

Chapter 1, Social Media, Social Data, and Python, introduces the main concepts of data mining applied to social media using Python. By walking the reader through a brief overview on machine learning, NLP, social network analysis, and data visualization, this chapter discusses the main Python tools for data science and provides some help to set up the Python environment.

Chapter 2, #MiningTwitter – Hashtags, Topics, and Time Series, opens the practical discussion on data mining using the Twitter data. After setting up a Twitter app to interact with the Twitter API, the chapter explains how to get data through the streaming API and how to perform some frequentist analysis on hashtags and text. The chapter also discusses some time series analysis to understand the distribution of tweets over time.

Chapter 3, Users, Followers, and Communities on Twitter, continues the discussion on Twitter mining, focusing the attention on users and interactions between users. This chapter shows how to mine the connections and conversations between the users. Interesting applications explained in the chapter include user clustering (segmentation) and how to measure influence and user engagement.

Chapter 4, Posts, Pages, and User Interactions on Facebook, focuses on Facebook and the Facebook Graph API. After understanding how to interact with the Graph API, including aspects of security and privacy, examples of how to mine posts from a user's profile and Facebook pages are provided. The concepts of time series analysis and user engagement are applied to user interactions such as comments, Likes, and Reactions.

Chapter 5, Topic Analysis on Google+, covers the social network by Google. After understanding how to access the Google centralized platform, examples of how to search content and users on Google+ are discussed. This chapter also shows how to embed data coming from the Google API into a custom web application that is built using the Python microframework, Flask.

Chapter 6, Questions and Answers on Stack Exchange, explains the topic of question answering and uses the Stack Exchange network as paramount example. The reader has the opportunity to learn how to search for users and content on the different sites of this network, most notably Stack Overflow. By using their data dumps for online processing,  this chapter introduces supervised machine learning methods applied to text classification and shows how to embed machine learning model into a real-time application.

Chapter 7, Blogs, RSS, Wikipedia, and Natural Language Processing, teaches text analytics. The Web is full of opportunities in terms of text mining, and this chapter shows how to interact with several data sources such as the WordPress.com API, Blogger API, RSS feeds, and Wikipedia API. Using textual data, the basic notions of NLP briefly mentioned throughout the book are formalized and expanded. The reader is then walked through the process of information extraction with custom examples on how to extract references of entities from free text.

Chapter 8, Mining All the Data!, reminds us of the many opportunities, in terms of data mining, that are available out there beyond the most common social networks. Examples of how to mine data from YouTube, GitHub, and Yelp are provided, along with a discussion on how to build your own API client, in case a particular platform doesn't provide one.

Chapter 9, Linked Data and the Semantic Web, provides an overview on the Semantic Web and related technologies. This chapter discusses the topics of Linked Data, microformats, and RDF, and offers examples on how to mine semantic information from DBpedia and Wikipedia.

What you need for this book

The code examples provided in this book assume that you are running a recent version of Python on either Linux, macOS, or Windows. The code has been tested on Python 3.4.* and Python 3.5.*. Older versions (Python 3.3.* or Python 2.*) are not explicitly supported.

Chapter 1, Social Media, Social Data, and Python, provides some instructions to set up a local development environment and introduces a brief list of tools that are going to be used throughout the book. We're going to take advantage of some of the essential Python libraries for scientific computing (for example, NumPy, pandas, and matplotlib), machine learning (for example, scikit-learn), NLP (for example, NLTK), and social network analysis (for example, NetworkX).

Who this book is for

This book is for intermediate Python developers who want to engage with the use of public APIs to collect data from social media platforms and perform statistical analysis in order to produce useful insights from the data. The book assumes a basic understanding of Python standard library and provides practical examples to guide you towards the creation of your data analysis project based on social data.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the book in the Search box.Select the book for which you're looking to download the code files.Choose from the drop-down menu where you purchased this book from.Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/bonzanini/Book-SocialMediaMiningPython. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/MasteringSocialMediaMiningWithPython_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at [email protected] with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Chapter 1.  Social Media, Social Data, and Python

This book is about applying data mining techniques to social media using Python. The three highlighted keywords in the previous sentence help us define the intended audience of this book: any developer, engineer, analyst, researcher, or student who is interested in exploring the area where the three topics meet.

In this chapter, we will cover the following topics:

Social media and social dataThe overall process of data mining from social mediaSetting up the Python development environmentPython tools for data scienceProcessing data in Python

Getting started

In the second quarter of 2015, Facebook reported nearly 1.5 billion monthly active users. In 2013, Twitter had reported a volume of 500+ million tweets per day. On a smaller scale, but certainly of interest for the readers of this book, in 2015, Stack Overflow announced that more than 10 million programming questions had been asked on their platform since the website has opened.

These numbers are just the tip of the iceberg when describing how the popularity of social media has grown exponentially with more users sharing more and more information through different platforms. This wealth of data provides unique opportunities for data mining practitioners. The purpose of this book is to guide the reader through the use of social media APIs to collect data that can be analyzed with Python tools in order to produce interesting insights on how users interact on social media.

This chapter lays the ground for an initial discussion on challenges and opportunities in social media mining and introduces some Python tools that will be used in the following chapters.

Social media - challenges and opportunities

In traditional media, users are typically just consumers. Information flows in one direction: from the publisher to the users. Social media breaks this model, allowing every user to be a consumer and publisher at the same time. Many academic publications have been written on this topic with the purpose of defining what the term social media really means (for example, Users of the world, unite! The challenges and opportunities of Social Media, Andreas M. Kaplan and Michael Haenlein, 2010). The aspects that are most commonly shared across different social media platforms are as follows:

Internet-based applicationsUser-generated contentNetworking

Social media are Internet-based applications. It is clear that the advances in Internet and mobile technologies have promoted the expansion of social media. Through your mobile, you can, in fact, immediately connect to a social media platform, publish your content, or catch up with the latest news.

Social media platforms are driven by user-generated content. As opposed to the traditional media model, every user is a potential publisher. More importantly, any user can interact with every other user by sharing content, commenting, or expressing positive appraisal via the like button (sometimes referred to as upvote, or thumbs up).

Social media is about networking. As described, social media is about the users interacting with other users. Being connected is the central concept for most social media platform, and the content you consume via your news feed or timeline is driven by your connections.

With these main features being central across several platforms, social media is used for a variety of purposes:

Staying in touch with friends and family (for example, via Facebook)Microblogging and catching up with the latest news (for example, via Twitter)Staying in touch with your professional network (for example, via LinkedIn)Sharing multimedia content (for example, via Instagram, YouTube, Vimeo, and Flickr)Finding answers to your questions (for example, via Stack Overflow, Stack Exchange, and Quora)Finding and organizing items of interest (for example, via Pinterest)

This book aims to answer one central question: how to extract useful knowledge from the data coming from the social media? Taking one step back, we need to define what is knowledge and what is useful.

Traditional definitions of knowledge come from information science. The concept of knowledge is usually pictured as part of a pyramid, sometimes referred to as knowledge hierarchy, which has data as its foundation, information as the middle layer, and knowledge at the top. This knowledge hierarchy is represented in the following diagram:

Figure 1.1: From raw data to semantic knowledge

Climbing the pyramid means refining knowledge from raw data. The journey from raw data to distilled knowledge goes through the integration of context and meaning. As we climb up the pyramid, the technology we build gains a deeper understanding of the original data, and more importantly, of the users who generate such data. In other words, it becomes more useful.

In this context, useful knowledge means actionable knowledge, that is, knowledge that enables a decision maker to implement a business strategy. As a reader of this book, you'll understand the key principles to extract value from social data. Understanding how users interact through social media platforms is one of the key aspects in this journey.

The following sections lay down some of the challenges and opportunities of mining data from social media platforms.

Opportunities

The key opportunity of developing data mining systems is to extract useful insights from data. The aim of the process is to answer interesting (and sometimes difficult) questions using data mining techniques to enrich our knowledge about a particular domain. For example, an online retail store can apply data mining to understand how their customers shop. Through this analysis, they are able to recommend products to their customers, depending on their shopping habits (for example, users who buy item A, also buy item B). This, in general, will lead to a better customer experience and satisfaction, which in return can produce better sales.

Many organizations in different domains can apply data mining techniques to improve their business. Some examples include the following:

Banking:
Identifying loyal customers to offer them exclusive promotionsRecognizing patterns of fraudulent transaction to reduce costs
Medicine:
Understanding patient behavior to forecast surgery visitsSupporting doctors in identifying successful treatments depending on the patient's history
Retail:
Understanding shopping patterns to improve customer experienceImproving the effectiveness of marketing campaigns with better targetingAnalyzing real-time traffic data to find the quickest route for food delivery

So how does it translate to the realm of social media? The core of the matter consists of how the users share their data through social media platforms. Organizations are not limited to analyze the data they directly collect anymore, and they have access to much more data.

The solution for this data collection happens through well-engineered language-agnostic APIs. A common practice among social media platforms is, in fact, to offer a Web API to developers who want to integrate their applications with particular social media functionalities.

Note

Application Programming Interface

An Application Programming Interface (API) is a set of procedure definitions and protocols that describe the behavior of a software component, such as a library or remote service, in terms of its allowed operations, inputs, and outputs. When using a third-party API, developers don't need to worry about the internals of the component, but only about how they can use it.

With the term Web API, we refer to a web service that exposes a number of URIs to the public, possibly behind an authentication layer, to access the data. A common architectural approach for designing this kind of APIs is called Representational State Transfer (REST). An API that implements the REST architecture is called RESTful API. We still prefer the generic term Web API, as many of the existing API do not strictly follow the REST principles. For the purpose of this book, a deep understanding of the REST architecture is not required.

Challenges

Some of the challenges of social media mining are inherited from the broader field of data mining.

When dealing with social data, we're often dealing with big data. To understand the meaning of big data and the challenges it entails, we can go back to the traditional definition (3D Data Management: Controlling Data Volume, Velocity and Variety, Doug Laney, 2001) that is also known as the three Vs of big data: volume, variety, and velocity. Over the years, this definition has also been expanded by adding more Vs, most notably value, as providing value to an organization is one the main purposes of exploiting big data. Regarding the original three Vs, volume means dealing with data that spans over more than one machine. This, of course, requires a different infrastructure from small data processing (for example, in-memory). Moreover, volume is also associated with velocity in the sense that data is growing so fast that the concept of big becomes a moving target. Finally, variety concerns how data is present in different formats and structures, often incompatible between them and with different semantics. Data from social media can check all the three Vs.

The rise of big data has pushed the development of new approaches to database technologies towards a family of systems called NoSQL. The term is an umbrella for multiple database paradigms that share the common trait of moving away from traditional relational data, promoting dynamic schema design. While this book is not about database technologies, from this field, we can still appreciate the need for dealing with a mixture of well-structured, unstructured, and semi-structured data. The phrase structured data refers to information that is well organized and typically presented in a tabular form. For this reason, the connection with relational databases is immediate. The following table shows an example of structured data that represents books sold by a bookshop:

Title

Genre

Price

1984

Political fiction

12

War and Peace

War novel

10

This kind of data is structured as each represented item has a precise organization, specifically, three attributes called title, genre, and price.

The opposite of structured data is unstructured data, which is information without a predefined data model, or simply not organized according to a predefined data model. Unstructured data is typically in the form of textual data, for example, e-mails, documents, social media posts, and so on. Techniques presented throughout this book can be used to extract patterns in unstructured data to provide some structure.

Between structured and unstructured data, we can find semi-structured data. In this case, the structure is either flexible or not fully predefined. It is sometimes also referred to as a self-describing structure. A typical example of data format that is semi-structured is JSON. As the name suggests, JSON borrows its notation from the programming language JavaScript. This data format has become extremely popular due to its wide use as a way to exchange data between client and server in a web application. The following snippet shows an example of the JSON representation that extends the previous book data:

[ { "title": "1984", "price": 12, "author": "George Orwell", "genre": ["Political fiction", "Social science fiction"] }, { "title": "War and Peace", "price": 10, "genre": ["Historical", Romance", "War novel"] } ]

What we can observe from this example is that the first book has the author attribute, whereas, this attribute is not present in the second book. Moreover, the genre attribute is here presented as a list, with a variable number of values. Both these aspects are usually avoided in a well-structured (relational) data format, but are perfectly fine in JSON and more in general when dealing with semi-structured data.

The discussion on structured and unstructured data translates into handling different data formats and approaching data integrity in different ways. The phrase data integrity is used to capture the combination of challenges coming from the presence of dirty, inconsistent, or incomplete data.

The case of inconsistent and incomplete data is very common when analyzing user-generated content, and it calls for attention, especially with data from social media. It is very rare to observe users who share their data methodically, almost in a formal fashion. On the contrary, social media often consists of informal environments, with some contradictions. For example, if a user wants to complain about a product on the company's Facebook page, the user first needs to like the page itself, which is quite the opposite of being upset with a company due to the poor quality of their product. Understanding how users interact on social media platforms is crucial to design a good analysis.

Developing data mining applications also requires us to consider issues related to data access, particularly when company policies translate into the lack of data to analyze. In other words, data is not always openly available. The previous paragraph discussed how in social media mining, this is a little less of an issue compared to other corporate environments, as most social media platforms offer well-engineered language-agnostic APIs that allow us to access the data we need. The availability of such data is, of course, still dependent on how users share their data and how they grant us access. For example, Facebook users can decide the level of detail that can be shown in their public profile and the details that can be shown only to their friends. Profile information, such as birthday, current location, and work history (as well as many more), can all be individually flagged as private or public. Similarly, when we try to access such data through the Facebook API, the users who sign up to our application have the opportunity to grant us access only to a limited subset of the data we are asking for.

One last general challenge of data mining lies in understanding the data mining process itself and being able to explain it. In other words, coming up with the right question before we start analyzing the data is not always straightforward. More often than not, research and development (R&D) processes are driven by exploratory analysis, in the sense that in order to understand how to tackle the problem, we first need to start tampering with it. A related concept in statistics is described by the phrase correlation does not imply causation. Many statistical tests can be used to establish correlation between two variables, that is, two events occurring together, but this is not sufficient to establish a cause-effect relationship in either direction. Funny examples of bizarre correlations can be found all over the Web. A popular case was published in the New England Journal of Medicine, one of the most reputable medical journals, showing an interesting correlation between the amount of chocolate consumed per capita per country versus the number of Nobel prices awarded (Chocolate Consumption, Cognitive Function, and Nobel Laureates, Franz H. Messerli, 2012).

When performing an exploratory analysis, it is important to keep in mind that correlation (two events occurring together) is a bidirectional relationship, while causation (event A has caused event B) is a unidirectional one. Does chocolate make you smarter or do smart people like chocolate more than an average person? Do the two events occur together just by a random chance? Is there a third, yet unseen, variable that plays some role in the correlation? Simply observing a correlation is not sufficient to describe causality, but it is often an interesting starting point to ask important questions about the data we are observing.

The following section generalizes the way our application interacts with a social media API and performs the desired analysis.

Social media mining techniques

This section briefly discusses the overall process for building a social media mining application, before digging into the details in the next chapters.

The process can be summarized in the following steps:

AuthenticationData collectionData cleaning and pre-processingModeling and analysisResult presentation

Figure 1.2 shows an overview of the process:

Figure 1.2: The overall process of social media mining

The authentication step is typically performed using the industry standard called Open Authorization (OAuth). The process is three legged, meaning that it involves three actors: a user, consumer (our application), and resource provider (the social media platform). The steps in the process are as follows:

The user agrees with the consumer to grant access to the social media platform.As the user doesn't give their social media password directly to the consumer, the consumer has an initial exchange with the resource provider to generate a token and a secret. These are used to sign each request and prevent forgery.The user is then redirected with the token to the resource provider, which will ask to confirm authorizing the consumer to access the user's data.Depending on the nature of the social media platform, it will also ask to confirm whether the consumer can perform any action on the user's behalf, for example, post an update, share a link, and so on.The resource provider issues a valid token for the consumer.The token can then go back to the user confirming the access.

Figure 1.3 shows the OAuth process with references to each of the steps described earlier. The aspect to remember is that the exchange of credentials (username/password) only happens between the user and the resource provider through the steps 3 and 4. All other exchanges are driven by tokens:

Figure 1.3: The OAuth process

From the user's perspective, this apparently complex process happens when the user is visiting our web app and hits the Login with Facebook (or Twitter, Google+, and so on) button. Then the user has to confirm that they are granting privileges to our app, and everything for them happens behind the scenes.

From a developer's perspective, the nice part is that the Python ecosystem has already well-established libraries for most social media platforms, which come with an implementation of the authentication process. As a developer, once you have registered your application with the target service, the platform will provide the necessary authorization tokens for your app. Figure 1.4 shows a screenshot of a custom Twitter app called Intro to Text Mining. On the Keys and Access Tokens configuration page, the developer can find the API key and secret, as well as the access token and access token secret. We'll discuss the details of the authorization for each social media platform in the relevant chapters:

Figure 1.4: Configuration page for a Twitter app called Intro to Text Mining. The page contains all the authorization tokens for the developers to use in their app.

The data collection, cleaning, and pre-processing steps are also dependent on the social media platform we are dealing with. In particular, the data collection step is tied to the initial authorization as we can only download data that we have been granted access to. Cleaning and pre-processing, on the other hand, are functional to the type of data modeling and analysis that we decide to employ to produce insights on the data.

Back to Figure 1.2, the modeling and analysis is performed by the component labeled ANALYTICS ENGINE. Typical data processing tasks that we'll encounter throughout this book are text mining and graph mining.

Text mining (also referred to as text analytics) is the process of deriving structured information from unstructured textual data. Text mining is applicable to most social media platforms, as the users are allowed to publish content in the form of posts or comments.

Some examples of text mining applications include the following:

Document classification: This is the task of assigning a document to one or more categoriesDocument clustering: This is the task of grouping documents into subsets (called clusters) that are coherent and distinct from one another (for example, by topic or sub-topic)Document summarization: This is the task of creating a shortened version of the document in order to reduce the information overload to the user, while still retaining the most important aspects described in the original sourceEntity extraction: This is the task of locating and classifying entity references from a text into some desired categories such as persons, locations, or organizationsSentiment analysis: This is the task of identifying and categorizing sentiments and opinions expressed in a text in order to understand the attitude towards a particular product, topic, service, and so on

Not all these applications are tailored for social media, but the growing amount of textual data available through these platforms makes social media a natural playground for text mining.

Graph mining is also focused on the structure of the data. Graphs are a simple-to-understand, yet powerful, data structure that is generic enough to be applied to many different data representations. In graphs, there are two main components to consider: nodes, which represent entities or objects, and edges, which represent relationships or connections between nodes. In the context of social media, the obvious use of a graph is to represent the social relationships of our users. More in general, in social sciences, the graph structure used to represent social relationship is also referred to as social network.

In terms of using such data structure within social media, we can naturally represent users as nodes, and their relationships (such as friends of or followers) as edges. In this way, information such as friends of friends who like Python becomes easily accessible just by traversing the graph (that is, walking from one node to the other by following the edges). Graph theory and graph mining offer more options to discover deeper insights that are not as clearly visible as the previous example.

After a high-level discussion on social media mining, the following section will introduce some of the useful Python tools that are commonly used in data mining projects.

Python tools for data science

Until now, we've been using the term data mining when referring to problems and techniques that we're going to apply throughout this book. The title of this section, in fact, mentions the term data science. The use of this term has exploded in the recent years, especially in business environments, while many academics and journalists have also criticized its use as a buzzword. Meanwhile, other academic institutions started offering courses on data science, and many books and articles have been published on the subject. Rather than having a strong opinion on where we should draw the border between different disciplines, we limit ourselves to observe how, nowadays, there is a general interest in multiple fields, including data science, data mining, data analysis, statistics, machine learning, artificial intelligence, data visualization, and more. The topics we're discussing are interdisciplinary by their own nature, and they all borrow from each other from time to time. These is certainly an amazing time to be working in any of these fields, with a lot of interest from the public and a constant buzz with new advances in interesting projects.

The purpose of this section is to introduce Python as a tool for data science, and to describe part of the Python ecosystem that we're going to use in the next chapters.

Python is one of the most interesting languages for data analytics projects. The following are some of the reasons that make it fit for purpose:

Declarative and intuitive syntaxRich ecosystem for data processingEfficiency

Python has a shallow learning curve due to its elegant syntax. Being a dynamic and interpreted language, it facilitates rapid development and interactive exploration. The ecosystem for data processing is partially described in the following sections, which will introduce the main packages we'll use in this book.

In terms of efficiency, interpreted and high-level languages are not famous for being furiously fast. Tools such as NumPy achieve efficiency by hooking to low-level libraries under the hood, and exposing a friendly Python interface. Moreover, many projects employ the use of Cython, a superset of Python that enriches the language by allowing, among other features, to define strong variable types and compile into C. Many other projects in the Python world are in the process of tackling efficiency issues with the overall goal of making pure Python implementations faster. In this book, we won't dig into Cython or any of these promising projects, but we'll make use of NumPy (especially through other libraries that employ NumPy) for data analysis.

Python development environment setup

When this book was started, Python 3.5 had just been released and received some attention for some its latest features, such as improved support for asynchronous programming and semantic definition of type hints. In terms of usage, Python 3.5 is probably not widely used yet, but it represents the current line of development of the language.

Note

The examples in this book are compatible with Python 3, particularly with versions 3.4+ and 3.5+.

In the never-ending discussion about choosing between Python 2 and Python 3, one of the points to keep in mind is that the support for Python 2 will be dismissed in a few years (at the time of writing, the sunset date is 2020). New features are not developed in Python 2, as this branch is only for bug fixes. On the other hand, many libraries are still developed for Python 2 first, and then the support for Python 3 is added later. For this reason, from time to time, there could be a minor hiccup in terms of compatibility of some library, which is usually resolved by the community quite quickly. In general, if there is no strong reason against this choice, the preference should go to Python 3, especially for new green-field projects.

pip and virtualenv

In order to keep the development environment clean, and ease the transition from prototype to production, the suggestion is to use virtualenv to manage a virtual environment and install dependencies. virtualenv is a tool for creating and managing isolated Python environments. By using an isolated virtual environment, developers avoid polluting the global Python environment with libraries that could be incompatible with each other. The tools allow us to maintain multiple projects that require different configurations and easily switch from one to the other. Moreover, the virtual environment can be installed in a local folder that is accessible to users without administrative privileges.

To install virtualenv in the global Python environment in order to make it available to all users, we can use pip from a terminal (Linux/Unix) or command prompt (Windows):

$ [sudo] pip install virtualenv

The sudo command might be necessary on Linux/Unix or macOS if our current user doesn't have administrator privileges on the system.

If a package is already installed, it can be upgraded to the latest version:

$ pip install --upgrade [package name]

Note

Since Python 3.4, the pip tool is shipped with Python. Previous versions require a separate installation of pip as explained on the project page (https://github.com/pypa/pip). The tool can also be used to upgrade itself to the latest version:

$ pip install --upgrade pip

Once virtualenv is globally available, for each project, we can define a separate Python environment where dependencies are installed in isolation, without tampering with the global environment. In this way, tracking the required dependencies of a single project is extremely easy.

In order to set up a virtual environment, follow these steps:

$ mkdir my_new_project # creat new project folder$ cd my_new_project # enter project folder$ virtualenv my_env # setup custom virtual environment

This will create a my_env subfolder, which is also the name of the virtual environment we're creating, in the current directory. Inside this subfolder, we have all the necessary tools to create the isolated Python environment, including the Python binaries and the standard library. In order to activate the environment, we can type the following command:

$ source my_env/bin/activate

Once the environment is active, the following will be visible on the prompt:

(my_env)$

Python packages can be installed for this particular environment using pip:

(my_env)$ pip install [package-name]

All the new Python libraries installed with pip when the environment is active will be installed into my_env/lib/python{VERSION}/site-packages. Notice that being a local folder, we won't need administrative access to perform this command.

When we want to deactivate the virtual environment, we can simply type the following command:

$ deactivate

The process described earlier should work for the official Python distributions that are shipped (or available for download) with your operating system.

Conda, Anaconda, and Miniconda

There is also one more option to consider, called conda (http://conda.pydata.org/), which is gaining some traction in the scientific community as it makes the dependency management quite easy. Conda is an open source package manager and environment manager for installing multiple versions of software packages (and related dependencies), which makes it easy to switch from one version to the other. It supports Linux, macOS, and Windows, and while it was initially created for Python, it can be used to package and distribute any software.

There are mainly two distributions that ship with conda: the batteries-included version, Anaconda, which comes with approximately 100 packages for scientific computing already installed, and the lightweight version, Miniconda, which simply comes with Python and the conda installer, without external libraries.

If you're new to Python, have some time for the bigger download and disk space to spare, and don't want to install all the packages manually, you can get started with Anaconda. For Windows and macOS, Anaconda is available with either a graphical or command-line installer. Figure 1.5 shows a screen capture of the installation procedure on a macOS. For Linux, only the command-line installer is available. In all cases, it's possible to choose between Python 2 and Python 3. If you prefer to have full control of your system, Miniconda will probably be your favorite option:

Figure 1.5: Screen capture of the Anaconda installation

Once you've installed your version of conda, in order to create a new conda environment, you can use the following command:

$ conda create --name my_env python=3.4 # or favorite version

The environment can be activated with the following command:

$ conda activate my_env

Similar to what happens with virtualenv, the environment name will be visible in the prompt:

(my_env)$

New packages can be installed for this environment with the following command:

$ conda install [package-name]

Finally, you can deactivate an environment by typing the following command:

$ conda deactivate

Another nice feature of conda is the ability to install packages from pip as well, so if a particular library is not available via conda install, or it's not been updated to the latest version we need, we can always fall back to the traditional Python package manager while using a conda environment.

If not specified otherwise, by default, conda will look up for packages on https://anaconda.org, while pip makes use of the Python Package Index (PyPI in short, also known as