38,39 €
Acquire and analyze data from all corners of the social web with Python
This book is for intermediate Python developers who want to engage with the use of public APIs to collect data from social media platforms and perform statistical analysis in order to produce useful insights from data. The book assumes a basic understanding of the Python Standard Library and provides practical examples to guide you toward the creation of your data analysis project based on social data.
Your social media is filled with a wealth of hidden data – unlock it with the power of Python. Transform your understanding of your clients and customers when you use Python to solve the problems of understanding consumer behavior and turning raw data into actionable customer insights.
This book will help you acquire and analyze data from leading social media sites. It will show you how to employ scientific Python tools to mine popular social websites such as Facebook, Twitter, Quora, and more. Explore the Python libraries used for social media mining, and get the tips, tricks, and insider insight you need to make the most of them. Discover how to develop data mining tools that use a social media API, and how to create your own data analysis projects using Python for clear insight from your social data.
This practical, hands-on guide will help you learn everything you need to perform data mining for social media. Throughout the book, we take an example-oriented approach to use Python for data analysis and provide useful tips and tricks that you can use in day-to-day tasks.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 434
Veröffentlichungsjahr: 2016
Copyright © 2016 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: July 2016
Production reference: 1260716
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-78355-201-6
www.packtpub.com
Author
Marco Bonzanini
Copy Editor
Vibha Shukla
Reviewer
Weiai Wayne Xu
Project Coordinator
Nidhi Joshi
Commissioning Editor
Pramila Balan
Proofreader
Safis Editing
Acquisition Editor
Sonali Vernekar
Indexer
Mariammal Chettiyar
Content Development Editor
Siddhesh Salvi
Graphics
Jason Monteiro
Disha Haria
Technical Editor
Pranil Pathare
Production Coordinator
Arvindkumar Gupta
Marco Bonzanini is a data scientist based in London, United Kingdom. He holds a PhD in information retrieval from Queen Mary University of London. He specializes in text analytics and search applications, and over the years, he has enjoyed working on a variety of information management and data science problems.
He maintains a personal blog at http://marcobonzanini.com, where he discusses different technical topics, mainly around Python, text analytics, and data science.
When not working on Python projects, he likes to engage with the community at PyData conferences and meet-ups, and he also enjoys brewing homemade beer.
This book is the outcome of a long journey that goes beyond the mere content preparation. Many people have contributed in different ways to shape the final result. Firstly, I would like to thank the team at Packt Publishing, particularly Sonali Vernekar and Siddhesh Salvi, for giving me the opportunity to work on this book and for being so helpful throughout the whole process. I would also like to thank Dr. Weiai “Wayne” Xu for reviewing the content of this book and suggesting many improvements. Many colleagues and friends, through casual conversations, deep discussions, and previous projects, strengthened the quality of the material presented in this book. Special mentions go to Dr. Miguel Martinez-Alvarez, Marco Campana, and Stefano Campana. I'm also happy to be part of the PyData London community, a group of smart people who regularly meet to talk about Python and data science, offering a stimulating environment. Last but not least, a distinct special mention goes to Daniela, who has encouraged me during the whole journey, sharing her thoughts, suggesting improvements, and providing a relaxing environment to go back to after work.
Weiai Wayne Xu is an assistant professor in the department of communication at University of Massachusetts – Amherst and is affiliated with the University’s Computational Social Science Institute. Previously, Xu worked as a network science scholar at the Network Science Institute of Northeastern University in Boston. His research on online communities, word-of-mouth, and social capital have appeared in various peer-reviewed journals. Xu also assisted four national grant projects in the area of strategic communication and public opinion. Aside from his professional appointment, he is a co-founder of a data lab called CuriosityBits Collective (http://www.curiositybits.org/).
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www2.packtpub.com/books/subscription/packtlib
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.
In the past few years, the popularity of social media has grown dramatically, with more and more users sharing all kinds of information through different platforms. Companies use social media platforms to promote their brands, professionals maintain a public profile online and use social media for networking, and regular users discuss about any topic. More users also means more data waiting to be mined.
You, the reader of this book, are likely to be a developer, engineer, analyst, researcher, or student who wants to apply data mining techniques to social media data. As a data mining practitioner (or practitioner-to-be), there is no lack of opportunities and challenges from this point of view.
Mastering Social Media Mining with Python will give you the basic tools you need to take advantage of this wealth of data. This book will start a journey through the main tools for data analysis in Python, providing the information you need to get started with applications such as NLP, machine learning, social network analysis, and data visualization. A step-by-step guide through the most popular social media platforms, including Twitter, Facebook, Google+, Stack Overflow, Blogger, YouTube and more, will allow you to understand how to access data from these networks, and how to perform different types of analysis in order to extract useful insight from the raw data.
There are three main aspects being touched in the book, as listed in the following list:
If exploring the area where these three main topics meet is something of interest, this book is for you.
Chapter 1, Social Media, Social Data, and Python, introduces the main concepts of data mining applied to social media using Python. By walking the reader through a brief overview on machine learning, NLP, social network analysis, and data visualization, this chapter discusses the main Python tools for data science and provides some help to set up the Python environment.
Chapter 2, #MiningTwitter – Hashtags, Topics, and Time Series, opens the practical discussion on data mining using the Twitter data. After setting up a Twitter app to interact with the Twitter API, the chapter explains how to get data through the streaming API and how to perform some frequentist analysis on hashtags and text. The chapter also discusses some time series analysis to understand the distribution of tweets over time.
Chapter 3, Users, Followers, and Communities on Twitter, continues the discussion on Twitter mining, focusing the attention on users and interactions between users. This chapter shows how to mine the connections and conversations between the users. Interesting applications explained in the chapter include user clustering (segmentation) and how to measure influence and user engagement.
Chapter 4, Posts, Pages, and User Interactions on Facebook, focuses on Facebook and the Facebook Graph API. After understanding how to interact with the Graph API, including aspects of security and privacy, examples of how to mine posts from a user's profile and Facebook pages are provided. The concepts of time series analysis and user engagement are applied to user interactions such as comments, Likes, and Reactions.
Chapter 5, Topic Analysis on Google+, covers the social network by Google. After understanding how to access the Google centralized platform, examples of how to search content and users on Google+ are discussed. This chapter also shows how to embed data coming from the Google API into a custom web application that is built using the Python microframework, Flask.
Chapter 6, Questions and Answers on Stack Exchange, explains the topic of question answering and uses the Stack Exchange network as paramount example. The reader has the opportunity to learn how to search for users and content on the different sites of this network, most notably Stack Overflow. By using their data dumps for online processing, this chapter introduces supervised machine learning methods applied to text classification and shows how to embed machine learning model into a real-time application.
Chapter 7, Blogs, RSS, Wikipedia, and Natural Language Processing, teaches text analytics. The Web is full of opportunities in terms of text mining, and this chapter shows how to interact with several data sources such as the WordPress.com API, Blogger API, RSS feeds, and Wikipedia API. Using textual data, the basic notions of NLP briefly mentioned throughout the book are formalized and expanded. The reader is then walked through the process of information extraction with custom examples on how to extract references of entities from free text.
Chapter 8, Mining All the Data!, reminds us of the many opportunities, in terms of data mining, that are available out there beyond the most common social networks. Examples of how to mine data from YouTube, GitHub, and Yelp are provided, along with a discussion on how to build your own API client, in case a particular platform doesn't provide one.
Chapter 9, Linked Data and the Semantic Web, provides an overview on the Semantic Web and related technologies. This chapter discusses the topics of Linked Data, microformats, and RDF, and offers examples on how to mine semantic information from DBpedia and Wikipedia.
The code examples provided in this book assume that you are running a recent version of Python on either Linux, macOS, or Windows. The code has been tested on Python 3.4.* and Python 3.5.*. Older versions (Python 3.3.* or Python 2.*) are not explicitly supported.
Chapter 1, Social Media, Social Data, and Python, provides some instructions to set up a local development environment and introduces a brief list of tools that are going to be used throughout the book. We're going to take advantage of some of the essential Python libraries for scientific computing (for example, NumPy, pandas, and matplotlib), machine learning (for example, scikit-learn), NLP (for example, NLTK), and social network analysis (for example, NetworkX).
This book is for intermediate Python developers who want to engage with the use of public APIs to collect data from social media platforms and perform statistical analysis in order to produce useful insights from the data. The book assumes a basic understanding of Python standard library and provides practical examples to guide you towards the creation of your data analysis project based on social data.
Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
You can download the code files by following these steps:
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
The code bundle for the book is also hosted on GitHub at https://github.com/bonzanini/Book-SocialMediaMiningPython. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/MasteringSocialMediaMiningWithPython_ColorImages.pdf.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at [email protected] with a link to the suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.
This book is about applying data mining techniques to social media using Python. The three highlighted keywords in the previous sentence help us define the intended audience of this book: any developer, engineer, analyst, researcher, or student who is interested in exploring the area where the three topics meet.
In this chapter, we will cover the following topics:
In the second quarter of 2015, Facebook reported nearly 1.5 billion monthly active users. In 2013, Twitter had reported a volume of 500+ million tweets per day. On a smaller scale, but certainly of interest for the readers of this book, in 2015, Stack Overflow announced that more than 10 million programming questions had been asked on their platform since the website has opened.
These numbers are just the tip of the iceberg when describing how the popularity of social media has grown exponentially with more users sharing more and more information through different platforms. This wealth of data provides unique opportunities for data mining practitioners. The purpose of this book is to guide the reader through the use of social media APIs to collect data that can be analyzed with Python tools in order to produce interesting insights on how users interact on social media.
This chapter lays the ground for an initial discussion on challenges and opportunities in social media mining and introduces some Python tools that will be used in the following chapters.
In traditional media, users are typically just consumers. Information flows in one direction: from the publisher to the users. Social media breaks this model, allowing every user to be a consumer and publisher at the same time. Many academic publications have been written on this topic with the purpose of defining what the term social media really means (for example, Users of the world, unite! The challenges and opportunities of Social Media, Andreas M. Kaplan and Michael Haenlein, 2010). The aspects that are most commonly shared across different social media platforms are as follows:
Social media are Internet-based applications. It is clear that the advances in Internet and mobile technologies have promoted the expansion of social media. Through your mobile, you can, in fact, immediately connect to a social media platform, publish your content, or catch up with the latest news.
Social media platforms are driven by user-generated content. As opposed to the traditional media model, every user is a potential publisher. More importantly, any user can interact with every other user by sharing content, commenting, or expressing positive appraisal via the like button (sometimes referred to as upvote, or thumbs up).
Social media is about networking. As described, social media is about the users interacting with other users. Being connected is the central concept for most social media platform, and the content you consume via your news feed or timeline is driven by your connections.
With these main features being central across several platforms, social media is used for a variety of purposes:
This book aims to answer one central question: how to extract useful knowledge from the data coming from the social media? Taking one step back, we need to define what is knowledge and what is useful.
Traditional definitions of knowledge come from information science. The concept of knowledge is usually pictured as part of a pyramid, sometimes referred to as knowledge hierarchy, which has data as its foundation, information as the middle layer, and knowledge at the top. This knowledge hierarchy is represented in the following diagram:
Climbing the pyramid means refining knowledge from raw data. The journey from raw data to distilled knowledge goes through the integration of context and meaning. As we climb up the pyramid, the technology we build gains a deeper understanding of the original data, and more importantly, of the users who generate such data. In other words, it becomes more useful.
In this context, useful knowledge means actionable knowledge, that is, knowledge that enables a decision maker to implement a business strategy. As a reader of this book, you'll understand the key principles to extract value from social data. Understanding how users interact through social media platforms is one of the key aspects in this journey.
The following sections lay down some of the challenges and opportunities of mining data from social media platforms.
The key opportunity of developing data mining systems is to extract useful insights from data. The aim of the process is to answer interesting (and sometimes difficult) questions using data mining techniques to enrich our knowledge about a particular domain. For example, an online retail store can apply data mining to understand how their customers shop. Through this analysis, they are able to recommend products to their customers, depending on their shopping habits (for example, users who buy item A, also buy item B). This, in general, will lead to a better customer experience and satisfaction, which in return can produce better sales.
Many organizations in different domains can apply data mining techniques to improve their business. Some examples include the following:
So how does it translate to the realm of social media? The core of the matter consists of how the users share their data through social media platforms. Organizations are not limited to analyze the data they directly collect anymore, and they have access to much more data.
The solution for this data collection happens through well-engineered language-agnostic APIs. A common practice among social media platforms is, in fact, to offer a Web API to developers who want to integrate their applications with particular social media functionalities.
Application Programming Interface
An Application Programming Interface (API) is a set of procedure definitions and protocols that describe the behavior of a software component, such as a library or remote service, in terms of its allowed operations, inputs, and outputs. When using a third-party API, developers don't need to worry about the internals of the component, but only about how they can use it.
With the term Web API, we refer to a web service that exposes a number of URIs to the public, possibly behind an authentication layer, to access the data. A common architectural approach for designing this kind of APIs is called Representational State Transfer (REST). An API that implements the REST architecture is called RESTful API. We still prefer the generic term Web API, as many of the existing API do not strictly follow the REST principles. For the purpose of this book, a deep understanding of the REST architecture is not required.
Some of the challenges of social media mining are inherited from the broader field of data mining.
When dealing with social data, we're often dealing with big data. To understand the meaning of big data and the challenges it entails, we can go back to the traditional definition (3D Data Management: Controlling Data Volume, Velocity and Variety, Doug Laney, 2001) that is also known as the three Vs of big data: volume, variety, and velocity. Over the years, this definition has also been expanded by adding more Vs, most notably value, as providing value to an organization is one the main purposes of exploiting big data. Regarding the original three Vs, volume means dealing with data that spans over more than one machine. This, of course, requires a different infrastructure from small data processing (for example, in-memory). Moreover, volume is also associated with velocity in the sense that data is growing so fast that the concept of big becomes a moving target. Finally, variety concerns how data is present in different formats and structures, often incompatible between them and with different semantics. Data from social media can check all the three Vs.
The rise of big data has pushed the development of new approaches to database technologies towards a family of systems called NoSQL. The term is an umbrella for multiple database paradigms that share the common trait of moving away from traditional relational data, promoting dynamic schema design. While this book is not about database technologies, from this field, we can still appreciate the need for dealing with a mixture of well-structured, unstructured, and semi-structured data. The phrase structured data refers to information that is well organized and typically presented in a tabular form. For this reason, the connection with relational databases is immediate. The following table shows an example of structured data that represents books sold by a bookshop:
Title
Genre
Price
1984
Political fiction
12
War and Peace
War novel
10
This kind of data is structured as each represented item has a precise organization, specifically, three attributes called title, genre, and price.
The opposite of structured data is unstructured data, which is information without a predefined data model, or simply not organized according to a predefined data model. Unstructured data is typically in the form of textual data, for example, e-mails, documents, social media posts, and so on. Techniques presented throughout this book can be used to extract patterns in unstructured data to provide some structure.
Between structured and unstructured data, we can find semi-structured data. In this case, the structure is either flexible or not fully predefined. It is sometimes also referred to as a self-describing structure. A typical example of data format that is semi-structured is JSON. As the name suggests, JSON borrows its notation from the programming language JavaScript. This data format has become extremely popular due to its wide use as a way to exchange data between client and server in a web application. The following snippet shows an example of the JSON representation that extends the previous book data:
[ { "title": "1984", "price": 12, "author": "George Orwell", "genre": ["Political fiction", "Social science fiction"] }, { "title": "War and Peace", "price": 10, "genre": ["Historical", Romance", "War novel"] } ]What we can observe from this example is that the first book has the author attribute, whereas, this attribute is not present in the second book. Moreover, the genre attribute is here presented as a list, with a variable number of values. Both these aspects are usually avoided in a well-structured (relational) data format, but are perfectly fine in JSON and more in general when dealing with semi-structured data.
The discussion on structured and unstructured data translates into handling different data formats and approaching data integrity in different ways. The phrase data integrity is used to capture the combination of challenges coming from the presence of dirty, inconsistent, or incomplete data.
The case of inconsistent and incomplete data is very common when analyzing user-generated content, and it calls for attention, especially with data from social media. It is very rare to observe users who share their data methodically, almost in a formal fashion. On the contrary, social media often consists of informal environments, with some contradictions. For example, if a user wants to complain about a product on the company's Facebook page, the user first needs to like the page itself, which is quite the opposite of being upset with a company due to the poor quality of their product. Understanding how users interact on social media platforms is crucial to design a good analysis.
Developing data mining applications also requires us to consider issues related to data access, particularly when company policies translate into the lack of data to analyze. In other words, data is not always openly available. The previous paragraph discussed how in social media mining, this is a little less of an issue compared to other corporate environments, as most social media platforms offer well-engineered language-agnostic APIs that allow us to access the data we need. The availability of such data is, of course, still dependent on how users share their data and how they grant us access. For example, Facebook users can decide the level of detail that can be shown in their public profile and the details that can be shown only to their friends. Profile information, such as birthday, current location, and work history (as well as many more), can all be individually flagged as private or public. Similarly, when we try to access such data through the Facebook API, the users who sign up to our application have the opportunity to grant us access only to a limited subset of the data we are asking for.
One last general challenge of data mining lies in understanding the data mining process itself and being able to explain it. In other words, coming up with the right question before we start analyzing the data is not always straightforward. More often than not, research and development (R&D) processes are driven by exploratory analysis, in the sense that in order to understand how to tackle the problem, we first need to start tampering with it. A related concept in statistics is described by the phrase correlation does not imply causation. Many statistical tests can be used to establish correlation between two variables, that is, two events occurring together, but this is not sufficient to establish a cause-effect relationship in either direction. Funny examples of bizarre correlations can be found all over the Web. A popular case was published in the New England Journal of Medicine, one of the most reputable medical journals, showing an interesting correlation between the amount of chocolate consumed per capita per country versus the number of Nobel prices awarded (Chocolate Consumption, Cognitive Function, and Nobel Laureates, Franz H. Messerli, 2012).
When performing an exploratory analysis, it is important to keep in mind that correlation (two events occurring together) is a bidirectional relationship, while causation (event A has caused event B) is a unidirectional one. Does chocolate make you smarter or do smart people like chocolate more than an average person? Do the two events occur together just by a random chance? Is there a third, yet unseen, variable that plays some role in the correlation? Simply observing a correlation is not sufficient to describe causality, but it is often an interesting starting point to ask important questions about the data we are observing.
The following section generalizes the way our application interacts with a social media API and performs the desired analysis.
This section briefly discusses the overall process for building a social media mining application, before digging into the details in the next chapters.
The process can be summarized in the following steps:
Figure 1.2 shows an overview of the process:
The authentication step is typically performed using the industry standard called Open Authorization (OAuth). The process is three legged, meaning that it involves three actors: a user, consumer (our application), and resource provider (the social media platform). The steps in the process are as follows:
Figure 1.3 shows the OAuth process with references to each of the steps described earlier. The aspect to remember is that the exchange of credentials (username/password) only happens between the user and the resource provider through the steps 3 and 4. All other exchanges are driven by tokens:
From the user's perspective, this apparently complex process happens when the user is visiting our web app and hits the Login with Facebook (or Twitter, Google+, and so on) button. Then the user has to confirm that they are granting privileges to our app, and everything for them happens behind the scenes.
From a developer's perspective, the nice part is that the Python ecosystem has already well-established libraries for most social media platforms, which come with an implementation of the authentication process. As a developer, once you have registered your application with the target service, the platform will provide the necessary authorization tokens for your app. Figure 1.4 shows a screenshot of a custom Twitter app called Intro to Text Mining. On the Keys and Access Tokens configuration page, the developer can find the API key and secret, as well as the access token and access token secret. We'll discuss the details of the authorization for each social media platform in the relevant chapters:
The data collection, cleaning, and pre-processing steps are also dependent on the social media platform we are dealing with. In particular, the data collection step is tied to the initial authorization as we can only download data that we have been granted access to. Cleaning and pre-processing, on the other hand, are functional to the type of data modeling and analysis that we decide to employ to produce insights on the data.
Back to Figure 1.2, the modeling and analysis is performed by the component labeled ANALYTICS ENGINE. Typical data processing tasks that we'll encounter throughout this book are text mining and graph mining.
Text mining (also referred to as text analytics) is the process of deriving structured information from unstructured textual data. Text mining is applicable to most social media platforms, as the users are allowed to publish content in the form of posts or comments.
Some examples of text mining applications include the following:
Not all these applications are tailored for social media, but the growing amount of textual data available through these platforms makes social media a natural playground for text mining.
Graph mining is also focused on the structure of the data. Graphs are a simple-to-understand, yet powerful, data structure that is generic enough to be applied to many different data representations. In graphs, there are two main components to consider: nodes, which represent entities or objects, and edges, which represent relationships or connections between nodes. In the context of social media, the obvious use of a graph is to represent the social relationships of our users. More in general, in social sciences, the graph structure used to represent social relationship is also referred to as social network.
In terms of using such data structure within social media, we can naturally represent users as nodes, and their relationships (such as friends of or followers) as edges. In this way, information such as friends of friends who like Python becomes easily accessible just by traversing the graph (that is, walking from one node to the other by following the edges). Graph theory and graph mining offer more options to discover deeper insights that are not as clearly visible as the previous example.
After a high-level discussion on social media mining, the following section will introduce some of the useful Python tools that are commonly used in data mining projects.
Until now, we've been using the term data mining when referring to problems and techniques that we're going to apply throughout this book. The title of this section, in fact, mentions the term data science. The use of this term has exploded in the recent years, especially in business environments, while many academics and journalists have also criticized its use as a buzzword. Meanwhile, other academic institutions started offering courses on data science, and many books and articles have been published on the subject. Rather than having a strong opinion on where we should draw the border between different disciplines, we limit ourselves to observe how, nowadays, there is a general interest in multiple fields, including data science, data mining, data analysis, statistics, machine learning, artificial intelligence, data visualization, and more. The topics we're discussing are interdisciplinary by their own nature, and they all borrow from each other from time to time. These is certainly an amazing time to be working in any of these fields, with a lot of interest from the public and a constant buzz with new advances in interesting projects.
The purpose of this section is to introduce Python as a tool for data science, and to describe part of the Python ecosystem that we're going to use in the next chapters.
Python is one of the most interesting languages for data analytics projects. The following are some of the reasons that make it fit for purpose:
Python has a shallow learning curve due to its elegant syntax. Being a dynamic and interpreted language, it facilitates rapid development and interactive exploration. The ecosystem for data processing is partially described in the following sections, which will introduce the main packages we'll use in this book.
In terms of efficiency, interpreted and high-level languages are not famous for being furiously fast. Tools such as NumPy achieve efficiency by hooking to low-level libraries under the hood, and exposing a friendly Python interface. Moreover, many projects employ the use of Cython, a superset of Python that enriches the language by allowing, among other features, to define strong variable types and compile into C. Many other projects in the Python world are in the process of tackling efficiency issues with the overall goal of making pure Python implementations faster. In this book, we won't dig into Cython or any of these promising projects, but we'll make use of NumPy (especially through other libraries that employ NumPy) for data analysis.
When this book was started, Python 3.5 had just been released and received some attention for some its latest features, such as improved support for asynchronous programming and semantic definition of type hints. In terms of usage, Python 3.5 is probably not widely used yet, but it represents the current line of development of the language.
The examples in this book are compatible with Python 3, particularly with versions 3.4+ and 3.5+.
In the never-ending discussion about choosing between Python 2 and Python 3, one of the points to keep in mind is that the support for Python 2 will be dismissed in a few years (at the time of writing, the sunset date is 2020). New features are not developed in Python 2, as this branch is only for bug fixes. On the other hand, many libraries are still developed for Python 2 first, and then the support for Python 3 is added later. For this reason, from time to time, there could be a minor hiccup in terms of compatibility of some library, which is usually resolved by the community quite quickly. In general, if there is no strong reason against this choice, the preference should go to Python 3, especially for new green-field projects.
In order to keep the development environment clean, and ease the transition from prototype to production, the suggestion is to use virtualenv to manage a virtual environment and install dependencies. virtualenv is a tool for creating and managing isolated Python environments. By using an isolated virtual environment, developers avoid polluting the global Python environment with libraries that could be incompatible with each other. The tools allow us to maintain multiple projects that require different configurations and easily switch from one to the other. Moreover, the virtual environment can be installed in a local folder that is accessible to users without administrative privileges.
To install virtualenv in the global Python environment in order to make it available to all users, we can use pip from a terminal (Linux/Unix) or command prompt (Windows):
$ [sudo] pip install virtualenvThe sudo command might be necessary on Linux/Unix or macOS if our current user doesn't have administrator privileges on the system.
If a package is already installed, it can be upgraded to the latest version:
$ pip install --upgrade [package name]Since Python 3.4, the pip tool is shipped with Python. Previous versions require a separate installation of pip as explained on the project page (https://github.com/pypa/pip). The tool can also be used to upgrade itself to the latest version:
$ pip install --upgrade pipOnce virtualenv is globally available, for each project, we can define a separate Python environment where dependencies are installed in isolation, without tampering with the global environment. In this way, tracking the required dependencies of a single project is extremely easy.
In order to set up a virtual environment, follow these steps:
$ mkdir my_new_project # creat new project folder$ cd my_new_project # enter project folder$ virtualenv my_env # setup custom virtual environmentThis will create a my_env subfolder, which is also the name of the virtual environment we're creating, in the current directory. Inside this subfolder, we have all the necessary tools to create the isolated Python environment, including the Python binaries and the standard library. In order to activate the environment, we can type the following command:
$ source my_env/bin/activateOnce the environment is active, the following will be visible on the prompt:
(my_env)$Python packages can be installed for this particular environment using pip:
(my_env)$ pip install [package-name]All the new Python libraries installed with pip when the environment is active will be installed into my_env/lib/python{VERSION}/site-packages. Notice that being a local folder, we won't need administrative access to perform this command.
When we want to deactivate the virtual environment, we can simply type the following command:
$ deactivateThe process described earlier should work for the official Python distributions that are shipped (or available for download) with your operating system.
There is also one more option to consider, called conda (http://conda.pydata.org/), which is gaining some traction in the scientific community as it makes the dependency management quite easy. Conda is an open source package manager and environment manager for installing multiple versions of software packages (and related dependencies), which makes it easy to switch from one version to the other. It supports Linux, macOS, and Windows, and while it was initially created for Python, it can be used to package and distribute any software.
There are mainly two distributions that ship with conda: the batteries-included version, Anaconda, which comes with approximately 100 packages for scientific computing already installed, and the lightweight version, Miniconda, which simply comes with Python and the conda installer, without external libraries.
If you're new to Python, have some time for the bigger download and disk space to spare, and don't want to install all the packages manually, you can get started with Anaconda. For Windows and macOS, Anaconda is available with either a graphical or command-line installer. Figure 1.5 shows a screen capture of the installation procedure on a macOS. For Linux, only the command-line installer is available. In all cases, it's possible to choose between Python 2 and Python 3. If you prefer to have full control of your system, Miniconda will probably be your favorite option:
Once you've installed your version of conda, in order to create a new conda environment, you can use the following command:
$ conda create --name my_env python=3.4 # or favorite versionThe environment can be activated with the following command:
$ conda activate my_envSimilar to what happens with virtualenv, the environment name will be visible in the prompt:
(my_env)$New packages can be installed for this environment with the following command:
$ conda install [package-name]Finally, you can deactivate an environment by typing the following command:
$ conda deactivateAnother nice feature of conda is the ability to install packages from pip as well, so if a particular library is not available via conda install, or it's not been updated to the latest version we need, we can always fall back to the traditional Python package manager while using a conda environment.
If not specified otherwise, by default, conda will look up for packages on https://anaconda.org, while pip makes use of the Python Package Index (PyPI in short, also known as
