38,39 €
Become an efficient data science practitioner by understanding Python's key concepts
If you are an aspiring data scientist and you have at least a working knowledge of data analysis and Python, this book will get you started in data science. Data analysts with experience of R or MATLAB will also find the book to be a comprehensive reference to enhance their data manipulation and machine learning skills.
Fully expanded and upgraded, the second edition of Python Data Science Essentials takes you through all you need to know to suceed in data science using Python. Get modern insight into the core of Python data, including the latest versions of Jupyter notebooks, NumPy, pandas and scikit-learn. Look beyond the fundamentals with beautiful data visualizations with Seaborn and ggplot, web development with Bottle, and even the new frontiers of deep learning with Theano and TensorFlow.
Dive into building your essential Python 3.5 data science toolbox, using a single-source approach that will allow to to work with Python 2.7 as well. Get to grips fast with data munging and preprocessing, and all the techniques you need to load, analyse, and process your data. Finally, get a complete overview of principal machine learning algorithms, graph analysis techniques, and all the visualization and deployment instruments that make it easier to present your results to an audience of both data science experts and business users.
The book is structured as a data science project. You will always benefit from clear code and simplified examples to help you understand the underlying mechanics and real-world datasets.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 448
Veröffentlichungsjahr: 2016
Copyright © 2016 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: April 2015
Second edition: October 2016
Production reference: 1211016
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-78646-213-8
www.packtpub.com
Authors
Alberto Boschetti
Luca Massaron
Copy Editor
Vikrant Phadke
Reviewer
Zacharias Voulgaris
Project Coordinator
Nidhi Joshi
Commissioning Editor
Veena Pagare
Proofreader
Safis Editing
Acquisition Editor
Namrata Patil
Indexer
Aishwarya Gangawane
Content Development Editor
Mayur Pawanikar
Graphics
Disha Haria
Technical Editor
Vivek Arora
Production Coordinator
Arvindkumar Gupta
Alberto Boschetti is a data scientist with expertise in signal processing and statistics. He holds a PhD in telecommunication engineering and currently lives and works in London. In his work projects, he faces challenges ranging from natural language processing (NLP), behavioral analysis, and machine learning to distributed processing. He is very passionate about his job and always tries to stay updated about the latest developments in data science technologies, attending meet-ups, conferences, and other events.
I would like to thank my family, my friends, and my colleagues. Also, a big thanks to the open source community.
Luca Massaron is a data scientist and marketing research director specializing in multivariate statistical analysis, machine learning, and customer insight, with over a decade of experience of solving real-world problems and generating value for stakeholders by applying reasoning, statistics, data mining, and algorithms. From being a pioneer of web audience analysis in Italy to achieving the rank of a top ten Kaggler, he has always been very passionate about every aspect of data and its analysis, and also about demonstrating the potential of data-driven knowledge discovery to both experts and non-experts. Favoring simplicity over unnecessary sophistication, Luca believes that a lot can be achieved in data science just by doing the essentials.
To Yukiko and Amelia, for their loving patience. "Roads go ever ever on, under cloud and under star, yet feet that wandering have gone turn at last to home afar".
Zacharias Voulgaris is a data scientist and technical author specializing in data science books. He has an engineering and management background, with post-graduate studies in information systems and machine learning. Zacharias has worked as a research fellow at Georgia Tech, investigating and applying machine learning technologies to real-world problems, as an SEO manager in an e-marketing company in Europe, as a program manager in Microsoft, and as a data scientist at US Bank and at G2 Web Services.
Dr. Voulgaris has also authored technical books, the most notable of which is Data Scientist - the definitive guide to becoming a data scientist (Technics Publications), and his newest book, Julia for Data Science (Technics Publications), was released during the summer of 2016. He has also written a number of data-science-related articles on blogs and participates in various data science/machine learning meetup groups. Finally, he has provided technical editorial aid in the book Python Data Science Essentials (Packt), by the same authors as this book.
I would very much like to express my gratitude to the authors of the book for giving me the opportunity to contribute to this project. Also, I'd like to thank Bastiaan Sjardin for introducing me to them and to the world of technical editing. It's been a privilege working with all of you.
"A journey of a thousand miles begins with a single step."
--Laozi (604 BC - 531 BC)Data science is a relatively new knowledge domain that requires the successful integration of linear algebra, statistical modeling, visualization, computational linguistics, graph analysis, machine learning, business intelligence, and data storage and retrieval.
The Python programming language, having conquered the scientific community during the last decade, is now an indispensable tool for the data science practitioner and a must-have tool for every aspiring data scientist. Python will offer you a fast, reliable, cross-platform, mature environment for data analysis, machine learning, and algorithmic problem solving. Whatever stopped you before from mastering Python for data science applications will be easily overcome by our easy, step-by-step, and example-oriented approach that will help you apply the most straightforward and effective Python tools to both demonstrative and real-world datasets. As the second edition of Python Data Science Essentials, this book offers updated and expanded content. Based on the recent Jupyter Notebooks (incorporating interchangeable kernels, a truly polyglot data science system), this book incorporates all the main recent improvements in Numpy, Pandas, and Scikit-learn. Additionally, it offers new content in the form of deep learning (by presenting Keras–based on both Theano and Tensorflow), beautiful visualizations (seaborn and ggplot), and web deployment (using bottle). This book starts by showing you how to set up your essential data science toolbox in Python’s latest version (3.5), using a single-source approach (implying that the book's code will be easily reusable on Python 2.7 as well). Then, it will guide you across all the data munging and preprocessing phases in a manner that explains all the core data science activities related to loading data, transforming, and fixing it for analysis, and exploring/processing it. Finally, the book will complete its overview by presenting you with the principal machine learning algorithms, graph analysis techniques, and all the visualization and deployment instruments that make it easier to present your results to an audience of both data science experts and business users.
Chapter 1, First Steps, introduces Jupyter notebooks and demonstrates how you can have access to the data run in the tutorials.
Chapter 2, Data Munging, gives an overview of the data science pipeline and explores all the key tools for handling and preparing data before you apply any learning algorithm and set up your hypothesis experimentation schedule.
Chapter 3, The Data Pipeline, discusses all the operations that can potentially improve or even boost your results.
Chapter 4, Machine Learning, delves into the principal machine learning algorithms offered by the Scikit-learn package, such as, among others, linear models, support vector machines, ensembles of trees, and unsupervised techniques for clustering.
Chapter 5, Social Network Analysis, introduces graphs, which is an interesting deviation from the predic-tors/target flat matrices. It is quite a hot topic in data science now. Expect to delve into very complex and intricate networks!
Chapter 6, Visualization, Insights, and Results, the concluding chapter, introduces you to the basics of visualization with Matplotlib, how to operate EDA with pandas, how to achieve beautiful visualizations with Seaborn and Bokeh, and also how to set up a web server to provide information on demand.
Appendix, Strengthen Your Python Foundations, covers a few Python examples and tutorials focused on the key features of the language that are indispensable in order to work on data science projects.
Python and all the data science tools mentioned in the book, from IPython to Scikit-learn, are free of charge and can be freely downloaded from the Internet. To run the code that accompanies the book, you need a computer that uses Windows, Linux, or Mac OS operating systems. The book will introduce you step-by-step to the process of installing the Python interpreter and all the tools and data that you need to run the examples.
If you are an aspiring data scientist and you have at least a working knowledge of data analysis and Python, this book will get you started in data science. Data analysts with experience in R or MATLAB will also find the book to be a comprehensive reference to enhance their data manipulation and machine learning skills.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
You can download the code files by following these steps:
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Python-Data-Science-Essentials-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from http://www.packtpub.com/sites/default/files/downloads/PythonDataScienceEssentialsSecondEdition_ColorImages.pdf.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at [email protected] with a link to the suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.
Whether you are an eager learner of data science or a well-grounded data science practitioner, you can take advantage of this essential introduction to Python for data science. You can use it to the fullest if you already have at least some previous experience in basic coding, in writing general-purpose computer programs in Python, or in some other data analysis-specific language such as MATLAB or R.
This book will delve directly into Python for data science, providing you with a straight and fast route to solve various data science problems using Python and its powerful data analysis and machine learning packages. The code examples that are provided in this book don't require you to be a master of Python. However, they will assume that you at least know the basics of Python scripting, including data structures such as lists and dictionaries, and the workings of class objects. If you don't feel confident about these subjects or have minimal knowledge of the Python language, before reading this book, we suggest that you take an online tutorial. There are many possible choices, but we suggest starting with the suggestions from the official beginner's guide to Python from the Python Foundation or directly going to the free Code Academy course at https://www.codecademy.com/learn/python. Using Code Academy's tutorial, or any other alternative you may find useful, in a matter of a few hours of study, you should acquire all the building blocks that will ensure you enjoy this book to the fullest. We have also prepared a tutorial of our own, which can be found in the last part of this book, in order to provide an integration of the two aforementioned free courses.
In any case, don't be intimidated by our starting requirements; mastering Python enough for data science applications isn't as arduous as you may think. It's just that we have to assume some basic knowledge on the reader's part because our intention is to go straight to the point of doing data science without having to explain too much about the general aspects of the language that we will be using.
Are you ready, then? Let's start!
In this introductory chapter, we will work out the basics to set off in full swing and go through the following topics:
Data science is a relatively new knowledge domain, though its core components have been studied and researched for many years by the computer science community. Its components include linear algebra, statistical modeling, visualization, computational linguistics, graph analysis, machine learning, business intelligence, and data storage and retrieval.
Data science is a new domain and you have to take into consideration that currently its frontiers are still somewhat blurred and dynamic. Since data science is made of various constituent sets of disciplines, please also keep in mind that there are different profiles of data scientists depending on their competencies and areas of expertise.
In such a situation, what can be the best tool of the trade that you can learn and effectively use in your career as a data scientist? We believe that the best tool is Python, and we intend to provide you with all the essential information that you will need for a quick start.
In addition, other tools such as R and MATLAB provide data scientists with specialized tools to solve specific problems in statistical analysis and matrix manipulation in data science. However, Python really completes your data scientist skill set. This multipurpose language is suitable for both development and production alike; it can handle small- to large-scale data problems and it is easy to learn and grasp no matter what your background or experience is.
Created in 1991 as a general-purpose, interpreted, and object-oriented language, Python has slowly and steadily conquered the scientific community and grown into a mature ecosystem of specialized packages for data processing and analysis. It allows you to have uncountable and fast experimentations, easy theory development, and prompt deployment of scientific applications.
At present, the core Python characteristics that render it an indispensable data science tool are as follows:
First, let's proceed to introduce all the settings you need in order to create a fully working data science environment to test the examples and experiment with the code that we are going to provide you with.
Python is an open source, object-oriented, and cross-platform programming language. Compared to some of its direct competitors (for instance, C++ or Java), Python is very concise. It allows you to build a working software prototype in a very short time. Yet it has become the most used language in the data scientist's toolbox not just because of that. It is also a general-purpose language, and it is very flexible due to a variety of available packages that solve a wide spectrum of problems and necessities.
There are two main branches of Python: 2.7.x and 3.x. At the time of writing this second edition of the book, the Python Foundation (https://www.python.org/) is offering downloads for Python version 2.7.11 and 3.5.1. Although the third version is the newest, the older one is still the most used version in the scientific area, since a few packages (check the website at http://py3readiness.org/ for a compatibility overview) won't run otherwise yet.
In addition, there is no immediate backward compatibility between Python 3 and 2. In fact, if you try to run some code developed for Python 2 with a Python 3 interpreter, it may not work. Major changes have been made to the newest version, and that has affected past compatibility. Some data scientists, having built most of their work on Python 2 and its packages, are reluctant to switch to the new version.
In this second edition of the book, we intend to address a growing audience of data scientists, data analysts, and developers, who may not have such a strong legacy with Python 2. Thus, we agreed that it would be better to work with Python 3 rather than the older version. We suggest using a version such as Python 3.4 or above. After all, Python 3 is the present and the future of Python. It is the only version that will be further developed and improved by the Python Foundation and it will be the default version of the future on many operating systems.
Anyway, if you are currently working with version 2 and you prefer to keep on working with it, you can still use this book and all its examples. In fact, for the most part, our code will simply work on Python 2 after having the code itself preceded by these imports:
from __future__ import (absolute_import, division, print_function, unicode_literals) from builtins import * from future import standard_library standard_library.install_aliases()The from __future__ import commands should always occur at the beginning of your scripts or else you may experience Python reporting an error.
As described in the Python-future website (http://python-future.org/), these imports will help convert several Python 3-only constructs to a form compatible with both Python 3 and Python 2 (and in any case, most Python 3 code should just simply work on Python 2 even without the aforementioned imports).
In order to run the upward commands successfully, if the future package is not already available on your system, you should install it (version >= 0.15.2) using the following command to be executed from a shell:
$> pip install -U futureIf you're interested in understanding the differences between Python 2 and Python 3 further, we recommend reading the wiki page offered by the Python Foundation itself at: https://wiki.python.org/moin/Python2orPython3.
Novice data scientists who have never used Python (who likely don't have the language readily installed on their machines) need to first download the installer from the main website of the project, www.python.org/downloads/, and then install it on their local machine.
This section provides you with full control over what can be installed on your machine. This is very useful when you have to set up single machines to deal with different tasks in data science. Anyway, please be warned that a step-by-step installation really takes time and effort. Instead, installing a ready-made scientific distribution, such as Anaconda, will lessen the burden of installation procedures and it may be well suited for first starting and learning because it saves you time and sometimes even trouble, though it will put a large number of packages (and we won't use most of them) on your computer all at once. Therefore, if you want to start immediately with an easy installation procedure, just skip this part and proceed to the section, Scientific distributions.
This being a multiplatform programming language, you'll find installers for machines that either run on Windows or Unix-like operating systems.
Remember that some of the latest versions of most Linux distributions (such as CentOS, Fedora, Red Hat Enterprise, Ubuntu, and some other minor ones) have Python 2 packaged in the repository. In such a case and in the case that you already have a Python version on your computer (since our examples run on Python 3), you first have to check what version you are exactly running. To do such a check, just follow these instructions:
To clarify the operations we have just mentioned, when a command is given in the terminal command line, we prefix the command with $>. Otherwise, if it's for the Python REPL, it's preceded by >>> (REPL is an acronym that stands for Read-Eval-Print-Loop, a simple interactive environment which takes a user's single commands from an input line in a shell and returns the results by printing).
Python won't come bundled with all you need, unless you take a specific premade distribution. Therefore, to install the packages you need, you can use either pip or easy_install. Both these two tools run in the command line and make the process of installation, upgrade, and removal of Python packages a breeze. To check which tools have been installed on your local machine, run the following command:
$> pipTo install pip, follow the instructions given at https://pip.pypa.io/en/latest/installing/.
Alternatively, you can also run this command:
$> easy_installIf both of these commands end up with an error, you need to install any one of them. We recommend that you use pip because it is thought of as an improvement over easy_install. Moreover, easy_install is going to be dropped in future and pip has important advantages over it. It is preferable to install everything using pip because:
Using easy_install in spite of the advantages of pip makes sense if you are working on Windows because pip won't always install pre-compiled binary packages. Sometimes it will try to build the package's extensions directly from C source, thus requiring a properly configured compiler (and that's not an easy task on Windows). This depends if the package is running on eggs, Python metadata files for distributing code as bundles, (and pip cannot directly use their binaries, but it needs to build from their source code) or wheels, the new standard for Python distribution of code bundles .(In this last case, pip can install binaries if available, as explained here: http://pythonwheels.com/). Instead, easy_install will always install available binaries from eggs and wheels. Therefore, if you are experiencing unexpected difficulties installing a package, easy_install can save your day (at some price anyway, as we just mentioned in the list).
The most recent versions of Python should already have pip installed by default. Therefore, you may have it already installed on your system. If not, the safest way is to download the get-pi.py script from https://bootstrap.pypa.io/get-pip.py and then run it using the following:
$> python get-pip.pyThe script will also install the setup tool from https://pypi.python.org/pypi/setuptools, which also contains easy_install.
You're now ready to install the packages you need in order to run the examples provided in this book. To install the < package-name > generic package, you just need to run this command:
$> pip install < package-name >Alternatively, you can run the following command:
$> easy_install < package-name >Note that in some systems, pip might be named as pip3 and easy_install as easy_install-3 to stress the fact that both operate on packages for Python 3. If you're unsure, check the version of Python pip is operating on with:
$> pip -VFor easy_install, the command is slightly different:
$> easy_install --versionAfter this, the <pk> package and all its dependencies will be downloaded and installed. If you're not certain whether a library has been installed or not, just try to import a module inside it. If the Python interpreter raises an ImportError error, it can be concluded that the package has not been installed.
This is what happens when the NumPy library has been installed:
>>> import numpyThis is what happens if it's not installed:
>>> import numpyTraceback (most recent call last):File "<stdin>", line 1, in <module>ImportError: No module named numpyIn the latter case, you'll need to first install it through pip or easy_install.
Take care that you don't confuse packages with modules. With pip, you install a package; in Python, you import a module. Sometimes, the package and the module have the same name, but in many cases, they don't match. For example, the sklearn module is included in the package named Scikit-learn.
Finally, to search and browse the Python packages available for Python, look at https://pypi.python.org/pypi.
More often than not, you will find yourself in a situation where you have to upgrade a package because either the new version is required by a dependency or it has additional features that you would like to use. First, check the version of the library you have installed by glancing at the __version__ attribute, as shown in the following example, numpy:
>>> import numpy>>> numpy.__version__ # 2 underscores before and after'1.9.2'Now, if you want to update it to a newer release, say the 1.11.0 version, you can run the following command from the command line:
$> pip install -U numpy==1.11.0Alternatively, you can use the following command:
$> easy_install --upgrade numpy==1.11.0Finally, if you're interested in upgrading it to the latest available version, simply run this command:
$> pip install -U numpyYou can alternatively run the following command:
$> easy_install --upgrade numpyAs you've read so far, creating a working environment is a time-consuming operation for a data scientist. You first need to install Python and then, one by one, you can install all the libraries that you will need. Sometimes, the installation procedures may not go as smoothly as you'd hoped for earlier, requiring the user to do extra steps, to install additional executables (like, in Linux boxes, gFortran for Scipy) or libraries (like libfreetype for Matplotlib). Usually, the backtrace of the error produced during the failed installation is clear enough to understand what went wrong and to take the correct resolving action, but at other times, the error is tricky or subtle, holding up the user for hours without advancing in the process.
If you want to save time and effort and want to ensure that you have a fully working Python environment that is ready to use, you can just download, install, and use the scientific Python distribution. Apart from Python, they also include a variety of preinstalled packages, and sometimes, they even have additional tools and an IDE. A few of them are very well known among data scientists, and in the sections that follow, you will find some of the key features of each of these packages.
We suggest that you first promptly download and install a scientific distribution, such as Anaconda (which is the most complete one), and after practicing the examples in the book, decide to fully uninstall the distribution and set up Python alone, which can be accompanied by just the packages you need for your projects.
Anaconda (http://continuum.io/downloads) is a Python distribution offered by Continuum Analytics that includes nearly 200 packages, which comprises NumPy, SciPy, pandas, Jupyter, Matplotlib, Scikit-learn, and NLTK. It's a cross-platform distribution (Windows, Linux, and Mac OS X) that can be installed on machines with other existing Python distributions and versions. Its base version is free; instead, add-ons that contain advanced features are charged separately. Anaconda introduces conda, a binary package manager, as a command-line tool to manage your package installations. As stated on the website, Anaconda's goal is to provide enterprise-ready Python distribution for large-scale processing, predictive analytics, and scientific computing.
If you've decided to install an Anaconda distribution, you can take advantage of the conda binary installer we mentioned previously. Anyway, conda is an open source package management system, and consequently it can be installed separately from an Anaconda distribution.
You can test immediately whether conda is available on your system. Open a shell and digit:
$> conda -VIf conda is available, there will appear the version of your conda; otherwise an error will be reported. If conda is not available, you can quickly install it on your system by going to http://conda.pydata.org/miniconda.html and installing the Miniconda software suitable for your computer. Miniconda is a minimal installation that only includes conda and its dependencies.
conda can help you manage two tasks: installing packages and creating virtual environments. In this section, we will explore how conda can help you easily install most of the packages you may need in your data science projects.
Before starting, please check that you have the latest version of conda at hand:
$> conda update condaNow you can install any package you need. To install the <package-name> generic package, you just need to run the following command:
$> conda install <package-name>You can also install a particular version of the package just by pointing it out:
$> conda install <package-name>=1.11.0Similarly, you can install multiple packages at once by listing all their names:
$> conda install <package-name-1> <package-name-2>If you just need to update a package that you previously installed, you can keep on using conda:
$> conda update <package-name>You can update all the available packages simply by using the --all argument:
$> conda update --allFinally, conda can also uninstall packages for you:
$> conda remove <package-name>If you would like to know more about conda, you can read its documentation at http://conda.pydata.org/docs/index.html. In summary, as a main advantage, it handles binaries even better than easy_install (by always providing a successful installation on Windows without any need to compile the packages from source) but without its problems and limitations. With the use of conda, packages are easy to install (and installation is always successful), update, and even uninstall. On the other hand, conda cannot install directly from a git server (so it cannot access the latest version of many packages under development) and it doesn't cover all the packages available on PyPI as pip itself.
Enthought Canopy (https://www.enthought.com/products/canopy/) is a Python distribution by Enthought Inc. It includes more than 200 preinstalled packages, such as NumPy, SciPy, Matplotlib, Jupyter, and pandas (more on these packages later). This distribution is targeted at engineers, data scientists, quantitative and data analysts, and enterprises. Its base version is free (which is named Canopy Express), but if you need advanced features, you have to buy a front version. It's a multiplatform distribution and its command-line install tool is canopy_cli.
PythonXY (http://python-xy.github.io/) is a free, open source Python distribution maintained by the community. It includes a number of packages, which include NumPy, SciPy, NetworkX, Jupyter, and Scikit-learn. It also includes Spyder, an interactive development environment inspired by the MATLAB IDE. The distribution is free. It works only on Microsoft Windows, and its command-line installation tool is pip.
WinPython (http://winpython.sourceforge.net/) is also a free, open-source Python distribution maintained by the community. It is designed for scientists, and includes many packages such as NumPy, SciPy, Matplotlib, and Jupyter. It also includes Spyder as an IDE. It is free and portable. You can put WinPython into any directory, or even into a USB flash drive, and at the same time maintain multiple copies and versions of it on your system. It works only on Microsoft Windows, and its command-line tool is the WinPython Package Manager (WPPM).
No matter whether you have chosen installing a standalone Python or instead you used a scientific distribution, you may have noticed that you are actually bound on your system to the Python's version you have installed. The only exception, for Windows users, is to use a WinPython distribution, since it is a portable installation and you can have as many different installations as you need.
A simple solution to break free of such a limitation is to use virtualenv, which is a tool to create isolated Python environments. That means that, by using different Python environments, you can easily achieve these things:
You can find documentation about virtualenv at http://virtualenv.readthedocs.io/en/stable/, though we are going to provide you with all the directions you need to start using it immediately. In order to take advantage of virtualenv, you have first to install it on your system:
$> pip install virtualenvAfter the installation completes, you can start building your virtual environments. Before proceeding, you have to take a few decisions:
After deciding on the Python version, the linking to existing global packages, and the relocability of the virtual environment, in order to start, you just launch the command from a shell. Declare the name you would like to assign to your new environment:
$> virtualenv clonevirtualenv will just create a new directory using the name you provided, in the path from which you actually launched the command. To start using it, you just enter the directory and digit activate:
$> cd clone$> activateAt this point, you can start working on your separated Python environment, installing packages and working with code.
If you need to install multiple packages at once, you may need some special function from pip—pip freeze—which will enlist all the packages (and their versions) you have installed on your system. You can record the entire list in a text file by this command:
$> pip freeze > requirements.txtAfter saving the list in a text file, just take it into your virtual environment and install all the packages in a breeze with a single command:
$> pip install -r requirements.txtEach package will be installed according to the order in the list (packages are listed in a case-insensitive sorted order). If a package requires other packages that are later in the list, that's not a big deal because pip automatically manages such situations. So if your package requires Numpy and Numpy is not yet installed, pip will install it first.
When you're finished installing packages and using your environment for scripting and experimenting, in order to return to your system defaults, just issue this command:
$> deactivateIf you want to remove the virtual environment completely, after deactivating and getting out of the environment's directory, you just have to get rid of the environment's directory itself by a recursive deletion. For instance, on Windows you just do this:
$> rd /s /q cloneOn Linux and Mac, the command will be:
$> rm -r -f cloneIf you are working extensively with virtual environments, you should consider using virtualenvwrapper, which is a set of wrappers for virtualenv, in order to help you manage multiple virtual environments easily. It can be found at http://bitbucket.org/dhellmann/virtualenvwrapper. If you are operating on a Unix system (Linux or OS X), another solution we have to quote is pyenv (which can be found at https://github.com/yyuu/pyenv), which lets you set your main Python version, allows installation of multiple versions and creates virtual environments. Its peculiarity is that it does not depend on Python to be installed and it works perfectly at the user-level (no need for sudo commands).
If you have installed the Anaconda distribution, or you have tried conda using a Miniconda installation, you can also take advantage of the conda command to run virtual environments as an alternative to virtualenv. Let's see in practice how to use conda for that. We can check what environments we have available like this:
>$ conda info -eThis command will report to you what environments you can use on your system based on conda. Most likely, your only environment will be just root, pointing to your Anaconda distribution's folder.
As an example, we can create an environment based on Python version 3.4, having all the necessary Anaconda-packaged libraries installed. That makes sense, for instance, for using the package Theano together with Python 3 on Windows (because of an issue we will explain in shortly). In order to create such an environment, just do this:
$> conda create -n python34 python=3.4 anacondaThe command asks for a particular Python Version 3.4 and requires the installation of all packages available on the Anaconda distribution (the argument anaconda). It names the environment as python34 using the argument -n. The complete installation will take a while, given the large number of packages in the Anaconda installation. After having completed all of the installation, you can activate the environment:
$> activate python34If you need to install additional packages to your environment, when activated, you just do the following:
$> conda install -n python34 <package-name1> <package-name2>That is, you make the list of the required packages follow the name of your environment. Naturally, you can also use pip install, as you would do in a virtualenv environment.
You can also use a file instead of listing all the packages by name yourself. You can create a list in an environment using the list argument and piping the output to a file:
$> conda list -e > requirements.txtThen, in your target environment, you can install the entire list using:
$> conda install --file requirements.txtYou can even create an environment, based on a requirements list:
$> conda create -n python34 python=3.4 --file requirements.txtFinally, after having used the environment, to close the session, you simply do this:
$> deactivateContrary to virtualenv, there is a specialized argument in order to completely remove an environment from your system:
$> conda remove -n python34 --allWe mentioned that the two most relevant characteristics of Python are its ability to integrate with other languages and its mature package system, which is well embodied by PyPI (see the Python Package Index at: https://pypi.python.org/pypi), a common repository for the majority of Python open source packages that is constantly maintained and updated.
The packages that we are now going to introduce are strongly analytical and they will constitute a complete data science toolbox. All the packages are made up of extensively tested and highly optimized functions for both memory usage and performance, ready to achieve any scripting operation with successful execution. A walkthrough on how to install them is provided in the following section.
Partially inspired by similar tools present in R and MATLAB environments, we will together explore how a few selected Python commands can allow you to efficiently handle data and then explore, transform, experiment, and learn from the same without having to write too much code or reinvent the wheel.
NumPy, which is Travis Oliphant's creation, is the true analytical workhorse of the Python language. It provides the user with multidimensional arrays, along with a large set of functions to operate a multiplicity of mathematical operations on these arrays. Arrays are blocks of data arranged along multiple dimensions, which implement mathematical vectors and matrices. Characterized by optimal memory allocation, arrays are useful not just for storing data, but also for fast matrix operations (vectorization), which are indispensable when you wish to solve ad hoc data science problems:
As a convention largely adopted by the Python community, when importing NumPy, it is suggested that you alias it as np:
import numpy as npWe will be doing this throughout the course of this book.
An original project by Travis Oliphant, Pearu Peterson, and Eric Jones, SciPy completes NumPy's functionalities, offering a larger variety of scientific algorithms for linear algebra, sparse matrices, signal and image processing, optimization, fast Fourier transformation, and much more:
The pandas package deals with everything that NumPy and SciPy cannot do. Thanks to its specific data structures, namely DataFrames and Series, pandas allows you to handle complex tables of data of different types (which is something that NumPy's arrays cannot do) and time series. Thanks to Wes McKinney's creation, you will be able easily and smoothly to load data from a variety of sources. You can then slice, dice, handle missing elements, add, rename, aggregate, reshape, and finally visualize your data at will:
Conventionally, pandas is imported as
