41,99 €
Learn to build powerful machine learning models quickly and deploy large-scale predictive applications
This book is for anyone who intends to work with large and complex data sets. Familiarity with basic Python and machine learning concepts is recommended. Working knowledge in statistics and computational mathematics would also be helpful.
Large Python machine learning projects involve new problems associated with specialized machine learning architectures and designs that many data scientists have yet to tackle. But finding algorithms and designing and building platforms that deal with large sets of data is a growing need. Data scientists have to manage and maintain increasingly complex data projects, and with the rise of big data comes an increasing demand for computational and algorithmic efficiency. Large Scale Machine Learning with Python uncovers a new wave of machine learning algorithms that meet scalability demands together with a high predictive accuracy.
Dive into scalable machine learning and the three forms of scalability. Speed up algorithms that can be used on a desktop computer with tips on parallelization and memory allocation. Get to grips with new algorithms that are specifically designed for large projects and can handle bigger files, and learn about machine learning in big data environments. We will also cover the most effective machine learning techniques on a map reduce framework in Hadoop and Spark in Python.
This efficient and practical title is stuffed full of the techniques, tips and tools you need to ensure your large scale Python machine learning runs swiftly and seamlessly.
Large-scale machine learning tackles a different issue to what is currently on the market. Those working with Hadoop clusters and in data intensive environments can now learn effective ways of building powerful machine learning models from prototype to production.
This book is written in a style that programmers from other languages (R, Julia, Java, Matlab) can follow.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 504
Veröffentlichungsjahr: 2016
Copyright © 2016 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: July 2016
Production reference: 1270716
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78588-721-5
www.packtpub.com
Authors
Bastiaan Sjardin
Luca Massaron
Alberto Boschetti
Reviewers
Oleg Okun
Kai Londenberg
Commissioning Editor
Akram Hussain
Acquisition Editor
Sonali Vernekar
Content Development Editor
Sumeet Sawant
Technical Editor
Manthan Raja
Copy Editor
Tasneem Fatehi
Project Coordinator
Shweta H Birwatkar
Proofreader
Safis Editing
Indexer
Mariammal Chettiyar
Graphics
Disha Haria
Kirk D'Penha
Production Coordinator
Arvindkumar Gupta
Cover Work
Arvindkumar Gupta
Bastiaan Sjardin is a data scientist and founder with a background in artificial intelligence and mathematics. He has a MSc degree in cognitive science obtained at the University of Leiden together with on campus courses at Massachusetts Institute of Technology (MIT). In the past 5 years, he has worked on a wide range of data science and artificial intelligence projects. He is a frequent community TA at Coursera in the social network analysis course from the University of Michigan and the practical machine learning course from Johns Hopkins University. His programming languages of choice are Python and R. Currently, he is the cofounder of Quandbee (http://www.quandbee.com/), a company providing machine learning and artificial intelligence applications at scale.
Luca Massaron is a data scientist and marketing research director who is specialized in multivariate statistical analysis, machine learning, and customer insight, with over a decade of experience in solving real-world problems and generating value for stakeholders by applying reasoning, statistics, data mining, and algorithms. From being a pioneer of Web audience analysis in Italy to achieving the rank of a top ten Kaggler, he has always been very passionate about everything regarding data and its analysis, and also about demonstrating the potential of data-driven knowledge discovery to both experts and non-experts. Favoring simplicity over unnecessary sophistication, he believes that a lot can be achieved in data science just by doing the essentials.
I would like to thank Yukiko and Amelia for their continued support, help, and loving patience.
Alberto Boschetti is a data scientist with expertise in signal processing and statistics. He holds a PhD in telecommunication engineering and currently lives and works in London. In his work projects, he faces challenges that span from natural language processing (NLP) and machine learning to distributed processing. He is very passionate about his job and always tries to stay updated about the latest developments in data science technologies, attending meet-ups, conferences, and other events.
Oleg Okun is a machine learning expert and an author/editor of four books, numerous journal articles, and conference papers. He has been working for more than a quarter of a century. During this time, Oleg was employed in both academia and industry in his mother country, Belarus, and abroad (Finland, Sweden, and Germany). His work experience includes document image analysis, fingerprint biometrics, bioinformatics, online/offline marketing analytics, and credit-scoring analytics. He is interested in all aspects of distributed machine learning and the Internet of Things. Oleg currently lives and works in Hamburg, Germany, and is about to start a new job as a chief architect of intelligent systems. His favorite programming languages are Python, R, and Scala.
I would like to express my deepest gratitude to my parents for everything that they have done for me.
Kai Londenberg is a data science and big data expert with many years of professional experience. Currently, he is working as a data scientist at the Volkswagen Data Lab. Before that, he had the pleasure of being the lead data scientist at Searchmetrics, where Luca Massaron was a member of his team. Kai enjoys working with cutting-edge technologies, and while he is a pragmatic machine learning practitioner and software developer at work, he always enjoys staying up-to-date with the latest technologies and research in machine learning, AI, and related fields. You can find him on LinkedIn at https://www.linkedin.com/in/kailondenberg.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www2.packtpub.com/books/subscription/packtlib
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.
"The nice thing about having a brain is that one can learn, that ignorance can be supplanted by knowledge, and that small bits of knowledge can gradually pile up into substantial heaps."
--Douglas HofstadterMachine learning is often referred to as the part of artificial intelligence that actually works. Its aim is to find a function based on an existing set of data (training set) in order to predict outcomes of a previously unseen dataset (test set) with the highest possible correctness. This occurs either in the form of labels and classes (classification problems) or in the form of a continuous value (regression problems). Tangible examples of machine learning in real-life applications range from predicting future stock prices to classifying the gender of an author from a set of documents. Throughout this book, the most important machine learning concepts, together with methods suitable for larger datasets, will be made clear to the reader, thanks to practical examples in Python. We will look at supervised learning (classification & regression), as well as unsupervised learning (such as Principal Component Analysis (PCA), clustering, and topic modeling) that have been found to be applicable to larger datasets.
Large IT corporations such as Google, Facebook, and Uber have generated a lot of buzz by claiming that they successfully applied such machine learning methods at a large scale. With the onset and availability of big data, the demand for scalable machine learning solutions has grown exponentially and many other companies and individuals have started aspiring to ripe the fruits of hidden correlations in big datasets. Unfortunately, most learning algorithms don't scale well, straining CPUs and memory either on a desktop computer or on a larger computing cluster. During these times, even if big data has passed the peak of hype, scalable machine learning solutions are not plentiful.
Frankly, we still need to work around a lot of bottlenecks even with datasets we would hardly categorize as big data (think of datasets up to 2GB or even smaller). The mission of this book is to provide methods (and sometimes unconventional ones) to apply the most powerful open source machine learning methods at a larger scale, without the need for expensive enterprise solutions or large computing clusters. Throughout this book, we will use Python and some other readily available solutions that integrate well in scalable machine learning pipelines. Reading the book is a journey that will redefine what you knew about machine learning, setting you on the starting blocks of real big data analysis.
Chapter 1, First Steps to Scalability, sets the problem of scalable machine learning under the right perspective and familiarizes you with the tools that we will be using in this book.
Chapter 2, Scalable Learning in Scikit-learn, discusses strategies for stochastic gradient descent (SGD) where we mitigate memory consumption; it is based on the theme of out-of-core learning. We will also deal with data preparation techniques that can deal with a variety of data, such as the hashing trick.
Chapter 3, Fast-Learning SVMs, covers streaming algorithms that are capable of discovering non-linearity in the form of support vector machines. We will present alternatives to Scikit-learn, such as LIBLINEAR and Vowpal Wabbit, which, although operating as external shell commands, can be easily wrapped and directed by Python scripts.
Chapter 4, Neural Networks and Deep Learning, provides useful tactics for applying deep neural networks within the Theano framework together with large-scale applications with H2O. Even though it is a hot topic, it can be quite a challenge to apply it successfully, let alone provide scalable solutions. We will also resort to unsupervised pre-training with autoencoders with the theanets package.
Chapter 5, Deep Learning with TensorFlow, covers interesting deep learning techniques together with an online method for neural networks. Although TensorFlow is only in its infancy, the framework provides elegant machine learning solutions. We will also utilize Keras Convolutional Neural Networks capabilities within the TensorFlow environment.
Chapter 6, Classification and Regression Trees at Scale, explains scalable solutions for random forest, gradient boosting, and XGboost. CART, an acronym for classification and regression trees, is a machine learning method usually applied in the framework of ensemble methods. We will also provide examples of a large-scale application using H2O.
Chapter 7, Unsupervised Learning at Scale, dives into unsupervised learning, as we will cover PCA, cluster analysis, and topic modeling using the right approach for scaling them up.
Chapter 8, Distributed Environments – Hadoop and Spark, teaches us how to set up Spark within a virtual machine environment, shifting from a single machine to a computational network paradigm. As Python can easily glue and power up our efforts on a cluster of machines, it becomes a piece of cake to leverage the power of a Hadoop cluster.
Chapter 9, Practical Machine Learning with Spark, gets into action with Spark, teaching all the essentials for starting immediately to manipulate data and build predictive models on large datasets.
Appendix, Introduction to GPUs and Theano, will cover the basics of Theano and GPU-computation. It will help you install and prepare your environment for using Theano on the GPU, if your system allows it.
The execution of the code examples provided in this book requires an installation of Python 2.7 or higher versions on macOS, Linux, or Microsoft Windows.
The examples throughout the book will make frequent use of Python's essential libraries, such as SciPy, NumPy, Scikit-learn, and StatsModels, and to a minor extent, matplotlib and pandas, for scientific and statistical computing. We will also make use of an out-of-core cloud computing application called H2O.
This book is highly dependent on Jupyter and its Notebooks powered by the Python kernel. We will use its most recent version, 4.1, for this book.
The first chapter will provide you with all the step-by-step instructions and some useful tips to set up your Python environment, these core libraries, and all the necessary tools.
This book is suitable for aspiring and actual data science practitioners, developers, and everyone who intends to work with large and complex datasets. We strive to make this book as accessible as possible to a wider audience. Yet, considering that the topics in this book are quite advanced, it is recommended, but not strictly compulsory, that readers are familiar with basic machine learning concept such as classification and regression, error minimizing functions, and cross validation.
We also assume some experience with Python, Jupyter Notebooks, and command-line execution together with a reasonable level of mathematical knowledge to grasp the concepts behind the various large solutions we propose. The text is written in a style that programmers of other languages (R, Java, and MATLAB) can follow. Ideally, it is highly suitable for (but not limited to) a data scientist familiar with machine learning and interested in leveraging Python, in respect to other languages such as R or MATLAB, because of its computational, memory, and I/O capabilities.
Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.
To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
You can download the code files by following these steps:
You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged into your Packt account.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Large-Scale-Machine-Learning-With-Python.We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
On Github, you will also find Vowpal Wabbit executables for Windows.
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/LargeScaleMachineLearningWithPython_ColorImages.pdf.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at <[email protected]> with a link to the suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.
Welcome to this book on scalable machine learning with Python.
In this chapter, we will discuss how to learn effectively from big data with Python and how it can be possible using your single machine or a cluster of other machines, which you can get, for instance, from Amazon Web Services (AWS) or the Google Cloud Platform.
In the book, we will be using Python's implementation of machine learning algorithms that are scalable. This means that they can work with a large amount of data and do not crash because of memory constraints. They also take a reasonable amount of time, which is something manageable for a data science prototype and also deployment in production. Chapters are organized around solutions (such as streaming data), algorithms (such as neural networks or ensemble of trees), and frameworks (such as Hadoop or Spark). We will also provide you with some basic reminders about the machine learning algorithms and explain how to make them scalable and suitable to problems with massive datasets.
Given such premises as a start, you'll need to learn the basics (so as to figure out the perspective under which this book has been written) and set up all your basic tools to start reading the chapters immediately.
In this chapter, we will introduce you to the following topics:
Let's start this journey together around scalable solutions with Python!
Even if the hype now is about big data, large datasets existed long before the term itself had been coined. Large collections of texts, DNA sequences, and vast amounts of data from radio telescopes have always represented a challenge for scientists and data analysts. As most machine learning algorithms have a computational complexity of O(n2) or even O(n3), where n is the number of training instances, the challenge from massive datasets has been previously faced by data scientists and analysts by resorting to data algorithms that could be more efficient. A machine learning algorithm is deemed scalable when it can work after an appropriate setup, in case of large datasets. A dataset can be large because of a large number of cases or variables, or because of both, but a scalable algorithm can deal with it in an efficient way as its running time increases almost linearly accordingly to the size of the problem. Therefore, it is just a matter of exchanging 1:1 more time (or more computational power) with more data. Instead, a machine learning algorithm doesn't scale if it's faced with large amounts of data; it simply stops working or operates with a running time that increases in a nonlinear way, for instance, exponentially, thus making learning unfeasible.
The introduction of cheap data storage, a large RAM, and multiprocessor CPU dramatically changed everything, increasing the ability of single laptops to analyze large amounts of data. Another big game changer arrived on the scene in the past years, shifting the attention from single powerful machines to clusters of commodity computers (cheaper, easily available machines). This big change has been the introduction of MapReduce and the open source framework Apache Hadoop with its Hadoop Distributed File System (HDFS) and, in general, of parallel computation on networks of computers.
In order to figure out how both of these changes deeply and positively affected your capabilities of solving your large scale problems, we should first start from what actually prevented you (and still prevents, depending on how massive is your problem) from analyzing large datasets.
No matter what your problem is, you will eventually find out that you cannot analyze your data because of any of these limits:
Your computer has limitations that will determine if you can learn from your data and how long it will take before you hit a wall. Computing limitations occur in many intensive calculations, I/O problems will bottleneck your prompt access to data, and finally memory limitations can constraint you to take on only a part of your data, thus limiting the kind of matrix computations that you may have access to or the precision or even exactness of your estimations.
Each of these hardware limitations will also affect you differently in severity with regard to the data you are analyzing:
Finally, it comes down to the algorithm that you are going to use in order to learn from the data. Each algorithm has its own characteristics, being able to map data using a solution differently affected by bias or variance. Therefore, with respect to your problem that, so far, you solved by machine learning, you considered, based on experience or empirical tests, that certain algorithms may work better than others did. With large scale problems, you have to add other and different considerations when deciding on the algorithm:
If you cross-evaluate hardware limitations with data characteristics and these kind of algorithms, you'll get a host of possible problematic combinations that can prevent you from getting results from large scale analysis. From a practical point of view, all the problematic combinations can be solved by three approaches:
Some motivating examples may make things clearer and more memorable for you. Let's take two simple examples:
In both cases, we have quite large datasets as they are produced by users' interactions on the Internet.
Depending on the business that we have in mind (we can imagine some big players here), we are clearly talking of millions of data points per day in both our examples. In the advertising case, data is certainly tall, being a continuous stream of information as the most recent data, more representative of markets and consumers, replaces the older one. In the search engine case, data is wide, being enriched by the feature provided by the results you offered to your customers: for instance, if you are in the travels business, you will have quite a lot of features about hotels, locations, and services offered.
Clearly, scalability is an issue for both these problems:
The scalability problem can be solved in one or multiple ways:
In this book, we will point out for you what kind of practical problems can be solved by each one of the solutions or algorithms proposed. It will become automatic for you to connect a particular constraint in time and execution (CPU, memory, or I/O) to the most suitable solution among the ones that we propose.
As our treatise will depend on Python—our open source language of choice for this book—we have to stop for a brief moment and present the language before clarifying how Python can easily help you scale up and out with your massive data problem.
Created in 1991 as a general-purpose, interpreted, object-oriented language, Python has slowly and steadily conquered the scientific community and grown into a mature ecosystem of specialized packages for data processing and analysis. It allows you to have uncountable and fast experimentations, easy theory developments, and prompt deployments of scientific applications.
As a machine learning practitioner, you will find using Python interesting for various reasons:
If you are not already an expert (and actually we require some basic knowledge of Python in order to be able to make the most out of this book), you can read everything about the language and find the basic installations files directly from the Python foundations at https://www.python.org/.
Python is an interpreted language; it runs the reading of your script from memory and executes it during runtime, thus accessing the necessary resources (files, objects in memory, and so on). Apart from being interpreted, another important aspect to take into consideration when using Python for data analysis and machine learning is that Python is single-threaded. Being single-threaded means that any Python program is executed sequentially from the start to the end of the script and that Python cannot take advantage of the extra processing power offered by the multiple threads and processors likely present in your computer (most computers nowadays are multicore).
Given such a situation, scaling up using Python can be achieved by different strategies:
Scaling out solutions simply involve connecting together multiple machines into a cluster. As you connect the machines (scaling out), you can also scale up each one of them using configurations that are more powerful (thus augmenting CPU, memory, and I/O), applying the techniques we mentioned in the previous paragraph and enhancing their performances.
By connecting multiple machines, you can leverage their computational power in a parallel fashion. Your data will be distributed across multiple storage disks/memory, limiting I/O transfers by having each machine work only on its available data (that is, its own storage disk or RAM memory).
In our book, this translates into using outside resources effectively by means of the following:
Each of these frameworks will be controlled by Python (for instance, Spark by its Python interface named pySpark).
Given the availability of many useful packages for machine learning and the fact that it is a programming language quite popular among data scientists, Python is our language of choice for all the code presented in this book.
In this book, when necessary, we will provide further instructions in order to install any further necessary library or tool. Here, we will instead start installing the basics, that is, the Python language and the most frequently used packages for computations and machine learning.
Before starting, it is important to know that there are two main branches of Python: versions 2 and 3. As many core functionalities have changed, scripts built for one version are sometimes incompatible with the other one (they won't work without raising errors and warnings). Although the third version is the newest, the older one is still the most used version in the scientific area and the default version for many operative systems (mainly for compatibility in upgrades). When version 3 was released (in 2008), most scientific packages weren't ready so the scientific community stuck with the previous version. Fortunately, since then, almost all packages have been updated leaving just a few (see http://py3readiness.org for a compatibility overview) as orphans of Python 3 compatibility.
In spite of the recent growth in popularity of Python 3 (which, we shouldn't forget, is the future of Python), Python 2 is still widely used among data scientists and data analysts. Moreover, for a long time Python 2 has been the default Python installation (for instance, on Ubuntu), so it is the most likely version that most of the readers should have ready at hand. For all these reasons, we will adopt Python 2 for this book. It is not merely love for the old technologies, it is just a practical choice in order to make Large Scale Machine Learning with Python accessible to the largest audience:
In case you need to understand the differences between Python 2 and Python 3 in depth, we suggest reading this web page about writing Python 2-3 compatible code:
http://python-future.org/compatible_idioms.html
From Python-Future, you may also find reading about how to convert Python 2 code to Python 3 useful:
http://python-future.org/automatic_conversion.html
As the first step, we are going to create a working environment for data science that you can use to replicate and test the examples in the book and prototype your own large solutions.
No matter in what language you are going to develop your application, Python will gift you with an easy time getting your data, building your model from it, and extracting the right parameters you need to make your predictions in a production environment.
Python is an open source, object-oriented, cross-platform programming language that, compared with its direct competitors (for instance, C/C++ and Java), produces very concise and readable code. It allows you to build a working software prototype in a very short time and tests, maintains, and scales it in the future. It has become the most used language in the data scientist's toolbox because, in the end, it is a general-purpose language turned very flexible thanks to a large variety of available packages that can easily and rapidly help you solve a wide spectrum of both common and niche problems.
If you have never used Python (but this doesn't mean that you may not already have it installed on your machine), you need to first download the installer from the main website of the project, https://www.python.org/downloads/ (remember, we're using version 3), and then install it on your local machine.
This section provides you with full control over what can be installed on your machine. This is very useful when you are going to use Python as both your prototyping and production language. Furthermore, it could help you keep track of the packages' versions that you are using. Anyway, be warned that a step-by-step installation really takes time and effort. Instead, installing a ready-made scientific distribution will lessen the burden of installation procedures and it may be well-suited to first start and learn because it can save you quite a lot of time, though it will install a large number of packages (that for the most part you won't maybe ever use) on your computer all at once. Therefore, if you want to start immediately and don't want to bother much about controlling your installation, just skip this part and proceed to the next section, Scientific distributions.
Being a multiplatform programming language, you'll find installers for computers that either run on Windows or Linux-/Unix-like operating systems. Remember that some Linux distributions (such as Ubuntu) already have Python 2 packed in the repository, which makes the installation process even easier.
If a syntax error has been raised, it means that you are running Python 2 instead of Python 3. If you don't experience an error and you can read that your Python version is 3.4.x or 3.5.x (at the time of writing, the latest version is 3.5.2), then congratulations for running the version of Python that we elected for this book.
To clarify, when a command is given in the terminal command line, we prefix the command with $. Otherwise, if it's for the Python REPL, it's preceded by >>>.
Depending on your system and past installations, Python may not come bundled with all that you need unless you have installed a distribution (which, on the other hand, usually is stuffed with much more than you may need).
To install any packages that you need, you can use either the pip or easy_install commands; however, easy_install is going to be dropped in the future and pip has important advantages over it.
pip is a tool to install Python packages directly accessing the Internet and picking them from the Python Package Index (https://pypi.python.org/pypi). PyPI is a repository containing third-party open source packages, which are constantly maintained and stored in the repository by their authors.
It is preferable to install everything using pip because of the following reasons:
The pip command runs in the command line and makes the process of installation, upgrade, and removal of Python packages a breeze.
As we mentioned, if you're running at least Python 2.7.9 or Python 3.4, the pip command should already be there. To assure which tools have been installed on your local machine, directly test with the following command if any error is raised:
In some Linux and Mac installations, Python 3 and not Python 2 being installed, the command may be present as pip3, so if you receive an error when looking for pip, try running the following command:
If this is the case, remember that pip3 is suitable only to install packages on Python 3. As we are working with Python 2 in the book (unless you decide to use the most recent Python 3.4), pip should always be your choice to install packages.
Alternatively, you can also test whether the old easy_install command is available:
Using easy_install in spite of pip and its advantages makes sense if you are working on Windows because pip will not install binary packages; therefore, if you are experiencing unexpected difficulties installing a package, easy_install can save your day.
If your test ends with an error, you really need to install pip from scratch (and in doing so, also easy_install at the same time).
To install pip, simply follow the instructions given at https://pip.pypa.io/en/stable/installing/. The safest way is to download the get-pip.py script from https://bootstrap.pypa.io/get-pip.py and then run it using the following:
By the way, the script will also install the setup tool from https://pypi.python.org/pypi/setuptools, which contains easy_install.
As an alternative, if you are running a Debian/Ubuntu Unix-like system, then a fast shortcut would be to install everything using apt-get:
After checking this basic requirement, you're now ready to install all the packages that you need in order to run the examples provided in this book. To install a generic <pk> package, you just need to run the following command:
Alternatively, if you prefer to use easy_install, you can also run the following command:
After this, the <pk> package and all its dependencies will be downloaded and installed.
If you're not sure whether a library has been installed or not, just try to import a module in it. If the Python interpreter raises an ImportError error, it can be concluded that the package has not been installed.
Let's take an example. This is what happens when the NumPy library has been installed:
This is what happens if it's not installed:
In the latter case, before importing it, you'll need to install it through pip or easy_install.
Take care that you don't confuse packages with modules. With pip, you install a package; in Python, you import a module. Sometimes, the package and module have the same name, but in many cases, they don't match. For example, the sklearnmodule is included in the package namedScikit-learn.
More often than not, you will find yourself in a situation where you have to upgrade a package because the new version is either required by a dependency or has additional features that you would like to use. To do so, first check the version of the library that you have installed by glancing at the __version__ attribute, as shown in the following example using the NumPy package:
Now, if you want to update it to a newer release, say precisely the 1.9.2 version, you can run the following command from the command line:
Alternatively (but we do not recommend it unless it proves necessary), you can also use the following command:
Finally, if you're just interested in upgrading it to the latest available version, simply run the following command:
You can also run the easy_install alternative:
As you've read so far, creating a working environment is a time-consuming operation for a data scientist. You first need to install Python and then, one by one, you can install all the libraries that you will need. (Sometimes, the installation procedures may not go as smoothly as you'd hoped for earlier.)
If you want to save time and effort and want to ensure that you have a fully working Python environment that is ready to use, you can just download, install, and use the scientific Python distribution. Apart from Python, they also include a variety of preinstalled packages, and sometimes they even have additional tools and an IDE setup for your usage. A few of them are very well-known among data scientists, and in the sections that follow, you will find some of the key features for two of these packages that we found most useful and practical.
To immediately focus on the contents of the book, we suggest that you first promptly download and install a scientific distribution, such as Anaconda (which is the most complete one around, in our opinion), and decide to fully uninstall the distribution and set up Python alone after practicing the examples in the book, which can be accompanied by just the packages you need for your projects.
Again, if possible, download and install the version containing Python 3.
The first package that we would recommend you to try is Anaconda (https://www.continuum.io/downloads), which is a Python distribution offered by Continuum Analytics that includes nearly 200 packages, including NumPy, SciPy, pandas, IPython, matplotlib, Scikit-learn, and StatsModels. It's a cross-platform distribution that can be installed on machines with other existing Python distributions and versions, and its base version is free. Additional add-ons that contain advanced features are charged separately. Anaconda introduces conda, a binary package manager, as a command-line tool to manage your package installations. As stated on its website, Anaconda's goal is to provide enterprise-ready Python distribution for large-scale processing, predictive analytics and scientific computing. As for Python version 2.7, we recommend the Anaconda distribution 4.0.0. (In order to have a look at the packages installed with Anaconda, you can have a look at the list at https://docs.continuum.io/anaconda/pkg-docs.)
As a second suggestion, if you are working on Windows and you desire a portable distribution, WinPython (http://winpython.sourceforge.net/) could be a quite interesting alternative (sorry, no Linux or MacOS versions). WinPython is also a free, open source Python distribution maintained by the community. It is also designed with scientists in mind, and it includes many essential packages such as NumPy, SciPy, matplotlib, and IPython (basically the same as Anaconda's). It also includes Spyder as an IDE, which can be helpful if you have experience using the MATLAB language and interface. Its crucial advantage is that it is portable (you can put it in any directory or even in a USB flash drive), so you can have different versions present on your computer, move a version from a Windows computer to another, and you can easily replace an older version with a newer one just by replacing its directory. When you run WinPython or its shell, it will automatically set all the environment variables necessary to run Python as if it were regularly installed and registered on your system.
At the time of writing, Python 2.7 was the most recent distribution prepared on October 2015 with the release 2.7.10; since then, WinPython has published only updates of the Python 3 version of the distribution. After installing the distribution on your system, you may need to update some of the key packages necessary for the examples present in this book.
IPython was initiated in 2001 as a free project by Fernando Perez, addressing a lack in the Python stack for scientific investigations using a user-programming interface that could incorporate the scientific approach (mainly experimenting and interactively discovering) in the process of software development.
A scientific approach implies the fast experimentation of different hypotheses in a reproducible fashion (as does the data exploration and analysis task in data science), and when using IPython, you will be able to implement an explorative, iterative, and trial-and-error research strategy more naturally during your code writing.
Recently, a large part of the IPython project has moved to a new one called Jupyter. This new project extends the potential usability of the original IPython interface to a wide range of programming languages. (For a complete list, visit https://github.com/ipython/ipython/wiki/IPython-kernels-for-other-languages.)
Thanks to the powerful idea of kernels, programs that run the user's code are communicated by the frontend interface and provide feedback on the results of the executed code to the interface itself; you can use the same interface and interactive programming style, no matter what language you are developing in.
Jupyter (IPython is the zero kernel, the original starting one) can be simply described as a tool for interactive tasks operable by a console or web-based notebook, which offers special commands that help developers better understand and build the code that is being currently written.
Contrary to an IDE, which is built around the idea of writing a script, running it afterward and evaluating its results, Jupyter lets you write your code in chunks named cells, run each of them sequentially, and evaluate the results of each one separately, examining both textual and graphic outputs. Besides graphical integration, it provides you with further help, thanks to customizable commands, a rich history (in the JSON format), and computational parallelism for an enhanced performance when dealing with heavy numeric computations.
Such an approach is also particularly fruitful for the tasks involving developing code based on data as it automatically accomplishes the often neglected duty of documenting and illustrating how data analysis has been done, its premises and assumptions, and its intermediate and final results. If a part of your job is to also present your work and persuade internal or external stakeholders to the project, Jupyter can really do the magic of storytelling for you with few additional efforts. There are many examples on https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks, some of which you may find inspiring for your work as we did.
Actually, we have to confess that keeping a clean, up-to-date Jupyter Notebook has saved us uncountable times when meetings with managers/stakeholders have suddenly popped up, requiring us to hastily present the state of our work.
In short, Jupyter offers you the following features:
Jupyter is our favored choice throughout this book, and it is used to clearly and effectively illustrate storytelling operations with scripts and data and their consequent results.
Though we strongly recommend using Jupyter, if you are using an REPL or IDE, you can use the same instructions and expect identical results (except for print formats and extensions of the returned results).
If you do not have Jupyter installed on your system, you can promptly set it up using the following command:
You can find complete instructions about the Jupyter installation (covering different operating systems) at http://jupyter.readthedocs.io/en/latest/install.html.
If you already have Jupyter installed, it should be upgraded to at least version 4.1.
After installation, you can immediately start using Jupyter, calling it from the command line:
Once the Jupyter instance has opened in the browser, click on the New button, and in the Notebooks section, choose Python 2 (other kernels may be present in the section, depending on what you installed):
At this point, your new empty notebook will look like the following screenshot and you can start entering the commands in the cells:
For instance, you may start typing the following in the cell:
After writing in cells, you just press the play button (below the Cell tab) to run it and obtain an output. Then, another cell will appear for your input. As you are writing in a cell, if you press the plus button on the above menu bar, you will get a new cell, and you can move from a cell to another using the arrows on the menu.
Most of the other functions are quite intuitive and we invite you to try them. In order to know better how Jupyter works, you may use a quick-start guide such as http://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/ or you can get a book specialized in Jupyter functionalities.
For a complete treatise of the full range of Jupyter functionalities when running the IPython kernel, refer to the following two Packt Publishing books:
For our illustrative purposes, just consider that every Jupyter block of instructions has a numbered input statement and an output one, so you will find the code presented in this
