28,79 €
Learn to use IPython and Jupyter Notebook for your data analysis and visualization work.
Python is one of the leading open source platforms for data science and numerical computing. IPython and the associated Jupyter Notebook offer efficient interfaces to Python for data analysis and interactive visualization, and they constitute an ideal gateway to the platform.
IPython Interactive Computing and Visualization Cookbook, Second Edition contains many ready-to-use, focused recipes for high-performance scientific computing and data analysis, from the latest IPython/Jupyter features to the most advanced tricks, to help you write better and faster code. You will apply these state-of-the-art methods to various real-world examples, illustrating topics in applied mathematics, scientific modeling, and machine learning.
The first part of the book covers programming techniques: code quality and reproducibility, code optimization, high-performance computing through just-in-time compilation, parallel computing, and graphics card programming. The second part tackles data science, statistics, machine learning, signal and image processing, dynamical systems, and pure and applied mathematics.
This book is intended for anyone interested in numerical computing and data science: students, researchers, teachers, engineers, analysts, and hobbyists. A basic knowledge of Python/NumPy is recommended. Some skills in mathematics will help you understand the theory behind the computational methods.
Cyrille Rossant, PhD, is a neuroscience researcher and software engineer at University College London. He is a graduate of École Normale Supérieure, Paris, where he studied mathematics and computer science. He has also worked at Princeton University and Collège de France. While working on data science and software engineering projects, he has gained experience in numerical computing, parallel computing, and high-performance data visualization. He is the author of Learning IPython for Interactive Computing and Data Visualization, Second Edition, Packt Publishing, the prequel of this cookbook.Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 572
Veröffentlichungsjahr: 2018
Copyright © 2018 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Veena Pagare
Acquisition Editor: Dominic Shakeshaft
Project Editor: Suzanne Coutinho
Technical Editors: Bhagyashree Rai, Nidhisha Shetty
Proofreader: Safis Editing
Indexer: Aishwarya Gangawane
Graphics: Tom Scaria
Production Coordinator: Shantanu Zagade
First published: September 2014
Second Edition: January 2018
Production reference: 1290118
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78588-863-2
www.packtpub.com
mapt.io
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Cyrille Rossant, PhD, is a neuroscience researcher and software engineer at University College London. He is a graduate of École Normale Supérieure, Paris, where he studied mathematics and computer science. He has also worked at Princeton University and Collège de France. While working on data science and software engineering projects, he has gained experience in numerical computing, parallel computing, and high-performance data visualization.
He is the author of Learning IPython for Interactive Computing and Data Visualization, Second Edition, Packt Publishing, the prequel of this cookbook.
I'm grateful to everyone who gave their feedback on this book, including Matthias Bussonnier, Thomas Caswell, Guillaume Gay, Brian Granger, Matthew Rocklin, Steven Silvester, and Jake VanderPlas. I'd also like to thank my family for their support.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
We are becoming awash in the flood of digital data from scientific research, engineering, economics, politics, journalism, business, and many other domains. As a result, analyzing, visualizing, and harnessing data is the occupation of an increasingly large and diverse set of people. Quantitative skills such as programming, numerical computing, mathematics, statistics, and data mining, which form the core of data science, are more and more appreciated in a seemingly endless plethora of fields.
Python, a widely-known programming language, is also one of the leading open platforms for data science. IPython is a mature Python project that provides scientist-friendly interactive access to Python. It is part of the broader Project Jupyter, which aims to provide high-quality environments for interactive computing, data analysis, visualization, and the authoring of interactive scientific documents. Jupyter is estimated to have several million users today.
The prequel of this book, Learning IPython for Interactive Computing and Data Visualization Second Edition, Packt Publishing was published in 2015, two years after the first edition. It is a beginner-level introduction to data science and numerical computing with Python, IPython, and Jupyter.
This book, the first edition of which was published in 2014, continues that journey by presenting more than 100 recipes for interactive scientific computing and data science. These recipes not only cover programming topics such as numerical computing, high-performance computing, parallel computing, and interactive visualization, but also data analysis topics such as statistics, data mining, machine learning, signal processing, graph theory, numerical optimization, and many others.
This second edition is fully compatible with the latest versions of the platform and its libraries. It includes new recipes to better leverage the latest features of Python 3, and it introduces promising new projects such as JupyterLab, Altair, and Dask.
By design, this book privileges breadth over depth. A particularly wide range of libraries and techniques are covered in this book, but not comprehensively. We give many references that let you deepen your knowledge of individual methods. The goal of this book is not to make you an expert of the subjects covered, but to give you a glimpse of the extremely diverse set of applications that you can tackle with the platform.
All the recipes in this book, which cover a specific techniques, are available online as a Jupyter notebook. This interactive document lets you read, execute, and modify the code interactively, which makes the learning process more engaging and dynamic.
Almost all of this book's content is available online on the GitHub platform (http://ipython-books.github.io/). Updates and corrections will be regularly published there, so you should make sure you check out the latest version of the book online.
This book targets researchers, engineers, data scientists, teachers, students, analysts, journalists, economists, and hobbyists interested in data analysis and numerical computing.
Readers familiar with the scientific Python ecosystem will find many resources to sharpen their skills in high-performance interactive computing with IPython and Jupyter.
Readers who need to implement algorithms for domain-specific applications will appreciate the introductions to a wide variety of topics in data analysis and applied mathematics.
Readers who are new to numerical computing with Python should start with the prequel of this book, Learning IPython for Interactive Computing and Data Visualization Second Edition, Packt Publishing published in 2015.
This book is split into two parts:
Part 1 (chapters 1 to 6) covers relatively advanced methods in interactive numerical computing, high-performance computing, and data visualization.
Part 2 (chapters 7 to 15) introduces standard methods in data science and mathematical modeling. Many of these methods are applied to real-world data.
Chapter 1, A Tour of Interactive Computing with Jupyter and IPython, contains a brief introduction to data analysis and numerical computing with IPython and Jupyter. It not only covers common packages such as Python, NumPy, pandas, and Matplotlib, but also advanced IPython/Jupyter topics such as interactive widgets in the Notebook, custom magic commands, configurable IPython extensions, and custom Jupyter kernels.
Chapter 2, Best Practices in Interactive Computing, details best practices to write reproducible, high-quality code: task automation, version control with Git, workflows with IPython and Jupyter, unit testing, continuous integration, debugging, and other related topics. The importance of these subjects in computational research and data analysis cannot be overstated.
Chapter 3, Mastering the Jupyter Notebook, covers topics related to the Jupyter Notebook, notably the Notebook format, notebook conversions, and interactive widgets.
Chapter 4, Profiling and Optimization, covers methods to make your code faster and more efficient: CPU and memory profiling in Python, advanced optimization techniques with NumPy (including large array manipulations), and memory mapping of huge arrays. These techniques are essential for big data analysis.
Chapter 5, High-Performance Computing, covers techniques to make your code much faster: code acceleration with Numba and Cython, wrapping C libraries in Python with ctypes, parallel computing with IPython and Dask, OpenMP, and General-Purpose Computing on Graphics Processing Units (GPGPU) with CUDA. The chapter ends with an introduction to the Julia language, a high-performance numerical computing programming language that can be used in the Jupyter Notebook.
Chapter 6, Data Visualization, introduces several visualization or interactive visualization libraries, such as matplotlib, seaborn, bokeh, D3, Altair, and others.
Chapter 7, Statistical Data Analysis, covers methods for getting insights into data. It introduces classic frequentist and Bayesian methods for hypothesis testing, parametric and nonparametric estimation, and model inference. The chapter leverages Python libraries such as pandas, SciPy, statsmodels, and PyMC. The last recipe introduces the statistical language R, which can be easily used in the Jupyter Notebook.
Chapter 8, Machine Learning, covers methods to learn and make predictions from data. Using the scikit-learn Python package, this chapter illustrates fundamental data mining and machine learning concepts such as supervised and unsupervised learning, classification, regression, feature selection, feature extraction, overfitting, regularization, cross-validation, and grid search. Algorithms addressed in this chapter include logistic regression, Naive Bayes, K-nearest neighbors, support vector machines, random forests, and others. These methods are applied to various types of datasets: numerical data, images, and text.
Chapter 9, Numerical Optimization, covers minimizing and maximizing mathematical functions. This topic is pervasive in data science, notably in statistics, machine learning, and signal processing. This chapter illustrates a few root-finding, minimization, and curve-fitting routines with SciPy.
Chapter 10, Signal Processing, covers extracting relevant information from complex and noisy data. These steps are sometimes required prior to running statistical and data mining algorithms. This chapter introduces basic signal processing methods such as Fourier transforms and digital filters.
Chapter 11, Image and Audio Processing, covers signal processing methods for images and sounds. It introduces image filtering, segmentation, computer vision, and face detection with scikit-image and OpenCV. It also presents methods for audio processing and synthesis.
Chapter 12, Deterministic Dynamical Systems, describes the dynamical processes underlying particular types of data. It illustrates simulation techniques for discrete-time dynamical systems, as well as for ordinary differential equations and partial differential equations.
Chapter 13, Stochastic Dynamical Systems, describes the dynamical random processes underlying particular types of data. It illustrates simulation techniques for discrete-time Markov chains, point processes, and stochastic differential equations.
Chapter 14, Graphs, Geometry, and Geographic Information Systems, covers analysis and visualization methods for graphs, flight networks, road networks, maps, and geographic data.
Chapter 15, Symbolic and Numerical Mathematics, introduces SymPy, a computer algebra system that brings symbolic computing to Python. The chapter ends with an introduction to Sage, another Python-based system for computational mathematics.
This book is accessible to beginners. However, it may be easier for you if you are familiar with the contents of Learning IPython for Interactive Computing and Data Visualization, Second Edition, Packt Publishing (also called the "IPython minibook"), the prequel of this book. The minibook introduces Python programming, the IPython console, the Jupyter Notebook, numerical computing with NumPy, basic data analysis with pandas, and plotting with Matplotlib. This book tackles scientific programming topics that rely on all of these tools.
Part 2 is a bit more theoretical. It is easier to read if you know the basics of calculus, linear algebra, and probability theory (real-valued functions, integrals and derivatives, differential equations, matrices, vector spaces, probabilities, random variables, and so on). These chapters introduce different topics in data science and applied mathematics, and how to apply them with Python: statistics, machine learning, numerical optimization, signal processing, dynamical systems, graph theory, and others.
This book uses the free Anaconda distribution (https://www.anaconda.com/download/). It includes Python 3, IPython, Jupyter, and almost all of the packages that we will be using in this book. Anaconda also includes a powerful packaging system named Conda. The introduction of this book's first chapter gives you more details.
The code of this book has been written for Python 3 and is incompatible with older versions of Python, Python 2 (although minimal to no changes would be required to make it compatible).
This book has a website: http://ipython-books.github.io. The text, the code, and the data from the book are available on several GitHub repositories at https://github.com/ipython-books/. You can also run the code interactively in your web browser without installing anything on your computer, thanks to the Binder project.
Be sure to check out http://ipython-books.github.io and the repositories to get the latest updates and corrections. You can also propose your own corrections and suggestions on GitHub by opening issues or pull requests.
You can also follow the author online (http://cyrille.rossant.net) and on Twitter (@cyrillerossant).
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/.
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example:«"
A block of code is set as follows:
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
Any command-line input or output is written as follows:
Bold: Indicates a new term, an important word, or words that you see on the screen, for example, in menus or dialog boxes, also appear in the text like this. Here is an example: "Select System info from the Administration panel."
Warnings or important notes appear in a box like this.
Tips and tricks appear like this.
In this book, you will find several headings that appear frequently (Getting ready, How to do it..., How it works..., There's more..., and See also).
To give clear instructions on how to complete a recipe, use these sections as follows:
This section tells you what to expect in the recipe and describes how to set up any software or any preliminary settings required for the recipe.
This section contains the steps required to follow the recipe.
This section usually consists of a detailed explanation of what happened in the previous section.
This section consists of additional information about the recipe in order to make you more knowledgeable about the recipe.
This section provides helpful links to other useful information for the recipe.
Feedback from our readers is always welcome.
General feedback: Email <[email protected]> and mention the book's title in the subject of your message. If you have questions about any aspect of this book, please email us at <[email protected]>.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book we would be grateful if you would report this to us. Please visit, http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at <[email protected]> with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit http://authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packtpub.com.
In this chapter, we will cover the following topics:
In this introduction, we will give a broad overview of Python, IPython, Jupyter, and the scientific Python ecosystem.
Python is a high-level, open-source, general-purpose programming language originally conceived by Guido van Rossum in the late 1980s (the name was inspired by the British comedy Monty Python's Flying Circus). This easy-to-use language is commonly used by system administrators as a glue language, linking various system components together. It is also a robust language for large-scale software development. In addition, Python comes with an extremely rich standard library (the batteries included philosophy), which covers string processing, internet protocols, operating system interfaces, and many other domains.
In the last twenty years, Python has been increasingly used for scientific computing and data analysis as well. Other competing platforms include commercial software such as MATLAB, Maple, Mathematica, Excel, SPSS, SAS, and others. Competing open-source platforms include Julia, R, Octave, and Scilab. These tools are dedicated to scientific computing, whereas Python is a general-purpose programming language that was not initially designed for scientific computing.
However, a wide ecosystem of tools has been developed to bring Python to the level of these other scientific computing systems. Today, the main advantage of Python, and one of the main reasons why it is so popular, is that it brings scientific computing features to a general-purpose language that is used in many research areas and industries. This makes the transition from research to production much easier.
IPython is a Python library that was originally meant to improve the default interactive console provided by Python, and to make it scientist-friendly. In 2011, ten years after the first release of IPython, the IPython Notebook was introduced. This web-based interface to IPython combines code, text, mathematical expressions, inline plots, interactive figures, widgets, graphical interfaces, and other rich media within a standalone sharable web document. This platform provides an ideal gateway to interactive scientific computing and data analysis. IPython has become essential to researchers, engineers, data scientists, and teachers and their students.
Within a few years, IPython gained an incredible popularity among the scientific and engineering communities. The Notebook started to support more and more programming languages beyond Python. In 2014, the IPython developers announced the Jupyter project, an initiative created to improve the implementation of the Notebook and make it language-agnostic by design. The name of the project reflects the importance of three of the main scientific computing languages supported by the Notebook: Julia, Python, and R.
Today, Jupyter is an ecosystem by itself that comprehends several alternative Notebook interfaces (JupyterLab, nteract, Hydrogen, and others), interactive visualization libraries, and authoring tools compatible with notebooks. Jupyter has its own conference named JupyterCon. The project received funding from several companies as well as the Alfred P. Sloan Foundation and the Gordon and Betty Moore Foundation.
SciPy is the name of a Python package for scientific computing, but it refers also, more generally, to the collection of all Python tools that have been developed to bring scientific computing features to Python.
In the late 1990s, Travis Oliphant and others started to build efficient tools to deal with numerical data in Python: Numeric, Numarray, and finally, NumPy. SciPy, which implements many numerical computing algorithms, was also created on top of NumPy. In the early 2000s, John Hunter created Matplotlib to bring scientific graphics to Python. At the same time, Fernando Perez created IPython to improve interactivity and productivity in Python. In the late 2000s, Wes McKinney created pandas for the manipulation and analysis of numerical tables and time series. Since then, hundreds of engineers and researchers collaboratively worked on this platform to make SciPy one of the leading open source platforms for scientific computing and data science.
Many of the SciPy tools are supported by NumFOCUS, a nonprofit that was created as a legal structure to promote the sustainable development of the ecosystem. NumFOCUS is supported by several large companies including Microsoft, IBM, and Intel.
SciPy has its own conferences, too: SciPy (in the US) and EuroSciPy (in Europe) (see https://conference.sci).
What are some of the main changes in the SciPy ecosystem since the first edition of this book, published in 2014? We give here a very brief selection.
Feel free to skip this section if you are new to the platform.
The last version of IPython at the time of writing is IPython 6.0, released in April 2017. It is the first version of IPython that is no longer compatible with Python 2. This decision allowed the developers to make the internal code simpler and to make better use of the new features of the language.
IPython now has a web-based Terminal interface that can be used along with notebooks. Keyboard shortcuts can be edited directly from the Notebook interface. Multiple cells can be selected and copy/pasted between notebooks. There is a new restart-and-run-all button and a find-and-replace option in the Notebook. See http://ipython.readthedocs.io/en/stable/whatsnew/version6.html for more details.
NumPy, which last version 1.13 was released in June 2017, now supports the @ matrix multiplication operator between matrices (it was previously accessible via the np.dot() function). Operations such as a + b + c use less memory and are faster on some systems (temporary elision). The new np.block() function lets one define block matrices. The new np.stack() function joins a sequence of arrays along a new axis. See https://docs.scipy.org/doc/numpy-1.13.0/release.html for more details.
SciPy 1.0 was released in October 2017. For the developers, the 1.0 version means that the library has reached some stability and maturity after 16 years of development. See https://docs.scipy.org/doc/scipy/reference/release.html for more details.
Matplotlib, of which version 2.1 was released in October 2017, has an improved styling and a much better default color palette with the viridis colormap instead of jet. See https://github.com/matplotlib/matplotlib/releases for more details.
pandas 0.21 was released in October 2017. pandas now supports categorical data. Several deprecations were done in the past years, with the deprecation of the .ix syntax and Panels (which may be replaced via the xarray library). See https://pandas.pydata.org/pandas-docs/stable/release.html for more details.
In this book, we use the Anaconda distribution, which is available at https://www.anaconda.com/download/. Anaconda works on Linux, macOS, and Windows. You should install the latest version of Anaconda (5.0.1 at the time of writing) with the latest 64-bit version of Python (3.6 at the time of writing). Python 2.7 is an old version that will be officially unsupported in 2020.
Anaconda comes with Python, IPython, Jupyter, NumPy, SciPy, pandas, Matplotlib, and almost all of the other scientific packages we will be using in this book. The list of all packages is available at https://docs.anaconda.com/anaconda/packages/pkg-docs.
Miniconda is a light version of Anaconda with only Python and a few other essential packages. You can install only the packages you need one by one using the conda package manager of Anaconda.
We won't cover in this book the various other ways of installing a scientific Python distribution.
The Anaconda website should give you all the instructions to install Anaconda on your system. To install new packages, you can use the conda package manager that comes with Anaconda. For example, to install the ipyparallel package (which is currently not installed by default in Anaconda), type conda install ipyparallel in a system shell.
A short introduction to system shells is given in the Learning the basics of the Unix shell section of Chapter 2, Best Practices in Interactive Computing.
Another way of installing packages is with conda-forge, available at https://conda-forge.org/. This is a community-driven effort to automatically build the latest versions of packages available on GitHub, and make them available with conda. If a package is not available with conda install somepackage, one may use instead conda install --channel conda-forge somepackage if the package is supported by conda-forge.
GitHub is a commercial service that provides free and paid hosting for software repositories. It is one of the most popular platforms for open source collaborative development.
pip is the Python system manager. Contrary to conda, pip works with any Python distribution, not just with Anaconda. Packages installable by pip are stored on the Python Package Index (PyPI) available at https://pypi.python.org/pypi.
Almost all Python packages available in conda are also available in pip, but the inverse is not true. In practice, if a package is not available in conda or conda-forge, it should be available with pip install somepackage. conda packages typically include binaries compiled for the most common platforms, whereas that is not necessarily the case with pip packages. pip packages may contain source code that has to be compiled locally (which requires that a compatible compiler is installed and configured), but they may also contain compiled binaries.
Here are a few references:
Here are a few resources on scientific Python:
