32,36 €
Understand the constructs of the Python programming language and use them to build data science projects
Key Features
Book Description
Python is the most widely used programming language for building data science applications. Complete with step-by-step instructions, this book contains easy-to-follow tutorials to help you learn Python and develop real-world data science projects. The “secret sauce” of the book is its curated list of topics and solutions, put together using a range of real-world projects, covering initial data collection, data analysis, and production.
This Python book starts by taking you through the basics of programming, right from variables and data types to classes and functions. You'll learn how to write idiomatic code and test and debug it, and discover how you can create packages or use the range of built-in ones. You'll also be introduced to the extensive ecosystem of Python data science packages, including NumPy, Pandas, scikit-learn, Altair, and Datashader. Furthermore, you'll be able to perform data analysis, train models, and interpret and communicate the results. Finally, you'll get to grips with structuring and scheduling scripts using Luigi and sharing your machine learning models with the world as a microservice.
By the end of the book, you'll have learned not only how to implement Python in data science projects, but also how to maintain and design them to meet high programming standards.
What you will learn
Who this book is for
If you want to learn Python or data science in a fun and engaging way, this book is for you. You'll also find this book useful if you're a high school student, researcher, analyst, or anyone with little or no coding experience with an interest in the subject and courage to learn, fail, and learn from failing. A basic understanding of how computers work will be useful.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 610
Veröffentlichungsjahr: 2019
Copyright © 2019 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Richa TripathiAcquisition Editor: Chaitanya NairContent Development Editor: Tiksha SarangSenior Editor: Afshaan KhanTechnical Editor: Romy DiasCopy Editor: Safis EditingProject Coordinator: Prajakta NaikProofreader: Safis EditingIndexer: Rekha NairProduction Designer: Nilesh Mohite
First published: August 2019
Production reference: 1300819
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78953-536-5
www.packt.com
Subscribe to our online digital library for full access to over 7,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Fully searchable for easy access to vital information
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Philipp Kats is a researcher at the Urban Complexity Lab, NYU CUSP, a research fellow at Kazan Federal University, and a data scientist at StreetEasy, with many years of experience in software development. His interests include data analysis, urban studies, data journalism, and visualization. Having a bachelor's degree in architectural design and a having followed the rocky path (at first) of being a self-taught developer, Philipp knows the pain points of learning programming and is eager to share his experience.
David Katz is a researcher and holds a Ph.D. in mathematics. As a mathematician at heart, he sees code as a tool to express his questions. David believes that code literacy is essential as it applies to most disciplines and professions. David is passionate about sharing his knowledge and has 6 years of experience teaching college and high school students.
The authors would also like to thank the administration of the Kazan Federal University IT-Lyceum, and its director, Timerbulat Samerkhanov, for the opportunity to conduct a course that laid the foundation for this book. Our special thanks go to our students for their help and feedback:
Azat Davletshin
Danis Saifullin
Evdokimov Alexandr
Kasatkin Alexander
Kirill Kaidanov
Nikolai Plantonov
Sri Manikanta is an undergraduate student pursuing his bachelor's degree in computer science and engineering at SICET under JNTUH. He is a founder of the Open Stack Developer Community at his college. He started his journey as a competitive programmer and he always loves to solve problems that are related to the filed of data science. He has worked on a couple of projects on deep learning and machine learning. He has published many articles regarding data science, machine learning, programming and cyber security in top publications such as Hacker Noon, freeCodeCamp, Towards Data Science, and DDI. He completed his Python specialization at the University of Michigan, through Coursera.
Richard Marsden has 25 years of professional software development experience. After starting in the field of geophysical surveying for the oil industry, he has spent the last 15 years running the Winwaed Software Technology LLC, an independent software vendor. Winwaed specializes in geospatial tools and applications including web applications and operates the Mapping-Tools website for tools and add-ins for geospatial applications such as Caliper Maptitude, Microsoft MapPoint, Android, and Ultra Mileage.
Richard has been a technical reviewer for a number of Packt publications, including Python Geospatial Development and Python Geospatial Analysis Essentials, both by Erik Westra; and Python Geospatial Analysis Cookbook, by Michael Diener.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Title Page
Copyright and Credits
Learn Python by Building Data Science Applications
About Packt
Why subscribe?
Contributors
About the authors
About the reviewers
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Code in Action
Conventions used
Get in touch
Reviews
Section 1: Getting Started with Python
Preparing the Workspace
Technical requirements
Installing Python
Downloading materials for running the code
Installing Python packages
Working with VS Code
The VS Code interface
Beginning with Jupyter
Notebooks
The Jupyter interface
Pre-flight check
Summary
Questions
Further reading
First Steps in Coding - Variables and Data Types
Technical requirements
Assigning variables
Naming the variable 
Understanding data types
Floats and integers
Operations with self-assignment
Order of execution
Strings
Formatting
Format method
F-strings
Legacy formatting
Formatting mini-language
Strings as sequences
Booleans
Logical operators
Converting the data types
Exercise
Summary
Questions
Further reading
Functions
Technical requirements
Understanding a function
Interface functions
The input function
The eval function
Variable properties
The help function
The type function
The isinstance function
dir
Math
abs
The round function
Iterables
The len function
The sorted function
The range function
The all and any functions
The max, min, and sum functions
Defining the function
Default values
Var-positional and var-keyword
Docstrings
Type annotations
Refactoring the temperature conversion
Understanding anonymous (lambda) functions
Understanding recursion
Summary
Questions
Further reading
Data Structures
Technical requirements
What are data structures?
Lists
Slicing
Tuples
Immutability
Dictionaries
Sets
More data structures
frozenset
defaultdict
Counter
Queue
deque
namedtuple
Enumerations
Using generators
Useful functions to use with data structures
The sum, max, and min functions
The all and any functions
The zip function
The map, filter, and reduce functions
Comprehensions
Summary
Questions
Further reading 
Loops and Other Compound Statements
Technical requirements
Understanding if, else, and elif statements
Inline if statements
Using if in a comprehension
Running code many times with loops
The for loop
itertools
cycle
chain
product
Enumeration
The while loop
Additional loop functionality – break and continue
Handling exceptions with try/except and try/finally 
Exceptions
try/except
try/except/finally
Understanding the with statements
Summary
Questions
Further reading
First Script – Geocoding with Web APIs
Technical requirements
Geocoding as a service
Learning about web APIs
Working with HTTPS
Working with the Nominatim API
The requests library
Starting to code
Caching with decorators
Reading and writing data
Geocoding the addresses
Moving code to a separate module
Collecting NYC Open Data from the Socrata service
Summary
Questions
Further reading
Scraping Data from the Web with Beautiful Soup 4
Technical requirements
When there is no API
HTML in a nutshell
Scraping with Beautiful Soup 4
CSS and XPath selectors
Developer console
Scraping WWII battles
Step 1 – Scraping the list of battles
Unordered list
Step 2 – Scraping information from the Wiki page
Key information
Additional information
Step 3 – Scraping data as a whole
Quality control
Beyond Beautiful Soup
Summary
Questions
Further reading
Simulation with Classes and Inheritance
Technical requirements
Understanding classes
Special (dunder) methods
__init__
__repr__ and __str__ 
Arithmetical and logical operations
Equality/relationship methods
__len__
__getitem__
__class__
Inheritance
Using super()
Data classes
Using classes in simulation
Writing the base classes
Writing the Island class
Herbivore haven
Harsh islands
Visualization
Summary
Questions
Further reading
Shell, Git, Conda, and More – at Your Command
Technical requirements
Shell
Pipes
Executing Python scripts
Command-line interface
Git
Concept
GitHub
Practical example
gitignore
Conda
Conda for virtual environments
Conda and Jupyter
Make
Cookiecutter
Summary
Questions
Section 2: Hands-On with Data
Python for Data Applications
Technical requirements
Introducing Python for data science
Exploring NumPy
Beginning with pandas
Trying SciPy and scikit-learn
Understanding Jupyter
Summary
Questions
Data Cleaning and Manipulation
Technical requirements
Getting started with pandas
Selection – by columns, indices, or both
Masking
Data types and data conversion
Math
Merging
Working with real data
Initial exploration
Defining the scope of work to be done
Getting to know regular expressions
Parsing locations
Geocoding
Time
Belligerents
Understanding casualties
Multilevel slicing
Quality assurance
Writing the file
Summary
Questions
Further reading
Data Exploration and Visualization
Technical requirements
Exploring the dataset
Descriptive statistics
Data visualization with matplotlib (and its pandas interface)
Aggregating the data to calculate summary statistics 
Resampling
Mapping
Declarative visualization with vega and altair
Drawing maps with Altair
Storing the Altair chart
Big data visualization with datashader
Summary
Questions
Further reading
Training a Machine Learning Model
Technical requirements
Understanding the basics of ML
Exploring unsupervised learning
Moving on to supervised learning
k-nearest neighbors
Linear regression
Decision trees
Summary
Questions
Further reading
Improving Your Model – Pipelines and Experiments
Technical requirements
Understanding cross-validation
Exploring feature engineering
Failed attempts
Optimizing the hyperparameters
Using a random forest model
Tracking your data and metrics with version control
Starting with data
Adding code to the equation
Metrics
Summary
Questions
Further reading
Section 3: Moving to Production
Packaging and Testing with Poetry and PyTest
Technical requirements
Building a package
Bringing your own package
Using a package manager – pip and conda
Creating a package scaffolding
A few ways to build your package
Trying out code with Poetry
Adding actual code
Defining dependencies
Non-code resources
Publishing the package
Development workflow
Testing the code so far
Testing with PyTest
Writing our own tests
Automating the process with CI services
Generating documentation generation with sphinx
Installing a package in editable mode
Summary
Questions
Further reading
Data Pipelines with Luigi
Technical requirements
Introducing the ETL pipeline
Redesigning your code as a pipeline
Building our first task in Luigi
Connecting the dots
Understanding time-based tasks
Scheduling with cron
Exploring the different output formats
Writing to an S3 bucket
Writing to SQL
Expanding Luigi with custom template classes
Summary
Questions
Further reading
Let's Build a Dashboard
Technical requirements
Building a dashboard – three types of dashboard
Static dashboards
Debugging Altair
Connecting your app to the Luigi pipeline
Understanding dynamic dashboards
First try with panel
Reading data from the database
Creating an interactive dashboard in Jupyter
Summary
Questions
Further reading
Serving Models with a RESTful API
Technical requirements
What is a RESTful API?
Python web frameworks
Building a basic API service
Exploring service with OpenAPI
Finalizing our naive first iteration
Data validation
Sending data in with POST requests
Adding features to our service
Building a web page
Speeding up with asynchronous calls
Deploying and testing your API loads with Locust
Summary
Questions
Further reading
Serverless API Using Chalice
Technical requirements
Understanding serverless
Getting started with Chalice
Setting up a simple model
Externalizing medians
Building a serverless API for an ML model
When we're still out of memory
Building a serverless function as a data pipeline
S3-triggered events
Summary
Questions
Further reading
Best Practices and Python Performance
Technical requirements
Speeding up your Python code
Rewriting the code with NumPy
Specialized data structures and algorithms
Dask
Dask-ML
Numba
Concurrency and parallelism
Different types of concurrency
Two types of problems
Before you start rewriting your code
Using best practices for coding in your project
Code formatting with black
Measuring code quality with Wily
Writing tests with hypothesis
Beyond this book – packages and technologies to look out for
Different Python flavors
Docker containers
Kubernetes
Summary
Questions
Further reading
Assessments
Chapter 1
Chapter 2
Chapter 3
Chapter 4
Chapter 5
Chapter 6
Chapter 7
Chapter 8
Chapter 9
Chapter 10
Chapter 11
Chapter 12
Chapter 13
Chapter 14
Chapter 15
Chapter 16
Chapter 17
Chapter 18
Chapter 19
Chapter 20
Other Books You May Enjoy
Leave a review - let other readers know what you think
Python has become one of the most popular programming languages in the world, according to multiple polls and metrics. This popularity is, to no small extent, a direct result of the simplicity of the language, its power, and scalability, allowing it to run even large-scale applications, such as Dropbox, YouTube, and many others. It becomes even more valuable with the rise in the adoption of machine learning techniques and algorithms, including state-of-the-art algorithms on the edge of scientific advancements.
Consequently, there are hundreds of books, courses, and online tutorials on different aspects of programming, machine learning, data processing, and more. Many sources highlight the importance of learning-by-doing and building your own projects. Connecting the dots and structuring all this vast knowledge into one big picture is not an easy task. Seeing the big picture, in our opinion, is critical for the completion of any project. Indeed, there are plenty of options and decisions to take at every step. It is the grand schema of a project as a whole that helps you make those decisions, focus on what matters, and spend your time wisely.
This book is designed to be an entry point for any newcomer or novice developer, aiming to cover the whole life cycle of a data-driven application. By the end of it, you will be able to write arbitrary Python code, collect and process data, explore it, and build your own packages, dashboards, and APIs. Multiple notes and tips point to alternative solutions or decisions, allowing you to alternate code for your specific needs.
This book will be a useful resource if any of the following apply to you:
You have just started to code.
You know the basics but struggle to build something handy.
You know your specific domain well—whether it be statistics, machine learning, or development—but lack experience in other parts of building a project.
You're an experienced developer with little exposure to Python, trying to learn about the Python package's ecosystem.
If you feel you fall into any of those categories, or want to build a project from scratch for other reasons, please join us on this journey.
This book is aimed at new Python developers with little to no prior programming skills beyond basic computer literacy. The book doesn't require any previous background in data science or statistics either. That being said, it covers a variety of topics, from data processing to visualization, to delivery—including dashboards, building APIs, Extract, Transform, Load (ETL) pipelines, or a standalone package. Thus, it is also suited to experienced data scientists interested in productizing their work. For a complete novice, this book aims to cover all major parts of the data application life cycle—from Python basics to scripts, data collection and processing, and the delivery of your work in different formats.
This book consists of three main sections. The first one is focused on language fundamentals, the second introduces data analysis in Python, and the final section covers different ways to deliver the results of your work. The last chapter of each section is focused on non-Python tools and topics related to the section subject.
Section 1, Getting Started with Python, introduces the Python programming language and explains how to install Python and all of the packages and tools we'll be using.
Chapter 1, Preparing the Workspace, covers all the tools we'll need throughout the book—what they are, how to install them, and how to use their interfaces. This includes the installation process for Python 3.7, all of the packages we'll require throughout the book, how to install all of them at once in a separate environment, as well as two code development tools we'll use—the Jupyter Notebook and VS Code. Finally, we'll run our first script to ensure everything works fine! By the end of this chapter, you will have everything you need to execute the book's code, ready to go.
Chapter 2, First Steps in Coding – Variables and Data Types, gives an introduction to fundamental programming concepts, such as variables and data types. You'll start writing code in Jupyter, and will even solve a simple problem using the knowledge you've just acquired.
Chapter 3, Functions, introduces yet another concept fundamental to programming—functions. This chapter covers the most important built-in functions and teaches you about writing new ones. Finally, you will revisit the problem from the previous chapter, and write an alternative solution, using functions.
Chapter 4, Data Structures, covers different types of data structures in Python—lists, sets, dictionaries, and many others. You will learn about the properties of each structure, their interfaces, how to operate them, and when to use them.
Chapter 5, Loops and Other Compound Statements, illustrates different compound statements in Python—loops—if/else, try/except, one-liners, and others. These represent core logic in the code and allow non-linear code execution. At the end of this chapter, you'll be able to operate large data structures using short, expressive code.
Chapter 6, First Script – Geocoding with Web APIs, introduces the concept of APIs, working with HTTP and geocoding service APIs in particular, from Python. At the end of this chapter, you'll have fully operational code for geocoding addresses from the dataset—code that you'll be using extensively throughout the rest of the book, but that's also highly applicable to many tasks beyond it.
Chapter 7, Scraping Data from the Web with Beautiful Soup 4, illustrates a solution to a similar but more complex task of data extraction from HTML pages—scraping. Step by step, you will build a script that collects pages and extracts data on all the battles in World War II, as described in Wikipedia. At the end of this chapter, you'll know the limitations, challenges, and the main solutions of the scraping packages used for the task, and will be able to write your own scrapers.
Chapter 8, Simulation with Classes and Inheritance, introduces one more critical concept for programming in Python—classes. Using classes, we will build a simple simulation model of an ecological system. We'll compute, collect, and visualize metrics, and use them to analyze the system's behavior.
Chapter 9,Shell, Git, Conda, and More – at Your Command, covers the basic tools essential for the development process—from Shell and Git, to Conda packaging and virtual environments, to the use of makefiles and the Cookiecutter tool. The information we share in this chapter is essential for code development in general, and Python development in particular, and will allow you to collaborate and talk the same language with other developers.
Section 2, Hands-On with Data, focuses on using Python for data processing analysis, including cleaning, visualization, and training machine learning models.
Chapter 10, Python for Data Applications, works as an introduction to the Python data analysis ecosystem—a distinct group of packages that allow simple work with data, its processing, and analysis. As a result, you will get familiar with the main packages and their purpose, their special syntaxes, and will understand what makes them work substantially faster than normal Python for numeric calculations.
Chapter 11, Data Cleaning and Manipulation, shows how to use the pandas package to process and clean our data, and make it ready for analysis. As an example, we'll clean and prepare the dataset we obtained from Wikipedia in Chapter 7, Scraping Data from the Web with Beautiful Soup 4. Through the process, we'll learn how to use regular expressions, use the geocoding code we wrote in Chapter 6, First Script – Geocoding with Web APIs, and an array of other techniques to clean the data.
Chapter 12, Data Exploration and Visualization, explains how to explore an arbitrary dataset and ask and answer questions about it, using queries, statistics, and visualizations. You'll learn how to use two visualization libraries, Matplotlib and Altair. Both make static charts quickly or more advanced, interactive ones. As our case example, we'll use the dataset we cleaned in the previous chapter.
Chapter 13, Training a Machine Learning Model, presents the core idea of machine learning and shows how to apply unsupervised learning with the k-means clustering algorithm, and supervised learning with KNN, linear regression, and decision trees, to a given dataset.
Chapter 14, Improving Your Model – Pipelines and Experiments, highlights ways to improve your model, using feature engineering, cross-validation, and by applying a more sophisticated algorithm. In addition, you will learn how to track your experiments and keep both code and data under version control, using data version control with dvc.
Section 3, Moving to Production, is focused on delivering the results of your work with Python, in different formats.
Chapter 15, Packaging and Testing with Poetry and PyTest, explains the process of packaging. Using our Wikipedia scraper as an example, we'll create a package using the poetry library, set dependencies and a development environment, and make the package accessible for installation using pip from GitHub. To ensure the package's functionality, we will add a few unit tests using the pytest testing library.
Chapter 16, Data Pipelines with Luigi, introduces ETL pipelines and explains how to build and schedule one using the luigi framework. We will build a set of interdependent tasks for data collection and processing and set them to work on a scheduled basis, writing data to local files, S3 buckets, or a database.
Chapter 17, Let's Build a Dashboard, covers a few ways to build and share a dashboard online. We'll start by writing a static dashboard based on the charts we made with the Altair library in Chapter 12, Data Exploration and Visualization. As an alternative, we will also deploy a dynamic dashboard that pulls data from a database upon request, using the panel library.
Chapter 18, Serving Models with a RESTful API, brings us back to the API theme—but this time, we'll build an API on our own, using the fastAPI framework and the pydantic package for validation. Using a machine learning model, we'll build a fully operational API server, with the OpenAPI documentation and strict request validation. As FastAPI supports asynchronous execution, we'll also discuss what that means and when to use it.
Chapter 19, Serverless API Using Chalice, goes beyond serving an API with a personal server and shows how to achieve similar results with a serverless application, using AWS Lambda and the chalice package. This includes building an API endpoint, a scheduled pipeline, and serving a machine learning model. Along the way, we discuss the pros and cons of running serverless, its limitations, and ways to mitigate them.
Chapter 20, Best Practices and Python Performance, is comprises of three distinct parts. The first part showcases different ways to make your code faster, by using NumPy's vectorized computations or a specific data structure (in our case, a k-d tree), extending computations to multiple cores or even machines with Dask, or by leveraging performance (and, potentially, GIL-release) of just-in-time compilation with Numba. We also discuss different ways to achieve concurrency in Python—using threads, asynchronous tasks, or multiple processes.
The second part of the chapter focuses on improving the speed and quality of development. In particular, we'll cover the use of linters and formatters—the black package in particular; code maintainability measurements with wily; and advanced, data-driven code testing with the hypothesis package.
Finally, the third part of this chapter goes over a few technologies beyond Python, but that are still potentially useful to you. This list includes different Python interpreters, such as Jython, Brython, and Iodide; Docker technology; and Kubernetes.
This book is designed for complete beginners and people who have just started to learn to code. It does not require any specific knowledge besides basic computer literacy.
The execution of the code examples provided in this book requires an installation of Python 3.7.3 or later on macOS, Linux, or Microsoft Windows. The code presented throughout the book makes use of many Python libraries. In each chapter, a list of required libraries is given at the beginning. A full list of libraries is stored in the GitHub repository, in the environment.yaml file. The same file can be used to install Python and all of the required libraries in bulk—full instructions are given in Chapter 1, Preparing the Workspace.
The code for this book was developed in and extensively uses two development environments—VS Code editor with its Python bundle, and Jupyter. We recommend using both for better alignment with the book's narrative.
The code for Chapter 6, First Script – Geocoding with Web APIs, Chapter 7, Scraping Data from the Web with Beautiful Soup 4, Chapter 11, Data Cleaning and Manipulation, and Chapter 16, Data Pipelines with Luigi, requires an internet connection.
The first chapter will provide you with step-by-step instructions and some useful tips for setting up your Python environment, the core libraries, and all the necessary tools.
You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packt.com
.
Select the
SUPPORT
tab.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Learn-Python-by-Building-Data-Science-Applications. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781789535365_ColorImages.pdf.
Visit the following link to check out videos of the code being run: http://bit.ly/2MIb3Pn
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packt.com.
This section focuses on becoming familiar with general-purpose Python, making use of existing libraries, writing our first scripts, learning the basics of Git, and using the IDE. In this section, we will also lay the foundation for our projects, building pipelines to process (project 1), collect (project 2), and simulate (project 3) data.
This section comprises the following chapters:
Chapter 1
,
Preparing the Workspace
Chapter 2
,
First Steps in Coding – Variables and Data Types
Chapter 3
,
Functions
Chapter 4
,
Data Structures
Chapter 5
,
Loops and Other Compound Statements
Chapter 6
,
First Script
– Geocoding with Web APIs
Chapter 7
,
Scraping Data from the Web with Beautiful Soup 4
Chapter 8
,
Classes and Inheritance
Chapter 9
,
Shell, Git, Conda, and More – at Your Command
Welcome! We're very excited to start learning and building things with you! However, we need to get ourselves ready first.
In this chapter, we'll learn how to download and install everything you'll need throughout the book, including Python itself, all the Python packages that we'll need, and two development tools we will be using extensively: Jupyter and Visual Studio Code (VS Code). After that, we'll go through a brief overview of Jupyter and VS Code interfaces. Finally, you will run your very first line of Python, so we need to ensure that everything is ready before we dive in.
In this chapter, we'll cover the following:
The minimum computer configuration required
How to install the Anaconda distribution
How to download the code for this book
Setting up and getting familiar with VS Code and Jupyter
Running your first line of code to ensure everything runs smoothly
By the end of this chapter, you will have learned about the hardware requirements for Python and this book, and what you can do if you don't have a sufficiently powerful computer. You will also learn how to install Python 3.7.2 and all required packages and tools using the open source Anaconda distribution.
Python can be very humble and does not require an advanced computer. In fact, you can run Python on a $10 Raspberry Pi or an Arduino board! The code and data we use in this book do not require any special computational power, any laptop, or any computer made after 2008. At least 2 GB of RAM, 20 GB of disk space, and an internet connection should suffice. Your operating system (OS) shouldn't be a problem either, as Python and all the tools we will use are cross-platform and work on Windows, macOS, and Linux.
Throughout the book, we'll use two main tools to write the code: Jupyter and VS Code. Both of them are free and aren't demanding.
All the code for the book is publicly available and free to access at https://github.com/PacktPublishing/Learn-Python-by-Building-Data-Science-Applications.
There are multiple Python distributions, starting with the original, vanilla Python, which is accessible at https://www.python.org/. Data analysis, however, adds unique requirements for packaging (https://www.youtube.com/watch?v=QjXJLVINsSA&feature=youtu.be&t=3555). In this book, we use Anaconda, which is an open source and free Python distribution, designed for data science and machine learning. Anaconda's main features include a smooth installation of data science packages (many of which run C and Fortran languages under the hood) and conda, which is a great package and environment manager (we will talk more about environments and conda later in Chapter 9, Shell, Git, Conda, and More – at Your Command). Conveniently, the Anaconda distribution installs all the packages (https://docs.anaconda.com/anaconda/packages/pkg-docs/) we need in this book and many more!
In order to install Anaconda, follow these steps:
First, go to the Anaconda distribution web page at
https://www.anaconda.com/distribution/
.
Select the Python 3.7 graphical installer for your platform and download it (at the time of writing, there is no graphical installer for Linux, so you'll have to use the one for the command line). The following screenshot shows what the interface looks like—we've marked the link we're interested in with dotted lines
:
Run the installation. Keep all settings as default. When you're asked if you want to install PyCharm, select no (until you personally want to, of course, but we won't use PyCharm in this book):
Voila! Now we have Python up and running! Next, let's download all the materials for this book.
All code in this book is also available as a separate archive of files—either Python scripts or Jupyter notebooks. You can download the full archive and follow along with the book using the relevant code from GitHub (https://github.com/PacktPublishing/Learn-Python-by-Building-Data-Science-Applications). Everything is stored on GitHub, which is an online service for code storage with version control. We will discuss both Git and GitHub in Chapter 9, Shell, Git, Conda, and More – at Your Command, but in this case, you won't need version control, so it is easier to download everything as an archive. Just use the Clone or download button on the right side (1), and select Download ZIP (2):
Once the download is complete, unzip the file and move it to a convenient location. This folder will be our main workspace throughout the book.
Many of the chapters in this book teach you how to make use of specific packages. Most of them are included in the standard Anaconda distribution, so if you installed Python using the Anaconda distribution, then you will have them already. Some packages might not be installed though, so we'll have to install them separately as per our requirements for every chapter. This is totally fine, and we'll specify which packages will be used at the beginning of each chapter.
In order to install a specific package, you have two options:
Installing via Anaconda by running either of the following commands. Specifying a channel is required if a package is rare and not present on the default channels of Anaconda and
conda-forge
:
> conda install <mypackage>
> conda install -c <mychannel> <mypackage>
Some packages are not present in conda at all. You can search for packages through the channels at https://anaconda.org/.
Most packages can be installed using
pip
:
> pip install <mypackage>
Generally speaking, we recommend using conda over pip for installation.
Alternatively, there is a single specification in the root of the repository that you can use to install everything at once. To do so, you need to go in your Terminal, and then to the repository's root (we will explain how to do that in Chapter 9, Shell, Git, Conda, and More – at Your Command, but VS Code's Terminal will open in the root of the given folder automatically). Once there, run the following command:
conda env update --name root -f environment.yml
Then, follow the instructions. Here, conda uses the environment.yml specification file as a list of packages to install.
Now, let's install our main development tools: VS Code and Jupyter.
VS Code is invaluable for Python development and experimentation. VS Code—not to be confused with Visual Studio, which is a commercial product—is a sophisticated, completely free, and open source text editor created by Microsoft. It is language-agnostic and will work perfectly with Python, JavaScript, Java, or any other language. VS Code has hundreds of built-in features and thousands of great plugins to expand its capabilities.
In order to install VS Code, head to its main web page, https://code.visualstudio.com/, and download the package for your OS. The installation is pretty straightforward; there is no need to change any of the default settings. Assuming you installed VS Code as part of the previous steps, you now need to open the VS Code application. Next, switch to the plugin marketplace menu (as shown in the following screenshot), type Python, and install the plugin. Python binding for VS Code provides plenty of Python-specific features and will prove very useful for us throughout the book.
In the following screenshot, 1 represents the plugin marketplace. Once switched, type Python in the search form (2), select the plugin (3), and hit install (Python was already installed in this screenshot, hence it offers to uninstall it instead):
Once that's done, let's briefly review the interface of the tool.
Let's go over the VS Code interface. In the following screenshot, you can see five distinctive sections:
Section 1 of VS Code has six icons (more will appear after installing certain plugins). The last one at the bottom of the toolbar, which is a gear symbol, represents the settings. All the others represent different modes, from top to bottom:
Explorer mode, which allows us to look for the files that are open in the given workspace
Search
mode, which allows us to look
for a particular text element throughout the whole workplace
A built-in Git client (more on that in
Chapter 3
,
Functions
)
Debugger mode, which halts and inspects code in the middle of the execution in order to understand what's happening under the hood
VS Code's plugin marketplace
Every mode changes the content of section 2. This is as an area that is dedicated to working with the workspace as a whole, which includes adding new files, removing existing ones, working with the workspace, or traversing through variables in debugging sessions.
Section 3 is the main one. Here, we actually write and read the code. You can have multiple tabs or even split this window into many: vertically, horizontally, or both. Most of the time, each tab represents one file in the workspace.
If you don't have section 4 open, then go to View | Terminal or use the Ctrl + ` shortcut. You can also drag this section out from the upper edge of section 5 using your mouse, if you prefer.
Section 5 has four subsections. In PROBLEMS, VS Code will point you to some potential issues in the code. The OUTPUT and DEBUG CONSOLE tabs' roles are self-explanatory, and we won't use them much. The most important tab here is Terminal: it duplicates the Terminal built into your OS (hence, it does not directly relate to VS Code itself). Terminals allow us to run system-wide commands, create folders, write to files, execute Python scripts, and run any software, which is essentially everything you can do via your OS graphical interface, but done just using code. We will cover the Terminal in more depth in Chapter 9, Shell, Git, Conda, and More – at Your Command. Conveniently, VS Code's Terminals open in the root directory of the workspace, which is a feature we will constantly utilize throughout the book.
Lastly, section 5 is an information bar that shows the current properties of the workspace, including the interpreter's name, Git repository and branch names (more on that in Chapter 3, Functions), and cursor position. Most of those elements are interactive!
One more feature that is hidden from the newcomers, but is an extremely powerful feature of VS Code, is its command palette, as shown in the following screenshot:
You can open the command palette using the Ctrl (command on macOS) + Shift + P shortcut. The command palette allows you to type in, select, and execute practically any feature of the application, from switching the color theme to searching for a word, to almost anything else. This feature allows programmers to avoid using a mouse or trackpad, and once mastered, it drastically increases productivity.
For example, let's create a new file (Ctrl/command + N) and type Hello Python!. Now, in order to switch that text to uppercase, all we need is to do the following:
Select all of the text by using
Ctrl
/
command
+
A
.
Open the command palette (
Ctrl
/
command
+
Shift
+
P
) and type
Upper
. Select the
Transform to Uppercase
command (note that the command palette also shows shortcuts).
Spend some time learning VS Code's features! One great place for that is the Interactive Playground: you can jump straight into it by typing the name into the command palette.
Another development environment we'll use is Jupyter. If you have installed Anaconda, then Jupyter is already on your machine, as it is one of the tools that come with Anaconda. To start using Jupyter, we need to run it from the Terminal (you might need to open a new Terminal to update the paths). The following code will run a newer version of the tool's frontend face, and that is what we'll use:
$ jupyter lab
Alternatively, it also supports an older version of the frontend via Jupyter Notebook. The two have their differences, but we'll stick with the lab.
The app's behavior depends on the folder from which it was started; it is more convenient to run it directly from the project's root folder. That's why it is so handy that VS Code's Terminal opens in a workspace folder by itself, as we don't need to navigate there every time. But why do we need another developer tool, anyway? That's what the next section is all about.
As we mentioned earlier, Jupyter is designed with a different approach to programming than VS Code. Its central concept is so-called notebooks: files that allow the mixing of actual code, text (including markdown and LaTeX equations), as well as plots, images, videos, and interactive visualizations. In notebooks, you execute code interactively, one cell after another. This way, you can experiment easily—write some code, run it, see the outcomes, and then tweak it again.
The outcomes are shown along with the code so that you can open and read the notebook, even without executing it. Because of that, notebooks are especially useful in scientific/analytical contexts, as on the one hand, they allow us to describe what we're doing with text and illustrations, and on the other hand, they keep the actual code tied to the narrative so that anyone can inspect and confirm that your analysis is valid. One great example of that is LIGO notebooks, which represent the actual code that was used to discover gravitational waves in the universe (this research won the Nobel Prize in 2017).
Notebooks are also great for teaching (as in the case of this book), as students can interact with each and every part of the code by themselves. However, while Jupyter is good for exploration, it feels less convenient when your code base starts to grow and mature. Because of this, we will switch back and forth between Jupyter and VS Code throughout the course of this book, picking the right tool for each particular job.
Let's now look at Jupyter's interface.
Let's get familiar with Jupyter's interface. This software works differently to VS Code: Jupyter works as a web server that is accessible through a browser. To make it run, just type jupyter labin VS Code's Terminal window and hit Enter. This will start the server. Depending on your OS, either a link will be printed in the Terminal (starting with localhost://...), or your default web browser will just open the page automatically. You can stop the Jupyter server by hitting Ctrl + C within the Terminal and typing yes, if prompted, or by closing the window.
Jupyter's layout, as shown in the following screenshot, is somewhat similar to that of VS Code:
Here, again, the tabs in section 1 show all the modes available for section 2, including a file browser, a list of running notebooks, a list of available commands, and tabs. The second section represents one of the modes previously described. Finally, the main section, section 3, shows all open tabs, similar to section 3 in VS Code. The default tab is Launcher. From here, we can create new notebooks, text files (such as classic code or data files), Terminals, and consoles.
Note that the launcher explicitly states Python 3 for both notebooks and consoles. This is because Jupyter is also language-agnostic. In fact, the name Jupyter comes from the Julia-Python-R triad of analytical languages, but the application supports many others, including C, Java, and Rust. In this book, we'll only use Python.
If everything went smoothly with Jupyter, then we're ready to go! But before we dive into coding, let's do one last pre-flight check.
Before we proceed to the content of this book, let's ensure our code can actually be executed by running the simplest possible code in Jupyter. To do this, let's create a test notebook and run some code to ensure everything works as intended. Click on the Python 3 square in the Notebook section. A new tab should open, called Untitled.ipynb.
First, the blue line highlighted represents the selected cell in the notebook. Each cell represents a separate snippet of code, which is executed simultaneously in one step. Let's write our very first line of code in this cell:
print('Hello world')
Now, hit Shift + Enter. This shortcut executes the selected cells in Python and outputs the result on the next line. It also automatically creates a new input cell if there are none, as shown in the following screenshot. The number on the left gives a hint as to the order in which cells are executed, so the first cell to be executed will be marked with 1. The asterisk means the cell is under execution and computation is underway:
If everything worked properly, and you see Hello world in the output, then congratulations—you are ready for the following chapters!
In this chapter, we prepared our working environment for the journey ahead. In particular, we installed the Anaconda Scientific Python Distribution with Python 3.7.2, which includes all the packages we'll need throughout the course of this book. We also installed and learned about the basics of VS Code, which is a sophisticated and interactive development environment that will be our primary tool for writing arbitrary code, and Jupyter, which we use for experimentation and analysis. Finally, we discussed and even ran some code already! We did this in Jupyter, which is a coding environment that is perfect for prototyping, experimentation, analysis, and educational purposes.
In the next chapter, we'll begin our introduction to Python, learning about variables, variable assignment, and Python's basic data types.
What version of Python do we use?
Will it work on a Windows PC?
Do I need to install any additional packages?
What is a Jupyter Notebook?
When and why should I use Jupyter Notebooks?
When
should
I switch to VS Code?
Can I run the code from this book on my smartphone/tablet?
Python for Beginners: Learn Python Programming (Python 3) [Video]
(
https://www.packtpub.com/application-development/python-beginners-learn-python-programming-python-3-video
)
Data Science Projects with Python
(
https://www.packtpub.com/big-data-and-business-intelligence/data-science-projects-python
)
The Scientific Paper Is Obsolete
(
https://www.theatlantic.com/science/archive/2018/04/the-scientific-paper-is-obsolete/556676/
)
Having set up all the tools, you're now ready to dive into development. Fire up Jupyter—in this chapter, we will get our hands dirty with code! We'll start with the concept of variables, and learn how to assign and use them in Python. We will discuss best practices on naming them, covering both strict requirements and general conventions. Next, we will cover Python's basic data types and the operators they support, including integers, decimal numbers (floats), strings, and Boolean values. Each data type has a corresponding behavior, typing rules, built-in methods, and works with certain operators.
At the end of this chapter, we will put everything we learned into practice by writing our own vacation budgeting calculator.
The topics covered in this chapter are as follows:
Assigning variables
Naming the variables
Understanding data types
Converting the data types
Exercise
You can follow the code for this chapter in theChapter02/first_code.ipynbnotebook. No additional packages are required, just Python.
You can find the code via the following link, which is in the GitHub repository in the Chapter02 folder (https://github.com/PacktPublishing/Learn-Python-by-Building-Data-Science-Applications).
Naming variables may seem to be a minor topic, but trust us, adopting a good habit of proper naming will save you a lot of time and nerves down the road. Do your best to name variables wisely and consistently. Ambiguous names will make code extremely hard to read, understand, and debug, for no good reason.
Now, technically there are just three requirements for variable names:
You cannot use reserved words:
False
,
class
,
finally
,
is
,
return
,
None
,
continue
,
for
,
lambda
,
try
,
True
,
def
,
from
,
nonlocal
,
while
,
and
,
del
,
global
,
not
,
with
,
as
,
elif
,
if
,
or
,
yield
,
assert
,
else
,
import
,
pass
,
break
,
except
,
in
, or
raise
. You also cannot use operators or special symbols (
+
,
-
,
/
,
*
,
%
,
,
<
,
>
,
@
,
&
) or brackets and parentheses
as part of variable names.
Variable names can't start with digits.
Variable names can't contain whitespace. Use the underscore symbol instead.
On top of that, there are also some general naming conventions. You don't have to, but it is strongly recommended to follow them:
Name your variables meaningfully and consistently, so that readers will understand what they are meant to be. Some examples are
counter
,
car
, and
today
.
Apply
snake_case
