Learn Python by Building Data Science Applications - Philipp Kats - E-Book

Learn Python by Building Data Science Applications E-Book

Philipp Kats

0,0
32,36 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Understand the constructs of the Python programming language and use them to build data science projects




Key Features



  • Learn the basics of developing applications with Python and deploy your first data application


  • Take your first steps in Python programming by understanding and using data structures, variables, and loops


  • Delve into Jupyter, NumPy, Pandas, SciPy, and sklearn to explore the data science ecosystem in Python



Book Description



Python is the most widely used programming language for building data science applications. Complete with step-by-step instructions, this book contains easy-to-follow tutorials to help you learn Python and develop real-world data science projects. The “secret sauce” of the book is its curated list of topics and solutions, put together using a range of real-world projects, covering initial data collection, data analysis, and production.






This Python book starts by taking you through the basics of programming, right from variables and data types to classes and functions. You'll learn how to write idiomatic code and test and debug it, and discover how you can create packages or use the range of built-in ones. You'll also be introduced to the extensive ecosystem of Python data science packages, including NumPy, Pandas, scikit-learn, Altair, and Datashader. Furthermore, you'll be able to perform data analysis, train models, and interpret and communicate the results. Finally, you'll get to grips with structuring and scheduling scripts using Luigi and sharing your machine learning models with the world as a microservice.






By the end of the book, you'll have learned not only how to implement Python in data science projects, but also how to maintain and design them to meet high programming standards.




What you will learn



  • Code in Python using Jupyter and VS Code


  • Explore the basics of coding – loops, variables, functions, and classes


  • Deploy continuous integration with Git, Bash, and DVC


  • Get to grips with Pandas, NumPy, and scikit-learn


  • Perform data visualization with Matplotlib, Altair, and Datashader


  • Create a package out of your code using poetry and test it with PyTest


  • Make your machine learning model accessible to anyone with the web API



Who this book is for



If you want to learn Python or data science in a fun and engaging way, this book is for you. You'll also find this book useful if you're a high school student, researcher, analyst, or anyone with little or no coding experience with an interest in the subject and courage to learn, fail, and learn from failing. A basic understanding of how computers work will be useful.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 610

Veröffentlichungsjahr: 2019

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Learn Python by Building Data Science Applications

 

 

 

 

A fun, project-based guide to learning Python 3 while building real-world apps

 

 

 

 

 

 

 

 

 

 

Philipp Kats
David Katz

 

 

 

 

 

 

 

 

 

 

 

 

 

 

BIRMINGHAM - MUMBAI

Learn Python by Building Data Science Applications

Copyright © 2019 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Richa TripathiAcquisition Editor: Chaitanya NairContent Development Editor: Tiksha SarangSenior Editor: Afshaan KhanTechnical Editor: Romy DiasCopy Editor: Safis EditingProject Coordinator: Prajakta NaikProofreader: Safis EditingIndexer: Rekha NairProduction Designer: Nilesh Mohite

First published: August 2019

Production reference: 1300819

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78953-536-5

www.packt.com

 
Packt.com

Subscribe to our online digital library for full access to over 7,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Fully searchable for easy access to vital information

Copy and paste, print, and bookmark content

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks. 

Contributors

About the authors

Philipp Kats is a researcher at the Urban Complexity Lab, NYU CUSP, a research fellow at  Kazan Federal University, and a data scientist at StreetEasy, with many years of experience in software development. His interests include data analysis, urban studies, data journalism, and visualization. Having a bachelor's degree in architectural design and a having followed the rocky path (at first) of being a self-taught developer, Philipp knows the pain points of learning programming and is eager to share his experience.

I would like to thank my wife, Anna, and son, Solomon, for their support and patience.

 

 

 

 

 

David Katz is a researcher and holds a Ph.D. in mathematics. As a mathematician at heart, he sees code as a tool to express his questions. David believes that code literacy is essential as it applies to most disciplines and professions. David is passionate about sharing his knowledge and has 6 years of experience teaching college and high school students.

I would like to thank my wife, Dina, for her support and help.

 

 

 

 

 

The authors would also like to thank the administration of the Kazan Federal University IT-Lyceum, and its director, Timerbulat Samerkhanov, for the opportunity to conduct a course that laid the foundation for this book. Our special thanks go to our students for their help and feedback:

Azat Davletshin

Danis Saifullin

Evdokimov Alexandr

Kasatkin Alexander

Kirill Kaidanov

Nikolai Plantonov

About the reviewers

Sri Manikanta is an undergraduate student pursuing his bachelor's degree in computer science and engineering at SICET under JNTUH. He is a founder of the Open Stack Developer Community at his college. He started his journey as a competitive programmer and he always loves to solve problems that are related to the filed of data science. He has worked on a couple of projects on deep learning and machine learning. He has published many articles regarding data science, machine learning, programming and cyber security in top publications such as Hacker Noon, freeCodeCamp, Towards Data Science, and DDI. He completed his Python specialization at the University of Michigan, through Coursera.

I would like to express my deepest gratitude to my spiritual and biological parents for everything that they have done for me. A special thanks to my friends and well-wishers for supporting me, and to Packt Publishing for giving me the opportunity to review this book.

 

 

Richard Marsden has 25 years of professional software development experience. After starting in the field of geophysical surveying for the oil industry, he has spent the last 15 years running the Winwaed Software Technology LLC, an independent software vendor. Winwaed specializes in geospatial tools and applications including web applications and operates the Mapping-Tools website for tools and add-ins for geospatial applications such as Caliper Maptitude, Microsoft MapPoint, Android, and Ultra Mileage.

Richard has been a technical reviewer for a number of Packt publications, including Python Geospatial Development and Python Geospatial Analysis Essentials, both by Erik Westra; and Python Geospatial Analysis Cookbook, by Michael Diener.

 

 

 

 

 

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Table of Contents

Title Page

Copyright and Credits

Learn Python by Building Data Science Applications

About Packt

Why subscribe?

Contributors

About the authors

About the reviewers

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Code in Action

Conventions used

Get in touch

Reviews

Section 1: Getting Started with Python

Preparing the Workspace

Technical requirements

Installing Python

Downloading materials for running the code

Installing Python packages

Working with VS Code

The VS Code interface

Beginning with Jupyter

Notebooks

The Jupyter interface

Pre-flight check

Summary

Questions

Further reading

First Steps in Coding - Variables and Data Types

Technical requirements

Assigning variables

Naming the variable 

Understanding data types

Floats and integers

Operations with self-assignment

Order of execution

Strings

Formatting

Format method

F-strings

Legacy formatting

Formatting mini-language

Strings as sequences

Booleans

Logical operators

Converting the data types

Exercise

Summary

Questions

Further reading

Functions

Technical requirements

Understanding a function

Interface functions

The input function

The eval function

Variable properties

The help function

The type function

The isinstance function

dir

Math

abs

The round function

Iterables

The len function

The sorted function

The range function

The all and any functions

The max, min, and sum functions

Defining the function

Default values

Var-positional and var-keyword

Docstrings

Type annotations

Refactoring the temperature conversion

Understanding anonymous (lambda) functions

Understanding recursion

Summary

Questions

Further reading

Data Structures

Technical requirements

What are data structures?

Lists

Slicing

Tuples

Immutability

Dictionaries

Sets

More data structures

frozenset

defaultdict

Counter

Queue

deque

namedtuple

Enumerations

Using generators

Useful functions to use with data structures

The sum, max, and min functions

The all and any functions

The zip function

The map, filter, and reduce functions

Comprehensions

Summary

Questions

Further reading 

Loops and Other Compound Statements

Technical requirements

Understanding if, else, and elif statements

Inline if statements

Using if in a comprehension

Running code many times with loops

The for loop

itertools

cycle

chain

product

Enumeration

The while loop

Additional loop functionality – break and continue

Handling exceptions with try/except and try/finally 

Exceptions

try/except

try/except/finally

Understanding the with statements

Summary

Questions

Further reading

First Script – Geocoding with Web APIs

Technical requirements

Geocoding as a service

Learning about web APIs

Working with HTTPS

Working with the Nominatim API

The requests library

Starting to code

Caching with decorators

Reading and writing data

Geocoding the addresses

Moving code to a separate module

Collecting NYC Open Data from the Socrata service

Summary

Questions

Further reading

Scraping Data from the Web with Beautiful Soup 4

Technical requirements

When there is no API

HTML in a nutshell

Scraping with Beautiful Soup 4

CSS and XPath selectors

Developer console

Scraping WWII battles

Step 1 – Scraping the list of battles

Unordered list

Step 2 – Scraping information from the Wiki page

Key information

Additional information

Step 3 – Scraping data as a whole

Quality control

Beyond Beautiful Soup

Summary

Questions

Further reading

Simulation with Classes and Inheritance

Technical requirements

Understanding classes

Special (dunder) methods

__init__

__repr__ and __str__ 

Arithmetical and logical operations

Equality/relationship methods

__len__

__getitem__

__class__

Inheritance

Using super()

Data classes

Using classes in simulation

Writing the base classes

Writing the Island class

Herbivore haven

Harsh islands

Visualization

Summary

Questions

Further reading

Shell, Git, Conda, and More – at Your Command

Technical requirements

Shell

Pipes

Executing Python scripts

Command-line interface

Git

Concept

GitHub

Practical example

gitignore

Conda

Conda for virtual environments

Conda and Jupyter

Make

Cookiecutter

Summary

Questions

Section 2: Hands-On with Data

Python for Data Applications

Technical requirements

Introducing Python for data science

Exploring NumPy

Beginning with pandas

Trying SciPy and scikit-learn

Understanding Jupyter

Summary

Questions

Data Cleaning and Manipulation

Technical requirements

Getting started with pandas

Selection – by columns, indices, or both

Masking

Data types and data conversion

Math

Merging

Working with real data

Initial exploration

Defining the scope of work to be done

Getting to know regular expressions

Parsing locations

Geocoding

Time

Belligerents

Understanding casualties

Multilevel slicing

Quality assurance

Writing the file

Summary

Questions

Further reading

Data Exploration and Visualization

Technical requirements

Exploring the dataset

Descriptive statistics

Data visualization with matplotlib (and its pandas interface)

Aggregating the data to calculate summary statistics 

Resampling

Mapping

Declarative visualization with vega and altair

Drawing maps with Altair

Storing the Altair chart

Big data visualization with datashader

Summary

Questions

Further reading

Training a Machine Learning Model

Technical requirements

Understanding the basics of ML

Exploring unsupervised learning

Moving on to supervised learning

k-nearest neighbors

Linear regression

Decision trees

Summary

Questions

Further reading

Improving Your Model – Pipelines and Experiments

Technical requirements

Understanding cross-validation

Exploring feature engineering

Failed attempts

Optimizing the hyperparameters

Using a random forest model

Tracking your data and metrics with version control

Starting with data

Adding code to the equation

Metrics

Summary

Questions

Further reading

Section 3: Moving to Production

Packaging and Testing with Poetry and PyTest

Technical requirements

Building a package

Bringing your own package

Using a package manager – pip and conda

Creating a package scaffolding

A few ways to build your package

Trying out code with Poetry

Adding actual code

Defining dependencies

Non-code resources

Publishing the package

Development workflow

Testing the code so far

Testing with PyTest

Writing our own tests

Automating the process with CI services

Generating documentation generation with sphinx

Installing a package in editable mode

Summary

Questions

Further reading

Data Pipelines with Luigi

Technical requirements

Introducing the ETL pipeline

Redesigning your code as a pipeline

Building our first task in Luigi

Connecting the dots

Understanding time-based tasks

Scheduling with cron

Exploring the different output formats

Writing to an S3 bucket

Writing to SQL

Expanding Luigi with custom template classes

Summary

Questions

Further reading

Let's Build a Dashboard

Technical requirements

Building a dashboard – three types of dashboard

Static dashboards

Debugging Altair

Connecting your app to the Luigi pipeline

Understanding dynamic dashboards

First try with panel

Reading data from the database

Creating an interactive dashboard in Jupyter

Summary

Questions

Further reading

Serving Models with a RESTful API

Technical requirements

What is a RESTful API?

Python web frameworks

Building a basic API service

Exploring service with OpenAPI

Finalizing our naive first iteration

Data validation

Sending data in with POST requests

Adding features to our service

Building a web page

Speeding up with asynchronous calls

Deploying and testing your API loads with Locust

Summary

Questions

Further reading

Serverless API Using Chalice

Technical requirements

Understanding serverless

Getting started with Chalice

Setting up a simple model

Externalizing medians

Building a serverless API for an ML model

When we're still out of memory

Building a serverless function as a data pipeline

S3-triggered events

Summary

Questions

Further reading

Best Practices and Python Performance

Technical requirements

Speeding up your Python code

Rewriting the code with NumPy

Specialized data structures and algorithms

Dask

Dask-ML

Numba

Concurrency and parallelism

Different types of concurrency

Two types of problems

Before you start rewriting your code

Using best practices for coding in your project

Code formatting with black

Measuring code quality with Wily

Writing tests with hypothesis

Beyond this book – packages and technologies to look out for

Different Python flavors

Docker containers

Kubernetes

Summary

Questions

Further reading

Assessments

Chapter 1

Chapter 2

Chapter 3

Chapter 4

Chapter 5

Chapter 6

Chapter 7

Chapter 8

Chapter 9

Chapter 10

Chapter 11

Chapter 12

Chapter 13

Chapter 14

Chapter 15

Chapter 16

Chapter 17

Chapter 18

Chapter 19

Chapter 20

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

There are no separate systems. The world is a continuum. Where to draw a boundary around a system depends on the purpose of the discussion.
– Donella H. Meadows, Thinking in Systems: A Primer

Python has become one of the most popular programming languages in the world, according to multiple polls and metrics. This popularity is, to no small extent, a direct result of the simplicity of the language, its power, and scalability, allowing it to run even large-scale applications, such as Dropbox, YouTube, and many others. It becomes even more valuable with the rise in the adoption of machine learning techniques and algorithms, including state-of-the-art algorithms on the edge of scientific advancements.

Consequently, there are hundreds of books, courses, and online tutorials on different aspects of programming, machine learning, data processing, and more. Many sources highlight the importance of learning-by-doing and building your own projects. Connecting the dots and structuring all this vast knowledge into one big picture is not an easy task. Seeing the big picture, in our opinion, is critical for the completion of any project. Indeed, there are plenty of options and decisions to take at every step. It is the grand schema of a project as a whole that helps you make those decisions, focus on what matters, and spend your time wisely. 

This book is designed to be an entry point for any newcomer or novice developer, aiming to cover the whole life cycle of a data-driven application. By the end of it, you will be able to write arbitrary Python code, collect and process data, explore it, and build your own packages, dashboards, and APIs. Multiple notes and tips point to alternative solutions or decisions, allowing you to alternate code for your specific needs.

This book will be a useful resource if any of the following apply to you:

You have just started to code.

You know the basics but struggle to build something handy.

You know your specific domain well—whether it be statistics, machine learning, or development—but lack experience in other parts of building a project.

You're an experienced developer with little exposure to Python, trying to learn about the Python package's ecosystem.

If you feel you fall into any of those categories, or want to build a project from scratch for other reasons, please join us on this journey.

Who this book is for

This book is aimed at new Python developers with little to no prior programming skills beyond basic computer literacy. The book doesn't require any previous background in data science or statistics either. That being said, it covers a variety of topics, from data processing to visualization, to delivery—including dashboards, building APIs, Extract, Transform, Load (ETL) pipelines, or a standalone package. Thus, it is also suited to experienced data scientists interested in productizing their work. For a complete novice, this book aims to cover all major parts of the data application life cycle—from Python basics to scripts, data collection and processing, and the delivery of your work in different formats.

What this book covers

This book consists of three main sections. The first one is focused on language fundamentals, the second introduces data analysis in Python, and the final section covers different ways to deliver the results of your work. The last chapter of each section is focused on non-Python tools and topics related to the section subject.

Section 1, Getting Started with Python, introduces the Python programming language and explains how to install Python and all of the packages and tools we'll be using.

Chapter 1, Preparing the Workspace, covers all the tools we'll need throughout the book—what they are, how to install them, and how to use their interfaces. This includes the installation process for Python 3.7, all of the packages we'll require throughout the book, how to install all of them at once in a separate environment, as well as two code development tools we'll use—the Jupyter Notebook and VS Code. Finally, we'll run our first script to ensure everything works fine! By the end of this chapter, you will have everything you need to execute the book's code, ready to go.

Chapter 2, First Steps in Coding – Variables and Data Types, gives an introduction to fundamental programming concepts, such as variables and data types. You'll start writing code in Jupyter, and will even solve a simple problem using the knowledge you've just acquired.

Chapter 3, Functions, introduces yet another concept fundamental to programming—functions. This chapter covers the most important built-in functions and teaches you about writing new ones. Finally, you will revisit the problem from the previous chapter, and write an alternative solution, using functions.

Chapter 4, Data Structures, covers different types of data structures in Python—lists, sets, dictionaries, and many others. You will learn about the properties of each structure, their interfaces, how to operate them, and when to use them.

Chapter 5, Loops and Other Compound Statements, illustrates different compound statements in Python—loops—if/else, try/except, one-liners, and others. These represent core logic in the code and allow non-linear code execution. At the end of this chapter, you'll be able to operate large data structures using short, expressive code.

Chapter 6, First Script – Geocoding with Web APIs, introduces the concept of APIs, working with HTTP and geocoding service APIs in particular, from Python. At the end of this chapter, you'll have fully operational code for geocoding addresses from the dataset—code that you'll be using extensively throughout the rest of the book, but that's also highly applicable to many tasks beyond it.

Chapter 7, Scraping Data from the Web with Beautiful Soup 4, illustrates a solution to a similar but more complex task of data extraction from HTML pages—scraping. Step by step, you will build a script that collects pages and extracts data on all the battles in World War II, as described in Wikipedia. At the end of this chapter, you'll know the limitations, challenges, and the main solutions of the scraping packages used for the task, and will be able to write your own scrapers.

Chapter 8, Simulation with Classes and Inheritance, introduces one more critical concept for programming in Python—classes. Using classes, we will build a simple simulation model of an ecological system. We'll compute, collect, and visualize metrics, and use them to analyze the system's behavior.

Chapter 9,Shell, Git, Conda, and More – at Your Command, covers the basic tools essential for the development process—from Shell and Git, to Conda packaging and virtual environments, to the use of makefiles and the Cookiecutter tool. The information we share in this chapter is essential for code development in general, and Python development in particular, and will allow you to collaborate and talk the same language with other developers.

Section 2, Hands-On with Data, focuses on using Python for data processing analysis, including cleaning, visualization, and training machine learning models.

Chapter 10, Python for Data Applications, works as an introduction to the Python data analysis ecosystem—a distinct group of packages that allow simple work with data, its processing, and analysis. As a result, you will get familiar with the main packages and their purpose, their special syntaxes, and will understand what makes them work substantially faster than normal Python for numeric calculations.

Chapter 11, Data Cleaning and Manipulation, shows how to use the pandas package to process and clean our data, and make it ready for analysis. As an example, we'll clean and prepare the dataset we obtained from Wikipedia in Chapter 7, Scraping Data from the Web with Beautiful Soup 4. Through the process, we'll learn how to use regular expressions, use the geocoding code we wrote in Chapter 6, First Script – Geocoding with Web APIs, and an array of other techniques to clean the data.

Chapter 12, Data Exploration and Visualization, explains how to explore an arbitrary dataset and ask and answer questions about it, using queries, statistics, and visualizations. You'll learn how to use two visualization libraries, Matplotlib and Altair. Both make static charts quickly or more advanced, interactive ones. As our case example, we'll use the dataset we cleaned in the previous chapter.

Chapter 13, Training a Machine Learning Model, presents the core idea of machine learning and shows how to apply unsupervised learning with the k-means clustering algorithm, and supervised learning with KNN, linear regression, and decision trees, to a given dataset.

Chapter 14, Improving Your Model – Pipelines and Experiments, highlights ways to improve your model, using feature engineering, cross-validation, and by applying a more sophisticated algorithm. In addition, you will learn how to track your experiments and keep both code and data under version control, using data version control with dvc.

Section 3, Moving to Production, is focused on delivering the results of your work with Python, in different formats.

Chapter 15, Packaging and Testing with Poetry and PyTest, explains the process of packaging. Using our Wikipedia scraper as an example, we'll create a package using the poetry library, set dependencies and a development environment, and make the package accessible for installation using pip from GitHub. To ensure the package's functionality, we will add a few unit tests using the pytest testing library.

Chapter 16, Data Pipelines with Luigi, introduces ETL pipelines and explains how to build and schedule one using the luigi framework. We will build a set of interdependent tasks for data collection and processing and set them to work on a scheduled basis, writing data to local files, S3 buckets, or a database.

Chapter 17, Let's Build a Dashboard, covers a few ways to build and share a dashboard online. We'll start by writing a static dashboard based on the charts we made with the Altair library in Chapter 12, Data Exploration and Visualization. As an alternative, we will also deploy a dynamic dashboard that pulls data from a database upon request, using the panel library.

Chapter 18, Serving Models with a RESTful API, brings us back to the API theme—but this time, we'll build an API on our own, using the fastAPI framework and the pydantic package for validation. Using a machine learning model, we'll build a fully operational API server, with the OpenAPI documentation and strict request validation. As FastAPI supports asynchronous execution, we'll also discuss what that means and when to use it.

Chapter 19, Serverless API Using Chalice, goes beyond serving an API with a personal server and shows how to achieve similar results with a serverless application, using AWS Lambda and the chalice package. This includes building an API endpoint, a scheduled pipeline, and serving a machine learning model. Along the way, we discuss the pros and cons of running serverless, its limitations, and ways to mitigate them.

Chapter 20, Best Practices and Python Performance, is comprises of three distinct parts. The first part showcases different ways to make your code faster, by using NumPy's vectorized computations or a specific data structure (in our case, a k-d tree), extending computations to multiple cores or even machines with Dask, or by leveraging performance (and, potentially, GIL-release) of just-in-time compilation with Numba. We also discuss different ways to achieve concurrency in Python—using threads, asynchronous tasks, or multiple processes.

The second part of the chapter focuses on improving the speed and quality of development. In particular, we'll cover the use of linters and formatters—the black package in particular; code maintainability measurements with wily; and advanced, data-driven code testing with the hypothesis package.

Finally, the third part of this chapter goes over a few technologies beyond Python, but that are still potentially useful to you. This list includes different Python interpreters, such as Jython, Brython, and Iodide; Docker technology; and Kubernetes.

To get the most out of this book

This book is designed for complete beginners and people who have just started to learn to code. It does not require any specific knowledge besides basic computer literacy.  

The execution of the code examples provided in this book requires an installation of Python 3.7.3 or later on macOS, Linux, or Microsoft Windows. The code presented throughout the book makes use of many Python libraries. In each chapter, a list of required libraries is given at the beginning. A full list of libraries is stored in the GitHub repository, in the environment.yaml file. The same file can be used to install Python and all of the required libraries in bulk—full instructions are given in Chapter 1, Preparing the Workspace. 

The code for this book was developed in and extensively uses two development environments—VS Code editor with its Python bundle, and Jupyter. We recommend using both for better alignment with the book's narrative.

The code for Chapter 6, First Script – Geocoding with Web APIs, Chapter 7, Scraping Data from the Web with Beautiful Soup 4, Chapter 11, Data Cleaning and Manipulation, and Chapter 16, Data Pipelines with Luigi, requires an internet connection.

The first chapter will provide you with step-by-step instructions and some useful tips for setting up your Python environment, the core libraries, and all the necessary tools.

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at

www.packt.com

.

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Learn-Python-by-Building-Data-Science-Applications. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781789535365_ColorImages.pdf.

Code in Action

Visit the following link to check out videos of the code being run: http://bit.ly/2MIb3Pn

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Section 1: Getting Started with Python

This section focuses on becoming familiar with general-purpose Python, making use of existing libraries, writing our first scripts, learning the basics of Git, and using the IDE. In this section, we will also lay the foundation for our projects, building pipelines to process (project 1), collect (project 2), and simulate (project 3) data.

This section comprises the following chapters:

Chapter 1

Preparing the Workspace

Chapter 2

,

First Steps in Coding – Variables and Data Types

Chapter 3

,

Functions

Chapter 4

,

Data Structures

Chapter 5

,

Loops and Other Compound Statements

Chapter 6

,

First Script

– Geocoding with Web APIs

Chapter 7

,

Scraping Data from the Web with Beautiful Soup 4

Chapter 8

,

Classes and Inheritance

Chapter 9

,

Shell, Git, Conda, and More – at Your Command

Preparing the Workspace

Welcome! We're very excited to start learning and building things with you! However, we need to get ourselves ready first.

In this chapter, we'll learn how to download and install everything you'll need throughout the book, including Python itself, all the Python packages that we'll need, and two development tools we will be using extensively: Jupyter and Visual Studio Code (VS Code). After that, we'll go through a brief overview of Jupyter and VS Code interfaces. Finally, you will run your very first line of Python, so we need to ensure that everything is ready before we dive in.

In this chapter, we'll cover the following:

The minimum computer configuration required

How to install the Anaconda distribution

How to download the code for this book

Setting up and getting familiar with VS Code and Jupyter

Running your first line of code to ensure everything runs smoothly

By the end of this chapter, you will have learned about the hardware requirements for Python and this book, and what you can do if you don't have a sufficiently powerful computer. You will also learn how to install Python 3.7.2 and all required packages and tools using the open source Anaconda distribution. 

Technical requirements

Python can be very humble and does not require an advanced computer. In fact, you can run Python on a $10 Raspberry Pi or an Arduino board! The code and data we use in this book do not require any special computational power, any laptop, or any computer made after 2008. At least 2 GB of RAM, 20 GB of disk space, and an internet connection should suffice. Your operating system (OS) shouldn't be a problem either, as Python and all the tools we will use are cross-platform and work on Windows, macOS, and Linux. 

Throughout the book, we'll use two main tools to write the code: Jupyter and VS Code. Both of them are free and aren't demanding.

All the code for the book is publicly available and free to access at https://github.com/PacktPublishing/Learn-Python-by-Building-Data-Science-Applications.

Installing Python

There are multiple Python distributions, starting with the original, vanilla Python, which is accessible at https://www.python.org/. Data analysis, however, adds unique requirements for packaging (https://www.youtube.com/watch?v=QjXJLVINsSA&feature=youtu.be&t=3555). In this book, we use Anaconda, which is an open source and free Python distribution, designed for data science and machine learning. Anaconda's main features include a smooth installation of data science packages (many of which run C and Fortran languages under the hood) and conda, which is a great package and environment manager (we will talk more about environments and conda later in Chapter 9, Shell, Git, Conda, and More – at Your Command). Conveniently, the Anaconda distribution installs all the packages (https://docs.anaconda.com/anaconda/packages/pkg-docs/) we need in this book and many more!

In order to install Anaconda, follow these steps:

First, go to the Anaconda distribution web page at 

https://www.anaconda.com/distribution/

.

Select the Python 3.7 graphical installer for your platform and download it (at the time of writing, there is no graphical installer for Linux, so you'll have to use the one for the command line). The following screenshot shows what the interface looks like—we've marked the link we're interested in with dotted lines

:

Run the installation. Keep all settings as default. When you're asked if you want to install PyCharm, select no (until you personally want to, of course, but we won't use PyCharm in this book):

Voila! Now we have Python up and running! Next, let's download all the materials for this book.

We use Anaconda build 3-2018.12, which is the most recent version at the time of writing this book. Until a new version is released, this build will be accessible at https://repo.anaconda.com/archive/.

Downloading materials for running the code

All code in this book is also available as a separate archive of files—either Python scripts or Jupyter notebooks. You can download the full archive and follow along with the book using the relevant code from GitHub (https://github.com/PacktPublishing/Learn-Python-by-Building-Data-Science-Applications). Everything is stored on GitHub, which is an online service for code storage with version control. We will discuss both Git and GitHub in Chapter 9, Shell, Git, Conda, and More – at Your Command, but in this case, you won't need version control, so it is easier to download everything as an archive. Just use the Clone or download button on the right side (1), and select Download ZIP (2):

Once the download is complete, unzip the file and move it to a convenient location. This folder will be our main workspace throughout the book.

Installing Python packages

Many of the chapters in this book teach you how to make use of specific packages. Most of them are included in the standard Anaconda distribution, so if you installed Python using the Anaconda distribution, then you will have them already. Some packages might not be installed though, so we'll have to install them separately as per our requirements for every chapter. This is totally fine, and we'll specify which packages will be used at the beginning of each chapter.

In order to install a specific package, you have two options:

Installing via Anaconda by running either of the following commands. Specifying a channel is required if a package is rare and not present on the default channels of Anaconda and

conda-forge

:

> conda install <mypackage>

> conda install -c <mychannel> <mypackage>

Some packages are not present in conda at all. You can search for packages through the channels at https://anaconda.org/.

Most packages can be installed using

pip

:

> pip install <mypackage>

Generally speaking, we recommend using conda over pip for installation.

Alternatively, there is a single specification in the root of the repository that you can use to install everything at once. To do so, you need to go in your Terminal, and then to the repository's root (we will explain how to do that in Chapter 9, Shell, Git, Conda, and More – at Your Command, but VS Code's Terminal will open in the root of the given folder automatically). Once there, run the following command:

conda env update --name root -f environment.yml

Then, follow the instructions. Here, conda uses the environment.yml specification file as a list of packages to install.

Now, let's install our main development tools: VS Code and Jupyter.

Working with VS Code

VS Code is invaluable for Python development and experimentation. VS Code—not to be confused with Visual Studio, which is a commercial product—is a sophisticated, completely free, and open source text editor created by Microsoft. It is language-agnostic and will work perfectly with Python, JavaScript, Java, or any other language. VS Code has hundreds of built-in features and thousands of great plugins to expand its capabilities.

In order to install VS Code, head to its main web page, https://code.visualstudio.com/, and download the package for your OS. The installation is pretty straightforward; there is no need to change any of the default settings. Assuming you installed VS Code as part of the previous steps, you now need to open the VS Code application. Next, switch to the plugin marketplace menu (as shown in the following screenshot), type Python, and install the plugin. Python binding for VS Code provides plenty of Python-specific features and will prove very useful for us throughout the book.

In the following screenshot, 1 represents the plugin marketplace. Once switched, type Python in the search form (2), select the plugin (3), and hit install (Python was already installed in this screenshot, hence it offers to uninstall it instead):

Once that's done, let's briefly review the interface of the tool.

The VS Code interface

Let's go over the VS Code interface. In the following screenshot, you can see five distinctive sections:

Section 1 of VS Code has six icons (more will appear after installing certain plugins). The last one at the bottom of the toolbar, which is a gear symbol, represents the settings. All the others represent different modes, from top to bottom:

Explorer mode, which allows us to look for the files that are open in the given workspace

Search

mode, which allows us to look

for a particular text element throughout the whole workplace

A built-in Git client (more on that in

Chapter 3

,

Functions

)

Debugger mode, which halts and inspects code in the middle of the execution in order to understand what's happening under the hood

VS Code's plugin marketplace

Every mode changes the content of section 2. This is as an area that is dedicated to working with the workspace as a whole, which includes adding new files, removing existing ones, working with the workspace, or traversing through variables in debugging sessions.

Section 3 is the main one. Here, we actually write and read the code. You can have multiple tabs or even split this window into many: vertically, horizontally, or both. Most of the time, each tab represents one file in the workspace.

If you don't have section 4 open, then go to View | Terminal or use the Ctrl + ` shortcut. You can also drag this section out from the upper edge of section 5 using your mouse, if you prefer.

Section 5 has four subsections. In PROBLEMS, VS Code will point you to some potential issues in the code. The OUTPUT and DEBUG CONSOLE tabs' roles are self-explanatory, and we won't use them much. The most important tab here is Terminal: it duplicates the Terminal built into your OS (hence, it does not directly relate to VS Code itself). Terminals allow us to run system-wide commands, create folders, write to files, execute Python scripts, and run any software, which is essentially everything you can do via your OS graphical interface, but done just using code. We will cover the Terminal in more depth in Chapter 9, Shell, Git, Conda, and More – at Your Command. Conveniently, VS Code's Terminals open in the root directory of the workspace, which is a feature we will constantly utilize throughout the book.

Lastly, section 5 is an information bar that shows the current properties of the workspace, including the interpreter's name, Git repository and branch names (more on that in Chapter 3, Functions), and cursor position. Most of those elements are interactive!

One more feature that is hidden from the newcomers, but is an extremely powerful feature of VS Code, is its command palette, as shown in the following screenshot:

You can open the command palette using the Ctrl (command on macOS) + Shift + P shortcut. The command palette allows you to type in, select, and execute practically any feature of the application, from switching the color theme to searching for a word, to almost anything else. This feature allows programmers to avoid using a mouse or trackpad, and once mastered, it drastically increases productivity.

For example, let's create a new file (Ctrl/command + N) and type Hello Python!. Now, in order to switch that text to uppercase, all we need is to do the following:

Select all of the text by using

Ctrl

/

command

+

A

.

Open the command palette (

Ctrl

/

command

+

Shift

+

P

) and type 

Upper

. Select the 

Transform to Uppercase

command (note that the command palette also shows shortcuts).

Spend some time learning VS Code's features! One great place for that is the Interactive Playground: you can jump straight into it by typing the name into the command palette.

Another great feature of VS Code is that it can use the key bindings that you use in other editors, including Vim, Sublime, and Atom. If you're used to their bindings, then switch to them, as they will save you a lot of time and frustration.

Beginning with Jupyter

Another development environment we'll use is Jupyter. If you have installed Anaconda, then Jupyter is already on your machine, as it is one of the tools that come with Anaconda. To start using Jupyter, we need to run it from the Terminal (you might need to open a new Terminal to update the paths). The following code will run a newer version of the tool's frontend face, and that is what we'll use:

$ jupyter lab

Alternatively, it also supports an older version of the frontend via Jupyter Notebook. The two have their differences, but we'll stick with the lab.

The app's behavior depends on the folder from which it was started; it is more convenient to run it directly from the project's root folder. That's why it is so handy that VS Code's Terminal opens in a workspace folder by itself, as we don't need to navigate there every time. But why do we need another developer tool, anyway? That's what the next section is all about.

Notebooks

As we mentioned earlier, Jupyter is designed with a different approach to programming than VS Code. Its central concept is so-called notebooks: files that allow the mixing of actual code, text (including markdown and LaTeX equations), as well as plots, images, videos, and interactive visualizations. In notebooks, you execute code interactively, one cell after another. This way, you can experiment easily—write some code, run it, see the outcomes, and then tweak it again.

The outcomes are shown along with the code so that you can open and read the notebook, even without executing it. Because of that, notebooks are especially useful in scientific/analytical contexts, as on the one hand, they allow us to describe what we're doing with text and illustrations, and on the other hand, they keep the actual code tied to the narrative so that anyone can inspect and confirm that your analysis is valid. One great example of that is LIGO notebooks, which represent the actual code that was used to discover gravitational waves in the universe (this research won the Nobel Prize in 2017).

Notebooks are also great for teaching (as in the case of this book), as students can interact with each and every part of the code by themselves. However, while Jupyter is good for exploration, it feels less convenient when your code base starts to grow and mature. Because of this, we will switch back and forth between Jupyter and VS Code throughout the course of this book, picking the right tool for each particular job.

Let's now look at Jupyter's interface.

The Jupyter interface

Let's get familiar with Jupyter's interface. This software works differently to VS Code: Jupyter works as a web server that is accessible through a browser. To make it run, just type jupyter labin VS Code's Terminal window and hit Enter. This will start the server. Depending on your OS, either a link will be printed in the Terminal (starting with localhost://...), or your default web browser will just open the page automatically. You can stop the Jupyter server by hitting Ctrl + C within the Terminal and typing yes, if prompted, or by closing the window.

Jupyter's layout, as shown in the following screenshot, is somewhat similar to that of VS Code:

Here, again, the tabs in section 1 show all the modes available for section 2, including a file browser, a list of running notebooks, a list of available commands, and tabs. The second section represents one of the modes previously described. Finally, the main section, section 3, shows all open tabs, similar to section 3 in VS Code. The default tab is Launcher. From here, we can create new notebooks, text files (such as classic code or data files), Terminals, and consoles.

Note that the launcher explicitly states Python 3 for both notebooks and consoles. This is because Jupyter is also language-agnostic. In fact, the name Jupyter comes from the Julia-Python-R triad of analytical languages, but the application supports many others, including C, Java, and Rust. In this book, we'll only use Python.

If everything went smoothly with Jupyter, then we're ready to go! But before we dive into coding, let's do one last pre-flight check.

Pre-flight check

Before we proceed to the content of this book, let's ensure our code can actually be executed by running the simplest possible code in Jupyter. To do this, let's create a test notebook and run some code to ensure everything works as intended. Click on the Python 3 square in the Notebook section. A new tab should open, called Untitled.ipynb. 

First, the blue line highlighted represents the selected cell in the notebook. Each cell represents a separate snippet of code, which is executed simultaneously in one step. Let's write our very first line of code in this cell:

print('Hello world')

Now, hit Shift + Enter. This shortcut executes the selected cells in Python and outputs the result on the next line. It also automatically creates a new input cell if there are none, as shown in the following screenshot. The number on the left gives a hint as to the order in which cells are executed, so the first cell to be executed will be marked with 1. The asterisk means the cell is under execution and computation is underway:

If everything worked properly, and you see Hello world in the output, then congratulations—you are ready for the following chapters!

Cells can also include markdown, which is useful for including explanations, images, or equations. For that, just switch from Code to Markdown by using the dropdown at the top.

Summary

In this chapter, we prepared our working environment for the journey ahead. In particular, we installed the Anaconda Scientific Python Distribution with Python 3.7.2, which includes all the packages we'll need throughout the course of this book. We also installed and learned about the basics of VS Code, which is a sophisticated and interactive development environment that will be our primary tool for writing arbitrary code, and Jupyter, which we use for experimentation and analysis. Finally, we discussed and even ran some code already! We did this in Jupyter, which is a coding environment that is perfect for prototyping, experimentation, analysis, and educational purposes.

In the next chapter, we'll begin our introduction to Python, learning about variables, variable assignment, and Python's basic data types.

Questions

What version of Python do we use?

Will it work on a Windows PC?

Do I need to install any additional packages?

What is a Jupyter Notebook?

When and why should I use Jupyter Notebooks?

When

should

I switch to VS Code?

Can I run the code from this book on my smartphone/tablet?

Further reading

Python for Beginners: Learn Python Programming (Python 3) [Video]

(

https://www.packtpub.com/application-development/python-beginners-learn-python-programming-python-3-video

)

Data Science Projects with Python

(

https://www.packtpub.com/big-data-and-business-intelligence/data-science-projects-python

)

The Scientific Paper Is Obsolete

(

https://www.theatlantic.com/science/archive/2018/04/the-scientific-paper-is-obsolete/556676/

)

First Steps in Coding - Variables and Data Types

Having set up all the tools, you're now ready to dive into development. Fire up Jupyter—in this chapter, we will get our hands dirty with code! We'll start with the concept of variables, and learn how to assign and use them in Python. We will discuss best practices on naming them, covering both strict requirements and general conventions. Next, we will cover Python's basic data types and the operators they support, including integers, decimal numbers (floats), strings, and Boolean values. Each data type has a corresponding behavior, typing rules, built-in methods, and works with certain operators.

At the end of this chapter, we will put everything we learned into practice by writing our own vacation budgeting calculator.

The topics covered in this chapter are as follows:

Assigning variables 

Naming the variables

Understanding data types 

Converting the data types

Exercise

Technical requirements

You can follow the code for this chapter in theChapter02/first_code.ipynbnotebook. No additional packages are required, just Python.

You can find the code via the following link, which is in the GitHub repository in the Chapter02 folder (https://github.com/PacktPublishing/Learn-Python-by-Building-Data-Science-Applications).

Important note: In this and many other chapters, we'll include both snippets of code and interactive shells, similar to what your code will look like in Jupyter. In order to distinguish the code we ran from the output, in every code block that has an interaction, the running code will start after a triple greater than sign (>>>), similar to how it is present in Python consoles. By the way, you still can copy and paste the code—Jupyter will simply ignore this symbol and will run correctly.

Naming the variable 

Naming variables may seem to be a minor topic, but trust us, adopting a good habit of proper naming will save you a lot of time and nerves down the road. Do your best to name variables wisely and consistently. Ambiguous names will make code extremely hard to read, understand, and debug, for no good reason.

Now, technically there are just three requirements for variable names:

You cannot use reserved words:

False

,

class

,

finally

,

is

,

return

,

None

,

continue

,

for

,

lambda

,

try

,

True

,

def

,

from

,

nonlocal

,

while

,

and

,

del

,

global

,

not

,

with

,

as

,

elif

,

if

,

or

,

yield

,

assert

,

else

,

import

,

pass

,

break

,

except

,

in

, or 

raise

. You also cannot use operators or special symbols (

+

,

-

,

/

,

*

,

%

,

,

<

,

>

,

@

,

&

) or brackets and parentheses 

as part of variable names.

Variable names can't start with digits.

Variable names can't contain whitespace. Use the underscore symbol instead.

On top of that, there are also some general naming conventions. You don't have to, but it is strongly recommended to follow them:

Name your variables meaningfully and consistently, so that readers will understand what they are meant to be. Some examples are 

counter

,

car

, and 

today

.

Apply

snake_case