29,99 €
In today's competitive job market, a project portfolio often outshines a traditional resume. Python Real-World Projects empowers you to get to grips with crucial Python concepts while building complete modules and applications. With two dozen meticulously designed projects to explore, this book will help you showcase your Python mastery and refine your skills. Tailored for beginners with a foundational understanding of class definitions, module creation, and Python's inherent data structures, this book is your gateway to programming excellence. You’ll learn how to harness the potential of the standard library and key external projects like JupyterLab, Pydantic, pytest, and requests. You’ll also gain experience with enterprise-oriented methodologies, including unit and acceptance testing, and an agile development approach. Additionally, you’ll dive into the software development lifecycle, starting with a minimum viable product and seamlessly expanding it to add innovative features. By the end of this book, you’ll be armed with a myriad of practical Python projects and all set to accelerate your career as a Python programmer.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 555
Veröffentlichungsjahr: 2023
Preface
Who this book is for
What this book covers
A note on skills required
To get the most out of this book
Complete the extras
Download the example code files
Conventions used
Get in touch
Share your thoughts
Download a free PDF copy of this book
Chapter 1: Project Zero: A Template for Other Projects
1.1 On quality
1.1.1 More Reading on Quality
1.2 Suggested project sprints
1.2.1 Inception
1.2.2 Elaboration, part 1: define done
1.2.3 Elaboration, part 2: define components and tests
1.2.4 Construction
1.2.5 Transition
1.3 List of deliverables
1.4 Development tool installation
1.5 Project 0 – Hello World with test cases
1.5.1 Description
1.5.2 Approach
1.5.3 Deliverables
1.5.4 Definition of done
1.6 Summary
1.7 Extras
1.7.1 Static analysis - mypy, flake8
1.7.2 CLI features
1.7.3 Logging
1.7.4 Cookiecutter
Chapter 2: Overview of the Projects
2.1 General data acquisition
2.2 Acquisition via Extract
2.3 Inspection
2.4 Clean, validate, standardize, and persist
2.5 Summarize and analyze
2.6 Statistical modeling
2.7 Data contracts
2.8 Summary
Chapter 3: Project 1.1: Data Acquisition Base Application
3.1 Description
3.1.1 User experience
3.1.2 About the source data
3.1.3 About the output data
3.2 Architectural approach
3.2.1 Class design
3.2.2 Design principles
3.2.3 Functional design
3.3 Deliverables
3.3.1 Acceptance tests
3.3.2 Additional acceptance scenarios
3.3.3 Unit tests
3.4 Summary
3.5 Extras
3.5.1 Logging enhancements
3.5.2 Configuration extensions
3.5.3 Data subsets
3.5.4 Another example data source
Chapter 4: Data Acquisition Features: Web APIs and Scraping
4.1 Project 1.2: Acquire data from a web service
4.1.1 Description
4.1.2 Approach
4.1.3 Deliverables
4.2 Project 1.3: Scrape data from a web page
4.2.1 Description
4.2.2 About the source data
4.2.3 Approach
4.2.4 Deliverables
4.3 Summary
4.4 Extras
4.4.1 Locate more JSON-format data
4.4.2 Other data sets to extract
4.4.3 Handling schema variations
4.4.4 CLI enhancements
4.4.5 Logging
Chapter 5: Data Acquisition Features: SQL Database
5.1 Project 1.4: A local SQL database
5.1.1 Description
5.1.2 Approach
5.1.3 Deliverables
5.2 Project 1.5: Acquire data from a SQL extract
5.2.1 Description
5.2.2 The Object-Relational Mapping (ORM) problem
5.2.3 About the source data
5.2.4 Approach
5.2.5 Deliverables
5.3 Summary
5.4 Extras
5.4.1 Consider using another database
5.4.2 Consider using a NoSQL database
5.4.3 Consider using SQLAlchemy to define an ORM layer
Chapter 6: Project 2.1: Data Inspection Notebook
6.1 Description
6.1.1 About the source data
6.2 Approach
6.2.1 Notebook test cases for the functions
6.2.2 Common code in a separate module
6.3 Deliverables
6.3.1 Notebook .ipynb file
6.3.2 Executing a notebook’s test suite
6.4 Summary
6.5 Extras
6.5.1 Use pandas to examine data
Chapter 7: Data Inspection Features
7.1 Project 2.2: Validating cardinal domains — measures, counts, and durations
7.1.1 Description
7.1.2 Approach
7.1.3 Deliverables
7.2 Project 2.3: Validating text and codes — nominal data and ordinal numbers
7.2.1 Description
7.2.2 Approach
7.2.3 Deliverables
7.3 Project 2.4: Finding reference domains
7.3.1 Description
7.3.2 Approach
7.3.3 Deliverables
7.4 Summary
7.5 Extras
7.5.1 Markdown cells with dates and data source information
7.5.2 Presentation materials
7.5.3 JupyterBook or Quarto for even more sophisticated output
Chapter 8: Project 2.5: Schema and Metadata
8.1 Description
8.2 Approach
8.2.1 Define Pydantic classes and emit the JSON Schema
8.2.2 Define expected data domains in JSON Schema notation
8.2.3 Use JSON Schema to validate intermediate files
8.3 Deliverables
8.3.1 Schema acceptance tests
8.3.2 Extended acceptance testing
8.4 Summary
8.5 Extras
8.5.1 Revise all previous chapter models to use Pydantic
8.5.2 Use the ORM layer
Chapter 9: Project 3.1: Data Cleaning Base Application
9.1 Description
9.1.1 User experience
9.1.2 Source data
9.1.3 Result data
9.1.4 Conversions and processing
9.1.5 Error reports
9.2 Approach
9.2.1 Model module refactoring
9.2.2 Pydantic V2 validation
9.2.3 Validation function design
9.2.4 Incremental design
9.2.5 CLI application
9.3 Deliverables
9.3.1 Acceptance tests
9.3.2 Unit tests for the model features
9.3.3 Application to clean data and create an NDJSON interim file
9.4 Summary
9.5 Extras
9.5.1 Create an output file with rejected samples
Chapter 10: Data Cleaning Features
10.1 Project 3.2: Validate and convert source fields
10.1.1 Description
10.1.2 Approach
10.1.3 Deliverables
10.2 Project 3.3: Validate text fields (and numeric coded fields)
10.2.1 Description
10.2.2 Approach
10.2.3 Deliverables
10.3 Project 3.4: Validate references among separate data sources
10.3.1 Description
10.3.2 Approach
10.3.3 Deliverables
10.4 Project 3.5: Standardize data to common codes and ranges
10.4.1 Description
10.4.2 Approach
10.4.3 Deliverables
10.5 Project 3.6: Integration to create an acquisition pipeline
10.5.1 Description
10.5.2 Approach
10.5.3 Deliverables
10.6 Summary
10.7 Extras
10.7.1 Hypothesis testing
10.7.2 Rejecting bad data via filtering (instead of logging)
10.7.3 Disjoint subentities
10.7.4 Create a fan-out cleaning pipeline
Chapter 11: Project 3.7: Interim Data Persistence
11.1 Description
11.2 Overall approach
11.2.1 Designing idempotent operations
11.3 Deliverables
11.3.1 Unit test
11.3.2 Acceptance test
11.3.3 Cleaned up re-runnable application design
11.4 Summary
11.5 Extras
11.5.1 Using a SQL database
11.5.2 Persistence with NoSQL databases
Chapter 12: Project 3.8: Integrated Data Acquisition Web Service
12.1 Description
12.1.1 The data series resources
12.1.2 Creating data for download
12.2 Overall approach
12.2.1 OpenAPI 3 specification
12.2.2 RESTful API to be queried from a notebook
12.2.3 A POST request starts processing
12.2.4 The GET request for processing status
12.2.5 The GET request for the results
12.2.6 Security considerations
12.3 Deliverables
12.3.1 Acceptance test cases
12.3.2 RESTful API app
12.3.3 Unit test cases
12.4 Summary
12.5 Extras
12.5.1 Add filtering criteria to the POST request
12.5.3 Use Celery instead of concurrent.futures
12.5.4 Call external processing directly instead of running a subprocess
Chapter 13: Project 4.1: Visual Analysis Techniques
13.1 Description
13.2 Overall approach
13.2.1 General notebook organization
13.2.2 Python modules for summarizing
13.2.3 PyPlot graphics
13.2.4 Iteration and evolution
13.3 Deliverables
13.3.1 Unit test
13.3.2 Acceptance test
13.4 Summary
13.5 Extras
13.5.1 Use Seaborn for plotting
13.5.2 Adjust color palettes to emphasize key points about the data
Chapter 14: Project 4.2: Creating Reports
14.1 Description
14.1.1 Slide decks and presentations
14.1.2 Reports
14.2 Overall approach
14.2.1 Preparing slides
14.2.2 Preparing a report
14.2.3 Creating technical diagrams
14.3 Deliverables
14.4 Summary
14.5 Extras
14.5.1 Written reports with UML diagrams
Chapter 15: Project 5.1: Modeling Base Application
15.1 Description
15.2 Approach
15.2.1 Designing a summary app
15.2.2 Describing the distribution
15.2.3 Use cleaned data model
15.2.4 Rethink the data inspection functions
15.2.5 Create new results model
15.3 Deliverables
15.3.1 Acceptance testing
15.3.2 Unit testing
15.3.3 Application secondary feature
15.4 Summary
15.5 Extras
15.5.1 Measures of shape
15.5.2 Creating PDF reports
15.5.3 Serving the HTML report from the data API
Chapter 16: Project 5.2: Simple Multivariate Statistics
16.1 Description
16.1.1 Correlation coefficient
16.1.2 Linear regression
16.1.3 Diagrams
16.2 Approach
16.2.1 Statistical computations
16.2.2 Analysis diagrams
16.2.3 Including diagrams in the final document
16.3 Deliverables
16.3.1 Acceptance tests
16.3.2 Unit tests
16.4 Summary
16.5 Extras
16.5.1 Use pandas to compute basic statistics
16.5.2 Use the dask version of pandas
16.5.3 Use numpy for statistics
16.5.4 Use scikit-learn for modeling
16.5.5 Compute the correlation and regression using functional programming
Chapter 17: Next Steps
17.1 Overall data wrangling
17.2 The concept of “decision support”
17.3 Concept of metadata and provenance
17.4 Next steps toward machine learning
Why subscribe?
Other Books You Might Enjoy
Packt is searching for authors like you
Share your thoughts
Download a free PDF copy of this book
Index
Cover
Table of Contents
Craft your Python portfolio with deployable applications
Steven F. Lott
BIRMINGHAM—MUMBAI
"Python" and the Python logo are trademarks of the Python Software Foundation.
Copyright © 2023 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Associate Group Product Manager: Kunal Sawant
Publishing Product Manager: Akash Sharma
Senior Editor: Kinnari Chohan
Senior Content Development Editor: Rosal Colaco
Technical Editor: Maran Fernandes
Copy Editor: Safis Editing
Associate Project Manager: Deeksha Thakkar
Proofreader: Safis Editing
Indexer: Pratik Shirodkar
Production Designer: Shyam Sundar Korumilli
Business Development Executive: Debadrita Chatterjee
Developer Relations Marketing Executive: Sonia Chauhan
First published: September 2023
Production reference: 1 310823
Published by Packt Publishing Ltd. Grosvenor House 11 St Paul’s Square Birmingham B3 1RB
ISBN 978-1-80324-676-5
www.packtpub.com
Steven F. Lott has been programming since computers were large, expensive, and rare. Working for decades in high tech has given him exposure to a lot of ideas and techniques; some are bad, but most are useful and helpful to others.
Steven has been working with Python since the ‘90s, building a variety of tools and applications. He’s written a number of titles for Packt Publishing, including Mastering Object-Oriented Python, Modern Python Cookbook, and FunctionalPython Programming.
He’s a tech nomad and lives on a boat that’s usually located on the east coast of the US. He tries to live by the words, ”Don’t come home until you have a story.”
Chris Griffith is a Lead Software Engineer with twelve years of experience with Python. His open-source Python projects have been downloaded over a million times, and he is the primary writer for the Code Calamity blog. Chris enjoys studio photography in his free time as well as digitizing vintage magazines and 8mm films.
Join our Python Discord workspace to discuss and learn more about the book:
https://packt.link/dHrHU
Project Zero: A Template for Other Projects
1.1 On quality
1.1.1 More Reading on Quality
1.2 Suggested project sprints
1.2.1 Inception
1.2.2 Elaboration, part 1: define done
1.2.3 Elaboration, part 2: define components and tests
1.2.4 Construction
1.2.5 Transition
1.3 List of deliverables
1.4 Development tool installation
1.5 Project 0 – Hello World with test cases
1.5.1 Description
1.5.2 Approach
1.5.3 Deliverables
1.5.4 Definition of done
1.6 Summary
1.7 Extras
1.7.1 Static analysis - mypy, flake8
1.7.2 CLI features
1.7.3 Logging
1.7.4 Cookiecutter
Overview of the Projects
2.1 General data acquisition
2.2 Acquisition via Extract
2.3 Inspection
2.4 Clean, validate, standardize, and persist
2.5 Summarize and analyze
2.6 Statistical modeling
2.7 Data contracts
2.8 Summary
Project 1.1: Data Acquisition Base Application
3.1 Description
3.1.1 User experience
3.1.2 About the source data
3.1.3 About the output data
3.2 Architectural approach
3.2.1 Class design
3.2.2 Design principles
3.2.3 Functional design
3.3 Deliverables
3.3.1 Acceptance tests
3.3.2 Additional acceptance scenarios
3.3.3 Unit tests
3.4 Summary
3.5 Extras
3.5.1 Logging enhancements
3.5.2 Configuration extensions
3.5.3 Data subsets
3.5.4 Another example data source
Data Acquisition Features: Web APIs and Scraping
4.1 Project 1.2: Acquire data from a web service
4.1.1 Description
4.1.2 Approach
4.1.3 Deliverables
4.2 Project 1.3: Scrape data from a web page
4.2.1 Description
4.2.2 About the source data
4.2.3 Approach
4.2.4 Deliverables
4.3 Summary
4.4 Extras
4.4.1 Locate more JSON-format data
4.4.2 Other data sets to extract
4.4.3 Handling schema variations
4.4.4 CLI enhancements
4.4.5 Logging
Data Acquisition Features: SQL Database
5.1 Project 1.4: A local SQL database
5.1.1 Description
5.1.2 Approach
5.1.3 Deliverables
5.2 Project 1.5: Acquire data from a SQL extract
5.2.1 Description
5.2.2 The Object-Relational Mapping (ORM) problem
5.2.3 About the source data
5.2.4 Approach
5.2.5 Deliverables
5.3 Summary
5.4 Extras
5.4.1 Consider using another database
5.4.2 Consider using a NoSQL database
5.4.3 Consider using SQLAlchemy to define an ORM layer
Project 2.1: Data Inspection Notebook
6.1 Description
6.1.1 About the source data
6.2 Approach
6.2.1 Notebook test cases for the functions
6.2.2 Common code in a separate module
6.3 Deliverables
6.3.1 Notebook .ipynb file
6.3.2 Executing a notebook’s test suite
6.4 Summary
6.5 Extras
6.5.1 Use pandas to examine data
Data Inspection Features
7.1 Project 2.2: Validating cardinal domains — measures, counts, and durations
7.1.1 Description
7.1.2 Approach
7.1.3 Deliverables
7.2 Project 2.3: Validating text and codes — nominal data and ordinal numbers
7.2.1 Description
7.2.2 Approach
7.2.3 Deliverables
7.3 Project 2.4: Finding reference domains
7.3.1 Description
7.3.2 Approach
7.3.3 Deliverables
7.4 Summary
7.5 Extras
7.5.1 Markdown cells with dates and data source information
7.5.2 Presentation materials
7.5.3 JupyterBook or Quarto for even more sophisticated output
Project 2.5: Schema and Metadata
8.1 Description
8.2 Approach
8.2.1 Define Pydantic classes and emit the JSON Schema
8.2.2 Define expected data domains in JSON Schema notation
8.2.3 Use JSON Schema to validate intermediate files
8.3 Deliverables
8.3.1 Schema acceptance tests
8.3.2 Extended acceptance testing
8.4 Summary
8.5 Extras
8.5.1 Revise all previous chapter models to use Pydantic
8.5.2 Use the ORM layer
Project 3.1: Data Cleaning Base Application
9.1 Description
9.1.1 User experience
9.1.2 Source data
9.1.3 Result data
9.1.4 Conversions and processing
9.1.5 Error reports
9.2 Approach
9.2.1 Model module refactoring
9.2.2 Pydantic V2 validation
9.2.3 Validation function design
9.2.4 Incremental design
9.2.5 CLI application
9.3 Deliverables
9.3.1 Acceptance tests
9.3.2 Unit tests for the model features
9.3.3 Application to clean data and create an NDJSON interim file
9.4 Summary
9.5 Extras
9.5.1 Create an output file with rejected samples
Data Cleaning Features
10.1 Project 3.2: Validate and convert source fields
10.1.1 Description
10.1.2 Approach
10.1.3 Deliverables
10.2 Project 3.3: Validate text fields (and numeric coded fields)
10.2.1 Description
10.2.2 Approach
10.2.3 Deliverables
10.3 Project 3.4: Validate references among separate data sources
10.3.1 Description
10.3.2 Approach
10.3.3 Deliverables
10.4 Project 3.5: Standardize data to common codes and ranges
10.4.1 Description
10.4.2 Approach
10.4.3 Deliverables
10.5 Project 3.6: Integration to create an acquisition pipeline
10.5.1 Description
10.5.2 Approach
10.5.3 Deliverables
10.6 Summary
10.7 Extras
10.7.1 Hypothesis testing
10.7.2 Rejecting bad data via filtering (instead of logging)
10.7.3 Disjoint subentities
10.7.4 Create a fan-out cleaning pipeline
Project 3.7: Interim Data Persistence
11.1 Description
11.2 Overall approach
11.2.1 Designing idempotent operations
11.3 Deliverables
11.3.1 Unit test
11.3.2 Acceptance test
11.3.3 Cleaned up re-runnable application design
11.4 Summary
11.5 Extras
11.5.1 Using a SQL database
11.5.2 Persistence with NoSQL databases
Project 3.8: Integrated Data Acquisition Web Service
12.1 Description
12.1.1 The data series resources
12.1.2 Creating data for download
12.2 Overall approach
12.2.1 OpenAPI 3 specification
12.2.2 RESTful API to be queried from a notebook
12.2.3 A POST request starts processing
12.2.4 The GET request for processing status
12.2.5 The GET request for the results
12.2.6 Security considerations
12.3 Deliverables
12.3.1 Acceptance test cases
12.3.2 RESTful API app
12.3.3 Unit test cases
12.4 Summary
12.5 Extras
12.5.1 Add filtering criteria to the POST request
12.5.2 Split the OpenAPI specification into two parts to use $REF for the output schema
12.5.3 Use Celery instead of concurrent.futures
12.5.4 Call external processing directly instead of running a subprocess
Project 4.1: Visual Analysis Techniques
13.1 Description
13.2 Overall approach
13.2.1 General notebook organization
13.2.2 Python modules for summarizing
13.2.3 PyPlot graphics
13.2.4 Iteration and evolution
13.3 Deliverables
13.3.1 Unit test
13.3.2 Acceptance test
13.4 Summary
13.5 Extras
13.5.1 Use Seaborn for plotting
13.5.2 Adjust color palettes to emphasize key points about the data
Project 4.2: Creating Reports
14.1 Description
14.1.1 Slide decks and presentations
14.1.2 Reports
14.2 Overall approach
14.2.1 Preparing slides
14.2.2 Preparing a report
14.2.3 Creating technical diagrams
14.3 Deliverables
14.4 Summary
14.5 Extras
14.5.1 Written reports with UML diagrams
Project 5.1: Modeling Base Application
15.1 Description
15.2 Approach
15.2.1 Designing a summary app
15.2.2 Describing the distribution
15.2.3 Use cleaned data model
15.2.4 Rethink the data inspection functions
15.2.5 Create new results model
15.3 Deliverables
15.3.1 Acceptance testing
15.3.2 Unit testing
15.3.3 Application secondary feature
15.4 Summary
15.5 Extras
15.5.1 Measures of shape
15.5.2 Creating PDF reports
15.5.3 Serving the HTML report from the data API
Project 5.2: Simple Multivariate Statistics
16.1 Description
16.1.1 Correlation coefficient
16.1.2 Linear regression
16.1.3 Diagrams
16.2 Approach
16.2.1 Statistical computations
16.2.2 Analysis diagrams
16.2.3 Including diagrams in the final document
16.3 Deliverables
16.3.1 Acceptance tests
16.3.2 Unit tests
16.4 Summary
16.5 Extras
16.5.1 Use pandas to compute basic statistics
16.5.2 Use the dask version of pandas
16.5.3 Use numpy for statistics
16.5.4 Use scikit-learn for modeling
16.5.5 Compute the correlation and regression using functional programming
Next Steps
17.1 Overall data wrangling
17.2 The concept of “decision support”
17.3 Concept of metadata and provenance
17.4 Next steps toward machine learning
How do we improve our knowledge of Python? Perhaps a more important question is “How do we show others how well we can write software in Python?”
Both of these questions have the same answer. We build our skills and demonstrate those skills by completing projects. More specifically, we need to complete projects that meet some widely-accepted standards for professional development. To be seen as professionals, we need to step beyond apprentice-level exercises, and demonstrate our ability to work without the hand-holding of a master crafter.
I think of it as sailing a boat alone for the first time, without a more experienced skipper or teacher on board. I think of it as completing a pair of hand-knitted socks that can be worn until the socks have worn out so completely, they can no longer be repaired.
Completing a project entails meeting a number of objectives. One of the most important is posting it to a public repository like SourceForge (https://sourceforge.net) or GitHub (https://github.com) so it can be seen by potential employers, funding sources, or business partners.
We’ll distinguish between three audiences for a completed project:
A personal project, possibly suitable for a work group or a few peers.
A project suitable for use throughout an enterprise (e.g., a business, organization, or government agency)
A project that can be published on the Python Package Index, PyPI (https://pypi.org).
We’re drawing a fine line between creating a PyPI package and creating a package usable within an enterprise. For PyPI, the software package must be installable with the PIP tool; this often adds requirements for a great deal of testing to confirm the package will work in the widest variety of contexts. This can be an onerous burden.
For this book, we suggest following practices often used for “Enterprise” software. In an Enterprise context, it’s often acceptable to create packages that are not installed by PIP. Instead, users can install the package by cloning the repository. When people work for a common enterprise, cloning packages permits users to make pull requests with suggested changes or bug fixes. The number of distinct environments in which the software is used may be very small. This reduces the burden of comprehensive testing; the community of potential users for enterprise software is smaller than a package offered to the world via PyPI.
This book is for experienced programmers who want to improve their skills by completing professional-level Python projects. It’s also for developers who need to display their skills by demonstrating a portfolio of work.
This is not intended as a tutorial on Python. This book assumes some familiarity with the language and the standard library. For a foundational introduction to Python, consider Learn Python Programming, Third Edition: https://www.packtpub.com/product/learn-python-programming-third-edition/9781801815093.
The projects in this book are described in broad strokes, requiring you to fill in the design details and complete the programming. Each chapter focuses more time on the desired approach and deliverables than the code you’ll need to write. The book will detail test cases and acceptance criteria, leaving you free to complete the working example that passes the suggested tests.
We can decompose this book into five general topics:
We’ll start with Acquiring Data From Sources. The first six projects will cover projects to acquire data for analytic processing from a variety of sources.
Once we have data, we often need to Inspect and Survey. The next five projects look at some ways to inspect data to make sure it’s usable, and diagnose odd problems, outliers, and exceptions.
The general analytics pipeline moves on to Cleaning, Converting, andNormalizing. There are eight projects that tackle these closely-related problems.
The useful results begin with Presenting Summaries. There’s a lot of variability here, so we’ll only present two project ideas. In many cases, you will want to provide their own, unique solutions to presenting the data they’ve gathered.
This book winds up with two small projects covering some basics of Statistical Modeling. In some organizations, this may be the start of more sophisticated data science and machine learning applications. We encourage you to continue your study of Python applications in the data science realm.
The first part has two preliminary chapters to help define what the deliverables are and what the broad sweep of the projects will include. Chapter 1, ProjectZero: A Template for Other Projects is a baseline project. The functionality is a “Hello, World!” application. However, the additional infrastructure of unit tests, acceptance tests, and the use of a tool like tox or nox to execute the tests is the focus.
The next chapter, Chapter 2, Overview of the Projects, shows the general approach this book will follow. This will present the flow of data from acquisition through cleaning to analysis and reporting. This chapter decomposes the large problem of “data analytics” into a number of smaller problems that can be solved in isolation.
The sequence of chapters starting with Chapter 3, Project 1.1: Data AcquisitionBase Application, builds a number of distinct data acquisition applications. This sequence starts with acquiring data from CSV files. The first variation, in Chapter 4, Data Acquisition Features: Web APIs and Scraping, looks at ways to get data from web pages.
The next two projects are combined into Chapter 5, Data Acquisition Features:SQL Database. This chapter builds an example SQL database, and then extracts data from it. The example database lets us explore enterprise database management concepts to more fully understand some of the complexities of working with relational data.
Once data has been acquired, the projects transition to data inspection. Chapter 6, Project 2.1: Data Inspection Notebook creates an initial inspection notebook. In Chapter 7, Data Inspection Features, a series of projects add features to the basic inspection notebook for different categories of data.
This topic finishes with the Chapter 8, Project 2.5: Schema and Metadata project to create a formal schema for a data source and for the acquired data. The JSON Schema standard is used because it seems to be easily adapted to enterprise data processing. This schema formalization will become part of later projects.
The third topic — cleaning — starts with Chapter 9, Project 3.1: Data CleaningBase Application. This is the base application to clean the acquired data. This introduces the Pydantic package as a way to provide explicit data validation rules.
Chapter 10, Data Cleaning Features has a number of projects to add features to the core data cleaning application. Many of the example datasets in the previous chapters provide very clean data; this makes the chapter seem like needless over-engineering. It can help if you extract sample data and then manually corrupt it so that you have examples of invalid and valid data.
In Chapter 11, Project 3.7: Interim Data Persistence, we’ll look at saving the cleaned data for further use.
The acquire-and-clean pipeline is often packaged as a web service. In Chapter 12, Project 3.8: Integrated Data Acquisition Web Service, we’ll create a web server to offer the cleaned data for subsequent processing. This kind of web services wrapper around a long-running acquire-and-clean process presents a number of interesting design problems.
The next topic is the analysis of the data. In Chapter 13, Project 4.1: VisualAnalysis Techniques we’ll look at ways to produce reports, charts, and graphs using the power of JupyterLab.
In many organizations, data analysis may lead to a formal document, or report, showing the results. This may have a large audience of stakeholders and decision-makers. In Chapter 14, Project 4.2: Creating Reports we’ll look at ways to produce elegant reports from the raw data using computations in a JupyterLab notebook.
The final topic is statistical modeling. This starts with Chapter 15, Project 5.1: Modeling Base Application to create an application that embodies lessons learned in the Inspection Notebook and Analysis Notebook projects. Sometimes we can share Python programming among these projects. In other cases, however, we can only share the lessons learned; as our understanding evolves, we often change data structures and apply other optimizations making it difficult to simply share a function or class definition.
In Chapter 16, Project 5.2: Simple Multivariate Statistics, we expand on univariate modeling to add multivariate statistics. This modeling is kept simple to emphasize foundational design and architectural details. If you’re interested in more advanced statistics, we suggest building the basic application project, getting it to work, and then adding more sophisticated modeling to an already-working baseline project.
The final chapter, Chapter 17, Next Steps, provides some pointers for more sophisticated applications. In many cases, a project evolves from exploration to monitoring and maintenance. There will be a long tail where the model continues to be confirmed and refined. In some cases, the long tail ends when a model is replaced. Seeing this long tail can help an analyst understand the value of time invested in creating robust, reliable software at each stage of their journey.
These projects demand a wide variety of skills, including software and data architecture, design, Python programming, test design, and even documentation writing. This breadth of skills reflects the author’s experience in enterprise software development. Developers are expected to be generalists, able to follow technology changes and adapt to new technology.
In some of the earlier chapters, we’ll offer some guidance on software design and construction. The guidance will assume a working knowledge of Python. It will point you toward the documentation for various Python packages for more information.
We’ll also offer some details on how best to construct unit tests and acceptance tests. These topics can be challenging because testing is often under-emphasized. Developers fresh out of school often lament that modern computer science education doesn’t seem to cover testing and test design very thoroughly.
This book will emphasize using pytest for unit tests and behave for acceptance tests. Using behave means writing test scenarios in the Gherkin language. This is the language used by the cucumber tool and sometimes the language is also called Cucumber. This may be new, and we’ll emphasize this with more detailed examples, particularly in the first five chapters.
Some of the projects will implement statistical algorithms. We’ll use notation like x to represent the mean of the variable x. For more information on basic statistics for data analytics, see Statistics for Data Science:
https://www.packtpub.com/product/statistics-for-data-science/9781788290678
This book presumes some familiarity with Python 3 and the general concept of application development. Because a project is a complete unit of work, it will go beyond the Python programming language. This book will often challenge you to learn more about specific Python tools and packages, including pytest, mypy, tox, and many others.
Most of these projects use exploratory data analysis (EDA) as a problem domain to show the value of functional programming. Some familiarity with basic probability and statistics will help with this. There are only a few examples that move into more serious data science.
Python 3.11 is expected. For data science purposes, it’s often helpful to start with the conda tool to create and manage virtual environments. It’s not required, however, and you should be able to use any available Python.
Additional packages are generally installed with pip. The command looks like this:
Each chapter includes a number of “extras” that help you to extend the concepts in the chapter. The extra projects often explore design alternatives and generally lead you to create additional, more complete solutions to the given problem.
In many cases, the extras section will need even more unit test cases to confirm they actually solve the problem. Expanding the core test cases of the chapter to include the extra features is an important software development skill.
The code bundle for the book is hosted on GitHub at https://github.com/PacktPublishing/Python-Real-World-Projects. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in the text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. For example: “Python has other statements, such as global or nonlocal, which modify the rules for variables in a particular namespace.”
Bold: Indicates a new term, an important word, or words you see on the screen, such as in menus or dialog boxes. For example: “The base case states that the sum of a zero-length sequence is 0. The recursive case states that the sum of a sequence is the first value plus the sum of the rest of the sequence.”
A block of code is set as follows:
Any command-line input or output is written as follows:
Warnings or important notes appear like this.
Tips and tricks appear like this.
Feedback from our readers is always welcome.
General feedback: Email [email protected], and mention the book’s title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book we would be grateful if you would report this to us. Please visit https://subscription.packtpub.com/help, click on the Submit Errata button, search for your book, and enter the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit http://authors.packtpub.com.
Once you’ve read Python Real-World Projects, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere? Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book, you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily
Follow these simple steps to get the benefits:
Scan the QR code or visit the link below
https://packt.link/free-ebook/9781803246765Submit your proof of purchase
That’s it! We’ll send your free PDF and other benefits to your email directly
This is a book of projects. To make each project a good portfolio piece, we’ll treat each project as an enterprise software product. You can build something that could be posted to a company’s (or organization’s) internal repository.
For this book, we’ll define some standards that will apply to all of these projects. The standards will identify deliverables as a combination of files, modules, applications, notebooks, and documentation files. While each enterprise is unique, the standards described here are consistent with my experience as a consultant with a variety of enterprises.
We want to draw an informal boundary to avoid some of the steps required to post to the PyPI website. Our emphasis is on a product with test cases and enough documentation to explain what it does. We don’t want to go all the way to creating a project in PyPI. This allows us to avoid the complications of a build system and the associated pyproject.toml file.
These projects are not intended to produce generic, reusable modules. They’re applications specific to a problem domain and a dataset. While these are specific solutions, we don’t want to discourage anyone who feels motivated to generalize a project into something generic and reusable.
This chapter will show the general outline of each project. Then we’ll look at the set of deliverables. This chapter ends with project zero – an initial project that will serve as a template for others. We’ll cover the following topics:
An overview of the software quality principles that we’ll try to emphasize.
A suggested approach to completing the project as a sequence of project sprints.
A general overview of the list of deliverables for each project.
Some suggested tools. These aren’t required, and some readers may have other choices.
A sample project to act as a template for subsequent projects.
We’ll start with an overview of some characteristics of high-quality software. The idea is to establish some standards for the deliverables of each project.
It helps to have a clear definition of expectations. For these expectations, we’ll rely on the ISO 25010 standard to define quality goals for each project. For more details, see https://iso25000.com/index.php/en/iso-25000-standards/iso-25010.
The ISO/IEC 25010:2011 standard describes Systems and softwareQuality Requirements and Evaluation (SQuaRE). This standard provides eight characteristics of software. These characteristics are as follows:
Functional suitability. Does it do what we need? It is complete, correct, and appropriate for the user’s expressed (and implied) needs? This is the focus of each project’s description.
Performance efficiency. Does it work quickly? Does it use the minimum resources? Does it have enough capacity to meet the user’s needs? We won’t address this deeply in this book. We’ll talk about writing performance tests and ways to address performance concerns.
Compatibility. Does it co-exist with other software? Does it properly interoperate with other applications? To an extent, Python can help assure an application interoperates politely with other applications. We’ll emphasize this compatibility issue in our choices of file formats and communication protocols.
Usability. There are a number of sub-characteristics that help us understand usability. Many of the projects in this book focus on the command-line interface (CLI) to assure a bare minimum of learnability, operability, error protection, and accessibility. A few projects will include a web services API, and others will make use of the GUI interface of JupyterLab to provide interactive processing.
Reliability. Is it available when the users want it? Can we detect and repair problems? We need to make sure we have all of the parts and pieces so we can use the software. We also need to make sure we have a complete set of tests to confirm that it will work.
Security. As with usability, this is a deep topic. We’ll address some aspects of security in one of the projects. The remaining projects will use a CLI permitting us to rely on the operating system’s security model.
Maintainability. Can we diagnose problems? Can we extend it? We’ll look at documentation and test cases as essential for maintainability. We’ll also leverage a few additional project files to make sure our project can be downloaded and extended by others.
Portability. Can we move to a new Python version? New hardware? This is very important. The Python ecosystem is rapidly evolving. Since all of the libraries and packages are in a constant state of change, we need to be able to define precisely what packages our project depends on, and confirm that it works with a new candidate set of packages.
Two of these characteristics (Compatibility and Portability) are features of Python. A wise choice of interfaces assures that these characteristics are met. These are sometimes described as architectural decisions since they influence how multiple applications work together.
For Security, we will rely on the operating system. Similarly, for Usability, we’ll limit ourselves to CLI applications, relying on long-standing design principles.
The idea of Performance is something we won’t emphasize here. We will point out places where large data sets will require some careful design. The choice of data structure and algorithm is a separate subject area. Our objective in this book is to expose you to projects that can provide the stimulus for a deeper study of performance issues.
Three of these quality characteristics — Functional suitability, Reliability, and Maintainability — are the real focus of these projects. These seem to be essential elements of good software design. These are the places where you can demonstrate your Python programming skills.
Another view is available from The Twelve-Factor App ( https://12factor.net). This is narrowly focused on web applications. The concepts provide deeper insights and more concrete technical guidance into the quality characteristics shown above:
Codebase. ”One codebase tracked in revision control, many deploys.” We’ll use Git and GitHub or perhaps one of the other version managers supported by sourceforge.
Dependencies. ”Explicitly declare and isolate dependencies.” Traditionally, a Python requirements.txt file was used for this. In this book, we’ll move forward to using a pyproject.toml file.
Config. ”Store config in the environment.” We won’t emphasize this, but Python offers numerous ways to handle configuration files.
Backing services. ”Treat backing services as attached resources.” We touch on this in a few places. How storage, messages, mail, or caching work isn’t something we’ll examine deeply.
Build, release, run. ”Strictly separate build and run stages.” For command-line applications, this means we should deploy the application into a ”production” environment to use the high-value data and produce the results that the enterprise needs. We want to avoid running things in our desktop development environment.
Processes. ”Execute the app as one or more stateless processes.” CLI applications tend to be structured this way without making any additional effort.
Port binding. ”Export services via port binding.” We won’t emphasize this; it’s very specific to web services.
Concurrency. ”Scale out via the process model.” This is a subject for the interested reader who wants to process very large data sets. We won’t emphasize it in the main text. We will suggest some of these topics in the ”Extras” section of some chapters.
Disposability. ”Maximize robustness with fast startup and graceful shutdown.” CLI applications tend to be structured this way, also.
Dev/prod parity. ”Keep development, staging, and production as similar as possible.” While we won’t emphasize this deeply, our intent with CLI applications is to expose the distinctions between development and production with command-line arguments, shell environment variables, and configuration files.
Logs. ”Treat logs as event streams.” We will suggest applications write logs, but we won’t provide more detailed guidance in this book.
Admin processes. ”Run admin/management tasks as one-off processes.” A few of the projects will require some additional administrative programming. These will be built as deliverable CLI applications, complete with an acceptance test suite.
Our objective is to provide project descriptions and lists of deliverables that try to conform to these quality standards. As we noted earlier, each enterprise is unique, and some organizations will fall short of these standards, while some will exceed them.
In addition to the ISO standard, the IEEE 1061 standard also covers software quality. While it has been inactive since 2020, it contains some good ideas. The standard is focused on quality metrics, which dives deeply into the idea of analyzing software for quality factors.
It can also help to read https://en.wikipedia.org/wiki/ISO/IEC_9126 for some background on the origins of the ISO standard.
When doing more reading on this topic, it can help to recognize the following three terms:
Factors are an external view of the software. They reflect the user’s understanding. Some of the underlying quality characteristics are not directly visible to users. Maintainability, for example, may appear to users as a reliability or usability problem because the software is difficult to repair or extend.
Criteria come from an internal view of the software. Quality criteria are the focus of the project’s deliverables. Our project code should reflect the eight quality characteristics listed above.
Metrics are how we can control the factors that are seen by the user. We won’t emphasize quality metrics. In some cases, tools like pylint provide tangible measurements of static code quality. This isn’t a comprehensive tool for software quality in general, but it provides an easy starting point for a few key metrics related to complexity and maintainability.
Given these standards for high-quality software, we can turn our attention to the sequence of steps for building these files. We’ll suggest a sequence of stages you can follow.
We hesitate to provide a detailed step-by-step process for building software. For more experienced developers, our sequence of steps may not match their current practices. For less experienced developers, the suggested process can help by providing a rational order in which the deliverables can be built.
There was a time when a ”statement of work” with a detailed list of specific tasks was a central part of a software development effort. This was often part of a ”waterfall” methodology where requirements flowed to analysts who wrote specifications that flowed down to designers who wrote high-level designs that flowed down to coders. This wasn’t a great way to build software, and has been largely supplanted by Agile methods. For more information on Agility, see https://agilemanifesto.org.
The Agile approach lets us examine a project both as a series of steps to be completed, as well as a collection of deliverables that need to be created. We’ll describe the steps first, avoiding too much emphasis on details. We’ll revisit the deliverables, and in those sections, dive a little more deeply into what the final product needs to be.
The suggested approach follows the ”Agile Unified Process” ( https://www.methodsandtools.com/archive/archive.php?id=21), which has four general phases. We’ll subdivide one of the phases to distinguish two important kinds of deliverables.
We suggest tackling each project in the following five phases:
Inception. Ready the tools. Organize the project directory and virtual environment.
Elaboration, part 1: Define done. This is implemented as acceptance test cases.
Elaboration, part 2: Define components and some tests. This is implemented as unit test cases for components that need to be built.
Construction. Build the software.
Transition. Final cleanup: make sure all tests pass and the documentation is readable.
These efforts don’t proceed in a simple linear fashion. It’s often necessary to iterate between elaboration and construction to create features separately.
It often works as shown in Figure 1.1.
Figure 1.1:
Development Phases and Cycles
This figure provides a very coarse overview of the kinds of activities we’ll discuss below. The important concept is iterating between the elaboration and construction phases. It’s difficult to fully design a project before constructing all of the code. It’s easier to design a little, construct a little, and refactor as needed.
For a complex project, there may be a series of transitions to production. Often a ”minimally viable product” will be created to demonstrate some of the concepts. This will be followed by products with more features or features better focused on the user. Ideally, it will have both kinds of enhancements: more features and a better focus on the user’s needs.
We’ll look at each of these four phases in a little more detail, starting with the inception phases.
Start the inception phase by creating the parent directory for the project, then some commonly-used sub-directories (docs, notebooks, src, tests). There will be some top-level files (README.md, pyproject.toml, and tox.ini). The list of expected directories and files is described in more detail in List of deliverables, later in this chapter. We’ll look at the contents of each of these files and directories in the Deliverables section.
It helps to capture any initial ideas in the README.md file. Later, this will be refactored into more formal documentation. Initially, it’s the perfect place to keep notes and reminders.
Build a fresh, new virtual environment for the project. Each project should have its own virtual environment. Environments are essentially free: it’s best to build them to reflect any unique aspects of each project.
Here’s a conda command that can be used to build an environment.
An important part of inception is to start the documentation for the project. This can be done using the Sphinx tool.
While Sphinx is available from the Conda Forge, this version lags behind the version available from the PyPI repository. Because of this lag, it’s best to install Sphinx using PIP:
After installing Sphinx, it helps to initialize and publish the documentation for the project. Starting this permits publishing and sharing the design ideas as the work progresses. In the docs directory, do the following steps:
Run the sphinx-quickstart command to populate the documentation. See https://www.sphinx-doc.org/en/master/usage/quickstart.html#setting-up-the-documentation-sources.
Update the index.rst table of contents (TOC) with two entries: “overview” and “API”. These are sections that will be in separate files.
Write an overview.rst document with the definition of done: what will be accomplished. This should cover the core ”Who-What-When-Where-Why” of the project.
Put a title in the API document, and a .. todo:: note to yourself. You’ll add to this document as you add modules to your project.
During Elaboration, you’ll update the the index.rst to add sections for architecture and design decisions.
During Construction, as you create code, you’ll add to the API section.
During Transition, you’ll add to the index.rst with some ”How” sections: How to test it, and how to use it.
With this as the starting point, the make html command will build a documentation set in HTML. This can be shared with stakeholders to assure there’s a clear, common understanding of the project.
With a skeleton directory and some initial places to record ideas and decisions, it makes sense to start elaborating on the initial goal to and decide what will be built, and how it will work.
It helps to have a clear definition of ”done.” This guides the construction effort toward a well-defined goal. It helps to have the definition of done written out as a formal, automated test suite. For this, the Gherkin language is helpful. The behave tool can execute the Gherkin feature to evaluate the application software. An alternative to Gherkin is using the pytest tool with the pytest-bdd plug-in to run the acceptance tests.
The two big advantages of Gherkin are the ability to structure the feature descriptions into scenarios and write the descriptions in English (or any other natural language). Framing the expected behavior into discrete operating scenarios forces us to think clearly about how the application or module is used. Writing in English (or other natural languages) makes it easier to share definitions with other people to confirm our understanding. It also helps to keep the definition of done focused on the problem domain without devolving into technical considerations and programming.
Each scenario can have three steps: Given, When, and Then. The Given step defines a context. The When step defines an action or a request of the software. The Then step defines the expected results. These step definitions can be as complex as needed, often involving multiple clauses joined with And. Examples can be provided in tables to avoid copying and pasting a scenario with a different set of values. A separate module provides Python implementations for the English-language step text.
See https://behave.readthedocs.io/en/stable/gherkin.html#gherkin-feature-testing-language for numerous examples of scenarios written in Gherkin.
Start this part of elaboration by creating a tests/features/project.feature file based on the overview description. Don’t use a boring name like project. A complex project may have multiple features, so the feature file names should reflect the features.
To use pytest, write one (or more) acceptance test scripts in the tests directory.
The features are supported by steps. These steps are in modules in the tests/steps directory. A tests/steps/hw_cli.py module provides the necessary Python definitions for the steps in the feature file. The names of the modules don’t matter; we suggest something like hw_cli because it implements the steps for a hello-world command-line interface.
The underlying mechanism is used by the Behave tool are function decorators. These match text from the feature file to define the function that implements that step. These can have wildcard-matching to permit flexibility in wording. The decorator can also parse out parameter values from the text.
A tests/environment.py file is required, but it can be empty for simple tests. This file provides a testing context, and is where some functions used by the Behave tool to control test setup and teardown are defined.
As soon as scenarios have been written, it makes sense to run the Behave tool to see the acceptance test fail. Initially, this lets you debug the step definitions.
For this application, the steps must properly execute the application program and capture the output file. Because the application doesn’t exist yet, a test failure at this point is expected.
The feature files with the application scenarios are a working definition of done. When the test suite runs, it will show whether or not the software works. Starting with features that fail to work means the rest of the construction phase will be debugging the failures and fixing the software until the application passes the acceptance test suite.
In Project 0 – Hello World with test cases we’ll look at an example of a Gherkin-language feature, the matching step definitions, and a tox.ini to run the test suite.
The acceptance test suite is often relatively ”coarse” – the tests exercise the application as a whole, and avoid internal error conditions or subtle edge cases. The acceptance test suite rarely exercises all of the individual software components. Because of this, it can be difficult to debug problems in complex applications without detailed unit tests for each unit — each package, module, class, and function.
After writing the general acceptance test suite, it helps to do two things. First, start writing some skeleton code that’s likely to solve the problem. The class or function will contain a docstring explaining the idea. Optionally, it can have a body of the pass statement. After writing this skeleton, the second step is to expand on the docstring ideas by writing unit tests for the components.
Let’s assume we’ve written a scenario with a step that will execute an application named src/hello_world.py. We can create this file and include a skeleton class definition like this:
This example shows a class with a design idea. This needs to be expanded with a clear statement of expected behaviors. Those expectations should take the form of unit tests for this class.
Once some skeletons and tests are written, the pytest tool can be used to execute those tests.
The unit tests will likely fail because the skeleton code is incomplete or doesn’t work. In the cases where tests are complete, but classes don’t work, you’re ready to start the construction phase.
In the cases where the design isn’t complete, or the tests are fragmentary, it makes sense to remain in the elaboration phase for those classes, modules, or functions. Once the tests are understood, construction has a clear and achievable goal.
We don’t always get the test cases right the first time, we must change them as we learn. We rarely get the working code right the first time. If the test cases come first, they make sure we have a clear goal.
In some cases, the design may not be easy to articulate without first writing some ”spike solution” to explore an alternative. Once the spike works, it makes sense to write tests to demonstrate the code works.
See http://www.extremeprogramming.org/rules/spike.html for more on creating spike solutions.
At this point, you have an idea of how the software will be designed. The test cases are a way to formalize the design into a goal. It’s time to begin construction.
The construction phase finishes the class and function (and module and package) definitions started in the elaboration phase. In some cases, test cases will need to be added as the definitions expand.
As we get closer to solving the problem, the number of tests passed will grow.
The number of tests may also grow. It’s common to realize the sketch of a class definition is incomplete and requires additional classes to implement the State or Strategy design pattern. As another example, we may realize subclasses are required to handle special cases. This new understanding will change the test suite.
When we look at our progress over several days, we should see that the number of tests pass approaches the total number of tests.
How many tests do we need? There are strong opinions here. For the purposes of showing high-quality work, tests that exercise 100% of the code are a good starting point. For some industries, a more strict rule is to cover 100% of the logic paths through the code. This higher standard is often used for applications like robotics and health care where the consequences of a software failure may involve injury or death.
For enterprise applications, there is a transition from the development team to formal operations. This usually means a deployment into a production environment with the real user community and their data.
In organizations with good Continuous Integration/Continuous Deployment (CI/CD) practices, there will be a formalized execution of the tox command to make sure everything works: all the tests pass.
In some enterprises, the make html command will also be run to create the documentation.
Often, the technical operations team will need specific topics in the documentation and the README.md file. Operations staff may have to diagnose and troubleshoot problems with hundreds of applications, and they will need very specific advice in places where they can find it immediately. We won’t emphasize this in this book, but as we complete our projects, it’s important to think that our colleagues will be using this software, and we want their work life to be pleasant and productive.
The final step is to post your project to your public repository of choice.
You have completed part of your portfolio. You’ll want potential business partners or hiring managers or investors to see this and recognize your level of skill.
We can view a project as a sequence of steps. We can also view a project as a deliverable set of files created by those steps. In the next section, we’ll look over the deliverables in a little more detail.
We’ll take another look at the project, this time from the view of what files will be created. This will parallel the outline of the activities shown in the previous section.
The following outline shows many of the files in a completed project:
The documentation in the docs directory. There will be other files in there, but you’ll be focused on the following files:
The Sphinx index.rst starter file with references to overview and API sections.
An overview.rst section with a summary of the project.
An api.rst section with .. automodule:: commands to pull in documentation from the application.
A set of test cases in the tests directory.
Acceptance tests aimed at Behave (or the pytest-bdd plug-in for Gherkin). When using Behave, there will be two sub-directories: a features directory and a steps directory. Additionally, there will be an environment.py file.
Unit test modules written with the pytest framework. These all have a name that starts with test_ to make them easy for pytest to find. Ideally, the Coverage tool is used to assure 100% of the code is exercised.
The final code in the src directory. For some of the projects, a single module will be sufficient. Other projects will involve a few modules. (Developers familiar with Java or C++ often create too many modules here. The Python concept of module is more akin to the Java concept of package. It’s not common Python practice to put each class definition into a separate module file.)
Any JupyterLab notebooks can be in the notebooks folder. Not all projects use JupyterLab notebooks, so this folder can be omitted if there are no notebooks.
A few other project files are in the top-level directory.
A tox.ini file should be used to run the pytest and behave test suites.
The pyproject.toml