E-Book
36,59 €

Data Analysis with Python E-Book

David Taieb

0,0

36,59 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Learn a modern approach to data analysis using Python to harness the power of programming and AI across your data. Detailed case studies bring this modern approach to life across visual data, social media, graph algorithms, and time series analysis.

Key Features

Bridge your data analysis with the power of programming, complex algorithms, and AI

Use Python and its extensive libraries to power your way to new levels of data insight

Work with AI algorithms, TensorFlow, graph algorithms, NLP, and financial time series

Explore this modern approach across with key industry case studies and hands-on projects

Book Description

Data Analysis with Python offers a modern approach to data analysis so that you can work with the latest and most powerful Python tools, AI techniques, and open source libraries. Industry expert David Taieb shows you how to bridge data science with the power of programming and algorithms in Python. You'll be working with complex algorithms, and cutting-edge AI in your data analysis. Learn how to analyze data with hands-on examples using Python-based tools and Jupyter Notebook. You'll find the right balance of theory and practice, with extensive code files that you can integrate right into your own data projects.

Explore the power of this approach to data analysis by then working with it across key industry case studies. Four fascinating and full projects connect you to the most critical data analysis challenges you're likely to meet in today. The first of these is an image recognition application with TensorFlow – embracing the importance today of AI in your data analysis. The second industry project analyses social media trends, exploring big data issues and AI approaches to natural language processing. The third case study is a financial portfolio analysis application that engages you with time series analysis - pivotal to many data science applications today. The fourth industry use case dives you into graph algorithms and the power of programming in modern data science. You'll wrap up with a thoughtful look at the future of data science and how it will harness the power of algorithms and artificial intelligence.

What you will learn

A new toolset that has been carefully crafted to meet for your data analysis challenges

Full and detailed case studies of the toolset across several of today's key industry contexts

Become super productive with a new toolset across Python and Jupyter Notebook

Look into the future of data science and which directions to develop your skills next

Who this book is for

This book is for developers wanting to bridge the gap between them and data scientists. Introducing PixieDust from its creator, the book is a great desk companion for the accomplished Data Scientist. Some fluency in data interpretation and visualization is assumed. It will be helpful to have some knowledge of Python, using Python libraries, and some proficiency in web development.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 465

Veröffentlichungsjahr: 2018

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Data Analysis with Python

Why subscribe?

PacktPub.com

Contributors

About the author

About the reviewers

Packt is searching for authors like you

Preface

Why am I writing this book?

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

1. Programming and Data Science – A New Toolset

What is data science

Is data science here to stay?

Why is data science on the rise?

What does that have to do with developers?

Putting these concepts into practice

Deep diving into a concrete example

Data pipeline blueprint

What kind of skills are required to become a data scientist?

IBM Watson DeepQA

Back to our sentiment analysis of Twitter hashtags project

Lessons learned from building our first enterprise-ready data pipeline

Data science strategy

Jupyter Notebooks at the center of our strategy

Why are Notebooks so popular?

Summary

2. Python and Jupyter Notebooks to Power your Data Analysis

Why choose Python?

Introducing PixieDust

SampleData – a simple API for loading data

Wrangling data with pixiedust_rosie

Display – a simple interactive API for data visualization

Filtering

Bridging the gap between developers and data scientists with PixieApps

Architecture for operationalizing data science analytics

Summary

3. Accelerate your Data Analysis with Python Libraries

Anatomy of a PixieApp

Routes

Generating requests to routes

A GitHub project tracking sample application

Displaying the search results in a table

Invoking the PixieDust display() API using pd_entity attribute

Invoking arbitrary Python code with pd_script

Making the application more responsive with pd_refresh

Creating reusable widgets

Summary

4. Publish your Data Analysis to the Web - the PixieApp Tool

Overview of Kubernetes

Installing and configuring the PixieGateway server

PixieGateway server configuration

PixieGateway architecture

Publishing an application

Encoding state in the PixieApp URL

Sharing charts by publishing them as web pages

PixieGateway admin console

Python Console

Displaying warmup and run code for a PixieApp

Summary

5. Python and PixieDust Best Practices and Advanced Concepts

Use @captureOutput decorator to integrate the output of third-party Python libraries

Create a word cloud image with @captureOutput

Increase modularity and code reuse

Creating a widget with pd_widget

PixieDust support of streaming data

Adding streaming capabilities to your PixieApp

Adding dashboard drill-downs with PixieApp events

Extending PixieDust visualizations

Debugging

Debugging on the Jupyter Notebook using pdb

Visual debugging with PixieDebugger

Debugging PixieApp routes with PixieDebugger

Troubleshooting issues using PixieDust logging

Client-side debugging

Run Node.js inside a Python Notebook

Summary

6. Analytics Study: AI and Image Recognition with TensorFlow

What is machine learning?

What is deep learning?

Getting started with TensorFlow

Simple classification with DNNClassifier

Image recognition sample application

Part 1 – Load the pretrained MobileNet model

Part 2 – Create a PixieApp for our image recognition sample application

Part 3 – Integrate the TensorBoard graph visualization

Part 4 – Retrain the model with custom training data

Summary

7. Analytics Study: NLP and Big Data with Twitter Sentiment Analysis

Getting started with Apache Spark

Apache Spark architecture

Configuring Notebooks to work with Spark

Twitter sentiment analysis application

Part 1 – Acquiring the data with Spark Structured Streaming

Architecture diagram for the data pipeline

Authentication with Twitter

Creating the Twitter stream

Creating a Spark Streaming DataFrame

Creating and running a structured query

Monitoring active streaming queries

Creating a batch DataFrame from the Parquet files

Part 2 – Enriching the data with sentiment and most relevant extracted entity

Getting started with the IBM Watson Natural Language Understanding service

Part 3 – Creating a real-time dashboard PixieApp

Refactoring the analytics into their own methods

Creating the PixieApp

Part 4 – Adding scalability with Apache Kafka and IBM Streams Designer

Streaming the raw tweets to Kafka

Enriching the tweets data with the Streaming Analytics service

Creating a Spark Streaming DataFrame with a Kafka input source

Summary

8. Analytics Study: Prediction - Financial Time Series Analysis and Forecasting

Getting started with NumPy

Creating a NumPy array

Operations on ndarray

Selections on NumPy arrays

Broadcasting

Statistical exploration of time series

Hypothetical investment

Autocorrelation function (ACF) and partial autocorrelation function (PACF)

Putting it all together with the StockExplorer PixieApp

BaseSubApp – base class for all the child PixieApps

StockExploreSubApp – first child PixieApp

MovingAverageSubApp – second child PixieApp

AutoCorrelationSubApp – third child PixieApp

Time series forecasting using the ARIMA model

Build an ARIMA model for the MSFT stock time series

StockExplorer PixieApp Part 2 – add time series forecasting using the ARIMA model

Summary

9. Analytics Study: Graph Algorithms - US Domestic Flight Data Analysis

Introduction to graphs

Graph representations

Graph algorithms

Graph and big data

Getting started with the networkx graph library

Creating a graph

Visualizing a graph

Part 1 – Loading the US domestic flight data into a graph

Graph centrality

Part 2 – Creating the USFlightsAnalysis PixieApp

Part 3 – Adding data exploration to the USFlightsAnalysis PixieApp

Part 4 – Creating an ARIMA model for predicting flight delays

Summary

10. The Future of Data Analysis and Where to Develop your Skills

Forward thinking – what to expect for AI and data science

References

A. PixieApp Quick-Reference

Annotations

Custom HTML attributes

Methods

Other Books You May Enjoy

Leave a review – let other readers know what you think

Index

Data Analysis with Python

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Acquisition Editors: Frank Pohlmann, Suresh M Jain

Project Editors: Savvy Sequeira, Kishor Rit

Content Development Editor: Alex Sorrentino

Technical Editor: Bhagyashree Rai

Proofreader: Safis Editing

Indexers: Priyanka Dhadke

Graphics: Tom Scaria

Production Coordinator: Sandip Tadge

First published: June 2018

Production reference: 1300718

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78883-996-9

www.packtpub.com

To Alexandra, Solomon, Zachary, Victoria and Charlotte:

Thank you for your support, unbounded love, and infinite patience. I would not have been able to complete this work without all of you.

To Fernand and Gisele:

Without whom I wouldn't be where I am today. Thank you for your continued guidance all these years.

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionalsLearn better with Skill Plans built especially for youGet a free eBook or video every monthMapt is fully searchableCopy and paste, print, and bookmark content

PacktPub.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the author

David Taieb is the Distinguished Engineer for the Watson and Cloud Platform Developer Advocacy team at IBM, leading a team of avid technologists on a mission to educate developers on the art of the possible with data science, AI and cloud technologies. He's passionate about building open source tools, such as the PixieDust Python Library for Jupyter Notebooks, which help improve developer productivity and democratize data science. David enjoys sharing his experience by speaking at conferences and meetups, where he likes to meet as many people as possible.

I want to give special thanks to all of the following dear friends at IBM who contributed to the development of PixieDust and/or provided invaluable support during the writing of this book: Brad Noble, Jose Barbosa, Mark Watson, Raj Singh, Mike Broberg, Jessica Mantaro, Margriet Groenendijk, Patrick Titzler, Glynn Bird, Teri Chadbourne, Bradley Holt, Adam Cox, Jamie Jennings, Terry Antony, Stephen Badolato, Terri Gerber, Peter May, Brady Paterson, Kathleen Francis, Dan O'Connor, Muhtar (Burak) Akbulut, Navneet Rao, Panos Karagiannis, Allen Dean, and Jim Young.

About the reviewers

Margriet Groenendijk is a data scientist and developer advocate for IBM. She has a background in climate research, where, at the University of Exeter, she explored large observational datasets and the output of global scale weather and climate models to understand the impact of land use on climate. Prior to that, she explored the effect of climate on the uptake of carbon from the atmosphere by forests during her PhD research at the Vrije Universiteit in Amsterdam.

Now adays, she explores ways to simplify working with diverse data using open source tools, IBM Cloud, and Watson Studio. She has experience with cloud services, databases, and APIs to access, combine, clean, and store different types of data. Margriet uses time series analysis, statistical data analysis, modeling and parameter optimisation, machine learning, and complex data visualization. She writes blogs and speaks about these topics at conferences and meetups.

va barbosa is a developer advocate for the Center for Open-Source Data & AI Technologies, where he helps developers discover and make use of data and machine learning technologies. This is fueled by his passion to help others, and guided by his enthusiasm for open source technology.

Always looking to embrace new challenges and fulfill his appetite for learning, va immerses himself in a wide range of technologies and activities. He has been an electronic technician, support engineer, software engineer, and developer advocate.

When not focusing on the developer experience, va enjoys dabbling in photography. If you can't find him in front of a computer, try looking behind a camera.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Preface

"Developers are the most-important, most-valuable constituency in business today, regardless of industry."

--Stephen O'Grady, author of The New Kingmakers

First, let me thank you and congratulate you, the reader, for the decision to invest some of your valuable time to read this book. Throughout the chapters to come, I will take you on a journey of discovering or even re-discovering data science from the perspective of a developer and will develop the theme of this book which is that data science is a team sport and that if it is to be successful, developers will have to play a bigger role in the near future and better collaborate with data scientists. However, to make data science more inclusive to people of all backgrounds and trades, we first need to democratize it by making data simple and accessible—this is in essence what this book is about.

Why am I writing this book?

As I'll explain in more detail in Chapter 1, Programming and Data Science – A New Toolset, I am first and foremost a developer with over 20 years, experience of building software components of a diverse nature; frontend, backend, middleware, and so on. Reflecting back on this time, I realize how much getting the algorithms right always came first in my mind; data was always somebody else's problem. I rarely had to analyze it or extract insight from it. At best, I was designing the right data structure to load it in a way that would make my algorithm run more efficiently and the code more elegant and reusable.

However, as the Artificial Intelligence and data science revolution got under way, it became obvious to me that developers like myself needed to get involved, and so 7 years ago in 2011, I jumped at the opportunity to become the lead architect for the IBM Watson core platform UI & Tooling. Of course, I don't pretend to have become an expert in machine learning or NLP, far from it. Learning through practice is not a substitute for getting a formal academic background.

However, a big part of what I want to demonstrate in this book is that, with the right tools and approach, someone equipped with the right mathematical foundations (I'm only talking about high-school level calculus concepts really) can quickly become a good practitioner in the field. A key ingredient to being successful is to simplify as much as possible the different steps of building a data pipeline; from acquiring, loading, and cleaning the data, to visualizing and exploring it, all the way to building and deploying machine learning models.

It was with an eye to furthering this idea of making data simple and accessible to a community beyond data scientists that, 3 years ago, I took on a leading role at the IBM Watson Data Platform team with the mission of expanding the community of developers working with data with a special focus on education and activism on their behalf. During that time as the lead developer advocate, I started to talk openly about the need for developers and data scientists to better collaborate in solving complex data problems.

Note

Note: During discussions at conferences and meetups, I would sometimes get in to trouble with data scientists who would get upset because they interpreted my narrative as me saying that data scientists are not good software developers. I want to set the record straight, including with you, the data scientist reader, that this is far from the case.

The majority of data scientists are excellent software developers with a comprehensive knowledge of computer science concepts. However, their main objective is to solve complex data problems which require rapid, iterative experimentations to try new things, not to write elegant, reusable components.

But I didn't want to only talk the talk; I also wanted to walk the walk and started the PixieDust open source project as my humble contribution to solving this important problem. As the PixieDust work progressed nicely, the narrative became crisper and easier to understand with concrete example applications that developers and data scientists alike could become excited about.

When I was presented with the opportunity to write a book about this story, I hesitated for a long time before embarking on this adventure for mainly two reasons:

I have written extensively in blogs, articles, and tutorials about my experience as a data science practitioner with Jupyter Notebooks. I also have extensive experience as a speaker and workshop moderator at a variety of conferences. One good example is the keynote speech I gave at ODSC London in 2017 titled, The Future of Data Science: Less Game of Thrones, More Alliances (https://odsc.com/training/portfolio/future-data-science-less-game-thrones-alliances). However, I had never written a book before and had no idea of how big a commitment it would be, even though I was warned many times by friends that had authored books before.I wanted this book to be inclusive and target equally the developer, the data scientist, and the line of business user, but I was struggling to find the right content and tone to achieve that goal.

In the end, the decision to embark on this adventure came pretty easily. Having worked on the PixieDust project for 2 years, I felt we had made terrific progress with very interesting innovations that generated lots of interest in the open-source community and that writing a book would complement nicely our advocacy work on helping developers get involved in data science.

As a side note, for the reader who is thinking about writing a book and who has similar concerns, I can only advise on the first one with a big, "Yes, go for it." For sure, it is a big commitment that requires a substantial amount of sacrifice but provided that you have a good story to tell with solid content, it is really worth the effort.

Who this book is for

This book will serve the budding data scientist and developer with an interest in developing their skills or anyone wishing to become a professional data scientist. With the introduction of PixieDust from its creator, the book will also be a great desk companion for the already accomplished Data Scientist.

No matter the individual's level of interest, the clear, easy-to-read text and real-life scenarios would suit those with a general interest in the area, since they get to play with Python code running in Jupyter Notebooks.

To produce a functioning PixieDust dashboard, only a modicum of HTML and CSS is required. Fluency in data interpretation and visualization is also necessary since this book addresses data professionals such as business and general data analysts. The later chapters also have much to offer.

What this book covers

The book contains two logical parts of roughly equal length. In the first half, I lay down the theme of the book which is the need to bridge the gap between data science and engineering, including in-depth details about the Jupyter + PixieDust solution I'm proposing. The second half is dedicated to applying what we learned in the first half, to four industry cases.

Chapter 1, Programming and Data Science – A New Toolset, I attempt to provide a definition of data science through the prism of my own experience, building a data pipeline that performs sentiment analysis on Twitter posts. I defend the idea that it is a team sport and that most often, silos exist between the data science and engineering teams that cause unnecessary friction, inefficiencies and, ultimately, a failure to realize its full potential. I also argue the point of view that data science is here to stay and that eventually, it will become an integral part of what is known today as computer science (I like to think that someday new terms will emerge, such as computer data science that better capture this duality).

Chapter 2, Python and Jupyter Notebooks to Power your Data Analysis, I start diving into popular data science tools such as Python and its ecosystem of open-source libraries dedicated to data science, and of course Jupyter Notebooks. I explain why I think Jupyter Notebooks will become the big winner in the next few years. I also introduce the PixieDust open-source library capabilities starting from the simple display() method that lets the user visually explore data in an interactive user interface by building compelling charts. With this API, the user can choose from multiple rendering engines such as Matplotlib, Bokeh, Seaborn, and Mapbox. The display() capability was the only feature in the PixieDust MVP (minimum viable product) but, over time, as I was interacting with a lot of data science practitioners, I added new features to what would quickly become the PixieDust toolbox:

sampleData(): A simple API for easily loading data into pandas and Apache Spark DataFrameswrangle_data(): A simple API for cleaning and massaging datasets. This capability includes the ability to destructure columns into new columns using regular expressions to extract content from unstructured text. The wrangle_data() API can also make recommendations based on predefined patterns.PackageManager: Lets the user install third-party Apache Spark packages inside a Python Notebook.Scala Bridge: Enables the user to run the Scala code inside a Python Notebook. Variables defined in the Python side are accessible in Scala and vice-versa.Spark Job Progress Monitor: Lets you track the status of your Spark Job with a real-time progress bar that displays directly in the output cell of the code being executed.PixieApp: Provides a programming model centered around HTML/CSS that lets developers build sophisticated dashboards to operationalize the analytics built in the Notebook. PixieApps can run directly in the Jupyter Notebook or be deployed as analytic web applications using the PixieGateway microservice. PixieGateway is an open-source companion project to PixieDust.

The following diagram summarizes the PixieDust development journey, including recent additions such as the PixieGateway and the PixieDebugger which is the first visual Python debugger for Jupyter Notebooks:

PixieDust journey

One key message to take away from this chapter is that PixieDust is first and foremost an open-source project that lives and breathes through the contributions of the developer community. As is the case for countless open-source projects, we can expect many more breakthrough features to be added to PixieDust over time.

Chapter 3, Accelerate your Data Analysis with Python Libraries, I take the reader through a deep dive of the PixieApp programming model, illustrating each concept along the way with a sample application that analyzes GitHub data. I start with a high-level description of the anatomy of a PixieApp including its life cycle and the execution flow with the concept of routes. I then go over the details of how developers can use regular HTML and CSS snippets to build the UI of the dashboard, seamlessly interacting with the analytics and leveraging the PixieDust display() API to add sophisticated charts.

The PixieApp programming model is the cornerstone of the tooling strategy for bridging the gap between data science and engineering, as it streamlines the process of operationalizing the analytics, thereby increasing collaboration between data scientists and developers and reducing the time-to-market of the application.

Chapter 4, Publish your Data Analysis to the Web - the PixieApp Tool, I discuss the PixieGateway microservice which enables developers to publish PixieApps as analytical web applications. I start by showing how to quickly deploy a PixieGateway microservice instance both locally and on the cloud as a Kubernetes container. I then go over the PixieGateway admin console capabilities, including the various configuration profiles and how to live-monitor the deployed PixieApps instances and the associated backend Python kernels. I also feature the chart sharing capability of the PixieGateway that lets the user turn a chart created with the PixieDust display() API into a web page accessible by anyone on the team.

The PixieGateway is a ground-breaking innovation with the potential of seriously speeding up the operationalization of analytics—which is sorely needed today—to fully capitalize on the promise of data science. It represents an open-source alternative to similar products that already exist on the market, such as the Shiny Server from R-Studio (https://shiny.rstudio.com/deploy) and Dash from Plotly (https://dash.plot.ly)

Chapter 5, Python and PixieDust Best Practices and Advanced Concepts, I complete the deep-dive of the PixieDust toolbox by going over advanced concepts of the PixieApp programming model:

@captureOutput decorator: By default, PixieApp routes require developers to provide an HTML fragment that will be injected in the application UI. This is a problem when we want to call a third-party Python library that is not aware of the PixieApp architecture and directly generate the output to the Notebook. @captureOutput solves this problem by automatically redirecting the content generated by the third-party Python library and encapsulating it into a proper HTML fragment.Leveraging Python class inheritance for greater modularity and code reuse: Breaks down the PixieApp code into logical classes that can be composed together using the Python class inheritance capability. I also show how to call an external PixieApp using the pd_app custom attribute.PixieDust support for streaming data: Shows how PixieDust display() and PixieApp can also handle streaming data.Implementing Dashboard drill-down with PixieApp events: Provides a mechanism for letting PixieApp components publish and subscribe to events generated when the user interacts with the UI (for example, charts, and buttons).Building a custom display renderer for the PixieDust display() API: Walks through the code of a simple renderer that extends the PixieDust menus. This renderer displays a custom HTML table showing the selected data.Debugging techniques: Go over the various debugging techniques that PixieDust offers including the visual Python debugger called PixieDebugger and the %%PixiedustLog magic for displaying Python logging messages.Ability to run Node.js code: We discuss the pixiedust_node extension that manages the life cycle of a Node.js process responsible for executing arbitrary Node.js scripts directly from within the Python Notebook.

Thanks to the open-source model with its transparent development process and a growing community of users who provided some valuable feedback, we were able to prioritize and implement a lot of these advanced features over time. The key point I'm trying to make is that following an open-source model with an appropriate license (PixieDust uses the Apache 2.0 license available here https://www.apache.org/licenses/LICENSE-2.0) does work very well. It helped us grow the community of users, which in turn provided us with the necessary feedback to prioritize new features that we knew were high value, and in some instances contributed code in the form of GitHub pull requests.

Chapter 6, Analytics Study: AI and Image Recognition with TensorFlow, I dive into the first of four industry cases. I start with a high-level introduction to machine learning, followed by an introduction to deep learning—a subfield of machine learning—and the TensorFlow framework that makes it easier to build neural network models. I then proceed to build an image recognition sample application including the associated PixieApp in four parts:

Part 1: Builds an image recognition TensorFlow model by using the pretrain ImageNet model. Using the TensorFlow for poets tutorial, I show how to build analytics to load and score a neural network model. Part 2: Creates a PixieApp that operationalizes the analytics created in Part 1. This PixieApp scrapes the images from a web page URL provided by the user, scores them against the TensorFlow model and then graphically shows the results.Part 3: I show how to integrate the TensorBoard Graph Visualization component directly in the Notebook, providing the ability to debug the neural network model.Part 4: I show how to retrain the model with custom training data and update the PixieApp to show the results from both models.

I decided to start the series of sample applications with deep learning image recognition with TensorFlow because it's an important use case that is growing in popularity and demonstrating how we can build the models and deploy them in an application in the same Notebook represents a powerful statement toward the theme of bridging the gap between data science and engineering.

Chapter 7, Analytics Study: NLP and Big Data with Twitter Sentiment Analysis, I talk about doing natural language processing at Twitter scale. In this chapter, I show how to use the IBM Watson Natural Language Understanding cloud-based service to perform a sentiment analysis of the tweets. This is very important because it reminds the reader that reusing managed hosted services rather building the capability in-house can sometimes be an attractive option.

I start with an introduction to the Apache Spark parallel computing framework, and then move on to building the application in four parts:

Part 1: Acquiring the Twitter data with Spark Structured StreamingPart 2: Enriching the data with sentiment and most relevant entity extracted from the textPart 3: Operationalizing the analytics by creating a real-time dashboard PixieApp.Part 4: An optional section that re-implements the application with Apache Kafka and IBM Streaming Designer hosted service to demonstrate how to add greater scalability.

I think the reader, especially those who are not familiar with Apache Spark, will enjoy this chapter as it is a little easier to follow than the previous one. The key takeaway is how to build analytics that scale with Jupyter Notebooks that are connected to a Spark cluster.

Chapter 8, Analytics Study: Prediction - Financial Time Series Analysis and Forecasting, I talk about time series analysis which is a very important field of data science with lots of practical applications in the industry. I start the chapter with a deep dive into the NumPy library which is foundational to so many other libraries, such as pandas and SciPy. I then proceed with the building of the sample application, which analyzes a time series comprised of historical stock data, in two parts:

Part 1: Provides a statistical exploration of the time series including various charts such as autocorrelation function (ACF) and partial autocorrelation function (PACF)Part 2: Builds a predictive model based on the ARIMA algorithms using the statsmodels Python library

Time series analysis is such an important field of data science that I consider to be underrated. I personally learned a lot while writing this chapter. I certainly hope that the reader will enjoy it as well and that reading it will spur an interest to know more about this great topic. If that's the case, I also hope that you'll be convinced to try out Jupyter and PixieDust on your next learnings about time series analysis.

Chapter 9, Analytics Study: Graph Algorithms - US Domestic Flight Data Analysis, I complete this series of industry use cases with the study of Graphs. I chose a sample application that analyzes flight delays because the data is readily available, and it's a good fit for using graph algorithms (well, for full disclosure, I may also have chosen it because I had already written a similar application to predict flight delays based on weather data where I used Apache Spark MLlib: https://developer.ibm.com/clouddataservices/2016/08/04/predict-flight-delays-with-apache-spark-mllib-flightstats-and-weather-data).

I start with an introduction to graphs and associated graph algorithms including several of the most popular graph algorithms such as Breadth First Search and Depth First Search. I then proceed with an introduction to the networkx Python library that is used to build the sample application.

The application is made of four parts:

Part 1: Shows how to load the US domestic flight data into a graph.Part 2: Creates the USFlightsAnalysis PixieApp that lets the user select an origin and destination airport and then display a Mapbox map of the shortest path between the two airports according to a selected centralityPart 3: Adds data exploration to the PixieApp that includes various statistics for each airline that flies out of the selected origin airportPart 4: Use the techniques learned in Chapter 8, Analytics Study: Prediction - Financial Time Series Analysis and Forecasting to build an ARIMA model for predicting flight delays

Graph theory is also another important and growing field of data science and this chapter nicely rounds up the series, which I hope provides a diverse and representative set of industry use cases. For readers who are particularly interested in using graph algorithms with big data, I recommend looking at Apache Spark GraphX (https://spark.apache.org/graphx) which implements many of the graph algorithms using a very flexible API.

Chapter 10, The Future of Data Analysis and Where to Develop your Skills, I end the book by giving a brief summary and explaining my take on Drew's Conway Venn Diagram. Then I talk about the future of AI and data science and how companies could prepare themselves for the AI and data science revolution. Also, I have listed some great references for further learning.

Appendix, PixieApp Quick-Reference, is a developer quick-reference guide that provides a summary of all the PixieApp attributes. This explains the various annotations, custom HTML attributes, and methods with the help of appropriate examples.

But enough about the introduction: let's get started on our journey with the first chapter titled Programming and Data Science – A New Toolset.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email <[email protected]>, and mention the book's title in the subject of your message. If you have questions about any aspect of this book, please email us at <[email protected]>.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book we would be grateful if you would report this to us. Please visit, http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at <[email protected]> with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit http://authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packtpub.com.

Chapter 1. Programming and Data Science – A New Toolset

"Data is a precious thing and will last longer than the systems themselves."

– Tim Berners-Lee, inventor of the World Wide Web

(https://en.wikipedia.org/wiki/Tim_Berners-Lee)

In this introductory chapter, I'll start the conversation by attempting to answer a few fundamental questions that will hopefully provide context and clarity for the rest of this book:

What is data science and why it's on the riseWhy is data science here to stayWhy do developers need to get involved in data science

Using my experience as a developer and recent data science practitioner, I'll then discuss a concrete data pipeline project that I worked on and a data science strategy that derived from this work, which is comprised of three pillars: data, services, and tools. I'll end the chapter by introducing Jupyter Notebooks which are at the center of the solution I'm proposing in this book.

What is data science

If you search the web for a definition of data science, you will certainly find many. This reflects the reality that data science means different things to different people. There is no real consensus on what data scientists exactly do and what training they must have; it all depends on the task they're trying to accomplish, for example, data collection and cleaning, data visualization, and so on.

For now, I'll try to use a universal and, hopefully, consensual definition: data science refers to the activity of analyzing a large amount of data in order to extract knowledge and insight leading to actionable decisions. It's still pretty vague though; one can ask what kind of knowledge, insight, and actionable decision are we talking about?

To orient the conversation, let's reduce the scope to three fields of data science:

Descriptive analytics: Data science is associated with information retrieval and data collection techniques with the goal of reconstituting past events to identify patterns and find insights that help understand what happened and what caused it to happen. An example of this is looking at sales figures and demographics by region to categorize customer preferences. This part requires being familiar with statistics and data visualization techniques.Predictive analytics: Data science is a way to predict the likelihood that some events are currently happening or will happen in the future. In this scenario, the data scientist looks at past data to find explanatory variables and build statistical models that can be applied to other data points for which we're trying to predict the outcome, for example, predicting the likelihood that a credit card transaction is fraudulent in real-time. This part is usually associated with the field of machine learning.Prescriptive analytics: In this scenario, data science is seen as a way to make better decisions, or perhaps I should say data-driven decisions. The idea is to look at multiple options and using simulation techniques, quantify, and maximize the outcome, for example, optimizing the supply chain by looking at minimizing operating costs.

In essence, descriptive data science answers the question of what (does the data tells me), predictive data science answers the question of why (is the data behaving a certain way), and prescriptive data science answers the questions of how (do we optimize the data toward a specific goal).

Is data science here to stay?

Let's get straight to the point from the start: I strongly think that the answer is yes.

However, that was not always the case. A few years back, when I first started hearing about data science as a concept, I initially thought that it was yet another marketing buzzword to describe an activity that already existed in the industry: Business Intelligence (BI). As a developer and architect working mostly on solving complex system integration problems, it was easy to convince myself that I didn't need to get directly involved in data science projects, even though it was obvious that their numbers were on the rise, the reason being that developers traditionally deal with data pipelines as black boxes that are accessible with well-defined APIs. However, in the last decade, we've seen exponential growth in data science interest both in academia and in the industry, to the point it became clear that this model would not be sustainable.

As data analytics are playing a bigger and bigger role in a company's operational processes, the developer's role was expanded to get closer to the algorithms and build the infrastructure that would run them in production. Another piece of evidence that data science has become the new gold rush is the extraordinary growth of data scientist jobs, which have been ranked number one for 2 years in a row on Glassdoor (https://www.prnewswire.com/news-releases/glassdoor-reveals-the-50-best-jobs-in-america-for-2017-300395188.html) and are consistently posted the most by employers on Indeed. Headhunters are also on the prowl on LinkedIn and other social media platforms, sending tons of recruiting messages to whoever has a profile showing any data science skills.

One of the main reasons behind all the investment being made into these new technologies is the hope that it will yield major improvements and greater efficiencies in the business. However, even though it is a growing field, data science in the enterprise today is still confined to experimentation instead of being a core activity as one would expect given all the hype. This has lead a lot of people to wonder if data science is a passing fad that will eventually subside and yet another technology bubble that will eventually pop, leaving a lot of people behind.

These are all good points, but I quickly realized that it was more than just a passing fad; more and more of the projects I was leading included the integration of data analytics into the core product features. Finally, it is when the IBM Watson Question Answering system won at a game of Jeopardy! against two experienced champions, that I became convinced that data science, along with the cloud, big data, and Artificial Intelligence (AI), was here to stay and would eventually change the way we think about computer science.

Why is data science on the rise?

There are multiple factors involved in the meteoric rise of data science.

First, the amount of data being collected keeps growing at an exponential rate. According to recent market research from the IBM Marketing Cloud (https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=WRL12345GBEN) something like 2.5 quintillion bytes are created every day (to give you an idea of how big that is, that's 2.5 billion of billion bytes), but yet only a tiny fraction of this data is ever analyzed, leaving tons of missed opportunities on the table.

Second, we're in the midst of a cognitive revolution that started a few years ago; almost every industry is jumping on the AI bandwagon, which includes natural language processing (NLP) and machine learning. Even though these fields existed for a long time, they have recently enjoyed the renewed attention to the point that they are now among the most popular courses in colleges as well as getting the lion's share of open source activities. It is clear that, if they are to survive, companies need to become more agile, move faster, and transform into digital businesses, and as the time available for decision-making is shrinking to near real-time, they must become fully data-driven. If you also include the fact that AI algorithms need high-quality data (and a lot of it) to work properly, we can start to understand the critical role played by data scientists.

Third, with advances in cloud technologies and the development of Platform as a Service (PaaS), access to massive compute engines and storage has never been easier or cheaper. Running big data workloads, once the purview of large corporations, is now available to smaller organizations or any individuals with a credit card; this, in turn, is fueling the growth of innovation across the board.

For these reasons, I have no doubt that, similar to the AI revolution, data science is here to stay and that its growth will continue for a long time. But we also can't ignore the fact that data science hasn't yet realized its full potential and produced the expected results, in particular helping companies in their transformation into data-driven organizations. Most often, the challenge is achieving that next step, which is to transform data science and analytics into a core business activity that ultimately enables clear-sighted, intelligent, bet-the-business decisions.

What does that have to do with developers?

This is a very important question that we'll spend a lot of time developing in the coming chapters. Let me start by looking back at my professional journey; I spent most of my career as a developer, dating back over 20 years ago, working on many aspects of computer science.

I started by building various tools that helped with software internationalization by automating the process of translating the user interface into multiple languages. I then worked on a LotusScript (scripting language for Lotus Notes) editor for Eclipse that would interface directly with the underlying compiler. This editor provided first-class development features, such as content assist, which provides suggestions, real-time syntax error reporting, and so on. I then spent a few years building middleware components based on Java EE and OSGI (https://www.osgi.org) for the Lotus Domino server. During that time, I led a team that modernized the Lotus Domino programming model by bringing it to the latest technologies available at the time. I was comfortable with all aspects of software development, frontend, middleware, backend data layer, tooling, and so on; I was what some would call a full-stack developer.

That was until I saw a demo of the IBM Watson Question Answering system that beat longtime champions Brad Rutter and Ken Jennings at a game of Jeopardy! in 2011. Wow! This was groundbreaking, a computer program capable of answering natural language questions. I was very intrigued and, after doing some research, meeting with a few researchers involved in the project, and learning about the techniques used to build this system, such as NLP, machine learning, and general data science, I realized how much potential this technology would have if applied to other parts of the business.

A few months later, I got an opportunity to join the newly formed Watson Division at IBM, leading a tooling team with the mission to build data ingestion and accuracy analysis capabilities for the Watson system. One of our most important requirements was to make sure the tools were easy to use by our customers, which is why, in retrospect, giving this responsibility to a team of developers was the right move. From my perspective, stepping into that job was both challenging and enriching. I was leaving a familiar world where I excelled at designing architectures based on well-known patterns and implementing frontend, middleware, or backend software components to a world focused mostly on working with a large amount of data; acquiring it, cleansing it, analyzing it, visualizing it, and building models. I spent the first six months drinking from the firehose, reading, and learning about NLP, machine learning, information retrieval, and statistical data science, at least enough to be able to work on the capabilities I was building.

It was at that time, interacting with the research team to bring these algorithms to market, that I realized how important developers and data scientists needed to collaborate better. The traditional approach of having data scientists solve complex data problems in isolation and then throw the results "over the wall" to developers for them to operationalize them is not sustainable and doesn't scale, considering that the amount of data to process keeps growing exponentially and the required time to market keeps shrinking.

Instead, their role needs to be shifting toward working as one team, which means that data scientists must work and think like software developers and vice versa. Indeed, this looks very good on paper: on the one hand, data scientists will benefit from tried-and-true software development methodologies such as Agile—with its rapid iterations and frequent feedback approach—but also from a rigorous software development life cycle that brings compliance with enterprise needs, such as security, code reviews, source control, and so on. On the other hand, developers will start thinking about data in a new way: as analytics meant to discover insights instead of just a persistence layer with queries and CRUD (short for, create, read, update, delete) APIs.

Putting these concepts into practice

After 4 years as the Watson Core Tooling lead architect building self-service tooling for the Watson Question Answering system, I joined the Developer Advocacy team of the Watson Data Platform organization which has the expanded mission of creating a platform that brings the portfolio of data and cognitive services to the IBM public cloud. Our mission was rather simple: win the hearts and minds of developers and help them be successful with their data and AI projects.

The work had multiple dimensions: education, evangelism, and activism. The first two are pretty straightforward, but the concept of activism is relevant to this discussion and worth explaining in more details. As the name implies, activism is about bringing change where change is needed. For our team of 15 developer advocates, this meant walking in the shoes of developers as they try to work with data—whether they're only getting started or already operationalizing advanced algorithms—feel their pain and identify the gaps that should be addressed. To that end, we built and made open source numerous sample data pipelines with real-life use cases.

At a minimum, each of these projects needed to satisfy three requirements:

The raw data used as input must be publicly availableProvide clear instructions for deploying the data pipeline on the cloud in a reasonable amount of timeDevelopers should be able to use the project as a starting point for similar scenarios, that is, the code must be highly customizable and reusable

The experience and insights we gained from these exercises were invaluable:

Understanding which data science tools are best suited for each taskBest practice frameworks and languagesBest practice architectures for deploying and operationalizing analytics

The metrics that guided our choices were multiple: accuracy, scalability, code reusability, but most importantly, improved collaboration between data scientists and developers.

Deep diving into a concrete example

Early on, we wanted to build a data pipeline that extracted insights from Twitter by doing sentiment analysis of tweets containing specific hashtags and to deploy the results to a real-time dashboard. This application was a perfect starting point for us, because the data science analytics were not too complex, and the application covered many aspects of a real-life scenario:

High volume, high throughput streaming dataData enrichment with sentiment analysis NLPBasic data aggregationData visualizationDeployment into a real-time dashboard

To try things out, the first implementation was a simple Python application that used the tweepy library (the official Twitter library for Python: https://pypi.python.org/pypi/tweepy) to connect to Twitter and get a stream of tweets and textblob (the simple Python library for basic NLP: https://pypi.python.org/pypi/textblob) for sentiment analysis enrichment.

The results were then saved into a JSON file for analysis. This prototype was a great way to getting things started and experiment quickly, but after a few iterations we quickly realized that we needed to get serious and build an architecture that satisfied our enterprise requirements.

Data pipeline blueprint

At a high level, data pipelines can be described using the following generic blueprint:

Data pipeline workflow

The main objective of a data pipeline is to operationalize (that is, provide direct business value) the data science analytics outcome in a scalable, repeatable process, and with a high degree of automation. Examples of analytics could be a recommendation engine to entice consumers to buy more products, for example, the Amazon recommended list, or a dashboard showing Key Performance Indicators (KPIs) that can help a CEO make future decisions for the company.

There are multiple persons involved in the building of a data pipeline:

Data engineers: They are responsible for designing and operating information systems. In other words, data engineers are responsible for interfacing with data sources to acquire the data in its raw form and then massage it (some call this data wrangling) until it is ready to be analyzed. In the Amazon recommender system example, they would implement a streaming processing pipeline that captures and aggregates specific consumer transaction events from the e-commerce system of records and stores them into a data warehouse.Data scientists: They analyze the data and build the analytics that extract insight. In our Amazon recommender system example, they could use a Jupyter Notebook that connects to the data warehouse to load the dataset and build a recommendation engine using, for example, collaborative filtering algorithm (https://en.wikipedia.org/wiki/Collaborative_filtering).Developers: They are responsible for operationalizing the analytics into an application targeted at line of business users (business analysts, C-Suite, end users, and so on). Again, in the Amazon recommender system, the developer will present the list of recommended products after the user has completed a purchase or via a periodic email.Line of business users: This encompasses all users that consume the output of data science analytics, for example, business analysts analyzing dashboards to monitor the health of a business or the end user using an application that provides a recommendation as to what to buy next.

Note

In real-life, it is not uncommon that the same person plays more than one of the roles described here; this may mean that one person has multiple, different needs when interacting with a data pipeline.

As the preceding diagram suggests, building a data science pipeline is iterative in nature and adheres to a well-defined process:

Acquire Data: This step includes acquiring the data in its raw form from a variety of sources: structured (RDBMS, system of records, and so on) or unstructured (web pages, reports, and so on):

Data cleansing: Check for integrity, fill missing data, fix incorrect data, and data mungingData prep: Enrich, detect/remove outliers, and apply business rules

Analyze: This step combines descriptive (understand the data) and prescriptive (build models) activities:

Explore: Find statistical properties, for example, central tendency, standard deviation, distribution, and variable identification, such as univariate and bivariate analysis, the correlation between variables, and so on.Visualization: This step is extremely important to properly analyze the data and form hypotheses. Visualization tools should provide a reasonable level of interactivity to facilitate understanding of the data.Build model: Apply inferential statistics to form hypotheses, such as selecting features for the models. This step usually requires expert domain knowledge and is subject to a lot of interpretation.

Deploy: Operationalize the output of the analysis phase:

Communicate: Generate reports and dashboards that communicate the analytic output clearly for consumption by the line of business user (C-Suite, business analyst, and so on)Discover: Set a business outcome objective that focuses on discovering new insights and business opportunities that can lead to a new source of revenueImplement: Create applications for end-users

Test: This activity should really be included in every step, but here we're talking about creating a feedback loop from field usage:

Create metrics that measure the accuracy of the modelsOptimize the models, for example, get more data, find new features, and so on