Predictive Analytics with TensorFlow - Md. Rezaul Karim - E-Book

Predictive Analytics with TensorFlow E-Book

Md. Rezaul Karim

0,0
45,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Accomplish the power of data in your business by building advanced predictive modelling applications with Tensorflow.

About This Book

  • A quick guide to gain hands-on experience with deep learning in different domains such as digit/image classification, and texts
  • Build your own smart, predictive models with TensorFlow using easy-to-follow approach mentioned in the book
  • Understand deep learning and predictive analytics along with its challenges and best practices

Who This Book Is For

This book is intended for anyone who wants to build predictive models with the power of TensorFlow from scratch. If you want to build your own extensive applications which work, and can predict smart decisions in the future then this book is what you need!

What You Will Learn

  • Get a solid and theoretical understanding of linear algebra, statistics, and probability for predictive modeling
  • Develop predictive models using classification, regression, and clustering algorithms
  • Develop predictive models for NLP
  • Learn how to use reinforcement learning for predictive analytics
  • Factorization Machines for advanced recommendation systems
  • Get a hands-on understanding of deep learning architectures for advanced predictive analytics
  • Learn how to use deep Neural Networks for predictive analytics
  • See how to use recurrent Neural Networks for predictive analytics
  • Convolutional Neural Networks for emotion recognition, image classification, and sentiment analysis

In Detail

Predictive analytics discovers hidden patterns from structured and unstructured data for automated decision-making in business intelligence.

This book will help you build, tune, and deploy predictive models with TensorFlow in three main sections. The first section covers linear algebra, statistics, and probability theory for predictive modeling.

The second section covers developing predictive models via supervised (classification and regression) and unsupervised (clustering) algorithms. It then explains how to develop predictive models for NLP and covers reinforcement learning algorithms. Lastly, this section covers developing a factorization machines-based recommendation system.

The third section covers deep learning architectures for advanced predictive analytics, including deep neural networks and recurrent neural networks for high-dimensional and sequence data. Finally, convolutional neural networks are used for predictive modeling for emotion recognition, image classification, and sentiment analysis.

Style and approach

TensorFlow, a popular library for machine learning, embraces the innovation and community-engagement of open source, but has the support, guidance, and stability of a large corporation.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 554

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Predictive Analytics with TensorFlow
Credits
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
eBooks, discount offers, and more
Why subscribe?
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. Basic Python and Linear Algebra for Predictive Analytics
A basic introduction to predictive analytics
Why predictive analytics?
Working principles of a predictive model
A bit of linear algebra
Programming linear algebra
Installing and getting started with Python
Installing on Windows
Installing Python on Linux
Installing and upgrading PIP (or PIP3)
Installing Python on Mac OS
Installing packages in Python
Getting started with Python
Python data types
Using strings in Python
Using lists in Python
Using tuples in Python
Using dictionary in Python
Using sets in Python
Functions in Python
Classes in Python
Vectors, matrices, and graphs
Vectors
Matrices
Matrix addition
Matrix subtraction
Multiplying two matrices
Finding the determinant of a matrix
Finding the transpose of a matrix
Solving simultaneous linear equations
Eigenvalues and eigenvectors
Span and linear independence
Principal component analysis
Singular value decomposition
Data compression in a predictive model using SVD
Predictive analytics tools in Python
Summary
2. Statistics, Probability, and Information Theory for Predictive Modeling
Using statistics in predictive modeling
Statistical models
Parametric versus nonparametric model
Parametric predictive models
Nonparametric predictive models
Population and sample
Random sampling
Expectation
Central limit theorem
Skewness and data distribution
Standard deviation and variance
Covariance and correlation
Interquartile, range, and quartiles
Hypothesis testing
Chi-square tests
Chi-square independence test
Basic probability for predictive modeling
Probability and the random variables
Generating random numbers and setting the seed
Probability distributions
Marginal probability
Conditional probability
The chain rule of conditional probability
Independence and conditional independence
Bayes' rule
Using information theory in predictive modeling
Self-information
Mutual information
Entropy
Shannon entropy
Joint entropy
Conditional entropy
Information gain
Using information theory
Using information theory in Python
Summary
3. From Data to Decisions – Getting Started with TensorFlow
Taking decisions based on data - Titanic example
Data value chain for making decisions
From disaster to decision – Titanic survival example
General overview of TensorFlow
Installing and configuring TensorFlow
Installing TensorFlow on Linux
Installing Python and nVidia driver
Installing NVIDIA CUDA
Installing NVIDIA cuDNN v5.1+
Installing the libcupti-dev library
Installing TensorFlow
Installing TensorFlow with native pip
Installing with virtualenv
Installing TensorFlow from source
Testing your TensorFlow installation
TensorFlow computational graph
TensorFlow programming model
Data model in TensorFlow
Tensors
Rank
Shape
Data type
Variables
Fetches
Feeds and placeholders
TensorBoard
How does TensorBoard work?
Getting started with TensorFlow – linear regression and beyond
Source code for the linear regression
Summary
4. Putting Data in Place - Supervised Learning for Predictive Analytics
Supervised learning for predictive analytics
Linear regression - revisited
Problem statement
Using linear regression for movie rating prediction
From disaster to decision - Titanic example revisited
An exploratory analysis of the Titanic dataset
Feature engineering
Logistic regression for survival prediction
Using TensorFlow contrib
Linear SVM for survival prediction
Ensemble method for survival prediction: random forest
A comparative analysis
Summary
5. Clustering Your Data - Unsupervised Learning for Predictive Analytics
Unsupervised learning and clustering
Using K-means for predictive analytics
How K-means works
Using K-means for predicting neighborhoods
Predictive models for clustering audio files
Using kNN for predictive analytics
Working principles of kNN
Implementing a kNN-based predictive model
Summary
6. Predictive Analytics Pipelines for NLP
NLP analytics pipelines
Using text analytics
Transformers and estimators
Standard transformer
Estimator transformer
StopWordsRemover
N-gram
Using BOW for predictive analytics
Bag-of-words
The problem definition
The dataset description and exploration
Spam prediction using LR and BOW with TensorFlow
TF-IDF model for predictive analytics
How to compute TF, IDF, and TFIDF?
Implementing a TF-IDF model for spam prediction
Using Word2vec for sentiment analysis
Continuous bag-of-words
Continuous skip-gram
Using CBOW for word embedding and model building
CBOW model building
Reusing the CBOW for predicting sentiment
Summary
7. Using Deep Neural Networks for Predictive Analytics
Deep learning for better predictive analytics
Artificial Neural Networks
Deep Neural Networks
DNN architectures
Multilayer perceptrons
Training an MLP
Using MLPs
DNN performance analysis
Fine-tuning DNN hyperparameters
Number of hidden layers
Number of neurons per hidden layer
Activation functions
Weight and biases initialization
Regularization
Using multilayer perceptrons for predictive analytics
Dataset description
Preprocessing
A TensorFlow implementation of MLP
Deep belief networks
Restricted Boltzmann Machines
Construction of a simple DBN
Unsupervised Pretraining
Using deep belief networks for predictive analytics
Summary
8. Using Convolutional Neural Networks for Predictive Analytics
CNNs and the drawbacks of regular DNNs
CNN architecture
Convolutional operations
Applying convolution operations in TensorFlow
Pooling layer and padding operations
Applying subsampling operations in TensorFlow
Tuning CNN hyperparameters
CNN-based predictive model for sentiment analysis
Exploring movie and product review datasets
Using CNN for predictive analytics about movie reviews
CNN model for emotion recognition
Dataset description
CNN architecture design
Testing the model on your own image
Using complex CNN for predictive analytics
Dataset description
CNN predictive model for image classification
Summary
9. Using Recurrent Neural Networks for Predictive Analytics
RNN architecture
Contextual information and the architecture of RNNs
BRNNs
LSTM networks
GRU cell
Using BRNN for image classification
Implementing an RNN for spam prediction
Developing a predictive model for time series data
Description of the dataset
Preprocessing and exploratory analysis
LSTM predictive model
Model evaluation
An LSTM predictive model for sentiment analysis
Network design
LSTM model training
Visualizing through TensorBoard
LSTM model evaluation
Summary
10. Recommendation Systems for Predictive Analytics
Recommendation systems
Collaborative filtering approaches
Content-based filtering approaches
Hybrid recommendation systems
Model-based collaborative filtering
Collaborative filtering approach for movie recommendations
The utility matrix
Dataset description
Ratings data
Movies data
User data
Exploratory analysis of the dataset
Implementing a movie recommendation engine
Training the model with available ratings
Inferencing the saved model
Generating a user-item table
Clustering similar movies
Movie rating prediction by users
Finding the top K movies
Predicting top K similar movies
Computing the user-user similarity
Evaluating the recommendation system
Factorization machines for recommendation systems
Factorization machines
The cold start problem in recommendation systems
Problem definition and formulation
Dataset description
Preprocessing
Implementing an FM model
Improved factorization machines for predictive analytics
Neural factorization machines
Dataset description
Using NFM for movie recommendations
Model training
Model evaluation
Summary
11. Using Reinforcement Learning for Predictive Analytics
Reinforcement learning
Reinforcement learning in predictive analytics
Notation, policy, and utility in RL
Policy
Utility
Developing a multiarmed bandit's predictive model
Developing a stock price predictive model
Summary
Index

Predictive Analytics with TensorFlow

Predictive Analytics with TensorFlow

Copyright © 2017 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: October 2017

Production reference: 1251017

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78839-892-3

www.packtpub.com

Credits

Author

Md. Rezaul Karim

Reviewers

Andrea Mostosi

Meng-Chieh Ling

Commissioning Editor

Sunith Shetty

Acquisition Editor

Chandan Kumar

Content Development Editor

Amrita Noronha

Technical Editor

Sayali Thanekar

Copy Editor

Safis Editing

Project Coordinator

Shweta H Birwatkar

Proofreader

Safis Editing

Indexer

Pratik Shirodkar

Graphics

Tania Dutta

Production Coordinator

Aparna Bhagat

About the Author

Md. Rezaul Karim is a Research Scientist at Fraunhofer FIT, Germany. He is also a PhD candidate at RWTH Aachen University, Aachen, Germany. He holds a BSc and an MSc degree in Computer Science. Before joining Fraunhofer FIT, he worked as a Researcher at Insight Centre for Data Analytics, Ireland. Before this, he worked as a Lead Engineer at Samsung Electronics' distributed R&D Institutes in Korea, India, Turkey, and Bangladesh. Previously, he has worked as a Research Assistant at the database lab, Kyung Hee University, Korea. He also worked as an R&D engineer with BMTech21 Worldwide, Korea. Even before this, he worked as a Software Engineer with i2SoftTechnology, Dhaka, Bangladesh.

He has more than 8 years of experience in the area of research and development with solid understanding of algorithms and data structures in C, C++, Java, Scala, R, and Python. He has published several books, articles, and research papers concerning big data and virtualization technologies, such as Spark, Kafka, DC/OS, Docker, Mesos, Zeppelin, Hadoop, and MapReduce. He is also equally competent with deep learning technologies such as TensorFlow, DeepLearning4j, and H2O. His research interests include Machine Learning, Deep Learning, Semantic Web, Linked Data, Big Data, and Bioinformatics. Also, he is the author of the following book titles:

Large-Scale Machine Learning with Spark (Packt Publishing Ltd.)Deep Learning with TensorFlow (Packt Publishing Ltd.)Scala and Spark for Big Data Analytics (Packt Publishing Ltd.)

Acknowledgments

I am very grateful to my parents, who have always encouraged me to pursue knowledge. I also want to thank my wife, Saroar; son, Shadman; brother, Mamtaz; sister, Josna; and friends who have endured my long monologs about the subjects in this book and always have encouraged and listened to me. Writing this book was made easier by the amazing efforts of the open source community and the great documentation of many projects out there related to TensorFlow and Python. Further, I would like to thank the acquisition, content development, and technical editors of Packt Publishing Ltd. (and, of course, others who were involved in this book title) for their sincere cooperation and coordination. Additionally, without the work of numerous researchers and deep learning practitioners who shared their expertise in publications, lectures, and source code, this book might not have existed at all! Finally, I appreciate the efforts of the TensorFlow community and all those who have contributed to APIs, whose work ultimately brought machine learning to the masses.

About the Reviewers

Andrea Mostosi is a technology enthusiast, a husband, and a father. During the last 10 years, he led the entire life cycle of several projects across different technologies, companies, and markets. He is now working on artificial intelligence, data mining, and a lot of other scary things.

I'd like to thank my wonderful son, Ryan, for every smile, every hug, and every sleepless night he gave me since his birth. When the machines finally take over humanity, you'll be able to say that your father has contributed to making this happen, my son.

Meng-Chieh Ling is a theoretical physics PhD from Karlsruhe Institute of Technology. After finishing his PhD, he attended The Data Incubator Reply to change his career from theoretical physics to data science.

www.PacktPub.com

eBooks, discount offers, and more

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1788398920.

If you'd like to join our team of regular reviewers, you can email us at <[email protected]>. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Preface

The continued growth in data, coupled with the need to make increasingly complex decisions against that data, is creating massive hurdles that prevent organizations from deriving insights in a timely manner using traditional approaches. Machine learning is concerned with algorithms that transform raw data into information and then into actionable intelligence. This fact makes machine learning well suited to the predictive analytics. Without machine learning, therefore, it would be nearly impossible to keep up with these massive streams of information altogether.

On the other hand, deep learning is a branch of machine learning algorithms based on learning multiple levels of representation. A deep learning algorithm is nothing more than the implementation of a complex and deep neural network so that it can learn through the analysis of large amounts of data. Thus, it took just a few years to develop powerful deep learning algorithms to recognize images, natural language processing, and perform a myriad of other complex tasks.

Considering these motivations and requirements, this book is dedicated to developers, data analysts, machine learning practitioners, and deep learning enthusiasts who want to build powerful, robust, and accurate predictive models with the power of TensorFlow from scratch, and combining other open source Python libraries.

The first section of this book covers applied math, statistics, and probability theory for predictive analytics. It will then cover useful Python packages to getting started with data science in a practical manner. The second section shows how to develop large-scale predictive analytics pipelines using supervised learning algorithms, for example, classification and regression; and unsupervised learning algorithms, for example, clustering. It'll then demonstrate how to develop predictive models for NLP.

Finally, reinforcement learning and a factorization machine-based recommendation system will be used to develop predictive models. The third section covers practical mastery of deep learning architectures for advanced predictive analytics, including deep neural networks and recurrent neural networks for high-dimensional and sequence data. Finally, it'll show how to develop convolutional neural networks-based predictive models for emotion recognition, image classification, and sentiment analysis.

Happy Reading!

What this book covers

Chapter 1, Basic Python and Linear Algebra for Predictive Analytics, discusses the basic concepts in linear algebra for predictive analytics, such as vectors, matrices, tensors, linear dependence, and span. Then, we move on to a brief introduction to PrincipalComponentAnalysis (PCA) and SingularValueDecomposition (SVD). Finally, some predictive modeling tools in Python will be discussed.

Chapter 2, Statistics, Probability, and Information Theory for Predictive Modeling, covers some statistic, probabilistic, and information theory concepts before getting started on predictive analytics: random sampling, hypothesis testing, chi-square test, correlation, expectation, variance, covariance and Bayes' rule, and so on. It then discusses the central objects of probability theory: random variables, stochastic processes, and events. Information theory, which studies the quantification, storage, and communication of information, will be discussed at the end of the chapter.

Chapter 3, From Data to Decisions - Getting Started with TensorFlow, provides a detailed description of the main TensorFlow features in a real-life problem, followed by detailed discussions about TensorFlow installation and configuration. It then covers computation graphs, data, and programming models before getting started with TensorFlow. The last part of the chapter contains an example of implementing linear regression model for predictive analytics.

Chapter 4, Putting Data in Place - Supervised Learning for Predictive Analytics, covers some TensorFlow-based supervised learning techniques from a theoretical and practical perspective. In particular, the linear regression model for regression analysis will be covered on a real dataset. It then shows how we could solve the Titanic survival problem using logistic regression, random forests, and SVMs for predictive analytics.

Chapter 5, Clustering Your Data - Unsupervised Learning for Predictive Analytics, digs deeper into predictive analytics and finds out how we can take advantage of it to cluster records belonging to the certain group or class for a dataset of unsupervised observations. It will then provide some practical examples of unsupervised learning. Particularly, clustering techniques using TensorFlow will be discussed with some hands-on examples.

Chapter 6, Predictive Analytics Pipelines for NLP, shows how to use TensorFlow for text analytics with a focus on text classification from an unstructured spam prediction and movie review dataset. Based on the spam filtering dataset, it shows how to develop predictive models using a linear regression algorithm with TensorFlow. Particularly, it will use the bag-of-words (BOW) and TF-IDF algorithms for spam prediction. Later on, it will also show how to develop large-scale predictive models for predicting sentiment from the movie review dataset using the continuous bag-of-words (CBOW) and continuous skip-gram algorithms.

Chapter 7, Using Deep Neural Networks for Predictive Analytics, demonstrates how to train DNNs and analyze the performance metrics that are needed to evaluate a DNN predictive model. It also shows how to tune the hyperparameters for DNNs for better and optimized performance. It will provide two examples on how to build very robust and accurate predictive models for predictive analytics as well, in particular, using DeepBeliefNetworks (DBN) and MultilayerPerceptron (MLP) on a bank marketing dataset.

Chapter 8, Using Convolutional Neural Networks for Predictive Analytics, discusses how to develop predictive analytics applications such as emotion recognition, image classification, and text classification using the convolutional neural network algorithm on real image/text datasets. Finally, it will provide some pointers on how to tune and debug CNN-based networks for optimized performance.

Chapter 9, Using Recurrent Neural Networks for Predictive Analytics, provides some theoretical background for RNNs. Then, it shows a few examples of implementing predictive models for image classification, sentiment analysis of movies, and products spam prediction for NLP. Finally, it shows how to develop predictive models for time-series data.

Chapter 10, Recommendation System for Predictive Analytics, provides several examples of how to develop recommendation systems for predictive analytics followed by some theoretical background of recommendation systems, for example, matrix factorization. Later in the chapter, an example of developing movie recommendation engine using SVD and K-means will be shown. Finally, the chapter shows how we could use factorization machines to develop a more accurate and robust recommendation system.

Chapter 11, Using Reinforcement Learning for Predictive Analytics, talks about designing machine learning systems driven by criticism and rewards. It will show several examples of how to apply reinforcement learning algorithms for developing predictive models on real-life datasets.

What you need for this book

All the examples have been implemented in Python 2 and 3 with TensorFlow 1.2.0+. You will also need some additional software and tools. To be more specific, the following tools and libraries are required, preferably the latest version:

Python (2.7.x or 3.3+)TensorFlow (1.0.0+)Bazel (latest version)pip/pip3 (latest version for Python 2 and 3 respectively)matplotlib (latest version) pandas (latest version) NumPy (latest version)SciPy (latest version) sklearn (latest version)yahoo_finance (latest version)Bazel(latest version)CUDA (latest version)CuDNN (latest version)

Linux distributions are preferable (including Debian, Ubuntu, Fedora, RHEL, and CentOS) and to be more specific, for Ubuntu it is recommended to have the 14.04 (LTS) 64-bit (or later) complete installation or VMWare player 12 or VirtualBox. You can also run TensorFlow jobs on Windows (XP/7/8/10) or Mac OS X (10.4.7+).

Processor Core i5 or Core i7 with GPU support is recommended to get the best results. However, multicore processing would provide faster data processing and scalability of the predictive analytics jobs—at least 8 GB RAM (recommended) for a standalone mode and at least 32 GB RAM for a single VM and higher for a cluster. There is enough storage for running heavy jobs (depending on the dataset size you will be handling), preferably at least 50 GB of free disk storage.

Who this book is for

This book is dedicated to developers, data analysts, and deep learning enthusiasts who want to build powerful, robust, and accurate predictive models with the power of TensorFlow from scratch and in combination with other open source Python libraries. If you want to build your own extensive applications that work and can predict smart decisions in the future, then this book is what you need! A good command of object-oriented programming with Python is a prerequisite. Some competence in applied mathematics, statistics, linear algebra, and information theory is a plus and would help readers understand the concepts presented in this book.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e–mail to <[email protected]>, and mention the book title via the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e–mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the book in the Search box.Select the book for which you're looking to download the code files.Choose from the drop-down menu where you purchased this book from.Click on Code Download.

You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Predictive–Analytics–with–TensorFlow. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/PredictiveAnalyticswithTensorFlow_ColorImages.pdf

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the ErrataSubmissionForm link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyright material on the internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

Chapter 1. Basic Python and Linear Algebra for Predictive Analytics

Predictive analytics (PA) is the use of data, statistical algorithms, and machine learning techniques to identify the likelihood of future outcomes based on historical data. The goal is to go beyond knowing what has happened to provide the best assessment of what will happen in the future. However, before we start developing predictive analytics models, knowing basic linear algebra, statistics, probability, and information theory with Python is a mandate. We will start with the basic concepts of linear algebra with Python.

In a nutshell, the following topics will be covered in this chapter:

What are predictive analytics and why do we use them?What is linear algebra?Installing and getting started with PythonVectors, matrices, and tensorsLinear dependence and spanPrincipal component analysis (PCA)Singular value decomposition (SVD)Predictive modeling tools in Python

A basic introduction to predictive analytics

We will refer to a famous definition of machine learning by Tom Mitchell, where he explained what learning really means from a computer science perspective:

"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E"

Based on this definition, we can conclude that a computer program or machine can:

Learn from data and historiesCan be improved with experienceInteractively enhance a model that can be used to predict an outcome

Typical machine learning tasks are concept learning, predictive modeling, clustering, and finding useful patterns. The ultimate goal is to improve the learning in such a way that it becomes automatic: so that no human interactions are needed anymore or reduce the level of human interaction as much as possible.

Predictive analytics on the other hand is the process of extracting useful information from historical facts, and stream data (consisting of live data objects) in order to determine hidden patterns and predict future outcomes and trends.

Tip

What doesn't predictive analytics do?

Predictive analytics does not tell you what will happen in the future, rather it is about creating predictive models that place a numerical value, or score, on the likelihood of a particular event to happen in the future with an acceptable level of reliability, and includes what-if scenarios and risk assessment.

Why predictive analytics?

In the area of business intelligence, with the right operations management platform, decision-makers are capable of managing all of the business-related inputs, events, and data that provide real-time insight to the enterprise level. Subsequently, predictive models can be used to identify useful patterns from historical, transactional, and recent data to identify potential risks and opportunities. Therefore, it is gaining much attention and wide acceptance. Furthermore, using the traditional reporting and monitoring tools, you have the ability to move from the reactive operations to proactive operations. PA helps move beyond this to plan for the future and identify new areas of business for profit and productivity.

Working principles of a predictive model

Being at the core of predictive analytics, many machine learning functions can be formulated as a convex optimization problem for finding a minimizer of a convex function f that depends on a variable vector w (weights), which has d records. Formally, we can write this as the optimization problem , where the objective function is of the form:

Here the vectors are the training data points for 1≤i≤n, and are their corresponding labels that we want to predict eventually. We call the method linear if L(w;x,y) can be expressed as a function of wTx and y.

The objective function f has two components: i) a regularizer that controls the complexity of the model, and ii) the loss that measures the error of the model on the training data. The loss function L(w;) is typically a convex function in w. The fixed regularization parameter λ≥0 defines the trade-off between the two goals of minimizing the loss on the training error and minimizing model complexity to avoid overfitting. For more detailed discussion, interested readers should refer to Chapter 7, Using Deep Neural Networks for Predictive Analytics.

A more simplified understanding can be gained from figure 1: you have the current data or observations. Now it's your shot to use the black box to predict the future outcome based on the current data and historical facts. In this context, all the undecided values are called parameters, and the description–that is, the black box, is a PA model:

Figure 1: the main task in predictive analytics is predictive modeling–that is, using the black box

As an engineer or a developer, you have to write an algorithm that will observe existing parameters/data/samples/examples to train the black box and figure out how to tune parameters to achieve the best model for making predictions before the deployment. Wow, that's a mouthful! Don't worry; this concept will be clearer in upcoming chapters.

In machine learning, we observe an algorithm's performance in two stages: learning and inference. The ultimate target of the learning stage is to prepare and describe the available data, also called feature vector, which is used to train the model.

The learning stage is one of the most important stages, but it is also truly time-consuming. It involves preparing a list of vectors also called feature vectors (most of the time) from the training data after transformation so that we can feed them to the learning algorithms. On the other hand, training data also sometimes contains impure information that needs some pre-processing such as cleaning.

Once we have the feature vectors, the next step in this stage is preparing (or writing/reusing) the learning algorithm. The next important step is training the algorithm to prepare the predictive model. Typically, (and of course based on data size), running an algorithm may take hours (or even days) so that the features converge into a useful model as shown in the following figure:

Figure 2: Learning and training a predictive model – it shows how to generate the feature vectors from the training data to train the learning algorithm that produces a predictive model

Tip

Common predictive analytics methods

Common predictive analytics methods include regression analysis, classification, time series forecasting, association rule mining, clustering, recommendation systems and text mining, sentiment analysis, and much more. Now to prepare the feature vectors, we need to know a little bit about mathematics, statistics, and so on.

The second most important stage is the inference that is used for making an intelligent use of the model such as predicting from the never-before-seen data, making recommendations, deducing future rules, and so on. Typically, it takes less time compared to the learning stage and sometimes even in real time, as shown in the following figure:

Figure 3: Inferencing from an existing model towards predictive analytics (feature vectors are generated from unknown data for making predictions)

Thus, inferencing (see figure 4 for more) is all about testing the model against new (that is, unobserved) data and evaluating the performance of the model itself. However, in the whole process and for making the predictive model a successful one, data acts as the first-class citizen in all machine learning tasks.

In reality, the data that we feed to our machine learning systems must be mathematical objects, such as vectors, matrices, or graphs (in later chapters, we will refer to them as tensors to make it clearer) so that they can consume such data:

Figure 4: Feature vectors are everywhere - they are used in both learning and inferencing stages in predictive analytics

Depending on the available data and feature types, the performance of your predictive model can vacillate dramatically. Therefore, selecting the right features is one of the most important steps before the inferencing takes place. This is called feature engineering, which can be defined as follows:

Tip

Feature engineering

In this process, domain knowledge about the data is used to create only selective or useful features that help prepare the feature vectors to be used so that a machine learning algorithm works.

For example, buying a car; you often see features such as model name, color, horse-power, price, and a number of seats. Thus considering these features, buying a car is not a trivial problem. The general machine learning rule of thumb is that the more data there is, the better the predictive model. However, having more features often creates a mess so the performance degrades drastically: especially if the dataset is high-dimensional and this phenomenon is called the curse of dimensionality. We will see some examples in following sections.

In addition, we also need to know how to represent and use such objects through better representation and transformation. These include some basic (and sometimes advanced maths), statistics, probability, and information theory.

For now, this is enough learning. Let's focus on learning some non-trivial topics of linear algebra that could cover vectors, matrix, graphs, and so on. In Chapter 2, Statistics, Probability and Information Theory for Predictive Modeling, we will learn the basic statistics, probability, and information theory needed for developing PA models. These will be your helping hand as well as basic building blocks for the TensorFlow-based PA throughout subsequent chapters.

Installing and getting started with Python

Python is one of the most popular programming languages. It is a high-level, interpreted, interactive, and object-oriented scripting language. Unfortunately, there has been a big split between Python versions: 2 versus 3, which could make things a bit confusing to newcomers. You can see the major difference between them at https://wiki.python.org/moin/Python2orPython3. But don't worry; I will lead you in the right direction for installing both major versions.

Installing on Windows

On the Python download page at https://www.python.org/downloads/, you'll find the latest release of Python 2 or Python 3 (2.7.13 and 3.6.1, respectively, at the time of writing). You can now select and download the installer (.exe) of either version. Installation is similar to installing other software on Windows.

Let's assume that you have installed both versions and now it's time to add the installation path to the environmental variables.

For doing so click on Start, and then type advanced system settings, then select the View advanced system settings | System Properties | Advanced | Environment Variables... button:

Figure 5: Creating a system variable for Python

Python 3 is usually listed in the User variables for Jason, but Python 2 is listed under the System variables as follows:

Figure 6: Showing how to add the Python installation location as system path

There are a few ways you can remedy this situation. The simplest way is to make changes that can give us access to python for Python 2 and python3 for Python 3. For this, go to the folder where you have installed Python 3. It should be something like this: C:\Users\[username]\AppData\Local\Programs\Python\Python36 by default.

Make a copy of the python.exe file, and rename that copy (not the original) to python3.exe as shown in the following screenshot:

Figure 7: Fixing Python 2 versus Python 3 issue

Open a new Command Prompt (the environmental variables refresh with each new Command Prompt you open), and type python3 --version:

Figure 8: Showing Python 2 and Python 3 version

Fantastic, now you're ready for whatever Python project you want to tackle.

Installing Python on Linux

For those of you who are new to Python, Python 2.7.x and 3.x are automatically installed on Ubuntu. Make sure to check if Python 2 or Python 3 is installed using the following command:

$ python -V >> Python 2.7.13$ which python>> /usr/bin/python

For Python 3.3+ use the following:

$ python3 -V >> Python 3.6.1

If you want a very specific version:

$ sudo apt-cache show python3$ sudo apt-get install python3=3.6.1*

Installing and upgrading PIP (or PIP3)

The pip or pip3 package manager usually comes with your Ubuntu. Make check to sure if pip or pip3 is installed using the following command:

$ pip -V >> pip 9.0.1 from /usr/local/lib/python2.7/dist-packages/pip-9.0.1-py2.7.egg (python 2.7)

For Python 3.3+ use the following:

$ pip3 -V >> pip 1.5.4 from /usr/lib/python3/dist-packages (python 3.4)

It is to be noted that pip version 8.1+ or pip3 version 1.5+ are strongly expected to give better results and smooth computation. If version 8.1+ for pip and 1.5+ for pip3 are not installed, see the following command to either install or upgrade to the latest pip version:

$ sudo apt-get install python-pip python-dev

For Python 3.3+, use the following command:

$ sudo apt-get install python3-pip python-dev

Installing Python on Mac OS

Before installing the Python, you should install a C compiler. The fastest way of doing so is to install the Xcode command-line tools by running the following command:

xcode-select –install

Alternatively, you can also download the full version of Xcode from the Mac App Store.

If you already have Xcode installed on your Mac machine, do not install OSX-GCC-Installer. In combination, you can experience some unwanted issues that are really difficult to diagnose and get rid of.

Although Mac OS comes with a large number of Unix utilities, however, one key component called Homebrew is missing, which can be installed using the following command:

$ /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Set the Homebrew installation path to the PATH environment variable to the ~/.profile file by issuing the following command:

export PATH=/usr/local/bin:/usr/local/sbin:$PATH

Now, you're ready to install Python 2.7.x or 3.x. For Python 2.7.x issue the following command:

$ brew install python

For Python 3 issue the following command:

$ brew install python3

Installing packages in Python

Additional packages (other than built-in packages) that will be used throughout this book can be installed via the pip installer program. We have already installed Python pip for Python 2.7.x and Python 3.x. Now to install a Python package or module, you can execute pip on the command line (Windows) or terminal (Linux/Mac OS):

$ sudo pip install PackageName # For Python3 use pip3

However, already installed packages can be updated via the --upgrade flag by issuing the following command:

$ sudo pip install PackageName –upgrade # For Python3, use pip3