39,59 €
Infuse an extra layer of intelligence into your Go applications with machine learning and AI
Key Features:
Build simple, maintainable, and easy to deploy machine learning applications with popular Go packagesLearn the statistics, algorithms, and techniques to implement machine learningOvercome the common challenges faced while deploying and scaling the machine learning workflows
Book Description:
This updated edition of the popular Machine Learning With Go shows you how to overcome the common challenges of integrating analysis and machine learning code within an existing engineering organization.
Machine Learning With Go, Second Edition, will begin by helping you gain an understanding of how to gather, organize, and parse real-world data from a variety of sources. The book also provides absolute coverage in developing groundbreaking machine learning pipelines including predictive models, data visualizations, and statistical techniques. Up next, you will learn the thorough utilization of Golang libraries including golearn, gorgonia, gosl, hector, and mat64. You will discover the various TensorFlow capabilities, along with building simple neural networks and integrating them into machine learning models. You will also gain hands-on experience implementing essential machine learning techniques such as regression, classification, and clustering with the relevant Go packages. Furthermore, you will deep dive into the various Go tools that help you build deep neural networks. Lastly, you will become well versed with best practices for machine learning model tuning and optimization.
By the end of the book, you will have a solid machine learning mindset and a powerful Go toolkit of techniques, packages, and example implementations
What you will learnBecome well versed with data processing, parsing, and cleaning using Go packagesLearn to gather data from various sources and in various real-world formatsPerform regression, classification, and image processing with neural networksEvaluate and detect anomalies in a time series modelUnderstand common deep learning architectures to learn how each model is builtLearn how to optimize, build, and scale machine learning workflowsDiscover the best practices for machine learning model tuning for successful deployments
Who this book is for:
This book is primarily for Go programmers who want to become a machine learning engineer and to build a solid machine learning mindset along with a good hold on Go packages. This is also useful for data analysts, data engineers, machine learning users who want to run their machine learning experiments using the Go ecosystem. Prior understanding of linear algebra is required to benefit from this book
Daniel Whitenack is a trained PhD data scientist with over 10 years' experience working on data-intensive applications in industry and academia. Recently, Daniel has focused his development efforts on open source projects related to running machine learning (ML) and artificial intelligence (AI) in cloud-native infrastructure (Kubernetes, for instance), maintaining reproducibility and provenance for complex data pipelines, and implementing ML/AI methods in new languages such as Go. Daniel co-hosts the Practical AI podcast, teaches data science/engineering at Ardan Labs and Purdue University, and has spoken at conferences around the world (including ODSC, PyCon, DataEngConf, QCon, GopherCon, Spark Summit, and Applied ML Days, among others). Janani Selvaraj works as a senior research and analytics consultant for a start-up in Trichy, Tamil Nadu. She is a mathematics graduate with PhD in environmental management. Her current interests include data wrangling and visualization, machine learning, and geospatial modeling. She currently trains students in data science and works as a consultant on several data-driven projects in a variety of domains. She is an R programming expert and founder of the R-Ladies Trichy group, a group that promotes gender diversity. She has served as a reviewer for Go-Machine learning Projects book.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 342
Veröffentlichungsjahr: 2019
Copyright © 2019 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Pravin DhandreAcquisition Editor: Devika BattikeContent Development Editor: Snehal KolteTechnical Editor: Naveen SharmaCopy Editor: Safis EditingProject Coordinator: Manthan PatelProofreader: Safis EditingIndexer: Priyanka DhadkeGraphics: Jisha ChirayilProduction Coordinator: Aparna Bhagat
First published: September 2017 Second edition: April 2019
Production reference: 1300419
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78961-989-8
www.packtpub.com
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Mapt is fully searchable
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Daniel Whitenack is a trained PhD data scientist with over 10 years' experience working on data-intensive applications in industry and academia. Recently, Daniel has focused his development efforts on open source projects related to running machine learning (ML) and artificial intelligence (AI) in cloud-native infrastructure (Kubernetes, for instance), maintaining reproducibility and provenance for complex data pipelines, and implementing ML/AI methods in new languages such as Go. Daniel co-hosts the Practical AI podcast, teaches data science/engineering at Ardan Labs and Purdue University, and has spoken at conferences around the world (including ODSC, PyCon, DataEngConf, QCon, GopherCon, Spark Summit, and Applied ML Days, among others).
Janani Selvaraj works as a senior research and analytics consultant for a start-up in Trichy, Tamil Nadu. She is a mathematics graduate with PhD in environmental management. Her current interests include data wrangling and visualization, machine learning, and geospatial modeling. She currently trains students in data science and works as a consultant on several data-driven projects in a variety of domains. She is an R programming expert and founder of the R-Ladies Trichy group, a group that promotes gender diversity. She has served as a reviewer for Go Machine Learning Projects book.
Saurabh Chhajed is a machine learning and big data engineer with 9 years' professional experience in the enterprise application development life cycle using the latest frameworks, tools, and design patterns. He has experience designing and implementing some of the most widely used and scalable customer-facing recommendation systems, with extensive usage of big data ecosystem—batch and real time and machine learning pipeline. He has also worked for some of the largest investment banks, credit card companies, and manufacturing companies around the world, implementing a range of robust and scalable product suites.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Title Page
Copyright and Credits
Machine Learning With Go Second Edition
About Packt
Why subscribe?
Packt.com
Contributors
About the authors
About the reviewer
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
Section 1: Analysis in Machine Learning Workflows
Gathering and Organizing Data
Handling data – Gopher style
Best practices for gathering and organizing data with Go
CSV files
Reading in CSV data from a file
Handling unexpected fields
Handling unexpected data types
Manipulating CSV data with data frames
Web scraping 
JSON
Parsing JSON
JSON output
SQL-like databases
Connecting to an SQL database
Querying the database
Modifying the database
Caching
Caching data in memory
Caching data locally on disk
Data versioning
Pachyderm jargon
Deploying or installing Pachyderm
Creating data repositories for data versioning
Putting data into data repositories
Getting data out of versioned data repositories
Summary
References
Matrices, Probability, and Statistics
Matrices and vectors
Vectors
Vector operations
Matrices
Matrix operations
Statistics
Distributions
Statistical measures
Measures of central tendency
Measures of spread or dispersion
Visualizing distributions
Histograms
Box plots
Bivariate analysis 
Probability
Random variables
Probability measures
Independent and conditional probability
Hypothesis testing
Test statistics
Calculating p-values
Summary
References
Evaluating and Validating
Evaluating
Continuous metrics
Categorical metrics
Individual evaluation metrics for categorical variables
Confusion matrices, AUC, and ROC
Validating
Training and test sets
Holdout set
Cross-validation
Summary
References
Section 2: Machine Learning Techniques
Regression
Understanding regression model jargon
Linear regression
Overview of linear regression
Linear regression assumptions and pitfalls
Linear regression example
Profiling the data
Choosing our independent variable
Creating our training and test sets
Training our model
Evaluating the trained model
Multiple linear regression
Nonlinear and other types of regression
Summary
References
Classification
Understanding classification model jargon
Logistic regression
Overview of logistic regression
Logistic regression assumptions and pitfalls
Logistic regression example
Cleaning and profiling data
Creating our training and test sets
Training and testing the logistic regression model
k-nearest neighbors 
Overview of kNN
kNN assumptions and pitfalls
kNN example
Decision trees and random forests
Overview of decision trees and random forests
Decision tree and random forest assumptions and pitfalls
Decision tree example
Random forest example
Naive Bayes
Overview of Naive Bayes and its big assumption
Naive Bayes example
Summary
References
Clustering
Understanding clustering model jargon
Measuring distance or similarity
Evaluating clustering techniques
Internal clustering evaluation
External clustering evaluation
k-means clustering
Overview of k-means clustering
k-means assumptions and pitfalls
k-means clustering example
Profiling the data
Generating clusters with k-means
Evaluating the generated clusters
Other clustering techniques
Summary
References
Time Series and Anomaly Detection
Representing time series data in Go
Understanding time series jargon
Statistics related to time series
Autocorrelation
Partial autocorrelation
Auto-regressive models for forecasting
Auto-regressive model overview
Auto-regressive model assumptions and pitfalls
Auto-regressive model example
Transforming into a stationary series
Analyzing the ACF and choosing an AR order
Fitting and evaluating an AR(2) model
Auto-regressive moving averages and other time series models
Anomaly detection
Summary
References
Section 3: Advanced Machine Learning, Deployment, and Scaling
Neural Networks
Understanding neural net jargon
Building a simple neural network
Nodes in the network
Network architecture
Why do we expect this architecture to work?
Training our neural network
Utilizing the simple neural network
Training the neural network on real data
Evaluating the neural network
Summary
References
Deep Learning
Deep learning techniques and jargon
Deep learning with Go
Using the TensorFlow Go bindings
Install TensorFlow for Go
Retrieving and calling a pretrained TensorFlow model
Object detection using TensorFlow from Go
Using TensorFlow models from GoCV
Installing GoCV
Streaming webcam object detection with GoCV
Summary
References
Deploying and Distributing Analyses and Models
Running models reliably on remote machines
A brief introduction to Docker and Docker jargon
Dockerizing a machine learning application
Dockerizing the model training and export
Dockerizing model predictions
Testing the Docker images locally
Running the Docker images on remote machines
Building a scalable and reproducible machine learning pipeline
Setting up a Pachyderm and a Kubernetes cluster
Building a Pachyderm machine learning pipeline
Creating and filling the input repositories
Creating and running the processing stages
Updating pipelines and examining provenance
Scaling pipeline stages
Summary
References
Algorithms/Techniques Related to Machine Learning
Gradient descent
Entropy, information gain, and related methods
Backpropagation
Other Books You May Enjoy
Leave a review - let other readers know what you think
This updated edition of the popular Machine Learning With Go - Second Edition shows the readers how to overcome the common challenges of integrating analysis and machine learning code within an existing engineering organization.
Machine Learning With Go - Second Edition, will begin by helping you gain an understanding of how to gather, organize, and parse real-world data from a variety of sources. The book also provides detailed information on developing groundbreaking machine learning pipelines including predictive models, data visualizations, and statistical techniques. Up next, you will learn about the utilization of Golang libraries including golearn, gorgonia, gosl, hector, and mat64, among others. You will discover the various TensorFlow capabilities, along with building simple neural networks and integrating into machine learning models. You will also gain hands-on experience implementing essential machine learning techniques, such as regression, classification, and clustering, with the relevant Go packages. Furthermore, you will deep dive into the various Go tools that can help us to build deep neural networks. Lastly, you will become well versed with best practices for machine learning model tuning and optimization.
By the end of the book, you will have a solid machine learning mindset and a powerful toolkit of Go techniques and packages, backed up with example implementations.
This book is primarily for Go programmers who want to become machine learning engineers, and to build a solid machine learning mindset, as well as improve their hold on Go packages. This is also useful for data analysts, data engineers, and machine learning users who want to run their machine learning experiments using the Go ecosystem.
Chapter 1, Gathering and Organizing Data, covers the gathering, organization, and parsing of data from local and remote sources. Once the reader is done with this chapter, they will understand how to interact with data stored in various places and in various formats, how to parse and clean that data, and how to output that cleaned and parsed data.
Chapter 2, Matrices, Probability, and Statistics, also covers statistical measures and operations key to day-to-day data analysis work. Once the reader is done with this chapter, they will understand how to perform solid summary data analysis, describe and visualize distributions, quantify hypotheses, and transform datasets with, for example, dimensionality reductions.
Chapter 3, Evaluation and Validation, covers evaluation and validation, which are key to measuring the performance of machine applications and ensuring that they generalize. Once the reader is done with this chapter, they will understand various metrics to gauge the performance of models (that is, to evaluate the model), as well as various techniques to validate the model more generally.
Chapter 4, Regression, covers regression, a widely used technique to model continuous variables, and a basis for other models. Regression produces models that are immediately interpretable. Thus, it can provide an excellent starting point when introducing predictive capabilities in an organization.
Chapter 5, Classification, covers classification, a machine learning technique distinct from regression in that the target variable is typically categorical or labeled. For example, a classification model may classify emails into spam and not-spam categories, or classify network traffic as fraudulent or not fraudulent.
Chapter 6, Clustering, covers clustering, an unsupervised machine learning technique used to form groupings of samples. At the end of this chapter, readers will be able to automatically form groupings of data points to better understand their structure.
Chapter 7, Time Series and Anomaly Detection, introduces techniques utilized to model time series data, such as stock prices and user events. After reading the chapter, the reader will understand how to evaluate various terms in a time series, build up a model of the time series, and detect anomalies in a time series.
Chapter 8, Neural Networks, introduces techniques utilized to perform regression, classification, and image processing with neural networks. After reading the chapter, the reader will understand how and when to apply these more complicated modeling techniques.
Chapter 9, Deep Learning, introduces deep learning techniques, along with the motivation behind them. After reading the chapter, the reader will understand how and when to apply these more complicated modeling techniques, and will understand the Go tooling available for building deep neural networks.
Chapter 10, Deploying and Distributing Analyses and Models, empowers readers to deploy the models, developed throughout the book, to production environments and distribute processing over production-scale data. The chapter will illustrate how both of these things can be done easily, without significant modifications to the code utilized throughout the book.
Appendix, Algorithms/Techniques Related to Machine Learning, can be referenced throughout the book and provides information about algorithms, optimizations, and techniques that are relevant to machine learning workflows.
Prior understanding of linear algebra is required to fully benefit from this book.
You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packt.com
.
Select the
SUPPORT
tab.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Machine-Learning-With-Go-Second-Edition. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/9781789619898_ColorImages.pdf.
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packt.com.
In this section, we will get a solid understanding of how to parse and organize data within a Go program, with an emphasis on handling that data in a machine learning workflow.
This section will contain the following chapters:
Chapter 1
,
Gathering and Organizing Data
Chapter 2
,
Matrices, Probability, and Statistics
Chapter 3
,
Evaluating and Validating
Machine learning in general involves a series of steps, out of which the process of gathering and cleaning data consumes a lot of time. Polls have shown that 90% or more of a data scientist's time is spent gathering data, organizing it, and cleaning it—not training or tuning their sophisticated machine learning models. Why is this? Isn't the machine learning part the fun part? Why do we need to care so much about the state of our data?
Not all types of data are appropriate when using certain types of models. For example, certain models do not perform well when we have high-dimensional data (for example, text data), and other models assume that variables are normally distributed, which is definitely not always the case. Thus, we must take care to gather data that fits our use case and make sure that we understand how our data and models will interact.
Another reason why gathering and organizing data consumes so much of a data scientist's time is that data is often messy and hard to aggregate. In most organizations, data might be housed in various systems and formats, and have various access control policies.
To form a training or test set or to supply variables to a model for predictions, we will likely need to deal with various formats of data, such as CSV, JSON, and database tables, and we will likely need to transform individual values. Common transformations include missing value analysis, parsing date times, converting categorical data to numerical data, normalizing values, and applying some function across values. Scraping the data from the web has emerged as an important data source and a lot of data-driven institutions are relying on it for adding it to their repositories.
Even though much of this book will be focused on various modeling techniques, you should always consider data gathering, parsing, and organization as a – or maybe the – key component of a successful data science project. If this part of your project is not carefully developed with a high level of integrity, you are setting yourself up for trouble in the long run.
From this chapter, readers will be able to learn the different data handling techniques using Golang with guided code covering the following topics:
Handling varied data forms—CSV, JSON, and SQL databases
Web scraping
Caching
Data versioning
As you can see in the preceding section, Go provides us with an opportunity to maintain high levels of integrity in our data gathering, parsing, and organization. We want to ensure that we leverage Go's unique properties whenever we are preparing our data for machine learning workflows.
Generally, Go data scientists/analysts should follow the following best practices when gathering and organizing data. These best practices are meant to help you maintain integrity in your applications and enable you to reproduce any analysis:
Check for and enforce expected types
: This might seem obvious, but it is too often overlooked when using dynamically typed languages. Although it is slightly verbose, explicitly parsing data into expected types and handling related errors can save you big headaches down the road.
Standardize and simplify your data ingress/egress
: There are many third-party packages for handling certain types of data or interactions with certain sources of data (some of which we will cover in this book). However, if you standardize the ways you are interacting with data sources, particularly centered around the use of
stdlib
, you can develop predictable patterns and maintain consistency within your team. A good example of this is a choice to utilize
database/sql
for database interactions rather than using various third-party
application program interfaces
(
APIs
) and
digital subscriber lines
(
DSLs
).
Version your data
: Machine learning models produce extremely different results depending on the training data you use, your choice of parameters, and input data. Thus, it is impossible to reproduce results without versioning both your code and data. We will discuss the appropriate techniques in the
Data versioning
section of this chapter.
CSV files might not be a go-to format for big data, but as a data scientist or developer working in machine learning, you are sure to encounter this format. You might need a mapping of zip codes to latitude/longitude and find this as a CSV file on the internet, or you may be given sales figures from your sales team in a CSV format. In any event, we need to understand how to parse these files.
The main package that we will utilize in parsing CSV files is encoding/csv from Go's standard library. However, we will also discuss a couple of packages that allow us to quickly manipulate or transform CSV data, github.com/go-gota/gota/dataframe and go-hep.org/x/hep/csvutil.
Web scrapingis a handy tool to have in a data scientist's skill set. It can be useful in a variety of situations to gather data, such as when a website does not provide an API, or you need to parse and extract web content programmatically, such as scraping Wikipedia tables. The following packages are used to scrape data from the web:
The
github.com/PuerkitoBio/goquery
package: A jQ
uery-like tool
The
http
package: To sc
rape information from an HTML web page on the internet
The
github.com/anaskhan96/soup
package: A Go package similar to the
BeautifulSoup
Python package
The following code snippet shows an example to scrape an xkcd comics image and its underlying text using the soup package:
fmt.Println("Enter the xkcd comic number :") var num int fmt.Scanf("%d", &num) url := fmt.Sprintf("https://xkcd.com/%d", num) resp, _ := soup.Get(url) doc := soup.HTMLParse(resp) title := doc.Find("div", "id", "ctitle").Text() fmt.Println("Title of the comic :", title) comicImg := doc.Find("div", "id", "comic").Find("img") fmt.Println("Source of the image :", comicImg.Attrs()["src"]) fmt.Println("Underlying text of the image :", comicImg.Attrs()["title"])
Scraping multiple Wikipedia tables from a single web page could be useful in the absence of authenticated information regarding certain topics. The following chunk of codes explains the various steps to scrape list of movies from a particular timeframe using the scrappy package.
The following code snippet shows the function that takes the desired URL as an input parameter and gives a table as an output:
func scrape(url string, selector string, ch chan []string) { scraper := scraper.NewScraper(url) selection := scraper.Find(selector) ch <- selection}
The next set of code creates a character string, and the scrape function is used to obtain the desired output:
years := []string{"2009", "2010", "2011", "2012", "2013"} channels := []chan []string{ make(chan []string), make(chan []string), make(chan []string), make(chan []string), make(chan []string), }for idx, year := range years { ch := channels[idx] go scrape("http://en.wikipedia.org/wiki/List_of_Bollywood_films_of_"+year, "table.wikitable i a", ch) }for i := 0; i < 5; i++ { select { case movies2009 := <-channels[0]: printMovies(movies2009) case movies2010 := <-channels[1]: printMovies(movies2010) case movies2011 := <-channels[2]: printMovies(movies2011) case movies2012 := <-channels[3]: printMovies(movies2012) case movies2013 := <-channels[4]: printMovies(movies2013) } }}
From this section, readers will get an idea of how to scrape data from the web by making use of simple functions. Further, scraping more sophisticated data in accordance with an individual's needs can be explored.
In a world in which the majority of data is accessed via the web, and most engineering organizations implement some number of microservices, we are going to encounter data in JSON format fairly frequently. We may only need to deal with it when pulling some random data from an API, or it might actually be the primary data format that drives our analytics and machine learning workflows.
