E-Book
31,19 €

Hands-On Automated Machine Learning E-Book

Sibanjan Das

0,0

31,19 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Automate data and model pipelines for faster machine learning applications

Key FeaturesBuild automated modules for different machine learning componentsUnderstand each component of a machine learning pipeline in depthLearn to use different open source AutoML and feature engineering platformsBook Description

AutoML is designed to automate parts of Machine Learning. Readily available AutoML tools are making data science practitioners’ work easy and are received well in the advanced analytics community. Automated Machine Learning covers the necessary foundation needed to create automated machine learning modules and helps you get up to speed with them in the most practical way possible.

In this book, you’ll learn how to automate different tasks in the machine learning pipeline such as data preprocessing, feature selection, model training, model optimization, and much more. In addition to this, it demonstrates how you can use the available automation libraries, such as auto-sklearn and MLBox, and create and extend your own custom AutoML components for Machine Learning.

By the end of this book, you will have a clearer understanding of the different aspects of automated Machine Learning, and you’ll be able to incorporate automation tasks using practical datasets. You can leverage your learning from this book to implement Machine Learning in your projects and get a step closer to winning various machine learning competitions.

What you will learnUnderstand the fundamentals of Automated Machine Learning systemsExplore auto-sklearn and MLBox for AutoML tasks Automate your preprocessing methods along with feature transformationEnhance feature selection and generation using the Python stackAssemble individual components of ML into a complete AutoML frameworkDemystify hyperparameter tuning to optimize your ML modelsDive into Machine Learning concepts such as neural networks and autoencoders Understand the information costs and trade-offs associated with AutoMLWho this book is for

If you’re a budding data scientist, data analyst, or Machine Learning enthusiast and are new to the concept of automated machine learning, this book is ideal for you. You’ll also find this book useful if you’re an ML engineer or data professional interested in developing quick machine learning pipelines for your projects. Prior exposure to Python programming will help you get the best out of this book.

Sibanjan Das is a Business Analytics and Data Science consultant. He has extensive experience in implementing predictive analytics solutions in Business Systems and IoT. An enthusiastic and passionate professional about technology and innovation, he has the passion for wrangling with data since early days of his career. Sibanjan holds a Masters IT degree with major in Business Analytics from Singapore Management University and holds several industry certifications such as OCA, OCP and CSCMS. Umit Mert Cakmak is a Data Scientist at IBM, where he excels at helping clients to solve complex data science problems, from inception to delivery of deployable assets. His research spans across multiple disciplines beyond his industry and he likes sharing his insights at conferences, universities and meet-ups.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 268

Veröffentlichungsjahr: 2018

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Hands-On Automated Machine Learning

A beginner's guide to building automated machine learning systems using AutoML and Python

Sibanjan Das

Umit Mert Cakmak

BIRMINGHAM - MUMBAI

Hands-On Automated Machine Learning

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Amey VarangaonkarAcquisition Editor: Varsha ShettyContent Development Editor: Tejas LimkarTechnical Editor: Sayli NikaljeCopy Editor: Safis EditingProject Coordinator: Manthan PatelProofreader: Safis EditingIndexer: Aishwarya GangawaneGraphics: Tania DuttaProduction Coordinator: Aparna Bhagat

First published: April 2018

Production reference: 1250418

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78862-989-8

www.packtpub.com

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

PacktPub.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the authors

Sibanjan Dasis a business analytics and data science consultant. He has extensive experience in implementing predictive analytics solutions in business systems and IoT. An enthusiastic and passionate professional about technology and innovation, he has loved wrangling with data since the early days of his career. He has a master's in IT with a major in business analytics from Singapore Management University, and holds several industry certifications such as OCA, OCP, and CSCMS.

I dedicate my writing to my father Dr. M. N. Das, mother, grandparent, sisters, brothers, and brothers-in-law. Also, to my wife and son, Sachit, for their love and care. They motivated me to do what I always wanted to do. The list of people who inspire me is long. I want to thank them all for their endless encouragement and kind support.

Umit Mert Cakmak is a Data Scientist at IBM, where he excels at helping clients to solve complex data science problems, from inception to delivery of deployable assets. His research spans across multiple disciplines beyond his industry and he likes sharing his insights at conferences, universities, and meet-ups.

First and foremost, my heartiest thanks to my mother and father, for their true love, support, and I am grateful for the lessons that they have taught me. I would like to dedicate my writings to my family, friends, colleagues, and all the great people who relentlessly work to make the world a better place.

About the reviewers

Brian T. Hoffman has developed and deployed data science solutions for 20 years, in fields such as drug discovery, biotech, software, and sales. After obtaining his PhD in drug discovery from University of North Carolina, Chapel Hill, he completed his postdoctoral fellowship in developing new ML techniques with the National Institutes of Health. He has a passion for determining how data can help improve business decisions, and has managed international teams of scientists implementing data science solutions for companies ranging from startups to Fortune 100.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Title Page

Hands-On Automated Machine Learning

Packt Upsell

Why subscribe?

PacktPub.com

Contributors

About the authors

About the reviewers

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Introduction to AutoML

Scope of machine learning

What is AutoML?

Why use AutoML and how does it help?

When do you automate ML?

What will you learn?

Core components of AutoML systems

Automated feature preprocessing

Automated algorithm selection

Hyperparameter optimization

Building prototype subsystems for each component

Putting it all together as an end–to–end AutoML system

Overview of AutoML libraries

Featuretools

Auto-sklearn

MLBox

TPOT

Summary

Introduction to Machine Learning Using Python

Technical requirements

Machine learning

Machine learning process

Supervised learning

Unsupervised learning

Linear regression

What is linear regression?

Working of OLS regression

Assumptions of OLS

Where is linear regression used?

By which method can linear regression be implemented?

Important evaluation metrics – regression algorithms

Logistic regression

What is logistic regression?

Where is logistic regression used?

By which method can logistic regression be implemented?

Important evaluation metrics – classification algorithms

Decision trees

What are decision trees?

Where are decision trees used?

By which method can decision trees be implemented?

Support Vector Machines

What is SVM?

Where is SVM used?

By which method can SVM be implemented?

k-Nearest Neighbors

What is k-Nearest Neighbors?

Where is KNN used?

By which method can KNN be implemented?

Ensemble methods

What are ensemble models?

Bagging

Boosting

Stacking/blending

Comparing the results of classifiers

Cross-validation

Clustering

What is clustering?

Where is clustering used?

By which method can clustering be implemented?

Hierarchical clustering

Partitioning clustering (KMeans)

Summary

Data Preprocessing

Technical requirements

Data transformation

Numerical data transformation

Scaling

Missing values

Outliers

Detecting and treating univariate outliers

Inter-quartile range

Filtering values

Winsorizing

Trimming

Detecting and treating multivariate outliers

Binning

Log and power transformations

Categorical data transformation

Encoding

Missing values for categorical data transformation

Text preprocessing

Feature selection

Excluding features with low variance

Univariate feature selection

Recursive feature elimination

Feature selection using random forest

Feature selection using dimensionality reduction

Principal Component Analysis

Feature generation

Summary

Automated Algorithm Selection

Technical requirements

Computational complexity

Big O notation

Differences in training and scoring time

Simple measure of training and scoring time 

Code profiling in Python

Visualizing performance statistics

Implementing k-NN from scratch

Profiling your Python script line by line

Linearity versus non-linearity

Drawing decision boundaries

Decision boundary of logistic regression

The decision boundary of random forest

Commonly used machine learning algorithms

Necessary feature transformations

Supervised ML

Default configuration of auto-sklearn

Finding the best ML pipeline for product line prediction

Finding the best machine learning pipeline for network anomaly detection

Unsupervised AutoML

Commonly used clustering algorithms

Creating sample datasets with sklearn

K-means algorithm in action

The DBSCAN algorithm in action

Agglomerative clustering algorithm in action

Simple automation of unsupervised learning

Visualizing high-dimensional datasets

Principal Component Analysis in action

t-SNE in action

Adding simple components together to improve the pipeline

Summary

Hyperparameter Optimization

Technical requirements

Hyperparameters

Warm start

Bayesian-based hyperparameter tuning

An example system

Summary

Creating AutoML Pipelines

Technical requirements

An introduction to machine learning pipelines

A simple pipeline

FunctionTransformer

A complex pipeline

Summary

Dive into Deep Learning

Technical requirements

Overview of neural networks

Neuron

Activation functions

The step function

The sigmoid function

The ReLU function

The tanh function

A feed-forward neural network using Keras

Autoencoders

Convolutional Neural Networks

Why CNN?

What is convolution?

What are filters?

The convolution layer

The ReLU layer

The pooling layer

The fully connected layer

Summary

Critical Aspects of ML and Data Science Projects

Machine learning as a search

Trade-offs in machine learning

Engagement model for a typical data science project

The phases of an engagement model

Business understanding

Data understanding

Data preparation

Modeling

Evaluation

Deployment

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

Dear reader, welcome to the world of automated machine learning (ML). Automated ML (AutoML) is designed to automate parts of ML. The readily available AutoML tools make the tasks of data science practitioners easier and are being well received in the advanced analytics community. This book covers the foundations you need to create AutoML modules, and shows how you can get up to speed with them in the most practical way possible.

You will learn to automate different tasks in the ML pipeline, such as data preprocessing, feature selection, model training, model optimization, and much more. The book also demonstrates how to use already available automation libraries, such as auto-sklearn and MLBox, and how to create and extend your own custom AutoML components for ML.

By the end of this book, you will have a clearer understanding of what the different aspects of AutoML are, and will be able to incorporate the automation tasks using practical datasets. The knowledge you get from this book can be leveraged to implement ML in your projects, or to get a step closer to winning an ML competition. We hope that everyone who buys this book finds it worthy and informative.

Who this book is for

This book is ideal for budding data scientists, data analysts, and ML enthusiasts who are new to the concept of AutoML. Machine learning engineers and data professionals who are interested in developing quick machine learning pipelines for their projects will also find this book useful.

What this book covers

Chapter 1, Introduction to AutoML, creates a foundation for you to dive into AutoML. We also introduce you to various AutoML libraries.

Chapter 2, Introduction to Machine Learning Using Python, introduces some machine learning concepts so that you can follow the AutoML approaches easily.

Chapter 3, Data Preprocessing, provides an in-depth understanding of different data preprocessing methods, what can be automated, and how to automate it. Feature tools and auto-sklearn preprocessing methods will be introduced here.

Chapter 4, Automated Algorithm Selection, provides guidance on which algorithm works best on which kind of dataset. We learn about the computational complexity and scalability of different algorithms, along with methods to decide the algorithm to use based on training and scoring time. We demonstrate auto-sklearn and how to extend it to include new algorithms.

Chapter 5, Hyperparameter Optimization, provides you with the required fundamentals on automating hyperparameter tuning a for variety of variables.

Chapter 6, Creating AutoML Pipelines, explains stitching together various components to create an end-to-end AutoML pipeline.

Chapter 7, Dive into Deep Learning, introduces you to various deep learning concepts and how they contribute to AutoML.

Chapter 8, Critical Aspects of ML and Data Science Projects, concludes the discussion and provides information on various trade-offs on the complexity and cost of AutoML projects.

To get the most out of this book

The only thing you need before you start reading is your inquisitiveness to know more about ML. Apart from that, prior exposure to Python programming and ML fundamentals are required to get the best out of this book, but they are not mandatory. You should have Python 3.5 and Jupyter Notebook installed.

If there is a specific requirement for any chapter, it is mentioned in the opening section.

Download the example code files

You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

www.packtpub.com

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

Enter the name of the book in the

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub athttps://github.com/PacktPublishing/Hands-On-Automated-Machine-Learning. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available athttps://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/HandsOnAutomatedMachineLearning_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "As an example, let's use StandardScaler from the sklearn.preprocessing module to standardize the values of the satisfaction_level column."

A block of code is set as follows:

{'algorithm': 'auto', 'copy_x': True, 'init': 'k-means++', 'max_iter': 300, 'n_clusters': 2, 'n_init': 10, 'n_jobs': 1, 'precompute_distances': 'auto', 'random_state': None, 'tol': 0.0001, 'verbose': 0}

Any command-line input or output is written as follows:

pip install nltk

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "You will get a NLTK Downloader popup. Select all from the Identifier section and wait for installation to be completed."

Warnings or important notes appear like this.

Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packtpub.com.

Introduction to AutoML

The last decade, if nothing else, has been a thrilling adventure in science and technology. The first iPhone was released in 2007, and back then all of its competitors had a physical integrated keyboard. The idea of touchscreen wasn't new as Apple had similar prototypes before and IBM came up with Simon Personal Communicator in 1994. Apple's idea was to have a device full of multimedia entertainment, such as listening to music and streaming videos, while having all the useful functionalities, such as web and GPS navigation. Of course, all of this was possible with access to affordable computing power at the time that Apple released the first generation iPhone. If you really think about the struggles that these great companies have had in the last 20 years, you can see how quickly technology came to where it is today. To put things into perspective, 10 years after the release of first generation iPhones, today your iPhone, along with others, can track faces and recognize objects such as animals, vehicles, and food. It can understand natural language and converse with you.

What about 3D printers that can print organs, self-driving cars, swarms of drones that fly together in harmony, gene editing, reusable rockets, and a robot that can do a backflip? These are not stories that you read in science fiction books anymore, and it's happening as you read these lines. You could only imagine this in the past, but today, science fiction is becoming a reality. People have started talking about the threat of artificial intelligence (AI). Many leading scientists, such as Stephen Hawking, are warning officials about the possible end of humankind, which could be caused by AI-based life forms.

AI and machine learning (ML) reached their peak in the last couple of years and are totally stealing the show. The chances are pretty good that you have already heard about the success of ML algorithms and great advancements in the field over the last decade. The recent success of Google's AlphaGo showed how far this technology can go when it beat Ke Jie, the best human Go player on Earth. This wasn't the first time that ML algorithms beat humans in particular tasks such as image recognition. When it comes to fine-grained details, such as recognizing different species of animals, these algorithms have often performed better than their human competitors.

These advancements have created a huge interest in the business world. As much as it sounds like an academic field of research, these technologies have huge business implications and can directly impact your organizations financials.

Enterprises from different industries want to utilize the power of these algorithms and try to adapt to the changing technology scene. Everybody is aware that people who figure out how to integrate these technologies into their businesses will lead the space, and the rest are going to have a hard time catching up.

We will explore more of such examples in the book. In this book, we will be covering the following topics:

Scope of machine learning

What AutoML is

Why use AutoML and how it helps

When to use AutoML

Overview of AutoML libraries

Scope of machine learning

Machine learning and predictive analytics now help companies to focus on important areas, anticipating problems before they happen, reducing costs, and increasing revenue. This was a natural evolution after working with business intelligence (BI) solutions. BI applications were helping companies to make better decisions by monitoring their business processes in an organized manner, usually using dashboards that have various key performance indicators (KPIs) and performance metrics.

BI tools allow you to dig deeper into your organizations historical data, uncover trends, understand seasonality, find out irregular events, and so on. They can also provide real-time analytics where you can set up some warnings and alerts to manage particular events better. All of these things are quite useful, but today businesses need more than that. What does that mean? BI tools allow you to work with historical and near real-time data, but they do not provide you with answers about the future and don't answer questions such as the following:

Which machine in your production line is likely to fail?

Which of your customers will probably switch to your competitor?

Which company's stock price is going up tomorrow?

Businesses want to answer these kinds of questions nowadays, and it pushes them to search for suitable tools and technologies, which brings them to ML and predictive analytics.

You need to be careful though! When you are working with BI tools, you are more confident about the results that you are going to have, but when you are working with ML models, there's no such guarantee and the ground is slippery. There is definitely a huge buzz about AI and ML nowadays, and people are making outrageous claims about the capabilities of upcoming AI products. After all, computer scientists have long sought to create intelligent machines and occasionally suffered along the way due to unreal expectations. You can have a quick Google search about AI winter and learn more about that period. Although the advancements are beyond imagination and the field is moving quickly, you should navigate through the noise and see what the actual use cases are that ML really shines in and they can help you to create a value for your research or business in measurable terms.

In order to do that, you need to start with small pilot projects where:

You have a relatively easier decision making processes

You know your assumptions well

You know your data well

The key here is to have a well-defined project scope and steps that you are going to execute. Collaboration between different teams is really helpful in this process, that's why you should break silos inside your organization. Also, starting small doesn't mean that your vision should be small too. You should always think about scalability in the future and slowly gear up to harness the big data sources.

There are a variety of ML algorithms that you can experiment with, each designed to solve a specific problem with their own pros and cons. There is a growing body of research in this area and practitioners are coming up with new methods and pushing the limits of this field everyday. Hence, one might get easily lost with all the information available out there, especially when developing ML applications since there are many available tools and techniques for every stage of the model building process. To ease building ML models, you need to decompose a whole process into small pieces. Automated ML (AutoML) pipelines have many moving parts such as feature preprocessing, feature selection, model selection, and hyperparameter optimization. Each of these parts needs to be handled with special care to deliver successful projects.

You will hear a lot about ML concepts throughout the book, but let's step back and understand why you need to pay special attention to AutoML.

As you have more tools and technologies in your arsenal to attack your problems, having too many options usually becomes a problem itself and it requires considerable amount of time to research and understand the right approach for a given problem. When you are dealing with ML problems, it's a similar story. Building high-performing ML models contains several carefully-crafted small steps. Each step leads you to another and if you do not drop the balls on your way, you will have your ML pipeline functioning properly and generalize well when you deploy your pipeline in a production environment.

The number of steps involved in your pipeline could be large and the process could be really lengthy. At every step, there are many methods available, and, once you think about the possible number of different combinations, you will quickly realize that you need a systematic way of experimenting with all these components in your ML pipelines.

This brings us to the topic of AutoML!

What is AutoML?

AutoML aims to ease the process of building ML models by automating commonly-used steps, such as feature preprocessing, model selection, and hyperparameters tuning. You will see each of these steps in detail in coming chapters and you will actually build an AutoML system to have a deeper understanding of the available tools and libraries for AutoML.

Without getting into the details, it's useful to review what an ML model is and how you train one.

ML algorithms will work on your data to find certain patterns, and this learning process is called model training. As a result of model training, you will have an ML model that supposedly will give you insights/answers about the data without requiring you to write explicit rules.

When you are using ML models in practice, you will throw a bunch of numerical data as input for training the algorithm. The output of the training process is a ML model that you can use to make predictions. Predictions can help you to decide whether your server should be maintained in the next four hours based on its current state, or whether a customer of yours is going to switch to your competitor or not.

Sometimes the problem you are solving will not be well-defined and you will not even know what kind of answers you are looking for. In such cases, ML models will help you to explore your dataset, such as identifying a cluster of customers that are similar to each other in terms of behavior or finding the hierarchical structure of stocks based on their correlations.

What do you do when your model comes up with clusters of customers? Well, you at least know this: customers that belong to the same cluster are similar to each other in terms of their features, such as their age, profession, marital status, gender, product preferences, daily/weekly/monthly spending habits, total amount spent, and so on. Customers who belong to different clusters are dissimilar to each other. With such an insight, you can utilize this information to create different ad campaigns for each cluster.

To put things into a more technical perspective, let's understand this process in simple mathematical terms. There is a dataset X, which contains n examples. These examples could represent customers or different species of animals. Each example is usually a set of real numbers, which are called features, for example if we have a female, 35 year old customer who spent $12000 at your store, you can represent this customer with the following vector (0.0, 35.0, 12000.0). Note that the gender is represented with 0.0, this means that a male customer would have 1.0 for that feature. The size of the vector represents the dimensionality. Since this is a vector of size three, which we usually denote by m, this is a three-dimensional dataset.

Depending on the problem type, you might need to have a label for each example. For example, if this is a supervised learning problem such as binary classification, you could label your examples with 1.0 or 0.0 and this new variable is called label or target variable. The target variable is usually referred to as y.

Having x and y, an ML model is simply a function, f, with weights, w (model parameters):

Model parameters are learned during the training process, but there are also other parameters that you might need to set before training starts, and these parameters are called hyperparameters, which will be explained shortly.

Features in your dataset usually should be preprocessed before being used in model training. For example, some of the ML models implicitly assume that features are distributed normally. In many real-life scenarios this is not the case, and you can benefit from applying feature transformations such as log transformation to have them normally distributed.

Once feature processing is done and model hyperparameters are set, model training starts. At the end of model training, model parameters will be learned and we can predict the target variable for new data that the model has not seen before. Prediction made by the model is usually referred to as :

What really happens during training? Since we know the labels for the dataset we used for training, we can iteratively update our model parameters based on the comparison of what our current model predicts and what the original label was.

This comparison is based on a function called loss function (or cost function), . Loss function represents the inaccuracy of predictions. Some of the common loss functions you may have heard of are square loss, hinge loss, logistic loss, and cross-entropy loss.

Once model training is done, you will test the performance of your ML model on test data, which is the dataset that has not been used in the training process, to see how well your model generalizes. You can use different performance metrics to assess the performance; based on the results, you should go back to previous steps and do multiple adjustments to achieve better performance.

At this point, you should have an overall idea of what training an ML model looks like under the hood.

What is AutoML then? When we are talking about AutoML, we mostly refer to automated data preparation (namely feature preprocessing, generation, and selection) and model training (model selection and hyperparameter optimization). The number of possible options for each step of this process can vary vastly depending on the problem type.

AutoML allows researchers and practitioners to automatically build ML pipelines out of these possible options for every step to find high-performing ML models for a given problem.

The following figure shows a typical ML model life cycle with a couple of examples for every step:

Data can be ingested from various sources such as flat files, databases, and APIs. Once you are able to ingest the data, you should process it to make it ready for ML and there are typical operations such as cleaning and formatting, feature transformation, and feature selection. After data processing, your final dataset should be ready for ML and you will shortlist candidate algorithms to work. Shortlisted algorithms should be validated and tuned through techniques such as cross-validation and hyperparameter optimization. Your final model will be ready to be operationalized with suitable workload type such as online, batch and streaming deployment. Once model is in production, you need to monitor its performance and take necessary action if needed such as re-training, re-evaluation, and re-deployment.

Once you are faced with building ML models, you will first do research on the domain you are working on and identify your objective. There are many steps involved in the process which should be planned and documented in advance before you actually start working on it. To learn more about the whole process of project management, you can refer to CRISP-DM model (https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining), project management is crucially important to deliver a successful application, however, it's beyond the scope of this book.