31,19 €
Automate data and model pipelines for faster machine learning applications
Key FeaturesBuild automated modules for different machine learning componentsUnderstand each component of a machine learning pipeline in depthLearn to use different open source AutoML and feature engineering platformsBook Description
AutoML is designed to automate parts of Machine Learning. Readily available AutoML tools are making data science practitioners’ work easy and are received well in the advanced analytics community. Automated Machine Learning covers the necessary foundation needed to create automated machine learning modules and helps you get up to speed with them in the most practical way possible.
In this book, you’ll learn how to automate different tasks in the machine learning pipeline such as data preprocessing, feature selection, model training, model optimization, and much more. In addition to this, it demonstrates how you can use the available automation libraries, such as auto-sklearn and MLBox, and create and extend your own custom AutoML components for Machine Learning.
By the end of this book, you will have a clearer understanding of the different aspects of automated Machine Learning, and you’ll be able to incorporate automation tasks using practical datasets. You can leverage your learning from this book to implement Machine Learning in your projects and get a step closer to winning various machine learning competitions.
What you will learnUnderstand the fundamentals of Automated Machine Learning systemsExplore auto-sklearn and MLBox for AutoML tasks Automate your preprocessing methods along with feature transformationEnhance feature selection and generation using the Python stackAssemble individual components of ML into a complete AutoML frameworkDemystify hyperparameter tuning to optimize your ML modelsDive into Machine Learning concepts such as neural networks and autoencoders Understand the information costs and trade-offs associated with AutoMLWho this book is for
If you’re a budding data scientist, data analyst, or Machine Learning enthusiast and are new to the concept of automated machine learning, this book is ideal for you. You’ll also find this book useful if you’re an ML engineer or data professional interested in developing quick machine learning pipelines for your projects. Prior exposure to Python programming will help you get the best out of this book.
Sibanjan Das is a Business Analytics and Data Science consultant. He has extensive experience in implementing predictive analytics solutions in Business Systems and IoT. An enthusiastic and passionate professional about technology and innovation, he has the passion for wrangling with data since early days of his career. Sibanjan holds a Masters IT degree with major in Business Analytics from Singapore Management University and holds several industry certifications such as OCA, OCP and CSCMS. Umit Mert Cakmak is a Data Scientist at IBM, where he excels at helping clients to solve complex data science problems, from inception to delivery of deployable assets. His research spans across multiple disciplines beyond his industry and he likes sharing his insights at conferences, universities and meet-ups.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 268
Veröffentlichungsjahr: 2018
Copyright © 2018 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Amey VarangaonkarAcquisition Editor: Varsha ShettyContent Development Editor: Tejas LimkarTechnical Editor: Sayli NikaljeCopy Editor: Safis EditingProject Coordinator: Manthan PatelProofreader: Safis EditingIndexer: Aishwarya GangawaneGraphics: Tania DuttaProduction Coordinator: Aparna Bhagat
First published: April 2018
Production reference: 1250418
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78862-989-8
www.packtpub.com
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Mapt is fully searchable
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Sibanjan Dasis a business analytics and data science consultant. He has extensive experience in implementing predictive analytics solutions in business systems and IoT. An enthusiastic and passionate professional about technology and innovation, he has loved wrangling with data since the early days of his career. He has a master's in IT with a major in business analytics from Singapore Management University, and holds several industry certifications such as OCA, OCP, and CSCMS.
Umit Mert Cakmak is a Data Scientist at IBM, where he excels at helping clients to solve complex data science problems, from inception to delivery of deployable assets. His research spans across multiple disciplines beyond his industry and he likes sharing his insights at conferences, universities, and meet-ups.
Brian T. Hoffman has developed and deployed data science solutions for 20 years, in fields such as drug discovery, biotech, software, and sales. After obtaining his PhD in drug discovery from University of North Carolina, Chapel Hill, he completed his postdoctoral fellowship in developing new ML techniques with the National Institutes of Health. He has a passion for determining how data can help improve business decisions, and has managed international teams of scientists implementing data science solutions for companies ranging from startups to Fortune 100.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Title Page
Copyright and Credits
Hands-On Automated Machine Learning
Packt Upsell
Why subscribe?
PacktPub.com
Contributors
About the authors
About the reviewers
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
Introduction to AutoML
Scope of machine learning
What is AutoML?
Why use AutoML and how does it help?
When do you automate ML?
What will you learn?
Core components of AutoML systems
Automated feature preprocessing
Automated algorithm selection
Hyperparameter optimization
Building prototype subsystems for each component
Putting it all together as an end–to–end AutoML system
Overview of AutoML libraries
Featuretools
Auto-sklearn
MLBox
TPOT
Summary
Introduction to Machine Learning Using Python
Technical requirements
Machine learning
Machine learning process
Supervised learning
Unsupervised learning
Linear regression
What is linear regression?
Working of OLS regression
Assumptions of OLS
Where is linear regression used?
By which method can linear regression be implemented?
Important evaluation metrics – regression algorithms
Logistic regression
What is logistic regression?
Where is logistic regression used?
By which method can logistic regression be implemented?
Important evaluation metrics – classification algorithms
Decision trees
What are decision trees?
Where are decision trees used?
By which method can decision trees be implemented?
Support Vector Machines
What is SVM?
Where is SVM used?
By which method can SVM be implemented?
k-Nearest Neighbors
What is k-Nearest Neighbors?
Where is KNN used?
By which method can KNN be implemented?
Ensemble methods
What are ensemble models?
Bagging
Boosting
Stacking/blending
Comparing the results of classifiers
Cross-validation
Clustering
What is clustering?
Where is clustering used?
By which method can clustering be implemented?
Hierarchical clustering
Partitioning clustering (KMeans)
Summary
Data Preprocessing
Technical requirements
Data transformation
Numerical data transformation
Scaling
Missing values
Outliers
Detecting and treating univariate outliers
Inter-quartile range
Filtering values
Winsorizing
Trimming
Detecting and treating multivariate outliers
Binning
Log and power transformations
Categorical data transformation
Encoding
Missing values for categorical data transformation
Text preprocessing
Feature selection
Excluding features with low variance
Univariate feature selection
Recursive feature elimination
Feature selection using random forest
Feature selection using dimensionality reduction
Principal Component Analysis
Feature generation
Summary
Automated Algorithm Selection
Technical requirements
Computational complexity
Big O notation
Differences in training and scoring time
Simple measure of training and scoring time 
Code profiling in Python
Visualizing performance statistics
Implementing k-NN from scratch
Profiling your Python script line by line
Linearity versus non-linearity
Drawing decision boundaries
Decision boundary of logistic regression
The decision boundary of random forest
Commonly used machine learning algorithms
Necessary feature transformations
Supervised ML
Default configuration of auto-sklearn
Finding the best ML pipeline for product line prediction
Finding the best machine learning pipeline for network anomaly detection
Unsupervised AutoML
Commonly used clustering algorithms
Creating sample datasets with sklearn
K-means algorithm in action
The DBSCAN algorithm in action
Agglomerative clustering algorithm in action
Simple automation of unsupervised learning
Visualizing high-dimensional datasets
Principal Component Analysis in action
t-SNE in action
Adding simple components together to improve the pipeline
Summary
Hyperparameter Optimization
Technical requirements
Hyperparameters
Warm start
Bayesian-based hyperparameter tuning
An example system
Summary
Creating AutoML Pipelines
Technical requirements
An introduction to machine learning pipelines
A simple pipeline
FunctionTransformer
A complex pipeline
Summary
Dive into Deep Learning
Technical requirements
Overview of neural networks
Neuron
Activation functions
The step function
The sigmoid function
The ReLU function
The tanh function
A feed-forward neural network using Keras
Autoencoders
Convolutional Neural Networks
Why CNN?
What is convolution?
What are filters?
The convolution layer
The ReLU layer
The pooling layer
The fully connected layer
Summary
Critical Aspects of ML and Data Science Projects
Machine learning as a search
Trade-offs in machine learning
Engagement model for a typical data science project
The phases of an engagement model
Business understanding
Data understanding
Data preparation
Modeling
Evaluation
Deployment
Summary
Other Books You May Enjoy
Leave a review - let other readers know what you think
Dear reader, welcome to the world of automated machine learning (ML). Automated ML (AutoML) is designed to automate parts of ML. The readily available AutoML tools make the tasks of data science practitioners easier and are being well received in the advanced analytics community. This book covers the foundations you need to create AutoML modules, and shows how you can get up to speed with them in the most practical way possible.
You will learn to automate different tasks in the ML pipeline, such as data preprocessing, feature selection, model training, model optimization, and much more. The book also demonstrates how to use already available automation libraries, such as auto-sklearn and MLBox, and how to create and extend your own custom AutoML components for ML.
By the end of this book, you will have a clearer understanding of what the different aspects of AutoML are, and will be able to incorporate the automation tasks using practical datasets. The knowledge you get from this book can be leveraged to implement ML in your projects, or to get a step closer to winning an ML competition. We hope that everyone who buys this book finds it worthy and informative.
This book is ideal for budding data scientists, data analysts, and ML enthusiasts who are new to the concept of AutoML. Machine learning engineers and data professionals who are interested in developing quick machine learning pipelines for their projects will also find this book useful.
Chapter 1, Introduction to AutoML, creates a foundation for you to dive into AutoML. We also introduce you to various AutoML libraries.
Chapter 2, Introduction to Machine Learning Using Python, introduces some machine learning concepts so that you can follow the AutoML approaches easily.
Chapter 3, Data Preprocessing, provides an in-depth understanding of different data preprocessing methods, what can be automated, and how to automate it. Feature tools and auto-sklearn preprocessing methods will be introduced here.
Chapter 4, Automated Algorithm Selection, provides guidance on which algorithm works best on which kind of dataset. We learn about the computational complexity and scalability of different algorithms, along with methods to decide the algorithm to use based on training and scoring time. We demonstrate auto-sklearn and how to extend it to include new algorithms.
Chapter 5, Hyperparameter Optimization, provides you with the required fundamentals on automating hyperparameter tuning a for variety of variables.
Chapter 6, Creating AutoML Pipelines, explains stitching together various components to create an end-to-end AutoML pipeline.
Chapter 7, Dive into Deep Learning, introduces you to various deep learning concepts and how they contribute to AutoML.
Chapter 8, Critical Aspects of ML and Data Science Projects, concludes the discussion and provides information on various trade-offs on the complexity and cost of AutoML projects.
The only thing you need before you start reading is your inquisitiveness to know more about ML. Apart from that, prior exposure to Python programming and ML fundamentals are required to get the best out of this book, but they are not mandatory. You should have Python 3.5 and Jupyter Notebook installed.
If there is a specific requirement for any chapter, it is mentioned in the opening section.
You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packtpub.com
.
Select the
SUPPORT
tab.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub athttps://github.com/PacktPublishing/Hands-On-Automated-Machine-Learning. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available athttps://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/HandsOnAutomatedMachineLearning_ColorImages.pdf.
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "As an example, let's use StandardScaler from the sklearn.preprocessing module to standardize the values of the satisfaction_level column."
A block of code is set as follows:
{'algorithm': 'auto', 'copy_x': True, 'init': 'k-means++', 'max_iter': 300, 'n_clusters': 2, 'n_init': 10, 'n_jobs': 1, 'precompute_distances': 'auto', 'random_state': None, 'tol': 0.0001, 'verbose': 0}
Any command-line input or output is written as follows:
pip install nltk
Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "You will get a NLTK Downloader popup. Select all from the Identifier section and wait for installation to be completed."
Feedback from our readers is always welcome.
General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packtpub.com.
The last decade, if nothing else, has been a thrilling adventure in science and technology. The first iPhone was released in 2007, and back then all of its competitors had a physical integrated keyboard. The idea of touchscreen wasn't new as Apple had similar prototypes before and IBM came up with Simon Personal Communicator in 1994. Apple's idea was to have a device full of multimedia entertainment, such as listening to music and streaming videos, while having all the useful functionalities, such as web and GPS navigation. Of course, all of this was possible with access to affordable computing power at the time that Apple released the first generation iPhone. If you really think about the struggles that these great companies have had in the last 20 years, you can see how quickly technology came to where it is today. To put things into perspective, 10 years after the release of first generation iPhones, today your iPhone, along with others, can track faces and recognize objects such as animals, vehicles, and food. It can understand natural language and converse with you.
What about 3D printers that can print organs, self-driving cars, swarms of drones that fly together in harmony, gene editing, reusable rockets, and a robot that can do a backflip? These are not stories that you read in science fiction books anymore, and it's happening as you read these lines. You could only imagine this in the past, but today, science fiction is becoming a reality. People have started talking about the threat of artificial intelligence (AI). Many leading scientists, such as Stephen Hawking, are warning officials about the possible end of humankind, which could be caused by AI-based life forms.
AI and machine learning (ML) reached their peak in the last couple of years and are totally stealing the show. The chances are pretty good that you have already heard about the success of ML algorithms and great advancements in the field over the last decade. The recent success of Google's AlphaGo showed how far this technology can go when it beat Ke Jie, the best human Go player on Earth. This wasn't the first time that ML algorithms beat humans in particular tasks such as image recognition. When it comes to fine-grained details, such as recognizing different species of animals, these algorithms have often performed better than their human competitors.
These advancements have created a huge interest in the business world. As much as it sounds like an academic field of research, these technologies have huge business implications and can directly impact your organizations financials.
Enterprises from different industries want to utilize the power of these algorithms and try to adapt to the changing technology scene. Everybody is aware that people who figure out how to integrate these technologies into their businesses will lead the space, and the rest are going to have a hard time catching up.
We will explore more of such examples in the book. In this book, we will be covering the following topics:
Scope of machine learning
What AutoML is
Why use AutoML and how it helps
When to use AutoML
Overview of AutoML libraries
Machine learning and predictive analytics now help companies to focus on important areas, anticipating problems before they happen, reducing costs, and increasing revenue. This was a natural evolution after working with business intelligence (BI) solutions. BI applications were helping companies to make better decisions by monitoring their business processes in an organized manner, usually using dashboards that have various key performance indicators (KPIs) and performance metrics.
BI tools allow you to dig deeper into your organizations historical data, uncover trends, understand seasonality, find out irregular events, and so on. They can also provide real-time analytics where you can set up some warnings and alerts to manage particular events better. All of these things are quite useful, but today businesses need more than that. What does that mean? BI tools allow you to work with historical and near real-time data, but they do not provide you with answers about the future and don't answer questions such as the following:
Which machine in your production line is likely to fail?
Which of your customers will probably switch to your competitor?
Which company's stock price is going up tomorrow?
Businesses want to answer these kinds of questions nowadays, and it pushes them to search for suitable tools and technologies, which brings them to ML and predictive analytics.
You need to be careful though! When you are working with BI tools, you are more confident about the results that you are going to have, but when you are working with ML models, there's no such guarantee and the ground is slippery. There is definitely a huge buzz about AI and ML nowadays, and people are making outrageous claims about the capabilities of upcoming AI products. After all, computer scientists have long sought to create intelligent machines and occasionally suffered along the way due to unreal expectations. You can have a quick Google search about AI winter and learn more about that period. Although the advancements are beyond imagination and the field is moving quickly, you should navigate through the noise and see what the actual use cases are that ML really shines in and they can help you to create a value for your research or business in measurable terms.
In order to do that, you need to start with small pilot projects where:
You have a relatively easier decision making processes
You know your assumptions well
You know your data well
The key here is to have a well-defined project scope and steps that you are going to execute. Collaboration between different teams is really helpful in this process, that's why you should break silos inside your organization. Also, starting small doesn't mean that your vision should be small too. You should always think about scalability in the future and slowly gear up to harness the big data sources.
There are a variety of ML algorithms that you can experiment with, each designed to solve a specific problem with their own pros and cons. There is a growing body of research in this area and practitioners are coming up with new methods and pushing the limits of this field everyday. Hence, one might get easily lost with all the information available out there, especially when developing ML applications since there are many available tools and techniques for every stage of the model building process. To ease building ML models, you need to decompose a whole process into small pieces. Automated ML (AutoML) pipelines have many moving parts such as feature preprocessing, feature selection, model selection, and hyperparameter optimization. Each of these parts needs to be handled with special care to deliver successful projects.
You will hear a lot about ML concepts throughout the book, but let's step back and understand why you need to pay special attention to AutoML.
As you have more tools and technologies in your arsenal to attack your problems, having too many options usually becomes a problem itself and it requires considerable amount of time to research and understand the right approach for a given problem. When you are dealing with ML problems, it's a similar story. Building high-performing ML models contains several carefully-crafted small steps. Each step leads you to another and if you do not drop the balls on your way, you will have your ML pipeline functioning properly and generalize well when you deploy your pipeline in a production environment.
The number of steps involved in your pipeline could be large and the process could be really lengthy. At every step, there are many methods available, and, once you think about the possible number of different combinations, you will quickly realize that you need a systematic way of experimenting with all these components in your ML pipelines.
This brings us to the topic of AutoML!
AutoML aims to ease the process of building ML models by automating commonly-used steps, such as feature preprocessing, model selection, and hyperparameters tuning. You will see each of these steps in detail in coming chapters and you will actually build an AutoML system to have a deeper understanding of the available tools and libraries for AutoML.
Without getting into the details, it's useful to review what an ML model is and how you train one.
ML algorithms will work on your data to find certain patterns, and this learning process is called model training. As a result of model training, you will have an ML model that supposedly will give you insights/answers about the data without requiring you to write explicit rules.
When you are using ML models in practice, you will throw a bunch of numerical data as input for training the algorithm. The output of the training process is a ML model that you can use to make predictions. Predictions can help you to decide whether your server should be maintained in the next four hours based on its current state, or whether a customer of yours is going to switch to your competitor or not.
Sometimes the problem you are solving will not be well-defined and you will not even know what kind of answers you are looking for. In such cases, ML models will help you to explore your dataset, such as identifying a cluster of customers that are similar to each other in terms of behavior or finding the hierarchical structure of stocks based on their correlations.
What do you do when your model comes up with clusters of customers? Well, you at least know this: customers that belong to the same cluster are similar to each other in terms of their features, such as their age, profession, marital status, gender, product preferences, daily/weekly/monthly spending habits, total amount spent, and so on. Customers who belong to different clusters are dissimilar to each other. With such an insight, you can utilize this information to create different ad campaigns for each cluster.
To put things into a more technical perspective, let's understand this process in simple mathematical terms. There is a dataset X, which contains n examples. These examples could represent customers or different species of animals. Each example is usually a set of real numbers, which are called features, for example if we have a female, 35 year old customer who spent $12000 at your store, you can represent this customer with the following vector (0.0, 35.0, 12000.0). Note that the gender is represented with 0.0, this means that a male customer would have 1.0 for that feature. The size of the vector represents the dimensionality. Since this is a vector of size three, which we usually denote by m, this is a three-dimensional dataset.
Depending on the problem type, you might need to have a label for each example. For example, if this is a supervised learning problem such as binary classification, you could label your examples with 1.0 or 0.0 and this new variable is called label or target variable. The target variable is usually referred to as y.
Having x and y, an ML model is simply a function, f, with weights, w (model parameters):
Model parameters are learned during the training process, but there are also other parameters that you might need to set before training starts, and these parameters are called hyperparameters, which will be explained shortly.
Features in your dataset usually should be preprocessed before being used in model training. For example, some of the ML models implicitly assume that features are distributed normally. In many real-life scenarios this is not the case, and you can benefit from applying feature transformations such as log transformation to have them normally distributed.
Once feature processing is done and model hyperparameters are set, model training starts. At the end of model training, model parameters will be learned and we can predict the target variable for new data that the model has not seen before. Prediction made by the model is usually referred to as :
What really happens during training? Since we know the labels for the dataset we used for training, we can iteratively update our model parameters based on the comparison of what our current model predicts and what the original label was.
This comparison is based on a function called loss function (or cost function), . Loss function represents the inaccuracy of predictions. Some of the common loss functions you may have heard of are square loss, hinge loss, logistic loss, and cross-entropy loss.
Once model training is done, you will test the performance of your ML model on test data, which is the dataset that has not been used in the training process, to see how well your model generalizes. You can use different performance metrics to assess the performance; based on the results, you should go back to previous steps and do multiple adjustments to achieve better performance.
At this point, you should have an overall idea of what training an ML model looks like under the hood.
What is AutoML then? When we are talking about AutoML, we mostly refer to automated data preparation (namely feature preprocessing, generation, and selection) and model training (model selection and hyperparameter optimization). The number of possible options for each step of this process can vary vastly depending on the problem type.
AutoML allows researchers and practitioners to automatically build ML pipelines out of these possible options for every step to find high-performing ML models for a given problem.
The following figure shows a typical ML model life cycle with a couple of examples for every step:
Data can be ingested from various sources such as flat files, databases, and APIs. Once you are able to ingest the data, you should process it to make it ready for ML and there are typical operations such as cleaning and formatting, feature transformation, and feature selection. After data processing, your final dataset should be ready for ML and you will shortlist candidate algorithms to work. Shortlisted algorithms should be validated and tuned through techniques such as cross-validation and hyperparameter optimization. Your final model will be ready to be operationalized with suitable workload type such as online, batch and streaming deployment. Once model is in production, you need to monitor its performance and take necessary action if needed such as re-training, re-evaluation, and re-deployment.
Once you are faced with building ML models, you will first do research on the domain you are working on and identify your objective. There are many steps involved in the process which should be planned and documented in advance before you actually start working on it. To learn more about the whole process of project management, you can refer to CRISP-DM model (https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining), project management is crucially important to deliver a successful application, however, it's beyond the scope of this book.