Machine Learning at Scale with H2O - Gregory Keys - E-Book

Machine Learning at Scale with H2O E-Book

Gregory Keys

0,0
29,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

H2O is an open source, fast, and scalable machine learning framework that allows you to build models using big data and then easily productionalize them in diverse enterprise environments.
Machine Learning at Scale with H2O begins with an overview of the challenges faced in building machine learning models on large enterprise systems, and then addresses how H2O helps you to overcome them. You’ll start by exploring H2O’s in-memory distributed architecture and find out how it enables you to build highly accurate and explainable models on massive datasets using your favorite ML algorithms, language, and IDE. You’ll also get to grips with the seamless integration of H2O model building and deployment with Spark using H2O Sparkling Water. You’ll then learn how to easily deploy models with H2O MOJO. Next, the book shows you how H2O Enterprise Steam handles admin configurations and user management, and then helps you to identify different stakeholder perspectives that a data scientist must understand in order to succeed in an enterprise setting. Finally, you’ll be introduced to the H2O AI Cloud platform and explore the entire machine learning life cycle using multiple advanced AI capabilities.
By the end of this book, you’ll be able to build and deploy advanced, state-of-the-art machine learning models for your business needs.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 418

Veröffentlichungsjahr: 2022

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Machine Learning at Scale with H2O

A practical guide to building and deploying machine learning models on enterprise systems

Gregory Keys

David Whiting

BIRMINGHAM—MUMBAI

Machine Learning at Scale with H2O

Copyright © 2022 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Publishing Product Manager: Aditi Gour

Senior Editor: David Sugarman

Content Development Editor: Manikandan Kurup

Technical Editor: Rahul Limbachiya

Copy Editor: Safis Editing

Project Coordinator: Farheen Fathima

Proofreader: Safis Editing

Indexer: Subalakshmi Govindhan

Production Designer: Alishon Mendonca

Marketing Coordinator: Abeer Riyaz Dawe

First published: July 2022

Production reference: 1290622

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80056-601-9

www.packt.com

My deepest love and warmth to Mary, Julia and Alexa for their support and understanding while husband and dad disappeared to the basement for significant chunks of nights and weekends as the seasons progressed.

- Gregory

To my wife Kathy, and son Ben, who endured too many late nights and weekends of dad locked away in his study working; the book has been a family effort and its culmination a family success.

- David

Acknowledgments

This book would not have been possible without the approval and support of our respective leaders at H2O.ai at the time of its writing, Dmitry Baev and Eyal Kaldes. In addition, we pay our great appreciation to the deep expertise of the many Makers at H2O.ai. Their day-to-day collaboration, education, and machine learning expertise are diffused throughout the pages of this book.

One name needs to be called out in particular: massive thanks to Eric Gudgeon for his infinite and unrelenting technical teachings, and for defining and developing a vast landscape of H2O model deployment implementations.

This book took longer to pull together than either of us expected. Working at a hyper-focused and highly energized company certainly was a contributing factor. Against this backdrop, we appreciate the world-class patience, encouragement, guidance, and professionalism of the Packt team in collaborating on this book from start to finish.

And most importantly there is family, who unfairly signed up for book writing without fully knowing it.

Contributors

About the authors

Gregory Keys is a master principal cloud architect for Data and AI at Oracle. Formerly a senior solutions architect at H2O.ai, he has over 20 years of experience designing and implementing software and data systems. He specializes in AI/ML solutions and has multiple software patents. Gregory has a PhD in evolutionary biology, which has greatly influenced him as a systems thinker.

David Whiting is a data science director and head of training at H2O.ai. He has a PhD in statistics from Texas A&M University and over 25 years of professional experience in academia, consulting, and industry. He has built and led data science teams in financial services and other regulated enterprises.

About the reviewers

Jan Gamec is a lead software engineer at H2O.ai and one of the top contributors to a state-of-the-art AutoML platform called Driverless AI. In the past decade, he has contributed to various projects, focusing on machine learning, cryptography, and web technologies, either in the public or academic sector. Jan holds a master's degree in machine learning and computer science from CTU, Czech Republic, with the main focus of interest being genetic programming, neural networks, and reinforcement learning.

Jagadeesh Rajarajan has over 10 years of experience in building scalable data science systems. He has rich domain knowledge in the following areas: search relevance (information retrieval), recommender systems, AI for customer engagement (acquisition, activation, and retention), MLOps, and interpretable machine learning systems.

Eric Gudgeon has worked on many large complex systems, built nationwide networks, and helped customers deploy highly scalable low-latency solutions. He has a passion for technology and finding creative solutions to problems.

Ondrej Bilek is a lead software engineer at H2O.ai and has rich experience designing and implementing machine learning platforms for Hadoop and Kubernetes. He led the development of Enterprise Steam and is currently working on the H2O AI Cloud.

Table of Contents

Preface

Section 1 – Introduction to the H2O Machine Learning Platform for Data at Scale

Chapter 1: Opportunities and Challenges

ML at scale

The ML life cycle and three challenge areas for ML at scale

A simplified ML life cycle

The model building challenge – state-of-the-art models at scale

The business challenge – getting your models into enterprise production systems

The navigation challenge – navigating the enterprise stakeholder landscape

H2O.ai's answer to these challenges

Summary

Chapter 2: Platform Components and Key Concepts

Technical requirements

Hello World – the H2O machine learning code

Code example

Some issues of scale

The components of H2O machine learning at scale

H2O Core – in-memory distributed model building

H2O Enterprise Steam – a managed, self-provisioning portal

The H2O MOJO – a flexible, low-latency scoring artifact

The workflow using H2O components

H2O key concepts

The data scientist's experience

The H2O cluster

Enterprise Steam as an H2O gateway

Enterprise Steam and the H2O Core high-level architecture

Sparkling Water allows users to code in H2O and Spark seamlessly

MOJOs export as DevOps-friendly artifacts

Summary

Chapter 3: Fundamental Workflow – Data to Deployable Model

Technical requirements

Use case and data overview

The fundamental workflow

Step 1 – launching the H2O cluster

Step 2 – connecting to the H2O cluster

Step 3 – building the model

Step 4 – evaluating and explaining the model

Step 5 – exporting the model's scoring artifact

Step 6 – shutting down the cluster

Variation points – alternatives and extensions to the fundamental workflow

Launching an H2O cluster using the Enterprise Steam API versus the UI (step 1)

Launching an H2O-3 versus Sparkling Water cluster (step 1)

Implementing Enterprise Steam or not (steps 1–2)

Using a personal access token to log in to Enterprise Steam (step 2)

Building the model (step 3)

Evaluating and explaining the model (step 4)

Exporting the model's scoring artifact (step 5)

Shutting down the cluster (step 6)

Summary

Section 2 – Building State-of-the-Art Models on Large Data Volumes Using H2O

Chapter 4: H2O Model Building at Scale – Capability Articulation

H2O data capabilities during model building

Ingesting data from the source to the H2O cluster

Manipulating data in the H2O cluster

Exporting data out of the H2O cluster

Additional data capabilities provided by Sparkling Water

H2O machine learning algorithms

H2O unsupervised learning algorithms

H2O supervised learning algorithms

Parameters and hyperparameters

H2O extensions of supervised learning

Miscellaneous

H2O modeling capabilities

H2O model training capabilities

H2O model evaluation capabilities

H2O model explainability capabilities

H2O trained model artifacts

Summary

Chapter 5: Advanced Model Building – Part I

Technical requirements

Splitting data for validation or cross-validation and testing

Train, validate, and test set splits

Train and test splits for k-fold cross-validation

Algorithm considerations

An introduction to decision trees

Random forests

Gradient boosting

Baseline model training

Model optimization with grid search

Step 1 – a Cartesian grid search to focus on the best tree depth

Step 2 – a random grid search to tune other parameters

H2O AutoML

The AutoML leaderboard

Feature engineering options

Target encoding

Other feature engineering options

Leveraging H2O Flow to enhance your IDE workflow

Monitoring with Flow

Interactive investigations with Flow

Putting it all together – algorithms, feature engineering, grid search, and AutoML

An enhanced AutoML procedure

Summary

Chapter 6: Advanced Model Building – Part II

Technical requirements

Modeling in Sparkling Water

Introducing Sparkling Water pipelines

Implementing a sentiment analysis pipeline

Importing the raw Amazon data

Defining Spark pipeline stages

Creating a Sparkling Water pipeline

Looking ahead – a production preview

UL methods in H2O

What is anomaly detection?

Isolation forests in H2O

Best practices for updating H2O models

Retraining models

Checkpointing models

Ensuring H2O model reproducibility

Case 1 – Reproducibility in single-node clusters

Case 2 – Reproducibility in multi-node clusters

Reproducibility for specific algorithms

Best practices for reproducibility

Summary

Chapter 7: Understanding ML Models

Selecting model performance metrics

Explaining models built in H2O

A simple introduction to Shapley values

Global explanations for single models

Local explanations for single models

Global explanations for multiple models

Automated model documentation (H2O AutoDoc)

Summary

Chapter 8: Putting It All Together

Technical requirements

Data wrangling

Importing the raw data

Defining the problem and creating the response variable

Converting implied numeric data from strings into numeric values

Cleaning up messy categorical columns

Feature engineering

Algebraic transformations

Features engineered from dates

Simplifying categorical variables by combining categories

Missing value indicator functions

Target encoding categorical columns

Model building and evaluation

Model search and optimization with AutoML

Investigating global explainability with AutoML models

Selecting a model from the AutoML candidates

Final model evaluation

Preparation for model pipeline deployment

Summary

Section 3 – Deploying Your Models to Production Environments

Chapter 9: Production Scoring and the H2O MOJO

Technical requirements

The model building and model scoring contexts

Model training to production model scoring

H2O production scoring

End-to-end production scoring pipeline with H2O

Target production systems for H2O MOJOs

H2O MOJO deep dive

What is a MOJO?

Deploying a MOJO

Wrapping MOJOs using the H2O MOJO API

Obtaining the MOJO runtime

The h2o-genmodel API

A generalized approach to wrapping your MOJO

Wrapping example – Build a batch file scorer in Java

Other things to know about MOJOs

Inspecting MOJO decision logic

MOJO and POJO

Summary

Chapter 10: H2O Model Deployment Patterns

Technical requirements

Surveying a sample of MOJO deployment patterns

H2O software

Third-party software integrations

Your software integrations

Accelerators based on H2O Driverless AI integrations

Exploring examples of MOJO scoring with H2O software

H2O MLOps

H2O eScorer

H2O batch database scorer

H2O batch file scorer

H2O Kafka scorer

H2O batch scoring on Spark

Exploring examples of MOJO scoring with third-party software

Snowflake integration

Teradata integration

BI tool integration

UiPath integration

Exploring examples of MOJO scoring with your target-system software

Your software application

On-device scoring

Exploring examples of accelerators based on H2O Driverless AI integrations

Apache NiFi

Apache Flink

AWS Lambda

AWS SageMaker

Summary

Section 4 – Enterprise Stakeholder Perspectives

Chapter 11: The Administrator and Operations Views

A model building and deployment view – the personas on the ground

View 1 – Enterprise Steam administrator

Enterprise Steam administrator concerns

Enterprise Steam configurations

H2O user governance from Enterprise Steam

Enterprise Steam configurations

Server cluster (backend) integration

H2O-3 and Sparkling Water management

Restarting Enterprise Steam

View 2 – The operations team

Enterprise Steam server Ops

H2O cluster Ops

MLOps

View 3 – The data scientist

Interactions with Enterprise Steam administrators

Interactions with H2O cluster (Hadoop or Kubernetes) Ops teams

Interactions with MLOps teams

Summary

Chapter 12: The Enterprise Architect and Security Views

Technical requirements

The enterprise and security architect view

H2O at Scale enterprise architecture

H2O at Scale implementation patterns

Component integration architecture

Communication architecture

Deployment architecture

H2O at Scale security

Data movement and privacy

User authentication and access control

Network and firewall

The data scientist's view of enterprise and security architecture

Summary

Section 5 – Broadening the View – Data to AI Applications with the H2O AI Cloud Platform

Chapter 13: Introducing H2O AI Cloud

Technical requirements

An H2O AI Cloud overview

H2O AI Cloud component breakdown

DistributedML (H2O-3 and H2O Sparkling Water)

H2O AutoML (H2O Driverless AI)

DeepLearningML (H2O Hydrogen Torch)

DocumentML (H2O Document AI)

A self-provisioning service (H2O Enterprise Steam)

Feature Store (H2O AI Feature Store)

MLOps (H2O MLOps)

Low-code SDK for AI applications (H2O Wave)

App Store (H2O AI App Store)

H2O AI Cloud architecture

Summary

Chapter 14: H2O at Scale in a Larger Platform Context

Technical requirements

A quick recap of H2O AI Cloud

Exploring a baseline reference solution for H2O at scale

Exploring new possibilities for H2O at scale

Leveraging H2O Driverless AI for prototyping and feature discovery

Integrating H2O MLOps for model monitoring, management, and governance

Leveraging H2O AI Feature Store for feature operationalization and reuse

Consuming predictions in a business context from a Wave AI app

Integrating an automated retraining pipeline in a Wave AI app

A Reference H2O Wave app as an enterprise AI integration fabric

Summary

Appendix : Alternative Methods to Launch H2O Clusters

Local H2O-3 cluster

Step 1 – Install H2O-3 in Python

Step 2 – Launch your H2O-3 cluster and write code

Local Sparkling Water cluster

Step 1 – Install Spark locally

Step 2 – Install Sparkling Water in Python

Step 3 – Install a Sparkling Water Python interactive shell

Step 4 – Launch a Jupyter notebook on top of the Sparkling Water shell

Step 5 – Launch your Sparkling Water cluster and write code

H2O-3 cluster in the 90-day free trial environment for H2O AI Cloud

Step 1 – Get your 90-day trial to H2O AI Cloud

Step 2 – Set up your Python environment

Step 3 – Launch your cluster

Step 4 – Write H2O-3 code

Other Books You May Enjoy

Section 1 – Introduction to the H2O Machine Learning Platform for Data at Scale

This section provides a general background of machine learning (ML) at scale with H2O. We will define ML at scale, focus on its challenges, and then see how H2O overcomes these challenges. We will then overview each H2O component to better understand its purpose and how it works from a technical standpoint. We will then put the components to work by implementing a minimal workflow. After this section, we will be ready to dive into advanced topics and techniques.

This section comprises the following chapters:

Chapter 1, Opportunities and ChallengesChapter 2, Platform Components and Key ConceptsChapter 3, Fundamental Workflow – Data to Deployable Model

Chapter 1: Opportunities and Challenges

Machine Learning (ML) and data science are winning a popularity contest of sorts, as witnessed by their headline coverage in the popular and professional press and by expanding job openings across the technology landscape. Students typically learn ML techniques using their own computers on relatively small datasets. Those who enter the field often find themselves in the much different setting of a large company buzzing with workers performing specialized job roles, while collaborating with others scattered across the nation or world. Both data science students and data science workers have a few key things in common – they are in an exciting and growing field that businesses deem ever more critical to their future, and the data they thrive on is becoming exponentially more abundant and diverse.

There are huge opportunities for ML in enterprises because the transformational impacts of ML on businesses, customers, patients, and so on are diverse, widespread, lucrative, and life-changing. A backdrop of urgency exists as well from competitors who are all attempting the same thing. Enterprises are thus incented to invest in significant ML transformations and to supply the necessary data, tooling, production systems, and people to journey toward ML success. But challenges loom large as well, and these challenges commonly revolve around scale. The challenges of scale take on many forms inherent to ML at an enterprise level.

In this chapter, we will define and explore the challenge of ML at scale by covering the following main topics:

ML at scale The ML life cycle and three challenge areas for ML at scaleH2O.ai's answer to these challenges

ML at scale

This book is about implementing ML at scale and how to use H2O.ai technology to succeed in doing so. What specifically do we mean by ML at scale? We can see three contexts and challenges of scale during the ML life cycle – building models from large datasets, deploying these models in enterprise production environments, and executing the full range of ML activities within the complexities of enterprise processes and stakeholders. This is summarized in the following figure:

Figure 1.1 – The challenges of ML at scale

Let's drill down further on these challenges. Before doing so, we will oversee a generic conception of the ML life cycle, which will be useful as a reference throughout the book.

The ML life cycle and three challenge areas for ML at scale

The ML life cycle is a process that data scientists and enterprise stakeholders follow to build ML models and put them into production environments, where they make predictions and achieve value. In this section, we will define a simplified ML life cycle and elaborate on two broad areas that present special challenges for ML at scale.

A simplified ML life cycle

We will use the following ML life cycle representation. The goal is to achieve a simplified depiction that we can all recognize as central to ML while avoiding attempts at a canonical definition. Let's use it as our working framework for discussion:

Figure 1.2 – A simplified ML life cycle

The following is a brief articulation.

Model building

Model building is a highly iterative process with frequent and unpredictable feedback loops along the way toward building a predictive model that is worthy of deploying in a business context. The steps can be summarized as follows:

Data ingestion: Data is pulled from sources or a storage layer in the model building environment. There is often significant work onward from here in finding and accessing potentially useful data sources and transforming the data into a useable form. Typically, this is done as part of a larger data pipeline and architecture.Data exploration: Data is explored to understand its qualities (for example, data profiling, correlation analysis, outlier detection, and data visualization).Data manipulation: Data is cleaned (for example, the imputation of missing data, the reduction of categorical features, and normalization) and new features are engineered. Model training: An ML algorithm, scoring metric, and validation method are selected, and the model is tuned across a range of hyperparameters and tested against a test dataset.Model evaluation and explainability: A fit of the model is diagnosed for performance metrics, overfitting, and other diagnostics; model explainability is used to validate against domain knowledge, to explain the model decisions at individual and global levels, and to guard against institutional risks such as unfair bias against demographic groups. Model deployment: The model is deployed as a scoring artifact to a software system and live scoring is made.Model monitoring: The model is monitored to detect whether the data fed into it changes over time compared to the distribution of data it was trained on. This is called data drift and usually leads to the decreased predictive power of the model. This usually triggers the need to retrain the model with a more current dataset and then redeploy the updated model. The model may also be monitored for other patterns, such as whether it is biasing decisions against a particular demographic group and whether malicious attacks are being made to try to cause the model to malfunction.

As mentioned, a key property in the workflow is the unknown number and sequence of iteration pathways taken between these steps before a model is deployed or before the project is deemed unsuccessful in reaching that stage.

The model building challenge – state-of-the-art models at scale

Let's, for now, define a large dataset as any dataset that exceeds your ability to build ML models on your laptop or local workstation. It may be too large because your libraries simply crash or because they take an unreasonable amount of time to complete. This may occur during model training or during data ingestion, exploration, and manipulation.

We can see four separate challenges of building ML models from large data volumes, with each contributing to a larger problem in general that we call the friction of iteration. This is represented in the following diagram:

Figure 1.3 – The challenge of model building with large data volumes

Let's elaborate on this.

Challenge one – data size and location

Enterprises collect and store vast amounts of diverse data and that is a boon to the data scientist looking to build accurate models. These datasets are either stored across many systems or centralized in a common storage layer (data lake) such as the Hadoop Distributed File System (HDFS) or AWS S3. Architecting and making data available to internal consumers is a major effort and challenge for an enterprise. However, the data scientist starting the ML life cycle with large datasets typically cannot move that data, once it becomes accessible, to a local environment due to either security reasons or high volume of data.. The consequence is that the data scientist must either do one of the following:

Move operations on the data (in other words, move the compute) to the data itself.Move data to a high-compute environment that they are authorized to use.

Challenge two – data size and data manipulation

Manipulating data can be compute-intensive, and attempting to do so against insufficient resources either will cause the compute to fail (for example, the script, library, or tool will crash) or take an unreasonably long amount of time. Who wants to wait 10 hours to join and filter table data when it can be done in 10 minutes? What you might consider an unreasonable amount of time is obviously relative to the dataset size; terabytes of data will always take longer to process than a few megabytes. Regardless, the speed of your data processing is critical to reducing the sum time of your iterations.

Challenge three – data size and data exploration

Challenges of data size during data exploration are identical to those during data manipulation. The data may be so large that your processing crashes or takes an unreasonable amount of time to complete while exploring models.

Challenge four – data size and model training

ML algorithms are extremely compute-intensive because they step through each record of a dataset and perform complex calculations each time, and then iterate these calculations against the dataset repeatedly to optimize toward a training metric and thus learn a predictive mathematical pattern among the noise. Our compute environment is particularly pressured during model training.

Up until now, we have been discussing dataset size in relative terms; that is, large data volumes are those that cause operations on them to either fail or take a long time to complete in a given compute environment. 

In absolute terms, data scientists often explore the largest dataset possible to understand it and then sample it for model training. Others always try to use the largest dataset for model training. However, accurate models can be built from 10 GB or less of sampled or unsampled data.

The key to proper use of sampling is that you have followed appropriate statistical and theoretical practices, and not that you are forced to do so because your ML processing will crash or take a long time to complete due to large data volumes. The latter is a bad practice that produces inferior models and H2O.ai overcomes this by allowing model building with massive data volumes.

There are also cases when data sampling may not lead to an acceptable model. In other words, the data scientist may need hundreds of gigabytes or a terabyte or more of data to build a valuable model. These are cases when the following applies:

The data scientist does not trust the sampling to produce the best model and feels that each small gain in lift warrants the use of the full dataset.The data scientist does not want to segment the data into separate datasets and thus separate model building exercises, or the larger stakeholder group wants a single model in production that predicts against all segments versus many that each predicts against a single segment.The data is highly dimensional, sparse, or both. In this case, a large number of records are needed to reduce variance and overfitting to a training dataset. This type of dataset is typical for anomaly detection, recommendation engines, predictive maintenance, security threat detection, personalized medicine, and so on. It is worth noting that the future will bring us more and more data, and thus highly dimensional and sparse datasets will become more common.The data is extremely imbalanced. The target variable is very rare in the dataset and a massive dataset is needed to avoid underfitting, overfitting, or weighting the target variable from these infrequent records.The data is highly volatile. Each subset of data that is collected is unrepresentative of the others and thus sampling or cross-validation folds may not be representative. Time series forecasting may be particularly sensitive to this problem, especially when forecast categories are highly granular (for example, yearly, monthly, daily, and hourly) against a single validation dataset.

The friction of iteration

Model building is a highly iterative process and anything that slows it down we call the friction of iteration. These causes can be due to the challenges of working with large data volumes, as previously discussed. They can also arise from simple workflow patterns such as switching among systems between each iteration or launching new environments to work on an iteration.

Any slowness during a single iteration may seem acceptable but when multiplied across the seemingly endless iterations from the project beginning to failure or success, the cost in time from this friction becomes significant, and reducing friction can be valuable. As we will see in the next section, slow model building delays the main goal of ML in an enterprise – achieving business value.

The business challenge – getting your models into enterprise production systems

The bare truth about ML initiatives is that they do not really achieve value until they are deployed to a live scoring environment. Models must meet evaluation criteria and be put into production to be deemed successful. Until that happens, from a business standpoint, little is achieved. This may seem a bit harsh, but it is typically how success is defined in data science initiatives. The following diagram maps this thinking onto the ML life cycle:

Figure 1.4 – The ML life cycle value chain

The friction of iteration from this view is thus a cost. Time taken to iterate through model building is time taken from getting business results. In other words, lower friction translates to less time to build and deploy a model to achieve business value, and more time to work on other problems and thus more models per quarter or year.

From the same point of view, time todeploy a model is viewed as a cost for similar reasons. The model deployment step may seem like a simple one-step sequence of transitioning the model to DevOps, but typically it is not. Anything that makes a model easier and more repeatable to deploy, document, and govern helps businesses achieve value sooner.

Let's now continue expanding on a larger landscape of enterprise stakeholders that data scientists must work with to build models that ultimately achieve business value.

The navigation challenge – navigating the enterprise stakeholder landscape

The data scientist in any enterprise does not work in isolation. There are multiple stakeholders who become involved directly in the ML life cycle or, more broadly, in the business cycle of initiating and consuming ML projects. Who might some of these stakeholders be? At a bare minimum, they include the business stakeholder who funded the ML project, the administrator providing the data scientist with permissions and capabilities, the DevOps or engineering team members who are responsible for model deployment and the infrastructure supporting it, perhaps marketing or sales associates whose functions are impacted directly by the model, and any other representatives of the internal or external consumers of the model. In more heavily regulated industries such as banking, insurance, or pharmaceuticals, these might include representatives or offices of various audit and risk functions – data risk, code risk, model risk, legal risk, reputational risk, compliance, external regulators, and so on. The following figure shows a general view:

Figure 1.5 – Data scientists working with enterprise stakeholders and processes

Stakeholder interaction is thus complex. What leads to this complexity? Obviously, the specialization and siloing of job functions make things complex, and this is further amplified by the scale of the enterprise. A larger dynamic of creating repeatable processes and minimizing risk contributes as well. Explaining this complexity is the task of a different book, but its reality in the enterprise is inescapable. To a data scientist, the ability to recognize, influence, negotiate with, deliver to, and ultimately build trust with these various stakeholders is imperative to successful ML solutions at scale. 

Now that we have understood the ML life cycle and the challenges inherent in its successful execution at scale, it is time for a brief introduction to how H2O.ai solves these challenges.

H2O.ai's answer to these challenges

H2O.ai provides software to build ML models at scale and overcome the challenges of doing so – model building at scale, model deployment at scale, and dealing with enterprise stakeholders' concerns and inherent friction along the way. These components are described in brief in the following diagram:

Figure 1.6 – H2O ML at scale

Subsequent chapters of this book elaborate on how these components are used to build and deploy state-of-the-art models within the complexities of the enterprise environment.

Let's try to understand these components at first glance:

H2O Core: This is open source software that distributes state-of-the-art ML algorithms and data manipulations over a specified number of servers on Kubernetes, Hadoop, or Spark environments. Data is partitioned in memory across the designated number of servers and ML algorithm computation is run in parallel using it.

This architecture creates horizontal scalability of model building to hundreds of gigabytes or terabytes of data and generally fast processing times at lower data volumes. Data scientists work with familiar IDEs, languages, and algorithms and are abstracted away from the underlying architecture. Thus, for example, a data scientist can run an XGBoost model in Python from a Jupyter notebook against 500 GB of data in Hadoop, similar to doing so with data loaded into their laptop.

H2O Core is often referred to as H2O Open Source and comes in two forms, H2O-3 and Sparkling Water, which we will elaborate on in subsequent chapters. H2O Core can be run as a scaled-down sandbox on a single server or laptop.

H2O Enterprise Steam: This is a web UI or API for data scientists to self-provision and manage their individual H2O Core environments. Self-provisioning includes auto-calculation of horizontal scaling based on user inputs that describe the data. Enterprise Steam is also used by administrators to manage users, including defining boundaries for their resource consumption, and to configure H2O Core integration against Hadoop, Spark, or Kubernetes.H2O MOJO: This is an easy-to-deploy scoring artifact exportable from models built from H2O Core. MOJOs are low latency (typically < 100 ms or faster) Java binaries that can run on any Java Virtual Machine (JVM) and thus serve predictions on diverse software systems, such as REST servers, database clients, Amazon SageMaker, Kafka queues, Spark pipelines, Hive user-defined functions (UDFs), and Internet of Things (IoT) devices.APIs: Each component has a rich set of APIs so that you can automate workflows, including continuous integration and continuous delivery (CI/CD) and retraining pipelines.

The focus of this book is on building and deploying state-of-the-art models at scale using H2O Core with help from Enterprise Steam and deploying those models as MOJOs within the complexities of enterprise environments.

H2O at Scale and H2O AI Cloud

We refer to H2O at scale in this book as H2O Enterprise Steam, H2O Core, and H2O Mojo because it addresses the ML at scale challenges described earlier in this chapter, especially through the distributed ML scalability that H2O Core provides for model building.

Note that H2O.ai offers a larger end-to-end ML platform called the H2O AI Cloud. The H2O AI Cloud integrates a hyper-advanced AutoML tool (called H2O Driverless AI) and other model building engines, an MLOps scoring, monitoring, and governance environment (called H2O MLOps), and a low-code software development kit, or SDK (called H2O Wave) with H2O API hooks to build AI applications that publish to the App Store. It also integrates H2O at scale as defined in this book.

H2O at scale can be deployed as standalone or as part of the H2O AI Cloud. As a standalone implementation, Enterprise Steam is not in fact required, but for reasons elaborated on later in this book, Enterprise Steam is deemed essential for enterprise implementations.

The majority of this book is focused on H2O at scale. The last part of the book will extend our understanding to the H2O AI Cloud and how H2O at scale components can leverage this larger integrated platform and vice versa.

Summary

In this chapter, we have set the stage for understanding and implementing ML at scale using H2O.ai technology. We have defined multiple forms of scale in an enterprise setting and articulated the challenges to ML from model building, model deployment, and enterprise stakeholder perspectives. We have anchored these challenges ultimately to the end goal of ML – providing business value. Finally, we briefly introduced H2O at scale components used by enterprises to overcome these challenges and achieve business value.

In the next chapter, we'll start to understand these components in greater technical detail so that we can start writing code and doing data science.