Machine Learning Infrastructure and Best Practices for Software Engineers - Miroslaw Staron - E-Book

Machine Learning Infrastructure and Best Practices for Software Engineers E-Book

Miroslaw Staron

0,0
29,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Although creating a machine learning pipeline or developing a working prototype of a software system from that pipeline is easy and straightforward nowadays, the journey toward a professional software system is still extensive. This book will help you get to grips with various best practices and recipes that will help software engineers transform prototype pipelines into complete software products.
The book begins by introducing the main concepts of professional software systems that leverage machine learning at their core. As you progress, you’ll explore the differences between traditional, non-ML software, and machine learning software. The initial best practices will guide you in determining the type of software you need for your product. Subsequently, you will delve into algorithms, covering their selection, development, and testing before exploring the intricacies of the infrastructure for machine learning systems by defining best practices for identifying the right data source and ensuring its quality.
Towards the end, you’ll address the most challenging aspect of large-scale machine learning systems – ethics. By exploring and defining best practices for assessing ethical risks and strategies for mitigation, you will conclude the book where it all began – large-scale machine learning software.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 484

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Machine Learning Infrastructure and Best Practices for Software Engineers

Take your machine learning software from a prototype to a fully fledged software system

Miroslaw Staron

Machine Learning Infrastructure and Best Practices for Software Engineers

Copyright © 2024 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Niranjan Naikwadi

Publishing Product Manager: Yasir Ali Khan

Book Project Manager: Hemangi Lotlikar

Senior Editor: Sushma Reddy

Technical Editor: Kavyashree K S

Copy Editor: Safis Editing

Proofreader: Safis Editing

Indexer: Hemangini Bari

Production Designer: Gokul Raj S.T

DevRel Marketing Coordinator: Vinishka Kalra

First published: January 2024

Production reference: 1170124

Published by

Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK

ISBN 978-1-83763-406-4

www.packtpub.com

Writing a book with a lot of practical examples requires a lot of extra time, which is often taken from family and friends. I dedicate this book to my family – Alexander, Cornelia, Viktoria, and Sylwia – who always supported and encouraged me, and to my parents and parents-in-law, who shaped me to be who I am.

– Miroslaw Staron

Contributors

About the author

Miroslaw Staron is a professor of Applied IT at the University of Gothenburg in Sweden with a focus on empirical software engineering, measurement, and machine learning. He is currently editor-in-chief of Information and Software Technology and co-editor of the regular Practitioner’s Digest column of IEEE Software. He has authored books on automotive software architectures, software measurement, and action research. He also leads several projects in AI for software engineering and leads an AI and digitalization theme at Software Center. He has written over 200 journal and conference articles.

I would like to thank my family for their support in writing this book. I would also like to thank my colleagues from the Software Center program who provided me with the ability to develop my ideas and knowledge in this area – in particular, Wilhelm Meding, Jan Bosch, Ola Söder, Gert Frost, Martin Kitchen, Niels Jørgen Strøm, and several other colleagues. One person who really ignited my interest in this area is of course Mirosław “Mirek” Ochodek, to whom I am extremely grateful. I would also like to thank the funders of my research, who supported my studies throughout the years. I would like to thank my Ph.D. students, who challenged me and encouraged me to always dig deeper into the topics. I’m also very grateful to the reviewers of this book – Hongyi Zhang and Sushant K. Pandey, who provided invaluable comments and feedback for the book. Finally, I would like to extend my gratitude to my publishing team – Hemangi Lotlikar, Sushma Reddy, and Anant Jaint – this book would not have materialized without you!

About the reviewers

Hongyi Zhang is a researcher at Chalmers University of Technology with over five years of experience in the fields of machine learning and software engineering. Specializing in machine learning, edge/cloud computing, and software engineering, his research merges machine learning theory and software applications, driving tangible improvements in industrial machine learning ecosystems.

Sushant Kumar Pandey is a dedicated post-doctoral researcher at the Department of CSE, Chalmers at the University of Gothenburg, Sweden, who seamlessly integrates academia with industry, collaborating with Volvo Cars in Gothenburg. Armed with a Ph.D. in CSE from the esteemed Indian Institute of Technology (BHU), India, Sushant specializes in the application of AI in software engineering. His research advances technology’s transformative potential. As a respected reviewer for prestigious venues such as IST, KBS, EASE, and ESWA, Sushant actively contributes to shaping the discourse in his field. Beyond research, he leverages his expertise to mentor students, fostering innovation and excellence in the next generation of professionals.

Table of Contents

Preface

Part 1: Machine Learning Landscape in Software Engineering

1

Machine Learning Compared to Traditional Software

Machine learning is not traditional software

Supervised, unsupervised, and reinforcement learning – it is just the beginning

An example of traditional and machine learning software

Probability and software – how well they go together

Testing and evaluation – the same but different

Summary

References

2

Elements of a Machine Learning System

Elements of a production machine learning system

Data and algorithms

Data collection

Feature extraction

Data validation

Configuration and monitoring

Configuration

Monitoring

Infrastructure and resource management

Data serving infrastructure

Computational infrastructure

How this all comes together – machine learning pipelines

References

3

Data in Software Systems – Text, Images, Code, and Their Annotations

Raw data and features – what are the differences?

Images

Text

Visualization of output from more advanced text processing

Structured text – source code of programs

Every data has its purpose – annotations and tasks

Annotating text for intent recognition

Where different types of data can be used together – an outlook on multi-modal data models

References

4

Data Acquisition, Data Quality, and Noise

Sources of data and what we can do with them

Extracting data from software engineering tools – Gerrit and Jira

Extracting data from product databases – GitHub and Git

Data quality

Noise

Summary

References

5

Quantifying and Improving Data Properties

Feature engineering – the basics

Clean data

Noise in data management

Attribute noise

Splitting data

How ML models handle noise

References

Part 2: Data Acquisition and Management

6

Processing Data in Machine Learning Systems

Numerical data

Summarizing the data

Diving deeper into correlations

Summarizing individual measures

Reducing the number of measures – PCA

Other types of data – images

Text data

Toward feature engineering

References

7

Feature Engineering for Numerical and Image Data

Feature engineering

Feature engineering for numerical data

PCA

t-SNE

ICA

Locally linear embedding

Linear discriminant analysis

Autoencoders

Feature engineering for image data

Summary

References

8

Feature Engineering for Natural Language Data

Natural language data in software engineering and the rise of GitHub Copilot

What a tokenizer is and what it does

Bag-of-words and simple tokenizers

WordPiece tokenizer

BPE

The SentencePiece tokenizer

Word embeddings

FastText

From feature extraction to models

References

Part 3: Design and Development of ML Systems

9

Types of Machine Learning Systems – Feature-Based and Raw Data-Based (Deep Learning)

Why do we need different types of models?

Classical machine learning models

Convolutional neural networks and image processing

BERT and GPT models

Using language models in software systems

Summary

References

10

Training and Evaluating Classical Machine Learning Systems and Neural Networks

Training and testing processes

Training classical machine learning models

Understanding the training process

Random forest and opaque models

Training deep learning models

Misleading results – data leaking

Summary

References

11

Training and Evaluation of Advanced ML Algorithms – GPT and Autoencoders

From classical ML to GenAI

The theory behind advanced models – AEs and transformers

AEs

Transformers

Training and evaluation of a RoBERTa model

Training and evaluation of an AE

Developing safety cages to prevent models from breaking the entire system

Summary

References

12

Designing Machine Learning Pipelines (MLOps) and Their Testing

What ML pipelines are

ML pipelines

Elements of MLOps

ML pipelines – how to use ML in the system in practice

Deploying models to HuggingFace

Downloading models from HuggingFace

Raw data-based pipelines

Pipelines for NLP-related tasks

Pipelines for images

Feature-based pipelines

Testing of ML pipelines

Monitoring ML systems at runtime

Summary

References

13

Designing and Implementing Large-Scale, Robust ML Software

ML is not alone

The UI of an ML model

Data storage

Deploying an ML model for numerical data

Deploying a generative ML model for images

Deploying a code completion model as an extension

Summary

References

Part 4: Ethical Aspects of Data Management and ML System Development

14

Ethics in Data Acquisition and Management

Ethics in computer science and software engineering

Data is all around us, but can we really use it?

Ethics behind data from open source systems

Ethics behind data collected from humans

Contracts and legal obligations

References

15

Ethics in Machine Learning Systems

Bias and ML – is it possible to have an objective AI?

Measuring and monitoring for bias

Other metrics of bias

Developing mechanisms to prevent ML bias from spreading throughout the system

Summary

References

16

Integrating ML Systems in Ecosystems

Ecosystems

Creating web services over ML models using Flask

Creating a web service using Flask

Creating a web service that contains a pre-trained ML model

Deploying ML models using Docker

Combining web services into ecosystems

Summary

References

17

Summary and Where to Go Next

To know where we’re going, we need to know where we’ve been

Best practices

Current developments

My view on the future

Final remarks

References

Index

Other Books You May Enjoy

Part 1:Machine Learning Landscape in Software Engineering

Traditionally, Machine Learning (ML) was considered to be a niche domain in software engineering. No large software systems used statistical learning in production. This changed in the 2010s when recommendation systems started to utilize large quantities of data – for example, to recommend movies, books, or music. With the rise of transformer technologies, this has changed. Commonly known products such as ChatGPT popularized these techniques and showed that they are no longer niche products, but have entered the mainstream software products and services. Software engineering needs to keep up and we need to know how to create the software based on these modern machine learning models. In this first part of the book, we look at how machine learning changes software development and how we need to adapt to these changes.

This part has the following chapters:

Chapter 1, Machine Learning Compared to Traditional SoftwareChapter 2, Elements of a Machine Learning SystemChapter 3, Data in Software Systems – Text, Images, Code, and FeaturesChapter 4, Data Acquisition, Data Quality, and NoiseChapter 5, Quantifying and Improving Data Properties