E-Book
32,39 €

Feature Engineering Made Easy E-Book

Sinan Ozdemir

0,0

32,39 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

A perfect guide to speed up the predicting power of machine learning algorithms

Key Features

Design, discover, and create dynamic, efficient features for your machine learning application
Understand your data in-depth and derive astonishing data insights with the help of this Guide
Grasp powerful feature-engineering techniques and build machine learning systems

Book Description

Feature engineering is the most important step in creating powerful machine learning systems. This book will take you through the entire feature-engineering journey to make your machine learning much more systematic and effective.

You will start with understanding your data—often the success of your ML models depends on how you leverage different feature types, such as continuous, categorical, and more, You will learn when to include a feature, when to omit it, and why, all by understanding error analysis and the acceptability of your models. You will learn to convert a problem statement into useful new features. You will learn to deliver features driven by business needs as well as mathematical insights. You'll also learn how to use machine learning on your machines, automatically learning amazing features for your data.

By the end of the book, you will become proficient in Feature Selection, Feature Learning, and Feature Optimization.

What you will learn

Identify and leverage different feature types
Clean features in data to improve predictive power
Understand why and how to perform feature selection, and model error analysis
Leverage domain knowledge to construct new features
Deliver features based on mathematical insights
Use machine-learning algorithms to construct features
Master feature engineering and optimization
Harness feature engineering for real world applications through a structured case study

Who this book is for

If you are a data science professional or a machine learning engineer looking to strengthen your predictive analytics model, then this book is a perfect guide for you. Some basic understanding of the machine learning concepts and Python scripting would be enough to get started with this book.

Sinan Ozdemir is a data scientist, startup founder, and educator living in the San Francisco Bay Area with his dog, Charlie; cat, Euclid; and bearded dragon, Fiero. He spent his academic career studying pure mathematics at Johns Hopkins University before transitioning to education. He spent several years conducting lectures on data science at Johns Hopkins University and at the General Assembly before founding his own startup, Legion Analytics, which uses artificial intelligence and data science to power enterprise sales teams. After completing a Fellowship at the Y Combinator accelerator, Sinan spent most of his time working on his fast-growing company, while creating educational material for data science. Divya Susarla is an experienced leader in data methods, implementing and applying tactics across a range of industries and fields including investment management, social enterprise consulting, and wine marketing. She trained in data by way of specializing in Economics and Political Science at University of California, Irvine, cultivating a passion for teaching by developing an analytically based, international affairs curriculum for students through the Global Connect program. Divya is currently focused on natural language processing and generation techniques at Kylie.ai, a startup helping clients automate their customer support conversations. When she is not busy working on building Kylie.ai and writing educational content, she spends her time traveling across the globe and experimenting with new recipes at her home in Berkeley, CA.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 350

Veröffentlichungsjahr: 2018

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Feature Engineering Made Easy

Identify unique features from your dataset in order to build powerful machine learning systems

Sinan Ozdemir

Divya Susarla

BIRMINGHAM - MUMBAI

Feature Engineering Made Easy

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Veena PagareAcquisition Editor: Varsha ShettyContent Development Editor: Tejas LimkarTechnical Editor: Sayli NikaljeCopy Editor: Safis EditingProject Coordinator: Manthan PatelProofreader: Safis EditingIndexer: Tejal Daruwale SoniGraphics: Tania DattaProduction Coordinator: Shantanu Zagade

First published: January 2018

Production reference: 1190118

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78728-760-0

www.packtpub.com

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

PacktPub.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the authors

Sinan Ozdemir is a data scientist, start-up founder, and educator living in the San Francisco Bay Area. He studied pure mathematics at Johns Hopkins University. He then spent several years conducting lectures on data science at Johns Hopkins University before founding his own start-up, Kylie.ai, which uses artificial intelligence to clone brand personalities and automate customer service communications.

Sinan is also the author of Principles of Data Science, available through Packt.

I would like to thank my parents and sister for supporting me throughout my life, and also my partner, Elizabeth Beutel. I also would like to thank my co-author, Divya Susarla, and Packt Publishing for all of their support.

Divya Susarla is an experienced leader in data methods, implementing and applying tactics across a range of industries and fields, such as investment management, social enterprise consulting, and wine marketing. She studied business economics and political science at the University of California, Irvine, USA.

Divya is currently focused on natural language processing and generation techniques at Kylie.ai, a start-up helping clients automate their customer support conversations.

I would like to thank my parents for their unwavering support and guidance, and also my partner, Neil Trivedi, for his patience and encouragement. Also, a shoutout to DSI-SF2; this book wouldn't be a reality without you all. Thanks to my co-author, Sinan Ozdemir, and to Packt Publishing for making this book possible.

About the reviewer

Michael Smith uses big data and machine learning to learn about how people behave. His experience includes IBM Watson and consulting for the US government. Michael actively publishes at and attends several prominent conferences as he engineers systems using text data and AI. He enjoys discussing technology and learning new ways to tackle problems.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Introduction to Feature Engineering

Motivating example – AI-powered communications

Why feature engineering matters

What is feature engineering?

Understanding the basics of data and machine learning

Supervised learning

Unsupervised learning

Unsupervised learning example – marketing segments

Evaluation of machine learning algorithms and feature engineering procedures

Example of feature engineering procedures – can anyone really predict the weather?

Steps to evaluate a feature engineering procedure

Evaluating supervised learning algorithms

Evaluating unsupervised learning algorithms

Feature understanding – what’s in my dataset?

Feature improvement – cleaning datasets

Feature selection – say no to bad attributes

Feature construction – can we build it?

Feature transformation – enter math-man

Feature learning – using AI to better our AI

Summary

Feature Understanding – What's in My Dataset?

The structure, or lack thereof, of data

An example of unstructured data – server logs

Quantitative versus qualitative data

Salary ranges by job classification

The four levels of data

The nominal level

Mathematical operations allowed

The ordinal level

Mathematical operations allowed

The interval level

Mathematical operations allowed

Plotting two columns at the interval level

The ratio level

Mathematical operations allowed

Recap of the levels of data

Summary

Feature Improvement - Cleaning Datasets

Identifying missing values in data

The Pima Indian Diabetes Prediction dataset

The exploratory data analysis (EDA)

Dealing with missing values in a dataset

Removing harmful rows of data

Imputing the missing values in data

Imputing values in a machine learning pipeline

Pipelines in machine learning

Standardization and normalization

Z-score standardization

The min-max scaling method

The row normalization method

Putting it all together

Summary

Feature Construction

Examining our dataset

Imputing categorical features

Custom imputers

Custom category imputer

Custom quantitative imputer

Encoding categorical variables

Encoding at the nominal level

Encoding at the ordinal level

Bucketing continuous features into categories

Creating our pipeline

Extending numerical features

Activity recognition from the Single Chest-Mounted Accelerometer dataset

Polynomial features

Parameters

Exploratory data analysis

Text-specific feature construction

Bag of words representation

CountVectorizer

CountVectorizer parameters

The Tf-idf vectorizer

Using text in machine learning pipelines

Summary

Feature Selection

Achieving better performance in feature engineering

A case study – a credit card defaulting dataset

Creating a baseline machine learning pipeline

The types of feature selection

Statistical-based feature selection

Using Pearson correlation to select features

Feature selection using hypothesis testing

Interpreting the p-value

Ranking the p-value

Model-based feature selection

A brief refresher on natural language processing

Using machine learning to select features

Tree-based model feature selection metrics

Linear models and regularization

A brief introduction to regularization

Linear model coefficients as another feature importance metric

Choosing the right feature selection method

Summary

Feature Transformations

Dimension reduction – feature transformations versus feature selection versus feature construction

Principal Component Analysis

How PCA works

PCA with the Iris dataset – manual example

Creating the covariance matrix of the dataset

Calculating the eigenvalues of the covariance matrix

Keeping the top k eigenvalues (sorted by the descending eigenvalues)

Using the kept eigenvectors to transform new data-points

Scikit-learn's PCA

How centering and scaling data affects PCA

A deeper look into the principal components

Linear Discriminant Analysis

How LDA works

Calculating the mean vectors of each class

Calculating within-class and between-class scatter matrices

Calculating eigenvalues and eigenvectors for SW-1SB

Keeping the top k eigenvectors by ordering them by descending eigenvalues

Using the top eigenvectors to project onto the new space

How to use LDA in scikit-learn

LDA versus PCA – iris dataset

Summary

Feature Learning

Parametric assumptions of data

Non-parametric fallacy

The algorithms of this chapter

Restricted Boltzmann Machines

Not necessarily dimension reduction

The graph of a Restricted Boltzmann Machine

The restriction of a Boltzmann Machine

Reconstructing the data

MNIST dataset

The BernoulliRBM

Extracting PCA components from MNIST

Extracting RBM components from MNIST

Using RBMs in a machine learning pipeline

Using a linear model on raw pixel values

Using a linear model on extracted PCA components

Using a linear model on extracted RBM components

Learning text features – word vectorizations

Word embeddings

Two approaches to word embeddings - Word2vec and GloVe

Word2Vec - another shallow neural network

The gensim package for creating Word2vec embeddings

Application of word embeddings - information retrieval

Summary

Case Studies

Case study 1 - facial recognition

Applications of facial recognition

The data

Some data exploration

Applied facial recognition

Case study 2 - predicting topics of hotel reviews data

Applications of text clustering

Hotel review data

Exploration of the data

The clustering model

SVD versus PCA components

Latent semantic analysis

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

This book will cover the topic of feature engineering. A huge part of the data science and machine learning pipeline, feature engineering includes the ability to identify, clean, construct, and discover new characteristics of data for the purpose of interpretation and predictive analysis.

In this book, we will be covering the entire process of feature engineering, from inspection to visualization, transformation, and beyond. We will be using both basic and advanced mathematical measures to transform our data into a form that's much more digestible by machines and machine learning pipelines.

By discovering and transforming, we, as data scientists, will be able to gain a whole new perspective on our data, enhancing not only our algorithms but also our insights.

Who this book is for

This book is for people who are looking to understand and utilize the practices of feature engineering for machine learning and data exploration.

The reader should be fairly well acquainted with machine learning and coding in Python to feel comfortable diving into new topics with a step-by-step explanation of the basics.

What this book covers

Chapter 1, Introduction to Feature Engineering, is an introduction to the basic terminology of feature engineering and a quick look at the types of problems we will be solving throughout this book.

Chapter 2, Feature Understanding – What's in My Dataset?, looks at the types of data we will encounter in the wild and how to deal with each one separately or together.

Chapter 3, Feature Improvement - Cleaning Datasets, explains various ways to fill in missing data and how different techniques lead to different structural changes in data that may lead to poorer machine learning performance.

Chapter 4, Feature Construction, is a look at how we can create new features based on what was already given to us in an effort to inflate the structure of data.

Chapter 5, Feature Selection, shows quantitative measures to decide which features are worthy of being kept in our data pipeline.

Chapter 6, Feature Transformations, uses advanced linear algebra and mathematical techniques to impose a rigid structure on data for the purpose of enhancing performance of our pipelines.

Chapter 7, Feature Learning, covers the use of state-of-the-art machine learning and artificial intelligence learning algorithms to discover latent features of our data that few humans could fathom.

Chapter 8, Case Studies, is an array of case studies shown in order to solidify the ideas of feature engineering.

To get the most out of this book

What do we require for this book:

This book uses Python to complete all of its code examples. A machine (Linux/Mac/Windows is OK) with access to a Unix-style terminal and Python 2.7 installed is required.

Installing the Anaconda distribution is also recommended as it comes with most of the packages used in the examples.

Download the example code files

You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

www.packtpub.com

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

Enter the name of the book in the

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub athttps://github.com/PacktPublishing/Feature-Engineering-Made-Easy. We also have other code bundles from our rich catalog of books and videos available athttps://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/FeatureEngineeringMadeEasy_ColorImages.pdf.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packtpub.com.

Introduction to Feature Engineering

In recent years, engineers and executives have been attempting to implement machine learning (ML) and artificial intelligence (AI) to solve problems that, for the most part, have been solved using fairly manual methodologies. A great example would have to be advancements in natural language processing (NLP) and more specifically in natural language generation and understanding. Even more specifically, we point to AI systems that are able to read in raw text from a user (perhaps a disgruntled user of the latest smartphone) and can articulately and accurately respond with the prose of a human and the speed of a machine. In this chapter, we will be introducing topics of feature engineering, such as:

Motivating examples of why feature engineering matters

Basic understanding of machine learning, including performance, evaluation

A detailed list of the chapters included in this book

Motivating example – AI-powered communications

Meet Arty, our AI chat system that is able to handle and respond to incoming customer support requests, just as any of our human employees would. Arty is endowed with the knowledge of our company and is ready to go at a moment’s notice.

Here is how a sample dialogue between a human and an AI customer support system would transpire:

Human

Hello, my phone is broken.

Sorry to hear that, how is it broken?

It’s frozen and I can’t reset it.

What kind of phone is it?

The new iDroid 28

Ahh, I see. Hold the power and volume down button for 20 seconds and it should reset.

It worked, thanks!

No problem, have a great day.

The reason that these types of systems are exciting and are disrupting major markets is the simplicity of such a complicated system. Let us break it down. On the surface, you might think, what an easy problem! The person has a simple problem with a simple solution.A request comes in and a response comes out. Hello, my phone froze, what should I do? Easy, just reset it. And sure, on the surface, that is what is happening here:

from Arty import AIAI.respond_to("my phone froze, what should I do?")>> "reset it."

The tough part comes in when you look at it from the AI’s perspective. It hasn’t had the entire human experience that we have had. It hasn’t had the privilege to read The Illiad or even Clifford the Big Red Dog and learn to internalize their messages. The point is, the AI hasn’t had a lot of experience in reading things. This AI has probably been given a few hundred thousand (maybe even millions) of previous chat dialogues of people in the past and was told to figure it out.

The following is a sample of data given to our AI system based on previous chat logs:

Request

Response

Helllo

Hi, what seems to be the problem?

My phone doesn’t work!!!!

Oh noo!!!! What’s going on with it?

>Hold on, I have to go walk my dog. BRB.

OK. I will wait.

Hey.

Hello. My name is Mark, how can I help?

The data is organized into two columns where theRequestcolumn represents what the end user types into a chat support dialogue. The next column,Response, represents the customer support agent’s response to the incoming message.

While reading over the thousands of typos, angry messages, and disconnected chats, the AI starts to think that it has this customer support thing down. Once this happens, the humans set the AI loose on new chats coming in. The humans, not realizing their mistake, start to notice that the AI hasn’t fully gotten the hang of this yet. The AI can’t seem to recognize even simple messages and keeps returning nonsensical responses. It’s easy to think that the AI just needs more time or more data, but these solutions are just band-aids to the bigger problem, and often do not even solve the issue in the first place.

The underlying problem is likely that the data given to the AI in the form of raw text wasn’t good enough and the AI wasn’t able to pick up on the nuances of the English language. For example, some of the problems would likely include:

Typos artificially expand the AI’s vocabulary without cause.

Helllo

and

hello

are two different words that are not related to each other.

Synonyms mean nothing to the AI. Words such as

hello

and

hey

have no similarity and therefore make the problem artificially harder.

Why feature engineering matters

Data scientists and machine learning engineers frequently gather data in order to solve a problem. Because the problem they are attempting to solve is often highly relevant and exists and occurs naturally in this messy world, the data that is meant to represent the problem can also end up being quite messy and unfiltered, and often incomplete.

This is why in the past several years, positions with titles such as Data Engineer have been popping up. These engineers have the unique job of engineering pipelines and architectures designed to handle and transform raw data into something usable by the rest of the company, particularly the data scientists and machine learning engineers. This job is not only as important as the machine learning experts’ job of creating machine learning pipelines, it is often overlooked and undervalued.

A survey conducted by data scientists in the field revealed that over 80% of their time was spent capturing, cleaning, and organizing data. The remaining less than 20% of their time was spent creating these machine learning pipelines that end up dominating the conversation. Moreover, these data scientists are spending most of their time preparing the data; more than 75% of them also reported that preparing data was the least enjoyable part of their process.

Here are the findings of the survey mentioned earlier:

Following is the graph of the what Data Scientist spend the most time doing:

As seen from the preceding graph, we breakup the Data Scientists's task in the following percentage :

Building training sets

: 3%

Cleaning and organizing data

: 60%

Collecting data for sets

: 19%

Mining data for patterns

: 9%

Refining algorithms

: 5%

A similar pie diagram for what is the least enjoyable part of data science:

From the graph a similar poll for the least enjoyable part of data science revealed:

Building training sets

: 10 %

Cleaning and organizing data

: 57%

Collecting data sets

: 21%

Mining for data patterns

: 3%

Refining algorithms

: 4%

Others

: 5%

The uppermost chart represents the percentage of time that data scientists spend on different parts of the process. Over 80% of a data scientists' time is spent preparing data for further use. The lower chart represents the percentage of those surveyed reporting their least enjoyable part of the process of data science. Over 75% of them report that preparing data is their least enjoyable part.

Source of the data: https://whatsthebigdata.com/2016/05/01/data-scientists-spend-most-of-their-time-cleaning-data/.

A stellar data scientist knows that preparing data is not only so important that it takes up most of their time, they also know that it is an arduous process and can be unenjoyable. Far too often, we take for granted clean data given to us by machine learning competitions and academic sources. More than 90% of data, the data that is interesting, and the most useful, exists in this raw format, like in the AI chat system described earlier.

Preparing data can be a vague phrase. Preparing takes into account capturing data, storing data, cleaning data, and so on. As seen in the charts shown earlier, a smaller, but still majority chunk of a data scientist's time is spent on cleaning and organizing data. It is in this process that our Data Engineers are the most useful to us. Cleaning refers to the process of transforming data into a format that can be easily interpreted by our cloud systems and databases. Organizing generally refers to a more radical transformation. Organizing tends to involve changing the entire format of the dataset into a much neater format, such as transforming raw chat logs into a tabular row/column structure.

Here is an illustration of Cleaning and Organizing:

The top transformation represents cleaning up a sample of server logs that include both the data and a text explanation of what is occurring on the servers. Notice that while cleaning, the & character, which is a Unicode character, was transformed into a more readable ampersand (&). The cleaning phase left the document pretty much in the same exact format as before. The bottom organizing transformation was a much more radical one. It turned the raw document into a row/column structure, in which each row represents a single action taken by the server and the columns represent attributes of the server action. In this case, the two attributes are Date and Text.

Both cleaning and organizing fall under a larger category of data science, which just so happens to be the topic of this book, feature engineering.

What is feature engineering?

Finally, the title of the book.

Yes, folks, feature engineering will be the topic of this book. We will be focusing on the process of cleaning and organizing data for the purposes of machine learning pipelines. We will also go beyond these concepts and look at more complex transformations of data in the forms of mathematical formulas and neural understanding, but we are getting ahead of ourselves. Let’s start a high level.

Feature engineering is the process of transforming data into features that better represent the underlying problem, resulting in improved machine learning performance.

To break this definition down a bit further, let's look at precisely what feature engineering entails:

Process of transforming data

: Note that we are not specifying raw data, unfiltered data, and so on. Feature engineering can be applied to data at any stage. Oftentimes, we will be applying feature engineering techniques to data that is already

processed

in the eyes of the data distributor. It is also important to mention that the data that we will be working with will usually be in a tabular format. The data will be organized into rows (observations) and columns (attributes). There will be times when we will start with data at its most raw form, such as in the examples of the server logs mentioned previously, but for the most part, we will deal with data already somewhat cleaned and organized.

eatures

: The word features will obviously be used a lot in this book. At its most basic level, a feature is an attribute of data that is meaningful to the machine learning process. Many times we will be diagnosing tabular data and identifying which columns are features and which are merely attributes.

Better represent the underlying problem

: The data that we will be working with will always serve to represent a specific problem in a specific domain. It is important to ensure that while we are performing these techniques, we do not lose sight of the bigger picture. We want to transform data so that it better represents the bigger problem at hand.

Resulting in improved machine learning performance

: Feature engineering exists as a single part of the process of data science. As we saw, it is an important and oftentimes undervalued part. The eventual goal of feature engineering is to obtain data that our learning algorithms will be able to extract patterns from and use in order to obtain better results. We will talk in depth about machine learning metrics and results later on in this book, but for now, know that we perform feature engineering not only to obtain cleaner data, but to eventually use that data in our machine learning pipelines.

We know what you’re thinking, why should I spend my time reading about a process that people say they do not enjoy doing? We believe that many people do not enjoy the process of feature engineering because they often do not have the benefits of understanding the results of the work that they do.

Most companies employ both data engineers and machine learning engineers. The data engineers are primarily concerned with the preparation and transformation of the data, while the machine learning engineers usually have a working knowledge of learning algorithms and how to mine patterns from already cleaned data.

Their jobs are often separate but intertwined and iterative. The data engineers will present a dataset for the machine learning engineers, which they will claim they cannot get good results from, and ask the Data Engineers to try to transform the data further, and so on, and so forth. This process can not only be monotonous and repetitive, it can also hurt the bigger picture.

Without having knowledge of both feature and machine learning engineering, the entire process might not be as effective as it could be. That’s where this book comes in. We will be talking about feature engineering and how it relates directly to machine learning. It will be a results-driven approach where we will deem techniques as helpful if, and only if, they can lead to a boost in performance. It is worth now diving a bit into the basics of data, the structure of data, and machine learning, to ensure standardization of terminology.

Understanding the basics of data and machine learning

When we talk about data, we are generally dealing with tabular data, that is, data that is organized into rows and columns. Think of this as being able to be opened in a spreadsheet technology such as Microsoft Excel. Each row of data, otherwise known as an observation, represents a single instance/example of a problem. If our data belongs to the domain of day-trading in the stock market, an observation might represent an hour’s worth of changes in the overall market and price.

For example, when dealing with the domain of network security, an observation could represent a possible attack or a packet of data sent over a wireless system.

The following shows sample tabular data in the domain of cyber security and more specifically, network intrusion:

DateTime

Protocol

Urgent

Malicious

June 2nd, 2018

TCP

FALSE

TRUE

June 2nd, 2018

HTTP

TRUE

June 2nd, 2018

HTTP

TRUE

FALSE

June 3rd, 2018

HTTP

FALSE

TRUE

We see that each row or observation consists of a network connection and we have four attributes of the observation:DateTime, Protocol, Urgent, and Malicious. While we will not dive into these specific attributes, we will simply notice the structure of the data given to us in a tabular format.

Because we will, for the most part, consider our data to be tabular, we can also look at specific instances where the matrix of data has only one column/attribute. For example, if we are building a piece of software that is able to take in a single image of a room and output whether or not there is a human in that room. The data for the input might be represented as a matrix of a single column where the single column is simply a URL to a photo of a room and nothing else.

For example, considering the following table of table that has only a single column titled, Photo URL. The values of the table are URLs (these are fake and do not lead anywhere and are purely for example) of photos that are relevant to the data scientist:

Photo URL

http://photo-storage.io/room/1

http://photo-storage.io/room/2

http://photo-storage.io/room/3

http://photo-storage.io/room/4

The data that is inputted into the system might only be a single column, such as in this case. In our ability to create a system that can analyze images, the input might simply be a URL to the image in question. It would be up to us as data scientists to engineer features from the URL.

As data scientists, we must be ready to ingest and handle data that might be large, small, wide, narrow (in terms of attributes), sparse in completion (there might be missing values), and be ready to utilize this data for the purposes of machine learning.Now’s a good time to talk more about that. Machine learning algorithms belong to a class of algorithms that are defined by their ability to extract and exploit patterns in data to accomplish a task based on historical training data. Vague, right? machine learning can handle many types of tasks, and therefore we will leave the definition of machine learning as is and dive a bit deeper.

We generally separate machine learning into two main types, supervised and unsupervised learning. Each type of machine learning algorithm can benefit from feature engineering, and therefore it is important that we understand each type.

Unsupervised learning

Supervised learning is all about making predictions. We utilize features of the data and use them to make informative predictions about the response of the data. If we aren’t making predictions by exploring structure, we are attempting to extract structure from our data. We generally do so by applying mathematical transformations to numerical matrix representations of data or iterative procedures to obtain new sets of features.

This concept can be a bit more difficult to grasp than supervised learning, and so I will present a motivating example to help elucidate how this all works.

Unsupervised learning example – marketing segments

Suppose we are given a large (one million rows) dataset where each row/observation is a single person with basic demographic information (age, gender, and so on) as well as the number of items purchased, which represents how many items this person has bought from a particular store:

Age

Gender

Number of items purchased

This is a sample of our marketing dataset where each row represents a single customer with three basic attributes about each person. Our goal will be to segment this dataset into types or clusters of people so that the company performing the analysis can understand the customer profiles much better.

Now, of course, We’ve only shown 8 out of one million rows, which can be daunting. Of course, we can perform basic descriptive statistics on this dataset and get averages, standard deviations, and so on of our numerical columns; however, what if we wished to segment these one million people into different types so that the marketing department can have a much better sense of the types of people who shop and create more appropriate advertisements for each segment?

Each type of customer would exhibit particular qualities that make that segment unique. For example, they may find that 20% of their customers fall into a category they like to call young and wealthy that are generally younger and purchase several items.

This type of analysis and the creation of these types can fall under a specific type of unsupervised learning called clustering. We will discuss this machine learning algorithm in further detail later on in this book, but for now, clustering will create a new feature that separates out the people into distinct types or clusters:

Age

Gender

Number of items purchased

Cluster

This shows our customer dataset after a clustering algorithm has been applied. Note the new column at the end calledclusterthat represents the types of people that the algorithm has identified. The idea is that the people who belong to similar clusters behave similarly in regards to the data (have similar ages, genders, purchase behaviors). Perhaps cluster six might be renamed as young buyers.

This example of clustering shows us why sometimes we aren’t concerned with predicting anything, but instead wish to understand our data on a deeper level by adding new and interesting features, or even removing irrelevant features.

Note that we are referring to every column as a feature because there is no response in unsupervised learning since there is no prediction occurring.

It’s all starting to make sense now, isn’t it? These features that we talk about repeatedly are what this book is primarily concerned with. Feature engineering involves the understanding and transforming of features in relation to both unsupervised and supervised learning.

Evaluation of machine learning algorithms and feature engineering procedures

It is important to note that in literature, oftentimes there is a stark contrast between the terms features and attributes. The term attribute is generally given to columns in tabular data, while the term feature is generally given only to attributes that contribute to the success of machine learning algorithms. That is to say, some attributes can be unhelpful or even hurtful to our machine learning systems. For example, when predicting how long a used car will last before requiring servicing, the color of the car will probably not very indicative of this value.

In this book, we will generally refer to all columns as features until they are proven to be unhelpful or hurtful. When this happens, we will usually cast those attributes aside in the code. It is extremely important, then, to consider the basis for this decision. How does one evaluate a machine learning system and then use this evaluation to perform feature engineering?