32,39 €
A perfect guide to speed up the predicting power of machine learning algorithms
Feature engineering is the most important step in creating powerful machine learning systems. This book will take you through the entire feature-engineering journey to make your machine learning much more systematic and effective.
You will start with understanding your data—often the success of your ML models depends on how you leverage different feature types, such as continuous, categorical, and more, You will learn when to include a feature, when to omit it, and why, all by understanding error analysis and the acceptability of your models. You will learn to convert a problem statement into useful new features. You will learn to deliver features driven by business needs as well as mathematical insights. You'll also learn how to use machine learning on your machines, automatically learning amazing features for your data.
By the end of the book, you will become proficient in Feature Selection, Feature Learning, and Feature Optimization.
If you are a data science professional or a machine learning engineer looking to strengthen your predictive analytics model, then this book is a perfect guide for you. Some basic understanding of the machine learning concepts and Python scripting would be enough to get started with this book.
Sinan Ozdemir is a data scientist, startup founder, and educator living in the San Francisco Bay Area with his dog, Charlie; cat, Euclid; and bearded dragon, Fiero. He spent his academic career studying pure mathematics at Johns Hopkins University before transitioning to education. He spent several years conducting lectures on data science at Johns Hopkins University and at the General Assembly before founding his own startup, Legion Analytics, which uses artificial intelligence and data science to power enterprise sales teams. After completing a Fellowship at the Y Combinator accelerator, Sinan spent most of his time working on his fast-growing company, while creating educational material for data science. Divya Susarla is an experienced leader in data methods, implementing and applying tactics across a range of industries and fields including investment management, social enterprise consulting, and wine marketing. She trained in data by way of specializing in Economics and Political Science at University of California, Irvine, cultivating a passion for teaching by developing an analytically based, international affairs curriculum for students through the Global Connect program. Divya is currently focused on natural language processing and generation techniques at Kylie.ai, a startup helping clients automate their customer support conversations. When she is not busy working on building Kylie.ai and writing educational content, she spends her time traveling across the globe and experimenting with new recipes at her home in Berkeley, CA.Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 350
Veröffentlichungsjahr: 2018
Copyright © 2018 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Veena PagareAcquisition Editor: Varsha ShettyContent Development Editor: Tejas LimkarTechnical Editor: Sayli NikaljeCopy Editor: Safis EditingProject Coordinator: Manthan PatelProofreader: Safis EditingIndexer: Tejal Daruwale SoniGraphics: Tania DattaProduction Coordinator: Shantanu Zagade
First published: January 2018
Production reference: 1190118
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78728-760-0
www.packtpub.com
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Mapt is fully searchable
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Sinan Ozdemir is a data scientist, start-up founder, and educator living in the San Francisco Bay Area. He studied pure mathematics at Johns Hopkins University. He then spent several years conducting lectures on data science at Johns Hopkins University before founding his own start-up, Kylie.ai, which uses artificial intelligence to clone brand personalities and automate customer service communications.
Sinan is also the author of Principles of Data Science, available through Packt.
Divya Susarla is an experienced leader in data methods, implementing and applying tactics across a range of industries and fields, such as investment management, social enterprise consulting, and wine marketing. She studied business economics and political science at the University of California, Irvine, USA.
Divya is currently focused on natural language processing and generation techniques at Kylie.ai, a start-up helping clients automate their customer support conversations.
Michael Smith uses big data and machine learning to learn about how people behave. His experience includes IBM Watson and consulting for the US government. Michael actively publishes at and attends several prominent conferences as he engineers systems using text data and AI. He enjoys discussing technology and learning new ways to tackle problems.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
Introduction to Feature Engineering
Motivating example – AI-powered communications
Why feature engineering matters
What is feature engineering?
Understanding the basics of data and machine learning
Supervised learning
Unsupervised learning
Unsupervised learning example – marketing segments
Evaluation of machine learning algorithms and feature engineering procedures
Example of feature engineering procedures – can anyone really predict the weather?
Steps to evaluate a feature engineering procedure
Evaluating supervised learning algorithms
Evaluating unsupervised learning algorithms
Feature understanding – what’s in my dataset?
Feature improvement – cleaning datasets
Feature selection – say no to bad attributes
Feature construction – can we build it?
Feature transformation – enter math-man
Feature learning – using AI to better our AI
Summary
Feature Understanding – What's in My Dataset?
The structure, or lack thereof, of data
An example of unstructured data – server logs
Quantitative versus qualitative data
Salary ranges by job classification
The four levels of data
The nominal level
Mathematical operations allowed
The ordinal level
Mathematical operations allowed
The interval level
Mathematical operations allowed
Plotting two columns at the interval level
The ratio level
Mathematical operations allowed
Recap of the levels of data
Summary
Feature Improvement - Cleaning Datasets
Identifying missing values in data
The Pima Indian Diabetes Prediction dataset
The exploratory data analysis (EDA)
Dealing with missing values in a dataset
Removing harmful rows of data
Imputing the missing values in data
Imputing values in a machine learning pipeline
Pipelines in machine learning
Standardization and normalization
Z-score standardization
The min-max scaling method
The row normalization method
Putting it all together
Summary
Feature Construction
Examining our dataset
Imputing categorical features
Custom imputers
Custom category imputer
Custom quantitative imputer
Encoding categorical variables
Encoding at the nominal level
Encoding at the ordinal level
Bucketing continuous features into categories
Creating our pipeline
Extending numerical features
Activity recognition from the Single Chest-Mounted Accelerometer dataset
Polynomial features
Parameters
Exploratory data analysis
Text-specific feature construction
Bag of words representation
CountVectorizer
CountVectorizer parameters
The Tf-idf vectorizer
Using text in machine learning pipelines
Summary
Feature Selection
Achieving better performance in feature engineering
A case study – a credit card defaulting dataset
Creating a baseline machine learning pipeline
The types of feature selection
Statistical-based feature selection
Using Pearson correlation to select features
Feature selection using hypothesis testing
Interpreting the p-value
Ranking the p-value
Model-based feature selection
A brief refresher on natural language processing
Using machine learning to select features
Tree-based model feature selection metrics
Linear models and regularization
A brief introduction to regularization
Linear model coefficients as another feature importance metric
Choosing the right feature selection method
Summary
Feature Transformations
Dimension reduction – feature transformations versus feature selection versus feature construction
Principal Component Analysis
How PCA works
PCA with the Iris dataset – manual example
Creating the covariance matrix of the dataset
Calculating the eigenvalues of the covariance matrix
Keeping the top k eigenvalues (sorted by the descending eigenvalues)
Using the kept eigenvectors to transform new data-points
Scikit-learn's PCA
How centering and scaling data affects PCA
A deeper look into the principal components
Linear Discriminant Analysis
How LDA works
Calculating the mean vectors of each class
Calculating within-class and between-class scatter matrices
Calculating eigenvalues and eigenvectors for SW-1SB
Keeping the top k eigenvectors by ordering them by descending eigenvalues
Using the top eigenvectors to project onto the new space
How to use LDA in scikit-learn
LDA versus PCA – iris dataset
Summary
Feature Learning
Parametric assumptions of data
Non-parametric fallacy
The algorithms of this chapter
Restricted Boltzmann Machines
Not necessarily dimension reduction
The graph of a Restricted Boltzmann Machine
The restriction of a Boltzmann Machine
Reconstructing the data
MNIST dataset
The BernoulliRBM
Extracting PCA components from MNIST
Extracting RBM components from MNIST
Using RBMs in a machine learning pipeline
Using a linear model on raw pixel values
Using a linear model on extracted PCA components
Using a linear model on extracted RBM components
Learning text features – word vectorizations
Word embeddings
Two approaches to word embeddings - Word2vec and GloVe
Word2Vec - another shallow neural network
The gensim package for creating Word2vec embeddings
Application of word embeddings - information retrieval
Summary
Case Studies
Case study 1 - facial recognition
Applications of facial recognition
The data
Some data exploration
Applied facial recognition
Case study 2 - predicting topics of hotel reviews data
Applications of text clustering
Hotel review data
Exploration of the data
The clustering model
SVD versus PCA components
Latent semantic analysis
Summary
Other Books You May Enjoy
Leave a review - let other readers know what you think
This book will cover the topic of feature engineering. A huge part of the data science and machine learning pipeline, feature engineering includes the ability to identify, clean, construct, and discover new characteristics of data for the purpose of interpretation and predictive analysis.
In this book, we will be covering the entire process of feature engineering, from inspection to visualization, transformation, and beyond. We will be using both basic and advanced mathematical measures to transform our data into a form that's much more digestible by machines and machine learning pipelines.
By discovering and transforming, we, as data scientists, will be able to gain a whole new perspective on our data, enhancing not only our algorithms but also our insights.
This book is for people who are looking to understand and utilize the practices of feature engineering for machine learning and data exploration.
The reader should be fairly well acquainted with machine learning and coding in Python to feel comfortable diving into new topics with a step-by-step explanation of the basics.
Chapter 1, Introduction to Feature Engineering, is an introduction to the basic terminology of feature engineering and a quick look at the types of problems we will be solving throughout this book.
Chapter 2, Feature Understanding – What's in My Dataset?, looks at the types of data we will encounter in the wild and how to deal with each one separately or together.
Chapter 3, Feature Improvement - Cleaning Datasets, explains various ways to fill in missing data and how different techniques lead to different structural changes in data that may lead to poorer machine learning performance.
Chapter 4, Feature Construction, is a look at how we can create new features based on what was already given to us in an effort to inflate the structure of data.
Chapter 5, Feature Selection, shows quantitative measures to decide which features are worthy of being kept in our data pipeline.
Chapter 6, Feature Transformations, uses advanced linear algebra and mathematical techniques to impose a rigid structure on data for the purpose of enhancing performance of our pipelines.
Chapter 7, Feature Learning, covers the use of state-of-the-art machine learning and artificial intelligence learning algorithms to discover latent features of our data that few humans could fathom.
Chapter 8, Case Studies, is an array of case studies shown in order to solidify the ideas of feature engineering.
What do we require for this book:
This book uses Python to complete all of its code examples. A machine (Linux/Mac/Windows is OK) with access to a Unix-style terminal and Python 2.7 installed is required.
Installing the Anaconda distribution is also recommended as it comes with most of the packages used in the examples.
You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packtpub.com
.
Select the
SUPPORT
tab.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub athttps://github.com/PacktPublishing/Feature-Engineering-Made-Easy. We also have other code bundles from our rich catalog of books and videos available athttps://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/FeatureEngineeringMadeEasy_ColorImages.pdf.
Feedback from our readers is always welcome.
General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packtpub.com.
In recent years, engineers and executives have been attempting to implement machine learning (ML) and artificial intelligence (AI) to solve problems that, for the most part, have been solved using fairly manual methodologies. A great example would have to be advancements in natural language processing (NLP) and more specifically in natural language generation and understanding. Even more specifically, we point to AI systems that are able to read in raw text from a user (perhaps a disgruntled user of the latest smartphone) and can articulately and accurately respond with the prose of a human and the speed of a machine. In this chapter, we will be introducing topics of feature engineering, such as:
Motivating examples of why feature engineering matters
Basic understanding of machine learning, including performance, evaluation
A detailed list of the chapters included in this book
Meet Arty, our AI chat system that is able to handle and respond to incoming customer support requests, just as any of our human employees would. Arty is endowed with the knowledge of our company and is ready to go at a moment’s notice.
Here is how a sample dialogue between a human and an AI customer support system would transpire:
Human
AI
Hello, my phone is broken.
Sorry to hear that, how is it broken?
It’s frozen and I can’t reset it.
What kind of phone is it?
The new iDroid 28
Ahh, I see. Hold the power and volume down button for 20 seconds and it should reset.
It worked, thanks!
No problem, have a great day.
The reason that these types of systems are exciting and are disrupting major markets is the simplicity of such a complicated system. Let us break it down. On the surface, you might think, what an easy problem! The person has a simple problem with a simple solution.A request comes in and a response comes out. Hello, my phone froze, what should I do? Easy, just reset it. And sure, on the surface, that is what is happening here:
from Arty import AIAI.respond_to("my phone froze, what should I do?")>> "reset it."
The tough part comes in when you look at it from the AI’s perspective. It hasn’t had the entire human experience that we have had. It hasn’t had the privilege to read The Illiad or even Clifford the Big Red Dog and learn to internalize their messages. The point is, the AI hasn’t had a lot of experience in reading things. This AI has probably been given a few hundred thousand (maybe even millions) of previous chat dialogues of people in the past and was told to figure it out.
The following is a sample of data given to our AI system based on previous chat logs:
Request
Response
Helllo
Hi, what seems to be the problem?
My phone doesn’t work!!!!
Oh noo!!!! What’s going on with it?
>Hold on, I have to go walk my dog. BRB.
OK. I will wait.
Hey.
Hello. My name is Mark, how can I help?
The data is organized into two columns where theRequestcolumn represents what the end user types into a chat support dialogue. The next column,Response, represents the customer support agent’s response to the incoming message.
While reading over the thousands of typos, angry messages, and disconnected chats, the AI starts to think that it has this customer support thing down. Once this happens, the humans set the AI loose on new chats coming in. The humans, not realizing their mistake, start to notice that the AI hasn’t fully gotten the hang of this yet. The AI can’t seem to recognize even simple messages and keeps returning nonsensical responses. It’s easy to think that the AI just needs more time or more data, but these solutions are just band-aids to the bigger problem, and often do not even solve the issue in the first place.
The underlying problem is likely that the data given to the AI in the form of raw text wasn’t good enough and the AI wasn’t able to pick up on the nuances of the English language. For example, some of the problems would likely include:
Typos artificially expand the AI’s vocabulary without cause.
Helllo
and
hello
are two different words that are not related to each other.
Synonyms mean nothing to the AI. Words such as
hello
and
hey
have no similarity and therefore make the problem artificially harder.
Data scientists and machine learning engineers frequently gather data in order to solve a problem. Because the problem they are attempting to solve is often highly relevant and exists and occurs naturally in this messy world, the data that is meant to represent the problem can also end up being quite messy and unfiltered, and often incomplete.
This is why in the past several years, positions with titles such as Data Engineer have been popping up. These engineers have the unique job of engineering pipelines and architectures designed to handle and transform raw data into something usable by the rest of the company, particularly the data scientists and machine learning engineers. This job is not only as important as the machine learning experts’ job of creating machine learning pipelines, it is often overlooked and undervalued.
A survey conducted by data scientists in the field revealed that over 80% of their time was spent capturing, cleaning, and organizing data. The remaining less than 20% of their time was spent creating these machine learning pipelines that end up dominating the conversation. Moreover, these data scientists are spending most of their time preparing the data; more than 75% of them also reported that preparing data was the least enjoyable part of their process.
Here are the findings of the survey mentioned earlier:
Following is the graph of the what Data Scientist spend the most time doing:
As seen from the preceding graph, we breakup the Data Scientists's task in the following percentage :
Building training sets
: 3%
Cleaning and organizing data
: 60%
Collecting data for sets
: 19%
Mining data for patterns
: 9%
Refining algorithms
: 5%
A similar pie diagram for what is the least enjoyable part of data science:
From the graph a similar poll for the least enjoyable part of data science revealed:
Building training sets
: 10 %
Cleaning and organizing data
: 57%
Collecting data sets
: 21%
Mining for data patterns
: 3%
Refining algorithms
: 4%
Others
: 5%
The uppermost chart represents the percentage of time that data scientists spend on different parts of the process. Over 80% of a data scientists' time is spent preparing data for further use. The lower chart represents the percentage of those surveyed reporting their least enjoyable part of the process of data science. Over 75% of them report that preparing data is their least enjoyable part.
A stellar data scientist knows that preparing data is not only so important that it takes up most of their time, they also know that it is an arduous process and can be unenjoyable. Far too often, we take for granted clean data given to us by machine learning competitions and academic sources. More than 90% of data, the data that is interesting, and the most useful, exists in this raw format, like in the AI chat system described earlier.
Preparing data can be a vague phrase. Preparing takes into account capturing data, storing data, cleaning data, and so on. As seen in the charts shown earlier, a smaller, but still majority chunk of a data scientist's time is spent on cleaning and organizing data. It is in this process that our Data Engineers are the most useful to us. Cleaning refers to the process of transforming data into a format that can be easily interpreted by our cloud systems and databases. Organizing generally refers to a more radical transformation. Organizing tends to involve changing the entire format of the dataset into a much neater format, such as transforming raw chat logs into a tabular row/column structure.
Here is an illustration of Cleaning and Organizing:
The top transformation represents cleaning up a sample of server logs that include both the data and a text explanation of what is occurring on the servers. Notice that while cleaning, the & character, which is a Unicode character, was transformed into a more readable ampersand (&). The cleaning phase left the document pretty much in the same exact format as before. The bottom organizing transformation was a much more radical one. It turned the raw document into a row/column structure, in which each row represents a single action taken by the server and the columns represent attributes of the server action. In this case, the two attributes are Date and Text.
Both cleaning and organizing fall under a larger category of data science, which just so happens to be the topic of this book, feature engineering.
Finally, the title of the book.
Yes, folks, feature engineering will be the topic of this book. We will be focusing on the process of cleaning and organizing data for the purposes of machine learning pipelines. We will also go beyond these concepts and look at more complex transformations of data in the forms of mathematical formulas and neural understanding, but we are getting ahead of ourselves. Let’s start a high level.
To break this definition down a bit further, let's look at precisely what feature engineering entails:
Process of transforming data
: Note that we are not specifying raw data, unfiltered data, and so on. Feature engineering can be applied to data at any stage. Oftentimes, we will be applying feature engineering techniques to data that is already
processed
in the eyes of the data distributor. It is also important to mention that the data that we will be working with will usually be in a tabular format. The data will be organized into rows (observations) and columns (attributes). There will be times when we will start with data at its most raw form, such as in the examples of the server logs mentioned previously, but for the most part, we will deal with data already somewhat cleaned and organized.
F
eatures
: The word features will obviously be used a lot in this book. At its most basic level, a feature is an attribute of data that is meaningful to the machine learning process. Many times we will be diagnosing tabular data and identifying which columns are features and which are merely attributes.
Better represent the underlying problem
: The data that we will be working with will always serve to represent a specific problem in a specific domain. It is important to ensure that while we are performing these techniques, we do not lose sight of the bigger picture. We want to transform data so that it better represents the bigger problem at hand.
Resulting in improved machine learning performance
: Feature engineering exists as a single part of the process of data science. As we saw, it is an important and oftentimes undervalued part. The eventual goal of feature engineering is to obtain data that our learning algorithms will be able to extract patterns from and use in order to obtain better results. We will talk in depth about machine learning metrics and results later on in this book, but for now, know that we perform feature engineering not only to obtain cleaner data, but to eventually use that data in our machine learning pipelines.
We know what you’re thinking, why should I spend my time reading about a process that people say they do not enjoy doing? We believe that many people do not enjoy the process of feature engineering because they often do not have the benefits of understanding the results of the work that they do.
Most companies employ both data engineers and machine learning engineers. The data engineers are primarily concerned with the preparation and transformation of the data, while the machine learning engineers usually have a working knowledge of learning algorithms and how to mine patterns from already cleaned data.
Their jobs are often separate but intertwined and iterative. The data engineers will present a dataset for the machine learning engineers, which they will claim they cannot get good results from, and ask the Data Engineers to try to transform the data further, and so on, and so forth. This process can not only be monotonous and repetitive, it can also hurt the bigger picture.
Without having knowledge of both feature and machine learning engineering, the entire process might not be as effective as it could be. That’s where this book comes in. We will be talking about feature engineering and how it relates directly to machine learning. It will be a results-driven approach where we will deem techniques as helpful if, and only if, they can lead to a boost in performance. It is worth now diving a bit into the basics of data, the structure of data, and machine learning, to ensure standardization of terminology.
When we talk about data, we are generally dealing with tabular data, that is, data that is organized into rows and columns. Think of this as being able to be opened in a spreadsheet technology such as Microsoft Excel. Each row of data, otherwise known as an observation, represents a single instance/example of a problem. If our data belongs to the domain of day-trading in the stock market, an observation might represent an hour’s worth of changes in the overall market and price.
For example, when dealing with the domain of network security, an observation could represent a possible attack or a packet of data sent over a wireless system.
The following shows sample tabular data in the domain of cyber security and more specifically, network intrusion:
DateTime
Protocol
Urgent
Malicious
June 2nd, 2018
TCP
FALSE
TRUE
June 2nd, 2018
HTTP
TRUE
TRUE
June 2nd, 2018
HTTP
TRUE
FALSE
June 3rd, 2018
HTTP
FALSE
TRUE
We see that each row or observation consists of a network connection and we have four attributes of the observation:DateTime, Protocol, Urgent, and Malicious. While we will not dive into these specific attributes, we will simply notice the structure of the data given to us in a tabular format.
Because we will, for the most part, consider our data to be tabular, we can also look at specific instances where the matrix of data has only one column/attribute. For example, if we are building a piece of software that is able to take in a single image of a room and output whether or not there is a human in that room. The data for the input might be represented as a matrix of a single column where the single column is simply a URL to a photo of a room and nothing else.
For example, considering the following table of table that has only a single column titled, Photo URL. The values of the table are URLs (these are fake and do not lead anywhere and are purely for example) of photos that are relevant to the data scientist:
Photo URL
http://photo-storage.io/room/1
http://photo-storage.io/room/2
http://photo-storage.io/room/3
http://photo-storage.io/room/4
The data that is inputted into the system might only be a single column, such as in this case. In our ability to create a system that can analyze images, the input might simply be a URL to the image in question. It would be up to us as data scientists to engineer features from the URL.
As data scientists, we must be ready to ingest and handle data that might be large, small, wide, narrow (in terms of attributes), sparse in completion (there might be missing values), and be ready to utilize this data for the purposes of machine learning.Now’s a good time to talk more about that. Machine learning algorithms belong to a class of algorithms that are defined by their ability to extract and exploit patterns in data to accomplish a task based on historical training data. Vague, right? machine learning can handle many types of tasks, and therefore we will leave the definition of machine learning as is and dive a bit deeper.
We generally separate machine learning into two main types, supervised and unsupervised learning. Each type of machine learning algorithm can benefit from feature engineering, and therefore it is important that we understand each type.
Supervised learning is all about making predictions. We utilize features of the data and use them to make informative predictions about the response of the data. If we aren’t making predictions by exploring structure, we are attempting to extract structure from our data. We generally do so by applying mathematical transformations to numerical matrix representations of data or iterative procedures to obtain new sets of features.
This concept can be a bit more difficult to grasp than supervised learning, and so I will present a motivating example to help elucidate how this all works.
Suppose we are given a large (one million rows) dataset where each row/observation is a single person with basic demographic information (age, gender, and so on) as well as the number of items purchased, which represents how many items this person has bought from a particular store:
Age
Gender
Number of items purchased
25
F
1
28
F
23
61
F
3
54
M
17
51
M
8
47
F
3
27
M
22
31
F
14
This is a sample of our marketing dataset where each row represents a single customer with three basic attributes about each person. Our goal will be to segment this dataset into types or clusters of people so that the company performing the analysis can understand the customer profiles much better.
Now, of course, We’ve only shown 8 out of one million rows, which can be daunting. Of course, we can perform basic descriptive statistics on this dataset and get averages, standard deviations, and so on of our numerical columns; however, what if we wished to segment these one million people into different types so that the marketing department can have a much better sense of the types of people who shop and create more appropriate advertisements for each segment?
Each type of customer would exhibit particular qualities that make that segment unique. For example, they may find that 20% of their customers fall into a category they like to call young and wealthy that are generally younger and purchase several items.
This type of analysis and the creation of these types can fall under a specific type of unsupervised learning called clustering. We will discuss this machine learning algorithm in further detail later on in this book, but for now, clustering will create a new feature that separates out the people into distinct types or clusters:
Age
Gender
Number of items purchased
Cluster
25
F
1
6
28
F
23
1
61
F
3
3
54
M
17
2
51
M
8
3
47
F
3
8
27
M
22
5
31
F
14
1
This shows our customer dataset after a clustering algorithm has been applied. Note the new column at the end calledclusterthat represents the types of people that the algorithm has identified. The idea is that the people who belong to similar clusters behave similarly in regards to the data (have similar ages, genders, purchase behaviors). Perhaps cluster six might be renamed as young buyers.
This example of clustering shows us why sometimes we aren’t concerned with predicting anything, but instead wish to understand our data on a deeper level by adding new and interesting features, or even removing irrelevant features.
It’s all starting to make sense now, isn’t it? These features that we talk about repeatedly are what this book is primarily concerned with. Feature engineering involves the understanding and transforming of features in relation to both unsupervised and supervised learning.
It is important to note that in literature, oftentimes there is a stark contrast between the terms features and attributes. The term attribute is generally given to columns in tabular data, while the term feature is generally given only to attributes that contribute to the success of machine learning algorithms. That is to say, some attributes can be unhelpful or even hurtful to our machine learning systems. For example, when predicting how long a used car will last before requiring servicing, the color of the car will probably not very indicative of this value.
In this book, we will generally refer to all columns as features until they are proven to be unhelpful or hurtful. When this happens, we will usually cast those attributes aside in the code. It is extremely important, then, to consider the basis for this decision. How does one evaluate a machine learning system and then use this evaluation to perform feature engineering?