33,59 €
You might already know that there's a wealth of data science and machine learning resources available on the market, but what you might not know is how much is left out by most of these AI resources. This book not only covers everything you need to know about algorithm families but also ensures that you become an expert in everything, from the critical aspects of avoiding bias in data to model interpretability, which have now become must-have skills.
In this book, you'll learn how using Anaconda as the easy button, can give you a complete view of the capabilities of tools such as conda, which includes how to specify new channels to pull in any package you want as well as discovering new open source tools at your disposal. You’ll also get a clear picture of how to evaluate which model to train and identify when they have become unusable due to drift. Finally, you’ll learn about the powerful yet simple techniques that you can use to explain how your model works.
By the end of this book, you’ll feel confident using conda and Anaconda Navigator to manage dependencies and gain a thorough understanding of the end-to-end data science workflow.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 423
Veröffentlichungsjahr: 2022
A comprehensive starter guide to building robust and complete models
Dan Meador
BIRMINGHAM—MUMBAI
Copyright © 2022 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Publishing Product Manager: Gebin George
Senior Editor: Tazeen Shaikh
Content Development Editor: Sean Lobo
Technical Editor: Devanshi Ayare
Copy Editor: Safis Editing
Project Coordinator: Aparna Ravikumar Nair
Proofreader: Safis Editing
Indexer: Subalakshmi Govindhan
Production Designer: Jyoti Chauhan
Marketing Coordinator: Abeer Riyaz Dawe
First published: May 2022
Production reference: 1280422
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-80056-878-5
www.packt.com
Data science has transformed not only the software industry but also the physical sciences, sociology, engineering, government, and society at large. While artificial intelligence has its roots in the earliest days of computer science, the rate of evolution and change in data science over the last decade is astounding. Yet even professionals in the field can struggle to keep up, and those who wish to join the profession can feel daunted by the amount they need to learn.
One of the keys to the pace of innovation in data science is the amount of energy expended on open source by researchers, universities, and companies alike. Making the most advanced tools of data science available to everyone lowers many of the barriers to entry for practitioners and encourages new and better tools.
We have been proud to be part of this open source data science revolution at Anaconda for the last 10 years. Over 25 million data scientists and other numerical computing professionals and students use Anaconda Distribution. Our relationship with this community gives us tremendous insights into the cutting-edge practices and the day-to-day work of people in the field.
Dan Meador gives you a valuable grounding in using open source data science tools to solve real-world problems in this book. Once you have completed your reading, you will understand many of the mechanisms, practices, and challenges underpinning modern data science. Dan's experience in the field and as the manager of Anaconda's conda command-line tool and its Navigator desktop app give him unique insight into the tools that millions use daily for their work.
Especially relevant to today's data practitioners is the second section, covering the topics of bias in AI/ML and choosing the best algorithms. When data science is (rightfully) challenged for the harm it can cause, it is not because of the malicious intent of data scientists but rather the lack of understanding of these critical issues. Therefore, being aware of bias in our algorithms and models is crucial for any data professional.
Whether you are interested in learning the tools of the trade for data science or an experienced professional looking to expand your knowledge, you will find this book to be a valuable resource and a foundation to explore these topics and others at another level of depth.
All the best on your journey!
Foreword by Kevin Goldsmith, CTO at Anaconda
Dan Meador is an engineering manager at Anaconda leading the conda team and championing open source. He also holds a patent for his work on AI systems and has grown his experience in AI/ML by creating AutoML solutions. He has seen how the power of data can work in everything from startups to Fortune 10 companies.
Andre Ye is a deep learning researcher at the University of Washington, focusing on improving the robustness and performance of deep computer vision systems for domain-specific applications. Documenting the field of data science is one of his strongest passions. He has written over 300 data science articles in various online publications and has published Modern Deep Learning Design and Application Development, a book exploring modern developments in designing effective deep learning systems. In his spare time, Andre enjoys keeping up with current data science research and jamming to hard metal.
Keith Moore is the chief product officer (CPO) for AutoScheduler.AI. He works with consumer goods, beverage, and distribution companies to drive efficiency in distribution centers. As the CPO, Moore's focus is on creating the future with the prescriptive warehouse. Moore was voted by Hart Energy Magazine as an Energy Innovator of the Year in 2020, was selected as a Pi Kappa Phi 30 under 30 member, and holds multiple patents in the fields of neural architecture search and supply chain planning. Moore attended the University of Tennessee, where he received a bachelor of science in mechanical engineering.
In this section, you'll learn about the most common open source software used for data science, the changes that are happening, and how to make use of them. You'll get hands-on with managing packages in a world with new options every day. You'll get a firm understanding of Anaconda and its tools, and how they enable you to work more efficiently.
This section includes the following chapters:
Chapter 1, Understanding the AI/ML LandscapeChapter 2, Analyzing Open Source SoftwareChapter 3, Using Anaconda Distribution to Manage PackagesChapter 4, Working with Jupyter Notebooks and NumPyIn this opening chapter, we'll give you a little appreciation and context to the why behind AI and machine learning (ML). The only data we have comes from the past, and using that will help us predict the future. We'll take a look at the massive amount of data that is coming into the world today and try to get a sense of the scale of what we have to work with.
The main goal of any type of software or algorithm is to solve business and real-world problems, so we'll also take a look at how the applications take shape. If we use a food analogy, data would be the ingredients, the algorithm would be the chef, and the meal created would be the model. You'll learn about the most commonly used types of models within the broader landscape and how to know what to use.
There are a huge number of tools that you could use as a data scientist, and so we will also touch on how you can use solutions such as those provided by Anaconda to be able to do the actual work you want to and be able to take action as your models grow stale (which they will). By the end of this chapter, you'll have an understanding of the value and landscape of AI and be able to jumpstart any project that you want to build.
AI is the most exciting technology of our age and, throughout this first chapter, these topics will give you the solid foundation that we'll build upon through the rest of the book. These are all key concepts that will be commonplace in your day-to-day journey, and which you'll find to be invaluable in accomplishing what you need to.
In this chapter, we're going to cover the following main topics:
Understanding the current state of AI and MLUnderstanding the massive generation of new dataHow to create business value with AIUnderstanding the main types of ML models Dealing with out-of-date modelsInstalling packages with AnacondaAI is moving fast. It has now become so commonplace that it's become an expectation that systems are intelligent. For example, not too long ago, the technology to compete against a human mind in chess was a groundbreaking piece of AI to be marveled at. Now we don't even give it a second thought. Millions of tactical and strategic calculations a second is now just a simple game that can be found on any computer or played on hundreds of websites.
That seemingly was intelligence… that was artificial. Simple right? With spam blockers, recommendation engines, and the best delivery route, the goalposts keep shifting so much that now, all of what was once thought of as AI is simply now regarded as everyday tools.
What was once considered AI is now just thought of as simply software. It seems that AI just means problems that are still unsolved. As those become normal, day-to-day operations, they can fade away from what we generally think of as AI. This is known as the Larry Tesler Theorem, which states "Artificial intelligence is whatever hasn't been done yet."
For example, if you asked someone what AI is, they would probably talk about autonomous driving, drone delivery, and robots that can perform very complex actions. All of these examples are very much in the realm of unsolved problems, and as (or if) they become solved, they may no longer be thought of as AI as the newer, harder problems take their place.
Before we dive any further, let's make sure we are aligned on a few terms that will be a focal point for the rest of the book.
It's important to call out the fact that there is no universal label as to what AI is, but for the purpose of this book, we will use the following definition:
"Artificial Intelligence (AI) is the development of computer systems to allow them to perform tasks that mimic the intelligence of humans. This can use vision, text, reading comprehension, complex problem solving, labeling, or other forms of input."
Along with the definition of AI, defining what a data scientist is can also lead you to many different descriptions. Know that as with AI, the field of data science can be a very broad category. Josh Wills tweeted that a data scientist is the following:
"A person who is better at statistics than any software engineer and better at software engineering than any statistician."
While there may be some truth to that, we'll use the following definition instead:
"A data scientist is someone who gains insight and knowledge from data by analyzing, applying statistics, and implementing an AI approach in order to be able to answer questions and solve problems."
If you are reading this, then you probably fall into that category. There are many tools that a data scientist needs to be able to utilize to work toward the end goal, and we'll learn about many of those in this book.
Now that we've set a base level of understanding of what AI is, let's take a look at where the state of the world is regarding AI, and also learn about where ML fits into the picture.
The past is the only place where we can gather data to make predictions about the future. This is one of the core value propositions of AI and ML, and this is true for the field itself. I'll spare you from too much of the history lesson, but know that the techniques and approaches used today aren't new. In fact, neural networks have been around for over 60 years! Knowing this, keep in mind on your data science journey that a white paper or approach that you deem as old or out of date might just not have reached the right point for technology or data to catch up to it.
These systems allow for much greater scalability, distribution, and speed than if we had humans perform those same tasks. We will dive more into specific problem types later in the chapter.
Currently, one of the most well-known approaches to creating AI is neural networks, in which data scientists drew inspiration from how the human brain works. Neural networks were only a genuinely viable path when two things happened:
We made the connection in 2012 that, just like our brain, we could get vastly better results if we created multiple layers.GPUs became fast enough to be able to train models in a reasonable timeframe.This huge leap in AI techniques would not have been possible if we had not come back to the ideas of the past with fresh eyes and newer hardware.
Before more advanced GPUs were used, it simply took too long to train a model, and so this wasn't practical. Think about an assembly line making a car. If that moved along at one meter a day, that would be an effective end result, but it would take an extremely long time to produce a car (Henry Ford's 1914 assembly line moved at two meters a minute). Similar to 4k (and 8k) TVs being particularly useless until streaming or Blu-ray formats allowed us to have content that could even be shown in 4k, sometimes, other supporting technology needs to improve before the applications can be fully realized.
The massive increase in computational power in the last decade has unlocked the ability for the tensor computations to really shine and has taken us a long way from the Cornell report on The Perceptron (https://bit.ly/perceptron-cornell), the first paper to mention the ideas that would become the neural networks we use today. GPU power has increased at a rate such that the massive number of training runs can be done in hours, not years.
Tensors themselves are a common occurrence in physics and other forms of engineering and are an example of how data science has a heavy influence from other fields and has an especially strong relationship with mathematics. Now they are a staple tool in training deep learning models using neural networks.
Tensors
A tensor is simply a data structure that is commonly used in neural networks, but is a mathematical term. It can refer to matrices, vectors, and any n-dimensional arrays, but is mostly used to describe the latter when it comes to neural networks. It is where TensorFlow, the popular Google library, gets its name.
Deep learning is a technique in the field of AI and, more specifically, ML, but aren't they the same thing? The answer is no. Understanding the difference will help you focus on particular subsets and ensure that you have a clear picture of what is out there. Let's take a more in-depth look next.
Machine Learning (ML) is simply a machine being able to infer things based on input data without having to be specifically told what things are. It learns and deduces patterns and tries its best to fit new data into that pattern. ML is, in fact, a subset of the larger AI field, and since both terms are so widely used, it's valuable to get some brief examples of different types of AI and how the subsets fit into the broader term.
Let's look at a simple Venn diagram that shows the relationship between AI, ML, and deep learning. You'll see that AI is the broader concept, with ML and deep learning being specific subsets:
Figure 1.1 – Hierarchy of AI, ML, and deep learning
An example of AI that isn't ML is an expert system. This is a rule-based system that is designed for a very specific case, and in some ways can come down to if-else statements. It is following a hand-coded system behind the scenes, but that can be very powerful. A traffic light that switches to green if there is more than x number of cars in the North/South lane, but fewer than y cars in the East/West lane, would be an example.
These expert systems have been around for a long time, and the chess game was an example of that. The famous Deep Thought from Carnegie Mellon searched about 500 million possible outcomes per move to hunt down the best one. It was enough to put even the best chess players on the ropes. It later gave way to Deep Blue, which started to look like something closer to ML as it used a Bayesian structure to achieve its world conquest.
That's not AI! You might say. In an odd twist… IBM agrees with you, at least in the late 90s, as they actually claimed that it wasn't AI. This was likely due to the term having negative connotations associated with it. However, this mentality has changed in modern times. Many of the early promises of AI have come to fruition, solving many issues we knew we wanted to solve, and creating whole new sectors such as chatbots.
AI can be complex image detection, such as for self-driving, and voice recognition systems, such as Amazon's Alexa, but it can also be a system made up of relatively simple instructions. Think about how many simple tasks you carry out based on incredibly simple patterns. I'm hungry, I should eat. Those clothes are red, those are white, so they belong in different bins. Pretty simple right? The fact is that AI is a massive term that can include much more than what it's given credit for.
Much of what AI has become in the last 10 years is due to the high amount of data that it has access to. In the next section, we'll take a look in a little more detail at what that looks like.
Put simply, data is the fuel that powers all things in AI. The amount of data is staggering. In 2018, it was calculated that 90% of the world's data was created in the last 2 years. There is no reason to think that stat might hold no matter when you read this. But who cares? On its own, data means nothing without being able to use it.
Fracking, a new technique to open up new pockets of oil, has opened up access to previously unreachable areas. Without that, those energy reserves would have sat there doing nothing. This is exactly what AI does with data. It lets us tap into this previously useless holding of information in order to unlock its value.
Data is just a recording of a specific state of the world or an event at a specific time. The ability and costs to do this have followed the famous Moore's law, making it cheaper and quicker to store and retrieve a huge amount. Just look at the price of hard drives throughout the years, going from $3.5 million per GB in 1964, to about .02 cents today in 2021.
Moore's Law
From the famed CEO of Intel, Moore's law states that the number of transistors that can fit on a chip will double every 2 years. This is many times misquoted as 18 months, but that is actually a separate prediction from a fellow Intel employee, David House, based on power consumption. This could also have been a self-fulfilling prophecy, as that was the goal, and not just happenstance.
It turned out that this law applies to many things outside of just compute speed. The cost of many goods (especially tech) follows this. In automotive, TV cost/resolution, and many other fields, you will find a similar curve.
If you heard that your used Coke cans would be worth $10 once a new recycling factory was built, would you throw them in the garbage? That is similar to what all companies are hearing about their data. Even though they might be making use of it today, they are still collecting and storing everything they can in the hope that someday, it can be used. Storage is also cheap, and getting cheaper. Due to both of these, it is seen as a much better move to save this data as it could be much more valuable than the cost of storing it.
What data do you have that could be valuable? Consider HR hiring reports, the exact time of each customer purchase, or keywords in searches – each piece of data on its own might not give you much insight, but combined with the power ML gives you to find patterns, that data could have incredible value. Even if you don't see what could be done now, the message to companies is Just hang on to it, maybe there is a use for it. Because of this, companies have become data pack rats.
One movement that has led to a huge increase in data is the massive increase in the number of IoT devices. IoT stands for Internet of Things, and it is the concept that every day, normal devices are connected to the same internet that you get your email, YouTube, and Facebook from. Light switches, vacuums, and even fridges can be connected and send data which is collected by the manufacturer in order to improve their functionality (hopefully).
These seeming pinpricks of data combined to create 13.6 Zettabytes in 2019, and it's not slowing down. By 2025, there will be 79.4 Zettabytes! You will be hard-pressed to find new devices that aren't IoT-ready as companies are always looking to add that new feature to the latest offering. From a physical perspective, if each gigabyte in a zettabyte was a brick, the Great Wall of China could be made 258 times over with the 3,873,000,000 bricks you'd have. That's a lot of data to take care of and process!
New technologies and even software architecture patterns have been developed to handle all this data. Event-based architecture is a way to handle data that turns the traditional database model inside out. Instead of storing everything in a database, it has a stream of events, and anything that needs that data can reach into the stream and grab what they need. There is so much data that they don't even bother putting it in one place!
But more data isn't always the answer. There are many times that more data is the enemy. Datasets that have a large amount of incorrectly labeled data can make for a much more poorly trained model. This, of course, makes intuitive sense. Trying to teach a person or a computer something while giving them examples that aren't valid isn't going to get the output that you are looking for.
Let's look at a simple example of explaining to your child what a tiger is. You point out two large orange cats with black stripes and tell them Look, that's what a tiger looks like. You then point to an all-black cat and then say And look! There is another one, incorrectly telling your child that this other animal is also a tiger. You have now created a dataset that contains false positives, and could make it challenging for your child to learn what an actual tiger is. False positives might be an issue with the 3-value dataset, whereas false negatives might be an issue if they have just seen one.
Important Note
This is for example purposes only. Any model trained on just one data point is almost guaranteed to not provide very accurate end results. There is a field of study known as one-shot learning that attempts to work with just one data point, but this is generally found in vision problems.
You might also have an issue where the data being fed in doesn't resemble the live production data. Training data being as close as possible to test data is critical, so if, in our training example from before, you pointed out a swimming tiger from 300 ft away, your child might find it very challenging to identify one when they see one walking from 10 ft away in the zoo. More doesn't always equal better.
Having data is critical to the success of AI, but the true driving force behind its adoption is what it can do for the world of business, such as Netflix recommending shows you will like, Google letting advertisers get their business in front of the right people, and Amazon showing you other products to fill your cart. This all allows businesses to scale like no other technique or approach out there and helps them continue to dominate in their space.
What do Facebook (now Meta), Apple, Amazon, Netflix, and Google have in common? Well, other than being companies that make up the popular FAANG acronym, they have a heavy focus on ML. Each one relies on this technology to not just make small percentage gains in areas, but many times, it is this tech that is at the heart of what they do. And as a key point, the only reason they keep AI and ML at the heart of what they do is because of the value it creates. Always focus on delivering business value and solving problems when you look to apply AI.
Google owes much of its growth to its hugely successful ad algorithms. Responsive search ads (or RSA) have a simple goal of trying to optimize ads to achieve the best outcome. RSA does this by pulling the levers it has, such as the headlines and body copy, to get the best outcome, which is clicks. Its bidding algorithm always has a dynamic pricing model for certain keywords based on a massive number of factors such as location, gender, age group, search profiles, and many others. Alphabet's (Google's parent company) revenue in 2020 was $182.5 billion, with a b. In no other way can you create such a massive generation of cash with so few people other than software and ML.
Some of these minor adjustments are changing the price by a penny, moving a user age group to target by a year, and changing the actual ad that is shown based on the context of the web page. Google's algorithms then measure whether it was successful. Can you imagine if a developer had to code up each individual change to a pricing model after making such small changes? Even if they did do that, there would be a much lower chance that the actual adjustments being made would be the ones actually impacting the desired end value.
We can consider another example in the form of how a system can determine what shows we might like by looking at Netflix. Netflix is able to suggest what we might want to watch next by using a recommendation system that uses your past viewing history to make predictions about future viewing habits. A recommendation system is simply something that predicts with different degrees of accuracy how likely you are to like a piece of content. We make use of this technique every time we pull up our Amazon home page, get an email from Netflix that there is a new show that we might like, or doomscroll through Twitter.
There is a massive business value to each of these, as getting more eyeballs on screens helps sell more ads and products, improves click-through rates (the number of people who actually click on an ad), and increases other metrics that, at the end of the day, make that platform more valuable than if you simply got a list of the top 10 most sold items. You have your own personal robot behind the scenes trying to help you out. Feels nice, doesn't it?
Netflix does this by creating latent features for each movie and show without having to use the old style of asking users questions such as: How much do you like comedies? Action movies? Sports movies? Think about how many people didn't fill these out, and thus Netflix didn't have the retention rate of people who did.
This prediction system so valuable that Netflix offered a million-dollar prize to anyone who could improve it by 10%. This tells us two things:
That there is a huge business value in improving the AI systemThat there is a dire shortage of people that Netflix could find to work on this problemA latent feature is a synthetic tag generated from a weighted combination of other attributes. Did you ever describe a movie to a friend as epic? That was our mind putting together attributes of a movie, the sum total of which created the epic label. This allows for more freedom and essentially infinite ways to combine what already exists to create a system that can determine (with incredible accuracy) what someone will like and buy. This also allows a reduced number of features to be considered.
Amazon has made real-world moves based on this, shipping items to local distribution centers before people buy things in order to increase delivery speed. This allows yet another competitive advantage that increases their ability to capture and retain customers. Maybe tomorrow you'll have a drone waiting outside your house for a few minutes waiting on you to click buy now on that new phone you've been eyeing for a week.
Every example here should show you not only how AI can solve problems at scale, but also that AI and ML are not just technical fields. If you really want to make an impact, you need to make sure you keep the business problems you are trying to solve at the forefront of your mind.
These are just a few of the numerous business problems that you may want to solve, but what techniques would you even start with when looking at them for the first time? In the next section, we will dive into just that.
Here we are going to take a look at some of the vast number of techniques and approaches that can be used to solve your problems. Similar to how a hacker may know a handful of techniques in their field (contrary to what Hollywood has you believe), a data scientist might know only one branch or area of the following really well. So don't be discouraged. The key is being able to know what tool to use based on the problem you have.
To put this in context, let's take a Star Wars example. Say you are put in charge of defense on the moon of Endor. You have data on the prior attacks of those pesky Ewoks. The Emperor is getting a little restless, so you decide to put ML to use to try and figure out what's going on and put a stop to it.
ML is very broad, so let's start with the four main categories: supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning. This distinction comes down to simply how much help the model gets when it's being trained and the desired outcome of the model. The dataset that the model is training on is, appropriately, called the training set. Let's take a look at each of these ML categories in a little more detail.
Supervised learning is used when you have labeled training data that you feed in. A famous and early example is spam detection. Another would be predicting the price of a car. These are both examples where you know the right answer. The key is that the data is labeled with the main feature you care about and can use that to train your model.
Back to the Endor moon, you have the following data. It shows reports with the weather, time of day, whether shipments are coming in, the number of guards, and other data that may be useful, along with a simple Boolean label of attack: True/False:
Figure 1.2 – Attack data from the moon of Endor
This is a great use case for supervised learning (in this case, should you be worried?). We'll look at this scenario in more detail in Chapter 7, Choosing the Best AI Algorithm.
Algorithms that fall under the supervised learning category are as follows:
Logistic RegressionLinear RegressionSupport Vector MachinesK-Nearest Neighbor (a density-based approach)Neural NetworksRandom ForestGradient BoostingNext, we will cover unsupervised learning.
Unsupervised learning is used when you do not have labeled training data that you feed in. A common scenario is when you have a group of entities and need to cluster them into groups. Some examples where this is used are advertising campaigns based on specific sub-sets of customers, or movies that might share common characteristics.
In the following diagram, you can see different customers from a hypothetical company, and there seem to be three separate groups they naturally fall into:
.
Figure 1.3 – Example of classification problem with three customer groups
This diagram might represent some customers of a new movie recommendation engine you are trying to build, and each group should get a separate genre sent to them for their viewing pleasure.
There is another related approach that takes the same idea of grouping, but instead of trying to see what people or entities fit into a group, you are trying to find the entities that don't fall into one of the main groups. These outliers don't fit into the pattern of the others, and searching for them is known as anomaly detection.
Anomaly detection is also a form of unsupervised learning. You can't have a labeled list of things that are normal and not normal, because how would you know? There isn't a sure way to go through and label all the different ways that something could look inconsistent, as that would be very time-consuming and borderline impossible. This type of problem is also known as outlier detection due to the goal being the detection of those entities that are different from the others.
This can be vital when looking at identity fraud to understand whether an action or response falls out of place of a normal baseline. If you have ever gotten a text or email from your credit card company asking whether it was you that made a purchase, that is anomaly detection at work! There is no way for them to code up every possible scenario that could happen outside the normal, and this again shows the power of ML.
Looking back at our earlier scenario on the moon of Endor, you know that there is some suspicious key card access that has happened. You look at all the data but can't make much of it. You know there isn't a way to figure out which logins in the past were valid, so you can't label the data and thus you determine that this falls into the unsupervised bucket of algorithms.
The following diagram shows what a dataset might look like for the key card events that could be a good candidate for an unsupervised problem, specifically anomaly detection. As you can see, there are no labels on any of the data points.
Figure 1.4 – Example of an anomaly problem with one anomaly
One of the data points (the top right) clearly has some characteristics that make it stand out from the rest of the group. The keystrokes take much longer, and the heat sensor reading at the time is much higher. This is a simplistic representation, but you can see how something seems out of place to the eye. That's the essence of what an anomaly is.
With the preceding example, do you think that the event on the bottom right should be investigated, or does it seem normal? It might be worthwhile looking into who accessed the system at that time.
Algorithms that fall under the unsupervised learning category are as follows:
K-means (a clustering-based approach)Isolation forestPrincipal Component Analysis (PCA)Neural networksWe've just covered supervised and unsupervised, but there is another type that is somewhat of a mix and subset of the two. Let's take a quick look at what semi-supervised models look like.
Semi-supervised learning is the process by which you attempt to create a model from data that is both labeled and unlabeled to try and have the best of both the supervised and unsupervised techniques. It attempts to learn from both types of data and is very useful when you don't have the luxury of everything being labeled. Many times, you use the unsupervised approach to find higher-level patterns, and the supervised step to fine-tune exactly what those patterns represent.
In the following example, you'll see where you might have taken part in this process yourself without realizing it.