E-Book
ohne Abo 34,79 €

Learning Data Mining with Python E-Book

Robert Layton

0,0

34,79 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Fachliteratur
Sprache: Englisch
Veröffentlichungsjahr: 2017

Beschreibung

This book teaches you to design and develop data mining applications using a variety of datasets, starting with basic classification and affinity analysis. This book covers a large number of libraries available in Python, including the Jupyter Notebook, pandas, scikit-learn, and NLTK.
You will gain hands on experience with complex data types including text, images, and graphs. You will also discover object detection using Deep Neural Networks, which is one of the big, difficult areas of machine learning right now.
With restructured examples and code samples updated for the latest edition of Python, each chapter of this book introduces you to new algorithms and techniques. By the end of the book, you will have great insights into using Python for data mining and understanding of the algorithms as well as implementations.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

MOBI

Seitenzahl: 475

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Ähnliche

BESTSELLER

Der Horror der frühen Chirurgie (ungekürzt)

Lindsey Fitzharris

Für immer aufgeräumt – auch digital

Jürgen Kurz

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Denke (nach) und werde reich

Napoleon Hill

Krebszellen mögen keine Himbeeren - Der große Bestseller - Vollständig überarbeitet und aktualisiert

Richard Béliveau

Der Crash ist die Lösung

Matthias Weik

30 Minuten Resilienz

Ulrich Siegrist

Mitochondrientherapie - die Alternative

Dr. sc. med. Bodo Kuklinski

Günter, der innere Schweinehund, lernt verkaufen

Stefan Frädrich

Die Leber wächst mit ihren Aufgaben

Dr. med. Eckart von Hirschhausen

Der größte Raubzug der Geschichte

Matthias Weik

Der Mann und das Holz

Lars Mytting

Unsere Hunde - gesund durch Homöopathie

Hans Günter Wolff

Die Jahrhundertlüge, die nur Insider kennen

Heiko Schrang

Kopf schlägt Kapital

Günter Faltin

Organisation für Komplexität

Niels Pfläging

The Truth About Employee Engagement

Patrick M. Lencioni

Leseprobe

Title Page

Learning Data Mining with Python

Second Edition

Use Python to manipulate data and build predictive models

Robert Layton

BIRMINGHAM - MUMBAI

Copyright

Learning Data Mining with Python

Second Edition

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: July 2015Second edition: April 2017

Production reference: 1250417

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-78712-678-7

www.packtpub.com

Credits

Author

Robert Layton

Copy Editor

Vikrant Phadkay

Reviewer

Asad Ahamad

Project Coordinator

Nidhi Joshi

Commissioning Editor

Veena Pagare

Proofreader

Safis Editing

Acquisition Editor

Divya Poojari

Indexer

Mariammal Chettiyar

Content Development Editor

Tejas Limkar

Graphics

Tania Dutta

Technical Editor

Danish Shaikh

Production Coordinator

Aparna Bhagat

About the Author

Robert Layton is a data scientist investigating data-driven applications to businesses across a number of sectors. He received a PhD investigating cybercrime analytics from the Internet Commerce Security Laboratory at Federation University Australia, before moving into industry, starting his own data analytics company dataPipeline (www.datapipeline.com.au). Next, he created Eureaktive (www.eureaktive.com.au), which works with tech-based startups on developing their proof-of-concepts and early-stage prototypes. Robert also runs www.learningtensorflow.com, which is one of the world's premier tutorial websites for Google's TensorFlow library.

Robert is an active member of the Python community, having used Python for more than 8 years. He has presented at PyConAU for the last four years and works with Python Charmers to provide Python-based training for businesses and professionals from a wide range of organisations.

Robert can be best reached via Twitter @robertlayton

Thank you to my family for supporting me on this journey, thanks to all the readers of revision 1 for making it a success, and thanks to Matty for his assistance behind-the-scenes with the book.

About the Reviewer

Asad Ahamad is a data enthusiast and loves to work on data to solve challenging problems.

He did his masters in Industrial Mathematics with Computer Application from Jamia Millia Islamia, New Delhi. He admires Mathematics a lot and always tries to use it to gain maximum profit for business.

He has good experience working on data mining, machine learning and data science and worked for various multinationals in India. He mainly uses R and Python to perform data wrangling and modeling. He is fond of using open source tools for data analysis.

He is active social media user. Feel free to connect him on twitter @asadtaj88

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1787126781.

If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

Getting Started with Data Mining

Introducing data mining

Using Python and the Jupyter Notebook

Installing Python

Installing Jupyter Notebook

Installing scikit-learn

A simple affinity analysis example

What is affinity analysis?

Product recommendations

Loading the dataset with NumPy

Downloading the example code

Implementing a simple ranking of rules

Ranking to find the best rules

A simple classification example

What is classification?

Loading and preparing the dataset

Implementing the OneR algorithm

Testing the algorithm

Summary

Classifying with scikit-learn Estimators

scikit-learn estimators

Nearest neighbors

Distance metrics

Loading the dataset

Moving towards a standard workflow

Running the algorithm

Setting parameters

Preprocessing

Standard pre-processing

Putting it all together

Pipelines

Summary

Predicting Sports Winners with Decision Trees

Loading the dataset

Collecting the data

Using pandas to load the dataset

Cleaning up the dataset

Extracting new features

Decision trees

Parameters in decision trees

Using decision trees

Sports outcome prediction

Putting it all together

Random forests

How do ensembles work?

Setting parameters in Random Forests

Applying random forests

Engineering new features

Summary

Recommending Movies Using Affinity Analysis

Affinity analysis

Algorithms for affinity analysis

Overall methodology

Dealing with the movie recommendation problem

Obtaining the dataset

Loading with pandas

Sparse data formats

Understanding the Apriori algorithm and its implementation

Looking into the basics of the Apriori algorithm

Implementing the Apriori algorithm

Extracting association rules

Evaluating the association rules

Summary

Features and scikit-learn Transformers

Feature extraction

Representing reality in models

Common feature patterns

Creating good features

Feature selection

Selecting the best individual features

Feature creation

Principal Component Analysis

Creating your own transformer

The transformer API

Implementing a Transformer

Unit testing

Putting it all together

Summary

Social Media Insight using Naive Bayes

Disambiguation

Downloading data from a social network

Loading and classifying the dataset

Creating a replicable dataset from Twitter

Text transformers

Bag-of-words models

n-gram features

Other text features

Naive Bayes

Understanding Bayes' theorem

Naive Bayes algorithm

How it works

Applying of Naive Bayes

Extracting word counts

Converting dictionaries to a matrix

Putting it all together

Evaluation using the F1-score

Getting useful features from models

Summary

Follow Recommendations Using Graph Mining

Loading the dataset

Classifying with an existing model

Getting follower information from Twitter

Building the network

Creating a graph

Creating a similarity graph

Finding subgraphs

Connected components

Optimizing criteria

Summary

Beating CAPTCHAs with Neural Networks

Artificial neural networks

An introduction to neural networks

Creating the dataset

Drawing basic CAPTCHAs

Splitting the image into individual letters

Creating a training dataset

Training and classifying

Back-propagation

Predicting words

Improving accuracy using a dictionary

Ranking mechanisms for word similarity

Putting it all together

Summary

Authorship Attribution

Attributing documents to authors

Applications and use cases

Authorship attribution

Getting the data

Using function words

Counting function words

Classifying with function words

Support Vector Machines

Classifying with SVMs

Kernels

Character n-grams

Extracting character n-grams

The Enron dataset

Accessing the Enron dataset

Creating a dataset loader

Putting it all together

Evaluation

Summary

Clustering News Articles

Preface

The second revision of Learning Data Mining with Python was written with the programmer in mind. It aims to introduce data mining to a wide range of programmers, as I feel that this is critically important to all those in the computer science field. Data mining is quickly becoming the building block of the next generation of Artificial Intelligence systems. Even if you don't find yourself building these systems, you will be using them, interfacing with them, and being guided by them. Understand the process behind them is important and helps you get the best out of them. The second revision builds upon the first. Many of chapters and exercises are similar, although new concepts are introduced and exercises are expanded in scope. Those that had read the first revision should be able to move quickly through the book and pick up new knowledge along the way and engage with the extra activities proposed. Those new to the book are encouraged to take their time, do the exercises and experiment. Feel free to break the code to understand it, and reach out if you have any questions. As this is a book aimed at programmers, we assume that you have some knowledge of programming and of Python itself. For this reason, there is little explanation of what the Python code itself is doing, except in cases where it is ambiguous.

What this book covers

Chapter 1, Getting started with data mining, introduces the technologies we will be using, along with implementing two basic algorithms to get started.

Chapter 2, Classifying with scikit-learn, covers classification, a key form of data mining. You’ll also learn about some structures for making your data mining experimentation easier to perform..

Chapter 3, Predicting Sports Winners with Decisions Trees, introduces two new algorithms, Decision Trees and Random Forests, and uses it to predict sports winners by creating useful features..

Chapter 4, Recommending Movies using Affinity Analysis, looks at the problem of recommending products based on past experience, and introduces the Apriori algorithm.

Chapter 5, Features and scikit-learn Transformers, introduces more types of features you can create, and how to work with different datasets.

Chapter 6, Social Media Insight using Naive Bayes, uses the Naïve Bayes algorithm to automatically parse text-based information from the social media website Twitter.

Chapter 7, Follow Recommendations Using Graph Mining, applies cluster analysis and network analysis to find good people to follow on social media.

Chapter 8, Beating CAPTCHAs with Neural Networks, looks at extracting information from images, and then training neural networks to find words and letters in those images.

Chapter 9, Authorship attribution, looks at determining who wrote a given documents, by extracting text-based features and using Support Vector Machines.

Chapter 10, Clustering news articles, uses the k-means clustering algorithm to group together news articles based on their content.

Chapter 11,Object Detection in Images using Deep Neural Networks, determines what type of object is being shown in an image, by applying deep neural networks.

Chapter 12, Working with Big Data, looks at workflows for applying algorithms to big data and how to get insight from it.

Appendix, Next step, goes through each chapter, giving hints on where to go next for a deeper understanding of the concepts introduced.

What you need for this book

It should come as no surprise that you’ll need a computer, or access to one, to complete the book. The computer should be reasonably modern, but it doesn’t need to be overpowered. Any modern processor (from about 2010 onwards) and 4 gigabytes of RAM will suffice, and you can probably run almost all of the code on a slower system too.

The exception here is with the final two chapters. In these chapters, I step through using Amazon’s web services (AWS) for running the code. This will probably cost you some money, but the advantage is less system setup than running the code locally. If you don’t want to pay for those services, the tools used can all be set-up on a local computer, but you will definitely need a modern system to run it. A processor built in at least 2012, and more than 4 GB of RAM are necessary.

I recommend the Ubuntu operating system, but the code should work well on Windows, Macs, or any other Linux variant. You may need to consult the documentation for your system to get some things installed though.

In this book, I use pip for installing code, which is a command line tool for installing Python libraries. Another option is to use Anaconda, which can be found online here: http://continuum.io/downloads

I also have tested all code using Python 3. Most of the code examples work on Python 2 with no changes. If you run into any problems, and can’t get around it, send an email and we can offer a solution.

Who this book is for

This book is for programmers that want to get started in data mining in an application-focused manner.

If you haven’t programmed before, I strongly recommend that you learn at least the basics before you get started. This book doesn’t introduce programming, nor does it give too much time to explaining the actual implementation (in-code) of how to type out the instructions. That said, once you go through the basics, you should be able to come back to this book fairly quickly – there is no need to be an expert programmer first!

I highly recommend that you have some Python programming experience. If you don’t, feel free to jump in, but you might want to take a look at some Python code first, possibly focused on tutorials using the IPython notebook. Writing programs in the IPython notebook works a little differently than other methods, such as writing a Java program in a fully-fledged IDE.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Hover the mouse pointer on the

SUPPORT

tab at the top.

Click on

Code Downloads & Errata

Enter the name of the book in the

box.

Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.

Click on

Code Download

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Learning-Data-Mining-with-Python-Second-Edition. The benefit of the github repository is that any issues with the code, including problems relating to software version changes, will be kept track of and the code there will include changes from readers around the world. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out! To avoid indention issues please use the code bundle to run the codes in the IDE instead of copying directly from the PDF.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at [email protected] with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Getting Started with Data Mining

We are collecting information about our world on a scale that has never been seen before in the history of humanity. Along with this trend, we are now placing more day-to-day importance on the use of this information in everyday life. We now expect our computers to translate web pages into other languages, predict the weather with high accuracy, suggest books we would like, and to diagnose our health issues. These expectations will grow into the future, both in application breadth and efficacy. Data Mining is a methodology that we can employ to train computers to make decisions with data and forms the backbone of many high-tech systems of today.

The Python programming language is fast growing in popularity, for a good reason. It gives the programmer flexibility, it has many modules to perform different tasks, and Python code is usually more readable and concise than in any other languages. There is a large and an active community of researchers, practitioners, and beginners using Python for data mining.

In this chapter, we will introduce data mining with Python. We will cover the following topics

What is data mining and where can we use it?

Setting up a Python-based environment to perform data mining

An example of affinity analysis, recommending products based on purchasing habits

An example of (a classic) classification problem, predicting the plant species based on its measurement

Introducing data mining

Data mining provides a way for a computer to learn how to make decisions with data. This decision could be predicting tomorrow's weather, blocking a spam email from entering your inbox, detecting the language of a website, or finding a new romance on a dating site. There are many different applications of data mining, with new applications being discovered all the time.

Data mining is part algorithm design, statistics, engineering, optimization, and computer science. However, combined with these base skills in the area, we also need to apply domain knowledge (expert knowledge)of the area we are applying the data mining. Domain knowledge is critical for going from good results to great results. Applying data mining effectively usually requires this domain-specific knowledge to be integrated with the algorithms.

Most data mining applications work with the same high-level view, where a model learns from some data and is applied to other data, although the details often change quite considerably.

Data mining applications involve creating data sets and tuning the algorithm as explained in the following steps

We start our data mining process by creating a dataset, describing an aspect of the real world. Datasets comprise of the following two aspects:

Samples: These are objects in the real world, such as a book, photograph, animal, person, or any other object. Samples are also referred to as observations, records or rows, among other naming conventions.Features: These are descriptions or measurements of the samples in our dataset. Features could be the length, frequency of a specific word, the number of legs on an animal, date it was created, and so on. Features are also referred to as variables, columns, attributes or covariant, again among other naming conventions.

The next step is tuning the data mining algorithm. Each data mining algorithm has parameters, either within the algorithm or supplied by the user. This tuning allows the algorithm to learn how to make decisions about the data.

As a simple example, we may wish the computer to be able to categorize people as short or tall. We start by collecting our dataset, which includes the heights of different people and whether they are considered short or tall:

Person

Height

Short or tall?

155cm

Short

165cm

Short

175cm

Tall

185cm

Tall

As explained above, the next step involves tuning the parameters of our algorithm. As a simple algorithm; if the height is more than x, the person is tall. Otherwise, they are short. Our training algorithms will then look at the data and decide on a good value for x. For the preceding data, a reasonable value for this threshold would be 170 cm. A person taller than 170 cm is considered tall by the algorithm. Anyone else is considered short by this measure. This then lets our algorithm classify new data, such as a person with height 167 cm, even though we may have never seen a person with those measurements before.

In the preceding data, we had an obvious feature type. We wanted to know if people are short or tall, so we collected their heights. This feature engineering is a critical problem in data mining. In later chapters, we will discuss methods for choosing good features to collect in your dataset. Ultimately, this step often requires some expert domain knowledge or at least some trial and error.

In this book, we will introduce data mining through Python. In some cases, we choose clarity of code and workflows, rather than the most optimized way to perform every task. This clarity sometimes involves skipping some details that can improve the algorithm's speed or effectiveness.

Using Python and the Jupyter Notebook

In this section, we will cover installing Python and the environment that we will use for most of the book, the Jupyter Notebook. Furthermore, we will install the NumPy module, which we will use for the first set of examples.

The Jupyter Notebook was, until very recently, called the IPython Notebook. You'll notice the term in web searches for the project. Jupyter is the new name, representing a broadening of the project beyond using just Python.

Installing Python

The Python programming language is a fantastic, versatile, and an easy to use language.

For this book, we will be using Python 3.5, which is available for your system from the Python Organization's website https://www.python.org/downloads/. However, I recommend that you use Anaconda to install Python, which you can download from the official website at https://www.continuum.io/downloads.

There will be two major versions to choose from, Python 3.5 and Python 2.7. Remember to download and install Python 3.5, which is the version tested throughout this book. Follow the installation instructions on that website for your system. If you have a strong reason to learn version 2 of Python, then do so by downloading the Python 2.7 version. Keep in mind that some code may not work as in the book, and some workarounds may be needed.

In this book, I assume that you have some knowledge of programming and Python itself. You do not need to be an expert with Python to complete this book, although a good level of knowledge will help. I will not be explaining general code structures and syntax in this book, except where it is different from what is considered normal python coding practice.

If you do not have any experience with programming, I recommend that you pick up the Learning Python book from Packt Publishing, or the book Dive Into Python, available online at www.diveintopython3.net

The Python organization also maintains a list of two online tutorials for those new to Python:

For non-programmers who want to learn to program through the Python language:

https://wiki.python.org/moin/BeginnersGuide/NonProgrammers

For programmers who already know how to program, but need to learn Python specifically:

https://wiki.python.org/moin/BeginnersGuide/Programmers Windows users will need to set an environment variable to use Python from the command line, where other systems will usually be immediately executable. We set it in the following steps

First, find where you install Python 3 onto your computer; the default location is

C:\Python35

Next, enter this command into the command line (cmd program): set the environment to

PYTHONPATH=%PYTHONPATH%;C:\Python35

Remember to change the C:\Python35 if your installation of Python is in a different folder.

Once you have Python running on your system, you should be able to open a command prompt and can run the following code to be sure it has installed correctly.

$ python Python 3.5.1 (default, Apr 11 2014, 13:05:11) [GCC 4.8.2] on Linux Type "help", "copyright", "credits" or "license" for more information. >>> print("Hello, world!")

Hello, world!

>>> exit()

Note that we will be using the dollar sign ($) to denote that a command that you type into the terminal (also called a shell or cmd on Windows). You do not need to type this character (or retype anything that already appears on your screen). Just type in the rest of the line and press Enter.

After you have the above "Hello, world!" example running, exit the program and move on to installing a more advanced environment to run Python code, the Jupyter Notebook.

Python 3.5 will include a program called pip, which is a package manager that helps to install new libraries on your system. You can verify that pip is working on your system by running the $ pip freeze command, which tells you which packages you have installed on your system. Anaconda also installs their package manager, conda, that you can use. If unsure, use conda first, use pip only if that fails.

Installing Jupyter Notebook

Jupyter is a platform for Python development that contains some tools and environments for running Python and has more features than the standard interpreter. It contains the powerful Jupyter Notebook, which allows you to write programs in a web browser. It also formats your code, shows output, and allows you to annotate your scripts. It is a great tool for exploring datasets and we will be using it as our main environment for the code in this book.

To install the Jupyter Notebook on your computer, you can type the following into a command line prompt (not into Python):

$ conda install jupyter notebook

You will not need administrator privileges to install this, as Anaconda keeps packages in the user's directory.

With the Jupyter Notebook installed, you can launch it with the following:

$ jupyter notebook

Running this command will do two things. First, it will create a Jupyter Notebook instance - the backend - that will run in the command prompt you just used. Second, it will launch your web browser and connect to this instance, allowing you to create a new notebook. It will look something like the following screenshot (where you need to replace /home/bob with your current working directory):

To stop the Jupyter Notebook from running, open the command prompt that has the instance running (the one you used earlier to run the jupyter notebook command). Then, press Ctrl + C and you will be prompted Shutdown this notebook server (y/[n])?. Type y and press Enter and the Jupyter Notebook will shut down.

Installing scikit-learn

The scikit-learn package is a machine learning library, written in Python (but also containing code in other languages). It contains numerous algorithms, datasets, utilities, and frameworks for performing machine learning. Scikit-learnis built upon the scientific python stack, including libraries such as the NumPy and SciPy for speed. Scikit-learn is fast and scalable in many instances and useful for all skill ranges from beginners to advanced research users. We will cover more details of scikit-learn in Chapter 2, Classifying with scikit-learn Estimators.

To install scikit-learn, you can use the conda utility that comes with Python 3, which will also install the NumPy and SciPy libraries if you do not already have them. Open a terminal with administrator/root privileges and enter the following command:

$ conda install scikit-learn

Users of major Linux distributions such as Ubuntu or Red Hat may wish to install the official package from their package manager.

Not all distributions have the latest versions of scikit-learn, so check the version before installing it. The minimum version needed for this book is 0.14. My recommendation for this book is to use Anaconda to manage this for you, rather than installing using your system's package manager.

Those wishing to install the latest version by compiling the source, or view more detailed installation instructions, can go to http://scikit-learn.org/stable/install.html and refer the official documentation on installing scikit-learn.

A simple affinity analysis example

In this section, we jump into our first example. A common use case for data mining is to improve sales, by asking a customer who is buying a product if he/she would like another similar product as well. You can perform this analysis through affinity analysis, which is the study of when things exist together, namely. correlate to each other.

To repeat the now-infamous phrase taught in statistics classes, correlation is not causation. This phrase means that the results from affinity analysis cannot give a cause. In our next example, we perform affinity analysis on product purchases. The results indicate that the products are purchased together, but not that buying one product causes the purchase of the other. The distinction is important, critically so when determining how to use the results to affect a business process, for instance.

What is affinity analysis?

Affinity analysis is a type of data mining that gives similarity between samples (objects). This could be the similarity between the following:

Users

on a website, to provide varied services or targeted advertising

Items

to sell to those users, to provide recommended movies or products

Human genes

, to find people that share the same ancestors

We can measure affinity in several ways. For instance, we can record how frequently two products are purchased together. We can also record the accuracy of the statement when a person buys object 1 and when they buy object 2. Other ways to measure affinity include computing the similarity between samples, which we will cover in later chapters.

Product recommendations

One of the issues with moving a traditional business online, such as commerce, is that tasks that used to be done by humans need to be automated for the online business to scale and compete with existing automated businesses. One example of this is up-selling, or selling an extra item to a customer who is already buying. Automated product recommendations through data mining are one of the driving forces behind the e-commerce revolution that is turning billions of dollars per year into revenue.

In this example, we are going to focus on a basic product recommendation service. We design this based on the following idea: when two items are historically purchased together, they are more likely to be purchased together in the future. This sort of thinking is behind many product recommendation services, in both online and offline businesses.

A very simple algorithm for this type of product recommendation algorithm is to simply find any historical case where a user has brought an item and to recommend other items that the historical user brought. In practice, simple algorithms such as this can do well, at least better than choosing random items to recommend. However, they can be improved upon significantly, which is where data mining comes in.

To simplify the coding, we will consider only two items at a time. As an example, people may buy bread and milk at the same time at the supermarket. In this early example, we wish to find simple rules of the form:

If a person buys product X, then they are likely to purchase product Y

More complex rules involving multiple items will not be covered such as people buying sausages and burgers being more likely to buy tomato sauce.

A simple classification example

In the affinity analysis example, we looked for correlations between different variables in our dataset. In classification, we have a single variable that we are interested in and that we call the class (also called the target). In the earlier example, if we were interested in how to make people buy more apples, we would explore the rules related to apples and use those to inform our decisions.

What is classification?

Classification is one of the largest uses of data mining, both in practical use and in research. As before, we have a set of samples that represents objects or things we are interested in classifying. We also have a new array, the class values. These class values give us a categorization of the samples. Some examples are as follows:

Determining the species of a plant by looking at its measurements. The class value here would be:

Which species is this?

Determining if an image contains a dog. The class would be:

Is there a dog in this image?

Determining if a patient has cancer, based on the results of a specific test. The class would be:

Does this patient have cancer?

While many of the examples previous are binary (yes/no) questions, they do not have to be, as in the case of plant species classification in this section.

The goal of classification applications is to train a model on a set of samples with known classes and then apply that model to new unseen samples with unknown classes. For example, we want to train a spam classifier on my past e-mails, which I have labeled as spam or not spam. I then want to use that classifier to determine whether my next email is spam, without me needing to classify it myself.

Implementing the OneR algorithm

OneR is a simple algorithm that simply predicts the class of a sample by finding the most frequent class for the feature values. OneR is shorthand for One Rule, indicating we only use a single rule for this classification by choosing the feature with the best performance. While some of the later algorithms are significantly more complex, this simple algorithm has been shown to have good performance in some real-world datasets.