IBM SPSS Modeler Essentials - Jesus Salcedo - E-Book

IBM SPSS Modeler Essentials E-Book

Jesus Salcedo

0,0
32,39 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Get to grips with the fundamentals of data mining and predictive analytics with IBM SPSS Modeler

About This Book

  • Get up–and-running with IBM SPSS Modeler without going into too much depth.
  • Identify interesting relationships within your data and build effective data mining and predictive analytics solutions
  • A quick, easy–to-follow guide to give you a fundamental understanding of SPSS Modeler, written by the best in the business

Who This Book Is For

This book is ideal for those who are new to SPSS Modeler and want to start using it as quickly as possible, without going into too much detail. An understanding of basic data mining concepts will be helpful, to get the best out of the book.

What You Will Learn

  • Understand the basics of data mining and familiarize yourself with Modeler's visual programming interface
  • Import data into Modeler and learn how to properly declare metadata
  • Obtain summary statistics and audit the quality of your data
  • Prepare data for modeling by selecting and sorting cases, identifying and removing duplicates, combining data files, and modifying and creating fields
  • Assess simple relationships using various statistical and graphing techniques
  • Get an overview of the different types of models available in Modeler
  • Build a decision tree model and assess its results
  • Score new data and export predictions

In Detail

IBM SPSS Modeler allows users to quickly and efficiently use predictive analytics and gain insights from your data. With almost 25 years of history, Modeler is the most established and comprehensive Data Mining workbench available. Since it is popular in corporate settings, widely available in university settings, and highly compatible with all the latest technologies, it is the perfect way to start your Data Science and Machine Learning journey.

This book takes a detailed, step-by-step approach to introducing data mining using the de facto standard process, CRISP-DM, and Modeler's easy to learn “visual programming” style. You will learn how to read data into Modeler, assess data quality, prepare your data for modeling, find interesting patterns and relationships within your data, and export your predictions. Using a single case study throughout, this intentionally short and focused book sticks to the essentials. The authors have drawn upon their decades of teaching thousands of new users, to choose those aspects of Modeler that you should learn first, so that you get off to a good start using proven best practices.

This book provides an overview of various popular data modeling techniques and presents a detailed case study of how to use CHAID, a decision tree model. Assessing a model's performance is as important as building it; this book will also show you how to do that. Finally, you will see how you can score new data and export your predictions. By the end of this book, you will have a firm understanding of the basics of data mining and how to effectively use Modeler to build predictive models.

Style and approach

This book empowers users to build practical & accurate predictive models quickly and intuitively. With the support of the advanced analytics users can discover hidden patterns and trends.This will help users to understand the factors that influence them, enabling you to take advantage of business opportunities and mitigate risks.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 227

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



IBM SPSS Modeler Essentials

 

 

 

 

 

 

 

 

 

 

Effective techniques for building powerful data mining and predictive analytics solutions

 

 

 

 

 

 

 

 

Jesus Salcedo
Keith McCormick

 

 

 

 

BIRMINGHAM - MUMBAI

IBM SPSS Modeler Essentials

 

Copyright © 2017 Packt Publishing

 

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

 

First published: December 2017

 

Production reference: 1211217

 

Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.

 

ISBN 978-1-78829-111-8

 

www.packtpub.com

Credits

Authors

Jesus Salcedo

Keith McCormick

Copy Editor

Safis Editing

Reviewer

Bowen Wei

Project Coordinator

Kinjal Bari

Commissioning Editor

Amey Varangaonkar

Proofreader

Safis Editing

Acquisition Editor

Chandan Kumar

Indexer

Rekha Nair

Content Development Editor

Trusha Shriyan

Graphics

Tania Dutta

Technical Editor

Jovita Alva

Production Coordinator

Aparna Bhagat

About the Authors

Jesus Salcedo has a PhD in psychometrics from Fordham University. He is an independent statistical consultant and has been using SPSS products for over 20 years. He is a former SPSS curriculum team lead and senior education specialist who has written numerous SPSS training courses and has trained thousands of users.

 

 

 

Keith McCormick is an independent data miner, trainer, conference speaker, and author. He has been using statistical software tools since the early 90s, and has been conducting training since 1997. He has been data mining and using IBM SPSS Modeler since its arrival in North America in the late 90s. He is also an expert in other packages of IBM's SPSS software suite, including IBM SPSS statistics, AMOS, and text mining. He blogs and reviews related books as well.

About the Reviewer

Bowen Wei is a senior software engineer and data scientist at IBM IoT Analytics.

He focuses on predictive asset maintenance, visual inspection, and deep learning. He is the lead in the data science team and IoT analytic solution team. He is currently doing research in the abnormal detection and deep learning image classification areas. He joined the IBM SPSS Modeler development team in 2010 and was transferred to the analytic solution team in 2013.

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1788291115. If you'd like to join our team of regular reviewers, you can email us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

To my beautiful Blue, I could not have done this without you. Thank you.
-Jesus

Table of Contents

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

Introduction to Data Mining and Predictive Analytics

Introduction to data mining

CRISP-DM overview

Business Understanding

Data Understanding

Data Preparation

Modeling

Evaluation

Deployment

Learning more about CRISP-DM

The data mining process (as a case study)

Summary

The Basics of Using IBM SPSS Modeler

Introducing the Modeler graphic user interface

Stream canvas

Palettes

Modeler menus

Toolbar

Manager tabs

Project window

Building streams

Mouse buttons

Adding nodes

Editing nodes

Deleting nodes

Building a stream

Connecting nodes

Deleting connections

Modeler stream rules

Help options

Help menu

Dialog help

Summary

Importing Data into Modeler

Data structure

Var. File source node

Var. File source node File tab

Var. File source node Data tab

Var. File source node Filter tab

Var. File source node Types tab

Var. File source node Annotations tab

Viewing data

Excel source node

Database source node

Levels of measurement and roles

Summary

Data Quality and Exploration

Data Audit node options

Data Audit node results

The Quality tab

Missing data

Ways to address missing data

Defining missing values in the Type node

Imputing missing values with the Data Audit node

Summary

Cleaning and Selecting Data

Selecting cases

Expression Builder

Sorting cases

Identifying and removing duplicate cases

Reclassifying categorical values

Summary

Combining Data Files

Combining data files with the Append node

Removing fields with the Filter node

Combining data files with the Merge node

The Filter tab

The Optimization tab

Summary

Deriving New Fields

Derive – Formula

Derive – Flag

Derive – Nominal

Derive – Conditional

Summary

Looking for Relationships Between Fields

Relationships between categorical fields

Distribution node

Matrix node

Relationships between categorical and continuous fields

Histogram node

Means node

Relationships between continuous fields

Plot node

Statistics node

Summary

Introduction to Modeling Options in IBM SPSS Modeler

Classification

Categorical targets

Numeric targets

The Auto nodes

Data reduction modeling nodes

Association

Segmentation

Choosing between models

Summary

Decision Tree Models

Decision tree theory

CHAID theory

How CHAID processes different types of input variables

Stopping rules

Building a CHAID Model

Partition node

Overfitting

CHAID dialog options

CHAID results

Summary

Model Assessment and Scoring

Contrasting model assessment with the Evaluation phase

Model assessment using the Analysis node

Modifying CHAID settings

Model comparison using the Analysis node

Model assessment and comparison using the Evaluation node

Scoring new data

Exporting predictions

Summary

Preface

We are proud to present this intentionally short book on the essentials of using IBM SPSS Modeler. Data science and predictive analytics are hot topics right now, and this book might be perceived as being inspired by these new and exciting trends. While we certainly hope to attract a variety of readers, including young practitioners that are new to the field, in actuality the contents of this book have been shaped by a variety of forces that have been unfolding over a period of approximately 25 years.

In 1992, Colin Shearer and his colleagues, then at ISL, were finding, as Colin himself described it, that data mining projects involved a lot of hard work, and that most of that work was boring. Specifically, to get to the rewarding tasks of finding patterns using the modeling algorithms you had to do a lot of repetitive preparatory work. It was this observation—that virtually all data mining projects share some of the same routine operations—that gave birth to the idea of the first data mining workbench, Clementine (now called IBM SPSS Modeler). The software was designed to make the repetitive tasks as quick and easy as possible. It is that same observation that is at the heart of this book. We have carefully chosen those tasks that apply to nearly all Modeler projects. For that reason, this book is decidedly not encyclopedic, and we sincerely hope that you can outgrow this book in short order and can then move on to more advanced features of Modeler and explore its powerful collection of features.

Another inspiration for this book is the history of Clementine documentation and training from the early 1990s to the present. Given the motivation behind the software, early documentation often focused on short, simple examples that could be carefully followed and then imitated in real-world examples, even though the real-world applications were always much more complex. Some of the earliest examples of this were the original Clementine Application Templates (ISL CATs) from the 1990s, which have evolved so much as to be unrecognizable.

The two of us first encountered Modeler as members of the SPSS community in the period between SPSS's acquisition of ISL (1998) and IBM's acquisition of SPSS (2009). We were both extensively involved in Modeler training for SPSS. Jesus was the training curriculum lead for IBM SPSS at one point after the acquisition. It soon became clear that training in Modeler was going to evolve after the acquisition and more and more entities were going to be involved in training. Some years later, we found ourselves working together at an IBM partner and built a complete SPSS Statistics and SPSS Modeler curriculum for that company. We have spent hundreds of hours discussing Modeler training and thousands of hours conducting Modeler training. We are passionate about how to create the ideal experience for new users of Modeler. We anticipate that the readers of this book will be brand new users engaged in self-study, students in classes that use Modeler, or participants in short courses and seminars such as the ones that we have taught for years.

In 2010, also in response to the changing marketplace after the IBM acquisition, Tom Khabaza (data mining pioneer and one of the earliest members of the ISL/Clementine team) and Keith started a dialog about a possible rookie book about SPSS Modeler. We knew that Modeler might be reaching new audiences. We had spirited discussions and produced a detailed outline, but the project never quite got off the ground. In 2011, without any knowledge of our beginner's guide concept, Packt reached out to Keith and wanted him to recruit others to write a more advanced Modeler book in a cookbook format. At first, Tom and Keith resisted because we thought that a beginner's guide was badly needed and we had an existing plan. However, it all worked out in the end. We combined forces with almost a dozen Modeler experts, including Colin Shearer, who kindly wrote the foreword. Jesus and other experts we knew joined as either co-authors or technical reviewers. The success of the IBM SPSS Modeler Cookbook (2013) demonstrated that more advanced content was also needed.

This book would have been completely different if it had been written before the cookbook. Knowing that the cookbook exists has allowed us to stick to our goal of writing a quick and easy read with only the absolute essentials. It has been designed to dovetail nicely with the cookbook and serve as a kind of prequel. In designing this book, we were quite consciously aware that many people who read this book might use our IBM SPSS Modeler EssentialsPackt video course as a companion. Since we tried to prioritize the absolute essentials in both, they necessarily cover similar ground. However, we chose different case study datasets for each, precisely to support the kind of learning that would come from working through both. We truly believe that they complement each other.

In that spirit, we have chosen a single case study to use throughout the book. It is just complex enough to suit our purposes, but clearly falls short of the complexity of a real-world example. This is a conscious decision. Work through this book. It is designed to be an experience, and not just a read, so follow it step by step from cover to cover. While we hope this book may also be useful to refer to later, we are trying to craft a positive (and easy) first-time experience with Modeler. Also, although we offer a sufficiently complex dataset to show the essentials, we do not attempt to fashion an elaborate scenario to place the dataset into a business context. This is also a conscious decision. We felt that a book on the essentials of Modeler should be a much more point and click book than a theory book. So if you want a book that emphasizes theory over practice, this may not be the best choice to begin your journey. We do rehearse the basic steps behind how modeling works in Modeler, but given the book's length, there is simply no room to discuss all the algorithms and the theory behind them in this book. We spend virtually all of the book pages on Data Understanding, Data Preparation, Modeling and Model Assessment, and spend virtually no pages on Business Understanding, Business Evaluation, and Deployment. Having said that, we care deeply about helping the reader understand why they are performing each step, and will always place the point and click steps in a proper context. That is why we are so carefully selective about how many steps, and which steps, we include in this short book.

IBM SPSS Modeler enables you to explore data, identify important relationships that you can leverage, and build predictive models quickly, allowing your organization to base its decisions purely on the insights obtained from your data. It is our hope that you enjoy mining your data with Modeler and that this book serves as your guide to get you started on this journey. We sincerely hope that you enjoy learning from this book as much as we have enjoyed teaching its content.

What this book covers

Chapter 1, Introduction to Data Mining, introduces the notion of data mining and the CRISP-DM process model. You will learn what data mining is, why you would want to use it, and some of the types of questions you could answer with data mining.

Chapter 2, The Basics of Using IBM SPSS Modeler, introduces the Modeler graphic user interface. You will learn where different components of the program are located, how to work with nodes and create streams, and how to use various help options.

Chapter 3, Importing Data into Modeler, introduces the general data structure that is used in Modeler. You will learn how to read and display data, and you will be introduced to the concepts of measurement level and field roles.

Chapter 4, Data Quality and Exploration, focuses on the Data Understanding phase of data mining. We will spend some time exploring our data and assessing its quality. This chapter introduces the Data Audit node, which is used to explore and assess data. You will see this node's options and learn how to look over its results. You will also be introduced to the concept of missing data and will be shown ways to address it.

Chapter 5, Cleaning and Selecting Data, introduces the Data Preparation phase, so we can fix some of the problems that were previously identified during the Data Understanding phase. You will be shown how to select the appropriate cases for analysis, how to sort cases to get a better feel for the data, how to identify and remove duplicate cases, and how to reclassify categorical values to address various types of issues.

Chapter 6, Combining Data Files, continues with the Data Preparation phase of data mining by filtering fields and combining different types of data files.

Chapter 7, Deriving New Fields, introduces the Derive node. The Derive node can perform different types of calculations so that users can extract more information from the data. These additional fields can then provide insights that may not have been apparent. In this chapter, you will learn that the Derive node can create fields as formulas, flags, nominals, or conditionals.

Chapter 8, Looking for Relationships between Fields, focuses on discovering simple relationships between an outcome variable and a predictor variable. You will learn how to use several statistical and graphing nodes to determine which fields are related to each other. Specifically, you will learn to use the Distribution and Matrix nodes to assess the relationship between two categorical variables. You will also learn how to use the Histogram and Means nodes to identify the relationship between categorical and continuous fields. Finally, you will be introduced to the Plot and Statistics nodes to investigate relationships between continuous fields.

Chapter 9, Introduction to Modeling Options in IBM SPSS Modeler, introduces the different types of models available in Modeler and then provides an overview of the predictive models. Readers will also be introduced to the Partition node so that they can create Training and Testing datasets.

Chapter 10, Decision Tree Models, introduces readers to the decision tree theory. It then provides an overview of the CHAID model so that readers become familiar with the theory, dialogs, and results of this model.

Chapter 11, Model Assessment and Scoring, speaks about assessing the results once a model has been built. This chapter discusses different ways of assessing the results of a model. Readers will also learn how to score new data and how to export these predictions.

What you need for this book

This book introduces students to the steps of data analysis. Students do not need to be experienced in analyzing data; however, an introductory statistics or data mining course would be helpful since this book's emphasis will be the point and click operations in Modeler, and neither statistical nor data mining theory. We will carefully cover theory as needed to help you understand why we are performing each of the steps in the case study, so you can safely start with this book as your very first book. However, a single case study will not provide a complete theoretical context for data mining and we will only use a single modeling algorithm in any detail.

Software demonstrations will be performed on IBM SPSS Modeler; thus, having access to Modeler is critical to enable you to follow along with step-by-step instructions. While we recognize that you might make a first pass at this content away from your computer, you should try each and every step in the book in Modeler. We have carefully narrowed the material down to the essentials. You should find that every step will serve you well when you apply what you've learned to your own data and your own situation. Since we've kept it to the basics, you should have no problem completing the entire book during the time period of a trial license of Modeler, if you do not have permanent access to Modeler. If you encounter this material in a university setting, you may be eligible for a student version.

If you don't have Modeler yet, you might want to consider watching the Packt IBM SPSS Modeler Essentials Video first, then installing the trial version, and then working through the book step by step. Since the two case studies are different, this will provide excellent reinforcement of the material. You will see virtually every concept twice, but with different datasets.

Who this book is for

This book is ideal for those who are new to SPSS Modeler and want to start using it as quickly as possible, without going into too much detail. An understanding of basic data mining concepts will be helpful to get the best out of the book.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "We can include other contexts through the use of the include directive".

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "In order to download new modules, we will go toFiles|Settings|Project Name|Project Interpreter."

Warnings or important notes appear like this.
Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply email [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you. You can download the code files by following these steps:

Log in or register to our website using your email address and password.

Hover the mouse pointer on the

SUPPORT

tab at the top.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box.

Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.

Click on

Code Download

.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub athttps://github.com/PacktPublishing/IBM-SPSS-Modeler-Essentials. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/IBMSPSSModelerEssentials_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Introduction to Data Mining and Predictive Analytics

IBM SPSS Modeler is an interactive data mining workbench composed of multiple tools and technologies to support the entire data mining process. In this first chapter, readers will be introduced to the concepts of data mining, CRISP-DM, which is a recipe for doing data mining the right way, and a case study outlining the data mining process. The chapter topics are as follows:

Introduction to data mining

CRISP-DM overview

The data mining process (as a case study)

Introduction to data mining

In this chapter, we will place IBM SPSS Modeler and its use in a broader context. Modeler was developed as a tool to perform data mining. Although the phrase predictive analytics is more common now, when Modeler was first developed in the 1990s, this type of analytics was almost universally called data mining. The use of the phrase data mining has evolved a bit since then to emphasize the exploratory aspect, especially in the context of big data and sometimes with a particular emphasis on the mining of private data that has been collected. This will not be our use of the term. Data mining can be defined in the following way:

Data mining is the search of data, accumulated during the normal course of doing business, in order to find and confirm the existence of previously unknown relationships that can produce positive and verifiable outcomes through the deployment of predictive models when applied to new data.

Several points are worth emphasizing:

The data is not new

The data that can solve the problem was not collected solely to perform data mining

The data miner is not testing known relationships (neither hypotheses nor hunches) against the data

The patterns must be verifiable

The resulting models must be capable of something useful