E-Book
116,99 €

Data Mining and Business Analytics with R E-Book

Johannes Ledolter

0,0

116,99 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: John Wiley & Sons
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Collecting, analyzing, and extracting valuable information from a large amount of data requires easily accessible, robust, computational and analytical tools. Data Mining and Business Analytics with R utilizes the open source software R for the analysis, exploration, and simplification of large high-dimensional data sets. As a result, readers are provided with the needed guidance to model and interpret complicated data and become adept at building powerful models for prediction and classification.

Highlighting both underlying concepts and practical computational skills, Data Mining and Business Analytics with R begins with coverage of standard linear regression and the importance of parsimony in statistical modeling. The book includes important topics such as penalty-based variable selection (LASSO); logistic regression; regression and classification trees; clustering; principal components and partial least squares; and the analysis of text and network data. In addition, the book presents:

A thorough discussion and extensive demonstration of the theory behind the most useful data mining tools
Illustrations of how to use the outlined concepts in real-world situations
Readily available additional data sets and related R code allowing readers to apply their own analyses to the discussed materials
Numerous exercises to help readers with computing skills and deepen their understanding of the material

Data Mining and Business Analytics with R is an excellent graduate-level textbook for courses on data mining and business analytics. The book is also a valuable reference for practitioners who collect and analyze data in the fields of finance, operations management, marketing, and the information sciences.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 467

Veröffentlichungsjahr: 2013

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Table of Contents

Title Page

Preface

Acknowledgments

Chapter 1: Introduction

Reference

Chapter 2: Processing the Information and Getting to Know Your Data

2.1 Example 1: 2006 Birth Data

2.2 Example 2: Alumni Donations

2.3 Example 3: Orange Juice

References

Chapter 3: Standard Linear Regression

3.1 Estimation in R

3.2 Example 1: Fuel Efficiency of Automobiles

3.3 Example 2: Toyota Used-Car Prices

Appendix 3.A T1he Effects of Model Overfitting on the Average Mean Square Error of the Regression Prediction

References

Chapter 4: Local Polynomial Regression: a Nonparametric Regression Approach

4.1 Model Selection

4.2 Application to Density Estimation and the Smoothing of Histograms

4.3 Extension to the Multiple Regression Model

4.4 Examples and Software

References

Chapter 5: Importance of Parsimony in Statistical Modeling

5.1 How Do We Guard Against False Discovery

References

chapter 6 : Penalty-Based Variable Selection in Regression Models with Many Parameters (LASSO)

6.1 Example 1: Prostate Cancer

6.2 Example 2: Orange Juice

References

Chapter 7: Logistic Regression

7.1 Building a Linear Model for Binary Response Data

7.2 Interpretation of the Regression Coefficients in a Logistic Regression Model

7.3 Statistical Inference

7.4 Classification Of New Cases

7.5 Estimation in R

7.6 Example 1: Death Penalty Data

7.7 Example 2: Delayed Airplanes

7.8 Example 3: Loan Acceptance

7.9 Example 4: German Credit Data

References

Chapter 8: Chapter 8

8.1 Binary Classification

8.2 Using Probabilities to Make Decisions

8.3 Sensitivity and Specificity

8.4 Example: German Credit Data

Chapter 9: Classification Using a Nearest Neighbor Analysis

9.1 THE k-Nearest Neighbor Algorithm

9.2 Example 1: Forensic Glass

9.3 Example 2: German Credit Data

Reference

Chapter 10: The Naïve Bayesian Analysis: a Model for Predicting a Categorical Response from Mostly Categorical Predictor Variables

10.1 Example: Delayed Airplanes

Reference

Chapter 11: Multinomial Logistic Regression

11.1 Computer Software

11.2 Example 1: Forensic Glass

11.3 Example 2: Forensic Glass Revisited

Appendix 11.A Specification of a Simple Triplet Matrix

References

Chapter 12: More on Classification and a Discussion on Discriminant Analysis

12.1 Fisher's Linear Discriminant Function

12.2 Example 1: German Credit Data

12.3 Example 2: Fisher Iris Data

12.4 Example 3: Forensic Glass Data

12.5 Example 4: MBA Admission Data

Reference

Chapter 13: Decision Trees

13.1 Example 1: Prostate Cancer

13.2 Example 2: Motorcycle Acceleration

13.3 Example 3: Fisher Iris Data Revisited

Chapter 14: Further Discussion on Regression and Classification Trees, Computer Software, and Other Useful Classification Methods

14.1 R Packages for Tree Construction

14.2 Chi-Square Automatic Interaction Detection (CHAID)

14.3 Ensemble Methods: Bagging, Boosting, And Random Forests

14.4 Support Vector Machines (SVM)

14.5 Neural Networks

14.6 The R Package Rattle: A Useful Graphical User Interface For Data Mining

References

Chapter 15: Clustering

15.1 -Means Clustering

15.2 Another Way to Look at Clustering: Applying The Expectation-Maximization (EM) Algorithm to Mixtures of Normal Distributions

15.3 Hierarchical Clustering Procedures

References

Chapter 16: Market Basket Analysis: Association Rules and Lift

16.1 Example 1: Online Radio

16.2 Example 2: Predicting Income

References

Chapter 17: Dimension Reduction: Factor Models and Principal Components

17.1 Example 1: European Protein Consumption

17.2 Example 2: Monthly Us Unemployment Rates

Chapter 18: Reducing the Dimension in Regressions with Multicollinear Inputs: Principal Components Regression and Partial Least Squares

18.1 Three Examples

References

Chapter 19: Text as Data: Text Mining and Sentiment Analysis1

19.1 Inverse Multinomial Logistic Regression

19.2 Example 1: Restaurant Reviews

19.3 Example 2: Political Sentiment

References

Chapter 20: Network Data

20.1 Example 1: Marriage And Power In Fifteenth Century Florence

20.2 Example 2: Connections In A Friendship Network

References

Appendix A Exercises

Appendix B: References

Index

Published by John Wiley & Sons, Inc., Hoboken, New Jersey

Published simultaneously in Canada

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging-in-Publication Data:

Ledolter, Johannes.

Data mining and business analytics with R / Johannes Ledolter, University of Iowa.

pages cm

Includes bibliographical references and index.

ISBN 978-1-118-44714-7 (cloth)

1. Data mining. 2. R (Computer program language) 3. Commercial statistics. I. Title.

QA76.9.D343L44 2013

006.3$′12–dc23

2013000330

Preface

This book is about useful methods for data mining and business analytics. It is written for readers who want to apply these methods so that they can learn about their processes and solve their problems. My objective is to provide a thorough discussion of the most useful data-mining tools that goes beyond the typical “black box” description, and to show why these tools work.

Powerful, accurate, and flexible computing software is needed for data mining, and Excel is of little use. Although excellent data-mining software is offered by various commercial vendors, proprietary products are usually expensive. In this text, I use the R Statistical Software, which is powerful and free. But the use of R comes with start-up costs. R requires the user to write out instructions, and the writing of program instructions will be unfamiliar to most spreadsheet users. This is why I provide R sample programs in the text and on the webpage that is associated with this book. These sample programs should smooth the transition to this very general and powerful computer environment and help keep the start-up costs to using R small.

The text combines explanations of the statistical foundation of data mining with useful software so that the tools can be readily applied and put to use. There are certainly better books that give a deeper description of the methods, and there are also numerous texts that give a more complete guide to computing with R. This book tries to strike a compromise that does justice to both theory and practice, at a level that can be understood by the MBA student interested in quantitative methods. This book can be used in courses on data mining in quantitative MBA programs and in upper-level undergraduate and graduate programs that deal with the analysis and interpretation of large data sets. Students in business, the social and natural sciences, medicine, and engineering should benefit from this book. The majority of the topics can be covered in a one semester course. But not every covered topic will be useful for all audiences, and for some audiences, the coverage of certain topics will be either too advanced or too basic. By omitting some topics and by expanding on others, one can make this book work for many different audiences.

Certain data-mining applications require an enormous amount of effort to just collect the relevant information, and in such cases, the data preparation takes a lot more time than the eventual modeling. In other applications, the data collection effort is minimal, but often one has to worry about the efficient storage and retrieval of high volume information (i.e., the “data warehousing”). Although it is very important to know how to acquire, store, merge, and best arrange the information, this text does not cover these aspects very deeply. This book concentrates on the modeling aspects of data mining.

The data sets and the R-code for all examples can be found on the webpage that accompanies this book (http://www.biz.uiowa.edu/faculty/jledolter/DataMining). Supplementary material for this book can also be found by entering ISBN 9781118447147 at http://booksupport.wiley.com. You can copy and paste the code into your own R session and rerun all analyses. You can experiment with the software by making changes and additions, and you can adapt the R templates to the analysis of your own data sets. Exercises and several large practice data sets are given at the end of this book. The exercises will help instructors when assigning homework problems, and they will give the reader the opportunity to practice the techniques that are discussed in this book. Instructions on how to best use these data sets are given in Appendix A.

This is a first edition. Although I have tried to be very careful in my writing and in the analyses of the illustrative data sets, I am certain that much can be improved. I would very much appreciate any feedback you may have, and I encourage you to write to me at [email protected]. Corrections and comments will be posted on the book's webpage.

Acknowledgments

I got interested in developing materials for an MBA-level text on Data Mining when I visited the University of Chicago Booth School of Business in 2011. The outstanding University of Chicago lecture materials for the course on Data Mining (BUS41201) taught by Professor Matt Taddy provided the spark to put this text together, and several examples and R-templates from Professor Taddy's notes have influenced my presentation. Chapter 19 on the analysis of text data draws heavily on his recent research. Professor Taddy's contributions are most gratefully acknowledged.

Writing a text is a time-consuming task. I could not have done this without the support and constant encouragement of my wife, Lea Vandervelde. Lea, a law professor at the University of Iowa, conducts historical research on the freedom suits of Missouri slaves. She knows first-hand how important and difficult it is to construct data sets for the mining of text data.

Chapter 1: Introduction

Today's statistics applications involve enormous data sets: many cases (rows of a data spreadsheet, with a row representing the information on a studied case) and many variables (columns of the spreadsheet, with a column representing the outcomes on a certain characteristic across the studied cases). A case may be a certain item such as a purchase transaction, or a subject such as a customer or a country, or an object such as a car or a manufactured product. The information that we collect varies across the cases, and the explanation of this variability is central to the tools that we study in this book. Many variables are typically collected on each case, but usually only a few of them turn out to be useful. The majority of the collected variables may be irrelevant and represent just noise. It is important to find those variables that matter and those that do not.

Here are a few types of data sets that one encounters in data mining. In marketing applications, we observe the purchase decisions, made over many time periods, of thousands of individuals who select among several products under a variety of price and advertising conditions. Social network data contains information on the presence of links among thousands or millions of subjects; in addition, such data includes demographic characteristics of the subjects (such as gender, age, income, race, and education) that may have an effect on whether subjects are “linked” or not. Google has extensive information on 100 million users, and Facebook has data on even more. The recommender systems developed by firms such as Netflix and Amazon use available demographic information and the detailed purchase/rental histories from millions of customers. Medical data sets contain the outcomes of thousands of performed procedures, and include information on their characteristics such as the type of procedure and its outcome, and the location where and the time when the procedure has been performed.

While traditional statistics applications focus on relatively small data sets, data mining involves very large and sometimes enormous quantities of information. One talks about megabytes and terabytes of information. A megabyte represents a million bytes, with a byte being the number of bits needed to encode a single character of text. A typical English book in plain text format (500 pages with 2000 characters per page) amounts to about 1 MB. A terabyte is a million megabytes, and an exabyte is a million terabytes.

Data mining attempts to extract useful information from such large data sets. Data mining explores and analyzes large quantities of data in order to discover meaningful patterns. The scale of a typical data mining application, with its large number of cases and many variables, exceeds that of a standard statistical investigation. The analysis of millions of cases and thousands of variables also puts pressure on the speed that is needed to accomplish the search and modeling steps of the typical data mining application. This is why researchers refer to data mining as statistics at scale and speed. The large scale (lots of available data) and the requirements on speed (solutions are needed quickly) create a large demand for automation. Data mining uses a combination of pattern-recognition rules, statistical rules, as well as rules drawn from machine learning (an area of computer science).

Data mining has wide applicability, with applications in intelligence and security analysis, genetics, the social and natural sciences, and business. Studying which buyers are more likely to buy, respond to an advertisement, declare bankruptcy, commit fraud, or abandon subscription services are of vital importance to business.

Many data mining problems deal with categorical outcome data (e.g., no/yes outcomes), and this is what makes machine learning methods, which have their origins in the analysis of categorical data, so useful. Statistics, on the other hand, has its origins in the analysis of continuous data. This makes statistics especially useful for correlation-type analyses where one sifts through a large number of correlations to find the largest ones.

The analysis of large data sets requires an efficient way of storing the data so that it can be accessed easily for calculations. Issues of data warehousing and how to best organize the data are certainly very important, but they are not emphasized in this book. The book focuses on the analysis tools and targets their statistical foundation.

Because of the often enormous quantities of data (number of cases/replicates), the role of traditional statistical concepts such as confidence intervals and statistical significance tests is greatly reduced. With large data sets, almost any small difference becomes significant. It is the problem of overfitting models (i.e., using more explanatory variables than are actually needed to predict a certain phenomenon) that becomes of central importance. Parsimonious representations are important as simpler models tend to give more insight into a problem. Large models overfitted on training data sets usually turn out to be extremely poor predictors in new situations as unneeded predictor variables increase the prediction error variance. Furthermore, overparameterized models are of little use if it is difficult to collect data on predictor variables in the future. Methods that help avoid such overfitting are needed, and they are covered in this book. The partitioning of the data into training and evaluation (test) data sets is central to most data mining methods. One must always check whether the relationships found in the training data set will hold up in the future.

Many data mining tools deal with problems for which there is no designated response that one wants to predict. It is common to refer to such analysis as unsupervised learning. Cluster analysis is one example where one uses feature (variable) data on numerous objects to group the objects (i.e., the cases) into a smaller number of groups (also called clusters). Dimension reduction applications are other examples for such type of problems; here one tries to reduce the many features on an object to a manageable few. Association rules also fall into this category of problems; here one studies whether the occurrence of one feature is related to the occurrence of others. Who would not want to know whether the sales of chips are being “lifted” to a higher level by the concurrent sales of beer?

Other data mining tools deal with problems for which there is a designated response, such as the volume of sales (a quantitative response) or whether someone buys a product (a categorical response). One refers to such analysis as supervised learning. The predictor variables that help explain (predict) the response can be quantitative (such as the income of the buyer or the price of a product) or categorical (such as the gender and profession of the buyer or the qualitative characteristics of the product such as new or old). Regression methods, regression trees, and nearest neighbor methods are well suited for problems that involve a continuous response. Logistic regression, classification trees, nearest neighbor methods, discriminant analysis (for continuous predictor variables) and naïve Bayes methods (mostly for categorical predictor variables) are well suited for problems that involve a categorical response.

Data mining should be viewed as a process. As with all good statistical analyses, one needs to be clear about the purpose of the analysis. Just to “mine data” without a clear purpose, without an appreciation of the subject area, and without a modeling strategy will usually not be successful. The data mining process involves several interrelated steps:

1. Efficient data storage and data preprocessing steps are very critical to the success of the analysis.

2. One needs to select appropriate response variables and decide on the number of variables that should be investigated.

3. The data needs to be screened for outliers, and missing values need to be addressed (with missing values either omitted or appropriately imputed through one of several available methods).

4. Data sets need to be partitioned into training and evaluation data sets. In very large data sets, which cannot be analyzed easily as a whole, data must be sampled for analysis.

5. Before applying sophisticated models and methods, the data need to be visualized and summarized. It is often said that a picture is worth a 1000 words. Basic graphs such as line graphs for time series, bar charts for categorical variables, scatter plots and matrix plots for continuous variables, box plots and histograms (often after stratification on useful covariates), maps for displaying correlation matrices, multidimensional graphs using color, trellis graphs, overlay plots, tree maps for visualizing network data, and geo maps for spatial data are just a few examples of the more useful graphical displays. In constructing good graphs, one needs to be careful about the right scaling, the correct labeling, and issues of stratification and aggregation.

6. Summary of the data involves the typical summary statistics such as mean, percentiles and median, standard deviation, and correlation, as well as more advanced summaries such as principal components.

7. Appropriate methods from the data mining tool bag need to be applied. Depending on the problem, this may involve regression, logistic regression, regression/classification trees, nearest neighbor methods, -means clustering, and so on.

8. The findings from these models need to be confirmed, typically on an evaluation (test or holdout) data set.

9. Finally, the insights one gains from the analysis need to be implemented. One must act on the findings and spring to action. This is what W.E. Deming had in mind when he talked about process improvement and his Deming (Shewhart) wheel of “plan, do, check, and act” (Ledolter and Burrill, (1999)).

Some data mining applications require an enormous amount of effort to just collect the relevant information. For example, an investigation of Pre-Civil War court cases of Missouri slaves seeking their freedom involves tedious study of handwritten court proceedings and Census records, electronic scanning of the records, and the use of character-recognition software to extract the relevant characteristics of the cases and the people involved. The process involves double and triple checking unclear information (such as different spellings, illegible entries, and missing information), selecting the appropriate number of variables, categorizing text information, and deciding on the most appropriate coding of the information. At the end, one will have created a fairly good master list of all available cases and their relevant characteristics. Despite all the diligent work, there will be plenty of missing information, information that is in error, and way too many variables and categories than are ultimately needed to tell the story behind the judicial process of gaining freedom.

Data preparation often takes a lot more time than the eventual modeling. The subsequent modeling is usually only a small component of the overall effort; quite often, relatively simple methods and a few well-constructed graphs can tell the whole story. It is the creation of the master list that is the most challenging task. The steps that are involved in the construction of the master list in such problems depend heavily on the subject area, and one can only give rough guidelines on how to proceed. It is also difficult to make this process automatic. Furthermore, even if some of the “data cleaning” steps can be made automatic, the investigator must constantly check and question any adjustments that are being made. Great care, lots of double and triple checking, and much common sense are needed to create a reliable master list. But without a reliable master list, the findings will be suspect, as we know that wrong data usually lead to wrong conclusions. The old saying “garbage in–garbage out” also applies to data mining.

Fortunately many large business data sets can be created almost automatically. Much of today's business data is collected for transactional purposes, that is, for payment and for shipping. Examples of such data sets are transactions that originate from scanner sales in super markets, telephone records that are collected by mobile telephone providers, and sales and rental histories that are collected by companies such as Amazon and Netflix. In all these cases, the data collection effort is minimal, even though companies have to worry about the efficient storage and retrieval of the information (i.e., the “data warehousing”).

Credit card companies collect information on purchases; telecom companies collect information on phone calls such as their timing, length, origin, and destination; retail stores have developed automated ways of collecting information on their sales such as the volume purchased and the price at which products are bought. Supermarkets are now the source of much excellent data on the purchasing behavior of individuals. Electronic scanners keep track of purchases, prices, and the presence of promotions. Loyalty programs of retail chains and frequent-flyer programs make it possible to link the purchases to the individual shopper and his/her demographic characteristics and preferences. Innovative marketing firms combine the customer's purchase decisions with the customer's exposure to different marketing messages. As early as the 1980s, Chicago's IRI (Information Resources Incorporated, now Symphony IRI) contracted with television cable companies to vary the advertisements that were sent to members of their household panels. They knew exactly who was getting which ad and they could track the panel members' purchases at the store. This allowed for a direct way of assessing the effectiveness of marketing interventions; certainly much more direct than the diary-type information that had been collected previously. At present, companies such as Google and Facebook run experiments all the time. They present their members with different ads and they keep track who is clicking on the advertised products and whether the products are actually being bought.

Internet companies have vast information on customer preferences and they use this for targeted advertising; they use recommender systems to direct their ads to areas that are most profitable. Advertising related products that have a good chance of being bought and “cross-selling” of products become more and more important. Data from loyalty programs, from e-Bay auction histories, and from digital footprints of users clicking on Internet webpages are now readily available. Google's “Flu tracker” makes use of the webpage clicks to develop a tool for the early detection of influenza outbreaks; Amazon and Netflix use the information from their shoppers' previous order histories without ever meeting them in person, and they use the information from previous order histories of their users to develop automatic recommender systems. Credit risk calculations, business sentiment analysis, and brand image analysis are becoming more and more important.

Sports teams use data mining techniques to assemble winning teams; see the success stories of the Boston Red Sox and the Oakland Athletics. Moneyball, a 2011 biographical sports drama film based on Michael Lewis's 2003 book of the same name, is an account of the Oakland Athletics baseball team's 2002 season and their general manager Billy Beane's attempts to assemble a competitive team through data mining and business analytics.

It is not only business applications of data mining that are important; data mining is also important for applications in the sciences. We have enormous data bases on drugs and their side effects, and on medical procedures and their complication rates. This information can be mined to learn which drugs work and under which conditions they work best; and which medical procedures lead to complications and for which patients.

Business analytics and data mining deal with collecting and analyzing data for better decision making in business. Managers and business students can gain a competitive advantage through business analytics and data mining. Most tools and methods for data mining discussed in this book have been around for a very long time. But several developments have come together over the past few years, making the present period a perfect time to use these methods for solving business problems.

1. More and more data relevant for data mining applications are now being collected.

2. Data is being warehoused and is now readily available for analysis. Much data from numerous sources has already been integrated, and the data is stored in a format that makes the analysis convenient.

3. Computer storage and computer power are getting cheaper every day, and good software is available to carry out the analysis.

4. Companies are interested in “listening” to their customers and they now believe strongly in customer relationship management. They are interested in holding on to good customers and getting rid of bad ones. They embrace tools and methods that give them this information.

This book discusses the modeling tools and the methods of data mining. We assume that one has constructed the relevant master list of cases and that the data is readily available. Our discussion covers the last 10–20% of effort that is needed to extract and model meaningful information from the raw data. A model is a simplified description of the process that may have generated the data. A model may be a mathematical formula, or a computer program. One must remember, however, that no model is perfect, and that all models are merely approximations. But some of these approximations will turn out to be useful and lead to insights. One needs to become a critical user of models. If a model looks too good to be true, then it generally is. Models need to be checked, and we emphasized earlier that models should not be evaluated on the data that had been used to build them. Models are “fine-tuned” to the data of the training set, and it is not obvious whether this good performance carries over to other data sets.

In this book, we use the R Statistical Software (Version 15 as of June 2012). It is powerful and free. One may search for the software on the web and download the system. R is similar to Matlab and requires the user to write out simple instructions. The writing of (program) instructions will be unfamiliar to a spreadsheet user, and there will be startup costs to using R. However, the R sample programs in this book and their listing on the book's webpage should help with the transition to this very general and powerful computer environment.

Reference

Ledolter, J. and Burrill, C.: Statistical Quality Control: Strategies and Tools for Continual Improvement. New York: John Wiley & Sons, Inc., 1999.

Chapter 2: Processing the Information and Getting to Know Your Data

In this chapter we analyze three data sets and illustrate the steps that are needed for preprocessing the data. We consider (i) the 2006 birth data that is used in the book R in a Nutshell: A Desktop Quick Reference (Adler, (2009)), (ii) data on the contributions to a Midwestern private college (Ledolter and Swersey, (2007)), and (iii) the orange juice data set taken from P. Rossi's bayesm package for R that was used earlier in Montgomery ((1987)). The three data sets are of suitable size (427,323 records and 13 variables in the 2006 birth data set; 1230 records and 11 variables in the contribution data set; and 28,947 records and 17 variables in the orange juice data set). The data sets include both continuous and categorical variables, have missing observations, and require preprocessing steps before they can be subjected to the appropriate statistical analysis and modeling. We use these data sets to illustrate how to summarize the available information and how to obtain useful graphical displays. The initial arrangement of the data is often not very convenient for the analysis, and the information has to be rearranged and preprocessed. We show how to do this within R.

All data sets and the R programs for all examples in this book are listed on the webpage that accompanies this book (http://www.biz.uiowa.edu/faculty/jledolter/DataMining). I encourage readers to copy and paste the R programs into their own R sessions and check the results. Having such templates available for the analysis helps speed up the learning curve for R. It is much easier to learn from a sample program than to piece together the R code from first principles. It is the author's experience that even novices catch on quite fast. It may happen that at some time in the future certain R functions and packages become obsolete and are no longer available. Readers should then look for adequate replacements. The R function “help” can be used to get information on new functions and packages.

2.1 Example 1: 2006 Birth Data

We consider the 2006 birth data set that is used in the book R In a Nutshell: A Desktop Quick Reference (Adler, (2009)). The data set births2006.smpl consists of 427,323 records and 13 variables, including the day of birth according to the month and the day of week (DOB_MM, DOB_WK), the birth weight of the baby (DBWT) and the weight gain of the mother during pregnancy (WTGAIN), the sex of the baby and its APGAR score at birth (SEX and APGAR5), whether it was a single or multiple birth (DPLURAL), and the estimated gestation age in weeks (ESTGEST). We list below the information for the first five births.

## Install packages from CRAN; use any USA mirror

library(lattice)

library(nutshell)

data(births2006.smpl)

births2006.smpl[1:5,]

DOB_MM DOB_WK MAGER TBO_REC WTGAIN SEX APGAR5 DMEDUC 591430 9 1 25 2 NA F NA NULL 1827276 2 6 28 2 26 M 9 2 years of college 1705673 2 2 18 2 25 F 9 NULL 3368269 10 5 21 2 6 M 9 NULL 2990253 7 7 25 1 36 M 10 2 years of high school UPREVIS ESTGEST DMETH_REC DPLURAL DBWT 591430 10 99 Vaginal 1 Single 3800 1827276 10 37 Vaginal 1 Single 3625 1705673 14 38 Vaginal 1 Single 3650 3368269 22 38 Vaginal 1 Single 3045 2990253 15 40 Vaginal 1 Single 3827

dim(births2006.smpl)

[1] 427323 13

The following bar chart of the frequencies of births according to the day of week of the birth shows that fewer births take place during the weekend (days 1 = Sunday, 2 = Monday, …, 7 = Saturday of DOB_WK). This may have to do with the fact that many babies are delivered by cesarean section, and that those deliveries are typically scheduled during the week and not on weekends. To follow up on this hypothesis, we obtain the frequencies in the two-way classification of births according to the day of week and the method of delivery. Excluding births of unknown delivery method, we separate the bar charts of the frequencies for the day of week of delivery according to the method of delivery. While it is also true that vaginal births are less frequent on weekends than on weekdays (doctors prefer to work on weekdays), the reduction in the frequencies of scheduled C-section deliveries from weekdays to weekends (about 50%) exceeds the weekday–weekend reduction of vaginal deliveries (about 25–30%).

births.dow=table(births2006.smpl$DOB_WK)

births.dow

1 2 3 4 5 6 7 40274 62757 69775 70290 70164 68380 45683

barchart(births.dow,ylab="Day of Week",col="black")

dob.dm.tbl=table(WK=births2006.smpl$DOB_WK, + MM=births2006.smpl$DMETH_REC)

dob.dm.tbl

MM WK C-section Unknown Vaginal 1 8836 90 31348 2 20454 272 42031 3 22921 247 46607 4 23103 252 46935 5 22825 258 47081 6 23233 289 44858 7 10696 109 34878

dob.dm.tbl=dob.dm.tbl[,-2]

dob.dm.tbl

MM WK C-section Vaginal 1 8836 31348 2 20454 42031 3 22921 46607 4 23103 46935 5 22825 47081 6 23233 44858 7 10696 34878

trellis.device()

barchart(dob.dm.tbl,ylab="Day of Week")

barchart(dob.dm.tbl,horizontal=FALSE,groups=FALSE, + xlab="Day of Week",col="black")

We use lattice (trellis) graphics (and the R package lattice) to condition density histograms on the values of a third variable. The variable for multiple births (single births to births with five offsprings (quintuplets) or more) and the method of delivery are our conditioning variables, and we separate histograms of birth weight according to these variables. As expected, birth weight decreases with multiple births, whereas the birth weight is largely unaffected by the method of delivery. Smoothed versions of the histograms, using the lattice command density plot, are also shown. Because of the very small sample sizes for quintuplet and even more births, the density of birth weight for this small group is quite noisy. The dot plot, also part of the lattice package, shows quite clearly that there are only few observations in that last group, while most other groups have many observations (which makes the dots on the dot plot “run into each other”); for groups with many observations a histogram would be the preferred graphical method.

histogram(∼DBWT|DPLURAL,data=births2006.smpl,layout=c(1,5), + col="black")

histogram(∼DBWT|DMETH_REC,data=births2006.smpl,layout=c(1,3), + col="black")

densityplot(∼DBWT|DPLURAL,data=births2006.smpl,layout=c(1,5), + plot.points=FALSE,col="black")

densityplot(∼DBWT,groups=DPLURAL,data=births2006.smpl, + plot.points=FALSE)

dotplot(∼DBWT|DPLURAL,data=births2006.smpl,layout=c(1,5), + plot.points=FALSE,col="black")

Scatter plots (xyplots in the package lattice) are shown for birth weight against weight gain, and the scatter plots are stratified further by multiple births. The last smoothed scatter plot indicates that there is little association between birth weight and weight gain during the course of the pregnancy.

xyplot(DBWT∼DOB_WK,data=births2006.smpl,col="black")

xyplot(DBWT∼DOB_WK|DPLURAL,data=births2006.smpl,layout=c(1,5), + col="black")

xyplot(DBWT∼WTGAIN,data=births2006.smpl,col="black")

xyplot(DBWT∼WTGAIN|DPLURAL,data=births2006.smpl,layout=c(1,5), + col="black")

smoothScatter(births2006.smpl$WTGAIN,births2006.smpl$DBWT)

We also illustrate box plots of birth weight against the APGAR score and box plots of birth weight against the day of week of delivery. We would not expect much relationship between the birth weight and the day of week of delivery; there is no reason why babies born on weekends should be heavier or lighter than those born during the week. The APGAR score is an indication of the health status of a newborn, with low scores indicating that the newborn experiences difficulties. The box plot of birth weight against the APGAR score shows a strong relationship. Babies of low birth weight often have low APGAR scores as their health is compromised by the low birth weight and its associated complications.

## boxplot is the command for a box plot in the standard graphics

## package

boxplot(DBWT∼APGAR5,data=births2006.smpl,ylab="DBWT", + xlab="AGPAR5")

boxplot(DBWT∼DOB_WK,data=births2006.smpl,ylab="DBWT", + xlab="Day of Week")

## bwplot is the command for a box plot in the lattice graphics

## package. There you need to declare the conditioning variables

## as factors

bwplot(DBWT∼factor(APGAR5)|factor(SEX),data=births2006.smpl, + xlab="AGPAR5")

bwplot(DBWT∼factor(DOB_WK),data=births2006.smpl, + xlab="Day of Week")

We also calculate the average birth weight as function of multiple births, and we do this for males and females separately. For that we use the tapply function. Note that there are missing observations in the data set and the option na.rm=TRUE (remove missing observations from the calculation) is needed to omit the missing observations from the calculation of the mean. The bar plot illustrates graphically how the average birth weight decreases with multiple deliveries. It also illustrates that the average birth weight for males is slightly higher than that for females.

fac=factor(births2006.smpl$DPLURAL)

res=births2006.smpl$DBWT

t4=tapply(res,fac,mean,na.rm=TRUE)

1 Single 2 Twin 3 Triplet 3298.263 2327.478 1677.017 4 Quadruplet 5 Quintuplet or higher 1196.105 1142.800

t5=tapply(births2006.smpl$DBWT,INDEX=list(births2006.smpl$DPLURAL, + births2006.smpl$SEX),FUN=mean,na.rm=TRUE)

F M 1 Single 3242.302 3351.637 2 Twin 2279.508 2373.819 3 Triplet 1697.822 1655.348 4 Quadruplet 1319.556 1085.000 5 Quintuplet or higher 1007.667 1345.500

barplot(t4,ylab="DBWT")

barplot(t5,beside=TRUE,ylab="DBWT")

Finally, we illustrate the levelplot and the contourplot of the R package lattice. For these plots we first create a cross-classification of weight gain and estimated gestation period by dividing the two continuous variables into 11 nonoverlapping groups. For each of the resulting groups, we compute the average birth weight. An earlier frequency distribution table of estimated gestation period indicates that “99” is used as the code for “unknown”. For the subsequent calculations, we omit all records with unknown gestation period (i.e., value 99). The graphs show that the birth weight increases with the estimated gestation period, but that birth weight is little affected by the weight gain. Note that the contour lines are essentially horizontal and that their associated values increase with the estimated gestation period.

2.1.1 Modeling Issues Investigated in Subsequent Chapters

This discussion, with its many summaries and graphs, has given us a pretty good idea about the data. But what questions would we want to have answered with these data? One may wish to predict the birth weight from characteristics such as the estimated gestation period and the weight gain of the mother; for that, one could use regression and regression trees. Or, one may want to identify births that lead to very low APGAR scores, for which purpose, one could use classification methods.

2.2 Example 2: Alumni Donations

The file contribution.csv (available on our data Web site) summarizes the contributions received by a selective private liberal arts college in the Midwest. The college has a large endowment and, as all private colleges do, keeps detailed records on alumni donations. Here we analyze the contributions of five graduating classes (the cohorts who have graduated in 1957, 1967, 1977, 1987, and 1997). The data set consists of living alumni and contains their contributions for the years 2000–2004. In addition, the data set includes several other variables such as gender, marital status, college major, subsequent graduate work, and attendance at fund-raising events, all variables that may play an important role in assessing the success of future capital campaigns. This is a carefully constructed and well-maintained data set; it contains only alumni who graduated from the institution, and not former students who spent time at the institution without graduating. The data set contains no missing observations. The first five records of the file are shown below. Alumni not contributing have the entry “0” in the related column. The 1957 cohort is the smallest group. This is because of smaller class sizes in the past and deaths of older alumni.

## Install packages from CRAN; use any USA mirror

library(lattice)

don <− read.csv("C:/DataMining/Data/contribution.csv")

don[1:5,]

Gender Class.Year Marital.Status Major Next.Degree FY04Giving FY03Giving 1 M 1957 M History LLB 2500 2500 2 M 1957 M Physics MS 5000 5000 3 F 1957 M Music NONE 5000 5000 4 M 1957 M History NONE 0 5100 5 M 1957 M Biology MD 1000 1000 FY02Giving FY01Giving FY00Giving AttendenceEvent 1 1400 12060 12000 1 2 5000 5000 10000 1 3 5000 5000 10000 1 4 200 200 0 1 5 1000 1005 1000 1

table(don$Class.Year)

1957 1967 1977 1987 1997 127 222 243 277 361

barchart(table(don$Class.Year),horizontal=FALSE, + xlab="Class Year",col="black")

Total contributions for 2000–2004 are calculated for each graduate. Summary statistics (mean, standard deviation, and percentiles) are shown below. More than 30% of the alumni gave nothing; 90% gave $1050 or less; and only 3% gave more than $5000. The largest contribution was $172,000.

The first histogram of total contributions shown below is not very informative as it is influenced by both a sizable number of the alumni who have not contributed at all and a few alumni who have given very large contributions. Omitting contributions that are zero or larger than $1000 provides a more detailed view of contributions in the $1–$1000 range; this histogram is shown to the right of the first one. Box plots of total contributions are also shown. The second box plot omits the information from outliers and shows the three quartiles of the distribution of total contributions (0, 75, and 400).

don$TGiving=don$FY00Giving+don$FY01Giving+don$FY02Giving + +don$FY03Giving+don$FY04Giving

mean(don$TGiving)

[1] 980.0436

sd(don$TGiving)

[1] 6670.773

quantile(don$TGiving,probs=seq(0,1,0.05))

0% 5% 10% 15% 20% 25% 30% 35% 0.0 0.0 0.0 0.0 0.0 0.0 0.0 10.0 40% 45% 50% 55% 60% 65% 70% 75% 25.0 50.0 75.0 100.0 150.8 200.0 275.0 400.0 80% 85% 90% 95% 100% 554.2 781.0 1050.0 2277.5 171870.1

quantile(don$TGiving,probs=seq(0.95,1,0.01))

95% 96% 97% 98% 99% 100% 2277.50 3133.56 5000.00 7000.00 16442.14 171870.06

hist(don$TGiving)

hist(don$TGiving[don$TGiving!=0][don$TGiving[don$TGiving!=0]<=1000])

boxplot(don$TGiving,horizontal=TRUE,xlab="Total Contribution")

boxplot(don$TGiving,outline=FALSE,horizontal=TRUE,xlab="Total Contribution")

We identify below the donors who gave at least $30,000 during 2000–2004. We also list their major and their next degree. The top donor has a mathematics–physics double major with no advanced degree. Four of the top donors have law degrees.

ddd=don[don$TGiving>=30000,]

ddd

ddd1=ddd[,c(1:5,12)]

ddd1

ddd1[order(ddd1$TGiving,decreasing=TRUE),]

Gender Class.Year Marital.Status Major Next.Degree TGiving 99 M 1957 M Mathematics-Physics NONE 171870.06 123 M 1957 W Economics-Business MBA 90825.88 132 M 1967 M Speech (Drama, etc.) JD 72045.31 105 M 1957 M History PHD 51505.84 135 M 1967 M History JD 42500.00 486 M 1977 M Economics MBA 36360.90 471 F 1977 D Economics JD 31500.00 1 M 1957 M History LLB 30460.00 2 M 1957 M Physics MS 30000.00 3 F 1957 M Music NONE 30000.00

For a university foundation, it is important to know who is contributing, as such information allows the foundation to target their fund-raising resources to those alumni who are most likely to donate. We show below box plots of total 5-year donation for the categories of class year, gender, marital status, and attendance at a foundation event. We have omitted in these graphs the outlying observations (those donors who contribute generously). Targeting one's effort to high contributors involves many personal characteristics that are not included in this database (such as special information about personal income and allegiance to the college). It may be a safer bet to look at the median amount of donation that can be achieved from the various groups. Class year certainly matters greatly; older alumni have access to higher life earnings, while more recent graduates may not have the resources to contribute generously. Attendance at a foundation-sponsored event certainly helps; this shows that it is important to get alumni to attend such events. This finding reminds the author about findings in his consulting work with credit card companies: if one wants someone to sign up for a credit card, one must first get that person to open up the envelope and read the advertising message. Single and divorced alumni give less; perhaps they worry about the sky-rocketing expenses of sending their own kids to college. We also provide box plots of total giving against the alumni's major and second degree. In these, we only consider those categories with frequencies exceeding a certain threshold (10); otherwise, we would have to look at the information from too many groups with low frequencies of occurrence. Alumni with an economics/business major contribute most. Among alumni with a second degree, MBAs and lawyers give the most.

boxplot(TGiving∼Class.Year,data=don,outline=FALSE)

boxplot(TGiving∼Gender,data=don,outline=FALSE)

boxplot(TGiving∼Marital.Status,data=don,outline=FALSE)

boxplot(TGiving∼AttendenceEvent,data=don,outline=FALSE)

t4=tapply(don$TGiving,don$Major,mean,na.rm=TRUE)

t5=table(don$Major)

t6=cbind(t4,t5)

t7=t6[t6[,2]>10,]

t7[order(t7[,1],decreasing=TRUE),]

barchart(t7[,1],col="black")

t4=tapply(don$TGiving,don$Next.Degree,mean,na.rm=TRUE)

t5=table(don$Next.Degree)

t6=cbind(t4,t5)

t7=t6[t6[,2]>10,]

t7[order(t7[,1],decreasing=TRUE),]

barchart(t7[,1],col="black")

A plot of histogram densities, stratified according to year of graduation, shows the distributions of 5-year giving among alumni who gave $1–$1000. It gives a more detailed description of the distribution than the earlier histogram of all contributions.

densityplot(∼TGiving|factor(Class.Year), + data=don[don$TGiving<=1000,][don[don$TGiving<=1000,] + TGiving>0,],plot.points=FALSE,col="black")

We now calculate the total of the 5-year donations for the five graduation cohorts. We do this by using the tapply function (applying the summation function to the total contributions of each of the graduation classes). The result shows that the 1957 cohort has contributed $560,000, compared to $35,000 of the 1997 cohort.

t11=tapply(don$TGiving,don$Class.Year,FUN=sum,na.rm=TRUE)

t11

1957 1967 1977 1987 1997 560506.76 293750.74 210768.81 105288.37 35138.92

barplot(t11,ylab="Average Donation")

Below we calculate the annual contributions (2000–2004) of the five graduation classes. The 5 bar charts are drawn on the same scale to facilitate ready comparisons. The year 2001 was the best because of some very large contributions from the 1957 cohort.

barchart(tapply(don$FY04Giving,don$Class.Year,FUN=sum, + na.rm=TRUE),horizontal=FALSE,ylim=c(0,225000),col="black")

barchart(tapply(don$FY03Giving,don$Class.Year,FUN=sum, + na.rm=TRUE),horizontal=FALSE,ylim=c(0,225000),col="black")

barchart(tapply(don$FY02Giving,don$Class.Year,FUN=sum, + na.rm=TRUE),horizontal=FALSE,ylim=c(0,225000),col="black")

barchart(tapply(don$FY01Giving,don$Class.Year,FUN=sum, + na.rm=TRUE),horizontal=FALSE,ylim=c(0,225000),col="black")

barchart(tapply(don$FY00Giving,don$Class.Year,FUN=sum, + na.rm=TRUE),horizontal=FALSE,ylim=c(0,225000),col="black")

Finally, we compute the numbers and proportions of individuals who contributed. We do this by first creating an indicator variable for total giving, and displaying the numbers of the alumni who did and did not contribute. About 66% of all alumni contribute. The mosaic plot shows that the 1957 cohort has the largest proportion of contributors; the 1997 cohort has the smallest proportion of contributors, but includes the largest number of individuals (the area of the bar in a mosaic plot expresses the size of the group). The proportions of contributors shown below indicate that 75% of the 1957 cohort contributes, while only 61% of the 1997 graduating class does so. We can do the same analysis for each of the 5 years (2000–2004). The results for the most recent year 2004 are also shown.

don$TGivingIND=cut(don$TGiving,c(−1,0.5,10000000), + labels=FALSE)-1

mean(don$TGivingIND)

[1] 0.6569106

t5=table(don$TGivingIND,don$Class.Year)

1957 1967 1977 1987 1997 0 31 71 75 105 140 1 96 151 168 172 221

barplot(t5,beside=TRUE)

mosaicplot(factor(don$Class.Year)∼factor(don$TGivingIND))

t50=tapply(don$TGivingIND,don$Class.Year,FUN=mean,na.rm=TRUE)

t50

1957 1967 1977 1987 1997 0.7559055 0.6801802 0.6913580 0.6209386 0.6121884

barchart(t50,horizontal=FALSE)

don$FY04GivingIND=cut(don$FY04Giving,c(−1,0.5,10000000), + labels=FALSE)-1

t51=tapply(don$FY04GivingIND,don$Class.Year,FUN=mean, + na.rm=TRUE)

t51

1957 1967 1977 1987 1997 0.5196850 0.5000000 0.4238683 0.3610108 0.3518006

barchart(t51,horizontal=FALSE)

Below we explore the relationship between the alumni contributions among the 5 years. For example, if we know the amount an alumnus gives in one year (say in year 2000) does this give us information about how much that person will give in 2001? Pairwise correlations and scatter plots show that donations in different years are closely related. We use the command plotcorr in the package ellipse to express the strength of the correlation through ellipse-like confidence regions.

Data=data.frame(don$FY04Giving,don$FY03Giving,don$FY02Giving, + don$FY01Giving,don$FY00Giving)

correlation=cor(Data)

correlation

don.FY04Giving don.FY03Giving don.FY02Giving don.FY01Giving don.FY04Giving 1.0000000 0.5742938 0.8163331 0.1034995 don.FY03Giving 0.5742938 1.0000000 0.5867497 0.1385288 don.FY02Giving 0.8163331 0.5867497 1.0000000 0.2105597 don.FY01Giving 0.1034995 0.1385288 0.2105597 1.0000000 don.FY00Giving 0.6831861 0.3783280 0.8753492 0.2528295 don.FY00Giving don.FY04Giving 0.6831861 don.FY03Giving 0.3783280 don.FY02Giving 0.8753492 don.FY01Giving 0.2528295 don.FY00Giving 1.0000000

plot(Data)

library(ellipse)

plotcorr(correlation)

We conclude our analysis of the contribution data set with several mosaic plots that illustrate the relationships among categorical variables. The proportion of alumni making a contribution is the same for men and women. Married alumni are most likely to contribute, and the area of the bars in the mosaic plot indicates that married alumni constitute the largest group. Alumni who have attended an informational meeting are more likely to contribute, and more than half of all alumni have attended such a meeting. Separating the alumni into groups who have and have not attended an informational meeting, we create mosaic plots for giving and marital status. The likelihood of giving increases with attendance, but the relative proportions of giving across the marital status groups are fairly similar. This tells us that there is a main effect of attendance, but that there is not much of an interaction effect.

mosaicplot(factor(don$Gender)∼factor(don$TGivingIND))

mosaicplot(factor(don$Marital.Status)∼factor(don$TGivingIND))

t2=table(factor(don$Marital.Status),factor(don$TGivingIND))

mosaicplot(t2)

mosaicplot(factor(don$AttendenceEvent)∼factor(don$TGivingIND))

2.2.1 Modeling Issues to be Investigated in Subsequent Chapters

This discussion, with the many summaries and graphs, has told us much about the information in the data. What questions would we want to have answered with this data? It may be of interest to predict the likelihood of 2004 giving on the basis of the previous giving history (2000–2003), donor characteristics, and whether a graduate had attended an informational meeting. Logistic regression models or classification trees will be the prime models. Unfortunately, the variable “attendance at an informational meeting” does not indicate the year or years the meeting was attended, so its influence on the 2004 donation may already be incorporated in the donations of earlier years.

2.3 Example 3: Orange Juice

This section analyzes the weekly sales data of refrigerated 64-ounce orange juice containers from 83 stores in the Chicago area. There are many stores throughout the city, many time periods, and also three different brands (Dominicks, MinuteMaid, and Tropicana). The data are arranged in rows, with each row giving the recorded store sales (in logarithms; logmove), as well as brand, price, presence/absence of feature advertisement, and the demographic characteristics of the store. There are 28,947 rows in this data set. The data is taken from P. Rossi's bayesm package for R, and it has been used earlier in Montgomery ((1987)).

Time sequence plots of weekly sales, averaged over all 83 stores, are shown for the three brands. We create these plots by first obtaining the average sales for a given week and brand (averaged over the 83 stores). For this, we use the very versatile R function tapply. Time sequence plots of the averages are then graphed for each brand, and the plots are arranged on the same scale for easy comparison. An equivalent display, as three panels on the same plotting page, is produced through the xyplot function of the lattice package. Box plots, histograms, and smoothed density plots for sales, stratified for the three brands, are also shown. These displays average the information across the 83 stores and the 121 weeks.

## Install packages from CRAN; use any USA mirror

library(lattice)

oj <− read.csv("C:/DataMining/Data/oj.csv")

oj$store <− factor(oj$store)

oj[1:2,]

store brand week logmove feat price AGE60 EDUC ETHNIC 1 2 tropicana 40 9.018695 0 3.87 0.2328647 0.2489349 0.1142799 2 2 tropicana 46 8.723231 0 3.87 0.2328647 0.2489349 0.1142799 INCOME HHLARGE WORKWOM HVAL150 SSTRDIST SSTRVOL CPDIST5 CPWVOL5 1 10.55321 0.1039534 0.3035853 0.4638871 2.110122 1.142857 1.92728 0.3769266 2 10.55321 0.1039534 0.3035853 0.4638871 2.110122 1.142857 1.92728 0.3769266

t1=tapply(oj$logmove,oj$brand,FUN=mean,na.rm=TRUE)

dominicks minute.maid tropicana 9.174831 9.217278 9.111483

t2=tapply(oj$logmove,INDEX=list(oj$brand,oj$week),FUN=mean, + na.rm=TRUE)

40 41 42 43 44 45 46 dominicks 8.707053 7.721438 7.684779 8.220681 7.529664 7.485447 8.374706 minute.maid 8.316846 10.599174 8.350451 8.464384 10.272432 8.302100 8.975714 tropicana 8.772400 8.506540 8.859382 8.603009 8.422304 8.633549 8.579669 47 48 49 50 51 52 53 dominicks 8.737358 8.031447 7.790064 7.515055 10.308041 9.305908 9.136502 minute.maid 9.907359 8.238033 10.641114 8.195133 8.460606 8.340930 10.131160 tropicana 8.571572 8.739818 8.465478 8.633266 8.577919 8.827387 8.760043 . . .

plot(t2[1,],type= "l",xlab="week",ylab="dominicks",ylim=c(7,12))

plot(t2[2,],type= "l",xlab="week",ylab="minute.maid",ylim=c(7,12))

plot(t2[3,],type= "l",xlab="week",ylab="tropicana",ylim=c(7,12))

logmove=c(t2[1,],t2[2,],t2[3,])

week1=c(40:160)

week=c(week1,week1,week1)

brand1=rep(1,121)

brand2=rep(2,121)

brand3=rep(3,121)

brand=c(brand1,brand2,brand3)

xyplot(logmove∼week|factor(brand),type= "l",layout=c(1,3), + col="black")

boxplot(logmove∼brand,data=oj)

histogram(∼logmove|brand,data=oj,layout=c(1,3))

densityplot(∼logmove|brand,data=oj,layout=c(1,3),plot.points=FALSE)

densityplot(∼logmove,groups=brand,data=oj, + plot.points=FALSE)

The previous displays ignore price and the presence of feature advertisement. Below we graph sales against price, and we do this for each brand separately but aggregating over weeks and stores. The graph shows that sales decrease with increasing price. A density plot of sales for weeks with and without feature advertisement, and a scatter plot of sales against price with the presence of feature advertisement indicated by the color of the plotting symbol both indicate the very positive effect of feature advertisement.

xyplot(logmove∼week,data=oj,col="black")

xyplot(logmove∼week|brand,data=oj,layout=c(1,3),col="black")

xyplot(logmove∼price,data=oj,col="black")

xyplot(logmove∼price|brand,data=oj,layout=c(1,3),col="black")

smoothScatter(oj$price,oj$logmove)

densityplot(∼logmove,groups=feat, data=oj, plot.points=FALSE)

xyplot(logmove∼price,groups=feat, data=oj)

Next we consider one particular store. Time sequence plots of the sales of store 5 are shown for the three brands. Scatter plots of sales against price, separately for the three brands, are also shown; sales decrease with increasing price. Density histograms of sales and scatter plots of sales against price, with weeks with and without feature advertisement coded in color, are shown for each of the three brands. Again, these graphs show very clearly that feature advertisement increases the sales.

densityplot(∼logmove|brand,groups=feat,data=oj1, + plot.points=FALSE)

xyplot(logmove∼price|brand,groups=feat,data=oj1)

The volume of the sales of a given store certainly depends on the price that is being charged and on the feature advertisement that is being run. In addition, sales of a store may depend on the characteristics of the store such as the income, age, and educational composition of its neighborhood. We may be interested in assessing whether the sensitivity (elasticity) of the sales to changes in the price depends on the income of the customers who live in the store's neighborhood. We may expect that the price elasticity is largest in poorer neighborhoods as poorer customers have to watch their spending budgets more closely. To follow up on this hypothesis, we look for the stores in the wealthiest and the poorest neighborhoods. We find that store 62 is in the wealthiest area, while store 75 is in the poorest one. Lattice scatter plots of sales versus price, on separate panels for these two stores, with and without the presence of feature advertisments, are shown below. In order to get a better idea about the effect of price on sales, we repeat the first scatter plot and add the best fitting (least squares) line to the graph; more discussion on how to determine that best fitting line is given in Chapter 3. The slope of the fitted line is more negative for the poorest store, indicating that its customers are more sensitive to changes in the price.

2.3.1 Modeling Issues to be Investigated in Subsequent Chapters

We can use this data set to investigate clustering. We may want to learn whether it is possible to reduce the 83 stores to a smaller number of homogeneous clusters. Furthermore, we may want to explain sales as a function of explanatory variables such as price, feature advertisements, and the characteristics of the store neighborhood. In particular, we may want to study whether the effects of price changes and feature advertisements depend on demographic characteristics of the store neighborhood. We will revisit this data set when we discuss regression (Chapter 3) and LASSO estimation (Chapter 6).

References

Adler, J.: R In a Nutshell: A Desktop Quick Reference. Sebastopol, CA: O'Reilly Media, Inc., 2009.

Ledolter, J. and Swersey, A.: Testing 1-2-3: Experimental Design with Applications in Marketing and Service Operations. Stanford, CA: Stanford University Press, 2007.

Montgomery, A.L.: Creating micro-marketing pricing strategies using supermarket scanner data. Marketing Science, Vol. 16 (1987), 315–337.

Chapter 3: Standard Linear Regression

In the standard linear regression model, the response

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben: