31,19 €
R is the most widely used programming language, and when used in association with data science, this powerful combination will solve the complexities involved with unstructured datasets in the real world. This book covers the entire data science ecosystem for aspiring data scientists, right from zero to a level where you are confident enough to get hands-on with real-world data science problems.
The book starts with an introduction to data science and introduces readers to popular R libraries for executing data science routine tasks. This book covers all the important processes in data science such as data gathering, cleaning data, and then uncovering patterns from it. You will explore algorithms such as machine learning algorithms, predictive analytical models, and finally deep learning algorithms. You will learn to run the most powerful visualization packages available in R so as to ensure that you can easily derive insights from your data.
Towards the end, you will also learn how to integrate R with Spark and Hadoop and perform large-scale data analytics without much complexity.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 475
Veröffentlichungsjahr: 2018
Copyright © 2018 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Pravin DhandreAcquisition Editor: Joshua NadarContent Development Editor: Karan ThakkarTechnical Editor: Suwarna PatilCopy Editor: Safis EditingProject Coordinator: Namrata SwettaProofreader: Safis EditingIndexer: Priyanka DhadkeGraphics: Jisha ChirayilProduction Coordinator: Arvindkumar Gupta
First published: November 2018 Production reference: 1301118
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78913-940-2
www.packtpub.com
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Mapt is fully searchable
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Vitor Bianchi Lanzetta (@vitorlanzetta) has a master's degree in Applied Economics (University of São Paulo—USP) and works as a data scientist in a tech start-up named RedFox Digital Solutions. He has also authored a book called R Data Visualization Recipes. The things he enjoys the most are statistics, economics, and sports of all kinds (electronics included). His blog, made in partnership with Ricardo Anjoleto Farias (@R_A_Farias), can be found at ArcadeData dot org, they kindly call it R-Cade Data.
Nataraj Dasgupta is the vice president of advanced analytics at RxDataScience Inc. Nataraj has been in the IT industry for more than 19 years, and has worked in the technical and analytics divisions of Philip Morris, IBM, UBS Investment Bank, and Purdue Pharma. At Purdue Pharma, Nataraj led the data science division, where he developed the company's award-winning big data and machine learning platform. Prior to Purdue, at UBS, he held the role of Associate Director, working with high-frequency and algorithmic trading technologies in the foreign exchange trading division of the bank.
I'd like to thank my wife, Sara, for her caring support and understanding as I worked on the book at weekends and evenings, and to my parents, parents-in-law, sister, and grandmother for all their support, guidance, tutelage, and encouragement over the years. I'd also like to thank Packt, especially the editors, Tushar Gupta, and Karan Thakkar, and everyone else in the team, whose persistence and attention to detail has been exemplary.
Ricardo Anjoleto Farias is an economist who graduated from the Universidade Estadual de Maringá in 2014. In addition to being a sports enthusiast (electronic or otherwise) and enjoying a good barbecue, he also likes math, statistics, and correlated studies. His first contact with R was when he embarked on his master's degree, and since then, he has tried to improve his skills with this powerful tool.
Doug Ortizis the founder of Illustris, LLC and is an experienced enterprise cloud, big data, data analytics, and solutions architect who has architected, designed, developed, reengineered, and integrated enterprise solutions. His other areas of expertise include Amazon Web Services, Azure, Google Cloud, Business Intelligence, Hadoop, Spark, NoSQL databases, and SharePoint, to name but a few.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Title Page
Copyright and Credits
Hands-On Data Science with R
About Packt
Why subscribe?
Packt.com
Contributors
About the authors
About the reviewer
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
Getting Started with Data Science and R
Introduction to data science
Key components of data science
Computer science
Predictive analytics (machine learning)
Domain knowledge
Active domains of data science
Finance
Healthcare
Pharmaceuticals
Government
Manufacturing and retail
Web industry
Other industries
Solving problems with data science
Using R for data science
Key features of R
Our first R program
UN development index
Summary
Quiz
Descriptive and Inferential Statistics
Measures of central tendency and dispersion
Measures of central tendency
Calculating mean, median, and mode with base R
Measures of dispersion
Useful functions to draw automated summaries
Statistical hypothesis testing
Running t-tests with R
Decision rule – a brief overview of the p-value approach
Be careful
Running z-tests with R
Elaborating a little longer
A/B testing – a brief introduction and a practical example with R
Summary
Quiz
Data Wrangling with R
Introduction to data wrangling with R
Data types, formats, and sources
Data extraction, transformation, and load
Basic tools of data wrangling
Using base R for data manipulation and analysis
Applying families of functions
Aggregation functions
Merging DataFrames
Using tibble and dplyr for data manipulation
Basic dplyr usage
Using select
Filtering with filter
Using arrange for sorting
Summarise
Sampling data
The tidyr package
Converting wide tables into long tables
Converting wide tables into long tables
Joining tables
dbplyr – databases and dplyr
Using data.table for data manipulation
Grouping operations
Adding a column
Ordering columns
What is the advantage of searching using key by?
Creating new columns in data.table
Deleting a column
Pivots on data.table
The melt functionality
Reading and writing files with data.table
A special note on dates and/or time
Miscellaneous topics
Checking data quality
Reading other file formats – Excel, SAS, and other data sources
On-disk formats
Working with web data
Web APIs
Tutorial – looking at airline flight times data
Summary
Quiz
KDD, Data Mining, and Text Mining
Good practices of KDD and data mining
Stages of KDD
Scraping a dwarf name
Retrieving text from the web
Legality of web scraping
Web scraping made easy with rvest
Retrieving tweets from R community
Creating your Twitter application
Fetching the number of tweets
Cleaning and transforming data
Looking for patterns – peeking, visualizing, and clustering data
Peeking data
Visualizing data
Cluster analysis
Summary
Quiz
Data Analysis with R
Preparing data for analysis
Data categories
Data types in R
Reading data
Managing data issues
Mixed data types
Missing data
Handling strings and dates
Handling dates using POSIXct or POSIXlt
Handling strings in R
Reading data
Combining strings
Simple pattern matching and replacement with R
Printing results
Data visualisation
Types of charts – basic primer
Histograms
Line plots
Scatter plots
Boxplots
Bar charts
Heatmaps
Summarizing data
Saving analysis for future work
Packrat
Checkpoint
Rocker
Summary
Quiz
Machine Learning with R
What is machine learning?
Machine learning everywhere
Machine learning vocabulary
Generic problems solved by machine learning
Linear regression with R
Tricks for lm
Tree models
Strengths and weakness
The Chilean plebiscite data
Starting with decision trees
Growing trees with tree and rpart
Random forests – a collection of trees
Support vector machines
What about regressions?
Hierarchical and k-means clustering
Neural networks
Introduction to feedforward neural networks with R
Summary
Quiz
Forecasting and ML App with R
The UI and server
Forecasting machine learning application
Application details
Summary
Quiz
Neural Networks and Deep Learning
Daily neural nets
Overview – NNs and deep learning
Neuroscience inspiration
ANN nodes
Activation functions
Layers
Training algorithms
NNs with Keras
Getting things ready for Keras
Getting practical with Keras
Further tips
Summary
Quiz
Markovian in R
Markovian-type models
Markovian models – real-world applications
The Markov chain
Programming an HMM with R
Summary
Quiz
Visualizing Data
Retrieving and cleaning data
Crafting visualizations
Summary
Quiz
Going to Production with R
What is R Shiny?
How to build a Shiny app
Building an application inside R
The reactive and isolate functions
The observeEvent and eventReactive functions
Approach for creating a data product from statistical modeling and web UI
Some advice about Shiny
Summary
Quiz
Large Scale Data Analytics with Hadoop
Installing the package and Spark
Manipulating Spark data using both dplyr and SQL
Filtering and aggregating Spark datasets
Using Spark machine learning or H2O Sparking Water
Providing interfaces to Spark packages
Spark DataFrames within the RStudio IDE
Summary
Quiz
R on Cloud
Cloud computing
Cloud types
Things to look for
Why Azure?
Azure registration
Azure Machine Learning Studio
How modules work
Building an experiment that uses R
Summary
Quiz
The Road Ahead
Growing your skills
Gathering data
Content to stay tuned to
Meeting Stack Overflow
Other Books You May Enjoy
Leave a review - let other readers know what you think
Hands-on Data Science with R deals with the practical aspects of R development, more so than the theoretical. In other words, emphasis has been given on how to use R for different data science-related activities, such as machine learning and data mining, as well as topics in visualization, cloud computing, and others. Note that much of the book assumes some prior familiarity with R, such that it is intended for intermediate R users. While a number of introductory explanations, such as instructions for installing R Studio, have been provided, the reader may find some topics more advanced, which necessitates prior experience with R programming.
If you are a budding data scientist keen to learn about the popular pandas library, or a Python developer looking to step into the world of data analysis, this book is the ideal resource to get you started. Some programming experience in Python will be helpful in terms of getting the most out of this book.
Chapter 1, Getting Started with Data Science and R, provides an introduction to the field of data science, its applicability in different industry domains, an overview of the machine learning process, and how to install R Studio in order to get started in R development. It also introduces the reader to programming in R, starting off at an intermediate level to facilitate an analysis of the HDI, published by the UN development program. The HDI signifies the level of economic development, including general public health, education, and various other societal factors, of a state.
Chapter 2, Descriptive and Inferential Statistics, introduces fundamental statistical analysis using R, including techniques to perform random sampling, hypothesis testing, and non-parametric tests. This chapter contains extensive examples of commands in R for performing common analysis, such as t-tests and z-tests, and includes utilization of some well-known statistical packages, such as HMISC in R.
Chapter 3, Data Wrangling with R, provides an introduction to packages available in R to slice and manipulate data. Packages that are available as part of the tidyverse set of packages, such as dplyr, and, more generally, the apply family of functions in R, have been introduced. The chapter is example-heavy, in that several examples have been provided to guide the reader on how to apply the functions in the respective packages
Chapter 4, KDD, Data Mining, and Text Mining, includes extensive discussions on the art of extracting information from unstructured data sources, such as websites and Twitter. KDD is a popular term in the data science community and this chapter does full justice to the topic by providing step-by-step examples so as to provide a holistic overview of the subject matter. Sections on web scraping, data transformation, and data visualization have been included. Examples on how to leverage packages such as rvest and httr in order to perform such operations are also discussed at length.
Chapter 5, Data Analysis with R, covers a general introduction to data types and data categories in R as they apply to machine learning, manipulating strings and dates, and charting with R. This chapter is essentially a consolidation of topics that are found elsewhere in the book, but in a more concise format. This chapter can hence be used as a standalone section of the book that does not depend on any other chapter and can be used to gain familiarity with the topics discussed.
Chapter 6, Machine Learning with R, provides a detailed overview of using R for predictive analytics, more generally known as machine learning. It starts out with linear regression, and gradually progresses to more in-depth topics in ML such as decision trees, random forest, and SVMs. Extensively worked-out, hands-on examples, along with visualizations, complement the theoretical discussions in this chapter. The chapter concludes with a discussion on neural networks, one of the most popular fields today in machine learning.
Chapter 7, Forecasting and ML App with R, includes an advanced R Shiny application, full with custom CSS style sheets, Google fonts, modified data table formats, and such like, for forecasting the revenue and sales of pharmaceutical medications in the UK using the NHS dataset. Such datasets are also known as real-world datasets in the sense that they contain actual data pertaining to physicians' prescribing activities. The application is fully reactive; that is, changing the controls on the frontend will immediately run the respective forecasting algorithm and update forecast tables. We have also used an algorithm known as Markov Chain Monte Carlo, which is a machine learning-based forecasting model provided as part of the Facebook package, Prophet.
Chapter 8, Neural Networks and Deep Learning, initiates a comprehensive discussion, along with hands-on examples, of using R for machine learning using two of the most popular algorithms—neural networks, and its more advanced variation, deep learning. Indeed, some of the most successful machine learning projects in the world today, such as self-driving cars and automated assistants such as Siri, are powered by deep learning. This chapter gives readers a unique and robust opportunity to delve into these areas and learn how they, too, can apply some of the same algorithms driving sensational successes in the field of machine learning today.
Chapter 9, Markovian in R, applies to more advanced users who are interested in learning more about Markov processes that involve finding latent (or hidden) data from information in datasets. This is essentially a part of a field known as Bayesian analysis, which allows machine learning practitioners to model states that are not directly visible. Markov models are used in fields such as natural language processing, and object recognition.
Chapter 10, Visualizing Data, provides a comprehensive introduction to various plotting libraries in R. In particular, libraries such as ggplot2, rCharts, and mapping libraries have been discussed at length. R is well known for its presentation-grade libraries that are capable of creating stunning, professional-grade visualizations. The chapter walks the reader through many of the plotting libraries that have made R a mainstay of the data visualization field.
Chapter 11, Going to Production with R, provides an introduction to the Shiny R package, a tool for the development of interactive applications. This chapter delves into how it works, how reactivity works, the basics of its template, how to build a basic application, and how to build one using a real dataset. If you want a package to present your data to people who are unfamiliar with the R language, maybe you should start by learning the Shiny App.
Chapter 12, Large Scale Data Analytics with Hadoop, covers Apache Spark, an engine for large-scale data processing, similar but not identical to Apache Hadoop. Since its focus is on processing, you can use it entirely from your RStudio console. This chapter teaches how to install and take your first steps on it with sparklyr, an R package that provides a backend to the dplyr package. In this way, you can use the dplyr functions to manipulate your big dataset into the Spark cluster.
Chapter 13, R on Cloud, takes an in-depth look at using AzureML on the Microsoft Azure (cloud) platform. Cloud computing has allowed companies across the world to transition from a traditional data center-oriented architecture to a cloud-based decentralized environment. Unsurprisingly, machine learning has become a major part of the success of the cloud due to the ease of deploying multi-node clusters for large-scale machine learning. AzureML is an easy-to-use web-based platform from Microsoft that allows even new data scientists to get a jump start on machine learning via a GUI-based interface.
Appendix A, The Road Ahead, introduces the reader to various resources on the web, such as blogs and forums to utilize and learn more about the field of R. The world of R is rapidly evolving, and in this chapter, we share some insights on the specific resources that will help seasoned data scientists stay abreast of all the developments in R today.
Readers should have a basic knowledge of R and the Shiny app.
You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packt.com
.
Select the
SUPPORT
tab.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub athttps://github.com/PacktPublishing/Hands-On-Data-Science-with-R. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available athttps://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/9781789139402_ColorImages.pdf.
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "The ggplot2 package is the most commonly used visualization package in R."
A block of code is set as follows:
life <- fread("ch1_life_exp.csv", header=T)# View contents of lifehead(life)
Bold: Indicates a new term, an important word, or words that you see on screen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "After logging in, search for the topic Cost Management + Billing in the left-hand menu, as shown in the following screenshot."
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packt.com.
Data, like science, has been ubiquitous the world over since early history. The term data science is not generally taken to literally mean science with data, since without data there would be of science. Rather, it is a specialized field in which data scientists and other practitioners apply advanced computing techniques, usually along with algorithms or predictive analytics to uncover insights that may be challenging to obtain with traditional methods.
Data science as a distinct subject was proposed since the early 1960s by pioneers and thought leaders such as Peter Naur, Prof. Jeff Wu, and William Cleveland. Today, we have largely realized the vision that Prof. Wu and others had in mind when the concept first arose; data science as an amalgamation of computing, data mining, and predictive analytics, all leading up to deriving key insights that drive business and growth across the world today.
The driving force behind this has been the rapid but proportional growth of computing capabilities and algorithms. Computing languages have also played a key role in supporting the emergence of data science, primary among them being the statistical language R.
In this introductory chapter, we will cover the following topics:
Introduction to data science and R
Active domains of data science
Solving problems with data science
Using R for data science
Setting up R and RStudio
Our first R program
The practice of data science requires the application of three distinct disciplines to uncover insights from data. These disciplines are as follows:
Computer science
Predictive analytics
Domain knowledge
The following diagram shows the core components of data science:
During the course of performing data science, if large datasets are involved, the practitioner may spend a fair amount of time cleansing and curating the dataset. In fact, it is not uncommon for data scientists to spend the majority of their time preparing data for analysis. The generally accepted distribution of time for a data science project involves 80% spent in data management and the remaining 20% spent in the actual analysis of the data.
While this may seem or sound overly general, the growth of big data, that is, large-scale datasets, usually in the range of terabytes, has meant that it takes sufficient time and effort to extract data before the actual analysis takes place. Real-world data is seldom perfect. Issues with real-world data range from missing variables to incorrect entries and other deficiencies. The size of datasets also poses a formidable challenge.
Technologies such as Hadoop, Spark, and NoSQL databases have addressed the needs of the data science community for managing and curating terabytes, if not petabytes, of information. These tools are usually the first step in the overall data science process that precedes the application of algorithms on the datasets using languages such as R, Python and others.
Hence, as a first step, the data scientist generally should be capable of working with datasets using contemporary tools for large-scale data mining. For instance, if the data resides in a Hadoop cluster, the practitioner must be able and willing to perform the work necessary to retrieve and curate the data from the source systems.
Second, once the data has been retrieved and curated, the data scientist should be aware of the requirements of the algorithm from a computational perspective and determine if the system has the necessary resources to efficiently execute these algorithms. For instance, if the algorithms can be taken advantage of with multi-core computing facilities, the practitioner must use the appropriate packages and functions to leverage. This may mean the difference between getting results in an hour versus requiring an entire day.
Last, but not least, the creation of machine learning models will require programming in one or more languages. This in itself demands a level of knowledge and skill in applying algorithms and using appropriate data structures and other computer science concepts:
In popular media and literature, predictive analytics is known by various names. The terms are used interchangeably and often depend on personal preferences and interpretations. The terms predictive analytics, machine learning, and statistical learning are technically synonymous, and refer to the field of applying algorithms in machine learning to the data.
The algorithm could be as simple as a line-of-best-fit, which you may have already used in Excel, also known as linear regression. Or it could be a complex deep learning model that implements multiple hidden layers and inputs. In both cases, the mere fact that a statistical model, an algorithm was applied to generate a prediction qualifies the usage as a practice of machine learning.
In general, creating a machine learning model involves a series of steps such as the sequence:
Cleanse and curate the dataset to extract the cohort on which the model will be built.
Analyze the data using descriptive statistics, for example, distributions and visualizations.
Feature engineering, preprocessing, and other steps necessary to add or remove variables/predictors.
Split the data into a train and test set (for example, set aside 80% of the data for training and the remaining 20% for testing your model).
Select appropriate machine learning models and create the model using cross validation.
Select the final model after assessing the performance across models on a given (one or more) cost metric. Note that the model could be an ensemble, that is, a combination of more than one model.
Perform predictions on the test dataset.
Deliver the final model.
The most commonly used languages for machine learning today are R and Python. In Python, the most popular package for machine learning is scikit-learn (http://scikit-learn.org), while in R, there are multiple packages, such as random forest, Gradient Boosting Machine (GBM), kernlab, Support Vector Machines (SVMs), and others.
Although Python's scikit-learn is extremely versatile and elaborate, and in fact the preferred language in production settings, the ease of use and diversity of packages in R gives it an advantage in terms of early adoption and use for machine learning exercises.
Popular machine learning tools such as TensorFlow from Google (https://www.tensorflow.org), XGBoost (http://xgboost.readthedocs.io/en/latest/), and H2O (https://www.h2o.ai) have also released packages that act as a wrapper to the underlying machine learning algorithms implemented in the respective tools.
It is a common misconception that machine learning is just about creating models. While that is indeed the end goal, there is a subtle yet fundamental difference between a model and a good model. With the functions available today, it is relatively easy for anyone to create a model by simply running a couple of lines of code. A good model has business value, while a model built without the rigor of formal machine learning principles is practically unusable for all intents and purposes. A key requirement of a good machine learning model is the judicious use of domain expertise to evaluate results, identify errors, analyze them, and further refine using the insights that subject matter experts can provide. This is where domain knowledge plays a crucial and indispensable role.
More often than data scientists would like to admit, machine learning models produce results that are obvious and intuitive. For instance, we once conducted an elaborate analysis of physicians, prescribing behavior to find out the strongest predictor of how many prescriptions a physician would write in the next quarter. We used a broad set of input variables such as the physicians locations, their specialties, hospital affiliations, prescribing history, and other data. In the end, the best performing model produced a result that we all knew very well. The strongest predictor of how many prescriptions a physician would write in the next quarter was the number of prescriptions the physician had written in the previous quarter! To filter out the truly meaningful variables and build a more robust model, we eventually had to engage someone who had extensive experience of working in the pharma industry. Machine learning models work best when produced in a hybrid approach—one that combines domain expertise along with the sophistication of models developed.
Data science plays a role in virtually all aspects of our day-to-day lives and is used across nearly all industries. The adoption of data science was largely spurred by the successes of start-ups such as Uber, Airbnb, and Facebook that rose rapidly and earned valuations of billions of dollars in a very short span of time.
Data generated by social media networks such as Facebook and Twitter, search engines such as Google and Yahoo!, and various other networks, such as Pinterest and Instagram led to a deluge of information about personal tastes, preferences, and habits of individuals. Companies leveraged the information using various machine learning techniques to gain insights.
For example, Natural Language Processing (NLP) is a machine learning technique used to analyse textual data on comments posted on public forums to extract users' interests. The users are then shown ads relevant to their interests generating sales from which companies earn ad revenue. Image recognition algorithms are utilized to automatically identify objects in an image and serve the relevant images when users search for those objects on search engines.
The use of data science as a means to not only increase user engagement but also increase revenue, has become a widespread phenomenon. Some of the domains in which data science is prevalent is given as follows. The list is not all-inclusive, but highlights some of the key industries in which data science plays an important role today:
A few of these domains have been discussed in the following sections.
Data science has been used in finance, especially in trading for many decades. Investment banks, especially trading desks, have employed complex models to analyse and make trading decisions. Some examples of data science as used in finance include:
Credit risk management
: Analyse the creditworthiness of a user by analyzing the historical financial records, assets, and transactions of the user
Loan fraud
: Identifying applications for credit or loans that may be fraudulent by analyzing the loan and applicant's characteristics
Market Basket Analysis
: Understanding the correlation among stocks and other securities and formulating trading and hedging strategies
High-frequency trading
: Analyzing trades and quotes to discover pricing inefficiencies and arbitrage opportunities
Healthcare and related fields such as pharmaceuticals and life sciences, have also seen a gradual rise in the adoption and use of machine learning. A leading example has been IBM Watson. Developed in late 2000s, IBM Watson rose to popularity after it won the Double Jeopardy, a popular quiz contest in the US in 2011. Today, IBM Watson is being used for clinical research and several institutions have published preliminary results of success. (Source: http://www.ascopost.com/issues/june-25-2017/how-watson-for-oncology-is-advancing-personalized-patient-care/). The primary impediment to wider adoption has been the extremely high cost of using the system with usually an uncertain return on investment. Companies that are generally well capitalized can invest in the technology.
More common uses of data science in healthcare include:
Epidemiology
: Preventing the spread of diseases and other epidemiology related use cases are being solved with various machine learning techniques. A recent example of the use of clustering to detect the Ebola outbreak received attention, being one of the first times that machine learning was used in a medical use case very effectively. (Source:
https://spectrum.ieee.org/tech-talk/biomedical/diagnostics/healthmap-algorithm-ebola-outbreak
).
Health insurance fraud detection
: The health insurance industry loses billions each year in the US due to fraudulent claims for insurance. Machine learning, and more generally, data science is being used to detect cases of fraud and reduce the loss incurred by leading health insurance firms. (Source:
https://www.sciencedirect.com/science/article/pii/S1877042812036099
).
Recommender engines
: Algorithms that match patients with physicians are used to provide recommendations based on the patients' symptoms and doctor specialties.
Image recognition
: Arguably, the most common use of data science in healthcare, image recognition algorithms are used for a variety of cases ranging from segmentation of malignant and non-malignant tumours to cell segmentation. (Source:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3159221/
).
Although closely linked to the data science use cases in healthcare, data science use cases in pharma are geared toward the development of drugs, physician marketing, and treatment-related analysis. Examples of data science in pharma include the following:
Patient journey and treatment pathways
: Understanding the progression of diseases in patients and treatment or therapy outcomes is one of the prime examples of data science in pharma. Several companies have engaged in deep studies related to the development of such tools to understand not only the efficiency of drugs, but also how to best position and market their products. (Source:
https://kx.com/blog/use-case-rxdatascience-patient-journey-app/
).
Sales field messaging
: Using NLP, pharma companies analyse discussions between sales representatives and physicians during sales visits to improve their messaging content and better inform physicians on the potential risks and benefits of medications as needed. (Source:
https://www.aktana.com/blog/field-sales/power-personalization-using-advanced-machine-learning-drive-rep-engagement/
).
Biomarker analysis
: Machine learning for identifying biomarkers and their importance and/or relevance to diseases are used in clinical research such as cancer-related studies. (Source:
https://www.futuremedicine.com/doi/abs/10.2217/pme.15.5?journalCode=pme
).
Research and development
: The use of machine learning for identifying small and large molecules that treat diseases is another common application of data science in pharma. It is a challenging task and several large pharma companies have engaged teams to solve such use cases. (Source:
https://www.kaggle.com/c/MerckActivity
).
Data science is used by state and national governments for a wide range of uses. These include topics in cyber security, voter benefits, climate change, social causes, and other similar use cases that are geared toward public policy and public benefits.
Some examples include the following:
Climate change
: One of the most popular topics among climate change proponents, there is extensive machine learning related work that is being conducted around the globe to detect and understand the causes of climate change. (Source:
https://toolkit.climate.gov
).
Cyber security
: The use of extremely advanced machine learning techniques for national cyber security is evident and well known all over the world, ever since such practices were disclosed by consultants at security firms a few years back. Security-related organizations employ some of the most advanced hardware and software stacks for detecting cyber threats and prevent hacking attempts. (Source:
https://www.csoonline.com/article/2942083/big-data-security/cybersecurity-is-the-killer-app-for-big-data-analytics.html
).
Social causes
: The use of data science for a wide range of use cases geared toward social good is well known due to several conferences and papers that have been organized and released respectively on the topic. Examples include topics in urban analytics, power grids utilizing smart meters, criminal justice. (Source:
https://dssg.uchicago.edu/data-science-for-social-good-conference-2017/agenda/
).
The manufacturing and retail industry has used data science to designing better products, optimize pricing, and design strategic marketing techniques. Some examples include the following:
Price optimization
: Generally related to the realm of linear programming, the challenge of price optimization, that is, pricing products, is now also being addressed with the help of machine learning. Dynamic pricing based upon market conditions, user preferences, and other factors are used as inputs to assess optimal pricing of products. (Source:
https://www.datasciencecentral.com/profiles/blogs/price-optimisation-using-decision-tree-regression-tree
).
Retail sales
: Retailers use algorithms to determine future sales forecasts, price discounts, and promotion sequences. (Source:
http://www.oliverwyman.com/our-expertise/insights/2017/feb/machine-learning-for-retail.html
).
Production capacity and maintenance
: In manufacturing, data science is being used to determine device maintenance requirements, equipment effectiveness, optimize production lines, and much more. The overall supply chain management is an area that has benefited and continues to earn profits from smart use of machine learning. (Source:
https://www.forbes.com/sites/louiscolumbus/2016/06/26/10-ways-machine-learning-is-revolutionizing-manufacturing/#51d4927228c2
).
One of the earliest beneficiaries of data science was the web industry. Empowered by the collection of user-specific data from social networks, firms around the world employ algorithms to understand user behavior and generate targeted ads. Google, one of the earliest proponents of targeted ad marketing today, earns most of its revenue from ads, more than $95 billion in 2017. (Source: https://www.statista.com/statistics/266249/advertising-revenue-of-google/). The use of data science for web-related businesses is ubiquitous today and companies such as Uber, Airbnb, Netflix, and Amazon have successfully navigated and made full use of this complex ecosystem, generating not only huge profits but also added millions of new jobs directly or indirectly as a result.
Targeted ads
: Click through ads have been one of the prime areas of machine learning. By reading cookies saved on users' computers from various sites, other sites can assess the users interests and accordingly decide which ads to serve when they visit new sites. As per online sources, the value of internet advertising is over $1 trillion and has generated over 10 million jobs in 2017 alone. (Source:
https://www.iab.com/insights/economic-value-advertising-supported-internet-ecosystem/
).
Recommender engines
: Netflix, Pandora, and other movies and audio streaming services utilize recommender engines to understand which movies or music the viewer or listener would be interested in and make recommendations. The recommendations are often based on what other users with similar tastes might have already seen and leverage recommender algorithms such as collaborative, content-based, and hybrid filtering.
Web design
: Using A/B testing, mouse tracking, and other sophisticated techniques, web developers leverage data science to design better web pages such as landing pages and in general websites. A/B testing for instance allows developers to decide between different versions of the same web page and deploy accordingly.
There are various other industries today that benefit from data science and as such, it has become so common that it would be impractical to list all, but at a high level, some of the others include the following:
Oil and natural gas for oil production
Meteorology for understanding weather patterns
Space research for detecting and/or analyzing stars and galaxies
Utilities for energy production and energy savings
Biotechnology for research and finding new cures for diseases
In general, since data science, or machine learning algorithms are not specific to any particular industry, it is entirely possible to apply algorithms to creative use cases and derive business benefits.
Data science is being used today to solve problems ranging from poverty alleviation to scientific research. It has emerged as the leading discipline that aims to disrupt the industry's status quo and provide a new alternative to pressing business issues.
However, while the promise of data science and machine learning is immense, it is important to bear in mind that it takes time and effort to realize the benefits. The return-on-investment on a machine learning project typically takes a fairly long time. It is thus essential to not overestimate the value it can bring in the short run.
A typical data science project in a corporate setting would require the collaborative efforts of various groups, both on the technical and the business side. Generally, this means that the project should have a business sponsor and a technical or analytics lead in addition to the data science team or data scientist. It is important to set expectations at the onset—both in terms of the time it would take to complete the project and the outcome that may be uncertain until the task has completed. Unlike other projects that may have a definite goal, it is not possible to predetermine the outcome of machine learning projects.
Some common questions to ask include the following:
What business value does the data science project bring to the organization?
Does it have a critical base of users, that is, would multiple users benefit from the expected outcome of the project?
How long would it take to complete the project and are all the business stakeholders aware of the timeline?
Have the project stakeholders taken all variables that may affect the timeline into account? Projects can often get delayed due to dependencies on external vendors.
Have we considered all other potential business use cases and made an assessment of what approach would have an optimal chance of success?
A few salient points for successful data science projects are given as follows:
Find projects or use cases related to business operations that are:
Challenging
Not necessarily complex, that is, they can be simple tasks but which add business value
Intuitive, easily understood (you can explain it to friends and family)
Takes effort to accomplish today or requires a lot of manual effort
Used frequently by a range of users and the benefits of the outcome would have executive visibility
Identify
low difficulty–high value
(shorter) versus
high difficulty–high value
(longer)
Educate business sponsors, share ideas, show enthusiasm (it's like a long job interview)
Score
early wins on low difficulty–high value
; create minimum viable solutions, get management buy-in before enhancing them (takes time)
Early wins act as a catalyst to foster executive confidence; and also make it easier to justify budgets, making it easier to move on to high difficulty—high value tasks
Being arguably the oldest and consequently the most mature language for statistical operations, R has been used by statisticians all over the world for over 20 years. The precursor to R was the S programming language, written by John Chambers in 1976 in Bell Labs. R, named after the initials of its developers, Ross Ihaka and Robert Gentleman, was implemented as an open source equivalent to S while they were at the University of Auckland.
The language has gained immensely in popularity since the early 2000s, averaging between 20% to 30% growth on a year-on-year basis:
In 2018, there were more than 12,000 R packages, up from about 7,500 just 3 years before, in 2015.
A few key features of R makes it not only very easy to learn, but also very versatile due to the number of available packages.
The key features of R are as follows:
Data mining
: The R package,
data.table
, developed by Dowle and Srinivasan, is arguably one of the most sophisticated packages for data mining in any language provides R users with the ability to query millions, if not billions of rows of data. In addition, there is
tibble
, an alternative to
data.frame
developed by Hadley Wickham. Other packages from Wickham include,
plyr
,
dplyr
and
ggplot2
for visualization.
Visualizations
: The
ggplot2
package is the most commonly used visualization package in R. Packages such as
rcharts
,
htmlwidgets
have also become extremely popular in recent years. Most of these packages allow R users to leverage elegant graphics features commonly found in JavaScript packages such as D3. Many of them act as wrappers for popular JavaScript visualization libraries to facilitate the creation of graphics elements in R.
Data science
: R has had various statistical libraries used for research for many years. With the growth of data science as a popular subject in the public domain, R users have released and further developed both new and existing packages that allows users to deploy complex machine learning algorithms. Examples include
randomforest
,
gbm
.
General availability of packages
: The 12,000+ packages in R provide coverage for a wide range of projects. These include packages for machine learning, data science, and even general purpose needs such as web scraping, cartography, and even fisheries sciences. Due to this rich ecosystem that can cater to the needs of a wide variety of use cases, R has grown exponentially in popularity. Whether you are working with JSON files or trying to solve an obscure machine learning problem, it is very likely that someone in the R community has already developed a package that contains (or can indirectly fulfill) the functionality you need.
Setting up R and RStudio
: This book will focus on using R for data science related tasks. The language R, as mentioned, is available as an open source product from
http://r-project.org
. In addition, we will be installing RStudio—an IDE (a graphical user interface) for writing and running our R code as well as R Shiny, a platform that allows users to develop elegant dashboards.
Downloading and installing R is as follows:
Go to
http://r-project.org
and click on the
CRAN
(
http://cran.r-project.org/mirrors.html
):
Select any one of the links in the corresponding page. These are links to
CRAN Mirrors
, that is, sites that host R packages and R installation files:
Once you select and click on the link, you'll be taken to a page with the links to download R for different operating systems, such as Windows, macOS, and Linux. Select the distribution that you need to start the download process:
This is the R for macOS download page:
This is the R for Windows download page (click on
i
nstall R for the first time
if it is a new installation):
This is the R for Windows download page. Download and install the
.exe
file for R:
The R for macOS installation process will require you to download the
.dmg
file. Select the default options for installation if you do not intend to make any changes, such as installing in a different directory:
You will also need to download and install RStudio and R Shiny. RStudio is used as the frontend, which you'll use to develop your R code. As such, it is not necessary to use RStudio to write code in R as you can launch the R console from the desktop (Windows), but RStudio has a nicer and a more user-friendly interface that makes it easier to code in R.
Download RStudio and R Shiny from
https://www.rstudio.com
:
Click on
Products
in the top menu and select RStudio to download and install the software.
Download the open source version of RStudio. Note that there are other versions which are paid commercial versions of the software. For our exercise, we'll be using the open source version only. Download it from
https://www.rstudio.com/products/rstudio/download/
:
Once you have installed RStudio, launch the application. This will bring up the Following screenshot. There are four panels in RStudio. The first three are shown when you first launch RStudio:
Click on
File
|
New File
|
R Script
. This will open a new panel. This is the section where you'll be writing your R code:
RStudio is a very mature interface for developing R code and has been in use for several years. You should familiarize yourself with the different features in RStudio as you'll be using the tool throughout the book.
In this section, we will create our first R program for data analysis. We'll use the human development data available from the United Nations development program. The initiative produces a Human Development Index (HDI) corresponding to each country, which signifies the level of economic development, including general public health, education, and various other societal factors.
The following diagram from the UN development program's website summaries the concept at a high level:
In this exercise, we will be looking at the life expectancy and expected years of schooling on a per country per year basis starting from 1990 onward. Not all data is available for all countries, due to various geopolitical and other reasons that have made it difficult to obtain data for respective years.
The datasets for the HDP program have been obtained from http://hdr.undp.org/en/data.
In the exercises, the data has been cleaned and formatted to make it easier for the reader to analyse the information, especially given it is the first chapter of the book. Download the data from the Packt code repository for this book. Following are the steps to complete the exercise:
Launch RStudio and click on
File
|
New File
|
R Script
.
Save the file as
Chapter1.R
.
Copy the commands shown in the following script and save.
Install the required packages for this exercise by running the following command. First, copy the command into the code window in RStudio:
install.packages(c("data.table","plotly","ggplot2","psych"))
Then, place your cursor on the line and click on
Run
:
This will install the respective packages in your system. In case you encounter any errors, search on Google for the cause of the error. There are various online forums, such as Stack Overflow, where you can search for common errors and learn how to fix them. Since errors can depend on the specific configuration of your machine, we cannot identify all of them, but it is very likely that someone else might have experienced the same error conditions.
We have already created the requisite CSV files, and the following code illustrates the entire process of reading in the CSV files and analyzing the data:
# We'll install the following packages:## data.table: a package for managing & manipulating datasets in R## plotly: a graphics library that has gained popularity in recent year## ggplot2: another graphics library that is extremely popular in R## psych: a tool for psychmetry that also includes some very helpful #statistical functionsinstall.packages(c("data.table","plotly","ggplot2","psych"))# Load the libraries# This is necessary if you will be using functionalities that are #available outside# The functions already available as part of standard Rlibrary(data.table)library(plotly)library(ggplot2)library(psych)library(RColorBrewer)# In R, packages contain multiple functions and once the package has #been loaded# the functions become available in your workspace# To find more information about a function, at the R console, type #in ?function_name# Note that you should replace function_name with the name of the actual function# This will bring up the relevant help notes for the function# Note that the "R Console" is the interactive screen generally #found # Read in Human Development Index Filehdi <- fread("ch1_hdi.csv",header=T) # The command fread can be used to read in a CSV file# View contents of hdihead(hdi) # View the top few rows of the data table hdi//
The output of the preceding code is as follows:
Read the life expectancy file by using the following code:
life <- fread("ch1_life_exp.csv", header=T)# View contents of lifehead(life)
The output of the code file is as follows:
Read the years of schooling file by using the following code:
# Read Years of Schooling Fileschool <- fread("ch1_schoolyrs.csv", header=T)# View contents of schoolhead(school)
The output of the preceding code is as follows:
Now we will read the country information:
iso <- fread("ch1_iso.csv")# View contents of isohead(iso)
The following is the output of the previous code:
Here we will see the processing of the hdi table by using the following code:
# Use melt.data.table to change hdi into a long table formathdi <- melt.data.table(hdi,1,2:ncol(hdi))# Set the names of the columns of hdisetnames(hdi,c("Country","Year","HDI"))# Process the life table# Use melt.data.table to change life into a long table formatlife <- melt.data.table(life,1,2:ncol(life))# Set the names of the columns of hdisetnames(life,c("Country","Year","LifeExp"))# Process the school table# Use melt.data.table to change school into a long table formatschool <- melt.data.table(school,1,2:ncol(school))
