E-Book
31,19 €

Hands-On Data Science with R E-Book

Vitor Bianchi Lanzetta

0,0

31,19 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Fachliteratur
Sprache: Englisch

Beschreibung

R is the most widely used programming language, and when used in association with data science, this powerful combination will solve the complexities involved with unstructured datasets in the real world. This book covers the entire data science ecosystem for aspiring data scientists, right from zero to a level where you are confident enough to get hands-on with real-world data science problems.
The book starts with an introduction to data science and introduces readers to popular R libraries for executing data science routine tasks. This book covers all the important processes in data science such as data gathering, cleaning data, and then uncovering patterns from it. You will explore algorithms such as machine learning algorithms, predictive analytical models, and finally deep learning algorithms. You will learn to run the most powerful visualization packages available in R so as to ensure that you can easily derive insights from your data.
Towards the end, you will also learn how to integrate R with Spark and Hadoop and perform large-scale data analytics without much complexity.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

MOBI

Seitenzahl: 475

Veröffentlichungsjahr: 2018

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Ähnliche

R Data Visualization Recipes

Vitor Bianchi Lanzetta

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Denke (nach) und werde reich

Napoleon Hill

30 Minuten Resilienz

Ulrich Siegrist

Krebszellen mögen keine Himbeeren - Der große Bestseller - Vollständig überarbeitet und aktualisiert

Richard Béliveau

Die Hormonrevolution

Michael E Platt

Der Crash ist die Lösung

Matthias Weik

Günter, der innere Schweinehund, lernt verkaufen

Stefan Frädrich

Die Leber wächst mit ihren Aufgaben

Dr. med. Eckart von Hirschhausen

Der größte Raubzug der Geschichte

Matthias Weik

Unsere Hunde - gesund durch Homöopathie

Hans Günter Wolff

Die Jahrhundertlüge, die nur Insider kennen

Heiko Schrang

Organisation für Komplexität

Niels Pfläging

Radikal führen

Reinhard K. Sprenger

30 Minuten Sympathisch und souverän: So geht Vortragen!

Thomas Lorenz

BLACKOUT - Morgen ist es zu spät

Marc Elsberg

The Truth About Employee Engagement

Patrick M. Lencioni

Mensch und Wald

Carsten Wippermann

The Food Truck Handbook

David Weber

Leseprobe

Hands-On Data Science with R

Techniques to perform data manipulation and mining to build smart analytical models using R

Vitor Bianchi Lanzetta

Nataraj Dasgupta

Ricardo Anjoleto Farias

BIRMINGHAM - MUMBAI

Hands-On Data Science with R

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Pravin DhandreAcquisition Editor: Joshua NadarContent Development Editor: Karan ThakkarTechnical Editor: Suwarna PatilCopy Editor: Safis EditingProject Coordinator: Namrata SwettaProofreader: Safis EditingIndexer: Priyanka DhadkeGraphics: Jisha ChirayilProduction Coordinator: Arvindkumar Gupta

First published: November 2018 Production reference: 1301118

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78913-940-2

www.packtpub.com

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

Packt.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the authors

Vitor Bianchi Lanzetta (@vitorlanzetta) has a master's degree in Applied Economics (University of São Paulo—USP) and works as a data scientist in a tech start-up named RedFox Digital Solutions. He has also authored a book called R Data Visualization Recipes. The things he enjoys the most are statistics, economics, and sports of all kinds (electronics included). His blog, made in partnership with Ricardo Anjoleto Farias (@R_A_Farias), can be found at ArcadeData dot org, they kindly call it R-Cade Data.

I'd like to thank God and my family, especially my caring parents, Naide and Carmo, and my wonderful sister, Gabriela. I love you all beyond measure.

Nataraj Dasgupta is the vice president of advanced analytics at RxDataScience Inc. Nataraj has been in the IT industry for more than 19 years, and has worked in the technical and analytics divisions of Philip Morris, IBM, UBS Investment Bank, and Purdue Pharma. At Purdue Pharma, Nataraj led the data science division, where he developed the company's award-winning big data and machine learning platform. Prior to Purdue, at UBS, he held the role of Associate Director, working with high-frequency and algorithmic trading technologies in the foreign exchange trading division of the bank.

I'd like to thank my wife, Sara, for her caring support and understanding as I worked on the book at weekends and evenings, and to my parents, parents-in-law, sister, and grandmother for all their support, guidance, tutelage, and encouragement over the years. I'd also like to thank Packt, especially the editors, Tushar Gupta, and Karan Thakkar, and everyone else in the team, whose persistence and attention to detail has been exemplary.

Ricardo Anjoleto Farias is an economist who graduated from the Universidade Estadual de Maringá in 2014. In addition to being a sports enthusiast (electronic or otherwise) and enjoying a good barbecue, he also likes math, statistics, and correlated studies. His first contact with R was when he embarked on his master's degree, and since then, he has tried to improve his skills with this powerful tool.

I am grateful to my family, mainly my parents, for their support during the difficult moments. I would also like to thank my friend and the book's co-author, Vitor Bianchi Lanzetta , who has taught me a lot, both academically and personally.

About the reviewer

Doug Ortizis the founder of Illustris, LLC and is an experienced enterprise cloud, big data, data analytics, and solutions architect who has architected, designed, developed, reengineered, and integrated enterprise solutions. His other areas of expertise include Amazon Web Services, Azure, Google Cloud, Business Intelligence, Hadoop, Spark, NoSQL databases, and SharePoint, to name but a few.

Huge thanks to my wonderful wife, Milla, to Maria, and Nikolay, and to my children, for all their support.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Title Page

Hands-On Data Science with R

About Packt

Why subscribe?

Packt.com

Contributors

About the authors

About the reviewer

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Getting Started with Data Science and R

Introduction to data science

Key components of data science

Computer science

Predictive analytics (machine learning)

Domain knowledge

Active domains of data science

Finance

Healthcare

Pharmaceuticals

Government

Manufacturing and retail

Web industry

Other industries

Solving problems with data science

Using R for data science

Key features of R

Our first R program

UN development index

Summary

Quiz

Descriptive and Inferential Statistics

Measures of central tendency and dispersion

Measures of central tendency

Calculating mean, median, and mode with base R

Measures of dispersion

Useful functions to draw automated summaries

Statistical hypothesis testing

Running t-tests with R

Decision rule – a brief overview of the p-value approach

Be careful

Running z-tests with R

Elaborating a little longer

A/B testing – a brief introduction and a practical example with R

Summary

Quiz

Data Wrangling with R

Introduction to data wrangling with R

Data types, formats, and sources

Data extraction, transformation, and load

Basic tools of data wrangling

Using base R for data manipulation and analysis

Applying families of functions

Aggregation functions

Merging DataFrames

Using tibble and dplyr for data manipulation

Basic dplyr usage

Using select

Filtering with filter

Using arrange for sorting

Summarise

Sampling data

The tidyr package

Converting wide tables into long tables

Joining tables

dbplyr – databases and dplyr

Using data.table for data manipulation

Grouping operations

Adding a column

Ordering columns

What is the advantage of searching using key by?

Creating new columns in data.table

Deleting a column

Pivots on data.table

The melt functionality

Reading and writing files with data.table

A special note on dates and/or time

Miscellaneous topics

Checking data quality

Reading other file formats – Excel, SAS, and other data sources

On-disk formats

Working with web data

Web APIs

Tutorial – looking at airline flight times data

Summary

Quiz

KDD, Data Mining, and Text Mining

Good practices of KDD and data mining

Stages of KDD

Scraping a dwarf name

Retrieving text from the web

Legality of web scraping

Web scraping made easy with rvest

Retrieving tweets from R community

Creating your Twitter application

Fetching the number of tweets

Cleaning and transforming data

Looking for patterns – peeking, visualizing, and clustering data

Peeking data

Visualizing data

Cluster analysis

Summary

Quiz

Data Analysis with R

Preparing data for analysis

Data categories

Data types in R

Reading data

Managing data issues

Mixed data types

Missing data

Handling strings and dates

Handling dates using POSIXct or POSIXlt

Handling strings in R

Reading data

Combining strings

Simple pattern matching and replacement with R

Printing results

Data visualisation

Types of charts – basic primer

Histograms

Line plots

Scatter plots

Boxplots

Bar charts

Heatmaps

Summarizing data

Saving analysis for future work

Packrat

Checkpoint

Rocker

Summary

Quiz

Machine Learning with R

What is machine learning?

Machine learning everywhere

Machine learning vocabulary

Generic problems solved by machine learning

Linear regression with R

Tricks for lm

Tree models

Strengths and weakness

The Chilean plebiscite data

Starting with decision trees

Growing trees with tree and rpart

Random forests – a collection of trees

Support vector machines

What about regressions?

Hierarchical and k-means clustering

Neural networks

Introduction to feedforward neural networks with R

Summary

Quiz

Forecasting and ML App with R

The UI and server

Forecasting machine learning application

Application details

Summary

Quiz

Neural Networks and Deep Learning

Daily neural nets

Overview – NNs and deep learning

Neuroscience inspiration

ANN nodes

Activation functions

Layers

Training algorithms

NNs with Keras

Getting things ready for Keras

Getting practical with Keras

Further tips

Summary

Quiz

Markovian in R

Markovian-type models

Markovian models – real-world applications

The Markov chain

Programming an HMM with R

Summary

Quiz

Visualizing Data

Retrieving and cleaning data

Crafting visualizations

Summary

Quiz

Going to Production with R

What is R Shiny?

How to build a Shiny app

Building an application inside R

The reactive and isolate functions

The observeEvent and eventReactive functions

Approach for creating a data product from statistical modeling and web UI

Some advice about Shiny

Summary

Quiz

Large Scale Data Analytics with Hadoop

Installing the package and Spark

Manipulating Spark data using both dplyr and SQL

Filtering and aggregating Spark datasets

Using Spark machine learning or H2O Sparking Water

Providing interfaces to Spark packages

Spark DataFrames within the RStudio IDE

Summary

Quiz

R on Cloud

Cloud computing

Cloud types

Things to look for

Why Azure?

Azure registration

Azure Machine Learning Studio

How modules work

Building an experiment that uses R

Summary

Quiz

The Road Ahead

Growing your skills

Gathering data

Content to stay tuned to

Meeting Stack Overflow

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

Hands-on Data Science with R deals with the practical aspects of R development, more so than the theoretical. In other words, emphasis has been given on how to use R for different data science-related activities, such as machine learning and data mining, as well as topics in visualization, cloud computing, and others. Note that much of the book assumes some prior familiarity with R, such that it is intended for intermediate R users. While a number of introductory explanations, such as instructions for installing R Studio, have been provided, the reader may find some topics more advanced, which necessitates prior experience with R programming.

Who this book is for

If you are a budding data scientist keen to learn about the popular pandas library, or a Python developer looking to step into the world of data analysis, this book is the ideal resource to get you started. Some programming experience in Python will be helpful in terms of getting the most out of this book.

What this book covers

Chapter 1, Getting Started with Data Science and R, provides an introduction to the field of data science, its applicability in different industry domains, an overview of the machine learning process, and how to install R Studio in order to get started in R development. It also introduces the reader to programming in R, starting off at an intermediate level to facilitate an analysis of the HDI, published by the UN development program. The HDI signifies the level of economic development, including general public health, education, and various other societal factors, of a state.

Chapter 2, Descriptive and Inferential Statistics, introduces fundamental statistical analysis using R, including techniques to perform random sampling, hypothesis testing, and non-parametric tests. This chapter contains extensive examples of commands in R for performing common analysis, such as t-tests and z-tests, and includes utilization of some well-known statistical packages, such as HMISC in R.

Chapter 3, Data Wrangling with R, provides an introduction to packages available in R to slice and manipulate data. Packages that are available as part of the tidyverse set of packages, such as dplyr, and, more generally, the apply family of functions in R, have been introduced. The chapter is example-heavy, in that several examples have been provided to guide the reader on how to apply the functions in the respective packages

Chapter 4, KDD, Data Mining, and Text Mining, includes extensive discussions on the art of extracting information from unstructured data sources, such as websites and Twitter. KDD is a popular term in the data science community and this chapter does full justice to the topic by providing step-by-step examples so as to provide a holistic overview of the subject matter. Sections on web scraping, data transformation, and data visualization have been included. Examples on how to leverage packages such as rvest and httr in order to perform such operations are also discussed at length.

Chapter 5, Data Analysis with R, covers a general introduction to data types and data categories in R as they apply to machine learning, manipulating strings and dates, and charting with R. This chapter is essentially a consolidation of topics that are found elsewhere in the book, but in a more concise format. This chapter can hence be used as a standalone section of the book that does not depend on any other chapter and can be used to gain familiarity with the topics discussed.

Chapter 6, Machine Learning with R, provides a detailed overview of using R for predictive analytics, more generally known as machine learning. It starts out with linear regression, and gradually progresses to more in-depth topics in ML such as decision trees, random forest, and SVMs. Extensively worked-out, hands-on examples, along with visualizations, complement the theoretical discussions in this chapter. The chapter concludes with a discussion on neural networks, one of the most popular fields today in machine learning.

Chapter 7, Forecasting and ML App with R, includes an advanced R Shiny application, full with custom CSS style sheets, Google fonts, modified data table formats, and such like, for forecasting the revenue and sales of pharmaceutical medications in the UK using the NHS dataset. Such datasets are also known as real-world datasets in the sense that they contain actual data pertaining to physicians' prescribing activities. The application is fully reactive; that is, changing the controls on the frontend will immediately run the respective forecasting algorithm and update forecast tables. We have also used an algorithm known as Markov Chain Monte Carlo, which is a machine learning-based forecasting model provided as part of the Facebook package, Prophet.

Chapter 8, Neural Networks and Deep Learning, initiates a comprehensive discussion, along with hands-on examples, of using R for machine learning using two of the most popular algorithms—neural networks, and its more advanced variation, deep learning. Indeed, some of the most successful machine learning projects in the world today, such as self-driving cars and automated assistants such as Siri, are powered by deep learning. This chapter gives readers a unique and robust opportunity to delve into these areas and learn how they, too, can apply some of the same algorithms driving sensational successes in the field of machine learning today.

Chapter 9, Markovian in R, applies to more advanced users who are interested in learning more about Markov processes that involve finding latent (or hidden) data from information in datasets. This is essentially a part of a field known as Bayesian analysis, which allows machine learning practitioners to model states that are not directly visible. Markov models are used in fields such as natural language processing, and object recognition.

Chapter 10, Visualizing Data, provides a comprehensive introduction to various plotting libraries in R. In particular, libraries such as ggplot2, rCharts, and mapping libraries have been discussed at length. R is well known for its presentation-grade libraries that are capable of creating stunning, professional-grade visualizations. The chapter walks the reader through many of the plotting libraries that have made R a mainstay of the data visualization field.

Chapter 11, Going to Production with R, provides an introduction to the Shiny R package, a tool for the development of interactive applications. This chapter delves into how it works, how reactivity works, the basics of its template, how to build a basic application, and how to build one using a real dataset. If you want a package to present your data to people who are unfamiliar with the R language, maybe you should start by learning the Shiny App.

Chapter 12, Large Scale Data Analytics with Hadoop, covers Apache Spark, an engine for large-scale data processing, similar but not identical to Apache Hadoop. Since its focus is on processing, you can use it entirely from your RStudio console. This chapter teaches how to install and take your first steps on it with sparklyr, an R package that provides a backend to the dplyr package. In this way, you can use the dplyr functions to manipulate your big dataset into the Spark cluster.

Chapter 13, R on Cloud, takes an in-depth look at using AzureML on the Microsoft Azure (cloud) platform. Cloud computing has allowed companies across the world to transition from a traditional data center-oriented architecture to a cloud-based decentralized environment. Unsurprisingly, machine learning has become a major part of the success of the cloud due to the ease of deploying multi-node clusters for large-scale machine learning. AzureML is an easy-to-use web-based platform from Microsoft that allows even new data scientists to get a jump start on machine learning via a GUI-based interface.

Appendix A, The Road Ahead, introduces the reader to various resources on the web, such as blogs and forums to utilize and learn more about the field of R. The world of R is rapidly evolving, and in this chapter, we share some insights on the specific resources that will help seasoned data scientists stay abreast of all the developments in R today.

To get the most out of this book

Readers should have a basic knowledge of R and the Shiny app.

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

www.packt.com

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

Enter the name of the book in the

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub athttps://github.com/PacktPublishing/Hands-On-Data-Science-with-R. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available athttps://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/9781789139402_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "The ggplot2 package is the most commonly used visualization package in R."

A block of code is set as follows:

life <- fread("ch1_life_exp.csv", header=T)# View contents of lifehead(life)

Bold: Indicates a new term, an important word, or words that you see on screen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "After logging in, search for the topic Cost Management + Billing in the left-hand menu, as shown in the following screenshot."

Warnings or important notes appear like this.

Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Getting Started with Data Science and R

“It is a capital mistake to theorise before one has data.” ― Sir Arthur Conan Doyle, The Adventures of Sherlock Holmes

Data, like science, has been ubiquitous the world over since early history. The term data science is not generally taken to literally mean science with data, since without data there would be of science. Rather, it is a specialized field in which data scientists and other practitioners apply advanced computing techniques, usually along with algorithms or predictive analytics to uncover insights that may be challenging to obtain with traditional methods.

Data science as a distinct subject was proposed since the early 1960s by pioneers and thought leaders such as Peter Naur, Prof. Jeff Wu, and William Cleveland. Today, we have largely realized the vision that Prof. Wu and others had in mind when the concept first arose; data science as an amalgamation of computing, data mining, and predictive analytics, all leading up to deriving key insights that drive business and growth across the world today.

The driving force behind this has been the rapid but proportional growth of computing capabilities and algorithms. Computing languages have also played a key role in supporting the emergence of data science, primary among them being the statistical language R.

In this introductory chapter, we will cover the following topics:

Introduction to data science and R

Active domains of data science

Solving problems with data science

Using R for data science

Setting up R and RStudio

Our first R program

Key components of data science

The practice of data science requires the application of three distinct disciplines to uncover insights from data. These disciplines are as follows:

Computer science

Predictive analytics

Domain knowledge

The following diagram shows the core components of data science:

Computer science

During the course of performing data science, if large datasets are involved, the practitioner may spend a fair amount of time cleansing and curating the dataset. In fact, it is not uncommon for data scientists to spend the majority of their time preparing data for analysis. The generally accepted distribution of time for a data science project involves 80% spent in data management and the remaining 20% spent in the actual analysis of the data.

While this may seem or sound overly general, the growth of big data, that is, large-scale datasets, usually in the range of terabytes, has meant that it takes sufficient time and effort to extract data before the actual analysis takes place. Real-world data is seldom perfect. Issues with real-world data range from missing variables to incorrect entries and other deficiencies. The size of datasets also poses a formidable challenge.

Technologies such as Hadoop, Spark, and NoSQL databases have addressed the needs of the data science community for managing and curating terabytes, if not petabytes, of information. These tools are usually the first step in the overall data science process that precedes the application of algorithms on the datasets using languages such as R, Python and others.

Hence, as a first step, the data scientist generally should be capable of working with datasets using contemporary tools for large-scale data mining. For instance, if the data resides in a Hadoop cluster, the practitioner must be able and willing to perform the work necessary to retrieve and curate the data from the source systems.

Second, once the data has been retrieved and curated, the data scientist should be aware of the requirements of the algorithm from a computational perspective and determine if the system has the necessary resources to efficiently execute these algorithms. For instance, if the algorithms can be taken advantage of with multi-core computing facilities, the practitioner must use the appropriate packages and functions to leverage. This may mean the difference between getting results in an hour versus requiring an entire day.

Last, but not least, the creation of machine learning models will require programming in one or more languages. This in itself demands a level of knowledge and skill in applying algorithms and using appropriate data structures and other computer science concepts:

Predictive analytics (machine learning)

In popular media and literature, predictive analytics is known by various names. The terms are used interchangeably and often depend on personal preferences and interpretations. The terms predictive analytics, machine learning, and statistical learning are technically synonymous, and refer to the field of applying algorithms in machine learning to the data.

The algorithm could be as simple as a line-of-best-fit, which you may have already used in Excel, also known as linear regression. Or it could be a complex deep learning model that implements multiple hidden layers and inputs. In both cases, the mere fact that a statistical model, an algorithm was applied to generate a prediction qualifies the usage as a practice of machine learning.

In general, creating a machine learning model involves a series of steps such as the sequence:

Cleanse and curate the dataset to extract the cohort on which the model will be built.

Analyze the data using descriptive statistics, for example, distributions and visualizations.

Feature engineering, preprocessing, and other steps necessary to add or remove variables/predictors.

Split the data into a train and test set (for example, set aside 80% of the data for training and the remaining 20% for testing your model).

Select appropriate machine learning models and create the model using cross validation.

Select the final model after assessing the performance across models on a given (one or more) cost metric. Note that the model could be an ensemble, that is, a combination of more than one model.

Perform predictions on the test dataset.

Deliver the final model.

The most commonly used languages for machine learning today are R and Python. In Python, the most popular package for machine learning is scikit-learn (http://scikit-learn.org), while in R, there are multiple packages, such as random forest, Gradient Boosting Machine (GBM), kernlab, Support Vector Machines (SVMs), and others.

Although Python's scikit-learn is extremely versatile and elaborate, and in fact the preferred language in production settings, the ease of use and diversity of packages in R gives it an advantage in terms of early adoption and use for machine learning exercises.

The Comprehensive R Archive Network (CRAN) has a task view page titled CRAN Task View: Machine Learning & Statistical Learning (https://cran.r-project.org/web/views/MachineLearning.html) that summarizes some of the key packages in use today for machine learning using R.

Popular machine learning tools such as TensorFlow from Google (https://www.tensorflow.org), XGBoost (http://xgboost.readthedocs.io/en/latest/), and H2O (https://www.h2o.ai) have also released packages that act as a wrapper to the underlying machine learning algorithms implemented in the respective tools.

It is a common misconception that machine learning is just about creating models. While that is indeed the end goal, there is a subtle yet fundamental difference between a model and a good model. With the functions available today, it is relatively easy for anyone to create a model by simply running a couple of lines of code. A good model has business value, while a model built without the rigor of formal machine learning principles is practically unusable for all intents and purposes. A key requirement of a good machine learning model is the judicious use of domain expertise to evaluate results, identify errors, analyze them, and further refine using the insights that subject matter experts can provide. This is where domain knowledge plays a crucial and indispensable role.

Domain knowledge

More often than data scientists would like to admit, machine learning models produce results that are obvious and intuitive. For instance, we once conducted an elaborate analysis of physicians, prescribing behavior to find out the strongest predictor of how many prescriptions a physician would write in the next quarter. We used a broad set of input variables such as the physicians locations, their specialties, hospital affiliations, prescribing history, and other data. In the end, the best performing model produced a result that we all knew very well. The strongest predictor of how many prescriptions a physician would write in the next quarter was the number of prescriptions the physician had written in the previous quarter! To filter out the truly meaningful variables and build a more robust model, we eventually had to engage someone who had extensive experience of working in the pharma industry. Machine learning models work best when produced in a hybrid approach—one that combines domain expertise along with the sophistication of models developed.

Active domains of data science

Data science plays a role in virtually all aspects of our day-to-day lives and is used across nearly all industries. The adoption of data science was largely spurred by the successes of start-ups such as Uber, Airbnb, and Facebook that rose rapidly and earned valuations of billions of dollars in a very short span of time.

Data generated by social media networks such as Facebook and Twitter, search engines such as Google and Yahoo!, and various other networks, such as Pinterest and Instagram led to a deluge of information about personal tastes, preferences, and habits of individuals. Companies leveraged the information using various machine learning techniques to gain insights.

For example, Natural Language Processing (NLP) is a machine learning technique used to analyse textual data on comments posted on public forums to extract users' interests. The users are then shown ads relevant to their interests generating sales from which companies earn ad revenue. Image recognition algorithms are utilized to automatically identify objects in an image and serve the relevant images when users search for those objects on search engines.

The use of data science as a means to not only increase user engagement but also increase revenue, has become a widespread phenomenon. Some of the domains in which data science is prevalent is given as follows. The list is not all-inclusive, but highlights some of the key industries in which data science plays an important role today:

A few of these domains have been discussed in the following sections.

Finance

Data science has been used in finance, especially in trading for many decades. Investment banks, especially trading desks, have employed complex models to analyse and make trading decisions. Some examples of data science as used in finance include:

Credit risk management

: Analyse the creditworthiness of a user by analyzing the historical financial records, assets, and transactions of the user

Loan fraud

: Identifying applications for credit or loans that may be fraudulent by analyzing the loan and applicant's characteristics

Market Basket Analysis

: Understanding the correlation among stocks and other securities and formulating trading and hedging strategies

High-frequency trading

: Analyzing trades and quotes to discover pricing inefficiencies and arbitrage opportunities

Healthcare

Healthcare and related fields such as pharmaceuticals and life sciences, have also seen a gradual rise in the adoption and use of machine learning. A leading example has been IBM Watson. Developed in late 2000s, IBM Watson rose to popularity after it won the Double Jeopardy, a popular quiz contest in the US in 2011. Today, IBM Watson is being used for clinical research and several institutions have published preliminary results of success. (Source: http://www.ascopost.com/issues/june-25-2017/how-watson-for-oncology-is-advancing-personalized-patient-care/). The primary impediment to wider adoption has been the extremely high cost of using the system with usually an uncertain return on investment. Companies that are generally well capitalized can invest in the technology.

More common uses of data science in healthcare include:

Epidemiology

: Preventing the spread of diseases and other epidemiology related use cases are being solved with various machine learning techniques. A recent example of the use of clustering to detect the Ebola outbreak received attention, being one of the first times that machine learning was used in a medical use case very effectively. (Source:

https://spectrum.ieee.org/tech-talk/biomedical/diagnostics/healthmap-algorithm-ebola-outbreak

Health insurance fraud detection

: The health insurance industry loses billions each year in the US due to fraudulent claims for insurance. Machine learning, and more generally, data science is being used to detect cases of fraud and reduce the loss incurred by leading health insurance firms. (Source:

https://www.sciencedirect.com/science/article/pii/S1877042812036099

Recommender engines

: Algorithms that match patients with physicians are used to provide recommendations based on the patients' symptoms and doctor specialties.

Image recognition

: Arguably, the most common use of data science in healthcare, image recognition algorithms are used for a variety of cases ranging from segmentation of malignant and non-malignant tumours to cell segmentation. (Source:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3159221/

Pharmaceuticals

Although closely linked to the data science use cases in healthcare, data science use cases in pharma are geared toward the development of drugs, physician marketing, and treatment-related analysis. Examples of data science in pharma include the following:

Patient journey and treatment pathways

: Understanding the progression of diseases in patients and treatment or therapy outcomes is one of the prime examples of data science in pharma. Several companies have engaged in deep studies related to the development of such tools to understand not only the efficiency of drugs, but also how to best position and market their products. (Source:

https://kx.com/blog/use-case-rxdatascience-patient-journey-app/

Sales field messaging

: Using NLP, pharma companies analyse discussions between sales representatives and physicians during sales visits to improve their messaging content and better inform physicians on the potential risks and benefits of medications as needed. (Source:

https://www.aktana.com/blog/field-sales/power-personalization-using-advanced-machine-learning-drive-rep-engagement/

Biomarker analysis

: Machine learning for identifying biomarkers and their importance and/or relevance to diseases are used in clinical research such as cancer-related studies. (Source:

https://www.futuremedicine.com/doi/abs/10.2217/pme.15.5?journalCode=pme

Research and development

: The use of machine learning for identifying small and large molecules that treat diseases is another common application of data science in pharma. It is a challenging task and several large pharma companies have engaged teams to solve such use cases. (Source:

https://www.kaggle.com/c/MerckActivity

Government

Data science is used by state and national governments for a wide range of uses. These include topics in cyber security, voter benefits, climate change, social causes, and other similar use cases that are geared toward public policy and public benefits.

Some examples include the following:

Climate change

: One of the most popular topics among climate change proponents, there is extensive machine learning related work that is being conducted around the globe to detect and understand the causes of climate change. (Source:

https://toolkit.climate.gov

Cyber security

: The use of extremely advanced machine learning techniques for national cyber security is evident and well known all over the world, ever since such practices were disclosed by consultants at security firms a few years back. Security-related organizations employ some of the most advanced hardware and software stacks for detecting cyber threats and prevent hacking attempts. (Source:

https://www.csoonline.com/article/2942083/big-data-security/cybersecurity-is-the-killer-app-for-big-data-analytics.html

Social causes

: The use of data science for a wide range of use cases geared toward social good is well known due to several conferences and papers that have been organized and released respectively on the topic. Examples include topics in urban analytics, power grids utilizing smart meters, criminal justice. (Source:

https://dssg.uchicago.edu/data-science-for-social-good-conference-2017/agenda/

Manufacturing and retail

The manufacturing and retail industry has used data science to designing better products, optimize pricing, and design strategic marketing techniques. Some examples include the following:

Price optimization

: Generally related to the realm of linear programming, the challenge of price optimization, that is, pricing products, is now also being addressed with the help of machine learning. Dynamic pricing based upon market conditions, user preferences, and other factors are used as inputs to assess optimal pricing of products. (Source:

https://www.datasciencecentral.com/profiles/blogs/price-optimisation-using-decision-tree-regression-tree

Retail sales

: Retailers use algorithms to determine future sales forecasts, price discounts, and promotion sequences. (Source:

http://www.oliverwyman.com/our-expertise/insights/2017/feb/machine-learning-for-retail.html

Production capacity and maintenance

: In manufacturing, data science is being used to determine device maintenance requirements, equipment effectiveness, optimize production lines, and much more. The overall supply chain management is an area that has benefited and continues to earn profits from smart use of machine learning. (Source:

https://www.forbes.com/sites/louiscolumbus/2016/06/26/10-ways-machine-learning-is-revolutionizing-manufacturing/#51d4927228c2

Web industry

One of the earliest beneficiaries of data science was the web industry. Empowered by the collection of user-specific data from social networks, firms around the world employ algorithms to understand user behavior and generate targeted ads. Google, one of the earliest proponents of targeted ad marketing today, earns most of its revenue from ads, more than $95 billion in 2017. (Source: https://www.statista.com/statistics/266249/advertising-revenue-of-google/). The use of data science for web-related businesses is ubiquitous today and companies such as Uber, Airbnb, Netflix, and Amazon have successfully navigated and made full use of this complex ecosystem, generating not only huge profits but also added millions of new jobs directly or indirectly as a result.

Targeted ads

: Click through ads have been one of the prime areas of machine learning. By reading cookies saved on users' computers from various sites, other sites can assess the users interests and accordingly decide which ads to serve when they visit new sites. As per online sources, the value of internet advertising is over $1 trillion and has generated over 10 million jobs in 2017 alone. (Source:

https://www.iab.com/insights/economic-value-advertising-supported-internet-ecosystem/

Recommender engines

: Netflix, Pandora, and other movies and audio streaming services utilize recommender engines to understand which movies or music the viewer or listener would be interested in and make recommendations. The recommendations are often based on what other users with similar tastes might have already seen and leverage recommender algorithms such as collaborative, content-based, and hybrid filtering.

Web design

: Using A/B testing, mouse tracking, and other sophisticated techniques, web developers leverage data science to design better web pages such as landing pages and in general websites. A/B testing for instance allows developers to decide between different versions of the same web page and deploy accordingly.

Other industries

There are various other industries today that benefit from data science and as such, it has become so common that it would be impractical to list all, but at a high level, some of the others include the following:

Oil and natural gas for oil production

Meteorology for understanding weather patterns

Space research for detecting and/or analyzing stars and galaxies

Utilities for energy production and energy savings

Biotechnology for research and finding new cures for diseases

In general, since data science, or machine learning algorithms are not specific to any particular industry, it is entirely possible to apply algorithms to creative use cases and derive business benefits.

Solving problems with data science

Data science is being used today to solve problems ranging from poverty alleviation to scientific research. It has emerged as the leading discipline that aims to disrupt the industry's status quo and provide a new alternative to pressing business issues.

However, while the promise of data science and machine learning is immense, it is important to bear in mind that it takes time and effort to realize the benefits. The return-on-investment on a machine learning project typically takes a fairly long time. It is thus essential to not overestimate the value it can bring in the short run.

A typical data science project in a corporate setting would require the collaborative efforts of various groups, both on the technical and the business side. Generally, this means that the project should have a business sponsor and a technical or analytics lead in addition to the data science team or data scientist. It is important to set expectations at the onset—both in terms of the time it would take to complete the project and the outcome that may be uncertain until the task has completed. Unlike other projects that may have a definite goal, it is not possible to predetermine the outcome of machine learning projects.

Some common questions to ask include the following:

What business value does the data science project bring to the organization?

Does it have a critical base of users, that is, would multiple users benefit from the expected outcome of the project?

How long would it take to complete the project and are all the business stakeholders aware of the timeline?

Have the project stakeholders taken all variables that may affect the timeline into account? Projects can often get delayed due to dependencies on external vendors.

Have we considered all other potential business use cases and made an assessment of what approach would have an optimal chance of success?

A few salient points for successful data science projects are given as follows:

Find projects or use cases related to business operations that are:

Challenging

Not necessarily complex, that is, they can be simple tasks but which add business value

Intuitive, easily understood (you can explain it to friends and family)

Takes effort to accomplish today or requires a lot of manual effort

Used frequently by a range of users and the benefits of the outcome would have executive visibility

Identify

low difficulty–high value

(shorter) versus

high difficulty–high value

(longer)

Educate business sponsors, share ideas, show enthusiasm (it's like a long job interview)

Score

early wins on low difficulty–high value

; create minimum viable solutions, get management buy-in before enhancing them (takes time)

Early wins act as a catalyst to foster executive confidence; and also make it easier to justify budgets, making it easier to move on to high difficulty—high value tasks

Using R for data science

Being arguably the oldest and consequently the most mature language for statistical operations, R has been used by statisticians all over the world for over 20 years. The precursor to R was the S programming language, written by John Chambers in 1976 in Bell Labs. R, named after the initials of its developers, Ross Ihaka and Robert Gentleman, was implemented as an open source equivalent to S while they were at the University of Auckland.

The language has gained immensely in popularity since the early 2000s, averaging between 20% to 30% growth on a year-on-year basis:

The growth of R packages

In 2018, there were more than 12,000 R packages, up from about 7,500 just 3 years before, in 2015.

A few key features of R makes it not only very easy to learn, but also very versatile due to the number of available packages.

Key features of R

The key features of R are as follows:

Data mining

: The R package,

data.table

, developed by Dowle and Srinivasan, is arguably one of the most sophisticated packages for data mining in any language provides R users with the ability to query millions, if not billions of rows of data. In addition, there is

tibble

, an alternative to

data.frame

developed by Hadley Wickham. Other packages from Wickham include,

plyr

dplyr

and

ggplot2

for visualization.

Visualizations

: The

ggplot2

package is the most commonly used visualization package in R. Packages such as

rcharts

htmlwidgets

have also become extremely popular in recent years. Most of these packages allow R users to leverage elegant graphics features commonly found in JavaScript packages such as D3. Many of them act as wrappers for popular JavaScript visualization libraries to facilitate the creation of graphics elements in R.

Data science

: R has had various statistical libraries used for research for many years. With the growth of data science as a popular subject in the public domain, R users have released and further developed both new and existing packages that allows users to deploy complex machine learning algorithms. Examples include

randomforest

gbm

General availability of packages

: The 12,000+ packages in R provide coverage for a wide range of projects. These include packages for machine learning, data science, and even general purpose needs such as web scraping, cartography, and even fisheries sciences. Due to this rich ecosystem that can cater to the needs of a wide variety of use cases, R has grown exponentially in popularity. Whether you are working with JSON files or trying to solve an obscure machine learning problem, it is very likely that someone in the R community has already developed a package that contains (or can indirectly fulfill) the functionality you need.

Setting up R and RStudio

: This book will focus on using R for data science related tasks. The language R, as mentioned, is available as an open source product from

http://r-project.org

. In addition, we will be installing RStudio—an IDE (a graphical user interface) for writing and running our R code as well as R Shiny, a platform that allows users to develop elegant dashboards.

Downloading and installing R is as follows:

Go to

http://r-project.org

and click on the

CRAN

(

http://cran.r-project.org/mirrors.html

Select any one of the links in the corresponding page. These are links to

CRAN Mirrors

, that is, sites that host R packages and R installation files:

Once you select and click on the link, you'll be taken to a page with the links to download R for different operating systems, such as Windows, macOS, and Linux. Select the distribution that you need to start the download process:

This is the R for macOS download page:

This is the R for Windows download page (click on

nstall R for the first time

if it is a new installation):

This is the R for Windows download page. Download and install the

.exe

file for R:

The R for macOS installation process will require you to download the

.dmg

file. Select the default options for installation if you do not intend to make any changes, such as installing in a different directory:

You will also need to download and install RStudio and R Shiny. RStudio is used as the frontend, which you'll use to develop your R code. As such, it is not necessary to use RStudio to write code in R as you can launch the R console from the desktop (Windows), but RStudio has a nicer and a more user-friendly interface that makes it easier to code in R.

Download RStudio and R Shiny from

https://www.rstudio.com

Click on

Products

in the top menu and select RStudio to download and install the software.

Download the open source version of RStudio. Note that there are other versions which are paid commercial versions of the software. For our exercise, we'll be using the open source version only. Download it from

https://www.rstudio.com/products/rstudio/download/

Once you have installed RStudio, launch the application. This will bring up the Following screenshot. There are four panels in RStudio. The first three are shown when you first launch RStudio:

Click on

File

New File

R Script

. This will open a new panel. This is the section where you'll be writing your R code:

RStudio is a very mature interface for developing R code and has been in use for several years. You should familiarize yourself with the different features in RStudio as you'll be using the tool throughout the book.

Our first R program

In this section, we will create our first R program for data analysis. We'll use the human development data available from the United Nations development program. The initiative produces a Human Development Index (HDI) corresponding to each country, which signifies the level of economic development, including general public health, education, and various other societal factors.

Further information on HDI can be found at http://hdr.undp.org/en/content/human-development-index-hdi.The site also hosts an FAQ page that provides short summary explanations of the various characteristics of the program at http://hdr.undp.org/en/faq-page/human-development-index-hdi.

The following diagram from the UN development program's website summaries the concept at a high level:

UN development index

In this exercise, we will be looking at the life expectancy and expected years of schooling on a per country per year basis starting from 1990 onward. Not all data is available for all countries, due to various geopolitical and other reasons that have made it difficult to obtain data for respective years.

The datasets for the HDP program have been obtained from http://hdr.undp.org/en/data.

In the exercises, the data has been cleaned and formatted to make it easier for the reader to analyse the information, especially given it is the first chapter of the book. Download the data from the Packt code repository for this book. Following are the steps to complete the exercise:

Launch RStudio and click on

File

New File

R Script

Save the file as

Chapter1.R

Copy the commands shown in the following script and save.

Install the required packages for this exercise by running the following command. First, copy the command into the code window in RStudio:

install.packages(c("data.table","plotly","ggplot2","psych"))

Then, place your cursor on the line and click on

Run

This will install the respective packages in your system. In case you encounter any errors, search on Google for the cause of the error. There are various online forums, such as Stack Overflow, where you can search for common errors and learn how to fix them. Since errors can depend on the specific configuration of your machine, we cannot identify all of them, but it is very likely that someone else might have experienced the same error conditions.

We have already created the requisite CSV files, and the following code illustrates the entire process of reading in the CSV files and analyzing the data:

# We'll install the following packages:## data.table: a package for managing & manipulating datasets in R## plotly: a graphics library that has gained popularity in recent year## ggplot2: another graphics library that is extremely popular in R## psych: a tool for psychmetry that also includes some very helpful #statistical functionsinstall.packages(c("data.table","plotly","ggplot2","psych"))# Load the libraries# This is necessary if you will be using functionalities that are #available outside# The functions already available as part of standard Rlibrary(data.table)library(plotly)library(ggplot2)library(psych)library(RColorBrewer)# In R, packages contain multiple functions and once the package has #been loaded# the functions become available in your workspace# To find more information about a function, at the R console, type #in ?function_name# Note that you should replace function_name with the name of the actual function# This will bring up the relevant help notes for the function# Note that the "R Console" is the interactive screen generally #found # Read in Human Development Index Filehdi <- fread("ch1_hdi.csv",header=T) # The command fread can be used to read in a CSV file# View contents of hdihead(hdi) # View the top few rows of the data table hdi//

The output of the preceding code is as follows:

Read the life expectancy file by using the following code:

life <- fread("ch1_life_exp.csv", header=T)# View contents of lifehead(life)

The output of the code file is as follows:

Read the years of schooling file by using the following code:

# Read Years of Schooling Fileschool <- fread("ch1_schoolyrs.csv", header=T)# View contents of schoolhead(school)

The output of the preceding code is as follows:

Now we will read the country information:

iso <- fread("ch1_iso.csv")# View contents of isohead(iso)

The following is the output of the previous code:

Here we will see the processing of the hdi table by using the following code:

# Use melt.data.table to change hdi into a long table formathdi <- melt.data.table(hdi,1,2:ncol(hdi))# Set the names of the columns of hdisetnames(hdi,c("Country","Year","HDI"))# Process the life table# Use melt.data.table to change life into a long table formatlife <- melt.data.table(life,1,2:ncol(life))# Set the names of the columns of hdisetnames(life,c("Country","Year","LifeExp"))# Process the school table# Use melt.data.table to change school into a long table formatschool <- melt.data.table(school,1,2:ncol(school))