R Machine Learning By Example - Raghav Bali - E-Book

R Machine Learning By Example E-Book

Raghav Bali

0,0
43,19 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Understand the fundamentals of machine learning with R and build your own dynamic algorithms to tackle complicated real-world problems successfully

About This Book

  • Get to grips with the concepts of machine learning through exciting real-world examples
  • Visualize and solve complex problems by using power-packed R constructs and its robust packages for machine learning
  • Learn to build your own machine learning system with this example-based practical guide

Who This Book Is For

If you are interested in mining useful information from data using state-of-the-art techniques to make data-driven decisions, this is a go-to guide for you. No prior experience with data science is required, although basic knowledge of R is highly desirable. Prior knowledge in machine learning would be helpful but is not necessary.

What You Will Learn

  • Utilize the power of R to handle data extraction, manipulation, and exploration techniques
  • Use R to visualize data spread across multiple dimensions and extract useful features
  • Explore the underlying mathematical and logical concepts that drive machine learning algorithms
  • Dive deep into the world of analytics to predict situations correctly
  • Implement R machine learning algorithms from scratch and be amazed to see the algorithms in action
  • Write reusable code and build complete machine learning systems from the ground up
  • Solve interesting real-world problems using machine learning and R as the journey unfolds
  • Harness the power of robust and optimized R packages to work on projects that solve real-world problems in machine learning and data science

In Detail

Data science and machine learning are some of the top buzzwords in the technical world today. From retail stores to Fortune 500 companies, everyone is working hard to making machine learning give them data-driven insights to grow their business. With powerful data manipulation features, machine learning packages, and an active developer community, R empowers users to build sophisticated machine learning systems to solve real-world data problems.

This book takes you on a data-driven journey that starts with the very basics of R and machine learning and gradually builds upon the concepts to work on projects that tackle real-world problems.

You'll begin by getting an understanding of the core concepts and definitions required to appreciate machine learning algorithms and concepts. Building upon the basics, you will then work on three different projects to apply the concepts of machine learning, following current trends and cover major algorithms as well as popular R packages in detail. These projects have been neatly divided into six different chapters covering the worlds of e-commerce, finance, and social-media, which are at the very core of this data-driven revolution. Each of the projects will help you to understand, explore, visualize, and derive insights depending upon the domain and algorithms.

Through this book, you will learn to apply the concepts of machine learning to deal with data-related problems and solve them using the powerful yet simple language, R.

Style and approach

The book is an enticing journey that starts from the very basics to gradually pick up pace as the story unfolds. Each concept is first defined in the larger context of things succinctly, followed by a detailed explanation of their application. Each topic is explained with the help of a project that solves a real real-world problem involving hands-on work thus giving you a deep insight into the world of machine learning.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 370

Veröffentlichungsjahr: 2016

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

R Machine Learning By Example
Credits
About the Authors
About the Reviewer
www.PacktPub.com
eBooks, discount offers, and more
Why subscribe?
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book

Errata
Piracy
Questions
1. Getting Started with R and Machine Learning
Delving into the basics of R
Using R as a scientific calculator
Operating on vectors
Special values
Data structures in R
Vectors
Creating vectors
Indexing and naming vectors
Arrays and matrices
Creating arrays and matrices
Names and dimensions
Matrix operations
Lists
Creating and indexing lists
Combining and converting lists
Data frames
Creating data frames
Operating on data frames
Working with functions
Built-in functions
User-defined functions
Passing functions as arguments
Controlling code flow
Working with if, if-else, and ifelse
Working with switch
Loops
Advanced constructs
lapply and sapply
apply
tapply
mapply
Next steps with R
Getting help
Handling packages
Machine learning basics
Machine learning – what does it really mean?
Machine learning – how is it used in the world?
Types of machine learning algorithms
Supervised machine learning algorithms
Unsupervised machine learning algorithms
Popular machine learning packages in R
Summary
2. Let's Help Machines Learn
Understanding machine learning
Algorithms in machine learning
Perceptron
Families of algorithms
Supervised learning algorithms
Linear regression
K-Nearest Neighbors (KNN)
Collecting and exploring data
Normalizing data
Creating training and test data sets
Learning from data/training the model
Evaluating the model
Unsupervised learning algorithms
Apriori algorithm
K-Means
Summary
3. Predicting Customer Shopping Trends with Market Basket Analysis
Detecting and predicting trends
Market basket analysis
What does market basket analysis actually mean?
Core concepts and definitions
Techniques used for analysis
Making data driven decisions
Evaluating a product contingency matrix
Getting the data
Analyzing and visualizing the data
Global recommendations
Advanced contingency matrices
Frequent itemset generation
Getting started
Data retrieval and transformation
Building an itemset association matrix
Creating a frequent itemsets generation workflow
Detecting shopping trends
Association rule mining
Loading dependencies and data
Exploratory analysis
Detecting and predicting shopping trends
Visualizing association rules
Summary
4. Building a Product Recommendation System
Understanding recommendation systems
Issues with recommendation systems
Collaborative filters
Core concepts and definitions
The collaborative filtering algorithm
Predictions
Recommendations
Similarity
Building a recommender engine
Matrix factorization
Implementation
Result interpretation
Production ready recommender engines
Extract, transform, and analyze
Model preparation and prediction
Model evaluation
Summary
5. Credit Risk Detection and Prediction – Descriptive Analytics
Types of analytics
Our next challenge
What is credit risk?
Getting the data
Data preprocessing
Dealing with missing values
Datatype conversions
Data analysis and transformation
Building analysis utilities
Analyzing the dataset
Saving the transformed dataset
Next steps
Feature sets
Machine learning algorithms
Summary
6. Credit Risk Detection and Prediction – Predictive Analytics
Predictive analytics
How to predict credit risk
Important concepts in predictive modeling
Preparing the data
Building predictive models
Evaluating predictive models
Getting the data
Data preprocessing
Feature selection
Modeling using logistic regression
Modeling using support vector machines
Modeling using decision trees
Modeling using random forests
Modeling using neural networks
Model comparison and selection
Summary
7. Social Media Analysis – Analyzing Twitter Data
Social networks (Twitter)
Data mining @social networks
Mining social network data
Data and visualization
Word clouds
Treemaps
Pixel-oriented maps
Other visualizations
Getting started with Twitter APIs
Overview
Registering the application
Connect/authenticate
Extracting sample tweets
Twitter data mining
Frequent words and associations
Popular devices
Hierarchical clustering
Topic modeling
Challenges with social network data mining
References
Summary
8. Sentiment Analysis of Twitter Data
Understanding Sentiment Analysis
Key concepts of sentiment analysis
Subjectivity
Sentiment polarity
Opinion summarization
Feature extraction
Approaches
Applications
Challenges
Sentiment analysis upon Tweets
Polarity analysis
Classification-based algorithms
Labeled dataset
Support Vector Machines
Ensemble methods
Boosting
Cross-validation
Summary
Index

R Machine Learning By Example

R Machine Learning By Example

Copyright © 2016 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: March 2016

Production reference: 1220316

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78439-084-6

www.packtpub.com

Credits

Authors

Raghav Bali

Dipanjan Sarkar

Reviewer

Alexey Grigorev

Commissioning Editor

Akram Hussain

Acquisition Editors

Kevin Colaco

Tushar Gupta

Content Development Editor

Kajal Thapar

Technical Editor

Utkarsha S. Kadam

Copy Editors

Vikrant Phadke

Alpha Singh

Project Coordinator

Shweta H Birwatkar

Proofreader

Safis Editing

Indexer

Monica Ajmera Mehta

Graphics

Disha Haria

Kirk D'Penha

Production Coordinator

Arvindkumar Gupta

Cover Work

Arvindkumar Gupta

About the Authors

Raghav Bali has a master's degree (gold medalist) in IT from the International Institute of Information Technology, Bangalore. He is an IT engineer at Intel, the world's largest silicon company, where he works on analytics, business intelligence, and application development. He has worked as an analyst and developer in domains such as ERP, finance, and BI with some of the top companies in the world. Raghav is a shutterbug, capturing moments when he isn't busy solving problems.

I would like to thank Packt Publishing for this opportunity, Kajal Thapar and Utkarsha S. Kadam for their fantastic support and editing, and everyone from the R community for making life simpler and data science interesting.

Finally, I would to thank my family, especially my parents and brother for their faith in me and for whom this book will be a surprise. I would also like to thank my mentors, teachers, and friends, who have always been an inspiration. Last but not least, special thanks to my partner in crime, Dipanjan Sarkar, without whom this wouldn't have been possible.

Dipanjan Sarkar is an IT engineer at Intel, the world's largest silicon company, where he works on analytics, business intelligence, and application development. He received his master's degree in information technology from the International Institute of Information Technology, Bangalore. His areas of specialization includes software engineering, data science, machine learning, and text analytics.

Dipanjan's interests include learning about new technology, disruptive start-ups, and data science. In his spare time, he loves reading, playing games, and watching popular sitcoms. He has also reviewed Data Analysis with R, Learning R for Geospatial Analysis, and R Data Analysis Cookbook, all by Packt Publishing.

I would like to thank my good friend and colleague, Raghav Bali, for co-authoring this book with me. Without his support, it would have been impossible to make this book a reality. I would also like to thank Kajal Thapar and Utkarsha S. Kadam for giving me timely feedback on the book's content and making the whole writing process really interactive and enjoyable. Much gratitude goes without saying to Packt Publishing for giving me this wonderful opportunity to share my knowledge with the machine learning and R enthusiasts out there who are doing truly amazing things every day.

Last but never the least, I am indebted to my family, friends, teachers, and colleagues for always standing by my side and supporting me in all my endeavors. Your support keeps me going day in, day out to take on new challenges!

About the Reviewer

Alexey Grigorev is a skilled data scientist and software engineer with more than 5 years of professional experience. He currently works as a data scientist at Searchmetrics. In his day-to-day job, he actively uses R and Python for data cleaning, data analysis, and modeling. He has been a reviewer on other Packt Publishing books on data analysis, such as Test-Driven Machine Learning and Mastering Data Analysis with R.

www.PacktPub.com

eBooks, discount offers, and more

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Preface

Data science and machine learning are some of the top buzzwords in the technical world today. From retail stores to Fortune 500 companies, everyone is working hard to make machine learning give them data-driven insights to grow their businesses. With powerful data manipulation features, machine learning packages, and an active developer community, R empowers users to build sophisticated machine learning systems to solve real-world data problems.

This book takes you on a data-driven journey that starts with the very basics of R and machine learning and gradually builds upon the concepts to work on projects that tackle real-world problems.

What this book covers

Chapter 1, Getting Started with R and Machine Learning, acquaints you with the book and helps you reacquaint yourself with R and its basics. This chapter also provides you with a short introduction to machine learning.

Chapter 2, Let's Help Machines Learn, dives into machine learning by explaining the concepts that form its base. You are also presented with various types of learning algorithms, along with some real-world examples.

Chapter 3, Predicting Customer Shopping Trends with Market Basket Analysis, starts off with our first project, e-commerce product recommendations, predictions, and pattern analysis, using various machine learning techniques. This chapter specifically deals with market basket analysis and association rule mining to detect customer shopping patterns and trends and make product predictions and suggestions using these techniques. These techniques are used widely by retail companies and e-commerce stores such as Target, Macy's, Flipkart, and Amazon for product recommendations.

Chapter 4, Building a Product Recommendation System, covers the second part of our first project on e-commerce product recommendations, predictions, and pattern analysis. This chapter specifically deals with analyzing e-commerce product reviews and ratings by different users, using algorithms and techniques such as user-collaborative filtering to design a recommender system that is production ready.

Chapter 5, Credit Risk Detection and Prediction – Descriptive Analytics, starts off with our second project, applying machine learning to a complex financial scenario where we deal with credit risk detection and prediction. This chapter specifically deals with introducing the main objective, looking at a financial credit dataset for 1,000 people who have applied for loans from a bank. We will use machine learning techniques to detect people who are potential credit risks and may not be able to repay a loan if they take it from the bank, and also predict the same for the future. The chapter will also talk in detail about our dataset, the main challenges when dealing with data, the main features of the dataset, and exploratory and descriptive analytics on the data. It will conclude with the best machine learning techniques suitable for tackling this problem.

Chapter 6, Credit Risk Detection and Prediction – Predictive Analytics, starts from where we left off in the previous chapter about descriptive analytics with looking at using predictive analytics. Here, we specifically deal with using several machine learning algorithms to detect and predict which customers would be potential credit risks and might not be likely to repay a loan to the bank if they take it. This would ultimately help the bank make data-driven decisions as to whether to approve the loan or not. We will be covering several supervised learning algorithms and compare their performance. Different metrics for evaluating the efficiency and accuracy of various machine learning algorithms will also be covered here.

Chapter 7, Social Media Analysis – Analyzing Twitter Data, introduces the world of social media analytics. We begin with an introduction to the world of social media and the process of collecting data through Twitter's APIs. The chapter will walk you through the process of mining useful information from tweets, including visualizing Twitter data with real-world examples, clustering and topic modeling of tweets, the present challenges and complexities, and strategies to address these issues. We show by example how some powerful measures can be computed using Twitter data.

Chapter 8, Sentiment Analysis of Twitter Data, builds upon the knowledge of Twitter APIs to work on a project for analyzing sentiments in tweets. This project presents multiple machine learning algorithms for the classification of tweets based on the sentiments inferred. This chapter will also present these results in a comparative manner and help you understand the workings and difference in results of these algorithms.

What you need for this book

This software applies to all the chapters of the book:

Windows / Mac OS X / LinuxR 3.2.0 (or higher)RStudio Desktop 0.99 (or higher)

For hardware, there are no specific requirements, since R can run on any PC that has Mac, Linux, or Windows, but a physical memory of minimum 4 GB is preferred to run some of the iterative algorithms smoothly.

Who this book is for

If you are interested in mining useful information from data using state-of-the-art techniques to make data-driven decisions, this is a go-to guide for you. No prior experience with data science is required, although basic knowledge of R is highly desirable. Prior knowledge of machine learning will be helpful but is not necessary.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "We can include other contexts through the use of the include directive."

Any command-line input or output is written as follows:

# comparing cluster labels with actual iris species labels.table(iris$Species, clusters$cluster)

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "From recommendations related to Who to follow on Twitter to Other movies you might enjoy on Netflix to Jobs you may be interested in on LinkedIn, recommender engines are everywhere and not just on e-commerce platforms."

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the book in the Search box.Select the book for which you're looking to download the code files.Choose from the drop-down menu where you purchased this book from.Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

Downloading the color images of this book


We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/RMachineLearningByExample_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

Chapter 1. Getting Started with R and Machine Learning

This introductory chapter will get you started with the basics of R which include various constructs, useful data structures, loops and vectorization. If you are already an R wizard, you can skim through these sections and dive right into the next part which talks about what machine learning actually represents as a domain and the main areas it encompasses. We will also talk about different machine learning techniques and algorithms used in each area. Finally, we will conclude by looking at some of the most popular machine learning packages in R, some of which we will be using in the subsequent chapters.

If you are a data or machine learning enthusiast, surely you would have heard by now that being a data scientist is referred to as the sexiest job of the 21st century by Harvard Business Review.

Note

Reference: https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/

There is a huge demand in the current market for data scientists, primarily because their main job is to gather crucial insights and information from both unstructured and structured data to help their business and organization grow strategically.

Some of you might be wondering how machine learning or R relate to all this! Well, to be a successful data scientist, one of the major tools you need in your toolbox is a powerful language capable of performing complex statistical calculations and working with various types of data and building models which help you get previously unknown insights and R is the perfect language for that! Machine learning forms the foundation of the skills you need to build to become a data analyst or data scientist, this includes using various techniques to build models to get insights from data.

This book will provide you with some of the essential tools you need to be well versed with both R and machine learning by not only looking at concepts but also applying those concepts in real-world examples. Enough talk; now let's get started on our journey into the world of machine learning with R!

In this chapter, we will cover the following aspects:

Delving into the basics of RUnderstanding the data structures in RWorking with functionsControlling code flowTaking further steps with RUnderstanding machine learning basicsFamiliarizing yourself with popular machine learning packages in R

Delving into the basics of R

It is assumed here that you are at least familiar with the basics of R or have worked with R before. Hence, we won't be talking much about downloading and installations. There are plenty of resources on the web which provide a lot of information on this. I recommend that you use RStudio which is an Integrated Development Environment (IDE), which is much better than the base R Graphical User Interface (GUI). You can visit https://www.rstudio.com/ to get more information about it.

Note

For details about the R project, you can visit https://www.r-project.org/ to get an overview of the language. Besides this, R has a vast arsenal of wonderful packages at its disposal and you can view everything related to R and its packages at https://cran.r-project.org/ which contains all the archives.

You must already be familiar with the R interactive interpreter, often called a Read-Evaluate-Print Loop (REPL). This interpreter acts like any command line interface which asks for input and starts with a > character, which indicates that R is waiting for your input. If your input spans multiple lines, like when you are writing a function, you will see a + prompt in each subsequent line, which means that you didn't finish typing the complete expression and R is asking you to provide the rest of the expression.

It is also possible for R to read and execute complete files containing commands and functions which are saved in files with an .R extension. Usually, any big application consists of several .R files. Each file has its own role in the application and is often called as a module. We will be exploring some of the main features and capabilities of R in the following sections.

Using R as a scientific calculator

The most basic constructs in R include variables and arithmetic operators which can be used to perform simple mathematical operations like a calculator or even complex statistical calculations.

> 5 + 6[1] 11> 3 * 2[1] 6> 1 / 0[1] Inf

Remember that everything in R is a vector. Even the output results indicated in the previous code snippet. They have a leading [1] symbol indicating it is a vector of size 1.

You can also assign values to variables and operate on them just like any other programming language.

> num <- 6> num ^ 2[1] 36> num[1] 6 # a variable changes value only on re-assignment> num <- num ^ 2 * 5 + 10 / 3> num[1] 183.3333

Operating on vectors

The most basic data structure in R is a vector. Basically, anything in R is a vector, even if it is a single number just like we saw in the earlier example! A vector is basically a sequence or a set of values. We can create vectors using the : operator or the c function which concatenates the values to create a vector.

> x <- 1:5> x[1] 1 2 3 4 5> y <- c(6, 7, 8 ,9, 10)> y[1] 6 7 8 9 10> z <- x + y> z[1] 7 9 11 13 15

You can clearly in the previous code snippet, that we just added two vectors together without using any loop, using just the + operator. This is known as vectorization and we will be discussing more about this later on. Some more operations on vectors are shown next:

> c(1,3,5,7,9) * 2[1] 2 6 10 14 18> c(1,3,5,7,9) * c(2, 4)[1] 2 12 10 28 18 # here the second vector gets recycled

Output:

> factorial(1:5)[1] 1 2 6 24 120> exp(2:10) # exponential function[1] 7.389056 20.085537 54.598150 148.413159 403.428793 1096.633158[7] 2980.957987 8103.083928 22026.465795> cos(c(0, pi/4)) # cosine function[1] 1.0000000 0.7071068> sqrt(c(1, 4, 9, 16))[1] 1 2 3 4> sum(1:10)[1] 55

You might be confused with the second operation where we tried to multiply a smaller vector with a bigger vector but we still got a result! If you look closely, R threw a warning also. What happened in this case is, since the two vectors were not equal in size, the smaller vector in this case c(2, 4) got recycled or repeated to become c(2, 4, 2, 4, 2) and then it got multiplied with the first vector c(1, 3, 5, 7 ,9) to give the final result vector, c(2, 12, 10, 28, 18). The other functions mentioned here are standard functions available in base R along with several other functions.

Tip

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the topClick on Code Downloads & ErrataEnter the name of the book in the Search boxSelect the book for which you're looking to download the code filesChoose from the drop-down menu where you purchased this book fromClick on Code Download

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

Special values

Since you will be dealing with a lot of messy and dirty data in data analysis and machine learning, it is important to remember some of the special values in R so that you don't get too surprised later on if one of them pops up.

> 1 / 0[1] Inf> 0 / 0[1] NaN> Inf / NaN[1] NaN> Inf / Inf[1] NaN> log(Inf)[1] Inf> Inf + NA[1] NA

The main values which should concern you here are Inf which stands for Infinity, NaN which is Not a Number, and NA which indicates a value that is missing or Not Available. The following code snippet shows some logical tests on these special values and their results. Do remember that TRUE and FALSE are logical data type values, similar to other programming languages.

> vec <- c(0, Inf, NaN, NA)> is.finite(vec)[1] TRUE FALSE FALSE FALSE> is.nan(vec)[1] FALSE FALSE TRUE FALSE> is.na(vec)[1] FALSE FALSE TRUE TRUE> is.infinite(vec)[1] FALSE TRUE FALSE FALSE

The functions are pretty self-explanatory from their names. They clearly indicate which values are finite, which are finite and checks for NaN and NA values respectively. Some of these functions are very useful when cleaning dirty data.

Working with functions

Next up, we will be looking at functions, which is a technique or methodology to easily structure and modularize your code, specifically lines of code which perform specific tasks, so that you can execute them whenever you need them without writing them again and again. In R, functions are basically treated as just another data type and you can assign functions, manipulate them as and when needed, and also pass them as arguments to other functions. We will be exploring all this in the following section.

Built-in functions

R consists of several functions which are available in the R-base package and, as you install more packages, you get more functionality, which is made available in the form of functions. We will look at a few built-in functions in the following examples:

> sqrt(5)[1] 2.236068> sqrt(c(1,2,3,4,5,6,7,8,9,10))[1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 [8] 2.828427 3.000000 3.162278> # aggregating functions> mean(c(1,2,3,4,5,6,7,8,9,10))[1] 5.5> median(c(1,2,3,4,5,6,7,8,9,10))[1] 5.5

You can see from the preceding examples that functions such as mean, median, and sqrt are built-in and can be used anytime when you start R, without loading any other packages or defining the functions explicitly.

User-defined functions

The real power lies in the ability to define your own functions based on different operations and computations you want to perform on the data and making R execute those functions just in the way you intend them to work. Some illustrations are shown as follows:

square <- function(data){ return (data^2)}> square(5)[1] 25> square(c(1,2,3,4,5))[1] 1 4 9 16 25point <- function(xval, yval){ return (c(x=xval,y=yval))}> p1 <- point(5,6)> p2 <- point(2,3)> > p1x y 5 6 > p2x y 2 3

As we saw in the previous code snippet, we can define functions such as square which computes the square of a single number or even a vector of numbers using the same code. Functions such as point are useful to represent specific entities which represent points in the two-dimensional co-ordinate space. Now we will be looking at how to use the preceding functions together.

Passing functions as arguments

When you define any function, you can also pass other functions to it as arguments if you intend to use them inside your function to perform some complex computations. This reduces the complexity and redundancy of the code. The following example computes the Euclidean distance between two points using the square function defined earlier, which is passed as an argument: