Data Science Tools - C. Greco - E-Book

Data Science Tools E-Book

C. Greco

0,0
45,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

This book introduces popular data science tools and guides readers on how to use them effectively. It covers data analysis using Microsoft Excel, KNIME, R, and OpenOffice, applying statistical concepts such as confidence intervals, normal distribution, T-Tests, linear regression, histograms, and geographic analysis with real data from Federal Government sources.
The course begins with the basics, including importing data and conducting various statistical tests. It progresses to specific methods for each tool, ensuring a comprehensive understanding of data analysis. Capstone exercises provide hands-on experience, reinforcing the concepts learned throughout the book.
Understanding these tools and concepts is crucial for effective data analysis. This book takes readers from the basics to advanced statistical methods, combining theoretical insights with practical applications. Companion files with source code and data sets enhance the learning experience, making this book an essential resource for mastering data analysis with popular software applications.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 189

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



DATA SCIENCETOOLS

LICENSE, DISCLAIMER OF LIABILITY, AND LIMITED WARRANTY

By purchasing or using this book (the “Work”), you agree that this license grants permission to use the contents contained herein, but does not give you the right of ownership to any of the textual content in the book or ownership to any of the information or products contained in it. This license does not permit uploading of the Work onto the Internet or on a network (of any kind) without the written consent of the Publisher. Duplication or dissemination of any text, code, simulations, images, etc. contained herein is limited to and subject to licensing terms for the respective products, and permission must be obtained from the Publisher or the owner of the content, etc., in order to reproduce or network any portion of the textual material (in any media) that is contained in the Work.

MERCURY LEARNING AND INFORMATION (“MLI” or “the Publisher”) and anyone involved in the creation, writing, or production of the companion disc, accompanying algorithms, code, or computer programs (“the software”), and any accompanying Web site or software of the Work, cannot and do not warrant the performance or results that might be obtained by using the contents of the Work. The author, developers, and the Publisher have used their best efforts to insure the accuracy and functionality of the textual material and/or programs contained in this package; we, however, make no warranty of any kind, express or implied, regarding the performance of these contents or programs. The Work is sold “as is” without warranty (except for defective materials used in manufacturing the book or due to faulty workmanship).

The author, developers, and the publisher of any accompanying content, and anyone involved in the composition, production, and manufacturing of this work will not be liable for damages of any kind arising out of the use of (or the inability to use) the algorithms, source code, computer programs, or textual material contained in this publication. This includes, but is not limited to, loss of revenue or profit, or other incidental, physical, or consequential damages arising out of the use of this Work.

The sole remedy in the event of a claim of any kind is expressly limited to replacement of the book, and only at the discretion of the Publisher. The use of “implied warranty” and certain “exclusions” vary from state to state, and might not apply to the purchaser of this product.

DATA SCIENCETOOLS

R, Excel, KNIME, & OpenOffice

CHRISTOPHER GRECO

MERCURY LEARNING AND INFORMATION

Dulles, Virginia

Boston, MassachusettsNew Delhi

Copyright ©2020 by MERCURY LEARNING AND INFORMATION LLC. All rights reserved.

This publication, portions of it, or any accompanying software may not be reproduced in any way, stored in a retrieval system of any type, or transmitted by any means, media, electronic display or mechanical display, including, but not limited to, photocopy, recording, Internet postings, or scanning, without prior permission in writing from the publisher.

Publisher: David PallaiMERCURY LEARNING AND INFORMATION22841 Quicksilver DriveDulles, VA [email protected](800) 232-0223

C. Greco. Data Science Tools: R, Excel, KNIME, & OpenOffice.ISBN: 978-1-68392-583-5

The publisher recognizes and respects all marks used by companies, manufacturers, and developers as a means to distinguish their products. All brand names and product names mentioned in this book are trademarks or service marks of their respective companies. Any omission or misuse (of any kind) of service marks or trademarks, etc. is not an attempt to infringe on the property of others.

Library of Congress Control Number: 2020937123

202122321  Printed on acid-free paper in the United States of America

Our titles are available for adoption, license, or bulk purchase by institutions, corporations, etc. For additional information, please contact the Customer Service Dept. at (800) 232-0223 (toll free). Digital versions of our titles are available at: www.academiccourseware.com and other electronic vendors.

The sole obligation of MERCURY LEARNING AND INFORMATION to the purchaser is to replace the book and/or disc, based on defective materials or faulty workmanship, but not based on the operation or functionality of the product.

CONTENTS

Preface

Acknowledgments

Notes on Permissions

Chapter 1:  First Steps

1.1 Introduction to Data Tools

1.1.1 The Software Is Easy to Use

1.1.2 The Software Is Available from Anywhere

1.1.3 The Software Is Updated Regularly

1.1.4 Summary

1.2 Why Data Analysis (Data Science) at All?

1.3 Where to Get Data

Chapter 2:  Importing Data

2.1 Excel

2.1.1 Excel Analysis ToolPak

2.2 OpenOffice

2.3 Import into R and Rattle

2.4 Import into RStudio

2.5 Rattle Import

2.6 Import into KNIME

2.6.1 Stoplight Approach

Chapter 3:  Statistical Tests

3.1 Descriptive Statistics

3.1.1 Excel

3.1.2 OpenOffice

3.1.3 RStudio/Rattle

3.1.4 KNIME

3.2 Cumulative Probability Charts

3.2.1 Excel

3.2.2 OpenOffice

3.2.3 R/RStudio/Rattle

3.2.4 KNIME

3.3 T-Test (Parametric)

3.3.1 Excel

3.3.2 OpenOffice

3.3.3 R/RStudio/Rattle

3.3.4 KNIME

Chapter 4:  More Statistical Tests

4.1 Correlation

4.1.1 Excel

4.1.2 OpenOffice

4.1.3 R/RStudio/Rattle

4.1.4 KNIME

4.2 Regression

4.2.1 Excel

4.2.2 OpenOffice

4.2.3 R/RStudio/Rattle

4.2.4 KNIME

4.3 Confidence Interval

4.3.1 Excel

4.3.2 OpenOffice

4.3.3 R/RStudio/Rattle

4.3.4 KNIME

4.4 Random Sampling

4.4.1 Excel

4.4.2 OpenOffice

4.4.3 R/RStudio/Rattle

4.4.4 KNIME

Chapter 5:  Statistical Methods for Specific Tools

5.1 Power

5.1.1 R/RStudio/Rattle

5.2 F-Test

5.2.1 Excel

5.2.2 R/RStudio/Rattle

5.2.3 KNIME

5.3 Multiple Regression/Correlation

5.3.1 Excel

5.3.2 OpenOffice

5.3.3 R/RStudio/Rattle

5.3.4 KNIME

5.4 Benford’s Law

5.4.1 Rattle

5.5 Lift

5.5.1 KNIME

5.6 Wordcloud

5.6.1 R/RStudio

5.6.2 KNIME

5.7 Filtering

5.7.1 Excel

5.7.2 OpenOffice

5.7.3 R/RStudio/Rattle

5.7.4 KNIME

Chapter 6:  Summary

6.1 Packages

6.2 Analysis ToolPak

Chapter 7:  Supplemental Information

7.1 Exercise One – Tornado and the States

7.1.1 Answer to Exercise 7.1

7.1.2 Pairing Exercise

References

Index

PREFACE

Data Science is all the rage. There is a great probability that every book you read, every Web site that you visit, every advertisement that you receive, is a result of data science and, with it, data analytics. What used to be “statistics” is now referenced as data analytics or data science. The concepts behind data science are myriad and complex, but the underlying concept is that very basic statistical concepts are vital to understanding data. This book really has a two-fold purpose. The first is to review briefly some of the concepts that the reader may have encountered while taking a course (or courses) in statistics, while the second is to demonstrate how to use tools to visualize those statistical concepts.

There are several caveats that must accompany this book. The first one is that the tools are of a certain version, which will be described below. This means that there will undoubtedly be future versions of these tools that might perform differently on your computer. I want to be very clear that this performance does not mean that these tools will perform better. Three of these are free and open source tools, and, as such, perform as well as the group of developers dictate they will in their most current versions. In most instances, the tool will be enhanced in the newer version, but there might be a different “buttonology” that will be associated with newer functions. You will see the word “buttonology” throughout this book in the form of the mechanics of the tool itself. I am not here to teach the reader statistics or the different concepts that compose the topics of this book. I am here to show you how the free and open source tools are applied to these concepts.

Now it is time to get to the very heart of the text, the tools of data science. There will be four tools that will encompass the content of this book. Three are open source tools (FOSS or Free and Open Source), with one being COS (Common Off the Shelf) software, but all four will require some instruction in their use. These are not always intuitive or self-explanatory, so there will be many screen pages for each mechanical function. I feel that visual familiarization trumps narrative, so you will not see a lot of writing, mostly descriptions and step-by-step mechanics. A few of you may be wondering how to practice these skills, and for those readers there is a final chapter that has several scenarios that allow the reader to apply what they have learned from these tools.

The organization of this book will be on the statistical concept, not the tool, which means that each chapter will encompass an explanation of the statistical concept, and then how to apply each tool to that concept. By using this presentation method, readers can go to the prescribed concept and use the tool most comfortably applied. Each section will be labeled accordingly, so they will both be in the table of contents and the index. This makes it simpler for individuals to see their choice of tools and the concepts they have to apply to those tools.

C. GrecoApril 2020

ACKNOWLEDGMENTS

When I have done these in the past, I always mentioned my wife, children, and grandchildren, which to me was not just necessary but mandatory, because they are the ones that impact me every day. Thanks to my brothers and sisters, who always set the bar high enough for excellence, but not so high that I would injure myself getting over it. You all always provided me with the motivation to do better. Now, I have to add a few people that have helped me get this book into print and in the electronic media. The first and foremost is Jim Walsh of Mercury Learning, who took a risk having me write a book on free and open source applications. I truly believe in this book, and he trusted me to put my best foot forward, but in addition he made suggestions along the way that helped me to be a better writer and contributor to the bigger publishing picture. I truly appreciate all your help, Jim.

The other editors and writers at Mercury Learning are like looking at a Science, Technology, Engineering, and Math (STEM) Hall of Fame. I am truly honored and privileged to even have a book title with this noble group. Thanks for all the guidance.

Finally, my father, who told me in no uncertain words that I should never try to study “hard sciences” but stick with the “soft sciences,” since I really stunk at math. Thanks, Dad, for giving me that incentive to pursue statistics and data analysis. I owe it all to you.

NOTES ON PERMISSIONS

•Microsoft Corporation screenshots fall under the guidelines seen here: https://www.microsoft.com/en-us/legal/intellectualproperty/permissions/default.aspx.

•OpenOffice screenshots fall under the guidelines seen here: https://www.openoffice.org/license.html.

•R / RStudio screenshots are permitted through the RStudio license and permission https://rstudio.com/about/software-license-descriptions/.

•R Foundation: http://www.r-project.org.

•Rattle screenshots are used with permission and also cited in:

Graham Williams. (2011). Data Mining with Rattle and R: The Art of Excavating Data for Knowledge Discovery. Use R! New York, NY: Springer.

•KNIME screenshots are permitted through KNIME licensing and permission: https://www.knime.com/downloads/full-license.

CHAPTER 1

FIRST STEPS

1.1INTRODUCTION TO DATA TOOLS

People have different motivations for pursuing what interests them. Ask someone about a car and they might say that they hate sedans, or love SUVs, or would never get anything other than an electric car, or maybe not get a car at all! People have different preferences and this does not change with data science (statistical) tools. Some people love Excel, to the point where they will use nothing other than that software for anything from keeping a budget to analyzing data. There are many reasons for maintaining dedication, but the main reason from my experience is familiarization with the object. A person who has only driven a stick shift loves the clutch, while those that have never driven a stick will not be as prone to prefer one with a manual gear shifter.

What reasons are there for preferring one software application to another? From my experience, there are three main points:

1.The software is easy to use

2.The software is available from anywhere

3.The software is updated regularly

Normally it could be put that software is inexpensive, but with the age of subscriptions software licenses are no longer perpetual, so a monthly payment is all that is necessary to ensure that the reader has access to the software as long as the subscription is current. Let’s explore each point and elaborate.

1.1.1The Software Is Easy to Use

If an analyst can select a few buttons and—voilà—the result appears, it is much easier than the “p” word. What is the “p” word? Programming! If an analyst has to do programming, it makes it difficult to get the result. Of course, analysts do not realize that once something is programmed, it is easier to apply that programming, but that is for another book at another time. The main point to get here is that Graphic User Interface (GUI) software seems to be preferred to programming software. The COS software is well known and also known to be easy to use. Some of the FOSS software will require more preparation.

1.1.2The Software Is Available from Anywhere

In this age of cloud computing, being able to access software seems trivial. After speaking with colleagues, they like the fact that they can perform and save their work online so they will not lose it. They also like the fact that updates are transparent and performed while they are using the tool. Finally, they like the fact that they do not have to worry about installing the software and using their memory or disk space.

1.1.3The Software Is Updated Regularly

The previous section covers this, so we will not elaborate. However, it is important to note that the tools that will be covered in this book are updated regularly. Unfortunately, the analyst will have to be the one to opt-in to the updates.

1.1.4Summary

Now that we have covered why analysts prefer certain tools, a description of the ones covered in this book will be given in table form to simplify the presentation and (as stated previously) minimize the written word.

1.2WHY DATA ANALYSIS (DATA SCIENCE) AT ALL?

The world today is a compendium of data. Data exist in everything we do, whether it is buying groceries or researching to buy a house. There are so many free applets and applications that are available to us that we have a hard time saying no to any of these. As one reference put it, and this author has generalized, if what you are downloading is free, then you are the product (Poundstone, 2019). This is poignant, because free and open source (FOSS) is something that is commonly accessible and available to all of us. However, why do we need data science to analyze all of this information? In my knowledge, there are a number of reasons why data science exists. First, it exists to corral the trillions of bytes of information that is gathered by companies and government agencies to determine everything from the cost of milk to the amount of carbon emissions in the air. Forty years ago, most data were collected, retrieved, and filed using paper. Personal computers were a dream, and data science was called archiving or something similar. Moving toward electronic media, databases turned mounds of paper into kilo-, mega-, giga-, and even petabytes. But with that amount of data, analysis turned from pencil and paper into personal computers, or any computer. Analysts started to realize that dynamic software was the means to getting data analysis into a more usable form.

Data science grew out of this data analytic effort and uses conventional statistical methods coupled with the power of computing in order to make data science readily available to all private and public entities. With the power to analyze marketing, technical, and personnel data, companies now have the ability to calculate the probability of their product succeeding, or their revenue growing the next year. With the growth of data science comes the many tools that make data analytics a possibility.

1.3WHERE TO GET DATA

Now that we have an introduction to the “why” of data science, the next subject is “where.” Where do you get data to use with data science tools? The answer to that question, especially now, is that data is available on many web sites for analysis (Williams, 2011). Some of these web sites include:

1.www.data.gov, which contains pages of data from different government agencies. If you want to know about climate data, or census, or disease control, this is the place to go.

2.www.kaggle.com, which not only contains data, but has contests with existing data that anyone can join. One dataset contains the various data collected from the Titanic, including how many died or survived and all the demographics for analysis and correlation.

3.Just about any federal government agency. If you do not want to go to a general web site, then go to www.cdc.gov, www.census.gov, www.noaa.gov, or any separate government web site for data pertaining to things like Social Security (www.ssa.gov) or even intelligence (www.nsa.gov) for some historical data.

Now that you have the “whys” and “wheres” associated with data science and tools, you now move on to the next step—actually using the tools with real data. Besides, you have no doubt had enough of this stage setting.

The data for this book was retrieved at the site, https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/, which has the tornado tracking data for the United States from 1951 until 2018. The government agency NOAA stands for the National Oceanic and Atmospheric Agency. The recommendation is to download these files (as many as you like) and use them separately for the examples in the book. This book will focus on the 1951 tornado tracking to make it relatively straightforward. Once you download the data, then the next step is to import the data into your favorite statistical tool.

CHAPTER 2

IMPORTING DATA

The first step to analyzing data is to import the data into the appropriate tool. This first section will show how to import data using each of the tools—Excel, R, KNIME, and OpenOffice. Since most analysts are familiar with Excel, Excel will be the first one addressed and then OpenOffice, since it is very close to Excel in functionality, for a good introduction to importing data.

2.1EXCEL

The version for this text will be Microsoft Excel 2016, because that is the version that appears in many federal government agencies. As of the writing of this book, Excel 2019 is available but not used in public service at this point.

Importing data into Excel could not be easier. The file that has been downloaded is a Comma Separated Value (CSV) file, so to import the file into Excel, go to the file location and double-click on the file. The file will appear in Excel if the computer defaults to all spreadsheets going into Excel. If not, open Excel and choose “File” and “Open” to go to the file location and open the file. The following screens illustrate the operation.

One caveat at this point with Excel. When opening a file, the default extension for Excel is the worksheet extension or “xlsx.” If the worksheet is a CSV, then that default has to be changed, as demonstrated in the preceding process. Once the extension is changed, click “OPEN” and the spreadsheet will appear in Excel. If the purpose is to stay as a CSV, then save it as such when you complete the work on the spreadsheet. Otherwise, save it as an “XLSX” file so that all the functionality of Excel remains with the spreadsheet as the analysis continues.

This is probably the easiest import for any of the applications presented because of the intuitive nature of Excel.

2.1.1Excel Analysis ToolPak

From this point forward, for any statistical analysis with Excel, we will be using the Analysis ToolPak, which will need to be installed as an add-on through Excel. If the Analysis ToolPak is already installed, it will show in the “Data” tab of Excel as shown here.

If the Analysis ToolPak is not showing in the Data toolbar, the analyst can add it simply by going to the “File” tab and choosing “Options” at the bottom of the left column. A screen will appear showing all the possibilities in the left column. The analyst chooses “Add-Ins” and the screen below will appear, showing all the add-ins that are available or not available. Take a second and look at the add-ins that are available as part of the Excel installation. There are a number of them, and they are very useful in data analytics. Take time to explore these add-ins to see how they can enhance your analysis, but in the meantime, finish installing the Analysis ToolPak add-in to complete this analysis.

When selecting Options, the next screen will reveal a number of choices in the left-hand side column. Choose “Add-Ins” and there will be a list of possible add-ins for Excel. Choose “Analysis ToolPak,” which will at this point be in “Inactive Application Add-Ins,” and go down to the bottom of the screen where it says “Manage:” to ensure that “Excel Add-Ins” is in the text box. Click on the “Go…” button and the following screen will appear.

Click in the checkbox next to “Analysis ToolPak” in order to activate the add-in, and it will appear in the Excel toolbar. If it does not, try to close out of Excel and try the process again. It should work at that point. If it does not work after repeated attempts and the computer is a government computer, there may be a firewall in place that will prevent the use of this add-in. If the system administrator cannot provide the computer with access, there is a description at the end of this book that will demonstrate the buttonology to substitute for the Analysis ToolPak.

2.2OPENOFFICE

The first step to using OpenOffice is to download the software from the OpenOffice website (www.openoffice.org), which is relatively straightforward. The current version of the software is 4.1.7, which will be the version that we will be using in this book. When you install OpenOffice you do not have to install all the different functionalities, and in this instance you just need the spreadsheet program, so when you open the splash screen you will see the following:

At this point, select Spreadsheet and this screen will appear, which will look very much like Excel. In fact, having used Excel between 1998 and 2000, it will look very much like those versions. What this means is that the functionality is not exactly the same, but it will be everything you need for the statistics concepts in this book.

The first task will be to import data retrieved from the Internet. In this case it will be the data from a site that tracks tornados occurring in the United States from 1950–2018. This data will be imported by using the same technique as in Excel—through the “open” command in the File Menu as depicted here: