23,92 €
Put your Haskell skills to work and generate publication-ready visualizations in no time at all
Key Features
Book Description
Every business and organization that collects data is capable of tapping into its own data to gain insights how to improve. Haskell is a purely functional and lazy programming language, well-suited to handling large data analysis problems. This book will take you through the more difficult problems of data analysis in a hands-on manner.
This book will help you get up-to-speed with the basics of data analysis and approaches in the Haskell language. You'll learn about statistical computing, file formats (CSV and SQLite3), descriptive statistics, charts, and progress to more advanced concepts such as understanding the importance of normal distribution. While mathematics is a big part of data analysis, we've tried to keep this course simple and approachable so that you can apply what you learn to the real world.
By the end of this book, you will have a thorough understanding of data analysis, and the different ways of analyzing data. You will have a mastery of all the tools and techniques in Haskell for effective data analysis.
What you will learn
Who this book is for
This book is intended for people who wish to expand their knowledge of statistics and data analysis via real-world examples. A basic understanding of the Haskell language is expected. If you are feeling brave, you can jump right into the functional programming style.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 158
Veröffentlichungsjahr: 2018
Copyright © 2018 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor:Amey VarangaonkarAcquisition Editor:Trusha ShriyanContent Development Editor:Arun NadarTechnical Editor:Diksha WakodeCopy Editor:Safis EditingProofreader: Safis EditingIndexer:Priyanka DhadkeGraphics:Alishon MendonsaProduction Coordinator:Deepika Naik
First published: October 2018
Production reference: 1301018
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78980-286-3
www.packtpub.com
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Mapt is fully searchable
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
James Church lives in Clarksville, Tennessee, United States, where he enjoys teaching, programming, and playing board games with his wife, Michelle. He is an assistant professor of computer science at Austin Peay State University. He has consulted for various companies and a chemical laboratory for the purpose of performing data analysis work. James is the author of Learning Haskell Data Analysis.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Title Page
Copyright and Credits
Getting Started with Haskell Data Analysis
Packt Upsell
Why subscribe?
Packt.com
Contributors
About the author
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
Descriptive Statistics
The CSV library – working with CSV files
Data range
Data mean and standard deviation
Data median
Data mode
Summary
SQLite3
SQLite3 command line
Working with SQLite3 and Haskell
Slices of data
Working with SQLite3 and descriptive statistics
Summary
Regular Expressions
Dots and pipes
Atom and Atom modifiers
Character classes
Regular expressions in CSV files
SQLite3 and regular expressions
Summary
Visualizations
Line plots of a single variable
Plotting a moving average
Creating publication-ready plots
Feature scaling
Scatter plots
Summary
Kernel Density Estimation
The central limit theorem
Normal distribution
Introducing kernel density estimation
Application of the KDE
Summary
Course Review
Converting CSV variation files into SQLite3
Using SQLite3 SELECT and the DescriptiveStats module for descriptive statistics
Creating compelling visualizations using EasyPlot
Reintroducing kernel density estimation
Summary
Other Books You May Enjoy
Leave a review - let other readers know what you think
Data analysis is part computer science and part statistics. An important part of data analysis is validating your assumptions with real-world data to see whether there is a pattern, or a particular user behavior that you can validate.
In this book, we are going to learn about data analysis from the perspective of the Haskell programming language. The goal of this book is to take you from being a beginner in math and statistics, to the point that you feel comfortable working with large-scale datasets. While mathematics is a big part of data analysis, we've tried to keep this book simple and approachable so that you can apply what you learn to the real world.
This book is intended for people who wish to expand their knowledge of statistics and data analysis via real-world examples. A basic understanding of the Haskell language is expected. If you are feeling brave, you can jump right into a functional programming style.
Chapter 1, Descriptive Statistics, teaches you about the Text.CSV library. It also covers some of the descriptive statistics functions, such as mean, median, and mode.
Chapter 2, SQLite3, focuses on how to get the data from CSV into SQLite3. You will understand the data types of SQLite3 and how to fetch data using SQL statements. It also covers how to create your own custom module of descriptive statistics.
Chapter 3, Regular Expressions, introduces you to regular expression syntax, such as dots and pipe. It also covers character classes at length. Finally, it teaches you how to use regular expressions within a CSV file and an SQLite3 database.
Chapter 4, Visualizations, starts with the installation of gnuplot and the EasyPlot Haskell library. It covers how to use moving average function to analyze stock data. Finally, it teaches you how to make publication-ready plots by adding legends and saving those plots to files.
Chapter 5, Kernel Density Estimation, introduces you to central limit theorem and normal distribution and helps you to understand the difference between them. Later, it talks about the kernel density estimator and how to apply it to a dataset.
Chapter 6, Course review, works on the MovieLens data by applying what you have learned from the first five chapters. In addition to what was covered in the earlier chapters, you will also be exploring a few more interesting techniques for analyzing the data.
You will need to set up the IHaskell notebook environment to test the examples in these chapters. You will also need some knowledge of the Haskell programming language, math, and statistics.
You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packt.com
.
Select the
SUPPORT
tab.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Getting-Started-with-Haskell-Data-Analysis. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/9781789802863_ColorImages.pdf.
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "We can see that we have the ability to download a CSV file called table.csv."
Any command-line input or output is written as follows:
sudo apt-get install sqlite3
libsqlite3-dev
Bold: Indicates a new term, an important word, or words that you see on screen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "We need to hit Apply, and then we have to hit Apply again."
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packt.com.
In this book, we are going to learn about data analysis from the perspective of the Haskell programming language. The goal of this book is to take you from being a beginner in math and statistics, to the point that you feel comfortable working with large-scale datasets. Now, the prerequisites for this book are that you know a little bit of the Haskell programming language, and also a little bit of math and statistics. From there, we can start you on your journey of becoming a data analyst.
In this chapter, we are going to cover descriptive statistics. Descriptive statistics are used to summarize a collection of values into one or two values. We begin with learning about the Haskell Text.CSV library. In later sections, we will cover in increasing difficulty the range, mean, median, and mode; you've probably heard of some of these descriptive statistics before, as they're quite common. We will be using the IHaskell environment on the Jupyter Notebook system.
The topics that we are going to cover are as follows:
The CSV library—working with CSV files
Data ranges
Data mean and standard deviation
Data median
Data mode
In this section, we're going to cover the basics of the CSV library and how to work with CSV files. To do this, we will be taking a closer look at the structure of a CSV file; how to install the Text.CSV Haskell library; and how to retrieve data from a CSV file from within Haskell.
Now to begin, we need a CSV file. So, I'm going to tab over to my Haskell environment, which is just a Debian Linux virtual machine running on my computer, and I'm going to go to the website at retrosheet.org. This is a website for baseball statistics, and we are going to use them to demonstrate the CSV library. Find the link for Data Downloads and click Game Logs, as follows:
Now, scroll down just a little bit and you should see game logs for every single season, going all the way back to 1871. For now, I would like to stick with the most recent complete season, which is 2015:
So, go ahead and click the 2015 link. We will have the option to download a ZIP file, so go ahead and click OK. Now, I'm going to tab over to my Terminal:
Let's go into the Downloads folder, and if we hit ls, we see that there's our ZIP file. Let's unzip that file and see what we have. Let's open up GL2015.TXT. This is a CSV file, and will display something like the following:
A CSV file is a file of comma-separated values. So, you'll see that we have a file divided up, where each line in this file is a record, and each record represents a single game of baseball in the 2015 season; and inside every single record is a listing of values, separated by a comma. So, the very first game in this dataset is a game between the St. Louis Cardinals—that's SLN—and the Chicago Cubs—that's CHN—and this game took place on March 5th 2015. The final score of this first game was 3-0, and every line in this file is a different game.
Now, CSV isn't a standard, but there are a few properties of a CSV file which I consider to be safe. Consider the following as my suggestions. A CSV file should keep one record per line. The first line should be a description of each column. In a future section, I'm going to tell you that we need to remove the header line; and you'll see that this particular file doesn't have this header line. I still like to see the description line for each column of values. If a field in a record includes a comma, then that field should be surrounded by double quote marks. Now we don't see an example of this—at least, not on this first line—but we do see examples of many values having quote marks surrounding the file, such as the very first value in the file, the date:
In a CSV file, if a field is surrounded by quote marks, then it is optional, unless it has a comma inside that value. While we're here, I would like to make a note of the tenth column in this file, which contains the number 3 on this particular row. This represents the away-team score in every single record of this file. Make a note that our first value on the tenth column is a 3—we're going to come back to that later on.
Our next task is installing the Text.CSV library; we do this using the Cabal tool, which connects with the Hackage repository and downloads the Text.CSV library:
The command that we use to start the install, shown in the first line of the preceding screenshot, is cabal install csv. It takes a moment to download the file, but it should download and install the Text.CSV library in our home folder. Now, let me describe what I currently have in myhomefolder:
I like to create a directory for my code called Code; and inside here, I have a directory called HaskellDataAnalysis. And inside HaskellDataAnalysis, I have two directories, called analysis and data. In the analysis folder, I would like to store my notebooks. In the data folder, I would like to store my datasets.
That way, I can keep a clear distinction between analysis files and data files. That means I need to move the data file, just downloaded, into my data folder. So, copy GL2015.TXT from our Downloads folder into our data folder. If I do an ls on my data folder, I'll see that I've got my file. Now, I'm going to go into my analysis folder, which currently contains nothing, and I'm going to start the Jupyter Notebook as follows:
Type in jupyter notebook, which will start a web server on your computer. You can use your web browser in order to interact with Haskell:
The address for the Jupyter Notebook is the localhost, on port 8888. Now I'm going to create a new Haskell notebook. To do this, I click on the New drop-down button on the right side of the screen, and I find Haskell:
Let's begin by renaming our notebook Baseball, because we're going to be looking at baseball statistics:
I need to import the Text.CSV file that we just installed. Now, if your cursor is sitting in a text field and you hit Enter, you'll just be making that text field larger, as shown in the following screenshot. Instead, in order to submit expressions to the Jupyter environment, you have to hit hit Shift + Enter on the keyboard:
So, now that we've imported Text.CSV, let's create our Baseball dataset and parse the dataset. The command for this is parseCSVFromFile, after which we pass in the location of our text file:
Great. If you didn't get a File Not Found error at this point, then that means you have successfully parsed the data from the CSV file. Now, let's explore the type of baseball data. To do this, we enter type and baseball, which is what we just created, and we see that we have either a parsing error or a CSV file:
I've already done this, so I know that there aren't any parsing errors in our CSV file, but if there were, they would be represented by ParseError
