E-Book
23,92 €

Getting Started with Haskell Data Analysis E-Book

James Church

0,0

23,92 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Put your Haskell skills to work and generate publication-ready visualizations in no time at all

Key Features

Take your data analysis skills to the next level using the power of Haskell

Understand regression analysis, perform multivariate regression, and untangle different cluster varieties

Create publication-ready visualizations of data

Book Description

Every business and organization that collects data is capable of tapping into its own data to gain insights how to improve. Haskell is a purely functional and lazy programming language, well-suited to handling large data analysis problems. This book will take you through the more difficult problems of data analysis in a hands-on manner.

This book will help you get up-to-speed with the basics of data analysis and approaches in the Haskell language. You'll learn about statistical computing, file formats (CSV and SQLite3), descriptive statistics, charts, and progress to more advanced concepts such as understanding the importance of normal distribution. While mathematics is a big part of data analysis, we've tried to keep this course simple and approachable so that you can apply what you learn to the real world.

By the end of this book, you will have a thorough understanding of data analysis, and the different ways of analyzing data. You will have a mastery of all the tools and techniques in Haskell for effective data analysis.

What you will learn

Learn to parse a CSV file and read data into the Haskell environment

Create Haskell functions for common descriptive statistics functions

Create an SQLite3 database using an existing CSV file

Learn the versatility of SELECT queries for slicing data into smaller chunks

Apply regular expressions in large-scale datasets using both CSV and SQLite3 files

Create a Kernel Density Estimator visualization using normal distribution

Who this book is for

This book is intended for people who wish to expand their knowledge of statistics and data analysis via real-world examples. A basic understanding of the Haskell language is expected. If you are feeling brave, you can jump right into the functional programming style.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 158

Veröffentlichungsjahr: 2018

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Getting Started with Haskell Data Analysis

Put your data analysis techniques to work and generate publication-ready visualizations

James Church

BIRMINGHAM - MUMBAI

Getting Started with Haskell Data Analysis

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor:Amey VarangaonkarAcquisition Editor:Trusha ShriyanContent Development Editor:Arun NadarTechnical Editor:Diksha WakodeCopy Editor:Safis EditingProofreader: Safis EditingIndexer:Priyanka DhadkeGraphics:Alishon MendonsaProduction Coordinator:Deepika Naik

First published: October 2018

Production reference: 1301018

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78980-286-3

www.packtpub.com

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

Packt.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the author

James Church lives in Clarksville, Tennessee, United States, where he enjoys teaching, programming, and playing board games with his wife, Michelle. He is an assistant professor of computer science at Austin Peay State University. He has consulted for various companies and a chemical laboratory for the purpose of performing data analysis work. James is the author of Learning Haskell Data Analysis.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Title Page

Getting Started with Haskell Data Analysis

Packt Upsell

Why subscribe?

Packt.com

Contributors

About the author

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Descriptive Statistics

The CSV library – working with CSV files

Data range

Data mean and standard deviation

Data median

Data mode

Summary

SQLite3

SQLite3 command line

Working with SQLite3 and Haskell

Slices of data

Working with SQLite3 and descriptive statistics

Summary

Regular Expressions

Dots and pipes

Atom and Atom modifiers

Character classes

Regular expressions in CSV files

SQLite3 and regular expressions

Summary

Visualizations

Line plots of a single variable

Plotting a moving average

Creating publication-ready plots

Feature scaling

Scatter plots

Summary

Kernel Density Estimation

The central limit theorem

Normal distribution

Introducing kernel density estimation

Application of the KDE

Summary

Course Review

Converting CSV variation files into SQLite3

Using SQLite3 SELECT and the DescriptiveStats module for descriptive statistics

Creating compelling visualizations using EasyPlot

Reintroducing kernel density estimation

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

Data analysis is part computer science and part statistics. An important part of data analysis is validating your assumptions with real-world data to see whether there is a pattern, or a particular user behavior that you can validate.

In this book, we are going to learn about data analysis from the perspective of the Haskell programming language. The goal of this book is to take you from being a beginner in math and statistics, to the point that you feel comfortable working with large-scale datasets. While mathematics is a big part of data analysis, we've tried to keep this book simple and approachable so that you can apply what you learn to the real world.

Who this book is for

What this book covers

Chapter 1, Descriptive Statistics, teaches you about the Text.CSV library. It also covers some of the descriptive statistics functions, such as mean, median, and mode.

Chapter 2, SQLite3, focuses on how to get the data from CSV into SQLite3. You will understand the data types of SQLite3 and how to fetch data using SQL statements. It also covers how to create your own custom module of descriptive statistics.

Chapter 3, Regular Expressions, introduces you to regular expression syntax, such as dots and pipe. It also covers character classes at length. Finally, it teaches you how to use regular expressions within a CSV file and an SQLite3 database.

Chapter 4, Visualizations, starts with the installation of gnuplot and the EasyPlot Haskell library. It covers how to use moving average function to analyze stock data. Finally, it teaches you how to make publication-ready plots by adding legends and saving those plots to files.

Chapter 5, Kernel Density Estimation, introduces you to central limit theorem and normal distribution and helps you to understand the difference between them. Later, it talks about the kernel density estimator and how to apply it to a dataset.

Chapter 6, Course review, works on the MovieLens data by applying what you have learned from the first five chapters. In addition to what was covered in the earlier chapters, you will also be exploring a few more interesting techniques for analyzing the data.

To get the most out of this book

You will need to set up the IHaskell notebook environment to test the examples in these chapters. You will also need some knowledge of the Haskell programming language, math, and statistics.

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

www.packt.com

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

Enter the name of the book in the

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Getting-Started-with-Haskell-Data-Analysis. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/9781789802863_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "We can see that we have the ability to download a CSV file called table.csv."

Any command-line input or output is written as follows:

sudo apt-get install sqlite3

libsqlite3-dev

Bold: Indicates a new term, an important word, or words that you see on screen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "We need to hit Apply, and then we have to hit Apply again."

Warnings or important notes appear like this.

Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Descriptive Statistics

In this book, we are going to learn about data analysis from the perspective of the Haskell programming language. The goal of this book is to take you from being a beginner in math and statistics, to the point that you feel comfortable working with large-scale datasets. Now, the prerequisites for this book are that you know a little bit of the Haskell programming language, and also a little bit of math and statistics. From there, we can start you on your journey of becoming a data analyst.

In this chapter, we are going to cover descriptive statistics. Descriptive statistics are used to summarize a collection of values into one or two values. We begin with learning about the Haskell Text.CSV library. In later sections, we will cover in increasing difficulty the range, mean, median, and mode; you've probably heard of some of these descriptive statistics before, as they're quite common. We will be using the IHaskell environment on the Jupyter Notebook system.

The topics that we are going to cover are as follows:

The CSV library—working with CSV files

Data ranges

Data mean and standard deviation

Data median

Data mode

The CSV library – working with CSV files

In this section, we're going to cover the basics of the CSV library and how to work with CSV files. To do this, we will be taking a closer look at the structure of a CSV file; how to install the Text.CSV Haskell library; and how to retrieve data from a CSV file from within Haskell.

Now to begin, we need a CSV file. So, I'm going to tab over to my Haskell environment, which is just a Debian Linux virtual machine running on my computer, and I'm going to go to the website at retrosheet.org. This is a website for baseball statistics, and we are going to use them to demonstrate the CSV library. Find the link for Data Downloads and click Game Logs, as follows:

Now, scroll down just a little bit and you should see game logs for every single season, going all the way back to 1871. For now, I would like to stick with the most recent complete season, which is 2015:

So, go ahead and click the 2015 link. We will have the option to download a ZIP file, so go ahead and click OK. Now, I'm going to tab over to my Terminal:

Let's go into the Downloads folder, and if we hit ls, we see that there's our ZIP file. Let's unzip that file and see what we have. Let's open up GL2015.TXT. This is a CSV file, and will display something like the following:

A CSV file is a file of comma-separated values. So, you'll see that we have a file divided up, where each line in this file is a record, and each record represents a single game of baseball in the 2015 season; and inside every single record is a listing of values, separated by a comma. So, the very first game in this dataset is a game between the St. Louis Cardinals—that's SLN—and the Chicago Cubs—that's CHN—and this game took place on March 5th 2015. The final score of this first game was 3-0, and every line in this file is a different game.

Now, CSV isn't a standard, but there are a few properties of a CSV file which I consider to be safe. Consider the following as my suggestions. A CSV file should keep one record per line. The first line should be a description of each column. In a future section, I'm going to tell you that we need to remove the header line; and you'll see that this particular file doesn't have this header line. I still like to see the description line for each column of values. If a field in a record includes a comma, then that field should be surrounded by double quote marks. Now we don't see an example of this—at least, not on this first line—but we do see examples of many values having quote marks surrounding the file, such as the very first value in the file, the date:

In a CSV file, if a field is surrounded by quote marks, then it is optional, unless it has a comma inside that value. While we're here, I would like to make a note of the tenth column in this file, which contains the number 3 on this particular row. This represents the away-team score in every single record of this file. Make a note that our first value on the tenth column is a 3—we're going to come back to that later on.

Our next task is installing the Text.CSV library; we do this using the Cabal tool, which connects with the Hackage repository and downloads the Text.CSV library:

The command that we use to start the install, shown in the first line of the preceding screenshot, is cabal install csv. It takes a moment to download the file, but it should download and install the Text.CSV library in our home folder. Now, let me describe what I currently have in myhomefolder:

I like to create a directory for my code called Code; and inside here, I have a directory called HaskellDataAnalysis. And inside HaskellDataAnalysis, I have two directories, called analysis and data. In the analysis folder, I would like to store my notebooks. In the data folder, I would like to store my datasets.

That way, I can keep a clear distinction between analysis files and data files. That means I need to move the data file, just downloaded, into my data folder. So, copy GL2015.TXT from our Downloads folder into our data folder. If I do an ls on my data folder, I'll see that I've got my file. Now, I'm going to go into my analysis folder, which currently contains nothing, and I'm going to start the Jupyter Notebook as follows:

Type in jupyter notebook, which will start a web server on your computer. You can use your web browser in order to interact with Haskell:

The address for the Jupyter Notebook is the localhost, on port 8888. Now I'm going to create a new Haskell notebook. To do this, I click on the New drop-down button on the right side of the screen, and I find Haskell:

Let's begin by renaming our notebook Baseball, because we're going to be looking at baseball statistics:

I need to import the Text.CSV file that we just installed. Now, if your cursor is sitting in a text field and you hit Enter, you'll just be making that text field larger, as shown in the following screenshot. Instead, in order to submit expressions to the Jupyter environment, you have to hit hit Shift + Enter on the keyboard:

So, now that we've imported Text.CSV, let's create our Baseball dataset and parse the dataset. The command for this is parseCSVFromFile, after which we pass in the location of our text file:

Great. If you didn't get a File Not Found error at this point, then that means you have successfully parsed the data from the CSV file. Now, let's explore the type of baseball data. To do this, we enter type and baseball, which is what we just created, and we see that we have either a parsing error or a CSV file:

I've already done this, so I know that there aren't any parsing errors in our CSV file, but if there were, they would be represented by ParseError