Clojure Data Analysis Cookbook - Second Edition - Eric Rochester - E-Book

Clojure Data Analysis Cookbook - Second Edition E-Book

Eric Rochester

0,0
44,39 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Key Features

    Book Description

    This book is for those with a basic knowledge of Clojure, who are looking to push the language to excel with data analysis.

    What you will learn

    • Read data from a variety of data formats
    • Transform data to make it more useful and easier to analyze
    • Process data concurrently and in parallel for faster performance
    • Harness multiple computers to analyze big data
    • Use powerful data analysis libraries such as Incanter, Hadoop, and Weka to get things done quickly
    • Apply powerful clustering and data mining techniques to better understand your data

    Who this book is for

    This book is for those with a basic knowledge of Clojure, who are looking to push the language to excel with data analysis.

    Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

    EPUB
    MOBI

    Seitenzahl: 439

    Veröffentlichungsjahr: 2015

    Bewertungen
    0,0
    0
    0
    0
    0
    0
    Mehr Informationen
    Mehr Informationen
    Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



    Table of Contents

    Clojure Data Analysis Cookbook Second Edition
    Credits
    About the Author
    About the Reviewers
    www.PacktPub.com
    Support files, eBooks, discount offers, and more
    Why subscribe?
    Free access for Packt account holders
    Preface
    What this book covers
    What you need for this book
    Who this book is for
    Conventions
    Reader feedback
    Customer support
    Downloading the example code
    Downloading the color images of this book
    Errata
    Piracy
    Questions
    1. Importing Data for Analysis
    Introduction
    Creating a new project
    Getting ready
    How to do it...
    How it works...
    Reading CSV data into Incanter datasets
    Getting ready
    How to do it…
    How it works…
    There's more…
    Reading JSON data into Incanter datasets
    Getting ready
    How to do it…
    How it works…
    Reading data from Excel with Incanter
    Getting ready
    How to do it…
    How it works…
    Reading data from JDBC databases
    Getting ready
    How to do it…
    How it works…
    See also
    Reading XML data into Incanter datasets
    Getting ready
    How to do it…
    How it works…
    There's more…
    Navigating structures with zippers
    Processing in a pipeline
    Comparing XML and JSON
    Scraping data from tables in web pages
    Getting ready
    How to do it…
    How it works…
    See also
    Scraping textual data from web pages
    Getting ready
    How to do it…
    How it works…
    Reading RDF data
    Getting ready
    How to do it…
    How it works…
    See also
    Querying RDF data with SPARQL
    Getting ready
    How to do it…
    How it works…
    There's more…
    Aggregating data from different formats
    Getting ready
    How to do it…
    Creating the triple store
    Scraping exchange rates
    Loading currency data and tying it all together
    How it works…
    See also
    2. Cleaning and Validating Data
    Introduction
    Cleaning data with regular expressions
    Getting ready
    How to do it…
    How it works…
    There's more...
    See also
    Maintaining consistency with synonym maps
    Getting ready
    How to do it…
    How it works…
    See also
    Identifying and removing duplicate data
    Getting ready
    How to do it…
    How it works…
    There's more…
    Regularizing numbers
    Getting ready
    How to do it…
    How it works…
    Calculating relative values
    Getting ready
    How to do it…
    How it works…
    Parsing dates and times
    Getting ready
    How to do it…
    There's more…
    Lazily processing very large data sets
    Getting ready
    How to do it…
    How it works…
    Sampling from very large data sets
    Getting ready
    How to do it…
    Sampling by percentage
    Sampling exactly
    How it works…
    Fixing spelling errors
    Getting ready
    How to do it…
    How it works…
    There's more…
    Parsing custom data formats
    Getting ready
    How to do it…
    How it works…
    Validating data with Valip
    Getting ready
    How to do it…
    How it works…
    3. Managing Complexity with Concurrent Programming
    Introduction
    Managing program complexity with STM
    Getting ready
    How to do it…
    How it works…
    See also
    Managing program complexity with agents
    Getting ready
    How to do it…
    How it works…
    See also
    Getting better performance with commute
    Getting ready
    How to do it…
    How it works…
    Combining agents and STM
    Getting ready
    How to do it…
    How it works…
    Maintaining consistency with ensure
    Getting ready
    How to do it…
    How it works…
    Introducing safe side effects into the STM
    Getting ready
    How to do it…
    Maintaining data consistency with validators
    Getting ready
    How to do it…
    How it works…
    See also
    Monitoring processing with watchers
    Getting ready
    How to do it…
    How it works…
    Debugging concurrent programs with watchers
    Getting ready
    How to do it…
    There's more...
    Recovering from errors in agents
    How to do it…
    Failing on errors
    Continuing on errors
    Using a custom error handler
    There's more...
    Managing large inputs with sized queues
    How to do it…
    How it works...
    4. Improving Performance with Parallel Programming
    Introduction
    Parallelizing processing with pmap
    How to do it…
    How it works…
    There's more…
    See also
    Parallelizing processing with Incanter
    Getting ready
    How to do it…
    How it works…
    Partitioning Monte Carlo simulations for better pmap performance
    Getting ready
    How to do it…
    How it works…
    Estimating with Monte Carlo simulations
    Chunking data for pmap
    Finding the optimal partition size with simulated annealing
    Getting ready
    How to do it…
    How it works…
    There's more…
    Combining function calls with reducers
    Getting ready
    How to do it…
    What happened here?
    There's more...
    See also
    Parallelizing with reducers
    Getting ready
    How to do it…
    How it works…
    See also
    Generating online summary statistics for data streams with reducers
    Getting ready
    How to do it…
    Using type hints
    Getting ready
    How to do it…
    How it works…
    See also
    Benchmarking with Criterium
    Getting ready
    How to do it…
    How it works…
    See also
    5. Distributed Data Processing with Cascalog
    Introduction
    Initializing Cascalog and Hadoop for distributed processing
    Getting ready
    How to do it…
    How it works…
    See also
    Querying data with Cascalog
    Getting ready
    How to do it…
    How it works…
    There's more
    Distributing data with Apache HDFS
    Getting ready
    How to do it…
    How it works…
    Parsing CSV files with Cascalog
    Getting ready
    How to do it…
    How it works…
    There's more
    Executing complex queries with Cascalog
    Getting ready
    How to do it…
    Aggregating data with Cascalog
    Getting ready
    How to do it…
    There's more
    Defining new Cascalog operators
    Getting ready
    How to do it…
    Creating map operators
    Creating map concatenation operators
    Creating filter operators
    Creating buffer operators
    Creating aggregate operators
    Creating parallel aggregate operators
    Composing Cascalog queries
    Getting ready
    How to do it…
    How it works…
    Transforming data with Cascalog
    Getting ready
    How to do it…
    How it works…
    6. Working with Incanter Datasets
    Introduction
    Loading Incanter's sample datasets
    Getting ready
    How to do it…
    How it works…
    There's more...
    Loading Clojure data structures into datasets
    Getting ready
    How to do it…
    How it works…
    See also…
    Viewing datasets interactively with view
    Getting ready
    How to do it…
    How it works…
    See also…
    Converting datasets to matrices
    Getting ready
    How to do it…
    How it works…
    There's more…
    See also…
    Using infix formulas in Incanter
    Getting ready
    How to do it…
    How it works…
    Selecting columns with $
    Getting ready
    How to do it…
    How it works…
    There's more…
    See also…
    Selecting rows with $
    Getting ready
    How to do it…
    How it works…
    Filtering datasets with $where
    Getting ready
    How to do it…
    How it works…
    There's more…
    Grouping data with $group-by
    Getting ready
    How to do it…
    How it works…
    Saving datasets to CSV and JSON
    Getting ready
    How to do it…
    Saving data as CSV
    Saving data as JSON
    How it works…
    See also…
    Projecting from multiple datasets with $join
    Getting ready
    How to do it…
    How it works…
    7. Statistical Data Analysis with Incanter
    Introduction
    Generating summary statistics with $rollup
    Getting ready
    How to do it…
    How it works…
    Working with changes in values
    Getting ready
    How to do it…
    How it works…
    Scaling variables to simplify variable relationships
    Getting ready
    How to do it…
    How it works…
    Working with time series data with Incanter Zoo
    Getting ready
    How to do it…
    There's more...
    Smoothing variables to decrease variation
    Getting ready
    How to do it…
    How it works…
    Validating sample statistics with bootstrapping
    Getting ready
    How to do it…
    How it works…
    There's more…
    Modeling linear relationships
    Getting ready
    How to do it…
    How it works…
    Modeling non-linear relationships
    Getting ready
    How to do it…
    How it works...
    Modeling multinomial Bayesian distributions
    Getting ready
    How to do it…
    How it works…
    There's more...
    Finding data errors with Benford's law
    Getting ready
    How to do it…
    How it works…
    There's more…
    8. Working with Mathematica and R
    Introduction
    Setting up Mathematica to talk to Clojuratica for Mac OS X and Linux
    Getting ready
    How to do it…
    How it works…
    There's more…
    Setting up Mathematica to talk to Clojuratica for Windows
    Getting ready
    How to do it...
    How it works...
    Calling Mathematica functions from Clojuratica
    Getting ready
    How to do it…
    How it works…
    Sending matrixes to Mathematica from Clojuratica
    Getting ready
    How to do it…
    How it works…
    Evaluating Mathematica scripts from Clojuratica
    Getting ready
    How to do it…
    How it works…
    Creating functions from Mathematica
    Getting ready
    How to do it…
    How it works…
    Setting up R to talk to Clojure
    Getting ready
    How to do it…
    Setting up R
    Setting up Clojure
    How it works…
    Calling R functions from Clojure
    Getting ready
    How to do it…
    How it works…
    There's more…
    Passing vectors into R
    Getting ready
    How to do it…
    How it works…
    Evaluating R files from Clojure
    Getting ready
    How to do it…
    How it works…
    There's more…
    Plotting in R from Clojure
    Getting ready
    How to do it…
    How it works…
    There's more…
    9. Clustering, Classifying, and Working with Weka
    Introduction
    Loading CSV and ARFF files into Weka
    Getting ready
    How to do it…
    How it works…
    There's more…
    See also…
    Filtering, renaming, and deleting columns in Weka datasets
    Getting ready
    How to do it…
    Renaming columns
    Removing columns
    Hiding columns
    How it works…
    Discovering groups of data using K-Means clustering
    Getting ready
    How to do it…
    How it works…
    Clustering with K-Means
    Analyzing the results
    Building macros
    See also…
    Finding hierarchical clusters in Weka
    Getting ready
    How to do it…
    How it works…
    There's more…
    Clustering with SOMs in Incanter
    Getting ready
    How to do it…
    How it works…
    There's more…
    Classifying data with decision trees
    Getting ready
    How to do it…
    How it works…
    There's more…
    Classifying data with the Naive Bayesian classifier
    Getting ready
    How to do it…
    How it works…
    There's more…
    Classifying data with support vector machines
    Getting ready
    How to do it…
    There's more…
    Finding associations in data with the Apriori algorithm
    Getting ready
    How to do it…
    How it works…
    There's more…
    10. Working with Unstructured and Textual Data
    Introduction
    Tokenizing text
    Getting ready
    How to do it…
    How it works…
    Finding sentences
    Getting ready
    How to do it…
    How it works…
    Focusing on content words with stoplists
    Getting ready
    How to do it…
    Getting document frequencies
    Getting ready
    How to do it…
    Scaling document frequencies by document size
    Getting ready
    How to do it…
    How it works…
    Scaling document frequencies with TF-IDF
    Getting ready
    How to do it…
    How it works…
    Finding people, places, and things with Named Entity Recognition
    Getting ready
    How to do it…
    How it works…
    Mapping documents to a sparse vector space representation
    Getting ready…
    How to do it…
    Performing topic modeling with MALLET
    Getting ready
    How to do it…
    How it works…
    See also…
    Performing naïve Bayesian classification with MALLET
    Getting ready
    How to do it…
    How it works…
    There's more…
    See also…
    11. Graphing in Incanter
    Introduction
    Creating scatter plots with Incanter
    Getting ready
    How to do it...
    How it works...
    There's more...
    See also
    Graphing non-numeric data in bar charts
    Getting ready
    How to do it...
    How it works...
    Creating histograms with Incanter
    Getting ready
    How to do it...
    How it works...
    Creating function plots with Incanter
    Getting ready
    How to do it...
    How it works...
    See also
    Adding equations to Incanter charts
    Getting ready
    How to do it...
    There's more...
    Adding lines to scatter charts
    Getting ready
    How to do it...
    How it works...
    See also
    Customizing charts with JFreeChart
    Getting ready
    How to do it...
    How it works...
    See also
    Customizing chart colors and styles
    Getting ready
    How to do it...
    Saving Incanter graphs to PNG
    Getting ready
    How to do it...
    How it works...
    Using PCA to graph multi-dimensional data
    Getting ready
    How to do it...
    How it works...
    There's more...
    Creating dynamic charts with Incanter
    Getting ready
    How to do it...
    How it works...
    12. Creating Charts for the Web
    Introduction
    Serving data with Ring and Compojure
    Getting ready
    How to do it…
    Configuring and setting up the web application
    Serving data
    Defining routes and handlers
    Running the server
    How it works…
    There's more…
    Creating HTML with Hiccup
    Getting ready
    How to do it…
    How it works…
    There's more…
    Setting up to use ClojureScript
    Getting ready
    How to do it…
    How it works…
    There's more…
    Creating scatter plots with NVD3
    Getting ready
    How to do it…
    How it works…
    There's more…
    Creating bar charts with NVD3
    Getting ready
    How to do it…
    How it works…
    Creating histograms with NVD3
    Getting ready
    How to do it…
    How it works…
    Creating time series charts with D3
    Getting ready
    How to do it…
    How it works…
    There's more…
    Visualizing graphs with force-directed layouts
    Getting ready
    How to do it…
    How it works…
    There's more…
    Creating interactive visualizations with D3
    Getting ready
    How to do it…
    How it works…
    There's more…
    Index

    Clojure Data Analysis Cookbook Second Edition

    Clojure Data Analysis Cookbook Second Edition

    Copyright © 2015 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    First published: March 2013

    Second edition: January 2015

    Production reference: 1220115

    Published by Packt Publishing Ltd.

    Livery Place

    35 Livery Street

    Birmingham B3 2PB, UK.

    ISBN 978-1-78439-029-7

    www.packtpub.com

    Credits

    Author

    Eric Rochester

    Reviewers

    Vitomir Kovanovic

    Muktabh Mayank Srivastava

    Federico Tomassetti

    Commissioning Editor

    Ashwin Nair

    Acquisition Editor

    Sam Wood

    Content Development Editor

    Parita Khedekar

    Technical Editor

    Ryan Kochery

    Copy Editors

    Dipti Kapadia

    Puja Lalwani

    Vikrant Phadke

    Project Coordinator

    Neha Thakur

    Proofreaders

    Ameesha Green

    Joel T. Johnson

    Samantha Lyon

    Indexer

    Priya Sane

    Graphics

    Sheetal Aute

    Disha Haria

    Production Coordinator

    Nitesh Thakur

    Cover Work

    Nitesh Thakur

    About the Author

    Eric Rochester enjoys reading, writing, and spending time with his wife and kids. When he’s not doing these things, he programs in a variety of languages and platforms, including websites and systems in Python, and libraries for linguistics and statistics in C#. Currently, he is exploring functional programming languages, including Clojure and Haskell. He works at Scholars’ Lab in the library at the University of Virginia, helping humanities professors and graduate students realize their digitally informed research agendas. He is also the author of Mastering Clojure Data Analysis, Packt Publishing.

    I’d like to thank everyone. My technical reviewers proved invaluable. Also, thank you to the editorial staff at Packt Publishing. This book is much stronger because of all of their feedback, and any remaining deficiencies are mine alone.

    A special thanks to Jackie, Melina, and Micah. They’ve been patient and supportive while I worked on this project. It is, in every way, for them.

    About the Reviewers

    Vitomir Kovanovic is a PhD student at the School of Informatics, University of Edinburgh, Edinburgh, UK. He received an MSc degree in computer science and software engineering in 2011, and BSc in information systems and business administration in 2009 from the University of Belgrade, Serbia. His research interests include learning analytics, educational data mining, and online education. He is a member of the Society for Learning Analytics Research and a member of program committees of several conferences and journals in technology-enhanced learning. In his PhD research, he focuses on the use of trace data for understanding the effects of technology use on the quality of the social learning process and learning outcomes. For more information, visit http://vitomir.kovanovic.info/

    Muktabh Mayank Srivastava is a data scientist and the cofounder of ParallelDots.com. Previously, he helped in solving many complex data analysis and machine learning problems for clients from different domains such as healthcare, retail, procurement, automation, Bitcoin, social recommendation engines, geolocation fact-finding, customer profiling, and so on.

    His new venture is ParallelDots. It is a tool that allows any content archive to be presented in a story using advanced techniques of NLP and machine learning. For publishers and bloggers, it automatically creates a timeline of any event using their archive and presents it in an interactive, intuitive, and easy-to-navigate interface on their webpage. You can find him on LinkedIn at http://in.linkedin.com/in/muktabh/ and on Twitter at @muktabh / @ParallelDots.

    Federico Tomassetti has been programming since he was a child and has a PhD in software engineering. He works as a consultant on model-driven development and domain-specific languages, writes technical articles, teaches programming, and works as a full-stack software engineer.

    He has experience working in Italy, Germany, and Ireland, and he is currently working at Groupon International.

    You can read about his projects on http://federico-tomassetti.it/ or https://github.com/ftomassetti/.

    www.PacktPub.com

    Support files, eBooks, discount offers, and more

    For support files and downloads related to your book, please visit www.PacktPub.com.

    Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

    At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

    https://www2.packtpub.com/books/subscription/packtlib

    Do you need instant solutions to your IT questions? PacktLib is Packt’s online digital book library. Here, you can search, access, and read Packt’s entire library of books.

    Why subscribe?

    Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

    Free access for Packt account holders

    If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.

    Preface

    Welcome to the second edition of Clojure Data Analysis Cookbook! It seems that books become obsolete almost as quickly as software does, so here we have the opportunity to keep things up-to-date and useful.

    Moreover, the state of the art of data analysis is also still evolving and changing. The techniques and technologies are being refined and improved. Hopefully, this book will capture some of that. I've also added a new chapter on how to work with unstructured textual data.

    In spite of these changes, some things have stayed the same. Clojure has further proven itself to be an excellent environment to work with data. As a member of the lisp family of languages, it inherits a flexibility and power that is hard to match. The concurrency and parallelization features have further proven themselves as great tools for developing software and analyzing data.

    Clojure's usefulness for data analysis is further improved by a number of strong libraries. Incanter provides a practical environment to work with data and perform statistical analysis. Cascalog is an easy-to-use wrapper over Hadoop and Cascading. Finally, when you're ready to publish your results, ClojureScript, an implementation of Clojure that generates JavaScript, can help you to visualize your data in an effective and persuasive way.

    Moreover, Clojure runs on the Java Virtual Machine (JVM), so any libraries written for Java are available too. This gives Clojure an incredible amount of breadth and power.

    I hope that this book will give you the tools and techniques you need to get answers from your data.

    What this book covers

    Chapter 1, Importing Data for Analysis, covers how to read data from a variety of sources, including CSV files, web pages, and linked semantic web data.

    Chapter 2, Cleaning and Validating Data, presents strategies and implementations to normalize dates, fix spelling, and work with large datasets. Getting data into a useable shape is an important, but often overlooked, stage of data analysis.

    Chapter 3, Managing Complexity with Concurrent Programming, covers Clojure's concurrency features and how you can use them to simplify your programs.

    Chapter 4, Improving Performance with Parallel Programming, covers how to use Clojure's parallel processing capabilities to speed up the processing of data.

    Chapter 5, Distributed Data Processing with Cascalog, covers how to use Cascalog as a wrapper over Hadoop and the Cascading library to process large amounts of data distributed over multiple computers.

    Chapter 6, Working with Incanter Datasets, covers the basics of working with Incanter datasets. Datasets are the core data structures used by Incanter, and understanding them is necessary in order to use Incanter effectively.

    Chapter 7, Statistical Data Analysis with Incanter, covers a variety of statistical processes and tests used in data analysis. Some of these are quite simple, such as generating summary statistics. Others are more complex, such as performing linear regressions and auditing data with Benford's Law.

    Chapter 8, Working with Mathematica and R, talks about how to set up Clojure in order to talk to Mathematica or R. These are powerful data analysis systems, and we might want to use them sometimes. This chapter will show you how to get these systems to work together, as well as some tasks that you can perform once they are communicating.

    Chapter 9, Clustering, Classifying, and Working with Weka, covers more advanced machine learning techniques. In this chapter, we'll primarily use the Weka machine learning library. Some recipes will discuss how to use it and the data structures its built on, while other recipes will demonstrate machine learning algorithms.

    Chapter 10, Working with Unstructured and Textual Data, looks at tools and techniques used to extract information from the reams of unstructured, textual data.

    Chapter 11, Graphing in Incanter, shows you how to generate graphs and other visualizations in Incanter. These can be important for exploring and learning about your data and also for publishing and presenting your results.

    Chapter 12, Creating Charts for the Web, shows you how to set up a simple web application in order to present findings from data analysis. It will include a number of recipes that leverage the powerful D3 visualization library.

    What you need for this book

    One piece of software required for this book is the Java Development Kit (JDK), which you can obtain from http://www.oracle.com/technetwork/java/javase/downloads/index.html. JDK is necessary to run and develop on the Java platform.

    The other major piece of software that you'll need is Leiningen 2, which you can download and install from http://leiningen.org/. Leiningen 2 is a tool used to manage Clojure projects and their dependencies. It has become the de facto standard project tool in the Clojure community.

    Throughout this book, we'll use a number of other Clojure and Java libraries, including Clojure itself. Leiningen will take care of downloading these for us as we need them.

    You'll also need a text editor or Integrated Development Environment (IDE). If you already have a text editor of your choice, you can probably use it. See http://clojure.org/getting_started for tips and plugins for using your particular favorite environment. If you don't have a preference, I'd suggest that you take a look at using Eclipse with Counterclockwise. There are instructions to this set up at https://code.google.com/p/counterclockwise/.

    That is all that's required. However, at various places throughout the book, some recipes will access other software. The recipes in Chapter 8, Working with Mathematica and R, that are related to Mathematica will require Mathematica, obviously, and those that are related to R will require that. However, these programs won't be used in the rest of the book, and whether you're interested in those recipes might depend on whether you already have this software.

    Who this book is for

    This book is for programmers or data scientists who are familiar with Clojure and want to use it in their data analysis processes. This isn't a tutorial on Clojure—there are already a number of excellent introductory books out there—so you'll need to be familiar with the language, but you don't need to be an expert.

    Likewise, you don't have to be an expert on data analysis, although you should probably be familiar with its tasks, processes, and techniques. While you might be able to glean enough from these recipes to get started with, for it to be truly effective, you'll want to get a more thorough introduction to this field.

    Conventions

    In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

    Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "Now, there will be a new subdirectory named getting-data.

    A block of code is set as follows:

    (defproject getting-data "0.1.0-SNAPSHOT" :description "FIXME: write description" :url "http://example.com/FIXME" :license {:name "Eclipse Public License" :url "http://www.eclipse.org/legal/epl-v10.html"} :dependencies [[org.clojure/clojure "1.6.0"]])

    When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

    (defn watch-debugging [input-file] (let [reader (agent (seque (mapcat lazy-read-csv input-files))) caster (agent nil) sink (agent []) counter (ref 0) done (ref false)] (add-watch caster :counter (partial watch-caster counter)) (add-watch caster :debug debug-watch) (send reader read-row caster sink done) (wait-for-it 250 done) {:results @sink :count-watcher @counter}))

    Any command-line input or output is written as follows:

    $ lein new getting-dataGenerating a project called getting-data based on the default template. To see other templates (app, lein plugin, etc), try lein help new.

    New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "Take a look at the Hadoop website for the Getting Started documentation of your version. Get a single node setup working".

    Note

    Warnings or important notes appear in a box like this.

    Tip

    Tips and tricks appear like this.

    Reader feedback

    Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

    To send us general feedback, simply send an e-mail to <[email protected]>, and mention the book title via the subject of your message.

    If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

    Customer support

    Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

    Downloading the example code

    You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

    Downloading the color images of this book

    We also provide you a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from: https://www.packtpub.com/sites/default/files/downloads/B03480_coloredimages.pdf.

    Errata

    Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the ErrataSubmissionForm link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

    To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

    Piracy

    Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

    Please contact us at <[email protected]> with a link to the suspected pirated material.

    We appreciate your help in protecting our authors, and our ability to bring you valuable content.

    Questions

    You can contact us at <[email protected]> if you are having a problem with any aspect of the book, and we will do our best to address it.

    Chapter 1. Importing Data for Analysis

    In this chapter, we will cover the following recipes:

    Creating a new projectReading CSV data into Incanter datasetsReading JSON data into Incanter datasetsReading data from Excel with IncanterReading data from JDBC databasesReading XML data into Incanter datasetsScraping data from tables in web pagesScraping textual data from web pagesReading RDF dataQuerying RDF data with SPARQLAggregating data from different formats

    Introduction

    There's not much data analysis that can be done without data, so the first step in any project is to evaluate the data we have and the data that we need. Once we have some idea of what we'll need, we have to figure out how to get it.

    Many of the recipes in this chapter and in this book use Incanter (http://incanter.org/) to import the data and target Incanter datasets. Incanter is a library that is used for statistical analysis and graphics in Clojure (similar to R) an open source language for statistical computing (http://www.r-project.org/). Incanter might not be suitable for every task (for example, we'll use the Weka library for machine learning later) but it is still an important part of our toolkit for doing data analysis in Clojure. This chapter has a collection of recipes that can be used to gather data and make it accessible to Clojure.

    For the very first recipe, we'll take a look at how to start a new project. We'll start with very simple formats such ascomma-separated values (CSV) and move into reading data from relational databases using JDBC. We'll examine more complicated data sources, such as web scraping and linked data (RDF).

    Creating a new project

    Over the course of this book, we're going to use a number of third-party libraries and external dependencies. We will need a tool to download them and track them. We also need a tool to set up the environment and start a REPL (read-eval-print-loop or interactive interpreter) that can access our code or to execute our program. REPLs allow you to program interactively. It's a great environment for exploratory programming, irrespective of whether that means exploring library APIs or exploring data.

    We'll use Leiningen for this (http://leiningen.org/). This has become a standard package automation and management system.

    Getting ready

    Visit the Leiningen site and download the lein script. This will download the Leiningen JAR file when it's needed. The instructions are clear, and it's a simple process.

    How to do it...

    To generate a new project, use the lein new command, passing the name of the project to it:

    $ lein new getting-dataGenerating a project called getting-data based on the default template. To see other templates (app, lein plugin, etc), try lein help new.

    There will be a new subdirectory named getting-data. It will contain files with stubs for the getting-data.core namespace and for tests.

    How it works...

    The new project directory also contains a file named project.clj. This file contains metadata about the project, such as its name, version, license, and more. It also contains a list of the dependencies that our code will use, as shown in the following snippet. The specifications that this file uses allow it to search Maven repositories and directories of Clojure libraries (Clojars, https://clojars.org/) in order to download the project's dependencies. Thus, it integrates well with Java's own packaging system as developed with Maven (http://maven.apache.org/).

    (defproject getting-data "0.1.0-SNAPSHOT" :description "FIXME: write description" :url "http://example.com/FIXME" :license {:name "Eclipse Public License" :url "http://www.eclipse.org/legal/epl-v10.html"} :dependencies [[org.clojure/clojure "1.6.0"]])

    In the Getting ready section of each recipe, we'll see the libraries that we need to list in the :dependencies section of this file. Then, when you run any lein command, it will download the dependencies first.

    Reading CSV data into Incanter datasets

    One of the simplest data formats is comma-separated values (CSV), and you'll find that it's everywhere. Excel reads and writes CSV directly, as do most databases. Also, because it's really just plain text, it's easy to generate CSV files or to access them from any programming language.

    Getting ready

    First, let's make sure that we have the correct libraries loaded. Here's how the project Leiningen (https://github.com/technomancy/leiningen) project.clj file should look (although you might be able to use more up-to-date versions of the dependencies):

    (defproject getting-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"] [incanter "1.5.5"]])

    Tip

    Downloading the example code

    You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

    Also, in your REPL or your file, include these lines:

    (use 'incanter.core 'incanter.io)

    Finally, downloaded a list of rest area locations from POI Factory at http://www.poi-factory.com/node/6643. The data is in a file named data/RestAreasCombined(Ver.BN).csv. The version designation might be different though, as the file is updated. You'll also need to register on the site in order to download the data. The file contains this data, which is the location and description of the rest stops along the highway:

    -67.834062,46.141129,"REST AREA-FOLLOW SIGNS SB I-95 MM305","RR, PT, Pets, HF" -67.845906,46.138084,"REST AREA-FOLLOW SIGNS NB I-95 MM305","RR, PT, Pets, HF" -68.498471,45.659781,"TURNOUT NB I-95 MM249","Scenic Vista-NO FACILITIES" -68.534061,45.598464,"REST AREA SB I-95 MM240","RR, PT, Pets, HF"

    In the project directory, we have to create a subdirectory named data and place the file in this subdirectory.

    I also created a copy of this file with a row listing the names of the columns and named it RestAreasCombined(Ver.BN)-headers.csv.

    How to do it…

    Now, use the incanter.io/read-dataset function in your REPL:
    user=> (read-dataset "data/RestAreasCombined(Ver.BJ).csv") | :col0 | :col1 | :col2 | :col3 | |------------+-----------+--------------------------------------+----------------------------| | -67.834062 | 46.141129 | REST AREA-FOLLOW SIGNS SB I-95 MM305 | RR, PT, Pets, HF | | -67.845906 | 46.138084 | REST AREA-FOLLOW SIGNS NB I-95 MM305 | RR, PT, Pets, HF | | -68.498471 | 45.659781 | TURNOUT NB I-95 MM249 | Scenic Vista-NO FACILITIES | | -68.534061 | 45.598464 | REST AREA SB I-95 MM240 | RR, PT, Pets, HF | | -68.539034 | 45.594001 | REST AREA NB I-95 MM240 | RR, PT, Pets, HF | …
    If we have a header row in the CSV file, then we include :header true in the call to read-dataset:
    user=> (read-dataset "data/RestAreasCombined(Ver.BJ)-headers.csv" :header true) | :longitude | :latitude | :name | :codes | |------------+-----------+--------------------------------------+----------------------------| | -67.834062 | 46.141129 | REST AREA-FOLLOW SIGNS SB I-95 MM305 | RR, PT, Pets, HF | | -67.845906 | 46.138084 | REST AREA-FOLLOW SIGNS NB I-95 MM305 | RR, PT, Pets, HF | | -68.498471 | 45.659781 | TURNOUT NB I-95 MM249 | Scenic Vista-NO FACILITIES | | -68.534061 | 45.598464 | REST AREA SB I-95 MM240 | RR, PT, Pets, HF | | -68.539034 | 45.594001 | REST AREA NB I-95 MM240 | RR, PT, Pets, HF | …

    How it works…

    Together, Clojure and Incanter make a lot of common tasks easy, which is shown in the How to do it section of this recipe.

    We've taken some external data, in this case from a CSV file, and loaded it into an Incanter dataset. In Incanter, a dataset is a table, similar to a sheet in a spreadsheet or a database table. Each column has one field of data, and each row has an observation of data. Some columns will contain string data (all of the columns in this example did), some will contain dates, and some will contain numeric data. Incanter tries to automatically detect when a column contains numeric data and coverts it to a Java int or double. Incanter takes away a lot of the effort involved with importing data.

    There's more…

    For more information about Incanter datasets, see Chapter 6, Working with Incanter Datasets.

    Reading JSON data into Incanter datasets

    Another data format that's becoming increasingly popular is JavaScript Object Notation (JSON, http://json.org/). Like CSV, this is a plain text format, so it's easy for programs to work with. It provides more information about the data than CSV does, but at the cost of being more verbose. It also allows the data to be structured in more complicated ways, such as hierarchies or sequences of hierarchies.

    Because JSON is a much richer data model than CSV, we might need to transform the data. In that case, we can just pull out the information we're interested in and flatten the nested maps before we pass it to Incanter. In this recipe, however, we'll just work with fairly simple data structures.

    Getting ready

    First, here are the contents of the Leiningen project.clj file:

    (defproject getting-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"] [incanter "1.5.5"] [org.clojure/data.json "0.2.5"]])

    Use these libraries in your REPL or program (inside an ns form):

    (require '[incanter.core :as i] '[clojure.data.json :as json] '[clojure.java.io :as io]) (import '[java.io EOFException])

    Moreover, you need some data. For this, I have a file named delicious-rss-214k.json and placed it in the folder named data. It contains a number of top-level JSON objects. For example, the first one starts like this:

    { "guidislink": false, "link": "http://designreviver.com/tips/a-collection-of-wordpress-tutorials-tips-and-themes/", "title_detail": { "base": "http://feeds.delicious.com/v2/rss/recent?min=1&count=100", "value": "A Collection of Wordpress Tutorials, Tips and Themes | Design Reviver", "language": null, "type": "text/plain" }, "author": "mccarrd4", …

    You can download this data file from Infochimps at http://www.ericrochester.com/clj-data-analysis/data/delicious-rss-214k.json.xz. You'll need to decompress it into the data directory.

    How to do it…

    Once everything's in place, we'll need a couple of functions to make it easier to handle the multiple JSON objects at the top level of the file:

    We'll need a function that attempts to call a function on an instance of java.io.Reader and returns nil if there's an EOFException, in case there's a problem reading the file:
    (defn test-eof [reader f] (try (f reader) (catch EOFException e nil)))
    Now, we'll build on this to repeatedly parse a JSON document from an instance of java.io.Reader. We do this by repeatedly calling test-eof until eof or until it returns nil, accumulating the returned values as we go:
    (defn read-all-json [reader] (loop [accum []] (if-let [record (test-eof reader json/read)] (recur (conj accum record)) accum)))
    Finally, we'll perform the previously mentioned two steps to read the data from the file:
    (def d (i/to-dataset (with-open [r (io/reader "data/delicious-rss-214k.json")] (read-all-json r))))

    This binds d to a new dataset that contains the information read in from the JSON documents.

    How it works…

    Similar to all Lisp's (List Processing), Clojure is usually read from the inside out and from right to left. Let's break it down. clojure.java.io/reader opens the file for reading. read-all-json parses all of the JSON documents in the file into a sequence. In this case, it returns a vector of the maps. incanter.core/to-dataset takes a sequence of maps and returns an Incanter dataset. This dataset will use the keys in the maps as column names, and it will convert the data values into a matrix. Actually, to-dataset can accept many different data structures. Try doc to-dataset in the REPL (doc shows the documentation string attached to the function), or see the Incanter documentation at http://data-sorcery.org/contents/ for more information.

    Reading data from Excel with Incanter

    We've seen how Incanter makes a lot of common data-processing tasks very simple, and reading an Excel spreadsheet is another example of this.

    Getting ready

    First, make sure that your Leiningen project.clj file contains the right dependencies:

    (defproject getting-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"] [incanter "1.5.5"]"]])

    Also, make sure that you've loaded those packages into the REPL or script:

    (use 'incanter.core 'incanter.excel)

    Find the Excel spreadsheet you want to work on. The file name of my spreadsheet is data/small-sample-header.xls, as shown in the following screenshot. You can download this from http://www.ericrochester.com/clj-data-analysis/data/small-sample-header.xls.

    How to do it…

    Now, all you need to do is call incanter.excel/read-xls:

    user=> (read-xls "data/small-sample-header.xls") | given-name | surname | relation | |------------+---------+-------------| | Gomez | Addams | father | | Morticia | Addams | mother | | Pugsley | Addams | brother |

    How it works…

    This can read standard Excel files (.xls) and the XML-based file format introduced in Excel 2003 (.xlsx).

    Reading data from JDBC databases

    Reading data from a relational database is only slightly more complicated than reading from Excel, and much of the extra complication involves connecting to the database.

    Fortunately, there's a Clojure-contributed package that sits on top of JDBC (the Java database connector API, http://www.oracle.com/technetwork/java/javase/jdbc/index.html) and makes working with databases much easier. In this example, we'll load a table from an SQLite database (http://www.sqlite.org/), which stores the database in a single file.

    Getting ready

    First, list the dependencies in your Leiningen project.clj file. We will also need to include the database driver library. For this example, it is org.xerial/sqlite-jdbc:

    (defproject getting-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"] [incanter "1.5.5"] [org.clojure/java.jdbc "0.3.3"] [org.xerial/sqlite-jdbc "3.7.15-M1"]])

    Then, load the modules into your REPL or script file:

    (require '[incanter.core :as i] '[clojure.java.jdbc :as j])

    Finally, get the database connection information. I have my data in an SQLite database file named data/small-sample.sqlite, as shown in the following screenshot. You can download this from http://www.ericrochester.com/clj-data-analysis/data/small-sample.sqlite.

    How to do it…

    Loading the data is not complicated, but we'll make it easier with a wrapper function:

    We'll create a function that takes a database connection map and a table name and returns a dataset created from this table:
    (defn load-table-data "This loads the data from a database table." [db table-name] (i/to-dataset (j/query db (str "SELECT * FROM " table-name ";"))))
    Next, we define a database map with the connection parameters suitable for our database:
    (defdb {:subprotocol "sqlite" :subname "data/small-sample.sqlite" :classname "org.sqlite.JDBC"})
    Finally, call load-table-data with db and a table name as a symbol or string:
    user=> (load-table-data db 'people) | :relation | :surname | :given_name | |-------------+----------+-------------| | father | Addams | Gomez | | mother | Addams | Morticia | | brother | Addams | Pugsley ||| …

    How it works…

    The load-table-data function passes the database connection information directly through to clojure.java.jdbc/query.query. It creates an SQL query that returns all of the fields in the table that is passed in. Each row of the result is a sequence of hashes mapping column names to data values. This sequence is wrapped in a dataset by incanter.core/to-dataset.

    See also

    Connecting to different database systems using JDBC isn't necessarily a difficult task, but it's dependent on which database you wish to connect to. Oracle has a tutorial for how to work with JDBC at http://docs.oracle.com/javase/tutorial/jdbc/basics, and the documentation for the clojure.java.jdbc library has some good information too (http://clojure.github.com/java.jdbc/). If you're trying to find out what the connection string looks like for a database system, there are lists available online. The list at http://www.java2s.com/Tutorial/Java/0340__Database/AListofJDBCDriversconnectionstringdrivername.htm includes the major drivers.

    Reading XML data into Incanter datasets

    One of the most popular formats for data is XML. Some people love it, while some hate it. However, almost everyone has to deal with it at some point. While Clojure can use Java's XML libraries, it also has its own package which provides a more natural way to work with XML in Clojure.

    Getting ready

    First, include these dependencies in your Leiningen project.clj file:

    (defproject getting-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"] [incanter "1.5.5"]])

    Use these libraries in your REPL or program:

    (require '[incanter.core :as i] '[clojure.xml :as xml] '[clojure.zip :as zip])

    Then, find a data file. I visited the website for the Open Data Catalog for Washington, D.C. (http://data.octo.dc.gov/), and downloaded the data for the 2013 crime incidents. I moved this file to data/crime_incidents_2013_plain.xml. This is how the contents of the file look:

    <?xml version="1.0" encoding="iso-8859-1"?> <dcst:ReportedCrimes xmlns:dcst="http://dc.gov/dcstat/types/1.0/"> <dcst:ReportedCrime xmlns:dcst="http://dc.gov/dcstat/types/1.0/"> <dcst:ccn><![CDATA[04104147]]></dcst:ccn> <dcst:reportdatetime> 2013-04-16T00:00:00-04:00 </dcst:reportdatetime> …

    How to do it…

    Now, let's see how to load this file into an Incanter dataset:

    The solution for this recipe is a little more complicated, so we'll wrap it into a function:
    (defn load-xml-data [xml-file first-data next-data] (let [data-map (fn [node] [(:tag node) (first (:content node))])] (->> (xml/parse xml-file) zip/xml-zip first-data (iterate next-data) (take-while #(not (nil? %)) (map zip/children) (map #(mapcat data-map %)) (map #(apply array-map %)) i/to-dataset)))
    We can call the function like this. Because there are so many columns, we'll just verify the data that is loaded by looking at the column names and the row count:
    user=> (def d (load-xml-data "data/crime_incidents_2013_plain.xml" zip/down zip/right)) user=> (i/col-names d) [:dcst:ccn :dcst:reportdatetime :dcst:shift :dcst:offense :dcst:method :dcst:lastmodifieddate :dcst:blocksiteaddress :dcst:blockxcoord :dcst:blockycoord :dcst:ward :dcst:anc :dcst:district :dcst:psa :dcst:neighborhoodcluster :dcst:businessimprovementdistrict :dcst:block_group :dcst:census_tract :dcst:voting_precinct :dcst:start_date :dcst:end_date] user=> (i/nrow d) 35826

    This looks good. This gives you the number of crimes reported in the dataset.

    How it works…

    This recipe follows a typical pipeline for working with XML:

    Parsing an XML data fileExtracting the data nodesConverting the data nodes into a sequence of maps representing the dataConverting the data into an Incanter dataset

    load-xml-data implements this process. This takes three parameters:

    The input filenameA function that takes the root node of the parsed XML and returns the first data nodeA function that takes a data node and returns the next data node or nil, if there are no more nodes

    First, the function parses the XML file and wraps it in a zipper (we'll talk more about zippers in the next section). Then, it uses the two functions that are passed in to extract all of the data nodes as a sequence. For each data node, the function retrieves that node's child nodes and converts them into a series of tag name / content pairs. The pairs for each data node are converted into a map, and the sequence of maps is converted into an Incanter dataset.

    There's more…

    We used a couple of interesting data structures or constructs in this recipe. Both are common in functional programming or Lisp, but neither have made their way into more mainstream programming. We should spend a minute with them.

    Navigating structures with zippers

    The first thing that happens to the parsed XML is that it gets passed to clojure.zip/xml-zip. Zippers are standard data structures that encapsulate the data at a position in a tree structure, as well as the information necessary to navigate back out. This takes Clojure's native XML data structure and turns it into something that can be navigated quickly using commands such as clojure.zip/down and clojure.zip/right. Being a functional programming language, Clojure encourages you to use immutable data structures, and zippers provide an efficient, natural way to navigate and modify a tree-like structure, such as an XML document.

    Zippers are very useful and interesting, and understanding them can help you understand and work better with immutable data structures. For more information on zippers, the Clojure-doc page is helpful (http://clojure-doc.org/articles/tutorials/parsing_xml_with_zippers.html). However, if you would rather dive into the deep end, see Gerard Huet's paper, The Zipper (http://www.st.cs.uni-saarland.de/edu/seminare/2005/advanced-fp/docs/huet-zipper.pdf).

    Processing in a pipeline

    We used the ->> macro to express our process as a pipeline. For deeply nested function calls, this macro lets you read it from the left-hand side to the right-hand side, and this makes the process's data flow and series of transformations much more clear.

    We can do this in Clojure because of its macro system. ->> simply rewrites the calls into Clojure's native, nested format as the form is read. The first parameter of the macro is inserted into the next expression as the last parameter. This structure is inserted into the third expression as the last parameter, and so on, until the end of the form. Let's trace this through a few steps. Say, we start off with the expression (->> x first (map length) (apply +)). As Clojure builds the final expression, here's each intermediate step (the elements to be combined are highlighted at each stage):

    (->> x first (map length) (apply +))(->>(first x) (map length) (apply +))(->>(map length (first x)) (apply +))(apply + (map length (first x)))

    Comparing XML and JSON

    XML and JSON (from the Reading JSON data into Incanter datasets recipe) are very similar. Arguably, much of the popularity of JSON is driven by disillusionment with XML's verboseness.

    When we're dealing with these formats in Clojure, the biggest difference is that JSON is converted directly to native Clojure data structures that mirror the data, such as maps and vectors Meanwhile, XML is read into record types that reflect the structure of XML, not the structure of the data.

    In other words, the keys of the maps for JSON will come from the domains, first_name or age, for instance. However, the keys of the maps for XML will come from the data format, such as tag, attribute, or children, and the tag and attribute names will come from the domain. This extra level of abstraction makes XML more unwieldy.

    Scraping data from tables in web pages

    There's data everywhere on the Internet. Unfortunately, a lot of it is difficult to reach. It's buried in tables, articles, or deeply nested div tags. Web scraping (writing a program that walks over a web page and extracts data from it) is brittle and laborious, but it's often the only way to free this data so it can be used in our analyses. This recipe describes how to load a web page and dig down into its contents so that you can pull the data out.

    To do this, we're going to use the Enlive (https://github.com/cgrand/enlive/wiki) library. This uses a domain specific language (DSL, a set of commands that make a small set of tasks very easy and natural) based on CSS selectors to locate elements within a web page. This library can also be used for templating. In this case, we'll just use it to get data back out of a web page.

    Getting ready

    First, you have to add Enlive to the dependencies in the project.clj file:

    (defproject getting-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"] [incanter "1.5.5"] [enlive "1.1.5"]])

    Next, use these packages in your REPL or script:

    (require '[clojure.string :as string] '[net.cgrand.enlive-html :as html] '[incanter.core :as i]) (import [java.net URL])

    Finally, identify the file to scrape the data from. I've put up a file at http://www.ericrochester.com/clj-data-analysis/data/small-sample-table.html, which looks like this:

    It's intentionally stripped down, and it makes use of tables for layout (hence the comment about 1999).

    How to do it…

    Since this task is a little complicated, let's pull out the steps into several functions:
    (defn to-keyword "This takes a string and returns a normalized keyword." [input] (->input string/lower-case (string/replace \space \-) keyword)) (defn load-data "This loads the data from a table at a URL." [url] (let [page (html/html-resource (URL. url)) table (html/select page [:table#data]) headers (->> (html/select table [:tr :th]) (map html/text) (map to-keyword) vec) rows (->> (html/select table [:tr]) (map #(html/select % [:td])) (map #(map html/text %)) (filterseq))] (i/dataset headers rows))))))
    Now, call load-data with the URL you want to load data from:
    user=> (load-data (str "http://www.ericrochester.com/" "clj-data-analysis/data/small-sample-table.html")) | :given-name | :surname | :relation | |-------------+----------+-------------| | Gomez | Addams | father | | Morticia | Addams | mother | | Pugsley | Addams | brother | | Wednesday | Addams | sister | …

    How it works…

    The let bindings in load-data tell the story here. Let's talk about them one by one.

    The first binding has Enlive download the resource and parse it into Enlive's internal representation:

    (let [page (html/html-resource (URL. url))

    The next binding selects the table with the data ID:

    table (html/select page [:table#data])

    Now, select of all the header cells from the table, extract the text from them, convert each to a keyword, and then convert the entire sequence into a vector. This gives headers for the dataset:

    headers (->> (html/select table [:tr :th]) (map html/text) (map to-keyword) vec)

    First, select each row individually. The next two steps are wrapped in map