Scala for Data Science - Pascal Bugnion - E-Book

Scala for Data Science E-Book

Pascal Bugnion

0,0
47,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Leverage the power of Scala with different tools to build scalable, robust data science applications

About This Book

  • A complete guide for scalable data science solutions, from data ingestion to data visualization
  • Deploy horizontally scalable data processing pipelines and take advantage of web frameworks to build engaging visualizations
  • Build functional, type-safe routines to interact with relational and NoSQL databases with the help of tutorials and examples provided

Who This Book Is For

If you are a Scala developer or data scientist, or if you want to enter the field of data science, then this book will give you all the tools you need to implement data science solutions.

What You Will Learn

  • Transform and filter tabular data to extract features for machine learning
  • Implement your own algorithms or take advantage of MLLib's extensive suite of models to build distributed machine learning pipelines
  • Read, transform, and write data to both SQL and NoSQL databases in a functional manner
  • Write robust routines to query web APIs
  • Read data from web APIs such as the GitHub or Twitter API
  • Use Scala to interact with MongoDB, which offers high performance and helps to store large data sets with uncertain query requirements
  • Create Scala web applications that couple with JavaScript libraries such as D3 to create compelling interactive visualizations
  • Deploy scalable parallel applications using Apache Spark, loading data from HDFS or Hive

In Detail

Scala is a multi-paradigm programming language (it supports both object-oriented and functional programming) and scripting language used to build applications for the JVM. Languages such as R, Python, Java, and so on are mostly used for data science. It is particularly good at analyzing large sets of data without any significant impact on performance and thus Scala is being adopted by many developers and data scientists. Data scientists might be aware that building applications that are truly scalable is hard. Scala, with its powerful functional libraries for interacting with databases and building scalable frameworks will give you the tools to construct robust data pipelines.

This book will introduce you to the libraries for ingesting, storing, manipulating, processing, and visualizing data in Scala.

Packed with real-world examples and interesting data sets, this book will teach you to ingest data from flat files and web APIs and store it in a SQL or NoSQL database. It will show you how to design scalable architectures to process and modelling your data, starting from simple concurrency constructs such as parallel collections and futures, through to actor systems and Apache Spark. As well as Scala's emphasis on functional structures and immutability, you will learn how to use the right parallel construct for the job at hand, minimizing development time without compromising scalability. Finally, you will learn how to build beautiful interactive visualizations using web frameworks.

This book gives tutorials on some of the most common Scala libraries for data science, allowing you to quickly get up to speed with building data science and data engineering solutions.

Style and approach

A tutorial with complete examples, this book will give you the tools to start building useful data engineering and data science solutions straightaway

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 469

Veröffentlichungsjahr: 2016

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Scala for Data Science
Credits
About the Author
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Installing the JDK
Installing and using SBT
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
eBooks, discount offers, and more
Questions
1. Scala and Data Science
Data science
Programming in data science
Why Scala?
Static typing and type inference
Scala encourages immutability
Scala and functional programs
Null pointer uncertainty
Easier parallelism
Interoperability with Java
When not to use Scala
Summary
References
2. Manipulating Data with Breeze
Code examples
Installing Breeze
Getting help on Breeze
Basic Breeze data types
Vectors
Dense and sparse vectors and the vector trait
Matrices
Building vectors and matrices
Advanced indexing and slicing
Mutating vectors and matrices
Matrix multiplication, transposition, and the orientation of vectors
Data preprocessing and feature engineering
Breeze – function optimization
Numerical derivatives
Regularization
An example – logistic regression
Towards re-usable code
Alternatives to Breeze
Summary
References
3. Plotting with breeze-viz
Diving into Breeze
Customizing plots
Customizing the line type
More advanced scatter plots
Multi-plot example – scatterplot matrix plots
Managing without documentation
Breeze-viz reference
Data visualization beyond breeze-viz
Summary
4. Parallel Collections and Futures
Parallel collections
Limitations of parallel collections
Error handling
Setting the parallelism level
An example – cross-validation with parallel collections
Futures
Future composition – using a future's result
Blocking until completion
Controlling parallel execution with execution contexts
Futures example – stock price fetcher
Summary
References
5. Scala and SQL through JDBC
Interacting with JDBC
First steps with JDBC
Connecting to a database server
Creating tables
Inserting data
Reading data
JDBC summary
Functional wrappers for JDBC
Safer JDBC connections with the loan pattern
Enriching JDBC statements with the "pimp my library" pattern
Wrapping result sets in a stream
Looser coupling with type classes
Type classes
Coding against type classes
When to use type classes
Benefits of type classes
Creating a data access layer
Summary
References
6. Slick – A Functional Interface for SQL
FEC data
Importing Slick
Defining the schema
Connecting to the database
Creating tables
Inserting data
Querying data
Invokers
Operations on columns
Aggregations with "Group by"
Accessing database metadata
Slick versus JDBC
Summary
References
7. Web APIs
A whirlwind tour of JSON
Querying web APIs
JSON in Scala – an exercise in pattern matching
JSON4S types
Extracting fields using XPath
Extraction using case classes
Concurrency and exception handling with futures
Authentication – adding HTTP headers
HTTP – a whirlwind overview
Adding headers to HTTP requests in Scala
Summary
References
8. Scala and MongoDB
MongoDB
Connecting to MongoDB with Casbah
Connecting with authentication
Inserting documents
Extracting objects from the database
Complex queries
Casbah query DSL
Custom type serialization
Beyond Casbah
Summary
References
9. Concurrency with Akka
GitHub follower graph
Actors as people
Hello world with Akka
Case classes as messages
Actor construction
Anatomy of an actor
Follower network crawler
Fetcher actors
Routing
Message passing between actors
Queue control and the pull pattern
Accessing the sender of a message
Stateful actors
Follower network crawler
Fault tolerance
Custom supervisor strategies
Life-cycle hooks
What we have not talked about
Summary
References
10. Distributed Batch Processing with Spark
Installing Spark
Acquiring the example data
Resilient distributed datasets
RDDs are immutable
RDDs are lazy
RDDs know their lineage
RDDs are resilient
RDDs are distributed
Transformations and actions on RDDs
Persisting RDDs
Key-value RDDs
Double RDDs
Building and running standalone programs
Running Spark applications locally
Reducing logging output and Spark configuration
Running Spark applications on EC2
Spam filtering
Lifting the hood
Data shuffling and partitions
Summary
Reference
11. Spark SQL and DataFrames
DataFrames – a whirlwind introduction
Aggregation operations
Joining DataFrames together
Custom functions on DataFrames
DataFrame immutability and persistence
SQL statements on DataFrames
Complex data types – arrays, maps, and structs
Structs
Arrays
Maps
Interacting with data sources
JSON files
Parquet files
Standalone programs
Summary
References
12. Distributed Machine Learning with MLlib
Introducing MLlib – Spam classification
Pipeline components
Transformers
Estimators
Evaluation
Regularization in logistic regression
Cross-validation and model selection
Beyond logistic regression
Summary
References
13. Web APIs with Play
Client-server applications
Introduction to web frameworks
Model-View-Controller architecture
Single page applications
Building an application
The Play framework
Dynamic routing
Actions
Composing the response
Understanding and parsing the request
Interacting with JSON
Querying external APIs and consuming JSON
Calling external web services
Parsing JSON
Asynchronous actions
Creating APIs with Play: a summary
Rest APIs: best practice
Summary
References
14. Visualization with D3 and the Play Framework
GitHub user data
Do I need a backend?
JavaScript dependencies through web-jars
Towards a web application: HTML templates
Modular JavaScript through RequireJS
Bootstrapping the applications
Client-side program architecture
Designing the model
The event bus
AJAX calls through JQuery
Response views
Drawing plots with NVD3
Summary
References
A. Pattern Matching and Extractors
Pattern matching in for comprehensions
Pattern matching internals
Extracting sequences
Summary
Reference
Index

Scala for Data Science

Scala for Data Science

Copyright © 2016 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: January 2016

Production reference: 1220116

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78528-137-2

www.packtpub.com

Credits

Author

Pascal Bugnion

Reviewers

Umanga Bista

Radek Ostrowski

Yuanhang Wang

Commissioning Editor

Veena Pagare

Acquisition Editor

Sonali Vernekar

Content Development Editor

Shali Deeraj

Technical Editor

Suwarna Patil

Copy Editor

Tasneem Fatehi

Project Coordinator

Sanchita Mandal

Proofreader

Safis Editing

Indexer

Monica Ajmera Mehta

Graphics

Disha Haria

Production Coordinator

Arvindkumar Gupta

Cover Work

Arvindkumar Gupta

About the Author

Pascal Bugnion is a data engineer at the ASI, a consultancy offering bespoke data science services. Previously, he was the head of data engineering at SCL Elections. He holds a PhD in computational physics from Cambridge University.

Besides Scala, Pascal is a keen Python developer. He has contributed to NumPy, matplotlib and IPython. He also maintains scikit-monaco, an open source library for Monte Carlo integration. He currently lives in London, UK.

I owe a huge debt of gratitude to my parents and my partner for supporting me in this, as well as my employer for encouraging me to pursue this project. I also thank the reviewers, Umanga Bista, Yuanhang Wang, and Radek Ostrowski for their tireless efforts, as well as the entire team at Packt for their support, advice, and hard work carrying this book to completion.

About the Reviewers

Umanga Bista is machine learning and real-time analytics enthusiast from Kathmandu. He completed his bachelors in computer engineering in September, 2013. Since then, he has been working at LogPoint, a SEIM product and company. He primarily works on building statistical plugins and real time, scalable, and fault tolerant architecture to process multiterabyte scale log data streams for security analytics, intelligence, and compliance.

Radek Ostrowski is a freelance big data engineer with an educational background in high-performance computing. He specializes in building scalable real-time data collection and predictive analytics platforms. He has worked at EPCC, University of Edinburgh in data-related projects for many years. Additionally, he has contributed to the success of a game's startup—deltaDNA, co-built super-scalable backend for PlayStation 4 at Sony, helped to improve data processes at Expedia, and started a Docker revolution at Tesco Bank. He is currently working with Spark and Scala for Max2 Inc, an NYC-based startup that is building a community-powered venue discovery platform, offering personalized recommendations, curated and real-time information.

Yuanhang Wang is a data scientist with primary focus on DSL design. He has dabbled in several functional programming languages. He is particularly interested in machine learning and programming language theory. He is currently a data scientist at China Mobile Research Center, working on typed data processing engine and optimizer that is built on top of several big-data platforms.

Yuanhang Wang describes himself as an enthusiast of purely functional programming and neural networks. He obtained his master's degrees both in Harbin Institute of Technology, China and University of Pavia, Italy.

www.PacktPub.com

Support files, eBooks, discount offers, and more

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Free access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.

To my parents.

To Jessica and to my friends.

Preface

Data science is fashionable. Data science startups are sprouting across the globe and established companies are scrambling to assemble data science teams. The ability to analyze large datasets is also becoming increasingly important in the academic and research world.

Why this explosion in demand for data scientists? Our view is that the emergence of data science can be viewed as the serendipitous collusion of several interlinked factors. The first is data availability. Over the last fifteen years, the amount of data collected by companies has exploded. In the world of research, cheap gene sequencing techniques have drastically increased the amount of genomic data available. Social and professional networking sites have built huge graphs interlinking a significant fraction of the people living on the planet. At the same time, the development of the World Wide Web makes accessing this wealth of data possible from almost anywhere in the world.

The increased availability of data has resulted in an increase in data awareness. It is no longer acceptable for decision makers to trust their experience and "gut feeling" alone. Increasingly, one expects business decisions to be driven by data.

Finally, the tools for efficiently making sense of and extracting insights from huge data sets are starting to mature: one doesn't need to be an expert in distributed computing to analyze a large data set any more. Apache Spark, for instance, greatly eases writing distributed data analysis applications. The explosion of cloud infrastructure facilitates scaling computing needs to cope with variable data amounts.

Scala is a popular language for data science. By emphasizing immutability and functional constructs, Scala lends itself well to the construction of robust libraries for concurrency and big data analysis. A rich ecosystem of tools for data science has therefore developed around Scala, including libraries for accessing SQL and NoSQL databases, frameworks for building distributed applications like Apache Spark and libraries for linear algebra and numerical algorithms. We will explore this rich and growing ecosystem in the fourteen chapters of this book.

What this book covers

We aim to give you a flavor for what is possible with Scala, and to get you started using libraries that are useful for building data science applications. We do not aim to provide an entirely comprehensive overview of any of these topics. This is best left to online documentation or to reference books. What we will teach you is how to combine these tools to build efficient, scalable programs, and have fun along the way.

Chapter 1, Scala and Data Science, is a brief description of data science, and of Scala's place in the data scientist's tool-belt. We describe why Scala is becoming increasingly popular in data science, and how it compares to alternative languages such as Python.

Chapter 2, Manipulating Data with Breeze, introduces Breeze, a library providing support for numerical algorithms in Scala. We learn how to perform linear algebra and optimization, and solve a simple machine learning problem using logistic regression.

Chapter 3, Plotting with breeze-viz, introduces the breeze-viz library for plotting two-dimensional graphs and histograms.

Chapter 4, Parallel Collections and Futures, describes basic concurrency constructs. We will learn to parallelize simple problems by distributing them over several threads using parallel collections, and apply what we have learned to build a parallel cross-validation pipeline. We then describe how to wrap computation in a future to execute it asynchronously. We apply this pattern to query a web API, sending several requests in parallel.

Chapter 5, Scala and SQL through JDBC, looks at interacting with SQL databases in a functional manner. We learn how to use common Scala patterns to wrap the Java interface exposed by JDBC. Besides learning about JDBC, this chapter introduces type classes, the loan pattern, implicit conversions, and other patterns that are frequently leveraged in libraries and existing Scala code.

Chapter 6, Slick - A Functional Interface for SQL, describes the Slick library for mapping data in SQL tables to Scala objects.

Chapter 7, Web APIs, describes how to query web APIs in a concurrent, fault-tolerant manner using futures. We learn to parse JSON responses and formulate complex HTTP requests with authentication. We walk through querying the GitHub API to obtain information about GitHub users programmatically.

Chapter 8, Scala and MongoDB, walks the reader through interacting with MongoDB, a leading NoSQL database. We build a pipeline that fetches user data from the GitHub API and stores it in a MongoDB database.

Chapter 9, Concurrency with Akka, introduces the Akka framework for building concurrent applications with actors. We use Akka to build a scalable crawler that explores the GitHub follower graph.

Chapter 10, Distributed Batch Processing with Spark, explores the Apache Spark framework for building distributed applications. We learn how to construct and manipulate distributed datasets in memory. We touch briefly on the internals of Spark, learning how the architecture allows for distributed, fault-tolerant computation.

Chapter 11, Spark SQL and DataFrames, describes DataFrames, one of the more powerful features of Spark for the manipulation of structured data. We learn how to load JSON and Parquet files into DataFrames.

Chapter 12, Distributed Machine Learning with MLlib, explores how to build distributed machine learning pipelines with MLlib, a library built on top of Apache Spark. We use the library to train a spam filter.

Chapter 13, Web APIs with Play, describes how to use the Play framework to build web APIs. We describe the architecture of modern web applications, and how these fit into the data science pipeline. We build a simple web API that returns JSON.

Chapter 14, Visualization with D3 and the Play Framework, builds on the previous chapter to program a fully fledged web application with Play and D3. We describe how to integrate JavaScript into a Play framework application.

Appendix, Pattern Matching and Extractors, describes how pattern matching provides the programmer with a powerful construct for control flow.

Who this book is for

This book introduces the data science ecosystem for people who already know some Scala. If you are a data scientist, or data engineer, or if you want to enter data science, this book will give you all the tools you need to implement data science solutions in Scala.

For the avoidance of doubt, let me also clarify what this book is not:

This is not an introduction to Scala. We assume that you already have a working knowledge of the language. If you do not, we recommend Programming in Scala by Martin Odersky, Lex Spoon, and Bill Venners.This is not a book about machine learning in Scala. We will use machine learning to illustrate the examples, but the aim is not to teach you how to write your own gradient-boosted tree class. Machine learning is just one (important) part of data science, and this book aims to cover the full pipeline, from data acquisition to data visualization. If you are interested more specifically in how to implement machine learning solutions in Scala, I recommend Scala for machine learning, by Patrick R. Nicolas.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

The code examples are also available on GitHub at www.github.com/pbugnion/s4ds.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

eBooks, discount offers, and more

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Questions

If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

Chapter 1. Scala and Data Science

The second half of the 20th century was the age of silicon. In fifty years, computing power went from extremely scarce to entirely mundane. The first half of the 21st century is the age of the Internet. The last 20 years have seen the rise of giants such as Google, Twitter, and Facebook—giants that have forever changed the way we view knowledge.

The Internet is a vast nexus of information. Ninety percent of the data generated by humanity has been generated in the last 18 months. The programmers, statisticians, and scientists who can harness this glut of data to derive real understanding will have an ever greater influence on how businesses, governments, and charities make decisions.

This book strives to introduce some of the tools that you will need to synthesize the avalanche of data to produce true insight.

Data science

Data science is the process of extracting useful information from data. As a discipline, it remains somewhat ill-defined, with nearly as many definitions as there are experts. Rather than add yet another definition, I will follow Drew Conway's description (http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram). He describes data science as the culmination of three orthogonal sets of skills:

Data scientists must have hacking skills. Data is stored and transmitted through computers. Computers, programming languages, and libraries are the hammers and chisels of data scientists; they must wield them with confidence and accuracy to sculpt the data as they please. This is where Scala comes in: it's a powerful tool to have in your programming toolkit.Data scientists must have a sound understanding of statistics and numerical algorithms. Good data scientists will understand how machine learning algorithms function and how to interpret results. They will not be fooled by misleading metrics, deceptive statistics, or misinterpreted causal links.A good data scientist must have a sound understanding of the problem domain. The data science process involves building and discovering knowledge about the problem domain in a scientifically rigorous manner. The data scientist must, therefore, ask the right questions, be aware of previous results, and understand how the data science effort fits in the wider business or research context.

Drew Conway summarizes this elegantly with a Venn diagram showing data science at the intersection of hacking skills, maths and statistics knowledge, and substantive expertise:

It is, of course, rare for people to be experts in more than one of these areas. Data scientists often work in cross-functional teams, with different members providing the expertise for different areas. To function effectively, every member of the team must nevertheless have a general working knowledge of all three areas.

To give a more concrete overview of the workflow in a data science project, let's imagine that we are trying to write an application that analyzes the public perception of a political campaign. This is what the data science pipeline might look like:

Obtaining data: This might involve extracting information from text files, polling a sensor network or querying a web API. We could, for instance, query the Twitter API to obtain lists of tweets with the relevant hashtags.Data ingestion: Data often comes from many different sources and might be unstructured or semi-structured. Data ingestion involves moving data from the data source, processing it to extract structured information, and storing this information in a database. For tweets, for instance, we might extract the username, the names of other users mentioned in the tweet, the hashtags, text of the tweet, and whether the tweet contains certain keywords.Exploring data: We often have a clear idea of what information we want to extract from the data but very little idea how. For instance, let's imagine that we have ingested thousands of tweets containing hashtags relevant to our political campaign. There is no clear path to go from our database of tweets to the end goal: insight into the overall public perception of our campaign. Data exploration involves mapping out how we are going to get there. This step will often uncover new questions or sources of data, which requires going back to the first step of the pipeline. For our tweet database, we might, for instance, decide that we need to have a human manually label a thousand or more tweets as expressing "positive" or "negative" sentiments toward the political campaign. We could then use these tweets as a training set to construct a model.Feature building: A machine learning algorithm is only as good as the features that enter it. A significant fraction of a data scientist's time involves transforming and combining existing features to create new features more closely related to the problem that we are trying to solve. For instance, we might construct a new feature corresponding to the number of "positive" sounding words or pairs of words in a tweet.Model construction and training: Having built the features that enter the model, the data scientist can now train machine learning algorithms on their datasets. This will often involve trying different algorithms and optimizing model hyperparameters. We might, for instance, settle on using a random forest algorithm to decide whether a tweet is "positive" or "negative" about the campaign. Constructing the model involves choosing the right number of trees and how to calculate impurity measures. A sound understanding of statistics and the problem domain will help inform these decisions.Model extrapolation and prediction: The data scientists can now use their new model to try and infer information about previously unseen data points. They might pass a new tweet through their model to ascertain whether it speaks positively or negatively of the political campaign.Distillation of intelligence and insight from the model: The data scientists combine the outcome of the data analysis process with knowledge of the business domain to inform business decisions. They might discover that specific messages resonate better with the target audience, or with specific segments of the target audience, leading to more accurate targeting. A key part of informing stakeholders involves data visualization and presentation: data scientists create graphs, visualizations, and reports to help make the insights derived clear and compelling.

This is far from a linear pipeline. Often, insights gained at one stage will require the data scientists to backtrack to a previous stage of the pipeline. Indeed, the generation of business insights from raw data is normally an iterative process: the data scientists might do a rapid first pass to verify the premise of the problem and then gradually refine the approach by adding new data sources or new features or trying new machine learning algorithms.

In this book, you will learn how to deal with each step of the pipeline in Scala, leveraging existing libraries to build robust applications.

Programming in data science

This book is not a book about data science. It is a book about how to use Scala, a programming language, for data science. So, where does programming come in when processing data?

Computers are involved at every step of the data science pipeline, but not necessarily in the same manner. The style of programs that we build will be drastically different if we are just writing throwaway scripts to explore data or trying to build a scalable application that pushes data through a well-understood pipeline to continuously deliver business intelligence.

Let's imagine that we work for a company making games for mobile phones in which you can purchase in-game benefits. The majority of users never buy anything, but a small fraction is likely to spend a lot of money. We want to build a model that recognizes big spenders based on their play patterns.

The first step is to explore data, find the right features, and build a model based on a subset of the data. In this exploration phase, we have a clear goal in mind but little idea of how to get there. We want a light, flexible language with strong libraries to get us a working model as soon as possible.

Once we have a working model, we need to deploy it on our gaming platform to analyze the usage patterns of all the current users. This is a very different problem: we have a relatively clear understanding of the goals of the program and of how to get there. The challenge comes in designing software that will scale out to handle all the users and be robust to future changes in usage patterns.

In practice, the type of software that we write typically lies on a spectrum ranging from a single throwaway script to production-level code that must be proof against future expansion and load increases. Before writing any code, the data scientist must understand where their software lies on this spectrum. Let's call this the permanence spectrum.

When not to use Scala

In the previous sections, we described how Scala's strong type system, preference for immutability, functional capabilities, and parallelism abstractions make it easy to write reliable programs and minimize the risk of unexpected behavior.

What reasons might you have to avoid Scala in your next project? One important reason is familiarity. Scala introduces many concepts such as implicits, type classes, and composition using traits that might not be familiar to programmers coming from the object-oriented world. Scala's type system is very expressive, but getting to know it well enough to use its full power takes time and requires adjusting to a new programming paradigm. Finally, dealing with immutable data structures can feel alien to programmers coming from Java or Python.

Nevertheless, these are all drawbacks that can be overcome with time. Scala does fall short of the other data science languages in library availability. The IPython Notebook, coupled with matplotlib, is an unparalleled resource for data exploration. There are ongoing efforts to provide similar functionality in Scala (Spark Notebooks or Apache Zeppelin, for instance), but there are no projects with the same level of maturity. The type system can also be a minor hindrance when one is exploring data or trying out different models.

Thus, in this author's biased opinion, Scala excels for more permanent programs. If you are writing a throwaway script or exploring data, you might be better served with Python. If you are writing something that will need to be reused and requires a certain level of provable correctness, you will find Scala extremely powerful.

Summary

Now that the obligatory introduction is over, it is time to write some Scala code. In the next chapter, you will learn about leveraging Breeze for numerical computations with Scala. For our first foray into data science, we will use logistic regression to predict the gender of a person given their height and weight.

References

By far, the best book on Scala is Programming in Scala by Martin Odersky, Lex Spoon, and Bill Venners. Besides being authoritative (Martin Odersky is the driving force behind Scala), this book is also approachable and readable.

Scala Puzzlers by Andrew Phillips and Nermin Šerifović provides a fun way to learn more advanced Scala.

Scala for Machine Learning by Patrick R. Nicholas provides examples of how to write machine learning algorithms with Scala.

Chapter 2. Manipulating Data with Breeze

Data science is, by and large, concerned with the manipulation of structured data. A large fraction of structured datasets can be viewed as tabular data: each row represents a particular instance, and columns represent different attributes of that instance. The ubiquity of tabular representations explains the success of spreadsheet programs like Microsoft Excel, or of tools like SQL databases.

To be useful to data scientists, a language must support the manipulation of columns or tables of data. Python does this through NumPy and pandas, for instance. Unfortunately, there is no single, coherent ecosystem for numerical computing in Scala that quite measures up to the SciPy ecosystem in Python.

In this chapter, we will introduce Breeze, a library for fast linear algebra and manipulation of data arrays as well as many other features necessary for scientific computing and data science.

Code examples

The easiest way to access the code examples in this book is to clone the GitHub repository:

$ git clone 'https://github.com/pbugnion/s4ds'

The code samples for each chapter are in a single, standalone folder. You may also browse the code online on GitHub.

Installing Breeze

If you have downloaded the code examples for this book, the easiest way of using Breeze is to go into the chap02 directory and type sbt console at the command line. This will open a Scala console in which you can import Breeze.

If you want to build a standalone project, the most common way of installing Breeze (and, indeed, any Scala module) is through SBT. To fetch the dependencies required for this chapter, copy the following lines to a file called build.sbt, taking care to leave an empty line after scalaVersion:

scalaVersion := "2.11.7" libraryDependencies ++= Seq( "org.scalanlp" %% "breeze" % "0.11.2", "org.scalanlp" %% "breeze-natives" % "0.11.2" )

Open a Scala console in the same directory as your build.sbt file by typing sbt console in a terminal. You can check that Breeze is working correctly by importing Breeze from the Scala prompt:

scala> import breeze.linalg._import breeze.linalg._

Getting help on Breeze

This chapter gives a reasonably detailed introduction to Breeze, but it does not aim to give a complete API reference.

To get a full list of Breeze's functionality, consult the Breeze Wiki page on GitHub at https://github.com/scalanlp/breeze/wiki. This is very complete for some modules and less complete for others. The source code (https://github.com/scalanlp/breeze/) is detailed and gives a lot of information. To understand how a particular function is meant to be used, look at the unit tests for that function.