E-Book
51,59 €

Scala and Spark for Big Data Analytics E-Book

Md. Rezaul Karim

0,0

51,59 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Fachliteratur
Sprache: Englisch

Beschreibung

Scala has been observing wide adoption over the past few years, especially in the field of data science and analytics. Spark, built on Scala, has gained a lot of recognition and is being used widely in productions. Thus, if you want to leverage the power of Scala and Spark to make sense of big data, this book is for you.
The first part introduces you to Scala, helping you understand the object-oriented and functional programming concepts needed for Spark application development. It then moves on to Spark to cover the basic abstractions using RDD and DataFrame. This will help you develop scalable and fault-tolerant streaming applications by analyzing structured and unstructured data using SparkSQL, GraphX, and Spark structured streaming. Finally, the book moves on to some advanced topics, such as monitoring, configuration, debugging, testing, and deployment.
You will also learn how to develop Spark applications using SparkR and PySpark APIs, interactive data analytics using Zeppelin, and in-memory data processing with Alluxio.
By the end of this book, you will have a thorough understanding of Spark, and you will be able to perform full-stack data analytics with a feel that no amount of data is too big.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 791

Veröffentlichungsjahr: 2017

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Ähnliche

Java Deep Learning Projects

Md. Rezaul Karim

Scala Machine Learning Projects

Md. Rezaul Karim

Large Scale Machine Learning with Spark

Md. Rezaul Karim

Machine Learning with Scala Quick Start Guide

Md. Rezaul Karim

TensorFlow: Powerful Predictive Analytics with TensorFlow

Md. Rezaul Karim

Predictive Analytics with TensorFlow

Md. Rezaul Karim

Für immer aufgeräumt – auch digital

Jürgen Kurz

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Denke (nach) und werde reich

Napoleon Hill

30 Minuten Resilienz

Ulrich Siegrist

Krebszellen mögen keine Himbeeren - Der große Bestseller - Vollständig überarbeitet und aktualisiert

Richard Béliveau

Die Hormonrevolution

Michael E Platt

Der Crash ist die Lösung

Matthias Weik

Günter, der innere Schweinehund, lernt verkaufen

Stefan Frädrich

Mission erfüllt

Owen Mark

Die Leber wächst mit ihren Aufgaben

Dr. med. Eckart von Hirschhausen

Macht, was ihr liebt!

Anja Förster

Kopf schlägt Kapital

Günter Faltin

Der größte Raubzug der Geschichte

Matthias Weik

Leseprobe

Scala and Spark for Big Data Analytics

Explore the concepts of functional programming, data streaming, and machine learning

Md. Rezaul Karim

Sridhar Alla

BIRMINGHAM - MUMBAI

Scala and Spark for Big Data Analytics

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: July 2017 Production reference: 2241017

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK. ISBN 978-1-78528-084-9www.packtpub.com

Credits

Author

Md. Rezaul Karim Sridhar Alla

Copy Editor

Safis Editing

Reviewer

Andrea Bessi

Sumit Pal

Project Coordinator

Ulhas Kambali

Commissioning Editor

Aaron Lazar

Proofreader

Safis Editing

Acquisition Editor

Nitin Dasan

Indexer

Rekha Nair

ContentDevelopmentEditor

Vikas Tiwari

Cover Work

Melwyn D'sa

Technical Editor

Subhalaxmi Nadar

Production Coordinator

Melwyn D'sa

About the Authors

Md. Rezaul Karim is a research scientist at Fraunhofer FIT, Germany. He is also a PhD candidate at RWTH Aachen University, Aachen, Germany. He holds a BSc and an MSc in computer science. Before joining Fraunhofer FIT, he had been working as a researcher at the Insight Centre for data analytics, Ireland. Previously, he worked as a lead engineer with Samsung Electronics' distributed R&D centers in Korea, India, Vietnam, Turkey, and Bangladesh. Earlier, he worked as a research assistant in the Database Lab at Kyung Hee University, Korea, and as an R&D engineer with BMTech21 Worldwide, Korea. Even before that, he worked as a software engineer with i2SoftTechnology, Dhaka, Bangladesh.

He has more than 8 years of experience in the area of research and development, with a solid knowledge of algorithms and data structures in C/C++, Java, Scala, R, and Python-focused big data technologies: Spark, Kafka, DC/OS, Docker, Mesos, Zeppelin, Hadoop, and MapReduce, and deep learning technologies: TensorFlow, DeepLearning4j, and H2O-Sparking Water. His research interests include machine learning, deep learning, semantic web, linked data, big data, and bioinformatics. He is the author of the following book titles with Packt:

Large-Scale Machine Learning with Spark

Deep Learning with TensorFlow

I am very grateful to my parents, who have always encouraged me to pursue knowledge. I also want to thank my wife Saroar, son Shadman, elder brother Mamtaz, elder sister Josna, and friends, who have endured my long monologues about the subjects in this book, and have always been encouraging and listening to me. Writing this book was made easier by the amazing efforts of the open source community and the great documentation of many projects out there related to Apache Spark and Scala. Further more, I would like to thank the acquisition, content development, and technical editors of Packt (and others who were involved in this book title) for their sincere cooperation and coordination. Additionally, without the work of numerous researchers and data analytics practitioners who shared their expertise in publications, lectures, and source code, this book might not exist at all!

Sridhar Alla is a big data expert helping small and big companies solve complex problems, such as data warehousing, governance, security, real-time processing, high-frequency trading, and establishing large-scale data science practices. He is an agile practitioner as well as a certified agile DevOps practitioner and implementer. He started his career as a storage software engineer at Network Appliance, Sunnyvale, and then worked as the chief technology officer at a cyber security firm, eIQNetworks, Boston. His job profile includes the role of the director of data science and engineering at Comcast, Philadelphia. He is an avid presenter at numerous Strata, Hadoop World, Spark Summit, and other conferences. He also provides onsite/online training on several technologies. He has several patents filed in the US PTO on large-scale computing and distributed systems. He holds a bachelors degree in computer science from JNTU, Hyderabad, India, and lives with his wife in New Jersey.

Sridhar has over 18 years of experience writing code in Scala, Java, C, C++, Python, R and Go. He also has extensive hands-on knowledge of Spark, Hadoop, Cassandra, HBase, MongoDB, Riak, Redis, Zeppelin, Mesos, Docker, Kafka, ElasticSearch, Solr, H2O, machine learning, text analytics, distributed computing and high performance computing.

I would like to thank my wonderful wife, Rosie Sarkaria, for all the love and patience during the many months I spent writing this book as well as reviewing countless edits I made. I would also like to thank my parents Ravi and Lakshmi Alla all the support and encouragement they continue to bestow upon me. I am very grateful to the many friends especially Abrar Hashmi, Christian Ludwig who helped me bounce ideas and get clarity on the various topics. Writing this book was not possible without the fantastic larger Apache community and Databricks folks who are making Spark so powerful and elegant. Further, I would like to thank the acquisition, content development and technical editors of Packt Publishing (and others who were involved in this book title) for their sincere cooperation and coordination.

About the Reviewers

Andre Baianov is an economist-turned-software developer, with a keen interest in data science. After a bachelor's thesis on data mining and a master's thesis on business intelligence, he started working with Scala and Apache Spark in 2015. He is currently working as a consultant for national and international clients, helping them build reactive architectures, machine learning frameworks, and functional programming backends.

To my wife: beneath our superficial differences, we share the same soul.

Sumit Pal is a published author with Apress for SQL on Big Data - Technology, Architecture and Innovations and SQL on Big Data - Technology, Architecture and Innovations. He has more than 22 years of experience in the software industry in various roles, spanning companies from start-ups to enterprises.

Sumit is an independent consultant working with big data, data visualization, and data science, and a software architect building end-to-end, data-driven analytic systems. He has worked for Microsoft (SQL Server development team), Oracle (OLAP development team), and Verizon (big data analytics team) in a career spanning 22 years. Currently, he works for multiple clients, advising them on their data architectures and big data solutions, and does hands-on coding with Spark, Scala, Java, and Python. Sumit has spoken at the following big data conferences: Data Summit NY, May 2017; Big Data Symposium, Boston, May 2017; Apache Linux Foundation, May 2016, in Vancouver, Canada; and Data Center World, March 2016, in Las Vegas.

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1785280848.

If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

Introduction to Scala

History and purposes of Scala

Platforms and editors

Installing and setting up Scala

Installing Java

Windows

Mac OS

Using Homebrew installer

Installing manually

Linux

Scala: the scalable language

Scala is object-oriented

Scala is functional

Scala is statically typed

Scala runs on the JVM

Scala can execute Java code

Scala can do concurrent and synchronized processing

Scala for Java programmers

All types are objects

Type inference

Scala REPL

Nested functions

Import statements

Operators as methods

Methods and parameter lists

Methods inside methods

Constructor in Scala

Objects instead of static methods

Traits

Scala for the beginners

Your first line of code

I'm the hello world program, explain me well!

Run Scala interactively!

Compile it!

Execute it with Scala command

Summary

Object-Oriented Scala

Variables in Scala

Reference versus value immutability

Data types in Scala

Variable initialization

Type annotations

Type ascription

Lazy val

Methods, classes, and objects in Scala

Methods in Scala

The return in Scala

Classes in Scala

Objects in Scala

Singleton and companion objects

Companion objects

Comparing and contrasting: val and final

Access and visibility

Constructors

Traits in Scala

A trait syntax

Extending traits

Abstract classes

Abstract classes and the override keyword

Case classes in Scala

Packages and package objects

Java interoperability

Pattern matching

Implicit in Scala

Generic in Scala

Defining a generic class

SBT and other build systems

Build with SBT

Maven with Eclipse

Gradle with Eclipse

Summary

Functional Programming Concepts

Introduction to functional programming

Advantages of functional programming

Functional Scala for the data scientists

Why FP and Scala for learning Spark?

Why Spark?

Scala and the Spark programming model

Scala and the Spark ecosystem

Pure functions and higher-order functions

Pure functions

Anonymous functions

Higher-order functions

Function as a return value

Using higher-order functions

Error handling in functional Scala

Failure and exceptions in Scala

Throwing exceptions

Catching exception using try and catch

Finally

Creating an Either

Future

Run one task, but block

Functional programming and data mutability

Summary

Collection APIs

Scala collection APIs

Types and hierarchies

Traversable

Iterable

Seq, LinearSeq, and IndexedSeq

Mutable and immutable

Arrays

Lists

Sets

Tuples

Maps

Option

Exists

Forall

Filter

Map

Take

GroupBy

Init

Drop

TakeWhile

DropWhile

FlatMap

Performance characteristics

Performance characteristics of collection objects

Memory usage by collection objects

Java interoperability

Using Scala implicits

Implicit conversions in Scala

Summary

Tackle Big Data – Spark Comes to the Party

Introduction to data analytics

Inside the data analytics process

Introduction to big data

4 Vs of big data

Variety of Data

Velocity of Data

Volume of Data

Veracity of Data

Distributed computing using Apache Hadoop

Hadoop Distributed File System (HDFS)

HDFS High Availability

HDFS Federation

HDFS Snapshot

HDFS Read

HDFS Write

MapReduce framework

Here comes Apache Spark

Spark core

Spark SQL

Spark streaming

Spark GraphX

Spark ML

PySpark

SparkR

Summary

Start Working with Spark – REPL and RDDs

Dig deeper into Apache Spark

Apache Spark installation

Spark standalone

Spark on YARN

YARN client mode

YARN cluster mode

Spark on Mesos

Introduction to RDDs

RDD Creation

Parallelizing a collection

Reading data from an external source

Transformation of an existing RDD

Streaming API

Using the Spark shell

Actions and Transformations

Transformations

General transformations

Math/Statistical transformations

Set theory/relational transformations

Data structure-based transformations

map function

flatMap function

filter function

coalesce

repartition

Actions

reduce

count

collect

Caching

Loading and saving data

Loading data

textFile

wholeTextFiles

Load from a JDBC Datasource

Saving RDD

Summary

Special RDD Operations

Types of RDDs

Pair RDD

DoubleRDD

SequenceFileRDD

CoGroupedRDD

ShuffledRDD

UnionRDD

HadoopRDD

NewHadoopRDD

Aggregations

groupByKey

reduceByKey

aggregateByKey

combineByKey

Comparison of groupByKey, reduceByKey, combineByKey, and aggregateByKey

Partitioning and shuffling

Partitioners

HashPartitioner

RangePartitioner

Shuffling

Narrow Dependencies

Wide Dependencies

Broadcast variables

Creating broadcast variables

Cleaning broadcast variables

Destroying broadcast variables

Accumulators

Summary

Introduce a Little Structure - Spark SQL

Spark SQL and DataFrames

DataFrame API and SQL API

Pivots

Filters

User-Defined Functions (UDFs)

Schema structure of data

Implicit schema

Explicit schema

Encoders

Loading and saving datasets

Loading datasets

Saving datasets

Aggregations

Aggregate functions

Count

First

Last

approx_count_distinct

Min

Max

Average

Sum

Kurtosis

Skewness

Variance

Standard deviation

Covariance

groupBy

Rollup

Cube

Window functions

ntiles

Joins

Inner workings of join

Shuffle join

Broadcast join

Join types

Inner join

Left outer join

Right outer join

Outer join

Left anti join

Left semi join

Cross join

Performance implications of join

Summary

Stream Me Up, Scotty - Spark Streaming

A Brief introduction to streaming

At least once processing

At most once processing

Exactly once processing

Spark Streaming

StreamingContext

Creating StreamingContext

Starting StreamingContext

Stopping StreamingContext

Input streams

receiverStream

socketTextStream

rawSocketStream

fileStream

textFileStream

binaryRecordsStream

queueStream

textFileStream example

twitterStream example

Discretized streams

Transformations

Window operations

Stateful/stateless transformations

Stateless transformations

Stateful transformations

Checkpointing

Metadata checkpointing

Data checkpointing

Driver failure recovery

Interoperability with streaming platforms (Apache Kafka)

Receiver-based approach

Direct stream

Structured streaming

Handling Event-time and late data

Fault tolerance semantics

Summary

Everything is Connected - GraphX

A brief introduction to graph theory

GraphX

VertexRDD and EdgeRDD

VertexRDD

EdgeRDD

Graph operators

Filter

MapValues

aggregateMessages

TriangleCounting

Pregel API

ConnectedComponents

Traveling salesman problem

ShortestPaths

PageRank

Summary

Learning Machine Learning - Spark MLlib and Spark ML

Introduction to machine learning

Typical machine learning workflow

Machine learning tasks

Supervised learning

Unsupervised learning

Reinforcement learning

Recommender system

Semisupervised learning

Spark machine learning APIs

Spark machine learning libraries

Spark MLlib

Spark ML

Spark MLlib or Spark ML?

Feature extraction and transformation

CountVectorizer

Tokenizer

StopWordsRemover

StringIndexer

OneHotEncoder

Spark ML pipelines

Dataset abstraction

Creating a simple pipeline

Unsupervised machine learning

Dimensionality reduction

PCA

Using PCA

Regression Analysis - a practical use of PCA

Dataset collection and exploration

What is regression analysis?

Binary and multiclass classification

Performance metrics

Binary classification using logistic regression

Breast cancer prediction using logistic regression of Spark ML

Dataset collection

Developing the pipeline using Spark ML

Multiclass classification using logistic regression

Improving classification accuracy using random forests

Classifying MNIST dataset using random forest

Summary

My Name is Bayes, Naive Bayes

Multinomial classification

Transformation to binary

Classification using One-Vs-The-Rest approach

Exploration and preparation of the OCR dataset

Hierarchical classification

Extension from binary

Bayesian inference

An overview of Bayesian inference

What is inference?

How does it work?

Naive Bayes

An overview of Bayes' theorem

My name is Bayes, Naive Bayes

Building a scalable classifier with NB

Tune me up!

The decision trees

Advantages and disadvantages of using DTs

Decision tree versus Naive Bayes

Building a scalable classifier with DT algorithm

Summary

Time to Put Some Order - Cluster Your Data with Spark MLlib

Unsupervised learning

Unsupervised learning example

Clustering techniques

Unsupervised learning and the clustering

Hierarchical clustering

Centroid-based clustering

Distribution-based clustestering

Centroid-based clustering (CC)

Challenges in CC algorithm

How does K-means algorithm work?

An example of clustering using K-means of Spark MLlib

Hierarchical clustering (HC)

An overview of HC algorithm and challenges

Bisecting K-means with Spark MLlib

Bisecting K-means clustering of the neighborhood using Spark MLlib

Distribution-based clustering (DC)

Challenges in DC algorithm

How does a Gaussian mixture model work?

An example of clustering using GMM with Spark MLlib

Determining number of clusters

A comparative analysis between clustering algorithms

Submitting Spark job for cluster analysis

Summary

Text Analytics Using Spark ML

Understanding text analytics

Text analytics

Sentiment analysis

Topic modeling

TF-IDF (term frequency - inverse document frequency)

Named entity recognition (NER)

Event extraction

Transformers and Estimators

Standard Transformer

Estimator Transformer

Tokenization

StopWordsRemover

NGrams

TF-IDF

HashingTF

Inverse Document Frequency (IDF)

Word2Vec

CountVectorizer

Topic modeling using LDA

Implementing text classification

Summary

Spark Tuning

Monitoring Spark jobs

Spark web interface

Jobs

Stages

Storage

Environment

Executors

SQL

Visualizing Spark application using web UI

Observing the running and completed Spark jobs

Debugging Spark applications using logs

Logging with log4j with Spark

Spark configuration

Spark properties

Environmental variables

Logging

Common mistakes in Spark app development

Application failure

Slow jobs or unresponsiveness

Optimization techniques

Data serialization

Memory tuning

Memory usage and management

Tuning the data structures

Serialized RDD storage

Garbage collection tuning

Level of parallelism

Broadcasting

Data locality

Summary

Time to Go to ClusterLand - Deploying Spark on a Cluster

Spark architecture in a cluster

Spark ecosystem in brief

Cluster design

Cluster management

Pseudocluster mode (aka Spark local)

Standalone

Apache YARN

Apache Mesos

Cloud-based deployments

Deploying the Spark application on a cluster

Submitting Spark jobs

Running Spark jobs locally and in standalone

Hadoop YARN

Configuring a single-node YARN cluster

Step 1: Downloading Apache Hadoop

Step 2: Setting the JAVA_HOME

Step 3: Creating users and groups

Step 4: Creating data and log directories

Step 5: Configuring core-site.xml

Step 6: Configuring hdfs-site.xml

Step 7: Configuring mapred-site.xml

Step 8: Configuring yarn-site.xml

Step 9: Setting Java heap space

Step 10: Formatting HDFS

Step 11: Starting the HDFS

Step 12: Starting YARN

Step 13: Verifying on the web UI

Submitting Spark jobs on YARN cluster

Advance job submissions in a YARN cluster

Apache Mesos

Client mode

Cluster mode

Deploying on AWS

Step 1: Key pair and access key configuration

Step 2: Configuring Spark cluster on EC2

Step 3: Running Spark jobs on the AWS cluster

Step 4: Pausing, restarting, and terminating the Spark cluster

Summary

Testing and Debugging Spark

Testing in a distributed environment

Distributed environment

Issues in a distributed system

Challenges of software testing in a distributed environment

Testing Spark applications

Testing Scala methods

Unit testing

Testing Spark applications

Method 1: Using Scala JUnit test

Method 2: Testing Scala code using FunSuite

Method 3: Making life easier with Spark testing base

Configuring Hadoop runtime on Windows

Debugging Spark applications

Logging with log4j with Spark recap

Debugging the Spark application

Debugging Spark application on Eclipse as Scala debug

Debugging Spark jobs running as local and standalone mode

Debugging Spark applications on YARN or Mesos cluster

Debugging Spark application using SBT

Summary

PySpark and SparkR

Introduction to PySpark

Installation and configuration

By setting SPARK_HOME

Using Python shell

By setting PySpark on Python IDEs

Getting started with PySpark

Working with DataFrames and RDDs

Reading a dataset in Libsvm format

Reading a CSV file

Reading and manipulating raw text files

Writing UDF on PySpark

Let's do some analytics with k-means clustering

Introduction to SparkR

Why SparkR?

Installing and getting started

Getting started

Using external data source APIs

Data manipulation

Querying SparkR DataFrame

Visualizing your data on RStudio

Summary

Preface

The continued growth in data coupled with the need to make increasingly complex decisions against that data is creating massive hurdles that prevent organizations from deriving insights in a timely manner using traditional analytical approaches. The field of big data has become so related to these frameworks that its scope is defined by what these frameworks can handle. Whether you're scrutinizing the clickstream from millions of visitors to optimize online ad placements, or sifting through billions of transactions to identify signs of fraud, the need for advanced analytics, such as machine learning and graph processing, to automatically glean insights from enormous volumes of data is more evident than ever.

Apache Spark, the de facto standard for big data processing, analytics, and data sciences across all academia and industries, provides both machine learning and graph processing libraries, allowing companies to tackle complex problems easily with the power of highly scalable and clustered computers. Spark's promise is to take this a little further to make writing distributed programs using Scala feel like writing regular programs for Spark. Spark will be great in giving ETL pipelines huge boosts in performance and easing some of the pain that feeds the MapReduce programmer's daily chant of despair to the Hadoop gods.

In this book, we used Spark and Scala for the endeavor to bring state-of-the-art advanced data analytics with machine learning, graph processing, streaming, and SQL to Spark, with their contributions to MLlib, ML, SQL, GraphX, and other libraries.

We started with Scala and then moved to the Spark part, and finally, covered some advanced topics for big data analytics with Spark and Scala. In the appendix, we will see how to extend your Scala knowledge for SparkR, PySpark, Apache Zeppelin, and in-memory Alluxio. This book isn't meant to be read from cover to cover. Skip to a chapter that looks like something you're trying to accomplish or that simply ignites your interest.

Happy reading!

What this book covers

Chapter 1, Introduction to Scala, will teach big data analytics using the Scala-based APIs of Spark. Spark itself is written with Scala and naturally, as a starting point, we will discuss a brief introduction to Scala, such as the basic aspects of its history, purposes, and how to install Scala on Windows, Linux, and Mac OS. After that, the Scala web framework will be discussed in brief. Then, we will provide a comparative analysis of Java and Scala. Finally, we will dive into Scala programming to get started with Scala.

Chapter 2, Object-Oriented Scala, says that the object-oriented programming (OOP) paradigm provides a whole new layer of abstraction. In short, this chapter discusses some of the greatest strengths of OOP languages: discoverability, modularity, and extensibility. In particular, we will see how to deal with variables in Scala; methods, classes, and objects in Scala; packages and package objects; traits and trait linearization; and Java interoperability.

Chapter 3, Functional Programming Concepts, showcases the functional programming concepts in Scala. More specifically, we will learn several topics, such as why Scala is an arsenal for the data scientist, why it is important to learn the Spark paradigm, pure functions, and higher-order functions (HOFs). A real-life use case using HOFs will be shown too. Then, we will see how to handle exceptions in higher-order functions outside of collections using the standard library of Scala. Finally, we will look at how functional Scala affects an object's mutability.

Chapter4, Collection APIs, introduces one of the features that attract most Scala users--the Collections API. It's very powerful and flexible, and has lots of operations coupled. We will also demonstrate the capabilities of the Scala Collection API and how it can be used in order to accommodate different types of data and solve a wide range of different problems. In this chapter, we will cover Scala collection APIs, types and hierarchy, some performance characteristics, Java interoperability, and Scala implicits.

Chapter 5, Tackle Big Data - Spark Comes to the Party, outlines data analysis and big data; we see the challenges that big data poses, how they are dealt with by distributed computing, and the approaches suggested by functional programming. We introduce Google's MapReduce, Apache Hadoop, and finally, Apache Spark, and see how they embraced this approach and these techniques. We will look into the evolution of Apache Spark: why Apache Spark was created in the first place and the value it can bring to the challenges of big data analytics and processing.

Chapter 6, Start Working with Spark - REPL and RDDs, covers how Spark works; then, we introduce RDDs, the basic abstractions behind Apache Spark, and see that they are simply distributed collections exposing Scala-like APIs. We will look at the deployment options for Apache Spark and run it locally as a Spark shell. We will learn the internals of Apache Spark, what RDDs are, DAGs and lineages of RDDs, Transformations, and Actions.

Chapter 7, Special RDD Operations, focuses on how RDDs can be tailored to meet different needs, and how these RDDs provide new functionalities (and dangers!) Moreover, we investigate other useful objects that Spark provides, such as broadcast variables and Accumulators. We will learn aggregation techniques, shuffling.

Chapter 8, Introduce a Little Structure - SparkSQL, teaches how to use Spark for the analysis of structured data as a higher-level abstraction of RDDs and how Spark SQL's APIs make querying structured data simple yet robust. Moreover, we introduce datasets and look at the differences between datasets, DataFrames, and RDDs. We will also learn to join operations and window functions to do complex data analysis using DataFrame APIs.

Chapter 9, Stream Me Up, Scotty - Spark Streaming, takes you through Spark Streaming and how we can take advantage of it to process streams of data using the Spark API. Moreover, in this chapter, the reader will learn various ways of processing real-time streams of data using a practical example to consume and process tweets from Twitter. We will look at integration with Apache Kafka to do real-time processing. We will also look at structured streaming, which can provide real-time queries to your applications.

Chapter 10, Everything is Connected - GraphX, in this chapter, we learn how many real-world problems can be modeled (and resolved) using graphs. We will look at graph theory using Facebook as an example, Apache Spark's graph processing library GraphX, VertexRDD and EdgeRDDs, graph operators, aggregateMessages, TriangleCounting, the Pregel API, and use cases such as the PageRank algorithm.

Chapter 11, Learning Machine Learning - Spark MLlib and ML, the purpose of this chapter is to provide a conceptual introduction to statistical machine learning. We will focus on Spark's machine learning APIs, called Spark MLlib and ML. We will then discuss how to solve classification tasks using decision trees and random forest algorithms and regression problem using linear regression algorithm. We will also show how we could benefit from using one-hot encoding and dimensionality reductions algorithms in feature extraction before training a classification model. In later sections, we will show a step-by-step example of developing a collaborative filtering-based movie recommendation system.

Chapter 12, My Name is Bayes, Naive Bayes, states that machine learning in big data is a radical combination that has created great impact in the field of research, in both academia and industry. Big data imposes great challenges on ML, data analytics tools, and algorithms to find the real value. However, making a future prediction based on these huge datasets has never been easy. Considering this challenge, in this chapter, we will dive deeper into ML and find out how to use a simple yet powerful method to build a scalable classification model and concepts such as multinomial classification, Bayesian inference, Naive Bayes, decision trees, and a comparative analysis of Naive Bayes versus decision trees.

Chapter 13, Time to Put Some Order - Cluster Your Data with Spark MLlib, gets you started on how Spark works in cluster mode with its underlying architecture. In previous chapters, we saw how to develop practical applications using different Spark APIs. Finally, we will see how to deploy a full Spark application on a cluster, be it with a pre-existing Hadoop installation or without.

Chapter 14, Text Analytics Using Spark ML, outlines the wonderful field of text analytics using Spark ML. Text analytics is a wide area in machine learning and is useful in many use cases, such as sentiment analysis, chat bots, email spam detection, natural language processing, and many many more. We will learn how to use Spark for text analysis with a focus on use cases of text classification using a 10,000 sample set of Twitter data. We will also look at LDA, a popular technique to generate topics from documents without knowing much about the actual text, and will implement text classification on Twitter data to see how it all comes together.

Chapter 15, Spark Tuning, digs deeper into Apache Spark internals and says that while Spark is great in making us feel as if we are using just another Scala collection, we shouldn't forget that Spark actually runs in a distributed system. Therefore, throughout this chapter, we will cover how to monitor Spark jobs, Spark configuration, common mistakes in Spark app development, and some optimization techniques.

Chapter 16, Time to Go to ClusterLand - Deploying Spark on a Cluster, explores how Spark works in cluster mode with its underlying architecture. We will see Spark architecture in a cluster, the Spark ecosystem and cluster management, and how to deploy Spark on standalone, Mesos, Yarn, and AWS clusters. We will also see how to deploy your app on a cloud-based AWS cluster.

Chapter 17, Testing and Debugging Spark, explains how difficult it can be to test an application if it is distributed; then, we see some ways to tackle this. We will cover how to do testing in a distributed environment, and testing and debugging Spark applications.

Chapter 18, PySpark & SparkR, covers the other two popular APIs for writing Spark code using R and Python, that is, PySpark and SparkR. In particular, we will cover how to get started with PySpark and interacting with DataFrame APIs and UDFs with PySpark, and then we will do some data analytics using PySpark. The second part of this chapter covers how to get started with SparkR. We will also see how to do data processing and manipulation, and how to work with RDD and DataFrames using SparkR, and finally, some data visualization using SparkR.

Chapter 19, Advanced Machine Learning Best Practices, provides theoretical and practical aspects of some advanced topics of machine learning with Spark. We will see how to tune machine learning models for optimized performance using grid search, cross-validation, and hyperparameter tuning. In a later section, we will cover how to develop a scalable recommendation system using ALS, which is an example of a model-based recommendation algorithm. Finally, a topic modelling application will be demonstrated as a text clustering technique

Appendix A, Accelerating Spark with Alluxio, shows how to use Alluxio with Spark to increase the speed of processing. Alluxio is an open source distributed memory storage system useful for increasing the speed of many applications across platforms, including Apache Spark. We will explore the possibilities of using Alluxio and how Alluxio integration will provide greater performance without the need to cache the data in memory every time we run a Spark job.

Appendix B, Interactive Data Analytics with Apache Zeppelin, says that from a data science perspective, interactive visualization of your data analysis is also important. Apache Zeppelin is a web-based notebook for interactive and large-scale data analytics with multiple backends and interpreters. In this chapter, we will discuss how to use Apache Zeppelin for large-scale data analytics using Spark as the interpreter in the backend.

Chapter 19 and Appendices are not present in the book but are available for download at the following link: https://www.packtpub.com/sites/default/files/downloads/ScalaandSparkforBigDataAnalytics_OnlineChapter_Appendices.pdf.

What you need for this book

All the examples have been implemented using Python version 2.7 and 3.5 on an Ubuntu Linux 64 bit, including the TensorFlow library version 1.0.1. However, in the book, we showed the source code with only Python 2.7 compatible. Source codes that are Python 3.5+ compatible can be downloaded from the Packt repository. You will also need the following Python modules (preferably the latest versions):

Spark 2.0.0 (or higher)

Hadoop 2.7 (or higher)

Java (JDK and JRE) 1.7+/1.8+

Scala 2.11.x (or higher)

Python 2.7+/3.4+

R 3.1+ and RStudio 1.0.143 (or higher)

Eclipse Mars, Oxygen, or Luna (latest)

Maven Eclipse plugin (2.9 or higher)

Maven compiler plugin for Eclipse (2.3.2 or higher)

Maven assembly plugin for Eclipse (2.4.1 or higher)

Operating system: Linux distributions are preferable (including Debian, Ubuntu, Fedora, RHEL, and CentOS) and to be more specific, for Ubuntu it is recommended to have a complete 14.04 (LTS) 64-bit (or later) installation, VMWare player 12, or Virtual box. You can run Spark jobs on Windows (XP/7/8/10) or Mac OS X (10.4.7+).

Hardware configuration: Processor Core i3, Core i5 (recommended), or Core i7 (to get the best results). However, multicore processing will provide faster data processing and scalability. You will need least 8-16 GB RAM (recommended) for a standalone mode and at least 32 GB RAM for a single VM--and higher for cluster. You will also need enough storage for running heavy jobs (depending on the dataset size you will be handling), and preferably at least 50 GB of free disk storage (for standalone word missing and for an SQL warehouse).

Who this book is for

Anyone who wishes to learn how to perform data analysis by harnessing the power of Spark will find this book extremely useful. No knowledge of Spark or Scala is assumed, although prior programming experience (especially with other JVM languages) will be useful in order to pick up the concepts quicker. Scala has been observing a steady rise in adoption over the past few years, especially in the fields of data science and analytics. Going hand in hand with Scala is Apache Spark, which is programmed in Scala and is widely used in the field of analytics. This book will help you leverage the power of both these tools to make sense of big data.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. You can download the code files by following these steps:

Hover the mouse pointer on the

SUPPORT

tab at the top.

Click on

Code Downloads & Errata

Enter the name of the book in the

box.

Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.

Click on

Code Download

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Scala-and-Spark-for-Big-Data-Analytics. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/ScalaandSparkforBigDataAnalytics_ColorImages.pdf

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Introduction to Scala

"I'm Scala. I'm a scalable, functional and object-oriented programming language. I can grow with you and you can play with me by typing one-line expressions and observing the results instantly"

- Scala Quote

In last few years, Scala has observed steady rise and wide adoption by developers and practitioners, especially in the fields of data science and analytics. On the other hand, Apache Spark which is written in Scala is a fast and general engine for large-scale data processing. Spark's success is due to many factors: easy-to-use API, clean programming model, performance, and so on. Therefore, naturally, Spark has more support for Scala: more APIs are available for Scala compared to Python or Java; although, new Scala APIs are available before those for Java, Python, and R.

Now that before we start writing your data analytics program using Spark and Scala (part II), we will first get familiar with Scala's functional programming concepts, object oriented features and the Scala collection APIs in detail (part I). As a starting point, we will provide a brief introduction to Scala in this chapter. We will cover some basic aspects of Scala including it's history and purposes. Then we will see how to install Scala on different platforms including Windows, Linux, and Mac OS so that your data analytics programs can be written on your favourite editors and IDEs. Later in this chapter, we will provide a comparative analysis between Java and Scala. Finally, we will dive into Scala programming with some examples.

In a nutshell, the following topics will be covered:

History and purposes of Scala

Platforms and editors

Installing and setting up Scala

Scala: the scalable language

Scala for Java programmers

Scala for the beginners

Summary

History and purposes of Scala

Scala is a general-purpose programming language that comes with support of functional programming and a strong static type system. The source code of Scala is intended to be compiled into Java bytecode, so that the resulting executable code can be run on Java virtual machine (JVM).

Martin Odersky started the design of Scala back in 2001 at the École Polytechnique Fédérale de Lausanne (EPFL). It was an extension of his work on Funnel, which is a programming language that uses functional programming and Petri nets. The first public release appears in 2004 but with only on the Java platform support. Later on, it was followed by .NET framework in June 2004.

Scala has become very popular and experienced wide adoptions because it not only supports the object-oriented programming paradigm, but it also embraces the functional programming concepts. In addition, although Scala's symbolic operators are hardly easy to read, compared to Java, most of the Scala codes are comparatively concise and easy to read -e.g. Java is too verbose.

Like any other programming languages, Scala was prosed and developed for specific purposes. Now, the question is, why was Scala created and what problems does it solve? To answer these questions, Odersky said in his blog:

"The work on Scala stems from a research effort to develop better language support for component software. There are two hypotheses that we would like to validate with the Scala experiment. First, we postulate that a programming language for component software needs to be scalable in the sense that the same concepts can describe small as well as large parts. Therefore, we concentrate on mechanisms for abstraction, composition, and decomposition, rather than adding a large set of primitives, which might be useful for components at some level of scale but not at other levels. Second, we postulate that scalable support for components can be provided by a programming language which unifies and generalizes object-oriented and functional programming. For statically typed languages, of which Scala is an instance, these two paradigms were up to now largely separate."

Nevertheless, pattern matching and higher order functions, and so on, are also provided in Scala, not to fill the gap between FP and OOP, but because they are typical features of functional programming. For this, it has some incredibly powerful pattern-matching features, which are an actor-based concurrency framework. Moreover, it has the support of the first- and higher-order functions. In summary, the name "Scala" is a portmanteau of scalable language, signifying that it is designed to grow with the demands of its users.

Platforms and editors

Scala runs on Java Virtual Machine (JVM), which makes Scala a good choice for Java programmers too who would like to have a functional programming flavor in their codes. There are lots of options when it comes to editors. It's better for you to spend some time making some sort of a comparative study between the available editors because being comfortable with an IDE is one of the key factors for a successful programming experience. Following are some options to choose from:

Scala IDE

Scala plugin for Eclipse

IntelliJ IDEA

Emacs

VIM

Scala support programming on Eclipse has several advantages using numerous beta plugins. Eclipse provides some exciting features such as local, remote, and high-level debugging facilities with semantic highlighting and code completion for Scala. You can use Eclipse for Java as well as Scala application development with equal ease. However, I would also suggest Scala IDE (http://scala-ide.org/)--it's a full-fledged Scala editor based on Eclipse and customized with a set of interesting features (for example, Scala worksheets, ScalaTest support, Scala refactoring, and so on).

The second best option, in my view, is the IntelliJ IDEA. The first release came in 2001 as the first available Java IDEs with advanced code navigation and refactoring capabilities integrated. According to the InfoWorld report (see at http://www.infoworld.com/article/2683534/development-environments/infoworld-review--top-java-programming-tools.html), out of the four top Java programming IDE (that is, Eclipse, IntelliJ IDEA, NetBeans, and JDeveloper), IntelliJ received the highest test center score of 8.5 out of 10.

The corresponding scoring is shown in the following figure:

Figure 1: Best IDEs for Scala/Java developers

From the preceding figure, you may be interested in using other IDEs such as NetBeans and JDeveloper too. Ultimately, the choice is an everlasting debate among the developers, which means the final choice is yours.

Installing and setting up Scala

As we have already mentioned, Scala uses JVM, therefore make sure you have Java installed on your machine. If not, refer to the next subsection, which shows how to install Java on Ubuntu. In this section, at first, we will show you how to install Java 8 on Ubuntu. Then, we will see how to install Scala on Windows, Mac OS, and Linux.

Installing Java

For simplicity, we will show how to install Java 8 on an Ubuntu 14.04 LTS 64-bit machine. But for Windows and Mac OS, it would be better to invest some time on Google to know how. For a minimum clue for the Windows users: refer to this link for details https://java.com/en/download/help/windows_manual_download.xml.

Now, let's see how to install Java 8 on Ubuntu with step-by-step commands and instructions. At first, check whether Java is already installed:

$ java -version

If it returns The program java cannot be found in the following packages, Java hasn't been installed yet. Then you would like to execute the following command to get rid of:

$ sudo apt-get install default-jre

This will install the Java Runtime Environment (JRE). However, if you may instead need the Java Development Kit (JDK), which is usually needed to compile Java applications on Apache Ant, Apache Maven, Eclipse, and IntelliJ IDEA.

The Oracle JDK is the official JDK, however, it is no longer provided by Oracle as a default installation for Ubuntu. You can still install it using apt-get. To install any version, first execute the following commands:

$ sudo apt-get install python-software-properties$ sudo apt-get update$ sudo add-apt-repository ppa:webupd8team/java$ sudo apt-get update

Then, depending on the version you want to install, execute one of the following commands:

$ sudo apt-get install oracle-java8-installer

After installing, don't forget to set the Java home environmental variable. Just apply the following commands (for the simplicity, we assume that Java is installed at /usr/lib/jvm/java-8-oracle):

$ echo "export JAVA_HOME=/usr/lib/jvm/java-8-oracle" >> ~/.bashrc $ echo "export PATH=$PATH:$JAVA_HOME/bin" >> ~/.bashrc$ source ~/.bashrc

Now, let's see the Java_HOME as follows:

$ echo $JAVA_HOME

You should observe the following result on Terminal:

/usr/lib/jvm/java-8-oracle

Now, let's check to make sure that Java has been installed successfully by issuing the following command (you might see the latest version!):

$ java -version

You will get the following output:

java version "1.8.0_121"

Java(TM) SE Runtime Environment (build 1.8.0_121-b13)

Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

Excellent! Now you have Java installed on your machine, thus you're ready Scala codes once it is installed. Let's do this in the next few subsections.

Windows

This part will focus on installing Scala on the PC with Windows 7, but in the end, it won't matter which version of Windows you to run at the moment:

The first step is to download a zipped file of Scala from the official site. You will find it at

https://www.Scala-lang.org/download/all.html

. Under the other resources section of this page, you will find a list of the archive files from which you can install Scala. We will choose to download the zipped file for Scala 2.11.8, as shown in the following figure:

Figure 2: Scala installer for Windows

After the downloading has finished, unzip the file and place it in your favorite folder. You can also rename the file Scala for navigation flexibility. Finally, a

PATH

variable needs to be created for Scala to be globally seen on your OS. For this, navigate to

Computer

Properties

, as shown in the following figure:

Figure 3: Environmental variable tab on windows

Select

Environment Variables

from there and get the location of the

bin

folder of Scala; then, append it to the

PATH

environment variable. Apply the changes and then press

, as shown in the following screenshot:

Figure 4: Adding environmental variables for Scala

Now, you are ready to go for the Windows installation. Open the CMD and just type

scala

. If you were successful in the installation process, then you should see an output similar to the following screenshot:

Figure 5: Accessing Scala from "Scala shell"

Mac OS

It's time now to install Scala on your Mac. There are lots of ways in which you can install Scala on your Mac, and here, we are going to mention two of them:

Using Homebrew installer

At first, check your system to see whether it has Xcode installed or not because it's required in this step. You can install it from the Apple App Store free of charge.

Next, you need to install

Homebrew

from the terminal by running the following command in your terminal:

$ /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)

Note: The preceding command is changed by the Homebrew guys from time to time. If the command doesn't seem to be working, check the Homebrew website for the latest incantation: http://brew.sh/.

Now, you are ready to go and install Scala by typing this command

brew install scala

in the terminal.

Finally, you are ready to go by simply typing Scala in your terminal (the second line) and you will observe the following on your terminal:

Figure 6: Scala shell on macOS

Installing manually

Before installing Scala manually, choose your preferred version of Scala and download the corresponding .tgz file of that version Scala-verion.tgz from http://www.Scala-lang.org/download/. After downloading your preferred version of Scala, extract it as follows:

$ tar xvf scala-2.11.8.tgz

Then, move it to /usr/local/share as follows:

$ sudo mv scala-2.11.8 /usr/local/share

Now, to make the installation permanent, execute the following commands:

$ echo "export SCALA_HOME=/usr/local/share/scala-2.11.8" >> ~/.bash_profile$ echo "export PATH=$PATH: $SCALA_HOME/bin" >> ~/.bash_profile

That's it. Now, let's see how it can be done on Linux distributions like Ubuntu in the next subsection.

Linux

In this subsection, we will show you the installation procedure of Scala on the Ubuntu distribution of Linux. Before starting, let's check to make sure Scala is installed properly. Checking this is straightforward using the following command:

$ scala -version

If Scala is already installed on your system, you should get the following message on your terminal:

Note that, during the writing of this installation, we used the latest version of Scala, that is, 2.11.8. If you do not have Scala installed on your system, make sure you install it before proceeding to the next step. You can download the latest version of Scala from the Scala website at http://www.scala-lang.org/download/ (for a clearer view, refer to Figure 2). For ease, let's download Scala 2.11.8, as follows:

$ cd Downloads/

$ wget https://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.tgz

After the download has been finished, you should find the Scala tar file in the download folder.

The user should first go into the Download directory with the following command: $ cd /Downloads/. Note that the name of the downloads folder may change depending on the system's selected language.

To extract the Scala tar file from its location or more, type the following command. Using this, the Scala tar file can be extracted from the Terminal:

$ tar -xvzf scala-2.11.8.tgz

Now, move the Scala distribution to the user's perspective (for example, /usr/local/scala/share) by typing the following command or doing it manually:

$ sudo mv scala-2.11.8 /usr/local/share/

Move to your home directory issue using the following command:

$ cd ~

Then, set the Scala home using the following commands:

$ echo "export SCALA_HOME=/usr/local/share/scala-2.11.8" >> ~/.bashrc

$ echo "export PATH=$PATH:$SCALA_HOME/bin" >> ~/.bashrc

Then, make the change permanent for the session by using the following command:

$ source ~/.bashrc

After the installation has been completed, you should better to verify it using the following command:

$ scala -version

If Scala has successfully been configured on your system, you should get the following message on your terminal:

Well done! Now, let's enter into the Scala shell by typing the scala command on the terminal, as shown in the following figure:

Figure 7: Scala shell on Linux (Ubuntu distribution)

Finally, you can also install Scala using the apt-get command, as follows:

$ sudo apt-get install scala

This command will download the latest version of Scala (that is, 2.12.x). However, Spark does not have support for Scala 2.12 yet (at least when we wrote this chapter). Therefore, we would recommend the manual installation described earlier.

Scala: the scalable language

The name Scala comes from a scalable language because Scala's concepts scale well to large programs. Some programs in other languages will take tens of lines to be coded, but in Scala, you will get the power to express the general patterns and concepts of programming in a concise and effective manner. In this section, we will describe some exciting features of Scala that Odersky has created for us:

Scala is object-oriented

Scala is a very good example of an object-oriented language. To define a type or behavior for your objects you need to use the notion of classes and traits, which will be explained later, in the next chapter. Scala doesn't support direct multiple inheritances, but to achieve this structure, you need to use Scala's extension of the subclassing and mixing-based composition. This will be discussed in later chapters.

Scala is functional

Functional programming treats functions like first-class citizens. In Scala, this is achieved with syntactic sugar and objects that extend traits (like Function2), but this is how functional programming is achieved in Scala. Also, Scala defines a simple and easy way to define anonymousfunctions (functions without names). It also supports higher-order functions and it allows nested functions. The syntax of these concepts will be explained in deeper details in the coming chapters.

Also, it helps you to code in an immutable way, and by this, you can easily apply it to parallelism with synchronization and concurrency.

Scala is statically typed

Unlike the other statically typed languages like Pascal, Rust, and so on, Scala does not expect you to provide redundant type information. You don't have to specify the type in most cases. Most importantly, you don't even need to repeat them again.

A programming language is called statically typed if the type of a variable is known at compile time: this also means that, as a programmer, you must specify what the type of each variable is. For example, Scala, Java, C, OCaml, Haskell, and C++, and so on. On the other hand, Perl, Ruby, Python, and so on are dynamically typed languages, where the type is not associated with the variables or fields, but with the runtime values.

The statically typed nature of Scala ensures that all kinds of checking are done by the compiler. This extremely powerful feature of Scala helps you find/catch most trivial bugs and errors at a very early stage, before being executed.

Scala runs on the JVM

Just like Java, Scala is also compiled into bytecode which can easily be executed by the JVM. This means that the runtime platforms of Scala and Java are the same because both generate bytecodes as the compilation output. So, you can easily switch from Java to Scala, you can and also easily integrate both, or even use Scala in your Android application to add a functional flavor.

Note that, while using Java code in a Scala program is quite easy, the opposite is very difficult, mostly because of Scala's syntactic sugar.

Also, just like the javac command, which compiles Java code into bytecode, Scala has the scalas command, which compiles the Scala code into bytecode.

Scala can execute Java code

As mentioned earlier, Scala can also be used to execute your Java code. Not just installing your Java code; it also enables you to use all the available classes from the Java SDK, and even your own predefined classes, projects, and packages right in the Scala environment.