Apache Mahout Essentials - Jayani Withanawasam - E-Book

Apache Mahout Essentials E-Book

Jayani Withanawasam

0,0
20,39 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

If you are a Java developer or data scientist, haven't worked with Apache Mahout before, and want to get up to speed on implementing machine learning on big data, then this is the perfect guide for you.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 138

Veröffentlichungsjahr: 2015

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Apache Mahout Essentials
Credits
About the Author
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. Introducing Apache Mahout
Machine learning in a nutshell
Features
Supervised learning versus unsupervised learning
Machine learning applications
Information retrieval
Business
Market segmentation (clustering)
Stock market predictions (regression)
Health care
Using a mammogram for cancer tissue detection
Machine learning libraries
Open source or commercial
Scalability
Languages used
Algorithm support
Batch processing versus stream processing
The story so far
Apache Mahout
Setting up Apache Mahout
How Apache Mahout works?
The high-level design
The distribution
From Hadoop MapReduce to Spark
Problems with Hadoop MapReduce
In-memory data processing with Spark and H2O
Why is Mahout shifting from Hadoop MapReduce to Spark?
When is it appropriate to use Apache Mahout?
Summary
2. Clustering
Unsupervised learning and clustering
Applications of clustering
Computer vision and image processing
Types of clustering
Hard clustering versus soft clustering
Flat clustering versus hierarchical clustering
Model-based clustering
K-Means clustering
Getting your hands dirty!
Running K-Means using Java programming
Data preparation
Understanding important parameters
Cluster visualization
Distance measure
Writing a custom distance measure
K-Means clustering with MapReduce
MapReduce in Apache Mahout
The map function
The reduce function
Additional clustering algorithms
Canopy clustering
Fuzzy K-Means
Streaming K-Means
The streaming step
The ball K-Means step
Spectral clustering
Dirichlet clustering
Text clustering
The vector space model and TF-IDF
N-grams and collocations
Preprocessing text with Lucene
Text clustering with the K-Means algorithm
Topic modeling
Optimizing clustering performance
Selecting the right features
Selecting the right algorithms
Selecting the right distance measure
Evaluating clusters
The initialization of centroids and the number of clusters
Tuning up parameters
The decision on infrastructure
Summary
3. Regression and Classification
Supervised learning
Target variables and predictor variables
Predictive analytics' techniques
Regression-based prediction
Model-based prediction
Tree-based prediction
Classification versus regression
Linear regression with Apache Spark
How does linear regression work?
A real-world example
The impact of smoking on mortality and different diseases
Linear regression with one variable and multiple variables
The integration of Apache Spark
Setting up Apache Spark with Apache Mahout
An example script
Distributed row matrix
An explanation of the code
Mahout references
The bias-variance trade-off
How to avoid over-fitting and under-fitting
Logistic regression with SGD
Logistic functions
Minimizing the cost function
Multinomial logistic regression versus binary logistic regression
A real-world example
An example script
Testing and evaluation
The confusion matrix
The area under the curve
The Naïve Bayes algorithm
The Bayes theorem
Text classification
Naïve assumption and its pros and cons in text classification
Improvements that Apache Mahout has made to the Naïve Bayes classification
A text classification coding example using the 20 newsgroups' example
Understand the 20 newsgroups' dataset
Text classification using Naïve Bayes – a MapReduce implementation with Hadoop
Text classification using Naïve Bayes – the Spark implementation
The Markov chain
Hidden Markov Model
A real-world example – developing a POS tagger using HMM supervised learning
POS tagging
HMM for POS tagging
HMM implementation in Apache Mahout
HMM supervised learning
The important parameters
Returns
The Baum Welch algorithm
A code example
The important parameters
The Viterbi evaluator
The Apache Mahout references
Summary
4. Recommendations
Collaborative versus content-based filtering
Content-based filtering
Collaborative filtering
Hybrid filtering
User-based recommenders
A real-world example – movie recommendations
Data models
The similarity measure
The neighborhood
Recommenders
Evaluation techniques
The IR-based method (precision/recall)
Addressing the issues with inaccurate recommendation results
Item-based recommenders
Item-based recommenders with Spark
Matrix factorization-based recommenders
Alternative least squares
Singular value decomposition
Algorithm usage tips and tricks
Summary
5. Apache Mahout in Production
Introduction
Apache Mahout with Hadoop
YARN with MapReduce 2.0
The resource manager
The application manager
A node manager
The application master
Containers
Managing storage with HDFS
The life cycle of a Hadoop application
Setting up Hadoop
Setting up Mahout in local mode
Prerequisites
Java installation
Setting up Mahout in Hadoop distributed mode
Prerequisites
Creating a Hadoop user
Passwordless SSH configuration
The pseudo-distributed mode
Configuration changes
Formatting the DFS filesystem
Starting the servers
The fully-distributed mode
Prerequisites
Host file configuration
Hadoop configuration changes
Formatting the DFS filesystem
Starting servers
Monitoring Hadoop
Commands/scripts
Data nodes
Node managers
Web UIs
Setting up Mahout with Hadoop's fully-distributed mode
Troubleshooting Hadoop
Optimization tips
Summary
6. Visualization
The significance of visualization in machine learning
D3.js
A visualization example for K-Means clustering
Summary
Index

Apache Mahout Essentials

Apache Mahout Essentials

Copyright © 2015 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: June 2015

Production reference: 1120615

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78355-499-7

www.packtpub.com

Credits

Author

Jayani Withanawasam

Reviewers

Guillaume Agis

Saleem A. Ansari

Sahil Kharb

Pavan Kumar Narayanan

Commissioning Editor

Akram Hussain

Acquisition Editor

Shaon Basu

Content Development Editor

Nikhil Potdukhe

Technical Editor

Tanmayee Patil

Copy Editor

Dipti Kapadia

Project Coordinator

Vijay Kushlani

Proofreader

Safis Editing

Indexer

Tejal Soni

Graphics

Sheetal Aute

Jason Monteiro

Production Coordinator

Melwyn D'sa

Cover Work

Melwyn D'sa

About the Author

Jayani Withanawasam is R&D engineer and a senior software engineer at Zaizi Asia, where she focuses on applying machine learning techniques to provide smart content management solutions.

She is currently pursuing an MSc degree in artificial intelligence at the University of Moratuwa, Sri Lanka, and has completed her BE in software engineering (with first class honors) from the University of Westminster, UK.

She has more than 6 years of industry experience, and she has worked in areas such as machine learning, natural language processing, and semantic web technologies during her tenure.

She is passionate about working with semantic technologies and big data.

First of all, I would like to thank the Apache Mahout contributors for the invaluable effort that they have put in the project, crafting it as a popular scalable machine learning library in the industry.

Also, I would like to thank Rafa Haro for leading me toward the exciting world of machine learning and natural language processing.

I am sincerely grateful to Shaon Basu, an acquisition editor at Packt Publishing, and Nikhil Potdukhe, a content development editor at Packt Publishing, for their remarkable guidance and encouragement as I wrote this book amid my other commitments.

Furthermore, my heartfelt gratitude goes to Abinia Sachithanantham and Dedunu Dhananjaya for motivating me throughout the journey of writing the book.

Last but not least, I am eternally thankful to my parents for staying by my side throughout all my pursuits and being pillars of strength.

About the Reviewers

Guillaume Agis is a French 25 year old with a master's degree in computer science from Epitech, where he studied for 4 years in France and 1 year in Finland.

Open-minded and interested in a lot of domains, such as healthcare, innovation, high-tech, and science, he is always open to new adventures and experiments. Currently, he works as a software engineer in London at a company called Touch Surgery, where he is developing an application. The application is a surgery simulator that allows you to practice and rehearse operations even before setting foot in the operating room.

His previous jobs were, for the most part, in R&D, where he worked with very innovative technologies, such as Mahout, to implement collaborative filtering into artificial intelligence.

He always does his best to bring his team to the top and tries to make a difference.

He's also helping while42, a worldwide alumni network of French engineers, to grow as well as manage the London chapter.

I would like to thank all the people who have brought me to the top and helped me become what I am now.

Saleem A. Ansari is a full stack Java/Scala/Ruby developer with over 7 years of industry experience and a special interest in machine learning and information retrieval. Having implemented data ingestion and processing pipeline in Core Java and Ruby separately, he knows the challenges faced by huge datasets in such systems. He has worked for companies such as Red Hat, Impetus Technologies, Belzabar Software Design, and Exzeo Software Pvt Ltd. He is also a passionate member of the Free and Open Source Software (FOSS) Community. He started his journey with FOSS in the year 2004. In 2005, he formed JMILUG - Linux User's Group at Jamia Millia Islamia University, New Delhi. Since then, he has been contributing to FOSS by organizing community activities and also by contributing code to various projects (http://github.com/tuxdna). He also mentors students on FOSS and its benefits. He is currently enrolled at Georgia Institute of Technology, USA, on the MSCS program. He can be reached at <[email protected]>.

Apart from reviewing this book, he maintains a blog at http://tuxdna.in/.

First of all, I would like to thank the vibrant, talented, and generous Apache Mahout community that created such a wonderful machine learning library. I would like to thank Packt Publishing and its staff for giving me this wonderful opportunity. I would like to thank the author for his hard work in simplifying and elaborating on the latest information in Apache Mahout.

Sahil Kharb has recently graduated from the Indian Institute of Technology, Jodhpur (India), and is working at Rockon Technologies. In the past, he has worked on Mahout and Hadoop for the last two years. His area of interest is data mining on a large scale. Nowadays, he works on Apache Spark and Apache Storm, doing real-time data analytics and batch processing with the help of Apache Mahout.

He has also reviewed Learning Apache Mahout, Packt Publishing.

I would like to thank my family, for their unconditional love and support, and God Almighty, for giving me strength and endurance. Also, I am thankful to my friend Chandni, who helped me in testing the code.

Pavan Kumar Narayanan is an applied mathematician with over 3 years of experience in mathematical programming, data science, and analytics. Currently based in New York, he has worked to build a marketing analytics product for a startup using Apache Mahout and has published and presented papers in algorithmic research at Transportation Research Board, Washington DC, and SUNY Research Conference, Albany, New York. He also runs a blog, DataScience Hacks (https://datasciencehacks.wordpress.com/). His interests are exploring new problem solving techniques and software, from industrial mathematics to machine learning writing book reviews.

Pavan can be contacted at <[email protected]>.

I would like to thank my family, for their unconditional love and support, and God Almighty, for giving me strength and endurance.

www.PacktPub.com

Support files, eBooks, discount offers, and more

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Free access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.

Preface

Apache Mahout is a scalable machine learning library that provides algorithms for classification, clustering, and recommendations.

This book helps you to use Apache Mahout to implement widely used machine learning algorithms in order to gain better insights about large and complex datasets in a scalable manner.

Starting from fundamental concepts in machine learning and Apache Mahout, real-world applications, a diverse range of popular algorithms and their implementations, code examples, evaluation strategies, and best practices are given for each machine learning technique. Further, this book contains a complete step-by-step guide to set up Apache Mahout in the production environment, using Apache Hadoop to unleash the scalable power of Apache Mahout in a distributed environment. Finally, you are guided toward the data visualization techniques for Apache Mahout, which make your data come alive!

What this book covers

Chapter 1, Introducing Apache Mahout, provides an introduction to machine learning and Apache Mahout.

Chapter 2, Clustering, provides an introduction to unsupervised learning and clustering techniques (K-Means clustering and other algorithms) in Apache Mahout along with performance optimization tips for clustering.

Chapter 3, Regression and Classification, provides an introduction to supervised learning and classification techniques (linear regression, logistic regression, Naïve Bayes, and HMMs) in Apache Mahout.

Chapter 4, Recommendations, provides a comparison between collaborative- and content-based filtering and recommenders in Apache Mahout (user-based, item-based, and matrix-factorization-based).

Chapter 5, Apache Mahout in Production, provides a guide to scaling Apache Mahout in the production environment with Apache Hadoop.

Chapter 6, Visualization, provides a guide to visualizing data using D3.js.

What you need for this book

The following software libraries are needed at various phases of this book:

Java 1.7 or aboveApache MahoutApache HadoopApache SparkD3.js

Who this book is for

If you are a Java developer or a data scientist who has not worked with Apache Mahout previously and want to get up to speed on implementing machine learning on big data, then this is a concise and fast-paced guide for you.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support