41,99 €
Understand your data and user preferences to make intelligent, accurate, and profitable decisions
This book caters to beginners and experienced data scientists looking to understand and build complex predictive decision-making systems, recommendation engines using R, Python, Spark, Neo4j, and Hadoop.
A recommendation engine (sometimes referred to as a recommender system) is a tool that lets algorithm developers predict what a user may or may not like among a list of given items. Recommender systems have become extremely common in recent years, and are applied in a variety of applications. The most popular ones are movies, music, news, books, research articles, search queries, social tags, and products in general.
The book starts with an introduction to recommendation systems and its applications. You will then start building recommendation engines straight away from the very basics. As you move along, you will learn to build recommender systems with popular frameworks such as R, Python, Spark, Neo4j, and Hadoop. You will get an insight into the pros and cons of each recommendation engine and when to use which recommendation to ensure each pick is the one that suits you the best.
During the course of the book, you will create simple recommendation engine, real-time recommendation engine, scalable recommendation engine, and more. You will familiarize yourselves with various techniques of recommender systems such as collaborative, content-based, and cross-recommendations before getting to know the best practices of building a recommender system towards the end of the book!
This book follows a step-by-step practical approach where users will learn to build recommendation engines with increasing complexity in every chapter
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 297
Veröffentlichungsjahr: 2016
Copyright © 2016 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: December 2016
Production reference: 1231216
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-78588-485-6
www.packtpub.com
Author
Suresh Kumar Gorakala
Copy Editor
Manisha Sinha
Reviewers
Vikram Dhillon
Vimal Romeo
Project Coordinator
Nidhi Joshi
Commissioning Editor
Veena Pagare
Proofreader
Safis Editing
Acquisition Editor
Tushar Gupta
Indexer
Mariammal Chettiyar
Content Development Editor
Manthan Raja
Graphics
Disha Haria
Technical Editor
Dinesh Chaudhary
Production Coordinator
Arvindkumar Gupta
Suresh Kumar Gorakala is a Data scientist focused on Artificial Intelligence. He has professional experience close to 10 years, having worked with various global clients across multiple domains and helped them in solving their business problems using Advanced Big Data Analytics. He has extensively worked on Recommendation Engines, Natural language Processing, Advanced Machine Learning, Graph Databases. He previously co-authored Building a Recommendation System with R for Packt Publishing. He is passionate traveler and is photographer by hobby.
I would like to thank my wife for putting up with my late-night writing sessions and all my family members for supporting me over the months. I also give deep thanks and gratitude to Barathi Ganesh, Raj Deepthi, Harsh and my colleagues who without their support this book quite possibly would not have happened. I would also like to thank all the mentors that I’ve had over the years. Without learning from these teachers, there is not a chance I could be doing what I do today, and it is because of them and others that I may not have listed here that I feel compelled to pass my knowledge on to those willing to learn. I would also like to thank all the reviewers and project managers of the book to make it a reality.
Vikram Dhillon is a software developer, a bioinformatics researcher, and a software coach at the Blackstone LaunchPad in the University of Central Florida. He has been working on his own startup involving healthcare data security of late. He lives in Orlando and regularly attends development meetups and hackathons. He enjoys spending his spare time reading about new technologies, such as the Blockchain and developing tutorials for machine learning in game design. He has been involved in open-source projects for over five years and writes about technology and startups at opsbug.com
Vimal Romeo is a data science at Ernst and Young, Rome. He holds a master’s degree in Big Data Analytics from Luiss Business School, Rome. He also holds an MBA degree from XIME ,India and a bachelor’s degree in computer science and engineering from CUSAT, India. He is an author at MilanoR which is a blog related to the R language.
I would like to thank my mom – Mrs Bernadit and my brother - Vibin for their continuous support. I would also like to thank my friends – Matteo Amadei, Antonella Di Luca, Asish Mathew and Eleonora Polidoro who supported me during this process. A special thanks to Nidhi Joshi from Packt Publishing for keeping me motivated during the process.
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Love You Mom
Building Recommendation Engines is a comprehensive guide for implementing Recommendation Engines such as collaborative filtering, content based recommendation engines, context aware recommendation engines using R, Python, Spark, Mahout, Neo4j technologies. The book covers various recommendation engines widely used across industries with their implementations. This book also covers a chapter on popular datamining techniques commonly used in building recommendations and also discuss in brief about the future of recommendation engines at the end of the book.
Chapter 1, Introduction to Recommendation Engines, will be a refresher to Data Scientists and an introduction to the beginners of recommendation engines. This chapter introduces popular recommendation engines that people use in their day-to-day lives. Popular recommendation engine approaches available along with their pros and cons are covered.
Chapter 2, Build Your First Recommendation Engine, is a short chapter about how to build a movie recommendation engine to give a head start for us before we take off into the world of recommendation engines.
Chapter 3, Recommendation Engines Explained, is about different recommendation engine techniques popularly employed, such as user-based collaborative filtering recommendation engines, item-based collaborative filtering, content-based recommendation engines, context-aware recommenders, hybrid recommenders, model-based recommender systems using Machine Learning models and mathematical models.
Chapter 4, Data Mining Techniques Used in Recommendation Engines, is about various Machine Learning techniques used in building recommendation engines such as similarity measures, classification, regression, and dimension reduction techniques. This chapter also covers evaluation metrics to test the recommendation engine’s predictive power.
Chapter 5, Building Collaborative Filtering Recommendation Engines, is about how to build user-based collaborative filtering and item-based collaborative filtering in R and Python. We'll also learn about different libraries available in R and Python that are extensively used in building recommendation engines.
Chapter 6, Building Personalized Recommendation Engines, is about how to build personalized recommendation engines using R and Python and the various libraries used for building content-based recommender systems and context-aware recommendation engines.
Chapter 7, Building Real-Time Recommendation Engines with Spark, is about the basics of Spark and MLlib required for building real-time recommender systems.
Chapter 8, Building Real-Time Recommendation Engines with Neo4j, is about the basics of graphDB and Neo4j concepts and how to build real-time recommender systems using Neo4j.
Chapter 9, Building Scalable Recommendation Engines with Mahout, is about the basic building blocks of Hadoop and Mahout required for building scalable recommender systems. It also covers the architecture we use to build scalable systems and a step-by-step implementation using Mahout and SVD.
Chapter 10, What Next?, is the final chapter explaining the summary of what we have learned so far: best practices that are employed in building the decision-making systems and where the future of the recommender systems are set to move.
To get started with different implementations of recommendation engines in R, Python, Spark, Neo4j, Mahout we need the following software:
Chapter number
Software required (With version)
Download links to the software
OS required
2,4,5
R studio Version 0.99.489
https://www.rstudio.com/products/rstudio/download/
WINDOWS 7+/Centos 6
2,4,5
R version 3.2.2
https://cran.r-project.org/bin/windows/base/
WINDOWS 7+/Centos 6
5,6,7
Anaconda 4.2 for Python 3.5
https://www.continuum.io/downloads
WINDOWS 7+/Centos 6
8
Neo4j 3.0.6
https://neo4j.com/download/
WINDOWS 7+/Centos 6
7
Spark 2.0
https://spark.apache.org/downloads.html
WINDOWS 7+/Centos 6
9
Hadoop 2.5 -Mahout 0.12
http://hadoop.apache.org/releases.html
http://mahout.apache.org/general/downloads.html
WINDOWS 7+/Centos 6
7,9,8
Java 7/Java 8
http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html
WINDOWS 7+/Centos 6
This book caters to beginners and experienced data scientists looking to understand and build complex predictive decision-making systems, recommendation engines using R, Python, Spark, Neo4j, and Hadoop.
Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.
To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
You can download the code files by following these steps:
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/building-recommendation-engines. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from http://www.packtpub.com/sites/default/files/downloads/BuildingRecommendationEngines_ColorImages.pdf.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at [email protected] with a link to the suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.
Given the complexity and challenges in building recommendation engines, a considerable amount of thought, skill, investment, and technology goes into building recommender systems. Are they worth such an investment? Let us look at some facts:
Of late, recommender systems are successful in impacting our lives in many ways. One such obvious example of this impact is how our online shopping experience has been redefined. As we browse through e-commerce sites and purchase products, the underlying recommendation engines respond immediately, in real time, with various relevant suggestions to consumers. Regardless of the perspective, from business player or consumer, recommendation engines have been immensely beneficial. Without a doubt, big data is the driving force behind recommender systems. A good recommendation engine should be reliable, scalable, highly available, and be able to provide personalized recommendations, in real time, to the large user base it contains.
A typical recommendation system cannot do its job efficiently without sufficient data. The introduction of big data technology enabled companies to capture plenty of user data, such as past purchases, browsing history, and feedback information, and feed it to the recommendation engines to generate relevant and effective recommendations in real time. In short, even the most advanced recommender system cannot be effective without the supply of big data. The role of big data and improvements in technology, both on the software and hardware front, goes beyond just supplying massive data. It also provides meaningful, actionable data fast, and provides the necessary setup to quickly process the data in real time.
Source: http://www.kdnuggets.com/2015/10/big-data-recommendation-systems-change-lives.html.
Now that we have defined recommender systems, their objective, usefulness, and the driving force behind recommender systems, in this section, we introduce different types of popular recommender systems in use.
Collaborative filtering recommender systems are basic forms of recommendation engines. In this type of recommendation engine, filtering items from a large set of alternatives is done collaboratively by users' preferences.
The basic assumption in a collaborative filtering recommender system is that if two users shared the same interests as each other in the past, they will also have similar tastes in the future. If, for example, user A and user B have similar movie preferences, and user A recently watched Titanic, which user B has not yet seen, then the idea is to recommend this unseen new movie to user B. The movie recommendations on Netflix are one good example of this type of recommender system.
There are two types of collaborative filtering recommender systems:
We will learn in depth about these two forms of recommendations in Chapter 3, Recommendation Engines Explained.
While building collaborative filtering recommender systems, we will learn about the following aspects:
The advantage of collaborative filtering systems is that they are simple to implement and very accurate. However, they have their own set of limitations, such as the Cold Start problem, which means, collaborative filtering systems fails to recommend to the first-time users whose information is not available in the system:
In collaborative filtering, we consider only user-item-preferences and build the recommender systems. Though this approach is accurate, it makes more sense if we consider user properties and item properties while building recommendation engines. Unlike in collaborative filtering, we use item properties and user preferences to the item properties while building content-based recommendation engines.
As the name indicates, a content-based recommender system uses the content information of the items for building the recommendation model. A content recommender system typically contains a user-profile-generation step, item-profile-generation step- and model-building step to generate recommendations for an active user. The content-based recommender system recommends items to users by taking the content or features of items and user profiles. As an example, if you have searched for videos of Lionel Messi on YouTube, then the content-based recommender system will learn your preference and recommend other videos related to Lionel Messi and other videos related to football.
In simpler terms, the system recommends items similar to those that the user has liked in the past. The similarity of items is calculated based on the features associated with the other compared items and is matched with the user's historical preferences.
While building a content-based recommendation system, we take into consideration the following questions:
The preceding considerations will be explained in Chapter 3, Recommendation Engines Explained. This technique doesn't take into consideration the user's neighborhood preferences. Hence, it doesn't require a large user group's preference for items for better recommendation accuracy. It only considers the user's past preferences and the properties/features of the items. In Chapter 3, Recommendation Engines Explained, we will learn about this system in detail, and also its pros and cons:
This type of recommendation engine is built by combining various recommender systems to build a more robust system. By combining various recommender systems, we can replace the disadvantages of one system with the advantages of another system and thus build a more robust system. For example, by combining collaborative filtering methods, where the model fails when new items don't have ratings, with content-based systems, where feature information about the items is available, new items can be recommended more accurately and efficiently.
For example, if you are a frequent reader of news on Google News, the underlying recommendation engine recommends news articles to you by combining popular news articles read by people similar to you and using your personal preferences, calculated using your previous click information. With this type of recommendation system, collaborative filtering recommendations are combined with content-based recommendations before pushing recommendations.
Before building a hybrid model, we should consider the following questions:
The advantage of hybrid recommendation engines is that this approach will increase the efficiency of recommendations compared to the individual recommendation techniques. This approach also suggests a good mix of recommendations to the users, both at the personalized level and at the neighborhood level. In Chapter 3, Recommendation Engines Explained, we will learn more about hybrid recommendations:
Personalized recommender systems, such as content-based recommender systems, are inefficient; they fail to suggest recommendations with respect to context. For example, assume a lady is very fond of ice-cream. Also assume that this lady goes to a cold place. Now there is high chance that a personalized recommender system suggests a popular ice-cream brand. Now let us ask our self a question: is it the right thing to suggest an ice-cream to a person in a cold place? Rather, it makes sense to suggest a coffee. This type of recommendation, which is personalized and context-aware is called a context-aware recommender systems. In the preceding example, place is the context.
User preferences may differ with the context, such as time of day, season, mood, place, location, options offered by the system, and so on. A person at a different location at a different time with different people may need different things. A context-aware recommender system takes the context into account before computing or serving recommendations. This recommender system caters for the different needs of people differently in different contexts.
Before building a context-aware model, we should consider the following questions:
The preceding image shows how different people, at different times and places, and with different company, need different dress recommendations.
With the advancements in technology, research, and infrastructure, recommender systems have been evolving rapidly. Recommender systems are moving away from simple similarity-measure-based approaches, to machine-learning approaches, to very advanced approaches such as deep learning. From a business angle, both customers and organizations are looking toward more personalized recommendations to be catered for immediately. Building personalized recommenders to cater to the large user base and products, we need sophisticated systems, which can scale easily and respond fast. The following are the types of recommendations that can help solve this challenge.
As stated earlier, big data primarily drives recommender systems. The big-data platforms enabled researchers to access large datasets and analyze data at the individual level, paving paths for building personalized recommender systems. With increase in Internet usage and a constant supply of data, efficient recommenders not only require huge data, but also need infrastructure which can scale and have minimum downtime. To realize this, big-data technology such as the Apache Hadoop ecosystem provided the infrastructure and platform to supply large data. To build recommendation systems on this huge supply of data, Mahout, a machine-learning library built on the Hadoop platform enables us to build scalable recommender systems. Mahout provides infrastructure to build, evaluate, and tune the different types of recommendation-engine algorithms. Since Hadoop is designed for offline batch processing, we can build offline recommender systems, which are scalable. In Chapter 9, Building Scalable Recommendation Engines with Mahout, we further see how to build scalable recommendation engines using Mahout.
The following figure displays how a scalable recommender system can be designed using Mahout:
We have seen many times, on any of the e-commerce sites, the You may also like feature. This is a deceptively simple phrase that encapsulates a new era in customer relationship management delivered in real time. Business organizations started investing in such systems, which can generate recommendations personalized to the customers and can deliver them in real time. Building such a system will not only give good returns on investment but also, efficient systems will buy the confidence of the users. Building a scalable real-time recommender system will not only capture users' purchase history, product information, user preferences, and extract patterns and recommend products, but will also respond instantly based on user online interactions and multi-criteria search preferences.
This ability makes compelling suggestions requiring a new generation of technology. This technology has to consider large databases of users' previous purchasing history, their preferences, and online interaction information such as in-page navigation data and multi-criteria searches, and then analyzes all this information in real time and responds accurately according to the current and long-term needs of the users. In this book, we have considered in-memory and graph-based systems, which are capable of handling large-scale, real-time recommender systems.
Most popular recommendation engine collaborative filtering requires considering the entirety of users and product information while generating recommendations. Assume a scenario where we have 1 million user ratings on 10,000 products. In order to build a system to handle such heavy computations and respond online, we require a system that is big-data compatible and processes data in-memory. The key technology in enabling scalable, real-time recommendations is Apache Spark Streaming, a technology that leverages scalability of big data and generates recommendations in real time, and processes data in-memory:
Graph databases have revolutionized the way people discover new products, information, and so on. In the human mind, we remember people, things, places, and so on, as graphs, relations, and networks. When we try to fetch information from these networks, we directly go to a required connection or graph and fetch information accurately. In a similar fashion, graph databases allow us to store user and product information in graphs as nodes and edges (relations). Searching in a graph database is fast. In recent times, recommender systems powered by graph databases have allowed organizations to build suggestions which are personalized and accurate in real time.
One of the key technologies enabling real-time recommendations using graph databases is Neo4j, a kind of NoSQL graph database that can easily outperform any other relational and NoSQL system in providing customer insights and product trends.
A NoSQL database, popularly known as not only SQL
