Machine Learning: Make Your Own Recommender System - Oliver Theobald - E-Book

Machine Learning: Make Your Own Recommender System E-Book

Oliver Theobald

0,0
11,99 €

oder
-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

With an introductory overview, the course prepares you for a deep dive into the practical application of Scikit-Learn and the datasets that bring theories to life. From the basics of machine learning to the intricate details of setting up a sandbox environment, this course covers the essential groundwork for any aspiring data scientist.
The course focuses on developing your skills in working with data, implementing data reduction techniques, and understanding the intricacies of item-based and user-based collaborative filtering, along with content-based filtering. These core methodologies are crucial for creating accurate and efficient recommender systems that cater to the unique preferences of users. Practical examples and evaluations further solidify your learning, making complex concepts accessible and manageable.
The course wraps up by addressing the critical topics of privacy, ethics in machine learning, and the exciting future of recommender systems. This holistic approach ensures that you not only gain technical proficiency but also consider the broader implications of your work in this field. With a final look at further resources, your journey into machine learning and recommender systems is just beginning, armed with the knowledge and tools to explore new horizons.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



First Edition

Copyright © 2018 by Oliver Theobald

Published by Scatterplot Press

All rights reserved. No part of this publication may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the publisher, except in the case of brief quotations embodied in critical reviews and certain other non-commercial uses permitted by copyright law.

Please contact the author at [email protected] for feedback, media contact, omissions or errors regarding this book.

TABLE OF CONTENTS

 

FOREWORD

DATASETS USED IN THIS BOOK

INTRODUCING SCIKIT-LEARN

INTRODUCTION

THE ANATOMY

SETTING UP A SANDBOX ENVIRONMENT

WORKING WITH DATA

DATA REDUCTION

ITEM-BASED COLLABORATIVE FILTERING

USER-BASED COLLABORATIVE FILTERING

CONTENT-BASED FILTERING

EVALUATION

PRIVACY & ETHICS

THE FUTURE OF RECOMMENDER SYSTEMS

FURTHER RESOURCES

 

FOREWORD

Recommender systems dictate the stream of content displayed to us each day and their impact on online behavior is second to none. From relevant friend suggestions on Facebook to product recommendations on Amazon, there’s no missing their presence and online sway. Whether you agree or disagree with this method of marketing, there’s no arguing its effectiveness. If mass adoption doesn’t convince you, take a look at what you’ve recently viewed and bought online. There’s a strong chance that at least some of your online activities, including finding this book, originated from algorithm-backed recommendations.

These data-driven systems are eroding the dominance of traditional search while aiding the discoverability of items that might not otherwise have been found. As a breakaway branch of machine learning, it’s more important than ever to understand how these models work and how to code your own basic recommender system.

This book is designed for beginners with partial background knowledge of data science and machine learning, including statistics and computing programming using Python. If this is your first foray into data science, you may want to spend a few hours reading my first book Machine Learning for Absolute Beginnersbefore you get started here.

DATASETS USED IN THIS BOOK

Goodbooks-10k Datasets (Chapter 6)

These two datasets contain information about books and user ratings collected from www.goodreads.com. The first dataset contains book ratings from individual users, while the second dataset contains information about individual books such as their average rating, number of five-star ratings, ISBN number, author, etc.

https://www.kaggle.com/sriharshavogeti/collaborative-recommender-system-on-goodreads/data

Advertising Dataset (Chapter 7)

This dataset contains fabricated information about the features of users responding to online advertisements, including their gender, age, location, daily time spent online, and whether they clicked on the advertisement. The dataset was created by Udemy course instructor Jose Portilla of Pierian Data and is used in his course Python for Data Science and Machine Learning Bootcamp.

https://scatterplotpress.com/p/datasets

Melbourne Housing Market (Chapter 8)

This third dataset contains data on house, unit, and townhouse prices in Melbourne, Australia. This dataset comprises data scraped from publicly available listings posted weekly on www.domain.com.au. The full dataset contains 14,242 property listings and 21 variables including address, suburb, land size, number of rooms, price, longitude, latitude, postcode, etc.

https://www.kaggle.com/anthonypino/melbourne-housing-market/

For any issues accessing and downloading these three datasets, please contact the author at [email protected]

INTRODUCING SCIKIT-LEARN

 

Scikit-learn is the core library for general machine learning. It offers an extensive repository of shallow algorithms1 including logistic regression, decision trees, linear regression, gradient boosting, etc., a broad range of evaluation metrics such as mean absolute error, as well as data partition methods including split validation and cross validation.

Scikit-learn is also used to perform a number of important machine learning tasks including training the model and using the trained model to predict the test data.

The following table is a brief overview of common terms and functions used in machine learning from Scikit-learn.

 

Table 1: Overview of key Scikit-learn terms and functions

 

 

1

 

 

 

INTRODUCTION

 

It wasn’t long ago that surfing the Internet was a standalone task that fell into our daily schedule like reading the newspaper or putting out the trash. For an hour or two, we disconnected the phone line and listened to the screech of the modem link to the world wide web.

Load speed was slow, and there was a drawn-out thought process that preceded each click. Waiting twenty seconds or longer for a page to render placed a heavy time penalty on selecting the wrong link. But as wireless broadband Internet infiltrated more homes, schools, and offices, online behavior changed and our browsing habits started to become more brazen.

Oops! Clicked on the wrong link? No problem. Jab the “Back” button and you’re right back where you started. A few seconds might be lost but as Steve Krug explains in the book Don’t Make Me Think: A Common Sense Approach to Web Usability, “there’s not much of a penalty for guessing wrong.”2 Krug clarifies that users don’t choose the best option but rather the “first reasonable option,” a strategy he calls “satisficing.”

“As soon as we find a link that seems like it might lead to what we’re looking for, there’s a very good chance that we’ll click on it,” explains Krug.3

Design trends would also further help streamline user habits. As Internet users became more familiar with site navigation, web designers caught on that it was better to incorporate existing design norms than attempting to reinvent the wheel. Intuitively, web users knew to hone in on the top-right corner for the “Log In” button, to the website footer for contact details and more menu options, and to whatever button appeared the biggest and brightest as a clue for what to click next.

But with this newfound confidence, we lost some of our behavioral programming from the offline world, such as the ability to browse auxiliary content and digest information. Inpatient and impervious to distraction, our attention spans plummeted and “satisficing” took hold. As Mike McGuire, vice president of the technology research firm Gartner, explains, “If there’s not something else there surfacing that meets your interest beyond what you initially dialed in for, then you’re out.”4

Realizing this problem, the Internet companies saw they needed a new way to hook attention and curb our smash-and-grab mentality. They knew it was impossible to design web content catering to every user’s individual needs and designing content tailored to a general audience merely made it easy for users to skim past on the way to what they came for. Flashing banners, intrusive pop-up windows, and hierarchical lists of popular or recent articles were tried but nothing could quite compare with a deliberately more mathematical approach. While it took almost two decades to perfect, this new approach would radically change online browsing habits and return the advantage to the platforms that could master this emerging and powerful technique.

The answer was a system of algorithms called a recommender system; systems that could predict what an individual user liked and mirror related items to the user in highly visible sections of the website. Author Robert Green explains the psychological power of mirrors in his book The 48 Laws of Power.

“You look deep into the souls of other people, fathom their innermost desires, their values, their tastes, their spirit and reflect it back to them. Making yourself into a kind of mirror image. Your ability to reflect their psyche back to them gives you great power over them.”5

While the theory was sound, it took time for the machine algorithms to work. Rudimentary systems evolved in the early 1990s and were refined in the mid-1990s as the web matured into a medium for online commerce. The early exponents of these systems were Dotcom companies like GroupLens who built models to predict a reader’s interest in online news articles.6

Amazon was another front-runner to the trend. Understanding the potency of user data to drive operational decisions, the Seattle-based company used machine-generated recommendations as a tool to push relevant products to customers. Their early recommendations were crude and clumsy, relying on tags to serve items based on related categories and keywords. Then, in a series of tactical moves to improve the way they recommended products to users, the company made a deal with AOL in the early 2000s. The deal granted Amazon access to operate the technology behind AOL’s e-commerce platform and acquire access to an important source of data. While AOL viewed its users’ data in terms of its primary value (recorded sales data), Amazon identified a secondary value that would improve its ability to push personalized product recommendations to users on the Amazon marketplace.

Armed with this new source of data, Amazon’s product recommendations became progressively sophisticated as different algorithms and filtering techniques attached to the site like a molecule chain. The use of recommender systems contributed to Amazon’s expanding market share and played a critical role in helping niche authors on the platform find new readers.

In 1988 Joe Simpson published a mountain climbing book titled Touching the Void that documented his near-death experience scaling the Andes in Peru. According to Chris Anderson, the author of The Long Tail, Simpson’s book received positive reviews but struggled to maintain attention post its release. A decade later, another mountaineering book, Into Thin Air writtenby Jon Krakauer, was released and enjoyed initial success on the Amazon platform. Recognizing a statistically significant combination of customers who purchased both books, Amazon began promoting Touching the Void to customers who bought Into the Air and vice versa. This sparked a sales revival of the former that would eventually eclipse the popularity of its more recent contemporary.7

This case study is one of many examples exhibiting the power of algorithms to aid discoverability and support content creators who would otherwise fall from view without a big marketing budget.

Owing to their effectiveness, Amazon’s recommender algorithms augmented control over the e-commerce platform. This, though, came at a cost because, like others in the book retail industry, Amazon relied on human editors to recommend books to customers. Amazon’s editors drew on their expert knowledge of literature sold on the platform and the Amazon customer base to propose recommendations. For a time, it seemed that both the in-house reviewers and the faceless algorithms could work together—not in unison but at least side-by-side. The fate of Amazon’s in-house editors was later settled after the company ran tests comparing sales data.

“Eventually the editors were presented with the precise percentage of sales Amazon had to forgo when it featured their reviews online,” explain the authors of Big Data: A Revolution That Will Transform How We Live, Work and Think, Viktor Mayer-Schönberger and Kenneth Cukier.8 Today a third of all of Amazon’s sales are thought to emanate from its recommendation engine9 and the original team of in-house book reviewers has long since disbanded. Amazon now dominates the online book business and has forced many traditional giants to the side or expelled them from the publishing industry.

The effectiveness of algorithm-based recommender systems appears to be having a similar effect on online organizations without the same data-driven mindset. In April 2018, the founders of Inbound.org (the “Hacker News” of the content marketing world) sent an email to subscribers explaining their uneasy decision to shut down the site. Inbound.org Co-founder, Dharmesh Shah, sighted social recommendation engines as one of the obstacles to the site’s growth.

“…it’s time to say farewell to inbound.org, as we know it. Why? Primarily because though the concept of a community is compelling—the core use case of user-curated marketing content is not. My suspicion is that it’s because of the way people find and share content has changed a great deal since inbound.org’s inception. With the growth of messaging platforms and the sharpening of social recommendation engines, content curation via community submission and voting is useful—but not indispensable.”10

In 2011, the co-author of Mosaic, co-founder of Netscape, and partner of Silicon Valley VC firm Andreessen Horowitz, Marc Andreessen, declared, “software is eating the world.” In 2018, it seems that recommender systems are having a similar impact on the web.

In the next chapter, we’ll move past the macro impact of recommender systems and begin to break down their unique features and ability to predict user preferences.

 

2

 

 

 

THE ANATOMY

 

Before we dive into exploring specific algorithms, we first need to examine how recommender systems fit into the broader landscape of data science.

Data science, itself, is an interdisciplinary field of methodologies and algorithms to extract knowledge or insight from data. Within the vast space of data science lies the popular field of artificial intelligence (AI), which is the ability of machines to simulate intellectual tasks. A prominent sub-field of artificial intelligence is machine learning, among other sub-fields such as perception, and search and planning. Recommender systems fall under the banner of machine learning and to some extent data mining.

Figure 1: Visual representation of data-related fields and sub-fields

 

Machine learning applies statistical methods to improve performance based on previous experience. While the programmer is responsible for feature selection and setting the model’s hyperparameters (algorithm learning settings), the machine assumes the majority of the work and the important decision-making process. Decisions are formed using advanced pattern recognition, and, typically, through managing far more variables than humans can mentally visualize. This process of combing data for patterns and forming predictions is known as self-learning and represents a major distinction from traditional computer programming where computers are designed to perform set tasks in response to pre-programmed commands. Using machine learning principles, computers don’t strictly need to receive an “input command” to perform a task but rather “input data."

 

Figure 2: Basic model representation of machine learning

 

Data mining is the process of discovering and unearthing patterns contained in complex datasets. Popular self-learning algorithms such as k-means clustering, decision trees, and regression analysis are applied in both data mining and machine learning. But whereas machine learning focuses on incremental and ongoing problem-solving using models that evolve with experience, data mining concentrates on cleaning up large datasets to create valuable insight at a set point in time.