Machine Learning in Java - AshishSingh Bhatia - E-Book

Machine Learning in Java E-Book

AshishSingh Bhatia

0,0
36,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Leverage the power of Java and its associated machine learning libraries to build powerful predictive models




Key Features



  • Solve predictive modeling problems using the most popular machine learning Java libraries


  • Explore data processing, machine learning, and NLP concepts using JavaML, WEKA, MALLET libraries


  • Practical examples, tips, and tricks to help you understand applied machine learning in Java



Book Description



As the amount of data in the world continues to grow at an almost incomprehensible rate, being able to understand and process data is becoming a key differentiator for competitive organizations. Machine learning applications are everywhere, from self-driving cars, spam detection, document search, and trading strategies, to speech recognition. This makes machine learning well-suited to the present-day era of big data and Data Science. The main challenge is how to transform data into actionable knowledge.







Machine Learning in Java will provide you with the techniques and tools you need. You will start by learning how to apply machine learning methods to a variety of common tasks including classification, prediction, forecasting, market basket analysis, and clustering. The code in this book works for JDK 8 and above, the code is tested on JDK 11.







Moving on, you will discover how to detect anomalies and fraud, and ways to perform activity recognition, image recognition, and text analysis. By the end of the book, you will have explored related web resources and technologies that will help you take your learning to the next level.







By applying the most effective machine learning methods to real-world problems, you will gain hands-on experience that will transform the way you think about data.




What you will learn



  • Discover key Java machine learning libraries


  • Implement concepts such as classification, regression, and clustering


  • Develop a customer retention strategy by predicting likely churn candidates


  • Build a scalable recommendation engine with Apache Mahout


  • Apply machine learning to fraud, anomaly, and outlier detection


  • Experiment with deep learning concepts and algorithms


  • Write your own activity recognition model for eHealth applications



Who this book is for



If you want to learn how to use Java's machine learning libraries to gain insight from your data, this book is for you. It will get you up and running quickly and provide you with the skills you need to successfully create, customize, and deploy machine learning applications with ease. You should be familiar with Java programming and some basic data mining concepts to make the most of this book, but no prior experience with machine learning is required.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 296

Veröffentlichungsjahr: 2018

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Machine Learning in JavaSecond Edition

 

 

 

 

 

 

Helpful techniques to design, build, and deploy powerful machine learning applications in Java

 

 

 

 

 

 

AshishSingh Bhatia
Bostjan Kaluza

 

 

 

 

 

 

 

 

BIRMINGHAM - MUMBAI

Machine Learning in Java Second Edition

Copyright © 2018 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Amey VarangaonkarAcquisition Editor:Divya PoojariContent Development Editor:Athikho Sapuni RishanaTechnical Editor:Joseph SunilCopy Editor:Safis EditingProject Coordinator:Kirti PisatProofreader: Safis EditingIndexer: Mariammal ChettiyarGraphics: Jisha ChirayilProduction Coordinator: Tom Scaria

First published: April 2016 Second edition: November 2018

Production reference: 1231118

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78847-439-9

www.packtpub.com

Contributors

About the authors

AshishSingh Bhatia is a reader and learner at his core. He has more than 11 years of rich experience in different IT sectors, encompassing training, development, and management. He has worked in many domains, such as software development, ERP, banking, and training. He is passionate about Python and Java, and has recently been exploring R. He is mostly involved in web and mobile development in various capacities. He likes to explore new technologies and share his views and thoughts through various online media and magazines. He believes in sharing his experience with the new generation and also takes part in training and teaching.

 

 

 

 

Bostjan Kaluza is a researcher in artificial intelligence and machine learning with extensive experience in Java and Python. Bostjan is the chief data scientist at Evolven, a leading IT operations analytics company. He works with machine learning, predictive analytics, pattern mining, and anomaly detection to turn data into relevant information. Prior to Evolven, Bostjan served as a senior researcher in the department of intelligent systems at the Jozef Stefan Institute and led research projects involving pattern and anomaly detection, ubiquitous computing, and multi-agent systems. In 2013, Bostjan published his first book, Instant Weka How-To, published by Packt Publishing, exploring how to leverage machine learning using Weka.

About the reviewer

Yogendra Sharma is a developer with experience in architecture, design, and the development of scalable and distributed applications, with a core interest in microservices and Spring. He is currently working as an IoT and cloud architect at Intelizign Engineering Services, Pune. He also has hands-on experience with technologies such as AWS Cloud, IoT, Python, J2SE, J2EE, Node.js, Angular, MongoDB, and Docker. He is constantly exploring technical novelties, and is open-minded and eager to learn more about new technologies and frameworks.

 

 

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

 
mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

Packt.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks. 

Table of Contents

Title Page

Copyright and Credits

Machine Learning in Java Second Edition

Contributors

About the authors

About the reviewer

Packt is searching for authors like you

About Packt

Why subscribe?

Packt.com

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Applied Machine Learning Quick Start

Machine learning and data science

Solving problems with machine learning

Applied machine learning workflow

Data and problem definition

Measurement scales

Data collection

Finding or observing data

Generating data

Sampling traps

Data preprocessing

Data cleaning

Filling missing values

Remove outliers

Data transformation

Data reduction

Unsupervised learning

Finding similar items

Euclidean distances

Non-Euclidean distances

The curse of dimensionality

Clustering

Supervised learning

Classification

Decision tree learning

Probabilistic classifiers

Kernel methods

Artificial neural networks

Ensemble learning

Evaluating classification

Precision and recall

Roc curves

Regression

Linear regression

Logistic regression

Evaluating regression

Mean squared error

Mean absolute error

Correlation coefficient

Generalization and evaluation

Underfitting and overfitting

Train and test sets

Cross-validation

Leave-one-out validation

Stratification

Summary

Java Libraries and Platforms for Machine Learning

The need for Java

Machine learning libraries

Weka

Java machine learning

Apache Mahout

Apache Spark

Deeplearning4j

MALLET

The Encog Machine Learning Framework

ELKI

MOA

Comparing libraries

Building a machine learning application

Traditional machine learning architecture

Dealing with big data

Big data application architecture

Summary

Basic Algorithms – Classification, Regression, and Clustering

Before you start

Classification

Data

Loading data

Feature selection

Learning algorithms

Classifying new data

Evaluation and prediction error metrics

The confusion matrix

Choosing a classification algorithm

Classification using Encog

Classification using massive online analysis

Evaluation

Baseline classifiers

Decision tree

Lazy learning

Active learning

Regression

Loading the data

Analyzing attributes

Building and evaluating the regression model

Linear regression

Linear regression using Encog

Regression using MOA

Regression trees

Tips to avoid common regression problems

Clustering

Clustering algorithms

Evaluation

Clustering using Encog

Clustering using ELKI

Summary

Customer Relationship Prediction with Ensembles

The customer relationship database

Challenge

Dataset

Evaluation

Basic Naive Bayes classifier baseline

Getting the data

Loading the data

Basic modeling

Evaluating models

Implementing the Naive Bayes baseline

Advanced modeling with ensembles

Before we start

Data preprocessing

Attribute selection

Model selection

Performance evaluation

Ensemble methods – MOA

Summary

Affinity Analysis

Market basket analysis

Affinity analysis

Association rule learning

Basic concepts

Database of transactions

Itemset and rule

Support

Lift

Confidence

Apriori algorithm

FP-Growth algorithm

The supermarket dataset

Discover patterns

Apriori

FP-Growth

Other applications in various areas

Medical diagnosis

Protein sequences

Census data

Customer relationship management

IT operations analytics

Summary

Recommendation Engines with Apache Mahout

Basic concepts

Key concepts

User-based and item-based analysis

Calculating similarity

Collaborative filtering

Content-based filtering

Hybrid approach

Exploitation versus exploration

Getting Apache Mahout

Configuring Mahout in Eclipse with the Maven plugin

Building a recommendation engine

Book ratings dataset

Loading the data

Loading data from a file

Loading data from a database

In-memory databases

Collaborative filtering

User-based filtering

Item-based filtering

Adding custom rules to recommendations

Evaluation

Online learning engine

Content-based filtering

Summary

Fraud and Anomaly Detection

Suspicious and anomalous behavior detection

Unknown unknowns

Suspicious pattern detection

Anomalous pattern detection

Analysis types

Pattern analysis

Transaction analysis

Plan recognition

Outlier detection using ELKI

An example using ELKI

Fraud detection in insurance claims

Dataset

Modeling suspicious patterns

The vanilla approach

Dataset rebalancing

Anomaly detection in website traffic

Dataset

Anomaly detection in time series data

Using Encog for time series

Histogram-based anomaly detection

Loading the data

Creating histograms

Density-based k-nearest neighbors

Summary

Image Recognition with Deeplearning4j

Introducing image recognition

Neural networks

Perceptron

Feedforward neural networks

Autoencoder

Restricted Boltzmann machine

Deep convolutional networks

Image classification

Deeplearning4j

Getting DL4J

MNIST dataset

Loading the data

Building models

Building a single-layer regression model

Building a deep belief network

Building a multilayer convolutional network

Summary

Activity Recognition with Mobile Phone Sensors

Introducing activity recognition

Mobile phone sensors

Activity recognition pipeline

The plan

Collecting data from a mobile phone

Installing Android Studio

Loading the data collector

Feature extraction

Collecting training data

Building a classifier

Reducing spurious transitions

Plugging the classifier into a mobile app

Summary

Text Mining with Mallet – Topic Modeling and Spam Detection

Introducing text mining

Topic modeling

Text classification

Installing Mallet

Working with text data

Importing data

Importing from directory

Importing from file

Pre-processing text data

Topic modeling for BBC News

BBC dataset

Modeling

Evaluating a model

Reusing a model

Saving a model

Restoring a model

Detecting email spam 

Email spam dataset

Feature generation

Training and testing

Model performance

Summary

What Is Next?

Machine learning in real life

Noisy data

Class unbalance

Feature selection

Model chaining

The importance of evaluation

Getting models into production

Model maintenance

Standards and markup languages

CRISP-DM

SEMMA methodology

Predictive model markup language

Machine learning in the cloud

Machine learning as a service

Web resources and competitions

Datasets

Online courses

Competitions

Websites and blogs

Venues and conferences

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

Machine Learning in Java, Second Edition, will provide you with the techniques and tools you need to quickly gain insights from complex data. You will start by learning how to apply machine learning methods to a variety of common tasks, including classification, prediction, forecasting, market basket analysis, and clustering.

This is a practical tutorial that uses hands-on examples to step through some real-world applications of machine learning. Without shying away from the technical details, you will explore machine learning with Java libraries using clear and practical examples. You will explore how to prepare data for analysis, choose a machine learning method, and measure the success of the process.

Who this book is for

If you want to learn how to use Java's machine learning libraries to gain insights from your data, this book is for you. It will get you up and running quickly and provide you with the skills you need to successfully create, customize, and deploy machine learning applications with ease. You should be familiar with Java programming and some basic data mining concepts in order to make the most of this book, but no prior experience with machine learning is required.

What this book covers

Chapter 1, Applied Machine Learning Quick Start, introduces the field of natural language processing (NLP). The tools and basic techniques that support NLP are discussed. The use of models, their validation, and their use from a conceptual perspective are presented.

Chapter 2, Java Libraries and Platforms for Machine Learning, covers the purpose and uses of tokenizers. Different tokenization processes will be explored, followed by how they can be used to solve specific problems.

Chapter 3, Basic Algorithms – Classification, Regression, and Clustering, covers the problems associated with sentence detection. Correct detection of the end of sentences is important for many reasons. We will examine different approaches to this problem using a variety of examples.

Chapter 4, Customer Relationship Prediction with Ensembles, covers the process and problems associated with name recognition. Finding names, locations, and various things in a document is an important step in NLP. The techniques available are identified and demonstrated.

Chapter 5, Affinity Analysis, covers the process of determining the part of speech that is useful in determining the importance of words and their relationships in a document. It is a process that can enhance the effectiveness of other NLP tasks.

Chapter 6, Recommendation Engine with Apache Mahout, covers traditional features that do not apply to text documents. In this chapter, we'll learn how text documents can be presented.

Chapter 7, Fraud and Anomaly Detection, covers information retrieval, which entails finding documents in an unstructured format, such as text that satisfies a query.

Chapter 8, Image Recognition with Deeplearning4J, covers the issues surrounding how documents and text can be classified. Once we have isolated the parts of text, we can begin the process of analyzing it for information. One of these processes involves classifying and clustering information.

Chapter 9, Activity Recognition with Mobile Phone Sensors, demonstrates how to discover topics in a set of documents.

Chapter 10, Text Mining with Mallet – Topic Modeling and Spam Detection, covers the use of parsers and chunkers to solve text problems that are then examined. This important process, which normally results in a parse tree, provides insights into the structure and meaning of documents. 

Chapter 11, What is Next?, brings together many of the topics in previous chapters to address other more sophisticated problems. The use and construction of a pipeline is discussed. The use of open source tools to support these operations is presented.

To get the most out of this book

This book assumes that the user has a working knowledge of the Java language and a basic idea about machine learning. This book heavily uses external libraries that are available in JAR format. It is assumed that the user is aware of using JAR files in Terminal or Command Prompt, although the book does also explain how to do this. The user may easily use this book with any generic Windows or Linux system.

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at

www.packt.com

.

Select the

SUPPORT

tab.

Click on

Code Downloads and Errata

.

Enter the name of the book in the

Search

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Machine-Learning-in-Java-Second-Edition. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/9781788474399_ColorImages.pdf.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Applied Machine Learning Quick Start

This chapter introduces the basics of machine learning, laying down common themes and concepts and making it easy to follow the logic and familiarize yourself with the topic. The goal is to quickly learn the step-by-step process of applied machine learning and grasp the main machine learning principles. In this chapter, we will cover the following topics:

Machine learning and data science

Data and problem definition

Data collection

Data preprocessing

Unsupervised learning

Supervised learning

Generalization and evaluation

If you are already familiar with machine learning and are eager to start coding, then quickly jump to the chapters that follow this one. However, if you need to refresh your memory or clarify some concepts, then it is strongly recommend revisiting the topics presented in this chapter. 

Machine learning and data science

Nowadays, everyone talks about machine learning and data science. So, what exactly is machine learning, anyway? How does it relate to data science? These two terms are commonly confused, as they often employ the same methods and overlap significantly. Therefore, let's first clarify what they are. Josh Wills tweeted this:

"A data scientist is a person who is better at statistics than any software engineer and better at software engineering than any statistician."
– Josh Wills

More specifically, data science encompasses the entire process of obtaining knowledge by integrating methods from statistics, computer science, and other fields to gain insight from data. In practice, data science encompasses an iterative process of data harvesting, cleaning, analysis, visualization, and deployment.

Machine learning, on the other hand, is mainly concerned with generic algorithms and techniques that are used in analysis and modelling phases of the data science process.

Solving problems with machine learning

Among the different machine learning approaches, there are three main ways of learning, as shown in the following list:

Supervised learning

Unsupervised learning

Reinforcement learning

Given a set of example inputs X, and their outcomes Y, supervised learning aims to learn a general mapping function f, which transforms inputs into outputs, as f: (X,Y).

An example of supervised learning is credit card fraud detection, where the learning algorithm is presented with credit card transactions (matrix X) marked as normal or suspicious (vector Y). The learning algorithm produces a decision model that marks unseen transactions as normal or suspicious (this is the f function).

In contrast, unsupervised learning algorithms do not assume given outcome labels, as they focus on learning the structure of the data, such as grouping similar inputs into clusters. Unsupervised learning can, therefore, discover hidden patterns in the data. An example of unsupervised learning is an item-based recommendation system, where the learning algorithm discovers similar items bought together; for example, people who bought book A also bought book B.

Reinforcement learning addresses the learning process from a completely different angle. It assumes that an agent, which can be a robot, bot, or computer program, interacts with a dynamic environment to achieve a specific goal. The environment is described with a set of states and the agent can take different actions to move from one state to another. Some states are marked as goal states, and if the agent achieves this state, it receives a large reward. In other states, the reward is smaller, non-existent, or even negative. The goal of reinforcement learning is to find an optimal policy or a mapping function that specifies the action to take in each of the states, without a teacher explicitly telling whether this leads to the goal state or not. An example of reinforcement learning would be a program for driving a vehicle, where the states correspond to the driving conditions, for example, current speed, road segment information, surrounding traffic, speed limits, and obstacles on the road; and the actions could be driving maneuvers, such as turn left or right, stop, accelerate, and continue. The learning algorithm produces a policy that specifies the action that is to be taken in specific configurations of driving conditions.

In this book, we will focus on supervised and unsupervised learning only, as they share many concepts. If reinforcement learning sparked your interest, a good book to start with is Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew Barto, MIT Press (2018).

Applied machine learning workflow

This book's emphasis is on applied machine learning. We want to provide you with the practical skills needed to get learning algorithms to work in different settings. Instead of math and theory in machine learning, we will spend more time on the practical, hands-on skills (and dirty tricks) to get this stuff to work well on an application. We will focus on supervised and unsupervised machine learning and learn the essential steps in data science to build the applied machine learning workflow.

A typical workflow in applied machine learning applications consists of answering a series of questions that can be summarized in the following steps:

Data and problem definition

: The first step is to ask interesting questions, such as: 

What is the problem you are trying solve

?

Why is it important

?

Which format of result answers your question

?

Is this a simple yes/no answer

?

Do you need to pick one of the available questions

?

Data collection

: Once you have a problem to tackle, you will need the data. Ask yourself what kind of data will help you answer the question.

Can you get the data from the available sources

?

Will you have to combine multiple sources

?

Do you have to generate the data

?

Are there any sampling biases

?

How much data will be required

?

Data preprocessing

: The first data preprocessing task is

data cleaning

. Some of the examples include filling missing values, smoothing noisy data, removing outliers, and resolving consistencies. This is usually followed by integration of multiple data sources and data transformation to a specific range (normalization), to value bins (discretized intervals), and to reduce the number of dimensions.

Data analysis and modelling

: Data analysis and modelling includes unsupervised and supervised machine learning, statistical inference, and prediction. A wide variety of machine learning algorithms are available, including k-nearest neighbors, Naive Bayes classifier, decision trees,

Support Vector Machines

 (

SVMs

)

, logistic regression, k-means, and so on. The method to be deployed depends on the problem definition, as discussed in the first step, and the type of collected data. The final product of this step is a model inferred from the data.

Evaluation

: The last step is devoted to model assessment. The main issue that the models built with machine learning face is how well they model the underlying data; for example, if a model is too specific or it overfits to the data used for training, it is quite possible that it will not perform well on new data. The model can be too generic, meaning that it underfits the training data. For example, when asked how the weather is in California, it always answers sunny, which is indeed correct most of the time. However, such a model is not really useful for making valid predictions. The goal of this step is to correctly evaluate the model and make sure it will work on new data as well. Evaluation methods include separate test and train sets, cross-validation, and leave-one-out cross-validation.

We will take a closer look at each of the steps in the following sections. We will try to understand the type of questions we must answer during the applied machine learning workflow, and look at the accompanying concepts of data analysis and evaluation.

Data and problem definition

When presented with a problem definition, we need to ask questions that will help in understanding the objective and target information from the data. We could ask very common questions, such as: what is the expected finding once the data is explored? What kind of information can be extracted after data exploration? Or, what kind of format is required so the question can be answered? Asking the right question will give a clearer understanding of how to proceed further. Data is simply a collection of measurements in the form of numbers, words, observations, descriptions of things, images, and more.

Data collection

Once questions are asked in the right direction, the target of data exploration is clear. So, the next step is to see where the data comes from. Data collected can be much unorganized and in very diverse formats, which may involve reading from a database, internet, file system, or other documents. Most of the tools for machine learning require data to be presented in a specific format in order to generate the proper result. We have two choices: observe the data from existing sources or generate the data via surveys, simulations, and experiments. Let's take a closer look at both approaches.

Finding or observing data

Data can be found or observed in many places. An obvious data source is the internet. With an increase in social media usage, and with mobile phones penetrating deeper as mobile data plans become cheaper or even offer unlimited data, there has been an exponential rise in data consumed by users.

Now, online streaming platforms have emerged—the following diagram shows that the hours spent on consuming video data is also growing rapidly:

To get data from the internet, there are multiple options, as shown in the following list:

Bulk downloads from websites such as Wikipedia, IMDb, and the

Million Song Dataset

(which can be found here:

https://labrosa.ee.columbia.edu/millionsong/

).

Accessing the data through APIs (such as Google, Twitter, Facebook, and YouTube).

It is okay to scrape public, non-sensitive, and anonymized data. Be sure to check the terms and conditions and to fully reference the information.

The main drawbacks of the data collected is that it takes time and space to accumulate the data, and it covers only what happened; for instance, intentions and internal and external motivations are not collected. Finally, such data might be noisy, incomplete, inconsistent, and may even change over time.

Another option is to collect measurements from sensors such as inertial and location sensors in mobile devices, environmental sensors, and software agents monitoring key performance indicators.

Generating data

An alternative approach is to generate the data by you, for example, with a survey. In survey design, we have to pay attention to data sampling; that is, who the respondents are that are answering the survey. We only get data from the respondents who are accessible and willing to respond. Also, respondents can provide answers that are in line with their self-image and researcher's expectations.

Alternatively, the data can be collected with simulations, where a domain expert specifies the behavior model of users at a micro level. For instance, crowd simulation requires specifying how different types of users will behave in a crowd. Some of the examples could be following the crowd, looking for an escape, and so on. The simulation can then be run under different conditions to see what happens (Tsai et al., 2011). Simulations are appropriate for studying macro phenomena and emergent behavior; however, they are typically hard to validate empirically.

Furthermore, you can design experiments to thoroughly cover all of the possible outcomes, where you keep all of the variables constant and only manipulate one variable at a time. This is the most costly approach, but usually provides the best quality.

Sampling traps

Data collection may involve many traps. To demonstrate one, let me share a story. There is supposed to be a global, unwritten rule for sending regular mail between students for free. If you write student to student in the place where the stamp should be, the mail is delivered to the recipient for free. Now, suppose Jacob sends a set of postcards to Emma, and given that Emma indeed receives some of the postcards, she concludes that all of the postcards are delivered and that the rule indeed holds true. Emma reasons that, as she received the postcards, all of the postcards are delivered. However, she does not know of the postcards that were sent by Jacob, but were undelivered; hence, she is unable to account for this in her inference. What Emma experienced is survivorship bias; that is, she drew the conclusion based on the data that survived. For your information, postcards that are sent with a student to student stamp get a circled black letter T stamp on them, which mean postage is due and the receiver should pay it, including a small fine. However, mail services often have higher costs on applying such fees and hence do not do it. (Magalhães, 2010).