47,99 €
Master the art of machine learning with .NET and gain insight into real-world applications
This book is targeted at .Net developers who want to build complex machine learning systems. Some basic understanding of data science is required.
.Net is one of the widely used platforms for developing applications. With the meteoric rise of Machine learning, developers are now keen on finding out how can they make their .Net applications smarter. Also, .NET developers are interested into moving into the world of devices and how to apply machine learning techniques to, well, machines.
This book is packed with real-world examples to easily use machine learning techniques in your business applications. You will begin with introduction to F# and prepare yourselves for machine learning using .NET framework. You will be writing a simple linear regression model using an example which predicts sales of a product. Forming a base with the regression model, you will start using machine learning libraries available in .NET framework such as Math.NET, Numl.NET and Accord.NET with the help of a sample application. You will then move on to writing multiple linear regressions and logistic regressions.
You will learn what is open data and the awesomeness of type providers. Next, you are going to address some of the issues that we have been glossing over so far and take a deep dive into obtaining, cleaning, and organizing our data. You will compare the utility of building a KNN and Naive Bayes model to achieve best possible results.
Implementation of Kmeans and PCA using Accord.NET and Numl.NET libraries is covered with the help of an example application. We will then look at many of issues confronting creating real-world machine learning models like overfitting and how to combat them using confusion matrixes, scaling, normalization, and feature selection. You will now enter into the world of Neural Networks and move your line of business application to a hybrid scientific application. After you have covered all the above machine learning models, you will see how to deal with very large datasets using MBrace and how to deploy machine learning models to Internet of Thing (IoT) devices so that the machine can learn and adapt on the fly
This book will guide you in learning everything about how to tackle the flood of data being encountered these days in your .NET applications with the help of popular machine learning libraries offered by the .NET framework.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 349
Veröffentlichungsjahr: 2016
Copyright © 2016 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: March 2016
Production reference: 1210316
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78588-840-3
www.packtpub.com
Author
Jamie Dixon
Reviewers
Reed Copsey, Jr.
César Roberto de Souza
Commissioning Editor
Vedika Naik
Acquisition Editor
Meeta Rajani
Technical Editor
Pankaj Kadam
Copy Editor
Laxmi Subramanian
Proofreader
Safis Editing
Indexer
Rekha Nair
Graphics
Jason Monteiro
Production Coordinator
Aparna Bhagat
Cover Work
Aparna Bhagat
Jamie Dixon has been writing code for as long as he can remember and has been getting paid to do it since 1995. He was using C# and JavaScript almost exclusively until discovering F#, and now combines all three languages for the problem at hand. He has a passion for discovering overlooked gems in datasets and merging software engineering techniques to scientific computing. When he codes for fun, he spends his time using Phidgets, Netduinos, and Raspberry Pis or spending time in Kaggle competitions using F# or R.
Jamie is a bachelor of science in computer science and has been an F# MVP since 2014. He is the former chair of his town's Information Services Advisory Board and is an outspoken advocate of open data. He is also involved with his local .NET User Group (TRINUG) with an emphasis on data analytics, machine learning, and the Internet of Things (IoT).
Jamie lives in Cary, North Carolina with his wonderful wife Jill and their three awesome children: Sonoma, Sawyer, and Sloan. He blogs weekly at jamessdixon.wordpress.com and can be found on Twitter at @jamie_dixon.
I had never considered writing a book until Meeta from Packt Publishing sent me an e-mail, asking me if I was interested in writing the book that you are holding. My first reaction was excitement immediately followed by fear. I have heard that writing a book is an arduous and painful undertaking with scant reward—was I really ready to dive into that? Fortunately, writing this book was nothing of the sort—all due to the many wonderful people that helped me along the way.
First and foremost are the technical reviewers Reed Copsey, Jr. and César Roberto de Souza. Their attention to detail, their spot-on suggestions, and occasional words of encouragement made all of the difference. Next, the team at Packt of Meeta Rajani, Pankaj Kadam, and Laxmi Subramanian took my words, code samples, and screenshots and turned them into something, well, beautiful. Mathias Brandiveder, Evalina Gasborova, Melinda Thielbar, James McCaffrey, Phil Trelford, Seth Jurez, and Chris Kalle all helped me at different points with questions about what and how to present the machine learning models and ideas. Dmitry Morozov and Ross McKinlay were indispensable for explaining the finer points of type providers. Isaac Abraham helped me with the section on MBrace and Tomas Petricek helped me with the section on Deedle. Chris Matthews and Mark Hutchinson reviewed the initial outline and gave me great feedback. Ian Hoppes saved me hours (days?) by sharing his expertise on the finer points of Razor and JavaScript. Finally, Rob Seder, Mike Esposito, and Kevin Allen encouraged and supported me throughout the entire process.
To everyone I mentioned and the people I may have missed, please accept my sincerest thanks.
Finally, my deepest love for the initial proofreader, soul mate, and best wife any person could have: Jill Dixon. I am truly the luckiest man in the world to be with you.
Reed Copsey, Jr. is the executive director of the F# Software Foundation and the CTO and co-owner of C Tech Development Corporation, a software company focused on applications and tooling for the Earth Sciences. After attending the University of Chicago, he went on to consult and work in many industries, including medical imaging, geographical information systems, analysis of retail market data, and more. He has been involved with technical and business support for numerous nonprofit organizations, and most recently enjoys spending his free time involved with the software community.
He is the organizer of the Bellingham Software Developers Network, has been a Microsoft MVP in .NET since 2010, is an avid StackOverflow contributor, and regularly speaks on F# and .NET at various user groups and conferences.
César Roberto de Souza is the author of the Accord.NET Framework and an experienced software developer. During his early university years in Brazil, he decided to create the Accord.NET Framework, a framework for machine learning, image processing, and scientific computing for .NET. Targeted at both professionals and hobbyists, the project has been used by large and small companies, big corporations, start-ups, universities, and in an extensive number of scientific publications. After finishing his MSc in the Federal University of São Carlos, the success of the project eventually granted him an opportunity to work and live in Europe, from where he continues its development and interacts with the growing community of users that now helps advance the project even further.
He is a technology enthusiast, with keen interest in machine learning, computer vision, and image processing, and regularly writes articles on those topics for the CodeProject, where he has won its article writing competition multiple times.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www2.packtpub.com/books/subscription/packtlib
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.
To Sonoma, Sawyer, and Sloan Dixon
The .NET Framework is one of the most successful application frameworks in history. Literally billions of lines of code have been written on the .NET Framework, with billions more to come. For all of its success, it can be argued that the .NET Framework is still underrepresented for data science endeavors. This book attempts to help address this issue by showing how machine learning can be rapidly injected into the common .NET line of business applications. It also shows how typical data science scenarios can be addressed using the .NET Framework. This book quickly builds upon an introduction to machine learning models and techniques in order to build real-world applications using machine learning. While by no means a comprehensive study of predictive analytics, it does address some of the more common issues that data scientists encounter when building their models.
Many books about machine learning are written with every chapter centering around a dataset and how to implement a model on that dataset. While this is a good way to build a mental blueprint (as well as some code boilerplate), this book is going to take a slightly different approach. This book centers around introducing the same application for the line of business development and one common open data dataset for the scientific programmer. We will then introduce different machine techniques, depending on the business scenario. This means you will be putting on different hats for each chapter. If you are a line of business software engineer, Chapters 2, 3, 6, and 9 will seem like old hat. If you are a research analyst, Chapters 4, 7, and 10 will be very familiar to you. I encourage you to try all chapters, regardless of your background, as you will perhaps gain a new perspective that will make you more effective as a data scientist. As a final note, one word you will not find in this book is "simply". It drives me nuts when I read a tutorial-based book and the author says "it is simply this" or "simply do that". If it was simple, I wouldn't need the book. I hope you find each of the chapters accessible and the code samples interesting, and these two factors can help you immediately in your career.
Chapter 1, Welcome to Machine Learning Using the .NET Framework, contextualizes machine learning in the .NET stack, introduces some of the libraries that we will use throughout the book, and provides a brief primer to F#.
Chapter 2, AdventureWorks Regression, introduces the business that we will use in this book—AdventureWorks Bicycle company. We will then look at a business problem where customers are dropping orders based on reviews of the product. It looks at creating a linear regression by hand, using Math.NET and Accord.NET to solve this business problem. It then adds this regression to the line of business application.
Chapter 3, More AdventureWorks Regression, looks at creating a multiple linear regression and a logistic regression to solve different business problems at AdventureWorks. It will look at different factors that affect bike sales and then categorize potential customers into potential sales or potential lost leads. It will then implement the models to help our website convert potential lost leads into potential sales.
Chapter 4, Traffic Stops – Barking Up the Wrong Tree?, takes a break from AdventureWorks. You will put on your data scientist hat, use an open dataset of traffic stops, and see if we can understand why some people get a verbal warning and why others get a ticket at a traffic stop. We will use basic summary statistics and decision trees to help in understanding the results.
Chapter 5, Time Out – Obtaining Data, stops with introducing datasets and machine learning models and concentrates on one of the hardest parts of machine learning—obtaining and cleaning the data. We will look at using F# type providers as a very powerful language feature that can vastly speed up this process of "data munging".
Chapter 6, AdventureWorks Redux – k-NN and Naïve Bayes Classifiers, goes back to AdventureWorks and looks at a business problem of how to improve cross sales. We will implement two popular machine learning classification models, k-NN and Naïve Bayes, to see which is better at solving this problem.
Chapter 7, Traffic Stops and Crash Locations – When Two Datasets Are Better Than One, returns back to the traffic stop data and adds in two other open datasets that can be used to improve the predictions and gain new insights. The chapter will introduce two common unsupervised machine learning techniques: k-means and PCA.
Chapter 8, Feature Selection and Optimization, takes another break from introducing new machine learning models and looks at another key part of building machine learning models—selecting the right data for the model, preparing the data for the model, and introducing some common techniques to deal with outliers and other data abnormalities.
Chapter 9, AdventureWorks Production – Neural Networks, goes back to AdventureWorks and looks at how to improve bike production by using a popular machine learning technique called neural networks.
Chapter 10, Big Data and IoT, wraps up by looking at a more recent problem—how to build machine learning models on top of data that is characterized by massive volume, variability, and velocity. We will then look at how IoT devices can generate this big data and how to deploy machine learning models onto these devices so that they become self-learning.
You will need Visual Studio 2013 (any version) or beyond installed on your computer. You can also use VS Code or Mono Develop. The examples in this book use Visual Studio 2015 Update 1.
The lines between business computing and scientific computing are becoming increasingly blurred. Indeed, an argument can be made that the distinction was never really as clear as it has been made out to be in the past. With that, machine learning principles and models are making their way into mainstream computing applications. Consider the Uber app that shows how far Uber drivers are from you, and product recommendations built into online retail sites such as Jet.
Also, the nature of the .NET software developer's job is changing. Earlier, when the cliché of ours is a changing industry was being thrown around, it was about languages (need to know JavaScript, C#, and TSql) and frameworks (Angular, MVC, WPF, and EF). Now, the cliché means that the software developer needs to know how to make sure their code is correct (test-driven development), how to get their code off of their machine onto the customer's machine (DevOps), and how to make their applications smarter (machine learning).
Also, the same forces that are pushing the business developer to retool are pushing the research analyst into unfamiliar territory. Earlier, analysts focused on data collection, exploration, and visualization in the context of an application (Excel, PowerBI, and SAS) for point-in-time analysis. The analyst would start with a question, grab some data, build some models, and then present the findings. Any kind of continuous analysis was done via report writing or just re-running the models. Today, analysts are being asked to sift through massive amounts of data (IoT telemetry, user exhaust, and NoSQL data lakes), where the questions may not be known beforehand. Also, once models are created, they are pushed into production applications where they are continually being re-trained in real time. No longer just a decision aid for humans, research is being done by computers to impact users immediately.
The newly-minted data scientist title is at the confluence of these forces. Typically, no one person can be an expert on both sides of the divide, so the data scientist is a bit of a jack of all trades, master of none who knows machine learning a little bit better than all of the other software engineers on the team and knows software engineering a little bit better than any researcher on the team. The goal of this book is to help move from either software engineer or business analyst to data scientist.
Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.
To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
You can download the code files by following these steps:
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at <[email protected]> with a link to the suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.
This is a book on creating and then using Machine Learning (ML) programs using the .NET Framework. Machine learning, a hot topic these days, is part of an overall trend in the software industry of analytics which attempts to make machines smarter. Analytics, though not really a new trend, has perhaps a higher visibility than in the past. This chapter will focus on some of the larger questions you might have about machine learning using the .NET Framework, namely: What is machine learning? Why should we consider it in the .NET Framework? How can I get started with coding?
If you check out on Wikipedia, you will find a fairly abstract definition of machine learning:
"Machine learning explores the study and construction of algorithms that can learn from and make predictions on data. Such algorithms operate by building a model from example inputs in order to make data-driven predictions or decisions, rather than following strictly static program instructions."
I like to think of machine learning as computer programs that produce different results as they are exposed to more information without changing their source code (and consequently needed to be redeployed). For example, consider a game that I play with the computer.
I show the computer this picture and tell it "Blue Circle". I then show it this picture and tell it "Red Circle". Next I show it this picture and say "Green Triangle."
Finally, I show it this picture and ask it "What is this?". Ideally the computer would respond, "Green Circle."
This is one example of machine learning. Although I did not change my code or recompile and redeploy, the computer program can respond accurately to data it has never seen before. Also, the computer code does not have to explicitly write each possible data permutation. Instead, we create models that the computer applies to new data. Sometimes the computer is right, sometimes it is wrong. We then feed the new data to the computer to retrain the model so the computer gets more and more accurate over time—or, at least, that is the goal.
Once you decide to implement some machine learning into your code base, another decision has to be made fairly early in the process. How often do you want the computer to learn? For example, if you create a model by hand, how often do you update it? With every new data row? Every month? Every year? Depending on what you are trying to accomplish, you might create a real-time ML model, a near-time model, or a periodic model. We will discuss the implications and implementations of each of these in several chapters in the book as different models lend themselves to different retraining strategies.
If you are a Windows developer, using .NET is something you do without thinking. Indeed, a vast majority of Windows business applications written in the last 15 years use managed code—most of it written in C#. Although it is difficult to categorize millions of software developers, it is fair to say that .NET developers often come from nontraditional backgrounds. Perhaps a developer came to .NET from a BCSC degree but it is equally likely s/he started writing VBA scripts in Excel, moving up to Access applications, and then into VB.NET/C# applications. Therefore, most .NET developers are likely to be familiar with C#/VB.NET and write in an imperative and perhaps OO style.
The problem with this rather narrow exposure is that most machine learning classes, books, and code examples are in R or Python and very much use a functional style of writing code. Therefore, the .NET developer is at a disadvantage when acquiring machine learning skills because of the need to learn a new development environment, a new language, and a new style of coding before learning how to write the first line of machine learning code.
If, however, that same developer could use their familiar IDE (Visual Studio) and the same base libraries (the .NET Framework), they can concentrate on learning machine learning much sooner. Also, when creating machine learning models in .NET, they have immediate impact as you can slide the code right into an existing C#/VB.NET solution.
On the other hand, .NET is under-represented in the data science community. There are a couple of different reasons floating around for that fact. The first is that historically Microsoft was a proprietary closed system and the academic community embraced open source systems such as Linux and Java. The second reason is that much academic research uses domain-specific languages such as R, whereas Microsoft concentrated .NET on general purpose programming languages. Research that moved to industry took their language with them. However, as the researcher's role is shifted from data science to building programs that can work at real time that customers touch, the researcher is getting more and more exposure to Windows and Windows development. Whether you like it or not, all companies which create software that face customers must have a Windows strategy, an iOS strategy, and an Android strategy.
One real advantage to writing and then deploying your machine learning code in .NET is that you can get everything with one stop shopping. I know several large companies who write their models in R and then have another team rewrite them in Python or C++ to deploy them. Also, they might write their model in Python and then rewrite it in C# to deploy on Windows devices. Clearly, if you could write and deploy in one language stack, there is a tremendous opportunity for efficiency and speed to market.
The .NET Framework has been around for general release since 2002. The base of the framework is the Common Language Runtime or CLR. The CLR is a virtual machine that abstracts much of the OS specific functionality like memory management and exception handling. The CLR is loosely based on the Java Virtual Machine (JVM). Sitting on top of the CLR is the Framework Class Library (FCL) that allows different languages to interoperate with the CLR and each other: the FCL is what allows VB.Net, C#, F#, and Iron Python code to work side-by-side with each other.
Since its first release, the .NET Framework has included more and more features. The first release saw support for the major platform libraries like WinForms, ASP.NET, and ADO.NET. Subsequent releases brought in things like Windows Communication Foundation (WCF), Language Integrated Query (LINQ), and Task Parallel Library (TPL). At the time of writing, the latest version is of the .Net Framework is 4.6.2.
In addition to the full-Monty .NET Framework, over the years Microsoft has released slimmed down versions of the .NET Framework intended to run on machines that have limited hardware and OS support. The most famous of these releases was thePortable Class Library (PCL) that targeted Windows RT applications running Windows 8. The most recent incantation of this is Universal Windows Applications (UWA), targeting Windows 10.
At Connect(); in November 2015, Microsoft announced GA of the latest edition of the .NET Framework. This release introduced the .Net Core 5. In January, they decided to rename it to .Net Core 1.0. .NET Core 1.0 is intended to be a slimmed down version of the full .NET Framework that runs on multiple operating systems (specifically targeting OS X and Linux). The next release of ASP.NET (ASP.NET Core 1.0) sits on top of .NET Core 1.0. ASP.NET Core 1.0 applications that run on Windows can still run the full .NET Framework.
(https://blogs.msdn.microsoft.com/webdev/2016/01/19/asp-net-5-is-dead-introducing-asp-net-core-1-0-and-net-core-1-0/)
In this book, we will be using a mixture of ASP.NET 4.0, ASP.NET 5.0, and Universal Windows Applications. As you can guess, machine learning models (and the theory behind the models) change with a lot less frequency than framework releases so the most of the code you write on .NET 4.6 will work equally well with PCL and .NET Core 1.0. Saying that, the external libraries that we will use need some time to catch up—so they might work with PCL but not with .NET Core 1.0 yet. To make things realistic, the demonstration projects will use .NET 4.6 on ASP.NET 4.x for existing (Brownfield) applications. New (Greenfield) applications will be a mixture of a UWA using PCL and ASP.NET 5.0 applications.
It seems like all of the major software companies are pitching machine learning services such as Google Analytics, Amazon Machine Learning Services, IBM Watson, Microsoft Cortana Analytics, to name a few. In addition, major software companies often try to sell products that have a machine learning component, such as Microsoft SQL Server Analysis Service, Oracle Database Add-In, IBM SPSS, or SAS JMP. I have not included some common analytical software packages such as PowerBI or Tableau because they are more data aggregation and report writing applications. Although they do analytics, they do not have a machine learning component (not yet at least).
With all these options, why would you want to learn how to implement machine learning inside your applications, or in effect, write some code that you can purchase elsewhere? It is the classic build versus buy decision that every department or company has to make. You might want to build because:
Once you decide to go native, you have a choice of rolling your own code or using some of the open source assemblies out there. This book will introduce both the techniques to you, highlight some of the pros and cons of each technique, and let you decide how you want to implement them. For example, you can easily write your own basic classifier that is very effective in production but certain models, such as a neural network, will take a considerable amount of time and energy and probably will not give you the results that the open source libraries do. As a final note, since the libraries that we will look at are open source, you are free to customize pieces of it—the owners might even accept your changes. However, we will not be customizing these libraries in this book.
Many books on machine learning use datasets that come with the language install (such as R or Hadoop) or point to public repositories that have considerable visibility in the data science community. The most common ones are Kaggle (especially the Titanic competition) and the UC Irvine's datasets. While these are great datasets and give a common denominator, this book will expose you to datasets that come from government entities. The notion of getting data from government and hacking for social good is typically called open data. I believe that open data will transform how the government interacts with its citizens and will make government entities more efficient and transparent. Therefore, we will use open datasets in this book and hopefully you will consider helping out with the open data movement.
As we will be on the .NET Framework, we could use either C#, VB.NET, or F#. All three languages have strong support within Microsoft and all three will be around for many years. F# is the best choice for this book because it is unique in the .NET Framework for thinking in the scientific method and machine learning model creation. Data scientists will feel right at home with the syntax and IDE (languages such as R are also functional first languages). It is the best choice for .NET business developers because it is built right into Visual Studio and plays well with your existing C#/VB.NET code. The obvious alternative is C#. Can I do this all in C#? Yes, kind of. In fact, many of the .NET libraries we will use are written in C#.
However, using C# in our code base will make it larger and have a higher chance of introducing bugs into the code. At certain points, I will show some examples in C#, but the majority of the book is in F#.
Another alternative is to forgo .NET altogether and develop the machine learning models in R and Python. You could spin up a web service (such as AzureML), which might be good in some scenarios, but in disconnected or slow network environments, you will get stuck. Also, assuming comparable machines, executing locally will perform better than going over the wire. When we implement our models to do real-time analytics, anything we can do to minimize the performance hit is something to consider.
A third alternative that the .NET developers will consider is to write the models in T-SQL. Indeed, many of our initial models have been implemented in T-SQL and are part of the SQL Server Analysis Server. The advantage of doing it on the data server is that the computation is as close as you can get to the data, so you will not suffer the latency of moving large amount of data over the wire. The downsides of using T-SQL are that you can't implement unit tests easily, your domain logic is moving away from the application and to the data server (which is considered bad form with most modern application architecture), and you are now reliant on a specific implementation of the database. F# is open source and runs on a variety of operating systems, so you can port your code much more easily.
