111,99 €
Introduces professionals and scientists to statistics and machine learning using the programming language R Written by and for practitioners, this book provides an overall introduction to R, focusing on tools and methods commonly used in data science, and placing emphasis on practice and business use. It covers a wide range of topics in a single volume, including big data, databases, statistical machine learning, data wrangling, data visualization, and the reporting of results. The topics covered are all important for someone with a science/math background that is looking to quickly learn several practical technologies to enter or transition to the growing field of data science. The Big R-Book for Professionals: From Data Science to Learning Machines and Reporting with R includes nine parts, starting with an introduction to the subject and followed by an overview of R and elements of statistics. The third part revolves around data, while the fourth focuses on data wrangling. Part 5 teaches readers about exploring data. In Part 6 we learn to build models, Part 7 introduces the reader to the reality in companies, Part 8 covers reports and interactive applications and finally Part 9 introduces the reader to big data and performance computing. It also includes some helpful appendices. * Provides a practical guide for non-experts with a focus on business users * Contains a unique combination of topics including an introduction to R, machine learning, mathematical models, data wrangling, and reporting * Uses a practical tone and integrates multiple topics in a coherent framework * Demystifies the hype around machine learning and AI by enabling readers to understand the provided models and program them in R * Shows readers how to visualize results in static and interactive reports * Supplementary materials includes PDF slides based on the book's content, as well as all the extracted R-code and is available to everyone on a Wiley Book Companion Site The Big R-Book is an excellent guide for science technology, engineering, or mathematics students who wish to make a successful transition from the academic world to the professional. It will also appeal to all young data scientists, quantitative analysts, and analytics professionals, as well as those who make mathematical models.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 1367
Veröffentlichungsjahr: 2020
Philippe J.S. De Brouwer
This edition first published 2021
© 2021 John Wiley & Sons, Inc.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Philippe J.S. De Brouwer to be identified as the author of this work has been asserted in accordance with law.
Registered Office
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
Editorial Office
111 River Street, Hoboken, NJ 07030, USA
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty
While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging-in-Publication Data
Names: De Brouwer, Philippe J. S., author.
Title: The big R-book : from data science to learning machines and big data / Philippe J.S. De Brouwer.
Description: Hoboken, NJ, USA : Wiley, 2020. | Includes bibliographical references and index.
Identifiers: LCCN 2019057557 (print) | LCCN 2019057558 (ebook) | ISBN 9781119632726 (hardback) | ISBN 9781119632764 (adobe pdf) | ISBN 9781119632771 (epub)
Subjects: LCSH: R (Computer program language)
Classification: LCC QA76.73.R3 .D43 2020 (print) | LCC QA76.73.R3 (ebook) | DDC 005.13/3–dc23
LC record available at https://lccn.loc.gov/2019057557
LC ebook record available at https://lccn.loc.gov/2019057558
Cover Design: Wiley
Cover Images: Information Tide series and Particle Geometry series
© agsandrew/Shutterstock, Abstract geometric landscape © gremlin/Getty Images, 3D illustration Rendering © MR.Cole_Photographer/Getty Images
To Joanna, Amelia and Maximilian
This book brings together skills and knowledge that can help to boost your career. It is an excellent tool for people working as database manager, data scientist, quant, modeller, statistician, analyst and more, who are knowledgeable about certain topics, but want to widen their horizon and understand what the others in this list do. A wider understanding means that we can do our job better and eventually open doors to new or enhanced careers.
The student who graduated froma science, technology, engineering ormathematics or similar program will find that this book helps to make a successful step from the academic world into a any private or governmental company.
This book uses the popular (and free) software R as leitmotif to build up essential programming proficiency, understand databases, collect data, wrangle data, buildmodels and select models froma suit of possibilities such linear regression, logistic regression, neural networks, decision trees, multi criteria decision models, etc. and ultimately evaluate a model and report on it.
We will go the extra mile by explaining some essentials of accounting in order to build up to pricing of assets such as bonds, equities and options. This helps to deepen the understanding how a company functions, is useful to bemore result oriented in a private company, helps for one's own investments, and provides a good example of the theories mentioned before. We also spend time on the presentation of results and we use R to generate slides, text documents and even interactive websites! Finally we explore big data and provide handy tips on speeding up code.
I hope that this book helps you to learn faster than me, and build a great and interesting career.
Enjoy reading!
Philippe De Brouwer
2020
This book is accompanied by a companion website:
www.wiley.com/go/De Brouwer/The Big R-Book
The website includes materials for students and instructors:
The Student companion site will contain the R-code, and the Instructor companion site will contain PDF slides based on the book's content.
The author has written this book based on his experience that spans roughly three decades in insurance, banking, and asset management. During his career, the author worked in IT, structured and managed highly technical investment portfolios (at some point oversaw €C24 billion in thousand investment funds), fulfilled many C-level roles (e.g. was CEO of KBCTFI SA [an asset manager in Poland], was CIO and COO for Eperon SA [a fund manager in Ireland] and sat on boards of investment funds, and was involved in big-data projects in London), and did quantitative analysis in risk departments of banks. This gave the author a unique and in-depth view of many areas ranging form analytics, big-data, databases, business requirements, financial modelling, etc.
In this book, the author presents a structured overview of his knowledge and experience for anyone whoworks with data and invites the reader to understand the bigger picture, and discover new aspects. This book also demystifies hype around machine learning and AI, by helping the reader to understand the models and programthem in R without spending toomuch time on the theory.
This book aims to be a starting point for quants, data scientists, modellers, etc. It aims to be the book that bridges different disciplines so that a specialist in one domain can grab this book, understand how his/her discipline fits in the bigger picture, and get enough material to understand the person who is specialized in a related discipline. Therefore, it could be the ideal book that helps you to make career move to another discipline so that in a few years you are that person who understands the whole data-chain. In short, the author wants to give you a short-cut to the knowledge that he spent 30 years to accumulate.
Another important point is that this book is written by and for practitioners: people that work with data, programming and mathematics for a living in a corporate environment. So, this book would be most interesting for anyone interested in data-science, machine learning, statistical learning and mathematical modelling and whomeverwants to convey technical matters in a clear and concise way to non-specialists.
This also means that this book is not necessarily the best book in any of the disciplines that it spans. In every specialisation there are already good contenders.
More formal introductions to statistics are for example in: Cyganowski, Kloeden, and Ombach (2001) and Andersen et al. (1987). There are also many books about specific stochastic processes and their applications in financial markets: see e.g. Wolfgang and Baschnagel (1999), Malliaris and Brock (1982), and Mikosch (1998). While knowledge of stochastic processes and their importance in asset pricing are important, this covers only a very narrow spot of applications and theory. This book is more general, more gently on theoretical foundations and focusses more on the use of data to answer real-life problems in everyday business environment.
A comprehensive introduction to statistics or econometrics can be found in Peracchi (2001) or Greene (1997). A general and comprehensive introduction in statistics is also in Neter, Wasserman, and Whitmore (1988).
This is not simply a book about programming and/or any related techniques. If you just want to learn programming in R, then Grolemund (2014) will be get you started faster. Our Part II will also get you started in programming, though it assumes a certain familiarity with programming and mainly zooms in on aspects that will be important in the rest of the book.
This book is not a comprehensive books about financialmodelling. Other books do a better job in listing all types of possiblemodels.No book does a better job here than BernardMarr's publication: Marr (2016): “Key Business Analytics, the 60+ business analysis tool every manager needs to know.” This book will list you all words that some managers might use and what it means, without any of the mathematics nor any or the programming behind. I warmly recommend keeping this book next to ours. Whenever someone comes up with a term like “customer churn analytics” for example, you can use Bernard's book to find out what it actually means and then turn to ours to “get your hands dirty” and actually do it.
If you are only interested in statistical learning and modelling, you will find the following booksmore focused: Hastie, Tibshirani, and Friedman (2009) or also James,Witten, Hastie, and Tibshirani (2013) who also uses R.
A more in-depth introduction to AI can be found in Russell and Norvig (2016).
Data science ismore elaborately treated in Baesens (2014) and the recent book by Wickham and Grolemund (2016) that provides an excellent introduction to R and data science in general. This last book is a great add-on to this book as it focussesmore on the data-aspects (but less on the statistical learning part). We also focus more on the practical aspects and real data problems in corporate environment.
A book that comes close to ours in purpose is the book that my friend professor Bart Baetens has compiled “Analytics in a Big Data World, the Essential guide to data science and its applications”: Baesens (2014). If the mathematics, programming, and R itself scare you in this book, then Bart's book is for you. Bart's book covers different methods, but above all, for the reader, it is sufficient to be able to use a spreadsheet to do some basic calculations. Therefore, it will not help you to tackle big data nor programming a neural network yourself, but you will understand very well what it means and how things work.
Another book that might work well if the maths in this one are prohibitive to you is Provost and Fawcett (2013), it will give you some insight in what the statistical learning is and how it works, but will not prepare you to use it on real data.
Summarizing, I suggest you buy next to this book also Marr (2016) and Baesens (2014). This will provide you a complete chain from business and buzzwords (Bernard's book) over understanding what modelling is and what practical issues one will encounter (Bart's book) to implementing this in a corporate setting and solve the practical problems of a data scientist and modeller on sizeable data (this book).
In a nutshell, this book does it all, is gentle on theoretical foundations and aims to be a one-stop shop to show the big picture, learn all those things and actually apply it. It aims to serve as a basis when later picking up more advanced books in certain narrow areas. This book will take you on a journey of working with data in a real company, and hence, it will discuss also practical problems such as people filling in forms or extracting data from a SQL database.
It should be readable for any person that finished (or is finishing) university level education in a quantitative field such as physics, civil engineering, mathematics, econometrics, etc. It should also be readable by the senior manager with a technical background, who tries to understand what his army of quants, data scientists, and developers are up to, while having fun learning R. After reading this book you will be able to talk to all, challenge their work, and make most analysis yourself or be part of a bigger entity and specialize in one of the steps of modelling or data-manipulation.
In some way, this book can also be seen as a celebration of FOSS (Free and Open Source Software). We proudly mention that for this book no commercial software was used at all. The operating systemis Linux, the windows manager Fluxbox (sometimes LXDE or KDE),Kile and vi helped the editing process, Okular displayed the PDF-file, even the database servers and Hadoop/Spark are FOSS …and of course R and LATEX provided the icing on the cake. FOSS makes this world a more inclusive place as it makes technology more attainable in poorer places on this world.
FOSS
Hence, we extend a warm thanks to all people that spend somuch time to contributing to free software.
Writing a book that is so eclectic and holds so many information would not have been possible without tremendous support from so many people: mentors, family, colleagues, and ex-colleagues at work or at universities. This book is in the first place a condensation of a few decades of interesting work in asset management and banking and mixes things that I have learned in C-level jobs and more technical assignments.
I thank the colleagues of the faculties of applied mathematics at the AGH University of Science and Technology, the faculty of mathematics of the Jagiellonian University of Krakow, and the colleagues of HSBC for the many stimulating discussions and shared insights in mathematical modelling and machine learning.
To the MBA program of the Cracovian Business School, the University of Warsaw, and to the many leaders that marked my journey, I am indebted for the business insight, stakeholder management and commercial wit that make this book complete.
A special thanks goes to Piotr Kowalczyk, FRM and Dr. Grzegorz Goryl, PRM, for reading large chunks of this book and providing detailed suggestions. I am also grateful for the general remarks and suggestions from Dr. Jerzy Dzieża, faculty of applied mathematics at the AGH University of Science and Technology of Krakow and the fruitful discussions with Dr. Tadeusz Czernik, from the University of Economics of Katowice and also SeniorManager at HSBC, Independent Model Review, Krakow.
This book would not be what it is now without the many years of experience, the stimulating discussions with somany friends, and in particularmy wife, Joanna De Brouwerwho encouraged me to move from London in order to work for HSBC in Krakow, Poland. Somehow, I feel that I should thank the city council and all the people for the wonderful and dynamic environment that attracts so many new service centres and that makes the ones that already had selected forKrakow grow their successful investments. This dynamic environment has certainly been an important stimulating factor in writing this book.
However, nothing would have been possible without the devotion and support of my family: my wife Joanna, both children,Amelia and Maximilian, were wonderful and are a constant source of inspiration and support.
Finally, I would like to thank the thousands of people who contribute to free and open source software, people that spend thousands of hours to create and improve software that others can use for free. I profoundly believe that these selfless acts make this world a better and more inclusive place, because they make computers, software, and studying more accessible for the less fortunate.
A special honorary mentioning should go to the people that have built Linux, LATEX, R, and the ecosystems around each of them as well as the companies that contribute to those projects, such as Microsoft that has embraced R and RStudio that enhances R and never fails to share the fruits of their efforts with the larger community.
You have certainly heard the words: “data is the new oil,” and you probably wondered “are we indeed on the verge of a newera of innovation andwealth creation or …is this just hype andwill it blow over soon enough?”
Since our ancestors left the trees about 6 million years ago,we roamed theAfrican steppes and we evolved a more upright position and limbs better suited for walking than climbing. However, for about 4million years physiological changes did not include a larger brain. It is only in the last million years that we gradually evolved a more potent frontal lobe capable of abstract and logical thinking.
The first good evidence of abstract thinking is the Makapansgat pebble, a jasperite cobble – roughly 260 g and 5 by 8 cm – that by geological tear and wear shows a few holes and lines that vaguely resemble (to us) a human face. About 2.5 million years ago one of our australopithecine ancestors not only realized this resemblance but also deemed it interesting enough to pick up the pebble, keep it, and finally leave it in a cave miles from the river where it was found.
This development of abstract thinking that goes beyond vague resemblance was a major milestone. As history unfolded, it became clear that this was only the first of many steps that would lead us to the era of data and knowledge that we live in today. Many more steps towards more complex and abstract thinking, gene mutations and innovation would be needed.
abstract thinking:
Soon we developed language.With language we were able to transform learning from an individual level to a collective level. Now, experiences could be passed on to the next generation or peers much more efficiently, it became possible to prepare someone for something that he or she did not yet encounter and to accumulate more knowledge with every generation.
More than ever before this abstract thinking and accumulation of collective experiences lead to a “knowledge advantage” and smartness became an attractive trait in a mate. This allowed our brain to develop further and great innovations such as the wheel, scripture, bronze, agriculture, iron, specialisation of labour soon started to transform not only our societal coherence but also the world around us.
Without those innovations, we would not be where we are now. While it is discussable to classify these inventions as the fruit of scientific work, it is equally hard to deny that some kind of scientific approach was necessary. For example, realizing the patterns in themovements of the sun, we could predict seasons and weather changes to come and this allowed us to put the grains on the right moment in the ground. This was based on observations and experience.
Science and progress flourished, but the fall of the Western European empire made Europe sink in the dark medieval periodwhere thinkingwas dominated by religious fear and superstition and hence scientific progress came to grinding halt, and it is wake improvements in medical care, food production and technology.
The Arab world continued the legacy of Aristotle (384–322 BCE, Greece) and Alhazen (Ibn al-Haytham, 965–1039 Iraq), who by many is considered as the father of the modern scientific method.1 It was this modern scientific method that became a catalyst for scientific and technological development.
scientific method:
A class of people that accumulated wealth through smart choices emerged. This was made possible by private enterprise and an efficient way of sharing risks and investments. In 1602, the East Indies Company became the first common stock company and in 1601 the Amsterdam Stock Exchange created a platform where innovative, exploratory and trade ideas could find the necessary capital to flourish.
In 1775, James Watt's improvement of the steam engine allowed to leverage on the progress made around the joint stock company and the stock exchange. This combination powered the raise of a new societal organization, capitalism and fueled the first industrial wave based on automation (mainly in the textile industry).
capitalism:
While this first industrialwave broughtmuchmisery and social injustice, as a species we were preparing for the next stage. It created wealth as never before on a scale never seen before. From England, the industrialization, spread fast over Europe and the young state in North America. It all ended in “the Panic of 1873,” that brought the “Long Depression” to Europe and the United States of America. This depression was so deep that it would indirectly give rise to a the invention of an new economic order: communism.
