Advances in Financial Machine Learning - Marcos Lopez de Prado - E-Book

Advances in Financial Machine Learning E-Book

Marcos Lopez de Prado

0,0
38,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Learn to understand and implement the latest machine learning innovations to improve your investment performance Machine learning (ML) is changing virtually every aspect of our lives. Today, ML algorithms accomplish tasks that - until recently - only expert humans could perform. And finance is ripe for disruptive innovations that will transform how the following generations understand money and invest. In the book, readers will learn how to: * Structure big data in a way that is amenable to ML algorithms * Conduct research with ML algorithms on big data * Use supercomputing methods and back test their discoveries while avoiding false positives Advances in Financial Machine Learning addresses real life problems faced by practitioners every day, and explains scientifically sound solutions using math, supported by code and examples. Readers become active users who can test the proposed solutions in their individual setting. Written by a recognized expert and portfolio manager, this book will equip investment professionals with the groundbreaking tools needed to succeed in modern finance.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 530

Veröffentlichungsjahr: 2018

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Praise for Advances in Financial Machine Learning

In his new book Advances in Financial Machine Learning, noted financial scholar Marcos López de Prado strikes a well-aimed karate chop at the naive and often statistically overfit techniques that are so prevalent in the financial world today. He points out that not only are business-as-usual approaches largely impotent in today’s high-tech finance, but in many cases they are actually prone to lose money. But López de Prado does more than just expose the mathematical and statistical sins of the finance world. Instead, he offers a technically sound roadmap for finance professionals to join the wave of machine learning. What is particularly refreshing is the author’s empirical approach—his focus is on real-world data analysis, not on purely theoretical methods that may look pretty on paper but which, in many cases, are largely ineffective in practice. The book is geared to finance professionals who are already familiar with statistical data analysis techniques, but it is well worth the effort for those who want to do real state-of-the-art work in the field.”

Dr. David H. Bailey, former Complex Systems Lead,

Lawrence Berkeley National Laboratory. Co-discoverer of the

BBP spigot algorithm

“Finance has evolved from a compendium of heuristics based on historical financial statements to a highly sophisticated scientific discipline relying on computer farms to analyze massive data streams in real time. The recent highly impressive advances in machine learning (ML) are fraught with both promise and peril when applied to modern finance. While finance offers up the nonlinearities and large data sets upon which ML thrives, it also offers up noisy data and the human element which presently lie beyond the scope of standard ML techniques. To err is human, but if you really want to f**k things up, use a computer. Against this background, Dr. López de Prado has written the first comprehensive book describing the application of modern ML to financial modeling. The book blends the latest technological developments in ML with critical life lessons learned from the author’s decades of financial experience in leading academic and industrial institutions. I highly recommend this exciting book to both prospective students of financial ML and the professors and supervisors who teach and guide them.”

Prof. Peter Carr, Chair of the Finance and Risk Engineering

Department, NYU Tandon School of Engineering

“Marcos is a visionary who works tirelessly to advance the finance field. His writing is comprehensive and masterfully connects the theory to the application. It is not often you find a book that can cross that divide. This book is an essential read for both practitioners and technologists working on solutions for the investment community.”

Landon Downs, President and Cofounder, 1QBit

“Academics who want to understand modern investment management need to read this book. In it, Marcos López de Prado explains how portfolio managers use machine learning to derive, test, and employ trading strategies. He does this from a very unusual combination of an academic perspective and extensive experience in industry, allowing him to both explain in detail what happens in industry and to explain how it works. I suspect that some readers will find parts of the book that they do not understand or that they disagree with, but everyone interested in understanding the application of machine learning to finance will benefit from reading this book.”

Prof. David Easley, Cornell University. Chair of the

NASDAQ-OMX Economic Advisory Board

“For many decades, finance has relied on overly simplistic statistical techniques to identify patterns in data. Machine learning promises to change that by allowing researchers to use modern nonlinear and highly dimensional techniques, similar to those used in scientific fields like DNA analysis and astrophysics. At the same time, applying those machine learning algorithms to model financial problems would be dangerous. Financial problems require very distinct machine learning solutions. Dr. López de Prado’s book is the first one to characterize what makes standard machine learning tools fail when applied to the field of finance, and the first one to provide practical solutions to unique challenges faced by asset managers. Everyone who wants to understand the future of finance should read this book.”

Prof. Frank Fabozzi, EDHEC Business School. Editor of

The Journal of Portfolio Management

“This is a welcome departure from the knowledge hoarding that plagues quantitative finance. López de Prado defines for all readers the next era of finance: industrial scale scientific research powered by machines.”

John Fawcett, Founder and CEO, Quantopian

“Marcos has assembled in one place an invaluable set of lessons and techniques for practitioners seeking to deploy machine learning techniques in finance. If machine learning is a new and potentially powerful weapon in the arsenal of quantitative finance, Marcos’s insightful book is laden with useful advice to help keep a curious practitioner from going down any number of blind alleys, or shooting oneself in the foot.”

Ross Garon, Head of Cubist Systematic Strategies. Managing

Director, Point72 Asset Management

“The first wave of quantitative innovation in finance was led by Markowitz optimization. Machine Learning is the second wave, and it will touch every aspect of finance. López de Prado’s Advances in Financial Machine Learning is essential for readers who want to be ahead of the technology rather than being replaced by it.”

Prof. Campbell Harvey, Duke University. Former President of

the American Finance Association

“The complexity inherent to financial systems justifies the application of sophisticated mathematical techniques. Advances in Financial Machine Learning is an exciting book that unravels a complex subject in clear terms. I wholeheartedly recommend this book to anyone interested in the future of quantitative investments.”

Prof. John C. Hull, University of Toronto. Author ofOptions, Futures, and other Derivatives

“Prado’s book clearly illustrates how fast this world is moving, and how deep you need to dive if you are to excel and deliver top of the range solutions and above the curve performing algorithms... Prado’s book is clearly at the bleeding edge of the machine learning world.”

Irish Tech News

“Financial data is special for a key reason: The markets have only one past. There is no ‘control group’, and you have to wait for true out-of-sample data. Consequently, it is easy to fool yourself, and with the march of Moore’s Law and the new machine learning, it’s easier than ever. López de Prado explains how to avoid falling for these common mistakes. This is an excellent book for anyone working, or hoping to work, in computerized investment and trading.”

Dr. David J. Leinweber, Former Managing Director, First Quadrant.Author of Nerds on Wall Street: Math, Machines and Wired Markets

“In his new book, Dr. López de Prado demonstrates that financial machine learning is more than standard machine learning applied to financial datasets. It is an important field of research in its own right. It requires the development of new mathematical tools and approaches, needed to address the nuances of financial datasets. I strongly recommend this book to anyone who wishes to move beyond the standard Econometric toolkit.”

Dr. Richard R. Lindsey, Managing Partner, Windham Capital Management.Former Chief Economist, U.S. Securities and Exchange Commission

“Dr. Lopez de Prado, a well-known scholar and an accomplished portfolio manager who has made several important contributions to the literature on machine learning (ML) in finance, has produced a comprehensive and innovative book on the subject. He has illuminated numerous pitfalls awaiting anyone who wishes to use ML in earnest, and he has provided much needed blueprints for doing it successfully. This timely book, offering a good balance of theoretical and applied findings, is a must for academics and practitioners alike.”

Prof. Alexander Lipton, Connection Science Fellow, MassachusettsInstitute of Technology. Risk’s Quant of the Year (2000)

“How does one make sense of todays’ financial markets in which complex algorithms route orders, financial data is voluminous, and trading speeds are measured in nanoseconds? In this important book, Marcos López de Prado sets out a new paradigm for investment management built on machine learning. Far from being a “black box” technique, this book clearly explains the tools and process of financial machine learning. For academics and practitioners alike, this book fills an important gap in our understanding of investment management in the machine age.”

Prof. Maureen O’Hara, Cornell University. Former President of

the American Finance Association

“Marcos López de Prado has produced an extremely timely and important book on machine learning. The author’s academic and professional first-rate credentials shine through the pages of this book—indeed, I could think of few, if any, authors better suited to explaining both the theoretical and the practical aspects of this new and (for most) unfamiliar subject. Both novices and experienced professionals will find insightful ideas, and will understand how the subject can be applied in novel and useful ways. The Python code will give the novice readers a running start and will allow them to gain quickly a hands-on appreciation of the subject. Destined to become a classic in this rapidly burgeoning field.”

Prof. Riccardo Rebonato, EDHEC Business School. Former

Global Head of Rates and FX Analytics at PIMCO

“A tour de force on practical aspects of machine learning in finance, brimming with ideas on how to employ cutting-edge techniques, such as fractional differentiation and quantum computers, to gain insight and competitive advantage. A useful volume for finance and machine learning practitioners alike.”

Dr. Collin P. Williams, Head of Research, D-Wave Systems

Advances in Financial Machine Learning

MARCOS LÓPEZ DE PRADO

Cover image: © Erikona/Getty Images Cover design: Wiley

Copyright © 2018 by John Wiley & Sons, Inc. All rights reserved.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.

Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600, or on the Web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. The views expressed in this book are the author’s and do not necessarily reflect those of the organizations he is affiliated with.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993, or fax (317) 572-4002.

Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.

ISBN 978-1-119-48208-6 (Hardcover) ISBN 978-1-119-48211-6 (ePDF) ISBN 978-1-119-48210-9 (ePub)

Dedicated to the memory of my coauthor and friend,

Professor Jonathan M. Borwein, FRSC, FAAAS,

FBAS, FAustMS, FAA, FAMS, FRSNSW

(1951–2016)

There are very few things which we know, which are not capable of being reduced to a mathematical reasoning. And when they cannot, it’s a sign our knowledge of them is very small and confused. Where a mathematical reasoning can be had, it’s as great a folly to make use of any other, as to grope for a thing in the dark, when you have a candle standing by you.

—Of the Laws of Chance, Preface (1692)

John Arbuthnot (1667–1735)

CONTENTS

About the Author

PREAMBLE

Chapter 1 Financial Machine Learning as a Distinct Subject

1.1 Motivation

1.2 The Main Reason Financial Machine Learning Projects Usually Fail

1.3 Book Structure

1.4 Target Audience

1.5 Requisites

1.6 FAQs

1.7 Acknowledgments

Exercises

References

Bibliography

Notes

PART 1 DATA ANALYSIS

Chapter 2 Financial Data Structures

2.1 Motivation

2.2 Essential Types of Financial Data

2.3 Bars

2.4 Dealing with Multi-Product Series

2.5 Sampling Features

Exercises

References

Chapter 3 Labeling

3.1 Motivation

3.2 The Fixed-Time Horizon Method

3.3 Computing Dynamic Thresholds

3.4 The Triple-Barrier Method

3.5 Learning Side and Size

3.6 Meta-Labeling

3.7 How to Use Meta-Labeling

3.8 The Quantamental Way

3.9 Dropping Unnecessary Labels

Exercises

Bibliography

Note

Chapter 4 Sample Weights

4.1 Motivation

4.2 Overlapping Outcomes

4.3 Number of Concurrent Labels

4.4 Average Uniqueness of a Label

4.5 Bagging Classifiers and Uniqueness

4.6 Return Attribution

4.7 Time Decay

4.8 Class Weights

Exercises

References

Bibliography

Chapter 5 Fractionally Differentiated Features

5.1 Motivation

5.2 The Stationarity vs. Memory Dilemma

5.3 Literature Review

5.4 The Method

5.5 Implementation

5.6 Stationarity with Maximum Memory Preservation

5.7 Conclusion

Exercises

References

Bibliography

PART 2 MODELLING

Chapter 6 Ensemble Methods

6.1 Motivation

6.2 The Three Sources of Errors

6.3 Bootstrap Aggregation

6.4 Random Forest

6.5 Boosting

6.6 Bagging vs. Boosting in Finance

6.7 Bagging for Scalability

Exercises

References

Bibliography

Notes

Chapter 7 Cross-Validation in Finance

7.1 Motivation

7.2 The Goal of Cross-Validation

7.3 Why K-Fold CV Fails in Finance

7.4 A Solution: Purged K-Fold CV

7.5 Bugs in Sklearn’s Cross-Validation

Exercises

Bibliography

Chapter 8 Feature Importance

8.1 Motivation

8.2 The Importance of Feature Importance

8.3 Feature Importance with Substitution Effects

8.4 Feature Importance without Substitution Effects

8.5 Parallelized vs. Stacked Feature Importance

8.6 Experiments with Synthetic Data

Exercises

References

Note

Chapter 9 Hyper-Parameter Tuning with Cross-Validation

9.1 Motivation

9.2 Grid Search Cross-Validation

9.3 Randomized Search Cross-Validation

9.4 Scoring and Hyper-parameter Tuning

Exercises

References

Bibliography

Notes

PART 3 BACKTESTING

Chapter 10 Bet Sizing

10.1 Motivation

10.2 Strategy-Independent Bet Sizing Approaches

10.3 Bet Sizing from Predicted Probabilities

10.4 Averaging Active Bets

10.5 Size Discretization

10.6 Dynamic Bet Sizes and Limit Prices

Exercises

References

Bibliography

Notes

Chapter 11 The Dangers of Backtesting

11.1 Motivation

11.2 Mission Impossible: The Flawless Backtest

11.3 Even If Your Backtest Is Flawless, It Is Probably Wrong

11.4 Backtesting Is Not a Research Tool

11.5 A Few General Recommendations

11.6 Strategy Selection

Exercises

References

Bibliography

Note

Chapter 12 Backtesting through Cross-Validation

12.1 Motivation

12.2 The Walk-Forward Method

12.3 The Cross-Validation Method

12.4 The Combinatorial Purged Cross-Validation Method

12.5 How Combinatorial Purged Cross-Validation Addresses Backtest Overfitting

Exercises

References

Chapter 13 Backtesting on Synthetic Data

13.1 Motivation

13.2 Trading Rules

13.3 The Problem

13.4 Our Framework

13.5 Numerical Determination of Optimal Trading Rules

13.6 Experimental Results

13.7 Conclusion

Exercises

References

Notes

Chapter 14 Backtest Statistics

14.1 Motivation

14.2 Types of Backtest Statistics

14.3 General Characteristics

14.4 Performance

14.5 Runs

14.6 Implementation Shortfall

14.7 Efficiency

14.8 Classification Scores

14.9 Attribution

Exercises

References

Bibliography

Notes

Chapter 15 Understanding Strategy Risk

15.1 Motivation

15.2 Symmetric Payouts

15.3 Asymmetric Payouts

15.4 The Probability of Strategy Failure

Exercises

References

Chapter 16 Machine Learning Asset Allocation

16.1 Motivation

16.2 The Problem with Convex Portfolio Optimization

16.3 Markowitz’s Curse

16.4 From Geometric to Hierarchical Relationships

16.5 A Numerical Example

16.6 Out-of-Sample Monte Carlo Simulations

16.7 Further Research

16.8 Conclusion

APPENDICES

16.A.1 Correlation-based Metric

16.A.2 Inverse Variance Allocation

16.A.3 Reproducing the Numerical Example

16.A.4 Reproducing the Monte Carlo Experiment

Exercises

References

Notes

PART 4 USEFUL FINANCIAL FEATURES

Chapter 17 Structural Breaks

17.1 Motivation

17.2 Types of Structural Break Tests

17.3 CUSUM Tests

17.4 Explosiveness Tests

Exercises

References

Chapter 18 Entropy Features

18.1 Motivation

18.2 Shannon’s Entropy

18.3 The Plug-in (or Maximum Likelihood) Estimator

18.4 Lempel-Ziv Estimators

18.5 Encoding Schemes

18.6 Entropy of a Gaussian Process

18.7 Entropy and the Generalized Mean

18.8 A Few Financial Applications of Entropy

Exercises

References

Bibliography

Note

Chapter 19 Microstructural Features

19.1 Motivation

19.2 Review of the Literature

19.3 First Generation: Price Sequences

19.4 Second Generation: Strategic Trade Models

19.5 Third Generation: Sequential Trade Models

19.6 Additional Features from Microstructural Datasets

19.7 What Is Microstructural Information?

Exercises

References

PART 5 HIGH-PERFORMANCE COMPUTING RECIPES

Chapter 20 Multiprocessing and Vectorization

20.1 Motivation

20.2 Vectorization Example

20.3 Single-Thread vs. Multithreading vs. Multiprocessing

20.4 Atoms and Molecules

20.5 Multiprocessing Engines

20.6 Multiprocessing Example

Exercises

Reference

Bibliography

Notes

Chapter 21 Brute Force and Quantum Computers

21.1 Motivation

21.2 Combinatorial Optimization

21.3 The Objective Function

21.4 The Problem

21.5 An Integer Optimization Approach

21.6 A Numerical Example

Exercises

References

Chapter 22 High-Performance Computational Intelligence and Forecasting Technologies

22.1 Motivation

22.2 Regulatory Response to the Flash Crash of 2010

22.3 Background

22.4 HPC Hardware

22.5 HPC Software

22.6 Use Cases

22.7 Summary and Call for Participation

22.8 Acknowledgments

References

Notes

Index

EULA

List of Tables

Chapter 1

Table 1.1

Table 1.2

Chapter 2

Table 2.1

Chapter 5

Table 5.1

Chapter 13

Table 13.1

Chapter 14

Table 14.1

Chapter 16

Table 16.1

Chapter 17

Table 17.1

List of Illustrations

Chapter 2

FIGURE 2.1

FIGURE 2.2

FIGURE 2.3

Chapter 3

FIGURE 3.1

FIGURE 3.2

Chapter 4

FIGURE 4.1

FIGURE 4.2

FIGURE 4.3

Chapter 5

FIGURE 5.1

FIGURE 5.2

FIGURE 5.3

FIGURE 5.4

FIGURE 5.5

Chapter 6

FIGURE 6.1

FIGURE 6.2

FIGURE 6.3

Chapter 7

FIGURE 7.1

FIGURE 7.2

FIGURE 7.3

Chapter 8

FIGURE 8.1

FIGURE 8.2

FIGURE 8.3

FIGURE 8.4

Chapter 9

FIGURE 9.1

FIGURE 9.2

Chapter 10

FIGURE 10.1

FIGURE 10.2

FIGURE 10.3

Chapter 11

FIGURE 11.1

FIGURE 11.2

Chapter 12

FIGURE 12.1

FIGURE 12.2

Chapter 13

FIGURE 13.1

FIGURE 13.2

FIGURE 13.3

FIGURE 13.4

FIGURE 13.5

FIGURE 13.6

FIGURE 13.7

FIGURE 13.8

FIGURE 13.9

FIGURE 13.10

FIGURE 13.11

FIGURE 13.12

FIGURE 13.13

FIGURE 13.14

FIGURE 13.15

FIGURE 13.16

FIGURE 13.17

FIGURE 13.18

FIGURE 13.19

FIGURE 13.20

FIGURE 13.21

FIGURE 13.22

FIGURE 13.23

FIGURE 13.24

FIGURE 13.25

Chapter 14

FIGURE 14.1

FIGURE 14.2

FIGURE 14.3

Chapter 15

FIGURE 15.1

FIGURE 15.2

FIGURE 15.3

Chapter 16

FIGURE 16.1

FIGURE 16.2

FIGURE 16.3

FIGURE 16.4

FIGURE 16.5

FIGURE 16.6

FIGURE 16.7

FIGURE 16.8

Chapter 17

FIGURE 17.1

FIGURE 17.2

FIGURE 17.3

Chapter 18

FIGURE 18.1

FIGURE 18.2

Chapter 19

FIGURE 19.1

FIGURE 19.2

FIGURE 19.3

Chapter 20

FIGURE 20.1

FIGURE 20.2

Chapter 21

FIGURE 21.1

Chapter 22

FIGURE 22.1

FIGURE 22.2

FIGURE 22.3

FIGURE 22.4

FIGURE 22.5

FIGURE 22.6

FIGURE 22.7

FIGURE 22.8

FIGURE 22.9

FIGURE 22.10

Guide

Cover

Table of Contents

Part

Pages

C1

a

b

c

iii

iv

v

vii

xxi

1

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

47

48

49

52

53

54

55

56

57

59

60

62

63

64

65

66

69

70

71

72

73

75

76

77

78

79

80

81

82

83

85

86

87

88

89

90

91

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

113

114

115

118

119

120

121

122

123

125

126

127

128

129

130

131

133

134

135

136

137

139

141

142

143

145

146

147

148

149

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

245

247

249

250

251

252

253

254

255

256

257

258

260

261

262

263

264

265

266

267

268

270

271

272

273

274

275

276

277

278

279

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

303

304

305

307

308

309

310

312

313

314

315

316

317

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

e1

About the Author

Prof. Marcos López de Prado is the CIO of True Positive Technologies (TPT), and Professor of Practice at Cornell University’s School of Engineering. He has over 20 years of experience developing investment strategies with the help of machine learning algorithms and supercomputers. Marcos launched TPT after he sold some of his patents to AQR Capital Management, where he was a principal and AQR’s first head of machine learning. He also founded and led Guggenheim Partners’ Quantitative Investment Strategies business, where he managed up to $13 billion in assets, and delivered an audited risk-adjusted return (information ratio) of 2.3.

Concurrently with the management of investments, between 2011 and 2018 Marcos was a research fellow at Lawrence Berkeley National Laboratory (U.S. Department of Energy, Office of Science). He has published dozens of scientific articles on machine learning and supercomputing in the leading academic journals, is a founding co-editor of The Journal of Financial Data Science, and SSRN ranks him as the most-read author in economics.

Marcos earned a PhD in financial economics (2003), a second PhD in mathematical finance (2011) from Universidad Complutense de Madrid, and is a recipient of Spain’s National Award for Academic Excellence (1999). He completed his post-doctoral research at Harvard University and Cornell University, where he is a faculty member. In 2019, Marcos received the ‘Quant of the Year Award’ from The Journal of Portfolio Management.

For more information, visit www.QuantResearch.org

PREAMBLE

Chapter 1 Financial Machine Learning as a Distinct Subject

CHAPTER 1Financial Machine Learning as a Distinct Subject

1.1 MOTIVATION

Machine learning (ML) is changing virtually every aspect of our lives. Today ML algorithms accomplish tasks that until recently only expert humans could perform. As it relates to finance, this is the most exciting time to adopt a disruptive technology that will transform how everyone invests for generations. This book explains scientifically sound ML tools that have worked for me over the course of two decades, and have helped me to manage large pools of funds for some of the most demanding institutional investors.

Books about investments largely fall in one of two categories. On one hand we find books written by authors who have not practiced what they teach. They contain extremely elegant mathematics that describes a world that does not exist. Just because a theorem is true in a logical sense does not mean it is true in a physical sense. On the other hand we find books written by authors who offer explanations absent of any rigorous academic theory. They misuse mathematical tools to describe actual observations. Their models are overfit and fail when implemented. Academic investigation and publication are divorced from practical application to financial markets, and many applications in the trading/investment world are not grounded in proper science.

A first motivation for writing this book is to cross the proverbial divide that separates academia and the industry. I have been on both sides of the rift, and I understand how difficult it is to cross it and how easy it is to get entrenched on one side. Virtue is in the balance. This book will not advocate a theory merely because of its mathematical beauty, and will not propose a solution just because it appears to work. My goal is to transmit the kind of knowledge that only comes from experience, formalized in a rigorous manner.

A second motivation is inspired by the desire that finance serves a purpose. Over the years some of my articles, published in academic journals and newspapers, have expressed my displeasure with the current role that finance plays in our society. Investors are lured to gamble their wealth on wild hunches originated by charlatans and encouraged by mass media. One day in the near future, ML will dominate finance, science will curtail guessing, and investing will not mean gambling. I would like the reader to play a part in that revolution.

A third motivation is that many investors fail to grasp the complexity of ML applications to investments. This seems to be particularly true for discretionary firms moving into the “quantamental” space. I am afraid their high expectations will not be met, not because ML failed, but because they used ML incorrectly. Over the coming years, many firms will invest with off-the-shelf ML algorithms, directly imported from academia or Silicon Valley, and my forecast is that they will lose money (to better ML solutions). Beating the wisdom of the crowds is harder than recognizing faces or driving cars. With this book my hope is that you will learn how to solve some of the challenges that make finance a particularly difficult playground for ML, like backtest overfitting. Financial ML is a subject in its own right, related to but separate from standard ML, and this book unravels it for you.

1.2 THE MAIN REASON FINANCIAL MACHINE LEARNING PROJECTS USUALLY FAIL

The rate of failure in quantitative finance is high, particularly so in financial ML. The few who succeed amass a large amount of assets and deliver consistently exceptional performance to their investors. However, that is a rare outcome, for reasons explained in this book. Over the past two decades, I have seen many faces come and go, firms started and shut down. In my experience, there is one critical mistake that underlies all those failures.

1.2.1 The Sisyphus Paradigm

Discretionary portfolio managers (PMs) make investment decisions that do not follow a particular theory or rationale (if there were one, they would be systematic PMs). They consume raw news and analyses, but mostly rely on their judgment or intuition. They may rationalize those decisions based on some story, but there is always a story for every decision. Because nobody fully understands the logic behind their bets, investment firms ask them to work independently from one another, in silos, to ensure diversification. If you have ever attended a meeting of discretionary PMs, you probably noticed how long and aimless they can be. Each attendee seems obsessed about one particular piece of anecdotal information, and giant argumentative leaps are made without fact-based, empirical evidence. This does not mean that discretionary PMs cannot be successful. On the contrary, a few of them are. The point is, they cannot naturally work as a team. Bring 50 discretionary PMs together, and they will influence one another until eventually you are paying 50 salaries for the work of one. Thus it makes sense for them to work in silos so they interact as little as possible.

Wherever I have seen that formula applied to quantitative or ML projects, it has led to disaster. The boardroom’s mentality is, let us do with quants what has worked with discretionary PMs. Let us hire 50 PhDs and demand that each of them produce an investment strategy within six months. This approach always backfires, because each PhD will frantically search for investment opportunities and eventually settle for (1) a false positive that looks great in an overfit backtest or (2) standard factor investing, which is an overcrowded strategy with a low Sharpe ratio, but at least has academic support. Both outcomes will disappoint the investment board, and the project will be cancelled. Even if 5 of those PhDs identified a true discovery, the profits would not suffice to cover for the expenses of 50, so those 5 will relocate somewhere else, searching for a proper reward.

1.2.2 The Meta-Strategy Paradigm

If you have been asked to develop ML strategies on your own, the odds are stacked against you. It takes almost as much effort to produce one true investment strategy as to produce a hundred, and the complexities are overwhelming: data curation and processing, HPC infrastructure, software development, feature analysis, execution simulators, backtesting, etc. Even if the firm provides you with shared services in those areas, you are like a worker at a BMW factory who has been asked to build an entire car by using all the workshops around you. One week you need to be a master welder, another week an electrician, another week a mechanical engineer, another week a painter . . . You will try, fail, and circle back to welding. How does that make sense?

Every successful quantitative firm I am aware of applies the meta-strategy paradigm (López de Prado [2014]). Accordingly, this book was written as a research manual for teams, not for individuals. Through its chapters you will learn how to set up a research factory, as well as the various stations of the assembly line. The role of each quant is to specialize in a particular task, to become the best there is at it, while having a holistic view of the entire process. This book outlines the factory plan, where teamwork yields discoveries at a predictable rate, with no reliance on lucky strikes. This is how Berkeley Lab and other U.S. National Laboratories routinely make scientific discoveries, such as adding 16 elements to the periodic table, or laying out the groundwork for MRIs and PET scans.1 No particular individual is responsible for these discoveries, as they are the outcome of team efforts where everyone contributes. Of course, setting up these financial laboratories takes time, and requires people who know what they are doing and have done it before. But what do you think has a higher chance of success, this proven paradigm of organized collaboration or the Sisyphean alternative of having every single quant rolling their immense boulder up the mountain?

1.3 BOOK STRUCTURE

This book disentangles a web of interconnected topics and presents them in an ordered fashion. Each chapter assumes that you have read the previous ones. Part 1 will help you structure your financial data in a way that is amenable to ML algorithms. Part 2 discusses how to do research with ML algorithms on that data. Here the emphasis is on doing research and making an actual discovery through a scientific process, as opposed to searching aimlessly until some serendipitous (likely false) result pops up. Part 3 explains how to backtest your discovery and evaluate the probability that it is false.

These three parts give an overview of the entire process, from data analysis to model research to discovery evaluation. With that knowledge, Part 4 goes back to the data and explains innovative ways to extract informative features. Finally, much of this work requires a lot of computational power, so Part 5 wraps up the book with some useful HPC recipes.

1.3.1 Structure by Production Chain

Mining gold or silver was a relatively straightforward endeavor during the 16th and 17th centuries. In less than a hundred years, the Spanish treasure fleet quadrupled the amount of precious metals in circulation throughout Europe. Those times are long gone, and today prospectors must deploy complex industrial methods to extract microscopic bullion particles out of tons of earth. That does not mean that gold production is at historical lows. On the contrary, nowadays miners extract 2,500 metric tons of microscopic gold every year, compared to the average annual 1.54 metric tons taken by the Spanish conquistadors throughout the entire 16th century!2 Visible gold is an infinitesimal portion of the overall amount of gold on Earth. El Dorado was always there . . . if only Pizarro could have exchanged the sword for a microscope.

The discovery of investment strategies has undergone a similar evolution. If a decade ago it was relatively common for an individual to discover macroscopic alpha (i.e., using simple mathematical tools like econometrics), currently the chances of that happening are quickly converging to zero. Individuals searching nowadays for macroscopic alpha, regardless of their experience or knowledge, are fighting overwhelming odds. The only true alpha left is microscopic, and finding it requires capital-intensive industrial methods. Just like with gold, microscopic alpha does not mean smaller overall profits. Microscopic alpha today is much more abundant than macroscopic alpha has ever been in history. There is a lot of money to be made, but you will need to use heavy ML tools.

Let us review some of the stations involved in the chain of production within a modern asset manager.

1.3.1.1 Data Curators

This is the station responsible for collecting, cleaning, indexing, storing, adjusting, and delivering all data to the production chain. The values could be tabulated or hierarchical, aligned or misaligned, historical or real-time feeds, etc. Team members are experts in market microstructure and data protocols such as FIX. They must develop the data handlers needed to understand the context in which that data arises. For example, was a quote cancelled and replaced at a different level, or cancelled without replacement? Each asset class has its own nuances. For instance, bonds are routinely exchanged or recalled; stocks are subjected to splits, reverse-splits, voting rights, etc.; futures and options must be rolled; currencies are not traded in a centralized order book. The degree of specialization involved in this station is beyond the scope of this book, and Chapter 1 will discuss only a few aspects of data curation.

1.3.1.2 Feature Analysts

This is the station responsible for transforming raw data into informative signals. These informative signals have some predictive power over financial variables. Team members are experts in information theory, signal extraction and processing, visualization, labeling, weighting, classifiers, and feature importance techniques. For example, feature analysts may discover that the probability of a sell-off is particularly high when: (1) quoted offers are cancelled-replaced with market sell orders, and (2) quoted buy orders are cancelled-replaced with limit buy orders deeper in the book. Such a finding is not an investment strategy on its own, and can be used in alternative ways: execution, monitoring of liquidity risk, market making, position taking, etc. A common error is to believe that feature analysts develop strategies. Instead, feature analysts collect and catalogue libraries of findings that can be useful to a multiplicity of stations. Chapters 2–9 and 17–19 are dedicated to this all-important station.

1.3.1.3 Strategists

In this station, informative features are transformed into actual investment algorithms. A strategist will parse through the libraries of features looking for ideas to develop an investment strategy. These features were discovered by different analysts studying a wide range of instruments and asset classes. The goal of the strategist is to make sense of all these observations and to formulate a general theory that explains them. Therefore, the strategy is merely the experiment designed to test the validity of this theory. Team members are data scientists with a deep knowledge of financial markets and the economy. Remember, the theory needs to explain a large collection of important features. In particular, a theory must identify the economic mechanism that causes an agent to lose money to us. Is it a behavioral bias? Asymmetric information? Regulatory constraints? Features may be discovered by a black box, but the strategy is developed in a white box. Gluing together a number of catalogued features does not constitute a theory. Once a strategy is finalized, the strategists will prepare code that utilizes the full algorithm and submit that prototype to the backtesting team described below. Chapters 10 and 16 are dedicated to this station, with the understanding that it would be unreasonable for a book to reveal specific investment strategies.

1.3.1.4 Backtesters

This station assesses the profitability of an investment strategy under various scenarios. One of the scenarios of interest is how the strategy would perform if history repeated itself. However, the historical path is merely one of the possible outcomes of a stochastic process, and not necessarily the most likely going forward. Alternative scenarios must be evaluated, consistent with the knowledge of the weaknesses and strengths of a proposed strategy. Team members are data scientists with a deep understanding of empirical and experimental techniques. A good backtester incorporates in his analysis meta-information regarding how the strategy came about. In particular, his analysis must evaluate the probability of backtest overfitting by taking into account the number of trials it took to distill the strategy. The results of this evaluation will not be reused by other stations, for reasons that will become apparent in Chapter 11. Instead, backtest results are communicated to management and not shared with anyone else. Chapters 11–16 discuss the analyses carried out by this station.

1.3.1.5 Deployment Team

The deployment team is tasked with integrating the strategy code into the production line. Some components may be reused by multiple strategies, especially when they share common features. Team members are algorithm specialists and hardcore mathematical programmers. Part of their job is to ensure that the deployed solution is logically identical to the prototype they received. It is also the deployment team’s responsibility to optimize the implementation sufficiently, such that production latency is minimized. As production calculations often are time sensitive, this team will rely heavily on process schedulers, automation servers (Jenkins), vectorization, multithreading, multiprocessing, graphics processing unit (GPU-NVIDIA), distributed computing (Hadoop), high-performance computing (Slurm), and parallel computing techniques in general. Chapters 20–22 touch on various aspects interesting to this station, as they relate to financial ML.

1.3.1.6 Portfolio Oversight

Once a strategy is deployed, it follows a cursus honorum, which entails the following stages or lifecycle:

Embargo:

Initially, the strategy is run on data observed after the end date of the backtest. Such a period may have been reserved by the backtesters, or it may be the result of implementation delays. If embargoed performance is consistent with backtest results, the strategy is promoted to the next stage.

Paper trading:

At this point, the strategy is run on a live, real-time feed. In this way, performance will account for data parsing latencies, calculation latencies, execution delays, and other time lapses between observation and positioning. Paper trading will take place for as long as it is needed to gather enough evidence that the strategy performs as expected.

Graduation:

At this stage, the strategy manages a real position, whether in isolation or as part of an ensemble. Performance is evaluated precisely, including attributed risk, returns, and costs.

Re-allocation:

Based on the production performance, the allocation to graduated strategies is re-assessed frequently and automatically in the context of a diversified portfolio. In general, a strategy’s allocation follows a concave function. The initial allocation (at graduation) is small. As time passes, and the strategy performs as expected, the allocation is increased. Over time, performance decays, and allocations become gradually smaller.

Decommission:

Eventually, all strategies are discontinued. This happens when they perform below expectations for a sufficiently extended period of time to conclude that the supporting theory is no longer backed by empirical evidence.

In general, it is preferable to release new variations of a strategy and run them in parallel with old versions. Each version will go through the above lifecycle, and old strategies will receive smaller allocations as a matter of diversification, while taking into account the degree of confidence derived from their longer track record.

1.3.2 Structure by Strategy Component

Many investment managers believe that the secret to riches is to implement an extremely complex ML algorithm. They are setting themselves up for a disappointment. If it was as easy as coding a state-of-the art classifier, most people in Silicon Valley would be billionaires. A successful investment strategy is the result of multiple factors. Table 1.1 summarizes what chapters will help you address each of the challenges involved in developing a successful investment strategy.

TABLE 1.1Overview of the Challenges Addressed by Every Chapter

Part

Chapter

Fin. data

Software

Hardware

Math

Meta-Strat

Overfitting

1

2

X

X

 

1

3

X

X

 

1

4

X

X

 

1

5

X

X

X

 

2

6

 

X

 

 

 

 

2

7

X

X

X

2

8

X

X

 

2

9

X

X

 

3

10

 

X

 

 

X

 

3

11

X

X

X

3

12

X

X

X

3

13

X

X

X

3

14

X

X

X

3

15

X

X

X

3

16

 

X

 

X

X

X

4

17

X

X

 

X

 

 

4

18

X

X

X

 

4

19

X

X

 

 

 

 

5

20

X

X

X

 

5

21

 

X

X

X

 

 

5

22

 

X

X

X

 

 

Throughout the book, you will find many references to journal articles I have published over the years. Rather than repeating myself, I will often refer you to one of them, where you will find a detailed analysis of the subject at hand. All of my cited papers can be downloaded for free, in pre-print format, from my website: www.QuantResearch.org.

1.3.2.1 Data

Problem: Garbage in, garbage out.

Solution: Work with unique, hard-to-manipulate data. If you are the only user of this data, whatever its value, it is all for you.

How:

Chapter 2: Structure your data correctly.

Chapter 3: Produce informative labels.

Chapters 4 and 5: Model non-IID series properly.

Chapters 17–19: Find predictive features.

1.3.2.2 Software

Problem: A specialized task requires customized tools.

Solution: Develop your own classes. Using popular libraries means more competitors tapping the same well.

How:

Chapters 2–22: Throughout the book, for each chapter, we develop our own functions. For your particular problems, you will have to do the same, following the examples in the book.

1.3.2.3 Hardware

Problem: ML involves some of the most computationally intensive tasks in all of mathematics.

Solution: Become an HPC expert. If possible, partner with a National Laboratory to build a supercomputer.

How:

Chapters 20 and 22: Learn how to think in terms of multiprocessing architectures. Whenever you code a library, structure it in such a way that functions can be called in parallel. You will find plenty of examples in the book.

Chapter 21: Develop algorithms for quantum computers.

1.3.2.4 Math

Problem: Mathematical proofs can take years, decades, and centuries. No investor will wait that long.

Solution: Use experimental math. Solve hard, intractable problems, not by proof but by experiment. For example, Bailey, Borwein, and Plouffe [1997] found a spigot algorithm for

π

(pi) without proof, against the prior perception that such mathematical finding would not be possible.

How:

Chapter 5: Familiarize yourself with memory-preserving data transformations.

Chapters 11–15: There are experimental methods to assess the value of your strategy, with greater reliability than a historical simulation.

Chapter 16: An algorithm that is optimal in-sample can perform poorly out-of-sample. There is no mathematical proof for investment success. Rely on experimental methods to lead your research.

Chapters 17 and 18: Apply methods to detect structural breaks, and quantify the amount of information carried by financial series.

Chapter 20: Learn queuing methods for distributed computing so that you can break apart complex tasks and speed up calculations.

Chapter 21: Become familiar with discrete methods, used among others by quantum computers, to solve intractable problems.

1.3.2.5 Meta-Strategies

Problem: Amateurs develop individual strategies, believing that there is such a thing as a magical formula for riches. In contrast, professionals develop methods to mass-produce strategies. The money is not in making a car, it is in making a car factory.

Solution: Think like a business. Your goal is to run a research lab like a factory, where true discoveries are not born out of inspiration, but out of methodic hard work. That was the philosophy of physicist Ernest Lawrence, the founder of the first U.S. National Laboratory.

How:

Chapters 7–9: Build a research process that identifies features relevant across asset classes, while dealing with multi-collinearity of financial features.

Chapter 10: Combine multiple predictions into a single bet.

Chapter 16: Allocate funds to strategies using a robust method that performs well out-of-sample.

1.3.2.6 Overfitting

Problem: Standard cross-validation methods fail in finance. Most discoveries in finance are false, due to multiple testing and selection bias.

Solution:

Whatever you do, always ask yourself in what way you may be overfitting. Be skeptical about your own work, and constantly challenge yourself to prove that you are adding value.

Overfitting is unethical. It leads to promising outcomes that cannot be delivered. When done knowingly, overfitting is outright scientific fraud. The fact that many academics do it does not make it right: They are not risking anyone’s wealth, not even theirs.

It is also a waste of your time, resources, and opportunities. Besides, the industry only pays for out-of-sample returns. You will only succeed

after

you have created substantial wealth for your investors.

How:

Chapters 11–15: There are three backtesting paradigms, of which historical simulation is only one. Each backtest is always overfit to some extent, and it is critical to learn to quantify by how much.

Chapter 16: Learn robust techniques for asset allocation that do not overfit in-sample signals at the expense of out-of-sample performance.

1.3.3 Structure by Common Pitfall

Despite its many advantages, ML is no panacea. The flexibility and power of ML techniques have a dark side. When misused, ML algorithms will confuse statistical flukes with patterns. This fact, combined with the low signal-to-noise ratio that characterizes finance, all but ensures that careless users will produce false discoveries at an ever-greater speed. This book exposes some of the most pervasive errors made by ML experts when they apply their techniques on financial datasets. Some of these pitfalls are listed in Table 1.2, with solutions that are explained in the indicated chapters.

TABLE 1.2Common Pitfalls in Financial ML

#

Category

Pitfall

Solution

Chapter

1

Epistemological

The Sisyphus paradigm

The meta-strategy paradigm

1

2

Epistemological

Research through backtesting

Feature importance analysis

8

3

Data processing

Chronological sampling

The volume clock

2

4

Data processing

Integer differentiation

Fractional differentiation

5

5

Classification

Fixed-time horizon labeling

The triple-barrier method

3

6

Classification

Learning side and size simultaneously

Meta-labeling

3

7

Classification

Weighting of non-IID samples

Uniqueness weighting; sequential bootstrapping

4

8

Evaluation

Cross-validation leakage

Purging and embargoing

7, 9

9

Evaluation

Walk-forward (historical) backtesting

Combinatorial purged cross-validation

11, 12

10

Evaluation

Backtest overfitting

Backtesting on synthetic data; the deflated Sharpe ratio

10–16

1.4 TARGET AUDIENCE

This book presents advanced ML methods specifically designed to address the challenges posed by financial datasets. By “advanced” I do not mean extremely difficult to grasp, or explaining the latest reincarnation of deep, recurrent, or convolutional neural networks. Instead, the book answers questions that senior researchers, who have experience applying ML algorithms to financial problems, will recognize as critical. If you are new to ML, and you do not have experience working with complex algorithms, this book may not be for you (yet). Unless you have confronted in practice the problems discussed in these chapters, you may have difficulty understanding the utility of solving them. Before reading this book, you may want to study several excellent introductory ML books published in recent years. I have listed a few of them in the references section.

The core audience of this book is investment professionals with a strong ML background. My goals are that you monetize what you learn in this book, help us modernize finance, and deliver actual value for investors.

This book also targets data scientists who have successfully implemented ML algorithms in a variety of fields outside finance. If you have worked at Google and have applied deep neural networks to face recognition, but things do not seem to work so well when you run your algorithms on financial data, this book will help you. Sometimes you may not understand the financial rationale behind some structures (e.g., meta-labeling, the triple-barrier method, fracdiff), but bear with me: Once you have managed an investment portfolio long enough, the rules of the game will become clearer to you, along with the meaning of these chapters.

1.5 REQUISITES

Investment management is one of the most multi-disciplinary areas of research, and this book reflects that fact. Understanding the various sections requires a practical knowledge of ML, market microstructure, portfolio management, mathematical finance, statistics, econometrics, linear algebra, convex optimization, discrete math, signal processing, information theory, object-oriented programming, parallel processing, and supercomputing.

Python has become the de facto standard language for ML, and I have to assume that you are an experienced developer. You must be familiar with scikit-learn (sklearn), pandas, numpy, scipy, multiprocessing, matplotlib and a few other libraries. Code snippets invoke functions from these libraries using their conventional prefix, pd for pandas, np for numpy, mpl for matplotlib, etc. There are numerous books on each of these libraries, and you cannot know enough about the specifics of each one. Throughout the book we will discuss some issues with their implementation, including unresolved bugs to keep in mind.

1.6 FAQs

How can ML algorithms be useful in finance?

Many financial operations require making decisions based on pre-defined rules, like option pricing, algorithmic execution, or risk monitoring. This is where the bulk of automation has taken place so far, transforming the financial markets into ultra-fast, hyper-connected networks for exchanging information. In performing these tasks, machines were asked to follow the rules as fast as possible. High-frequency trading is a prime example. See Easley, López de Prado, and O’Hara [2013] for a detailed treatment of the subject.

The algorithmization of finance is unstoppable. Between June 12, 1968, and December 31, 1968, the NYSE was closed every Wednesday, so that back office could catch up with paperwork. Can you imagine that? We live in a different world today, and in 10 years things will be even better. Because the next wave of automation does not involve following rules, but making judgment calls. As emotional beings, subject to fears, hopes, and agendas, humans are not particularly good at making fact-based decisions, particularly when those decisions involve conflicts of interest. In those situations, investors are better served when a machine makes the calls, based on facts learned from hard data. This not only applies to investment strategy development, but to virtually every area of financial advice: granting a loan, rating a bond, classifying a company, recruiting talent, predicting earnings, forecasting inflation, etc. Furthermore, machines will comply with the law, always, when programmed to do so. If a dubious decision is made, investors can go back to the logs and understand exactly what happened. It is much easier to improve an algorithmic investment process than one relying entirely on humans.

How can ML algorithms beat humans at investing?

Do you remember when people were certain that computers would never beat humans at chess? Or Jeopardy!? Poker? Go? Millions of years of evolution (a genetic algorithm) have fine-tuned our ape brains to survive in a hostile 3-dimensional world where the laws of nature are static. Now, when it comes to identifying subtle patterns in a high-dimensional world, where the rules of the game change every day, all that fine-tuning turns out to be detrimental. An ML algorithm can spot patterns in a 100-dimensional world as easily as in our familiar 3-dimensional one. And while we all laugh when we see an algorithm make a silly mistake, keep in mind, algorithms have been around only a fraction of our millions of years. Every day they get better at this, we do not. Humans are slow learners, which puts us at a disadvantage in a fast-changing world like finance.

Does that mean that there is no space left for human investors?

Not at all. No human is better at chess than a computer. And no computer is better at chess than a human supported by a computer. Discretionary PMs are at a disadvantage when betting against an ML algorithm, but it is possible that the best results are achieved by combining discretionary PMs with ML algorithms. This is what has come to be known as the “quantamental” way. Throughout the book you will find techniques that can be used by quantamental teams, that is, methods that allow you to combine human guesses (inspired by fundamental variables) with mathematical forecasts. In particular, Chapter 3 introduces a new technique called meta-labeling, which allows you to add an ML layer on top of a discretionary one.

How does financial ML differ from econometrics?

Econometrics is the application of classical statistical methods to economic and financial series. The essential tool of econometrics is multivariate linear regression, an 18th-century technology that was already mastered by Gauss before 1794 (Stigler [1981]). Standard econometric models do not learn. It is hard to believe that something as complex as 21st-century finance could be grasped by something as simple as inverting a covariance matrix.

Every empirical science must build theories based on observation. If the statistical toolbox used to model these observations is linear regression, the researcher will fail to recognize the complexity of the data, and the theories will be awfully simplistic, useless. I have no doubt in my mind, econometrics is a primary reason economics and finance have not experienced meaningful progress over the past decades (Calkin and López de Prado [2014a, 2014b]).

For centuries, medieval astronomers made observations and developed theories about celestial mechanics. These theories never considered non-circular orbits, because they were deemed unholy and beneath God’s plan. The prediction errors were so gross, that ever more complex theories had to be devised to account for them. It was not until Kepler had the temerity to consider non-circular (elliptical) orbits that all of the sudden a much simpler general model was able to predict the position of the planets with astonishing accuracy. What if astronomers had never considered non-circular orbits? Well . . . what if economists finally started to consider non-linear functions? Where is our Kepler? Finance does not have a Principia because no Kepler means no Newton.

Financial ML methods do not replace theory. They guide it. An ML algorithm learns complex patterns in a high-dimensional space without being specifically directed. Once we understand what features are predictive of a phenomenon, we can build a theoretical explanation, which can be tested on an independent dataset. Students of economics and finance would do well enrolling in ML courses, rather than econometrics. Econometrics may be good enough to succeed in financial academia (for now), but succeeding in business requires ML.

What do you say to people who dismiss ML algorithms as black boxes?

If you are reading this book, chances are ML algorithms are white boxes to you. They are transparent, well-defined, crystal-clear, pattern-recognition functions. Most people do not have your knowledge, and to them ML is like a magician’s box: “Where did that rabbit come from? How are you tricking us, witch?” People mistrust what they do not understand. Their prejudices are rooted in ignorance, for which the Socratic remedy is simple: education. Besides, some of us enjoy using our brains, even though neuroscientists still have not figured out exactly how they work (a black box in itself).

From time to time you will encounter Luddites, who are beyond redemption. Ned Ludd was a weaver from Leicester, England, who in 1779 smashed two knitting frames in an outrage. With the advent of the industrial revolution, mobs infuriated by mechanization sabotaged and destroyed all machinery they could find. Textile workers ruined so much industrial equipment that Parliament had to pass laws making “machine breaking” a capital crime. Between 1811 and 1816, large parts of England were in open rebellion, to the point that there were more British troops fighting Luddites than there were fighting Napoleon on the Iberian Peninsula. The Luddite rebellion ended with brutal suppression through military force. Let us hope that the black box movement does not come to that.

Why don’t you discuss specific ML algorithms?

The book is agnostic with regards to the particular ML algorithm you choose. Whether you use convolutional neural networks, AdaBoost, RFs, SVMs, and so on, there are many shared generic problems you will face: data structuring, labeling, weighting, stationary transformations, cross-validation, feature selection, feature importance, overfitting, backtesting, etc. In the context of financial modeling, answering these questions is non-trivial, and framework-specific approaches need to be developed. That is the focus of this book.

What other books do you recommend on this subject?

To my knowledge, this is the first book to provide a complete and systematic treatment of ML methods specific for finance: starting with a chapter dedicated to financial data structures, another chapter for labeling of financial series, another for sample weighting, time series differentiation, . . .