The Book of Alternative Data - Alexander Denev - E-Book

The Book of Alternative Data E-Book

Alexander Denev

0,0
32,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

The first and only book to systematically address methodologies and processes of leveraging non-traditional information sources in the context of investing and risk management Harnessing non-traditional data sources to generate alpha, analyze markets, and forecast risk is a subject of intense interest for financial professionals. A growing number of regularly-held conferences on alternative data are being established, complemented by an upsurge in new papers on the subject. Alternative data is starting to be steadily incorporated by conventional institutional investors and risk managers throughout the financial world. Methodologies to analyze and extract value from alternative data, guidance on how to source data and integrate data flows within existing systems is currently not treated in literature. Filling this significant gap in knowledge, The Book of Alternative Data is the first and only book to offer a coherent, systematic treatment of the subject. This groundbreaking volume provides readers with a roadmap for navigating the complexities of an array of alternative data sources, and delivers the appropriate techniques to analyze them. The authors--leading experts in financial modeling, machine learning, and quantitative research and analytics--employ a step-by-step approach to guide readers through the dense jungle of generated data. A first-of-its kind treatment of alternative data types, sources, and methodologies, this innovative book: * Provides an integrated modeling approach to extract value from multiple types of datasets * Treats the processes needed to make alternative data signals operational * Helps investors and risk managers rethink how they engage with alternative datasets * Features practical use case studies in many different financial markets and real-world techniques * Describes how to avoid potential pitfalls and missteps in starting the alternative data journey * Explains how to integrate information from different datasets to maximize informational value The Book of Alternative Data is an indispensable resource for anyone wishing to analyze or monetize different non-traditional datasets, including Chief Investment Officers, Chief Risk Officers, risk professionals, investment professionals, traders, economists, and machine learning developers and users.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 760

Veröffentlichungsjahr: 2020

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Preface

Acknowledgments

PART 1: Introduction and Theory

CHAPTER 1: Alternative Data: The Lay of the Land

1.1 INTRODUCTION

1.2 WHAT IS “ALTERNATIVE DATA”?

1.3 SEGMENTATION OF ALTERNATIVE DATA

1.4 THE MANY VS OF BIG DATA

1.5 WHY ALTERNATIVE DATA?

1.6 WHO IS USING ALTERNATIVE DATA?

1.7 CAPACITY OF A STRATEGY AND ALTERNATIVE DATA

1.8 ALTERNATIVE DATA DIMENSIONS

1.9 WHO ARE THE ALTERNATIVE DATA VENDORS?

1.10 USAGE OF ALTERNATIVE DATASETS ON THE BUY SIDE

1.11 CONCLUSION

NOTES

CHAPTER 2: The Value of Alternative Data

2.1 INTRODUCTION

2.2 THE DECAY OF INVESTMENT VALUE

2.3 DATA MARKETS

2.4 THE MONETARY VALUE OF DATA (PART I)

2.5 EVALUATING (ALTERNATIVE) DATA STRATEGIES WITH AND WITHOUT BACKTESTING

2.6 THE MONETARY VALUE OF DATA (PART II)

2.7 THE ADVANTAGES OF MATURING ALTERNATIVE DATASETS

2.8 SUMMARY

NOTES

CHAPTER 3: Alternative Data Risks and Challenges

3.1 LEGAL ASPECTS OF DATA

3.2 RISKS OF USING ALTERNATIVE DATA

3.3 CHALLENGES OF USING ALTERNATIVE DATA

3.4 AGGREGATING THE DATA

3.5 SUMMARY

NOTES

CHAPTER 4: Machine Learning Techniques

4.1. INTRODUCTION

4.2. MACHINE LEARNING: DEFINITIONS AND TECHNIQUES

4.3. WHICH TECHNIQUE TO CHOOSE?

4.4. ASSUMPTIONS AND LIMITATIONS OF THE MACHINE LEARNING TECHNIQUES

4.5. STRUCTURING IMAGES

4.6. NATURAL LANGUAGE PROCESSING (NLP)

4.7. SUMMARY

NOTES

CHAPTER 5: The Processes behind the Use of Alternative Data

5.1. INTRODUCTION

5.2. STEPS IN THE ALTERNATIVE DATA JOURNEY

5.3. STRUCTURING TEAMS TO USE ALTERNATIVE DATA

5.4. DATA VENDORS

5.5. SUMMARY

NOTES

CHAPTER 6: Factor Investing

6.1. INTRODUCTION

6.2. FACTOR MODELS

6.3. THE DIFFERENCE BETWEEN CROSS-SECTIONAL AND TIME SERIES TRADING APPROACHES

6.4. WHY FACTOR INVESTING?

6.5. SMART BETA INDICES USING ALTERNATIVE DATA INPUTS

6.6. ESG FACTORS

6.7. DIRECT AND INDIRECT PREDICTION

6.8. SUMMARY

NOTES

PART 2: Practical Applications

CHAPTER 7: Missing Data: Background

7.1. INTRODUCTION

7.2. MISSING DATA CLASSIFICATION

7.3. LITERATURE OVERVIEW OF MISSING DATA TREATMENTS

7.4. SUMMARY

NOTES

CHAPTER 8: Missing Data: Case Studies

8.1. INTRODUCTION

8.2. CASE STUDY: IMPUTING MISSING VALUES IN MULTIVARIATE CREDIT DEFAULT SWAP TIME SERIES

8.3. CASE STUDY: SATELLITE IMAGES

8.4. SUMMARY

8.5. APPENDIX: GENERAL DESCRIPTION OF THE MICE PROCEDURE

8.6. APPENDIX: SOFTWARE LIBRARIES USED IN THIS CHAPTER

NOTES

CHAPTER 9: Outliers (Anomalies)

9.1. INTRODUCTION

9.2. OUTLIERS DEFINITION, CLASSIFICATION, AND APPROACHES TO DETECTION

9.3. TEMPORAL STRUCTURE

9.4. GLOBAL VERSUS LOCAL OUTLIERS, POINT ANOMALIES, AND MICRO-CLUSTERS

9.5. OUTLIER DETECTION PROBLEM SETUP

9.6. COMPARATIVE EVALUATION OF OUTLIER DETECTION ALGORITHMS

9.7. APPROACHES TO OUTLIER EXPLANATION

9.8. CASE STUDY: OUTLIER DETECTION ON FED COMMUNICATIONS INDEX

9.9. SUMMARY

9.10. APPENDIX

NOTES

CHAPTER 10: Automotive Fundamental Data

10.1. INTRODUCTION

10.2. DATA

10.3. APPROACH 1: INDIRECT APPROACH

10.4. APPROACH 2: DIRECT APPROACH

10.5. GAUSSIAN PROCESSES EXAMPLE

10.6. SUMMARY

10.7. APPENDIX

NOTES

CHAPTER 11: Surveys and Crowdsourced Data

11.1. INTRODUCTION

11.2. SURVEY DATA AS ALTERNATIVE DATA

11.3. THE DATA

11.4. THE PRODUCT

11.5. CASE STUDIES

11.6. SOME TECHNICAL CONSIDERATIONS ON SURVEYS

11.7. CROWDSOURCING ANALYST ESTIMATES SURVEY

11.8. ALPHA CAPTURE DATA

11.9. SUMMARY

11.10. APPENDIX

NOTES

CHAPTER 12: Purchasing Managers' Index

12.1. INTRODUCTION

12.2. PMI PERFORMANCE

12.3. NOWCASTING GDP GROWTH

12.4. IMPACTS ON FINANCIAL MARKETS

12.5. SUMMARY

NOTES

CHAPTER 13: Satellite Imagery and Aerial Photography

13.1. INTRODUCTION

13.2. FORECASTING US EXPORT GROWTH

13.3. CAR COUNTS AND EARNINGS PER SHARE FOR RETAILERS

13.4. MEASURING CHINESE PMI MANUFACTURING WITH SATELLITE DATA

13.5. SUMMARY

CHAPTER 14: Location Data

14.1. INTRODUCTION

14.2. SHIPPING DATA TO TRACK CRUDE OIL SUPPLIES

14.3. MOBILE PHONE LOCATION DATA TO UNDERSTAND RETAIL ACTIVITY

14.4. TAXI RIDE DATA AND NEW YORK FED MEETINGS

14.5. CORPORATE JET LOCATION DATA AND M&A

14.6. SUMMARY

NOTE

CHAPTER 15: Text, Web, Social Media, and News

15.1. INTRODUCTION

15.2. COLLECTING WEB DATA

15.3. SOCIAL MEDIA

15.4. NEWS

15.5. OTHER WEB SOURCES

15.6. SUMMARY

NOTES

CHAPTER 16: Investor Attention

16.1. INTRODUCTION

16.2. READERSHIP OF PAYROLLS TO MEASURE INVESTOR ATTENTION

16.3. GOOGLE TRENDS DATA TO MEASURE MARKET THEMES

16.4. INVESTOPEDIA SEARCH DATA TO MEASURE INVESTOR ANXIETY

16.5. USING WIKIPEDIA TO UNDERSTAND PRICE ACTION IN CRYPTOCURRENCIES

16.6. ONLINE ATTENTION FOR COUNTRIES TO INFORM EMFX TRADING

16.7. SUMMARY

CHAPTER 17: Consumer Transactions

17.1. INTRODUCTION

17.2. CREDIT AND DEBIT CARD TRANSACTION DATA

17.3. CONSUMER RECEIPTS

17.4. SUMMARY

NOTE

CHAPTER 18: Government, Industrial, and Corporate Data

18.1. INTRODUCTION

18.2. USING INNOVATION MEASURES TO TRADE EQUITIES

18.3. QUANTIFYING CURRENCY CRISIS RISK

18.4. MODELING CENTRAL BANK INTERVENTION IN CURRENCY MARKETS

18.5. SUMMARY

CHAPTER 19: Market Data

19.1. INTRODUCTION

19.2. RELATIONSHIP BETWEEN INSTITUTIONAL FX FLOW DATA AND FX SPOT

19.3. UNDERSTANDING LIQUIDITY USING HIGH-FREQUENCY FX DATA

19.4. SUMMARY

NOTE

CHAPTER 20: Alternative Data in Private Markets

20.1. INTRODUCTION

20.2. DEFINING PRIVATE EQUITY AND VENTURE CAPITAL FIRMS

20.3. PRIVATE EQUITY DATASETS

20.4. UNDERSTANDING THE PERFORMANCE OF PRIVATE FIRMS

20.5. SUMMARY

Conclusions

SOME LAST WORDS

References

About the Authors

Index

End User License Agreement

List of Tables

Chapter 1

TABLE 1.1 Segmentation of alternative data.

Chapter 4

TABLE 4.1 Financial (and non-) problems and suggested modeling techniques.

Chapter 8

TABLE 8.1 Summary statistics for MRD metrics for cluster 1 in ...

TABLE 8.2 Summary statistics for MRD metrics for cluster 2.

TABLE 8.3 Summary statistics for MRD metrics for cluster 2 whe...

TABLE 8.4 Summary statistics for MRD metrics for cluster 3 in comparison.

TABLE 8.5 Summary statistics for MRD metrics for cluster 3, wh...

Chapter 9

TABLE 9.1 Datasets used in comparative analysis of outlier detection algorithms....

Chapter 10

TABLE 10.1 Chevrolet Cruze: Top 10 countries unit sales/registrations in 2017....

TABLE 10.2 Long/short-portfolio sizes by number of tradeable companies.

TABLE 10.3 Top 10 strategies when ranked by CAGR. L – lon...

TABLE 10.4 Equal weighted benchmarks.

TABLE 10.5 Supporting statistics for top-ranked strategies by CAGR.

TABLE 10.6 Automotive factors created from the alternative data set.

TABLE 10.7 Freshest automotive factors summary statistics.

TABLE 10.8 Top 10 alt data strategies according to CAGR.

TABLE 10.9 Long top 33% strategy excess returns vs equal weight...

TABLE 10.10 Time averaged Spearman rank correlations.

TABLE 10.11 Factors CAGRs.

TABLE 10.12 Lags applied in automotive factor calculations.

Chapter 12

TABLE 12.1 GDP Growth correlations with % changes of select indicators.

TABLE 12.2 Model performance (2010Q1–2018Q1).

Chapter 13

TABLE 13.1 Annual correlation between exports, lights, and GDP.

TABLE 13.2 Comparing model forecasts through the average percentage derivatio...

List of Illustrations

Chapter 1

FIGURE 1.1 The four stages of data transformation: from raw data to a strate...

FIGURE 1.2 US GDP growth rate versus PMI; correlation 68%; time period: Q1 2...

FIGURE 1.3 China GDP growth rate versus PMI; correlation 69%; time period: Q...

FIGURE 1.4 Examples of alternative data usage by different market players....

FIGURE 1.5 Alternative data adoption curve: investment management constituen...

FIGURE 1.6 Impact of transaction costs on the information ratio of Cuemacro'...

FIGURE 1.7 Alternative datasets released commercially per year.

FIGURE 1.8 Brands most associated with alternative data.

FIGURE 1.9 Total spend on alternative data by buy side.

FIGURE 1.10 “Alternative datasets” derived from web scraping: most popular a...

Chapter 2

FIGURE 2.1 Different discriminatory pricing mechanisms.

FIGURE 2.2 US change in nonfarm payrolls versus ADP private payroll change....

Chapter 3

FIGURE 3.1 Comparison of data protection laws around the world.

Chapter 4

FIGURE 4.1 Balance between high bias and high variance.

FIGURE 4.2 Visualizing linear regression.

FIGURE 4.3 Visualizing logistic regression.

FIGURE 4.4 SVM example: The black line is the decision boundary.

FIGURE 4.5 Kernel trick example.

FIGURE 4.6 Visualizing linear regression as a neural network.

FIGURE 4.7 Visualizing logistic regression as a neural network.

FIGURE 4.8 Visualizing softmax regression as a neural network.

FIGURE 4.9 Multi-layer perceptron with 1 hidden layer.

FIGURE 4.10 Convolutional neutral network with 3 convolutional layers and 2 ...

FIGURE 4.11 Various edge, corner, and blob-based feature detectors.

FIGURE 4.12 Dominant feature detection algorithms and their properties.

FIGURE 4.13 Frequency of the words “burger” and “king.”

Chapter 5

FIGURE 5.1 Cost of setting up a data science team.

Chapter 6

FIGURE 6.1 Probabilistic Graphical Model (PGM) showing a potential modeling ...

FIGURE 6.2 Another potential modeling sequence (Model B).

FIGURE 6.3 A third potential modeling sequence (Model C).

Chapter 7

FIGURE 7.1 Average rank for all the classifiers. Column “Avg.” is the averag...

FIGURE 7.2 Average rank for the rule induction learning methods.

FIGURE 7.3 Average rank for the approximate methods.

FIGURE 7.4 Average rank for the lazy learning methods.

FIGURE 7.5 Best imputation methods for each group. The three best rankings p...

FIGURE 7.6 Methods for pattern classification with missing data. This scheme...

FIGURE 7.7 Misclassification error rate (mean ± standard d...

FIGURE 7.8 Misclassification error rate (mean ± standard d...

FIGURE 7.9 Misclassification error rate (mean ± standard d...

FIGURE 7.10 Error rates of input datasets by using LERS new classification....

FIGURE 7.11 Error rates of input datasets by using LERS naï...

FIGURE 7.12 Mean, standard deviation, and MSE values for the AUC (area under...

Chapter 8

FIGURE 8.1 Clustering for CDS time series data: (1) relatively small fractio...

FIGURE 8.2 Example of DINEOF imputation for synthetic 2D data.

FIGURE 8.3 Top; Example of complete time series data (ticker 1, cluster 2). ...

FIGURE 8.4 Amelia (top) and MICE (bottom) imputed time series for data in Fi...

FIGURE 8.5 RF imputation (dots) for data in Figure 8.2-3, compared with the ...

FIGURE 8.6 DINEOF (top) and MSSA (bottom) imputation (dots) for data in Figu...

FIGURE 8.7 Example of complete time series data (ticker 40, cluster 3). The ...

FIGURE 8.8 Amelia imputed time series for data in Figure 8.7 (dots), compare...

FIGURE 8.9 MSSA imputed time series for data in Figure 8.7 (dots), compared ...

FIGURE 8.10 Example of DINEOF imputation for car park data.

FIGURE 8.11 Car park image.

FIGURE 8.12 Car park image with 50% removed.

FIGURE 8.13 Car park image with missing pixels mean filled, pre-DINEOF.

FIGURE 8.14 Car park image with missing pixels mean filled, post-DINEOF.

FIGURE 8.15 Car park image with missing pixels local mean filled, pre-DINEOF...

FIGURE 8.16 Car park image with missing pixels local mean filled, post-DINEO...

Chapter 9

FIGURE 9.1 An example of LOF score visualization in 2 dimensions: radius of ...

FIGURE 9.2 An illustration of potential difficulties in choosing a normal ne...

FIGURE 9.3 An illustration of a case where rank statistic does not provide t...

FIGURE 9.4 Outliers explanation in problematic situations: measuring skills ...

FIGURE 9.5 Histogram plot of log(text length).

FIGURE 9.6 Event types of Fed communication.

FIGURE 9.7 Histogram plot of CScores.

FIGURE 9.8 Most talkative Fed speakers.

FIGURE 9.9 Event types of Fed communications flagged as outliers by unsuperv...

Chapter 10

FIGURE 10.1 Mean percent of sales volume known x-months after the end of the...

FIGURE 10.2 Mean percent of production volume known x-months after the end o...

FIGURE 10.3 The process followed.

FIGURE 10.4 Q_pct_delta_ffo quintile CAGRs at 3-months clairvoyance.

FIGURE 10.5 Q_pct_delta_ffo returns plot vs quarterly benchmark.

FIGURE 10.6 Heatmap of stocks held over time for Q_pct_delta_ffo at 3-months...

FIGURE 10.7 revenues_sales_prev_3m_sum_prev_1m_pct_change returns plot vs qu...

FIGURE 10.8 revenues_sales_prev_3m_sum_prev_1m_pct_change quintile CAGR.

FIGURE 10.9 ww_market_share_prev_1m_pct_change returns plot vs quarterly ben...

FIGURE 10.10 ww_market_share_prev_1m_pct_change quintile CAGR.

FIGURE 10.11 usa_sales_volume_prev_12m_sum_prev_3m_pct_change returns plot v...

FIGURE 10.12 usa_sales_volume_prev_12m_sum_prev_3m_pct_change quintile CAGR....

Chapter 11

FIGURE 11.1 Hierarchy of contributors.

FIGURE 11.2 Typical timeline of a survey.

FIGURE 11.3 The process followed in a survey.

FIGURE 11.4 Are you currently playing JX Mobile III (test version)?

FIGURE 11.5 Are you willing to pay for JX Mobile III at launch?

FIGURE 11.6 How much do/did you spend per month for items in JX PC III?

FIGURE 11.7 Performance of the share price of Kingsoft (top) and the Hang Se...

FIGURE 11.8 Crude oil production by OPEC as estimated by several data provid...

FIGURE 11.9 Monthly changes in oil prices versus changes in OPEC oil supply ...

Chapter 12

FIGURE 12.1 Nowcasting Eurozone (EZ) GDP Growth in Q2 2018.

FIGURE 12.2 Eurozone GDP and Composite PMI.

FIGURE 12.3 GBP/USD intraday volatility around UK PMI Services over past 5 y...

Chapter 13

FIGURE 13.1 First picture from Explorer VI satellite.

FIGURE 13.2 Car count for Marks & Spencer versus earnings (actual and estima...

FIGURE 13.3 Regressing consensus and car count data with earnings per share ...

FIGURE 13.4 Regressing news sentiment and car count data with earnings per s...

FIGURE 13.5 China SpaceKnow's satellite manufacturing index versus official ...

FIGURE 13.6 Surprises in China PMI manufacturing versus consensus, SMI and h...

Chapter 14

FIGURE 14.1 Comparing AIS versus official crude oil exports.

FIGURE 14.2 Thasos Foot Traffic Index YoY versus US Retail Sales YoY.

FIGURE 14.3 Trading XRT based on Thasos Mall Foot Traffic index.

FIGURE 14.4 Comparing visits to particular malls.

FIGURE 14.5 Comparing Walmart's actual earnings per share against consensus ...

FIGURE 14.6 Regressing consensus estimates and footfall against reported ear...

FIGURE 14.7 Regressing footfall, news, and Twitter data against reported ear...

FIGURE 14.8 Corporate aircraft visits at takeover targets.

Chapter 15

FIGURE 15.1 Happiest and saddest words in Hedonometer's corpus.

FIGURE 15.2 Hedonometer index for latter part of 2018 till early 2019.

FIGURE 15.3 Average Hedonometer score by day of the week.

FIGURE 15.4 Happiness Sentiment Index against S&P 500.

FIGURE 15.5 Surprise in nonfarm payrolls vs. USD/JPY 1-minute move after rel...

FIGURE 15.6 Twitter-based forecast for US change in nonfarm payrolls versus ...

FIGURE 15.7 Trading EUR/USD and USD/JPY on an intraday basis around NFP.

FIGURE 15.8 S&P 500 versus article count on it on Bloomberg News.

FIGURE 15.9 Average daily count of articles per ticker.

FIGURE 15.10 USD/JPY news sentiment score versus weekly returns.

FIGURE 15.11 News versus trend information ratio.

FIGURE 15.12 News versus trend correlation.

FIGURE 15.13 News versus trend model returns.

FIGURE 15.14 News versus trend model YoY returns.

FIGURE 15.15 USD/JPY news volume versus 1M implied volatility.

FIGURE 15.16 Regressing news volume versus 1M implied volatility.

FIGURE 15.17 EUR/USD ON volatility add-on, implied volatility, realized vola...

FIGURE 15.18 EUR/USD ON implied volatility on FOMC days against FOMC news vo...

FIGURE 15.19 EUR/USD overnight volatility on FOMC days.

FIGURE 15.20 EUR/USD overnight volatility on ECB days.

FIGURE 15.21 FOMC sentiment index and UST 10Y yield changes over the past mo...

Chapter 16

FIGURE 16.1 “Payrolls” clicks on the days of US employment report.

FIGURE 16.2 Search volume for “world cup” in the United States.

FIGURE 16.3 Regressing Google Domestic Trend Indices.

FIGURE 16.4 S&P 500 versus Google Shock Sentiment.

FIGURE 16.5 S&P 500 vs Google Shock Sentiment scatter.

FIGURE 16.6 IAI vs VIX.

FIGURE 16.7 IAI vs VIX as a scatter plot.

FIGURE 16.8 Trading S&P 500 with IAI and VIX.

FIGURE 16.9 Turkey PVIX indicator vs USD/TRY 1M implied volatility.

FIGURE 16.10 Comparing English attention with local content for Brazil.

FIGURE 16.11 Trading a basket of EM currencies using macroeconomy “attention...

Chapter 17

FIGURE 17.1 Brazil YoY retail sales versus SpendingPulse Brazil retail sales...

FIGURE 17.2 Alternative data forecasts for Amazon revenue versus actual reve...

FIGURE 17.3 Comparing Shure versus Sennheiser (MoM) spend at Amazon.

Chapter 18

FIGURE 18.1 Long-only portfolios derived from visa and patent data.

FIGURE 18.2 Long-only portfolios derived from visa and patent data (in and o...

FIGURE 18.3 Average FX crisis rates, 2000–2017.

FIGURE 18.4 COFER data: Currency Composition of Official Foreign Exchange Re...

FIGURE 18.5 Comparing model estimates of CNY intervention versus official da...

Chapter 19

FIGURE 19.1 EUR/USD daily volume.

FIGURE 19.2 EUR/USD daily abs net flow.

FIGURE 19.3 Multiple regressions between spot returns and net flow.

FIGURE 19.4 EUR/USD index versus EUR/USD fund flow score.

FIGURE 19.5 Risk-adjusted returns for trend and daily flow-based strategies....

FIGURE 19.6 Daily flow and trend returns.

FIGURE 19.7 EUR/USD bid/ask spread over time.

FIGURE 19.8 EUR/USD and USD/JPY bid/ask spread by time of day.

Chapter 20

FIGURE 20.1 AUM of largest GPs (general partners) in billions USD.

Guide

Cover

Table of Contents

Begin Reading

Pages

i

iii

iv

v

xv

xvi

xvii

1

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

“Alternative data is one of the hottest topics in the investment management industry today. Whether it is used to forecast global economic growth in real time, to parse the entrails of a company with more granularity than that offered by a quarterly report, or to better understand stock market behaviour, alternative data is something that everyone in asset management needs to get to grips with. Alexander Denev and Saeed Amen are able guides to a convoluted subject with many pitfalls, both technical and theoretical, even for those who still think Python is a snake best avoided.”

—Robin Wigglesworth, Global finance correspondent, Financial Times.

“Congratulations to the authors for producing such a timely, comprehensive, and accessible discussion of alternative data. As we move further into the twenty-first century, this book will rapidly become the go-to work on the subject.”

—Professor David Hand, Imperial College London

“Over the last decade, alternative data has become central to the quest for temporary monopoly of information. Yet, despite its frequent use, little has been written about the end-to-end pipeline necessary to extract value. This book fills the omission, providing not just practical overviews of machine learning methods and data sources, but placing as much importance on data ingestion, preparation, and pre-processing as on the models that map to outcomes. The authors do not consider methodology alone, but also provide insightful case studies and practical examples, and highlight the importance of cost-benefit analysis throughout. For value extraction from alternative data, they provide informed insights and deep conceptual understanding – crucial if we are to successfully embed such technology at the heart of trading.”

—Stephen Roberts, Royal Academy of Engineering/Man Group Professor of Machine Learning, University of Oxford, UK, and Director of the Oxford-Man Institute of Quantitative Finance

“True investment outperformance comes from the triad of data plus machine learning plus supercomputing. Alexander Denev and Saeed Amen have written the first comprehensive exposition of alternative data, revealing sources of alpha that are not tapped by structured datasets. Asset managers unfamiliar with the contents of this book are not earning the fees they charge to investors.”

—Dr. Marcos López de Prado, Professor of Practice at Cornell University, and CIO at True Positive Technologies LP

“Alexander and Saeed have written an important book about an important topic. I am involved with alternative data every day, but I still enjoyed the perspectives in the book, and learned a lot. I highly recommend it to everybody looking to harness the power of alt data (and avoid the pitfalls!).”

—Jens Nordvig, Founder and CEO of Exante Data

The Book of Alternative Data

A Guide for Investors, Traders, and Risk Managers

 

 

ALEXANDER DENEV

SAEED AMEN

 

 

 

 

 

 

 

© 2020 by Alexander Denev and Saeed Amen. All rights reserved.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750–8400, fax (978) 646–8600, or on the Web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748–6011, fax (201) 748–6008, or online at http://www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762–2974, outside the United States at (317) 572–3993 or fax (317) 572–4002.

Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.

Library of Congress Cataloging-in-Publication Data:

Names: Denev, Alexander, author. | Amen, Saeed, 1982- author.

Title: The book of alternative data : a guide for investors, traders and risk managers / Alexander Denev, Saeed Amen.

Description: Hoboken, New Jersey : Wiley, [2020] | Includes bibliographical references and index.

Identifiers: LCCN 2020008783 (print) | LCCN 2020008784 (ebook) | ISBN 9781119601791 (hardback) | ISBN 9781119601814 (adobe pdf) | ISBN 9781119601807 (epub)

Subjects: LCSH: Investments | Financial risk management. | Big data.

Classification: LCC HG4529 .D47 2020 (print) | LCC HG4529 (ebook) | DDC 332.63/204—dc23

LC record available at https://lccn.loc.gov/2020008783

LC ebook record available at https://lccn.loc.gov/2020008784

Cover Design: WileyCover Image: © akindo/Getty Images

To Natalie, with all my love. –Alexander

For Gido and Baba, in life, in time, in spirit, your path is forever my guide. –Saeed

Preface

Data permeates through our world, in ever increasing amounts. This fact alone is not sufficient for data to be useful. Indeed, data has no utility, if it is devoid of information, which could aide our understanding. Data needs to be insightful for it to be of use and it also needs to be processed in the appropriate way. In the pre-Big Data age days, statistics such as averages, standard deviation, correlations were calculated on structured datasets to illuminate our understanding of the world. Models were calibrated on (a small number of) input variables which were often well “understood” to obtain an output via well-trodden methods like, say, linear regression.

However, interpreting Big Data, and hence alternative data, comes with many challenges. Big Data is characterized by properties such as volume, velocity and variety and other Vs, which we will discuss in this book. It is impossible to calculate statistics, unless datasets are well structured and relevant features are extracted. When it comes to prediction, the input variables derived from Big Data are numerous and traditional statistical methods can be prone to overfitting. Moreover, nowadays calculating statistics or building models on this data must be done sometimes frequently and in a dynamic way to account for the always changing nature of the data in our high frequency world.

Thanks to technological and methodological advances, understanding Big Data and by extension alternative data, has become a tractable problem. Extracting features from messy enormous volumes of data is now possible thanks to the recent developments in artificial intelligence and machine learning. Cloud infrastructure allows elastic and powerful computation to manage such data flows and to train models both quickly and efficiently. Most of the programming languages in use today are open source and many such as Python have a large number of libraries in the sphere of machine learning and data science more broadly, making it easier to develop tech stacks to number crunch large datasets.

When we decided to write this book, we felt that there was a gap in the book market in this area. This gap seemed at odds with the ever growing importance of data, and in particular, alternative data. We live in a world, which is rich with data, where many datasets are accessible and available at a relatively low cost. Hence, we thought that it was worth writing a lengthy book to address how to address the challenges of how to use data profitably. We do admit though that the world of alternative data and its use cases is and will be subject to change in the near future. As a result, the path we paved with this book is also subject to change. Not least the label “alternative data” might become obsolete as it could soon turn mainstream. Alternative data may simply become “data”. What might seem to be great technological and methodological feats today to make alternative data usable, may soon become trivial exercises. New datasets from sources we could not even imagine could begin to appear, and quantum computing could revolutionise the way we look at data.

We decided to target this book at the investment community. Applications, of course, can be found elsewhere, and indeed everywhere. By staying within the financial domain, we could also have discussed areas such as credit decisions or insurance pricing, for example. We will not discuss these particular applications in this book, as we decided to focus on questions that an investor might face. Of course, we might consider adding these applications in future editions of the book.

At the time of writing, we are living in a world afflicted by COVID-19. It is a world, in which it is very important for decision makers to make the right judgement, and furthermore, these decisions must be done in a timely manner. Delays or poor decision making can have fatal consequences in the current environment. Having access to data streams that track the foot traffic of people can be crucial to curb the spread of the disease. Using satellite or aerial images could be helpful to identify mass gatherings and to disperse them for reasons of public safety. From an asset manager's point of view, creating nowcasts before official macroeconomic figures and company financial statements are released, results better investment decisions. It is no longer sufficient to wait several months to find out about the state of the economy. Investors want to have be able to estimate such points on a very high frequency basis. The recent advances in technology and artificial intelligence makes all this possible.

So, let us commence on our journey through alternative data. We hope you will enjoy this book!

Acknowledgments

We would like to thank our friends and colleagues who have helped us by providing suggestions and correcting our errors.

In first place, we would like to express our gratitude to Dr. Marcos Lopez de Prado who gave us the idea of writing this book. We would like to thank Kate Lavrinenko without whom the chapter on outliers would not have been possible; Dave Peterson, who proofread the entire book and provided useful and thorough feedback; Henry Sorsky for his work with us on the automotive fundamental data and missing data chapters, as well as proofreading many of the chapters and pointing out mistakes; Doug Dannemiller for his work around the risks of alternative data which we leveraged; Mike Taylor for his contribution to the data vendors section; Jorge Prado for his ideas around the auctions of data.

We would also like to extend our thanks to Paul Bilokon and Matthew Dixon for their support during the writing process. We are very grateful to Wiley, and Bill Falloon in particular, for the enthusiasm with which they have accepted our proposal, and for the rigor and constructive nature of the reviewing process by Amy Handy. Last but not least, we are thankful to our families. Without their continuous support this work would have been impossible.

PART 1Introduction and Theory

Chapter 1: Alternative Data: The Lay of the Land

Chapter 2: The value of Alternative Data

Chapter 3: Alternative Data Risks and Challenges

Chapter 4: Machine Learning Techniques

Chapter 5: The Processes behind the Use of Alternative Data

Chapter 6: Factor Investing

CHAPTER 1Alternative Data: The Lay of the Land

1.1 INTRODUCTION

There is a considerable amount of buzz around the topic of alternative data in finance. In this book, we seek to discuss the topic in detail, showing how alternative data can be used to enhance understanding of financial markets, improve returns, and manage risk better.

This book is aimed at investors who are in search of superior returns through nontraditional approaches. These methods are different from fundamental analysis or quantitative methods that rely solely on data widely available in financial markets. It is also aimed at risk managers who want to identify early signals of events that could have a negative impact, using information that is not present yet in any standard and broadly used datasets.1

At the moment of writing there are mixed opinions in the industry about whether alternative data can add any value in the investment process on top of the more standardized data sources. There is news in the press about hedge funds and banks who have tried, but failed to extract value from it (see e.g. Risk, 2019). We must stress, however, that the absence of predictive signals in alternative data is only one of the components of a potential failure. In fact, we will try to convince the reader, through the practical examples that we will examine, that useful signals can be gleaned from alternative data in many cases. At the same time, we will also explain why any strategy that aims to extract and make successful use of signals is a combination of algorithms, processes, technology, and careful cost-benefit analysis. Failure to tackle any of these aspects in the right way will lead to a failure to extract usable insights from alternative data. Hence, the proof of the existence of a signal in a dataset is not sufficient to benefit from a superior investment strategy, given that there are many other subtle issues at play, most of which are dynamic in nature, as we will explain later.

In this book, we will also discuss in detail the techniques that can be used to make alternative data usable for the purposes we have already noted. These will be techniques belonging to what are labeled today as the fields of Machine Learning (ML) and Artificial Intelligence (AI). However, we do not want to give the upfront impression of being unnecessarily complex, with these “sophisticated” catchall terms. Hence, we will also include simpler and more traditional techniques, such as linear and logistic regression,2 with which the financial community is already familiar. Indeed, in many instances simpler techniques can be very useful when seeking to extract signals from alternative datasets in finance. Nevertheless, this is not a machine learning textbook and hence we will not delve in the details of each technique we will use, but we will only provide a succinct introduction. We will refer the reader to the appropriate texts where necessary.

This is also not a book about the technology and the infrastructure that underlie any real-world implementations of alternative data. These topics encompassing data engineering are still, of course, very important. Indeed, they are necessary for anything found to be a signal in the data to be of any use in real life. However, given the variety and the deep expertise needed to treat them in detail, we believe that these topics deserve a book on their own. Nevertheless, we must stress that methodologies that we use in practice to extract a signal are often constrained by technological limitations. Do we need an algorithm to work fast and deliver results in almost real time or can we live with some latency? Hence, the type of algorithm we choose will be very much determined by technological constraints like these. We will hint at these important aspects throughout, although this book will not be, strictly speaking, technological.

In this book, we will go through practical case studies showing how different alternative data sources can be profitably employed for different purposes within finance. These case studies will cover a variety of data sources and for each of them will explore in detail how to solve a specific problem like, for example, predicting equity returns from fundamental industrial data or forecasting economic variables from survey indices. The case studies will be self-contained and representative of a wide array of situations that could appear in the real-world applications, across a number of different asset classes.

Finally, this book will not be a catalogue of all the alternative data sources existing at the moment of writing. We deem this to be futile because, in our dynamic world, the number and variety of such datasets increase every day. What is more important, in our view, is the process and techniques of how to make the available data useful. In doing so, we will be quite practical by also examining mundane problems that appear in sieving through datasets, the missteps and mistakes that any practical application entails.

This book is structured as follows. Part I will be a general introduction to alternative data, the processes and the techniques to make it usable in an investment strategy. In Chapter 1, we will define alternative data and create a taxonomy. In Chapter 2 we will discuss the subtle problem of how to price datasets. This subject is currently being actively debated in the industry. Chapter 3 will talk about the risks associated with alternative data, in particular the legal risks, and we will also delve more into the details of the technical problems that one faces when implementing alternative data strategies. Chapter 4 introduces many of the machine learning and structuring techniques that can be relevant for understanding alternative data. Again, we will refer the reader to the appropriate literature for a more in-depth understanding of those techniques.

Chapter 5 will examine the processes behind the testing and the implementation of alternative data signals-based strategies. We will recommend a fail-fast approach to the problem. In a world where datasets are many and further proliferating, we believe that this is the best way to proceed.

Part II will focus on some real-world use cases, beginning with an explanation of factor investing in Chapter 6, and a discussion of how alternative data can be incorporated in this framework. One of the use cases will not be directly related to an investment strategy but is a problem at the entry point of any project and must be treated before anything else is attempted – missing data, in Chapters 7 and 8. We also address another ubiquitous problem of outliers in data (see Chapter 9). We will then examine use cases for investment strategies and economic forecasting based on a broad array of different types of alternative datasets, in many different asset classes, including public markets such as equities and FX. We also look at the applicability of alternative data to understand private markets (see Chapter 20), where markets are typically opaquer given the lack of publicly available information. The alternative datasets we shall discuss include automotive supply chain data (see Chapter 10), satellite imagery (see Chapter 13), and machine readable news (see Chapter 15). In many instances, we shall also illustrate the use case with trading strategies on various asset classes.

So, to start this journey, let's explain a little bit more about what the financial community means by “alternative data” and why it is considered to be such a hot topic.

1.2 WHAT IS “ALTERNATIVE DATA”?

It is widely known that information can provide an edge. Hence, financial practitioners have historically tried to gather as much data as is feasible. The nature of this information, however, has changed over time, especially since the beginning of the Big Data revolution.3 From “standard” sources like market prices and balance sheet information, it evolved to include others, in particular those that are not strictly speaking financial. These include, for example, satellite imagery, social media, ship movements, and the Internet-of-Things (IoT). The data from these “nonstandard” sources is labeled alternative data.

In practice, alternative data has several characteristics, which we list below. It is data that has at least one of the following features:

Less commonly used by market participants

Tends to be more costly to collect, and hence more expensive to purchase

Usually outside of financial markets

Has shorter history

More challenging to use

We must note from this list that what constitutes alternative data can vary significantly over time according to how widely available it is, as well has how embedded in a process it is. Obviously, today most financial market data is far more commoditized and more widely available than it was decades ago. Hence, it is not generally labeled as alternative. For example, a daily time series for equity closing prices is easily accessible from many sources and it is considered nonalternative. In contrast, very high frequency FX data, although financial, is far more expensive, specialized, and niche. The same is also true of comprehensive FX volume and flow data, which is less readily available. Hence, these market derived datasets may then be considered alternative. The cost and availability of a dataset are very much dependent on several factors, such as asset class and frequency. Hence, these factors determine whether the label “alternative” should be attached to it or not. Of course, clear-cut definitions are not possible and the line between “alternative” and “nonalternative” is somewhat blurred. It is also possible that, in the near future, what we consider “alternative” will become more standardized and mainstream. Hence, it could lose the label “alternative” and simply be referred to as data.

In recent years, the alternative data landscape has significantly expanded. One major reason is that there has been a proliferation of devices and processes that generate data. Furthermore, much of this data can be recorded automatically, as opposed to requiring manual processes to do so. The cost of data storage is also coming down, making it more feasible to record this data to disk for longer periods of time. The world is also awash with “exhaust data,” which is data generated by processes whose primary purpose is not to collect or generate and sell the data. In this sense, data is a “side effect.” The most obvious example of exhaust data in financial markets is market data. Traders trade with one another on an exchange and on an over-the-counter basis. Every time they post quotes or agree to trade at a price with a counterparty, they create a data point. This data exists as an exhaust of the trading activity. The concept of distributing market data is hardly new and has been an important part of markets for the ages and is an important part of the revenue for exchanges and trading venues.

However, there are other types of exhaust data that have been less commonly utilized. Take, for example, a large newswire organization. Journalists continually write news articles to inform their readers as part of their everyday business. This generates large amounts of text daily, which can be stored on disk and structured. If we think about firms such as Google, Facebook, and Twitter, their users essentially generate vast amounts of data, in terms of their searches, their posts, and likes. This exhaust data, which is a by-product of user activity, is monetized by serving advertisements targets toward users. Additionally, each of us creates exhaust data every time we use our mobile phones, creating a record of our location and leaving a digital footprint on the web.

Corporations that produce and record this exhaust data are increasingly beginning to think about ways of monetizing it outside of their organization. Most of the exhaust data, however, remains underutilized and not monetized. Laney (2017) labels this “dark data.” It is internal, usually archived, not generally accessible and not structured sufficiently for analysis. It could be archived emails, project communications, and so on. Once such data is structured, it will also make that data more useful for generating internal insights, as well as for external monetization.

1.3 SEGMENTATION OF ALTERNATIVE DATA

As already mentioned, we will not describe all the sources of alternative data but will try to provide a concise segmentation, which should be enough to cover most of the cases encountered in practice. First, we can divide the alternative data sources into the following high-level categories of generators:4 individuals, institutions5 and sensors, and derivations or combinations of these. The latter is important because it can lead to the practically infinite proliferation of datasets. For example, a series of trading signals extracted from data can be considered as another transformed dataset.

The collectors of data can be either institutions or individuals. They can store information created by other data generators. For example, credit card institutions can collect transactions from individual consumers. Concert venues could use sensors to track the number of individuals entering a particular concert hall. The data collection can be either manual or automatic (e.g. handwriting versus sensors). The latter is prevalent in the modern age, although until a couple of decades ago the opposite was true.6 The data recorded can either be in a digital or analog form. This segmentation is summarized in Table 1.1.

We can further subdivide the high-level categories into finer-grained categories according to the type of data is generated. A list can never be exhaustive. For example, individuals generate internet traffic and activity, physical movement and location (e.g. via mobile phone), and consumer behavior (e.g. spending, selling); institutions generate reports (e.g. corporate reports, government reports), institutional behavior (e.g. market activity); and physical processes collect information about physical variables (e.g. temperature or luminosity, which can be detected via sensors).

TABLE 1.1Segmentation of alternative data.

Who Generates the Data?

Who Collects the Data?

How Is It Collected?

How Is It Recorded?

Physical processes

Individuals

Manually

Via digital methods

Individuals

Institutions

Automatically

Via analog methods

Institutions

As individuals, we generate data via our actions: we spend, we walk, we talk, we browse the web, and so on. Each of these activities leaves a digital footprint that can be stored and later analyzed. We have limited action capital, which means that the number of actions we can perform each day is limited. Hence, the amount of data we can generate individually is also limited by this. Institutions also have limited action capital: mergers and acquisitions, corporate reports, and the like. Sensors also have limited data generation capacity given by the frequency, bandwidth, and other physical limitations underpinning their structure. However, data can also be artificially generated by computers that aggregate, interpolate, and extrapolate data from the previous data sources. They can transform and derive the data as already mentioned above. Therefore, for practical purposes we can say that the amount of data is unlimited. One such example of data generated by a computer is that of an electronic market maker, which continually trades with the market and publishes quotes, creating a digital footprint of its trading activity.

How to navigate this infinite universe of data and how to select which datasets we believe might contain something valuable for us is almost an art. Practically speaking, we are limited by time and budget constraints. Hence, venturing into inspecting many data sources, without some process of prescreening, can be risky and is also not cost effective. After all, even “free” datasets have a cost associated with them, namely the time and effort spent to analyze them. We will discuss how to approach this problem of finding datasets later and how a new profession is emerging to tackle this task – the data scout and data strategist.

Data can be collected by firms and then resold to other parties in a raw format. This means that no or minimal data preprocessing is performed. Data can be then processed by cleansing it, running it through quality control checks, and maybe enriching it through other sources. Processed data can then be transformed into signals to be consumed by investment professionals.7 When data vendors do this processing, they can do it for multiple clients, hence reducing the cost overall.

These signals could be, for example, a factor that is predictive of the return of an asset class or a company, or an early warning indicator for an extreme event. A subsequent transformation could then be performed to convert a signal, or a series of signals, into a strategy encompassing several time steps based, for instance, on determining portfolio weights at each time step over an investment horizon. These four stages are illustrated in Figure 1.1.

FIGURE 1.1 The four stages of data transformation: from raw data to a strategy.

1.4 THE MANY VS OF BIG DATA

The alternative data universe is part of the bigger discourse on Big Data.8 Big Data, and hence alternative data, in general, has been characterized by 3 Vs, which have emerged as a common framework to describe it, namely:

Volume (

increasing

) refers to the amount of generated data. For example, the actions of individuals on the web (browsing, blogging, uploading pictures, etc.) or via financial transactions are tracked more frequently. These actions are aggregated into many billions of records globally.

9

This was not the case before the rise of the web. Furthermore, computer algorithms are used to further process, aggregate, and, hence, multiply the amount of data generated. Traditional databases can no longer cope with storing and analyzing these datasets. Instead, distributed systems are now preferred for these purposes.

Variety (

increasing

) refers to both the diversity of data sources and the forms of data coming from those sources. The latter can be structured in different ways (e.g. CSV, XML, JSON, database tables etc.), semi-structured, and also unstructured. The increasing variety is due to the fact that the set of activities and physical variables that can be tracked is increasing, alongside the greater penetration of devices and sensors that can collect data. Trying to understand different forms of data can come with analytical challenges. These challenges can relate to structuring these datasets and also how to extract features from them.

Velocity (

increasing

) refers to the speed with which data are being generated, transmitted, and refreshed. In fact, the time to get hold of a piece of data has decreased as computing power and connectivity have increased.

In substance, the 3 Vs signal that the technological and analytical challenges to ingest, cleanse, transform, and incorporate data in processes are increasing. For example, a common analytical challenge is tracking information about one specific company in many datasets. If we want to leverage information from all the datasets at hand, we must join them by the identifier of that company. A hurdle to this can be the fact that the company appears with different names or tickers in the different datasets. This is because a certain company can have hundreds of subsidiaries in different jurisdictions, different spellings with suffixes like “ltd.” omitted, and so on. The complexity of this problem explodes exponentially as we add more and more datasets. We will discuss the challenges behind this later in a section specifically dedicated to record linkage and entity mapping (see Chapter 3).

These 3 Vs are more related to technical issues, rather than business specific issues. Recently 4 further Vs have been defined, namely Variability, Veracity, Validity, and Value, which are focused more on the usage of Big Data.

Variability (

increasing

) refers both to the regularity and quality inconsistency (e.g. anomalies) of the data streams. As we explained above, the diversity of the data sources and the speed at which data originates from them has increased. In this sense, the regularity aspect of Variability is a consequence of both Variety and Velocity.

Veracity (

decreasing

) refers to the confidence or trust in the data source. In fact, with the multiplication of data sources it has become increasingly difficult to assess the reliability of the data originating from them. While one can be pretty confident of the data, say, from a national bureau of statistics such as the Bureau of Labor Statistics in the United States, a greater leap of faith is needed for smaller and unknown data providers. This refers both to whether data is truthful and the quality of the transformations the provider has performed on the data, such as cleansing, filling missing values, and so on.

Validity (

decreasing

) refers to how accurate and correct the data is for its intended use. For example, data might be invalid because of purely physical limitations. These limitations might reduce accuracy and also result in missing observations; for example, a GPS signal can deteriorate on narrow streets in between buildings (in this case overlaying them onto a roadmap can be a good solution to rectify incorrect positioning information).

Value (

increasing

) refers to the business impact of data. This is the ultimate motivation for venturing into data analysis. In general, the belief is that overall Value is increasing but this does not mean that all data has value for a business. This must be proven case by case, which is the purpose of this book.

We have encountered other Vs, such as Vulnerability, Volatility, and Visualization. We will not debate them here because we believe they are a marginal addition to the 7 Vs we have just discussed.

In closing, we note that parts of the alternative data universe are not characterized by all these Vs if looked upon in isolation. For instance, they might come in smaller sample sizes or be generated at a lower frequency, in other words “small data.” For example, expert surveys can be quite irregular and be based on a small sample of respondents, typically around 1000. The 7 Vs should, therefore, be interpreted as a general characterization of data nowadays. Hence, they paint a broad picture of the data universe, although some alternative datasets can still exhibit properties that are more typical of the pre–Big Data age.

1.5 WHY ALTERNATIVE DATA?

Now that we have defined what alternative data is, it is time to ask the question of why investment professionals and risk managers should be concerned with it. According to a recent report from Deloitte (see Mok, 2017):

“Those firms that do not update their investment processes within that time frame [over the next five years] could face strategic risks and might very well be outmanoeuvred by competitors that effectively incorporate alternative data into their securities valuation and trading signal processes.”

There is a general belief today in the financial industry, as witnessed by the quote above, that gaining access and mining alternative datasets in a timely manner can provide investors with insights that can be quickly monetized (a time frame in the order of months, rather than years) or can be used to flag potential risks. The insights can be of two types: either anticipatory or complementary to already available information. Hence, information advantage is the primary reason for using alternative data.

With regards to the first type, for example, alternative data can be used to generate insights that are a substitute for other types of more “mainstream” macroeconomic data. These “mainstream” insights may not be available on a prompt basis and at a sufficiently high frequency. However, they are nevertheless deemed to be important factors in portfolio performance. Investors want to anticipate these macro data points and rebalance their portfolios in the light of early insights. For example, GDP figures, which are the main indicator for economic activity, are released quarterly. This is because compiling the numbers that compose it is a labor-intensive and meticulous process, which takes some time. Furthermore, revisions of these numbers can be frequent. Nevertheless, knowing in advance what the next GDP figure will be can provide an edge, especially if done before other market participants. Central banks, for example, closely watch inflation and economic activity (i.e. GDP) as an input to the decision on the next rate move. FX and bond traders try in their turn to anticipate the move of the central banks and make a profitable trade. Furthermore, on an intraday basis, traders with good forecasts for official data can trade the short-term reaction of the market to any data surprise.

What can be a proxy for GDP, which is released at a higher frequency than quarterly? Purchasing Managers Indexes (PMI) that are released monthly could be one possibility.10 They are based on surveys for sectors including manufacturing or service.11 The survey is based on questionnaire responses from panels of senior purchasing executives (or similar) working in a sample of companies deemed to be representative of the wider universe. Questions could be, for instance, “Is your company's output higher, the same, or lower than one month ago?” or “What is the business perspective over a 6-month horizon?”

The information of the various components mentioned earlier is aggregated into the PMI indicator, which is interpreted based on its relative position to the value 50. Any value higher than the 50 level is considered to show expanding conditions while a value below the 50 mark potentially signals a recession.

The correlation between Real GDP growth rate and PMI is shown in Figure 1.2 for the US and Figure 1.3 for China. We can see that indeed an index like this, albeit not 100% correlated to GDP, is a good approximation to it. One explanation is the relative differences in what the measures represent. GDP measures economic output that has already happened. Hence, it is defined as hard data. By contrast, PMIs tend to be more forward-looking, given the nature of the survey questions asked. We define such forward-looking, survey-based releases as soft data. We should note that it can be the case that soft data is not always perfectly confirmed by subsequent hard data, even if they are generally correlated.

FIGURE 1.2 US GDP growth rate versus PMI; correlation 68%; time period: Q1 2005–Q1 2016.

Note. The dots indicate quarterly values.

Source: Based on data from PMI: ISM and Haver Analytics. GDP: Bureau of Economic Analysis and Haver Analytics.

FIGURE 1.3 China GDP growth rate versus PMI; correlation 69%; time period: Q1 2005–Q3 2019.

Source: PMI: China Federation of Logistics and Purchases and Haver Analytics. GDP: National Bureau of Statistics of China and Haver Analytics.

The PMI indicators