Vectorization - Edward DongBo Cui - E-Book

Vectorization E-Book

Edward DongBo Cui

0,0
115,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Enables readers to develop foundational and advanced vectorization skills for scalable data science and machine learning and address real-world problems

Offering insights across various domains such as computer vision and natural language processing, Vectorization covers the fundamental topics of vectorization including array and tensor operations, data wrangling, and batch processing. This book illustrates how the principles discussed lead to successful outcomes in machine learning projects, serving as concrete examples for the theories explained, with each chapter including practical case studies and code implementations using NumPy, TensorFlow, and PyTorch.

Each chapter has one or two types of contents: either an introduction/comparison of the specific operations in the numerical libraries (illustrated as tables) and/or case study examples that apply the concepts introduced to solve a practical problem (as code blocks and figures). Readers can approach the knowledge presented by reading the text description, running the code blocks, or examining the figures.

Written by the developer of the first recommendation system on the Peacock streaming platform, Vectorization explores sample topics including:

  • Basic tensor operations and the art of tensor indexing, elucidating how to access individual or subsets of tensor elements
  • Vectorization in tensor multiplications and common linear algebraic routines, which form the backbone of many machine learning algorithms
  • Masking and padding, concepts which come into play when handling data of non-uniform sizes, and string processing techniques for natural language processing (NLP)
  • Sparse matrices and their data structures and integral operations, and ragged or jagged tensors and the nuances of processing them

From the essentials of vectorization to the subtleties of advanced data structures, Vectorization is an ideal one-stop resource for both beginners and experienced practitioners, including researchers, data scientists, statisticians, and other professionals in industry, who seek academic success and career advancement.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 425

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Table of Contents

Title Page

Copyright

Dedication

About the Author

Preface

Acknowledgment

1 Introduction to Vectorization

1.1 What Is Vectorization

1.2 Case Study: Dense Layer of a Neural Network

1.3 Vectorization vs. Other Parallel Computing Paradigms

Bibliography

2 Basic Tensor Operations

2.1 Tensor Initializers

2.2 Data Type and Casting

2.3 Mathematical Operations

2.4 Reduction Operations

2.5 Value Comparison Operations

2.6 Logical Operations

2.7 Ordered Array-Adjacent Element Operations

2.8 Array Reversing

2.9 Concatenation, Stacking, andSplitting

2.10 Reshaping

2.11 Broadcasting

2.12 Case Studies

Bibliography

3 Tensor Indexing

3.1 Get Values at Index

3.2 Slicing

3.3 Case Study: Get Consecutive Index

3.4 Take and Gather

3.5 Assign Values at Index

3.6 Put and Scatter

3.7 Case Study: Batchwise Scatter Values

Bibliography

4 Linear Algebra

4.1 Tensor Multiplications

4.2 The

matmul

Operation

4.3 The

tensordot

Operation

4.4 Einsum

4.5 Case Study: Pair-wise Pearson’s Cross-Correlation

4.6 Case Study: Hausdorff Distance

4.7 Common Linear Algebraic Routines

4.8 Case Study: Fitting Single Exponential Curves

Bibliography

5 Masking and Padding

5.1 Masking

5.2 Padding

5.3 Advanced Case Studies

Bibliography

6 String Processing

6.1 String Data Types

6.2 String Operations

6.3 Case Study: Parsing DateTime from String Representations

6.4 Mapping Strings to Indices

6.5 Case Study: Factorization Machine

6.6 Regular Expressions (Regex)

6.7 Data Serialization and Deserialization

Bibliography

7 Sparse Matrix

7.1

Scipy

’s Sparse Matrix Classes

7.2 Sparse Matrix Broadcasting

7.3 Tensorflow’s Sparse Tensors

7.4 PyTorch’s Sparse Matrix

7.5 Sparse Matrix in Other Python Libraries

7.6 When (Not) to Use Sparse Matrix

7.7 Case Study: Sparse Matrix Factorization with ALS

Bibliography

8 Jagged Tensors

8.1 Left Align a Sparse Tensor to Represent Ragged Tensor

8.2 Index to Binary Indicator

8.3 Case Study: Jaccard Similarities Using Sparse Matrix

8.4 Case Study: Batchwise Set Operations

8.5 Case Study: Autoencoders with Sparse Inputs

Bibliography

9 Groupby, Apply, and Aggregate

9.1 Pandas Groupwise Operations

9.2 Reshaping and Windowing of Dense Tensors

9.3 Case Study: Vision Transformer (ViT)

9.4 Bucketizing Values

9.5 Segment-wise Aggregation

9.6 Case Study: EmbeddingBag

9.7 Case Study: Vocal Duration Constrained by Note Duration

9.8 Case Study: Filling of Missing Values in a Sequence

Bibliography

10 Sorting and Reordering

10.1 Sorting Operations

10.2 Case Study: Top-k Using

argsort

and

argpartition

10.3 Case Study: Sort the Rows of a Matrix

10.4 Case Study: Reverse Padded Sequence

10.5 Case Study: Gumbel-Max Sampling with Weights

10.6 Case Study: Sorting Articles Around Anchored Advertisements

Bibliography

11 Building a Language Model from Scratch

11.1 Language Modeling with Transformer

11.2 Pre-LN vs. Post-LN Transformer

11.3 Layer Normalization

11.4 Positional Encoding and Embedding

11.5 Activation Functions in Feedforward Layer

11.6 Case Study: Training a Tiny LLaMA Model for Next Token Prediction

11.7 A Word on AI Safety and Alignment

11.8 Concluding Remarks

Bibliography

Index

End User License Agreement

List of Tables

Chapter 1

Table 1.1 Min–Max normalized scores

Chapter 2

Table 2.1 List of tensor initializers in Python, NumPy, Tensorflow, and PyTo...

Table 2.2 List of data types in Python, NumPy [Harris et al., 2023a], Tensor...

Table 2.3 List of common mathematical operations in Python, NumPy, Tensorflo...

Table 2.4 List of tensor reduction operations in Python, NumPy, Tensorflow, ...

Table 2.5 List of value comparison and thresholding operations in Python, Nu...

Table 2.6 List of logical or boolean tensor operations in Python, NumPy, Ten...

Table 2.7 List of adjacent element cumulative sum and discrete differences t...

Table 2.8 List of array/tensor reversing operations in Python, NumPy, Tensor...

Table 2.9 List of adjacent element cumulative sum and discrete differences t...

Table 2.10 List of tensor reshaping operations in Python, NumPy, Tensorflow,...

Chapter 3

Table 3.1 List of slicing and striding operations with example usages in 1D ...

Table 3.2 List of examples of building slicing and striding configurations u...

Table 3.3 List of take and gather operations in NumPy, Tensorflow, and PyTor...

Table 3.4 List of accumulation operators that can be used with NumPy and PyT...

Table 3.5 List of put and scatter operations in NumPy, Tensorflow, and PyTor...

Chapter 4

Table 4.1 Examples of tensor multiplication operations in NumPy, Tensorflow,...

Table 4.2 Examples of einsum equations and equivalent NumPy operations.

Table 4.3 List of linear algebraic routines in NumPy, Tensorflow, and PyTorc...

Chapter 5

Table 5.1 List of masking and mask indices construction operations.

Chapter 6

Table 6.1 List of string transformation operations in Python, NumPy, Pandas,...

Chapter 7

Table 7.1 Construction, update, and inter-conversion of seven sparse matrix ...

Table 7.2 Performance comparisons of all seven sparse matrix data structures...

Table 7.3 PyTorch Sparse tensor comparisons.

Chapter 9

Table 9.1 Example user-movie viewing hours dataset.

Table 9.2 List of groupwise aggregation operations in NumPy, Tensorflow, and...

Chapter 10

Table 10.1 List of sorting operations in NumPy, Tensorflow, and PyTorch.

List of Illustrations

Chapter 1

Figure 1.1 CPU vector computations.

Figure 1.2 Multilayer perceptron.

Figure 1.3 Feed-forward layer.

Figure 1.4 Reading remote database (a) with single thread and (b) with multi...

Figure 1.5 Driver and workers in distributed computing. A driver (or main) n...

Figure 1.6 Comparison of parallel computing paradigms. Vectorization stores ...

Chapter 2

Figure 2.1 Concatenation and stacking in 1D vectors.

Figure 2.2 Concatenation and stacking in 2D matrices.

Figure 2.3 Splitting operations.

Figure 2.4 Reshaping a vector to a matrix before taking averages.

Figure 2.5 Concept of order in reshaping.

Figure 2.6 Scalar to vector broadcasting.

Figure 2.7 Vector to matrix broadcasting.

Figure 2.8 Broadcasting between tensor with singleton dimension and non-sing...

Figure 2.9 2x2 magic squares cannot be constructed.

Figure 2.10 Luo Shu, the earliest known third-order magic square.

Figure 2.11 Example of constructing a third-order magic square using the Sia...

Figure 2.12 Deconstructing odd-order Siamese magic squares into a sum of two...

Figure 2.13 The “X” method for constructing doubly even order magic squares....

Figure 2.14 Steps of implementing the X-method for constructing doubly even ...

Figure 2.15 LUX: order of filling out a grid.

Figure 2.16 Examples of constructing order 6 and 10 magic squares using the ...

Chapter 3

Figure 3.1 Illustration of integer indexing using positive (regular) and equ...

Figure 3.2 Indexing a matrix. (a) Row-column index. Here, “row” refers to th...

Figure 3.3 Flat and multi(dimensional) indices. There are two conventions to...

Figure 3.4 Illustration of extracting indices that satisfy boolean condition...

Figure 3.5 Examples of 2D matrix slicing and striding in NumPy, Tensorflow, ...

Figure 3.6 Examples of 3D tensor slicing and striding in NumPy, Tensorflow, ...

Figure 3.7 Example usage of

take

operation along a 1D vector.

Figure 3.8 Example usage of flat

take

operation on a 2D matrix.

Figure 3.9 Example usage of multi-dimensional

take

operation on a vector.

Figure 3.10 Illustrated examples of

torch.gather

operation on a 2D matrix. (...

Figure 3.11 Illustration of multi-element multi-indexing using -dimensional...

Figure 3.12 Illustration of

tf.gather_nd

with nonzero

batch_dims

argument. T...

Figure 3.13 Illustration of

np.put

operation on a 1D vector target.

Figure 3.14 Illustrated example of

np.put_along_axis

.

Figure 3.15 Illustration of scattering values over a 2D matrix using multi-i...

Figure 3.16 Illustration of

tf.scatter_nd

over a 1D target tensor.

Figure 3.17 Illustration of scattering 2D slices over a 3D tensor.

Figure 3.18 Illustration of scattering slices into multiple batch dimensions...

Figure 3.19 Illustration of

torch.scatter_reduce

on 2D matrices. (a) Scatter...

Figure 3.20 A step-by-step illustration of the batchwise scattering process ...

Chapter 4

Figure 4.1 Illustration of an example tensordot operation between tensors of...

Figure 4.2 Illustration of

tensordot

operations when

axes=0

. (a) For a p...

Figure 4.3 Illustration of

tensordot

operations when

axes=1

. (a) For a p...

Figure 4.4 Illustration of

tensordot

operations when

axes=2

. (a) For a p...

Figure 4.5 Fiducial markers used to track head positions. (a) The original t...

Figure 4.6 Fit single exponential to noisy data points.

Chapter 5

Figure 5.1 Example uses of

tril

and

triu

operations.

Figure 5.2 Example usage of

tf.linalg.band_part

. It is impossible to create ...

Figure 5.3 Before and after thresholding to remove background.

Figure 5.4 Boolean tensor masking and compressing.

Figure 5.5 Masking top K values. (a) Given an example batch of arrays, we wo...

Figure 5.6 Example illustration of

constant

padding.

Figure 5.7 Illustration of

constant

padding order.

Figure 5.8 Example illustration of

reflect

padding. Dashed lines indicate ax...

Figure 5.9 Example illustration of

symmetric

padding. Dashed lines indicate ...

Figure 5.10 Example illustration of

edge

or

replicate

padding.

Figure 5.11 Example illustration of

wrap

or

circular

padding mode.

Figure 5.12 Example of

same

vs.

valid

padding in convolutional operations.

Figure 5.13 Bidirectional vs. causal self-attentions.

Figure 5.14 Attention mask. (a) Bidirectional attention mask, which allows t...

Figure 5.15 Building a ragged range vector. (a) Create the boolean mask by b...

Figure 5.16 Example effect of Length Regulator module on the four-word sente...

Chapter 6

Figure 6.1 Training results of Deep Factorization Machine.

Chapter 7

Figure 7.1 Example user-item rating matrix in a typical recommendation syste...

Figure 7.2

coo_matrix

example.

Figure 7.3

csc_matrix

example.

Figure 7.4

csr_matrix

example.

Figure 7.5

bsr_matrix

example.

Figure 7.6

dok_matrix

example.

Figure 7.7

lil_matrix

example.

Figure 7.8

dia_matrix

example.

Figure 7.9 Row-wise broadcasting example.

Figure 7.10 Column-wise broadcasting example.

Figure 7.11 Generalized compressed sparse tensor.

Figure 7.12 Illustration of matrix multiplication on sparse indices.

Figure 7.13 ALS training loss.

Chapter 8

Figure 8.1 Comparison of three ways of representing a jagged tensor. (a) A d...

Figure 8.2 Before (left) and after (right) of left aligning the sparse matri...

Figure 8.3 Example illustration of converting a set of variable-length lists...

Figure 8.4 Illustration of the logic to compute pair-wise Jaccard Similarity...

Figure 8.5 Creating a

csr_matrix

binary indicator representation using indic...

Figure 8.6 Illustration of an alternative way of computing flattened ragged ...

Figure 8.7 Autoencoder with categorical inputs represented by a sparse vecto...

Chapter 9

Figure 9.1 Downsampled time series from the code example.

Figure 9.2 Schematic illustration of reshape and reduce operations used to i...

Figure 9.3 Mean-removed time series from the code example.

Figure 9.4 Schematic illustration of broadcasting used to implement groupby-...

Figure 9.5 Sliding window of size 3 and stride 2.

Figure 9.6 Illustration of implementation logic of

sliding_window_tf

using T...

Figure 9.7 Moving average of the time series from the code example.

Figure 9.8 Sinusoidal positional encoding of the transformer architecture.

Figure 9.9 ViT architecture overview.

Figure 9.10 Transformer pre-LN architecture. After blocks of encoder layer...

Figure 9.11 Example illustration of left-inclusive and left-exclusive maskin...

Figure 9.12 Illustration of segment-wise summation operation, with each colo...

Figure 9.13 “Eyes On Me” sheet music from my previous SVS project.

Figure 9.14 Example note duration constraints by group.

Figure 9.15 Implementation logic for

ffill

. (a) Illustration of the input da...

Chapter 10

Figure 10.1 Schematic illustration of the effect of partitioning the array. ...

Figure 10.2 Performance of two top-k functions implemented in NumPy, Tensorf...

Figure 10.3 Illustration of using

np.unique

to sort rows of a matrix. In thi...

Figure 10.4 Illustration of the implementation logic to reverse a padded seq...

Figure 10.5 Evolution of

samples

tensor when implementing the Gumbel-Max tri...

Figure 10.6 Simulated Gumbel-Max negative sampling frequency when using unif...

Figure 10.7 Simulated Gumbel-Max negative sampling frequency when sampling f...

Figure 10.8 Illustration of arranging news articles around anchored ads.

Figure 10.9 Illustration of converting

ads_positions

into ads

anchored_mask

...

Figure 10.10 Illustration of creating the

output_index_ads

matrix. (a) Creat...

Figure 10.11 Combine ads and news output index into a single matrix. Notice ...

Chapter 11

Figure 11.1 Learning rate schedule from the original transformer paper.

Figure 11.2 Comparison of Pre- and Post-layer normalization architecture. Po...

Figure 11.3 Comparison of normalization scheme. (a) Batch normalization norm...

Figure 11.4 Sinusoidal positional encoding from the original transformer pap...

Figure 11.5 Tensor2Tensor variant of the sinusoidal positional encoding.

Figure 11.6 Relative positional embedding in T5 model.

Figure 11.7 Rotation matrix of the first nine sequence positions, given embe...

Figure 11.8 Corresponding values in and to be multiplied. Suppose we hav...

Figure 11.9 Comparison of nonlinear activation functions.

Figure 11.10 Training our tiny LLaMA model using WikiText2.

Guide

Cover

Table of Contents

Title Page

Copyright

Dedication

About the Author

Preface

Acknowledgment

Begin Reading

Index

End User License Agreement

Pages

ii

iii

iv

v

xiii

xv

xvi

xvii

xviii

xix

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

419

420

421

422

423

IEEE Press445 Hoes LanePiscataway, NJ 08854

 

IEEE Press Editorial BoardSarah Spurgeon, Editor-in-Chief

 

Moeness Amin

Jón Atli Benediktsson

Adam Drobot

James Duncan

Ekram Hossain

Brian Johnson

Hai Li

James Lyke

Joydeep Mitra

Desineni Subbaram Naidu

Tony Q. S. Quek

Behzad Razavi

Thomas Robertazzi

Diomidis Spinellis

Vectorization

 

A Practical Guide to Efficient Implementations of Machine Learning Algorithms

 

Edward DongBo Cui

Director of Data Science, DotDash Meredith

 

 

 

 

 

Copyright © 2025 by The Institute of Electrical and Electronics Engineers, Inc.All rights reserved.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.

Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging-in-Publication Data applied for:

Hardback ISBN: 9781394272945

Cover Design: WileyCover Images: © da-kuk/Getty Images

 

 

 

 

 

To my students

About the Author

Edward Dongbo Cui is a data science and machine learning engineering leader who holds a PhD in neuroscience from Case Western Reserve, USA. Edward served as Director of Data Science at DotDash Meredith, building large-scale search and recommender systems for online digital publications. At NBC Universal, he served as Director of Data Science, building the first recommendation system on the new Peacock streaming platform. Previously, he was lead data scientist at Nielsen Global Media. He is an expert in ML engineering, research, and MLOps to drive data-centric decision-making and enhance product innovation.

Preface

Over the years, I have accrued valuable experiences in implementing machine learning algorithms, and this book is my endeavor to share those insights with you. Machine learning is a vast field, encompassing areas including but not limited to computer vision, speech processing, natural language processing, and recommendation systems. While the diversity of these topics is immense, there are underlying mathematical and computational principles that weave them together.

Structure of the Book: The book is structured into 11 chapters, each dedicated to a pivotal topic in the realm of scientific computation and machine learning:

Chapter 1

delves into the power of vectorization, contextualized within parallel computing.

Chapter 2

lays the foundation by discussing the basic tensor operations.

Chapter 3

details the art of tensor indexing, elucidating how to access individual or subsets of tensor elements.

Chapter 4

probes deeper into vectorization in tensor multiplications and common linear algebraic routines, which form the backbone of many machine learning algorithms.

Chapter 5

sheds light on the pivotal concepts of masking and padding, which come into play when handling data of nonuniform sizes.

Chapter 6

touches upon string processing techniques, invaluable for natural language processing (NLP) tasks.

Chapter 7

introduces the realm of sparse matrices, discussing their data structures and integral operations.

Chapter 8

takes a closer look at a unique kind of sparse tensor – ragged or jagged tensors – and the nuances of processing them.

Chapter 9

breaks down the data wrangling pattern of groupby-apply-aggregate, which is crucial for data summarization.

Chapter 10

explores the topic on sorting, reordering, sampling, and batch processing through a series of case studies.

Chapter 11

concludes the book with a review of the Transformer++ architecture (a.k.a LLaMA, a foundational large language model architecture from Meta), weaving together the principles discussed in the preceding chapters.

Audience: While I have tailored this book for those embarking on their journey in machine learning algorithm implementation, it is not solely for beginners. Having a basic understanding of machine learning and deep learning will certainly be an advantage, but every concept introduced is supplemented with sufficient background to make it comprehensible to all readers.

Prerequisites:

Familiarity with the Python programming language:

Readers can find online resources to learn Python or through classic books such as “Think Python.”

Undergraduate linear algebra:

To this, I highly recommend Gilbert Strang’s lectures on Linear Algebra at MIT, or his book “Introduction to Linear Algebra.”

Foundational knowledge of data science and machine learning:

The goal of this book is not to introduce machine learning, but rather to guide the reader in solving real-world problems encountered in machine learning (oftentimes deep learning). Basic knowledge of data science and machine learning can be gained from either previous more foundational courses or practical knowledge gained by doing machine learning projects hands-on. Courses such as Andrew Ng’s CS229 Machine Learning can serve as a foundation.

Features:

Every chapter is punctuated with numerous case studies to ensure the concepts are grounded in practical scenarios. Accompanying these are detailed code implementations with line-by-line explanations. Most of the implementations will use only the knowledge acquired within the chapter. Certainly, as the reader learns along the way, one may come up with better implementations by leveraging more advanced concepts.

The book covers vectorization techniques using NumPy, Tensorflow, and PyTorch (and for certain operations, Pandas as well). When introducing the operations, I will include tables that enumerate equivalent operations in these three libraries, in the hope that readers who are familiar with one of the libraries can also learn to translate algorithms and models implemented in another library. Most case study code examples will be implemented in one of the libraries (often NumPy), but readers are encouraged to implement the same functionality using alternative libraries by referencing equivalent operations.

All code examples are available at

github.com/sapphire008/vectorization-book-supplement

. Note that some code examples are simulations using random inputs, so rerunning the notebooks will not necessarily get the same results as the ones provided in the book.

To further clarify many of the vectorization operations, illustrative figures have been incorporated to visually represent the logic behind them, often accompanied by detailed explanations either within the text body or in the caption.

The principles discussed in this book should be agnostic to specific versions of Python, NumPy, TensorFlow, or PyTorch. By default, my code examples were tested using

python==3.10.6

,

numpy==1.23.4

,

pandas==2.1.1

,

scikit-learn==1.3.0

,

scipy==1.11.1

,

tensorflow==2.13.0

,

keras==2.13.1

, and

torch==2.1.0

. However, I will specify the library versions alongside code examples when the versions used are not the default, or whenever relevant or necessary to emphasize the versions.

Limitations:

Note that the carriage return symbol “↪” within the code block (with shaded background) is used to indicate a line break to accommodate formatting of long lines.

Code examples and case studies implemented to solve certain problems may not always be the most optimal solution, though it is honestly my best effort to come up with these solutions when writing this book. Sometimes, I also specifically made a decision to implement certain things not as efficiently as possible but to illustrate the practical application of the concepts introduced in the chapter. Because the materials of each chapter in the book build upon each other, I have made an explicit choice not to use a functionality that would be introduced in a later chapter. Still, this demonstrates the beauty of vectorization and coding in general: there is not a single solution to a specific problem in practice. Nonetheless, I encourage readers to find more efficient ways to solve the case study problems. On the other hand, if the readers found a bug in one of the implementations, please feel free to open an issue at

github.com/sapphire008/vectorization-book-supplement

.

When comparing the vectorization libraries, it is possible that I may be ignorant of certain functionalities specific to the library even though the information contained in this book is to the best of my current knowledge. I may sometimes claim that one library does not implement functionality that another library implements at the time of writing, but if readers found the other library indeed implemented that functionality, I stand corrected and will make the best effort to update this information in future editions of the book.

My hope is that by the end of this book, the readers will be well-versed in the foundational and advanced vectorization techniques of implementing scalable machine learning algorithms and will be ready to tackle real-world challenges.

November 29, 2023New York                 

Edward DongBo Cui

Director of Data ScienceDotDash Meredith, New York

Acknowledgment

I express my deepest gratitude to Ryan Beauchemin and Kendall Meyer for their invaluable reviews and contributions, which have greatly enhanced the quality of this book. I also thank Ilmir Moussikaev for believing in my potential and providing me with the opportunity to create something remarkable on the Peacock streaming platform. This specific experience was the cornerstone of my career, leading me to acquire the knowledge necessary to write this book. I also want to extend my heartfelt appreciation to my students. Teaching and mentoring them on various techniques in machine learning and data science provided me with the insights and material needed to write this book. A special thanks to Chris Haroun, whose inspiration and guidance motivated me to embark on this writing endeavor. Lastly, I am immensely thankful to my wife for her unwavering support throughout this journey.

1Introduction to Vectorization

1.1 What Is Vectorization

Vectorization is a type of parallel computing paradigm that performs arithmetic operations on an array of numbers [Intel, 2022] within a single data processing unit (either CPU or GPU). Modern CPUs can perform between 4 and 16 single precision (float32) parallel computations, depending on the type of instructions [Intel, 2022] (see Figure 1.1).

Here, SSE, or streaming SIMD (single-instruction-multiple-data) extension, has 128 registers; each register is used to store 1-bit of data for computations. This is equivalent to performing four single-precision (float32) operations or two double-precision (float64) operations. Intel’s AVX512, or Advanced Vector Extension 512, on the other hand, has 512 registers and can therefore process 512 bits of data simultaneously. This is equivalent to performing 16 single-precision or 8 double-precision operations.

The term “Vectorization” also refers to the process of transforming an algorithm computable on one data point at a time to an operation that calculates a collection of data simultaneously [Khartchenko, 2018]. A classic example is the dot product operation,

where , and .

The dot product operation on the right-hand side of the equation is the vectorized form of the summation process on the left. Instead of operating on one element of the array at a time, it operates on the collections or arrays, and .

Figure 1.1 CPU vector computations.

 Note

Another definition of the term “vectorization” is related to the more recent advances in large language models: using a vector or an array of values to represent a word. This is more commonly known as embedding in the literature which originated from the word2vec model in natural language processing. In Chapters 6 and 7, we will introduce the idea of representing string or categorical features as numerical values, i.e. integer indices, sparse one-hot encoding vectors, and dense embedding vectors.

1.1.1 A Simple Example of Vectorization in Action

Suppose we want to take the dot product of 1 million random numbers drawn from the uniform distribution. Using Python’s for loop, we would typically implement this algorithm as the follows:

On my M1 MacBook Pro under Python 3.11.4, this is the runtime of the above function:

If we implement the algorithm using numpy with vectorization

Under the same setup, we have the following runtime of the vectorized implementation.

This is a ∼37× speed-up by vectorization!

1.1.2 Python Can Still Be Faster!

However, the above example is not to discount that certain native Python operations are still faster than NumPy’s implementations, despite NumPy being often used for vectorization operations to simplify for-loops in Python.

For example, we would like to split a list of strings delimited by periods “.”. Using Python’s list comprehension, we have

Line 2 runs for

However, in comparison, if we use NumPy’s “vectorized” split operation

Line 3 runs for

which is about ∼4.0 slower than native Python.

1.1.3 Memory Allocation of Vectorized Operations

When using vectorized operations, we need to consider the trade-off between memory and speed. Sometimes, vectorized operations would need to allocate more memory for results from intermediate steps. Therefore, this has implications on the overall performance of the program, in terms of space vs. time complexity. Consider a function in which we would like to compute the sum of the squares of the elements in an array. This can be implemented as a for-loop, where we iterate through each element, take the square, and accumulate the square values as the total. Alternatively, the function can also be implemented as a vectorized operation where we take the square of each element first, store the results in an array, and then sum the squared vector.

Let us use the memory_profiler package to examine the memory usage of each line of the two functions. We can first install the package by pip install memory_profiler==0.61.0 and load it as a plugin in a Jupyter Notebook.

We also save the above script into a file (which we call memory_allocation_example.py), then we measure the memory trace of each function in the notebook as follows:

This gives us the following line-by-line memory trace of the function call:

We can also use %timeit magic function to measure the speed of the function:

which gives

Similarly, let us also look at the for-loop implementation

And we also test the run time as

which gives

We can see that although the for-loop function is slower than the vectorized function, the for-loop had almost no memory increase. In contrast, the vectorized solution has a ∼1M increase in memory usage while reducing the run time by about 30%. If X is several magnitudes larger than our current example, then we can see that even more memory will be consumed in exchange for speed. Hence, we need to consider carefully the trade-off between speed and memory (or time vs. space) when using vectorized operations.

1.2 Case Study: Dense Layer of a Neural Network

Vectorization is a key skill for implementing various machine learning models, especially for deep learning algorithms. A simple neural network consists of several feed-forward layers, as illustrated in Figure 1.2.

A feed-forward layer (also called dense layer, linear layer, perceptron layer, or hidden layer, as illustrated in Figure 1.3) has the following expression:

Notice that is a linear model. Here,

is the input, and typically of shape

(batch_size, num_inputs)

is the weight (a.k.a. kernel) parameter matrix of shape

(num_inputs, hidden_size)

is the bias parameter vector of shape

(1, hidden_size)

Figure 1.2 Multilayer perceptron.

Figure 1.3 Feed-forward layer.

is an activation function that is applied to every element of the resulting linear model operation. It can be, e.g.,

relu

,

sigmoid

, or

tanh

.

is the output of shape

(batch_size, hidden_size)

Stacking multiple feed-forward layers creates the multilayer perceptron model, as illustrated by our example neural network above.

If we were to implement a feed-forward layer using only Python naively, we would need to run thousands to millions of iterations of the for-loop, depending on the batch_size, num_inputs, and hidden_size. When it comes to large language models like generative pretrained transformer (GPT) which has hundreds of billions of parameters to optimize, using for loops to make such computations can become so slow that training or making an inference using the model is infeasible. But if we take advantage of vectorization on either CPU or GPU, we would see significant gains in the performance of our model. Furthermore, certain operations like matrix multiplication are simply inefficient to perform if implemented naively by simply parallelizing a for-loop, so in practice must rely on special algorithms such as Strassen’s Algorithm [Strassen, 1969] (or more recently, a new algorithm discovered by AlphaTensor [Fawzi et al., 2022]). Implementing these algorithms as an instruction on either CPU or GPU is a non-trivial task itself. Practitioners of machine learning and data science should therefore take advantage of pre-existing libraries that implement these algorithms. But to use these libraries efficiently, one needs to be adept in thinking in terms of vectorization.

In the following, let’s take a look at an example of how to implement the above perceptron layer using numpy.

Lines 3–5 define the sigmoid activation function,

Line 10 implements the linear model

– Here, the

@

operator is the matrix multiplication operator that is recognized by typical vector libraries like

numpy

,

tensorflow

, and

torch

. We will discuss more on matrix operations in

Chapter 4

.

– The result of multiplying and should be a matrix of shape

(batch_size, hidden_size)

, but in this implementation, we have of shape

(1, hidden_size)

. How can a vector be added to a matrix? In

Chapter 2

, we will explore the concept called

broadcasting

.

Line 12 applies the sigmoid activation we defined previously to compute the final output of this dense layer. This is an element-wise operation, where we apply the sigmoid function to each element of the matrix. We will explore the idea of element-wise operations in

Chapter 2

.

As you can see, vectorization can greatly simplify the representation of neural networks operating on high-dimensional arrays with a large number of elements or data points.

1.3 Vectorization vs. Other Parallel Computing Paradigms

Knowing that vectorization is a type of parallel computing paradigm, it is natural to ask the question: how does vectorization compare to other parallel computing paradigms such as multithreading, multiprocessing, or multiworker distributed computing?

1.3.1 Multithreading

Like vectorization, multithreading operates on a single processing unit, with shared memory. This can be a single core of a CPU or GPU. Vectorization processes multiple data inputs within the register in parallel, given a single instruction. In contrast, multithreading can only process one stream of data at a time, instead of processing data in parallel. It can switch between different threads of the same program (or sometimes different threads of different programs), with each thread representing a different part of the process. This is referred to as “concurrency.” Of course, multithreading can be used in combination with vectorization. In fact, modern GPUs operate by combining vectorization, multithreading, and multiprocessing to increase their processing powers [NVIDIA, 2012].

The multithreading paradigm is quite advantageous in tasks like reading data from remote servers. For example, a database has 10 parts. If we have only a single thread, then for each part of the data to be read, we need to wait for one minute until the data is downloaded before reading the next part. Then, it will take 10 minutes in total to download the entire database. However, if we have 10 threads, each making a request to read one part of the data. Since the time to make a request to read the data is negligible compared to the time to download the data, each thread takes one minute to download the data. However, since the downloading process is in parallel with each other, it will take only one minute in total to download the entire database. See Figure 1.4 for an illustration.

1.3.2 Multiprocessing

Multiprocessing, on the other hand, operates on multiple processing units, e.g. multi-core CPUs or GPUs. Each core of the processing unit can access data through a shared memory (the RAM). Unlike vectorization and multithreading, multiprocessing does not share the same data between the processes. Therefore, to process the same data in parallel using different instructions, multiprocessing needs to duplicate the data across the cores. Unlike multithreading, multiprocessing will not be efficient if the process is bounded by I/O (i.e. data reading and writing). Conversely, multiprocessing is a great choice if the process is bounded by the processing power (a.k.a. CPU bound) and can greatly speed up the process by distributing the computational load to multiple cores of the processing unit.

Figure 1.4 Reading remote database (a) with single thread and (b) with multithreading.

A typical use of multiprocessing is to perform in-memory batch data processing. Suppose we want to compute the min–max normalized score for a bunch of users:

We can define the following user-defined function (UDF) to be used to compute each user’s normalized score:

Each grouped batch based on the user’s id will be transformed using the apply_func:

This takes about 1.5 ms on my MacBook Pro M1 machine with 8 cores, yielding the following results in Table 1.1.

We can perform the same transformation using multiprocessing, where each process will operate on each of the grouped batches of data. We first define the helper function apply_parallel which will use Python’s joblib to create the multiprocessing environment.

Table 1.1 Min–Max normalized scores

Id

Scores

A

0.9

A

0.85

A

1

A

0

B

0.809524

B

0.904762

B

1

B

0.952381

B

0

C

0

C

0.677419

C

0.612903

C

1

C

0.935484

C

0.774194

C

0.354839

C

0.516129

C

0.0645161

We then use this helper function to perform our batchwise transformations

This takes 677 ms, which is much slower than using the native pandas.DataFrame.apply function. This is because setting up multiprocessing has an overhead, so most of the process is used to set up the multiprocessing environment instead of performing the actual computation. However, when the size of the data and the number of batches are large, we can expect that the additional overhead can be worthwhile for the overall speed gain from multiprocessing. Let’s now instead randomly generate 100,000 rows of data for our transformation.

Figure 1.5 Driver and workers in distributed computing. A driver (or main) node issues a series of commands to create workers and assigns each worker a task to process a portion of the data. Once each worker completes their own task, the worker will send the results back to the driver for aggregation. The driver then waits for all the workers to finish their tasks before returning the results to the user.

If we use pandas.DataFrame.apply, the transformation will take about 19 seconds. If we use apply_parallel, it now takes 16 seconds to perform the transformation, a ∼16% gain in processing speed.

1.3.3 Multiworker Distributed Computing

In modern big data analytics, practitioners often need to process data ranging from anywhere between several gigabytes and hundreds of terabytes in size (and this is often only the compressed size on disk; the size of the data occupying the memory can be much larger). No single consumer-grade laptop or desktop computer can hold hundreds of gigabytes to terabytes of data in memory, nor is any single CPU or GPU fast enough to process this much data without the user waiting hours to days (well, at least no such powerful processing unit exists yet as I am writing this book). Data analysts, data scientists, and machine learning engineers instead rely on cloud technologies to host such a large amount of data on distributed file storage systems and process the data using distributed computing. The term “distributed” here simply means multiple machines are storing, accessing, and processing the data, whereas machines communicate with each other via networking. Figure 1.5 illustrates a multiworker distributed computing system.

When dealing with the task of processing a large amount of data, a “divide and conquer” approach is key. The idea is if it takes 100 hours for a single computer to process our data, then 100 computers can finish the same task in about one hour. If we can somehow divide such a large amount of data into smaller pieces, let each computing node processes that small piece of data, and finally aggregate the results at the end from all the parallel workers, it can then greatly reduce the physical (or wall) time we need to wait for our data processing task to finish. In fact, this is the core idea behind MapReduce [Dean and Ghemawat, 2008].

A “Hello World” example often used to illustrate MapReduce is the task of counting the number of occurrences of each word from a large body of documents. In this example, we will use Apache Beam [Apache Foundation, 2023a] to illustrate how to write a MapReduce algorithm for word count. We can install apache-beam with pip install apache-beam (I am using apache-beam==2.49.0. Note that Apache Beam currently does not support pandas >=2.0.0[Apache Foundation, 2023b] at the time of writing, so I am using pandas==1.5.3 in this example).

Lines 1–2 import the

apache-beam

library and the

PiplineOptions

class to configure the running environment.

Lines 5–9 set up the pipeline running environment. Beam can automatically figure out when to use multithreading, multiprocessing, or distributed computing depending on the specified runner type. In this small example, since we are running using

DirectRunner

, Beam assumes this is running on a single machine, so it will find the opportunity to use multiprocessing whenever appropriate. But, if we run the pipeline on a distributed environment such as Apache Flink or Apache Spark, Beam will use multiworker processing, while each worker will then also use multiprocessing if multiple cores are also available on each worker. This small technical detail, however, should not prevent us from understanding how MapReduce works in a distributed, parallel environment.

Line 12 starts the context of the Beam pipeline, with the options we specified.

Lines 14–22 initialize the pipeline by creating some mock documents. In this small example, we assume we have four documents. Each document contains words separated by spaces.

Line 24 splits the documents into individual words. Here,

FlatMap

means applying the word splitting function to each document and then flat the results. So instead of returning the list of lists, it returns a flattened list.

Line 26 performs the classic Map step of MapReduce, where it creates the tuple

(word, 1)

for each word in the flattened list. This is often described as the

<key, value>

map format, where the

word

is the key and the static value

1

is the

value

.

Line 28 performs the classic Reduce step of the MapReduce, where it aggregates the tuples from the previous step by each key, summing together the values. This is equivalent to

.groupby(…).sum()

operations in

pandas.DataFrame

.

Line 30 prints the results of the pipeline.

After executing the above pipeline, we have the following printout that counts the number of occurrences of each word:

If the number and size of documents are both very large, we could imagine that each document will be read, split, mapped, aggregated, and returned as the above summary by each worker node. Then, the driver node can do further merging of the aggregated results from each worker to output the final results.

Figure 1.6 Comparison of parallel computing paradigms. Vectorization stores data within a single registry of the processing unit and processes multiple data representations on the registry using a single instruction simultaneously. Multithreading needs to switch between different threads of the processing unit and instructions are executed concurrently. Multiprocessing initiates multiple processes across different computing cores or different processing units on the machine, where data are communicated through shared memory. Multiworker distributed computing uses multiple machines to process a small piece of data, where the data are hosted on distributed file systems and communicated through networking.

Figure 1.6 summarizes and compares different paradigms of parallel computing we have discussed so far.

I hope this section provides enough information on different paradigms of parallel computing. It can be deduced that multithreading, multiprocessing, and multiworker paradigms are agnostic to the exact instructions of the program, whereas vectorization manipulates specific data using dedicated algorithms. Therefore, it is correct to expect that when combining vectorization with other types of parallel computing techniques such as multiprocessing and multiworker distributed computing, we can maximize our computing efficiency given limited resources.

Bibliography

Apache Foundation. Apache Beam.

https://beam.apache.org/

, 2023a.

Apache Foundation. Support Pandas==2.x in Apache Beam. #27221.

https://github.com/apache/beam/issues/27221

, June 2023b.

Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters.

Communications of the ACM

, 51(1):107–113, January 2008. ISSN 0001-0782, 1557-7317. doi: 10.1145/1327452.1327492.

Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J. R. Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, David Silver, Demis Hassabis, and Pushmeet Kohli. Discovering faster matrix multiplication algorithms with reinforcement learning.

Nature

, 610(7930):47–53, October 2022. ISSN 1476-4687. doi: 10.1038/s41586-022-05172-4.

Intel. Get Outstanding Computational Performance without a Specialized Accelerator.

Intel Technology Brief

, July 2022.

Evgueny Khartchenko. Vectorization: A Key Tool To Improve Performance On Modern CPUs.

https://www.intel.com/content/www/us/en/developer/articles/technical/vectorization-a-key-tool-to-improve-performance-on-modern-cpus.html

, January 2018.

NVIDIA. NVIDIA’s Next Generation CUDATM Compute Architecture. Technical report, 2012.

Volker Strassen. Gaussian elimination is not optimal.

Numerische Mathematik

, 13:354–356, 1969.

2Basic Tensor Operations

Numerical libraries like NumPy [Harris et al., 2020], Tensorflow [Abadi et al., 2015], and PyTorch [Paszke et al., 2019] implement various types of vectorized operations to facilitate scientific computing. Throughout this book, we define the following set of terminologies: if a variable is called a

scalar

, then the variable has zero dimensions and represents a single numerical value.

array

, or

vector

, then the variable has one dimension and represents a list of values.

matrix