115,99 €
Enables readers to develop foundational and advanced vectorization skills for scalable data science and machine learning and address real-world problems
Offering insights across various domains such as computer vision and natural language processing, Vectorization covers the fundamental topics of vectorization including array and tensor operations, data wrangling, and batch processing. This book illustrates how the principles discussed lead to successful outcomes in machine learning projects, serving as concrete examples for the theories explained, with each chapter including practical case studies and code implementations using NumPy, TensorFlow, and PyTorch.
Each chapter has one or two types of contents: either an introduction/comparison of the specific operations in the numerical libraries (illustrated as tables) and/or case study examples that apply the concepts introduced to solve a practical problem (as code blocks and figures). Readers can approach the knowledge presented by reading the text description, running the code blocks, or examining the figures.
Written by the developer of the first recommendation system on the Peacock streaming platform, Vectorization explores sample topics including:
From the essentials of vectorization to the subtleties of advanced data structures, Vectorization is an ideal one-stop resource for both beginners and experienced practitioners, including researchers, data scientists, statisticians, and other professionals in industry, who seek academic success and career advancement.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 425
Veröffentlichungsjahr: 2024
Cover
Table of Contents
Title Page
Copyright
Dedication
About the Author
Preface
Acknowledgment
1 Introduction to Vectorization
1.1 What Is Vectorization
1.2 Case Study: Dense Layer of a Neural Network
1.3 Vectorization vs. Other Parallel Computing Paradigms
Bibliography
2 Basic Tensor Operations
2.1 Tensor Initializers
2.2 Data Type and Casting
2.3 Mathematical Operations
2.4 Reduction Operations
2.5 Value Comparison Operations
2.6 Logical Operations
2.7 Ordered Array-Adjacent Element Operations
2.8 Array Reversing
2.9 Concatenation, Stacking, andSplitting
2.10 Reshaping
2.11 Broadcasting
2.12 Case Studies
Bibliography
3 Tensor Indexing
3.1 Get Values at Index
3.2 Slicing
3.3 Case Study: Get Consecutive Index
3.4 Take and Gather
3.5 Assign Values at Index
3.6 Put and Scatter
3.7 Case Study: Batchwise Scatter Values
Bibliography
4 Linear Algebra
4.1 Tensor Multiplications
4.2 The
matmul
Operation
4.3 The
tensordot
Operation
4.4 Einsum
4.5 Case Study: Pair-wise Pearson’s Cross-Correlation
4.6 Case Study: Hausdorff Distance
4.7 Common Linear Algebraic Routines
4.8 Case Study: Fitting Single Exponential Curves
Bibliography
5 Masking and Padding
5.1 Masking
5.2 Padding
5.3 Advanced Case Studies
Bibliography
6 String Processing
6.1 String Data Types
6.2 String Operations
6.3 Case Study: Parsing DateTime from String Representations
6.4 Mapping Strings to Indices
6.5 Case Study: Factorization Machine
6.6 Regular Expressions (Regex)
6.7 Data Serialization and Deserialization
Bibliography
7 Sparse Matrix
7.1
Scipy
’s Sparse Matrix Classes
7.2 Sparse Matrix Broadcasting
7.3 Tensorflow’s Sparse Tensors
7.4 PyTorch’s Sparse Matrix
7.5 Sparse Matrix in Other Python Libraries
7.6 When (Not) to Use Sparse Matrix
7.7 Case Study: Sparse Matrix Factorization with ALS
Bibliography
8 Jagged Tensors
8.1 Left Align a Sparse Tensor to Represent Ragged Tensor
8.2 Index to Binary Indicator
8.3 Case Study: Jaccard Similarities Using Sparse Matrix
8.4 Case Study: Batchwise Set Operations
8.5 Case Study: Autoencoders with Sparse Inputs
Bibliography
9 Groupby, Apply, and Aggregate
9.1 Pandas Groupwise Operations
9.2 Reshaping and Windowing of Dense Tensors
9.3 Case Study: Vision Transformer (ViT)
9.4 Bucketizing Values
9.5 Segment-wise Aggregation
9.6 Case Study: EmbeddingBag
9.7 Case Study: Vocal Duration Constrained by Note Duration
9.8 Case Study: Filling of Missing Values in a Sequence
Bibliography
10 Sorting and Reordering
10.1 Sorting Operations
10.2 Case Study: Top-k Using
argsort
and
argpartition
10.3 Case Study: Sort the Rows of a Matrix
10.4 Case Study: Reverse Padded Sequence
10.5 Case Study: Gumbel-Max Sampling with Weights
10.6 Case Study: Sorting Articles Around Anchored Advertisements
Bibliography
11 Building a Language Model from Scratch
11.1 Language Modeling with Transformer
11.2 Pre-LN vs. Post-LN Transformer
11.3 Layer Normalization
11.4 Positional Encoding and Embedding
11.5 Activation Functions in Feedforward Layer
11.6 Case Study: Training a Tiny LLaMA Model for Next Token Prediction
11.7 A Word on AI Safety and Alignment
11.8 Concluding Remarks
Bibliography
Index
End User License Agreement
Chapter 1
Table 1.1 Min–Max normalized scores
Chapter 2
Table 2.1 List of tensor initializers in Python, NumPy, Tensorflow, and PyTo...
Table 2.2 List of data types in Python, NumPy [Harris et al., 2023a], Tensor...
Table 2.3 List of common mathematical operations in Python, NumPy, Tensorflo...
Table 2.4 List of tensor reduction operations in Python, NumPy, Tensorflow, ...
Table 2.5 List of value comparison and thresholding operations in Python, Nu...
Table 2.6 List of logical or boolean tensor operations in Python, NumPy, Ten...
Table 2.7 List of adjacent element cumulative sum and discrete differences t...
Table 2.8 List of array/tensor reversing operations in Python, NumPy, Tensor...
Table 2.9 List of adjacent element cumulative sum and discrete differences t...
Table 2.10 List of tensor reshaping operations in Python, NumPy, Tensorflow,...
Chapter 3
Table 3.1 List of slicing and striding operations with example usages in 1D ...
Table 3.2 List of examples of building slicing and striding configurations u...
Table 3.3 List of take and gather operations in NumPy, Tensorflow, and PyTor...
Table 3.4 List of accumulation operators that can be used with NumPy and PyT...
Table 3.5 List of put and scatter operations in NumPy, Tensorflow, and PyTor...
Chapter 4
Table 4.1 Examples of tensor multiplication operations in NumPy, Tensorflow,...
Table 4.2 Examples of einsum equations and equivalent NumPy operations.
Table 4.3 List of linear algebraic routines in NumPy, Tensorflow, and PyTorc...
Chapter 5
Table 5.1 List of masking and mask indices construction operations.
Chapter 6
Table 6.1 List of string transformation operations in Python, NumPy, Pandas,...
Chapter 7
Table 7.1 Construction, update, and inter-conversion of seven sparse matrix ...
Table 7.2 Performance comparisons of all seven sparse matrix data structures...
Table 7.3 PyTorch Sparse tensor comparisons.
Chapter 9
Table 9.1 Example user-movie viewing hours dataset.
Table 9.2 List of groupwise aggregation operations in NumPy, Tensorflow, and...
Chapter 10
Table 10.1 List of sorting operations in NumPy, Tensorflow, and PyTorch.
Chapter 1
Figure 1.1 CPU vector computations.
Figure 1.2 Multilayer perceptron.
Figure 1.3 Feed-forward layer.
Figure 1.4 Reading remote database (a) with single thread and (b) with multi...
Figure 1.5 Driver and workers in distributed computing. A driver (or main) n...
Figure 1.6 Comparison of parallel computing paradigms. Vectorization stores ...
Chapter 2
Figure 2.1 Concatenation and stacking in 1D vectors.
Figure 2.2 Concatenation and stacking in 2D matrices.
Figure 2.3 Splitting operations.
Figure 2.4 Reshaping a vector to a matrix before taking averages.
Figure 2.5 Concept of order in reshaping.
Figure 2.6 Scalar to vector broadcasting.
Figure 2.7 Vector to matrix broadcasting.
Figure 2.8 Broadcasting between tensor with singleton dimension and non-sing...
Figure 2.9 2x2 magic squares cannot be constructed.
Figure 2.10 Luo Shu, the earliest known third-order magic square.
Figure 2.11 Example of constructing a third-order magic square using the Sia...
Figure 2.12 Deconstructing odd-order Siamese magic squares into a sum of two...
Figure 2.13 The “X” method for constructing doubly even order magic squares....
Figure 2.14 Steps of implementing the X-method for constructing doubly even ...
Figure 2.15 LUX: order of filling out a grid.
Figure 2.16 Examples of constructing order 6 and 10 magic squares using the ...
Chapter 3
Figure 3.1 Illustration of integer indexing using positive (regular) and equ...
Figure 3.2 Indexing a matrix. (a) Row-column index. Here, “row” refers to th...
Figure 3.3 Flat and multi(dimensional) indices. There are two conventions to...
Figure 3.4 Illustration of extracting indices that satisfy boolean condition...
Figure 3.5 Examples of 2D matrix slicing and striding in NumPy, Tensorflow, ...
Figure 3.6 Examples of 3D tensor slicing and striding in NumPy, Tensorflow, ...
Figure 3.7 Example usage of
take
operation along a 1D vector.
Figure 3.8 Example usage of flat
take
operation on a 2D matrix.
Figure 3.9 Example usage of multi-dimensional
take
operation on a vector.
Figure 3.10 Illustrated examples of
torch.gather
operation on a 2D matrix. (...
Figure 3.11 Illustration of multi-element multi-indexing using -dimensional...
Figure 3.12 Illustration of
tf.gather_nd
with nonzero
batch_dims
argument. T...
Figure 3.13 Illustration of
np.put
operation on a 1D vector target.
Figure 3.14 Illustrated example of
np.put_along_axis
.
Figure 3.15 Illustration of scattering values over a 2D matrix using multi-i...
Figure 3.16 Illustration of
tf.scatter_nd
over a 1D target tensor.
Figure 3.17 Illustration of scattering 2D slices over a 3D tensor.
Figure 3.18 Illustration of scattering slices into multiple batch dimensions...
Figure 3.19 Illustration of
torch.scatter_reduce
on 2D matrices. (a) Scatter...
Figure 3.20 A step-by-step illustration of the batchwise scattering process ...
Chapter 4
Figure 4.1 Illustration of an example tensordot operation between tensors of...
Figure 4.2 Illustration of
tensordot
operations when
axes=0
. (a) For a p...
Figure 4.3 Illustration of
tensordot
operations when
axes=1
. (a) For a p...
Figure 4.4 Illustration of
tensordot
operations when
axes=2
. (a) For a p...
Figure 4.5 Fiducial markers used to track head positions. (a) The original t...
Figure 4.6 Fit single exponential to noisy data points.
Chapter 5
Figure 5.1 Example uses of
tril
and
triu
operations.
Figure 5.2 Example usage of
tf.linalg.band_part
. It is impossible to create ...
Figure 5.3 Before and after thresholding to remove background.
Figure 5.4 Boolean tensor masking and compressing.
Figure 5.5 Masking top K values. (a) Given an example batch of arrays, we wo...
Figure 5.6 Example illustration of
constant
padding.
Figure 5.7 Illustration of
constant
padding order.
Figure 5.8 Example illustration of
reflect
padding. Dashed lines indicate ax...
Figure 5.9 Example illustration of
symmetric
padding. Dashed lines indicate ...
Figure 5.10 Example illustration of
edge
or
replicate
padding.
Figure 5.11 Example illustration of
wrap
or
circular
padding mode.
Figure 5.12 Example of
same
vs.
valid
padding in convolutional operations.
Figure 5.13 Bidirectional vs. causal self-attentions.
Figure 5.14 Attention mask. (a) Bidirectional attention mask, which allows t...
Figure 5.15 Building a ragged range vector. (a) Create the boolean mask by b...
Figure 5.16 Example effect of Length Regulator module on the four-word sente...
Chapter 6
Figure 6.1 Training results of Deep Factorization Machine.
Chapter 7
Figure 7.1 Example user-item rating matrix in a typical recommendation syste...
Figure 7.2
coo_matrix
example.
Figure 7.3
csc_matrix
example.
Figure 7.4
csr_matrix
example.
Figure 7.5
bsr_matrix
example.
Figure 7.6
dok_matrix
example.
Figure 7.7
lil_matrix
example.
Figure 7.8
dia_matrix
example.
Figure 7.9 Row-wise broadcasting example.
Figure 7.10 Column-wise broadcasting example.
Figure 7.11 Generalized compressed sparse tensor.
Figure 7.12 Illustration of matrix multiplication on sparse indices.
Figure 7.13 ALS training loss.
Chapter 8
Figure 8.1 Comparison of three ways of representing a jagged tensor. (a) A d...
Figure 8.2 Before (left) and after (right) of left aligning the sparse matri...
Figure 8.3 Example illustration of converting a set of variable-length lists...
Figure 8.4 Illustration of the logic to compute pair-wise Jaccard Similarity...
Figure 8.5 Creating a
csr_matrix
binary indicator representation using indic...
Figure 8.6 Illustration of an alternative way of computing flattened ragged ...
Figure 8.7 Autoencoder with categorical inputs represented by a sparse vecto...
Chapter 9
Figure 9.1 Downsampled time series from the code example.
Figure 9.2 Schematic illustration of reshape and reduce operations used to i...
Figure 9.3 Mean-removed time series from the code example.
Figure 9.4 Schematic illustration of broadcasting used to implement groupby-...
Figure 9.5 Sliding window of size 3 and stride 2.
Figure 9.6 Illustration of implementation logic of
sliding_window_tf
using T...
Figure 9.7 Moving average of the time series from the code example.
Figure 9.8 Sinusoidal positional encoding of the transformer architecture.
Figure 9.9 ViT architecture overview.
Figure 9.10 Transformer pre-LN architecture. After blocks of encoder layer...
Figure 9.11 Example illustration of left-inclusive and left-exclusive maskin...
Figure 9.12 Illustration of segment-wise summation operation, with each colo...
Figure 9.13 “Eyes On Me” sheet music from my previous SVS project.
Figure 9.14 Example note duration constraints by group.
Figure 9.15 Implementation logic for
ffill
. (a) Illustration of the input da...
Chapter 10
Figure 10.1 Schematic illustration of the effect of partitioning the array. ...
Figure 10.2 Performance of two top-k functions implemented in NumPy, Tensorf...
Figure 10.3 Illustration of using
np.unique
to sort rows of a matrix. In thi...
Figure 10.4 Illustration of the implementation logic to reverse a padded seq...
Figure 10.5 Evolution of
samples
tensor when implementing the Gumbel-Max tri...
Figure 10.6 Simulated Gumbel-Max negative sampling frequency when using unif...
Figure 10.7 Simulated Gumbel-Max negative sampling frequency when sampling f...
Figure 10.8 Illustration of arranging news articles around anchored ads.
Figure 10.9 Illustration of converting
ads_positions
into ads
anchored_mask
...
Figure 10.10 Illustration of creating the
output_index_ads
matrix. (a) Creat...
Figure 10.11 Combine ads and news output index into a single matrix. Notice ...
Chapter 11
Figure 11.1 Learning rate schedule from the original transformer paper.
Figure 11.2 Comparison of Pre- and Post-layer normalization architecture. Po...
Figure 11.3 Comparison of normalization scheme. (a) Batch normalization norm...
Figure 11.4 Sinusoidal positional encoding from the original transformer pap...
Figure 11.5 Tensor2Tensor variant of the sinusoidal positional encoding.
Figure 11.6 Relative positional embedding in T5 model.
Figure 11.7 Rotation matrix of the first nine sequence positions, given embe...
Figure 11.8 Corresponding values in and to be multiplied. Suppose we hav...
Figure 11.9 Comparison of nonlinear activation functions.
Figure 11.10 Training our tiny LLaMA model using WikiText2.
Cover
Table of Contents
Title Page
Copyright
Dedication
About the Author
Preface
Acknowledgment
Begin Reading
Index
End User License Agreement
ii
iii
iv
v
xiii
xv
xvi
xvii
xviii
xix
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
419
420
421
422
423
IEEE Press445 Hoes LanePiscataway, NJ 08854
IEEE Press Editorial BoardSarah Spurgeon, Editor-in-Chief
Moeness Amin
Jón Atli Benediktsson
Adam Drobot
James Duncan
Ekram Hossain
Brian Johnson
Hai Li
James Lyke
Joydeep Mitra
Desineni Subbaram Naidu
Tony Q. S. Quek
Behzad Razavi
Thomas Robertazzi
Diomidis Spinellis
Edward DongBo Cui
Director of Data Science, DotDash Meredith
Copyright © 2025 by The Institute of Electrical and Electronics Engineers, Inc.All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging-in-Publication Data applied for:
Hardback ISBN: 9781394272945
Cover Design: WileyCover Images: © da-kuk/Getty Images
To my students
Edward Dongbo Cui is a data science and machine learning engineering leader who holds a PhD in neuroscience from Case Western Reserve, USA. Edward served as Director of Data Science at DotDash Meredith, building large-scale search and recommender systems for online digital publications. At NBC Universal, he served as Director of Data Science, building the first recommendation system on the new Peacock streaming platform. Previously, he was lead data scientist at Nielsen Global Media. He is an expert in ML engineering, research, and MLOps to drive data-centric decision-making and enhance product innovation.
Over the years, I have accrued valuable experiences in implementing machine learning algorithms, and this book is my endeavor to share those insights with you. Machine learning is a vast field, encompassing areas including but not limited to computer vision, speech processing, natural language processing, and recommendation systems. While the diversity of these topics is immense, there are underlying mathematical and computational principles that weave them together.
Structure of the Book: The book is structured into 11 chapters, each dedicated to a pivotal topic in the realm of scientific computation and machine learning:
Chapter 1
delves into the power of vectorization, contextualized within parallel computing.
Chapter 2
lays the foundation by discussing the basic tensor operations.
Chapter 3
details the art of tensor indexing, elucidating how to access individual or subsets of tensor elements.
Chapter 4
probes deeper into vectorization in tensor multiplications and common linear algebraic routines, which form the backbone of many machine learning algorithms.
Chapter 5
sheds light on the pivotal concepts of masking and padding, which come into play when handling data of nonuniform sizes.
Chapter 6
touches upon string processing techniques, invaluable for natural language processing (NLP) tasks.
Chapter 7
introduces the realm of sparse matrices, discussing their data structures and integral operations.
Chapter 8
takes a closer look at a unique kind of sparse tensor – ragged or jagged tensors – and the nuances of processing them.
Chapter 9
breaks down the data wrangling pattern of groupby-apply-aggregate, which is crucial for data summarization.
Chapter 10
explores the topic on sorting, reordering, sampling, and batch processing through a series of case studies.
Chapter 11
concludes the book with a review of the Transformer++ architecture (a.k.a LLaMA, a foundational large language model architecture from Meta), weaving together the principles discussed in the preceding chapters.
Audience: While I have tailored this book for those embarking on their journey in machine learning algorithm implementation, it is not solely for beginners. Having a basic understanding of machine learning and deep learning will certainly be an advantage, but every concept introduced is supplemented with sufficient background to make it comprehensible to all readers.
Prerequisites:
Familiarity with the Python programming language:
Readers can find online resources to learn Python or through classic books such as “Think Python.”
Undergraduate linear algebra:
To this, I highly recommend Gilbert Strang’s lectures on Linear Algebra at MIT, or his book “Introduction to Linear Algebra.”
Foundational knowledge of data science and machine learning:
The goal of this book is not to introduce machine learning, but rather to guide the reader in solving real-world problems encountered in machine learning (oftentimes deep learning). Basic knowledge of data science and machine learning can be gained from either previous more foundational courses or practical knowledge gained by doing machine learning projects hands-on. Courses such as Andrew Ng’s CS229 Machine Learning can serve as a foundation.
Features:
Every chapter is punctuated with numerous case studies to ensure the concepts are grounded in practical scenarios. Accompanying these are detailed code implementations with line-by-line explanations. Most of the implementations will use only the knowledge acquired within the chapter. Certainly, as the reader learns along the way, one may come up with better implementations by leveraging more advanced concepts.
The book covers vectorization techniques using NumPy, Tensorflow, and PyTorch (and for certain operations, Pandas as well). When introducing the operations, I will include tables that enumerate equivalent operations in these three libraries, in the hope that readers who are familiar with one of the libraries can also learn to translate algorithms and models implemented in another library. Most case study code examples will be implemented in one of the libraries (often NumPy), but readers are encouraged to implement the same functionality using alternative libraries by referencing equivalent operations.
All code examples are available at
github.com/sapphire008/vectorization-book-supplement
. Note that some code examples are simulations using random inputs, so rerunning the notebooks will not necessarily get the same results as the ones provided in the book.
To further clarify many of the vectorization operations, illustrative figures have been incorporated to visually represent the logic behind them, often accompanied by detailed explanations either within the text body or in the caption.
The principles discussed in this book should be agnostic to specific versions of Python, NumPy, TensorFlow, or PyTorch. By default, my code examples were tested using
python==3.10.6
,
numpy==1.23.4
,
pandas==2.1.1
,
scikit-learn==1.3.0
,
scipy==1.11.1
,
tensorflow==2.13.0
,
keras==2.13.1
, and
torch==2.1.0
. However, I will specify the library versions alongside code examples when the versions used are not the default, or whenever relevant or necessary to emphasize the versions.
Limitations:
Note that the carriage return symbol “↪” within the code block (with shaded background) is used to indicate a line break to accommodate formatting of long lines.
Code examples and case studies implemented to solve certain problems may not always be the most optimal solution, though it is honestly my best effort to come up with these solutions when writing this book. Sometimes, I also specifically made a decision to implement certain things not as efficiently as possible but to illustrate the practical application of the concepts introduced in the chapter. Because the materials of each chapter in the book build upon each other, I have made an explicit choice not to use a functionality that would be introduced in a later chapter. Still, this demonstrates the beauty of vectorization and coding in general: there is not a single solution to a specific problem in practice. Nonetheless, I encourage readers to find more efficient ways to solve the case study problems. On the other hand, if the readers found a bug in one of the implementations, please feel free to open an issue at
github.com/sapphire008/vectorization-book-supplement
.
When comparing the vectorization libraries, it is possible that I may be ignorant of certain functionalities specific to the library even though the information contained in this book is to the best of my current knowledge. I may sometimes claim that one library does not implement functionality that another library implements at the time of writing, but if readers found the other library indeed implemented that functionality, I stand corrected and will make the best effort to update this information in future editions of the book.
My hope is that by the end of this book, the readers will be well-versed in the foundational and advanced vectorization techniques of implementing scalable machine learning algorithms and will be ready to tackle real-world challenges.
November 29, 2023New York
Edward DongBo Cui
Director of Data ScienceDotDash Meredith, New York
I express my deepest gratitude to Ryan Beauchemin and Kendall Meyer for their invaluable reviews and contributions, which have greatly enhanced the quality of this book. I also thank Ilmir Moussikaev for believing in my potential and providing me with the opportunity to create something remarkable on the Peacock streaming platform. This specific experience was the cornerstone of my career, leading me to acquire the knowledge necessary to write this book. I also want to extend my heartfelt appreciation to my students. Teaching and mentoring them on various techniques in machine learning and data science provided me with the insights and material needed to write this book. A special thanks to Chris Haroun, whose inspiration and guidance motivated me to embark on this writing endeavor. Lastly, I am immensely thankful to my wife for her unwavering support throughout this journey.
Vectorization is a type of parallel computing paradigm that performs arithmetic operations on an array of numbers [Intel, 2022] within a single data processing unit (either CPU or GPU). Modern CPUs can perform between 4 and 16 single precision (float32) parallel computations, depending on the type of instructions [Intel, 2022] (see Figure 1.1).
Here, SSE, or streaming SIMD (single-instruction-multiple-data) extension, has 128 registers; each register is used to store 1-bit of data for computations. This is equivalent to performing four single-precision (float32) operations or two double-precision (float64) operations. Intel’s AVX512, or Advanced Vector Extension 512, on the other hand, has 512 registers and can therefore process 512 bits of data simultaneously. This is equivalent to performing 16 single-precision or 8 double-precision operations.
The term “Vectorization” also refers to the process of transforming an algorithm computable on one data point at a time to an operation that calculates a collection of data simultaneously [Khartchenko, 2018]. A classic example is the dot product operation,
where , and .
The dot product operation on the right-hand side of the equation is the vectorized form of the summation process on the left. Instead of operating on one element of the array at a time, it operates on the collections or arrays, and .
Figure 1.1 CPU vector computations.
Another definition of the term “vectorization” is related to the more recent advances in large language models: using a vector or an array of values to represent a word. This is more commonly known as embedding in the literature which originated from the word2vec model in natural language processing. In Chapters 6 and 7, we will introduce the idea of representing string or categorical features as numerical values, i.e. integer indices, sparse one-hot encoding vectors, and dense embedding vectors.
Suppose we want to take the dot product of 1 million random numbers drawn from the uniform distribution. Using Python’s for loop, we would typically implement this algorithm as the follows:
On my M1 MacBook Pro under Python 3.11.4, this is the runtime of the above function:
If we implement the algorithm using numpy with vectorization
Under the same setup, we have the following runtime of the vectorized implementation.
This is a ∼37× speed-up by vectorization!
However, the above example is not to discount that certain native Python operations are still faster than NumPy’s implementations, despite NumPy being often used for vectorization operations to simplify for-loops in Python.
For example, we would like to split a list of strings delimited by periods “.”. Using Python’s list comprehension, we have
Line 2 runs for
However, in comparison, if we use NumPy’s “vectorized” split operation
Line 3 runs for
which is about ∼4.0 slower than native Python.
When using vectorized operations, we need to consider the trade-off between memory and speed. Sometimes, vectorized operations would need to allocate more memory for results from intermediate steps. Therefore, this has implications on the overall performance of the program, in terms of space vs. time complexity. Consider a function in which we would like to compute the sum of the squares of the elements in an array. This can be implemented as a for-loop, where we iterate through each element, take the square, and accumulate the square values as the total. Alternatively, the function can also be implemented as a vectorized operation where we take the square of each element first, store the results in an array, and then sum the squared vector.
Let us use the memory_profiler package to examine the memory usage of each line of the two functions. We can first install the package by pip install memory_profiler==0.61.0 and load it as a plugin in a Jupyter Notebook.
We also save the above script into a file (which we call memory_allocation_example.py), then we measure the memory trace of each function in the notebook as follows:
This gives us the following line-by-line memory trace of the function call:
We can also use %timeit magic function to measure the speed of the function:
which gives
Similarly, let us also look at the for-loop implementation
And we also test the run time as
which gives
We can see that although the for-loop function is slower than the vectorized function, the for-loop had almost no memory increase. In contrast, the vectorized solution has a ∼1M increase in memory usage while reducing the run time by about 30%. If X is several magnitudes larger than our current example, then we can see that even more memory will be consumed in exchange for speed. Hence, we need to consider carefully the trade-off between speed and memory (or time vs. space) when using vectorized operations.
Vectorization is a key skill for implementing various machine learning models, especially for deep learning algorithms. A simple neural network consists of several feed-forward layers, as illustrated in Figure 1.2.
A feed-forward layer (also called dense layer, linear layer, perceptron layer, or hidden layer, as illustrated in Figure 1.3) has the following expression:
Notice that is a linear model. Here,
is the input, and typically of shape
(batch_size, num_inputs)
is the weight (a.k.a. kernel) parameter matrix of shape
(num_inputs, hidden_size)
is the bias parameter vector of shape
(1, hidden_size)
Figure 1.2 Multilayer perceptron.
Figure 1.3 Feed-forward layer.
is an activation function that is applied to every element of the resulting linear model operation. It can be, e.g.,
relu
,
sigmoid
, or
tanh
.
is the output of shape
(batch_size, hidden_size)
Stacking multiple feed-forward layers creates the multilayer perceptron model, as illustrated by our example neural network above.
If we were to implement a feed-forward layer using only Python naively, we would need to run thousands to millions of iterations of the for-loop, depending on the batch_size, num_inputs, and hidden_size. When it comes to large language models like generative pretrained transformer (GPT) which has hundreds of billions of parameters to optimize, using for loops to make such computations can become so slow that training or making an inference using the model is infeasible. But if we take advantage of vectorization on either CPU or GPU, we would see significant gains in the performance of our model. Furthermore, certain operations like matrix multiplication are simply inefficient to perform if implemented naively by simply parallelizing a for-loop, so in practice must rely on special algorithms such as Strassen’s Algorithm [Strassen, 1969] (or more recently, a new algorithm discovered by AlphaTensor [Fawzi et al., 2022]). Implementing these algorithms as an instruction on either CPU or GPU is a non-trivial task itself. Practitioners of machine learning and data science should therefore take advantage of pre-existing libraries that implement these algorithms. But to use these libraries efficiently, one needs to be adept in thinking in terms of vectorization.
In the following, let’s take a look at an example of how to implement the above perceptron layer using numpy.
Lines 3–5 define the sigmoid activation function,
Line 10 implements the linear model
– Here, the
@
operator is the matrix multiplication operator that is recognized by typical vector libraries like
numpy
,
tensorflow
, and
torch
. We will discuss more on matrix operations in
Chapter 4
.
– The result of multiplying and should be a matrix of shape
(batch_size, hidden_size)
, but in this implementation, we have of shape
(1, hidden_size)
. How can a vector be added to a matrix? In
Chapter 2
, we will explore the concept called
broadcasting
.
Line 12 applies the sigmoid activation we defined previously to compute the final output of this dense layer. This is an element-wise operation, where we apply the sigmoid function to each element of the matrix. We will explore the idea of element-wise operations in
Chapter 2
.
As you can see, vectorization can greatly simplify the representation of neural networks operating on high-dimensional arrays with a large number of elements or data points.
Knowing that vectorization is a type of parallel computing paradigm, it is natural to ask the question: how does vectorization compare to other parallel computing paradigms such as multithreading, multiprocessing, or multiworker distributed computing?
Like vectorization, multithreading operates on a single processing unit, with shared memory. This can be a single core of a CPU or GPU. Vectorization processes multiple data inputs within the register in parallel, given a single instruction. In contrast, multithreading can only process one stream of data at a time, instead of processing data in parallel. It can switch between different threads of the same program (or sometimes different threads of different programs), with each thread representing a different part of the process. This is referred to as “concurrency.” Of course, multithreading can be used in combination with vectorization. In fact, modern GPUs operate by combining vectorization, multithreading, and multiprocessing to increase their processing powers [NVIDIA, 2012].
The multithreading paradigm is quite advantageous in tasks like reading data from remote servers. For example, a database has 10 parts. If we have only a single thread, then for each part of the data to be read, we need to wait for one minute until the data is downloaded before reading the next part. Then, it will take 10 minutes in total to download the entire database. However, if we have 10 threads, each making a request to read one part of the data. Since the time to make a request to read the data is negligible compared to the time to download the data, each thread takes one minute to download the data. However, since the downloading process is in parallel with each other, it will take only one minute in total to download the entire database. See Figure 1.4 for an illustration.
Multiprocessing, on the other hand, operates on multiple processing units, e.g. multi-core CPUs or GPUs. Each core of the processing unit can access data through a shared memory (the RAM). Unlike vectorization and multithreading, multiprocessing does not share the same data between the processes. Therefore, to process the same data in parallel using different instructions, multiprocessing needs to duplicate the data across the cores. Unlike multithreading, multiprocessing will not be efficient if the process is bounded by I/O (i.e. data reading and writing). Conversely, multiprocessing is a great choice if the process is bounded by the processing power (a.k.a. CPU bound) and can greatly speed up the process by distributing the computational load to multiple cores of the processing unit.
Figure 1.4 Reading remote database (a) with single thread and (b) with multithreading.
A typical use of multiprocessing is to perform in-memory batch data processing. Suppose we want to compute the min–max normalized score for a bunch of users:
We can define the following user-defined function (UDF) to be used to compute each user’s normalized score:
Each grouped batch based on the user’s id will be transformed using the apply_func:
This takes about 1.5 ms on my MacBook Pro M1 machine with 8 cores, yielding the following results in Table 1.1.
We can perform the same transformation using multiprocessing, where each process will operate on each of the grouped batches of data. We first define the helper function apply_parallel which will use Python’s joblib to create the multiprocessing environment.
Table 1.1 Min–Max normalized scores
Id
Scores
A
0.9
A
0.85
A
1
A
0
B
0.809524
B
0.904762
B
1
B
0.952381
B
0
C
0
C
0.677419
C
0.612903
C
1
C
0.935484
C
0.774194
C
0.354839
C
0.516129
C
0.0645161
We then use this helper function to perform our batchwise transformations
This takes 677 ms, which is much slower than using the native pandas.DataFrame.apply function. This is because setting up multiprocessing has an overhead, so most of the process is used to set up the multiprocessing environment instead of performing the actual computation. However, when the size of the data and the number of batches are large, we can expect that the additional overhead can be worthwhile for the overall speed gain from multiprocessing. Let’s now instead randomly generate 100,000 rows of data for our transformation.
Figure 1.5 Driver and workers in distributed computing. A driver (or main) node issues a series of commands to create workers and assigns each worker a task to process a portion of the data. Once each worker completes their own task, the worker will send the results back to the driver for aggregation. The driver then waits for all the workers to finish their tasks before returning the results to the user.
If we use pandas.DataFrame.apply, the transformation will take about 19 seconds. If we use apply_parallel, it now takes 16 seconds to perform the transformation, a ∼16% gain in processing speed.
In modern big data analytics, practitioners often need to process data ranging from anywhere between several gigabytes and hundreds of terabytes in size (and this is often only the compressed size on disk; the size of the data occupying the memory can be much larger). No single consumer-grade laptop or desktop computer can hold hundreds of gigabytes to terabytes of data in memory, nor is any single CPU or GPU fast enough to process this much data without the user waiting hours to days (well, at least no such powerful processing unit exists yet as I am writing this book). Data analysts, data scientists, and machine learning engineers instead rely on cloud technologies to host such a large amount of data on distributed file storage systems and process the data using distributed computing. The term “distributed” here simply means multiple machines are storing, accessing, and processing the data, whereas machines communicate with each other via networking. Figure 1.5 illustrates a multiworker distributed computing system.
When dealing with the task of processing a large amount of data, a “divide and conquer” approach is key. The idea is if it takes 100 hours for a single computer to process our data, then 100 computers can finish the same task in about one hour. If we can somehow divide such a large amount of data into smaller pieces, let each computing node processes that small piece of data, and finally aggregate the results at the end from all the parallel workers, it can then greatly reduce the physical (or wall) time we need to wait for our data processing task to finish. In fact, this is the core idea behind MapReduce [Dean and Ghemawat, 2008].
A “Hello World” example often used to illustrate MapReduce is the task of counting the number of occurrences of each word from a large body of documents. In this example, we will use Apache Beam [Apache Foundation, 2023a] to illustrate how to write a MapReduce algorithm for word count. We can install apache-beam with pip install apache-beam (I am using apache-beam==2.49.0. Note that Apache Beam currently does not support pandas >=2.0.0[Apache Foundation, 2023b] at the time of writing, so I am using pandas==1.5.3 in this example).
Lines 1–2 import the
apache-beam
library and the
PiplineOptions
class to configure the running environment.
Lines 5–9 set up the pipeline running environment. Beam can automatically figure out when to use multithreading, multiprocessing, or distributed computing depending on the specified runner type. In this small example, since we are running using
DirectRunner
, Beam assumes this is running on a single machine, so it will find the opportunity to use multiprocessing whenever appropriate. But, if we run the pipeline on a distributed environment such as Apache Flink or Apache Spark, Beam will use multiworker processing, while each worker will then also use multiprocessing if multiple cores are also available on each worker. This small technical detail, however, should not prevent us from understanding how MapReduce works in a distributed, parallel environment.
Line 12 starts the context of the Beam pipeline, with the options we specified.
Lines 14–22 initialize the pipeline by creating some mock documents. In this small example, we assume we have four documents. Each document contains words separated by spaces.
Line 24 splits the documents into individual words. Here,
FlatMap
means applying the word splitting function to each document and then flat the results. So instead of returning the list of lists, it returns a flattened list.
Line 26 performs the classic Map step of MapReduce, where it creates the tuple
(word, 1)
for each word in the flattened list. This is often described as the
<key, value>
map format, where the
word
is the key and the static value
1
is the
value
.
Line 28 performs the classic Reduce step of the MapReduce, where it aggregates the tuples from the previous step by each key, summing together the values. This is equivalent to
.groupby(…).sum()
operations in
pandas.DataFrame
.
Line 30 prints the results of the pipeline.
After executing the above pipeline, we have the following printout that counts the number of occurrences of each word:
If the number and size of documents are both very large, we could imagine that each document will be read, split, mapped, aggregated, and returned as the above summary by each worker node. Then, the driver node can do further merging of the aggregated results from each worker to output the final results.
Figure 1.6 Comparison of parallel computing paradigms. Vectorization stores data within a single registry of the processing unit and processes multiple data representations on the registry using a single instruction simultaneously. Multithreading needs to switch between different threads of the processing unit and instructions are executed concurrently. Multiprocessing initiates multiple processes across different computing cores or different processing units on the machine, where data are communicated through shared memory. Multiworker distributed computing uses multiple machines to process a small piece of data, where the data are hosted on distributed file systems and communicated through networking.
Figure 1.6 summarizes and compares different paradigms of parallel computing we have discussed so far.
I hope this section provides enough information on different paradigms of parallel computing. It can be deduced that multithreading, multiprocessing, and multiworker paradigms are agnostic to the exact instructions of the program, whereas vectorization manipulates specific data using dedicated algorithms. Therefore, it is correct to expect that when combining vectorization with other types of parallel computing techniques such as multiprocessing and multiworker distributed computing, we can maximize our computing efficiency given limited resources.
Apache Foundation. Apache Beam.
https://beam.apache.org/
, 2023a.
Apache Foundation. Support Pandas==2.x in Apache Beam. #27221.
https://github.com/apache/beam/issues/27221
, June 2023b.
Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters.
Communications of the ACM
, 51(1):107–113, January 2008. ISSN 0001-0782, 1557-7317. doi: 10.1145/1327452.1327492.
Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J. R. Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, David Silver, Demis Hassabis, and Pushmeet Kohli. Discovering faster matrix multiplication algorithms with reinforcement learning.
Nature
, 610(7930):47–53, October 2022. ISSN 1476-4687. doi: 10.1038/s41586-022-05172-4.
Intel. Get Outstanding Computational Performance without a Specialized Accelerator.
Intel Technology Brief
, July 2022.
Evgueny Khartchenko. Vectorization: A Key Tool To Improve Performance On Modern CPUs.
https://www.intel.com/content/www/us/en/developer/articles/technical/vectorization-a-key-tool-to-improve-performance-on-modern-cpus.html
, January 2018.
NVIDIA. NVIDIA’s Next Generation CUDATM Compute Architecture. Technical report, 2012.
Volker Strassen. Gaussian elimination is not optimal.
Numerische Mathematik
, 13:354–356, 1969.
Numerical libraries like NumPy [Harris et al., 2020], Tensorflow [Abadi et al., 2015], and PyTorch [Paszke et al., 2019] implement various types of vectorized operations to facilitate scientific computing. Throughout this book, we define the following set of terminologies: if a variable is called a
scalar
, then the variable has zero dimensions and represents a single numerical value.
array
, or
vector
, then the variable has one dimension and represents a list of values.
matrix