32,99 €
Want to jump into data science but don't know where to start? Let's be real, data science is presented as something mystical and unattainable without the most powerful software, hardware, and data expertise. Real data science isn't about technology. It's about how you approach the problem. In this updated edition of Data Smart: Using Data Science to Transform Information into Insight, award-winning data scientist and bestselling author Jordan Goldmeier shows you how to implement data science problems using Excel while exposing how things work behind the scenes. Data Smart is your field guide to building statistics, machine learning, and powerful artificial intelligence concepts right inside your spreadsheet. Inside you'll find: * Four-color data visualizations that highlight and illustrate the concepts discussed in the book * Tutorials explaining complicated data science using just Microsoft Excel * How to take what you've learned and apply it to everyday problems at work and life * Advice for using formulas, Power Query, and some of Excel's latest features to solve tough data problems * Smart data science solutions for common business challenges * Explanations of what algorithms do, how they work, and what you can tweak to take your Excel skills to the next level Data Smart is a must-read for students, analysts, and managers ready to become data science savvy and share their findings with the world.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 578
Cover
Table of Contents
Title Page
Introduction
What Am I Doing Here?
What Is Data Science?
Do Data Scientists Actually Use Excel?
Conventions
Let's Get Going
Notes
1 Everything You Ever Needed to Know About Spreadsheets but Were Too Afraid to Ask
Some Sample Data
Accessing Quick Descriptive Statistics
Excel Tables
Lookup Formulas
PivotTables
Using Array Formulas
Solving Stuff with Solver
Notes
2 Set It and Forget It: An Introduction to Power Query
What Is Power Query?
Sample Data
Starting Power Query
Filtering Rows
Removing Columns
Find & Replace
Close & Load to…Table
Note
3 Naïve Bayes and the Incredible Lightness of Being an Idiot
The World's Fastest Intro to Probability Theory
Separating the Signal and the Noise
Using the Bayes Rule to Create an AI Model
Let's Get This Excel Party Started
Notes
4 Cluster Analysis Part 1: Using K-Means to Segment Your Customer Base
Dances at Summer Camp
Getting Real: K-Means Clustering Subscribers in Email Marketing
K-Medians Clustering and Asymmetric Distance Measurements
5 Cluster Analysis Part II: Network Graphs and Community Detection
What Is a Network Graph?
Visualizing a Simple Graph
Building a Graph from the Wholesale Wine Data
Introduction to Gephi
How Much Is an Edge Worth? Points and Penalties in Graph Modularity
Let's Get Clustering!
There and Back Again: A Gephi Tale
6 Regression: The Granddaddy of Supervised Artificial Intelligence
Predicting Pregnant Customers at RetailMart Using Linear Regression
Predicting Pregnant Customers at RetailMart Using Logistic Regression
Note
7 Ensemble Models: A Whole Lot of Bad Pizza
Getting Started Using the Data from Chapter 6
Bagging: Randomize, Train, Repeat
Boosting: If You Get It Wrong, Just Boost and Try Again
Note
8 Forecasting: Breathe Easy: You Can't Win
The Sword Trade Is Hopping
Getting Acquainted with Time-Series Data
Starting Slow with Simple Exponential Smoothing
You Might Have a Trend
Holt's Trend-Corrected Exponential Smoothing
Multiplicative Holt-Winters Exponential Smoothing
Forecast Sheets in Excel
9 Optimization Modeling: Because That “Fresh-Squeezed” Orange Juice Ain't Gonna Blend Itself
Wait…Is This Data Science?
Starting with a Simple Trade-Off
Fresh from the Grove to Your Glass…with a Pit Stop Through a Blending Model
Modeling Risk
Notes
10 Outlier Detection: Just Because They’re Odd Doesn’t Mean They’re Unimportant
Outliers Are (Bad?) People, Too
The Fascinating Case of
Hadlum v. Hadlum
Terrible at Nothing, Bad at Everything
Note
11 Moving on From Spreadsheets
Getting Up and Running with R
Doing Some Actual Data Science
12 Conclusion
Where Am I? What Just Happened?
Before You Go-Go
Get Creative and Keep in Touch!
Index
Copyright
Dedication
About the Author
About the Technical Editors
Acknowledgments
End User License Agreement
Chapter 3
Table 3.1: Data cleaning procedures required to tokenize each tweet
Chapter 5
Table 5.1: Simple network graph representation of
Friends
Chapter 6
Table 6.1: Features to consider
Chapter 1
Figure 1.1: Concession stand sales
Figure 1.2: When you right-click the status bar, you have the option to have...
Figure 1.3: To insert an Excel table, place your cursor anywhere in the tabl...
Figure 1.4: Concession stand data with an Excel table applied
Figure 1.5: Tables will replace the column headers with the column names. Th...
Figure 1.6: Tables already have filters bulit-in. To filter a specific colum...
Figure 1.7: The default table formatting is overly colorful and distracting....
Figure 1.8: The label “Average Profit” has been added to H2. The cell I2 is ...
Figure 1.9: The Excel table was named Sales. As you type Sales into the form...
Figure 1.10: Once you've typed in the table name and added a left square bra...
Figure 1.11: When referring to a table's column field in Excel, you will see...
Figure 1.12: To automatically add information to an Excel table, place your ...
Figure 1.13: When you add new data to the bottom of a table, it will automat...
Figure 1.14: Structured references allow you to create column calculations t...
Figure 1.15: A table layout where the unique identifier the “key” is the lef...
Figure 1.16: A
VLOOKUP
has been implemented in cell B19. In this case, the f...
Figure 1.17: In some workflows, you know a value stored in another field, an...
Figure 1.18:
INDEX
/
MATCH
has been across helper cells
B22:B24
. It's also imp...
Figure 1.19: The elements required to implement an
XLOOKUP
Figure 1.20:
XLOOKUP
is implemented in cell
B28
. Notice the
XLOOKUP
formula ...
Figure 1.21: The PivotTable Builder and a count of sales by item
Figure 1.22: Revenue by item and category
Figure 1.23: Solver appears on the far right of the Data tab.
Figure 1.24: Getting calorie and item counts set up
Figure 1.25: The Solver Parameters dialog box
Figure 1.26: Adding a new constraint for Solver to use
Figure 1.27: Adding an integer constraint
Figure 1.28: The Solver Results dialog box
Figure 1.29: An optimized item selection
Figure 1.30: Setting the options for OpenSolver
Figure 1.31: Deselect the Perform a quick linearity check on the model in th...
Chapter 2
Figure 2.1: The Contacts table
Figure 2.2: Power Query can be found in the leftmost buttons of the Data rib...
Figure 2.3: The Power Query Editor
Figure 2.4: In the Name box in the Query Settings pane you can set the name ...
Figure 2.5: The drop-down filter for the Country column in Power Query
Figure 2.6: To remove a column, right-click its header and select Remove fro...
Figure 2.7: The Find & Replace pop-up
Figure 2.8: Using Find & Replace, we successfully replaced the X values with...
Figure 2.9: The Merging Columns pop-up
Figure 2.10: Our query results. After you're finished creating your query, P...
Chapter 3
Figure 3.1: Once you've inserted a new table, change the name to AboutMandri...
Figure 3.2: Click Table/Range to bring this data into Power Query. On laptop...
Figure 3.3: To set the case for each tweet, select Lowercase from the Format...
Figure 3.4: You can define what you want to replace in the Replace Values op...
Figure 3.5: To split words into multiple columns (or, as you'll see, into mu...
Figure 3.6: The Split Column By Delimiter Pop Up. Selecting Rows instead of ...
Figure 3.7: Each token is now given its own row.
Figure 3.8: The Custom Column Formula options box. Here you can type M code ...
Figure 3.9: Use the drop-down next to the field header to filter out data yo...
Figure 3.10: The Group By feature is similar to a PivotTable. To find it, cl...
Figure 3.11: The Group By option box.
Figure 3.12: App tokens with their respective lengths
Figure 3.13: We add 1 to everything to account for rare words.
Figure 3.14: You can have a table automatically apply descriptive analytics ...
Figure 3.15: We can use structured references to easily find P(Token | App)....
Figure 3.16: Looking up LN(P(Mandrill | App)) for mandrill-specific tweets
Figure 3.17: This table displays the log probabilities for each token given ...
Figure 3.18: Place both Class and Number into the Rows field.
Figure 3.19: To create a new calculated field, click the PivotTable Analyze ...
Figure 3.20: The Insert Calculated Field dialog box. To save time, double-cl...
Figure 3.21: Place the field Model Prediction into the Values field to use t...
Figure 3.22: The results of the naïve Bayes classifier. The classifi...
Chapter 4
Figure 4.1: Campers and counselors tearing up the dance floor
Figure 4.2: Initial cluster centers placed
Figure 4.3: Lines denote the borders of the clusters.
Figure 4.4: Cluster assignments given by shaded regions in the Voronoi diagr...
Figure 4.5: Moving the centers just a tad
Figure 4.6: Optimal three-means clustering
Figure 4.7: The details of the last 32 offers on the OfferInformation worksh...
Figure 4.8: A list of offers taken by customer on the Transactions worksheet...
Figure 4.9: PivotTable field list
Figure 4.10: PivotTable of deals versus customers
Figure 4.11: Deal description and purchase data merged into a single matrix...
Figure 4.12: Blank cluster centers placed on the 4MC tab
Figure 4.13: A dancer at (8,2) and a cluster center at (4,4)
Figure 4.14: Euclidean distance is the square root of the sum of squared dis...
Figure 4.15: Distance calculations from each customer to each cluster
Figure 4.16: The distance between Adams and Cluster 1
Figure 4.17: The Solver setup for 4-means clustering
Figure 4.18: The Evolutionary tab in Solver
Figure 4.19: The four optimal cluster centers
Figure 4.20: Setting up a tab to count popular deals by cluster
Figure 4.21: Totals of each deal taken broken out by cluster
Figure 4.22: Sorting on Cluster 1—Pinot, Pinot, Pinot!
Figure 4.23: Sorting on Cluster 2—small-timers
Figure 4.24: Sorting on Cluster 3 is a bit of a mess.
Figure 4.25: Sorting on Cluster 4—these folks just like champagne in August?...
Figure 4.26: The distances considered for a chaperone's contribution to the ...
Figure 4.27: The bare-bones Distances tab
Figure 4.28: Using the
LET
function to calculate the Euclidean distance betw...
Figure 4.29: The completed distance matrix
Figure 4.30: The beginning stages of our silhouette
Figure 4.31: Average distance between each customer and the customers in eve...
Figure 4.32: Average distances to the folks in my own cluster and to the clo...
Figure 4.33: The final silhouette for 4-means clustering
Figure 4.34: The 5-means clustering tab
Figure 4.35: The optimal 5-means clusters
Figure 4.36: Sorting on Cluster 1—pinot noir out the ears
Figure 4.37: Sorting on Cluster 2—small quantities only, please
Figure 4.38: Sorting on Cluster 3—is espumante that important?
Figure 4.39: Sorting on Cluster 4—all sorts of interests
Figure 4.40: Sorting on Cluster 5—high volume
Figure 4.41: The silhouette for 5-means clustering
Figure 4.42: An illustration of cosine similarity on two binary purchase vec...
Figure 4.43: The 5MedC tab not yet optimized
Figure 4.44: The five-cluster medians
Figure 4.45: Sorting on Cluster 1—low-volume customers
Figure 4.46: Sorting on Cluster 2—not all who sparkle are vampires
Figure 4.47: Sorting on Cluster 3—Francophiles
Figure 4.48: Sorting on Cluster 4—high volume for 19 deals in a row
Figure 4.49: Sorting on cluster 5—mainlining pinot noir
Chapter 5
Figure 5.1: A diagram of relationships on Friends
Figure 5.2: Get Add-ins on the Insert tab
Figure 5.3: The Office Add-In store. Search for the GiGraph add-in.
Figure 5.4: The GIGRAPH add-in is installed.
Figure 5.5: A GiGraph directed network graph
Figure 5.6: An adjacency matrix of Friends data in Excel
Figure 5.7: Matrix degree measures calculated just in Excel
Figure 5.8: The Matrix tab showing who bought what
Figure 5.9: The empty grid for the cosine similarity matrix
Figure 5.10: The completed customer cosine similarity matrix
Figure 5.11: Calculating the 80th percentile of edge weights
Figure 5.12: The 0.5.neighborhood adjacency matrix
Figure 5.13: Gephi Import Wizard
Figure 5.14: Gephi Import report
Figure 5.15: The WineNetwork-radj adjacency matrix visually represented as a...
Figure 5.16: The same network graph with labels added
Figure 5.17: After running ForceAtlas and a few formatting updates
Figure 5.18: Calculating the average degree of a graph
Figure 5.19: Resizing the graph according to node degree
Figure 5.20: A cluster of folks with their nodes visually represented as the...
Figure 5.21: The Data Laboratory overview
Figure 5.22: The WineNetwork adjacency matrix visualized
Figure 5.23: Stubby Friends graph
Figure 5.24: A rewiring of the Friends graph
Figure 5.25: Node selection probabilities on the Friends graph
Figure 5.26: The expected number of edges between Ross and Rachel
Figure 5.27: Counting edge stubs on the r-neighborhood graph
Figure 5.28: The Scores tab
Figure 5.29: Adding two upper bounds to each customer's score variable
Figure 5.30: Filled out Split1 tab, ready for optimization
Figure 5.31: The LP formulation for the first split
Figure 5.32: Optimal solution for the first split
Figure 5.33: The Split2 tab with previous run values
Figure 5.34: The optimal solution for Split2
Figure 5.35: No modularity improvement in Split3
Figure 5.36: Final community labels for modularity maximization
Figure 5.37: The initial TopDealsByCluster tab
Figure 5.38: Make sure to fill right. When you drag right with a table colum...
Figure 5.39: TopDealsByCluster with completed purchase counts
Figure 5.40: Top deals for community 0
Figure 5.41: Poppin' bottles in community 1
Figure 5.42: People who liked the March Espumante deal
Figure 5.43: Pinot peeps
Figure 5.44: Gephi modularity settings
Figure 5.45: Customer graph recolored to show modularity clusters
Figure 5.46: Exporting modularity classes back to Excel
Figure 5.47: Gephi modularity classes back in Excel
Figure 5.48: Reproducing the modularity score for the communities detected b...
Figure 5.49: Top purchases per cluster from Gephi
Chapter 6
Figure 6.1: Raw training data
Figure 6.2: Training Data w Dummy Vars tab with new columns for the dummy va...
Figure 6.3: Training data with dummy variables populated
Figure 6.4: Cat ownership versus me sneezing
Figure 6.5: Scatter plot of cats versus sneezing
Figure 6.6: Linear model displayed on the graph
Figure 6.7: Renaming the Excel table on the Linear Model tab to LM_TrainingD...
Figure 6.8: Assigning cells C3:V3 the named range ModelCoefficients
Figure 6.9: Linear modeling setup
Figure 6.10: The prediction column for a linear model
Figure 6.11: Predictions and sum of squared error
Figure 6.12: Solver setup for linear model
Figure 6.13: Optimized linear model
Figure 6.14: R-squared of 0.46 for the linear regression
Figure 6.15: The result of the F-test
Figure 6.16: Designing the SSCP matrix
Figure 6.17: The SSCP matrix completed
Figure 6.18: The inverse of the SSCP matrix
Figure 6.19: The standard error of each model coefficient
Figure 6.20: The coefficient standard error on the linear model
Figure 6.21: Female, Home, and Apt are insignificant predictors according to...
Figure 6.22: Test set data
Figure 6.23: Predictions on the test set
Figure 6.24: Cutoff values for the pregnancy classification
Figure 6.25: Precision calculations on the test set
Figure 6.26: Specificity calculations on the test set
Figure 6.27: The false positive rate and the true positive rate
Figure 6.28: The ROC curve for the linear regression
Figure 6.29: The link function
Figure 6.30: The initial logistic model sheet
Figure 6.31: Values through the logistic function
Figure 6.32: Identical Solver setup for logistic model
Figure 6.33: Fitted logistic model
Figure 6.34: The Logistic Regression sheet
Figure 6.35: Logistic regression predictions on the test set
Figure 6.36: The Performance Logistic tab
Figure 6.37: The ROC curve series values pulling from the Performance Logist...
Figure 6.38: The ROC series results
Chapter 7
Figure 7.1: The TD tab houses the data from Chapter 6
Figure 7.2: The folic acid decision stump
Figure 7.3: Node impurity for the folic acid stump
Figure 7.4: Use headers as first row
Figure 7.5: Generating a field fo random numbers using
Number.Random()
Figure 7.6: Adding
Table.Buffer()
to signal to Power Query we want to see up...
Figure 7.7: Adding a conditional column box
Figure 7.8: Updated rows sorted randomly
Figure 7.9: Four random columns and a random two-thirds of the rows
Figure 7.10: Four possibilities for the training data
Figure 7.11: Feature/response pairings for each of the features in the rando...
Figure 7.12: Calculating which feature value is associated with pregnancy
Figure 7.13: Combined impurity values for four decision stumps
Figure 7.14: The winner's circle for the four decision stumps
Figure 7.15: Reshuffling the data yields a new stump.
Figure 7.16: Getting ready to record a macro
Figure 7.17: The 200 decision stumps
Figure 7.18: Stumps added to the TestBag tab
Figure 7.19: Stumps evaluated on the TestBag set
Figure 7.20: Predictions for each row
Figure 7.21: Performance metrics for bagging
Figure 7.22: The ROC curve for bagged stumps
Figure 7.23: The initial portions of the BoostStumps tab
Figure 7.24: Counting up how each feature splits the training data
Figure 7.25: Weights for each training data row
Figure 7.26: The weighted error calculation for each stump
Figure 7.27: The first winning boosted stump
Figure 7.28: Alpha value for the first boosting iteration
Figure 7.29: The new weight calculation
Figure 7.30: The second stump
Figure 7.31: The 200
th
stump
Figure 7.32: Decision stumps pasted to TestBoost
Figure 7.33: Predictions on each row of test data from each stump
Figure 7.34: Final predictions from the boosted model
Figure 7.35: The performance metrics for boosted stumps
Figure 7.36: The ROC curves for the boosted and bagged models
Chapter 8
Figure 8.1: Time-series data
Figure 8.2: Scatter plot of time-series data—the y-axis has been adjusted to...
Figure 8.3: Initial worksheet design for simple exponential smoothing
Figure 8.4: Testing the
LAMBDA
by sending in parameters
Figure 8.5: Assigning a
LAMBDA
to a named range
Figure 8.6: Generating the one-step forecast, error, and level calculation f...
Figure 8.7: Simple exponential smoothing forecast with an alpha of 0.5
Figure 8.8: The sum of squared error for simple exponential smoothing
Figure 8.9: The standard error calculation
Figure 8.10: Solver formulation for optimizing alpha
Figure 8.11: The Actuals and Forecast columns are added to the table.
Figure 8.12: Graphing the final simple exponential smoothing forecast
Figure 8.13: Your trend is legit.
Figure 8.14: Starting with smoothing parameters set to 0.5
Figure 8.15: The initial level and trend values
Figure 8.16: The level, trend, forecast, and error calculations
Figure 8.17: Forecasting future months with Holt's trend-corrected exponenti...
Figure 8.18: Graph of the forecast with default alpha and gamma values
Figure 8.19: Calculating the SSE and standard error
Figure 8.20: Optimization setup for Holt's trend-corrected exponential smoot...
Figure 8.21: Graph of optimal Holt's forecast
Figure 8.22: Months and associated one-step forecast errors
Figure 8.23: Sum of squared mean deviations of Holt's forecast errors
Figure 8.24: One-month lagged error deviations
Figure 8.25: A beautiful cascading matrix of lagged error deviations fit for...
Figure 8.26:
SUMPRODUCT
of lagged deviations with originals
Figure 8.27: This is my correlogram; there are many like it, but this is min...
Figure 8.28: Critical points for the autocorrelations
Figure 8.29: Click the pasted in series, and select Format Data Series to ma...
Figure 8.30: The Format Data Series pane
Figure 8.31: The Select Data Source dialog box. Note that Autocorrelation ha...
Figure 8.32: Changing the critical values into highlighted regions
Figure 8.33: Initializing our Holt-Winter's algorithm
Figure 8.34: The 2x12MA smoother function accounts for the fact the moving a...
Figure 8.35: The smoothed demand data
Figure 8.36: The Initial Seasonal Factors table
Figure 8.37: A bar chart of estimated seasonal variations
Figure 8.38: Repeating the initial seasons factors so that we can deseasonal...
Figure 8.39: Initial level and trend estimates via a trendline on the deseas...
Figure 8.40: All of the initial Holt-Winters values in one place
Figure 8.41: Assigning the initial Level and Trend to the seasonal forecast...
Figure 8.42: Worksheet with smoothing parameters and first one-step forecast...
Figure 8.43: Taking the update equations through month 36
Figure 8.44: Graphing the Holt-Winters forecast
Figure 8.45: The Solver setup for Holt-Winters
Figure 8.46: The optimized Holt-Winters forecast
Figure 8.47: Correlogram for the Holt-Winters model
Figure 8.48: Simulated one-step errors
Figure 8.49: Simulated future demand
Figure 8.50: I have 1,000 demand scenarios
Figure 8.51: The forecast interval for Holt-Winters
Figure 8.52: The forecast sandwiched by the prediction interval
Figure 8.53: The Change Chart Type dialog box
Figure 8.54: The fan chart is a thing of beauty.
Figure 8.55: The forecast sheet is found in the Forecast group on the Data r...
Figure 8.56: The Create Forecast Worksheet dialog box
Figure 8.57: A forecast sheet is born.
Chapter 9
Figure 9.1: The budget constraint makes the feasible region a triangle.
Figure 9.2: The cellar constraint cuts a chunk out of the feasible region.
Figure 9.3: The level set and objective function for the revenue optimizatio...
Figure 9.4: Testing the all-butter corner. Mmmm, so creamy.
Figure 9.5: Located the optimal corner
Figure 9.6: Tools and butter data placed, lovingly, in Excel
Figure 9.7: Revenue and constraint calculations within the tools and butter ...
Figure 9.8: Completed tools and butter formulation in Solver
Figure 9.9: Optimized tools and butter workbook
Figure 9.10: Making the tools and butter decisions integers
Figure 9.11: A graph of Pierre's $500 bonus
Figure 9.12: Formulation for the evolutionary solver
Figure 9.13: The specs sheet for raw orange juice procurement
Figure 9.14: Setting up the blending spreadsheet
Figure 9.15: Cost calculations added to the juice blending worksheet
Figure 9.16: Demand and Valencia calculations added (with formula annotation...
Figure 9.17: Adding taste and color constraints to the worksheet
Figure 9.18: The populated Solver dialog for the blending problem
Figure 9.19: Solution to the orange juice-blending problem
Figure 9.20: Relaxed quality model
Figure 9.21: Solver implementation of the relaxed quality model
Figure 9.22: Solution to the relaxed quality model
Figure 9.23: Graphing the trade-off between cost and quality
Figure 9.24: Solver setup for minimax quality reduction
Figure 9.25: Adding indicator variables to the spreadsheet
Figure 9.26: Setting up our “Big M” constraint values
Figure 9.27: Initializing Solver for the minimax
Figure 9.28: Optimal solution limited to four suppliers per period (rows 19-...
Figure 9.29: Indicator and amount variables added for the de-acidification d...
Figure 9.30: Calculation added for upper bound on how much juice can be de-a...
Figure 9.31: Adding in a lower bound on de-acidification
Figure 9.32: Adding a “Not Reduced” calculation
Figure 9.33: Solver formulation for de-acidification problem
Figure 9.34: Solved de-acidification model
Figure 9.35: Combining independent random variables to illustrate how they g...
Figure 9.36: The cumulative distribution function for the cell phone contact...
Figure 9.37: Specifications with standard deviation added
Figure 9.38: 100 generated juice spec scenarios
Figure 9.39: Spec calculations for each scenario
Figure 9.40: Solver setup for robust optimization
Figure 9.41: Solution to the robust optimization model
Chapter 10
Figure 10.1: Tukey’s fences for some pregnancy durations
Figure 10.2: Adding conditional formatting for outliers
Figure 10.3: Birth duration with conditional formatting color codes
Figure 10.4: Gimli, son of Gloin, Dwarven outlier
Figure 10.5: Multidimensional employee performance data
Figure 10.6: Mean and standard deviation for each column
Figure 10.7: The standardized set of employee performance data
Figure 10.8: Empty employee distance matrix
Figure 10.9: The employee distance matrix
Figure 10.10: Employee 142619 ranked by distance in relation to 144624
Figure 10.11: Each employee on the column ranked in relation to each row
Figure 10.12: The indegree counts for three different nearest neighbor graph...
Figure 10.13: The performance data for employee 137155
Figure 10.14: The performance data for employee 143406
Figure 10.15: Employee 143406 has a high 5-distance
Figure 10.16: k-distance fails on local outliers
Figure 10.17: The triangle is not nearly as reachable by its neighbors as th...
Figure 10.18: The skeleton of the reach distance tab
Figure 10.19: All reach distances
Figure 10.20: Average reachability for each employee with respect to their n...
Figure 10.21: LOFs for the employees. Somebody is knocking on the door of 2....
Chapter 11
Figure 11.1: RStudio Online via posit.cloud
Figure 11.2: Your new variables appear in the Environment pane.
Figure 11.3: The Help window for the square root function
Figure 11.4: Search results for the word log
Figure 11.5: The Upload Files box
Figure 11.6: Viewing the
winedata
dataframe in R
Figure 11.7:
winedata.clusters
expanded
Figure 11.8: Pinot Noir appears as the top cluster when applying spherical k...
Figure 11.9: A variable importance plot in R
Figure 11.10: Recall and precision graphed in R
Figure 11.11: Graph of sword demand
Figure 11.12: Fan chart of the demand forecast
Figure 11.13: A boxplot of the pregnancy duration data
Figure 11.14: A boxplot with Tukey fences using three times the IQR
Cover
Table of Contents
Title Page
Copyright
Dedication
About the Author
About the Technical Editors
Acknowledgments
Introduction
Begin Reading
Index
End User License Agreement
iii
xix
xx
xxi
xxii
xxiii
xxiv
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
iv
v
vii
ix
xi
xii
419
Second Edition
Jordan Goldmeier
1
Why do 87% of data science projects never make it into production?”
https://designingforanalytics.com/resources/failure-rates-for-analytics-bi-iot-and-big-data-projects-85-yikes
2
8 famous analytics and AI disasters.”
www.cio.com/article/190888/5-famous-analytics-and-ai-disasters.html
3
Forecasting for COVID-19 has failed.”
www.ncbi.nlm.nih.gov/pmc/articles/PMC7447267
4
The Real Story Of 2016.”
https://fivethirtyeight.com/features/the-real-story-of-2016
5
To see the impact DataKind has had, take a look at their case studies -
www.datakind.org/what-we-do