Data Smart - Jordan Goldmeier - E-Book

Data Smart E-Book

Jordan Goldmeier

4,8
32,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Want to jump into data science but don't know where to start? Let's be real, data science is presented as something mystical and unattainable without the most powerful software, hardware, and data expertise. Real data science isn't about technology. It's about how you approach the problem. In this updated edition of Data Smart: Using Data Science to Transform Information into Insight, award-winning data scientist and bestselling author Jordan Goldmeier shows you how to implement data science problems using Excel while exposing how things work behind the scenes. Data Smart is your field guide to building statistics, machine learning, and powerful artificial intelligence concepts right inside your spreadsheet. Inside you'll find: * Four-color data visualizations that highlight and illustrate the concepts discussed in the book * Tutorials explaining complicated data science using just Microsoft Excel * How to take what you've learned and apply it to everyday problems at work and life * Advice for using formulas, Power Query, and some of Excel's latest features to solve tough data problems * Smart data science solutions for common business challenges * Explanations of what algorithms do, how they work, and what you can tweak to take your Excel skills to the next level Data Smart is a must-read for students, analysts, and managers ready to become data science savvy and share their findings with the world.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 578

Bewertungen
4,8 (16 Bewertungen)
13
3
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Table of Contents

Title Page

Introduction

What Am I Doing Here?

What Is Data Science?

Do Data Scientists Actually Use Excel?

Conventions

Let's Get Going

Notes

1 Everything You Ever Needed to Know About Spreadsheets but Were Too Afraid to Ask

Some Sample Data

Accessing Quick Descriptive Statistics

Excel Tables

Lookup Formulas

PivotTables

Using Array Formulas

Solving Stuff with Solver

Notes

2 Set It and Forget It: An Introduction to Power Query

What Is Power Query?

Sample Data

Starting Power Query

Filtering Rows

Removing Columns

Find & Replace

Close & Load to…Table

Note

3 Naïve Bayes and the Incredible Lightness of Being an Idiot

The World's Fastest Intro to Probability Theory

Separating the Signal and the Noise

Using the Bayes Rule to Create an AI Model

Let's Get This Excel Party Started

Notes

4 Cluster Analysis Part 1: Using K-Means to Segment Your Customer Base

Dances at Summer Camp

Getting Real: K-Means Clustering Subscribers in Email Marketing

K-Medians Clustering and Asymmetric Distance Measurements

5 Cluster Analysis Part II: Network Graphs and Community Detection

What Is a Network Graph?

Visualizing a Simple Graph

Building a Graph from the Wholesale Wine Data

Introduction to Gephi

How Much Is an Edge Worth? Points and Penalties in Graph Modularity

Let's Get Clustering!

There and Back Again: A Gephi Tale

6 Regression: The Granddaddy of Supervised Artificial Intelligence

Predicting Pregnant Customers at RetailMart Using Linear Regression

Predicting Pregnant Customers at RetailMart Using Logistic Regression

Note

7 Ensemble Models: A Whole Lot of Bad Pizza

Getting Started Using the Data from Chapter 6

Bagging: Randomize, Train, Repeat

Boosting: If You Get It Wrong, Just Boost and Try Again

Note

8 Forecasting: Breathe Easy: You Can't Win

The Sword Trade Is Hopping

Getting Acquainted with Time-Series Data

Starting Slow with Simple Exponential Smoothing

You Might Have a Trend

Holt's Trend-Corrected Exponential Smoothing

Multiplicative Holt-Winters Exponential Smoothing

Forecast Sheets in Excel

9 Optimization Modeling: Because That “Fresh-Squeezed” Orange Juice Ain't Gonna Blend Itself

Wait…Is This Data Science?

Starting with a Simple Trade-Off

Fresh from the Grove to Your Glass…with a Pit Stop Through a Blending Model

Modeling Risk

Notes

10 Outlier Detection: Just Because They’re Odd Doesn’t Mean They’re Unimportant

Outliers Are (Bad?) People, Too

The Fascinating Case of

Hadlum v. Hadlum

Terrible at Nothing, Bad at Everything

Note

11 Moving on From Spreadsheets

Getting Up and Running with R

Doing Some Actual Data Science

12 Conclusion

Where Am I? What Just Happened?

Before You Go-Go

Get Creative and Keep in Touch!

Index

Copyright

Dedication

About the Author

About the Technical Editors

Acknowledgments

End User License Agreement

List of Tables

Chapter 3

Table 3.1: Data cleaning procedures required to tokenize each tweet

Chapter 5

Table 5.1: Simple network graph representation of

Friends

Chapter 6

Table 6.1: Features to consider

List of Illustrations

Chapter 1

Figure 1.1: Concession stand sales

Figure 1.2: When you right-click the status bar, you have the option to have...

Figure 1.3: To insert an Excel table, place your cursor anywhere in the tabl...

Figure 1.4: Concession stand data with an Excel table applied

Figure 1.5: Tables will replace the column headers with the column names. Th...

Figure 1.6: Tables already have filters bulit-in. To filter a specific colum...

Figure 1.7: The default table formatting is overly colorful and distracting....

Figure 1.8: The label “Average Profit” has been added to H2. The cell I2 is ...

Figure 1.9: The Excel table was named Sales. As you type Sales into the form...

Figure 1.10: Once you've typed in the table name and added a left square bra...

Figure 1.11: When referring to a table's column field in Excel, you will see...

Figure 1.12: To automatically add information to an Excel table, place your ...

Figure 1.13: When you add new data to the bottom of a table, it will automat...

Figure 1.14: Structured references allow you to create column calculations t...

Figure 1.15: A table layout where the unique identifier the “key” is the lef...

Figure 1.16: A

VLOOKUP

has been implemented in cell B19. In this case, the f...

Figure 1.17: In some workflows, you know a value stored in another field, an...

Figure 1.18:

INDEX

/

MATCH

has been across helper cells

B22:B24

. It's also imp...

Figure 1.19: The elements required to implement an

XLOOKUP

Figure 1.20:

XLOOKUP

is implemented in cell

B28

. Notice the

XLOOKUP

formula ...

Figure 1.21: The PivotTable Builder and a count of sales by item

Figure 1.22: Revenue by item and category

Figure 1.23: Solver appears on the far right of the Data tab.

Figure 1.24: Getting calorie and item counts set up

Figure 1.25: The Solver Parameters dialog box

Figure 1.26: Adding a new constraint for Solver to use

Figure 1.27: Adding an integer constraint

Figure 1.28: The Solver Results dialog box

Figure 1.29: An optimized item selection

Figure 1.30: Setting the options for OpenSolver

Figure 1.31: Deselect the Perform a quick linearity check on the model in th...

Chapter 2

Figure 2.1: The Contacts table

Figure 2.2: Power Query can be found in the leftmost buttons of the Data rib...

Figure 2.3: The Power Query Editor

Figure 2.4: In the Name box in the Query Settings pane you can set the name ...

Figure 2.5: The drop-down filter for the Country column in Power Query

Figure 2.6: To remove a column, right-click its header and select Remove fro...

Figure 2.7: The Find & Replace pop-up

Figure 2.8: Using Find & Replace, we successfully replaced the X values with...

Figure 2.9: The Merging Columns pop-up

Figure 2.10: Our query results. After you're finished creating your query, P...

Chapter 3

Figure 3.1: Once you've inserted a new table, change the name to AboutMandri...

Figure 3.2: Click Table/Range to bring this data into Power Query. On laptop...

Figure 3.3: To set the case for each tweet, select Lowercase from the Format...

Figure 3.4: You can define what you want to replace in the Replace Values op...

Figure 3.5: To split words into multiple columns (or, as you'll see, into mu...

Figure 3.6: The Split Column By Delimiter Pop Up. Selecting Rows instead of ...

Figure 3.7: Each token is now given its own row.

Figure 3.8: The Custom Column Formula options box. Here you can type M code ...

Figure 3.9: Use the drop-down next to the field header to filter out data yo...

Figure 3.10: The Group By feature is similar to a PivotTable. To find it, cl...

Figure 3.11: The Group By option box.

Figure 3.12: App tokens with their respective lengths

Figure 3.13: We add 1 to everything to account for rare words.

Figure 3.14: You can have a table automatically apply descriptive analytics ...

Figure 3.15: We can use structured references to easily find P(Token | App)....

Figure 3.16: Looking up LN(P(Mandrill | App)) for mandrill-specific tweets

Figure 3.17: This table displays the log probabilities for each token given ...

Figure 3.18: Place both Class and Number into the Rows field.

Figure 3.19: To create a new calculated field, click the PivotTable Analyze ...

Figure 3.20: The Insert Calculated Field dialog box. To save time, double-cl...

Figure 3.21: Place the field Model Prediction into the Values field to use t...

Figure 3.22: The results of the naïve Bayes classifier. The classifi...

Chapter 4

Figure 4.1: Campers and counselors tearing up the dance floor

Figure 4.2: Initial cluster centers placed

Figure 4.3: Lines denote the borders of the clusters.

Figure 4.4: Cluster assignments given by shaded regions in the Voronoi diagr...

Figure 4.5: Moving the centers just a tad

Figure 4.6: Optimal three-means clustering

Figure 4.7: The details of the last 32 offers on the OfferInformation worksh...

Figure 4.8: A list of offers taken by customer on the Transactions worksheet...

Figure 4.9: PivotTable field list

Figure 4.10: PivotTable of deals versus customers

Figure 4.11: Deal description and purchase data merged into a single matrix...

Figure 4.12: Blank cluster centers placed on the 4MC tab

Figure 4.13: A dancer at (8,2) and a cluster center at (4,4)

Figure 4.14: Euclidean distance is the square root of the sum of squared dis...

Figure 4.15: Distance calculations from each customer to each cluster

Figure 4.16: The distance between Adams and Cluster 1

Figure 4.17: The Solver setup for 4-means clustering

Figure 4.18: The Evolutionary tab in Solver

Figure 4.19: The four optimal cluster centers

Figure 4.20: Setting up a tab to count popular deals by cluster

Figure 4.21: Totals of each deal taken broken out by cluster

Figure 4.22: Sorting on Cluster 1—Pinot, Pinot, Pinot!

Figure 4.23: Sorting on Cluster 2—small-timers

Figure 4.24: Sorting on Cluster 3 is a bit of a mess.

Figure 4.25: Sorting on Cluster 4—these folks just like champagne in August?...

Figure 4.26: The distances considered for a chaperone's contribution to the ...

Figure 4.27: The bare-bones Distances tab

Figure 4.28: Using the

LET

function to calculate the Euclidean distance betw...

Figure 4.29: The completed distance matrix

Figure 4.30: The beginning stages of our silhouette

Figure 4.31: Average distance between each customer and the customers in eve...

Figure 4.32: Average distances to the folks in my own cluster and to the clo...

Figure 4.33: The final silhouette for 4-means clustering

Figure 4.34: The 5-means clustering tab

Figure 4.35: The optimal 5-means clusters

Figure 4.36: Sorting on Cluster 1—pinot noir out the ears

Figure 4.37: Sorting on Cluster 2—small quantities only, please

Figure 4.38: Sorting on Cluster 3—is espumante that important?

Figure 4.39: Sorting on Cluster 4—all sorts of interests

Figure 4.40: Sorting on Cluster 5—high volume

Figure 4.41: The silhouette for 5-means clustering

Figure 4.42: An illustration of cosine similarity on two binary purchase vec...

Figure 4.43: The 5MedC tab not yet optimized

Figure 4.44: The five-cluster medians

Figure 4.45: Sorting on Cluster 1—low-volume customers

Figure 4.46: Sorting on Cluster 2—not all who sparkle are vampires

Figure 4.47: Sorting on Cluster 3—Francophiles

Figure 4.48: Sorting on Cluster 4—high volume for 19 deals in a row

Figure 4.49: Sorting on cluster 5—mainlining pinot noir

Chapter 5

Figure 5.1: A diagram of relationships on Friends

Figure 5.2: Get Add-ins on the Insert tab

Figure 5.3: The Office Add-In store. Search for the GiGraph add-in.

Figure 5.4: The GIGRAPH add-in is installed.

Figure 5.5: A GiGraph directed network graph

Figure 5.6: An adjacency matrix of Friends data in Excel

Figure 5.7: Matrix degree measures calculated just in Excel

Figure 5.8: The Matrix tab showing who bought what

Figure 5.9: The empty grid for the cosine similarity matrix

Figure 5.10: The completed customer cosine similarity matrix

Figure 5.11: Calculating the 80th percentile of edge weights

Figure 5.12: The 0.5.neighborhood adjacency matrix

Figure 5.13: Gephi Import Wizard

Figure 5.14: Gephi Import report

Figure 5.15: The WineNetwork-radj adjacency matrix visually represented as a...

Figure 5.16: The same network graph with labels added

Figure 5.17: After running ForceAtlas and a few formatting updates

Figure 5.18: Calculating the average degree of a graph

Figure 5.19: Resizing the graph according to node degree

Figure 5.20: A cluster of folks with their nodes visually represented as the...

Figure 5.21: The Data Laboratory overview

Figure 5.22: The WineNetwork adjacency matrix visualized

Figure 5.23: Stubby Friends graph

Figure 5.24: A rewiring of the Friends graph

Figure 5.25: Node selection probabilities on the Friends graph

Figure 5.26: The expected number of edges between Ross and Rachel

Figure 5.27: Counting edge stubs on the r-neighborhood graph

Figure 5.28: The Scores tab

Figure 5.29: Adding two upper bounds to each customer's score variable

Figure 5.30: Filled out Split1 tab, ready for optimization

Figure 5.31: The LP formulation for the first split

Figure 5.32: Optimal solution for the first split

Figure 5.33: The Split2 tab with previous run values

Figure 5.34: The optimal solution for Split2

Figure 5.35: No modularity improvement in Split3

Figure 5.36: Final community labels for modularity maximization

Figure 5.37: The initial TopDealsByCluster tab

Figure 5.38: Make sure to fill right. When you drag right with a table colum...

Figure 5.39: TopDealsByCluster with completed purchase counts

Figure 5.40: Top deals for community 0

Figure 5.41: Poppin' bottles in community 1

Figure 5.42: People who liked the March Espumante deal

Figure 5.43: Pinot peeps

Figure 5.44: Gephi modularity settings

Figure 5.45: Customer graph recolored to show modularity clusters

Figure 5.46: Exporting modularity classes back to Excel

Figure 5.47: Gephi modularity classes back in Excel

Figure 5.48: Reproducing the modularity score for the communities detected b...

Figure 5.49: Top purchases per cluster from Gephi

Chapter 6

Figure 6.1: Raw training data

Figure 6.2: Training Data w Dummy Vars tab with new columns for the dummy va...

Figure 6.3: Training data with dummy variables populated

Figure 6.4: Cat ownership versus me sneezing

Figure 6.5: Scatter plot of cats versus sneezing

Figure 6.6: Linear model displayed on the graph

Figure 6.7: Renaming the Excel table on the Linear Model tab to LM_TrainingD...

Figure 6.8: Assigning cells C3:V3 the named range ModelCoefficients

Figure 6.9: Linear modeling setup

Figure 6.10: The prediction column for a linear model

Figure 6.11: Predictions and sum of squared error

Figure 6.12: Solver setup for linear model

Figure 6.13: Optimized linear model

Figure 6.14: R-squared of 0.46 for the linear regression

Figure 6.15: The result of the F-test

Figure 6.16: Designing the SSCP matrix

Figure 6.17: The SSCP matrix completed

Figure 6.18: The inverse of the SSCP matrix

Figure 6.19: The standard error of each model coefficient

Figure 6.20: The coefficient standard error on the linear model

Figure 6.21: Female, Home, and Apt are insignificant predictors according to...

Figure 6.22: Test set data

Figure 6.23: Predictions on the test set

Figure 6.24: Cutoff values for the pregnancy classification

Figure 6.25: Precision calculations on the test set

Figure 6.26: Specificity calculations on the test set

Figure 6.27: The false positive rate and the true positive rate

Figure 6.28: The ROC curve for the linear regression

Figure 6.29: The link function

Figure 6.30: The initial logistic model sheet

Figure 6.31: Values through the logistic function

Figure 6.32: Identical Solver setup for logistic model

Figure 6.33: Fitted logistic model

Figure 6.34: The Logistic Regression sheet

Figure 6.35: Logistic regression predictions on the test set

Figure 6.36: The Performance Logistic tab

Figure 6.37: The ROC curve series values pulling from the Performance Logist...

Figure 6.38: The ROC series results

Chapter 7

Figure 7.1: The TD tab houses the data from Chapter 6

Figure 7.2: The folic acid decision stump

Figure 7.3: Node impurity for the folic acid stump

Figure 7.4: Use headers as first row

Figure 7.5: Generating a field fo random numbers using

Number.Random()

Figure 7.6: Adding

Table.Buffer()

to signal to Power Query we want to see up...

Figure 7.7: Adding a conditional column box

Figure 7.8: Updated rows sorted randomly

Figure 7.9: Four random columns and a random two-thirds of the rows

Figure 7.10: Four possibilities for the training data

Figure 7.11: Feature/response pairings for each of the features in the rando...

Figure 7.12: Calculating which feature value is associated with pregnancy

Figure 7.13: Combined impurity values for four decision stumps

Figure 7.14: The winner's circle for the four decision stumps

Figure 7.15: Reshuffling the data yields a new stump.

Figure 7.16: Getting ready to record a macro

Figure 7.17: The 200 decision stumps

Figure 7.18: Stumps added to the TestBag tab

Figure 7.19: Stumps evaluated on the TestBag set

Figure 7.20: Predictions for each row

Figure 7.21: Performance metrics for bagging

Figure 7.22: The ROC curve for bagged stumps

Figure 7.23: The initial portions of the BoostStumps tab

Figure 7.24: Counting up how each feature splits the training data

Figure 7.25: Weights for each training data row

Figure 7.26: The weighted error calculation for each stump

Figure 7.27: The first winning boosted stump

Figure 7.28: Alpha value for the first boosting iteration

Figure 7.29: The new weight calculation

Figure 7.30: The second stump

Figure 7.31: The 200

th

stump

Figure 7.32: Decision stumps pasted to TestBoost

Figure 7.33: Predictions on each row of test data from each stump

Figure 7.34: Final predictions from the boosted model

Figure 7.35: The performance metrics for boosted stumps

Figure 7.36: The ROC curves for the boosted and bagged models

Chapter 8

Figure 8.1: Time-series data

Figure 8.2: Scatter plot of time-series data—the y-axis has been adjusted to...

Figure 8.3: Initial worksheet design for simple exponential smoothing

Figure 8.4: Testing the

LAMBDA

by sending in parameters

Figure 8.5: Assigning a

LAMBDA

to a named range

Figure 8.6: Generating the one-step forecast, error, and level calculation f...

Figure 8.7: Simple exponential smoothing forecast with an alpha of 0.5

Figure 8.8: The sum of squared error for simple exponential smoothing

Figure 8.9: The standard error calculation

Figure 8.10: Solver formulation for optimizing alpha

Figure 8.11: The Actuals and Forecast columns are added to the table.

Figure 8.12: Graphing the final simple exponential smoothing forecast

Figure 8.13: Your trend is legit.

Figure 8.14: Starting with smoothing parameters set to 0.5

Figure 8.15: The initial level and trend values

Figure 8.16: The level, trend, forecast, and error calculations

Figure 8.17: Forecasting future months with Holt's trend-corrected exponenti...

Figure 8.18: Graph of the forecast with default alpha and gamma values

Figure 8.19: Calculating the SSE and standard error

Figure 8.20: Optimization setup for Holt's trend-corrected exponential smoot...

Figure 8.21: Graph of optimal Holt's forecast

Figure 8.22: Months and associated one-step forecast errors

Figure 8.23: Sum of squared mean deviations of Holt's forecast errors

Figure 8.24: One-month lagged error deviations

Figure 8.25: A beautiful cascading matrix of lagged error deviations fit for...

Figure 8.26:

SUMPRODUCT

of lagged deviations with originals

Figure 8.27: This is my correlogram; there are many like it, but this is min...

Figure 8.28: Critical points for the autocorrelations

Figure 8.29: Click the pasted in series, and select Format Data Series to ma...

Figure 8.30: The Format Data Series pane

Figure 8.31: The Select Data Source dialog box. Note that Autocorrelation ha...

Figure 8.32: Changing the critical values into highlighted regions

Figure 8.33: Initializing our Holt-Winter's algorithm

Figure 8.34: The 2x12MA smoother function accounts for the fact the moving a...

Figure 8.35: The smoothed demand data

Figure 8.36: The Initial Seasonal Factors table

Figure 8.37: A bar chart of estimated seasonal variations

Figure 8.38: Repeating the initial seasons factors so that we can deseasonal...

Figure 8.39: Initial level and trend estimates via a trendline on the deseas...

Figure 8.40: All of the initial Holt-Winters values in one place

Figure 8.41: Assigning the initial Level and Trend to the seasonal forecast...

Figure 8.42: Worksheet with smoothing parameters and first one-step forecast...

Figure 8.43: Taking the update equations through month 36

Figure 8.44: Graphing the Holt-Winters forecast

Figure 8.45: The Solver setup for Holt-Winters

Figure 8.46: The optimized Holt-Winters forecast

Figure 8.47: Correlogram for the Holt-Winters model

Figure 8.48: Simulated one-step errors

Figure 8.49: Simulated future demand

Figure 8.50: I have 1,000 demand scenarios

Figure 8.51: The forecast interval for Holt-Winters

Figure 8.52: The forecast sandwiched by the prediction interval

Figure 8.53: The Change Chart Type dialog box

Figure 8.54: The fan chart is a thing of beauty.

Figure 8.55: The forecast sheet is found in the Forecast group on the Data r...

Figure 8.56: The Create Forecast Worksheet dialog box

Figure 8.57: A forecast sheet is born.

Chapter 9

Figure 9.1: The budget constraint makes the feasible region a triangle.

Figure 9.2: The cellar constraint cuts a chunk out of the feasible region.

Figure 9.3: The level set and objective function for the revenue optimizatio...

Figure 9.4: Testing the all-butter corner. Mmmm, so creamy.

Figure 9.5: Located the optimal corner

Figure 9.6: Tools and butter data placed, lovingly, in Excel

Figure 9.7: Revenue and constraint calculations within the tools and butter ...

Figure 9.8: Completed tools and butter formulation in Solver

Figure 9.9: Optimized tools and butter workbook

Figure 9.10: Making the tools and butter decisions integers

Figure 9.11: A graph of Pierre's $500 bonus

Figure 9.12: Formulation for the evolutionary solver

Figure 9.13: The specs sheet for raw orange juice procurement

Figure 9.14: Setting up the blending spreadsheet

Figure 9.15: Cost calculations added to the juice blending worksheet

Figure 9.16: Demand and Valencia calculations added (with formula annotation...

Figure 9.17: Adding taste and color constraints to the worksheet

Figure 9.18: The populated Solver dialog for the blending problem

Figure 9.19: Solution to the orange juice-blending problem

Figure 9.20: Relaxed quality model

Figure 9.21: Solver implementation of the relaxed quality model

Figure 9.22: Solution to the relaxed quality model

Figure 9.23: Graphing the trade-off between cost and quality

Figure 9.24: Solver setup for minimax quality reduction

Figure 9.25: Adding indicator variables to the spreadsheet

Figure 9.26: Setting up our “Big M” constraint values

Figure 9.27: Initializing Solver for the minimax

Figure 9.28: Optimal solution limited to four suppliers per period (rows 19-...

Figure 9.29: Indicator and amount variables added for the de-acidification d...

Figure 9.30: Calculation added for upper bound on how much juice can be de-a...

Figure 9.31: Adding in a lower bound on de-acidification

Figure 9.32: Adding a “Not Reduced” calculation

Figure 9.33: Solver formulation for de-acidification problem

Figure 9.34: Solved de-acidification model

Figure 9.35: Combining independent random variables to illustrate how they g...

Figure 9.36: The cumulative distribution function for the cell phone contact...

Figure 9.37: Specifications with standard deviation added

Figure 9.38: 100 generated juice spec scenarios

Figure 9.39: Spec calculations for each scenario

Figure 9.40: Solver setup for robust optimization

Figure 9.41: Solution to the robust optimization model

Chapter 10

Figure 10.1: Tukey’s fences for some pregnancy durations

Figure 10.2: Adding conditional formatting for outliers

Figure 10.3: Birth duration with conditional formatting color codes

Figure 10.4: Gimli, son of Gloin, Dwarven outlier

Figure 10.5: Multidimensional employee performance data

Figure 10.6: Mean and standard deviation for each column

Figure 10.7: The standardized set of employee performance data

Figure 10.8: Empty employee distance matrix

Figure 10.9: The employee distance matrix

Figure 10.10: Employee 142619 ranked by distance in relation to 144624

Figure 10.11: Each employee on the column ranked in relation to each row

Figure 10.12: The indegree counts for three different nearest neighbor graph...

Figure 10.13: The performance data for employee 137155

Figure 10.14: The performance data for employee 143406

Figure 10.15: Employee 143406 has a high 5-distance

Figure 10.16: k-distance fails on local outliers

Figure 10.17: The triangle is not nearly as reachable by its neighbors as th...

Figure 10.18: The skeleton of the reach distance tab

Figure 10.19: All reach distances

Figure 10.20: Average reachability for each employee with respect to their n...

Figure 10.21: LOFs for the employees. Somebody is knocking on the door of 2....

Chapter 11

Figure 11.1: RStudio Online via posit.cloud

Figure 11.2: Your new variables appear in the Environment pane.

Figure 11.3: The Help window for the square root function

Figure 11.4: Search results for the word log

Figure 11.5: The Upload Files box

Figure 11.6: Viewing the

winedata

dataframe in R

Figure 11.7:

winedata.clusters

expanded

Figure 11.8: Pinot Noir appears as the top cluster when applying spherical k...

Figure 11.9: A variable importance plot in R

Figure 11.10: Recall and precision graphed in R

Figure 11.11: Graph of sword demand

Figure 11.12: Fan chart of the demand forecast

Figure 11.13: A boxplot of the pregnancy duration data

Figure 11.14: A boxplot with Tukey fences using three times the IQR

Guide

Cover

Table of Contents

Title Page

Copyright

Dedication

About the Author

About the Technical Editors

Acknowledgments

Introduction

Begin Reading

Index

End User License Agreement

Pages

iii

xix

xx

xxi

xxii

xxiii

xxiv

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

iv

v

vii

ix

xi

xii

419

Data Smart

Using Data Science to Transform Information into Insight

 

Second Edition

 

 

Jordan Goldmeier

 

 

 

 

 

 

 

Notes

1

   Why do 87% of data science projects never make it into production?”

https://designingforanalytics.com/resources/failure-rates-for-analytics-bi-iot-and-big-data-projects-85-yikes

2

   8 famous analytics and AI disasters.”

www.cio.com/article/190888/5-famous-analytics-and-ai-disasters.html

3

   Forecasting for COVID-19 has failed.”

www.ncbi.nlm.nih.gov/pmc/articles/PMC7447267

4

   The Real Story Of 2016.”

https://fivethirtyeight.com/features/the-real-story-of-2016

5

   To see the impact DataKind has had, take a look at their case studies -

www.datakind.org/what-we-do