Data Science Programming All-in-One For Dummies - John Paul Mueller - E-Book

Data Science Programming All-in-One For Dummies E-Book

John Paul Mueller

0,0
29,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Your logical, linear guide to the fundamentals of data science programming Data science is exploding--in a good way--with a forecast of 1.7 megabytes of new information created every second for each human being on the planet by 2020 and 11.5 million job openings by 2026. It clearly pays dividends to be in the know. This friendly guide charts a path through the fundamentals of data science and then delves into the actual work: linear regression, logical regression, machine learning, neural networks, recommender engines, and cross-validation of models. Data Science Programming All-In-One For Dummies is a compilation of the key data science, machine learning, and deep learning programming languages: Python and R. It helps you decide which programming languages are best for specific data science needs. It also gives you the guidelines to build your own projects to solve problems in real time. * Get grounded: the ideal start for new data professionals * What lies ahead: learn about specific areas that data is transforming * Be meaningful: find out how to tell your data story * See clearly: pick up the art of visualization Whether you're a beginning student or already mid-career, get your copy now and add even more meaning to your life--and everyone else's!

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 1180

Veröffentlichungsjahr: 2019

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Data Science Programming All-in-One For Dummies®

Published by: John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030-5774, www.wiley.com

Copyright © 2020 by John Wiley & Sons, Inc., Hoboken, New Jersey

Published simultaneously in Canada

No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the Publisher. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

Trademarks: Wiley, For Dummies, the Dummies Man logo, Dummies.com, Making Everything Easier, and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc. and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.

LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: THE PUBLISHER AND THE AUTHOR MAKE NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE ACCURACY OR COMPLETENESS OF THE CONTENTS OF THIS WORK AND SPECIFICALLY DISCLAIM ALL WARRANTIES, INCLUDING WITHOUT LIMITATION WARRANTIES OF FITNESS FOR A PARTICULAR PURPOSE. NO WARRANTY MAY BE CREATED OR EXTENDED BY SALES OR PROMOTIONAL MATERIALS. THE ADVICE AND STRATEGIES CONTAINED HEREIN MAY NOT BE SUITABLE FOR EVERY SITUATION. THIS WORK IS SOLD WITH THE UNDERSTANDING THAT THE PUBLISHER IS NOT ENGAGED IN RENDERING LEGAL, ACCOUNTING, OR OTHER PROFESSIONAL SERVICES. IF PROFESSIONAL ASSISTANCE IS REQUIRED, THE SERVICES OF A COMPETENT PROFESSIONAL PERSON SHOULD BE SOUGHT. NEITHER THE PUBLISHER NOR THE AUTHOR SHALL BE LIABLE FOR DAMAGES ARISING HEREFROM. THE FACT THAT AN ORGANIZATION OR WEBSITE IS REFERRED TO IN THIS WORK AS A CITATION AND/OR A POTENTIAL SOURCE OF FURTHER INFORMATION DOES NOT MEAN THAT THE AUTHOR OR THE PUBLISHER ENDORSES THE INFORMATION THE ORGANIZATION OR WEBSITE MAY PROVIDE OR RECOMMENDATIONS IT MAY MAKE. FURTHER, READERS SHOULD BE AWARE THAT INTERNET WEBSITES LISTED IN THIS WORK MAY HAVE CHANGED OR DISAPPEARED BETWEEN WHEN THIS WORK WAS WRITTEN AND WHEN IT IS READ.

For general information on our other products and services, please contact our Customer Care Department within the U.S. at 877-762-2974, outside the U.S. at 317-572-3993, or fax 317-572-4002. For technical support, please visit https://hub.wiley.com/community/support/dummies.

Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.

Library of Congress Control Number: 2019954497

ISBN 978-1-119-62611-4;

ISBN 978-1-119-62613-8 (ebk); ISBN 978-1-119-62614-5 (ebk)

Data Science Programming All-In-One For Dummies®

To view this book's Cheat Sheet, simply go to www.dummies.com and search for “Data Science Programming All-In-One For Dummies Cheat Sheet” in the Search box.

Table of Contents

Cover

Introduction

About This Book

Foolish Assumptions

Icons Used in This Book

Beyond the Book

Where to Go from Here

Book 1: Defining Data Science

Chapter 1: Considering the History and Uses of Data Science

Considering the Elements of Data Science

Defining the Role of Data in the World

Creating the Data Science Pipeline

Comparing Different Languages Used for Data Science

Learning to Perform Data Science Tasks Fast

Chapter 2: Placing Data Science within the Realm of AI

Seeing the Data to Data Science Relationship

Defining the Levels of AI

Creating a Pipeline from Data to AI

Chapter 3: Creating a Data Science Lab of Your Own

Considering the Analysis Platform Options

Choosing a Development Language

Obtaining and Using Python

Obtaining and Using R

Presenting Frameworks

Accessing the Downloadable Code

Chapter 4: Considering Additional Packages and Libraries You Might Want

Considering the Uses for Third-Party Code

Obtaining Useful Python Packages

Locating Useful R Libraries

Chapter 5: Leveraging a Deep Learning Framework

Understanding Deep Learning Framework Usage

Working with Low-End Frameworks

Understanding TensorFlow

Book 2: Interacting with Data Storage

Chapter 1: Manipulating Raw Data

Defining the Data Sources

Considering the Data Forms

Understanding the Need for Data Reliability

Chapter 2: Using Functional Programming Techniques

Defining Functional Programming

Understanding Pure and Impure Languages

Comparing the Functional Paradigm

Using Python for Functional Programming Needs

Understanding How Functional Data Works

Working with Lists and Strings

Employing Pattern Matching

Working with Recursion

Performing Functional Data Manipulation

Chapter 3: Working with Scalars, Vectors, and Matrices

Considering the Data Forms

Defining Data Type through Scalars

Creating Organized Data with Vectors

Creating and Using Matrices

Extending Analysis to Tensors

Using Vectorization Effectively

Selecting and Shaping Data

Working with Trees

Representing Relations in a Graph

Chapter 4: Accessing Data in Files

Understanding Flat File Data Sources

Working with Positional Data Files

Accessing Data in CSV Files

Moving On to XML Files

Considering Other Flat-File Data Sources

Working with Nontext Data

Downloading Online Datasets

Chapter 5: Working with a Relational DBMS

Considering RDBMS Issues

Accessing the RDBMS Data

Creating a Dataset

Mixing RDBMS Products

Chapter 6: Working with a NoSQL DMBS

Considering the Ramifications of Hierarchical Data

Accessing the Data

Interacting with Data from NoSQL Databases

Working with Dictionaries

Developing Datasets from Hierarchical Data

Processing Hierarchical Data into Other Forms

Book 3: Manipulating Data Using Basic Algorithms

Chapter 1: Working with Linear Regression

Considering the History of Linear Regression

Combining Variables

Manipulating Categorical Variables

Using Linear Regression to Guess Numbers

Learning One Example at a Time

Chapter 2: Moving Forward with Logistic Regression

Considering the History of Logistic Regression

Differentiating between Linear and Logistic Regression

Using Logistic Regression to Guess Classes

Switching to Probabilities

Working through Multiclass Regression

Chapter 3: Predicting Outcomes Using Bayes

Understanding Bayes' Theorem

Using Naïve Bayes for Predictions

Working with Networked Bayes

Considering the Use of Bayesian Linear Regression

Considering the Use of Bayesian Logistic Regression

Chapter 4: Learning with K-Nearest Neighbors

Considering the History of K-Nearest Neighbors

Learning Lazily with K-Nearest Neighbors

Leveraging the Correct k Parameter

Implementing KNN Regression

Implementing KNN Classification

Book 4: Performing Advanced Data Manipulation

Chapter 1: Leveraging Ensembles of Learners

Leveraging Decision Trees

Working with Almost Random Guesses

Meeting Again with Gradient Descent

Averaging Different Predictors

Chapter 2: Building Deep Learning Models

Discovering the Incredible Perceptron

Hitting Complexity with Neural Networks

Understanding More about Neural Networks

Looking Under the Hood of Neural Networks

Explaining Deep Learning Differences with Other Forms of AI

Chapter 3: Recognizing Images with CNNs

Beginning with Simple Image Recognition

Understanding CNN Image Basics

Moving to CNNs with Character Recognition

Explaining How Convolutions Work

Detecting Edges and Shapes from Images

Chapter 4: Processing Text and Other Sequences

Introducing Natural Language Processing

Understanding How Machines Read

Understanding Semantics Using Word Embeddings

Using Scoring and Classification

Book 5: Performing Data-Related Tasks

Chapter 1: Making Recommendations

Realizing the Recommendation Revolution

Downloading Rating Data

Leveraging SVD

Chapter 2: Performing Complex Classifications

Using Image Classification Challenges

Distinguishing Traffic Signs

Chapter 3: Identifying Objects

Distinguishing Classification Tasks

Perceiving Objects in Their Surroundings

Overcoming Adversarial Attacks on Deep Learning Applications

Chapter 4: Analyzing Music and Video

Learning to Imitate Art and Life

Mimicking an Artist

Moving toward GANs

Chapter 5: Considering Other Task Types

Processing Language in Texts

Processing Time Series

Chapter 6: Developing Impressive Charts and Plots

Starting a Graph, Chart, or Plot

Setting the Axis, Ticks, and Grids

Defining the Line Appearance

Using Labels, Annotations, and Legends

Creating Scatterplots

Plotting Time Series

Plotting Geographical Data

Visualizing Graphs

Book 6: Diagnosing and Fixing Errors

Chapter 1: Locating Errors in Your Data

Considering the Types of Data Errors

Obtaining the Required Data

Validating Your Data

Manicuring the Data

Dealing with Dates in Your Data

Chapter 2: Considering Outrageous Outcomes

Deciding What Outrageous Means

Considering the Five Mistruths in Data

Considering Detection of Outliers

Examining a Simple Univariate Method

Developing a Multivariate Approach

Chapter 3: Dealing with Model Overfitting and Underfitting

Understanding the Causes

Determining the Sources of Overfitting and Underfitting

Guessing the Right Features

Chapter 4: Obtaining the Correct Output Presentation

Considering the Meaning of Correct

Determining a Presentation Type

Choosing the Right Graph

Working with External Data

Chapter 5: Developing Consistent Strategies

Standardizing Data Collection Techniques

Using Reliable Sources

Verifying Dynamic Data Sources

Looking for New Data Collection Trends

Weeding Old Data

Considering the Need for Randomness

Index

About the Authors

Connect with Dummies

End User License Agreement

List of Tables

Chapter 2

TABLE 2-1: Comparing Machine Learning to Statistics

List of Illustrations

Book 1 Chapter 1

FIGURE 1-1: Loading data into variables so that you can manipulate it.

FIGURE 1-2: Using the variable content to train a linear regression model.

FIGURE 1-3: Outputting a result as a response to the model.

Book 1 Chapter 2

FIGURE 2-1: Deep learning is a subset of machine learning which is a subset of ...

Book 1 Chapter 3

FIGURE 3-1: The setup process begins by telling you whether you have the 64-bit...

FIGURE 3-2: Tell the wizard how to install Anaconda on your system.

FIGURE 3-3: Specify an installation location.

FIGURE 3-4: Configure the advanced installation options.

FIGURE 3-5: Create a folder to use to hold the book’s code.

FIGURE 3-6: Provide a new name for your notebook.

FIGURE 3-7: A notebook contains cells that you use to hold code.

FIGURE 3-8: Your saved notebooks appear in a list in the project folder.

FIGURE 3-9: Notebook warns you before removing any files from the repository.

FIGURE 3-10: The files you want to add to the repository appear as part of an u...

FIGURE 3-11: Colab makes using your Python projects on a tablet easy.

FIGURE 3-12: Azure Notebooks provides another means of running Python code.

FIGURE 3-13: Open an Anaconda Prompt to install R.

FIGURE 3-14: The conda utility tells you which packages it will install.

FIGURE 3-15: Anaconda Navigator provides access to a number of useful tools.

FIGURE 3-16: Changing your environment will often change the available tool lis...

FIGURE 3-17: You can save R code in .r files, but the .r files lack Notebook co...

Book 1 Chapter 5

FIGURE 5-1: Be sure to use the Anaconda prompt for the installation and check t...

FIGURE 5-2: Choose the Visual C++ Build Tools workload to support your Python s...

FIGURE 5-3: Select an environment to use in Anaconda Navigator.

Book 2 Chapter 3

FIGURE 3-1: A tree in Python looks much like the physical alternative.

FIGURE 3-2: Graph nodes can connect to each other in myriad ways.

Book 2 Chapter 4

FIGURE 4-1: A text file contains only text and a little formatting with control...

FIGURE 4-2: Each field in this file consumes precisely the same space.

FIGURE 4-3: This file includes carriage returns for row indicators.

FIGURE 4-4: The raw format of a CSV file is still text and quite readable.

FIGURE 4-5: Use an application such as Excel to create a formatted CSV presenta...

FIGURE 4-6: CSV headers can contain data type information, among other clues.

FIGURE 4-7: XML is a hierarchical format that can become quite complex.

FIGURE 4-8: An Excel file is highly formatted and might contain information of ...

FIGURE 4-9: The image appears onscreen after you render and show it.

FIGURE 4-10: Cropping the image makes it smaller.

Book 2 Chapter 6

FIGURE 6-1: A hierarchical construction relies on links to each item.

FIGURE 6-2: The arrangement of keys when using a BST.

FIGURE 6-3: The arrangement of keys when using a binary heap.

FIGURE 6-4: An example graph that you can use for certain types of data storage...

Book 3 Chapter 1

FIGURE 1-1: Drawing a linear regression line through the data points.

FIGURE 1-2: Developing a multiple regression model.

FIGURE 1-3: Changing the simple linear regression question.

FIGURE 1-4: Seeing the effect of

i

on

y

.

FIGURE 1-5: Using a residual plot to see errant data.

FIGURE 1-6: Nonlinear relationship between variable LSTAT and target prices.

FIGURE 1-7: Combined variables LSTAT and RM help to separate high from low pric...

FIGURE 1-8: Adding polynomial features increases the predictive power.

FIGURE 1-9: A slow descent optimizing squared error.

Book 3 Chapter 2

FIGURE 2-1: Contrasting linear to logistic regression.

FIGURE 2-2: Considering the approach to fitting the data.

FIGURE 2-3: Contrasting linear to logistic regression.

FIGURE 2-4: Probabilities do not work as well with a straight line as they do w...

FIGURE 2-5: The plot shows the result of a multiclass regression among three cl...

Book 3 Chapter 3

FIGURE 3-1: Seeing the probabilities for each of the colors.

FIGURE 3-2: Determining how many cars to paint specific colors.

FIGURE 3-3: The interactive version of the Asia Bayesian network is helpful in ...

FIGURE 3-4: A Naïve Bayes model can retrace evidence to the right outcome.

FIGURE 3-5: A visualization of the decision tree built from the play-tennis dat...

Book 3 Chapter 4

FIGURE 4-1: The bull’s-eye dataset, a nonlinear cloud of points that is difficu...

FIGURE 4-2: The KNN approach models the data differently than multiple linear r...

Book 4 Chapter 1

FIGURE 1-1: Comparing a single decision tree output to an ensemble of decision ...

FIGURE 1-2: Seeing the accuracy of ensembles of different sizes.

FIGURE 1-3: Installing the rfpimp package in Python.

Book 4 Chapter 2

FIGURE 2-1: The separating line of a perceptron across two classes.

FIGURE 2-2: Learning logical XOR using a single separating line isn’t possible.

FIGURE 2-3: Plots of different activation functions.

FIGURE 2-4: An example of the architecture of a neural network.

FIGURE 2-5: A detail of the feed-forward process in a neural network.

FIGURE 2-6: Two interleaving moon-shaped clouds of data points.

FIGURE 2-7: How the ReLU activation function works in receiving and releasing s...

FIGURE 2-8: Dropout temporarily rules out 40 percent of neurons from the traini...

Book 4 Chapter 3

FIGURE 3-1: The image appears onscreen after you render and show it.

FIGURE 3-2: Different filters for different noise cleaning.

FIGURE 3-3: Cropping the image makes it smaller.

FIGURE 3-4: The example application would like to find similar photos.

FIGURE 3-5: The output shows the results that resemble the test image.

FIGURE 3-6: Examples from the training and test sets differ in pose and express...

FIGURE 3-7: Each pixel is read by the computer as a number in a matrix.

FIGURE 3-8: Only by translation invariance can an algorithm spot the dog and it...

FIGURE 3-9: Displaying some of the handwritten characters from MNIST.

FIGURE 3-10: A convolution processes a chunk of an image by matrix multiplicati...

FIGURE 3-11: The borders of an image are detected after applying a 3-x-3 pixel ...

FIGURE 3-12: A max pooling layer operating on chunks of a reduced image.

FIGURE 3-13: The architecture of LeNet5, a neural network for handwritten digit...

FIGURE 3-14: A plot of the LeNet5 network training process.

FIGURE 3-15: Processing a dog image using convolutions.

FIGURE 3-16: The content of an image is transformed by style transfer.

Book 5 Chapter 2

FIGURE 2-1: Some common image augmentations.

FIGURE 2-2: Some examples from the German Traffic Sign Recognition Benchmark.

FIGURE 2-3: Distribution of classes.

FIGURE 2-4: Training and validation errors compared.

Book 5 Chapter 3

FIGURE 3-1: Detection, localization, and segmentation example from the Coco dat...

FIGURE 3-2: Object detection resulting from Keras-RetinaNet.

Book 5 Chapter 4

FIGURE 4-1: A human might see a fanciful drawing.

FIGURE 4-2: The computer sees a series of numbers.

FIGURE 4-3: How a GAN operates.

Book 5 Chapter 5

FIGURE 5-1: Working with cyclic data that varies over time.

Book 5 Chapter 6

FIGURE 6-1: The output of a plain line graph.

FIGURE 6-2: The output of multiple datasets in a single line graph.

FIGURE 6-3: The output of multiple presentations in a single figure.

FIGURE 6-4: Allowing multiple revisions to a single output graphic.

FIGURE 6-5: The original figure changes as needed.

FIGURE 6-6: Modifying the plot ticks.

FIGURE 6-7: Adding grid lines to make data easier to read.

FIGURE 6-8: Making changes to a line as part of the plot or separately.

FIGURE 6-9: Adding markers to emphasize the data points.

FIGURE 6-10: Labels identify specific graphic elements.

FIGURE 6-11: Annotation provides the means of pointing something out.

FIGURE 6-12: Legends identify the individual grouped data elements.

FIGURE 6-13: Some plots really don’t say anything at all.

FIGURE 6-14: Differentiation makes the plots easier to interpret.

FIGURE 6-15: A scatterplot showing a high degree of negative correlation.

FIGURE 6-16: A scatterplot showing a high degree of positive correlation.

FIGURE 6-17: Using a general plot to display date-oriented data.

FIGURE 6-18: Using

plot_date()

to display date-oriented data.

FIGURE 6-19: The results of calculating a trend line for the airline passenger ...

FIGURE 6-20: An orthographic projection of the world.

FIGURE 6-21: Your maps can look quite realistic.

FIGURE 6-22: Some projections allow for a close look.

FIGURE 6-23: Adding locations or other information to the map.

FIGURE 6-24: Plotting the original graph.

FIGURE 6-25: Plotting the graph addition.

Book 6 Chapter 2

FIGURE 2-1: Descriptive statistics for a DataFrame.

FIGURE 2-2: Boxplots.

FIGURE 2-3: Reporting possibly outlying examples.

FIGURE 2-4: The first two and last two components from the PCA.

FIGURE 2-5: The possible outlying cases spotted by PCA.

Book 6 Chapter 3

FIGURE 3-1: Underfitting is the result of using a model that isn't complex enou...

FIGURE 3-2: Overfitting causes the model to follow the data too closely.

FIGURE 3-3: Applying the model to slightly different data shows the problem wit...

FIGURE 3-4: Using the correct degrees of polynomial fitting makes a big differe...

FIGURE 3-5: Nonlinear relationship between variable LSTAT and target prices.

FIGURE 3-6: Combined variables LSTAT and RM help to separate high from low pric...

FIGURE 3-7: Adding polynomial features increases the predictive power.

Book 6 Chapter 4

FIGURE 4-1: Pie charts show a percentage of the whole.

FIGURE 4-2: Bar charts make performing comparisons easier.

FIGURE 4-3: Histograms let you see distributions of numbers.

FIGURE 4-4: Use boxplots to present groups of numbers.

FIGURE 4-5: Use line graphs to show trends.

FIGURE 4-6: Use scatterplots to show groups of data points and their associated...

FIGURE 4-7: Load external code as needed to provide specific information for yo...

FIGURE 4-8: Embedding images can dress up your notebook presentation.

Guide

Cover

Table of Contents

Begin Reading

Pages

i

ii

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

223

224

225

226

227

228

229

230

231

232

233

234

235

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

639

640

641

642

643

644

645

646

647

648

649

650

651

652

653

654

655

656

657

658

659

660

661

662

663

664

665

666

667

668

669

670

671

672

673

674

675

676

677

678

679

680

681

682

683

684

685

686

687

689

690

691

692

693

694

695

696

697

698

699

700

701

702

703

704

705

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

Introduction

Data science is a term that the media has chosen to minimize, obfuscate, and sometimes misuse. It involves a lot more than just data and the science of working with data. Today, the world uses data science in all sorts of ways that you might not know about, which is why you need Data Science Programming All-in-One For Dummies.

In the book, you start with both the data and the science of manipulating it, but then you go much further. In addition to seeing how to perform a wide range of analysis, you also delve into making recommendations, classifying real-world objects, analyzing audio, and even creating art.

However, you don’t just learn about amazing new technologies and how to perform common tasks. This book also dispels myths created by people who wish data science were something different than it really is or who don’t understand it at all. A great deal of misinformation swirls around the world today as the media seeks to sensationalize, anthropomorphize, and emotionalize technologies that are, in fact, quite mundane. It’s hard to know what to believe. You find reports that robots are on the cusp of becoming sentient and that the giant tech companies can discover your innermost thoughts simply by reviewing your record of purchases. With this book, you can replace disinformation with solid facts, and you can use those facts to create a strategy for performing data science development tasks.

About This Book

You might find that this book starts off a little slowly because most people don’t have a good grasp on getting a system prepared for data science use. Book 1 helps you configure your system. The book uses Jupyter Notebook as an Integrated Development Environment (IDE) for both Python and R. That way, if you choose to view the examples in both languages, you use the same IDE to do it. Jupyter Notebook also relies on the literate programming strategy first proposed by Donald Knuth (see http://www.literateprogramming.com/) to make your coding efforts significantly easier and more focused on the data. In addition, in contrast to other environments, you don’t actually write entire applications before you see something; you write code and focus on the results of just that code block as part of a whole application.

After you have a development environment installed and ready to use, you can start working with data in all its myriad forms in Book 2. This book covers a great many of these forms — everything from in-memory datasets to those found on large websites. In addition, you see a number of data formats ranging from flat files to Relational Database Management Systems (RDBMSs) and Not Only SQL (NoSQL) databases.

Of course, manipulating data is worthwhile only if you can do something useful with it. Book 3 discusses common sorts of analysis, such as linear and logistic regression, Bayes’ Theorem, and K-Nearest Neighbors (KNN).

Most data science books stop at this point. In this book, however, you discover AI, machine learning, and deep learning techniques to get more out of your data than you might have thought possible. This exciting part of the book, Book 4, represents the cutting edge of analysis. You use huge datasets to discover important information about large groups of people that will help you improve their health or sell them products.

Performing analysis may be interesting, but analysis is only a step along the path. Book 5 shows you how to put your analysis to use in recommender systems, to classify objects, work with nontextual data like music and video, and display the results of an analysis in a form that everyone can appreciate.

The final minibook, Book 6, offers something you won’t find in many places, not even online. You discover how to detect and fix problems with your data, the logic used to interpret the data, and the code used to perform tasks such as analysis. By the time you complete Book 6, you’ll know much more about how to ensure that the results you get are actually the results you need and want.

To make absorbing the concepts easy, this book uses the following conventions:

Text that you’re meant to type just as it appears in the book is in

bold

. The exception is when you’re working through a step list: Because each step is bold, the text to type is not bold.

When you see words in

italics

as part of a typing sequence, you need to replace that value with something that works for you. For example, if you see “Type

Your Name

and press Enter,” you need to replace

Your Name

with your actual name.

Web addresses and programming code appear in

monofont

. If you're reading a digital version of this book on a device connected to the Internet, you can click or tap the web address to visit that website, like this:

https://www.dummies.com

.

When you need to type command sequences, you see them separated by a special arrow, like this: File ⇒ New File. In this example, you go to the File menu first and then select the New File entry on that menu.

Foolish Assumptions

You might find it difficult to believe that we’ve assumed anything about you — after all; we haven’t even met you yet! Although most assumptions are indeed foolish, we made these assumptions to provide a starting point for the book.

You need to be familiar with the platform you want to use because the book doesn’t offer any guidance in this regard. (Book 1, Chapter 3 does, however, provide Anaconda installation instructions for both Python and R, and Book 1, Chapter 5 helps you install the TensorFlow and Keras frameworks used for this book.) To give you the maximum information about Python concerning how it applies to deep learning, this book doesn’t discuss any platform-specific issues. You see the R version of the Python coding examples in the downloadable source, along with R-specific notes on usage and development. You really do need to know how to install applications, use applications, and generally work with your chosen platform before you begin working with this book.

You must know how to work with Python or R. You can find a wealth of Python tutorials online (see https://www.w3schools.com/python/ and https://www.tutorialspoint.com/python/ as examples). R, likewise, provides a wealth of online tutorials (see https://www.tutorialspoint.com/r/index.htm, https://docs.anaconda.com/anaconda/navigator/tutorials/r-lang/, and https://www.statmethods.net/r-tutorial/index.html as examples).

This book isn’t a math primer. Yes, you see many examples of complex math, but the emphasis is on helping you use Python or R to perform data science development tasks rather than teaching math theory. We include some examples that also discuss the use of technologies such as data management (see Book 2), statistical analysis (see Book 3), AI, machine learning, deep learning (see Book 4), practical data science application (see Book 5), and troubleshooting both data and code (see Book 6). Book 1, Chapters 1 and 2 give you a better understanding of precisely what you need to know to use this book successfully. You also use a considerable number of libraries in writing code for this book. Book 1, Chapter 4 discusses library use and suggests other libraries that you might want to try.

This book also assumes that you can access items on the Internet. Sprinkled throughout are numerous references to online material that will enhance your learning experience. However, these added sources are useful only if you actually find and use them.

Icons Used in This Book

As you read this book, you see icons in the margins that indicate material of interest (or not, as the case may be). This section briefly describes each icon in this book.

Tips are nice because they help you save time or perform some task without a lot of extra work. The tips in this book are time-saving techniques or pointers to resources that you should try so that you can get the maximum benefit from Python or R, or from performing deep learning–related tasks. (Note that R developers will also find copious notes in the source code files for issues that differ significantly from Python.)

We don’t want to sound like angry parents or some kind of maniacs, but you should avoid doing anything that’s marked with a Warning icon. Otherwise, you might find that your application fails to work as expected, you get incorrect answers from seemingly bulletproof algorithms, or (in the worst-case scenario) you lose data.

Whenever you see this icon, think advanced tip or technique. You might find these tidbits of useful information just too boring for words, or they could contain the solution you need to get a program running. Skip these bits of information whenever you like.

If you don’t get anything else out of a particular chapter or section, remember the material marked by this icon. This text usually contains an essential process or a bit of information that you must know to work with Python or R, or to perform deep learning–related tasks successfully. (Note that the R source code files contain a great deal of text that gives essential details for working with R when R differs considerably from Python.)

Beyond the Book

This book isn’t the end of your Python or R data science development experience — it’s really just the beginning. We provide online content to make this book more flexible and better able to meet your needs. That way, as we receive email from you, we can address questions and tell you how updates to Python, R, or their associated add-ons affect book content. In fact, you gain access to all these cool additions:

Cheat sheet:

You remember using crib notes in school to make a better mark on a test, don’t you? You do? Well, a cheat sheet is sort of like that. It provides you with some special notes about tasks that you can do with Python and R with regard to data science development that not every other person knows. You can find the cheat sheet by going to

www.dummies.com

, searching this book's title, and scrolling down the page that appears. The cheat sheet contains really neat information, such as the most common data errors that cause people problems with working in the data science field.

Updates: Sometimes changes happen. For example, we might not have seen an upcoming change when we looked into our crystal ball during the writing of this book. In the past, this possibility simply meant that the book became outdated and less useful, but you can now find updates to the book, if we have any, by searching this book's title at www.dummies.com.

In addition to these updates, check out the blog posts with answers to reader questions and demonstrations of useful, book-related techniques at http://blog.johnmuellerbooks.com/.

Companion files:

Hey! Who really wants to type all the code in the book and reconstruct all those neural networks manually? Most readers prefer to spend their time actually working with data and seeing the interesting things they can do, rather than typing. Fortunately for you, the examples used in the book are available for download, so all you need to do is read the book to learn Python or R data science programming techniques. You can find these files at

www.dummies.com

. Search this book's title, and on the page that appears, scroll down to the image of the book cover and click it. Then click the More about This Book button and on the page that opens, go to the Downloads tab.

Where to Go from Here

It’s time to start your Python or R for data science programming adventure! If you’re completely new to Python or R and its use for data science tasks, you should start with Book 1, Chapter 1. Progressing through the book at a pace that allows you to absorb as much of the material as possible makes it feasible for you to gain insights that you might not otherwise gain if you read the chapters in a random order. However, the book is designed to allow you to read the material in any order desired.

If you’re a novice who’s in an absolute rush to get going with Python or R for data science programming as quickly as possible, you can skip to Book 1, Chapter 3 with the understanding that you may find some topics a bit confusing later. Skipping to Book 1, Chapter 5 is okay if you already have Anaconda (the programming product used in the book) installed with the appropriate language (Python or R as you desire), but be sure to at least skim Chapter 3 so that you know what assumptions we made when writing this book.

This book relies on a combination of TensorFlow and Keras to perform deep learning tasks. Even if you’re an advanced reader who wants to perform deep learning tasks, you need to go to Book 1, Chapter 5 to discover how to configure the environment used for this book. You must configure the environment according to instructions or you’re likely to experience failures when you try to run the code. However, this issue applies only to deep learning. This book has a great deal to offer in other areas, such as data manipulation and statistical analysis.

Book 1

Defining Data Science

Contents at a Glance

Chapter 1: Considering the History and Uses of Data Science

Considering the Elements of Data Science

Defining the Role of Data in the World

Creating the Data Science Pipeline

Comparing Different Languages Used for Data Science

Learning to Perform Data Science Tasks Fast

Chapter 2: Placing Data Science within the Realm of AI

Seeing the Data to Data Science Relationship

Defining the Levels of AI

Creating a Pipeline from Data to AI

Chapter 3: Creating a Data Science Lab of Your Own

Considering the Analysis Platform Options

Choosing a Development Language

Obtaining and Using Python

Obtaining and Using R

Presenting Frameworks

Accessing the Downloadable Code

Chapter 4: Considering Additional Packages and Libraries You Might Want

Considering the Uses for Third-Party Code

Obtaining Useful Python Packages

Locating Useful R Libraries

Chapter 5: Leveraging a Deep Learning Framework

Understanding Deep Learning Framework Usage

Working with Low-End Frameworks

Understanding TensorFlow

Chapter 1

Considering the History and Uses of Data Science

IN THIS CHAPTER

Understanding data science history and uses

Considering the flow of data in data science

Working with various languages in data science

Performing data science tasks quickly

The burgeoning uses for data in the world today, along with the explosion of data sources, create a demand for people who have special skills to obtain, manage, and analyze information for the benefit of everyone. The data scientist develops and hones these special skills to perform such tasks on multiple levels, as described in the first two sections of this chapter.

Data needs to be funneled into acceptable forms that allow data scientists to perform their tasks. Even though the precise data flow varies, you can generalize it to a degree. The third section of the chapter gives you an overview of how data flow occurs.

As with anyone engaged in computer work today, a data scientist employs various programming languages to express the manipulation of data in a repeatable manner. The languages that a data scientist uses, however, focus on outputs expected from given inputs, rather than on low-level control or a precise procedure, as a computer scientist would use. Because a data scientist may lack a formal programming education, the languages tend to focus on declarative strategies, with the data scientist expressing a desired outcome rather than devising a specific procedure. The fourth section of the chapter discusses various languages used by data scientists, with an emphasis on Python and R.

The final section of the chapter provides a very quick overview of getting tasks done quickly. Optimization without loss of precision is an incredibly difficult task and you see it covered a number of times in this book, but this introduction is enough to get you started. The overall goal of this first chapter is to describe data science and explain how a data scientist uses algorithms, statistics, data extraction, data manipulation, and a slew of other technologies to employ it as part of an analysis.

You don’t have to type the source code for this chapter manually (or, actually at all, given that you use it only to obtain an understanding of the data flow process). In fact, using the downloadable source is a lot easier. The source code for this chapter appears in the DSPD_0101_Quick_Overview.ipynb source code file for Python. See the Introduction for details on how to find these source files.

Considering the Elements of Data Science

At one point, the world viewed anyone working with statistics as a sort of accountant or perhaps a mad scientist. Many people consider statistics and the analysis of data boring. However, data science is one of those occupations in which the more you learn, the more you want to learn. Answering one question often spawns more questions that are even more interesting than the one you just answered. However, what makes data science so sexy is that you see it everywhere, used in an almost infinite number of ways. The following sections give you more details on why data science is such an amazing field of study.

Considering the emergence of data science

Data science is a relatively new term. William S. Cleveland coined the term in 2001 as part of a paper entitled “Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics.” It wasn't until a year later that the International Council for Science actually recognized data science and created a committee for it. Columbia University got into the act in 2003 by beginning publication of the Journal of Data Science.

However, the mathematical basis behind data science is centuries old because data science is essentially a method of viewing and analyzing statistics and probability. The first essential use of statistics as a term comes in 1749, but statistics are certainly much older than that. People have used statistics to recognize patterns for thousands of years. For example, the historian Thucydides (in his History of the Peloponnesian War) describes how the Athenians calculated the height of the wall of Platea in fifth century BC by counting bricks in an unplastered section of the wall. Because the count needed to be accurate, the Athenians took the average of the count by several solders.

The process of quantifying and understanding statistics is relatively new, but the science itself is quite old. An early attempt to begin documenting the importance of statistics appears in the ninth century, when Al-Kindi wrote Manuscript on Deciphering Cryptographic Messages. In this paper, Al-Kindi describes how to use a combination of statistics and frequency analysis to decipher encrypted messages. Even in the beginning, statistics saw use in the practical application of science for tasks that seemed virtually impossible to complete. Data science continues this process, and to some people it might actually seem like magic.

Outlining the core competencies of a data scientist

As is true of anyone performing most complex trades today, the data scientist requires knowledge of a broad range of skills to perform the required tasks. In fact, so many different skills are required that data scientists often work in teams. Someone who is good at gathering data might team up with an analyst and someone gifted in presenting information. Finding a single person who possesses all the required skills would be hard. With this in mind, the following list describes areas in which a data scientist can excel (with more competencies being better):

Data capture:

It doesn’t matter what sort of math skills you have if you can’t obtain data to analyze in the first place. The act of capturing data begins by managing a data source using database-management skills. However, raw data isn’t particularly useful in many situations; you must also understand the data domain so that you can look at the data and begin formulating the sorts of questions to ask. Finally, you must have data-modeling skills so that you understand how the data is connected and whether the data is structured.

Analysis:

After you have data to work with and understand the complexities of that data, you can begin to perform an analysis on it. You perform some analysis using basic statistical tool skills, much like those that just about everyone learns in college. However, the use of specialized math tricks and algorithms can make patterns in the data more obvious or help you draw conclusions that you can’t draw by reviewing the data alone.

Presentation:

Most people don’t understand numbers well. They can’t see the patterns that the data scientist sees. Providing a graphical presentation of these patterns is important to help others visualize what the numbers mean and how to apply them in a meaningful way. More important, the presentation must tell a specific story so that the impact of the data isn’t lost.

Linking data science, big data, and AI

Interestingly enough, the act of moving data around so that someone can perform analysis on it is a specialty called Extract, Transform, and Load (ETL). The ETL specialist uses programming languages such as Python to extract the data from a number of sources. Corporations tend not to keep data in one easily accessed location, so finding the data required to perform analysis takes time. After the ETL specialist finds the data, a programming language or other tool transforms it into a common format for analysis purposes. The loading process takes many forms, but this book relies on Python to perform the task. In a large, real-world operation, you might find yourself using tools such as Informatica, MS SSIS, or Teradata to perform the task.

Data science isn’t necessarily a means to an end; it may instead be a step along the way. As a data scientist works through various datasets and finds interesting facts, these facts may act as input for other sorts of analysis and AI applications. For example, consider that your shopping habits often suggest what books you might like or where you might like to go for a vacation. Shopping or other habits can also help others understand other, sometimes less benign, activities as well. Machine Learning For Dummies and Artificial Intelligence For Dummies, both by John Paul Mueller and Luca Massaron (Wiley), help you understand these other uses of data science. For now, consider the fact that what you learn in this book can have a definite effect on a career path that will go many other places.

Understanding the role of programming

A data scientist may need to know several programming languages in order to achieve specific goals. For example, you may need SQL knowledge to extract data from relational databases. Python can help you perform data loading, transformation, and analysis tasks. However, you might choose a product such as MATLAB (which has its own programming language) or PowerPoint (which relies on VBA) to present the information to others. (If you’re interested to see how MATLAB compares to the use of Python, you can get the book, MATLAB For Dummies, by John Paul Mueller [Wiley].) The immense datasets that data scientists rely on often require multiple levels of redundant processing to transform into useful processed data. Manually performing these tasks is time consuming and error prone, so programming presents the best method for achieving the goal of a coherent, usable data source.

Given the number of products that most data scientists use, sticking to just one programming language may not be possible. Yes, Python can load data, transform it, analyze it, and even present it to the end user, but the process works only when the language provides the required functionality. You may have to choose other languages to fill out your toolkit. The languages you choose depend on a number of criteria. Here are some criteria you should consider:

How you intend to use data science in your code (you have a number of tasks to consider, such as data analysis, classification, and regression)

Your familiarity with the language

The need to interact with other languages

The availability of tools to enhance the development environment

The availability of APIs and libraries to make performing tasks easier

Defining the Role of Data in the World

This section of the chapter is too short. It can’t even begin to describe the ways in which data will affect you in the future. Consider the following subsections as offering tantalizing tidbits —appetizers that can whet your appetite for exploring the world of data and data science further. The applications listed in these sections are already common in some settings. You probably used at least one of them today, and quite likely more than just one. After reading the following sections, you might want to take the time to consider all the ways in which data currently affects your life. The use of data to perform amazing feats is really just the beginning. Humanity is at the cusp of an event that will rival the Industrial Revolution (see https://www.history.com/topics/industrial-revolution/industrial-revolution), and the use of data (and its associated technologies, such as AI, machine learning, and deep learning) is actually quite immature at this point.

Enticing people to buy products

Demographics, those vital or social statistics that group people by certain characteristics, have always been part art and part science. You can find any number of articles about getting your computer to generate demographics for clients (or potential clients). The use of demographics is wide ranging, but you see them used for things like predicting which product a particular group will buy (versus that of the competition). Demographics are an important means of categorizing people and then predicting some action on their part based on their group associations. Here are the methods that you often see cited for AIs when gathering demographics:

Historical:

Based on previous actions, an AI generalizes which actions you might perform in the future.

Current activity:

Based on the action you perform now and perhaps other characteristics, such as gender, a computer predicts your next action.

Characteristics:

Based on the properties that define you, such as gender, age, and area where you live, a computer predicts the choices you are likely to make.

You can find articles about AI’s predictive capabilities that seem almost too good to be true. For example, the article at https://medium.com/@demografy/artificial-intelligence-can-now-predict-demographic-characteristics-knowing-only-your-name-6749436a6bd3 says that AI can now predict your demographics based solely on your name. The company in that article, Demografy (https://demografy.com/), claims to provide gender, age, and cultural affinity based solely on name. Even though the site claims that it’s 90 to 95 percent accurate (see the Is Demografy Accurate answer at https://demografy.com/faq for details), this statistic is unlikely because some names are gender ambiguous, such as Renee, and others are assigned to one gender in some countries and another gender in others. In fact, the answer on the Demografy site seems to acknowledge this issue by saying the outcome “heavily depends on your particular list and may show considerably different results than these averages”. Yes, demographic prediction can work, but exercise care before believing everything that these sites tell you.

If you want to experiment with demographic prediction, you can find a number of APIs online. For example, the DeepAI API at https://deepai.org/machine-learning-model/demographic-recognition promises to help you predict age, gender, and cultural background based on a person’s appearance in a video. Each of the online APIs do specialize, so you need to choose the API with an eye toward the kind of input data you can provide.

Keeping people safer

You already have a good idea of how data might affect you in ways that keep you safer. For example, statistics help car designers create new designs that provide greater safety for the occupant and sometimes other parties as well. Data also figures into calculations for things like

Medications

Medical procedures

Safety equipment

Safety procedures

How long to keep the crosswalk signs lit

Safety goes much further, though. For example, people have been trying to predict natural disasters for as long as there have been people and natural disasters. No one wants to be part of an earthquake, tornado, volcanic eruption, or any other natural disaster. Being able to get away quickly is the prime consideration in such cases, given that humans can’t control their environment well enough yet to prevent any natural disaster.

Data managed by deep learning provides the means to look for extremely subtle patterns that boggle the minds of humans. These patterns can help predict a natural catastrophe, according to the article on Google’s solution at http://www.digitaljournal.com/tech-and-science/technology/google-to-use-ai-to-predict-natural-disasters/article/533026. The fact that the software can predict any disaster at all is simply amazing. However, the article at http://theconversation.com/ai-could-help-us-manage-natural-disasters-but-only-to-an-extent-90777 warns that relying on such software exclusively would be a mistake. Overreliance on technology is a constant theme throughout this book, so don’t be surprised that deep learning is less than perfect in predicting natural catastrophes as well.

Creating new technologies

New technologies can cover a very wide range of applications. For example, you find new technologies for making factories safer and more efficient all the time. Space travel requires an inordinate number of new technologies. Just consider how the data collected in the past affects things like smart phone use and the manner in which you drive your car.

However, a new technology can take an interesting twist, and you should look for these applications as well. You probably have black-and-white videos or pictures of family members or special events that you’d love to see in color. Color consists of three elements: hue (the actual color); value (the darkness or lightness of the color); and saturation (the intensity of the color). You can read more about these elements at http://learn.leighcotnoir.com/artspeak/elements-color/hue-value-saturation/. Oddly enough, many artists are color-blind and make strong use of color value in their creations (read https://www.nytimes.com/2017/12/23/books/a-colorblind-artist-illustrator-childrens-books.html as one of many examples). So having hue missing (the element that black-and-white art lacks) isn’t the end of the world. Quite the contrary: Some artists view it as an advantage (see https://www.artsy.net/article/artsy-editorial-the-advantages-of-being-a-colorblind-artist for details).

When viewing something in black and white, you see value and saturation but not hue. Colorization is the process of adding the hue back in. Artists generally perform this process using a painstaking selection of individual colors, as described at https://fstoppers.com/video/how-amazing-colorization-black-and-white-photos-are-done-5384 and https://www.diyphotography.net/know-colors-add-colorizing-black-white-photos/. However, AI has automated this process using Convolutional Neural Networks (CNNs), as described at https://emerj.com/ai-future-outlook/ai-is-colorizing-and-beautifying-the-world/.

The easiest way to use CNN for colorization is to find a library to help you. The Algorithmia site at https://demos.algorithmia.com/colorize-photos/