63,99 €
Data Mining Algorithms is a practical, technically-oriented guide to data mining algorithms that covers the most important algorithms for building classification, regression, and clustering models, as well as techniques used for attribute selection and transformation, model quality evaluation, and creating model ensembles. The author presents many of the important topics and methodologies widely used in data mining, whilst demonstrating the internal operation and usage of data mining algorithms using examples in R.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 1177
Veröffentlichungsjahr: 2014
Title Page
Copyright
Dedication
Acknowledgements
Preface
Data mining
Motivation
Organization
Notation
R code examples
Website
Further readings
References
Part I: Preliminaries
Chapter 1: Tasks
1.1 Introduction
1.2 Inductive learning tasks
1.3 Classification
1.4 Regression
1.5 Clustering
1.6 Practical issues
1.7 Conclusion
1.8 Further readings
References
Chapter 2: Basic statistics
2.1 Introduction
2.2 Notational conventions
2.3 Basic statistics as modeling
2.4 Distribution description
2.5 Relationship detection
2.6 Visualization
2.7 Conclusion
2.8 Further readings
References
Part II: Classification
Chapter 3: Decision trees
3.1 Introduction
3.2 Decision tree model
3.3 Growing
3.4 Pruning
3.5 Prediction
3.6 Weighted instances
3.7 Missing value handling
3.8 Conclusion
3.9 Further readings
References
Chapter 4: Naïve Bayes classifier
4.1 Introduction
4.2 Bayes rule
4.3 Classification by Bayesian inference
4.4 Practical issues
4.5 Conclusion
4.6 Further readings
References
Chapter 5: Linear classification
5.1 Introduction
5.2 Linear representation
5.3 Parameter estimation
5.4 Discrete attributes
5.5 Conclusion
5.6 Further readings
References
Chapter 6: Misclassification costs
6.1 Introduction
6.2 Cost representation
6.3 Incorporating misclassification costs
6.4 Effects of cost incorporation
6.5 Experimental procedure
6.6 Conclusion
6.7 Further readings
References
Chapter 7: Classification model evaluation
7.1 Introduction
7.2 Performance measures
7.3 Evaluation procedures
7.4 Conclusion
7.5 Further readings
References
Part III: Regression
Chapter 8: Linear regression
8.1 Introduction
8.2 Linear representation
8.3 Parameter estimation
8.4 Discrete attributes
8.5 Advantages of linear models
8.6 Beyond linearity
8.7 Conclusion
8.8 Further readings
References
Chapter 9: Regression trees
9.1 Introduction
9.2 Regression tree model
9.3 Growing
9.4 Pruning
9.5 Prediction
9.6 Weighted instances
9.7 Missing value handling
9.8 Piecewise linear regression
9.9 Conclusion
9.10 Further readings
References
Chapter 10: Regression model evaluation
10.1 Introduction
10.2 Performance measures
10.3 Evaluation procedures
10.4 Conclusion
10.5 Further readings
References
Part IV: Clustering
Chapter 11: (Dis)similarity measures
11.1 Introduction
11.2 Measuring dissimilarity and similarity
11.3 Difference-based dissimilarity
11.4 Correlation-based similarity
11.5 Missing attribute values
11.6 Conclusion
11.7 Further readings
References
Chapter 12: k-Centers clustering
12.1 Introduction
12.2 Algorithm scheme
12.3
k
-Means
12.4 Beyond means
12.5 Beyond (fixed)
k
12.6 Explicit cluster modeling
12.7 Conclusion
12.8 Further readings
References
Chapter 13: Hierarchical clustering
13.1 Introduction
13.2 Cluster hierarchies
13.3 Agglomerative clustering
13.4 Divisive clustering
13.5 Hierarchical clustering visualization
13.6 Hierarchical clustering prediction
13.7 Conclusion
13.8 Further readings
References
Chapter 14: Clustering model evaluation
14.1 Introduction
14.2 Per-cluster quality measures
14.3 Overall quality measures
14.4 External quality measures
14.5 Using quality measures
14.6 Conclusion
14.7 Further readings
References
Part V: Getting Better Models
Chapter 15: Model ensembles
15.1 Introduction
15.2 Model committees
15.3 Base models
15.4 Model aggregation
15.5 Specific ensemble modeling algorithms
15.6 Quality of ensemble predictions
15.7 Conclusion
15.8 Further readings
References
Chapter 16: Kernel methods
16.1 Introduction
16.2 Support vector machines
16.3 Support vector regression
16.4 Kernel trick
16.5 Kernel functions
16.6 Kernel prediction
16.7 Kernel-based algorithms
16.8 Conclusion
16.9 Further readings
References
Chapter 17: Attribute transformation
17.1 Introduction
17.2 Attribute transformation task
17.3 Simple transformations
17.4 Multiclass encoding
17.5 Conclusion
17.6 Further readings
References
Chapter 18: Discretization
18.1 Introduction
18.2 Discretization task
18.3 Unsupervised discretization
18.4 Supervised discretization
18.5 Effects of discretization
18.6 Conclusion
18.7 Further readings
References
Chapter 19: Attribute selection
19.1 Introduction
19.2 Attribute selection task
19.3 Attribute subset search
19.4 Attribute selection filters
19.5 Attribute selection wrappers
19.6 Effects of attribute selection
19.7 Conclusion
19.8 Further readings
References
Chapter 20: Case studies
20.1 Introduction
20.2 Census income
20.3 Communities and crime
20.4 Cover type
20.5 Conclusion
20.6 Further readings
References
Closing
Retrospecting
Final words
A: Notation
A.1 Attribute values
A.2 Data subsets
A.3 Probabilities
B: R packages
B.1 CRAN packages
B.2 DMR packages
B.3 Installing packages
References
C: Datasets
Index
End User License Agreement
iv
v
xix
xx
xxi
xxii
xxiii
xxiv
xxv
xxvi
xxvii
xxviii
xxix
xxx
xxxi
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
222
221
223
224
225
226
227
228
229
230
231
232
233
234
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
464
465
466
467
463
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
596
595
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
627
626
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
Cover
Table of Contents
preface
Part I: Preliminaries
Begin Reading
Figure 2.1
Figure 2.2
Figure 2.3
Figure 2.4
Figure 3.1
Figure 3.2
Figure 5.1
Figure 5.2
Figure 5.3
Figure 5.4
Figure 6.1
Figure 6.2
Figure 6.3
Figure 7.1
Figure 7.2
Figure 7.3
Figure 7.4
Figure 7.5
Figure 7.6
Figure 7.7
Figure 7.8
Figure 7.9
Figure 7.10
Figure 9.1
Figure 10.1
Figure 10.2
Figure 12.1
Figure 12.2
Figure 12.3
Figure 12.4
Figure 13.1
Figure 13.2
Figure 14.1
Figure 14.2
Figure 15.1
Figure 15.2
Figure 15.3
Figure 15.4
Figure 15.5
Figure 15.6
Figure 15.7
Figure 15.8
Figure 16.1
Figure 16.2
Figure 16.3
Figure 16.4
Figure 16.5
Figure 16.6
Figure 17.1
Figure 18.1
Figure 18.2
Figure 19.1
Figure 19.2
Figure 19.3
Figure 19.4
Figure 19.5
Figure 19.6
Figure 20.1
Figure 20.2
Figure 20.3
Figure 20.4
Figure 20.5
Figure 20.6
Figure 20.7
Figure 20.8
Figure 20.9
Figure 20.10
Figure 20.11
Figure 20.12
Figure 20.13
Figure 20.14
Figure 20.15
Figure 20.16
Figure 20.17
Figure 20.18
Figure 20.19
Figure 20.20
Figure 20.21
Figure 20.22
Figure 20.23
Table 6.1
Table 7.1
Table 7.2
Table 7.3
Table 7.4
Table 7.5
Table 17.1
Paweł Cichosz
Department of Electronics and Information TechnologyWarsaw University of TechnologyPoland
This edition first published 2015
© 2015 by John Wiley & Sons, Ltd
Registered office: John Wiley & Sons, Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom
For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com.
The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. It is sold on the understanding that the publisher is not engaged in rendering professional services and neither the publisher nor the author shall be liable for damages arising herefrom. If professional advice or other expert assistance is required, the services of a competent professional should be sought.
Library of Congress Cataloging-in-Publication Data
Cichosz, Pawel, author.
Data mining algorithms : explained using R / Pawel Cichosz.
pages cm
Summary: “This book narrows down the scope of data mining by adopting a heavily modeling-oriented perspective” –Provided by publisher.
Includes bibliographical references and index.
ISBN 978-1-118-33258-0 (hardback)
1. Data mining. 2. Computer algorithms. 3. R (Computer program language) I. Title.
QA76.9.D343C472 2015
006.3′12–dc23
2014036992
A catalogue record for this book is available from the British Library.
ISBN: 9781118332580
To my wife, Joanna, and my sons, Grzegorz and Łukasz
With the rise and rapidly growing popularity of online idea sharing methods, such as blogs and wikis, traditional books are no longer the only way of making large portions of text available to a wide audience. The former are particularly suitable for collaborative or social writing and readings undertakings, often with mixed reader–writer roles of particular participants. For individual writing and reading efforts the traditional book form (although not necessarily tied to the paper media) still remains the best approach. On one hand, it clearly assigns full and exclusive responsibility for the contents to the author, with no easy excuses for errors and other deficiencies. On the other hand, there are several other people engaged in the publishing process who help to give the book its final shape and protect the audience against a totally flawed work.
As the author of this book, I feel indeed totally responsible for all its imperfections, only some which I am aware of, but I have no doubts that there are many more of them. With that being said, several people from the editorial and production team worked hard to make the imperfect outcome of my work worth publishing. My thanks go, in particular, to Richard Davies, Prachi Sinha Sahay, Debbie Jupe, and Kay Heather from Wiley for their encouragement, support, understanding, and reassuring professionalism at all stages of writing and production. Radhika Sivalingam, Lincy Priya, and Yogesh Kukshal did their best to transform my manuscript into a real book, meeting publication standards. I believe there are others who contributed to this book's production that I am not even aware of and I am grateful to them all, also.
I was thoughtless enough to share my intention to write this book with my colleagues from the Artificial Intelligence Applications Research Group at the Warsaw University of Technology. While their warm reception of this idea and constant words of encouragement were extremely helpful, I wished I had not done that many times. It would have been so much easier to give up if I had kept this in secret. Perhaps the ultimate reason why I continued to work despite hesitations is that I knew they would keep asking and I would be unable to find a good excuse. Several thoughts expressed in this book were shaped by discussions during our group's seminar meetings. Interacting with my colleagues from the analytics teams at Netezza Poland, IBM Poland, and iQor Poland, with which I had an opportunity to work on some data mining projects at different stages of writing the book, was also extremely stimulating, although the contents of the book have no relationships with the projects I was involved with.
I owe special thanks to my wife and two sons, who did not directly contribute to the contents of this book, but made it possible by allowing me to spend much of my time that should be normally devoted to them on this work and providing constant encouragement. If you guys can read these thanks in a published copy of the book, then it means it is all over at last and we will hopefully get back to normal life.
Data mining has been a rapidly growing field of research and practical applications during the last two decades. From a somewhat niche academic area at the intersection of machine learning and statistics it has developed into an established scientific discipline and a highly valued branch of the computing industry. This is reflected by data mining becoming an essential part of computer science education as well as the increasing overall awareness of the term “data mining” among the general (not just computing-related) academic and business audience.
Various definitions of data mining may be found in the literature. Some of them are broad enough to include all types of data analysis, regardless of the representation and applicability of their results. This book narrows down the scope of data mining by adopting a heavily modeling-oriented perspective. According to this perspective the ultimate goal of data mining is delivering predictive models. The latter can be thought of as computationally represented chunks of knowledge about some domain of interest, described by the analyzed data, that are capable of providing answers to queries transcending the data, i.e., such that cannot be answered by just extracting and aggregating values from the data. Such knowledge is discovered from data by capturing and generalizing useful relationship patterns that occur therein.
Activities needed for creating predictive models based on data and making sure that they meet the application's requirements fall in the scope of data mining as understood in this book. Analytical activities which do not contribute to model creation—although they may still deliver extremely useful results—remain therefore beyond the scope of our interest. This still leaves a lot of potential contents to be covered, including not only modeling algorithms, but also techniques for evaluating the quality of predictive models, transforming data to make modeling algorithms easier to apply or more likely to succeed, selecting attributes most useful for model creation, and combining multiple models for better predictions.
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
