20,99 €
Let Hadoop For Dummies help harness the power of your data and rein in the information overload
Big data has become big business, and companies and organizations of all sizes are struggling to find ways to retrieve valuable information from their massive data sets with becoming overwhelmed. Enter Hadoop and this easy-to-understand For Dummies guide. Hadoop For Dummies helps readers understand the value of big data, make a business case for using Hadoop, navigate the Hadoop ecosystem, and build and manage Hadoop applications and clusters.
From programmers challenged with building and maintaining affordable, scaleable data systems to administrators who must deal with huge volumes of information effectively and efficiently, this how-to has something to help you with Hadoop.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 600
Veröffentlichungsjahr: 2014
Hadoop® For Dummies®
Published by: John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030-5774, www.wiley.com
Copyright © 2014 by John Wiley & Sons, Inc., Hoboken, New Jersey
Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the Publisher. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Trademarks: Wiley, For Dummies, the Dummies Man logo, Dummies.com, Making Everything Easier, and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc. and may not be used without written permission. Hadoop is a registered trademark of the Apache Software Foundation. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.
LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: THE PUBLISHER AND THE AUTHOR MAKE NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE ACCURACY OR COMPLETENESS OF THE CONTENTS OF THIS WORK AND SPECIFICALLY DISCLAIM ALL WARRANTIES, INCLUDING WITHOUT LIMITATION WARRANTIES OF FITNESS FOR A PARTICULAR PURPOSE. NO WARRANTY MAY BE CREATED OR EXTENDED BY SALES OR PROMOTIONAL MATERIALS. THE ADVICE AND STRATEGIES CONTAINED HEREIN MAY NOT BE SUITABLE FOR EVERY SITUATION. THIS WORK IS SOLD WITH THE UNDERSTANDING THAT THE PUBLISHER IS NOT ENGAGED IN RENDERING LEGAL, ACCOUNTING, OR OTHER PROFESSIONAL SERVICES. IF PROFESSIONAL ASSISTANCE IS REQUIRED, THE SERVICES OF A COMPETENT PROFESSIONAL PERSON SHOULD BE SOUGHT. NEITHER THE PUBLISHER NOR THE AUTHOR SHALL BE LIABLE FOR DAMAGES ARISING HEREFROM. THE FACT THAT AN ORGANIZATION OR WEBSITE IS REFERRED TO IN THIS WORK AS A CITATION AND/OR A POTENTIAL SOURCE OF FURTHER INFORMATION DOES NOT MEAN THAT THE AUTHOR OR THE PUBLISHER ENDORSES THE INFORMATION THE ORGANIZATION OR WEBSITE MAY PROVIDE OR RECOMMENDATIONS IT MAY MAKE. FURTHER, READERS SHOULD BE AWARE THAT INTERNET WEBSITES LISTED IN THIS WORK MAY HAVE CHANGED OR DISAPPEARED BETWEEN WHEN THIS WORK WAS WRITTEN AND WHEN IT IS READ.
For general information on our other products and services, please contact our Customer Care Department within the U.S. at 877-762-2974, outside the U.S. at 317-572-3993, or fax 317-572-4002. For technical support, please visit www.wiley.com/techsupport.
Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.
Library of Congress Control Number: 2013954209
ISBN: 978-1-118-60755-8 (pbk); ISBN 978-1-118-65220-6 (ebk); ISBN 978-1-118-70503-2 (ebk)
Manufactured in the United States of America
10 9 8 7 6 5 4 3 2 1
Table of Contents
Introduction
About this Book
Foolish Assumptions
How This Book Is Organized
Part I: Getting Started With Hadoop
Part II: How Hadoop Works
Part III: Hadoop and Structured Data
Part IV: Administering and Configuring Hadoop
Part V: The Part Of Tens: Getting More Out of Your Hadoop Cluster
Icons Used in This Book
Beyond the Book
Where to Go from Here
Part I: Getting Started with Hadoop
Chapter 1: Introducing Hadoop and Seeing What It’s Good For
Big Data and the Need for Hadoop
Exploding data volumes
Varying data structures
A playground for data scientists
The Origin and Design of Hadoop
Distributed processing with MapReduce
Apache Hadoop ecosystem
Examining the Various Hadoop Offerings
Comparing distributions
Working with in-database MapReduce
Looking at the Hadoop toolbox
Chapter 2: Common Use Cases for Big Data in Hadoop
The Keys to Successfully Adopting Hadoop (Or, “Please, Can We Keep Him?”)
Log Data Analysis
Data Warehouse Modernization
Fraud Detection
Risk Modeling
Social Sentiment Analysis
Image Classification
Graph Analysis
To Infinity and Beyond
Chapter 3: Setting Up Your Hadoop Environment
Choosing a Hadoop Distribution
Choosing a Hadoop Cluster Architecture
Pseudo-distributed mode (single node)
Fully distributed mode (a cluster of nodes)
The Hadoop For Dummies Environment
The Hadoop For Dummies distribution: Apache Bigtop
Setting up the Hadoop For Dummies environment
The Hadoop For Dummies Sample Data Set: Airline on-time performance
Your First Hadoop Program: Hello Hadoop!
Part II: How Hadoop Works
Chapter 4: Storing Data in Hadoop: The Hadoop Distributed File System
Data Storage in HDFS
Taking a closer look at data blocks
Replicating data blocks
Slave node and disk failures
Sketching Out the HDFS Architecture
Looking at slave nodes
Keeping track of data blocks with NameNode
Checkpointing updates
HDFS Federation
HDFS High Availability
Chapter 5: Reading and Writing Data
Compressing Data
Managing Files with the Hadoop File System Commands
Ingesting Log Data with Flume
Chapter 6: MapReduce Programming
Thinking in Parallel
Seeing the Importance of MapReduce
Doing Things in Parallel: Breaking Big Problems into Many Bite-Size Pieces
Looking at MapReduce application flow
Understanding input splits
Seeing how key/value pairs fit into the MapReduce application flow
Writing MapReduce Applications
Getting Your Feet Wet: Writing a Simple MapReduce Application
The FlightsByCarrier driver application
The FlightsByCarrier mapper
The FlightsByCarrier reducer
Running the FlightsByCarrier application
Chapter 7: Frameworks for Processing Data in Hadoop: YARN and MapReduce
Running Applications Before Hadoop 2
Tracking JobTracker
Tracking TaskTracker
Launching a MapReduce application
Seeing a World beyond MapReduce
Scouting out the YARN architecture
Launching a YARN-based application
Real-Time and Streaming Applications
Chapter 8: Pig: Hadoop Programming Made Easier
Admiring the Pig Architecture
Going with the Pig Latin Application Flow
Working through the ABCs of Pig Latin
Uncovering Pig Latin structures
Looking at Pig data types and syntax
Evaluating Local and Distributed Modes of Running Pig scripts
Checking Out the Pig Script Interfaces
Scripting with Pig Latin
Chapter 9: Statistical Analysis in Hadoop
Pumping Up Your Statistical Analysis
The limitations of sampling
Factors that increase the scale of statistical analysis
Running statistical models in MapReduce
Machine Learning with Mahout
Collaborative filtering
Clustering
Classifications
R on Hadoop
The R language
Hadoop Integration with R
Chapter 10: Developing and Scheduling Application Workflows with Oozie
Getting Oozie in Place
Developing and Running an Oozie Workflow
Writing Oozie workflow definitions
Configuring Oozie workflows
Running Oozie workflows
Scheduling and Coordinating Oozie Workflows
Time-based scheduling for Oozie coordinator jobs
Time and data availability-based scheduling for Oozie coordinator jobs
Running Oozie coordinator jobs
Part III: Hadoop and Structured Data
Chapter 11: Hadoop and the Data Warehouse: Friends or Foes?
Comparing and Contrasting Hadoop with Relational Databases
NoSQL data stores
ACID versus BASE data stores
Structured data storage and processing in Hadoop
Modernizing the Warehouse with Hadoop
The landing zone
A queryable archive of cold warehouse data
Hadoop as a data preprocessing engine
Data discovery and sandboxes
Chapter 12: Extremely Big Tables: Storing Data in HBase
Say Hello to HBase
Sparse
It’s distributed and persistent
It has a multidimensional sorted map
Understanding the HBase Data Model
Understanding the HBase Architecture
RegionServers
MasterServer
Zookeeper and HBase reliability
Taking HBase for a Test Run
Creating a table
Working with Zookeeper
Getting Things Done with HBase
Working with an HBase Java API client example
HBase and the RDBMS world
Knowing when HBase makes sense for you?
ACID Properties in HBase
Transitioning from an RDBMS model to HBase
Deploying and Tuning HBase
Hardware requirements
Deployment Considerations
Tuning prerequisites
Understanding your data access patterns
Pre-Splitting your regions
The importance of row key design
Tuning major compactions
Chapter 13: Applying Structure to Hadoop Data with Hive
Saying Hello to Hive
Seeing How the Hive is Put Together
Getting Started with Apache Hive
Examining the Hive Clients
The Hive CLI client
The web browser as Hive client
SQuirreL as Hive client with the JDBC Driver
Working with Hive Data Types
Creating and Managing Databases and Tables
Managing Hive databases
Creating and managing tables with Hive
Seeing How the Hive Data Manipulation Language Works
LOAD DATA examples
INSERT examples
Create Table As Select (CTAS) examples
Querying and Analyzing Data
Joining tables with Hive
Improving your Hive queries with indexes
Windowing in HiveQL
Other key HiveQL features
Chapter 14: Integrating Hadoop with Relational Databases Using Sqoop
The Principles of Sqoop Design
Scooping Up Data with Sqoop
Connectors and Drivers
Importing Data with Sqoop
Importing data into HDFS
Importing data into Hive
Importing data into HBase
Importing incrementally
Benefiting from additional Sqoop import features
Sending Data Elsewhere with Sqoop
Exporting data from HDFS
Sqoop exports using the Insert approach
Sqoop exports using the Update and Update Insert approach
Sqoop exports using call stored procedures
Sqoop exports and transactions
Looking at Your Sqoop Input and Output Formatting Options
Getting down to brass tacks: An example of output line-formatting and input-parsing
Sqoop 2.0 Preview
Chapter 15: The Holy Grail: Native SQL Access to Hadoop Data
SQL’s Importance for Hadoop
Looking at What SQL Access Actually Means
SQL Access and Apache Hive
Solutions Inspired by Google Dremel
Apache Drill
Cloudera Impala
IBM Big SQL
Pivotal HAWQ
Hadapt
The SQL Access Big Picture
Part IV: Administering and Configuring Hadoop
Chapter 16: Deploying Hadoop
Working with Hadoop Cluster Components
Rack considerations
Master nodes
Slave nodes
Edge nodes
Networking
Hadoop Cluster Configurations
Small
Medium
Large
Alternate Deployment Form Factors
Virtualized servers
Cloud deployments
Sizing Your Hadoop Cluster
Chapter 17: Administering Your Hadoop Cluster
Achieving Balance: A Big Factor in Cluster Health
Mastering the Hadoop Administration Commands
Understanding Factors for Performance
Hardware
MapReduce
Benchmarking
Tolerating Faults and Data Reliability
Putting Apache Hadoop’s Capacity Scheduler to Good Use
Setting Security: The Kerberos Protocol
Expanding Your Toolset Options
Hue
Ambari
Hadoop User Experience (Hue)
The Hadoop shell
Basic Hadoop Configuration Details
Part V: The Part of Tens
Chapter 18: Ten Hadoop Resources Worthy of a Bookmark
Central Nervous System: Apache.org
Tweet This
Hortonworks University
Cloudera University
BigDataUniversity.com
planet Big Data Blog Aggregator
Quora’s Apache Hadoop Forum
The IBM Big Data Hub
Conferences Not to Be Missed
The Google Papers That Started It All
The Bonus Resource: What Did We Ever Do B.G.?
Chapter 19: Ten Reasons to Adopt Hadoop
Hadoop Is Relatively Inexpensive
Hadoop Has an Active Open Source Community
Hadoop Is Being Widely Adopted in Every Industry
Hadoop Can Easily Scale Out As Your Data Grows
Traditional Tools Are Integrating with Hadoop
Hadoop Can Store Data in Any Format
Hadoop Is Designed to Run Complex Analytics
Hadoop Can Process a Full Data Set (As Opposed to Sampling)
Hardware Is Being Optimized for Hadoop
Hadoop Can Increasingly Handle Flexible Workloads (No Longer Just Batch)
About the Authors
Cheat Sheet
More Dummies Products
Table of Contents
Begin Reading
1
2
3
4
5
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
115
116
117
118
119
120
121
122
123
124
125
126
127
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
359
360
361
362
363
364
365
366
367
368
369
371
372
373
374
375
376
377
395
396
Welcome to Hadoop for Dummies! Hadoop is an exciting technology, and this book will help you cut through the hype and wrap your head around what it’s good for and how it works. We’ve included examples and plenty of practical advice so you can get started with your own Hadoop cluster.
In our own Hadoop learning activities, we’re constantly struck by how little beginner-level content is available. For almost any topic, we see two things: high-level marketing blurbs with pretty pictures; and dense, low-level, narrowly focused descriptions. What are missing are solid entry-level explanations that add substance to the marketing fluff and help someone with little or no background knowledge bridge the gap to the more advanced material. Every chapter in this book was written with this goal in mind: to clearly explain the chapter’s concept, explain why it’s significant in the Hadoop universe, and show how you can get started with it.
No matter how much (or how little) you know about Hadoop, getting started with the technology is not exactly easy for a number of reasons. In addition to the lack of entry-level content, the rapid pace of change in the Hadoop ecosystem makes it difficult to keep on top of standards. We find that most discussions on Hadoop either cover the older interfaces, and are never updated; or they cover the newer interfaces with little insight into how to bridge the gap from the old technology. In this book, we’ve taken care to describe the current interfaces, but we also discuss previous standards, which are still commonly used in environments where some of the older interfaces are entrenched.
Here are a few things to keep in mind as you read this book:
Bold text means that you’re meant to type the text just as it appears in the book. The exception is when you’re working through a steps list: Because each step is bold, the text to type is not bold.Web addresses and programming code appear in monofont. If you’re reading a digital version of this book on a device connected to the Internet, note that you can click the web address to visit that website, like this: www.dummies.comWe’ve written this book so that anyone with a basic understanding of computers and IT can learn about Hadoop. But that said, some experience with databases, programming, and working with Linux would be helpful.
There are some parts of this book that require deeper skills, like the Java coverage in Chapter 6 on MapReduce; but if you haven’t programmed in Java before, don’t worry. The explanations of how MapReduce works don’t require you to be a Java programmer. The Java code is there for people who’ll want to try writing their own MapReduce applications. In Part 3, a database background would certainly help you understand the significance of the various Hadoop components you can use to integrate with existing databases and work with relational data. But again, we’ve written in a lot of background to help provide context for the Hadoop concepts we’re describing.
This book is composed of five parts, with each part telling a major chunk of the Hadoop story. Every part and every chapter was written to be a self-contained unit, so you can pick and choose whatever you want to concentrate on. Because many Hadoop concepts are intertwined, we’ve taken care to refer to whatever background concepts you might need so you can catch up from other chapters, if needed. To give you an idea of the book’s layout, here are the parts of the book and what they’re about:
As the beginning of the book, this part gives a rundown of Hadoop and its ecosystem and the most common ways Hadoop’s being used. We also show you how you can set up your own Hadoop environment and run the example code we’ve included in this book.
This is the meat of the book, with lots of coverage designed to help you understand the nuts and bolts of Hadoop. We explain the storage and processing architecture, and also how you can write your own applications.
How Hadoop deals with structured data is arguably the most important debate happening in the Hadoop community today. There are many competing SQL-on-Hadoop technologies, which we survey, but we also take a deep look at the more established Hadoop community projects dedicated to structured data: HBase, Hive, and Sqoop.
When you’re ready to get down to brass tacks and deploy a cluster, this part is a great starting point. Hadoop clusters sink or swim depending on how they’re configured and deployed, and we’ve got loads of experience-based advice here.
To cap off the book, we’ve given you a list of additional places where you can bone up on your Hadoop skills. We’ve also provided you an additional set of reasons to adopt Hadoop, just in case you weren’t convinced already.
The Tip icon marks tips (duh!) and shortcuts that you can use to make working with Hadoop easier.
Remember icons mark the information that’s especially important to know. To siphon off the most important information in each chapter, just skim through these icons.
The Technical Stuff icon marks information of a highly technical nature that you can normally skip over.
The Warning icon tells you to watch out! It marks important information that may save you headaches.
We have written a lot of extra content that you won’t find in this book. Go online to find the following:
The Cheat Sheet for this book is atwww.dummies.com/cheatsheet/hadoop
Here you’ll find quick references for useful Hadoop information we’ve brought together and keep up to date. For instance, a handy list of the most common Hadoop commands and their syntax, a map of the various Hadoop ecosystem components, and what they’re good for, and listings of the various Hadoop distributions available in the market and their unique offerings. Since the Hadoop ecosystem is continually evolving, we’ve also got instructions on how to set up the Hadoop for Dummies environment with the newest production-ready versions of the Hadoop and its components.
Updates to this book, if we have any, are atwww.dummies.com/extras/hadoop
Code samples used in this book are also atwww.dummies.com/extras/hadoop
All the code samples in this book are posted to the website in Zip format; just download and unzip them and they’re ready to use with the Hadoop for Dummies environment described in Chapter 3. The Zip files, which are named according to chapter, contain one or more files. Some files have application code (Java, Pig, and Hive) and others have series of commands or scripts. (Refer to the downloadable Read Me file for a detailed description of the files.) Note that not all chapters have associated code sample files.
If you’re starting from scratch with Hadoop, we recommend you start at the beginning and truck your way on through the whole book. But Hadoop does a lot of different things, so if you come to a chapter or section that covers an area you won’t be using, feel free to skip it. Or if you’re not a total newbie, you can bypass the parts you’re familiar with. We wrote this book so that you can dive in anywhere.
If you’re a selective reader and you just want to try out the examples in the book, we strongly recommend looking at Chapter 3. It’s here that we describe how to set up your own Hadoop environment in a Virtual Machine (VM) that you can run on your own computer. All the examples and code samples were tested using this environment, and we’ve laid out all the steps you need to download, install, and configure Hadoop.
Part I
Visit www.dummies.com for great Dummies content online.
In this part …
See what makes Hadoop-sense — and what doesn’t.Look at what Hadoop is doing to raise productivity in the real world.See what's involved in setting up a Hadoop environmentVisit www.dummies.com for great Dummies content online.Chapter 1
In This Chapter
Seeing how Hadoop fills a need
Digging (a bit) into Hadoop’s history
Getting Hadoop for yourself
Looking at Hadoop application offerings
Organizations are flooded with data. Not only that, but in an era of incredibly cheap storage where everyone and everything are interconnected, the nature of the data we’re collecting is also changing. For many businesses, their critical data used to be limited to their transactional databases and data warehouses. In these kinds of systems, data was organized into orderly rows and columns, where every byte of information was well understood in terms of its nature and its business value. These databases and warehouses are still extremely important, but businesses are now differentiating themselves by how they’re finding value in the large volumes of data that are stored in a tidy database.
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
