E-Book
20,99 €

Hadoop For Dummies E-Book

Dirk deRoos

0,0

20,99 €

oder

Leseprobe lesen

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: John Wiley & Sons
Kategorie: Fachliteratur
Sprache: Englisch

Beschreibung

Let Hadoop For Dummies help harness the power of your data and rein in the information overload

Big data has become big business, and companies and organizations of all sizes are struggling to find ways to retrieve valuable information from their massive data sets with becoming overwhelmed. Enter Hadoop and this easy-to-understand For Dummies guide. Hadoop For Dummies helps readers understand the value of big data, make a business case for using Hadoop, navigate the Hadoop ecosystem, and build and manage Hadoop applications and clusters.

Explains the origins of Hadoop, its economic benefits, and its functionality and practical applications
Helps you find your way around the Hadoop ecosystem, program MapReduce, utilize design patterns, and get your Hadoop cluster up and running quickly and easily
Details how to use Hadoop applications for data mining, web analytics and personalization, large-scale text processing, data science, and problem-solving
Shows you how to improve the value of your Hadoop cluster, maximize your investment in Hadoop, and avoid common pitfalls when building your Hadoop cluster

From programmers challenged with building and maintaining affordable, scaleable data systems to administrators who must deal with huge volumes of information effectively and efficiently, this how-to has something to help you with Hadoop.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

MOBI

Seitenzahl: 600

Veröffentlichungsjahr: 2014

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Ähnliche

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Denke (nach) und werde reich

Napoleon Hill

30 Minuten Resilienz

Ulrich Siegrist

Krebszellen mögen keine Himbeeren - Der große Bestseller - Vollständig überarbeitet und aktualisiert

Richard Béliveau

Die Hormonrevolution

Michael E Platt

Der Crash ist die Lösung

Matthias Weik

Günter, der innere Schweinehund, lernt verkaufen

Stefan Frädrich

Mission erfüllt

Owen Mark

Die Leber wächst mit ihren Aufgaben

Dr. med. Eckart von Hirschhausen

Macht, was ihr liebt!

Anja Förster

Der größte Raubzug der Geschichte

Matthias Weik

Unsere Hunde - gesund durch Homöopathie

Hans Günter Wolff

Die Jahrhundertlüge, die nur Insider kennen

Heiko Schrang

Organisation für Komplexität

Niels Pfläging

Radikal führen

Reinhard K. Sprenger

30 Minuten Sympathisch und souverän: So geht Vortragen!

Thomas Lorenz

BLACKOUT - Morgen ist es zu spät

Marc Elsberg

The Truth About Employee Engagement

Hadoop® For Dummies®

Published by: John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030-5774, www.wiley.com

Published simultaneously in Canada

No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the Publisher. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

Trademarks: Wiley, For Dummies, the Dummies Man logo, Dummies.com, Making Everything Easier, and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc. and may not be used without written permission. Hadoop is a registered trademark of the Apache Software Foundation. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.

LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: THE PUBLISHER AND THE AUTHOR MAKE NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE ACCURACY OR COMPLETENESS OF THE CONTENTS OF THIS WORK AND SPECIFICALLY DISCLAIM ALL WARRANTIES, INCLUDING WITHOUT LIMITATION WARRANTIES OF FITNESS FOR A PARTICULAR PURPOSE. NO WARRANTY MAY BE CREATED OR EXTENDED BY SALES OR PROMOTIONAL MATERIALS. THE ADVICE AND STRATEGIES CONTAINED HEREIN MAY NOT BE SUITABLE FOR EVERY SITUATION. THIS WORK IS SOLD WITH THE UNDERSTANDING THAT THE PUBLISHER IS NOT ENGAGED IN RENDERING LEGAL, ACCOUNTING, OR OTHER PROFESSIONAL SERVICES. IF PROFESSIONAL ASSISTANCE IS REQUIRED, THE SERVICES OF A COMPETENT PROFESSIONAL PERSON SHOULD BE SOUGHT. NEITHER THE PUBLISHER NOR THE AUTHOR SHALL BE LIABLE FOR DAMAGES ARISING HEREFROM. THE FACT THAT AN ORGANIZATION OR WEBSITE IS REFERRED TO IN THIS WORK AS A CITATION AND/OR A POTENTIAL SOURCE OF FURTHER INFORMATION DOES NOT MEAN THAT THE AUTHOR OR THE PUBLISHER ENDORSES THE INFORMATION THE ORGANIZATION OR WEBSITE MAY PROVIDE OR RECOMMENDATIONS IT MAY MAKE. FURTHER, READERS SHOULD BE AWARE THAT INTERNET WEBSITES LISTED IN THIS WORK MAY HAVE CHANGED OR DISAPPEARED BETWEEN WHEN THIS WORK WAS WRITTEN AND WHEN IT IS READ.

For general information on our other products and services, please contact our Customer Care Department within the U.S. at 877-762-2974, outside the U.S. at 317-572-3993, or fax 317-572-4002. For technical support, please visit www.wiley.com/techsupport.

Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.

Library of Congress Control Number: 2013954209

ISBN: 978-1-118-60755-8 (pbk); ISBN 978-1-118-65220-6 (ebk); ISBN 978-1-118-70503-2 (ebk)

Manufactured in the United States of America

10 9 8 7 6 5 4 3 2 1

Hadoop For Dummies

Visit www.dummies.com/cheatsheet/hadoop to view this book's cheat sheet.

Table of Contents

Introduction

About this Book

Foolish Assumptions

How This Book Is Organized

Part I: Getting Started With Hadoop

Part II: How Hadoop Works

Part III: Hadoop and Structured Data

Part IV: Administering and Configuring Hadoop

Part V: The Part Of Tens: Getting More Out of Your Hadoop Cluster

Icons Used in This Book

Beyond the Book

Where to Go from Here

Part I: Getting Started with Hadoop

Chapter 1: Introducing Hadoop and Seeing What It’s Good For

Big Data and the Need for Hadoop

Exploding data volumes

Varying data structures

A playground for data scientists

The Origin and Design of Hadoop

Distributed processing with MapReduce

Apache Hadoop ecosystem

Examining the Various Hadoop Offerings

Comparing distributions

Working with in-database MapReduce

Looking at the Hadoop toolbox

Chapter 2: Common Use Cases for Big Data in Hadoop

The Keys to Successfully Adopting Hadoop (Or, “Please, Can We Keep Him?”)

Log Data Analysis

Data Warehouse Modernization

Fraud Detection

Risk Modeling

Social Sentiment Analysis

Image Classification

Graph Analysis

To Infinity and Beyond

Chapter 3: Setting Up Your Hadoop Environment

Choosing a Hadoop Distribution

Choosing a Hadoop Cluster Architecture

Pseudo-distributed mode (single node)

Fully distributed mode (a cluster of nodes)

The Hadoop For Dummies Environment

The Hadoop For Dummies distribution: Apache Bigtop

Setting up the Hadoop For Dummies environment

The Hadoop For Dummies Sample Data Set: Airline on-time performance

Your First Hadoop Program: Hello Hadoop!

Part II: How Hadoop Works

Chapter 4: Storing Data in Hadoop: The Hadoop Distributed File System

Data Storage in HDFS

Taking a closer look at data blocks

Replicating data blocks

Slave node and disk failures

Sketching Out the HDFS Architecture

Looking at slave nodes

Keeping track of data blocks with NameNode

Checkpointing updates

HDFS Federation

HDFS High Availability

Chapter 5: Reading and Writing Data

Compressing Data

Managing Files with the Hadoop File System Commands

Ingesting Log Data with Flume

Chapter 6: MapReduce Programming

Thinking in Parallel

Seeing the Importance of MapReduce

Doing Things in Parallel: Breaking Big Problems into Many Bite-Size Pieces

Looking at MapReduce application flow

Understanding input splits

Seeing how key/value pairs fit into the MapReduce application flow

Writing MapReduce Applications

Getting Your Feet Wet: Writing a Simple MapReduce Application

The FlightsByCarrier driver application

The FlightsByCarrier mapper

The FlightsByCarrier reducer

Running the FlightsByCarrier application

Chapter 7: Frameworks for Processing Data in Hadoop: YARN and MapReduce

Running Applications Before Hadoop 2

Tracking JobTracker

Tracking TaskTracker

Launching a MapReduce application

Seeing a World beyond MapReduce

Scouting out the YARN architecture

Launching a YARN-based application

Real-Time and Streaming Applications

Chapter 8: Pig: Hadoop Programming Made Easier

Admiring the Pig Architecture

Going with the Pig Latin Application Flow

Working through the ABCs of Pig Latin

Uncovering Pig Latin structures

Looking at Pig data types and syntax

Evaluating Local and Distributed Modes of Running Pig scripts

Checking Out the Pig Script Interfaces

Scripting with Pig Latin

Chapter 9: Statistical Analysis in Hadoop

Pumping Up Your Statistical Analysis

The limitations of sampling

Factors that increase the scale of statistical analysis

Running statistical models in MapReduce

Machine Learning with Mahout

Collaborative filtering

Clustering

Classifications

R on Hadoop

The R language

Hadoop Integration with R

Chapter 10: Developing and Scheduling Application Workflows with Oozie

Getting Oozie in Place

Developing and Running an Oozie Workflow

Writing Oozie workflow definitions

Configuring Oozie workflows

Running Oozie workflows

Scheduling and Coordinating Oozie Workflows

Time-based scheduling for Oozie coordinator jobs

Time and data availability-based scheduling for Oozie coordinator jobs

Running Oozie coordinator jobs

Part III: Hadoop and Structured Data

Chapter 11: Hadoop and the Data Warehouse: Friends or Foes?

Comparing and Contrasting Hadoop with Relational Databases

NoSQL data stores

ACID versus BASE data stores

Structured data storage and processing in Hadoop

Modernizing the Warehouse with Hadoop

The landing zone

A queryable archive of cold warehouse data

Hadoop as a data preprocessing engine

Data discovery and sandboxes

Chapter 12: Extremely Big Tables: Storing Data in HBase

Say Hello to HBase

Sparse

It’s distributed and persistent

It has a multidimensional sorted map

Understanding the HBase Data Model

Understanding the HBase Architecture

RegionServers

MasterServer

Zookeeper and HBase reliability

Taking HBase for a Test Run

Creating a table

Working with Zookeeper

Getting Things Done with HBase

Working with an HBase Java API client example

HBase and the RDBMS world

Knowing when HBase makes sense for you?

ACID Properties in HBase

Transitioning from an RDBMS model to HBase

Deploying and Tuning HBase

Hardware requirements

Deployment Considerations

Tuning prerequisites

Understanding your data access patterns

Pre-Splitting your regions

The importance of row key design

Tuning major compactions

Chapter 13: Applying Structure to Hadoop Data with Hive

Saying Hello to Hive

Seeing How the Hive is Put Together

Getting Started with Apache Hive

Examining the Hive Clients

The Hive CLI client

The web browser as Hive client

SQuirreL as Hive client with the JDBC Driver

Working with Hive Data Types

Creating and Managing Databases and Tables

Managing Hive databases

Creating and managing tables with Hive

Seeing How the Hive Data Manipulation Language Works

LOAD DATA examples

INSERT examples

Create Table As Select (CTAS) examples

Querying and Analyzing Data

Joining tables with Hive

Improving your Hive queries with indexes

Windowing in HiveQL

Other key HiveQL features

Chapter 14: Integrating Hadoop with Relational Databases Using Sqoop

The Principles of Sqoop Design

Scooping Up Data with Sqoop

Connectors and Drivers

Importing Data with Sqoop

Importing data into HDFS

Importing data into Hive

Importing data into HBase

Importing incrementally

Benefiting from additional Sqoop import features

Sending Data Elsewhere with Sqoop

Exporting data from HDFS

Sqoop exports using the Insert approach

Sqoop exports using the Update and Update Insert approach

Sqoop exports using call stored procedures

Sqoop exports and transactions

Looking at Your Sqoop Input and Output Formatting Options

Getting down to brass tacks: An example of output line-formatting and input-parsing

Sqoop 2.0 Preview

Chapter 15: The Holy Grail: Native SQL Access to Hadoop Data

SQL’s Importance for Hadoop

Looking at What SQL Access Actually Means

SQL Access and Apache Hive

Solutions Inspired by Google Dremel

Apache Drill

Cloudera Impala

IBM Big SQL

Pivotal HAWQ

Hadapt

The SQL Access Big Picture

Part IV: Administering and Configuring Hadoop

Chapter 16: Deploying Hadoop

Working with Hadoop Cluster Components

Rack considerations

Master nodes

Slave nodes

Edge nodes

Networking

Hadoop Cluster Configurations

Small

Medium

Large

Alternate Deployment Form Factors

Virtualized servers

Cloud deployments

Sizing Your Hadoop Cluster

Chapter 17: Administering Your Hadoop Cluster

Achieving Balance: A Big Factor in Cluster Health

Mastering the Hadoop Administration Commands

Understanding Factors for Performance

Hardware

MapReduce

Benchmarking

Tolerating Faults and Data Reliability

Putting Apache Hadoop’s Capacity Scheduler to Good Use

Setting Security: The Kerberos Protocol

Expanding Your Toolset Options

Hue

Ambari

Hadoop User Experience (Hue)

The Hadoop shell

Basic Hadoop Configuration Details

Part V: The Part of Tens

Chapter 18: Ten Hadoop Resources Worthy of a Bookmark

Central Nervous System: Apache.org

Tweet This

Hortonworks University

Cloudera University

BigDataUniversity.com

planet Big Data Blog Aggregator

Quora’s Apache Hadoop Forum

The IBM Big Data Hub

Conferences Not to Be Missed

The Google Papers That Started It All

The Bonus Resource: What Did We Ever Do B.G.?

Chapter 19: Ten Reasons to Adopt Hadoop

Hadoop Is Relatively Inexpensive

Hadoop Has an Active Open Source Community

Hadoop Is Being Widely Adopted in Every Industry

Hadoop Can Easily Scale Out As Your Data Grows

Traditional Tools Are Integrating with Hadoop

Hadoop Can Store Data in Any Format

Hadoop Is Designed to Run Complex Analytics

Hadoop Can Process a Full Data Set (As Opposed to Sampling)

Hardware Is Being Optimized for Hadoop

Hadoop Can Increasingly Handle Flexible Workloads (No Longer Just Batch)

About the Authors

Cheat Sheet

More Dummies Products

Guide

Table of Contents

Begin Reading

Pages

100

101

102

103

104

105

106

107

108

109

110

111

112

113

115

116

117

118

119

120

121

122

123

124

125

126

127

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

359

360

361

362

363

364

365

366

367

368

369

371

372

373

374

375

376

377

395

396

Introduction

Welcome to Hadoop for Dummies! Hadoop is an exciting technology, and this book will help you cut through the hype and wrap your head around what it’s good for and how it works. We’ve included examples and plenty of practical advice so you can get started with your own Hadoop cluster.

About this Book

In our own Hadoop learning activities, we’re constantly struck by how little beginner-level content is available. For almost any topic, we see two things: high-level marketing blurbs with pretty pictures; and dense, low-level, narrowly focused descriptions. What are missing are solid entry-level explanations that add substance to the marketing fluff and help someone with little or no background knowledge bridge the gap to the more advanced material. Every chapter in this book was written with this goal in mind: to clearly explain the chapter’s concept, explain why it’s significant in the Hadoop universe, and show how you can get started with it.

No matter how much (or how little) you know about Hadoop, getting started with the technology is not exactly easy for a number of reasons. In addition to the lack of entry-level content, the rapid pace of change in the Hadoop ecosystem makes it difficult to keep on top of standards. We find that most discussions on Hadoop either cover the older interfaces, and are never updated; or they cover the newer interfaces with little insight into how to bridge the gap from the old technology. In this book, we’ve taken care to describe the current interfaces, but we also discuss previous standards, which are still commonly used in environments where some of the older interfaces are entrenched.

Here are a few things to keep in mind as you read this book:

Bold text means that you’re meant to type the text just as it appears in the book. The exception is when you’re working through a steps list: Because each step is bold, the text to type is not bold.Web addresses and programming code appear in monofont. If you’re reading a digital version of this book on a device connected to the Internet, note that you can click the web address to visit that website, like this: www.dummies.com

Foolish Assumptions

We’ve written this book so that anyone with a basic understanding of computers and IT can learn about Hadoop. But that said, some experience with databases, programming, and working with Linux would be helpful.

There are some parts of this book that require deeper skills, like the Java coverage in Chapter 6 on MapReduce; but if you haven’t programmed in Java before, don’t worry. The explanations of how MapReduce works don’t require you to be a Java programmer. The Java code is there for people who’ll want to try writing their own MapReduce applications. In Part 3, a database background would certainly help you understand the significance of the various Hadoop components you can use to integrate with existing databases and work with relational data. But again, we’ve written in a lot of background to help provide context for the Hadoop concepts we’re describing.

How This Book Is Organized

This book is composed of five parts, with each part telling a major chunk of the Hadoop story. Every part and every chapter was written to be a self-contained unit, so you can pick and choose whatever you want to concentrate on. Because many Hadoop concepts are intertwined, we’ve taken care to refer to whatever background concepts you might need so you can catch up from other chapters, if needed. To give you an idea of the book’s layout, here are the parts of the book and what they’re about:

Part I: Getting Started With Hadoop

As the beginning of the book, this part gives a rundown of Hadoop and its ecosystem and the most common ways Hadoop’s being used. We also show you how you can set up your own Hadoop environment and run the example code we’ve included in this book.

Part II: How Hadoop Works

This is the meat of the book, with lots of coverage designed to help you understand the nuts and bolts of Hadoop. We explain the storage and processing architecture, and also how you can write your own applications.

Part III: Hadoop and Structured Data

How Hadoop deals with structured data is arguably the most important debate happening in the Hadoop community today. There are many competing SQL-on-Hadoop technologies, which we survey, but we also take a deep look at the more established Hadoop community projects dedicated to structured data: HBase, Hive, and Sqoop.

Part IV: Administering and Configuring Hadoop

When you’re ready to get down to brass tacks and deploy a cluster, this part is a great starting point. Hadoop clusters sink or swim depending on how they’re configured and deployed, and we’ve got loads of experience-based advice here.

Part V: The Part Of Tens: Getting More Out of Your Hadoop Cluster

To cap off the book, we’ve given you a list of additional places where you can bone up on your Hadoop skills. We’ve also provided you an additional set of reasons to adopt Hadoop, just in case you weren’t convinced already.

Icons Used in This Book

The Tip icon marks tips (duh!) and shortcuts that you can use to make working with Hadoop easier.

Remember icons mark the information that’s especially important to know. To siphon off the most important information in each chapter, just skim through these icons.

The Technical Stuff icon marks information of a highly technical nature that you can normally skip over.

The Warning icon tells you to watch out! It marks important information that may save you headaches.

Beyond the Book

We have written a lot of extra content that you won’t find in this book. Go online to find the following:

The Cheat Sheet for this book is at

www.dummies.com/cheatsheet/hadoop

Here you’ll find quick references for useful Hadoop information we’ve brought together and keep up to date. For instance, a handy list of the most common Hadoop commands and their syntax, a map of the various Hadoop ecosystem components, and what they’re good for, and listings of the various Hadoop distributions available in the market and their unique offerings. Since the Hadoop ecosystem is continually evolving, we’ve also got instructions on how to set up the Hadoop for Dummies environment with the newest production-ready versions of the Hadoop and its components.

Updates to this book, if we have any, are at

www.dummies.com/extras/hadoop

Code samples used in this book are also at

www.dummies.com/extras/hadoop

All the code samples in this book are posted to the website in Zip format; just download and unzip them and they’re ready to use with the Hadoop for Dummies environment described in Chapter 3. The Zip files, which are named according to chapter, contain one or more files. Some files have application code (Java, Pig, and Hive) and others have series of commands or scripts. (Refer to the downloadable Read Me file for a detailed description of the files.) Note that not all chapters have associated code sample files.

Where to Go from Here

If you’re starting from scratch with Hadoop, we recommend you start at the beginning and truck your way on through the whole book. But Hadoop does a lot of different things, so if you come to a chapter or section that covers an area you won’t be using, feel free to skip it. Or if you’re not a total newbie, you can bypass the parts you’re familiar with. We wrote this book so that you can dive in anywhere.

If you’re a selective reader and you just want to try out the examples in the book, we strongly recommend looking at Chapter 3. It’s here that we describe how to set up your own Hadoop environment in a Virtual Machine (VM) that you can run on your own computer. All the examples and code samples were tested using this environment, and we’ve laid out all the steps you need to download, install, and configure Hadoop.

Part I

Getting Started with Hadoop

Visit www.dummies.com for great Dummies content online.

In this part …

See what makes Hadoop-sense — and what doesn’t.Look at what Hadoop is doing to raise productivity in the real world.See what's involved in setting up a Hadoop environmentVisit www.dummies.com for great Dummies content online.

Chapter 1

Introducing Hadoop and Seeing What It’s Good For

In This Chapter

Seeing how Hadoop fills a need

Digging (a bit) into Hadoop’s history

Getting Hadoop for yourself

Looking at Hadoop application offerings

Organizations are flooded with data. Not only that, but in an era of incredibly cheap storage where everyone and everything are interconnected, the nature of the data we’re collecting is also changing. For many businesses, their critical data used to be limited to their transactional databases and data warehouses. In these kinds of systems, data was organized into orderly rows and columns, where every byte of information was well understood in terms of its nature and its business value. These databases and warehouses are still extremely important, but businesses are now differentiating themselves by how they’re finding value in the large volumes of data that are stored in a tidy database.

Lesen Sie weiter in der vollständigen Ausgabe!

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben: