E-Book
32,99 €

Microsoft Big Data Solutions E-Book

Adam Jorgensen

0,0

32,99 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.

Herausgeber: John Wiley & Sons
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Tap the power of Big Data with Microsoft technologies Big Data is here, and Microsoft's new Big Data platform is a valuable tool to help your company get the very most out of it. This timely book shows you how to use HDInsight along with HortonWorks Data Platform for Windows to store, manage, analyze, and share Big Data throughout the enterprise. Focusing primarily on Microsoft and HortonWorks technologies but also covering open source tools, Microsoft Big Data Solutions explains best practices, covers on-premises and cloud-based solutions, and features valuable case studies. Best of all, it helps you integrate these new solutions with technologies you already know, such as SQL Server and Hadoop. * Walks you through how to integrate Big Data solutions in your company using Microsoft's HDInsight Server, HortonWorks Data Platform for Windows, and open source tools * Explores both on-premises and cloud-based solutions * Shows how to store, manage, analyze, and share Big Data through the enterprise * Covers topics such as Microsoft's approach to Big Data, installing and configuring HortonWorks Data Platform for Windows, integrating Big Data with SQL Server, visualizing data with Microsoft and HortonWorks BI tools, and more * Helps you build and execute a Big Data plan * Includes contributions from the Microsoft and HortonWorks Big Data product teams If you need a detailed roadmap for designing and implementing a fully deployed Big Data solution, you'll want Microsoft Big Data Solutions.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 563

Veröffentlichungsjahr: 2014

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Cover

Part I: What Is Big Data?

Chapter 1: Industry Needs and Solutions

What's So

Big

About Big Data?

A Brief History of Hadoop

What Is Hadoop?

Summary

Chapter 2: Microsoft's Approach to Big Data

A Story of “Better Together”

Competition in the Ecosystem

Deploying Hadoop

Summary

Part II: Setting Up for Big Data with Microsoft

Chapter 3: Configuring Your First Big Data Environment

Getting Started

Getting the Install

Running the Installation

Validating Your New Cluster

Common Post-setup Tasks

Summary

Part III: Storing and Managing Big Data

Chapter 4: HDFS, Hive, HBase, and HCatalog

Exploring the Hadoop Distributed File System

Exploring Hive: The Hadoop Data Warehouse Platform

Exploring HCatalog: HDFS Table and Metadata Management

Exploring HBase: An HDFS Column-oriented Database

Summary

Chapter 5: Storing and Managing Data in HDFS

Understanding the Fundamentals of HDFS

Using Common Commands to Interact with HDFS

Moving and Organizing Data in HDFS

Summary

Chapter 6: Adding Structure with Hive

Understanding Hive's Purpose and Role

Creating and Querying Basic Tables

Using Advanced Data Structures with Hive

Summary

Chapter 7: Expanding Your Capability with HBase and HCatalog

Using HBase

Managing Data with HCatalog

Creating Partitions

Integrating HCatalog with Pig and Hive

Using HBase or Hive as a Data Warehouse

Summary

Part IV: Working with Your Big Data

Chapter 8: Effective Big Data ETL with SSIS, Pig, and Sqoop

Combining Big Data and SQL Server Tools for Better Solutions

Working with SSIS and Hive

Configuring Your Packages

Transferring Data with Sqoop

Using Pig for Data Movement

Choosing the Right Tool

Summary

Chapter 9: Data Research and Advanced Data Cleansing with Pig and Hive

Getting to Know Pig

Using Hive

Summary

Part V: Big Data and SQL Server Together

Chapter 10: Data Warehouses and Hadoop Integration

State of the Union

Challenges Faced by Traditional Data Warehouse Architectures

Hadoop's Impact on the Data Warehouse Market

Introducing Parallel Data Warehouse (PDW)

Project Polybase

Summary

Chapter 11: Visualizing Big Data with Microsoft BI

An Ecosystem of Tools

Self-service Big Data with PowerPivot

Rapid Big Data Exploration with Power View

Spatial Exploration with Power Map

Summary

Chapter 12: Big Data Analytics

Data Science, Data Mining, and Predictive Analytics

Introduction to Mahout

Building a Recommendation Engine

Summary

Chapter 13: Big Data and the Cloud

Defining the Cloud

Exploring Big Data Cloud Providers

Setting Up a Big Data Sandbox in the Cloud

Storing Your Data in the Cloud

Summary

Chapter 14: Big Data in the Real World

Common Industry Analytics

Operational Analytics

Summary

Part VI: Moving Your Big Data Forward

Chapter 15: Building and Executing Your Big Data Plan

Gaining Sponsor and Stakeholder Buy-in

Identifying Technical Challenges

Identifying Operational Challenges

Going Forward

Summary

Chapter 16: Operational Big Data Management

Ongoing Data Integration with Cloud and On-premise Solutions

Integration Thoughts for Big Data

Backups and High Availability in Your Big Data Environment

Big Data Solution Governance

Creating Operational Analytics

Summary

Introduction

Our Team

All Kidding Aside

Who Is This Book For?

What You Need to Use This Book

Chapter Overview

Features Used in This Book

End User License Agreement

Pages

vii

viii

xvi

xvii

xviii

xix

100

101

102

103

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

323

324

325

326

327

328

329

330

331

332

333

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

151

203

335

Guide

Cover

Table of Contents

Part I: What Is Big Data?

Chapter 1: Industry Needs and Solutions

List of Illustrations

Figure 2.1

Figure 3.1

Figure 3.2

Figure 3.3

Figure 3.4

Figure 3.5

Figure 3.6

Figure 3.7

Figure 3.8

Figure 3.9

Figure 3.10

Figure 3.11

Figure 3.12

Figure 3.13

Figure 3.14

Figure 3.15

Figure 3.16

Figure 3.17

Figure 4.1

Figure 4.2

Figure 4.3

Figure 4.4

Figure 4.5

Figure 4.6

Figure 4.7

Figure 4.8

Figure 5.1

Figure 5.2

Figure 5.3

Figure 6.1

Figure 7.1

Figure 7.2

Figure 7.3

Figure 7.4

Figure 7.5

Figure 7.6

Figure 7.7

Figure 7.8

Figure 7.9

Figure 7.10

Figure 7.11

Figure 8.1

Figure 8.2

Figure 8.3

Figure 9.1

Figure 9.2

Figure 9.3

Figure 9.4

Figure 9.5

Figure 9.6

Figure 9.7

Figure 9.8

Figure 9.9

Figure 9.10

Figure 9.11

Figure 9.12

Figure 9.13

Figure 9.14

Figure 9.15

Figure 9.16

Figure 9.17

Figure 9.18

Figure 9.19

Figure 9.20

Figure 9.21

Figure 9.22

Figure 9.23

Figure 9.24

Figure 10.1

Figure 10.2

Figure 10.3

Figure 10.4

Figure 10.5

Figure 10.6

Figure 10.7

Figure 10.8

Figure 10.9

Figure 10.10

Figure 10.11

Figure 10.12

Figure 10.13

Figure 10.14

Figure 10.15

Figure 11.1

Figure 11.2

Figure 11.3

Figure 11.4

Figure 11.5

Figure 11.6

Figure 11.7

Figure 11.8

Figure 11.9

Figure 11.10

Figure 11.11

Figure 11.12

Figure 11.13

Figure 11.14

Figure 11.15

Figure 11.16

Figure 11.17

Figure 11.18

Figure 11.19

Figure 11.20

Figure 11.21

Figure 11.22

Figure 11.23

Figure 11.24

Figure 11.25

Figure 11.26

Figure 11.27

Figure 11.28

Figure 11.29

Figure 11.30

Figure 11.31

Figure 11.32

Figure 11.33

Figure 11.34

Figure 11.35

Figure 11.36

Figure 11.37

Figure 13.1

Figure 13.2

Figure 13.3

Figure 13.4

Figure 13.5

Figure 13.6

Figure 13.7

Figure 13.8

Figure 13.9

Figure 13.10

Figure 13.11

Figure 13.12

Figure 13.13

Figure 13.14

Figure 13.15

Figure 13.16

Figure 13.17

Figure 13.18

Figure 16.1

Figure 16.2

Figure 16.3

Figure 16.4

Figure 16.5

Figure 16.6

Figure 16.7

Figure 16.8

Figure 16.9

Figure 16.10

Figure 16.11

Figure 16.12

Figure 16.13

Figure 16.14

Figure 16.15

List of Tables

Table 3.1

Table 3.2

Table 3.3

Table 3.4

Table 3.5

Table 3.6

Table 4.1

Table 4.2

Table 5.1

Table 6.1

Table 6.2

Table 6.3

Table 10.1

Table 10.2

Table 10.3

Table 10.4

Table 10.5

Table 12.1

Table 12.2

Table 12.3

Table 13.1

Table 15.1

Table 15.2

Table 16.1

Table 16.2

Part IWhat Is Big Data?

In This Part

Chapter 1:

Industry Needs and Solutions

Chapter 2:

Microsoft's Approach to Big Data

Chapter 1Industry Needs and Solutions

What You Will Learn in This Chapter

Finding Out What Constitutes “Big Data”

Appreciating the History and Origins of Hadoop

Defining Hadoop

Understanding the Core Components of Hadoop

Looking to the Future with Hadoop 2.0

This first chapter introduces you to the open source world of Apache and to Hadoop, one of the most exciting and innovative platforms ever created for the data professional. In this chapter we're going to go on a bit of a journey. You're going to find out what inspired Hadoop, where it came from, and its future direction. You'll see how from humble beginnings two gentlemen have inspired a generation of data professionals to think completely differently about data processing and data architecture.

Before we look into the world of Hadoop, though, we must first ask ourselves an important question. Why does big data exist? Is this name just a fad, or is there substance to all the hype? Is big data here to stay? If you want to know the answers to these questions and a little more, read on. You have quite a journey in front of you…

What's So Big About Big Data?

The world has witnessed explosive, exponential growth in recent times. So, did we suddenly have a need for big data? Not exactly. Businesses have been tackling the capacity challenge for many years (much to the delight of storage hardware vendors). Therefore the big in big data isn't purely a statement on size.

Likewise, on the processing front, scale-out solutions such as high-performance computing and distributed database technology have been in place since the last millennium. There is nothing intrinsically new there either.

People also often talk about unstructured data, but, really, this just refers to the format of the data. Could this be a reason we “suddenly” need big data? We know that web data, especially web log data, is born in an unstructured format and can be generated in significant quantities and volume. However, is this really enough to be considered big data?

In my mind, the answer is no. No one property on its own is sufficient for a project or a solution to be considered a big data solution. It's only when you have a cunning blend of these ingredients that you get to bake a big data cake.

This is in line with the Gartner definition of big data, which they updated in Doug Laney's publication, The Importance of Big Data: A Definition (Gartner, 2012): “High volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.”

What we do know is that every CIO on the planet seems to want to start a big data project right now. In a world of shrinking budgets, there is this sudden desire to jump in with both feet into this world of big data and start prospecting for golden nuggets. It's the gold rush all over again, and clearly companies feel like they might miss out if they hesitate.

However, this is a picture that has been sharpening its focus for several years. In the buildup to this ubiquitous acceptance of big data, we've been blessed with plenty of industry terms and trends, web scale, new programming paradigms of “code first,” and of course, to the total disgust of data modelers everywhere, NoSQL. Technologies such as Cassandra and MongoDB are certainly part of the broader ecosystem, but none have resonated as strongly with the market as Hadoop and big data. Why? In essence, unless you were Facebook, Google, Yahoo!, or Bing, issues like web scale really didn't apply.

It seems as though everyone is now building analytics platforms, and that, to be the king of geek chic, requires a degree in advanced statistics. The reason? Big data projects aren't defined by having big data sets. They are shaped by big ideas, by big questions, and by big opportunities. Big data is not about one technology or even one platform. It's so much more than that: It's a mindset and a movement.

Big data, therefore, is a term that underpins a raft of technologies (including the various Hadoop projects, NoSQL offerings, and even MPP Database Systems, for example) that have been created in the drive to better analyze and derive meaning from data at a dramatically lower cost and while delivering new insights and products for organizations all over the world. In times of recession, businesses look to derive greater value from the assets they have rather than invest in new assets. Big data, and in particular Hadoop, is the perfect vehicle for doing exactly that.

A Brief History of Hadoop

Necessity is the mother of invention, and Hadoop is no exception. Hadoop was created to meet the need of web companies to index and process the data tsunami courtesy of the newfangled Internetz. Hadoop's origins owe everything to both Google and the Apache Nutch project. Without one influencing the other, Hadoop might have ended up a very different animal (joke intended). In this next section, we are going to see how their work contributed to making Hadoop what it is today.

Google

As with many pioneering efforts, Google provided significant inspiration for the development that became known as Hadoop. Google published two landmark papers. The first paper, published in October 2003, was titled “The Google File System,” and the second paper, “MapReduce: Simplified Data Processing on Large Clusters,” published just over a year later in December 2004, provided the inspiration to Doug Cutting and his team of part-time developers for their project, Nutch.

MapReduce was first designed to enable Google developers to focus on the large-scale computations that they were trying to perform while abstracting away all the scaffolding code required to make the computation possible. Given the size of the data set they were working on and the duration of tasks, the developers knew that they had to have a model that was highly parallelized, was fault tolerant, and was able to balance the workload across a distributed set of machines. Of course, the Google implementation of MapReduce worked over Google File System (GFS); Hadoop Distributed File System (HDFS) was still waiting to be invented.

Google has since continued to release thought-provoking, illuminating, and inspirational publications. One publication worthy of note is “BigTable: A Distributed Storage System for Structured Data.” Of course, they aren't the only ones. LinkedIn, Facebook, and of course Yahoo! have all contributed to the big data mind share.

There are similarities here to the SIGMOD papers published by various parties in the relational database world, but ultimately it isn't the same. Let's look at an example. Twitter has open-sourced Storm—their complex event processing engine—which has recently been accepted into the Apache incubator program. For relational database vendors, this level of open sharing is really quite unheard of. For more details about storm head over to Apache: http://incubator.apache.org/projects/storm.html.

Nutch

Nutch was an open source crawler-based search engine built by a handful of part-time developers, including Doug Cutting. As previously mentioned Cutting was inspired by the Google publications and changed Nutch to take advantage of the enhanced scalability of the architecture promoted by Google. However, it wasn't too long after this that Cutting joined Yahoo! and Hadoop was born.

Nutch joined the Apache foundation in January 2005, and its first release (0.7) was in August 2005. However, it was not until 0.8 was released in July 2006 that Nutch began the transition to Hadoop-based architecture.

Nutch is still very much alive and is an actively contributed-to project. However, Nutch has now been split into two codebases. Version 1 is the legacy and provides the origins of Hadoop. Version 2 represents something of a re-architecture of the original implementation while still holding true to the original goals of the project.

What Is Hadoop?

Apache Hadoop is a top-level open source project and is governed by the Apache Software Foundation (ASF). Hadoop is not any one entity or thing. It is best thought of as a platform or an ecosystem that describes a method of distributed data processing at scale using commodity hardware configured to run as a cluster of computing power. This architecture enables Hadoop to address and analyze vast quantities of data at significantly lower cost than traditional methods commonly found in data warehousing, for example, with relational database systems.

At its core, Hadoop has two primary functions:

Processing data (MapReduce)

Storing data (HDFS)

With the advent of Hadoop 2.0, the next major release of Hadoop, we will see the decoupling of resource management from data processing. This adds a third primary function to this list. However, at the time of this writing, Yarn, the Apache project responsible for the resource management, is in alpha technology preview modes.

That said, a number of additional subprojects have been developed and added to the ecosystem that have been built on top of these two primary functions. When bundled together, these subprojects plus the core projects of MapReduce and HDFS become known as a distribution.

Derivative Works and Distributions

To fully understand a distribution, you must first understand the role, naming, and branding of Apache Hadoop. The basic rule here is that only official releases by the Apache Hadoop project may be called Apache Hadoop or Hadoop. So, what about companies that build products/solutions on top of Hadoop? This is where the term derivative works comes in.

What Are Derivative Works?

Any product that uses Apache Hadoop code, known as artifacts, as part of its construction is said to be a derivative work. A derivative work is not an Apache Hadoop release. It may be true that a derivative work can be described as “powered by Apache Hadoop.” However, there is strict guidance on product naming to avoid confusion in the marketplace. Consequently, companies that provide distributions of Hadoop should also be considered to be derivative works.

NOTE

I liken the relationship between Hadoop and derivative works to the world of Xbox games development. Many Xbox games use graphics engines provided by a third party. The Unreal Engine is just such an example.

What Is a Distribution?

Now that you know what a derivative work is, we can look at distributions. A distribution is the packaging of Apache Hadoop projects and subprojects plus any other additional proprietary components into a single managed package. For example, Hortonworks provides a distribution of Hadoop called “Hortonworks Data Platform,” or HDP for short. This is the distribution used by Microsoft for its product, HDInsight.

You may be asking yourself what is so special about that? You could certainly do this yourself. However, this would be a significant undertaking. First, you'd need to download the projects you want, resolve any dependencies, and then compile all the source code. However, when you decide to go down this route, all the testing and integration of the various components is on you to manage and maintain. Bear in mind that the creators of distributions also employ the committers of the actual source and therefore can also offer support.

As you might expect, distributions may lag slightly behind the Apache projects in terms of releases. This is one of the deciding factors you might want to consider when picking a distribution. Frequency of updates is a key factor, given how quickly the Hadoop ecosystem evolves.

If you look at the Hortonworks distribution, known as Hortonworks Data Platform (HDP), you can see that there are a number of projects at different stages of development. The distribution brings these projects together and tests them for interoperability and stability. Once satisfied that the projects all hang together, the distributor (in this case, Hortonworks) creates the versioned release of the integrated software (the distribution as an installable package).

The 1.3 version made a number of choices as to which versions to support. Today, though, just a few months later, the top-line Hadoop project has a 1.2.0.5 release available, which is not part of HDP 1.3. This and other ecosystem changes will be consumed in the next release of the HDP distribution.

To see a nice graphic of the Hortonworks distribution history, I will refer you to http://hortonworks.com/products/hdp-2/. Hadoop is a rapidly changing and evolving ecosystem and doesn't rest on its laurels so including version history is largely futile.

Hadoop Distributions

Note that there are several Hadoop distributions on the market for you to choose from. Some include proprietary components; others do not. The following sections briefly cover some of the main Hadoop distributions.

Hortonworks HDP

Hortonworks provides a distribution of Apache Hadoop known as Hortonworks Data Platform (HDP). HDP is a 100% open source distribution. Therefore, it does not contain any proprietary code or licensing. The developers employed by Hortonworks contribute directly to the Apache projects. Hortonworks is also building a good track record for regular releases of their distribution, educational content, and community engagement. In addition, Hortonworks has established a number of strategic partnerships, which will stand them in good stead. HDP is available in three forms. The first is for Hadoop 1.x, and the second is for Hadoop 2.0, which is currently in development. Hortonworks also offers HDP for Windows, which is a third distribution. HDP for Windows is the only version that runs on the Windows platform.

MapR

MapR is an interesting distribution for Hadoop. They have taken some radical steps to alter the core architecture of Hadoop to mitigate some of its single points of failure, such as the removal of the single master name node for an alternative architecture that provides them with a multimaster system. As a result, MapR has also implemented its own JobTracker to improve availability.

MapR also takes a different approach to storage. Instead of using direct attached storage in the data nodes, MapR uses mounted network file storage, which they call Direct Access NFS. The storage provided uses MapR's file system, which is fully POSIX compliant.

MapR is available both within Amazon's Elastic MapReduce Service and within Google's Cloud Platform. MapR also offers a free distribution called M3. However, it is not available in Azure or on Windows and is missing some of the high-availability (HA) features. For those goodies, you have to pay to get either the M5 or M7 versions.

Cloudera CDH

Cloudera, whose chief architect is Doug Cutting, offers an open source distribution called Cloudera Distribution Including Apache Hadoop (CDH). Like MapR, Cloudera has invested heavily in some proprietary extensions to Hadoop for their Enterprise distribution. Cloudera, however, also has an additional release, Cloudera Standard, which combines CDH with their own cluster management tool: Cloudera Manager. Cloudera Manager is proprietary, but it is a free download. As far as competition goes, this puts Cloudera Standard firmly up against Hortonworks's HDP distribution, which includes Ambari for its cluster management.

Cloudera's big-ticket item is Impala. Impala is a real-time, massively parallel processing (MPP) query engine that runs natively on Hadoop. This enables users to issue SQL queries against data stored in HDFS and Apache HBase without having to first move the data into another platform.

Is HDInsight a Distribution?

In a word, no. HDInsight is a product that has been built on top of the Hortonworks HDP distribution (specifically the HDP distribution for Windows). At the time of this writing, HDP 1.3 is the currently available version.

Core Hadoop Ecosystem

Some projects in the world of Hadoop are simply more important than others. Projects like HDFS, the Hadoop Distributed File System, are fundamental to the operation of Hadoop. Similarly, MapReduce currently provides both the scheduling and the execution and programming engines to the whole of Hadoop. Without these two projects there simply is no Hadoop.

In this next section, we are going to delve a little deeper into these core Hadoop projects to build up our knowledge of the main building blocks. Once we've done that, we'll be well placed to move forward with the next section, which will touch on some of the other projects in the Hadoop ecosystem.

HDFS

HDFS, one of the core components of Apache Hadoop, stands for Hadoop Distributed File System. There's no exotic branding to be found here. HDFS is a Java-based, distributed, fault-tolerant file storage system designed for distribution across a number of commodity servers. These servers have been configured to operate together as an HDFS cluster. By leveraging a scale-out model, HDFS ensures that it can support truly massive data volumes at a low and linear cost point.

Before diving into the details of HDFS, it is worth taking a moment to discuss the files themselves. Files created in HDFS are made up of a number of HDFS data blocks or simply HDFS blocks. These blocks are not small. They are 64MB or more in size, which allows for larger I/O sizes and in turn greater throughput. Each block is replicated and then distributed across the machines of the HDFS cluster.

HDFS is built on three core subcomponents:

NameNode

DataNode

Secondary NameNode

Simply put, the NameNode is the “brain.” It is responsible for managing the file system, and therefore is responsible for allocating directories and files. The NameNode also manages the blocks, which are present on the DataNode. There is only one NameNode per HDFS cluster.

The DataNodes are the workers, sometimes known as slaves. The DataNodes perform the bidding of the NameNode. DataNodes exist on every machine in the cluster, and they are responsible for offering up the machine's storage to HDFS. In summary, the job of the DataNode is to manage all the I/O (that is, read and write requests).

HDFS is also the point of integration for a new Microsoft technology called Polybase, which you will learn more about in Chapter 10, “Data Warehouses and Hadoop Integration.”

MapReduce

MapReduce is both an engine and a programming model. Users develop MapReduce programs and submit them to the MapReduce engine for processing. The programs created by the developers are known as jobs. Each job is a combination of Java ARchive (JAR) files and classes required to execute the MapReduce program. These files are themselves collated into a single JAR file known as a job file.

Each MapReduce job can be broken down into a few key components. The first phase of the job is the map. The map breaks the input up into many tiny pieces so that it can then process each piece independently and in parallel. Once complete, the results from this initial process can be collected, aggregated, and processed. This is the reduce part of the job.

The MapReduce engine is used to distribute the workload across the HDFS cluster and is responsible for the execution of MapReduce jobs. The MapReduce engine accepts jobs via the JobTracker. There is one JobTracker per Hadoop cluster (the impact of which we discuss shortly). The JobTracker provides the scheduling and orchestration of the MapReduce engine; it does not actually process data itself.

To execute a job, the JobTracker communicates with the HDFS NameNode to determine the location of the data to be analyzed. Once the location is known, the JobTracker then speaks to another component of the MapReduce engine called the TaskTracker. There are actually many TaskTracker nodes in the Hadoop cluster. Each node of the cluster has its own TaskTracker. Clearly then, the MapReduce engine is another master/slave architecture.

TaskTrackers provide the execution engine for the MapReduce engine by spawning a separate process for every task request. Therefore, the JobTracker must identify the appropriate TaskTrackers to use by assessing which are available to accept task requests and, ideally, which trackers are closest to the data. After the decision has been made, the JobTracker can submit the workload to the targeted TaskTrackers.

TaskTrackers are monitored by the JobTracker. This is a bottom-up monitoring process. Each TaskTracker must “report in” via a heartbeat signal. If it fails to do so for any reason, the JobTracker assumes it has failed and reassigns the tasks accordingly. Similarly, if an error occurs during the processing of an assigned task, the TaskTracker is responsible for calling that in to the JobTracker. The decision on what to do next then lies with the JobTracker.

The JobTracker keeps a record of the tasks as they complete. It maintains the status of the job, and a client application can poll it to get the latest state of the job.

NOTE

The JobTracker is a single point of failure for the MapReduce engine. If it goes down, all running jobs are halted, and new jobs cannot be scheduled.

Important Apache Projects for Hadoop

Now that we have a conceptual grasp of the core projects for Hadoop (the brain and heart if you will), we can start to flesh out our understanding of the broader ecosystem. There are a number of projects that fall under the Hadoop umbrella. Some will succeed, while others will wither and die. That is the very nature of open source software. The good ideas get developed, evolve, and become great—at least, that's the theory.

Some of the projects we are about to discuss are driving lots of innovation—especially for Hadoop 2.0. Hive is the most notable project in this regard. Almost all the work around the Hortonworks Stinger initiative is to empower SQL in Hadoop. Many of these changes will be driven through the Hive project. Therefore, it is important to know what Hive is and why it is getting so much attention.

Hive

Apache Hive is another key subproject of Hadoop. It provides data warehouse software that enables a SQL-like querying experience for the end user. The Hive query language is called Hive Query Language (HQL). (Clearly, the creators of Hive had no time for any kind of creative branding.) HQL is similar to ANSI SQL, making the crossover from one to the other relatively simple. HQL provides an abstraction over MapReduce; HQL queries are translated by Hive into MapReduce jobs. Hive is therefore quite a popular starting point for end users because there is no need to learn how to program a MapReduce job to access and process data held in Hadoop.

It is important to understand that Hive does not turn Hadoop into a relational database management system (RDBMS). Hive is still a batch-processing system that generates MapReduce jobs. It does not offer transactional support, a full type system, security, high concurrency, or predictable response times. Queries tend to be measured in minutes rather in than milliseconds or seconds. This is because there is a high spin-up cost for each query and, at the end of the day, no cost-based optimizer underpins the query plan like traditional SQL developers are used to. Therefore, it is important not to overstate Hive's capabilities.

Hive does offer certain features that an RDBMS might not, though. For example, Hive supports the following complex types: structs, maps (key/value pairs), and arrays. Likewise, Hive offers native operator support for regular expressions, which is an interesting addition. HQL also offers additional extensibility by allowing MapReduce developers to plug in their own custom mappers and reducers, allowing for more advanced analysis.

The most recent and exciting developments for Hive have been the new Stinger initiatives. Stinger has the goal of delivering 100X performance improvement to Hive plus SQL compatibility. These two features will have a profound impact on Hadoop adoption; keep them on your radar. We'll talk more about Stinger in Chapter 2, “Microsoft's Approach to Big Data.”

Pig

Apache Pig is an openly extensible programmable platform for loading, manipulating, and transforming data in Hadoop using a scripting language called Pig Latin. Pig is another abstraction on top of the Hadoop core. It converts the Pig Latin script into MapReduce jobs, which can then be executed against Hadoop.

Pig Latin scripts define the flow of data through transformations and, although simple to write, can result in complex and sophisticated manipulation of data. So, even though Pig Latin is SQL-like syntactically, it is more like a SQL Server Integration Services (SSIS) Data Flow task in spirit. Pig Latin scripts can have multiple inputs, transformations, and outputs. Pig has a large number of its own built-in functions, but you can always either create your own or just “raid the piggybank” (https://cwiki.apache.org/confluence/display/PIG/PiggyBank) for community-provided functions.

As previously mentioned, Pig provides its scalability by operating in a distributed mode on a Hadoop cluster. However, Pig Latin programs can also be run in a local mode. This does not use a Hadoop cluster; instead, the processing takes place in a single local Java Virtual Machine (JVM). This is certainly advantageous for iterative development and initial prototyping.

SQOOP

SQOOP is a top-level Apache project. However, I like to think of Apache SQOOP as a glue project. It provides the vehicle to transfer data from the relational, tabular world of structured data stores to Apache Hadoop (and vice versa).

SQOOP is extensible to allow developers to create new connectors using the SQOOP application programming interface (API). This is a core part of SQOOP's architecture, enabling a plug-and-play framework for new connectors.

SQOOP is currently going through something of a re-imagining process. As a result, there are now two versions of SQOOP. SQOOP 1 is a client application architecture that interacts directly with the Hadoop configurations and databases. SQOOP 1 also experienced a number of challenges in its development. SQOOP 2 aims to address the original design issues and starts from a server-based architecture. These are discussed in more detail later in this book.

Historically, SQL Server had SQOOP connectors that were separate downloads available from Microsoft. These have now been rolled into SQOOP 1.4 and are also included into the HDInsight Service. SQL Server Parallel Data Warehouse (PDW) has an alternative technology, Polybase, which we discuss in more detail in Chapter 10, “Data Warehouses and Hadoop Integration.”

HCatalog

So, what is HCatalog? Simply put, HCatalog provides a tabular abstraction of the HDFS files stored in Hadoop. A number of tools then leverage this abstraction when working with the data. Pig, Hive, and MapReduce all use this abstraction to reduce the complexity and overhead of reading and writing data to Hadoop.

HDFS files can, in theory, be in any format, and the data blocks can be placed anywhere on the cluster. HCatalog provides the mechanism for mapping both the file formats and data locations to the tabular view of the data. Again, HCatalog is open and extensible to allow for the fact that some file formats may be proprietary. Additional coding would be required, but the fact that a file format in HDFS was previously unknown would not be a blocker to using HCatalog.

Apache HCatalog is technically no longer a Hadoop project. It is still an important feature, but its codebase was merged with the Hive Project early in 2013. HCatalog is built on top of the Hive and leverages its command-line interface for issuing commands against the HCatalog.

One way to think about HCatalog is as the master database for Hive. In that sense, HCatalog provides the catalog views and interfaces for your Hadoop “database.”

HBase

HBase is an interesting project because it provides NoSQL database functionality on top of HDFS. It is also a column store, providing fast access to large quantities of data, which is often sparsely populated. HBase also offers transactional support to Hadoop, enabling a level of Data Modification Language (DML) (that is, inserts, updates, and deletes). However, HBase does not offer a SQL interface; remember, it is part of the NoSQL family. It also does not offer a number of other RDBMS features, such as typed columns, security, enhanced data programmability features, and querying languages.

HBase is designed to work with large tables, but you are unlikely to ever see a table like this in an RDBMS (not even in a SharePoint database). HBase tables can have billions of rows, which is not uncommon these days; but in conjunction with that, those rows can have an almost limitless number of columns. In that sense, there could be millions of columns. In contrast, SQL Server is limited to 1,024 columns.

Architecturally, HBase belongs to the master/slave collection of distributed Hadoop implementations. It is also heavily reliant on Zookeeper (an Apache project we discuss shortly).

Flume

Flume is the StreamInsight of the Hadoop ecosystem. As you would expect, it is a distributed system that collects, aggregates, and shifts large volumes of event streaming data into HDFS. Flume is also fault tolerant and can be tuned for failover and recovery. However, in general terms, faster recovery tends to mean trading some performance; so, as with most things, a balance needs to be found.

The Flume architecture consists of the following components:

Client

Source

Channel

Sink

Destination

Events flow from the client to the source. The source is the first Flume component. The source inspects the event and then farms it out to one or more channels for processing. Each channel is consumed by a sink. In Hadoop parlance, the event is “drained” by the sink. The channel provides the separation between source and sink and is also responsible for managing recovery by persisting events to the file system if required.

Once an event is drained, it is the sink's responsibility to then deliver the event to the destination. There are a number of different sinks available, including an HDFS sink. For the Integration Services users out there familiar with the term backpressure, you can think of the channel as the component that handles backpressure. If the source is receiving events faster than they can be drained, it is the channel's responsibility to grow and manage that accumulation of events.

A single pass through a source, channel, and sink is known as a hop. The components for a hop exist in a single JVM called an agent. However, Flume does not restrict the developer to a single hop. Complex multihop flows are perfectly possible with Flume. This includes creating fan-out and fan-in flows; failover routes for failed hops; and conditional, contextual routing of events. Consequently, events can be passed from agent to agent before reaching their ultimate destination.

Mahout

Mahout is all about machine learning. The goal of the project is to build scalable machine-learning libraries. The core of Apache Mahout is implemented on top of Hadoop using MapReduce. However, the project does not limit itself to that paradigm. At present, Mahout is focused on four use cases:

Recommendation mining

: Recommendation mining is the driving force behind several recommendation engines. How many of you have seen something like this appear in your inbox: “Because you bought this New England Patriots shirt, you might also like this NFL football.”

Clustering

: Clustering is the grouping of text documents to create topically related groupings or categories.

Classification

: Classification algorithms sit on top of classified documents and subsequently learn how to classify new documents. You could imagine how recruitment agents would love clustering and classification for their buzzword bingo analysis. If Apache Mahout is able to reduce the number of calls received for the wrong job, that's a win for everyone in my book.

Frequent item set mining

: Frequent item set mining is a way to understand which items are often bucketed together (for example, in shopping basket analysis).

Ambari

Ambari is the system center of the Hadoop ecosystem. It provides all the provisioning, operational insight, and management for Hadoop clusters. Remember that Hadoop clusters can contain many hundreds or thousands of machines. Keeping them configured correctly is a significant undertaking, and so having some tooling in this space is absolutely essential.

Ambari provides a web interface for ease of management where you can check on all the Hadoop services and core components. The same web interface can also be used to monitor the cluster, configuring notification alerts for health and performance conditions. Job diagnostic information is also surfaced in the web UI, helping users better understand job interdependencies, historic performance, and system trends.

Finally, Ambari can integrate with other third-party monitoring applications via its RESTful API. So when I say it is the system center of Hadoop, it literally is!

Oozie

Oozie is a Java web scheduling application for Hadoop. Often, a single job on its own does not define a business process. More often than not, there is a chain of events, processing, or processes that must be initiated and completed for the result to have meaning. It is Oozie's lot in life to provide this functionality. Simply put, Oozie can be used to compose a single container/unit of work from a collection of jobs, scripts, and programs. For those familiar with enterprise schedulers, this will be familiar territory. Oozie takes these units of work and can schedule them accordingly.

It is important to understand that Oozie is a trigger mechanism. It submits jobs and such, but MapReduce is the executor. Consequently, Oozie must also solicit status information for actions that it has requested. Therefore, Oozie has callback and polling mechanisms built in to provide it with job status/completion information.

Zookeeper

Distributed applications use Zookeeper to help manage and store configuration information. Zookeeper is interesting because it steps away from the master/slave model seen in other areas of Hadoop and is itself a highly distributed architecture and consequently highly available. What is interesting is that it achieves this while providing a “single view of the truth” for the configuration information data that it holds. Zookeeper is responsible for managing and mediating potentially conflicting updates to this information to ensure synchronized consistency across the cluster. For those of you who are familiar with managing complex merge replication topologies, you know that this is no trivial task!

The Future for Hadoop

You don't have to look too far into the future to discern the future direction of Hadoop. Alpha code and community previews are already available for Hadoop 2.0, which is fantastic to see. Aside from this, the projects we've talked about in the previous section continue to add new features, and so we should also expect to see new V1 distributions from the likes of Hortonworks for the foreseeable future.

Of course, one of the most exciting things to happen to Hadoop is the support for Hadoop on Windows and Azure. The opportunity this presents for the market cannot be overstated. Hadoop is now an option for all data professionals on all major platforms, and that is very exciting indeed.

So, what can we expect in Hadoop 2.0? Two projects that are worth highlighting here (at least in summary): YARN and Tez.

Summary

In this first chapter, you learned all about what big data is, about the core components of the Hadoop ecosystem, and a little bit about its history and inspiration. The stage is set now for you to immerse yourself in this new and exciting world of big data using Hadoop.

Chapter 2Microsoft's Approach to Big Data

What You Will Learn in This Chapter

Recognizing Microsoft's Strategic Moves to Adopt Big Data

Competing in the Hadoop Ecosystem

Deciding How to Deploy Hadoop

In Chapter 1 we learned a bit about the various projects that comprise the Hadoop ecosystem. In this chapter we will focus on Microsoft's approach to big data and delve a bit deeper into the more competitive elements of the Hadoop. Finally, we'll look at some of the considerations when deploying Hadoop and evaluate our deployment options. We'll consider how these deployment factors might manifest themselves in our chosen topology and what, if anything, we can do to mitigate them.

A Story of “Better Together”

Back in 2011, at the PASS Summit Keynote, then Senior Vice President Ted Kummert formally announced the partnership with Hortonworks as a central tenet of Microsoft's strategy into the world of “big data.” It was quite a surprise.

Those of us who had been following Microsoft's efforts in this space were all waiting for Microsoft to release a proprietary product for distributed scale-out compute (for example, the Microsoft Research project known as Dryad). However, it was not to be. Microsoft elected to invest in this partnership and work with the open source community to enable Hadoop to run on Windows and work with Microsoft's tooling. It was more than a bold move. It was unprecedented.

Later that week, Dave DeWitt commented in his keynote Q&A that the “market had already spoken” and had chosen Hadoop. This was a great insight into Microsoft's rationale; they were too late to launch their own product. However, this is just the beginning of the story. Competition is rife, and although Hadoop's core is open source, a number of proprietary products have emerged that are built on top of Hadoop. Will Microsoft ever build any proprietary components? No one knows. Importantly, though, the precedent has been set. As product companies look to monetize their investment, it seems inevitable that there will ultimately be more proprietary products built on top of Hadoop.

Microsoft's foray into the world of big data and open source solutions (OSS) has also overlapped with the even broader, even more strategic shift in focus to the cloud with Windows Azure. This has led to some very interesting consequences for the big data strategy that would have otherwise never materialized. Have you ever considered Linux to be part of the Microsoft data platform? Neither had I!

With these thoughts in your mind, I now urge you to read on and learn more about this fascinating ecosystem. Understand Microsoft's relationship with the open source world and get insight on your deployment choices for your Apache Hadoop cluster.

NOTE

If you want to know more about project Dryad, this site provides a great starting point: http://research.microsoft.com/en-us/projects/dryad/. You will notice some uncanny similarities.

Competition in the Ecosystem

Just because Hadoop is an open source series of projects doesn't mean for one moment that it is uncompetitive. Quite the opposite. In many ways, it is a bit like playing cards but with everyone holding an open hand; everyone can see each other's cards. That is, until they can't. Many systems use open source technology as part of a mix of components that blend in proprietary extensions. These proprietary elements are what closes the hand and fuels the competition. We will see an example of this later in this chapter when we look at Cloudera's Impala technology.

Hadoop is no exception. To differentiate themselves in the market, distributors of Hadoop have opted to move in different directions rather than collaborate on a single project or initiative. To highlight how this is all playing out, let's focus on one area: SQL on Hadoop. No area is more hotly contested or more important to the future of adoption of a distribution than the next generation of SQL on Hadoop.

SQL on Hadoop Today

To recap what you learned in Chapter 1, “Industry Needs and Solutions”: SQL on Hadoop came into being via the Hive project. Hive abstracts away the complexity of MapReduce by providing a SQL-like language known as Hive Query Language (HQL). Notice that it does not suddenly mean that Hadoop observes all the ACID (atomicity, consistency, isolation, durability) rules of a transaction. It is more that Hadoop offers through Hive a querying syntax that is familiar to end users. However, you want to note that Hive works only on data that resides in Hadoop.

The challenge for Hive has always been that dependency on MapReduce. Owing to the tight coupling between the execution engine of MapReduce and the scheduling, there was no choice but to build on top of MR. However, Hadoop 2.0 and project YARN changed all that. By separating scheduling into its own project and decoupling it from execution, new possibilities have surfaced for the evolution of Hive.

Hortonworks and Stinger

Hortonworks has focused all its energy on Stinger. Stinger is not a Hadoop project as such; instead, it is an initiative to dramatically improve the performance and completeness of Hive. The goal is to speed up Hive by 100x. No mean feat. What is interesting about Stinger is that all the coding effort goes directly into the Hadoop projects. That way everyone benefits from the changes made. This completely aligns with Hortonworks's commitment and charter to Hadoop.

So what is Stinger? It consists of three phases. The first two phases have already been delivered.

Stinger Phase 1

Phase 1 was primarily aimed at optimizing Hive within its current architecture. Hence it was delivered in Hive 0.11 in May 2013, forming part of Hortonworks Data Platform (HDP) 1.3 release. Phase 1 delivered three changes of notable significance:

Optimized RC file (ORCFile): Optimizations to the ORC File format have contributed enormously to Hive's data access patterns. By adding metadata at the file and block level, queries can now be run faster. In addition, much like SQL Server's column store technology, only the bytes from the required columns are read from HDFS; reducing I/O and again adding a further performance boost.

NOTE

ORCFile stands for Optimized Record Columnar File. This file format allows for the data to be partitioned horizontally (rows) and vertically (columns). In essence, it's a column store for Hadoop.

SQL compatibility: Decimal as a data type was introduced. Truncate was also added. Windowing functions also made the list, so Hive picked up support for

RANK

LAG & LEAD

FIRST & LAST

, and

ROW_NUMBER

in addition to the

OVER

clause. Some improvements were also made in the core syntax, so

GROUP BY

allowed aliases and

ALTER VIEW

was also included.

Query and join optimizations: As with most releases of database software, query optimizations are often featured, and Hive 0.11 was no exception. Hive had two major changes in this area. The first was to remove redundant operators from the plan. It had been observed that these operators could be consuming up to 10% of the CPU in simple queries. The second improvement was to

JOIN

operators with the de-emphasis of the

MAPJOIN

hint. This was in part enabled by another change, which changed the default configuration of

hive.auto.convert.join

true

(that is, on).

Stinger Phase 2

Phase 2 was implemented as part of Hive 0.12, which was released in October 2013. Note that this release followed only 5 months after phase 1. The community behind Stinger are moving at a fast pace.

To continue with Stinger's three-pronged focus on speed, scale, and SQL, phase 2 also needed to cut over to Hadoop 2.0. This enabled the engineers working on Hive to leverage YARN and lay the groundwork for Tez.

NOTE

Refer back to Chapter 1 for definitions of Hadoop projects YARN and Tez.

Phase 2 included the following enhancements:

Performance: Queries got faster with Stinger phase 2 thanks to a number of changes. A new logical optimizer was introduced called the Correlation Optimizer. Its job is to merge multiple correlated MapReduce jobs into a single job to reduce the movement of data.

ORDER BY

was made a parallel operation. Furthermore, predicate pushdown was implemented to allow ORCFile to skip over rows, much like segment skipping in SQL Server. Optimizations were also added for

COUNT (DISTINCT)

, with the

hive.map.groupby.sorted

configuration property.

SQL compatibility: Two significant data types were introduced:

VARCHAR

and

DATE

GROUP BY

support was enhanced to enable support for struct and union types. Lateral views were also extended to support an “outer” join behavior, and truncate was extended to support truncation of columns. New user-defined functions (UDFs) were added to work over the Binary data type. Finally partition switching entered the product courtesy of

ALTER TABLE..EXCHANGE PARTITION

NOTE

SQL Server does not support lateral views. That's because SQL Server doesn't support a data type for arrays and functions to interact with this type. To learn about lateral views, head over to https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView.

End of HCatalog project: With Hive 0.12, HCatalog ceased to exist as its own project and was merged into Hive.

NOTE

HCatalog is defined in Chapter 1.

Stinger Phase 3

Stinger phase 3 is underway, but will see Hadoop introduce Apache Tez, thus moving away from batch to a more interactive query/response engine. Vectorized queries (batch mode to SQL Server Query Processor aficionados) and an in-memory cache are all in the pipeline. However, it is still the early days for this phase of the Stinger initiative.

Cloudera and Impala

Cloudera chose a different direction when defining their SQL in Hadoop strategy. Clearly, they saw the limitations of MapReduce and chose to implement their own engine: Impala.

Cloudera took a different approach to Hortonworks when they built Impala. In effect, they chose to sidestep the whole issue of Hadoop's legacy with MapReduce and started over. Cloudera created three new daemons that drive Impala:

Impala Daemon

Impala Statestore

Impala Catalog Service

Impala Daemon

The Impala daemon is the core component, and it runs on every node of the Hadoop cluster. The process is called impalad, and it operates in a decentralized, multimaster pattern; that is, any node can be the controlling “brain” for a given query. As the coordinating node is decided for each query, a common single point of failure and bottleneck for a number of massively parallel-processing (MPP) systems is elegantly removed from the architecture. Note, however, that the Impala daemon you connect to when submitting your query will be the one that will take on the responsibility of acting as the coordinator. This could be load balanced by the calling application. However, it is not automatically load balanced.

Once one node has been defined as the coordinator, the other nodes act as workhorses performing delegated tasks on data subsets as defined by the coordinator. Each workhorse operates over data and provides interim results back to the coordinator, who will be responsible for the final result set.

The Impala daemons are in constant contact with the Statestore daemon to see which nodes in the cluster are healthy and are accepting tasks.

Impala Statestore

The Statestore is another daemon known as statestored. Its job is to monitor all the Impala daemons, confirming their availability to perform tasks and informing them of the health of other Impala daemons in the cluster. It therefore helps to make sure that tasks are not assigned to a node that is currently unreachable. This is important because Impala sacrifices runtime resilience for speed. Unlike MapReduce, queries that experience a node failure are canceled; so, the sooner the cluster knows about an issue, the better.

Note that only one Statestore daemon is deployed on the cluster. However, this is not an availability issue per se. This process is not critical to the operation of Impala. The cluster does become susceptible to runtime stability for query operation, but does not go offline.

Impala Catalog Service

The Catalog Service is the third daemon and is named catalogd. Its job is to distribute metadata changes to all nodes in the cluster. Again, only one Catalog Service daemon is in operation on the cluster, and it is commonly deployed on the same node as the Statestore owing to the fact that it uses the Statestore as the vehicle for transmitting its messages to the Impala daemons.

The catalog service removes the need to issue REFRESH and INVALIDATE METADATA statements, which would otherwise be required when using Data Definition Language (DDL) or Data Modification Language (DML) in Impala. By distributing metadata changes it ensures that any Impala daemon can act as a coordinator without any additional actions on the part of the calling application.

As with the Statestore, the Catalog Service is not mission critical. If the Catalog Service is down for any reason, users would need to execute REFRESH table after performing an insert or INVALIDATE METADATA after DDL operations on any Impala daemon they were connecting to.

Is Impala Open Source?

The simple answer is yes. Impala is an open source product. However, there is a catch. It's an extension to CDH (Cloudera Distribution Including Apache Hadoop). This last point is important. You cannot use Impala on any old Hadoop distribution; it is unique to Cloudera. So although it is open source, it is in many ways proprietary. Just because something is open source doesn't mean that there is no vendor lock-in.

Like Hortonworks, Cloudera monetizes their investment in Hadoop through support and training. Impala is no exception. Real Time Query (RTQ) is the technical support package for Impala and is an extension of Cloudera Enterprise (their base enterprise technical support offering). To get RTQ, you have to purchase both Cloudera Enterprise and RTQ.

Microsoft's Contribution to SQL in Hadoop

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben: