The Kimball Group Reader - Ralph Kimball - E-Book

The Kimball Group Reader E-Book

Ralph Kimball

0,0
39,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

The final edition of the incomparable data warehousing and business intelligence reference, updated and expanded

The Kimball Group Reader, Remastered Collection is the essential reference for data warehouse and business intelligence design, packed with best practices, design tips, and valuable insight from industry pioneer Ralph Kimball and the Kimball Group. This Remastered Collection represents decades of expert advice and mentoring in data warehousing and business intelligence, and is the final work to be published by the Kimball Group. Organized for quick navigation and easy reference, this book contains nearly 20 years of experience on more than 300 topics, all fully up-to-date and expanded with 65 new articles. The discussion covers the complete data warehouse/business intelligence lifecycle, including project planning, requirements gathering, system architecture, dimensional modeling, ETL, and business intelligence analytics, with each group of articles prefaced by original commentaries explaining their role in the overall Kimball Group methodology.

Data warehousing/business intelligence industry's current multi-billion dollar value is due in no small part to the contributions of Ralph Kimball and the Kimball Group. Their publications are the standards on which the industry is built, and nearly all data warehouse hardware and software vendors have adopted their methods in one form or another. This book is a compendium of Kimball Group expertise, and an essential reference for anyone in the field.

  • Learn data warehousing and business intelligence from the field's pioneers
  • Get up to date on best practices and essential design tips
  • Gain valuable knowledge on every stage of the project lifecycle
  • Dig into the Kimball Group methodology with hands-on guidance

Ralph Kimball and the Kimball Group have continued to refine their methods and techniques based on thousands of hours of consulting and training. This Remastered Collection of The Kimball Group Reader represents their final body of knowledge, and is nothing less than a vital reference for anyone involved in the field.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 1994

Veröffentlichungsjahr: 2016

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Introduction

Intended Audience and Goals

Preview of Contents

Navigation Aids

Terminology Notes

Chapter 1: The Reader at a Glance

Setting Up for Success

Tackling DW/BI Design and Development

Chapter 2: Before You Dive In

Before Data Warehousing

Historical Perspective

Dealing with Demanding Realities

Chapter 3: Project/Program Planning

Professional Responsibilities

Justification and Sponsorship

Kimball Methodology

Chapter 4: Requirements Definition

Gathering Requirements

Organizing around Business Processes

Wrapping Up the Requirements

Chapter 5: Data Architecture

Making the Case for Dimensional Modeling

Enterprise Data Warehouse Bus Architecture

Agile Project Considerations

Integration Instead of Centralization

Contrast with the Corporate Information Factory

Chapter 6: Dimensional Modeling Fundamentals

Basics of Dimensional Modeling

Dos and Don'ts

Myths about Dimensional Modeling

Chapter 7: Dimensional Modeling Tasks and Responsibilities

Design Activities

Design Review Activities

Chapter 8: Fact Table Core Concepts

Granularity

Types of Fact Tables

Parent-Child Fact Tables

Fact Table Keys and Degenerate Dimensions

Miscellaneous Fact Table Design Patterns

Chapter 9: Dimension Table Core Concepts

Dimension Table Keys

Date and Time Dimension Considerations

Miscellaneous Dimension Patterns

Slowly Changing Dimensions

Chapter 10: More Dimension Patterns and Considerations

Snowflakes, Outriggers, and Bridges

Dealing with Hierarchies

Customer Issues

Addresses and International Issues

Industry Scenarios and Idiosyncrasies

Chapter 11: Back Room ETL and Data Quality

Planning the ETL System

Data Quality Considerations

Populating Fact and Dimension Tables

Supporting Real Time

Chapter 12: Technical Architecture Considerations

Overall Technical/System Architecture

Presentation Server Architecture

Front Room Architecture

Metadata

Infrastructure and Security Considerations

Chapter 13: Front Room Business Intelligence Applications

Delivering Value with Business Intelligence

Implementing the Business Intelligence Layer

Mining Data to Uncover Relationships

Dealing with SQL

Chapter 14: Maintenance and Growth Considerations

Deploying Successfully

Sustaining for Ongoing Impact

Chapter 15: Final Thoughts

Key Insights and Reminders

A Look to the Future

Article Index

Index

End User License Agreement

Pages

v

vii

ix

xi

xxv

xxvi

xxvii

xxviii

xxix

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

617

618

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

639

640

641

642

643

644

645

646

647

648

649

650

651

652

653

654

655

656

657

658

659

660

661

662

663

664

665

666

667

668

669

670

671

672

673

674

675

676

677

678

679

680

681

682

683

684

685

686

687

688

689

690

691

692

693

694

695

696

697

698

699

700

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

755

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

799

800

801

802

803

805

806

807

808

809

810

811

812

813

814

815

816

817

818

819

820

821

822

823

824

825

826

827

828

829

830

831

832

833

834

835

836

837

838

839

840

841

842

843

844

845

846

847

848

849

850

851

853

854

855

856

857

858

859

860

861

862

863

864

865

866

867

868

869

870

871

872

873

874

875

876

877

878

879

880

881

882

Guide

Cover

Table of Contents

Begin Reading

List of Illustrations

Chapter 1: The Reader at a Glance

Figure 1.1 Drill-across report combining data from three subject area fact tables.

Figure 1.2 Bus matrix for a manufacturer's EDW.

Figure 1.3 Drill-across report with a dimension version mismatch.

Figure 1.4 Employee dimension designed for type 2 SCD.

Figure 1.5 Adding dimension attributes and numeric facts through simple user interface gestures.

Figure 1.6 Sample drill-across report.

Figure 1.7 The three different types of fact tables.

Chapter 2: Before You Dive In

Figure 2.1 Dimensional schema example.

Figure 2.2 Conformed dimensions used by multiple fact tables.

Figure 2.3 The enterprise data warehouse, showing the back room ETL staging area and the front room presentation area.

Figure 2.4 The data warehouse bus architecture, showing a series of data marts connecting to the conformed dimensions of the enterprise.

Chapter 3: Project/Program Planning

Figure 3.1 The Kimball Lifecycle diagram.

Figure 3.2 Impact versus feasibility prioritization.

Chapter 4: Requirements Definition

Figure 4.1 The intersection of business requirements and realities.

Figure 4.2 Matrix of business processes mapped to strategic business initiatives.

Figure 4.3 Simplified enterprise data warehouse bus matrix.

Figure 4.4 Opportunity matrix.

Figure 4.5 Prioritization grid.

Chapter 5: Data Architecture

Figure 5.1 A dimensional model for grocery store sales.

Figure 5.2 Two representations of a product dimension—the top snowflaked version is normalized to 3NF, and the bottom is denormalized into a flat dimension.

Figure 5.3 A high level 3NF model.

Figure 5.4 A detailed dimensional model for retail point of sales. Numbers 1 through 4 show places where the design may be gracefully extended.

Figure 5.5 The suppliers and parts database.

Figure 5.6 Enterprise data warehouse bus matrix for a telecommunications company.

Figure 5.7 A detailed data warehouse bus matrix.

Figure 5.8 Independent data marts/warehouses.

Figure 5.9 Dimensional data warehouse.

Figure 5.10 Normalized data warehouse with departmental data marts.

Figure 5.11 Hybrid of normalized data warehouse and dimensional data warehouse.

Chapter 6: Dimensional Modeling Fundamentals

Figure 6.1 A dimensional model for grocery store sales.

Figure 6.2 Dragging dimension attributes and facts into a report.

Figure 6.3 A product dimension table with both marketing and finance attributes.

Figure 6.4 Separate fact tables for each business process share common dimensions.

Figure 6.5 Two fact tables with conformed dimensions.

Figure 6.6 Potential changes to an existing dimensional model.

Chapter 7: Dimensional Modeling Tasks and Responsibilities

Figure 7.1 Customer invoice line item fact table.

Figure 7.2 Two subject areas with conformed dimensions.

Figure 7.3 An example of a “bad” fact table.

Figure 7.4 An example of a “good” fact table addressing the issues in Figure 7.3.

Figure 7.5 A customer billing fact table.

Figure 7.6 Example high level bubble chart diagram.

Chapter 8: Fact Table Core Concepts

Figure 8.1 Example procurement accumulating snapshot.

Figure 8.2 A transaction grained insurance fact table.

Figure 8.3 Three fact table grains in insurance.

Figure 8.4 Accumulating snapshot for the student admissions pipeline.

Figure 8.5 Daily student attendance is recorded in a factless fact table.

Figure 8.6 Factless coverage fact table for promotion events.

Figure 8.7 Bad idea #1—The transaction header is a dimension table.

Figure 8.8 Bad idea #2—None of the header dimensionality is directly joined to the line items.

Figure 8.9 Recommended structure for modeling header/line item transactions.

Figure 8.10 Fact table to capture customer information changes.

Figure 8.11 Balance transactions that reference the account surrogate key.

Chapter 9: Dimension Table Core Concepts

Figure 9.1 A sample data warehouse schema.

Figure 9.2 A durable key in an account dimension.

Figure 9.3 Duplicate product entries.

Figure 9.4 A basic dimensional design for order tracking.

Figure 9.5 The specific meaning of the transaction date depends on the transaction type dimension.

Figure 9.6 A dimension table with a date foreign key, as well as a date stamp attribute.

Figure 9.7 Multiple dates on an accumulating snapshot of orders data.

Figure 9.8 Multiple airports on a flight segment fact table.

Figure 9.9 Multiple business entities associated with a phone call.

Figure 9.10 Sample schema with extensive role playing.

Figure 9.11 The logical progression of transforming complex source data into its corresponding dimensional model.

Figure 9.12 A typical dimensional schema for a set of retail sales transactions with a “causal” promotion dimension.

Figure 9.13 Customer dimension with rapidly changing demographic attributes.

Figure 9.14 Separating frequently changing customer attributes in a demographics mini-dimension.

Figure 9.15 A mini-dimension for capturing profile changes with type 1 overwrite in the primary dimension table.

Figure 9.16 Tracking both “as was” and current profile attributes in a single dimension table.

Figure 9.17 One foreign key in the fact table joins to a SCD2 dimension table, while the natural/durable key joins to only the most current version of the dimension.

Figure 9.18 A dimension table with a series of type 3 attributes to track predictable changes.

Figure 9.19 Add a mini-dimension to make SCD type 4.

Figure 9.20 Add a mini-dimension and type 1 outrigger to make SCD type 5.

Figure 9.21 Add type 1 attributes to type 2 dimension to make SCD type 6.

Figure 9.22 Add dual type 1 and type 2 dimensions to make SCD type 7.

Figure 9.23 Slowly changing dimension techniques summary.

Figure 9.24 Customer dimension table with type 2 administration control columns.

Chapter 10: More Dimension Patterns and Considerations

Figure 10.1 Denormalized and snowflaked versions of the same dimension table.

Figure 10.2 A date dimension outrigger joined to the employee dimension.

Figure 10.3 A bridge table resolves a many-valued dimension attribute.

Figure 10.4 A bridge table models a hierarchical variable-depth relationship.

Figure 10.5 Shopper dimension with both visitor and customer attributes.

Figure 10.6 A shopper dimension with a snowflake for customer attributes.

Figure 10.7 A financial product dimension with a subdimension for each product type.

Figure 10.8 A calendar dimension with a higher cardinality subdimension.

Figure 10.9 A typical dimensional model for ATM transactions.

Figure 10.10 A healthcare billing fact table.

Figure 10.11 A bridge table to handle multiple diagnoses.

Figure 10.12 Resolving the relationship of multiple customers associated with an account using a bridge table.

Figure 10.13 The fact table for a letter archive data warehouse showing the keyword bridge dimension table.

Figure 10.14 The same fact table showing the keyword dimension containing a variable-length text string keyword list.

Figure 10.15 A banking example showing an account-to-customer bridge table.

Figure 10.16 A sales fact table capturing multiple sales reasons.

Figure 10.17 Multiple sales reasons modeled as a pivoted sales reason dimension.

Figure 10.18 The account-to-customer bridge table without the mini-dimension.

Figure 10.19 The account-to-customer bridge table with the demographics mini-dimension.

Figure 10.20 Graphical representation of multiple date hierarchies.

Figure 10.21 Sample source data for a product dimension.

Figure 10.22 Normalized hierarchies in the source or ETL staging area.

Figure 10.23 A fact table whose grain is the individual invoice line item.

Figure 10.24 A schematic diagram of customer organizations that Big Nine Consultants sells consulting services to.

Figure 10.25 A bridge table between the fact and dimension table to navigate the organizational hierarchy.

Figure 10.26 Basic structure of the employee dimension and reports-to hierarchy.

Figure 10.27 Classic relational structure for a reports-to hierarchy.

Figure 10.28 Separate employee and reports-to (or job) dimensions.

Figure 10.29 Tracking history in the reports-to relationship with a natural key bridge table.

Figure 10.30 Forced fixed-depth reports-to hierarchy.

Figure 10.31 Sample org chart with pathstring values.

Figure 10.32 Using a bridge table to support multiple alternate hierarchies.

Figure 10.33 Treating satisfaction metrics as facts.

Figure 10.34 Using a bridge table between the transaction fact table and a satisfaction dimension.

Figure 10.35 Design for real-time customer status tracking with historical and current perspectives.

Figure 10.36 A typical fact table recording sales made in different time zones, with GMT as the absolute time stamp.

Figure 10.37 Primary and supplementary calendar dimensions for organizations that must track multiple national calendars.

Figure 10.38 A fact table simultaneously recording sales in many different currencies. The currency conversion table is only used when converting to a third currency.

Figure 10.39 A claims transaction dimensional model.

Figure 10.40 A claims monthly snapshot dimensional model.

Figure 10.41 Inappropriate way to handle the facts unique to each line of business.

Figure 10.42 Monthly snapshot for automobile coverages.

Figure 10.43 The flight segment dimensional model.

Figure 10.44 A human resources monthly snapshot with a type 2 slowly changing employee dimension table.

Figure 10.45 This value chain represents a product flow where each process is a fact table.

Figure 10.46 The budget chain showing the budget, commitments, and payments fact tables.

Figure 10.47 Compliance-enabled fact and dimension tables.

Figure 10.48 The dimensional model for clickstream events.

Figure 10.49 A market basket fact table based on atomic retail sales data.

Chapter 11: Back Room ETL and Data Quality

Figure 11.1 Simplified conceptual ETL plan.

Figure 11.2 Target customer dimension table.

Figure 11.3 Sample source rows.

Figure 11.4 Typical architecture to support FTP-based integration.

Figure 11.5 The error event schema.

Figure 11.6 A sample audit dimension attached to a shipments invoice fact table.

Figure 11.7 Normal and instrumented reports using the audit dimension.

Figure 11.8 An insurance fact table augmented with an exposure confidence fact and uncertainty dimension.

Figure 11.9 A simple audit dimension.

Figure 11.10 Normal and instrumented reports using the audit dimension.

Figure 11.11 Sample source data.

Figure 11.12 Standardized company names for the sample source data.

Figure 11.13 The original loading of a dimension where surrogate keys are assigned sequentially to every input record and the natural key becomes an ordinary attribute.

Figure 11.14 The lookup table for a typical dimension with as many rows as there are unique natural keys.

Figure 11.15 Dimension processing logic for all refreshes of a dimension table after the original load.

Figure 11.16 The pipelined fact table processing logic for replacing all natural keys (designated as IDs) with surrogate keys.

Figure 11.17 Sample source data.

Figure 11.18 Sample junk dimension rows.

Figure 11.19 Source transaction data and associated diagnosis group table.

Figure 11.20 Diagnosis Bridge table and associated ICD10_Diagnosis dimension.

Figure 11.21 The web-intensive data warehouse architecture, showing the hot response cache and the application server that creates its contents.

Chapter 12: Technical Architecture Considerations

Figure 12.1 Approach 1: Managing master data in the data warehouse ETL.

Figure 12.2 Approach 2: Implementing MDM via an integration hub.

Figure 12.3 Approach 3: The enterprise MDM system.

Figure 12.4 The extended RDBMS–based architecture for an enterprise data warehouse.

Figure 12.5 Comparison of the storage, metadata, and query layers between conventional RDBMSs and the Hadoop data warehouse stack.

Figure 12.6 HDFS viewed as a data warehouse environment.

Figure 12.7 Capability comparison of conventional RDBMS and Hadoop systems.

Figure 12.8 The aggregate navigator sits between the user application and DBMS.

Figure 12.9 Aggregate records identified by the level attributes, which is not recommended.

Figure 12.10 Aggregate records stored in separate fact and shrunken dimension tables is recommended.

Figure 12.11 A base-level dimensional model.

Figure 12.12 An aggregated dimensional schema derived from Figure 12.11 by aggregating to the category level.

Figure 12.13 The three-way aggregate table formed by aggregating time to month, product to category, and store to region.

Figure 12.14 A typical storage area network configuration.

Chapter 13: Front Room Business Intelligence Applications

Figure 13.1 The analytic application lifecycle.

Figure 13.2 A sample report mockup.

Figure 13.3 Sample SQL to hand off data from the warehouse to a data mining application.

Figure 13.4 The overall data mining process.

Figure 13.5 The primary data sets used by data mining tools.

Figure 13.6 Sample data from the two subqueries and the final drill-across results.

Figure 13.7 A sample report comparing this year's sales to last year.

Figure 13.8 A more interesting and challenging set of comparison metrics versus Figure 13.7.

Figure 13.9 Sample report illustrating several sequential operations.

Figure 13.10 Non-additive calculations, like average price, must be performed where all the underlying calculation components are available.

Figure 13.11 Sample report with rankings.

Figure 13.12 Sample report with product break rows.

Figure 13.13 A special behavior dimension is added to a normal dimensional model to restrict the schema to the context of a behaviorally defined group of products.

Figure 13.14 A sample schema with stock prices.

Chapter 15: Final Thoughts

Figure 15.1 Traditional and Supernova Customer Dimensions.

Introduction

The Kimball Group's article and Design Tip archive has been the most popular destination on our website (www.kimballgroup.com). Stretching back twenty years to Ralph's original 1995 DBMS magazine articles, the archive explores more than 250 topics, sometimes in more depth than provided by our books or courses.

With The Kimball Group Reader, Second Edition, we have organized all of the articles in a coherent way. But The Reader is more than merely a collection of our past magazine articles and Design Tips verbatim. We have trimmed the redundancy, made sure all the articles are written with the same consistent vocabulary, and updated many of the figures. This is a new and improved remastered compilation of our writings.

After considerable discussion, we decided to update many time references and edit content throughout the book to provide the perspective of 2015 rather than leaving old dates or outdated concepts in the articles. Thus an article written in 2007 may use 2015 in an example! When articles refer to the number of years that have passed, we have updated these references relative to 2015. For example, if a 2005 article originally said “during the past five years,” the article now reads “during the past fifteen years.” Mentions regarding our years of experience, number of books sold, articles written, or students taught have also been updated to 2015 figures. Finally, we occasionally changed references from outmoded technologies such as “modems” to more modern technologies, especially “internet.” We trust these changes will not mislead or cause confusion, but rather make your reading experience more natural.

Intended Audience and Goals

The primary reader of this book should be the analyst, designer, modeler, or manager who is delivering a data warehouse in support of business intelligence. The articles in this book trace the entire lifecycle of DW/BI system development, from original business requirements gathering all the way to final deployment. We believe that this collection of articles serves as a superb reference-in-depth for literally hundreds of issues and situations that arise in the development of a DW/BI system.

The articles range from a managerial focus to a highly technical focus, although in all cases, the tone of the articles strives to be educational. These articles have been accessed thousands of times per day on the Kimball Group website over a span of 20 years, so we're confident they're useful. This book adds significant value by organizing the archive, and systematically editing the articles to ensure their consistency and relevance.

Preview of Contents

Following two introductory chapters, the book's organization will look somewhat familiar to readers of The Data Warehouse Lifecycle Toolkit, Second Edition (Wiley, 2008) because we've organized the articles topically to correspond with the major milestones of a data warehouse/business intelligence (DW/BI) implementation. Not surprisingly given the word “Kimball” is practically synonymous with dimensional modeling, much of The Reader focuses on that topic in particular.

Chapter 1

:

The Reader

at a Glance.

We begin the book with a series of articles written by Ralph several years ago for

DM Review

magazine. This series succinctly encapsulates the Kimball approach in a cohesive manner, so it serves as a perfect overview, akin to

CliffsNotes

, for the book.

Chapter 2

: Before You Dive In.

Long-time readers of Ralph's articles will find that this chapter is a walk down memory lane, as many of the articles are historically significant. Somewhat amazingly, the content is still very relevant even though most of these articles were written in the 1990s.

Chapter 3

: Project/Program Planning.

With an overview and history lesson under your belt,

Chapter 3

moves on to getting the DW/BI program and project launched. We consider both the project team's and sponsoring stakeholders' responsibilities, and then delve into the Kimball Lifecycle approach.

Chapter 4

: Requirements Definition.

It is difficult to achieve DW/BI success in the absence of business requirements. This chapter delivers specific recommendations for effectively eliciting the business's needs. It stresses the importance of organizing the requirements findings around business processes, and suggests tactics for reaching organizational consensus on appropriate next steps.

Chapter 5

: Data Architecture.

With a solid understanding of the business requirements, we turn our attention to the data (where we will remain through

Chapter 11

). This chapter begins with the justification for dimensional modeling. It then describes the enterprise data warehouse bus architecture, discusses the agile development approach to support data warehousing, provides rationalization for the requisite integration and stewardship, and then contrasts the Kimball architecture with the Corporate Information Factory's hub-and-spoke.

Chapter 6

: Dimensional Modeling Fundamentals.

This chapter introduces the basics of dimensional modeling, starting with distinguishing a fact from a dimension, and the core activities of drilling down, drilling across, and handling time in a data warehouse. We also explore familiar fables about dimensional models.

Chapter 7

: Dimensional Modeling Tasks and Responsibilities.

While

Chapter 6

covers the fundamental “what and why” surrounding dimensional modeling, this chapter focuses on the “how, who, and when.”

Chapter 7

describes the dimensional modeling process and tasks, with the aim of organizing an effective team, whether starting with a blank slate or revisiting an existing model.

Chapter 8

: Fact Table Core Concepts.

The theme for

Chapter 8

could be stated as “just the facts, and nothing but the facts.” We begin by discussing granularity and the three fundamental types of fact tables, and then turn our attention to fact table keys and degenerate dimensions. The chapter closes with a potpourri of common fact table patterns, including null, textual, and sparsely populated metrics, as well as facts that closely resemble dimension attributes.

Chapter 9

: Dimension Table Core Concepts.

We shift our focus to dimension tables in

Chapter 9

, starting with a discussion of surrogate keys and the ever-present time (or date) dimensions. We then explore role playing, junk, and causal dimension patterns, before launching into a thorough handling of slowly changing dimensions, including four new advanced dimension types. Hang onto your hats.

Chapter 10

: More Dimension Patterns and Considerations.

Chapter 10

complements the previous chapter with more meaty coverage of dimension tables. We describe snowflakes and outriggers, as well as a significantly updated section on bridges for handling both multi-valued dimension attributes and ragged variable hierarchies. We discuss nuances often encountered in customer dimensions, along with internationalization issues. The chapter closes with a series of case studies covering insurance, voyages and networks, human resources, finance, electronic commerce, text searching, and retail; we encourage everyone to peruse these vignettes as the patterns and recommendations transcend industry or application boundaries.

Chapter 11

: Back Room ETL and Data Quality.

We switch gears from designing the target dimensional model to populating it in

Chapter 11

. Be forewarned: This is a hefty chapter, as you'd expect given the subject matter. This updated edition of the

Reader

has a wealth of new material in this chapter. We start by describing the 34 subsystems required to extract, transform, and load (ETL) the data, along with the pros and cons of using a commercial ETL tool. From there, we delve into data quality considerations, provide specific guidance for building fact and dimension tables, and discuss the implications of real-time ETL.

Chapter 12

: Technical Architecture Considerations.

It's taken us until

Chapter 12

, but we're finally discussing issues surrounding the technical architecture, starting with server oriented architecture (SOA), master data management (MDM), and packaged analytics. A new section on big data features two in-depth Kimball Group white papers written by Ralph. Final sections in this chapter focus on the presentation server, including the role of aggregate navigation and online analytical processing (OLAP), user interface design, metadata, infrastructure, and security.

Chapter 13

: Front Room Business Intelligence Applications.

In

Chapter 13

, we step into the front room of the DW/BI system where business users are interacting with the data. We describe the lifecycle of a typical business analysis, starting with a review of historical performance but not stopping there. We then turn our attention to standardized BI reports before digging into data mining and predictive analytics. The chapter closes by exploring the limitations of SQL for business analysis.

Chapter 14

: Maintenance and Growth Considerations.

In this penultimate chapter, we provide recommendations for successfully deploying the DW/BI system, as well as keeping it healthy for sustained success.

Chapter 15

: Final Thoughts.

The

Reader

concludes with final perspectives on data warehousing and business intelligence from each Kimball Group principal. The insights range from the most important hard won lessons we have learned to some glimpses of what the future of data warehousing may hold.

Navigation Aids

Given the breadth and depth of the articles in The Kimball Group Reader, we have very deliberately identified over two dozen articles as “Kimball Classics” because they captured a concept so effectively that we, and many others in the industry, have referred to these articles repeatedly over the past twenty years. The classic articles are designated with a special icon that looks like this:

We expect most people will read the articles in somewhat random order, rather than digesting the book from front to back. Therefore, we have put special emphasis on The Reader's index as we anticipate many of you will delve in by searching the index for a particular technique or modeling situation.

Terminology Notes

We are very proud that the vocabulary established by Ralph has been so durable and broadly adopted. Kimball “marker words” including dimensions, facts, slowly changing dimensions, surrogate keys, fact table grains, factless fact tables, and degenerate dimensions, have been used consistently across the industry for more than twenty years. But in spite of our best intentions, a few terms have morphed since their introduction; we have retroactively replaced the old terms with the accepted current ones.

Artificial keys are now called surrogate keys.

Data mart

has been replaced with

business process dimensional model

,

business process subject area

, or just

subject area

, depending on the context.

Data staging

is now known as

extract

,

transform

,

and load

.

End user applications

have been replaced by

business intelligence applications

.

Helper tables are now bridge tables.

Since most people won't read this book from cover to cover, we need to introduce some common abbreviations up front:

DW/BI

is shorthand for the end-to-end

data warehouse/business intelligence

system. This abbreviation is useful for brevity, but it also explicitly links data warehousing and business intelligence as codependent. Finally, it reflects the shift of emphasis from the data warehouse being an end in itself to business intelligence (BI) really driving everything we do. After all, the data warehouse is the platform for all forms of BI.

Many figures in

The Reader

include the

DD, FK

, and

PK

abbreviations, which stand for

degenerate dimension

,

foreign key

, and

primary key

, respectively.

ETL

means

extract, transform, and load,

the standard paradigm for acquiring data and making it ready for exposure to BI tools.

ER

refers to

entity-relationship

. We frequently use ER when we discuss third normal form (3NF) or normalized data models, as opposed to dimensional data models.

OLAP

stands for

online analytical processing

, typically used to differentiate dimensional models captured in a multidimensional database or

cube

from dimensional models in a relational DBMS called

star schemas

. These relational star schemas are sometimes referred to as

ROLAP

.

SCD

is the abbreviation for

slowly changing dimension

, referring to the techniques we've established for handling dimension attribute changes.

Chapter 1The Reader at a Glance

Beginning in late 2007, Ralph wrote a series of articles for DM Review magazine (now called Information Management). Published over a 16-month time frame, this sequence systematically describes the Kimball approach and classic best practices in a cohesive manner. Rather than scattering these articles topically throughout the book, we opted to present the series nearly in its entirety because it provides an overview of the content that follows in subsequent chapters. You can think of Chapter 1 as CliffsNotes for The Kimball Group Reader. Please note that specific date and contextual references throughout this chapter have been freshened to the perspective of 2015, as explained in the Introduction.

The chapter begins with several articles encouraging you to practice restraint and establish appropriate boundaries with other stakeholders when embarking on a data warehouse/business intelligence (DW/BI) project. From there, the series turns its attention to bringing operational data into the data warehouse and then leveraging core dimensional modeling principles to deliver robust analytic capabilities to the business users.

In addition to the articles in this chapter, Ralph also wrote a very detailed article on data quality for DM Review. Due to its in-depth coverage, this article is presented in Chapter 11 with other back room extract, transform, and load (ETL) topics.

Setting Up for Success

Before diving into implementing the DW/BI system, make sure you assess the complete set of related requirements, while avoiding the risks of overpromising.

1.1 Resist the Urge to Start Coding

Ralph Kimball, DM Review, Nov 2007

The most important first step in designing a DW/BI system, paradoxically, is to stop. Step back for a week, and be absolutely sure you have a sufficiently broad perspective on all the requirements that surround your project. The DW/BI design task is a daunting intellectual challenge, and it is not easy to step far enough back from the problem to protect yourself from embarrassing or career-threatening problems discovered after the project is underway.

Before cutting any code, designing any tables, or making a major hardware or software purchase, take a week to write down thoughtful, high quality answers to the following 10 questions, each of which is a reality that will come to control your project at some point. These define the classic set of simultaneous constraints faced by every DW/BI effort.

Business requirements.

Are you in touch with the key performance indicators (KPIs) your users actually need to make the decisions currently important to their enterprise? Although all 10 questions are important, understanding the business requirements is the most fundamental and far reaching. If you have a positive answer to this question, you can identify the data assets needed to support decision making, and you will be able to decide which measurement process to tackle first.

Strategic data profiling.

Have you verified that your available data assets are capable of supporting the answers to question number one? The goal of strategic data profiling is to make “go/no go” decisions very early in the DW/BI project as to whether to proceed with a subject area.

Tactical data profiling.

Is there a clear executive mandate to support the necessary business process re-engineering required for an effective data quality culture, perhaps even driving for Six Sigma data quality? The only real way to improve data quality is to go back to the source and figure out why better data isn't being entered. Data entry clerks are not the cause of poor data quality! Rather, the fixes require an end-to-end awareness of the need for better quality data and a commitment from the highest levels to change how business processes work.

Integration.

Is there a clear executive mandate in your organization to define common descriptors and measures across all your customer-facing processes? All of the organizations within your enterprise that participate in data integration must come to agreement on key descriptors and measures. Have your executives made it clear that this must happen?

Latency.

Do you have a realistic set of requirements from business users for how quickly data must be published by the data warehouse, including as-of-yesterday, many times per day, and truly instantaneous?

Compliance.

Have you received clear guidance from senior management as to which data is compliance-sensitive, and where you must guarantee that you have protected the chain of custody?

Security.

Do you know how you are going to protect confidential as well as proprietary data in the ETL back room, at the users' desktops, over the web, and on all permanent media?

Archiving.

Do you have a realistic plan for very long term archiving of important data, and do you know what data should be archived?

Supporting business users.

Have you profiled all your user communities to determine their abilities to use spreadsheets, construct database requests in ad hoc query tools, or just view reports on their screens? Are the users expecting to attach high-end predictive analysis or data mining tools to the underlying data?

IT licenses and skill sets.

Are you prepared to rely on the major technology site licenses your organization has already committed to, and do you have enough staff with advanced skills to exploit the technical choices you make? Do you know which parts of your organization expect to access big data or the Internet of Things, and do you have the skills to support these activities that often arise from outside IT?

Time spent answering these classic DW questions is enormously valuable. Every one of the answers will affect the architecture, choice of approaches, and even the feasibility of your DW/BI project. You dare not start coding before all the team members understand what these answers mean!

The big news is that business users have seized control of the DW. They may not be building the technical infrastructure, but they are quite sure that they own the data warehouse and the BI tools and those tools must meet their needs. This transfer of initiative from IT to the users has been very obvious over the past 15 years. Witness the soul-searching articles and industry speeches exhorting CIOs to show more business leadership and the high CIO turnover as reported in CIO Magazine (see the April 1, 2004 issue at www.cio.com).

Many of the 10 questions in this article are brought into much clearer focus by increased user ownership of the DW/BI system. Let's focus on the top five new urgent topics, in some cases coalescing our questions:

Business requirements.

The DW/BI system needs a permanent “KPI team” continuously in touch with business users' analytic needs and the consequent demand for new data sources to support new KPIs. Also, the system should increasingly support the full gamut of analytic applications, which include not only data delivery, but alerting the users to problems and opportunities, exploring causal factors with additional data sources, testing what-if scenarios to evaluate possible decisions, and tracking the decisions made. The DW/BI system is not just about displaying reports, but rather must be a platform for decision making in the broadest sense. The oldest label for data warehousing,

decision support

, remains surprisingly apt.

Strategic data profiling.

The earlier you tell the users bad news about the viability of a proposed data source, the more they will appreciate you. Develop the ability to assess a data source within a day or two. Elevate the data profiling tool to a strategic, must-have status.

Tactical data profiling.

The increased awareness of data quality is one of the most remarkable new DW perspectives, certainly driven by business users. But all is for naught if the business is not willing to support a quality culture and the end-to-end business process re-engineering required.

Integration and latency.

The user demand for the 360-degree integrated view of the business has been more like an approaching express train than a shock wave. We have been talking about it for more than a decade. But now the demands of integration, coupled with real-time access to information, have combined these two issues into a significant new architectural challenge.

Compliance and security.

DW/BI folks in IT often don't have the right instincts for protecting data because the system is supposed to be about exposing data. But this new emphasis on compliance and security must be built systematically into the data flows and the BI tools across the entire DW/BI solution.

The purpose of this first article has been to expose the fundamental design issues every DW/BI design team faces and to bring to the surface the urgent new requirements. In this ongoing series of articles, I probe each of these areas in some depth, reminding us of the remarkably unchanging aspects of data warehousing, while at the same time trying to catch the winds of change.

1.2 Set Your Boundaries

Ralph Kimball, DM Review, Dec 2007

In article 1.1, Resist the Urge to Start Coding, I encouraged you to pause briefly before charging forward on your ambitious DW/BI project. You were supposed to use this pause to answer a checklist of major environmental questions regarding business requirements, quality data, and whether your organization is ready to attack the hard issues of integration, compliance, and security.

While answering the questions, I hope you talked to all your business user clients and sponsors who may have a stake or a responsibility in the DW/BI system. Before the memory of these conversations fades away, I suggest you make a thorough list of all the promises you made as you were selling the concept of the DW/BI system. It wouldn't surprise me if you said, “Yes, we will…”

Tie the rolling operational results to the general ledger (GL).

Implement effective compliance.

Identify and implement all the key performance indicators (KPIs) needed by marketing, sales, and finance and make them available in the executive dashboard.

Encourage the business community to add new cost drivers to our system requirements so that they can calculate activity-based costing and accurate profit across the enterprise. And while we are adding these cost drivers, we'll work out all the necessary allocation factors to assign these costs against various categories of revenue.

Identify and implement all the customer satisfaction indicators needed by marketing.

Seamlessly integrate all the customer-facing operational processes into a single coherent system.

Promise to use exclusively the front end, middleware, and back end tools provided by the enterprise resource planning (ERP) vendor whose worldwide license was just signed by our CEO.

Be the first showcase application for the new service-oriented architecture (SOA) initiative, and we'll implement, manage, and validate the new infrastructure.

Implement and manage server virtualization for the DW/BI system. And this new system will be “green.”

Implement and manage the storage area network (SAN) for the DW/BI system.

Implement and manage security and privacy for all data in the DW/BI system, including responsibility for the lightweight directory access protocol (LDAP) directory server and its associated authentication and authorization functions. We'll also make sure that all data accesses by the sales force on their mobile devices in the field are secure.

Define the requirements for long term archiving and recovery of data looking forward 20 years.

Looking at this list of promises all at once, you might wonder who in their right mind would agree to them. Actually, I am much more sympathetic than it may seem. You must address these topics because they are all key facets of the DW/BI challenge. But if you gave the answers as literally stated, you have lost control of your boundaries. You have taken on far too much, you have made promises you can't deliver, and your business clients and enterprise bosses have abrogated or avoided key responsibilities that they must own. More seriously, even if you think you can deliver all these promises, you are not in a powerful enough position in your enterprise to make all these results happen.

You don't have to be a curmudgeon to be a good DW/BI system manager. This isn't about saying no to every possible responsibility. You will be doing your enterprise a favor by alerting and educating your business users and management to the appropriate boundaries of responsibilities. You can still be an enthusiastic advocate, as long as your boundaries are clear. Let's describe the key boundaries.

Boundaries with the business users. Your job is to find the business users, interview them, and interpret what they tell you into specific DW/BI deliverables. You must assemble a findings document that describes the results of the interviews and how you interpreted what the business users told you. Their responsibility is to be available for the interviews and to put energy into describing how they make decisions. Later in the process, the business users have a responsibility to provide feedback on your findings. You cannot attempt to define business requirements unless the business user community is an equal partner with IT.

Your job is not over after the first round of interviews. You must encourage ongoing business user feedback and suggestions, and also educate the business users as to the realities of system development. View this as a mutual learning process. In the latter stages of DW/BI system development, you simply cannot add new KPIs and especially new data sources to the project without slipping the delivery date. You cannot suddenly change a batch-oriented system into a real-time pipeline. Your business users must be understanding and trusting partners of the DW/BI system development, and they have to understand the costs of sudden new requirements. Bottom line—business users must become sophisticated observers of the DW/BI development process and know when it is inappropriate to change the scope by adding new KPIs, new data sources, or new real-time requirements.

Boundaries with finance. Of the promises you made, several should be the responsibility of finance. You should never agree to implement cost allocations, even if the “profit system” is your main responsibility. Not only are cost allocations very complex, but the assignment of costs to various revenue-producing departments is bad news politically. In this case, finance should work out the logical and political implications of the cost allocations, and you can quietly implement them.

You also should never agree to tie rolling operational results to the GL. In dimensional modeling parlance, you can't make this happen because the GL dimensions, such as organization and account, can't be conformed to the operational dimensions, such as customer and product. Also, special GL transactions, such as journal adjustments done at the end of the month, often cannot be put into an operational context. Again, you need to hand this issue back to finance and wait for a solution from them.

Boundaries across organizations.

These days it is hard to find anyone who argues against integration of all your data assets under the DW/BI umbrella. But this challenge is 70 percent political and only 30 percent technical. Your executives must establish a corporate culture that sends a very clear message to all the separate departments that they must come together to agree on common dimensional attributes, key performance metrics, and calendars. Your executives must lead the way before you can do your job.

Boundaries with legal.

In the early '90s, we often lamented that the data warehouse wasn't seeing widespread use. Well, now we have the opposite problem. A big piece, shall I say headache, of being taken very seriously is providing adequate security, privacy, archiving, and compliance across the DW/BI system. But you can't do anything until you understand your enterprise's policies. You must not define these policies yourself. You can lose your job and go to jail if you get these wrong. Go to your legal department with a list of areas where you need firm guidance.

Boundaries with IT.

Strangely, one of the most important boundaries you must maintain is with IT. You should be able to rely on other groups within IT for storage (either SAN or networked attached storage), server virtualization, LDAP server maintenance, authentication technologies, providing new infrastructure such as SOA, and support for big data infrastructure such as Hadoop.

Most of us in the DW/BI business are natural salespeople. We are evangelists for the use of our systems because we really believe they will benefit the business. But we need to be conscious of trying to please the client too much. Ultimately, the DW/BI system will be much more successful if all the other parties described in this article are equal, responsible partners.

Tackling DW/BI Design and Development

This group of articles focuses on the big issues that are part of every DW/BI system design.

1.3 Data Wrangling

Ralph Kimball, DM Review, Jan 2008

In this article, we are ready to design the first stage of the data pipeline leading from the operational sources to the final BI user interfaces. I call this stage “data wrangling” because we must lasso the data and get it under our control. Successful data wrangling includes change data capture, extraction, data staging, archiving, and the first step of data warehouse compliance. Let's examine these narrow data wrangling responsibilities.

The amount of data processed in each data extract should be kept to the bare minimum. You should strive not to download complete copies of source tables, but sometimes you must. Limiting the data extract to the bare minimum is a fascinating challenge and can be harder than it appears. The first architectural choice is whether to perform change data capture on the production source computer or after extraction to a machine owned by the data warehouse. From the data warehousing point of view, the more attractive alternative is doing change data capture at the production source. For this you need cooperation from the production source database administrators (DBAs), adequate processing resources on the production machine, and a very high quality scheme for identifying 100 percent of the changes that have occurred since the previous load.

To design the change data capture system on the production source, you need to have a very candid conversation with the production system DBAs. You need to identify every situation in which a change to source data could happen. These include normal applications posting transactions, special administrative overrides, and emergency scenarios, such as restoration of a data set.

One popular way to look for source data changes is to query a change_date_time field in the source table. This is a pretty strong approach if this field is populated by database triggers that are not circumvented by any process. But many production applications prohibit the use of triggers for performance reasons. Also, how does such an approach handle record deletes? If the record simply vanishes, you won't find it by querying the change_date_time field. But maybe you can collect the deletes in a separate feed.

Another approach is a special production system daemon that captures every input command by reading the production system transaction log or by intercepting message queue traffic. The daemon approach solves the delete problem but is still vulnerable to special administrative override commands performed manually by the DBAs. Some of you may think such overrides are crazy, but I have seen some very well-run shops resort to doing these overrides occasionally because of weird business rules that are simply too complicated to program into the normal transaction processing applications.

If you have figured out an acceptable scheme for isolating all data changes at the source, you still need to ask for one more favor, if you have any political capital left with the source system DBAs. You need to get a reason code for all changes to the major dimensional entities, such as customer or product. In dimensional modeling parlance, these reason codes will tell you whether the change to an individual dimensional attribute should be treated as a slowly changing dimension (SCD) type 1, 2, or 3. These distinctions are a big deal. The ETL pipelines required to process these three SCD choices are completely different.

If your production system presents too many objections, consider doing change data capture after extraction. Now you must download much larger data sets, perhaps complete dimensions or even complete fact tables. But you are guaranteed to find every change in the source system, as long as you keep a complete prior copy of the source system tables against which to compare.

If you download a complete source table today, you can find all the changes by performing a record-by-record and field-by-field comparison against a copy of yesterday's source table. You will indeed find every change, including the deletes. But in this case, you are probably missing reason codes for dimensional attribute changes. If so, you may need to impose unilaterally a reason code policy crafted for each attribute. In other words, if the package type of an existing product suddenly is changed, you could always assume that manufacturing is correcting a data error, and hence the change is always type 1.

If the table you are comparing is very large, the brute force approach of comparing each field can take too long. You can often improve this comparison step by a factor of 10 using a special hash code, called a cyclic redundancy checksum (CRC). For a discussion of this advanced technique, see the discussion of cyclic redundancy checksums on Wikipedia.

Finally, even if you are sure you have accounted for 100 percent of the source system changes, you should periodically check the DW totals against totals computed directly on the source. This is like balancing your checkbook when you have to manually investigate a discrepancy between the two data sets.

Extraction, whether it occurs before or after change data capture, is the transfer of data from the source system into the DW/BI environment. Besides actually moving the data, you have two main responsibilities in this step. First, you need to rid yourself of all narrowly proprietary data formats during the transfer itself. Change EBCDIC character formatting to ASCII. Unravel all IBM mainframe data formats (e.g., packed decimals and OCCURS statements) into standard relational database management system table and column formats. I also suggest unraveling XML hierarchical structures at this stage, even though XML structures can now be fully supported at a semantic level by relational databases.

Your second responsibility is to direct the flow of incoming data either into simple flat files or relational tables. Both choices are equally valid. You can process flat files very efficiently with sort utilities and sequential processing commands like grep and tr. Of course, you will eventually load everything into relational tables for joining and random access operations.

I recommend immediately staging