Using OpenRefine - Ruben Verborgh - E-Book

Using OpenRefine E-Book

Ruben Verborgh

0,0
28,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Data today is like gold - but how can you manage your most valuable assets? Managing large datasets used to be a task for specialists, but the game has changed - data analysis is an open playing field. Messy data is now in your hands! With OpenRefine the task is a little easier, as it provides you with the necessary tools for cleaning and presenting even the most complex data. Once it's clean, that's when you can start finding value.
Using OpenRefine takes you on a practical and actionable through this popular data transformation tool. Packed with cookbook style recipes that will help you properly get to grips with data, this book is an accessible tutorial for anyone that wants to maximize the value of their data.
This book will teach you all the necessary skills to handle any large dataset and to turn it into high-quality data for the Web. After you learn how to analyze data and spot issues, we'll see how we can solve them to obtain a clean dataset. Messy and inconsistent data is recovered through advanced techniques such as automated clustering. We'll then show extract links from keyword and full-text fields using reconciliation and named-entity extraction.
Using OpenRefine is more than a manual: it's a guide stuffed with tips and tricks to get the best out of your data.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 157

Veröffentlichungsjahr: 2013

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Using OpenRefine
Credits
Foreword
About the Authors
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers and more
Why Subscribe?
Free Access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example files
Errata
Piracy
Questions
1. Diving Into OpenRefine
Introducing OpenRefine
Recipe 1 – installing OpenRefine
Windows
Mac
Linux
Recipe 2 – creating a new project
File formats supported by OpenRefine
Recipe 3 – exploring your data
Recipe 4 – manipulating columns
Collapsing and expanding columns
Moving columns around
Renaming and removing columns
Recipe 5 – using the project history
Recipe 6 – exporting a project
Recipe 7 – going for more memory
Windows
Mac
Linux
Summary
2. Analyzing and Fixing Data
Recipe 1 – sorting data
Reordering rows
Recipe 2 – faceting data
Text facets
Numeric facets
Customized facets
Faceting by star or flag
Recipe 3 – detecting duplicates
Recipe 4 – applying a text filter
Recipe 5 – using simple cell transformations
Recipe 6 – removing matching rows
Summary
3. Advanced Data Operations
Recipe 1 – handling multi-valued cells
Recipe 2 – alternating between rows and records mode
Recipe 3 – clustering similar cells
Recipe 4 – transforming cell values
Recipe 5 – adding derived columns
Recipe 6 – splitting data across columns
Recipe 7 – transposing rows and columns
Summary
4. Linking Datasets
Recipe 1 – reconciling values with Freebase
Recipe 2 – installing extensions
Recipe 3 – adding a reconciliation service
Recipe 4 – reconciling with Linked Data
Recipe 5 – extracting named entities
Summary
A. Regular Expressions and GREL
Regular expressions for text patterns
Character classes
Quantifiers
Anchors
Choices
Groups
Overview
General Refine Expression Language (GREL)
Transforming data
Creating custom facets
Solving problems with GREL
Index

Using OpenRefine

Using OpenRefine

Copyright © 2013 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: September 2013

Production Reference: 1040913

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78328-908-0

www.packtpub.com

Cover Image by Aniket Sawant (<[email protected]>)

Credits

Authors

Ruben Verborgh

Max De Wilde

Reviewers

Martin Magdinier

Dr. Mateja Verlic

Acquisition Editor

Sam Birch

Commissioning Editor

Subho Gupta

Technical Editors

Anita Nayak

Harshad Vairat

Project Coordinator

Sherin Padayatty

Proofreader

Paul Hindle

Indexer

Hemangini Bari

Production Coordinator

Nilesh R. Mohite

Cover Work

Nilesh R. Mohite

Foreword

At the time I joined Metaweb Technologies, Inc. in 2008, we were building up Freebase in earnest; entity by entity, fact by fact. Now you may know Freebase through its newest incarnation, Google's Knowledge Graph, which powers the "Knowledge panels" on www.google.com.

Building up "the world's database of everything" is a tall order that machines and algorithms alone cannot do, even if raw public domain data exists in abundance. Raw data from multiple sources must be cleaned up, homogenized, and then reconciled with data already in Freebase. Even that first step of cleaning up the data cannot be automated entirely; it takes the common sense of a human reader to know that if both 0.1 and 10,000,000 occur in a column named cost, they are very likely in different units (perhaps millions of dollars and dollars respectively). It also takes a human reader to decide that UCBerkley means the same as University of California in Berkeley, CA, but not the same as Berkeley DB.

If these errors occur often enough, we might as well have given up or just hired enough people to perform manual data entry. But these errors occur often enough to be a problem, and yet not often enough that anyone who has not dealt with such data thinks simple automation is sufficient. But, dear reader, you have dealt with data, and you know how unpredictably messy it can be.

Every dataset that we wanted to load into Freebase became an iterative exercise in programming mixed with manual inspection that led to hard-coding transformation rules, from turning two-digit years into four-digits, to swapping given name and surname if there is a comma in between them. Even for most of us programmers, this exercise got old quickly, and it was painful to start every time.

So, we created Freebase Gridworks, a tool for cleaning up data and making it ready for loading into Freebase. We designed it to be a database-spreadsheet hybrid; it is interactive like spreadsheet software and programmable like databases. It was this combination that made Gridworks the first of its kind.

In the process of creating and then using Gridworks ourselves, we realized that cleaning, transforming, and just playing with data is crucial and generally useful, even if the goal is not to load data into Freebase. So, we redesigned the tool to be more generic, and released its Version 2 under the name "Google Refine" after Google acquired Metaweb.

Since then, Refine has been well received in many different communities; data journalists, open data enthusiasts, librarians, archivists, hacktivists, and even programmers and developers by trade. Its adoption in the early days spread through word of mouth, in hackathons and informal tutorials held by its own users.

Having proven itself through early adopters, Refine now needs better organized efforts to spread and become a mature product with a sustainable community around it. Expert users, open source contributors, and data enthusiast groups are actively teaching how to use Refine on tours and in the classroom. Ruben and Max from the Free Your Metadata team have taken the next logical step in consolidating those tutorials and organizing those recipes into this handy missing manual for Refine.

Stepping back to take in the bigger picture, we may realize that messy data is not anyone's own problem, but it is more akin to ensuring that one's neighborhood is safe and clean. It is not a big problem, but it has implications on big issues such as transparency in government. Messy data discourages analysis and hides real-world problems, and we all have to roll up our sleeves to do the cleaning.

David Huynh

Original creator of OpenRefine

About the Authors

Ruben Verborgh is a PhD researcher in Semantic Hypermedia. He is fascinated by the Web's immense possibilities and tries to contribute ideas that will maybe someday slightly influence the way the Web changes all of us. His degree in Computer Science Engineering convinced him more than ever that communication is the most crucial thing for IT-based solutions. This is why he really enjoys explaining things to those eager to learn. In 2011, he launched the Free Your Metadata project together with Seth van Hooland and Max De Wilde, which aims to evangelize the importance of bringing your data on the Web. This book is one of the assets in this continuing quest.

He currently works at Multimedia Lab, a research group of iMinds, Ghent University, Belgium, in the domains of Semantic Web, Web APIs, and Adaptive Hypermedia. Together with Seth van Hooland, he's writing Linked Data for Libraries, Archives, and Museums, Facet Publishing, a practical guide for metadata practitioners.

Max De Wilde is a PhD researcher in Natural Language Processing and a teaching assistant at the Université libre de Bruxelles (ULB), department of Information and Communication Sciences. He holds a Master's degree in Linguistics from the ULB and an Advanced Master's in Computational Linguistics from the University of Antwerp. Currently, he is preparing a doctoral thesis on the impact of language-independent information extraction on document retrieval. At the same time, he works as a full-time assistant and supervises practical classes for Master's level students in a number of topics, including database quality, document management, and architecture of information systems.

About the Reviewers

Martin Magdinier, during the last six years, has been heavily engaged with startup and open data communities in France, Vietnam, and Canada. Through his recent projects (TTCPass and Objectif Neige) and consulting positions, he became intimate with data massage techniques. Coming from a business approach, his focus is on data management and transformation tools that empower the business user. In 2011, he started to blog tips and tutorials on OpenRefine to help other business users to make the most out of this tool. In 2012, when Google released the software to the community, he helped to structure the new organization. Today, he continues to actively support the OpenRefine user base and advocates its usage in various communities.

Dr. Mateja Verlic is Head of Research at Zemanta and is an enthusiastic developer of the LOD-friendly distribution of OpenRefine. After finishing her PhD in Computer Science, she worked for two years as Assistant Professor at the University of Maribor, focusing mostly on machine learning, intelligent systems, text mining, and sentiment analysis. In 2011, when she joined Zemanta as an urban ninja and researcher, she began exploring the semantic web and has been really passionate about web technologies, lean startup, community projects, and open source software ever since.

www.PacktPub.com

Support files, eBooks, discount offers and more

You might want to visit www.PacktPub.com to download the datasets and projects to follow along with the recipes in this book.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

http://PacktLib.PacktPub.com

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books.

Why Subscribe?

Fully searchable across every book published by PacktCopy and paste, print and bookmark contentOn demand and accessible via web browser

Free Access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access.

To Linda, for her ever-lasting and loving support

Ruben Verborgh

To Hélène, and baby Jeanne

Max De Wilde

Preface

Data is often dubbed the new gold, as it is of tremendous value for today's data-driven economy. However, we prefer to think of data as diamonds. At first they're raw, but through great skills, they can be polished to become the shiny assets that are so worthy to us. This is precisely what this book covers; how your dataset can be transformed in OpenRefine so you can optimize its quality for real-world (re)use.

As the vast amount of functionality of OpenRefine can be overwhelming to new users, we are convinced that a decent manual can make the difference. This book will guide you from your very first steps to really advanced operations that you probably didn't know were possible. We will spend time on all different aspects of OpenRefine, so in the end, you will have obtained the necessary skills to revive your own datasets. This book starts out with cleaning the data to fix small errors, and ends by linking your dataset to others so it can become part of a larger data ecosystem.

We realize that every dataset is different, yet learning is easiest by example. This is why we have chosen the Powerhouse Museum dataset to demonstrate the techniques in this book. However, since not all steps apply on your dataset, we have structured the different tasks as recipes. Just like in a regular cookbook, you can just pick the recipes you need for what you want to achieve. Some recipes depend on each other, but this is indicated at the start of each chapter.

In addition, the example dataset in this book illustrates a healthy data culture; the people at Powerhouse decided to bring it online even though they were aware that there were still some quality issues. Interestingly, that didn't stop them from doing it, and in fact, it shouldn't stop you; the important thing is to get the data out. Since then, the data quality has significantly improved, but we're providing you with the old version so you can perform the cleaning and linking yourself.

We are confident this book will explain all the tools necessary to help you get your data in the best possible shape. As soon as you master the skill of polishing, the raw data diamonds you have right now will become shiny diamonds.

Have fun learning OpenRefine!

Ruben and Max.

What this book covers

Chapter 1, Diving Into OpenRefine, teaches you the basic steps of OpenRefine, showing you how to import a dataset and how to get around in the main interface.

Chapter 2, Analyzing and Fixing Data, explains how you can get to know your dataset and how to spot errors in it. In addition, you'll also learn several techniques to repair mistakes.

Chapter 3, Advanced Data Operations, dives deeper into dataset repair, demonstrating some of the more sophisticated data operations OpenRefine has to offer.

Chapter 4, Linking Datasets, connects your dataset to others through reconciliation of single terms and with named-entity recognition on full-text fields.

Appendix, Regular Expressions and GREL, introduces you to advanced pattern matching and the General Refine Expression Language.

What you need for this book

This book does not assume any prior knowledge; we'll even guide you through the installation of OpenRefine in Chapter 1, Diving Into OpenRefine.

Who this book is for

This book is for anybody who is working with data, particularly large datasets. If you've been wondering how you can gain an insight into the issues within your data, increase its quality, or link it to other datasets, then this book is for you.

No prior knowledge of OpenRefine is assumed, but if you've worked with OpenRefine before, you'll still be able to learn new things in this book. We cover several advanced techniques in the later chapters, with Chapter 4, Linking Datasets, entirely devoted to linking your dataset.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Program code inside text is shown as follows: "The expression that transforms the reconciled cell to its URL is cell.recon.match.id".

New terms are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "After clicking on OK, you will see a new column with the corresponding URLs".

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to <[email protected]>, and mention the book title via the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example files

You can download the raw data and OpenRefine projects to follow along with the recipes in the book. Each chapter has its own example file which can be downloaded from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the erratasubmissionform link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at <[email protected]> if you are having a problem with any aspect of the book, and we will do our best to address it.

Chapter 1. Diving Into OpenRefine

In this opening chapter, we will discover what OpenRefine is made for, why you should use it, and how. After a short introduction, we will go through seven fundamental recipes that will give you a foretaste of the power of OpenRefine:

Recipe 1 – installing OpenRefineRecipe 2 – creating a new projectRecipe 3 – exploring your dataRecipe 4 – manipulating columnsRecipe 5 – using the project historyRecipe 6 – exporting a projectRecipe 7 – going for more memory