28,99 €
Data today is like gold - but how can you manage your most valuable assets? Managing large datasets used to be a task for specialists, but the game has changed - data analysis is an open playing field. Messy data is now in your hands! With OpenRefine the task is a little easier, as it provides you with the necessary tools for cleaning and presenting even the most complex data. Once it's clean, that's when you can start finding value.
Using OpenRefine takes you on a practical and actionable through this popular data transformation tool. Packed with cookbook style recipes that will help you properly get to grips with data, this book is an accessible tutorial for anyone that wants to maximize the value of their data.
This book will teach you all the necessary skills to handle any large dataset and to turn it into high-quality data for the Web. After you learn how to analyze data and spot issues, we'll see how we can solve them to obtain a clean dataset. Messy and inconsistent data is recovered through advanced techniques such as automated clustering. We'll then show extract links from keyword and full-text fields using reconciliation and named-entity extraction.
Using OpenRefine is more than a manual: it's a guide stuffed with tips and tricks to get the best out of your data.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 157
Veröffentlichungsjahr: 2013
Copyright © 2013 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: September 2013
Production Reference: 1040913
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78328-908-0
www.packtpub.com
Cover Image by Aniket Sawant (<[email protected]>)
Authors
Ruben Verborgh
Max De Wilde
Reviewers
Martin Magdinier
Dr. Mateja Verlic
Acquisition Editor
Sam Birch
Commissioning Editor
Subho Gupta
Technical Editors
Anita Nayak
Harshad Vairat
Project Coordinator
Sherin Padayatty
Proofreader
Paul Hindle
Indexer
Hemangini Bari
Production Coordinator
Nilesh R. Mohite
Cover Work
Nilesh R. Mohite
At the time I joined Metaweb Technologies, Inc. in 2008, we were building up Freebase in earnest; entity by entity, fact by fact. Now you may know Freebase through its newest incarnation, Google's Knowledge Graph, which powers the "Knowledge panels" on www.google.com.
Building up "the world's database of everything" is a tall order that machines and algorithms alone cannot do, even if raw public domain data exists in abundance. Raw data from multiple sources must be cleaned up, homogenized, and then reconciled with data already in Freebase. Even that first step of cleaning up the data cannot be automated entirely; it takes the common sense of a human reader to know that if both 0.1 and 10,000,000 occur in a column named cost, they are very likely in different units (perhaps millions of dollars and dollars respectively). It also takes a human reader to decide that UCBerkley means the same as University of California in Berkeley, CA, but not the same as Berkeley DB.
If these errors occur often enough, we might as well have given up or just hired enough people to perform manual data entry. But these errors occur often enough to be a problem, and yet not often enough that anyone who has not dealt with such data thinks simple automation is sufficient. But, dear reader, you have dealt with data, and you know how unpredictably messy it can be.
Every dataset that we wanted to load into Freebase became an iterative exercise in programming mixed with manual inspection that led to hard-coding transformation rules, from turning two-digit years into four-digits, to swapping given name and surname if there is a comma in between them. Even for most of us programmers, this exercise got old quickly, and it was painful to start every time.
So, we created Freebase Gridworks, a tool for cleaning up data and making it ready for loading into Freebase. We designed it to be a database-spreadsheet hybrid; it is interactive like spreadsheet software and programmable like databases. It was this combination that made Gridworks the first of its kind.
In the process of creating and then using Gridworks ourselves, we realized that cleaning, transforming, and just playing with data is crucial and generally useful, even if the goal is not to load data into Freebase. So, we redesigned the tool to be more generic, and released its Version 2 under the name "Google Refine" after Google acquired Metaweb.
Since then, Refine has been well received in many different communities; data journalists, open data enthusiasts, librarians, archivists, hacktivists, and even programmers and developers by trade. Its adoption in the early days spread through word of mouth, in hackathons and informal tutorials held by its own users.
Having proven itself through early adopters, Refine now needs better organized efforts to spread and become a mature product with a sustainable community around it. Expert users, open source contributors, and data enthusiast groups are actively teaching how to use Refine on tours and in the classroom. Ruben and Max from the Free Your Metadata team have taken the next logical step in consolidating those tutorials and organizing those recipes into this handy missing manual for Refine.
Stepping back to take in the bigger picture, we may realize that messy data is not anyone's own problem, but it is more akin to ensuring that one's neighborhood is safe and clean. It is not a big problem, but it has implications on big issues such as transparency in government. Messy data discourages analysis and hides real-world problems, and we all have to roll up our sleeves to do the cleaning.
David Huynh
Original creator of OpenRefine
Ruben Verborgh is a PhD researcher in Semantic Hypermedia. He is fascinated by the Web's immense possibilities and tries to contribute ideas that will maybe someday slightly influence the way the Web changes all of us. His degree in Computer Science Engineering convinced him more than ever that communication is the most crucial thing for IT-based solutions. This is why he really enjoys explaining things to those eager to learn. In 2011, he launched the Free Your Metadata project together with Seth van Hooland and Max De Wilde, which aims to evangelize the importance of bringing your data on the Web. This book is one of the assets in this continuing quest.
He currently works at Multimedia Lab, a research group of iMinds, Ghent University, Belgium, in the domains of Semantic Web, Web APIs, and Adaptive Hypermedia. Together with Seth van Hooland, he's writing Linked Data for Libraries, Archives, and Museums, Facet Publishing, a practical guide for metadata practitioners.
Max De Wilde is a PhD researcher in Natural Language Processing and a teaching assistant at the Université libre de Bruxelles (ULB), department of Information and Communication Sciences. He holds a Master's degree in Linguistics from the ULB and an Advanced Master's in Computational Linguistics from the University of Antwerp. Currently, he is preparing a doctoral thesis on the impact of language-independent information extraction on document retrieval. At the same time, he works as a full-time assistant and supervises practical classes for Master's level students in a number of topics, including database quality, document management, and architecture of information systems.
Martin Magdinier, during the last six years, has been heavily engaged with startup and open data communities in France, Vietnam, and Canada. Through his recent projects (TTCPass and Objectif Neige) and consulting positions, he became intimate with data massage techniques. Coming from a business approach, his focus is on data management and transformation tools that empower the business user. In 2011, he started to blog tips and tutorials on OpenRefine to help other business users to make the most out of this tool. In 2012, when Google released the software to the community, he helped to structure the new organization. Today, he continues to actively support the OpenRefine user base and advocates its usage in various communities.
Dr. Mateja Verlic is Head of Research at Zemanta and is an enthusiastic developer of the LOD-friendly distribution of OpenRefine. After finishing her PhD in Computer Science, she worked for two years as Assistant Professor at the University of Maribor, focusing mostly on machine learning, intelligent systems, text mining, and sentiment analysis. In 2011, when she joined Zemanta as an urban ninja and researcher, she began exploring the semantic web and has been really passionate about web technologies, lean startup, community projects, and open source software ever since.
You might want to visit www.PacktPub.com to download the datasets and projects to follow along with the recipes in this book.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
http://PacktLib.PacktPub.com
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books.
If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access.
To Linda, for her ever-lasting and loving support
Ruben Verborgh
To Hélène, and baby Jeanne
Max De Wilde
Data is often dubbed the new gold, as it is of tremendous value for today's data-driven economy. However, we prefer to think of data as diamonds. At first they're raw, but through great skills, they can be polished to become the shiny assets that are so worthy to us. This is precisely what this book covers; how your dataset can be transformed in OpenRefine so you can optimize its quality for real-world (re)use.
As the vast amount of functionality of OpenRefine can be overwhelming to new users, we are convinced that a decent manual can make the difference. This book will guide you from your very first steps to really advanced operations that you probably didn't know were possible. We will spend time on all different aspects of OpenRefine, so in the end, you will have obtained the necessary skills to revive your own datasets. This book starts out with cleaning the data to fix small errors, and ends by linking your dataset to others so it can become part of a larger data ecosystem.
We realize that every dataset is different, yet learning is easiest by example. This is why we have chosen the Powerhouse Museum dataset to demonstrate the techniques in this book. However, since not all steps apply on your dataset, we have structured the different tasks as recipes. Just like in a regular cookbook, you can just pick the recipes you need for what you want to achieve. Some recipes depend on each other, but this is indicated at the start of each chapter.
In addition, the example dataset in this book illustrates a healthy data culture; the people at Powerhouse decided to bring it online even though they were aware that there were still some quality issues. Interestingly, that didn't stop them from doing it, and in fact, it shouldn't stop you; the important thing is to get the data out. Since then, the data quality has significantly improved, but we're providing you with the old version so you can perform the cleaning and linking yourself.
We are confident this book will explain all the tools necessary to help you get your data in the best possible shape. As soon as you master the skill of polishing, the raw data diamonds you have right now will become shiny diamonds.
Have fun learning OpenRefine!
Ruben and Max.
Chapter 1, Diving Into OpenRefine, teaches you the basic steps of OpenRefine, showing you how to import a dataset and how to get around in the main interface.
Chapter 2, Analyzing and Fixing Data, explains how you can get to know your dataset and how to spot errors in it. In addition, you'll also learn several techniques to repair mistakes.
Chapter 3, Advanced Data Operations, dives deeper into dataset repair, demonstrating some of the more sophisticated data operations OpenRefine has to offer.
Chapter 4, Linking Datasets, connects your dataset to others through reconciliation of single terms and with named-entity recognition on full-text fields.
Appendix, Regular Expressions and GREL, introduces you to advanced pattern matching and the General Refine Expression Language.
This book does not assume any prior knowledge; we'll even guide you through the installation of OpenRefine in Chapter 1, Diving Into OpenRefine.
This book is for anybody who is working with data, particularly large datasets. If you've been wondering how you can gain an insight into the issues within your data, increase its quality, or link it to other datasets, then this book is for you.
No prior knowledge of OpenRefine is assumed, but if you've worked with OpenRefine before, you'll still be able to learn new things in this book. We cover several advanced techniques in the later chapters, with Chapter 4, Linking Datasets, entirely devoted to linking your dataset.
In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.
Program code inside text is shown as follows: "The expression that transforms the reconciled cell to its URL is cell.recon.match.id".
New terms are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "After clicking on OK, you will see a new column with the corresponding URLs".
Warnings or important notes appear in a box like this.
Tips and tricks appear like this.
Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.
To send us general feedback, simply send an e-mail to <[email protected]>, and mention the book title via the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the raw data and OpenRefine projects to follow along with the recipes in the book. Each chapter has its own example file which can be downloaded from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the erratasubmissionform link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.
Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at <[email protected]> with a link to the suspected pirated material.
We appreciate your help in protecting our authors, and our ability to bring you valuable content.
You can contact us at <[email protected]> if you are having a problem with any aspect of the book, and we will do our best to address it.
In this opening chapter, we will discover what OpenRefine is made for, why you should use it, and how. After a short introduction, we will go through seven fundamental recipes that will give you a foretaste of the power of OpenRefine: