KNIME Essentials - Gábor Bakos - E-Book

KNIME Essentials E-Book

Gábor Bakos

0,0
27,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

KNIME is an open source data analytics, reporting, and integration platform, which allows you to analyze a small or large amount of data without having to reach out to programming languages like R.

"KNIME Essentials" teaches you all you need to know to start processing your first data sets using KNIME. It covers topics like installation, data processing, and data visualization including the KNIME reporting features. Data processing forms a fundamental part of KNIME, and KNIME Essentials ensures that you are fully comfortable with this aspect of KNIME before showing you how to visualize this data and generate reports.

"KNIME Essentials" guides you through the process of the installation of KNIME through to the generation of reports based on data. The main parts between these two phases are the data processing and the visualization. The KNIME variants of data analysis concepts are introduced, and after the configuration and installation description comes the data processing which has many options to convert or extend it. Visualization makes it easier to get an overview for parts of the data, while reporting offers a way to summarize them in a nice way.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 212

Veröffentlichungsjahr: 2013

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

KNIME Essentials
Credits
About the Author
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers and more
Why Subscribe?
Free Access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Installing and Using KNIME
Few words about KNIME
Installing KNIME
Installation using the archive
KNIME for Windows
KNIME for Linux
KNIME for Mac OS X
Troubleshooting
KNIME terminologies
Organizing your work
Nodes
Node lifecycle
Meta nodes
Ports
Data tables
Port view
Flow variables
Node views
HiLite
Eclipse concepts
Preferences
Logging
User interface
Getting started
Setting preferences
KNIME
Other preferences
Installing extensions
Workbench
Workflow handling
Node controls
HiLite
Variable flows
Meta nodes
Workflow lifecycle
Other views
Summary
2. Data Preprocessing
Importing data
Importing data from a database
Starting Java DB
Importing data from tabular files
Importing data from web services
REST services
Importing XML files
Importing models
Other formats
Public data sources
Regular expressions
Basic syntax
Partial versus whole match
Usage from Java
References and tools
Alternative pattern description
Transforming the shape
Filtering rows
Sampling
Appending tables
Less columns
Dimension reduction
More columns
GroupBy
Pivoting and Unpivoting
One2Many and Many2One
Cosmetic transformations
Renames
Changing the column order
Reordering the rows
The row ID
Transpose
Transforming values
Generic transformations
Java snippets
The Math Formula node
Conversion between types
Binning
Normalization
Text normalization
Regular expressions
Multiple columns
XML transformation
Time transformation
Smoothing
Data generation
Generating the grid
Constraints
Loops
Workflow customization
Case study – finding min-max in the next n rows
Case study – ranks within groups
Summary
3. Data Exploration
Computing statistics
Overview of visualizations
Visual guide for the views
Distance matrix
Using visual properties
Color
Size
Shape
KNIME views
HiLite
Use cases for HiLite
Row IDs
Extreme values
Basic KNIME views
The Box plots
Hierarchical clustering
Histograms
Interactive Table
The Lift chart
Lines
Pie charts
The Scatter plots
Spark Line Appender
Radar Plot Appender
The Scorer views
JFreeChart
The Bar charts
The Bubble chart
Heatmap
The Histogram chart
The Interval chart
The Line chart
The Pie chart
The Scatter plot
Open Street Map
3D Scatterplot
Other visualization nodes
The R plot, Python plot, and Matlab plot
The official R plots
The RapidMiner view
The HiTS visualization
Tips for HiLiting
Using Interactive HiLite Collector
Finding connections
Visualizing models
Further ideas
Summary
4. Reporting
Installation of the reporting extensions
Reporting concepts
Importing data
Sending data and images to a report
Importing from other sources
Joining data sets
Preferences
Using the designer
In visible views
Report properties
Report items
Label
Text
Binding
Dynamic text
Data
Image
Grid
List
Groups
Sorting
Filters
Table
Chart
Cross Tab
Setting up
Changing
Using data cubes
Quick Tools
Aggregation
Relative time period
Generating reports
Using colors
Using HiLite
Using workflow variables
Suggested readings
Summary
Index

KNIME Essentials

KNIME Essentials

Copyright © 2013 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: October 2013

Production Reference: 1101013

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-84969-921-1

www.packtpub.com

Cover Image by Abhishek Pandey (<[email protected]>)

Credits

Author

Gábor Bakos

Reviewers

Thorsten Meinl

Takeshi Nakano

Acquisition Editors

Saleem Ahmed

Edward Gordon

Commissioning Editor

Amit Ghodake

Technical Editors

Iram Malik

Aman Preet Singh

Copy Editors

Gladson Monteiro

Kirti Pai

Mradula Hegde

Sayanee Mukherjee

Project Coordinator

Esha Thakker

Proofreader

Clyde Jenkins

Indexers

Tejal Daruwale

Priya Subramani

Graphics

Ronak Dhruv

Yuvraj Mannari

Production Coordinator

Prachali Bhiwandkar

Cover Work

Prachali Bhiwandkar

About the Author

Gábor Bakos is a programmer and a mathematician, having a few years of experience with KNIME and KNIME node development (HiTS nodes and RapidMiner integration for KNIME).

In Trinity College, Dublin, the author was helping a research group with his data analysis skills (also had the opportunity to improve those), and with the new KNIME node development. When he worked for the evopro Kft. or the Scriptum Informatika Zrt., he was also working on various data analysis software products. He currently works for his own company, Mind Eratosthenes Kft. (www.mind-era.com), where he develops the RapidMiner integration for KNIME (tech.knime.org/community/rapidminer-integration), among other things.

The author would like to thank the reviewers and Packt Publishing for their help in creating this book.

About the Reviewers

Thorsten Meinl is currently a Senior Software Developer at KNIME.com in Zurich. He holds a PhD in Computer Science from the University of Konstanz. He has been working on KNIME for over seven years. His main responsibilities are quality assurance, testing, and the continuous integration infrastructure, as well as managing the KNIME Community Contributions. Besides this, he is also interested in parallel computing and cheminformatics.

Takeshi Nakano is a Senior Research Engineer working for Recruit Technologies Co., Ltd. and leads the Advanced Technology Lab in Japan. He holds a Master's degree from the Nara Institute of Science and Technology (NAIST) in Computer Science. He is the lead author of Hadoop Hacks, a book from O'Reilly Japan, and also the author of Getting Started with Apache Solr, a book from Gijutsu­Hyohron in Japan. He loves to find inspiration for his hobbies (reading, scuba diving, and others).

www.PacktPub.com

Support files, eBooks, discount offers and more

You might want to visit www.PacktPub.com for support files and downloads related to your book.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

http://PacktLib.PacktPub.com

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books. 

Why Subscribe?

Fully searchable across every book published by PacktCopy and paste, print and bookmark contentOn demand and accessible via web browser

Free Access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access.

Preface

Dear reader, welcome to an intuitive way of data analysis. Using a visual programming language based on dataflows, you can create an easy-to-understand analysis process, while it internally checks signals about some of the common problems. Obviously, any environment that does not help with proper documentation would be destined to fail, but KNIME's success is based not just on its high quality—cross-platform—code, but also on the good description about what it does and how you can use the building blocks.

This book covers the most common tasks that are required during the data preparation and visualization phase of data analysis using KNIME. Because of the size constraints—and to bring the best price/value for those who are already familiar with or not interested in modeling—we have not covered the modeling and machine learning algorithms available for KNIME. If you are already familiar with these algorithms, you will easily get familiar with the options in KNIME, and these are quite obvious to use, so you lose almost nothing. If you have not found time yet to get acquainted with these concepts, we encourage you to first learn for what these procedures are good and when you should use them. There are some good books, courses, and training available—these are the ideal options for learning—but the Wikipedia articles can also give you a basic introduction specific to the algorithm you want to use.

What this book covers

Chapter 1, Installation and Using KNIME, introduces the user interface, the concepts used in the first three chapters, and how you can install and configure KNIME and its extensions.

Chapter 2, Data Preprocessing, covers the most common tasks, so that you can analyze your data, such as loading, transforming, and generating data; it also introduces the powerful regular expressions and some case studies.

Chapter 3, Data Exploration, describes how you can use KNIME to get an overview about your data, how you can visualize them in different forms, or even create publication quality figures.

Chapter 4, Reporting, introduces the KNIME reporting extension with the specific concepts, the user interface, and the basic blocks of reports.

What you need for this book

You only need a KNIME-compatible operating system, which is either a modern Linux, Mac OS X (10.6 or above), or Windows XP or above. The Java runtime is bundled with KNIME, and the first chapter describes how you can download and install KNIME. For this reason, you will need Internet connection too.

Who this book is for

This book is designed to give a good start to the data scientists who are not familiar with KNIME yet. Others, who are not familiar with programming, but need to load and transform their data in an intuitive way might also find this book useful.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to <[email protected]>, and mention the book title via the subject of your message.

If there is a topic in which you have expertise, and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the erratasubmissionform link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at <[email protected]> if you are having a problem with any aspect of the book, and we will do our best to address it.

Chapter 1. Installing and Using KNIME

In this chapter, we will go through the installation of KNIME, add some useful extensions, customize the settings, and find out how to use it for basic tasks. You will also be familiarized with the terminology of KNIME, so there's no misunderstanding in the later chapters.

As always, it is a good idea to read the manual of the software you get. You will find a short introduction on KNIME in the file, quickstart.pdf, present in the installation folder. The topics we will cover in the chapter are as follows:

Installation of KNIME on different platformsTerms used in KNIMEIntroduction to the KNIME user interface

Few words about KNIME

KNIME is an open source (GNU GPL available at http://www.gnu.org/licenses/gpl.html) data analytics platform with a large set of building blocks and third-party tools. You can use it from loading your data to a final report or to predict new values using a previously found model.

KNIME is available in four flavors: Desktop/Professional, Team Space, Server, and Cluster Execution. Only the Desktop version is open source; with a Professional subscription, you will get support for it, and also support the future development of KNIME. We will cover only the open source version. There is also an SDK version for free, but it is intended for use by node developers. Most probably, you will not need it yet.

At the time of writing this book, KNIME Desktop 2.8.0 was the latest version available; all the information presented in this book is based on that version.

Installing KNIME

KNIME is supported by various operating systems on 32-bit and 64-bit x86 Intel-architecture-based platforms. These operating systems are: Windows (from XP to Windows 8 at the time of writing this book) and Linux (most modern Linux operating systems work well with KNIME, Mac OS X (10.6 and above); you can check the list of supported platforms for details at: http://www.eclipse.org/eclipse/development/readme_eclipse_3.7.1.html. It also supports Java 7 on Windows and Linux, so extensions requiring Java 7 can be used too. Unfortunately under Mac OS X, there were some problems with Java 7. So on Mac OS X, the recommended version is Java 6.

There are two ways to install KNIME: an easier way is to unpack the archive you can download from their site, and a bit more complicated way is to install KNIME to an existing Eclipse installation as a plugin. Both have use cases, but the general recommendation is to install it from an archive.

Installation using the archive

We assume you are using the open source version of KNIME, which can be downloaded from the following address (always download the latest version):

http://www.knime.org/knime-desktop-sdk-download

It is not necessary to subscribe to the newsletters, but if you have not done it yet, it might be worth doing it. Some of the newsletters also contain tips for KNIME usage. This is quite infrequent, usually one per month.

The supported operating system versions are 32-bit and 64-bit for Linux and Windows, and 64-bit for Mac OS X.

KNIME for Windows

KNIME is available in an executable file for Windows (in a 7-zip compressed format). You can execute it as a regular user (unless your network administrator blacklists running executable files that are downloaded from the Internet); just double-click on it and in the window that appears, select the destination folder.

Note

On an older version of Windows (7 and older), there is a limitation to the path length; it cannot be longer than 260 characters. KNIME and some extensions can get close to this limit, so it is recommended to install it to a short path. Installing it to Program Files is not recommended.

You do not have to specify the folder name (such as knime), as a folder with the name knime_KNIME version (in our case knime_2.8.0) will be created at the destination address, and it will contain the whole installation. You can have multiple versions installed.

You can start KNIME GUI with the knime.exe executable file from that folder. You can create a shortcut of it on your desktop using the right-click menu by navigating to Send to | Desktop (create shortcut). On its first start, KNIME might ask for permissions to connect to the Internet. This may require administrator rights, but it is usually a good idea to change the firewall settings to let KNIME through.

KNIME for Linux

This file is just a simple tar.gz archive. You can unzip it using a command similar to the one shown as follows:

$ tar –xvzf knime_2.8.0.linux.gtk.x86_64.tar.gz –C /path/to/extract

Alternatively, you can use your favorite archive-handling tool to achieve similar results. The executable you need is named knime. Your window manager's manual might help you create application launchers for this executable if you prefer to have one.

KNIME for Mac OS X

You should drag the dmg file to the Applications place, and if you have Java installed, it should just work. The executable to start is called knime.app from the command line, knime.app/Contents/MacOS/knime.

Troubleshooting

If you have problems installing KNIME, maybe others also had similar problems; please check the FAQ page of KNIME at http://tech.knime.org/faq first. If it does not solve your problem, you should search the forum at http://tech.knime.org/forum; if even that fails to help, ask the experts there.

KNIME terminologies

It is important to share your thoughts and problems using the same terms. This makes it easier to reach your goal, and others will appreciate if it is easy to understand. This section will introduce the main concepts of KNIME.

Organizing your work

In KNIME, you store your files in a workspace. When KNIME starts, you can specify which workspace you want to use. The workspaces are not just for files; they also contain settings and logs. It might be a good idea to set up an empty workspace, and instead of customizing a new one each time, you start a new project; you just copy (extract) it to the place you want to use, and open it with KNIME (or switch to it).

The workspace can contain workflow groups (sometimes referred to as workflow set) or workflows. The groups are like folders in a filesystem that can help organize your workflows. Workflows might be your programs and processes that describe the steps which should be applied to load, analyze, visualize, or transform the data you have, something like an execution plan. Workflows contain the executable parts, which can be edited using the workflow editor, which in turn is similar to a canvas. Both the groups and the workflows might have metadata associated with them, such as the creation date, author, or comments (even the workspace can contain such information).

Workflows might contain nodes, meta nodes, connections, workflow variables (or just flow variables), workflow credentials, and annotations besides the previously introduced metadata.

Workflow credentials is the place where you can store your login name and password for different connections. These are kept safe, but you can access them easily.

Tip

It is safe to share a workflow if you use only the workflow credentials for sensitive information (although the user name will be saved).

Nodes

Each node has a type, which identifies the algorithm associated with the node. You can think of the type as a template; it specifies how to execute for different inputs and parameters, and what should be the result. The nodes are similar to functions (or operators) in programs.

The node types are organized according to the following general types, which specify the color and the shape of the node for easier understanding of workflows. The general types are shown in the following image:

Example representation of different general types of nodes

The nodes are organized in categories; this way, it is easier to find them.

Each node has a node documentation that describes what can be achieved using that type of node, possibly use cases or tips. It also contains information about parameters and possible input ports and output ports. (Sometimes the last two are called inports and outports, or even in-ports and out-ports.)

Parameters are usually single values (for example, filename, column name, text, number, date, and so on) associated with an identifier; although, having an array of texts is also possible. These are the settings that influence the execution of a node. There are other things that can modify the results, such as workflow variables or any other state observable from KNIME.

Node lifecycle

Nodes can have any of the following states:

Misconfigured (also called IDLE)ConfiguredQueued for executionRunningExecuted

There are possible warnings in most of the states, which might be important; you can read them by moving the mouse pointer over the triangle sign.

Meta nodes

Meta nodes look like normal nodes at first sight, although they contain other nodes (or meta nodes) inside them. The associated context of the node might give options for special execution. Usually they help to keep your workflow organized and less scary at first sight.

A user-defined meta node

Ports

The ports are where data in some form flows through from one node to another. The most common port type is the data table. These are represented by white triangles. The input ports (where data is expected to get into) are on the left-hand side of the nodes, but the output ports (where the created data comes out) are on the right-hand side of the nodes. You cannot mix and match the different kinds of ports. It is also not allowed to connect a node's output to its input or create circles in the graph of nodes; you have to create a loop if you want to achieve something similar to that.

Note

Currently, all ports in the standard KNIME distribution are presenting the results only when they are ready; although the infrastructure already allows other strategies, such as streaming, where you can view partial results too.

The ports might contain information about the data even if their nodes are not yet executed.

Data tables

These are the most common form of port types. It is similar to an Excel sheet or a data table in the database. Sometimes these are named example set or data frame.

Each data table has a name, a structure (or schema, a table specification), and possibly properties. The structure describes the data present in the table by storing some properties about the columns. In other contexts, columns may be called attributes, variables, or features.

A column can only contain data of a single type (but the types form a hierarchy from the top and can be of any type). Each column has a type, a name, and a position within the table. Besides these, they might also contain further information, for example, statistics about the contained values or color/shape information for visual representation. There is always something in the data tables that looks like a column, even if it is not really a column. This is where the identifiers for the rows are held, that is, the row keys.

There can be