34,79 €
Harness the power of ElasticSearch to build and manage scalable search and analytics solutions with this fast-paced guide
Anyone who wants to build efficient search and analytics applications can choose this book. This book is also beneficial for skilled developers, especially ones experienced with Lucene or Solr, who now want to learn Elasticsearch quickly.
With constantly evolving and growing datasets, organizations have the need to find actionable insights for their business. ElasticSearch, which is the world's most advanced search and analytics engine, brings the ability to make massive amounts of data usable in a matter of milliseconds. It not only gives you the power to build blazing fast search solutions over a massive amount of data, but can also serve as a NoSQL data store.
This guide will take you on a tour to become a competent developer quickly with a solid knowledge level and understanding of the ElasticSearch core concepts. Starting from the beginning, this book will cover these core concepts, setting up ElasticSearch and various plugins, working with analyzers, and creating mappings. This book provides complete coverage of working with ElasticSearch using Python and performing CRUD operations and aggregation-based analytics, handling document relationships in the NoSQL world, working with geospatial data, and taking data backups. Finally, we'll show you how to set up and scale ElasticSearch clusters in production environments as well as providing some best practices.
This is an easy-to-follow guide with practical examples and clear explanations of the concepts. This fast-paced book believes in providing very rich content focusing majorly on practical implementation. This book will provide you with step-by-step practical examples, letting you know about the common errors and solutions along with ample screenshots and code to ensure your success.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 247
Veröffentlichungsjahr: 2016
Copyright © 2016 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: January 2016
Production reference: 1250116
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78439-101-0
www.packtpub.com
Author
Bharvi Dixit
Reviewer
Alberto Paro
Commissioning Editor
Pramila Balan
Acquisition Editor
Sonali Vernekar
Content Development Editor
Kirti Patil
Technical Editor
Ryan Kochery
Copy Editor
Kausambhi Majumdar
Project Coordinator
Nidhi Joshi
Proofreader
Safis Editing
Indexer
Tejal Daruwale Soni
Graphics
Abhinash Sahu
Production Coordinator
Manu Joseph
Cover Work
Manu Joseph
Bharvi Dixit is an IT professional with an extensive experience of working on the search servers (especially Elasticsearch) and NoSQL databases. He is currently working as a technology and search expert with GrownOut, a SAAS-based referral hiring solution provider company. He is the organizer and speaker of Delhi's Elasticsearch Meetup Group, which is one of the fastest growing Elasticsearch communities in India.
He also works as a freelance Elasticsearch consultant and has helped many small to medium size organizations in adapting Elasticsearch for different use cases such as, creating search solutions for big data-automated intelligence platforms in the area of counter-terrorism and risk management as well as in other domains such as recruitment, e-commerce, finance and log monitoring.
He holds a master's degree in computer science from LBSIM - Delhi, India, and has a keen interest in creating scalable backend platforms. His other interest area are data analytics, distributed computing, automations, and DevOps. Java and Python are the primary languages in which he loves to write code, and he has already built a proprietary software for consultancy firms.
In his spare time, he loves writing blogs and reading the latest technology books. He can be connected through LinkedIn at https://in.linkedin.com/in/bharvidixit.
I would like to thank my family for their continuous support, specially my brother, Patanjali Dixit, who always guided me at each step throughout my career. I would also like to give a big thanks to Lavleen for the support, patience, and encouragement she gave during all those days when I was busy writing this book.
I would like to extend my thanks to all of the Packt team working on this book and our technical reviewer, Alberto Paro. Without them, the book wouldn't have been as great as it is now. It was one of the best team i have worked with.
Finally, special thanks to Shay Banon for creating Elasticsearch and to all the people who contributed to the libraries and modules published around this project.
Once again, thank you.
Alberto Paro is an engineer, project manager, and software developer. He currently works as a CTO at Big Data Technologies and as a freelance international consultant on software engineering for big data and NoSQL solutions. He loves to study emerging solutions and applications mainly related to Big Data processing, NoSQL, natural language processing, and neural networks. He began programming in BASIC on a Sinclair Spectrum when he was eight years old, and he has a lot of experience of using different operating systems, applications, and programming languages.
In 2000, he graduated in computer science engineering from Politecnico di Milano with a thesis on designing multiuser and multidevice web applications. He assisted the professors at the university for about a year. Then, he came in contact with The Net Planet Company and loved their innovative ideas; he started working on knowledge management solutions and advanced data mining products. In the summer of 2014, his company was acquired by Big Data technologies, where he currently works and uses mainly Scala and Python on state-of-the-art Big Data software (Spark, Akka, Cassandra, and YARN). In 2013, he started freelancing as a consultant for Big Data technologies, machine learning, and Elasticsearch.
In his spare time, when he is not playing with his children, he likes to work on open source projects. When he was in high school, he started contributing to projects related to the GNOME environment (gtkmm). One of his preferred programming languages is Python, and he wrote one of the first NoSQL backends on Django for MongoDB (Django-MongoDB-engine). He is also a fan of the Scala language and enjoys spreading his love of technology: he was a presenter of Big Data concepts at Scala Day Italy 2015 on Scala.JS and Big Data Tech Italian Conference in Florence.
In 2010, he began using Elasticsearch to provide search capabilities to some Django e-commerce sites and developed PyES (a Pythonic client for Elasticsearch), as well as the initial part of the Elasticsearch MongoDB driver. He is the author of ElasticSearch Cookbook and ElasticSearch Cookbook Second Edition as well as a technical reviewer of Elasticsearch Server, Second Edition, and the video course, Building a Search Server with ElasticSearch, all of which have been published by Packt Publishing.
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www2.packtpub.com/books/subscription/packtlib
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.
If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.
With constantly evolving and growing datasets, organizations have the need to find actionable insights for their business. Elasticsearch, which is the world's most advanced search and analytics engine, brings the ability to make massive amounts of data usable in a matter of milliseconds. It not only gives you the power to build blazingly fast search solutions over a massive amount of data, but can also serve as a NoSQL data store.
Elasticsearch Essentials will guide you to become a competent developer quickly with a solid knowledge and understanding of the Elasticsearch core concepts. In the beginning, this book will cover the fundamental concepts required to start working with Elasticsearch and then it will take you through more advanced concepts of search techniques and data analytics.
This book provides complete coverage of working with Elasticsearch using Python and Java APIs to perform CRUD operations, aggregation-based analytics, handling document relationships, working with geospatial data, and controlling search relevancy.
In the end, you will not only learn about scaling Elasticsearch clusters in production, but also how to secure Elasticsearch clusters and take data backups using best practices.
Chapter 1, Getting Started with Elasticsearch, provides an introduction to Elasticsearch and how it works. After going through the basic concepts and terminologies, you will learn how to install and configure Elasticsearch and perform basic operations with Elasticsearch.
Chapter 2, Understanding Document Analysis and Creating Mappings, covers the details of the built-in analyzers, tokenizers, and filters provided by Lucene. It also covers how to create custom analyzers and mapping with different data types.
Chapter 3, Putting Elasticsearch into Action, introduces Elasticsearch Query-DSL, various queries, and the data sorting techniques. You will also learn how to perform CRUD operations with Elasticsearch using Elasticsearch Python and Java clients.
Chapter 4, Aggregations for Analytics, is all about the Elasticsearch aggregation framework for building analytics on data. It provides many fundamental as well complex examples of data analytics that can be built using a combination of full-text search, term-based search, and multi level aggregations. The user will master the aggregation module of Elasticsearch by learning a complete set of practical code examples that are covered using Python and Java clients.
Chapter 5, Data Looks Better on Maps: Master Geo-Spatiality, discusses geo-data concepts and covers the rich geo-search functionalities offered by Elasticsearch including how to create mappings for geo-points and geo-shapes data, indexing documents, geo-aggregations, and sorting data based on geo-distance. It includes code examples for the most widely used geo-queries in both Python and Java.
Chapter 6, Document Relationships in NoSQL World, focuses on the techniques offered by Elasticsearch to handle relational data using nested and parent-child relationships and creating a schema for the same using real-world examples. The reader will also learn how to create mappings based on relational data and write code for indexing and querying data using Python and Java APIs.
Chapter 7, Different Methods of Search and Bulk Operations, covers the different types of search and bulk APIs that every programmer needs to know while developing applications and working with large data sets. You will learn examples of bulk processing, multi-searches, and faster data reindexing using both Python and Java, which will help you throughout your journey with Elasticsearch.
Chapter 8, Controlling Relevancy, discusses the most important aspect of search engines—relevancy. It covers the powerful scoring capabilities available in Elasticsearch and practical examples that show how you can control the scoring process according to your needs.
Chapter 9, Cluster Scaling in Production Deployments, shows how to create Elasticsearch clusters and configure different types of nodes with the right resource allocations. It also focuses on cluster scalability using the best practices in production environment.
Chapter 10, Backups and Security, focuses on the different mechanisms of creating data backups of an Elasticsearch cluster and restoring them back into the same or an other cluster. A step-by-step guide to setting up NFS (Network File System) is also provided. Finally, you will learn about setting up Nginx to secure Elasticsearch and load balance requests.
This book was written using Elasticsearch version 2.0.0, and all the examples and functions should work with it. Using Oracle Java 1.7 u55 and above is recommended for creating Elasticsearch clusters. In addition to this, you'll need a command that allows you to send HTTP requests, such as curl, which is available for most operating systems. In addition to this, this book covers all the examples using Python and Java.
For Java examples, you will need to have Java JDK (Java Development Kit) installed and an editor that will allow you to develop your code (such as Eclipse). Apache Maven has been used to build Java codes.
To run the Python examples, you will need Python 2.7 and above and will also need to install Elasticsearch-Py, the official Python client for Elasticsearch.
In addition to this, some chapters may require additional software such as Elasticsearch plugins and other software but it has been explicitly mentioned when certain types of software are needed.
Anyone who wants to build efficient search and analytics applications can choose this book. It is also beneficial for skilled developers, especially ones experienced with Lucene or Solr, who now want to learn Elasticsearch quickly. A basic knowledge of Python or Java and Linux is expected.
In addition to this, readers who want to see how to improve their query relevancy, and how to use Elasticsearch Java and Python API, may find this book interesting and useful.
In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "REST endpoints also enable users to make changes in clusters and indices settings dynamically rather than manually pushing configuration updates to all the nodes in a cluster by editing the elasticsearch.yml file and restarting the node."
A block of code is set as follows:
Any command-line input or output is written as follows:
Warnings or important notes appear in a box like this.
Tips and tricks appear like this.
Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.
To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from http://www.packtpub.com/sites/default/files/downloads/B03461_ColorImages.pdf.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at <[email protected]> with a link to the suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.
Nowadays, search is one of the primary functionalities needed in every application; it can be fulfilled by Elasticsearch, which also has many other extra features. Elasticsearch, which is built on top of Apache Lucene, is an open source, distributable, and highly scalable search engine. It provides extremely fast searches and makes data discovery easy.
In this chapter, we will cover the following topics:
Elasticsearch is a distributed, full text search and analytic engine that is build on top of Lucene, a search engine library written in Java, and is also a base for Solr. After its first release in 2010, Elasticsearch has been widely adopted by large as well as small organizations, including NASA, Wikipedia, and GitHub, for different use cases. The latest releases of Elasticsearch are focusing more on resiliency, which builds confidence in users being able to use Elasticsearch as a data storeage tool, apart from using it as a full text search engine. Elasticsearch ships with sensible default configurations and settings, and also hides all the complexities from beginners, which lets everyone become productive very quickly by just learning the basics.
Lucene is a blazing fast search library but it is tough to use directly and has very limited features to scale beyond a single machine. Elasticsearch comes to the rescue to overcome all the limitations of Lucene. Apart from providing a simple HTTP/JSON API, which enables language interoperability in comparison to Lucene's bare Java API, it has the following main features:
There are many more features available in Elasticsearch, such as multitenancy and percolation, which will be discussed in detail in the next chapters.
Elasticsearch is based on a REST design pattern and all the operations, for example, document insertion, deletion, updating, searching, and various monitoring and management tasks, can be performed using the REST endpoints provided by Elasticsearch.
In a REST-based web API, data and services are exposed as resources with URLs. All the requests are routed to a resource that is represented by a path. Each resource has a resource identifier, which is called as URI. All the potential actions on this resource can be done using simple request types provided by the HTTP protocol. The following are examples that describe how CRUD operations are done with REST API:
Many Elasticsearch users get confused between the POST and PUT request types. The difference is simple. POST is used to create a new resource, while PUT is used to update an existing resource. The PUT request is used during resource creation in some cases but it must have the complete URI available for this.
All the real-world data comes in object form. Every entity (object) has some properties. These properties can be in the form of simple key value pairs or they can be in the form of complex data structures. One property can have properties nested into it, and so on.
Elasticsearch is a document-oriented data store where objects, which are called as documents, are stored and retrieved in the form of JSON. These objects are not only stored, but also the content of these documents gets indexed to make them searchable.
JavaScript Object Notation (JSON) is a lightweight data interchange format and, in the NoSQL world, it has become a standard data serialization format. The primary reason behind using it as a standard format is the language independency and complex nested data structure that it supports. JSON has the following data type support:
Array, Boolean, Null, Number, Object, and String
The following is an example of a JSON object, which is self-explanatory about how these data types are stored in key value pairs:
The following are the most common terms that are very important to know when starting with Elasticsearch:
A shard can be either primary or secondary. A primary shard is the one where all the operations that change the index are directed. A secondary shard is the one that contains duplicate data of the primary shard and helps in quickly searching the data as well as for high availability; in a case where the machine that holds the primary shard goes down, then the secondary shard becomes the primary automatically.
Elasticsearch is a search
