34,79 €
ElasticSearch is an open source search server built on Apache Lucene. It was built to provide a scalable search solution with built-in support for near real-time search and multi-tenancy.Jumping into the world of ElasticSearch by setting up your own custom cluster, this book will show you how to create a fast, scalable, and flexible search solution. By learning the ins-and-outs of data indexing and analysis, "ElasticSearch Server" will start you on your journey to mastering the powerful capabilities of ElasticSearch. With practical chapters covering how to search data, extend your search, and go deep into cluster administration and search analysis, this book is perfect for those new and experienced with search servers.In "ElasticSearch Server" you will learn how to revolutionize your website or application with faster, more accurate, and flexible search functionality. Starting with chapters on setting up your own ElasticSearch cluster and searching and extending your search parameters you will quickly be able to create a fast, scalable, and completely custom search solution.Building on your knowledge further you will learn about ElasticSearch's query API and become confident using powerful filtering and faceting capabilities. You will develop practical knowledge on how to make use of ElasticSearch's near real-time capabilities and support for multi-tenancy.Your journey then concludes with chapters that help you monitor and tune your ElasticSearch cluster as well as advanced topics such as shard allocation, gateway configuration, and the discovery module.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 374
Veröffentlichungsjahr: 2013
Copyright © 2013 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors, will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: February 2013
Production Reference: 1110213
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-84951-844-4
www.packtpub.com
Cover Image by Neha Rajappan (<[email protected]>)
Authors
Rafał Kuć
Marek Rogoziński
Reviewers
Ravindra Bharathi
Matthew Lee Hinman
Marcelo Ochoa
Karel Minařík
Acquisition Editor
Andrew Duckworth
Lead Technical Editor
Neeshma Ramakrishnan
Technical Editors
Prasad Dalvi
Jalasha D'costa
Charmaine Pereira
Varun Pius Rodrigues
Copy Editors
Brandt D'Mello
Alfida Paiva
Laxmi Subramanian
Ruta Waghmare
Project Coordinator
Anurag Banerjee
Proofreader
Chris Smith
Indexer
Rekha Nair
Production Coordinator
Conidon Miranda
Cover Work
Conidon Miranda
Rafał Kuć is a born team leader and software developer. He currently works as a consultant and a software engineer at Sematext Group, Inc., where he concentrates on open source technologies such as Apache Lucene and Solr, ElasticSearch, and Hadoop stack. He has more than 11 years of experience in various software branches, from banking software to e-commerce products. He focuses mainly on Java but is open to every tool and programming language that will make the achievement of his goal easier and faster. Rafał is also one of the founders of the solr.pl site where he tries to share his knowledge and help people with their problems with Solr and Lucene. He is also a speaker for various conferences around the world, such as Lucene Eurocon, Berlin Buzzwords, and ApacheCon.
Rafał began his journey with Lucene in 2002, and it wasn't exactly love at first sight. When he came back to Lucene in late 2003, he revised his thoughts about the framework and saw the potential in search technologies. Then Solr came along and that was it. He started working with ElasticSearch in the middle of 2010. Currently, Lucene, Solr, ElasticSearch, and information retrieval are his main points of interest.
Rafał is also the author of Apache Solr 3.1 Cookbook and the update to it—Apache Solr 4 Cookbook—published by Packt Publishing.
The book you are holding was a new experience for me although it is not the first book I've written. When we started working on it, we thought that we would be able to write about all the functionalities we wanted, but we couldn't have imagined how big ElasticSearch is and how much time it would take to write about it. Finally, we had to choose the topics and hopefully we've chosen wisely and you'll find this book helpful in your work. When I described a single functionality, I tried to write about it like I would like to read about it myself, so I hope that you'll find those descriptions helpful and interesting.
Although I would go the same way if I went back in time, the time of writing this book was not easy for my family, especially because this was not the only book I was working on at the time. Apache Solr 4 Cookbook was also being updated at the same time. The ones that suffered from this the most were my wife, Agnes, and our two lovely kids—our son, Philip, and daughter, Susanna. Without their patience and understanding, writing this book wouldn't have been possible. I would also like to thank my parents and Agnes' parents for their support and help.
I would like to thank all the people involved in creating, developing, and maintaining the ElasticSearch and Lucene projects for their work and passion. Without them this book couldn't have been written.
Finally, a big thanks to all the reviewers on this book. Their in-depth comments and insights have made this book better, at least from my point of view.
Once again, thank you all!
Marek Rogoziński is a software architect and consultant with more than 10 years of experience. His specialization concerns solutions based on open source projects such as Solr and ElasticSearch.
He is also the co-funder of the solr.pl site, publishing information and tutorials about the Solr and Lucene library.
He currently holds the position of Chief Technology Officer in Smartupz, the vendor of the Discourse™ social collaboration software.
Writing this book was hard work but also a great opportunity to try something new. Looking at more and more pages being created with time, I realized how rich ElasticSearch is and how difficult it is to fit the description of its features within the page limit. I hope that topics that finally made it to the book are the most important and interesting ones.
The biggest thank-you goes to all the people involved in the development of Lucene and ElasticSearch. Great work!
I would like to thank also the team working on this book. I am impressed how smoothly and quickly we passed through all the organizational stuff. Special thanks to the reviewers for a long list of comments and suggestions.
Last but not the least, thanks to all my friends, both those who persuaded me to write a book and those to whom it will be a complete surprise.
Ravindra Bharathi has worked in the software industry for over a decade in various domains such as education, digital media marketing/advertising, enterprise search, and energy management systems. He has a keen interest in search-based applications that involve data visualization, mashups, and dashboards. He blogs at http://ravindrabharathi.blogspot.com.
Matthew Lee Hinman currently develops distributed archiving software for high availability and cloud-based systems written in both Clojure and Java. He enjoys contributing to open source software and spending time hiking outdoors.
Marcelo Ochoa works at the System Laboratory of Facultad de Ciencias Exactas of the Universidad Nacional del Centro de la Provincia de Buenos Aires, and is the CTO at Scotas.com, a company specialized in near real-time search solutions using Apache Solr and Oracle. He divides his time between University jobs and external projects related to Oracle and big data technologies. He has worked in several Oracle-related projects such as translation of Oracle manuals and multimedia CBTs. His background is in database, network, web, and Java technologies. In the XML world, he is known as the developer of the DB Generator for the Apache Cocoon project, the open source projects DBPrism and DBPrism CMS, the Lucene-Oracle integration using Oracle JVM Directory implementation, and in the Restlet.org project, the Oracle XDB Restlet Adapter (an alternative to writing native REST web services inside the database-resident JVM).
Since 2006, he has been part of the Oracle ACE program. Oracle ACEs are known for their strong credentials as Oracle community enthusiasts and advocates, with candidates nominated by ACEs in the Oracle Technology and Applications communities.
He is the author of Chapter 17 of the book Oracle Database Programming using Java and Web Services, Kuassi Mensah, Digital Press and Chapter 21 of the book Professional XML Databases, Kevin Williams, Wrox Press.
You might want to visit www.PacktPub.com for support files and downloads related to your book.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
http://PacktLib.PacktPub.com
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read, and search across Packt's entire library of books.
If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access.
Welcome to the ElasticSearch Server book. While reading this book, you will be taken on a journey to the wonderful world of full-text search provided by ElasticSearch enterprise search server. We will start with a general introduction to ElasticSearch, which covers how to start and run ElasticSearch and how to configure it using both configuration files and the REST API. You will also learn how to create your index structure and tell ElasticSearch about it, how to configure different analyses for fields, and how to use the built-in data types.
This book will also discuss the query language, the so-called Query DSL, that allows you to create complicated queries and filter returned results. In addition to all that, you'll see how you can use faceting to calculate aggregated data based on the results returned by your queries. We will implement the autocomplete functionality together and will learn how to use ElasticSearch's spatial capabilities and how to use prospective search.
Finally, this book will show you some capabilities of the ElasticSearch administration API, with features such as shard placement control, cluster handling, and more. In addition to all that, you'll learn how to overcome some common problems that can come up on your journey with ElasticSearch server.
Chapter 1, Getting Started with ElasticSearch Cluster, covers ElasticSearch installation and configuration, REST API usage, mapping configuration, routing, and index aliasing.
Chapter 2, Searching Your Data, discusses Query DSL—basic and compound queries, filtering, result sorting, and using scripts.
Chapter 3, Extending Your Structure and Search, explains how to index data that is not flat, how to handle highlighting and autocomplete, and how to extend your index with things such as time to live, source, and so on.
Chapter 4, Make Your Search Better, covers how to influence your scoring, how to use synonyms, and how to handle multilingual data. In addition to that, it describes how to use position-aware queries and check why your document was matched.
Chapter 5, Combining Indexing, Analysis, and Search, shows you how to index tree-like structures, use nested objects, handle parent-child relationships, modify your live index structure, fetch data from external systems, and speed up your indexing by using batch processing.
Chapter 6, Beyond Searching, is dedicated to faceting, "more like this", and the prospective search functionality.
Chapter 7, Administrating Your Cluster, is concentrated on the cluster administration API and cluster monitoring. In this chapter you'll also find information about external plugin installation.
Chapter 8, Dealing with Problems, will guide you through fetching large results sets efficiently, controlling cluster rebalancing, validating your queries, and using warm-up queries.
This book was written using ElasticSearch server 0.20.0, and all the examples and functions should work with it. In addition to that, you'll need a command that allows sending HTTP requests such as curl, which is available for most operating systems. Please note that all examples in this book use the mentioned curl tool. If you want to use another tool, please remember to format the request in an appropriate way that is understood by the tool of your choice.
In addition to that, some chapters may require additional software, such as ElasticSearch plugins or MongoDB NoSQL database, but when needed this is explicitly mentioned.
If you are a beginner to the work of full-text search and ElasticSearch server, this book is especially for you. You will be guided through the basics of ElasticSearch, and you will learn how to use some of the advanced functionalities.
If you know ElasticSearch and have worked with it, you may find this book interesting as it provides a good overview of all the functionalities with examples and descriptions. However, you may encounter sections that you already know about.
If you know the Apache Solr search engine, this book can also be used to compare some functionalities of Apache Solr and ElasticSearch. This may help you judge which tool is more appropriate for your use case.
If you know all the details about ElasticSearch and know how each of the configuration parameters works, this is definitely not the book you are looking for!
Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.
To send us general feedback, simply send an e-mail to <[email protected]>, and mention the book title through the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the erratasubmissionform link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website, or added to any list of existing errata, under the Errata section of that title.
Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at <[email protected]> with a link to the suspected pirated material.
We appreciate your help in protecting our authors, and our ability to bring you valuable content.
You can contact us at <[email protected]> if you are having a problem with any aspect of the book, and we will do our best to address it.
The first thing we need to do with ElasticSearch is install it. As with many applications, you start with the installation and configuration, usually forgetting about the importance of those steps until something bad happens. In this chapter we will focus quite extensively on this part of ElasticSearch. Be advised that this chapter is not a definitive guide to every configuration option and parameter. Of course, we would like to describe them all, but if we did that we would have to write a book that is twice (or even more) the size of the one you are holding in your hands! In addition to that, ElasticSearch is like all the other software applications available today—it evolves every day and keeps changing. We will cover only what we feel is commonly required, as well as specific functionalities that are sometimes hard to understand or those that are so wide that having them described in a single place would save you some time. By the end of this chapter, you will have learned the following:
ElasticSearch is an open source search server project started by Shay Banon and published in February 2010. The project grew into a major player in the field of search solutions. Additionally, due to its distributed nature and real-time abilities, many people use it as a document database. Let's go through the basic concepts of ElasticSearch
An index is the place where ElasticSearch stores data. If you come from the relational database world, you can think of an index like a table. But in contrast to a relational database, the table values stored in an index are prepared for fast and efficient full-text searching and in particular, do not have to store the original values. If you know MongoDB, you can think of the ElasticSearch index as being like a collection in MongoDB; and if you are familiar with CouchDB you can think about an index as you would about the CouchDB database.
The main entity stored in ElasticSearch is a document. In an analogy to relational databases, a document is a row of data in a database table. Comparing an ElasticSearch document to a MongoDB one, both can have different structures, but the one in ElasticSearch needs to have the same types for common fields.
Documents consist of fields (row columns), but each field may occur several times and such a field is called multivalued. Each field has a type (text, number, date, and so on). Field types can also be complex—a field can contain other subdocuments or arrays. The field type is important for ElasticSearch—it gives the search engine information about how various operations such as comparison or sorting should be performed. Fortunately, this can be determined automatically. Unlike relational databases, documents don't need to have a fixed structure; every document may have a different set of fields and in addition to that, the fields don't have to be known during application development. Of course, one can force a document structure with the use of schema.
In ElasticSearch, one index can store many objects with different purposes. For example, a blog application can store articles and comments. Document type lets us easily differentiate these objects. It is worth noting that practically every document can have a different structure; but in real operations, dividing it into types significantly helps in data manipulation. Of course, one needs to keep the limitations in mind. One such limitation is that the different document types can't set different types for the same property.
ElasticSearch can work as a standalone, single-search server. Nevertheless, to be able to process large sets of data and to achieve fault tolerance, ElasticSearch can be run on many cooperating servers. Collectively, these servers are called a clusterand each of them is called a node. Large amounts of data can be split across many nodes via index sharding (splitting it into smaller individual parts). Better availability and performance are achieved through the replicas (copies of index parts).
When we have a large number of documents, we can come to a point where a single node is not enough because of the RAM limitations, hard disk capacity, and so on. The other problem is that the desired functionality is so complicated that the server computing power is not sufficient. In such cases, the data can be divided into smaller parts called shards,where each shard is a separate Apache Lucene index. Each shard can be placed on a different server and thus your data can be spread among the clusters. When you query an index that is built from multiple shards, ElasticSearch sends the query to each relevant shard and merges the result in a transparent way so that your application doesn't need to know about shards.
In order to increase query throughput or achieve high availability, shard replicas can be used. The primary shard is used as the place where operations that change the index are directed. A replica is just an exact copy of the primary shard and each shard can have zero or more replicas. When the primary shard is lost (for example, the server holding the shard data is unavailable), a cluster can promote a replica to be the new primary shard.
The first step is to make sure that a Java SE environment is installed properly. ElasticSearch requires Version 6 or later, which can be downloaded from the following location: http://www.oracle.com/technetwork/java/javase/downloads/index.html. You can also use OpenJDK if you wish.
To install ElasticSearch, just download it from http://www.elasticsearch.org/download/ and unpack it. Choose the lastest stable version. That's it! The installation is complete.
During the writing of this book we used Version 0.20.0.
The main interface to communicate with ElasticSearch is based on an HTTP protocol and REST. This means that you can even use a web browser for some basic queries and requests; but for anything more sophisticated, you'll need to use additional software, such as the cURL command. If you use the Linux or OS X command, the curl package should already be available. In case you're using Windows, you can download it from http://curl.haxx.se/download.html.
Let's now go to the newly created directory. We can see the following directory structure:
Directory
Description
bin
The scripts needed for running ElasticSearch instances and for plugin management
config
The directory where the configuration files are located
lib
The libraries used by ElasticSearch
After ElasticSearch starts, it will create the following directories (if they don't exist):
Directory
Description
data
Where all the data used by ElasticSearch is stored
logs
Files with information about events and errors that occur during the running of an instance
plugins
The location for storing the installed plugins
work
Temporary files
One of the reasons—but of course, not the only one—that ElasticSearch is gaining more and more attention is because getting started with ElasticSearch is quite easy. Because of the reasonable default values and automatics for simple environments, we can skip the configuration and go straight to the next chapter without changing a single line in our configuration files. However, in order to truly understand ElasticSearch, it is worth understanding some of the available settings.
The whole configuration is located in the config directory. We can see two files there: elasticsearch.yml (or elasticsearch.json, which will be used if present) and logging.yml. The first file is responsible for setting the default configuration values for the server. This is important because some of these values can be changed at runtime and be kept as a part of the cluster state, so the values in this file may not be accurate. We will show you how to check the accurate configuration in Chapter 8, Dealing with Problems. The two values that we cannot change at runtime are cluster.name and node.name.
The cluster.name property is responsible for holding the name of our cluster. The cluster name separates different clusters from each other. Nodes configured with the same name will try to form a cluster.
The second value is the instance name. We can leave this parameter undefined. In this case, ElasticSearch automatically chooses a unique name for itself. Note that this name is chosen during every startup, so the name can be different on each restart. Defining the name can help when referring to concrete instances by API or when using monitoring tools to see what is happening to a node during long periods of time and between restarts. If you don't provide a name, ElasticSearch will automatically choose one randomly—so you can have different names given to the same node on each restart. Think about giving descriptive names to your nodes. Other parameters are well commented in the file, so we advise you to look through it; do not worry if you do not understand the explanation. We hope that everything will become clear after reading the next few chapters.
The second file (logging.yml) defines how much information is written to the system logs, defines the log files, and creates new files periodically. Changes in this file are necessary only when you need to adapt to monitoring or back up solutions, or during system debugging.
Let's leave the configuration files for now. An important part of configuration is tuning your operating system. During the indexing, especially when you have many shards and replicas, ElasticSearch will create several files; so the system cannot limit the open file descriptors to less than 32,000. For Linux servers, this can usually be changed in /etc/security/limits.conf and the current value can be displayed using the ulimit command.
The next settings are connected to the memory limit for a single instance. The default values (1024MB) may not be sufficient. If you spot entries with OutOfMemoryError in a log file, set the environment variable ES_HEAP_SIZE to a value greater than 1024. Note that this value shouldn't be set to more than 50 percent of the total physical memory available—the rest can be used as disk cache and it greatly increases the search performance.
Let's run our first instance. Go to the bin directory and run the following command from the command line:
The -f option tells ElasticSearch that the program should not be detached from the console and should be run in the foreground. This allows us to see the diagnostic messages generated by the program and stop it by pressing Ctrl + C. The other option is -p, which tells ElasticSearch that the identifier of the process should be written to the file pointed by this parameter. This can be executed by using additional monitoring software or admin scripts.
Congratulations, we now have our ElasticSearch instance up and running! During its work, a server usually uses two port numbers: one for communication with the REST API by using the HTTP protocol and the second one for the transport module used for communication in a cluster. The default port for the HTTP API is 9200, so we can check the search readiness by pointing a web browser at http://127.0.0.1:9200/. The browser should show a code snippet similar to the following:
Downloading the example code
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
The output is structured as a JSON (JavaScript Object Notation) object. We will use this notation in more complex requests too. If you are not familiar with JSON, please take a minute and read the article available at http://en.wikipedia.org/wiki/JSON.
Note that ElasticSearch is smart. If the default port is not available, the engine binds to the next free port. You can find information about this on the console, during booting:
Note the fragment with [http]. ElasticSearch uses a few ports for various tasks. The interface that we are using is handled by the HTTP module.
Now we will use the cURL program. For example, our query can be executed as follows:
The -X parameter is a request method. The default value is GET (so, in this example, we can omit this parameter). Do not worry about the GET value for now, we will describe it in more detail later in this chapter.
Note the ?pretty parameter. As a standard, the API returns information in a JSON object in which the new line signs are omitted. This parameter forces ElasticSearch to add a new line character to the response, making the response more human-friendly. You can try running the preceding query with and without the ?pretty parameter to see the difference.
ElasticSearch is useful in small and medium-sized applications, but it is built with large installations in mind. So now we will set up our big, two-node cluster. Unpack the ElasticSearch archive in a different directory and run the second instance. If we look into the log, we see something similar to the following:
This means that our second instance (named Orbit) found the previously running instance (named Bova). ElasticSearch automatically formed a new, two-node cluster.
Even though we expect our cluster (or node) to run flawlessly for a lifetime, we may end up needing to restart it or shut it down properly (for example, for maintenance). There are three ways in which we can shut down ElasticSearch:
We will focus on the last method now. It allows us to shut down the whole cluster by executing the following command:
To shut down just a single node, execute the following command:
In the previous command line, BlrmMvBdSKiCeYGsiHijdg is the identifier for a given node. The former may be read from ElasticSearch logs or from another API call:
Running an instance in the foreground using the –f option is comfortable for testing or development. In the real world, an instance should be managed by the operating system tools; it should start automatically during system boot and close correctly when the system is shut down. This is simple when using a system like Linux Debian. ElasticSearch has the deb archive available with all the necessary scripts. If you don't use the deb archive, you can always use the ElasticSearch service wrapper (https://github.com/elasticsearch/elasticsearch-servicewrapper), which provides all the needed startup scripts.
ElasticSearch REST API can be used for various tasks. Thanks to it, we can manage indexes, change instance parameters, check nodes and cluster status, index data, and search it. But for now, we will concentrate on using the CRUD (create-retrieve-update-delete) part of the API, which allows us to use ElasticSearch in a similar way to how you would use a NoSQL database.
Before moving on to a description of various operations, a few words about REST itself. In a REST-like architecture, every request is directed to a concrete object indicated by the path part of the address. For example, if /books/ is a reference to a list of books in our library, /books/1 is a reference to the book with the identifier 1. Note that these objects can be nested. /books/1/chapter/6 is the sixth chapter in the first book in the library, and so on. We have the subject of our API call. What about an operation that we would like to execute, such as GET or POST? To indicate that, request types are used. An HTTP protocol gives us quite a long list of request types to use as verbs in the API calls. Logical choices are GET in order to obtain the current state of the requested object, POST for changing the object state, PUT for object creation, and DELETE for destroying an object. There is also a HEAD request that is only used for fetching the base information about an object.
If we look at the examples of the operations discussed in the Shutting down ElasticSearch section, everything should make more sense:
Now we will check how these operations can be used to store, fetch, alter, and delete data from ElasticSearch.
In ElasticSearch, every piece of data has a defined index and type. You can think about an index as a collection of documents or a table in a database. In contrast to database records, documents added to an index have no defined structure and field types. More precisely, a single field has its type defined, but ElasticSearch can do some magic and guess the corresponding type.
Now we will try to index some documents. For our example, let's imagine that we are building some kind of CMS for our blog. One of the entities in this blog is (surprise!) articles. Using the JSON notation, a document can be presented as shown in the following example:
As we can see, the JSON document contains a set of fields, where each field can have a different form. In our example, we have a number (priority), text (title), and an array of strings (tags). In the next examples, we will show you the other types. As mentioned earlier in this chapter, ElasticSearch can guess these type (because JSON is semi-typed; that is, the numbers are not in quotation marks) and automatically customize the way of storing this data in its internal structures.
Now we want to store this record in the index and make it available for searching. Choosing the index name as blog and type as article, we can do this by executing the following command:
You can notice a new option to cURL, -d. The parameter value of this option is the text that should be used as a request payload—a request body. This way we can send additional information such as a document definition.
Note that the unique identifier is placed in the URL, not in the body. If you omit this identifier, the search returns an error, similar to the following:
If everything is correct, the server will answer with a JSON response similar to this:
In the preceding reply, ElasticSearch includes information about the status of the operation and shows where the new document was placed. There is information about the document's unique identifier and current version, which will be incremented automatically by ElasticSearch every time the document changes.
In the above example, we've specified the document identifier ourselves. But ElasticSearch can generate this automatically. This seems very handy, but only when an index is the only source of data. If we use a database for storing data and ElasticSearch for full text searching, synchronization of this data will be hindered unless the generated identifier is stored in the database as well. Generation of a unique key can be achieved by using the following command:
Notice POST instead of PUT
