Mastering Apache Solr 7.x - Sandeep Nair - E-Book

Mastering Apache Solr 7.x E-Book

Sandeep Nair

0,0
32,39 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Accelerate your enterprise search engine and bring relevancy in your search analytics

Key Features

  • A practical guide in building expertise with Indexing, Faceting, Clustering and Pagination
  • Master the management and administration of Enterprise Search Applications and services seamlessly
  • Handle multiple data inputs such as JSON, xml, pdf, doc, xls,ppt, csv and much more.

Book Description

Apache Solr is the only standalone enterprise search server with a REST-like application interface. providing highly scalable, distributed search and index replication for many of the world's largest internet sites.

To begin with, you would be introduced to how you perform full text search, multiple filter search, perform dynamic clustering and so on helping you to brush up the basics of Apache Solr. You will also explore the new features and advanced options released in Apache Solr 7.x which will get you numerous performance aspects and making data investigation simpler, easier and powerful. You will learn to build complex queries, extensive filters and how are they compiled in your system to bring relevance in your search tools. You will learn to carry out Solr scoring, elements affecting the document score and how you can optimize or tune the score for the application at hand. You will learn to extract features of documents, writing complex queries in re-ranking the documents. You will also learn advanced options helping you to know what content is indexed and how the extracted content is indexed. Throughout the book, you would go through complex problems with solutions along with varied approaches to tackle your business needs.

By the end of this book, you will gain advanced proficiency to build out-of-box smart search solutions for your enterprise demands.

What you will learn

  • Design schema using schema API to access data in the database
  • Advance querying and fine-tuning techniques for better performance
  • Get to grips with indexing using Client API
  • Set up a fault tolerant and highly available server with newer distributed capabilities, SolrCloud
  • Explore Apache Tika to upload data with Solr Cell
  • Understand different data operations that can be done while indexing
  • Master advanced querying through Velocity Search UI, faceting and Query Re-ranking, pagination and spatial search
  • Learn to use JavaScript, Python, SolrJ and Ruby for interacting with Solr

Who this book is for

The book would rightly appeal to developers, software engineers, data engineers and database architects who are building or seeking to build enterprise-wide effective search engines for business intelligence. Prior experience of Apache Solr or Java programming is must to take the best of this book.

Sandeep Nair has more than 11 years of experience of Java and Java EE technologies. His keen interest is in developing enterprise solutions using the Liferay platform, and he has been doing so for the past 9 years. He has executed projects using Liferay across various verticals, providing solutions for collaboration, enterprise content management, and web content management systems. He is also experienced with Java and Java EE. He has authored Liferay Beginner’s Guide and Instant Liferay Portal 6 Starter. Travel, food, and books are his passions, besides coding. Chintan Mehta is a cofounder of KNOWARTH Technologies and heads the cloud/RIMS/DevOps team. He has rich, progressive experience in server administration of Linux, AWS Cloud, DevOps, and RIMS, and server administration on open source technologies. He is also an AWS Certified Solutions Architect. Chintan has authored MySQL 8 for BigData, MySQL 8 Administrator's Guide, and Hadoop Backup and Recovery Solutions, and has reviewed Liferay Portal Performance Best Practices and Building Serverless Web Applications. Dharmesh Vasoya is a Liferay 6.2 certified developer. He has 5.5 years of experience in application development with technologies such as Java, Liferay, Spring, Hibernate, Portlet, and JSF. He has successfully delivered projects in various domains, such as healthcare, collaboration, communication, and enterprise CMS, using Liferay. Dharmesh has good command of the configuration setup of servers such as Solr, Tomcat, JBOSS, and Apache Web Server. He has good experience of clustering, load balancing and performance tuning. He completed his MCA at Ahmedabad University.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 275

Veröffentlichungsjahr: 2018

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Mastering Apache Solr 7.x

 

 

 

 

 

 

 

 

An expert guide to advancing, optimizing, and scaling your enterprise search

 

 

 

 

 

 

 

 

Sandeep Nair
Chintan Mehta
Dharmesh Vasoya

 

 

 

 

 

 

 

 

 

 

BIRMINGHAM - MUMBAI

Mastering Apache Solr 7.x

 

Copyright © 2018 Packt Publishing

 

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Pravin DhandreAcquisition Editor: Aman SinghContent Development Editor: Aishwarya PandereTechnical Editor: Dinesh PawarCopy Editor: Vikrant PhadkayProject Coordinator: Nidhi JoshiProofreader: Safis EditingIndexer: Aishwarya GangawaneGraphics: Tania DuttaProduction Coordinator: Arvindkumar Gupta

First published: February 2018

Production reference: 1160218

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78883-738-5

www.packtpub.com

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

PacktPub.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the authors

Sandeep Nair has more than 11 years of experience of Java and Java EE technologies. His keen interest is in developing enterprise solutions using the Liferay platform, and he has been doing so for the past 9 years. He has executed projects using Liferay across various verticals, providing solutions for collaboration, enterprise content management, and web content management systems. He is also experienced with Java and Java EE.

He has authored Liferay Beginner’s Guide and Instant Liferay Portal 6 Starter.

Travel, food, and books are his passions, besides coding.

I would like to thank  my dad, Uday, for believing in me; my mom, Savithri, for loving me the most in the world; my brother, Sanju, who cheers me up at every moment of my life; my wife, Iesha, all I will say is I Love You; and finally, that little life that may come to this world hopefully before this book gets published. Thanks to KNOWARTH, my coauthors, and the entire Packt team, especially Aishwarya, for making this happen.

 

 

 

Chintan Mehta is a cofounder of KNOWARTH Technologies and heads the cloud/RIMS/DevOps team. He has rich, progressive experience in server administration of Linux, AWS Cloud, DevOps, and RIMS, and server administration on open source technologies. He is also an AWS Certified Solutions Architect.

Chintan has authored MySQL 8 for BigData, MySQL 8 Administrator's Guide, and Hadoop Backup and Recovery Solutions, and has reviewed Liferay Portal Performance Best Practices and Building Serverless Web Applications.

I would like to thank my coauthors. I would like to especially thank my wonderful wife, Mittal, and my sweet son, Devam, for putting up with the long days, nights, and weekends when I was camped out in front of my laptop. Last but not least, I want to thank my Mom and Dad, friends, family, and colleagues for supporting me throughout.

Dharmesh Vasoya is a Liferay 6.2 certified developer. He has 5.5 years of experience in application development with technologies such as Java, Liferay, Spring, Hibernate, Portlet, and JSF. He has successfully delivered projects in various domains, such as healthcare, collaboration, communication, and enterprise CMS, using Liferay.

Dharmesh has good command of the configuration setup of servers such as Solr, Tomcat, JBOSS, and Apache Web Server. He has good experience of clustering, load balancing and performance tuning. He completed his MCA at Ahmedabad University.

I had a wonderful experience as an author of this book. It was an awesome journey and I would like to thank each and every contributor. I would like to thank all my coauthors, Packt team members, and all the reviewers for their great support and effort. Big thanks to my family members for their great support during the entire course of writing.

About the reviewers

Marcelo Ochoa works at the system laboratory of Facultad de Ciencias Exactas of Universidad Nacional del Centro de la Provincia de Buenos Aires and is the CTO at Scotas. He has worked on several Oracle-related projects, such as translating Oracle manuals and multimedia CBTs. Since 2006, he has been part of an Oracle ACE program and was recently inducted into a Docker mentor program.

He has coauthored Oracle Database Programming using Java and Web Services by Digital Press and Professional XML Databases by Wrox Press. He has been a technical reviewer on several Packt Publishing books.

Krunal Patel has been working on the Liferay portal for 5+ years and has 9+ years of experience in enterprise application development using Java and Java EE. He has also executed enterprise CMS projects using Solr, Apache web server, and Apache Lucene. He has good experience in setup and configuration of servers (Solr, Tomcat, JBOSS, and Jenkins (CI)), performance tuning, LDAP integration, and so on. He has an ITIL Foundation certification in IT service management, Liferay 6.1 Developer certification, Brainbench Java 6 certification, and MongoDB for Java Developers certification.

 

 

 

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Table of Contents

Title Page

Copyright and Credits

Mastering Apache Solr 7.x

Packt Upsell

Why subscribe?

PacktPub.com

Contributors

About the authors

About the reviewers

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Introduction to Solr 7

Introduction to Solr

History of Solr

Lucene – the backbone of Solr

Why choose Solr?

Benefits of keyword search

Benefits of ranked results

Solr use cases

Social media

Science and research

Search engine

E-commerce

Media and entertainment

Government

Education

What's new in Solr 7?

Replication for SolrCloud

TLOG replicas

PULL replicas

Schemaless improvements

Autoscaling

Default numeric types

Spatial fields

SolrJ

JMX and MBeans

Other changes

Summary

Getting Started

Solr installation

Understanding various files and the folder structure

bin

Solr script

Post script

contrib

DataImportHandler

ContentExtractionLibrary

LanguageIdentifier

Clustering

VelocityIntegration

dist and docs

example

core.properties

zoo.cfg

solr.xml

server

Running Solr

Running basic Solr commands

Production Solr setup

Loading sample data

Loading data from MySQL

Understanding the browse interface

Using the Solr admin interface

Dashboard

Logging

Cloud screens

Tree view

Graph view

Collections or core admin

Java properties

Thread dump

Collection-specific tools

Overview

Analysis

DataImport

Documents

Files

Query

Stream

Schema

Core-specific tools

Summary

Designing Schemas

How Solr works

Getting started with Solr's basics

The schema file of Solr

Understanding field types

Definitions and properties of field types

Field type properties

Field types available in Solr

Understanding date fields

Understanding currencies and exchange rates 

Understanding enum fields

Field management

Field properties

Copying fields

Dynamic fields

Mastering Schema API

Schema API in detail

Schema operations

Listing fields, field types, DynamicFields, and CopyField rules

Deciphering schemaless mode

Creating a schemaless example

Schemaless mode configuration

Managed schema

Field guessing

Summary

Mastering Text Analysis Methodologies

Understanding text analysis

What is text analysis?

How text analysis works

Understanding analyzer

What is an analyzer?

How an analyzer works

Understanding tokenizers

What is a tokenizer?

Available tokenizers in Solr

Standard tokenizer

White space tokenizer

Classic tokenizer

Keyword tokenizer

Lower case tokenizer

Letter tokenizer

N-gram tokenizer

Edge n-gram tokenizer

Understanding filters

What is a filter?

Available filters in Solr

Stop filter

Classic filter

Synonym filter

Synonym graph filter

ASCII folding filter

Keep word filter

KStem filter

KeywordMarkerFilterFactory

Word delimiter graph filter 

Understanding CharFilter

Understanding PatternReplaceCharFilterFactor

Understanding multilingual analysis

Language identification

Configuring Solr for multiple language search

Creating separate fields per language

Creating separate indexes per language

Understanding phonetic matching

Understanding Beider-Morse phonetic matching

Summary

Data Indexing and Operations

Basics of Solr indexing

Installing Postman

Exploring the post tool

Understanding index handlers

Working with an index handler with the XML format

Index handler with JSON

Apache Tika and indexing

Solr Cell basics

Indexing a binary using Tika

Language detection 

Language detection configuration

Client APIs 

Summary

Advanced Queries – Part I

Search relevance

Velocity search UI

Query parsing and syntax

Common query parameters

Standard query parser

Advantage

Disadvantage

Searching terms for standard query parser

Term modifiers

Wildcard searches

Fuzzy searches

Proximity searching 

Range searches

Boolean operators

Escaping special characters

Grouping terms

Dates and times in query strings

Adding comments to the query string

The DisMax Query Parser

Advantages

DisMax query parser parameters

eDisMax Query Parser

Response writer

JSON

Standard XML

CSV

Velocity

Faceting

Common parameters

Field-value faceting parameters

Range faceting

Pivot faceting

Interval faceting

Highlighting

Highlighting parameters

Highlighter

Unified highlighter (hl.method=unified)

Original highlighter (hl.method=original) 

FastVector highlighter (hl.method=fastVector)

Boundary scanners

The breakIterator boundary scanner

The simple boundary scanner

Summary

Advanced Queries – Part II

Spellchecking

Spellcheck parameters

Implementation approaches

IndexBasedSpellChecker

DirectSolrSpellChecker

FileBasedSpellChecker

WordBreakSolrSpellChecker

Distributed spellcheck

Suggester

Suggester parameters

Running suggestions

Pagination

How to implement pagination

Cursor pagination

Result grouping

Result grouping parameters

Running result grouping

Result clustering

Result clustering parameters

Result clustering implementation

Install the clustering contrib

Declare the cluster search component

Declare the request handler and include the cluster search component

Spatial search

Spatial search implementation

Field types

Query parser

Spatial search query parser parameters

Function queries

Summary

Managing and Fine-Tuning Solr

JVM configuration

Managing the memory heap 

Managing solrconfig.xml

User-defined properties

Implicit Solr core properties

Managing backups

Backup in SolrCloud

Standalone mode backups

Backup API

Backup status

API to restore

Restore status API

Snapshot API

JMX with Solr

JMX configuration

Logging configuration

Log settings using the admin web interface

Log level at startup

Setting the environment variable

Passing parameters in the startup script

Configuring Log4J for logging

SolrCloud overview

SolrCloud in interactive mode

SolrCloud – core concepts

Routing documents

Splitting shards

Setting up ignore commits from client applications

Enabling SSL – Solr security

Prerequisites

Generating a key and self-signed certificate

Starting Solr with SSL system properties

Performance statistics

Statistics for request handlers

Summary

Client APIs – An Overview

Client API overview

JavaScript Client API

SolrJ Client API

Ruby Client API

Python Client API

Summary

Preface

In today's digital enterprise world, every business has complex search requirements. With big data coming into the picture, the volume of data on which search filters have to be applied has massively increased. It becomes absolutely crucial to have an enterprise search platform that caters to your enterprise application.

Solr is a leading open source Java-based enterprise search platform that has been adopted by many organizations. It offers a plethora of features, such as handling rich documents, faceted search, and full-text searching, to name a few.

With the recent release of Solr 7, the arsenal of features that Solr provides has widened. We hope that this book will provide you with everything you need to not only learn but also master the various features and functionalities that Solr provides. We believe you will enjoy reading this as much as we did writing it. Happy learning!

Who this book is for

This book is for anyone who wants to not only learn Solr 7.0 but also understand various advanced concepts of Solr. You'll learn why you should search on an enterprise search platform like Solr by the time you finish this book.

What this book covers

Chapter 1, Introduction to Solr 7, gets you acquainted with what Solr is all about and explains why you should use Solr.

Chapter 2, Getting Started, shows you how to set up Solr and how everything is laid out under the Solr umbrella.

Chapter 3, Designing Schemas, takes us through schema design using the schema API and gives an understanding of schemaless mode.

Chapter 4, Mastering Text Analysis Methodologies, shows us features related to text analysis, tokenizers, filters, and analyzers.

Chapter 5, Data Indexing and Operations, teaches us how to use the client API to do indexing. We also learn about index handlers.

Chapter 6, Advanced Queries – Part I, looks at querying Solr, velocity search UI, relevance, query parsing, faceting, and highlighting.

Chapter 7, Advanced Queries – Part II, continues where the last chapter ended. We go through suggester, pagination, result grouping, clustering, and spatial search.

Chapter 8, Managing and Fine-Tuning Solr, shows how to make Solr ready for production.

Chapter 9, Client APIs – An Overview, gives an overview of the various APIs that are available for JavaScript, Ruby, Python, and Java to interact with Solr. 

To get the most out of this book

It would be great if you know a bit of Java, but it is not mandatory as this book will teach you from the ground up

Download the example code files

You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at

www.packtpub.com

.

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub athttps://github.com/PacktPublishing/Mastering-Apache-Solr-7x. We also have other code bundles from our rich catalog of books and videos available athttps://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/MasteringApacheSolr7x_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Also, the PATH variable should point to JRE 1.8."

A block of code is set as follows:

<requestHandler name="/dataimport" class="solr.DataImportHandler"> <lst name="defaults"> <str name="config">db-data-config.xml</str> </lst></requestHandler>

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

<field

column

="category_id" name="category_id" /><field column="category_name"

name

="category_name" /><field column="remarks" name="remarks" />

Any command-line input or output is written as follows:

brew install solr

solr start

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Go to the Query screen; at the bottom, click on facet."

Warnings or important notes appear like this.
Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packtpub.com.

Introduction to Solr 7

Today we are in the age of digitization. People are generating data in different ways: they take pictures, upload images, write blogs, comment on someone's blog or picture, change their status on social networking sites, tweet on Twitter, update details on LinkedIn, do financial transactions, write emails, store data on the cloud, and so on. Data size has grown not only in the personal space but also in professional services, where people have to deal with a humongous amount of data. Think of the data managed by players such as Google, Facebook, the New York Stock Exchange, Amazon, and many others. For this data tsunami, we need the appropriate tools to fetch data, in an organized way, that can be used in various fields, such as scientific research, real-time traffic, fighting crime, fraud detection, digital personalization, and so on. All of this data needs to be captured, stored, searched, shared, transferred, analyzed, and visualized.

Analyzing structured, unstructured, or semi-structured ubiquitous data helps us discover hidden patterns, market trends, correlations, and personal preferences. With the help of the right tools to process and analyze data, organizations can expect much better marketing plans, additional revenue opportunities, improved customer services, healthier operational efficiency, competitive benefits, and much more. It is important to not only store data but also process it in order to generate information that is necessary. Every company collects data and uses it; however, to potentially flourish more effectively, a company needs to search relevant data. Every company must carve out direct search-produced data, which can improve their business either directly or indirectly. 

Okay, now you have Solr, which is generally referred to as search server, and you are doing searches. Is that what you need? Hold on! This allows a lot more than a simple search. So get ready and hold your breath to take a deep dive into Solr—a scalable, flexible, and enterprise NoSQL search platform!

We will go through the following topics in this chapter:

Introduction to Solr

Why Solr?

Solr use cases

What's new in Solr 7

Introduction to Solr

Solr is one of the most popular enterprise search servers and is widely used across the world. It is written based on Java and uses the Lucene Java search library. Solr is an open source project from Apache Software Foundation (ASF) and is amazingly fast, scalable, and ideal for searching relevant data. Some of the major Solr users are Netfix, SourceForge, Instagram, CNET, and Flipkart. You can check out more such use cases at https://wiki.apache.org/solr/PublicServers.

Some of the features included are as follows:

Full-text search

Faceted search

Dynamic clustering

GEO search

Hit highlighting

Near-real-time indexing

Rich document handling

Geospatial search

Structured Query Language

(

SQL

) support

Textual search

Rest API

JSON, XML, PHP, Ruby, Python, XSLT, velocity, and custom Java binary output formats over HTTP

GUI admin interface

Replication

Distributed search

Caching of queries, documents, and filters

Auto-suggest

Streaming

Many more features

Solr has enabled many such Internet sites, government sites, and Intranet sites too, providing solutions for e-commerce, blogs, science, research, and so on. Solr can index billions of documents/rows via XML, JSON, CSV, or HTTP APIs. It can secure your data with the help of authentication and can be drilled down to role-based authentication. Solr is now an integral part of many big data solutions too.

History of Solr

Doug Cutting created Lucene in 2000, which is the core technology behind Solr.

Solr was made in 2004 by Yonik Seeley at CNET Networks for a homegrown project to provide search capability for the CNET Networks website.

Later in 2006, CNET Networks published the Solr source code to ASF. By early 2007, Solr had found its place in some of the top projects. It was then that Solr kept on adding new features to attract customers and contributors.

Solr 1.3 was released in September 2008. It included major performance enhancements and features such as distributed search.

In January 2009, Yonik Seeley, Grant Ingersoll, and Erik Hatcher joined Lucidworks; they are the prime faces of Solr and enterprise search. Lucidworks started providing commercial support and training for Solr.

Solr 1.4 was released in November 2009. Solr had never stopped providing enhancements; 1.4 was no exception, with indexing, searching, faceting, rich document processing, database integration, plugins, and more.

In 2011, Solr versioning was revised to match up with the versions of Lucene. Sometime in 2010, the Lucence and Solr projects were merged; Solr had then became an integral subproject of Lucene. Solr downloads were still available separately; however, it was developed together by the same set of contributors. Solr was then marked as 3.1.

Solr 4.0 was released in October 2012, which introduced the SolrCloud feature. There were a number of follow-ups released over a couple of years in the 4.x line. Solr kept on adding new features, becoming more scalable and further focusing on reliability.

Solr 5.0 was released in February 2015. It was with this release that official support for the WAR bundle package ended. It was packaged as a standalone application. And later, in version 5.3, it also included an authentication and authorization framework.

Solr 6.0 was released in April 2016. It included support for executing parallel SQL queries across SolrCloud. It also included stream expression support and JDBC driver for the SQL interface.

Finally, Solr 7.0 was released in September 2017, followed by 7.1.0 in October 2017, as shown in the following diagram. We will discuss the new features as we move ahead in this chapter, in the What is new in Solr 7 section.

We have depicted the history of Solr in the preceding image for a much better view and understanding.

So by now, we have a brief understanding of Solr, along with its history. We must also have a good understanding of why we should be using Solr. Let's get the answer to this question too.

Lucene – the backbone of Solr

Lucene is an open source project that provides text search engine libraries. It is widely adopted for many search engine technologies. It has strong community contributions, which makes it much stronger as a technology backend. Lucene is a simple code library that you can use to write your own code by using the API available for searching, indexing, and much more.

For Lucene, a document consists of a collection of fields; they are name-value pairs consisting of either text or numbers. Lucene can be configured as a text analyzer that tokenizes a field’s text to a series of words. It can also do further processing, such as substituting with synonyms or other similar processes. Lucene stores its index on the disk of the server, which consists of indexing for each of the documents. The index is an inverted index that stores the mapping of a field to its relevant document, along with the position of the word from the text of the document. Once you have the index in place, you can search for documents with the input of a query string that is parsed accordingly to Lucence. Lucene manages to score a value for each of the relevant documents and the ones that are high-scoring documents are displayed.

Why choose Solr?

If we already have a relational database, then why should we use Solr? It's simple; if there is a use case that needs you to search, you need a search engine platform like Solr. There are various use cases that we will be discussing further in the chapter.

Databases and Solr have their own pros and cons. In one place where we use a database, SQL supports limited wildcard-based text search with some basic normalization, such as matching uppercase to lowercase. It might be a costly query as it does full table scans. Whereas in Solr, a searchable word index is stored in an inverse index, which is much faster than traditional database searches.

Let's look at the following diagram to understand this better:

Having an enterprise search engine solution is must for an organization nowadays, it is having a prominent role in the aspect of getting information quickly with the help of searches. Not having such a search engine platform can result in insufficient information, inefficiency of productivity, and additional efforts due to duplication of work. Why? Just because of not having the right information available quickly, without a search; it is something that we can't even think of. Most such use cases comprise the following key requirements:

Data collected should be parsed and indexed. So, parsing and indexing is one of the important requirements of any enterprise search engine platform.

A search should provide the required results almost at runtime on the required datasets. Performance and relevance are two more key requirements.

The search engine platform should be able to crawl or collect all of the data that it would require to perform the search.

Integration of the search engine along with administration, monitoring, log management, and customization is something that we would be expecting.

Solr has been designed to have a powerful and flexible search that can be used by applications; whenever you want to serve data based on search patterns, Solr is the right fit.

Here is a high-level diagram that shows how Solr is integrated with an application:

The majority of popular websites, including many Intranet websites, have integrated search solutions to help users find relevant information quickly. User experience is a key element for any solution that we develop; and searching is one of the major features that cannot be ignored when we talk about user experience.

Benefits of keyword search

One of the basic needs a search engine should support is a keyword search, as that's the primary goal behind the search engine platform. In fact it is the first thing a user will start with. Keyword search is the most common technique used for a search engine and also for end users on our websites. It is a pretty common expectation nowadays to punch in a few keywords and quickly retrieve the relevant results. But what happens in the backend is something we need to take care of to ensure that the user experience doesn't deteriorate. Let's look at a few areas that we must consider in order to provide better outcomes for search engine platforms using Solr:

Relevant search with quick turnaround

Auto-correct spelling

Auto-suggestions

Synonyms

Multilingual support

Phrase handling—an option to search for a specific keyword or all keywords in a phrase provided

Expanded results if the user wants to view something beyond the top-ranked results

These features can be easily managed by Solr; so our next challenge is to provide relevant results with improved user experience.

Benefits of ranked results

Solr is not limited to finding relevant results for a user's search. Providing the end user with selection of the most relevant results, that are sorted, is important as well. We will be doing this using SQL to find relevant matching pattern results and sorting them into columns in either ascending or descending order. Similarly, Solr also does sorting of the result set retrieved based on the search pattern, with a score that would match the relevancy strength in the dataset.

Ranked results is very important, primarily because the volume of data that search engine platforms have to dig through is huge. If there is no control on ranked results, then the result set would be filled with no relevancy and would have so much data that it wouldn't be feasible to display it either. The other important aspect is user experience. All of us are now used to expecting a search engine to provide relevant results using limited keywords. We are getting restless, aren't we? But we expect a search engine platform to not get annoyed and provide us relevant ranked results with few keywords. Hold on, we are not talking of Google search here! So for users like us, Solr can help address such situations by providing higher rankings based on various criteria: fields, terms, document name, and a few more. The ranking of the dataset can vary based on many factors, but a higher ranking would generally be based on the relevancy of the search pattern. With this, we can also have criteria such as gender; with the rankings of certain documents being at the top.

Solr use cases

Solr is widely accepted and used by big companies such as Netflix, Disney, Instagram, The Guardian, and many more. Let us see with the help of a few use cases the real-life importance that Solr has made on renowned scenarios.

For an extended but incomplete list of use cases and sites that leverage Solr, you can refer to the official web page of Solr at https://wiki.apache.org/solr/PublicServers:

This diagram helps us understand Solr as a solution serving various industries. Though it's not an exhaustive list of industries where Solr has been playing a prominent role in business decisions, let's discuss a few of the industries.