Natural Language Processing with Java Cookbook - Richard M. Reese - E-Book

Natural Language Processing with Java Cookbook E-Book

Richard M. Reese

0,0
36,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

A problem-solution guide to encounter various NLP tasks utilizing Java open source libraries and cloud-based solutions




Key Features





  • Perform simple-to-complex NLP text processing tasks using modern Java libraries Extract relationships between different text complexities using a problem-solution approach


  • Utilize cloud-based APIs to perform machine translation operations





Book Description



Natural Language Processing (NLP) has become one of the prime technologies for processing very large amounts of unstructured data from disparate information sources. This book includes a wide set of recipes and quick methods that solve challenges in text syntax, semantics, and speech tasks.






At the beginning of the book, you'll learn important NLP techniques, such as identifying parts of speech, tagging words, and analyzing word semantics. You will learn how to perform lexical analysis and use machine learning techniques to speed up NLP operations. With independent recipes, you will explore techniques for customizing your existing NLP engines/models using Java libraries such as OpenNLP and the Stanford NLP library. You will also learn how to use NLP processing features from cloud-based sources, including Google and Amazon's AWS. You will master core tasks, such as stemming, lemmatization, part-of-speech tagging, and named entity recognition. You will also learn about sentiment analysis, semantic text similarity, language identification, machine translation, and text summarization.






By the end of this book, you will be ready to become a professional NLP expert using a problem-solution approach to analyze any sort of text, sentences, or semantic words.




What you will learn





  • Explore how to use tokenizers in NLP processing


  • Implement NLP techniques in machine learning and deep learning applications


  • Identify sentences within the text and learn how to train specialized NER models


  • Learn how to classify documents and perform sentiment analysis


  • Find semantic similarities between text elements and extract text from a variety of sources


  • Preprocess text from a variety of data sources


  • Learn how to identify and translate languages



Who this book is for



This book is for data scientists, NLP engineers, and machine learning developers who want to perform their work on linguistic applications faster with the use of popular libraries on JVM machines. This book will help you build real-world NLP applications using a recipe-based approach. Prior knowledge of Natural Language Processing basics and Java programming is expected.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 323

Veröffentlichungsjahr: 2019

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Natural Language Processing with Java Cookbook

 

 

 

 

 

 

 

 

 

 

Over 70 recipes to create linguistic and language translation applications using Java libraries

 

 

 

 

 

 

 

 

 

 

Richard M. Reese

 

 

 

 

 

 

 

 

 

 

BIRMINGHAM - MUMBAI

Natural Language Processing with Java Cookbook

Copyright © 2019 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Pravin DhandreAcquisition Editor: Nelson MorrisContent Development Editor: Ronnel MathewTechnical Editor: Dinesh PawarCopy Editor: Safis EditingProject Coordinator: Namrata SwettaProofreader: Safis EditingIndexer:Pratik ShirodkarGraphics:Tom ScariaProduction Coordinator:Jayalaxmi Raja

First published: April 2019

Production reference: 1220419

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78980-115-6

www.packtpub.com

 
mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

Packt.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks. 

Contributors

About the author

Richard M. Reese has worked in both industry and academia. For 17 years, he worked in the telephone and aerospace industries, serving in several capacities, including research and development, software development, supervision, and training. He currently teaches at Tarleton State University, where he has the opportunity to apply his years of industry experience to enhance his teaching. Richard has written several Java books and a C pointer book. He uses a concise and easy-to-follow approach to the topics at hand. His Java books have addressed EJB 3.1, updates to Java 7 and 8, certification, jMonkeyEngine, natural language processing, functional programming, networks, and data science.

About the reviewer

Jennifer L. Reese studied computer science at Tarleton State University. She also earned her M.Ed. from Tarleton in December 2016. She currently teaches computer science to high-school students. Her interests include the integration of computer science concepts with other academic disciplines, increasing diversity in computer science courses, and the application of data science to the field of education. She has co-authored two books: Java for Data Science, and Java 7 New Features Cookbook. She previously worked as a software engineer. In her free time, she enjoys reading, cooking, and traveling—especially to any destination with a beach. She is a musician and appreciates a variety of musical genres.

 

 

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Table of Contents

Title Page

Copyright and Credits

Natural Language Processing with Java Cookbook

About Packt

Why subscribe?

Packt.com

Contributors

About the author

About the reviewer

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Conventions used

Sections

Getting ready

How to do it…

How it works…

There's more…

See also

Get in touch

Reviews

Preparing Text for Analysis and Tokenization

Technical requirements

Tokenization using the Java SDK

Getting ready

How to do it...

How it works...

Tokenization using OpenNLP

Getting ready

How to do it...

How it works...

See also

Tokenization using maximum entropy

Getting ready

How to do it...

How it works...

See also

Training a neural network tokenizer for specialized text

Getting ready

How to do it...

How it works...

There's more...

See also

Identifying the stem of a word

Getting ready

How to do it...

How it works...

See also

Training an OpenNLP lemmatization model

Getting ready

How to do it...

How it works...

There's more...

See also

Determining the lexical meaning of a word using OpenNLP

Getting ready

How to do it...

How it works...

See also

Removing stop words using LingPipe

Getting ready

How to do it...

How it works...

See also

Isolating Sentences within a Document

Technical requirements

Finding sentences using the Java core API

Getting ready

How to do it...

How it works...

See also

Performing SBD using the BreakIterator class

Getting ready

How to do it...

How it works...

There's more...

See also

Using OpenNLP to perform SBD

Getting ready

How to do it...

How it works...

There's more...

See also

Using the Stanford NLP API to perform SBD

Getting ready

How to do it...

How it works...

There's more...

See also

Using the LingPipe and chunking to perform SBD

Getting ready

How to do it...

How it works...

There's more...

See also

Performing SBD on specialized text

Getting ready

How to do it...

How it works...

See also

Training a neural network to perform SBD with specialized text

Getting ready 

How to do it...

How it works...

See also

Performing Name Entity Recognition

Technical requirements

Using regular expressions to find entities

Getting ready

How to do it...

How it works...

There's more...

See also

Using chunks with regular expressions to identify entities

Getting ready

How to do it...

How it works...

There's more...

See also

Using OpenNLP to find entities in text

Getting ready

How to do it...

How it works...

There's more...

See also

Isolating multiple entities types

Getting ready

How to do it...

How it works...

See also

Using a CRF model to find entities in a document

Getting ready

How to do it...

How it works...

There's more...

See also

Using a chunker to find entities

Getting ready

How to do it...

How it works...

See also

Training a specialized NER model

Getting ready

How to do it...

How it works...

See also

Detecting POS Using Neural Networks

Technical requirements

Finding POS using tagging

Getting ready

How to do it...

How it works...

There's more...

See also

Using a chunker to find POS

Getting ready

How to do it...

How it works...

There's more...

See also

Using a tag dictionary

Getting ready

How to do it...

How it works...

There's more...

See also

Finding POS using the Penn Treebank

Getting ready

How to do it...

How it works...

There's more...

See also

Finding POS from textese

Getting ready

How to do it...

How it works...

There's more...

See also

Using a pipeline to perform tagging

Getting ready

How to do it...

How it works...

See also

Using a hidden Markov model to perform POS

Getting ready

How to do it...

How it works...

There's more...

See also

Training a specialized POS model

Getting ready

How to do it...

How it works...

See also

Performing Text Classification

Technical requirements

Training a maximum entropy model for text classification

Getting ready

How to do it...

How it works...

See also

Classifying documents using a maximum entropy model

Getting ready

How to do it...

How it works...

There's more...

See also

Classifying documents using the Stanford API

Getting ready

How to do it...

How it works...

There's more...

See also

Training a model to classify text using LingPipe

Getting ready

How to do it...

How it works...

See also

Using LingPipe to classify text

Getting ready

How to do it...

How it works...

There's more...

See also

Detecting spam

Getting ready

How to do it...

How it works...

There's more...

See also

Performing sentiment analysis on reviews

Getting ready

How to do it...

How it works...

There's more...

See also

Finding Relationships within Text

Technical requirements

Displaying parse trees graphically

Getting ready

How to do it...

How it works...

There's more...

See also

Using probabilistic context-free grammar to parse text

Getting ready

How to do it...

How it works...

There's more...

See also

Using OpenNLP to generate a parse tree

Getting ready

How to do it...

How it works...

There's more...

See also

Using the Google NLP API to parse text

Getting ready

How to do it...

How it works...

There's more...

See also

Identifying parent-child relationships in text

Getting ready

How to do it...

How it works...

There's more...

See also

Finding co-references in a sentence

Getting ready

How to do it...

How it works...

There's more...

See also

Language Identification and Translation

Technical requirements

Detecting the natural language in use using LingPipe

Getting ready

How to do it… 

How it works…

There's more…

See also

Discovering supported languages using the Google API

Getting ready

How to do it…

How it works…

See also

Detecting the natural language in use using the Google API

Getting ready

How to do it…

How it works…

There's more…

See also

Language translation using Google

Getting ready

How to do it…

How it works…

There's more…

See also

Language detection and translation using Amazon AWS

Getting ready

How to do it…

How it works…

There's more…

See also

Converting text to speech using the Google Cloud Text-to-Speech API

Getting ready

How to do it…

How it works…

See also

Converting speech to text using the Google Cloud Speech-to-Text API

Getting ready

How to do it…

How it works…

There's more…

See also

Identifying Semantic Similarities within Text

Technical requirements

Finding the cosine similarity of the text

Getting ready

How to do it...

How it works...

There's more...

See also

Finding the distance between text

Getting ready

How to do it...

How it works...

See also

Finding differences between plaintext instances

Getting ready

How to do it...

How it works...

There's more...

See also

Finding hyponyms and antonyms

Getting ready

How to do it...

How it works...

There's more...

See also

Common Text Processing and Generation Tasks

Technical requirements

Generating random numbers

Getting ready

How to do it…

How it works…

There's more…

See also

Spell-checking using the LanguageTool API

Getting ready

How to do it…

How it works…

See also

Checking grammar using the LanguageTool API

Getting ready

How to do it…

How it works…

See also

Summarizing text in a document

Getting ready

How to do it…

How it works…

There's more...

See also

Creating, inverting, and using dictionaries

Getting ready

How to do it…

How it works…

There's more…

See also

Extracting Data for Use in NLP Analysis

Technical requirements

Connecting to an HTML page

Getting ready

How to do it…

How it works…

There's more…

See also

Extracting text and metadata from an HTML page

Getting ready

How to do it…

How it works…

There's more…

See also

Extracting text from a PDF document

Getting ready

How to do it…

How it works…

There's more…

See also

Extracting metadata from a PDF document

Getting ready

How to do it…

How it works…

There's more…

See also

Extracting text from a Word document

Getting ready

How to do it…

How it works…

There's more…

See also

Extracting metadata from a Word document

Getting ready

How to do it…

How it works…

There's more…

See also

Extracting text from a spreadsheet

Getting ready

How to do it…

How it works…

There's more…

See also

Extracting metadata from a spreadsheet

Getting ready

How to do it…

How it works…

See also

Creating a Chatbot

Technical requirements

Creating a simple chatbot using AWS

Getting ready

How to do it…

How it works…

See also

Creating a bot using AWS Toolkit for Eclipse

Getting ready

How to do it…

How it works…

See also

Creating a Lambda function

Getting ready

How to do it…

How it works…

See also

Uploading the Lambda function

Getting ready

How to do it…

How it works…

See also

Executing a Lambda function from Eclipse

Getting ready

How to do it…

How it works…

See also

Installation and Configuration

Technical requirements

Getting ready to use the Google Cloud Platform

Getting ready

How to do it…

How it works…

See also

Configuring Eclipse to use the Google Cloud Platform

Getting ready

How to do it…

How it works…

See also

Getting ready to use Amazon Web Services

Getting ready

How to do it…

How it works…

See also

Configuring Eclipse to use Amazon Web Services

Getting ready

How to do it…

How it works…

See also

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

NLP is a rapidly changing field with numerous supporting APIs. Keeping abreast of changes to these technologies and APIs can be challenging. In addition, there are numerous languages that provide access to NLP functionality.

This cookbook covers many of these techniques with a series of recipes using Java. Each recipe addresses one or more NLP technique and provides a template for using these techniques. Step-by-step instructions will guide the user through each recipe. This facilitates the reader's learning of these technologies and the development of their understanding of how to use them.

The book is organized around common NLP tasks. The chapters do not need to be followed sequentially and can be read in any order. 

Who this book is for

This book is designed for Java developers who want to learn how to incorporate NLP techniques into their applications using Java. The reader is assumed to have a good working knowledge of Java 8. Ideally, they will have some experience with Eclipse and Maven as well, though this is not required.

What this book covers

Chapter 1, Preparing Text for Analysis and Tokenization, demonstrates numerous approaches for performing tokenization. This is the process of extracting the individual words and elements of a document, and forms the basis for most NLP tasks. This process can be difficult to perform correctly. There are many specialized tokenizers available to address a number of different specialized texts.

Chapter 2, Isolating Sentences within a Document, covers how the process of sentence isolation is also a key NLP task. The process involves more than finding a period, exclamation mark, or question mark and using them as sentence delimiters. The process often requires the use of trained neural network models to work correctly.

Chapter 3, Performing Name Entity Recognition, explains how to isolate the key elements of a text in terms of entities such as names, dates, and places. It is not feasible to create an exhaustive list of entities, so neural networks are frequently used to perform this task.

Chapter 4, Detecting POS Using Neural Networks, covers the topic of POS, which refers to parts of speech and corresponds to sentence elements such as nouns, verbs, and adjectives. Performing POS is critical to extract meaning from a text. This chapter will illustrate various POS techniques and show how these elements can be depicted.

Chapter 5, Performing Text Classification, outlines a common NLP activity: classifying text into one or more categories. This chapter will demonstrate how this is accomplished, including the process of performing sentiment analysis. This is often used to access a customer's opinion of a product or service.

Chapter 6, Finding Relationships within Text, explains how identifying the relationships between text elements can be used to extract meaning from a document. While this is not a simple task, it is becoming increasingly important to many applications. We will examine various approaches to accomplish this goal.

Chapter 7, Language Identification and Translation, covers how language translation is critical to many problem domains, and takes on increased importance as the world becomes more and more interconnected. In this chapter, we will demonstrate several cloud-based approaches to performing natural language translation.

Chapter 8, Identifying Semantic Similarities within Text, explains how texts can be similar to each other at various levels. Similar words may be used, or there may be similarities in text structure. This capability is useful for a variety of tasks ranging from spell checking to assisting in determining the meaning of a text. We will demonstrate various approaches in this chapter.

Chapter 9, Common Text Processing and Generation Tasks, outlines how the NLP techniques illustrated in this book are all based on a set of common text-processing activities. These include using data structures such as inverted dictionaries and generating random numbers for training sets. In this chapter, we will demonstrate many of these tasks.

Chapter 10, Extracting Data for Use in NLP Analysis, emphasizes how important it is to be able to obtain data from a variety of sources. As more and more data is created, we need mechanisms for extracting and then processing the data. We will illustrate some of these techniques, including extracting data from Word/PDF documents, websites, and spreadsheets.

Chapter 11, Creating a Chatbot, discusses an increasingly common and important NLP application: chatbots. In this chapter, we will demonstrate how to create a chatbot, and how a Java application interface can be used to enhance the functionality of the chatbot.

Appendix, Installation and Configuration, covers the different installations and configurations for Google Cloud Platform (GCP) and Amazon Web Services (AWS).

To get the most out of this book

The reader should be proficient with Java in order to understand and use many of the APIs covered in this book. The recipes used here are presented as Eclipse projects. Familiarity with Eclipse is not an absolute requirement, but will speed up the learning process. While it is possible to use another IDE, the recipes are written using Eclipse.

Most, but not all, of the recipes use Maven to import the necessary API libraries for the recipes. A basic understanding of how to use a POM file is useful. In some recipes, we will directly import JAR files into a project when they are not available in a Maven repository. In these situations, instructions for Eclipse will be provided.

In the last chapter, Chapter 11, Creating a Chatbot, we will be using the AWS Toolkit for Eclipse. This can easily be installed in most IDEs. For a few chapters, we will be using GCP and various Amazon AWS libraries. The reader will need to establish accounts on these platforms, which are free as long as certain usage quotas are not exceeded.

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at

www.packt.com

.

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Natural-Language-Processing-with-Java-Cookbook. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

 

Sections

In this book, you will find several headings that appear frequently (Getting ready, How to do it..., How it works..., There's more..., and See also).

To give clear instructions on how to complete a recipe, use these sections as follows:

Getting ready

This section tells you what to expect in the recipe and describes how to set up any software or any preliminary settings required for the recipe.

How to do it…

This section contains the steps required to follow the recipe.

How it works…

This section usually consists of a detailed explanation of what happened in the previous section.

There's more…

This section consists of additional information about the recipe in order to make you more knowledgeable about the recipe.

See also

This section provides helpful links to other useful information for the recipe.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Preparing Text for Analysis and Tokenization

One of the first steps required for Natural Language Processing (NLP) is the extraction of tokens in text. The process of tokenization splits text into tokens—that is, words. Normally, tokens are split based upon delimiters, such as white space. White space includes blanks, tabs, and carriage-return line feeds. However, specialized tokenizers can split tokens according to other delimiters. In this chapter, we will illustrate several tokenizers that you will find useful in your analysis.

Another important NLP task involves determining the stem and lexical meaning of a word. This is useful for deriving more meaning about the words beings processed, as illustrated in the fifth and sixth recipe. The stem of a word refers to the root of a word. For example, the stem of the word antiquated is antiqu. While this may not seem to be the correct stem, the stem of a word is the ultimate base of the word.

The lexical meaning of a word is not concerned with the context in which it is being used. We will be examining the process of performing lemmatization of a word. This is also concerned with finding the root of a word, but uses a more detailed dictionary to find the root. The stem of a word may vary depending on the form the word takes. However, with lemmatization, the root will always be the same. Stemming is often used when we will be satisfied with possibly a less than precise determination of the root of a word. A more thorough discussion of stemming versus lemmatization can be found at: https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/.

The last task in this chapter deals with the process of text normalization. Here, we are concerned with converting the token that is extracted to a form that can be more easily processed during later analysis. Typical normalization activities include converting cases, expanding abbreviations, removing stop words along with stemming, and lemmatization. Stop words are those words that can often be ignored with certain types of analyses. For example, in some contexts, the word the does not always need to be included.

In this chapter, we will cover the following recipes:

Tokenization using the Java SDK

Tokenization using OpenNLP

Tokenization using maximum entropy

Training a neural network tokenizer for specialized text

Identifying the stem of a word

Training an OpenNLP lemmatization model

Determining the lexical meaning of a word using OpenNLP

Removing stop words using LingPipe

Technical requirements

In this chapter, you will need to install the following software, if they have not already been installed:

Eclipse Photon 4.8.0

Java JDK 8 or later

We will be using the following APIs, which you will be instructed to add for each recipe as appropriate:

OpenNLP 1.9.0

LingPipe 4.1.0

The code files for this chapter can be found at https://github.com/PacktPublishing/Natural-Language-Processing-with-Java-Cookbook/tree/master/Chapter01.

Tokenization using the Java SDK

Tokenization can be achieved using a number of Java classes, including the String, StringTokenizer, and StreamTokenizer classes. In this recipe, we will demonstrate the use of the Scanner class. While frequently used for console input, it can also be used to tokenize a string.

Getting ready

To prepare, we need to create a new Java project.

How it works...

The Scanner class's constructor took a string as an argument. This allowed us to apply the Scanner class's methods against the text we used in the next method, which returns a single token at a time, delimited by white spaces. While it was not necessary to store the tokens in a list, this permits us to use it later for different purposes.

Tokenization using OpenNLP

In this recipe, we will create an instance of the OpenNLPSimpleTokenizer class to illustrate tokenization. We will use its tokenize method against a sample text.

Getting ready

To prepare, we need to do the following:

Create a new Java project

Add the following POM dependency to your project:

<dependency> <groupId>org.apache.opennlp</groupId> <artifactId>opennlp-tools</artifactId> <version>1.9.0</version></dependency>

See also

The OpenNLP API documentation can be found at 

https://opennlp.apache.org/docs/1.9.0/apidocs/opennlp-tools/index.html

Tokenization using maximum entropy

Maximum entropy is a statistical classification technique. It takes various characteristics of a subject, such as the use of specialized words or the presence of whiskers in a picture, and assigns a weight to each characteristic. These weights are eventually added up and normalized to a value between 0 and 1, indicating the probability that the subject is of a particular kind. With a high enough level of confidence, we can conclude that the text is all about high-energy physics or that we have a picture of a cat.

If you're interested, you can find a more complete explanation of this technique at https://nadesnotes.wordpress.com/2016/09/05/natural-language-processing-nlp-fundamentals-maximum-entropy-maxent/. In this recipe, we will demonstrate the use of maximum entropy with the OpenNLP TokenizerME class.

Getting ready

To prepare, we need to do the following:

Create a new Maven project.

Download the 

en-token.bin

file from

http://opennlp.sourceforge.net/models-1.5/

. Save it at the root directory of the project.

Add the following POM dependency to your project:

<dependency> <groupId>org.apache.opennlp</groupId> <artifactId>opennlp-tools</artifactId> <version>1.9.0</version></dependency>

How it works...

The sampleText variable holds the test string. A try-with-resources block is used to automatically close the InputStream. The new File statement throws a FileNotFoundException, while the new TokenizerModel(modelInputStream) statement throws an IOException, both of which need to be handled.

The code examples in this book that deal with exception handling include a comment suggesting that exceptions should be handled. The user is encouraged to add the appropriate code to deal with exceptions. This will often include print statements or possibly logging operations.

An instance of the TokenizerModel class is created using the en-token.bin model. This model has been trained to recognize English text. An instance of the TokenizerME class represents the tokenizer where the tokenize method is executed against it using the sample text. This method returns an array of strings that are then displayed. Note that the comma and period are treated as separate tokens.

See also

The OpenNLP API documentation can be found at 

https://opennlp.apache.org/docs/1.9.0/apidocs/opennlp-tools/index.html

Training a neural network tokenizer for specialized text

Sometimes, we need to work with specialized text, such as an uncommon language or text that is unique to a problem domain. In such cases, the standard tokenizers are not always sufficient. This necessitates the creation of a unique model that will work better with the specialized text. In this recipe, we will demonstrate how to train a model using OpenNLP.

Getting ready

To prepare, we need to do the following:

Create a new Maven project

Add the following dependency to the POM file:

<dependency> <groupId>org.apache.opennlp</groupId> <artifactId>opennlp-tools</artifactId> <version>1.9.0</version></dependency>