E-Book
31,19 €

Natural Language Processing with Java E-Book

Richard M. Reese

0,0

31,19 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Fachliteratur
Sprache: Englisch

Beschreibung

Natural Language Processing (NLP) allows you to take any sentence and identify patterns, special names, company names, and more. The second edition of Natural Language Processing with Java teaches you how to perform language analysis with the help of Java libraries, while constantly gaining insights from the outcomes.
You’ll start by understanding how NLP and its various concepts work. Having got to grips with the basics, you’ll explore important tools and libraries in Java for NLP, such as CoreNLP, OpenNLP, Neuroph, and Mallet. You’ll then start performing NLP on different inputs and tasks, such as tokenization, model training, parts-of-speech and parsing trees. You’ll learn about statistical machine translation, summarization, dialog systems, complex searches, supervised and unsupervised NLP, and more.
By the end of this book, you’ll have learned more about NLP, neural networks, and various other trained models in Java for enhancing the performance of NLP applications.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

MOBI

Seitenzahl: 357

Veröffentlichungsjahr: 2018

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Ähnliche

Java for Data Science

Richard M. Reese

Java for Data Science

Richard M. Reese

Learning Network Programming with Java

Richard M. Reese

Learning Java Functional Programming

Richard M. Reese

Natural Language Processing with Java Cookbook

Richard M. Reese

Machine Learning: End-to-End guide for Java developers

Richard M. Reese

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Denke (nach) und werde reich

Napoleon Hill

30 Minuten Resilienz

Ulrich Siegrist

Krebszellen mögen keine Himbeeren - Der große Bestseller - Vollständig überarbeitet und aktualisiert

Richard Béliveau

Die Hormonrevolution

Michael E Platt

Der Crash ist die Lösung

Matthias Weik

Günter, der innere Schweinehund, lernt verkaufen

Stefan Frädrich

Mission erfüllt

Owen Mark

Die Leber wächst mit ihren Aufgaben

Dr. med. Eckart von Hirschhausen

Macht, was ihr liebt!

Anja Förster

Der größte Raubzug der Geschichte

Matthias Weik

Unsere Hunde - gesund durch Homöopathie

Hans Günter Wolff

Die Jahrhundertlüge, die nur Insider kennen

Heiko Schrang

Leseprobe

Natural Language Processing with JavaSecond Edition

Techniques for building machine learning and neural network models for NLP

Richard M. Reese

AshishSingh Bhatia

BIRMINGHAM - MUMBAI

Natural Language Processing with Java Second Edition

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Pravin DhandreAcquisition Editor:Divya PoojariContent Development Editor:Eisha DsouzaTechnical Editor:Jovita AlvaCopy Editor: Safis EditingProject Coordinator: Nidhi JoshiProofreader: Safis EditingIndexer:Tejal Daruwale SoniGraphics:Jisha ChirayilProduction Coordinator:Shraddha Falebhai

First published: March 2015 Second edition: July 2018

Production reference: 1300718

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78899-349-4

www.packtpub.com

To my parents, Smt. Ravindrakaur Bhatia and S. Tej Singh Bhatia, and to my brother, S. Ajit Singh Bhatia, for guiding, motivating, and supporting me when it was required most. To my friends, who are always there, and especially to Mr. Mitesh Soni, for the support and inspiration to write.

– AshishSingh Bhatia

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

PacktPub.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

About the authors

Richard M. Reese has worked in both industry and academia. For 17 years, he worked in the telephone and aerospace industries, serving in several capacities, including research and development, software development, supervision, and training. He currently teaches at Tarleton State University. Richard has written several Java books and a C Pointer book. He uses a concise and easy-to-follow approach to teaching about topics. His Java books have addressed EJB 3.1, updates to Java 7 and 8, certification, functional programming, jMonkeyEngine, and natural language processing.

AshishSingh Bhatia is a learner, reader, seeker, and developer at core. He has over 10 years of IT experience in different domains, including banking, ERP, and education. He is persistently passionate about Python, Java, R, and web and mobile development. He is always ready to explore new technologies.

I would like to first and foremost thank my loving parents and friends for their continued support, patience, and encouragement.

About the reviewers

Doug Ortiz is an experienced enterprise cloud, big data, data analytics, and solutions architect who has designed, developed, re-engineered, and integrated enterprise solutions. His other expertise is in Amazon Web Services, Azure, Google Cloud, business intelligence, Hadoop, Spark, NoSQL databases, and SharePoint, to mention just a few.

He is the founder of Illustris, LLC, and is reachable at [email protected].

Huge thanks to my wonderful wife, Milla, as well as Maria, Nikolay, and our children for all their support.

Paraskevas V. Lekeas received his PhD and MS in CS from the NTUA, Greece, where he conducted his postdoc on algorithmic engineering, and he also holds degrees in math and physics from the University of Athens. He was a professor at the TEI of Athens and the University of Crete before taking an internship at the University of Chicago. He has extensive experience in knowledge discovery and engineering, having addressed many challenges for startups and for corporations using a diverse arsenal of tools and technologies. He is leading the data group at H5, helping H5 advancing in innovative knowledge discovery.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Title Page

Natural Language Processing with Java Second Edition

Dedication

Packt Upsell

Why subscribe?

PacktPub.com

Contributors

About the authors

About the reviewers

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Introduction to NLP

What is NLP?

Why use NLP?

Why is NLP so hard?

Survey of NLP tools

Apache OpenNLP

Stanford NLP

LingPipe

GATE

UIMA

Apache Lucene Core

Deep learning for Java

Overview of text-processing tasks

Finding parts of text

Finding sentences

Feature-engineering

Finding people and things

Detecting parts of speech

Classifying text and documents

Extracting relationships

Using combined approaches

Understanding NLP models

Identifying the task

Selecting a model

Building and training the model

Verifying the model

Using the model

Preparing data

Summary

Finding Parts of Text

Understanding the parts of text

What is tokenization?

Uses of tokenizers

Simple Java tokenizers

Using the Scanner class

Specifying the delimiter

Using the split method

Using the BreakIterator class

Using the StreamTokenizer class

Using the StringTokenizer class

Performance considerations with Java core tokenization

NLP tokenizer APIs

Using the OpenNLPTokenizer class

Using the SimpleTokenizer class

Using the WhitespaceTokenizer class

Using the TokenizerME class

Using the Stanford tokenizer

Using the PTBTokenizer class

Using the DocumentPreprocessor class

Using a pipeline

Using LingPipe tokenizers

Training a tokenizer to find parts of text

Comparing tokenizers

Understanding normalization

Converting to lowercase

Removing stopwords

Creating a StopWords class

Using LingPipe to remove stopwords

Using stemming

Using the Porter Stemmer

Stemming with LingPipe

Using lemmatization

Using the StanfordLemmatizer class

Using lemmatization in OpenNLP

Normalizing using a pipeline

Summary

Finding Sentences

The SBD process

What makes SBD difficult?

Understanding the SBD rules of LingPipe's HeuristicSentenceModel class

Simple Java SBDs

Using regular expressions

Using the BreakIterator class

Using NLP APIs

Using OpenNLP

Using the SentenceDetectorME class

Using the sentPosDetect method

Using the Stanford API

Using the PTBTokenizer class

Using the DocumentPreprocessor class

Using the StanfordCoreNLP class

Using LingPipe

Using the IndoEuropeanSentenceModel class

Using the SentenceChunker class

Using the MedlineSentenceModel class

Training a sentence-detector model

Using the Trained model

Evaluating the model using the SentenceDetectorEvaluator class

Summary

Finding People and Things

Why is NER difficult?

Techniques for name recognition

Lists and regular expressions

Statistical classifiers

Using regular expressions for NER

Using Java's regular expressions to find entities

Using the RegExChunker class of LingPipe

Using NLP APIs

Using OpenNLP for NER

Determining the accuracy of the entity

Using other entity types

Processing multiple entity types

Using the Stanford API for NER

Using LingPipe for NER

Using LingPipe's named entity models

Using the ExactDictionaryChunker class

Building a new dataset with the NER annotation tool

Training a model

Evaluating a model

Summary

Detecting Part of Speech

The tagging process

The importance of POS taggers

What makes POS difficult?

Using the NLP APIs

Using OpenNLP POS taggers

Using the OpenNLP POSTaggerME class for POS taggers

Using OpenNLP chunking

Using the POSDictionary class

Obtaining the tag dictionary for a tagger

Determining a word's tags

Changing a word's tags

Adding a new tag dictionary

Creating a dictionary from a file

Using Stanford POS taggers

Using Stanford MaxentTagger

Using the MaxentTagger class to tag textese

Using the Stanford pipeline to perform tagging

Using LingPipe POS taggers

Using the HmmDecoder class with Best_First tags

Using the HmmDecoder class with NBest tags

Determining tag confidence with the HmmDecoder class

Training the OpenNLP POSModel

Summary

Representing Text with Features

N-grams

Word embedding

GloVe

Word2vec

Dimensionality reduction

Principle component analysis

Distributed stochastic neighbor embedding

Summary

Information Retrieval

Boolean retrieval

Dictionaries and tolerant retrieval

Wildcard queries

Spelling correction

Soundex

Vector space model

Scoring and term weighting

Inverse document frequency

TF-IDF weighting

Evaluation of information retrieval systems

Summary

Classifying Texts and Documents

How classification is used

Understanding sentiment analysis

Text-classifying techniques

Using APIs to classify text

Using OpenNLP

Training an OpenNLP classification model

Using DocumentCategorizerME to classify text

Using the Stanford API

Using the ColumnDataClassifier class for classification

Using the Stanford pipeline to perform sentiment analysis

Using LingPipe to classify text

Training text using the Classified class

Using other training categories

Classifying text using LingPipe

Sentiment analysis using LingPipe

Language identification using LingPipe

Summary

Topic Modeling

What is topic modeling?

The basics of LDA

Topic modeling with MALLET

Training

Evaluation

Summary

Using Parsers to Extract Relationships

Relationship types

Understanding parse trees

Using extracted relationships

Extracting relationships

Using NLP APIs

Using OpenNLP

Using the Stanford API

Using the LexicalizedParser class

Using the TreePrint class

Finding word dependencies using the GrammaticalStructure class

Finding coreference resolution entities

Extracting relationships for a question-answer system

Finding the word dependencies

Determining the question type

Searching for the answer

Summary

Combined Pipeline

Preparing data

Using boilerpipe to extract text from HTML

Using POI to extract text from Word documents

Using PDFBox to extract text from PDF documents

Using Apache Tika for content analysis and extraction

Pipelines

Using the Stanford pipeline

Using multiple cores with the Stanford pipeline

Creating a pipeline to search text

Summary

Creating a Chatbot

Chatbot architecture

Artificial Linguistic Internet Computer Entity

Understanding AIML

Developing a chatbot using ALICE and AIML

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

You'll start by understanding how NLP and its various concepts work. Having got to grips with the basics, you'll explore important tools and libraries in Java for NLP, such as CoreNLP, OpenNLP, Neuroph, Mallet, and more. You'll then start performing NLP on different inputs and tasks, such as tokenization, model training, parts of speech, parsing trees, and more. You'll learn about statistical machine translation, summarization, dialog systems, complex searches, supervised and unsupervised NLP, and other things. By the end of this book, you'll have learned more about NLP, neural networks, and various other trained models in Java for enhancing the performance of NLP applications.

Who this book is for

Natural Language Processing with Java is for you if you are a data analyst, data scientist, or machine learning engineer who wants to extract information from a language using Java. Knowledge of Java programming is needed, while a basic understanding of statistics will be useful, but is not mandatory.

What this book covers

Chapter 1, Introduction to NLP, explains the importance and uses of NLP. The NLP techniques used in this chapter are explained with simple examples illustrating their use.

Chapter 2, Finding Parts of Text, focuses primarily on tokenization. This is the first step in more advanced NLP tasks. Both core Java and Java NLP tokenization APIs are illustrated.

Chapter 3, Finding Sentences, proves that sentence boundary disambiguation is an important NLP task. This step is a precursor for many other downstream NLP tasks in which text elements should not be split across sentence boundaries. This includes ensuring that all phrases are in one sentence and supporting Parts-of-Speech analysis.

Chapter 4, Finding People and Things, covers what is commonly referred to as Named Entity Recognition (NER). This task is concerned with identifying people, places, and similar entities in text. This technique is a preliminary step for processing queries and searches.

Chapter 5, Detecting Parts of Speech, shows you how to detect Parts-of -Speech, which are grammatical elements of text, such as nouns and verbs. Identifying these elements is a significant step in determining the meaning of text and detecting relationships within text.

Chapter 6, Representing Text with Features, explains how text is presented using N-grams and outlines role they play in revealing the context.

Chapter 7, Information Retrieval, deals with processing the huge amount of data uncovered in information retrieval and finding the relevant information using various approaches, such as Boolean retrieval, dictionaries, and tolerant retrieval.

Chapter 8, Classifying Texts and Documents, proves that classifying text is useful for tasks such as spam detection and sentiment analysis. The NLP techniques that support this process are investigated and illustrated.

Chapter 9, Topic Modeling, discusses the basics of topic modeling using a document that contains some text.

Chapter 10, Using Parsers to Extract Relationships, demonstrates parse trees. A parse tree is used for many purposes, including information extraction. It holds information regarding the relationships between these elements. An example implementing a simple query is presented to illustrate this process.

Chapter 11, Combined Pipeline, addresses several issues surrounding the use of combinations of techniques that solve NLP problems.

Chapter 12, Creating a ChatBot, looks at different types of chatbot, and we will be developing a simple appointment-booking chatbot too.

To get the most out of this book

Java SDK 8 is used to illustrate the NLP techniques. Various NLP APIs are needed and can be readily downloaded. An IDE is not required but is desirable.

Download the example code files

You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

www.packtpub.com

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

Enter the name of the book in the

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Natural-Language-Processing-with-Java-Second-Edition. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/NaturalLanguageProcessingwithJavaSecondEdition_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "To process the text, we will use the theSentence variable as input to Annotator."

A block of code is set as follows:

System.out.println(tagger.tagString("AFAIK she H8 cth!")); System.out.println(tagger.tagString( "BTW had a GR8 tym at the party BBIAM."));

Any command-line input or output is written as follows:

mallet-2.0.6$ bin/mallet import-dir --input sample-data/web/en --output tutorial.mallet --keep-sequence --remove-stopwords

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Select System info from the Administration panel."

Warnings or important notes appear like this.

Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packtpub.com.

Introduction to NLP

Natural Language Processing (NLP) is a broad topic focused on the use of computers to analyze natural languages. It addresses areas such as speech processing, relationship extraction, document categorization, and summation of text. However, these types of analyses are based on a set of fundamental techniques, such as tokenization, sentence detection, classification, and extracting relationships. These basic techniques are the focus of this book. We will start with a detailed discussion of NLP, investigate why it is important, and identify application areas.

There are many tools available that support NLP tasks. We will focus on the Java language and how various Java Application Programmer Interfaces (APIs) support NLP. In this chapter, we will briefly identify the major APIs, including Apache's OpenNLP, Stanford NLP libraries, LingPipe, and GATE.

This is followed by a discussion of the basic NLP techniques illustrated in this book. The nature and use of these techniques is presented and illustrated using one of the NLP APIs. Many of these techniques will use models. Models are similar to a set of rules that are used to perform a task such as tokenizing text. They are typically represented by a class that is instantiated from a file. We'll round off the chapter with a brief discussion on how data can be prepared to support NLP tasks.

NLP is not easy. While some problems can be solved relatively easily, there are many others that require the use of sophisticated techniques. We will strive to provide a foundation for NLP-processing so that you will be able to better understand which techniques are available for and applicable to a given problem.

NLP is a large and complex field. In this book, we will only be able to address a small part of it. We will focus on core NLP tasks that can be implemented using Java. Throughout this book, we will demonstrate a number of NLP techniques using both the Java SE SDK and other libraries, such as OpenNLP and Stanford NLP. To use these libraries, there are specific API JAR files that need to be associated with the project in which they are being used. A discussion of these libraries is found in the Survey of NLP tools section and contains download links to the libraries. The examples in this book were developed using NetBeans 8.0.2. These projects require the API JAR files to be added to the Libraries category of the Projects Properties dialog box.

In this chapter, we will learn about the following topics:

What is NLP?

Why use NLP?

Why is NLP so hard?

Survey of NLP tools

Deep learning for Java

Overview of text-processing tasks

Understanding NLP models

Preparing data

What is NLP?

A formal definition of NLP frequently includes wording to the effect that it is a field of study using computer science, Artificial Intelligence (AI), and formal linguistics concepts to analyze natural language. A less formal definition suggests that it is a set of tools used to derive meaningful and useful information from natural language sources, such as web pages and text documents.

Meaningful and useful implies that it has some commercial value, though it is frequently used for academic problems. This can readily be seen in its support of search engines. A user query is processed using NLP techniques in order to generate a result page that a user can use. Modern search engines have been very successful in this regard. NLP techniques have also found use in automated help systems and in support of complex query systems, as typified by IBM's Watson project.

When we work with a language, the terms syntax and semantics are frequently encountered. The syntax of a language refers to the rules that control a valid sentence structure. For example, a common sentence structure in English starts with a subject followed by a verb and then an object, such as "Tim hit the ball." We are not used to unusual sentence orders, such as "Hit ball Tim." Although the rule of syntax for English is not as rigorous as that for computer languages, we still expect a sentence to follow basic syntax rules.

The semantics of a sentence is its meaning. As English speakers, we understand the meaning of the sentence, "Tim hit the ball." However, English, and other natural languages, can be ambiguous at times and a sentence's meaning may only be determined from its context. As we will see, various machine learning techniques can be used to attempt to derive the meaning of a text.

As we progress with our discussions, we will introduce many linguistic terms that will help us better understand natural languages and provide us with a common vocabulary to explain the various NLP techniques. We will see how the text can be split into individual elements and how these elements can be classified.

In general, these approaches are used to enhance applications, thus making them more valuable to their users. The uses of NLP can range from relatively simple uses to those that are pushing what is possible today. In this book, we will show examples that illustrate simple approaches, which may be all that is required for some problems, to the more advanced libraries and classes available to address sophisticated needs.

Why use NLP?

NLP is used in a wide variety of disciplines to solve many different types of problems. Text analysis is performed on text that ranges from a few words of user input for an internet query to multiple documents that need to be summarized. We have seen a large growth in the amount and availability of unstructured data in recent years. This has taken forms such as blogs, tweets, and various other social media. NLP is ideal for analyzing this type of information.

Machine learning and text analysis are used frequently to enhance an application's utility. A brief list of application areas follow:

Searching

: This identifies specific elements of text. It can be as simple as finding the occurrence of a name in a document or might involve the use of synonyms and alternate spellings/misspellings to find entries that are close to the original search string.

Machine translation

: This typically involves the translation of one natural language into another.

Summation

: Paragraphs, articles, documents, or collections of documents may need to be summarized. NLP has been used successfully for this purpose.

Named-Entity Recognition

(

NER

): This involves extracting names of locations, people, and things from text. Typically, this is used in conjunction with other NLP tasks, such as processing queries.

Information grouping

: This is an important activity that takes textual data and creates a set of categories that reflect the content of the document. You have probably encountered numerous websites that organize data based on your needs and have categories listed on the left-hand side of the website.

Parts-of-Speech tagging

(

POS

): In this task, text is split up into different grammatical elements, such as nouns and verbs. This is useful for analyzing the text further.

Sentiment analysis

: People's feelings and attitudes regarding movies, books, and other products can be determined using this technique. This is useful in providing automated feedback with regards to how well a product is perceived.

Answering queries

: This type of processing was illustrated when IBM's Watson successfully won a Jeopardy competition. However, its use is not restricted to winning gameshows and has been used in a number of other fields, including medicine.

Speech-recognition

: Human speech is difficult to analyze. Many of the advances that have been made in this field are the result of NLP efforts.

Natural-Language Generation

(

NLG

): This is the process of generating text from a data or knowledge source, such as a database. It can automate the reporting of information, such as weather reports, or summarize medical reports.

NLP tasks frequently use different machine learning techniques. A common approach starts with training a model to perform a task, verifying that the model is correct, and then applying the model to a problem. We will examine this process further in the Understanding NLP models section.

Why is NLP so hard?

NLP is not easy. There are several factors that make this process hard. For example, there are hundreds of natural languages, each of which has different syntax rules. Words can be ambiguous where their meaning is dependent on their context. Here, we will examine a few of the more significant problem areas.

At the character level, there are several factors that need to be considered. For example, the encoding scheme used for a document needs to be considered. Text can be encoded using schemes such as ASCII, UTF-8, UTF-16, or Latin-1. Other factors, such as whether the text should be treated as case-sensitive or not, may need to be considered. Punctuation and numbers may require special processing. We sometimes need to consider the use of emoticons (character combinations and special character images), hyperlinks, repeated punctuation (... or ---), file extensions, and usernames with embedded periods. Many of these are handled by preprocessing text, as we will discuss in the Preparing data section.

When we tokenize text, it usually means we are breaking up the text into a sequence of words. These words are called tokens. The process is referred to as tokenization. When a language uses whitespace characters to delineate words, this process is not too difficult. With a language such as Chinese, it can be quite difficult since it uses unique symbols for words.

Words and morphemes may need to be assigned a Part-of-Speech (POS) label, identifying what type of unit it is. A morpheme is the smallest division of text that has meaning. Prefixes and suffixes are examples of morphemes. Often, we need to consider synonyms, abbreviation, acronyms, and spellings when we work with words.

Stemming is another task that may need to be applied. Stemming is the process of finding the word stem of a word. For example, words such as walking, walked, or walks have the word stem walk. Search engines often use stemming to assist in asking a query.

Closely related to stemming is the process of lemmatization. This process determines the base form of a word, called its lemma. For example, for the word operating, its stem is oper but its lemma is operate. Lemmatization is a more refined process than stemming, and uses vocabulary and morphological techniques to find a lemma. This can result in more precise analysis in some situations.

Words are combined into phrases and sentences. Sentence detection can be problematic and is not as simple as looking for the periods at the end of a sentence. Periods are found in many places, including abbreviations such as Ms., and in numbers such as 12.834.

We often need to understand which words in a sentence are nouns and which are verbs. We are often concerned with the relationship between words. For example, coreferences resolution determines the relationship between certain words in one or more sentences. Consider the following sentence:

"The city is large but beautiful. It fills the entire valley."

The word it is the coreference to city. When a word has multiple meanings, we might need to perform word-sense disambiguation (WSD) to determine the intended meaning. This can be difficult to do at times. For example, "John went back home." Does the home refer to a house, a city, or some other unit? Its meaning can sometimes be inferred from the context in which it is used. For example, "John went back home. It was situated at the end of a cul-de-sac."

Despite these difficulties, NLP is able to perform these tasks reasonably well in most situations and provide added value to many problem domains. For example, sentiment analysis can be performed on customer tweets, resulting in possible free product offers for dissatisfied customers. Medical documents can be readily summarized to highlight the relevant topics and improved productivity.Summarization is the process of producing a short description of different units. These units can include multiple sentences, paragraphs, a document, or multiple documents. The intent may be to identify those sentences that convey the meaning of the unit, determine the prerequisites for understanding a unit, or to find items within these units. Frequently, the context of the text is important in accomplishing this task.

Survey of NLP tools

There are many tools available that support NLP. Some of these are available with the Java SE SDK but are limited in their utility for all but the simplest types of problems. Other libraries, such as Apache's OpenNLP and LingPipe, provide extensive and sophisticated support for NLP problems.

Low-level Java support includes string libraries, such as String, StringBuilder, and StringBuffer. These classes possess methods that perform searching, matching, and text-replacement. Regular expressions use special encoding to match substrings. Java provides a rich set of techniques to use regular expressions.

As discussed earlier, tokenizers are used to split text into individual elements. Java provides supports for tokenizers with:

The

String

class'

split

method

The

StreamTokenizer

class

The

StringTokenizer

class

There also exist a number of NLP libraries/APIs for Java. A partial list of Java-based NLP APIs can be found in the following table. Most of these are open source. In addition, there are a number of commercial APIs available. We will focus on the open source APIs:

API

URL

Apertium

http://www.apertium.org/

General Architecture for Text Engineering

http://gate.ac.uk/

Learning Based Java

https://github.com/CogComp/lbjava

LingPipe

http://alias-i.com/lingpipe/

MALLET

http://mallet.cs.umass.edu/

MontyLingua

http://web.media.mit.edu/~hugo/montylingua/

Apache OpenNLP

http://opennlp.apache.org/

UIMA

http://uima.apache.org/

Stanford Parser

http://nlp.stanford.edu/software

Apache Lucene Core

https://lucene.apache.org/core/

Snowball

http://snowballstem.org/

Many of these NLP tasks are combined to form a pipeline. A pipeline consists of various NLP tasks, which are integrated into a series of steps to achieve a processing goal. Examples of frameworks that support pipelines are General Architecture for Text Engineering (GATE) and Apache UIMA.

In the next section, we will cover several NLP APIs in more depth. A brief overview of their capabilities will be presented along with a list of useful links for each API.

GATE

GATE is a set of tools written in Java and developed at the University of Sheffield in England. It supports many NLP tasks and languages. It can also be used as a pipeline for NLP-processing. It supports an API along with GATE Developer, a document viewer that displays text along with annotations. This is useful for examining a document using highlighted annotations. GATE Mimir, a tool for indexing and searching text generated by various sources, is also available. Using GATE for many NLP tasks involves a bit of code. GATE Embedded is used to embed GATE functionality directly in the code. Useful GATE links are listed in the following table:

Gate

Website

Home

https://gate.ac.uk/

Documentation

https://gate.ac.uk/documentation.html

JavaDocs

http://jenkins.gate.ac.uk/job/GATE-Nightly/javadoc/

Download

https://gate.ac.uk/download/

Wiki

http://gatewiki.sf.net/

TwitIE is an open source GATE pipeline for information-extraction over tweets. It contains the following:

Social media data-language identification

Twitter tokenizer for handling smileys, username, URLs, and so on

POS tagger

Text-normalization

It is available as part of the GATE Twitter plugin. The following table lists the required links:

TwitIE

Website

Home

https://gate.ac.uk/wiki/twitie.html

Documentation

https://gate.ac.uk/sale/ranlp2013/twitie/twitie-ranlp2013.pdf?m=1

UIMA

The Organization for the Advancement of Structured Information Standards (OASIS) is a consortium focused on information-oriented business technologies. It developed the Unstructured Information Management Architecture (UIMA) standard as a framework for NLP pipelines. It is supported by Apache UIMA.

Although it supports pipeline creation, it also describes a series of design patterns, data representations, and user roles for the analysis of text. UIMA links are listed in the following table:

Apache UIMA

Website

Home

https://uima.apache.org/

Documentation

https://uima.apache.org/documentation.html

JavaDocs

https://uima.apache.org/d/uimaj-2.6.0/apidocs/index.html

Download

https://uima.apache.org/downloads.cgi

Wiki

https://cwiki.apache.org/confluence/display/UIMA/Index

Apache Lucene Core

Apache Lucene Core is an open source library for full-featured text search engines written in Java. It uses tokenization for breaking text into small chunks for indexing elements. It also provide pre- and post-tokenization options for analysis purposes. It supports stemming, filtering, text-normalization, and synonym-expansion after tokenization. When used, it creates a directory and index files, and can be used to search the contents. It cannot be taken as an NLP toolkit, but it provides powerful tools for working with text and advanced string-manipulation with tokenization. It provides a free search engine. The following table list the important links for Apache Lucene:

Apache Lucene

Website

Home

http://lucene.apache.org/

Documentation

http://lucene.apache.org/core/documentation.html

JavaDocs

http://lucene.apache.org/core/7_3_0/core/index.html

Download

http://lucene.apache.org/core/mirrors-core-latest-redir.html?