WORKING WITH grep, sed, AND awk Pocket Primer - Mercury Learning and Information - E-Book

WORKING WITH grep, sed, AND awk Pocket Primer E-Book

Mercury Learning and Information

0,0
29,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

This book introduces readers to three powerful command-line utilities—grep, sed, and awk—that can create simple yet powerful shell scripts. Using the bash shell, it focuses on small text files to help readers understand these tools. Grep searches for patterns in data, sed modifies data, and awk performs tasks on pattern matches. Aimed at those new to the bash environment, the book is also valuable for those with some experience.
The journey starts with grep, teaching how to search for specific words or patterns in data. It then moves to sed, showing how to change or modify data efficiently. Finally, it delves into awk, a versatile programming language for searching and processing data files. The book also includes a chapter on using regular expressions with these tools, enhancing your scripting capabilities.
Mastering these utilities is crucial for efficient data handling and automation in a bash environment. This book transitions readers from basic to advanced command-line skills, blending theory with practical examples. It is an essential resource for anyone looking to harness the full power of bash scripting.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 314

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Working withgrep, sed, and awk

Pocket Primer

LICENSE, DISCLAIMER OF LIABILITY, AND LIMITED WARRANTY

By purchasing or using this book and its companion files (the “Work”), you agree that this license grants permission to use the contents contained herein, but does not give you the right of ownership to any of the textual content in the book or ownership to any of the information, files, or products contained in it. This license does not permit uploading of theWork onto the Internet or on a network (of any kind) without the written consent of the Publisher. Duplication or dissemination of any text, code, simulations, images, etc. contained herein is limited to and subject to licensing terms for the respective products, and permission must be obtained from the Publisher or the owner of the content, etc., in order to reproduce or network any portion of the textual material (in any media) that is contained in the Work.

Mercury Learning and Information (“MLI” or “the Publisher”) and anyone involved in the creation, writing, production, accompanying algorithms, code, or computer programs (“the software”), and any accompanying Web site or software of the Work, cannot and do not warrant the performance or results that might be obtained by using the contents of the Work. The author, developers, and the Publisher have used their best efforts to ensure the accuracy and functionality of the textual material and/or programs contained in this package; we, however, make no warranty of any kind, express or implied, regarding the performance of these contents or programs. The Work is sold “as is” without warranty (except for defective materials used in manufacturing the book or due to faulty workmanship).

The author, developers, and the publisher of any accompanying content, and anyone involved in the composition, production, and manufacturing of this work will not be liable for damages of any kind arising out of the use of (or the inability to use) the algorithms, source code, computer programs, or textual material contained in this publication. This includes, but is not limited to, loss of revenue or profit, or other incidental, physical, or consequential damages arising out of the use of this Work.

The sole remedy in the event of a claim of any kind is expressly limited to replacement of the book and only at the discretion of the Publisher. The use of “implied warranty” and certain “exclusions” vary from state to state, and might not apply to the purchaser of this product.

Companion files also available for downloading from the publisher by writing to [email protected]

Working withgrep, sed, and awk

Pocket Primer

 

Copyright ©2023 by Mercury Learning and Information LLC. All rights reserved.

This publication, portions of it, or any accompanying software may not be reproduced in any way, stored in a retrieval system of any type, or transmitted by any means, media, electronic display or mechanical display, including, but not limited to, photocopy, recording, Internet postings, or scanning, without prior permission in writing from the publisher.

Publisher: David Pallai

Mercury Learning and Information

22841 Quicksilver Drive

Dulles, VA 20166

[email protected]

www.merclearning.com

1-800-232-0223

O. Campesato. Working with grep, sed, and awk Pocket Primer.

ISBN: 978-150152-151-5

The publisher recognizes and respects all marks used by companies, manufacturers, and developers as a means to distinguish their products. All brand names and product names mentioned in this book are trademarks or service marks of their respective companies. Any omission or misuse (of any kind) of service marks or trademarks, etc. is not an attempt to infringe on the property of others.

Library of Congress Control Number: 2023934889 232425321     Printed on acid-free paper in the United States of America.

Our titles are available for adoption, license, or bulk purchase by institutions, corporations, etc. For additional information, please contact the Customer Service Dept. at 800-232-0223(toll free).

All of our titles are also available in digital format at numerous digital vendors. Companion files are available for download by writing to the publisher [email protected]. The sole obligation of Mercury Learning and Information to the purchaser is to replace the book, based on defective materials or faulty workmanship, but not based on the operation or functionality of the product.

I’d like to dedicate this book to my parents– may this bring joy and happiness into their lives.

Contents

Preface

Chapter 1  Working with grep

What is the grep Command?

Meta Characters and the grep Command

Escaping Meta Characters with the grep Command

Useful Options for the grep Command

Character Classes and the grep Command

Working with the –c Option in grep

Matching a Range of Lines

Using Backreferences in the grep Command

Finding Empty Lines in Datasets

Using Keys to Search Datasets

The Backslash Character and the grep Command

Multiple Matches in the grep Command

The grep Command and the xargs Command

Searching Zip Files for a String

Checking for a Unique Key Value

Redirecting Error Messages

The egrep Command and fgrep Command

Displaying “Pure” Words in a Dataset with egrep

The fgrep Command

Delete Rows with Missing Values

A Simple Use Case

Summary

Chapter 2  Working with sed

What is the sed Command?

The sed Execution Cycle

Matching String Patterns Using sed

Substituting String Patterns Using sed

Replacing Vowels from a String or a File

Deleting Multiple Digits and Letters from a String

Search and Replace with sed

Regular Expressions with sed

Datasets with Multiple Delimiters

Useful Switches in sed

Working with Datasets

Printing Lines

Character Classes and sed

Removing Control Characters

Counting Words in a Dataset

Back References in sed

One-Line sed Commands

Populate Missing Values with the sed Command

A Dataset with 1,000,000 Rows

Numeric Comparisons

Counting Adjacent Digits

Average Support Rate

Summary

Chapter 3  Working with awk

The awk Command

Launching awk Commands and File Extensions

Built-In Variables that Control awk

How Does the awk Command Work?

Three Important Code Blocks in awk Commands

Different Ways to Print Text

Working with the -F Switch in awk

Splitting Strings into Characters

The PROCINFO Array

Ignore Case in awk

Working with OFS, ORS, and Linefeed versus “\n”

Linefeed versus “\n”

Basic awk Examples with Text Files

Display the Contents of a File

Omit the Header Line of a File

Display a Range of Lines in a Text File With NR

Display a Range of Lines Based on Strings

Insert Blank Lines

Remove Blank Lines

Specifying the Number of Fields

Changing the Field Separator FS

Exclude Fields

Switch Adjacent Fields

Display Fields in Reverse Order

Count Non-Empty and Empty Rows

Detecting Transfer Protocols

Detecting Number Formats

Working with Record Lengths

Aligning Text with the printf() Statement

Working with Loops in awk

A for Loop in awk

Exponents in a for Loop

A for Loop with a break Statement

Working with while Loops in awk

A do-while Loop in awk

Conditional Logic and Control Statements

The break, continue, and next Keywords in awk

The exit Keyword

Conditionally Displaying Fields

Logical Operators in awk

Logical Operators and Partial Matches

Checking for Leap Years

Formatting Output

Floating Point Output

Inserting a Thousands Separator

Scientific Notation

Rounding and Truncating Numbers

Rounding Numbers

Truncating Numbers

Numeric Functions in awk

Convert Base 10 to Binary

Built-In String Functions in awk

Convert Strings to Lowercase in awk

Convert Strings to Uppercase in awk

Convert Strings to Mixed Case in awk

Counting Lines that Match a Character

Working with the match() Function

Characters and Hexadecimal Numbers in awk

Non-Printable Characters

Hexadecimal Numbers

Summary

Chapter 4  Working with awk, Part 2

Working with Arrays in awk

Initializing and Printing the Contents of an Array

Initializing and Deleting the Contents of an Array

Adding an Array of Numbers

Find the Maximum and Minimum of an Array of Numbers

Concatenating an Array of Strings

Generating Arrays from Input Strings

The split() Function with Linefeed Strings

Using the split() Function with the case Statement

The patsplit() Function

Multi-Dimensional Arrays

Higher Dimensionality Arrays

Calculating Invoice Totals (1)

Calculating Invoice Totals (2)

Calculating Invoice Averages

Counting Fields in Text Files

Displaying the Number of Fields in Text Files

Deleting Alternate Lines in Datasets

Print a Range of Strings in Datasets

Print a Range of Dates in Datasets

Merging Lines in Datasets

Printing File Contents as a Single Line

Joining Groups of Lines in a Text File

Joining Alternate Lines in a Text File

Reversing the Lines in a File

Working with Duplicate Lines in a File

Display Duplicate Rows

Remove Duplicate Rows

Uniformity of Data Values

Count Duplicate Fields in a File

Calculating Invoice Totals

Printing Lines Using Conditional Logic

Duplicate and Unique Rows in Text Files

Splitting Filenames with awk

One-Line awk Commands

Useful Short awk Scripts

Printing the Words in a Text String in awk

Count Occurrences of a String in Specific Rows

Well-Known Sorting Algorithms

The Bubble Sort

Find Anagrams in a List of Words

Sort an Array in O(n) Complexity

Find the Median of an Array of Numbers

Linear Search

Binary Search (Iterative)

Binary Search Walkthrough

Code for a Binary Search (Iterative)

Binary Search (Recursive)

Summary

Chapter 5  Working with awk, Part 3

Bit Operations in awk

Logical AND

Logical OR

Logical XOR

Logical NOT

Logical Left Shift

Logical Right Shift

Reverse a String

Find Palindromes

Check for Unique Characters

Check for Balanced Parentheses

Printing a String in a Fixed Number of Columns

Printing a Dataset in a Fixed Number of Columns

Aligning Columns in Datasets

Aligning Columns and Multiple Rows in Datasets

Displaying a Subset of Columns in a Text File

Subsets of Columns Aligned Rows in Datasets

Longest/Shortest Words in Datasets

Counting Word Frequency in Datasets

Displaying Only “Pure” Words in a Dataset

Delete Rows with Missing Values

Working with Multiline Records in awk

CSV Files with Multirow Records

Processing Multiple Files with awk

Inner Join of Two Files in awk

Logical Operators in awk

A Simple Use Case

Another Use Case

Date Conversion

A Dataset with 1,000,000 Rows

Numeric Comparisons

Counting Adjacent Digits

Average Support Rate

What is Recursion?

Arithmetic Series

Calculating Arithmetic Series (Iterative)

Calculating Arithmetic Series (Recursive)

Calculating Partial Arithmetic Series

Geometric Series

Calculating a Geometric Series (Iterative)

Calculating Arithmetic Series (Recursive)

Factorial Values

Fibonacci Numbers

Euclid’s Algorithm (GCD)

Lowest Common Multiple of Two Positive Integers

Composite versus Prime Numbers

Factors of Positive Integers

Goldbach’s Conjecture

Environment Variables in awk

Summary

Chapter 6  Regular Expressions

What are Regular Expressions?

Metacharacters

Character Sets

Working with “^” and “\”

Character Classes

POSIX Character Classes

Generic Character Classes

Regular Expressions in awk

Matching Strings with the Dot “.” Metacharacter

Matching Strings with the “^” Metacharacter

Matching Strings with the “$” Metacharacter

Matching Strings with the “|” Metacharacter

Matching with ?, +, or * Metacharacters

Matching Subgroups with ?, +, or * Metacharacters

Matching with Character Classes

Working with Postfix Arithmetic Operators

Working with the sub() Function in awk

Examples Using the sub() Function

Working with the gsub() Function

Removing Digits with the gsub() Function

Removing Characters with the gsub() Function

Removing Consecutive Characters with the gsub() Function

Removing Complements of Strings with the gsub() Function

Removing Metacharacters with the gsub() Function

Removing Spaces in Text Fields

Splitting Strings with Regular Expressions

Dynamic Regular Expressions

Regular Expressions Involving Text Files

Counting Comments with Regular Expressions

Combining grep with awk

Combining sed with awk

Removing HTML Tags

The gensub() Function and Captured Groups

Summary

Index

PREFACE

WHAT IS THE GOAL?

The goal of this book is to introduce readers to three powerful command line utilities that can be combined to create simple yet powerful shell scripts for performing a multitude of tasks. The code samples and scripts use the bash shell, and typically involve small text files, so you can focus on understanding the features of grep, sed, and awk. Aimed at a reader new to working in a bash environment, the book is comprehensive enough to be a good reference and teaches new tricks to those who already have some experience with these command line utilities.

This book takes introductory concepts and demonstrates their use in simple yet powerful shell scripts. Keep in mind that this book does not cover “pure” system administration functionality.

IS THIS BOOK IS FOR ME AND WHAT WILL I LEARN?

This book is intended for general users as well as anyone who wants to perform a variety of tasks from the command line.

You will acquire an understanding of how to use grep, sed, and awk whose functionality is discussed in the first five chapters. Specifically, Chapter 1 introduces the grep command, Chapter 2 introduces the sed command, and Chapters 3 through 5 discuss the awk command. The sixth and final chapter introduces you to regular expressions.

This book saves you the time required to search for relevant code samples, adapting them to your specific needs, which is a potentially time-consuming process.

HOW WERE THE CODE SAMPLES CREATED?

The code samples in this book were created and tested using bash on a MacBook Pro with OS X 10.15.7 (macOS Catalina). Regarding their content: the code samples are derived primarily from scripts prepared by the author, and in some cases, there are code samples that incorporate short sections of code from discussions in online forums. The key point to remember is that the code samples follow the “Four Cs”: they must be Clear, Concise, Complete, and Correct to the extent that it is possible to do so, given the size of this book.

WHAT YOU NEED TO KNOW FOR THIS BOOK

You need some familiarity with working from the command line in a Unix-like environment. However, there are subjective prerequisites, such as a desire to learn shell programming, along with the motivation and discipline to read and understand the code samples. In any case, if you’re not sure whether or not you can absorb the material in this book, glance through the code samples to get a feel for the level of complexity.

HOW DO I SET UP A COMMAND SHELL?

If you are a Mac user, there are three ways to do so. The first method is to use Finder to navigate to Applications > Utilities and then double click on the Utilities application. Next, if you already have a command shell available, you can launch a new command shell by typing the following command:

open /Applications/Utilities/Terminal.app

A second method for Mac users is to open a new command shell on a MacBook from a command shell that is already visible simply by clicking command+n in that command shell, and your Mac will launch another command shell.

If you are a PC user, you can install Cygwin (open source https://cygwin.com/) that simulates bash commands or use another toolkit such as MKS (a commercial product). Please read the online documentation that describes the download and installation process.

If you use RStudio, you need to launch a command shell inside of RStudio by navigating to Tools > Command Line, and then you can launch bash commands. Note that custom aliases are not automatically set if they are defined in a file other than the main start-up file (such as .bash_login).

WHAT ARE THE “NEXT STEPS” AFTER FINISHING THIS BOOK?

The answer to this question varies widely, mainly because the answer depends heavily on your objectives. The best answer is to try a new tool or technique from the book out on a problem or task you care about, professionally, or personally. Precisely what that might be depends on who you are, as the needs of a data scientist, manager, student, or developer are all different. In addition, keep what you learned in mind as you tackle new data cleaning or manipulation challenges. Sometimes knowing a technique is possible will make finding a solution easier, even if you have to re-read the section to remember exactly how the syntax works.

If you have reached the limits of what you have learned here and want to get further technical depth on these commands, there is a wide variety of literature published and online resources describing the bash shell, Unix programming, and the grep, sed, and awk commands.

CHAPTER 1

Working with GREP

This chapter introduces you to the versatile grep command that can process an input text stream to generate a desired output text stream. This command also works well with other Unix commands. This chapter contains many short code samples that illustrate various options of the grep command.

The first part of this chapter introduces the grep command used in isolation, in conjunction with meta characters (such as ^, $, and so forth), and with code snippets that illustrate how to use some of the options of the grep command. Next, you will learn how to match ranges of lines, how to use the back references in grep, and how to “escape” meta characters in grep.

The second part of this chapter shows you how to use the grep command to find empty lines and common lines in datasets, as well as the use of keys to match rows in datasets. Next, you will learn how to use character classes with the grep command, as well as the backslash (\) character, and how to specify multiple matching patterns. You will learn how to combine the grep command with the find command and the xargs command, which is useful for matching a pattern in files that reside in different directories. This section contains some examples of common mistakes that people make with the grep command.

The third section briefly discusses the egrep command and the fgrep command, which are related commands that provide additional functionality that is unavailable in the standard grep utility. The fourth section contains a use case that illustrates how to use the grep command to find matching lines that are then merged to create a new dataset.

What is the grep Command?

The grep (“Global Regular Expression Print”) command is useful for finding strings in one or more files. Several examples are here:

grepabc *sh displays all the lines of abc in files with suffix sh.

grep –i abc *sh is the same as the preceding query, but case-insensitive.

grep –l abc *sh displays all the filenames with suffix sh that contain abc.

grep –n abc *sh displays all the line numbers of the occurrences of the string abc in files with suffix sh.

You can perform logical AND and logical OR operations with this syntax:

grep abc *sh | grep def matches lines containing abc AND def.

grep "abc\|def" *sh matches lines containing abc OR def.

You can combine switches as well: the following command displays the names of the files that contain the string abc (case insensitive):

grep –il abc *sh

In other words, the preceding command matches filenames that contain abc, Abc, ABc, ABC, abC, and so forth.

Another (less efficient way) to display the lines containing abc (case insensitive) is here:

cat file1 |grep –i abc

The preceding command involves two processes, whereas the “grep using –l switch instead of cat to input the files you want” approach involves a single process. The execution time is roughly the same for small text files, but the execution time can become more significant if you are working with multiple large text files.

You can combine the sort command, the pipe symbol, and the grep command. For example, the following command displays the files with a “Jan” date in increasing size:

ls -l |grep " Jan " | sort -n

A sample output from the preceding command is here:

-rw-r--r--  1 oswaldcampesato2  staff       3 Sep 27  2022 abc.txt
-rw-r--r--  1 oswaldcampesato2  staff       6 Sep 21  2022 control1.txt
-rw-r--r--  1 oswaldcampesato2  staff      27 Sep 28  2022 fiblist.txt
-rw-r--r--  1 oswaldcampesato2  staff      28 Sep 14  2022 dest
-rw-r--r--  1 oswaldcampesato2  staff      36 Sep 14  2022 source
-rw-r--r--  1 oswaldcampesato2  staff     195 Sep 28  2022 Divisors.py
-rw-r--r--  1 oswaldcampesato2  staff     267 Sep 28  2022 Divisors2.py

Meta Characters and the grep Command

The fundamental building blocks are the regular expressions that match a single character. Most characters, including all letters and digits, are regular expressions that match themselves. Any meta-character with special meaning may be quoted by preceding it with a backslash.

A regular expression may be followed by one of several repetition operators, as shown here:

"." matches any single character.

"?" indicates that the preceding item is optional and will be matched at most once: Z? matches Z or ZZ.

"*" indicates that the preceding item will be matched zero or more times: Z* matches Z, ZZ, ZZZ, and so forth.

"+" indicates that the preceding item will be matched one or more times: Z+ matches ZZ, ZZZ, and so forth.

"{n}" indicates that the preceding item is matched exactly n times: Z{3} matches ZZZ.

"{n,}" indicates that the preceding item is matched n or more times: Z{3} matches ZZZ, ZZZZ, and so forth.

"{,m}" indicates that the preceding item is matched at most m times: Z{,3} matches Z, ZZ, and ZZZ.

"{n,m}" indicates that the preceding item is matched at least n times, but not more than m times: Z{2,4} matches ZZ, ZZZ, and ZZZZ.

The empty regular expression matches the empty string (i.e., a line in the input stream with no data). Two regular expressions may be joined by the infix operator (|). When used in this manner, the infix operator behaves exactly like a logical “OR” statement, which directs the grep command to return any line that matches either regular expression.

Escaping Meta Characters with the grep Command

Listing 1.1 displays the content of lines.txt that contains lines with words and metacharacters.

Listing 1.1: lines.txt

abcd
ab
abc
cd
defg
.*.
..

The following grep command lists the lines of length 2 (using the ^ to begin and $ to end, with operators to restrict the length) in lines.txt:

grep '^..$' lines.txt

The following command lists the lines of length two in lines.txt that contain two dots (the backslash tells grep to interpret the dots as actual dots, not as metacharacters):

grep '^\.\.$' lines.txt

The result is shown here:

..

The following command also displays lines of length 2 that begins and ends with a dot. Note that the * matches any text of any length, including no text at all, and is used as a metacharacter because it is not preceded with a backslash:

grep '^\.*\.$' lines.txt

The following command lists the lines that contain a period, followed by an asterisk, and then another period (the * is now a character that must be matched because it is preceded by a backslash):

grep '^\.\*\.$' lines.txt

Useful Options for the grep Command

There are many types of pattern matching possibilities with the grep command, and this section contains an eclectic mix of such commands that handle common scenarios.

In the following examples, we have four text files (two .sh and two .txt) and two Word documents in a directory. The string abc is found on one line in abc1.txt and three lines in abc3.sh. The string ABC is found on two lines in in ABC2.txt and four lines in ABC4.sh. Notice that abc is not found in ABC files, and ABC is not found in abc files.

ls *
ABC.doc   ABC4.sh   abc1.txt   ABC2.txt   abc.doc   abc3.sh

The following code snippet searches for occurrences of the string abc in all the files in the current directory that have sh as a suffix:

grep abc *sh
abc3.sh:abc at start
abc3.sh:ends with -abc
abc3.sh:the abc is in the middle

The “-c” option counts the number of occurrences of a string: even though ABC4.sh has no matches, it still counts them and returns zero:

grep –c abc *sh

The output of the preceding command is here:

ABC4.sh:0
abc3.sh:3

The “-e” option lets you match patterns that would otherwise cause syntax problems (the “–” character normally is interpreted as an argument for grep):

grep –e "-abc" *sh
abc3.sh:ends with -abc

The “-e” option also lets you match multiple patterns:

grep –e "-abc" -e "comment" *sh
ABC4.sh:# ABC in a comment
abc3.sh:ends with -abc

The “-i” option is to perform a case insensitive match:

grep –i abc *sh
ABC4.sh:ABC at start
ABC4.sh:ends with ABC
ABC4.sh:the ABC is in the middle
ABC4.sh:# ABC in a comment
abc3.sh:abc at start
abc3.sh:ends with -abc
abc3.sh:the abc is in the middle

The “-v” option “inverts” the matching string, which means that the output consists of the lines that do not contain the specified string (ABC does not match because -i is not used, and ABC4.sh has an entirely empty line):

grep –v abc *sh

Use the “-iv” options to display the lines that do not contain a specified string using a case insensitive match:

grep –iv abc *sh
ABC4.sh:
abc3.sh:this line won't match

The “-l” option is to list only the filenames that contain a successful match (note this matches contents of files, not the filenames). The Word document matches because the actual text is still visible to grep, it is just surrounded by proprietary formatting gibberish. You can do similar things with other formats that contain text, such as XML, HTML, CSV, and so forth:

grep -l abc *
 
abc1.txt
abc3.sh
abc.doc

The “-l” option is to list only the filenames that contain a successful match:

grep –l abc *sh

Use the “-il” options to display the filenames that contain a specified string using a case insensitive match:

grep –il abc *doc

The preceding command is very useful when you want to check for the occurrence of a string in Word documents.

The “-n” option specifies line numbers of any matching file:

grep –n abc *sh
abc3.sh:1:abc at start
abc3.sh:2:ends with -abc
abc3.sh:3:the abc is in the middle

The “-h” option suppresses the display of the filename for a successful match:

grep –h abc *sh
abc at start
ends with -abc
the abc is in the middle

For the next series of examples, we will use columns4.txt, as shown in Listing 1.2.

Listing 1.2: columns4.txt

123 ONE TWO
456 three four
ONE TWO THREE FOUR
five 123 six
one two three
four five

The "-o" option shows only the matched string (this is how you avoid returning the entire line that matches):

grep –o one columns4.txt

The "-o" option followed by the "-b" option shows the position of the matched string (returns character position, not line number. The "o" in “one” is the 59th character of the file):

grep –o –b one columns4.txt

You can specify a recursive search, as shown here (output not shown because it will be different on every client or account. This searches not only every file in directory /etc, but every file in every subdirectory of etc):

grep –r abc /etc

The preceding commands match lines where the specified string is a substring of a longer string in the file. For instance, the preceding commands will match occurrences of abc as well as abcd, dabc, abcde, and so forth.

grep ABC *txt
 
ABC2.txt:ABC at start or ABC in middle or end in ABC
ABC2.txt:ABCD DABC

If you want to exclude everything except for an exact match, you can use the –w option, as shown here:

grep –w ABC *txt
ABC2.txt:ABC at start or ABC in middle or end in ABC

The --color switch displays the matching string in color:

grep --color abc *sh
abc3.sh:abc at start
abc3.sh:ends with -abc
abc3.sh:the abc is in the middle

You can use the pair of metacharacters (.*) to find the occurrences of two words that are separated by an arbitrary number of intermediate characters.

The following command finds all lines that contain the strings one and three with any number of intermediate characters:

grep "one.*three" columns4.txt
one two three

You can “invert” the preceding result by using the –v switch, as shown here:

grep –v "one.*three" columns4.txt
123 ONE TWO
456 three four
ONE TWO THREE FOUR
five 123 six
four five

The following command finds all lines that contain the strings one and three with any number of intermediate characters, where the match involves a case-insensitive comparison:

grep -i "one.*three" columns4.txt
ONE TWO THREE FOUR
one two three

You can “invert” the preceding result by using the –v switch, as shown here:

grep –iv "one.*three" columns4.txt
123 ONE TWO
456 three four
five 123 six
four five

Sometimes you need to search a file for the presence of either of two strings. For example, the following command finds the files that contain start or end:

grep -l 'start\|end' *
ABC2.txt
ABC4.sh
abc3.sh

Later in the chapter, you will see how to find files that contain a pair of strings via the grep and xargs commands.

Character Classes and the grep Command

This section contains some simple one-line commands that combine the grep command with character classes.

echo "abc" | grep '[:alpha:]'
abc
echo "123" | grep '[:alpha:]'
(returns nothing, no match)
echo "abc123" | grep '[:alpha:]'
abc123
echo "abc" | grep '[:alnum:]'
abc
echo "123" | grep '[:alnum:]'
(returns nothing, no match)
echo "abc123" | grep '[:alnum:]'
abc123
echo "123" | grep '[:alnum:]'
(returns nothing, no match)
echo "abc123" | grep '[:alnum:]'
abc123
echo "abc" | grep '[0-9]'
(returns nothing, no match)
echo "123" | grep '[0-9]'
123
echo "abc123" | grep '[0-9]'
abc123
echo "abc123" | grep -w '[0-9]'
(returns nothing, no match)

Working with the –c Option in grep

Consider a scenario in which a directory (such as a log directory) has files created by an outside program. Your task is to write a shell script that determines which (if any) of the files that contain two occurrences of a string, after which additional processing is performed on the matching files (e.g., use email to send log files containing two or more errors messages to a system administrator for investigation).

One solution involves the –c option for grep, followed by additional invocations of the grep command.

The command snippets in this section assume the following data files whose contents are shown below.

The file hello1.txt contains the following:

hello world1

The file hello2.txt contains the following:

hello world2
hello world2 second time

The file hello3.txt contains the following:

hello world3
hello world3 two
hello world3 three

Now launch the following commands, where warnings and errors are redirected to 2>/dev/null, and therefore you will not see them:

grep -c hello hello*txt 2>/dev/null
hello1.txt:1
hello2.txt:2
hello3.txt:3
grep -l hello hello*txt 2>/dev/null
hello1.txt
hello2.txt
hello3.txt
grep -c hello hello*txt 2>/dev/null |grep ":2$"
hello2.txt:2

Note how we use the “ends with” $ metacharacter to grab just the files that have exactly two matches. We also use the colon :2$ rather than just 2$ to prevent grabbing files that have 12, 32, or 142 matches (which would end in :12, :32 and :142).

What if we wanted to show “two or more” (as in the “2 or more errors in a log”)? In this case, you would use the invert (-v) command to exclude counts of exactly 0 or exactly 1.

Grep -c hello hello*txt 2>/dev/null |grep -v ':[0-1]$'
hello2.txt:2
hello3.txt:3

In a real world application, you would want to strip off everything after the colon to return only the filenames. There are a many ways to do so, but we will use the cut command, which involves defining : as a delimiter with -d":" and using -f1 to return the first column (i.e., the part before the colon in the return text):

grep -c hello hello*txt 2>/dev/null | grep -v ':[0-1]$'| cut -d":" -f1
hello2.txt
hello3.txt

Matching a Range of Lines

The head and tail commands display a range of lines in a text file. Now suppose that you want to search a range of lines for a string. For instance, the following command displays lines 9 through 15 of longfile.txt:

cat -n longfile.txt |head -15|tail -9

The output is here:

     7  and each line
     8  contains
     9  one or
    10  more words
    11  and if you
    12  use the cat
    13  command the
    14  file contents
    15  scroll

This command displays the subset of lines 9 through 15 of longfile.txt that contain the string and:

cat -n longfile.txt |head -15|tail -9 | grep and

The output is here:

     7  and each line
    11  and if you
    13  command the

This command includes a whitespace after the word and, thereby excluding the line with the word command:

cat -n longfile.txt |head -15|tail -9 | grep "and "

The output is here:

     7  and each line
    11  and if you

Note that the preceding command excludes lines that end in “and” because they do not have the whitespace after “and” at the end of the line. You could remedy this situation with an “OR” operator including both cases:

cat -n longfile.txt |head -15|tail -9 | grep " and\|and "
     7  and each line
    11  and if you
    13  command the

However, the preceding allows “command” back into the mix. Hence, if you really want to match a specific word, it is best to use the -w tag, which is smart enough to handle the variations:

cat -n longfile.txt |head -15|tail -9 | grep -w "and"
     7  and each line
    11  and if you

The use of whitespace is safer if you are looking for something at the beginning or end of a line. This is a common approach when reading contents of log files or other structured text where the first word is often important (a tag like ERROR or Warning, a numeric code, or a date). This command displays the lines that start with the word and:

cat longfile.txt |head -15|tail -9 | grep "^and "

The output is here (without the line number because we are not using cat -n):

and each line
and if you

Recall that the “use the file name(s) in the command, instead of using cat to display the file first” style is more efficient:

head -15 longfile.txt |tail -9 | grep "^and "
and each line
and if you

However, the head command does not display the line numbers of a text file, so the “cat first” (cat -n adds line numbers) style was used in the earlier examples when you wanted to see the line numbers, even though this style is less efficient. Hence, add an extra command to a pipe if it adds value, otherwise start with a direct call to the files you want to process with the first command in the pipe (assuming the command syntax is capable of reading in filenames).

Using Backreferences in the grep Command

The grep command allows you to reference a set of characters that match a regular expression placed inside a pair of parentheses. For grep to parse the parentheses correctly, each has to be preceded with the escape character “\.”

For example, grep 'a\(.\)' uses the “.” meta character to match ab or a3 but not 3a or ba.

The backreference \n, where n is a single digit, matches the substring previously matched by the nth parenthesized sub-expression of the regular expression. For example, grep '\(a\)\1' matches aa and grep '\(a\)\2' matches aaa.

When used with alternation, if the group does not participate in the match, then the backreference makes the whole match fail. For example, grep 'a\(.\)|b\1' does not match ba or ab or bb (or anything else really).

If you have more than one regular expression inside a pair of parentheses, they are referenced (from left to right) by \1, \2, . . ., \9:

grep -e '\([a-z]\)\([0-9]\)\1' is the same as this command:
grep -e '\([a-z]\)\([0-9]\)\([a-z]\)'
grep -e '\([a-z]\)\([0-9]\)\2' is the same as this command:
grep -e '\([a-z]\)\([0-9]\)\([0-9]\)'

The easiest way to think of it is that the number (for example, \2) is a placeholder or variable that saves you from typing the longer regular expression it references. As regular expressions can become extremely complex, this often helps code clarity.

You can match consecutive digits or characters using the pattern \([0-9]\)\1. For example, the following command is a successful match because the string 1223 contains a pair of consecutive identical digits:

echo "1223" | grep  -e '\([0-9]\)\1'

Similarly, the following command is a successful match because the string 12223 contains three consecutive occurrences of the digit 2:

echo "12223" | grep  -e '\([0-9]\)\1\1'

You can check for the occurrence of two identical digits separated by any character with this expression:

echo "12z23" | grep  -e '\([0-9]\).\1'

In an analogous manner, you can test for the occurrence of duplicate letters, as shown here:

echo "abbc" | grep  -e '\([a-z]\)\1'

The following example matches an IP address, and does not use backreferences, just the \d and \. The following are regular expressions to match digits and periods:

echo "192.168.125.103" | grep -e '\(\d\d\d\)\.\(\d\d\d\)\.\    (\d\d\d\)\.\(\d\d\d\)'

If you want to allow for fewer than three digits, you can use the expression {1,3}, which matches 1, 2, or 3 digits on the third block. In a situation where any of the four blocks might have fewer than three characters, you must use the following type of syntax in all four blocks:

echo "192.168.5.103" | grep -e '\(\d\d\d\)\.\(\d\d\d\)\.\    (\d\)\{1,3\}\.\(\d\d\d\)'

You can perform more complex matches using backreferences. Listing 1.3 displays the content of columns5.txt that contains several lines that are palindromes (the same spelling from left-to-right as right-to-left). Note that the third line is an empty line.

Listing 1.3: columns5.txt

one eno
ONE ENO
 
ONE TWO OWT ENO
four five

The following command finds all lines that are palindromes:

grep -w -e '\(.\)\(.\).*\2\1' columns5.txt

The output of the preceding command is here:

one eno
ONE ENO
ONE TWO OWT ENO

The idea is as follows: the first \(.\) matches a set of letters, followed by a second \(.\) that matches a set of letters, followed by any number of intermediate characters. The sequence \2\1 reverses the order of the matching sets of letters specified by the two consecutive occurrences of \(.\).

Finding Empty Lines in Datasets

Recall that the metacharacter ^ refers to the beginning of a line and the metacharacter $ refers to the end of a line. Thus, an empty line consists of the sequence ^$. You can find the single empty in columns5.txt with this command:

grep -n "^$" columns5.txt

The output of the preceding grep command is here (use the -n switch to display line numbers, as blank lines will not otherwise show in the output):

3:

More commonly, the goal is to strip the empty lines from a file. We can do that just by inverting the prior query (and not showing the line numbers)

grep -v "^$" columns5.txt
 
one eno
ONE ENO
ONE TWO OWT ENO
four five

As you can see, the preceding output displays four non-empty lines, and as we saw in the previous grep