29,99 €
This book introduces readers to three powerful command-line utilities—grep, sed, and awk—that can create simple yet powerful shell scripts. Using the bash shell, it focuses on small text files to help readers understand these tools. Grep searches for patterns in data, sed modifies data, and awk performs tasks on pattern matches. Aimed at those new to the bash environment, the book is also valuable for those with some experience.
The journey starts with grep, teaching how to search for specific words or patterns in data. It then moves to sed, showing how to change or modify data efficiently. Finally, it delves into awk, a versatile programming language for searching and processing data files. The book also includes a chapter on using regular expressions with these tools, enhancing your scripting capabilities.
Mastering these utilities is crucial for efficient data handling and automation in a bash environment. This book transitions readers from basic to advanced command-line skills, blending theory with practical examples. It is an essential resource for anyone looking to harness the full power of bash scripting.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 314
Veröffentlichungsjahr: 2024
Pocket Primer
LICENSE, DISCLAIMER OF LIABILITY, AND LIMITED WARRANTY
By purchasing or using this book and its companion files (the “Work”), you agree that this license grants permission to use the contents contained herein, but does not give you the right of ownership to any of the textual content in the book or ownership to any of the information, files, or products contained in it. This license does not permit uploading of theWork onto the Internet or on a network (of any kind) without the written consent of the Publisher. Duplication or dissemination of any text, code, simulations, images, etc. contained herein is limited to and subject to licensing terms for the respective products, and permission must be obtained from the Publisher or the owner of the content, etc., in order to reproduce or network any portion of the textual material (in any media) that is contained in the Work.
Mercury Learning and Information (“MLI” or “the Publisher”) and anyone involved in the creation, writing, production, accompanying algorithms, code, or computer programs (“the software”), and any accompanying Web site or software of the Work, cannot and do not warrant the performance or results that might be obtained by using the contents of the Work. The author, developers, and the Publisher have used their best efforts to ensure the accuracy and functionality of the textual material and/or programs contained in this package; we, however, make no warranty of any kind, express or implied, regarding the performance of these contents or programs. The Work is sold “as is” without warranty (except for defective materials used in manufacturing the book or due to faulty workmanship).
The author, developers, and the publisher of any accompanying content, and anyone involved in the composition, production, and manufacturing of this work will not be liable for damages of any kind arising out of the use of (or the inability to use) the algorithms, source code, computer programs, or textual material contained in this publication. This includes, but is not limited to, loss of revenue or profit, or other incidental, physical, or consequential damages arising out of the use of this Work.
The sole remedy in the event of a claim of any kind is expressly limited to replacement of the book and only at the discretion of the Publisher. The use of “implied warranty” and certain “exclusions” vary from state to state, and might not apply to the purchaser of this product.
Companion files also available for downloading from the publisher by writing to [email protected]
Pocket Primer
Copyright ©2023 by Mercury Learning and Information LLC. All rights reserved.
This publication, portions of it, or any accompanying software may not be reproduced in any way, stored in a retrieval system of any type, or transmitted by any means, media, electronic display or mechanical display, including, but not limited to, photocopy, recording, Internet postings, or scanning, without prior permission in writing from the publisher.
Publisher: David Pallai
Mercury Learning and Information
22841 Quicksilver Drive
Dulles, VA 20166
www.merclearning.com
1-800-232-0223
O. Campesato. Working with grep, sed, and awk Pocket Primer.
ISBN: 978-150152-151-5
The publisher recognizes and respects all marks used by companies, manufacturers, and developers as a means to distinguish their products. All brand names and product names mentioned in this book are trademarks or service marks of their respective companies. Any omission or misuse (of any kind) of service marks or trademarks, etc. is not an attempt to infringe on the property of others.
Library of Congress Control Number: 2023934889 232425321 Printed on acid-free paper in the United States of America.
Our titles are available for adoption, license, or bulk purchase by institutions, corporations, etc. For additional information, please contact the Customer Service Dept. at 800-232-0223(toll free).
All of our titles are also available in digital format at numerous digital vendors. Companion files are available for download by writing to the publisher [email protected]. The sole obligation of Mercury Learning and Information to the purchaser is to replace the book, based on defective materials or faulty workmanship, but not based on the operation or functionality of the product.
I’d like to dedicate this book to my parents– may this bring joy and happiness into their lives.
Preface
Chapter 1 Working with grep
What is the grep Command?
Meta Characters and the grep Command
Escaping Meta Characters with the grep Command
Useful Options for the grep Command
Character Classes and the grep Command
Working with the –c Option in grep
Matching a Range of Lines
Using Backreferences in the grep Command
Finding Empty Lines in Datasets
Using Keys to Search Datasets
The Backslash Character and the grep Command
Multiple Matches in the grep Command
The grep Command and the xargs Command
Searching Zip Files for a String
Checking for a Unique Key Value
Redirecting Error Messages
The egrep Command and fgrep Command
Displaying “Pure” Words in a Dataset with egrep
The fgrep Command
Delete Rows with Missing Values
A Simple Use Case
Summary
Chapter 2 Working with sed
What is the sed Command?
The sed Execution Cycle
Matching String Patterns Using sed
Substituting String Patterns Using sed
Replacing Vowels from a String or a File
Deleting Multiple Digits and Letters from a String
Search and Replace with sed
Regular Expressions with sed
Datasets with Multiple Delimiters
Useful Switches in sed
Working with Datasets
Printing Lines
Character Classes and sed
Removing Control Characters
Counting Words in a Dataset
Back References in sed
One-Line sed Commands
Populate Missing Values with the sed Command
A Dataset with 1,000,000 Rows
Numeric Comparisons
Counting Adjacent Digits
Average Support Rate
Summary
Chapter 3 Working with awk
The awk Command
Launching awk Commands and File Extensions
Built-In Variables that Control awk
How Does the awk Command Work?
Three Important Code Blocks in awk Commands
Different Ways to Print Text
Working with the -F Switch in awk
Splitting Strings into Characters
The PROCINFO Array
Ignore Case in awk
Working with OFS, ORS, and Linefeed versus “\n”
Linefeed versus “\n”
Basic awk Examples with Text Files
Display the Contents of a File
Omit the Header Line of a File
Display a Range of Lines in a Text File With NR
Display a Range of Lines Based on Strings
Insert Blank Lines
Remove Blank Lines
Specifying the Number of Fields
Changing the Field Separator FS
Exclude Fields
Switch Adjacent Fields
Display Fields in Reverse Order
Count Non-Empty and Empty Rows
Detecting Transfer Protocols
Detecting Number Formats
Working with Record Lengths
Aligning Text with the printf() Statement
Working with Loops in awk
A for Loop in awk
Exponents in a for Loop
A for Loop with a break Statement
Working with while Loops in awk
A do-while Loop in awk
Conditional Logic and Control Statements
The break, continue, and next Keywords in awk
The exit Keyword
Conditionally Displaying Fields
Logical Operators in awk
Logical Operators and Partial Matches
Checking for Leap Years
Formatting Output
Floating Point Output
Inserting a Thousands Separator
Scientific Notation
Rounding and Truncating Numbers
Rounding Numbers
Truncating Numbers
Numeric Functions in awk
Convert Base 10 to Binary
Built-In String Functions in awk
Convert Strings to Lowercase in awk
Convert Strings to Uppercase in awk
Convert Strings to Mixed Case in awk
Counting Lines that Match a Character
Working with the match() Function
Characters and Hexadecimal Numbers in awk
Non-Printable Characters
Hexadecimal Numbers
Summary
Chapter 4 Working with awk, Part 2
Working with Arrays in awk
Initializing and Printing the Contents of an Array
Initializing and Deleting the Contents of an Array
Adding an Array of Numbers
Find the Maximum and Minimum of an Array of Numbers
Concatenating an Array of Strings
Generating Arrays from Input Strings
The split() Function with Linefeed Strings
Using the split() Function with the case Statement
The patsplit() Function
Multi-Dimensional Arrays
Higher Dimensionality Arrays
Calculating Invoice Totals (1)
Calculating Invoice Totals (2)
Calculating Invoice Averages
Counting Fields in Text Files
Displaying the Number of Fields in Text Files
Deleting Alternate Lines in Datasets
Print a Range of Strings in Datasets
Print a Range of Dates in Datasets
Merging Lines in Datasets
Printing File Contents as a Single Line
Joining Groups of Lines in a Text File
Joining Alternate Lines in a Text File
Reversing the Lines in a File
Working with Duplicate Lines in a File
Display Duplicate Rows
Remove Duplicate Rows
Uniformity of Data Values
Count Duplicate Fields in a File
Calculating Invoice Totals
Printing Lines Using Conditional Logic
Duplicate and Unique Rows in Text Files
Splitting Filenames with awk
One-Line awk Commands
Useful Short awk Scripts
Printing the Words in a Text String in awk
Count Occurrences of a String in Specific Rows
Well-Known Sorting Algorithms
The Bubble Sort
Find Anagrams in a List of Words
Sort an Array in O(n) Complexity
Find the Median of an Array of Numbers
Linear Search
Binary Search (Iterative)
Binary Search Walkthrough
Code for a Binary Search (Iterative)
Binary Search (Recursive)
Summary
Chapter 5 Working with awk, Part 3
Bit Operations in awk
Logical AND
Logical OR
Logical XOR
Logical NOT
Logical Left Shift
Logical Right Shift
Reverse a String
Find Palindromes
Check for Unique Characters
Check for Balanced Parentheses
Printing a String in a Fixed Number of Columns
Printing a Dataset in a Fixed Number of Columns
Aligning Columns in Datasets
Aligning Columns and Multiple Rows in Datasets
Displaying a Subset of Columns in a Text File
Subsets of Columns Aligned Rows in Datasets
Longest/Shortest Words in Datasets
Counting Word Frequency in Datasets
Displaying Only “Pure” Words in a Dataset
Delete Rows with Missing Values
Working with Multiline Records in awk
CSV Files with Multirow Records
Processing Multiple Files with awk
Inner Join of Two Files in awk
Logical Operators in awk
A Simple Use Case
Another Use Case
Date Conversion
A Dataset with 1,000,000 Rows
Numeric Comparisons
Counting Adjacent Digits
Average Support Rate
What is Recursion?
Arithmetic Series
Calculating Arithmetic Series (Iterative)
Calculating Arithmetic Series (Recursive)
Calculating Partial Arithmetic Series
Geometric Series
Calculating a Geometric Series (Iterative)
Calculating Arithmetic Series (Recursive)
Factorial Values
Fibonacci Numbers
Euclid’s Algorithm (GCD)
Lowest Common Multiple of Two Positive Integers
Composite versus Prime Numbers
Factors of Positive Integers
Goldbach’s Conjecture
Environment Variables in awk
Summary
Chapter 6 Regular Expressions
What are Regular Expressions?
Metacharacters
Character Sets
Working with “^” and “\”
Character Classes
POSIX Character Classes
Generic Character Classes
Regular Expressions in awk
Matching Strings with the Dot “.” Metacharacter
Matching Strings with the “^” Metacharacter
Matching Strings with the “$” Metacharacter
Matching Strings with the “|” Metacharacter
Matching with ?, +, or * Metacharacters
Matching Subgroups with ?, +, or * Metacharacters
Matching with Character Classes
Working with Postfix Arithmetic Operators
Working with the sub() Function in awk
Examples Using the sub() Function
Working with the gsub() Function
Removing Digits with the gsub() Function
Removing Characters with the gsub() Function
Removing Consecutive Characters with the gsub() Function
Removing Complements of Strings with the gsub() Function
Removing Metacharacters with the gsub() Function
Removing Spaces in Text Fields
Splitting Strings with Regular Expressions
Dynamic Regular Expressions
Regular Expressions Involving Text Files
Counting Comments with Regular Expressions
Combining grep with awk
Combining sed with awk
Removing HTML Tags
The gensub() Function and Captured Groups
Summary
Index
The goal of this book is to introduce readers to three powerful command line utilities that can be combined to create simple yet powerful shell scripts for performing a multitude of tasks. The code samples and scripts use the bash shell, and typically involve small text files, so you can focus on understanding the features of grep, sed, and awk. Aimed at a reader new to working in a bash environment, the book is comprehensive enough to be a good reference and teaches new tricks to those who already have some experience with these command line utilities.
This book takes introductory concepts and demonstrates their use in simple yet powerful shell scripts. Keep in mind that this book does not cover “pure” system administration functionality.
This book is intended for general users as well as anyone who wants to perform a variety of tasks from the command line.
You will acquire an understanding of how to use grep, sed, and awk whose functionality is discussed in the first five chapters. Specifically, Chapter 1 introduces the grep command, Chapter 2 introduces the sed command, and Chapters 3 through 5 discuss the awk command. The sixth and final chapter introduces you to regular expressions.
This book saves you the time required to search for relevant code samples, adapting them to your specific needs, which is a potentially time-consuming process.
The code samples in this book were created and tested using bash on a MacBook Pro with OS X 10.15.7 (macOS Catalina). Regarding their content: the code samples are derived primarily from scripts prepared by the author, and in some cases, there are code samples that incorporate short sections of code from discussions in online forums. The key point to remember is that the code samples follow the “Four Cs”: they must be Clear, Concise, Complete, and Correct to the extent that it is possible to do so, given the size of this book.
You need some familiarity with working from the command line in a Unix-like environment. However, there are subjective prerequisites, such as a desire to learn shell programming, along with the motivation and discipline to read and understand the code samples. In any case, if you’re not sure whether or not you can absorb the material in this book, glance through the code samples to get a feel for the level of complexity.
If you are a Mac user, there are three ways to do so. The first method is to use Finder to navigate to Applications > Utilities and then double click on the Utilities application. Next, if you already have a command shell available, you can launch a new command shell by typing the following command:
open /Applications/Utilities/Terminal.app
A second method for Mac users is to open a new command shell on a MacBook from a command shell that is already visible simply by clicking command+n in that command shell, and your Mac will launch another command shell.
If you are a PC user, you can install Cygwin (open source https://cygwin.com/) that simulates bash commands or use another toolkit such as MKS (a commercial product). Please read the online documentation that describes the download and installation process.
If you use RStudio, you need to launch a command shell inside of RStudio by navigating to Tools > Command Line, and then you can launch bash commands. Note that custom aliases are not automatically set if they are defined in a file other than the main start-up file (such as .bash_login).
The answer to this question varies widely, mainly because the answer depends heavily on your objectives. The best answer is to try a new tool or technique from the book out on a problem or task you care about, professionally, or personally. Precisely what that might be depends on who you are, as the needs of a data scientist, manager, student, or developer are all different. In addition, keep what you learned in mind as you tackle new data cleaning or manipulation challenges. Sometimes knowing a technique is possible will make finding a solution easier, even if you have to re-read the section to remember exactly how the syntax works.
If you have reached the limits of what you have learned here and want to get further technical depth on these commands, there is a wide variety of literature published and online resources describing the bash shell, Unix programming, and the grep, sed, and awk commands.
This chapter introduces you to the versatile grep command that can process an input text stream to generate a desired output text stream. This command also works well with other Unix commands. This chapter contains many short code samples that illustrate various options of the grep command.
The first part of this chapter introduces the grep command used in isolation, in conjunction with meta characters (such as ^, $, and so forth), and with code snippets that illustrate how to use some of the options of the grep command. Next, you will learn how to match ranges of lines, how to use the back references in grep, and how to “escape” meta characters in grep.
The second part of this chapter shows you how to use the grep command to find empty lines and common lines in datasets, as well as the use of keys to match rows in datasets. Next, you will learn how to use character classes with the grep command, as well as the backslash (\) character, and how to specify multiple matching patterns. You will learn how to combine the grep command with the find command and the xargs command, which is useful for matching a pattern in files that reside in different directories. This section contains some examples of common mistakes that people make with the grep command.
The third section briefly discusses the egrep command and the fgrep command, which are related commands that provide additional functionality that is unavailable in the standard grep utility. The fourth section contains a use case that illustrates how to use the grep command to find matching lines that are then merged to create a new dataset.
The grep (“Global Regular Expression Print”) command is useful for finding strings in one or more files. Several examples are here:
grepabc *sh displays all the lines of abc in files with suffix sh.
grep –i abc *sh is the same as the preceding query, but case-insensitive.
grep –l abc *sh displays all the filenames with suffix sh that contain abc.
grep –n abc *sh displays all the line numbers of the occurrences of the string abc in files with suffix sh.
You can perform logical AND and logical OR operations with this syntax:
grep abc *sh | grep def matches lines containing abc AND def.
grep "abc\|def" *sh matches lines containing abc OR def.
You can combine switches as well: the following command displays the names of the files that contain the string abc (case insensitive):
grep –il abc *sh
In other words, the preceding command matches filenames that contain abc, Abc, ABc, ABC, abC, and so forth.
Another (less efficient way) to display the lines containing abc (case insensitive) is here:
cat file1 |grep –i abc
The preceding command involves two processes, whereas the “grep using –l switch instead of cat to input the files you want” approach involves a single process. The execution time is roughly the same for small text files, but the execution time can become more significant if you are working with multiple large text files.
You can combine the sort command, the pipe symbol, and the grep command. For example, the following command displays the files with a “Jan” date in increasing size:
ls -l |grep " Jan " | sort -n
A sample output from the preceding command is here:
The fundamental building blocks are the regular expressions that match a single character. Most characters, including all letters and digits, are regular expressions that match themselves. Any meta-character with special meaning may be quoted by preceding it with a backslash.
A regular expression may be followed by one of several repetition operators, as shown here:
"." matches any single character.
"?" indicates that the preceding item is optional and will be matched at most once: Z? matches Z or ZZ.
"*" indicates that the preceding item will be matched zero or more times: Z* matches Z, ZZ, ZZZ, and so forth.
"+" indicates that the preceding item will be matched one or more times: Z+ matches ZZ, ZZZ, and so forth.
"{n}" indicates that the preceding item is matched exactly n times: Z{3} matches ZZZ.
"{n,}" indicates that the preceding item is matched n or more times: Z{3} matches ZZZ, ZZZZ, and so forth.
"{,m}" indicates that the preceding item is matched at most m times: Z{,3} matches Z, ZZ, and ZZZ.
"{n,m}" indicates that the preceding item is matched at least n times, but not more than m times: Z{2,4} matches ZZ, ZZZ, and ZZZZ.
The empty regular expression matches the empty string (i.e., a line in the input stream with no data). Two regular expressions may be joined by the infix operator (|). When used in this manner, the infix operator behaves exactly like a logical “OR” statement, which directs the grep command to return any line that matches either regular expression.
Listing 1.1 displays the content of lines.txt that contains lines with words and metacharacters.
Listing 1.1: lines.txt
The following grep command lists the lines of length 2 (using the ^ to begin and $ to end, with operators to restrict the length) in lines.txt:
The following command lists the lines of length two in lines.txt that contain two dots (the backslash tells grep to interpret the dots as actual dots, not as metacharacters):
The result is shown here:
The following command also displays lines of length 2 that begins and ends with a dot. Note that the * matches any text of any length, including no text at all, and is used as a metacharacter because it is not preceded with a backslash:
The following command lists the lines that contain a period, followed by an asterisk, and then another period (the * is now a character that must be matched because it is preceded by a backslash):
There are many types of pattern matching possibilities with the grep command, and this section contains an eclectic mix of such commands that handle common scenarios.
In the following examples, we have four text files (two .sh and two .txt) and two Word documents in a directory. The string abc is found on one line in abc1.txt and three lines in abc3.sh. The string ABC is found on two lines in in ABC2.txt and four lines in ABC4.sh. Notice that abc is not found in ABC files, and ABC is not found in abc files.
The following code snippet searches for occurrences of the string abc in all the files in the current directory that have sh as a suffix:
The “-c” option counts the number of occurrences of a string: even though ABC4.sh has no matches, it still counts them and returns zero:
The output of the preceding command is here:
The “-e” option lets you match patterns that would otherwise cause syntax problems (the “–” character normally is interpreted as an argument for grep):
The “-e” option also lets you match multiple patterns:
The “-i” option is to perform a case insensitive match:
The “-v” option “inverts” the matching string, which means that the output consists of the lines that do not contain the specified string (ABC does not match because -i is not used, and ABC4.sh has an entirely empty line):
Use the “-iv” options to display the lines that do not contain a specified string using a case insensitive match:
The “-l” option is to list only the filenames that contain a successful match (note this matches contents of files, not the filenames). The Word document matches because the actual text is still visible to grep, it is just surrounded by proprietary formatting gibberish. You can do similar things with other formats that contain text, such as XML, HTML, CSV, and so forth:
The “-l” option is to list only the filenames that contain a successful match:
Use the “-il” options to display the filenames that contain a specified string using a case insensitive match:
The preceding command is very useful when you want to check for the occurrence of a string in Word documents.
The “-n” option specifies line numbers of any matching file:
The “-h” option suppresses the display of the filename for a successful match:
For the next series of examples, we will use columns4.txt, as shown in Listing 1.2.
Listing 1.2: columns4.txt
The "-o" option shows only the matched string (this is how you avoid returning the entire line that matches):
The "-o" option followed by the "-b" option shows the position of the matched string (returns character position, not line number. The "o" in “one” is the 59th character of the file):
You can specify a recursive search, as shown here (output not shown because it will be different on every client or account. This searches not only every file in directory /etc, but every file in every subdirectory of etc):
The preceding commands match lines where the specified string is a substring of a longer string in the file. For instance, the preceding commands will match occurrences of abc as well as abcd, dabc, abcde, and so forth.
If you want to exclude everything except for an exact match, you can use the –w option, as shown here:
The --color switch displays the matching string in color:
You can use the pair of metacharacters (.*) to find the occurrences of two words that are separated by an arbitrary number of intermediate characters.
The following command finds all lines that contain the strings one and three with any number of intermediate characters:
You can “invert” the preceding result by using the –v switch, as shown here:
The following command finds all lines that contain the strings one and three with any number of intermediate characters, where the match involves a case-insensitive comparison:
You can “invert” the preceding result by using the –v switch, as shown here:
Sometimes you need to search a file for the presence of either of two strings. For example, the following command finds the files that contain start or end:
Later in the chapter, you will see how to find files that contain a pair of strings via the grep and xargs commands.
This section contains some simple one-line commands that combine the grep command with character classes.
Consider a scenario in which a directory (such as a log directory) has files created by an outside program. Your task is to write a shell script that determines which (if any) of the files that contain two occurrences of a string, after which additional processing is performed on the matching files (e.g., use email to send log files containing two or more errors messages to a system administrator for investigation).
One solution involves the –c option for grep, followed by additional invocations of the grep command.
The command snippets in this section assume the following data files whose contents are shown below.
The file hello1.txt contains the following:
The file hello2.txt contains the following:
The file hello3.txt contains the following:
Now launch the following commands, where warnings and errors are redirected to 2>/dev/null, and therefore you will not see them:
Note how we use the “ends with” $ metacharacter to grab just the files that have exactly two matches. We also use the colon :2$ rather than just 2$ to prevent grabbing files that have 12, 32, or 142 matches (which would end in :12, :32 and :142).
What if we wanted to show “two or more” (as in the “2 or more errors in a log”)? In this case, you would use the invert (-v) command to exclude counts of exactly 0 or exactly 1.
In a real world application, you would want to strip off everything after the colon to return only the filenames. There are a many ways to do so, but we will use the cut command, which involves defining : as a delimiter with -d":" and using -f1 to return the first column (i.e., the part before the colon in the return text):
The head and tail commands display a range of lines in a text file. Now suppose that you want to search a range of lines for a string. For instance, the following command displays lines 9 through 15 of longfile.txt:
The output is here:
This command displays the subset of lines 9 through 15 of longfile.txt that contain the string and:
The output is here:
This command includes a whitespace after the word and, thereby excluding the line with the word command:
The output is here:
Note that the preceding command excludes lines that end in “and” because they do not have the whitespace after “and” at the end of the line. You could remedy this situation with an “OR” operator including both cases:
However, the preceding allows “command” back into the mix. Hence, if you really want to match a specific word, it is best to use the -w tag, which is smart enough to handle the variations:
The use of whitespace is safer if you are looking for something at the beginning or end of a line. This is a common approach when reading contents of log files or other structured text where the first word is often important (a tag like ERROR or Warning, a numeric code, or a date). This command displays the lines that start with the word and:
cat longfile.txt |head -15|tail -9 | grep "^and "
The output is here (without the line number because we are not using cat -n):
Recall that the “use the file name(s) in the command, instead of using cat to display the file first” style is more efficient:
However, the head command does not display the line numbers of a text file, so the “cat first” (cat -n adds line numbers) style was used in the earlier examples when you wanted to see the line numbers, even though this style is less efficient. Hence, add an extra command to a pipe if it adds value, otherwise start with a direct call to the files you want to process with the first command in the pipe (assuming the command syntax is capable of reading in filenames).
The grep command allows you to reference a set of characters that match a regular expression placed inside a pair of parentheses. For grep to parse the parentheses correctly, each has to be preceded with the escape character “\.”
For example, grep 'a\(.\)' uses the “.” meta character to match ab or a3 but not 3a or ba.
The backreference \n, where n is a single digit, matches the substring previously matched by the nth parenthesized sub-expression of the regular expression. For example, grep '\(a\)\1' matches aa and grep '\(a\)\2' matches aaa.
When used with alternation, if the group does not participate in the match, then the backreference makes the whole match fail. For example, grep 'a\(.\)|b\1' does not match ba or ab or bb (or anything else really).
If you have more than one regular expression inside a pair of parentheses, they are referenced (from left to right) by \1, \2, . . ., \9:
The easiest way to think of it is that the number (for example, \2) is a placeholder or variable that saves you from typing the longer regular expression it references. As regular expressions can become extremely complex, this often helps code clarity.
You can match consecutive digits or characters using the pattern \([0-9]\)\1. For example, the following command is a successful match because the string 1223 contains a pair of consecutive identical digits:
Similarly, the following command is a successful match because the string 12223 contains three consecutive occurrences of the digit 2:
You can check for the occurrence of two identical digits separated by any character with this expression:
In an analogous manner, you can test for the occurrence of duplicate letters, as shown here:
The following example matches an IP address, and does not use backreferences, just the \d and \. The following are regular expressions to match digits and periods:
If you want to allow for fewer than three digits, you can use the expression {1,3}, which matches 1, 2, or 3 digits on the third block. In a situation where any of the four blocks might have fewer than three characters, you must use the following type of syntax in all four blocks:
echo "192.168.5.103" | grep -e '\(\d\d\d\)\.\(\d\d\d\)\.\ (\d\)\{1,3\}\.\(\d\d\d\)'
You can perform more complex matches using backreferences. Listing 1.3 displays the content of columns5.txt that contains several lines that are palindromes (the same spelling from left-to-right as right-to-left). Note that the third line is an empty line.
Listing 1.3: columns5.txt
The following command finds all lines that are palindromes:
grep -w -e '\(.\)\(.\).*\2\1' columns5.txt
The output of the preceding command is here:
The idea is as follows: the first \(.\) matches a set of letters, followed by a second \(.\) that matches a set of letters, followed by any number of intermediate characters. The sequence \2\1 reverses the order of the matching sets of letters specified by the two consecutive occurrences of \(.\).
Recall that the metacharacter ^ refers to the beginning of a line and the metacharacter $ refers to the end of a line. Thus, an empty line consists of the sequence ^$. You can find the single empty in columns5.txt with this command:
grep -n "^$" columns5.txt
The output of the preceding grep command is here (use the -n switch to display line numbers, as blank lines will not otherwise show in the output):
3:
More commonly, the goal is to strip the empty lines from a file. We can do that just by inverting the prior query (and not showing the line numbers)
As you can see, the preceding output displays four non-empty lines, and as we saw in the previous grep