31,19 €
AWK is one of the most primitive and powerful utilities which exists in all Unix and Unix-like distributions. It is used as a command-line utility when performing a basic text-processing operation, and as programming language when dealing with complex text-processing and mining tasks. With this book, you will have the required expertise to practice advanced AWK programming in real-life examples.
The book starts off with an introduction to AWK essentials. You will then be introduced to regular expressions, AWK variables and constants, arrays and AWK functions and more. The book then delves deeper into more complex tasks, such as printing formatted output in AWK, control flow statements, GNU's implementation of AWK covering the advanced features of GNU AWK, such as network communication, debugging, and inter-process communication in the GAWK programming language which is not easily possible with AWK.
By the end of this book, the reader will have worked on the practical implementation of text processing and pattern matching using AWK to perform routine tasks.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 449
Veröffentlichungsjahr: 2018
Copyright © 2018 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor:Sunith ShettyAcquisition Editor:Viraj MadhavContent Development Editor:Cheryl DsaTechnical Editor:Suwarna PatilCopy Editors: Vikrant Phadkay, Safis EditingProject Coordinator: Nidhi JoshiProofreader: Safis EditingIndexer:Rekha NairGraphics:Tania DuttaProduction Coordinator:Shantanu Zagade
First published: March 2018
Production reference: 1210318
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78839-103-0
www.packtpub.com
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Mapt is fully searchable
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Shiwang Kalkhanda (RHCA, RHCSS, MCSE) is a Linux geek and consultant with expertise in the automation of infrastructure deployment and management. He has more than 10 years of experience in security, system, and network administration and training on open source tech. For most of his automation work, he uses shell scripting, Python & Go. He holds a master's and bachelor's degree in computer application. He enjoys traveling and spending time with his kids.
John C Kennedy has worked with UNIX and Linux since 1998. He has worked with Nagios as a monitoring tool for much of the past 5 years.
He has been reviewing and tech-editing books in his spare time since 2001 and has about 20 open source books to his credit. He believes the best part of reviewing is that he learns something from every book he works on.
He was born in the USA and grew up in Northern Virginia, USA. He spent some time in the US Air Force and has lived in Germany and the UK. He has been married to Michele since 1994 and has two children, Denise and Kieran. He lives in Virginia.
Marco Ippolito is an Italian software engineer working as director of software development for Imagining IT. Marco completed his postgraduate in software engineering in Oxford and has worked for large corporations such as Google, Oracle, Intel, HP, and Dell, as well as for start-ups such as Platform.sh. He can be reached [email protected] has experience working in teams speaking Italian, English, Spanish, Brazilian Portuguese, German, and French, remotely or on-site.
Doug Ortizis an experienced enterprise cloud, big data, data analytics, and solutions architect who has designed, developed, re-engineered, and integrated enterprise solutions. His other expertise is in Amazon Web Services, Azure, Google Cloud, business intelligence, Hadoop, Spark, NoSQL databases, and SharePoint, to mention a few.
He is the founder of Illustris, LLC and is reachable at [email protected].
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Title Page
Copyright and Credits
Learning AWK Programming
Dedication
Packt Upsell
Why subscribe?
PacktPub.com
Contributors
About the author
About the reviewers
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Conventions used
Get in touch
Reviews
Getting Started with AWK Programming
AWK programming language overview
What is AWK?
Types of AWK
When and where to use AWK
Getting started with AWK
Installation on Linux
Using the package manager
Compiling from the source code
Workflow of AWK
Action and pattern structure of AWK
Example data file
Pattern-only statements
Action-only statements
Printing each input line/record
Using the BEGIN and END blocks construct
The BEGIN block
The body block
The END block
Patterns
Actions
Running AWK programs
AWK as a Unix command line
AWK as a filter (reading input from the Terminal)
Running AWK programs from the source file
AWK programs as executable script files
Extending the AWK command line on multiple lines
Comments in AWK
Shell quotes with AWK
Data files used as examples in this book
Some simple examples with default usage
Multiple rules with AWK
Using standard input with names in AWK
AWK standard options
Standard command-line options
The -F option – field separator
The -f option (read source file)
The -v option (assigning variables)
GAWK-only options
The --dump-variables option (AWK global variables)
The --profile option (profiling)
The --sandbox option
The -i option (including other files in your program)
Include other files in the GAWK program (using @include)
The -V option
Summary
Working with Regular Expressions
Introduction to regular expressions
What is a regular expression?
Why use regular expressions?
Using regular expressions with AWK
Regular expressions as string-matching patterns with AWK
Basic regular expression construct
Understanding regular expression metacharacters
Quoted metacharacter
Anchors
Matching at the beginning of a string
Matching at the end of a string
Dot
Brackets expressions
Character classes
Named character classes (POSIX standard)
Complemented bracket expressions
Complemented character classes
Complemented named character classes
Alternation operator
Unary operator for repetition
Closure
Positive closure
Zero or one
Repetition ranges with interval expressions
A single number in brackets
A single number followed by a comma in brackets
Two numbers in brackets
Grouping using parentheses
Concatenation using alternation operator within parentheses
Backreferencing in regular expressions – sed and grep
Precedence in regular expressions
GAWK-specific regular expression operators
Matching whitespaces
Matching not whitespaces
Matching words (\w)
Matching non-words
Matching word boundaries
Matching at the beginning of a word
Matching at the end of a word
Matching not as a sub-string using
Matching a string as sub-string only using
Case-sensitive matching
Escape sequences
Summary
AWK Variables and Constants
Built-in variables in AWK
Field separator
Using a single character or simple string as a value of the FS
Using regular expressions as values of the FS
Using each character as a separate field
Using the command line to set the FS as -F
Output field separator
Record separator
Outputting the record separator
NR and NF
FILENAME
Environment variables in AWK
ARGC and ARGV
CONVFMT and OFMT
RLENGTH and RSTART
FNR
ENVIRON and SUBSET
FIELD (POSITIONAL) VARIABLE ($0 and $n)
Environment variables in GAWK
ARGIND
ERRNO
FIELDWIDTHS
IGNORECASE
PROCINFO
String constants
Numeric constants
Conversion between strings and numbers
Summary
Working with Arrays in AWK
One-dimensional arrays
Assignment in arrays
Accessing elements in arrays
Referring to members in arrays
Processing arrays using loops
Using the split() function to create arrays
Delete operation in arrays
Multidimensional arrays
Summary
Printing Output in AWK
The print statement
Role of output separator in print statement
Pretty printing with the printf statement
Escape sequences for special character printing
Different format control characters in the format specifier
Format specification modifiers
Printing with fixed column width
Using the minus modifier (-) for left justification
Printing with fixed width – right justified
Using hash modifier (#)
Using plus modifier (+) for prefixing with sign/symbol
Printing with prefix sign/symbol
Dot precision as modifier
Positional modifier using integer constant followed by $ (N$):
Redirecting output to file
Redirecting output to a file (>)
Appending output to a file (>>)
Sending output on other commands using pipe (|)
Special file for redirecting output (/dev/null, stderr)
Closing files and pipes
Summary
AWK Expressions
AWK variables and constants
Arithmetic expressions using binary operators
Assignment expressions
Increment and decrement expressions
Relational expressions
Logical or Boolean expressions
Ternary expressions
Unary expressions
Exponential expressions
String concatenation
Regular expression operators
Operators' Precedence
Summary
AWK Control Flow Statements
Conditional statements
The if statement
if
If...else
The if...else...if statement
The switch statement (a GAWK-specific feature)
Looping statement
The while loop
do...while loop statement
The for loop statement
For each loop statement
Statements affecting flow control
Break usage
Usage of continue
Exit usage
Next usage
Summary
AWK Functions
Built-in functions
Arithmetic functions
The sin (expr) function
The cos (expr) function
The atan2 (x, y) function
The int (expr) function
The exp (expr) function
The log (expr) function
The sqrt (expr) function
The rand() function
The srand ([expr]) function
Summary table of built-in arithmetic functions
String functions
The index (str, sub) function
The length ( string ) function
The split (str, arr, regex) function
The substr (str, start, [ length ]) function
The sub (regex, replacement, string) function
The gsub (regex, replacement, string) function
The gensub (regex, replacement, occurrence, [ string ]) function
The match (string, regex) function
The tolower (string) function
The toupper (string) function
The sprintf (format, expression) function
The strtonum (string) function
Summary table of built-in string functions
Input/output functions
The close (filename [to/from]) function
The fflush ([ filename ]) function
The system (command) function
The getline command
Simple getline
Getline into a variable
Getline from a file
Using getline to get a variable from a file
Using getline to output into a pipe
Using getline to change the output into a variable from a pipe
Using getline to change the output into a variable from a coprocess
The nextfile() function
The time function
The systime() function
The mktime (datespec) function
The strftime (format, timestamp) function
Bit-manipulating functions
The and (num1, num2) function
The or (num1, num2) function
The xor (num1, num2) function
The lshift (val, count) function
The rshift (val, count) function
The compl (num) function
User-defined functions
Function definition and syntax
Calling user-defined functions
Controlling variable scope
Return statement
Making indirect function calls
Summary
GNU's Implementation of AWK – GAWK (GNU AWK)
Things you don't know about GAWK
Reading non-decimal input
GAWK's built-in command line debugger
What is debugging?
Debugger concepts
Using GAWK as a debugger
Starting the debugger
Set breakpoint
Removing the breakpoint
Running the program
Looking inside the program
Displaying some variables and data
Setting watch and unwatch
Controlling the execution
Viewing environment information
Saving the commands in file
Exiting the debugger
Array sorting
Sort array by values using asort( )
Sort array indexes using asorti()
Two-way inter-process communication
Using GAWK for network programming
TCP client and server (/inet/tcp)
UDP client and server ( /inet/udp )
Reading a web page using HttpService
Profiling
Summary
Practical Implementation of AWK
Working with one-liners for text processing and pattern matching with AWK
Selective printing of lines with AWK
Modifying line spacing in a file with AWK
Numbering and calculations with AWK
Selective deletion of certain lines in a file with AWK
String operation on selected lines with AWK
Array creation with AWK one-liner
Text conversion and substitution in files with AWK
One-liners for system administrators
Use case examples of pattern matching using AWK
Parsing web server (Apache/Nginx) log files
Understanding the Apache combined log format
Using AWK for processing different log fields
Identifying problems with the running website
Printing the top 10 request IP addresses with their GeoIP information
Counting and printing unique visits to a website
Real-time IP address lookup for requests
Converting text to HTML table
Converting decimal to binary
Renaming files in a directory with AWK
Printing a generated sequence of numbers in a specified columnate format
Transposing a matrix
Processing multiple files using AWK
Summary
Further reading
This book is for anyone who is inclined to learn text processing and data extraction in a Unix-like environment. Readers will gain sufficient practical knowledge to write AWK one-liners for extracting data and write clean and small AWK programs to solve complex problems. You will be able to automate the process of cleaning any raw data, remove any extra unnecessary stuff, and create a desired reportable output. Examples given in the book are easily reproducible and will help you better understand AWK.
Text processing is used in data mining, data cleaning of CSV, and other similar-format database files. System administrators use it in their shell scripts to automate tasks and filter out command output. It is used extensively withgrep,egrep,fgrep, and regular expressions for parsing text files. Its use cases vary from industry to industry, such as telecom enterprises and business process organizations that deal with large CSV files for storing logs and other user information. They use AWK for cleaning and transforming the structure of data from one form to another.
AWK one of the oldest and most powerful utilities that exists in all and Linux distributions. It is used as a command-line utility for performing basic text processing operations and as a programming language when dealing with complex text processing. The best thing about AWK is that it is a data-driven language: you describe the data you wish to work with, and the set of actions you want to perform in the case of a pattern match. This book will provide you with a rundown, explaining the concepts to help you get started with AWK. We will cover every element of functions, variables, and more.
This book will enable the user to perform text filtering, text cleaning, and parsing of input in user-defined formats to create elegant reports. Our main focus throughout the book is on learning AWK with examples and small scripts to quickly solve user problems. The mission of this book is to make the reader comfortable and friendly with AWK.
The book is written from the beginners' point of view. It covers the basic to intermediate skills that are essential for text processing in a simple and effective manner. But at the same time, there is good amount of stuff that a seasoned AWK user shall find interesting. It covers a wide range of audience and shall be useful to the following people:
Data scientists who need to extract and clean data for analysis
Developers who perform parsing of flat text files, HTML files, XML files, or CSV files
System administrators who parse log files for analysis
Any GNU/Linux hobbyist or enthusiast who likes to play with GNU/Linux filters for data manipulation
It is written in such a manner that any user with a basic familiarity with the GNU/Linux command line can start using it. The only requisite for it is to have a GNU/Linux box for practicing the stuff covered. This book begins with the essentials of text processing, that is, regular expressions (followed by the structure of the AWK program), variables, constants, functions, arrays, printing options, control flow of the program, and use of different operators to carry out various text processing and mining tasks.
For advanced users,Chapter 9, GNU's Implementation of AWK – GAWK (GNU AWK), covers GAWK implementation in networking, inter-process communication, and debugging. It is followed by practical examples for text processing and pattern matching. For system administrators, we have covered quick one liners that they will find useful in their daily operations. This book has got something for every learner who is working on GNU/Linux.
Chapter 1, Getting Started with AWK Programming, introduces AWK's essentials. In this chapter, you will learn how to set up an AWK environment on a Linux machine. You will run AWK programs in different ways with basic examples. This chapters lays the foundation for other chapters.
Chapter 2,Working with Regular Expressions, introduces regular expressions and explains how they are handled with AWK. You will get to know regular expressions with suitable examples.
Chapter 3, AWK Variables and Constants, focuses on the usage of AWK variables. You will understand how to use built-in and user-defined variables while writing AWK programs and command lines. You will also learn how string and numeric constants can be used to process different fields in data files.
Chapter 4,Working with Arrays in AWK, focuses on associative arrays in AWK. You will understand various features of associative arrays, such as these: indexes do not need to be in order, one can use either a string or a number as an array index, and array size can expand/shrink at runtime and is not statically defined.
Chapter 5, Printing Output in AWK, focuses on theprintandprintf functions and how they can be used efficiently to produce formatted reports. You will also learn how to use redirections in an AWK program.
Chapter 6, AWK Expressions, describes the expressions that build the core logic of a program in any programming language. The reader will learn how to create and use different types of expression in AWK language.
Chapter 7,AWK Control Flow Statements, covers the usage of different conditional statements to control the flow of AWK programs, with examples.
Chapter 8, AWK Functions, covers the different types of built-in functions available in AWK. In addition, you will learn the usage of user-defined functions to perform repetitive tasks in AWK.
Chapter 9,GNU's Implementation of AWK – GAWK (GNU AWK), covers the advanced features of GNU AWK, such as network communication, debugging, and inter-process communication in GAWK. These are not present in AWK.
Chapter 10,Practical Implementation of AWK, illustrates various use cases of text processing. You will learn how a system administrator can use the AWK command line and scripts to automate repetitive tasks. Programmers and data scientists dealing with raw data in text files will learn how to clean raw data and produce formatted reports.
We have put our best efforts in to making this book's code and content relevant for the larger audience working on GNU/Linux. All the examples covered in the book are based on openSUSE Leap 42.3 Linux distribution with GAWK version 4.1.3 installed. You can use any Linux distribution having GAWK version 4.1.3 or above installed to practice the examples. GAWK is GNU’s open source implementation of AT&T's original AWK.
Only requirement for this book is the GAWK utility. Users can install the Windows variant of GAWK on their system if they want to practice the examples, but we strongly recommend using any Linux distribution for this purpose. For Windows and OS X, those who would like to practice it on a virtual environment can use VMware of Virtualbox to set up their favorite Linux distribution and then use AWK to execute the examples. For beginners, we have covered the installation of AWK programs on popular GNU/Linux distributions using package management and source code.
A basic understanding of GNU/Linux operating systems and familiarity with any text editor such as emacs, vi or nano are required. An understanding of shell scripting will be an added advantage to the reader.
You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packtpub.com
.
Select the
SUPPORT
tab.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub athttps://github.com/PacktPublishing/Learning-AWK-Programming. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available athttps://github.com/PacktPublishing/. Check them out!
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, path names, dummy URLs, user input, and Twitter handles. Here is an example: "We can check that using the which command, which will return the absolute path of AWK on our system."
Any command-line input or output is written as follows:
[ shiwang@linux ~ ] $ sudo apt-get update -y
[ shiwang@linux ~ ] $ sudo apt-get install gawk -y
Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "It was updated and replaced in the mid-1980s with an enhanced version called New AWK (NAWK)."
Feedback from our readers is always welcome.
General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packtpub.com.
Welcome to your journey with AWK programming. We all interact with data in our daily life in one way or another. Retrieving the desired, useful information from this data can seem like a difficult task, however, if we have the correct tools and proper knowledge of how to handle them, it's really not that difficult. This book will teach you how to efficiently handle one of the best of these tools, known as AWK. It is a standard feature of a Unix-like operating system to retrieve information from raw data. It will help you to understand trivial data-processing concepts in a user-friendly way.
This chapter is designed to give you a kickstart for writing your own simple AWK programs. Throughout the book, we will explain AWK, and work with useful and interesting examples to develop your problem-solving skills in AWK.
In this chapter, we will cover the following:
An overview of the AWK programming language
Different installation methods of AWK on the Linux environment
Understanding the AWK workflow
Learning how to create and run basic AWK programs in multiple ways
Working with sample data files with a simple usage of AWK
Understanding different AWK options
In this section, we will explore the AWK philosophy and different types of AWK that exist today, starting from its original implementation in 1977 at AT&T's Laboratories, Inc. We will also look at the various implementation areas of AWK in data science today.
AWK is an interpreted programming language designed for text processing and report generation. It is typically used for data manipulation, such as searching for items within data, performing arithmetic operations, and restructuring raw data for generating reports in most Unix-like operating systems. Using AWK programs, one can handle repetitive text-editing problems with very simple and short programs. AWK is a pattern-action language; it searches for patterns in a given input and, when a match is found, it performs the corresponding action. The pattern can be made of strings, regular expressions, comparison operations on numbers, fields, variables, and so on. AWK reads the input files and splits each input line of the file into fields automatically.
AWK has most of the well-designed features that every programming language should contain. Its syntax particularly resembles that of the C programming language. It is named after its original three authors:
Alfred V. Aho
Peter J. Weinberger
Brian W. Kernighan
AWK is a very powerful, elegant, and simple tool that every person dealing with text processing should be familiar with.
The AWK language was originally implemented as an AWK utility on Unix. Today, most Linux distributions provide GNU implementation of AWK (GAWK), and a symlink for AWK is created from the original GAWK binary. The AWK utility can be categorized into the following three types, depending upon the type of interpreter it uses for executing AWK programs:
AWK
: This is the original AWK interpreter available from AT&T Laboratories. However, it is not used much nowadays and hence it might not be well-maintained. Its limitation is that it splits a line into a maximum 99 fields. It was updated and replaced in the mid-1980s with an enhanced version called
New AWK
(
NAWK
).
NAWK
: This is AT&T's latest development on the AWK interpreter. It is well-maintained by one of the original authors of AWK - Dr. Brian W. Kernighan.
GAWK
: This is the GNU project's implementation of the AWK programming language. All GNU/Linux distributions are shipped with GAWK by default and hence it is the most popular version of AWK. GAWK interpreter is fully compatible with AWK and NAWK.
Beyond these, we also have other, less popular, AWK interpreters and translators, mentioned as follows. These variants are useful in operations when you want to translate your AWK program to C, C++, or Perl:
MAWK: Michael Brennan interpreter for AWK.TAWK: Thompson Automation interpreter/compiler/Microsoft Windows DLL for AWK.MKSAWK: Mortice Kern Systems interpreter/compiler/for AWK.AWKCC: An AWK translator to C (might not be well-maintained).AWKC++: Brian Kernighan's AWK translator to C++ (experimental). It can be downloaded from: https://9p.io/cm/cs/who/bwk/awkc++.ps.AWK2C: An AWK translator to C. It uses GNU AWK libraries extensively.A2P: An AWK translator to Perl. It comes with Perl.AWKA: Yet another AWK translator to C (comes with the library), based on MAWK. It can be downloaded from: http://awka.sourceforge.net/download.html.AWK is simpler than any other utility for text processing and is available as the default on Unix-like operating systems. However, some people might say Perl is a superior choice for text processing, as AWK is functionally a subset of Perl, but the learning curve for Perl is steeper than that of AWK; AWK is simpler than Perl. AWK programs are smaller and hence quicker to execute. Anybody who knows the Linux command line can start writing AWK programs in no time. Here are a few use cases of AWK:
Text processing
Producing formatted text reports/labels
Performing arithmetic operations on fields of a file
Performing string operations on different fields of a file
Programs written in AWK are smaller than they would be in other higher-level languages for similar text processing operations. AWK programs are interpreted on a GNU/Linux Terminal and thus avoid the compiling, debugging phase of software development in other languages.
This section describes how to set up the AWK environment on your GNU/Linux system, and we'll also discuss the workflow of AWK. Then, we'll look at different methods for executing AWK programs.
All the examples in this book are covered using Linux distribution (openSUSE Leap 42.3). In order to practice examples discussed in this book, you need GNU AWK version 4.1.3 or above to be installed on your systems. Although there won't be drastic changes if you use earlier versions, we still recommend you use the same version to get along.
Generally, AWK is installed by default on most GNU/Linux distributions. Using the which command, you can check whether it is installed on your system or not. In case AWK is not installed on your system, you can do so in one of two ways:
Using the package manager of the corresponding GNU/Linux system
Compiling from the source code
Let's take a look at each method in detail in the following sections.
Different flavors of GNU/Linux distribution have different package-management utilities. If you are using a Debian-based GNU/Linux distribution, such as Ubuntu, Mint, or Debian, then you can install it using the Advance Package Tool (APT) package manager, as follows:
[ shiwang@linux ~ ] $ sudo apt-get update -y
[ shiwang@linux ~ ] $ sudo apt-get install gawk -y
Similarly, to install AWK on an RPM-based GNU/Linux distribution, such as Fedora, CentOS, or RHEL, you can use the Yellowdog Updator Modified (YUM) package manager, as follows:
[ root@linux ~ ] # yum update -y
[ root@linux ~ ] # yum install gawk -y
For installation of AWK on openSUSE, you can use the zypper (zypper command line) package-management utility, as follows:
[ root@linux ~ ] # zypper update -y
[ root@linux ~ ] # zypper install gawk -y
Once installation is finished, make sure AWK is accessible through the command line. We can check that using the which command, which will return the absolute path of AWK on our system:
[ root@linux ~ ] # which awk
/usr/bin/awk
You can also use awk --version to find the AWK version on our system:
[ root@linux ~ ] # awk --version
Like every other open source utility, the GNU AWK source code is freely available for download as part of the GNU project. Previously, you saw how to install AWK using the package manager; now, you will see how to install AWK by compiling from its source code on the GNU/Linux distribution. The following steps are applicable to most of the GNU/Linux software for installation:
Download the source code from a GNU project ftp site. Here, we will use the
wget
command line utility to download it, however you are free to choose any other program, such as
curl
, you feel comfortable with:
[ shiwang@linux ~ ] $ wget http://ftp.gnu.org/gnu/gawk/gawk-4.1.3.tar.xz
Extract the downloaded source code:
[ shiwang@linux ~ ] $ tar xvf gawk-4.1.3.tar.xz
Change your working directory and execute the
configure
file to configure the GAWK as per the working environment of your system:
[ shiwang@linux ~ ] $ cd gawk-4.1.3 && ./configure
Once the
configure
command completes its execution successfully, it will generate the
make
file. Now, compile the source code by executing the
make
command:
[ shiwang@linux ~ ] $ make
Type
make install
to install the programs and any data files and documentation. When installing into a prefix owned by root, it is recommended that the package be configured and built as a regular user, and only the
make install
phase is executed with root privileges:
[ shiwang@linux ~ ] $ sudo make install
Upon successful execution of these five steps, you have compiled and installed AWK on your GNU/Linux distribution. You can verify this by executing the
which awk
command in the Terminal or
awk --version
:
[ root@linux ~ ] # which awk
/usr/bin/awk
Now you have a working AWK/GAWK installation and we are ready to begin AWK programming, but before that, our next section describes the workflow of the AWK interpreter.
Having a basic knowledge of the AWK interpreter workflow will help you to better understand AWK and will result in more efficient AWK program development. Hence, before getting your hands dirty with AWK programming, you need to understand its internals. The AWK workflow can be summarized as shown in the following figure:
Let's take a look at each operation:
READ OPERATION
: AWK reads a line from the input stream (file, pipe, or stdin) and stores it in memory. It works on text input, which can be a file, the standard input stream, or from a pipe, which it further splits into records and fields:
Records
: An AWK record is a single, continuous data input that AWK works on. Records are bounded by a record separator, whose value is stored in the RS variable. The default value of
RS
is set to a newline character. So, the lines of input are considered records for the AWK interpreter. Records are read continuously until the end of the input is reached.
Figure 1.2
shows how input data is broken into records and then goes further into how it is split into fields:
Fields
: Each record can further be broken down into individual chunks called fields. Like records, fields are bounded. The default field separator is any amount of whitespace, including tab and space characters. So by default, lines of input are further broken down into individual words separated by whitespace. You can refer to the fields of a record by a field number, beginning with 1. The last field in each record can be accessed by its number or with the
NF
special variable, which contains the number of fields in the current record, as shown in
Figure 1.3
:
EXECUTE OPERATION
: All AWK commands are applied sequentially on the input (records and fields). By default, AWK executes commands on each record/line. This behavior of AWK can be restricted by the use of patterns.
REPEAT OPERATION
: The process of read and execute is repeated until the end of the file is reached.
The following flowchart depicts the workflow of the AWK interpreter:
AWK programs are sequences of one or more patterns followed by action statements. These action-pattern statements are separated by newlines. Both actions (AWK commands) and patterns (search patterns) are optional, hence we use { } to distinguish them:
/ search pattern / { action / awk-commands }/ search pattern / { action / awk-commands }
AWK reads each input line one after the other, and searches for matches of the given pattern. If the current input line matches the given pattern, a corresponding action is taken. Then, the next input line is read and the matching of patterns starts again. This process continues until all input is read.
In AWK syntax, we can omit either patterns or actions (but not both) in a single pattern-action statement. Where a search pattern has no action statement (AWK command), each line for which a pattern matches is printed to the output.
Multiple AWK commands on a single line are seperated by a semicolon (;).
A semicolon can be put at the end of any statement.Before proceeding, let's create a empinfo.txt file for practice. Each line in the file contains the name of the employee, their phone number, email address, job profile, salary in USD, and working days in a week:
Jack 9857532312 [email protected] hr 2000 5
Jane 9837432312 [email protected] hr 1800 5
Eva 8827232115 [email protected] lgs 2100 6
amit 9911887766 [email protected] lgs 2350 6
Julie 8826234556 [email protected] hr 2500 5
The syntax of the awk command with a pattern only is as follows:
In the given example, all lines of the empinfo.txt file are processed, and those that contain the Jane pattern are printed:
$ awk '/Jane/' empinfo.txt
Jane 9837432312 [email protected] hr 1800 5
The syntax of the awk command with an action only is as follows:
If you omit the pattern and give the action statement (AWK commands), then the given action is performed on all input lines, for example:
$ awk '{ print $1 }' empinfo.txt
Jack
Jane
Eva
amit
Julie
In the given example, all employee names are printed on the screen as $1, representing the first field of each input line.
We can print each input record of the input file in multiple ways, as shown in the following example. All the given examples will produce the same result by printing all the input records of the input file.
In our first example, we specify the empty pattern without any action statement to print each input record of the input file, as follows:
$ awk '//' empinfo.txt
In this example, we specify the print action statement only, without giving any pattern for matching, and print each input record of the input file, as follows:
$ awk '{ print }' empinfo.txt
In this example, we specify the $0 default variable, along with the print action statement, to print each input record of the input file, as follows:
$ awk '{ print $0 }' empinfo.txt
In this example, we specify the empty expression along with the print action statement to print each input record of the input file, as follows:
$ awk '//{ print }' empinfo.txt
All of the given examples perform the basic printing operation using AWK. On execution of any of the preceding examples, we will get the following output:
Jack 9857532312 [email protected] hr 2000 5
Jane 9837432312 [email protected] hr 1800 5
Eva 8827232115 [email protected] lgs 2100 6
amit 9911887766 [email protected] lgs 2350 6
Julie 8826234556 [email protected] hr 2500 5
AWK contains two special keywords, BEGIN and END, for patterns where an action is required. Both of them are optional and are used without slashes. They don't match any input lines.
The BEGIN block is executed in the beginning, before the first input line of the first input file is read. This block is executed once only when the AWK program is started. It is frequently used to initialize the variables or to change the value of the AWK built-in variables, such as FS. The syntax of the BEGIN block is as follows:
BEGIN { action / awk-commands }
It is the same pattern-action block that we discussed at the beginning of the chapter. The syntax of the body block is as follows:
/ search pattern / { action / awk-commands }
In the body block, AWK commands are applied by default on each input line, however, we can restrict this behavior with the help of patterns. There are no keywords for the body block.
The END block is executed after the last input line of the last input file is read. This block is executed at the end and is generally used for rendering totals, averages, and for processing data and figures that were read in the various input records. The syntax of the END block is as follows:
Here is an example of the usage of the BEGIN and END blocks:
$ awk 'BEGIN { print "==Employee Info==" } # begin block
{ print } # body block
END { print "==ends here==" }' empinfo.txt # end block
On executing the code, we get the following result:
In pattern-action statements, the pattern is something that determines when an action is to be executed. We can summarize the usage of patterns as follows:
BEGIN { statements }
: The statements are executed once before any input has been read.
END { statements }
: The statements are executed once after all input has been read.
expression { statements }
: The statements are executed at each input line where the expression is true, that is, non-zero or non-null.
/ regular expression / { statements }
: The statements are executed at each input line that contains a string matched by the regular expression.
compound pattern { statements }
: A compound pattern combines expressions with
&&
(AND),
||
(OR),
!
(NOT), and parentheses; the statements are executed at each input line where the compound pattern is true.
pattern 1, pattern 2 { statements }
: A range pattern matches each input line from a line matched by pattern 1 to the next line matched by pattern 2, inclusive; the statements are executed at each matching line. Here, the pattern range could be of regular expressions or addresses.
BEGIN and END do not combine with other patterns. A range pattern cannot be part of any other pattern. BEGIN and END are the only patterns that require an action.
There are different ways to run an AWK program. For a short program, we can directly execute AWK commands on the Terminal, and for long AWK programs, we generally create an AWK program script or source file. In this section, we will discuss different methods of executing AWK programs.
This is the most-used method of running AWK programs. In this method program, AWK commands are given in single quotes as the first argument of the AWK command line, as follows:
$ awk 'program' input file1 file2 file3 .......fileN
Here, program refers to the sequence of pattern-action statements discussed earlier. In this format, the AWK interpreter is invoked from the shell or Terminal to process the input line of files. The quotes around program instruct the shell not to interpret the AWK character as a special shell character and treat the entire argument as singular, for the AWK program not for the shell. It also enables the program to continue on more than one line.
The format used to call the AWK program from inside of a shell script is the same one we used on the Unix/Linux command line. For example:
$ awk '{ print }' empinfo.txt /etc/passwd
The preceding command will print every line of the empinfo.txt file, followed by the lines of the /etc/passwd file on your system, as follows:
Jack 9857532312 [email protected] hr 2000 5
Jane 9837432312 [email protected] hr 1800 5
Eva 8827232115 [email protected] lgs 2100 6
amit 9911887766 [email protected] lgs 2350 6
Julie 8826234556 [email protected] hr 2500 5
at:x:25:25:Batch jobs daemon:/var/spool/atjobs:/bin/bash
avahi:x:481:480:User for Avahi:/run/avahi-daemon:/bin/false
avahi-autoipd:x:493:493:User for Avahi IPv4LL:/var/lib/avahi-autoipd:/bin/false
bin:x:1:1:bin:/bin:/bin/bash
daemon:x:2:2:Daemon:/sbin:/bin/bash
games:x:12:100:Games account:/var/games:/bin/bash
man:x:13:62:Manual pages viewer:/var/cache/man:/bin/bash
messagebus:x:499:499:User for D-Bus:/run/dbus:/bin/false
......................
...................... till last line in /etc/passwd
Filter commands can take their input from stdin instead of reading it from the file. We can omit giving input filenames at the command line while executing the awk program, and simply call it from the Terminal as:
$ awk 'program'
In the previous example, AWK applies the program to whatever you type on the standard input, that is, the Terminal, until you type end-of-file by pressing Ctrl + D, for example:
$ awk '$2==50{ print }'
apple 50
apple 50
banana 60
litchi 50
litchi 50
mango 55
grapes 40
pineapple 60
........
The line that contains 50 in the second field is printed, hence it's repeated twice on the Terminal. This functionality of AWK can be used to experiment with AWK; all you need is to type your AWK commands first, then type data, and see what happens next. The only thing you have to take care of here is to enclose your AWK commands in single quotes on the command line. This prevents the shell expansion of special characters, such as $, and also allows your program to be longer than one line.
Here is one more example in which we take input from the pipe and process it with the AWK command:
$ echo -e "jack \nsam \ntarly \njerry" | awk '/sam/{ print }'
On executing this code, you get the following result:
sam
We will be using examples of executing the AWK command line on the Terminal throughout the book for explaining various topics. This type of operation is performed when the program (AWK commands) is short (up to a few lines).
When AWK programs are long, it is more convenient to put them in a separate file. Putting AWK programs in a file reduces errors and retyping. Its syntax is as follows:
$ awk -f source_file inputfile1 inputfile2 ............inputfileN
The -f option tells the AWK utility to read the AWK commands from source_file. Any filename can be used in place of source_file. For example, we can create a cmd.awk text file containing the AWK commands, as follows:
$ vi cmd.awk
BEGIN { print "***Emp Info***" }
{ print }
Now, we instruct AWK to read the commands from the cmd.awk file and perform the given actions:
$ awk -f cmd.awk empinfo.txt
On executing the preceding command, we get the following result:
***Emp Info***
Jack 9857532312 [email protected] hr 2000 5
Jane 9837432312 [email protected] hr 1800 5
Eva 8827232115 [email protected] lgs 2100 6
amit 9911887766 [email protected] lgs 2350 6
Julie 8826234556 [email protected] hr 2500 5
It does the same thing as this:
$ awk 'BEGIN { print "***Emp Info***" } { print }' empinfo.txt
We can write self-contained AWK scripts to execute AWK commands, like we have with shell scripts to execute shell commands. We create the AWK script by using #!, followed by the absolute path of the AWK interpreter and the -f optional argument. The line beginning with #! tells the operating system to run the immediately-followed interpreter with the given argument and the full argument list of the executed program. For example, we can update the cmd.awk file to emp.awk, as follows:
$ vi emp.awk
#!/usr/bin/awk -f
BEGIN { print "***Emp Info***" }
{ print }
Give this file executable permissions (with the chmod utility), then simply run ./emp.awk empinfo.txt at the shell and the system will run AWK as if you had typed awk -f cmd.awk empinfo.txt:
$ chmod +x emp.awk
$./emp.awk empinfo.txt
***Emp Info***
Jack 9857532312 [email protected] hr 2000 5
Jane 9837432312 [email protected] hr 1800 5
Eva 8827232115 [email protected] lgs 2100 6
amit 9911887766 [email protected] lgs 2350 6
Julie 8826234556 [email protected] hr 2500 5
Self-contained executable AWK scripts are useful when you want to write AWK programs that users can invoke without having to know it was written in AWK.
