Learning AWK Programming - Shiwang Kalkhanda - E-Book

Learning AWK Programming E-Book

Shiwang Kalkhanda

0,0
31,19 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

AWK is one of the most primitive and powerful utilities which exists in all Unix and Unix-like distributions. It is used as a command-line utility when performing a basic text-processing operation, and as programming language when dealing with complex text-processing and mining tasks. With this book, you will have the required expertise to practice advanced AWK programming in real-life examples.
The book starts off with an introduction to AWK essentials. You will then be introduced to regular expressions, AWK variables and constants, arrays and AWK functions and more. The book then delves deeper into more complex tasks, such as printing formatted output in AWK, control flow statements, GNU's implementation of AWK covering the advanced features of GNU AWK, such as network communication, debugging, and inter-process communication in the GAWK programming language which is not easily possible with AWK.
By the end of this book, the reader will have worked on the practical implementation of text processing and pattern matching using AWK to perform routine tasks.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 449

Veröffentlichungsjahr: 2018

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Learning AWK Programming
A fast, and simple cutting-edge utility for text-processing on the Unix-like environment
Shiwang Kalkhanda
BIRMINGHAM - MUMBAI

Learning AWK Programming

Copyright © 2018 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor:Sunith ShettyAcquisition Editor:Viraj MadhavContent Development Editor:Cheryl DsaTechnical Editor:Suwarna PatilCopy Editors: Vikrant Phadkay, Safis EditingProject Coordinator: Nidhi JoshiProofreader: Safis EditingIndexer:Rekha NairGraphics:Tania DuttaProduction Coordinator:Shantanu Zagade

First published: March 2018

Production reference: 1210318

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78839-103-0

www.packtpub.com

To my father, the late Ranvir Singh
mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

PacktPub.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the author

Shiwang Kalkhanda (RHCA, RHCSS, MCSE) is a Linux geek and consultant with expertise in the automation of infrastructure deployment and management. He has more than 10 years of experience in security, system, and network administration and training on open source tech. For most of his automation work, he uses shell scripting, Python & Go. He holds a master's and bachelor's degree in computer application. He enjoys traveling and spending time with his kids.

I thank my parents, the late Ranvir Singh and Vijay Lata, for their unconditional love and support throughout my life. To my beautiful wife, Reetu, for making me complete and understanding me better than myself. To my son, Ranvijay, whose precious time as a father I stole to complete this book. To my brother, Pranjal, and my friends Sanjay Bandyopadhyay, Tej Pratap Singh, Rajneesh Pandey for their constant support.

About the reviewers

John C Kennedy has worked with UNIX and Linux since 1998. He has worked with Nagios as a monitoring tool for much of the past 5 years.

He has been reviewing and tech-editing books in his spare time since 2001 and has about 20 open source books to his credit. He believes the best part of reviewing is that he learns something from every book he works on.

He was born in the USA and grew up in Northern Virginia, USA. He spent some time in the US Air Force and has lived in Germany and the UK. He has been married to Michele since 1994 and has two children, Denise and Kieran. He lives in Virginia.

I would like to thank my family, including my nephews, Aiden and Mason, and my niece, Harriet, for supporting all the silly things I do and for giving me the time to work on this.

Marco Ippolito is an Italian software engineer working as director of software development for Imagining IT. Marco completed his postgraduate in software engineering in Oxford and has worked for large corporations such as Google, Oracle, Intel, HP, and Dell, as well as for start-ups such as Platform.sh. He can be reached [email protected] has experience working in teams speaking Italian, English, Spanish, Brazilian Portuguese, German, and French, remotely or on-site.

Doug Ortizis an experienced enterprise cloud, big data, data analytics, and solutions architect who has designed, developed, re-engineered, and integrated enterprise solutions. His other expertise is in Amazon Web Services, Azure, Google Cloud, business intelligence, Hadoop, Spark, NoSQL databases, and SharePoint, to mention a few.

He is the founder of Illustris, LLC and is reachable at [email protected].

Huge thanks to my wonderful wife, Milla, as well as Maria, Nikolay, and our children for all their support.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Table of Contents

Title Page

Copyright and Credits

Learning AWK Programming

Dedication

Packt Upsell

Why subscribe?

PacktPub.com

Contributors

About the author

About the reviewers

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Conventions used

Get in touch

Reviews

Getting Started with AWK Programming

AWK programming language overview

What is AWK?

Types of AWK

When and where to use AWK

Getting started with AWK

Installation on Linux

Using the package manager

Compiling from the source code

Workflow of AWK

Action and pattern structure of AWK

Example data file

Pattern-only statements

Action-only statements

Printing each input line/record

Using the BEGIN and END blocks construct

The BEGIN block

The body block

The END block

Patterns

Actions

Running AWK programs

AWK as a Unix command line

AWK as a filter (reading input from the Terminal)

Running AWK programs from the source file

AWK programs as executable script files

Extending the AWK command line on multiple lines

Comments in AWK

Shell quotes with AWK

Data files used as examples in this book

Some simple examples with default usage

Multiple rules with AWK

Using standard input with names in AWK

AWK standard options

Standard command-line options

The -F option – field separator

The -f option (read source file)

The -v option (assigning variables)

GAWK-only options

The --dump-variables option (AWK global variables)

The --profile option (profiling)

The --sandbox option

The -i option (including other files in your program)

Include other files in the GAWK program (using @include)

The -V option

Summary

Working with Regular Expressions

Introduction to regular expressions

What is a regular expression?

Why use regular expressions?

Using regular expressions with AWK

Regular expressions as string-matching patterns with AWK

Basic regular expression construct

Understanding regular expression metacharacters

Quoted metacharacter

Anchors

Matching at the beginning of a string

Matching at the end of a string

Dot

Brackets expressions

Character classes

Named character classes (POSIX standard)

Complemented bracket expressions

Complemented character classes

Complemented named character classes

Alternation operator

Unary operator for repetition

Closure

Positive closure

Zero or one

Repetition ranges with interval expressions

A single number in brackets

A single number followed by a comma in brackets

Two numbers in brackets

Grouping using parentheses

Concatenation using alternation operator within parentheses

Backreferencing in regular expressions – sed and grep

Precedence in regular expressions

GAWK-specific regular expression operators

Matching whitespaces

Matching not whitespaces

Matching words (\w)

Matching non-words

Matching word boundaries

Matching at the beginning of a word

Matching at the end of a word

Matching not as a sub-string using

Matching a string as sub-string only using

Case-sensitive matching

Escape sequences

Summary

AWK Variables and Constants

Built-in variables in AWK

Field separator

Using a single character or simple string as a value of the FS

Using regular expressions as values of the FS

Using each character as a separate field

Using the command line to set the FS as -F

Output field separator

Record separator

Outputting the record separator

NR and NF

FILENAME

Environment variables in AWK

ARGC and ARGV

CONVFMT and OFMT

RLENGTH and RSTART

FNR

ENVIRON and SUBSET

FIELD (POSITIONAL) VARIABLE ($0 and $n)

Environment variables in GAWK

ARGIND

ERRNO

FIELDWIDTHS

IGNORECASE

PROCINFO

String constants

Numeric constants

Conversion between strings and numbers

Summary

Working with Arrays in AWK

One-dimensional arrays

Assignment in arrays

Accessing elements in arrays

Referring to members in arrays

Processing arrays using loops

Using the split() function to create arrays

Delete operation in arrays

Multidimensional arrays

Summary

Printing Output in AWK

The print statement

Role of output separator in print statement

Pretty printing with the printf statement

Escape sequences for special character printing

Different format control characters in the format specifier

Format specification modifiers

Printing with fixed column width

Using the minus modifier (-) for left justification

Printing with fixed width – right justified

Using hash modifier (#)

Using plus modifier (+) for prefixing with sign/symbol

Printing with prefix sign/symbol

Dot precision as modifier

Positional modifier using integer constant followed by $ (N$):

Redirecting output to file

Redirecting output to a file (>)

Appending output to a file (>>)

Sending output on other commands using pipe (|)

Special file for redirecting output (/dev/null, stderr)

Closing files and pipes

Summary

AWK Expressions

AWK variables and constants

Arithmetic expressions using binary operators

Assignment expressions

Increment and decrement expressions

Relational expressions

Logical or Boolean expressions

Ternary expressions

Unary expressions

Exponential expressions

String concatenation

Regular expression operators

Operators' Precedence

Summary

AWK Control Flow Statements

Conditional statements

The if statement

if

If...else

The if...else...if statement

The switch statement (a GAWK-specific feature)

Looping statement

The while loop

do...while loop statement

The for loop statement

For each loop statement

Statements affecting flow control

Break usage

Usage of continue

Exit usage

Next usage

Summary

AWK Functions

Built-in functions

Arithmetic functions

The sin (expr) function

The cos (expr) function

The atan2 (x, y) function

The int (expr) function

The exp (expr) function

The log (expr) function

The sqrt (expr) function

The rand() function

The srand ([expr]) function

Summary table of built-in arithmetic functions

String functions

The index (str, sub) function

The length ( string ) function

The split (str, arr, regex) function

The substr (str, start, [ length ]) function

The sub (regex, replacement, string) function

The gsub (regex, replacement, string) function

The gensub (regex, replacement, occurrence, [ string ]) function

The match (string, regex) function

The tolower (string) function

The toupper (string) function

The sprintf (format, expression) function

The strtonum (string) function

Summary table of built-in string functions

Input/output functions

The close (filename [to/from]) function

The fflush ([ filename ]) function

The system (command) function

The getline command

Simple getline

Getline into a variable

Getline from a file

Using getline to get a variable from a file

Using getline to output into a pipe

Using getline to change the output into a variable from a pipe

Using getline to change the output into a variable from a coprocess

The nextfile() function

The time function

The systime() function

The mktime (datespec) function

The strftime (format, timestamp) function

Bit-manipulating functions

The and (num1, num2) function

The or (num1, num2) function

The xor (num1, num2) function

The lshift (val, count) function

The rshift (val, count) function

The compl (num) function

User-defined functions

Function definition and syntax

Calling user-defined functions

Controlling variable scope

Return statement

Making indirect function calls

Summary

GNU's Implementation of AWK – GAWK (GNU AWK)

Things you don't know about GAWK

Reading non-decimal input

GAWK's built-in command line debugger

What is debugging?

Debugger concepts

Using GAWK as a debugger

Starting the debugger

Set breakpoint

Removing the breakpoint

Running the program

Looking inside the program

Displaying some variables and data

Setting watch and unwatch

Controlling the execution

Viewing environment information

Saving the commands in file

Exiting the debugger

Array sorting

Sort array by values using asort( )

Sort array indexes using asorti()

Two-way inter-process communication

Using GAWK for network programming

TCP client and server (/inet/tcp)

UDP client and server ( /inet/udp )

Reading a web page using HttpService

Profiling

Summary

Practical Implementation of AWK

Working with one-liners for text processing and pattern matching with AWK

Selective printing of lines with AWK

Modifying line spacing in a file with AWK

Numbering and calculations with AWK

Selective deletion of certain lines in a file with AWK

String operation on selected lines with AWK

Array creation with AWK one-liner

Text conversion and substitution in files with AWK

One-liners for system administrators

Use case examples of pattern matching using AWK

Parsing web server (Apache/Nginx) log files

Understanding the Apache combined log format

Using AWK for processing different log fields

Identifying problems with the running website

Printing the top 10 request IP addresses with their GeoIP information

Counting and printing unique visits to a website

Real-time IP address lookup for requests

Converting text to HTML table

Converting decimal to binary

Renaming files in a directory with AWK

Printing a generated sequence of numbers in a specified columnate format

Transposing a matrix

Processing multiple files using AWK

Summary

Further reading

Preface

This book is for anyone who is inclined to learn text processing and data extraction in a Unix-like environment. Readers will gain sufficient practical knowledge to write AWK one-liners for extracting data and write clean and small AWK programs to solve complex problems. You will be able to automate the process of cleaning any raw data, remove any extra unnecessary stuff, and create a desired reportable output. Examples given in the book are easily reproducible and will help you better understand AWK.

Text processing is used in data mining, data cleaning of CSV, and other similar-format database files. System administrators use it in their shell scripts to automate tasks and filter out command output. It is used extensively withgrep,egrep,fgrep, and regular expressions for parsing text files. Its use cases vary from industry to industry, such as telecom enterprises and business process organizations that deal with large CSV files for storing logs and other user information. They use AWK for cleaning and transforming the structure of data from one form to another.

AWK one of the oldest and most powerful utilities that exists in all and Linux distributions. It is used as a command-line utility for performing basic text processing operations and as a programming language when dealing with complex text processing. The best thing about AWK is that it is a data-driven language: you describe the data you wish to work with, and the set of actions you want to perform in the case of a pattern match. This book will provide you with a rundown, explaining the concepts to help you get started with AWK. We will cover every element of functions, variables, and more.

This book will enable the user to perform text filtering, text cleaning, and parsing of input in user-defined formats to create elegant reports. Our main focus throughout the book is on learning AWK with examples and small scripts to quickly solve user problems. The mission of this book is to make the reader comfortable and friendly with AWK.

Who this book is for

The book is written from the beginners' point of view. It covers the basic to intermediate skills that are essential for text processing in a simple and effective manner. But at the same time, there is good amount of stuff that a seasoned AWK user shall find interesting. It covers a wide range of audience and shall be useful to the following people:

Data scientists who need to extract and clean data for analysis

Developers who perform parsing of flat text files, HTML files, XML files, or CSV files

System administrators who parse log files for analysis

Any GNU/Linux hobbyist or enthusiast who likes to play with GNU/Linux filters for data manipulation

It is written in such a manner that any user with a basic familiarity with the GNU/Linux command line can start using it. The only requisite for it is to have a GNU/Linux box for practicing the stuff covered. This book begins with the essentials of text processing, that is, regular expressions (followed by the structure of the AWK program), variables, constants, functions, arrays, printing options, control flow of the program, and use of different operators to carry out various text processing and mining tasks.

For advanced users,Chapter 9, GNU's Implementation of AWK – GAWK (GNU AWK), covers GAWK implementation in networking, inter-process communication, and debugging. It is followed by practical examples for text processing and pattern matching. For system administrators, we have covered quick one liners that they will find useful in their daily operations. This book has got something for every learner who is working on GNU/Linux.

What this book covers

Chapter 1, Getting Started with AWK Programming, introduces AWK's essentials. In this chapter, you will learn how to set up an AWK environment on a Linux machine. You will run AWK programs in different ways with basic examples. This chapters lays the foundation for other chapters.

Chapter 2,Working with Regular Expressions, introduces regular expressions and explains how they are handled with AWK. You will get to know regular expressions with suitable examples.

Chapter 3, AWK Variables and Constants, focuses on the usage of AWK variables. You will understand how to use built-in and user-defined variables while writing AWK programs and command lines. You will also learn how string and numeric constants can be used to process different fields in data files.

Chapter 4,Working with Arrays in AWK, focuses on associative arrays in AWK. You will understand various features of associative arrays, such as these: indexes do not need to be in order, one can use either a string or a number as an array index, and array size can expand/shrink at runtime and is not statically defined.

Chapter 5, Printing Output in AWK, focuses on theprintandprintf functions and how they can be used efficiently to produce formatted reports. You will also learn how to use redirections in an AWK program.

Chapter 6, AWK Expressions, describes the expressions that build the core logic of a program in any programming language. The reader will learn how to create and use different types of expression in AWK language.

Chapter 7,AWK Control Flow Statements, covers the usage of different conditional statements to control the flow of AWK programs, with examples.

Chapter 8, AWK Functions, covers the different types of built-in functions available in AWK. In addition, you will learn the usage of user-defined functions to perform repetitive tasks in AWK.

Chapter 9,GNU's Implementation of AWK – GAWK (GNU AWK), covers the advanced features of GNU AWK, such as network communication, debugging, and inter-process communication in GAWK. These are not present in AWK.

Chapter 10,Practical Implementation of AWK, illustrates various use cases of text processing. You will learn how a system administrator can use the AWK command line and scripts to automate repetitive tasks. Programmers and data scientists dealing with raw data in text files will learn how to clean raw data and produce formatted reports.

To get the most out of this book

We have put our best efforts in to making this book's code and content relevant for the larger audience working on GNU/Linux. All the examples covered in the book are based on openSUSE Leap 42.3 Linux distribution with GAWK version 4.1.3 installed. You can use any Linux distribution having GAWK version 4.1.3 or above installed to practice the examples. GAWK is GNU’s open source implementation of AT&T's original AWK.

Only requirement for this book is the GAWK utility. Users can install the Windows variant of GAWK on their system if they want to practice the examples, but we strongly recommend using any Linux distribution for this purpose. For Windows and OS X, those who would like to practice it on a virtual environment can use VMware of Virtualbox to set up their favorite Linux distribution and then use AWK to execute the examples. For beginners, we have covered the installation of AWK programs on popular GNU/Linux distributions using package management and source code.

A basic understanding of GNU/Linux operating systems and familiarity with any text editor such as emacs, vi or nano are required. An understanding of shell scripting will be an added advantage to the reader.

Download the example code files

You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at

www.packtpub.com

.

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub athttps://github.com/PacktPublishing/Learning-AWK-Programming. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available athttps://github.com/PacktPublishing/. Check them out!

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, path names, dummy URLs, user input, and Twitter handles. Here is an example: "We can check that using the which command, which will return the absolute path of AWK on our system."

Any command-line input or output is written as follows:

[ shiwang@linux ~ ] $ sudo apt-get update -y

[ shiwang@linux ~ ] $ sudo apt-get install gawk -y

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "It was updated and replaced in the mid-1980s with an enhanced version called New AWK (NAWK)."

Warnings or important notes appear like this.
Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packtpub.com.

Getting Started with AWK Programming

Welcome to your journey with AWK programming. We all interact with data in our daily life in one way or another. Retrieving the desired, useful information from this data can seem like a difficult task, however, if we have the correct tools and proper knowledge of how to handle them, it's really not that difficult. This book will teach you how to efficiently handle one of the best of these tools, known as AWK. It is a standard feature of a Unix-like operating system to retrieve information from raw data. It will help you to understand trivial data-processing concepts in a user-friendly way.

This chapter is designed to give you a kickstart for writing your own simple AWK programs. Throughout the book, we will explain AWK, and work with useful and interesting examples to develop your problem-solving skills in AWK.

In this chapter, we will cover the following:

An overview of the AWK programming language

Different installation methods of AWK on the Linux environment

Understanding the AWK workflow

Learning how to create and run basic AWK programs in multiple ways

Working with sample data files with a simple usage of AWK

Understanding different AWK options

AWK programming language overview

In this section, we will explore the AWK philosophy and different types of AWK that exist today, starting from its original implementation in 1977 at AT&T's Laboratories, Inc. We will also look at the various implementation areas of AWK in data science today.

What is AWK?

AWK is an interpreted programming language designed for text processing and report generation. It is typically used for data manipulation, such as searching for items within data, performing arithmetic operations, and restructuring raw data for generating reports in most Unix-like operating systems. Using AWK programs, one can handle repetitive text-editing problems with very simple and short programs. AWK is a pattern-action language; it searches for patterns in a given input and, when a match is found, it performs the corresponding action. The pattern can be made of strings, regular expressions, comparison operations on numbers, fields, variables, and so on. AWK reads the input files and splits each input line of the file into fields automatically.

AWK has most of the well-designed features that every programming language should contain. Its syntax particularly resembles that of the C programming language. It is named after its original three authors:

Alfred V. Aho

Peter J. Weinberger

Brian W. Kernighan

AWK is a very powerful, elegant, and simple tool that every person dealing with text processing should be familiar with.

Types of AWK

The AWK language was originally implemented as an AWK utility on Unix. Today, most Linux distributions provide GNU implementation of AWK (GAWK), and a symlink for AWK is created from the original GAWK binary. The AWK utility can be categorized into the following three types, depending upon the type of interpreter it uses for executing AWK programs:

AWK

: This is the original AWK interpreter available from AT&T Laboratories. However, it is not used much nowadays and hence it might not be well-maintained. Its limitation is that it splits a line into a maximum 99 fields. It was updated and replaced in the mid-1980s with an enhanced version called

New AWK

(

NAWK

).

NAWK

: This is AT&T's latest development on the AWK interpreter. It is well-maintained by one of the original authors of AWK - Dr. Brian W. Kernighan.

GAWK

: This is the GNU project's implementation of the AWK programming language. All GNU/Linux distributions are shipped with GAWK by default and hence it is the most popular version of AWK. GAWK interpreter is fully compatible with AWK and NAWK.

Beyond these, we also have other, less popular, AWK interpreters and translators, mentioned as follows. These variants are useful in operations when you want to translate your AWK program to C, C++, or Perl:

MAWK: Michael Brennan interpreter for AWK.TAWK: Thompson Automation interpreter/compiler/Microsoft Windows DLL for AWK.MKSAWK: Mortice Kern Systems interpreter/compiler/for AWK.AWKCC: An AWK translator to C (might not be well-maintained).AWKC++: Brian Kernighan's AWK translator to C++ (experimental). It can be downloaded from: https://9p.io/cm/cs/who/bwk/awkc++.ps.AWK2C: An AWK translator to C. It uses GNU AWK libraries extensively.A2P: An AWK translator to Perl. It comes with Perl.AWKA: Yet another AWK translator to C (comes with the library), based on MAWK. It can be downloaded from: http://awka.sourceforge.net/download.html.

When and where to use AWK

AWK is simpler than any other utility for text processing and is available as the default on Unix-like operating systems. However, some people might say Perl is a superior choice for text processing, as AWK is functionally a subset of Perl, but the learning curve for Perl is steeper than that of AWK; AWK is simpler than Perl. AWK programs are smaller and hence quicker to execute. Anybody who knows the Linux command line can start writing AWK programs in no time. Here are a few use cases of AWK:

Text processing

Producing formatted text reports/labels

Performing arithmetic operations on fields of a file

Performing string operations on different fields of a file

Programs written in AWK are smaller than they would be in other higher-level languages for similar text processing operations. AWK programs are interpreted on a GNU/Linux Terminal and thus avoid the compiling, debugging phase of software development in other languages.

Getting started with AWK

This section describes how to set up the AWK environment on your GNU/Linux system, and we'll also discuss the workflow of AWK. Then, we'll look at different methods for executing AWK programs.

Installation on Linux

All the examples in this book are covered using Linux distribution (openSUSE Leap 42.3). In order to practice examples discussed in this book, you need GNU AWK version 4.1.3 or above to be installed on your systems. Although there won't be drastic changes if you use earlier versions, we still recommend you use the same version to get along.

Generally, AWK is installed by default on most GNU/Linux distributions. Using the which command, you can check whether it is installed on your system or not. In case AWK is not installed on your system, you can do so in one of two ways:

Using the package manager of the corresponding GNU/Linux system

Compiling from the source code

Let's take a look at each method in detail in the following sections.

Using the package manager

Different flavors of GNU/Linux distribution have different package-management utilities. If you are using a Debian-based GNU/Linux distribution, such as Ubuntu, Mint, or Debian, then you can install it using the Advance Package Tool (APT) package manager, as follows:

[ shiwang@linux ~ ] $ sudo apt-get update -y

[ shiwang@linux ~ ] $ sudo apt-get install gawk -y

Similarly, to install AWK on an RPM-based GNU/Linux distribution, such as Fedora, CentOS, or RHEL, you can use the Yellowdog Updator Modified (YUM) package manager, as follows:

[ root@linux ~ ] # yum update -y

[ root@linux ~ ] # yum install gawk -y

For installation of AWK on openSUSE, you can use the zypper (zypper command line) package-management utility, as follows:

[ root@linux ~ ] # zypper update -y

[ root@linux ~ ] # zypper install gawk -y

Once installation is finished, make sure AWK is accessible through the command line. We can check that using the which command, which will return the absolute path of AWK on our system:

[ root@linux ~ ] # which awk

/usr/bin/awk

You can also use awk --version to find the AWK version on our system:

[ root@linux ~ ] # awk --version

Compiling from the source code

Like every other open source utility, the GNU AWK source code is freely available for download as part of the GNU project. Previously, you saw how to install AWK using the package manager; now, you will see how to install AWK by compiling from its source code on the GNU/Linux distribution. The following steps are applicable to most of the GNU/Linux software for installation:

Download the source code from a GNU project ftp site. Here, we will use the

wget

command line utility to download it, however you are free to choose any other program, such as

curl

, you feel comfortable with:

[ shiwang@linux ~ ] $ wget http://ftp.gnu.org/gnu/gawk/gawk-4.1.3.tar.xz

Extract the downloaded source code:

[ shiwang@linux ~ ] $ tar xvf gawk-4.1.3.tar.xz

Change your working directory and execute the

configure

file to configure the GAWK as per the working environment of your system:

[ shiwang@linux ~ ] $ cd gawk-4.1.3 && ./configure

Once the

configure

command completes its execution successfully, it will generate the

make

file. Now, compile the source code by executing the

make

command:

[ shiwang@linux ~ ] $ make

Type

make install

to install the programs and any data files and documentation. When installing into a prefix owned by root, it is recommended that the package be configured and built as a regular user, and only the

make install

phase is executed with root privileges:

[ shiwang@linux ~ ] $ sudo make install

Upon successful execution of these five steps, you have compiled and installed AWK on your GNU/Linux distribution. You can verify this by executing the

which awk

command in the Terminal or

awk --version

:

[ root@linux ~ ] # which awk

/usr/bin/awk

Now you have a working AWK/GAWK installation and we are ready to begin AWK programming, but before that, our next section describes the workflow of the AWK interpreter.

If you are running on macOS X, AWK, and not GAWK, would be installed as a default on it. For GAWK installation on macOS X, please refer to MacPorts for GAWK.

Workflow of AWK

Having a basic knowledge of the AWK interpreter workflow will help you to better understand AWK and will result in more efficient AWK program development. Hence, before getting your hands dirty with AWK programming, you need to understand its internals. The AWK workflow can be summarized as shown in the following figure:

Figure 1.1: AWK workflow

Let's take a look at each operation:

READ OPERATION

: AWK reads a line from the input stream (file, pipe, or stdin) and stores it in memory. It works on text input, which can be a file, the standard input stream, or from a pipe, which it further splits into records and fields:

Records

: An AWK record is a single, continuous data input that AWK works on. Records are bounded by a record separator, whose value is stored in the RS variable. The default value of

RS

is set to a newline character. So, the lines of input are considered records for the AWK interpreter. Records are read continuously until the end of the input is reached.

Figure 1.2

shows how input data is broken into records and then goes further into how it is split into fields:

Figure 1.2: AWK input data is split into records with the record separator

Fields

: Each record can further be broken down into individual chunks called fields. Like records, fields are bounded. The default field separator is any amount of whitespace, including tab and space characters. So by default, lines of input are further broken down into individual words separated by whitespace. You can refer to the fields of a record by a field number, beginning with 1. The last field in each record can be accessed by its number or with the

NF

special variable, which contains the number of fields in the current record, as shown in

Figure 1.3

:

Figure 1.3: Records are split into fields by a field separator

EXECUTE OPERATION

: All AWK commands are applied sequentially on the input (records and fields). By default, AWK executes commands on each record/line. This behavior of AWK can be restricted by the use of patterns.

REPEAT OPERATION

: The process of read and execute is repeated until the end of the file is reached.

The following flowchart depicts the workflow of the AWK interpreter:

Figure 1.4: Workflow of the AWK interpreter

Action and pattern structure of AWK

AWK programs are sequences of one or more patterns followed by action statements. These action-pattern statements are separated by newlines. Both actions (AWK commands) and patterns (search patterns) are optional, hence we use { } to distinguish them:

/ search pattern / { action / awk-commands }/ search pattern / { action / awk-commands }

AWK reads each input line one after the other, and searches for matches of the given pattern. If the current input line matches the given pattern, a corresponding action is taken. Then, the next input line is read and the matching of patterns starts again. This process continues until all input is read.

Throughout this book, we will be using the terms patterns or search patterns and actions or AWK commands interchangeably.

In AWK syntax, we can omit either patterns or actions (but not both) in a single pattern-action statement. Where a search pattern has no action statement (AWK command), each line for which a pattern matches is printed to the output.

Each action statement can be single or multiple AWK commands.

Multiple AWK commands on a single line are seperated by a semicolon (;).

A semicolon can be put at the end of any statement.

Example data file

Before proceeding, let's create a empinfo.txt file for practice. Each line in the file contains the name of the employee, their phone number, email address, job profile, salary in USD, and working days in a week:

Jack 9857532312 [email protected] hr 2000 5

Jane 9837432312 [email protected] hr 1800 5

Eva 8827232115 [email protected] lgs 2100 6

amit 9911887766 [email protected] lgs 2350 6

Julie 8826234556 [email protected] hr 2500 5

Pattern-only statements

The syntax of the awk command with a pattern only is as follows:

awk '/ pattern /' inputfilename

In the given example, all lines of the empinfo.txt file are processed, and those that contain the Jane pattern are printed:

$ awk '/Jane/' empinfo.txt

Jane 9837432312 [email protected] hr 1800 5

Action-only statements

The syntax of the awk command with an action only is as follows:

awk '{ action statements / awk-commands }' inputfilenames

If you omit the pattern and give the action statement (AWK commands), then the given action is performed on all input lines, for example:

$ awk '{ print $1 }' empinfo.txt

Jack

Jane

Eva

amit

Julie

In the given example, all employee names are printed on the screen as $1, representing the first field of each input line.

An empty pattern, that is / /, matches the null character and is equivalent to giving no pattern at all. If we specify an empty pattern, it will print each input record of the input file. An empty action, that is { }, specifies that doing nothing will not print any input record of the input file.

Printing each input line/record

We can print each input record of the input file in multiple ways, as shown in the following example. All the given examples will produce the same result by printing all the input records of the input file.

In our first example, we specify the empty pattern without any action statement to print each input record of the input file, as follows:

$ awk '//' empinfo.txt

In this example, we specify the print action statement only, without giving any pattern for matching, and print each input record of the input file, as follows:

$ awk '{ print }' empinfo.txt

In this example, we specify the $0 default variable, along with the print action statement, to print each input record of the input file, as follows:

$ awk '{ print $0 }' empinfo.txt

In this example, we specify the empty expression along with the print action statement to print each input record of the input file, as follows:

$ awk '//{ print }' empinfo.txt

All of the given examples perform the basic printing operation using AWK. On execution of any of the preceding examples, we will get the following output:

Jack 9857532312 [email protected] hr 2000 5

Jane 9837432312 [email protected] hr 1800 5

Eva 8827232115 [email protected] lgs 2100 6

amit 9911887766 [email protected] lgs 2350 6

Julie 8826234556 [email protected] hr 2500 5

Using the BEGIN and END blocks construct

AWK contains two special keywords, BEGIN and END, for patterns where an action is required. Both of them are optional and are used without slashes. They don't match any input lines.

The BEGIN block

The BEGIN block is executed in the beginning, before the first input line of the first input file is read. This block is executed once only when the AWK program is started. It is frequently used to initialize the variables or to change the value of the AWK built-in variables, such as FS. The syntax of the BEGIN block is as follows:

BEGIN { action / awk-commands }

The body block

It is the same pattern-action block that we discussed at the beginning of the chapter. The syntax of the body block is as follows:

/ search pattern / { action / awk-commands }

In the body block, AWK commands are applied by default on each input line, however, we can restrict this behavior with the help of patterns. There are no keywords for the body block.

The END block

The END block is executed after the last input line of the last input file is read. This block is executed at the end and is generally used for rendering totals, averages, and for processing data and figures that were read in the various input records. The syntax of the END block is as follows:

END { action / awk-commands }
We can have multiple BEGIN or END blocks in a program. The action in that block will get executed as per the appearance of the block in that program. It is not mandatory to have BEGIN first and END last. The BEGIN and END blocks do not contain patterns, they contain action statements only.

Here is an example of the usage of the BEGIN and END blocks:

$ awk 'BEGIN { print "==Employee Info==" } # begin block

{ print } # body block

END { print "==ends here==" }' empinfo.txt # end block

On executing the code, we get the following result:

Patterns

In pattern-action statements, the pattern is something that determines when an action is to be executed. We can summarize the usage of patterns as follows:

BEGIN { statements }

: The statements are executed once before any input has been read.

END { statements }

: The statements are executed once after all input has been read.

expression { statements }

: The statements are executed at each input line where the expression is true, that is, non-zero or non-null.

/ regular expression / { statements }

: The statements are executed at each input line that contains a string matched by the regular expression.

compound pattern { statements }

: A compound pattern combines expressions with

&&

(AND),

||

(OR),

!

(NOT), and parentheses; the statements are executed at each input line where the compound pattern is true.

pattern 1, pattern 2 { statements }

: A range pattern matches each input line from a line matched by pattern 1 to the next line matched by pattern 2, inclusive; the statements are executed at each matching line. Here, the pattern range could be of regular expressions or addresses.

BEGIN and END do not combine with other patterns. A range pattern cannot be part of any other pattern. BEGIN and END are the only patterns that require an action.

Running AWK programs

There are different ways to run an AWK program. For a short program, we can directly execute AWK commands on the Terminal, and for long AWK programs, we generally create an AWK program script or source file. In this section, we will discuss different methods of executing AWK programs.

AWK as a Unix command line

This is the most-used method of running AWK programs. In this method program, AWK commands are given in single quotes as the first argument of the AWK command line, as follows:

$ awk 'program' input file1 file2 file3 .......fileN

Here, program refers to the sequence of pattern-action statements discussed earlier. In this format, the AWK interpreter is invoked from the shell or Terminal to process the input line of files. The quotes around program instruct the shell not to interpret the AWK character as a special shell character and treat the entire argument as singular, for the AWK program not for the shell. It also enables the program to continue on more than one line.

The format used to call the AWK program from inside of a shell script is the same one we used on the Unix/Linux command line. For example:

$ awk '{ print }' empinfo.txt /etc/passwd

The preceding command will print every line of the empinfo.txt file, followed by the lines of the /etc/passwd file on your system, as follows:

Jack 9857532312 [email protected] hr 2000 5

Jane 9837432312 [email protected] hr 1800 5

Eva 8827232115 [email protected] lgs 2100 6

amit 9911887766 [email protected] lgs 2350 6

Julie 8826234556 [email protected] hr 2500 5

at:x:25:25:Batch jobs daemon:/var/spool/atjobs:/bin/bash

avahi:x:481:480:User for Avahi:/run/avahi-daemon:/bin/false

avahi-autoipd:x:493:493:User for Avahi IPv4LL:/var/lib/avahi-autoipd:/bin/false

bin:x:1:1:bin:/bin:/bin/bash

daemon:x:2:2:Daemon:/sbin:/bin/bash

games:x:12:100:Games account:/var/games:/bin/bash

man:x:13:62:Manual pages viewer:/var/cache/man:/bin/bash

messagebus:x:499:499:User for D-Bus:/run/dbus:/bin/false

......................

...................... till last line in /etc/passwd

AWK as a filter (reading input from the Terminal)

Filter commands can take their input from stdin instead of reading it from the file. We can omit giving input filenames at the command line while executing the awk program, and simply call it from the Terminal as:

$ awk 'program'

In the previous example, AWK applies the program to whatever you type on the standard input, that is, the Terminal, until you type end-of-file by pressing Ctrl + D, for example:

$ awk '$2==50{ print }'

apple 50

apple 50

banana 60

litchi 50

litchi 50

mango 55

grapes 40

pineapple 60

........

The line that contains 50 in the second field is printed, hence it's repeated twice on the Terminal. This functionality of AWK can be used to experiment with AWK; all you need is to type your AWK commands first, then type data, and see what happens next. The only thing you have to take care of here is to enclose your AWK commands in single quotes on the command line. This prevents the shell expansion of special characters, such as $, and also allows your program to be longer than one line.

Here is one more example in which we take input from the pipe and process it with the AWK command:

$ echo -e "jack \nsam \ntarly \njerry" | awk '/sam/{ print }'

On executing this code, you get the following result:

sam

We will be using examples of executing the AWK command line on the Terminal throughout the book for explaining various topics. This type of operation is performed when the program (AWK commands) is short (up to a few lines).

Running AWK programs from the source file

When AWK programs are long, it is more convenient to put them in a separate file. Putting AWK programs in a file reduces errors and retyping. Its syntax is as follows:

$ awk -f source_file inputfile1 inputfile2 ............inputfileN

The -f option tells the AWK utility to read the AWK commands from source_file. Any filename can be used in place of source_file. For example, we can create a cmd.awk text file containing the AWK commands, as follows:

$ vi cmd.awk

BEGIN { print "***Emp Info***" }

{ print }

Now, we instruct AWK to read the commands from the cmd.awk file and perform the given actions:

$ awk -f cmd.awk empinfo.txt

On executing the preceding command, we get the following result:

***Emp Info***

Jack 9857532312 [email protected] hr 2000 5

Jane 9837432312 [email protected] hr 1800 5

Eva 8827232115 [email protected] lgs 2100 6

amit 9911887766 [email protected] lgs 2350 6

Julie 8826234556 [email protected] hr 2500 5

It does the same thing as this:

$ awk 'BEGIN { print "***Emp Info***" } { print }' empinfo.txt

We don't usually need to put the filename specified with -f in single quotes, because filenames generally don't contain any shell special characters. In the cmd.awk source file, we didn't put the AWK commands in single quotes. The quotes are only needed when we execute the AWK command from the command line. We added the .awk extension in the filename to clearly identify the AWK program file; it doesn't affect the execution of the AWK program and hence is not mandatory.

AWK programs as executable script files

We can write self-contained AWK scripts to execute AWK commands, like we have with shell scripts to execute shell commands. We create the AWK script by using #!, followed by the absolute path of the AWK interpreter and the -f optional argument. The line beginning with #! tells the operating system to run the immediately-followed interpreter with the given argument and the full argument list of the executed program. For example, we can update the cmd.awk file to emp.awk, as follows:

$ vi emp.awk

#!/usr/bin/awk -f

BEGIN { print "***Emp Info***" }

{ print }

Give this file executable permissions (with the chmod utility), then simply run ./emp.awk empinfo.txt at the shell and the system will run AWK as if you had typed awk -f cmd.awk empinfo.txt:

$ chmod +x emp.awk

$./emp.awk empinfo.txt

***Emp Info***

Jack 9857532312 [email protected] hr 2000 5

Jane 9837432312 [email protected] hr 1800 5

Eva 8827232115 [email protected] lgs 2100 6

amit 9911887766 [email protected] lgs 2350 6

Julie 8826234556 [email protected] hr 2500 5

Self-contained executable AWK scripts are useful when you want to write AWK programs that users can invoke without having to know it was written in AWK.