Bash for Data Scientists - Oswald Campesato - E-Book

Bash for Data Scientists E-Book

Oswald Campesato

0,0
29,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

This book introduces powerful command line utilities for creating efficient shell scripts to process datasets. Using the bash shell, the examples and scripts focus on small datasets to help readers understand the features of grep, sed, and awk. Companion files with code are available for download from the publisher.
The course starts with an introduction to the basics, covering files and directories, and useful commands. It then progresses to conditional logic and loops, providing a solid foundation for processing datasets. Detailed chapters on using grep, sed, and awk illustrate their capabilities in handling and cleaning various types of datasets effectively.
Advanced topics include processing datasets with Pandas, exploring NoSQL, SQLite, and Python. The book equips data scientists, analysts, and anyone seeking shell-based solutions with practical skills. By the end, users will be adept at creating robust scripts for dataset processing, combining command line utilities for optimal results.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 356

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



BASHFORDATA SCIENTISTS

Oswald Campesato

MERCURY LEARNING AND INFORMATION

Dulles, VirginiaBoston, MassachusettsNew Delhi

Copyright ©2023 by MERCURY LEARNING AND INFORMATION LLC. All rights reserved.

This publication, portions of it, or any accompanying software may not be reproduced in any way, stored in a retrieval system of any type, or transmitted by any means, media, electronic display or mechanical display, including, but not limited to, photocopy, recording, Internet postings, or scanning, without prior permission in writing from the publisher.

Publisher: David Pallai

MERCURY LEARNING AND INFORMATION

22841 Quicksilver Drive

Dulles, VA 20166

[email protected]

www.merclearning.com

1-800-232-0223

O. Campesato. Bash for Data Scientists.

ISBN: 978-1-68392-973-4

The publisher recognizes and respects all marks used by companies, manufacturers, and developers as a means to distinguish their products. All brand names and product names mentioned in this book are trademarks or service marks of their respective companies. Any omission or misuse (of any kind) of service marks or trademarks, etc. is not an attempt to infringe on the property of others.

Library of Congress Control Number: 2022948076

222324321    Printed on acid-free paper in the United States of America.

Our titles are available for adoption, license, or bulk purchase by institutions, corporations, etc. For additional information, please contact the Customer Service Dept. at 800-232-0223(toll free).

All of our titles are available in digital format at academiccourseware.com and other digital vendors. Companion files for this title are available by writing to the publisher at [email protected]. The sole obligation of MERCURY LEARNING AND INFORMATION to the purchaser is to replace the book, based on defective materials or faulty workmanship, but not based on the operation or functionality of the product.

I’d like to dedicate this book to my parents —may this bring joy and happiness into their lives.

CONTENTS

Preface

Chapter 1: Introduction

What is Unix?

Available Shell Types

What is Bash?

Getting Help for Bash Commands

Navigating Around Directories

The history Command

Listing Filenames with the ls Command

Displaying Contents of Files

The cat Command

The head and tail Commands

The Pipe Symbol

The fold Command

File Ownership: Owner, Group, and World

Hidden Files

Handling Problematic Filenames

Working with Environment Variables

The env Command

Useful Environment Variables

Setting the PATH Environment Variable

Specifying Aliases and Environment Variables

Finding Executable Files

The printf Command and the echo Command

The cut Command

The echo Command and Whitespaces

Command Substitution (“Back Tick”)

The Pipe Symbol and Multiple Commands

Using a Semicolon to Separate Commands

The paste Command

Inserting Blank Lines with the paste Command

A Simple Use Case with the paste Command

A Simple Use Case with cut and paste Commands

Working with Meta Characters

Working with Character Classes

What about zsh?

Switching between bash and zsh

Configuring zsh

Summary

Chapter 2: Files and Directories

Create, Copy, Remove, and Move Files

Creating Files

Copying Files

Copy Files with Command Substitution

Deleting Files

Moving Files

The basename, dirname, and file Commands

The wc Command

The more Command and the less Command

The head Command

The tail Command

File Comparison Commands

The Parts of a Filename

Working with File Permissions

The chmod Command

The chown Command

The chgrp Command

The umask and ulimit Commands

Working with Directories

Absolute and Relative Directories

Absolute and Relative Path Names

Creating Directories

Removing Directories

Changing Directories

Renaming Directories

Using Quote Characters

Streams and Redirection Commands

Metacharacters and Character Classes

Digits and Characters

Working with “^” and “\” and “!”

Filenames and Metacharacters

Summary

Chapter 3: Useful Commands

The join Command

The fold Command

The split Command

The sort Command

The uniq Command

How to Compare Files

The od Command

The tr Command

A Simple Use Case

The find Command

The tee Command

File Compression Commands

The tar Command

The cpio Command

The gzip and gunzip Commands

The bunzip2 Command

The zip Command

Commands for zip Files and bz Files

Internal Field Separator (IFS)

Data from a Range of Columns in a Dataset

Working with Uneven Rows in Datasets

The alias Command

Summary

Chapter 4: Conditional Logic and Loops

Arithmetic Operations and Operators

Working with Arrays

Arrays and Text Files

Working with Variables

Assigning Values to Variables

Working with Operators for Strings and Numbers

The read Command for User Input

The test Command for Variables, Files, and Directories

Relational Operators

Boolean Operators

String Operators

File Test Operators

Conditional Logic with if/else Statements

The case/esac Statement

Arithmetic Operators and Comparisons

Working with Strings in Shell Scripts

Working with Strings

Working with Loops

Using a for loop

Working with Nested Loops

Using a while Loop

The while, case, and if/elif/fi Statements

Using an until Loop

User-defined Functions

Creating a Simple Menu from Shell Commands

Summary

Chapter 5: Processing Datasets with grep and sed

What is the grep Command?

Metacharacters and the grep Command

Escaping Metacharacters with the grep Command

Useful Options for the grep Command

Character Classes and the grep Command

Working with the –c Option in grep

Matching a Range of Lines

Using Back References in the grep Command

Finding Empty Lines in Datasets

Using Keys to Search Datasets

The Backslash Character and the grep Command

Multiple Matches in the grep Command

The grep Command and the xargs Command

Searching zip Files for a String

Checking for a Unique Key Value

Redirecting Error Messages

The egrep Command and fgrep Command

Displaying “Pure” Words in a Dataset with egrep

The fgrep Command

Delete Rows with Missing Values

A Simple Use Case

What is the sed Command?

The sed Execution Cycle

Matching String Patterns Using sed

Substituting String Patterns Using sed

Replacing Vowels from a String or a File

Deleting Multiple Digits and Letters from a String

Search and Replace with sed

Datasets with Multiple Delimiters

Useful Switches in sed

Working with Datasets

Printing Lines

Character Classes and sed

Removing Control Characters

Counting Words in a Dataset

Back References in sed

One-line sed Commands

Populate Missing Values with the sed Command

A Dataset With 1,000,000 Rows

Numeric Comparisons

Counting Adjacent Digits

Average Support Rate

Summary

Chapter 6: Processing Datasets with awk

The awk Command

Built-in Variables that Control awk

How Does the awk Command Work?

Aligning Text with the printf Command

Conditional Logic and Control Statements

The while Statement

A for loop in awk

A for loop with a break Statement

The next and continue Statements

Deleting Alternate Lines in Datasets

Merging Lines in Datasets

Printing File Contents as a Single Line

Joining Groups of Lines in a Text File

Joining Alternate Lines in a Text File

Matching with Metacharacters and Character Sets

Printing Lines Using Conditional Logic

Splitting Filenames with awk

Working with Postfix Arithmetic Operators

Numeric Functions in awk

One-line awk Commands

Useful Short awk Scripts

Printing the Words in a Text String in awk

Count Occurrences of a String in Specific Rows

Printing a String in a Fixed Number of Columns

Printing a Dataset in a Fixed Number of Columns

Aligning Columns in Datasets

Aligning Columns and Multiple Rows in Datasets

Displaying a Subset of Columns in a Text File

Subsets of Column-aligned Rows in Datasets

Counting Word Frequency in Datasets

Displaying Only “Pure” Words in a Dataset

Delete Rows with Missing Values

Working with Multi-line Records in awk

A Simple Use Case

Another Use Case

A Dataset with 1,000,000 Rows

Numeric Comparisons

Counting Adjacent Digits

Average Support Rate

Summary

Chapter 7: Processing Datasets (Pandas)

Prerequisites for this Chapter

Analyzing Missing Data

Causes of Missing Data

Pandas, CSV Files, and Missing Data

Single Column CSV Files

Two Column CSV Files

Missing Data and Imputation

Counting Missing Data Values

Drop Redundant Columns

Remove Duplicate Rows

Display Duplicate Rows

Uniformity of Data Values

Too Many Missing Data Values

Categorical Data

Data Inconsistency

Mean Value Imputation

Random Value Imputation

Multiple Imputation

Matching and Hot Deck Imputation

Is a Zero Value Valid or Invalid?

Skewed Datasets

CSV Files with Multi-row Records

Column Subset and Row Subrange of the Titanic CSV File

Data Normalization

Assigning Classes to Data

Other Data Cleaning Tasks

DeepChecks and Data Validation

Handling Categorical Data

Processing Inconsistent Categorical Data

Mapping Categorical Data to Numeric Values

Mapping Categorical Data to One Hot Encoded Values

Working with Currency

Working with Dates

Find Missing Dates

Find Unique Dates

Switch Date Formats

Working with Imbalanced Datasets

Data Sampling Techniques

Removing Noisy Data

Cost-sensitive Learning

Detecting Imbalanced Data

Rebalancing Datasets

Specify stratify in Data Splits

What is SMOTE?

Data Wrangling

Data Transformation: What Does This Mean?

A Dataset with 1,000,000 Rows

Dataset Details

Numeric Comparisons

Counting Adjacent Digits

Saving CSV Data to XML, JSON, and HTML Files

Summary

Chapter 8: NoSQL, SQLite, and Python

Non-relational Database Systems

Advantages of Non-relational Databases

What is NoSQL?

What is NewSQL?

RDBMS VERSUS NOSQL: WHICH ONE TO USE?

Good Data Types for NoSQL

Some Guidelines for Selecting a Database

NoSQL Databases

What is MongoDB?

Features of MongoDB

Installing MongoDB

Launching MongoDB

Useful Mongo APIs

Metacharacters in Mongo Queries

MongoDB Collections and Documents

Document Format in MongoDB

Create a MongoDB Collection

Working with MongoDB Collections

Find All Android Phones

Find All Android Phones in 2018

Insert a New Item (Document)

Update an Existing Item (Document)

Calculate the Average Price for Each Brand

Calculate the Average Price for Each Brand in 2019

Import Data with mongoimport

What is Fugue?

What is Compass?

What is PyMongo?

MySQL, SQLAlchemy, and Pandas

What is SQLAlchemy?

Read MySQL Data via SQLAlchemy

Export SQL Data from Pandas to Excel

MySQL and Connector/Python

Establishing a Database Connection

Creating a Database Table

Reading Data from a Database Table

What is SQLite?

SQLite Features

SQLite Installation

SQLiteStudio Installation

DB Browser for SQLite Installation

SQLiteDict (Optional)

What is Timescaledb?

Install Timescaledb (Macbook)

Setting Up the TimescaleDB Extension

The rides Table

The Parallel Copy Command

Data Analysis

Large Scale Data Imputation

Summary

Index

PREFACE

WHAT IS THE GOAL?

The goal of this book is to introduce readers to an assortment of powerful command line utilities that can be combined to create simple yet powerful shell scripts for processing datasets. The code samples and scripts use the bash shell, and typically involve small datasets so that you can focus on understanding the features of grep, sed, and awk. Aimed at a reader relatively new to working in a bash environment, the book is comprehensive enough to be a good reference and teach a few new tricks to those who already have some experience with creating shells scripts.

This short book contains a variety of code fragments and shell scripts for data scientists, data analysts, and other people who want shell-based solutions to “clean” various types of datasets. In addition, the concepts and code samples in this book are useful for people who want to simplify routine tasks.

This book takes introductory concepts and commands in bash, and then demonstrates their use in simple yet powerful shell scripts. This book does not cover “pure” system administration functionality for Unix or Linux.

IS THIS BOOK IS FOR ME AND WHAT WILL I LEARN?

This book is intended for general users, data scientists, data analysts, and other people who perform a variety of tasks from the command line, and who also have a modest knowledge of shell programming.

You will acquire an understanding of how to use various bash commands, often as part of short shell scripts in later chapters. The chapters also contain simple use cases that illustrate how to perform various tasks involving datasets, such as switching the order of a two-column dataset (Chapter 1), removing control characters in a text file (Chapter 2), find specific lines and merge them (Chapter 3), reformatting a date field in a dataset (Chapter 5), and removing nested quotes (Chapter 6).

This book saves you the time required to search for relevant code samples, adapting them to your specific needs, which is a potentially time-consuming process.

HOW WERE THE CODE SAMPLES CREATED?

The code samples in this book were created and tested using bash on a MacBook Pro with OS X 10.15.7 (macOS Catalina). Regarding their content: the code samples are derived primarily from scripts prepared by the author, and in some cases there are code samples that incorporate short sections of code from discussions in online forums. The key point to remember is that the code samples follow the “Four Cs”: they must be Clear, Concise, Complete, and Correct to the extent that it’s possible to do so, given the size of this book.

WHAT YOU NEED TO KNOW FOR THIS BOOK

You need some familiarity with working from the command line in a Unix-like environment. However, there are subjective prerequisites, such as a desire to learn shell programming, along with the motivation and discipline to read and understand the code samples. In any case, if you’re not sure whether or not you can absorb the material in this book, glance through the code samples to get a feel for the level of complexity.

WHICH BASH COMMANDS ARE EXCLUDED?

The commands that do not meet any of the criteria listed in the previous section are not included in this book. Consequently, there is no coverage of commands for system administration (e.g., shutting down a machine, scheduling backups, and so forth). The purpose of the material in the chapters is to illustrate how to use bash commands for handling common data cleaning tasks with datasets before you process them in tools such as Pandas, after which you can do further reading to deepen your knowledge.

HOW DO I SET UP A COMMAND SHELL?

If you are a Mac user, there are three ways to do so. The first method is to use Finder to navigate to Applications > Utilities and then double click on the Utilities application. Next, if you already have a command shell available, you can launch a new command shell by typing the following command:

open /Applications/Utilities/Terminal.app

A second method for Mac users is to open a new command shell on a MacBook from a command shell that is already visible simply by clicking command+n in that command shell, and your Mac will launch another command shell.

If you are a PC user, you can install Cygwin (open source https://cygwin.com/) that simulates bash commands, or use another toolkit such as MKS (a commercial product). Please read the online documentation that describes the download and installation process.

If you use RStudio, you launch a command shell inside of RStudio by navigating to Tools > Command Line, and then you can launch bash commands. Note that custom aliases are not automatically set if they are defined in a file other than the main start-up file (such as .bash_login).

WHAT ARE THE “NEXT STEPS” AFTER FINISHING THIS BOOK?

The answer to this question varies widely, mainly because the answer depends heavily on your objectives. The best answer is to try a new tool or technique from the book out on a problem or task you care about, professionally or personally. Precisely what that might be depends on who you are, as the needs of a data scientist, manager, student, or developer are all different. In addition, keep what you learned in mind as you tackle new data cleaning or manipulation challenges. Sometimes knowing a technique is possible will make finding a solution easier, even if you have to re-read the section to remember exactly how the syntax works.

If you have reached the limits of what you have learned here and want to get further technical depth on these commands, there is a wide variety of literature published and online resources describing the bash shell, Unix programming, and the grep, sed, and awk commands.

CHAPTER 1

INTRODUCTION

This chapter introduces you to basic commands in Bash, such as navigating around the file system, listing files, and displaying the contents of files. This chapter is dense and contains a very eclectic mix of topics to quickly prepare you for later chapters. If you already have some knowledge of shell programming, you can probably skim quickly through this introductory chapter and proceed to Chapter 2.

The first part of this chapter starts with a brief introduction to some Unix shells, and then discusses files, file permissions, and directories. You will also learn how to create files and directories and how to change their access permissions.

The second part of this chapter introduces simple shell scripts, along with instructions for making them executable. As you will see, shell scripts contain Bash commands (and can optionally contain user-defined functions), so it’s a good idea to learn about Bash commands before you can create shell scripts (which includes Bash scripts).

The third portion of this chapter discusses two useful Bash commands: the cut command (for cutting or extracting columns and/or fields from a dataset) and the paste command (for “pasting” text or datasets together vertically).

In addition, the final part of this chapter contains a use case involving the cut command and paste command that illustrates how to switch the order of two columns in a dataset. You can also perform this task using the awk command (discussed in Chapter 5).

There are a few points to keep in mind before delving into the details of shell scripts. First, shell scripts can be executed from the command line after adding “execute” permissions to the text file containing the shell commands. Second, you can use the crontab utility to schedule the execution of your shell scripts. The crontab utility allows you to specify the execution of a shell script on an hourly, daily, weekly, or monthly basis. Tasks that are commonly scheduled via crontab include performing backups and removing unwanted files. If you are completely new to Unix, just keep in mind that there is a way to run scripts both from the command line and in a “scheduled” manner. Setting file permissions to run the script from the command line will be discussed later.

Third, the contents of any shell script can be as simple as a single command, or can comprise hundreds of lines of Bash commands. In general, the more interesting shell scripts involve a combination of several Bash commands. A learning tip: since there are usually several ways to produce the desired result, it’s helpful to read other people’s shell scripts to learn how to combine commands in useful ways.

WHAT IS UNIX?

Unix is an operating system created by Ken Thompson in the early 1970s, and today there are several variants available, such as HP/UX for HP machines and AIX for IBM machines. Linus Torvalds developed the Linux operating system during the 1990s, and many Linux commands are the same as their Bash counterparts (but differences exist, often in the commands for system administrators). The Mac OS X operating system is based on AT&T Unix.

Unix has a rich and storied history, and if you are really interested in learning about its past, you can read online articles and also Wikipedia. This book foregoes those details and focuses on helping you quickly learn how to become productive with various commands.

Available Shell Types

The original Unix shell is the Bourne shell, which was written in the mid-1970s by Stephen R. Bourne. In addition, the Bourne shell was the first shell to appear on Bash systems, and you will sometimes hear “the shell” as a reference to the Bourne shell. The Bourne shell is a POSIX standard shell, usually installed as /bin/sh on most versions of Unix, whose default prompt is the $ character. Consequently, Bourne shell scripts will execute on almost every version of Unix. In essence, the AT&T branches of Unix support the Bourne shell (sh), bash, Korn shell (ksh), tsh, and zsh.

However, there is also the BSD branch of Unix that uses the “C” shell (csh), whose default prompt is the % character. In general, shell scripts written for csh will not execute on AT&T branches of Unix, unless the csh shell is also installed on those machines (and vice versa).

The Bourne shell is the most “unadorned” in the sense that it lacks some commands that are available in the other shells, such as history and noclobber. The various subcategories for Bourne Shell are listed as follows:

•Bourne shell (sh)

•Korn shell (ksh)

•Bourne Again shell (bash)

•POSIX shell (sh)

•zsh (“Zee” shell)

The different C-type shells are as follows:

•C shell (csh)

•TENEX/TOPS C shell (tcsh)

While the commands and the shell scripts in this book are based on the Bash shell, many of the commands also work in other shells (and if not, those other shells have a similar command to accomplish the same goal). Performing an Internet search for “how do I do <Bash command> in <shell name>” will often get you an answer. Sometimes the command is essentially the same, but with slightly different syntax, and typing “man <command>” in a command shell can provide useful information.

WHAT IS BASH?

Bash is an acronym for “Bourne Again Shell,” which has its roots in the Bourne shell created by Stephen R. Bourne. Shell scripts based on the Bourne shell will execute in Bash, but the converse is not true. The Bash shell provides additional features that are unavailable in the Bourne shell, such as support for arrays (discussed later in this chapter).

On Mac OS X, the /bin directory contains the following executable shells:

-rwxr-xr-x 1 root wheel 31440 Sep 21 2020 sh -rwxr-xr-x 1 root wheel 110848 Sep 21 2020 dash -r-xr-xr-x 1 root wheel 623472 Sep 21 2020 bash -rwxr-xr-x 1 root wheel 529424 Sep 21 2020 tcsh -r-xr-xr-x 1 root wheel 1300256 Sep 21 2020 ksh -rwxr-xr-x 1 root wheel 529424 Sep 21 2020 csh -rwxr-xr-x 1 root wheel 637840 Sep 21 2020 zsh

A nice comparison matrix of the support for various features among the preceding shells is available online:

https://stackoverflow.com/questions/5725296/difference-between-sh-and-bash

In some environments, the Bourne shell sh is the Bash shell, which you can check by typing the following command:

sh --version GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin16) Copyright (C) 2007 Free Software Foundation, Inc.

If you are new to the command line (be it Mac, Linux, or PCs), please read the preface of this book, which provides some useful guidelines for accessing command shells.

Getting Help for Bash Commands

If you want to see the options for a specific Bash command, invoke the man command to see a description of that Bash command and its options, as shown here:

man cat

The man command produces terse explanations, and if those explanations are not clear enough, you can search for online code samples that provide more details.

Navigating Around Directories

In a command shell, you will often perform basic operations, such as displaying (or changing) the current directory, listing the contents of a directory, and displaying the contents of a file. The following set of commands shows you how to perform these operations, and you can execute a subset of these comments in the sequence that is relevant to you. Options for some of the commands in this section (such as the ls command) are described in greater detail later in this chapter.

A frequently used Bash command is pwd (“print working directory”) that displays the current directory, as shown here:

pwd

The output of the preceding command might look something like this:

/Users/jsmith

Use the cd (“change directory”) command to go to a specific directory. For example, type the command cd /Users/jsmith/Mail to navigate to this directory. If you are currently in the /Users/jsmith directory, type cd Mail.

You can navigate to your home directory with either of these commands:

$ cd $HOME $ cd

One convenient way to return to the previous directory is the command cd –. The cd command on Windows merely displays the current directory and does not change the current directory (unlike the Unix cd command).

The history Command

The history command displays a list (i.e., the history) of commands that you executed in the current command shell:

history

A short sample output of the preceding command (the maximum number of lines is 500) is here:

1202 cat longfile.txt > longfile2.txt 1203 vi longfile2.txt 1204 cat longfile2.txt |fold -40 1205 cat longfile2.txt |fold -30 1206 cat longfile2.txt |fold -50 1207 cat longfile2.txt |fold -45 1208 vi longfile2.txt 1209 history 1210 cd /Library/Developer/CommandLineTools/usr/include/c++/ 1211 cd /tmp 1212 cd $HOME/Desktop 1213 history

If you want to navigate to the directory that is shown in line 1210, you can do so simply by typing the following command:

!1210

The preceding snippet executes the same command that appears as line 1210 in the previous sample’s output. However, !cd will search backward through the history of commands to find the first command that matches cd command: in this case, line 1212 is the first match. If there aren’t any intervening cd commands between the current command and the command in line 1210, then !1210 and !cd will have the same effect.

NOTE

Be careful with the “!” option with Bash commands because the command that matches the “!” might not be the one you intended, especially something of the form

!rm

,

which will remove one or more files. Hence, it’s safer to use the

history

command and then explicitly specify the correct number (in that history) when you invoke the “!” operator

.

LISTING FILENAMES WITH THE LS COMMAND

The ls command is for listing filenames, and there are many switches available that you can use, as shown in this section. For example, the ls *txt command displays the following filenames (the actual display depends on the font size and the width of the command shell) on my Mac:

ReservedWords.txt data2.txt input-info.txt longfile2.txt abc.txt good-info.txt longfile.txt names.txt

The command ls -1t (the letter “l”) displays a time-based long listing of filenames:

-rw-r--r-- 1 oswaldcampesato staff 2101 Jun 16 13:07 input-info.txt -rw-r--r-- 1 oswaldcampesato staff 29 Jun 16 13:07 data2.txt -rw-r--r-- 1 oswaldcampesato staff 407 Jun 16 13:07 longfile2.txt -rw-r--r-- 1 oswaldcampesato staff 408 Jun 16 13:07 longfile.txt -rw-r--r-- 1 oswaldcampesato staff 2267 Jun 16 13:07 ReservedWords.txt -rw-r--r-- 1 oswaldcampesato staff 1638 Jun 16 13:07 good-info.txt -rw-r--r-- 1 oswaldcampesato staff 45 Jun 16 13:07 abc.txt -rw-r--r-- 1 oswaldcampesato staff 35 Jun 16 13:07 names.txt

The command ls -ltr (the letters “l”, “t”, and “r”) displays a reversed time-based long listing of filenames:

-rw-r--r-- 1 oswaldcampesato staff 35 Jun 16 13:07 names.txt -rw-r--r-- 1 oswaldcampesato staff 45 Jun 16 13:07 abc.txt -rw-r--r-- 1 oswaldcampesato staff 1638 Jun 16 13:07 good-info.txt -rw-r--r-- 1 oswaldcampesato staff 2267 Jun 16 13:07 ReservedWords.txt -rw-r--r-- 1 oswaldcampesato staff 408 Jun 16 13:07 longfile.txt -rw-r--r-- 1 oswaldcampesato staff 407 Jun 16 13:07 longfile2.txt -rw-r--r-- 1 oswaldcampesato staff 29 Jun 16 13:07 data2.txt -rw-r--r-- 1 oswaldcampesato staff 2101 Jun 16 13:07 input-info.txt

Here are the descriptions of all the listed columns in the preceding output:

•Column #1: represents the file type and permission given on the file (see below)

•Column #2: the number of memory blocks taken by the file or directory

•Column #3: the (Bash user) owner of the file

•Column #4: represents the group of the owner

•Column #5: represents file size in bytes

•Column #6: the date and time when this file was created or last modified

•Column #7: represents the file or directory name

In the ls −l listing example, every file line began with a d, -, or l. These characters indicate the type of file that’s listed. These (and other) initial values are described as follows:

•− Regular file (ASCII text file, binary executable, or hard link)

•b Block special file (such as a physical hard drive)

•c Character special file (such as a physical hard drive)

•d Directory file that contains a listing of other files and directories

•l Symbolic link file

•p Named pipe (a mechanism for interprocess communications)

•s Socket (for interprocess communication)

Consult online documentation for more details regarding the ls command.

DISPLAYING CONTENTS OF FILES

This section introduces you to several commands for displaying different lines of text in a text file. Before doing so, let’s invoke the wc (word count) command to display the number of lines, words, and characters in a text file, as shown here:

wc longfile.txt 37 80 408 longfile.txt

The preceding output shows that the file longfile.txt contains 37 lines, 80 words, and 408 characters, which means that the file size is actually quite small (despite its name).

The cat Command

Invoke the cat command to display the entire contents of longfile.txt:

cat longfile.txt

The preceding command displays the following text:

the contents of this long file are too long to see in a single screen and each line contains one or more words and if you use the cat command the (other lines are omitted)

The preceding command displays the entire contents of a file; however, there are several commands that display only a portion of a file, such as less, more, page, head, and tail (all of which are discussed later).

As another example, suppose that the file temp1 has the following contents:

this is line1 of temp1 this is line2 of temp1 this is line3 of temp1

Let’s also suppose that the file temp2 has these contents:

this is line1 of temp2 this is line2 of temp2

Type the following command that contains the ? meta character (discussed in detail later in this chapter):

cat temp?

The output from the preceding command is shown here:

this is line1 of temp1 this is line2 of temp1 this is line3 of temp1 this is line1 of temp2 this is line2 of temp2

The head and tail Commands

The head command displays the first ten lines of a text file (by default), an example of which is here:

head longfile.txt

The preceding command displays the following text:

the contents of this long file are too long to see in a single screen and each line contains one or more words

The head command also provides an option to specify a different number of lines to display, as shown here:

head −4 longfile.txt

The preceding command displays the following text:

the contents of this long file are too long

The tail command displays the last 10 lines (by default) of a text file:

tail longfile.txt

The preceding command displays the following text:

is available in every shell including the bash shell csh zsh ksh and Bourne shell

NOTE

The last two lines in the preceding output are blank lines (not a typographical error in this page)

.

Similarly, the tail command allows you to specify a different number of lines to display: tail –4 longfile.txt displays the last 4 lines of longfile.txt.

Use the more command to display a screenful of data, as shown here:

more longfile.txt

Press the <spacebar> to view the next screen full of data, and press the <return> key to see the next line of text in a file. Incidentally, some people prefer the less command, which generates essentially the same output as the more command. (A geeky joke: “What’s less? It’s more.”)

The Pipe Symbol

A very useful feature of Bash is its support for the pipe symbol (“|”), which enables you to “pipe” or redirect the output of one command to become the input of another command. The pipe command is useful when you want to perform a sequence of operations involving various Bash commands.

For example, the following code snippet combines the head command with the cat command and the pipe (“|”) symbol:

cat longfile.txt| head −2

A technical point: the preceding command creates two Bash processes (more about processes later) whereas the command head −2 longfile.txt only creates a single Bash process.

You can use the head and tail commands in more interesting ways. For example, the following command sequence displays lines 11 through 15 of longfile.txt:

head −15 longfile.txt |tail −5

The preceding command displays the following text:

and if you use the cat command the file contents scroll

To display the line numbers for the preceding output, use the following:

cat –n longfile.txt | head −15 | tail −5

The preceding command displays the following text:

11 and if you 12 use the cat 13 command the 14 file contents 15 scroll

You won’t see the “tab” character from the output, but it’s visible if you redirect the previous command sequence to a file and then use the −t option with the cat command:

cat –n longfile.txt | head -15 | tail −5 > 1 cat –t 1 11^Iand if you 12^Iuse the cat 13^Icommand the 14^Ifile contents 15^Iscroll

The fold Command

The fold command enables you to “fold” the lines in a text file, which is useful for text files that contain long lines of text that you want to split into shorter lines. For example, here are the contents of longfile2.txt:

the contents of this long file are too long to see in a single screen and each line contains one or more words and if you use the cat command the file contents scroll off the screen so you can use other commands such as the head or tail or more commands in conjunction with the pipe command that is very useful in Bash and is available in every shell including the bash shell csh zsh ksh and Bourne shell

You can “fold” the contents of longfile2.txt into lines whose length is 45 (just as an example) with this command:

cat longfile2.txt |fold -45

The output of the preceding command is here:

the contents of this long file are too long t o see in a single screen and each line contai ns one or more words and if you use the cat c ommand the file contents scroll off the scree n so you can use other commands such as the h ead or tail or more commands in conjunction w ith the pipe command that is very useful in U nix and is available in every shell including the bash shell csh zsh ksh and Bourne shell

Notice that some words in the preceding output are split based on the line width, and not “newspaper style.”

In Chapter 5, you will learn how to display the lines in a text file that match a string or a pattern, as well as how to replace a string with another string in a text file.

FILE OWNERSHIP: OWNER, GROUP, AND WORLD

Use the chmod command to set permissions for files. For example, if you need to set the permission rwx rw- r-- for a file, use the following:

chmod u=rwx g=rw o=r filename

In the preceding command, the options u, g, and o represent user permissions, group permissions, and others’ permissions, respectively.

To add additional permissions on the current file, use + to add permission to the user, group, or others and use − to remove the permissions. For example, given a file with the permissions rwx rw- r--, add the executable permission as follows:

chmod o+x filename

This command adds the x permission for others. Add the executable permission to all permission categories that is, for user, group, and others as follows:

chmod a+x filename

In the preceding command, the letter a means “all groups.” Conversely, specify a - to remove a permission from all groups, as shown here:

chmod a-x filename

HIDDEN FILES

An “invisible” file is one whose first character is the dot or period character (.). Bash programs (including the shell) use most of these files to store configuration information. Some common examples of hidden files include the following files:

•.profile: the Bourne shell (sh) initialization script

•.bash_profile: the Bash shell (bash) initialization script

•.kshrc: the Korn shell (ksh) initialization script

•.cshrc: the C shell (csh) initialization script

•.rhosts: the remote shell configuration file

To list invisible files, specify the -a option to ls:

ls -a . .profile docs lib test_results .. .rhosts hosts pub users .emacs bin hw1 res.01 work .exrc ch07 hw2 res.02 .kshrc ch07.bak hw3 res.03

Single dot .: This represents the current directory.

Double dot ..: This represents the parent directory.

HANDLING PROBLEMATIC FILENAMES

Problematic filenames contain one or more whitespaces, hidden (non-printing) characters, or start with a dash (-) character.

You can use double quotes to list filenames that contain whitespaces, or you can precede each whitespace by a backslash (\) character.

For example, if you have a file named One Space.txt, you can use the ls command as follows:

ls −1 ̎One Space.txt̎ ls –l One\ Space.txt

Filenames that start with a dash (-) character are difficult to handle because the dash character is the prefix that specifies options for Bash commands. Consequently, if you have a file whose name is –abc, then the command ls –abz will not work correctly, because -z is interpreted as a switch for the ls command, and since there is no “z” option, you will see the following type of output:

ls: illegal option -- z usage: ls [-@ABCFGHLOPRSTUWabcdefghiklmnopqrstuwx1%] [file ...]

In most cases, the best solution to this type of file is to rename the file. This can be done in your operating system if your client isn’t a Unix shell, or you can use the following special syntax for the mv (“move”) command to rename the file. The preceding two dashes tell mv to ignore the dash in the filename. An example is here:

mv -- -abc.txt renamed-abc.txt

WORKING WITH ENVIRONMENT VARIABLES

There are many built-in environment variables available, and the following subsections discuss the env command and then some of the more common variables.

The env Command

The env (“environment”) command displays the variables that are in your Bash environment. An example of the output of the env command is here:

SHELL=/bin/bash TERM=xterm-256color TMPDIR=/var/folders/73/39lngcln4dj_scmgvsv53g_w0000gn/T/ OLDPWD=/tmp TERM_SESSION_ID=63101060-9DF0-405E-84E1-EC56282F4803 USER=ocampesato COMMAND_MODE=bash2003PATH=/opt/local/bin:/Users/ocampesato/android-sdk-mac_86/platform-tools:/Users/ocampesato/android-sdk-mac_86/tools:/usr/local/bin: PWD=/Users/ocampesato JAVA_HOME=/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home LANG=en_US.UTF-8 NODE_PATH=/usr/local/lib/node_modules HOME=/Users/ocampesato LOGNAME=ocampesato DISPLAY=/tmp/launch-xnTgkE/org.macosforge.xquartz:0 SECURITYSESSIONID=186a4 _=/usr/bin/env

Some interesting examples of setting an environment variable and also executing a command are described online:

https://stackoverflow.com/questions/13998075/setting-environment-variable-for-one-program-call-in-bash-using-env

Useful Environment Variables

This section discusses some important environment variables, most of which you probably will not need to modify, but it’s useful to be aware of the existence of these variables and their purpose.

•The HOME variable contains the absolute path of the user’s home directory.

•The HOSTNAME variable specifies the Internet name of the host.

•The LOGNAME variable specifies the user’s login name.

•The PATH variable specifies the search path (see the next subsection).

•The SHELL variable specifies the absolute path of the current shell.

•The USER specifies the user’s current username. This value might be different than the login name if a superuser executes the su command to emulate another user’s permissions.

Setting the PATH Environment Variable

Programs and other executable files can live in many directories, so operating systems provide a search path that lists the directories that the operating system searches for executable files. Add a directory to your path so that you can invoke an executable file by specifying just the filename: you don’t need to specify the full path to the executable file.

The search path is stored in an environment variable, which is a named string maintained by the operating system. These variables contain information available to the command shell and other programs.

The path variable is named PATH in Bash or Path in Windows (Bash is case-sensitive; Windows is not).

Setting the path in Bash/Linux:

export PATH=$HOME/anaconda:$PATH

To add the Python directory to the path for a particular session in Bash, use the following:

export PATH=̎$PATH:/usr/local/bin/python̎

In Bourne shell or ksh shell, enter this command:

PATH=̎$PATH:/usr/local/bin/python̎

NOTE

/usr/local/bin/python

is the path of the

Python

directory

.

Specifying Aliases and Environment Variables

The following command defines an environment variable called h1:

h1=$HOME/test

Now if you enter the following command,

echo $h1

you will see the following output on OS X:

/Users/jsmith/test

The next code snippet shows you how to set the alias ll so that it displays a long listing of a directory:

alias ll=̎ls -l̎

The following three alias definitions involve the ls command and various switches:

alias ll=̎ls –l̎ alias lt=̎ls –lt̎ alias ltr=̎ls –ltr̎

As an example, you can replace the command ls -ltr (the letters “l”, “t”, and “r”) that you saw earlier in the chapter with the ltr alias and you will see the same reversed time-based long listing of filenames (reproduced here):

total 56 -rwx------ 1 ocampesato staff 176 Jan 06 19:21 ssl-instructions.txt -rw-r--r-- 1 ocampesato staff 12 Jan 06 19:21 output.txt -rw-r--r-- 1 ocampesato staff 11 Jan 06 19:21 outfile.txt -rwx------ 1 ocampesato staff 12 Jan 06 19:21 kyrgyzstan.txt -rwx------ 1 ocampesato staff 478 Jan 06 19:21 iphonemeetup.txt -rwx------ 1 ocampesato staff 146 Jan 06 19:21 checkin-commands.txt -rwx------ 1 ocampesato staff 25 Jan 06 19:21 apple-care.txt

You can also define an alias that contains the Bash pipe (|) symbol:

alias ltrm=̎ls –ltr|more̎

In a similar manner, you can define aliases for directory related commands:

alias ltd=̎ls –lt | grep ̍^d̍̎ alias ltdm=̎ls –lt | grep ̍^d̍|more̎

FINDING EXECUTABLE FILES

There are several commands available for finding executable files (binary files or shell scripts) by searching the directories in the PATH environment variable via the commands which, whence, whereis, and whatis.

The preceding commands produce similar results to the which command. The which command gives the full path to whatever executable you specify or a blank line if the executable is not in any directory that is specified in the PATH environment variable. This is useful for finding out whether a particular command or utility is installed on the system.

which rm

The output of the preceding command is here:

/bin/rm

The whereis command provides the information that you get from the where command:

$ whereis rm /bin/rm

The whatis command looks up the specified command in the whatis database, which is useful for identifying system commands and important configuration files:

git-rm(1) - Remove files from the working tree and from the index grm(1), rm(1) - remove files or directories rm(1), unlink(1) - remove directory entries

Consider it a simplified version of the man command, which displays concise details about Bash commands (e.g., type man ls and you will see several pages of explanation regarding the ls command).

THE printf COMMAND AND THE echo COMMAND

In brief, use the printf command instead of the echo command if you need to control the output format. One key difference is that the echo command prints a newline character whereas the printf statement does not print a newline character. Keep this point in mind when you see the printf statement in the awk code samples in Chapter 6.

As a simple example, place the following code snippet in a shell script:

printf ̎%-5s %-10s %-4s\n̎ ABC DEF GHI printf ̎%-5s %-10s %-4.2f\n̎ ABC DEF 12.3456

Make the shell script executable and then launch the shell script, after which you will see the following output:

ABC DEF GHI ABC DEF 12.35

However, if you type the following pair of commands:

echo ̎ABC DEF GHI̎ echo ̎ABC DEF 12.3456̎

you will see the following output:

ABC DEF GHI ABC DEF 12.3456

A detailed (and very lengthy) discussion regarding the printf statement and the echo command is here:

https://unix.stackexchange.com/questions/65803/why-is-printf-better-than-echo

THE cut COMMAND

The cut command enables you to extract fields with a specified delimiter (another word commonly used for IFS, especially when it’s part of a command syntax, instead of being set as an outside variable) as well as a range of columns from an input stream. Some examples are here:

x=̎abc def ghi̎ echo $x | cut –d̎ ̎ –f2

The following code snippet displays the contents of field 3 followed by field 1:

x=̎abc def ghi̎ echo $x | cut –d̎ ̎ –f3,2

The output (using space ̎ ̎ as IFS, and -f2 to indicate the second column) of the preceding code snippet is here: