29,99 €
This book introduces powerful command line utilities for creating efficient shell scripts to process datasets. Using the bash shell, the examples and scripts focus on small datasets to help readers understand the features of grep, sed, and awk. Companion files with code are available for download from the publisher.
The course starts with an introduction to the basics, covering files and directories, and useful commands. It then progresses to conditional logic and loops, providing a solid foundation for processing datasets. Detailed chapters on using grep, sed, and awk illustrate their capabilities in handling and cleaning various types of datasets effectively.
Advanced topics include processing datasets with Pandas, exploring NoSQL, SQLite, and Python. The book equips data scientists, analysts, and anyone seeking shell-based solutions with practical skills. By the end, users will be adept at creating robust scripts for dataset processing, combining command line utilities for optimal results.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 356
Veröffentlichungsjahr: 2024
BASHFORDATA SCIENTISTS
Oswald Campesato
MERCURY LEARNING AND INFORMATION
Dulles, VirginiaBoston, MassachusettsNew Delhi
Copyright ©2023 by MERCURY LEARNING AND INFORMATION LLC. All rights reserved.
This publication, portions of it, or any accompanying software may not be reproduced in any way, stored in a retrieval system of any type, or transmitted by any means, media, electronic display or mechanical display, including, but not limited to, photocopy, recording, Internet postings, or scanning, without prior permission in writing from the publisher.
Publisher: David Pallai
MERCURY LEARNING AND INFORMATION
22841 Quicksilver Drive
Dulles, VA 20166
www.merclearning.com
1-800-232-0223
O. Campesato. Bash for Data Scientists.
ISBN: 978-1-68392-973-4
The publisher recognizes and respects all marks used by companies, manufacturers, and developers as a means to distinguish their products. All brand names and product names mentioned in this book are trademarks or service marks of their respective companies. Any omission or misuse (of any kind) of service marks or trademarks, etc. is not an attempt to infringe on the property of others.
Library of Congress Control Number: 2022948076
222324321 Printed on acid-free paper in the United States of America.
Our titles are available for adoption, license, or bulk purchase by institutions, corporations, etc. For additional information, please contact the Customer Service Dept. at 800-232-0223(toll free).
All of our titles are available in digital format at academiccourseware.com and other digital vendors. Companion files for this title are available by writing to the publisher at [email protected]. The sole obligation of MERCURY LEARNING AND INFORMATION to the purchaser is to replace the book, based on defective materials or faulty workmanship, but not based on the operation or functionality of the product.
I’d like to dedicate this book to my parents —may this bring joy and happiness into their lives.
CONTENTS
Preface
Chapter 1: Introduction
What is Unix?
Available Shell Types
What is Bash?
Getting Help for Bash Commands
Navigating Around Directories
The history Command
Listing Filenames with the ls Command
Displaying Contents of Files
The cat Command
The head and tail Commands
The Pipe Symbol
The fold Command
File Ownership: Owner, Group, and World
Hidden Files
Handling Problematic Filenames
Working with Environment Variables
The env Command
Useful Environment Variables
Setting the PATH Environment Variable
Specifying Aliases and Environment Variables
Finding Executable Files
The printf Command and the echo Command
The cut Command
The echo Command and Whitespaces
Command Substitution (“Back Tick”)
The Pipe Symbol and Multiple Commands
Using a Semicolon to Separate Commands
The paste Command
Inserting Blank Lines with the paste Command
A Simple Use Case with the paste Command
A Simple Use Case with cut and paste Commands
Working with Meta Characters
Working with Character Classes
What about zsh?
Switching between bash and zsh
Configuring zsh
Summary
Chapter 2: Files and Directories
Create, Copy, Remove, and Move Files
Creating Files
Copying Files
Copy Files with Command Substitution
Deleting Files
Moving Files
The basename, dirname, and file Commands
The wc Command
The more Command and the less Command
The head Command
The tail Command
File Comparison Commands
The Parts of a Filename
Working with File Permissions
The chmod Command
The chown Command
The chgrp Command
The umask and ulimit Commands
Working with Directories
Absolute and Relative Directories
Absolute and Relative Path Names
Creating Directories
Removing Directories
Changing Directories
Renaming Directories
Using Quote Characters
Streams and Redirection Commands
Metacharacters and Character Classes
Digits and Characters
Working with “^” and “\” and “!”
Filenames and Metacharacters
Summary
Chapter 3: Useful Commands
The join Command
The fold Command
The split Command
The sort Command
The uniq Command
How to Compare Files
The od Command
The tr Command
A Simple Use Case
The find Command
The tee Command
File Compression Commands
The tar Command
The cpio Command
The gzip and gunzip Commands
The bunzip2 Command
The zip Command
Commands for zip Files and bz Files
Internal Field Separator (IFS)
Data from a Range of Columns in a Dataset
Working with Uneven Rows in Datasets
The alias Command
Summary
Chapter 4: Conditional Logic and Loops
Arithmetic Operations and Operators
Working with Arrays
Arrays and Text Files
Working with Variables
Assigning Values to Variables
Working with Operators for Strings and Numbers
The read Command for User Input
The test Command for Variables, Files, and Directories
Relational Operators
Boolean Operators
String Operators
File Test Operators
Conditional Logic with if/else Statements
The case/esac Statement
Arithmetic Operators and Comparisons
Working with Strings in Shell Scripts
Working with Strings
Working with Loops
Using a for loop
Working with Nested Loops
Using a while Loop
The while, case, and if/elif/fi Statements
Using an until Loop
User-defined Functions
Creating a Simple Menu from Shell Commands
Summary
Chapter 5: Processing Datasets with grep and sed
What is the grep Command?
Metacharacters and the grep Command
Escaping Metacharacters with the grep Command
Useful Options for the grep Command
Character Classes and the grep Command
Working with the –c Option in grep
Matching a Range of Lines
Using Back References in the grep Command
Finding Empty Lines in Datasets
Using Keys to Search Datasets
The Backslash Character and the grep Command
Multiple Matches in the grep Command
The grep Command and the xargs Command
Searching zip Files for a String
Checking for a Unique Key Value
Redirecting Error Messages
The egrep Command and fgrep Command
Displaying “Pure” Words in a Dataset with egrep
The fgrep Command
Delete Rows with Missing Values
A Simple Use Case
What is the sed Command?
The sed Execution Cycle
Matching String Patterns Using sed
Substituting String Patterns Using sed
Replacing Vowels from a String or a File
Deleting Multiple Digits and Letters from a String
Search and Replace with sed
Datasets with Multiple Delimiters
Useful Switches in sed
Working with Datasets
Printing Lines
Character Classes and sed
Removing Control Characters
Counting Words in a Dataset
Back References in sed
One-line sed Commands
Populate Missing Values with the sed Command
A Dataset With 1,000,000 Rows
Numeric Comparisons
Counting Adjacent Digits
Average Support Rate
Summary
Chapter 6: Processing Datasets with awk
The awk Command
Built-in Variables that Control awk
How Does the awk Command Work?
Aligning Text with the printf Command
Conditional Logic and Control Statements
The while Statement
A for loop in awk
A for loop with a break Statement
The next and continue Statements
Deleting Alternate Lines in Datasets
Merging Lines in Datasets
Printing File Contents as a Single Line
Joining Groups of Lines in a Text File
Joining Alternate Lines in a Text File
Matching with Metacharacters and Character Sets
Printing Lines Using Conditional Logic
Splitting Filenames with awk
Working with Postfix Arithmetic Operators
Numeric Functions in awk
One-line awk Commands
Useful Short awk Scripts
Printing the Words in a Text String in awk
Count Occurrences of a String in Specific Rows
Printing a String in a Fixed Number of Columns
Printing a Dataset in a Fixed Number of Columns
Aligning Columns in Datasets
Aligning Columns and Multiple Rows in Datasets
Displaying a Subset of Columns in a Text File
Subsets of Column-aligned Rows in Datasets
Counting Word Frequency in Datasets
Displaying Only “Pure” Words in a Dataset
Delete Rows with Missing Values
Working with Multi-line Records in awk
A Simple Use Case
Another Use Case
A Dataset with 1,000,000 Rows
Numeric Comparisons
Counting Adjacent Digits
Average Support Rate
Summary
Chapter 7: Processing Datasets (Pandas)
Prerequisites for this Chapter
Analyzing Missing Data
Causes of Missing Data
Pandas, CSV Files, and Missing Data
Single Column CSV Files
Two Column CSV Files
Missing Data and Imputation
Counting Missing Data Values
Drop Redundant Columns
Remove Duplicate Rows
Display Duplicate Rows
Uniformity of Data Values
Too Many Missing Data Values
Categorical Data
Data Inconsistency
Mean Value Imputation
Random Value Imputation
Multiple Imputation
Matching and Hot Deck Imputation
Is a Zero Value Valid or Invalid?
Skewed Datasets
CSV Files with Multi-row Records
Column Subset and Row Subrange of the Titanic CSV File
Data Normalization
Assigning Classes to Data
Other Data Cleaning Tasks
DeepChecks and Data Validation
Handling Categorical Data
Processing Inconsistent Categorical Data
Mapping Categorical Data to Numeric Values
Mapping Categorical Data to One Hot Encoded Values
Working with Currency
Working with Dates
Find Missing Dates
Find Unique Dates
Switch Date Formats
Working with Imbalanced Datasets
Data Sampling Techniques
Removing Noisy Data
Cost-sensitive Learning
Detecting Imbalanced Data
Rebalancing Datasets
Specify stratify in Data Splits
What is SMOTE?
Data Wrangling
Data Transformation: What Does This Mean?
A Dataset with 1,000,000 Rows
Dataset Details
Numeric Comparisons
Counting Adjacent Digits
Saving CSV Data to XML, JSON, and HTML Files
Summary
Chapter 8: NoSQL, SQLite, and Python
Non-relational Database Systems
Advantages of Non-relational Databases
What is NoSQL?
What is NewSQL?
RDBMS VERSUS NOSQL: WHICH ONE TO USE?
Good Data Types for NoSQL
Some Guidelines for Selecting a Database
NoSQL Databases
What is MongoDB?
Features of MongoDB
Installing MongoDB
Launching MongoDB
Useful Mongo APIs
Metacharacters in Mongo Queries
MongoDB Collections and Documents
Document Format in MongoDB
Create a MongoDB Collection
Working with MongoDB Collections
Find All Android Phones
Find All Android Phones in 2018
Insert a New Item (Document)
Update an Existing Item (Document)
Calculate the Average Price for Each Brand
Calculate the Average Price for Each Brand in 2019
Import Data with mongoimport
What is Fugue?
What is Compass?
What is PyMongo?
MySQL, SQLAlchemy, and Pandas
What is SQLAlchemy?
Read MySQL Data via SQLAlchemy
Export SQL Data from Pandas to Excel
MySQL and Connector/Python
Establishing a Database Connection
Creating a Database Table
Reading Data from a Database Table
What is SQLite?
SQLite Features
SQLite Installation
SQLiteStudio Installation
DB Browser for SQLite Installation
SQLiteDict (Optional)
What is Timescaledb?
Install Timescaledb (Macbook)
Setting Up the TimescaleDB Extension
The rides Table
The Parallel Copy Command
Data Analysis
Large Scale Data Imputation
Summary
Index
PREFACE
WHAT IS THE GOAL?
The goal of this book is to introduce readers to an assortment of powerful command line utilities that can be combined to create simple yet powerful shell scripts for processing datasets. The code samples and scripts use the bash shell, and typically involve small datasets so that you can focus on understanding the features of grep, sed, and awk. Aimed at a reader relatively new to working in a bash environment, the book is comprehensive enough to be a good reference and teach a few new tricks to those who already have some experience with creating shells scripts.
This short book contains a variety of code fragments and shell scripts for data scientists, data analysts, and other people who want shell-based solutions to “clean” various types of datasets. In addition, the concepts and code samples in this book are useful for people who want to simplify routine tasks.
This book takes introductory concepts and commands in bash, and then demonstrates their use in simple yet powerful shell scripts. This book does not cover “pure” system administration functionality for Unix or Linux.
IS THIS BOOK IS FOR ME AND WHAT WILL I LEARN?
This book is intended for general users, data scientists, data analysts, and other people who perform a variety of tasks from the command line, and who also have a modest knowledge of shell programming.
You will acquire an understanding of how to use various bash commands, often as part of short shell scripts in later chapters. The chapters also contain simple use cases that illustrate how to perform various tasks involving datasets, such as switching the order of a two-column dataset (Chapter 1), removing control characters in a text file (Chapter 2), find specific lines and merge them (Chapter 3), reformatting a date field in a dataset (Chapter 5), and removing nested quotes (Chapter 6).
This book saves you the time required to search for relevant code samples, adapting them to your specific needs, which is a potentially time-consuming process.
HOW WERE THE CODE SAMPLES CREATED?
The code samples in this book were created and tested using bash on a MacBook Pro with OS X 10.15.7 (macOS Catalina). Regarding their content: the code samples are derived primarily from scripts prepared by the author, and in some cases there are code samples that incorporate short sections of code from discussions in online forums. The key point to remember is that the code samples follow the “Four Cs”: they must be Clear, Concise, Complete, and Correct to the extent that it’s possible to do so, given the size of this book.
WHAT YOU NEED TO KNOW FOR THIS BOOK
You need some familiarity with working from the command line in a Unix-like environment. However, there are subjective prerequisites, such as a desire to learn shell programming, along with the motivation and discipline to read and understand the code samples. In any case, if you’re not sure whether or not you can absorb the material in this book, glance through the code samples to get a feel for the level of complexity.
WHICH BASH COMMANDS ARE EXCLUDED?
The commands that do not meet any of the criteria listed in the previous section are not included in this book. Consequently, there is no coverage of commands for system administration (e.g., shutting down a machine, scheduling backups, and so forth). The purpose of the material in the chapters is to illustrate how to use bash commands for handling common data cleaning tasks with datasets before you process them in tools such as Pandas, after which you can do further reading to deepen your knowledge.
HOW DO I SET UP A COMMAND SHELL?
If you are a Mac user, there are three ways to do so. The first method is to use Finder to navigate to Applications > Utilities and then double click on the Utilities application. Next, if you already have a command shell available, you can launch a new command shell by typing the following command:
open /Applications/Utilities/Terminal.app
A second method for Mac users is to open a new command shell on a MacBook from a command shell that is already visible simply by clicking command+n in that command shell, and your Mac will launch another command shell.
If you are a PC user, you can install Cygwin (open source https://cygwin.com/) that simulates bash commands, or use another toolkit such as MKS (a commercial product). Please read the online documentation that describes the download and installation process.
If you use RStudio, you launch a command shell inside of RStudio by navigating to Tools > Command Line, and then you can launch bash commands. Note that custom aliases are not automatically set if they are defined in a file other than the main start-up file (such as .bash_login).
WHAT ARE THE “NEXT STEPS” AFTER FINISHING THIS BOOK?
The answer to this question varies widely, mainly because the answer depends heavily on your objectives. The best answer is to try a new tool or technique from the book out on a problem or task you care about, professionally or personally. Precisely what that might be depends on who you are, as the needs of a data scientist, manager, student, or developer are all different. In addition, keep what you learned in mind as you tackle new data cleaning or manipulation challenges. Sometimes knowing a technique is possible will make finding a solution easier, even if you have to re-read the section to remember exactly how the syntax works.
If you have reached the limits of what you have learned here and want to get further technical depth on these commands, there is a wide variety of literature published and online resources describing the bash shell, Unix programming, and the grep, sed, and awk commands.
This chapter introduces you to basic commands in Bash, such as navigating around the file system, listing files, and displaying the contents of files. This chapter is dense and contains a very eclectic mix of topics to quickly prepare you for later chapters. If you already have some knowledge of shell programming, you can probably skim quickly through this introductory chapter and proceed to Chapter 2.
The first part of this chapter starts with a brief introduction to some Unix shells, and then discusses files, file permissions, and directories. You will also learn how to create files and directories and how to change their access permissions.
The second part of this chapter introduces simple shell scripts, along with instructions for making them executable. As you will see, shell scripts contain Bash commands (and can optionally contain user-defined functions), so it’s a good idea to learn about Bash commands before you can create shell scripts (which includes Bash scripts).
The third portion of this chapter discusses two useful Bash commands: the cut command (for cutting or extracting columns and/or fields from a dataset) and the paste command (for “pasting” text or datasets together vertically).
In addition, the final part of this chapter contains a use case involving the cut command and paste command that illustrates how to switch the order of two columns in a dataset. You can also perform this task using the awk command (discussed in Chapter 5).
There are a few points to keep in mind before delving into the details of shell scripts. First, shell scripts can be executed from the command line after adding “execute” permissions to the text file containing the shell commands. Second, you can use the crontab utility to schedule the execution of your shell scripts. The crontab utility allows you to specify the execution of a shell script on an hourly, daily, weekly, or monthly basis. Tasks that are commonly scheduled via crontab include performing backups and removing unwanted files. If you are completely new to Unix, just keep in mind that there is a way to run scripts both from the command line and in a “scheduled” manner. Setting file permissions to run the script from the command line will be discussed later.
Third, the contents of any shell script can be as simple as a single command, or can comprise hundreds of lines of Bash commands. In general, the more interesting shell scripts involve a combination of several Bash commands. A learning tip: since there are usually several ways to produce the desired result, it’s helpful to read other people’s shell scripts to learn how to combine commands in useful ways.
Unix is an operating system created by Ken Thompson in the early 1970s, and today there are several variants available, such as HP/UX for HP machines and AIX for IBM machines. Linus Torvalds developed the Linux operating system during the 1990s, and many Linux commands are the same as their Bash counterparts (but differences exist, often in the commands for system administrators). The Mac OS X operating system is based on AT&T Unix.
Unix has a rich and storied history, and if you are really interested in learning about its past, you can read online articles and also Wikipedia. This book foregoes those details and focuses on helping you quickly learn how to become productive with various commands.
The original Unix shell is the Bourne shell, which was written in the mid-1970s by Stephen R. Bourne. In addition, the Bourne shell was the first shell to appear on Bash systems, and you will sometimes hear “the shell” as a reference to the Bourne shell. The Bourne shell is a POSIX standard shell, usually installed as /bin/sh on most versions of Unix, whose default prompt is the $ character. Consequently, Bourne shell scripts will execute on almost every version of Unix. In essence, the AT&T branches of Unix support the Bourne shell (sh), bash, Korn shell (ksh), tsh, and zsh.
However, there is also the BSD branch of Unix that uses the “C” shell (csh), whose default prompt is the % character. In general, shell scripts written for csh will not execute on AT&T branches of Unix, unless the csh shell is also installed on those machines (and vice versa).
The Bourne shell is the most “unadorned” in the sense that it lacks some commands that are available in the other shells, such as history and noclobber. The various subcategories for Bourne Shell are listed as follows:
•Bourne shell (sh)
•Korn shell (ksh)
•Bourne Again shell (bash)
•POSIX shell (sh)
•zsh (“Zee” shell)
The different C-type shells are as follows:
•C shell (csh)
•TENEX/TOPS C shell (tcsh)
While the commands and the shell scripts in this book are based on the Bash shell, many of the commands also work in other shells (and if not, those other shells have a similar command to accomplish the same goal). Performing an Internet search for “how do I do <Bash command> in <shell name>” will often get you an answer. Sometimes the command is essentially the same, but with slightly different syntax, and typing “man <command>” in a command shell can provide useful information.
Bash is an acronym for “Bourne Again Shell,” which has its roots in the Bourne shell created by Stephen R. Bourne. Shell scripts based on the Bourne shell will execute in Bash, but the converse is not true. The Bash shell provides additional features that are unavailable in the Bourne shell, such as support for arrays (discussed later in this chapter).
On Mac OS X, the /bin directory contains the following executable shells:
-rwxr-xr-x 1 root wheel 31440 Sep 21 2020 sh -rwxr-xr-x 1 root wheel 110848 Sep 21 2020 dash -r-xr-xr-x 1 root wheel 623472 Sep 21 2020 bash -rwxr-xr-x 1 root wheel 529424 Sep 21 2020 tcsh -r-xr-xr-x 1 root wheel 1300256 Sep 21 2020 ksh -rwxr-xr-x 1 root wheel 529424 Sep 21 2020 csh -rwxr-xr-x 1 root wheel 637840 Sep 21 2020 zsh
A nice comparison matrix of the support for various features among the preceding shells is available online:
https://stackoverflow.com/questions/5725296/difference-between-sh-and-bash
In some environments, the Bourne shell sh is the Bash shell, which you can check by typing the following command:
sh --version GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin16) Copyright (C) 2007 Free Software Foundation, Inc.
If you are new to the command line (be it Mac, Linux, or PCs), please read the preface of this book, which provides some useful guidelines for accessing command shells.
If you want to see the options for a specific Bash command, invoke the man command to see a description of that Bash command and its options, as shown here:
man cat
The man command produces terse explanations, and if those explanations are not clear enough, you can search for online code samples that provide more details.
In a command shell, you will often perform basic operations, such as displaying (or changing) the current directory, listing the contents of a directory, and displaying the contents of a file. The following set of commands shows you how to perform these operations, and you can execute a subset of these comments in the sequence that is relevant to you. Options for some of the commands in this section (such as the ls command) are described in greater detail later in this chapter.
A frequently used Bash command is pwd (“print working directory”) that displays the current directory, as shown here:
pwd
The output of the preceding command might look something like this:
/Users/jsmith
Use the cd (“change directory”) command to go to a specific directory. For example, type the command cd /Users/jsmith/Mail to navigate to this directory. If you are currently in the /Users/jsmith directory, type cd Mail.
You can navigate to your home directory with either of these commands:
$ cd $HOME $ cd
One convenient way to return to the previous directory is the command cd –. The cd command on Windows merely displays the current directory and does not change the current directory (unlike the Unix cd command).
The history command displays a list (i.e., the history) of commands that you executed in the current command shell:
history
A short sample output of the preceding command (the maximum number of lines is 500) is here:
1202 cat longfile.txt > longfile2.txt 1203 vi longfile2.txt 1204 cat longfile2.txt |fold -40 1205 cat longfile2.txt |fold -30 1206 cat longfile2.txt |fold -50 1207 cat longfile2.txt |fold -45 1208 vi longfile2.txt 1209 history 1210 cd /Library/Developer/CommandLineTools/usr/include/c++/ 1211 cd /tmp 1212 cd $HOME/Desktop 1213 history
If you want to navigate to the directory that is shown in line 1210, you can do so simply by typing the following command:
!1210
The preceding snippet executes the same command that appears as line 1210 in the previous sample’s output. However, !cd will search backward through the history of commands to find the first command that matches cd command: in this case, line 1212 is the first match. If there aren’t any intervening cd commands between the current command and the command in line 1210, then !1210 and !cd will have the same effect.
NOTE
Be careful with the “!” option with Bash commands because the command that matches the “!” might not be the one you intended, especially something of the form
!rm
,
which will remove one or more files. Hence, it’s safer to use the
history
command and then explicitly specify the correct number (in that history) when you invoke the “!” operator
.
The ls command is for listing filenames, and there are many switches available that you can use, as shown in this section. For example, the ls *txt command displays the following filenames (the actual display depends on the font size and the width of the command shell) on my Mac:
ReservedWords.txt data2.txt input-info.txt longfile2.txt abc.txt good-info.txt longfile.txt names.txt
The command ls -1t (the letter “l”) displays a time-based long listing of filenames:
-rw-r--r-- 1 oswaldcampesato staff 2101 Jun 16 13:07 input-info.txt -rw-r--r-- 1 oswaldcampesato staff 29 Jun 16 13:07 data2.txt -rw-r--r-- 1 oswaldcampesato staff 407 Jun 16 13:07 longfile2.txt -rw-r--r-- 1 oswaldcampesato staff 408 Jun 16 13:07 longfile.txt -rw-r--r-- 1 oswaldcampesato staff 2267 Jun 16 13:07 ReservedWords.txt -rw-r--r-- 1 oswaldcampesato staff 1638 Jun 16 13:07 good-info.txt -rw-r--r-- 1 oswaldcampesato staff 45 Jun 16 13:07 abc.txt -rw-r--r-- 1 oswaldcampesato staff 35 Jun 16 13:07 names.txt
The command ls -ltr (the letters “l”, “t”, and “r”) displays a reversed time-based long listing of filenames:
-rw-r--r-- 1 oswaldcampesato staff 35 Jun 16 13:07 names.txt -rw-r--r-- 1 oswaldcampesato staff 45 Jun 16 13:07 abc.txt -rw-r--r-- 1 oswaldcampesato staff 1638 Jun 16 13:07 good-info.txt -rw-r--r-- 1 oswaldcampesato staff 2267 Jun 16 13:07 ReservedWords.txt -rw-r--r-- 1 oswaldcampesato staff 408 Jun 16 13:07 longfile.txt -rw-r--r-- 1 oswaldcampesato staff 407 Jun 16 13:07 longfile2.txt -rw-r--r-- 1 oswaldcampesato staff 29 Jun 16 13:07 data2.txt -rw-r--r-- 1 oswaldcampesato staff 2101 Jun 16 13:07 input-info.txt
Here are the descriptions of all the listed columns in the preceding output:
•Column #1: represents the file type and permission given on the file (see below)
•Column #2: the number of memory blocks taken by the file or directory
•Column #3: the (Bash user) owner of the file
•Column #4: represents the group of the owner
•Column #5: represents file size in bytes
•Column #6: the date and time when this file was created or last modified
•Column #7: represents the file or directory name
In the ls −l listing example, every file line began with a d, -, or l. These characters indicate the type of file that’s listed. These (and other) initial values are described as follows:
•− Regular file (ASCII text file, binary executable, or hard link)
•b Block special file (such as a physical hard drive)
•c Character special file (such as a physical hard drive)
•d Directory file that contains a listing of other files and directories
•l Symbolic link file
•p Named pipe (a mechanism for interprocess communications)
•s Socket (for interprocess communication)
Consult online documentation for more details regarding the ls command.
This section introduces you to several commands for displaying different lines of text in a text file. Before doing so, let’s invoke the wc (word count) command to display the number of lines, words, and characters in a text file, as shown here:
wc longfile.txt 37 80 408 longfile.txt
The preceding output shows that the file longfile.txt contains 37 lines, 80 words, and 408 characters, which means that the file size is actually quite small (despite its name).
Invoke the cat command to display the entire contents of longfile.txt:
cat longfile.txt
The preceding command displays the following text:
the contents of this long file are too long to see in a single screen and each line contains one or more words and if you use the cat command the (other lines are omitted)
The preceding command displays the entire contents of a file; however, there are several commands that display only a portion of a file, such as less, more, page, head, and tail (all of which are discussed later).
As another example, suppose that the file temp1 has the following contents:
this is line1 of temp1 this is line2 of temp1 this is line3 of temp1
Let’s also suppose that the file temp2 has these contents:
this is line1 of temp2 this is line2 of temp2
Type the following command that contains the ? meta character (discussed in detail later in this chapter):
cat temp?
The output from the preceding command is shown here:
this is line1 of temp1 this is line2 of temp1 this is line3 of temp1 this is line1 of temp2 this is line2 of temp2
The head command displays the first ten lines of a text file (by default), an example of which is here:
head longfile.txt
The preceding command displays the following text:
the contents of this long file are too long to see in a single screen and each line contains one or more words
The head command also provides an option to specify a different number of lines to display, as shown here:
head −4 longfile.txt
The preceding command displays the following text:
the contents of this long file are too long
The tail command displays the last 10 lines (by default) of a text file:
tail longfile.txt
The preceding command displays the following text:
is available in every shell including the bash shell csh zsh ksh and Bourne shell
NOTE
The last two lines in the preceding output are blank lines (not a typographical error in this page)
.
Similarly, the tail command allows you to specify a different number of lines to display: tail –4 longfile.txt displays the last 4 lines of longfile.txt.
Use the more command to display a screenful of data, as shown here:
more longfile.txt
Press the <spacebar> to view the next screen full of data, and press the <return> key to see the next line of text in a file. Incidentally, some people prefer the less command, which generates essentially the same output as the more command. (A geeky joke: “What’s less? It’s more.”)
A very useful feature of Bash is its support for the pipe symbol (“|”), which enables you to “pipe” or redirect the output of one command to become the input of another command. The pipe command is useful when you want to perform a sequence of operations involving various Bash commands.
For example, the following code snippet combines the head command with the cat command and the pipe (“|”) symbol:
cat longfile.txt| head −2
A technical point: the preceding command creates two Bash processes (more about processes later) whereas the command head −2 longfile.txt only creates a single Bash process.
You can use the head and tail commands in more interesting ways. For example, the following command sequence displays lines 11 through 15 of longfile.txt:
head −15 longfile.txt |tail −5
The preceding command displays the following text:
and if you use the cat command the file contents scroll
To display the line numbers for the preceding output, use the following:
cat –n longfile.txt | head −15 | tail −5
The preceding command displays the following text:
11 and if you 12 use the cat 13 command the 14 file contents 15 scroll
You won’t see the “tab” character from the output, but it’s visible if you redirect the previous command sequence to a file and then use the −t option with the cat command:
cat –n longfile.txt | head -15 | tail −5 > 1 cat –t 1 11^Iand if you 12^Iuse the cat 13^Icommand the 14^Ifile contents 15^Iscroll
The fold command enables you to “fold” the lines in a text file, which is useful for text files that contain long lines of text that you want to split into shorter lines. For example, here are the contents of longfile2.txt:
the contents of this long file are too long to see in a single screen and each line contains one or more words and if you use the cat command the file contents scroll off the screen so you can use other commands such as the head or tail or more commands in conjunction with the pipe command that is very useful in Bash and is available in every shell including the bash shell csh zsh ksh and Bourne shell
You can “fold” the contents of longfile2.txt into lines whose length is 45 (just as an example) with this command:
cat longfile2.txt |fold -45
The output of the preceding command is here:
the contents of this long file are too long t o see in a single screen and each line contai ns one or more words and if you use the cat c ommand the file contents scroll off the scree n so you can use other commands such as the h ead or tail or more commands in conjunction w ith the pipe command that is very useful in U nix and is available in every shell including the bash shell csh zsh ksh and Bourne shell
Notice that some words in the preceding output are split based on the line width, and not “newspaper style.”
In Chapter 5, you will learn how to display the lines in a text file that match a string or a pattern, as well as how to replace a string with another string in a text file.
Use the chmod command to set permissions for files. For example, if you need to set the permission rwx rw- r-- for a file, use the following:
chmod u=rwx g=rw o=r filename
In the preceding command, the options u, g, and o represent user permissions, group permissions, and others’ permissions, respectively.
To add additional permissions on the current file, use + to add permission to the user, group, or others and use − to remove the permissions. For example, given a file with the permissions rwx rw- r--, add the executable permission as follows:
chmod o+x filename
This command adds the x permission for others. Add the executable permission to all permission categories that is, for user, group, and others as follows:
chmod a+x filename
In the preceding command, the letter a means “all groups.” Conversely, specify a - to remove a permission from all groups, as shown here:
chmod a-x filename
An “invisible” file is one whose first character is the dot or period character (.). Bash programs (including the shell) use most of these files to store configuration information. Some common examples of hidden files include the following files:
•.profile: the Bourne shell (sh) initialization script
•.bash_profile: the Bash shell (bash) initialization script
•.kshrc: the Korn shell (ksh) initialization script
•.cshrc: the C shell (csh) initialization script
•.rhosts: the remote shell configuration file
To list invisible files, specify the -a option to ls:
ls -a . .profile docs lib test_results .. .rhosts hosts pub users .emacs bin hw1 res.01 work .exrc ch07 hw2 res.02 .kshrc ch07.bak hw3 res.03
Single dot .: This represents the current directory.
Double dot ..: This represents the parent directory.
Problematic filenames contain one or more whitespaces, hidden (non-printing) characters, or start with a dash (-) character.
You can use double quotes to list filenames that contain whitespaces, or you can precede each whitespace by a backslash (\) character.
For example, if you have a file named One Space.txt, you can use the ls command as follows:
ls −1 ̎One Space.txt̎ ls –l One\ Space.txt
Filenames that start with a dash (-) character are difficult to handle because the dash character is the prefix that specifies options for Bash commands. Consequently, if you have a file whose name is –abc, then the command ls –abz will not work correctly, because -z is interpreted as a switch for the ls command, and since there is no “z” option, you will see the following type of output:
ls: illegal option -- z usage: ls [-@ABCFGHLOPRSTUWabcdefghiklmnopqrstuwx1%] [file ...]
In most cases, the best solution to this type of file is to rename the file. This can be done in your operating system if your client isn’t a Unix shell, or you can use the following special syntax for the mv (“move”) command to rename the file. The preceding two dashes tell mv to ignore the dash in the filename. An example is here:
mv -- -abc.txt renamed-abc.txt
There are many built-in environment variables available, and the following subsections discuss the env command and then some of the more common variables.
The env (“environment”) command displays the variables that are in your Bash environment. An example of the output of the env command is here:
SHELL=/bin/bash TERM=xterm-256color TMPDIR=/var/folders/73/39lngcln4dj_scmgvsv53g_w0000gn/T/ OLDPWD=/tmp TERM_SESSION_ID=63101060-9DF0-405E-84E1-EC56282F4803 USER=ocampesato COMMAND_MODE=bash2003PATH=/opt/local/bin:/Users/ocampesato/android-sdk-mac_86/platform-tools:/Users/ocampesato/android-sdk-mac_86/tools:/usr/local/bin: PWD=/Users/ocampesato JAVA_HOME=/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home LANG=en_US.UTF-8 NODE_PATH=/usr/local/lib/node_modules HOME=/Users/ocampesato LOGNAME=ocampesato DISPLAY=/tmp/launch-xnTgkE/org.macosforge.xquartz:0 SECURITYSESSIONID=186a4 _=/usr/bin/env
Some interesting examples of setting an environment variable and also executing a command are described online:
https://stackoverflow.com/questions/13998075/setting-environment-variable-for-one-program-call-in-bash-using-env
This section discusses some important environment variables, most of which you probably will not need to modify, but it’s useful to be aware of the existence of these variables and their purpose.
•The HOME variable contains the absolute path of the user’s home directory.
•The HOSTNAME variable specifies the Internet name of the host.
•The LOGNAME variable specifies the user’s login name.
•The PATH variable specifies the search path (see the next subsection).
•The SHELL variable specifies the absolute path of the current shell.
•The USER specifies the user’s current username. This value might be different than the login name if a superuser executes the su command to emulate another user’s permissions.
Programs and other executable files can live in many directories, so operating systems provide a search path that lists the directories that the operating system searches for executable files. Add a directory to your path so that you can invoke an executable file by specifying just the filename: you don’t need to specify the full path to the executable file.
The search path is stored in an environment variable, which is a named string maintained by the operating system. These variables contain information available to the command shell and other programs.
The path variable is named PATH in Bash or Path in Windows (Bash is case-sensitive; Windows is not).
Setting the path in Bash/Linux:
export PATH=$HOME/anaconda:$PATH
To add the Python directory to the path for a particular session in Bash, use the following:
export PATH=̎$PATH:/usr/local/bin/python̎
In Bourne shell or ksh shell, enter this command:
PATH=̎$PATH:/usr/local/bin/python̎
NOTE
/usr/local/bin/python
is the path of the
Python
directory
.
The following command defines an environment variable called h1:
h1=$HOME/test
Now if you enter the following command,
echo $h1
you will see the following output on OS X:
/Users/jsmith/test
The next code snippet shows you how to set the alias ll so that it displays a long listing of a directory:
alias ll=̎ls -l̎
The following three alias definitions involve the ls command and various switches:
alias ll=̎ls –l̎ alias lt=̎ls –lt̎ alias ltr=̎ls –ltr̎
As an example, you can replace the command ls -ltr (the letters “l”, “t”, and “r”) that you saw earlier in the chapter with the ltr alias and you will see the same reversed time-based long listing of filenames (reproduced here):
total 56 -rwx------ 1 ocampesato staff 176 Jan 06 19:21 ssl-instructions.txt -rw-r--r-- 1 ocampesato staff 12 Jan 06 19:21 output.txt -rw-r--r-- 1 ocampesato staff 11 Jan 06 19:21 outfile.txt -rwx------ 1 ocampesato staff 12 Jan 06 19:21 kyrgyzstan.txt -rwx------ 1 ocampesato staff 478 Jan 06 19:21 iphonemeetup.txt -rwx------ 1 ocampesato staff 146 Jan 06 19:21 checkin-commands.txt -rwx------ 1 ocampesato staff 25 Jan 06 19:21 apple-care.txt
You can also define an alias that contains the Bash pipe (|) symbol:
alias ltrm=̎ls –ltr|more̎
In a similar manner, you can define aliases for directory related commands:
alias ltd=̎ls –lt | grep ̍^d̍̎ alias ltdm=̎ls –lt | grep ̍^d̍|more̎
There are several commands available for finding executable files (binary files or shell scripts) by searching the directories in the PATH environment variable via the commands which, whence, whereis, and whatis.
The preceding commands produce similar results to the which command. The which command gives the full path to whatever executable you specify or a blank line if the executable is not in any directory that is specified in the PATH environment variable. This is useful for finding out whether a particular command or utility is installed on the system.
which rm
The output of the preceding command is here:
/bin/rm
The whereis command provides the information that you get from the where command:
$ whereis rm /bin/rm
The whatis command looks up the specified command in the whatis database, which is useful for identifying system commands and important configuration files:
git-rm(1) - Remove files from the working tree and from the index grm(1), rm(1) - remove files or directories rm(1), unlink(1) - remove directory entries
Consider it a simplified version of the man command, which displays concise details about Bash commands (e.g., type man ls and you will see several pages of explanation regarding the ls command).
In brief, use the printf command instead of the echo command if you need to control the output format. One key difference is that the echo command prints a newline character whereas the printf statement does not print a newline character. Keep this point in mind when you see the printf statement in the awk code samples in Chapter 6.
As a simple example, place the following code snippet in a shell script:
printf ̎%-5s %-10s %-4s\n̎ ABC DEF GHI printf ̎%-5s %-10s %-4.2f\n̎ ABC DEF 12.3456
Make the shell script executable and then launch the shell script, after which you will see the following output:
ABC DEF GHI ABC DEF 12.35
However, if you type the following pair of commands:
echo ̎ABC DEF GHI̎ echo ̎ABC DEF 12.3456̎
you will see the following output:
ABC DEF GHI ABC DEF 12.3456
A detailed (and very lengthy) discussion regarding the printf statement and the echo command is here:
https://unix.stackexchange.com/questions/65803/why-is-printf-better-than-echo
The cut command enables you to extract fields with a specified delimiter (another word commonly used for IFS, especially when it’s part of a command syntax, instead of being set as an outside variable) as well as a range of columns from an input stream. Some examples are here:
x=̎abc def ghi̎ echo $x | cut –d̎ ̎ –f2
The following code snippet displays the contents of field 3 followed by field 1:
x=̎abc def ghi̎ echo $x | cut –d̎ ̎ –f3,2
The output (using space ̎ ̎ as IFS, and -f2 to indicate the second column) of the preceding code snippet is here: