E-Book
44,39 €

Red Hat Enterprise Linux Troubleshooting Guide E-Book

Benjamin Cane

0,0

44,39 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Fachliteratur
Sprache: Englisch

Beschreibung

Red Hat Enterprise Linux is an operating system that allows you to modernize your infrastructure, boost efficiency through virtualization, and finally prepare your data center for an open, hybrid cloud IT architecture. It provides the stability to take on today's challenges and the flexibility to adapt to tomorrow's demands.
In this book, you begin with simple troubleshooting best practices and get an overview of the Linux commands used for troubleshooting. The book will cover the troubleshooting methods for web applications and services such as Apache and MySQL. Then, you will learn to identify system performance bottlenecks and troubleshoot network issues; all while learning about vital troubleshooting steps such as understanding the problem statement, establishing a hypothesis, and understanding trial, error, and documentation. Next, the book will show you how to capture and analyze network traffic, use advanced system troubleshooting tools such as strace, tcpdump & dmesg, and discover common issues with system defaults.
Finally, the book will take you through a detailed root cause analysis of an unexpected reboot where you will learn
to recover a downed system.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

MOBI

Seitenzahl: 665

Veröffentlichungsjahr: 2015

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Ähnliche

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Denke (nach) und werde reich

Napoleon Hill

30 Minuten Resilienz

Ulrich Siegrist

Krebszellen mögen keine Himbeeren - Der große Bestseller - Vollständig überarbeitet und aktualisiert

Richard Béliveau

Die Hormonrevolution

Michael E Platt

Der Crash ist die Lösung

Matthias Weik

Günter, der innere Schweinehund, lernt verkaufen

Stefan Frädrich

Mission erfüllt

Owen Mark

Die Leber wächst mit ihren Aufgaben

Dr. med. Eckart von Hirschhausen

Macht, was ihr liebt!

Anja Förster

Der größte Raubzug der Geschichte

Matthias Weik

Unsere Hunde - gesund durch Homöopathie

Hans Günter Wolff

Die Jahrhundertlüge, die nur Insider kennen

Heiko Schrang

Organisation für Komplexität

Niels Pfläging

Radikal führen

Reinhard K. Sprenger

30 Minuten Sympathisch und souverän: So geht Vortragen!

Thomas Lorenz

BLACKOUT - Morgen ist es zu spät

Marc Elsberg

The Truth About Employee Engagement

Red Hat Enterprise Linux Troubleshooting Guide

Credits

About the Author

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers, and more

Why subscribe?

Free access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Troubleshooting Best Practices

Styles of troubleshooting

The Data Collector

The Educated Guesser

The Adaptor

Choosing the appropriate style

Troubleshooting steps

Understanding the problem statement

Asking questions

Tickets

Humans

Attempting to duplicate the issue

Running investigatory commands

Establishing a hypothesis

Putting together patterns

Is this something that I've encountered before?

Trial and error

Start by creating a backup

Getting help

Books

Team Wikis or Runbooks

Google

Man pages

Reading a man page

Name

Synopsis

Description

Examples

Additional sections

Info documentation

Referencing more than commands

Installing man pages

Red Hat kernel docs

People

Following up

Documentation

Root cause analysis

The anatomy of a good RCA

The problem as it was reported

The actual root cause of the problem

A timeline of events and actions taken

Any key data points to validate the root cause

A plan of action to prevent the incident from reoccurring

Establishing a root cause

Sometimes you must sacrifice a root cause analysis

Understanding your environment

Summary

2. Troubleshooting Commands and Sources of Useful Information

Finding useful information

Log files

The default location

Common log files

Finding logs that are not in the default location

Checking syslog configuration

Checking the application's configuration

Other examples

Using the find command

Configuration files

Default system configuration directory

Finding configuration files

Using the rpm command

Using the find command

The proc filesystem

Troubleshooting commands

Command-line basics

Command flags

The piping command output

Gathering general information

w – show who is logged on and what they are doing

rpm – RPM package manager

Listing all packages installed

Listing all files deployed by a package

Using package verification

df – report file system space usage

Showing available inodes

free – display memory utilization

What is free, is not always free

The /proc/meminfo file

ps – report a snapshot of current running processes

Printing every process in long format

Printing a specific user's processes

Printing a process by process ID

Printing processes with performance information

Networking

ip – show and manipulate network settings

Show IP address configuration for a specific device

Show routing configuration

Show network statistics for a specified device

netstat – network statistics

Printing network connections

Printing all ports listening for tcp connections

Delay

Performance

iotop – a simple top-like I/O monitor

iostat – report I/O and CPU statistics

Manipulating the output

vmstat – report virtual memory statistics

sar – collect, report, or save system activity information

Using the sar command

Summary

3. Troubleshooting a Web Application

A small back story

The reported issue

Data gathering

Asking questions

Duplicating the issue

Understanding the environment

Where is this blog hosted?

Lookup IPs with nslookup

What about ping, dig, or other tools?

Ok, it's within our environment; now what?

What services are installed and running?

Validate the web server

Validating the database service

Validating PHP

A summary of installed and running services

Looking for error messages

Apache logs

Finding the location of Apache's logs

Reviewing the logs

Using curl to call our web application

Requesting a non-PHP page

Reviewing generated log entries

What we learned from httpd logs

Verifying the database

Verifying the WordPress database

Finding the installation path for WordPress

Checking the default configuration

Finding the database credentials

Connecting as the WordPress user

Validating the database structure

What we learned from the database validation

Establishing a hypothesis

Resolving the issue

Understanding database data files

Finding the MariaDB data folder

Resolving data file issues

Validating

Final validation

Summary

4. Troubleshooting Performance Issues

Performance issues

It's slow

Performance

Application

CPU

Top – a single command to look at everything

What does this output tell us about our issue?

Individual processes from top

Determining the number of CPUs available

Threads and Cores

lscpu – Another way to look at CPU info

ps – Drill down deeper on individual processes with ps

Using ps to determine process CPU utilization

Putting it all together

A quick look with top

Digging deeper with ps

Memory

free – Looking at free and used memory

Linux memory buffers and caches

Swapped memory

What free tells us about our system

Checking for oomkill

ps - Checking individual processes memory utilization

vmstat – Monitoring memory allocation and swapping

Putting it all together

Taking a look at the system's memory utilization with free

Watch what is happening with vmstat

Finding the processes that utilize the most memory with ps

Disk

iostat – CPU and device input/output statistics

CPU details

Reviewing I/O statistics

Identifying devices

Who is writing to these devices?

ps – Using ps to identify processes utilizing I/O

iotop – A top top-like command for disk i/o

Putting it all together

Using iostat to determine whether there is a I/O bandwidth problem

Using iotop to determine which processes are consuming disk bandwidth

Using ps to understand more about processes

Network

ifstat – Review interface statistics

Quick review of what we have identified

Comparing historical metrics

sar – System activity report

CPU

Memory

Disk

Network

Review what we learned by comparing historical statistics

Summary

5. Network Troubleshooting

Database connectivity issues

Data collection

Duplicating the issue

Finding the database server

Testing connectivity

Telnet from blog.example.com

Telnet from our laptop

Ping

Troubleshooting DNS

Checking DNS with dig

Looking up DNS with nslookup

What did dig and nslookup tell us?

A bit about /etc/hosts

DNS summary

Pinging from another location

Testing port connectivity with cURL

Showing current network connections with netstat

Using netstat to watch for new connections

Breakdown of netstat states

Capturing network traffic with tcpdump

Taking a look at the server's network interfaces

What is a network interface?

Viewing device configuration

Specifying the interface with tcpdump

Reading the captured data

A quick primer on TCP

Types of TCP packet

Reviewing collected data

Taking a look on the other side

Identifying the network configuration

Testing connectivity from db.example.com

Looking for connections with netstat

Tracing network connections with tcpdump

Routing

Viewing the routing table

The default route

Utilizing IP to show the routing table

Looking for routing misconfigurations

More specific routes win

Hypothesis

Trial and error

Removing the invalid route

Configuration files

Summary

6. Diagnosing and Correcting Firewall Issues

Diagnosing firewalls

Déjà vu

Troubleshooting from historic issues

Basic troubleshooting

Validating the MariaDB service

Troubleshooting with tcpdump

Understanding ICMP

Understanding connection rejections

A quick summary of what you have learned so far

Managing the Linux firewall with iptables

Verify that iptables is running

Show iptables rules being enforced

Understanding iptables rules

Ordering matters

Default policies

Breaking down the iptables rules

Putting the rules together

Viewing iptables counters

Correcting the iptables rule ordering

How iptables rules are applied

Modifying iptables rules

Testing our changes

Summary

7. Filesystem Errors and Recovery

Diagnosing filesystem errors

Read-only filesystems

Using the mount command to list mounted filesystems

A mounted filesystem

Using fdisk to list available partitions

Back to troubleshooting

NFS – Network Filesystem

NFS and network connectivity

Using the showmount command

NFS server configuration

Exploring /etc/exports

Identifying the current exports

Testing NFS from another client

Making mounts permanent

Unmounting the /mnt filesystem

Troubleshooting the NFS server, again

Finding the NFS log messages

Reading /var/log/messages

Read-only filesystems

Identifying disk issues

Recovering the filesystem

Unmounting the filesystem

Filesystem checks with fsck

The fsck and xfs filesystems

How do these tools repair a filesystem?

Mounting the filesystem

Repairing the other filesystems

Recovering the / (root) filesystem

Validation

Summary

8. Hardware Troubleshooting

Starting with a log entry

What is a RAID?

RAID 0 – striping

RAID 1 – mirroring

RAID 5 – striping with distributed parity

RAID 6 – striping with double distributed parity

RAID 10 – mirrored and striped

Back to troubleshooting our RAID

How RAID recovery works

Checking the current RAID status

Summarizing the key information

Looking at md status with /proc/mdstat

Using both /proc/mdstat and mdadm

Identifying a bigger issue

Understanding /dev

More than just disk drives

Device messages with dmesg

Summarizing what dmesg has provided

Using mdadm to examine the superblock

Checking /dev/sdb2

What we have learned so far

Re-adding the drives to the arrays

Adding a new disk device

When disks are not added cleanly

Another way to watch the rebuild status

Summary

9. Using System Tools to Troubleshoot Applications

Open source versus home-grown applications

When the application won't start

Exit codes

Is the script failing, or the application?

A wealth of information in the configuration file

Watching log files during startup

Checking whether the application is already running

Checking open files

Understanding file descriptors

Getting back to the lsof output

Using lsof to check whether we have a previously running process

Finding out more about the application

Tracing an application with strace

What is a system call?

Using strace to identify why the application will not start

Resolving the conflict

Summary

10. Understanding Linux User and Kernel Limits

A reported issue

Why is the job failing?

Background questions

Is the cron job even running?

User crontabs

Understanding user limits

The file size limit

The max user processes limit

The open files limit

Changing user limits

The limits.conf file

Future proofing the scheduled job

Running the job again

Kernel tunables

Finding the kernel parameter for open files

Changing kernel tunables

Permanently changing a tunable

Temporarily changing a tunable

Running the job one last time

A look back

Too many open files

A bit of clean up

Summary

11. Recovering from Common Failures

The reported problem

Is Apache really down?

Why is it down?

What else was happening at that time?

Searching the messages log

Breaking down this useful one-liner

The cut command

The sort command

The uniq command

Tying it all together

What happens when a Linux system runs out of memory?

Minimum free memory

A quick recap

How oom-kill works

Adjusting the oom score

Determining whether our process was killed by oom-kill

Why did the system run out of memory?

Resolving the issue in the long-term and short-term

Long-term resolution

Short-term resolution

Summary

12. Root Cause Analysis of an Unexpected Reboot

A late night alert

Identifying the issue

Did someone reboot this server?

What do the logs tell us?

Learning about new processes and services

What caused the high load average?

What are the run queue and load average?

Load average

Investigating the filesystem being full

The du command

Why wasn't the queue directory processed?

A checkpoint on what you learned

Sometimes you cannot prove everything

Preventing reoccurrence

Immediate action

Long-term actions

A sample Root Cause Analysis

Problem summary

Problem details

Root cause

Action plan

Further actions to be taken

Summary

Index

Red Hat Enterprise Linux Troubleshooting Guide

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: October 2015

Production reference: 1131015

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78528-355-0

www.packtpub.com

Credits

Author

Benjamin Cane

Reviewers

Brian C Galura

Deepak G Kulkarni

Warren Myers

Siddhesh Poyarekar

Commissioning Editor

Nadeem Baghban

Acquisition Editor

Reshma Raman

Content Development Editor

Mamata Walkar

Technical Editor

Ryan Kochery

Copy Editors

Tani Kothari

Merilyn Pereira

Project Coordinator

Shipra Chawhan

Proofreader

Safis Editing

Indexer

Tejal Daruwale Soni

Production Coordinator

Aparna Bhagat

Cover Work

Aparna Bhagat

About the Author

Benjamin Cane has nearly 10 years of experience in Linux systems administration. His first systems administration role was in 2006. At that time, he worked for a web hosting company supporting thousands of FreeBSD and Linux systems.

Afterwards, he joined a managed services company that specialized in managing mission-critical systems. There, he worked his way to the position of a lead systems engineer, providing 24x7 support for highly critical enterprise systems that ran Red Hat Enterprise Linux.

Now, Benjamin is a systems architect. He focuses on building High and Continuous Availability environments within the financial services industry. He is also currently a Red Hat Certified Engineer and Certified Ethical Hacker.

With his experience in mission-critical environments, he has learned to identify and troubleshoot very complex issues quickly, because often these environments have a low tolerance for downtime. Being able to identify the root causes of very complex systems issues quickly is a skill that requires extensive knowledge of Linux and troubleshooting best practices.

In addition to this book, Benjamin writes about Linux systems administration and DevOps topics on his blog at http://bencane.com. He is also the project founder for Runbook (https://github.com/Runbook/runbook), an open source application designed to monitor and automatically resolve infrastructure and application issues.

About the Reviewers

Brian C Galura spent his childhood tinkering with subjects such as Java programming and Linux; his professional experience started with VoIP testing at 3Com in suburban Chicago. He then spent two years studying computer engineering at Purdue University before leaving to pursue freelance consulting in Los Angeles.

Following several years of freelancing, he developed his expertise in enterprise infrastructure and cloud computing by working for a variety of start-ups and large corporations. Later, he completed a bachelor's in IT while working at Citrix. Brian is currently working on Citrix's cloud engineering and systems architecture team in Santa Barbara, California.

Deepak G Kulkarni has around 10 years of experience in software engineering including nine years of strong experience in product development using C/C++/Java.

He has over four years of experience in JUnit, Core JAVA, Web Services, XML, Spring, Spring MVC, and Hibernate. He also has extensive software development experience using C/C++ on UNIX (HP-UX) and Sun-Solaris, shell scripting, and STL (Standard Template Library) and cross-platform development. Deepak also has over two years of experience in the Mac OS using Objective C/C++, Cocoa framework, Xcode, and MetroWorks IDE.

Deepak has a working knowledge of Perl/SNMP/TCP/IP, OS Internals, web services, XML, XSD, database (Oracle)/IPC (inter process communication)/system calls/Pro*C. He has been exposed to UML / use case diagrams / class diagrams / sequence diagrams / activity diagrams / state diagrams, IBM Rational Rose, and Design Patterns (GoF Patterns) / Design Principles. He has hands-on experience with Ant, Maven, JUnit, JProfiler, JProbe, JSON, Servlets, Python, and Ruby on Rails, as well as experience in remote service calls and client server programming.

Warren Myers has been a professional data center and cloud automation architect for over eight years, with exposure to dozens of platforms, scores of customers, and hundreds of interesting problems.

He was a reviewer on Raspberry Pi Server Essentials and wrote the freely-available e-book, Debugging and Supporting Software Systems (http://cnx.org/contents/[email protected]:1/Debugging_and_Supporting_Softw).

Siddhesh Poyarekar has been a free and open source software programmer for over seven years and has worked on a variety of projects beginning with writing his own dialer program for an ISP that did not have a UI-based dialer for Linux. He spent a number of years troubleshooting problems in various domains from the Linux desktop, shell, and the kernel to the core system runtime on Linux systems, that is, the GNU C Library for which he is now a maintainer.

www.PacktPub.com

Support files, eBooks, discount offers, and more

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Free access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.

This book is dedicated to my wife Ruby and our two sons Ethan and Jacob. Ruby, without your love and support I would have never been able to write this book. Thank you for putting up with the long hours and sleepless nights while trying to complete this.

Preface

Red Hat Enterprise Linux is a widely popular Linux distribution that is used in everything from Cloud to enterprise mainframe computers. If you include downstream distributions such as CentOS, the adoption of the Red Hat Enterprise Linux distribution is even greater.

As with most things, there is always someone responsible for resolving issues with all of these various systems running Red Hat Enterprise Linux. Red Hat Enterprise Linux Troubleshooting Guide is written to provide basic to advanced troubleshooting practices and commands for Linux systems, with these troubleshooting techniques specifically focused on systems running Red Hat Enterprise Linux.

This book is designed to provide you with steps and the knowledge required to remedy a wide variety of scenarios. The examples in this book use real-world issues with real-world resolutions.

While the examples in this book are situational, this book can also be used as a reference for Linux-related topics and commands. They provide the reader with the ability to reference both troubleshooting steps and specific commands to resolve complex issues.

What this book covers

Chapter 1, Troubleshooting Best Practices, covers the troubleshooting process at a high level. By equating the troubleshooting process with the scientific method, this book will explain how to break down a problem to identify the root cause, no matter how complicated the problem.

Chapter 2, Troubleshooting Commands and Sources of Useful Information, provides the reader with a simple introduction to common locations of useful information. It will also provide a reference for fundamental Linux commands that can be used for troubleshooting many types of issues.

Chapter 3, Troubleshooting a Web Application, takes the process learned in Chapter 1 and the commands learned in Chapter 2 to work through a complicated problem. The problem outlined in this chapter is "By example" meaning that the flow of this chapter is designed to walk you through the entire troubleshooting process, from end to end.

Chapter 4, Troubleshooting Performance Issues, deals with performance issues and some of the most complicated problems to troubleshoot. Often, the complications are compounded by the perception of users versus expected levels of performance. In this chapter, the tools and information discussed in Chapter 2 will, once again, be used to resolve a real-world performance problem.

Chapter 5, Network Troubleshooting, talks about networking being a critical component of any modern day system. This chapter will cover the core commands necessary for the configuration and diagnostics of Linux networking.

Chapter 6, Diagnosing and Correcting Firewall Issues, covers the complex nature of Linux firewalls, in a continuation of Chapter 5. This chapter will introduce and highlight commands and techniques necessary to troubleshoot Linux software firewalls.

Chapter 7, Filesystem Errors and Recovery, teaches you that being able to recover a filesystem could mean the difference between losing and retaining data. This chapter will introduce some core Linux filesystem concepts and will demonstrate how to recover a read-only filesystem.

Chapter 8, Hardware Troubleshooting, starts to touch on the process of troubleshooting hardware issues. This chapter will walk you through the restoration of a failed hard drive.

Chapter 9, Using System Tools to Troubleshoot Applications, explores how often a system administrator's role is not only to troubleshoot OS issues but also application issues. This chapter will show you how to utilize common system tools to identify the root cause of an application issue.

Chapter 10, Understanding Linux User and Kernel Limits, demonstrates that Red Hat Enterprise Linux has many components in place to prevent users from overloading the system. This chapter will explore these components and explain how to modify them to allow legitimate resource utilization.

Chapter 11, Recovering from Common Failures, walks you through troubleshooting out-of-memory conditions. This scenario is very common in heavily utilized environments and can be difficult to troubleshoot. This chapter will cover not only how to troubleshoot this issue, but also why the issue occurred.

Chapter 12, Root Cause Analysis of an Unexpected Reboot, puts the troubleshooting process and commands learned in the previous chapters to the test. This chapter walks you through performing a Root Cause Analysis on a server that has unexpectedly rebooted.

What you need for this book

Although this book can be standalone, readers will benefit greatly from having a system with Red Hat Enterprise Linux release 7, with the operating system available. You will more effectively learn the commands and resources discussed in this book when you have the ability to execute them on a test system.

While it is possible to use many of the commands, processes, and resources covered in this book with other Linux distributions, it is highly recommended to utilize a Red Hat downstream distribution such as CentOS 7 if Red Hat Enterprise Linux 7 is not available to the reader.

Who this book is for

If you are a competent RHEL administrator or consultant with a desire to increase your troubleshooting skills and your knowledge of Red Hat Enterprise Linux, then this book is perfect for you. A good knowledge level and understanding of basic Linux commands are expected.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "Within reason, it is not required to include every cd or ls command executed."

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

192.168.33.12 > 192.168.33.11: ICMP host 192.168.33.12 unreachable - admin prohibited, length 68

Any command-line input or output is written as follows:

# yum install man-pages

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "we will see a message on our screen that says still here?."

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

Chapter 1. Troubleshooting Best Practices

This chapter, which happens to be the first chapter, is probably the most important and least technical. Most chapters in this book cover specific issues and the commands necessary to troubleshoot those issues. This chapter, however, will cover some troubleshooting best practices that can be applied to any issue.

You can think of this chapter as the principles behind the practices being applied.

Styles of troubleshooting

Before covering the best practices of troubleshooting, it is important to understand the different styles of troubleshooting. In my experience, I have found that people tend to use one of three styles of troubleshooting, which are as follows:

The Data CollectorThe Educated GuesserThe Adaptor

Each of these styles have their own strengths and weaknesses. Let's have a look at the characteristics of these styles.

The Data Collector

I like to call the first style of troubleshooting, the Data Collector. The Data Collector is someone who generally utilizes a systematic approach to solve issues. The systematic troubleshooting approach is generally characterized as follows:

Asking specific questions to parties reporting issues, expecting detailed answersRunning commands to identify system performance for most issuesRunning through a predefined set of troubleshooting steps before stepping into action

The strength of this style is that it is effective, no matter what level of engineer or administrator is using it. By going through issues systematically, collecting each data point, and understanding the results before executing any resolution, the Data Collector is able to resolve issues that they might not necessarily be familiar with.

The weakness of this style is that the data collection is not usually the fastest method to resolve issues. Depending on the issue, collecting data can take a long time and some of that data might not be necessary to find the resolution.

The Educated Guesser

I like to call the second style of troubleshooting, the Educated Guesser. The Educated Guesser is someone who generally utilizes an intuitive approach to solve issues. The intuitive approach is generally characterized by the following:

Identifying the cause of issues with minimal informationRunning a few commands before resolving the issueUtilizing previous experience to identify root cause

The strength of this style of troubleshooting is that it allows you to come up with resolutions sooner. When confronted with an issue, this type of troubleshooter tends to pull from experience and requires minimal information to find a resolution.

The weakness of this style is that it relies heavily on experience, and thus requires time before being effective. When focusing on resolution, this troubleshooter might also attempt multiple actions to resolve the issue, which can make it seem like the Educated Guesser does not fully understand the issue at hand.

The Adaptor

There is a third and often-overlooked style of troubleshooting; this style utilizes both the systematic and intuitive styles. I like to call this style the Adaptor. The Adaptor has a personality which enables it to switch between systematic and intuitive troubleshooting styles. This combined style is often faster than the Data Collector style and is more detail oriented than the Educated Guesser style. This is because they are able to apply the troubleshooting style appropriate for the task at hand.

Choosing the appropriate style

While it is easy to say that one method is better than the other, the fact of the matter is that picking the appropriate troubleshooting style depends greatly on the person. It is important to understand which troubleshooting style best fits your own personality. By understanding which style fits you better, you can learn and use techniques that fit that style. You can also learn and adopt techniques from other styles to apply troubleshooting steps that you would normally overlook.

This book will show both the Data Collector and Educated Guesser styles of troubleshooting, and periodically highlighting which personality style the steps best fit.

Troubleshooting steps

Troubleshooting is a process that is both rigid and flexible. The rigidity of the troubleshooting process is based on the fact that there are basic steps to be followed. In this way, I like to equate the troubleshooting process to the scientific method, where the scientific method has a specific list of steps that must be followed.

The flexibility of the troubleshooting process is that these steps can be followed in any order that makes sense. Unlike the scientific method, the troubleshooting process often has the goal of resolving the issue quickly. Sometimes, in order to resolve an issue quickly, you might need to skip a step or execute them out of order. For example, with the troubleshooting process, you might need to resolve the immediate issue, and then identify the root cause of that issue.

The following list has five steps that make up the troubleshooting process. Each of these steps could also include several sub-tasks, which may or may not be relevant to the issue. It is important to follow these steps with a grain of salt, as not every issue can be placed into the same bucket. The following steps are meant to be used as a best practice but, as with all things, it should be adapted to the issue at hand:

Understanding the problem statement.Establishing a hypothesis.Trial and error.Getting help.Documentation.

Understanding the problem statement

With the scientific method, the first step is to establish a problem statement, which is another way of saying: to identify and understand the goal of the experiment. With the troubleshooting process, the first step is to understand the problem being reported. The better we understand an issue, the easier it is to resolve the issue.

There are a number of tasks we can perform that will help us understand issues better. This first step is where a Data Collector's personality stands out. Data Collectors, by nature, will gather as much data as they can before moving on to the next step, whereas, the Educated Guessers generally tend to run through this step quickly and then move on to the next step, which can sometimes cause critical pieces of information to be missed.

Adaptors tend to understand which data collecting steps are necessary and which ones are not. This allows them to collect data as a Data Collector would, but without spending time gathering data that does not add value to the issue at hand.

The sub-task in this troubleshooting step is asking the right questions.

Asking questions

Whether via human or automated processes such as a ticket system, the reporter of the issue is often a great source of information.

Tickets

When they receive a ticket, the Educated Guesser personality will often read the heading of the ticket, make an assumption of the issue and move to the next stage of understanding the issue. The Data Collector personality will generally open the ticket and read the full details of the ticket.

While it depends on the ticketing and monitoring system, in general, there can be useful information within a ticket. Unless the issue is a common issue and you are able to understand all that you know from the header, it is generally a good idea to read the ticket description. Even small amounts of information might help with particularly tricky issues.

Humans

Gathering additional information from humans, however, can be inconsistent. This varies greatly depending on the environment being supported. In some environments, the person reporting an issue can provide all of the details required to resolve the issue. In other environments, they might not understand the issue and simply explain the symptoms.

No matter what troubleshooting style fits your personality best, being able to get important information from the person reporting the issue is an important skill. Intuitive problem solvers such as the Educated Guesser or Adaptor tend to find this process easier as compared to Data Collector personalities, not because these personalities are necessarily better at obtaining details from people but rather because they are able to identify patterns with less information. Data Collectors, however, can get the information they need from those reporting the issue if they are prepared to ask troubleshooting questions.

Note

Don't be afraid to ask obvious questions

My first technical job was in a webhosting technical support call center. There I often received calls from users who did not want to perform the basic troubleshooting steps and simply wanted the issue escalated. These users simply felt that they had performed all of the troubleshooting steps themselves and had found an issue beyond first level support.

While sometimes this was true, more often, the issue was something basic that they had overlooked. In that role, I quickly learned that even if the user is reluctant to answer basic or obvious questions, at the end of the day, they simply want their issue resolved. If that meant going through repetitive steps, that was ok, as long as the issue is resolved.

Even today, as I am now the escalation point for senior engineers, I find that many times engineers (even with years of troubleshooting experience under their belt) overlook simple basic steps.

Asking simple questions that might seem basic are sometimes a great time saver; so don't be afraid to ask them.

Attempting to duplicate the issue

One of the best ways to gather information and understand an issue is to experience it. When an issue is reported, it is best to duplicate the issue.

While users can be a source of a lot of information, they are not always the most reliable; oftentimes a user might experience an error and overlook it or simply forget to relay the error when reporting the issue.

Often, one of the first questions I will ask a user is how to recreate the issue. If the user is able to provide this information, I will be able to see any errors and often identify the resolution of the issue faster.

Note

Sometimes duplicating the issue is not possible

While it is always best to duplicate the issue, it is not always possible. Every day, I work with many teams; sometimes, those teams are within the company but many times they are external vendors. Every so often during a critical issue, I will see someone make a blanket statement such as "If we can't duplicate it, we cannot troubleshoot it."

While it is true that duplicating an issue is sometimes the only way to find the root cause, I often hear this statement abused. Duplicating an issue should be viewed like a tool; it is simply one of many tools in your troubleshooting tool belt. If it is not available, then you simply have to make do with another tool.

There is a significant difference between not being able to find a resolution and not attempting to find a resolution due to the inability to duplicate an issue. The latter is not only unhelpful, but also unprofessional.

Running investigatory commands

Most likely, you are reading this book to learn techniques and commands to troubleshoot Red Hat Enterprise Linux systems. The third sub-task in understanding the problem statement is just that—running investigative commands to identify the cause of the issue. Before executing investigatory commands, however, it is important to know that the previous steps are in a logical order.

It is a best practice to first ask the user reporting an issue some basic details of the issue, then after obtaining enough information, duplicate the issue. Once the issue has been duplicated, the next logical step is to run the necessary commands to troubleshoot and investigate the cause of the issue.

It is very common to find yourself returning to previous steps during the troubleshooting process. After you have identified some key errors, you might find that you must ask the original reporter for additional information. When troubleshooting, do not be afraid to take a few steps backwards in order to gain clarity of the issue at hand.

Establishing a hypothesis

With the scientific method, once a problem statement has been formulated it is then time to establish a hypothesis. With the troubleshooting process, after you have identified the issue, gathered the information about the issue such as errors, system current state, and so on, it is also time to establish what you believe caused or is causing the issue.

Some issues, however, might not require much of a hypothesis. It is common that errors in log files or the systems current state might answer why the issue occurred. In such scenarios, you can simply resolve the issue and move on to the Documentation step.

For issues that are not cut and dry, you will need to put together a hypothesis of the root cause. This is necessary as the next step after forming a hypothesis is attempting to resolve the issue. It is difficult to resolve an issue if you do not have at least, a theory of the root cause.

Here are a few techniques that can be used to help form a hypothesis.

Putting together patterns

While performing data collection during the previous steps, you might start to see patterns. Patterns can be something as simple as similar log entries across multiple services, the type of failure that occurred (such as, multiple services going offline), or even a reoccurring spike in system resource utilization.

These patterns can be used to formulate a theory of the issue. To drive the point home, let's go through a real-world scenario.

You are managing a server that both runs a web application and receives e-mails. You have a monitoring system that detected an error with the web service and created a ticket. While investigating the ticket, you also receive a call from an e-mail user stating they are getting e-mail bounce backs.

When you ask the user to read the error to you they mention No space left on device.

Let's break down this scenario:

A ticket from our monitoring solution has told us Apache is downWe have also received reports from e-mail users with errors indicative of a file system being full

Could all of this mean that Apache is down because the file system is full? Possibly. Should we investigate it? Absolutely!

Is this something that I've encountered before?

The above breakdown leads into the next technique for forming a hypothesis. It might sound simple but is often forgotten. "Have I seen something like this before?"

With the previous scenario, the error reported from the e-mail bounce back was one that generally indicated that a file system was full. How do we know this? Well, simple, we have seen it before. Maybe we have seen that same error with e-mail bounce backs or maybe we have seen the error from other services. The point is, the error is familiar and the error generally means one thing.

Remembering common errors can be extremely useful for the intuitive types such as the Educated Guesser and Adaptor; this is something they tend to naturally perform. For the Data Collector, a handy trick would be to keep a reference table of common errors handy.

Tip

From my experience, most Data Collectors tend to keep a set of notes that contain things such as common commands or steps for procedures. Adding common errors and the meaning behind those errors are a great way for systematic thinkers such as Data Collectors to establish a hypothesis faster.

Overall, establishing a hypothesis is important for all types of troubleshooters. This is the area where the intuitive thinkers such as Educated Guessers and Adaptors excel. Generally, those types of troubleshooters will form a hypothesis sooner, even if sometimes those hypotheses are not always correct.

Trial and error

In the scientific method, once a hypothesis is formed, the next stage is experimentation. With troubleshooting, this equates to attempting to resolve the issue.

Some issues are simple and can be resolved using a standard procedure or steps from experience. Other issues, however, are not as simple. Sometimes, the hypothesis turns out to be wrong or the issue ends up being more complicated than initially thought.

In such cases, it might take multiple attempts to resolve the issue. I personally like to think of this as similar to trial and error. In general, you might have an idea of what is wrong (the hypothesis) and an idea on how to resolve it. You attempt to resolve it (trial), and if that doesn't work (error), you move on to the next possible solution.

Start by creating a backup

To those taking up a new role as a Linux Systems Administrator, if there were only one piece of advice I could give, it would be one that most have learned the hard way: back everything up before making changes.

Many times as systems administrators we find ourselves needing to change a configuration file or delete a few unneeded files in order to solve an issue. Unfortunately, we might think we know what needs to be removed or changed but are not always correct.

If a backup was taken, then the change can simply be restored to its previous state, however, without a backup. Thus reverting changes is not as easy.

A backup can consist of many things, it can be a full system backup using something like rdiff-backup, a VM snapshot, or something as simple as creating a copy of a file.

Tip

For those interested in seeing the extent of this tip in practice, simply run the following command on any server that has more than four systems administrators and has been around for several years:

$ find /etc –name "*.bak"

Getting help

In many cases at this point the issue is resolved, but much like each step in the troubleshooting process, it depends on the issue at hand. While getting help is not exactly a troubleshooting step, it is often the next logical step if you cannot solve the issue on your own.

When looking for help, there are generally six resources available:

BooksTeam Wikis or RunbooksGoogleMan pagesRed Hat kernel docsPeople

Books

Books (such as this one) are good for referencing commands or troubleshooting steps for particular types of issues. Other books such as the ones that specialize on a specific technology are good for referencing how that technology works. In previous years, it was not uncommon to see a senior admin with a bookshelf full of technical books at his or her disposal.

In today's world, as books are more frequently seen in a digital format, they are even easier to use as references. The digital format makes them searchable and allows readers to find specific sections faster than a traditional printed version.

Team Wikis or Runbooks

Before Team Wikisbecame common, many operations groups had physical books called Runbooks. These books are a list of processes and procedures used daily by the operations team to keep the production environments operating normally. Sometimes, these Runbooks would contain information for provisioning new servers and sometimes they would be dedicated to troubleshooting.

In today's world, these Runbooks have mostly been replaced by Team Wikis, these Wikis will often have the same content but are online. They also tend to be searchable and easier to keep up to date, which means they are frequently more relevant than a traditional printed Runbook.

The benefit of Team Wikis and Runbooks are that not only can they often address issues that are specific to your environment, but they can also resolve those issues. There are many ways to configure services such as Apache, and there are even more ways that external systems create dependencies on these services.

In some environments, you might be able to simply restart Apache whenever there is an issue, but in others, you might actually have to go through several prerequisite steps. If there is a specific process that needs to be followed before restarting a service, it is a best practice to document the process in either a Team Wiki or Runbook.

Google

Google is such a common tool for systems administrators that at one point there were specific search portals available at google.com/linux, google.com/microsoft, google.com/mac, and google.com/bsd.

Google has depreciated these search portals but that doesn't mean that the number of times systems administrators use Google or any other search engine for troubleshooting has decreased.

In fact, in today's world, it is not uncommon to hear the words "I would Google it" in technical interviews.

A few tips for those new to using Google for systems administration tasks are:

If you copy and paste a full error message (removing the server specific text) you will likely find more relevant results:

For example, searching for kdumpctl: No memory reserved for crash kernel returns 600 results, whereas searching for memory reserved for crash kernel returns 449,000 results.

You can find an online version of any man page by searching for man then a command such as man netstat.You can wrap an error in double quotes to refine search results to those that contain the same error.Asking what you're looking for in the form of a question usually results in tutorials. For example, How do you restart Apache on RHEL 7?

While Google can be a great resource, the results should always be taken with a grain of salt. Often while searching for an error on Google, you might find a suggested command that offers little explanation but simply says "run this and it will fix it". Be very cautious when running these commands, it is important that any command you execute on a system should be a command you are familiar with. You should always know what a command does before executing it.

Man pages

When Google is not available or even sometimes when it is, the best source of information on commands or Linux, in general, are the man pages. The man pages are core Linux manual documents that are accessible via the man command.

To look up documentation for the netstat command, for example, simply run the following:

$ man netstat NETSTAT(8) Linux System Administrator's Manual NETSTAT(8) NAME netstat - Print network connections, routing tables, interface statistics, masquerade connections, and multicast memberships

As you can see, this command outputs not only the information on what the netstat command is, but also contains a quick synopsis of usage information such as the following:

Also, it gives detailed descriptions of each flag and what it does:

--route , -r Display the kernel routing tables. See the description in route(8) for details. netstat -r and route -e produce the same output. --groups , -g Display multicast group membership information for IPv4 and IPv6. --interfaces=iface , -I=iface , -i Display a table of all network interfaces, or the specified iface.

In general, the base manual pages for the core system and libraries are distributed with the man-pages package. The man pages for specific commands such as top, netstat, or ps are distributed as part of that command's installation package. The reason for this is because the documentation of individual commands and components is left to the package maintainers.

This can mean that some commands are not documented to the level of others. In general, however, the man pages are extremely useful sources of information and can answer most day-to-day questions.

Reading a man page

In the previous example, we can see that the man page for netstat includes a few sections of information. In general, man pages have a consistent layout with some common sections that can be found within most man pages. The following is a simple list of some of these common sections:

NameSynopsisDescriptionExamples

Name

The Name section generally contains the name of the command and a very brief description of the command. The following is the name section from the ps command's man page:

NAME ps - report a snapshot of the current processes.

Synopsis

The Synopsis section of a command's man page will generally list the command followed by the possible command flags or options. A very good example of this section can be seen in the netstat command's synopsis:

This section can be very useful as a quick reference for command syntax.

Description

The Description section will often contain a longer description of the command as well as a list and explanation of the various command options. The following snippet is from the cat command's man page:

DESCRIPTION Concatenate FILE(s), or standard input, to standard output. -A, --show-all equivalent to -vET -b, --number-nonblank number nonempty output lines, overrides -n

The description section is very useful, since it goes beyond simply looking up options. This section is often where you will find documentation about the nuances of commands.

Examples

Often man pages will also include examples of using the command:

EXAMPLES cat f - g Output f's contents, then standard input, then g's infocontents.

The preceding is a snippet from the cat command's man page. We can see, in this example, how to use cat to read from files and standard input in one command.

This section is often where I find new ways of using commands that I've used many times before.

Additional sections

In addition to the previous section, you might also see sections such as See Also, Files, Author, and History. These sections can also contain useful information; however, not every man page will have them.

Info documentation

Along with man pages, Linux systems generally also contain info documentation, which are designed to contain additional documentation, which go beyond that, within man pages. Much like man pages, the info documentation is included with a command package, and the quality/quantity of the documentation can vary by package.

The method to invoke the info documentation is similar to man pages, simply execute the info command followed by the subject you wish to view:

$ info gzip GNU Gzip: General file (de)compression ************************************** This manual is for GNU Gzip (version 1.5, 10 June 2014), and documents commands for compressing and decompressing data. Copyright (C) 1998-1999, 2001-2002, 2006-2007, 2009-2012 Free Software Foundation, Inc.

Referencing more than commands

In addition to using man pages and info documentation to look up commands; these tools can also be used to view documentation around other items such as system calls or configuration files.

As an example, if you were to use man to search for the term signal, you would see the following:

$ man signalSIGNAL(2)Linux Programmer's ManualSIGNAL(2)NAME signal - ANSI C signal handlingSYNOPSIS #include <signal.h> typedef void (*sighandler_t)(int); sighandler_t signal(int signum, sighandler_t handler);DESCRIPTION The behavior of signal() varies across UNIX versions, and has also varied historically across different versions of Linux. Avoid its use: use sigaction(2) instead. See Portability below.signal() sets the disposition of the signal signum to handler, which is either SIG_IGN, SIG_DFL, or the address of a programmer-defined function (a "signal handler").

Signal is a very important system call and a core concept of Linux. Knowing that it is possible to use the man and info commands to look up core Linux concepts and behaviors can be very useful during troubleshooting.

Installing man pages

Red Hat Enterprise Linux based distributions generally include the man-pages package; if your system does not have the man-pages package installed, you can install it with the yum command:

# yum install man-pages

Red Hat kernel docs

In addition to man pages, the Red Hat distribution also has a package called kernel-doc. This package contains quite a bit of information on how the internals of the system works.

The kernel documentation is a set of text files that are placed into /usr/share/doc/kernel-doc-<kernel-version>/ and are categorized by the topic they cover. This resource is quite useful for deeper troubleshooting such as adjusting kernel tunables or understanding how ext4 filesystems utilize the journal.

By default, the kernel-doc package is not installed, however, it can be easily installed using the yum command:

# yum install kernel-doc

People

Whether it is a friend or a team leader, there is certain etiquette when asking others for help. The following is a list of things that people tend to expect when asked to help solve an issue. When I am asked for help, I would expect you to:

Try to resolve it yourself: When escalating an issue, it is always best to at least try to follow the Understanding the problem statement and Forming a hypothesis steps of the troubleshooting process.Document what you've tried: Documentation is key to escalating issues or getting help. The better you document the steps tried and errors found, the faster it will be for others to identify and resolve the issue.Explain what you think the issue is and what was reported: When you escalate the issue, one of the first things to point out is your hypothesis. Often this can help expedite resolution by leading the next person to a possible solution without having to perform data collection activities.Mention whether there is anything else that happened to this system recently: Often issues come in pairs, it is important to highlight all factors of what is happening on the system or systems affected.

The preceding list, while not extensive, is important as each of these key pieces of information can help the next person troubleshoot the issue effectively.

Following up

When escalating issues, it is always best to follow up with that other person to find out what they did and how they did it. This is important as it will show the person you asked that you are willing to learn more, which many times will lead to them taking time to explain how they resolved and identified the issue.

Interactions like these will give you more knowledge and help build your system's administration skills and experience.

Documentation

Documentation is a critical step in the troubleshooting process. At every step during the process, it is key to take note and document the actions being performed. Why is it important to document? Three reasons mainly:

When escalating the issue, the more information you have written down the more you can pass on to anotherIf the issue is a reoccurring issue, the documentation can be used to update a Team Wiki or RunbookIf, in your environment, you perform Root Cause Analysis (RCA), all of this information will be required for a RCA

Depending on environments, the documentation can be anything from simple notes saved in a text file on a local system to required notes for a ticket system. Each work environment is different but a general rule is there is no such thing as too much documentation.

For Data Collectors, this step is fairly natural. As most Data Collector personalities will generally keep quite a few notes for their own personal use. For Educated Guessers, this step might seem unnecessary. However, for any issue that is reoccurring or needs to be escalated, documentation is critical.

What kind of information should be documented? The following list is a good starting point but as with most things in troubleshooting, it depends on the environment and the issue:

The problem statement, as you understand itThe hypothesis of what is causing the issueData collected during the information gathering steps:

Specific errors foundRelevant system metrics (for example, CPU, Memory, and Disk utilization)

Commands executed during the information gathering steps (within reason, it is not required to include every cd or ls command executed)Steps taken during attempts to resolve the issue, including specific commands executed

With the preceding items well documented, if the issue reoccurs, it is relatively simple to take the documentation and move it to a Team Wiki. The benefit to this is that a Wiki article can be used by other team members who need to resolve the same issue during reoccurrences.

One of the three reasons listed previously for documentation is to use the documentation during Root Cause Analysis, which leads to our next topic—Establishing a Root Cause Analysis.

Root cause analysis

Root cause analysis is a process that is performed after incidents occur. The goal of the RCA process is to identify the root cause of an incident and identify any possible corrective actions to prevent the same incident from occurring again. These corrective actions might be as simple as establishing user training to reconfiguring Apache across all web servers.

The RCA process is not unique to technology and is a widely practiced process in fields such as aviation and occupational safety. In these fields, an incident is often more than simply a few computers being offline. They are incidents where a person's life might have been at risk.

The anatomy of a good RCA

Different work environments might implement RCA processes differently but at the end of the day there are a few key elements in every good RCA:

The problem as it was reportedThe actual root cause of the problemA timeline of events and actions takenAny key data pointsA plan of action to prevent the incident from reoccurring

The problem as it was reported

One of the first

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben:

Red Hat Enterprise Linux Troubleshooting Guide E-Book

Benjamin Cane

Table of Contents

Red Hat Enterprise Linux Troubleshooting Guide

Red Hat Enterprise Linux Troubleshooting Guide

Credits

About the Author

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers, and more

Why subscribe?

Free access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Note

Tip

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

Chapter 1. Troubleshooting Best Practices

Styles of troubleshooting

The Data Collector

The Educated Guesser

The Adaptor

Choosing the appropriate style

Troubleshooting steps

Understanding the problem statement

Asking questions

Tickets

Humans

Note

Attempting to duplicate the issue

Note

Running investigatory commands

Establishing a hypothesis

Putting together patterns

Is this something that I've encountered before?

Tip

Trial and error

Start by creating a backup

Tip

Getting help

Books

Team Wikis or Runbooks

Google

Man pages

Reading a man page

Name

Synopsis

Description

Examples

Additional sections

Info documentation

Referencing more than commands

Installing man pages

Red Hat kernel docs

People

Following up

Documentation

Root cause analysis

The anatomy of a good RCA

The problem as it was reported