Mastering Linux Kernel Development - Raghu Bharadwaj - E-Book

Mastering Linux Kernel Development E-Book

Raghu Bharadwaj

0,0
45,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Explore Implementation of core kernel subsystems

About This Book

  • Master the design, components, and structures of core kernel subsystems
  • Explore kernel programming interfaces and related algorithms under the hood
  • Completely updated material for the 4.12.10 kernel

Who This Book Is For

If you are a kernel programmer with a knowledge of kernel APIs and are looking to build a comprehensive understanding, and eager to explore the implementation, of kernel subsystems, this book is for you. It sets out to unravel the underlying details of kernel APIs and data structures, piercing through the complex kernel layers and gives you the edge you need to take your skills to the next level.

What You Will Learn

  • Comprehend processes and fles—the core abstraction mechanisms of the Linux kernel that promote effective simplification and dynamism
  • Decipher process scheduling and understand effective capacity utilization under general and real-time dispositions
  • Simplify and learn more about process communication techniques through signals and IPC mechanisms
  • Capture the rudiments of memory by grasping the key concepts and principles of physical and virtual memory management
  • Take a sharp and precise look at all the key aspects of interrupt management and the clock subsystem
  • Understand concurrent execution on SMP platforms through kernel synchronization and locking techniques

In Detail

Mastering Linux Kernel Development looks at the Linux kernel, its internal arrangement and design, and various core subsystems, helping you to gain significant understanding of this open source marvel. You will look at how the Linux kernel, which possesses a kind of collective intelligence thanks to its scores of contributors, remains so elegant owing to its great design.

This book also looks at all the key kernel code, core data structures, functions, and macros, giving you a comprehensive foundation of the implementation details of the kernel's core services and mechanisms. You will also look at the Linux kernel as well-designed software, which gives us insights into software design in general that are easily scalable yet fundamentally strong and safe.

By the end of this book, you will have considerable understanding of and appreciation for the Linux kernel.

Style and approach

Each chapter begins with the basic conceptual know-how for a subsystem and extends into the details of its implementation. We use appropriate code excerpts of critical routines and data structures for subsystems.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 419

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Mastering Linux Kernel Development

 

 

 

 

 

 

 

 

 

 

 

 

 

A kernel developer's reference manual

 

 

 

 

 

 

 

 

 

 

 

Raghu Bharadwaj

 

 

 

 

 

 

 

 

 

 

 

BIRMINGHAM - MUMBAI

Mastering Linux Kernel Development

 

Copyright © 2017 Packt Publishing

 

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

 

First published: October 2017

 

Production reference: 1091017

Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.

ISBN 978-1-78588-305-7

 

www.packtpub.com

Credits

Author

Raghu Bharadwaj

Copy Editor

Madhusudan Uchil

Reviewer

Rami Rosen

Project Coordinator

Virginia Dias

Commissioning Editor

Kartikey Pandey

Proofreader

Safis Editing

Acquisition Editor

Rahul Nair

Indexer

Francy Puthiry

Content Development Editor

Sharon Raj

Graphics

Kirk D'Penha

Technical Editor

Mohit Hassija

Production Coordinator

Arvindkumar Gupta

About the Author

Raghu Bharadwaj is a leading consultant, contributor, and corporate trainer on the Linux kernel with experience spanning close to two decades. He is an ardent kernel enthusiast and expert, and has been closely following the Linux kernel since the late 90s. He is the founder of TECH VEDA, which specializes in engineering and skilling services on the Linux kernel, through technical support, kernel contributions, and advanced training. His precise understanding and articulation of the kernel has been a hallmark, and his penchant for software designs and OS architectures has garnered him special mention from his clients. Raghu is also an expert in delivering solution-oriented, customized training programs for engineering teams working on the Linux kernel, Linux drivers, and Embedded Linux. Some of his clients include major technology companies such as Xilinx, GE, Canon, Fujitsu, UTC, TCS, Broadcom, Sasken, Qualcomm, Cognizant, STMicroelectronics, Stryker, and Lattice Semiconductors.

 

 

I would first like to thank Packt for giving me this opportunity to come up with this book. I extend my sincere regards all the editors (Sharon and the team) at Packt for rallying behind me and ensuring that I stay on time and in line in delivering precise, crisp, and most up-to-date information through this book.I would also like to thank my family, who supported me throughout my busy schedules. Lastly, but most importantly, I would like to thank my team at TECH VEDA who not only supported but also contributed in their own ways through valuable suggestions and feedback.

About the Reviewer

Rami Rosen is the author of Linux Kernel Networking – Implementation and Theory , a book published by Apress in 2013. Rami has worked for more than 20 years in high-tech companies—starting his way in three startups. Most of his work (past and present) is around kernel and userspace networking and virtualization projects, ranging from device drivers and kernel network stack and DPDK to NFV and OpenStack. Occasionally, he gives talks in international conferences and writes articles for LWN.net—the Linux Journal, and more.

 

I thank my wife, Yoonhwa, who allowed me to spend weekends reviewing this book.

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com, and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

 

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1785883054.

If you'd like to join our team of regular reviewers, you can email us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Table of Contents

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Errata

Piracy

Questions

Comprehending Processes, Address Space, and Threads

Processes

The illusion called address space

Kernel and user space

Process context

Process descriptors

Process attributes - key elements

state

pid

tgid

thread info

flags

exit_code and exit_signal

comm

ptrace

Process relations - key elements

real_parent and parent

children

sibling

group_leader

Scheduling attributes - key elements

prio and static_prio

se, rt, and dl

policy

cpus_allowed

rt_priority

Process limits - key elements

File descriptor table - key elements

fs

files

Signal descriptor - key elements

signal

sighand

sigset_t blocked, real_blocked

pending

sas_ss_sp

sas_ss_size

Kernel stack

The issue of stack overflow

Process creation

fork()

Copy-on-write (COW)

exec

vfork()

Linux support for threads

clone()

Kernel threads

do_fork() and copy_process()

Process status and termination

wait

exit

Namespaces and cgroups

Mount namespaces

UTS namespaces

IPC namespaces

PID namespaces

Network namespaces

User namespaces

Cgroup namespaces

Control groups (cgroups)

Summary

Deciphering the Process Scheduler

Process schedulers

Linux process scheduler design

Runqueue

The scheduler's entry point

Process priorities

Scheduler classes

Completely Fair Scheduling class (CFS)

Computing priorities and time slices under CFS

CFS's runqueue

Group scheduling

Scheduling entities under many-core systems

Scheduling policies

Real-time scheduling class

FIFO

RR

Real-time group scheduling

Deadline scheduling class (sporadic task model deadline scheduling)

Scheduler related system calls

Processor affinity calls

Process preemption

Summary

Signal Management

Signals

Signal-management APIs

Raising signals from a program

Waiting for queued signals

Signal data structures

Signal descriptors

Blocked and pending queues

Signal handler descriptor

Signal generation and delivery

Signal-generation calls

Signal delivery

Executing user-mode handlers

Setting up user-mode handler frames

Restarting interrupted system calls

Summary

Memory Management and Allocators

Initialization operations

Page descriptor

Flags

Mapping

Zones and nodes

Memory zones

Memory nodes

Node descriptor structure

Zone descriptor structure

Memory allocators

Page frame allocator

Buddy system

GFP mask

Zone modifiers

Page mobility and placement

Watermark modifiers

Page reclaim modifiers

Action modifiers

Type flags

Slab allocator

Kmalloc caches

Object caches

Cache management

Cache layout - generic

Slub data structures

Vmalloc

Contiguous Memory Allocator (CMA)

Summary

Filesystems and File I/O

Filesystem - high-level view

Metadata

Inode (index node)

Data block map

Directories

Superblock

Operations

Mount and unmount operations

File creation and deletion operations

File open and close operations

File read and write operations

Additional features

Extended file attributes

Filesystem consistency and crash recovery

Access control lists (ACLs)

Filesystems in the Linux kernel

Ext family filesystems

Ext2

Ext3

Ext4

Common filesystem interface

VFS structures and operations

struct superblock

struct inode

Struct dentry

struct file

Special filesystems

Procfs

Sysfs

Debugfs

Summary

Interprocess Communication

Pipes and FIFOs

pipefs

Message queues

System V message queues

Data structures

POSIX message queues

Shared memory

System V shared memory

Operation interfaces

Allocating shared memory

Attaching a shared memory

Detaching shared memory

Data structures

POSIX shared memory

Semaphores

System V semaphores

Data structures

POSIX semaphores

Summary

Virtual Memory Management

Process address space

Process memory descriptor

Managing virtual memory areas

Locating a VMA

Merging VMA regions

struct address_space

Page tables

Summary

Kernel Synchronization and Locking

Atomic operations

Atomic integer operations

Atomic bitwise operations

Introducing exclusion locks

Spinlocks

Alternate spinlock APIs

Reader-writer spinlocks

Mutex locks

Debug checks and validations

Wait/wound mutexes

Operation interfaces:

Semaphores

Reader-writer semaphores

Sequence locks

API

Completion locks

Initialization

Waiting for completion

Signalling completion

Summary

Interrupts and Deferred Work

Interrupt signals and vectors

Programmable interrupt controller

Interrupt controller operations

IRQ descriptor table

High-level interrupt-management interfaces

Registering an interrupt handler

Deregistering an interrupt handler

Threaded interrupt handlers

Control interfaces

IRQ stacks

Deferred work

Softirqs

Tasklets

Workqueues

Interface API

Creating dedicated workqueues

Summary

Clock and Time Management

Time representation

Timing hardware

Real-time clock (RTC)

Timestamp counter (TSC)

Programmable interrupt timer (PIT)

CPU local timer

High-precision event timer (HPET)

ACPI power management timer (ACPI PMT)

Hardware abstraction

Calculating elapsed time

Linux timekeeping data structures, macros, and helper routines

Jiffies

Timeval and timespec

Tracking and maintaining time

Tick and interrupt handling

Tick devices

Software timers and delay functions

Dynamic timers

Race conditions with dynamic timers

Dynamic timer handling

Delay functions

POSIX clocks

Summary

Module Management

Kernel modules

Elements of an LKM

Binary layout of a LKM

Load and unload operations

Module data structures

Memory layout

Summary

Preface

Mastering Linux Kernel Development looks at the Linux kernel, its internalarrangement and design, and various core subsystems, helping you to gainsignificant understanding of this open source marvel. You will look at how theLinux kernel, which possesses a kind of collective intelligence thanks to itsscores of contributors, remains so elegant owing to its great design.This book also looks at all the key kernel code, core data structures, functions, and macros, giving you a comprehensive foundation of the implementation details of the kernel’s core services and mechanisms. You will also look at the Linux kernel aswell-designed software, which gives us insights into software design in general that are easily scalable yet fundamentally strong and safe.

What this book covers

Chapter 1, Comprehending Processes, Address Space, and Threads, looks closely at one of the principal abstractions of Linux called the process and the whole ecosystem, which facilitate this abstraction. We will also spend time in understanding address space, process creation, and threads.

Chapter 2, Deciphering the Process Scheduler, explains process scheduling, which is a vital aspect of any operating system. Here we will build our understanding of the different scheduling policies engaged by Linux to deliver effective process execution.

Chapter 3, Signal Management, helps in understanding all core aspects of signal usage, their representation, data structures, and kernel routines for signal generation and delivery.

Chapter 4, Memory Management and Allocators, traverses us through one of the most crucial aspects of the Linux kernel, comprehending various nuances of memory representations and allocations. We will also gauge the efficiency of the kernel in maximizing resource usage at minimal costs.

Chapter 5, Filesystems and File I/O, imparts a generic understanding of a typical filesystem, its fabric, design, and what makes it an elemental part of an operating system. We will also look at abstraction, using the common, layered architecture design, which the kernel comprehensively imbibes through the VFS.

Chapter 6, Interprocess Communication, touches upon the various IPC mechanisms offered by the kernel. We will explore the layout and relationship between various data structures for each IPC mechanism, and look at both the SysV and POSIX IPC mechanisms.

Chapter 7, Virtual Memory Management, explains memory management with details of virtual memory management and page tables. We will look into the various aspects of the virtual memory subsystem such as process virtual address space and its segments, memory descriptor structure, memory mapping and VMA objects, page cache and address translation with page tables.

Chapter 8, Kernel Synchronization and Locking, enables us to understand the various protection and synchronization mechanisms provided by the kernel, and comprehend the merits and shortcomings of these mechanisms. We will try and appreciate the tenacity with which the kernel addresses these varying synchronization complexities.

Chapter 9, Interrupts and Deferred work , talks about interrupts, which are a key facet of any operating system to get necessary and priority tasks done. We will look at how interrupts are generated, handled, and managed in Linux. We will also look at various bottom halve mechanisms.

Chapter 10,Clock and Time Management, reveals how kernel measures and manages time. We will look at all key time-related structures, routines, and macros to help us gauge time management effectively.

Chapter 11, Module Management, quickly looks at modules, kernel's infrastructure in managing modules along with all the core data structures involved. This helps us understand how kernel inculcates dynamic extensibility.

What you need for this book

Apart from a deep desire to understand the nuances of the Linux kernel and its design, you need prior understanding of the Linux operating system in general, and the idea of an open-source software to start spending time with this book. However, this is not binding, and anyone with a keen eye to grab detailed information about the Linux system and its working can grab this book.

Who this book is for

This book is for system programming enthusiasts and professionals who would like to deepen their understanding of the Linux kernel and its various integral components.

This is a handy book for developers working on various kernel-related projects.

Students of software engineering can use this as a reference guide for comprehending various aspects of Linux kernel and its design principles.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning. Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "In the loop() function, we read the value of the distance from the sensor and then display it on the serial port."

A block of code is set as follows:

/* linux-4.9.10/arch/x86/include/asm/thread_info.h */struct thread_info { unsigned long flags; /* low level flags */};

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "Go toSketch|Include Library|Manage Librariesand you will get a dialog."

Warnings or important notes appear like this.
Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply email [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visitinghttp://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Comprehending Processes, Address Space, and Threads

When kernel services are invoked in the current process context, its layout throws open the right path for exploring kernels in more detail. Our effort in this chapter is centered around comprehending processes and the underlying ecosystem the kernel provides for them. We will explore the following concepts in this chapter:

Program to process

Process layout

Virtual address spaces

Kernel and user space

Process APIs

Process descriptors

Kernel stack management

Threads

Linux thread API

Data structures

Namespace and cgroups

Processes

Quintessentially, computing systems are designed, developed, and often tweaked for running user applications efficiently. Every element that goes into a computing platform is intended to enable effective and efficient ways for running applications. In other words, computing systems exist to run diverse application programs. Applications can run either as firmware in dedicated devices or as a "process" in systems driven by system software (operating systems).

At its core, a process is a running instance of a program in memory. The transformation from a program to a process happens when the program (on disk) is fetched into memory for execution.

A program’s binary image carries code (with all its binary instructions) and data (with all global data), which are mapped to distinct regions of memory with appropriate access permissions (read, write, and execute). Apart from code and data, a process is assigned additional memory regions called stack (for allocation of function call frames with auto variables and function arguments) and heap for dynamic allocations at runtime.

Multiple instances of the same program can exist with their respective memory allocations. For instance, for a web browser with multiple open tabs (running simultaneous browsing sessions), each tab is considered a process instance by the kernel, with unique memory allocations.

The following figure represents the layout of processes in memory:

The illusion called address space

Modern-day computing platforms are expected to handle a plethora of processes efficiently. Operating systems thus must deal with allocating unique memory to all contending processes within the physical memory (often finite) and also ensure their reliable execution. With multiple processes contending and executing simultaneously (multi-tasking), the operating system must ensure that the memory allocation of every process is protected from accidental access by another process.

To address this issue, the kernel provides a level of abstraction between the process and the physical memory called virtualaddress space. Virtual address space is the process' view of memory; it is how the running program views the memory.

Virtual address space creates an illusion that every process exclusively owns the whole memory while executing. This abstracted view of memory is called virtual memory and is achieved by the kernel's memory manager in coordination with the CPU's MMU. Each process is given a contiguous 32 or 64-bit address space, bound by the architecture and unique to that process. With each process caged into its virtual address space by the MMU, any attempt by a process to access an address region outside its boundaries will trigger a hardware fault, making it possible for the memory manger to detect and terminate violating processes, thus ensuring protection.

The following figure depicts the illusion of address space created for every contending process:

Kernel and user space

Modern operating systems not only prevent one process from accessing another but also prevent processes from accidentally accessing or manipulating kernel data and services (as the kernel is shared by all the processes).

Operating systems achieve this protection by segmenting the whole memory into two logical halves, the user and kernel space. This bifurcation ensures that all processes that are assigned address spaces are mapped to the user space section of memory and kernel data and services run in kernel space. The kernel achieves this protection in coordination with the hardware. While an application process is executing instructions from its code segment, the CPU is operating in user mode. When a process intends to invoke a kernel service, it needs to switch the CPU into privileged mode (kernel mode), which is achieved through special functions called APIs (application programming interfaces). These APIs enable user processes to switch into the kernel space using special CPU instructions and then execute the required services through system calls. On completion of the requested service, the kernel executes another mode switch, this time back from kernel mode to user mode, using another set of CPU instructions.

System calls are the kernel's interfaces to expose its services to application processes; they are also called kernel entry points. As system calls are implemented in kernel space, the respective handlers are provided through APIs in the user space. API abstraction also makes it easier and convenient to invoke related system calls.

The following figure depicts a virtualized memory view:

Process context

When a process requests a kernel service through a system call, the kernel will execute on behalf of the caller process. The kernel is now said to be executing in process context. Similarly, the kernel also responds to interrupts raised by other hardware entities; here, the kernel executes in interrupt context. When in interrupt context, the kernel is not running on behalf of any process.

Process descriptors

Right from the time a process is born until it exits, it’s the kernel's process management subsystem that carries out various operations, ranging from process creation, allocating CPU time, and event notifications to destruction of the process upon termination.

Apart from the address space, a process in memory is also assigned a data structure called the process descriptor, which the kernel uses to identify, manage, and schedule the process. The following figure depicts process address spaces with their respective process descriptors in the kernel:

In Linux, a process descriptor is an instance of type struct task_struct defined in <linux/sched.h>, it is one of the central data structures, and contains all the attributes, identification details, and resource allocation entries that a process holds. Looking at struct task_struct is like a peek into the window of what the kernel sees or works with to manage and schedule a process.

Since the task structure contains a wide set of data elements, which are related to the functionality of various kernel subsystems, it would be out of context to discuss the purpose and scope of all the elements in this chapter. We shall consider a few important elements that are related to process management.

Process attributes - key elements

Process attributes define all the key and fundamental characteristics of a process. These elements contain the process's state and identifications along with other key values of importance.

state

A process right from the time it is spawned until it exits may exist in various states, referred to as process states--they define the process’s current state:

TASK_RUNNING

(0): The task is either executing or contending for CPU in the scheduler run-queue.

TASK_INTERRUPTIBLE

(1): The task is in an interruptible wait state; it remains in wait until an awaited condition becomes true, such as the availability of mutual exclusion locks, device ready for I/O, lapse of sleep time, or an exclusive wake-up call. While in this wait state, any signals generated for the process are delivered, causing it to wake up before the wait condition is met.

TASK_KILLABLE

: This is similar to

TASK_INTERRUPTIBLE

, with the exception that interruptions can only occur on fatal signals, which makes it a better alternative to

TASK_INTERRUPTIBLE

.

TASK_UNINTERRUTPIBLE

(2): The task is in uninterruptible wait state similar to

TASK_INTERRUPTIBLE

, except that generated signals to the sleeping process do not cause wake-up. When the event occurs for which it is waiting, the process transitions to

TASK_RUNNING

. This process state is rarely used.

TASK_ STOPPED

(4): The task has received a STOP signal. It will be back to running on receiving the continue signal (SIGCONT).

TASK_TRACED

(8): A process is said to be in traced state when it is being combed, probably by a debugger.

EXIT_ZOMBIE

(32): The process is terminated, but its resources are not yet reclaimed.

EXIT_DEAD

(16): The child is terminated and all the resources held by it freed, after the parent collects the exit status of the child using

wait

.

The following figure depicts process states:

pid

This field contains a unique process identifier referred to as PID. PIDs in Linux are of the type pid_t (integer). Though a PID is an integer, the default maximum number PIDs is 32,768 specified through the /proc/sys/kernel/pid_max interface. The value in this file can be set to any value up to 222 (PID_MAX_LIMIT, approximately 4 million).

To manage PIDs, the kernel uses a bitmap. This bitmap allows the kernel to keep track of PIDs in use and assign a unique PID for new processes. Each PID is identified by a bit in the PID bitmap; the value of a PID is determined from the position of its corresponding bit. Bits with value 1 in the bitmap indicate that the corresponding PIDs are in use, and those with value 0 indicate free PIDs. Whenever the kernel needs to assign a unique PID, it looks for the first unset bit and sets it to 1, and conversely to free a PID, it toggles the corresponding bit from 1 to 0.

tgid

This field contains the thread group id. For easy understanding, let's say when a new process is created, its PID and TGID are the same, as the process happens to be the only thread. When the process spawns a new thread, the new child gets a unique PID but inherits the TGID from the parent, as it belongs to the same thread group. The TGID is primarily used to support multi-threaded process. We will delve into further details in the threads section of this chapter.

thread info

This field holds processor-specific state information, and is a critical element of the task structure. Later sections of this chapter contain details about the importance of thread_info.

flags

The flags field records various attributes corresponding to a process. Each bit in the field corresponds to various stages in the lifetime of a process. Per-process flags are defined in <linux/sched.h>:

#define PF_EXITING /* getting shut down */#define PF_EXITPIDONE /* pi exit done on shut down */#define PF_VCPU /* I'm a virtual CPU */#define PF_WQ_WORKER /* I'm a workqueue worker */#define PF_FORKNOEXEC /* forked but didn't exec */#define PF_MCE_PROCESS /* process policy on mce errors */#define PF_SUPERPRIV /* used super-user privileges */#define PF_DUMPCORE /* dumped core */#define PF_SIGNALED /* killed by a signal */#define PF_MEMALLOC /* Allocating memory */#define PF_NPROC_EXCEEDED /* set_user noticed that RLIMIT_NPROC was exceeded */#define PF_USED_MATH /* if unset the fpu must be initialized before use */#define PF_USED_ASYNC /* used async_schedule*(), used by module init */#define PF_NOFREEZE /* this thread should not be frozen */#define PF_FROZEN /* frozen for system suspend */#define PF_FSTRANS /* inside a filesystem transaction */#define PF_KSWAPD /* I am kswapd */#define PF_MEMALLOC_NOIO0 /* Allocating memory without IO involved */#define PF_LESS_THROTTLE /* Throttle me less: I clean memory */#define PF_KTHREAD /* I am a kernel thread */#define PF_RANDOMIZE /* randomize virtual address space */#define PF_SWAPWRITE /* Allowed to write to swap */#define PF_NO_SETAFFINITY /* Userland is not allowed to meddle with cpus_allowed */#define PF_MCE_EARLY /* Early kill for mce process policy */#define PF_MUTEX_TESTER /* Thread belongs to the rt mutex tester */#define PF_FREEZER_SKIP /* Freezer should not count it as freezable */#define PF_SUSPEND_TASK /* this thread called freeze_processes and should not be frozen */

exit_code and exit_signal

These fields contain the exit value of the task and details of the signal that caused the termination. These fields are to be accessed by the parent process through wait() on termination of the child.

comm

This field holds the name of the binary executable used to start the process.

ptrace

This field is enabled and set when the process is put into trace mode using the ptrace() system call.

Process relations - key elements

Every process can be related to a parent process, establishing a parent-child relationship. Similarly, multiple processes spawned by the same process are called siblings. These fields establish how the current process relates to another process.

real_parent and parent

These are pointers to the parent's task structure. For a normal process, both these pointers refer to the same task_struct; they only differ for multi-thread processes, implemented using posix threads. For such cases, real_parent refers to the parent thread task structure and parent refers the process task structure to which SIGCHLD is delivered.

children

This is a pointer to a list of child task structures.

sibling

This is a pointer to a list of sibling task structures.

group_leader

This is a pointer to the task structure of the process group leader.

Scheduling attributes - key elements

All contending processes must be given fair CPU time, and this calls for scheduling based on time slices and process priorities. These attributes contain necessary information that the scheduler uses when deciding on which process gets priority when contending.

prio and static_prio

prio helps determine the priority of the process for scheduling. This field holds static priority of the process within the range 1 to 99 (as specified by sched_setscheduler()) if the process is assigned a real-time scheduling policy. For normal processes, this field holds a dynamic priority derived from the nice value.

se, rt, and dl

Every task belongs to a scheduling entity (group of tasks), as scheduling is done at a per-entity level. se is for all normal processes, rt is for real-time processes, and dl is for deadline processes. We will discuss more on these attributes in the next chapter on scheduling.

policy

This field contains information about the scheduling policy of the process, which helps in determining its priority.

cpus_allowed

This field specifies the CPU mask for the process, that is, on which CPU(s) the process is eligible to be scheduled in a multi-processor system.

rt_priority

This field specifies the priority to be applied by real-time scheduling policies. For non-real-time processes, this field is unused.

Process limits - key elements

The kernel imposes resource limits to ensure fair allocation of system resources among contending processes. These limits guarantee that a random process does not monopolize ownership of resources. There are 16 different types of resource limits, and the task structure points to an array of type struct rlimit, in which each offset holds the current and maximum values for a specific resource.

/*include/uapi/linux/resource.h*/struct rlimit { __kernel_ulong_t rlim_cur; __kernel_ulong_t rlim_max;};These limits are specified in

include/uapi/asm-generic/resource.h

#define RLIMIT_CPU 0 /* CPU time in sec */ #define RLIMIT_FSIZE 1 /* Maximum filesize */ #define RLIMIT_DATA 2 /* max data size */ #define RLIMIT_STACK 3 /* max stack size */ #define RLIMIT_CORE 4 /* max core file size */ #ifndef RLIMIT_RSS # define RLIMIT_RSS 5 /* max resident set size */ #endif #ifndef RLIMIT_NPROC # define RLIMIT_NPROC 6 /* max number of processes */ #endif #ifndef RLIMIT_NOFILE # define RLIMIT_NOFILE 7 /* max number of open files */ #endif #ifndef RLIMIT_MEMLOCK # define RLIMIT_MEMLOCK 8 /* max locked-in-memory address space */ #endif #ifndef RLIMIT_AS # define RLIMIT_AS 9 /* address space limit */ #endif #define RLIMIT_LOCKS 10 /* maximum file locks held */ #define RLIMIT_SIGPENDING 11 /* max number of pending signals */ #define RLIMIT_MSGQUEUE 12 /* maximum bytes in POSIX mqueues */ #define RLIMIT_NICE 13 /* max nice prio allowed to raise to 0-39 for nice level 19 .. -20 */ #define RLIMIT_RTPRIO 14 /* maximum realtime priority */ #define RLIMIT_RTTIME 15 /* timeout for RT tasks in us */ #define RLIM_NLIMITS 16

File descriptor table - key elements

During the lifetime of a process, it may access various resource files to get its task done. This results in the process opening, closing, reading, and writing to these files. The system must keep track of these activities; file descriptor elements help the system know which files the process holds.

fs

Filesystem information is stored in this field.

files

The file descriptor table contains pointers to all the files that a process opens to perform various operations. The files field contains a pointer, which points to this file descriptor table.

Signal descriptor - key elements

For processes to handle signals, the task structure has various elements that determine how the signals must be handled.

signal

This is of type struct signal_struct, which contains information on all the signals associated with the process.

sighand

This is of type struct sighand_struct, which contains all signal handlers associated with the process.

sigset_t blocked, real_blocked

These elements identify signals that are currently masked or blocked by the process.

pending

This is of type struct sigpending, which identifies signals which are generated but not yet delivered.

sas_ss_sp

This field contains a pointer to an alternate stack, which facilitates signal handling.

sas_ss_size

This filed shows the size of the alternate stack, used for signal handling.

Kernel stack

With current-generation computing platforms powered by multi-core hardware capable of running simultaneous applications, the possibility of multiple processes concurrently initiating kernel mode switch when requesting for the same process is built in. To be able to handle such situations, kernel services are designed to be re-entrant, allowing multiple processes to step in and engage the required services. This mandated the requesting process to maintain its own private kernel stack to keep track of the kernel function call sequence, store local data of the kernel functions, and so on.

The kernel stack is directly mapped to the physical memory, mandating the arrangement to be physically in a contiguous region. The kernel stack by default is 8kb for x86-32 and most other 32-bit systems (with an option of 4k kernel stack to be configured during kernel build), and 16kb on an x86-64 system.

When kernel services are invoked in the current process context, they need to validate the process’s prerogative before it commits to any relevant operations. To perform such validations, the kernel services must gain access to the task structure of the current process and look through the relevant fields. Similarly, kernel routines might need to have access to the current task structure for modifying various resource structures such as signal handler tables, looking for pending signals, file descriptor table, and memory descriptor among others. To enable accessing the task structure at runtime, the address of the current task structure is loaded into a processor register (register chosen is architecture specific) and made available through a kernel global macro called current (defined in architecture-specific kernel header asm/current.h ):

/* arch/ia64/include/asm/current.h */ #ifndef _ASM_IA64_CURRENT_H #define _ASM_IA64_CURRENT_H /* * Modified 1998-2000 * David Mosberger-Tang <[email protected]>, Hewlett-Packard Co */ #include <asm/intrinsics.h> /* * In kernel mode, thread pointer (r13) is used to point to the current task * structure. */ #define current ((struct task_struct *) ia64_getreg(_IA64_REG_TP)) #endif /* _ASM_IA64_CURRENT_H */ /* arch/powerpc/include/asm/current.h */ #ifndef _ASM_POWERPC_CURRENT_H #define _ASM_POWERPC_CURRENT_H #ifdef __KERNEL__ /* * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version * 2 of the License, or (at your option) any later version. */ struct task_struct; #ifdef __powerpc64__ #include <linux/stddef.h> #include <asm/paca.h> static inline struct task_struct *get_current(void) { struct task_struct *task; __asm__ __volatile__("ld %0,%1(13)" : "=r" (task) : "i" (offsetof(struct paca_struct, __current))); return task; } #define current get_current() #else /* * We keep `current' in r2 for speed. */

register struct task_struct *current asm ("r2");

#endif #endif /* __KERNEL__ */ #endif /* _ASM_POWERPC_CURRENT_H */

However, in register-constricted architectures, where there are few registers to spare, reserving a register to hold the address of the current task structure is not viable. On such platforms, the task structure of the current process is directly made available at the top of the kernel stack that it owns. This approach renders a significant advantage with respect to locating the task structure, by just masking the least significant bits of the stack pointer.

With the evolution of the kernel, the task structure grew and became too large to be contained in the kernel stack, which is already restricted in physical memory (8Kb). As a result, the task structure was moved out of the kernel stack, barring a few key fields that define the process's CPU state and other low-level processor-specific information. These fields were then wrapped in a newly created structure called struct thread_info. This structure is contained on top of the kernel stack and provides a pointer that refers to the current task structure, which can be used by kernel services.

The following code snippet shows struct thread_info for x86 architecture (kernel 3.10):

/* linux-3.10/arch/x86/include/asm/thread_info.h */

struct thread_info {

struct task_struct *task

; /* main task structure */ struct exec_domain *exec_domain; /* execution domain */ __u32 flags; /* low level flags */ __u32 status; /* thread synchronous flags */ __u32 cpu; /* current CPU */ int preempt_count; /* 0 => preemptable, <0 => BUG */ mm_segment_t addr_limit; struct restart_block restart_block; void __user *sysenter_return; #ifdef CONFIG_X86_32 unsigned long previous_esp; /* ESP of the previous stack in case of nested (IRQ) stacks */ __u8 supervisor_stack[0]; #endif unsigned int sig_on_uaccess_error:1; unsigned int uaccess_err:1; /* uaccess failed */};

With thread_info containing process-related information, apart from task structure, the kernel has multiple viewpoints to the current process structure: struct task_struct, an architecture-independent information block, and thread_info, an architecture-specific one. The following figure depicts thread_info and task_struct:

For architectures that engage thread_info, the current macro's implementation is modified to look into the top of kernel stack to obtain a reference to the current thread_info and through it the current task structure. The following code snippet shows the implementation of current for an x86-64 platform:

#ifndef __ASM_GENERIC_CURRENT_H #define __ASM_GENERIC_CURRENT_H #include <linux/thread_info.h>

#define get_current() (current_thread_info()->task)

#define current get_current()

#endif /* __ASM_GENERIC_CURRENT_H */ /* * how to get the current stack pointer in C */ register unsigned long current_stack_pointer asm ("sp"); /* * how to get the thread information struct from C */ static inline struct thread_info *current_thread_info(void) __attribute_const__; static inline struct thread_info *current_thread_info(void) {

return (struct thread_info *)

(current_stack_pointer & ~(THREAD_SIZE - 1));

}

As use of PER_CPU variables has increased in recent times, the process scheduler is tuned to cache crucial current process-related information in the PER_CPU area. This change enables quick access to current process data over looking up the kernel stack. The following code snippet shows the implementation of the current macro to fetch the current task data through the PER_CPU variable:

#ifndef _ASM_X86_CURRENT_H #define _ASM_X86_CURRENT_H #include <linux/compiler.h> #include <asm/percpu.h> #ifndef __ASSEMBLY__ struct task_struct;

DECLARE_PER_CPU(struct task_struct *, current_task);

static __always_inline struct task_struct *get_current(void) {

return this_cpu_read_stable(current_task);

}

#define current get_current()

#endif /* __ASSEMBLY__ */ #endif /* _ASM_X86_CURRENT_H */

The use of PER_CPU data led to a gradual reduction of information in thread_info. With thread_info shrinking in size, kernel developers are considering getting rid of thread_info altogether by moving it into the task structure. As this involves changes to low-level architecture code, it has only been implemented for the x86-64 architecture, with other architectures planned to follow. The following code snippet shows the current state of the thread_info structure with just one element:

/* linux-4.9.10/arch/x86/include/asm/thread_info.h */struct thread_info { unsigned long flags; /* low level flags */};

The issue of stack overflow

Unlike user mode, the kernel mode stack lives in directly mapped memory. When a process invokes a kernel service, which may internally be deeply nested, chances are that it may overrun into immediate memory range. The worst part of it is the kernel will be oblivious to such occurrences. Kernel programmers usually engage various debug options to track stack usage and detect overruns, but these methods are not handy to prevent stack breaches on production systems.Conventional protection through the use of guard pages is also ruled out here (as it wastes an actual memory page).

Kernel programmers tend to follow coding standards--minimizing the use of local data, avoiding recursion, and avoiding deep nesting among others--to cut down the probability of a stack breach. However, implementation of feature-rich and deeply layered kernel subsystems may pose various design challenges and complications, especially with the storage subsystem where filesystems, storage drivers, and networking code can be stacked up in several layers, resulting in deeply nested function calls.

The Linux kernel community has been pondering over preventing such breaches for quite long, and toward that end, the decision was made to expand the kernel stack to 16kb (x86-64, since kernel 3.15). Expansion of the kernel stack might prevent some breaches, but at the cost of engaging much of the directly mapped kernel memory for the per-process kernel stack. However, for reliable functioning of the system, it is expected of the kernel to elegantly handle stack breaches when they show up on production systems.

With the 4.9 release, the kernel has come with a new system to set up virtually mapped kernel stacks. Since virtual addresses are currently in use to map even a directly mapped page, principally the kernel stack does not actually require physically contiguous pages. The kernel reserves a separate range of addresses for virtually mapped memory, and addresses from this range are allocated when a call to vmalloc() is made. This range of memory is referred as the vmalloc range. Primarily this range is used when programs require huge chunks of memory which are virtually contiguous but physically scattered. Using this, the kernel stack can now be allotted as individual pages, mapped to the vmalloc range. Virtual mapping also enables protection from overruns as a no-access guard page can be allocated with a page table entry (without wasting an actual page). Guard pages would prompt the kernel to pop an oops message on memory overrun and initiate a kill against overrunning process.

Virtually mapped kernel stacks with guard pages are currently available only for the x86-64 architecture (support for other architectures seemingly to follow). This can be enabled by choosing the HAVE_ARCH_VMAP_STACK or CONFIG_VMAP_STACK build-time options.

Process creation

During kernel boot, a kernel thread called init is spawned, which in turn is configured to initialize the first user-mode process (with the same name). The init (pid 1) process is then configured to carry out various initialization operations specified through configuration files, creating multiple processes. Every child process further created (which may in turn create its own child process(es)) are all descendants of the init process. Processes thus created end up in a tree-like structure or a single hierarchy model. The shell, which is one such process, becomes the interface for users to create user processes, when programs are called for execution.

Fork, vfork, exec, clone, wait and exit are the core kernel interfaces for the creation and control of new process. These operations are invoked through corresponding user-mode APIs.

fork()

Fork() is one of the core "Unix thread APIs" available across *nix systems since the inception of legacy Unix releases. Aptly named, it forks a new process from a running process. When fork() succeeds, the new process is created (referred to as child) by duplicating the caller's address space and task structure. On return from fork(), both caller (parent) and new process (child) resume executing instructions from the same code segment which was duplicated under copy-on-write. Fork() is perhaps the only API that enters kernel mode in the context of caller process, and on success returns to user mode in the context of both caller and child (new process).

Most resource entries of the parent's task structure such as memory descriptor, file descriptor table, signal descriptors, and scheduling attributes are inherited by the child, except for a few attributes such as memory locks, pending signals, active timers, and file record locks (for the full list of exceptions, refer to the fork(2) man page). A child process is assigned a unique pid and will refer to its parent's pid through the ppid field of its task structure; the child’s resource utilization and processor usage entries are reset to zero.

The parent process updates itself about the child’s state using the wait() system call and normally waits for the termination of the child process. Failing to call wait(), the child may terminate and be pushed into a zombie state.

Copy-on-write (COW)

Duplication of parent process to create a child needs cloning of the user mode address space (stack, data, code, and heap segments) and task structure of the parent for the child; this would result in execution overhead that leads to un-deterministic process-creation time. To make matters worse, this process of cloning would be rendered useless if neither parent nor child did not initiate any state-change operations on cloned resources.

As per COW, when a child is created, it is allocated a unique task structure with all resource entries (including page tables) referring to the parent's task structure, with read-only access for both parent and child. Resources are truly duplicated when either of the processes initiates a state change operation, hence the name copy-on-write (write in COW implies a state change). COW does bring effectiveness and optimization to the fore, by deferring the need for duplicating process data until write, and in cases where only read happens, it avoids it altogether. This on-demand copying also reduces the number of swap pages needed, cuts down the time spent on swapping, and might help reduce demand paging.

exec

At times creating a child process might not be useful, unless it runs a new program altogether: the exec family of calls serves precisely this purpose. exec replaces the existing program in a process with a new executable binary:

#include <unistd.h>int execve(const char *filename, char *const argv[],char *const envp[]);

The execve is the system call that executes the program binary file, passed as the first argument to it. The second and third arguments are null-terminated arrays of arguments and environment strings, to be passed to a new program as command-line arguments. This system call can also be invoked through various glibc (library) wrappers, which are found to be more convenient and flexible:

#include <unistd.h>extern char **environ;int execl(const char *path, const char *arg, ...);int execlp(const char *file, const char *arg, ...);int execle(const char *path, const char *arg,..., char * const envp[]);int execv(const char *path, char *constargv[]);int execvp(const char *file, char *constargv[]);int execvpe(const char *file, char *const argv[],char *const envp[]);

Command-line user-interface programs such as shell use the exec interface to launch user-requested program binaries.

vfork()

Unlike fork(), vfork() creates a child process and blocks the parent, which means that the child runs as a single thread and does not allow concurrency; in other words, the parent process is temporarily suspended until the child exits or call exec(). The child shares the data of the parent.

Linux support for threads

The flow of execution in a process is referred to as a thread