45,59 €
Explore Implementation of core kernel subsystems
If you are a kernel programmer with a knowledge of kernel APIs and are looking to build a comprehensive understanding, and eager to explore the implementation, of kernel subsystems, this book is for you. It sets out to unravel the underlying details of kernel APIs and data structures, piercing through the complex kernel layers and gives you the edge you need to take your skills to the next level.
Mastering Linux Kernel Development looks at the Linux kernel, its internal arrangement and design, and various core subsystems, helping you to gain significant understanding of this open source marvel. You will look at how the Linux kernel, which possesses a kind of collective intelligence thanks to its scores of contributors, remains so elegant owing to its great design.
This book also looks at all the key kernel code, core data structures, functions, and macros, giving you a comprehensive foundation of the implementation details of the kernel's core services and mechanisms. You will also look at the Linux kernel as well-designed software, which gives us insights into software design in general that are easily scalable yet fundamentally strong and safe.
By the end of this book, you will have considerable understanding of and appreciation for the Linux kernel.
Each chapter begins with the basic conceptual know-how for a subsystem and extends into the details of its implementation. We use appropriate code excerpts of critical routines and data structures for subsystems.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 419
Veröffentlichungsjahr: 2017
BIRMINGHAM - MUMBAI
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: October 2017
Production reference: 1091017
ISBN 978-1-78588-305-7
www.packtpub.com
Author
Raghu Bharadwaj
Copy Editor
Madhusudan Uchil
Reviewer
Rami Rosen
Project Coordinator
Virginia Dias
Commissioning Editor
Kartikey Pandey
Proofreader
Safis Editing
Acquisition Editor
Rahul Nair
Indexer
Francy Puthiry
Content Development Editor
Sharon Raj
Graphics
Kirk D'Penha
Technical Editor
Mohit Hassija
Production Coordinator
Arvindkumar Gupta
Raghu Bharadwaj is a leading consultant, contributor, and corporate trainer on the Linux kernel with experience spanning close to two decades. He is an ardent kernel enthusiast and expert, and has been closely following the Linux kernel since the late 90s. He is the founder of TECH VEDA, which specializes in engineering and skilling services on the Linux kernel, through technical support, kernel contributions, and advanced training. His precise understanding and articulation of the kernel has been a hallmark, and his penchant for software designs and OS architectures has garnered him special mention from his clients. Raghu is also an expert in delivering solution-oriented, customized training programs for engineering teams working on the Linux kernel, Linux drivers, and Embedded Linux. Some of his clients include major technology companies such as Xilinx, GE, Canon, Fujitsu, UTC, TCS, Broadcom, Sasken, Qualcomm, Cognizant, STMicroelectronics, Stryker, and Lattice Semiconductors.
Rami Rosen is the author of Linux Kernel Networking – Implementation and Theory , a book published by Apress in 2013. Rami has worked for more than 20 years in high-tech companies—starting his way in three startups. Most of his work (past and present) is around kernel and userspace networking and virtualization projects, ranging from device drivers and kernel network stack and DPDK to NFV and OpenStack. Occasionally, he gives talks in international conferences and writes articles for LWN.net—the Linux Journal, and more.
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com, and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1785883054.
If you'd like to join our team of regular reviewers, you can email us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Errata
Piracy
Questions
Comprehending Processes, Address Space, and Threads
Processes
The illusion called address space
Kernel and user space
Process context
Process descriptors
Process attributes - key elements
state
pid
tgid
thread info
flags
exit_code and exit_signal
comm
ptrace
Process relations - key elements
real_parent and parent
children
sibling
group_leader
Scheduling attributes - key elements
prio and static_prio
se, rt, and dl
policy
cpus_allowed
rt_priority
Process limits - key elements
File descriptor table - key elements
fs
files
Signal descriptor - key elements
signal
sighand
sigset_t blocked, real_blocked
pending
sas_ss_sp
sas_ss_size
Kernel stack
The issue of stack overflow
Process creation
fork()
Copy-on-write (COW)
exec
vfork()
Linux support for threads
clone()
Kernel threads
do_fork() and copy_process()
Process status and termination
wait
exit
Namespaces and cgroups
Mount namespaces
UTS namespaces
IPC namespaces
PID namespaces
Network namespaces
User namespaces
Cgroup namespaces
Control groups (cgroups)
Summary
Deciphering the Process Scheduler
Process schedulers
Linux process scheduler design
Runqueue
The scheduler's entry point
Process priorities
Scheduler classes
Completely Fair Scheduling class (CFS)
Computing priorities and time slices under CFS
CFS's runqueue
Group scheduling
Scheduling entities under many-core systems
Scheduling policies
Real-time scheduling class
FIFO
RR
Real-time group scheduling
Deadline scheduling class (sporadic task model deadline scheduling)
Scheduler related system calls
Processor affinity calls
Process preemption
Summary
Signal Management
Signals
Signal-management APIs
Raising signals from a program
Waiting for queued signals
Signal data structures
Signal descriptors
Blocked and pending queues
Signal handler descriptor
Signal generation and delivery
Signal-generation calls
Signal delivery
Executing user-mode handlers
Setting up user-mode handler frames
Restarting interrupted system calls
Summary
Memory Management and Allocators
Initialization operations
Page descriptor
Flags
Mapping
Zones and nodes
Memory zones
Memory nodes
Node descriptor structure
Zone descriptor structure
Memory allocators
Page frame allocator
Buddy system
GFP mask
Zone modifiers
Page mobility and placement
Watermark modifiers
Page reclaim modifiers
Action modifiers
Type flags
Slab allocator
Kmalloc caches
Object caches
Cache management
Cache layout - generic
Slub data structures
Vmalloc
Contiguous Memory Allocator (CMA)
Summary
Filesystems and File I/O
Filesystem - high-level view
Metadata
Inode (index node)
Data block map
Directories
Superblock
Operations
Mount and unmount operations
File creation and deletion operations
File open and close operations
File read and write operations
Additional features
Extended file attributes
Filesystem consistency and crash recovery
Access control lists (ACLs)
Filesystems in the Linux kernel
Ext family filesystems
Ext2
Ext3
Ext4
Common filesystem interface
VFS structures and operations
struct superblock
struct inode
Struct dentry
struct file
Special filesystems
Procfs
Sysfs
Debugfs
Summary
Interprocess Communication
Pipes and FIFOs
pipefs
Message queues
System V message queues
Data structures
POSIX message queues
Shared memory
System V shared memory
Operation interfaces
Allocating shared memory
Attaching a shared memory
Detaching shared memory
Data structures
POSIX shared memory
Semaphores
System V semaphores
Data structures
POSIX semaphores
Summary
Virtual Memory Management
Process address space
Process memory descriptor
Managing virtual memory areas
Locating a VMA
Merging VMA regions
struct address_space
Page tables
Summary
Kernel Synchronization and Locking
Atomic operations
Atomic integer operations
Atomic bitwise operations
Introducing exclusion locks
Spinlocks
Alternate spinlock APIs
Reader-writer spinlocks
Mutex locks
Debug checks and validations
Wait/wound mutexes
Operation interfaces:
Semaphores
Reader-writer semaphores
Sequence locks
API
Completion locks
Initialization
Waiting for completion
Signalling completion
Summary
Interrupts and Deferred Work
Interrupt signals and vectors
Programmable interrupt controller
Interrupt controller operations
IRQ descriptor table
High-level interrupt-management interfaces
Registering an interrupt handler
Deregistering an interrupt handler
Threaded interrupt handlers
Control interfaces
IRQ stacks
Deferred work
Softirqs
Tasklets
Workqueues
Interface API
Creating dedicated workqueues
Summary
Clock and Time Management
Time representation
Timing hardware
Real-time clock (RTC)
Timestamp counter (TSC)
Programmable interrupt timer (PIT)
CPU local timer
High-precision event timer (HPET)
ACPI power management timer (ACPI PMT)
Hardware abstraction
Calculating elapsed time
Linux timekeeping data structures, macros, and helper routines
Jiffies
Timeval and timespec
Tracking and maintaining time
Tick and interrupt handling
Tick devices
Software timers and delay functions
Dynamic timers
Race conditions with dynamic timers
Dynamic timer handling
Delay functions
POSIX clocks
Summary
Module Management
Kernel modules
Elements of an LKM
Binary layout of a LKM
Load and unload operations
Module data structures
Memory layout
Summary
Mastering Linux Kernel Development looks at the Linux kernel, its internalarrangement and design, and various core subsystems, helping you to gainsignificant understanding of this open source marvel. You will look at how theLinux kernel, which possesses a kind of collective intelligence thanks to itsscores of contributors, remains so elegant owing to its great design.This book also looks at all the key kernel code, core data structures, functions, and macros, giving you a comprehensive foundation of the implementation details of the kernel’s core services and mechanisms. You will also look at the Linux kernel aswell-designed software, which gives us insights into software design in general that are easily scalable yet fundamentally strong and safe.
Chapter 1, Comprehending Processes, Address Space, and Threads, looks closely at one of the principal abstractions of Linux called the process and the whole ecosystem, which facilitate this abstraction. We will also spend time in understanding address space, process creation, and threads.
Chapter 2, Deciphering the Process Scheduler, explains process scheduling, which is a vital aspect of any operating system. Here we will build our understanding of the different scheduling policies engaged by Linux to deliver effective process execution.
Chapter 3, Signal Management, helps in understanding all core aspects of signal usage, their representation, data structures, and kernel routines for signal generation and delivery.
Chapter 4, Memory Management and Allocators, traverses us through one of the most crucial aspects of the Linux kernel, comprehending various nuances of memory representations and allocations. We will also gauge the efficiency of the kernel in maximizing resource usage at minimal costs.
Chapter 5, Filesystems and File I/O, imparts a generic understanding of a typical filesystem, its fabric, design, and what makes it an elemental part of an operating system. We will also look at abstraction, using the common, layered architecture design, which the kernel comprehensively imbibes through the VFS.
Chapter 6, Interprocess Communication, touches upon the various IPC mechanisms offered by the kernel. We will explore the layout and relationship between various data structures for each IPC mechanism, and look at both the SysV and POSIX IPC mechanisms.
Chapter 7, Virtual Memory Management, explains memory management with details of virtual memory management and page tables. We will look into the various aspects of the virtual memory subsystem such as process virtual address space and its segments, memory descriptor structure, memory mapping and VMA objects, page cache and address translation with page tables.
Chapter 8, Kernel Synchronization and Locking, enables us to understand the various protection and synchronization mechanisms provided by the kernel, and comprehend the merits and shortcomings of these mechanisms. We will try and appreciate the tenacity with which the kernel addresses these varying synchronization complexities.
Chapter 9, Interrupts and Deferred work , talks about interrupts, which are a key facet of any operating system to get necessary and priority tasks done. We will look at how interrupts are generated, handled, and managed in Linux. We will also look at various bottom halve mechanisms.
Chapter 10,Clock and Time Management, reveals how kernel measures and manages time. We will look at all key time-related structures, routines, and macros to help us gauge time management effectively.
Chapter 11, Module Management, quickly looks at modules, kernel's infrastructure in managing modules along with all the core data structures involved. This helps us understand how kernel inculcates dynamic extensibility.
Apart from a deep desire to understand the nuances of the Linux kernel and its design, you need prior understanding of the Linux operating system in general, and the idea of an open-source software to start spending time with this book. However, this is not binding, and anyone with a keen eye to grab detailed information about the Linux system and its working can grab this book.
This book is for system programming enthusiasts and professionals who would like to deepen their understanding of the Linux kernel and its various integral components.
This is a handy book for developers working on various kernel-related projects.
Students of software engineering can use this as a reference guide for comprehending various aspects of Linux kernel and its design principles.
In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning. Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "In the loop() function, we read the value of the distance from the sensor and then display it on the serial port."
A block of code is set as follows:
/* linux-4.9.10/arch/x86/include/asm/thread_info.h */struct thread_info { unsigned long flags; /* low level flags */};
New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "Go toSketch|Include Library|Manage Librariesand you will get a dialog."
Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply email [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visitinghttp://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.
When kernel services are invoked in the current process context, its layout throws open the right path for exploring kernels in more detail. Our effort in this chapter is centered around comprehending processes and the underlying ecosystem the kernel provides for them. We will explore the following concepts in this chapter:
Program to process
Process layout
Virtual address spaces
Kernel and user space
Process APIs
Process descriptors
Kernel stack management
Threads
Linux thread API
Data structures
Namespace and cgroups
Quintessentially, computing systems are designed, developed, and often tweaked for running user applications efficiently. Every element that goes into a computing platform is intended to enable effective and efficient ways for running applications. In other words, computing systems exist to run diverse application programs. Applications can run either as firmware in dedicated devices or as a "process" in systems driven by system software (operating systems).
At its core, a process is a running instance of a program in memory. The transformation from a program to a process happens when the program (on disk) is fetched into memory for execution.
A program’s binary image carries code (with all its binary instructions) and data (with all global data), which are mapped to distinct regions of memory with appropriate access permissions (read, write, and execute). Apart from code and data, a process is assigned additional memory regions called stack (for allocation of function call frames with auto variables and function arguments) and heap for dynamic allocations at runtime.
Multiple instances of the same program can exist with their respective memory allocations. For instance, for a web browser with multiple open tabs (running simultaneous browsing sessions), each tab is considered a process instance by the kernel, with unique memory allocations.
The following figure represents the layout of processes in memory:
Modern-day computing platforms are expected to handle a plethora of processes efficiently. Operating systems thus must deal with allocating unique memory to all contending processes within the physical memory (often finite) and also ensure their reliable execution. With multiple processes contending and executing simultaneously (multi-tasking), the operating system must ensure that the memory allocation of every process is protected from accidental access by another process.
To address this issue, the kernel provides a level of abstraction between the process and the physical memory called virtualaddress space. Virtual address space is the process' view of memory; it is how the running program views the memory.
Virtual address space creates an illusion that every process exclusively owns the whole memory while executing. This abstracted view of memory is called virtual memory and is achieved by the kernel's memory manager in coordination with the CPU's MMU. Each process is given a contiguous 32 or 64-bit address space, bound by the architecture and unique to that process. With each process caged into its virtual address space by the MMU, any attempt by a process to access an address region outside its boundaries will trigger a hardware fault, making it possible for the memory manger to detect and terminate violating processes, thus ensuring protection.
The following figure depicts the illusion of address space created for every contending process:
Modern operating systems not only prevent one process from accessing another but also prevent processes from accidentally accessing or manipulating kernel data and services (as the kernel is shared by all the processes).
Operating systems achieve this protection by segmenting the whole memory into two logical halves, the user and kernel space. This bifurcation ensures that all processes that are assigned address spaces are mapped to the user space section of memory and kernel data and services run in kernel space. The kernel achieves this protection in coordination with the hardware. While an application process is executing instructions from its code segment, the CPU is operating in user mode. When a process intends to invoke a kernel service, it needs to switch the CPU into privileged mode (kernel mode), which is achieved through special functions called APIs (application programming interfaces). These APIs enable user processes to switch into the kernel space using special CPU instructions and then execute the required services through system calls. On completion of the requested service, the kernel executes another mode switch, this time back from kernel mode to user mode, using another set of CPU instructions.
The following figure depicts a virtualized memory view:
When a process requests a kernel service through a system call, the kernel will execute on behalf of the caller process. The kernel is now said to be executing in process context. Similarly, the kernel also responds to interrupts raised by other hardware entities; here, the kernel executes in interrupt context. When in interrupt context, the kernel is not running on behalf of any process.
Right from the time a process is born until it exits, it’s the kernel's process management subsystem that carries out various operations, ranging from process creation, allocating CPU time, and event notifications to destruction of the process upon termination.
Apart from the address space, a process in memory is also assigned a data structure called the process descriptor, which the kernel uses to identify, manage, and schedule the process. The following figure depicts process address spaces with their respective process descriptors in the kernel:
In Linux, a process descriptor is an instance of type struct task_struct defined in <linux/sched.h>, it is one of the central data structures, and contains all the attributes, identification details, and resource allocation entries that a process holds. Looking at struct task_struct is like a peek into the window of what the kernel sees or works with to manage and schedule a process.
Since the task structure contains a wide set of data elements, which are related to the functionality of various kernel subsystems, it would be out of context to discuss the purpose and scope of all the elements in this chapter. We shall consider a few important elements that are related to process management.
Process attributes define all the key and fundamental characteristics of a process. These elements contain the process's state and identifications along with other key values of importance.
A process right from the time it is spawned until it exits may exist in various states, referred to as process states--they define the process’s current state:
TASK_RUNNING
(0): The task is either executing or contending for CPU in the scheduler run-queue.
TASK_INTERRUPTIBLE
(1): The task is in an interruptible wait state; it remains in wait until an awaited condition becomes true, such as the availability of mutual exclusion locks, device ready for I/O, lapse of sleep time, or an exclusive wake-up call. While in this wait state, any signals generated for the process are delivered, causing it to wake up before the wait condition is met.
TASK_KILLABLE
: This is similar to
TASK_INTERRUPTIBLE
, with the exception that interruptions can only occur on fatal signals, which makes it a better alternative to
TASK_INTERRUPTIBLE
.
TASK_UNINTERRUTPIBLE
(2): The task is in uninterruptible wait state similar to
TASK_INTERRUPTIBLE
, except that generated signals to the sleeping process do not cause wake-up. When the event occurs for which it is waiting, the process transitions to
TASK_RUNNING
. This process state is rarely used.
TASK_ STOPPED
(4): The task has received a STOP signal. It will be back to running on receiving the continue signal (SIGCONT).
TASK_TRACED
(8): A process is said to be in traced state when it is being combed, probably by a debugger.
EXIT_ZOMBIE
(32): The process is terminated, but its resources are not yet reclaimed.
EXIT_DEAD
(16): The child is terminated and all the resources held by it freed, after the parent collects the exit status of the child using
wait
.
The following figure depicts process states:
This field contains a unique process identifier referred to as PID. PIDs in Linux are of the type pid_t (integer). Though a PID is an integer, the default maximum number PIDs is 32,768 specified through the /proc/sys/kernel/pid_max interface. The value in this file can be set to any value up to 222 (PID_MAX_LIMIT, approximately 4 million).
To manage PIDs, the kernel uses a bitmap. This bitmap allows the kernel to keep track of PIDs in use and assign a unique PID for new processes. Each PID is identified by a bit in the PID bitmap; the value of a PID is determined from the position of its corresponding bit. Bits with value 1 in the bitmap indicate that the corresponding PIDs are in use, and those with value 0 indicate free PIDs. Whenever the kernel needs to assign a unique PID, it looks for the first unset bit and sets it to 1, and conversely to free a PID, it toggles the corresponding bit from 1 to 0.
This field contains the thread group id. For easy understanding, let's say when a new process is created, its PID and TGID are the same, as the process happens to be the only thread. When the process spawns a new thread, the new child gets a unique PID but inherits the TGID from the parent, as it belongs to the same thread group. The TGID is primarily used to support multi-threaded process. We will delve into further details in the threads section of this chapter.
This field holds processor-specific state information, and is a critical element of the task structure. Later sections of this chapter contain details about the importance of thread_info.
The flags field records various attributes corresponding to a process. Each bit in the field corresponds to various stages in the lifetime of a process. Per-process flags are defined in <linux/sched.h>:
#define PF_EXITING /* getting shut down */#define PF_EXITPIDONE /* pi exit done on shut down */#define PF_VCPU /* I'm a virtual CPU */#define PF_WQ_WORKER /* I'm a workqueue worker */#define PF_FORKNOEXEC /* forked but didn't exec */#define PF_MCE_PROCESS /* process policy on mce errors */#define PF_SUPERPRIV /* used super-user privileges */#define PF_DUMPCORE /* dumped core */#define PF_SIGNALED /* killed by a signal */#define PF_MEMALLOC /* Allocating memory */#define PF_NPROC_EXCEEDED /* set_user noticed that RLIMIT_NPROC was exceeded */#define PF_USED_MATH /* if unset the fpu must be initialized before use */#define PF_USED_ASYNC /* used async_schedule*(), used by module init */#define PF_NOFREEZE /* this thread should not be frozen */#define PF_FROZEN /* frozen for system suspend */#define PF_FSTRANS /* inside a filesystem transaction */#define PF_KSWAPD /* I am kswapd */#define PF_MEMALLOC_NOIO0 /* Allocating memory without IO involved */#define PF_LESS_THROTTLE /* Throttle me less: I clean memory */#define PF_KTHREAD /* I am a kernel thread */#define PF_RANDOMIZE /* randomize virtual address space */#define PF_SWAPWRITE /* Allowed to write to swap */#define PF_NO_SETAFFINITY /* Userland is not allowed to meddle with cpus_allowed */#define PF_MCE_EARLY /* Early kill for mce process policy */#define PF_MUTEX_TESTER /* Thread belongs to the rt mutex tester */#define PF_FREEZER_SKIP /* Freezer should not count it as freezable */#define PF_SUSPEND_TASK /* this thread called freeze_processes and should not be frozen */
These fields contain the exit value of the task and details of the signal that caused the termination. These fields are to be accessed by the parent process through wait() on termination of the child.
This field holds the name of the binary executable used to start the process.
This field is enabled and set when the process is put into trace mode using the ptrace() system call.
Every process can be related to a parent process, establishing a parent-child relationship. Similarly, multiple processes spawned by the same process are called siblings. These fields establish how the current process relates to another process.
These are pointers to the parent's task structure. For a normal process, both these pointers refer to the same task_struct; they only differ for multi-thread processes, implemented using posix threads. For such cases, real_parent refers to the parent thread task structure and parent refers the process task structure to which SIGCHLD is delivered.
This is a pointer to a list of child task structures.
This is a pointer to a list of sibling task structures.
This is a pointer to the task structure of the process group leader.
All contending processes must be given fair CPU time, and this calls for scheduling based on time slices and process priorities. These attributes contain necessary information that the scheduler uses when deciding on which process gets priority when contending.
prio helps determine the priority of the process for scheduling. This field holds static priority of the process within the range 1 to 99 (as specified by sched_setscheduler()) if the process is assigned a real-time scheduling policy. For normal processes, this field holds a dynamic priority derived from the nice value.
Every task belongs to a scheduling entity (group of tasks), as scheduling is done at a per-entity level. se is for all normal processes, rt is for real-time processes, and dl is for deadline processes. We will discuss more on these attributes in the next chapter on scheduling.
This field contains information about the scheduling policy of the process, which helps in determining its priority.
This field specifies the CPU mask for the process, that is, on which CPU(s) the process is eligible to be scheduled in a multi-processor system.
This field specifies the priority to be applied by real-time scheduling policies. For non-real-time processes, this field is unused.
The kernel imposes resource limits to ensure fair allocation of system resources among contending processes. These limits guarantee that a random process does not monopolize ownership of resources. There are 16 different types of resource limits, and the task structure points to an array of type struct rlimit, in which each offset holds the current and maximum values for a specific resource.
/*include/uapi/linux/resource.h*/struct rlimit { __kernel_ulong_t rlim_cur; __kernel_ulong_t rlim_max;};These limits are specified in
include/uapi/asm-generic/resource.h
#define RLIMIT_CPU 0 /* CPU time in sec */ #define RLIMIT_FSIZE 1 /* Maximum filesize */ #define RLIMIT_DATA 2 /* max data size */ #define RLIMIT_STACK 3 /* max stack size */ #define RLIMIT_CORE 4 /* max core file size */ #ifndef RLIMIT_RSS # define RLIMIT_RSS 5 /* max resident set size */ #endif #ifndef RLIMIT_NPROC # define RLIMIT_NPROC 6 /* max number of processes */ #endif #ifndef RLIMIT_NOFILE # define RLIMIT_NOFILE 7 /* max number of open files */ #endif #ifndef RLIMIT_MEMLOCK # define RLIMIT_MEMLOCK 8 /* max locked-in-memory address space */ #endif #ifndef RLIMIT_AS # define RLIMIT_AS 9 /* address space limit */ #endif #define RLIMIT_LOCKS 10 /* maximum file locks held */ #define RLIMIT_SIGPENDING 11 /* max number of pending signals */ #define RLIMIT_MSGQUEUE 12 /* maximum bytes in POSIX mqueues */ #define RLIMIT_NICE 13 /* max nice prio allowed to raise to 0-39 for nice level 19 .. -20 */ #define RLIMIT_RTPRIO 14 /* maximum realtime priority */ #define RLIMIT_RTTIME 15 /* timeout for RT tasks in us */ #define RLIM_NLIMITS 16
During the lifetime of a process, it may access various resource files to get its task done. This results in the process opening, closing, reading, and writing to these files. The system must keep track of these activities; file descriptor elements help the system know which files the process holds.
Filesystem information is stored in this field.
The file descriptor table contains pointers to all the files that a process opens to perform various operations. The files field contains a pointer, which points to this file descriptor table.
For processes to handle signals, the task structure has various elements that determine how the signals must be handled.
This is of type struct signal_struct, which contains information on all the signals associated with the process.
This is of type struct sighand_struct, which contains all signal handlers associated with the process.
These elements identify signals that are currently masked or blocked by the process.
This is of type struct sigpending, which identifies signals which are generated but not yet delivered.
This field contains a pointer to an alternate stack, which facilitates signal handling.
This filed shows the size of the alternate stack, used for signal handling.
With current-generation computing platforms powered by multi-core hardware capable of running simultaneous applications, the possibility of multiple processes concurrently initiating kernel mode switch when requesting for the same process is built in. To be able to handle such situations, kernel services are designed to be re-entrant, allowing multiple processes to step in and engage the required services. This mandated the requesting process to maintain its own private kernel stack to keep track of the kernel function call sequence, store local data of the kernel functions, and so on.
The kernel stack is directly mapped to the physical memory, mandating the arrangement to be physically in a contiguous region. The kernel stack by default is 8kb for x86-32 and most other 32-bit systems (with an option of 4k kernel stack to be configured during kernel build), and 16kb on an x86-64 system.
When kernel services are invoked in the current process context, they need to validate the process’s prerogative before it commits to any relevant operations. To perform such validations, the kernel services must gain access to the task structure of the current process and look through the relevant fields. Similarly, kernel routines might need to have access to the current task structure for modifying various resource structures such as signal handler tables, looking for pending signals, file descriptor table, and memory descriptor among others. To enable accessing the task structure at runtime, the address of the current task structure is loaded into a processor register (register chosen is architecture specific) and made available through a kernel global macro called current (defined in architecture-specific kernel header asm/current.h ):
/* arch/ia64/include/asm/current.h */ #ifndef _ASM_IA64_CURRENT_H #define _ASM_IA64_CURRENT_H /* * Modified 1998-2000 * David Mosberger-Tang <[email protected]>, Hewlett-Packard Co */ #include <asm/intrinsics.h> /* * In kernel mode, thread pointer (r13) is used to point to the current task * structure. */ #define current ((struct task_struct *) ia64_getreg(_IA64_REG_TP)) #endif /* _ASM_IA64_CURRENT_H */ /* arch/powerpc/include/asm/current.h */ #ifndef _ASM_POWERPC_CURRENT_H #define _ASM_POWERPC_CURRENT_H #ifdef __KERNEL__ /* * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version * 2 of the License, or (at your option) any later version. */ struct task_struct; #ifdef __powerpc64__ #include <linux/stddef.h> #include <asm/paca.h> static inline struct task_struct *get_current(void) { struct task_struct *task; __asm__ __volatile__("ld %0,%1(13)" : "=r" (task) : "i" (offsetof(struct paca_struct, __current))); return task; } #define current get_current() #else /* * We keep `current' in r2 for speed. */
register struct task_struct *current asm ("r2");
#endif #endif /* __KERNEL__ */ #endif /* _ASM_POWERPC_CURRENT_H */
However, in register-constricted architectures, where there are few registers to spare, reserving a register to hold the address of the current task structure is not viable. On such platforms, the task structure of the current process is directly made available at the top of the kernel stack that it owns. This approach renders a significant advantage with respect to locating the task structure, by just masking the least significant bits of the stack pointer.
With the evolution of the kernel, the task structure grew and became too large to be contained in the kernel stack, which is already restricted in physical memory (8Kb). As a result, the task structure was moved out of the kernel stack, barring a few key fields that define the process's CPU state and other low-level processor-specific information. These fields were then wrapped in a newly created structure called struct thread_info. This structure is contained on top of the kernel stack and provides a pointer that refers to the current task structure, which can be used by kernel services.
The following code snippet shows struct thread_info for x86 architecture (kernel 3.10):
/* linux-3.10/arch/x86/include/asm/thread_info.h */
struct thread_info {
struct task_struct *task
; /* main task structure */ struct exec_domain *exec_domain; /* execution domain */ __u32 flags; /* low level flags */ __u32 status; /* thread synchronous flags */ __u32 cpu; /* current CPU */ int preempt_count; /* 0 => preemptable, <0 => BUG */ mm_segment_t addr_limit; struct restart_block restart_block; void __user *sysenter_return; #ifdef CONFIG_X86_32 unsigned long previous_esp; /* ESP of the previous stack in case of nested (IRQ) stacks */ __u8 supervisor_stack[0]; #endif unsigned int sig_on_uaccess_error:1; unsigned int uaccess_err:1; /* uaccess failed */};
With thread_info containing process-related information, apart from task structure, the kernel has multiple viewpoints to the current process structure: struct task_struct, an architecture-independent information block, and thread_info, an architecture-specific one. The following figure depicts thread_info and task_struct:
For architectures that engage thread_info, the current macro's implementation is modified to look into the top of kernel stack to obtain a reference to the current thread_info and through it the current task structure. The following code snippet shows the implementation of current for an x86-64 platform:
#ifndef __ASM_GENERIC_CURRENT_H #define __ASM_GENERIC_CURRENT_H #include <linux/thread_info.h>
#define get_current() (current_thread_info()->task)
#define current get_current()
#endif /* __ASM_GENERIC_CURRENT_H */ /* * how to get the current stack pointer in C */ register unsigned long current_stack_pointer asm ("sp"); /* * how to get the thread information struct from C */ static inline struct thread_info *current_thread_info(void) __attribute_const__; static inline struct thread_info *current_thread_info(void) {
return (struct thread_info *)
(current_stack_pointer & ~(THREAD_SIZE - 1));
}
As use of PER_CPU variables has increased in recent times, the process scheduler is tuned to cache crucial current process-related information in the PER_CPU area. This change enables quick access to current process data over looking up the kernel stack. The following code snippet shows the implementation of the current macro to fetch the current task data through the PER_CPU variable:
#ifndef _ASM_X86_CURRENT_H #define _ASM_X86_CURRENT_H #include <linux/compiler.h> #include <asm/percpu.h> #ifndef __ASSEMBLY__ struct task_struct;
DECLARE_PER_CPU(struct task_struct *, current_task);
static __always_inline struct task_struct *get_current(void) {
return this_cpu_read_stable(current_task);
}
#define current get_current()
#endif /* __ASSEMBLY__ */ #endif /* _ASM_X86_CURRENT_H */
The use of PER_CPU data led to a gradual reduction of information in thread_info. With thread_info shrinking in size, kernel developers are considering getting rid of thread_info altogether by moving it into the task structure. As this involves changes to low-level architecture code, it has only been implemented for the x86-64 architecture, with other architectures planned to follow. The following code snippet shows the current state of the thread_info structure with just one element:
/* linux-4.9.10/arch/x86/include/asm/thread_info.h */struct thread_info { unsigned long flags; /* low level flags */};
Unlike user mode, the kernel mode stack lives in directly mapped memory. When a process invokes a kernel service, which may internally be deeply nested, chances are that it may overrun into immediate memory range. The worst part of it is the kernel will be oblivious to such occurrences. Kernel programmers usually engage various debug options to track stack usage and detect overruns, but these methods are not handy to prevent stack breaches on production systems.Conventional protection through the use of guard pages is also ruled out here (as it wastes an actual memory page).
Kernel programmers tend to follow coding standards--minimizing the use of local data, avoiding recursion, and avoiding deep nesting among others--to cut down the probability of a stack breach. However, implementation of feature-rich and deeply layered kernel subsystems may pose various design challenges and complications, especially with the storage subsystem where filesystems, storage drivers, and networking code can be stacked up in several layers, resulting in deeply nested function calls.
The Linux kernel community has been pondering over preventing such breaches for quite long, and toward that end, the decision was made to expand the kernel stack to 16kb (x86-64, since kernel 3.15). Expansion of the kernel stack might prevent some breaches, but at the cost of engaging much of the directly mapped kernel memory for the per-process kernel stack. However, for reliable functioning of the system, it is expected of the kernel to elegantly handle stack breaches when they show up on production systems.
With the 4.9 release, the kernel has come with a new system to set up virtually mapped kernel stacks. Since virtual addresses are currently in use to map even a directly mapped page, principally the kernel stack does not actually require physically contiguous pages. The kernel reserves a separate range of addresses for virtually mapped memory, and addresses from this range are allocated when a call to vmalloc() is made. This range of memory is referred as the vmalloc range. Primarily this range is used when programs require huge chunks of memory which are virtually contiguous but physically scattered. Using this, the kernel stack can now be allotted as individual pages, mapped to the vmalloc range. Virtual mapping also enables protection from overruns as a no-access guard page can be allocated with a page table entry (without wasting an actual page). Guard pages would prompt the kernel to pop an oops message on memory overrun and initiate a kill against overrunning process.
Virtually mapped kernel stacks with guard pages are currently available only for the x86-64 architecture (support for other architectures seemingly to follow). This can be enabled by choosing the HAVE_ARCH_VMAP_STACK or CONFIG_VMAP_STACK build-time options.
During kernel boot, a kernel thread called init is spawned, which in turn is configured to initialize the first user-mode process (with the same name). The init (pid 1) process is then configured to carry out various initialization operations specified through configuration files, creating multiple processes. Every child process further created (which may in turn create its own child process(es)) are all descendants of the init process. Processes thus created end up in a tree-like structure or a single hierarchy model. The shell, which is one such process, becomes the interface for users to create user processes, when programs are called for execution.
Fork, vfork, exec, clone, wait and exit are the core kernel interfaces for the creation and control of new process. These operations are invoked through corresponding user-mode APIs.
Fork() is one of the core "Unix thread APIs" available across *nix systems since the inception of legacy Unix releases. Aptly named, it forks a new process from a running process. When fork() succeeds, the new process is created (referred to as child) by duplicating the caller's address space and task structure. On return from fork(), both caller (parent) and new process (child) resume executing instructions from the same code segment which was duplicated under copy-on-write. Fork() is perhaps the only API that enters kernel mode in the context of caller process, and on success returns to user mode in the context of both caller and child (new process).
Most resource entries of the parent's task structure such as memory descriptor, file descriptor table, signal descriptors, and scheduling attributes are inherited by the child, except for a few attributes such as memory locks, pending signals, active timers, and file record locks (for the full list of exceptions, refer to the fork(2) man page). A child process is assigned a unique pid and will refer to its parent's pid through the ppid field of its task structure; the child’s resource utilization and processor usage entries are reset to zero.
The parent process updates itself about the child’s state using the wait() system call and normally waits for the termination of the child process. Failing to call wait(), the child may terminate and be pushed into a zombie state.
Duplication of parent process to create a child needs cloning of the user mode address space (stack, data, code, and heap segments) and task structure of the parent for the child; this would result in execution overhead that leads to un-deterministic process-creation time. To make matters worse, this process of cloning would be rendered useless if neither parent nor child did not initiate any state-change operations on cloned resources.
As per COW, when a child is created, it is allocated a unique task structure with all resource entries (including page tables) referring to the parent's task structure, with read-only access for both parent and child. Resources are truly duplicated when either of the processes initiates a state change operation, hence the name copy-on-write (write in COW implies a state change). COW does bring effectiveness and optimization to the fore, by deferring the need for duplicating process data until write, and in cases where only read happens, it avoids it altogether. This on-demand copying also reduces the number of swap pages needed, cuts down the time spent on swapping, and might help reduce demand paging.
At times creating a child process might not be useful, unless it runs a new program altogether: the exec family of calls serves precisely this purpose. exec replaces the existing program in a process with a new executable binary:
#include <unistd.h>int execve(const char *filename, char *const argv[],char *const envp[]);
The execve is the system call that executes the program binary file, passed as the first argument to it. The second and third arguments are null-terminated arrays of arguments and environment strings, to be passed to a new program as command-line arguments. This system call can also be invoked through various glibc (library) wrappers, which are found to be more convenient and flexible:
#include <unistd.h>extern char **environ;int execl(const char *path, const char *arg, ...);int execlp(const char *file, const char *arg, ...);int execle(const char *path, const char *arg,..., char * const envp[]);int execv(const char *path, char *constargv[]);int execvp(const char *file, char *constargv[]);int execvpe(const char *file, char *const argv[],char *const envp[]);
Command-line user-interface programs such as shell use the exec interface to launch user-requested program binaries.
Unlike fork(), vfork() creates a child process and blocks the parent, which means that the child runs as a single thread and does not allow concurrency; in other words, the parent process is temporarily suspended until the child exits or call exec(). The child shares the data of the parent.
The flow of execution in a process is referred to as a thread
