"

Author: tjytimi [email protected]
Date: 2022/05/22
Revisor: Falcon [email protected]
Project: Anatomy of the RISC-V Linux Kernel

"

Process creation process code analysis

foreword

This paper analyzes the process creation process code, and analyzes the Linux kernel's support for process implementation under RICS-V through the process creation process.

This article analyzes the code execution order, only describes the implementation of key functions, and the kernel version is Linux 5.17.

kernel_clone

Linux new process creation is achieved through fork()system calls. After the user process enters the kernel mode through the fork()system call, it enters its system call prototype. code show as below:


// kernel/fork.c : 2620

SYSCALL_DEFINE0(fork)
{
 struct kernel_clone_args args = {
  .exit_signal = SIGCHLD,
 };
 return kernel_clone(&args);
}

This function is implemented by simply setting the argsparameters and calling the kernel_clone()function.

In fact, creating a process includes a set of system calls, including clone(), vfork()etc., and its final implementation is achieved by calling the kernel_clone()function , the difference is only the setting argparameters are different.

Usually we use the Pthread thread library commonly used in user programs, and finally kernel_clone()complete .

Friends who are familiar with the previous kernel version can find that it kernel_clone()is the _do_fork()function in the previous version of the kernel. The simplified code is as follows:


// kernel/fork.c  : 2524
pid_t kernel_clone(struct kernel_clone_args *args)
{
 u64 clone_flags = args->flags;
 struct pid *pid;
 struct task_struct *p;

 p = copy_process(NULL, trace, NUMA_NO_NODE, args);

 pid = get_task_pid(p, PIDTYPE_PID);
 nr = pid_vnr(pid);

 wake_up_new_task(p);

 return nr;
}

kernel_clone()Do the following in sequence:

complete copy of process descriptor

Call to copy_process()complete the generation of the child process and generate the child process process descriptor ( task_struct).

This function determines whether to copy or reuse the parent process data according to the flagflag . This is the basis on which processes are created and can be run.

This function is described in detail in copy_process()the section .

Get the process PID of the new process

Because the fork()system call needs to return namespacethe PID of the child process in the current process namespace ( ) in the parent process as the return value, which is used for management work such as the recovery of the child process by the parent process in the user space. So after copy_process()completion , kernel_clone()the process ID number of the current namespace will be obtained from the PID structure of the child process.

wake up child process

Call wake_up_new_task()Insert the scheduling entity in the process descriptor into the run queue to complete the process wake-up function. It should be noted that the wake-up at this time does not mean that the child process starts to execute. The actual execution of the child process is selected by the scheduler at the appropriate time according to the scheduling policy to execute the process.

This will be described in detail in wake_up_new_taskthe section.

copy_process()

As the name suggests, the copy_process()function is responsible for duplicating the parent process' data structures. The function is defined kernel/fork.cin . The main flow of the function is as follows, which will be analyzed in turn.

copy_process()
 - dup_task_struct()
 - sched_fork()
 - copy_xxx(copy_mm,copy_fs,copy_files,copy_thread)
 - 初始化进程 PID 实体及进程关系

dup_task_struct()

dup_task_struct()The function makes a preliminary copy of the parent process descriptor to generate a new child process descriptor. The specific process of the function is as follows:

First, a process descriptor ( struct task_struct *tsk) and a kernel stack ( unsigned long *stack) are allocated through the SLAB system.
Copy the process descriptor of the parent process to the process descriptor of the newly created child process through the arch_dup_task_struct()function . arch_dup_task_struct()is a processor-related function, the code is as follows. Which fstate_save()is responsible for saving the RICS-V processor floating point unit register in the element contained in the threadfield ( )struct thread_struct of the parent process process descriptor. Field definitions are processor-related and are used to save processor-related context information for use in process switching. See section for definitions . Copy the process descriptor of the parent process directly to the child process, and the subsequent process will modify the elements that need to be modified.fstatethreadstruct thread_structcopy_thread*dst = *src

// arch/riscv/kernel/process.c : 115

int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
{
 fstate_save(src, task_pt_regs(src));
 *dst = *src;
 return 0;
}

Point the kernel stack pointer of the child process descriptor to the kernel stack requested in step 1.
Call the setup_thread_stack()function , which is used to initialize the thread_infodata. Under the RISC-V architecture, the thread_infostack space is not shared, but is explicitly defined in the task_stuctstructure , so step 2 has completed thread_infothe copy, and the setup_thread_stack()function does nothing and returns directly. In the X86 and MIPS architectures, thread_infothe kernel stack space is reused, and this function will thread_infocopy child process.

The processor will frequently use the current process thread_info, so the kernel hopes to achieve fast access through registers. The RICS-V architecture has a special tp( thread point) register, which stores thread_infothe address of . When CONFIG_THREAD_INFO_IN_TASKthis configuration is opened, thread_infoit is task_structthe first member of , which is also equivalent to task_structpreparing a register tpfor .


   struct task_struct {
#ifdef CONFIG_THREAD_INFO_IN_TASK
 /*
  * For reasons of header soup (see current_thread_info()), this
  * must be the first element of task_struct.
  */
 struct thread_info  thread_info;
#endif

 ...

Similar MIPS architectures use gpregisters thread_infoto store addresses. X86 does not specially prepare registers, but because it is thread_infoput into kernel stack for multiplexing, it can also be accessed quickly through stack registers thread_info. From here, it can be seen that learning the kernel through the RISC-V architecture has a lot less processor-related difficulties that need to be carefully thought about compared to the X86.

Initialize some elements in the subprocess structure. For example, the flag that the process needs to be rescheduled in the call clear_tsk_need_resched()will be thread_inforeset (because the new process does not need to be rescheduled, the new process may copy this flag when copying the descriptor).

sched_fork

sched_fork()The function is responsible for initializing scheduling-related fields, such as initializing the priority of the process, initializing the virtual runtime, and initializing the scheduling class. This part of the code is very simple and straightforward, and no detailed analysis is required.

copy_xxx

This includes a series copy_xxxof . By their names, it can be known that the function of the function is to copy or share the data corresponding to the specific structure in the process descriptor. Whether the copy is shared or not depends on the setting copy_process()of the flagflag .

Here the main analysis copy_mm()and copy_thread()two parts.

copy_mm

copy_mm()It mainly performs the process of copying the memory descriptor of the process. The main flow of the function is as follows. For some content related to memory management, please refer to other related analysis articles in this project.

 copy_mm()
  - dup_mm
   - mm_init
    - mm_alloc_pgd
     - pgd_malloc
   - init_new_context
  - dup_mmap

The function first calls the dup_mm()function copy the memory descriptor, and the main work of this function is done by andmm_init() .init_new_context()

mm_init()Carry out the initialization of the process memory descriptor, and finally call the processor-related pgd_malloc()function to apply for the global page directory pgd, and copy the entire content of the kernel page table entry from the No. 0 process to this process.


// arch/riscv/include/asm/pgalloc.h : 80
static inline pgd_t *pgd_alloc(struct mm_struct *mm)
{
 pgd_t *pgd;

 pgd = (pgd_t *)__get_free_page(GFP_KERNEL);
 if (likely(pgd != NULL)) {
  memset(pgd, 0, USER_PTRS_PER_PGD * sizeof(pgd_t));
  /* Copy kernel mappings */
  memcpy(pgd + USER_PTRS_PER_PGD,
   init_mm.pgd + USER_PTRS_PER_PGD,
   (PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t));
 }
 return pgd;
}

init_new_context()Set the context ID in the memory descriptor to 0 to ensure that the ASID is invalid when the process is switched, and the ASID version number and hardware number will be directly applied for the memory descriptor.


// arch/riscv/include/asm/mmu_context.h : 26

#define init_new_context init_new_context
static inline int init_new_context(struct task_struct *tsk,
   struct mm_struct *mm)
{
#ifdef CONFIG_MMU
 atomic_long_set(&mm->context.id, 0);
#endif
 return 0;
}

During process scheduling, it will call to switch_mm()switch to the newly created process page global directory pgd, and this function will call the csr_writemacro to write the directory address and ASID information to the CSR_SATPregister. Please refer to the context_switch()分析relevant switch_mm()section of this project.

Then the copy_mm()function will call dup_mmap()the function to copy the process address space of the parent process to the child process. When copying, it will be copied level by level. When the page table entry is copied at the last level, the pte_wrprotectfunction to write protection for the page. This is the realization of basic写时复制(COW) work.

copy_thread

copy_threadIt is a key step related to the CPU architecture in process creation. It is used to create and initialize the thread context descriptor thread, which is used to store the CPU-related state, and its corresponding structure definition is processor-related. The thread_structdefinitions as follows:


// arch/riscv/include/asm : 31

struct thread_struct {
 /* Callee-saved registers */
 unsigned long ra;
 unsigned long sp; /* Kernel mode stack */
 unsigned long s[12]; /* s[0]: frame pointer */
 struct __riscv_d_ext_state fstate;
 unsigned long bad_cause;
};

This structure is used to store the value of the callee-saved register ( callee saved register). According to the RISC-V manual register description:

raFor the return address register, store ret the address where the execution starts after the return instruction.
spFor the kernel stack register. s0- s11for the saving register ( saved register),
fstateRegisters associated with floating-point operations ( described indup_task_struct() section ).fstate

The copy_threadfunctions as follows:


// arch/ricsv/kernel/process.c : 122
int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
  struct task_struct *p, unsigned long tls)
{
 struct pt_regs *childregs = task_pt_regs(p);

 /* p->thread holds context to be restored by __switch_to() */
 if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
  /* Kernel thread */
  memset(childregs, 0, sizeof(struct pt_regs));
  childregs->gp = gp_in_global;
  /* Supervisor/Machine, irqs on: */
  childregs->status = SR_PP | SR_PIE;

  p->thread.ra = (unsigned long)ret_from_kernel_thread;
  p->thread.s[0] = usp; /* fn */
  p->thread.s[1] = arg;
 } else {
  *childregs = *(current_pt_regs());
  if (usp) /* User fork */
   childregs->sp = usp;
  if (clone_flags & CLONE_SETTLS)
   childregs->tp = tls;
  childregs->a0 = 0; /* Return value of fork() */
  p->thread.ra = (unsigned long)ret_from_fork;
 }
 p->thread.sp = (unsigned long)childregs; /* kernel sp */
 return 0;
}

copy_threadThe function first initializes the kernel stack, and uses task_pt_regs(p)to force the kernel stack space into a register context pt_regsstructure, which is used to save the value of the register during exceptions and system calls. RISC-V This structure is defined as follows:


struct pt_regs {

 unsigned long epc;
 unsigned long ra;
 unsigned long sp;
 unsigned long gp;
 unsigned long tp;


 ...
 /* Supervisor/Machine CSRs */
 unsigned long status;
 unsigned long badaddr;
 unsigned long cause;
 /* a0 value before the syscall */
 unsigned long orig_a0;
}

Then, different branches are decided according to whether the new process is a kernel thread or a user process, which will be described below.

kernel thread

If the new process is a kernel thread, set in the thread threadcontext rato the ret_from_kernel_threadfunction address, s[0]set the thread context to fnthe address ( fn is the execution function passed in when creating a new kernel thread), and s[1]set it to arg ( arg is passed in the new kernel thread to be executed parameter of the function number fn).

This means that when the child process starts to run, ret_from_kernel_threadthe , this function is the assembly related to the RISC-V architecture, which specifies the s0function corresponding to the calling register, and uses the s1register as the parameter of this function.

user process

Call current_pt_regs()to get the context of the kernel stack register of the current process (parent process) and copy it to the context in the stack of the child process childregs.
Assign the stack register of the register context in the child process kernel stack sp(that is, the user stack after returning to user mode) to usp.
If is clone_flagsset CLONE_SETTLS, assign the tpvalue to tls. You can refer to the system call CLONE_SETTLS related content
a0The register of RISC -V is the return value register after the system call returns to the user state, so it is childregs->a0set to 0. When fork()the system call sub-process returns to the user state, a0the value of is restored from the kernel stack to the a0register , and the return value of the sub-process is 0 .
thread.raSet the kernel thread context of the child process to ret_from_forkthe address where the assembly is located, and thread.spassign childregsthe address just set. After the user process completes the above process, the process descriptor is shown in the following figure.

When the process switches to the child process, the kernel will restore the values of all registers from the kernel thread context: the kernel stack will become the above set thread.sp, the raregister will become thread.ra( ret_from_forkthe address of the assembly), which will make the child process first after being scheduled execute ret_from_fork.

ret_from_forkAfter completion, the register context stored in the kernel stack will be restored to the RISC-V registers, including writing usp to the stack register, completing the switch from the kernel stack to the user stack, and finally calling mretor sretreturning to user mode. 创建进程后子进程的执行A section will describe this in detail.

The above involves the kernel stack, the user stack, the register context in the kernel stack, and the thread context in the process descriptor. Various concepts are confusing and should be carefully distinguished.

wake_up_new_task

After the copy of the subprocess descriptor is completed, kernel_clone()call to wake_up_new_task()wake up the subprocess, add it to the run queue, and wait for the scheduling of the scheduling system. Its main work is done by the activate_task()function . The main process is as follows:

activate_task
 - enqueue_task
   -  p->sched_class->enqueue_task(rq, p, flags)
 - p->on_rq = TASK_ON_RQ_QUEUED

activate_task()First call enqueue_task(), the insert function corresponding to the scheduler class sched_classcorresponding to , and this class sched_fork()is set in the function. After inserting, set the process's on_rqID to TASK_ON_RQ_QUEUED, indicating that the process has entered the run queue.

Execution of child process after process creation

After the child process is established, it is only added to the running queue and does not actually run. Only when the scheduler selects the child process, the child process will be put into operation. For the content of the scheduler, please refer to other materials.

When the process is scheduled, context_switch()the switch_toassembly function in the process switching function restores the selected next process to the state before it was switched out. You can refer to the relevant content RISC-V Linux 上下文切换分析in .

// arch/riscv/kernel/entry.S : 512

ENTRY(__switch_to)
 /* Save context into prev->thread */
 li    a4,  TASK_THREAD_RA
 add   a3, a0, a4
 add   a4, a1, a4
 REG_S ra,  TASK_THREAD_RA_RA(a3)
 REG_S sp,  TASK_THREAD_SP_RA(a3)
 REG_S s0,  TASK_THREAD_S0_RA(a3)

 ....

 REG_S s11, TASK_THREAD_S11_RA(a3)
 /* Restore context from next->thread */
 REG_L ra,  TASK_THREAD_RA_RA(a4)
 REG_L sp,  TASK_THREAD_SP_RA(a4)
 REG_L s0,  TASK_THREAD_S0_RA(a4)

 ...

 REG_L s11, TASK_THREAD_S11_RA(a4)
 /* The offset of thread_info in task_struct is zero. */
 move tp, a1
 ret
ENDPROC(__switch_to)

According to the __switch_tocode , if the next process is a process that has already been executed, when it was switched out last time, the value of the registerthread.ra will be recorded in . After this switch back, the register will be restored to the return address stored by the next process, so that the last The instruction will make the process return to the function , and then execute the last function of the function to perform process scheduling finishing work.raraswitch_toretcontext_switch()context_switch()finish_task_switch()

However, the newly created child process has not been switched out before, so for the new process, the above copy_thread()function will thread.raassign the value to ret_from_fork, so that the scheduler thinks that the child process is switched out from ret_from_forkat .

So the new process that is scheduled will not be executed finish_task_switch(), but will be executed first ret_from_fork.

ret_from_forkThe main process is as follows, the assembly first calls schedule_tailto perform the finishing work of process scheduling, which actually performs finish_task_switch()functions similar to finishing code.


ret_from_fork
 - schedule_tail
 - ret_from_exception
  - restore_all

The calling ret_from_exceptionfunction executes the function returned from the exception or system call.

ret_from_exceptionThe main flow of the function is restore_allthat its assembly code is as follows. The restoration of the state before the system call is completed by loading the register context stored in the kernel stack at the time of the system call into the register of RISC-V.

// arch/riscv/kernel/entry.S : 268

restore_all:

 REG_L a0, PT_STATUS(sp)
 REG_L  a2, PT_EPC(sp)
 REG_SC x0, a2, PT_EPC(sp)

 csrw CSR_STATUS, a0
 csrw CSR_EPC, a2
 REG_L x1, PT_RA(sp)
 REG_L x3, PT_GP(sp)
 REG_L x4, PT_TP(sp)

 ...
 REG_L x30, PT_T5(sp)
 REG_L x31, PT_T6(sp)

 REG_L x2, PT_SP(sp)

#ifdef CONFIG_RISCV_M_MODE
mret
#else
sret
#endif

First restore the STATUS status register.
Restore the EPCregister (store the next instruction PC of the system call), because the stack context in the parent process copy_thread()is *childregs = *(current_pt_regs())copied , so the child process EPCand the parent process point to the system call return address.
All registers except stack registers are restored in turn.
Restore stack registers x2Complete the kernel stack to user stack switch.
Call mretor sretreturn to the user mode, and start executing the user mode code (that is, the code corresponding to the EPCregister .

So far, the process creation process code analysis is completed.

Summarize

According to the code execution flow, this paper analyzes the process creation and its processing related to the RISC-V processor architecture, and shows the characteristics of the RISC-V processor and process implementation.