Solving the pitfalls stepped into by the python subprocess parameter shell=True

0x01 Problem phenomenon

The program written uses subprocess to create sub-processes to run other programs, determining what to do when the other programs have finished running.

With shell=True used in subprocess, the code to determine the exit of the user program is as follows

while () is None:
    do_something

To determine if the child process has finished running, the program is stuck in this loop after the child process has finished running and the code does not continue down the line.

0x02 Cause analysis

Baidu explains the shell parameters as follows:

The shell=True parameter causes a variable of type String to be accepted as a command and calls the shell to execute the string, while shell=False accepts only array variables as commands and uses the first element of the array as the command and all the rest as arguments to the command.

By looking at the server processes you can see that there are still processes present, the processes are as follows

这里写图片描述

is the program running in the shell, from which it follows that with shell=true, the shell does not exit when the child process finishes running, but is stuck in a shell command that can be seen by the process.

这里写图片描述

Addendum: Python stepping on the pitfalls of the journey of one of the Shell sub-processes that can not be killed

1.1 Pit-stopping cases

The program in question is a resident Agent class management process, which includes, but is not limited to, the following types of tasks being performed.

a. Multi-threaded network communication packet processing

Interacting with the ControlMaster node

With fixed Listen port

b. Regular operational tasks, by executing shell commands

c. etc

It was fun discovering the pit :)

a. Reboot Agent to find the Port is occupied

=> It immediately occurred to me that maybe the process wasn't killed, or maybe there was a problem with the stop script.

=> Ruling it out reveals that no, the Agent process did die.

=> netstat -tanop|grep port_number finds that the port is indeed occupied.

=> Debugging environment, just kill the occupied process, miss the first chance to find the problem.

b. The problem resurfaces after some time, and the port is still occupied after reboot.

Locate the problem in a script called, which takes up a port used by the Agent

=> strange, a script use this strange Port for what (greater than 60000 Port, interested brick friends can think about why the Agent default use 6W + port)

=> review The script does not have code for port listening.

A snap of the head, c. The process has shared the resources of the parent process.

=> Traced the script back to one of the scripts in the task started by the Agent.

=> The problem is basically localized, the script belongs to the script called by the Agent.

=> The Agent inherits the Agent's original resource FD, which is the port

=> Although the script passively triggers the terminate mechanism due to a timeout, terminate does not kill the child process.

=> The parent process (ppid) of this script process has been reset to 1.

d. The problem **** is in the script process timeout kill logic

1.2 Pit-filling solutions

Using code review, we found the following library code for shell execution.

self._subpro = (
    cmd, shell=True, stdout=,
    stderr=,
    preexec_fn=_signal_handle
)
# The point is shell=True !

Change the above code to:

self._subpro = (
    (), stdout=,
    stderr=, preexec_fn=_signal_handle
)
# The point is to remove theshell=True

1.3 Pit location analysis

The Agent executes this code in a newly created threading thread, and if the thread times out (xx seconds), it calls self._subpro.terminate() to terminate the script.

Surface normal.

Enable a new thread to execute the script

If a problem occurs, perform a timeout to prevent other tasks from killing the process by calling terminate.

Deeper issues.

Python 2. If shell=True, the associated pid is set to the shell (sh/bash/etc) itself (the parent of the shell that executes the command), not to the process that executes the cmd task.

The child process will copy the parent process's opened FD table, so that even if it is killed, it still retains the Listened Port FD.

This kills the shell process (it may not be dead, it may be in defunct state), but the actual executing process is alive. So the pitfalls of 1.1 are firmly in place.

1.4 Post-pit extensions

1.4.1 Extended knowledge

This section of Extended Knowledge consists of two parts.

In Linux, what information does a child process inherit from its parent process?

The significance of a resident process like Agent choosing port >60000

We'll save the extensions for the end of the next post, but if you're interested, you can search for them yourself.

1.4.1 Technical keywords

Linux system process

Linux random port selection

multithreaded execution of programs

Shell Execution

1.5 Summary of Pit Filling

1. The child process inherits resource information from the parent process

2. If you only kill the parent process of a process, the child processes that have integrated the resources of the parent process will continue to occupy the resources of the parent process without releasing them, including but not limited to

listened port

opened fd

etc

Popen use, the shell's bool state determines the logic of the process kill, you need to choose how to use it according to the scenarios

The above is a personal experience, I hope it can give you a reference, and I hope you can support me more. If there is any mistake or something that has not been fully considered, please do not hesitate to give me advice.