我有一个python脚本,将检查一个队列,并在每个项目上执行一个动作:

# checkqueue.py
while True:
  check_queue()
  do_something()

我如何编写一个bash脚本来检查它是否正在运行,如果没有,就启动它。大致如下伪代码(或者它应该做一些类似ps | grep?):

# keepalivescript.sh
if processidfile exists:
  if processid is running:
     exit, all ok

run checkqueue.py
write processid to processidfile

我将从crontab中调用它:

# crontab
*/5 * * * * /path/to/keepalivescript.sh

当前回答

在线:

while true; do <your-bash-snippet> && break; done

如果失败,它将持续重新启动<your-bash-snippet>: && break将停止循环,如果<your-bash-snippet>优雅地停止(返回代码0)。

在所有情况下重启<your-bash-snippet>:

while true; do <your-bash-snippet>; done

例如# 1

while true; do openconnect x.x.x.x:xxxx && break; done

例如# 2

while true; do docker logs -f container-name; sleep 2; done

其他回答

避免pid文件、cron或任何其他试图计算不是它们的子进程的文件。

在UNIX中,您只能服侍您的子女是有原因的。任何试图解决这个问题的方法(ps解析、pgrep、存储PID等等)都是有缺陷的。说“不”。

相反,您需要监视您的进程的进程作为进程的父进程。这是什么意思?这意味着只有启动进程的进程可以可靠地等待进程结束。在bash中,这绝对是微不足道的。

until myserver; do
    echo "Server 'myserver' crashed with exit code $?.  Respawning.." >&2
    sleep 1
done

The above piece of bash code runs myserver in an until loop. The first line starts myserver and waits for it to end. When it ends, until checks its exit status. If the exit status is 0, it means it ended gracefully (which means you asked it to shut down somehow, and it did so successfully). In that case we don't want to restart it (we just asked it to shut down!). If the exit status is not 0, until will run the loop body, which emits an error message on STDERR and restarts the loop (back to line 1) after 1 second.

我们为什么要等呢?因为如果我的服务器的启动顺序有问题,它会立即崩溃,你会有一个非常密集的不断重启和崩溃的循环。睡眠可以消除这种压力。

现在您所需要做的就是启动这个bash脚本(可能是异步的),它将监视我的服务器并在必要时重新启动它。如果希望在引导时启动监视器(使服务器在重新启动时“存活”下来),可以使用@reboot规则在用户的cron(1)中安排它。使用crontab打开你的cron规则:

crontab -e

然后添加一个规则来启动监视器脚本:

@reboot /usr/local/bin/myservermonitor

另外;查看inittab(5)和/etc/inittab。您可以在那里添加一行,让myserver在某个初始化级别启动并自动重生。


编辑。

让我添加一些关于为什么不使用PID文件的信息。虽然它们很受欢迎;它们也有很多缺陷,你没有理由不以正确的方式去做。

考虑一下:

PID recycling (killing the wrong process): /etc/init.d/foo start: start foo, write foo's PID to /var/run/foo.pid A while later: foo dies somehow. A while later: any random process that starts (call it bar) takes a random PID, imagine it taking foo's old PID. You notice foo's gone: /etc/init.d/foo/restart reads /var/run/foo.pid, checks to see if it's still alive, finds bar, thinks it's foo, kills it, starts a new foo. PID files go stale. You need over-complicated (or should I say, non-trivial) logic to check whether the PID file is stale, and any such logic is again vulnerable to 1.. What if you don't even have write access or are in a read-only environment? It's pointless overcomplication; see how simple my example above is. No need to complicate that, at all.

请参见:当“正确”执行时,pid文件仍然有缺陷吗?

顺便说一下;比PID文件更糟糕的是解析ps!永远不要这样做。

ps is very unportable. While you find it on almost every UNIX system; its arguments vary greatly if you want non-standard output. And standard output is ONLY for human consumption, not for scripted parsing! Parsing ps leads to a LOT of false positives. Take the ps aux | grep PID example, and now imagine someone starting a process with a number somewhere as argument that happens to be the same as the PID you stared your daemon with! Imagine two people starting an X session and you grepping for X to kill yours. It's just all kinds of bad.

如果你不想自己管理这个过程;有一些非常好的系统可以充当您的进程的监控器。例如,看看runit。

最简单的方法是使用flock on file。在Python脚本中

lf = open('/tmp/script.lock','w')
if(fcntl.flock(lf, fcntl.LOCK_EX|fcntl.LOCK_NB) != 0): 
   sys.exit('other instance already running')
lf.write('%d\n'%os.getpid())
lf.flush()

在shell中,你可以测试它是否正在运行:

if [ `flock -xn /tmp/script.lock -c 'echo 1'` ]; then 
   echo 'it's not running'
   restart.
else
   echo -n 'it's already running with PID '
   cat /tmp/script.lock
fi

当然你不需要测试,因为如果它已经在运行,你重新启动它,它会以" other instance already running "退出

当进程死亡时,它的所有文件描述符将被关闭,所有锁将被自动移除。

watch "yourcommand"

如果/当进程停止(经过2s延迟),它将重新启动进程。

watch -n 0.1 "yourcommand"

在0.1s后重新启动,而不是默认的2秒

watch -e "yourcommand"

如果程序出现错误退出,则停止重新启动。

优点:

内置命令 一行 易于使用和记忆。

缺点:

只在命令执行完成后在屏幕上显示命令的结果

我不确定它在操作系统之间的可移植性如何,但你可以检查你的系统是否包含'run-one'命令,即。“运行一个人”。 具体来说,这组命令包括“run-one-constant”,这似乎正是所需要的。

从手册页:

run-one-constant命令[ARGS]

注意:显然这可以在脚本中调用,但它也完全消除了拥有脚本的需要。

在线:

while true; do <your-bash-snippet> && break; done

如果失败,它将持续重新启动<your-bash-snippet>: && break将停止循环,如果<your-bash-snippet>优雅地停止(返回代码0)。

在所有情况下重启<your-bash-snippet>:

while true; do <your-bash-snippet>; done

例如# 1

while true; do openconnect x.x.x.x:xxxx && break; done

例如# 2

while true; do docker logs -f container-name; sleep 2; done