Skip to content

fix: update status label should call after task is started#4886

Open
ningmingxiao wants to merge 1 commit into
containerd:mainfrom
ningmingxiao:fix_restart
Open

fix: update status label should call after task is started#4886
ningmingxiao wants to merge 1 commit into
containerd:mainfrom
ningmingxiao:fix_restart

Conversation

@ningmingxiao
Copy link
Copy Markdown
Contributor

@ningmingxiao ningmingxiao commented May 6, 2026

fix: containerd/containerd#13350
how it happened
nerdctl run -d --restart=always busybox sleep 10000
step 1

	task, err := taskutil.NewTask(ctx, client, c, taskutil.TaskOptions{
		AttachStreamOpt: createOpt.Attach,
		IsInteractive:   createOpt.Interactive,
		IsTerminal:      createOpt.TTY,
		IsDetach:        createOpt.Detach,
		Con:             con,
		LogURI:          logURI,
		DetachKeys:      createOpt.DetachKeys,
		Namespace:       createOpt.GOptions.Namespace,
		DetachC:         detachC,
		CheckpointDir:   "",
	})

after taskutil.NewTask task is created

setp 2
containerd find desiredStatus is running but task != running (task status is created)

	desiredStatus := containerd.ProcessStatus(labels[restart.StatusLabel])
		if task, err = c.Task(ctx, nil); err == nil {
			if status, err = task.Status(ctx); err == nil {
				if desiredStatus == status.Status {
					continue
				}
			}

step 3. nerdctl start the task sucessfully
step 4 containerd will kill and delete task (containerd find task is not running but actually is running )
step5 : nerdctl exec failed
step6: containerd recreate the task
@AkihiroSuda @ChengyuZhu6 can you take a look ?

@ningmingxiao
Copy link
Copy Markdown
Contributor Author

@AkihiroSuda

@ningmingxiao ningmingxiao marked this pull request as draft May 7, 2026 02:35
@ningmingxiao ningmingxiao force-pushed the fix_restart branch 2 times, most recently from 863667c to 62a5e09 Compare May 7, 2026 03:34
@ningmingxiao ningmingxiao marked this pull request as ready for review May 7, 2026 03:34
@ningmingxiao ningmingxiao changed the title fix: update restart label should call after task is started fix: update status label should call after task is started May 7, 2026
@ningmingxiao ningmingxiao force-pushed the fix_restart branch 2 times, most recently from 6c4c518 to 7acc864 Compare May 7, 2026 03:51
Signed-off-by: ningmingxiao <ning.mingxiao@zte.com.cn>
@AkihiroSuda
Copy link
Copy Markdown
Member

How to test?

@ningmingxiao
Copy link
Copy Markdown
Contributor Author

I can add some sleep or use dlv to reproduce it tomorrow. The container use restart labels, I forget to show it.

@ningmingxiao
Copy link
Copy Markdown
Contributor Author

ningmingxiao commented May 11, 2026

I create local branch to reproduce it
https://github.com/containerd/containerd/pull/13366/changes
https://github.com/ningmingxiao/nerdctl/pull/15/changes

nerdctl run --restart=always -d busybox sleep 1000 (nerdctl exited sucessfully)
❯ echo $?
0

containerd log

5月 11 10:26:06 LIN-5A04F407932 containerd[1325813]: time="2026-05-11T10:26:06.093119614+08:00" level=info msg="connecting to shim 20a8600f97db18f59c48f3f9ddfb64cf2200d83480ec65b15367f21fdeb0fd4c" address="unix:///run/containerd/s/f7ca33ac16fda036d073a8e927140718c3982a3968af6f8c6672760a55d2d4cb" namespace=default protocol=ttrpc version=3
5月 11 10:26:07 LIN-5A04F407932 containerd[1325813]: time="2026-05-11T10:26:07.435773350+08:00" level=info msg="{\"container_id\":\"20a8600f97db18f59c48f3f9ddfb64cf2200d83480ec65b15367f21fdeb0fd4c\",\"bundle\":\"/run/containerd/io.containerd.runtime.v2.task/default/20a8600f97db18f59c48f3f9ddfb64cf2200d83480ec65b15367f21fdeb0fd4c\",\"rootfs\":[{\"type\":\"overlay\",\"source\":\"overlay\",\"options\":[\"workdir=/var/lib/containerd2/io.containerd.snapshotter.v1.overlayfs/snapshots/9/work\",\"upperdir=/var/lib/containerd2/io.containerd.snapshotter.v1.overlayfs/snapshots/9/fs\",\"lowerdir=/var/lib/containerd2/io.containerd.snapshotter.v1.overlayfs/snapshots/3/fs\",\"index=off\"]}],\"io\":{\"stdout\":\"binary:///usr/local/bin/nerdctl?_NERDCTL_INTERNAL_LOGGING=%2Fvar%2Flib%2Fnerdctl%2F1935db59\",\"stderr\":\"binary:///usr/local/bin/nerdctl?_NERDCTL_INTERNAL_LOGGING=%2Fvar%2Flib%2Fnerdctl%2F1935db59\"},\"pid\":1329281}" ns=default topic=/tasks/create type=containerd.events.TaskCreate
5月 11 10:26:10 LIN-5A04F407932 containerd[1325813]: /tmp/changes
5月 11 10:26:11 LIN-5A04F407932 containerd[1325813]: time="2026-05-11T10:26:11.450742430+08:00" level=info msg="{\"container_id\":\"20a8600f97db18f59c48f3f9ddfb64cf2200d83480ec65b15367f21fdeb0fd4c\",\"pid\":1329281}" ns=default topic=/tasks/start type=containerd.events.TaskStart
5月 11 10:26:11 LIN-5A04F407932 containerd[1325813]: /tmp/start
5月 11 10:26:11 LIN-5A04F407932 containerd[1325813]: time="2026-05-11T10:26:11.787132412+08:00" level=error msg="ttrpc: received message on inactive stream" stream=19
5月 11 10:26:11 LIN-5A04F407932 containerd[1325813]: time="2026-05-11T10:26:11.788373721+08:00" level=info msg="{\"container_id\":\"20a8600f97db18f59c48f3f9ddfb64cf2200d83480ec65b15367f21fdeb0fd4c\",\"id\":\"20a8600f97db18f59c48f3f9ddfb64cf2200d83480ec65b15367f21fdeb0fd4c\",\"pid\":1329281,\"exit_status\":137,\"exited_at\":{\"seconds\":1778466371,\"nanos\":786743651}}" ns=default topic=/tasks/exit type=containerd.events.TaskExit
5月 11 10:26:12 LIN-5A04F407932 containerd[1325813]: time="2026-05-11T10:26:12.458734104+08:00" level=warning msg="failed to send SIGTERM signal, killing logging shim" error="os: process already finished"
5月 11 10:26:12 LIN-5A04F407932 containerd[1325813]: time="2026-05-11T10:26:12.481356385+08:00" level=info msg="{\"container_id\":\"20a8600f97db18f59c48f3f9ddfb64cf2200d83480ec65b15367f21fdeb0fd4c\",\"pid\":1329281,\"exit_status\":137,\"exited_at\":{\"seconds\":1778466371,\"nanos\":786743651}}" ns=default topic=/tasks/delete type=containerd.events.TaskDelete
5月 11 10:26:12 LIN-5A04F407932 containerd[1325813]: time="2026-05-11T10:26:12.481778779+08:00" level=info msg="shim disconnected" id=20a8600f97db18f59c48f3f9ddfb64cf2200d83480ec65b15367f21fdeb0fd4c namespace=default
5月 11 10:26:12 LIN-5A04F407932 containerd[1325813]: time="2026-05-11T10:26:12.481813847+08:00" level=info msg="cleaning up after shim disconnected" id=20a8600f97db18f59c48f3f9ddfb64cf2200d83480ec65b15367f21fdeb0fd4c namespace=default
5月 11 10:26:12 LIN-5A04F407932 containerd[1325813]: time="2026-05-11T10:26:12.481825464+08:00" level=info msg="cleaning up dead shim" id=20a8600f97db18f59c48f3f9ddfb64cf2200d83480ec65b15367f21fdeb0fd4c namespace=default
5月 11 10:26:12 LIN-5A04F407932 containerd[1325813]: kill container
5月 11 10:26:12 LIN-5A04F407932 containerd[1325813]: time="2026-05-11T10:26:12.509688692+08:00" level=info msg="connecting to shim 20a8600f97db18f59c48f3f9ddfb64cf2200d83480ec65b15367f21fdeb0fd4c" address="unix:///run/containerd/s/f7ca33ac16fda036d073a8e927140718c3982a3968af6f8c6672760a55d2d4cb" namespace=default protocol=ttrpc version=3
5月 11 10:26:13 LIN-5A04F407932 containerd[1325813]: time="2026-05-11T10:26:13.893975671+08:00" level=info msg="{\"container_id\":\"20a8600f97db18f59c48f3f9ddfb64cf2200d83480ec65b15367f21fdeb0fd4c\",\"bundle\":\"/run/containerd/io.containerd.runtime.v2.task/default/20a8600f97db18f59c48f3f9ddfb64cf2200d83480ec65b15367f21fdeb0fd4c\",\"rootfs\":[{\"type\":\"overlay\",\"source\":\"overlay\",\"options\":[\"workdir=/var/lib/containerd2/io.containerd.snapshotter.v1.overlayfs/snapshots/9/work\",\"upperdir=/var/lib/containerd2/io.containerd.snapshotter.v1.overlayfs/snapshots/9/fs\",\"lowerdir=/var/lib/containerd2/io.containerd.snapshotter.v1.overlayfs/snapshots/3/fs\",\"index=off\"]}],\"io\":{\"stdout\":\"binary:///usr/local/bin/nerdctl?_NERDCTL_INTERNAL_LOGGING=%2Fvar%2Flib%2Fnerdctl%2F1935db59\",\"stderr\":\"binary:///usr/local/bin/nerdctl?_NERDCTL_INTERNAL_LOGGING=%2Fvar%2Flib%2Fnerdctl%2F1935db59\"},\"pid\":1329986}" ns=default topic=/tasks/create type=containerd.events.TaskCreate
5月 11 10:26:13 LIN-5A04F407932 containerd[1325813]: time="2026-05-11T10:26:13.916257758+08:00" level=info msg="{\"container_id\":\"20a8600f97db18f59c48f3f9ddfb64cf2200d83480ec65b15367f21fdeb0fd4c\",\"pid\":1329986}" ns=default topic=/tasks/start type=containerd.events.TaskStart

find container will exited and recreated

5月 11 10:26:11 LIN-5A04F407932 containerd[1325813]: time="2026-05-11T10:26:11.788373721+08:00" level=info msg="{\"container_id\":\"20a8600f97db18f59c48f3f9ddfb64cf2200d83480ec65b15367f21fdeb0fd4c\",\"id\":\"20a8600f97db18f59c48f3f9ddfb64cf2200d83480ec65b15367f21fdeb0fd4c\",\"pid\":1329281,\"exit_status\":137,\"exited_at\":{\"seconds\":1778466371,\"nanos\":786743651}}" ns=default topic=/tasks/exit type=containerd.events.TaskExit
5月 11 10:26:12 LIN-5A04F407932 containerd[1325813]: time="2026-05-11T10:26:12.458734104+08:00" level=warning msg="failed to send SIGTERM signal, killing logging shim" error="os: process already finished"

I also create a pr to record events containerd/containerd#13324
@AkihiroSuda
but it's difficult to add it into ci.

@AkihiroSuda AkihiroSuda added this to the v2.3.1 milestone May 11, 2026
@AkihiroSuda AkihiroSuda requested a review from a team May 11, 2026 05:35
@haytok
Copy link
Copy Markdown
Member

haytok commented May 16, 2026

@ningmingxiao

I’m not a committer, but since this is a strange issue, I took a look at the PR.

For example, by adding the following to /etc/containerd/config.toml to shorten containerd’s reconcile interval (ref), I confirmed that the issue described in this PR can be reproduced without inserting sleep on the nerdctl side.​​​​​​​​​​​​​​​​

  [plugins."io.containerd.internal.v1.restart"]
    interval = "100ms"

So the race itself does seem to exist. However, with the default interval of 10s, this issue is by its nature fairly unlikely to occur in practice, so I'd like to understand its real-world impact before reviewing the implementation in detail.

How did you discover this issue in the first place, and in what situation would not being able to resolve it cause problems? Is there some critical scenario behind it?

@ningmingxiao
Copy link
Copy Markdown
Contributor Author

ningmingxiao commented May 16, 2026

our user use nerdctl create container and then use nerdctl exec it to check something, but exec failed. It happened several times and I add some debug log at cotainerd kill api.I also use sysctl kernel.monitor_signals=0x100 to enable trace signal 9 then find container main process is killed by shim not because of oom and I also print all container events you can see my pr for containerd @haytok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants