Skip to content

vmm: prevent serial manager TCP shutdown deadlock#172

Merged
amphi merged 1 commit into
cyberus-technology:gardenlinuxfrom
amphi:serial_manager_dont_hang
Jun 18, 2026
Merged

vmm: prevent serial manager TCP shutdown deadlock#172
amphi merged 1 commit into
cyberus-technology:gardenlinuxfrom
amphi:serial_manager_dont_hang

Conversation

@amphi

@amphi amphi commented Jun 18, 2026

Copy link
Copy Markdown

Keep track of the active TCP console client so the serial manager can shut it down before waiting for the worker thread to exit. This prevents a shutdown deadlock where a blocked TCP read keeps VM teardown stuck after migration or console disconnects.

By storing a clone of the active TCP stream and closing it during drop, the read side is forced to wake up and observe the shutdown path instead of waiting forever in the socket read. The worker then exits its epoll loop, the join returns, and VM teardown can finish.

@amphi amphi requested review from hertrste, phip1611 and tpressure June 18, 2026 09:33
Comment thread vmm/src/serial_manager.rs
pty_write_out: Option<Arc<AtomicBool>>,
socket_path: Option<PathBuf>,
/// Tracks the active TCP client so `Drop` can shut it down before joining.
active_tcp_stream: Arc<Mutex<Option<TcpStream>>>,

@phip1611 phip1611 Jun 18, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be Option<Arc<Mutex<TcpStream>>>. I'm unsure if you need the Arc<Mutex> at all since you seem to have &mut self available on every call site. Would Option<TcpStream> also work?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable is shared between the main VMM thread and serial-manager thread, so I think we need at least the Arc, and I'd like to synchronize the access to it, thus the Mutex

@phip1611 phip1611 Jun 18, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and I'd like to synchronize the access to it, thus the Mutex

Never use a Mutex because you'd like to synchronize access - the compiler either forces you or doesn't complain which means there are no unsynchronized accesses :)

But I assume you wanted to say that Rust complains without Arc<Mutex<>>, hence the Mutex is necessary?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried to remove the Mutex<> part, but yes the compiler complains without it.

@tpressure

tpressure commented Jun 18, 2026

Copy link
Copy Markdown

AI comment:

vmm/src/serial_manager.rs:501-524 returns 0 for TCP close, WouldBlock , and Interrupted, but the shared path still calls queue_input_bytes(&input[..count]) at vmm/src/serial_manager.rs:542-547. Zero-length queueing is not harmless: x86 Serial::queue_input_bytes() still calls recv_data() and sets data-ready/interrupt state, and aarch64 Pl011::queue_input_bytes() clears RX-empty state with no byte queued. This can expose spurious readable serial input to the guest. The Ok(0) case existed before, but the PR rewrites this block and adds new zero-count paths. I’d skip queueing when count == 0.

@phip1611 phip1611 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code-wise LGTM but what about Thomas' comment?

Keep track of the active TCP console client so the serial manager
can shut it down before waiting for the worker thread to exit. This
prevents a shutdown deadlock where a blocked TCP read keeps VM teardown
stuck after migration or console disconnects.

By storing a clone of the active TCP stream and closing it during drop,
the read side is forced to wake up and observe the shutdown path instead
of waiting forever in the socket read. The worker then exits its epoll
loop, the join returns, and VM teardown can finish.

On-behalf-of: SAP sebastian.eydam@sap.com
Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>
@amphi amphi force-pushed the serial_manager_dont_hang branch from fe7f1e3 to 2d7ec1c Compare June 18, 2026 13:41
@amphi

amphi commented Jun 18, 2026

Copy link
Copy Markdown
Author

AI comment:

vmm/src/serial_manager.rs:501-524 returns 0 for TCP close, WouldBlock , and Interrupted, but the shared path still calls queue_input_bytes(&input[..count]) at vmm/src/serial_manager.rs:542-547. Zero-length queueing is not harmless: x86 Serial::queue_input_bytes() still calls recv_data() and sets data-ready/interrupt state, and aarch64 Pl011::queue_input_bytes() clears RX-empty state with no byte queued. This can expose spurious readable serial input to the guest. The Ok(0) case existed before, but the PR rewrites this block and adds new zero-count paths. I’d skip queueing when count == 0.

Oh wtf that was a rebasing mistake I think. I tried making the socket nonblocking first, but then decided that the current solution seems easier and more reliable. I reverted this change.

@amphi amphi requested a review from phip1611 June 18, 2026 13:42

@phip1611 phip1611 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After offline discussion. LGTM.

@amphi amphi merged commit 079ed4d into cyberus-technology:gardenlinux Jun 18, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants