Skip to content

Implement epoll APIs in the JS filesystem#27207

Open
guybedford wants to merge 2 commits into
emscripten-core:mainfrom
guybedford:epoll
Open

Implement epoll APIs in the JS filesystem#27207
guybedford wants to merge 2 commits into
emscripten-core:mainfrom
guybedford:epoll

Conversation

@guybedford

@guybedford guybedford commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Updated version of #27201, based to #27206. Also includes an integrated callback model for #27181, to fully verify the unified wait/callback approach on epoll semantics.

Resolves #5033, #10556.

Adds epoll_create1, epoll_ctl, epoll_wait, epoll_pwait and a non-blocking JS-callback variant, emscripten_epoll_set_callback, on a single fd readiness model shared with poll().

Readiness in the JS FS system is already event-driven in SOCKFS and PIPEFS. The integration point is the per-inode wait-queue, having each FS node carrying a listeners set. Producers then call notifyNodeListeners(node, flags) on ready transitions. There is no separate or parallel readiness machinery - it integrates directly with the existing model. pollOne(fd, events) is reused on the same readiness definition.

Per standard epoll semantics - epoll_ctl ADD installs a new listener on the watched node. If items are already ready they are added to the ready list. That listener then appends the registration to the epoll's ready list for waking. The epoll_wait consumes the ready list, re-checking each item against its current mask via pollOne.

  • EPOLLONESHOT clears listeners to avoid unnecessary callback firing. EPOLL_CTL_MOD can then re-arm them again.
  • EPOLLET is implemented correctly to avoid refiring items that remain ready
  • EPOLLEXCLUSIVE is passed for listeners allowing only one wake for multiple epoll listeners to avoid the "thundering herd".
  • When exceeding maxevents, draining follows Linux-like semantics in supporting round-robin ready calling. To achieve this without losing performance, a doubly-linked list is used for the registrations. A simpler set / array with copying could be used alternatively if we don't want to use this approach.
  • Registrations key on the open file description (the dup-shared stream state), matching Linux: closing a watched fd and reusing its number for a different open does not resurrect the registration onto the new fd.
  • Active listeners on an epoll maintains keepalive, so long as the set of epoll descriptors are non-terminal.

To support JS callbacks without JSPI/threads, a new emscripten_epoll_set_callback is implemented. This was implemented here to verify its comprehensive integration with all of the implemented epoll semantics, but could also be split out into a separate PR if necessary. It allows registering a persistent consumer on that same ready list as the epoll - the runtime delivers the ready set to the callback on each progress as if it were responding to an epoll_wait, but on the next tick after exiting the stack with no blocking and no ASYNCIFY/JSPI. It is armed once for the entire epoll, then consistently re-fires on the next tick while the set stays ready (so level and overflow drain as a blocking epoll_wait loop would). There is at most one callback per epoll (a second call replaces it; a NULL callback unregisters). Full integration with ready-list semantics work out naturally as it is just another consumer of the ready list. EPOLLET / EPOLLONESHOT EPOLLEXCLUSIVE / maxevents all work out and apply to this callback design, so a single callback can fully integrate with normal epoll semantics.

Most of the diff is tests, covering these semantics in depth including error handling, level versus edge reporting, nesting and ELOOP, fd-close auto-removal, JSPI and pthreads, real sockets, deregistration. For emscripten_epoll_set_callback comprehensive tests are added for integrating with JSPI blocking epoll_wait in parallel and verifying both deterministically drain the same ready list with a wait and a callback on one epoll take disjoint slices rather than each seeing private or overlapping copies.

Minor semantic divergences to note:

  • epoll_pwait ignores sigmask
  • epoll_create1 ignores EPOLL_CLOEXEC
  • nesting is capped at 5 levels
  • epoll_event under Wasm in Musl is laid out as aligned 16 rather than x86-64's packed 12 bytes.

PR made with AI assistance, under my review

@sbc100 sbc100 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this like this direction.

I've not had time to look at all the details yet, but it seems like a great idea to unify the node events like this.

Comment thread system/include/emscripten/emscripten.h Outdated
Comment thread ChangeLog.md Outdated
Comment thread src/lib/libsyscall.js Outdated
Comment thread src/lib/libsyscall.js Outdated
Comment thread src/lib/libsockfs.js Outdated
'error': {{{ cDefs.POLLERR }}},
}[event];
// 'listen' has no readiness mapping; skip it.
if (flags) notifyNodeListeners(FS.getStream(fd)?.node, flags);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if FS.getStream(fd) is undefined? i.e. what happens when notifyNodeListeners gets an undefined node?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens e.g. for stdio streams which don't have a Node (currently!).

Comment thread src/lib/libsyscall.js Outdated
Comment thread src/lib/libsyscall.js Outdated
Comment thread src/lib/libsyscall.js Outdated
if (!node?.listeners) return;
// Fire every non-exclusive listener. Among EPOLLEXCLUSIVE registrations (one
// fd watched by several epolls) wake only one, rotating round-robin per node,
// to avoid a thundering herd. (Only epoll registrations are ever exclusive;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the idea here is that you could have N threads all waiting on M sockets in EPOLLEXCLUSIVE mode and then the kernel would be pick just one thread to wake in that case? Does the linux kernel also do this round robin thing?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is specific to our implementation actually, Linux has looser guarantees I believe that only one or more exclusive waiters are woken, and apparently LIFO registration order is common. This is definitely an area we could tune further.

Comment thread src/lib/libsyscall.js Outdated
Comment thread src/lib/libsockfs_node.js Outdated
sock.pending.push(newsock);
SOCKFS.emit('connection', newsock.stream.fd);
// A queued client makes the listening socket readable (POLLIN).
notifyNodeListeners(sock.stream.node, {{{ cDefs.POLLRDNORM }}} | {{{ cDefs.POLLIN }}});

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is POLLRDNORM include here? (I have to admit I've not seen this before. .the man page says Equivalent to POLLIN)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semantically equivalent but POLLIN=0x001, POLLRDNORM=0x040. A caller may mask for POLLRDNORM, so having both allows both to be matched.

Comment thread src/lib/libsyscall.js Outdated
Comment thread src/lib/libpipefs.js
rNode.pipe = pipe;
wNode.pipe = pipe;
// The read end's node carries the poll wait-queue; writes wake it.
pipe.readNode = rNode;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we never need to notify the write node? I guess we just always accept new writes and buffer them?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently not, because there isn't any capacity limit for writable in the current implementation. This could be added in future.

Comment thread src/lib/libpipefs.js Outdated
Comment thread system/include/emscripten/emscripten.h Outdated
Comment thread src/lib/libsyscall.js Outdated
Adds epoll_create1/epoll_ctl/epoll_wait/epoll_pwait and a non-blocking
JS-callback variant, emscripten_epoll_set_callback, on a single fd
readiness model shared with poll().

Readiness is source-based: producers (sockets, pipes) post edges to a
wait-queue on the FS node, which dup'd fds share. An epoll instance is a
real FS fd whose stream holds an interest map (fd -> registration) and a
ready list. epoll_ctl ADD arms a persistent listener on the watched
node - the registration's edge in the interest graph; on an edge the
listener appends the registration to the epoll's ready list (Linux's
rdllist) and wakes any waiter. Because a source-based model only learns
readiness from edges, epoll_ctl ADD/MOD also samples the current level
once, so an fd already ready when watched is reported with no further
event needed.

A wait consumes the ready list (Linux's ep_send_events): each listed
registration is re-derived against its current mask; level-triggered
ones still ready are re-listed at the tail, edge-triggered ones leave
until the next edge, and a no-longer-ready (spurious) edge is dropped. A
fired EPOLLONESHOT drops its watched-node listener until EPOLL_CTL_MOD
re-arms it, so a dead edge carries no traffic. The ready list is an
intrusive doubly-linked list, so draining is O(ready) rather than
O(registered), and the remainder past maxevents is rotated to the front
for round-robin fairness.

emscripten_epoll_set_callback registers a persistent consumer on that
same ready list: the runtime delivers the ready set to the callback on
each progress, with no blocking and no ASYNCIFY/JSPI. It is armed once
(not per spin), re-fires on the next tick while the set stays ready (so
level and overflow drain as a blocking epoll_wait loop would), and there
is at most one callback per epoll (a second call replaces it; a NULL
callback unregisters). Per-fd EPOLLET/EPOLLONESHOT apply unchanged, so a
single callback can mix level/edge/oneshot fds. A blocking epoll_wait
(under PROXY_TO_PTHREAD, ASYNCIFY, or JSPI) consumes the same ready list,
so a wait and a callback on one epoll take disjoint slices rather than
each seeing a private copy. The callback is delivered on the main thread's
event loop (under PROXY_TO_PTHREAD use a blocking epoll_wait instead), and
keeps the runtime alive only while the set can still fire: once every
watched fd is closed the set is terminal and the keepalive is dropped, so
no explicit disposal is required (closing the epoll or passing a NULL
callback also dispose).

Registrations key on the open file description (the dup-shared stream
state), matching Linux: closing a watched fd and reusing its number for a
different open does not resurrect the registration onto the new fd. A
close (socket, pipe, or a nested epoll) notifies its node, so the watching
epoll promptly re-derives and drops the registration - the analog of
Linux's eventpoll_release_file walking the watched file's epitem list.

Only sockets and pipes derive real readiness; every other stream type
(regular files across MEMFS/NODEFS/NODERAWFS, devices, ttys) has no poll
handler and is treated as always readable+writable, so epoll_ctl rejects
it with EPERM. This also fixes poll() crashing on a NODERAWFS regular
file, whose stream carries no stream_ops at all.

EPOLLEXCLUSIVE distributes its single wakeup across multiple epolls
watching one fd (round-robin), which suppresses the thundering herd for
that case; suppressing it across multiple waiters on a single epoll is out
of scope (one instance, and they already share the ready list).

Known limitations: WASMFS epoll is out of scope (link error); ttys are
not pollable (no poll handler), unlike Linux; and eviction of a closed
watched fd is keyed on the fd number, so (unlike Linux) a dup that keeps
the underlying description alive does not preserve the registration.
@sbc100 sbc100 changed the title epoll implementation for the JS filesystem Implement epoll APIs in the JS filesystem Jun 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

epoll support

2 participants