Implement epoll APIs in the JS filesystem#27207
Conversation
aacb3d7 to
57e98fb
Compare
sbc100
left a comment
There was a problem hiding this comment.
I think this like this direction.
I've not had time to look at all the details yet, but it seems like a great idea to unify the node events like this.
| 'error': {{{ cDefs.POLLERR }}}, | ||
| }[event]; | ||
| // 'listen' has no readiness mapping; skip it. | ||
| if (flags) notifyNodeListeners(FS.getStream(fd)?.node, flags); |
There was a problem hiding this comment.
What happens if FS.getStream(fd) is undefined? i.e. what happens when notifyNodeListeners gets an undefined node?
There was a problem hiding this comment.
This happens e.g. for stdio streams which don't have a Node (currently!).
| if (!node?.listeners) return; | ||
| // Fire every non-exclusive listener. Among EPOLLEXCLUSIVE registrations (one | ||
| // fd watched by several epolls) wake only one, rotating round-robin per node, | ||
| // to avoid a thundering herd. (Only epoll registrations are ever exclusive; |
There was a problem hiding this comment.
I guess the idea here is that you could have N threads all waiting on M sockets in EPOLLEXCLUSIVE mode and then the kernel would be pick just one thread to wake in that case? Does the linux kernel also do this round robin thing?
There was a problem hiding this comment.
This is specific to our implementation actually, Linux has looser guarantees I believe that only one or more exclusive waiters are woken, and apparently LIFO registration order is common. This is definitely an area we could tune further.
| sock.pending.push(newsock); | ||
| SOCKFS.emit('connection', newsock.stream.fd); | ||
| // A queued client makes the listening socket readable (POLLIN). | ||
| notifyNodeListeners(sock.stream.node, {{{ cDefs.POLLRDNORM }}} | {{{ cDefs.POLLIN }}}); |
There was a problem hiding this comment.
Why is POLLRDNORM include here? (I have to admit I've not seen this before. .the man page says Equivalent to POLLIN)
There was a problem hiding this comment.
Semantically equivalent but POLLIN=0x001, POLLRDNORM=0x040. A caller may mask for POLLRDNORM, so having both allows both to be matched.
| rNode.pipe = pipe; | ||
| wNode.pipe = pipe; | ||
| // The read end's node carries the poll wait-queue; writes wake it. | ||
| pipe.readNode = rNode; |
There was a problem hiding this comment.
Do we never need to notify the write node? I guess we just always accept new writes and buffer them?
There was a problem hiding this comment.
Currently not, because there isn't any capacity limit for writable in the current implementation. This could be added in future.
Adds epoll_create1/epoll_ctl/epoll_wait/epoll_pwait and a non-blocking JS-callback variant, emscripten_epoll_set_callback, on a single fd readiness model shared with poll(). Readiness is source-based: producers (sockets, pipes) post edges to a wait-queue on the FS node, which dup'd fds share. An epoll instance is a real FS fd whose stream holds an interest map (fd -> registration) and a ready list. epoll_ctl ADD arms a persistent listener on the watched node - the registration's edge in the interest graph; on an edge the listener appends the registration to the epoll's ready list (Linux's rdllist) and wakes any waiter. Because a source-based model only learns readiness from edges, epoll_ctl ADD/MOD also samples the current level once, so an fd already ready when watched is reported with no further event needed. A wait consumes the ready list (Linux's ep_send_events): each listed registration is re-derived against its current mask; level-triggered ones still ready are re-listed at the tail, edge-triggered ones leave until the next edge, and a no-longer-ready (spurious) edge is dropped. A fired EPOLLONESHOT drops its watched-node listener until EPOLL_CTL_MOD re-arms it, so a dead edge carries no traffic. The ready list is an intrusive doubly-linked list, so draining is O(ready) rather than O(registered), and the remainder past maxevents is rotated to the front for round-robin fairness. emscripten_epoll_set_callback registers a persistent consumer on that same ready list: the runtime delivers the ready set to the callback on each progress, with no blocking and no ASYNCIFY/JSPI. It is armed once (not per spin), re-fires on the next tick while the set stays ready (so level and overflow drain as a blocking epoll_wait loop would), and there is at most one callback per epoll (a second call replaces it; a NULL callback unregisters). Per-fd EPOLLET/EPOLLONESHOT apply unchanged, so a single callback can mix level/edge/oneshot fds. A blocking epoll_wait (under PROXY_TO_PTHREAD, ASYNCIFY, or JSPI) consumes the same ready list, so a wait and a callback on one epoll take disjoint slices rather than each seeing a private copy. The callback is delivered on the main thread's event loop (under PROXY_TO_PTHREAD use a blocking epoll_wait instead), and keeps the runtime alive only while the set can still fire: once every watched fd is closed the set is terminal and the keepalive is dropped, so no explicit disposal is required (closing the epoll or passing a NULL callback also dispose). Registrations key on the open file description (the dup-shared stream state), matching Linux: closing a watched fd and reusing its number for a different open does not resurrect the registration onto the new fd. A close (socket, pipe, or a nested epoll) notifies its node, so the watching epoll promptly re-derives and drops the registration - the analog of Linux's eventpoll_release_file walking the watched file's epitem list. Only sockets and pipes derive real readiness; every other stream type (regular files across MEMFS/NODEFS/NODERAWFS, devices, ttys) has no poll handler and is treated as always readable+writable, so epoll_ctl rejects it with EPERM. This also fixes poll() crashing on a NODERAWFS regular file, whose stream carries no stream_ops at all. EPOLLEXCLUSIVE distributes its single wakeup across multiple epolls watching one fd (round-robin), which suppresses the thundering herd for that case; suppressing it across multiple waiters on a single epoll is out of scope (one instance, and they already share the ready list). Known limitations: WASMFS epoll is out of scope (link error); ttys are not pollable (no poll handler), unlike Linux; and eviction of a closed watched fd is keyed on the fd number, so (unlike Linux) a dup that keeps the underlying description alive does not preserve the registration.
Updated version of #27201, based to #27206. Also includes an integrated callback model for #27181, to fully verify the unified wait/callback approach on epoll semantics.
Resolves #5033, #10556.
Adds
epoll_create1,epoll_ctl,epoll_wait,epoll_pwaitand a non-blocking JS-callback variant,emscripten_epoll_set_callback, on a single fd readiness model shared withpoll().Readiness in the JS FS system is already event-driven in SOCKFS and PIPEFS. The integration point is the per-inode wait-queue, having each FS node carrying a
listenersset. Producers then callnotifyNodeListeners(node, flags)on ready transitions. There is no separate or parallel readiness machinery - it integrates directly with the existing model.pollOne(fd, events)is reused on the same readiness definition.Per standard epoll semantics -
epoll_ctl ADDinstalls a new listener on the watched node. If items are already ready they are added to the ready list. That listener then appends the registration to the epoll's ready list for waking. The epoll_wait consumes the ready list, re-checking each item against its current mask viapollOne.EPOLLONESHOTclears listeners to avoid unnecessary callback firing.EPOLL_CTL_MODcan then re-arm them again.EPOLLETis implemented correctly to avoid refiring items that remain readyEPOLLEXCLUSIVEis passed for listeners allowing only one wake for multiple epoll listeners to avoid the "thundering herd".maxevents, draining follows Linux-like semantics in supporting round-robin ready calling. To achieve this without losing performance, a doubly-linked list is used for the registrations. A simpler set / array with copying could be used alternatively if we don't want to use this approach.To support JS callbacks without JSPI/threads, a new
emscripten_epoll_set_callbackis implemented. This was implemented here to verify its comprehensive integration with all of the implemented epoll semantics, but could also be split out into a separate PR if necessary. It allows registering a persistent consumer on that same ready list as the epoll - the runtime delivers the ready set to the callback on each progress as if it were responding to anepoll_wait, but on the next tick after exiting the stack with no blocking and no ASYNCIFY/JSPI. It is armed once for the entire epoll, then consistently re-fires on the next tick while the set stays ready (so level and overflow drain as a blocking epoll_wait loop would). There is at most one callback per epoll (a second call replaces it; a NULL callback unregisters). Full integration with ready-list semantics work out naturally as it is just another consumer of the ready list.EPOLLET/EPOLLONESHOTEPOLLEXCLUSIVE/maxeventsall work out and apply to this callback design, so a single callback can fully integrate with normal epoll semantics.Most of the diff is tests, covering these semantics in depth including error handling, level versus edge reporting, nesting and ELOOP, fd-close auto-removal, JSPI and pthreads, real sockets, deregistration. For
emscripten_epoll_set_callbackcomprehensive tests are added for integrating with JSPI blockingepoll_waitin parallel and verifying both deterministically drain the same ready list with a wait and a callback on one epoll take disjoint slices rather than each seeing private or overlapping copies.Minor semantic divergences to note:
epoll_pwaitignoressigmaskepoll_create1ignoresEPOLL_CLOEXECepoll_eventunder Wasm in Musl is laid out as aligned 16 rather than x86-64's packed 12 bytes.PR made with AI assistance, under my review