feat(telemetry)!: make telemetry worker wasm-compatible for the TraceExporter#2172
feat(telemetry)!: make telemetry worker wasm-compatible for the TraceExporter#2172Aaalibaba42 wants to merge 4 commits into
Conversation
1624706 to
9042d85
Compare
🎉 All green!🧪 All tests passed 🎯 Code Coverage (details) 🔗 Commit SHA: 2f63fe1 | Docs | Datadog PR Page | Give us feedback! |
📚 Documentation Check Results📦
|
Clippy Allow Annotation ReportComparing clippy allow annotations between branches:
Summary by Rule
Annotation Counts by File
Annotation Stats by Crate
About This ReportThis report tracks Clippy allow annotations for specific rules, showing how they've changed in this PR. Decreasing the number of these annotations generally improves code quality. |
🔒 Cargo Deny Results📦
|
9042d85 to
6743d39
Compare
| let resp = select! { | ||
| biased; | ||
| result = client.request(req) => result?, | ||
| _ = sleeper.sleep(timeout) => { | ||
| return Err(anyhow::anyhow!("Telemetry request timed out")); | ||
| } | ||
| }; |
There was a problem hiding this comment.
Not sure how common this pattern will be but we could have a helper function client.request_with_timeout(req, timeout)
There was a problem hiding this comment.
I'm a bit reluctant to change the shape of the client Trait directly, but I wouldn't mind a helper function in http.rs, but then it wouldn't be "pinned" on a capability but would just be generic over it, so caller would have to specify the capability again. I'm not sure what's the best here
There was a problem hiding this comment.
I'm thinking either a method with a default implementation on the trait (so that if there is a simpler/more efficient way to do it in some implementations it can be overridden), or maybe an extension trait to keep it the main trait simple.
0d19b82 to
ca34406
Compare
ca34406 to
800823d
Compare
VianneyRuhlmann
left a comment
There was a problem hiding this comment.
The change in shutdown_worker really needs to be discussed further as it can negatively impact shutdown time. Also a lot of comment are very verbose and at wasm specific explanation on public methods doc that shouldn't be cluttered with implementation details.
| /// [`TelemetryWorkerHandle::wait_for_shutdown_async`]) — `Notify` is pure | ||
| /// atomics and works on wasm too. `shutdown_finished` triggers both so either | ||
| /// path observes the same event. |
There was a problem hiding this comment.
I wonder wether this is necessary, wait_for_shutdow is only used when the worker is spawned on it's own thread not as a SharedRuntime worker. Since spawn is not available on wasm we can platform-gate shutdown
| // Runtime-agnostic timeout: race the mailbox against a capability-driven sleep | ||
| // instead of `tokio::time::timeout_at`, which requires a tokio reactor (not | ||
| // available on wasm where we run on the JS event loop). |
There was a problem hiding this comment.
You don't need to re-explain the capabilities every time the old comment is fine
| let sleeper = <C as SleepCapability>::new(); | ||
| tokio::select! { | ||
| biased; | ||
| mailbox_action = self.mailbox.recv() => mailbox_action, | ||
| _ = sleeper.sleep(remaining) => Some(TelemetryActions::Lifecycle(deadline_action)), |
There was a problem hiding this comment.
Not specifically for this PR but should we add a helper for timeout to the sleeper capability ?
| /// Capability-driven sleep so the same code path works on native (tokio | ||
| /// reactor) and wasm (JS `setTimeout` via [`SleepCapability`]). The spawn | ||
| /// mechanism is platform-specific but the public API and observable | ||
| /// behavior are identical. |
There was a problem hiding this comment.
No need to re-explicit the wasm specific behavior in all methods (especially public ones) this is an implementation detail
| /// Native-only convenience entry point. Wasm callers should construct via | ||
| /// [`Self::build_worker`] and hand the worker to a | ||
| /// [`SharedRuntime`](libdd_shared_runtime::SharedRuntime) like | ||
| /// `LocalRuntime`. |
There was a problem hiding this comment.
That seems a bit heavy, especially since the method is not available on wasm. It should be enough to document the build_worker method
|
|
||
| async fn shutdown_workers(self) { | ||
| // Sequential rather than the previous `JoinSet`-driven concurrent stop: | ||
| // there are at most ~3 workers and each stop is fast, so the parallelism |
There was a problem hiding this comment.
That's not true some stop require flushing to the agent which can be long especially if we have to retry. This is critical as this is delaying the tracer shutdown an thus the time to restart the customer app
| # Cross-platform `Instant`/`SystemTime`. On native it re-exports `std::time`; on | ||
| # wasm32-unknown-unknown it backs by `Performance.now()` and `Date.now()` via | ||
| # `js-sys`, which avoids the `time not implemented on this platform` panic that | ||
| # `std::time::Instant::now()` raises on that target. |
There was a problem hiding this comment.
No need to clutter the cargo.toml. Documenting web-time usage in rust code is enough imo, or maximum a single line comment
| // Runtime-agnostic timeout: race the request against a capability-driven sleep, | ||
| // mirroring `agent_info::fetcher::fetch_and_hash_response`. Same pattern, same | ||
| // reason: `tokio::time::sleep` needs a tokio reactor that wasm doesn't provide. |
There was a problem hiding this comment.
It's fine to not re-explain the purpose of capability at every call-site
| /// On wasm, cancellation deadlines are driven by `wasm_bindgen_futures::spawn_local` and | ||
| /// no runtime handle is required. |
There was a problem hiding this comment.
Wasm always uses the SharedRuntime right ? There should be any need to run on spawn_local
| // web_time::Instant is std::time::Instant on native and a Performance.now() | ||
| // backed shim on wasm32-unknown-unknown (std::time::Instant panics there). |
There was a problem hiding this comment.
nit: I don't think this needs to be called out at every import
We can always have the previous native implementation as is in a |
What?
Port the telemetry stack (
libdd-telemetry,libdd-data-pipeline) to compile and run onwasm32-unknown-unknown.TelemetryWorker,TelemetryWorkerHandle, andTelemetryClientbecome generic over a capability bundleC: HttpClientCapability + SleepCapability; native call sites pin toNativeCapabilities.Why?
Telemetry was previously hard-gated behind
#[cfg(not(target_arch = "wasm32"))]and was a no-op on that target for theTraceExporter.How?
tokio::timecall (sleep, timeout) is replaced by atokio::select!race against<C as SleepCapability>::new().sleep(…), which resolves tosetTimeouton wasm.std::timereplaced byweb-time(Performance.now()/Date.now()on wasm, re-exportsstd::timeon native).tokio::sync::Notify-based async path alongside the existing nativeCondvarsync path.MockClient/file-endpoint logic inhttp_client.rsis deleted and ported toNativeCapabilitiesinlibdd-capabilities-impl.Additional Notes
TelemetryClientBuilder::buildnow returnsResult(was panicking on missing fields) — Rust API break.TelemetryClient::startis now sync viatry_send_msg— Rust API break.typealias pinned toNativeCapabilities.