feat(library-config): otel process context reader#2176
Conversation
Add `read()`, `threadlocal_attribute_key_map()`, and `read_threadlocal_attribute_key_map()` to `otel_process_ctx::linux`, along with the `find_otel_mapping()` / `is_named_otel_mapping()` / `parse_mapping_start()`.
2887d0b to
cfe279d
Compare
Clippy Allow Annotation ReportComparing clippy allow annotations between branches:
Summary by Rule
Annotation Counts by File
Annotation Stats by Crate
About This ReportThis report tracks Clippy allow annotations for specific rules, showing how they've changed in this PR. Decreasing the number of these annotations generally improves code quality. |
📚 Documentation Check Results📦
|
🔒 Cargo Deny Results📦
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2887d0b569
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let payload = unsafe { std::slice::from_raw_parts(payload_ptr, payload_size as usize) }; | ||
| let context = ProcessContext::decode(payload)?; |
There was a problem hiding this comment.
Synchronize before decoding the publisher's payload
When read() runs concurrently with another thread calling publish()/update(), the timestamp check does not keep the old payload allocation alive: after the first published_at load, ProcessContextHandle::update can assign self.payload = payload, dropping the Vec that payload_ptr still references, and this code then builds a slice and lets prost read from a dangling pointer. The final timestamp comparison happens only after the unsafe read, so it cannot prevent a crash/UB; the same-process reader needs to coordinate with the publisher or otherwise copy from memory that cannot be freed during decode.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
yeah... This file seems to want to update some sort of seqlock, but there are several problems with it, one of these is this reclamation problem; a seqlock can protects only static region. Also a seqlock depends on a counter being odd to detect in-progress changes.
There was a problem hiding this comment.
Likely the solution involves in-process readers using the write lock.
There was a problem hiding this comment.
lso a seqlock depends on a counter being odd to detect in-progress changes.
I think the odd thing is because of multiple writers. If you have one writer only, and multiple reader, a monotonic value should be enough to detect torn reads (which I believe is the model for the process context).
one of these is this reclamation problem; a seqlock can protects only static region
I think I don't understand what you mean by that?
There was a problem hiding this comment.
I think the odd thing is because of multiple writers.
It's not, it's to detect that a modification is progress, so the read may be give torn/inconsistent data. Assume there is no reordering:
- update data
- ++seq
and the reader does
- read seq
- read data
- re-read seq to confirm it's the same
then the reader may read partially updated data and see the same seq.
But anyway, I was wrong about this, because the writer uses 0 to signal the update in progress.
one of these is this reclamation problem; a seqlock can protects only static region
What I mean is that the reader may fetch a pointer to the payload, and by the time it reads it, it's been freed already. This is a problem if the memory region has been unmapped (segfault).
There was a problem hiding this comment.
So I think the wrong publishing protocol was chosen for this. What the seqlock is protecting is a single pointer, so you could just update the pointer atomically with release-acquire (more or less, it also has a size, but that's not an obstacle because amd64/arm64 support 16-byte atomics, or the size could be moved to the beginning of the pointed to region). This doesn't eliminate the reclamation problem -- you'd still need something else to prevent in progress readers from segfaulting, but it does show that the seqlock buys nothing here.
There was a problem hiding this comment.
Ah, I see, thanks for the clarification for the odd counter part. I didn't remember that correctly.
I guess in the process context case the information is just split across two different fields.
What I mean is that the reader may fetch a pointer to the payload, and by the time it reads it, it's been freed already. This is a problem if the memory region has been unmapped (segfault).
@ivoanjo I don't remember all the details from the spec, but I think the model here is to map a page at init time and then never unmap it (beyond in-place update)? Of course you can't enforce that from the reader side, you have to rely on the writer to be compliant.
But it seems that process_vm_readv() returns an error, and doesn't cause SEGFAULT, for invalid/unmapped addresses (not clear from the doc, but so says my clanker). So I won't be worried about unmapping the context page.
However, if the payload is being modified, I think you're right. I don't see how you can avoid:
- reader see a consistent context and read it locally
- writer starts and completes an update by re-allocating the payload, causing the previous one to be dropped if heap-allocated, or overwritten if stored in the mapping.
- reader starts to dereference the payload ptr
This case might be even worse because you won't get EINVAL, since the memory will most likely still be mapped (even in the heap), just potential gibberish 🤔
There was a problem hiding this comment.
I guess one fix would be to restrict the reader to always use the same mapping for the payload, and include in the bracket as part of the "writing". The pointer would effectively be mostly useless, and the payload would be a static part (beside the length) of the whole context.
There was a problem hiding this comment.
I think the model here is to map a page at init time and then never unmap it (beyond in-place update)?
This is the case for the header. The problem is the payload.
But it seems that process_vm_readv() returns an error, and doesn't cause SEGFAULT, for invalid/unmapped addresses
Yes, out-of-process reads don't have this problem. A more difficult question, with process_vm_readv doing plain memcpy page-by-page and having the payload and header in different pages is whether they participate correctly in the protocol, which something that needs to be seen architecture-by-architecture, but probably it's not a big obstacle.
However, if the payload is being modified, I think you're right. I don't see how you can avoid:
The way this is usually avoided for in-process readers is to use specific reclamation strategies like RCU, hazard pointers, refcounting, etc... or use read-write locks.
I think that reading gibberish, even in-process, is not that bad, because it can be detected by the seqlock protocol (assuming you follow the pointer and read all your data before re-checking the sequence number). It's just the if the memory has been unmapped that it becomes a problem.
| name.starts_with("/memfd:OTEL_CTX") | ||
| || name.starts_with("[anon_shmem:OTEL_CTX]") | ||
| || name.starts_with("[anon:OTEL_CTX]") |
There was a problem hiding this comment.
Match only the exact OTEL_CTX mapping name
Because these checks use starts_with, a process that also has a mapping such as /memfd:OTEL_CTX_BACKUP or [anon:OTEL_CTX_old] can be selected before the real /memfd:OTEL_CTX entry, since /proc/self/maps is ordered by address rather than by mapping name. In that case read() will return an invalid-signature error (or read the wrong context) even though a valid OTEL context mapping exists later in the maps file; the name match should be exact, with only the kernel's separate (deleted) token handled as metadata.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Why would there be mappings with names like "/memfd:OTEL_CTX_BACKUP" or "[anon:OTEL_CTX_old]"? Yes, sure, it's technically true but... what are we supposed to do about this?
There was a problem hiding this comment.
Wasn't there a suggestion on adding several of these mappings for PHP to account for it processing several service/env? Ultimately we didn't do it, but I think that's indeed the idea of the spec, that you could have several.
This implements literally what the spec says: https://github.com/open-telemetry/opentelemetry-specification/blob/14137d0022f98bb59392f82255a70f3d9c2a7d4c/oteps/profiles/4719-process-ctx.md?plain=1#L164
Locate mapping: Parse
/proc/<pid>/mapsand search for entries with name starting with[anon_shmem:OTEL_CTX],[anon:OTEL_CTX]or/memfd:OTEL_CTX
though I think it should say [anon_shmem:OTEL_CTX and [anon:OTEL_CTX. Or maybe it just wanted to allow the (deleted) afterwards
There was a problem hiding this comment.
Yes, it's for the (deleted) stuff. It's not for multiple mappings to find.
There was a problem hiding this comment.
What @bwoebi said -- the process context is expected to be a singleton for a process, and anything that's not a singleton for a process should not go in there.
E.g. if a process is 1:1 to a service, then service name goes in there (for instance, this is the plan for dd-trace-rb); if a process is somehow multi-tenant and supports multiple services (e.g. dd-trace-php), then it shouldn't go in there.
It's not supported by the spec publishing several OTEL_CTX memory locations -- readers are only looking for one so they'll either error or ignore the others.
🎉 All green!🧪 All tests passed 🎯 Code Coverage (details) 🔗 Commit SHA: 9589dd3 | Docs | Datadog PR Page | Give us feedback! |
Artifact Size Benchmark Reportaarch64-alpine-linux-musl
aarch64-unknown-linux-gnu
libdatadog-x64-windows
libdatadog-x86-windows
x86_64-alpine-linux-musl
x86_64-unknown-linux-gnu
|
…:Result -> io::Result
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| fence(Ordering::SeqCst); | ||
| AtomicU64::from_ptr(addr_of_mut!((*header).monotonic_published_at_ns)) | ||
| (*header) | ||
| .monotonic_published_at_ns | ||
| .store(published_at_ns, Ordering::Relaxed); |
There was a problem hiding this comment.
I think the fence is unnecessary if I change the store to use Ordering::SeqCst, yeah? In the previous state, it did a SeqCst fence with a Relaxed store, I didn't change it, I just changed to use the AtomicU64 natively instead of through a ptr.
There was a problem hiding this comment.
I've read a bunch of material which helped me understand some other cases better but... not precisely this one. If anyone can educate me here, I'd appreciate understanding how they'd be different and which one is desired!
There was a problem hiding this comment.
Yes, and seqcst is even too strong, it only needs release. If you're interested, you can read this paper on synchronization for the seqlock algorithm.
But there is a problem that makes reasoning purely about the C++ memory model here insufficient, and that is that we need to support out-of-process readers with process_vm_readv. This requires understanding both a) what happens architecturally on the writer and b) how process_vm_readv (which I think does just memcpy page by page) and access_remote_vm work exactly.
In any case, we only need to support amd64 and arm64, so just a Release store should be enough.
There was a problem hiding this comment.
My suggestion is -- let's go with the strongest barrier we have. Why? Because the process context is not performance-sensitive, and the cost of getting this detail wrong is high.
So even though we need barriers here for correctness, I don't think we should be trying to think about them in terms of performance. If we had a lock that would use a much stronger atomic and nobody would be mad if we grab a log once an hour so that's the mental model to go for here ;)
-- my 0.02c :D
There was a problem hiding this comment.
I think the fence is unnecessary if I change the store to use Ordering::SeqCst, yeah? In the previous state, it did a SeqCst fence with a Relaxed store, I didn't change it, I just changed to use the AtomicU64 natively instead of through a ptr.
Yes, I believe SeqCst store is more constraining than a SeqCst fence + a relaxed store. The difference is rather technical, but it's that in the latter case, the store isn't part of the total sequential consistent order. For all other orderings, a fence with ordering O followed (resp. preceded by) a relaxed store (resp. a relaxed load) is strictly equivalent to a store (resp. a load) with ordering O. The main usage for fences is that you can only use one for several atomic operations at once, or that you can use them conditionally. Here there's no strong motivation other than following closely the spec.
Yes, and seqcst is even too strong, it only needs release. If you're interested, you can read this paper on synchronization for the seqlock algorithm.
Yup. We used SeqCst mostly because:
- the process context publication isn't performance-critical (happens once or not very often)
- since the whole seqlock-like thing is fishy in Rust, we upgraded to the strongest ordering.
But there is a problem that makes reasoning purely about the C++ memory model here insufficient, and that is that we need to support out-of-process readers with process_vm_readv. This requires understanding both a) what happens architecturally on the writer and b) how process_vm_readv (which I think does just memcpy page by page) and access_remote_vm work exactly.
Yes, I personally have no idea what is the memory semantics of process_vm_readv, and in any case it can't be considered by the C++ memory model, which doesn't model OS-level operations like these. So we're a bit in best-effort places.
There was a problem hiding this comment.
Though, process_vm_readv can't have stronger requirements than direct memory accesses, I believe. So if we're direct-memory-access-safe, I don't see how we wouldn't be syscall-read-memory-safe.
19c2470 to
d363e03
Compare
In this case, we could just make the pointer and the size atomic and access them with relaxed memory order. This would remove the technicality of the race being UB. But even without it, I don't think it's much of a problem. I'm pretty sure the exact same code would be generated. |
|
My original question still stands (as it's not written in the PR description - to be clear I haven't look at the PR code yet in details): what is the motivation for having a reader in libdatadog? |
I could be implemented somewhere else. I guess the same reason the writer is implemented in libdatadog -- to do it uniformly across tracers. And with the writer already implemented in it, and the reader being sort of implemented already for the tests, everything stays more cohesive. That said, I have no great objection to this being moved to dd-trace-php |
|
More precisely: You need a reader in (PHP) profiling to support using the profiler with another otel compatible tracer (other than our tracer) running in the same process and being able to access that tracers context data. libdatadog is the right place for this - it's not necessarily a pure PHP concern. |
|
In practice we're allocating a whole OS page to put this header in there, and the header is like 32 bytes. That leaves roughly 4,064 bytes (because we're talking about Linux here, and nearly everything is running 4096 pages or more on Linux) to put a serialized payload into. Why use a separate allocation for the payload? Nobody should use more than ~4k bytes for this purpose, methinks? And this utilizes otherwise wasted space. Instead of storing a ptr, you store a length for the usable portion of the page, which can vary e.g. 16 KiB on macOS if we ever port there. I think someone may have hinted at doing this in the discussion already but does anyone (@ivoanjo?) have a problem mandating that? Because otherwise you can't really do a safe read, if the ptr has a separate lifetime. Torn reads can be handled, but a separate lifetime complicates making that safe for both threads and out-of-process readers. |
Add
read(),threadlocal_attribute_key_map(), andread_threadlocal_attribute_key_map()tootel_process_ctx::linux, along with thefind_otel_mapping()/is_named_otel_mapping()/parse_mapping_start().What does this PR do?
A brief description of the change being made with this pull request.
Motivation
What inspired you to submit this pull request?
Additional Notes
Anything else we should know when reviewing?
How to test the change?
Describe here in detail how the change can be validated.