Description
Agent Environment
Agent 7.50.3 on AWS ECS Fargate, using the latest container image.
Stack trace
| 1708126659023 | panic: runtime error: index out of range [0] with length 0 |
| 1708126659023 | goroutine 378 [running]: |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/util.(*ChunkAllocator[...]).Accept(0x5dcf9a0, {0xc001bb0080?, 0xc, 0x10}, 0x681) |
| 1708126659023 | /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/util/chunking.go:96 +0x2f9 |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/util.ChunkPayloadsBySizeAndWeight[...](0xc00118b720, 0xc001bd25f0, 0x48, 0xf4240?) |
| 1708126659023 | /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/util/chunking.go:166 +0x2c5 |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/checks.chunkProcessesBySizeAndWeight({0xc001bb0080?, 0xc, 0x10}, 0xc001a8b680, 0x4044b33333333333?, 0x0?, 0xc001bd25f0) |
| 1708126659023 | /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/checks/chunking.go:42 +0x326 |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/checks.chunkProcessesAndContainers(0x97d03d8?, {0xc000152720, 0x3, 0xc001c0d860?}, 0xc001bad560?, 0xc001afcdb0?) |
| 1708126659023 | /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/checks/process.go:414 +0x118 |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/checks.createProcCtrMessages(0xc000074360, 0x40401c28f5c28f5c?, {0xc000152720?, 0x0?, 0x403a8a3d70a3d70a?}, 0x0?, 0x3ff9eb851eb851ec?, 0x57eac212, {0x0, 0x0}, ...) |
| 1708126659023 | /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/checks/process.go:371 +0x5d |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/checks.(*ProcessCheck).run(0xc000b5e480, 0x57eac212, 0x1) |
| 1708126659023 | /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/checks/process.go:277 +0x6fe |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/checks.(*ProcessCheck).Run(0x3?, 0xc00144ce00, 0xc001ac53b0) |
| 1708126659023 | /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/checks/process.go:351 +0xe5 |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/runner.(*CheckRunner).runCheckWithRealTime(0xc0002392c0, {0x6ed99b0, 0xc000b5e480}, 0xc001ac53b0) |
| 1708126659023 | /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/runner/runner.go:182 +0xc6 |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/runner.(*CheckRunner).runnerForCheck.func2({0x1, 0x1}) |
| 1708126659023 | /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/runner/runner.go:350 +0x65 |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/checks.(*runnerWithRealTime).run(0xc000b4df40) |
| 1708126659023 | /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/checks/runner.go:73 +0x35a |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/runner.(*CheckRunner).Run.func1() |
| 1708126659023 | /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/runner/runner.go:287 +0x5c |
| 1708126659023 | created by github.com/DataDog/datadog-agent/pkg/process/runner.(*CheckRunner).Run |
| 1708126659023 | /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/runner/runner.go:285 +0x3c9 |
| 1708126659055 | process-agent exited with code 2, signal 0, restarting in 2 seconds
Describe what happened:
Got this crash while setting up DataDog process monitoring on a Fargate task. Eventually realized I'd made a typo and set pidMode = task
on one of the container definitions instead of the root task config, and that fixed the immediate issue.
Describe what you expected:
Any diagnostic message that would help me discover the configuration issue.
Steps to reproduce the issue:
Set up an ECS task with multiple containers and don't set pidMode
, sort of. There are more conditions but I'm not sure exactly what they are yet (in local testing I discovered it's apparently sensitive to whether the datadog-agent
container's name comes alphabetically before or after any other containers in the task. more on that in a bit).
Additional environment details (Operating System, Cloud provider, etc):
Here's how far I've gotten in debugging the crash:
datadog-agent/pkg/process/util/chunking.go
Lines 85 to 98 in abce0cb
The specific line is c.props[c.idx].size += len(ps)
; we know c.idx
is zero from the panic message, but we also know that c.idx >= len(c.chunks)
is not true because otherwise the if
would have been taken and c.props
would have a 0th element. Substituting, !(c.idx >= len(c.chunks)
=> c.idx < len(c.chunks)
=> 0 < len(c.chunks)
— that is, c.chunks
does have (at least) a 0th element.
pkg/process/util/chunking.go
correctly maintains the relationship between c.chunks
and c.props
, but c.chunks
can also escape the module by reference via GetChunks
:
datadog-agent/pkg/process/util/chunking.go
Lines 100 to 102 in abce0cb
If another section were to acquire a reference to c.chunks
via GetChunks
, and then append
to it, that would violate Accept
's assumption that len(c.props) >= len(c.chunks)
. This occurs:
datadog-agent/pkg/process/checks/chunking.go
Lines 15 to 22 in abce0cb
datadog-agent/pkg/process/checks/chunking.go
Lines 45 to 51 in abce0cb
If either of the "two scenarios" referenced in chunkProcessesBySizeAndWeight
's comment occurs, and the container with unmappable processes is the first one inspected (hence why order matters!) so that appendContainerWithoutProcesses
sees an empty collectorProcs
, c.chunks
will be extended but c.props
won't. When chunkProcessesBySizeAndWeight
later calls utils.ChunkPayloadsBySizeAndWeight
, which calls Accept
, this crash will occur.
My first instinct would be to say that GetChunks
shouldn't exist (pass chunker
down to appendContainerWithoutProcesses
and use Accept
with an empty process list? I dunno) but I'm seeing this code for the first time so take it with a grain of salt.
Presumably in this case process <=> container mapping cannot be established
because of the pidMode
configuration issue, but I'm not certain. I don't know if there's a good way to detect that setting from within the container, but this crash definitely shouldn't be happening.