Description
Currently the spec defines the default defaultErrnoRet
to be EPERM
, this is troublesome and causes issues when running newer containers on hosts running an older kernel/userspace where the libseccomp version might not be aware of some of the syscalls used by the container. There have been many issues reported about syscalls getting an EPERM
return code instead of ENOSYS
when not available and breaking the user-space inside the container. Below is an example of such reports:
-
[garden/seccomp] Unable to run 32-bit binaries in concourse containers concourse/concourse#7471
-
seccomp filter should return ENOSYS for unknown syscalls runc#2151
runc does have some hacky code in order to try and figure out if a syscall is supported or not but the method is not always reliable and we have seen at Canonical reports of Ubuntu Noble containers breaking under Ubuntu Jammy hosts on ARM as well as PPC .
This issue currently affects all fixed release distros which also happen to be the most popular distros for running containers, and for these distros updating libseccomp for every new syscall provides an unnecessary risk of regression and defeats the whole purpose of a fixed released distro.
I understand that changing the current defaultErrnoRet
to ENOSYS
may also cause regressions, however it also needs to be acknowledged that the EPERM
default value was an oversight and that the OCI spec is fundamentally not compatible with fixed release distros which are the most popular distros for running containers. It also violates one of the most fundamental rules/expectations of containers which is to be able to run any version of a user-space whether it is older or newer than the version on that host.
Having said that, I believe there is a way to satisfy both camps. A list of the currently available syscalls (up to kernel 6.10 as of the writing of this post) can be compiled and be manually set as EPERM
for those who were relying on the default EPERM
return value while having the others unchanged in the seccomp profiles. This means that when changing defaultErrnoRet
to be ENOSYS
all previously available syscalls will still return EPERM
while newer added ones or even older ones that are defined in the seccomp profile but not known by the libseccomp package on the host will return the correct ENOSYS
.
As an example, this can be expressed as the following in the runtime spec:
defaultErrnoRet
(uint, OPTIONAL) - the errno return code to use.
Some actions likeSCMP_ACT_ERRNO
andSCMP_ACT_TRACE
allow to specify the errno code to return.
When the action doesn't support an errno, the runtime MUST print and error and fail.
If not specified then its default value isEPERM
for syscalls prior to kernel 6.11 andENOSYS
for future ones.
This means that anyone currently using the spec will see no change to their containers since they are all using syscalls from linux 6.10 and below. But it also means that newer containers using post 6.10 syscalls will return the expected ENOSYS
error limiting the issue.
Also the spec should define the behavior to follow if the syscall name is not known to the host. I believe the spec should explicitly define ENOSYS
for such syscalls, and I am planning on working on a kernel driver that would expose to user-space the list of supported syscalls by the kernel, making it easier to determine the return value of each syscall.