Skip to content

Default “defaultErrnoRet” breaks the ability for hosts to run more recent container images. #1266

Open
@ghadi-rahme

Description

@ghadi-rahme

Currently the spec defines the default defaultErrnoRet to be EPERM, this is troublesome and causes issues when running newer containers on hosts running an older kernel/userspace where the libseccomp version might not be aware of some of the syscalls used by the container. There have been many issues reported about syscalls getting an EPERM return code instead of ENOSYS when not available and breaking the user-space inside the container. Below is an example of such reports:

runc does have some hacky code in order to try and figure out if a syscall is supported or not but the method is not always reliable and we have seen at Canonical reports of Ubuntu Noble containers breaking under Ubuntu Jammy hosts on ARM as well as PPC .
This issue currently affects all fixed release distros which also happen to be the most popular distros for running containers, and for these distros updating libseccomp for every new syscall provides an unnecessary risk of regression and defeats the whole purpose of a fixed released distro.

I understand that changing the current defaultErrnoRet to ENOSYS may also cause regressions, however it also needs to be acknowledged that the EPERM default value was an oversight and that the OCI spec is fundamentally not compatible with fixed release distros which are the most popular distros for running containers. It also violates one of the most fundamental rules/expectations of containers which is to be able to run any version of a user-space whether it is older or newer than the version on that host.

Having said that, I believe there is a way to satisfy both camps. A list of the currently available syscalls (up to kernel 6.10 as of the writing of this post) can be compiled and be manually set as EPERM for those who were relying on the default EPERM return value while having the others unchanged in the seccomp profiles. This means that when changing defaultErrnoRet to be ENOSYS all previously available syscalls will still return EPERM while newer added ones or even older ones that are defined in the seccomp profile but not known by the libseccomp package on the host will return the correct ENOSYS.
As an example, this can be expressed as the following in the runtime spec:

  • defaultErrnoRet (uint, OPTIONAL) - the errno return code to use.
    Some actions like SCMP_ACT_ERRNO and SCMP_ACT_TRACE allow to specify the errno code to return.
    When the action doesn't support an errno, the runtime MUST print and error and fail.
    If not specified then its default value is EPERM for syscalls prior to kernel 6.11 and ENOSYS for future ones.

This means that anyone currently using the spec will see no change to their containers since they are all using syscalls from linux 6.10 and below. But it also means that newer containers using post 6.10 syscalls will return the expected ENOSYS error limiting the issue.

Also the spec should define the behavior to follow if the syscall name is not known to the host. I believe the spec should explicitly define ENOSYS for such syscalls, and I am planning on working on a kernel driver that would expose to user-space the list of supported syscalls by the kernel, making it easier to determine the return value of each syscall.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions