Skip to content

Add worker ramping option to prevent system overload when starting many workers #1219

Open
@twiggy

Description

@twiggy

Problem Description

When running tests with many workers (e.g., pytest -n 50), all worker processes start simultaneously. This can cause several issues:

  • System overload: Sudden spike in CPU, memory, and I/O usage can make the system unresponsive
  • Resource contention: All workers competing for resources at once can actually slow down test execution
  • CI/CD failures: Resource-constrained environments may fail when too many processes start at once
  • Database/service overload: If tests connect to external services, simultaneous connections can overwhelm them

Proposed Solution

Add a --ramp option that gradually starts workers over a specified time period, similar to JMeter's ramp-up period for load testing.

Example Usage

# Start 10 workers over 10 seconds (one worker per second)
pytest -n 10 --ramp 10s

# Start 50 workers over 5 minutes
pytest -n 50 --ramp 5m

# Start 100 workers over 1 hour  
pytest -n 100 --ramp 1h

How it would work

With --ramp 10s and -n 10:

  • Worker 0: starts immediately
  • Worker 1: starts after 1 second
  • Worker 2: starts after 2 seconds
  • ... and so on
  • Worker 9: starts after 9 seconds

Benefits

  1. Prevents system overload by distributing the resource usage spike over time
  2. Better for shared environments where sudden resource spikes affect other users/processes
  3. Improves test reliability in resource-constrained CI/CD environments
  4. Useful for load testing scenarios where gradual ramp-up is desired
  5. Backward compatible - no impact when --ramp is not specified

Implementation Considerations

  • Support common time formats: seconds (s), minutes (m), hours (h)
  • Workers should delay test execution, not just process initialization
  • Provide progress feedback during ramp-up period
  • Should work with all existing distribution modes (--dist)

Use Cases

  1. Large test suites: Organizations with thousands of tests that need many workers
  2. Shared CI/CD environments: Where resource usage must be controlled
  3. Database-heavy tests: Preventing connection pool exhaustion
  4. Performance testing: Gradually increasing load on the system under test
  5. Development machines: Preventing system freeze when running tests locally

Related Issues/Context

This feature is similar to:

  • JMeter's ramp-up period for thread groups
  • Kubernetes' rolling deployments
  • Database connection pool gradual scaling

I've been able to bang this out using AI hopefully with enough tests and documentation and have used it against my own suite. Creating this issue as a holder. Hoping to create PR in the near future.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions