Skip to content

Optimize uWSGI PSGI Configuration for K8s Deployment #146

Open
@ranguard

Description

@ranguard

Performance: Optimize uWSGI PSGI Configuration for K8s Deployment

NOTE: This is generated from Claude.ai being given our current config and being told to add create this ticket.

Summary

Need to analyze and optimize current uWSGI configuration to reduce slow responses and improve overall performance in our Kubernetes containerized environment.

Current Configuration

[uwsgi]
master = true
workers = 20
die-on-term = true
need-app = true
vacuum = true
disable-logging = true
listen = 1024
post-buffering = 4096
buffer-size = 65535
early-psgi = true
perl-no-die-catch = true
max-worker-lifetime = 3600
max-requests = 1000
reload-on-rss = 300
harakiri = 60

Performance Issues

  • Slow response times observed
  • Need data-driven optimization approach
  • Current settings may not be optimal for our workload

Action Items (Priority Order)

1. Enable Performance Monitoring (CRITICAL - Do First)

Why: Need baseline metrics before making changes

Implementation:

  • Add stats socket to uWSGI config:
    stats = 127.0.0.1:9191
    stats-http = true
  • Deploy and verify stats endpoint accessibility
  • Install uwsgitop for monitoring: pip install uwsgitop

Success Criteria: Can access uWSGI stats via curl http://localhost:9191

2. Collect Baseline Performance Data (HIGH)

Why: Establish current performance patterns before optimization

Tasks:

  • Monitor for 24-48 hours and collect:
    • Worker CPU/memory usage: kubectl top pods
    • uWSGI worker stats: uwsgitop 127.0.0.1:9191
    • Container resource limits vs usage
    • Response time patterns
    • Worker restart frequency

Success Criteria: Have documented baseline metrics

3. Enable Targeted Logging (HIGH)

Why: Identify specific bottlenecks without overwhelming logs

Implementation:

  • Temporarily replace disable-logging = true with:
    log-slow = 1000
    log-4xx = true  
    log-5xx = true
    logto = /tmp/uwsgi.log

Success Criteria: Can identify slow requests and error patterns

4. Right-size Worker Count (MEDIUM-HIGH)

Why: Most impactful setting - too many workers can hurt performance

Analysis needed:

  • Check container CPU allocation
  • Monitor CPU usage per worker
  • Test worker count = (2 × CPU cores) + 1 as starting point

Implementation:

  • If 4 CPU cores allocated, try workers = 9
  • Monitor performance impact
  • Adjust based on CPU utilization patterns

Success Criteria: Workers fully utilize available CPU without thrashing

5. Optimize Memory Settings (MEDIUM)

Why: Frequent worker restarts hurt performance

Current issues:

  • reload-on-rss = 300 may be too aggressive
  • max-worker-lifetime = 3600 + max-requests = 1000 causes frequent restarts

Tasks:

  • Monitor actual worker RSS usage
  • Test increased settings:
    reload-on-rss = 512
    max-requests = 5000
    max-worker-lifetime = 7200

Success Criteria: Reduced worker restart frequency

6. Tune Request Handling (MEDIUM)

Why: Better request processing and timeout handling

Implementation:

  • Analyze typical request duration
  • Adjust harakiri = 120 if requests legitimately take >60s
  • Consider increasing listen = 2048 if seeing connection drops

7. Optimize Buffer Settings (LOW-MEDIUM)

Why: Current buffer-size = 65535 may be oversized

Tasks:

  • Monitor buffer utilization in stats
  • Test with buffer-size = 32768
  • Adjust based on actual usage patterns

8. Load Testing & Validation (LOW)

Why: Validate improvements under controlled conditions

Implementation:

  • Set up load testing environment
  • Compare before/after metrics
  • Document optimal configuration

Monitoring Commands

# uWSGI stats
curl http://localhost:9191

# Resource usage
kubectl top pods <pod-name>
kubectl describe pod <pod-name>

# Worker memory
kubectl exec <pod-name> -- ps aux | grep uwsgi

# Connection monitoring  
kubectl exec <pod-name> -- netstat -an | grep :80 | sort | uniq -c

Red Flags to Watch For

  • High worker churn (frequent restarts)
  • Listen queue overflows in stats
  • Consistent high CPU with low throughput
  • Memory usage steadily climbing
  • Regular harakiri timeouts

Success Metrics

  • Reduced 95th percentile response time
  • Decreased worker restart frequency
  • Improved resource utilization efficiency
  • Fewer timeout errors

Notes

  • Make changes incrementally
  • Monitor each change for 24+ hours before next adjustment
  • Keep rollback plan ready
  • Document all changes and their impact

Priority: High
Estimate: 1-2 sprints
Dependencies: Monitoring tools, load testing capability

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions