Skip to content

Incident Case Study: PHP-FPM OOM Kill

This case study explains a real failure mode and the prevention rules.

Summary

  • The Linux kernel triggered OOM-killer
  • A php-fpm process was killed
  • Nginx logged:
    • recv() failed (104: Connection reset by peer) while reading response header from upstream

What Happened

  • Some requests held PHP-FPM workers for long durations
  • Memory use grew until the system ran out of RAM
  • Kernel killed php-fpm to protect the system

Evidence Pattern

  • Kernel logs show:
    • php invoked oom-killer
    • Out of memory: Killed process (php-fpm...)
  • Nginx shows:
    • Connection reset by peer

Root Cause Categories

Common reasons for PHP-FPM OOM:

  1. Unlimited runtime overrides in code
    • ini_set('memory_limit', '-1')
    • ini_set('max_execution_time', 100000)
  2. Unbounded DB fetches
    • huge ->get() or Model::all()
  3. Exports/reports in web request
  4. Large uploads processed inline
  5. Too many concurrent FPM workers for available RAM

Lessons Learned

  • Web requests must be bounded:
    • ≤ 120 seconds
    • ≤ 256MB memory
  • Heavy work must be moved to queues
  • FPM must have a safe pm.max_children cap
  • Add slowlog to detect slow routes early

Preventive Controls

  • Server-enforced timeouts:
    • request_terminate_timeout = 120s
  • Code review checklist enforcement
  • Queue workers with:
    • --timeout=120
    • --memory=256

What To Do If You See This Again

  1. Check kernel logs for OOM:
    • dmesg -T | egrep -i 'oom|killed process'
  2. Identify heavy endpoints (slowlog / access logs)
  3. Move heavy logic to queue
  4. Reduce concurrency pressure (FPM caps + caching)