Linux Troubleshooting & Performance — From Production
If you've ever stared at a server with 100% CPU, a 502 waterfall, or a hung process that won't die — this blog is for you. Real incidents, real commands, real fixes. No padding.
About Damon
Senior Technical Support Engineer · L3
OPSWAT · Ho Chi Minh City
7+ years working on Linux systems in production — from running server fleets at an Anti-DDoS company to L3 support for enterprise deployments at OPSWAT, where the bugs are always someone else's OS-level problem.
Most of what I write comes from incidents that took too long to diagnose the first time. The goal is to make it faster for you.
Full background →What This Blog Covers
Written for engineers who are mid-incident, not mid-tutorial. Every post has the exact commands and the reasoning behind each one — not just the fix, but why it works.
- ▸Linux performance troubleshooting — CPU, memory, load average, I/O wait
- ▸Process debugging — ps, top, strace, lsof, process states (D, Z, R)
- ▸NGINX production issues — 502 errors, upstream keepalive, SSL hardening
- ▸Security hardening — CIS benchmarks for Ubuntu, RHEL, Windows Server
- ▸Incident response — log analysis, root cause, postmortem workflows
Common searches that land here: linux high cpu usage, debug linux server, load average explained, nginx 502 under load, linux process monitoring.
Start Here
The three guides most engineers need first
Linux Performance Troubleshooting: Complete Engineer's Guide
CPU, memory, I/O, process states — the full diagnostic workflow with real commands and decision trees. Start here when a server is slow and you don't know why.
Read article →NGINX 502 Bad Gateway Under Load: Causes, Debugging, and Fixes
The most common NGINX failure pattern in production. Covers port exhaustion, missing keepalive, and proxy timeouts — with the exact config that fixes each one.
Read article →Linux Security Hardening: CIS Benchmarks for Production
CIS Level 1 hardening for Ubuntu, RHEL, and Windows Server. What each control does, what breaks in production, and how to apply it safely.
Read article →Browse by Topic
Grouped by what you are actually trying to do
Linux Commands
ps, top, htop, strace, ss — the tools you reach for first
Troubleshooting
High CPU, memory leaks, zombie processes, 502 errors
Monitoring & Debugging
strace, lsof, auditd, log analysis
Security & Infrastructure
CIS hardening, firewall, Docker, NGINX config
All Topics
linux troubleshooting · nginx debugging · security hardening · infrastructure
Latest Articles
Most recent Linux and DevOps troubleshooting guides
NGINX 502 Bad Gateway Under Load: Root Causes and Fixes
NGINX 502 errors under load are almost never a simple app crash. This guide covers the real root causes — connection backlog overflow, keepalive misconfiguration, ephemeral port exhaustion — with diagnostic commands and config fixes from production incidents.
Log Analysis for Security Investigations: Windows Event Logs and Web Server Access Logs
A practical guide to log analysis for security investigations — Windows Event Viewer, critical Event IDs, Apache access log parsing, and the Linux command-line tools that make manual log analysis fast and effective.
Diamond Model of Intrusion Analysis: 4 Core Components Explained (2026)
A technical breakdown of the Diamond Model of Intrusion Analysis — adversary, victim, capability, and infrastructure — with real attack examples, meta-features, and how it compares to the Cyber Kill Chain and MITRE ATT&CK.
Cyber Kill Chain: All 7 Phases Explained with Real Attack Examples (2026)
A technical deep-dive into the Cyber Kill Chain — all 7 phases mapped with real attacker techniques, detection indicators, and defensive controls. Includes a full real-world attack walkthrough and Kill Chain vs MITRE ATT&CK comparison.
How to Trace Route in Linux: traceroute Examples
Use traceroute in Linux to diagnose network path issues — read hop output, interpret timeouts, use TCP mode to bypass firewalls, and identify where packets are being dropped.
Tools & Resources
Beyond the blog — scripts, CLI tools, and guides built from the same production experience. If you find yourself doing the same thing manually three times, it becomes a tool.
Troubleshooting Guides
Deep-dive walkthroughs for the incidents that take the longest to debug
Start with Linux →Security Hardening
CIS benchmark implementation guides for Ubuntu, RHEL, and Windows Server
Read the guide →