CPU Steal Time Explained (and When It Justifies Dedicated Hardware)

Martijn Aurik

Knowledge blog

img[data-role="placeholder-img"] { display: none; }

If your VPS “randomly” slows down, you need a signal that separates application problems from infrastructure scheduling problems. CPU steal time is one of the clearest signals you can read from inside a Linux VM. It is not a court verdict. It is a symptom. Used correctly, it helps you stop guessing.

TL;DR

CPU steal time is time your VM was ready to run, but the hypervisor did not schedule your vCPU.
You can see it as st in top, and %steal in mpstat.
Sustained steal time that correlates with latency spikes is strong evidence of shared-host CPU contention.
Dedicated hardware can be the simplest fix because it removes shared-host CPU scheduling as a variable.
Dedicated does not fix bad queries, lock contention, memory leaks, or storage bottlenecks. It fixes the “my VM is not getting CPU when it needs it” class of problems.

What CPU steal time is
What steal time is not
How to read steal time on Linux
Patterns that suggest shared host CPU contention
When steal time is not the culprit
Decision checklist: tune, move, or go dedicated
What success looks like after moving
FAQ

What CPU steal time is

CPU steal time is time your virtual CPU wanted to run, but the hypervisor did not give it CPU time.
Linux tooling is unusually direct about it.

“st : time stolen from this vm by the hypervisor”

In mpstat, the definition is even more explicit:

“percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.”

Steal time is a VM-visible symptom of host scheduling decisions. You can have user-facing CPU pressure without seeing your own process CPU usage rise the way you expect.

worldstream-infrastructure-comparison (2)

img[data-role="placeholder-img"] { display: none; }

What steal time is not

Steal time is not a generic “my server is slow” counter. It is not disk latency. It is not network latency. It is also not a verdict on your code.

One common mistake is confusing steal time with I/O wait.

In mpstat, %iowait is:

CPU idle time while the system had an outstanding disk I/O request.

In mpstat, %steal is:

vCPU involuntarily waiting because the hypervisor is busy elsewhere.

Mix those up and you will chase the wrong fix.

How to read steal time on Linux

Option 1: top (fastest)
Run:
top

Look at the CPU summary line.
If your build, request handling, or queueing feels slow and you see st rise at the same time, that is a signal.
One practical nuance: top documents that the st field may not be shown depending on kernel version. If it is missing, use mpstat or vmstat instead.

Option 2: mpstat (best for sampling)
If sysstat is installed:
mpstat -P ALL 1 60

This gives you per-CPU samples each second, including %steal and %iowait.

If you want a shorter capture:
mpstat -P ALL 1 10

Option 3: vmstat (nice quick overview)
Run:
vmstat 1 10

In vmstat, CPU fields include:

st: time stolen from a virtual machine.

Option 4: confirm at the source in /proc/stat
If you want to remove “tool interpretation” from the picture, check /proc/stat.

The cpu line contains multiple counters including:

steal: stolen time while running in a virtualized environment.

This is the raw data many tools are reading under the hood.

Patterns that suggest shared host CPU contention

Steal time becomes useful when you correlate it with symptoms.

Pattern 1: latency spikes coincide with steal spikes
What you see:

p95 or p99 latency jumps for seconds to minutes
queue time rises
requests mostly succeed but become late

What you do:

capture mpstat -P ALL 1 60 during the slow window
compare the timestamps with your latency chart

If latency and %steal rise together, your slowdown is likely upstream of your code.

Pattern 2: load average rises, but your app cannot “get CPU”
This is where teams get stuck:

load average rising
latency rising
user CPU not rising as expected
steal time rising

That can happen because tasks are runnable, but your VM is waiting for real CPU time.

To corroborate scheduler pressure, check the run queue:
sar -q 1 10

sar -q reports runq-sz as the run queue length, meaning tasks running or waiting for run time. That is a useful “runnable pressure” signal when you suspect scheduling issues.

Pattern 3: sustained steal under normal load
Peaks happen. Short host events happen.

What hurts product performance is repeatable steal time during normal workload, especially when it lines up with tail latency.
Red Hat’s guidance is blunt: large amounts of steal time indicate CPU contention and can reduce guest performance.

When steal time is not the culprit

This section is here for a reason. If you skip it, your incident response will become “dedicated until proven otherwise.” That is not engineering.
Use this elimination checklist.

1) If %steal is near zero during the slowdown, check CPU throttling models

A classic example is burstable CPU credit models. When credits are exhausted, the instance drops back toward baseline CPU behavior. That can look like “random slowdowns” with low or zero steal time.

2) If you run containers, check cgroup CPU throttling first

In Kubernetes, CPU limits are enforced by CPU throttling. So a container can be slow because it is being throttled by its CPU limit, even when the VM itself shows low steal time.

3) If %iowait is high, you are likely storage bound

High %iowait with low %steal points at:

disk
filesystem
database I/O
storage contention

Steal time is not a storage metric. Treat it as such and you will waste days.

4) If you suspect CPU placement issues, look at NUMA and pinning

NUMA placement can create real performance hits. If you are doing vCPU pinning, NUMA tuning needs to be considered. Otherwise you can create avoidable misses and unpredictable performance. This applies more when you control the host, but it still matters in some managed environments.

5) If slowdowns align with maintenance, consider live migration

Live migration exists. It can introduce short-lived jitter. In a live migration, the guest continues running while memory pages are transferred to another host. Do not assume. Correlate timestamps.

6) If CPU frequency is changing, performance can change

CPU frequency scaling is real. Higher frequency generally means more instructions retired per unit time, with higher power draw. Governors and scaling algorithms can change behavior based on load. This is more common on your own hardware than on a managed VPS, but it is measurable and worth keeping in your mental model.

Decision checklist: tune, move, or go dedicated

This is the actionable part.

Step 1: prove the correlation

During a slowdown window, capture:

mpstat to record %steal and %iowait
sar -q to record run queue pressure
your app p95 and p99 latency from APM or logs
timestamps of deploys and background jobs

Commands:

mpstat -P ALL 1 60
sar -q 1 60
vmstat 1 60

If latency spikes and steal spikes match, you have evidence of CPU scheduling contention upstream. If %iowait spikes instead, stop talking about dedicated CPU. You have an I/O problem.

Step 2: decide if this is a “move within VPS land” fix

Before you migrate away, ask your provider for the clean mitigations:

Can you move me to a different host node?
Do you offer dedicated vCPU or lower oversubscription plans?
Can you confirm whether CPU overcommit is used on this host class?

If the provider cannot answer, that is already an answer.

Step 3: use a sourced rule of thumb to know when to escalate

There is no single global threshold that applies to every workload. But you do not need perfection to take action.
A widely cited rule of thumb is:

If steal time is greater than 10% for 20 minutes, the VM is likely running slower than it should.

Treat that as an escalation trigger, not a law of physics. If your product is latency-sensitive, you will often care long before that. Not because the CPU is “maxed,” but because tail latency is.

Step 4: know what changes when you go dedicated

Dedicated hardware removes one major upstream variable:

shared-host CPU scheduling between unrelated tenants.

If steal time was your problem, you should see it drop close to zero under the same workload patterns. Be precise though. Steal time is a virtual CPU concept. If you install your own hypervisor on dedicated hardware and overcommit it, you can recreate steal time inside your own guest VMs. Dedicated removes the noisy neighbor. It does not remove bad capacity planning.

What success looks like after moving

Do not measure success by vibes. Measure it like an engineering team.

If steal time was the root cause, success looks like:

%steal stays near zero under the same load patterns
p95 and p99 latency stop drifting under normal workload
build times become predictable, with less variance
incident timelines stop including “could not reproduce” and “went away on its own”

If steal stays low but latency is still unstable, you now have evidence that the bottleneck is elsewhere. Often I/O, throttling, or application-level contention.
That is still a win. It narrows the problem.

Removing contention without surprises

If you have correlated sustained steal time with real latency impact, you are not dealing with a mystery. You are dealing with shared-host CPU contention.
At that point, moving to dedicated hardware is often the fastest way to remove the variable and get a clean baseline again.

Worldstream offers:

Instant Delivery dedicated servers, stated as live within 2 hours
Custom dedicated servers, stated as live within 24 hours
24/7/365 support with an average response time stated as 7 minutes
Infrastructure hosted in self-owned Dutch data centers
Fixed monthly pricing stated on dedicated server pages

If you want fewer surprises, start by removing the upstream variable you cannot control.

FAQ

Because the guest’s view of CPU does not include host scheduling decisions.

Steal time exists because your VM can be ready to run while the hypervisor schedules other work.
Corroborate with:

%steal via mpstat
runnable pressure via sar -q
p95 and p99 latency from your application

CPU Steal Time Explained (and When It Justifies Dedicated Hardware)

TL;DR

Table of contents

What CPU steal time is

What steal time is not

How to read steal time on Linux

Patterns that suggest shared host CPU contention

When steal time is not the culprit

Decision checklist: tune, move, or go dedicated

What success looks like after moving

Removing contention without surprises

FAQ

Why is my CPU “idle” but my app is slow?

What is a “noisy neighbor” on a VPS?

Is upgrading my VPS enough?

Does dedicated fix steal time?