listing for S
sched_stat - Displays CPU usage and process-scheduling statistics for SMP
and NUMA platforms
/usr/sbin/sched_stat [-l] [-s] [-f] [-u] [-R] [command [cmd_arg]...]
-f Prints the count of calls that are not multiprocessor safe and
therefore funneled to the master CPU. For example:
unix master calls 11174 resulting blocks 2876
The impact of funneled calls on the master CPU needs to be taken into
account when evaluating statistics for the master CPU.
-l Prints scheduler load-balancing statistics. For example:
Scheduler Load Balancing
| 5-second averages
steal idle desired | current interrupt RT
cpu trys steals steals load | load % %
0 | 288 3 20609 0.000 0.000 0.454 0.156
1 | 615 6 21359 0.000 0.000 0.002 0.203
2 | 996 4 20135 0.000 0.001 0.000 0.237
3 | 1302 4 16195 0.000 0.001 0.000 0.330
6 | 5 0 3029 0.000 0.000 0.000 0.034
In the displayed table, each row contains per-CPU information as
cpu The number identifier of the CPU.
The number of attempts made to steal processes/threads from
other CPUs when the CPU was not idle.
steals The number of processes/threads actually stolen from other CPUs
when the CPU was not idle.
The number of processes/threads stolen from other CPUs when the
CPU was idle.
The number of time slices that should be used on this CPU for
running timeshare threads. This information is calculated by
comparing the current load, interrupt %, and RT % statistics
obtained for this CPU with those obtained for other CPUs in the
When current load is less than desired load, the scheduler will
attempt to migrate timeshare threads to this CPU in order to
better balance the timeshare workload among CPUs in the same
See DESCRIPTION for information about PAGs.
Over the last five seconds, the average number of time slices
used to run timeshare threads on this CPU.
Over the last five seconds, the average percentage of time
slices that this CPU spent in interrupt context.
RT % Over the last five seconds, the average precentage of time
slices that this CPU used to run threads according to FIFO or
-R Prints information about CPU locality in two tables:
Radtab Shows the order-of-preference (in terms of memory affinity)
that exists between a CPU and different RADs. Order-of-
preference indicates, for a given home RAD, the ranking of
other RADs in terms of increasing physical distance from that
home RAD. If a process or thread needs more memory or needs to
be scheduled on a RAD other than its home RAD, the kernel
automatically searches RADs for additional memory or CPU cycles
in the order of preference shown in this table.
Hoptab Shows the distance (number of hops) between different RADs and,
by association, between CPUs. The information in this table is
coarser-grained than in the preceding Radtab table and more
relevant to NUMA programming choices. For example, the
expression RAD_DIST_LOCAL + 2 indicates RADs that are no more
than two hops from a thread's home RAD.
For example (a small, switchless mesh NUMA system):
Radtab (rads in order of preference)
Preference 0 1 2 3
0 0 1 2 3
1 1 0 3 2
2 2 3 0 1
3 3 2 1 0
Hoptab (hops indexed by rad)
To rad # 0 1 2 3
0 0 1 1 2
1 1 0 2 1
2 1 2 0 1
3 2 1 1 0
In these tables, the CPU identifiers are listed across the top from
left to right and the RAD identifiers are listed on the left from top
to bottom. For example if a process running on CPU 2 needs additional
memory, Radtab indicates that the kernel will search for that memory
first in RAD 2, then in RAD 3, then in RAD 0, and last in RAD 1.
Hoptab shows the basis of this preference in that RAD 2 is CPU 2's
local RAD, RADs 0 and 3 are one hop away, and RAD 1 is two hops away.
The -R option is useful only on NUMA platforms, such as GS1280 and ES80
AlphServer systems, in which memory latency times varies from one RAD
to another. The information in these tables is less useful for GS80,
GS160 and GS320 AlphaServer systems because both coarse and finer-
grained memory affinity is the same from any CPU in one RAD to any CPU
in another RAD; however, the displays can tell you which CPUs are in
Make sure that you both maximize the size of your terminal emulator
window and minimize the font size before using the -R option;
otherwise, line-wrapping will render the tables very difficult to read
on systems that have many CPUs.
-s Prints scheduling-dispatch (processor-usage) statistics for each CPU.
Scheduler Dispatch Statistics
cpu 0 local global idle remote | total percent
hot 60827 12868 19158991 0 | 19232686 91.6
warm 78 21 1542019 0 | 1542118 7.3
cold 315 27289 184784 7855 | 220243 1.0
total 61220 40178 20885794 7855 | 20995047
percent 0.3 0.2 99.5 0.0
cpu 1 local global idle remote | total percent
hot 33760 11788 16412544 0 | 16458092 89.5
warm 66 24 1707014 0 | 1707104 9.3
cold 201 26191 203513 0 | 229905 1.2
These statistics show the count and percentage of thread context
switches (times that the kernel switches to a new thread) for the
local Threads scheduled from the CPU's Local Run Queue
global Threads scheduled from the Global Run Queue of the PAG to which
the CPU belongs
idle Threads scheduled from the Idle CPU Queue of the PAG to which
the CPU belongs
remote Threads stolen from Global or Local Run Queues in another PAG
Note that these statistics do not count CPU time slices that were used
to re-run the same thread.
Each SMP unit (or RAD on a NUMA system) has a Processor Affinity Group
(PAG). Each PAG contains the following queues:
· A Global Run Queue from which processes or threads are scheduled
on the first available CPU
· One or more Local Run Queues from which processes or threads are
scheduled on a specific CPU
· A queue that contains idle CPUs
A thread that is handed to an idle CPU goes directly to that CPU
without first being placed on the other queues.
If there is insufficient work queued locally to keep the PAG's CPUs
busy, threads are stolen first from the Global and then the Local Run
Queues in a remote PAG.
For each of these categories, statistics are grouped into hot, warm,
and cold subcategories. The hot statistics show context switches to
threads that last ran on the CPU only a very short time before. The
warm statistics show context switches to threads that last ran on the
CPU a somewhat longer time before. The cold statistics indicate context
switches to threads that never ran on the CPU before. These statistics
are a measure of how well cache affinity is being maintained; that is,
how likely the data used by threads when they last ran is still in the
cache when the threads are rescheduled. You cannot evaluate this
information without knowledge of the type of work being done on the
system; maintenance of cache affinity can be very important on systems
(or processor sets) that are dedicated to running certain applications
(such as those doing high performance technical computing) but is less
critical for systems serving a variety of applications and users.
-u Prints processor-usage statistics for each CPU. For example:
cpu user nice system idle widle | scalls intr csw tbsyc
0 | 0.0 0.0 0.7 99.2 0.1 | 3327337 50861486 41885424 317108
1 | 0.0 0.0 0.4 99.5 0.1 | 3514438 0 36710149 268667
2 | 0.0 0.0 0.4 99.5 0.1 | 3182064 0 37384120 257749
3 | 0.0 0.0 0.4 99.5 0.1 | 3528519 0 36468319 249492
6 | 0.0 0.0 0.1 99.9 0.0 | 668892 11664 11793053 352294
7 | 0.0 0.0 0.1 99.9 0.0 | 772821 0 9341527 352319
8 | 0.0 0.0 0.0 100.0 0.0 | 529050 11724 5717059 347267
9 | 0.0 0.0 0.0 100.0 0.0 | 492386 0 6603681 351509
In this table:
cpu The number identifier of the CPU.
user The percentage of time slices spent running threads in user
nice The percentage of time slices in which lower-priority threads
were scheduled. These are user-context threads whose priority
was explicitly lowered by using an interface such as the nice
command or the class-scheduling software.
system The percentage of time slices spent running threads in system
context. This work includes servicing of interrupts and system
calls that are made on behalf of user processes. An unusually
high percentage in the system category might indicate a system
bottleneck. Running kprofile and lockinfo provides more
specific information about where system time is being spent.
See uprofile(1) and lockinfo(8), respectively, for information
about these utilities.
idle The percentage of time slices in which no threads were
widle The percentage of time slices in which available threads were
blocked by pending I/O and the CPU was idle. If this count is
unusually high, it suggests that a bottleneck in an I/O channel
might be causing suboptimal performance.
scalls The count of system calls that were serviced.
intr The count of interrupts that were serviced.
csw The count of thread context switches (thread scheduling
changes) that completed.
tbsyc The number of times that the translation buffer was
The command to be executed by sched_stat.
Any arguments to the preceding command.
The command and cmd_arg operands are used to limit the length of time in
which sched_stat gathers statistics. Typically, sleep is specified for
command and some number of seconds is specified for cmd_arg.
If you do not specify a command to specify an time interval for statistics
gathering, the statistics will reflect what has occurred since the system
was last booted.
The sched_stat utility helps you determine how well the system load is
distributed among CPUs, what kinds of jobs are getting (or not getting)
sufficient cycles on each CPU, and how well cache affinity is being
maintained for these jobs.
Answers to the following questions influence how a process and its threads
· Is the request to be serviced multiprocessor-safe?
If not, the kernel funnels the request to the master CPU. The master
CPU must reside in the default processor set (which contains all
system CPUs if none were assigned to user-defined processor sets) and
is typically CPU 0; however, some platforms permit CPUs other than CPU
0 to be the master CPU. Few requests generated by software
distributed with the operating system need to be funneled to the
master CPU and most of these are associated with certain device
drivers. However, if the system runs many third-party drivers, the
number of requests that must be funneled to the master CPU might be
· What is the job priority?
Job priority influences how frequently a thread is scheduled. Realtime
requests and interrupts have higher priority than time-share jobs,
which include the majority of user-mode threads. So, if a significant
number of CPU cycles are spent servicing realtime requests and
interrupts, there are fewer cycles available for time-share jobs.
Default priority for time-share jobs can also be changed by using the
nice command, the runclass command, or through class-scheduling
software. On a busy system, cache affinity is less likely to be
maintained for a thread from a time-share job whose priority was
lowered because more time is likely to elapse between rescheduling
operations for each thread. Conversely, cache affinity is more likely
to be maintained for threads of a higher-priority time-share job
because less time elapses between rescheduling operations. Note that
the scheduler always prioritizes the need for low response latency (as
demanded by interrupts and real-time requests) higher than maintenance
of cache affinity, regardless of the priority assigned to a time-share
· Are there user-defined restrictions that limit where a process may
If so, the kernel must schedule all threads of that process on CPUs in
the restricted set. In some cases, user-defined restrictions are
explicit RAD or CPU bindings specified either in an application or by
a command (such as runon) that was used to launch the program or
reassign one of its threads.
The set of CPUs where the kernel can schedule a thread is also
influenced by the presence of user-defined processor sets. If the
process was not explicitly started in or reassigned to a user-defined
processor set, the kernel must run it and all of its threads only on
CPUs in the default processor set.
· Are any CPUs idle?
The scheduler is very aggressive in its attempts to steal jobs from
other CPUs to run on an idle CPU. This means that the scheduler will
migrate processes or threads across RAD boundaries to give an idle CPU
work to do unless one of the preceding restrictions is in place to
prevent that. For example, the scheduler does not cross processor set
boundaries when stealing work from another CPU, even when a CPU is
idle. In general, keeping CPUs busy with work has higher priority than
maintaining memory or cache affinity during load-balancing operations.
Explicit memory-allocation advice provided in application code influences
scheduling only to the extent that the preceding factors do not override
that advice. However, explicit memory-allocation advice does make a
difference (and thereby can improve performance) when CPUs in the processor
set where the program is running are kept busy but are not overloaded.
To gather statistics with sched_stat, you typically follow these steps:
1. Start up a system workload and wait for it to get to a steady state.
2. Start sched_stat with sleep as the specified command and some number
of seconds as the specified cmd_arg. This causes sched_stat to gather
statistics for the length of time it takes the sleep command to
For example, the following command causes sched_stat to collect statistics
for 60 seconds and then print a report:
# /usr/sbin/sched_stat sleep 60
If you include options on the command line, only statistics for the
specified options are reported.
If you specify the command without any options, all options except for -R
are assumed. (See the descriptions of the -f, -l, -s, and -u options in the
Running the sched_stat command has minimal impact on system performance.
The sched_stat utility is subject to change, without advance notice, from
one release to another. The utility is intended mainly for use by other
software applications included in the operating system product, kernel
developers, and software support representatives. Therefore, sched_stat
should be used only interactively; any customer scripts or programs written
to depend on its output data or display format might be broken by changes
in future versions of the utility or by patches that might be applied to
>0 An error occurred.
The pseudo driver that is opened by the sched_stat utility for RAD-
related statistics gathering.
Commands: iostat(1), netstat(1), nice(1), renice(1), runclass(1), runon(1),
uprofile(1), vmstat(1), advfsstat(8), collect(8), lockinfo(8), nfsstat(8),
Others: numa_intro(3), class_scheduling(4), processor_sets(4)
listing for S