11 DTrace Providers
WARNING:
Oracle Linux 7 is now in Extended Support. See Oracle Linux Extended Support and Oracle Open Source Support Policies for more information.
Migrate applications and data to Oracle Linux 8 or Oracle Linux 9 as soon as possible.
For more information about DTrace, see Oracle Linux: DTrace Release Notes and Oracle Linux: Using DTrace for System Tracing.
This chapter describes some of the existing DTrace providers. Note
that the list of providers discussed in this chapter is not
exhaustive. To display the providers that are available on your
system, use the dtrace -l command. Detailed
information about translators for important data structures can be
found in
/usr/lib64/dtrace/version/*.d
files.
dtrace Provider
The dtrace
provider includes several probes
that are related to DTrace itself. You can use these probes to
initialize state before tracing begins, process state after
tracing has completed, and to handle unexpected execution errors
in other probes.
BEGIN Probe
The BEGIN
probe fires before any other probe.
No other probe fires until all BEGIN
clauses
have completed. This probe can be used to initialize any state
that is needed in other probes. The following example shows how
to use the BEGIN
probe to initialize an
associative array to map between mmap()
protection bits and a textual representation:
BEGIN { prot[0] = "---"; prot[1] = "r--"; prot[2] = "-w-"; prot[3] = "rw-"; prot[4] = "--x"; prot[5] = "r-x"; prot[6] = "-wx"; prot[7] = "rwx"; } syscall::mmap:entry { printf("mmap with prot = %s", prot[arg2 & 0x7]); }
The BEGIN
probe fires in an unspecified
context, which means the output of stack
or
ustack
, and the value of context-specific
variables such as execname
, are all
arbitrary. These values should not be relied upon or interpreted
to infer any meaningful information. No arguments are defined
for the BEGIN
probe.
END Probe
The END
probe fires after all other probes.
This probe will not fire until all other probe clauses have
completed. This probe can be used to process state that has been
gathered or to format the output. The printa
action is therefore often used in the END
probe. The BEGIN
and END
probes can be used together to measure the total time that is
spent tracing, for example:
BEGIN { start = timestamp; } /* * ... other tracing actions... */ END { printf("total time: %d secs", (timestamp - start) / 1000000000); }
See Data Normalization and
printa Action for other common uses of the
END
probe.
As with the BEGIN
probe, no arguments are
defined for the END
probe. The context in
which the END
probe fires is arbitrary and
should not be depended upon.
When tracing with the bufpolicy
option set to
fill
, adequate space is reserved to
accommodate any records that are traced in the
END
probe. See
fill Policy and END Probes for details.
Note:
The exit
action causes tracing to stop and
the END
probe to fire. However, there is
some delay between the invocation of the
exit
action and when the
END
probe fires. During this delay, no
probes will fire. After a probe invokes the
exit
action, the END
probe is not fired until the DTrace consumer determines that
exit
has been called and stops tracing. The
rate at which the exit status is checked can be set by using
statusrate option. For more information,
see Options and Tunables.
ERROR Probe
The ERROR
probe fires when a runtime error
occurs during the execution of a clause for a DTrace probe. As
shown in the following example, if a clause attempts to
dereference a NULL
pointer, the
ERROR
probe fires. Save it in a file named
error.d
:
BEGIN { *(char *)NULL; } ERROR { printf("Hit an error!"); }
When you run this program, output similar to the following is displayed:
# dtrace -s error.d dtrace: script 'error.d' matched 2 probes CPU ID FUNCTION:NAME 1 3 :ERROR Hit an error! dtrace: error on enabled probe ID 1 (ID 1: dtrace:::BEGIN): invalid address (0x0) in action #1 at DIF offset 16 ^C
The previous output indicates that the ERROR
probe fired and that dtrace reported the
error. dtrace has its own enabling of the
ERROR
probe so that it can report errors.
Using the ERROR
probe, you can create your
own custom error handling.
The arguments to the ERROR
probe are
described in the following table.
Argument | Description |
---|---|
|
The enabled probe identifier (EPID) of the probe that caused the error. |
|
The index of the action that caused the fault. |
|
The DIF offset into the action or -1 if not applicable. |
|
The fault type. |
|
Value that is particular to the fault type. |
The following table describes the various fault types that can
be specified in arg4
and the values that
arg5
can take for each fault type.
arg4 Value | Description | arg5 Meaning |
---|---|---|
|
Unknown fault type |
None |
|
Access to unmapped or invalid address |
Address accessed |
|
Unaligned memory access |
Address accessed |
|
Illegal or invalid operation |
None |
|
Integer divide by zero |
None |
|
Insufficient scratch memory to satisfy scratch allocation |
None |
|
Attempt to access a kernel address or property without sufficient privileges |
Address accessed or 0 if not applicable |
|
Attempt to access a user address or property without sufficient privileges |
Address accessed or 0 if not applicable |
|
DTrace internal parameter stack overflow |
None |
|
Invalid user process stack |
Address of invalid stack pointer |
If the actions that are taken in the ERROR
probe cause an error, that error is silently dropped. The
ERROR
probe is not recursively invoked.
dtrace Stability
The dtrace
provider uses DTrace's stability
mechanism to describe its stabilities. These values are listed
in the following table.
Element | Name Stability | Data Stability | Dependency Class |
---|---|---|---|
Provider |
Stable |
Stable |
Common |
Module |
Private |
Private |
Unknown |
Function |
Private |
Private |
Unknown |
Name |
Stable |
Stable |
Common |
Arguments |
Stable |
Stable |
Common |
For more information about the stability mechanism, see DTrace Stability Features.
profile Provider
The profile
provider includes probes that are
associated with an interrupt that fires at some regular, specified
time interval. Such probes are not associated with any particular
point of execution, but rather with the asynchronous interrupt
event. You can use these probes to sample some aspect of the
system state and then use the samples to infer system behavior. If
the sampling rate is high or the sampling time is long, an
accurate inference is possible. Using DTrace actions, you can use
the profile
provider to sample practically any
aspect of the system. For example, you could sample the state of
the current thread, the state of the CPU, or the current machine
instruction.
profile-n Probes
The profile-n
probes fire at a fixed interval, at a high-interrupt level on
all active CPUs. The units of n
default to a frequency that is expressed as a rate of firing per
second, but the value can also have an optional suffix , as
shown in Table 11-1, which specifies
either a time interval or a frequency. The following table
describes valid time suffixes for a
tick-
n probe.
Table 11-1 Valid Time Suffixes
Suffix | Time Units |
---|---|
|
nanoseconds |
|
microseconds |
|
milliseconds |
|
seconds |
|
minutes |
|
hours |
|
days |
|
hertz (frequency expressed as rate per second) |
tick-n Probes
The tick-n
probes
fire at fixed intervals, at a high interrupt level on only one
CPU per interval. Unlike
profile-n
probes,
which fire on every CPU,
tick-n
probes fire
on only one CPU per interval and the CPU on which they fire can
change over time. The units of n
default to a frequency expressed as a rate of firing per second,
but the value can also have an optional time suffix as shown in
Table 11-1, which specifies either a
time interval or a frequency.
The tick-n
probes
have several uses, such as providing some periodic output or
taking a periodic action.
Note:
By default, the highest supported tick frequency is 5000 Hz
(tick-5000
).
profile Probe Arguments
The following table describes the arguments for the
profile
probes.
Table 11-2 profile
Probe Arguments
Probe | arg0 | arg1 | arg2 |
---|---|---|---|
|
|
|
|
|
|
|
— |
The arguments are as follows:
-
pc
: kernel program counter -
upc
: user-space program counter -
nsecs
: elapsed number of nanoseconds
profile Probe Creation
Unlike other providers, the profile
provider
creates probes dynamically on an as-needed basis. Thus, the
desired probe might not appear in a listing of all probes, for
example, when using the dtrace -l -P profile
command, but the probe is created when it is explicitly enabled.
A time interval that is too short causes the machine to
continuously field time-based interrupts and denies service on
the machine. The profile
provider silently
refuses to create a probe that would result in an interval of
less than two hundred microseconds.
prof Stability
The profile
provider uses DTrace's stability
mechanism to describe its stabilities. These stability values
are listed in the following table.
Element | Name Stability | Data Stability | Dependency Class |
---|---|---|---|
Provider |
Evolving |
Evolving |
Common |
Module |
Unstable |
Unstable |
Unknown |
Function |
Private |
Private |
Unknown |
Name |
Evolving |
Evolving |
Common |
Arguments |
Evolving |
Evolving |
Common |
For more information, see DTrace Stability Features.
fbt Provider
The fbt
(Function Boundary Tracing) provider
includes probes that are associated with the entry to and return
from most functions in the Oracle Linux kernel. Therefore, there could
well be tens of thousands of fbt
probes.
To confirm that the fbt
provider is available
on your processor's architecture, you should be able to load the
module that provides fbt
instrumentation and
successfully list several probes. Note that this process could
take several seconds due to the large number of such probes. For
example, consider the following command, which is executed as
root
:
# dtrace -l -P fbt | wc -l dtrace: failed to match fbt:::: No probe matches description 1 # modprobe fbt # dtrace -l -P fbt | wc -l 88958
In the previous example, the first dtrace
command automatically loads modules that are listed in
/etc/dtrace-modules
, but also confirms that
fbt
was not among them. After
fbt
is loaded manually, many
fbt
probes appear. For more information, see
Module Loading and fbt.
Like other DTrace providers, Function Boundary Tracing (FBT) has no probe effect when not explicitly enabled. When enabled, FBT only induces a probe effect in probed functions. While the FBT implementation is highly specific to the instruction set architecture, FBT has been implemented on both x86 and 64-bit Arm platforms. For each instruction set, there are a small number of leaf functions that do not call other functions and are highly optimized by the compiler, which cannot be instrumented by FBT. Probes for these functions are not present in DTrace.
An effective use of FBT probes requires knowledge of the operating
system implementation. It is therefore recommended that you use
FBT only when developing kernel software or when other providers
are not sufficient. You can use other DTrace providers such as
syscall
, sched
,
proc
, and io
to answer most
system analysis questions without requiring operating system
implementation knowledge.
fbt Probes
FBT provides a probe at the entry and return of most functions
in the kernel, named entry
and
return
, respectively. All FBT probes have a
function name and module name.
fbt Probe Arguments
The arguments to entry
probes are the same as
the arguments to the corresponding operating system kernel
function. These arguments can be accessed as
int64_t
values by using the
arg0
, arg1
,
arg2
, ... variables.
If the function has a return value, the return value is stored
in arg1
of the return
probe. If a function does not have a return value,
arg1
is not defined.
While a given function only has a single point of entry, it
might have many different points where it returns to its caller.
FBT collects a function's multiple return sites into a single
return
probe. If you want to know the exact
return path, you can examine the return
probe
arg0
value, which indicates the offset in
bytes of the returning instruction in the function text.
fbt Examples
You can easily use the fbt
provider to
explore the kernel's implementation. The following example
script records the first gettimeofday
call
from any clock
process and then follows the
subsequent code path through the kernel. Type the following D
source code and save it in a file named
xgettimeofday.d
:
/* * To make the output more readable, indent every function entry * and unindent every function return. This is done by setting the * "flowindent" option. */ #pragma D option flowindent syscall::gettimeofday:entry /execname == "clock" && guard++ == 0/ { self->traceme = 1; printf("start"); } fbt::: /self->traceme/ {} syscall::gettimeofday:return /self->traceme/ { self->traceme = 0; exit(0); }
Running this script results in output that is similar to the following:
# dtrace -s ./xgettimeofday.d dtrace: script './xgettimeofday.d' matched 92115 probes CPU FUNCTION 0 => gettimeofday start 0 -> SyS_gettimeofday 0 -> getnstimeofday64 0 -> __getnstimeofday64 0 <- __getnstimeofday64 0 <- getnstimeofday64 0 -> _copy_to_user 0 <- _copy_to_user 0 <- SyS_gettimeofday 0 <= gettimeofday
The previous output shows the internal kernel functions that are
called when the gettimeofday
system call is
made.
Module Loading and fbt
While the Oracle Linux kernel can dynamically load and unload kernel
modules, for fbt
probes, the
fbt
kernel module must be loaded to support
the instrumentation. For more information about loading kernel
modules, see the note in Getting Started With DTrace. If
fbt
is not listed in
/etc/dtrace-modules
, or if the
dtrace -l command lists no
fbt
probes, use the following command:
# modprobe fbt
Conversely, you can unload the fbt
instrumentation with the following command:
# modprobe -r fbt
When the fbt
module is loaded, FBT
automatically provides probes to instrument all other loaded
modules, including any new modules that are dynamically loaded.
If a loaded module has no enabled FBT probes, the module might
be unloaded and the corresponding probes are destroyed as the
module is unloaded. If a loaded module has enabled FBT probes,
the module is considered busy and cannot be unloaded.
fbt Stability
The fbt
provider uses DTrace's stability
mechanism to describe its stabilities. These stability values
are listed in the following table.
Element | Name Stability | Data Stability | Dependency Class |
---|---|---|---|
Provider |
Evolving |
Evolving |
Common |
Module |
Private |
Private |
Unknown |
Function |
Private |
Private |
ISA |
Name |
Evolving |
Evolving |
Common |
Arguments |
Private |
Private |
ISA |
For more information, see DTrace Stability Features.
syscall Provider
The syscall
provider makes available a probe at
the entry to and return from every system call in the system.
Because system calls are the primary interface between user-level
applications and the operating system kernel, the
syscall
provider can offer tremendous insight
into application behavior with respect to the system.
syscall Probes
syscall
provides a pair of probes for each
system call: an entry
probe that fires before
the system call is entered, and a return
probe that fires after the system call has completed, but before
control has been transferred back to user-level. For all
syscall
probes, the function name is set as
the name of the instrumented system call.
Often, the system call names that are provided by
syscall
correspond to names in the Section 2
manual pages. However, some syscall
provider
probes do not directly correspond to any documented system call.
Some common reasons for this discrepancy are described in the
following sections.
System Call Anachronisms
In some cases, the name of the system call, as provided by the
syscall
provider, might be a reflection of
an ancient implementation detail.
Subcoded System Calls
Some system calls might be implemented as sub operations of
another system call. For example,
socketcall()
, is the common kernel entry
point for the socket system calls.
New System Calls
Oracle Linux implements at
-suffixed system
interfaces as individual system calls, for example:
-
faccessat()
-
fchmodat()
-
fchownat()
-
fstatat64()
-
futimensat()
-
linkat()
-
mkdirat()
-
mknodat()
-
name_to_handle_at()
-
newfstatat()
-
open_by_handle_at()
-
openat()
-
readlinkat()
-
renameat()
-
symlinkat()
-
unlinkat()
-
utimensat()
These system calls implement a superset of the functionality
of their old non-at
-suffixed counterparts.
They take an additional first argument that is either an open
directory file descriptor. In which case, the operation on a
relative pathname is taken relative to the specified
directory, or is the reserved value
AT_FDCWD
, in which case the operation takes
place relative to the current working directory.
Replaced System Calls
In Oracle Linux, the following older system calls have been replaced
and are not called by the newer glibc
interfaces. These legacy interfaces remain, but are
reimplemented, not as system calls in their own right, but as
calls to the newer system calls. The following table lists the
legacy call and its new call equivalent.
Legacy System Call | New System Call |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Large File System Calls
A 32-bit program that supports large
files that exceed two gigabytes in size must be
able to process 64-bit file offsets. Because large files
require the use of large offsets, large files are manipulated
through a parallel set of system interfaces. The following
table lists some of the syscall
probes for
the large file system call interfaces.
Table 11-3 syscall Large File Probes
Large File syscall Probe | System Call |
---|---|
|
|
|
|
|
|
syscall Probe Arguments
For entry
probes, the arguments,
arg0
...
argn
, are
arguments to the system call. For return probes, both
arg0
and arg1
contain the
return value. A non-zero value in the D variable
errno
indicates a system call failure.
syscall Stability
The syscall
provider uses DTrace's stability
mechanism to describe its stabilities. These stability values
are listed in the following table.
Element | Name Stability | Data Stability | Dependency Class |
---|---|---|---|
Provider |
Evolving |
Evolving |
Common |
Module |
Private |
Private |
Unknown |
Function |
Private |
Private |
Instruction set architecture (ISA) |
Name |
Evolving |
Evolving |
Common |
Arguments |
Private |
Private |
ISA |
For more information about the stability mechanism, see DTrace Stability Features.
sdt provider
The Statically Defined Tracing (SDT) provider
(sdt
) creates probes at sites that a software
programmer has formally designated. The SDT mechanism enables
programmers to consciously choose locations of interest to users
of DTrace and to convey some semantic knowledge about each
location through the probe name.
Importantly, SDT can act as a metaprovider by registering probes
so that they appear to come from other providers, such as
io
, proc
, and
sched
, which do not have dedicated modules of
their own. Thus, the SDT provider is chiefly of interest only to
developers of new providers. Most users will access SDT only
indirectly by using other providers.
Note:
Because the sdt
probes that are defined for
the Oracle Linux kernel are likely to change over time, they are not
listed here. Both the name stability and the data stability of
the probes are Private, which reflects the kernel's
implementation and should not be interpreted as a commitment to
preserve these interfaces. For more information, see
DTrace Stability Features.
Creating sdt Probes
If you are a device driver developer, you might be interested in
creating your own sdt
probes for your Oracle Linux
driver. The disabled probe effect of SDT is essentially the cost
of several no-operation machine instructions. You are therefore
encouraged to add sdt
probes to your device
drivers as needed. Unless these probes negatively affect
performance, you can leave them in your shipping code. See
Statically Defined Tracing of Kernel Modules.
DTrace also provides a mechanism for application developers to define user-space static probes. See Statically Defined Tracing of User Applications.
Declaring Probes
The sdt
probes are declared by using the
DTRACE_PROBE
macro from
<linux/sdt.h>
.
The module name and function name of an SDT-based probe correspond to the kernel module and function of the probe, respectively. DTrace includes the kernel module name and function name as part of the tuple identifying a probe, so you do not need to include this information in the probe name to prevent name space collisions. Use the dtrace -l -m module command to list the probes that your driver module has installed and the full names that are seen by DTrace users.
The name of the probe depends on the name that is provided in
the DTRACE_PROBE
macro. If the name does
not contain two consecutive underscores
(__
), the name of the probe is as written
in the macro. If the name contains two consecutive
underscores, the probe name converts the consecutive
underscores to a single dash (-
). For
example, if a DTRACE_PROBE
macro specifies
transaction__start
, the SDT probe is named
transaction-start
. This substitution
enables C code to provide macro names that are not valid C
identifiers without specifying a string.
SDT can also act as a metaprovider by registering probes so
that they appear to come from other providers, such as
io
, proc
, and
sched
, which do not have dedicated modules
of their own. For example, kernel/exit.c
contains calls to the DTRACE_PROC
macro,
which are defined as follows in
<linux/sdt.h>
:
# define DTRACE_PROC(name) \ DTRACE_PROBE(__proc_##name);
Probes that use such macros appear to come from a provider
other than sdt
. The leading double
underscore, provider name, and trailing underscore in the
name
argument are used to match the
provider and are not included in the probe name. Note that the
functionality for creating probes for providers other than
those that are hard-coded into DTrace is not currently
available.
sdt Probe Arguments
The arguments for each sdt
probe are the
arguments that are specified in the kernel source code in the
corresponding DTRACE_PROBE
macro reference.
When declaring your sdt
probes, you can
minimize their disabled probe effect by not dereferencing
pointers and by not loading from global variables in the probe
arguments. Both pointer dereferencing and global variable
loading may be done safely in D actions that enable probes, so
DTrace users can request these actions only when they are
needed.
sdt Stability
The sdt
provider uses DTrace's stability
mechanism to describe its stabilities. These values are listed
in the following table.
Element | Name Stability | Data Stability | Dependency Class |
---|---|---|---|
Provider |
Evolving |
Evolving |
ISA |
Module |
Private |
Private |
Unknown |
Function |
Private |
Private |
Unknown |
Name |
Private |
Private |
ISA |
Arguments |
Private |
Private |
ISA |
For more information about the stability mechanism, refer to DTrace Stability Features.
pid Provider
The pid
provider enables tracing of any user
process, as specified by its pid
.
The pid
provider enables tracing function entry
and return in user programs just like the fbt
provider provides that capability for the kernel. Note that most
of the examples in this guide that use the fbt
provider to trace kernel function calls can be modified slightly
to apply to user processes.
The pid
provider also enables tracing of any
instruction, as specified by an absolute address or function
offset.
The pid
provider has no probe effect when
probes are not enabled. When probes are enabled, the probes only
induce probe effect on those processes that are traced.
Note:
When the compiler inlines a function, the pid
provider's probe does not fire. Use one of the following methods
to compile a particular C function so that it will not be
inlined.
-
Sun Studio:
#pragma no_inline (funcname[, funcname])
-
gcc
:funcname __attribute__ ((noinline))
Consult your compiler documentation for updates.
Naming pid Probes
The pid
provider actually defines a class of
providers. Each process can potentially have its own associated
pid
provider. For example, a process with ID
123, would be traced by using the pid123
provider.
The module portion of the probe description refers to an object
loaded in the corresponding process's address space. To see
which objects will be loaded for my_exec
or
are loaded for process ID 123, use the following commands:
# ldd my_exec ... # pldd 123 123: /tmp/my_exec linux-vdso.so.1 /lib64/libc.so.6 /lib64/ld-linux-x86-64.so.2p
In the probe description, you name the object by the name of the
file, not by its full path name. You can also omit the
.6
or so.6
suffix. All of
the following examples name the same probe:
pid123:libc.so.6:strcpy:entry pid123:libc.so:strcpy:entry pid123:libc:strcpy:entry
The first example is the actual name of the probe. The other examples are convenient aliases that are replaced with the full load object name internally.
For the load object of the executable, you can use the
a.out
alias. The following two probe
descriptions name the same probe:
pid123:my_exec:main:return pid123:a.out:main:return
The function field of the probe description names a function in
the module. A user application binary might have several names
for the same function. For example,
__gnu_get_libc_version
might be an alternate
name for the function gnu_get_libc_version
in
libc.so.6
. DTrace chooses one canonical name
for such a function and uses that name internally.
The following example illustrates how DTrace internally remaps module and function names to a canonical form:
# dtrace -q -n 'pid123:libc:__gnu_get_libc_version: { printf("%s\n%s\n", probemod, probefunc)}' libc.so.6 gnu_get_libc_version
For examples of how to use the pid
provider
effectively, see User Process Tracing.
pid Probe Arguments
An entry
probe fires when the traced function
is invoked. The arguments to entry probes are the values of the
arguments to the traced function.
A return
probe fires when the traced function
returns or makes a tail call to another function.
The arg1
probe argument holds the function
return value.
An offset probe fires whenever execution
reaches the instruction at the specified offset in the function.
For example, to trace the instruction at the address 4 bytes
into function main
, you can use
pid123:a.out:main:4
. The arguments for offset
probes are undefined. The uregs[]
array will
help you when examining the process state at these probe sites.
See uregs[] Array.
pid Stability
The pid
provider uses DTrace's stability
mechanism to describe its stabilities. These values are listed
in the following table.
Element | Name Stability | Data Stability | Dependency Class |
---|---|---|---|
Provider |
Evolving |
Evolving |
ISA |
Module |
Private |
Private |
Unknown |
Function |
Private |
Private |
Unknown |
Name |
Evolving |
Evolving |
ISA |
Arguments |
Private |
Private |
Unknown |
For more information about the stability mechanism, see DTrace Stability Features.
proc Provider
The proc
provider makes available the probes
that pertain to the following activities: process creation and
termination, LWP creation and termination, execution of new
program images, and signal sending and handling.
proc Probes
The probes for the proc
provider are listed
in the following table.
Table 11-4 proc Probes
Probe | Description |
---|---|
|
Fires when a process (or process thread) is created
using |
|
Fires whenever a process loads a new process image
using a variant of the |
|
Fires when an |
|
Fires when an |
|
Fires when the current process is exiting. The reason
for exit, which is expressed as one of the
|
|
Fires when a process thread is created, the latter
typically as a result of
|
|
Fires when a process or process thread is exiting, due
either to a signal or to an explicit call to
|
|
Fires within the context of a newly created process or
process thread. The |
|
Probes that fires when a pending signal is cleared
because the target thread was waiting for the signal
in |
|
Fires when a signal is sent to a single-threaded
process and the signal is both unblocked and ignored
by the process. Under these conditions, the signal is
discarded on generation. The
|
|
Fires immediately before a thread handles a signal.
The |
|
Fires when a signal is sent to a process or to a
thread created by a process. The
|
|
Fires in the context of a newly created process. The
|
Note:
In Linux, there is no fundamental difference between a process and a thread that a process creates. The threads of a process are set up so that they can share resources, but each thread has its own entry in the process table with its own process ID.
proc Probe Arguments
The following table lists the argument types for the
proc
probes. See Table 11-4
for a description of the arguments.
Table 11-5 proc Probe Arguments
Probe | args[0] | args[1] | args[2] |
---|---|---|---|
|
|
— |
— |
|
|
— |
— |
|
|
— |
— |
|
— |
— |
— |
|
|
— |
— |
|
|
|
— |
|
— |
— |
— |
|
— |
— |
— |
|
|
— |
— |
|
|
|
|
|
|
|
|
|
|
|
|
|
— |
— |
— |
lwpsinfo_t
Several proc
probes have arguments of type
lwpsinfo_t
. Detailed information about this
data structure can be found in
/usr/lib64/dtrace/version/procfs.d
.
The definition of the lwpsinfo_t
structure,
as available to DTrace consumers, is as follows:
typedef struct lwpsinfo { int pr_flag; /* flags */ id_t pr_lwpid; /* thread id */ uintptr_t pr_addr; /* internal address of thread */ uintptr_t pr_wchan; /* wait addr for sleeping lwp (NULL on Linux) */ char pr_stype; /* sync event type (0 on Linux) */ char pr_state; /* numeric thread state */ char pr_sname; /* printable character for pr_state */ int pr_pri; /* priority, high value = high priority */ char pr_name[PRCLSZ]; /* scheduling class name */ processorid_t pr_onpro; /* processor which last ran this thread */ } lwpsinfo_t;
Note:
Lightweight processes do not exist in Linux. Rather, in
Oracle Linux, processes and threads are represented by process
descriptors of type struct task_struct
in
the task list. DTrace translates the members of
lwpsinfo_t
from the
task_struct
for the Oracle Linux process.
The pr_flag
is set to 1
if the thread is stopped. Otherwise, it is set to
0
.
In Oracle Linux, the pr_stype
field is unsupported,
and hence is always 0
.
The following table describes the values that
pr_state
can take, as well as the
corresponding character values for
pr_sname
.
Table 11-6 pr_state Values
pr_state Value | pr_sname Value
|
Description |
---|---|---|
|
R
|
The thread is runnable or is currently running on a
CPU. The
The equivalent Oracle Linux task state is
|
|
S
|
The thread is sleeping. The
The equivalent Oracle Linux task state is
|
|
T
|
The thread is stopped, either due to an explicit
The equivalent Oracle Linux task state is
|
|
W
|
The thread is waiting on wait queue. The
The equivalent Oracle Linux task state is
|
|
Z
|
The thread is a zombie.
The equivalent Oracle Linux task state is
|
psinfo_t
Several proc
probes have an argument of
type psinfo_t
. Detailed information about
this data structure can be found in
/usr/lib64/dtrace/version/procfs.d
.
The definition of the psinfo_t
structure,
as available to DTrace consumers, is as follows:
typedef struct psinfo { int pr_nlwp; /* not supported */ pid_t pr_pid; /* unique process id */ pid_t pr_ppid; /* process id of parent */ pid_t pr_pgid; /* pid of process group leader */ pid_t pr_sid; /* session id */ uid_t pr_uid; /* real user id */ uid_t pr_euid; /* effective user id */ uid_t pr_gid; /* real group id */ uid_t pr_egid; /* effective group id */ uintptr_t pr_addr; /* address of process */ size_t pr_size; /* not supported */ size_t pr_rssize; /* not supported */ struct tty_struct *pr_ttydev; /* controlling tty (or -1) */ ushort_t pr_pctcpu; /* not supported */ ushort_t pr_pctmem; /* not supported */ timestruc_t pr_start; /* not supported */ timestruc_t pr_time; /* not supported */ timestruc_t pr_ctime; /* not supported */ char pr_fname[16]; /* name of exec'ed file */ char pr_psargs[80]; /* initial chars of arg list */ int pr_wstat; /* not supported */ int pr_argc; /* initial argument count */ uintptr_t pr_argv; /* address of initial arg vector */ uintptr_t pr_envp; /* address of initial env vector */ char pr_dmodel; /* data model */ taskid_t pr_taskid; /* not supported */ projid_t pr_projid; /* not supported */ int pr_nzomb; /* not supported */ poolid_t pr_poolid; /* not supported */ zoneid_t pr_zoneid; /* not supported */ id_t pr_contract; /* not supported */ lwpsinfo_t pr_lwp; /* not supported */ } psinfo_t;
Note:
Lightweight processes do not exist in Linux. In Oracle Linux,
processes and threads are represented by process descriptors
of type struct task_struct
in the task
list. DTrace translates the members of
psinfo_t
from the
task_struct
for the Oracle Linux process.
pr_dmodel
is set to either
PR_MODEL_ILP32
, denoting a 32–bit
process, or PR_MODEL_LP64
, denoting a
64–bit process.
proc Examples
The following examples illustrate the use of the probes that are
published by the proc
provider.
exec
The following example shows how you can use the
exec
probe to easily determine which
programs are being executed, and by whom. Type the following D
source code and save it in a file named
whoexec.d
:
#pragma D option quiet proc:::exec { self->parent = execname; } proc:::exec-success /self->parent != NULL/ { @[self->parent, execname] = count(); self->parent = NULL; } proc:::exec-failure /self->parent != NULL/ { self->parent = NULL; } END { printf("%-20s %-20s %s\n", "WHO", "WHAT", "COUNT"); printa("%-20s %-20s %@d\n", @); }
Running the example script for a short period of time results in output similar to the following:
# dtrace -s ./whoexec.d ^C WHO WHAT COUNT abrtd abrt-handle-eve 1 firefox basename 1 firefox mkdir 1 firefox mozilla-plugin- 1 firefox mozilla-xremote 1 firefox run-mozilla.sh 1 firefox uname 1 gnome-panel firefox 1 kworker/u:1 modprobe 1 modprobe modprobe.ksplic 1 mozilla-plugin- plugin-config 1 mozilla-plugin- uname 1 nice sosreport 1 run-mozilla.sh basename 1 run-mozilla.sh dirname 1 run-mozilla.sh firefox 1 run-mozilla.sh uname 1 sh abrt-action-sav 1 sh blkid 1 sh brctl 1 sh cut 1 ...
start and exit Probes
If you want to know how long programs are running, from
creation to termination, you can enable the
start
and exit
probes,
as shown in the following example. Save it in a file named
progtime.d
:
proc:::start { self->start = timestamp; } proc:::exit /self->start/ { @[execname] = quantize(timestamp - self->start); self->start = 0; }
Running the example script on a build server for several seconds results in output similar to the following:
# dtrace -s ./progtime.d dtrace: script ’./progtime.d’ matched 2 probes ^C ... cc value ------------- Distribution ------------- count 33554432 | 0 67108864 |@@@ 3 134217728 |@ 1 268435456 | 0 536870912 |@@@@ 4 1073741824 |@@@@@@@@@@@@@@ 13 2147483648 |@@@@@@@@@@@@ 11 4294967296 |@@@ 3 8589934592 | 0 sh value ------------- Distribution ------------- count 262144 | 0 524288 |@ 5 1048576 |@@@@@@@ 29 2097152 | 0 4194304 | 0 8388608 |@@@ 12 16777216 |@@ 9 33554432 |@@ 9 67108864 |@@ 8 134217728 |@ 7 268435456 |@@@@@ 20 536870912 |@@@@@@ 26 1073741824 |@@@ 14 2147483648 |@@ 11 4294967296 | 3 8589934592 | 1 17179869184 | 0 ...
signal-send
The following example shows how you can use the
signal-send
probe to determine the sending
and receiving of process associated with any signal. Type the
following D source code and save it in a file named
sig.d
:
#pragma D option quiet proc:::signal-send { @[execname, stringof(args[1]->pr_fname), args[2]] = count(); } END { printf("%20s %20s %12s %s\n", "SENDER", "RECIPIENT", "SIG", "COUNT"); printa("%20s %20s %12d %@d\n", @); }
Running this script results in output similar to the following:
# dtrace -s sig.d ^C SENDER RECIPIENT SIG COUNT gnome-panel Xorg 29 1 kworker/0:2 dtrace 2 1 Xorg Xorg 29 3 java Xorg 29 6 firefox Xorg 29 14 kworker/0:0 Xorg 29 1135
proc Stability
The proc
provider uses DTrace's stability
mechanism to describe its stabilities. These values are listed
in the following table.
Element | Name Stability | Data Stability | Dependency Class |
---|---|---|---|
Provider |
Evolving |
Evolving |
ISA |
Module |
Private |
Private |
Unknown |
Function |
Private |
Private |
Unknown |
Name |
Evolving |
Evolving |
ISA |
Arguments |
Evolving |
Evolving |
ISA |
For more information about the stability mechanism, see DTrace Stability Features.
sched Provider
The sched
provider makes available probes that
are related to CPU scheduling. Because CPUs are the one resource
that all threads must consume, the sched
provider is very useful for understanding systemic behavior. For
example, using the sched
provider, you can
understand when and why threads sleep, run, change priority, or
wake other threads.
sched Probes
The following table describes the probes for the
sched
provider.
Table 11-7 sched Probes
Probe | Description |
---|---|
|
Fires whenever a thread's priority is about to be
changed. The |
|
Fires immediately before a runnable thread is dequeued
from a run queue. The |
|
Fires immediately before a runnable thread is enqueued
to a run queue. The |
|
Fires when the current CPU is about to end execution
of a thread. The |
|
Fires when a CPU has just begun execution of a thread.
The |
|
Fires immediately before the current thread is
preempted. After this probe fires, the current thread
will select a thread to run and the
|
|
Fires when a scheduling decision has been made, but
the dispatcher has elected to continue to run the
current thread. The |
|
Fires immediately before the current thread sleeps on
a synchronization object. The type of the
synchronization object is contained in the
|
|
Fires when a CPU has been instructed by another CPU to
make a scheduling decision — often because a
higher-priority thread has become runnable. The
|
|
Fires as a part of clock tick-based accounting. In
clock tick-based accounting, CPU accounting is
performed by examining which threads and processes are
running when a fixed-interval interrupt fires. The
|
|
Fires immediately before the current thread wakes a
thread sleeping on a synchronization object. The
|
sched Probe Arguments
The following table describes the argument types for the
sched
probes. See
Table 11-7 for descriptions of the arguments.
Table 11-8 sched Probe Arguments
Probe | args[0] | args[1] | args[2] | args[3] |
---|---|---|---|---|
|
|
|
|
— |
|
|
|
|
— |
|
|
|
|
|
|
|
|
— |
— |
|
— |
— |
— |
— |
|
— |
— |
— |
— |
|
— |
— |
— |
— |
|
— |
— |
— |
— |
|
|
|
— |
— |
|
|
|
— |
— |
|
|
|
— |
— |
cpuinfo_t
The cpuinfo_t
structure defines a CPU. Per
the information in Table 11-8, arguments to
both the enqueue
and
dequeue
probes include a pointer to a
cpuinfo_t
. Additionally, the
cpuinfo_t
that corresponds to the current
CPU is pointed to by the curcpu
variable.
The definition of the cpuinfo_t
structure
is as follows:
typedef struct cpuinfo { processorid_t cpu_id; /* CPU identifier */ psetid_t cpu_pset; /* not supported */ chipid_t cpu_chip; /* chip identifier */ lgrp_id_t cpu_lgrp; /* not supported */ cpuinfo_arch_t *cpu_info; /* CPU information */ } cpuinfo_t;
cpu_id
: Is the processor identifier.
cpu_chip
: Is the identifier of the physical
chip. Physical chips can contain several CPU cores.
cpu_info
: Is a pointer to the
cpuinfo_arch_t
structure that is associated
with the CPU.
sched Examples
The following examples show the use of sched
probes.
on-cpu and off-cpu Probes
One common question that you might want answered is which CPUs
are running threads and for how long? The following example
shows how you can use the on-cpu
and
off-cpu
probes to easily answer this
question on a system-wide basis. Type the following D source
code and save it in a file named where.d
:
sched:::on-cpu { self->ts = timestamp; } sched:::off-cpu /self->ts/ { @[cpu] = quantize(timestamp - self->ts); self->ts = 0; }
Running the previous script results in output that is similar to the following:
# dtrace -s ./where.d dtrace: script ’./where.d’ matched 2 probes ^C 0 value ------------- Distribution ------------- count 2048 | 0 4096 |@@ 37 8192 |@@@@@@@@@@@@@ 212 16384 |@ 30 32768 | 10 65536 |@ 17 131072 | 12 262144 | 9 524288 | 6 1048576 | 5 2097152 | 1 4194304 | 3 8388608 |@@@@ 75 16777216 |@@@@@@@@@@@@ 201 33554432 | 6 67108864 | 0 1 value ------------- Distribution ------------- count 2048 | 0 4096 |@ 6 8192 |@@@@ 23 16384 |@@@ 18 32768 |@@@@ 22 65536 |@@@@ 22 131072 |@ 7 262144 | 5 524288 | 2 1048576 | 3 2097152 |@ 9 4194304 | 4 8388608 |@@@ 18 16777216 |@@@ 19 33554432 |@@@ 16 67108864 |@@@@ 21 134217728 |@@ 14 268435456 | 0
The previous output shows that on CPU 1 threads tend to run for less than 131072 nanoseconds (on order of 100 microseconds) at a stretch, or for 8388608 to 134217728 nanoseconds (approximately 10 to 100 milliseconds). A noticeable gap between the two clusters of data is shown in the histogram. You also might be interested in knowing which CPUs are running a particular process.
You can also use the on-cpu
and
off-cpu
probes for answering this question.
The following script displays which CPUs run a specified
application over a period of ten seconds. Save it in a file
named whererun.d.
:
#pragma D option quiet dtrace:::BEGIN { start = timestamp; } sched:::on-cpu /execname == $$1/ { self->ts = timestamp; } sched:::off-cpu /self->ts/ { @[cpu] = sum(timestamp - self->ts); self->ts = 0; } profile:::tick-1sec /++x >= 10/ { exit(0); } dtrace:::END { printf("CPU distribution over %d seconds:\n\n", (timestamp - start) / 1000000000); printf("CPU microseconds\n--- ------------\n"); normalize(@, 1000); printa("%3d %@d\n", @); }
Running the previous script on a large mail server and specifying the IMAP daemon results in output that is similar to the following:
# dtrace -s ./whererun.d imapd CPU distribution of imapd over 10 seconds: CPU microseconds --- ------------ 15 10102 12 16377 21 25317 19 25504 17 35653 13 41539 14 46669 20 57753 22 70088 16 115860 23 127775 18 160517
Oracle Linux takes into account the amount of time that a thread has
been sleeping when selecting a CPU on which to run the thread,
as a thread that has been sleeping for less time tends not to
migrate. Use the off-cpu
and
on-cpu
probes to observe this behavior.
Type the following source code and save it in a file named
howlong.d
:
sched:::off-cpu /curlwpsinfo->pr_state == SSLEEP/ { self->cpu = cpu; self->ts = timestamp; } sched:::on-cpu /self->ts/ { @[self->cpu == cpu ? "sleep time, no CPU migration" : "sleep time, CPU migration"] = lquantize((timestamp - self->ts) / 1000000, 0, 500, 25); self->ts = 0; self->cpu = 0; }
Running the previous script for approximately 30 seconds results in output that is similar to the following:
# dtrace -s ./howlong.d dtrace: script ’./howlong.d’ matched 2 probes ^C sleep time, CPU migration value ------------- Distribution ------------- count < 0 | 0 0 |@@@@@@@ 6838 25 |@@@@@ 4714 50 |@@@ 3108 75 |@ 1304 100 |@ 1557 125 |@ 1425 150 | 894 175 |@ 1526 200 |@@ 2010 225 |@@ 1933 250 |@@ 1982 275 |@@ 2051 300 |@@ 2021 325 |@ 1708 350 |@ 1113 375 | 502 400 | 220 425 | 106 450 | 54 475 | 40 >= 500 |@ 1716 sleep time, no CPU migration value ------------- Distribution ------------- count < 0 | 0 0 |@@@@@@@@@@@@ 58413 25 |@@@ 14793 50 |@@ 10050 75 | 3858 100 |@ 6242 125 |@ 6555 150 | 3980 175 |@ 5987 200 |@ 9024 225 |@ 9070 250 |@@ 10745 275 |@@ 11898 300 |@@ 11704 325 |@@ 10846 350 |@ 6962 375 | 3292 400 | 1713 425 | 585 450 | 201 475 | 96 >= 500 | 3946
The previous output reveals that there are many more occurrences of non-migration than migration. Also, when sleep times are longer, migrations are more likely. The distributions are noticeably different in the sub-100 millisecond range, but look very similar as the sleep times get longer. This result would seem to indicate that sleep time is not factored into the scheduling decision when a certain threshold is exceeded.
enqueue and dequeue Probes
You might want to know on which CPUs processes and threads are
waiting to run. You can use the enqueue
probe along with the dequeue
probe to
answer this question. Type the following source code and save
it in a file named qtime.d
:
sched:::enqueue { a[args[0]->pr_lwpid, args[1]->pr_pid, args[2]->cpu_id] = timestamp; } sched:::dequeue /a[args[0]->pr_lwpid, args[1]->pr_pid, args[2]->cpu_id]/ { @[args[2]->cpu_id] = quantize(timestamp - a[args[0]->pr_lwpid, args[1]->pr_pid, args[2]->cpu_id]); a[args[0]->pr_lwpid, args[1]->pr_pid, args[2]->cpu_id] = 0; }
Running the previous script for several seconds results in output that is similar to the following:
# dtrace -s qtime.d dtrace: script 'qtime.d' matched 16 probes ^C 1 value ------------- Distribution ------------- count 8192 | 0 16384 | 1 32768 |@ 47 65536 |@@@@@@@ 365 131072 |@@@@@@@@@@@@ 572 262144 |@@@@@@@@@@@@ 570 524288 |@@@@@@@ 354 1048576 |@ 57 2097152 | 7 4194304 | 1 8388608 | 1 16777216 | 0 0 value ------------- Distribution ------------- count 8192 | 0 16384 | 6 32768 |@ 49 65536 |@@@@@ 261 131072 |@@@@@@@@@@@@@ 753 262144 |@@@@@@@@@@@@ 704 524288 |@@@@@@@@ 455 1048576 |@ 74 2097152 | 9 4194304 | 2 8388608 | 0
Rather than looking at wait times, you might want to examine
the length of the run queue over time. Using the
enqueue
and dequeue
probes, you can set up an associative array to track the queue
length. Type the following source code and save it in a file
named qlen.d
:
sched:::enqueue { this->len = qlen[args[2]->cpu_id]++; @[args[2]->cpu_id] = lquantize(this->len, 0, 100); } sched:::dequeue /qlen[args[2]->cpu_id]/ { qlen[args[2]->cpu_id]--; }
Running the previous script on a largely idle dual-core processor system for approximately 30 seconds results in output that is similar to the following:
# dtrace -s qlen.d dtrace: script 'qlen.d' matched 16 probes ^C 1 value ------------- Distribution ------------- count < 0 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 8124 1 |@@@@@@ 1558 2 |@ 160 3 | 51 4 | 24 5 | 13 6 | 11 7 | 9 8 | 6 9 | 0 0 value ------------- Distribution ------------- count < 0 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 8569 1 |@@@@@@@@@ 2429 2 |@ 292 3 | 25 4 | 8 5 | 5 6 | 4 7 | 4 8 | 1 9 | 0
The output is roughly what you would expect for an idle system: the majority of the time that a runnable thread is enqueued, the run queues were very short (three or fewer threads in length). However, given that the system was largely idle, the exceptional data points at the bottom of each table might be unexpected. For example, why were the run queues as long as 8 runnable threads? To explore this question further, you could write a D script that displays the contents of the run queue when the length of the run queue is long. This problem is complicated because D enablings cannot iterate over data structures, and therefore cannot simply iterate over the entire run queue. Even if D enablings could do so, you should avoid dependencies on the kernel's internal data structures.
For this type of script, you would enable the
enqueue
and dequeue
probes and then use both speculations and associative arrays.
Whenever a thread is enqueued, the script increments the
length of the queue and records the timestamp in an
associative array keyed by the thread. You cannot use a
thread-local variable in this case because a thread might be
enqueued by another thread. The script then checks to see if
the queue length exceeds the maximum, and if so, the script
starts a new speculation, and records the timestamp and the
new maximum. Then, when a thread is dequeued, the script
compares the enqueue
timestamp to the
timestamp of the longest length: if the thread was enqueued
before the timestamp of the longest length, the thread was in
the queue when the longest length was recorded. In this case,
the script speculatively traces the thread's information. When
the kernel dequeues the last thread that was enqueued at the
timestamp of the longest length, the script commits the
speculation data. Type the following source code and save it
in a file named whoqueue.d
:
#pragma D option quiet #pragma D option nspec=4 #pragma D option specsize=100k int maxlen; int spec[int]; sched:::enqueue { this->len = ++qlen[this->cpu = args[2]->cpu_id]; in[args[0]->pr_addr] = timestamp; } sched:::enqueue /this->len > maxlen && spec[this->cpu]/ { /* * There is already a speculation for this CPU. We just set a new * record, so we’ll discard the old one. */ discard(spec[this->cpu]); } sched:::enqueue /this->len > maxlen/ { /* * We have a winner. Set the new maximum length and set the timestamp * of the longest length. */ maxlen = this->len; longtime[this->cpu] = timestamp; /* * Now start a new speculation, and speculatively trace the length. */ this->spec = spec[this->cpu] = speculation(); speculate(this->spec); printf("Run queue of length %d:\n", this->len); } sched:::dequeue /(this->in = in[args[0]->pr_addr]) && this->in <= longtime[this->cpu = args[2]->cpu_id]/ { speculate(spec[this->cpu]); printf(" %d/%d (%s)\n", args[1]->pr_pid, args[0]->pr_lwpid, stringof(args[1]->pr_fname)); } sched:::dequeue /qlen[args[2]->cpu_id]/ { in[args[0]->pr_addr] = 0; this->len = --qlen[args[2]->cpu_id]; } sched:::dequeue /this->len == 0 && spec[this->cpu]/ { /* * We just processed the last thread that was enqueued at the time * of longest length; commit the speculation, which by now contains * each thread that was enqueued when the queue was longest. */ commit(spec[this->cpu]); spec[this->cpu] = 0; }
Running the previous script on the same system results in output that is similar to the following:
# dtrace -s whoqueue.d Run queue of length 1: 2850/2850 (java) Run queue of length 2: 4034/4034 (kworker/0:1) 16/16 (sync_supers) Run queue of length 3: 10/10 (ksoftirqd/1) 1710/1710 (hald-addon-inpu) 25350/25350 (dtrace) Run queue of length 4: 2852/2852 (java) 2850/2850 (java) 1710/1710 (hald-addon-inpu) 2099/2099 (Xorg) Run queue of length 5: 3149/3149 (notification-da) 2417/2417 (gnome-settings-) 2437/2437 (gnome-panel) 2461/2461 (wnck-applet) 2432/2432 (metacity) Run queue of length 9: 3685/3685 (firefox) 3149/3149 (notification-da) 2417/2417 (gnome-settings-) 2437/2437 (gnome-panel) 2852/2852 (java) 2452/2452 (nautilus) 2461/2461 (wnck-applet) 2432/2432 (metacity) 2749/2749 (gnome-terminal) ^C
sleep and wakeup Probes
The following example shows how you might use the
wakeup
probe to determine what is waking a
particular process, and when, over a given period. Type the
following source code and save it in a file named
gterm.d
:
#pragma D option quiet dtrace:::BEGIN { start = timestamp; } sched:::wakeup /stringof(args[1]->pr_fname) == "gnome-terminal"/ { @[execname] = lquantize((timestamp - start) / 1000000000, 0, 10); } profile:::tick-1sec /++x == 10/ { exit(0); }
The output from running this script is as follows:
# dtrace -s gterm.d Xorg value ------------- Distribution ------------- count < 0 | 0 0 |@@@@@@@@@@@@@@@ 69 1 |@@@@@@@@ 35 2 |@@@@@@@@@ 42 3 | 2 4 | 0 5 | 0 6 | 0 7 |@@@@ 16 8 | 0 9 |@@@ 15 >= 10 | 0
This output shows that the X server is waking the
gnome-terminal
process as you interact with
the system.
Additionally, you could use the sleep
probe
with the wakeup
probe to understand which
applications are blocking on other applications, and for how
long. Type the following source code and save it in a file
named whofor.d
:
#pragma D option quiet sched:::sleep { bedtime[curlwpsinfo->pr_addr] = timestamp; } sched:::wakeup /bedtime[args[0]->pr_addr]/ { @[stringof(args[1]->pr_fname), execname] = quantize(timestamp - bedtime[args[0]->pr_addr]); bedtime[args[0]->pr_addr] = 0; } END { printa("%s sleeping on %s:\n%@d\n", @); }
The tail of the output from running the previous example script on a desktop system for several seconds is similar to the following:
# dtrace -s whofor.d ^C ... Xorg sleeping on metacity: value ------------- Distribution ------------- count 65536 | 0 131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2 262144 | 0 gnome-power-man sleeping on Xorg: value ------------- Distribution ------------- count 131072 | 0 262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 524288 | 0 ...
preempt and remain-cpu Probes
Because Oracle Linux is a preemptive system, higher priority threads preempt lower priority threads. Preemption can induce a significant latency bubble in the lower priority thread. Therefore, you might want to know which threads are being preempted by other threads.
The following example shows how you would use the
preempt
and remain-cpu
probes to display this information. Type the following source
code and save it in a file named
whopreempt.d
:
#pragma D option quiet sched:::preempt { self->preempt = 1; } sched:::remain-cpu /self->preempt/ { self->preempt = 0; } sched:::off-cpu /self->preempt/ { /* * If we were told to preempt ourselves, see who we ended up giving * the CPU to. */ @[stringof(args[1]->pr_fname), args[0]->pr_pri, execname, curlwpsinfo->pr_pri] = count(); self->preempt = 0; } END { printf("%30s %3s %30s %3s %5s\n", "PREEMPTOR", "PRI", "PREEMPTED", "PRI", "#"); printa("%30s %3d %30s %3d %5@d\n", @); }
Running the previous script on a desktop system for several seconds results in output that is similar to the following:
# dtrace -s whopreempt.d ^C PREEMPTOR PRI PREEMPTED PRI # firefox 120 kworker/0:0 120 1 gnome-panel 120 swapper 120 1 gnome-panel 120 wnck-applet 120 1 jbd2/dm-0-8 120 swapper 120 1 khugepaged 139 kworker/0:0 120 1 ksoftirqd/1 120 kworker/0:0 120 1 kworker/0:0 120 gnome-terminal 120 1 kworker/0:2 120 Xorg 120 1 kworker/0:2 120 java 120 1 kworker/1:0 120 Xorg 120 1 nautilus 120 Xorg 120 1 rtkit-daemon 0 rtkit-daemon 120 1 rtkit-daemon 120 swapper 120 1 watchdog/0 0 swapper 120 1 watchdog/1 0 kworker/0:0 120 1 wnck-applet 120 Xorg 120 1 wnck-applet 120 swapper 120 1 automount 120 kworker/0:0 120 2 gnome-power-man 120 kworker/0:0 120 2 kworker/0:0 120 swapper 120 2 kworker/1:0 120 dtrace 120 2 metacity 120 kworker/0:0 120 2 notification-da 120 swapper 120 2 udisks-daemon 120 kworker/0:0 120 2 automount 120 swapper 120 3 gnome-panel 120 Xorg 120 3 gnome-settings- 120 Xorg 120 3 gnome-settings- 120 swapper 120 3 gnome-terminal 120 swapper 120 3 java 120 kworker/0:0 120 3 ksoftirqd/0 120 swapper 120 3 kworker/0:2 120 swapper 120 3 metacity 120 Xorg 120 3 nautilus 120 kworker/0:0 120 3 qpidd 120 swapper 120 3 metacity 120 swapper 120 4 gvfs-afc-volume 120 swapper 120 5 java 120 Xorg 120 5 notification-da 120 Xorg 120 5 notification-da 120 kworker/0:0 120 5 Xorg 120 kworker/0:0 120 6 wnck-applet 120 kworker/0:0 120 10 VBoxService 120 swapper 120 13 dtrace 120 swapper 120 14 kworker/1:0 120 kworker/0:0 120 16 dtrace 120 kworker/0:0 120 20 Xorg 120 swapper 120 90 hald-addon-inpu 120 swapper 120 100 java 120 swapper 120 108 gnome-terminal 120 kworker/0:0 120 110
tick
If NOHZ
is set to off
,
Oracle Linux uses tick-based CPU accounting,
where a system clock interrupt fires at a fixed interval and
attributes CPU utilization to the processes that are running
at the time of the tick. The following example shows how you
would use the tick
probe to observe this
attribution.
# dtrace -n sched:::tick'{ @[stringof(args[1]->pr_fname)] = count() }' dtrace: description 'sched:::tick' matched 1 probe ^C VBoxService 1 gpk-update-icon 1 hald-addon-inpu 1 jbd2/dm-0-8 1 automount 2 gnome-session 2 hald 2 gnome-power-man 3 ksoftirqd/0 3 kworker/0:2 3 notification-da 4 devkit-power-da 6 nautilus 9 dbus-daemon 11 gnome-panel 11 gnome-settings- 11 dtrace 19 khugepaged 22 metacity 27 kworker/0:0 41 swapper 56 firefox 58 wnck-applet 61 gnome-terminal 67 java 84 Xorg 227
One deficiency of tick-based accounting is that the system clock that performs accounting is often also responsible for dispatching any time-related scheduling activity. As a result, if a thread is to perform some amount of work every clock tick (that is, every 10 milliseconds), the system either over-accounts or under-accounts for the thread, depending on whether the accounting is done before or after time-related dispatching scheduling activity. If accounting is performed before time-related dispatching, the system under-accounts for threads running at a regular interval. If such threads run for less than the clock tick interval, they can effectively hide behind the clock tick.
The following example examines whether a system has any such
threads. Type the following source code and save it in a file
named tick.d
:
sched:::tick, sched:::enqueue { @[probename] = lquantize((timestamp / 1000000) % 10, 0, 10); }
The output of the example script is two distributions of the
millisecond offset within a ten millisecond interval, one for
the tick
probe and another for
enqueue
:
# dtrace -s tick.d dtrace: script 'tick.d' matched 9 probes ^C tick value ------------- Distribution ------------- count < 0 | 0 0 |@@@@@ 29 1 |@@@@@@@@@@@@@@@@@@@ 106 2 |@@@@@ 27 3 |@ 7 4 |@@ 10 5 |@@ 12 6 |@ 4 7 |@ 8 8 |@@ 9 9 |@@@ 17 >= 10 | 0 enqueue value ------------- Distribution ------------- count < 0 | 0 0 |@@@@ 82 1 |@@@@ 86 2 |@@@@ 76 3 |@@@ 65 4 |@@@@@ 101 5 |@@@@ 79 6 |@@@@ 75 7 |@@@@ 76 8 |@@@@ 89 9 |@@@@ 75 >= 10 | 0
The output histogram named tick
shows that
the clock tick is firing at a 1 millisecond offset. In this
example, the output for enqueue
is evenly
spread across the ten millisecond interval and no spike is
visible at 1 millisecond, so it appears that the threads are
being not being scheduled on a time basis.
sched Stability
The sched
provider uses DTrace's stability
mechanism to describe its stabilities. These values are listed
in the following table.
Element | Name Stability | Data Stability | Dependency Class |
---|---|---|---|
Provider |
Evolving |
Evolving |
ISA |
Module |
Private |
Private |
Unknown |
Function |
Private |
Private |
Unknown |
Name |
Evolving |
Evolving |
ISA |
Arguments |
Evolving |
Evolving |
ISA |
For more information about the stability mechanism, see DTrace Stability Features.
io Provider
The io
provider makes available probes that
relate to data input and output. The io
provider enables quick exploration of behavior that is observed
through I/O monitoring tools such as iostat.
For example, you can use the io
provider to
understand I/O by device, I/O type, I/O size, process, or
application name
.
io Probes
The following table describes the probes for the
io
provider.
Table 11-9 io Probes
Probe | Description |
---|---|
|
Fires when an I/O request is about to be made either to a peripheral device or to an NFS server. |
|
Fires after an I/O request has been fulfilled. The
done probe fires after the I/O completes, but before
completion processing has been performed on the
buffer. As a result |
|
Fires immediately before a thread begins to wait
pending completion of a given I/O request. Some time
after the |
|
Fires when a thread finishes waiting for the
completion of a given I/O request. The
|
The io
probes fire for all I/O requests to
peripheral devices, and for all file read and file write
requests to an NFS server. Requests for metadata from an NFS
server, for example, do not trigger io
probes
due to a readdir()
request.
io Probe Arguments
The following table describes the arguments for the
io
probes.
Table 11-10 io Probe Arguments
Argument | Type | Description |
---|---|---|
|
|
The |
|
|
The |
|
|
The |
Note:
DTrace does not currently support the use of
fileinfo_t
with io
probes. In Oracle Linux, no information is readily accessible at the
level where the io
probes fire about the
file where an I/O request originated.
bufinfo_t
The bufinfo_t
structure is the abstraction
that describes an I/O request. The buffer that corresponds to
an I/O request is pointed to by args[0]
in
the start
, done
,
wait-start
, and
wait-done
probes. Detailed information
about this data structure can be found in
/usr/lib64/dtrace/version/io.d
.
The definition of bufinfo_t
is as follows:
typedef struct bufinfo { int b_flags; /* flags */ size_t b_bcount; /* number of bytes */ caddr_t b_addr; /* buffer address */ uint64_t b_blkno; /* expanded block # on device */ uint64_t b_lblkno; /* logical block # on device */ size_t b_resid; /* not supported */ size_t b_bufsize; /* size of allocated buffer */ caddr_t b_iodone; /* I/O completion routine */ int b_error; /* not supported */ dev_t b_edev; /* extended device */ } bufinfo_t;
Note:
DTrace translates the members of
bufinfo_t
from the
buffer_head
or bio
for
the Oracle Linux I/O request structure, depending on the kernel
version.
b_flags
indicates the state of the I/O
buffer, and consists of a bitwise-or of different state
values. The following table describes the values for the
supported states.
Table 11-11 b_flags Values
b_flags | Value | Description |
---|---|---|
|
|
Indicates that the I/O request is asynchronous and
is not waited upon. The
Note:
Some I/Os directed to be asynchronous might not
set |
|
|
Indicates that the buffer is being used in a paged I/O request. |
|
|
Indicates that the buffer is being used for physical (direct) I/O to a user data area. |
|
|
Indicates that data is to be read from the peripheral device into main memory. |
|
|
Indicates that the data is to be transferred from main memory to the peripheral device. |
b_bcount
: Is the number of bytes to be
transferred as part of the I/O request.
b_addr
: Is the virtual address of the I/O
request, when known.
b_blkno
: Identifies which block on the
device is to be accessed.
b_lblkno
: Identifies which logical block on
the device is to be accessed. The mapping from a logical block
to a physical block (such as the cylinder, track, and so on)
is defined by the device.
b_bufsize
: Contains the size of the
allocated buffer.
b_iodone
: Identifies a specific routine in
the kernel that is called when the I/O is complete.
b_edev
: Contains the major and minor device
numbers of the device accessed. You can use the D subroutines
getmajor
and getminor
to
extract the major and minor device numbers from the
b_edev
field.
devinfo_t
The devinfo_t
structure provides
information about a device. The devinfo_t
structure that corresponds to the destination device of an I/O
is pointed to by args[1]
in the
start
, done
,
wait-start
, and
wait-done
probes. Detailed information
about this data structure can be found in
/usr/lib64/dtrace/version/io.d
.
The definition of devinfo_t
is as follows:
typedef struct devinfo { int dev_major; /* major number */ int dev_minor; /* minor number */ int dev_instance; /* not supported */ string dev_name; /* name of device */ string dev_statname; /* name of device + instance/minor */ string dev_pathname; /* pathname of device */ } devinfo_t;
Note:
DTrace translates the members of
devinfo_t
from the
buffer_head
for the Oracle Linux I/O request
structure.
dev_major
: Is the major number of the
device.
dev_minor
: Is the minor number of the
device.
dev_name
: Is the name of the device driver
that manages the device.
dev_statname
: Is the name of the device as
reported by iostat. This field is provided
so that aberrant iostat output can be
quickly correlated to actual I/O activity.
dev_pathname
: Is the full path of the
device. The path that is specified by
dev_pathname
includes components expressing
the device node, the instance number, and the minor node.
However, note that all three of these elements are not
necessarily expressed in the statistics name. For some
devices, the statistics name consists of the device name and
the instance number. For other devices, the name consists of
the device name and the number of the minor node. As a result,
two devices that have the same dev_statname
migh differ in their dev_pathname
.
fileinfo_t
Note:
DTrace does not currently support the use of
fileinfo_t
with the
args[2]
argument of the
io
probes. You can use the
fileinfo_t
structure to obtain
information about a process's open files by using the
fds[]
array. See
Built-In Variables.
The fileinfo_t
structure provides
information about a file. args[2]
in the
start
, done
,
wait-start
, and
wait-done
probes points to the file to
which an I/O request corresponds. The presence of file
information is contingent upon the file system providing this
information when dispatching I/O requests. Some file systems,
especially third-party file systems, might not provide this
information. Also, I/O requests might emanate from a file
system for which no file information exists. For example, any
I/O from or to file system metadata is not associated with any
one file. Finally, some highly optimized file systems might
aggregate I/O from disjoint files into a single I/O request.
In this case, the file system might provide the file
information either for the file that represents the majority
of the I/O or for the file that represents some of the I/O.
Alternatively, the file system might provide no file
information at all in this case.
Detailed information about this data structure can be found in
/usr/lib64/dtrace/version/io.d
.
The definition of fileinfo_t
is as follows:
typedef struct fileinfo { string fi_name; /* name (basename of fi_pathname) */ string fi_dirname; /* directory (dirname of fi_pathname) */ string fi_pathname; /* full pathname */ loff_t fi_offset; /* offset within file */ string fi_fs; /* file system */ string fi_mount; /* not supported */ int fi_oflags; /* open() flags for file descriptor */ } fileinfo_t;
The fi_name
field contains the name of the
file but does not include any directory components. If no file
information is associated with an I/O, the
fi_name
field is set to the string
<none>
. In some rare cases, the
pathname that is associated with a file might be unknown. In
this case, the fi_name
field is set to the
string <unknown>
.
The fi_dirname
field contains only the
directory component of the file name. As with
fi_name
, this string can be set to
<none>
, if no file information is
present, or <unknown>
if the pathname
that is associated with the file is not known.
The fi_pathname
field contains the full
pathname to the file. As with fi_name
, this
string can be set to <none>
, if no
file information is present, or
<unknown>
if the pathname that is
associated with the file is not known.
The fi_offset
field contains the offset
within the file , or -1
, if either file
information is not present or if the offset is otherwise
unspecified by the file system.
The fi_fs
field contains the name of the
file system type, or <none>
, if no
information is present.
The fi_oflags
field contains the flags that
were specified when opening the file.
io Examples
'The following example script displays information for every I/O
as it is issued. Type the following source code and save it in a
file named iosnoop.d
.
#pragma D option quiet BEGIN { printf("%10s %2s\n", "DEVICE", "RW"); } io:::start { printf("%10s %2s\n", args[1]->dev_statname, args[0]->b_flags & B_READ ? "R" : "W"); }
The output from this script is similar to the following:
# dtrace -s ./iosnoop.d DEVICE RW dm-00 R dm-00 R dm-00 R dm-00 R dm-00 R dm-00 R ...
You can make the example script slightly more sophisticated by using an associative array to track the time (in milliseconds) spent on each I/O, as shown in the following example:
#pragma D option quiet BEGIN { printf("%10s %2s %7s\n", "DEVICE", "RW", "MS"); } io:::start { start[args[0]->b_edev, args[0]->b_blkno] = timestamp; } io:::done /start[args[0]->b_edev, args[0]->b_blkno]/ { this->elapsed = timestamp - start[args[0]->b_edev, args[0]->b_blkno]; printf("%10s %2s %3d.%03d\n", args[1]->dev_statname, args[0]->b_flags & B_READ ? "R" : "W", this->elapsed / 10000000, (this->elapsed / 1000) % 1000); start[args[0]->b_edev, args[0]->b_blkno] = 0; }
The modified script adds a MS (milliseconds) column to the output.
You can aggregate on device, application, process ID and bytes
transferred, then save it in a file named
whoio.d
, as shown in the following example:
#pragma D option quiet io:::start { @[args[1]->dev_statname, execname, pid] = sum(args[0]->b_bcount); } END { printf("%10s %20s %10s %15s\n", "DEVICE", "APP", "PID", "BYTES"); printa("%10s %20s %10d %15@d\n", @); }
Running this script for a few seconds results in output that is similar to the following:
# dtrace -s whoio.d ^C DEVICE APP PID BYTES dm-00 evince 14759 16384 dm-00 flush-252:0 1367 45056 dm-00 bash 14758 131072 dm-00 gvfsd-metadata 2787 135168 dm-00 evince 14758 139264 dm-00 evince 14338 151552 dm-00 jbd2/dm-0-8 390 356352
If you are copying data from one device to another, you might
want to know if one of the devices acts as a limiter on the
copy. To answer this question, you need to know the effective
throughput of each device, rather than the number of bytes per
second that each device is transferring. For exampe, you can
determine throughput by using the following script and saving it
in a file named copy.d
:
#pragma D option quiet io:::start { start[args[0]->b_edev, args[0]->b_blkno] = timestamp; } io:::done /start[args[0]->b_edev, args[0]->b_blkno]/ { /* * We want to get an idea of our throughput to this device in KB/sec. * What we have, however, is nanoseconds and bytes. That is we want * to calculate: * * bytes / 1024 * ------------------------ * nanoseconds / 1000000000 * * But we cannot calculate this using integer arithmetic without losing * precision (the denominator, for one, is between 0 and 1 for nearly * all I/Os). So we restate the fraction, and cancel: * * bytes 1000000000 bytes 976562 * --------- * ------------- = --------- * ------------- * 1024 nanoseconds 1 nanoseconds * * This is easy to calculate using integer arithmetic. */ this->elapsed = timestamp - start[args[0]->b_edev, args[0]->b_blkno]; @[args[1]->dev_statname, args[1]->dev_pathname] = quantize((args[0]->b_bcount * 976562) / this->elapsed); start[args[0]->b_edev, args[0]->b_blkno] = 0; } END { printa(" %s (%s)\n%@d\n", @); }
Running the previous script for several seconds while copying data from a hard disk to a USB drive yields the following output:
# dtrace -s copy.d ^C sdc1 (/dev/sdc1) value ------------- Distribution ------------- count 32 | 0 64 | 3 128 | 1 256 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2257 512 | 1 1024 | 0 dm-00 (/dev/dm-00) value ------------- Distribution ------------- count 128 | 0 256 | 1 512 | 0 1024 | 2 2048 | 0 4096 | 2 8192 |@@@@@@@@@@@@@@@@@@ 172 16384 |@@@@@ 52 32768 |@@@@@@@@@@@ 108 65536 |@@@ 34 131072 | 0
The previous output shows that the USB drive
(sdc1
) is clearly the limiting device. The
throughput of sdc1
is between 256K/sec and
512K/sec, while dm-00
delivered I/O at
anywhere from 8 MB/second to over 64 MB/second.
io Stability
The io
provider uses DTrace's stability
mechanism to describe its stabilities. These values are listed
in the following table.
Element | Name Stability | Data Stability | Dependency Class |
---|---|---|---|
Provider |
Evolving |
Evolving |
ISA |
Module |
Private |
Private |
Unknown |
Function |
Private |
Private |
Unknown |
Name |
Evolving |
Evolving |
ISA |
Arguments |
Evolving |
Evolving |
ISA |
For more information about the stability mechanism, see DTrace Stability Features
fasttrap Provider
The fasttrap
provider performs dynamic
instrumentation of arbitrary instructions in user-space threads.
Unlike most other DTrace providers, the
fasttrap
provider is not designed for tracing
system activity. Rather, this provider is intended as a way for
DTrace consumers to inject information into the DTrace framework
by activating the fasttrap
probe.
For more information about enabling statically defined probes in user-space programs, see Statically Defined Tracing of User Applications.
fasttrap Probes
The fasttrap
provider makes available a
single probe that fires whenever a user-level process makes a
certain DTrace call into the kernel. The DTrace call to activate
the probe is not available
fasttrap Stability
The fasttrap
provider uses DTrace's stability
mechanism to describe its stabilities. These values are listed
in the following table.
Element | Name Stability | Data Stability | Dependency Class |
---|---|---|---|
Provider |
Evolving |
Evolving |
ISA |
Module |
Private |
Private |
Unknown |
Function |
Private |
Private |
Unknown |
Name |
Evolving |
Evolving |
ISA |
Arguments |
Private |
Private |
Unknown |
For more information about the stability mechanism, see DTrace Stability Features.