 |
» |
|
|
|
 |
| |
This page offers NFS knowledge
that anyone who supports NFS systems should know.
NFS is a key component of many UNIX systems, and
is often the prime consumer of network services.
When NFS doesn't
work, people seem to say they have "NFS
problems", not "Unix problems" or "network
problems.” NFS uses much of a Unix kernel – and
others – so it can be tricky to triage problems.
We’re
offering this insight, hints and hopefully, help!
|
|
 |
Engineering
tips
|
 |
 |
|
|
|
|
|
| |
 |
|
 |
 |
When the question is "Why isn't it working?",
the first thing to do is see what the client
can coax from the server. You start with the
basics and work to more complex cases. Do the
following from the client.
-
ping server
Ping sends echo request ICMP messages to a
server. The ICMP server is implemented within
the kernel and relies on next to no user
level daemons. If ping doesn't print replies,
you have a network configuration problem,
not a NFS problem. If you haven't checked
for unplugged power and data wires, now is
the time to do so. Then check ifconfig lines
and make sure entries in /etc/hosts are
consistant.
-
rpcinfo -p server
Rpcinfo is a simple ONC-RPC test tool. -p queries
the portmap daemon on the server for a list
of all RPC services currently registered
on the server. If it times out, portmap probably
isn't running or is sick.
-
rpcinfo -u server nfs
This calls a NULL procedure that is implemented
in all applications. In addition to -u (UDP),
you can use -t (TCP).
-
rpcinfo -u server mountd
NFS V2 uses MOUNT V1. NFS V3 uses MOUNT V3.
Don't worry about rpcinfo's complaints if
MOUNT V1 and V3 are available. You should
see:
program 100005 version 1 ready and waiting
rpcinfo: RPC: Program/version mismatch; low
version = 1, high version = 1
program 100005
version 2 is not available
program 100005 version
3 ready and waiting
Mount problems
If all the above works, and the mount command
doesn't, problems are often due to restrictions
mountd is applying. Mountd often returns little
and often doesn't make a syslog entry.
Mountd's options offer a spectrum of privacy
(i.e. opportunities for mount failure). The options
are better described in mountd(8), but the general
range from most permissive to least is
-
mountd -n
This accepts requests from anyone at any site.
It is only appropriate outside of a firewall
for file systems exported read only that
contain no private data. If this is not used,
requests from non-root users will be rejected
with a complaint about weak credentials.
Some non-Unix clients may attempt to use
AUTH_NONE, this will always result in a weak
credential complaint.
-
mountd -i
When ONC RPC programs receive requests, they
generally get the IP address of the sender
from the OS and the nost name from the authentication
header of the request. If mountd can't match
the hostname and IP address, it will reject
the request. Often the /etc/hosts file
on client and server are mismatched,
confusing both systems and the system administrators
who made the change just before the end of
the workday.
-
mountd -d
This disallows requests from clients in other
DNS domains, rejecting the request with rejected
credentials, or whatever the error is.
-
mountd -s
This is claimed to enable subdomain checking,
but it may do the same thing as -d.
Further narrowing down the problem
Try to reproduce a problem by using common Unix
commands. If you can, write a minimal C program
to use for final study. The problem with relying
on Unix commands is that they are rife with surprises. "cp" does
not simply read one file and write another -
it does various stat()s and other calls before
it reads any data at all. Consider this report:
A couple other odd items. If we "dd if=foo.txt of=/dev/ttypn" on the
client, the mtime does change. If we "cat foo.txt >> /dev/ttypn" on
the client, the mtime does not change.[Later] Using dd of=/dev/ttyp2
conv=notrunc does NOT update mtime.
First, dd is not a simple program. Second, there
are at least two ways to truncate a file. One
is to include O_TRUNC in the open(2), another
is to use ftruncate() or truncate(). Instead
of looking at dd to see which it uses, it may
be easier to write a pair of test programs to
see if either or both of these change mtime.
That information combined with a tcpdump trace
will put you 2/3s of the way to a diagnosis.
|
 |
 |
|
 |
 |
This is a more complicated problem. Whereas
before you couldn't even get to NFS, now you
can, some of the time or even nearly all of the
time. Now the whole array of software NFS touches
comes into question - network, file system, VM/UBC.
Where do you start?
As always,
you start with the simple things. This time you
look for places where messages
were lost or delayed. If you can find a packet
loss rate of more than 1% fixing that will
usually solve the problem.
netstat -is
This prints MAC level statistics, information
about the performance of your network interfaces.
For example:
tu0 Ethernet counters at Thu
Oct 9 21:46:37 1997
65535 seconds since
last zeroed
4294967291 bytes received
4294967290 bytes sent
53202599 data blocks received
21993784 data blocks sent
3256128485 multicast bytes
received
26289560 multicast blocks
received
5819084 multicast bytes
sent
59623 multicast blocks
sent
1474661 blocks sent,
initially deferred
2296844 blocks sent,
single collision
2982245 blocks sent,
multiple collisions
34985 send failures,
reasons include:
Excessive
collisions
0 collision detect
check failure
65 receive failures,
reasons include:
Block check
error
Framing Error
Frame too
long
0 unrecognized
frame destination
0 data overruns
0 system buffer
unavailable
0 user buffer unavailable
Three counters have reached
their upper bounds in the week the system has
been up. Look for the number of data blocks
sent and received and compare that to send
and receive
failures. In a clean Ethernet with no one plugging
and unplugging cables, the error counts can
be zero, except in a grossly overloaded network. In that case, expect several "Excessive
collisions" errors. Here there are about
0.16% excessive collisions, high enough to
have a measurable
effect on NFS. The only solution is a faster
network or to break on into multiple subnets.
Any
other error is bad. Period. Communications
theory claims that no communication channel
is perfect,
but
properly configured Ethernet is astoundingly
good. Almost all output errors are especially
worrisome. If you see "Remote failure to
defer" it means that the Ethernet is
too long or that a transceiver is not noticing
when it has won the wire. Other errors suggest
hardware
failures of some sort. A high receive or
transmit
error rate will cause bad performance.
Receive errors are less exciting
than transmit errors, but those 65 errors are
higher than you will see on a clean net. The
error types are typical – maybe even the full
set. On the same machine, apparently tu1 is
talking to a cleaner subnet:
tu1 Ethernet counters at Thu
Oct 9 21:46:37 1997
65535 seconds since
last zeroed
3940414467 bytes received
5796189 bytes sent
21453036 data blocks received
62309 data blocks sent
2404736887 multicast bytes
received
19653590 multicast blocks
received
4402854 multicast bytes
sent
44561 multicast blocks
sent
2882 blocks sent,
initially deferred
396 blocks sent,
single collision
590 blocks sent,
multiple collisions
0 send failures
0 collision detect
check failure
0 receive failures
0 unrecognized
frame destination
0 data overruns
0 system buffer
unavailable
0 user buffer unavailable
Half as many packets received, no
errors. That's the way it should be.
Don't let yourself be lulled into thinking
that the CRC check is doing its job. Keep in
mind that an average 16 bit CRC will still
let one message out of 65536 through, but there
are situations where it may not be that effective.
It's well worthwhile to keep your network as
error free as possible so that any hardware
or configuration problems will stand out when
they do occur.
Many people would worry over the high collision
rate on tu0. While there are some issues, a high collision rate ties up very little bandwidth. It's fascinating
to watch 10Base2 Ethernet on an oscilloscope.
You can see collisions (the signal is twice
as strong) and they take something like 51
usecs. Large Ethernet packets take about
300 times that, so even a 50% collision
rate may
not harm performance. You can often pick
out NFS reads and writes, as those show
up as multiple
packets with the minimum 9.6 usec separation
time.
On the other hand, a high collision
rate means a highly loaded Ethernet. Given the
tools we ship, That may be the best clue that
someone is flooding the net with junk. Tcpdump
is the best tool to find out who and what.
netstat -s
Netstat also reports statistics for IP and
it's users. It produces a lot of output so
the initial reaction is "Too Much!" The
key things to look at are marked with asterisks
below. Their importance is listed afterwards.
Sections without interesting data for NFS
are not included
ip:
* 112277645 total packets
received
* 166 bad header checksums
0 with size smaller
than minimum
0 with data size < data
length
0 with header length < data
size
0 with data length < header
length
13407455 fragments
received
1 fragment dropped
(dup or out of space)
* 3692 fragments dropped
after timeout
12749963 packets forwarded
159 packets not forwardable
0 packets denied access
512 redirects sent
0 packets with unknown
or unsupported protocol
90412125 packets consumed
here
103655445 total packets
generated here
18 lost packets due
to resource problems
4291912 total packets
reassembled ok
2737452 output packets
fragmented ok
12951497 output fragments
created
3 packets with special
flags set
icmp:
410713 calls to icmp_error
0 errors not generated
'cuz old ip message was too short
0 errors not generated
'cuz old message was icmp
Output histogram:
echo reply:
857067
destination unreachable:
410701
routing redirect:
512
time exceeded:
12
address mask
reply: 3
0 messages with bad
code fields
0 messages < minimum
length
0 bad checksums
0 messages with bad
length
Input histogram:
echo reply:
215897
destination unreachable:
425916
source quench:
1650
echo:
857067
time exceeded:
400
address mask
request: 3
857070 message responses
generated
igmp:
10239 messages received
0 messages received
with too few bytes
0 messages received
with bad checksum
10239 membership queries
received
0 membership queries
received with invalid field(s)
0 membership reports
received
0 membership reports
received with invalid field(s)
0 membership reports
received for groups to which we belong
0 membership reports
sent
tcp:
* 40524497 packets sent
35825129 data
packets (1599924796 bytes)
* 57176 data
packets (56122576 bytes) retransmitted
4213355 ack-only
packets (3847882 delayed)
191 URG only
packets
11406 window
probe packets
302538 window
update packets
114946 control
packets
* 25716962 packets received
18207501 acks
(for 1596631807 bytes)
304232 duplicate
acks
0 acks for
unsent data
15185777 packets
(1815409364 bytes) received in-sequence
* 18979 completely
duplicate packets (997570 bytes)
3604 packets
with some dup. data (87720 bytes duped)
* 264573 out-of-order
packets (30522274 bytes)
475 packets
(16 bytes) of data after window
16 window
probes
155467 window
update packets
1379 packets
received after close
* 17 discarded
for bad checksums
0 discarded
for bad header offset fields
0 discarded
because packet too short
51766 connection requests
18648 connection accepts
66686 connections
established (including accepts)
79278 connections
closed (including 1364 drops)
6683 embryonic connections
dropped
14267537 segments
updated rtt (of 14308046 attempts)
* 14195 retransmit timeouts 8
connections dropped by rexmit timeout 1973
persist timeouts 20841 keepalive timeouts 9156
keepalive probes sent 199 connections
dropped by keepalive
udp:
* 61642016 packets sent
* 63183812
packets received 0 incomplete headers 0
bad data length fields
* 3 bad checksums 146557 full sockets 824421
for no port (413723 broadcasts, 0 multicasts) 0
input packets missed pcb cache
ip: 103655445 total packets
generated here
ip: 112277645 total packets
received
tcp: 40524497 packets sent
tcp: 25716962 packets
received
udp: 61642016 packets sent
udp:
63183812 packets received
These just set baselines for the amount of
activity to compare against the various error
counters below.
ip: 166 bad header checksums
tcp: 17 discarded
for bad checksums
udp: 3 bad checksums
If you look at Ethernet and FDDI specs and
take the time to read up on the subject, you
will gain great respect for the error checking
that a good CRC offers. If you do the same
with the IP checksum, you'll note that it’s
an elegant checksum, but still just a checksum.
Its goal was largely to provide a warning of
software corruption of IP messages in end nodes
and routers. If you look at netstat output
long enough, especially if you work on weird
problems, you'll be amazed at how much manages
to evade the CRC check.
Every so often someone innocently posts
a suggestion to comp.protocols.nfs that
perhaps the IP checksum
has outlived its usefulness. The last time
someone did that, within a day engineers
from most of the major Unix vendors posted
very
different answers for why maintaining the
IP checksum is critical and other people
posted
accounts of NFS data corruption when the
checksum on UDP messages was disabled.
An interesting
software engineering thesis would be to study
these messages and try to uncover what happened
to them.
ip: 3692 fragments dropped
after timeout
tcp: 57176 data packets
(56122576 bytes) retransmitted
tcp: 18979 completely
duplicate packets (997570 bytes)
tcp:
14195 retransmit timeouts
These are hallmarks of messages lost in the
net. While the NFS timeout
rate (see below) is NFS specific, this is helpful because it shows that more than NFS is having trouble.
ping server
If NFS traffic between nodes on the same subnet
is fine, but going across a router is not,
then there is a very high probability that
the router is dropping packets. Router
statistics are generally available only
to the high priest in charge of guarding
the router, and he's probably too busy
to help out. Ping is a useful check here.
If you start up a ping that crosses a router,
you should see the echo for each packet
you send. If you don't, what does it tell
you? First, you should have run ping between
pairs of nodes on each subnet first to
verify that no packets are lost in that
case. You can also use tcpdump on both
client and server to see if messages that
reach the router show up on the other side.
If they don't, then that's a very strong
sign that the router is swamped or congested.
- nfsstat
The server statistics aren't too useful, but
the client RPC portion is. (nfsstat
-cr will
print just that.) For example:
alingo 26% nfsstat -cr
Client rpc:
tcp: calls badxids badverfs timeouts newcreds
9 0 0 0 0
creates connects badconns inputs avails interrupts
2 2 0 20 9 0
udp: calls badxids badverfs timeouts newcreds retrans
125955 3 0 342 0 0
badcalls timers waits
343 229 0
The key
things are UDP calls and timeouts. A timeout
rate of more than 1% generally results in
awful performance. If badxids is incrementing,
that
usually means the server's file system is
overloaded and duplicate replies are coming
back because
the client has retransmitted the original
request.
nfsd and nfsiod
Nfsd must be running for NFS to work at all.
The number of server threads you should have
it run is a function of load and speed of
the exported file system. However, even at
the recommended value of 8, you should see
good performance. Nfsiod is not needed to
let NFS work, but it provides a big boost
to client performance by helping to shepard
multiple read and write requests at a time.
Again, the recommended number of threads,
7, should provide decent performance for
casual use.
|
 |
 |
|
 |
 |
Gigabit Ethernet came out at what should have
been the perfect time – Tru64 could easily
saturate 100 Mbs media with uniprocessor systems
and a 10X increase would give us new headroom.
Besides, there is a lot to be said for not saturating
media. One big challenge was keeping up with
the packet load, a 1500 byte packet takes only
12 usec of wire time. Some vendors created "Jumbo
frames," a 9000 byte alternative that still
only takes 72 usec. One reason that size was
chosen was that it could hold an 8KB NFS I/O
message.
When Gigabit hardware became available, Tru64
NFS only supported double buffered reads, i.e.
when an application reads a file sequentially,
NFS would send a read request before the application
asked for the data. This performed poorly on
SMP systems, as it didn't allow all the CPUs
to be busy handling reads and had no chance
to keep up with Gigabit loads. NFS was changed
to
issue two readaheads when the application made
each read request, with an eight read ceiling.
The first tests on Gigabit showed this let
4 CPU ES40s saturate Gigabit when reading
cached
files on the server. Unlike 10Base2 and FDDI,
it wouldn't take years of hardware and software
work to swamp the new medium.
Unfortunately, things aren't quite that simple.
While those experiments were in 2001, It appears
that few Gigabit switches can keep up. Customers
and benchmarking people keep running into problems,
but it's natural to look to the computers for
the source of the problem instead of the infrastructure.
Such “NFS problems” may be simply
network congestion issues. Eventually switch
vendors will increase their buffering, but
in the meantime there is a lot of hardware
that can't handle some rather simple configurations.
A Tru64 NFS client limits the network traffic
it causes to one request per thread plus the
number of nfsiod threads. The latter assist by
doing read ahead and write behinds to allow the
application requesting the I/O to return to user
level. A similar thing is done with disk I/O,
but NFS I/O is complicated by doing retransmits
and whatnot, so clients generally have full fledged
threads handing the work.
Tru64’s default number of nfsiod threads
is 7, so a program reading or writing a file
can have up to 8 requests outstanding at once.
The standard I/O size of 48 KB means that there
may be 384 KB out. 384 KB appears to be enough
to swamp various infrastructures. 384 KB is 3,072
Kb - merely 3 msec of wire time. Consider two
simple cases.
You can do a simple test to see how switches
handle two fast streams flowing into a single
stream. Consider one client reading files from
two servers, with all three connected to a Gigabit
switch. The switch will see 2 Gb/sec of data
arriving and has to squeeze it out a 1 Gb/sec
wire. Using separate programs (e.g. cp) to read
the files, there will be no more than nine outstanding
reads. (The client doesn't try to evenly allocate
nfsiod threads to individual programs, while
that might be a problem, it's a separate problem
and shouldn't affect this at all.) If the result
is less than what you got with just one reader,
then you probably have a congestion problem.
The client's "fragments dropped after timeout" counter
will have incremented which says not all fragments
of the read replies reached the client. Nfsstat
on the client will report timeouts and retransmissions.
The servers won't be at fault, as their load
is less than in the single server case. If you
can change switches to another vendor’s
and get a different fragment loss, then that
will be more evidence that the switches can’t
handle the load. Another experiment that can
be very useful is to reduce the number of nfsiod
threads or remount the file systems with a mount
option like –o rsize=16384 and see if performance
improves.
One frustrating aspect of all this is that
while network switches track a huge amount
of statistics,
expect to have trouble finding information
about frames discarded due to congestion.
This makes
it very hard to understand what is going on
without a lot more effort or discussions
with the switch
vendors.
In
summary, if NFS seems to work okay
as long as you don't read or write
files, be sure to consider the
infrastructure. Various things
to do include
- On
the client, verify that retransmits
are happening (nfsstat -cr).
Look
under retrans for that.
-
On
the client (if
reading) or server
(if writing), see
if there are any
IP "fragments
dropped after timeout".
This
is a very strong
indication that
the infrastructure
is losing fragments.
- Experiment
with various number of nfsiod threads
or rsize mount option.
You
can kill and restart nfsiod on the fly.
Change the argument to change the number
of helper threads that are started.
If performance jumps up when you
decrease the number of threads below
a certain point, that's a very clear
sign you've crossed a congestion threshold.
-
If
you can enable “flow
control” on
your switches,
that may
greatly
improve
matters.
That will
throttle
fast senders,
so if
a
system
is
sending
to both
fast and
slow receivers,
the
fast receiver
may see
throughput
go down.
|
 |
 |
|
 |
 |
When NFS hangs it can be a challenge to figure
out exactly why. Before diving into the kernel
and hunting for hung threads, the very first
thing to do is to go back to the beginning and
check to see if anything works.
Assuming that everything else still works,
then the important thing to do is to find
the threads
that are involved and get their stack traces.
You already know how to find the user processes
on the client, but there's more to look at
and most people, even senior OS engineers,
don't
commonly know where to find it. Sometimes important
culprits can be found as daemons or other user
level code. You have to consider any process
that might be reading or writing any file!
If the system is low on memory, a system
call may
call code that flushes out NFS pages to make
space for its own I/O.
The easiest way to find a non-kernel thread
stuck in NFS I/O is to look for processes
in the uninterruptible
state. Of course, processes in U state happen
all the time. Consider this from a mail hub:
% ps ax | grep ' U '
372 ?? U 59:31.20
/usr/sbin/ypserv
10748 ??> U 0:01.13 imapd:
17604 ??> U 0:05.49 imapd:
21332 ??> U 0:02.91 imapd:
26344 ??> U 0:00.19 -AA26344
quarry.zk3.dec.com: DATA (sendmail)
27291 ??> U 0:00.38 imapd:
27829 ??> U 0:00.21 -AA27829
quarry.zk3.dec.com: DATA (sendmail)
% ps ax | grep ' U '
26043 ??> U 0:00.83 imapd:
26850 ??> U 0:05.71 perl
/var/adm/ues/bin/cklocks
29186 ??> U 0:00.17 procmail
-f ...
There was no repeat, therefore none of these processes are hung. The
most likely user process to hang is the update daemon which does a sync(2)
every thirty seconds. If your system is running multithreaded applications,
you may want to use ps axm | more and look for U flags.
Nfsiod and nfsd spawn several kernel threads,
threads that belong to Pid 0 or its equivalent on a member. The
rest of the
time nfsiod and nfsd perform bookkeeping duties
and are rarely involved in "NFS problems". While their kernel threads are the standard
places to find hangs, be sure to check all the other kernel threads
too, especially those involved with managing swapping, VM, and the UBC.
You can get a glimpse of the kernel threads via
ps mlp 0 or this alternative:
# ps -p
0 -m -o wchan,state,time
WCHAN S TIME
* R < 1:23.44
- R N 0:00.00
malloc_ U < 0:00.04
4f027c U < 0:04.89
4f0464 U < 0:00.06
isp_rq S < 0:08.00
isp_abo I < 0:00.00
isp_fm I < 0:00.02
ss_tmo S < 0:00.00
isp_rq I < 0:00.00
isp_abo I < 0:00.00
isp_fm I < 0:00.00
ss_tmo S < 0:00.00
624298 U < 0:00.00
623df8 U < 0:00.00
netisr S < 1:07.23
87e13a28 S < 0:00.34
4c90b0 U < 0:00.00
4ca900 U < 0:00.00
4caac0 U < 0:00.00
624510 U < 0:00.00
ubc_dir U 0:00.00
648718 U < 0:00.00
648728 U < 0:00.05
623ae8 U < 0:01.12
648748 U < 0:00.00
648758 U < 0:00.00
61a8f8 U < 0:00.00
6486e8 U < 0:00.00
250670 U < 0:00.00
nfsiod_ I 0:00.00
nfsiod_ I 0:00.28
nfsiod_ I 0:00.01
nfsiod_ I 0:00.24
nfsiod_ I 0:00.00
nfsiod_ I 0:00.55
nfsiod_ I 0:00.54
5abf570 U < 0:00.00
5abf330 U < 0:00.00
5abea30 U < 0:00.00
nfs_tcp I 0:00.00
nfs_tcp I 0:00.00
nfs_tcp I 0:00.00
nfs_tcp I 0:00.00
nfs_tcp I 0:00.00
nfs_tcp I 0:00.00
nfs_tcp I 0:00.04
nfs_tcp I 0:00.00
nfs_udp I 0:00.00
nfs_udp I 0:00.00
nfs_udp I 0:00.00
nfs_udp I 0:00.00
nfs_udp I 0:00.02
nfs_udp I 0:00.00
nfs_udp I 0:00.00
nfs_udp I 0:00.01
NFS wchan names were selected to make it easier to pick out the NFS threads.
When they are busy, they will not have those wchan names, but they will appear
in the same line of the ps's output. That output is useful mainly to see
if some threads are hung. It’s not easy to go from a ps line to the
thread address, but it's easy to get a list of all kernel threads from the
running system or a crash dump:
# dbx -k /vmunix
(dbx) set $pid=0
(dbx) tstack
On the client, the nfsiod program starts several kernel threads (part of
Pid 0) that take over reads and writes to NFS files and makes sure they happen.
This includes retransmitting requests when replies are not received in a
timely fashion. Many times you can find that the application Pid is waiting
for I/O to complete and that one or more nfsiod threads are doing the I/O
but are waiting for a reply or for transmit done processing to complete.
Unfortunately, both of these show similar stack traces:
dbx)
set $pid=4362
(dbx)
tstack
Thread 0xfffffc0014ebd8c0:
> 0 thread_block()_
1 mpsleep(0x0,
0xfffffc000ee31400, 0xfffffc0014f32680,
0x1004,
0x20000000001)
2 clntkudp_callit_addr(h
= 0xfffffc000ee31408, procnum = 7_
On the server, things are more variable. Usually one server thread is stuck
deep inside file system code waiting for something to happen and all the
other threads have identical stacks. These latter ones are not interesting,
all they show is that the client has given up waiting for a reply and has
retransmitted the request. The server has picked up the request and has called
code that is waiting for the original request to finish and release whatever
SMP or other locks its thread holds.
Again, look at the stacks for the kernel threads, the NFS server ones
will stand out by their names and sizes. From the stack trace you can generally
decide if the problem lies with the file system, VM, or NFS. |
 |
 |
|
 |
 |
It's very easy to leap the hundreds of KB of
kernel code to the conclusion that NFS is broken.
This is due in part to several warning messages
NFS prints on the console and user's terminals.
It is also due in part to the importance of NFS
in many environments. While HTTP may win out
sometimes, many web servers return WWW pages
that are fetched from NFS servers.
ASE V4.0x systems in particular are extremely
heavy NFS consumers, and even access local
file systems via NFS. Generally when an ASE
system
is having problems with NFS, people don't realize
that the system is doing little but NFS. Unfortunately,
the conclusion is that NFS is at fault when
there are actually many possibilities.
Learn to analyze
the many clues available to diagnose a problem.
Once you learn them,
you will find that
- You can often diagnose a problem
in a few minutes.
-
You will have enough evidence to convince your
support group that you need their help.
-
You will develop a reputation for bringing
the right problem to the right personnel. When
you do ask for help, people will immediately
accept you may have a serious problem. Having
good data can greatly decrease the time to
solving the problem. Just a few minutes collecting
data can save you days getting the problem
resolved.
A final reminder: collect tcpdump traces, netstat
output, tcpdump traces, appropriate stack traces,
and good tcpdump traces. |
 |
|
 |
|