99% IOWAIT with Oracle RAC 10g (10.1.0.4) on Linux

Discussion:

99% IOWAIT with Oracle RAC 10g (10.1.0.4) on Linux

(too old to reply)

m***@hotmail.com

2005-06-10 14:25:49 UTC

Oracle 10.1.0.4 EE running on 2 node RHEL 3 cluster (Oracle Firewire
Kernel)
Shared Storage : Maxtor One Touch II

It seems that periodically the I/O to the shared device seems to 'hang
up' (i.e. 99% I/O Wait in 'top') for exactly 1 minute when both
instances are booted.

At first I suspected that this was just a 'top' reporting anomoly, so I
traced a SQL statement which runs for approx 30 seconds with only one
instance started.

I then traced the session with both instances running and the execution
time jumped to 90 seconds, which corresponds to the normal 30 secs plus
this strange 60 second timeout. When I tkprof'd the trace file, I can
see that of the 90 seconds response time, 1 individual 'db file
scattered read' took 59.8 seconds. This is highly unusual for one
multi block read:

Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
------------------------------- Waited ---------- ------------
SQL*Net message to client 2 0.00 0.00
db file scattered read 6954 59.8 82.42
SQL*Net message from client 2 276.12 276.12

This issue is easily repeatable.

The thing that makes me think that this is a I/O problem to the shared
disk is that we had to increase the CSS misscount to 120 seconds
because of repeated "Voting Disk timeout" errors which used to crash
CRS on one of the nodes.

Anyone have any idea how to diagnose the source of this I/O hang.

When I run iostat during this period of 99% IOWAIT, there is no
activity to the shared disk at all. 0 bytes read, 0 bytes written.

Matt

Bart the bear

2005-06-10 18:58:01 UTC

Matt, try with netstat -s and see whether you have packet counts
increasing.
On Linux, there is "watch" command which can repeat given command in
regular intervals. See the packet counts with watch -n 5 "netstat -i".
Common wisdom tells you that if it isn't disk-related, it must be
network-related. Personally, I always blame the network first. That
keeps
the network guys on their toes. Your database may be synchronizing and
reconfiguring PCL lock database, which is a joyous process which
freezes both instances. There is a definite logic there. If neither
instance works, then you cannot brak any of them. Hence, the Oracle is
unbreakable, as in the marketing papers.

m***@hotmail.com

2005-06-13 07:08:56 UTC

If the system is waiting on network, will this show up as I/OWait
too..?

I assumed it was only for Disk I/O.

Matt

Bart the bear

2005-06-14 15:40:39 UTC

Yes, it will. Network waits will also show as I/O wait, as network I/O
is
usually done by using read/write primitives.

Noons

2005-06-11 13:29:18 UTC

Post by m***@hotmail.com
Anyone have any idea how to diagnose the source of this I/O hang.
When I run iostat during this period of 99% IOWAIT, there is no
activity to the shared disk at all. 0 bytes read, 0 bytes written.

Can you check if a process called "kupdated" is going flat out on CPU
when this happens?

--
Cheers
Nuno Souto
in sunny Sydney, Australia
***@yahoo.com.au.nospam

m***@hotmail.com

2005-06-13 07:04:07 UTC

No process is using any CPU at the time of the I/O wait.

DA Morgan

2005-06-13 08:06:55 UTC

Post by m***@hotmail.com
No process is using any CPU at the time of the I/O wait.

If you send a StatsPack by email I'll take a look at it.

Also please include
netstat
iostat
vmstat
sar

--
Daniel A. Morgan
http://www.psoug.org
***@x.washington.edu
(replace x with u to respond)

m***@hotmail.com

2005-06-13 08:39:09 UTC

Daniel,

I am in the middle of taking ocfs out of the equation and using raw
paritions for my voting and quorum files....

I'm doing this by backing up these files, unmounting ocfs and restoring
them to the raw slices.

If the problem persists after this I will send you the files and output
you requested (i.e. statspack etc)

Matt

Jeremy

2005-06-13 14:22:30 UTC

hey, if you want to copy me in on those files i'll have a look at them
too

Noons

2005-06-13 08:10:21 UTC

Post by m***@hotmail.com
No process is using any CPU at the time of the I/O wait.

Sorry, I thought you said it was 99% iowait in "top":
that is CPU flat out in a wait loop...

Jeremy

2005-06-11 17:10:14 UTC

if you had to double the default disk read timeout for CRS then your
disk read delay must be happening at a lower level of the stack then
the cluster services... possibly in the filesystem or kernel?

- how frequently does it happen?
- are you using OCFS or raw partitions? (there was a bug with an older
version of OCFS where an instance would totally hang in a 'D' disk wait
state under certain circumstances.)
- what brand and model of fireware drive are you using? (some firewire
chipsets don't support multiple-login and so only one machine can
access the drive at a time.)

just a few ideas off the top of my head...

m***@hotmail.com

2005-06-13 07:07:39 UTC

- how frequently does it happen?

Very frequently - I can cause the problem simply by generating any
database I/O. But it also occurrs when both instances are effectively
idle.

- are you using OCFS or raw partitions? (there was a bug with an older

version of OCFS where an instance would totally hang in a 'D' disk wait

state under certain circumstances.)

OCFS for the cluster files, ASM for the database. I will search
metalink for OCFS bugs.

- what brand and model of fireware drive are you using? (some firewire

chipsets don't support multiple-login and so only one machine can
access the drive at a time.)

Maxtor Onetouch II with the Oxford 911. This was chosen because it
handled multiple logons.

m***@hotmail.com

2005-06-13 11:00:54 UTC

Post by Jeremy
if you had to double the default disk read timeout for CRS then your
disk read delay must be happening at a lower level of the stack then
the cluster services... possibly in the filesystem or kernel?

I have removed the OCFS component and now store the Cluster shared
files (voting and registry) on raw partitions.

The problem still exists. If anything, it seems to be worse now. The
100% IOWAIT lasts for much longer.

Bart the bear

2005-06-14 15:42:17 UTC

Let's go back to network.

m***@hotmail.com

2005-06-13 13:50:54 UTC

More info:

When the IOWAIT goes to 100%, I run the following command to find the
processes that are waiting on I/O:

ps -aux |grep D (The 'D' is for uninteruptable sleep state - which
can be I/O read requests)

This gives me 2 processes:

evmd and
ocssd.bin

The ocssd.bin process is the first to enter the 'D' state.....I have
used 'strace' on both of these processes. When the IOWAIT is high
(99%), I see delayed pwrite() calls. When the IOWAIT drops back down,
the same pwrite() calls execute very quickly.

Any ideas..?

Matt

Jeremy

2005-06-13 14:39:45 UTC

Post by m***@hotmail.com
The ocssd.bin process is the first to enter the 'D' state.....I have
used 'strace' on both of these processes. When the IOWAIT is high
(99%), I see delayed pwrite() calls. When the IOWAIT drops back down,
the same pwrite() calls execute very quickly.

... and if you simply "shutdown immediate" the second instance (but the
server is still running, the ASM instance is still running, etc) --
then this does NOT happen and your query executes in 30 seconds again?
(or whatever your baseline time is...)

also, if you happen to email those files my way, then could you include
the alert log and /var/log/messages? thx...

/j

m***@hotmail.com

2005-06-14 15:29:28 UTC

After a few days of fault finding I've managed to get to the root of
the problem (thanks to the help of Jermey and Daniel):

The /var/log/messages file had the following entries in it every time
the IOWAIT went through the roof:

Jun 13 12:20:30 linux1 kernel: ieee1394: sbp2: aborting sbp2 command
Jun 13 12:20:30 linux1 kernel: Read (10) 00 05 e7 d2 80 00 00 05 00
Jun 13 12:20:30 linux1 kernel: ieee1394: sbp2: aborting sbp2 command
Jun 13 12:20:30 linux1 kernel: Read (10) 00 00 15 9a 60 00 00 05 00

So the problem appeared to be either in the sbp2 driver or the hard
drive itself. The hard drive has the Oxford 911 chipset so my
investigation centered around the sbp2 driver.

A good dig around google for the abort messages above lead me to an
optional parameter for loading the sbp2 module.

sbp2_serialize_io

By adding the following line into the /etc/modules.conf and rebooting
each node, I have solved the problem.

options sbp2 sbp2_serialize_io=1

This option is generally used to workaround bugs in the sbp2 driver, or
for debugging purposes so I suspect that it may be slower than the
default setting. But for my purposes the stability is the major
priority.

Thanks to everyone who contributed to the thread....

BTW - to confirm whether this option is effective check for the
following string in the 'dmesg' output:

ieee1394: sbp2: Driver forced to serialize I/O (serialize_io = 1)

Cheers

Matt

Noons

2005-06-15 03:49:28 UTC

Post by m***@hotmail.com
After a few days of fault finding I've managed to get to the root of

And thank you for taking the trouble and time to get back
here with the solution! If only more did the same, this place
would be even better as a source of info. Once again:
much appreciated!

Bart the bear

2005-06-15 17:26:36 UTC

Matt, IEEE1394 is amberwire (actually, it's firewire, but it's not
exactly burning).
Your waits are related to your disk adapter. Maybe UltraSCSI-640 would
serve
you better then @#$%! firewire which is slow as a snail. Look at
www.t10.org.

Bart the bear

2005-06-15 18:16:53 UTC

Sorry guys, I am an idiot. I haven't read the entire thread. Please,
forgive my mom
for unleashing me onto the world.

DA Morgan

2005-06-16 04:40:53 UTC

Post by Bart the bear
Sorry guys, I am an idiot. I haven't read the entire thread. Please,
forgive my mom
for unleashing me onto the world.

Even your father though we've never met.

Fear not ... anyone that has been here for awhile gets their chance to
eat crow. Me far more times than I'd like to recommend to others.

--
Daniel A. Morgan
http://www.psoug.org
***@x.washington.edu
(replace x with u to respond)

DA Morgan

2005-06-16 04:39:55 UTC

Post by Bart the bear
Matt, IEEE1394 is amberwire (actually, it's firewire, but it's not
exactly burning).
Your waits are related to your disk adapter. Maybe UltraSCSI-640 would
serve
www.t10.org.

And how do you propose to dual mount it and make it available to a RAC
cluster?

--
Daniel A. Morgan
http://www.psoug.org
***@x.washington.edu
(replace x with u to respond)

Mladen Gogala

2005-06-17 00:36:09 UTC

Post by DA Morgan
And how do you propose to dual mount it and make it available to a RAC
cluster?

Buy the proper adapter. It must act as a terminator if the host
goes down. Most of the early clusters (OPS) were SCSI clusters.

--
I either want less corruption, or more chance to participate in it.

DA Morgan

2005-06-17 02:45:23 UTC

Post by Mladen Gogala

Post by DA Morgan
And how do you propose to dual mount it and make it available to a RAC
cluster?

Buy the proper adapter. It must act as a terminator if the host
goes down. Most of the early clusters (OPS) were SCSI clusters.

Need RAW for VOTE and OCR even if one were to use OCFS.

--
Daniel A. Morgan
http://www.psoug.org
***@x.washington.edu
(replace x with u to respond)

JSchneider

2005-06-17 13:53:56 UTC

Post by DA Morgan
Need RAW for VOTE and OCR even if one were to use OCFS.

no you don't. the first 5 clusters i setup on OCFS used files on the
OCFS filesytem for voting and ocr. i don't think it's a good practice,
but you can certainly put those on OCFS.

DA Morgan

2005-06-17 14:32:52 UTC

Post by JSchneider

Post by DA Morgan
Need RAW for VOTE and OCR even if one were to use OCFS.

no you don't. the first 5 clusters i setup on OCFS used files on the
OCFS filesytem for voting and ocr. i don't think it's a good practice,
but you can certainly put those on OCFS.

You are correct. I had my head in a Mac cluster and was thinking in that
environment where the only solution is RAW.

--
Daniel A. Morgan
http://www.psoug.org
***@x.washington.edu
(replace x with u to respond)

25 Replies
57 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

m***@hotmail.com 2005-06-10 14:25:49 UTC

Bart the bear 2005-06-10 18:58:01 UTC

m***@hotmail.com 2005-06-13 07:08:56 UTC

Bart the bear 2005-06-14 15:40:39 UTC

Noons 2005-06-11 13:29:18 UTC

m***@hotmail.com 2005-06-13 07:04:07 UTC

DA Morgan 2005-06-13 08:06:55 UTC

m***@hotmail.com 2005-06-13 08:39:09 UTC

Jeremy 2005-06-13 14:22:30 UTC

Noons 2005-06-13 08:10:21 UTC

Jeremy 2005-06-11 17:10:14 UTC

m***@hotmail.com 2005-06-13 07:07:39 UTC

m***@hotmail.com 2005-06-13 11:00:54 UTC

Bart the bear 2005-06-14 15:42:17 UTC

m***@hotmail.com 2005-06-13 13:50:54 UTC

Jeremy 2005-06-13 14:39:45 UTC

m***@hotmail.com 2005-06-14 15:29:28 UTC

Noons 2005-06-15 03:49:28 UTC

Bart the bear 2005-06-15 17:26:36 UTC

Bart the bear 2005-06-15 18:16:53 UTC

DA Morgan 2005-06-16 04:40:53 UTC

DA Morgan 2005-06-16 04:39:55 UTC

Mladen Gogala 2005-06-17 00:36:09 UTC

DA Morgan 2005-06-17 02:45:23 UTC

JSchneider 2005-06-17 13:53:56 UTC

DA Morgan 2005-06-17 14:32:52 UTC

about - legalese

Loading...