m***@hotmail.com
2005-06-10 14:25:49 UTC
Oracle 10.1.0.4 EE running on 2 node RHEL 3 cluster (Oracle Firewire
Kernel)
Shared Storage : Maxtor One Touch II
It seems that periodically the I/O to the shared device seems to 'hang
up' (i.e. 99% I/O Wait in 'top') for exactly 1 minute when both
instances are booted.
At first I suspected that this was just a 'top' reporting anomoly, so I
traced a SQL statement which runs for approx 30 seconds with only one
instance started.
I then traced the session with both instances running and the execution
time jumped to 90 seconds, which corresponds to the normal 30 secs plus
this strange 60 second timeout. When I tkprof'd the trace file, I can
see that of the 90 seconds response time, 1 individual 'db file
scattered read' took 59.8 seconds. This is highly unusual for one
multi block read:
Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
------------------------------- Waited ---------- ------------
SQL*Net message to client 2 0.00 0.00
db file scattered read 6954 59.8 82.42
SQL*Net message from client 2 276.12 276.12
This issue is easily repeatable.
The thing that makes me think that this is a I/O problem to the shared
disk is that we had to increase the CSS misscount to 120 seconds
because of repeated "Voting Disk timeout" errors which used to crash
CRS on one of the nodes.
Anyone have any idea how to diagnose the source of this I/O hang.
When I run iostat during this period of 99% IOWAIT, there is no
activity to the shared disk at all. 0 bytes read, 0 bytes written.
Matt
Kernel)
Shared Storage : Maxtor One Touch II
It seems that periodically the I/O to the shared device seems to 'hang
up' (i.e. 99% I/O Wait in 'top') for exactly 1 minute when both
instances are booted.
At first I suspected that this was just a 'top' reporting anomoly, so I
traced a SQL statement which runs for approx 30 seconds with only one
instance started.
I then traced the session with both instances running and the execution
time jumped to 90 seconds, which corresponds to the normal 30 secs plus
this strange 60 second timeout. When I tkprof'd the trace file, I can
see that of the 90 seconds response time, 1 individual 'db file
scattered read' took 59.8 seconds. This is highly unusual for one
multi block read:
Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
------------------------------- Waited ---------- ------------
SQL*Net message to client 2 0.00 0.00
db file scattered read 6954 59.8 82.42
SQL*Net message from client 2 276.12 276.12
This issue is easily repeatable.
The thing that makes me think that this is a I/O problem to the shared
disk is that we had to increase the CSS misscount to 120 seconds
because of repeated "Voting Disk timeout" errors which used to crash
CRS on one of the nodes.
Anyone have any idea how to diagnose the source of this I/O hang.
When I run iostat during this period of 99% IOWAIT, there is no
activity to the shared disk at all. 0 bytes read, 0 bytes written.
Matt