Database IO Performance Troubleshooting
We often stumble upon database performance issues and we soon jump into AWR report to check if any of the top wait events are slowing down the system.
To our surprise we see something like below. The top wait event is the DB file sequential read which says that the most of the waits are due to index access.
Top 5 Timed Foreground Events
Event | Waits | Time(S) | Avg wait(ms) | % DB time | Wait class |
db file sequential read | 21,407,957 | 240,686 | 11 | 81.24 | User I/O |
DB CPU | 30,996 | 10.46 | |||
Control file sequential read | 1,643,451 | 2,186 | 1 | 0.74 | System I/O |
SQL*Net more data to client | 16,900,143 | 2,162 | 0 | 0.73 | Network |
Direct path read | 70,251 | 1,719 | 24 | 0.58 | User I/O |
Now, what are the various index access that optimizer can choose:
- Index Unique Scan
- Index Range Scan
- Index Full Scan
- Index Fast Full Scan
So, our immediate guess is “Index Full Scan” if there are unusual increase in amount of wait on DB file sequential read in AWR report. Now, we need to see if any of the query plans have changed and are causing this high amount of index read. The index full scan are single block IO and would impact the query performance.
We will take the top IO query’s SQL_ID and check if their execution plan has changed recently. Use the below query to verify the same.
SELECT sql_id, plan_hash_value, timestamp
FROM dba_hist_sql_pan
WHERE sql_id in (‘<sql_id1>’,’<sql_id2>’);
If there is a new plan hash value present on a recent date, that means that there is a change in the execution plan. Then our troubleshooting is easy and straight forward. We need to compare the older and the current (CURSOR) plan and correct the execution plan either by gathering the statistics or using SQL plan baselines. The quickest fix is to use the baseline. Please evaluate the baseline option as it has its own advantages and disadvantages.
In this case, we did not find any change in the execution plan, now what next? Started comparing the elapsed time of the queries with past AWR report and found that the elapsed time has increased by 3 times for some of the queries. This system was a OLTP system and the number of queries executed as high as ~5M executions in 60 minutes. Elapsed time of this query was about 0.01 seconds when the system was performing optimally. Now, the elapsed time has increased to 0.03 seconds per execution and was slowing down the transactions.
So, it was evident that there is some issue with IO sub system. So, drilling down to disk IO was required. Below is the “iostat -D 5 50” output and some of the disks are consistently having more than 50ms service time. For OLTP systems, the average service time should be less than 10ms.
Hdisk177 xfer: %tm_act bps tps bread bwetn
84.0 3.5M 142.3 3.0M 507.9k Read: rps avgserv minserv maxserv timeout fail 117 23.5 0.3 1.6S 0 0 Write: wps avgserv minserv maxserv timeout fail 24.6 52.7 0.5 1.3S 0 0 Queue: avgtime mintime maxtime avgwqsz avgsqsz sqlful 2.2 0.0 133.3 0.0 5.0 10.3 Hdisk173 xfer: %tm_act bps tps bread bwetn 66.0 2.9M 125.0 2.5M 316.0 Read: rps avgserv minserv maxserv timeout fail 107.0 15.8 0.3 650.3 0 0 Write: wps avgserv minserv maxserv timeout fail 18.0 39.6 0.6 231.7 0 0 Queue: avgtime mintime maxtime avgwqsz avgsqsz sqlful 1.5 0.0 156.4 0.0 2.0 2.3 Hdisk178 xfer: %tm_act bps tps bread bwetn 69.8 3.6M 130.9 3.1M 414.3K Read: rps avgserv minserv maxserv timeout fail 110.7 13.7 0.3 483.5 0 0 Write: wps avgserv minserv maxserv timeout fail 20.1 42.2 1.1 150.9 0 0 Queue: avgtime mintime maxtime avgwqsz avgsqsz sqlful 0.7 0.0 111.3 0.0 2.0 1.4 Hdisk179 xfer: %tm_act bps tps bread bwetn 68.1 3.2M 131.3 2.8M 423.6K Read: rps avgserv minserv maxserv timeout fail 109.3 15.4 0.2 429.6 0 0 Write: wps avgserv minserv maxserv timeout fail 22.0 46.5 0.6 147.4 0 0 Queue: avgtime mintime maxtime avgwqsz avgsqsz sqlful 1.2 0.0 124.9 0.0 3.0 2.4
|
The highlighted report shows very high service time for some of the disks. It was evident that there was IO issue from storage. Further investigation on storage lead to some multi path errors. Post upgrading storage software the IO response time improved and automatically reduced the query elapsed time.
There are several other factors that cause the IO bottleneck such as:
- High number of disk sorts
- Conventional Data loads on indexed table
- Increase in height of index
- Export/RMAN Backup
So, it is important to look from all angles of database before zeroing down on to the route cause.