Diagnose DISK Failure
Collect the following information when a platform reports SCSI disk errors:
The following files:
/var/log/messages*
/var/log/messages*
The output from the following commands:
/sbin/fdisk -l
/bin/cat /proc/scsi/scsi
/bin/dmesg
/sbin/fdisk -l
/bin/cat /proc/scsi/scsi
/bin/dmesg
we can see that the channel is 0, id is 1 and lun is 0. Each line of the error refers to the same id. We can also see that the disk is getting errors on different sectors. In this case, id 1 is target 1. This would indicate that target 1 is getting the errors and is the suspect fru.
/var/log/messages
Dec 20 10:33:23 localhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 27010000
Dec 20 10:33:23 localhost kernel: I/O error: dev 08:03, sector 0
Dec 20 10:33:23 localhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 27010000
Dec 20 10:33:23 localhost kernel: I/O error: dev 08:03, sector 13631520
Dec 20 10:33:23 localhost kernel: EXT3-fs error (device sd(8,3)): ext3_get_inode_loc: unable to read inode block – inode=852064, block=1703940
Dec 20 10:33:23 localhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 27010000
Dec 20 10:33:23 localhost kernel: I/O error: dev 08:03, sector 0
Dec 20 10:33:23 localhost kernel: EXT3-fs error (device sd(8,3)) in ext3_reserve_inode_write: IO failure
Dec 20 10:33:24 localhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 27010000
Dec 20 10:33:24 localhost kernel: I/O error: dev 08:03, sector 0
Dec 20 10:33:24 localhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 27010000
Dec 20 10:33:24 localhost kernel: I/O error: dev 08:03, sector 13631520
Dec 20 10:33:24 localhost kernel: EXT3-fs error (device sd(8,3)): ext3_get_inode_loc: unable to read inode block – inode=852064, block=1703940
Dec 20 10:33:24 localhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 27010000
Dec 20 10:33:24 localhost kernel: I/O error: dev 08:03, sector 0
Dec 20 10:33:24 localhost kernel: EXT3-fs error (device sd(8,3)) in ext3_reserve_inode_write: IO failure
Dec 20 10:33:24 localhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 27010000
Dec 20 10:33:24 localhost kernel: I/O error: dev 08:03, sector 0
Dec 20 10:33:23 localhost kernel: I/O error: dev 08:03, sector 0
Dec 20 10:33:23 localhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 27010000
Dec 20 10:33:23 localhost kernel: I/O error: dev 08:03, sector 13631520
Dec 20 10:33:23 localhost kernel: EXT3-fs error (device sd(8,3)): ext3_get_inode_loc: unable to read inode block – inode=852064, block=1703940
Dec 20 10:33:23 localhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 27010000
Dec 20 10:33:23 localhost kernel: I/O error: dev 08:03, sector 0
Dec 20 10:33:23 localhost kernel: EXT3-fs error (device sd(8,3)) in ext3_reserve_inode_write: IO failure
Dec 20 10:33:24 localhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 27010000
Dec 20 10:33:24 localhost kernel: I/O error: dev 08:03, sector 0
Dec 20 10:33:24 localhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 27010000
Dec 20 10:33:24 localhost kernel: I/O error: dev 08:03, sector 13631520
Dec 20 10:33:24 localhost kernel: EXT3-fs error (device sd(8,3)): ext3_get_inode_loc: unable to read inode block – inode=852064, block=1703940
Dec 20 10:33:24 localhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 27010000
Dec 20 10:33:24 localhost kernel: I/O error: dev 08:03, sector 0
Dec 20 10:33:24 localhost kernel: EXT3-fs error (device sd(8,3)) in ext3_reserve_inode_write: IO failure
Dec 20 10:33:24 localhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 27010000
Dec 20 10:33:24 localhost kernel: I/O error: dev 08:03, sector 0
Notice in the following example We see that the errors also report the same channel 0, id 1, and lun 0 each time. We can also note scsi bus timeouts along with scsi check condition messages for the same device. In this case, target 1 is the suspect fru
Another example of /var/log/messages
Sep 21 23:35:41 localhost kernel: klogd 1.4.1, log source = /proc/kmsg started.
Sep 21 23:35:41 localhost kernel: Inspecting /boot/System.map-2.4.18-17.7.x.4smp
Sep 21 23:35:41 localhost kernel: Loaded 17857 symbols from /boot/System.map-2.4.18-17.7.x.4smp.
Sep 21 23:35:41 localhost kernel: Symbols match kernel version 2.4.18.
Sep 21 23:35:41 localhost kernel: Loaded 256 symbols from 11 modules.
Sep 21 23:35:41 localhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0
return code = 27010000
Sep 21 23:35:41 localhost kernel: I/O error: dev 08:17, sector 66453508
Sep 21 23:35:41 localhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0
return code = 27010000
:
:
Sep 21 23:35:49 localhost kernel: scsi :”’ aborting command due to timeout : pid
43891492, scsi0, channel 0, id 1, lun 0 Write (10) 00 00 4b ae 5b 00 00 02 00”’
Sep 21 23:35:49 localhost kernel: mptscsih: OldAbort scheduling ABORT SCSI IO
(sc=c2db7200)
Sep 21 23:35:49 localhost kernel: IOs outstanding = 5
Sep 21 23:35:49 localhost kernel: scsi : aborting command due to timeout : pid
43891493, scsi0, channel 0, id 1, lun 0 Write (10) 00 00 43 2e 5d 00 00 02 00
:
:
Sep 21 23:35:49 localhost kernel: mptscsih: ioc0: Issue of TaskMgmt Successful!
Sep 21 23:35:49 localhost kernel: SCSI host 0 abort (pid 43891492) timed out – resetting
Sep 21 23:35:49 localhost kernel: SCSI bus is being reset for host 0 channel 0.
Sep 21 23:35:50 localhost kernel: mptscsih: OldReset scheduling BUS_RESET (sc=c2db7200)
Sep 21 23:35:50 localhost kernel: IOs outstanding = 6
Sep 21 23:35:50 localhost kernel: SCSI host 0 abort (pid 43891493) timed out – resetting
:
:
Sep 21 23:35:51 localhost kernel: SCSI host 0 reset (pid 43891492) timed out again –
Sep 21 23:35:51 localhost kernel: probably an unrecoverable SCSI bus or device hang.
Sep 21 23:35:51 localhost kernel: SCSI host 0 reset (pid 43891493) timed out again –
Sep 21 23:35:51 localhost kernel: SCSI Error Report =-=-= (0:0:0)
Sep 21 23:35:51 localhost kernel: SCSI_Status=02h (CHECK CONDITION)
Sep 21 23:35:51 localhost kernel: Original_CDB[]: 28 00 02 B1 4E 62 00 00 04 00
Sep 21 23:35:51 localhost kernel: SenseData[12h]: 70 00 06 00 00 00 00 0A 00 00 00 00 29 02 02 00 00 00
Sep 21 23:35:51 localhost kernel: SenseKey=6h (UNIT ATTENTION); FRU=02h
Sep 21 23:35:51 localhost kernel: ASC/ASCQ=29h/02h “SCSI BUS RESET OCCURRED”
Sep 21 23:35:51 localhost kernel: SCSI Error Report =-=-= (0:1:0)
Sep 21 23:35:51 localhost kernel: SCSI_Status=02h (CHECK CONDITION)
Sep 21 23:35:51 localhost kernel: Original_CDB[]: 2A 00 00 45 EE 5F 00 00 02 00
Sep 21 23:35:51 localhost kernel: SenseData[12h]: 70 00 06 00 00 00 00 0A 00 00 00 00 29 02 02 00 00 00
Sep 21 23:35:51 localhost kernel: SenseKey=6h (UNIT ATTENTION); FRU=02h
Sep 21 23:35:51 localhost kernel: ASC/ASCQ=29h/02h “SCSI BUS RESET OCCURRED”
Sep 21 23:35:51 localhost kernel: md3: no spare disk to reconstruct array! — continuing in degraded mode
Sep 21 23:35:51 localhost kernel: md: updating md2 RAID superblock on device
Sep 21 23:35:52 localhost kernel: md: (skipping faulty sdb5 )
Sep 21 23:35:52 localhost kernel: md: sda5 [events: 00000012]<6>(write) sda5’s sb offset: 4192832
Sep 21 23:35:52 localhost kernel: raid1: sda7: redirecting sector 30736424 to another mirror
Sep 21 23:35:41 localhost kernel: Inspecting /boot/System.map-2.4.18-17.7.x.4smp
Sep 21 23:35:41 localhost kernel: Loaded 17857 symbols from /boot/System.map-2.4.18-17.7.x.4smp.
Sep 21 23:35:41 localhost kernel: Symbols match kernel version 2.4.18.
Sep 21 23:35:41 localhost kernel: Loaded 256 symbols from 11 modules.
Sep 21 23:35:41 localhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0
return code = 27010000
Sep 21 23:35:41 localhost kernel: I/O error: dev 08:17, sector 66453508
Sep 21 23:35:41 localhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0
return code = 27010000
:
:
Sep 21 23:35:49 localhost kernel: scsi :”’ aborting command due to timeout : pid
43891492, scsi0, channel 0, id 1, lun 0 Write (10) 00 00 4b ae 5b 00 00 02 00”’
Sep 21 23:35:49 localhost kernel: mptscsih: OldAbort scheduling ABORT SCSI IO
(sc=c2db7200)
Sep 21 23:35:49 localhost kernel: IOs outstanding = 5
Sep 21 23:35:49 localhost kernel: scsi : aborting command due to timeout : pid
43891493, scsi0, channel 0, id 1, lun 0 Write (10) 00 00 43 2e 5d 00 00 02 00
:
:
Sep 21 23:35:49 localhost kernel: mptscsih: ioc0: Issue of TaskMgmt Successful!
Sep 21 23:35:49 localhost kernel: SCSI host 0 abort (pid 43891492) timed out – resetting
Sep 21 23:35:49 localhost kernel: SCSI bus is being reset for host 0 channel 0.
Sep 21 23:35:50 localhost kernel: mptscsih: OldReset scheduling BUS_RESET (sc=c2db7200)
Sep 21 23:35:50 localhost kernel: IOs outstanding = 6
Sep 21 23:35:50 localhost kernel: SCSI host 0 abort (pid 43891493) timed out – resetting
:
:
Sep 21 23:35:51 localhost kernel: SCSI host 0 reset (pid 43891492) timed out again –
Sep 21 23:35:51 localhost kernel: probably an unrecoverable SCSI bus or device hang.
Sep 21 23:35:51 localhost kernel: SCSI host 0 reset (pid 43891493) timed out again –
Sep 21 23:35:51 localhost kernel: SCSI Error Report =-=-= (0:0:0)
Sep 21 23:35:51 localhost kernel: SCSI_Status=02h (CHECK CONDITION)
Sep 21 23:35:51 localhost kernel: Original_CDB[]: 28 00 02 B1 4E 62 00 00 04 00
Sep 21 23:35:51 localhost kernel: SenseData[12h]: 70 00 06 00 00 00 00 0A 00 00 00 00 29 02 02 00 00 00
Sep 21 23:35:51 localhost kernel: SenseKey=6h (UNIT ATTENTION); FRU=02h
Sep 21 23:35:51 localhost kernel: ASC/ASCQ=29h/02h “SCSI BUS RESET OCCURRED”
Sep 21 23:35:51 localhost kernel: SCSI Error Report =-=-= (0:1:0)
Sep 21 23:35:51 localhost kernel: SCSI_Status=02h (CHECK CONDITION)
Sep 21 23:35:51 localhost kernel: Original_CDB[]: 2A 00 00 45 EE 5F 00 00 02 00
Sep 21 23:35:51 localhost kernel: SenseData[12h]: 70 00 06 00 00 00 00 0A 00 00 00 00 29 02 02 00 00 00
Sep 21 23:35:51 localhost kernel: SenseKey=6h (UNIT ATTENTION); FRU=02h
Sep 21 23:35:51 localhost kernel: ASC/ASCQ=29h/02h “SCSI BUS RESET OCCURRED”
Sep 21 23:35:51 localhost kernel: md3: no spare disk to reconstruct array! — continuing in degraded mode
Sep 21 23:35:51 localhost kernel: md: updating md2 RAID superblock on device
Sep 21 23:35:52 localhost kernel: md: (skipping faulty sdb5 )
Sep 21 23:35:52 localhost kernel: md: sda5 [events: 00000012]<6>(write) sda5’s sb offset: 4192832
Sep 21 23:35:52 localhost kernel: raid1: sda7: redirecting sector 30736424 to another mirror
In the following example, we see write errors on channel 0, id 0, lun 0 along with a medium error indicating an issue with the device itself. In this example, target 0 is the suspect part.
/var/log/messages
scsi0: ERROR on channel 0, id 0, lun 0, CDB: Write (10) 00 06 1d 3a 0d 00 00 08 00
Info fld=0x61d3a0d, Deferred sd08:02: sense key Medium Error
Additional sense indicates Write error
I/O error: dev 08:02, sector 102498376
SCSI Error: (0:0:0) Status=02h (CHECK CONDITION)
Key=3h (MEDIUM ERROR); FRU=0Ch
ASC/ASCQ=0Ch/02h “”
CDB: 2A 00 07 F9 3A 2D 00 00 08 00
Info fld=0x61d3a0d, Deferred sd08:02: sense key Medium Error
Additional sense indicates Write error
I/O error: dev 08:02, sector 102498376
SCSI Error: (0:0:0) Status=02h (CHECK CONDITION)
Key=3h (MEDIUM ERROR); FRU=0Ch
ASC/ASCQ=0Ch/02h “”
CDB: 2A 00 07 F9 3A 2D 00 00 08 00
scsi0: ERROR on channel 0, id 0, lun 0, CDB: Write (10) 00 07 f9 3a 2d 00 00 08 00
Info fld=0x8153a0d, Deferred sd08:02: sense key Medium Error
Additional sense indicates Write error – auto reallocation failed I/O error: dev 08:02, sector 133693544
Info fld=0x8153a0d, Deferred sd08:02: sense key Medium Error
Additional sense indicates Write error – auto reallocation failed I/O error: dev 08:02, sector 133693544
Here is an example of the output you would see from cat /proc/scsi/scsi. This will give you a model number of the drive which could help with determining a part number and disk ID. Output will deviate depending on platform model and configuration:Attached devices:
cat /proc/scsi/scsi
cat /proc/scsi/scsi
Host: scsi0 Channel: 00 Id: 00 Lun: 00
Vendor: SEAGATE Model: ST373307LC Rev: 0007
Type: Direct-Access ANSI SCSI revision: 03
Host: scsi0 Channel: 00 Id: 01 Lun: 00
Vendor: SEAGATE Model: ST373307LC Rev: 0007
Type: Direct-Access
Vendor: SEAGATE Model: ST373307LC Rev: 0007
Type: Direct-Access ANSI SCSI revision: 03
Host: scsi0 Channel: 00 Id: 01 Lun: 00
Vendor: SEAGATE Model: ST373307LC Rev: 0007
Type: Direct-Access
Here’s an example of what you might see with dmesg. In the example shown, we can see that there is an issue with channel 0, id 1 and lun 0. This example needs more data to determine the failure. However, one may suspect that the disk needs to be replaced as it shows an i/o error 2 times on the same disk.
Sector 3228343
scsi0 (1:0): rejecting I/O to offline device
RAID1 conf printout:
— wd:1 rd:2
disk 0, wo:0, o:1, dev:sda1
disk 1, wo:1, o:0, dev:sdb1
RAID1 conf printout:
— wd:1 rd:2
disk 0, wo:0, o:1, dev:sda1
scsi0 (1:0): rejecting I/O to offline device
md: write_disk_sb failed for device sdb2
md: errors occurred during superblock update, repeating scsi0 (1:0): rejecting I/O to offline device
scsi0 (1:0): rejecting I/O to offline device
RAID1 conf printout:
— wd:1 rd:2
disk 0, wo:0, o:1, dev:sda1
disk 1, wo:1, o:0, dev:sdb1
RAID1 conf printout:
— wd:1 rd:2
disk 0, wo:0, o:1, dev:sda1
scsi0 (1:0): rejecting I/O to offline device
md: write_disk_sb failed for device sdb2
md: errors occurred during superblock update, repeating scsi0 (1:0): rejecting I/O to offline device
Disk Performance
Using the dd Command to Determine Sequential I/O Speed
The dd command provides a simple way to measure sequential I/O performance. The following shows a sequential read of 1GB (1024MB). There are 1024 1MB (1024KB) reads:
# time -p dd if=/dev/raw/raw8 of=/dev/null bs=1024k
count=1024
1024+0 records in
1024+0 records out
real 6.77
user 0.00
sys 0.04
The megabytes per second can be calculated as follows:
1GB/6.77 sec = 151.25MBps
we saw 1024 I/Os, where each I/O is defined to have a boundary set to a block size of 1MB, thanks to the bs option on the dd command. Calculating MBps simply takes an arithmetic quotient of 1024MB/6.77 seconds, providing a speedy 151MB/sec. In our testing, cache on the array is clear, providing a nice 151MBps, which is not bad for a single LUN/LDEV on a single path. However, determining whether the bus was saturated and whether the service time for each I/O was within specifications are valid concerns. Each question requires more scrutiny
Using sar and iostat to Measure Disk Performance
Continuing with the dd command, we repeat the test but focus only on the data yielded by the sarcommand to depict service time and other traits, as per the following.
# sar -d 1 100
Linux 2.6.5-7.97-default (atlorca2) 05/09/05
14:19:23 DEV tps rd_sec/s wr_sec/s
14:19:48 dev8-144 0.00 0.00 0.00
14:19:49 dev8-144 0.00 0.00 0.00
14:19:50 dev8-144 178.00 182272.00 0.00
14:19:51 dev8-144 303.00 311296.00 0.00
14:19:52 dev8-144 300.00 307200.00 0.00
14:19:53 dev8-144 303.00 309248.00 0.00
14:19:54 dev8-144 301.00 309248.00 0.00
14:19:55 dev8-144 303.00 311296.00 0.00
14:19:56 dev8-144 302.00 309248.00 0.00
This sar output shows that the total number of transfers per second (TPS) holds around 300. rd_sec/smeasures the number of read sectors per second, and each sector is 512 bytes. Divide the rd_sec/sby the tps, and you have the number of sectors in each transfer. In this case, the average is 1024 sectors at 512 bytes each. This puts the average SCSI block size at 512KB. This is a very important discovery; because the dd command requests a block size of 1MB, the SCSI driver blocks the request into 512 byte blocks, so two physical I/Os complete for every logical I/O requested. Different operating systems have this value hard coded at the SCSI driver at different block sizes, so be aware of this issue when troubleshooting.
As always, more than one way exists to capture I/O stats. In this case, iostat may suit your needs. This example uses iostat rather than sar to evaluate the dd run.
#iostat
Linux 2.6.5-7.97-default (atlorca2) 05/09/05
avg-cpu: %user %nice %sys %iowait %idle
0.00 0.01 0.02 0.08 99.89
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sdj 0.06 57.76 0.00 15222123 72
sdj 0.00 0.00 0.00 0 0
sdj 0.00 0.00 0.00 0 0
sdj 98.00 102400.00 0.00 102400 0
sdj 298.00 305152.00 0.00 305152 0
sdj 303.00 309248.00 0.00 309248 0
sdj 303.00 311296.00 0.00 311296 0
sdj 301.00 307200.00 0.00 307200 0
sdj 302.00 309248.00 0.00 309248 0
sdj 302.00 309248.00 0.00 309248 0
sdj 141.00 143360.00 0.00 143360 0
Calculating MBps from iostat can be achieved by calculating KB from blocks read per second (Blk_read/s) and multiplying them by the transactions per second (TPS). In the previous example, 311296 Blk_read/s / (303 tps) = 1027.3 blocks x 512 bytes/block = 526018 bytes / 1024 bytes/KB = 513KB avg.
Before we explain the importance of the blocking size based on a given driver, let us demonstrate the same test results with a different block size. Again, we move 1GB of data through a raw character device using a much smaller block size. It is very important to understand that the exact same 1GB of data is being read by dd and written to /dev/null.
Understanding the Importance of I/O Block Size When Testing Performance
The I/O block size can impact performance. By reducing the dd read block size from 1024k to 2k, the FCP payload of 2k and the SCSI disk (sd) driver can deliver about 1/16 of the performance. Additionally, the I/O rate increases dramatically as the block size of each request drops to that of the FCP limit. In the first example, the sd driver was blocking on 512k, which put the I/O rate around 300 per second. In the world of speed, 300 I/O per second is rather dismal; however, we must keep that number in perspective because we were moving large data files at over 100 MBps. Though the I/O rate was low, the MBps was enormous.
Most applications use an 8K block size. In the following demonstration, we use a 2K block size to illustrate the impact of I/O payload (I/O size).
atlorca2:~ # time -p dd if=/dev/raw/raw8 of=/dev/null bs=2k count=524288 524288+0 records in 524288+0 records out real 95.98 user 0.29 sys 8.78
You can easily see that by simply changing the block size of a data stream from 1024k to 2k, the time it takes to move large amounts of data changes drastically. The time to transfer 1GB of data has increased 13 times from less than 7 seconds to almost 96 seconds, which should highlight the importance of block size to any bean counter.
We can use sar to determine the average I/O size (payload).
atlorca2:~ # sar -d 1 100| grep dev8-144 14:46:50 dev8-144 5458.00 21832.00 0.00 14:46:51 dev8-144 5478.00 21912.00 0.00 14:46:52 dev8-144 5446.00 21784.00 0.00 14:46:53 dev8-144 5445.00 21780.00 0.00 14:46:54 dev8-144 5464.00 21856.00 0.00 14:46:55 dev8-144 5475.00 21900.00 0.00 14:46:56 dev8-144 5481.00 21924.00 0.00 14:46:57 dev8-144 5467.00 21868.00 0.00
From the sar output, we can determine that 21868 rd_sec/s transpires, while we incur a tps of 5467. The quotient of 21868/5467 provides four sectors in a transaction, which equates to 2048 bytes, or 2K. This calculation shows that we are moving much smaller chunks of data but at an extremely high I/O rate of 5500 I/O per second. Circumstances do exist where I/O rates are the sole concern, as with the access rates of a company's Web site. However, changing perspective from something as simple as a Web transaction to backing up the entire corporate database puts sharp focus on the fact that block size matters. Remember, backup utilities use large block I/O, usually 64k
With the understanding that small block I/O impedes large data movements, note that filesystem fragmentation and sparse file fragmentation can cause an application's request to be broken into very small I/O. In other words, even though a dd if=/file system/file_name of=/tmp/out_file bs=128k is requesting a read with 128k block I/O, sparse file or filesystem fragmentation can force the read to be broken into much smaller block sizes. So, as we continue to dive into performance troubleshooting throughout this chapter, always stay focused on the type of measurement needed: I/O, payload, or block size. In addition to considering I/O, payload, and block size, time is an important factor.
Importance of Time
Continuing with our example, we must focus on I/O round-trip time and bus saturation using the same performance test as earlier. In the next few examples, we use iostat to illustrate average wait time, service time, and percent of utilization of our test device.
The following iostat display is from the previous sequential read test but with block size set to 4096k, or 4MBs, illustrating time usage and device saturation.
# dd if=/dev/raw/raw8 of=/dev/null bs=4096k &
# iostat -t -d -x 1 100
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sdj 0.00 0.00 308.00 0.00 319488.00 0.00 159744.00 0.00 1037.30 4.48 14.49 3.24 99.80
sdj 0.00 0.00 311.00 0.00 319488.00 0.00 159744.00 0.00 1027.29 4.53 14.66 3.21 99.70
In this iostat output, we see that the device utilization is pegged at 100%. When device utilization reaches 100%, device saturation has been achieved. This value indicates not only saturation but also the percentage of CPU time for which an I/O request was issued. In addition to eating up CPU cycles with pending I/O waits, notice that the round-trip time (service time) required for each I/O request increased.Service time, the time required for a request to be completed on any given device, holds around 3.2ms. Before we go into detail about all the items that must be completed within that 3.2ms, which are discussed later in this chapter, we need to recap the initial test parameters. Recall that the previousiostat data was collected while using the dd command with a block size of 4096k. Running the same test with block size set to 1024k yields identical block counts in iostat, as you can see in this example# dd if=/dev/raw/raw8 of=/dev/null bs=1024k & # iostat -t -d -x 1 100 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sdj 0.00 0.00 303.00 0.00 309248.00 0.00 154624.00 0.00 1020.62 1.51 4.96 3.29 99.80 sdj 0.00 0.00 303.00 0.00 311296.00 0.00 155648.00 0.00 1027.38 1.51 4.98 3.29 99.70 sdj 0.00 0.00 304.00 0.00 311296.00 0.00 155648.00 0.00 1024.00 1.50 4.95 3.26 99.00 sdj 0.00 0.00 303.00 0.00 309248.00 0.00 154624.00 0.00 1020.62 1.50 4.93 3.28 99.40 sdj 0.00 0.00 304.00 0.00 311296.00 0.00 155648.00 0.00 1024.00 1.50 4.93 3.28 99.60
Determining Block Size
As we have illustrated earlier in this chapter, block size greatly impacts an application's overall performance. However, there are limits that must be understood concerning who has control over the I/O boundary. Every application has the capability to set its own I/O block request size, but the key is to understand the limits and locations. In Linux, excluding applications and filesystems, the sd driver blocks all I/O on the largest block depending on medium (such as SCSI LVD or FCP). An I/O operation on the SCSI bus with any typical SCSI RAID controller (not passing any other port drivers, such as Qlogic, or Emulex FCP HBA) holds around 128KB. However, in our case, through FCP, the largest block size is set to 512KB, as shown in the previous example when doing a raw sequential read access through dd. However, it goes without saying that other factors have influence, as shown later in this chapter when additional middle layer drivers are installed for I/O manipulation.To determine the maximum blocking factor, or max I/O size, of a request at the SD/FCP layer through a raw sequential read access, we must focus on the following items captured by the previous ddrequest in iostat examples.The following example explains how to calculate block size.Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sdj 303.00 311296.00 0.00 311296 0As the output shows, the number of blocks read per second is 311296.00, and the number of transactions per second is 303.00.(sectors read/sec)/(read/sec) =~ 1024 sectorsRecall that a sector has 512 bytes.(1024 sectors) x (512 bytes/sector) = 524288 bytesNow convert the value to KB.(524288 bytes) / (1024 bytes/KB) = 512KBAnother way to calculate the block size of an I/O request is to simply look at the avgrq-sz data fromiostat. This field depicts the average number of sectors requested in a given I/O request, which in turn only needs to be multiplied by 512 bytes to yield the block I/O request size in bytes.Now that we have demonstrated how to calculate the in-route block size on any given I/O request, we need to return to our previous discussion about round-trip time and follow up with queue length.Importance of a Queue
Service time only includes the amount of time required for a device to complete the request given to it. It is important to keep an eye on svctm so that any latency with respect to the end device can be noted quickly and separated from the average wait time. The average wait time (await) is not only the amount of time required to service the I/O at the device but also the amount of wait time spent in the dispatch queue and the roundtrip time. It is important to keep track of both times because the difference between the two can help identify problems with the local host.To wrap things up with I/O time and queues, we need to touch on queue length.. If you are familiar with C programming, you may find it useful to look at how these values are calculated. The following depicts the calculation for average queue length and wait time found in iostat source codenr_ios = sdev.rd_ios + sdev.wr_ios; tput = ((double) nr_ios) * HZ / itv; util = ((double) sdev.tot_ticks) / itv * HZ; svctm = tput ? util / tput : 0.0; /* * kernel gives ticks already in milliseconds for all platforms * => no need for further scaling. */ await = nr_ios ? (sdev.rd_ticks + sdev.wr_ticks) / nr_ios : 0.0; arqsz = nr_ios ? (sdev.rd_sectors + sdev.wr_sectors) / nr_ios : 0.0; printf("%-10s", st_hdr_iodev_i->name); if (strlen(st_hdr_iodev_i->name) > 10) printf("\n "); /* rrq/s wrq/s r/s w/s rsec wsec rkB wkB rqsz qusz await svctm %util */ printf(" %6.2f %6.2f %5.2f %5.2f %7.2f %7.2f %8.2f %8.2f %8.2f %8.2f %7.2f %6.2f %6.2f\n", ((double) sdev.rd_merges) / itv * HZ, ((double) sdev.wr_merges) / itv * HZ, ((double) sdev.rd_ios) / itv * HZ, ((double) sdev.wr_ios) / itv * HZ, ((double) sdev.rd_sectors) / itv * HZ, ((double) sdev.wr_sectors) / itv * HZ, ((double) sdev.rd_sectors) / itv * HZ / 2, ((double) sdev.wr_sectors) / itv * HZ / 2, arqsz, ((double) sdev.rq_ticks) / itv * HZ / 1000.0, await, /* The ticks output is biased to output 1000 ticks per second */ svctm, /* Again: ticks in milliseconds */ util / 10.0);Though it is nice to understand the calculations behind every value provided in performance tools, the most important thing to recall is that a large number of outstanding I/O requests on any given bus is not desirable when faced with performance concerns.In the following iostat example, we use an I/O request size of 2K, which results in low service time and queue length but high disk utilization.# dd if=/dev/raw/raw8 of=/dev/null bs=2k & # iostat -t -d -x 1 100 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sdj 0.00 0.00 5492.00 0.00 21968.00 0.00 10984.00 0.00 4.00 0.97 0.18 0.18 96.70 sdj 0.00 0.00 5467.00 0.00 21868.00 0.00 10934.00 0.00 4.00 0.95 0.17 0.17 94.80 sdj 0.00 0.00 5413.00 0.00 21652.00 0.00 10826.00 0.00 4.00 0.96 0.18 0.18 96.40 sdj 0.00 0.00 5453.00 0.00 21812.00 0.00 10906.00 0.00 4.00 0.98 0.18 0.18 97.80 sdj 0.00 0.00 5440.00 0.00 21760.00 0.00 10880.00 0.00 4.00 0.97 0.18 0.18 96.60Notice how the %util remains high, while the request size falls to 4 sectors/(I/O), which equals our 2048-byte block size. In addition, the average queue size remains small, and wait time is negligible along with service time. Recall that wait time includes roundtrip time, as discussed previously. Now that we have low values for avgrq-sz, avgqu-sz, await, and svctm, we must decide whether we have a performance problem. In this example, the answer is both yes and no. Yes, the device is at its peak performance for a single thread data query, and no, the results for the fields typically focused on to find performance concerns are not high.Multiple Threads (Processes) of I/O to a Disk
Now that we have covered the basics, let us address a multiple read request to a device.In the following example, we proceed with the same block size, 2K, as discussed previously; however, we spawn a total of six read threads to the given device to illustrate how service time, queue length, and wait time differ. Let's run six dd commands at the same time.atlorca2:~ # dd if=/dev/raw/raw8 of=/dev/null bs=2k & atlorca2:~ # dd if=/dev/raw/raw8 of=/dev/null bs=2k & atlorca2:~ # dd if=/dev/raw/raw8 of=/dev/null bs=2k & atlorca2:~ # dd if=/dev/raw/raw8 of=/dev/null bs=2k & atlorca2:~ # dd if=/dev/raw/raw8 of=/dev/null bs=2k & atlorca2:~ # dd if=/dev/raw/raw8 of=/dev/null bs=2k &
Note that the previous code can be performed in a simple for loop:for I in 1 2 3 4 5 6 do dd if=/dev/raw/raw8 of=/dev/null bs=2k & doneLet's use iostat again to look at the dd performance.# iostat -t -d -x 1 100 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sdj 0.00 0.00 5070.00 0.00 20280.00 0.00 10140.00 0.00 4.00 4.96 0.98 0.20 100.00 sdj 0.00 0.00 5097.00 0.00 20388.00 0.00 10194.00 0.00 4.00 4.97 0.98 0.20 100.00 sdj 0.00 0.00 5103.00 0.00 20412.00 0.00 10206.00 0.00 4.00 4.97 0.97 0.20 100.00