Alignment
Problem
Many new drives come with a sector size of 4096 bytes, instead of the regular 512, also known as advanced formatting. This might become an issue if the partition is aligned wrong. Because of this you need to align the partitions (or theoretically have the raid use the devices directly, however that might introduce other issues during some extraordinary circumstances).
Solution
The easy way to fix this is to have a rather new version of fdisk and simply create one partition per disk, the default starting sector is 2048, which is perfect for these drives.
Other issues
It is good to take alignment into consideration independent of sector size, for example testing with the help of dd different parameters for the block size could yield very different results:
# dd bs=1MB count=10000 if=/dev/zero of=/dev/md1
10000+0 records in
10000+0 records out
10000000000 bytes (10 GB) copied, 48.0181 s, 208 MB/s
# dd bs=1M count=10000 if=/dev/zero of=/dev/md1
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 16.5033 s, 635 MB/s
The above test is run on a 12 core machine with 96 GB ram, there are 10 drives configured in RAID0, similar results are shown if the test is run on the same machine with RAID6 instead.
Running the same test on a quad-core with 16 GB RAM, the first configuration is 5 drives in RAID5:
$ dd if=/dev/zero of=bigfile bs=1MB count=10000
10000+0 records in
10000+0 records out
10000000000 bytes (10 GB) copied, 41.6308 s, 240 MB/s
$ dd if=/dev/zero of=bigfile bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 39.5522 s, 265 MB/s
This configuration is with 2 drives in RAID1 on the same machine:
$ dd if=/dev/zero of=bigfile bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 110.346 s, 95.0 MB/s
$ dd if=/dev/zero of=bigfile bs=1MB count=10000
10000+0 records in
10000+0 records out
10000000000 bytes (10 GB) copied, 110.814 s, 90.2 MB/s
There are no great differences in performance and the results are inconclusive.
We speculate in that the difference in performance is related to caches/buffers, however we have not been able to determine anything. Note also that the first test where the large difference is shown is writing directly towards the RAID volume, without a filesystem.
Devices size
Drives do seldom have the exact same storage capacity. Because of this, if one disk in a raid breaks, you don’t want to utilize the entire drives for the raid, as the new replacement drive might be a bit smaller than the rest of the drives in your array. The solution is to simply spare a couple of sectors at the end of each drive (or rather, leave a couple of sectors on the smallest drive, and then partition the others the exact same way). This way you have a bit of a margin in case the replacement drive is smaller than the others.
Performance issues
Problem
When first building the raid5 with mdadm I ran into performance problems, the write speeds varied between 400 kB/s and 10 MB/s with an avarage around 5 MB/, it was horrible.
My write tests were simply:
$ dd if=/dev/zero of=/dev/mapper/raid bs=1MB count=100000
Note
Even here alignment affects performance, using bs=1M instead of bs=1MB will use a blocksize of instead of .
As the system I’m running have 16GB of ram i had to use large files (in this case, 100GB) to avoid the filecache and buffers to have a large impact. I also ran the test without a filesystem in order to exclude the possibility that ext3/ext4 was the reason for my bad performance figures.
I ran all tests with encryption.
We can see that the filecache have quite the large impact on small files, even when there is a filesystem on the device:
$ dd if=/dev/zero of=/mnt/testfile bs=1MB count=1000
1000+0 records in
1000+0 records out
1000000000 bytes (1.0 GB) copied, 0.472932 s, 2.1 GB/s
Notice: The above test were run on my fully-working raid, without performance issues, however the reason behind using large files for testing still remains.
Solution
Simple!
Increase the stripe cache size, I got the tip from raid.wiki.kernel.org.
# echo 32768 > /sys/block/md127/md/stripe_cache_size
The above command solved the problem and increase my write speeds to ~400 MB/s, which I deem acceptible.
The change will only last until the next reboot, or probably rather until the next time the raid is built.
To make it permanent, add the above command to /etc/rc.local.
Western Digital Green
Apparently some of the Western Digital Caviar Green drives (all?) have a firmware that forces the disk into low-power mode after 8 seconds of inactivity. This is shown by the fact that the smart value Load_Cycle_Count (193) increases rapidly. This doesn’t seem to affect linear read and write performance (as I get speeds a bit over 400 MB/s), however it might have a bad impact on accessing small files and access times. The biggest issue in my view is that it decreases the lifetime of the drives, significantly.
See also: WD Green