狗头金的价值:Getting Good IO from Amazon's EBS

来源：百度文库编辑：中财网时间：2024/05/01 08:54:03

Getting Good IO from Amazon's EBS

Wed Jul 29 00:23:52 -0700 2009

The performance characteristics of Amazon’s Elastic Block Store are moody, technically opaque and, at times, downright confounding. At Heroku, we’ve spent a lot of time managing EBS disks, and recently I had a very long night trying to figure out how to get the best performance out of the EBS disks, and little did I know I was testing them on a night when they were particularly ornery. On a good day, an EBS can give you 7,000 seeks per second and on a not so good day will give you only 200. On those good days you’ll be singing its praises and on the bad days you’ll be cursing its name. What I think I stumbled on that day was a list of techniques that seem to even out the bumps and get decent performance out of EBS disks even when they are performing badly.

Under perfect circumstances a totally untweaked EBS drive running an ext3 filesystem will get you about 80kb read or write throughput and 7,000 seeks per second. Two disks in a RAID 0 configuration will get you about 140kb read or write and about 10,000 seeks per second, this represents the best numbers I’ve been able to get out of an EBS disk setup as it seems to be saturating the IO channel on the EC2 instance (this makes sense as it would be about what you’d expect from gigabit ethernet). However, when the EBS drives are NOT running their best, which is often, you need a lot more tweaking to get good performance out of them.

The tool I used to benchmark was bonnie++, specifically:

bonnie++ -u nobody -fd /disk/bonnie

Saturating reads and writes was not very hard, but seeks per second – which is CRITICAL for databases – was much more sensitive and is what I was optimizing for in my tests.

In my tests I build raids. I’ve been using mdadm raid 0:

mdadm --create /dev/md0 --metadata=1.1 --level=0 ...

Each EBS disk is claimed to be a redundant disk to begin with so I felt safe just striping for speed.

Now, I just need to take a moment to point something out. Performance testing on EBS is very hard. The disks speed up and slow down on their own. A lot. Telling when your tweak is helping vs it just being luck is not easy. It feels a bit like trying to clock the speed of passing cars with a radar gun from the back of a rampaging bull. I fully expect to find that some of my discoveries here are just a mare’s nest, but hopefully others will prove enduring.

After testing, what I found surprised me:

More disks are better than fewer. I’ve had people tell me that performance maxed out for them at 20 to 30 disks. I could not measure anything above 8 disks. Most importantly, lots of disks seem to smooth out the flaky performance of a single EBS disk that might be busy chewing on someone else’s data.
Your IO scheduler matters (but not as much as I thought it would). Do not use noop. Use cfq or deadline. I found deadline to be a little better but YMMV.
Larger chunk sizes on the raid made a (shockingly) HUGE difference in performance. The sweet spot seemed to be at 256k.
A larger read ahead buffer on the raid also made a HUGE difference. I bumped it from 256 bytes to 64k.
Use XFS or JFS. The biggest surprise to me was how much better XFS and JFS performed on these moody disks. I am used to seeing only minimal performance enhancements to disk performance when using them but something about the way XFS and JFS group reads and writes plays very nicely with EBS drives.
Mounting noatime helps but only by about 5%.
Different EC2 instance sizes, much to my surprise, did not make a noticeable difference in disk IO.
I was not able to reproduce Ilya’s results where a disk performed poorly when newly created but faster after being zeroed out with dd (due to lazy allocation of sectors).

I’ve included my notes from that day below. I was not running tests three times in a row and taking the standard deviation into account (although I wish I had), and these aren’t easy to reproduce because it’s been a while since the EBS drives were having such a bad day.

Scheduler FS Disks Settings Seq Block W Seq Block RW Seq Block R Random Seeks/s deadline ext3 1 60K 25K 50K 216 deadline ext3 24 125k 17k 20k 1296 deadline ext3 24 stride=16 125k 18k 20k 866 deadline ext3 24 noatime,stride=16 124k 18k 19k 1639 cfq ext3 24 noatime,stride=16,blockdev —setra 65536 /dev/md0 124k 31k 38k 3939 deadline ext3 24 noatime,chunksize=256,stride=64,blockdev —setra 65536 /dev/md0 126k 37k 44k 1720 cfq ext3 24 noatime,chunksize=256,stride=64,blockdev —setra 65536 /dev/md0 124k 34k 44k 4560 cfq ext3 24 noatime,chunksize=256,stride=64,blockdev —setra 393216 /dev/md0 126k35k 43k 1860 cfq ext3 24 noatime,chunksize=256,blockdev —setra 65536 /dev/md0 129k 34k 43k 1285 cfq ext3 24 noatime,chunksize=256,blockdev —setra 65536 /dev/sd* 125k 35k 44k 2557 cfq ext3 16 noatime,chunksize=256,stride=64,blockdev —setra 65536 /dev/md0 125k 40k 48k 2770 noop ext3 16 noatime,chunksize=256,stride=64,blockdev —setra 65536 /dev/md0 124k 38k 47k 2504 deadline ext3 16 noatime,chunksize=256,stride=64,blockdev —setra 65536 /dev/md0 125k 41k 46k 1886 cfq xfs 16 noatime,chunksize=256,blockdev —setra 65536 /dev/md0 126k 62k 93k 7428 deadline xfs 16 noatime,chunksize=256,blockdev —setra 65536 /dev/md0 118k 63k 92k 10723 cfq xfs 16 noatime,chunksize=512,blockdev —setra 65536 /dev/md0 122k 63k 92k 10099 cfq xfs 16 noatime,chunksize=512,blockdev —setra 131072 /dev/md0 116k 64k 99k 9664 deadlinexfs 16 noatime,chunksize=512,blockdev —setra 131072 /dev/md0 118k 66k 99k 6396 cfq xfs 24 noatime,chunksize=256,blockdev —setra 65536 /dev/md0 117k 62k 89k 7657 cfq xfs 8 noatime,chunksize=256,blockdev —setra 65536 /dev/md0 117k 62k 91k 3059 deadline xfs 8 noatime,chunksize=256,blockdev —setra 65536 /dev/md0 139k 63k 85k 10403 deadline xfs 8 noatime,chunksize=256,blockdev —setra 32768 /dev/md0 124k 60k 82k 9308 deadline xfs 4 noatime,chunksize=256,blockdev —setra 65536 /dev/md0 88k 48k 77k 1133 deadline xfs 12 noatime,chunksize=256,blockdev —setra 65536 /dev/md0 119k 67k 90k 8590 cfq xfs 12 noatime,chunksize=256,blockdev —setra 65536 /dev/md0 85k 64k 91k 10340 cfq ext2 8 noatime,chunksize=256,blockdev —setra 65536 /dev/md0 141k 51k 51k 3242 deadline xfs 8 noatime,chunksize=256,blockdev —setra 65536 /dev/md0 112k,112k,115k 67k,61k,61k 85k,83k,86k 9568,8541,8339 deadline jfs 8 noatime,chunksize=256,blockdev —setra 65536 /dev/md0 135k,138k,85k 66k,64k,33k 92k,87k,79k 9785,10109,8615

Glad you liked it. Would you like to share?

Sharing this page …

Thanks! Close

Add New Comment

Showing 19 comments

Sort by

Subscribe by email

Subscribe by RSS

El Yobo 8 months ago I was thinking that I could configure a RAID0 with the ephemeral disks, then on top of this add a RAID1 between the ephemeral RAID0 and an EBS volume. The RAID1 would be configured using mdadm's --write-mostly to ensure that reads will be avoided on the EBS volume where possible. It seems like this would provide the performance at almost the same level as the ephemeral disks, while providing the security of the EBS backing, and incidentally reducing IO on the EBS volume (as there will be fewer reads going there). Any thoughts? Flag 1 person liked this.
Ilya Grigorik 2 years ago Henry, I'm a bit confused by your seek times. 10,000 seeks a second would imply ~0.1ms for a seek, whereas the numbers I'm more familiar with are ~5~8ms range (60s / 7200 RPM disk = ~8ms). Are we talking about the same thing here?

On bonnie, I don't have a lot of hands on experience with it, but I have come across articles saying that it's not to be trusted on NAS installations. I don't have anything concrete to back this up, but just a heads up. Having said that...

- IO Scheduler: CFQ is the default under 2.6.x, correct?

- Larger chunk sizes = great performance: to be expected, but this depends very much on the data you're storing. If you have large contiguous chunks then tweaking the FS and your raid config to larger values will definitely lead to better perf. Having said that, one thing I haven't tried is giving InnoDB a raw disk, instead of a filesystem. I wonder if performs better with that layer removed?

- XFS: once again, depends on your files. XFS does dynamic allocation for inodes, which means really poor performance if you're storing A LOT of very small files (something that I was trying to do).

- Different EC2 instance sizes: great to hear. Even with small vs medium? They advertise IO differences on the site. I've never setup a definitive test for this.

- Performance increase after DD: This is based on hearsay, and I haven't been able to reproduce it reliably either. The theory is that EBS allocates blocks on demand, so you can either DD the drive and force that cost up front, or swallow it as the drive grows. Having said that, I've seen more variation in performance based on simply unmounting/remounting the EBS drive. Unfortunately, we have no visibility into how the system is setup, and for all we know, they're changing the hardware configuration underneath without us knowing a thing.. It's a little bit frustrating, but such is life in the clouds. ;-) Flag 1 person liked this.
mealtkt 1 year ago in reply to Ilya Grigorik I would like to say a word *against* dd'ing an ebs volume to increase performance. There's quite a possibility that performance will actually degrade, and here's why. EBS volumes are constantly copied to S3 in the background under some kind of algorithm. This is done to speed up EBS snapshots. It will also increase your ebs snapshot sizes.. all in all.. probably not worth it. Flag
batmonkey 2 years ago Yes, but will it still slice a tomato? Flag dmix liked this
Jo Liss 2 months ago Looking at the numbers, it seems to me that it might not be XFS/JFS that is fast, but rather the ext3 journaling that is slow. The ext2 FS seems to perform quite well in your test, getting approximately the same speed as the xfs with the same configuration.

Maybe it's even just a matter of more conservative defaults options in ext3 (e.g. try setting data=writeback).

I don't know how bonnie++ works, but if it's just accessing a pre-allocated file, then I would expect that there is not much performance to be gained over a dumb file system like ext2. Flag
Andrew 6 months ago Isn't the argument to "blockdev --setra" specified in sectors, not bytes? So wouldn't "blockdev--setra 65536 /dev/md0" equal 32MB, not 64KB?

Earlier in the post you reference "64k" bytes as being an optimal read ahead, yet this doesn't match up with the blockdev statements in the table.

Thanks for posting this data, very useful. Flag
Sasha Aickin 8 months ago Hey there. Awesome post, but could it be that you wrote (in the second paragraph) kb when you meant MB? 140MB is just about a gigabit, but 140kb (kiloBITs) is only 0.01% of a gigabit.

http://www.google.com/search?q...
http://www.google.com/search?q... Flag
Daniele Calabrese 1 year ago Soundtrckr going after a performance increase and stability this weekend Flag
Al 1 year ago Very useful. Thanks for taking the time to right this Flag
@mrflip 1 year ago Have you found any benefit from the "nobarrier" option?
I can't tell what the actual takeaway from this thread is:
http://developer.amazonwebserv...
But i think it says that nobarrier is either a) likely to be very helpful or b) harmless

Also, here are some recent benchmarks for XFS on local storage with a whole different group of magic words:
http://recoverymonkey.net/word...
and another set of benchmarks from Vadim at the MySQL Performance Blog:
http://www.mysqlperformanceblo...

In that thread, the 'first write is slower' is confirmed by a link to the AWS docs:
http://docs.amazonwebservices....
"Due to how Amazon EC2 virtualizes disks, the first write to any location on an instance's drives performs slower than subsequent writes. For most applications, amortizing this cost over the lifetime of the instance is acceptable. However, if you require high disk performance, we recommend initializing drives by writing once to every drive location before production use. " Flag
Shlomo Swidler 1 year ago in reply to @mrflip Those comments in the AWS docs relate to the ephemeral drives attached to the instance, not to EBS volumes. Flag
jbellis 2 years ago Ilya: exactly, the seek numbers here are mostly worthless because the high ones must be coming from ram cache on the EBS provider. It would take a _lot_ more testing to average things out to where "sometimes we hit cache" stops dominating the results. Flag
Erik Giberti 2 years ago This is great information! I was wondering recently if after 6-8 drives, there wouldn't start to be a degradation in the performance gains because of the processor cycles required to keep the software raid running. I just handn't gotten around to scripting a test. Based on your findings, that doesn't appear to be the case.

I also blogged about some speed improvements I noticed with different filesystems and configs as well. http://af-design.com/blog/2009... According to a commenter there, Amazon states that under certain circumstances, EBS stores will fail at an "annual failure rate (AFR) of between 0.1% – 0.5%." So be careful about assuming the drives are more robust than "real" drives.
Flag
roobaron 2 years ago Nice Post.
I will see if I can replicate when I have time. Flag
John 2 years ago in reply to roobaron That's really cool! Thanks for sharing your notes! Flag
Mark 2 years ago Thanks for a great post. One newbie question-- how do you add disks? Each EBS is exclusively connected to an instance. Are you talking about replicating across multiple volumes? Thanks! Flag
OrionHeroku 2 years ago in reply to Mark If you were using Amazon's default tools, you would do:

ec2-attach-volume vol-111 -i i-1111 -d /dev/sdh1
ec2-attach-volume vol-222 -i i-1111 -d /dev/sdh2
ec2-attach-volume vol-333 -i i-1111 -d /dev/sdh3
ec2-attach-volume vol-444 -i i-1111 -d /dev/sdh4

Now the instance (i-1111) has 4 disks, sdh1, sdh2, sdh3 and sdh4 which you can make into a software raid with the mdadm tool just like if you had physical disks connected to a physical computer.
Flag
Zeno Davatz 11 months ago in reply to OrionHeroku Can you please tell me why I would ever make a software raid in the Amazon cloud? The Amazon cloud it supposed to never fail! Thank you for your Feedback. Flag 1 person liked this.
Adam 11 months ago in reply to Zeno Davatz Real world example: Our raid10 started underperforming due to backend performance degradation on Amazon's end. The drive didn't fail, but the latency for reads was over 100ms/op and was killing our overall performance. With the software raid, I simply dropped the volume and replaced it.

求《from good to great》下载地址 in order to prevent the rice from getting wet.这句中为什么要加上from They are good friends.能否说成They are getting on well with each other. Light is good from whatever lamp it shines (l )having,bringing or resulting from good luck 翻译一下 :it is good to hear from you though 这个歌是我在迪吧里面听来的是一个DJ ，最后的歌词是：i am getting good. getting stronger What sets a truly exceptional English teacher apart from a merely good one? 用“hear from”和“do good to”各造一个简单的句子？ (Except) (Apart from) being a bit too long the piay was very good Getting the Word Out 询问IO流问题 io口扩展求点IO题目?? 什么是IO接口 io是什么意思？？？谢谢 ~~~ 硬盘io错误 .io域名后缀是什么意思？怎样捕获IO异常什么是IO失败 io接口错误 IO失败是什么意思？关于JAVA IO 问题