Nikolaus Rath's Website

SSD Caching under Linux

I recently found myself with a spare 128 GB SSD disk and decided to try my hand at setting up SSD caching under Linux. My personal desktop system so far stored all data on traditional spinning disks. However, a little while ago I got a new computer at work that comes exclusively with SSD storage, and since then I've become increasingly annoyed with the disk performance of my personal system. Using SSD caching seemed like an appealing option to increase performance without having to replace any disks.

The two operations where the disk performance is most noticeable is when booting up the system, and when using Unison to synchronize my home directory between different systems. Booting the system typically takes about 1 minute and 30 seconds from loading the initrd to X11 coming up, and about 2 minutes until my desktop is fully loaded (that's i3, Dropbox, Emacs, Firefox, Network Manager, XFCE Terminal, and Pulse Audio). Scanning my home directory with Unison typically takes about 1 minute and 19 seconds (that's just detecting any changes that need to be synchronized, not actually transferring the changed data).

To better estimate how much improvement I could possibly get from the SSD, I first transferred my entire root file system from spinning disks to the SSD. This increased my boot time to X11 from 1:30 to 22 seconds (I unfortunately didn't write down the time when the desktop was fully loaded).

For SSD caching under Linux, there are currently three options: bcache, lvmcache, and EnhanceIO (A nice overview of the differences between bcache and lvmcache can be found on the Life Reflections Blog). EnhanceIO I ruled out immediately because it isn't included in the mainline kernel. bcache has the drawback of requiring you to reformat your partitions and there are various rumours about data corruption with more complex storage stacks. Therefore, I tried lvmcache first.

lvmcache

Initial setup of lvmcache on the block devices was straightforward, but getting the system to setup the stack correctly on boot required some manual work for my Debian Jessie system. It turns out that there is a missing dependency on the thin-provisioning-tools package that contains the cache_check binary. Furthermore, in order to be able cache the root file system, you need to manually configure initramfs-tools to include this binary (and the C++ library that it requires) in the initrd. That out of the way, things worked smoothly.

Unfortunately, even after several boots I was unable to measure any performance improvement. I tried to encourage promotion of blocks to the cache by setting the sequential_threshold, read_promote_adjustment and write_promote_adjustment variables to zero (using dmsetup message <device> 0 <variable> 0), but to no avail. Maybe using it for a longer time would have eventually improved performance, but I got the impression that lvmcache was not the right tool for my use case. As I understand, lvmcache is not actually a cache but more of tiered storage system: it tries to determine which blocks are accessed most frequently, and "promotes" them to storage on the cache device. A traditional cache, in contrast, would put almost every block that is read or written into the cache, evicting the least-frequently accessed block from the cache to make room if necessary.

bcache

bcache works more like a traditional cache. The only exception is that it tries to detect sequential access (e.g. watching a movie, creating an ISO image) and bypasses the cache for such requests (because typically spinning disks are quite performant for them). I was a bit worried about the interaction between bcache, LVM, dm-crypt and btrfs, but also unable to find any concrete reports of problems other than a bug in the btrfs+bcrypt interaction that was fixed in kernel 3.19. Also, there were several people on the bcache and btrfs mailing lists who reported successfully using it even in complex stacks.

Therefore, I decided to bite the bullet and give bcache a try. After moving around a lot of data in order to be able to format the bcache devices and enabling the most-recent kernel from the jessie-backports repository (4.3 at the time of this post) to avoid the above mentioned bcrypt+btrfs bug, I ended up with the following configuration:

  • /dev/sda3 (a 512 GB partition an a spinning disk) is the backing device for /dev/bcache0
  • /dev/sdb3 (a 256 GB partition on a spinning disk) is the backing device for /dev/bcache1
  • /dev/sdc2 (a 58 GB partition on the SSD) is initialized as a cache device, and connected to both bcache0 and bcache1.
  • /dev/bcache1 and /dev/bcache2 are used as LVM physical volumes (PVs), forming volume group vg0.
  • /dev/mapper/vg0-root is the btrfs formatted root file system (linearly mapped onto the PVs)
  • /dev/mapper/vg0-home is a LUKS (dm-crypt) encrypted device (linearly mapped onto the PVs)
  • /dev/mapper/vg0-home_luks is the btrfs-formatted home file system (backed by vg0-home).

This time, there was no need for any manual fiddling with the initrd, things worked perfectly out of the box. I am also extremely pleased with the performance. While the first reboot proceeded at regular speed, subsequest boots reduced the time to X11 from 1:30 minutes to about 9 seconds, and the time until the fully loaded desktop from 2:00 minutes to about 15 seconds (times are not exact because I used a stopwatch). Note that this is even faster than in my SSD-only experiment - presumably because now both root file system and home directory are utilizing the SSD. The time required to scan my home directory for synchronization changed from 1:19 (with spinning disks) to 4 seconds (with SSD cache). Wow!

I am a little bit worried about encountering bugs in this rather complex stack, but I hope that they will at least not go unnoticed because btrfs checksums all its data. If I encounter any problems, I will update this blog post.

Update: I've now replaced btrfs with ext4, but otherwise kept the stack unchanged. See BTRFS Reliability - a datapoint for details.

Linux

Comments