I finally got a chance to learn about XIV. I was dragged into an IBM product presentation recently, so I figured I would summarize the one thing not covered on the NDA here :)
What is XIV?
Essentially, it’s a disk storage device that uses only SATA drives but gets a high number of IO/s out of them by spreading the reads and writes across all disks. Every LUN you create will be stretched across every disk in the array. Instead of using standard RAID to do this, XIV has a non-standard algorithm that accomplishes the same thing on a larger scale.
They build every system exactly the same way- each system contains a bunch of nodes of 12 drives each with their own processors and memory. It’s all off the shelf hardware in a node- pentium processors, regular ram, and sata drives. Not enterprise class on its own, but because of the distribution system they’ve worked out, you get all the performance of all the drives for all your reads/writes.
Scalability is done by hooking new systems to old ones through the 10GB switch interlink ports. They say that as newer communication tech becomes available, this will follow along (so eventually they will support infiniband). Also, when you add a system to the cluster, the balancing of data is automatic.
How is this different?
The big change here is in the way they put data on their disks. They’ve re-invented the wheel a bit, but for a reason. The performance you can get out of low cost low end drives in parallel is very good. Normally, I would never tell people that SATA is appropriate for databases or email, but XIV claims to be fast enough. I imagine we’ll see some benchmarks soon.
The first thing I asked about was parity space. XIV puts parity info over the whole array, so with 120 1TB drives, you get 80TB addressable space. Also, because rebuilding a 1TB drive from parity is normally a really intensive operation that generates many reads across the RAID, I asked about how they handle rebuilds. They claim that they can rebuild a 1TB drive from parity in about half an hour because all the parity data is being read from all the other heads simultaneously.
This sounds good, but I wonder if a failure and rebuild will slow down your entire production environment instead of only the raid where the drive failed. Also, in the event of an entire node failing with 12 drives, would that mean a 6 hour rebuild that affects the whole production array? If they have some way of prioritizing production IO, then I am satisfied. I don’t know if they do though.
Normal “copy on write” snapshots create extra writing traffic- every snapshot is another write that must be committed to disk before the acknowledgment is sent to the host. XIV uses a snapshot algorithm called “redirect on write” to avoid this problem and allow larger numbers of readable/writable snapshots.
They create a snapshot LUN that initially points to the real data, and when a change is made to the source, they write the new data to unused space and point the production LUN there while leaving the snapshot pointed at the old data. Netapp used a different algorithm to solve the same problems inherent to “copy on write” traditional snap shots that launched them into success in the enterprise storage market years ago.
Other advanced features
The box is delivered with all functionality enabled, which is an interesting move considering every other vendor I’ve dealt with makes most of their money from software. They include mirroring, thin provisioning, and a weird one time only type of virtualization that sits between the hosts and the old storage and reads all the data off the storage while continuing to pass the IO through transparently.
If someone from XIV (or more likely IBM) is reading this, I want to know more details about your mirroring and your workload prioritization:
- Do you support synchronous, asynchronous, and asynchronous with consistency group mirroring? What about one to one, one to many, and many to one configurations?
- Do you have a way to prevent disk rebuilds from taking disk resources that are needed by production apps?