22 08 2008

I finally got a chance to learn about XIV. I was dragged into an IBM product presentation recently, so I figured I would summarize the one thing not covered on the NDA here :)

What is XIV?

Essentially, it’s a disk storage device that uses only SATA drives but gets a high number of IO/s out of them by spreading the reads and writes across all disks. Every LUN you create will be stretched across every disk in the array. Instead of using standard RAID to do this, XIV has a non-standard algorithm that accomplishes the same thing on a larger scale.

They build every system exactly the same way- each system contains a bunch of nodes of 12 drives each with their own processors and memory. It’s all off the shelf hardware in a node- pentium processors, regular ram, and sata drives. Not enterprise class on its own, but because of the distribution system they’ve worked out, you get all the performance of all the drives for all your reads/writes.

Scalability is done by hooking new systems to old ones through the 10GB switch interlink ports. They say that as newer communication tech becomes available, this will follow along (so eventually they will support infiniband). Also, when you add a system to the cluster, the balancing of data is automatic.

How is this different?

The big change here is in the way they put data on their disks. They’ve re-invented the wheel a bit, but for a reason. The performance you can get out of low cost low end drives in parallel is very good. Normally, I would never tell people that SATA is appropriate for databases or email, but XIV claims to be fast enough. I imagine we’ll see some benchmarks soon.

The first thing I asked about was parity space. XIV puts parity info over the whole array, so with 120 1TB drives, you get 80TB addressable space. Also, because rebuilding a 1TB drive from parity is normally a really intensive operation that generates many reads across the RAID, I asked about how they handle rebuilds. They claim that they can rebuild a 1TB drive from parity in about half an hour because all the parity data is being read from all the other heads simultaneously.

This sounds good, but I wonder if a failure and rebuild will slow down your entire production environment instead of only the raid where the drive failed. Also, in the event of an entire node failing with 12 drives, would that mean a 6 hour rebuild that affects the whole production array? If they have some way of prioritizing production IO, then I am satisfied. I don’t know if they do though.


Normal “copy on write” snapshots create extra writing traffic- every snapshot is another write that must be committed to disk before the acknowledgment is sent to the host. XIV uses a snapshot algorithm called “redirect on write” to avoid this problem and allow larger numbers of readable/writable snapshots.

They create a snapshot LUN that initially points to the real data, and when a change is made to the source, they write the new data to unused space and point the production LUN there while leaving the snapshot pointed at the old data. Netapp used a different algorithm to solve the same problems inherent to “copy on write” traditional snap shots that launched them into success in the enterprise storage market years ago.

Other advanced features

The box is delivered with all functionality enabled, which is an interesting move considering every other vendor I’ve dealt with makes most of their money from software. They include mirroring, thin provisioning, and a weird one time only type of virtualization that sits between the hosts and the old storage and reads all the data off the storage while continuing to pass the IO through transparently.


If someone from XIV (or more likely IBM) is reading this, I want to know more details about your mirroring and your workload prioritization:

  • Do you support synchronous, asynchronous, and asynchronous with consistency group mirroring? What about one to one, one to many, and many to one configurations?
  • Do you have a way to prevent disk rebuilds from taking disk resources that are needed by production apps?

Storage and fabric virtualization

7 08 2007

Aloha Open Systems Storage Guy,

What’s your take on virtualization? VSAN from Cisco, SVC from IBM? What other virtualization products are available from other vendors?


Cisco VSANs and IBM’s SVC are different things for certain :)

The VSAN allows you to create multiple logical fabrics within the same switch- you tell it what ports are part of what SAN, and you can manage the fabrics individually. It’s especially useful if you’re bridging two locations’ fabrics together for replication or something because it allows you to do “inter VSAN routing” if you have the right enterprise software feature. That would allow you to have two separate fabrics whose devices can see each other, but if the link between the sites fails (which is more likely than a switch failure), you won’t have the management nightmare of having to rebuild the original fabric out of two separated fabrics when the link comes back. VSANs are also commonly used to isolate groups of devices for the purpose of keeping those devices logically separated from parts of the network they’ll never need to interact with.

IBM’s SVC is a different technology that is supposed to consolidate multiple islands of FC storage. It’s essentially a Linux server cluster that you place between your application servers and the storage. It allows you to take all the storage behind it and create what they call “virtual disks”- essentially a LUN that’s passed to a server but contains multiple raids (possibly from multiple controllers). This gives you the option of striping your data across more spindles than you would be able to normally, and allows you to do dynamic thin provisioning when your datasets grow.

The only downside of the Cisco VSAN technology I can think of is its cost- it’s bloody expensive compared to a cheap low end solution, and for anything less than a 50 device FC fabric, I would questionable whether it’s worth it. There is an alternative from Brocade/McData they call LSAN, however I am not as familiar with it. I have been told that it’s slightly less complicated, but harder to manage, and doesn’t have the full feature-set of Cisco.

The downside to the IBM SVC is that you create latency for all your disk reads- every time a server needs to perform a write, it has to go through the Linux cluster first. It has a much larger cache than most controllers, so there’s a better chance that the data you’re looking for is already there, but if it’s not, your read performance might suffer a little because of the extra few milliseconds. The advantage is that you can now use incredibly cheap controllers with tiny amounts of cache, and it allows you to migrate data from any manufacturer’s device to any other manufacturer’s device without interrupting your servers. Under a virtualized environment like this, an older DS4300 like you have will perform pretty much on the same level as a more expensive DS4800 or EMC CX3-80 (assuming the same number of drives) because you don’t really use the cache of the underlying system. Another advantage of the SVC is that most FC storage controllers charge you either one time or over time for the number of servers you’re planning to connect to them. IBM charges a “partition license” fee for LUN masking, and EMC charges a “multipath maintenance” tax. Either way, the multipath drivers for SVC are free, and it only needs one partition from the controller, so you might be able to save money that way.

Did you have any specific questions about these topics you want more detail on?

Also, one of the new bloggers in the storage world- Barry Whyte– focuses on IBM SVC. He just started, but his blog will hopefully become a real resource for people with IBM storage virtualization on their mind.