IBM’s XIV

22 08 2008

I finally got a chance to learn about XIV. I was dragged into an IBM product presentation recently, so I figured I would summarize the one thing not covered on the NDA here :)

What is XIV?

Essentially, it’s a disk storage device that uses only SATA drives but gets a high number of IO/s out of them by spreading the reads and writes across all disks. Every LUN you create will be stretched across every disk in the array. Instead of using standard RAID to do this, XIV has a non-standard algorithm that accomplishes the same thing on a larger scale.

They build every system exactly the same way- each system contains a bunch of nodes of 12 drives each with their own processors and memory. It’s all off the shelf hardware in a node- pentium processors, regular ram, and sata drives. Not enterprise class on its own, but because of the distribution system they’ve worked out, you get all the performance of all the drives for all your reads/writes.

Scalability is done by hooking new systems to old ones through the 10GB switch interlink ports. They say that as newer communication tech becomes available, this will follow along (so eventually they will support infiniband). Also, when you add a system to the cluster, the balancing of data is automatic.

How is this different?

The big change here is in the way they put data on their disks. They’ve re-invented the wheel a bit, but for a reason. The performance you can get out of low cost low end drives in parallel is very good. Normally, I would never tell people that SATA is appropriate for databases or email, but XIV claims to be fast enough. I imagine we’ll see some benchmarks soon.

The first thing I asked about was parity space. XIV puts parity info over the whole array, so with 120 1TB drives, you get 80TB addressable space. Also, because rebuilding a 1TB drive from parity is normally a really intensive operation that generates many reads across the RAID, I asked about how they handle rebuilds. They claim that they can rebuild a 1TB drive from parity in about half an hour because all the parity data is being read from all the other heads simultaneously.

This sounds good, but I wonder if a failure and rebuild will slow down your entire production environment instead of only the raid where the drive failed. Also, in the event of an entire node failing with 12 drives, would that mean a 6 hour rebuild that affects the whole production array? If they have some way of prioritizing production IO, then I am satisfied. I don’t know if they do though.

Snapshots

Normal “copy on write” snapshots create extra writing traffic- every snapshot is another write that must be committed to disk before the acknowledgment is sent to the host. XIV uses a snapshot algorithm called “redirect on write” to avoid this problem and allow larger numbers of readable/writable snapshots.

They create a snapshot LUN that initially points to the real data, and when a change is made to the source, they write the new data to unused space and point the production LUN there while leaving the snapshot pointed at the old data. Netapp used a different algorithm to solve the same problems inherent to “copy on write” traditional snap shots that launched them into success in the enterprise storage market years ago.

Other advanced features

The box is delivered with all functionality enabled, which is an interesting move considering every other vendor I’ve dealt with makes most of their money from software. They include mirroring, thin provisioning, and a weird one time only type of virtualization that sits between the hosts and the old storage and reads all the data off the storage while continuing to pass the IO through transparently.

Questions

If someone from XIV (or more likely IBM) is reading this, I want to know more details about your mirroring and your workload prioritization:

  • Do you support synchronous, asynchronous, and asynchronous with consistency group mirroring? What about one to one, one to many, and many to one configurations?
  • Do you have a way to prevent disk rebuilds from taking disk resources that are needed by production apps?




Follow up: Gizmodo reporter banned for prank.

14 01 2008

It sounds like the guy I wrote about last week that was pranking salespeople at CES is banned for life.

Good.





Off topic: Gizmodo acts stupidly at CES

11 01 2008

OK, I know you are young. I know you are irreverent. I know you think you’re cooler than a bunch of old stodgy sales people at CES. That’s no excuse for being an ass-hat and pulling pranks on people who are just trying to do their job:

Click here to see Gizmodo’s account of how they “pwned” CES.

Sometimes I’m glad our industry is not as old and unexciting as say commodities brokerage. I like the fact that some of the most influential people in the technology industry are irreverent bloggers with a sense of humor. People like Arrington or Malik, however, have a basic sense of boundaries and decency, and would never stoop to annoying (admittedly uncool) sales people to get a laugh from their audience. For shame!

There’s a reason I removed you insufferable tools from my RSS reader a few months ago. This is just another extension of it.





Barry’s question

7 01 2008

Via email:

“When you are thinking about Disaster Recovery, CDP, do you assume that Tier3 is adequate, mainly because this is backup only, or maybe DR so hopefully not needed? How does your thinking proceed? Do think about your primary data at the same time?

I ask this as a loaded question, knowing that anything that has to copy to, snap to, or mirror with, secondary, backup or DR or CDP storage now has a definite tie with the primary.

Barry Whyte

SVC Performance Architect
IBM Systems & Technology Group”

This is a loaded question! To start with, I’ll note some assumptions and concept clarifications to ensure we’re talking about the same thing- if I’m off on anything, let me know ;)

  • CDP: continuous data protection, an IBM backup software algorithm- small changes sent to a central server continuously
  • Tier 3: low price random access storage media- not tape, usually cheap SATA drives
    • Note: there’s been discussion about these tier definitions before, and I hold that tier 3 means different things to different companies.

To your question- I would have to decide based on the company’s current architecture. If they have a storage solution that has synchronous mirroring between two sites, then using low performance drives on either side will slow production. If they’re doing asynchronous replication (or a server instead of storage based DR solution), I would probably be fine with SATA/tier3.

To explain my reasoning, I must first say that I can not decide without having a specific case and a IT person to question. My advice would be based on risk tolerance versus capital expenditure tolerance. Secondly, SATA has a undeserved bad rap- the drives are about as reliable as other enterprise ones (according to Google). SATA drives are certainly not fast for random access loads, but for sequential and low urgency loads like backups, they will do the job.

Low performance media will always be part of a healthy storage balance- the most bang for most companies’ bucks will be in prioritizing their applications (or even their data), and using the media that makes the most sense. Need an Oracle server to stop freezing up your warehouse management app? Put that baby on 15,000 RPM FC hard drives- lots of them. Need to keep a backup copy of a file server on site in case of a server outage? SATA will do the job. Need to keep nightly point in time backups of your entire storage infrastructure for years? You probably can’t afford to put that on drives at all- use tape.

That said, most companies that haven’t reached a boiling point in their storage gear yearly expenditures won’t bother to do much of this stuff. Face it, tiering your applications for storage takes operator time, and gear just seems to feel cheaper to management than IT man hours. That and the explosive growth of media density in the last 5 years have kept tiered storage plan adoption either to the ridiculously large data producers who have no other choice (like large banks) or to more forward thinking smaller shops.





Gene’s question

21 11 2007

Gene writes:

question about SAN interoperability

…2 windows 2003 server sp2 servers, running on HP proliant dl380 g4 each with one single port fiber hba’s. Servers will be clustered to run sql 2005. HBA’s are hp branded- emulex fc2143’s (Emulex id- is lp1150)…SUN SAN has both 6130’s disk array and 3510 array…we want to use disk from both arrays…(betters disks in 6130, slower stuff in 3510)

Do we actually need multipath drivers? (SUN has come out with DSM’s for both these arrays)…any issue using multiple DSM’s if they are requried.

Any known issues with the type of device drivers for the HBAs? Storport versus scsiport…

any help is appreciated

In general, if you only have one FC port per server, you don’t need a multipath driver. I am not sure if this holds true with multiple subsystems that aren’t under some sort of virtualization umbrella though… you might need a device driver that understands how to work with multiple subsystems. This would not be a multipath driver though- those are for multiple paths to the same LUN.

Regarding scsiport versus storport, I found an excellent whitepaper detailing the differences here. The way I read this is that these layers of the storage stack replace the proprietary device and multipath drivers provided by Sun- if they support it, then you should take storport, the more recent version. Unfortunately, I can’t give you very specific caveats with this technology because every system I’ve worked on used the vendor’s device and multi-path drivers, or a virtualization head to combine multiple physical subsystems into a logical one.

The disk subsystems are withdrawn from Sun’s marketing- have you asked your Sun contact whether they’ll support the setup you’re considering?





Defining tiers for storage

17 08 2007

There’s a good series going on over at the Storage Anarchist‘s page about defining storage tiers- if you’re trying to get some insight to better organize your own data, it promises to be a good series. Here’s the link to the first of four entries.








Follow

Get every new post delivered to your Inbox.