22 08 2008

I finally got a chance to learn about XIV. I was dragged into an IBM product presentation recently, so I figured I would summarize the one thing not covered on the NDA here :)

What is XIV?

Essentially, it’s a disk storage device that uses only SATA drives but gets a high number of IO/s out of them by spreading the reads and writes across all disks. Every LUN you create will be stretched across every disk in the array. Instead of using standard RAID to do this, XIV has a non-standard algorithm that accomplishes the same thing on a larger scale.

They build every system exactly the same way- each system contains a bunch of nodes of 12 drives each with their own processors and memory. It’s all off the shelf hardware in a node- pentium processors, regular ram, and sata drives. Not enterprise class on its own, but because of the distribution system they’ve worked out, you get all the performance of all the drives for all your reads/writes.

Scalability is done by hooking new systems to old ones through the 10GB switch interlink ports. They say that as newer communication tech becomes available, this will follow along (so eventually they will support infiniband). Also, when you add a system to the cluster, the balancing of data is automatic.

How is this different?

The big change here is in the way they put data on their disks. They’ve re-invented the wheel a bit, but for a reason. The performance you can get out of low cost low end drives in parallel is very good. Normally, I would never tell people that SATA is appropriate for databases or email, but XIV claims to be fast enough. I imagine we’ll see some benchmarks soon.

The first thing I asked about was parity space. XIV puts parity info over the whole array, so with 120 1TB drives, you get 80TB addressable space. Also, because rebuilding a 1TB drive from parity is normally a really intensive operation that generates many reads across the RAID, I asked about how they handle rebuilds. They claim that they can rebuild a 1TB drive from parity in about half an hour because all the parity data is being read from all the other heads simultaneously.

This sounds good, but I wonder if a failure and rebuild will slow down your entire production environment instead of only the raid where the drive failed. Also, in the event of an entire node failing with 12 drives, would that mean a 6 hour rebuild that affects the whole production array? If they have some way of prioritizing production IO, then I am satisfied. I don’t know if they do though.


Normal “copy on write” snapshots create extra writing traffic- every snapshot is another write that must be committed to disk before the acknowledgment is sent to the host. XIV uses a snapshot algorithm called “redirect on write” to avoid this problem and allow larger numbers of readable/writable snapshots.

They create a snapshot LUN that initially points to the real data, and when a change is made to the source, they write the new data to unused space and point the production LUN there while leaving the snapshot pointed at the old data. Netapp used a different algorithm to solve the same problems inherent to “copy on write” traditional snap shots that launched them into success in the enterprise storage market years ago.

Other advanced features

The box is delivered with all functionality enabled, which is an interesting move considering every other vendor I’ve dealt with makes most of their money from software. They include mirroring, thin provisioning, and a weird one time only type of virtualization that sits between the hosts and the old storage and reads all the data off the storage while continuing to pass the IO through transparently.


If someone from XIV (or more likely IBM) is reading this, I want to know more details about your mirroring and your workload prioritization:

  • Do you support synchronous, asynchronous, and asynchronous with consistency group mirroring? What about one to one, one to many, and many to one configurations?
  • Do you have a way to prevent disk rebuilds from taking disk resources that are needed by production apps?

Oracle RAC ASM on a JBOD- Mike’s question

21 07 2008

I had Oracle RAC with ASM running in a RAW configuration on dual 32 bit servers running RedHat 4. I upgraded to dual qla2200 HBAs on each server and they connect through a Brocade 2250 switch to two JBOD disk arrays, a NexStor 18f and a NexStor 8f. I have set up multipath in the multibus configuration and can see the drives as multipath devices in an active-active configuration. I am using OCFS2 for my CRS voting and config disks and that runs fine. When I try to start ASM I can get proper connectivity on the first server that starts up, the second server hands until it eventually errors with a access issue.

I have verified all permissions required are set and can see the disks on both sides using the oracleasm utility. It appears DM is only allowing a single host to access the ASM disks at one time, so when node A starts up and acquires the ASM disks when the ASM instance starts, node B is left hung, visa versa if Node B starts first.

I was told it may be a SCSI reservation issue, but can’t seem to find any information on this. I know people are using this type of configuration to RAID controllers but is the JBOD causing issues? How to get both instances seeing the ASM disks?


Hi Mike! To preface my answer, I’ll start by saying that my Oracle knowledge was purely acquired through osmosis, and that I’m primarily a platform guy. I’ve never sat in front of an Oracle server and done anything, but I do understand a fair bit about how they interact with storage :)

First, when you say “I upgraded to dual qla2200 HBAs on each server”, do you mean that you had a working system using ASM and RAC before you changed the HBA hardware? If so, without even going into the rest of the story, I would start by checking Oracle’s and Nexstor’s support for that card and seeing if they have any known issues with the firmware level you’re using.

Second, it really sounds like your main issue is a multipath one. A “SCSI reservation issue” is another way of saying that a server is locking the devices to itself, which is exactly what multipathing software is supposed to fix. There are several places that it could break down: the application, the OS, the hardware, or the firmware. The only way to see which level your problem comes from is to try to eliminate them by swapping them out. I’d start with Oracle- ASM is supposed to really get down and directly control the disks as raw devices, so they might have a compatibility matrix that contains the whole stack. Maybe it’s as simple as Oracle not supporting your firmware…

If not, you’ll have to do some trouble shooting and vendor support calls until you find out where in your config the error is. I am fairly certain that Oracle can fix this for you though.

Linux sharing a JBOD- paul’s question

23 05 2008

“Question about multiple servers accessing same disks through a SAN switch

I’m trying to set up a Linux system for server failover where two servers (with SAS HBA) are accessing the same set of disks (jbod) through a SAN switch. First – will this work? If so, what software do I need to run on the servers to keep the two servers from stepping on each other? Do I need multipath support?”

It depends. A JBOD does not do any RAID management- it leaves that to the servers. If you have two servers trying to operate on the same disks, they have to be using a clustering operating system to avoid overwriting themselves. Linux is probably able to do that, and I know Windows has a cluster edition, but it’s certainly not the simplest way to get servers sharing data.

This brings me to my first question: what are you trying to do? Do you want them to run the same application so if one fails, the other will pick up the slack without losing anything? Or are you trying to process the same data twice as fast by using two server “heads”?

Secondly, when you say “SAN switch”, do you mean fibre channel switches? If you have SAS HBAs, those can not plug into an FC network.

Thirdly, multipath support usually implies allowing a server to see a single LUN through multiple fabrics. If you have more than one path between every server and drive, multipathing software would indeed be suggested.

Follow up: Gizmodo reporter banned for prank.

14 01 2008

It sounds like the guy I wrote about last week that was pranking salespeople at CES is banned for life.


Off topic: Gizmodo acts stupidly at CES

11 01 2008

OK, I know you are young. I know you are irreverent. I know you think you’re cooler than a bunch of old stodgy sales people at CES. That’s no excuse for being an ass-hat and pulling pranks on people who are just trying to do their job:

Click here to see Gizmodo’s account of how they “pwned” CES.

Sometimes I’m glad our industry is not as old and unexciting as say commodities brokerage. I like the fact that some of the most influential people in the technology industry are irreverent bloggers with a sense of humor. People like Arrington or Malik, however, have a basic sense of boundaries and decency, and would never stoop to annoying (admittedly uncool) sales people to get a laugh from their audience. For shame!

There’s a reason I removed you insufferable tools from my RSS reader a few months ago. This is just another extension of it.

Barry’s question

7 01 2008

Via email:

“When you are thinking about Disaster Recovery, CDP, do you assume that Tier3 is adequate, mainly because this is backup only, or maybe DR so hopefully not needed? How does your thinking proceed? Do think about your primary data at the same time?

I ask this as a loaded question, knowing that anything that has to copy to, snap to, or mirror with, secondary, backup or DR or CDP storage now has a definite tie with the primary.

Barry Whyte

SVC Performance Architect
IBM Systems & Technology Group”

This is a loaded question! To start with, I’ll note some assumptions and concept clarifications to ensure we’re talking about the same thing- if I’m off on anything, let me know ;)

  • CDP: continuous data protection, an IBM backup software algorithm- small changes sent to a central server continuously
  • Tier 3: low price random access storage media- not tape, usually cheap SATA drives
    • Note: there’s been discussion about these tier definitions before, and I hold that tier 3 means different things to different companies.

To your question- I would have to decide based on the company’s current architecture. If they have a storage solution that has synchronous mirroring between two sites, then using low performance drives on either side will slow production. If they’re doing asynchronous replication (or a server instead of storage based DR solution), I would probably be fine with SATA/tier3.

To explain my reasoning, I must first say that I can not decide without having a specific case and a IT person to question. My advice would be based on risk tolerance versus capital expenditure tolerance. Secondly, SATA has a undeserved bad rap- the drives are about as reliable as other enterprise ones (according to Google). SATA drives are certainly not fast for random access loads, but for sequential and low urgency loads like backups, they will do the job.

Low performance media will always be part of a healthy storage balance- the most bang for most companies’ bucks will be in prioritizing their applications (or even their data), and using the media that makes the most sense. Need an Oracle server to stop freezing up your warehouse management app? Put that baby on 15,000 RPM FC hard drives- lots of them. Need to keep a backup copy of a file server on site in case of a server outage? SATA will do the job. Need to keep nightly point in time backups of your entire storage infrastructure for years? You probably can’t afford to put that on drives at all- use tape.

That said, most companies that haven’t reached a boiling point in their storage gear yearly expenditures won’t bother to do much of this stuff. Face it, tiering your applications for storage takes operator time, and gear just seems to feel cheaper to management than IT man hours. That and the explosive growth of media density in the last 5 years have kept tiered storage plan adoption either to the ridiculously large data producers who have no other choice (like large banks) or to more forward thinking smaller shops.

Gene’s question

21 11 2007

Gene writes:

question about SAN interoperability

…2 windows 2003 server sp2 servers, running on HP proliant dl380 g4 each with one single port fiber hba’s. Servers will be clustered to run sql 2005. HBA’s are hp branded- emulex fc2143’s (Emulex id- is lp1150)…SUN SAN has both 6130’s disk array and 3510 array…we want to use disk from both arrays…(betters disks in 6130, slower stuff in 3510)

Do we actually need multipath drivers? (SUN has come out with DSM’s for both these arrays)…any issue using multiple DSM’s if they are requried.

Any known issues with the type of device drivers for the HBAs? Storport versus scsiport…

any help is appreciated

In general, if you only have one FC port per server, you don’t need a multipath driver. I am not sure if this holds true with multiple subsystems that aren’t under some sort of virtualization umbrella though… you might need a device driver that understands how to work with multiple subsystems. This would not be a multipath driver though- those are for multiple paths to the same LUN.

Regarding scsiport versus storport, I found an excellent whitepaper detailing the differences here. The way I read this is that these layers of the storage stack replace the proprietary device and multipath drivers provided by Sun- if they support it, then you should take storport, the more recent version. Unfortunately, I can’t give you very specific caveats with this technology because every system I’ve worked on used the vendor’s device and multi-path drivers, or a virtualization head to combine multiple physical subsystems into a logical one.

The disk subsystems are withdrawn from Sun’s marketing- have you asked your Sun contact whether they’ll support the setup you’re considering?


Get every new post delivered to your Inbox.