Why you can’t always trust the Hardware Compatibility List (HCL)

Late last year, I started on a project to replace my workplace’s main AFP/SMB/NFS file storage.  Due to already having good experiences with Synology gear, knowing DSM well, and knowing other macadmins happy with their Synos, I ordered storage and built a system that should perform quite well:

  • Synology 3614RPXS unit
  • 10 HGST 7k4000 4TB drives
  • 2 250gig SSD drives
  • 10 Gig SFP+ card, fiber networking back to the switches.
  • +16 Gigs ECC RAM

All taken from the Synology HCL- so I was ready to rock.  I built it as a RAID6, started copying files, and everything was looking good.  Once I’d moved a few TB onto the NAS, I started checking out consecutive read speeds from the NAS, and found an unexpected behavior:  the reads would often saturate a GigE connection as you’d expect, until they didn’t- large transfers would look like Image

When it’s great, it’s great.  When it wasn’t it wasn’t.  REALLY wasn’t.  It would run for about 90 seconds at full tilt, then 20 of nearly nothing.

During these file copies, the overall Volume Utilization levels would ebb and flow with an inverse relationship with speed.  When volume utilization approaches and hits 100%, the network speed plummets.Pasted_Image_3_11_15__10_15_AM

So the question became “why does the volume utilization go so high?” I started a ticket with Synology on Feb 2.  I did their tests and requested configuration changes- direct GigE connections to take the LAN out of the equation, SMB/AFP/NFS, disabling every service that makes the NAS a compelling product.  This stumped the U.S. based support, so it became an issue for the .tw engineers.

If your ticket goes off to the Taiwanese engineers, the communications cycles start to rival taking pen to paper and paying the government to deliver the paper. To Taiwan.  It all runs through the US support staff, and it gets slow. Eventually, I coordinated a screen sharing session with an engineer, where I replicated the issue.  They tested more… htop, iostat.  “can you make a new volume?”  “if you send me disks I can!”

Meanwhile, I’m asking the storage guys I know on Twitter (and their friends), and scouring the Synology forums for anybody who has an answer.  Eventually, I don’t find an answer, but someone else who has the same experience.  We start collaborating.  Then a few days later, I find another forum post from a user who has the same issues.  We start exploring ideas… amount of RAM?  Version of DSM that built the storage?  RAID level?  Then we find the overlap: we all use Hitachi HUS724040AL[AE]640 drives, and at least 10 of them in an volume.  One user was fine with 8 of them in a NAS, but when expanded to 13, performance changed and led to his post looking for help.

I then brought this information to Synology, and on March 27, Synology informed me they were trying to replicate the issue with that gear.  On April 16, they’d finally received drives to test.  On April 21, they agreed with my conclusion:

The software developers have came to a conclusion on this issue. That is, the Hitachi with a HUS724040… suffix indeed has slow performance in a RAID consisted of more than 6 disks.

Despite being on the list, and despite configuring everything properly, I still ended up with gear that did not perform as expected, as they’d not tested this number of drives in a volume.  Hitachi now tells me that they’re working on the issue with Synology, but in the meantime, I’m abandoning the Hitachi drives for WD Red Pros.