Relevant mostly to OS X admins
Earlier this week, my production Xserve suddenly started behaving badly- massive latency, timeouts authenticating users, dismal disk performance, lots of SBBOD on the console. Checking the logs, there were many errors such as
client: 0x825200 : USER DROPPED EVENTS! callback_client: ERROR: d2f_callback_rpc() => (ipc/send) timed out (268435460) for pid 17336
along with fseventd errors. I also noted that CrashPlan ProE was simply halted in a scan. I started with a reboot, which fixed the issues for that day, but by morning, they’d returned, with similar log errors. I poked around, first starting with Disk Utility to run checks. It stated that the first 2 volumes I asked it to check were healthy, but got stuck for over half an hour on another. After finally persuading it to cancel that check, I brought over a go-to disk maintenance tool I’ve used for decades: Allsoft’s DiskWarrior. It has never harmed data on a directory rebuild, but I have to admit that the idea of running it on 2 production AFP storage RAIDs (R5 and R6) and a boot volume RAID1 gave me pause. But I double checked on last night’s backups, and had at it.
DiskWarrior found Volume Information errors on all 3 RAIDs, fixed them, and in the 48 hours following, it’s been humming along as expected.
I remember using DW back on an AppleShareIP server in the pre-OSX days. Unfortunately, HFS+ has its flaws, but DW has a good shot at fixing them.
Now… where’s my native ZFS?