Raidz2 Large Write Slows Down After a While

I don't agree.

And if one of your disks failed, and age was a factor… you're going to be sweating bullets wondering if another will fail before your resilver completes.

So every every single time you loose any drive in a mirror setup you risk all the data on all drives in the entire pool. I sure do hope you aren't on vacation and/or have to order a drive online.

It all depends on your use case. For me, raidz3 wins easily. Performance during resilvering is not something most users would suffer that much from anyway either. And that 8-drive recommendation comes from many factors and don't apply to most home-users anyway. Not that you should go overboard with it but it will easily pay for fast cache or whatever else you might want.

If performance was the goal you would not be using spinning rust anyway, and if you are still limited by a single gbit then don't even think about it (but don't over-utilize your pool).

But there are still lots of potential ways for your data to die, and you still need to back up your pool. Period. PERIOD!

Of course! Yet for home users there is not a single reasonable way to do it if you have decently sized pool. There is bound to be sacrifices on what you choose to backup.

Except for maybe another pool. Which is going to hurt since ZFS with buying everything up front is VASTLY more expensive than RAID where you can grow the array as needed - under the assumption that the storage needs grow slowly, which they typically do for home users.

As a ZFS user and fanboy the temptation of a dual raid6 setup is tempting. But I can't compromise on the filesystem so in the end I've compromised on backups instead. Likely not the smartest move considering how rare bitrot is, I am very well aware.

For many drives I'd go for raidz3. And for a 4-drive NAS I'd go with raidz2 rather than a mirror setup for the ease of mind.

For fast SSD pools I'd go with mirrors. Much easier to backup the entire pool as well.

RAID is not a backup solution, its an uptime solution.

I know. Doesn't change the fact there exist no sensible way to backup a decently sized pool in a home-user scenario.

Now we might disagree on what is sensible. And some things might not need a backup.

Some people mention ZFS send feature, which should prevent sending unchanged blocks. You could use rsnapshot to a different machine (that needs to be just a little bit larger than the original one) and personally I would recommend borg backup, which can deduplicate, compress and encrypt and isn't hard to setup. This could result in the backups being a fraction of the original - depending on the data. You can use AWS S3/ DO Spaces which is a lot cheaper or other compatible interfaces using rclone serve sftp + sshfs with borg or in the case of restic or duplicacy natively. You can also lookup tarsnap, rsync.net and others for remote backups.

In home scenarios, you usually care about documents, pictures a few specific videos and some music above everything else. This tends to be a fraction of the total capacity. The rest doesn't change very often and is not such a loss but tends to be quite huge in comparison. So to backup all documents and pictures one or two external drives and/ or a cloud backup is sufficient.

Personally, for some important documents I use orgpad.com which I help build. I am personally responsible for the infrastructure - so the backups actually have at least three tiers, independent providers etc. That is just a side effect of having a SaaS platform with a reasonable and contactable team behind it.

zfs send. Keep different filesystems for system backup, personal data, media storage, gross hoarding. Consider not backing up the media storage and gross hoarding filesystems.

Targets:

1. That NAS off in the corner built from an old desktop.

2. rsync.net and similar.

3. your encrypted cloud backup service of choice, through an adapter.

If the value of your data is less than the cost to protect it, you know it is disposable, you just wish it weren't.


Backing up TB's of data isn't cheap no matter what type of cloud backup solution you're picking unfortunately. I don't think most here want to spend $1000/year on backing up their entire NAS.

> I don't think most here want to spend $1000/year on backing up their entire NAS.

These 2 solutions cost less than 10 USD/month: Jottacloud [1] and zfs.rent [2]. The former is even hosted in privacy-friendly Norway, and its 5 TB unlimited with going over that you get upload-capped. So I pay 90 EUR/year for Jottacloud. Which is approx 10% of your projected 1000 USD.

[1] https://www.jottacloud.com/en/

[2] https://zfs.rent/


What about the initial replication? My NAS is currently sitting at about 8TB used, which even at the maximum 10Mb/s upload of my current ISP (best in the area of course, because 'murika) would take 75 days. Your first link doesn't seem to say anything about alternatives for getting the initial data set to them, and your second link, well "We have queued up 200 users and have decided to close new requests for the time being. Please check back soon!" basically means it is imaginary for the time being.


Buy a second hand "enterprise" server and buy some new HDDs. Set the server up at work/home, and physically put the server at a data center near your vicinity. If you already live in USA, hosting elsewhere doesn't get you much. The problem with hosting in USA is for sensitive data for people outside of USA (EU for example).

Jottacloud could be a useful offering, but someone considering the best way to balance many drives on a NAS and worried about spending $1000/year to back it up is probably not going to fit into a single account.

zfs.rent is an interesting model, though quite beta and currently not accepting users. And those prices add up pretty fast. Let's say you want to back up about 4 drives of data. That's $960 a year for the first two years, then $480 a year after. Or if you want any redundancy at all, $1200/$600. You'd probably want to find your own cheaper box to stuff hard drives into, and at that point you're far gone from "cloud".

1. "Unlimited" plans marketed for personal users are unsustainable, which is why they invariably push you into some proprietary app. Sure they might let you dump 24TB on it while they're growing, but they'll inevitably tighten that up. And a key feature of backups is long term - do you really want to be reuploading 24TB in a year or two? Look at their business pricing for the actual no-nonsense cost.

2. You can build a 8TBx4+2 raidz2 with EasyStores for $720. Backing that up to a raidz1 zfs.rent is $1200 for a single year. $2500 for 3 years by prepurchasing the drives.

An estimate of $1000/year for full cloud backups isn't wrong.

It isn't "unlimited", it is "unlimited with caveat". It is 5 TB soft cap after which upload limits apply, which get progressively harsh the more you are from that 5 TB soft limit. Just like I wrote in my previous post, and as they write on their website. For me, 5 TB in a privacy-friendly country is a bargain deal. Duplicati has support for Jottacloud, and they also provide an CLI for their API.

If I had an amount of 24 TB to back up I'd buy my own server and harddisks, and I'd physically set that server up at work or at home. Then, travel to the data center (I live in Amsterdam Area, so enough choice) and put the server there, again, physically. No need to "upload 24 TB".

RAIDZ et al. is about availability. Which makes sense for businesses, but for personal data an offsite backup makes more sense. I'd just go with a mirror of 2 different disks, with different serial numbers (ie. different orders).

I was speaking generally. I'm glad the service fits for you, and it appears that they have setup reasonably-structured incentives.

But using that pricing, 24TB with Jottacloud would be 450 EUR/year, which is still on the order of $1000. The larger point is that disk sizes are always growing, and while 24TB may seem unnecessary to you, it's very easy to build an array that is cumbersome to backup. And IME this has remained true for the past two decades.

You're right that the cost effective way of performing large backups seems to be just buying another array and hosting it somewhere offsite. But that still doubles your array cost, and so is not something to be done lightly.

FWIW personal users want availability too, unless rebuilding arrays from scratch is your hobby. A simple mirror is a form a raid, which you can only get away with if your storage requirements can be taken care of by a single spindle. If that's where you are, fantastic. But that doesn't describe most home ZFS users.

Unless you have tons of data churn or really love burst uploads, you should treat Jottacloud as roughly 15TB of capacity. Here's the speed chart: https://docs.jottacloud.com/en/articles/3271114-reduced-uplo...

So for 24TB you would use two accounts, assuming they're fine with that, and your uploads would run at 5Mbps per thread, with 12 threads in parallel.

That will let you upload 100GB every 4 hours, which should be enough. And the total cost would be 180EUR/year, which is pretty far from $1000.


Thanks for taking the time to explain the details of their offering. I agree that looks well thought out and a much better value than anything else I've seen, especially for the market of personal NAS stuff we're talking about. And a quick search looks like it can be supported by Free (ie secure) backup solutions like duplicity.


16TB is basically the smallest array where someone would face the question of mirrors vs raidz. 16TB * $0.005 / GB-mo = $960/year.


2) and 3) requires a decent broadband upload speed. Home broadband often doesn't provide this.

Depends where you're from I guess.

Its also just the first backup which takes a (long) while. If you got only 10 mbit upload, then you're able to upload 600 mbit per minute, or 36000 mbit per hour. Which is about 4500 mB/hour (or ~4,4 GB/hour). Which results in 108000 mB/day (or ~105 GB/day). Its about 3/4 of a TB a week.

I'm glad I got 30+ mbit upload with my VDSL2. My first backups took ages (I got 5 TB with Jottacloud). First I thought it was Duplicati being slow with regards to cryptography or being .NET, but nope that ain't it.

Incremental backups during night are not going to hamper the connection though.

> Its about 3/4 of a TB a week.

That's best case. Anyone who's done large transfers over the internet knows reality will be much worse.


Bearing in mind that it's likely not practical to use 100% of available upload bandwidth for extremely long periods of time. Generally speaking anyway. :)

I already have selective backups to Backblaze with rclone.

The point is that home users can have good reasons for relying on redundancy.


That said, it reduces risk of loss when an actual full backup solution is impractical... I have about 10TB of media on my current 12tb (4x4 raid5) nas box. My next plan is a 6-drive raid z2 with 8-12tb drives early next year. I can't justify doubling the cost or more to ensure it's all backed up. The important bits (to me) are in various redundant places... that doesn't mean I want to lose it and the parity options allow for some safety without too much lose of storage space.

If you only need backups in case of whole drive failure, AWS S3 glacier is cheaper at $0.004/GB/month.

But it's still $4/month/TB or $48/TB-Year. Assuming a two 10TB drives are $300 each (actually can get them for less than $200) and last 5 years or warranty expires. ($300*2 drives /5 years / 10 TB = $12/TB-Year). So AWS glacier is still ~4x the cost of buying physical drives with RAID 1 and keeping them at a friends house.

Honestly if there is any cloud storage that trades resiliency for price that would be awesome. As long as the cloud storage can tell me if any of the storage has been corrupted/lost, I'd except a 10% yearly failure rate. If my home raid is 95% and the cloud is 90% then I have an effective yearly data loss of only 0.5%.


Deep Archive is my fav backup solution to just store a lot of data cheaply as there isn't really anything better, just the AWS egress charge is really the only thing they are keeping at a high price and for sure is for lock-in. .09c/GB is still too high to restore TB's of data if you need a full NAS rebuild and need to re-download everything.

How can there be any sensible way other than having a complete copy of the data separate from the original copy?

It can be expensive, so perhaps just prioritise what you actually need instead of backing up LATEST.MOVIE.2020.Remux.r250

Yeah, this.

Keep personal media metadata and download it again, maybe, but back it up?

Better still, only store what you'll watch and delete it.

I only backup scans of truly useful legal documents, photos go to the cloud.

Everything contemporary, disposable pop culture is "backed up" online from my perspective.


Isn't the backup solution to always get two of whatever you build? A primary raid backed up to a second pool?


Typically three. Two local, one remote. Consider not RAID protecting the local backup.


Not everyone has that kind of cash lying around, especially if the pool is large and the data is not critical.


You need to do more than just keep a backup of current state, if you want to deal with problems other than complete failure/destruction of the primary. What about accidental deletion and incorrect editing (done too long ago for "undo" to work)?

>What about accidental deletion and incorrect editing (done too long ago for "undo" to work)?

Eh? We're talking ZFS here. Built-in atomic low cost snapshots are part of the core feature set. If we're talking using a pool and backing up another pool it's a reasonable baseline assumption that one would be using either their own or one of the many mature scripts to have regular automatic snapshots and retention schedule. Ie., once every 10 min for a day, once an hour for a week or two, once a week for a year, and monthly thereafter (or whatever fits your use case), along with preservation of any manual custom snapshots. I mean, it's right there, for free.


Fair point, but the comment I replied to did not mention configuring anything like that, and while it is all there for the taking it isn't automatically configured.

This is true in a vacuum, but I feel it misses the mark having run my own personal array for 20+ years. For the most part, individuals have very little mission critical data (unless they're videographers or professional photographers). Ergo, most data on personal arrays is going to be non-critical - eg VM images, old drive images, build trees, downloaded datasets, torrented media.

Does it make sense to double your cost (/halve your capacity) by backing up the whole thing, especially when the backup is likely to be more spinning rust? I've recently chosen to do so (while also backing up the critical data several different ways), but haven't always. I can understand people who still choose otherwise. Either way my main concern is a single raidz3 and I'm going to have a bad week if I lose that, independent of what can be rebuilt.

In general, personal users of ZFS are very different than institutional users of ZFS. Hopefully not too divergent, so those raidz code paths still receive lots of testing and support. But you need to always keep this in mind when reading any advice about ZFS.

False. There are too many different reasons for a backup to make that statement, and for most people RAID and NAS are combined into one.

There is backup because a disks are unreliable and fail randomly. RAID is designed to solve this.

There is backup because you accidentally deleted a file. There is backup because ransomware encrypted your files. RAID with snapshots solve these (arguably better than most other backup systems).

There is backup because your main computers (phone) might break. NAS systems are a good place to backup these.

There is backup because your house burned down [or otherwise is destroyed]. This is the only one that RAID doesn't solve. (even then you can do offsite NAS)

What risks do you need to mitigate? Once you have that list you can decide what you mitigations to apply. RAID is a very useful piece of this and so saying RAID is not backup is not helpful and may harm people from making useful backups because they let the search for perfection delay them.

RAID is not a backup.

It really is as simple as that, and it is helpful to remind people so they do not suffer the harms likely to result from mistakenly believing that RAID is a backup.


If you got disks from the same brand, and same batch, chances are they will fail at the same time before a resilver.

> Except for maybe another pool. Which is going to hurt since ZFS with buying everything up front is VASTLY more expensive than RAID where you can grow the array as needed - under the assumption that the storage needs grow slowly, which they typically do for home users.

your backup pool doesn't have to match your working pool. it certainly doesn't have to be the same size. it just has to be larger than the data you've got.

so you can build a big 64TB pool, and back it up to a set of external USB drives which expand as your actual data expands. similarly you can turn on a bit more compression for the backup pool than you might want on the main one.

> Which is going to hurt since ZFS with buying everything up front is VASTLY more expensive than RAID where you can grow the array as needed - under the assumption that the storage needs grow slowly, which they typically do for home users.

Not with mirrored vdevs, which you can expand pair by pair.

I'm not arguing for the article, just pointing out your error. The article is very flawed and is bad advice. Not because it's bad per se, but the reasoning fails at basic mathematical analysis. There certainly are valid reasons to use mirrored vdevs. One might be, so you can more easily expand your pool incrementally.

Anyway, from personal experience, my thought is that home users should give up on the idea of "incremental expansion" anyway. Double up each time. I'm surprised that you want to say home users aren't going to use 8 drive setups, while at the same time decrying mirrors in favor of parity. If you have a low drive count, the storage efficiency of parity puts you close to that of mirrors anyway. raidz2 with 4 drives as you suggest makes almost no sense vs mirrors.

Good point. Though the "internet wisdom" nowadays is to not add vdevs to a pool that has already seen use (in this scenario there will be a lot of use) since the pool will be very unbalanced and there is no way to rebalance it. Also negating many of the benefits of mirrors over raidz.

But that advice might not be relevant for the use-cases discussed here, something to keep in mind.

I meant that home-users don't have to be limited by 8-drive setups.

Your mirrors can be more than 2 wide, you can have automatic hot standbys, you can make each mirror its own zpool so loss of some drives doesn't lose the entire pool.

Actually losing a mirror vdev doesn't lose the whole pool anymore for several years now. I've recovered data off a zpool where I lost one of three mirrored vdevs. It's not pretty, but your data is still there. Any files on the missing drives are just 0 bytes.

> Of course! Yet for home users there is not a single reasonable way to do it if you have decently sized pool. There is bound to be sacrifices on what you choose to backup.

It's $130 for an 8TB USB drive at Best Buy. If you don't mirror your backup, that's $16/TB cost, which is quite reasonable.


You can add mirrors to increase the pool size. More RAM will speed up the pool, but cache drives also works really well. And you can have a hot spare.

> So every every single time you loose any drive in a mirror setup you risk all the data on all drives in the entire pool.

Not true. If you lost all the drives in a mirror vdev, you lost those files not the entire pool.

More importantly, mirrors are not limited to two drives. You can have any number.

I use mirror vdevs with 4 drives.


tbh for my 10 drive home setup I use a Z3 setup. Had dead drives but never more than 1 at a time. I do check them weekly and if they show signs of issues I replace them on the spot though. I do rsync to a second unit in a diff location but (so far) never had to do a restore.


Wow yeah I just did the math again and they are! Tarnsap is 10x S3 prices? I know the service does a lot but I thought it was a small margin on top of plain object storage.


It's hard to price a backup service per byte when all the value comes from software. At the small end there's barely any profit, and at the big end the costs get ridiculous.

Yeah I don't begin to understand the economics. I was just surprised at the difference.

I naively thought storage is cheap so backups are too.

We (rsync.net) have several PB of raidz3 deployed all over the world.

We use conservatively sized (12-15 drive) vdevs and typically join 3 or 4 of those together to make a pool.

I can see getting nervous about raidz2 (sort of analogous to "raid6") after a drive failure ... but losing 4 drives out of 12 in a single raidz3 failure cascade is extremely improbable.

We all sleep quite well with this arrangement and have since we first migrated from UFS2 to ZFS in 2012.

Some background ...

We have a fairly robust drive-proofing procedure prior to deployment - I think we still use 'badblocks' and really beat them up for 5-7 days. It's not enough to merely not fail - we have to see perfect returns on the SMART diagnostics and zero complaints from FreeBSD.

So we're starting with a known-good population of drives.

Then we monitor them very closely - again with both SMART and FreeBSD/ZFS system data and we are fairly aggressive about failing them out early if we see them start to misbehave.

So we don't see things like straight up drive failures - the bad ones have already been rejected and drives that misbehave are culled out.

The last time I got pulled in to make a judgement call (and got a little nervous) was (IIRC) we had a 15 drive vdev that had a drive failure and a candidate for early removal that had started to misbehave and the choice was made to yank them both because why do two big long resilvers ... and then during the subsequent resilver a third drive, while not failing, started to spit out errors. So the resilver completed and all was well but we had a potential third failure that could have died in the resilver and then we would be running with no protection.

But that's why I love raidz3 - even if that had happened, we would have needed to lose yet another drive to lose the vdev (and the entire pool).

I think the actionable recommendations here are to burn in your drives when you get them because we do, indeed, find rejects in most batches. Also, pick an error threshold that is low and be disciplined about sticking to it - don't let drives spin out SMART errors sporadically for months ...

EDIT: Here is another thought and this goes back to pre-ZFS days and old-style RAID, etc. In normal operations we're aggressively failing out drives that misbehave BUT if you're in a marginal situation you need to flip that logic around - especially if the pool is in-use during resilver/rebuild.

If you've lost 2 drives in a raidz3 (or, say, 1 drive in a RAID6) that remaining protection drive even if it is failing is like gold. If you fail it out, it's gone forever and has no relevance to the pool/array - but if you keep it, even as it's dying, you can either limp along OR you can even offline the pool/array and send that last drive to recovery or clone it or whatever ... the point is, when an array goes sideways every single bit of parity, no matter how poorly behaved, should be treated like gold.

Agreed on the limping along. I survived a RAIDZ1 double drive failure exactly like that, one drive completely died and a second started acting up during the resilver. Resilver slowed way down but made it eventually.

All those drives got replaced with a RAIDZ2 pool.

thank you for this.

can you speak to the retention rate post culling using your proofing? And do you have preferred manufacturers?

Also, do you bother to do "the trick" where you zfs across heterogeneous SKUs to reduce the risk of use-case-coupled failures?

I don't have real numbers for this but I feel like post-burn-in retention is very high - like 99%.

We do not have preferred manufacturers. Data suggests that Hitachis are better than Seagates but I reject this kind of Ford vs. Chevy preference in hard drives - the "good drives" can flip immediately - with new model or even new revision of existing model. I don't think it makes sense to try to pursue a particular manufacturer.

FWIW, we have bought a ton of seagates over the last four years and they have been fine.


I was running 3x 12 drive vdevs in raidz2 and write performance was terrible. We moved some data to a different machine and rebuilt as 18 mirrors. This was a long time ago (as in running on Solaris long time ago), so maybe things are better now.

Each 12 drive array would have an interesting physical block layout.

What was your ASHIFT and what was your physical disk raw sector size because if you had a ASHIFT=9 you would have 5120k blocks, if it was a ASHIFT=12 you would have physical 40960 blocks. None of those line up well with zfs.

Additionally if this was a long time ago the pool would have defaulted to ASHIFT=9 and if you added any 4K drives you could have been having write amplification occurring.

Looks as if you had your drives doing a lot of unnecessary extra wasted work.

This is huge thing (called "Slop Space") that I noticed after I bought all ZFS NAS parts!

It wastes much capacity by default if drive count is not ideal. IIRC use larger recordsize (1MB or above) and enable compression almost solves issue. I use default 128K recordsize for small files zvol for performance, and use larger recordsize for the rest of zvols.

https://wintelguy.com/zfs-calc.pl


Slow performance is always my impression with RAIDZ. There's an obvious performance hit if you are coming from conventional RAID setups (like mdadm). I've even seen some unbelievable slow speed like 1 MiB/s in the middle of copying a large git repo (which have a lot of small files, repeated git pull also creates a lot of fragmentation). But my experience was based on early ZFSOnLinux, perhaps a real BSD will perform better. Anyway, I knew ZFS's first goal is data safety, not performance, and it's why I could tolerate the performance...

RAIDZ gives roughly the IOPS of one drive per VDEV. Mirrors is one drive of write IOPS per VDEV and some say two drives of IOPS for read.

You went from 3 drives to 18, IOPS wise.

All rsync.net infrastructure is, and has been, built solely on FreeBSD.

Since 2001 ... :)


I love the backup solution you guys provide with Borg, the pricing is amazing and the product has been rock solid. Any chance of getting similar "expert level" pricing for accounts using ZFS send | receive ?

zfs send enabled accounts are (relatively) expensive because we need to give you a full-blown VPN with a routable ipv4, etc.

If you can make do with an ipv6 address and if you have a reasonable quantity, email us ... perhaps we can work something out ...

I'm relatively new to ZFS, but I thought you can send/receive over ssh (much like rsync). What's the purpose of the VPN layer?

My use is relatively small, just a home user with less than a TB worth of data. Definitely not worthwhile for your team to set up as a one off.

It is always worth to write an email to info@rsync.net to work out an acceptable offer. I did it and got a relatively good offer. I have now 2 TB at rsync.net.

rsync.net is not cheap but you have to know what you get: ZFS with great redundancy and a great support. All you have to do is using the service. You don't have to worry about maintenance.

I looked for, and didn't see, "SSD" in this article (let alone "NVMe"). Maybe because it's from 2015? But at any rate, I'm not sure the logic applies there. High performance SSDs remain much more expensive, so losing major capacity is a much costlier issue, and simultaneously they rebuild vastly faster. I thought about this when making a pool out of U.2 NVMe drives, and with rebuild times measured in minutes and given the cost/GB I think RAIDZ2 (or even Z1) vdevs are plenty sufficient for most use cases.

By the same token, what does the backup system and unique pool data lifetime look like? If someone is using a very fast/smaller/expensive pool as a local working space, but it's constantly being replicated to a much more heavily redundant pool of spinning rust in turn backing up sufficiently fast to remote, it may be perfectly acceptable to have minimal redundancy (I still like being able to heal from corruption) in the working pool. If the whole thing going kaput only means losing a few minutes of data it's totally reasonable to consider how much money that's actually worth.

I guess a lot of the blanket advice for ZFS rubs me the wrong way. It's offers a very powerful toolbox full of options that are genuinely great in different circumstances, and there aren't many footguns (dedup being the biggest one that immediately comes to mind) that are hard to reason about. It's a shame if users aren't considering there own budgets, needs, hardware, and so on and taking advantage of it to get the most out of them.

As always with RAID-style setups, there's an inevitable trade off of cost vs capacity vs performance.

There's still a place for RAIDZ/RAIDZ2, and in my opinion that place is storing bulk data that isn't too heavily accessed or that needs to be stored with an eye towards keeping £/GB down.

Yes, mirrors are faster. Yes, mirrors are easier to expand. But across 12 4TB disks that is 24TB instead of 40TB with RAIDZ2 - and that's a lot of capacity to lose if you're on a budget.

The rebuild times in this post seem high to me, though. I replaced 7x 2TB disks (nearly full) in a raidz1 in a backup pool with larger drives in about 30 hours.

> The rebuild times in this post seem high to me, though.

Two factors. Disks are getting large, and the rebuild time for RAID-Z[1] is dependent on the fragmentation. In combination it means it can take ages.

I just had to replace a failing WD Red 3TB[2] in an old 4xRAID-Z1 pool, and it took 9 hours. That was a single 3TB disk. The disks in my new pool are 14TB and 16TB.

[1]: https://youtu.be/Efl0Kv_hXwY

[2]: power-on hours in SMART showed over 7 years

OpenZFS 2.0 has sequential rebuilds:

> The sequential reconstruction feature adds a more traditional RAID rebuild mechanism to ZFS. Specifically, it allows for mirror vdevs to be rebuilt in LBA order. Depending on the pools average block size, overall fragmentation, and the performance characteristics of the devices (SMR) sequential reconstruction can restore redundancy in less time than a traditional healing resilver. However, it cannot verify block checksums as part of the rebuild. Therefore a scrub is automatically started when the last active sequential resilver completes.

* https://github.com/openzfs/zfs/pull/10349

The one thing in the article I agree with without reservation though is that you should always have a backup of your pool!

ZFS makes doing good backups easy, with zfs send | zfs receive.

I mean, is this controversial? The problem for a budget conscious home user is: where do you back up to? Doing a complete backup to a remote storage provider isn't a practical option for many users, which leaves the affordable options being to only back up critical data and/or use something like a big USB external drive.

Neither of these options are great, so in reality many people end up relying on RAID or mirroring to provide some measure of redundancy regardless of whether or not it's a good idea.

I have a raidz2 that I keep everything on: (A) Personal, (B) core media, and (C) all other media.

I backup A and B on a workstation drive, and I further backup A to S3 Glacier for $1/month/100GB, syncing every few months.

It isn't all or nothing, 3-copies vs. striped pool. Mirrors are fine if you have quick access to fix dead drives, but I like the idea of being able to lose any two drives vs. only one drive in a given mirror. And while I currently have a 4 drive RAIDz2 (same disk utilization as a mirror) I wish I had a 4+2 array, which I'll probably build next year.

Yeah, this is basically what I do as well, I didn't mean to imply it is all or nothing. The practical solution is tiered backup of the stuff you really care about + some local fault tolerance for stuff you don't care about as much.

What I wanted to point out is that of course backing up everything is the solution per the parent comment, but the reality is that costs for punting and mindlessly backing up say 10 TB of data properly are still fairly high. So you pick and choose the important stuff and do the best you can for the rest.

I just started playing around with s3blkdevd. S3-compatible storage at backblaze B2 is $0.005/GB-month and I'll see if I can manage a vdev made out of a huge nbd disk. Topical to the article, backblaze has piles of RAID6 and appears to be happy with it.

Another thought has been to find a ZFS-friend in another part of the world and exchange snapshots. Inline encryption makes that a little more viable now.

Right now I'm sending incremental snapshots to AWS Glacier and have another local box mirroring the data (mostly to fully verify the snapshots before sending them to Glacier).


There's also the new "special" vdev feature (which allow you to segregate metadata onto separate ssd mirrors) which should further speed up the remaining unavoidable random reads during scrub/resilver (for metadata traversal).

My gripe with this is that the author assumes everyone has the same workload and prefers the same set of tradeoffs. I use raidz1 on my home desktop precisely because I would absolutely prefer to have to wait for a resilver than lose data.

"So backup your data!" - of course, but that's just an implicitly larger pool.


TFA suggests using mirrors instead of RAIDZ. There is no case in which RAIDZ1 is more durable against data loss than a mirror is.


Doesn't that ignore the fact that RAIDZ1 gives you more storage space than a mirror?

I use 10x8TB in RaidZ 2 in my home server. TimeMachine Backup for 6 people, docker volumes and an excessively huge media collection.

The TimeMachine datasets are backed up offsite.

Losing this pool would be a PITA, but not critical.

My primary goal with ZFS is some data redundancy. At a good cost. And quick remote backup for a fraction of the pool. Not performance.

At one point, 2 disks died within 2 days. While there was some panic involved, the data on the server could be reproduced with some time.

There isn't a best solution, that fits all needs. If there was, ZFS wouldn't offer all the options it does.

This article is pretty hand-wavy. It doesn't give any empirical numbers at all.

I've had a small FreeNAS server using 4 x 3TB SATA drives in a mirrored config for years now and it's out of space. I'm about to build a new server using used 10 x 3TB SAS drives and intend to put all ten disks into a RAIDZ2 vdev. I care more about space than performance or rebuild times. Before I load it with data, I'll do some testing of read/write performance and rebuilding times. If they're unacceptable, I'll try two smaller RAIDZ vdevs and if that still doesn't work, I'll go back to mirrors.

The article is another entry in a long series of bad ZFS articles.

For some reason a lot of people get to a point where they're comfortable with it and suddenly their use case is everyone's, they've become an expert, and you should Just Do What They Say.

I highly recommend people ignore articles like this. ZFS is very flexible, and can serve a variety of workloads. It also assumes you know what you're doing, and the tradeoffs are not always apparent up-front.

If you want to become comfortable-enough with ZFS to make your own choices, I recommend standing up your ZFS box well before you need it, and play with it. Set up configs you'd never use in production, just to see what happens. Yank a disk, figure out how to recover. If you have time, fill it up with garbage and look for yourself how fragmentation effects resilver times. If you're a serious user, join the mailing lists - it is high-signal.

And value random articles on the interwebs telling you the Real Way at what they cost you.

I'm convinced articles like this are a big part of what gives ZFS a bad name. People follow authoritative-sounding bad advice and blame their results on the file system.

bang-on.

This article should be thrown in the trash.

The EXAMPLES for a raidz and raidz2 and raidz3 are mis-aligned in the article. They will obviously have terrible performance, and waste space due to blocks not fitting nicely due to non-standard physical block sizes.

mirror's are easy because you don't have to worry about ASHIFT values, and physical block sizes of devices and making sure all of your blocks can be divided nicely into 128K block sizes.

I get the feeling that the author of the article never actually has ran a properly configured raidz(x) with SLOG and cache devices.

I strongly disagree with this old blogpost.

I feel that this advice is a somewhat dishonest attempt to plaster over the fact that you can't expand a VDEV.

https://louwrentius.com/the-hidden-cost-of-using-zfs-for-you...

So you try to burry that fact by promoting mirrors. But mirrors aren't as safe as RAIDZ2 and they aren't as space/efficient.

It all depends on circumstances, but if you want to store a ton of data, RAIDZ(2|3) seems the right way to go.

Use RAIDZ(2|3) vdevs, not mirrors.

No thanks. I use raidz2.

With mirror vdevs if you lose the wrong 2 drives you lose everything. I can lose any 2 drives and be totally fine.

The probability of losing the wrong 2 drives at once is small, sure. But I would rather just not care about that probability. And I don't lose half my capacity, which for a home user (I don't have an unlimited budget!) matters a whole lot more than having the absolute best iops.

The original article already included a reply to this question.

> But wait, why would I want to trade guaranteed two disk failure in RAIDZ2 with only 85.7% survival of two disk failure in a pool of mirrors? Because of the drastically shorter time to resilver, and drastically lower load placed on the pool while doing so. The only disk more heavily loaded than usual during a mirror vdev resilvering is the other disk in the vdev – which might sound bad, but remember that it's no more heavily loaded than it would've been as a RAIDZ member. Each block resilvered on a RAIDZ vdev requires a block to be read from each surviving RAIDZ member; each block written to a resilvering mirror only requires one block to be read from a surviving vdev member. For a six-disk RAIDZ1 vs a six disk pool of mirrors, that's five times the extra I/O demands required of the surviving disks.

I think it's perfectly okay to disagree. But any comment must contain a counterargument to be useful, for example, one could've argued that load is not an issue in a small array and so on, and I may happily accept the other side of the argument. However, I respectfully criticize that your comment doesn't include any counterargument, it's not useful, please read the article more carefully next time.


It's not a question, it's a statement and my opinion on the matter. I respectfully criticize that your comment isn't helpful at all nor a counter argument and is therefore useless as well. Please read my comment more carefully next time.


You don't lose everything, only the files that were on that vdev. I've been through this. It's not fun, though, and your pool is irreversibly damaged but your data is not all lost


One problem with mirror vdevs vs. RAIDZ2 : In the RAIDZ2 case you can lose any two drives and still have your data, in the mirror vdev case, 2 drives failing in the same vdev means all your data is gone. You could potentially be okay with losing half your drives, but only if all the failed drives happen to be part of different vdevs.


"Use RAID10" is basically the storage equivalent of "use paper ballots". Sure, it feels suboptimal, but critically your intuition on how it can fail (mostly) works. That's a really nice property to have for storage.

As long as you aren't relying on it as backup, it doesn't matter for most use cases.

I'm about to build a zpool consisting of nothing but 3-wide raidz1 vdevs. I can tolerate one drive dying. In the ~8 years or so I've been running a NAS, I've had precisely one drive failure. I am fully aware that survivorship bias is a thing, and anecdotes aren't data, but it's good enough for me.

Anything important is backed up locally and to the cloud. Everything else is merely annoying to have to download again.

I disagree.

With co-located boxes and drop shipped drive replacements the time between a FAULT and the resilver event can be multiple days. Even though the resilver will go faster with a mirror having one disk remaining on a mirror vdev compared to raidz2 (or higher) mirrors will increase risk of data loss irrespective of resilver times because of drop ship drive replacement time.

3TB resilver on my last mechanical drive failure took 6 hours 30 minutes. Plus an additional 3 days for the drive to arrive.

With mirror vdev setups you lose significantly more space as well. If you argue speed is worth it, then I would instead invest that money you saved going with a raidz2 with NVME cache and SLOG.

Users won't notice the resilver event at all with a significant amount of memory and NVME cache + nvme SLOG tuned with a high /sys/module/zfs/parameters/zfs_dirty_data_max and larger than default /sys/module/zfs/parameters/zfs_txg_timeout.

I've been using ZFS for a few years now and it's just been amazing. Storage is so cheap these days that it's just much simpler and straight-forward to use mirrored vdevs. If a single drive fails, the pool is still completely usable, and all I have to do is swap out the bad drive (when I get around to it, no hurry usually) and resilver. Resilvering can take a while, but everything is still completely usable while it's happening so it just runs in the background and I don't even notice.

I just upgraded a pool of 4 drives from 4TB each to 10TB - I'd never done anything like that before and was rather nervous, but it was just so simple with the mirrored set up - just swap out each drive and resilver one- by-one.

I would like to see an update for this for SSDs. The devices are black boxes of mystery, possibly likely to fail nearly simultaneously (at least in terms of being unable to write). Early on we left 20% if capacity unformatted to allow more wear leveling, but running them in RAID1 seemed both necessary and likely to cause synchronous errors. RAIDZ1 over three disks might be okay. Obviously huge R/W bandwidth so resilvering doesn't take long.

Also, it isn't mentioned, but having the ZIL on battery backed flash can give huge improvements in IOPS to anything, and is far more valuable than extra TB or spindles.


Battery backed flash sounds like a huge PitA to deal with though. Where do you even get that?


I had my pitchforks out because I really like my raidz1 setup, but the arguments are very good for the mirror. When I set my pool up, I chose raidz1 because compared to mirroring, I still have 37.5% more space, which is huge if the budget is constrained. But if you feel paranoid and the raidz3 begin to sound good, that 12.5% extra space doesn't seem to worth the extra complexity over a much straightforward mirror.

There's no planet on which I would trust a mirror with a 10TB + SATA drive. There's a reason the major storage vendors already have or are working on 3-disk parity for large NL-SAS/SATAS drives.

Give me RAID-Z2 or RAID-Z3 with dRAID all day long (although I wouldn't deploy dRAID quite yet on production workloads).

Every time I make this argument (whether regular RAID or ZFS), everyone pulls out their pitchforks and tells me about how they have run RAID 5 or 6 for years and never had a problem, plus "I can have two drives fail! You would lose your array if the wrong two drives fail!"

But drive failure is much more likely during an operation like a resilver/resync, and I wonder if the majority of people espousing more risky RAID a setups (especially single parity) are those who just want hundreds of TB for a growing media collection.

I know for my critical data, I don't trust parity. Plus I can't afford the days or weeks long resilver operation, I need a working storage array that doesn't suffer drastically in performance if one drive goes bad.

People act like resilvers on parity are vastly different than on mirrored. They are not, its just reads and writes, granted in a different pattern. My RAIDZ2 resilvers at about what I expect the old drives in it to write at, ~100MB/sec. In a mirrored setup it all depends on the one drive doing the reading to survive the process. The drive that is most likely the same age as the one that just failed.

A resilver isnt more likely to kill an existing drive than a scrub. And the general advice is to do those semi regularly.

A blanket statement equating parity RAID to "more risky" is absurd and leads to people thinking they need a massive pool of mirrors just for their home files. ZFS has numerous options to prioritize different IO higher or lower (priority to resilvers or end user data).


I then tell a real horror-story about my past, not so long ago there was a EVA (A SAN from HP) with a freaking out hardware Raid out of nowhere (uptime about 200 days no Firmware update before nothing) it just started to freak out, the story goes on but the lesson...No more HW-Raid/proprietary stuff for me EVER again.

While none for me either, I used to look down on HW RAID etc. Lately I've realized, such solutions are genuinely useful and more likely to survive/work properly, if the people managing it do not understand the technology, don't want to understand or just don't have the time for it.

HW RAID with red blinking lights seems more likely to survive than a SW RAID that works until it doesn't, because you lost all the redundancy and noone there knows how either work.

>HW RAID with red blinking lights seems more likely to survive than a SW RAID that works until it doesn't, because you lost all the redundancy and noone there knows how either work.

Well if one ever sees your red blinking light because the new printer stands in front of the server :)

Critical data for the average user typically does not require much space, so the cost of storage overhead in a mirrored ZFS pool is not too bad.

RAIDZ is much more cost efficient when you need to reliably store large content that isn't really critical. As you've noted, media is one good example of this.

In my case, I'm using a RAIDZ1 vdev of 3 disks, with the understanding that things could go south during a resilver.


Yeah true, in my Home-server i have one mirror (2 disks for OS/Swap etc) and one RaidZ1 (5 Disks for data), for me it's just important that i have no bit-rot etc. Daily backups (restic) are made so no problem whatsoever.


For big installations i do striped (sometimes mirrors) vdevs (2 disks) and on top of them Raidz2/3. Works fast as hell and reliable.


I recently built a ZFS raidz2 array with 4 disks. Is the article saying it's better to mirror over four vdevs (or 4 disks?)


I did the same thing as you. I've had to sit through an RMA period on replacing a drive in a home RAID5 and just never wanted anything like that again; every replacement of a disk in a mirrored vdev would feel the same. Data is backed up but it's a headache to have data unavailable.

Raidz2 Large Write Slows Down After a While

Source: https://news.ycombinator.com/item?id=25358268

0 Response to "Raidz2 Large Write Slows Down After a While"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel