Don't Make ZFS Re-Read What It Just Read
May 28, 2026, 5:24 p.m.
I have already mentioned one of my improvements to scrub in an earlier post, Teaching ZFS about time. In that post I also claimed that scrub can be expensive, and that we were looking for ways to limit the amount of data we actually want to scrub. A scrub on a multi-petabyte pool can run for days, and some of us want a smaller, easier check that everything is still going right.
For people who scrub on a schedule - say once a month - most of those days are spent re-reading blocks that the previous scrub already verified, and that nothing has touched since. Of course, other bad things can happen, like physical degradation of a disk, but sometimes we can assume that "the disks were perfectly happy then, and the blocks are not rabbits - they have not hopped away". Under that assumption, the only blocks worth re-reading are the ones written since the last scrub finished.
So, before working on crrd, I added another small fix:
Remember the TXG up to which the last scrub finished. Next time, scrub from there.
The reason it took a while to land is less the idea and more the bookkeeping around it.
A scrub can complete, pause, be cancelled, hit an error, or be replaced by a resilver halfway through.
Only one of those outcomes - "completed normally" - means the data is actually verified up to some point.
Everything else has to leave the saved TXG alone, or the next -C scrub will happily skip blocks that were never checked.
The nice part is that the scrub code already knew how to walk a range.
When kicking off a scrub in the kernel, we can specify scn_max_txg and scn_min_txg, which let us skip the merkle tree branches we do not want to walk.
So the only piece missing was persistence: somewhere to remember that number across reboots and pool imports.
The on-disk side is small.
We add a single ZAP entry under the MOS directory, holding one uint64_t:
#define DMU_POOL_LAST_SCRUBBED_TXG "last_scrubbed_txg"
And one matching field on the in-memory spa_t:
struct spa {
...
uint64_t spa_scrubbed_last_txg; /* last txg scrubbed */
...
};
On pool load, we pull the value out of the MOS into spa_scrubbed_last_txg.
On scrub completion, we write it back.
That is the entire persistence story, and it intentionally stays that boring - no new feature flag, no on-disk format bump, just one ZAP entry that older code does not know to look for.
The user-facing side is two pieces.
First, a new read-only pool property so you can see what is stored:
$ zpool get last_scrubbed_txg tank
NAME PROPERTY VALUE SOURCE
tank last_scrubbed_txg 53830995 -
Zero means "no scrub has ever finished on this pool since the feature became available", which is the same state a freshly created pool starts in.
Second, a new flag for zpool scrub:
# Scrub only data written since the last successful scrub.
$ zpool scrub -C tank
# Same thing, but wait for it.
$ zpool scrub -w -C tank
-C is mutually exclusive with -s (stop), -p (pause), and -e (error scrub).
The first two are obvious - you cannot continue and stop in the same breath.
The third is a little subtler: -e only walks the error log, not the whole dataset, so combining it with -C would be asking the kernel two contradictory questions at once - not supported today, but maybe somewhere down the road.
And of course, the feature only helps for scrubs that complete after it lands.
The first scrub on an upgraded pool still has to do the full sweep, because last_scrubbed_txg is 0 and -C would degenerate into a full scrub anyway.
This shipped as openzfs/zfs#16301, merged as commit 4b4e346, and is in OpenZFS 2.4.0.
The whole change is, code-wise, very small - one ZAP entry, one in-memory field, one extra branch in the ioctl, and one new CLI flag. That is also the nicest thing about it. The scrub machinery already knew how to walk a TXG range; it just needed someone to remember where it left off.
This development effort was sponsored by Wasabi Technologies, Inc. and Klara, Inc.