ZFS 2.3 released with ZFS raidz expansion
I just don't get it how the Windows world - by far the largest PC platform per userbase - still doesn't have any answer to ZFS. Microsoft had WinFS and then ReFS but it's on the backburner and while there is active development (Win11 ships some bits time to time) release is nowhere in sight. There are some lone warriors trying the giant task of creating a ZFS compatibility layer with some projects, but they are far from being mature/usable.
How come that Windows still uses a 32 year old file system?
> How come that Windows still uses a 32 year old file system?
Simple. Because most of the burden is taken by the (enterprise) storage hardware hosting the FS. Snapshots, block level deduplication, object storage technologies, RAID/Resiliency, size changes, you name it.
Modern storage appliances are black magic, and you don't need much more features from NTFS. You either transparently access via NAS/SAN or store your NTFS volumes on capable disk boxes.
On the Linux world, at the higher end, there's Lustre and GPFS. ZFS is mostly for resilient, but not performance critical needs.
>ZFS is mostly for resilient, but not performance critical needs.
Los Alamos disagrees ;)
https://www.lanl.gov/media/news/0321-computational-storage
But yes, in general you are right, Cern for example uses Ceph:
https://indico.cern.ch/event/1457076/attachments/2934445/515...
I think what LLNL did predates GPUDirect and other new technologies came after 2022, but that's a good start.
CERN's Ceph also for their "General IT" needs. Their clusters are independent from that. Also CERN's most processing is distributed across Europe. We are part of that network.
Many, if not all of the HPC centers we talk with uses Lustre as their "immediate" storage. Also, there's Weka now, a closed source storage system supporting insane speeds and tons of protocols at the same time. Mostly used for and by GPU clusters around the world. You connect terabits to that cluster casually. It's all flash, and flat out fast.
Did you confuse LANL for LLNL?
It's just a typo, not a confusion, and I'm well beyond the edit window.
So private consumers should just pay cloud subscription if they want safer/modern data storage for their PC? (without NAS)
No, private consumers have a choice, since Linux and FreeBSD runs well on their hardware. Microsoft is too busy shoveling their crappy AI and convincing OEMs to put a second Windows button (the CoPilot button) on their keyboards.
Probably. There are levels of backups, and a cloud subscription SHOULD give you copies in geographical separate locations with someone to help you (who probably isn't into computers and doesn't want to learn the complex details) restore when (NOT IF!) needed.
I have all my backups on a NAS in the next room. This covers the vast majority of use cases for backups, but if my house burns down everything is lost. I know I'm taking that risk, but really I should have better. Just paying someone to do it all in the cloud should be better for me as well and I keep thinking I should do this.
Of course paying someone assumes they will do their job. There are always incompetent companies out there to take your money.
My setup is similar to yours, but I also distribute my most important data in compressed (<5GB) encrypted backups to several free-tier cloud storage accounts. I could restore it by copying one key and running one script.
I lost faith in most paid operators. Whoops, this thing that absolutely can happen to home users and we're supposed to protect them from now actually happened to us and we were not prepared. We're so sorry!
Nah. Give me access to 5-15 cloud storage accounts, I'll handle it myself. Have done so for years.
If you need Windows, you can use something like restic (checksums and compression) and external drives (more than one, stored in more than one place) to make a backup. Plus "maybe" but not needed ReFS (on your non-Windows partition), which is included in the Workstation/Enterprise editions of Windows.
I trust my own backups much more than any subscription, not essentially from a technical point of view, but from an access point of view (e.g. losing access to your Google account).
EDIT: You have to enable check-summing and/or compression for data on ReFS manually
https://learn.microsoft.com/en-us/windows-server/storage/ref...
> I trust my own backups much more than any subscription, not from a technical standpoint but from an access one (for example, losing access to your google account).
I personally use cloud storage extensively, but I keep a local version with periodic rclone/borg. It allows me access from everywhere and sleep well at night.
NTFS has Volume Shadow Copy, which is "good enough" for private users if they want to create image backups while their system is running.
First of all, that's not a backup, that's a snapshot, and NO, that's not "good enough", tell your grandma that all her digitised pictures are gone because her hard drive exploded, or that one most important jpeg is now unwatchable because of bitrot.
Just because someone is a private user doesn't mean that the data is less important, often it's quite the opposite, for example a family album vs your cloned git repository.
... VSS is used to create backups. Re-read parent.
Not good enough, you can make 10000 backups of bitrotten data, if you don't have check-sums on your block (zfs) or files (restic) nothing can help you. That's the same integrity as to copy stuff on your thump-drive.
The same applies to those filesystems on Linux which don't check for bit-rottenness, which will be the majority of installs.
Your average grandma would use ext4 when using Linux. Android phones don't do that as well and I don't know about iOS, but apparently APFS only does metadata checksumming.
>> if you don't have check-sums on your block (zfs) or files (restic)
....
I think Microsoft has discontinued Windows 7 backup to force people to buy OneDrive subscriptions. They also forcefully enabled the feature when they first introduced it.
So, I think that your answer for this question is "unfortunately, yes".
Not that I support the situation.
No, if they need ZFS-like function, they just pay for NAS.
ZFS is not in the same market with AWS S3.
Having a NAS is life-changing. Doesn't have to be some large 20-bay monstrosity, just something that will give you redundancy and has an ethernet jack.
To be honest, the situation with Linux is barely better.
ZFS has license issues with Linux, preventing full integration, and Btrfs is 15 years in the making and still doesn't match ZFS in features and stability.
Most Linux distros still use ext4 by default, which is 19 years old, but ext4 is little more than a series of extensions on top of ext2, which is the same age as NTFS.
In all fairness, there are few OS components that are as critical as the filesystem, and many wouldn't touch filesystems that have less than a decade of proven track record in production.
ZFS might be better then any other FS on Linux (I don't judge that).
But you must admit that the situation on Linux is quite better then on Windows. Linux has so many FS in main branch. There is a lot of development. BTRFS had a rocky start, but it got better.
I’m interested to know what ‘full integration’ does look like, I use ZFS in Proxmox (Debian-based) and it’s really great and super solid, but I haven’t used ZFS in more vanilla Linux distros. Does Proxmox have things that regular Linux is missing out on, or are there shortcomings and things I just don’t realise about Proxmox?
The difference is that the ZFS kernel module is included by default with Proxmox, whereas with e.g. Debian, you would need to install it manually.
And you can't follow the latest kernel before the ZFS module supports it.
There is a trick for this:
* Step 1: Make friends with a ZFS developer.
* Step 2: Guilt him into writing patches to add support as soon as a new kernel is released.
* Step 3: Enjoy
Adding support for a new kernel release to ZFS is usually only a few hours of work. I have done it in the past more than a dozen times.Try CachyOS https://cachyos.org/ , you can even swap from an existing Arch installation:
https://wiki-dev.cachyos.org/sk/cachyos_repositories/how_to_...
I use NixOS, and it simply updates to the latest kernel that supports zfs, with a single, declerative option.
for Debian that's not exactly a problem
Unless you’re using Debian backports, and they backport a new kernel a week before the zfs backport package update happens.
Happened to me more than once. I ended up manually changing the kernel version limitations the second time just to get me back online, but I don’t recall if that ended up hurting me in the long run or not.
You probably don’t realise how important encryption is.
It’s still not supported by Proxmox, yes, you can do it yourself somehow but you are alone then and miss features and people report problems with double or triple file system layers.
I do not understand how they have not encryption out of the box, this seems to be a problem.
I'm not sure about proxmox, but ZFS on Linux does have encryption.
as far as stability goes, btrfs is used by meta, synology and many others, so I wouldn't say it's not stable, but some features are lacking
My understanding is that single-disk btrfs is good, but raid is decidedly dodgy; https://btrfs.readthedocs.io/en/latest/btrfs-man5.html#raid5... states that:
> The RAID56 feature provides striping and parity over several devices, same as the traditional RAID5/6.
> There are some implementation and design deficiencies that make it unreliable for some corner cases and *the feature should not be used in production, only for evaluation or testing*.
> The power failure safety for metadata with RAID56 is not 100%.
I have personally been bitten once (about 10 years ago) by btrfs just failing horribly on a single desktop drive. I've used either mdadm + ext4 (for /) or zfs (for large /data mounts) ever since. Zfs is fantastic and I genuinely don't understand why it's not used more widely.
One problem with your setup is that ZFS by design can't use a traditional *nix filesystem buffer cache. Instead it has to use its own ARC (adaptive replacement cache) with end-to-end checksumming, transparent compression, and copy-on-write semantics. This can lead to annoying performance problems when the two types of file system caches contest for available memory. There is a back pressure mechanism, but it effectively pauses other writes while evicting dirty cache entries to release memory.
Traditionally, you have the page cache on top of the FS and the buffer cache below the FS, with the two being unified such that double caching is avoided in traditional UNIX filesystems.
ZFS goes out of its way to avoid the buffer cache, although Linux does not give it the option to fully opt out of it since the block layer will buffer reads done by userland to disks underneath ZFS. That is why ZFS began to purge the buffer cache on every flush 11 years ago:
https://github.com/openzfs/zfs/commit/cecb7487fc8eea3508c3b6...
That is how it still works today:
https://github.com/openzfs/zfs/blob/fe44c5ae27993a8ff53f4cef...
If I recall correctly, the page cache is also still above ZFS when mmap() is used. There was talk about fixing it by having mmap() work out of ARC instead, but I don’t believe it was ever done, so there is technically double caching done there.
what's the best way to deal with this then? disable filecache of linux? I've tried disabling/minimizing arc in the past to avoid the oom reaper, but the arc was stubborn and its RAM usage remained as is
These days, ZFS frees memory fast enough when Linux requests memory to be freed that you generally do not see OOM because of ZFS, but if you have a workload where it is not fast enough, you can limit the maximum arc size to try to help:
https://openzfs.github.io/openzfs-docs/Performance%20and%20T...
I didn't have any trouble limiting zfs_arc_max to 3GB on one system where I felt that it was important. I ran it that way for a fair number of years and it always stayed close to that bound (if it was ever exceeded, it wasn't by a noteworthy amount at any time when I was looking).
At the time, I had it this way because I had fear of OOM events causing [at least] unexpected weirdness.
A few months ago I discovered weird issues with a fairly big, persistent L2ARC being ignored at boot due to insufficient ARC. So I stopped arbitrarily limiting zfs_arc_max and just let it do its default self-managed thing.
So far, no issues. For me. With my workload.
Are you having issues with this, or is it a theoretical problem?
I was assuming OP wants to highlight filesystem use on a workstation/desktop, not for a file server/NAS. I had similar experience decade ago, but these days single drives just work, same with mirroring. For such setups btrfs should be stable. I've never seen a workstation with raid5/6 setup. Secondly, filesystems and volume managers are something else, even if e.g. btrfs and ZFS are essentialy both.
For a NAS setup I would still prefer ZFS with truenas scale (or proxmox if virtualization is needed), just because all these scenarios are supported as well. And as far as ZFS goes, encryption is still something I am not sure about especially since I want to use snapshots sending those as a backup to remote machine.
RAID5/6 is not needed with btrfs. One should use RAID1, which supports striping the same data onto multiple drives in a redundant way.
How can you achieve 2-disk fault tolerance using btrfs and RAID 1?
By using three drives.
RAID1 is just making literal copies, so each additional drive in a RAID1 is a self-sufficient copy. You want two drives of fault tolerance? Use three drives, so if you lose two copies you still have one left.
This is of course hideously inefficient as you scale larger, but that is not the question posed.
> This is of course hideously inefficient as you scale larger, but that is not the question posed.
It's not just inefficient, you literally can't scale larger. Mirroring is all that RAID 1 allows for. To scale, you'd have to switch to RAID 10, which doesn't allow two-disk fault tolerance (you can get lucky if they are in different stripes, but this isn't fault tolerance.)
But you're right - RAID 1 also scales terribly compared to RAID 6, even before introducing striping. Imagine you have 6 x 16 TB disks:
With RAID 6, usable space of 64 TB, two-drive fault tolerance.
With RAID 1, usable space of 16 TB, five-drive fault tolerance.
With RAID 10, usable space of 32 GB, one-drive fault tolerance.
Btrfs did not support that until Linux 5.5 when it added RAID1c3. On its mirror devices instead of doing mirroring, it just stores 2 copies, no matter how many mirror members you have.
> I have personally been bitten once (about 10 years ago) by btrfs just failing horribly on a single desktop drive.
Me, too. The drive was unrecoverable. I had to reinstall from scratch.
Licensing incompatibilities.
It is possible to corrupt the file system from user space as a normal user with Btrfs. The PostgreSQL devs found that when working on async IO. And as fer as I know that issue has not been fixed.
https://www.postgresql.org/message-id/CA%2BhUKGL-sZrfwcdme8j...
LMDB users also unearthed a btrfs data corruption bug last year: https://bugzilla.redhat.com/show_bug.cgi?id=2169947
I'm similar to some other people here, I guess once they've been bitten by data loss due to btrfs, it's difficult to advocate for it.
I am assuming almost everybody at some point experienced data loss because they pulled out a flash drive too early. Is it safe to assume that we stopped using flash drives because of it?
I'm not sure we have stopped using flash, judging by the pile of USB sticks on my desk :) In relation to the fs analogy if you used a flash drive that you know corrupted your data, you'd throw it away for one you know works.
I once purchased a bunch of flash drives from Google’s online swag store and just unplugging them was often enough to put then in a state where they claimed to be 8MB devices and nothing I wrote to them was ever possible to read back in my limited tests. I stopped using those fast.
Do Synology actually use the multi-device options of btrfs, or are they using linux softraid + lvm underneath?
I know Synology Hybrid RAID is a clever use of LVM + MD raid, for example.
I believe Synology runs btrfs on top of regular mdraid + lvm, possibly with patches to let btrfs checksum failures reach into the underlying layers to find the right data to recover.
Related blog post: https://daltondur.st/syno_btrfs_1/
> Btrfs [...] still doesn't match ZFS in features [...]
Isn't the feature in question (array expansion) precisely one which btrfs already had for a long time? Does ZFS have the opposite feature (shrinking the array), which AFAIK btrfs also already had for a long time?
(And there's one feature which is important to many, "being in the upstream Linux kernel", that ZFS most likely will never have.)
ZFS also had expansion for a long time but it was offline expansion. I don't know if btrfs has also had online for a long time?
And shrinking no, that is a big missing feature in ZFS IMO. Understandable considering its heritage (large scale datacenters) but nevertheless an issue for home use.
But raidz is rock-solid. Btrfs' raid is not.
Raidz wasn't able to be expanded in place before this. You were able to add to a pool that included a raidz vdev, but that raidz vdev was immutable.
Oh ok, I've never done this, but I thought it was already there. Maybe this was the original ZFS from Sun? But maybe I just remember it incorrectly, sorry.
I've used it on multi-drive arrays but I never had the need for expansion.
You could add top level raidz vdevs or replace the members of a raid-z vdev with larger disks to increase storage space back then. You still have those options now.
https://openzfs.github.io/openzfs-docs/Getting%20Started/index.html
ZFS runs on all major Linux distros, the source is compiled locally and there is no meaningful license problem. In datacenter and "enterprise" environments we compile ZFS "statically" with other kernel modules all the time.For over six years now, there is an "experimental" option presented by the graphical Ubuntu installer to install the root filesystem on ZFS. Almost everyone I personally know (just my anecdote) chooses this "experimental" option. There has been an occasion here and there of ZFS snapshots taking up too much space, but other than this there have not been any problems.
I statically compile ZFS into a kernel that intentionally does not support loading modules on some of my personal laptops. My experience has been great, others' mileage may (certainly will) vary.
> Most Linux distros still use ext4 by default, which is 19 years old, but ext4 is little more than a series of extensions on top of ext2, which is the same age as NTFS.
However, ext4 and XFS are much more simpler and performant than BTRFS & ZFS as root drives on personal systems and small servers.
I personally won't use either on a single disk system as root FS, regardless of how fast my storage subsystem is.
ZFS will outscale ext4 in parallel workloads with ease. XFS will often scale better than ext4, but if you use L2ARC and SLOG devices, it is no contest. On top of that, you can use compression for an additional boost.
You might also find ZFS outperforms both of them in read workloads on single disks where ARC minimizes cold cache effects. When I began using ZFS for my rootfs, I noticed my desktop environment became more responsive and I attributed that to ARC.
Not on most database workloads. There zfs does not scale very well.
Percona and many others who benchmarked this properly would disagree with you. Percona found that ext4 and ZFS performed similarly when given identical hardware (with proper tuning of ZFS):
https://www.percona.com/blog/mysql-zfs-performance-update/
In this older comparison where they did not initially tune ZFS properly for the database, they found XFS to perform better, only for ZFS to outperform it when tuning was done and a L2ARC was added:
https://www.percona.com/blog/about-zfs-performance/
This is roughly what others find when they take the time to do proper tuning and benchmarks. ZFS outscales both ext4 and XFS, since it is a multiple block device filesystem that supports tiered storage while ext4 and XFS are single block device filesystems (with the exception of supporting journals on external drives). They need other things to provide them with scaling to multiple block devices and there is no block device level substitute for supporting tiered storage at the filesystem level.
That said, ZFS has a killer feature that ext4 and XFS do not have, which is low cost replication. You can snapshot and send/recv without affecting system performance very much, so even in situations where ZFS is not at the top in every benchmark such as being on equal hardware, it still wins, since the performance penalty of database backups on ext4 and XFS is huge.
There is no way that a CoW filesystem with parity calculations or striping is gonna beat XFS on multiple disks, specially on high speed NVMe.
The article provides great insight into optimizing ZFS, but using an EBS volume as store (with pretty poor IOPS) and then giving the NVMe as metadata cache only for ZFS feels like cheating. At the very least, metadata for XFS could have been offloaded to the NVMe too. I bet if we store set XFS with metadata and log to a RAMFS it will beat ZFS :)
L2ARC is a cache. Cache is actually part of its full name, which is Level 2 Adaptive Replacement Cache. It is intended to make fast storage devices into extensions of the in memory Adaptative Replacement Cache. L2ARC functions as a victim cache. While L2ARC does cache metadata, it caches data too. You can disable the data caching, but performance typically suffers when you do. While you can put ZFS metadata on a special device if you want, that was not the configuration that Percona evaluated.
If you do proper testing, you will find ZFS does beat XFS if you scale it. Its L2ARC devices are able to improve IOPS of storage cheaply, which XFS cannot do. Using a feature ZFS has to improve performance at price point that XFS cannot match is competition, not cheating.
ZFS cleverly uses CoW in a way that eliminates the need for a journal, which is overhead for XFS. CoW also enables ZFS' best advantage over XFS, which is that database backups on ZFS via snapshots and (incremental) send/recv affect system performance minimally where backups on XFS are extremely disruptive to performance. Percona had high praise for database backups on ZFS:
https://www.percona.com/blog/zfs-for-mongodb-backups/
Finally, there were no parity calculations in the configurations that Percona tested. Did you post a preformed opinion without taking the time to actually understand the configurations used in Percona's benchmarks?
Refuting the "it doesn't scale" argument with a data from a blog that showcases a single workload (TPC-C) with 200G+10tables dataset (small to medium) at 2vCPU (wtf) machine with 16 connections (no thread pool so overprovisioned) is not quite a definition of a scale at all. It's a lost experiment if anything.
The guy did not have any data to justify his claims of not scaling. Percona’s data says otherwise. If you don’t like how they got their data, then I advise you to do your own benchmarks.
It is based on data from internal benchmarks. Zfs is fine for database workloads but scales worse than Xfs based on my personal experience. It is unpublished benchmarks and I do not have access to any farm to win a discussion on the internet.
I did internal benchmarks at ClusterHQ in 2016. Those benchmarks showed that a tuned ZFS FS of the time had 85% the performance of XFS on equal hardware (a beefy EC2 instance with 4 SSDs, with XFS using MD RAID 0), but it was considered a win for ZFS because of the performance difference when running backups. L2ARC was not considered since the underlying storage was already SSD based and there was nothing faster, but in practice, you often can use it with a faster tier of storage and that puts ZFS ahead even without considering the substantial performance dips of backups.
I don't have anything to like or not to like. I'm not a user of ZFS filesystem. I'm just dismissing your invalid argumentation. Percona's data is nothing about the scale for reasons I already mentioned.
The argument he made was invalid without data to back it up. I at least cited something. The remarks on the performance when backups are made and the benefits of L2ARC were really the most important points, and are far from invalid.
No doubt. I want to reiterate my point. Citing myself:
> "I personally won't use either on a single disk system as root FS, regardless of how fast my storage subsystem is." (emphasis mine)
We are no strangers to filesystems. I personally benchmarked a ZFS7320 extensively, writing a characterization report, plus we have a ZFS7420 for a very long time, complete with separate log SSDs for read and write on every box.
However, ZFS is not saturation proof, plus is nowhere near a Lustre cluster performance wise, when scaled.
What kills ZFS and BTRFS on desktop systems are write performance, esp. on heavy workloads like system updates. If I need a desktop server (performance-wise), I'd configure it accordingly and use these, but I'd never use BTRFS or ZFS on a single root disk due to their overhead, to reiterate myself thrice.
I am generally happy with the write performance of ZFS. I have not noticed slow system updates on ZFS (although I run Gentoo, so slow is relative here). In what ways is the write performance bad?
I am one of the OpenZFS contributors (although I am less active as late). If you bring some deficiency to my attention, there is a chance I might spend the time needed to improve upon it.
By the way, ZFS limits the outstanding IO queue depth to try to keep latencies down as a type of QoS, but you can tune it to allow larger IO queue depths, which should improve write performance. If your issue is related to that, it is an area that is known to be able to use improvement in certain situations:
https://openzfs.github.io/openzfs-docs/Performance%20and%20T...
https://openzfs.github.io/openzfs-docs/Performance%20and%20T...
https://openzfs.github.io/openzfs-docs/Performance%20and%20T...
What I see with CoW filesystems is, when you force the FS to sync a lot (like apt does to keep immunity against power losses to a maximum), the write performance slouches visibly. This also means that when you're writing a lot of small files with a lot of processes and flood the FS with syncs, you get the same slouching, making everything slower in the process. This effect is better controlled in simpler filesystems, namely XFS and EXT4. This is why I keep backups elsewhere and keep my single disk rootfs on "simple" filesystems.
I'll be installing a 2 disk OpenZFS RAID1 volume on a SBC for high value files soon-ish, and I might be doing some tests on that when it's up. Honestly, I don't expect stellar performance since I'll be already putting it on constrained hardware, but let you know if I experience anything that doesn't feel right.
Thanks for the doc links, I'll be devouring them when my volume is up and running.
Where do you prefer your (bug and other) reports? GitHub? E-mail? IP over Avian Carriers?
Heavy synchronous IO from incredibly frequent fsync is a weak point. You can make it better using SLOG devices. I realize what I am about to say is not what you want to hear, but any application doing excessive fsync operations is probably doing things wrong. This is a view that you will find prevalent among all filesystem developers (i.e. the ext4 and XFS guys will have this view too). That is because all filesystems run significantly faster when fsync() is used sparingly.
In the case of APT, it should install all of the files and then call sync() once. This is equivalent of calling fsync on every file like APT currently does, but aggregates it for efficiency. The reason APT does not use sync() is probably a portability thing, because the standard does not require sync() to be blocking, but on Linux it is:
https://www.man7.org/linux/man-pages/man2/sync.2.html
From a power loss perspective, if power is lost when installing a package into the filesystem, you need to repair the package. Thus it does not really matter for power loss protection if you are using fsync() on all files or sync() once for all files, since what must happen next to fix it is the same. However, from a performance perspective, it really does matter.
That said, slow fsync performance generally is not an issue for desktop workloads because they rarely ever use fsync. APT is the main exception. You are the first to complain about APT performance in years as far as I know (there were fixes to improve APT performance 10 years ago, when its performance was truly horrendous).
You can file bug reports against ZFS here:
https://github.com/openzfs/zfs
I suggest filing a bug report against APT. There is no reason for it to be doing fsync calls on every file it installs in the filesystem. It is inefficient.
Actually this was discussed recently [0]. While everybody knows it's not efficient, it's required to keep update process resilient against unwanted shutdowns (like power losses which corrupt the filesystem due to uncommitted work left on the filesystem).
> From a power loss perspective, if power is lost when installing a package into the filesystem, you need to repair the package.
Yes, but at least you have all the files, otherwise you can have 0 length files which can prevent you from booting your system. In this case, your system boots, all files are in place, but some packages are in semi-configured state. Believe me, apt can recover from many nasty corners without any ill effects as long as all files are there. I used to be a tech-lead for a Debian derivative back in the day, so I lived in the trenches in Debian for a long time, so I have seen things.
Again it's decided that the massive sync will stay in place for now, because the risks involved in the wild doesn't justify the performance difference yet. If you prefer to be reckless, there's "eatmydata" and "--force-unsafe-io" options baked in already.
Thanks for the links, I'll let you know if I find something. I just need to build the machine from the parts I have, then I'll be off to the races.
[0]: https://lists.debian.org/debian-devel/2024/12/msg00533.html [warning, long thread]
This email mentions a bunch of operations that are done per file to ensure the file put in the final location always has the correct contents:
https://lists.debian.org/debian-devel/2024/12/msg00540.html
It claims that the fsync is needed to avoid the file appearing at the final location with a zero length after a power loss. This is not true on ZFS.
ZFS puts every filesystem operation into a transaction group that is committed atomically about every 5 seconds by default. On power loss, the transaction group either succeeds or never happens. The result is that even without using fsync, there will never be a zero length file at the final location because the rename being part of a successful transaction group commit implies that the earlier writes also were part of a successful transaction group commit.
The result is that you can use --force-unsafe-io with dpkg on ZFS, things will run faster and there should be no issues for power loss recovery as far as zero length files go.
The following email mentions that sync() had been used at one point but caused problems when flash drives were connected, so it was dropped:
https://lists.debian.org/debian-devel/2024/12/msg00597.html
The timeline is unclear, but I suspect this happened before Linux 2.6.29 introduced syncfs(), which would have addressed that. Unfortunately, it would have had problems for systems with things like a separate /usr mount, which requires the package manager to realize multiple syncfs calls are needed. It sounds like dpkg was calling sync() per file, which is even worse than calling fsync() per file, although it would have ensured that the directory entries for prior files were there following a power loss event.
The email also mentions that fsync is not called on directories. The result is that a power loss event (on any Linux filesystem, not just ZFS) could have the files missing from multiple packages marked as installed in the package database, which is said to use fsync to properly record installations. I find this situation weird since I would use sync() to avoid this, but if they are comfortable having systems have multiple “installed” packages missing files in the filesystem after a power loss, then there is no need to use sync().
Hi! I am quite a beginner when it comes to file systems. Would this sync effect not be helped by direct IO in ZFS's case?
Also, given that you seem quite knowledgeable of the topic, what is your go-to backup solution?
I initially thought about storing `zfs send` files into backblaze (as backup at a different location), but without recv-ing these, I don't think the usual checksumming works properly. I can checksum the whole before and after updating, but I'm not convinced if this is the best solution.
No, it will not. It would be helped by APT switching to using a single sync/syncfs call after installing all files, which is the performant way to do what it wants on Linux:
After studying the DPKG developers’ reasoning for using fsync excessively, it turns out that there is no need for them to use fsync on a ZFS rootfs. When the rootfs is ZFS, you can use --force-unsafe-io to skip the fsync operations for a speed improvement and there will be no safety issues due to how ZFS is designed.
DPKG will write each file to a temporary location and then rename it to the final location. On ext4, without fsync, when a power loss event occurs, it is possible for the rename to the final location to be done, without any of the writes such that you have a zero length file. On ZFS, the rename being done after the writes means that the rename being done implies the writes were done due to the sequential nature of ZFS’ transaction group commit, so the file will never appear in the final location without the file contents following a power loss event, which is why ZFS does not need the fsync there.
ZFS on OS X was killed because of Oracle licensing drama. I don’t expect anything better on Windows either.
There is a third party port here:
https://openzfsonosx.org/wiki/Main_Page
It was actually the NetApp lawsuit that caused problems for Apple’s adoption of ZFS. Apple wanted indemnification from Sun because of the lawsuit, Sun’s CEO did not sign the agreement before Oracle’s acquisition of Sun happened and Oracle had no interest in granting that, so the official Apple port was cancelled.
I heard this second hand years later from people who were insiders at Sun.
That’s a shame re: NetApp/ZFS.
While third-party ports are great, they lack deep integration that first-party support would have brought (non-kludgy Time Machine which is technically fixed with APFS).
> ZFS on OS X was killed because of Oracle licensing drama.
It was killed because Apple and Sun couldn't agree on a 'support contract'. From Jeff Bonwick, one of the co-creators ZFS:
>> Apple can currently just take the ZFS CDDL code and incorporate it (like they did with DTrace), but it may be that they wanted a "private license" from Sun (with appropriate technical support and indemnification), and the two entities couldn't come to mutually agreeable terms.
> I cannot disclose details, but that is the essence of it.
* https://archive.is/http://mail.opensolaris.org/pipermail/zfs...
Sun took DTrace, licensed via CDDL—just like ZFS—and put it into the kernel without issue. Of course a file system is much more central to an operating system, so they wanted much more of a CYA for that.
>ZFS on OS X was killed because of Oracle licensing drama.
Naa it was Jobs ego not the license:
>>Only one person at Steve Jobs' company announces new products: Steve Jobs.
https://arstechnica.com/gadgets/2016/06/zfs-the-other-new-ap...
It’s a cute story that plays into the same old assertions about Steve Jobs, but the conclusion is mostly baseless. There are many other, more credible, less conspiratorial, possible explanations.
It could have played into it though, but I agree the support contract that couldn't be worked out mentioned elsewhere in the thread is more likely.
But I think these things are usually a combination. When a business relationship sours, agreements are suddenly much harder to work out. The negotiators are still people and they have feelings that will affect their decisionmaking.
You've been able to add and remove devices at will for a long time with btrfs (only recently supported in zfs with lots of caveats)
Btrfs also supports async/offline dedupe
You can also layer it on top of mdadm. Iirc zfs strongly discourages using anything but direct attached physical disks.
License is not a real issue. It must be just distributed in separate module. No big hurdle.
The main hurdle is hostile Linux kernel developers who aren't held accountable intentionally breaking ZFS for their own petty ideological reasons e.g. removing the in-kernel FPU/SIMD register save/restore API and replacing it with a "new" API to do the the same.
What's "new" about the "new" API? Its symbols are GPL2 only to deny it's use to non-GPL2 modules (like ZFS). Guess that's an easy way to make sure that BTRFS is faster than ZFS or set yourself up as the (to be) injured party.
Of course a reimplementation of the old API in terms of the new is an evil "GPL condom" violating the kernel license right? Why can't you see ZFS's CDDL2 license is the real problem here for being the wrong flavour of copyleft license. Way to claim the moral high ground you short-sighted, bigoted pricks. sigh
From my point of view it is a real usability issue.
zfs modules are not in the official repos. You either have to compile it on each machine or use unofficial repos, which is not exactly ideal and can break things if those repos are not up to date. And I guess it also needs some additional steps for secureboot setup on some distros?
I really want to try zfs because btrfs has some issues with RAID5 and RAID6 (it is not recommended so I don't use it) but I am not sure I want to risk the overall system stability, I would not want to end up in a situation where my machines don't boot and I have to fix it manually.
I have been using ZFS on Mint and Alpine Linux for years for all drives (including root) and have never had an issue. It's been fantastic and is super fast. My linux/zfs laptop loads games much faster than an identical machine running Windows.
I have never had data corruption issues with ZFS, but I have had both xfs and ext4 destroy entire discs.
Why are you considering raid5/6? Are you considering building a large storage array? If the data will fit comfortably (50-60% utilization) on one drive, all you need is raid1. Btrfs is fine for raid1 (raid1c3 for extra redundancy); it might have hidden bugs, but no filesystem is immune from those; zfs had a data loss bug (it was rare, but it happened) a year ago.
Why use zfs for a boot partition? Unless you're using every disk mounting point and nvme slot for a single large raid array, you can use a cheap 512GB nvme drive or old spare 2.5" ssd for the boot volume. Or two, in btrfs raid1 if you absolutely must... but do you even need redundancy or datasum (which can hurt performance) to protect OS files? Do you really care if static package files get corrupted? Those are easily reinstalled, and modern quality brand SSDs are quite reliable.
I am already using ext4 for /boot and / on nvme, and I am happy with that.
I want to use raid 5 for the large storage mount point that holds non-OS files. I want both space and redundancy. Currently I have several separate raid1 btrfs mounts since it is recommended against raid5.
It is a problem because most of the internal kernel APIs are GPL-only, which limit the abilities of the ZFS module. It is a common source of argument between the Linux guys and the ZFS on Linux guys.
The reason for this is not just to piss off non-GPL module developers. GPL-only internal APIs are subject to change without notice, even more so than the rest of the kernel. And because the licence may not allow the Linux kernel developers to make the necessary changes to the module when it happens, there is a good chance it breaks without warning.
And even with that, all internal APIs may change, it is just a bit less likely than for the GPL-only ones, and because ZFS on Linux is a separate module, there is no guarantee for it to not break with successive Linux versions, in fact, it is more like a guarantee that it will break.
Linux is proudly monolithic, and as constantly evolving a monolithic kernel, developers need to have control over the entire project. It is also community-driven. Combined, you need rules to have the community work together, or everything will break down, and that's what the GPL is for.
I remember it being a pain in the ass on Fedora which tracks closely to mainline. Frequently a new kernel version would come out that zfs module didn't support so you'd have to downgrade and hold back the package until support was added.
Fedora packages zfs-fuse. I think some distros have arrangements to make sure kernels have zfs support. It may be less of a headache on those
In tree fs don't break that way
>ZFS has license issues with Linux, preventing full integration
No one wants that, openZFS is much healthier without Linux and it's "Foundation/Politics".
> No one wants that
I want that
Then let me tell you that FreeBSD or OmniOS is what you really want ;)
You're now 0 for 2 at telling me what I want
The customer is not always right, however a good/modern Filesystem really would be something for Linux ;)
> The customer is not always right,
An uninvited door-to-door salesman is rarely, if ever right.
HN is more like a Tupperware party. ;)
Well then you ought to go somewhere more appreciative of your pitches ;)
XFS is 22 and still the best in-tree FS there is :)
> I just don't get it how the Windows world - by far the largest PC platform per userbase - still doesn't have any answer to ZFS.
The mainline Linux kernel doesn't either, and I think the answer is because it's hard and high risk with a return mostly measured in technical respect?
Technically speaking, bcachefs has been merged into the Linux Kernel - that makes your initial assertion wrong.
But considering it's had two drama events within 1 year of getting merged... I think we can safely confirm your conclusion of it being really hard
> Technically speaking, bcachefs has been merged into the Linux Kernel - that makes your initial assertion wrong.
bcachefs doesn't implement its erasure coding/RAID yet? Doesn't implement send/receive. Doesn't implement scrub/fsck. See: https://bcachefs.org/Roadmap, https://bcachefs.org/Wishlist/
btrfs is still more of a legit competitor to ZFS these days and it isn't close to touching ZFS where it matters. If the perpetually half-finished bcachefs and btrfs are the "answer" to ZFS that seems like too little, too late to me.
Erasure coding is almost done; all that's missing is some of the device evacuate and reconstruct paths, and people have been testing it and giving positive feedback (especially w.r.t. performance).
It most definitely does have fsck and has since the beginning, and it's a much more robust and dependable fsck than btrfs's. Scrub isn't quite done - I actually was going to have it ready for this upcoming merge window except for a nasty bout of salmonella :)
Send/recv is a long ways off, there might be some low level database improvements needed before that lands.
Short term (next year or two) priorities are finishing off online fsck, more scalability work (upcoming version for this merge window will do 50PB, but now we need to up the limit on number of drives), and quashing bugs.
Hearing that it is missing some code for reconstruction makes it sound like it is missing something fairly important. The original purpose of parity RAID is to support reconstruction.
We can do reconstruct reads, what's missing is the code to rewrite missing blocks in a stripe after a drive dies.
In general, due to the scope of the project, I've been prioritizing the functionality that's needed to validate the design and the parts that are needed for getting the relationships between different components correct.
e.g. recently I've been doing a bunch of work on backpointers scalability, and that plus scrub are leading to more back and forth iteration on minor interactions with erasure coding.
So: erasure coding is complete enough to know that it works and for people to torture test it, but yes you shouldn't be running it in production yet (and it's explicitly marked as such). What's remaining is trivial but slightly tedious stuff that's outside the critical path of the rest of the design.
Some of the code I've been writing for scrub is turning out to also be what we want for reconstruct, so maybe we'll get there sooner rather than later...
>except for a nasty bout of salmonella
Did the Linux Foundation send you some "free" sushi? ;)
However keep the good work rolling, super happy about a good, usable and modern Filesystem native to Linux.
FYI: the main reason I gave up on bcachefs is that I can't use devices with native 16K blocks.
Hope that's coming this year. I have a bunch of old HDDs and SSDs and I could very easily assemble a spare storage server with about 4TB capacity. Already tested bcachefs with most of the drives and it performed very well.
Also lack of ability to reconstruct seems like another worrying omission.
I wasn't aware there were actual users needing bs > ps yet. Cool :)
That should be a completely trivial for bcachefs to support, it'll mostly just be a matter of finding or writing the tests.
Seriously? But... NVMe drives! I stopped testing because I only have one spare NVMe and couldn't use it with bcachefs.
If you or others can get it done I'm absolutely starting to use bcachefs the month after. I do need fast storage servers in my home office.
You can do this on ZFS today with `zpool create -o ashift=14 ...`.
Yeah I know, thanks. But ZFS still mostly requires drives with the same sizes. My main NAS is like that but I can't expand it even though I want to, with drives of different sizes I have lying around, and I am not keen on spending for new HDDs right now. So I thought I'll make a secondary NAS with bcachefs and all the spare drives I have.
As for ZFS, I'll be buying some extra drives later this year and will make use of direct_io so I can use another NVMe spare for faster access.
If you don’t care about redundancy, you could add all of them as top level vdevs and then ZFS will happily use all of the space on them until one fails. Performance should be great until there is a failure. Just have good backups.
Thank you, looking forward to it!
Honest question. As an end user that uses Windows and Linux and does not uses ZFS, what I am missing?
Way better data security, resilience against file rotting. This goes for both HDDs or SSDs. Copy-on-write, snapshots, end to end integrity. Also easier to extend the storage for safety/drive failure (and SSDs corrupt in a more sneaky way) with pools.
How many of us are using single disks on our laptops? I have a NAS and use all of the above but that doesn’t help people with single drive systems. Or help me understand why I would want it on my laptop.
My thinkpad from college uses ZFS as its rootfs. The benefits are:
* If the hard drive / SSD corrupted blocks, the corruption would be identified.
* Ditto blocks allow for self healing. Usually, this only applies to metadata, but if you set copies=2, you can get this on data too. It is a poor man’s RAID.
* ARC made the desktop environment very responsive since unlike the LRU cache, ARC resists cold cache effects from transient IO workloads.
* Transparent compression allowed me to store more on the laptop than otherwise possible.
* Snapshots and rollback allowed me to do risky experiments and undo them as if nothing happened.
* Backups were easy via send/receive of snapshots.
* If the battery dies while you are doing things, you can boot without any damage to the filesystem.
That said, I use a MacBook these days when I need to go outside. While I miss ZFS on it, I have not felt motivated to try to get a ZFS rootfs on it since the last I checked, Apple hardcoded the assumption that the rootfs is one of its own filesystems into the XNU kernel and other parts of the system.Not ever having to deal with partitions and instead using data sets each of which can have their own properties such as compression, size quota, encryption etc is another benefit. Also using zfsbootmenu instead of grub enables booting from different datasets or snapshots as well as mounting and fixing data sets all from the bootloader!
Alright that's a bit mind blowing. TIL. Thank you. =)
NTFS had compression since mot even sure when.
For other stuff, let that nerdy CorpIT handle your system.
NTFS compression is slow and has a low compression ratio. ZFS has both zstd and lz4.
yes but NTFS is bad enough that no one needs to be told how bad it is.
If the single drive in your laptop corrupts data, you won't know. ZFS can't fix corruption without extra copies, but it's still useful to catch the problem and notify the user.
Also snapshots are great regardless.
In some circumstances it can.
Every ZFS block pointer has room for 3 disk addresses; by default, the extras are used only for redundant metadata, but they can also be used for user data.
When you turn on ditto blocks for data (zfs set copies=2 rpool/foo), zfs can fix corruption even on single-drive systems at the cost of using double or triple the space. Note that (like compression), this only affects blocks written after the setting is in place, but (if you can pause writes to the filesystem) you can use zfs send|zfs recv to rewrite all blocks to ensure all blocks are redundant.
It provides encryption by default without having to deal with LUKS. And no need to ever do fsck again.
Except that swap on OpenZFS still deadlocks 7 years later (https://github.com/openzfs/zfs/issues/7734) so you're still going to need LUKS for your swap anyway.
Another option is to go without swap. I avoid swap on my machines unless I want hibernation support.
The data security and rot resilience only goes for systems with ECC memory. Correct data with a faulty checksum will be treated the same as incorrect data with a correct checksum.
Windows has its own extended filesystem through Storage Spaces, with many ZFS features added as lesser used Storage Spaces options, especially when combined with ReFS.
This has nothing to do with ZFS as a filesystem. It has integrity verification on duplicated raid configurations. If the system memory flips a bit, it will get written to disk like all filesystems. If a bit flips on a disk, however, it can be detected and repaired. Without ECC, your source of truth can corrupt, but this true of any system.
Please stop repeating this, it is incorrect. ECC helps with any system, but it isn't necessary for ZFS checksums to work.
On zfs there is the ARC (adaptive read cache), on non-zfs systems this "read cache" is called buffer, both reside in memory, so ECC is equally important for both systems.
Rot means changing bits without accessing those bits, and that's ~not possible with zfs, additionally you can enable check-summing IN the ARC (disabled by default), and with that you can say that ECC and "enterprise" quality hardware is even more important for non-ZFS systems.
>Correct data with a faulty checksum will be treated the same as incorrect data with a correct checksum.
There is no such thing as "correct" data, only a block with a correct checksum, if the checksum is not correct, the block is not ok.
"data security and rot resilience only goes for systems with ECC memory."
No. Bad HDDs/SSDs or bad SATA cables/ports cause a lot more data corruption than bad RAM. And ZFS will correct these cases even without ECC memory. It's a myth that the data healing properties of ZFS are useless without ECC memory.
Precisely this. And don’t forget about bugs in virtualization layers/drivers — ZFS can very often save your data in those cases, too.
I once managed to use ZFS to detect a bit flip on a machine that did not have ECC RAM. All python programs started crashing in libpython.so on my old desktop one day. I thought it was a bug in ZFS, so I started debugging. I compared the in-memory buffer from ARC with the on-disk buffer for libpython.so and found a bit flip. At the time, accessing a snapshot through .zfs would duplicate the buffer in ARC, which made it really easy to compare the in-memory buffer against the on-disk buffer. I was in shock as I did not expect to ever see one in person. Since then, I always insist on my computers having ECC.
For a while I ran Open Solaris with ZFS as root filesystem.
The key feature for me, which I miss, is the snapshotting integrated into the package manager.
ZFS allows snapshots more or less for free (due to copy on weite) including cron based snapshotting every 15 minutes. So if I did a mistake anywhere there was a way to recover.
And that integrated with the update manager and boot manager means that on an update a snapshot is created and during boot one can switch between states. Never had a broken update, but gave a good feeling.
On my home server I like the raid features and on Solaris it was nicely integrated with NFS etc so that one can easily create volumes and export them and set restrictions (max size etc.) on it.
> is the snapshotting integrated into the package manager.
some linux distros have that by default with btrfs. And usually it's a package install away if you're already on btrfs.
Much faster launch of applications/files you use regularly. Ability to always rollback updates in seconds if they cause issues thanks to snapshots. Fast backups with snapshots + zfs send/receive to a remote machine. Compressed disks, this both let's you store more on a drive and makes accessing files faster. Easy encryption. ability to mirror 2 large usb disks so you never have your data corrupted or lose it from drive failures. Can move your data or entire os install to a new computer easily by using a live disk and just doing a send/receive to the new pc.
(I have never used dedup, but it's there if you want I guess)
Online filesystem checking and repair.
Reading any file will tell you with 100% guarantee if it is corrupt or not.
Snapshots that you can `cd` into, so you can compare any prior version of your FS with the live version of your FS.
Block level compression.
>Reading any file will tell you with 100% guarantee if it is corrupt or not.
Only possible if it was not corrupted in RAM before it was written to disk.
Using ECC memory is important, irrespective of ZFS.
Cross platform native encryption with sane fs for removable media.
Who would that help?
MacOS also defaults to a non-portable FS for likely similar reasons, if one was being cynical.
It would help users using USB sticks, external drives?
Couple it with encrypted zfs send/receive for cross platform secure backups.
I meant, why would they prioritize cross platform when it doesn’t help them?
I'm missing file clones/copy-on-write.
Snapshots (Note: NTFS does have this in the way of Volume Shadow Copy but it's not as easily accessible as a feature to the end user as it is in ZFS). Copy on Write for reliability under crashes. Block checksumming for data protection (bitrot)
NTFS was able to be extended in various way over the years to the point what you could do with an NTFS drive 32 years ago will feel like talking about a completely different filesystem than what you can do with it on current Windows.
Honestly I really like ReFS, particularly in context of storage spaces, but I don't think it's relevant to Microsoft's consumer desktop OS where users don't have 6 drives they need to pool together. Don't get me wrong, I use ZFS because that's what I can get running on a Linux server and I'm not going to go run Windows Server just for the storage pooling... but ReFS + Storage Spaces wins my heart with the 256 MB slab approach. This means you can add+remove mixed sized drives and get the maximum space utilization for the parity settings of the pool. Here ZFS is still getting to online adds of same or larger drives 10 years later.
OS development pretty much stopped around 2000. ZFS is from 2001. I don't count a new way to organise my photos or integrate with a search engine as "OS" though.
There is occasional talk of moving the Windows implementation of OpenZFS (https://github.com/openzfsonwindows/openzfs/releases) into an officially supported tier, though that will probably come after the MacOS version (https://github.com/openzfsonosx) is officially supported.
The same reason file deduplication is not enabled for client Windows: greed.
For example, there are numerous new file systems people use: OneDrive, Google Drive, iCloud Storage. Do you get it?
What do you mean by a ZFS compatibility layer? There is a Windows port:
https://github.com/openzfsonwindows/openzfs
Note that it is a beta.
NTFS is good enough for most people, who have a laptop with one SSD in it.
The benefits of ZFS don't need multiple drives to be useful. I'm running ZFS on root for years now and snapshots have saved my bacon several times. Also with block checksums you can at least detect bitrot. And COW is always useful.
Windows manages volume snapshots on NTFS through VSS. I think ZFS snapshots are a bit "cleaner" of a design, and the tooling is a bit friendlier IMO, but the functionality to snapshot, rollback, and save your bacon is there regardless. Outside of the automatically enabled "System Restore" (which only uses VSS to snapshot specific system files during updates) I don't think anyone bothers to use it though.
CoW, advanced parity, and checksumming are the big ones NTFS lacks. CoW is just inherently not how NTFS is designed and checksumming isn't there. Anything else (encryption, compression, snapshots, ACLs, large scale, virtual devices, basic parity) is done through NTFS on Windows.
Yes I know that NTFS has snapshots, I mentioned that in another comment. I don't think NTFS is as relevant in comparison though. People who choose windows will have no interest in ZFS and vice versa (someone considering ZFS will not pick Windows).
And I don't think anyone bothers to use it due to the lack of user-facing tooling around it. If it would be as easy to create snapshots as it is on ZFS, more people would use it, I'm sure. It's just so amazing to try something out, screw up my system and just revert :P But VSS is more of a system API than a user-facing geature.
VSS is also used by backup software to quiet the filesystem by the way.
But yeah the others are great features. My main point was though that almost all the features of ZFS are very beneficial even on a single drive. You don't need an array to take advantage of Snapshots, the crash reliability that CoW offers, and checksumming (though you will lack the repair option obviously)
> I don't think NTFS is as relevant in comparison though. People who choose windows will have no interest in ZFS and vice versa (someone considering ZFS will not pick Windows).
ZFS on Windows, as a first-class supported-by-Microsoft option would be killer. It won't ever happen, but it would be great. (NTFS / VSS with filesystem/snapshot send/receive would "scratch" a lot of that "itch", too.)
> And I don't think anyone bothers to use it due to the lack of user-facing tooling around it. If it would be as easy to create snapshots as it is on ZFS, more people would use it, I'm sure. It's just so amazing to try something out, screw up my system and just revert :P But VSS is more of a system API than a user-facing geature.
VSS on NTFS is handy and useful but in my experience brittle compared to ZFS snapshots. Sometimes VSS just doesn't work. I've had repeated cases over the years where accessing a snapshot failed (with traditional unhelpful Microsoft error messages) until the host machine was rebooted. Losing VSS snapshots on a volume is much easier than trashing a ZFS volume.
VSS straddles the filesystem and application layers in a way that ZFS doesn't. I think that contributes to some of the jank (VSS writers becoming "unstable", for example). It also straddles hardware interfaces in a novel way that ZFS doesn't (using hardware snapshot functionality-- somewhat like using a GPU versus "software rendering"). I think that also opens up a lot of opportunity for jank, as compared to ZFS treating storage as dumb blocks.
It's good to see that they were pretty conservative about the expansion.
Not only is expansion completely transparent and resumable, it also maintains redundancy throughout the process.
That said, there is one tiny caveat people should be aware of:
> After the expansion completes, old blocks remain with their old data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distributed among the larger set of disks. New blocks will be written with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been expanded once to 6-wide, has 4 data to 2 parity).
I'm not sure that's really a caveat, it just means old data might be in an inoptimal layout. Even with that, you still get the full benefits of raidzN, where up to N disks can completely fail and the pool will remain functional.
I think it's a huge caveat, because it makes upgrades a lot less efficient than you'd expect.
For example, home users generally don't want to buy all of their storage up front. They want to add additional disks as the array fills up. Being able to start with a 2-disk raidz1 and later upgrade that to a 3-disk and eventually 4-disk array is amazing. It's a lot less amazing if you end up with a 55% storage efficiency rather than 66% you'd ideally get from a 2-disk to 3-disk upgrade. That's 11% of your total disk capacity wasted, without any benefit whatsoever.
You have a couple options:
1. Delete the snapshots and rewrite the files in place like how people do when they want to rebalance a pool.
2. Use send/receive inside the pool.
Either one will make the data use the new layout. They both carry the caveat that reflinks will not survive the operation, such that if you used reflinks to deduplicate storage, you will find the deduplication effect is gone afterward.
Well, when you start a raidz with 2 devices you've already done goofed. Start with a mirror or at least 3 devices.
Also, if you don't wait to upgrade until the disks are at 100% utilization (which you should never do! you're creating massive fragmentation upwards of ~85%) efficiency in the real world will be better.
It still seems pretty minor. If you want extreme optimization, feel free to destroy the pool and create it new, or create it with the ideal layout from the beginning.
Old data still works fine, the same guarantees RAID-Z provides still hold. New data will be written with the new data layout.
Is that the case? What if I expand a 3-1 array to 3-2? Won't the old blocks remain 3-1?
I don't believe it supports adding parity drives only data drives.
Ahh interesting, thanks.
Since preexisting blocks are kept at their current parity ratio and not modified (only redistributed among all devices), increasing the parity level of new blocks won't really be useful in practice anyway.
Caveat is very much expected, you should expect ZFS features to not rewrite blocks. Changes to settings only apply to new data for example.
Yaeh it's a pretty huge caveat to be honest.
Da1 Db1 Dc1 Pa1 Pb1
Da2 Db2 Dc2 Pa2 Pb2
Da3 Db3 Dc3 Pa3 Pb3
___ ___ ___ Pa4 Pb4
___ represents free space. After expansion by one disk you would logically expect something like: Da1 Db1 Dc1 Da2 Pa1 Pb1
Db2 Dc2 Da3 Db3 Pa2 Pb2
Dc3 ___ ___ ___ Pa3 Pb3
___ ___ ___ ___ Pa4 Pb4
But as I understand it it would actually expand to: Da1 Db1 Dc1 Dd1 Pa1 Pb1
Da2 Db2 Dc2 Dd2 Pa2 Pb2
Da3 Db3 Dc3 Dd3 Pa3 Pb3
___ ___ ___ ___ Pa4 Pb4
Where the Dd1-3 blocks are just wasted. Meaning by adding a new disk to the array you're only expanding free storage by 25%... So say you have 8TB disks for a total of 24TB of storage free originally, and you have 4TB free before expansion, you would have 5TB free after expansion.Please tell me I've misunderstood this, because to me it is a pretty useless implementation if I haven't.
ZFS RAID-Z does not have parity disks. The parity and data is interleaved to allow data reads to be done from all disks rather than just the data disks.
The slides here explain how it works:
https://openzfs.org/w/images/5/5e/RAIDZ_Expansion_2023.pdf
Anyway, you are not entirely wrong. The old data will have the old parity:data ratio while new data will have the new parity:data ratio. As old data is freed from the vdev, new writes will use the new parity:data ratio. You can speed this up by doing send/receive, or by deleting all snapshots and then rewriting the files in place. This has the caveat that reflinks will not survive the operation, such that if you used reflinks to deduplicate storage, you will find the deduplication effect is gone afterward.
To be fair, RAID5/6 don't have parity disks either. RAID2, RAID3, and RAID4 do, but they're all effectively dead technology for good reason.
I think it's easy for a lot of people to conceptualize RAID5/6 and RAID-Zn as having "data disks" and "parity disks" to wrap around the complicated topic of how it works, but all of them truly interleave and compute parity data across all disks, allowing any single disk to die.
I've been of two minds on the persistent myth of "parity disks" but I usually ignore it, because it's a convenient lie to understand your data is safe, at least. It's also a little bit the same way that raidz1 and raidz2 are sometimes talked about as "RAID5" and "RAID6"; the effective benefits are the same, but the implementation is totally different.
Unless I misunderstood you, you're describing more how classical RAID would work. The RAID-Z expansion works like you note you would logically expect. You added a drive with four blocks of free space, and you end up with four blocks more of free space afterwards.
You can see this in the presentation[1] slides[2].
The reason this is sub-optimal post-expansion is because, in your example, the old maximal stripe width is lower than the post-expansion maximal stripe width.
Your example is a bit unfortunate in terms of allocated blocks vs layout, but if we tweak it slightly, then
Da1 Db1 Dc1 Pa1 Pb1
Da2 Db2 Dc2 Pa2 Pb2
Da3 Db3 Pa3 Pb3 ___
would after RAID-Z expansion would become Da1 Db1 Dc1 Pa1 Pb1 Da2
Db2 Dc2 Pa2 Pb2 Da3 Db3
Pa3 Pb3 ___ ___ ___ ___
Ie you added a disk with 3 new blocks, and so total free space after is 1+3 = 4 blocks.However if the same data was written in the post-expanded vdev configuration, it would have become
Da1 Db1 Dc1 Dd1 Pa1 Pb1
Da2 Db2 Dc2 Dd2 Pa2 Pb2
___ ___ ___ ___ ___ ___
Ie, you'd have 6 free blocks not just 4 blocks.Of course this doesn't count for writes which end up taking less than the maximal stripe width.
[1]: https://www.youtube.com/watch?v=tqyNHyq0LYM
[2]: https://openzfs.org/w/images/5/5e/RAIDZ_Expansion_2023.pdf
Your diagrams have some flaws too. ZFS has a variable stripe size. Let’s say you have a 10 disk raid-z2 vdev that is ashift=12 for 4K columns. If you have a 4K file, 1 data block and 2 parity blocks will be written. Even if you expand the raid-z vdev, there is no savings to be had from the new data:parity ratio. Now, let’s assume that you have a 72K file. Here, you have 18 data blocks and 6 parity blocks. You would benefit from rewriting this to use the new data:parity ratio. In this case, you would only need 4 parity blocks. ZFS does not rewrite it as part of the expansion, however.
There are already good diagrams in your links, so I will refrain from drawing my own with ASCII. Also, ZFS will vary which columns get parity, which is why the slides you linked have the parity at pseudo-random locations. It was not a quirk of the slide’s author. The data is really laid out that way.
What are the errors? I tried to show exactly what you talk about.
edit: ok, I didn't consider the exact locations of the parity, I was only concerned with space usage.
The 8 data blocks need three stripes on a 3+2 RAID-Z2 setup both pre and post expansion, the last being a partial stripe, but when written in the 4+2 setup only needs 2 full stripes, leading to more total free space.
This is huge news for ZFS users (probably mostly those in the hobbyist/home use space, but still). raidz expansion has been one of the most requested features for years.
I'm not yet familiar with zfs and couldn't find it in the release note: Does expansion only works with disk of the same size? Or is adding are bigger/smaller disks possible or do all disk need to have the same size?
You can use different sized disks, but RAID-Z will truncate the space it uses to the lowest common denominator. If you increase the lowest common denominator, RAID-Z should auto-expand to use the additional space. All parity RAID technologies truncate members to the lowest common denominator, rather than just ZFS.
Is it definitely the LCD? Given drive of size 15 and 20 the LCD would be 1, no? I had assumed it would just use the size of the smallest drive on every drive (so 15+20->15+15=30). When I first read your comment I was thinking of GCF but even that would be fairly inefficient (GCF(15,20) = 5, so 15+20->5+5=10).
That's not entirely true, Unraid has mechanisms for unbalanced disks, but they come at a high cost in terms of usability by standard workloads.
Unraid is not a RAID technology:
> Unraid saves data to individual drives rather than spreading single files out over multiple drives
https://en.wikipedia.org/wiki/Unraid#Software-defined_NAS
At least, it is not one in the sense of the original RAID paper that coined the term:
https://web.mit.edu/6.033/2015/wwwdocs/papers/Patterson88.pd...
IIRC, you could always replace drives in a raidset with larger devices. When the last drive is replaced, then the new space is recognized.
This new operation seems somewhat more sophisticated.
As far as I understand, ZFS doesn't work at all with disks of differing sizes (in the same array). So if you try it, it just finds the size of the smallest disk, and uses that for all disks. So if you put an 8TB drive in an array with a bunch of 10TB drives, they'll all be treated as 8TB drives, and the extra 2TB will be ignored on those disks.
However, if you replace the smallest disk with a new, larger drive, and resilver, then it'll now use the new smallest disk as the baseline, and use that extra space on the other drives.
(Someone please correct me if I'm wrong.)
> As far as I understand, ZFS doesn't work at all with disks of differing sizes (in the same array).
This might be misleading, however, it may only be my understanding of word "array".
You can use 2x10TB mirrors as vdev0, and 6x12TB in RAIDZ2 as vdev1 in the same pool/array. You can also stack as many unevenly sized disks as you want in a pool. The actual problem is when you want a different drive topology within a pool or vdev, or you want to mismatch, say, 3 oddly sized drives to create some synthetic redundancy level (2x4TB and 1x8TB to achieve two copies on two disks) like btrfs does/tries to do.
This is the case with any parity based raid, they just hide it or lie to you in various ways. If you have two 6TB dives and two 12TB drives in a single raid-6 array, it is physically impossible to have two drive parity once you exceed 12TB of written capacity. BTRFS and bcachefs can’t magically create more space where none exists on your 6TB drives. They resort to dropping to mirror protection for the excess capacity which you could also do manually with ZFS by giving it partitions instead of the whole drive.
You need to buy the same exact drive with the same capacity and speed. Your raidz vdev be as small and as slow as your smallest and slowest drive.
btrfs and the new bcachefs can do RAID with mixed drives, but I can’t trust either of them with my data yet.
It doesn't have to be the same exact drive. Mixing drives from different manufacturers (with the same capacity) is often used to prevent correlated failure. ZFS is not using the whole disk, so different disks can be mixed, because the disk often have varying capacity.
You can run raid-z across partitions to utilize the full drive just like synology does with their “hybrid raid” - you just shouldn’t.
> You need to buy the same exact drive
AFAIK you can add larger and faster drives, you will just not get any benefits from it.
You can get read speed benefits with faster drives, but your writes will be limited by your slowest.
Just have backups. I used btrfs and zfs for different purposes. Never had any lost data or downtime with btrfs since 2016. I only use raid 0 and raid 1 and compression. Btrfs does not havr a hungry ram requirement.
Neither does zfs, that’s a widely repeated red herring from people trying to do dedup in the very early days, and people who misunderstood how it used ram to do caching.
Tbh the idea of keeping backups defeats the purpose of using RAIDZ (especially RAIDZ3). I don’t want to buy an LTO drive, so if I backup, it’s either buying more HDDs or S3 Glacier ($$$). I like RAIDZ so I don’t have to buy so many drives. I guess it protects you if your house burns down, but how many people do offsite backups for their personal files? And dormant, unpowered HDDs die a lot faster than live, powered HDDs.
Yes, seriously handling your data is expensive. I am talking about buying new hardrives.
How does ZFS compare to btrfs? I'm currently using btrfs for my home server, but I've had some strange troubles with it. I'm thinking about switching to ZFS, but I don't want to end up in the same situation.
I first tried btrfs 15 years ago with Linux 2.6.33-rc4 if I recall. It developed an unlinkable file within 3 days, so I stopped using it. Later, I found ZFS. It had a few less significant problems, but I was a CS student at the time and I thought I could fix them since they seemed minor in comparison to the issue I had with btrfs, so over the next 18 months, I solved all of the problems that it had that bothered me and sent the patches to be included in the then ZFSOnLinux repository. My effort helped make it production ready on Linux. I have used ZFS ever since and it has worked well for me.
If btrfs had been in better shape, I would have been a btrfs contributor. Unfortunately for btrfs, it not only was in bad shape back then, but other btrfs issues continued to bite me every time I tried it over the years for anything serious (e.g. frequent ENOSPC errors when there is still space). ZFS on the other hand just works. Myself and many others did a great deal of work to ensure it works well.
The main reason for the difference is that ZFS had a very solid foundation, which was achieved by having some fantastic regression testing facilities. It has a userland version that randomly exercises the code to find bugs before they occur in production and a test suite that is run on every proposed change to help shake out bugs.
ZFS also has more people reviewing proposed changes than other filesystems. The Btrfs developers will often state that there is a significant man power difference between the two file systems. I vaguely recall them claiming the difference was a factor of 6.
Anyway, few people who use ZFS regret it, so I think you will find you like it too.
ZFS has been in production use for almost 20 years now. BTRFS is not fully fit for production, according to BTRFS: https://btrfs.readthedocs.io/en/latest/btrfs-man5.html#raid5...
Some simple use-cases are arguably production ready with BTRFS, YMMV.
btrfs has similar aims to ZFS, but is far less mature. i used it for my root partitions due to it not needing DKMS, but had many troubles. i used it in a fairly simple way, just a mirror. one day, of the drives in the array started to have issues- and btrfs fell on it's face. it remounted everything read-only if i remember correctly, and would not run in degraded mode by default. even mdraid would do better than this without checksumming and so forth. ZFS also likewise, says that the array is faulted, but of course allows it to be used. the fact the default behavior was not RAID, because it's literally missing the R part for reading the data back, made me lose any faith in it. i moved to ZFS and haven't had issues since. there is much more of a community and lots of good tooling around it.
I used Btrfs for a few years but switched away a couple years ago. I also had one or two incidents with Btrfs where some weirdness happened, but I was able to recover everything in the end. Overall I liked the flexibility of Btrfs, but mostly I found it too slow.
I use ZFS on Arch Linux and overall have had no problems with it so far. There's more customization and methods to optimize performance. My one suggestion is to do a lot of research and testing with ZFS. There is a bit of a learning curve, but it's been worth the switch for me.
FINALLY!
You can do borderline insane single-vdev setups like RAID-Z3 with 4 disks (3 Disks worth of redundancy) of the most expensive and highest density hard drives money can buy right now, for an initial effective space usage of 25% and then keep buying and expanding Disk by Disk, with the space demand growing, up to something like 12ish disks. Disk prices dropping as time goes on and a spread out failure chance with disks being added at different times.
Yes but see my sibling comment.
When you expand your array, your existing data will not be stored any more efficiently.
To get the new parity/data ratios, you would have to force copies of the data and delete the old, inefficient versions, e.g. with something like this [1]
My personal take is that it's a much better idea to buy individual complete raid-z configurations and add new ones / replace old ones (disk by disk!) as you go.
I wish something like this would be build into ZFS, so snapshots and current access would not be broken.
True, but I have a gut feeling that a lot of these thorny issues would come up again:
Happy to see the ARC bypass for NVMe performance. ZFS really fails to exploit NVMe's potential. Online expansion might be interesting. I tried to use ZFS for some very busy databases and ended up getting bitten badly by the fragmentation bug. The only way to restore performance appears to be copying the data off the volume, nuking it and then copying it back. Now -perhaps- if I expand the zpool then I might be able to reduce fragmentation by copying the tablespace on the same volume.
Worth noting that TrueNAS already supports this[0] (I assuming using 2.3.0rc3?). Not sure about the stability, but very exciting.
Note: This is online expansion. Expansion was always possible but you did need to take the array down to do it. You could also move to bigger drives but you also had to do that one at a time (and only gain the new capacity once all drives were upgraded of course)
As far as I know shrinking a pool is still not possible though. So if you have a pool with 5 drives and add a 6th, you can't go back to 5 drives even if there is very little data in it.
Can someone describe why they would use ZFS (or similar) for home usage?
Good reasons for me:
Checksums: this is even more important in home usage as the hardware is usually of lower quality. Faulty controllers, crappy cables, hard disks stored in a higher than advised temperature... many reasons for bogus data to be saved, and zfs handles that well and automatically (if you have redundancy)
Snapshots: very useful to make backups and quickly go back to an older version of a file when mistakes are made
Ease of mind: compared to the alternatives, I find that zfs is easier to use and makes it harder to make a mistake that could bring data loss (e.g. remove by mistake the wrong drive when replacing a faulty one, pool becomes unusable, "ops!", put the disk back, pool goes back to work as nothing happened). Maybe it is different now with mdadm, ma when I used it years ago I was always worried to make a destructive mistake.
> Snapshots: very useful to make backups and quickly go back to an older version of a file when mistakes are made
Piling on here: Sending snapshots to remote machines (or removable drives) is very easy. That makes snapshots viable as a backup mechanism (because they can exist off-site and offline).
To give an answer that nobody else has given, ZFS is great for storing Steam games. Set recordsize=1M and compression=zstd and you can often store about 33% more games in the same space.
A friend uses ZFS to store his Steam games on a couple of hard drives. He gave ZFS a SSD to use as L2ARC. ZFS automatically caches the games he likes to run on the SSD so that they load quickly. If he changes which games he likes to run, ZFS will automatically adapt to cache those on the SSD instead.
The compression and ARC will make games load much master than they would on NTFS even without having a separate drive for the ARC.
As I understand, L2ARC doesn't work across reboots which unfortunately makes it almost useless for systems that get rebooted regularly, like desktops.
L2ARC has had persistence support for a few years now.
Wow thanks for pointing that out, apparently it's been around for four years since with the first 2.0 release without me noticing.
I replicate my entire filesystem to a local NAS every 10 minutes using zrepl. This has already saved my bacon once when a WD_BLACK SN850 suddenly died on me [1]. It's also recovered code from some classic git blunders. It shouldn't be possible any more to lose data to user error or single device failure. We have the technology.
Several reasons, but major ones (for me) are reliability (checksums and self-healing) and portability (no other modern filesystem can be read and written on Linux, FreeBSD, Windows, and macOS).
Snapshots ("boot environments") are also supported by Btrfs (my Linux installations use that so I don't have to worry about having the 3rd party kernel module to read my rootfs). Performance isn't that great either and, assuming Linux, XFS is a better choice if that is your main concern.
It's relatively easy, and yet powerful. Before that I had MDADM + LVM + dm-crypt + ext4, which also worked but all the layers got me into a headache.
Automated snapshots are super easy and fast. Also easy to access if you deleted a file, you don't have to restore the whole snapshot, you can just cp from the hidden .zfs/ folder.
I run it on 6x 8TB disk for a couple of years now. I run it in a raidz2, which means up to 2 disk can die. Would I use it on a single disk on a Desktop? Probably not.
> Would I use it on a single disk on a Desktop? Probably not.
I do. Snapshots and replication and checksumming are awesome.
I have a home built NAS that uses ZFS for the storage array and the checksumming has been really quite useful in detecting and correcting bit rot. In the past I used MDADM and EXT over the top and that worked but it didn't defend against bit rot. I have considered BTRFS since it would get me the same checksumming without the rest of ZFS but its not considered reliable for systems with parity yet (although now I think it likely is more than reliable enough now).
I do occasionally use snapshots and the compression feature is handy on quite a lot of my data set but I don't use the user and group limitations or remote send and receive etc. ZFS does a lot more than I need but it also works really well and I wouldn't move away from a checksumming filesystem now.
Apart from just peace of mind from bitrot, I use it for the snapshotting capability which makes it super easy to do backups. You can snapshot and send the snapshots to other storage with e.g zfs-autobackup and it's trivial and you can't screw it up. If the snapshots exist on the other drive, you know you have a backup.
I use it on a NAS for:
- Confidence in my long-term storage of some data I care about, as zpool scrub protects against bit rot
- Cheap snapshots that provide both easy checkpoints for work saved to my network share, and resilience against ransomware attacks against my other computers' backups to my NAS
- Easy and efficient (zfs send) replication to external hard drives for storage pool backup
- Built-in and ergonomic encryption
And it's really pretty easy to use. I started with FreeNAS (now TrueNAS), but eventually switched to just running FreeBSD + ZFS + Samba on my file server because it's not that complicated.
I use it on my work laptop. Reasons:
- a single solution that covers the entire storage domain (I don't have to learn multiple layers, like logical volume manager vs. ext4 vs. physical partitions) - cheap/free snapshots. I have been glad to have been able to revert individual files or entire file systems to an earlier state. E.g., create a snapshot before doing a major distro update. - easy to configure/well documented
Like others have said, at this point I would need a good reason, NOT to use ZFS on a system.
I used it on my home NAS (4x3TB drives, holding all of my family's backups, etc.) for the data security / checksumming features. IMO it's performant, robust and well-designed in ways that give me reassurance regarding data integrity and help prevent me shooting myself in the foot.
> describe why they would use ZFS (or similar) for home usage
Mostly because it's there, but also the snapshots have a `diff` feature that's occasionally useful.
I'm trying to find a reason not to use ZFS at home.
Requirement for enterprise quality disks, huge RAM (1 gig per TB), ECC, at least x5 disks of redundancy. None of these are things, but people will try to educate you anyway. So use it but keep it to yourself. :)
No need to keep it to yourself. As you've mentioned, all of these requirements are misinformation so you can ignore people who repeat them (or even better, tell them to stop spreading misinformation).
For those not in the know:
You don't need to use enterprise quality disks. There is nothing in the ZFS design that requires enterprise quality disks any more than any other file system. In fact, ZFS has saved my data through multiple consumer-grade HDD failures over the years thanks to raidz.
The 1 gig per TB figure is ONLY for when using the ZFS dedup feature, which the ZFS dedup feature is widely regarded as a bad idea except in VERY specific use cases. 99.9% of ZFS users should not and will not use dedup and therefore they do not need ridiculous piles of ram.
There is nothing in the design of ZFS any more dangerous to run without ECC than any other filesystem. ECC is a good idea regardless of filesystem but its certainly not a requirement.
And you don't need x5 disks of redundancy. It runs great and has benefits even on single-disk systems like laptops. Naturally, having parity drives is better in case a drive fails but on single disk systems you still benefit from the checksumming, snapshotting, boot environments, transparent compression, incremental zfs send/recv, and cross-platform native encryption.
One reason why it might be a good idea to use higher quality drives when using ZFS is because it seems like in some scenarios ZFS can result in more writes being done to the drive than when other file systems are used. This can be a problem for some QLC and TLC drives that have low endurance.
I'm in the process of setting up a server at home and was testing a few different file systems. I was doing a test where I had a program continuously synchronously writing just a single byte every second (like might happen for some programs that are writing logs fairly continuously). For most of my tests I was just using the default settings for each file system. When using ext4 this resulted in 28 KB/s of actual writes being done to the drive which seems reasonable due to 4 KB blocks needing to be written, journaling, writing metadata, etc... BTRFS generated 68 KB/s of actual writes which still isn't too bad. When using ZFS about the best I could get it to do after trying various settings for volblocksize, ashift, logbias, atime, and compression settings still resulted in 312 KB/s of actual writes being done to the drive which I was not pleased with. At the rate ZFS was writing data, over a 10 year span that same program running continuously would result in about 100 TB of writes being done to the drive which is about a quarter of what my SSD is rated for.
One knob you could change that should radically alter that is zfs_txg_timeout which is how many seconds ZFS will accumulate writes before flushing them out to disk. The default is 5 seconds, but I usually increase mine to 20. When writing a lot of data, it'll get flushed to disk more often, so this timer is only for when you're writing small amounts of data like the test you just described.
> like might happen for some programs that are writing logs fairly continuously
On Linux, I think journald would be aggregating your logs from multiple services so at least you wouldn't be incurring that cost on a per-program basis. On FreeBSD with syslog we're doomed to separate log files.
> over a 10 year span that same program running continuously would result in about 100 TB of writes being done to the drive which is about a quarter of what my SSD is rated for
I sure hope I've upgraded SSDs by the year 2065.
>I sure hope I've upgraded SSDs by the year 2065.
My mind jumped at that too when I first read parent's comment. But presumably he's writing other files to disk too. Not just that one file. :)
> The 1 gig per TB figure is ONLY for when using the ZFS dedup feature, which the ZFS dedup feature is widely regarded as a bad idea except in VERY specific use cases. 99.9% of ZFS users should not and will not use dedup and therefore they do not need ridiculous piles of ram.
You also really don't need a 1GB for RAM unless you have a very high write volume. YMMV but my experience is that its closer to 1GB for 10TB.
The interesting part about the enterprise quality disk misinformation is how so wrong it is. The core idea of ZFS was to detect issues when those drives or their drivers are faulty. And this was more happening with cheap non-enterprise disks at that time.
I use ZFS for boot and storage volumes on my main workstation, which is primarily that--a workstation, not a server or NAS. Some benefits:
- Excellent filesystem level backup facility. I can transfer snapshots to a spare drive, or send/receive to a remote (at present a spare computer, but rsync.net looks better every year I have to fix up the spare).
- Unlike other fs-level backup solutions, the flexibility of zvols means I can easily expand or shrink the scope of what's backed up.
- It's incredibly easy to test (and restore) backups. Pointing my to-be-backed-up volume, or my backup volume, to a previous backup snapshot is instant, and provides a complete view of the filesystem at that point in time. No "which files do you want to restore" hassles or any of that, and then I can re-point back to latest and keep stacking backups. Only Time Machine has even approached that level of simplicity in my experience, and I have tried a lot of backup tools. In general, backup tools/workflows that uphold "the test process is the restoration process, so we made the restoration process as easy and reversible as possible" are the best ones.
- Dedup occasionally comes in useful (if e.g. I'm messing around with copies of really large AI training datasets or many terabytes of media file organization work). It's RAM-expensive, yes, but what's often not mentioned is that you can turn it on and off for a volume--if you rewrite data. So if I'm looking ahead to a week of high-volume file wrangling, I can turn dedup on where I need it, start a snapshot-and-immediately-restore of my data (or if it's not that many files, just cp them back and forth), and by the next day or so it'll be ready. Turning it off when I'm done is even simpler. I imagine that the copy cost and unpredictable memory usage mean that this kind of "toggled" approach to dedup isn't that useful for folks driving servers with ZFS, but it's outstanding on a workstation.
- Using ZFSBootMenu outside of my OS means I can be extremely cavalier with my boot volume. Not sure if an experimental kernel upgrade is going to wreck my graphics driver? Take a snapshot and try it! Not sure if a curl | bash invocation from the internet is going to rm -rf /? Take a snapshot and try it! If my boot volume gets ruined, I can roll it back to a snapshot in the bootloader from outside of the OS. For extra paranoia I have a ZFSBootMenu EFI partition on a USB drive if I ever wreck the bootloader as well, but the odds are that if I ever break the system that bad the boot volume is damaged at the block level and can't restore local snapshots. In that case, I'd plug in the USB drive and restore a snapshot from the adjacent data volume, or my backup volume ... all without installing an OS or leaving the bootloader. The benefits of this to mental health are huge; I can tend towards a more "college me" approach to trying random shit from StackOverflow for tweaking my system without having to worry about "adult professional me" being concerned that I don't know what running some random garbage will do to my system. Being able to experiment first, and then learn what's really going on once I find what works, is very relieving and makes tinkering a much less fraught endeavor.
- Being able to per-dataset enable/disable ARC and ZIL means that I can selectively make some actions really fast. My Steam games, for example, are in a high-ARC-bias dataset that starts prewarming (with throttled IO) in the background on boot. Game load times are extremely fast--sometimes at better than single-ext4-SSD levels--and I'm storing all my game installs on spinning rust for $35 (4x 500GB + 2x 32GB cheap SSD for cache)!
It's great to hear that you're using ZFSBootMenu the way I envisioned it! There's such a sense of relief and freedom having snapshots of your whole OS taken every 15 minutes.
One thing that you might not be aware of is that you can create a zpool checkpoint before doing something 'dangerous' (disk swap, pool version upgrade, etc) and if it goes badly, roll back to that checkpoint in ZFSBootMenu on the Pool tab. Keep in mind though that you can only have one checkpoint at a time, they keep growing and growing, and a rollback is for EVERYTHING on the pool.
Oh, are you zdykstra? If so, thanks for creating an invaluable tool!
> you can create a zpool checkpoint before doing something 'dangerous' (disk swap, pool version upgrade, etc) and if it goes badly, roll back to that checkpoint in ZFSBootMenu on the Pool tab
Good to know! Snapshots meet most of my needs at present (since my boot volume is a single fast drive, snapshots ~~ checkpoints in this case), but I could see this coming in useful for future scenarios where I need to do complex or risky things with data volumes or SAN layout changes.
Been running it since rc2. It’s insane how long this took to finally ship.
Can someone provide details on this bit please? "Direct IO: Allows bypassing the ARC for reads/writes, improving performance in scenarios like NVMe devices where caching may hinder efficiency".
ARC is based in RAM, so how could it reduce performance when used with NVMe devices? They are fast, but they aren't RAM-fast ...
Because with a (ARC) cache you have to copy from the app to the cache and then dma to disk. With direct io you can dma directly from the app ram to the disk.
Yes - interested in this too. Is this for both ARC and L2ARC, or just L2ARC?
Would love to use ZFS, but unfortunately Fedora just cant keep up with it...
Not sure if it helps you at all, but I have a simple Ruby script that I use to build kernels on Fedora with a specified ZFS version.
https://github.com/kaspergrubbe/fedora-kernel-compilation/bl...
It builds on top of the exploded fedora kernel tree, adds zfs and spits out a .rpm that you can install with rpm -ivh.
It doesn't play well with dkms because it tries to interfere, so I disable it on my system.
I could never getting working on rpm-ostree distros.
I've been running Fedora on top of the excellent ZFSBootMenu[1] for about a year. You need to pay attention to the kernel versions supported by OpenZFS and might have to wait for support for a couple of weeks. The setup works fine otherwise.
If you delay upgrading the kernel on occasions, it is more or less fine.
The annual reminder that if Oracle wanted to contribute positively to the Linux ecosystem, they would update the CDDL license ZFS uses to GPL compatible.
This is the annual reply that Oracle cannot change the OpenZFS license because OpenZFS contributors removed the “or any later version” part of the license from their contributions.
By the way, comments such as yours seem to assume that Oracle is somehow involved with OpenZFS. Oracle has no connection with OpenZFS outside of owning copyright on the original OpenSolaris sources and a few tiny commits their employees contributed before Oracle purchased Sun. Oracle has its own internal ZFS fork and they have zero interest in bringing it to Linux. They want people to either go on their cloud or buy this:
Is there a reason the OpenZFS contributors don't want to dual-license their code? I'm not too familiar with the CDDL but I'm not sure what advantage it brings to an open source project compared to something like GPL? Having to deal with DKMS is one of the reasons why I'm sticking with BTRFS for doing ZFS-like stuff.
The OpenZFS code is based on the original OpenSolaris code, and the license used is the CDDL because that is what OpenSolaris used. Dual licensing that requires the current OpenSolaris copyright holder to agree. That is unlikely without writing a very big check. Further speculation is not a productive thing to do, but since I know a number of people assume that OpenSolaris copyright holder is the only one preventing this, let me preemptively say that it is not so simple. Different groups have different preferred licenses. Some groups cannot stand certain licenses. Other groups might detest the idea of dual licensing in general since it causes community fragmentation whenever contributors decide to publish changes only under 1 of the 2 licenses.
The CDDL was designed to ensure that if Sun Microsystems were acquired by a company hostile to OSS, people could still use Sun’s open source software. In particular, the CDDL has an explicit software patent grant. Some consider that to have been invaluable in preempting lawsuits from a certain company that would rather have ZFS be closed source software.
Oracle changing the license would not make a huge difference to OpenZFS.
Oracle only owns the copyright to the original Sun Microsystems code. It doesn’t apply to all ZFS implementations (probably not OracleZFS, perhaps not IllumosZFS) but in the specific case of OpenZFS the majority of the code is no longer Sun code.
Don’t forget that SunZFS was open sourced in 2005 before Oracle bought Sun Microsystems in 2009. Oracle have created their own closed source version of ZFS but outside some Oracle shops nobody uses it (some people say Oracle has stopped working on OracleZFS all together some time ago).
Considering the forks (first from Sun to the various open source implementations and later the fork from open source into Oracle's closed source version) were such a long time ago, there is not that much original code left. A lot of storage tech, or even entire storage concepts, did not exist when Sun open sourced ZFS. Various ZFS implementations developed their own support for TRIM, or Sequential Resilvering, or Zstd compression, or Persistent L2ARC, or Native ZFS Encryption, or Fusion Pools, or Allocation Classes, or dRAID, or RAIDZ expansion long after 2005. That's is why the majority of the code in OpenZFS 2 is from long after the fork from Sun code twenty years ago.
Modern OpenZFS contains new code contributions from Nexenta Systems, Delphix, Intel, iXsystems, Datto, Klara Systems and a whole bunch of other companies that have voluntarily offered their code when most of the non-Oracle ZFS implementations merged to become OpenZFS 2.0.
If you'd want to relicense OpenZFS you could get Oracle to agree for the bit under Sun copyright but for the majority of the code you'd have to get a dozen or so companies to agree to relicensing their contributions (probably not that hard) and many hundreds of individual contributors over two decades (a big task and probably not worth it).
The only thing Oracle wants to "contribute positively to" is Larry's next yacht.
Honestly the cddl being incompatible with the gpl is one of the weirder statements to come out of the fsf. It comes up every time the cddl is mentioned but no one really knows why they are incompatible, it is basically "the fsf says they are incompatible" and when really pressed, they dithered until 2016 then came up with some hand waving that the incompatibility is some minutia as to what scope each license applies to.
The whole thing smells of some FSF agenda to me. if you ship a cddl file in your gpl project it is still a gpl licensed project and the cddl file is still a cddl licensed file.
Marvelous!
After years in the making ZFS raidz expansaion is finally here.
Major features added in release:
> RAIDZ Expansion: Add new devices to an existing RAIDZ pool, increasing storage capacity without downtime.
More specifically:
> A new device (disk) can be attached to an existing RAIDZ vdev
The first 4 seem like really big deals.
The fifth is also, once you consider non-ascii names.
Could someone show a legit reason to use 1000-character filenames? Seems to me, when filenames are long like that, they are actually capturing several KEYS that can be easily searched via ls & re's. e.g.
2025-Jan-14-1258.93743_Experiment-2345_Gas-Flow-375.3_etc_etc.dat
But to me this stuff should be in metadata. It's just that we don't have great tools for grepping the metadata.
Heck, the original Macintosh FS had no subdirectories - they were faked by burying subdirectory names in the (flat filesysytem) filename. The original Macintosh File System (MFS), did not support true hierarchical subdirectories. Instead, the illusion of subdirectories was created by embedding folder-like names into the filenames themselves.
This was done by using colons (:) as separators in filenames. A file named Folder:Subfolder:File would appear to belong to a subfolder within a folder. This was entirely a user interface convention managed by the Finder. Internally, MFS stored all files in a flat namespace, with no actual directory hierarchy in the filesystem structure.
So, there is 'utility' in "overloading the filename space". But...
> Could someone show a legit reason to use 1000-character filenames?
1023 byte names can mean less than 250 characters due to use of unicode and utf-8. Add to it unicode normalization which might "expand" some characters into two or more combining characters, deliberate use of combining characters, emoji, rare characters, and you might end up with many "characters" taking more than 4 bytes. A single "country flag" character will be usually 8 bytes, usually most emoji will be at least 4 bytes, skin tone modifiers will add 4 bytes, etc.
this ' ' takes 27 bytes in my terminal, '' takes 28, another combo I found is 35 bytes.
And that's on top of just getting a long title using let's say one of CJK or other less common scripts - an early manuscript of somewhat successful Japanese novel has a non-normalized filename of 119 byte, and it's nowhere close to actually long titles, something that someone might reasonably have on disk. A random find on the internet easily points to a book title that takes over 300 bytes in non-normalized utf8.
P.S. proper title of "Robinson Crusoe" if used as filename takes at least 395 bytes...
hah. Apparently HN eradicated the carefully pasted complex unicode emojis.
The first was "man+woman kissing" with skin tone modifier, then there was few flags
So if I’m running a Proxmox on ZFS and NVMEs, will I be better off enabling Direct IO when 2.3 gets rolled out? What are the use cases for it?
I would guess for very high performance NVMe drives.
How well tested is this in combination with encryption?
Is the ZFS team handling encryption as a first class priority at all?
ZFS on Linux inherited a lot of fame from ZFS on Solaris, but everyone using it in production should study the issue tracker very well for a realistic impression of the situation.
Main issue with encryption is occasional attempts by certain (specific) Linux kernel developer to lockout ZFS out of access to advanced instruction set extensions (far from the only weird idea of that specific developer).
The way ZFS encryption is layered, the features should be pretty much orthogonal from each other, but I'll admit that there's a bit of lacking with ZFS native encryption (though mainly in upper layer tooling in my experience rather than actual on-disk encryption parts)
These are actually wrappers around CPU instructions, so what ZFS does is implement its own equivalents. This does not affect encryption (beyond the inconvenience that we did not have SIMD acceleration for a while on certain architectures).
>occasional attempts by certain (specific) Linux kernel developer
Can we please refer to them by the actual name?
Greg Kroah-Hartman.
The new features should interact fine with encryption. They are implemented at different parts of ZFS' internal stack.
There have been many man hours put into investigating bug reports involving encryption and some fixes were made. Unfortunately, something appears to be going wrong when non-raw sends of encrypted datasets are received by another system:
https://github.com/openzfs/zfs/issues/12014
I do not believe anyone has figured out what is going wrong there. It has not been for lack of trying. Raw sends from encrypted datasets appear to be fine.
But I presume it is still not possible to remove a vdev.
That was added a while ago:
https://openzfs.github.io/openzfs-docs/man/master/8/zpool-re...
It works by making a readonly copy of the vdev being removed inside the remaining space. The existing vdev is then removed. Data can still be accessed from the copy, but new writes will go to an actual vdev while data no longer needed on the copy is gradually reclaimed as free space as the old data is no longer needed.
Although "Top-level vdevs can only be removed if the primary pool storage does not contain a top-level raidz vdev, all top-level vdevs have the same sector size, and the keys for all encrypted datasets are loaded."
I forgot we still did not have that last bit implemented. However, it is less important now that we have expansion.
> However, it is less important now that we have expansion.
Not really sure if that's true. They seem like two different/distinct use cases, though there's probably some small overlap.
And in my case all the vdevs are raidz
Is this possible elsewhere (re: other filesystems)?
It is possible with windows storage space (remove drive from a pool) and mdadm/lvm (remove disk from a RAID array, remove volume from lvm), which to me are the two major alternatives. Don't know about unraid.
IIUC the ask (I have a hard time wrapping my head around zfs vernacular), btrfs allows this at least in some cases.
If you can convince btrfs balance to not use the dev to remove it will simply rebalance data to the other devs and then you can btrfs device remove.
> It is possible with windows storage space (remove drive from a pool) and mdadm/lvm (remove disk from a RAID array, remove volume from lvm), which to me are the two major alternatives. Don't know about unraid.
Perhaps I am misunderstanding you, but you can offline and remove drives from a ZFS pool.
Do you mean WSS and mdadm/lvm will allow an automatic live rebalance and then reconfigure the drive topology?
So for instance I have a ZFS pool with 3 HDD data vdevs, and 2 SSD special vdevs. I want to convert the two SSD vdevs into a single one (or possibly remove one of them). From what I read the only way to do that is to destroy the entire pool and recreate it (it's in a server in a datacentre, don't want to reupload that much data).
In windows, you can set a disk for removal, and as long as the other disks have enough space and are compatible with the virtual disks (eg you need at least 5 disks if you have parity with number of columns=5), it will rebalance the blocks onto the other disks until you can safely remove the disk. If you use thin provisioning, you can also change your mind about the settings of a virtual disk, create a new one on the same pool, and move the data from one to the other.
Mdadm/lvm will do the same albeit with more of a pain in the arse as RAID requires to resilver not just the occupied space but also the free space so takes a lot more time and IO than it should.
It's one of my beef with ZFS, there are lots of no return decisions. That and I ran into some race conditions with loading a ZFS array on boot with nvme drives on ubuntu. They seem to not be ready, resulting in randomly degraded arrays. Fixed by loading the pool with a delay.
The man page says that your example is doable with zpool remove:
https://openzfs.github.io/openzfs-docs/man/master/8/zpool-re...
My understanding is that ZFS does virtual <-> physical translation in the vdev layer, i.e. all block references in ZFS contain a (vdev, vblock) tuple, and the vdev knows how to translate that virtual block offset into actual on-disk block offset(s).
This kinda implies that you can't actually remove data vdevs, because in practice you can't rewrite all references. You also can't do offline deduplication without rewriting references (i.e. actually touching the files in the filesystem). And that's why ZFS can't deduplicate snapshots after the fact.
On the other hand, reshaping a vdev is possible, because that "just" requires shuffling the vblock -> physical block associations inside the vdev.
There is a clever trick that is used to make top level removal work. The code will make the vdev readonly. Then it will copy its contents into free space on other vdevs (essentially, the contents will be stored behind the scenes in a file). Finally, it will redirect reads on that vdev into the stored vdev. This indirection allows you to remove the vdev. It is not implemented for raid-z at present though.
Though the vdev itself still exists after doing that? It just happens to be backed by, essentially, a "file" in the pool, instead of the original physical block devices, right?
Yes.
> Do you mean WSS and mdadm/lvm will allow an automatic live rebalance and then reconfigure of the drive topo?
mdadm can convert RAID-5 to a larger or smaller RAID-5, RAID-6 to a larger or smaller RAID-6, RAID-5 to RAID-6 or the other way around, RAID-0 to a degraded RAID-5, and many other fairly reasonable operations, while the array is online, resistant to power loss and the likes.
I wrote the first version of this md code in 2005 (against kernel 2.6.13), and Neil Brown rewrote and mainlined it at some point in 2006. ZFS is… a bit late to the party.
Doing this with the on disk data in a merkle tree is much harder than doing it on more conventional forms of storage.
By the way, what does MD do when there is corrupt data on disk that makes it impossible to know what the correct reconstruction is during a reshape operation? ZFS will know what file was damaged and proceed with the undamaged parts. ZFS might even be able to repair the damaged data from ditto blocks. I don’t know what the MD behavior is, but its options for handling this are likely far more limited.
Well, then they made a design choice in their RAID implementation that made fairly reasonable things hard.
I don't know what md does if the parity doesn't match up, no. (I've never ever had that happen, in more than 25 years of pretty heavy md use on various disks.)
I am not sure if reshaping is a reasonable thing. It is not so reasonable in other fields. In architecture, if you build a bridge and then want more lanes, you usually build a new bridge, rather than reshape the bridge. The idea of reshaping a bridge while cars are using it would sound insane there, yet that is what people want from storage stacks.
Reshaping traditional storage stacks does not consider all of the ways things can go wrong. Handling all of them well is hard, if not impossible to do in traditional RAID. There is a long history of hardware analogs to MD RAID killing parity arrays when they encounter silent corruption that makes it impossible to know what is supposed to be stored there. There is also the case where things are corrupted such that there is a valid reconstruction, but the reconstruction produces something wrong silently.
Reshaping certainly is easier to do with MD RAID, but the feature has the trade off that edge cases are not handled well. For most people, I imagine that risk is fine until it bites them. Then it is not fine anymore. ZFS made an effort to handle all of the edge cases so that they do not bite people and doing that took time.
> I am not sure if reshaping is a reasonable thing.
Yet people are celebrating when ZFS adds it. Was it all for nothing?
People wanted it, but it was very hard to do safely. While ZFS now can do it safely, many other storage solutions cannot.
Those corruption issues I mentioned, where the RAID controller has no idea what to do, affect far more than just reshaping. They affect traditional RAID arrays when disks die and when patrol scrubs are done. I have not tested MD RAID on edge cases lately, but the last time I did, I found MD RAID ignored corruption whenever possible. It would not detect corruption in normal operation because it assumed all data blocks are good unless SMART said otherwise. Thus, it would randomly serve bad data from corrupted mirror members and always serve bad data from RAID 5/6 members whenever the data blocks were corrupted. This was particularly tragic on RAID 6, where MD RAID is hypothetically able to detect and correct the corruption if it tried. Doing that would come with such a huge performance overhead that it is clear why it was not done.
Getting back to reshaping, while I did not explicitly test it, I would expect that unless a disk is missing or disappears during a reshape, MD RAID would ignore any corruption that can be detected using parity and assume all data blocks are good just like it does in normal operation. It does not make sense for MD RAID to look for corruption during a reshape operation, since not only would it be slower, but even if it finds corruption, it has no clue how to correct the corruption unless RAID 6 is used, there are no missing/failed members and the affected stripe does not have any read errors from SMART detecting a bad sector that would effectively make it as if there was a missing disk.
You could do your own tests. You should find that ZFS handles edge cases where the wrong thing is in a spot where something important should be gracefully while MD RAID does not. MD RAID is a reimplementation of a technology from the 1960s. If 1960s storage technology handled these edge cases well, Sun Microsystems would not have made ZFS to get away from older technologies.
> While ZFS now can do it safely ...
It's the first release with the code, so "safely" might not be the right description until a few point releases happen. ;)
It was in development for 8 years. I think it is safe, but time will tell.
I’ve experienced bit rot on md. It was not fun, and the tooling was of approximately no help recovering.
Storage Spaces doesn't dedicate drive to single purpose. It operates in chunks (256MB i think). So one drive can, at the same time, be part of a mirror and raid-5 and raid-0. This allows fully using drives with various sizes. And choosing to remove drive will cause it to redistribute the chunks to other available drives, without going offline.
And as a user it seems to me to be the most elegant design. The quality of the implementation (parity write performance in particular) is another matter.
btrfs has supported online adding and removing of devices to the pool from the start
Bcachefs allows it
Cool, just have to wait before it is stable enough for daily use of mission critical data. I am personally optimistic about bcachefs, but incredibly pessimistic about changing filesystems.
It seems easier to copy data to a new ZFS pool if you need to remove RAID-Z top level vdevs. Another possibility is to just wait for someone to implement it in ZFS. ZFS already has top level vdev removal for other types of vdevs. Support for top level raid-z vdev removal just needs to be implemented on top of that.
Btrfs
Except you shouldn’t use btrfs for any parity based raid if you value your data at all. In fact, I’m not aware if any vendor that has implemented btrfs with parity based raid, they all resort to btrfs on md.