btrfs-progs: fix an old bug in lowmem mode #992

adam900710 · 2025-05-29T02:20:12Z

Inspired by Mark's recent enhanced super block dev item check against
chunk tree device item, it turns out there are several existing images
not passing the check.

One of the check is fsck/020/keyed_data_ref_with_reloc_leaf.img, which
may be caused by some older kernels (the bug is already fixed a long
time ago).

The first patch is to fix a lowmem mode check bug, that it doesn't
correctly account the keyed backref with shared backref.

This is exposed by the image from the next patch.

Then the last patch is to update the image with a proper note on how to
re-create it.
And the new image also acts as a regression test for the above lowmem
bug.

Please note this is not the last fsck image which has such problem,
fsck/057 is another one, and will be addressed in a dedicated patch.

[BUG] When a device replace failed, e.g. try to replace a device on a RO mounted btrfs, the error message is incorrectly broken into two lines: [adam@btrfs-vm ~]$ sudo btrfs replace start -fB 1 /dev/test/scratch3 /mnt/btrfs/ Performing full device TRIM /dev/mapper/test-scratch3 (10.00GiB) ... ERROR: ioctl(DEV_REPLACE_START) failed on "/mnt/btrfs/": Read-only file system [adam@btrfs-vm ~]$ Note the newline after the "Read-only file system" error message. [CAUSE] Inside cmd_replace_start(), if the ioctl failed we need to handle the error messages different depeneding on start_args.result. If the result is not BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_RESULT we will append extra info to the error message. But the initial error message is using error(), which implies a newline. This results the above incorrectly splitted error message. [FIX] Instead of manually appending an extra reason to the existing error message, just do different output depending on the start_args.result in the first place. Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: Anand Jain <[email protected]>

It's a long known problem that direct IO can lead to data checksum mismatches if the user space is also modifying its buffer during writeback. Although it's fixed in the recent v6.15 release (and backported to older kernels), we still need a user friendly way to fix those problems. This patch introduce the dryrun version of "btrfs rescue fix-data-checksum", which reports the logical bytenr and corrupted mirrors. Signed-off-by: Qu Wenruo <[email protected]>

@ioff

[BUG] There is a seldomly utilized function, btrfs_find_item(), which has no document and is not behaving correctly. Inside backref.c, iterate_inode_refs() and btrfs_ref_to_path() both utilize this function, to find the parent inode. However btrfs_find_item() will never return 0 if @ioff is passed as 0 for such usage, result early failure for all kinds of inode iteration functions. [CAUSE] Both functions pass 0 as the @ioff parameter initially, e.g: We have the following fs tree root: item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160 generation 3 transid 9 size 6 nbytes 16384 block group 0 mode 40755 links 1 uid 0 gid 0 rdev 0 sequence 1 flags 0x0(none) item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12 index 0 namelen 2 name: .. item 2 key (256 DIR_ITEM 2507850652) itemoff 16078 itemsize 33 location key (257 INODE_ITEM 0) type FILE transid 9 data_len 0 name_len 3 name: foo item 3 key (256 DIR_INDEX 2) itemoff 16045 itemsize 33 location key (257 INODE_ITEM 0) type FILE transid 9 data_len 0 name_len 3 name: foo item 4 key (257 INODE_ITEM 0) itemoff 15885 itemsize 160 generation 9 transid 9 size 16384 nbytes 16384 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 4 flags 0x0(none) item 5 key (257 INODE_REF 256) itemoff 15872 itemsize 13 index 2 namelen 3 name: foo item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53 generation 9 type 1 (regular) extent data disk byte 13631488 nr 16384 extent data offset 0 nr 16384 ram 16384 extent compression 0 (none) Then we call paths_from_inode() with: - @Inum = 257 - ipath = {.fs_root = 5} Then we got the following sequence: iterate_irefs(257, fs_root, inode_to_path) |- iterate_inode_refs() |- inode_ref_info() |- btrfs_find_item(257, 0, fs_root) | Returned 1, with @found_key updated to | (257, INODE_REF, 256). This makes iterate_irefs() exit immediately, but obviously that btrfs_find_item() call is to find any INODE_REF, not to find the exact match. [FIX] If btrfs_find_item() found an item matching the objectid and type, then it should return 0 other than 1. Fix it and keep the behavior the same across btrfs-progs and the kernel. Since we're here, also add some comments explaining the function. Signed-off-by: Qu Wenruo <[email protected]>

Previously "btrfs rescue fix-data-checksum" only show affected logical bytenr, which is not helpful to determine which files are affected. Enhance the output by also outputting the affected subvolumes (in numeric form), and the file paths inside that subvolume. The example looks like this: logical=13631488 corrtuped mirrors=1 affected files: (subvolume 5)/foo (subvolume 5)/dir/bar logical=13635584 corrtuped mirrors=1 affected files: (subvolume 5)/foo (subvolume 5)/dir/bar Although the end result is still not perfect, it's still much easier to find out which files are affected. Signed-off-by: Qu Wenruo <[email protected]>

This mode will ask user for how to fix each block. User input can match the first letter or the whole action name to specify given action, the input is verified case insensitive. If no user input is provided, the default action is to ignore the corrupted block. If the input matches no action, a warning is outputted and user must retry until a valid input is provided. Signed-off-by: Qu Wenruo <[email protected]>

This adds a new group of action in the interactive mode to fix a csum mismatch. The output looks like this: logical=13631488 corrtuped mirrors=1 affected files: (subvolume 5)/foo (subvolume 5)/dir/bar <<I>>gnore/<1>:1 Csum item for logical 13631488 updated using data from mirror 1 In the interactive mode, the update-csum-item behavior is outputted as all available mirror numbers. Considering all the existing (and future) action should starts with an alphabet, it's pretty easy to distinguish mirror number from other actions. The update-csum-item action itself is pretty straight-forward, just read out the data from specified mirror, then calculate a new checksum, and update the corresponding csum item in csum tree. Signed-off-by: Qu Wenruo <[email protected]>

This option allows "btrfs rescue fix-data-checksum" to use the specified mirror number to update checksum item for all corrupted mirrors. If the specified number is larger than the max number of mirrors, the real mirror number will be `num % (num_mirrors + 1)`. Signed-off-by: Qu Wenruo <[email protected]>

[BUG] For a btrfs which is under metadata balance and interrupted (powerloss etc), we can have the following extent backref item for a data extent: item 22 key (13631488 EXTENT_ITEM 524288) itemoff 3161 itemsize 108 refs 180 gen 9 flags DATA (178 0xdfb591fbbf5f519) extent data backref root FS_TREE objectid 258 offset 0 count 77 (178 0xdfb591fa80d95ea) extent data backref root FS_TREE objectid 257 offset 0 count 1 (184 0x151e000) shared data backref parent 22142976 count 51 (184 0x512000) shared data backref parent 5316608 count 51 Then lowmem mode will cause the following false alert: [3/8] checking extents ERROR: extent[13631488, 524288] referencer count mismatch (root: 5, owner: 258, offset: 0) wanted: 77, have: 128 ERROR: errors found in extent allocation tree or chunk allocation [CAUSE] When shared and keyed backref items are found, we must follow the following rules to avoid incorrect refs count: - If the leaf belongs to the shared backref parent Then any found EXTENT_DATA inside the leaf will be contributed to the shared backref values. - Otherwise any found EXTENT_DATA can contributed to the keyed backref In above case, if our leaf is 5316608 or 22142976, then we should not contribute the number of found EXTENT_DATA to the keyed backref. Unfortunately the original fix d53d42f ("btrfs-progs: lowmem: fix false alerts of referencer count mismatch for blocks relocated") is not following the above strict rule, but relying on the flag of the leaf. However that RELOC flag is not a correct indicator, e.g in above case the leaf at 5316608 is not yet being relocated, thus no RELOC flag. [FIX] Instead of check the unreliable RELOC flag, follow the correct rule when checking the leaf. Before we start checking the content of a leaf for EXTENT_DATA items, make sure the leaf's bytenr is not in any shared backref item. If so skip to the next leaf. Fixes: d53d42f ("btrfs-progs: lowmem: fix false alerts of referencer count mismatch for blocks relocated") Signed-off-by: Qu Wenruo <[email protected]>

[CI FAILURE] With Mark's incoming fsck enhancement on super block device item, it exposed the following fs corruption on an existing image dump: restoring image keyed_data_ref_with_reloc_leaf.img ====== RUN CHECK /home/runner/work/btrfs-progs/btrfs-progs/btrfs check ./keyed_data_ref_with_reloc_leaf.img.restored [1/8] checking log skipped (none written) [2/8] checking root items [3/8] checking extents device 1's bytes_used was 55181312 in tree but 59375616 in superblock ERROR: errors found in extent allocation tree or chunk allocation [4/8] checking free space cache [5/8] checking fs roots [6/8] checking only csums items (without verifying data) [7/8] checking root refs [8/8] checking quota groups skipped (not enabled on this FS) Opening filesystem to check... Checking filesystem on ./keyed_data_ref_with_reloc_leaf.img.restored UUID: ca3568a3-d9d8-4901-b4dd-b180a437a07f found 1175552 bytes used, error(s) found total csum bytes: 512 total tree bytes: 651264 total fs tree bytes: 606208 total extent tree bytes: 16384 btree space waste bytes: 291631 file data blocks allocated: 67633152 referenced 1048576 [CAUSE] The image has the following device item in the super block: dev_item.uuid 5a1c9f96-b581-4fd3-a73a-5cfd789c3c84 dev_item.fsid ca3568a3-d9d8-4901-b4dd-b180a437a07f [match] dev_item.type 0 dev_item.total_bytes 67108864 dev_item.bytes_used 59375616 Meanwhile the tree dump shows the following device item in the chunk tree: item 0 key (DEV_ITEMS DEV_ITEM 1) itemoff 3897 itemsize 98 devid 1 total_bytes 67108864 bytes_used 55181312 io_align 4096 io_width 4096 sector_size 4096 type 0 The bytes_used value does not match and triggered the fsck failure. The root cause is that kernel has a bug that removed device extents will not be accounted during transaction commit, this bug will be fixed by Mark's patch: https://lore.kernel.org/linux-btrfs/[email protected]/ [FIX] Re-generate the image with the latest kernel so that the mismatch won't happen. Also for any future updates on this image, here is the needed steps to re-create the image. - Modify the kernel btrfs module to commit transaction more frequently The idea is to replace the btrfs_end_transaction_throttle() call with btrfs_commit_transaction() in the while() loop of relocate_block_group(). So that every time a tree block is relocated, the fs will be committed. - Use the following script to populate the fs with log-writes This requries CONFIG_DM_LOG_WRITES to be enabled. ``` dev_src="/dev/test/scratch1" dev_log="/dev/test/log" dev="/dev/mapper/log" mnt="/mnt/btrfs/" table="0 $(blockdev --getsz $dev_src) log-writes $dev_src $dev_log" fail() { echo "!!! FAILED !!!" exit 1 } umount $dev &> /dev/null umount $mnt &> /dev/null dmsetup create log --table "$table" || fail mkfs.btrfs -b 1G -n 4k -f -m single $dev mount $dev $mnt xfs_io -f -c "pwrite 0 512k" $mnt/src > /dev/null sync for ((i = 0; i < 128; i++)); do xfs_io -f -c "reflink $mnt/src $(($i * 4096)) $(($i * 4096)) 4K" $mnt/file1 > /dev/null || fail done for ((i = 0; i < 128; i++)); do xfs_io -f -c "pwrite 0 2k" $mnt/inline_$i > /dev/null || fail done sync dmsetup message log 0 mark init btrfs balance start -m $mnt umount $mnt dmsetup remove log ``` - Replay the log and run "btrfs dump-tree -t extent" for each fua Use the log-writes tool (https://github.com/josefbacik/log-writes.git) to replay the log. And there will be one with the following contents for the data extent item, with mixed keyed and shared backref. item 22 key (13631488 EXTENT_ITEM 524288) itemoff 3161 itemsize 108 refs 180 gen 9 flags DATA (178 0xdfb591fbbf5f519) extent data backref root FS_TREE objectid 258 offset 0 count 77 (178 0xdfb591fa80d95ea) extent data backref root FS_TREE objectid 257 offset 0 count 1 (184 0x151e000) shared data backref parent 22142976 count 51 (184 0x512000) shared data backref parent 5316608 count 51 - Take the image of that fs This image has one extra benefit, that it exposed a bug in the older lowmem code where shared backref number is not correctly skipped, thus this image will also act as a regression test. Signed-off-by: Qu Wenruo <[email protected]>

adam900710 added 7 commits May 14, 2025 09:47

adam900710 mentioned this pull request May 29, 2025

btrfs-progs: check that device byte values in superblock match those in chunk root #991

Open

adam900710 force-pushed the test_img_fixes branch 2 times, most recently from 19a411b to b4eefe1 Compare May 29, 2025 07:21

adam900710 changed the title ~~btrfs-progs: fsck-tests: fix an image which has incorrect super dev item~~ btrfs-progs: fix an old bug in lowmem mode May 29, 2025

adam900710 added 2 commits May 30, 2025 09:48

adam900710 force-pushed the test_img_fixes branch from b4eefe1 to dbcb2d8 Compare May 30, 2025 00:23

kdave force-pushed the devel branch 6 times, most recently from ef43ce6 to 3eff852 Compare May 30, 2025 14:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

btrfs-progs: fix an old bug in lowmem mode #992

btrfs-progs: fix an old bug in lowmem mode #992

Uh oh!

adam900710 commented May 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

btrfs-progs: fix an old bug in lowmem mode #992

Are you sure you want to change the base?

btrfs-progs: fix an old bug in lowmem mode #992

Uh oh!

Conversation

adam900710 commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

adam900710 commented May 29, 2025 •

edited

Loading