Skip to content

btrfs-progs: fix an old bug in lowmem mode #992

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: devel
Choose a base branch
from

Conversation

adam900710
Copy link
Collaborator

@adam900710 adam900710 commented May 29, 2025

Inspired by Mark's recent enhanced super block dev item check against
chunk tree device item, it turns out there are several existing images
not passing the check.

One of the check is fsck/020/keyed_data_ref_with_reloc_leaf.img, which
may be caused by some older kernels (the bug is already fixed a long
time ago).

The first patch is to fix a lowmem mode check bug, that it doesn't
correctly account the keyed backref with shared backref.

This is exposed by the image from the next patch.

Then the last patch is to update the image with a proper note on how to
re-create it.
And the new image also acts as a regression test for the above lowmem
bug.

Please note this is not the last fsck image which has such problem,
fsck/057 is another one, and will be addressed in a dedicated patch.

[BUG]
When a device replace failed, e.g. try to replace a device on a RO
mounted btrfs, the error message is incorrectly broken into two lines:

 [adam@btrfs-vm ~]$ sudo btrfs replace start -fB 1 /dev/test/scratch3  /mnt/btrfs/
 Performing full device TRIM /dev/mapper/test-scratch3 (10.00GiB) ...
 ERROR: ioctl(DEV_REPLACE_START) failed on "/mnt/btrfs/": Read-only file system

 [adam@btrfs-vm ~]$

Note the newline after the "Read-only file system" error message.

[CAUSE]
Inside cmd_replace_start(), if the ioctl failed we need to handle the
error messages different depeneding on start_args.result.

If the result is not BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_RESULT we will
append extra info to the error message.

But the initial error message is using error(), which implies a newline.

This results the above incorrectly splitted error message.

[FIX]
Instead of manually appending an extra reason to the existing error
message, just do different output depending on the start_args.result in
the first place.

Signed-off-by: Qu Wenruo <[email protected]>
Reviewed-by: Anand Jain <[email protected]>
It's a long known problem that direct IO can lead to data checksum
mismatches if the user space is also modifying its buffer during
writeback.

Although it's fixed in the recent v6.15 release (and backported to older
kernels), we still need a user friendly way to fix those problems.

This patch introduce the dryrun version of "btrfs rescue
fix-data-checksum", which reports the logical bytenr and corrupted mirrors.

Signed-off-by: Qu Wenruo <[email protected]>
[BUG]
There is a seldomly utilized function, btrfs_find_item(), which has no
document and is not behaving correctly.

Inside backref.c, iterate_inode_refs() and btrfs_ref_to_path() both
utilize this function, to find the parent inode.

However btrfs_find_item() will never return 0 if @ioff is passed as 0
for such usage, result early failure for all kinds of inode iteration
functions.

[CAUSE]
Both functions pass 0 as the @ioff parameter initially, e.g:

 We have the following fs tree root:

  	item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
		generation 3 transid 9 size 6 nbytes 16384
		block group 0 mode 40755 links 1 uid 0 gid 0 rdev 0
		sequence 1 flags 0x0(none)
	item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
		index 0 namelen 2 name: ..
	item 2 key (256 DIR_ITEM 2507850652) itemoff 16078 itemsize 33
		location key (257 INODE_ITEM 0) type FILE
		transid 9 data_len 0 name_len 3
		name: foo
	item 3 key (256 DIR_INDEX 2) itemoff 16045 itemsize 33
		location key (257 INODE_ITEM 0) type FILE
		transid 9 data_len 0 name_len 3
		name: foo
	item 4 key (257 INODE_ITEM 0) itemoff 15885 itemsize 160
		generation 9 transid 9 size 16384 nbytes 16384
		block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
		sequence 4 flags 0x0(none)
	item 5 key (257 INODE_REF 256) itemoff 15872 itemsize 13
		index 2 namelen 3 name: foo
	item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
		generation 9 type 1 (regular)
		extent data disk byte 13631488 nr 16384
		extent data offset 0 nr 16384 ram 16384
		extent compression 0 (none)

  Then we call paths_from_inode() with:
  - @Inum = 257
  - ipath = {.fs_root = 5}

  Then we got the following sequence:

  iterate_irefs(257, fs_root, inode_to_path)
  |- iterate_inode_refs()
     |- inode_ref_info()
        |- btrfs_find_item(257, 0, fs_root)
	|  Returned 1, with @found_key updated to
	|  (257, INODE_REF, 256).

  This makes iterate_irefs() exit immediately, but obviously that
  btrfs_find_item() call is to find any INODE_REF, not to find the
  exact match.

[FIX]
If btrfs_find_item() found an item matching the objectid and type, then
it should return 0 other than 1.

Fix it and keep the behavior the same across btrfs-progs and the kernel.

Since we're here, also add some comments explaining the function.

Signed-off-by: Qu Wenruo <[email protected]>
Previously "btrfs rescue fix-data-checksum" only show affected logical
bytenr, which is not helpful to determine which files are affected.

Enhance the output by also outputting the affected subvolumes (in
numeric form), and the file paths inside that subvolume.

The example looks like this:

  logical=13631488 corrtuped mirrors=1 affected files:
    (subvolume 5)/foo
    (subvolume 5)/dir/bar
  logical=13635584 corrtuped mirrors=1 affected files:
    (subvolume 5)/foo
    (subvolume 5)/dir/bar

Although the end result is still not perfect, it's still much easier to
find out which files are affected.

Signed-off-by: Qu Wenruo <[email protected]>
This mode will ask user for how to fix each block.

User input can match the first letter or the whole action name to
specify given action, the input is verified case insensitive.

If no user input is provided, the default action is to ignore the
corrupted block.

If the input matches no action, a warning is outputted and user must
retry until a valid input is provided.

Signed-off-by: Qu Wenruo <[email protected]>
This adds a new group of action in the interactive mode to fix a csum
mismatch.

The output looks like this:

  logical=13631488 corrtuped mirrors=1 affected files:
    (subvolume 5)/foo
    (subvolume 5)/dir/bar
  <<I>>gnore/<1>:1
  Csum item for logical 13631488 updated using data from mirror 1

In the interactive mode, the update-csum-item behavior is outputted as
all available mirror numbers.

Considering all the existing (and future) action should starts with an
alphabet, it's pretty easy to distinguish mirror number from other
actions.

The update-csum-item action itself is pretty straight-forward, just read
out the data from specified mirror, then calculate a new checksum, and
update the corresponding csum item in csum tree.

Signed-off-by: Qu Wenruo <[email protected]>
This option allows "btrfs rescue fix-data-checksum" to use the specified
mirror number to update checksum item for all corrupted mirrors.

If the specified number is larger than the max number of mirrors, the
real mirror number will be `num % (num_mirrors + 1)`.

Signed-off-by: Qu Wenruo <[email protected]>
@adam900710 adam900710 force-pushed the test_img_fixes branch 2 times, most recently from 19a411b to b4eefe1 Compare May 29, 2025 07:21
@adam900710 adam900710 changed the title btrfs-progs: fsck-tests: fix an image which has incorrect super dev item btrfs-progs: fix an old bug in lowmem mode May 29, 2025
[BUG]

For a btrfs which is under metadata balance and interrupted (powerloss
etc), we can have the following extent backref item for a data extent:

        item 22 key (13631488 EXTENT_ITEM 524288) itemoff 3161 itemsize 108
                refs 180 gen 9 flags DATA
                (178 0xdfb591fbbf5f519) extent data backref root FS_TREE objectid 258 offset 0 count 77
                (178 0xdfb591fa80d95ea) extent data backref root FS_TREE objectid 257 offset 0 count 1
                (184 0x151e000) shared data backref parent 22142976 count 51
                (184 0x512000) shared data backref parent 5316608 count 51

Then lowmem mode will cause the following false alert:

[3/8] checking extents
ERROR: extent[13631488, 524288] referencer count mismatch (root: 5, owner: 258, offset: 0) wanted: 77, have: 128
ERROR: errors found in extent allocation tree or chunk allocation

[CAUSE]
When shared and keyed backref items are found, we must follow the
following rules to avoid incorrect refs count:

- If the leaf belongs to the shared backref parent
  Then any found EXTENT_DATA inside the leaf will be contributed to the
  shared backref values.

- Otherwise any found EXTENT_DATA can contributed to the keyed backref

In above case, if our leaf is 5316608 or 22142976, then we should not
contribute the number of found EXTENT_DATA to the keyed backref.

Unfortunately the original fix d53d42f ("btrfs-progs: lowmem: fix
false alerts of referencer count mismatch for blocks relocated") is not
following the above strict rule, but relying on the flag of the leaf.

However that RELOC flag is not a correct indicator, e.g in above case
the leaf at 5316608 is not yet being relocated, thus no RELOC flag.

[FIX]
Instead of check the unreliable RELOC flag, follow the correct rule when
checking the leaf.

Before we start checking the content of a leaf for EXTENT_DATA items,
make sure the leaf's bytenr is not in any shared backref item.
If so skip to the next leaf.

Fixes: d53d42f ("btrfs-progs: lowmem: fix false alerts of referencer count mismatch for blocks relocated")
Signed-off-by: Qu Wenruo <[email protected]>
[CI FAILURE]
With Mark's incoming fsck enhancement on super block device item, it
exposed the following fs corruption on an existing image dump:

  restoring image keyed_data_ref_with_reloc_leaf.img
  ====== RUN CHECK /home/runner/work/btrfs-progs/btrfs-progs/btrfs check ./keyed_data_ref_with_reloc_leaf.img.restored
  [1/8] checking log skipped (none written)
  [2/8] checking root items
  [3/8] checking extents
  device 1's bytes_used was 55181312 in tree but 59375616 in superblock
  ERROR: errors found in extent allocation tree or chunk allocation
  [4/8] checking free space cache
  [5/8] checking fs roots
  [6/8] checking only csums items (without verifying data)
  [7/8] checking root refs
  [8/8] checking quota groups skipped (not enabled on this FS)
  Opening filesystem to check...
  Checking filesystem on ./keyed_data_ref_with_reloc_leaf.img.restored
  UUID: ca3568a3-d9d8-4901-b4dd-b180a437a07f
  found 1175552 bytes used, error(s) found
  total csum bytes: 512
  total tree bytes: 651264
  total fs tree bytes: 606208
  total extent tree bytes: 16384
  btree space waste bytes: 291631
  file data blocks allocated: 67633152
   referenced 1048576

[CAUSE]
The image has the following device item in the super block:

  dev_item.uuid		5a1c9f96-b581-4fd3-a73a-5cfd789c3c84
  dev_item.fsid		ca3568a3-d9d8-4901-b4dd-b180a437a07f [match]
  dev_item.type		0
  dev_item.total_bytes	67108864
  dev_item.bytes_used	59375616

Meanwhile the tree dump shows the following device item in the chunk
tree:

	item 0 key (DEV_ITEMS DEV_ITEM 1) itemoff 3897 itemsize 98
		devid 1 total_bytes 67108864 bytes_used 55181312
		io_align 4096 io_width 4096 sector_size 4096 type 0

The bytes_used value does not match and triggered the fsck failure.

The root cause is that kernel has a bug that removed device extents will
not be accounted during transaction commit, this bug will be fixed by
Mark's patch: https://lore.kernel.org/linux-btrfs/[email protected]/

[FIX]
Re-generate the image with the latest kernel so that the mismatch won't
happen.

Also for any future updates on this image, here is the needed steps to
re-create the image.

- Modify the kernel btrfs module to commit transaction more frequently
  The idea is to replace the btrfs_end_transaction_throttle() call
  with btrfs_commit_transaction() in the while() loop of
  relocate_block_group().

  So that every time a tree block is relocated, the fs will be
  committed.

- Use the following script to populate the fs with log-writes
  This requries CONFIG_DM_LOG_WRITES to be enabled.

```
dev_src="/dev/test/scratch1"
dev_log="/dev/test/log"
dev="/dev/mapper/log"
mnt="/mnt/btrfs/"
table="0 $(blockdev --getsz $dev_src) log-writes $dev_src $dev_log"

fail()
{
	echo "!!! FAILED !!!"
	exit 1
}

umount $dev &> /dev/null
umount $mnt &> /dev/null

dmsetup create log --table "$table" || fail
mkfs.btrfs -b 1G -n 4k -f -m single $dev
mount $dev $mnt

xfs_io -f -c "pwrite 0 512k" $mnt/src > /dev/null
sync

for ((i = 0; i < 128; i++)); do
	xfs_io -f -c "reflink $mnt/src $(($i * 4096)) $(($i * 4096)) 4K" $mnt/file1 > /dev/null || fail
done
for ((i = 0; i < 128; i++)); do
	xfs_io -f -c "pwrite 0 2k" $mnt/inline_$i > /dev/null || fail
done
sync
dmsetup message log 0 mark init
btrfs balance start -m $mnt
umount $mnt
dmsetup remove log
```

- Replay the log and run "btrfs dump-tree -t extent" for each fua
  Use the log-writes tool (https://github.com/josefbacik/log-writes.git)
  to replay the log.
  And there will be one with the following contents for the data extent
  item, with mixed keyed and shared backref.

          item 22 key (13631488 EXTENT_ITEM 524288) itemoff 3161 itemsize 108
                refs 180 gen 9 flags DATA
                (178 0xdfb591fbbf5f519) extent data backref root FS_TREE objectid 258 offset 0 count 77
                (178 0xdfb591fa80d95ea) extent data backref root FS_TREE objectid 257 offset 0 count 1
                (184 0x151e000) shared data backref parent 22142976 count 51
                (184 0x512000) shared data backref parent 5316608 count 51

- Take the image of that fs

This image has one extra benefit, that it exposed a bug in the older
lowmem code where shared backref number is not correctly skipped, thus
this image will also act as a regression test.

Signed-off-by: Qu Wenruo <[email protected]>
@kdave kdave force-pushed the devel branch 6 times, most recently from ef43ce6 to 3eff852 Compare May 30, 2025 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant