Skip to content

Commit 10b3772

Browse files
adam900710gregkh
authored andcommitted
btrfs: do proper folio cleanup when cow_file_range() failed
[ Upstream commit 06f3642 ] [BUG] When testing with COW fixup marked as BUG_ON() (this is involved with the new pin_user_pages*() change, which should not result new out-of-band dirty pages), I hit a crash triggered by the BUG_ON() from hitting COW fixup path. This BUG_ON() happens just after a failed btrfs_run_delalloc_range(): BTRFS error (device dm-2): failed to run delalloc range, root 348 ino 405 folio 65536 submit_bitmap 6-15 start 90112 len 106496: -28 ------------[ cut here ]------------ kernel BUG at fs/btrfs/extent_io.c:1444! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP CPU: 0 UID: 0 PID: 434621 Comm: kworker/u24:8 Tainted: G OE 6.12.0-rc7-custom+ #86 Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs] pc : extent_writepage_io+0x2d4/0x308 [btrfs] lr : extent_writepage_io+0x2d4/0x308 [btrfs] Call trace: extent_writepage_io+0x2d4/0x308 [btrfs] extent_writepage+0x218/0x330 [btrfs] extent_write_cache_pages+0x1d4/0x4b0 [btrfs] btrfs_writepages+0x94/0x150 [btrfs] do_writepages+0x74/0x190 filemap_fdatawrite_wbc+0x88/0xc8 start_delalloc_inodes+0x180/0x3b0 [btrfs] btrfs_start_delalloc_roots+0x174/0x280 [btrfs] shrink_delalloc+0x114/0x280 [btrfs] flush_space+0x250/0x2f8 [btrfs] btrfs_async_reclaim_data_space+0x180/0x228 [btrfs] process_one_work+0x164/0x408 worker_thread+0x25c/0x388 kthread+0x100/0x118 ret_from_fork+0x10/0x20 Code: aa1403e1 9402f3ef aa1403e0 9402f36f (d4210000) ---[ end trace 0000000000000000 ]--- [CAUSE] That failure is mostly from cow_file_range(), where we can hit -ENOSPC. Although the -ENOSPC is already a bug related to our space reservation code, let's just focus on the error handling. For example, we have the following dirty range [0, 64K) of an inode, with 4K sector size and 4K page size: 0 16K 32K 48K 64K |///////////////////////////////////////| |#######################################| Where |///| means page are still dirty, and |###| means the extent io tree has EXTENT_DELALLOC flag. - Enter extent_writepage() for page 0 - Enter btrfs_run_delalloc_range() for range [0, 64K) - Enter cow_file_range() for range [0, 64K) - Function btrfs_reserve_extent() only reserved one 16K extent So we created extent map and ordered extent for range [0, 16K) 0 16K 32K 48K 64K |////////|//////////////////////////////| |<- OE ->|##############################| And range [0, 16K) has its delalloc flag cleared. But since we haven't yet submit any bio, involved 4 pages are still dirty. - Function btrfs_reserve_extent() returns with -ENOSPC Now we have to run error cleanup, which will clear all EXTENT_DELALLOC* flags and clear the dirty flags for the remaining ranges: 0 16K 32K 48K 64K |////////| | | | | Note that range [0, 16K) still has its pages dirty. - Some time later, writeback is triggered again for the range [0, 16K) since the page range still has dirty flags. - btrfs_run_delalloc_range() will do nothing because there is no EXTENT_DELALLOC flag. - extent_writepage_io() finds page 0 has no ordered flag Which falls into the COW fixup path, triggering the BUG_ON(). Unfortunately this error handling bug dates back to the introduction of btrfs. Thankfully with the abuse of COW fixup, at least it won't crash the kernel. [FIX] Instead of immediately unlocking the extent and folios, we keep the extent and folios locked until either erroring out or the whole delalloc range finished. When the whole delalloc range finished without error, we just unlock the whole range with PAGE_SET_ORDERED (and PAGE_UNLOCK for !keep_locked cases), with EXTENT_DELALLOC and EXTENT_LOCKED cleared. And the involved folios will be properly submitted, with their dirty flags cleared during submission. For the error path, it will be a little more complex: - The range with ordered extent allocated (range (1)) We only clear the EXTENT_DELALLOC and EXTENT_LOCKED, as the remaining flags are cleaned up by btrfs_mark_ordered_io_finished()->btrfs_finish_one_ordered(). For folios we finish the IO (clear dirty, start writeback and immediately finish the writeback) and unlock the folios. - The range with reserved extent but no ordered extent (range(2)) - The range we never touched (range(3)) For both range (2) and range(3) the behavior is not changed. Now even if cow_file_range() failed halfway with some successfully reserved extents/ordered extents, we will keep all folios clean, so there will be no future writeback triggered on them. CC: [email protected] Reviewed-by: Boris Burkov <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
1 parent 5d47918 commit 10b3772

File tree

1 file changed

+37
-40
lines changed

1 file changed

+37
-40
lines changed

fs/btrfs/inode.c

Lines changed: 37 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -1408,6 +1408,17 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
14081408

14091409
alloc_hint = btrfs_get_extent_allocation_hint(inode, start, num_bytes);
14101410

1411+
/*
1412+
* We're not doing compressed IO, don't unlock the first page (which
1413+
* the caller expects to stay locked), don't clear any dirty bits and
1414+
* don't set any writeback bits.
1415+
*
1416+
* Do set the Ordered (Private2) bit so we know this page was properly
1417+
* setup for writepage.
1418+
*/
1419+
page_ops = (keep_locked ? 0 : PAGE_UNLOCK);
1420+
page_ops |= PAGE_SET_ORDERED;
1421+
14111422
/*
14121423
* Relocation relies on the relocated extents to have exactly the same
14131424
* size as the original extents. Normally writeback for relocation data
@@ -1470,6 +1481,10 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
14701481
file_extent.offset = 0;
14711482
file_extent.compression = BTRFS_COMPRESS_NONE;
14721483

1484+
/*
1485+
* Locked range will be released either during error clean up or
1486+
* after the whole range is finished.
1487+
*/
14731488
lock_extent(&inode->io_tree, start, start + ram_size - 1,
14741489
&cached);
14751490

@@ -1515,27 +1530,12 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
15151530

15161531
btrfs_dec_block_group_reservations(fs_info, ins.objectid);
15171532

1518-
/*
1519-
* We're not doing compressed IO, don't unlock the first page
1520-
* (which the caller expects to stay locked), don't clear any
1521-
* dirty bits and don't set any writeback bits
1522-
*
1523-
* Do set the Ordered (Private2) bit so we know this page was
1524-
* properly setup for writepage.
1525-
*/
1526-
page_ops = (keep_locked ? 0 : PAGE_UNLOCK);
1527-
page_ops |= PAGE_SET_ORDERED;
1528-
1529-
extent_clear_unlock_delalloc(inode, start, start + ram_size - 1,
1530-
locked_folio, &cached,
1531-
EXTENT_LOCKED | EXTENT_DELALLOC,
1532-
page_ops);
1533-
if (num_bytes < cur_alloc_size)
1533+
if (num_bytes < ram_size)
15341534
num_bytes = 0;
15351535
else
1536-
num_bytes -= cur_alloc_size;
1536+
num_bytes -= ram_size;
15371537
alloc_hint = ins.objectid + ins.offset;
1538-
start += cur_alloc_size;
1538+
start += ram_size;
15391539
extent_reserved = false;
15401540

15411541
/*
@@ -1546,6 +1546,8 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
15461546
if (ret)
15471547
goto out_unlock;
15481548
}
1549+
extent_clear_unlock_delalloc(inode, orig_start, end, locked_folio, &cached,
1550+
EXTENT_LOCKED | EXTENT_DELALLOC, page_ops);
15491551
done:
15501552
if (done_offset)
15511553
*done_offset = end;
@@ -1561,40 +1563,35 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
15611563
* Now, we have three regions to clean up:
15621564
*
15631565
* |-------(1)----|---(2)---|-------------(3)----------|
1564-
* `- orig_start `- start `- start + cur_alloc_size `- end
1566+
* `- orig_start `- start `- start + ram_size `- end
15651567
*
15661568
* We process each region below.
15671569
*/
15681570

1569-
clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
1570-
EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV;
1571-
page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;
1572-
15731571
/*
15741572
* For the range (1). We have already instantiated the ordered extents
15751573
* for this region. They are cleaned up by
15761574
* btrfs_cleanup_ordered_extents() in e.g,
1577-
* btrfs_run_delalloc_range(). EXTENT_LOCKED | EXTENT_DELALLOC are
1578-
* already cleared in the above loop. And, EXTENT_DELALLOC_NEW |
1579-
* EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV are handled by the cleanup
1580-
* function.
1575+
* btrfs_run_delalloc_range().
1576+
* EXTENT_DELALLOC_NEW | EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV
1577+
* are also handled by the cleanup function.
15811578
*
1582-
* However, in case of @keep_locked, we still need to unlock the pages
1583-
* (except @locked_folio) to ensure all the pages are unlocked.
1579+
* So here we only clear EXTENT_LOCKED and EXTENT_DELALLOC flag, and
1580+
* finish the writeback of the involved folios, which will be never submitted.
15841581
*/
1585-
if (keep_locked && orig_start < start) {
1582+
if (orig_start < start) {
1583+
clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC;
1584+
page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;
1585+
15861586
if (!locked_folio)
15871587
mapping_set_error(inode->vfs_inode.i_mapping, ret);
15881588
extent_clear_unlock_delalloc(inode, orig_start, start - 1,
1589-
locked_folio, NULL, 0, page_ops);
1589+
locked_folio, NULL, clear_bits, page_ops);
15901590
}
15911591

1592-
/*
1593-
* At this point we're unlocked, we want to make sure we're only
1594-
* clearing these flags under the extent lock, so lock the rest of the
1595-
* range and clear everything up.
1596-
*/
1597-
lock_extent(&inode->io_tree, start, end, NULL);
1592+
clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
1593+
EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV;
1594+
page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;
15981595

15991596
/*
16001597
* For the range (2). If we reserved an extent for our delalloc range
@@ -1608,11 +1605,11 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
16081605
*/
16091606
if (extent_reserved) {
16101607
extent_clear_unlock_delalloc(inode, start,
1611-
start + cur_alloc_size - 1,
1608+
start + ram_size - 1,
16121609
locked_folio, &cached, clear_bits,
16131610
page_ops);
1614-
btrfs_qgroup_free_data(inode, NULL, start, cur_alloc_size, NULL);
1615-
start += cur_alloc_size;
1611+
btrfs_qgroup_free_data(inode, NULL, start, ram_size, NULL);
1612+
start += ram_size;
16161613
}
16171614

16181615
/*

0 commit comments

Comments
 (0)