Skip to content

zpool import on boot takes forever, causes OOM, some sort of memory deadlock #18075

@darkpixel

Description

@darkpixel

System information

Type Version/Name
Distribution Name Debian Trixie
Distribution Version 13
Kernel Version 6.12.45+deb13-amd64
Architecture amd64
OpenZFS Version 2.3.2-2

Describe the problem you're observing

This box has been running for ~5 years now. It's been upgraded over the years from Debian Bullseye on up through Trixie. There are 6x HGST 12 TB drives in a RAIDZ1 which gives around 60 TB of storage. It has 32 GB RAM and pretty much nothing running on it. Just the non-GUI Debian install. It holds backups. A lot of backups. It spends its days automatically SSHing out to various client sites and doing zfs receives on ~10 datasets per client site.

We have a script that loops through each backed up dataset that keeps the latest ~350 snapshots and deletes anything older. This usually keeps us around ~2 TB of free space. It's in my morning procedures to check the backup server and run the script every morning. That 2 TB of free space is more than enough in case I forget to check it for 2-3 days.

The box has been running Trixie for about a month now.

Recently it decided to do a scrub. I think (not certain) that this is the first scrub under Trixie.
The scrub got to ~90% complete over ~5 days...and then I did something stupid.

I got busy, distracted, and then sick for a few days...and I completely forgot to run the script.
The pool ran completely out of space.
When I checked it the scrub was hanged at ~90%, as well as all the automatic backup jobs.

I ran my script from the console, and it slowly (very slowly) started freeing up space in the pool.
The scrub crawled back to life, and the backup jobs un-blocked and failed because they had been hanged for a while.

I watched the pool crawl up to ~980 GB free and the scrub climb to 90.89% via my SSH connection...and then I heard the "beep" of a server rebooting...and my SSH connection dropped.

I waited ~10 minutes and the box still hadn't come back up, so I connected a keyboard and monitor to it and saw it at the 'zpool import' step of the boot process and messages that zpool import was killed by OOM.

I rebooted and set the "init" boot arg to /bin/bash and got to a command line.
zpool import showed my pool.
zpool import tank started churning away.
It ran for a while...maybe 15 minutes (with intermittent messages from the kernel about a hanged task)...and then spit out OOM messages again.

I hit CTRL ALT DEL...and got a spew of messages about memory deadlocks.

I tried again. Same issue.

I can't seem to get the pool imported.

Describe how to reproduce the problem

zpool import tank
wait ~10 minutes

Include any warning/errors/backtraces from the system logs

I can't get my console buffer to scroll back, but I will attach a snap from my cell phone.

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type: DefectIncorrect behavior (e.g. crash, hang)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions