-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
System information
| Type | Version/Name |
|---|---|
| Distribution Name | Debian Trixie |
| Distribution Version | 13 |
| Kernel Version | 6.12.45+deb13-amd64 |
| Architecture | amd64 |
| OpenZFS Version | 2.3.2-2 |
Describe the problem you're observing
This box has been running for ~5 years now. It's been upgraded over the years from Debian Bullseye on up through Trixie. There are 6x HGST 12 TB drives in a RAIDZ1 which gives around 60 TB of storage. It has 32 GB RAM and pretty much nothing running on it. Just the non-GUI Debian install. It holds backups. A lot of backups. It spends its days automatically SSHing out to various client sites and doing zfs receives on ~10 datasets per client site.
We have a script that loops through each backed up dataset that keeps the latest ~350 snapshots and deletes anything older. This usually keeps us around ~2 TB of free space. It's in my morning procedures to check the backup server and run the script every morning. That 2 TB of free space is more than enough in case I forget to check it for 2-3 days.
The box has been running Trixie for about a month now.
Recently it decided to do a scrub. I think (not certain) that this is the first scrub under Trixie.
The scrub got to ~90% complete over ~5 days...and then I did something stupid.
I got busy, distracted, and then sick for a few days...and I completely forgot to run the script.
The pool ran completely out of space.
When I checked it the scrub was hanged at ~90%, as well as all the automatic backup jobs.
I ran my script from the console, and it slowly (very slowly) started freeing up space in the pool.
The scrub crawled back to life, and the backup jobs un-blocked and failed because they had been hanged for a while.
I watched the pool crawl up to ~980 GB free and the scrub climb to 90.89% via my SSH connection...and then I heard the "beep" of a server rebooting...and my SSH connection dropped.
I waited ~10 minutes and the box still hadn't come back up, so I connected a keyboard and monitor to it and saw it at the 'zpool import' step of the boot process and messages that zpool import was killed by OOM.
I rebooted and set the "init" boot arg to /bin/bash and got to a command line.
zpool import showed my pool.
zpool import tank started churning away.
It ran for a while...maybe 15 minutes (with intermittent messages from the kernel about a hanged task)...and then spit out OOM messages again.
I hit CTRL ALT DEL...and got a spew of messages about memory deadlocks.
I tried again. Same issue.
I can't seem to get the pool imported.
Describe how to reproduce the problem
zpool import tank
wait ~10 minutes
Include any warning/errors/backtraces from the system logs
I can't get my console buffer to scroll back, but I will attach a snap from my cell phone.
