Page MenuHomeFreeBSD

nvme: Give reset a chance to undo failure
Needs ReviewPublic

Authored by imp on Mar 1 2024, 9:49 PM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Nov 12, 6:22 PM
Unknown Object (File)
Tue, Nov 12, 7:53 AM
Unknown Object (File)
Mon, Nov 11, 11:32 PM
Unknown Object (File)
Mon, Nov 11, 7:42 PM
Unknown Object (File)
Mon, Nov 11, 9:39 AM
Unknown Object (File)
Apr 5 2024, 11:15 AM
Unknown Object (File)
Mar 6 2024, 7:02 PM
Unknown Object (File)
Mar 4 2024, 3:23 AM
Subscribers

Details

Reviewers
mav
chuck
chs
Summary

There are times when we may fail a drive, since it stops responding, but
never-the-less are able to reset the controller and bring it back on
line. While this won't always allow a fix, certain controllers have been
observed to enter a state where they stop replying so badly we fail
them, only to have them recover later with a reset (sometimes with
manual intervention prior to the reset to send a vendor specific FTL
reset command). Allowing reset to be tried in these cases allows us to
avoid a reboot.

Sponsored by: Netflix

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 56389
Build 53277: arc lint + arc unit

Event Timeline

On failure we've already notified consumers that controller has failed. What will report it is back? And is there even a device to sent request IOCTL?

In D44180#1008994, @mav wrote:

On failure we've already notified consumers that controller has failed. What will report it is back? And is there even a device to sent request IOCTL?

Yea. I also hit this... And I'll need to rework more. There is a device passed into the ioctl, implicitly, otherwise we wouldn't have ctrlr...

mav@ is correct.

We need to do more here. If we were failed, we need to try the reset to see if that gets us out of the failed state. And if we do, we need call the new controller notification to build back up all the down-stream consumers that we've torn down. I'll do that as a separate review though once I get it tested out. The drives that are going wonkies take between 5 minutes and 2 weeks to trigger...