On 07/24/2017 07:35 PM, Dan Williams wrote:
On Mon, Jul 24, 2017 at 4:15 PM, Linda Knippers
> Hi Dan,
> I've got 4 NVDIMMs in an interleave set in a configuration that supports labels.
> I'm running a 4.12 kernel with the latest ndctl.
> I had three namespaces configured and all seemed well. When I configured the
> fourth one, I made a mistake in the name so I hit control-c. I wasn't sure what
> state I was in but according to what I could see with ndctl, it had created the
> namespace but not enabled it, so I enabled it manually with ndctl and that
> seemed ok.
> Then I tried to use ndctl create-namespace to change the name, which failed
> because the namespace was enabled so I disabled it and tried again. At some
> point, not really sure where, I got this kernel warning:
> # [ 5224.196085] nd namespace4.3: failed to track label: 4
> (details in the attached file)
> At this point I rebooted the system. When it came back up, nmem0 was disabled.
> I dumped the labels (also attached) and I see that nmem0 has some extra labels
> that correspond to the namespace that I was struggling with.
> I think my troubles started with the control-c. It doesn't look like ndctl traps
> signals when creating namespaces so perhaps we can get into an inconsistent
> It also seems like that kernel warning is a bit more important than a
> WARN_ONCE would imply. I think that was the beginning of the end of my
> configuration. It might have been better to just panic.
In general if the system is even remotely recoverable we don't panic.
In this case it is recoverable. The WARN_ONCE() is really there as a
loud, "this is a kernel bug, but we'll do our best to keep going".
Keeping going is ok unless you're risking data.
> I was trying to figure out if I could fix my configuration
> losing the good namespaces but I don't see a way. The check-labels option
> isn't very helpful because I think it only looks at the info blocks,
> which are fine, even though the labels on nmem0 are not. The destroy-namespace
> option doesn't help because it only works with a good namespace.
> I'm going to wipe my nvdimms and start over. I suspect the problem is
> reproducible but it could depend on the timing of the control-c, unless
> the root cause was actually trying to rename a namespace. Maybe I'll try
> that again but not today.
The recovery method when the labels are corrupted is:
ndctl disable-region all
ndctl zero-labels all
ndctl enable-region all
...and that should get you back to square one.
Right, but that blows away all my namespaces. I was hoping to find a way
to just fix up (delete) what appeared to be extraneous labels.
If you are able to reproduce I'd like to see the state of the DIMM
label areas. You can dump them in json format with the following:
ndctl read-labels -j all
That was in one of the attachments.