Which driver failure would cause all disk access to cease without system panic?

Alex_Grig · March 22, 2017, 12:06pm

@Mahmoud Al-Qudsi

NUC SKU: NUC6i3SYH

NVMe SSD: Intel SSD 600p Series (128GB, M.2 2280 80mm NVMe PCIe 3.0 x4, 3D1, TLC)

Mahmoud_Al-Qudsi · March 22, 2017, 7:14pm

That NUC would be a 6th-generation Skylake CPU, and I think it’s running on Intel’s HM170 chipset. Presuming that you and I are seeing the same issue, there’s not a very clear overlap between our situations.

I’ve had no problems with these configurations (up to one month+ of uptime):
* Intel 750 SSD running on a fourth-generation (Haswell) Xeon 1650 v3, X99 chipset
* Samsung 950 Pro running on a sixth-generation (Skylake) Xeon 1545M v5, CM236 chipset

But I’ve been seeing this bug in the following configurations:
* Samsung 950 Pro running on a seventh-generation (Kaby Lake) Xeon 1505M v6, CM238 chipset
* Samsung 950 Pro running on a seventh-generation (Kaby Lake) Core i7 7700HQ, CM238 chipset

Is there some way to enable verbose logging for anything to do with storport?

Mahmoud_Al-Qudsi · March 23, 2017, 2:31pm

I used perfmon to set up a data collector for Microsoft-Windows-storport, but it seems to want to stop logging randomly, without rhyme or reason. All limits are disabled, and I’m logging to a USB stick (as obviously local won’t do).

I even tried to set it up to restart automatically every ten minutes but that just made things worse as after the first, manually initiated run, future runs would create the etl file but leave it blank at 1KB (even though all limits were disabled). I think there may be a bug in W10 where it’s applying limits that aren’t checked, so I’m trying again with the limits set to 24 hours and the job set to restart more frequently than that. We’ll see if that works.

Back to the matter at hand, I tried disabling PCI-E link power management in the power options (given that NVMe is just a glorified PCI-E protocol) and of course it seemed to be working for over 24 hours… but then inevitably, probably less than 2 minutes after the storport data collector died (of course!), it happened again. Such is life.

The odd thing is that with the PCI-E link power management turned off, the system was much more stable. With it enabled, stressing out disk access (such as restoring a Chrome session with 60 or so tabs all at once) triggers the bug far more often… but it might just be a coincidence.

Mahmoud_Al-Qudsi · March 23, 2017, 11:11pm

I managed to get the data logger working, but the stupid thing refuses to write directly to a file and insists on using buffered I/O. How do MS driver devs get anything done when their data collection is at the mercy of pure luck?

Even with the buffer size set to 1kb (the minimum) and the flush timer set to 1 second (the minimum) and all storport/miniport logging enabled, I was unable to get the last commands before the system froze. The ETL file had some captured calls towards the end that were taking on the order of 7,000,000 ns to execute (SCSI command 0x35 SCSIOP_SYNCHRONIZE_CACHE) but that was followed by read/write commands back in the 10s of 100s of ns, so I don’t know what to think.

Mahmoud_Al-Qudsi · March 25, 2017, 7:31pm

@Alex Grig

Sorry, my second failing machine is actually an HM175 chipset, which is an updated HM170 (your NUC).

Mahmoud_Al-Qudsi · March 28, 2017, 7:34pm

So I cloned my disk over to a 256GB Toshiba THNSN5256GPUK (aka Toshiba XG4) on the 24th, and have been running without any hitches since then.

I still don’t think it’s any sort of hardware failure in the Samsung 950 Pro drive - it’s also been running since then on another machine without failure. It seems to be some sort of incompatibility between the Samsung NVMe drive and the chipset.

The Samsung 960 Pro I ordered is coming in on Friday (D.V.); we’ll see what happens when I clone everything over to that drive!

Alex_Grig · March 28, 2017, 9:50pm

If your system supports S3, see if the issue appears more often after S3-S0 cycle.

Mahmoud_Al-Qudsi · March 30, 2017, 2:40pm

No, the most common freeze case is after a full restart when my autorun entries load up immediately after login and I re-open a Chrome session that tries to load several dozen tabs at once.

Alex_Grig · March 30, 2017, 9:40pm

Looks like multiple commands issued to it at the same time F it up.