Virtual PDU for the HLK Flush test

CaptainFlint · September 11, 2024, 5:51pm

I'm running HLK tests for a virtual storage device (qemu virtio), and Flush test is the only one that cannot be run properly, because it requires a power distribution unit that would control the client machine's power. Our virtual machies are not controlled by a PDU. RedHat had obtained an errata ID from Microsoft to skip this test, and probably so should we, but I've got an idea: what if I set up a virtual software PDU for controlling the client machine, maybe then we can make the test pass, and thus eliminate manual review stage (which over the years seems to becoming more and more finicky).

The problem is, I can't understand the architecture of the expected hardware configuration. Here is the information page for the Flush test:

It shows a picture, where the PDU is connected via COM port to the machine, and it says that the test will use both the IP address and the port (I take it, they mean the COM port) to control the power.

There are several things about it that I can't get my head around.

1. The setup example picture says: "IP Address assigned to APC via COM port is the IP address asked in job parameters". But how can a network address be assigned via COM? And who is doing the assigning, then? If it's me (or network administrator), then what does COM port have to do with it? And if it's the test that will send some configuration commands via the COM port, then how would I even know which IP address it's going to assign?

2. How will the test know which COM port to use for communicating? When I start the test, I can only give the IP address, and the Outlet port number, but not the COM port.

3. For the sake of experiment, I added a single serial port to the client virtual machine (so that there simply was no choice), and set it to save all data into a file on the host machine. I tested that writing into COM1 results in the data appearing in that output file. Then I launched the Flush test, specified some random IP address belonging to the same network, and a random outlet number. The test started, and in the test output on the client machine I saw messages that the machine is expected to be rebooted. But there was not a single byte sent into the COM1 port, and not a single network packet sent towards the IP address (monitored that with Wireshark running both on the Studio and the Client machines). The test does not even attempt to use the PDU, and I have no idea what is wrong. I can't find any relevant messages in the test log either.

Could it be that the PDU is expected to announce itself, first, somehow? Send a broadcast network message, or some data via the COM port? And since I don't have any of this currently, the test does not receive any of that and decides not to communicate with the PDU at all?

Mark_Roddy · September 11, 2024, 7:51pm

Typically horrible HLK docs. Also typically PDUs have some sort of serial port, for legacy access, and a network connection. They appear to assume APC.

All you should need is the ip address and an endpoint that provides the snmp object.

CaptainFlint · September 11, 2024, 10:10pm

That's the problem. I have configured a virtualpdu instance on an IP address, specified that IP address to the Flush test parameters when I launched it, and the virtualpdu SNMP server never received even a single request, and Wireshark never detected any request being sent. All the firewalls are turned off, of course.

That's what made me read the docs more carefully and notice that minor remark about the COM port also being used. But even after I added a COM port, again, I could not detect any attempt from the test to manupulate the machine power via either network, or via the COM port. It's like the whole PDU functionality is non-existent. I can't understand what I need to do to get any kind of request from it.

CaptainFlint · January 6, 2025, 3:12am

I'm happy to report that the issue is successfully resolved. Unfortunately, I have no idea why all my previous experiments did not show any network communication; probably it was some really stupid configuration mistake on my part, but that old environment is long gone by now. The new attempts immediately revealed all the network packets being sent to the PDU. And COM port is not required at all.

In case somebody else needs this, here is what it looks like.
When the Flush Test needs the machine to lose power, the Client machine under test sends an SNMP request. The target IP address is the one specified in the test run parameters (and default UDP port 161 is used). The request's contents is the PDU command "immediate reboot", and the outlet port is added at the end of the OID. I find it a bit strange that they chose to use the Client machine for controlling its own power supply; Studio machine would be a more solid choice, IMO. But since they are sending the reboot command and not "shut down + power up" sequence, that works.

I tried two simple implementations:

A program listening to UDP port 161 is running directly on the Client machine; the test is started with IP=127.0.0.1. When the request is received, the program calls NtShutdownSystem(ShutdownReboot) to reboot the machine as fast as possible. Not exactly the same as hard reset, but good enough.
A program running on the Studio machine; the test's IP parameter is set to the Studio machine's IP (as visible from Client). When the request is received, the program connects to the Client VM QEMU telnet interface and sends system_reset command to force-reboot the machine.

Both approaches worked fine, letting the Flush Test successfully pass. (Although there are some issues with running the test for Windows 11 and Windows Server 2022 targets, but they don't seem to be PDU related.) Obviously, other solutions are possible, e. g. with PDU emulator running directly on the host, or by killing/restarting the QEMU process instead of using telnet, but it's just minor details now.