What's the real story about KeQueryPerformanceCounter?

Shane_Corbin · March 4, 2008, 10:54am

So, I’ve got a scenario where I’d like to use KeQueryPerformanceCounter, but the documentation looks to be pulling some scare tactics. What’s the real story here?

“Depending on the platform, KeQueryPerformanceCounter can disable system-wide interrupts for a minimal interval.”

How can I determine if my platform may be a victim? And how long is this minimal interval?

Here’s why I want to use KeQueryPerformanceCounter…
In hardware I have a decrementing counter, that decrements at a variable frequency. The goal is to have it reach 0 exactly on the second. So, I can tune the frequency to make these adjustments. I have also provided myself with a a free running counter that increments at the current frequency. What I am going to do is periodically (say with a 1 second timer) take a sample from this free running counter and take a KeQueryPerformanceCounter sample. I’ll then find the difference in time (say in microseconds) from the last sample set and determine how much my variable frequency needs adjusted to align with the system clock.

Is calling KeQueryPerformanceCounter periodically on the magnitude of seconds going to degrade my I/O performance and/or whole system as the documentation warns? Should I be approaching this problem in a different manner? For my linux driver I can retrieve microsecond values from the current system time. It appears the best I can do with Windows is millisecond resolution, but even those values only get updated once every 10ms. To me it seems KeQueryPerformanceCounter is my only choice.

Tim_Roberts · March 4, 2008, 1:30pm

xxxxx@hotmail.com wrote:

So, I’ve got a scenario where I’d like to use KeQueryPerformanceCounter, but the documentation looks to be pulling some scare tactics. What’s the real story here?

“Depending on the platform, KeQueryPerformanceCounter can disable system-wide interrupts for a minimal interval.”

How can I determine if my platform may be a victim? And how long is this minimal interval?

With a uniprocessor HAL, the system uses the motherboard countdown
timer. The countdown register is split across two I/O ports, so
interrupts must be disabled while reading it to stay in sync. It takes
small integer microsecond (less than 5).

With a multiprocessor HAL, it uses the Pentium cycle counter.
Interrupts do not need to be disabled, and it returns quickly. In
exchange for that, the value is no longer monotonically increasing if
you happen to read consecutive values on two separate cores.

Here’s why I want to use KeQueryPerformanceCounter…
In hardware I have a decrementing counter, that decrements at a variable frequency. The goal is to have it reach 0 exactly on the second.

“On the second” by who’s reckoning? The clocks in your PC are not all
that accurate. It’s not uncommon to gain or lose five or more seconds
in a day. Are you syncing with some external time base? If not, then
what’s the point of your PLL?

So, I can tune the frequency to make these adjustments. I have also provided myself with a a free running counter that increments at the current frequency. What I am going to do is periodically (say with a 1 second timer) take a sample from this free running counter and take a KeQueryPerformanceCounter sample. I’ll then find the difference in time (say in microseconds) from the last sample set and determine how much my variable frequency needs adjusted to align with the system clock.

It is surprisingly difficult in Windows to identify a single entity that
can be called “the system clock”.

Is calling KeQueryPerformanceCounter periodically on the magnitude of seconds going to degrade my I/O performance and/or whole system as the documentation warns?

No. If you call it tens of thousands of times a second, then it can be
a problem.

Should I be approaching this problem in a different manner? For my linux driver I can retrieve microsecond values from the current system time. It appears the best I can do with Windows is millisecond resolution, but even those values only get updated once every 10ms. To me it seems KeQueryPerformanceCounter is my only choice.

Yes, but remember that the system clock only updates during those 10ms
or 16ms scheduler intervals. If your goal is to align with the ticking
of the system clock, then there is no particular reason to be more
accurate than the system clock, is there?

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Shane_Corbin · March 4, 2008, 1:48pm

>“On the second” by who’s reckoning? The clocks in your PC are not all

that accurate. It’s not uncommon to gain or lose five or more seconds
in a day. Are you syncing with some external time base? If not, then
what’s the point of your PLL?
I am decoding an external high precision time source and setting my Windows system time. When my incoming time source goes away, I need to start sourcing time out based on the Windows system time. I am aware that the time will drift, I just need to time align my outgoing time to my Windows time.

Yes, but remember that the system clock only updates during those 10ms
or 16ms scheduler intervals. If your goal is to align with the ticking
of the system clock, then there is no particular reason to be more
accurate than the system clock, is there?
You are probably currect, I just wanted to make a minor adjustments to the PLL used to generate outgoing time as possible and I thought I could avoid large sinusoidal overshoots by using time references with near microsecond resolution as opposed to millisecond resolution. Perhaps I should go back and review some of my control theory and just use the ticking of the system clock.

Thanks for the info concerning KeQueryPerformanceCounter. It seems everytime I read the documentation, I cower away from using it.

Oh, and if you have any ideas of a better way to handle my scenario I’d love to hear them. Especially if it gets me out of control theory review; haha.

Tim_Roberts · March 4, 2008, 2:03pm

xxxxx@hotmail.com wrote:

> “On the second” by who’s reckoning? The clocks in your PC are not all
> that accurate. It’s not uncommon to gain or lose five or more seconds
> in a day. Are you syncing with some external time base? If not, then
> what’s the point of your PLL?
>
I am decoding an external high precision time source and setting my Windows system time. When my incoming time source goes away, I need to start sourcing time out based on the Windows system time. I am aware that the time will drift, I just need to time align my outgoing time to my Windows time.

This may be a red herring, but have you looked into NTP (Network Time
Protocol)? There are some open source NTP servers and clients for
Windows, and they do processing very similar to this. They sync to
precision time sources over the Internet, and try to serve the time
themselves in between contacts with their servers.

It might behoove you to look through the source. The server code is
rather complicated, because they take the job very, very seriously, but
it might provide some useful new avenues of thought for you.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

OSR_Community_User · March 4, 2008, 2:28pm

Tim Roberts wrote:

With a uniprocessor HAL, the system uses the motherboard countdown
timer.
[…]
With a multiprocessor HAL, it uses the Pentium cycle counter.

I’m not sure this is accurate. I think it uses several timers depending not only on uni vs. multiprocessor HAL, but depending also on OS version, SP, ACPI vs non-Acpi HAL, chipset, etc.

Besides the timers you mentioned, it could use also the (so called) PM timer, and on Vista and later HPET.

I do agree with Tim about the main issues though.

Jake_Oshins · March 5, 2008, 1:40pm

Jorge, you’re right. Tim’s description would have been right for NT4.

Jake

wrote in message news:xxxxx@ntdev…
> Tim Roberts wrote:
>
>> With a uniprocessor HAL, the system uses the motherboard countdown
>> timer.
> […]
>> With a multiprocessor HAL, it uses the Pentium cycle counter.
>
> I’m not sure this is accurate. I think it uses several timers
> depending not only on uni vs. multiprocessor HAL, but depending also
> on OS version, SP, ACPI vs non-Acpi HAL, chipset, etc.
>
> Besides the timers you mentioned, it could use also the (so called)
> PM timer, and on Vista and later HPET.
>
> I do agree with Tim about the main issues though.
>
>

OSR_Community_User · March 5, 2008, 9:24pm

Jake Oshins wrote:

Jorge, you’re right. Tim’s description would have been right for NT4.

Speaking about KeQueryPerformanceCounter, would you mind commenting a bit on HPET?
Is it as disaspointing as it seems (at least to me)?

Some time ago I was going to ask here why Vista doesn’t support Hpet 64-bit counters. But then I learnt that there are Hpet hardware out there that doesn’t support 64-bit timers. I further learnt that even when it does, and even under a 64-bit environment, it is still not guaranteed that the hardware won’t split a 64-bit access into two 32-bit ones.

That’s what I call disappointing. Seems like this (almost) brand new standard is already (almost) obsolete?

I’m also wondering when we are going to see support for aperiodic interrupts. (yeah, I know you are not going to comment on future Windows features).

Jake_Oshins · March 6, 2008, 12:31pm

About the only useful comment I can give you is that I don’t think it
matters whether an HPET implementation is 64- or 32-bit. In either
case, the software has to manage an offset, since the HPET will be
reset when the machine goes to sleep.

Jake

wrote in message news:xxxxx@ntdev…
> Jake Oshins wrote:
>
>> Jorge, you’re right. Tim’s description would have been right for
>> NT4.
>
> Speaking about KeQueryPerformanceCounter, would you mind commenting
> a bit on HPET?
> Is it as disaspointing as it seems (at least to me)?
>
> Some time ago I was going to ask here why Vista doesn’t support Hpet
> 64-bit counters. But then I learnt that there are Hpet hardware out
> there that doesn’t support 64-bit timers. I further learnt that even
> when it does, and even under a 64-bit environment, it is still not
> guaranteed that the hardware won’t split a 64-bit access into two
> 32-bit ones.
>
> That’s what I call disappointing. Seems like this (almost) brand new
> standard is already (almost) obsolete?
>
> I’m also wondering when we are going to see support for aperiodic
> interrupts. (yeah, I know you are not going to comment on future
> Windows features).
>
>

OSR_Community_User · March 7, 2008, 12:40am

Jake Oshins wrote:

About the only useful comment I can give you is that I don’t think it
matters whether an HPET implementation is 64- or 32-bit. In either
case, the software has to manage an offset, since the HPET will be
reset when the machine goes to sleep.

This might be getting a bit OT, and I don’t want to abuse from your time. So feel free to ignore, but I’m not quite sure I get your point:

Does management of a “sleep offset” would require synchronization logic on every KeQueryPerformanceCounter invokation? I would assume that this offset should be pretty static and updated only at wake-up time.

I admit I’m not familiar with the wake-up procedure. If every call would need to hold a spinlock against the wake-up code, then clearly the benefit of 64-bit counters are not that significant. OTOH, if the wake-up procedure can update the offset before any possible race, then adding the offset would be very cheap. Much cheaper than syncing against the timer interrupt and/or reading the hardware timer multiple times as you need when using 32-bit counters.

Additionally, there are many cases where you don’t care about any sleep offset. So just reading the hardware timer with a single 64-bit access would be enough. I realize this is not relevant for KeQueryPerformanceCounter though.

Again, feel free to ignore me, or to 'educate" me if I’m talking non-sense

Jake_Oshins · March 7, 2008, 1:16pm

Well, it’s tempting to just say that you’re talking non-sense, as long
as I’ve been invited to do so…

Seriously though, 32-bit hardware doesn’t require a spinlock
acquisition any more than 64-bit hardware would. You read the timer
value, compare it to a recently read value to check for rollover,
possibly correct for rollover, add it to the offset and return the
result.

The only complication is storing the recently read value. The most
convenient way to do that is to update it when you take a scheduler
timer tick. It really doesn’t matter whether that update is atomic
either, as any recent value will do when checking for rollover, as
long as the rollover period is much shorter than the timer tick
period.

Even when you look at other 32- to 64-bit algorithms, you can do most
of them without a spinlock. Some require an interlocked update, some
don’t.

My favorite is one that you can use when there’s only one writer and
many readers of a 64-bit value on a 32-bit machine. You keep two
high-dword values and one low-dword value in a struct that looks like
this:

struct {
ULONG High1;
ULONG High2;
ULONG Low;
} value;

Then, the writer updates it like this:

value.High1 = highword;
value.Low = lowword;
(fence, if the architecture requires it)
value.High2 = highword;

The readers read like this:

for (;
{
highword = value.High1;
lowword = value.Low;
(fence, if the architecture requires it)

if (highword == value.High2)
{
break;
}
}

Jake

wrote in message news:xxxxx@ntdev…
> Jake Oshins wrote:
>
>> About the only useful comment I can give you is that I don’t think
>> it
>> matters whether an HPET implementation is 64- or 32-bit. In either
>> case, the software has to manage an offset, since the HPET will be
>> reset when the machine goes to sleep.
>
> This might be getting a bit OT, and I don’t want to abuse from your
> time. So feel free to ignore, but I’m not quite sure I get your
> point:
>
> Does management of a “sleep offset” would require synchronization
> logic on every KeQueryPerformanceCounter invokation? I would assume
> that this offset should be pretty static and updated only at wake-up
> time.
>
> I admit I’m not familiar with the wake-up procedure. If every call
> would need to hold a spinlock against the wake-up code, then clearly
> the benefit of 64-bit counters are not that significant. OTOH, if
> the wake-up procedure can update the offset before any possible
> race, then adding the offset would be very cheap. Much cheaper than
> syncing against the timer interrupt and/or reading the hardware
> timer multiple times as you need when using 32-bit counters.
>
> Additionally, there are many cases where you don’t care about any
> sleep offset. So just reading the hardware timer with a single
> 64-bit access would be enough. I realize this is not relevant for
> KeQueryPerformanceCounter though.
>
> Again, feel free to ignore me, or to 'educate" me if I’m talking
> non-sense
>
>

OSR_Community_User · March 7, 2008, 10:46pm

Jake Oshins wrote:

Seriously though, 32-bit hardware doesn’t require a spinlock
acquisition any more than 64-bit hardware would.

I was thinking (well, I obviously wasn’t thinking) that some kind of sync logic was required. I was clearly wrong. Then the benefit of 64-bit counters, at least here, is not really significant. Even the overhead of the 64 to 32-bit transfer (which I wouldn’t qualify as significant) is not needed at all on a 64-bit environment.

I still think it is somehow a miss that the standard doesn’t cover a full 64-bit implementation. AFAIK Intel implementations are already 64-bit “single access” (not split in two 32-bit ones). All that would be required is an additional capability flag describing if the hardware supports it. Obviously this would be possible on chipset implementations only, add-on cards can’t have enough knowledge about the hardware platform.

But considering that this feature would not benefit Windows, or probably any other standard PC OS for that matter, then I guess nobody cares.

Jake_Oshins · March 9, 2008, 7:48pm

It goes deeper than that. The HPET has to count at a constant rate
regardless of what is happening to the clock frequency on the
frontside bus, the PCIe bus or the processor itself. This pretty much
dictates that it’s built into the chipset and/or it has its own
crystal. (I’ve never seen a non-prototype plug-in implementation.)

And since it needs to continue counting reliably even in the face of
aggressive power management, you want it to have the fewest gates
possible, which pretty much leads to a 32-bit implementation.

Lastly, we’ve seen chipset implementers get clock logic wrong many,
many times. The common failure mode involves implementing the clock
as two or more clocks, with the fast-counting lower bits feeding
events into the slow-counting upper bits. If software can observe any
anomalies during this lower-bit-group rollover feeding into the upper
group(s) then the software has to be re-coded to deal with the
possibility of garbage, which makes functions like
KeQueryPerformanceCounter much heavier-weight, as they have to read
the hardware multiple times on every call.

So we (at Microsoft) want the simplest possible implementation of the
HPET which meets our requirements.

Jake Oshins

wrote in message news:xxxxx@ntdev…
> Jake Oshins wrote:
>
>> Seriously though, 32-bit hardware doesn’t require a spinlock
>> acquisition any more than 64-bit hardware would.
>
> I was thinking (well, I obviously wasn’t thinking) that some kind of
> sync logic was required. I was clearly wrong. Then the benefit of
> 64-bit counters, at least here, is not really significant. Even the
> overhead of the 64 to 32-bit transfer (which I wouldn’t qualify as
> significant) is not needed at all on a 64-bit environment.
>
> I still think it is somehow a miss that the standard doesn’t cover a
> full 64-bit implementation. AFAIK Intel implementations are already
> 64-bit “single access” (not split in two 32-bit ones). All that
> would be required is an additional capability flag describing if the
> hardware supports it. Obviously this would be possible on chipset
> implementations only, add-on cards can’t have enough knowledge about
> the hardware platform.
>
> But considering that this feature would not benefit Windows, or
> probably any other standard PC OS for that matter, then I guess
> nobody cares.
>
>
>

OSR_Community_User · March 10, 2008, 12:35am

Jake Oshins wrote:

And since it needs to continue counting reliably even in the face of
aggressive power management, you want it to have the fewest gates
possible, which pretty much leads to a 32-bit implementation.

Lastly, we’ve seen chipset implementers get clock logic wrong many,
many times…
So we (at Microsoft) want the simplest possible implementation of the
HPET which meets our requirements.

I see your point, but I don’t think it’s relevant for the current standard and available implementations.

The standard already covers 64-bit timers, it actually recommends the main counter to be 64-bits. Furthermore, Intel implementations are 64-bit already.

So it is not about incrementing the power consumption (they already have that many gates clocking all the time). It is not about making it more complicated (it already is). It is just about adding a capability flag. That would have zero cost, no complication whatsoever, and zero power consumption.

If you would advocate that the standard would be changed to discourage 64-bit counters, then I might still not agree, but I might understand your point. Otherwise, in which way a new optional capability flag could possibly make any harm?

Jake_Oshins · March 10, 2008, 1:27am

wrote in message news:xxxxx@ntdev…
> Jake Oshins wrote:
>
>> And since it needs to continue counting reliably even in the face
>> of
>> aggressive power management, you want it to have the fewest gates
>> possible, which pretty much leads to a 32-bit implementation.
>>
>> Lastly, we’ve seen chipset implementers get clock logic wrong many,
>> many times…
>> So we (at Microsoft) want the simplest possible implementation of
>> the
>> HPET which meets our requirements.
>
> I see your point, but I don’t think it’s relevant for the current
> standard and available implementations.
>
> The standard already covers 64-bit timers, it actually recommends
> the main counter to be 64-bits. Furthermore, Intel implementations
> are 64-bit already.
>
> So it is not about incrementing the power consumption (they already
> have that many gates clocking all the time). It is not about making
> it more complicated (it already is). It is just about adding a
> capability flag. That would have zero cost, no complication
> whatsoever, and zero power consumption.
>
> If you would advocate that the standard would be changed to
> discourage 64-bit counters, then I might still not agree, but I
> might understand your point. Otherwise, in which way a new optional
> capability flag could possibly make any harm?
>
>

It seems that I misunderstood your issue. I’ll go back and look at
it. (I haven’t really though hard about the HPET in a few years.)

- Jake

OSR_Community_User · March 10, 2008, 4:18am

xxxxx@rahul.net wrote:

It is not about making it more complicated (it already is).
> It is just about adding a capability flag.
That would have zero cost, no complication whatsoever,
> and zero power consumption.

Please don’t talk about “zero cost, no complication” here:

If you add *any* flag that indicates a hardware difference in a machine,
you need to (a) add program logic to check it in your software, (b) make
two versions of everything - for implementations “with” and “without”
the hardware difference, and (c) test it (flag check and added code).

If this flag is “public”, most likely (d) someone else from a different
company will manage to screw up the design, which leads to (e) either
not being able to use the capability at all or (f) have to go through a
costly/cumbersome detection logic to make sure you can actually can use
the capability that is indicated. Plus testing it a lot.

And this is true for most hardware differences, and most operating
systems. IMHO this is also one of the reasons why companies that have a
very limited set of supported hardware for their OS (like Apple) can put
their focus on GUI design / usability / “ease of use”.

What MS Windows has to deal with is a daunting task in comparison -
literally millions of possible hardware combinations, backward
compatibility (in some cases up to W32s), and thousands of people from
different backgrounds writing software and drivers for their OS.

OSR_Community_User · March 10, 2008, 12:20pm

Hagen Patzke wrote:

Please don’t talk about “zero cost, no complication” here:

I think you misunderstood. I wasn’t talking about the software side, but about the hardware side. That was in response to Jake that would prefer simpler hardware to avoid possible bugs in the *hardware* (again, not in the software).

If you add *any* flag that indicates a hardware difference in a machine,
you need to (a) add program logic to check it in your software,

Of course you don’t *need* to. As long as the common ground is enough for you, you ignore the difference completely.

(b) make two versions of everything - for implementations “with” and “without”
the hardware difference,

Not at all. If you are happy with the “without” version (as Windows is), and as long as the “with” version is backwards compatible (it is), then you make a single version.

and (c) test it (flag check and added code).

Again, not at all.

And this is true for most hardware differences, and most operating
systems. IMHO this is also one of the reasons why companies that have a
very limited set of supported hardware for their OS (like Apple) can put
their focus on GUI design / usability / “ease of use”.

That might be true, but in this case the difference exists already you like it or not.

OSR_Community_User · March 10, 2008, 1:49pm

xxxxx@rahul.net wrote:

> you need to (a) add program logic to check it in your software,

Of course you don’t *need* to. As long as the common ground is enough
for you, you ignore the difference completely.

If I get you correctly, HPET implementations already differ, and you
want to have a capability flag that signals that a specific
implementation is 64-bit.

Adding this capability flag means you want to exploit that 32/64bit
difference - and as soon as you start to exploit it, everything I wrote
is true.

OTOH, if you don’t want to make use of any HW difference, then you also
don’t need a flag for it. Or am I missing something?

OSR_Community_User · March 10, 2008, 3:36pm

Hagen Patzke wrote:

If I get you correctly, HPET implementations already differ, and you
want to have a capability flag that signals that a specific
implementation is 64-bit.

Not exactly. The standard already establishes how to signal if the counters are 64-bit or not. But the standard warns you that there is no guarantee that the platform won’t split a 64-bit access into two 32-bit ones, not even in a 64-bit OS/environment.

Consider a plug-in conventional PCI (32-bit) implementation. You might be using a 64-bit OS, but no matter what, the plug-in card would see two separate 32-bit accesses. Seems that plug-in implementations are not realistic. But conceivable a chipset implementation might still, internally, use a 32-bit architecture. I don’t know if there are actual implementations that support 64-bit counters with a 32-bit “path”. But the point is that the standard warns about this possiblity, and there is no provision on the standard to specify a “guaranteed 64-bit single access” (at the extent of the chipset and HPET hardware side).

If there is the possibility that a 64-bit access might be split, then (at least) some of the benefits of using 64-bit counters are lost. The timer might overwrap just between the two accesses, and you would then need to use some rollover detection.

Adding this capability flag means you want to exploit that 32/64bit
difference - and as soon as you start to exploit it, everything I wrote
is true.

I never said that *exploiting* the capability would have no cost. I said that adding the capability flag in the hardware would have zero cost, and that the impact on software that decides to ignore it would be also zero.

Whoever would want to exploit the capability would need to consider if it is worth or not. They might need to use two different paths at run-time (which might be worth or not). They might decide to install different routines at setup time. Or in some specific cases it might simply require this capability to be present and refuse to run if it is not (sorry, your machine doesn’t meet the minimum requirements).

OSR_Community_User · March 11, 2008, 4:28am

xxxxx@rahul.net wrote:

I never said that *exploiting* the capability would have no cost.

I said that adding the capability flag in the hardware would have
zero cost,

When I see them again, I’ll ask some chip designers about that.

and that the impact on software that decides to ignore it
would be also zero.

…as you can see on the time required to boot some well-known OS.

Wat I wanted to point out is that there is nor “zero cost” for any
feature. Sometimes not even for features you omit from the design.

But I definitely agree with you on one point: it *is* necessary to have
some way to identify a chip/board/implementation. Wouldn’t this solve
your problem, too?

The idea of a “capability directory” (for more than your flag) is
appealing, but what if some implementations are extended?
Or the opposite: if an implementation is broken?
Then you have to go for some other ID anyway, to avoid problems.

OSR_Community_User · March 11, 2008, 1:02pm

Hagen Patzke wrote:

> I said that adding the capability flag in the hardware would have
> zero cost,

When I see them again, I’ll ask some chip designers about that.

Please do it and you’ll see I am right. But before you do, make sure you understand what is exactly the meaning here of “adding a capability flag”. Because the chip designer won’t be able to provide an answer without more details.

So what we are talking about? Well, from the point of view of the hardware is essentially the same as changing a single *read-only* bit in the revision ID at the header. How much this increases the cost? One transistor, one gate in the worst case? Possibly, possibly not at all, and possibly this could actually reduce costs because it depends on the exact design on the mux (which might benefit from a constant that now changed from zero to one).

Or are you talking about the intrinsic cost of any change, even one that reduces manufacturing costs? Well, if you were going to produce a new chip mask *just for this purpose*, you will certainly get a huge non-recurring charge. You obviously won’t introduce this change until a new chip version, or at the very least until a new errata version.

> and that the impact on software that decides to ignore it
> would be also zero.

…as you can see on the time required to boot some well-known OS.

Sorry, I don’t see how this could possibly affect boot time. From the point of view of the software (that doesn’t exploit the new capability), the change would be the same as defining a new bit, previously reserved at CPUID. You will already reading the register, you will already ignoring and masking away reserved bits.

Wat I wanted to point out is that there is nor “zero cost” for any feature

This might be, by definition, true if you take “zero” in the strict mathematical sense. But nobody cares if it is zero absolute, or just close enough to zero to be not significant.

But I definitely agree with you on one point: it *is* necessary to have
some way to identify a chip/board/implementation. Wouldn’t this solve
your problem, too?

I sense the possibility that you are misunderstanding what is needed in this specific case. Please read the HPET specs. It is not just a simple timer. You don’t need to add any new register or a capability directory. It already has configuration and capability registers (in a similar, but not exact, way as a PCI function device). So all that is required is to change the meaning of a previously reserved bit.