Detours?

Prokash,

You have your reason, and I have mine. Before we go really off-topic, we
should probably go off-line if there is a need…

Indeed, it could be an interesting idea (at least for me) to discuss it in private - there is a good chance that you have some interesting ideas on the subject that I am just unaware of. If you don’t mind, could you please contact me at ‘xxxxx@hotmail.com’, and let’s discuss it…

Anton Bassov

It’s way more complicated than that. There’s a pipeline, right ? By the time
an instruction is executed, it has been fetched way back when. Reasoning at
run time doesn’t do, you have to reason at fetch time. Using one-byte hooks
is ok, but not for example for a profiler such as TrueTime, because it
falsifies the timing. Also note, ideally the hooking engine should act
outside the bounds of the OS, so that when it hooks it knows that it’s not
going to be preempted. Once you go into the hooking business, you may be
assuming control and bumping the OS one level higher!

Alberto.

----- Original Message -----
From: “Matt Miller”
To: “Windows System Software Devs Interest List”
Sent: Monday, August 27, 2007 10:16 PM
Subject: Re: [ntdev] Detours?

> On Mon, Aug 27, 2007 at 10:02:00PM -0400, Alberto Moreira wrote:
>> Oh, wow, SoftICE reinvented ? We used to call it “Capt’n Hook”, the
>> hooking
>> engine. BoundsChecker, TrueTime and TrueCoverage used it too. By the way,
>> if you want to safely intercept code, multiprocessor proof, you can use
>> the
>> CMPXCHG8B instruction.
>
> As Anton pointed out, the use of cmpxchg8b will not solve the problem of
> another thread executing within the sequence of instructions that are
> overwritten by the jump instruction. Alternative tricks would be needed
> to ensure that this does not happen. This is likely one of the reasons
> why Microsoft uses a two-byte no-op (mov edi, edi) rather than two
> one-byte no-op (nop / nop) instructions in their binaries that are
> compiled to support hotpatching.
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer

The two byte nop sequence at the beginning of a function can be guaranteed
to fall on a 2, 4, 8, or 16 byte boundary, depending upon the linker. Since
MS writes the linker they can force it to put that nop on which ever
boundary they find useful. Since it is also the beginning of a function,
and I suspect not a function that can be ‘fallen into’ there is a lot more
safety involved in using a 16-bit locked move of some sort - maybe an
interlocked call or a lock prefix or a compare and exchange of the correct
size. Since it is at the beginning of the function, replacing those 16-bits
will not cause a prefetch to fail half way through to get all the nop or the
new code.

Run windbg on a 8-cpu target and do a manual break. It take a while before
the target gets all the cpus marshaled into limbo land and returns to the
debugger on the host. This makes some of the tricks such as using affinity
and busy loops for all the cpus except the patching one, far more obvious
and user unfriendly.

Good idea about becoming a hypervisor and doing the patching.


David J. Craig
Engineer, Sr. Staff Software Systems
Broadcom Corporation

“Alberto Moreira” wrote in message news:xxxxx@ntdev…
> It’s way more complicated than that. There’s a pipeline, right ? By the
> time an instruction is executed, it has been fetched way back when.
> Reasoning at run time doesn’t do, you have to reason at fetch time. Using
> one-byte hooks is ok, but not for example for a profiler such as TrueTime,
> because it falsifies the timing. Also note, ideally the hooking engine
> should act outside the bounds of the OS, so that when it hooks it knows
> that it’s not going to be preempted. Once you go into the hooking
> business, you may be assuming control and bumping the OS one level higher!
>
> Alberto.
>
>
> ----- Original Message -----
> From: “Matt Miller”
> To: “Windows System Software Devs Interest List”
> Sent: Monday, August 27, 2007 10:16 PM
> Subject: Re: [ntdev] Detours?
>
>
>> On Mon, Aug 27, 2007 at 10:02:00PM -0400, Alberto Moreira wrote:
>>> Oh, wow, SoftICE reinvented ? We used to call it “Capt’n Hook”, the
>>> hooking
>>> engine. BoundsChecker, TrueTime and TrueCoverage used it too. By the
>>> way,
>>> if you want to safely intercept code, multiprocessor proof, you can use
>>> the
>>> CMPXCHG8B instruction.
>>
>> As Anton pointed out, the use of cmpxchg8b will not solve the problem of
>> another thread executing within the sequence of instructions that are
>> overwritten by the jump instruction. Alternative tricks would be needed
>> to ensure that this does not happen. This is likely one of the reasons
>> why Microsoft uses a two-byte no-op (mov edi, edi) rather than two
>> one-byte no-op (nop / nop) instructions in their binaries that are
>> compiled to support hotpatching.
>>
>> —
>> NTDEV is sponsored by OSR
>>
>> For our schedule of WDF, WDM, debugging and other seminars visit:
>> http://www.osr.com/seminars
>>
>> To unsubscribe, visit the List Server section of OSR Online at
>> http://www.osronline.com/page.cfm?name=ListServer
>
>

> Subject: Detours?

From: Mark McDougall
> Date: Mon, 27 Aug 2007 15:16:43 +1000
> X-Message-Number: 1
> Hi,
> Can anyone tell me if there’s a way to use Detours with kernel-mode >
drivers?

If you want to use Detours in kernel mode, you have to take in account,
Microsoft Patchguard blocks your detoured code using from Vista and
x64bit Windows versions…

Regards,
MG.

You know, we looked at that 2-byte edi magic often enough when we were
writing DriverStudio code, and we were baffled by it, because we didn’t see
any reason for it. Now I understand the reasoning behind it, but I’m still
not sure it’s needed.

I didn’t look at it in much detail recently, it’s been over two years now,
but I am going to bet that in most x86 compatible architectures a jump or a
subroutine call fetches at least one cache line to start running. Chances
are that the whole preamble will be fetched in one go, specially if one
takes care to align one’s functions on cache line boundaries (actually, I
bet an 8-byte alignment will do, it only takes a jump 5 bytes in an 32-bit
machine) to speed up the fetching. The key to enlightment, I believe, is to
think about things as they happen at fetch time.

Furthermore, to safely plant a jump in a sequence is rather easy, for
example, one can stuff an int 3 in the first byte of the function, have the
ISR handle the problematic issues, give it a little time to stabilize, and
then stuff the jump.

But the easiest thing to do is to lock them all out during the hooking: an
IPI does wonders.

Alberto.

----- Original Message -----
From: “David J. Craig”
Newsgroups: ntdev
To: “Windows System Software Devs Interest List”
Sent: Tuesday, August 28, 2007 8:30 PM
Subject: Re:[ntdev] Detours?

> The two byte nop sequence at the beginning of a function can be guaranteed
> to fall on a 2, 4, 8, or 16 byte boundary, depending upon the linker.
> Since MS writes the linker they can force it to put that nop on which ever
> boundary they find useful. Since it is also the beginning of a function,
> and I suspect not a function that can be ‘fallen into’ there is a lot more
> safety involved in using a 16-bit locked move of some sort - maybe an
> interlocked call or a lock prefix or a compare and exchange of the correct
> size. Since it is at the beginning of the function, replacing those
> 16-bits will not cause a prefetch to fail half way through to get all the
> nop or the new code.
>
> Run windbg on a 8-cpu target and do a manual break. It take a while
> before the target gets all the cpus marshaled into limbo land and returns
> to the debugger on the host. This makes some of the tricks such as using
> affinity and busy loops for all the cpus except the patching one, far more
> obvious and user unfriendly.
>
> Good idea about becoming a hypervisor and doing the patching.
>
> –
> David J. Craig
> Engineer, Sr. Staff Software Systems
> Broadcom Corporation
>
>
> “Alberto Moreira” wrote in message news:xxxxx@ntdev…
>> It’s way more complicated than that. There’s a pipeline, right ? By the
>> time an instruction is executed, it has been fetched way back when.
>> Reasoning at run time doesn’t do, you have to reason at fetch time.
>> Using one-byte hooks is ok, but not for example for a profiler such as
>> TrueTime, because it falsifies the timing. Also note, ideally the hooking
>> engine should act outside the bounds of the OS, so that when it hooks it
>> knows that it’s not going to be preempted. Once you go into the hooking
>> business, you may be assuming control and bumping the OS one level
>> higher!
>>
>> Alberto.
>>
>>
>> ----- Original Message -----
>> From: “Matt Miller”
>> To: “Windows System Software Devs Interest List”
>> Sent: Monday, August 27, 2007 10:16 PM
>> Subject: Re: [ntdev] Detours?
>>
>>
>>> On Mon, Aug 27, 2007 at 10:02:00PM -0400, Alberto Moreira wrote:
>>>> Oh, wow, SoftICE reinvented ? We used to call it “Capt’n Hook”, the
>>>> hooking
>>>> engine. BoundsChecker, TrueTime and TrueCoverage used it too. By the
>>>> way,
>>>> if you want to safely intercept code, multiprocessor proof, you can use
>>>> the
>>>> CMPXCHG8B instruction.
>>>
>>> As Anton pointed out, the use of cmpxchg8b will not solve the problem of
>>> another thread executing within the sequence of instructions that are
>>> overwritten by the jump instruction. Alternative tricks would be needed
>>> to ensure that this does not happen. This is likely one of the reasons
>>> why Microsoft uses a two-byte no-op (mov edi, edi) rather than two
>>> one-byte no-op (nop / nop) instructions in their binaries that are
>>> compiled to support hotpatching.
>>>
>>> —
>>> NTDEV is sponsored by OSR
>>>
>>> For our schedule of WDF, WDM, debugging and other seminars visit:
>>> http://www.osr.com/seminars
>>>
>>> To unsubscribe, visit the List Server section of OSR Online at
>>> http://www.osronline.com/page.cfm?name=ListServer
>>
>>
>
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer

I wouldn’t rely on the effects of caching to save you…


Ken Johnson (Skywing)
Windows SDK MVP
http://www.nynaeve.net
“Alberto Moreira” wrote in message news:xxxxx@ntdev…
> You know, we looked at that 2-byte edi magic often enough when we were
> writing DriverStudio code, and we were baffled by it, because we didn’t
> see any reason for it. Now I understand the reasoning behind it, but I’m
> still not sure it’s needed.
>
> I didn’t look at it in much detail recently, it’s been over two years now,
> but I am going to bet that in most x86 compatible architectures a jump or
> a subroutine call fetches at least one cache line to start running.
> Chances are that the whole preamble will be fetched in one go, specially
> if one takes care to align one’s functions on cache line boundaries
> (actually, I bet an 8-byte alignment will do, it only takes a jump 5 bytes
> in an 32-bit machine) to speed up the fetching. The key to enlightment, I
> believe, is to think about things as they happen at fetch time.
>
> Furthermore, to safely plant a jump in a sequence is rather easy, for
> example, one can stuff an int 3 in the first byte of the function, have
> the ISR handle the problematic issues, give it a little time to stabilize,
> and then stuff the jump.
>
> But the easiest thing to do is to lock them all out during the hooking: an
> IPI does wonders.
>
>
> Alberto.
>
>
>
> ----- Original Message -----
> From: “David J. Craig”
> Newsgroups: ntdev
> To: “Windows System Software Devs Interest List”
> Sent: Tuesday, August 28, 2007 8:30 PM
> Subject: Re:[ntdev] Detours?
>
>
>> The two byte nop sequence at the beginning of a function can be
>> guaranteed to fall on a 2, 4, 8, or 16 byte boundary, depending upon the
>> linker. Since MS writes the linker they can force it to put that nop on
>> which ever boundary they find useful. Since it is also the beginning of
>> a function, and I suspect not a function that can be ‘fallen into’ there
>> is a lot more safety involved in using a 16-bit locked move of some
>> sort - maybe an interlocked call or a lock prefix or a compare and
>> exchange of the correct size. Since it is at the beginning of the
>> function, replacing those 16-bits will not cause a prefetch to fail half
>> way through to get all the nop or the new code.
>>
>> Run windbg on a 8-cpu target and do a manual break. It take a while
>> before the target gets all the cpus marshaled into limbo land and returns
>> to the debugger on the host. This makes some of the tricks such as using
>> affinity and busy loops for all the cpus except the patching one, far
>> more obvious and user unfriendly.
>>
>> Good idea about becoming a hypervisor and doing the patching.
>>
>> –
>> David J. Craig
>> Engineer, Sr. Staff Software Systems
>> Broadcom Corporation
>>
>>
>> “Alberto Moreira” wrote in message
>> news:xxxxx@ntdev…
>>> It’s way more complicated than that. There’s a pipeline, right ? By the
>>> time an instruction is executed, it has been fetched way back when.
>>> Reasoning at run time doesn’t do, you have to reason at fetch time.
>>> Using one-byte hooks is ok, but not for example for a profiler such as
>>> TrueTime, because it falsifies the timing. Also note, ideally the
>>> hooking engine should act outside the bounds of the OS, so that when it
>>> hooks it knows that it’s not going to be preempted. Once you go into the
>>> hooking business, you may be assuming control and bumping the OS one
>>> level higher!
>>>
>>> Alberto.
>>>
>>>
>>> ----- Original Message -----
>>> From: “Matt Miller”
>>> To: “Windows System Software Devs Interest List”
>>> Sent: Monday, August 27, 2007 10:16 PM
>>> Subject: Re: [ntdev] Detours?
>>>
>>>
>>>> On Mon, Aug 27, 2007 at 10:02:00PM -0400, Alberto Moreira wrote:
>>>>> Oh, wow, SoftICE reinvented ? We used to call it “Capt’n Hook”, the
>>>>> hooking
>>>>> engine. BoundsChecker, TrueTime and TrueCoverage used it too. By the
>>>>> way,
>>>>> if you want to safely intercept code, multiprocessor proof, you can
>>>>> use the
>>>>> CMPXCHG8B instruction.
>>>>
>>>> As Anton pointed out, the use of cmpxchg8b will not solve the problem
>>>> of
>>>> another thread executing within the sequence of instructions that are
>>>> overwritten by the jump instruction. Alternative tricks would be
>>>> needed
>>>> to ensure that this does not happen. This is likely one of the reasons
>>>> why Microsoft uses a two-byte no-op (mov edi, edi) rather than two
>>>> one-byte no-op (nop / nop) instructions in their binaries that are
>>>> compiled to support hotpatching.
>>>>
>>>> —
>>>> NTDEV is sponsored by OSR
>>>>
>>>> For our schedule of WDF, WDM, debugging and other seminars visit:
>>>> http://www.osr.com/seminars
>>>>
>>>> To unsubscribe, visit the List Server section of OSR Online at
>>>> http://www.osronline.com/page.cfm?name=ListServer
>>>
>>>
>>
>>
>>
>> —
>> NTDEV is sponsored by OSR
>>
>> For our schedule of WDF, WDM, debugging and other seminars visit:
>> http://www.osr.com/seminars
>>
>> To unsubscribe, visit the List Server section of OSR Online at
>> http://www.osronline.com/page.cfm?name=ListServer
>
>

It’s not about caching. It’s about fetching. Instructions are fetched in
chunks of 32 or 64 bytes, or even more in some machines. This length is set
to coincide with cache lines to optimize access time.

Alberto.

----- Original Message -----
From: “Skywing”
Newsgroups: ntdev
To: “Windows System Software Devs Interest List”
Sent: Thursday, August 30, 2007 2:23 PM
Subject: Re:[ntdev] Re:Detours?

>I wouldn’t rely on the effects of caching to save you…
>
> –
> Ken Johnson (Skywing)
> Windows SDK MVP
> http://www.nynaeve.net
> “Alberto Moreira” wrote in message news:xxxxx@ntdev…
>> You know, we looked at that 2-byte edi magic often enough when we were
>> writing DriverStudio code, and we were baffled by it, because we didn’t
>> see any reason for it. Now I understand the reasoning behind it, but I’m
>> still not sure it’s needed.
>>
>> I didn’t look at it in much detail recently, it’s been over two years
>> now, but I am going to bet that in most x86 compatible architectures a
>> jump or a subroutine call fetches at least one cache line to start
>> running. Chances are that the whole preamble will be fetched in one go,
>> specially if one takes care to align one’s functions on cache line
>> boundaries (actually, I bet an 8-byte alignment will do, it only takes a
>> jump 5 bytes in an 32-bit machine) to speed up the fetching. The key to
>> enlightment, I believe, is to think about things as they happen at fetch
>> time.
>>
>> Furthermore, to safely plant a jump in a sequence is rather easy, for
>> example, one can stuff an int 3 in the first byte of the function, have
>> the ISR handle the problematic issues, give it a little time to
>> stabilize, and then stuff the jump.
>>
>> But the easiest thing to do is to lock them all out during the hooking:
>> an IPI does wonders.
>>
>>
>> Alberto.
>>
>>
>>
>> ----- Original Message -----
>> From: “David J. Craig”
>> Newsgroups: ntdev
>> To: “Windows System Software Devs Interest List”
>> Sent: Tuesday, August 28, 2007 8:30 PM
>> Subject: Re:[ntdev] Detours?
>>
>>
>>> The two byte nop sequence at the beginning of a function can be
>>> guaranteed to fall on a 2, 4, 8, or 16 byte boundary, depending upon the
>>> linker. Since MS writes the linker they can force it to put that nop on
>>> which ever boundary they find useful. Since it is also the beginning of
>>> a function, and I suspect not a function that can be ‘fallen into’ there
>>> is a lot more safety involved in using a 16-bit locked move of some
>>> sort - maybe an interlocked call or a lock prefix or a compare and
>>> exchange of the correct size. Since it is at the beginning of the
>>> function, replacing those 16-bits will not cause a prefetch to fail half
>>> way through to get all the nop or the new code.
>>>
>>> Run windbg on a 8-cpu target and do a manual break. It take a while
>>> before the target gets all the cpus marshaled into limbo land and
>>> returns to the debugger on the host. This makes some of the tricks such
>>> as using affinity and busy loops for all the cpus except the patching
>>> one, far more obvious and user unfriendly.
>>>
>>> Good idea about becoming a hypervisor and doing the patching.
>>>
>>> –
>>> David J. Craig
>>> Engineer, Sr. Staff Software Systems
>>> Broadcom Corporation
>>>
>>>
>>> “Alberto Moreira” wrote in message
>>> news:xxxxx@ntdev…
>>>> It’s way more complicated than that. There’s a pipeline, right ? By the
>>>> time an instruction is executed, it has been fetched way back when.
>>>> Reasoning at run time doesn’t do, you have to reason at fetch time.
>>>> Using one-byte hooks is ok, but not for example for a profiler such as
>>>> TrueTime, because it falsifies the timing. Also note, ideally the
>>>> hooking engine should act outside the bounds of the OS, so that when it
>>>> hooks it knows that it’s not going to be preempted. Once you go into
>>>> the hooking business, you may be assuming control and bumping the OS
>>>> one level higher!
>>>>
>>>> Alberto.
>>>>
>>>>
>>>> ----- Original Message -----
>>>> From: “Matt Miller”
>>>> To: “Windows System Software Devs Interest List”
>>>> Sent: Monday, August 27, 2007 10:16 PM
>>>> Subject: Re: [ntdev] Detours?
>>>>
>>>>
>>>>> On Mon, Aug 27, 2007 at 10:02:00PM -0400, Alberto Moreira wrote:
>>>>>> Oh, wow, SoftICE reinvented ? We used to call it “Capt’n Hook”, the
>>>>>> hooking
>>>>>> engine. BoundsChecker, TrueTime and TrueCoverage used it too. By the
>>>>>> way,
>>>>>> if you want to safely intercept code, multiprocessor proof, you can
>>>>>> use the
>>>>>> CMPXCHG8B instruction.
>>>>>
>>>>> As Anton pointed out, the use of cmpxchg8b will not solve the problem
>>>>> of
>>>>> another thread executing within the sequence of instructions that are
>>>>> overwritten by the jump instruction. Alternative tricks would be
>>>>> needed
>>>>> to ensure that this does not happen. This is likely one of the
>>>>> reasons
>>>>> why Microsoft uses a two-byte no-op (mov edi, edi) rather than two
>>>>> one-byte no-op (nop / nop) instructions in their binaries that are
>>>>> compiled to support hotpatching.
>>>>>
>>>>> —
>>>>> NTDEV is sponsored by OSR
>>>>>
>>>>> For our schedule of WDF, WDM, debugging and other seminars visit:
>>>>> http://www.osr.com/seminars
>>>>>
>>>>> To unsubscribe, visit the List Server section of OSR Online at
>>>>> http://www.osronline.com/page.cfm?name=ListServer
>>>>
>>>>
>>>
>>>
>>>
>>> —
>>> NTDEV is sponsored by OSR
>>>
>>> For our schedule of WDF, WDM, debugging and other seminars visit:
>>> http://www.osr.com/seminars
>>>
>>> To unsubscribe, visit the List Server section of OSR Online at
>>> http://www.osronline.com/page.cfm?name=ListServer
>>
>>
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer

Interesting stuff. But I do not understand what do you say is the reasoning
behind this seamingly useless instruction and why is it magic ? I always
thought they left this space open to provoke easy patching but how does this
exactly relate to instructions being fetched by the processor ?

/Daniel

“Alberto Moreira” wrote in message news:xxxxx@ntdev…
> You know, we looked at that 2-byte edi magic often enough when we were
> writing DriverStudio code, and we were baffled by it, because we didn’t
> see any reason for it. Now I understand the reasoning behind it, but I’m
> still not sure it’s needed.
>
> I didn’t look at it in much detail recently, it’s been over two years now,
> but I am going to bet that in most x86 compatible architectures a jump or
> a subroutine call fetches at least one cache line to start running.
> Chances are that the whole preamble will be fetched in one go, specially
> if one takes care to align one’s functions on cache line boundaries
> (actually, I bet an 8-byte alignment will do, it only takes a jump 5 bytes
> in an 32-bit machine) to speed up the fetching. The key to enlightment, I
> believe, is to think about things as they happen at fetch time.
>
> Furthermore, to safely plant a jump in a sequence is rather easy, for
> example, one can stuff an int 3 in the first byte of the function, have
> the ISR handle the problematic issues, give it a little time to stabilize,
> and then stuff the jump.
>
> But the easiest thing to do is to lock them all out during the hooking: an
> IPI does wonders.
>
>
> Alberto.
>
>
>
> ----- Original Message -----
> From: “David J. Craig”
> Newsgroups: ntdev
> To: “Windows System Software Devs Interest List”
> Sent: Tuesday, August 28, 2007 8:30 PM
> Subject: Re:[ntdev] Detours?
>
>
>> The two byte nop sequence at the beginning of a function can be
>> guaranteed to fall on a 2, 4, 8, or 16 byte boundary, depending upon the
>> linker. Since MS writes the linker they can force it to put that nop on
>> which ever boundary they find useful. Since it is also the beginning of
>> a function, and I suspect not a function that can be ‘fallen into’ there
>> is a lot more safety involved in using a 16-bit locked move of some
>> sort - maybe an interlocked call or a lock prefix or a compare and
>> exchange of the correct size. Since it is at the beginning of the
>> function, replacing those 16-bits will not cause a prefetch to fail half
>> way through to get all the nop or the new code.
>>
>> Run windbg on a 8-cpu target and do a manual break. It take a while
>> before the target gets all the cpus marshaled into limbo land and returns
>> to the debugger on the host. This makes some of the tricks such as using
>> affinity and busy loops for all the cpus except the patching one, far
>> more obvious and user unfriendly.
>>
>> Good idea about becoming a hypervisor and doing the patching.
>>
>> –
>> David J. Craig
>> Engineer, Sr. Staff Software Systems
>> Broadcom Corporation
>>
>>
>> “Alberto Moreira” wrote in message
>> news:xxxxx@ntdev…
>>> It’s way more complicated than that. There’s a pipeline, right ? By the
>>> time an instruction is executed, it has been fetched way back when.
>>> Reasoning at run time doesn’t do, you have to reason at fetch time.
>>> Using one-byte hooks is ok, but not for example for a profiler such as
>>> TrueTime, because it falsifies the timing. Also note, ideally the
>>> hooking engine should act outside the bounds of the OS, so that when it
>>> hooks it knows that it’s not going to be preempted. Once you go into the
>>> hooking business, you may be assuming control and bumping the OS one
>>> level higher!
>>>
>>> Alberto.
>>>
>>>
>>> ----- Original Message -----
>>> From: “Matt Miller”
>>> To: “Windows System Software Devs Interest List”
>>> Sent: Monday, August 27, 2007 10:16 PM
>>> Subject: Re: [ntdev] Detours?
>>>
>>>
>>>> On Mon, Aug 27, 2007 at 10:02:00PM -0400, Alberto Moreira wrote:
>>>>> Oh, wow, SoftICE reinvented ? We used to call it “Capt’n Hook”, the
>>>>> hooking
>>>>> engine. BoundsChecker, TrueTime and TrueCoverage used it too. By the
>>>>> way,
>>>>> if you want to safely intercept code, multiprocessor proof, you can
>>>>> use the
>>>>> CMPXCHG8B instruction.
>>>>
>>>> As Anton pointed out, the use of cmpxchg8b will not solve the problem
>>>> of
>>>> another thread executing within the sequence of instructions that are
>>>> overwritten by the jump instruction. Alternative tricks would be
>>>> needed
>>>> to ensure that this does not happen. This is likely one of the reasons
>>>> why Microsoft uses a two-byte no-op (mov edi, edi) rather than two
>>>> one-byte no-op (nop / nop) instructions in their binaries that are
>>>> compiled to support hotpatching.
>>>>
>>>> —
>>>> NTDEV is sponsored by OSR
>>>>
>>>> For our schedule of WDF, WDM, debugging and other seminars visit:
>>>> http://www.osr.com/seminars
>>>>
>>>> To unsubscribe, visit the List Server section of OSR Online at
>>>> http://www.osronline.com/page.cfm?name=ListServer
>>>
>>>
>>
>>
>>
>> —
>> NTDEV is sponsored by OSR
>>
>> For our schedule of WDF, WDM, debugging and other seminars visit:
>> http://www.osr.com/seminars
>>
>> To unsubscribe, visit the List Server section of OSR Online at
>> http://www.osronline.com/page.cfm?name=ListServer
>
>

The processor fetches instructions one cache line at a time. If I am
patching some stretch of code and my patch fits within one cache line,
there’s no chance that the reported inconsistency will happen. If I use an
8-byte compare-replace instruction, the only way the patch may end up
inconsistent is if the patch itself straddles over two cache lines. But take
my word with a pinch of salt because it has been a couple of years since I
touched this stuff.

Alberto.

----- Original Message -----
From: “Daniel Terhell”
Newsgroups: ntdev
To: “Windows System Software Devs Interest List”
Sent: Friday, August 31, 2007 6:28 PM
Subject: Re:[ntdev] Re:Detours?

> Interesting stuff. But I do not understand what do you say is the
> reasoning behind this seamingly useless instruction and why is it magic ?
> I always thought they left this space open to provoke easy patching but
> how does this exactly relate to instructions being fetched by the
> processor ?
>
> /Daniel
>
>
>
>
> “Alberto Moreira” wrote in message news:xxxxx@ntdev…
>> You know, we looked at that 2-byte edi magic often enough when we were
>> writing DriverStudio code, and we were baffled by it, because we didn’t
>> see any reason for it. Now I understand the reasoning behind it, but I’m
>> still not sure it’s needed.
>>
>> I didn’t look at it in much detail recently, it’s been over two years
>> now, but I am going to bet that in most x86 compatible architectures a
>> jump or a subroutine call fetches at least one cache line to start
>> running. Chances are that the whole preamble will be fetched in one go,
>> specially if one takes care to align one’s functions on cache line
>> boundaries (actually, I bet an 8-byte alignment will do, it only takes a
>> jump 5 bytes in an 32-bit machine) to speed up the fetching. The key to
>> enlightment, I believe, is to think about things as they happen at fetch
>> time.
>>
>> Furthermore, to safely plant a jump in a sequence is rather easy, for
>> example, one can stuff an int 3 in the first byte of the function, have
>> the ISR handle the problematic issues, give it a little time to
>> stabilize, and then stuff the jump.
>>
>> But the easiest thing to do is to lock them all out during the hooking:
>> an IPI does wonders.
>>
>>
>> Alberto.
>>
>>
>>
>> ----- Original Message -----
>> From: “David J. Craig”
>> Newsgroups: ntdev
>> To: “Windows System Software Devs Interest List”
>> Sent: Tuesday, August 28, 2007 8:30 PM
>> Subject: Re:[ntdev] Detours?
>>
>>
>>> The two byte nop sequence at the beginning of a function can be
>>> guaranteed to fall on a 2, 4, 8, or 16 byte boundary, depending upon the
>>> linker. Since MS writes the linker they can force it to put that nop on
>>> which ever boundary they find useful. Since it is also the beginning of
>>> a function, and I suspect not a function that can be ‘fallen into’ there
>>> is a lot more safety involved in using a 16-bit locked move of some
>>> sort - maybe an interlocked call or a lock prefix or a compare and
>>> exchange of the correct size. Since it is at the beginning of the
>>> function, replacing those 16-bits will not cause a prefetch to fail half
>>> way through to get all the nop or the new code.
>>>
>>> Run windbg on a 8-cpu target and do a manual break. It take a while
>>> before the target gets all the cpus marshaled into limbo land and
>>> returns to the debugger on the host. This makes some of the tricks such
>>> as using affinity and busy loops for all the cpus except the patching
>>> one, far more obvious and user unfriendly.
>>>
>>> Good idea about becoming a hypervisor and doing the patching.
>>>
>>> –
>>> David J. Craig
>>> Engineer, Sr. Staff Software Systems
>>> Broadcom Corporation
>>>
>>>
>>> “Alberto Moreira” wrote in message
>>> news:xxxxx@ntdev…
>>>> It’s way more complicated than that. There’s a pipeline, right ? By the
>>>> time an instruction is executed, it has been fetched way back when.
>>>> Reasoning at run time doesn’t do, you have to reason at fetch time.
>>>> Using one-byte hooks is ok, but not for example for a profiler such as
>>>> TrueTime, because it falsifies the timing. Also note, ideally the
>>>> hooking engine should act outside the bounds of the OS, so that when it
>>>> hooks it knows that it’s not going to be preempted. Once you go into
>>>> the hooking business, you may be assuming control and bumping the OS
>>>> one level higher!
>>>>
>>>> Alberto.
>>>>
>>>>
>>>> ----- Original Message -----
>>>> From: “Matt Miller”
>>>> To: “Windows System Software Devs Interest List”
>>>> Sent: Monday, August 27, 2007 10:16 PM
>>>> Subject: Re: [ntdev] Detours?
>>>>
>>>>
>>>>> On Mon, Aug 27, 2007 at 10:02:00PM -0400, Alberto Moreira wrote:
>>>>>> Oh, wow, SoftICE reinvented ? We used to call it “Capt’n Hook”, the
>>>>>> hooking
>>>>>> engine. BoundsChecker, TrueTime and TrueCoverage used it too. By the
>>>>>> way,
>>>>>> if you want to safely intercept code, multiprocessor proof, you can
>>>>>> use the
>>>>>> CMPXCHG8B instruction.
>>>>>
>>>>> As Anton pointed out, the use of cmpxchg8b will not solve the problem
>>>>> of
>>>>> another thread executing within the sequence of instructions that are
>>>>> overwritten by the jump instruction. Alternative tricks would be
>>>>> needed
>>>>> to ensure that this does not happen. This is likely one of the
>>>>> reasons
>>>>> why Microsoft uses a two-byte no-op (mov edi, edi) rather than two
>>>>> one-byte no-op (nop / nop) instructions in their binaries that are
>>>>> compiled to support hotpatching.
>>>>>
>>>>> —
>>>>> NTDEV is sponsored by OSR
>>>>>
>>>>> For our schedule of WDF, WDM, debugging and other seminars visit:
>>>>> http://www.osr.com/seminars
>>>>>
>>>>> To unsubscribe, visit the List Server section of OSR Online at
>>>>> http://www.osronline.com/page.cfm?name=ListServer
>>>>
>>>>
>>>
>>>
>>>
>>> —
>>> NTDEV is sponsored by OSR
>>>
>>> For our schedule of WDF, WDM, debugging and other seminars visit:
>>> http://www.osr.com/seminars
>>>
>>> To unsubscribe, visit the List Server section of OSR Online at
>>> http://www.osronline.com/page.cfm?name=ListServer
>>
>>
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer