Storport PDO I/O thread performance question

In order to test performance of my storport driver, I created a simple RAMdisk plugin for it.
Then I ran ATTO Disk Benchmark on the resulting drive. Results were surprisingly low:
80 MB/s @ 4KB block size
160 MB/s @ 8KB block size

1.7GB/s @ 128KB block size

Basically, transfer speed roughly doubles for each doubling of transfer size.
That means my current design can only do about 20000 transactions per second.

I’m using LIST_ENTRY (doubly linked list) to store event log, KEVENT to signal the thread and KSPIN_LOCK to synchronise access to the list.

There’s not much other code in there. After getting event signal, the thread only calls these:
ExInterlockedRemoveHeadList (get one request to process)
CONTAINING_RECORD (get the Irp from list entry)
IoGetCurrentIrpStackLocation
MmGetSystemAddressForMdlSafe (get the buffer)
RtlCopyMemory (copy from memory to buffer)

Note: Compiling in release mode doesn’t affect performance more than 5%@ 4KB block size. Interestingly, @128KB block size, speed increase is 15% ??

Is this (20K IO/s) really top performance I can get out of this implementation? Should I move to direct request handling if I want more?

You’re saying the most you’re getting is 20K IOPs?

SOMEthing isn’t right.

We’ve measured more than 500K IOPs through a StorPort Miniport driver… on a real (commodity) disk. So, the problem surely isn’t StorPort.

Peter
OSR
@OSRDrivers

On Thu, Apr 2, 2015 at 5:22 AM, wrote:

>
> MmGetSystemAddressForMdlSafe (get the buffer)
> RtlCopyMemory (copy from memory to buffer)

Why are you doing that?

Mark Roddy

In place of I/O, to implement his RAMDISK, I would guess?

Peter
OSR
@OSRDrivers

Exactly. I’m doing it to transfer data to / from my “disk”, which is an allocated memory block.
Also: Why is MmGetSystemAddressForMdlSafe a “problem”? AFAIK I need it to get the buffer provided (by the client user-mode app) which will recieve data (read) or provide data (write).

I know storport is capable of more. After all, performance of my driver doesn’t match performance of my SSD disk. As pointed out by M M (his response was spun off into another thread for some reason), he problem must be in thread synchronisation.

I’m currently researching lock-free queues to implement something with less overhead.
Ideologically, I’d assume this is a single-producer, single consumer scenario, but I “highly suspect” that IRP_MJ_READ, IRP_MJ_WRITE and IRP_MJ_SCSI require a multiple-producer optimised solution. I have yet to find a good solution for that.

In the mean time, I want to skip the thread to see how fast direct request / response can work. I’m still too much of a n00b though, so far I’m only getting BSODs. I really have to make that debugger work as it’s supposed to (I opened a separate thread for that).

> I’m currently researching lock-free queues to implement something with less overhead.

Why do you need any threads or queues?

Tell STORPORT you can only support 1 request per LUN a time, then there is no need in your own queue (and even your own locks).

Thread is also not necessary. Just do memcpy() from StartIo directly. Yes, on DISPATCH, but this
raises the perf a lot.

And yes, allocate the memory for your disk using MmAllocatePagesForMdl and map then (if the disk is not large) immediately.


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

OK, I finally managed to make IRP_MJ_SCSI / SRB_FUNCTION_EXECUTE_SCSI / SCSIOP_READ - SCSIOP_WRITE process the request immediately.

It did not have the desired effect.

Write speed went up some 20%, while read speed went up some 330% (3.3x).
Speeds are now:
95/255 MB/s @ 4KB (97 / 264 in release)
180/460 MB/s @ 8KB (190/512 in release)

1470/2060 MB/s @128KB (2030 / 3850 in release)

There’s a huge discrepancy between read and write even though the code path is exactly the same, save for the actual RtlCopyMemory call which only reverses src and dest between them.

While this is certainly an improvement, I don’t think it’s a good solution for two reasons:

  1. It might lead to two requests writing to the same memory area at the same time. Therefore it just serves to prove that something’s wrong.
  2. This is the fastest code path available to me and it still only serves about 24K IO/s. Explanation: this is a function directly registered as DriverObject->MajorFunction. As I switch through possible command codes, SRB_FUNCTION_EXECUTE_SCSI is the first, as are SCSIOP_READ / SCSIOP_WRITE. The entire code path is really short with no syncronisation code in it, yet performance still sucks :frowning:

Granted, I’m running these tests inside a VM, but I don’t think this is the limiting factor. Must get a production certificate to be sure.

This is a huge puzzle for me, both differences in R vs W performance as well as performance as such.

Previous post doesn’t account for Maxim’s suggestions, I was typing this for much too long :slight_smile:

@Maxim:
In what way does StartIo behave differently than IRP_MJ_READ / WRITE and IRP_MJ_SCSI ? I’m already have trouble with those, now a third I/O concept? Is there a fourth?
Why does driver framework enable so many different code paths for the same functionality? Is there documentation describing differences among them?

Is there a difference between memcpy and RtlCopyMemory? Also, what DISPATCH?

Finally, I don’t understand anything about your last paragraph:

>And yes, allocate the memory for your disk using MmAllocatePagesForMdl and map then (if the disk is not large) immediately.
The only way I “know” how to work with MDLs is through Irp.MdlAddress. Isn’t that IRP specific? How can I allocate memory using that?
How do you map this immediately? Currently I’m mapping the buffer that came with IRP, not my disk buffer.

Thanks

xxxxx@gmail.com wrote:

Is there a difference between memcpy and RtlCopyMemory?

C:\tmp>findstr RtlCopyMemory \DDK\7600\inc\ddk\wdm.h
#define RtlCopyMemory(Destination,Source,Length)
memcpy((Destination),(Source),(Length))


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

> In what way does StartIo behave differently than IRP_MJ_READ / WRITE and IRP_MJ_SCSI ? I’m

already have trouble with those, now a third I/O concept? Is there a fourth?
Why does driver framework enable so many different code paths for the same functionality? Is there
documentation describing differences among them?

Is there a difference between memcpy and RtlCopyMemory? Also, what DISPATCH?

I cannot understand why are you speaking on dispatch routines while having a storport miniport.

MajorFunction belongs to STORPORT, not your code.

You only write StartIo, the interrupt handler (if any), the init path and a couple of additional functions with STORPORT.


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

>testing performance in with non-optimized code (debug build) is a waste of time.

I would say more: debug builds are a waste of time. They are only useful to investigate some (not all) specific bugs.

Having a debug build as a part of your daily/hourly personal software process is a waste of time.

Max is suggesting that you implement Fast IO.

No for sure, for sure I could not suggest FastIo for STORPORT miniport.

Just implement the miniport correctly, without doing any “smartness” going out of the STORPORT architecture and ensure you’ve filled the STORPORT’s init data correctly.


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

Why are you accessing the IRP? With storport, you should be using the SRB.
To queue the SRB in your own worker queue (LIST_ENTRY), just create a small
structure to hold a pointer to the SRB and the list entry and allocate in
your startio (or have the port driver to give you an SRB extension and link
those together in your list. I prefer the former.) For efficiency, use a
lookaside list to allocate and free your data structures. To access the IO
data buffer from the SRB, simply use:

PVOID buf;
StorPortGetSystemAddress(hba_ext, srb, &buf);

I also suggest using as many storport support routines as you can, and only
use things outside of the storport model when absolutely required. For
example I recommend using StorPortCopyMemory() rather than RtlCopyMemory()
or memcpy(). Do the same for your splinlocks as well (STOR_SPINLOCK) and
memory allocations (StorPortAllocatePool()). Remember also that some
storport routines are not available on earlier versions of Windows, but
most are available from w7 upwards. W8 added some new stuff for worker
queues and such.

When you speak of block size, what are you exactly doing? In the config
setup of the storport miniport, you set the max IO size with
config_info->MaximumTransferLength. Unless smaller blocks are intentionally
read, the port driver will try to give you as much as it can up to the
maximum. I have found that setting the max transfer length to 32K is a good
balance.

At the end of the day, are you taking to real hardware or will you require
the thread model in your final driver?

If you don’t actually need a thread in your final driver, then you can
ignore a lot of what I said :slight_smile: But do try to stick to using as many
StorPortXxx() support routines as you can.

On Thu, Apr 2, 2015 at 5:22 AM, wrote:

> In order to test performance of my storport driver, I created a simple
> RAMdisk plugin for it.
> Then I ran ATTO Disk Benchmark on the resulting drive. Results were
> surprisingly low:
> 80 MB/s @ 4KB block size
> 160 MB/s @ 8KB block size
> …
> 1.7GB/s @ 128KB block size
>
> Basically, transfer speed roughly doubles for each doubling of transfer
> size.
> That means my current design can only do about 20000 transactions per
> second.
>
> I’m using LIST_ENTRY (doubly linked list) to store event log, KEVENT to
> signal the thread and KSPIN_LOCK to synchronise access to the list.
>
> There’s not much other code in there. After getting event signal, the
> thread only calls these:
> ExInterlockedRemoveHeadList (get one request to process)
> CONTAINING_RECORD (get the Irp from list entry)
> IoGetCurrentIrpStackLocation
> MmGetSystemAddressForMdlSafe (get the buffer)
> RtlCopyMemory (copy from memory to buffer)
>
> Note: Compiling in release mode doesn’t affect performance more than 5%@
> 4KB block size. Interestingly, @128KB block size, speed increase is 15% ??
>
> Is this (20K IO/s) really top performance I can get out of this
> implementation? Should I move to direct request handling if I want more?
>
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>


Jamey Kirby
Disrupting the establishment since 1964

This is a personal email account and as such, emails are not subject to
archiving. Nothing else really matters.

>I also suggest using as many storport support routines as you can, and only use things outside of the

storport model when absolutely required.

Yes, and this includes using StorPort’s queues.


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

From Windows 8+, you can also use the storport support routines to manage a
minport internal queue.

https://msdn.microsoft.com/en-us/library/windows/hardware/hh451486(v=vs.85).aspx

On Mon, Apr 6, 2015 at 12:22 AM, Maxim S. Shatskih
wrote:

> >I also suggest using as many storport support routines as you can, and
> only use things outside of the
> >storport model when absolutely required.
>
> Yes, and this includes using StorPort’s queues.
>
> –
> Maxim S. Shatskih
> Microsoft MVP on File System And Storage
> xxxxx@storagecraft.com
> http://www.storagecraft.com
>
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>


Jamey Kirby
Disrupting the establishment since 1964

This is a personal email account and as such, emails are not subject to
archiving. Nothing else really matters.

Max, I also agree 100% with respect to letting StorPort manage your IO
queue. However, there can be a situation in a virtual miniport where you
may need to process the request in a system thread at passive level. In
this situation, you may wish to use an internal thread and manage your own
queue; but even in this situation, you can still use the StorPort queue and
configure the driver to only process one request at a time. It is always
best not to reinvent the wheel, and to use the APIs that Microsoft has
defined for a particular domain.

On Mon, Apr 6, 2015 at 12:53 PM, Jamey Kirby wrote:

> From Windows 8+, you can also use the storport support routines to manage
> a minport internal queue.
>
>
> https://msdn.microsoft.com/en-us/library/windows/hardware/hh451486(v=vs.85).aspx
>
> ᐧ
>
> On Mon, Apr 6, 2015 at 12:22 AM, Maxim S. Shatskih > > wrote:
>
>> >I also suggest using as many storport support routines as you can, and
>> only use things outside of the
>> >storport model when absolutely required.
>>
>> Yes, and this includes using StorPort’s queues.
>>
>> –
>> Maxim S. Shatskih
>> Microsoft MVP on File System And Storage
>> xxxxx@storagecraft.com
>> http://www.storagecraft.com
>>
>>
>> —
>> NTDEV is sponsored by OSR
>>
>> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>>
>> OSR is HIRING!! See http://www.osr.com/careers
>>
>> For our schedule of WDF, WDM, debugging and other seminars visit:
>> http://www.osr.com/seminars
>>
>> To unsubscribe, visit the List Server section of OSR Online at
>> http://www.osronline.com/page.cfm?name=ListServer
>>
>
>
>
> –
> Jamey Kirby
> Disrupting the establishment since 1964
>
> This is a personal email account and as such, emails are not subject to
> archiving. Nothing else really matters.

>


Jamey Kirby
Disrupting the establishment since 1964

This is a personal email account and as such, emails are not subject to
archiving. Nothing else really matters.

I have now confirmed that the performance limits explained in OP are the maximum I can ever get with current driver design.

Thanks Maxim & Jamey, your posts were very informative and helpful - though at times so far ahead of my current knowledge that I couldn’t understand what you were saying. I went exploring and discovered that my driver isn’t really a StorPort, but a “normal” driver that only “fakes” storport. It deals with SRBs, but it extracts them from IRP data.

I went browsing through the StorPort example and there’s some (much) stuff I don’t understand yet in there. In addition, my driver still needs to be deployed on some XP machines which the sample explicitly omits from compatible OS list. I have decided that proper StorPort implementation takes second priority and I will take it on later when I finish the functional upgrade I’m working on right now. 20K IO/s may be a low number for a RAMdisk, but it is plenty for network-based storage.

Thanks again for all your help. It is very much appreciated.

P.S. I really wish documentation was better for driver development. There’s no “concepts” section explaining the basics to newbies like me. Often documentation states “don’t use this” without any explanations. Even more often there are statements clearly indicating that I should already have certain knowledge (but I don’t). And there’s no link to that knowledge nor any online info on it (if I am at all capable of finding what I should actually understand and go looking for it). Very reminding of how we do development in company I work for: lacking documentation, you learn by asking your coworkers :slight_smile:
Anyway, the samples, while very informative, are also under-documented. Just browsing through Virtual StorPort MiniPort sample raised the following questions for me:

  1. Why does one have to install the driver multiple times to have multiple devices? Shouldn’t WMI / a simple console .exe handle that?
  2. What is WMI doing in there? What exactly does it handle if not #1? Also, what’s with auto-generating that WMI header file?
  3. How would one go about creating a comms path to the driver so that a new device could be registered through an IoControl (or some such) call instead of a fixed .inf file?
    These are just the most basic questions asked by a total newbie before they even went in-depth. I feel so stupid with this at times with no apparent way to even gain the required knowledge but by bugging you guys.

Take heart and know this: The StorPort documentation sucks. I don’t mean it sucks a little and could use a bit of improvement here and there. I mean it totally, completely, well and truly sucks. It is mostly re-purposed ScsiPort documentation, or when it isn’t, it says "this is the same (or different than) SCSI Port) without any more elaboration. We write StorPort drivers of the most demanding type here. I can tell you that, even for us who’ve been doing this for years, trying to use the documentation is an exercise in frustration.

So… it’s not you, OK?

Now… in terms of overall Windows driver development. This isn’t like programming in user-mode. There IS no short cut. There is some significant prerequisite information that you need to know about Windows I/O Subsystem architecture. I was just discussing this last week with some members of the WDK doc team (good folks all, by the way). We were talking about “People want samples, people want instant gratification… but they really NEED to understand some of the prerequisites. How and where do we explain those prerequisites?” Of course, we came to no conclusions.

Let me see if I can answer your questions:

Well, YOU shouldn’t have to install the driver multiple times. Once the driver’s information (the VID and PID it supports, for example) it should automatically be recognized and started.

In Windows there’s ONE driver instance, and that driver will support multiple devices of the same “type” (where “type” is defined as VID and PID, for example).

Nothing. It has absolutely, positively, nothing to do with installing, starting, or registering drivers. Mostly, you just ignore WMI in kernel-mode ( a gross simplification, but…). When we DO deal with it in driver land, it’s mostly just an optional method of supplying statistics and event reporting back to user land.

In the world of Windows, that question doesn’t really make sense. An INF file describes the devices that a driver supports to Setup, so the PnP Manager knows what driver to load when it encounters a new device. This PnP process takes place (a) during startup, (b) whenever you plug a new device into the system.

IoControl (IOCTL), is an I/O operation that’s neither a read nor a write, and is processed by a driver on behalf of a device.

You should take our driver seminar. Then YOU’d be answering these questions for people… :wink:

Peter
OSR
@OSRDrivers

Thanks, Peter for this post.

The problem with virtual drivers is that a new device isn’t plugged in, such a driver must have an alternate means of telling the OS about a “new” device. I’m handling that through IoControl calls right now.

About the seminars: I suppose you ment multiple seminars. There are 4 of them on OSR front page right now and they all seem interesting. They are however all in the U.S. and I’m sure my boss wouldn’t be too happy if I went missing for two months right now - even though I myself wouldn’t mind a tour of the U.S. in the mean time :slight_smile: Any seminars planned in the EU?

Why do you have to use multiple root-enumerated virtual miniports? Microsoft iSCSI, for example, works as a single instance without problems

It sounds like you have some sort of hybrid SCSI driver. I think you can do
what you want with SCSIPORT and STORPORT and still support XP. Take a look
at this open source project:

http://www.arsenalrecon.com/apps/image-mounter/

On Wed, Apr 8, 2015 at 2:26 AM, wrote:

> Thanks, Peter for this post.
>
> The problem with virtual drivers is that a new device isn’t plugged in,
> such a driver must have an alternate means of telling the OS about a “new”
> device. I’m handling that through IoControl calls right now.
>
> About the seminars: I suppose you ment multiple seminars. There are 4 of
> them on OSR front page right now and they all seem interesting. They are
> however all in the U.S. and I’m sure my boss wouldn’t be too happy if I
> went missing for two months right now - even though I myself wouldn’t mind
> a tour of the U.S. in the mean time :slight_smile: Any seminars planned in the EU?
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>


Jamey Kirby
Disrupting the establishment since 1964

This is a personal email account and as such, emails are not subject to
archiving. Nothing else really matters.