PRP clarification

Hello,

maybe it is a bit off-topic but I would appreciate some clarification regarding the NVMe PRP format. Here is an example.

Let’s say, we have a S/G list with following physical segments:

  1. ph.addr: 0x20100, len: 0x0300
  2. ph.addr: 0x30100, len: 0x0F00
  3. ph.addr: 0x31000, len: 0x1000
  4. ph.addr: 0x32000, len: 0x0400
  5. ph.addr: 0x40100, len: 0x0F00
  6. ph.addr: 0x41000, len: 0x0B00

(I already cut them down to pages, page is 0x1000)

In such case a common S/G list is tirivial:

0x20100, 0x0300
0x30100, 0x2300
0x40100, 0x1A00

(total I/O is obviously 0x4000).

The space for the PRP list is allocated starting at the physical address 0x800000.
To my understanding in order to describe such S/G list in PRP terms we need to have surprisingly large allocation because it has to be “linked”.

I.e. at the address 0x800000 we would have something like:

0x0000 0000 2000 0100 <– the page address + offset

0x0000 0008 0000 1000 <– next list starts at this physical

Than the thing continues at 0x801000 with something like

0x0000 0000 3000 0100 <– the address + offset

0x0000 0000 3100 0000 <– the address + offset

0x0000 0000 3200 0000 <– the address + offset

0x0000 0008 0000 2000 <– next list starts at this physical

Than the thing continues at 0x802000 with something like

0x0000 0000 4000 0100 <– the address + offset

0x0000 0000 4100 0000 <– the address + offset

Altogether we consume three pages. Or maybe I completely misunderstand?

If that’s the case, the PRP way could be fast for the hardware - but somewhat tedious to construct and quite space-consuming.

Thanks!

On Jan 23, 2018, at 11:17 PM, xxxxx@gmail.com wrote:
>
> Let’s say, we have a S/G list with following physical segments:
>
> 1. ph.addr: 0x20100, len: 0x0300
> 2. ph.addr: 0x30100, len: 0x0F00
> 3. ph.addr: 0x31000, len: 0x1000
> 4. ph.addr: 0x32000, len: 0x0400
> 5. ph.addr: 0x40100, len: 0x0F00
> 6. ph.addr: 0x41000, len: 0x0B00
>
> (I already cut them down to pages, page is 0x1000)

A scatter/gather list spans a single buffer. Because of that, only the first and last entries can be less than full, and only the first page has an offset other than zero. Each submission queue entry is going to be a single buffer, so the case you’ve described can never occur.

> In such case a common S/G list is tirivial:
>
> 0x20100, 0x0300
> 0x30100, 0x2300
> 0x40100, 0x1A00
>
> (total I/O is obviously 0x4000).
>
> The space for the PRP list is allocated starting at the physical address 0x800000.
> To my understanding in order to describe such S/G list in PRP terms we need to have surprisingly large allocation because it has to be “linked”.

Not so. The base request has two PRP slots. Depending on the command format, that’s either two PRPs, or one PRP and one PRP list pointer, or one PRP list pointer. However, a PRP list contains nothing but PRPs. There is no linking to additional lists. (How would you do that? There’s no field to signal that it’s a link).

In any single request, only the very first PRP can have a non-zero offset.

> I.e. at the address 0x800000 we would have something like:
>
> 0x0000 0000 2000 0100 <– the page address + offset
> 0x0000 0008 0000 1000 <– next list starts at this physical

Remember that the “base address” of the page does not include the low 12 bits. The value of the PRP entry is simply the raw 64-bit physical address. Your entries would actually be:

0000 0000 0020 0100
0000 0000 0080 1000

Although, as I’ve said, the example is not actually valid, because the list won’t have any forward links.

Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Many thanks for the clarification regarding the PRP list.
Apologies, it was not entirely clear for me that list can’t be linked and the format wasn’t well understood.

And even more apologies, ultimately this question is about NVMe, not Windows - I am not under Windows.
There aren’t many forums for the “rest of us” and if they are, I do not intent to generate additional attention.

“My” OS is somewhat related to FreeBSD (comes from a well-known manufacturer of phones and desktop computers).
I am using the FreeBSD driver as a startpoint (a modification of it does even “work”).

Unlike other OS-es, there is a C++ framework (think about very much extended StorPort) a driver has to be based on.

Again apologies that a question with some “competitor” OS in a background hijacked your forum.

As for NVMe - maybe my brain is blocked - but than I simply do not see how my (not-so-theoretical) sample buffer descriptor can be processed by NVMe at all (if SL list is not allowed).

Normally, it does not make much sense for a simple storage driver to make any effort in addition to the OS VM allocation process and create extra trouble for the HBA driver. If the buffer in question is a result of a simple memory allocation - I am fine.

However, in the case of software RAID that’s precisely what should happen since the RAID driver would dissect the memory regions according it’s own logic and send the new memory regions (much highly fragmented!) to the driver.

For the technically curious: in this “different OS” such buffers can be created by intent (of the software RAID) from existing buffers with merely two system calls:

  • IOSubMemoryDescriptor::withSubRange()
  • IOMemoryDescriptor::withOptions()

So back to my sample.

Your driver received from a software RAID driver the following buffer which can be described in terms of physical ranges as follows:

from the physical address 0x20100: 0x0300 bytes
from the physical address 0x30100: 0x2300 bytes
from the physical address 0x40100: 0x1A00 bytes

You start to dissect these ranges into PRP-s and got the following partial result:

  1. ph.addr: 0x20100, len: 0x0300
  2. ph.addr: 0x30100, len: 0x0F00
  3. ph.addr: 0x31000, len: 0x1000
  4. ph.addr: 0x32000, len: 0x0400
  5. ph.addr: 0x40100, len: 0x0F00
  6. ph.addr: 0x41000, len: 0x0B00

This is NOT a valid PRP list precisely because the second and fifth element aren’t “on the page” - they can’t be.
You can’t even do a partial I/O in this case.

I can of course always either explicitly double-buffer or tell all storage drivers that I can’t accept buffers with more than two fragments.
(My understanding, I can “fit” the first fragment in PRP1, the PRP2 can point to a PRP list, correct?)

But this is somewhat… brutal.

xxxxx@gmail.com wrote:

As for NVMe - maybe my brain is blocked - but than I simply do not see how my (not-so-theoretical) sample buffer descriptor can be processed by NVMe at all (if SL list is not allowed).

Your example IS theoretical, because it cannot occur in real life.  What
you showed is not “a buffer”.  It is three different buffers.

A “buffer”, by definition, is a set of full pages, with a potential
partial page at the beginning and end.

Your driver received from a software RAID driver the following buffer which can be described in terms of physical ranges as follows:

from the physical address 0x20100: 0x0300 bytes
from the physical address 0x30100: 0x2300 bytes
from the physical address 0x40100: 0x1A00 bytes

That is not “a buffer”.  It is three different buffers, which will
clearly be going to three different disk locations, and will thus be
submitted in three different requests.  How would you specify the
destination addresses for those buffers?

You start to dissect these ranges into PRP-s and got the following partial result:

  1. ph.addr: 0x20100, len: 0x0300
  2. ph.addr: 0x30100, len: 0x0F00
  3. ph.addr: 0x31000, len: 0x1000
  4. ph.addr: 0x32000, len: 0x0400
  5. ph.addr: 0x40100, len: 0x0F00
  6. ph.addr: 0x41000, len: 0x0B00

This is NOT a valid PRP list precisely because the second and fifth element aren’t “on the page” - they can’t be.
You can’t even do a partial I/O in this case.

RIght.  This list of transfers will have to be done in three different
requests.

 

I can of course always either explicitly double-buffer or tell all storage drivers that I can’t accept buffers with more than two fragments.

How do you think you’re going to be getting multiple buffers in one request?

 

(My understanding, I can “fit” the first fragment in PRP1, the PRP2 can point to a PRP list, correct?)

No.  Only the first PRP in a request can start at a non-zero address.

Remember, the PRP entries do not have a length field!  All you have is
one single length field for the entire request.  Each PRP entry only
specifies a starting address.  Data will be consumed from that base
through the end of the page, until the total overall length for the
request is consumed.  So, for your entry number 4 above, there is simply
no way to say “only use the first 0x400 bytes of this page”.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

xxxxx@probo.com wrote:

No.  Only the first PRP in a request can start at a non-zero address.

s/address/offset/


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

It’s been a few years since I’ve dealt with this… but I *definitely* recall that the whole issue of how to interpret PRPs and PRP lists was, ah, “somewhat ambiguous.” In fact, different vendors had different options on the subject.

I’m really struggling to remember how this is supposed to work…

From the spec:

So, the last entry can EITHER be a pointer to a memory page to transfer, OR a pointer to the next PRP List.

I recall this, specifically, because there was a bug in one of my drivers.

I’m not sure that’s been helpful at all. If not, sorry.

Peter
OSR
@OSRDrivers

xxxxx@osr.com wrote:

I’m really struggling to remember how this is supposed to work…

From the spec:

So, the last entry can EITHER be a pointer to a memory page to transfer, OR a pointer to the next PRP List.

Thanks for this, I missed the implications.  I’m not a disk driver guy,
so my responses in this thread are based on Googling and code reading
after I read his question.

But to apply this to the original poster, the only time that this
“linkage” applies is if the PRP list exceeds a complete page.  So, if a
PRP list has <= 511 entries, it is one list.  If it takes 512 entries or
more, the 512th entry of the first list is the address of the next part
of the list.  You cannot use the linked lists to transfer discontiguous
buffers as in the original question.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

I don’t think that’s accurate, because that was MY original reading of the NVMe spec, too… and that’s not how it works.

From the original bug report I received (from a major drive manufacturer):



This isn’t the NVMe forum, and the last time I touched the NVMe spec was more than 3 years ago… which, in code years, is an eternity.

I’d suggest the OP seek out help from the NVMe Implementor’s Forum or some such. I don’t object to the question being posted here (even if it IS about iOS)… but I’m not sure how much we’ll be able to help him.

Peter
OSR
@OSRDrivers

Hi Tim,

I apologize, but in “our” world, outside of Windows this IS a buffer. More precisely, a physical representation of corresponding class.
It is something unusual, we rarely deal with “buffers” in conventional sense.

What I provided is something out of IOMemoryDescriptor, a thingy describing such cases.

Apologies, this is a quite special framework I have to deal with. Think about it like a C++ version of StorPort on steroides, where everything around you is of that framework (so not only the storage).

Regarding your question about the I/O. In my example I very well can this as a single request.

So what I will get from my supervisor (who actually does the software RAID) or from the “mothership” company (in their software RAID) is to my “public” obligatory routine which is mandated to be declared like:

IOReturn
MyStorageDevice::doAsyncReadWrite(IOMemoryDescriptor *pBuffer,
UInt64 block,
UInt64 nblks,
IOStorageAttributes *pAttributes,
IOStorageCompletion *pCompletion)

Where:

  • IOMemoryDescriptor *pBuffer parameter is NOT a buffer in your sense, but the buffer in that C++ framework’s sense, which was created the way I wrote.
  • UInt64 block is the LBA
  • UInt64 nblks is how many (well, this may seem a bit redundant because IOMemoryDescriptor class has the size information, too - but OK)
  • IOStorageAttributes *pAttributes - usually null (for now, ignored)
  • IOStorageCompletion *pCompletion = completion routine.

Once my storage nub receives this call, it will pass it to the HBA driver.

Here, basically (if you seen the FreeBSD driver where we starting from) it essentially calls the “nvme_ns_strategy” (but “behind” a thing called “IOCommandGate”, so it goes single-threaded from that point).

But back to your concern: there is indeed no such thing as “buffer” in your sense for me. Instead, there is a pointer to the IOMemoryDescriptor class which describes a bunch of buffers. All the date from these buffers has to be transferred (according the mandated API-s) from / to UInt64 block (that’s the LBA).

In any case, what I described is a valid physical representation of the IOMemoryDescriptor.
If it looks unusual - yes, it is. When we started to move to that framework it did look quite unusual, indeed.

But again, this is not about StorPort vs. IOKit API-s. I have no choice, the IOKit framework will “feed” me with such “buffers”, do I want it (no, I don’t!) or not.

The problem is, does NVMe spec allow this dealing within the PRP scope or we better to manufacture ones which support SGL.

As far as I understand, SGL is an option, not mandatory.

Thanks

>As far as I understand, SGL is an option, not mandatory.

As far as I remember, that is correct. Initially, nobody was doing SGLs. It took ages for the LeCroy NVMe Emulator to support SGL transfers (if it ever did).

I wish I could help you more… but I’m seriously out of practice and out of touch on NVMe.

Peter
OSR
@OSRdrivers

Some more (exotic technical) details.

You write the HBA driver. You have several choices. Regardless which one, for each discovered NameSpace you spawn a little mini-driver called “nub”.
That driver acts as a “freight forwarder” between your HBA driver and the mass storage driver.

All is done in a certain “matching” chain, so basically once your entities are published, every entity in Registry will look for the “date”.

It is your responsibility to ensure, the marriage is OK. It is a bit different than in Windows because usually you get the “date” / “marriage” of an other driver from the “above” (so they are already there).

Once the relationship chain is established, your “marriage partners” will call your public routines.

You have to maintain certain routines as “public” in order to function, it is well-documented, which ones.

The one I described you ( doAsyncReadWrite()) is such. Everything is in Embedded C++ dialect.

It’s quite neat - but as you see, sometimes we can have a surprise where the “buffer” is not something every normal mortal expects.

iOS wouldn’t have a problem because there is no RAID stack there.
It’s the “desktop / server thing”, close enough - but called a bit differently :wink:

xxxxx@gmail.com wrote:

I apologize, but in “our” world, outside of Windows this IS a buffer. More precisely, a physical representation of corresponding class. It is something unusual, we rarely deal with “buffers” in conventional sense.

There’s really no need to apologize.  I did find a PowerPoint deck that
talks about the impact of software RAID on NVMe requests.  I find it
difficult to understand how one could have discontiguous regions going
to a single LBA, but that’s just me.

The problem is, does NVMe spec allow this dealing within the PRP scope or we better to manufacture ones which support SGL.

Clearly, it is impossible to map this into PRPs, simply because you
can’t specify the length of each PRP entry.  The PowerPoint deck from
openfabrics.org says you either wait for SGL in NVMe 1.1, or you split
it into multiple commands.  Slide 22 here:

http://downloads.openfabrics.org/Media/Monterey_2012/2012_Workshop_Tues_NVME_Windows.pdf

As far as I understand, SGL is an option, not mandatory.

Yep, looks like it.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Thanks for the link!

As far as splitting the I/O - that would obviously work only if the “split” is at the sector size.

In my case I intentionally made the possible “split” at 0x300 bytes. You can’t transfer 0x300 bytes as a whole.

That means: double-buffer in one form or the other. The easiest way (for me) is to declare in the corresponding mandatory registry property: “don’t send me any I/O with more, than two fragments”.

That works (for now) fine. Thank all of you - we clearly need NVMe with SGL support.

I mean, if we think seriously about the software RAID.