diff options
| author | fishsoupisgood <github@madingley.org> | 2019-04-29 01:17:54 +0100 | 
|---|---|---|
| committer | fishsoupisgood <github@madingley.org> | 2019-05-27 03:43:43 +0100 | 
| commit | 3f2546b2ef55b661fd8dd69682b38992225e86f6 (patch) | |
| tree | 65ca85f13617aee1dce474596800950f266a456c /docs/specs | |
| download | qemu-master.tar.gz qemu-master.tar.bz2 qemu-master.zip | |
Diffstat (limited to 'docs/specs')
| -rw-r--r-- | docs/specs/acpi_cpu_hotplug.txt | 24 | ||||
| -rw-r--r-- | docs/specs/acpi_mem_hotplug.txt | 94 | ||||
| -rw-r--r-- | docs/specs/acpi_pci_hotplug.txt | 45 | ||||
| -rw-r--r-- | docs/specs/edu.txt | 110 | ||||
| -rw-r--r-- | docs/specs/fw_cfg.txt | 226 | ||||
| -rw-r--r-- | docs/specs/ivshmem_device_spec.txt | 96 | ||||
| -rw-r--r-- | docs/specs/pci-ids.txt | 54 | ||||
| -rw-r--r-- | docs/specs/pci-serial.txt | 34 | ||||
| -rw-r--r-- | docs/specs/pci-testdev.txt | 26 | ||||
| -rw-r--r-- | docs/specs/ppc-spapr-hcalls.txt | 78 | ||||
| -rw-r--r-- | docs/specs/ppc-spapr-hotplug.txt | 305 | ||||
| -rw-r--r-- | docs/specs/pvpanic.txt | 39 | ||||
| -rw-r--r-- | docs/specs/qcow2.txt | 362 | ||||
| -rw-r--r-- | docs/specs/qed_spec.txt | 138 | ||||
| -rw-r--r-- | docs/specs/rocker.txt | 1014 | ||||
| -rw-r--r-- | docs/specs/standard-vga.txt | 81 | ||||
| -rw-r--r-- | docs/specs/vhost-user.txt | 266 | ||||
| -rw-r--r-- | docs/specs/vmw_pvscsi-spec.txt | 92 | 
18 files changed, 3084 insertions, 0 deletions
| diff --git a/docs/specs/acpi_cpu_hotplug.txt b/docs/specs/acpi_cpu_hotplug.txt new file mode 100644 index 00000000..340b751a --- /dev/null +++ b/docs/specs/acpi_cpu_hotplug.txt @@ -0,0 +1,24 @@ +QEMU<->ACPI BIOS CPU hotplug interface +-------------------------------------- + +QEMU supports CPU hotplug via ACPI. This document +describes the interface between QEMU and the ACPI BIOS. + +ACPI GPE block (IO ports 0xafe0-0xafe3, byte access): +----------------------------------------- + +Generic ACPI GPE block. Bit 2 (GPE.2) used to notify CPU +hot-add/remove event to ACPI BIOS, via SCI interrupt. + +CPU present bitmap for: +  ICH9-LPC (IO port 0x0cd8-0xcf7, 1-byte access) +  PIIX-PM  (IO port 0xaf00-0xaf1f, 1-byte access) +--------------------------------------------------------------- +One bit per CPU. Bit position reflects corresponding CPU APIC ID. +Read-only. + +CPU hot-add/remove notification: +----------------------------------------------------- +QEMU sets/clears corresponding CPU bit on hot-add/remove event. +CPU present map read by ACPI BIOS GPE.2 handler to notify OS of CPU +hot-(un)plug events. diff --git a/docs/specs/acpi_mem_hotplug.txt b/docs/specs/acpi_mem_hotplug.txt new file mode 100644 index 00000000..3df3620c --- /dev/null +++ b/docs/specs/acpi_mem_hotplug.txt @@ -0,0 +1,94 @@ +QEMU<->ACPI BIOS memory hotplug interface +-------------------------------------- + +ACPI BIOS GPE.3 handler is dedicated for notifying OS about memory hot-add +and hot-remove events. + +Memory hot-plug interface (IO port 0xa00-0xa17, 1-4 byte access): +--------------------------------------------------------------- +0xa00: +  read access: +      [0x0-0x3] Lo part of memory device phys address +      [0x4-0x7] Hi part of memory device phys address +      [0x8-0xb] Lo part of memory device size in bytes +      [0xc-0xf] Hi part of memory device size in bytes +      [0x10-0x13] Memory device proximity domain +      [0x14] Memory device status fields +          bits: +              0: Device is enabled and may be used by guest +              1: Device insert event, used to distinguish device for which +                 no device check event to OSPM was issued. +                 It's valid only when bit 1 is set. +              2: Device remove event, used to distinguish device for which +                 no device eject request to OSPM was issued. +              3-7: reserved and should be ignored by OSPM +      [0x15-0x17] reserved + +  write access: +      [0x0-0x3] Memory device slot selector, selects active memory device. +                All following accesses to other registers in 0xa00-0xa17 +                region will read/store data from/to selected memory device. +      [0x4-0x7] OST event code reported by OSPM +      [0x8-0xb] OST status code reported by OSPM +      [0xc-0x13] reserved, writes into it are ignored +      [0x14] Memory device control fields +          bits: +              0: reserved, OSPM must clear it before writing to register. +                 Due to BUG in versions prior 2.4 that field isn't cleared +                 when other fields are written. Keep it reserved and don't +                 try to reuse it. +              1: if set to 1 clears device insert event, set by OSPM +                 after it has emitted device check event for the +                 selected memory device +              2: if set to 1 clears device remove event, set by OSPM +                 after it has emitted device eject request for the +                 selected memory device +              3: if set to 1 initiates device eject, set by OSPM when it +                 triggers memory device removal and calls _EJ0 method +              4-7: reserved, OSPM must clear them before writing to register + +Selecting memory device slot beyond present range has no effect on platform: +   - write accesses to memory hot-plug registers not documented above are +     ignored +   - read accesses to memory hot-plug registers not documented above return +     all bits set to 1. + +Memory hot remove process diagram: +---------------------------------- + +-------------+     +-----------------------+      +------------------+      + |  1. QEMU    |     | 2. QEMU               |      |3. QEMU           |      + |  device_del +---->+ device unplug request +----->+Send SCI to guest,|      + |             |     |         cb            |      |return control to |      + +-------------+     +-----------------------+      |management        |      +                                                    +------------------+      +                                                                              + +---------------------------------------------------------------------+      +                                                                              + +---------------------+              +-------------------------+             + | OSPM:               | remove event | OSPM:                   |             + | send Eject Request, |              | Scan memory devices     |             + | clear remove event  +<-------------+ for event flags         |             + |                     |              |                         |             + +---------------------+              +-------------------------+             +           |                                                                  +           |                                                                  + +---------v--------+            +-----------------------+                    + | Guest OS:        |  success   | OSPM:                 |                    + | process Ejection +----------->+ Execute _EJ0 method,  |                    + | request          |            | set eject bit in flags|                    + +------------------+            +-----------------------+                    +           |failure                         |                                 +           v                                v                                 + +------------------------+      +-----------------------+                    + | OSPM:                  |      | QEMU:                 |                    + | set OST event & status |      | call device unplug cb |                    + | fields                 |      |                       |                    + +------------------------+      +-----------------------+                    +          |                                  |                                +          v                                  v                                + +------------------+              +-------------------+                      + |QEMU:             |              |QEMU:              |                      + |Send OST QMP event|              |Send device deleted|                      + |                  |              |QMP event          |                      + +------------------+              |                   |                      +                                   +-------------------+ diff --git a/docs/specs/acpi_pci_hotplug.txt b/docs/specs/acpi_pci_hotplug.txt new file mode 100644 index 00000000..a839434f --- /dev/null +++ b/docs/specs/acpi_pci_hotplug.txt @@ -0,0 +1,45 @@ +QEMU<->ACPI BIOS PCI hotplug interface +-------------------------------------- + +QEMU supports PCI hotplug via ACPI, for PCI bus 0. This document +describes the interface between QEMU and the ACPI BIOS. + +ACPI GPE block (IO ports 0xafe0-0xafe3, byte access): +----------------------------------------- + +Generic ACPI GPE block. Bit 1 (GPE.1) used to notify PCI hotplug/eject +event to ACPI BIOS, via SCI interrupt. + +PCI slot injection notification pending (IO port 0xae00-0xae03, 4-byte access): +--------------------------------------------------------------- +Slot injection notification pending. One bit per slot. + +Read by ACPI BIOS GPE.1 handler to notify OS of injection +events.  Read-only. + +PCI slot removal notification (IO port 0xae04-0xae07, 4-byte access): +----------------------------------------------------- +Slot removal notification pending. One bit per slot. + +Read by ACPI BIOS GPE.1 handler to notify OS of removal +events.  Read-only. + +PCI device eject (IO port 0xae08-0xae0b, 4-byte access): +---------------------------------------- + +Write: Used by ACPI BIOS _EJ0 method to request device removal. +One bit per slot. + +Read: Hotplug features register.  Used by platform to identify features +available.  Current base feature set (no bits set): + - Read-only "up" register @0xae00, 4-byte access, bit per slot + - Read-only "down" register @0xae04, 4-byte access, bit per slot + - Read/write "eject" register @0xae08, 4-byte access, +   write: bit per slot eject, read: hotplug feature set + - Read-only hotplug capable register @0xae0c, 4-byte access, bit per slot + +PCI removability status (IO port 0xae0c-0xae0f, 4-byte access): +----------------------------------------------- + +Used by ACPI BIOS _RMV method to indicate removability status to OS. One +bit per slot.  Read-only diff --git a/docs/specs/edu.txt b/docs/specs/edu.txt new file mode 100644 index 00000000..7f814678 --- /dev/null +++ b/docs/specs/edu.txt @@ -0,0 +1,110 @@ + +EDU device +========== + +Copyright (c) 2014-2015 Jiri Slaby + +This document is licensed under the GPLv2 (or later). + +This is an educational device for writing (kernel) drivers. Its original +intention was to support the Linux kernel lectures taught at the Masaryk +University. Students are given this virtual device and are expected to write a +driver with I/Os, IRQs, DMAs and such. + +The devices behaves very similar to the PCI bridge present in the COMBO6 cards +developed under the Liberouter wings. Both PCI device ID and PCI space is +inherited from that device. + +Command line switches: +    -device edu[,dma_mask=mask] + +    dma_mask makes the virtual device work with DMA addresses with the given +    mask. For educational purposes, the device supports only 28 bits (256 MiB) +    by default. Students shall set dma_mask for the device in the OS driver +    properly. + +PCI specs +--------- + +PCI ID: 1234:11e8 + +PCI Region 0: +   I/O memory, 1 MB in size. Users are supposed to communicate with the card +   through this memory. + +MMIO area spec +-------------- + +Only size == 4 accesses are allowed for addresses < 0x80. size == 4 or +size == 8 for the rest. + +0x00 (RO) : identification (0xRRrr00edu) +	    RR -- major version +	    rr -- minor version + +0x04 (RW) : card liveness check +	    It is a simple value inversion (~ C operator). + +0x08 (RW) : factorial computation +	    The stored value is taken and factorial of it is put back here. +	    This happens only after factorial bit in the status register (0x20 +	    below) is cleared. + +0x20 (RW) : status register, bitwise OR +	    0x01 -- computing factorial (RO) +	    0x80 -- raise interrupt 0x01 after finishing factorial computation + +0x24 (RO) : interrupt status register +	    It contains values which raised the interrupt (see interrupt raise +	    register below). + +0x60 (WO) : interrupt raise register +	    Raise an interrupt. The value will be put to the interrupt status +	    register (using bitwise OR). + +0x64 (WO) : interrupt acknowledge register +	    Clear an interrupt. The value will be cleared from the interrupt +	    status register. This needs to be done from the ISR to stop +	    generating interrupts. + +0x80 (RW) : DMA source address +	    Where to perform the DMA from. + +0x88 (RW) : DMA destination address +	    Where to perform the DMA to. + +0x90 (RW) : DMA transfer count +	    The size of the area to perform the DMA on. + +0x98 (RW) : DMA command register, bitwise OR +	    0x01 -- start transfer +	    0x02 -- direction (0: from RAM to EDU, 1: from EDU to RAM) +	    0x04 -- raise interrupt 0x100 after finishing the DMA + +IRQ controller +-------------- +An IRQ is generated when written to the interrupt raise register. The value +appears in interrupt status register when the interrupt is raised and has to +be written to the interrupt acknowledge register to lower it. + +DMA controller +-------------- +One has to specify, source, destination, size, and start the transfer. One +4096 bytes long buffer at offset 0x40000 is available in the EDU device. I.e. +one can perform DMA to/from this space when programmed properly. + +Example of transferring a 100 byte block to and from the buffer using a given +PCI address 'addr': +addr     -> DMA source address +0x40000  -> DMA destination address +100      -> DMA transfer count +1        -> DMA command register +while (DMA command register & 1) +	; + +0x40000  -> DMA source address +addr+100 -> DMA destination address +100      -> DMA transfer count +3        -> DMA command register +while (DMA command register & 1) +	; diff --git a/docs/specs/fw_cfg.txt b/docs/specs/fw_cfg.txt new file mode 100644 index 00000000..74351dd1 --- /dev/null +++ b/docs/specs/fw_cfg.txt @@ -0,0 +1,226 @@ +QEMU Firmware Configuration (fw_cfg) Device +=========================================== + += Guest-side Hardware Interface = + +This hardware interface allows the guest to retrieve various data items +(blobs) that can influence how the firmware configures itself, or may +contain tables to be installed for the guest OS. Examples include device +boot order, ACPI and SMBIOS tables, virtual machine UUID, SMP and NUMA +information, kernel/initrd images for direct (Linux) kernel booting, etc. + +== Selector (Control) Register == + +* Write only +* Location: platform dependent (IOport or MMIO) +* Width: 16-bit +* Endianness: little-endian (if IOport), or big-endian (if MMIO) + +A write to this register sets the index of a firmware configuration +item which can subsequently be accessed via the data register. + +Setting the selector register will cause the data offset to be set +to zero. The data offset impacts which data is accessed via the data +register, and is explained below. + +Bit14 of the selector register indicates whether the configuration +setting is being written. A value of 0 means the item is only being +read, and all write access to the data port will be ignored. A value +of 1 means the item's data can be overwritten by writes to the data +register. In other words, configuration write mode is enabled when +the selector value is between 0x4000-0x7fff or 0xc000-0xffff. + +NOTE: As of QEMU v2.4, writes to the fw_cfg data register are no +      longer supported, and will be ignored (treated as no-ops)! + +Bit15 of the selector register indicates whether the configuration +setting is architecture specific. A value of 0 means the item is a +generic configuration item. A value of 1 means the item is specific +to a particular architecture. In other words, generic configuration +items are accessed with a selector value between 0x0000-0x7fff, and +architecture specific configuration items are accessed with a selector +value between 0x8000-0xffff. + +== Data Register == + +* Read/Write (writes ignored as of QEMU v2.4) +* Location: platform dependent (IOport [*] or MMIO) +* Width: 8-bit (if IOport), 8/16/32/64-bit (if MMIO) +* Endianness: string-preserving + +[*] On platforms where the data register is exposed as an IOport, its +port number will always be one greater than the port number of the +selector register. In other words, the two ports overlap, and can not +be mapped separately. + +The data register allows access to an array of bytes for each firmware +configuration data item. The specific item is selected by writing to +the selector register, as described above. + +Initially following a write to the selector register, the data offset +will be set to zero. Each successful access to the data register will +increment the data offset by the appropriate access width. + +Each firmware configuration item has a maximum length of data +associated with the item. After the data offset has passed the +end of this maximum data length, then any reads will return a data +value of 0x00, and all writes will be ignored. + +An N-byte wide read of the data register will return the next available +N bytes of the selected firmware configuration item, as a substring, in +increasing address order, similar to memcpy(). + +== Register Locations == + +=== x86, x86_64 Register Locations === + +Selector Register IOport: 0x510 +Data Register IOport:     0x511 + +== Firmware Configuration Items == + +=== Signature (Key 0x0000, FW_CFG_SIGNATURE) === + +The presence of the fw_cfg selector and data registers can be verified +by selecting the "signature" item using key 0x0000 (FW_CFG_SIGNATURE), +and reading four bytes from the data register. If the fw_cfg device is +present, the four bytes read will contain the characters "QEMU". + +=== Revision (Key 0x0001, FW_CFG_ID) === + +A 32-bit little-endian unsigned int, this item is used as an interface +revision number, and is currently set to 1 by QEMU when fw_cfg is +initialized. + +=== File Directory (Key 0x0019, FW_CFG_FILE_DIR) === + +Firmware configuration items stored at selector keys 0x0020 or higher +(FW_CFG_FILE_FIRST or higher) have an associated entry in a directory +structure, which makes it easier for guest-side firmware to identify +and retrieve them. The format of this file directory (from fw_cfg.h in +the QEMU source tree) is shown here, slightly annotated for clarity: + +struct FWCfgFiles {		/* the entire file directory fw_cfg item */ +    uint32_t count;		/* number of entries, in big-endian format */ +    struct FWCfgFile f[];	/* array of file entries, see below */ +}; + +struct FWCfgFile {		/* an individual file entry, 64 bytes total */ +    uint32_t size;		/* size of referenced fw_cfg item, big-endian */ +    uint16_t select;		/* selector key of fw_cfg item, big-endian */ +    uint16_t reserved; +    char name[56];		/* fw_cfg item name, NUL-terminated ascii */ +}; + +=== All Other Data Items === + +Please consult the QEMU source for the most up-to-date and authoritative +list of selector keys and their respective items' purpose and format. + +=== Ranges === + +Theoretically, there may be up to 0x4000 generic firmware configuration +items, and up to 0x4000 architecturally specific ones. + +Selector Reg.    Range Usage +---------------  ----------- +0x0000 - 0x3fff  Generic (0x0000 - 0x3fff, RO) +0x4000 - 0x7fff  Generic (0x0000 - 0x3fff, RW, ignored in QEMU v2.4+) +0x8000 - 0xbfff  Arch. Specific (0x0000 - 0x3fff, RO) +0xc000 - 0xffff  Arch. Specific (0x0000 - 0x3fff, RW, ignored in v2.4+) + +In practice, the number of allowed firmware configuration items is given +by the value of FW_CFG_MAX_ENTRY (see fw_cfg.h). + += Host-side API = + +The following functions are available to the QEMU programmer for adding +data to a fw_cfg device during guest initialization (see fw_cfg.h for +each function's complete prototype): + +== fw_cfg_add_bytes() == + +Given a selector key value, starting pointer, and size, create an item +as a raw "blob" of the given size, available by selecting the given key. +The data referenced by the starting pointer is only linked, NOT copied, +into the data structure of the fw_cfg device. + +== fw_cfg_add_string() == + +Instead of a starting pointer and size, this function accepts a pointer +to a NUL-terminated ascii string, and inserts a newly allocated copy of +the string (including the NUL terminator) into the fw_cfg device data +structure. + +== fw_cfg_add_iXX() == + +Insert an XX-bit item, where XX may be 16, 32, or 64. These functions +will convert a 16-, 32-, or 64-bit integer to little-endian, then add +a dynamically allocated copy of the appropriately sized item to fw_cfg +under the given selector key value. + +== fw_cfg_add_file() == + +Given a filename (i.e., fw_cfg item name), starting pointer, and size, +create an item as a raw "blob" of the given size. Unlike fw_cfg_add_bytes() +above, the next available selector key (above 0x0020, FW_CFG_FILE_FIRST) +will be used, and a new entry will be added to the file directory structure +(at key 0x0019), containing the item name, blob size, and automatically +assigned selector key value. The data referenced by the starting pointer +is only linked, NOT copied, into the fw_cfg data structure. + +== fw_cfg_add_file_callback() == + +Like fw_cfg_add_file(), but additionally sets pointers to a callback +function (and opaque argument), which will be executed host-side by +QEMU each time a byte is read by the guest from this particular item. + +NOTE: The callback function is given the opaque argument set by +fw_cfg_add_file_callback(), but also the current data offset, +allowing it the option of only acting upon specific offset values +(e.g., 0, before the first data byte of the selected item is +returned to the guest). + +== fw_cfg_modify_file() == + +Given a filename (i.e., fw_cfg item name), starting pointer, and size, +completely replace the configuration item referenced by the given item +name with the new given blob. If an existing blob is found, its +callback information is removed, and a pointer to the old data is +returned to allow the caller to free it, helping avoid memory leaks. +If a configuration item does not already exist under the given item +name, a new item will be created as with fw_cfg_add_file(), and NULL +is returned to the caller. In any case, the data referenced by the +starting pointer is only linked, NOT copied, into the fw_cfg data +structure. + +== fw_cfg_add_callback() == + +Like fw_cfg_add_bytes(), but additionally sets pointers to a callback +function (and opaque argument), which will be executed host-side by +QEMU each time a guest-side write operation to this particular item +completes fully overwriting the item's data. + +NOTE: This function is deprecated, and will be completely removed +starting with QEMU v2.4. + +== Externally Provided Items == + +As of v2.4, "file" fw_cfg items (i.e., items with selector keys above +FW_CFG_FILE_FIRST, and with a corresponding entry in the fw_cfg file +directory structure) may be inserted via the QEMU command line, using +the following syntax: + +    -fw_cfg [name=]<item_name>,file=<path> + +where <item_name> is the fw_cfg item name, and <path> is the location +on the host file system of a file containing the data to be inserted. + +NOTE: Users *SHOULD* choose item names beginning with the prefix "opt/" +when using the "-fw_cfg" command line option, to avoid conflicting with +item names used internally by QEMU. For instance: + +    -fw_cfg name=opt/my_item_name,file=./my_blob.bin + +Similarly, QEMU developers *SHOULD NOT* use item names prefixed with +"opt/" when inserting items programmatically, e.g. via fw_cfg_add_file(). diff --git a/docs/specs/ivshmem_device_spec.txt b/docs/specs/ivshmem_device_spec.txt new file mode 100644 index 00000000..667a8628 --- /dev/null +++ b/docs/specs/ivshmem_device_spec.txt @@ -0,0 +1,96 @@ + +Device Specification for Inter-VM shared memory device +------------------------------------------------------ + +The Inter-VM shared memory device is designed to share a region of memory to +userspace in multiple virtual guests.  The memory region does not belong to any +guest, but is a POSIX memory object on the host.  Optionally, the device may +support sending interrupts to other guests sharing the same memory region. + + +The Inter-VM PCI device +----------------------- + +*BARs* + +The device supports three BARs.  BAR0 is a 1 Kbyte MMIO region to support +registers.  BAR1 is used for MSI-X when it is enabled in the device.  BAR2 is +used to map the shared memory object from the host.  The size of BAR2 is +specified when the guest is started and must be a power of 2 in size. + +*Registers* + +The device currently supports 4 registers of 32-bits each.  Registers +are used for synchronization between guests sharing the same memory object when +interrupts are supported (this requires using the shared memory server). + +The server assigns each VM an ID number and sends this ID number to the QEMU +process when the guest starts. + +enum ivshmem_registers { +    IntrMask = 0, +    IntrStatus = 4, +    IVPosition = 8, +    Doorbell = 12 +}; + +The first two registers are the interrupt mask and status registers.  Mask and +status are only used with pin-based interrupts.  They are unused with MSI +interrupts. + +Status Register: The status register is set to 1 when an interrupt occurs. + +Mask Register: The mask register is bitwise ANDed with the interrupt status +and the result will raise an interrupt if it is non-zero.  However, since 1 is +the only value the status will be set to, it is only the first bit of the mask +that has any effect.  Therefore interrupts can be masked by setting the first +bit to 0 and unmasked by setting the first bit to 1. + +IVPosition Register: The IVPosition register is read-only and reports the +guest's ID number.  The guest IDs are non-negative integers.  When using the +server, since the server is a separate process, the VM ID will only be set when +the device is ready (shared memory is received from the server and accessible via +the device).  If the device is not ready, the IVPosition will return -1. +Applications should ensure that they have a valid VM ID before accessing the +shared memory. + +Doorbell Register:  To interrupt another guest, a guest must write to the +Doorbell register.  The doorbell register is 32-bits, logically divided into +two 16-bit fields.  The high 16-bits are the guest ID to interrupt and the low +16-bits are the interrupt vector to trigger.  The semantics of the value +written to the doorbell depends on whether the device is using MSI or a regular +pin-based interrupt.  In short, MSI uses vectors while regular interrupts set the +status register. + +Regular Interrupts + +If regular interrupts are used (due to either a guest not supporting MSI or the +user specifying not to use them on startup) then the value written to the lower +16-bits of the Doorbell register results is arbitrary and will trigger an +interrupt in the destination guest. + +Message Signalled Interrupts + +A ivshmem device may support multiple MSI vectors.  If so, the lower 16-bits +written to the Doorbell register must be between 0 and the maximum number of +vectors the guest supports.  The lower 16 bits written to the doorbell is the +MSI vector that will be raised in the destination guest.  The number of MSI +vectors is configurable but it is set when the VM is started. + +The important thing to remember with MSI is that it is only a signal, no status +is set (since MSI interrupts are not shared).  All information other than the +interrupt itself should be communicated via the shared memory region.  Devices +supporting multiple MSI vectors can use different vectors to indicate different +events have occurred.  The semantics of interrupt vectors are left to the +user's discretion. + + +Usage in the Guest +------------------ + +The shared memory device is intended to be used with the provided UIO driver. +Very little configuration is needed.  The guest should map BAR0 to access the +registers (an array of 32-bit ints allows simple writing) and map BAR2 to +access the shared memory region itself.  The size of the shared memory region +is specified when the guest (or shared memory server) is started.  A guest may +map the whole shared memory region or only part of it. diff --git a/docs/specs/pci-ids.txt b/docs/specs/pci-ids.txt new file mode 100644 index 00000000..0adcb89a --- /dev/null +++ b/docs/specs/pci-ids.txt @@ -0,0 +1,54 @@ + +PCI IDs for qemu +================ + +Red Hat, Inc. donates a part of its device ID range to qemu, to be used for +virtual devices.  The vendor IDs are 1af4 (formerly Qumranet ID) and 1b36. + +Contact Gerd Hoffmann <kraxel@redhat.com> to get a device ID assigned +for your devices. + +1af4 vendor ID +-------------- + +The 1000 -> 10ff device ID range is used as follows for virtio-pci devices. +Note that this allocation separate from the virtio device IDs, which are +maintained as part of the virtio specification. + +1af4:1000  network device +1af4:1001  block device +1af4:1002  balloon device +1af4:1003  console device +1af4:1004  SCSI host bus adapter device +1af4:1005  entropy generator device +1af4:1009  9p filesystem device + +1af4:10f0  Available for experimental usage without registration.  Must get +   to      official ID when the code leaves the test lab (i.e. when seeking +1af4:10ff  upstream merge or shipping a distro/product) to avoid conflicts. + +1af4:1100  Used as PCI Subsystem ID for existing hardware devices emulated +           by qemu. + +1af4:1110  ivshmem device (shared memory, docs/specs/ivshmem_device_spec.txt) + +All other device IDs are reserved. + +1b36 vendor ID +-------------- + +The 0000 -> 00ff device ID range is used as follows for QEMU-specific +PCI devices (other than virtio): + +1b36:0001  PCI-PCI bridge +1b36:0002  PCI serial port (16550A) adapter (docs/specs/pci-serial.txt) +1b36:0003  PCI Dual-port 16550A adapter (docs/specs/pci-serial.txt) +1b36:0004  PCI Quad-port 16550A adapter (docs/specs/pci-serial.txt) +1b36:0005  PCI test device (docs/specs/pci-testdev.txt) +1b36:0006  PCI Rocker Ethernet switch device +1b36:0007  PCI SD Card Host Controller Interface (SDHCI) +1b36:000a  PCI-PCI bridge (multiseat) + +All these devices are documented in docs/specs. + +The 0100 device ID is used for the QXL video card device. diff --git a/docs/specs/pci-serial.txt b/docs/specs/pci-serial.txt new file mode 100644 index 00000000..66c761f2 --- /dev/null +++ b/docs/specs/pci-serial.txt @@ -0,0 +1,34 @@ + +QEMU pci serial devices +======================= + +There is one single-port variant and two muliport-variants.  Linux +guests out-of-the box with all cards.  There is a Windows inf file +(docs/qemupciserial.inf) to setup the single-port card in Windows +guests. + + +single-port card +---------------- + +Name:   pci-serial +PCI ID: 1b36:0002 + +PCI Region 0: +   IO bar, 8 bytes long, with the 16550 uart mapped to it. +   Interrupt is wired to pin A. + + +multiport cards +--------------- + +Name:   pci-serial-2x +PCI ID: 1b36:0003 + +Name:   pci-serial-4x +PCI ID: 1b36:0004 + +PCI Region 0: +   IO bar, with two/four 16550 uart mapped after each other. +   The first is at offset 0, second at offset 8, ... +   Interrupt is wired to pin A. diff --git a/docs/specs/pci-testdev.txt b/docs/specs/pci-testdev.txt new file mode 100644 index 00000000..128ae222 --- /dev/null +++ b/docs/specs/pci-testdev.txt @@ -0,0 +1,26 @@ +pci-test is a device used for testing low level IO + +device implements up to two BARs: BAR0 and BAR1. +Each BAR can be memory or IO. Guests must detect +BAR type and act accordingly. + +Each BAR size is up to 4K bytes. +Each BAR starts with the following header: + +typedef struct PCITestDevHdr { +    uint8_t test;  <- write-only, starts a given test number +    uint8_t width_type; <- read-only, type and width of access for a given test. +                           1,2,4 for byte,word or long write. +                           any other value if test not supported on this BAR +    uint8_t pad0[2]; +    uint32_t offset; <- read-only, offset in this BAR for a given test +    uint32_t data;    <- read-only, data to use for a given test +    uint32_t count;  <- for debugging. number of writes detected. +    uint8_t name[]; <- for debugging. 0-terminated ASCII string. +} PCITestDevHdr; + +All registers are little endian. + +device is expected to always implement tests 0 to N on each BAR, and to add new +tests with higher numbers.  In this way a guest can scan test numbers until it +detects an access type that it does not support on this BAR, then stop. diff --git a/docs/specs/ppc-spapr-hcalls.txt b/docs/specs/ppc-spapr-hcalls.txt new file mode 100644 index 00000000..667b3fa0 --- /dev/null +++ b/docs/specs/ppc-spapr-hcalls.txt @@ -0,0 +1,78 @@ +When used with the "pseries" machine type, QEMU-system-ppc64 implements +a set of hypervisor calls using a subset of the server "PAPR" specification +(IBM internal at this point), which is also what IBM's proprietary hypervisor +adheres too. + +The subset is selected based on the requirements of Linux as a guest. + +In addition to those calls, we have added our own private hypervisor +calls which are mostly used as a private interface between the firmware +running in the guest and QEMU. + +All those hypercalls start at hcall number 0xf000 which correspond +to a implementation specific range in PAPR. + +- H_RTAS (0xf000) + +RTAS is a set of runtime services generally provided by the firmware +inside the guest to the operating system. It predates the existence +of hypervisors (it was originally an extension to Open Firmware) and +is still used by PAPR to provide various services that aren't performance +sensitive. + +We currently implement the RTAS services in QEMU itself. The actual RTAS +"firmware" blob in the guest is a small stub of a few instructions which +calls our private H_RTAS hypervisor call to pass the RTAS calls to QEMU. + +Arguments: + +  r3 : H_RTAS (0xf000) +  r4 : Guest physical address of RTAS parameter block + +Returns: + +  H_SUCCESS   : Successfully called the RTAS function (RTAS result +                will have been stored in the parameter block) +  H_PARAMETER : Unknown token + +- H_LOGICAL_MEMOP (0xf001) + +When the guest runs in "real mode" (in powerpc lingua this means +with MMU disabled, ie guest effective == guest physical), it only +has access to a subset of memory and no IOs. + +PAPR provides a set of hypervisor calls to perform cachable or +non-cachable accesses to any guest physical addresses that the +guest can use in order to access IO devices while in real mode. + +This is typically used by the firmware running in the guest. + +However, doing a hypercall for each access is extremely inefficient +(even more so when running KVM) when accessing the frame buffer. In +that case, things like scrolling become unusably slow. + +This hypercall allows the guest to request a "memory op" to be applied +to memory. The supported memory ops at this point are to copy a range +of memory (supports overlap of source and destination) and XOR which +is used by our SLOF firmware to invert the screen. + +Arguments: + +  r3: H_LOGICAL_MEMOP (0xf001) +  r4: Guest physical address of destination +  r5: Guest physical address of source +  r6: Individual element size +        0 = 1 byte +        1 = 2 bytes +        2 = 4 bytes +        3 = 8 bytes +  r7: Number of elements +  r8: Operation +        0 = copy +        1 = xor + +Returns: + +  H_SUCCESS   : Success +  H_PARAMETER : Invalid argument + diff --git a/docs/specs/ppc-spapr-hotplug.txt b/docs/specs/ppc-spapr-hotplug.txt new file mode 100644 index 00000000..46e07196 --- /dev/null +++ b/docs/specs/ppc-spapr-hotplug.txt @@ -0,0 +1,305 @@ += sPAPR Dynamic Reconfiguration = + +sPAPR/"pseries" guests make use of a facility called dynamic-reconfiguration +to handle hotplugging of dynamic "physical" resources like PCI cards, or +"logical"/paravirtual resources like memory, CPUs, and "physical" +host-bridges, which are generally managed by the host/hypervisor and provided +to guests as virtualized resources. The specifics of dynamic-reconfiguration +are documented extensively in PAPR+ v2.7, Section 13.1. This document +provides a summary of that information as it applies to the implementation +within QEMU. + +== Dynamic-reconfiguration Connectors == + +To manage hotplug/unplug of these resources, a firmware abstraction known as +a Dynamic Resource Connector (DRC) is used to assign a particular dynamic +resource to the guest, and provide an interface for the guest to manage +configuration/removal of the resource associated with it. + +== Device-tree description of DRCs == + +A set of 4 Open Firmware device tree array properties are used to describe +the name/index/power-domain/type of each DRC allocated to a guest at +boot-time. There may be multiple sets of these arrays, rooted at different +paths in the device tree depending on the type of resource the DRCs manage. + +In some cases, the DRCs themselves may be provided by a dynamic resource, +such as the DRCs managing PCI slots on a hotplugged PHB. In this case the +arrays would be fetched as part of the device tree retrieval interfaces +for hotplugged resources described under "Guest->Host interface". + +The array properties are described below. Each entry/element in an array +describes the DRC identified by the element in the corresponding position +of ibm,drc-indexes: + +ibm,drc-names: +  first 4-bytes: BE-encoded integer denoting the number of entries +  each entry: a NULL-terminated <name> string encoded as a byte array + +  <name> values for logical/virtual resources are defined in PAPR+ v2.7, +  Section 13.5.2.4, and basically consist of the type of the resource +  followed by a space and a numerical value that's unique across resources +  of that type. + +  <name> values for "physical" resources such as PCI or VIO devices are +  defined as being "location codes", which are the "location labels" of +  each encapsulating device, starting from the chassis down to the +  individual slot for the device, concatenated by a hyphen. This provides +  a mapping of resources to a physical location in a chassis for debugging +  purposes. For QEMU, this mapping is less important, so we assign a +  location code that conforms to naming specifications, but is simply a +  location label for the slot by itself to simplify the implementation. +  The naming convention for location labels is documented in detail in +  PAPR+ v2.7, Section 12.3.1.5, and in our case amounts to using "C<n>" +  for PCI/VIO device slots, where <n> is unique across all PCI/VIO +  device slots. + +ibm,drc-indexes: +  first 4-bytes: BE-encoded integer denoting the number of entries +  each 4-byte entry: BE-encoded <index> integer that is unique across all DRCs +    in the machine + +  <index> is arbitrary, but in the case of QEMU we try to maintain the +  convention used to assign them to pSeries guests on pHyp: + +    bit[31:28]: integer encoding of <type>, where <type> is: +                  1 for CPU resource +                  2 for PHB resource +                  3 for VIO resource +                  4 for PCI resource +                  8 for Memory resource +    bit[27:0]: integer encoding of <id>, where <id> is unique across +                 all resources of specified type + +ibm,drc-power-domains: +  first 4-bytes: BE-encoded integer denoting the number of entries +  each 4-byte entry: 32-bit, BE-encoded <index> integer that specifies the +    power domain the resource will be assigned to. In the case of QEMU +    we associated all resources with a "live insertion" domain, where the +    power is assumed to be managed automatically. The integer value for +    this domain is a special value of -1. + + +ibm,drc-types: +  first 4-bytes: BE-encoded integer denoting the number of entries +  each entry: a NULL-terminated <type> string encoded as a byte array + +  <type> is assigned as follows: +    "CPU" for a CPU +    "PHB" for a physical host-bridge +    "SLOT" for a VIO slot +    "28" for a PCI slot +    "MEM" for memory resource + +== Guest->Host interface to manage dynamic resources == + +Each DRC is given a globally unique DRC Index, and resources associated with +a particular DRC are configured/managed by the guest via a number of RTAS +calls which reference individual DRCs based on the DRC index. This can be +considered the guest->host interface. + +rtas-set-power-level: +  arg[0]: integer identifying power domain +  arg[1]: new power level for the domain, 0-100 +  output[0]: status, 0 on success +  output[1]: power level after command + +  Set the power level for a specified power domain + +rtas-get-power-level: +  arg[0]: integer identifying power domain +  output[0]: status, 0 on success +  output[1]: current power level + +  Get the power level for a specified power domain + +rtas-set-indicator: +  arg[0]: integer identifying sensor/indicator type +  arg[1]: index of sensor, for DR-related sensors this is generally the +          DRC index +  arg[2]: desired sensor value +  output[0]: status, 0 on success + +  Set the state of an indicator or sensor. For the purpose of this document we +  focus on the indicator/sensor types associated with a DRC. The types are: + +    9001: isolation-state, controls/indicates whether a device has been made +          accessible to a guest + +          supported sensor values: +            0: isolate, device is made unaccessible by guest OS +            1: unisolate, device is made available to guest OS + +    9002: dr-indicator, controls "visual" indicator associated with device + +          supported sensor values: +            0: inactive, resource may be safely removed +            1: active, resource is in use and cannot be safely removed +            2: identify, used to visually identify slot for interactive hotplug +            3: action, in most cases, used in the same manner as identify + +    9003: allocation-state, generally only used for "logical" DR resources to +          request the allocation/deallocation of a resource prior to acquiring +          it via isolation-state->unisolate, or after releasing it via +          isolation-state->isolate, respectively. for "physical" DR (like PCI +          hotplug/unplug) the pre-allocation of the resource is implied and +          this sensor is unused. + +          supported sensor values: +            0: unusable, tell firmware/system the resource can be +               unallocated/reclaimed and added back to the system resource pool +            1: usable, request the resource be allocated/reserved for use by +               guest OS +            2: exchange, used to allocate a spare resource to use for fail-over +               in certain situations. unused in QEMU +            3: recover, used to reclaim a previously allocated resource that's +               not currently allocated to the guest OS. unused in QEMU + +rtas-get-sensor-state: +  arg[0]: integer identifying sensor/indicator type +  arg[1]: index of sensor, for DR-related sensors this is generally the +          DRC index +  output[0]: status, 0 on success + +  Used to read an indicator or sensor value. + +  For DR-related operations, the only noteworthy sensor is dr-entity-sense, +  which has a type value of 9003, as allocation-state does in the case of +  rtas-set-indicator. The semantics/encodings of the sensor values are distinct +  however: + +  supported sensor values for dr-entity-sense (9003) sensor: +    0: empty, +         for physical resources: DRC/slot is empty +         for logical resources: unused +    1: present, +         for physical resources: DRC/slot is populated with a device/resource +         for logical resources: resource has been allocated to the DRC +    2: unusable, +         for physical resources: unused +         for logical resources: DRC has no resource allocated to it +    3: exchange, +         for physical resources: unused +         for logical resources: resource available for exchange (see +           allocation-state sensor semantics above) +    4: recovery, +         for physical resources: unused +         for logical resources: resource available for recovery (see +           allocation-state sensor semantics above) + +rtas-ibm-configure-connector: +  arg[0]: guest physical address of 4096-byte work area buffer +  arg[1]: 0, or address of additional 4096-byte work area buffer. only non-zero +          if a prior RTAS response indicated a need for additional memory +  output[0]: status: +               0: completed transmittal of device-tree node +               1: instruct guest to prepare for next DT sibling node +               2: instruct guest to prepare for next DT child node +               3: instruct guest to prepare for next DT property +               4: instruct guest to ascend to parent DT node +               5: instruct guest to provide additional work-area buffer +                  via arg[1] +            990x: instruct guest that operation took too long and to try +                  again later + +  Used to fetch an OF device-tree description of the resource associated with +  a particular DRC. The DRC index is encoded in the first 4-bytes of the first +  work area buffer. + +  Work area layout, using 4-byte offsets: +    wa[0]: DRC index of the DRC to fetch device-tree nodes from +    wa[1]: 0 (hard-coded) +    wa[2]: for next-sibling/next-child response: +             wa offset of null-terminated string denoting the new node's name +           for next-property response: +             wa offset of null-terminated string denoting new property's name +    wa[3]: for next-property response (unused otherwise): +             byte-length of new property's value +    wa[4]: for next-property response (unused otherwise): +             new property's value, encoded as an OFDT-compatible byte array + +== hotplug/unplug events == + +For most DR operations, the hypervisor will issue host->guest add/remove events +using the EPOW/check-exception notification framework, where the host issues a +check-exception interrupt, then provides an RTAS event log via an +rtas-check-exception call issued by the guest in response. This framework is +documented by PAPR+ v2.7, and already use in by QEMU for generating powerdown +requests via EPOW events. + +For DR, this framework has been extended to include hotplug events, which were +previously unneeded due to direct manipulation of DR-related guest userspace +tools by host-level management such as an HMC. This level of management is not +applicable to PowerKVM, hence the reason for extending the notification +framework to support hotplug events. + +Note that these events are not yet formally part of the PAPR+ specification, +but support for this format has already been implemented in DR-related +guest tools such as powerpc-utils/librtas, as well as kernel patches that have +been submitted to handle in-kernel processing of memory/cpu-related hotplug +events[1], and is planned for formal inclusion is PAPR+ specification. The +hotplug-specific payload is QEMU implemented as follows (with all values +encoded in big-endian format): + +struct rtas_event_log_v6_hp { +#define SECTION_ID_HOTPLUG              0x4850 /* HP */ +    struct section_header { +        uint16_t section_id;            /* set to SECTION_ID_HOTPLUG */ +        uint16_t section_length;        /* sizeof(rtas_event_log_v6_hp), +                                         * plus the length of the DRC name +                                         * if a DRC name identifier is +                                         * specified for hotplug_identifier +                                         */ +        uint8_t section_version;        /* version 1 */ +        uint8_t section_subtype;        /* unused */ +        uint16_t creator_component_id;  /* unused */ +    } hdr; +#define RTAS_LOG_V6_HP_TYPE_CPU         1 +#define RTAS_LOG_V6_HP_TYPE_MEMORY      2 +#define RTAS_LOG_V6_HP_TYPE_SLOT        3 +#define RTAS_LOG_V6_HP_TYPE_PHB         4 +#define RTAS_LOG_V6_HP_TYPE_PCI         5 +    uint8_t hotplug_type;               /* type of resource/device */ +#define RTAS_LOG_V6_HP_ACTION_ADD       1 +#define RTAS_LOG_V6_HP_ACTION_REMOVE    2 +    uint8_t hotplug_action;             /* action (add/remove) */ +#define RTAS_LOG_V6_HP_ID_DRC_NAME      1 +#define RTAS_LOG_V6_HP_ID_DRC_INDEX     2 +#define RTAS_LOG_V6_HP_ID_DRC_COUNT     3 +    uint8_t hotplug_identifier;         /* type of the resource identifier, +                                         * which serves as the discriminator +                                         * for the 'drc' union field below +                                         */ +    uint8_t reserved; +    union { +        uint32_t index;                 /* DRC index of resource to take action +                                         * on +                                         */ +        uint32_t count;                 /* number of DR resources to take +                                         * action on (guest chooses which) +                                         */ +        char name[1];                   /* string representing the name of the +                                         * DRC to take action on +                                         */ +    } drc; +} QEMU_PACKED; + +== ibm,lrdr-capacity == + +ibm,lrdr-capacity is a property in the /rtas device tree node that identifies +the dynamic reconfiguration capabilities of the guest. It consists of a triple +consisting of <phys>, <size> and <maxcpus>. + +  <phys>, encoded in BE format represents the maximum address in bytes and +  hence the maximum memory that can be allocated to the guest. + +  <size>, encoded in BE format represents the size increments in which +  memory can be hot-plugged to the guest. + +  <maxcpus>, a BE-encoded integer, represents the maximum number of +  processors that the guest can have. + +pseries guests use this property to note the maximum allowed CPUs for the +guest. + +[1] http://thread.gmane.org/gmane.linux.ports.ppc.embedded/75350/focus=106867 diff --git a/docs/specs/pvpanic.txt b/docs/specs/pvpanic.txt new file mode 100644 index 00000000..c7bbacc7 --- /dev/null +++ b/docs/specs/pvpanic.txt @@ -0,0 +1,39 @@ +PVPANIC DEVICE +============== + +pvpanic device is a simulated ISA device, through which a guest panic +event is sent to qemu, and a QMP event is generated. This allows +management apps (e.g. libvirt) to be notified and respond to the event. + +The management app has the option of waiting for GUEST_PANICKED events, +and/or polling for guest-panicked RunState, to learn when the pvpanic +device has fired a panic event. + +ISA Interface +------------- + +pvpanic exposes a single I/O port, by default 0x505. On read, the bits +recognized by the device are set. Software should ignore bits it doesn't +recognize. On write, the bits not recognized by the device are ignored. +Software should set only bits both itself and the device recognize. +Currently, only bit 0 is recognized, setting it indicates a guest panic +has happened. + +ACPI Interface +-------------- + +pvpanic device is defined with ACPI ID "QEMU0001". Custom methods: + +RDPT:       To determine whether guest panic notification is supported. +Arguments:  None +Return:     Returns a byte, bit 0 set to indicate guest panic +            notification is supported. Other bits are reserved and +            should be ignored. + +WRPT:       To send a guest panic event +Arguments:  Arg0 is a byte, with bit 0 set to indicate guest panic has +            happened. Other bits are reserved and should be cleared. +Return:     None + +The ACPI device will automatically refer to the right port in case it +is modified. diff --git a/docs/specs/qcow2.txt b/docs/specs/qcow2.txt new file mode 100644 index 00000000..121dfc8c --- /dev/null +++ b/docs/specs/qcow2.txt @@ -0,0 +1,362 @@ +== General == + +A qcow2 image file is organized in units of constant size, which are called +(host) clusters. A cluster is the unit in which all allocations are done, +both for actual guest data and for image metadata. + +Likewise, the virtual disk as seen by the guest is divided into (guest) +clusters of the same size. + +All numbers in qcow2 are stored in Big Endian byte order. + + +== Header == + +The first cluster of a qcow2 image contains the file header: + +    Byte  0 -  3:   magic +                    QCOW magic string ("QFI\xfb") + +          4 -  7:   version +                    Version number (valid values are 2 and 3) + +          8 - 15:   backing_file_offset +                    Offset into the image file at which the backing file name +                    is stored (NB: The string is not null terminated). 0 if the +                    image doesn't have a backing file. + +         16 - 19:   backing_file_size +                    Length of the backing file name in bytes. Must not be +                    longer than 1023 bytes. Undefined if the image doesn't have +                    a backing file. + +         20 - 23:   cluster_bits +                    Number of bits that are used for addressing an offset +                    within a cluster (1 << cluster_bits is the cluster size). +                    Must not be less than 9 (i.e. 512 byte clusters). + +                    Note: qemu as of today has an implementation limit of 2 MB +                    as the maximum cluster size and won't be able to open images +                    with larger cluster sizes. + +         24 - 31:   size +                    Virtual disk size in bytes + +         32 - 35:   crypt_method +                    0 for no encryption +                    1 for AES encryption + +         36 - 39:   l1_size +                    Number of entries in the active L1 table + +         40 - 47:   l1_table_offset +                    Offset into the image file at which the active L1 table +                    starts. Must be aligned to a cluster boundary. + +         48 - 55:   refcount_table_offset +                    Offset into the image file at which the refcount table +                    starts. Must be aligned to a cluster boundary. + +         56 - 59:   refcount_table_clusters +                    Number of clusters that the refcount table occupies + +         60 - 63:   nb_snapshots +                    Number of snapshots contained in the image + +         64 - 71:   snapshots_offset +                    Offset into the image file at which the snapshot table +                    starts. Must be aligned to a cluster boundary. + +If the version is 3 or higher, the header has the following additional fields. +For version 2, the values are assumed to be zero, unless specified otherwise +in the description of a field. + +         72 -  79:  incompatible_features +                    Bitmask of incompatible features. An implementation must +                    fail to open an image if an unknown bit is set. + +                    Bit 0:      Dirty bit.  If this bit is set then refcounts +                                may be inconsistent, make sure to scan L1/L2 +                                tables to repair refcounts before accessing the +                                image. + +                    Bit 1:      Corrupt bit.  If this bit is set then any data +                                structure may be corrupt and the image must not +                                be written to (unless for regaining +                                consistency). + +                    Bits 2-63:  Reserved (set to 0) + +         80 -  87:  compatible_features +                    Bitmask of compatible features. An implementation can +                    safely ignore any unknown bits that are set. + +                    Bit 0:      Lazy refcounts bit.  If this bit is set then +                                lazy refcount updates can be used.  This means +                                marking the image file dirty and postponing +                                refcount metadata updates. + +                    Bits 1-63:  Reserved (set to 0) + +         88 -  95:  autoclear_features +                    Bitmask of auto-clear features. An implementation may only +                    write to an image with unknown auto-clear features if it +                    clears the respective bits from this field first. + +                    Bits 0-63:  Reserved (set to 0) + +         96 -  99:  refcount_order +                    Describes the width of a reference count block entry (width +                    in bits: refcount_bits = 1 << refcount_order). For version 2 +                    images, the order is always assumed to be 4 +                    (i.e. refcount_bits = 16). +                    This value may not exceed 6 (i.e. refcount_bits = 64). + +        100 - 103:  header_length +                    Length of the header structure in bytes. For version 2 +                    images, the length is always assumed to be 72 bytes. + +Directly after the image header, optional sections called header extensions can +be stored. Each extension has a structure like the following: + +    Byte  0 -  3:   Header extension type: +                        0x00000000 - End of the header extension area +                        0xE2792ACA - Backing file format name +                        0x6803f857 - Feature name table +                        other      - Unknown header extension, can be safely +                                     ignored + +          4 -  7:   Length of the header extension data + +          8 -  n:   Header extension data + +          n -  m:   Padding to round up the header extension size to the next +                    multiple of 8. + +Unless stated otherwise, each header extension type shall appear at most once +in the same image. + +If the image has a backing file then the backing file name should be stored in +the remaining space between the end of the header extension area and the end of +the first cluster. It is not allowed to store other data here, so that an +implementation can safely modify the header and add extensions without harming +data of compatible features that it doesn't support. Compatible features that +need space for additional data can use a header extension. + + +== Feature name table == + +The feature name table is an optional header extension that contains the name +for features used by the image. It can be used by applications that don't know +the respective feature (e.g. because the feature was introduced only later) to +display a useful error message. + +The number of entries in the feature name table is determined by the length of +the header extension data. Each entry look like this: + +    Byte       0:   Type of feature (select feature bitmap) +                        0: Incompatible feature +                        1: Compatible feature +                        2: Autoclear feature + +               1:   Bit number within the selected feature bitmap (valid +                    values: 0-63) + +          2 - 47:   Feature name (padded with zeros, but not necessarily null +                    terminated if it has full length) + + +== Host cluster management == + +qcow2 manages the allocation of host clusters by maintaining a reference count +for each host cluster. A refcount of 0 means that the cluster is free, 1 means +that it is used, and >= 2 means that it is used and any write access must +perform a COW (copy on write) operation. + +The refcounts are managed in a two-level table. The first level is called +refcount table and has a variable size (which is stored in the header). The +refcount table can cover multiple clusters, however it needs to be contiguous +in the image file. + +It contains pointers to the second level structures which are called refcount +blocks and are exactly one cluster in size. + +Given a offset into the image file, the refcount of its cluster can be obtained +as follows: + +    refcount_block_entries = (cluster_size * 8 / refcount_bits) + +    refcount_block_index = (offset / cluster_size) % refcount_block_entries +    refcount_table_index = (offset / cluster_size) / refcount_block_entries + +    refcount_block = load_cluster(refcount_table[refcount_table_index]); +    return refcount_block[refcount_block_index]; + +Refcount table entry: + +    Bit  0 -  8:    Reserved (set to 0) + +         9 - 63:    Bits 9-63 of the offset into the image file at which the +                    refcount block starts. Must be aligned to a cluster +                    boundary. + +                    If this is 0, the corresponding refcount block has not yet +                    been allocated. All refcounts managed by this refcount block +                    are 0. + +Refcount block entry (x = refcount_bits - 1): + +    Bit  0 -  x:    Reference count of the cluster. If refcount_bits implies a +                    sub-byte width, note that bit 0 means the least significant +                    bit in this context. + + +== Cluster mapping == + +Just as for refcounts, qcow2 uses a two-level structure for the mapping of +guest clusters to host clusters. They are called L1 and L2 table. + +The L1 table has a variable size (stored in the header) and may use multiple +clusters, however it must be contiguous in the image file. L2 tables are +exactly one cluster in size. + +Given a offset into the virtual disk, the offset into the image file can be +obtained as follows: + +    l2_entries = (cluster_size / sizeof(uint64_t)) + +    l2_index = (offset / cluster_size) % l2_entries +    l1_index = (offset / cluster_size) / l2_entries + +    l2_table = load_cluster(l1_table[l1_index]); +    cluster_offset = l2_table[l2_index]; + +    return cluster_offset + (offset % cluster_size) + +L1 table entry: + +    Bit  0 -  8:    Reserved (set to 0) + +         9 - 55:    Bits 9-55 of the offset into the image file at which the L2 +                    table starts. Must be aligned to a cluster boundary. If the +                    offset is 0, the L2 table and all clusters described by this +                    L2 table are unallocated. + +        56 - 62:    Reserved (set to 0) + +             63:    0 for an L2 table that is unused or requires COW, 1 if its +                    refcount is exactly one. This information is only accurate +                    in the active L1 table. + +L2 table entry: + +    Bit  0 -  61:   Cluster descriptor + +              62:   0 for standard clusters +                    1 for compressed clusters + +              63:   0 for a cluster that is unused or requires COW, 1 if its +                    refcount is exactly one. This information is only accurate +                    in L2 tables that are reachable from the the active L1 +                    table. + +Standard Cluster Descriptor: + +    Bit       0:    If set to 1, the cluster reads as all zeros. The host +                    cluster offset can be used to describe a preallocation, +                    but it won't be used for reading data from this cluster, +                    nor is data read from the backing file if the cluster is +                    unallocated. + +                    With version 2, this is always 0. + +         1 -  8:    Reserved (set to 0) + +         9 - 55:    Bits 9-55 of host cluster offset. Must be aligned to a +                    cluster boundary. If the offset is 0, the cluster is +                    unallocated. + +        56 - 61:    Reserved (set to 0) + + +Compressed Clusters Descriptor (x = 62 - (cluster_bits - 8)): + +    Bit  0 -  x:    Host cluster offset. This is usually _not_ aligned to a +                    cluster boundary! + +       x+1 - 61:    Compressed size of the images in sectors of 512 bytes + +If a cluster is unallocated, read requests shall read the data from the backing +file (except if bit 0 in the Standard Cluster Descriptor is set). If there is +no backing file or the backing file is smaller than the image, they shall read +zeros for all parts that are not covered by the backing file. + + +== Snapshots == + +qcow2 supports internal snapshots. Their basic principle of operation is to +switch the active L1 table, so that a different set of host clusters are +exposed to the guest. + +When creating a snapshot, the L1 table should be copied and the refcount of all +L2 tables and clusters reachable from this L1 table must be increased, so that +a write causes a COW and isn't visible in other snapshots. + +When loading a snapshot, bit 63 of all entries in the new active L1 table and +all L2 tables referenced by it must be reconstructed from the refcount table +as it doesn't need to be accurate in inactive L1 tables. + +A directory of all snapshots is stored in the snapshot table, a contiguous area +in the image file, whose starting offset and length are given by the header +fields snapshots_offset and nb_snapshots. The entries of the snapshot table +have variable length, depending on the length of ID, name and extra data. + +Snapshot table entry: + +    Byte 0 -  7:    Offset into the image file at which the L1 table for the +                    snapshot starts. Must be aligned to a cluster boundary. + +         8 - 11:    Number of entries in the L1 table of the snapshots + +        12 - 13:    Length of the unique ID string describing the snapshot + +        14 - 15:    Length of the name of the snapshot + +        16 - 19:    Time at which the snapshot was taken in seconds since the +                    Epoch + +        20 - 23:    Subsecond part of the time at which the snapshot was taken +                    in nanoseconds + +        24 - 31:    Time that the guest was running until the snapshot was +                    taken in nanoseconds + +        32 - 35:    Size of the VM state in bytes. 0 if no VM state is saved. +                    If there is VM state, it starts at the first cluster +                    described by first L1 table entry that doesn't describe a +                    regular guest cluster (i.e. VM state is stored like guest +                    disk content, except that it is stored at offsets that are +                    larger than the virtual disk presented to the guest) + +        36 - 39:    Size of extra data in the table entry (used for future +                    extensions of the format) + +        variable:   Extra data for future extensions. Unknown fields must be +                    ignored. Currently defined are (offset relative to snapshot +                    table entry): + +                    Byte 40 - 47:   Size of the VM state in bytes. 0 if no VM +                                    state is saved. If this field is present, +                                    the 32-bit value in bytes 32-35 is ignored. + +                    Byte 48 - 55:   Virtual disk size of the snapshot in bytes + +                    Version 3 images must include extra data at least up to +                    byte 55. + +        variable:   Unique ID string for the snapshot (not null terminated) + +        variable:   Name of the snapshot (not null terminated) + +        variable:   Padding to round up the snapshot table entry size to the +                    next multiple of 8. diff --git a/docs/specs/qed_spec.txt b/docs/specs/qed_spec.txt new file mode 100644 index 00000000..7982e058 --- /dev/null +++ b/docs/specs/qed_spec.txt @@ -0,0 +1,138 @@ +=Specification= + +The file format looks like this: + + +----------+----------+----------+-----+ + | cluster0 | cluster1 | cluster2 | ... | + +----------+----------+----------+-----+ + +The first cluster begins with the '''header'''.  The header contains information about where regular clusters start; this allows the header to be extensible and store extra information about the image file.  A regular cluster may be a '''data cluster''', an '''L2''', or an '''L1 table'''.  L1 and L2 tables are composed of one or more contiguous clusters. + +Normally the file size will be a multiple of the cluster size.  If the file size is not a multiple, extra information after the last cluster may not be preserved if data is written.  Legitimate extra information should use space between the header and the first regular cluster. + +All fields are little-endian. + +==Header== + Header { +     uint32_t magic;               /* QED\0 */ +  +     uint32_t cluster_size;        /* in bytes */ +     uint32_t table_size;          /* for L1 and L2 tables, in clusters */ +     uint32_t header_size;         /* in clusters */ +  +     uint64_t features;            /* format feature bits */ +     uint64_t compat_features;     /* compat feature bits */ +     uint64_t autoclear_features;  /* self-resetting feature bits */ + +     uint64_t l1_table_offset;     /* in bytes */ +     uint64_t image_size;          /* total logical image size, in bytes */ +  +     /* if (features & QED_F_BACKING_FILE) */ +     uint32_t backing_filename_offset; /* in bytes from start of header */ +     uint32_t backing_filename_size;   /* in bytes */ + } + +Field descriptions: +* ''cluster_size'' must be a power of 2 in range [2^12, 2^26]. +* ''table_size'' must be a power of 2 in range [1, 16]. +* ''header_size'' is the number of clusters used by the header and any additional information stored before regular clusters. +* ''features'', ''compat_features'', and ''autoclear_features'' are file format extension bitmaps.  They work as follows: +** An image with unknown ''features'' bits enabled must not be opened.  File format changes that are not backwards-compatible must use ''features'' bits. +** An image with unknown ''compat_features'' bits enabled can be opened safely.  The unknown features are simply ignored and represent backwards-compatible changes to the file format. +** An image with unknown ''autoclear_features'' bits enable can be opened safely after clearing the unknown bits.  This allows for backwards-compatible changes to the file format which degrade gracefully and can be re-enabled again by a new program later. +* ''l1_table_offset'' is the offset of the first byte of the L1 table in the image file and must be a multiple of ''cluster_size''. +* ''image_size'' is the block device size seen by the guest and must be a multiple of 512 bytes. +* ''backing_filename_offset'' and ''backing_filename_size'' describe a string in (byte offset, byte size) form.  It is not NUL-terminated and has no alignment constraints.  The string must be stored within the first ''header_size'' clusters.  The backing filename may be an absolute path or relative to the image file. + +Feature bits: +* QED_F_BACKING_FILE = 0x01.  The image uses a backing file. +* QED_F_NEED_CHECK = 0x02.  The image needs a consistency check before use. +* QED_F_BACKING_FORMAT_NO_PROBE = 0x04.  The backing file is a raw disk image and no file format autodetection should be attempted.  This should be used to ensure that raw backing files are never detected as an image format if they happen to contain magic constants. + +There are currently no defined ''compat_features'' or ''autoclear_features'' bits. + +Fields predicated on a feature bit are only used when that feature is set.  The fields always take up header space, regardless of whether or not the feature bit is set. + +==Tables== + +Tables provide the translation from logical offsets in the block device to cluster offsets in the file. + + #define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t)) +   + Table { +     uint64_t offsets[TABLE_NOFFSETS]; + } + +The tables are organized as follows: + +                    +----------+ +                    | L1 table | +                    +----------+ +               ,------'  |  '------. +          +----------+   |    +----------+ +          | L2 table |  ...   | L2 table | +          +----------+        +----------+ +      ,------'  |  '------. + +----------+   |    +----------+ + |   Data   |  ...   |   Data   | + +----------+        +----------+ + +A table is made up of one or more contiguous clusters.  The table_size header field determines table size for an image file.  For example, cluster_size=64 KB and table_size=4 results in 256 KB tables. + +The logical image size must be less than or equal to the maximum possible size of clusters rooted by the L1 table: + header.image_size <= TABLE_NOFFSETS * TABLE_NOFFSETS * header.cluster_size + +L1, L2, and data cluster offsets must be aligned to header.cluster_size.  The following offsets have special meanings: + +===L2 table offsets=== +* 0 - unallocated.  The L2 table is not yet allocated. + +===Data cluster offsets=== +* 0 - unallocated.  The data cluster is not yet allocated. +* 1 - zero.  The data cluster contents are all zeroes and no cluster is allocated. + +Future format extensions may wish to store per-offset information.  The least significant 12 bits of an offset are reserved for this purpose and must be set to zero.  Image files with cluster_size > 2^12 will have more unused bits which should also be zeroed. + +===Unallocated L2 tables and data clusters=== +Reads to an unallocated area of the image file access the backing file.  If there is no backing file, then zeroes are produced.  The backing file may be smaller than the image file and reads of unallocated areas beyond the end of the backing file produce zeroes. + +Writes to an unallocated area cause a new data clusters to be allocated, and a new L2 table if that is also unallocated.  The new data cluster is populated with data from the backing file (or zeroes if no backing file) and the data being written. + +===Zero data clusters=== +Zero data clusters are a space-efficient way of storing zeroed regions of the image. + +Reads to a zero data cluster produce zeroes.  Note that the difference between an unallocated and a zero data cluster is that zero data clusters stop the reading of contents from the backing file. + +Writes to a zero data cluster cause a new data cluster to be allocated.  The new data cluster is populated with zeroes and the data being written. + +===Logical offset translation=== +Logical offsets are translated into cluster offsets as follows: + +  table_bits table_bits    cluster_bits +  <--------> <--------> <---------------> + +----------+----------+-----------------+ + | L1 index | L2 index |     byte offset | + +----------+----------+-----------------+ +  +       Structure of a logical offset + + offset_mask = ~(cluster_size - 1) # mask for the image file byte offset +  + def logical_to_cluster_offset(l1_index, l2_index, byte_offset): +   l2_offset = l1_table[l1_index] +   l2_table = load_table(l2_offset) +   cluster_offset = l2_table[l2_index] & offset_mask +   return cluster_offset + byte_offset + +==Consistency checking== + +This section is informational and included to provide background on the use of the QED_F_NEED_CHECK ''features'' bit. + +The QED_F_NEED_CHECK bit is used to mark an image as dirty before starting an operation that could leave the image in an inconsistent state if interrupted by a crash or power failure.  A dirty image must be checked on open because its metadata may not be consistent. + +Consistency check includes the following invariants: +# Each cluster is referenced once and only once.  It is an inconsistency to have a cluster referenced more than once by L1 or L2 tables.  A cluster has been leaked if it has no references. +# Offsets must be within the image file size and must be ''cluster_size'' aligned. +# Table offsets must at least ''table_size'' * ''cluster_size'' bytes from the end of the image file so that there is space for the entire table. + +The consistency check process starts by from ''l1_table_offset'' and scans all L2 tables.  After the check completes with no other errors besides leaks, the QED_F_NEED_CHECK bit can be cleared and the image can be accessed. diff --git a/docs/specs/rocker.txt b/docs/specs/rocker.txt new file mode 100644 index 00000000..1c743515 --- /dev/null +++ b/docs/specs/rocker.txt @@ -0,0 +1,1014 @@ +Rocker Network Switch Register Programming Guide +Copyright (c) Scott Feldman <sfeldma@gmail.com> +Copyright (c) Neil Horman <nhorman@tuxdriver.com> +Version 0.11, 12/29/2014 + +LICENSE +======= + +This program is free software; you can redistribute it and/or modify +it under the terms of the GNU General Public License as published by +the Free Software Foundation; either version 2 of the License, or +(at your option) any later version. + +This program is distributed in the hope that it will be useful, +but WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +GNU General Public License for more details. + +SECTION 1: Introduction +======================= + +Overview +-------- + +This document describes the hardware/software interface for the Rocker switch +device.  The intended audience is authors of OS drivers and device emulation +software. + +Notations and Conventions +------------------------- + +o In register descriptions, [n:m] indicates a range from bit n to bit m, +inclusive. +o Use of leading 0x indicates a hexadecimal number. +o Use of leading 0b indicates a binary number. +o The use of RSVD or Reserved indicates that a bit or field is reserved for +future use. +o Field width is in bytes, unless otherwise noted. +o Register are (R) read-only, (R/W) read/write, (W) write-only, or (COR) clear +on read +o TLV values in network-byte-order are designated with (N). + + +SECTION 2: PCI Configuration Registers +====================================== + +PCI Configuration Space +----------------------- + +Each switch instance registers as a PCI device with PCI configuration space: + +	offset	width	description		value +	--------------------------------------------- +	0x0	2	Vendor ID		0x1b36 +	0x2	2	Device ID		0x0006 +	0x4	4	Command/Status +	0x8	1	Revision ID		0x01 +	0x9	3	Class code		0x2800 +	0xC	1	Cache line size +	0xD	1	Latency timer +	0xE	1	Header type +	0xF	1	Built-in self test +	0x10	4	Base address low +	0x14	4	Base address high +	0x18-28		Reserved +	0x2C	2	Subsystem vendor ID	* +	0x2E	2	Subsystem ID		* +	0x30-38		Reserved +	0x3C	1	Interrupt line +	0x3D	1	Interrupt pin		0x00 +	0x3E	1	Min grant		0x00 +	0x3D	1	Max latency		0x00 +	0x40	1	TRDY timeout +	0x41	1	Retry count +	0x42	2	Reserved + + +* Assigned by sub-system implementation + +SECTION 3: Memory-Mapped Register Space +======================================= + +There are two memory-mapped BARs.  BAR0 maps device register space and is +0x2000 in size.  BAR1 maps MSI-X vector and PBA tables and is also 0x2000 in +size, allowing for 256 MSI-X vectors. + +All registers are 4 or 8 bytes long.  It is assumed host software will access 4 +byte registers with one 4-byte access, and 8 byte registers with either two +4-byte accesses or a single 8-byte access.  In the case of two 4-byte accesses, +access must be lower and then upper 4-bytes, in that order. + +BAR0 device register space is organized as follows: + +	offset		description +	------------------------------------------------------ +	0x0000-0x000f	Bogus registers to catch misbehaving +			drivers.  Writes do nothing.  Reads +			back as 0xDEADBABE. +	0x0010-0x00ff	Test registers +	0x0300-0x03ff	General purpose registers +	0x1000-0x1fff	Descriptor control + +Holes in register space are reserved.  Writes to reserved registers do nothing. +Reads to reserved registers read back as 0. + +No fancy stuff like write-combining is enabled on any of the registers. + +BAR1 MSI-X register space is organized as follows: + +	offset		description +	------------------------------------------------------ +	0x0000-0x0fff	MSI-X vector table (256 vectors total) +	0x1000-0x1fff	MSI-X PBA table + + +SECTION 4: Interrupts, DMA, and Endianness +========================================== + +PCI Interrupts +-------------- + +The device supports only MSI-X interrupts.  BAR1 memory-mapped region contains +the MSI-X vector and PBA tables, with support for up to 256 MSI-X vectors. + +The vector assignment is: + +	vector		description +	----------------------------------------------------- +	0		Command descriptor ring completion +	1		Event descriptor ring completion +	2		Test operation completion +	3		RSVD +	4-255		Tx and Rx descriptor ring completion +			  Tx vector is even +			  Rx vector is odd + +A MSI-X vector table entry is 16 bytes: + +	field		offset	width	description +	------------------------------------------------------------- +	lower_addr	0x0	4	[31:2] message address[31:2] +					[1:0] Rsvd (4 byte alignment +						    required) +	upper_addr	0x4	4	[31:19] Rsvd +					[14:0] message address[46:32] +	data		0x8	4	message data[31:0] +	control		0xc	4	[31:1] Rsvd +					[0] mask (0 = enable, +						  1 = masked) + +Software should install the Interrupt Service Routine (ISR) before any ports +are enabled or any commands are issued on the command ring. + +DMA Operations +-------------- + +DMA operations are used for packet DMA to/from the CPU, command and event +processing.  Command processing includes statistical counters and table dumps, +table insertion/deletion, and more.  Event processing provides an async +notification method for device-originating events.  Each DMA operation has a +set of control registers to manage a descriptor ring.  The descriptor rings are +allocated from contiguous host DMA-able memory and registers specify the rings +base address, size and current head and tail indices.  Software always writes +the head, and hardware always writes the tail. + +The higher-order bit of DMA_DESC_COMP_ERR is used to mark hardware completion +of a descriptor.  Software will clear this bit when posting a descriptor to the +ring, and hardware will set this bit when the descriptor is complete. + +Descriptor ring sizes must be a power of 2 and range from 2 to 64K entries. +Descriptor rings' base address must be 8-byte aligned.  Descriptors must be +packed within ring.  Each descriptor in each ring must also be aligned on an 8 +byte boundary.  Each descriptor ring will have these registers: + +	DMA_DESC_xxx_BASE_ADDR, offset 0x1000 + (x * 32), 64-bit, (R/W) +	DMA_DESC_xxx_SIZE, offset 0x1008 + (x * 32), 32-bit, (R/W) +	DMA_DESC_xxx_HEAD, offset 0x100c + (x * 32), 32-bit, (R/W) +	DMA_DESC_xxx_TAIL, offset 0x1010 + (x * 32), 32-bit, (R) +	DMA_DESC_xxx_CTRL, offset 0x1014 + (x * 32), 32-bit, (W) +	DMA_DESC_xxx_CREDITS, offset 0x1018 + (x * 32), 32-bit, (R/W) +	DMA_DESC_xxx_RSVD1, offset 0x101c + (x * 32), 32-bit, (R/W) + +Where x is descriptor ring index: + +	index		ring +	-------------------- +	0		CMD +	1		EVENT +	2		TX (port 0) +	3		RX (port 0) +	4		TX (port 1) +	5		RX (port 1) +	. +	. +	. +	124		TX (port 61) +	125		RX (port 61) +	126		Resv +	127		Resv + +Writing BASE_ADDR or SIZE will reset HEAD and TAIL to zero.  HEAD cannot be +written past TAIL.  To do so would wrap the ring.  An empty ring is when HEAD +== TAIL.  A full ring is when HEAD is one position behind TAIL.  Both HEAD and +TAIL increment and modulo wrap at the ring size. + +CTRL register bits: + +	bit	name		description +	------------------------------------------------------------------------ +	[0]	CTRL_RESET	Reset the descriptor ring +	[1:31]	Reserved + +All descriptor types share some common fields: + +	field			width	description +	------------------------------------------------------------------- +	DMA_DESC_BUF_ADDR	8	Phys addr of desc payload, 8-byte +					aligned +	DMA_DESC_COOKIE		8	Desc cookie for completion matching, +					upper-most bit is reserved +	DMA_DESC_BUF_SIZE	2	Desc payload size in bytes +	DMA_DESC_TLV_SIZE	2	Desc payload total size in bytes +					used for TLVs.  Must be <= +					DMA_DESC_BUF_SIZE. +	DMA_DESC_COMP_ERR	2	Completion status of associated +					desc payload.  High order bit is +					clear on new descs, toggled by +					hw for completed items. + +To support forward- and backward-compatibility, descriptor and completion +payloads are specified in TLV format.  Fields are packed with Type=field name, +Length=field length, and Value=field value.  Software will ignore unknown fields +filled in by the switch.  Likewise, the switch will ignore unknown fields +filled in by software. + +Descriptor payload buffer is 8-byte aligned and TLVs are 8-byte aligned.  The +value within a TLV is also 8-byte aligned.  The (packed, 8 byte) TLV header is: + +	field	width	description +	----------------------------- +	type	4	TLV type +	len	2	TLV value length +	pad	2	Reserved + +The alignment requirements for descriptors and TLVs are to avoid unaligned +access exceptions in software.  Note that the payload for each TLV is also +8 byte aligned. + +Figure 1 shows an example descriptor buffer with two TLVs. + +                  <------- 8 bytes -------> + +  8-byte  +––––+  +–––––––––––+–––––+–––––+                     +–+ +  align           |   type    | len | pad |    TLV#1 hdr          | +                  +–––––––––––+–––––+–––––+    (len=22)           | +                  |                       |                       | +                  |  value                |    TVL#1 value        | +                  |                       |    (padded to 8-byte  | +                  |                 +–––––+     alignment)        | +                  |                 |/////|                       | +   8-byte +––––+  +–––––––––––+–––––––––––+                       | +   align          |   type    | len | pad |    TLV#2 hdr    DESC_BUF_SIZE +                  +–––––+–––––+–––––+–––––+    (len=2)            | +                  |value|/////////////////|    TLV#2 value        | +                  +–––––+/////////////////|                       | +                  |///////////////////////|                       | +                  |///////////////////////|                       | +                  |///////////////////////|                       | +                  |////////unused/////////|                       | +                  |////////space//////////|                       | +                  |///////////////////////|                       | +                  |///////////////////////|                       | +                  |///////////////////////|                       | +                  +–––––––––––––––––––––––+                     +–+ + +				fig. 1 + +TLVs can be nested within the NEST TLV type. + +Interrupt credits +^^^^^^^^^^^^^^^^^ + +MSI-X vectors used for descriptor ring completions use a credit mechanism for +efficient device, PCIe bus, OS and driver operations.  Each descriptor ring has +a credit count which represents the number of outstanding descriptors to be +processed by the driver.  As the device marks descriptors complete, the credit +count is incremented.  As the driver processes those outstanding descriptors, +it returns credits back to the device.  This way, the device knows the driver's +progress and can make decisions about when to fire the next interrupt or not. +When the credit count is zero, and the first descriptors are posted for the +driver, a single interrupt is fired.  Once the interrupt is fired, the +interrupt is disabled (auto-masked*).  In response to the interrupt, the driver +will process descriptors and PIO write a returned credit value for that +descriptor ring.  If the driver returns all credits (the driver caught up with +the device and there is no outstanding work), then the interrupt is unmasked, +but not fired.  If only partial credits are returned, the interrupt remains +masked but the device generates an interrupt, signaling the driver that more +outstanding work is available. + +(* this masking is unrelated to to the MSI-X interrupt mask register) + +Endianness +---------- + +Device registers are hard-coded to little-endian (LE).  The driver should +convert to/from host endianess to LE for device register accesses. + +Descriptors are LE.  Descriptor buffer TLVs will have LE type and length +fields, but the value field can either be LE or network-byte-order, depending +on context.  TLV values containing network packet data will be in network-byte +order.  A TLV value containing a field or mask used to compare against network +packet data is network-byte order.  For example, flow match fields (and masks) +are network-byte-order since they're matched directly, byte-by-byte, against +network packet data.  All non-network-packet TLV multi-byte values will be LE. + +TLV values in network-byte-order are designated with (N). + + +SECTION 5: Test Registers +========================= + +Rocker has several test registers to support troubleshooting register access, +interrupt generation, and DMA operations: + +	TEST_REG, offset 0x0010, 32-bit (R/W) +	TEST_REG64, offset 0x0018, 64-bit (R/W) +	TEST_IRQ, offset 0x0020, 32-bit (R/W) +	TEST_DMA_ADDR, offset 0x0028, 64-bit (R/W) +	TEST_DMA_SIZE, offset 0x0030, 32-bit (R/W) +	TEST_DMA_CTRL, offset 0x0034, 32-bit (R/W) + +Reads to TEST_REG and TEST_REG64 will read a value equal to twice the last +value written to the register.  The 32-bit and 64-bit versions are for testing +32-bit and 64-bit host accesses. + +A vector can be written to TEST_IRQ and the device will generate an interrupt +for that vector. + +To test basic DMA operations, allocate a DMA-able host buffer and put the +buffer address into TEST_DMA_ADDR and size into TEST_DMA_SIZE.  Then, write to +TEST_DMA_CTRL to manipulate the buffer contents.  TEST_DMA_CTRL operations are: + +	operation		value	description +	----------------------------------------------------------- +	TEST_DMA_CTRL_CLEAR	1	clear buffer +	TEST_DMA_CTRL_FILL	2	fill buffer bytes with 0x96 +	TEST_DMA_CTRL_INVERT	4	invert bytes in buffer + +Various buffer address and sizes should be tested to verify no address boundary +issue exists.  In particular, buffers that start on odd-8-byte boundary and/or +span multiple PAGE sizes should be tested. + + +SECTION 6: Ports +================ + +Physical and Logical Ports +------------------------------------ + +The switch supports up to 62 physical (front-panel) ports.  Register +PORT_PHYS_COUNT returns the actual number of physical ports available: + +	PORT_PHYS_COUNT, offset 0x0304, 32-bit, (R) + +In addition to front-panel ports, the switch supports logical ports for +tunnels. + +Front-panel ports and logical tunnel ports are mapped into a single 32-bit port +space.  A special CPU port is assigned port 0.  The front-panel ports are +mapped to ports 1-62.  A special loopback port is assigned port 63.  Logical +tunnel ports are assigned ports 0x0001000-0x0001ffff. +To summarize the port assignments: + +	port			mapping +	------------------------------------------------------- +	0			CPU port (for packets to/from host CPU) +	1-62			front-panel physical ports +	63			loopback port +	64-0x0000ffff		RSVD +	0x00010000-0x0001ffff	logical tunnel ports +	0x00020000-0xffffffff	RSVD + +Physical Port Mode +------------------ + +Switch front-panel ports operate in a mode.  Currently, the only mode is +OF-DPA.  OF-DPA[1] mode is based on OpenFlow Data Plane Abstraction (OF-DPA) +Abstract Switch Specification, Version 1.0, from Broadcom Corporation.  To +set/get the mode for front-panel ports, see port settings, below. + +Port Settings +------------- + +Link status for all front-panel ports is available via PORT_PHYS_LINK_STATUS: + +	PORT_PHYS_LINK_STATUS, offset 0x0310, 64-bit, (R) + +	Value is port bitmap.  Bits 0 and 63 always read 0.  Bits 1-62 +	read 1 for link UP and 0 for link DOWN for respective front-panel ports. + +Other properties for front-panel ports are available via DMA CMD descriptors: + +	Get PORT_SETTINGS descriptor: + +		field		width	description +		---------------------------------------------- +		PORT_SETTINGS	2	CMD_GET +		PPORT		4	Physical port # + +	Get PORT_SETTINGS completion: + +		field		width	description +		---------------------------------------------- +		PPORT		4	Physical port # +		SPEED		4	Current port interface speed, in Mbps +		DUPLEX		1	1 = Full, 0 = Half +		AUTONEG		1	1 = enabled, 0 = disabled +		MACADDR		6	Port MAC address +		MODE		1	0 = OF-DPA +		LEARNING	1	MAC address learning on port +						1 = enabled +						0 = disabled +		PHYS_NAME	<var>	Physical port name (string) + +	Set PORT_SETTINGS descriptor: + +		field		width	description +		---------------------------------------------- +		PORT_SETTINGS	2	CMD_SET +		PPORT		4	Physical port # +		SPEED		4	Port interface speed, in Mbps +		DUPLEX		1	1 = Full, 0 = Half +		AUTONEG		1	1 = enabled, 0 = disabled +		MACADDR		6	Port MAC address +		MODE		1	0 = OF-DPA + +Port Enable +----------- + +Front-panel ports are initially disabled, which means port ingress and egress +packets will be dropped.  To enable or disable a port, use PORT_PHYS_ENABLE: + +	PORT_PHYS_ENABLE: offset 0x0318, 64-bit, (R/W) + +	Value is bitmap of first 64 ports.  Bits 0 and 63 are ignored +	and always read as 0.  Write 1 to enable port; write 0 to disable it. +	Default is 0. + + +SECTION 7: Switch Control +========================= + +This section covers switch-wide register settings. + +Control +------- + +This register is used for low level control of the switch. + +	CONTROL: offset 0x0300, 32-bit, (W) + +	bit	name		description +	------------------------------------------------------------------------ +	[0]	CONTROL_RESET	If set, device will perform reset +	[1:31]	Reserved + +Switch ID +--------- + +The switch has a SWITCH_ID to be used by software to uniquely identify the +switch: + +	SWITCH_ID: offset 0x0320, 64-bit, (R) + +	Value is opaque to switch software and no special encoding is implied. + + +SECTION 8: Events +================= + +Non-I/O asynchronous events from the device are notified to the host using the +event ring.  The TLV structure for events is: + +	field		width	description +	--------------------------------------------------- +	TYPE		4	Event type, one of: +					1: LINK_CHANGED +					2: MAC_VLAN_SEEN +	INFO		<nest>	Event info (details below) + +Link Changed Event +------------------ + +When link status changes on a physical port, this event is generated. + +	field		width	description +	--------------------------------------------------- +	INFO		<nest> +	  PPORT		4	Physical port +	  LINKUP	1	Link status: +					0: down +					1: up + +MAC VLAN Seen Event +------------------- + +When a packet ingresses on a port and the source MAC/VLAN isn't known to the +device, the device will generate this event.  In response to the event, the +driver should install to the device the MAC/VLAN on the port into the bridge +table.  Once installed, the MAC/VLAN is known on the port and this event will +no longer be generated. + +	field		width	description +	--------------------------------------------------- +	INFO		<nest> +	  PPORT		4	Physical port +	  MAC		6	MAC address +	  VLAN		2	VLAN ID + + +SECTION 9: CPU Packet Processing +================================ + +Ingress packets directed to the host CPU for further processing are delivered +in the DMA RX ring.  Likewise, host CPU originating packets destined to egress +on switch ports are scheduled by software using the DMA TX ring. + +Tx Packet Processing +-------------------- + +Software schedules packets for egress on switch ports using the DMA TX ring.  A +TX descriptor buffer describes the packet location and size in host DMA-able +memory, the destination port, and any hardware-offload functions (such as L3 +payload checksum offload).  Software then bumps the descriptor head to signal +hardware of new Tx work.  In response, hardware will DMA read Tx descriptors up +to head, DMA read descriptor buffer and packet data, perform offloading +functions, and finally frame packet on wire (network).  Once packet processing +is complete, hardware will writeback status to descriptor(s) to signal to +software that Tx is complete and software resources (e.g. skb) backing packet +can be released. + +Figure 2 shows an example 3-fragment packet queued with one Tx descriptor.  A +TLV is used for each packet fragment. + +	                                           pkt frag 1 +	                                           +–––––––+  +–+ +	                                       +–––+       |    | +	                         desc buf      |   |       |    | +	                        +––––––––+     |   |       |    | +	        Tx ring     +–––+        +–––––+   |       |    | +	      +–––––––––+   |   |  TLVs  |         +–––––––+    | +	      |         +–––+   +––––––––+         pkt frag 2   | +	      | desc 0  |       |        +–––––+   +–––––––+    | +	      +–––––––––+       |  TLVs  |     +–––+       |    | +	head+–+         |       +––––––––+         |       |    | +	      | desc 1  |       |        +–––––+   +–––––––+    |pkt +	      +–––––––––+       |  TLVs  |     |                | +	      |         |       +––––––––+     |   pkt frag 3   | +	      |         |                      |   +–––––––+    | +	      +–––––––––+                      +–––+       |    | +	      |         |                          |       |    | +	      |         |                          |       |    | +	      +–––––––––+                          |       |    | +	      |         |                          |       |    | +	      |         |                          |       |    | +	      +–––––––––+                          |       |    | +	      |         |                          +–––––––+  +–+ +	      |         | +	      +–––––––––+ + +				fig 2. + +The TLVs for Tx descriptor buffer are: + +	field			width	description +	--------------------------------------------------------------------- +	PPORT			4	Destination physical port # +	TX_OFFLOAD		1	Hardware offload modes: +					  0: no offload +					  1: insert IP csum (ipv4 only) +					  2: insert TCP/UDP csum +					  3: L3 csum calc and insert +                        	             into csum offset (TX_L3_CSUM_OFF) +                 	                    16-bit 1's complement csum value. +                                	     IPv4 pseudo-header and IP +                        	             already calculated by OS +                  	                   and inserted. +					  4: TSO (TCP Segmentation Offload) +	TX_L3_CSUM_OFF		2	For L3 csum offload mode, the offset, +					from the beginning of the packet, +					of the csum field in the L3 header +	TX_TSO_MSS		2	For TSO offload mode, the +					Maximum Segment Size in bytes +        TX_TSO_HDR_LEN		2	For TSO offload mode, the +					length of ethernet, IP, and +					TCP/UDP headers, including IP +					and TCP options. +	TX_FRAGS		<array>	Packet fragments +	  TX_FRAG		<nest>	Packet fragment +	    TX_FRAG_ADDR	8	DMA address of packet fragment +	    TX_FRAG_LEN		2	Packet fragment length + +Possible status return codes in descriptor on completion are: + +	DESC_COMP_ERR	reason +	-------------------------------------------------------------------- +	0		OK +	-ROCKER_ENXIO	address or data read err on desc buf or packet +			fragment +	-ROCKER_EINVAL	bad pport or TSO or csum offloading error +	-ROCKER_ENOMEM	no memory for internal staging tx fragment + +Rx Packet Processing +-------------------- + +For packets ingressing on switch ports that are not forwarded by the switch but +rather directed to the host CPU for further processing are delivered in the DMA +RX ring.  Rx descriptor buffers are allocated by software and placed on the +ring.  Hardware will fill Rx descriptor buffers with packet data, write the +completion, and signal to software that a new packet is ready.  Since Rx packet +size is not known a-priori, the Rx descriptor buffer must be allocated for +worst-case packet size.  A single Rx descriptor will contain the entire Rx +packet data in one RX_FRAG.  Other Rx TLVs describe and hardware offloads +performed on the packet, such as checksum validation. + +The TLVs for Rx descriptor buffer are: + +	field		width	description +	--------------------------------------------------- +	PPORT		4	Source physical port # +	RX_FLAGS	2	Packet parsing flags: +				  (1 << 0): IPv4 packet +				  (1 << 1): IPv6 packet +				  (1 << 2): csum calculated +				  (1 << 3): IPv4 csum good +				  (1 << 4): IP fragment +				  (1 << 5): TCP packet +				  (1 << 6): UDP packet +				  (1 << 7): TCP/UDP csum good +				  (1 << 8): Offload forward +	RX_CSUM		2	IP calculated checksum: +				  IPv4: IP payload csum +				  IPv6: header and payload csum +				(Only valid is RX_FLAGS:csum calc is set) +	RX_FRAG_ADDR	8	DMA address of packet fragment +	RX_FRAG_MAX_LEN	2	Packet maximum fragment length +	RX_FRAG_LEN	2	Actual packet fragment length after receive + +Offload forward RX_FLAG indicates the device has already forwarded the packet +so the host CPU should not also forward the packet. + +Possible status return codes in descriptor on completion are: + +	DESC_COMP_ERR	reason +	-------------------------------------------------------------------- +	0		OK +	-ROCKER_ENXIO	address or data read err on desc buf +	-ROCKER_ENOMEM	no memory for internal staging desc buf +	-ROCKER_EMSGSIZE Rx descriptor buffer wasn't big enough to contain +			packet data TLV and other TLVs. + + +SECTION 10: OF-DPA Mode +====================== + +OF-DPA mode allows the switch to offload flow packet processing functions to +hardware.  An OpenFlow controller would communicate with an OpenFlow agent +installed on the switch.  The OpenFlow agent would (directly or indirectly) +communicate with the Rocker switch driver, which in turn would program switch +hardware with flow functionality, as defined in OF-DPA.  The block diagram is: + +		+–––––––––––––––----–––+ +		|        OF            | +		|  Remote Controller   | +		+––––––––+––----–––––––+ +		         | +		         | +		+––––––––+–––––––––+ +		|       OF         | +		|   Local Agent    | +		+––––––––––––––––––+ +		|                  | +		|   Rocker Driver  | +		+––––––––––––––––––+ +		    <this spec> +		+––––––––––––––––––+ +		|                  | +		|   Rocker Switch  | +		+––––––––––––––––––+ + +To participate in flow functions, ports must be configure for OF-DPA mode +during switch initialization. + +OF-DPA Flow Table Interface +--------------------------- + +There are commands to add, modify, delete, and get stats of flow table entries. +The commands are issued using the DMA CMD descriptor ring.  The following +commands are defined: + +	CMD_ADD:		add an entry to flow table +	CMD_MOD:		modify an entry in flow table +	CMD_DEL:		delete an entry from flow table +	CMD_GET_STATS:		get stats for flow entry + +TLVs for add and modify commands are: + +	field			width	description +	---------------------------------------------------- +	OF_DPA_CMD		2	CMD_[ADD|MOD] +	OF_DPA_TBL		2	Flow table ID +					  0: ingress port +					  10: vlan +					  20: termination mac +					  30: unicast routing +					  40: multicast routing +					  50: bridging +					  60: ACL policy +	OF_DPA_PRIORITY		4	Flow priority +	OF_DPA_HARDTIME		4	Hard timeout for flow +	OF_DPA_IDLETIME		4	Idle timeout for flow +	OF_DPA_COOKIE		8	Cookie + +Additional TLVs based on flow table ID: + +Table ID 0: ingress port + +	field			width	description +	---------------------------------------------------- +	OF_DPA_IN_PPORT		4	ingress physical port number +	OF_DPA_GOTO_TBL		2	goto table ID; zero to drop + +Table ID 10: vlan + +	field			width	description +	---------------------------------------------------- +	OF_DPA_IN_PPORT		4	ingress physical port number +	OF_DPA_VLAN_ID		2 (N)	vlan ID +	OF_DPA_VLAN_ID_MASK	2 (N)	vlan ID mask +	OF_DPA_GOTO_TBL		2	goto table ID; zero to drop +	OF_DPA_NEW_VLAN_ID	2 (N)	new vlan ID + +Table ID 20: termination mac + +	field			width	description +	---------------------------------------------------- +	OF_DPA_IN_PPORT		4	ingress physical port number +	OF_DPA_IN_PPORT_MASK	4	ingress physical port number mask +	OF_DPA_ETHERTYPE	2 (N)	must be either 0x0800 or 0x86dd +	OF_DPA_DST_MAC		6 (N)	destination MAC +	OF_DPA_DST_MAC_MASK	6 (N)	destination MAC mask +	OF_DPA_VLAN_ID		2 (N)	vlan ID +	OF_DPA_VLAN_ID_MASK	2 (N)	vlan ID mask +	OF_DPA_GOTO_TBL		2	only acceptable values are +					unicast or multicast routing +					table IDs +	OF_DPA_OUT_PPORT	2	if specified, must be +					controller, set zero otherwise + +Table ID 30: unicast routing + +	field			width	description +	---------------------------------------------------- +	OF_DPA_ETHERTYPE	2 (N)	must be either 0x0800 or 0x86dd +	OF_DPA_DST_IP		4 (N)	destination IPv4 address. +					Must be unicast address +	OF_DPA_DST_IP_MASK	4 (N)	IP mask.  Must be prefix mask +	OF_DPA_DST_IPV6		16 (N)	destination IPv6 address. +					Must be unicast address +	OF_DPA_DST_IPV6_MASK	16 (N)	IPv6 mask. Must be prefix mask +	OF_DPA_GOTO_TBL		2	goto table ID; zero to drop +	OF_DPA_GROUP_ID		4	data for GROUP action must +					be an L3 Unicast group entry + +Table ID 40: multicast routing + +	field			width	description +	---------------------------------------------------- +	OF_DPA_ETHERTYPE	2 (N)	must be either 0x0800 or 0x86dd +	OF_DPA_VLAN_ID		2 (N)	vlan ID +	OF_DPA_SRC_IP		4 (N)	source IPv4. Optional, +					can contain IPv4 address, +					must be completely masked +					if not used +	OF_DPA_SRC_IP_MASK	4 (N)	IP Mask +	OF_DPA_DST_IP		4 (N)	destination IPv4 address. +					Must be multicast address +	OF_DPA_SRC_IPV6		16 (N)	source IPv6 Address. Optional. +					Can contain IPv6 address, +					must be completely masked +					if not used +	OF_DPA_SRC_IPV6_MASK	16 (N)	IPv6 mask. +	OF_DPA_DST_IPV6		16 (N)	destination IPv6 Address. Must +					be multicast address +					Must be multicast address +	OF_DPA_GOTO_TBL		2	goto table ID; zero to drop +	OF_DPA_GROUP_ID		4	data for GROUP action must +					be an L3 multicast group entry + +Table ID 50: bridging + +	field			width	description +	---------------------------------------------------- +	OF_DPA_VLAN_ID		2 (N)	vlan ID +	OF_DPA_TUNNEL_ID	4	tunnel ID +	OF_DPA_DST_MAC		6 (N)	destination MAC +	OF_DPA_DST_MAC_MASK	6 (N)	destination MAC mask +	OF_DPA_GOTO_TBL		2	goto table ID; zero to drop +	OF_DPA_GROUP_ID		4	data for GROUP action must +					be a L2 Interface, L2 +					Multicast, L2 Flood, +					or L2 Overlay group entry +					as appropriate +	OF_DPA_TUNNEL_LPORT	4	unicast Tenant Bridging +					flows specify a tunnel +					logical port ID +	OF_DPA_OUT_PPORT	2	data for OUTPUT action, +					restricted to CONTROLLER, +					set to 0 otherwise + +Table ID 60: acl policy + +	field			width	description +	---------------------------------------------------- +	OF_DPA_IN_PPORT		4	ingress physical port number +	OF_DPA_IN_PPORT_MASK	4	ingress physical port number mask +	OF_DPA_ETHERTYPE	2 (N)	ethertype +	OF_DPA_VLAN_ID		2 (N)	vlan ID +	OF_DPA_VLAN_ID_MASK	2 (N)	vlan ID mask +	OF_DPA_VLAN_PCP		2 (N)	vlan Priority Code Point +	OF_DPA_VLAN_PCP_MASK	2 (N)	vlan Priority Code Point mask +	OF_DPA_SRC_MAC		6 (N)	source MAC +	OF_DPA_SRC_MAC_MASK	6 (N)	source MAC mask +	OF_DPA_DST_MAC		6 (N)	destination MAC +	OF_DPA_DST_MAC_MASK	6 (N)	destination MAC mask +	OF_DPA_TUNNEL_ID	4	tunnel ID +	OF_DPA_SRC_IP		4 (N)	source IPv4. Optional, +					can contain IPv4 address, +					must be completely masked +					if not used +	OF_DPA_SRC_IP_MASK	4 (N)	IP Mask +	OF_DPA_DST_IP		4 (N)	destination IPv4 address. +					Must be multicast address +	OF_DPA_DST_IP_MASK	4 (N)	IP Mask +	OF_DPA_SRC_IPV6		16 (N)	source IPv6 Address. Optional. +					Can contain IPv6 address, +					must be completely masked +					if not used +	OF_DPA_SRC_IPV6_MASK	16 (N)	IPv6 mask +	OF_DPA_DST_IPV6		16 (N)	destination IPv6 Address. Must +					be multicast address. +	OF_DPA_DST_IPV6_MASK	16 (N)	IPv6 mask +	OF_DPA_SRC_ARP_IP	4 (N)	source IPv4 address in the ARP +					payload.  Only used if ethertype +					== 0x0806. +	OF_DPA_SRC_ARP_IP_MASK	4 (N)	IP Mask +	OF_DPA_IP_PROTO		1	IP protocol +	OF_DPA_IP_PROTO_MASK	1	IP protocol mask +	OF_DPA_IP_DSCP		1	DSCP +	OF_DPA_IP_DSCP_MASK	1	DSCP mask +	OF_DPA_IP_ECN		1	ECN +	OF_DPA_IP_ECN_MASK		1	ECN mask +	OF_DPA_L4_SRC_PORT	2 (N)	L4 source port, only for +					TCP, UDP, or SCTP +	OF_DPA_L4_SRC_PORT_MASK	2 (N)	L4 source port mask +	OF_DPA_L4_DST_PORT	2 (N)	L4 source port, only for +					TCP, UDP, or SCTP +	OF_DPA_L4_DST_PORT_MASK	2 (N)	L4 source port mask +	OF_DPA_ICMP_TYPE	1	ICMP type, only if IP +					protocol is 1 +	OF_DPA_ICMP_TYPE_MASK	1	ICMP type mask +	OF_DPA_ICMP_CODE	1	ICMP code +	OF_DPA_ICMP_CODE_MASK	1	ICMP code mask +	OF_DPA_IPV6_LABEL	4 (N)	IPv6 flow label +	OF_DPA_IPV6_LABEL_MASK	4 (N)	IPv6 flow label mask +	OF_DPA_GROUP_ID		4	data for GROUP action +	OF_DPA_QUEUE_ID_ACTION	1	write the queue ID +	OF_DPA_NEW_QUEUE_ID	1	queue ID +	OF_DPA_VLAN_PCP_ACTION	1	write the VLAN priority +	OF_DPA_NEW_VLAN_PCP	1	VLAN priority +	OF_DPA_IP_DSCP_ACTION	1	write the DSCP +	OF_DPA_NEW_IP_DSCP	1	new DSCP +	OF_DPA_TUNNEL_LPORT	4	restrct to valid tunnel +					logical port, set to 0 +					otherwise. +	OF_DPA_OUT_PPORT	2	data for OUTPUT action, +					restricted to CONTROLLER, +					set to 0 otherwise +	OF_DPA_CLEAR_ACTIONS	4	if 1 packets matching flow are +					dropped (all other instructions +					ignored) + +TLVs for flow delete and get stats command are: + +	field			width	description +	--------------------------------------------------- +	OF_DPA_CMD		2	CMD_[DEL|GET_STATS] +	OF_DPA_COOKIE		8	Cookie + +On completion of get stats command, the descriptor buffer is written back with +the following TLVs: + +	field			width	description +	--------------------------------------------------- +	OF_DPA_STAT_DURATION	4	Flow duration +	OF_DPA_STAT_RX_PKTS	8	Received packets +	OF_DPA_STAT_TX_PKTS	8	Transmit packets + +Possible status return codes in descriptor on completion are: + +	DESC_COMP_ERR	command			reason +	-------------------------------------------------------------------- +	0		all			OK +	-ROCKER_EFAULT	all			head or tail index outside +						of ring +	-ROCKER_ENXIO	all			address or data read err on +						desc buf +	-ROCKER_EMSGSIZE GET_STATS		cmd descriptor buffer wasn't +						big enough to contain write-back +						TLVs +	-ROCKER_EINVAL	all			invalid parameters passed in +	-ROCKER_EEXIST	ADD			entry already exists +	-ROCKER_ENOSPC	ADD			no space left in flow table +	-ROCKER_ENOENT	MOD|DEL|GET_STATS	cookie invalid + +Group Table Interface +--------------------- + +There are commands to add, modify, delete, and get stats of group table +entries.  The commands are issued using the DMA CMD descriptor ring.  The +following commands are defined: + +	CMD_ADD:		add an entry to group table +	CMD_MOD:		modify an entry in group table +	CMD_DEL:		delete an entry from group table +	CMD_GET_STATS:		get stats for group entry + +TLVs for add and modify commands are: + +	field			width	description +	----------------------------------------------------------- +	FLOW_GROUP_CMD		2	CMD_[ADD|MOD] +	FLOW_GROUP_ID		2	Flow group ID +	FLOW_GROUP_TYPE		1	Group type: +					  0: L2 interface +					  1: L2 rewrite +					  2: L3 unicast +					  3: L2 multicast +					  4: L2 flood +					  5: L3 interface +					  6: L3 multicast +					  7: L3 ECMP +					  8: L2 overlay +	FLOW_VLAN_ID		2	Vlan ID (types 0, 3, 4, 6) +	FLOW_L2_PORT		2	Port (types 0) +	FLOW_INDEX		4	Index (all types but 0) +	FLOW_OVERLAY_TYPE	1	Overlay sub-type (type 8): +					  0: Flood unicast tunnel +					  1: Flood multicast tunnel +					  2: Multicast unicast tunnel +					  3: Multicast multicast tunnel +	FLOW_GROUP_ACTION		nest +	  FLOW_GROUP_ID		2	next group ID in chain (all +					types except 0) +	  FLOW_OUT_PORT		4	egress port (types 0, 8) +	  FLOW_POP_VLAN_TAG	1	strip outer VLAN tag (type 1 +					only) +	  FLOW_VLAN_ID		2	(types 1, 5) +	  FLOW_SRC_MAC		6	(types 1, 2, 5) +	  FLOW_DST_MAC		6	(types 1, 2) + +TLVs for flow delete and get stats command are: + +	field			width	description +	----------------------------------------------------------- +	FLOW_GROUP_CMD		2	CMD_[DEL|GET_STATS] +	FLOW_GROUP_ID		2	Flow group ID + +On completion of get stats command, the descriptor buffer is written back with +the following TLVs: + +	field			width	description +	--------------------------------------------------- +	FLOW_GROUP_ID		2	Flow group ID +	FLOW_STAT_DURATION	4	Flow duration +	FLOW_STAT_REF_COUNT	4	Flow reference count +	FLOW_STAT_BUCKET_COUNT	4	Flow bucket count + +Possible status return codes in descriptor on completion are: + +	DESC_COMP_ERR	command			reason +	-------------------------------------------------------------------- +	0		all			OK +	-ROCKER_EFAULT	all			head or tail index outside +						of ring +	-ROCKER_ENXIO	all			address or data read err on +						desc buf +	-ROCKER_ENOSPC	GET_STATS		cmd descriptor buffer wasn't +						big enough to contain write-back +						TLVs +	-ROCKER_EINVAL	ADD|MOD			invalid parameters passed in +	-ROCKER_EEXIST	ADD			entry already exists +	-ROCKER_ENOSPC	ADD			no space left in flow table +	-ROCKER_ENOENT	MOD|DEL|GET_STATS	group ID invalid +	-ROCKER_EBUSY	DEL			group reference count non-zero +	-ROCKER_ENODEV	ADD			next group ID doesn't exist + + + +References +========== + +[1] OpenFlow Data Plane Abstraction (OF-DPA) Abstract Switch Specification, +Version 1.0, from Broadcom Corporation, February 21, 2014. diff --git a/docs/specs/standard-vga.txt b/docs/specs/standard-vga.txt new file mode 100644 index 00000000..19d2a745 --- /dev/null +++ b/docs/specs/standard-vga.txt @@ -0,0 +1,81 @@ + +QEMU Standard VGA +================= + +Exists in two variants, for isa and pci. + +command line switches: +    -vga std               [ picks isa for -M isapc, otherwise pci ] +    -device VGA            [ pci variant ] +    -device isa-vga        [ isa variant ] +    -device secondary-vga  [ legacy-free pci variant ] + + +PCI spec +-------- + +Applies to the pci variant only for obvious reasons. + +PCI ID: 1234:1111 + +PCI Region 0: +   Framebuffer memory, 16 MB in size (by default). +   Size is tunable via vga_mem_mb property. + +PCI Region 1: +   Reserved (so we have the option to make the framebuffer bar 64bit). + +PCI Region 2: +   MMIO bar, 4096 bytes in size (qemu 1.3+) + +PCI ROM Region: +   Holds the vgabios (qemu 0.14+). + + +The legacy-free variant has no ROM and has PCI_CLASS_DISPLAY_OTHER +instead of PCI_CLASS_DISPLAY_VGA. + + +IO ports used +------------- + +Doesn't apply to the legacy-free pci variant, use the MMIO bar instead. + +03c0 - 03df : standard vga ports +01ce        : bochs vbe interface index port +01cf        : bochs vbe interface data port (x86 only) +01d0        : bochs vbe interface data port + + +Memory regions used +------------------- + +0xe0000000 : Framebuffer memory, isa variant only. + +The pci variant used to mirror the framebuffer bar here, qemu 0.14+ +stops doing that (except when in -M pc-$old compat mode). + + +MMIO area spec +-------------- + +Likewise applies to the pci variant only for obvious reasons. + +0000 - 03ff : reserved, for possible virtio extension. +0400 - 041f : vga ioports (0x3c0 -> 0x3df), remapped 1:1. +              word access is supported, bytes are written +              in little endia order (aka index port first), +              so indexed registers can be updated with a +              single mmio write (and thus only one vmexit). +0500 - 0515 : bochs dispi interface registers, mapped flat +              without index/data ports.  Use (index << 1) +              as offset for (16bit) register access. + +0600 - 0607 : qemu extended registers.  qemu 2.2+ only. +              The pci revision is 2 (or greater) when +              these registers are present.  The registers +              are 32bit. +  0600      : qemu extended register region size, in bytes. +  0604      : framebuffer endianness register. +              - 0xbebebebe indicates big endian. +              - 0x1e1e1e1e indicates little endian. diff --git a/docs/specs/vhost-user.txt b/docs/specs/vhost-user.txt new file mode 100644 index 00000000..650bb181 --- /dev/null +++ b/docs/specs/vhost-user.txt @@ -0,0 +1,266 @@ +Vhost-user Protocol +=================== + +Copyright (c) 2014 Virtual Open Systems Sarl. + +This work is licensed under the terms of the GNU GPL, version 2 or later. +See the COPYING file in the top-level directory. +=================== + +This protocol is aiming to complement the ioctl interface used to control the +vhost implementation in the Linux kernel. It implements the control plane needed +to establish virtqueue sharing with a user space process on the same host. It +uses communication over a Unix domain socket to share file descriptors in the +ancillary data of the message. + +The protocol defines 2 sides of the communication, master and slave. Master is +the application that shares its virtqueues, in our case QEMU. Slave is the +consumer of the virtqueues. + +In the current implementation QEMU is the Master, and the Slave is intended to +be a software Ethernet switch running in user space, such as Snabbswitch. + +Master and slave can be either a client (i.e. connecting) or server (listening) +in the socket communication. + +Message Specification +--------------------- + +Note that all numbers are in the machine native byte order. A vhost-user message +consists of 3 header fields and a payload: + +------------------------------------ +| request | flags | size | payload | +------------------------------------ + + * Request: 32-bit type of the request + * Flags: 32-bit bit field: +   - Lower 2 bits are the version (currently 0x01) +   - Bit 2 is the reply flag - needs to be sent on each reply from the slave + * Size - 32-bit size of the payload + + +Depending on the request type, payload can be: + + * A single 64-bit integer +   ------- +   | u64 | +   ------- + +   u64: a 64-bit unsigned integer + + * A vring state description +   --------------- +  | index | num | +  --------------- + +   Index: a 32-bit index +   Num: a 32-bit number + + * A vring address description +   -------------------------------------------------------------- +   | index | flags | size | descriptor | used | available | log | +   -------------------------------------------------------------- + +   Index: a 32-bit vring index +   Flags: a 32-bit vring flags +   Descriptor: a 64-bit user address of the vring descriptor table +   Used: a 64-bit user address of the vring used ring +   Available: a 64-bit user address of the vring available ring +   Log: a 64-bit guest address for logging + + * Memory regions description +   --------------------------------------------------- +   | num regions | padding | region0 | ... | region7 | +   --------------------------------------------------- + +   Num regions: a 32-bit number of regions +   Padding: 32-bit + +   A region is: +   ----------------------------------------------------- +   | guest address | size | user address | mmap offset | +   ----------------------------------------------------- + +   Guest address: a 64-bit guest address of the region +   Size: a 64-bit size +   User address: a 64-bit user address +   mmap offset: 64-bit offset where region starts in the mapped memory + +In QEMU the vhost-user message is implemented with the following struct: + +typedef struct VhostUserMsg { +    VhostUserRequest request; +    uint32_t flags; +    uint32_t size; +    union { +        uint64_t u64; +        struct vhost_vring_state state; +        struct vhost_vring_addr addr; +        VhostUserMemory memory; +    }; +} QEMU_PACKED VhostUserMsg; + +Communication +------------- + +The protocol for vhost-user is based on the existing implementation of vhost +for the Linux Kernel. Most messages that can be sent via the Unix domain socket +implementing vhost-user have an equivalent ioctl to the kernel implementation. + +The communication consists of master sending message requests and slave sending +message replies. Most of the requests don't require replies. Here is a list of +the ones that do: + + * VHOST_GET_FEATURES + * VHOST_GET_VRING_BASE + +There are several messages that the master sends with file descriptors passed +in the ancillary data: + + * VHOST_SET_MEM_TABLE + * VHOST_SET_LOG_FD + * VHOST_SET_VRING_KICK + * VHOST_SET_VRING_CALL + * VHOST_SET_VRING_ERR + +If Master is unable to send the full message or receives a wrong reply it will +close the connection. An optional reconnection mechanism can be implemented. + +Message types +------------- + + * VHOST_USER_GET_FEATURES + +      Id: 1 +      Equivalent ioctl: VHOST_GET_FEATURES +      Master payload: N/A +      Slave payload: u64 + +      Get from the underlying vhost implementation the features bitmask. + + * VHOST_USER_SET_FEATURES + +      Id: 2 +      Ioctl: VHOST_SET_FEATURES +      Master payload: u64 + +      Enable features in the underlying vhost implementation using a bitmask. + + * VHOST_USER_SET_OWNER + +      Id: 3 +      Equivalent ioctl: VHOST_SET_OWNER +      Master payload: N/A + +      Issued when a new connection is established. It sets the current Master +      as an owner of the session. This can be used on the Slave as a +      "session start" flag. + + * VHOST_USER_RESET_OWNER + +      Id: 4 +      Equivalent ioctl: VHOST_RESET_OWNER +      Master payload: N/A + +      Issued when a new connection is about to be closed. The Master will no +      longer own this connection (and will usually close it). + + * VHOST_USER_SET_MEM_TABLE + +      Id: 5 +      Equivalent ioctl: VHOST_SET_MEM_TABLE +      Master payload: memory regions description + +      Sets the memory map regions on the slave so it can translate the vring +      addresses. In the ancillary data there is an array of file descriptors +      for each memory mapped region. The size and ordering of the fds matches +      the number and ordering of memory regions. + + * VHOST_USER_SET_LOG_BASE + +      Id: 6 +      Equivalent ioctl: VHOST_SET_LOG_BASE +      Master payload: u64 + +      Sets the logging base address. + + * VHOST_USER_SET_LOG_FD + +      Id: 7 +      Equivalent ioctl: VHOST_SET_LOG_FD +      Master payload: N/A + +      Sets the logging file descriptor, which is passed as ancillary data. + + * VHOST_USER_SET_VRING_NUM + +      Id: 8 +      Equivalent ioctl: VHOST_SET_VRING_NUM +      Master payload: vring state description + +      Sets the number of vrings for this owner. + + * VHOST_USER_SET_VRING_ADDR + +      Id: 9 +      Equivalent ioctl: VHOST_SET_VRING_ADDR +      Master payload: vring address description +      Slave payload: N/A + +      Sets the addresses of the different aspects of the vring. + + * VHOST_USER_SET_VRING_BASE + +      Id: 10 +      Equivalent ioctl: VHOST_SET_VRING_BASE +      Master payload: vring state description + +      Sets the base offset in the available vring. + + * VHOST_USER_GET_VRING_BASE + +      Id: 11 +      Equivalent ioctl: VHOST_USER_GET_VRING_BASE +      Master payload: vring state description +      Slave payload: vring state description + +      Get the available vring base offset. + + * VHOST_USER_SET_VRING_KICK + +      Id: 12 +      Equivalent ioctl: VHOST_SET_VRING_KICK +      Master payload: u64 + +      Set the event file descriptor for adding buffers to the vring. It +      is passed in the ancillary data. +      Bits (0-7) of the payload contain the vring index. Bit 8 is the +      invalid FD flag. This flag is set when there is no file descriptor +      in the ancillary data. This signals that polling should be used +      instead of waiting for a kick. + + * VHOST_USER_SET_VRING_CALL + +      Id: 13 +      Equivalent ioctl: VHOST_SET_VRING_CALL +      Master payload: u64 + +      Set the event file descriptor to signal when buffers are used. It +      is passed in the ancillary data. +      Bits (0-7) of the payload contain the vring index. Bit 8 is the +      invalid FD flag. This flag is set when there is no file descriptor +      in the ancillary data. This signals that polling will be used +      instead of waiting for the call. + + * VHOST_USER_SET_VRING_ERR + +      Id: 14 +      Equivalent ioctl: VHOST_SET_VRING_ERR +      Master payload: u64 + +      Set the event file descriptor to signal when error occurs. It +      is passed in the ancillary data. +      Bits (0-7) of the payload contain the vring index. Bit 8 is the +      invalid FD flag. This flag is set when there is no file descriptor +      in the ancillary data. diff --git a/docs/specs/vmw_pvscsi-spec.txt b/docs/specs/vmw_pvscsi-spec.txt new file mode 100644 index 00000000..49affb2a --- /dev/null +++ b/docs/specs/vmw_pvscsi-spec.txt @@ -0,0 +1,92 @@ +General Description +=================== + +This document describes VMWare PVSCSI device interface specification. +Created by Dmitry Fleytman (dmitry@daynix.com), Daynix Computing LTD. +Based on source code of PVSCSI Linux driver from kernel 3.0.4 + +PVSCSI Device Interface Overview +================================ + +The interface is based on memory area shared between hypervisor and VM. +Memory area is obtained by driver as device IO memory resource of +PVSCSI_MEM_SPACE_SIZE length. +The shared memory consists of registers area and rings area. +The registers area is used to raise hypervisor interrupts and issue device +commands. The rings area is used to transfer data descriptors and SCSI +commands from VM to hypervisor and to transfer messages produced by +hypervisor to VM. Data itself is transferred via virtual scatter-gather DMA. + +PVSCSI Device Registers +======================= + +The length of the registers area is 1 page (PVSCSI_MEM_SPACE_COMMAND_NUM_PAGES). +The structure of the registers area is described by the PVSCSIRegOffset enum. +There are registers to issue device command (with optional short data), +issue device interrupt, control interrupts masking. + +PVSCSI Device Rings +=================== + +There are three rings in shared memory: + +    1. Request ring (struct PVSCSIRingReqDesc *req_ring) +        - ring for OS to device requests +    2. Completion ring (struct PVSCSIRingCmpDesc *cmp_ring) +        - ring for device request completions +    3. Message ring (struct PVSCSIRingMsgDesc *msg_ring) +        - ring for messages from device. +       This ring is optional and the guest might not configure it. +There is a control area (struct PVSCSIRingsState *rings_state) used to control +rings operation. + +PVSCSI Device to Host Interrupts +================================ +There are following interrupt types supported by PVSCSI device: +    1. Completion interrupts (completion ring notifications): +        PVSCSI_INTR_CMPL_0 +        PVSCSI_INTR_CMPL_1 +    2. Message interrupts (message ring notifications): +        PVSCSI_INTR_MSG_0 +        PVSCSI_INTR_MSG_1 + +Interrupts are controlled via PVSCSI_REG_OFFSET_INTR_MASK register +Bit set means interrupt enabled, bit cleared - disabled + +Interrupt modes supported are legacy, MSI and MSI-X +In case of legacy interrupts, register PVSCSI_REG_OFFSET_INTR_STATUS +is used to check which interrupt has arrived.  Interrupts are +acknowledged when the corresponding bit is written to the interrupt +status register. + +PVSCSI Device Operation Sequences +================================= + +1. Startup sequence: +    a. Issue PVSCSI_CMD_ADAPTER_RESET command; +    aa. Windows driver reads interrupt status register here; +    b. Issue PVSCSI_CMD_SETUP_MSG_RING command with no additional data, +       check status and disable device messages if error returned; +       (Omitted if device messages disabled by driver configuration) +    c. Issue PVSCSI_CMD_SETUP_RINGS command, provide rings configuration +       as struct PVSCSICmdDescSetupRings; +    d. Issue PVSCSI_CMD_SETUP_MSG_RING command again, provide +       rings configuration as struct PVSCSICmdDescSetupMsgRing; +    e. Unmask completion and message (if device messages enabled) interrupts. + +2. Shutdown sequences +    a. Mask interrupts; +    b. Flush request ring using PVSCSI_REG_OFFSET_KICK_NON_RW_IO; +    c. Issue PVSCSI_CMD_ADAPTER_RESET command. + +3. Send request +    a. Fill next free request ring descriptor; +    b. Issue PVSCSI_REG_OFFSET_KICK_RW_IO for R/W operations; +       or PVSCSI_REG_OFFSET_KICK_NON_RW_IO for other operations. + +4. Abort command +    a. Issue PVSCSI_CMD_ABORT_CMD command; + +5. Request completion processing +    a. Upon completion interrupt arrival process completion +       and message (if enabled) rings. | 
