=== How the Blkif Drivers Work ===
Andrew Warfield
andrew.warfield@cl.cam.ac.uk

The intent of this is to explain at a fairly detailed level how the
split device drivers work in Xen 1.3 (aka 2.0beta).  The intended
audience for this, I suppose, is anyone who intends to work with the
existing blkif interfaces and wants something to help them get up to
speed with the code in a hurry.  Secondly though, I hope to break out
the general mechanisms that are used in the drivers that are likely to
be necessary to implement other drivers interfaces.

As a point of warning before starting, it is worth mentioning that I
anticipate much of the specifics described here changing in the near
future.  There has been talk about making the blkif protocol
a bit more efficient than it currently is.  Keir's addition of grant
tables will change the current remapping code that is used when shared
pages are initially set up.

Also, writing other control interface types will likely need support
from Xend, which at the moment has a steep learning curve... this
should be addressed in the future.

For more information on the driver model as a whole, read the
"Reconstructing I/O" technical report
(http://www.cl.cam.ac.uk/Research/SRG/netos/papers/2004-xenngio.pdf).

==== High-level structure of a split-driver interface ====

Why would you want to write a split driver in the first place?  As Xen
is a virtual machine manager and focuses on isolation as an initial
design principle, it is generally considered unwise to share physical
access to devices across domains.  The reasons for this are obvious:
when device resources are shared, misbehaving code or hardware can
result in the failure of all of the client applications.  Moreover, as
virtual machines in Xen are entire OSs, standard device drives that
they might use cannot have multiple instantiations for a single piece
of hardware.  In light of all this, the general approach in Xen is to
give a single virtual machine hardware access to a device, and where
other VMs want to share the device, export a higher-level interface to
facilitate that sharing.  If you don't want to share, that's fine.
There are currently Xen users actively exploring running two
completely isolated X-Servers on a Xen host, each with it's own video
card, keyboard, and mouse.  In these situations, the guests need only
be given physical access to the necessary devices and left to go on
their own.  However, for devices such as disks and network interfaces,
where sharing is required, the split driver approach is a good
solution.

The structure is like this:

   +--------------------------+  +--------------------------+
   | Domain 0 (privileged)    |  | Domain 1 (unprivileged)  |
   |                          |  |                          |
   | Xend ( Application )     |  |                          |
   | Blkif Backend Driver     |  | Blkif Frontend Driver    |
   | Physical Device Driver   |  |                          |
   +--------------------------+  +--------------------------+
   +--------------------------------------------------------+
   |                X       E       N                       |
   +--------------------------------------------------------+


The Blkif driver is in two parts, which we refer to as frontend (FE)
and a backend (BE).  Together, they serve to proxy device requests
between the guest operating system in an unprivileged domain, and the
physical device driver in the physical domain.  An additional benefit
to this approach is that the FE driver can provide a single interface
for a whole class of physical devices.  The blkif interface mounts
IDE, SCSI, and our own VBD-structured disks, independent of the
physical driver underneath.  Moreover, supporting additional OSs only
requires that a new FE driver be written to connect to the existing
backend.

==== Inter-Domain Communication Mechanisms ====

===== Event Channels =====

Before getting into the specifics of the block interface driver, it is
worth discussing the mechanisms that are used to communicate between
domains.  Two mechanisms are used to allow the construction of
high-performance drivers: event channels and shared-memory rings.

Event channels are an asynchronous interdomain notification
mechanism.  Xen allows channels to be instantiated between two
domains, and domains can request that a virtual irq be attached to
notifications on a given channel.  The result of this is that the
frontend domain can send a notification on an event channel, resulting
in an interrupt entry into the backend at a later time.

The event channel between two domains is instantiated in the Xend code
during driver startup (described later).  Xend's channel.py
(tools/python/xen/xend/server/channel.py) defines the function


def eventChannel(dom1, dom2):
    return xc.evtchn_bind_interdomain(dom1=dom1, dom2=dom2)


which maps to xc_evtchn_bind_interdomain() in tools/libxc/xc_evtchn.c,
which in turn generates a hypercall to Xen to patch the event channel
between the domains.  Only a privileged domain can request the
creation of an event channel.

Once the event channel is created in Xend, its ends are passed to both the
front and backend domains over the control channel.  The end that is
passed to a domain is just an integer "port" uniquely identifying the
event channel's local connection to that domain.  An example of this
setup code is in linux-2.6.x/drivers/xen/blkfront/blkfront.c in
blkif_connect(), which receives several status change events as
the driver starts up.  It is passed an event channel end in a
BLKIF_INTERFACE_STATUS_CONNECTED message, and patches it in like this:


   blkif_evtchn = status->evtchn;
   blkif_irq    = bind_evtchn_to_irq(blkif_evtchn);
   if ( (rc = request_irq(blkif_irq, blkif_int, 
                          SA_SAMPLE_RANDOM, "blkif", NULL)) )
       printk(KERN_ALERT"blkfront request_irq failed (%ld)\n",rc);


This code associates a virtual irq with the event channel, and
attaches the function blkif_int() as an interrupt handler for that
irq.  blkif_int() simply handles the notification and returns, it does
not need to interact with the channel at all.

An example of generating a notification can also be seen in blkfront.c:


static inline void flush_requests(void)
{
    DISABLE_SCATTERGATHER();
    wmb(); /* Ensure that the frontend can see the requests. */
    blk_ring->req_prod = req_prod;
    notify_via_evtchn(blkif_evtchn);
}
}}}

notify_via_evtchn() issues a hypercall to set the event waiting flag on
the other domain's end of the channel.

===== Communication Rings =====

Event channels are strictly a notification mechanism between domains.
To move large chunks of data back and forth, Xen allows domains to
share pages of memory.  We use communication rings as a means of
managing access to a shared memory page for message passing between
domains.  These rings are not explicitly a mechanism of Xen, which is
only concerned with the actual sharing of the page and not how it is
used, they are however worth discussing as they are used in many
places in the current code and are a useful model for communicating
across a shared page.

A shared page is set up by a front end guest first allocating and passing 
the address of a page in its own address space to the backend driver.  

Consider the following code, also from blkfront.c.  Note:  this code
is in blkif_disconnect().  The driver transitions from STATE_CLOSED
to STATE_DISCONNECTED before becoming CONNECTED.  The state automata
is in blkif_status().

   blk_ring = (blkif_ring_t *)__get_free_page(GFP_KERNEL);
   blk_ring->req_prod = blk_ring->resp_prod = resp_cons = req_prod = 0;
   ...
   /* Construct an interface-CONNECT message for the domain controller. */
   cmsg.type      = CMSG_BLKIF_FE;
   cmsg.subtype   = CMSG_BLKIF_FE_INTERFACE_CONNECT;
   cmsg.length    = sizeof(blkif_fe_interface_connect_t);
   up.handle      = 0;
   up.shmem_frame = virt_to_machine(blk_ring) >> PAGE_SHIFT;
   memcpy(cmsg.msg, &up, sizeof(up));  


blk_ring will be the shared page.  The producer and consumer pointers
are then initialised (these will be discussed soon), and then the
machine address of the page is send to the backend via a control
channel to Xend.  This control channel itself uses the notification
and shared memory mechanisms described here, but is set up for each
domain automatically at startup.

The backend, which is a privileged domain then takes the page address
and maps it into its own address space (in
linux26/drivers/xen/blkback/interface.c:blkif_connect()):


void blkif_connect(blkif_be_connect_t *connect)

   ...
   unsigned long shmem_frame = connect->shmem_frame;
   ...

   if ( (vma = get_vm_area(PAGE_SIZE, VM_IOREMAP)) == NULL )
   {
      connect->status = BLKIF_BE_STATUS_OUT_OF_MEMORY;
      return;
   }

   prot = __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED);
   error = direct_remap_area_pages(&init_mm, VMALLOC_VMADDR(vma->addr),
                                   shmem_frame<<PAGE_SHIFT, PAGE_SIZE,
                                   prot, domid);

   ...

   blkif->blk_ring_base = (blkif_ring_t *)vma->addr
}}}

The machine address of the page is passed in the shmem_frame field of
the connect message.  This is then mapped into the virtual address
space of the backend domain, and saved in the blkif structure
representing this particular backend connection.

NOTE:  New mechanisms will be added very shortly to allow domains to
explicitly grant access to their pages to other domains.  This "grant
table" support is in the process of being added to the tree, and will
change the way a shared page is set up.  In particular, it will remove
the need of the remapping domain to be privileged.

Sending data across shared rings:

Shared rings avoid the potential for write interference between
domains in a very cunning way.  A ring is partitioned into a request
and a response region, and domains only work within their own space.
This can be thought of as a double producer-consumer ring -- the ring
is described by four pointers into a circular buffer of fixed-size
records.  Pointers may only advance, and may not pass one another.


                         resp_cons----+
                                      V
           +----+----+----+----+----+----+----+
           |    |    |  free(A)     |RSP1|RSP2|
           +----+----+----+----+----+----+----+
 req_prod->|    |       -------->        |RSP3|
           +----+                        +----+
           |REQ8|                        |    |<-resp_prod
           +----+                        +----+
           |REQ7|                        |    |
           +----+                        +----+
           |REQ6|       <--------        |    |
           +----+----+----+----+----+----+----+
           |REQ5|REQ4|    free(B)   |    |    |
           +----+----+----+----+----+----+----+
  req_cons---------^


By adopting the convention that every request will receive a response,
not all four pointers need be shared and flow control on the ring
becomes very easy to manage.  Each domain manages its own
consumer pointer, and the two producer pointers are visible to both
(xen/include/public/io/blkif.h):


/* NB. Ring size must be small enough for sizeof(blkif_ring_t) <=PAGE_SIZE.*/
  #define BLKIF_RING_SIZE        64

  ...

/*
 * We use a special capitalised type name because it is _essential_ that all
 * arithmetic on indexes is done on an integer type of the correct size.
 */
typedef u32 BLKIF_RING_IDX;

/*
 * Ring indexes are 'free running'. That is, they are not stored modulo the
 * size of the ring buffer. The following macro converts a free-running counter
 * into a value that can directly index a ring-buffer array.
 */
#define MASK_BLKIF_IDX(_i) ((_i)&(BLKIF_RING_SIZE-1))

typedef struct {
    BLKIF_RING_IDX req_prod;  /*  0: Request producer. Updated by front-end. */
    BLKIF_RING_IDX resp_prod; /*  4: Response producer. Updated by back-end. */
    union {                   /*  8 */
        blkif_request_t  req;
        blkif_response_t resp;
    } PACKED ring[BLKIF_RING_SIZE];
} PACKED blkif_ring_t;


As shown in the diagram above, the rules for using a shared memory
ring are simple.  

 1. A ring is full when a domain's producer and consumer pointers are
    equal (e.g. req_prod == resp_cons).  In this situation, the
    consumer pointer must be advanced.  Furthermore, if the consumer
    pointer is equal to the other domain's producer pointer,
    (e.g. resp_cons = resp_prod), then the other domain has all the
    buffers.

2. Producer pointers point to the next buffer that will be written to.
   (So blk_ring[MASK_BLKIF_IDX(req_prod)] should not be consumed.)

3. Consumer pointers point to a valid message, so long as they are not
   equal to the associated producer pointer.

4. A domain should only ever write to the message pointed
   to by its producer index, and read from the message at it's
   consumer.  More generally, the domain may be thought of to have
   exclusive access to the messages between its consumer and producer,
   and should absolutely not read or write outside this region.

   Thus the front end has exclusive access to the free(A) region 
   in the figure above, and the back end driver has exclusive
   access to the free(B) region.

In general, drivers keep a private copy of their producer pointer and
then set the shared version when they are ready for the other end to
process a set of messages.  Additionally, it is worth paying attention
to the use of memory barriers (rmb/wmb) in the code, to ensure that
rings that are shared across processors behave as expected.

==== Structure of the Blkif Drivers ====

Now that the communications primitives have been discussed, I'll
quickly cover the general structure of the blkif driver.  This is
intended to give a high-level idea of what is going on, in an effort
to make reading the code a more approachable task.

There are three key software components that are involved in the blkif
drivers (not counting Xen itself).  The frontend and backend driver,
and Xend, which coordinates their initial connection.  Xend may also
be involved in control-channel signalling in some cases after startup,
for instance to manage reconnection if the backend is restarted.

===== Frontend Driver Structure =====

The frontend domain uses a single event channel and a shared memory
ring to trade control messages with the backend.  These are both setup
during domain startup, which will be discussed shortly.  The shared
memory ring is called blkif_ring, and the private ring indexes are
resp_cons, and req_prod.  The ring is protected by blkif_io_lock.
Additionally, the frontend keeps a list of outstanding requests in
rec_ring[].  These are uniquely identified by a guest-local id number,
which is associated with each request sent to the backend, and
returned with the matching responses.  Information about the actual
disks are stored in major_info[], of which only the first nr_vbds
entries are valid.  Finally, the global 'recovery' indicates that the
connection between the backend and frontend drivers has been broken
(possibly due to a backend driver crash) and that the frontend is in
recovery mode, in which case it will attempt to reconnect and reissue
outstanding requests.

The frontend driver is single-threaded and after setup is entered only
through three points:  (1) read/write requests from the XenLinux guest
that it is a part of, (2) interrupts from the backend driver on its
event channel (blkif_int()), and (3) control messages from Xend
(blkif_ctrlif_rx).

===== Backend Driver Structure =====

The backend driver is slightly more complex as it must manage any
number of concurrent frontend connections.  For each domain it
manages, the backend driver maintains a blkif structure, which
describes all the connection and disk information associated with that
particular domain.  This structure is associated with the interrupt
registration, and allows the backend driver to have immediate context
when it takes a notification from some domain.

All of the blkif structures are stored in a hash table (blkif_hash),
which is indexed by a hash of the domain id, and a "handle", really a
per-domain blkif identifier, in case it wants to have multiple connections.

The per-connection blkif structure is of type blkif_t.  It contains
all of the communication details (event channel, irq, shared memory
ring and indexes), and blk_ring_lock, which is the backend mutex on
the shared ring.  The structure also contains vbd_rb, which is a
red-black tree, containing an entry for each device/partition that is
assigned to that domain.  This structure is filled by xend passing
disk information to the backend at startup, and is protected by
vbd_lock.  Finally, the blkif struct contains a status field, which
describes the state of the connection.

The backend driver spawns a kernel thread at startup
(blkio_schedule()), which handles requests to and from the actual disk
device drivers.  This scheduler thread maintains a list of blkif
structures that have pending requests, and services them round-robin
with a maximum per-round request limit.  blkifs are added to the list
in the interrupt handler (blkif_be_int()) using
add_to_blkdev_list_tail(), and removed in the scheduler loop after
calling do_block_io_op(), which processes a batch of requests.  The
scheduler thread is explicitly activated at several points in the code
using maybe_trigger_blkio_schedule().

Pending requests between the backend driver and the physical device
drivers use another ring, pending_ring.  Requests are placed in this
ring in the scheduler thread and issued to the device.  A completion
callback, end_block_io_op, indicates that requests have been serviced
and generates a response on the appropriate blkif ring.  pending
reqs[] stores a list of outstanding requests with the physical drivers.

So, control entries to the backend are (1) the blkio scheduler thread,
which sends requests to the real device drivers, (2) end_block_io_op,
which is called as serviced requests complete, (3) blkif_be_int()
handles notifications from the frontend drivers in other domains, and
(4) blkif_ctrlif_rx() handles control messages from xend.

==== Driver Startup ====

Prior to starting a new guest using the frontend driver, the backend
will have been started in a privileged domain.  The backend
initialisation code initialises all of its data structures, such as
the blkif hash table, and starts the scheduler thread as a kernel
thread. It then sends a driver status up message to let xend know it
is ready to take frontend connections.

When a new domain that uses the blkif frontend driver is started,
there are a series of interactions between it, xend, and the specified
backend driver.  These interactions are as follows:

The domain configuration given to xend will specify the backend domain
and disks that the new guest is to use.  Prior to actually running the
domain, xend and the backend driver interact to setup the initial
blkif record in the backend.

(1) Xend sends a BLKIF_BE_CREATE message to backend.

  Backend does blkif_create(), having been passed FE domid and handle.
  It creates and initialises a new blkif struct, and puts it in the
  hash table.
  It then returns a STATUS_OK response to xend.

(2) Xend sends a BLKIF_BE_VBD_CREATE message to the backend.
 
  Backend adds a vbd entry in the red-black tree for the
  specified (dom, handle) blkif entry.
  Sends a STATUS_OK response.

(3) Xend sends a BLKIF_BE_VBD_GROW message to the backend.

  Backend takes the physical device information passed in the 
  message and assigns them to the newly created vbd struct.

(2) and (3) repeat as any additional devices are added to the domain.

At this point, the backend has enough state to allow the frontend
domain to start.  The domain is run, and eventually gets to the
frontend driver initialisation code.  After setting up the frontend
data structures, this code continues the communications with xend and
the backend to negotiate a connection:

(4) Frontend sends Xend a BLKIF_FE_DRIVER_STATUS_CHANGED message.

  This message tells xend that the driver is up.  The init function
  now spin-waits until driver setup is complete in order to prevent
  Linux from attempting to boot before the disks are connected.

(5) Xend sends the frontend an INTERFACE_STATUS_CHANGED message

  This message specifies that the interface is now disconnected
  (instead of closed).
  The domain updates it's state, and allocates the shared blk_ring
  page.  Next, 

(6) Frontend sends Xend a BLKIF_INTERFACE_CONNECT message

  This message specifies the domain and handle, and includes the
  address of the newly created page.

(7) Xend sends the backend a BLKIF_BE_CONNECT message

  The backend fills in the blkif connection information, maps the
  shared page, and binds an irq to the event channel.
  
(8) Xend sends the frontend an INTERFACE_STATUS_CHANGED message

  This message takes the frontend driver to a CONNECTED state, at
  which point it binds an irq to the event channel and calls
  xlvbd_init to initialise the individual block devices.

The frontend Linux is stall spin waiting at this point, until all of
the disks have been probed.  Messaging now is directly between the
front and backend domain using the new shared ring and event channel.

(9) The frontend sends a BLKIF_OP_PROBE directly to the backend.

  This message includes a reference to an additional page, that the
  backend can use for it's reply.  The backend responds with an array
  of the domains disks (as vdisk_t structs) on the provided page.

The frontend now initialises each disk, calling xlvbd_init_device()
for each one.