=== How the Blkif Drivers Work === Andrew Warfield andrew.warfield@cl.cam.ac.uk The intent of this is to explain at a fairly detailed level how the split device drivers work in Xen 1.3 (aka 2.0beta). The intended audience for this, I suppose, is anyone who intends to work with the existing blkif interfaces and wants something to help them get up to speed with the code in a hurry. Secondly though, I hope to break out the general mechanisms that are used in the drivers that are likely to be necessary to implement other drivers interfaces. As a point of warning before starting, it is worth mentioning that I anticipate much of the specifics described here changing in the near future. There has been talk about making the blkif protocol a bit more efficient than it currently is. Keir's addition of grant tables will change the current remapping code that is used when shared pages are initially set up. Also, writing other control interface types will likely need support from Xend, which at the moment has a steep learning curve... this should be addressed in the future. For more information on the driver model as a whole, read the "Reconstructing I/O" technical report (http://www.cl.cam.ac.uk/Research/SRG/netos/papers/2004-xenngio.pdf). ==== High-level structure of a split-driver interface ==== Why would you want to write a split driver in the first place? As Xen is a virtual machine manager and focuses on isolation as an initial design principle, it is generally considered unwise to share physical access to devices across domains. The reasons for this are obvious: when device resources are shared, misbehaving code or hardware can result in the failure of all of the client applications. Moreover, as virtual machines in Xen are entire OSs, standard device drives that they might use cannot have multiple instantiations for a single piece of hardware. In light of all this, the general approach in Xen is to give a single virtual machine hardware access to a device, and where other VMs want to share the device, export a higher-level interface to facilitate that sharing. If you don't want to share, that's fine. There are currently Xen users actively exploring running two completely isolated X-Servers on a Xen host, each with it's own video card, keyboard, and mouse. In these situations, the guests need only be given physical access to the necessary devices and left to go on their own. However, for devices such as disks and network interfaces, where sharing is required, the split driver approach is a good solution. The structure is like this: +--------------------------+ +--------------------------+ | Domain 0 (privileged) | | Domain 1 (unprivileged) | | | | | | Xend ( Application ) | | | | Blkif Backend Driver | | Blkif Frontend Driver | | Physical Device Driver | | | +--------------------------+ +--------------------------+ +--------------------------------------------------------+ | X E N | +--------------------------------------------------------+ The Blkif driver is in two parts, which we refer to as frontend (FE) and a backend (BE). Together, they serve to proxy device requests between the guest operating system in an unprivileged domain, and the physical device driver in the physical domain. An additional benefit to this approach is that the FE driver can provide a single interface for a whole class of physical devices. The blkif interface mounts IDE, SCSI, and our own VBD-structured disks, independent of the physical driver underneath. Moreover, supporting additional OSs only requires that a new FE driver be written to connect to the existing backend. ==== Inter-Domain Communication Mechanisms ==== ===== Event Channels ===== Before getting into the specifics of the block interface driver, it is worth discussing the mechanisms that are used to communicate between domains. Two mechanisms are used to allow the construction of high-performance drivers: event channels and shared-memory rings. Event channels are an asynchronous interdomain notification mechanism. Xen allows channels to be instantiated between two domains, and domains can request that a virtual irq be attached to notifications on a given channel. The result of this is that the frontend domain can send a notification on an event channel, resulting in an interrupt entry into the backend at a later time. The event channel between two domains is instantiated in the Xend code during driver startup (described later). Xend's channel.py (tools/python/xen/xend/server/channel.py) defines the function def eventChannel(dom1, dom2): return xc.evtchn_bind_interdomain(dom1=dom1, dom2=dom2) which maps to xc_evtchn_bind_interdomain() in tools/libxc/xc_evtchn.c, which in turn generates a hypercall to Xen to patch the event channel between the domains. Only a privileged domain can request the creation of an event channel. Once the event channel is created in Xend, its ends are passed to both the front and backend domains over the control channel. The end that is passed to a domain is just an integer "port" uniquely identifying the event channel's local connection to that domain. An example of this setup code is in linux-2.6.x/drivers/xen/blkfront/blkfront.c in blkif_connect(), which receives several status change events as the driver starts up. It is passed an event channel end in a BLKIF_INTERFACE_STATUS_CONNECTED message, and patches it in like this: blkif_evtchn = status->evtchn; blkif_irq = bind_evtchn_to_irq(blkif_evtchn); if ( (rc = request_irq(blkif_irq, blkif_int, SA_SAMPLE_RANDOM, "blkif", NULL)) ) printk(KERN_ALERT"blkfront request_irq failed (%ld)\n",rc); This code associates a virtual irq with the event channel, and attaches the function blkif_int() as an interrupt handler for that irq. blkif_int() simply handles the notification and returns, it does not need to interact with the channel at all. An example of generating a notification can also be seen in blkfront.c: static inline void flush_requests(void) { DISABLE_SCATTERGATHER(); wmb(); /* Ensure that the frontend can see the requests. */ blk_ring->req_prod = req_prod; notify_via_evtchn(blkif_evtchn); } }}} notify_via_evtchn() issues a hypercall to set the event waiting flag on the other domain's end of the channel. ===== Communication Rings ===== Event channels are strictly a notification mechanism between domains. To move large chunks of data back and forth, Xen allows domains to share pages of memory. We use communication rings as a means of managing access to a shared memory page for message passing between domains. These rings are not explicitly a mechanism of Xen, which is only concerned with the actual sharing of the page and not how it is used, they are however worth discussing as they are used in many places in the current code and are a useful model for communicating across a shared page. A shared page is set up by a front end guest first allocating and passing the address of a page in its own address space to the backend driver. Consider the following code, also from blkfront.c. Note: this code is in blkif_disconnect(). The driver transitions from STATE_CLOSED to STATE_DISCONNECTED before becoming CONNECTED. The state automata is in blkif_status(). blk_ring = (blkif_ring_t *)__get_free_page(GFP_KERNEL); blk_ring->req_prod = blk_ring->resp_prod = resp_cons = req_prod = 0; ... /* Construct an interface-CONNECT message for the domain controller. */ cmsg.type = CMSG_BLKIF_FE; cmsg.subtype = CMSG_BLKIF_FE_INTERFACE_CONNECT; cmsg.length = sizeof(blkif_fe_interface_connect_t); up.handle = 0; up.shmem_frame = virt_to_machine(blk_ring) >> PAGE_SHIFT; memcpy(cmsg.msg, &up, sizeof(up)); blk_ring will be the shared page. The producer and consumer pointers are then initialised (these will be discussed soon), and then the machine address of the page is send to the backend via a control channel to Xend. This control channel itself uses the notification and shared memory mechanisms described here, but is set up for each domain automatically at startup. The backend, which is a privileged domain then takes the page address and maps it into its own address space (in linux26/drivers/xen/blkback/interface.c:blkif_connect()): void blkif_connect(blkif_be_connect_t *connect) ... unsigned long shmem_frame = connect->shmem_frame; ... if ( (vma = get_vm_area(PAGE_SIZE, VM_IOREMAP)) == NULL ) { connect->status = BLKIF_BE_STATUS_OUT_OF_MEMORY; return; } prot = __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED); error = direct_remap_area_pages(&init_mm, VMALLOC_VMADDR(vma->addr), shmem_frame<blk_ring_base = (blkif_ring_t *)vma->addr }}} The machine address of the page is passed in the shmem_frame field of the connect message. This is then mapped into the virtual address space of the backend domain, and saved in the blkif structure representing this particular backend connection. NOTE: New mechanisms will be added very shortly to allow domains to explicitly grant access to their pages to other domains. This "grant table" support is in the process of being added to the tree, and will change the way a shared page is set up. In particular, it will remove the need of the remapping domain to be privileged. Sending data across shared rings: Shared rings avoid the potential for write interference between domains in a very cunning way. A ring is partitioned into a request and a response region, and domains only work within their own space. This can be thought of as a double producer-consumer ring -- the ring is described by four pointers into a circular buffer of fixed-size records. Pointers may only advance, and may not pass one another. resp_cons----+ V +----+----+----+----+----+----+----+ | | | free(A) |RSP1|RSP2| +----+----+----+----+----+----+----+ req_prod->| | --------> |RSP3| +----+ +----+ |REQ8| | |<-resp_prod +----+ +----+ |REQ7| | | +----+ +----+ |REQ6| <-------- | | +----+----+----+----+----+----+----+ |REQ5|REQ4| free(B) | | | +----+----+----+----+----+----+----+ req_cons---------^ By adopting the convention that every request will receive a response, not all four pointers need be shared and flow control on the ring becomes very easy to manage. Each domain manages its own consumer pointer, and the two producer pointers are visible to both (xen/include/public/io/blkif.h): /* NB. Ring size must be small enough for sizeof(blkif_ring_t) <=PAGE_SIZE.*/ #define BLKIF_RING_SIZE 64 ... /* * We use a special capitalised type name because it is _essential_ that all * arithmetic on indexes is done on an integer type of the correct size. */ typedef u32 BLKIF_RING_IDX; /* * Ring indexes are 'free running'. That is, they are not stored modulo the * size of the ring buffer. The following macro converts a free-running counter * into a value that can directly index a ring-buffer array. */ #define MASK_BLKIF_IDX(_i) ((_i)&(BLKIF_RING_SIZE-1)) typedef struct { BLKIF_RING_IDX req_prod; /* 0: Request producer. Updated by front-end. */ BLKIF_RING_IDX resp_prod; /* 4: Response producer. Updated by back-end. */ union { /* 8 */ blkif_request_t req; blkif_response_t resp; } PACKED ring[BLKIF_RING_SIZE]; } PACKED blkif_ring_t; As shown in the diagram above, the rules for using a shared memory ring are simple. 1. A ring is full when a domain's producer and consumer pointers are equal (e.g. req_prod == resp_cons). In this situation, the consumer pointer must be advanced. Furthermore, if the consumer pointer is equal to the other domain's producer pointer, (e.g. resp_cons = resp_prod), then the other domain has all the buffers. 2. Producer pointers point to the next buffer that will be written to. (So blk_ring[MASK_BLKIF_IDX(req_prod)] should not be consumed.) 3. Consumer pointers point to a valid message, so long as they are not equal to the associated producer pointer. 4. A domain should only ever write to the message pointed to by its producer index, and read from the message at it's consumer. More generally, the domain may be thought of to have exclusive access to the messages between its consumer and producer, and should absolutely not read or write outside this region. Thus the front end has exclusive access to the free(A) region in the figure above, and the back end driver has exclusive access to the free(B) region. In general, drivers keep a private copy of their producer pointer and then set the shared version when they are ready for the other end to process a set of messages. Additionally, it is worth paying attention to the use of memory barriers (rmb/wmb) in the code, to ensure that rings that are shared across processors behave as expected. ==== Structure of the Blkif Drivers ==== Now that the communications primitives have been discussed, I'll quickly cover the general structure of the blkif driver. This is intended to give a high-level idea of what is going on, in an effort to make reading the code a more approachable task. There are three key software components that are involved in the blkif drivers (not counting Xen itself). The frontend and backend driver, and Xend, which coordinates their initial connection. Xend may also be involved in control-channel signalling in some cases after startup, for instance to manage reconnection if the backend is restarted. ===== Frontend Driver Structure ===== The frontend domain uses a single event channel and a shared memory ring to trade control messages with the backend. These are both setup during domain startup, which will be discussed shortly. The shared memory ring is called blkif_ring, and the private ring indexes are resp_cons, and req_prod. The ring is protected by blkif_io_lock. Additionally, the frontend keeps a list of outstanding requests in rec_ring[]. These are uniquely identified by a guest-local id number, which is associated with each request sent to the backend, and returned with the matching responses. Information about the actual disks are stored in major_info[], of which only the first nr_vbds entries are valid. Finally, the global 'recovery' indicates that the connection between the backend and frontend drivers has been broken (possibly due to a backend driver crash) and that the frontend is in recovery mode, in which case it will attempt to reconnect and reissue outstanding requests. The frontend driver is single-threaded and after setup is entered only through three points: (1) read/write requests from the XenLinux guest that it is a part of, (2) interrupts from the backend driver on its event channel (blkif_int()), and (3) control messages from Xend (blkif_ctrlif_rx). ===== Backend Driver Structure ===== The backend driver is slightly more complex as it must manage any number of concurrent frontend connections. For each domain it manages, the backend driver maintains a blkif structure, which describes all the connection and disk information associated with that particular domain. This structure is associated with the interrupt registration, and allows the backend driver to have immediate context when it takes a notification from some domain. All of the blkif structures are stored in a hash table (blkif_hash), which is indexed by a hash of the domain id, and a "handle", really a per-domain blkif identifier, in case it wants to have multiple connections. The per-connection blkif structure is of type blkif_t. It contains all of the communication details (event channel, irq, shared memory ring and indexes), and blk_ring_lock, which is the backend mutex on the shared ring. The structure also contains vbd_rb, which is a red-black tree, containing an entry for each device/partition that is assigned to that domain. This structure is filled by xend passing disk information to the backend at startup, and is protected by vbd_lock. Finally, the blkif struct contains a status field, which describes the state of the connection. The backend driver spawns a kernel thread at startup (blkio_schedule()), which handles requests to and from the actual disk device drivers. This scheduler thread maintains a list of blkif structures that have pending requests, and services them round-robin with a maximum per-round request limit. blkifs are added to the list in the interrupt handler (blkif_be_int()) using add_to_blkdev_list_tail(), and removed in the scheduler loop after calling do_block_io_op(), which processes a batch of requests. The scheduler thread is explicitly activated at several points in the code using maybe_trigger_blkio_schedule(). Pending requests between the backend driver and the physical device drivers use another ring, pending_ring. Requests are placed in this ring in the scheduler thread and issued to the device. A completion callback, end_block_io_op, indicates that requests have been serviced and generates a response on the appropriate blkif ring. pending reqs[] stores a list of outstanding requests with the physical drivers. So, control entries to the backend are (1) the blkio scheduler thread, which sends requests to the real device drivers, (2) end_block_io_op, which is called as serviced requests complete, (3) blkif_be_int() handles notifications from the frontend drivers in other domains, and (4) blkif_ctrlif_rx() handles control messages from xend. ==== Driver Startup ==== Prior to starting a new guest using the frontend driver, the backend will have been started in a privileged domain. The backend initialisation code initialises all of its data structures, such as the blkif hash table, and starts the scheduler thread as a kernel thread. It then sends a driver status up message to let xend know it is ready to take frontend connections. When a new domain that uses the blkif frontend driver is started, there are a series of interactions between it, xend, and the specified backend driver. These interactions are as follows: The domain configuration given to xend will specify the backend domain and disks that the new guest is to use. Prior to actually running the domain, xend and the backend driver interact to setup the initial blkif record in the backend. (1) Xend sends a BLKIF_BE_CREATE message to backend. Backend does blkif_create(), having been passed FE domid and handle. It creates and initialises a new blkif struct, and puts it in the hash table. It then returns a STATUS_OK response to xend. (2) Xend sends a BLKIF_BE_VBD_CREATE message to the backend. Backend adds a vbd entry in the red-black tree for the specified (dom, handle) blkif entry. Sends a STATUS_OK response. (3) Xend sends a BLKIF_BE_VBD_GROW message to the backend. Backend takes the physical device information passed in the message and assigns them to the newly created vbd struct. (2) and (3) repeat as any additional devices are added to the domain. At this point, the backend has enough state to allow the frontend domain to start. The domain is run, and eventually gets to the frontend driver initialisation code. After setting up the frontend data structures, this code continues the communications with xend and the backend to negotiate a connection: (4) Frontend sends Xend a BLKIF_FE_DRIVER_STATUS_CHANGED message. This message tells xend that the driver is up. The init function now spin-waits until driver setup is complete in order to prevent Linux from attempting to boot before the disks are connected. (5) Xend sends the frontend an INTERFACE_STATUS_CHANGED message This message specifies that the interface is now disconnected (instead of closed). The domain updates it's state, and allocates the shared blk_ring page. Next, (6) Frontend sends Xend a BLKIF_INTERFACE_CONNECT message This message specifies the domain and handle, and includes the address of the newly created page. (7) Xend sends the backend a BLKIF_BE_CONNECT message The backend fills in the blkif connection information, maps the shared page, and binds an irq to the event channel. (8) Xend sends the frontend an INTERFACE_STATUS_CHANGED message This message takes the frontend driver to a CONNECTED state, at which point it binds an irq to the event channel and calls xlvbd_init to initialise the individual block devices. The frontend Linux is stall spin waiting at this point, until all of the disks have been probed. Messaging now is directly between the front and backend domain using the new shared ring and event channel. (9) The frontend sends a BLKIF_OP_PROBE directly to the backend. This message includes a reference to an additional page, that the backend can use for it's reply. The backend responds with an array of the domains disks (as vdisk_t structs) on the provided page. The frontend now initialises each disk, calling xlvbd_init_device() for each one.