Xen Shadow Pagetables Notes


Shadow 3 in Xen 3.3

From blog.xen.org:

Shadow 3 is the next step in the evolution of the shadow pagetable code. By making the shadow pagetables behave more like a TLB, we take advantage of guest operating system TLB behavior to reduce and coalesce the number of guest pagetable changes that the hypervisor has to translate to the shadow pagetables. This can dramatically reduce the virtualization overhead for HVM guests.

Shadow paging overhead is one of the largest source of cpu virtualization overhead for HVM guests. Because HVM guest operating systems don’t know the physical frame numbers of the pages assigned to them, they use guest frame numbers instead. This requires the hypervisor to translate each guest frame numbers into machine frames in the shadow pagetables before they can be used by the guest.

Those who have been around awhile may remember the Shadow-1 code. Its method of propagating changes from guest pagetables to the shadow pagetables was as follows:

While this method worked so-so for Linux, it was disastrous for Windows. Windows heavily uses a technique called demand-paging. Resyncing a guest page is an expensive operation, and under Shadow-1, every time a page was faulted in would cause an out-of-sync, write, and a resync.

The next step, Shadow-2, (among many other things) did away with the out-of-sync mechanism and instead emulated every write to guest pagetables. Emulation avoids the expensive unsync-resync cycle for demand paging. However, it removes any “batching” effects: every write is immediately reflected in the shadow pagetables, even though the guest operating system may not have been expecting the address change to be available until later.

Furthermore, Windows will frequently write “transition values” into pagetable entries when a page is being mapped in or mapped out. The cycle for demand-faulting zero pages in 32-bit Windows looks like:

On bare hardware, this looks like “Page fault / memory write / memory write”. Memory writes are relatively inexpensive. But in Shadow-2, this looks like:

Each emulated write involves a VMEXIT/VMENTER as well as about 8000 cycles of emulation inside the hypervisor, much more expensive than a mere memory write.

Shadow-3 brings back the out-of-sync mechanism, but with some key changes. First, only L1 pagetables are allowed to go out-of-sync. All L2+ pagetables are emulated. Secondly, we don’t necessarily resync on the next page fault. One of the things this enables is to do a “lazy pull-through”: if we get a page fault where the shadow is not-present but the guest is present, we can simply propagate that entry to the shadows, and return to the guest, leaving the rest of the page out-of-sync. This means that once a page is out-of-sync, demand-faulting looks like this:

Pulling through a single guest value is actually cheaper than emulation. So for demand-paging under Windows, we have 1/3 fewer trips into the hypervisor. Furthermore, batch updates, like process destruction or mapping large address spaces, are propagated to the shadows in a batch at the next CR3 switch, rather than going into and out of the hypervisor on each individual write.

All of this adds up to greatly improved performance for workloads like compilation, compression, databases, and any workload which does a lot of memory management in an HVM guest.