Security Research & Defense
In January, 2018, Microsoft published an advisory and security updates for a new class of hardware vulnerabilities involving speculative execution side channels (known as Spectre and Meltdown). In this blog post, we will provide a technical analysis of an additional subclass of speculative execution side channel vulnerability known as Speculative Store Bypass (SSB) which has been assigned CVE-2018-3639. SSB was independently discovered by Ken Johnson of the Microsoft Security Response Center (MSRC) and Jann Horn (@tehjh) of Google Project Zero (GPZ).
This post is primarily geared toward security researchers and engineers who are interested in a technical analysis of SSB and the mitigations that are relevant to it. If you are interested in more general guidance, please refer to our advisory for Speculative Store Bypass and our knowledge base articles for Windows Server, Windows Client, and Microsoft cloud services.
Please note that the information in this post is current as of the date of this post.TL;DR
Before diving into the technical details, below is a brief summary of the CPUs that are affected by SSB, Microsoft’s assessment of the risk, and the mitigations identified to date.What is affected? AMD, ARM, and Intel CPUs are affected by CVE-2018-3639 to varying degrees. What is the risk? Microsoft currently assesses the risk posed by CVE-2018-3639 to our customers as low. We are not aware of any exploitable instances of this vulnerability class in our software at this time, but we are continuing to investigate and we encourage researchers to find and report any exploitable instances of CVE-2018-3639 as part of our Speculative Execution Side Channel Bounty program. We will adapt our mitigation strategy for CVE-2018-3639 as our understanding of the risk evolves. What is the mitigation? Microsoft has already released mitigations as part of our response to Spectre and Meltdown that are applicable to CVE-2018-3639 in certain scenarios, such as reducing timer precision in Microsoft Edge and Internet Explorer. Software developers can address individual instances of CVE-2018-3639 if they are discovered by introducing a speculation barrier instruction as described in Microsoft’s C++ developer guidance for speculative execution side channels.
Microsoft is working with CPU manufacturers to assess the availability and readiness of new hardware features that can be used to resolve CVE-2018-3639. In some cases, these features will require a microcode or firmware update to be installed. Microsoft plans to provide a mitigation that leverages the new hardware features in a future Windows update. Speculative Store Bypass (SSB) overview
In our blog post on mitigating speculative execution side channel hardware vulnerabilities, we described three speculation primitives that can be used to create the conditions for a speculative execution side channel. These three primitives provide the fundamental methods for entering speculative execution along a non-architectural path and consist of conditional branch misprediction, indirect branch misprediction, and exception delivery or deferral. Speculative Store Bypass (SSB) belongs to a new category of speculation primitive that we refer to as memory access misprediction.
SSB arises due to a CPU optimization that can allow a potentially dependent load instruction to be speculatively executed ahead of an older store. Specifically, if a load is predicted as not being dependent on a prior store, then the load can be speculatively executed before the store. If the prediction is incorrect, this can result in the load reading stale data and possibly forwarding that data onto other dependent micro-operations during speculation. This can potentially give rise to a speculative execution side channel and the disclosure of sensitive information.
To illustrate how this might occur, it may help to consider the following simple example. In this example, RDI and RSI are assumed to be equal to the same address on the architectural path.01: 88040F mov [rdi+rcx],al 02: 4C0FB6040E movzx r8,byte [rsi+rcx] 03: 49C1E00C shl r8,byte 0xc 04: 428B0402 mov eax,[rdx+r8]
In this example, the MOV instruction on line 1 may take additional time to execute (e.g. if the computation of the address expression for RDI+RCX is waiting on prior instructions to execute). If this occurs, the CPU may predict that the MOVZX is not dependent on the MOV and may speculatively execute it ahead of the MOV that performs the store. This can result in stale data from the memory located at RSI+RCX being loaded into R8 and fed to a dependent load on line 4. If the byte value in R8 is sensitive, then it may be observed through a side channel by leveraging a cache-based disclosure primitive such as FLUSH+RELOAD (if RDX refers to shared memory) or PRIME+PROBE. The CPU will eventually detect the misprediction and discard that state that was computed, but the data that was accessed during speculation may have created residual side effects in the cache by this point that can then be measured to infer the value that was loaded into R8.
This example is simplified for the purposes of explaining the issue, but it is possible to imagine generalizations of this concept that could occur. For example, it may be possible for similar sequences to exist where SSB could give rise to a speculative out-of-bounds read, type confusion, indirect branch, and so on. We have revised our C++ Developer Guidance for Speculative Execution Side Channels to include additional examples of code patterns and conditions that could give rise to an instance of CVE-2018-3639. In practice, finding an exploitable instance of CVE-2018-3639 will require an attacker to identify an instruction sequence where:
- The sequence is reachable across a trust boundary, e.g. an attacker in user mode can trigger the sequence in kernel mode through a system call.
- The sequence contains a load instruction that is architecturally dependent on a prior store.
- The stale data that is read by the load instruction is sensitive and is used in a way that can create a side channel on the non-architectural path, e.g. the data feeds a disclosure gadget.
- The store instruction does not execute before the load and the dependent instructions that compose the disclosure gadget are speculatively executed.
While our research into this new vulnerability class is ongoing, we have not identified instruction sequences that satisfy all of the above criteria and we are currently not aware of any exploitable instances of CVE-2018-3639 in our software.
There are multiple mitigations that are applicable to SSB. In our previous blog post on mitigating speculative execution side channels, we characterized the software security models that can generally be at risk and the various tactics for mitigating speculative execution side channels. We will reuse the previously established terminology from that post to frame the mitigation options available for SSB.Relevance to software security models
The following table summarizes the potential relevance of SSB to the various intra-device attack scenarios that software security models are typically concerned with. As with CVE-2017-5753 (Spectre variant 1), SSB is theoretically applicable to each attack scenario as indicated by the orange cells (grey cells indicate not applicable).Attack Category Attack Scenario Conditional branch misprediction Indirect branch misprediction Exception delivery or deferral CVE-2018-3639 (SSB) Inter-VM Hypervisor-to-guest Host-to-guest Guest-to-guest Intra-OS Kernel-to-user Process-to-process Intra-process Enclave Enclave-to-any Preventing speculation techniques involving SSB
As we’ve noted in the past, one of the best ways to mitigate a vulnerability is by addressing the issue as close to the root cause as possible. In the case of SSB, there are a few techniques that can be used to prevent speculation techniques that rely on SSB as the speculation primitive.Speculation barrier via serializing instruction
As with CVE-2017-5753 (Spectre variant 1), it is possible to mitigate SSB by using an instruction which is architecturally defined to serialize execution, thus acting as a speculation barrier. In the case of SSB, a serializing instruction (such as an LFENCE on x86/x64 and SSBB on ARM) can be inserted between the store instruction and the load that could be speculatively executed ahead of the store. For example, inserting an LFENCE on line 2 mitigates the simplified example from this post. Additional information can be found in the C++ Developer Guidance for Speculative Execution Side Channels.01: 88040F mov [rdi+rcx],al 02: 0FAEE8 lfence 03: 4C0FB6040E movzx r8,byte [rsi+rcx] 04: 49C1E00C shl r8,byte 0xc 05: 428B0402 mov eax,[rdx+r8] Speculative store bypass disable (SSBD)
In some cases, CPUs can provide facilities for inhibiting a speculative store bypass from occurring and can therefore offer a categorical mitigation for SSB. AMD, ARM, and Intel have documented new hardware features that can be used by software to accomplish this. Microsoft is working with AMD, ARM, and Intel to assess the availability and readiness of these features. In some cases, these features will require a microcode or firmware update to be installed. Microsoft plans to provide a mitigation that leverages the new hardware features in a future Windows update.Generally applicable mitigations for SSB
There are a number of previously described mitigations that are also generally applicable to SSB. These include mitigations that involve removing sensitive content from memory or removing observation channels. Generally speaking, the mitigation techniques for these two tactics that are effective against CVE-2017-5753 (Spectre variant 1) are also applicable to SSB.Applicability of mitigations
The complex nature of these issues makes it difficult to understand the relationship between mitigations, speculation techniques, and the attack scenarios to which they apply. This section provides tables to help describe these relationships. Some of the mitigation techniques mentioned in the tables below are described in our previous blog post on this subject.
The legend for the tables that follow is:Applicable Not applicable Mitigation relationship to attack scenarios
The following table summarizes the relationship between attack scenarios and applicable mitigations.Mitigation Tactic Mitigation Name Inter-VM Intra-OS Enclave Prevent speculation techniques Speculation barrier via execution serializing instruction Security domain CPU core isolation Indirect branch speculation barrier on demand and mode change Non-speculated or safely-speculated indirect branches Speculative Store Bypass Disable (SSBD) Remove sensitive content from memory Hypervisor address space segregation Split user and kernel page tables (“KVA Shadow”) Remove observation channels Map guest memory as noncacheable in root extended page tables Do not share physical pages across guests Decrease browser timer precision Mitigation relationship to variants
The following table summarizes the relationship among SSB and the Spectre and Meltdown variants, and applicable mitigations.Mitigation Tactic Mitigation Name CVE-2017-5753 (variant 1) CVE-2017-5715 (variant 2) CVE-2017-5754 (variant 3) CVE-2018-3639 (SSB) Prevent speculation techniques Speculation barrier via execution serializing instruction Security domain CPU core isolation Indirect branch speculation barrier on demand and mode change Non-speculated or safely-speculated indirect branches Speculative Store Bypass Disable (SSBD) Remove sensitive content from memory Hypervisor address space segregation Split user and kernel page tables (“KVA Shadow”) Remove observation channels Map guest memory as noncacheable in root extended page tables Do not share physical pages across guests Decrease browser timer precision Wrapping up
In this post, we analyzed a new class of speculative execution side channel hardware vulnerabilities known as Speculative Store Bypass (SSB). This analysis provided the basis for evaluating the risk associated with this class of vulnerability and the mitigation options that exist. As we noted in our previous post, research into speculative execution side channels is ongoing and we will continue to evolve our response and mitigations as we learn more. While we currently assess the risk of SSB as low, we encourage researchers to help further our understanding of the true risk and to report any exploitable instances of CVE-2018-3639 that may exist as part of our Speculative Execution Side Channel bounty program.
Microsoft Security Response Center (MSRC)
The security of Microsoft’s cloud services is a top priority for us. One of the technologies that is central to cloud security is Microsoft Hyper-V which we use to isolate tenants from one another in the cloud. Given the importance of this technology, Microsoft has made and continues to make significant investment in the security of Hyper-V and the powerful security features that it enables, such as Virtualization-Based Security (VBS). To reinforce this commitment, Microsoft offers rewards of up to $250,000 USD for the discovery of vulnerabilities in Hyper-V through our Hyper-V Bounty Program.
We would like to share with the security community that we have now released debugging symbols for many of the core components in Hyper-V, with some exceptions such as the hypervisor where we would like to avoid our customers taking a dependency on undocumented hypercalls for instance.
The symbols that have been made available allow security researchers to better analyze Hyper-V’s implementation and report any vulnerabilities that may exist as part of our Hyper-V Bounty Program. The list of the components that now have debugging symbols available can be found at this blogpost by the Microsoft Virtualization team.
We believe this is a step towards contributing more and more from our internal knowledge back to the security research community. As always, please let us know if you find any new vulnerabilities at firstname.lastname@example.org , or if you have any other questions @msftsecresponse.
MSRC Vulnerabilities and Mitigations Team
DLL planting (aka binary planting/hijacking/preloading) resurface every now and then, it is not always clear on how Microsoft will respond to the report. This blog post will try to clarify the parameters considered while triaging DLL planting issues.
It is well known that when an application loads a DLL without specifying a fully qualified path, Windows attempts to locate the DLL by searching a well-defined set of directories in an order known as DLL search order. The search order used in the default SafeDllSearchMode is as below:
- The directory from which the application loaded.
- The system directory. Use the GetSystemDirectory function to get the path of this directory.
- The 16-bit system directory. There is no function that obtains the path of this directory, but it is searched.
- The Windows directory. Use the GetWindowsDirectory function to get the path of this directory. function to get the path of this directory.
- The current directory.
- The directories that are listed in the PATH environment variable. Note that this does not include the per-application path specified by the App Paths registry key. The App Paths key is not used when computing the DLL search path.
The default DLL search order can be changed with various options as noted in one of our previous blogpost “Load Library Safely”.
A DLL loading in an application becomes a DLL planting vulnerability if an attacker can plant the malicious DLL in any of the directories searched per the search order, and the planted DLL is not found in the prior directories searched that attacker has no access to it. For example, an application loading foo.dll that is not present in either application directory or system directory or windows directory can provide an opportunity for an attacker to plant foo.dll if he has access to the current working directory. DLL planting vulnerabilities are very convenient and is less work for an attacker, it gives very easy code execution since the DllMain() gets called immediately on loading the DLL. Attackers don’t have to worry about bypassing any mitigation if the application allows loading non-signed binaries.
Based on where the malicious DLL can be planted in the DLL search order the vulnerability broadly falls into one of the three categories:
- Application Directory (App Dir) DLL planting.
- Current Working Directory (CWD) DLL planting.
- PATH Directories DLL planting.
The above categories are what guides our response. Let’s look at these categories to see how we triage each of them.Application Directory (App Dir) DLL planting
Application directory is where an application keeps its dependent non-system DLLs and trusts them to be intact. Files located in a program's installation directory are presumed to be benevolent, trustworthy and a directory ACL security control is typically used to safeguard them. Anyone able to replace a binary in the installation directory, presumably has the privileges necessary to write/overwrite files. The application directory is considered a code directory, where code related artifacts for the application should be stored. If an attacker can achieve DLL overwrite within the application directory without being on the directory ACL, it’s a much bigger issue than replacing/planting a single DLL.
Let’s look at some of the scenario involved with application directory DLL planting:
Scenario 1: Malicious binary planting in a trusted application directory.
Applications installed properly generally safeguard the application directory with ACLs, elevated access (typically admin) is required to modify the content of the application directory in this scenario. For example, Microsoft Word’s installation location is “C:\Program Files (x86)\Microsoft Office\root\Office16\”. An admin access is required to modify anything in this directory. A victim, who has admin rights, can be tricked/socially engineered to plant DLLs in a trusted location but if such is the case, they can be tricked/social engineered to do worse things.
Scenario 2: Malicious binary planted in an untrusted application directory.
Application installed via XCOPY without installer being used, available on a share, downloaded from internet, standalone executable in a non ACLed directory are some of the scenarios that falls under untrusted category. For example, an installer (including redistributable, setup.exe generated by ClickOnce, and self-extracting archives generated by IExpress) downloaded from internet and running from default “Downloads” folder. Launching an application from an untrustworthy location is dangerous, a victim can be easily tricked/fooled to plant DLLs into these untrusted locations.
A DLL planting issue that falls into this category, Application Directory DLL planting, is treated as Defense-in-Depth issue that will be considered for updates in future versions only. We resolve any MSRC case that fall in this category as vNext consideration, mainly due to the amount of social engineering involved in the attack and the by design nature of the bug. A victim would have to be tricked into placing the malicious DLL (malware) where it can be triggered AND perform a non-recommended action (like running an installer in the same directory as the malware). A non-installed application has no reference point for "known good directory/binaries", unless it creates the directory itself. Ideally, the installer should create a temporary directory with a randomized name (to prevent further DLL planting), extract its binaries to it and use them to install the application. While an attacker can make use of a drive-by download to place the malware on the victim's system, such as into the "Downloads" folder, the essence of the attack is social engineering.
In Windows 10 Creators Update we added a new process mitigation that can be used to mitigate the Application Directory DLL planting vulnerabilities. This new process mitigation, PreferSystem32, when opted in toggles the order of application directory and system32 in the DLL search order. Because of this any malicious system binary can’t be hijacked by planting it in the application directory. This can be enabled for the scenarios where the process creation can be controlled.Current Working Directory (CWD) DLL planting
Applications typically set the directory from where they are invoked as the CWD, this applies even when the application is invoked based on the default file association. For example, clicking a file from the share “D:\temp\file.abc”’ will make “D:\temp” as the CWD for the application associated with the file type .abc.
The scenario of hosting files in a remote share, especially a webdav share, makes CWD DLL planting issues more vulnerable. This way an attacker can host the malicious DLL along with the file and social engineer the victim to open/click the file to get the malicious DLL loaded into the target application.
Scenario 3: Malicious binary planted in the CWD.
Application loading a DLL not present in any of the first three trusted location will look for the same in the untrusted CWD. Victim opening a .doc file from the location \\server1\share2\ will launch Microsoft Word, if the Microsoft Word can’t find one of its dependent DLL oart.dll in the trusted location it will try to load it from the CWD \\server1\share2\. Since the share is an untrusted location attacker can easily plant oart.dll to feed into the application.
Trigger => \\server1\share2\openme.doc
Application => C:\Program Files (x86)\Microsoft Office\root\Office16\Winword.exe
App Dir=> C:\Program Files (x86)\Microsoft Office\root\Office16\
CWD => \\server1\share2\
Malicious DLL => \\server1\share2\OART.DLL
A DLL planting issue that falls into this category of CWD DLL planting is treated as an Important severity issue and we will issue a security patch for this. Most of the DLL planting issue that we have fixed in the past falls into this category, the advisory 2269637 lists a subset of them. This brings to a question why any Microsoft applications would load DLLs that are not present in its application directory or System directory or Windows directory. It so happens that there are various optional components, different OS editions and multiple architectures that come with different set of binaries that sometimes applications fail to consider or verify effectively before loading the DLLs.PATH Directories DLL planting
The last resort to find the DLLs in the DLL search order is the PATH directories, which is a set of directories that has been added by various applications to facilitate user experience in locating the application and its artifacts.
The directories that are in the PATH environment variable are always admin ACLed and a normal user can’t modify contents of these directories. If we have a world writable directory exposed via PATH, then it is a bigger issue than just the single instance of DLL planting and we deal with that as an important severity issue. But just the DLL planting issue is considered as a low security issue since we don’t expect to cross any security boundary with this planting vulnerability. Thus, DLL planting issues that fall into the category of PATH directories DLL planting are treated as won’t fix.Conclusion
We hope this clears up questions on how we triage a reported DLL planting issue and what situations we consider to be severe enough to issue a security patch. Below is a quick guide to what we fix/won’t fix via a security release (down level).What Microsoft will address with a security fix
CWD scenarios - Like an associated application loading a DLL from the untrusted CWD.What Microsoft will consider addressing the next time a product is released
Application directory scenarios – This is at complete discern of product group based on whether it is an explicit load or implicit load. Explicit load can be tweaked but the implicit loads (dependent DLLs) are strictly by-design as the path can’t be controlled.What Microsoft won't address (not a vulnerability)
PATH directory scenarios – Since there can’t be a non-admin directory in the PATH this can’t be exploited.
Antonio Galvan, MSRC
Swamy Shivaganga Nagaraju, MSRC Vulnerabilities and Mitigations Team
On January 3rd, 2018, Microsoft released an advisory and security updates that relate to a new class of discovered hardware vulnerabilities, termed speculative execution side channels, that affect the design methodology and implementation decisions behind many modern microprocessors. This post dives into the technical details of Kernel Virtual Address (KVA) Shadow which is the Windows kernel mitigation for one specific speculative execution side channel: the rogue data cache load vulnerability (CVE-2017-5754, also known as “Meltdown” or “Variant 3”). KVA Shadow is one of the mitigations that is in scope for Microsoft's recently announced Speculative Execution Side Channel bounty program.
It’s important to note that there are several different types of issues that fall under the category of speculative execution side channels, and that different mitigations are required for each type of issue. Additional information about the mitigations that Microsoft has developed for other speculative execution side channel vulnerabilities (“Spectre”), as well as additional background information on this class of issue, can be found here.
Please note that the information in this post is current as of the date of this post.Vulnerability description & background
The rogue data cache load hardware vulnerability relates to how certain processors handle permission checks for virtual memory. Processors commonly implement a mechanism to mark virtual memory pages as owned by the kernel (sometimes termed supervisor), or as owned by user mode. While executing in user mode, the processor prevents accesses to privileged kernel data structures by way of raising a fault (or exception) when an attempt is made to access a privileged, kernel-owned page. This protection of kernel-owned pages from direct user mode access is a key component of privilege separation between kernel and user mode code.
Certain processors capable of speculative out-of-order execution, including many currently in-market processors from Intel, and some ARM-based processors, are susceptible to a speculative side channel that is exposed when an access to a page incurs a permission fault. On these processors, an instruction that performs an access to memory that incurs a permission fault will not update the architectural state of the machine. However, these processors may, under certain circumstances, still permit a faulting internal memory load µop (micro-operation) to forward the result of the load to subsequent, dependent µops. These processors can be said to defer handling of permission faults to instruction retirement time.
Out of order processors are obligated to “roll back” the architecturally-visible effects of speculative execution down paths that are proven to have never been reachable during in-program-order execution, and as such, any µops that consume the result of a faulting load are ultimately cancelled and rolled back by the processor once the faulting load instruction retires. However, these dependent µops may still have issued subsequent cache loads based on the (faulting) privileged memory load, or otherwise may have left additional traces of their execution in the processor’s caches. This creates a speculative side channel: the remnants of cancelled, speculative µops that operated on the data returned by a load incurring a permission fault may be detectable through disturbances to the processor cache, and this may enable an attacker to infer the contents of privileged kernel memory that they would not otherwise have access to. In effect, this enables an unprivileged user mode process to disclose the contents of privileged kernel mode memory.Operating system implications
Most operating systems, including Windows, rely on per-page user/kernel ownership permissions as a cornerstone of enforcing privilege separation between kernel mode and user mode. A speculative side channel that enables unprivileged user mode code to infer the contents of privileged kernel memory is problematic given that sensitive information may exist in the kernel’s address space. Mitigating this vulnerability on affected, in-market hardware is especially challenging, as user/kernel ownership page permissions must be assumed to no longer prevent the disclosure (i.e., reading) of kernel memory contents from user mode. Thus, on vulnerable processors, the rogue data cache load vulnerability impacts the primary tool that modern operating system kernels use to protect themselves from privileged kernel memory disclosure by untrusted user mode applications.
In order to protect kernel memory contents from disclosure on affected processors, it is thus necessary to go back to the drawing board with how the kernel isolates its memory contents from user mode. With the user/kernel ownership permission no longer effectively safeguarding against memory reads, the only other broadly-available mechanism to prevent disclosure of privileged kernel memory contents is to entirely remove all privileged kernel memory from the processor’s virtual address space while executing user mode code.
This, however, is problematic, in that applications frequently make system service calls to request that the kernel perform operations on their behalf (such as opening or reading a file on disk). These system service calls, as well as other critical kernel functions such as interrupt processing, can only be performed if their requisite, privileged code and data are mapped in to the processor’s address space. This presents a conundrum: in order to meet the security requirements of kernel privilege separation from user mode, no privileged kernel memory may be mapped into the processor’s address space, and yet in order to reasonably handle any system service call requests from user mode applications to the kernel, this same privileged kernel memory must be quickly accessible for the kernel itself to function.
The solution to this quandary is to, on transitions between kernel mode and user mode, also switch the processor’s address space between a kernel address space (which maps the entire user and kernel address space), and a shadow user address space (which maps the entire user memory contents of a process, but only a minimal subset of kernel mode transition code and data pages needed to switch into and out of the kernel address space). The select set of privileged kernel code and data transition pages handling the details of these address space switches, which are “shadowed” into the user address space are “safe” in that they do not contain any privileged data that would be harmful to the system if disclosed to an untrusted user mode application. In the Windows kernel, the usage of this disjoint set of shadow address spaces for user and kernel modes is called “kernel virtual address shadowing”, or KVA shadow, for short.
In order to support this concept, each process may now have up to two address spaces: the kernel address space and the user address space. As there is no virtual memory mapping for other, potentially sensitive privileged kernel data when untrusted user mode code executes, the rogue data cache load speculative side channel is completely mitigated. This approach is not, however, without substantial complexity and performance implications, as will later be discussed.
On a historical note, some operating systems previously have implemented similar mechanisms for a variety of different and unrelated reasons: For example, in 2003 (prior to the common introduction of 64-bit processors in most broadly-available consumer hardware), with the intention of addressing larger amounts of virtual memory on 32-bit systems, optional support was added to the 32-bit x86 Linux kernel in order to provide a 4GB virtual address space to user mode, and a separate 4GB address space to the kernel, requiring address space switches on each user/kernel transition. More recently, a similar approach, termed KAISER, has been advocated to mitigate information leakage about the kernel virtual address space layout due to processor side channels. This is distinct from the rogue data cache load speculative side channel issue, in that no kernel memory contents, as opposed to address space layout information, were at the time considered to be at risk prior to the discovery of speculative side channels.KVA shadow implementation in the Windows kernel
While the design requirements of KVA shadow may seem relatively innocuous, (privileged kernel-mode memory must not be mapped in to the address space when untrusted user mode code runs) the implications of these requirements are far-reaching throughout Windows kernel architecture. This touches a substantial number of core facilities for the kernel, such as memory management, trap and exception dispatching, and more. The situation is further complicated by a requirement that the same kernel code and binaries must be able to run with and without KVA shadow enabled. Performance of the system in both configurations must be maximized, while simultaneously attempting to keep the scope of the changes required for KVA shadow as contained as possible. This maximizes maintainability of code in both KVA shadow and non-KVA-shadow configurations.
This section focuses primarily on the implications of KVA shadow for the 64-bit x86 (x64) Windows kernel. Most considerations for KVA shadow on x64 also apply to 32-bit x86 kernels, though there are some divergences between the two architectures. This is due to ISA differences between 64-bit and 32-bit modes, particularly with trap and exception handling.
Please note that the implementation details described in this section are subject to change without notice in the future. Drivers and applications must not take dependencies on any of the internal behaviors described below without first checking for updated documentation.
The best way to understand the complexities involved with KVA shadow is to start with the underlying low-level interface in the kernel that handles the transitions between user mode and kernel mode. This interface, called the trap handling code, is responsible for fielding traps (or exceptions) that may occur from either kernel mode or user mode. It is also responsible for dispatching system service calls and hardware interrupts. There are several events that the trap handling code must handle, but the most relevant for KVA shadow are those called “kernel entry” and “kernel exit” events. These events, respectively, involve transitions from user mode into kernel mode, and from kernel mode into user mode.Trap handling and system service call dispatching overview and retrospective
As a quick recap of how the Windows kernel dispatches traps and exceptions on x64 processors, traditionally, the kernel programs the current thread’s kernel stack pointer into the current processor’s TSS (task state segment), specifically into the KTSS64.Rsp0 field, which informs the processor which stack pointer (RSP) value to load up on a ring transition to ring 0 (kernel mode) code. This field is traditionally updated by the kernel on context switch, and several other related internal events; when a switch to a different thread occurs, the processor KTSS64.Rsp0 field is updated to point to the base of the new thread’s kernel stack, such that any kernel entry event that occurs while that thread is running enters the kernel already on that thread’s stack. The exception to this rule is that of system service calls, which typically enter the kernel with a “syscall” instruction; this instruction does not switch the stack pointer and it is the responsibility of the operating system trap handling code to manually load up an appropriate kernel stack pointer.
On typical kernel entry, the hardware has already pushed what is termed a “machine frame” (internally, MACHINE_FRAME) on the kernel stack; this is the processor-defined data structure that the IRETQ instruction consumes and removes from the stack to effect an interrupt-return, and includes details such as the return address, code segment, stack pointer, stack segment, and processor flags on the calling application. The trap handling code in the Windows kernel builds a structure called a trap frame (internally, KTRAP_FRAME) that begins with the hardware-pushed MACHINE_FRAME, and then contains a variety of software-pushed fields that describe the volatile register state of the context that was interrupted. System calls, as noted above, are an exception to this rule, and must manually build the entire KTRAP_FRAME, including the MACHINE_FRAME, after effecting a stack switch to an appropriate kernel stack for the current thread.KVA shadow trap and system service call dispatching design considerations
With a basic understanding of how traps are handled without KVA shadow, let’s dive into the details of the KVA shadow-specific considerations of trap handling in the kernel.
When designing KVA shadow, several design considerations applied for trap handling when KVA shadow were active, namely, that the security requirements were met, that performance impact on the system was minimized, and that changes to the trap handling code were kept as compartmentalized as possible in order to simplify code and improve maintainability. For example, it is desirable to share as much trap handling code between the KVA shadow and non-KVA shadow configurations as practical, so that it is easier to make changes to the kernel’s trap handling facilities in the future.
When KVA shadowing is active, user mode code typically runs with the user mode address space selected. It is the responsibility of the trap handling code to switch to the kernel address space on kernel entry, and to switch back to the user address space on kernel exit. However, additional details apply: it is not sufficient to simply switch address spaces, because the only transition kernel pages that can be permitted to exist (or be “shadowed into”) in the user address space are only those that hold contents that are “safe” to disclose to user mode. The first complication that KVA shadow encounters is that it would be inappropriate to shadow the kernel stack pages for each thread into the user mode address space, as this would allow potentially sensitive, privileged kernel memory contents on kernel thread stacks to be leaked via the rogue data cache load speculative side channel.
It is also desirable to keep the set of code and data structures that are shadowed into the user mode address space to a minimum, and if possible, to only shadow permanent fixtures in the address space (such as portions of the kernel image itself, and critical per-processor data structures such as the GDT (Global Descriptor Table), IDT (Interrupt Descriptor Table), and TSS. This simplifies memory management, as handling setup and teardown of new mappings that are shadowed into user mode address spaces has associated complexities, as would enabling any shadowed mappings to become pageable. For these reasons, it was clear that it would not be acceptable for the kernel’s trap handling code to continue to use the per-kernel-thread stack for kernel entry and kernel exit events. Instead, a new approach would be required.
The solution that was implemented for KVA shadow was to switch to a mode of operation wherein a small set of per-processor stacks (internally called KTRANSITION_STACKs) are the only stacks that are shadowed into the user mode address space. Eight of these stacks exist for each processor, the first of which represents the stack used for “normal” kernel entry events, such as exceptions, page faults, and most hardware interrupts, and the remaining seven transition stacks represent the stacks used for traps that are dispatched using the x64-defined IST (Interrupt Stack Table) mechanism (note that Windows does not use all 7 possible IST stacks presently).
When KVA shadow is active, then, the KTSS64.Rsp0 field of each processor points to the first transition stack of each processor, and each of the KTSS64.Ist[n] fields point to the n-th KTRANSITION_STACK for that processor. For convenience, the transition stacks are located in a contiguous region of memory, internally termed the KPROCESSOR_DESCRIPTOR_AREA, that also contains the per-processor GDT, IDT, and TSS, all of which are required to be shadowed into the user mode address space for the processor itself to be able to handle ring transitions properly. This contiguous memory block is, itself, shadowed in its entirety.
This configuration ensures that when a kernel entry event is fielded while KVA shadow is active, that the current stack is both shadowed into the user mode address space, and does not contain sensitive memory contents that would be risky to disclose to user mode. However, in order to maintain these properties, the trap dispatch code must be careful to push no sensitive information onto any transition stack at any time. This necessitates the first several rules for KVA shadow in order to avoid any other memory contents from being stored onto the transition stacks: when executing on a transition stack, the kernel must be fielding a kernel entry or kernel exit event, interrupts must be disabled and must remain disabled throughout, and the code executing on a transition stack must be careful to never incur any other type of kernel trap. This also implies that the KVA shadow trap dispatch code can assume that traps arising in kernel mode already are executing with the correct CR3, and on the correct kernel stack (except for some special considerations for IST-delivered traps, as discussed below).Fielding a trap with KVA shadow active
Based on the above design decisions, there is an additional set of tasks specific to KVA shadowing that must occur prior to the normal trap handling code in the kernel being invoked for a kernel entry trap events. In addition, there is a similar set of tasks related to KVA shadow that must occur at the end of trap processing, if a kernel exit is occurring.
On normal kernel entry, the following sequence of events must occur:
- The kernel GS base value must be loaded. This enables the remaining trap code to access per-processor data structures, such as those that hold the kernel CR3 value for the current processor.
- The processor’s address space must be switched to the kernel address space, so that all kernel code and data are accessible (i.e., the kernel CR3 value must be loaded). This necessitates that the kernel CR3 value must be stored in a location that is, itself, shadowed. For the purposes of KVA shadow, a single per-processor KPRCB page that contains only “safe” contents maintains a copy of the current processor’s kernel CR3 value for easy access to the KVA shadow trap dispatch code. Context switch between address spaces, and process attach/detach update the corresponding KPRCB fields with the new CR3 value on process address space changes.
- The machine frame previously pushed by hardware as a part of the ring transition from user mode to kernel mode must be copied from the current (transition) stack, to the per-kernel-thread stack for the current thread.
- The current stack must be switched to the per-kernel-thread stack. At this point, the “normal” trap handling code can largely proceed as usual, and without invasive modifications (save that the kernel GS base has already been loaded).
Roughly speaking, the inverse sequence of events must occur on normal kernel exit; the machine frame at the top of the current kernel thread stack must be copied to the transition stack for the processor, the stacks must be switched, CR3 must be reloaded with the corresponding value for the user mode address space of the current process, the user mode GS base must be reloaded, and then control may be returned to user mode.
System service call entry and exit through the SYSCALL/SYSRETQ instruction pair is handled slightly specially, in that the processor does not already push a machine frame, because the kernel logically does not have a current stack pointer until it explicitly loads one. In this case, no machine frame needs be copied on kernel entry and kernel exit, but the other basic steps must still be performed.
Special care needs to be taken by the KVA shadow trap dispatch code for NMI, machine check, and double fault type trap events, because these events may interrupt even normally uninterruptable code. This means that they could even interrupt the normally uninterruptable KVA shadow trap dispatch code itself, during a kernel entry or kernel exit event. These types of traps are delivered using the IST mechanism onto their own distinct transition stacks, and the trap handling code must carefully handle the case of the GS base or CR3 value being in any state due to the indeterminate state of the machine at the time in which these events may occur, and must preserve the pre-existing GS base or CR3 values.
At this point, the basics for how to enter and exit the kernel with KVA shadow are in place. However, it would be undesirable to inline the KVA shadow trap dispatch code into the standard trap entry and trap exit code paths, as the standard trap entry and trap exit code paths could be located anywhere in the kernel’s .text code section, and it is desirable to minimize the amount of code that needs be shadowed into the user address space. For this reason, the KVA shadow trap dispatch code is collected into a series of parallel entry points packed within their own code section within the kernel image, and either the standard set of trap entry points, or the KVA shadow trap entry points are installed into the IDT at system boot time, based on whether KVA shadow is in use at system boot. Similarly, the system service call entry points are also located in this special code section in the kernel image.
Note that one implication of this design choice is that KVA shadow does not protect against attacks against kernel ASLR using speculative side channels. This is a deliberate decision given the design complexity of KVA shadow, timelines involved, and the realities of other side channel issues affecting the same processor designs. Notably, processors susceptible to rogue data cache load are also typically susceptible to other attacks on their BTBs (branch target buffers), and other microarchitectural resources that may allow kernel address space layout disclosure to a local attacker that is executing arbitrary native code.Memory management considerations for KVA shadow
Now that KVA shadow is able to handle trap entry and trap exit, it’s necessary to understand the implications of KVA shadowing on memory management. As with the trap handling design considerations for KVA shadow, ensuring the correct security properties, providing good performance characteristics, and maximizing the maintainability of code changes were all important design goals. Where possible, rules were established to simplify the memory management design implementation. For example, all kernel allocations that are shadowed into the user mode address space are shadowed system-wide and not per-process or per-processor. As another example, all such shadowed allocations exist at the same kernel virtual address in both the user mode and kernel mode address spaces and share the same underlying physical pages in both address spaces, and all such allocations are considered nonpageable and are treated as though they have been locked into memory.
The most apparent memory management consequence of KVA shadowing is that each process typically now needs a separate address space (i.e., page table hierarchy, or top level page directory page) allocated to describe the shadow user address space, and that the top level page directory entries corresponding to user mode VAs must be replicated from the process’s kernel address space top level page directory page to the process’s user address space top level page directory page.
The top level page directory page entries for the kernel half of the VA space are not replicated, however, and instead only correspond to a minimal set of page table pages needed to map the small subset of pages that have been explicitly shadowed into the user mode address space. As noted above, pages that are shadowed into the user mode address space are left nonpageable for simplicity. In practice, this is not a substantial hardship for KVA shadow, as only a very small number of fixed allocations are ever shadowed system-wide. (Remember that only the per-processor transition stacks are shadowed, not any per-thread data structures, such as per-thread kernel stacks.)
Memory management must then replicate any updates to top level user mode page directory page entries between the two process address spaces, as any updates occur, and access bit handling for working set aging and other purposes must logically OR the access bits from both user and kernel address spaces together if a top level page directory page entry is being considered (and, similarly, working set aging must clear access bits in both top level page directory page if a top level entry is being considered). Similarly, memory management must be aware of both address spaces that may exist for processes in various other edge-cases where top-level page directory pages are manipulated.
Finally, no general purpose kernel allocations can be marked as “global” in their corresponding leaf page table entries by the kernel, because processors susceptible to rogue data cache load cannot observe any cached virtual address translations for any privileged kernel pages that could contain sensitive memory contents while in user mode, for KVA shadow protections to be effective, and such global entries would still be cached in the processor translation buffer (TB) across an address space switch.Booting is just the beginning of a journey
At this point, we have covered some of the major areas involved in the kernel with respect to KVA shadow. However, there’s much more that’s involved beyond just trap handling and memory management: For example, changes to how Windows handles multiprocessor initialization, hibernate and resume, processor shutdown and reboot, and many other areas were all required in order to make KVA shadow into a fully featured solution that works correctly in all supported software configurations.
Furthermore, preventing the rogue data cache load issue from exposing privileged kernel mode memory contents is just the beginning of turning KVA shadow into a feature that could be shipped to a diverse customer base. So far, we have only touched on the basics of the highlights of an unoptimized implementation of KVA shadow on x64 Windows. We’re far from done examining KVA shadowing, however; a substantial amount of additional work was still required in order to reduce the performance overhead of KVA shadow to the absolute minimum possible. As we’ll see, there are a number of options that have been considered and employed to that end with KVA shadow. The below optimizations are already included with the January 3rd, 2018 security updates to address rogue data cache load.Performance optimizations
One of the primary challenges faced by the implementation of KVA shadow was maximizing system performance. The model of a unified, flat address space shared between user and kernel mode, with page permission bits to protect kernel-owned pages from access by unprivileged user mode code, is both convenient for an operating system kernel to implement, and easily amenable to high performance user/kernel transitions.
The reason why the traditional, unified address space model allows for fast user/kernel transitions relates to how processors handle virtual memory. Processors typically cache previously fetched virtual address translations in a small internal cache that is termed a translation buffer, (or TB, for short); some literature also refers to these types of address translation caches as translation lookaside buffers (or TLBs for short). The processor TB operates on the principle of locality: if an application (or the kernel) has referenced a particular virtual address translation recently, it is likely to do so again, and the processor can save the costly process of re-walking the operating system’s page table hierarchy if the requisite translation is already cached in the processor TB.
Traditionally, a TB contains information that is primarily local to a particular address space (or page table hierarchy), and when a switch to a different page table hierarchy occurs, such as with a context switch between threads in different processes, the processor TB must be flushed so that translations from one process are not improperly used in the context of a different process. This is critical, as two processes can, and frequently do, map the same user mode virtual address to completely different physical pages.
KVA shadowing requires switching address spaces much more frequently than operating systems have traditionally done so, however; on processors susceptible to the rogue data cache load issue, it is now necessary to switch the address space on every user/kernel transition, which are vastly more frequent events than cross-process context switches. In the absence of any further optimizations, the fact that the processor TB is flushed and invalidated on each user/kernel transition would substantially reduce the benefit of the processor TB, and would represent a significant performance cost on the system.
Fortunately, there are some techniques that the Windows KVA shadow implementation employs to substantially mitigate the performance costs of KVA shadowing on processor hardware that is susceptible to rogue data cache load. Optimizing KVA shadow for maximum performance presented a challenging exercise in finding creative ways to make use of existing, in-the-field hardware capabilities, sometimes outside the scope of their original intended use, while still maintaining system security and correct system operation, but several techniques have been developed to substantially reduce the cost.PCID acceleration
The first optimization, the usage of PCID (process-context identifier) acceleration is relevant to Intel Core-family processors of Haswell and newer microarchitectures. While the TB on many processors traditionally maintained information local to an address space, and which had to be flushed on any address space switch, the PCID hardware capability allows address translations to be tagged with a logical PCID that informs the processor which address space they are relevant to. An address space (or page table hierarchy) can be tagged with a distinguished PCID value, and this tag is maintained with any non-global translations that are cached the processor’s TB; then, on address space switch to an address space with a different associated PCID, the processor can be instructed to preserve the previous TB contents. Because the processor requires that the current address space’s PCID to match that of any cached translation in the TB for the purposes of matching any translation lookups in the TB, address translations from multiple address spaces can now be safely represented concurrently in the processor TB.
On hardware that is PCID-capable and which requires KVA shadowing, the Windows kernel employs two distinguished PCID values, which are internally termed PCID_KERNEL and PCID_USER. The kernel address space is tagged with PCID_KERNEL, and the user address space is tagged with PCID_USER, and on each user/kernel transition, the kernel will typically instruct the processor to preserve the TB contents when switching address spaces. This enables the preservation of the entire TB contents on system service calls and other high frequency user/kernel transitions, and in many workloads, substantially mitigates almost all of the cost of KVA shadowing. Some duplication of TB entries between user and kernel mode is possible if the same user mode VA is referenced by user and kernel code, and additional processing is also required on some types of TB flushes, as certain types of TB flushes (such as those that invalidate user mode VAs) must be replicated to both user and kernel PCIDs. However, this overhead is typically relatively minor compared to the loss of all TB entries if the entire TB were not preserved on each user/kernel transition.
On address space switches between processes, such as context switches between two different processes, the entire TB is invalidated. This must be performed because the PCID values assigned by the kernel are not process-specific, but are global to the entire system. Assigning different PCID values to each process (which would be a more “traditional” usage of PCID) would preclude the need to flush the entire TB on context switches between processes, but would also require TB flush IPIs (interprocessor-interrupts) to be sent to a potentially much larger set of processors, specifically being all of those that had previously loaded a given PCID, which in and of itself is a performance trade-off due to the cost involved in TB flush IPIs.
It’s important to note that PCID acceleration also requires the hypervisor to expose CR4.PCID and the INVPCID instruction to the Windows kernel. The Hyper-V hypervisor was updated to expose these capabilities with the January 3rd, 2018 security updates. Additionally, the underlying PCID hardware capability is only defined for the native 64-bit paging mode, and thus a 64-bit kernel is required to take advantage of PCID acceleration (32-bit applications running under a 64-bit kernel can still benefit from the optimization).User/global acceleration
Although many modern processors can take advantage of PCID acceleration, older Intel Core family processors, and current Intel Atom family processors do not provide hardware support for PCID and thus cannot take advantage of that PCID support to accelerate KVA shadowing. These processors do allow a more limited form of TB preservation across address space switches, however, in the form of the “global” page table entry bit. The global bit allows the operating system kernel to communicate to the processor that a given leaf translation is “global” to the entire system, and need not be invalidated on address space switches. (A special facility to invalidate all translations including global translations is provided by the processor, for cases when the operating system changes global memory translations. On x64 and x86 processors, this is accomplished by toggling the CR4.PGE control register bit.)
Traditionally, the kernel would mark most kernel mode page translations as global, in order to indicate that these address translations can be preserved in the TB during cross-process address space switches while all non-global address translations are flushed from the TB. The kernel is then obligated to ensure that both incoming and outgoing address spaces provide consistent translations for any global translations in both address spaces, across a global-preserving address space switch, for correct system operation. This is a simple matter for the traditional use of kernel virtual address management, as most of the kernel address space is identical across all processes. The global bit, thus, elegantly allows most of the effective TB contents for kernel VAs to be preserved across context switches with minimal hardware and software complexity.
In the context of KVA shadow, however, the global bit can be used for a completely different purpose than its original intention, for an optimization termed “user/global acceleration”. Instead of marking kernel pages as global, KVA shadow marks user pages as global, indicating to the processor that all pages in the user mode half of the address space are safe to preserve across address space switches. While an address space switch must still occur on each user/kernel transition, global translations are preserved in the TB, which preserves the user TB entries. As most applications primarily spend their time executing in user mode, this mode of operation preserves the portion of the TB that is most relevant to most applications. The TB contents for kernel virtual addresses are unavoidably lost on each address space switch when user/global acceleration is in use, and as with PCID acceleration, some TB flushes must be handled differently (and cross-process context switches require an entire TB flush), but preserving the user TB contents substantially cuts the cost of KVA shadowing over the more naïve approach of marking no translations as global.Privileged process acceleration
The purpose of KVA shadowing is to protect sensitive kernel mode memory contents from disclosure to untrusted user mode applications. This is required for security purposes in order to maintain privilege separation between kernel mode and user mode. However, highly-privileged applications that have complete control over the system are typically trusted by the operating system for a variety of tasks, up to and including loading drivers, creating kernel memory dumps, and so on. These applications effectively already have the privileges required in order to access kernel memory, and so KVA shadowing is of minimal benefit for these applications.
KVA shadow thus optimizes highly privileged applications (specifically, those that have a primary token which is a member of the BUILTIN\Administrators group, which includes LocalSystem, and processes that execute as a fully-elevated administrator account) by running these applications only with the KVA shadow “kernel” address space, which is very similar to how applications execute on processors that are not susceptible to rogue data cache load. These applications avoid most of the overhead of KVA shadowing, as no address space switch occurs on user/kernel transitions. Because these applications are fully trusted by the operating system, and already have (or could obtain) the capability to load drivers that could naturally access kernel memory, KVA shadowing is not required for fully-privileged applications.Optimizations are ongoing
The introduction of KVA shadowing radically alters how the Windows kernel fields traps and exceptions from a processor, and significantly changes several key aspects of memory management. While several high-value optimizations have already been deployed with the initial release of operating system updates to integrate KVA shadow support, research into additional avenues of improvement and opportunities for performance tuning continues. KVA shadow represents a substantial departure from some existing operating system design paradigms, and with any such substantial shift in software design, exploring all possible optimizations and performance tuning opportunities is an ongoing effort.Driver and application compatibility
A key consideration of KVA shadow was that existing applications and drivers must continue to work. Specifically, it would not have been acceptable to change the Windows ABI, or to invalidate how drivers work with user mode memory, in order to integrate KVA shadow support into the operating system. Applications and drivers that use supported and documented interfaces are highly compatible with KVA shadow, and no changes to how drivers access user mode memory through supported and documented means are necessary. For example, under a try/except block, it is still possible for a driver to use ProbeForRead to probe a user mode address for validity, and then to copy memory from that user mode virtual address (under try/except protection). Similarly, MDL mappings to/from user mode memory still function as before.
A small number of drivers and applications did, however, encounter compatibility issues with KVA shadow. By and large, the majority of incompatible drivers and applications used substantially unsupported and undocumented means to interface with the operating system. For example, Microsoft encountered several software applications from multiple software vendors that assumed that the raw machine instructions in certain, non-exported Windows kernel functions would remain static or unchanged with software updates. Such approaches are highly fragile and are subject to breaking at even slight perturbations of the operating system kernel code.
Operating system changes like KVA shadow, that necessitated a security update which changed how the operating system manages memory and trap and exception dispatching, underscore the fragility of depending on highly unsupported and undocumented mechanisms in drivers and applications. Microsoft strongly encourages developers to use supported and documented facilities in drivers and applications. Keeping customers secure and up to date is a shared commitment, and avoiding dependencies on unsupported and undocumented facilities and behaviors is critical to meeting the expectations that customers have with respect to keeping their systems secure.Conclusion
Mitigating hardware vulnerabilities in software is an extremely challenging proposition, whether you are an operating system vendor, driver writer, or an application vendor. In the case of rogue data cache load and KVA shadow, the Windows kernel is able to provide a transparent and strong mitigation for drivers and applications, albeit at the cost of additional operating system complexity, and especially on older hardware, at some potential performance cost depending on the characteristics of a given workload. The breadth of changes required to implement KVA shadowing was substantial, and KVA shadow support easily represents one of the most intricate, complex, and wide-ranging security updates that Microsoft has ever shipped. Microsoft is committed to protecting our customers, and we will continue to work with our industry partners in order to address speculative execution side channel vulnerabilities.
Ken Johnson, Microsoft Security Response Center (MSRC)
On January 3rd, 2018, Microsoft released an advisory and security updates related to a newly discovered class of hardware vulnerabilities involving speculative execution side channels (known as Spectre and Meltdown) that affect AMD, ARM, and Intel CPUs to varying degrees. If you haven’t had a chance to learn about these issues, we recommend watching The Case of Spectre and Meltdown by the team at TU Graz from BlueHat Israel, reading the blog post by Jann Horn (@tehjh) of Google Project Zero, or reading the FOSDEM 2018 presentation by Jon Masters of Red Hat.
This new hardware vulnerability class represents a major advancement in CPU side channel attacks and we’re sincerely grateful to the researchers who discovered these issues, our industry partners, and the many individuals across Microsoft who have worked on these vulnerabilities. Mitigating hardware vulnerabilities through changes in firmware and software is a significant industry challenge, and there has been some confusion over the past two months for users as the industry has collectively continued to work on this.
In this post, we’ll provide insight into how Microsoft has approached mitigating speculative execution side channel hardware vulnerabilities to date. We’ll do this by briefly describing how we investigated this new vulnerability class, the taxonomy we established for reasoning about it, and the mitigations we have implemented as a result. This post is primarily geared toward security researchers and engineers who are interested in deeper technical details related to speculative execution side channel vulnerabilities and their respective mitigations. If you are interested in more general guidance, please refer to our knowledge base articles for Windows Server, Windows Client, and Microsoft cloud services.
Please note that the information in this post is current as of the date of this post.Approaching the challenge
We first learned about speculative execution side channel vulnerabilities through the discoveries made by Jann Horn of Google Project Zero. Given the potential severity and the scope of these attacks, we kicked off our incident response process to drive coordination and remediation across Microsoft and with industry partners. This ultimately led to the mobilization of hundreds of individuals across the company, but before we could get there, we needed to assess the severity, impact, and root cause of the issue.
Conventional software vulnerabilities are well-understood and are relatively easy to perform root cause analysis on (we even have automation for many cases, see VulnScan). Speculative execution side channels, on the other hand, represented a fundamentally new hardware vulnerability class with no established process for determining their severity and their impact on existing software security models. To create this process, we and others in the industry needed to thoroughly research speculative execution side channels and establish a taxonomy and framework for reasoning about their effects and possible mitigations. The Microsoft Security Response Center (MSRC) brought in experts from across Microsoft (e.g. Microsoft Research Cambridge and the Windows Offensive Security Research team), and we hired Anders Fogh (@anders_fogh), of GDATA Advanced Analytics, as a consultant whose deep expertise on CPU side channel attacks greatly contributed to our understanding of these issues. This research gave us a foundation on which we and others were able to build.A framework for speculative execution side channel vulnerabilities
Side channel attacks consist of three phases: the priming phase which is used to place a system into a desired initial state (e.g. flushing cache lines), the triggering phase which is used to perform the action that conveys information through the side channel, and the observing phase which is used to detect the presence of the information conveyed through the side channel. These phases can occur architecturally (by actually executing instructions), or speculatively (through speculative execution).
Side channel attacks involving speculative execution have four components: a speculation primitive, which provides the means for entering speculative execution down a non-architectural path; a windowing gadget, which provides a sufficient amount of time for speculative execution to convey information through a side channel; a disclosure gadget, which provides the means for communicating information through a side channel during speculative execution; and a disclosure primitive, which provides the means for reading the information that was communicated by the disclosure gadget. These four components can be combined in different ways to create a speculation technique.Speculation primitives
Spectre and Meltdown make use of three different speculation primitives. The first is conditional branch misprediction (e.g. CVE-2017-5753, aka Variant 1), where speculation may occur when predicting whether a conditional branch is taken or not taken. The second is indirect branch misprediction (e.g. CVE-2017-5715, aka Variant 2), where speculation may occur when predicting the target of an indirect branch (including call, jump, and return). The third deals with exception delivery or deferral (e.g. CVE-2017-5754, aka Variant 3), where speculative execution may continue past the point at which a fault will be raised. Each of these primitives can be used to enter speculative execution along a non-architectural path and attackers can attempt to trigger speculation in different ways (e.g. through training or colliding of predictors).Windowing gadgets
Speculative execution is effectively a race condition between the retirement of the µop (micro-operation) that led to speculation and the µop being executed speculatively. Attackers must win this race condition for a disclosure gadget to execute speculatively. We have defined three categories of windowing gadgets. The first deals with non-cached loads that trigger a load from main memory which typically takes hundreds of cycles on most CPUs. The second involves a dependency chain of loads of which may or may not be non-cached. The third involves a dependency chain of integer ALU operations which have fixed-latency operations. These windowing gadgets can occur naturally in code, or they can be manufactured by attackers in certain cases.Disclosure gadgets
The primary purpose of speculative execution in the context of a side channel attack is to trigger the execution of a disclosure gadget. This is because the disclosure gadget is responsible for accessing information that is stored in a security domain which is not architecturally visible to the attacker and conveying that information through a side channel. This could include memory content, register state, and so on. As with gadgets used in Return Oriented Programming (ROP), there are many possible examples of disclosure gadgets and their form and function will vary.Disclosure primitives
After speculative execution of a disclosure gadget has conveyed information via a side channel, the final step is to detect the presence of that information using a disclosure primitive. Each disclosure primitive is intrinsically linked to the communication channel through which information is conveyed (e.g., CPU data cache, etc.), and the manner through which a disclosure gadget conveyed the information. There has been a significant amount of prior research into disclosure primitives for other side channels, such as FLUSH+RELOAD and PRIME+PROBE. These primitives are generally applicable to speculative execution side channels as well.Relevance to software security models
To illustrate this, the following table summarizes the relevance of speculation primitives to the attack scenarios that represent software security models of concern. Each attack scenario is described in terms of the direction that information flows when performing a speculative execution side channel attack. Orange cells indicate that the speculation primitive is applicable to the corresponding attack scenario whereas grey cells are not applicable.Attack Category Attack Scenario Conditional branch misprediction Indirect branch misprediction Exception delivery or deferral Inter-VM Hypervisor-to-guest Host-to-guest Guest-to-guest Intra-OS Kernel-to-user Process-to-process Intra-process Enclave Enclave-to-any Inter-VM
This category deals with attacks on virtualization-based isolation which relies on hardware support for virtualization extensions for security. Hypervisor-to-guest attacks involve a malicious guest attempting to read hypervisor memory. Host-to-guest attacks involve a malicious guest reading the memory of the privileged guest that provides virtualization assists (e.g. the root partition in Hyper-V or dom0 in Xen). Guest-to-guest attacks involve a malicious guest reading the memory of another guest.Intra-OS
The enclave category deals with enclave-to-any attacks where code executing outside of an enclave (e.g. an Intel SGX enclave) is able to read memory within the enclave.Mitigations for speculative execution side channel vulnerabilities
The taxonomy and framework described above provide the basis for defining a strategy and a set of tactics for mitigating speculative execution side channel vulnerabilities. Through this process we identified three general tactics that can be employed by software (and hardware) to mitigate this issue with varying levels of completeness. These tactics are summarized in the following table:Prevent speculation techniques Remove sensitive content from memory Remove observation channels SSpeculative execution attacks inherently rely on using a speculation primitive to execute a desired disclosure gadget. These attacks can be mitigated by preventing the use of a speculation primitive or specific instance of a speculation technique. This tactic is desirable because it mitigates the issue at or near the root cause. Speculative execution attacks rely on sensitive information being accessible in the victim’s address space. These attacks can be mitigated by removing sensitive information from memory such that it is not readable during speculative execution. This approach cannot protect against reading of register state or “non-sensitive” memory content (such as address space information). Speculative execution attacks inherently rely on the ability to communicate information through a side channel (such as CPU data caches). These attacks can be mitigated by removing channels for communication which thereby prevents the use of certain disclosure primitives. This can provide a broad mitigation that is independent of any single speculation primitive.
In the sections that follow, we will describe the mitigations we have implemented and what impact they have to the speculation techniques that have been described thus far.Preventing speculation techniques
One of the best ways to mitigate vulnerabilities is by eliminating a class of issues at the root cause. For speculative execution attacks, this can be accomplished by preventing speculation primitives from being used to execute a desired disclosure gadget.Speculation barrier via execution serializing instruction
This mitigation involves inserting instructions that act as a speculation barrier by serializing execution. On AMD and Intel CPUs this involves the use of an LFENCE instruction whereas ARM recommends the use of conditional select/move (CSEL/MOVS) instructions along with a new explicit barrier instruction (CSDB). Microsoft has added support for the /Qspectre flag to Visual C++ which currently enables some narrow compile-time static analysis to identify at-risk code sequences related to CVE-2017-5753 and insert speculation barrier instructions. This flag has been used to rebuild at-risk code in Windows and was released with our January 2018 security updates. It is important to note, however, that the Visual C++ compiler cannot guarantee complete coverage for CVE-2017-5753 which means instances of this vulnerability may still exist. We recommend that software developers treat CVE-2017-5753 as a new hardware vulnerability class that can be mitigated in software by adding an explicit speculation barrier. While we are continuing to look for opportunities to improve Visual C++ support for /Qspectre, Microsoft has also announced a Speculative Execution Side Channel bounty program to encourage researchers to find and report any instances of CVE-2017-5753 that may remain.
Modern CPUs typically store prediction state in per-core or per-SMT caches. This means that isolating workloads to distinct cores can be used to robustly mitigate speculation techniques that rely on colliding prediction state. To that end, Microsoft has published documentation describing how minimum root (“minroot”) can be used to dedicate core(s) to the Hyper-V root partition and how CPU groups can be used to dedicate core(s) to guests.Indirect branch speculation barrier on demand and mode change
Speculation techniques that rely on one security domain (user, kernel, hypervisor) being able to influence indirect branch predictions (e.g. CVE-2017-5715) in another security domain can be mitigated through the use of additional hardware interfaces provided by Intel, AMD, and ARM. In the case of Intel and AMD, these hardware interfaces require microcode updates to expose the features of which operating systems can take advantage. These features provide software with the ability to flush indirect branch prediction state, inhibit the use of branch predictions from less-privileged security domains, and protect against sibling hardware thread prediction interference. The microcode updates from Intel and AMD are currently at varying levels of readiness and availability, but they are generally expected to enable strong mitigations for speculation techniques that use indirect branch mispredictions across security domains. On March 1st, 2018, Microsoft announced the availability Intel microcode updates through the Microsoft Update Catalog (KB4090007).Non-speculated or safely-speculated indirect branches
Some types of indirect branches on existing Intel and AMD hardware are either not predicted at all or have generally safe prediction behavior and can therefore be used to mitigate CVE-2017-5715. For example, far JMP on existing Intel hardware is not predicted and can therefore be used as an alternative to near indirect CALL and JMP. This mitigation takes advantage of the far JMP behavior by transforming all at-risk indirect CALL and JMP instructions into far JMP for the Hyper-V hypervisor when running on Intel hardware. For the Hyper-V hypervisor running on AMD hardware, at-risk near indirect CALL and JMP are preceded by an execution serializing RDTSCP instruction as per AMD’s recommended guidance.
In a similar vein, Google has also proposed a software solution known as retpoline which transforms near indirect calls and jumps into “retpolines” that rely on safe and deterministic near return mispredictions when transferring control. This solution provides a strong mitigation for speculation techniques that use indirect branch misprediction on many existing CPUs. We believe retpoline can be a suitable mitigation for environments where it is practical to rebuild all at-risk code and we have pursued a similar path through the changes we have made to Hyper-V. We are also evaluating performance optimizations involving a hybrid mitigation for the Windows kernel and device drivers that would involve retpoline and hardware support for indirect branch speculation barriers.
There are some additional points that are important to consider regarding the use of retpoline. Some modern CPUs, as noted by Intel, do not satisfy the security properties that retpoline relies upon for near-return prediction without requiring software to prevent RSB (Return Stack Buffer) underflows. These CPUs require software to perform RSB stuffing to prevent underflow which we believe to be nontrivial for software to perform in the general case. In addition, scenarios that would require all at-risk software to be rebuilt face significant challenges in the context of expansive, multi-vendor, binary ecosystems such as multi-tenant cloud environments or traditional desktops and servers that make use of applications and other software produced by multiple vendors. For these environments, we believe it is unreasonable to expect that all at-risk software will be rebuilt. As we look toward the future, hardware security improvements such as Intel’s Control-flow Enforcement Technology (CET) will encounter compatibility issues with the use of retpoline which software developers need to consider.Removing sensitive content from memory
All attacks involving speculative execution side channels attempt to gain access to information across security domains. This means that attackers will attempt to disclose data from a victim security domain that they should not have access to. As such, removing sensitive information from the victim security domain can be an effective method of preventing information disclosure using speculation techniques.Hypervisor address space segregation
The Microsoft Hyper-V hypervisor has historically maintained an identity map of all physical memory to accelerate memory accesses while executing in the hypervisor (known as the “physical map”). To minimize exposure to speculation techniques, we have removed the physical map entirely and no longer map all physical memory into the address space of the hypervisor, thus helping to mitigate the risk of cross-VM information disclosure through speculative execution.Split user and kernel page tables
This mitigation is known as Kernel Virtual Address (KVA) Shadow on Windows and it mitigates the speculation technique known as Meltdown (CVE-2017-5754) by creating two page directory bases per process. The first maps the “user” page tables which only contains user mode mappings and a small number of kernel transition pages. The second maps the “kernel” page tables which contains both user and kernel mappings for the process. The user page tables are active while executing code in user mode while the kernel pages are switched to when trapping and executing code in kernel mode on behalf of a process. This has the effect of removing sensitive kernel memory content from the virtual address space for a process which thereby provides a robust mitigation for CVE-2017-5754. This mitigation draws inspiration from prior research known as KAISER, although KVA Shadow is not intended to enable robust local Kernel ASLR (KASLR). KVA Shadow is also similar to mitigations available for the Linux kernel (Kernel Page Table Isolation, KPTI), Apple MacOS and iOS, and Google Chromebook.Removing observation channels
All speculation techniques implicitly rely on communicating information through a side channel in a manner that can be detected by a disclosure primitive. This means speculation techniques can be broadly mitigated by removing channels for communicating and observing information.Map guest memory as noncacheable in the root extended page tables
Virtualization software often maps guest memory into the address space of a privileged guest (the “root” for Hyper-V, or dom0 for Xen) to enable fast communication through shared memory. In the context of speculation techniques, these shared memory regions can be used to communicate information through the loading of a shared cache line (e.g. FLUSH+RELOAD). One way to mitigate this is by mapping guest memory regions as noncacheable (UC) in the extended page tables for the root. On all CPUs, noncacheable memory cannot be loaded during speculative execution and therefore prevents loading of a shared cache line.Do not share physical pages across guests
In some cases, virtualization software may share physical pages between guests. These shared memory regions can be used to communicate information through the loading of a shared cache line (similar to mapping guest memory as noncacheable). In this case, these shared regions can facilitate guest-to-guest attacks involving speculative execution. This mitigation implements the straightforward solution which is to stop sharing physical pages between guests that are not part of the same security domain.Decrease browser timer precision
The complex nature of these issues makes it difficult to understand the relationship between mitigations, speculation techniques, and the attack scenarios to which they apply.
The legend for the tables that follow is:Applicable Not applicable
Mitigation relationship to attack scenarios
The following table summarizes the relationship between attack scenarios and applicable mitigations.Mitigation Tactic Mitigation Name Inter-VM Intra-OS Enclave Prevent speculation techniques Speculation barrier via execution serializing instruction Security domain CPU core isolation Indirect branch speculation barrier on demand and mode change Non-speculated or safely-speculated indirect branches Remove sensitive content from memory Hypervisor address space segregation Split user and kernel page tables (“KVA Shadow”) Remove observation channels Map guest memory as noncacheable in root extended page tables Do not share physical pages across guests Decrease browser timer precision Mitigation relationship to variants
The following table summarizes the relationship between Spectre and Meltdown variants and applicable mitigations.Mitigation Tactic Mitigation Name CVE-2017-5753 (variant 1) CVE-2017-5715 (variant 2) CVE-2017-5754 (variant 3) Prevent speculation techniques Speculation barrier via execution serializing instruction Security domain CPU core isolation Indirect branch speculation barrier on demand and mode change Non-speculated or safely-speculated indirect branches Remove sensitive content from memory Hypervisor address space segregation Split user and kernel page tables (“KVA Shadow”) Remove observation channels Map guest memory as noncacheable in root extended page tables Do not share physical pages across guests Decrease browser timer precision Wrapping up
In this post, we described our approach toward mitigating a new hardware vulnerability class involving speculative execution side channels. We believe the mitigations described in this post offer strong protections for these vulnerabilities. Going forward, we recommend that the software industry view these issues as a new vulnerability class that may require software changes in order to help mitigate (e.g. like buffer overruns, type confusions, use after frees, and so on). We expect this new hardware vulnerability class to be the subject of further research as we’ve witnessed in the past with other vulnerability classes. Microsoft is committed to protecting our customers, and we will continue to work with our industry partners on mitigating speculative execution side channel vulnerabilities. To reinforce this commitment, we have launched the Speculative Execution Side Channel bounty program to encourage discovery and reporting of these vulnerabilities such that they can be fixed.
Matt Miller, Microsoft Security Response Center (MSRC)