Wed 06 June 2018

Hypervisor Not Required

by Paul Sherwood , 2018 , Tags hypervisor linux qnx safety-critical real-time foss automotive

Hypervisor Not Required

Many embedded/automotive vendors are recommending that electronic control unit (ECU) consolidation can be best achieved by adopting an architecture with a hypervisor. The idea is to isolate functions into guest operating system virtual machines and restrict access to sensitive resources. So examples of the consolidated architecture look something like:

Hypervisor Guest Separation

We suggest that this approach is fundamentally incorrect, as follows:

  • Any operating system that we trust to run safety critical processes on a multicore processor must be able to guarantee to schedule all resources so that safety critical processes get what they need, and are properly isolated from other processes. Without such guarantees, the operating system would not be fit for safety critical use in any case.
  • If we have an operating system which provides trustable guarantees, we can rely on it to isolate and schedule multiple safety-critical processes for a single multicore processor. We don’t need additional separation. We either trust the OS or we don’t.
  • Given that we are bound to have multiple boxes, we do not actually need an architecture that supports multiple OSes on the same physical machine. We could dedicate some boxes to Linux, and others to QNX, for example. Or port the missing functionality to the other OS (we’ll be porting anyway, since a lot of the functions are currently bare-metal).
  • Multithreading operating systems are already designed to handle multiple applications. Moreover:
  • Each guest requires its own copy of the operating system and system libraries. The guests are consuming memory multiple times for the same things. We do not have lots of spare memory to waste on copies of system software. This would only be justifiable if evidenced improvements in safety/reliability of the system can be shown to be worth the cost.
  • Adding a hypervisor increases the amount of software in the system. More software means more bugs, more security vulnerabilities, and an increased attack surface. As one executive commented “How do they think they can close one door, by opening two?”
  • Once we put a hypervisor underneath an OS, all guarantees that we had for the OS itself no longer apply. All bets are off. A critical bug or vulnerability in the hypervisor can definitely take down the system (the same is true for any Operating System, of course, and for the underlying hardware).
  • More software means more things to update, which is more complicated, error-prone and higher risk, than updating a single software stack.
  • The only fundamental justification for using a hypervisor is to support multiple different operating systems on a single CPU. Since we expect that each system/vehicle requires a set of domain controllers, it seems that we can avoid that situation altogether just by dedicating some controllers to run QNX, some to run Linux and so on.

Consolidation Without Hypervisors

Choosing an operating system that handles scheduling and separation of multiple applications across available resources (memory and CPU cores), the overall consolidated approach is just:

Operating System Separation

Thus if we want to combine Infotainment, Cluster and HUD, the desired approach would be to combine all of these functions on a single unit running Linux, QNX or similar. Separation can be achieved in various ways (for example namespaces, Discretionary Access Control) depending on the OS, and the architect clearly needs to establish that the OS is fit for this purpose. For example in discussion around this document it was suggested that in critical scenarios the OS must perform as a Separation Kernel.

Discussions

There was a lot of useful discussion around the contents of this document which we distil down to two key points:

1) Improved security

"While adding the complexity of a hypervisor (or separation kernel) increases potential vulnerability attack surface, separation makes exploitation more difficult."

It’s not clear that there is any data/research about in-the-wild exploitations, to indicate whether the claimed increase in exploitation difficulty actually pays off. And as Geer’s Law states

Any security technology whose effectiveness can't be empirically determined is indistinguishable from blind luck.

Also, note that this approach does not mitigate against the various hardware-level vulnerabilities exposed in modern microprocessors (Rowhammer, Spectre, Meltdown etc).

2) Reduced costs and risks

The main justification for consolidation is to reduce engineering cost and risk because

  • less boxes, less materials, less weight, less physical space, less wiring
  • we can reuse the same code
  • no need to revalidate the whole ECU when we make a change in one of the 'guests' (depending on implementation a 'guest' may be a stack with an OS, or a single executable)

However there are some clear additional costs and risks:

  • direct costs of the 'hypervisor' or ‘separation kernel’ itself (including costs for licensing, support, porting, integration and validation)
  • reduction in performance due to the hypervisor or separation kernel itself
  • reduction in available memory (which can affect performance also) due to guest OS footprints
  • increased complexity and risk for security updates (now we may need to update a hypervisor and/or multiple guest OS)
  • another link in the 'chain of trust', resulting in an increase in the attack surface
  • another vendor in the supply chain
  • depending on the choice of hypervisor/separation-kernel, another binary blob where we have to trust the vendor for long-term support
  • risk of missed issues due to the assumption of 'we don't need to re-validate'
  • uncertainty leading to wrong implementation (will decision-makers safely distinguish between a dedicated minimal 'hypervisor' of 900 lines as described, a 'separation kernel', and some shiny 'product' based on KVM or similar?)
  • risk that the 're-use' turns out to require 're-work' costs because the vendor contribution doesn't play nicely as a guest after all
  • blame gaming between vendors when there are issues with shared functions such as power management, diagnostics, etc.
  • risk that people assume 'hypervisor' solves all the problems, until results show that it didn't
  • costs associated with recalls and/or accidents arising from failure to address the above