Writing a virtual machine in Rust
Table of Contents
Overview #
To help me learn Rust, I decided to implement an interpreter for Lox, the toy scripting language in Robert Nystrom’s book Crafting Interpreters. I’ve been interested in scripting languages that are designed to be embedded in applications, and this seemed like a good way to learn more about how such languages are made.
Nystrom’s book walks you through two different implementations of the interpreter: a tree-walk interpreter in Java, and a bytecode virtual machine in C. I was more interested in the bytecode VM. Over the years, I’d written a couple of tree-walk interpreters, starting with Apache FreeMarker, as well as an optimising source-to-source compiler with type inference. But I hadn’t implemented a bytecode VM yet. The idea of it brought back good memories of doing assembly-language programming on the Apple II when I was fifteen years old. The instruction set for Nystrom’s virtual machine is simpler than the instruction set of the Apple II’s Motorola 6502 CPU, but the basic idea is the same. So I skipped the part of his book about the tree-walk interpreter and went directly to the part about the bytecode VM.
The result is OrganicLox. The design is
broadly similar to the one in the book, but I’ve tried to write idiomatic Rust,
with reasonable performance and no unsafe code. Functionally, OrganicLox should
be identical to Nystrom’s clox
: at least, it passes his 246 automated
tests.
I decided against writing my own garbage collector, because everyone seems to
agree that garbage collectors are particularly difficult to implement in
Rust.
To keep things simple, I used reference counting instead (via
Rc). On the
main
branch,
this means that cycles of object properties won’t be collected. On the
with_dumpster_gc
branch, I’ve replaced Rc
with the
dumpster crate,
which uses reference counting but also deals with cycles.
My favourite parts #
I liked this book very much: Nystrom’s clear explanations, useful illustrations, easygoing writing style, and humour made the book a pleasure to read.
For me, some of the most interesting parts of his design, which draws heavily on the design of the Lua VM, involved the implementation of local variables and closures in a stack-based VM. I enjoyed learning that the compiler associates local variables with specific positions on the stack, so the VM only needs to know about stack positions, not about the variable names they’re associated with.
The implementation of upvalues, the variables that are captured by closures (and may need to be heap-allocated as a result), was tricky to get right. While fixing a bug, I was delighted to realise that a recursive call to a function defined in a local scope requires the function’s closure to be one of its own upvalues.
What I wished for #
Nystrom alternates between top-down and bottom-up explanations. The bottom-up explanations made it more difficult to translate his design into Rust, since I had to guess where he was going. Several times I had to refactor a lot of code because I had guessed incorrectly. Since implementing Lox in all sorts of programming langauges seems to be a popular exercise, it would have been helpful to have more high-level notes about the design.
It was fun to learn about Vaughan Pratt’s top-down operator precedence parsing technique and to see that it’s actually possible to generate bytecode without first building an AST. But perhaps it would have been more useful to produce an AST and learn something about writing a second pass to optimise the bytecode.
Performance #
The main optimisation I did at the outset was to have the scanner intern all identifiers (using string_interner), so neither the compiler nor the VM need to do any string comparisons on identifiers.
After finishing Chapter 24, I spent a bit of time looking at performance, following the advice in The Rust Performance Book. I ran benchmarks with criterion, did some profiling with perf and hotspot, adjusted the build configuration, and inlined some functions.
I then looked at some other Rust implementations of
Lox,
and noticed that everyone seems to have found it difficult to approach the
performance of Nystrom’s clox
, and that the implementations that succeeded in
doing so contained a lot of unsafe code, which I wanted to avoid. I highly
recommend Manuel Cerón’s blog
post
about his impressive, determined, and partly successful attempt to match the
speed of clox
in Rust by writing more and more unsafe code.
For comparison, here are the results of some of Nystrom’s benchmarks run on these implementations:
- Nystrom’s
clox
- The safe version of Cerón’s
loxido
- The unsafe version of
loxido
- OrganicLox with
Rc
- OrganicLox with the
dumpster
garbage collector
OrganicLox with Rc
is two to four times slower than clox
, which is perhaps
in the typical range for interpreters written in Rust compared to interpreters
written in
C.
The version with dumpster
is much slower. It seems clear that a fast garbage
collector is essential for a high-performance VM.
How much do these differences matter? Benchmarks are designed to simulate extreme scenarios, e.g. by using long loops or deep recursion, or by creating a lot of work for the garbage collector, so they’re not necessarily representative of real use cases. The scripts run by applications with an embedded interpreter are often short and do very little work. We might get a more realistic idea of how a VM performs in such a scenario by running Nystrom’s integration test suite, which runs 246 short Lox scripts and checks their output. Here’s a comparison:
In an application where every CPU cycle counts, the differences in performance between these different implementations could be very significant. But in other use cases, they might not be.
Alternatives to this approach #
It’s tempting to try adding a just-in-time compiler (JIT) to improve the performance of the interpreter, either by writing one from scratch or by using an existing backend like LLVM or Cranelift. Examples in Rust include Roto and Dora. But even if you do that, it’s not easy to achieve high performance. In these benchmarks, Python’s new JIT has no measurable effect on performance (see also these Dora benchmark results).
That’s one reason to consider using an existing interpreter that already has years of engineering invested in it, instead of creating a new one. Lua is one popular option, but it isn’t fast enough for some applications. The widely used LuaJIT VM is an order of magnitude faster than the standard Lua VM for some use cases, but it’s a tracing JIT, which has advantages and disadvantages. The Lua developers have an interesting paper in which they discuss the limitations of this approach, and propose a different solution (an additional statically typed language, a bit like Cython).
For some use cases, if embedded scripts don’t do much work, their performance might not be critical. If they do so much work that their performance is insufficient, another option is not to embed an interpreter at all. Cloudflare, which had invested heavily for over 20 years in scripts using LuaJIT embedded in OpenResty, recently replaced them with Rust code.