<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Eli Bendersky's website - Assembly</title><link href="https://eli.thegreenplace.net/" rel="alternate"></link><link href="https://eli.thegreenplace.net/feeds/assembly.atom.xml" rel="self"></link><id>https://eli.thegreenplace.net/</id><updated>2024-09-14T13:15:30-07:00</updated><entry><title>WebAssembly Text code samples</title><link href="https://eli.thegreenplace.net/2023/webassembly-text-code-samples/" rel="alternate"></link><published>2023-04-22T07:46:00-07:00</published><updated>2023-04-22T14:49:03-07:00</updated><author><name>Eli Bendersky</name></author><id>tag:eli.thegreenplace.net,2023-04-22:/2023/webassembly-text-code-samples/</id><summary type="html">&lt;p&gt;This post talks about writing WebAssembly by hand (using its textual format),
and mentions a new GitHub &lt;a class="reference external" href="https://github.com/eliben/wasm-wat-samples/"&gt;repository&lt;/a&gt; I've created with code samples.&lt;/p&gt;
&lt;p&gt;A bit of nomenclature first. &lt;strong&gt;WASM&lt;/strong&gt; stands for WebAssembly - it has a &lt;a class="reference external" href="https://webassembly.github.io/spec/core/binary/index.html"&gt;binary
format&lt;/a&gt; and a
&lt;a class="reference external" href="https://webassembly.github.io/spec/core/text/index.html"&gt;textual format&lt;/a&gt;.
The textual format, called WebAssembly Text or &lt;strong&gt;WAT&lt;/strong&gt;, is …&lt;/p&gt;</summary><content type="html">&lt;p&gt;This post talks about writing WebAssembly by hand (using its textual format),
and mentions a new GitHub &lt;a class="reference external" href="https://github.com/eliben/wasm-wat-samples/"&gt;repository&lt;/a&gt; I've created with code samples.&lt;/p&gt;
&lt;p&gt;A bit of nomenclature first. &lt;strong&gt;WASM&lt;/strong&gt; stands for WebAssembly - it has a &lt;a class="reference external" href="https://webassembly.github.io/spec/core/binary/index.html"&gt;binary
format&lt;/a&gt; and a
&lt;a class="reference external" href="https://webassembly.github.io/spec/core/text/index.html"&gt;textual format&lt;/a&gt;.
The textual format, called WebAssembly Text or &lt;strong&gt;WAT&lt;/strong&gt;, is the subject of this
post.&lt;/p&gt;
&lt;div class="section" id="introduction-to-wat"&gt;
&lt;h2&gt;Introduction to WAT&lt;/h2&gt;
&lt;p&gt;WASM is a stack machine, and while stack machines can lead to wonderfully
compact bytecode, they can also be awkward to code by hand - because the
programmer needs to have a mental model of the top stack slots at all times,
remembering what they refer to. While you can certainly code directly to the
stack machine with WAT, it also has some programmer-friendly constructs
that significantly improve writability and readability. Here's an example:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;(local.set $writeidx (i32.sub (local.get $writeidx) (i32.const 1)))
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This is equivalent to &lt;tt class="docutils literal"&gt;writeidx &lt;span class="pre"&gt;-=&lt;/span&gt; 1&lt;/tt&gt; in many mainstream languages. The two
WAT features at play here are:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;The ability to declare variables and to refer to them
by name (this includes function parameters).&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Folded instructions&lt;/em&gt; - allowing the programmer to condense a sequence of
stack operations into a single &lt;a class="reference external" href="https://en.wikipedia.org/wiki/S-expression"&gt;s-expr&lt;/a&gt;. This is music to
&lt;a class="reference external" href="https://eli.thegreenplace.net/tag/lisp"&gt;my Lisper ears&lt;/a&gt;!&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These folded instructions can go as deep as we wish; here's an even more nested
example involving memory access:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;(local.set
    $next_env_ptr
    (i32.load (i32.add  (global.get $env_ptrs)
                        (i32.mul (local.get $i) (i32.const 4)))))
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;In pseudo-C, this is equivalent to &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;next_env_ptr&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;env_ptrs&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;WAT has some additional ergonomic features that I like. For example, named
functions with named parameters, as well as declared return values:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;(func $itoa (export &amp;quot;itoa&amp;quot;) (param $num i32) (result i32 i32)
  ...
)
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This function has a name we can refer to in calls, a single parameter with
a name (&lt;tt class="docutils literal"&gt;$num&lt;/tt&gt;) and two return values. Calling this function can be done
in a folded expression like:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;(call $itoa (i32.add (local.get $n) (i32.const 1)))
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Which is equivalent to &lt;tt class="docutils literal"&gt;itoa(n+1)&lt;/tt&gt;. Another feature this example demonstrates
is &lt;em&gt;types&lt;/em&gt; - WAT functions and values (parameters, globals and locals) have
types, which makes code easier to read and understand, and also provides the
compiler an opportunity to check for correctness at compile time.&lt;/p&gt;
&lt;p&gt;Moreover, in the WASM model, type checking goes deeper and extends to stack
interactions; the WASM compiler knows how many stack slots each instruction uses
and produces, and this is verified as well - so common mistakes are easily
caught. I find that the code is much more often correct once I get it to compile
in WAT compared to other assembly languages.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="samples-of-wat-code"&gt;
&lt;h2&gt;Samples of WAT code&lt;/h2&gt;
&lt;p&gt;Back to the original goal of this post. While I enjoy writing WAT code, one
aspect of the experience that could be improved is documentation. The
&lt;a class="reference external" href="https://webassembly.github.io/spec/core/index.html"&gt;WASM spec&lt;/a&gt; is
much more suitable for formal verification than for actual documentation
purposes; specifically, it's hard to grep and doesn't provide much in terms
of examples. This is alright for a spec, but I couldn't find complementary
resources that just show code samples.&lt;/p&gt;
&lt;p&gt;Therefore, I've decided to collect some of the WAT snippets I've written so far
into a GitHub repository named &lt;a class="reference external" href="https://github.com/eliben/wasm-wat-samples/"&gt;wasm-wat-samples&lt;/a&gt;. It's my humble contribution to
the world of WAT documentation. The goal of the repository is to demonstrate how
WAT concepts (including WASI) and constructs are used in practice; it's
optimized for &lt;em&gt;greppability&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;I hope others find it useful as well - feel free to suggest additional samples
in issues and PRs!&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-1" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-1"&gt;[1]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;memory&lt;/tt&gt; is implicitly the linear heap memory every WASM module has.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
</content><category term="misc"></category><category term="WebAssembly"></category><category term="Assembly"></category></entry><entry><title>itoa (integer to string) in WebAssembly</title><link href="https://eli.thegreenplace.net/2023/itoa-integer-to-string-in-webassembly/" rel="alternate"></link><published>2023-04-17T20:09:00-07:00</published><updated>2024-05-04T19:46:23-07:00</updated><author><name>Eli Bendersky</name></author><id>tag:eli.thegreenplace.net,2023-04-17:/2023/itoa-integer-to-string-in-webassembly/</id><summary type="html">&lt;p&gt;&lt;strong&gt;Update (2023-04-22)&lt;/strong&gt;: here's a repository with many WAT code samples,
including this one: &lt;a class="reference external" href="https://github.com/eliben/wasm-wat-samples"&gt;wasm-wat-samples&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This is a brief blog post that mostly consists of a single, well-documented code
snippet.&lt;/p&gt;
&lt;p&gt;I've been getting more and more interested
&lt;a class="reference external" href="https://eli.thegreenplace.net/tag/webassembly"&gt;in WebAssembly recently&lt;/a&gt;,
and found there's a dearth of high-quality WAT (&lt;em&gt;WebAssembly Text&lt;/em&gt;
language …&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;strong&gt;Update (2023-04-22)&lt;/strong&gt;: here's a repository with many WAT code samples,
including this one: &lt;a class="reference external" href="https://github.com/eliben/wasm-wat-samples"&gt;wasm-wat-samples&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This is a brief blog post that mostly consists of a single, well-documented code
snippet.&lt;/p&gt;
&lt;p&gt;I've been getting more and more interested
&lt;a class="reference external" href="https://eli.thegreenplace.net/tag/webassembly"&gt;in WebAssembly recently&lt;/a&gt;,
and found there's a dearth of high-quality WAT (&lt;em&gt;WebAssembly Text&lt;/em&gt;
language) code samples dealing with some of the trickier aspects of WASM like
working with strings and passing data between WASM and the host &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt; via memory.&lt;/p&gt;
&lt;p&gt;So here's a complete WASM module written in WAT that exports an &lt;tt class="docutils literal"&gt;itoa&lt;/tt&gt;
function -- just like its C counterpart, it converts an integer into a string
representation:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;module&lt;/span&gt;
    &lt;span class="c1"&gt;;; Logging function imported from the environment; will print a single&lt;/span&gt;
    &lt;span class="c1"&gt;;; i32.&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;env&amp;quot;&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;log&amp;quot;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="nv"&gt;$log&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;param&lt;/span&gt; &lt;span class="kt"&gt;i32&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;

    &lt;span class="c1"&gt;;; Declare linear memory and export it to host. The offset returned by&lt;/span&gt;
    &lt;span class="c1"&gt;;; $itoa is relative to this memory.&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;memory&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;memory&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;;; Using some memory for a number--&amp;gt;digit ASCII lookup-table, and then the&lt;/span&gt;
    &lt;span class="c1"&gt;;; space for writing the result of $itoa.&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;i32.const&lt;/span&gt; &lt;span class="mf"&gt;8000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;0123456789&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;global&lt;/span&gt; &lt;span class="nv"&gt;$itoa_out_buf&lt;/span&gt; &lt;span class="kt"&gt;i32&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;i32.const&lt;/span&gt; &lt;span class="mf"&gt;8010&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="c1"&gt;;; itoa: convert an integer to its string representation. Only supports&lt;/span&gt;
    &lt;span class="c1"&gt;;; numbers &amp;gt;= 0.&lt;/span&gt;
    &lt;span class="c1"&gt;;; Parameter: the number to convert&lt;/span&gt;
    &lt;span class="c1"&gt;;; Result: address and length of string in memory.&lt;/span&gt;
    &lt;span class="c1"&gt;;; Note: this result is only valid until the next call to $itoa which will&lt;/span&gt;
    &lt;span class="c1"&gt;;; overwrite it; obviously, this isn&amp;#39;t concurrency-safe either.&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="nv"&gt;$itoa&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;itoa&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;param&lt;/span&gt; &lt;span class="nv"&gt;$num&lt;/span&gt; &lt;span class="kt"&gt;i32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;result&lt;/span&gt; &lt;span class="kt"&gt;i32&lt;/span&gt; &lt;span class="kt"&gt;i32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;local&lt;/span&gt; &lt;span class="nv"&gt;$numtmp&lt;/span&gt; &lt;span class="kt"&gt;i32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;local&lt;/span&gt; &lt;span class="nv"&gt;$numlen&lt;/span&gt; &lt;span class="kt"&gt;i32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;local&lt;/span&gt; &lt;span class="nv"&gt;$writeidx&lt;/span&gt; &lt;span class="kt"&gt;i32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;local&lt;/span&gt; &lt;span class="nv"&gt;$digit&lt;/span&gt; &lt;span class="kt"&gt;i32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;local&lt;/span&gt; &lt;span class="nv"&gt;$dchar&lt;/span&gt; &lt;span class="kt"&gt;i32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;;; Count the number of characters in the output, save it in $numlen.&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;i32.lt_s&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;local.get&lt;/span&gt; &lt;span class="nv"&gt;$num&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;i32.const&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;local.set&lt;/span&gt; &lt;span class="nv"&gt;$numlen&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;i32.const&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;local.set&lt;/span&gt; &lt;span class="nv"&gt;$numlen&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;i32.const&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;local.set&lt;/span&gt; &lt;span class="nv"&gt;$numtmp&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;local.get&lt;/span&gt; &lt;span class="nv"&gt;$num&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;loop&lt;/span&gt; &lt;span class="nv"&gt;$countloop&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;block&lt;/span&gt; &lt;span class="nv"&gt;$breakcountloop&lt;/span&gt;
                &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;i32.eqz&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;local.get&lt;/span&gt; &lt;span class="nv"&gt;$numtmp&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
                &lt;span class="nb"&gt;br_if&lt;/span&gt; &lt;span class="nv"&gt;$breakcountloop&lt;/span&gt;

                &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;local.set&lt;/span&gt; &lt;span class="nv"&gt;$numtmp&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;i32.div_u&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;local.get&lt;/span&gt; &lt;span class="nv"&gt;$numtmp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;i32.const&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
                &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;local.set&lt;/span&gt; &lt;span class="nv"&gt;$numlen&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;i32.add&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;local.get&lt;/span&gt; &lt;span class="nv"&gt;$numlen&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;i32.const&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
                &lt;span class="nb"&gt;br&lt;/span&gt; &lt;span class="nv"&gt;$countloop&lt;/span&gt;
            &lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;end&lt;/span&gt;

        &lt;span class="c1"&gt;;; Now that we know the length of the output, we will start populating&lt;/span&gt;
        &lt;span class="c1"&gt;;; digits into the buffer. E.g. suppose $numlen is 4:&lt;/span&gt;
        &lt;span class="c1"&gt;;;&lt;/span&gt;
        &lt;span class="c1"&gt;;;                     _  _  _  _&lt;/span&gt;
        &lt;span class="c1"&gt;;;&lt;/span&gt;
        &lt;span class="c1"&gt;;;                     ^        ^&lt;/span&gt;
        &lt;span class="c1"&gt;;;  $itoa_out_buf -----|        |---- $writeidx&lt;/span&gt;
        &lt;span class="c1"&gt;;;&lt;/span&gt;
        &lt;span class="c1"&gt;;;&lt;/span&gt;
        &lt;span class="c1"&gt;;; $writeidx starts by pointing to $itoa_out_buf+3 and decrements until&lt;/span&gt;
        &lt;span class="c1"&gt;;; all the digits are populated.&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;local.set&lt;/span&gt; &lt;span class="nv"&gt;$writeidx&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;i32.sub&lt;/span&gt;
                &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;i32.add&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;global.get&lt;/span&gt; &lt;span class="nv"&gt;$itoa_out_buf&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;local.get&lt;/span&gt; &lt;span class="nv"&gt;$numlen&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
                &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;i32.const&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;

        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;loop&lt;/span&gt; &lt;span class="nv"&gt;$writeloop&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;block&lt;/span&gt; &lt;span class="nv"&gt;$breakwriteloop&lt;/span&gt;
            &lt;span class="c1"&gt;;; digit &amp;lt;- $num % 10&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;local.set&lt;/span&gt; &lt;span class="nv"&gt;$digit&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;i32.rem_u&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;local.get&lt;/span&gt; &lt;span class="nv"&gt;$num&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;i32.const&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
            &lt;span class="c1"&gt;;; set the char value from the lookup table of digit chars&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;local.set&lt;/span&gt; &lt;span class="nv"&gt;$dchar&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;i32.load8_u&lt;/span&gt; &lt;span class="k"&gt;offset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8000&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;local.get&lt;/span&gt; &lt;span class="nv"&gt;$digit&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;

            &lt;span class="c1"&gt;;; mem[writeidx] &amp;lt;- dchar&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;i32.store8&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;local.get&lt;/span&gt; &lt;span class="nv"&gt;$writeidx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;local.get&lt;/span&gt; &lt;span class="nv"&gt;$dchar&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

            &lt;span class="c1"&gt;;; num &amp;lt;- num / 10&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;local.set&lt;/span&gt; &lt;span class="nv"&gt;$num&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;i32.div_u&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;local.get&lt;/span&gt; &lt;span class="nv"&gt;$num&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;i32.const&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;

            &lt;span class="c1"&gt;;; If after writing a number we see we wrote to the first index in&lt;/span&gt;
            &lt;span class="c1"&gt;;; the output buffer, we&amp;#39;re done.&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;i32.eq&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;local.get&lt;/span&gt; &lt;span class="nv"&gt;$writeidx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;global.get&lt;/span&gt; &lt;span class="nv"&gt;$itoa_out_buf&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="nb"&gt;br_if&lt;/span&gt; &lt;span class="nv"&gt;$breakwriteloop&lt;/span&gt;

            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;local.set&lt;/span&gt; &lt;span class="nv"&gt;$writeidx&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;i32.sub&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;local.get&lt;/span&gt; &lt;span class="nv"&gt;$writeidx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;i32.const&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
            &lt;span class="nb"&gt;br&lt;/span&gt; &lt;span class="nv"&gt;$writeloop&lt;/span&gt;
        &lt;span class="p"&gt;))&lt;/span&gt;

        &lt;span class="c1"&gt;;; return (itoa_out_buf, numlen)&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;global.get&lt;/span&gt; &lt;span class="nv"&gt;$itoa_out_buf&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;local.get&lt;/span&gt; &lt;span class="nv"&gt;$numlen&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Some notes about this code:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;itoa&lt;/tt&gt; uses the &lt;a class="reference external" href="https://github.com/WebAssembly/multi-value"&gt;multi-value&lt;/a&gt;
feature of WASM to return multiple values. This feature is supported pretty
&lt;a class="reference external" href="https://webassembly.org/roadmap/"&gt;uniformly by WASM hosts&lt;/a&gt; at this point.&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;itoa&lt;/tt&gt; writes its string output into memory, and returns the address of this
string and its length to the host. This address (in WASM's linear memory) is
currently hard-coded; there is no dynamic memory allocation built into WASM.
It's possible to implement it, and higher-level languages do, but for a simple
examples this will do.&lt;/li&gt;
&lt;li&gt;The module exports its linear memory to the host so that the host can read
the string &lt;tt class="docutils literal"&gt;itoa&lt;/tt&gt; wrote.&lt;/li&gt;
&lt;li&gt;The algorithm is straightforward and unoptimized; it runs one O(log(N)) loop
to find the output size, and another such loop to populate the output.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For the accompanying host code and instructions for compiling &amp;amp; running, see
&lt;a class="reference external" href="https://github.com/eliben/code-for-blog/tree/main/2023/wasm-itoa"&gt;the GitHub repository&lt;/a&gt;.&lt;/p&gt;
&lt;div class="section" id="exercises"&gt;
&lt;h2&gt;Exercises&lt;/h2&gt;
&lt;p&gt;The &lt;tt class="docutils literal"&gt;itoa&lt;/tt&gt; function presented here has a few limitations that can be fixed
without too much effort; if you're interested in WAT programming, these could
be good exercises:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Extend it to support negative numbers&lt;/li&gt;
&lt;li&gt;Extend it to work for other common bases like hexadecimal (will require an
additional parameter).&lt;/li&gt;
&lt;/ul&gt;
&lt;hr class="docutils" /&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-1" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-1"&gt;[1]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;A note on nomenclature: by &lt;em&gt;host&lt;/em&gt; I mean the execution environment a
WASM module runs in. The most common environment is a web browser, but
recently it's more and more common to see WASM executed in non-browser
environments like Node.js, wasmtime (Rust) or wazero (Go).&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
</content><category term="misc"></category><category term="WebAssembly"></category><category term="Assembly"></category></entry><entry><title>An Intel 8080 assembler and online simulator</title><link href="https://eli.thegreenplace.net/2020/an-intel-8080-assembler-and-online-simulator/" rel="alternate"></link><published>2020-07-25T16:00:00-07:00</published><updated>2024-09-14T13:15:30-07:00</updated><author><name>Eli Bendersky</name></author><id>tag:eli.thegreenplace.net,2020-07-25:/2020/an-intel-8080-assembler-and-online-simulator/</id><summary type="html">&lt;p&gt;While going through Charles Petzold's &amp;quot;Code&amp;quot; book again, I was looking for an
easy-to-use online assembler and simulator for the classic &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Intel_8080"&gt;Intel 8080 CPU&lt;/a&gt;, but couldn't find anything that
fit my needs exactly. There are some well-done tools out there, but they seem to
be more geared to running game …&lt;/p&gt;</summary><content type="html">&lt;p&gt;While going through Charles Petzold's &amp;quot;Code&amp;quot; book again, I was looking for an
easy-to-use online assembler and simulator for the classic &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Intel_8080"&gt;Intel 8080 CPU&lt;/a&gt;, but couldn't find anything that
fit my needs exactly. There are some well-done tools out there, but they seem to
be more geared to running game ROMs and large programs on an emulator; my need
was different - I just wanted something to play with, to practice 8080 assembly
programming.&lt;/p&gt;
&lt;p&gt;So I ended up rolling my own, and the &lt;a class="reference external" href="https://github.com/eliben/js-8080-sim/"&gt;js-8080-sim&lt;/a&gt; project was born. The project has
three main parts:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;An assembler for the 8080: translating assembly language code into 8080
machine code. I wrote a custom assembler for this.&lt;/li&gt;
&lt;li&gt;A CPU simulator: simulating 8080 machine code. For this purpose I cloned
the &lt;a class="reference external" href="https://github.com/maly/8080js"&gt;maly/8080js&lt;/a&gt; project into my
repository &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt; and tweaked it a little bit.&lt;/li&gt;
&lt;li&gt;A simple web UI for writing 8080 assembly code, running it and observing the
results (as changed values in memory and registers). I wrote a basic UI in
JS:&lt;/li&gt;
&lt;/ul&gt;
&lt;img alt="js 8080 web UI screenshot" class="align-center" src="https://github.com/eliben/js-8080-sim/blob/main/doc/js-sim-screenshot.png?raw=true" style="width: 650px;" /&gt;
&lt;p&gt;If you want to play with the simulator, a live version is available online at
&lt;a class="reference external" href="https://eliben.org/js8080"&gt;https://eliben.org/js8080&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The UI is purely client-side; it makes no requests and just uses your browser
as a GUI. It does use the browser's local storage to save the last program you
ran.&lt;/p&gt;
&lt;p&gt;Issues and PRs &lt;a class="reference external" href="https://github.com/eliben/js-8080-sim/"&gt;on GitHub&lt;/a&gt; welcome!&lt;/p&gt;
&lt;div class="section" id="on-javascript-and-frameworks"&gt;
&lt;h2&gt;On JavaScript and frameworks&lt;/h2&gt;
&lt;p&gt;Using JS for a project like this is very natural, because ultimately what I'm
interested in is having a convenient web UI to play with the simulator. When
I do this, I almost always end up writing vanilla HTML+CSS+JS, avoiding
frameworks. I don't write JS often, so whenever I get to work on a new project,
the framework &lt;em&gt;du juor&lt;/em&gt; has typically changed from the last time, and I just
don't have the time to keep track. Vanilla HTML+CSS+JS has much better
longevity, IMHO, although it does mean somewhat more manual work (e.g. to keep
the UI in sync with the application state).&lt;/p&gt;
&lt;p&gt;The only framework I was tempted to use is Bootstrap for the CSS and layout,
but eventually decided against it in the interest of simplicity.&lt;/p&gt;
&lt;p&gt;We're fortunate to have much more stable and usable JS and web APIs in 2020
compared to just a few years ago. For the simulator I've been using the ES6
version of JS, which is widely supported today and offers many niceties.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-1" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-1"&gt;[1]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;I went with vendoring 8080js because it appears to be unmaintained,
and I also wanted to avoid a dependency, preferring the project to be
self-contained. This was easy with 8080js because it's a single JS file
and it has a permissive 2-clause BSD license. I've reproduced the license
in full in the cloned source file. FWIW, 8080js itself is also based on
an earlier BSD-licensed simulator; OSS at its best :-)&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
</content><category term="misc"></category><category term="JavaScript"></category><category term="Assembly"></category></entry><entry><title>"Beating" C with 400 lines of unoptimized assembly</title><link href="https://eli.thegreenplace.net/2019/beating-c-with-400-lines-of-unoptimized-assembly/" rel="alternate"></link><published>2019-11-23T09:46:00-08:00</published><updated>2022-10-04T14:08:24-07:00</updated><author><name>Eli Bendersky</name></author><id>tag:eli.thegreenplace.net,2019-11-23:/2019/beating-c-with-400-lines-of-unoptimized-assembly/</id><summary type="html">&lt;p&gt;Earlier this week I ran into a fun quick blog post named
&lt;a class="reference external" href="https://ajeetdsouza.github.io/blog/posts/beating-c-with-70-lines-of-go/"&gt;Beating C with 70 lines of Go&lt;/a&gt;,
which reimplements the basic functionality of &lt;tt class="docutils literal"&gt;wc&lt;/tt&gt; in Go using various
approaches and compares their performance. Apparently it's inspired by an
earlier &lt;a class="reference external" href="https://chrispenner.ca/posts/wc"&gt;Haskell-based post&lt;/a&gt; and several
other offshoots.&lt;/p&gt;
&lt;p&gt;This reminded me …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Earlier this week I ran into a fun quick blog post named
&lt;a class="reference external" href="https://ajeetdsouza.github.io/blog/posts/beating-c-with-70-lines-of-go/"&gt;Beating C with 70 lines of Go&lt;/a&gt;,
which reimplements the basic functionality of &lt;tt class="docutils literal"&gt;wc&lt;/tt&gt; in Go using various
approaches and compares their performance. Apparently it's inspired by an
earlier &lt;a class="reference external" href="https://chrispenner.ca/posts/wc"&gt;Haskell-based post&lt;/a&gt; and several
other offshoots.&lt;/p&gt;
&lt;p&gt;This reminded me of my earlier post about &lt;a class="reference external" href="https://eli.thegreenplace.net/2016/wc-in-x64-assembly/"&gt;reimplementing wc in pure x64
assembly&lt;/a&gt;, where I
also measured the performance of my program against &lt;tt class="docutils literal"&gt;wc&lt;/tt&gt;.&lt;/p&gt;
&lt;p&gt;The optimized approach taken in the Go implementation is very similar to what I
did in assembly, so it seemed like an interesting comparison. I started by
generating a ~580 MiB file using &lt;a class="reference external" href="https://github.com/eliben/xmlgen"&gt;xmlgen&lt;/a&gt;
and ran the various versions against each other:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;LC_TYPE=POSIX wc&lt;/tt&gt;: 2.13 sec&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;wc-naive.go&lt;/span&gt;&lt;/tt&gt;: 3.53 sec&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;wc-chunks.go&lt;/span&gt;&lt;/tt&gt;: 1.37 sec&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;wcx64&lt;/tt&gt;: 1.2 sec&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Note the &lt;tt class="docutils literal"&gt;LC_TYPE&lt;/tt&gt; setting for the system's &lt;tt class="docutils literal"&gt;wc&lt;/tt&gt;. This is important for a
fair comparison, because without this &lt;tt class="docutils literal"&gt;wc&lt;/tt&gt; will attempt to do &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;utf-8&lt;/span&gt;&lt;/tt&gt;
decoding on all bytes in the file, which results in significant slowdowns. Since
the Go versions use byte-counts and so does my &lt;tt class="docutils literal"&gt;wcx64&lt;/tt&gt;, I force the comparison
to be fair. In fact, this isn't a bad result for Go - the straightforward
solution is almost as fast as the same approach direct-coded in assembly!&lt;/p&gt;
&lt;p&gt;The Go blog post follows with parallelized versions which are much faster than
the serial one, but I'm excluding it here because all the other competitors are
single-threaded. This is not a serious benchmark anyway. If you prefer to
be serious, &lt;a class="reference external" href="https://github.com/expr-fi/fastlwc/"&gt;this response using SIMD-optimized C&lt;/a&gt; blows everything out of the water:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;fastlwc&lt;/tt&gt;: 0.11 sec&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The conclusion? Well, there's no real conclusion here, beyond that coding
exercises like this are fun in any language :-)&lt;/p&gt;
</content><category term="misc"></category><category term="Assembly"></category><category term="C &amp; C++"></category><category term="Go"></category></entry><entry><title>Adventures in JIT compilation: Part 2 - an x64 JIT</title><link href="https://eli.thegreenplace.net/2017/adventures-in-jit-compilation-part-2-an-x64-jit/" rel="alternate"></link><published>2017-03-22T06:32:00-07:00</published><updated>2024-05-04T19:46:23-07:00</updated><author><name>Eli Bendersky</name></author><id>tag:eli.thegreenplace.net,2017-03-22:/2017/adventures-in-jit-compilation-part-2-an-x64-jit/</id><summary type="html">&lt;p&gt;In the &lt;a class="reference external" href="https://eli.thegreenplace.net/2017/adventures-in-jit-compilation-part-1-an-interpreter/"&gt;first part of the series&lt;/a&gt;
I've briefly introduced the BF source language and went on to present four
interpreters with increasing degree of optimization. That post should serve as a
good backgroud before diving into actual JIT-ing.&lt;/p&gt;
&lt;p&gt;Another important part of the background puzzle is my &lt;a class="reference external" href="https://eli.thegreenplace.net/2013/11/05/how-to-jit-an-introduction"&gt;How to …&lt;/a&gt;&lt;/p&gt;</summary><content type="html">&lt;p&gt;In the &lt;a class="reference external" href="https://eli.thegreenplace.net/2017/adventures-in-jit-compilation-part-1-an-interpreter/"&gt;first part of the series&lt;/a&gt;
I've briefly introduced the BF source language and went on to present four
interpreters with increasing degree of optimization. That post should serve as a
good backgroud before diving into actual JIT-ing.&lt;/p&gt;
&lt;p&gt;Another important part of the background puzzle is my &lt;a class="reference external" href="https://eli.thegreenplace.net/2013/11/05/how-to-jit-an-introduction"&gt;How to JIT - an
introduction&lt;/a&gt; post from
2013; there, I discuss some of the basic tools needed to emit executable x64
machine code at run-time and actually run it on Linux. Please go through it
quickly if these things are new to you.&lt;/p&gt;
&lt;div class="section" id="the-two-phases-of-jit"&gt;
&lt;h2&gt;The two phases of JIT&lt;/h2&gt;
&lt;p&gt;As I wrote &lt;a class="reference external" href="https://eli.thegreenplace.net/2013/11/05/how-to-jit-an-introduction"&gt;previously&lt;/a&gt;, the JIT
technique is easier to understand when divided into two distinct phases:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;Create machine code at program run-time.&lt;/li&gt;
&lt;li&gt;Execute that machine code, also at program run-time.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Phase 2 for our BF JIT is exactly identical to the method described in that
introductory post. Take a look at the &lt;tt class="docutils literal"&gt;JitProgram&lt;/tt&gt; class in
&lt;a class="reference external" href="https://github.com/eliben/code-for-blog/blob/main/2017/bfjit/jit_utils.h"&gt;jit_utils&lt;/a&gt;
for details. We'll be more focused on phase 1, which will be translating
BF to x64 machine code; per the definition quoted in part 1 of the
series, we're going to develop an actual BF compiler (compiling from BF source
to x64 machine code).&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="compilers-assemblers-and-instruction-encoding"&gt;
&lt;h2&gt;Compilers, assemblers and instruction encoding&lt;/h2&gt;
&lt;p&gt;Traditionally, compilation was divided into several stages. The actual source
language compiler would translate some higher-level language to target-specific
assembly; then, an assembler would translate assembly to actual machine code
&lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt;. There's a number of important benefits assembly language provides over raw
machine code. Salient examples include:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Instruction encoding: it's certainly nicer to write &lt;tt class="docutils literal"&gt;inc %r13&lt;/tt&gt; to increment
the contents of register &lt;tt class="docutils literal"&gt;r13&lt;/tt&gt; than to write &lt;tt class="docutils literal"&gt;0x49, 0xFF, 0xC5&lt;/tt&gt;.
Instruction encoding for the popular architectures is &lt;a class="reference external" href="http://ref.x86asm.net/"&gt;notoriously complicated&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Naming labels and procedures for jumps/calls: it's easier to write &lt;tt class="docutils literal"&gt;jl loop&lt;/tt&gt;
than to figure out the encoding for the instruction, along with the relative
position of the &lt;tt class="docutils literal"&gt;loop&lt;/tt&gt; label and encoding the delta to it (not to mention
this delta changes every time we add instructions in between and needs to be
recomputed). Similarly for functions, &lt;tt class="docutils literal"&gt;call foo&lt;/tt&gt; instead of doing it by
address.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;One of my guiding principles through the field of programming is that before
diving into the possible solutions for a problem (for example, some library for
doing X) it's worth working through the problem manually first (doing X by hand,
without libraries). Grinding your teeth over issues for a while is the best way
to appreciate what the shrinkwrapped solution/library does for you.&lt;/p&gt;
&lt;p&gt;In this spirit, our first JIT is going to be completely hand-written.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="simple-jit-hand-rolling-x64-instruction-encoding"&gt;
&lt;h2&gt;Simple JIT - hand-rolling x64 instruction encoding&lt;/h2&gt;
&lt;p&gt;Out first JIT for this post is &lt;a class="reference external" href="https://github.com/eliben/code-for-blog/blob/main/2017/bfjit/simplejit.cpp"&gt;simplejit.cpp&lt;/a&gt;. Similarly to the
interpreters of part 1, all the action happens in a single function (here called
&lt;tt class="docutils literal"&gt;simplejit&lt;/tt&gt;) invoked from &lt;tt class="docutils literal"&gt;main&lt;/tt&gt;. &lt;tt class="docutils literal"&gt;simplejit&lt;/tt&gt; goes through the BF source
and emits x64 machine code into a memory buffer; in the end, it jumps to this
machine code to run the BF program.&lt;/p&gt;
&lt;p&gt;Here's its beginning:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MEMORY_SIZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="c1"&gt;// Registers used in the program:&lt;/span&gt;
&lt;span class="c1"&gt;//&lt;/span&gt;
&lt;span class="c1"&gt;// r13: the data pointer -- contains the address of memory.data()&lt;/span&gt;
&lt;span class="c1"&gt;//&lt;/span&gt;
&lt;span class="c1"&gt;// rax, rdi, rsi, rdx: used for making system calls, per the ABI.&lt;/span&gt;

&lt;span class="n"&gt;CodeEmitter&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;emitter&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="c1"&gt;// Throughout the translation loop, this stack contains offsets (in the&lt;/span&gt;
&lt;span class="c1"&gt;// emitter code vector) of locations for fixup.&lt;/span&gt;
&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;stack&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;open_bracket_stack&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="c1"&gt;// movabs &amp;lt;address of memory.data&amp;gt;, %r13&lt;/span&gt;
&lt;span class="n"&gt;emitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EmitBytes&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="mh"&gt;0x49&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xBD&lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;emitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EmitUint64&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;As usual, we have our BF memory buffer in a &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;std::vector&lt;/span&gt;&lt;/tt&gt;. The comments reveal
some of the conventions used througout the emitted program: our &amp;quot;data pointer&amp;quot;
will be in &lt;tt class="docutils literal"&gt;r13&lt;/tt&gt;.&lt;/p&gt;
&lt;p&gt;&lt;tt class="docutils literal"&gt;CodeEmitter&lt;/tt&gt; is a very simple utility to append bytes and words to a vector
of bytes. Its full code &lt;a class="reference external" href="https://github.com/eliben/code-for-blog/blob/main/2017/bfjit/jit_utils.cpp"&gt;is here&lt;/a&gt;.
It's platform independent except the assumption of little-endian (for
&lt;tt class="docutils literal"&gt;EmitUint64&lt;/tt&gt; it will write the lowest byte of the 64-bit word first, then the
second lowest byte, etc.)&lt;/p&gt;
&lt;p&gt;Our first bit of actual machine code emission follows:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;// movabs &amp;lt;address of memory.data&amp;gt;, %r13&lt;/span&gt;
&lt;span class="n"&gt;emitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EmitBytes&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="mh"&gt;0x49&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xBD&lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;emitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EmitUint64&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;And it's a cool one, mixing elements from the host (the C++ program doing the
emission) and the JITed code. First note the usage of &lt;tt class="docutils literal"&gt;movabs&lt;/tt&gt;, a x64
instruction useful for placing 64-bit immediates in a register. This is exactly
what we're doing here - placing the address of the data buffer of &lt;tt class="docutils literal"&gt;memory&lt;/tt&gt;
in &lt;tt class="docutils literal"&gt;r13&lt;/tt&gt;. The call to &lt;tt class="docutils literal"&gt;EmitBytes&lt;/tt&gt; with a cryptic sequence of hex values is
preceded by a snippet of assembly in a comment - the assembly conveys the
meaning for human readers, the hex values are the actual encoding the machine
will understand.&lt;/p&gt;
&lt;p&gt;Then comes the BF compilation loop, which looks at the next BF instruction and
emits the appropriate machine code for it. Our compiler works in a single pass;
this means that there's a bit of trickiness in handling the jumps, as we will
soon see.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pc&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pc&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;instructions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;pc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kt"&gt;char&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;instruction&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;instructions&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;pc&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;switch&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;instruction&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="sc"&gt;&amp;#39;&amp;gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;// inc %r13&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;emitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EmitBytes&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="mh"&gt;0x49&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xFF&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xC5&lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="sc"&gt;&amp;#39;&amp;lt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;// dec %r13&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;emitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EmitBytes&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="mh"&gt;0x49&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xFF&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xCD&lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="sc"&gt;&amp;#39;+&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;// Our memory is byte-addressable, so using addb/subb for modifying it.&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;// addb $1, 0(%r13)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;emitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EmitBytes&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="mh"&gt;0x41&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x45&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x01&lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="sc"&gt;&amp;#39;-&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;// subb $1, 0(%r13)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;emitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EmitBytes&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="mh"&gt;0x41&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x6D&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x01&lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;These are pretty straightforward; since &lt;tt class="docutils literal"&gt;r13&lt;/tt&gt; is the data pointer, &lt;tt class="docutils literal"&gt;&amp;gt;&lt;/tt&gt; and
&lt;tt class="docutils literal"&gt;&amp;lt;&lt;/tt&gt; increment and decrement it, while &lt;tt class="docutils literal"&gt;+&lt;/tt&gt; and &lt;tt class="docutils literal"&gt;-&lt;/tt&gt; increment and decrement
what it's pointing to. One slightly subtle aspect is that I chose a byte-value
memory for our BF implementations; this means we have to be careful when reading
or writing to memory and do byte-addressing (the &lt;tt class="docutils literal"&gt;b&lt;/tt&gt; suffixes on &lt;tt class="docutils literal"&gt;add&lt;/tt&gt; and
&lt;tt class="docutils literal"&gt;sub&lt;/tt&gt; above) rather than the default 64-bit-addressing.&lt;/p&gt;
&lt;p&gt;The code emitted for &lt;tt class="docutils literal"&gt;.&lt;/tt&gt; and &lt;tt class="docutils literal"&gt;,&lt;/tt&gt; is a bit more exciting; in the effort of
avoiding any external dependencies, we're going to invoke
&lt;a class="reference external" href="http://man7.org/linux/man-pages/man2/syscalls.2.html"&gt;Linux system calls&lt;/a&gt;
directly. &lt;tt class="docutils literal"&gt;WRITE&lt;/tt&gt; for &lt;tt class="docutils literal"&gt;.&lt;/tt&gt;; &lt;tt class="docutils literal"&gt;READ&lt;/tt&gt; for &lt;tt class="docutils literal"&gt;,&lt;/tt&gt;. We're using the x64 ABI
here with the syscall identifier in &lt;tt class="docutils literal"&gt;rax&lt;/tt&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// To emit one byte to stdout, call the write syscall with fd=1 (for&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// stdout), buf=address of byte, count=1.&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;//&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// mov $1, %rax&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// mov $1, %rdi&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// mov %r13, %rsi&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// mov $1, %rdx&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// syscall&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;emitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EmitBytes&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="mh"&gt;0x48&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xC7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xC0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x01&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;emitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EmitBytes&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="mh"&gt;0x48&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xC7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xC7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x01&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;emitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EmitBytes&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="mh"&gt;0x4C&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x89&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xEE&lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;emitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EmitBytes&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="mh"&gt;0x48&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xC7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xC2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x01&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;emitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EmitBytes&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="mh"&gt;0x0F&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x05&lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="sc"&gt;&amp;#39;,&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// To read one byte from stdin, call the read syscall with fd=0 (for&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// stdin),&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// buf=address of byte, count=1.&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;emitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EmitBytes&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="mh"&gt;0x48&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xC7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xC0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;emitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EmitBytes&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="mh"&gt;0x48&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xC7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xC7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;emitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EmitBytes&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="mh"&gt;0x4C&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x89&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xEE&lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;emitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EmitBytes&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="mh"&gt;0x48&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xC7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xC2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x01&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;emitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EmitBytes&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="mh"&gt;0x0F&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x05&lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The comments certainly help, don't they? I hope these snippets are a great
motivation for using assembly language rather than encoding instructions
manually :-)&lt;/p&gt;
&lt;p&gt;The jump instructions are always the most interesting in BF. For &lt;tt class="docutils literal"&gt;[&lt;/tt&gt; we do:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="sc"&gt;&amp;#39;[&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// For the jumps we always emit the instruciton for 32-bit pc-relative&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// jump, without worrying about potentially short jumps and relaxation.&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// cmpb $0, 0(%r13)&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;emitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EmitBytes&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="mh"&gt;0x41&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x7d&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// Save the location in the stack, and emit JZ (with 32-bit relative&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// offset) with 4 placeholder zeroes that will be fixed up later.&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;open_bracket_stack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;emitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;emitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EmitBytes&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="mh"&gt;0x0F&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x84&lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;emitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EmitUint32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Note that we don't know where this jump leads at this point - it will go to the
matching &lt;tt class="docutils literal"&gt;]&lt;/tt&gt;, which we haven't encountered yet! Therefore, to keep our
compilation in a single pass &lt;a class="footnote-reference" href="#footnote-2" id="footnote-reference-2"&gt;[2]&lt;/a&gt; we use the time-honored technique of
&lt;em&gt;backpatching&lt;/em&gt; by emitting a placeholder value for the jump and fixing it up
once we encounter the matching label. Another thing to note is always using a
32-bit pc-relative jump, for simplicity; we could save a couple of bytes with a
short jump in most cases (see &lt;a class="reference external" href="https://eli.thegreenplace.net/2013/01/03/assembler-relaxation"&gt;my article on assembler relaxation&lt;/a&gt; for the full
scoop), but I don't think it's worth the effort here.&lt;/p&gt;
&lt;p&gt;Compiling the matching &lt;tt class="docutils literal"&gt;]&lt;/tt&gt; is a bit trickier; I hope the comments do a good
job explaining what's going on, and the code itself is optimized for readability
rather than cleverness:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="sc"&gt;&amp;#39;]&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;open_bracket_stack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;empty&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;DIE&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;unmatched closing &amp;#39;]&amp;#39; at pc=&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pc&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;open_bracket_offset&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;open_bracket_stack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;top&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;open_bracket_stack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// cmpb $0, 0(%r13)&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;emitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EmitBytes&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="mh"&gt;0x41&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x7d&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// open_bracket_offset points to the JZ that jumps to this closing&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// bracket. We&amp;#39;ll need to fix up the offset for that JZ, as well as emit a&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// JNZ with a correct offset back. Note that both [ and ] jump to the&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// instruction *after* the matching bracket if their condition is&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// fulfilled.&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// Compute the offset for this jump. The jump start is computed from after&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// the jump instruction, and the target is the instruction after the one&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// saved on the stack.&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;jump_back_from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;emitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;jump_back_to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;open_bracket_offset&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pcrel_offset_back&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;compute_relative_32bit_offset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jump_back_from&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;jump_back_to&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// jnz &amp;lt;open_bracket_location&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;emitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EmitBytes&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="mh"&gt;0x0F&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x85&lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;emitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EmitUint32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pcrel_offset_back&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// Also fix up the forward jump at the matching [. Note that here we don&amp;#39;t&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// need to add the size of this jmp to the &amp;quot;jump to&amp;quot; offset, since the jmp&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// was already emitted and the emitter size was bumped forward.&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;jump_forward_from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;open_bracket_offset&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;jump_forward_to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;emitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pcrel_offset_forward&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;compute_relative_32bit_offset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jump_forward_from&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;jump_forward_to&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;emitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReplaceUint32AtOffset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;open_bracket_offset&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                                &lt;/span&gt;&lt;span class="n"&gt;pcrel_offset_forward&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This concludes the compiler loop; we end up with a bunch of potentially
executable machine code in &lt;tt class="docutils literal"&gt;vector&lt;/tt&gt;. This code refers to the host program (the
address of &lt;tt class="docutils literal"&gt;memory.data()&lt;/tt&gt;), but that's OK since the host program's lifetime
wraps the lifetime of the JITed code. What's remaining is to actually invoke
this machine code:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;// ... after the compilation loop&lt;/span&gt;

&lt;span class="c1"&gt;// The emitted code will be called as a function from C++; therefore it has to&lt;/span&gt;
&lt;span class="c1"&gt;// use the proper calling convention. Emit a &amp;#39;ret&amp;#39; for orderly return to the&lt;/span&gt;
&lt;span class="c1"&gt;// caller.&lt;/span&gt;
&lt;span class="n"&gt;emitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EmitByte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mh"&gt;0xC3&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="c1"&gt;// Load the emitted code to executable memory and run it.&lt;/span&gt;
&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;emitted_code&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;emitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;JitProgram&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;jit_program&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;emitted_code&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="c1"&gt;// JittedFunc is the C++ type for the JIT function emitted here. The emitted&lt;/span&gt;
&lt;span class="c1"&gt;// function is callable from C++ and follows the x64 System V ABI.&lt;/span&gt;
&lt;span class="k"&gt;using&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;JittedFunc&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="n"&gt;JittedFunc&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;JittedFunc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;jit_program&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;program_memory&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The call should be familiar from reading the &lt;a class="reference external" href="https://eli.thegreenplace.net/2013/11/05/how-to-jit-an-introduction"&gt;How to JIT&lt;/a&gt; post.
Note that here we opted for the simplest function possible - no arguments, no
return value; in future sections we'll spice it up a bit.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="taking-our-jit-for-a-spin"&gt;
&lt;h2&gt;Taking our JIT for a spin&lt;/h2&gt;
&lt;p&gt;In &lt;a class="reference external" href="https://eli.thegreenplace.net/2017/adventures-in-jit-compilation-part-1-an-interpreter/"&gt;part 1&lt;/a&gt;,
I presented a trivial BF program that prints the numbers 1 to 5 to the screen:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;++++++++ ++++++++ ++++++++ ++++++++ ++++++++ ++++++++
&amp;gt;+++++
[&amp;lt;+.&amp;gt;-]
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Let's see what our compiler translates it to. Even though the code vector inside
&lt;tt class="docutils literal"&gt;simplejit&lt;/tt&gt; is ephemeral (lives only temporarily in memory), we can serialize
it to a binary file which we can then disassemble (with &lt;tt class="docutils literal"&gt;objdump &lt;span class="pre"&gt;-D&lt;/span&gt; &lt;span class="pre"&gt;-b&lt;/span&gt; binary
&lt;span class="pre"&gt;-mi386:x86-64&lt;/span&gt;&lt;/tt&gt;). The following is the disassembly listing with comments I
embedded to explain what's going on:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt; # The runtime address of memory.data() goes into r13; note that this will
 # likely be a different value in every invocation of the JIT.

  0:   49 bd f0 54 e3 00 00    movabs $0xe354f0,%r13
  7:   00 00 00

 # A sequence of 48 instructions that all do the same, for the initial sequence
 # of +s; this makes me miss our optimizing interpreter, by worry not - we&amp;#39;ll
 # make this go away later in the post.

  a:   41 80 45 00 01          addb   $0x1,0x0(%r13)
  f:   41 80 45 00 01          addb   $0x1,0x0(%r13)

 # [...] 46 more &amp;#39;addb&amp;#39;

 # &amp;gt;+++++

 fa:   49 ff c5                inc    %r13
 fd:   41 80 45 00 01          addb   $0x1,0x0(%r13)
102:   41 80 45 00 01          addb   $0x1,0x0(%r13)
107:   41 80 45 00 01          addb   $0x1,0x0(%r13)
10c:   41 80 45 00 01          addb   $0x1,0x0(%r13)
111:   41 80 45 00 01          addb   $0x1,0x0(%r13)

 # Here comes the loop! Note that the relative jump offset is already inserted
 # into the &amp;#39;je&amp;#39; instruction by the backpatching process.

116:   41 80 7d 00 00          cmpb   $0x0,0x0(%r13)
11b:   0f 84 35 00 00 00       je     0x156
121:   49 ff cd                dec    %r13
124:   41 80 45 00 01          addb   $0x1,0x0(%r13)

 # The &amp;#39;.&amp;#39; is translated into a syscall to WRITE

129:   48 c7 c0 01 00 00 00    mov    $0x1,%rax
130:   48 c7 c7 01 00 00 00    mov    $0x1,%rdi
137:   4c 89 ee                mov    %r13,%rsi
13a:   48 c7 c2 01 00 00 00    mov    $0x1,%rdx
141:   0f 05                   syscall
143:   49 ff c5                inc    %r13
146:   41 80 6d 00 01          subb   $0x1,0x0(%r13)
14b:   41 80 7d 00 00          cmpb   $0x0,0x0(%r13)

 # Jump back to beginning of loop

150:   0f 85 cb ff ff ff       jne    0x121

 # We&amp;#39;re done

156:   c3                      retq
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="how-does-it-perform"&gt;
&lt;h2&gt;How does it perform?&lt;/h2&gt;
&lt;p&gt;It's time to measure the performance of our JIT against the interpreters from
part 1. &lt;tt class="docutils literal"&gt;optinterp3&lt;/tt&gt; was about 10x faster than the naive interpreter - how
will this JIT measure up? Note that it has no optimizations (except not having
to recompute the jump destination for every loop iteration as the naive
interpreter did). Can you guess? The results may surprise you...&lt;/p&gt;
&lt;p&gt;The simple JIT runs &lt;tt class="docutils literal"&gt;mandelbrot&lt;/tt&gt; in 2.89 seconds, and &lt;tt class="docutils literal"&gt;factor&lt;/tt&gt; in 0.94
seconds - much faster still than &lt;tt class="docutils literal"&gt;opt3interp&lt;/tt&gt;; here's the comparison plot
(omitting the slower interpreters since they skew the scale):&lt;/p&gt;
&lt;img alt="BF opt3 vs simplejit" class="align-center" src="https://eli.thegreenplace.net/images/2017/bf-runtime-vs-simplejit.png" /&gt;
&lt;p&gt;Why is this so? &lt;tt class="docutils literal"&gt;opt3interp&lt;/tt&gt; is heavily optimized - it folds entire loops into
a single operation; &lt;tt class="docutils literal"&gt;simplejit&lt;/tt&gt; does none of this - we've just seen the
embarrassing sequence of &lt;tt class="docutils literal"&gt;addb&lt;/tt&gt;s it emits for a long sequence of &lt;tt class="docutils literal"&gt;+&lt;/tt&gt;s.&lt;/p&gt;
&lt;p&gt;The reason is that the &lt;em&gt;baseline&lt;/em&gt; performance of the JIT is vastly better. I've
mentioned this briefly in part 1 - imagine what's needed to interpret a
single instruction in the fastest interpreter.&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;Advance &lt;tt class="docutils literal"&gt;pc&lt;/tt&gt; and compare it to program size.&lt;/li&gt;
&lt;li&gt;Grab the instruction at &lt;tt class="docutils literal"&gt;pc&lt;/tt&gt;.&lt;/li&gt;
&lt;li&gt;Switch on the value of the instruction to the right &lt;tt class="docutils literal"&gt;case&lt;/tt&gt;.&lt;/li&gt;
&lt;li&gt;Execute the &lt;tt class="docutils literal"&gt;case&lt;/tt&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This requires a whole sequence of machine instructions, with at least two
branches (one for the loop, one for the &lt;tt class="docutils literal"&gt;switch&lt;/tt&gt;). On the other hand, the JIT
just emits a &lt;em&gt;single instruction&lt;/em&gt; - no branches. I would say that - depending on
what the compiler did while compiling the interpreter - the JIT is between 4 and
8 times faster at running any given BF operation. It has to run many more BF
operations because it doesn't optimize, but this difference is insufficient to
close the huge baseline gap. Later in this post we're going to see an optimized
JIT which performs even better.&lt;/p&gt;
&lt;p&gt;But first, let's talk about this painful instruction encoding business.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="manually-encoding-instructions"&gt;
&lt;h2&gt;Manually encoding instructions&lt;/h2&gt;
&lt;p&gt;As promised, &lt;tt class="docutils literal"&gt;simplejit&lt;/tt&gt; is completely self-contained. It doesn't use any
external libraries, and encodes all the instructions by hand. It's not hard to
see how painful that process is, and the code is absolutely unreadable unless
accompanied by detailed comments; moreover, changing the code is a pain, and
changes happen in unexpected ways. For example, if we want to use some other
register in an instruction, the change to emitted code won't be intuitive.
&lt;tt class="docutils literal"&gt;add %r8, %r9&lt;/tt&gt; is encoded as &lt;tt class="docutils literal"&gt;0x4C, 0x01, 0xC8&lt;/tt&gt;, but &lt;tt class="docutils literal"&gt;add %r8, %r10&lt;/tt&gt; is
&lt;tt class="docutils literal"&gt;0x4C, 0x01, 0xD0&lt;/tt&gt;; since registers are specified in sub-byte nibbles,
one needs very good memory and tons of experience to predict what goes where.&lt;/p&gt;
&lt;p&gt;Would you expect related instructions to look somewhat similar?
They don't. &lt;tt class="docutils literal"&gt;inc %r13&lt;/tt&gt; is encoded as &lt;tt class="docutils literal"&gt;0x49, 0xFF, 0xC0&lt;/tt&gt;, for example.
To put it bluntly - unless you're &lt;a class="reference external" href="http://www.catb.org/jargon/html/story-of-mel.html"&gt;Mel&lt;/a&gt;, you're going to have a
hard time. Now imagine that you have to support emitting code for multiple
architectures!&lt;/p&gt;
&lt;p&gt;This is why all compilers, VMs and related projects have their own layers to
help with this encoding task, along with related tasks like labels and jump
computations. Most are not exposed for easy usage outside their project; others,
like &lt;a class="reference external" href="http://luajit.org/dynasm.html"&gt;DynASM&lt;/a&gt; (developed as part of the LuaJIT
project) are packaged for separate usage. DynASM is an example of a low-level
framework - providing instruction encoding and not much else; some frameworks
are higher-level, doing more compiler-y things like register allocation. One
example is &lt;a class="reference external" href="https://eli.thegreenplace.net/2013/10/17/getting-started-with-libjit-part-1"&gt;libjit&lt;/a&gt;;
another is LLVM.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="asmjit"&gt;
&lt;h2&gt;asmjit&lt;/h2&gt;
&lt;p&gt;While looking for a library to help me encode instructions, I initially tried
DynASM. It's an interesting approach - and you can see &lt;a class="reference external" href="http://blog.reverberate.org/2012/12/hello-jit-world-joy-of-simple-jits.html"&gt;Josh Haberman's post&lt;/a&gt;
about using it for a simple BF JIT, but I found it to be a bit too
abandonware-ish for my taste. Besides, I don't like the funky preprocessor
approach with a dependency on Lua.&lt;/p&gt;
&lt;p&gt;So I found another project that seemed to fit the bill - &lt;a class="reference external" href="https://github.com/asmjit/asmjit"&gt;asmjit&lt;/a&gt; - a pure C++ library without any
preprocessing. &lt;tt class="docutils literal"&gt;asmjit&lt;/tt&gt; began about 3 years ago to ease its author's
development of fast kernels for graphics code. Its documentation isn't much
better than &lt;tt class="docutils literal"&gt;dynasm&lt;/tt&gt;'s, but being just a C++ library I found it easier to dive
into the source when questions arose the docs couldn't answer. Besides, the
author is very active and quick in answering questions on GitHub and adding
missing featuers. Therefore, the rest of this post shows BF JITs that use
&lt;tt class="docutils literal"&gt;asmjit&lt;/tt&gt; - these can also serve as a non-trivial tutorial for the library.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="simpleasmjit-jit-with-sane-instruction-encoding"&gt;
&lt;h2&gt;simpleasmjit - JIT with sane instruction encoding&lt;/h2&gt;
&lt;p&gt;Enter &lt;a class="reference external" href="https://github.com/eliben/code-for-blog/blob/main/2017/bfjit/simpleasmjit.cpp"&gt;simpleasmjit.cpp&lt;/a&gt; -
the same simple JIT (no optimizations) as &lt;tt class="docutils literal"&gt;simplejit&lt;/tt&gt;, but using &lt;tt class="docutils literal"&gt;asmjit&lt;/tt&gt;
for the instruction encoding, labels and so on. Just for fun, we'll mix things
up a bit. First, we'll change the JITed function signature from &lt;tt class="docutils literal"&gt;void
&lt;span class="pre"&gt;(*)(void)&lt;/span&gt;&lt;/tt&gt; to &lt;tt class="docutils literal"&gt;void &lt;span class="pre"&gt;(*)(uint64_t)&lt;/span&gt;&lt;/tt&gt;; the address of the BF memory buffer will
be passed as argument into the JITed function rather than hard-coded into it.&lt;/p&gt;
&lt;p&gt;Second, we'll use actual C functions to emit / input characters, rather than
system calls. Moreover, since &lt;tt class="docutils literal"&gt;putchar&lt;/tt&gt; and &lt;tt class="docutils literal"&gt;getchar&lt;/tt&gt; may be macros on some
systems, taking their address can be unsafe. So we'll wrap them in actual C++
functions, whose address it &lt;em&gt;is&lt;/em&gt; safe to take in emitted code:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;myputchar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;putchar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;mygetchar&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;getchar&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;tt class="docutils literal"&gt;simpleasmjit&lt;/tt&gt; starts by initializing an &lt;tt class="docutils literal"&gt;asmjit&lt;/tt&gt; runtime, code holder and
assembler &lt;a class="footnote-reference" href="#footnote-3" id="footnote-reference-3"&gt;[3]&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;asmjit&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;JitRuntime&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;jit_runtime&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;asmjit&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;CodeHolder&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jit_runtime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;getCodeInfo&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;asmjit&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;X86Assembler&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;assm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Next, we'll give a mnemonic name to our data pointer, and emit a copy of the
address of the memory buffer into it (it's in &lt;tt class="docutils literal"&gt;rdi&lt;/tt&gt; initially, as the first
function argument in the x64 ABI):&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;// We pass the data pointer as an argument to the JITed function, so it&amp;#39;s&lt;/span&gt;
&lt;span class="c1"&gt;// expected to be in rdi. Move it to r13.&lt;/span&gt;
&lt;span class="n"&gt;asmjit&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;X86Gp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dataptr&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;asmjit&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;x86&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;r13&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;assm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mov&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;asmjit&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;x86&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;rdi&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then we get to the usual BF processing loop that emits code for every BF op:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pc&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pc&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;instructions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;pc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kt"&gt;char&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;instruction&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;instructions&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;pc&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;switch&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;instruction&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="sc"&gt;&amp;#39;&amp;gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;// inc %r13&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;assm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataptr&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="sc"&gt;&amp;#39;&amp;lt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;// dec %r13&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;assm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataptr&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="sc"&gt;&amp;#39;+&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;// addb $1, 0(%r13)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;assm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;asmjit&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;x86&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;byte_ptr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataptr&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="sc"&gt;&amp;#39;-&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;// subb $1, 0(%r13)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;assm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;asmjit&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;x86&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;byte_ptr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataptr&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Notice the difference! No more obscure hex codes - &lt;tt class="docutils literal"&gt;assm.inc(dataptr)&lt;/tt&gt; is so
much nicer than &lt;tt class="docutils literal"&gt;0x49, 0xFF, 0xC5&lt;/tt&gt;, isn't it?&lt;/p&gt;
&lt;p&gt;For input and output we emit calls to our wrapper functions:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="sc"&gt;&amp;#39;.&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// call myputchar [dataptr]&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;assm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;movzx&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;asmjit&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;x86&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;rdi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;asmjit&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;x86&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;byte_ptr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataptr&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;assm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;asmjit&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;imm_ptr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;myputchar&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="sc"&gt;&amp;#39;,&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// [dataptr] = call mygetchar&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// Store only the low byte to memory to avoid overwriting unrelated data.&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;assm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;asmjit&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;imm_ptr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mygetchar&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;assm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mov&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;asmjit&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;x86&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;byte_ptr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataptr&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;asmjit&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;x86&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;al&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The magic is in the &lt;tt class="docutils literal"&gt;imm_ptr&lt;/tt&gt; modifier, which places the address of the
function in the emitted code.&lt;/p&gt;
&lt;p&gt;Finally, the code handling &lt;tt class="docutils literal"&gt;[&lt;/tt&gt; and &lt;tt class="docutils literal"&gt;]&lt;/tt&gt; is also much simpler due to asmjit's
&lt;em&gt;labels&lt;/em&gt;, which can be used before they're actually emitted:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="sc"&gt;&amp;#39;[&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;assm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cmp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;asmjit&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;x86&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;byte_ptr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataptr&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;asmjit&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Label&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;open_label&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;assm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;newLabel&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;asmjit&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Label&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;close_label&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;assm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;newLabel&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// Jump past the closing &amp;#39;]&amp;#39; if [dataptr] = 0; close_label wasn&amp;#39;t bound&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// yet (it will be bound when we handle the matching &amp;#39;]&amp;#39;), but asmjit lets&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// us emit the jump now and will handle the back-patching later.&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;assm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jz&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;close_label&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// open_label is bound past the jump; all in all, we&amp;#39;re emitting:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;//&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;//    cmpb 0(%r13), 0&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;//    jz close_label&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// open_label:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;//    ...&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;assm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;open_label&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// Save both labels on the stack.&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;open_bracket_stack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BracketLabels&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;open_label&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;close_label&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="sc"&gt;&amp;#39;]&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;open_bracket_stack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;empty&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;DIE&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;unmatched closing &amp;#39;]&amp;#39; at pc=&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pc&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;BracketLabels&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;open_bracket_stack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;top&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;open_bracket_stack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;//    cmpb 0(%r13), 0&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;//    jnz open_label&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// close_label:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;//    ...&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;assm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cmp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;asmjit&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;x86&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;byte_ptr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataptr&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;assm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jnz&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;open_label&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;assm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;close_label&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;We just have to remember which label we used for the jump and emit the exact
same &lt;tt class="docutils literal"&gt;Label&lt;/tt&gt; object - &lt;tt class="docutils literal"&gt;asmjit&lt;/tt&gt; handles the backpatching on its own!
Moreover, all the jump offset computations are performed automatically.&lt;/p&gt;
&lt;p&gt;Finally, after emitting the code we can call it:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;using&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;JittedFunc&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="n"&gt;JittedFunc&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;asmjit&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;jit_runtime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="c1"&gt;// [...]&lt;/span&gt;
&lt;span class="c1"&gt;// Call it, passing the address of memory as a parameter.&lt;/span&gt;
&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;That's it. This JIT emits virtually the same exact code as &lt;tt class="docutils literal"&gt;simplejit&lt;/tt&gt;, and
thus we don't expect it to perform any differently. The main point of this
exercise is to show how much simpler and more pleasant emitting code is with a
library like &lt;tt class="docutils literal"&gt;asmjit&lt;/tt&gt;. It hides all the icky encoding and offset computations,
letting us focus on what's actually unique for our program - the sequence of
instructions emitted.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="optasmjit-combining-bf-optimizations-with-a-jit"&gt;
&lt;h2&gt;optasmjit - combining BF optimizations with a JIT&lt;/h2&gt;
&lt;p&gt;Finally, it's time to combine the clever optimizations we've developed in part 1
with the JIT. Here, I'm essentially taking &lt;tt class="docutils literal"&gt;optinterp3&lt;/tt&gt; from part
1 and bolting a JIT backend onto it. The result is &lt;a class="reference external" href="https://github.com/eliben/code-for-blog/blob/main/2017/bfjit/optasmjit.cpp"&gt;optasmjit.cpp&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Recall that instead of the 8 BF ops, we have an extended set, with integer
arguments, that conveys higher-level ops in some cases:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;enum&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;BfOpKind&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;INVALID_OP&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;INC_PTR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;DEC_PTR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;INC_DATA&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;DEC_DATA&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;READ_STDIN&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;WRITE_STDOUT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;LOOP_SET_TO_ZERO&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;LOOP_MOVE_PTR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;LOOP_MOVE_DATA&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;JUMP_IF_DATA_ZERO&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;JUMP_IF_DATA_NOT_ZERO&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The translation phase from BF ops to a sequence of &lt;tt class="docutils literal"&gt;BfOpKind&lt;/tt&gt; is exactly the
same as it was in &lt;tt class="docutils literal"&gt;optinterp3&lt;/tt&gt;. Let's take a look at how a couple of the new
ops are implemented now:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="no"&gt;BfOpKind&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;INC_PTR&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;assm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;op&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argument&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;As before with the interpreters, an increment of 1 is replaced by the addition
of an argument. We use a different instruction for this - &lt;tt class="docutils literal"&gt;add&lt;/tt&gt; instead of
&lt;tt class="docutils literal"&gt;inc&lt;/tt&gt; &lt;a class="footnote-reference" href="#footnote-4" id="footnote-reference-4"&gt;[4]&lt;/a&gt;. How about something more interesting:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="no"&gt;BfOpKind&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;LOOP_MOVE_DATA&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// Only move if the current data isn&amp;#39;t 0:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;//&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;//   cmpb 0(%r13), 0&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;//   jz skip_move&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;//   &amp;lt;...&amp;gt; move data&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// skip_move:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;asmjit&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Label&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;skip_move&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;assm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;newLabel&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;assm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cmp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;asmjit&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;x86&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;byte_ptr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataptr&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;assm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jz&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;skip_move&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;assm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mov&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;asmjit&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;x86&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;r14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dataptr&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;op&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argument&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;assm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;asmjit&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;x86&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;r14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;op&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argument&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;assm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;asmjit&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;x86&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;r14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;op&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argument&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// Use rax as a temporary holding the value of at the original pointer;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// then use al to add it to the new location, so that only the target&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// location is affected: addb %al, 0(%r13)&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;assm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mov&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;asmjit&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;x86&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;rax&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;asmjit&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;x86&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;byte_ptr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataptr&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;assm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;asmjit&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;x86&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;byte_ptr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;asmjit&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;x86&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;r14&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;asmjit&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;x86&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;al&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;assm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mov&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;asmjit&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;x86&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;byte_ptr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataptr&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;assm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;skip_move&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I'll just note again how much simpler this code is to write with &lt;tt class="docutils literal"&gt;asmjit&lt;/tt&gt; than
without it. Also note the careful handling of the byte-granulated data when
touching memory - I ran into a number of nasty bugs when developing this. In
fact, using the native machine word size (64 bits in this case) for BF memory
cells would've made everything much simpler; 8-bit cells are closer to the
common semantics of the language and provide an extra challenge.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="performance"&gt;
&lt;h2&gt;Performance&lt;/h2&gt;
&lt;p&gt;Let's see how &lt;tt class="docutils literal"&gt;optasmjit&lt;/tt&gt; fares against the fastest interpreter and the
unoptimized JIT - 0.93 seconds for &lt;tt class="docutils literal"&gt;mandelbrot&lt;/tt&gt;, 0.3 seconds for &lt;tt class="docutils literal"&gt;factor&lt;/tt&gt; -
another factor of 3 in performance:&lt;/p&gt;
&lt;img alt="BF opt3 vs simplejit vs optasmjit" class="align-center" src="https://eli.thegreenplace.net/images/2017/bf-runtime-vs-optasmjit.png" /&gt;
&lt;p&gt;Notably, the performance delta with the optimized interpreter is huge: the JIT
is more than 4x faster. If we compare it all the way to the initial simple
interpreter, &lt;tt class="docutils literal"&gt;optasmjit&lt;/tt&gt; is about 40x faster - making it hard to even
compare on the same chart :-)&lt;/p&gt;
&lt;img alt="BF full performance comparison for part 2" class="align-center" src="https://eli.thegreenplace.net/images/2017/bf-runtime-full-part2.png" /&gt;
&lt;/div&gt;
&lt;div class="section" id="jits-are-fun"&gt;
&lt;h2&gt;JITs are fun!&lt;/h2&gt;
&lt;p&gt;I find writing JITs lots of fun. It's really nice to be able to hand-craft every
instruction emitted by the compiler. While this is quite painful to do without
any encoding help, libraries like &lt;tt class="docutils literal"&gt;asmjit&lt;/tt&gt; make the process much more
pleasant.&lt;/p&gt;
&lt;p&gt;We've done quite a bit in this part of the series. &lt;tt class="docutils literal"&gt;optasmjit&lt;/tt&gt; is a genuine
optimizing JIT for BF! It:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Parses BF source&lt;/li&gt;
&lt;li&gt;Translates it to a sequence of higher-level ops&lt;/li&gt;
&lt;li&gt;Optimizes these ops&lt;/li&gt;
&lt;li&gt;Compiles the ops to tight x64 assembly in memory and runs them&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Let's connect these steps to some real compiler jargon. &lt;tt class="docutils literal"&gt;BfOpKind&lt;/tt&gt; ops can be
seen as the &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Intermediate_representation"&gt;compiler IR&lt;/a&gt;. Translation of
human-readable source code to IR is often the first step in compilation (though
it in itself is sometimes divided into multiple steps for realistic languages).
The translation/compilation of ops to assembly is often called &amp;quot;lowering&amp;quot;; in
some compilers this involves multiple steps and intermediate IRs.&lt;/p&gt;
&lt;p&gt;I left a lot of code out of the blog post - otherwise it would be huge! I
encourage you to go back through the full source files discussed here and
understand what's going on - every JIT is a single standalone C++ file.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;p&gt;Links to all posts in this series:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="https://eli.thegreenplace.net/2017/adventures-in-jit-compilation-part-1-an-interpreter/"&gt;Part 1 - an interpreter&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://eli.thegreenplace.net/2017/adventures-in-jit-compilation-part-2-an-x64-jit/"&gt;Part 2 - an x64 JIT&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://eli.thegreenplace.net/2017/adventures-in-jit-compilation-part-3-llvm/"&gt;Part 3 - LLVM&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://eli.thegreenplace.net/2017/adventures-in-jit-compilation-part-4-in-python/"&gt;Part 4 - Python&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr class="docutils" /&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-1" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-1"&gt;[1]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;I said &lt;em&gt;traditionally&lt;/em&gt; because many modern compilers no longer work this
way. For example, LLVM compiles IR to another, much lower-level IR that
represents machine-code level instructions; assembly can be emitted from
this IR, but also machine code directly - so the assembler is integrated
into the compiler.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-2" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-2"&gt;[2]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Some compilers would do two passes; this is similar to our first
interpreter optimization in part 1: the first pass collects information
(such as location of all matching &lt;tt class="docutils literal"&gt;]&lt;/tt&gt;s), so the second pass already
knows what offsets to emit.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-3" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-3"&gt;[3]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Please refer to asmjit's documentation for the full scoop. I'll also
mention that asmjit has a &amp;quot;compiler&amp;quot; layer which does more sophisticated
things like register allocation; in this post I'm only using the base
assembly layer.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-4" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-4"&gt;[4]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Wondering whether we could have
just used &lt;tt class="docutils literal"&gt;add 1&lt;/tt&gt; instead of &lt;tt class="docutils literal"&gt;inc&lt;/tt&gt; in the first place? Certainly! In
fact, while there probably used to be a good reason for a separate
&lt;tt class="docutils literal"&gt;inc&lt;/tt&gt; instruction, in these days of complex multi-port pipelined x64
CPUs, it's not clear which one is faster. I just wanted to show both for
diversity.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
</content><category term="misc"></category><category term="Compilation"></category><category term="Code generation"></category><category term="Assembly"></category></entry></feed>