Eli Bendersky's website - Go

Implementing Raft: Part 4 - Key/Value Database

2024-10-10T19:50:00-07:00

This is Part 4 in a series of posts describing the Raft distributed consensus algorithm and its complete implementation in Go. Here is a list of posts in the series:

In this part, we're going to use our Raft module to implement a simple but realistic application - a replicated key / value database with strong consistency semantics. All the code for this part is located in this directory.

Key / value database as a state machine

First of all, what's a key / value database (KV DB)? Think of it as a Go map, or as an extremely simple version of NoSQL databases like Redis or CouchDB. The basic operations our KV DB supports are:

PUT(k,v): assign value v to key k
GET(k): retrieve the value associated with key k
CAS(k, cmp, v): atomic compare-and-swap. First, it reads curV - the current value associated with key k. If curV==cmp, assigns value v to k instead; otherwise, it's a no-op. In any case, curV is returned.

For example, suppose the commands in some Raft log are (in order from left to right):

PUT(x,2)  PUT(y,3)  PUT(x,4)  PUT(z,5)  CAS(x,4,8)  CAS(z,4,9)

Applied to an empty DB, this log will result in these keys / values:

x=8
y=3
z=5

System diagram

In this part we're going to build a complete KV DB system - including the service and a client library:

The diagram presents a cluster with 3 replicas [1]. Each replica is a KV DB service.
A KV service contains a Raft Consensus Module (the diagram doesn't show the log, assuming it's just part of the CM), and a data store module that implements the actual database.
The Raft CM of each replica is connected to the others via RPCs - these are the Raft protocol RPCs discussed extensively in previous parts.
The KV service presents a REST API to the external world; clients can send HTTP commands to the service and get results.
"KV Client" is a client library with a convenient API that encapsulates the HTTP interactions with KV services. This is also part of our demo, and we'll discuss it later in the post.

KV service architecture

The KV service consists of several key components:

An instance of a Raft server; as described back in Part 1, a Raft Server wraps a consensus module with some RPC scaffolding. In this part we reuse our final Raft server code from Part 3, without any modifications.
An underlying "data store". For our demonstration, a simple mutex-protected Go map will do; this is implemented in kvservice/datastore.go. This data store implements the Get, Put and CAS commands described earlier. All keys and values are Go strings (naturally, anything can be encoded in a string value).
An HTTP server for the REST API of the service exposed to the external world.

Commands

If you recall from Part 2, we submit new commands to the Raft cluster with the ConsensusModule.Submit method. A Command is an arbitrary any value; whenever the Raft cluster reaches consensus on a log entry, it sends a "commit entry" with this command on the commit channel. Commands are application-specific, and since we're working on a concrete application now, it's time to define our command for the KV service:

// Command is the concrete command type KVService submits to the Raft log to
// manage its state machine. It's also used to carry the results of the command
// after it's applied to the state machine. These are the supported commands:
//
// CommandGet: queries a key's value
//
// * Key is the key to get, Value is ignored
// * CompareValue is ignored
// * ResultFound is true iff Key was found in the store
// * ResultValue is the value, if Key was found in the store
//
// CommandPut: assigns value to the key
//
// * Key,Value are the pair to assign (store[key]=value)
// * CompareValue is ignored
// * ResultFound is true iff Key was previously found in the store
// * ResultValue is the old value of Key, if it was previously found
//
// CommandCAS: atomic compare-and-swap, performs:
//
//    if Store[Key] == CompareValue {
//      Store[Key] = Value
//    } else {
//      nop
//    }
//
// * Key is the key this command acts on
// * CompareValue is the previous value the command compares to
// * Value is the new value the command assigns
// * ResultFound is true iff Key was previously found in the store
// * ResultValue is the old value of Key, if it was previously found
type Command struct {
  Kind CommandKind

  Key, Value string

  CompareValue string

  ResultValue string
  ResultFound bool

  // id is the Raft ID of the server submitting this command.
  Id int
}

type CommandKind int

const (
  CommandInvalid CommandKind = iota
  CommandGet
  CommandPut
  CommandCAS
)

For simplicity, I chose to include fields for several commands in the same struct instead of using an algebraic data type here.

One important thing to note is that the service's Raft cluster ID is part of the command; it will soon become clear why this is needed.

Life of a PUT request to the service

Before we dive deep into the code, let's examine the journey a successful PUT request makes through the system:

A client sends a PUT("k", "v") request to a service, via HTTP. Let's assume it reaches the service which is currently the Raft cluster leader (we'll discuss what happens if it reaches a follower later on).
The service's HTTP handler receives the request, constructs a Command of kind CommandPut representing it and submits it to its Raft CM.
1. At this point, the HTTP handler waits; it can't reply to the client until it knows that the command was properly replicated to the Raft cluster and committed by the CM.
2. Once the command it submitted appears on the commit channel, the HTTP handler can return a success status to the client.
Meanwhile, a process in the service watches its commit channel for new commands that reached consensus by the cluster, and updates the underlying data store.
At the same time, the other services in the cluster - the followers - are also watching their commit channels and update their own replicas of the data store with the new PUT command.

Note that steps 2.2 and 3 happen concurrently. One process (in the sense of CSP) handles a client request, while another process takes care to execute commands arriving on the commit channel. In fact, there's more concurrency here than meets the eye. Our service can handle multiple concurrent requests, each with its own command - and it should all just work. This kind of concurrency is natural in Go - and now it's time to see how it works.

KV service code walk-through

All the code described in this section is located in kvservice/kvservice.go. Here's the struct defining the service:

type KVService struct {
  sync.Mutex

  // id is the service ID in a Raft cluster.
  id int

  // rs is the Raft server that contains a CM
  rs *raft.Server

  // commitChan is the commit channel passed to the Raft server; when commands
  // are committed, they're sent on this channel.
  commitChan chan raft.CommitEntry

  // commitSubs are the commit subscriptions currently active in this service.
  // See the createCommitSubsciption method for more details.
  commitSubs map[int]chan raft.CommitEntry

  // ds is the underlying data store implementing the KV DB.
  ds *DataStore

  // srv is the HTTP server exposed by the service to the external world.
  srv *http.Server
}

Don't worry about understanding exactly what each field means right now; note the correlation to the descriptions in "KV service architecture", though. A service holds a Raft server, a datastore, and an HTTP server. Other entities, like the commit channel, should be familiar by now.

A new service is created with this constructor:

// New creates a new KVService
//
//   - id: this service's ID within its Raft cluster
//   - peerIds: the IDs of the other Raft peers in the cluster
//   - storage: a raft.Storage implementation the service can use for
//     durable storage to persist its state.
//   - readyChan: notification channel that has to be closed when the Raft
//     cluster is ready (all peers are up and connected to each other).
func New(id int, peerIds []int, storage raft.Storage, readyChan <-chan any) *KVService {
  gob.Register(Command{})
  commitChan := make(chan raft.CommitEntry)

  // raft.Server handles the Raft RPCs in the cluster; after Serve is called,
  // it's ready to accept RPC connections from peers.
  rs := raft.NewServer(id, peerIds, storage, readyChan, commitChan)
  rs.Serve()
  kvs := &KVService{
    id:         id,
    rs:         rs,
    commitChan: commitChan,
    ds:         NewDataStore(),
    commitSubs: make(map[int]chan raft.CommitEntry),
  }

  kvs.runUpdater()
  return kvs
}

We'll get back to what runUpdater is a little later; for now, let's look at how the HTTP server is launched:

// ServeHTTP starts serving the KV REST API on the given TCP port. This
// function does not block; it fires up the HTTP server and returns. To properly
// shut down the server, call the Shutdown method.
func (kvs *KVService) ServeHTTP(port int) {
  if kvs.srv != nil {
    panic("ServeHTTP called with existing server")
  }
  mux := http.NewServeMux()
  mux.HandleFunc("POST /get/", kvs.handleGet)
  mux.HandleFunc("POST /put/", kvs.handlePut)
  mux.HandleFunc("POST /cas/", kvs.handleCAS)

  kvs.srv = &http.Server{
    Addr:    fmt.Sprintf(":%d", port),
    Handler: mux,
  }

  go func() {
    kvs.kvlog("serving HTTP on %s", kvs.srv.Addr)
    if err := kvs.srv.ListenAndServe(); err != http.ErrServerClosed {
      log.Fatal(err)
    }
    kvs.srv = nil
  }()
}

This should be familiar if you've written Go HTTP servers before. Listening is done in a goroutine to enable clean shutdown of the HTTP server specifically and the whole service in general; check out the Shutdown method for more details.

In the previous section, I mentioned that multiple HTTP requests can be handled concurrently; this is just the nature of the standard Go HTTP server. Here we see the handleXXX handlers registered with the server; each handler is invoked in a separate goroutine, and our code has to account for this. To understand what this means in practice, let's look at the updater goroutine.

// runUpdater runs the "updater" goroutine that reads the commit channel
// from Raft and updates the data store; this is the Replicated State Machine
// part of distributed consensus!
// It also notifies subscribers (registered with createCommitSubsciption).
func (kvs *KVService) runUpdater() {
  go func() {
    for entry := range kvs.commitChan {
      cmd := entry.Command.(Command)

      switch cmd.Kind {
      case CommandGet:
        cmd.ResultValue, cmd.ResultFound = kvs.ds.Get(cmd.Key)
      case CommandPut:
        cmd.ResultValue, cmd.ResultFound = kvs.ds.Put(cmd.Key, cmd.Value)
      case CommandCAS:
        cmd.ResultValue, cmd.ResultFound = kvs.ds.CAS(cmd.Key, cmd.CompareValue, cmd.Value)
      default:
        panic(fmt.Errorf("unexpected command %v", cmd))
      }

      // We're modifying the command to include results from the datastore,
      // so clone an entry with the update command for the subscribers.
      newEntry := raft.CommitEntry{
        Command: cmd,
        Index:   entry.Index,
        Term:    entry.Term,
      }

      // Forward this entry to the subscriber interested in its index, and
      // close the subscription - it's single-use.
      if sub := kvs.popCommitSubscription(entry.Index); sub != nil {
        sub <- newEntry
        close(sub)
      }
    }
  }()
}

The updater goroutine is responsible for implementing step (3) described in the "Life of..." section. It watches the commit channel for new committed commands, applies these commands to the datastore and then notifies "subscribers" about it. The first two tasks is what we'd expect from an implementation of a Raft-based replicated state machine; the last task needs some elaboration.

Recall step 2.1 from the "Life of..." section; once an HTTP handler submits a command to the Raft cluster, it has to wait and see if this command was properly committed. The way we implement it is:

The handler submits a command to the Raft CM, and keeps note of the log index the command is placed in.
The handler than registers a "subscription" with the updater, telling it: "hey, if you see a command submitted for this index, let me know". The subscription is implemented with a channel.
The handler can then wait on the channel.

Here's the code of handlePut, demonstrating this in action:

func (kvs *KVService) handlePut(w http.ResponseWriter, req *http.Request) {
  pr := &api.PutRequest{}
  if err := readRequestJSON(req, pr); err != nil {
    http.Error(w, err.Error(), http.StatusBadRequest)
    return
  }
  kvs.kvlog("HTTP PUT %v", pr)

  // Submit a command into the Raft server; this is the state change in the
  // replicated state machine built on top of the Raft log.
  cmd := Command{
    Kind:  CommandPut,
    Key:   pr.Key,
    Value: pr.Value,
    Id:    kvs.id,
  }
  logIndex := kvs.rs.Submit(cmd)
  // If we're not the Raft leader, send an appropriate status
  if logIndex < 0 {
    renderJSON(w, api.PutResponse{RespStatus: api.StatusNotLeader})
    return
  }

  // Subscribe for a commit update for our log index. Then wait for it to
  // be delivered.
  sub := kvs.createCommitSubsciption(logIndex)

  // Wait on the sub channel: the updater will deliver a value when the Raft
  // log has a commit at logIndex. To ensure clean shutdown of the service,
  // also select on the request context - if the request is canceled, this
  // handler aborts without sending data back to the client.
  select {
  case entry := <-sub:
    // If this is our command, all is good! If it's some other server's command,
    // this means we lost leadership at some point and should return an error
    // to the client.
    entryCmd := entry.Command.(Command)
    if entryCmd.Id == kvs.id {
      renderJSON(w, api.PutResponse{
        RespStatus: api.StatusOK,
        KeyFound:   entryCmd.ResultFound,
        PrevValue:  entryCmd.ResultValue,
      })
    } else {
      renderJSON(w, api.PutResponse{RespStatus: api.StatusFailedCommit})
    }
  case <-req.Context().Done():
    return
  }
}

The code is well-commented, but I want to specifically call out a few important points:

When kvs.rs.Submit is called with the command, it returns -1 if the current Raft CM is not the leader. In this case, we return a special status to the client - "I'm not the leader" - and abort the handler. We'll see what the client does about this further down in the post.

For a leader, Submit returns the log index at which the command was submitted. This is the index used to subscribe to notifications from the commit channel.
The handler waits on a receive on this channel. This can be canceled if the HTTP request is canceled by the client (e.g. timeout); otherwise, we just wait. In practice, with the optimizations in Part 3, it takes just a handful of milliseconds to fully commit new commands in a functioning Raft cluster. In case of problems (disconnections, crashes etc.) this may take longer, but our application prioritizes consistency over availability (see Part 0 on fault tolerance in Raft and the CAP theorem).
When notified that a commit was made for this log index, there's still an important safety check to make! Is it actually our command that was committed there? This is what the id field on the command is for.

Consider the following case: peer A is the leader, and a client submits a command. A places it in log index 42, but gets disconnected before it manages to tell followers about it. After a while, C becomes the new leader; C is unaware that A placed something in its log at index 42. Therefore, when C receives a new command from another client, it commits it at index 42 (since this is still the "next index for entries" for all connected cluster members). At some point later, A gets reconnected to the cluster, becomes a follower (since its term is out of date), and sees the commit from C at index 42. At this point it realizes that it failed to commit its own command (because the ID doesn't match), and replies with a "failed commit" status to the client.

I'll leave figuring out the mechanics of channel subscriptions to you as an exercise. Just read the createCommitSubscription and popCommitSubscription methods - they're fairly straightforward.

Consistency guarantees

I wrote in detail about linearizable semantics recently. Our KV service is linearizable based on that definition, due to the nature of Raft consensus. An operation only becomes visible to clients after it's committed; and it's committed by cluster consensus, at a "moment in time" relative to other operations in the Raft log.

Moreover, it's also serializable for transactions like CAS: these are performed by a single service (the leader) atomically, so clients can never observe the results of sub-operations in isolation.

By being both linearizable and serializable, our service is strict serializable, which is the strongest consistency guarantee for distributed systems.

As discussed before, this strong consistency comes at the expense of availability in the face of network partitions (as it must, due to the CAP theorem limits). It's a "CP" system; the following diagram is from Wikipedia:

What are such services good for? Though it can serve as a NoSQL database, it won't be very performant - every operation has to reach consensus among multiple peers before being considered "done". Instead, such strict serializable services are used as the very bottom layer of large distributed systems. For example, it can be used to coordinate distributed locks, elect leaders (these are fairly easy to build on top of our CAS primitive) or store some critical low-volume configuration data for a complex system.

Plumbing read-only operations through the Raft log

You'll note that all the commands our KV service supports - PUT, GET and CAS - are implemented fairly consistently and follow the sequence described in the "Life of..." section. This raises an important question: is this really necessary for the read-only GET operations? After all, they don't really change the state machine, so why add them as Raft log commands?

While it's true that a stray GET command won't harm the integrity of the internal data store, it may result in stale reads or other events inconsistent with the linearizable semantics of our service.

To see why, let's work by contradiction; assume we don't plumb GET through the Raft log, but instead let leaders immediately reply to GET requests based on their local datastore. Here's what can happen:

The KV DB has the key-value pair k=v.
A used to be a leader, but got disconnected from its peers; after a suitable election timeout, C was elected as the new leader. A still thinks it's the leader, however.
At some point, a client contacts C and submits PUT(k,v2). C successfully replicates this command to the remaining connected peers.
A bit later, another client sends GET(K) to C and gets the correct response v2.
Then, a different client sends GET(k) to A (perhaps the client remembered that the previous time it contacted the service, A was the leader [2]). Since A still thinks it's the leader, it will happily reply with the value v to the client's request.

This sequence of events breaks the linearizability guarantees of our service! The read GET(K) --> v is stale, since another client already read the value as v2. There is no single-threaded history in which this sequence of events is possible.

This problem is explicitly called out in Section 8 of the Raft paper. The canonical solution is what our service is doing: plumb all commands - even the read-only ones - through the Raft log [3]. A service won't respond to a client's request unless it was able to successfully commit this command to the Raft log.

Since we plumb GET commands through the Raft log, in our example the problem in the last step couldn't happen, because A would not respond to its client while disconnected from the cluster. Instead, it would have to wait to be reconnected, and at that point would discover that it's no longer the leader. The client would then ask the real leader and get the right response. However, even if due to additional disconnections or crashes A resumed leadership, it would have to process the PUT(k,v2) before processing the client's GET(k), since the state machine is updated in log order.

KV client

Now it's time to discuss the final piece of our system - the KV client library. Since the KV service API is just REST, we don't necessarily need a client library - we could just use curl calls or any other way to generate HTTP requests to interact with it. However, a convenient, idiomatic client library goes a long way in improving the quality of life of users - and it will be particularly useful in this case because it encodes some essential logic - finding and keeping track of the cluster leader.

So far, everything in our system has been replicated by N, which is the Raft cluster size (typically 3 or 5). The client is a single entity - just user code that wants to use the KV service. All the client code is in kvclient/kvclient.go; let's walk through how a single request works, starting with the type and constructor:

type KVClient struct {
  addrs []string

  // assumedLeader is the index (in addrs) of the service we assume is the
  // current leader. It is zero-initialized by default, without loss of
  // generality.
  assumedLeader int

  clientID int32
}

// New creates a new KVClient. serviceAddrs is the addresses (each a string
// with the format "host:port") of the services in the KVService cluster the
// client will contact.
func New(serviceAddrs []string) *KVClient {
  return &KVClient{
    addrs:         serviceAddrs,
    assumedLeader: 0,
    clientID:      clientCount.Add(1),
  }
}

// clientCount is used internally for debugging
var clientCount atomic.Int32

To create a client, we have to provide it with a list of addresses for the KV services that constitute a cluster; before the client sends its first request, the services should be launched and listening on these addresses.

All client requests follow the same steps; let's use Put as an example:

// Put the key=value pair into the store. Returns an error, or
// (prevValue, keyFound, false), where keyFound specifies whether the key was
// found in the store prior to this command, and prevValue is its previous
// value if it was found.
func (c *KVClient) Put(ctx context.Context, key string, value string) (string, bool, error) {
  putReq := api.PutRequest{
    Key:   key,
    Value: value,
  }
  var putResp api.PutResponse
  err := c.send(ctx, "put", putReq, &putResp)
  return putResp.PrevValue, putResp.KeyFound, err
}

Types like PutRequest and PutResponse are defined in api/api.go (you may have noticed them in the service code as well); they're trivial, so I won't spend more time on them.

All the client logic is encapsulated in the send method:

func (c *KVClient) send(ctx context.Context, route string, req any, resp api.Response) error {
  // This loop rotates through the list of service addresses until we get
  // a response that indicates we've found the leader of the cluster. It
  // starts at c.assumedLeader
FindLeader:
  for {
    // There's a two-level context tree here: we have the user context - ctx,
    // and we create our own context to impose a timeout on each request to
    // the service. If our timeout expires, we move on to try the next service.
    // In the meantime, we have to keep an eye on the user context - if that's
    // canceled at any time (due to timeout, explicit cancellation, etc), we
    // bail out.
    retryCtx, retryCtxCancel := context.WithTimeout(ctx, 50*time.Millisecond)
    path := fmt.Sprintf("http://%s/%s/", c.addrs[c.assumedLeader], route)

    c.clientlog("sending %#v to %v", req, path)
    if err := sendJSONRequest(retryCtx, path, req, resp); err != nil {
      // Since the contexts are nested, the order of testing here matters.
      // We have to check the parent context first - if it's done, it means
      // we have to return.
      if contextDone(ctx) {
        c.clientlog("parent context done; bailing out")
        retryCtxCancel()
        return err
      } else if contextDeadlineExceeded(retryCtx) {
        // If the parent context is not done, but our retry context is done,
        // it's time to retry a different service.
        c.clientlog("timed out: will try next address")
        c.assumedLeader = (c.assumedLeader + 1) % len(c.addrs)
        retryCtxCancel()
        continue FindLeader
      }
      retryCtxCancel()
      return err
    }
    c.clientlog("received response %#v", resp)

    // No context/timeout on this request - we've actually received a response.
    switch resp.Status() {
    case api.StatusNotLeader:
      c.clientlog("not leader: will try next address")
      c.assumedLeader = (c.assumedLeader + 1) % len(c.addrs)
      retryCtxCancel()
      continue FindLeader
    case api.StatusOK:
      retryCtxCancel()
      return nil
    case api.StatusFailedCommit:
      retryCtxCancel()
      return fmt.Errorf("commit failed; please retry")
    default:
      panic("unreachable")
    }
  }
}

There's some context subtlety going on here - hopefully the comments make that clear enough.

The client keeps track of the last service it saw that accepted a command as a leader. When asked to send a new command to the service, this is the service it starts from. If its request to the assumed leader times out, or that service says it's no longer the leader, the client retries to the next service in the cluster.

During normal operation, the leader will typically be stable, each client will quickly discover who it is and from that point on will address the leader directly. When there's a cluster disruption, the client will spend a bit of time looking for the leader - but this can be optimized if needed [4].

If a client can't find a leader, it will just keep trying; since we use the Go context idiom, this can always be controlled by the user - by imposing a timeout on client operations, or canceling them for other reasons.

Future work

The KV service presented in this post provides strong consistency guarantees, as discussed. However, keeping systems linearizable all the way through the client is notoriously tricky, and the simple client we presented in this post is not immune to issues.

The problem is with its retry logic; when a client sends a PUT command to a leader and the request times out, what is the right thing to do? Our client just retries, looking for a different leader. Is this the right approach?

Not necessarily! Consider what happens if the leader committed the command, but crashed before responding to the client. If the client now retries, the command may end up duplicated in the log. While it may seem like this shouldn't be a problem because PUT is idempotent [5], it can in fact cause non-linearizable behavior to be observed, if some other client managed to PUT another value for the same key in-between the replies.

This isn't a trivial problem; in fact, it's also mentioned in section 8 of the Raft paper. We'll spend the next part in the series discussing this problem in detail, presenting one potential solution and talking about how real-world distributed KV services deal with it.

[1]	For the terms used in this description, refer to Part 0.

[2]	This is exactly how our client implementation works, as we'll see soon.

[3]	The paper also discusses some ideas for optimizations of this process. Since this optimizes the uncommon path (when crashes and disconnections disrupt the normal operation of the Raft cluster), I leave this out of my implementation.

[4] Here's an exercise: the AppendEntries RPC sent by leaders to followers contains a "leader ID" field; so followers know who the current leader is. We already have it in our Raft implementation; try to plumb this information all the way through to the client. When a follower sends a "I'm not a leader" response to the client, it can include the ID of the service it thinks is the current leader; this can reduce the search time somewhat.

[5]	Applying `PUT(k1, v1)` right after another `PUT(k1,v1)` doesn't affect the correctness of the DB.

Notes on running Go in the browser with WebAssembly

2024-09-14T06:05:00-07:00

Recently I've had to compile Go to WebAssembly to run in the browser in a couple of small projects (#1, #2), and in general spent some time looking at WebAssembly. I find WebAssembly to be an exciting technology, both for the web and for other uses (e.g. with WASI); specifically, it's pretty great that we can take existing projects and components written in Go and run them in the browser.

In this post, I will summarize some useful patterns in running Go in the browser via WebAssembly. All the patterns are demonstrated by small, self-contained programs you can find in this GitHub repository.

Basics: calling Go from JS

This sample serves as the basis for other samples in this post: let's write a Go function that we'll call in the browser using JS. This function uses Go's math/big stdlib package to calculate the sum of the harmonic series for some duration [1], and returns the result with high precision:

// calcHarmonic calculates the harmonic series for approximately the given
// number of seconds and returns the accumulated result in a string.
func calcHarmonic(nsecs float64) string {
  d := time.Duration(nsecs * float64(time.Second))
  start := time.Now()
  r1 := big.NewRat(1, 1)
  for i := 2; ; i++ {
    addend := big.NewRat(1, int64(i))
    r1 = r1.Add(r1, addend)

    if i%10 == 0 && time.Now().Sub(start) >= d {
      break
    }
  }
  return r1.FloatString(40)
}

To export this function to JS in the browser, we add the following code:

func main() {
  // Export the name "calcHarmonic" to JS, with our wrapper as value
  js.Global().Set("calcHarmonic", jsCalcHarmonic)

  // The Go main function compiled to WASM is expected to block
  // indefinitely.
  select {}
}

// wrap calcHarmonic to be callable from JS
var jsCalcHarmonic = js.FuncOf(func(this js.Value, args []js.Value) any {
  if len(args) != 1 {
    panic("want one argument")
  }

  s := calcHarmonic(args[0].Float())
  return js.ValueOf(s)
})

This Go file is compiled to the WASM/js target with:

GOOS=js GOARCH=wasm go build -o harmonic.wasm harmonic.go

And load it from JS:

// Instantiate a new Go object (defined in from wasm_exec.js)
const go = new Go();
WebAssembly.instantiateStreaming(fetch("harmonic.wasm"), go.importObject).then(
    (result) => {
        go.run(result.instance);
    });

The JS code that calls calcHarmonic is:

let buttonElement = document.getElementById("submitButton");
document.getElementById("submitButton").addEventListener("click", () => {
    let input = document.getElementById("timeInput").value;
    let s = calcHarmonic(parseFloat(input));
    document.getElementById("outputDiv").innerText = s;
});

Finally, the wasm_exec.js file from the Go distribution has to be included with something like:

<script src="wasm_exec.js"></script>

The easiest way to obtain this file is download it from the Go project's GitHub mirror (for the same Go version your Go code is compiled with); this is handled by the Makefile in our sample project:

wasm_exec.js:
  wget https://raw.githubusercontent.com/golang/go/release-branch.go1.22/misc/wasm/wasm_exec.js

This is the basic recipe for invoking Go from JS in the browser: the Go code is platform-agnostic and presents some API and all the glue logic is done in JS. The next samples show some variations on this basic scheme.

Link to the full code for this sample.

DOM manipulation from Go

In the previous example, Go implemented the calcHarmonic function, but the rest of the program's logic was in JS - setting up an event listener for a button click, updating output, etc.

We can move more of the code to Go, if we want. The calcHarmonic remains unchanged, but our main function in Go becomes:

func main() {
  doc := js.Global().Get("document")
  buttonElement := doc.Call("getElementById", "submitButton")
  inputElement := doc.Call("getElementById", "timeInput")
  outputElement := doc.Call("getElementById", "outputDiv")

  buttonElement.Call("addEventListener", "click", js.FuncOf(
    func(this js.Value, args []js.Value) any {
      input := inputElement.Get("value")
      inputFloat, err := strconv.ParseFloat(input.String(), 64)
      if err != nil {
        log.Println(err)
        return nil
      }
      s := calcHarmonic(inputFloat)
      outputElement.Set("innerText", s)
      return nil
    }))

  select {}
}

We obtain JS values from the js.Global() context and can call functions or set attributes on them. If you squint, this looks very similar to JS code, but written in Go-ish.

This code sample demonstrates some useful capabilities of DOM manipulation in Go:

Adding event listeners on DOM elements, with Go callbacks
Getting values from DOM elements
Setting attributes on DOM elements

The only code JS remaining in our index.html is the WebAssembly loader:

const go = new Go();
WebAssembly.instantiateStreaming(fetch("harmonic.wasm"), go.importObject).then(
    (result) => {
        go.run(result.instance);
    });

All the rest is done in Go! Link to the full code for this sample.

For a more full-featured sample, check out this directory. It implements a simple Game of Life running in the browser, entirely in Go. All the game logic, canvas manipulation and event management is done in Go; here too, the only JS code in the project is the few lines used to load the WebAssembly module.

I personally prefer keeping the UI logic in JS, but if you're interested in Go purity all the way - it's definitely feasible.

Using TinyGo as an alternative compiler

The Go compiler's support for WebAssembly is pretty good these days, but there's a small snag that may be important to users: the entire Go runtime is compiled into the WASM binary. On my machine, the .wasm files produced for the sample Go code weigh in at around 2.5 MiB, which will take some time to load in the browser - especially on slow connections [2].

There's an alternative: TinyGo is a Go toolchain "for small places", specializing in embedded controllers; the same considerations apply to WASM. The TinyGo runtime is lightweight compared to Go, and the binaries are about 1/4 the size. Not everything is perfect with TinyGo, though: compilation is much slower, and the resulting code is a bit slower as well. Finally, TinyGo has some limitations that make stdlib packages that rely on reflection not work; this can be painful when interacting with JS because encoding/json relies on reflection - so you may need to look for an alternative JSON package.

The dom-in-go sample directory also shows how to build the project with TinyGo; take a look at the Makefile. Note that TinyGo has its own wasm_exec.js support file - it won't work with the one taken from the standard Go distribution; the Makefile handles this too.

Keeping the main thread free: WebAssembly in a web worker

If we come back to the original sample and run the calculation for some non-trivial amount of time (say, 2 seconds or more) - you may notice something: the page appears "frozen" while the calculation is running. You can't interact with the UI in any way, can't select text with the mouse; if you try to add periodic console.log printouts or some spinner animation - nothing will show until calcHarmonic returns with the result.

This is the expected behavior for JS when it calls a blocking, CPU-intensive function! Let's revisit the code again:

 let buttonElement = document.getElementById("submitButton");
 document.getElementById("submitButton").addEventListener("click", () => {
     let input = document.getElementById("timeInput").value;
     let s = calcHarmonic(parseFloat(input));
     document.getElementById("outputDiv").innerText = s;
 });

The highlighted line will block the main thread for 2+ seconds, but the main thread in JS is also used for all the UI interaction. This is one of the most common manifestations of function coloring problem - blocking is problematic. Luckily, all modern browsers support Web Workers - isolated threads that can execute concurrently.

It's not hard to make web workers work with WebAssembly, which is what our next demo shows. The main HTML file includes, in addition to the UI logic:

const worker = new Worker("worker.js");
worker.onmessage = ({ data }) => {
    let { action, payload } = data;
    switch (action) {
        case "log":
            console.log(`worker.log: ${payload}`);
            break;
        case "result":
            resultReady(payload);
            break;
        default:
            console.error(`Unknown action: ${action}`);
    }
};

Where worker.js is:

importScripts("wasm_exec.js");
console.log("Worker is running");

// Load the WASM module with Go code.
const go = new Go();
WebAssembly.instantiateStreaming(fetch("harmonic.wasm"), go.importObject).then(
    (result) => {
        go.run(result.instance);
        console.log("Worker loaded WASM module");
    }).catch((err) => {
        console.error("Worker failed to load WASM module: ", err)
    });

onmessage = ({ data }) => {
    let { action, payload } = data;
    postMessage({
        action: "log",
        payload: `Worker received message ${action}: ${payload}`,
    });
    switch (action) {
        case "calculate":
            let result = calcHarmonic(payload);
            postMessage({ action: "result", payload: result });
            break;
        default:
            throw (`unknown action '${action}'`);
    }
};

(The Go code remains unchanged.)

We see that the worker does the WebAssembly loading now, meaning that the Go code executes in a separate thread and the UI thread is free to run while the computation is ongoing. This sample adds a spinner that animates until the web worker returns calcHarmonic's answer, to show the effect.

Link to the full code for this sample.

Talking on a Web Socket with Go

A few years ago I published a sample of a Go server talking via web sockets with JavaScript client code. Well, since the theme here is porting all client code to Go, how about we replace that JavaScript client with yet more Go?

This turns out to be fairly simple - not much different from the "DOM manipulation in Go" section, in fact. But there are some nuances I want to cover.

The application is simple - we display a box, and whenever there's mouse movement over the box, the client sends messages to the server via a web socket; the server echoes the message back and the client uses it to update a text div:

The server code is standard Go using the golang.org/x/net/websocket package. On the client, however, we have to use browser APIs. Here's the interesting part of the code:

const wsServerAddress = "ws://127.0.0.1:4050"

// These are equivalent to the following in JS:
//
//   ws = new WebSocket(addr) ...
//
wsCtor := js.Global().Get("WebSocket")
wsEcho := wsCtor.New(wsServerAddress + "/wsecho")
wsTime := wsCtor.New(wsServerAddress + "/wstime")

To send on a web socket, we'll use this function:

// wsSend sends a message on a web socket; the web socket must be active and
// open (otherwise wsSends logs an error and doesn't send anything).
// The message will be serialized to JSON prior to sending.
func wsSend(sock js.Value, msg any) {
  if !sock.IsNull() || sock.Get("readyState").Equal(js.Global().Get("WebSocket").Get("OPEN")) {
    b, err := json.Marshal(msg)
    if err != nil {
      log.Fatal(err)
    }
    sock.Call("send", string(b))
  } else {
    log.Println("socket is not open")
  }
}

And here's how receiving looks, registering the message event listener:

wsEcho.Call("addEventListener", "message", js.FuncOf(
  func(this js.Value, args []js.Value) any {
    event := args[0]
    var ev Event
    if err := json.Unmarshal([]byte(event.Get("data").String()), &ev); err != nil {
      log.Fatal(err)
    }
    coordMsg := fmt.Sprintf("Coordinates: (%v, %v)", ev.X, ev.Y)
    outputElement.Set("innerText", coordMsg)
    return nil
  }))

As before, this is just straightforward translation of JS into Go [3]. Note something interesting that's going on here: we have two different Go programs, talking over web sockets with each other using completely different underlying libraries. One uses a Go-native implementation of web sockets; the other uses the browser implementation, exposed via a JS API. In a realistic program, it would make sense to abstract over these details so the same code could be used to send/receive data over web sockets, whether it runs on the server or the client.

Link to the full code for this sample.

Testing locally with Node.js

This section isn't strictly about "running in the browser", but it covers the important topic of local testing. Sometimes we don't want the browser in the loop for our tests; well, good news - we can leverage Node.js's ability to load and execute WebAssembly modules to run GOOS=js GOARCH=wasm Go binaries locally!

The intersting tidbit here is that we can leverage special support implemented in the Go toolchain to make these invocations similar to running/testing regular Go programs. Here's an excerpt from go help run describing it:

By default, 'go run' runs the compiled binary directly: 'a.out arguments...'.
If the -exec flag is given, 'go run' invokes the binary using xprog:
  'xprog a.out arguments...'.
If the -exec flag is not given, GOOS or GOARCH is different from the system
default, and a program named go_$GOOS_$GOARCH_exec can be found
on the current search path, 'go run' invokes the binary using that program,
for example 'go_js_wasm_exec a.out arguments...'. This allows execution of
cross-compiled programs when a simulator or other execution method is
available.

The Makefile in our sample handles this fully; we can run a test like this locally, without opening the browser:

//go:build js && wasm

package main

import (
  "log"
  "syscall/js"
  "testing"
)

func TestJSArr(t *testing.T) {
  log.Println("hello from test in js/wasm")

  objs := js.Global().Call("eval", `({
arr: [41,42,43],
})`)

  arr := objs.Get("arr")
  if got := arr.Length(); got != 3 {
    t.Errorf("got %#v, want %#v", got, 3)
  }

  if got := arr.Index(1).Int(); got != 42 {
    t.Errorf("got %#v, want %#v", got, 42)
  }
}

With an invocation like:

GOOS=js GOARCH=wasm go test -exec=supportfiles/go_js_wasm_exec -v .

Link to the full code for this sample.

[1]	The harmonic series is known to diverge, but very slowly. You need over 200 million elements to get to the sum of 20, etc. (see A004080).

[2]	There are some additional mitigations we can explore, like compressing the WASM binary. This is outside the scope of this post, and it applies to the TinyGo output as well.

[3]	To be honest, this makes me appreciate JS as an extension language. It has such a simple ABI! Everything is an object, and we can get/set object properties (which can be other objects), and call functions/methods - that's all we need to access all of the browser APIs.

SentencePiece BPE Tokenizer in Go

2024-08-23T10:35:00-07:00

Earlier this year I wrote a post about implementing BPE tokenization in Go, which made it possible to reproduce OpenAI's tokenizer.

Today I want to mention a new project I've been hacking on recently: go-sentencepiece - a pure Go implementation of the SentencePiece tokenizer that's used for Google AI's models like Gemma and Gemini. SentencePiece has a canonical C++ implementation and Python bindings (using SWIG). While it's not too hard to wrap the C++ code with cgo, in some cases a C compiler dependency isn't desirable, so a pure Go solution may be useful. This is what go-sentencepiece is for.

A disclaimer: while SentencePiece contains implementations for both BPE and Unigram tokenizers, go-sentencepiece only implements BPE because this is the one use in practice by models. Also, go-sentencepiece doesn't implement the training phase of the tokenizer, only encoding & decoding. For training, feel free to review my previous post.

There are a couple of ways in which SentencePiece works differently from OpenAI's variant of BPE:

The text is not pre-split by whitespace using a regexp; instead, whitespace is considered just another part of the input and has its own tokens. You can even see it in the screenshot above - it's marked by the "fat underscore" character (U+2581). While single-space runes are usually part of the next non-space token, multi-space tokens exist as distinct tokens.
Instead of being configured by just a vocabulary and a regexp, SentencePiece tokenizers have a whole protobuf for configuration, with many options. go-sentencepiece only supports the set of options used for Google AI's models, but more can be added easily.

The whitespace difference turns out to play a crucial role in performance. My original BPE implementation was fairly naive, using simple quadratic algorithms for encoding; this was OK, because these algorithms were working on one word at a time, so the N was very small.

This is no longer sufficient for SentencePiece, however, since the length of the full text is N. Therefore, the implementation adopts some more sophisticated algorithms from the C++ SentencePiece codebase; in particular:

To match a prefix of a long string from a set of candidates, we use a trie data structure. The prefixmatcher package implements this and may be generally interesting.
To figure out which pair of tokens to try merging next, we use a heap-based priority queue; this is implemented in the generic priorityqueue package.

While I didn't spend much time in micro-optimizing the implementation, these algorithmic improvements sped up the encoder by about 100x compared to a naive approach, and it's now so fast that I don't think it will ever be a bottleneck in reality.

Config and set up

As mentioned earlier, SentencePiece is configurable with a protobuf file. There are two parts to this: first is a .proto file defining the schema of the protobuf. This is vendored into my repository, copied from the C++ SentencePiece repository. The .pb.go file is also in the tree so you don't need to run the protobuf compiler unless the .proto changes.

The second part is the protobuf itself, which contains the tokenizer vocabulary and a bunch of configuration options. This can be downloaded from the official Gemma repository. go-sentencepiece should be able to load this file.

Online demo

As before, I've implemented an online demo of this tokenizer by compiling it into WebAssembly and adding some HTML+JS scaffolding around it. This is where the screenshot above is from.

You can play with it here: https://eliben.github.io/go-sentencepiece/ (the model protobuf is quite big though, so this page may take a few seconds to load if you have a slow connection).

Building static binaries with Go on Linux

2024-07-30T14:35:00-07:00

One of Go's advantages is being able to produce statically-linked binaries [1]. This doesn't mean that Go always produces such binaries by default, however; in some scenarios it requires extra work to make this happen. Specifics here are OS-dependent; here we focus on Unix systems.

Basics - hello world

This post goes over a series of experiments: we take simple programs and use go build to produce binaries on a Linux machine. We then examine whether the produced binary is statically or dynamically linked. The first example is a simple "hello, world":

package main

import "fmt"

func main() {
  fmt.Println("hello world")
}

After building it with go build, we get a binary. There are a few ways on Linux to determine whether a binary is statically or dynamically linked. One is the file tool:

$ file ./helloworld
helloworld: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, Go BuildID=Flm7stIXKLPfvBhTgXmR/PPwdjFUEkc9NCSPRC7io/PofU_qoulSqJ0Ktvgx5g/eQXbAL15zCEIXOBSPZgY, with debug_info, not stripped

You can see it says "statically linked". Another way is to use ldd, which prints the shared object dependencies of a given binary:

$ ldd ./helloworld
  not a dynamic executable

Alternatively, we can also use the ubiquitous nm tool, asking it to list the undefined symbols in a binary (these are symbols the binary expects the dynamic linker to provide at run-time from shared objects):

$ nm -u ./helloworld
<empty output>

All of these tell us that a simple helloworld is a statically-linked binary. Throughout the post I'll mostly be using ldd (out of habit), but you can use any approach you like.

DNS and user groups

There are two pieces of functionality the Go standard library defers to the system's libc on Unix machines, when some conditions are met. When cgo is enabled (as it often - but not always - is on Unix machines), Go will call the C library for DNS lookups in the net package and for user and group ID lookups in the os/user package.

Let's observe this with an experiment:

package main

import (
  "fmt"
  "net"
)

func main() {
  fmt.Println(net.LookupHost("go.dev"))
}

If we build this program, we notice it's dynamically linked, expecting to load a libc shared object at run-time:

$ go build lookuphost.go
$ ldd ./lookuphost
  linux-vdso.so.1 (0x00007b50cb22a000)
  libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007b50cae00000)
  /lib64/ld-linux-x86-64.so.2 (0x00007b50cb22c000)

This is explained in the net package documentation in some detail. The Go standard library does have a pure Go implementation of this functionality (although it may lack some advanced features). We can ask the toolchain to use it in a couple of ways. First, we can set the netgo build tag:

$ go build -tags netgo lookuphost.go
$ ldd ./lookuphost
  not a dynamic executable

Second, we can disable cgo entirely with the CGO_ENABLED env var. This env var is usually on by default on Unix systems:

$ go env CGO_ENABLED
1

If we disable it explicitly for our build, we'll get a static binary again:

$ CGO_ENABLED=0 go build lookuphost.go
$ ldd ./lookuphost
  not a dynamic executable

Similarly, some of the functionality of the os/user package uses libc by default. Here's an example:

package main

import (
  "encoding/json"
  "log"
  "os"
  "os/user"
)

func main() {
  user, err := user.Lookup("bob")
  if err != nil {
    log.Fatal(err)
  }

  je := json.NewEncoder(os.Stdout)
  je.Encode(user)
}

This produces a dynamically-linked binary:

$ go build userlookup.go
$ ldd ./userlookup
  linux-vdso.so.1 (0x0000708301084000)
  libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x0000708300e00000)
  /lib64/ld-linux-x86-64.so.2 (0x0000708301086000)

As with net, we can ask the Go toolchain to use the pure Go implementation of this user lookup functionality. The build tag for this is osusergo:

$ go build -tags osusergo userlookup.go
$ ldd ./userlookup
  not a dynamic executable

Or, we can disable cgo:

$ CGO_ENABLED=0 go build userlookup.go
$ ldd ./userlookup
  not a dynamic executable

Linking C into our go binary

We've seen that the standard library has some functionality that may require dynamic linking by default, but this is relatively easy to override. What happens when we actually have C code as part of our Go program, though?

Go supports C extensions and FFI using cgo. For example:

package main

// #include <stdio.h>
// void helloworld() {
//   printf("hello, world from C\n");
// }
import "C"

func main() {
  C.helloworld()
}

A program built from this source will be dynamically linked, due to cgo:

$ go build cstdio.go
$ ldd ./cstdio
  linux-vdso.so.1 (0x00007bc6d68e3000)
  libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007bc6d6600000)
  /lib64/ld-linux-x86-64.so.2 (0x00007bc6d68e5000)

In our C code, printf is a call to libc; even if we don't explicitly call into the C runtime in our C code, cgo may do it in the scaffolding code it generates.

Note that cgo may be involved even if your project has no C code of its own; several dependencies may bring in cgo. Some popular packages - like the go-sqlite3 driver - depend on cgo, and importing them will impose a cgo requirement on a program.

Obviously, building with CGO_ENABLED=0 is no longer an option. So what's the recourse?

Linking a `libc` statically

To recap, once we have C code as part of our Go binary, it's going to be dynamically linked on Unix, because:

The C code calls into libc (the C runtime)
The libc typically used on Unix systems is glibc
The recommended way to link to glibc is dynamically (for various technical and license-related reasons that are outside the scope of this post)
Therefore, go build produces dynamically-linked Go binaries

To change this flow of events, we can interpose at step (2) - use a different libc implementation, one that's statically linked. Luckily, such an implementation exists and is well used and tested - musl.

To follow along, start by installing musl. The standard instructions using ./configure --prefix=<MUSLDIR> and make / make install work well. We'll use $MUSLDIR to refer to the directory where musl is installed. musl comes with a gcc wrapper that makes it easy to pass all the right flags. To re-build our cstdio example using musl, run:

$ CC=$MUSLDIR/bin/musl-gcc go build --ldflags '-linkmode external -extldflags "-static"' cstdio.go
$ ldd ./cstdio
  not a dynamic executable

The CC env var tells go build which C compiler to use for cgo; the linker flags instruct it to use an external linker for the final build (read this for the gory details) and then to perform a static link.

This approach works for more complex use cases as well! I won't paste the code here, but the sample repository accompanying this post has a file called use-sqlite.go; it uses the go-sqlite3 package. Try go build-ing it normally and observe the dynamically linked binary produced; next, try to build it with the flags shown above to use musl, and observe that the produced binary will be statically linked.

Another curious tidbit is that we now have another way to build a statically-linked lookuphost program - by linking it with musl:

$ CC=$MUSLDIR/bin/musl-gcc go build --ldflags '-linkmode external -extldflags "-static"' lookuphost.go
$ ldd ./lookuphost
  not a dynamic executable

Since we didn't provide -tags netgo and didn't disable cgo, the Go toolchain uses calls into libc to implement DNS lookup; however, since these calls end up in the statically-linked musl, the final binary is statically linked!

Using Zig as our C compiler

Another alternative emerged recently to achieve what we want: using the Zig toolchain. Zig is a new systems programming language, which uses a bundled toolchain approach similar to Go. Its toolchain bundles together a Zig compiler, C/C++ compiler, linker and libc for static linking. Therefore, Zig can actually be used to link Go binaries statically with C code!

Instead of installing musl, we could instead install Zig and use its x86_64-linux-musl target (adjust the architecture if needed). This is done by pointing to the zig binary as our CC= env var; assuming Zig is installed in $ZIGDIR:

$ CC="$ZIGDIR/zig cc -target x86_64-linux-musl" go build cstdio.go
$ CC="$ZIGDIR/zig cc -target x86_64-linux-musl" go build use-sqlite.go

These will produce statically-linked Go binaries; the zig driver takes care of setting the right linker flags automatically, so the command-line ends up being slightly simpler than invoking musl-gcc. Another advantage of Zig here is that enables cross-compilation of Go programs that include C code [2].

I did find some issues with this approach, however; for example, attempting to link the lookuphost.go sample fails with a slew of linker errors.

Summary

Making sure Go produces a statically-linked binary on Linux takes a little bit of effort, but works well overall.

There's a long standing accepted proposal about adding a -static flag to go build that would take care of setting up all the flags required for a static build. AFAICT, the proposal is just waiting for someone with enough grit and dedication to implement and test it in all the interesting scenarios.

Code

The code for all the experiments described in this post is available on GitHub.

[1]	A statically-linked binary doesn't have run-time dependencies on other libraries (typically in the form of shared objects), not even the C runtime library (`libc`). I wrote much more about this topic in the past.

[2]

Go is well-known for its cross-compilation capabilities, but it depends on the C toolchain to compile C code. Therefore, when cgo is involved, cross-compilation is challenging. Zig can help with this because its toolchain supports cross compilation for Zig and C! It does so by bundling LLVM with a bunch of targets linked in.

Locally patching dependencies in Go

2024-07-04T06:35:00-07:00

In a previous post I talked about how each Go module is its own self-contained "virtual environment" during development. Among other benefits, this makes the dependencies of a module explicit and simple to tweak.

Locally patching a dependency

To use a concrete example, suppose our module depends on the popular package go-cmp, that lets us deep-compare arbitrary Go values. Say we're debugging an intricate scenario and want to either:

Add a log statement inside the dependency to see what our code is passing to it (e.g. "do I ever invoke cmp.Equal with these specific options?")
Test a suspicion of a bug in the dependency by temporarily modifying its code and seeing if this has an effect on our module.

The Go module system makes this easy to accomplish; this post will demonstrate several way of doing this.

Setting up

Let's set up a test module to demonstrate this. The full code can be found on GitHub, or just follow along:

In a directory, run go mod init example.com (the module name is just a placeholder - it's a local experiment, we don't intend it to be imported or even published online). This creates a go.mod file; now, let's write this code:

package main

import (
  "fmt"

  "github.com/google/go-cmp/cmp"
  "github.com/google/go-cmp/cmp/cmpopts"
)

func main() {
  s1 := []int{42, 12, 23, 2}
  s2 := []int{12, 2, 23, 42}

  if cmp.Equal(s1, s2, cmpopts.SortSlices(intLess)) {
    fmt.Println("slices are equal")
  }
}

func intLess(x, y int) bool {
  return x < y
}

And then run go mod tidy; this should get the github.com/google/go-cmp dependency, and the go.mod file will look something like:

module example.com

go 1.22.2

require github.com/google/go-cmp v0.6.0

(your Go version and the dependency version will likely be different, of course)

Now, we'll download the dependency locally and patch it. Clone the https://github.com/google/go-cmp/ repository into a local directory; we'll call it $DEP (on my machine DEP=/home/eliben/test/go-cmp). Next, edit $DEP/cmp/compare.go to add a log statement:

func Equal(x, y interface{}, opts ...Option) bool {
  log.Println("options:", opts)
  s := newState(opts)
  s.compareAny(rootStep(x, y))
  return s.result.Equal()
}

If we run our test module now we don't see any effect yet:

$ go run .
slices are equal

This is to be expected! Go has no idea we've cloned the dependency locally and want it to be used in the build process of our test module. This is the next step.

Using a module `replace` directive

The most basic way to accomplish what we need is using a replace directive in the go.mod file of our test module.

In our module directory, run:

$ go mod edit -replace github.com/google/go-cmp=$DEP

If you look in your go.mod file, you'll see a new replace directive added there, redirecting uses of github.com/google/go-cmp to whatever directory DEP stands for on your machine.

If we now run the test module, it will pick up the patched dependency:

$ go run .
2024/06/29 06:57:17 options: [FilterValues(cmpopts.sliceSorter.filter, Transformer(cmpopts.SortSlices, cmpopts.sliceSorter.sort))]
slices are equal

Using Go workspaces

Go workspaces (go.work files) have been with us since version 1.18; a workspace makes it easier to work with multi-module repositories and large monorepos. It can also be leveraged to implement our use case very easily.

Get back to a clean go.mod file without a replace directive (you can either undo the change using source control, run go mod edit -dropreplace ... or just remove the replace directive from the go.mod file).

Now, run these commands in the test module's directory:

$ go work init
$ go work use . $DEP

This asks the Go tool to:

Initialize an empty workspace in the current directory; a go.work file will be created.
Add use directives to go.work for including the current directory . and the place where we checked out a local version of the dependency ($DEP).

If you look around, a new file was created - go.work; go.mod itself was not modified. If we run the module with go run ., we'll see that the local patch was picked up!

I like this approach a bit more than planting replace directives in the go.mod file, since it provides a cleaner separation between temporary patching and the module's actual source code. While go.mod files are checked into source control and provide a critical source of truth for building the module, go.work files aren't typically checked in and are used to set up a convenient local development environment. Using go.work for temporary patching is thus safer - it's more difficult to leave behind a replace directive in the go.mod file and commit it (this can cause all kinds of inconveniences when testing, for example).

Using `gohack`

gohack is a tool designed especially to address our use case; it predates Go workspaces. Start by installing it:

$ go install github.com/rogpeppe/gohack@latest

Now run:

$ gohack get github.com/google/go-cmp
github.com/google/go-cmp => $HOME/gohack/github.com/google/go-cmp

This invocation does two things:

Fetch the dependency's code and store it somewhere locally. You can control where these are stored by setting the $GOHACK env var; the default is $HOME/gohack.
Add a replace line to our go.mod file to point there.

Since gohack placed the dependency in a new location, we'll have to edit its cmp/compare.go file again to add the log statement. If we go run . in our test module, we'll see the change picked up.

It's also fairly easy to undo changes with the gohack undo command.

Which approach to use?

gohack can be useful in some cases where a quick check is all you need. Since gohack obtains the dependency on its own, it makes it a bit faster to use than cloning manually. That said, I'd be concerned about committing the replace line accidentally, which is why I think the workspace approach is safer (and also more explicit).

Update 2024-07-05: Sean Liao reminded me that go mod vendor is yet another way to accomplish this. This approach comes with its own tradeoffs; read the documentation to learn more.

Reading Google Sheets from a Go program

2024-05-31T18:07:00-07:00

I recently needed to process some data from a Google Sheet in a Go program, and was looking for the most straightforward way to do so on my local machine. This post lists some approaches that I found to work, with full source code.

To access the Sheets API, you'll need a GCP project, and would typically have the gcloud command-line tool installed. To enable the sheets API for your project, run:

$ gcloud services enable sheets.googleapis.com --project=<PROJECT-NAME>

If you want to list which APIs are already enabled, you can do:

$ gcloud services list --enabled --project=<PROJECT-NAME>

The simplest approach I found to work was using a service account. This post demonstrates this approach, as well as a (slightly) more involved approach that uses Oauth 2.0

Service account

A service account on GCP can be thought of as a virtual account, along with its own email address, attached to a project. These accounts have their own auth, permissions, etc. This is very useful for running on a VM - you typically don't want the VM to be logged in with your primary Google account, and this service account can be specific to a given VM (or a group thereof).

Start by creating a new service account on this page. Once created, select Manage Keys in the Actions menu, and add a new key. This will download a private key to your machine; keep it safe! The following program expects this key file to be provided with the -keyfile flag:

package main

import (
  "context"
  "flag"
  "fmt"
  "io/ioutil"
  "log"

  "golang.org/x/oauth2/google"
  "google.golang.org/api/option"
  "google.golang.org/api/sheets/v4"
)

func main() {
  keyFilePath := flag.String("keyfile", "", "path to the credentials file")
  flag.Parse()

  ctx := context.Background()
  credentials, err := ioutil.ReadFile(*keyFilePath)
  if err != nil {
    log.Fatal("unable to read key file:", err)
  }

  scopes := []string{
    "https://www.googleapis.com/auth/spreadsheets.readonly",
  }
  config, err := google.JWTConfigFromJSON(credentials, scopes...)
  if err != nil {
    log.Fatal("unable to create JWT configuration:", err)
  }

  srv, err := sheets.NewService(ctx, option.WithHTTPClient(config.Client(ctx)))
  if err != nil {
    log.Fatalf("unable to retrieve sheets service: %v", err)
  }

  // ...

We can specify the requested scopes (permissions) when creating an auth config. Here we're asking for read-only access to the Google Sheets.

Once auth succeeds (sheets.NewService returns w/o an error), we can use the sheets package to read and analyze the sheet; the code below simply prints the document's title and emits all the values from columns A and B in Sheet1.

  docId := "1qsNWsZuw98r9HEl01vwxCO5O1sIsI-fr0bJ4KGVvWsU"
  doc, err := srv.Spreadsheets.Get(docId).Do()
  if err != nil {
    log.Fatalf("unable to retrieve data from document: %v", err)
  }
  fmt.Printf("The title of the doc is: %s\n", doc.Properties.Title)

  val, err := srv.Spreadsheets.Values.Get(docId, "Sheet1!A:B").Do()
  if err != nil {
    log.Fatalf("unable to retrieve range from document: %v", err)
  }

  fmt.Printf("Selected major dimension=%v, range=%v\n", val.MajorDimension, val.Range)
  for _, row := range val.Values {
    fmt.Println(row)
  }
}

Note the docId passed to the sheets package; this is the path segment in your spreadsheet's URL following the /d/. In this example, I'm using a test sheet I've created.

Important: unless your sheet is world-readable, your service account won't be able to access it. Here the account's email comes in handy; you can take it from the service account's GCP IAM page (Details tab), and give this email permissions to the sheet. This way you can have the program processing a private sheet that only you have access to.

OAuth

Another way to achieve what we want is with OAuth. This also requires a bit of setup in your project's GCP console. Follow the Go quickstart docs for that. Our sample assumes you've saved the credentials.json file somewhere locally and will pass it through the -credfile flag. Unlike the quickstart, it handles all the token exchange process automatically without having to ask you to copy a code from a web page. You still have to authenticate the first time you run it, of course.

The full code of the sample is available on GitHub; while the auth part is different, the actual sheets processing code is identical to the service account sample.

For an overview of the OAuth protocol, see my earlier post.

P.S. ADC

Initially, I had trouble accessing the sheet using ADC (Application Default Credentials), but following a HN comment on this post, I was motivated to try again and it worked. I may have mixed up my auth JSON files previously, because the code is identical to what I've originally tried. In any case, the code is available on GitHub along with the other options. Depending on the exact use case, ADC may be simpler than using a service account (though IMHO the service account is a more "reliable" method across machines because its configuration is more explicit - less is happening under the hood).

Tokens for LLMs: Byte Pair Encoding in Go

2024-04-25T06:34:00-07:00

A basic unit of currency in modern LLMs is the token; exciting new models have long context windows of millions of tokens. API pricing for the large providers is per-token. We're even seeing the invention of new, derived units like TPM (tokens per minute).

But what are tokens?

This OpenAI help article tells us that tokens are pieces of words, and gives some useful rules of thumb like a token being equivalent to approximately 4 characters or 3/4 of a word for the English language.

In this post I want to review the most commonly used algorithm for splitting text into tokens, provide a complete implementation in Go, and show a playground for experimenting with it. While my implementation isn't tuned for speed, it aims to be complete, readable and compatible with OpenAI's tiktoken library, generating identical results and working with the same vocabulary files.

Byte pair encoding - introduction

Byte pair encoding (BPE) is an algorithm originally designed for data compression. A 2016 paper suggested re-purposing it for "word segmentation" for machine learning tasks. The colloquial term for word segmentation is tokenization.

Input: arbitrary text with words, numbers, whitespace and punctuation.
Output: list of tokens representing the same text. Each token is an integer identifier which can be looked up in a vocabulary to reproduce the input text [1].

The BPE algorithm has an important pre-processing step: splitting the input text into words. The splitting is customizable and different models / vocabularies use different regexps for splitting (more on this later). The main idea is some sort of whitespace-based splitting (though whitespace itself is preserved) because we typically don't want inter-word tokens [2].

We'll be using this line from a catchy 1990s song as an example:

i'm blue dabadee dabadam

A word splitter will produce something like the following list, where spaces are replaced by underscores _ for the sake of presentation (they remain as spaces in the actual implementation of the algorithm and its trained vocabulary):

i
'm
_blue
_dabadee
_dabadam

A few things to note:

The contraction 'm is split from i - this is common for English language splitters, which want things like 'm, 'll, 're as separate words.
Whitespace is preserved and attached at the start of a word. Whitespace is important because tokens at the beginning of words sometimes have different semantic meaning from tokens not at the beginning of words. The choice of where it's attached is arbitrary. From this point on, whitespace bytes are considered like any other bytes in the BPE algorithm.

Now is a good time for some terminology we'll be using while talking about BPE:

Word: produced by the splitter in pre-processing, like the list shown above.
Token: typically a sub-word sequence of bytes; the output of the tokenizer is a list of tokens, by ID.
Token ID: unique numerical identifier for a token.
Vocabulary: a mapping of token IDs --> token values learned by the tokenizer during the training process.
Training: the process in which BPE learns a vocabulary from a corpus of text.
Splitter regexp: regular expression used to split text into words during pre-processing. Given an algorithm (in this case BPE), the pair vocabulary + splitter regexp unambiguously defines how a given text will be tokenized.
Encoder: given a vocabulary and a splitter regexp, tokenizes any text into a list of IDs from the vocabulary.
Decoder: given a list of IDs and the vocabulary, reconstructs the original text.

Training

BPE training proceeds by first assuming each byte is its own token, and then successively merging pairs of tokens into longer tokens and adding these to the vocabulary, until the desired vocabulary size is achieved.

Let's reuse our example, starting with these words:

i
'm
_blue
_dabadee
_dabadam

The BPE process starts by creating a token for each byte in the inclusive range [0..255]. So the minimal vocabulary size is 256; this guarantees that from the very start, there's a valid encoded representation of any text.

Then, the following process is repeated:

Count how many times each ordered pair of bytes appears in the input. Ordered pair here means two bytes right next to each other. In our example, some such pairs are "bl", "da", "de", "ee" etc.
Find the pair with the highest count, and create a new token from it (create a new token ID, mapping it to the concatenation of the most common pair).
Replace this most common pair with the combined token in the input set.

In our example, we start by splitting input words to bytes, so it's a list of single-byte token lists. This is our working list:

[i]
[' m]
[_ b l u e]
[_ d a b a d e e]
[_ d a b a d a m]

Next, we count the frequency of appearance of each ordered pair:

[d a] --> 3
[a b] --> 2
[b a] --> 2
[' m] --> 1
[_ b] --> 1
[l u] --> 1
[u e] --> 1
[_ d] --> 2
[a d] --> 2
[d e] --> 1
[e e] --> 1
[b l] --> 1
[a m] --> 1

The pair "da" is the most common one, so we're creating a new token for it, and substituting it everywhere in the working list:

[i]
[' m]
[_ b l u e]
[_ da b a d e e]
[_ da b a da m]

As you can see, in every instance "d" followed by "a" was combined into "da". Now repeat the process; finding the most common pairs in this new working list:

[e e] --> 1
[a da] --> 1
[l u] --> 1
[_ da] --> 2
[da b] --> 2
[a d] --> 1
[d e] --> 1
[da m] --> 1
[' m] --> 1
[_ b] --> 1
[b l] --> 1
[u e] --> 1
[b a] --> 2

Several pairs have a count of 2, so we pick one arbitrarily. Let's say it's _da (a space followed by "da"). We add _da as a new token and make replacements in the working list:

[i]
[' m]
[_ b l u e]
[_da b a d e e]
[_da b a da m]

And so on. When does this process stop? When we either run out of pairs (every word consists of a single token) or - more realistically for an actual training corpus - when we reach our desired vocabulary size. For example the vocabulary used for GPT-4 has around 100,000 tokens (more on this later).

The output of the training process is a vocabulary; let's say we've only run two cycles on our input text as described. The vocabulary will have 258 tokens in it: 256 for the single bytes, one for da and another for _da. Each of these would have a unique integer ID.

In our Go sample code, the training is implemented in this file. You can set the debugTrain variable to true to follow the process on some sample text.

Encoding

Having learned a vocabulary, the process of encoding is what happens every time we feed text into an LLM and it needs to be tokenized. The input is arbitrary text, a splitting regexp and a vocabulary. For example, let's take the input text "yada daba". Splitting is performed as before, and the input is broken into individual bytes:

[y a d a]
[_ d a b a]

BPE encoding takes the vocabulary and tries to apply learned tokens to the input text, word by word. The process is greedy - tokens are applied in the same order they've been learned (this is easy to accomplish by assigning monotonically increasing integer IDs to new tokens in the vocabulary, and then prioritizing lower-numbered tokens for encoding).

The first token we learned was da, so let's apply that:

[y a da]
[_ da b a]

The next token we learned was _da:

[y a da]
[_da b a]

This is the final stage; there are no more learned tokens to apply. The result will consist of 6 tokens.

In our sample code, the encoder is in this file.

Realistic vocabulary and splitting

The examples shown so far have been toys, but the algorithms are real and work with the actual vocabularies and splitters used in modern models. As a case study, the tokenizer used for OpenAI's GPT-4 uses a vocabulary called cl100k_base, which contains 100k tokens in addition to the 256 byte-sized ones. This is also the vocabulary (encoding) the tiktoken library uses. It can be freely downloaded from OpenAI - a copy is available in my sample repository. The file is base64 encoded, which is easy to unravel and we'll see tokens like:

" Fritz"  91083
"Initially"  91084
"nodeValue"  91085
"_TRIANGLES"  91086
"-backend"  91087

The token string value is to the left, and the numerical token ID is to the right. As you can see, the algorithm is not particularly discerning about what it learns - names, pieces of code - whatever works!

The other important data needed to reproduce OpenAI's tokenization is the splitting regexp, which is this:

(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+

It's just a combination of several alternatives. You could use one of the many "regexp explainer" websites out there to study it, or ask a modern LLM, but the gist of it is: this regexp splits space-delimited words, leaving spaces in front of the words, with some special provisions like English contractions (being separate words) and long numbers being split to groups of 3. For Go programmers, it's important to note that this pattern uses ?! - negative lookahead - which the standard regexp package doesn't support. Therefore, we'll have to reach for the 3rd party regexp2 to implement this [3].

In our sample repository, take a look at this test that ties everything together - it loads the cl100k_base encoding and uses it alongside the splitting regexp to tokenize some real text.

Full online demo with a web UI and WebAssembly

My goal with this project wasn't only to understand the BPE algorithm, but to also try reproducing the actual tokenizer used by OpenAI for its most modern models. And this goal was accomplished!

OpenAI has a nice website here that lets you enter text and see how it's tokenized. I've managed to reproduce this UI - see the cmd/wasm directory in the repository. I've also placed it online - it can ran in your browser from here. Here's a screenshot [4]:

How it works: the Go implementation of BPE is compiled to a WebAssembly binary that's loaded from a bit of glue JavaScript embedded in a simple HTML page. The JavaScript watches the text box as you type and sends the string to a Go function exported from the WASM, which tokenizes it on the fly. So we get a nice effect of "tokens updated as we type". The selection button at the bottom also lets us see the numerical IDs for these tokens - they should be equivalent to what tiktoken is producing.

[1]	For simplicity, this post will focus on English. As you'll see, however, the BPE algorithm is language-agnostic.

[2]	There's also a performance implication: if we make tokenization word-oriented, we can easily implement streaming tokenization without depending on previous words.

[3]	I think it would be possible - with a bit of effort - to work around this limitation and stick to the standard library, but just using `regexp2` is simpler, and it's also what tiktoken-go is doing.

[4]

You'll notice that in this example every word (except contractions) is a separate token; this shouldn't be surprising, since these are all very common words and the vocabulary is large! Try playing with it a bit though, giving it longer words (like "discombobulated") or non-trivial variable names from a programming language.

The life of an Ollama prompt

2024-03-06T05:28:00-08:00

In a previous post I've described how - thanks to standardized tooling - we could use a locally-running Gemma model from a Go program within hours from its public release.

This post dives into the internals of Ollama - a popular and extremely convenient open-source Go project that makes such workflows possible.

HTTP request to Ollama

Having installed Ollama and run ollama run gemma, we're ready to send HTTP requests to it. There are several ways to do so:

Sending a raw HTTP request with a tool like curl
Using Ollama's own client libraries (currently available in Go, Python and JS)
Using a provider-agnostic client like LangChainGo

For options (2) and (3) see the Appendix; here we'll focus on (1) for simplicity and to remove layers from the explanation.

Let's send an HTTP request to the api/generate endpoint of Ollama with curl:

$ curl http://localhost:11434/api/generate -d '{
  "model": "gemma",
  "prompt": "very briefly, tell me the difference between a comet and a meteor",
  "stream": false
}' | jq .

[...]

{
  "model": "gemma",
  "created_at": "2024-03-04T14:43:51.665311735Z",
  "response": "Sure, here is the difference between a comet and a meteor:

  **Comet:**
  - A celestial object that orbits the Sun in a highly elliptical path.
  - Can be seen as a streak of light in the sky, often with a tail.
  - Comets typically have a visible nucleus, meaning a solid core that
    can be seen from Earth.

  **Meteor:**
  - A streak of hot gas or plasma that appears to move rapidly across the sky.
  - Can be caused by small pieces of rock or dust from space that burn up
    in the atmosphere.
  - Meteors do not have a visible nucleus.",
  "done": true,
  "context":
[...]
}

(The response is JSON and I've reformatted the text for clarity)

Ollama's HTTP API is documented here. For each endpoint, it lists a description of parameters and the data returned.

Ollama service

Ollama itself is a client-server application; when the installation script is run, it does several things:

Download Ollama binary
Place it in $PATH
Run ollama serve as a background service

The service checks the value of the OLLAMA_HOST env var to figure out which host and port to use. The default is port 11434 on localhost (hence you can see our curl request is made to localhost:11434). It then listens on the port, presenting the API discussed above.

What's interesting to note is that when we run ollama run <model> from the command-line, this invokes the Ollama binary in client mode; in this mode, it sends requests to the service using the same API. For example, here are two ways to invoke it - interactive:

$ ollama run gemma
>>> translate naranjo to english
Naranjo translates to Orange in English.

Naranjo is the Spanish word for Orange.

>>> <Ctrl+D>

And piping to stdin:

$ echo "translate naranjo to english" | ollama run gemma
Naranjo translates to Orange in English. Orange is the English word equivalent of the word Naranjo.

In both these cases, the Ollama binary sends an HTTP request to http://localhost:11434/api/generate, just like the one we've made manually with curl.

The `generate` API endpoint

Now that we know where our prompt to Ollama ends up (whether we issue it using an HTTP request or the Ollama command-line tool), let's see what the generate API endpoint actually does.

Ollama uses the Gin web framework, and the API route is fairly standard:

r.POST("/api/generate", GenerateHandler)

This routes HTTP POST requests for /api/generate to a handler function called GenerateHandler, which is defined in the same source file:

func GenerateHandler(c *gin.Context) {
  [...]
}

After parsing and validating the request, GenerateHandler starts by fetching the model the request asked for with the "model" field. It then loads the right model and runs it, feeding it with the prompt provided in the request. The next sections describe these two steps.

Fetching and loading the model

When Ollama is looking for a model (by name), it first checks if it already has it downloaded and stored locally. On my Linux machine, Ollama stores its local cache of models at /usr/share/ollama/.ollama/models/blobs. If the model is already available locally, there's not much to do for this step.

Otherwise, Ollama looks in its online library of models. Specifically, the service makes a request to https://registry.ollama.ai/v2/library/ to check if a model exists. At the time of writing, it's not clear if anyone except the Ollama maintainers can upload new models to the library - but it seems like they're working on this option.

But where do these models come from? As this doc explains, models are imported from other sources in formats like GGUF or Safetensors. The topic of these formats is very interesting, but I won't be covering it in this post; if you're interested, a recent blog post by Vicki Boykis provides useful historic background.

While models can be imported from a variety of formats, Ollama's library stores them as GGUF and that's what the service expects to find.

For the purpose of this explanation, it's sufficient to know that GGUF stores some metadata about the model (e.g. its architecture and parameters, like numbers of layers in different parts, etc) as well as its actual weights. The weights can be stored in different formats - some more suitable for GPUs, some for CPUs. Quantization is common, especially for CPU-oriented models. The model file is usually a giant multi-GiB binary blob that needs to be downloaded and cached locally.

Running the underlying model with a prompt

To run the model, Ollama turns to another project - llama.cpp. llama.cpp arose as a local inference engine for the Llama model when it was originally released. Since the model architecture and weights were published, it became possible to implement inference for the model without relying on full-blown Python ML frameworks like TensorFlow, PyTorch or JAX. It uses its author's separate project - ggml, for an efficient C++ library of ML primitives that can run on CPUs and GPUs.

Originally llama.cpp just hard-coded Llama's architecture and loaded the weights, but in time it grew to incorporate additional open-sourced models and its implementation became a kind of a switch based on the model's architecture.

For example, this commit added Gemma support to llama.cpp [1]. Once this is in place, all it needs is to load the weights and some parameterization of the model from its GGUF file and it's ready to go.

llama.cpp is a C++ project that was originally designed as a command-line utility you can use to load models and chat with them. C++ is not known for having a pleasant or stable ABI to work with, so many projects wrapped llama.cpp with a lightweight C ABI in order to create bindings into other languages.

Ollama, as a Go project, did the same. It went a step further though, and cleverly leverages llama.cpp's server sample, which encapsulates all operations in functions that take JSON inputs and return JSON outputs. Ollama added some glue in ext_server, and wrapped it with cgo to be able to invoke llama.cpp inference in-process.

The generate endpoint calls llm.Predict, which after some hops ends llama.cpp's request_completion.

Afterword: standard interfaces

In my previous post, I've mentioned that the flow works and is easy to set up due to standardized interfaces that have been implemented in OSS projects.

After reading this post with Ollama internals, I hope it's clear what standardized interfaces come into play here.

First and foremost is llama.cpp and its associated GGUF format. While the internals of llama.cpp are somewhat clunky, this project is unapologetically pragmatic and a true boon for the ecosystem because of the way it standardizes LLM inference (and embeddings). Given a model architecture implemented in C++ in the innards of llama.cpp, variations can be easily explored and run on compatible CPUs and GPUs. Slight model modifications? Tuning? Trying some new kind of quantizations? Just create a GGUF file and llama.cpp will run it for you.

The other half of the solution is Ollama, which wraps llama.cpp in a conveniently packaged tool, API and ecosystem [2]. As a Go project, it's easily distributable and makes it trivial to hack on a powerful API server. The REST API it presents can then be leveraged by any tool capable of issuing HTTP requests.

Appendix: Go client libraries for the Ollama API

If you want to use LLMs programmatically from Go through Ollama, the most convenient options are either using Ollama's own Go client library or through LangChainGo. Another option - as discussed above - is to send raw HTTP requests.

The Ollama Go client library is a great option because it's what the Ollama client itself uses to talk to the service; it's as battle-tested and functional as you can hope for. On the other hand, LangChainGo is convenient if you use multiple providers and want code that's consistent and provider-agnostic.

This sample lists Go code to ask Ollama a question using (1) the Ollama Go library or (2) LangChainGo.

[1]	The Gemma announcement points to this official documentation and implementation - https://github.com/google-deepmind/gemma - it can be used to re-implement Gemma inference, along with the pre-trained model weights Google released.

[2]	Ollama has additional capabilities I haven't mentioned here, like Modelfiles for creating and sharing models.

Gemma, Ollama and LangChainGo

2024-02-22T16:24:00-08:00

Yesterday Google released Gemma - an open LLM that folks can run locally on their machines (similarly to llama2). I was wondering how easy it would be to run Gemma on my computer, chat with it and interact with it from a Go program.

Turns it - thanks to Ollama - it's extremely easy! Gemma was already added to Ollama, so all one has to do is run:

$ ollama run gemma

And wait for a few minutes while the model downloads. From this point on, my previous post about using Ollama locally in Go applies with pretty much no changes. Gemma becomes available through a REST API locally, and can be accessed from ollama-aware libraries like LangChainGo.

I went ahead and added a --model flag to all my code samples from that post, and they can all run with --model gemma now. It all just works, due to the magic of standard interfaces:

Gemma is packaged in a standard interface for inclusion in Ollama
Ollama then presents a standardized REST API for this model, just like it does for other compatible models
LangChainGo has an Ollama provider that lets us write code to interact with any model running through Ollama

So we can write code like:

package main

import (
  "context"
  "flag"
  "fmt"
  "log"

  "github.com/tmc/langchaingo/llms"
  "github.com/tmc/langchaingo/llms/ollama"
)

func main() {
  modelName := flag.String("model", "", "ollama model name")
  flag.Parse()

  llm, err := ollama.New(ollama.WithModel(*modelName))
  if err != nil {
    log.Fatal(err)
  }

  query := flag.Args()[0]
  ctx := context.Background()
  completion, err := llms.GenerateFromSinglePrompt(ctx, llm, query)
  if err != nil {
    log.Fatal(err)
  }

  fmt.Println("Response:\n", completion)
}

And then run it as follows:

$ go run ollama-completion-arg.go --model gemma "what should be added to 91 to make -20?"
Response:
 The answer is -111.

91 + (-111) = -20

Gemma seems relatively fast for a model running on a CPU. I find that the default 7B model, while much more capable than the default 7B llama2 based on published benchmarks - also runs about 30% faster on my machine.

Without LangChainGo

While LangChainGo offers a conveneint API that's standardized across LLM providers, its use is by no means required for this sample. Ollama itself has a Go API as part of its structure and it can be used externally as well. Here's an equivalent sample that doesn't require LangChainGo:

package main

import (
  "context"
  "flag"
  "fmt"
  "log"

  "github.com/jmorganca/ollama/api"
)

func main() {
  modelName := flag.String("model", "", "ollama model name")
  flag.Parse()

  client, err := api.ClientFromEnvironment()
  if err != nil {
    log.Fatal(err)
  }

  req := &api.GenerateRequest{
    Model:  *modelName,
    Prompt: flag.Args()[0],
    Stream: new(bool), // disable streaming
  }

  ctx := context.Background()
  var response string
  respFunc := func(resp api.GenerateResponse) error {
    response = resp.Response
    return nil
  }

  err = client.Generate(ctx, req, respFunc)
  if err != nil {
    log.Fatal(err)
  }

  fmt.Println("Response:\n", response)
}

gemini-cli: Access Gemini models from the command-line

2024-02-21T06:04:00-08:00

This post is about a new command-line tool I've recently built in Go - gemini-cli, and how to use it for LLM-based data analysis with Google's Gemini models.

Background: I've been reading Simon Willison's posts about LLMs with interest, especially his work on tools that leverage LLMs and SQLite to create fun little analysis pipelines for local documents. Since I've recently done some Go work on Google's Gemini SDKs (also in langchaingo) and wrote a couple of blog posts about it, I was interested in creating a similar pipeline for myself using Go and Gemini models. This is how the idea for gemini-cli was born.

The tool

Like any Go command-line tool, gemini-cli is very easy to install:

$ go install github.com/eliben/gemini-cli@latest

And you're good to go! It will want a Gemini API key set in the GEMINI_API_KEY env var or passed with the --key flag. If you don't have an API key yet, you can get one quickly and for free from https://ai.google.dev/

The motivating task

For a while I've been interested in adding a "related posts" feature to my blog. It was clear that I'll want to use embeddings to convert my posts to vector space and then use vector similarity to find related posts. Check out my earlier post on RAG for additional information on these techniques.

Before starting to write the code, however, I wanted to experiment with a command-line tool so I could rapidly prototype. Think of it as crafting some text processing pipeline from classical Unix command-line tools before trying to implement it in a programming language. gemini-cli excels for precisely such prototyping.

Finding related posts

Let's see how to use gemini-cli for my task. I have access to the contents of my blog posts on the file system as a large bunch of reStructuredText and HTML files. These are private, but you're free to replicate this experiment for any collection of textual documents you have handy. It will even work on programming language source code!

Let's first get the lay of the land - how many files are there [1]?

$ pss -f --rst content/|wc -l
279
$ pss -f --html content/|wc -l
1064

OK, so a bit over 1300 overall. Let's start by computing the embeddings for the reST files. We'll ask gemini-cli to write it into a new SQLite DB called blogemb.db, using its embed db subcommand:

$ export GEMINI_API_KEY=...
$ gemini-cli embed db blogemb.db --files content/,"*.rst"
Found 279 values to embed
Splitting to 9 batches
Embedding batch #1 / 9, size=32
Embedding batch #2 / 9, size=32
Embedding batch #3 / 9, size=32
Embedding batch #4 / 9, size=32
Embedding batch #5 / 9, size=32
Embedding batch #6 / 9, size=32
Embedding batch #7 / 9, size=32
Embedding batch #8 / 9, size=32
Embedding batch #9 / 9, size=23
Collected 279 embeddings; inserting into table embeddings

Let's look at the DB file using the sqlite3 command-line tool:

$ sqlite3 blogemb.db
SQLite version 3.37.2 2022-01-06 13:25:41
Enter ".help" for usage hints.

sqlite> .tables
embeddings

sqlite> .schema
CREATE TABLE embeddings (
id TEXT PRIMARY KEY,
embedding BLOB
);

sqlite> select count(*) from embeddings;
279

sqlite> select id, length(embedding) from embeddings limit 10;
content/2014/blogging-setup-with-pelican.rst|3072
content/2014/c++-perfect-forwarding-and-universal-references.rst|3072
content/2014/derivation-normal-equation-linear-regression.rst|3072
content/2014/goodbye-wordpress.rst|3072
content/2014/highlight-tab-gnome-terminal.rst|3072
content/2014/meshgrids-and-disambiguating-rows-and-columns-from-cartesian-coordinates.rst|3072
content/2014/samples-for-llvm-clang-library.rst|3072
content/2014/sfinae-and-enable-if.rst|3072
content/2014/summary-of-reading-july-september-2014.rst|3072
content/2014/summary-of-reading-october-december-2014.rst|3072

As expected, we see 279 entries in the table; for each row the id column value is the path of the file and embedding contains the embedding as a blob. Embeddings are returned by the model as arrays of 32-bit floats, and gemini-cli encodes them into a blob as follows:

// encodeEmbedding encodes an embedding into a byte buffer, e.g. for DB
// storage as a blob.
func encodeEmbedding(emb []float32) []byte {
  buf := new(bytes.Buffer)
  for _, f := range emb {
    err := binary.Write(buf, binary.LittleEndian, f)
    if err != nil {
      panic(err)
    }
  }
  return buf.Bytes()
}

Each float32 thus occupies 4 bytes; since our DB blobs are 3072 bytes long, we can infer that each embedding vector has 768 elements; the embedding model projects our text into 768-dimensional space [2]!

Back to our task, though. Note that gemini-cli uses the batch-embedding API of Gemini under the hood, so it's efficient for large input corpora. We can control the batch size with a flag; just for fun, let's do this when embedding the HTML files since there are so many of them:

$ gemini-cli embed db blogemb.db --batch-size=64 --files content/,"*.html"
Found 1064 values to embed
Splitting to 17 batches
Embedding batch #1 / 17, size=64
Embedding batch #2 / 17, size=64
Embedding batch #3 / 17, size=64
Embedding batch #4 / 17, size=64
Embedding batch #5 / 17, size=64
Embedding batch #6 / 17, size=64
Embedding batch #7 / 17, size=64
Embedding batch #8 / 17, size=64
Embedding batch #9 / 17, size=64
Embedding batch #10 / 17, size=64
Embedding batch #11 / 17, size=64
Embedding batch #12 / 17, size=64
Embedding batch #13 / 17, size=64
Embedding batch #14 / 17, size=64
Embedding batch #15 / 17, size=64
Embedding batch #16 / 17, size=64
Embedding batch #17 / 17, size=40
Collected 1064 embeddings; inserting into table embeddings

A brief note on performance: with a batch size of 64, this process took only 17 seconds - not bad for over a thousand documents. In the future I plan to improve this time further with more concurrency and smarter batch size selection [3].

Let's examine the resulting SQLite DB with all the embeddings:

$ stat -c %s blogemb.db
5627904
$ echo "select count(*) from embeddings" | sqlite3 blogemb.db
1343

All 1343 entries have made it into the embeddings table, and the total size of the DB is just over 5 MiB.

Now we're ready to look for related posts. The embed similar subcommand takes the name of a SQLite DB that holds all embeddings (like the one we've just created) and a string of content to compare; it also accepts - as an indication that the input content will be piped through standard input, so let's use that:

$ gemini-cli embed similar blogemb.db - < content/2023/better-http-server-routing-in-go-122.rst
{"id":"content/2023/better-http-server-routing-in-go-122.rst","score":"1.0000001"}
{"id":"content/2021/rest-servers-in-go-part-2-using-a-router-package.rst","score":"0.8904768"}
{"id":"content/2021/life-of-an-http-request-in-a-go-server.rst","score":"0.83037585"}
{"id":"content/2021/rest-servers-in-go-part-5-middleware.rst","score":"0.8136583"}
{"id":"content/2022/serving-static-files-and-web-apps-in-go.rst","score":"0.7732284"}

The output is in the JSON Lines format, and by default prints the ID and the similarity score (using cosine similarity), sorted by decreasing similarity. Unsurprisingly, the most similar post is... itself, with a perfect similarity score of 1.0

The results look pretty good! The most similar posts found indeed are very relevant to the one we were asking about. For fun, let's try a book review and now with a larger list of output candidates (by using the topk flag):

$ gemini-cli embed similar blogemb.db --topk=10 - < content/2011/book-review-the-voyage-of-the-beagle-by-charles-darwin.html
{"id":"content/2011/book-review-the-voyage-of-the-beagle-by-charles-darwin.html","score":"1"}
{"id":"content/2008/book-review-the-origin-of-species-by-charles-darwin.html","score":"0.80570847"}
{"id":"content/2006/book-review-the-selfish-gene-by-richard-dawkins.html","score":"0.7845073"}
{"id":"content/2011/summary-of-reading-april-june-2011.html","score":"0.7939675"}
{"id":"content/2004/book-review-a-short-history-of-nearly-by-bill-bryson.html","score":"0.7784306"}
{"id":"content/2005/book-review-around-the-world-in-80-days-by-jules-verne.html","score":"0.7792236"}
{"id":"content/2005/book-review-the-double-helix-by-james-watson.html","score":"0.7658307"}
{"id":"content/2008/book-review-after-tamerlane-by-john-darwin.html","score":"0.7641713"}
{"id":"content/2005/book-review-mysterious-island-by-jules-verne.html","score":"0.7605505"}
{"id":"content/2008/book-review-the-adventures-of-tom-sawyer-by-mark-twain.html","score":"0.75610566"}

What's next

For my task, I now have the basic information available to implement it, and all the infrastructure for running experiments; with gemini-cli in hand, this took less than 5 minutes. All I needed to do is write the tool :-)

I really enjoyed building gemini-cli; it's true to the spirit of simple, textual Unix CLIs that can be easily combined together through pipes. Using SQLite as the storage and retrieval format is also quite pleasant, and provides interoperability for free.

For you - if you're a Go developer interested in building stuff with LLMs and getting started for free - I hope you find gemini-cli useful. I've only shown its embed * subcommands, but the CLI also lets you chat with an LLM through the terminal, query the API for various model details, and everything is configurable with extra flags.

It's open-source, of course; the README file rendered on GitHub has extensive documentation, and more is available by running gemini-cli help. Try it, ask questions, open issues!

[1]	I like using pss, but feel free to use your favorite tools - `git grep`, `ag` or just a concoction of `find` and `grep`.

[2]

A word of caution: LLMs have limited context window sizes; for embeddings, if the input is larger than the model's context window it may get truncated - so it's the user's responsibility to ensure that input documents are properly sized.

gemini-cli will report the maximal number of input tokens for supported models when you invoke the gemini-cli models command.

[3]	We have to be careful with too much parallelism, because at the free tier the Gemini SDK may be rate-limited.

Eli Bendersky's website - Go

Implementing Raft: Part 4 - Key/Value Database

Key / value database as a state machine

System diagram

KV service architecture

Commands

Life of a PUT request to the service

KV service code walk-through

Consistency guarantees

Plumbing read-only operations through the Raft log

KV client

Future work

Notes on running Go in the browser with WebAssembly

Basics: calling Go from JS

DOM manipulation from Go

Using TinyGo as an alternative compiler

Keeping the main thread free: WebAssembly in a web worker

Talking on a Web Socket with Go

Testing locally with Node.js

SentencePiece BPE Tokenizer in Go

Config and set up

Online demo

Building static binaries with Go on Linux

Basics - hello world

DNS and user groups

Linking C into our go binary

Linking a libc statically

Using Zig as our C compiler

Summary

Code

Locally patching dependencies in Go

Locally patching a dependency

Setting up

Using a module replace directive

Using Go workspaces

Using gohack

Which approach to use?

Reading Google Sheets from a Go program

Service account

OAuth

P.S. ADC

Tokens for LLMs: Byte Pair Encoding in Go

Byte pair encoding - introduction

Training

Encoding

Realistic vocabulary and splitting

Full online demo with a web UI and WebAssembly

The life of an Ollama prompt

HTTP request to Ollama

Ollama service

The generate API endpoint

Fetching and loading the model

Running the underlying model with a prompt

Afterword: standard interfaces

Appendix: Go client libraries for the Ollama API

Gemma, Ollama and LangChainGo

Without LangChainGo

gemini-cli: Access Gemini models from the command-line

The tool

The motivating task

Finding related posts

What's next

Linking a `libc` statically

Using a module `replace` directive

Using `gohack`

The `generate` API endpoint