Eli Bendersky's website - Machine Learning

Asking an LLM to build a simple web tool

2024-07-09T20:09:00-07:00

I've been really enjoying following Simon Willison's blog posts recently. Simon shows other programmers the way LLMs will be used for code assistance in the future, and posts full interactions with LLMs to build small tools or parts of larger applications.

A recent post caught my attention; here Simon got an LLM (Claude 3.5 Sonnet in this case) to build a complete tool that lets one configure/tweak box shadow settings and copy the resulting CSS code for use in a real application. One thing that seemed interesting is that the LLM in this case used some heavyweight dependencies (React + JSX) to implement this; Almost 3 MiB of dependency for something that clearly needs only a few dozen lines of HTML + JS to implement; yikes.

So I've decided to try my own experiment and get an LLM to do this without any dependencies. It turned out to be very easy, because the LLM I used (in this case ChatGPT 4o, but it could really have been any of the top-tier LLMs, I think) opted for the no-dependency approach from the start. I was preparing to ask it to adjust the code to remove dependencies, but this turned out to be unnecessary.

The resulting tool is very similar to Simon's in functionality; it's deployed at https://eliben.org/box-shadow-tool/; here's a screenshot:

Here are my prompts:

CSS for a slight box shadow, build me a tool that helps me twiddle settings and preview them and copy and paste out the CSS

ChatGPT produced a working tool but it didn't really look good on the page.

Yes, make the tool itself look a bit better with some CSS so it's all centered on the screen and there's enough space for the preview box

It still wasn't quite what I wanted.

the container has to be wider so all the text and sliders fix nicely, and there's still not enough space for the shadows of the preview box to show without overlapping with other elements

Now it was looking better; I wanted a button to copy-paste, like in Simon's demo:

this looks better; now add a nice-looking button at the bottom that copies the resulting css code to the clipboard

The code ChatGPT produced for the clipboard copy operation was flagged by vscode as deprecated, so I asked:

it seems like "document.execCommand('copy')" is deprecated; is there a more accepted way to do this?

The final version can be seen in the online demo (view-source). The complete ChatGPT transcript is available here.

Insights

Overall, this was a positive experience. While a tool like this is very simple to implement manually, doing it with an LLM was even quicker. The results are still not perfect in terms of alignment and space, but they're good enough. At this point one would probably just take over and do the final tweaks manually.

I was pleasantly surprised by how stable the LLM managed to keep its output throughout the interaction; it only modified the parts I asked it to, and the rest of the code remained identical. Stability has been an issue with LLMs (particularly for images), and I'm happy to see it holds well for code (there could be some special tuning or prompt engineering for ChatGPT to make this work well).

Tokens for LLMs: Byte Pair Encoding in Go

2024-04-25T06:34:00-07:00

A basic unit of currency in modern LLMs is the token; exciting new models have long context windows of millions of tokens. API pricing for the large providers is per-token. We're even seeing the invention of new, derived units like TPM (tokens per minute).

But what are tokens?

This OpenAI help article tells us that tokens are pieces of words, and gives some useful rules of thumb like a token being equivalent to approximately 4 characters or 3/4 of a word for the English language.

In this post I want to review the most commonly used algorithm for splitting text into tokens, provide a complete implementation in Go, and show a playground for experimenting with it. While my implementation isn't tuned for speed, it aims to be complete, readable and compatible with OpenAI's tiktoken library, generating identical results and working with the same vocabulary files.

Byte pair encoding - introduction

Byte pair encoding (BPE) is an algorithm originally designed for data compression. A 2016 paper suggested re-purposing it for "word segmentation" for machine learning tasks. The colloquial term for word segmentation is tokenization.

Input: arbitrary text with words, numbers, whitespace and punctuation.
Output: list of tokens representing the same text. Each token is an integer identifier which can be looked up in a vocabulary to reproduce the input text [1].

The BPE algorithm has an important pre-processing step: splitting the input text into words. The splitting is customizable and different models / vocabularies use different regexps for splitting (more on this later). The main idea is some sort of whitespace-based splitting (though whitespace itself is preserved) because we typically don't want inter-word tokens [2].

We'll be using this line from a catchy 1990s song as an example:

i'm blue dabadee dabadam

A word splitter will produce something like the following list, where spaces are replaced by underscores _ for the sake of presentation (they remain as spaces in the actual implementation of the algorithm and its trained vocabulary):

i
'm
_blue
_dabadee
_dabadam

A few things to note:

The contraction 'm is split from i - this is common for English language splitters, which want things like 'm, 'll, 're as separate words.
Whitespace is preserved and attached at the start of a word. Whitespace is important because tokens at the beginning of words sometimes have different semantic meaning from tokens not at the beginning of words. The choice of where it's attached is arbitrary. From this point on, whitespace bytes are considered like any other bytes in the BPE algorithm.

Now is a good time for some terminology we'll be using while talking about BPE:

Word: produced by the splitter in pre-processing, like the list shown above.
Token: typically a sub-word sequence of bytes; the output of the tokenizer is a list of tokens, by ID.
Token ID: unique numerical identifier for a token.
Vocabulary: a mapping of token IDs --> token values learned by the tokenizer during the training process.
Training: the process in which BPE learns a vocabulary from a corpus of text.
Splitter regexp: regular expression used to split text into words during pre-processing. Given an algorithm (in this case BPE), the pair vocabulary + splitter regexp unambiguously defines how a given text will be tokenized.
Encoder: given a vocabulary and a splitter regexp, tokenizes any text into a list of IDs from the vocabulary.
Decoder: given a list of IDs and the vocabulary, reconstructs the original text.

Training

BPE training proceeds by first assuming each byte is its own token, and then successively merging pairs of tokens into longer tokens and adding these to the vocabulary, until the desired vocabulary size is achieved.

Let's reuse our example, starting with these words:

i
'm
_blue
_dabadee
_dabadam

The BPE process starts by creating a token for each byte in the inclusive range [0..255]. So the minimal vocabulary size is 256; this guarantees that from the very start, there's a valid encoded representation of any text.

Then, the following process is repeated:

Count how many times each ordered pair of bytes appears in the input. Ordered pair here means two bytes right next to each other. In our example, some such pairs are "bl", "da", "de", "ee" etc.
Find the pair with the highest count, and create a new token from it (create a new token ID, mapping it to the concatenation of the most common pair).
Replace this most common pair with the combined token in the input set.

In our example, we start by splitting input words to bytes, so it's a list of single-byte token lists. This is our working list:

[i]
[' m]
[_ b l u e]
[_ d a b a d e e]
[_ d a b a d a m]

Next, we count the frequency of appearance of each ordered pair:

[d a] --> 3
[a b] --> 2
[b a] --> 2
[' m] --> 1
[_ b] --> 1
[l u] --> 1
[u e] --> 1
[_ d] --> 2
[a d] --> 2
[d e] --> 1
[e e] --> 1
[b l] --> 1
[a m] --> 1

The pair "da" is the most common one, so we're creating a new token for it, and substituting it everywhere in the working list:

[i]
[' m]
[_ b l u e]
[_ da b a d e e]
[_ da b a da m]

As you can see, in every instance "d" followed by "a" was combined into "da". Now repeat the process; finding the most common pairs in this new working list:

[e e] --> 1
[a da] --> 1
[l u] --> 1
[_ da] --> 2
[da b] --> 2
[a d] --> 1
[d e] --> 1
[da m] --> 1
[' m] --> 1
[_ b] --> 1
[b l] --> 1
[u e] --> 1
[b a] --> 2

Several pairs have a count of 2, so we pick one arbitrarily. Let's say it's _da (a space followed by "da"). We add _da as a new token and make replacements in the working list:

[i]
[' m]
[_ b l u e]
[_da b a d e e]
[_da b a da m]

And so on. When does this process stop? When we either run out of pairs (every word consists of a single token) or - more realistically for an actual training corpus - when we reach our desired vocabulary size. For example the vocabulary used for GPT-4 has around 100,000 tokens (more on this later).

The output of the training process is a vocabulary; let's say we've only run two cycles on our input text as described. The vocabulary will have 258 tokens in it: 256 for the single bytes, one for da and another for _da. Each of these would have a unique integer ID.

In our Go sample code, the training is implemented in this file. You can set the debugTrain variable to true to follow the process on some sample text.

Encoding

Having learned a vocabulary, the process of encoding is what happens every time we feed text into an LLM and it needs to be tokenized. The input is arbitrary text, a splitting regexp and a vocabulary. For example, let's take the input text "yada daba". Splitting is performed as before, and the input is broken into individual bytes:

[y a d a]
[_ d a b a]

BPE encoding takes the vocabulary and tries to apply learned tokens to the input text, word by word. The process is greedy - tokens are applied in the same order they've been learned (this is easy to accomplish by assigning monotonically increasing integer IDs to new tokens in the vocabulary, and then prioritizing lower-numbered tokens for encoding).

The first token we learned was da, so let's apply that:

[y a da]
[_ da b a]

The next token we learned was _da:

[y a da]
[_da b a]

This is the final stage; there are no more learned tokens to apply. The result will consist of 6 tokens.

In our sample code, the encoder is in this file.

Realistic vocabulary and splitting

The examples shown so far have been toys, but the algorithms are real and work with the actual vocabularies and splitters used in modern models. As a case study, the tokenizer used for OpenAI's GPT-4 uses a vocabulary called cl100k_base, which contains 100k tokens in addition to the 256 byte-sized ones. This is also the vocabulary (encoding) the tiktoken library uses. It can be freely downloaded from OpenAI - a copy is available in my sample repository. The file is base64 encoded, which is easy to unravel and we'll see tokens like:

" Fritz"  91083
"Initially"  91084
"nodeValue"  91085
"_TRIANGLES"  91086
"-backend"  91087

The token string value is to the left, and the numerical token ID is to the right. As you can see, the algorithm is not particularly discerning about what it learns - names, pieces of code - whatever works!

The other important data needed to reproduce OpenAI's tokenization is the splitting regexp, which is this:

(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+

It's just a combination of several alternatives. You could use one of the many "regexp explainer" websites out there to study it, or ask a modern LLM, but the gist of it is: this regexp splits space-delimited words, leaving spaces in front of the words, with some special provisions like English contractions (being separate words) and long numbers being split to groups of 3. For Go programmers, it's important to note that this pattern uses ?! - negative lookahead - which the standard regexp package doesn't support. Therefore, we'll have to reach for the 3rd party regexp2 to implement this [3].

In our sample repository, take a look at this test that ties everything together - it loads the cl100k_base encoding and uses it alongside the splitting regexp to tokenize some real text.

Full online demo with a web UI and WebAssembly

My goal with this project wasn't only to understand the BPE algorithm, but to also try reproducing the actual tokenizer used by OpenAI for its most modern models. And this goal was accomplished!

OpenAI has a nice website here that lets you enter text and see how it's tokenized. I've managed to reproduce this UI - see the cmd/wasm directory in the repository. I've also placed it online - it can ran in your browser from here. Here's a screenshot [4]:

How it works: the Go implementation of BPE is compiled to a WebAssembly binary that's loaded from a bit of glue JavaScript embedded in a simple HTML page. The JavaScript watches the text box as you type and sends the string to a Go function exported from the WASM, which tokenizes it on the fly. So we get a nice effect of "tokens updated as we type". The selection button at the bottom also lets us see the numerical IDs for these tokens - they should be equivalent to what tiktoken is producing.

[1]	For simplicity, this post will focus on English. As you'll see, however, the BPE algorithm is language-agnostic.

[2]	There's also a performance implication: if we make tokenization word-oriented, we can easily implement streaming tokenization without depending on previous words.

[3]	I think it would be possible - with a bit of effort - to work around this limitation and stick to the standard library, but just using `regexp2` is simpler, and it's also what tiktoken-go is doing.

[4]

You'll notice that in this example every word (except contractions) is a separate token; this shouldn't be surprising, since these are all very common words and the vocabulary is large! Try playing with it a bit though, giving it longer words (like "discombobulated") or non-trivial variable names from a programming language.

The life of an Ollama prompt

2024-03-06T05:28:00-08:00

In a previous post I've described how - thanks to standardized tooling - we could use a locally-running Gemma model from a Go program within hours from its public release.

This post dives into the internals of Ollama - a popular and extremely convenient open-source Go project that makes such workflows possible.

HTTP request to Ollama

Having installed Ollama and run ollama run gemma, we're ready to send HTTP requests to it. There are several ways to do so:

Sending a raw HTTP request with a tool like curl
Using Ollama's own client libraries (currently available in Go, Python and JS)
Using a provider-agnostic client like LangChainGo

For options (2) and (3) see the Appendix; here we'll focus on (1) for simplicity and to remove layers from the explanation.

Let's send an HTTP request to the api/generate endpoint of Ollama with curl:

$ curl http://localhost:11434/api/generate -d '{
  "model": "gemma",
  "prompt": "very briefly, tell me the difference between a comet and a meteor",
  "stream": false
}' | jq .

[...]

{
  "model": "gemma",
  "created_at": "2024-03-04T14:43:51.665311735Z",
  "response": "Sure, here is the difference between a comet and a meteor:

  **Comet:**
  - A celestial object that orbits the Sun in a highly elliptical path.
  - Can be seen as a streak of light in the sky, often with a tail.
  - Comets typically have a visible nucleus, meaning a solid core that
    can be seen from Earth.

  **Meteor:**
  - A streak of hot gas or plasma that appears to move rapidly across the sky.
  - Can be caused by small pieces of rock or dust from space that burn up
    in the atmosphere.
  - Meteors do not have a visible nucleus.",
  "done": true,
  "context":
[...]
}

(The response is JSON and I've reformatted the text for clarity)

Ollama's HTTP API is documented here. For each endpoint, it lists a description of parameters and the data returned.

Ollama service

Ollama itself is a client-server application; when the installation script is run, it does several things:

Download Ollama binary
Place it in $PATH
Run ollama serve as a background service

The service checks the value of the OLLAMA_HOST env var to figure out which host and port to use. The default is port 11434 on localhost (hence you can see our curl request is made to localhost:11434). It then listens on the port, presenting the API discussed above.

What's interesting to note is that when we run ollama run <model> from the command-line, this invokes the Ollama binary in client mode; in this mode, it sends requests to the service using the same API. For example, here are two ways to invoke it - interactive:

$ ollama run gemma
>>> translate naranjo to english
Naranjo translates to Orange in English.

Naranjo is the Spanish word for Orange.

>>> <Ctrl+D>

And piping to stdin:

$ echo "translate naranjo to english" | ollama run gemma
Naranjo translates to Orange in English. Orange is the English word equivalent of the word Naranjo.

In both these cases, the Ollama binary sends an HTTP request to http://localhost:11434/api/generate, just like the one we've made manually with curl.

The `generate` API endpoint

Now that we know where our prompt to Ollama ends up (whether we issue it using an HTTP request or the Ollama command-line tool), let's see what the generate API endpoint actually does.

Ollama uses the Gin web framework, and the API route is fairly standard:

r.POST("/api/generate", GenerateHandler)

This routes HTTP POST requests for /api/generate to a handler function called GenerateHandler, which is defined in the same source file:

func GenerateHandler(c *gin.Context) {
  [...]
}

After parsing and validating the request, GenerateHandler starts by fetching the model the request asked for with the "model" field. It then loads the right model and runs it, feeding it with the prompt provided in the request. The next sections describe these two steps.

Fetching and loading the model

When Ollama is looking for a model (by name), it first checks if it already has it downloaded and stored locally. On my Linux machine, Ollama stores its local cache of models at /usr/share/ollama/.ollama/models/blobs. If the model is already available locally, there's not much to do for this step.

Otherwise, Ollama looks in its online library of models. Specifically, the service makes a request to https://registry.ollama.ai/v2/library/ to check if a model exists. At the time of writing, it's not clear if anyone except the Ollama maintainers can upload new models to the library - but it seems like they're working on this option.

But where do these models come from? As this doc explains, models are imported from other sources in formats like GGUF or Safetensors. The topic of these formats is very interesting, but I won't be covering it in this post; if you're interested, a recent blog post by Vicki Boykis provides useful historic background.

While models can be imported from a variety of formats, Ollama's library stores them as GGUF and that's what the service expects to find.

For the purpose of this explanation, it's sufficient to know that GGUF stores some metadata about the model (e.g. its architecture and parameters, like numbers of layers in different parts, etc) as well as its actual weights. The weights can be stored in different formats - some more suitable for GPUs, some for CPUs. Quantization is common, especially for CPU-oriented models. The model file is usually a giant multi-GiB binary blob that needs to be downloaded and cached locally.

Running the underlying model with a prompt

To run the model, Ollama turns to another project - llama.cpp. llama.cpp arose as a local inference engine for the Llama model when it was originally released. Since the model architecture and weights were published, it became possible to implement inference for the model without relying on full-blown Python ML frameworks like TensorFlow, PyTorch or JAX. It uses its author's separate project - ggml, for an efficient C++ library of ML primitives that can run on CPUs and GPUs.

Originally llama.cpp just hard-coded Llama's architecture and loaded the weights, but in time it grew to incorporate additional open-sourced models and its implementation became a kind of a switch based on the model's architecture.

For example, this commit added Gemma support to llama.cpp [1]. Once this is in place, all it needs is to load the weights and some parameterization of the model from its GGUF file and it's ready to go.

llama.cpp is a C++ project that was originally designed as a command-line utility you can use to load models and chat with them. C++ is not known for having a pleasant or stable ABI to work with, so many projects wrapped llama.cpp with a lightweight C ABI in order to create bindings into other languages.

Ollama, as a Go project, did the same. It went a step further though, and cleverly leverages llama.cpp's server sample, which encapsulates all operations in functions that take JSON inputs and return JSON outputs. Ollama added some glue in ext_server, and wrapped it with cgo to be able to invoke llama.cpp inference in-process.

The generate endpoint calls llm.Predict, which after some hops ends llama.cpp's request_completion.

Afterword: standard interfaces

In my previous post, I've mentioned that the flow works and is easy to set up due to standardized interfaces that have been implemented in OSS projects.

After reading this post with Ollama internals, I hope it's clear what standardized interfaces come into play here.

First and foremost is llama.cpp and its associated GGUF format. While the internals of llama.cpp are somewhat clunky, this project is unapologetically pragmatic and a true boon for the ecosystem because of the way it standardizes LLM inference (and embeddings). Given a model architecture implemented in C++ in the innards of llama.cpp, variations can be easily explored and run on compatible CPUs and GPUs. Slight model modifications? Tuning? Trying some new kind of quantizations? Just create a GGUF file and llama.cpp will run it for you.

The other half of the solution is Ollama, which wraps llama.cpp in a conveniently packaged tool, API and ecosystem [2]. As a Go project, it's easily distributable and makes it trivial to hack on a powerful API server. The REST API it presents can then be leveraged by any tool capable of issuing HTTP requests.

Appendix: Go client libraries for the Ollama API

If you want to use LLMs programmatically from Go through Ollama, the most convenient options are either using Ollama's own Go client library or through LangChainGo. Another option - as discussed above - is to send raw HTTP requests.

The Ollama Go client library is a great option because it's what the Ollama client itself uses to talk to the service; it's as battle-tested and functional as you can hope for. On the other hand, LangChainGo is convenient if you use multiple providers and want code that's consistent and provider-agnostic.

This sample lists Go code to ask Ollama a question using (1) the Ollama Go library or (2) LangChainGo.

[1]	The Gemma announcement points to this official documentation and implementation - https://github.com/google-deepmind/gemma - it can be used to re-implement Gemma inference, along with the pre-trained model weights Google released.

[2]	Ollama has additional capabilities I haven't mentioned here, like Modelfiles for creating and sharing models.

Gemma, Ollama and LangChainGo

2024-02-22T16:24:00-08:00

Yesterday Google released Gemma - an open LLM that folks can run locally on their machines (similarly to llama2). I was wondering how easy it would be to run Gemma on my computer, chat with it and interact with it from a Go program.

Turns it - thanks to Ollama - it's extremely easy! Gemma was already added to Ollama, so all one has to do is run:

$ ollama run gemma

And wait for a few minutes while the model downloads. From this point on, my previous post about using Ollama locally in Go applies with pretty much no changes. Gemma becomes available through a REST API locally, and can be accessed from ollama-aware libraries like LangChainGo.

I went ahead and added a --model flag to all my code samples from that post, and they can all run with --model gemma now. It all just works, due to the magic of standard interfaces:

Gemma is packaged in a standard interface for inclusion in Ollama
Ollama then presents a standardized REST API for this model, just like it does for other compatible models
LangChainGo has an Ollama provider that lets us write code to interact with any model running through Ollama

So we can write code like:

package main

import (
  "context"
  "flag"
  "fmt"
  "log"

  "github.com/tmc/langchaingo/llms"
  "github.com/tmc/langchaingo/llms/ollama"
)

func main() {
  modelName := flag.String("model", "", "ollama model name")
  flag.Parse()

  llm, err := ollama.New(ollama.WithModel(*modelName))
  if err != nil {
    log.Fatal(err)
  }

  query := flag.Args()[0]
  ctx := context.Background()
  completion, err := llms.GenerateFromSinglePrompt(ctx, llm, query)
  if err != nil {
    log.Fatal(err)
  }

  fmt.Println("Response:\n", completion)
}

And then run it as follows:

$ go run ollama-completion-arg.go --model gemma "what should be added to 91 to make -20?"
Response:
 The answer is -111.

91 + (-111) = -20

Gemma seems relatively fast for a model running on a CPU. I find that the default 7B model, while much more capable than the default 7B llama2 based on published benchmarks - also runs about 30% faster on my machine.

Without LangChainGo

While LangChainGo offers a conveneint API that's standardized across LLM providers, its use is by no means required for this sample. Ollama itself has a Go API as part of its structure and it can be used externally as well. Here's an equivalent sample that doesn't require LangChainGo:

package main

import (
  "context"
  "flag"
  "fmt"
  "log"

  "github.com/jmorganca/ollama/api"
)

func main() {
  modelName := flag.String("model", "", "ollama model name")
  flag.Parse()

  client, err := api.ClientFromEnvironment()
  if err != nil {
    log.Fatal(err)
  }

  req := &api.GenerateRequest{
    Model:  *modelName,
    Prompt: flag.Args()[0],
    Stream: new(bool), // disable streaming
  }

  ctx := context.Background()
  var response string
  respFunc := func(resp api.GenerateResponse) error {
    response = resp.Response
    return nil
  }

  err = client.Generate(ctx, req, respFunc)
  if err != nil {
    log.Fatal(err)
  }

  fmt.Println("Response:\n", response)
}

gemini-cli: Access Gemini models from the command-line

2024-02-21T06:04:00-08:00

This post is about a new command-line tool I've recently built in Go - gemini-cli, and how to use it for LLM-based data analysis with Google's Gemini models.

Background: I've been reading Simon Willison's posts about LLMs with interest, especially his work on tools that leverage LLMs and SQLite to create fun little analysis pipelines for local documents. Since I've recently done some Go work on Google's Gemini SDKs (also in langchaingo) and wrote a couple of blog posts about it, I was interested in creating a similar pipeline for myself using Go and Gemini models. This is how the idea for gemini-cli was born.

The tool

Like any Go command-line tool, gemini-cli is very easy to install:

$ go install github.com/eliben/gemini-cli@latest

And you're good to go! It will want a Gemini API key set in the GEMINI_API_KEY env var or passed with the --key flag. If you don't have an API key yet, you can get one quickly and for free from https://ai.google.dev/

The motivating task

For a while I've been interested in adding a "related posts" feature to my blog. It was clear that I'll want to use embeddings to convert my posts to vector space and then use vector similarity to find related posts. Check out my earlier post on RAG for additional information on these techniques.

Before starting to write the code, however, I wanted to experiment with a command-line tool so I could rapidly prototype. Think of it as crafting some text processing pipeline from classical Unix command-line tools before trying to implement it in a programming language. gemini-cli excels for precisely such prototyping.

Finding related posts

Let's see how to use gemini-cli for my task. I have access to the contents of my blog posts on the file system as a large bunch of reStructuredText and HTML files. These are private, but you're free to replicate this experiment for any collection of textual documents you have handy. It will even work on programming language source code!

Let's first get the lay of the land - how many files are there [1]?

$ pss -f --rst content/|wc -l
279
$ pss -f --html content/|wc -l
1064

OK, so a bit over 1300 overall. Let's start by computing the embeddings for the reST files. We'll ask gemini-cli to write it into a new SQLite DB called blogemb.db, using its embed db subcommand:

$ export GEMINI_API_KEY=...
$ gemini-cli embed db blogemb.db --files content/,"*.rst"
Found 279 values to embed
Splitting to 9 batches
Embedding batch #1 / 9, size=32
Embedding batch #2 / 9, size=32
Embedding batch #3 / 9, size=32
Embedding batch #4 / 9, size=32
Embedding batch #5 / 9, size=32
Embedding batch #6 / 9, size=32
Embedding batch #7 / 9, size=32
Embedding batch #8 / 9, size=32
Embedding batch #9 / 9, size=23
Collected 279 embeddings; inserting into table embeddings

Let's look at the DB file using the sqlite3 command-line tool:

$ sqlite3 blogemb.db
SQLite version 3.37.2 2022-01-06 13:25:41
Enter ".help" for usage hints.

sqlite> .tables
embeddings

sqlite> .schema
CREATE TABLE embeddings (
id TEXT PRIMARY KEY,
embedding BLOB
);

sqlite> select count(*) from embeddings;
279

sqlite> select id, length(embedding) from embeddings limit 10;
content/2014/blogging-setup-with-pelican.rst|3072
content/2014/c++-perfect-forwarding-and-universal-references.rst|3072
content/2014/derivation-normal-equation-linear-regression.rst|3072
content/2014/goodbye-wordpress.rst|3072
content/2014/highlight-tab-gnome-terminal.rst|3072
content/2014/meshgrids-and-disambiguating-rows-and-columns-from-cartesian-coordinates.rst|3072
content/2014/samples-for-llvm-clang-library.rst|3072
content/2014/sfinae-and-enable-if.rst|3072
content/2014/summary-of-reading-july-september-2014.rst|3072
content/2014/summary-of-reading-october-december-2014.rst|3072

As expected, we see 279 entries in the table; for each row the id column value is the path of the file and embedding contains the embedding as a blob. Embeddings are returned by the model as arrays of 32-bit floats, and gemini-cli encodes them into a blob as follows:

// encodeEmbedding encodes an embedding into a byte buffer, e.g. for DB
// storage as a blob.
func encodeEmbedding(emb []float32) []byte {
  buf := new(bytes.Buffer)
  for _, f := range emb {
    err := binary.Write(buf, binary.LittleEndian, f)
    if err != nil {
      panic(err)
    }
  }
  return buf.Bytes()
}

Each float32 thus occupies 4 bytes; since our DB blobs are 3072 bytes long, we can infer that each embedding vector has 768 elements; the embedding model projects our text into 768-dimensional space [2]!

Back to our task, though. Note that gemini-cli uses the batch-embedding API of Gemini under the hood, so it's efficient for large input corpora. We can control the batch size with a flag; just for fun, let's do this when embedding the HTML files since there are so many of them:

$ gemini-cli embed db blogemb.db --batch-size=64 --files content/,"*.html"
Found 1064 values to embed
Splitting to 17 batches
Embedding batch #1 / 17, size=64
Embedding batch #2 / 17, size=64
Embedding batch #3 / 17, size=64
Embedding batch #4 / 17, size=64
Embedding batch #5 / 17, size=64
Embedding batch #6 / 17, size=64
Embedding batch #7 / 17, size=64
Embedding batch #8 / 17, size=64
Embedding batch #9 / 17, size=64
Embedding batch #10 / 17, size=64
Embedding batch #11 / 17, size=64
Embedding batch #12 / 17, size=64
Embedding batch #13 / 17, size=64
Embedding batch #14 / 17, size=64
Embedding batch #15 / 17, size=64
Embedding batch #16 / 17, size=64
Embedding batch #17 / 17, size=40
Collected 1064 embeddings; inserting into table embeddings

A brief note on performance: with a batch size of 64, this process took only 17 seconds - not bad for over a thousand documents. In the future I plan to improve this time further with more concurrency and smarter batch size selection [3].

Let's examine the resulting SQLite DB with all the embeddings:

$ stat -c %s blogemb.db
5627904
$ echo "select count(*) from embeddings" | sqlite3 blogemb.db
1343

All 1343 entries have made it into the embeddings table, and the total size of the DB is just over 5 MiB.

Now we're ready to look for related posts. The embed similar subcommand takes the name of a SQLite DB that holds all embeddings (like the one we've just created) and a string of content to compare; it also accepts - as an indication that the input content will be piped through standard input, so let's use that:

$ gemini-cli embed similar blogemb.db - < content/2023/better-http-server-routing-in-go-122.rst
{"id":"content/2023/better-http-server-routing-in-go-122.rst","score":"1.0000001"}
{"id":"content/2021/rest-servers-in-go-part-2-using-a-router-package.rst","score":"0.8904768"}
{"id":"content/2021/life-of-an-http-request-in-a-go-server.rst","score":"0.83037585"}
{"id":"content/2021/rest-servers-in-go-part-5-middleware.rst","score":"0.8136583"}
{"id":"content/2022/serving-static-files-and-web-apps-in-go.rst","score":"0.7732284"}

The output is in the JSON Lines format, and by default prints the ID and the similarity score (using cosine similarity), sorted by decreasing similarity. Unsurprisingly, the most similar post is... itself, with a perfect similarity score of 1.0

The results look pretty good! The most similar posts found indeed are very relevant to the one we were asking about. For fun, let's try a book review and now with a larger list of output candidates (by using the topk flag):

$ gemini-cli embed similar blogemb.db --topk=10 - < content/2011/book-review-the-voyage-of-the-beagle-by-charles-darwin.html
{"id":"content/2011/book-review-the-voyage-of-the-beagle-by-charles-darwin.html","score":"1"}
{"id":"content/2008/book-review-the-origin-of-species-by-charles-darwin.html","score":"0.80570847"}
{"id":"content/2006/book-review-the-selfish-gene-by-richard-dawkins.html","score":"0.7845073"}
{"id":"content/2011/summary-of-reading-april-june-2011.html","score":"0.7939675"}
{"id":"content/2004/book-review-a-short-history-of-nearly-by-bill-bryson.html","score":"0.7784306"}
{"id":"content/2005/book-review-around-the-world-in-80-days-by-jules-verne.html","score":"0.7792236"}
{"id":"content/2005/book-review-the-double-helix-by-james-watson.html","score":"0.7658307"}
{"id":"content/2008/book-review-after-tamerlane-by-john-darwin.html","score":"0.7641713"}
{"id":"content/2005/book-review-mysterious-island-by-jules-verne.html","score":"0.7605505"}
{"id":"content/2008/book-review-the-adventures-of-tom-sawyer-by-mark-twain.html","score":"0.75610566"}

What's next

For my task, I now have the basic information available to implement it, and all the infrastructure for running experiments; with gemini-cli in hand, this took less than 5 minutes. All I needed to do is write the tool :-)

I really enjoyed building gemini-cli; it's true to the spirit of simple, textual Unix CLIs that can be easily combined together through pipes. Using SQLite as the storage and retrieval format is also quite pleasant, and provides interoperability for free.

For you - if you're a Go developer interested in building stuff with LLMs and getting started for free - I hope you find gemini-cli useful. I've only shown its embed * subcommands, but the CLI also lets you chat with an LLM through the terminal, query the API for various model details, and everything is configurable with extra flags.

It's open-source, of course; the README file rendered on GitHub has extensive documentation, and more is available by running gemini-cli help. Try it, ask questions, open issues!

[1]	I like using pss, but feel free to use your favorite tools - `git grep`, `ag` or just a concoction of `find` and `grep`.

[2]

A word of caution: LLMs have limited context window sizes; for embeddings, if the input is larger than the model's context window it may get truncated - so it's the user's responsibility to ensure that input documents are properly sized.

gemini-cli will report the maximal number of input tokens for supported models when you invoke the gemini-cli models command.

[3]	We have to be careful with too much parallelism, because at the free tier the Gemini SDK may be rate-limited.

Using Gemini models in Go with LangChainGo

2024-01-30T18:23:00-08:00

In a previous post I've discussed how to access Google's multimodal Gemini models from Go (with a nice free tier!)

Recently, Google's SDKs were added as providers for LangChainGo; this makes it possible to use the capabilities of the LangChain framework with Google's Gemini models as LLM providers.

This post shows some samples of using these new providers and how simple it is to switch providers from Google AI (which uses API keys) to Vertex (which requires a GCP project).

LangChainGo examples with GoogleAI

Let's start with the GoogleAI provider. We'll need the latest release of langchaingo. Here's a complete example of asking the model a basic textual question:

package main

import (
  "context"
  "fmt"
  "log"
  "os"

  "github.com/tmc/langchaingo/llms"
  "github.com/tmc/langchaingo/llms/googleai"
)

func main() {
  ctx := context.Background()
  apiKey := os.Getenv("API_KEY")
  llm, err := googleai.New(ctx, googleai.WithAPIKey(apiKey))
  if err != nil {
    log.Fatal(err)
  }

  prompt := "What is the L2 Lagrange point?"
  answer, err := llms.GenerateFromSinglePrompt(ctx, llm, prompt)
  if err != nil {
    log.Fatal(err)
  }

  fmt.Println(answer)
}

llms.GenerateFromSinglePrompt is a convenience function in LangChainGo for cases where we have a single string input and want a single string output. The more general API is the Model.GenerateContent method, which supports multiple messages and different message kinds (text, images, etc.); Here's an example using it to reproduce our question about the difference between turtle images from the previous post:

func main() {
  ctx := context.Background()
  apiKey := os.Getenv("API_KEY")
  llm, err := googleai.New(ctx, googleai.WithAPIKey(apiKey))
  if err != nil {
    log.Fatal(err)
  }

  imgData1, err := os.ReadFile(filepath.Join(imagesPath, "turtle1.png"))
  if err != nil {
    log.Fatal(err)
  }

  imgData2, err := os.ReadFile(filepath.Join(imagesPath, "turtle2.png"))
  if err != nil {
    log.Fatal(err)
  }

  parts := []llms.ContentPart{
    llms.BinaryPart("image/png", imgData1),
    llms.BinaryPart("image/png", imgData2),
    llms.TextPart("Describe the difference between these two pictures, with scientific detail"),
  }

  content := []llms.MessageContent{
    {
      Role:  schema.ChatMessageTypeHuman,
      Parts: parts,
    },
  }

  resp, err := llm.GenerateContent(ctx, content, llms.WithModel("gemini-pro-vision"))
  if err != nil {
    log.Fatal(err)
  }

  bs, _ := json.MarshalIndent(resp, "", "    ")
  fmt.Println(string(bs))
}

Note that we pass the specific model we want to utilize here using the WithModel option to llm.GenerateContent.

Finally, an example of using an embedding model to calculate embeddings for text:

func main() {
  ctx := context.Background()
  apiKey := os.Getenv("API_KEY")
  llm, err := googleai.New(ctx, googleai.WithAPIKey(apiKey))
  if err != nil {
    log.Fatal(err)
  }

  texts := []string{"lion", "parrot"}
  emb, err := llm.CreateEmbedding(ctx, texts)
  if err != nil {
    log.Fatal(err)
  }

  fmt.Println("Num of embedding vectors:", len(emb))
  for i, e := range emb {
    fmt.Printf("%d: %v...\n", i, e[:10])
  }
}

Switching to Vertex

Switching to the Vertex provider is very easy, since it implements exactly the same LangChainGo interfaces. We have to change the import line from:

import "github.com/tmc/langchaingo/llms/googleai"

To:

import "github.com/tmc/langchaingo/llms/googleai/vertex"

And then replace our llm value creation with:

ctx := context.Background()
project := os.Getenv("VERTEX_PROJECT")
location := os.Getenv("VERTEX_LOCATION")
llm, err := vertex.New(ctx, vertex.WithCloudProject(project), vertex.WithCloudLocation(location))
if err != nil {
  log.Fatal(err)
}

The rest of the code remains the same!

Code

The full, runnable code for these samples (both for Google AI and Vertex) is available on GitHub.

Using Gemini models from Go

2023-12-22T17:45:00-08:00

Google has recently made their newest family of multimodal LLMs available via an API with a generous free tier. Google also released SDKs in several popular programming languages, including Go.

This post is a quick overview of how to get started with the Go SDK to ask the model questions that mix text with images.

The task

We'll be asking the model to explain the difference between two images of turtles; this one:

And this one:

Using the Google AI SDK

With the Google AI SDK, all you need to access the model is generate an API key (similarly to how it works with OpenAI's API). The Go SDK lives at https://github.com/google/generative-ai-go, with package documentation at https://pkg.go.dev/github.com/google/generative-ai-go; it has a good section of examples we can follow.

Here's the code for our task:

package main

import (
  "context"
  "encoding/json"
  "fmt"
  "log"
  "os"

  "github.com/google/generative-ai-go/genai"
  "google.golang.org/api/option"
)

func main() {
  ctx := context.Background()
  client, err := genai.NewClient(ctx, option.WithAPIKey(os.Getenv("API_KEY")))
  if err != nil {
    log.Fatal(err)
  }
  defer client.Close()

  model := client.GenerativeModel("gemini-pro-vision")

  imgData1, err := os.ReadFile("../images/turtle1.png")
  if err != nil {
    log.Fatal(err)
  }

  imgData2, err := os.ReadFile("../images/turtle2.png")
  if err != nil {
    log.Fatal(err)
  }

  prompt := []genai.Part{
    genai.ImageData("png", imgData1),
    genai.ImageData("png", imgData2),
    genai.Text("Describe the difference between these two pictures, with scientific detail"),
  }
  resp, err := model.GenerateContent(ctx, prompt...)

  if err != nil {
    log.Fatal(err)
  }

  bs, _ := json.MarshalIndent(resp, "", "    ")
  fmt.Println(string(bs))
}

Since the LLM API is multimodal, the SDK provides helper types like genai.ImageData and genai.Text to wrap inputs in a type-safe way. When we run this sample, we get the model's response dumped as a JSON object. The important part is:

"Content": {
  "Parts": [
    "The first picture is of a tortoise, which is a reptile characterized by
    its hard shell. The second picture is of a sea turtle, which is a reptile
    characterized by its flippers and streamlined shell. Tortoises are
    terrestrial animals, while sea turtles are marine animals. Tortoises have
    a domed shell, while sea turtles have a flattened shell. Tortoises have
    thick, scaly skin, while sea turtles have smooth, leathery skin. Tortoises
    have short legs with claws, while sea turtles have long flippers.
    Tortoises have a slow metabolism and can live for over 100 years, while
    sea turtles have a faster metabolism and typically live for around 50
    years."
  ],
  "Role": "model"
},

OK, so now we know :-)

Using the GCP Vertex SDK

If you're a GCP customer and have your GCP project set up with billing and everything else, you may want to use the Vertex Go SDK instead.

The great thing about how the Go SDKs work is that you barely have to change your code at all! The only changes are the import line, from:

"github.com/google/generative-ai-go/genai"

To:

"cloud.google.com/go/vertexai/genai"

And then change how you create the client, since the auth is different. For Vertex, the client should be created like this:

client, err := genai.NewClient(ctx, os.Getenv("GCP_PROJECT_ID"), "us-central1")

Where GCP_PROJECT_ID is an env var with your GCP project and the location/region can be set based on your preferences. The rest of the code remains exactly the same!

There are two SDKs because the features offered by the two products can differ in some cases. For example, the GCP one may allow you to read data directly from your storage buckets or database tables.

Code

The full code for all the samples in this post - along with the sample images - is available on GitHub.

Update 2024-01-31: see this post about accessing the Gemini models via langchaingo.

Using Ollama with LangChainGo

2023-11-22T05:25:00-08:00

One of the most exciting areas of LLM-related development in 2023 is the availability of powerful (and sometimes even open-source) models we can run locally on our machines.

Several tools exist that make it relatively easy to obtain, run and manage such models locally; for example Ollama (written in Go!) LocalAI (also largely in Go!).

In this post I'm going to describe how to use Ollama to run a model locally, communicate with it using its API and integrate it into a Go program using LangChainGo.

Setting up Ollama

To start, follow the installation and setup instructions from the Ollama website. Ollama runs as a service, exposing a REST API on a localhost port. Once installed, you can invoke ollama run <modelname> to talk to this model; the model is downloaded and cached the first time it's requested.

In this blog post, we'll be talking to the llama2 model, so run ollama run llama2. After the ollama command finishes installing the model, we'll see a prompt and will be able to chat with it [1]:

>>> very briefly, tell me the difference between a comet and a meteor

 Sure! Here's a brief difference:

A comet is a small, icy body that orbits the sun. When a comet approaches the
inner solar system, the heat from the sun causes the comet to release gas and
dust, creating a bright tail that can be seen from Earth.

A meteor, on the other hand, is a small piece of rock or metal that enters the
Earth's atmosphere. As it travels through the atmosphere, the friction causes
the meteor to heat up and burn, producing a bright streak of light in the sky,
commonly known as a shooting star. If the meteor survives its passage through
the atmosphere and lands on Earth, it is called a meteorite.

Manually invoking the REST API

ollama runs in the background and exposes a REST API on port 11434. We can talk to it "manually" using curl commands:

$ curl http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "very briefly, tell me the difference between a comet and a meteor",
  "stream": false
}'
{"model":"llama2","created_at":"2023-11-20T14:53:47.32607236Z",
 "response":"\nSure! Here's the difference:\n\nA comet is a small,
  icy body that orbits the sun. Comets are composed of dust and frozen
  gases, such as water, methane, and ammonia. When a comet approaches
  the inner solar system, the sun's heat causes the comet's ices
  to vaporize, creating a bright tail of gas and dust that can be seen
  from Earth.\n\nA meteor, on the other hand, is a small body of rock
[...]

This may take a bit of time, especially if your machine doesn't have a powerful GPU. We can also ask Ollama to stream the model's responses so we get output as soon as it's ready, before waiting for the model to complete its reply. We can do that by passing "stream": true:

$ curl http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "very briefly, tell me the difference between a comet and a meteor",
  "stream": true
}'
{"model":"llama2","created_at":"2023-11-20T14:57:06.709696317Z","response":"\n","done":false}
{"model":"llama2","created_at":"2023-11-20T14:57:06.89584866Z","response":" Sure","done":false}
{"model":"llama2","created_at":"2023-11-20T14:57:07.053242632Z","response":"!","done":false}
{"model":"llama2","created_at":"2023-11-20T14:57:07.217867169Z","response":" Here","done":false}
{"model":"llama2","created_at":"2023-11-20T14:57:07.374557181Z","response":"'","done":false}
{"model":"llama2","created_at":"2023-11-20T14:57:07.560674269Z","response":"s","done":false}
{"model":"llama2","created_at":"2023-11-20T14:57:07.719981235Z","response":" the","done":false}
{"model":"llama2","created_at":"2023-11-20T14:57:07.878008762Z","response":" quick","done":false}
{"model":"llama2","created_at":"2023-11-20T14:57:08.035846088Z","response":" and","done":false}
{"model":"llama2","created_at":"2023-11-20T14:57:08.192951527Z","response":" dirty","done":false}
{"model":"llama2","created_at":"2023-11-20T14:57:08.372491712Z","response":":","done":false}
{"model":"llama2","created_at":"2023-11-20T14:57:08.530388951Z","response":"\n","done":false}
[...]

The response is broken into separate JSON messages with "done": false. The last message will have "done": true.

We can send other kinds of requests to the model; for example, we can ask it to calculate embeddings:

$ curl http://localhost:11434/api/embeddings -d '{
  "model": "llama2",
  "prompt": "article about asteroids"
}' | jq
{
  "embedding": [
    0.5615004897117615,
    -2.90958833694458,
    0.836567759513855,
    -0.3081018626689911,
    -1.1424092054367065,
    -1.5503573417663574,
    0.93345707654953,
    -3.008531093597412,
    3.6917684078216553,
    0.3383431136608124,
    1.0924581289291382,
    -2.1573197841644287,
[...]

Programmatic access to models through Ollama

The Ollama README lists some ways to interact with ollama models programmatically; the most common way seems to be through LangChain and related tools. LangChain is emerging as a common framework for interacting with LLMs; it has high-level tools for chaining LLM-related tasks together, but also low-level SDKs for each model's REST API.

Here I will show how to talk to Ollama via the Go port of LangChain - LangChainGo.

Let's start with a simple non-streaming completion request:

package main

import (
  "context"
  "fmt"
  "log"

  "github.com/tmc/langchaingo/llms"
  "github.com/tmc/langchaingo/llms/ollama"
)

func main() {
  llm, err := ollama.New(ollama.WithModel("llama2"))
  if err != nil {
    log.Fatal(err)
  }

  query := "very briefly, tell me the difference between a comet and a meteor"

  ctx := context.Background()
  completion, err := llms.GenerateFromSinglePrompt(ctx, llm, query)
  if err != nil {
    log.Fatal(err)
  }

  fmt.Println("Response:\n", completion)
}

For streaming, GenerateFromSinglePrompt will take a streaming function as an option. The streaming function is invoked with each chunk of data as it's received; at the end, it's called with an empty chunk:

package main

import (
  "context"
  "fmt"
  "log"

  "github.com/tmc/langchaingo/llms"
  "github.com/tmc/langchaingo/llms/ollama"
)

func main() {
  llm, err := ollama.New(ollama.WithModel("llama2"))
  if err != nil {
    log.Fatal(err)
  }

  query := "very briefly, tell me the difference between a comet and a meteor"

  ctx := context.Background()
  _, err = llms.GenerateFromSinglePrompt(ctx, llm, query,
    llms.WithStreamingFunc(func(ctx context.Context, chunk []byte) error {
      fmt.Printf("chunk len=%d: %s\n", len(chunk), chunk)
      return nil
    }))
  if err != nil {
    log.Fatal(err)
  }
}

The final completion is still returned from GenerateFromSinglePrompt, in case it's needed. Running this, we'll get something like the following output:

$ go run ollama-completion-stream.go
chunk len=1:

chunk len=5:  Sure
chunk len=1: !
chunk len=5:  Here
chunk len=1: '
chunk len=1: s
chunk len=2:  a
chunk len=6:  brief
chunk len=12:  explanation
[...]
chunk len=0:

Finally, we can also obtain embeddings from the model using the langchain package:

package main

import (
  "context"
  "fmt"
  "log"

  "github.com/tmc/langchaingo/llms/ollama"
)

func main() {
  llm, err := ollama.New(ollama.WithModel("llama2"))
  if err != nil {
    log.Fatal(err)
  }

  texts := []string{
    "meteor",
    "comet",
    "puppy",
  }

  ctx := context.Background()
  embs, err := llm.CreateEmbedding(ctx, texts)
  if err != nil {
    log.Fatal(err)
  }

  fmt.Printf("Got %d embeddings:\n", len(embs))
  for i, emb := range embs {
    fmt.Printf("%d: len=%d; first few=%v\n", i, len(emb), emb[:4])
  }
}

Code

The full code for this post is available on GitHub.

Update 2024-02-22: See a followup post on using additional models like Google's Gemma with the same setup.

[1]	ML models involve a huge amount of mathematical computations and typically run best on beefy GPUs. If your machine (like mine!) doesn't have a GPU installed, the model will still work on the CPU, but runs very slowly.

Retrieval Augmented Generation in Go

2023-11-10T09:02:00-08:00

I've been reading more and more about LLM-based applications recently, itching to build something useful as a learning experience. In this post, I want to share a Retrieval Augmented Generation (RAG) system I've built in 100% Go and some insights I learned along the way.

Some limitations of current LLMs

Let's take OpenAI's API as an example; for your hard-earned dollars, it gives you access to powerful LLMs and related models. These LLMs have some limitations as a general knowledge system [1]:

They have a training cutoff date somewhere in the past; recently, OpenAI moved the cutoff of their GPT models from 2021 to April 2023, but it's still not real-time.
Even if LLMs develop more real-time training, they still only have access to public data. They aren't familiar with your internal documents, which you may want to use them on.
You pay per token, which is about 3/4 of a word; there can be different pricing for input tokens and output tokens. The prices are low if you're only experimenting, but can grow fast if you're working at scale. This may limit how many tokens you want an LLM to crunch for you in each request.

Retrieval Augmented Generation

One of the most popular emerging techniques to address these limitations is Retrieval Augmented Generation (RAG). Here's a useful diagram borrowed from a GCP blog post:

The idea is:

We want the LLM to "ingest" a large body of text it wasn't trained on, and then chat to it about it
Even if the full body of text fits the LLM's context window, this may be too expensive for each query [2]
Therefore, we'll run a separate information retrieval stage, finding the most relevant information for our query
Finally, we'll add this information as the context for our query and chat with the LLM about it

The third step in the list above is the trickiest part - finding the most "relevant" information is difficult in the general case. Are we supposed to build a search engine? Well, that would be one approach! Powerful full-text search engines exist and could be helpful here, but there's a better way using embeddings. Read on to see how it works.

Implementing RAG in Go

In the course of my research on the subject, I wrote a bunch of Python code to perform RAG, and then ported it to Go. It was easier to find Python samples online, but once everything clicked in my head, porting to Go was trivial.

This process led me to the following observation:

LLM-based applications like RAG are a data-pipeline task, not a machine-learning task.

What I mean by this is that the application doesn't crunch matrices, doesn't explore the best loss function or gradient update, doesn't train and evaluate models. It simply hooks up textual tools together; LLMs are one such textual tool, embeddings are another. Therefore, Go is very well suited for such applications! Go is much faster than Python, just as capable with text processing, and its easy concurrency is helpful for applications that spend a long time waiting for network I/O.

The motivating problem

When I started hacking on this project, OpenAI models had a cutoff in 2021, but it remains relevant today. I wanted the LLM to answer questions about the Go documentation - including recently published pages. For example, Go shipped forward compatibility and toolchain management in 1.21 this summer. Asking ChatGPT about it yields sub-optimal results, because of the training cutoff:

Me: what does GOTOOLCHAIN mean in go?

ChatGPT: In Go, the GOTOOLCHAIN environment variable is used to specify an
alternative toolchain to be used for compiling and linking Go code. This can
be useful when you need to use tools that are different from the default ones
provided with the Go installation, such as gccgo or other experimental
toolchains.

This is a classical LLM hallucination, because ChatGPT is not familiar with the GOTOOLCHAIN env var released in Go 1.21; it's a plausible explanation of what such an env var could do in theory, but clearly not what I'm looking for.

Me: what is the latest released version of the Go programming language?

ChatGPT: As of my last update in April 2023, the latest stable version of Go
was 1.20. Please check the official Go website or the Go GitHub repository for
the most current version.

To address this, I wanted to build a RAG system that:

Reads all the latest Go documentation pages (Markdown files straight from the Go website repository)
When a question is asked, finds the most relevant information from these documentation pages
This relevant information is added as context to the question via basic prompt engineering, and the question + context is passed to ChatGPT via its API

Let's get to it.

Step 1: ingest documentation

Step 1 is trivial and doesn't involve any LLM-related technology. It's a command-line tool that recursively walks a locally cloned _content directory of the Go website's source code, reads each Markdown file and splits it to "chunks" of approximately 1000 tokens [3], consisting of whole paragraphs. Each chunk is then stored in a SQLite DB with some additional information:

CREATE TABLE IF NOT EXISTS chunks (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  path TEXT,
  nchunk INTEGER,
  content TEXT
);

Step 2a: calculate embedding for each chunk

Embeddings are a fascinating aspect of modern ML and LLMs. I won't cover them in detail here - there are plenty of excellent resources online. For our needs - an embedding model is a function that takes arbitrary text and returns a fixed-size vector of real numbers that represents this text in vector space (N-dimensional Cartesian coordinates). Related chunks of text will be closer to each other in this space (using regular vector space distance metrics) than unrelated chunks.

For this step, a command-line tool with the --calculate flag will read the DB table created in step 1, calculate the embedding for each chunk using OpenAI's API, and store the embedding for each chunk in a new table:

CREATE TABLE IF NOT EXISTS embeddings (
  id INTEGER PRIMARY KEY,
  embedding BLOB
)

It's easy to correlate each chunk's text to its embedding by joining the tables on id.

This step takes a bit of time and costs a bit of $$$, because it sends an embedding API request to OpenAI for each chunk. The embeddings are stored in the DB as a cache, so we shouldn't have to compute them more than once.

Step 2b: find the most relevant context for a question

When a user asks a question - we can calculate the question's embedding, and use a bit of math to find the most relevant (closest in vector space) chunks to it from our DB collected in the previous step.

This is configurable, but my demo program finds the 3 closest chunks to the question (total of ~3K tokens).

Theoretically, this step requires a bit of engineering sophistication to efficiently find the closest chunks; a whole exciting domain of infrastructure called vector databases exists to address this need. Many of them are implemented in Go, e.g. milvus or weaviate.

Since our sample is small, I simply do this by iterating over all the chunks in the DB one by one, calculating the question's vector similarity with each one and keep the 3 chunks with best similarity.

Step 3: prompt engineering to answer the question with context

Now comes a small bit of prompt engineering, which is just a fancy name for string interpolation :-) This is the literal Go code doing this:

  query := fmt.Sprintf(`Use the below information to answer the subsequent question.
Information:
%v

Question: %v`, contextInfo, theQuestion)

Where contextInfo is simply the 3 relevant chunks pasted together with some newlines. That's it!

Asking our question about GOTOOLCHAIN again using this approach, we get a much better answer:

In Go, `GOTOOLCHAIN` is an environment variable that determines which Go
toolchain version should be used. The value of `GOTOOLCHAIN` can be set to
various options:

1. `local`: This indicates that the bundled Go toolchain should be used.
2. `<name>`: This specifies a specific Go toolchain version that should
   be used. If a binary with that name is found in the system PATH, it will be
   used; otherwise, the `go` command will download and use the specified Go
   toolchain.
3. `<name>+auto` or `<name>+path`: This tells the `go` command to
   select and run a newer Go version as needed. It will consult the
   `toolchain` and `go` lines in the current workspace's `go.work` file (or
   the main module's `go.mod` file if there is no workspace) to determine
   which toolchain version to use. If a newer version is specified in those
   files, it will be used; otherwise, it will fallback to the default Go
   toolchain.

Overall, `GOTOOLCHAIN` is used to specify the specific Go toolchain version or
the selection process of a newer Go version that should be used by the `go`
command.

Code and final words

The full code for this project is available on GitHub; all you need to run it is your own OPENAI_API_KEY. The repository includes the SQLite DB with the embeddings already pre-populated, so you don't even need to run the rag tool with --calculate. See the README file for full instructions.

Update (2024-01-03): the GitHub repository now includes a cmd/gemini-rag directory that reimplements this RAG tool using the Google Gemini model.

I'd like to thank Simon Willison, whose lucid writing on this subject has been very helpful in my research for this project. Specifically, the following resources were invaluable:

[1]	LLMs have much more serious limitations, of course, w.r.t. factfulness and hallucinations. This list is focused on the topic of our specific example and isn't a general review of LLMs.

[2]

Let's take OpenAI's newly announced GPT 4 Turbo, for example. It has a whopping 128K token context window and costs 1 cent per 1K tokens. If we use the full context for the input (ignoring for the moment output tokens, which are more expensive), that's $1.28 per query. Not for the faint of heart, if you want this to run at scale!

[3]	Tokens are counted using the Go port of OpenAI's tiktoken library.

Minimal character-based LSTM implementation

2018-06-07T05:34:00-07:00

Following up on the earlier post deciphering a minimal vanilla RNN implementation, here I'd like to extend the example to a simple LSTM model.

Once again, the idea is to combine a well-commented code sample (available here) with some high-level diagrams and math to enable someone to fully understand the code. The LSTM architecture presented herein is the standard one originating from Hochreiter's and Schmidthuber's 1997 paper. It's described pretty much everywhere; Chris Olah's post has particularly nice diagrams and is worth reading.

LSTM cell structure

From 30,000 feet, LSTMs look just like regular RNNs; there's a "cell" that has a recurrent connection (output tied to input), and when trained this cell is usually unrolled to some fixed length.

So we can take the basic RNN structure from the previous post:

LSTMs are a bit trickier because there are two recurrent connections; these can be "packed" into a single vector h, so the above diagram still applies. Here's how an LSTM cell looks inside:

x is the input; p is the probabilities computed from the output y (these symbols are named consistently with my earlier RNN post) and exit the cell at the bottom purely due to topological convenience. The two memory vectors are h and c - as mentioned earlier, they could be combined into a single vector, but are shown here separately for clarity.

The main idea of LSTMs is to enable training of longer sequences by providing a "fast-path" to back-propagate information farther down in memory. Hence the c vector is not multiplied by any matrices on its path. The circle-in-circle block means element-wise multiplication of two vectors; plus-in-square is element-wise addition. The funny greek letter is the Sigmoid non-linearity:

\[\sigma(x) =\frac{1}{1+e^{-x}}\]

The only other block we haven't seen in the vanilla RNN diagram is the colon-in-square in the bottom-left corner; this is simply the concatenation of h and x into a single column vector. In addition, I've combined the "multiply by matrix W, then add bias b" operation into a single rectantular box to save on precious diagram space.

Here are the equations computed by a cell:

\[\begin{align*} xh&=x^{[t]}:h^{[t-1]}\\ f&=\sigma(W_f\cdot xh+b_f)\\ i&=\sigma(W_i\cdot xh+b_i)\\ o&=\sigma(W_o\cdot xh+b_o)\\ cc&=tanh(W_{cc}\cdot xh+b_{cc})\\ c^{[t]}&=c^{[t-1]}\odot f +cc\odot i\\ h^{[t]}&=tanh(c^{[t]})\odot o\\ y^{[t]}&=W_{y}\cdot h^{[t]}+b_y\\ p^{[t]}&=softmax(y^{[t]})\\ \end{align*}\]

Backpropagating through an LSTM cell

This works exactly like backprop through a vanilla RNN; we have to carefully compute how the gradient flows through every node and make sure we properly combine gradients at fork points. Most of the elements in the LSTM diagram are familiar from the previous post. Let's briefly work through the new ones.

First, the Sigmoid function; it's an elementwise function, and computing its derivative is very similar to the tanh function discussed in the previous post. As usual, given f=\sigma(k), from the chain rule we have the following derivative w.r.t. some weight w:

\[\frac{\partial f}{\partial w}=\frac{\partial \sigma(k)}{\partial k}\frac{\partial k}{\partial w}\]

To compute the derivative \frac{\partial \sigma(k)}{\partiak k}, we'll use the ratio-derivative formula:

\[(\frac{f}{g})'=\frac{f'g-g'f}{g^2}\]

So:

\[\sigma '(k)=\frac{e^{-k}}{(1+e^{-k})^2}\]

A clever way to express this is:

\[\sigma '(k)=\sigma(k)(1-\sigma(k))\]

Going back to the chain rule with f=\sigma(k), we get:

\[\frac{\partial f}{\partial w}=f(1-f)\frac{\partial k}{\partial w}\]

The other new operation we'll have to find the derivative of is element-wise multiplication. Let's say we have the column vectors x, y and z, each with m rows, and we have z(x)=x\odot y. Since z as a function of x has m inputs and m outputs, its Jacobian has dimensions [m,m].

D_{j}z_{i} is the derivative of the i-th element of z w.r.t. the j-th element of x. For z(x)=x\odot y this is non-zero only when i and j are equal, and in that case the derivative is .

Therefore, Dz(x) is a square matrix with the elements of y on the diagonal and zeros elsewhere:

\[Dz=\begin{bmatrix} y_1 & 0 & \cdots & 0 \\ 0 & y_2 & \cdots & 0 \\ \vdots & \ddots & \ddots & \vdots \\ 0 & 0 & \cdots & y_m \\ \end{bmatrix}\]

If we want to backprop some loss L through this function, we get:

\[\frac{\partial L}{\partial x}=\frac{\partial L}{\partial z}Dz\]

As x has m elements, the right-hand side of this equation multiplies a [1,m] vector by a [m,m] matrix which is diagonal, resulting in element-wise multiplication with the matrix's diagonal elements. In other words:

\[\frac{\partial L}{\partial x}=\frac{\partial L}{\partial z}\odot y\]

In code, it looks like this:

# Assuming dz is the gradient of loss w.r.t. z; dz, y and dx are all
# column vectors.
dx = dz * y

Model quality

In the post about min-char-rnn, we've seen that the vanilla RNN generates fairly low quality text:

one, my dred, roriny. qued bamp gond hilves non froange saws, to mold his a work, you shirs larcs anverver strepule thunboler muste, thum and cormed sightourd so was rewa her besee pilman

The LSTM's generated text quality is somewhat better when trained with roughtly the same hyper-parameters:

the she, over is was besiving the fact to seramed for i said over he will round, such when a where, "i went of where stood it at eye heardul rrawed only coside the showed had off with the refaurtoned

I'm fairly sure that it can be made to perform even better with larger memory vectors and more training data. That said, an even more advanced architecture can be helpful here. Moreover, since this is a character-based model, to really capture effects between words a few words apart we'll need a much deeper LSTM (I'm unrolling to 16 characters we can only capture 2-3 words), and hence much more training data and time.

Once again, the goal here is not to develop a state-of-the-art language model, but to show a simple, comprehensible example of how and LSTM is implemented end-to-end in Python code. The full code is here - please let me know if you find any issues with it or something still remains unclear.

Eli Bendersky's website - Machine Learning

Asking an LLM to build a simple web tool

Insights

Tokens for LLMs: Byte Pair Encoding in Go

Byte pair encoding - introduction

Training

Encoding

Realistic vocabulary and splitting

Full online demo with a web UI and WebAssembly

The life of an Ollama prompt

HTTP request to Ollama

Ollama service

The generate API endpoint

Fetching and loading the model

Running the underlying model with a prompt

Afterword: standard interfaces

Appendix: Go client libraries for the Ollama API

Gemma, Ollama and LangChainGo

Without LangChainGo

gemini-cli: Access Gemini models from the command-line

The tool

The motivating task

Finding related posts

What's next

Using Gemini models in Go with LangChainGo

LangChainGo examples with GoogleAI

Switching to Vertex

Code

Using Gemini models from Go

The task

Using the Google AI SDK

Using the GCP Vertex SDK

Code

Using Ollama with LangChainGo

Setting up Ollama

Manually invoking the REST API

Programmatic access to models through Ollama

Code

Retrieval Augmented Generation in Go

Some limitations of current LLMs

Retrieval Augmented Generation

Implementing RAG in Go

The motivating problem

Step 1: ingest documentation

Step 2a: calculate embedding for each chunk

Step 2b: find the most relevant context for a question

Step 3: prompt engineering to answer the question with context

Code and final words

Minimal character-based LSTM implementation

LSTM cell structure

Backpropagating through an LSTM cell

Model quality

The `generate` API endpoint