One of the most exciting areas of LLM-related development in 2023 is the availability of powerful (and sometimes even open-source) models we can run locally on our machines.

Several tools exist that make it relatively easy to obtain, run and manage such models locally; for example Ollama (written in Go!) LocalAI (also largely in Go!).

In this post I'm going to describe how to use Ollama to run a model locally, communicate with it using its API and integrate it into a Go program using LangChainGo.

Ollama logo, taken from the Ollama website

Setting up Ollama

To start, follow the installation and setup instructions from the Ollama website. Ollama runs as a service, exposing a REST API on a localhost port. Once installed, you can invoke ollama run <modelname> to talk to this model; the model is downloaded and cached the first time it's requested.

In this blog post, we'll be talking to the llama2 model, so run ollama run llama2. After the ollama command finishes installing the model, we'll see a prompt and will be able to chat with it [1]:

>>> very briefly, tell me the difference between a comet and a meteor

 Sure! Here's a brief difference:

A comet is a small, icy body that orbits the sun. When a comet approaches the
inner solar system, the heat from the sun causes the comet to release gas and
dust, creating a bright tail that can be seen from Earth.

A meteor, on the other hand, is a small piece of rock or metal that enters the
Earth's atmosphere. As it travels through the atmosphere, the friction causes
the meteor to heat up and burn, producing a bright streak of light in the sky,
commonly known as a shooting star. If the meteor survives its passage through
the atmosphere and lands on Earth, it is called a meteorite.

Manually invoking the REST API

ollama runs in the background and exposes a REST API on port 11434. We can talk to it "manually" using curl commands:

$ curl http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "very briefly, tell me the difference between a comet and a meteor",
  "stream": false
}'
{"model":"llama2","created_at":"2023-11-20T14:53:47.32607236Z",
 "response":"\nSure! Here's the difference:\n\nA comet is a small,
  icy body that orbits the sun. Comets are composed of dust and frozen
  gases, such as water, methane, and ammonia. When a comet approaches
  the inner solar system, the sun's heat causes the comet's ices
  to vaporize, creating a bright tail of gas and dust that can be seen
  from Earth.\n\nA meteor, on the other hand, is a small body of rock
[...]

This may take a bit of time, especially if your machine doesn't have a powerful GPU. We can also ask Ollama to stream the model's responses so we get output as soon as it's ready, before waiting for the model to complete its reply. We can do that by passing "stream": true:

$ curl http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "very briefly, tell me the difference between a comet and a meteor",
  "stream": true
}'
{"model":"llama2","created_at":"2023-11-20T14:57:06.709696317Z","response":"\n","done":false}
{"model":"llama2","created_at":"2023-11-20T14:57:06.89584866Z","response":" Sure","done":false}
{"model":"llama2","created_at":"2023-11-20T14:57:07.053242632Z","response":"!","done":false}
{"model":"llama2","created_at":"2023-11-20T14:57:07.217867169Z","response":" Here","done":false}
{"model":"llama2","created_at":"2023-11-20T14:57:07.374557181Z","response":"'","done":false}
{"model":"llama2","created_at":"2023-11-20T14:57:07.560674269Z","response":"s","done":false}
{"model":"llama2","created_at":"2023-11-20T14:57:07.719981235Z","response":" the","done":false}
{"model":"llama2","created_at":"2023-11-20T14:57:07.878008762Z","response":" quick","done":false}
{"model":"llama2","created_at":"2023-11-20T14:57:08.035846088Z","response":" and","done":false}
{"model":"llama2","created_at":"2023-11-20T14:57:08.192951527Z","response":" dirty","done":false}
{"model":"llama2","created_at":"2023-11-20T14:57:08.372491712Z","response":":","done":false}
{"model":"llama2","created_at":"2023-11-20T14:57:08.530388951Z","response":"\n","done":false}
[...]

The response is broken into separate JSON messages with "done": false. The last message will have "done": true.

We can send other kinds of requests to the model; for example, we can ask it to calculate embeddings:

$ curl http://localhost:11434/api/embeddings -d '{
  "model": "llama2",
  "prompt": "article about asteroids"
}' | jq
{
  "embedding": [
    0.5615004897117615,
    -2.90958833694458,
    0.836567759513855,
    -0.3081018626689911,
    -1.1424092054367065,
    -1.5503573417663574,
    0.93345707654953,
    -3.008531093597412,
    3.6917684078216553,
    0.3383431136608124,
    1.0924581289291382,
    -2.1573197841644287,
[...]

Programmatic access to models through Ollama

The Ollama README lists some ways to interact with ollama models programmatically; the most common way seems to be through LangChain and related tools. LangChain is emerging as a common framework for interacting with LLMs; it has high-level tools for chaining LLM-related tasks together, but also low-level SDKs for each model's REST API.

Here I will show how to talk to Ollama via the Go port of LangChain - LangChainGo.

Let's start with a simple non-streaming completion request:

package main

import (
  "context"
  "fmt"
  "log"

  "github.com/tmc/langchaingo/llms"
  "github.com/tmc/langchaingo/llms/ollama"
)

func main() {
  llm, err := ollama.New(ollama.WithModel("llama2"))
  if err != nil {
    log.Fatal(err)
  }

  query := "very briefly, tell me the difference between a comet and a meteor"

  ctx := context.Background()
  completion, err := llms.GenerateFromSinglePrompt(ctx, llm, query)
  if err != nil {
    log.Fatal(err)
  }

  fmt.Println("Response:\n", completion)
}

For streaming, GenerateFromSinglePrompt will take a streaming function as an option. The streaming function is invoked with each chunk of data as it's received; at the end, it's called with an empty chunk:

package main

import (
  "context"
  "fmt"
  "log"

  "github.com/tmc/langchaingo/llms"
  "github.com/tmc/langchaingo/llms/ollama"
)

func main() {
  llm, err := ollama.New(ollama.WithModel("llama2"))
  if err != nil {
    log.Fatal(err)
  }

  query := "very briefly, tell me the difference between a comet and a meteor"

  ctx := context.Background()
  _, err = llms.GenerateFromSinglePrompt(ctx, llm, query,
    llms.WithStreamingFunc(func(ctx context.Context, chunk []byte) error {
      fmt.Printf("chunk len=%d: %s\n", len(chunk), chunk)
      return nil
    }))
  if err != nil {
    log.Fatal(err)
  }
}

The final completion is still returned from GenerateFromSinglePrompt, in case it's needed. Running this, we'll get something like the following output:

$ go run ollama-completion-stream.go
chunk len=1:

chunk len=5:  Sure
chunk len=1: !
chunk len=5:  Here
chunk len=1: '
chunk len=1: s
chunk len=2:  a
chunk len=6:  brief
chunk len=12:  explanation
[...]
chunk len=0:

Finally, we can also obtain embeddings from the model using the langchain package:

package main

import (
  "context"
  "fmt"
  "log"

  "github.com/tmc/langchaingo/llms/ollama"
)

func main() {
  llm, err := ollama.New(ollama.WithModel("llama2"))
  if err != nil {
    log.Fatal(err)
  }

  texts := []string{
    "meteor",
    "comet",
    "puppy",
  }

  ctx := context.Background()
  embs, err := llm.CreateEmbedding(ctx, texts)
  if err != nil {
    log.Fatal(err)
  }

  fmt.Printf("Got %d embeddings:\n", len(embs))
  for i, emb := range embs {
    fmt.Printf("%d: len=%d; first few=%v\n", i, len(emb), emb[:4])
  }
}

Code

The full code for this post is available on GitHub.

Update 2024-02-22: See a followup post on using additional models like Google's Gemma with the same setup.


[1]ML models involve a huge amount of mathematical computations and typically run best on beefy GPUs. If your machine (like mine!) doesn't have a GPU installed, the model will still work on the CPU, but runs very slowly.