Google has recently made their newest family of multimodal LLMs available via an API with a generous free tier. Google also released SDKs in several popular programming languages, including Go.

This post is a quick overview of how to get started with the Go SDK to ask the model questions that mix text with images.

The task

We'll be asking the model to explain the difference between two images of turtles; this one:

first turtle image

And this one:

second turtle image

Using the Google AI SDK

With the Google AI SDK, all you need to access the model is generate an API key (similarly to how it works with OpenAI's API). The Go SDK lives at https://github.com/google/generative-ai-go, with package documentation at https://pkg.go.dev/github.com/google/generative-ai-go; it has a good section of examples we can follow.

Here's the code for our task:

package main

import (
  "context"
  "encoding/json"
  "fmt"
  "log"
  "os"

  "github.com/google/generative-ai-go/genai"
  "google.golang.org/api/option"
)

func main() {
  ctx := context.Background()
  client, err := genai.NewClient(ctx, option.WithAPIKey(os.Getenv("API_KEY")))
  if err != nil {
    log.Fatal(err)
  }
  defer client.Close()

  model := client.GenerativeModel("gemini-pro-vision")

  imgData1, err := os.ReadFile("../images/turtle1.png")
  if err != nil {
    log.Fatal(err)
  }

  imgData2, err := os.ReadFile("../images/turtle2.png")
  if err != nil {
    log.Fatal(err)
  }

  prompt := []genai.Part{
    genai.ImageData("png", imgData1),
    genai.ImageData("png", imgData2),
    genai.Text("Describe the difference between these two pictures, with scientific detail"),
  }
  resp, err := model.GenerateContent(ctx, prompt...)

  if err != nil {
    log.Fatal(err)
  }

  bs, _ := json.MarshalIndent(resp, "", "    ")
  fmt.Println(string(bs))
}

Since the LLM API is multimodal, the SDK provides helper types like genai.ImageData and genai.Text to wrap inputs in a type-safe way. When we run this sample, we get the model's response dumped as a JSON object. The important part is:

"Content": {
  "Parts": [
    "The first picture is of a tortoise, which is a reptile characterized by
    its hard shell. The second picture is of a sea turtle, which is a reptile
    characterized by its flippers and streamlined shell. Tortoises are
    terrestrial animals, while sea turtles are marine animals. Tortoises have
    a domed shell, while sea turtles have a flattened shell. Tortoises have
    thick, scaly skin, while sea turtles have smooth, leathery skin. Tortoises
    have short legs with claws, while sea turtles have long flippers.
    Tortoises have a slow metabolism and can live for over 100 years, while
    sea turtles have a faster metabolism and typically live for around 50
    years."
  ],
  "Role": "model"
},

OK, so now we know :-)

Using the GCP Vertex SDK

If you're a GCP customer and have your GCP project set up with billing and everything else, you may want to use the Vertex Go SDK instead.

The great thing about how the Go SDKs work is that you barely have to change your code at all! The only changes are the import line, from:

"github.com/google/generative-ai-go/genai"

To:

"cloud.google.com/go/vertexai/genai"

And then change how you create the client, since the auth is different. For Vertex, the client should be created like this:

client, err := genai.NewClient(ctx, os.Getenv("GCP_PROJECT_ID"), "us-central1")

Where GCP_PROJECT_ID is an env var with your GCP project and the location/region can be set based on your preferences. The rest of the code remains exactly the same!

There are two SDKs because the features offered by the two products can differ in some cases. For example, the GCP one may allow you to read data directly from your storage buckets or database tables.

Code

The full code for all the samples in this post - along with the sample images - is available on GitHub.

Update 2024-01-31: see this post about accessing the Gemini models via langchaingo.