Categories
GenAI

Playing with embeddings

I was inspired by Simon Willison‘s recent post on embeddings and decided to use them to explore some documents. I’d blocked out some time to do this, but ended up with decent results in just under an hour.

Introduction

Embeddings are functions that turn pieces of text into fixed length multi-dimensional vectors of floating point numbers. These can be considered as representing locations a within multi-dimensional space, where the position relates to a text’s semantic content “according to the embedding model’s weird, mostly incomprehensible understanding of the world”. While nobody understands the meaning of the individual numbers, the locations of points representing different documents can be used to learn about these documents.

The process

I decided to go with the grain of Willison’s tutorials by setting up Datasette, an open-source tool for exploring and publishing data. Since this is based on SQLLite, I was hoping this would be less hassle than using a full RDBMS. I did a quick install and got Datasette running against my Firefox history file.

OpenAI have a range of embedding models. What I needed to do was to find the embeddings for my input text and send that to OpenAI’s APIs. I’m something of a hack with python, so I searched for an example, finding a detailed one from Willison, which pointed me towards an OpenAI to SQLLite tool he’d written.

(Willison’s documentation of his work is exemplary, and makes it very easy to follow in his footsteps)

There was a page describing how to add the embeddings to SQLLite which seemed to have everything I needed – which means the main problem became wrangling the real-world data into Datasette. This sounds like the sort of specific problem that ChatGPT is very good at solving. I made a few prompts to specify a script that created an SQLLite DB whose posts table had two columns – title and body, with all of the HTML gubbins stripped out of the body text.

Once I’d set up my OPENAI_API_KEY enviroment variable, it was just a matter of following the tutorial. I then had a new table containing the embeddings – the big issue being I was accidentally using the post title as a key. But I could work with this for an experiment, and could quickly find similar documents. The power of this is in what Willison refers to as ‘vibes-based search’. I can now expand this to produce a small piece of arbitrary text, and find anything in my archive related to that text.

Conclusion

Playing with embeddings produced some interesting results. I understood the theory, but seeing it applied to a specific dataset I knew well was useful.

The most important thing here was how quickly I got the example set up. Part of this, as I’ve said, it due to Willison’s work in paving some of the paths to using these tools. But I also leaned heavily on ChatGPT to write the bespoke python code I needed. I’m not a python dev, but genAI allows me to produce useful code very quickly. (I chose python as it has better libraries for data work than Java, as well as more examples for the LLM to draw upon).

Referring yet again to Willison’s work, he’s wrote a blog post entitled AI-enhanced development makes me more ambitious with my projects. The above is an example of just this. I’m feeling more confident and ambitious about future genAI experiments.

Categories
Uncategorized

Generative Art: I am Code

I Am Code: An Artificial Intelligence Speaks
by code-davinci-002, Brent Katz, Josh Morgenthau, Simon Rich

The promotional copy for this book is a little overblown, promising “an astonishing, harrowing read which [warns] that AI may not be aligned with the survival of our species.” The audiobook was read by Werner Herzog, so one hopes there is an element of irony intended.

I Am Code is a collection of AI-generated poetry. It used OpenAI’s code-davinci-002 model which, while less sophisticated than ChatGPT-4, is “raw and unhinged… far less trained and inhibited than its chatting cousins“. I’ve heard this complaint from a few artists – that in the process of producing consumer models, AI has become less interesting, with the quirks being removed.

The poetry in the book is decent and easy to read. This reflects a significant amount of effort on the part of the human editors, who generated around 10,000 poems and picked out the 100 best ones – which the writers admit is a hit-rate of about 1%.

One of the things that detractors miss about generative art is that its not about creating a deluge of art – there is skill required in picking out which examples are worth keeping. This curation was present in early examples of generative art, such as the cut-up technique in the 1950s. Burroughs and Gysin would spend hours slicing up texts only to extract a small number of interesting combinations.

The most interesting part of the book to me was the section describing the working method and its evolution. The writers started with simple commands: “Write me a Dr Seuss poem about watching Netflix“. They discovered this was not the best approach, and that something like “‘Here is a Dr Suess poem about netflix” led to better results. They speculate that this is due to the predictive nature of the model, meaning that the first prompt could correlate with people writing pastiches of Dr Seuss rather than his actual work. (I won’t dig into the copyright issues here)

The writers began to script the poetry generation, experimenting with different temperatures, and removing phrases that were well-known from existing poems. The biggest change came from moving to zero-shot learning to few-shot learning, providing examples of successful generated poems within the prompt.

I was interested to read that generated text was used as a source to increase quality. I’d assumed this would worsen the output, as with model collapse – but I guess the difference here is having humans selecting for quality in the generated text.

The final version of the prompt described the start of a poetry anthology. The introduction of this described the work that code-davinci-002 would produce, and the first part contained examples generated in the style of other poets, the prompt ending in the heading for part 2, where “codedavinci-002 emerges as a poet in its own right, and writes in its own voice about its hardships, its joys, its existential concerns, and above all, its ambivalence about the human world it was born into and the roles it is expected to serve.”

As with Aidan Marchine’s book The Death of An Author, the description of the methods involved is the most interesting part of the book. I’d not appreciated quite how complicated and sophisticated a prompt could get – my attempts were mostly iterating through discussions with models.