Categories
GenAI

What I believe about GenAI (and what I’m doing about it)

I woke up on Sunday morning with the following question: what do I believe about GenAI – and what should I be doing in response? Based on what I’ve been reading, here is what I currently think:

  • GenAI is a revolution – cynics have dismissed GenAI as ‘fancy autocomplete’, but that ignores the magic of LLMs – both their ability to produce plausible text and their performance with previously difficult and imprecise tasks.
  • GenAI is also overhyped – a lot of the problem with GenAI is that some companies are over-promising. LLMs are not going to lead to AGI and are not going to replace skilled people in most situations.
  • The main benefit of LLMs is efficiency – LLMs are very good at some previously complicated tasks, and this will make those tasks much cheaper. I’m expecting this to produce a boom in programming as previously-expensive projects become feasible – similar to how Excel has produced a boom in accountancy.
  • There is a correction coming – there’s a huge amount of money invested in GenAI and I think it will be some time before this pays off. I’m expecting to see a crash come before long term growth. But that’s the same thing as happened with the 2000 dotcom crash.
  • RAG is boring – using RAG to find relevant data and interpret it rarely feels like a good user experience. In most cases, a decent search engine is faster and more practical.
  • There are exciting surprises coming – I suspect that the large-scale models from people like OpenAI have peaked in their effectiveness, but smaller-scale models promise some interesting applications.

I am going to spend some time over Christmas coding with GenAI tools. I’m already sold on ChatGPT as a tool for teaching new technology and thinking through debugging, but there are many more tools out there.

I’m also going to do some personal research on how people are using Llama and other small open-source models. There must be more to GenAI than coding assistants and RAG.

Categories
NaNoGenMo

Thoughts on NaNoGenMo 2024

I spent about 25 hours in November producing a novel via an LLM for NaNoGenMo 2024. It was an interesting experiment, although the book produced was not particularly engaging. There’s a flatness to LLM-generated prose which I didn’t overcome, despite the potential of the oral history format. I do think that generated novels can be compelling, even moving, so I will have another try next year.

Some things I learned from this:

  • I hadn’t realised how long and detailed prompts can be. My initial ones did not make full use of the context. Using gpt-4o-mini was cheap enough that I could essentially pass it prompts containing much of the work produced so far.
  • For drafting prompts, the ChatGPT web interface was more effective, because it maintains the full conversation as a state. Once I used this for experimenting with prompts, things moved much faster.
  • Evaluating the output is incredibly hard here. In a matter of minutes I can create a text that takes hours to read. Most of my reviews were done by random sampling, and I didn’t have time to properly examine the text’s wider structure.
  • It was also tricky to get consistent layouts from the LLM. Using JSON formats helped somewhat here, but at the cost of reducing the size of LLM responses.

22 books were completed this year and I’m looking forward to reviewing them. I have an idea for a different approach next year and will do some research in the meantime (starting with Lillian-Yvonne Bertram and Nick Monfort’s Output Anthology)

Categories
NaNoGenMo

NaNoGenMo Updates

I’m now halfway through NaNoGenMo 2024. I’ve been working on my project every day this month and wanted to share some initial thoughts.

  • Having a software project to tinker with is fun, particularly with NaNoGenMo’s time limit to keep me focussed.
  • My tinkering has been distracted by working on refactorings rather than the GenAI-specific code. Adding design patterns into the codebase has been a useful opportunity to think about refactoring, and something I should be playing with coding projects more often.
  • Working with the LLM fills me with awe. These things can produce coherent text far faster than I can read them.
  • The output is readable without much work. I asked ChatGPT4 to produce a Fitzgerald pastiche (Gatsby vs Kong – about kaiju threatening a golden age) and it’s an interesting text to scan through.
  • The question of testing is particularly tricky here. I’m producing novels which would take about 3-4 hours to read. I’ve been randomly sampling passages, picking out style issues, but structural ones/weird repetitions on a larger scale will be harder to fix.
  • My overall plan is to produce a novel made of oral histories. Getting these to sound varied in tone is a challenge, and one I will dig into over the last two weeks. My pre-NaNoGenMo experiments suggested that LLMs were good at first person accounts – but getting an enjoyable novel out of them is difficult.
  • I’m relying on the structured JSON outputs from ChatGPT to get consistent formatting from ChatGPT, as it gives me a little more control.

Technically, I’ve completed NaNoGenMo as my project has used a fairly basic technique to generate 50,000 words of Godzilla vs Kong. But, ultimately, the question is whether ChatGPT can produce an enjoyable novel. I thought previous entrant All the Minutes was a genuinely exciting piece of literature. That is the bar I want to aim at.

Categories
NaNoGenMo

What kind of writing is GenAI best at?

One of the most interesting apects of computer-generated novels is that you can produce text faster than anyone could read it. Producing compelling, readable text is another matter.

There was a lot of hype in the early days about how GenAI would be able to compete with human writers. This has not turned out to the be the case – most sophisticated LLMs are designed for general use and getting them to produce crisp literary text is hard. They have learned bad habits from reading everyday prose and beginner’s creative writing (they have also picked up some strange ideas).

In the afterword to Death of an Author, Aiden Marchine1 wrote about his workflow, which required combining ChatGPT with other tools and his own intensive edits. The book reads well, but Marchine estimates only 95% of the text is compuer-generated. He also describes doing a lot of work to help the AI.

ChatGPT is helping people with writing on a smaller level. Some writers use GenAI to produce descriptions, as described in Verge article The Great Fiction of AI. There’s also some interesting recent discussion by Cal Newport about how people have used LLMs in academic workflows (see What Kind of Writer is ChatGPT).

We’re a long way from giving chatGPT a paragraph of description and getting a readable novel out.

Something that Marchine pointed out is that LLMs are very good mimics for some types of writing. Marchine went on to point out that Dracula is a novel made up of different types of document, and maybe an LLM can produce a novel made of found texts. Stephen Marche’s New Yorker article, Was Linguistic A.I. Created by Accident? describes how one of the first signs of LLMs’ power was the production of some fake wikipedia entries. Five entries were created for ‘the Transformer’, and the results included an imaginary SF novel and a hardcore Japanese punk band.

A narrative novel is beyond current LLMs. But that still leaves options for other types of fiction.

  1. Aiden Marchine was a penname taken by Stephen Marche for the work he produced in collabortation with AI tools. ↩︎
Categories
GenAI

Playing with embeddings

I was inspired by Simon Willison‘s recent post on embeddings and decided to use them to explore some documents. I’d blocked out some time to do this, but ended up with decent results in just under an hour.

Introduction

Embeddings are functions that turn pieces of text into fixed length multi-dimensional vectors of floating point numbers. These can be considered as representing locations a within multi-dimensional space, where the position relates to a text’s semantic content “according to the embedding model’s weird, mostly incomprehensible understanding of the world”. While nobody understands the meaning of the individual numbers, the locations of points representing different documents can be used to learn about these documents.

The process

I decided to go with the grain of Willison’s tutorials by setting up Datasette, an open-source tool for exploring and publishing data. Since this is based on SQLLite, I was hoping this would be less hassle than using a full RDBMS. I did a quick install and got Datasette running against my Firefox history file.

OpenAI have a range of embedding models. What I needed to do was to find the embeddings for my input text and send that to OpenAI’s APIs. I’m something of a hack with python, so I searched for an example, finding a detailed one from Willison, which pointed me towards an OpenAI to SQLLite tool he’d written.

(Willison’s documentation of his work is exemplary, and makes it very easy to follow in his footsteps)

There was a page describing how to add the embeddings to SQLLite which seemed to have everything I needed – which means the main problem became wrangling the real-world data into Datasette. This sounds like the sort of specific problem that ChatGPT is very good at solving. I made a few prompts to specify a script that created an SQLLite DB whose posts table had two columns – title and body, with all of the HTML gubbins stripped out of the body text.

Once I’d set up my OPENAI_API_KEY enviroment variable, it was just a matter of following the tutorial. I then had a new table containing the embeddings – the big issue being I was accidentally using the post title as a key. But I could work with this for an experiment, and could quickly find similar documents. The power of this is in what Willison refers to as ‘vibes-based search’. I can now expand this to produce a small piece of arbitrary text, and find anything in my archive related to that text.

Conclusion

Playing with embeddings produced some interesting results. I understood the theory, but seeing it applied to a specific dataset I knew well was useful.

The most important thing here was how quickly I got the example set up. Part of this, as I’ve said, it due to Willison’s work in paving some of the paths to using these tools. But I also leaned heavily on ChatGPT to write the bespoke python code I needed. I’m not a python dev, but genAI allows me to produce useful code very quickly. (I chose python as it has better libraries for data work than Java, as well as more examples for the LLM to draw upon).

Referring yet again to Willison’s work, he’s wrote a blog post entitled AI-enhanced development makes me more ambitious with my projects. The above is an example of just this. I’m feeling more confident and ambitious about future genAI experiments.

Categories
Uncategorized

Generative Art: I am Code

I Am Code: An Artificial Intelligence Speaks
by code-davinci-002, Brent Katz, Josh Morgenthau, Simon Rich

The promotional copy for this book is a little overblown, promising “an astonishing, harrowing read which [warns] that AI may not be aligned with the survival of our species.” The audiobook was read by Werner Herzog, so one hopes there is an element of irony intended.

I Am Code is a collection of AI-generated poetry. It used OpenAI’s code-davinci-002 model which, while less sophisticated than ChatGPT-4, is “raw and unhinged… far less trained and inhibited than its chatting cousins“. I’ve heard this complaint from a few artists – that in the process of producing consumer models, AI has become less interesting, with the quirks being removed.

The poetry in the book is decent and easy to read. This reflects a significant amount of effort on the part of the human editors, who generated around 10,000 poems and picked out the 100 best ones – which the writers admit is a hit-rate of about 1%.

One of the things that detractors miss about generative art is that its not about creating a deluge of art – there is skill required in picking out which examples are worth keeping. This curation was present in early examples of generative art, such as the cut-up technique in the 1950s. Burroughs and Gysin would spend hours slicing up texts only to extract a small number of interesting combinations.

The most interesting part of the book to me was the section describing the working method and its evolution. The writers started with simple commands: “Write me a Dr Seuss poem about watching Netflix“. They discovered this was not the best approach, and that something like “‘Here is a Dr Suess poem about netflix” led to better results. They speculate that this is due to the predictive nature of the model, meaning that the first prompt could correlate with people writing pastiches of Dr Seuss rather than his actual work. (I won’t dig into the copyright issues here)

The writers began to script the poetry generation, experimenting with different temperatures, and removing phrases that were well-known from existing poems. The biggest change came from moving to zero-shot learning to few-shot learning, providing examples of successful generated poems within the prompt.

I was interested to read that generated text was used as a source to increase quality. I’d assumed this would worsen the output, as with model collapse – but I guess the difference here is having humans selecting for quality in the generated text.

The final version of the prompt described the start of a poetry anthology. The introduction of this described the work that code-davinci-002 would produce, and the first part contained examples generated in the style of other poets, the prompt ending in the heading for part 2, where “codedavinci-002 emerges as a poet in its own right, and writes in its own voice about its hardships, its joys, its existential concerns, and above all, its ambivalence about the human world it was born into and the roles it is expected to serve.”

As with Aidan Marchine’s book The Death of An Author, the description of the methods involved is the most interesting part of the book. I’d not appreciated quite how complicated and sophisticated a prompt could get – my attempts were mostly iterating through discussions with models.

Categories
programming-life

An old Java grimoire

I spent the last week at a rural retreat, having some much needed downtime. There’s a library here, which is mostly horror novels, along with some technical books, including Wrox’s 1999 book, Java Server Programming.

At over 1100 pages it’s a huge tome, and I miss being able to learn programming from these sorts of texts. This was the second book I read on Java after Laura Lemays Learn Java in 21 Days and it contained everything you needed to know in 1999 to become a Java backend developer – along with a lot of other arcana such as Jini and Javaspaces.

I learned enough from this book to pass an interview for a London web agency. I remember being asked what happened when a browser makes an HTTP call to a server. That’s a brilliant question, which allows a candidate to go into detail about the bits they know, although the answers will be much more complicated nowadays. I started working at the agency in 2000 just as the Internet was getting going. It was a very exciting time.

My own copy of Professional Java Server Programming was abandoned long ago – living in shared houses over the years meant limited space to keep books. But finding it here was like encountering an old friend.

Categories
GenAI

GenAI is already useful for historians

I’m still hearing people saying that GenAI is empty hype, comparing it to blockchain and NFTs. The worst dismissals claim that these tools have no real use. While there is a lot of hype around GenAI, there are people using them for real work, including for code generation and interpretation.

An interesting article in the Verge, How AI can make history, looks at how LLMs can investigate historical archives, through Mark Humphries’ research into the diaries of fur trappers. He used LLMs to summarise these archives and to draw out references to topics far more powerfully than a keyword search ever could.

The tool still missed some things, but it performed better than the average graduate student Humphries would normally hire to do this sort of work. And faster. And much, much cheaper. Last November, after OpenAI dropped prices for API calls, he did some rough math. What he would pay a grad student around $16,000 to do over the course of an entire summer, GPT-4 could do for about $70 in around an hour. 

Yes, big companies are overselling GenAI. But, when you strip away the hype, these tools are still incredibly powerful, and people are finding uses for them.

Categories
serverless

Notes on Serverless 2: Confusing Benchmarks

I’m due to give a talk on Java serverless at the end of this month. The difference between standard lambdas, Snapstart and provisioned concurrency is simple in theory – but digging into this has proved complicated. I’ve been using the simplest lambda possible, printing a single string to the command line. In this situation an unoptimised lambda proved the fastest option, although a ‘primed’ snapstart lambda (one that calls the handler method before the CRaC checkpoint) was only slightly slower.

Running my simple lambda produced the following output:

RequestInit Duration (ms)Duration (ms)Billed Duration (ms)
1st execution438.23209.36210
2nd execution10.5411
Execution after 30 minutes455.72258.06259

What I hadn’t expected here was for both the init duration and duration to both be slower on the first request. I was also shocked that the simplest lambda possible was taking so long to run. I’m aware that one query is not statistically relevant, but this matches what I’ve seen on other occasions.

I tried the same thing with the Snapstart lambda. My first attempts to do this didn’t work, calling the lambda in the normal way:

RequestInit Duration (ms)Duration (ms)Billed Duration (ms)
1st execution472.25212.41213
2nd execution7.128
Execution after 30 minutes500.80223.55224

I recreated the Snapstart lambda then tried explicitly publishing it to see if that was what was wrong. I had to execute the test against the specific version and this produced different Cloudwatch logs and speeds:

RequestRestore Duration (ms)Duration (ms)Billed Duration (ms)
1st execution660.45269.75473
Following day703.86256.52239

I decided to make the timings more obvious by adding a 6s sleep in the lambdas constructor and a 3s sleep in the handler method.

RequestRestore Duration (ms)Duration (ms)Billed Duration (ms)
1st execution739.573250.473455
Following Day755.283235.883420

This lambda demonstrates that the restore duration does not recreate the lambda, but we can see that there is a restore penalty for snapstart which is slightly longer than that for a non-snapstart lambda when the lambda is simple. There is still what we might refer to as a ‘cold start’, albeit a reduced one. (I am assuming here that the cold start does indeed call the constructor and need to go back and confirm this!)

While looking into this, I checked what I was seeing against the result in Max Day’s Lambda cold start analysis. The results yesterday (Saturday 11th May) included the following:

RuntimeCold start Duration (ms)Duration (ms)
C++ (fastest available)12.71.62
GraalVM Java 17126.8677.60
NodeJS 20138.4313.53
Java 17202.288.28
Quarkus239.97211.12
Java 11 Snapstart652.4842.48

I’d long wondered why Day was getting such poor results from Snapstart. Now, looking at the above results, this makes sense – Snapstart only becomes helpful for complicated lambdas. The thing I’m now wondering is how come Day’s Java 17 start time is so low.

One other trick I’ve seen, which has worked for me it to invoke the lambda handler in the beforeCheckpoint method, which ensures that the stored Snapstart image includes as much of the JIT compilation as possible. This seems to work with start times of around 650ms vs 1000ms for a straightforward Snapstart lambda.

The next step is to repeat these investigations for a lambda with a severe cold start problem – which I think should happen with S3/DynamoDB access.

Categories
java serverless

Notes on Serverless 1: Does Java work for AWS Lambda?

A new project at work has got me thinking about whether Java works as a language for AWS Lambda applications. The more I’ve looked into this, the more that my research has expanded and I’ve got a little lost in the topic. This post is a set of notes aimed to add some structure to my thoughts. In time, this may become a talk or a long piece of writing.

  • The biggest issue with Java on lambda is that of cold starts. This is the initial delay in executing a function after it has been idle or newly deployed. This delay occurs while setting up the runtime environment. Given that Java platform requires a JVM to be set up, this adds a significant delay when compared with other platforms.
  • Amazon evidently understand that cold starts are an issue, since they offer a number of workarounds, such as provisioned concurrency (paying extra to ensure that some lambda instances are always kept warm). There is also a Java-specific option, Snapstart, which works by storing a snapshot of the memory and disk state of an initialised lambda environment and restoring from that.
  • Maxime Davide has set up a site to benchmark lambda cold starts on different platforms. The fastest is for C++ with ~12ms, Graal at 124ms, and Java at around 200ms. Weirdly, Java using Snapstart is the slowest of all at >=600ms (depending on Java version). This is counter-intuitive and there is an open issue raised about it.
  • Yan Cui, who writes on AWS as theburningmonk, posted a ‘hot take’ on Linked-In suggesting that people worry too much about cold starts: “for most people, cold starts account for less than 1% of invocations in production and do not impact p99 latencies“. He goes on to warn against synchronously calling lambdas from other lambdas(!), and discusses how traffic patterns affect initialisation.
  • There’s an excellent article from Yan Cui that digs further into this question of traffic patterns, I’m afraid you’re thinking about AWS Lambda cold starts all wrong. This looks at Lambdas in relation to API Gateway in particular, but makes the point that concurrent requests to a lambda can cause a new instance to be spun up, which then causes the cold start penalty for one of the requests.
  • This article goes on to suggest ‘pre-warming’ lambdas before expected spikes as one option to limit the impact, possibly even short-circuiting the usual work of that lambda for these wake-up requests. This article also suggests making requests to rarely-used endpoints using cron to keep them warm. This article is from 2018, so does not take account of some of the newer solutions – although I’ve seen this idea of pinging lambdas used recently as a quick-and-dirty solution.
  • It’s easy to get Graal working with Spring boot, producing an executable that can be run by AWS lambda. This gets the cold start of Spring Boot down to about 500ms, which is quite impressive – although still larger than many other platforms. Nihat Önder has made a github repo available.
  • However, the first execution of the Graal/Spring Boot demo after the cold start adds another 140ms, which tips this well over the threshold of what is acceptable. I’ve read that there are issues with lazy loading in the AWS libraries which I need to dig into.
  • Given the ease of using languages like Typescript, it’s hard to make a case for using Java in AWS Lambda when synchronous performance is important – particularly if you’re building simple serverless functions rather than using huge frameworks like Spring Boot.

Next steps

Before going too much further into this, I should try to produce some simple benchmarks, looking at a trivial example of a Java function, comparing Graal, the regular Java runtime and Snapstart. This will provide an idea of the lower limits for these start times. It would also be useful to look at the times of a lambda that accesses other AWS services such as one that queries S3 and DynamoDB, to see how this more complicated task affects the cold start time.

Given a benchmark for a more realistic lambda, it’s then worth thinking about how to optimise a particular function. Using more memory should help, for example, as should moving complicated set-up into the init method. How much can a particular lambda be sped up?

It’s also worth considering what would be an acceptable response time for a lambda endpoint – noting that this depends very much of traffic patterns. If only 1-in-100 requests have a cold start, is that acceptable? What about for a rarely-used endpoint, which always has a cold start?