Notes on Serverless 2: Confusing Benchmarks

I’m due to give a talk on Java serverless at the end of this month. The difference between standard lambdas, Snapstart and provisioned concurrency is simple in theory – but digging into this has proved complicated. I’ve been using the simplest lambda possible, printing a single string to the command line. In this situation an unoptimised lambda proved the fastest option, although a ‘primed’ snapstart lambda (one that calls the handler method before the CRaC checkpoint) was only slightly slower.

Running my simple lambda produced the following output:

RequestInit Duration (ms)Duration (ms)Billed Duration (ms)
1st execution438.23209.36210
2nd execution10.5411
Execution after 30 minutes455.72258.06259

What I hadn’t expected here was for both the init duration and duration to both be slower on the first request. I was also shocked that the simplest lambda possible was taking so long to run. I’m aware that one query is not statistically relevant, but this matches what I’ve seen on other occasions.

I tried the same thing with the Snapstart lambda. My first attempts to do this didn’t work, calling the lambda in the normal way:

RequestInit Duration (ms)Duration (ms)Billed Duration (ms)
1st execution472.25212.41213
2nd execution7.128
Execution after 30 minutes500.80223.55224

I recreated the Snapstart lambda then tried explicitly publishing it to see if that was what was wrong. I had to execute the test against the specific version and this produced different Cloudwatch logs and speeds:

RequestRestore Duration (ms)Duration (ms)Billed Duration (ms)
1st execution660.45269.75473
Following day703.86256.52239

I decided to make the timings more obvious by adding a 6s sleep in the lambdas constructor and a 3s sleep in the handler method.

RequestRestore Duration (ms)Duration (ms)Billed Duration (ms)
1st execution739.573250.473455
Following Day755.283235.883420

This lambda demonstrates that the restore duration does not recreate the lambda, but we can see that there is a restore penalty for snapstart which is slightly longer than that for a non-snapstart lambda when the lambda is simple. There is still what we might refer to as a ‘cold start’, albeit a reduced one. (I am assuming here that the cold start does indeed call the constructor and need to go back and confirm this!)

While looking into this, I checked what I was seeing against the result in Max Day’s Lambda cold start analysis. The results yesterday (Saturday 11th May) included the following:

RuntimeCold start Duration (ms)Duration (ms)
C++ (fastest available)12.71.62
GraalVM Java 17126.8677.60
NodeJS 20138.4313.53
Java 17202.288.28
Java 11 Snapstart652.4842.48

I’d long wondered why Day was getting such poor results from Snapstart. Now, looking at the above results, this makes sense – Snapstart only becomes helpful for complicated lambdas. The thing I’m now wondering is how come Day’s Java 17 start time is so low.

One other trick I’ve seen, which has worked for me it to invoke the lambda handler in the beforeCheckpoint method, which ensures that the stored Snapstart image includes as much of the JIT compilation as possible. This seems to work with start times of around 650ms vs 1000ms for a straightforward Snapstart lambda.

The next step is to repeat these investigations for a lambda with a severe cold start problem – which I think should happen with S3/DynamoDB access.

java serverless

Notes on Serverless 1: Does Java work for AWS Lambda?

A new project at work has got me thinking about whether Java works as a language for AWS Lambda applications. The more I’ve looked into this, the more that my research has expanded and I’ve got a little lost in the topic. This post is a set of notes aimed to add some structure to my thoughts. In time, this may become a talk or a long piece of writing.

  • The biggest issue with Java on lambda is that of cold starts. This is the initial delay in executing a function after it has been idle or newly deployed. This delay occurs while setting up the runtime environment. Given that Java platform requires a JVM to be set up, this adds a significant delay when compared with other platforms.
  • Amazon evidently understand that cold starts are an issue, since they offer a number of workarounds, such as provisioned concurrency (paying extra to ensure that some lambda instances are always kept warm). There is also a Java-specific option, Snapstart, which works by storing a snapshot of the memory and disk state of an initialised lambda environment and restoring from that.
  • Maxime Davide has set up a site to benchmark lambda cold starts on different platforms. The fastest is for C++ with ~12ms, Graal at 124ms, and Java at around 200ms. Weirdly, Java using Snapstart is the slowest of all at >=600ms (depending on Java version). This is counter-intuitive and there is an open issue raised about it.
  • Yan Cui, who writes on AWS as theburningmonk, posted a ‘hot take’ on Linked-In suggesting that people worry too much about cold starts: “for most people, cold starts account for less than 1% of invocations in production and do not impact p99 latencies“. He goes on to warn against synchronously calling lambdas from other lambdas(!), and discusses how traffic patterns affect initialisation.
  • There’s an excellent article from Yan Cui that digs further into this question of traffic patterns, I’m afraid you’re thinking about AWS Lambda cold starts all wrong. This looks at Lambdas in relation to API Gateway in particular, but makes the point that concurrent requests to a lambda can cause a new instance to be spun up, which then causes the cold start penalty for one of the requests.
  • This article goes on to suggest ‘pre-warming’ lambdas before expected spikes as one option to limit the impact, possibly even short-circuiting the usual work of that lambda for these wake-up requests. This article also suggests making requests to rarely-used endpoints using cron to keep them warm. This article is from 2018, so does not take account of some of the newer solutions – although I’ve seen this idea of pinging lambdas used recently as a quick-and-dirty solution.
  • It’s easy to get Graal working with Spring boot, producing an executable that can be run by AWS lambda. This gets the cold start of Spring Boot down to about 500ms, which is quite impressive – although still larger than many other platforms. Nihat Önder has made a github repo available.
  • However, the first execution of the Graal/Spring Boot demo after the cold start adds another 140ms, which tips this well over the threshold of what is acceptable. I’ve read that there are issues with lazy loading in the AWS libraries which I need to dig into.
  • Given the ease of using languages like Typescript, it’s hard to make a case for using Java in AWS Lambda when synchronous performance is important – particularly if you’re building simple serverless functions rather than using huge frameworks like Spring Boot.

Next steps

Before going too much further into this, I should try to produce some simple benchmarks, looking at a trivial example of a Java function, comparing Graal, the regular Java runtime and Snapstart. This will provide an idea of the lower limits for these start times. It would also be useful to look at the times of a lambda that accesses other AWS services such as one that queries S3 and DynamoDB, to see how this more complicated task affects the cold start time.

Given a benchmark for a more realistic lambda, it’s then worth thinking about how to optimise a particular function. Using more memory should help, for example, as should moving complicated set-up into the init method. How much can a particular lambda be sped up?

It’s also worth considering what would be an acceptable response time for a lambda endpoint – noting that this depends very much of traffic patterns. If only 1-in-100 requests have a cold start, is that acceptable? What about for a rarely-used endpoint, which always has a cold start?


Thoughts about serveless

I drafted this post some time in 2022, and never got around to posting it. I wanted to publish it as it contains some good links and thinking points.

My last role gave me a chance to play more with serverless code in the form of AWS Lambda. While the issues around the cold starts still need managing in some way, I’m am excited about serverless as a technology and think it should be more widely adopted.

The main advantage of serverless is not having to think about servers. They are still there, but can be mostly ignored. As Justin Etheredge neatly put it, “Managing servers is a nasty side effect of wanting to execute code.”

Not having to think about servers means a lot of things become simpler. Most compelling of these is having a smaller attack surface against hackers. Another is not having to maintain servers. Amazon has dedicated engineers responsible for managing the machines and upgrading them, and has the advantage of massive economies of scale. Companies can focus on the code that delivers the value for their customers.

We’ve moved from having ‘servers-as-pets’, keeping the same instance running for months; to ‘servers-as-cattle’ with puppet to create new ones; to ephemeral containers – but we still have to manage resources, even if they’re just Docker config files. This is a very different role to programming, and leads to the dev/ops split. All servers are a drag, even if they are containers being managed by Kubernetes.

Which is not to say that serverless means being able to ignore ops completely, as Charity Majors has explained. Observability is vital, and you will still encounter issues where the abstractions of serverless leak through. The structure of an application comes to contain a significant amount of logic (for example which queues connect serverless applications) and one needs to be careful of this.

For me, one of the main advantages of serverless is that it enforces good behaviour. AWS Lambda is inherently stateless, since any state can last only for a single request. Paying for the time a request takes focuses developers on writing smaller pieces of code, thereby following more effective cloud patterns. The ease of adding lambdas also avoids the problem with persistent servers where it is easier to add to an existing microservice than handle the overhead of creating a new one, even where it is necessary.

One of the risks is lock-in. From a code point of view, serverless abstractions have appeared, and well-written code ought to be easy to port. However, moving the data for a cloud application would likely be fearsome and expensive, and I’ve not seen much writing about how that would occur. Picking serverless over container-based code is probably the least of your problems with that sort of migration.

Another issue is that serverless is not perfect for all situations – long running processes or those dealing with calls to high-latency services are probably better handled by container-based services – although I think people do not make enough use of serverless.

One thing is that I’ve seen less discussion than I would expect of Serverless as a hobbyist option. In one way, it’s as straightforward as a CGI-BIN, but there is the risk of cost, given that you’re paying for every bot that visits your application. Having said that, serverless applications can still be as cost-effective as hosted applications for small-scale apps. The monitoring and management of AWS costs is an ongoing problem.

Gunnar Morling gave a good talk at QCon, Serverless Search for My Blog with Java, Quarkus & AWS Lambda which explored all aspects of using serverless for a hobby project. There is also Robin Sloan’s discussion of cloud on his blog, including how he uses a hack to get around the cold-start issue. Such hacks are probably more relevant to hobby sites than production software, but is discussion of the topic is illuminating.


First steps in servlerless

I’m starting a new job next month where I’ll be using AWS Lambda. In preparation, I’ve been cramming on the topic. The main resource I’ve used is O’Reilly’s Programming AWS Lambda, and I’m enjoying learning from an actual physical book with an animal on the cover.

Here’s a quick summary of some of the other sources I’ve been looking at:

  • Mike Robert’s Serverless Architectures post is massive, and full of really useful discussion. This includes: a comparison between serverless and stored procedures (vendor locking, difficulty testing and versioning); the value of reduced time to market; environmental benefits of serverless; and the challenges of integration testing.
  • Gunnar Morling produced a good infoQ talk, Serverless Search for my blog, which discusses AWS Lambda used for a Lucene-based blog search. Morling uses Quarkus to avoid lock-in, and also suggests this gets around the cold-start problem. He also suggested Funqy as an vendor independent abstraction for serverless code. Morling points out that serverless has a smaller attack service, but looked in detail at dealing with a ‘denial of wallet’ attack.
  • Bruce Schneier discussed The Misaligned Incentives for Cloud Security, warning that it has a few large providers making technical decisions for millions of users; and that security problems such as data breaches affect their customers more than it affects them.
  • Guy Podjarny talks about the security issues in greater detail in Serverless security: What’s left to protect. He points out that one still needs to consider dependency vulnerabilities. While security permissions in serverless can be very granular, there is also a risk of this sprawling. Podjarny makes a number of suggestions including having critical and non-critical functionality in different accounts or regions.
  • Serverless and Chatbots: A Match Made in the Cloud by Gillian Armstrong was focussed on chatbots, but had a good overview of a lambda-based platform in production. Armstrong also noted that while lambdas scale every quickly, other parts of an infrastructure such as datastores might not.
  • A 2020 article Why the Serverless Revolution Has Stalled takes a more cynical approach, looking at four potential issues: limited programming languages; vendor lock; performance; inability to replace monolithic applications. Some of these issues have been solved by some teams, but all these points are worth considering.
  • Cloud study by the writer Robin Sloan discusses his use of cloud functions to provide simple support for running his newsletter. His solution to the cold start problem is, he admits, not best practise, but works for him: “Instead of deploying each of my functions as Actually Different cloud functions, I’ve rolled them up into one “mega function”—really almost a tiny app.” This solves a lot of issues for this small piece of functionality, not least that it fails fast: “if something isn’t working, nothing is working
  • Another post on cold starts suggested reducing the artefact size and had a good discussion of using pings to keep services live.
  • Operational Best Practices #serverless talked about how serverless limits the amount of code an enterprise needed, and that BaaS, FaaS and BaaS can all help speed up dev, particularly early in the process “You get to rent engineers from Google, AWS, Pagerduty, Pingdom, Heroku, etc for much cheaper than if you hired them in-house — if you could even get them, which you probably can’t because talent is scarce.”
  • That piece also contains a stern warning: “there is no such thing as having the luxury of not having to understand how your storage systems work. Queries will get slow, and you’ll need to be able to figure out why and fix them. You’ll hit scaling cliffs where suddenly a perfectly-usable app just starts timing everything out because of that extra second of latency coming from … The more you understand about your storage system (and the more you stay in the lane of how it was intended to be used), the happier you’ll be.

Using serverless for hobby projects does look attractive. But, having tried to get S3 and IAM working on AWS, I’d be reluctant to suggest that to anyone – particularly given the financial perils of AWS.