JVM struggles and the BEAM

I’d like to talk about how the opaqueness of the JVM has been plaguing me over the last few months and how I fell in love with the Erlang virtual machine, more famously known as the BEAM (and by extension, the Elixir programming language).

TL;DR: the BEAM is built for dealing with highly concurrent applications. It has world-class introspection that gives you the power to observe and manipulate your running application through a REPL. It has a simple (actor-based) concurrency model that’s easy to teach and makes it easy to reason about and scale your systems. The Elixir programming language is a modern BEAM-language that allows for terse and readable code, while fully leveraging Erlang and OTP. If you need to build and maintain high traffic backend services, you will not regret learning about this stack.

Hitting walls with the JVM

At work, I’ve been maintaining a business critical, JVM-based backend service that has to handle thousands of requests per second (we’ll call that a “high traffic” service). Without going into details, the service basically calls out to a database and another backend service on every request and then responds to the user’s request. It also does some off-request operations that are necessary for dealing with the database. The point is that there are some concurrent operations going on within the service for every request.

So I wanted to know how a single instance of my JVM-based system would perform under double its current load. I wanted to see where it would break down, where its bottlenecks were.

The state of the art in observing your JVM-based applications is a combination of using thread dumps, gc logs, thread activity visualizations. Thread dumps give you a snapshot of the the name of the thread, its current running state (waiting, blocked, etc), and the stacktrace of the work its currently doing. GC logs give you a record of when and how much garbage was collected. Thread activity visualizations show you the timeline of thread moving between different running states.

After performing some load tests and using these tools, I was able to find a completely underutilized thread pool (where most of the threads were doing nothing at all); at least, that’s what I gathered from piecing together disparate clues and developing hunches.

When it came to finding out which threads were overutilized, I wasn’t so lucky. Seeing that random threads were active, didn’t tell me if they were too busy and if I needed to increase a threadpool’s size. Even trying to understand the flow of control of my program (i.e., the handoff of control between threads) seemed impossible. I might have had to piece together stacktraces from thread dumps (assuming VisualVM had perfect sampling of the threads’ stacktraces), but I was left defeated.

Outside of observability, I would also need to try to use monitoring/metrics to help find bottlenecks. I would need to add instrumentation to threadpools to try to gauge if they were busy. Sadly though, you’d need to instrument every library in your stack that uses threadpools to know if you need to tweak the concurrency configuration of those libraries. Good luck with that.

I blew days trying to understand what was going on in my system (using all of the aforementioned tools and studying many of the libraries in our standard backend stack) and trying to piece together clues for possible bottlenecks. I got nowhere. I felt helpless and frustrated. I started to talk to some experienced JVM engineers about the issues I was having with visibility into the system and that didn’t really help either; they didn’t have any better tools to guide my search for the truth. Most folks learned from trial and error or hearing about learnings from other Java devs. Getting a visualization of the control flow of the system (like tracing for threads) would either require really heavy instrumentation all the way down or not be possible at all in some cases.

How I coincidentally found the BEAM

A talented, newly hired teammate joined my squad, saw my struggles with reasoning about the internals of my fairly simple JVM service, and mentioned that the Elixir programming language made all of this a lot easier.

I thought it was another “why are you all using Java?! I used X language at a previous company and it was great” type of argument. I was very wrong. After a few weeks of hearing about Elixir in passing (the actor model, the iex shell, fault-tolerance, and the :observer), I got curious. I spent a few nights reading up about the tech and watched a lot of youtube videos from various Elixir conferences.

The turning points for me were the following:

Between Sasa’s talk and the visualixir library, I learned that the Elixir stack (really, the BEAM), supplied the level of deep introspection I was after — without the need for heavily instrumenting my software or comparing dozens of metric graphs to try to understand what was happening in my system. I became intrigued.

Finding bottlenecks with the BEAM

I had the incredible opportunity to run an experiment at work with the BEAM and deepen my study of the stack. I built a clone of my JVM service in Elixir, to see how it felt to write a service in Elixir and to confirm if the qualities I kept reading about were actually helpful.

After building the clone in a few days (literally going from zero to having something in production), I wanted to see if I could easily find a bottleneck.

I mirrored traffic from my JVM service to the Elixir service. In fact, I had all of the global traffic from my horizontally scaled JVM service going to a single Elixir service on an 8-core machine.

The Elixir service (running Plug) handled the load somewhat fine: as in, the service kept accepting requests despite the CPU going to 100% and memory usage remained fairly flat. However, I wasn’t getting many responses to the requests. Something was choking internally and causing my requests to go unfulfilled.

So I fired up a cli version of :observer called :observer_cli.

Due to the BEAM and its applications adhering to the actor model (which basically allows you to decompose your application into a series of completely-isolated processes that communicate solely via messages), you can use the observer to reverse-sort by mailbox size. This allows you to visually see which processes are backed up in processing their messages. In other words, the VM gives you the ability to see the bottlenecks in your system.

I found that a process pool that was in charge of talking to our database had too few processes to handle the incoming load. So the managing process’ mailbox kept growing and growing — causing requests to stall until the database query messages could be processed by the pool.

I changed the configuration (which we maybe could have done on the running app itself thanks to the elixir repl) to increase the database client’s process pool size, redeployed the change, fired up :observer_cli again, and saw no bottleneck. Looking at the server logs, we were fulfilling requests now. I had literally found the bottleneck within an hour and increased the throughput of our system. I was sold.

Long live the BEAM

Over the course of continuing to iterate on that clone and through building out some infrastructure pieces for getting VM and application metrics into our metric store and into Grafana, the BEAM continued to shine. Not only did :observer_cli continue to give me the visibility to find bottlenecks with ease, but other aspects of it (like that elixir repl, fault-tolerance via supervisors, and the time-slice process scheduling that avoids CPU starvation) blew my mind. My mind was blown by this stack weekly, I kid you not.

Having a stack that gives you the tools and guarantees to safely operate highly concurrent systems has been a real game changer. It’s taught me more about the shortcomings of JVM through comparison (but also where the JVM shines). I highly recommend playing with the BEAM and Elixir if you get the chance.

After watching Sasa’s talk, I recommend checking out the following resources to learn more:

Staff Software Engineer @Spotify.