Problems with GenServer Callback Types
Update (October 2, 2020)
After publishing this and trying out some of the type helpers I suggested, I realize that I misunderstood how Dialyzer works. Success typing, which Dialyzer implements, does not actually solve the problems I call out below. I think this only strengthens the argument for the benefits of the work done by Gleam and the WhatsApp team. I published a new post about my misunderstanding which you can find here
At work, we are currently implementing a GraphQL API in Elixir which fronts and
orchestrates several other APIs. Several of the APIs we interact with use OAuth
token flows for authentication. We use the OTP GenServer1 behaviour to
implement a cache for these tokens. GenServer has worked very well for this use
case, but it has proven particularly useful for the more complicated OAuth
refresh token flow, which has two separate expiration timelines. Clients invoke
GenServer.call/3 to request a token, and internally the cache uses
Process.send_after/4 to implement the expiration timelines.
Dialyzer Lets Me Down
I recently extended the implementation of these token caches to support refresh
token flows and ran into some trouble. This trouble left me dissatisfied with
the behaviour of Dialyzer for OTP behaviours. Our team uses Dialyzer to provide
type checking as we develop our software. Dialyzer does not provide a full
fledged type system, but the success typing it provides is useful to detect
errors during development. Additionally, we use Credo to enforce that all public
methods in a module have a type spec. And we use the
typed_ecto_schema3 libraries to generate type specifications for our
user-defined structs and Ecto schema-defined structs. With these tools in place,
Dialyzer is able to detect many classes of programmer errors. In particular, it
is a useful tool for us during refactoring. For example, if we update the return
type of one function clause without updating the return type of the other
clause, Dialyzer will quickly detect the error.
As I worked to refactor our GenServer token cache implementation, Dialyzer let me down. This is no fault of Dialyzer; in fact, it’s not really the fault of anything. The callbacks of GenServer all have type specifications, but because GenServer is so generic, the types must be very loose. In particular, GenServer allows any value to be stored as its state. Not only that, each callback can change the type of the state. It is this aspect of the specification that bit me in this refactoring. Previously, our token server used a plain map as its state, but because the refresh token flow is more complicated, I updated the server to use a typed struct as the state. I updated most of the callback function clauses to return the struct as their state, but I missed some. A slightly stricter specification for the callback type would have easily caught this error, i.e. a specification that the type of the state in the return value is the same as the type of the state in the call back arguments.4
Had I not caught this error by luck, it would have caused issues in production. These issues would have perhaps been tricky to pin down, because they would occur not in the incorrectly implemented callback, but rather in the next invocation of any other callback which expected the state to be our struct type. Thankfully, I detected the error by eye and disaster was averted.
Enforcing our Team’s Conventions
These helpers do not actually work, read with a grain of salt.
Based on this experience, I’m planning to introduce some GenServer type helpers into our codebase. By convention, our team will agree that GenServer callbacks should always return the same type as the new state. The type helpers, we implement will enforce this convention. The type helper will look something like this:
These type specs would have caught my errors above, and they could be extended
to catch other categories of error. For example, we could define stricters types
for the request arguments in
handle_call/3. But there are
still a couple of places where Dialyzer still can’t help us. For example,
although we can have stricter types for the callbacks, these types will be
visible when clients call
GenServer.call/3. As it
currently stands, I think this issue is unsolvable. These methods must have very
loose typing because they are called by every client of every GenServer.
There are a couple of projects in the works that I am hopeful will improve developer experience with these kind of errors. I’m following both with quite a bit of interest. The first is a new programming language targeting the BEAM called Gleam5. Gleam is a statically typed language which takes inspiration from ML family of languages. In addition to providing a robust type system, Gleam hopes to provide a type-safe implementation of OTP.6 A Gleam implementation of GenServer will hopefully be able to make explicit the connections between client code and GenServer callbacks.
The second project is work that the Erlang team within WhatsApp at Facebook is doing to improve Erlang developer experience.7 In fact, the loose specification of GenServer is a shortcoming that they have explicitly identified. The team is currently prototyping a declarative and statically typed GenServer API which they hope to release by the end of 2020. The WhatsApp team has engaged with the Elixir core team, the implementer of Gleam, and other members of the BEAM community. While their improvements are targeted explicilty to Erlang, I’m hopeful that their broad collaboration will lead to improvements that can be useful across BEAM languages.