Writer monads for knowledge tracking in OCaml

Published on January 27, 2020 Writer monads for knowledge tracking in OCaml

Tracking knowledge across large areas of a codebase is a fairly common task. Keeping track of whether or not some cache miss happens when loading files is an example of this — you could want to track this in a metric. If we model knowledge monoidally, we can use a writer monad to track this knowledge without boilerplate in a simple and threadsafe manner.

Feel free to skip the background section if you don't care how this problem came up in a real project.

Background

The Mina Protocol uses zk-SNARKs to compress the blockchain down to a constant size. These SNARKs are proof objects certifying that some computation has been run correctly. Some nodes in the network create these SNARKs and others verify them. To create a SNARK one needs to have a specific large proving key, and to verify a SNARK one needs to a specific small verification key. These proving and verification keys are indexed by the SNARK logic itself. In other words, if any computation changes then the proving and verifying keys need to change. In other other words, whenever some code is changed in the SNARK circuit during the development of Mina, both proving keys and verification keys must be regenerated. Key generation also involves some randomness.

We want developers to be productive and so we want proving and verification key generation to happen transparently at build time.

Naively, one can just always regenerate all proving and verification keys every build, but this has some drawbacks: (1) As randomness is involved during key generation, keys created from different builds will be incompatible. For example, you wouldn't be able to connect to a live network that was built on CI with some local branch of code, even though when you haven't touched any SNARK logic that would invalidate the keys. (2) Also, key generation is slow so we'd like to skip that step if possible.

A nice solution to this would be to remove the randomness from key generation under certain conditions. This may be done by adding some debug branch through the code that generates. Unfortunately, it turned out to be quite a bit of work to get rid of all the randomness in the key generation logic. It's enough work that we decided to punt on it temporarily.

Another solution is to introduce a series of layered caches that would be placed in front of key-generation in the key-loading process. To load keys in Mina, the following process is followed:

Try to load the keys from a manual override path
Try to load the keys from the normal installation path
Try to load the keys from an S3 installation path
Try to load the keys from S3
Try to load the keys from the auto-generation path
Auto-generate the keys
Store the keys in the auto-generation path (only if the keys were auto-generated)
Store the keys in S3 (only if the keys were auto-generated)

To further complicate things, we need to track the outcome of the key generation process and propagate it to various interested observers for legacy reasons.

Finally, we're sufficiently motivated.

Simplified Model

The following is a simplified model of the problem. In reality, there may be several more places where loading occurs:

module Action : sig
  type 'a t
end

Actions are what happens when you invoke a load-or-generate function.

module A (Intf: sig
  type elem1
  type elem2
  type t
  val load_or_gen_a1 : unit -> elem1 Action.t
  val load_or_gen_a2 : unit -> elem2 Action.t
  val build : elem1 -> elem2 -> t
end) = struct
  include Intf
  let load_or_gen_a =
    let open Deferred.Let_syntax in
    let* a1 = loadOrGenA1 () in
    let+ a2 = loadOrGenA2 () in
    build a1 a2
end

In one spot we load a1 and a2 to create an A.t.

module B = struct
  type t
  let load_or_gen_b : unit -> t Action.t
end

In another, we load a B.t.

module Subroutine = struct
  let load_or_gen () =
    let open Deferred.Let_syntax in
    let* a = load_or_gen_a () in
    let+ b = load_or_gen_b () in
    (a, b)
end

Finally we load A.t and B.t together.

To model whether or not generation has occurred:

module Track_generated = struct
  type t = [`Generated_something | `Cache_hit]
end

The task is to somehow incorporate generated knowledge into the output of load_or_gen in a clean way.

In this model, actions are the asynchronous monad. Let's assume we're using Jane Street's Async library and then implement Action as follows:

module Action = struct
  type 'a t = 'a Deferred.t
end

Tracking Knowledge v1

A naive way to track whether or not generation has occurred is to introduce some sort of global mutable state.

module Global_mutable_state = struct
  let generation_occurred =
    let state = ref `Cache_hit in
    fun () ->
      state := `Generated_something
end

Then in load_or_gen_a1, load_or_gen_a2, and load_or_gen_b we need to call generation_occurred () whenever necessary.

This is easy to implement, but it's a very terrible solution! Global mutable state means we need to be careful about concurrent writes. We also need to make sure we don't forget to call this function if we ever introduce a new place load_or_gen leaf in the call graph. It's error-prone and brittle.

Extending our knowledge model

The concept of knowledge has an interesting property: Once I know a, when presented with new information b I'll never un-know a. As a consequence, knowledge gives rise to a semilattice.

At the bottom, I know nothing or $\bot$ . Afterward, I can learn about $a$ or $b$ , and once I see the other one I know both $\{a,b\}$ .

This means we can implement a well-defined commutative, associative, idempotent binary operation $\vee$ that incorporates new knowledge. And in this case, since we don't act when Cache_hit, we can treat Cache_hit as our identity or bottom state, and Generated_something is our top state.

Semilattice cache_hit at the bottom generated_something at the top

module Track_generated = struct
  type t = [`Generated_something | `Cache_hit]
  let bottom = `Cache_hit
  (* In Haskell, we'd provide a Monoid instance and use
   * <>, but OCaml uses <> for does-not-equal-to *)
  let (+) x y =
    match (x, y) with
    | (`Generated_something, _)
    | (_, `Generated_something) -> `Generated_something
    | (`Cache_hit, `Cache_hit) -> `Cache_hit
end

Tracking Knowledge v2

Now we can ditch our global mutable state, and instead change our actions to also give us the new knowledge about Track_generated.t:

module Action = struct
  type 'a t = ('a * Track_generated.t) Deferred.t
end

Then we update load_or_gen_a and load_or_gen to thread through the knowledge. Here's what load_or_gen_a would look like afterwards:

let load_or_gen_a () =
    let* (a1, t1) = load_or_gen_a1 () in
    let+ (a2, t2) = load_or_gen_a2 () in
    (build a1 a2, t1 + t2)

A better solution! We've eliminated global mutable state, so we've solved the concurrency problem. This is parallelizable since knowledge aggregation is commutative and associative. Unfortunately, we still have a bunch of boilerplate that we can't forget to add. And it's still brittle: The type-system will remind us to remember that actions have a tupled result, but we need to rely on warnings at best to remember to join together all of our knowledge.

Writer Monad

The writer monad wraps computation alongside some monoidal value. Traditionally, this is used in a language like Haskell to aggregate logs — some call it the logging monad. However, we're free to use any monoid here. Monoids are values that have an identity and an associative binary operation — just like $\vee$ that we've described above.

For reference, here's an OCaml implementation of a writer monad:

module type Monoid_intf = sig
  type t
  val empty : t
  val (+) : t -> t -> t
end
module Writer (M : Monoid) = struct
  type 'a t = 'a * M.t
  include Monad.Make (struct
    type nonrec 'a t = 'a t
    let return x = (x, M.empty)
    let map = `Define_using_bind
    let bind (a, m1) ~f =
      let open M in
      let (b, m2) = f a in
      (b, m1 + m2)
  end)
end

We can layer Writer on Track_generated under the async monad to get a monad that both tracks knowledge and runs code concurrently.

module With_track_generated = struct
  type 'a t = {data: 'a, dirty: Track_generated.t}
end
module Deferred_with_track_generated = struct
  type 'a t = 'a With_track_generated.t Deferred.t
  include Monad.Make (struct
    type nonrec 'a t = 'a t
    let return x =
      Deferred.return
        {With_track_generated.data= x;
         dirty= Track_generated.empty}
    let map = `Define_using_bind
    let bind t ~f =
      let open Deferred.Let_syntax in
      let* {With_track_generated.data; dirty= dirty1} = t in
      let+ {With_track_generated.data= output; dirty= dirty2} = f data in
      { With_track_generated.data= output
      ; dirty= Track_generated.(dirty1 + dirty2) }
  end)
end

Tracking Knowledge v3

This time, we'll change Action once more:

module Action = struct
  type 'a t = 'a Deferred_with_track_generated.t
end

Then we'll get compilation errors until we change our leaf load_or_gen functions to return that initial Track_generated.t. And we're done! We don't have to change load_or_gen_a or load_or_gen_b at all other than opening this monad instead of Deferred.

This is the best solution! There is no global mutable state. There is no boilerplate. Our code is clean and knowledge propagation is neatly handled for you. If later more load_or_gen functions are introduced, the type system will demand that the author remembers to send a Track_generated.t with the output.

Summary

Writer monads provide a boilerplate-free, thread-safe, non-error-prone way to propagate knowledge through large swaths of a codebase. In Mina, we used the Deferred with_track_generated monad to propagate knowledge that some generation occurs to give developers a transparent way to keep their code as compatible as possible with other builds to make them more productive.

Thank you Christina Lee and Omer Zach for your thoughtful reviews on this post!