This blog is the second part of a blog series on Understanding the Jepsen toolkit for Distributed systems testing. I would encourage you to go through part 1 before continuing on here.

Major components of a Jepsen test

Generators

A Jepsen history is a list of operations–invocations and completions. A generator’s job is to specify what invocations to perform, and when. In a sense, a generator’s output becomes a history as Jepsen incrementally applies it to a database.

Conceptually, then, a generator is a graph of events, some of which have not yet occurred. Some events are invocations: these are the operations the generator will provide to clients. Some events are completions: these are provided by clients to the generator. Other events are temporal: a certain time has passed.

This event graph has some invocations which are ready to perform. When jepsen sees a ready invocation, it applies the invocation using the client, obtains a completion, and applies the completion back to the graph, obtaining a new graph.

The generator protocol contains 2 methods. op and update

(defprotocol Generator
  (update [gen test context event]
          "Updates the generator to reflect an event having taken place.")

  (op [gen test context]
      "Obtains the next operation from this generator."))

The jepsen code calls (op gen test context) to ask the generator for the next invocation that should be processed.

The operation can have three forms:

  1. The generator may return nil, which means the generator is done, and there is nothing more to do. Once a generator does this, it must never return anything other than nil, even if the context changes.

  2. The generator may return :pending, which means there might be more ops later, but it can’t tell yet.

  3. The generator may return an operation, in which case:

    • If it’s time is in the past, it can be evaluated now
    • If it’s time is in the future, jepsen’s test runner will wait until either
      • The time arrives
      • Circumstances change (e.g. we update the generator)

But (op gen test context) returns more than just an operation; it also returns the subsequent state of the generator, if that operation were to be performed. The two are bundled into a tuple.

(op gen test context) => [op gen'] ; known op and subsequent state
[:pending gen] ; unsure and same state
nil ; exhausted

Default generators

There are some default generators which can be used for writing tests. They are,

nil is a valid generator; it ignores updates and always yields nil for all operations.

IPersistentMaps are generators which ignore updates and return exactly one operation which looks like the map itself, but with default values for time, process, and type provided based on the context. This means you can write a generator like

{:f :write, :value 2}

;; and it will generate a single op like
{:type :invoke, :process 3, :time 1234, :f :write, :value 2}

To produce an infinite series of ops drawn from the same map, use

(repeat {:f :write, :value 2})

Sequences are generators which assume the elements of the sequence are themselves generators. They ignore updates, and return all operations from the first generator in the sequence, then all operations from the second, and so on.

Functions are generators which ignore updates and can take either test and context as arguments, or no args. When a function is used as a generator, its return value is used as a generator; that generator is used until exhausted, and then the function is called again to produce a new generator. For instance:

; Produces a series of different random writes, e.g. 1, 5, 2, 3...
(fn [] {:f :write, :value (rand-int 5)})

; Alternating write/read ops, e.g. write 2, read, write 5, read, ...
(fn [] (map gen/once [{:f :write, :value (rand-int 5)}
                      {:f :read}]))

Promises and delays are generators which ignore updates, yield :pending until realized, then are replaced by whatever generator they contain. Delays are not evaluated until they could produce an op, so you can include them in sequences, phases, etc., and they’ll be evaluated only once prior ops have been consumed.

Look at the extend-protocol implementation for Generator in the jepsen.generator.pure namespace for more details on default implementations

Wrapper generators

Built on this mechanism there are wrapper generators provided out of the box which do things like handle exceptions in the generator functions and convert them into ::op-threw and ::update-threw events. There is also a wrapper for adding traces ie. trace

There are wrappers for map and f-map which can transform the value generated by the generator (f op) and pass updates to underlying generator as is.

There is a filter wrapper for filtering out ops which don’t match a predicate.

There are any, mix, limit and repeat wrappers which do pretty much what you expect them to.

There are timing related wrappers like time-limit, delay and stagger which help add limit on op generation and delays respectively.

phases is an important wrapper because it ensures all ops from previous generator are exhausted before moving to the next generator in the chain.

Contexts

A context is a map which provides information about the state of the world to generators. For instance, a generator might need to know the number of threads which will ask it for operations. It can get that number from the context. Users can add their own values to the context map, which allows two generators to share state. When one generator calls another, it can pass a modified version of the context, which allows us to write generators that, say, run two independent workloads, each with their own concurrency and thread mappings.

The standard context mappings, which are provided by Jepsen when invoking the top-level generator, and can be expected by every generator, are:

  :time           The current Jepsen linear time, in nanoseconds
  :free-threads   A collection of idle threads which could perform work
  :workers        A map of thread identifiers to process identifiers

Transaction generators

The MongoDb transactions test suite uses the list-append workload which has the jepsen.tests.cycle.append/gen as the generator. This generator wraps over the elle.list-append/gen generator. The client needs to understand operations of the form:

{:type :invoke, :f :txn, :value [[:r 3 nil] [:append 3 2] [:r 3]]}

and return completions like:

{:type :invoke, :f :txn, :value [[:r 3 [1]] [:append 3 2] [:r 3 [1 2]]]}

where the key 3 identifies some list, whose value is initially [1], and becomes [1 2] after an append operation.

The elle.list-append generator generates operations where values are transactions made up of reads and appends to integer keys.

The list-append generator i.e append-txns creates wr-txns and replaces the :w operation with :append.

Example output from the generator

(take 10 (la/gen opts))
({:type :invoke, :f :txn, :value [[:r 9 nil] [:append 9 1] [:append 9 2] [:r 9 nil]]}
 {:type :invoke, :f :txn, :value [[:r 9 nil]]}
 {:type :invoke, :f :txn, :value [[:r 7 nil] [:r 9 nil] [:r 9 nil]]}
 {:type :invoke, :f :txn, :value [[:append 8 1]]}
 {:type :invoke, :f :txn, :value [[:append 6 1] [:r 8 nil] [:append 7 1] [:append 9 3]]}
 {:type :invoke, :f :txn, :value [[:append 9 4]]}
 {:type :invoke, :f :txn, :value [[:append 9 5]]}
 {:type :invoke, :f :txn, :value [[:r 9 nil] [:r 6 nil] [:r 9 nil]]}
 {:type :invoke, :f :txn, :value [[:r 9 nil] [:append 8 2]]}
 {:type :invoke, :f :txn, :value [[:append 8 3] [:r 9 nil]]})

wr-txns generates a lazy seq of transactions where each is of the form

{:type :invoke, :f :txn, :value [[:r 9 nil] [:w 8 2]]}

The length of the txn is controlled by the max-txn-length and min-txn-length keys.

Every write is also unique per key meaning if 1 is written for key 5, it is never written again for key 5. This is ensured by using a state variable where there is a current value assoc’d with each key.

Keys appear in the transactions based on key-distribution type. It can either be uniform meaning each key has same probability of appearing, or exponential which means that key i in the current key pool is k^i times more likely than the first key to be chosen. k is the base, defaults to 2

Nemesis

jepsen.nemesis/Nemesis is the main protocol which represents a way to introduce faults into the tests.

(defprotocol Nemesis
  (setup! [this test] "Set up the nemesis to work with the cluster. Returns the
                      nemesis ready to be invoked")
  (invoke! [this test op] "Apply an operation to the nemesis, which alters the
                          cluster.")
  (teardown! [this test] "Tear down the nemesis when work is complete"))

Important functions are bisect, complete-grudge and bridge which distribute given nodes into subsets which are then operated upon by functions like partitioner to create network partitions.

Network manipulation occurs via the jepsen.net/Net protocol. Out of the box implementations are available for iptables and ipfilter Default is iptables.

The compose function can combine nemesis and distribute them based on the :op. For example,

(compose {#{:start :stop} (partition-random-halves)
                #{:kill}        (process-killer)})

This routes :kill ops to process killer, and :start/:stop to the partitioner.

node-start-stopper is an interesting function which takes a targeting function, which selects the node or subset of nodes to act on, a start! function which says what to do when nemesis receives :start and a stop! function which says what to do when nemesis receives :stop.

Example usage of this function is the hammer-time nemesis which pauses the given process on a node using SIGSTOP, and then resumes it using SIGCONT.

(node-start-stopper targeter ;; by default this is rand-nth
                    (fn start [t n]
                      (c/su (c/exec :killall :-s "STOP" process))
                      [:paused process])
                    (fn stop [t n]
                      (c/su (c/exec :killall :-s "CONT" process))
                      [:resumed process]))

Nemesis packages

jepsen.nemesis.combined/nemesis-package creates a nemesis map from given options.

This map has a nemesis, generator for it’s ops and a final-generator to clean up failure modes at the end of the tests and perf for showing test performance graphs.

This is a generic nemesis which includes partitioning errors, clock-based errors and db errors (start, stop, kill, pause etc) This nemesis package can be used as the first thing to try and then see what kind of errors need more drilling down.

Every sub-package i.e partition-package, clock-package and db-package is also a map with similar information i.e nemesis, generator, final-generator and perf

db-nemesis is the basic reify’d class which responds to :start :stop :pause :resume commands and uses the relevant methods from the DB protocol to implement those operations.

There is something called node-specs which is used everywhere. This is just a set of keywords which identifies which subset of the nodes to operate on in the nemesis world.

  • nil - Chooses a random, non-empty subset of nodes
  • :one - Chooses a single random node
  • :minority - Chooses a random minority of nodes
  • :majority - Chooses a random majority of nodes
  • :primaries - A random nonempty subset of nodes which we think are primaries
  • :all - All nodes ["a", …] - The specified nodes

Consistency checkers : Elle

this section consists of my notes taken from an excellent talk by Kyle Kingsbury where he explains Elle

Elle is transactional consistency checker for Databases i.e a tool for verifying whether DBs are serializable, snapshot isolated etc.

Serializability doesn’t say anything about time at which transactions happen, only that there should be some order of txns. So checking for serializability is similar to an ordering problem. Trivially, we can enumerate every possible ordering and eliminate the ones which don’t match our observed reads and writes. But this approach scales at n! so this is bad.

Serializability can be decomposed into 3 properties (according to a paper on which Gretchen is based).

  • Internal : inside each transaction, you should observe values consistent with prior reads and writes
  • External : you need to see the values which a previous transaction wrote
  • Total visibility : order of txns should be a total order, cant be partial

For external, we need to build a dependency graph of transactions. If i read a value 3, i know it has come from some txn that wrote 3. But if 2 txns write the same value, its difficult to build dependency. If you find a cycle in the dependency graph, there is no total order so serializability is violated.

Elle is based on this paper by Adya, Liskov, & O’Neil titled Generalized Isolation Level Definitions.

This paper describes database isolation levels in terms of dependency graphs and cycles.

Types of dependency

write-read : T1 w(xi) & T2 r(xi) then T2 has to come later as it observed a value T1 wrote

write-write : T1 w(xi) & T2 w(xj) then T2 has to come later as xi « xj in version history, which essentially means that if we see values 2 and 3 in an incrementing counter, we know 3 is a later value so it happened later. This is a-priori knowledge

read-write : T1 r(xi) & T2 w(xj) then T2 has to come later since T2 wrote a later version of x. known as anti-dependency

From this we can create a direct serialization graph comprising of wr, ww and rw edges and solve for cycles. Nodes are txns.

Lets say we have histories of operations

T1 : w(x1) w(y1)
T2 : r(x1) w(y2)

But we are missing relations between y1 and y2 and also, we are not sure whether the x1 that we read can only have been written by T1.

  • If y1 < y2, this is serializable, T1 → T2

  • If y2 < y1, there is a cycle and hence not serializable

Recoverability

If we can look at a value xi and go back to find the only txn which could have written xi, we have recovered it from history. Another problem is that we just write with a value to a register, it replaces the current value and hence destroys history of operations. So if we write xk, we cant be sure that there were previous versions with xj and xi before it.

Traceability

If we read a value xk, we can trace it back to a write of xk and also a write of xj before and xi before that Types of registers which we can use

  • Simple write, if register contains 3 and we write 5, it gets replaced
  • Incrementing counter, if register contains 3 and we write 5, it gets added to become 8
  • Set, if register contains #{1, 2} and we write 5, it gets added to set to become #{1,2,5} This is useful to infer that we ever read the set and see a value of 5, we are sure that at least one write of 5 has happened prior (but we assume unique writes)
  • Append-only list, if register contains [1, 2] and we write 3, if becomes [1, 2, 3] and we can also infer that w2 happened just before this and w1 before that since the register maintains order

Traceability is similar to audit trails, or oplogs in that sense.

Example

Imagine we are given 2 txns

T1 : w(x, 1) w(y, 1)
T2 : w(x, 2) w(y, 2)

This in itself isn’t helpful, but if we add a third txn which reads the 2 registers

T3 : r(x, [1,2]) r(y, [2,1])

We see that T1 has ww dependency on T2 for x and T2 has ww dependency on T1 for y which forms a cycle. This is a dirty write anamoly!

So how does it all work?

  • Generate and execute a whole bunch of transactions

  • Find wr, ww, rw edges in them by observing the values read and written and build a graph out of them

  • Use a specialised linear time algorithm to find strongly connected components (fancy speak for cycles) in the graph. Each of these is an anomaly

  • Now if we want to check other types of isolation levels, we can add more constraints (edges) in the graph. For example, add an edge from each txn from same process to check process order

  • We can also add edges for real time (single clock of truth) order and check for strict serializability of strong snapshot isolation

Conclusion

Hopefully this 2 part series of blogs has given you a slightly more in-depth and hands-on view of the amazing work by the Jepsen team. I would encourage every distributed systems learner to learn and understand the toolkit. It has really helped me understand the such analyses more clearly and make more learned choices about the databases that we rely upon so heavily in our day jobs!