Back to Articles

Finite-State Machines at Luno

Timothy Stranex
5 minute read

Executing a multi-step business process (e.g. processing a financial transaction) is trivial in a linear self-contained program, but can become very complex in the context of a distributed system with multiple concurrent servers and external dependencies, such as the systems we run at Luno. With even a small number of services that need to coordinate, it quickly becomes difficult to keep track of what’s happening and it’s very easy to introduce bugs.

finite-state machines at Luno

Finite-state machines are a useful way to formalise such processes to make them easier to reason about. Formalising the process often seem overkill at first but it’s essential in a distributed context.

As you may recall, a finite-state machine consists of a set of states and a set of transitions between these states. Let’s consider a simplified version of a process that we deal with at Luno: sending a Bitcoin transaction.

Finite-state machines at Luno_cycle

 

Each of these steps will be executed by a different service. The initial request will come from a frontend API server while the Bitcoin transaction will be transmitted by a backend Bitcoin network node.

Keeping track of the state

To keep track of the process, we need to store the state somewhere. At Luno, we typically use MySQL databases for this purpose. Each different process typically has its own table. In this case, the table is “withdrawals” and the “status” column will store the current state.

For redundancy, we have multiple backend servers capable of processing each step. However, we can’t allow two servers to work on the same step concurrently otherwise we may end up sending the payment twice. How can we ensure that only one server executes the state transition?

Most people will suggest using SQL transactions at this point. Here is some code doing that in Go, which is the language we use for all our backend servers:

func ProcessWithdrawal(dbc *sql.DB, id int64) error {
    tx, err := dbc.Begin()
    if err != nil {
        return err
    }
    defer tx.Rollback()

    var status Status
    if err := tx.QueryRow(“select status from withdrawals where id=?”, id).Scan(&status); err != nil {
        return err
    }

    if status != StatusPending {
        return errors.New(“wrong state”)
    }

    if _, err := tx.Exec(“update withdrawals set status=? Where id=?”, StatusProcessing, id); err != nil {
        return err
    }

    return tx.Commit()
}

It looks plausible but this code is actually dangerously wrong!

The reason is that MySQL transactions, by default, are not actually executed in an isolated serial fashion. Instead, reads are allowed to execute concurrently for performance reasons. If two servers run ProcessWithdrawal() concurrently, the initial SELECT statement will return status PENDING on both machines and then they’ll both update the status to PROCESSING, and both will think that they’re meant to send the payment.

This kind of behaviour of often blamed on “NoSQL” databases. In fact, most SQL databases will happily exhibit the same behaviour too if you’re not careful because they typically run in “REPEATABLE READ” mode. In this mode, reads are allowed to execute concurrently to greatly improve read performance.

Several Bitcoin exchange and wallet platforms have made this mistake in the past causing significant loss of their customers’ funds and for the companies themselves to go bankrupt!

The solution is update the state like this:

func UpdateState(dbc *sql.DB, id int64, oldStatus, newStatus Status) error {
    r, err := dbc.Exec(“update withdrawals set status=? where id=? and status=?”, newStatus, id, oldStatus)
    if err != nil {
        return err
    }

    n, err := r.RowsAffected()
    if err != nil {
        return err
    }

    if n != 1 {
        return errors.New(“failed to update state”)
    }

    return nil
}

By starting the transaction with an update, MySQL acquires an exclusive lock on the specific row. This prevents any other transaction from concurrently updating the state. However, another transaction may have already run and updated the state earlier. Checking the number of rows affected tells us if we were the one who successfully made the transition. If we were successful, we know that we can now safely proceed to send the Bitcoin payment. If not, it means another server has taken ownership of that transition so we should give up and go try to process the another withdrawal.

Once we have this UpdateState function we can now rewrite the ProcessWithdrawal function like this:

func ProcessWithdrawal(dbc *sql.DB, id int64) error {
    if err := UpdateState(dbc, id, StatusPending, StatusProcessing); err != nil {
        return err
    }

    rawTx := signTransaction(id)
    bitcoin.Send(rawTx)

    return UpdateState(dbc, id, StatusProcessing, StatusComplete)
}

Failure handling

In order to build a robust system, you have to be prepared for a failure to happen at any time. Having the states of a process mapped out makes this easier to analyse. For each state, consider what failures could occur before reaching the next state. The way I usually think about this is by going through each line of the code and asking “assume someone pulled out the power cable after this line - then what?”.

In our withdrawals example, we are guaranteed that only one server will manage to move the withdrawal into PROCESSING state. However, it’s possible for that server to lose power before it finishes sending the transaction and updates the state to COMPLETE. If this happens, the withdrawal will get stuck in the PROCESSING state since no other server is able to deal with it.

Although getting into a “stuck” state isn’t ideal, when dealing with financial transactions, this failure mode is usually better than the alternative of possibly sending a payment twice.

We can’t tell from the state alone whether the failure occurred before or after the Bitcoin transaction was transmitted. If the server failed before, the remedy is to update the withdrawal back to PENDING so that the step can be repeated. If the server failed after the transaction was transmitted, the withdrawal should be moved to COMPLETE. External information is needed to decide how to resume the process.

Conclusion

This is a fairly simple example, but the principles remain useful for more complicated processes:

  • Map out your process as a finite state machine diagram
  • Identity which services are responsible for each state transition
  • Store the process in a table with a column to track the state
  • Use the UpdateState function to safely update the state
  • Analyse the possible failures that can happen on each state and handle them accordingly

In summary, finite-state machines help us to analyse and reason about distributed processes. It is easy to use a database like MySQL to keep track of the state, but you should beware of the common mistakes.

Want to join our team of talented engineers? See our available positions here.

Avatar Timothy Stranex
Author

Timothy Stranex

Timothy is co-founder and CTO at Luno. Before founding Luno, Timothy was a software engineer at Google in Switzerland. He holds an MSc in Theoretical Physics from the University of Zürich and a BSc in Physics, Mathematics and Computer Science from the University of Cape Town.

It’s never too late to get started with Bitcoin. Learn, buy and use Bitcoin with Luno now.

Desktop Icon Apple App Store Logo Google Play Store Logo