Defensive programming: the byzantine generals problem

May 24, 2023 6 min read

I love the movie Intersteller, which is where I first learned about Murphy's law:

Anything that can go wrong will go wrong.

In software development, many things can go wrong. A developer must be familiar with the various components of software and the communication paths that connect them.

Creators of mature tools invest a lot in ensuring the reliability of their tools and put in place suitable protocols to handle failure. In contrast, many developers in developer land aren't very aware of what could go wrong and only learn about it after something actually goes wrong. Past due.

Take a look at this code:

$order = Order::create($request->input('order'));

OrderProduct::insert(
    Arr::map(
        $request->input('products'),
        fn ($product) => [...$product, 'order_id' => $order->id]
    )
);

SendOrderToVendor::dispatch($order->id);

This code inserts an order in the database, inserts products attached to this order and dispatches a job to the queue to be processed in the background.

An order is meaningless without its products, and products are meaningless if they aren't attached to an order, so the two queries together represent a single transaction. It will be obvious to a seasoned developer that these two queries must be run as part of a single database transaction.

$order = DB::transaction(function () use ($request) {
    $order = Order::create($request->input('order'));

    OrderProduct::insert(
        Arr::map(
            $request->input('products'),
            fn ($product) => [...$product, 'order_id' => $order->id]
        )
    );
});

SendOrderToVendor::dispatch($order->id);

With that, either nothing is persisted in the database for either query, or records are reliably persisted. The entire transaction is rolled back if even one of the queries returned an error.

What would happen if the queue service was down and the SendOrderToVendor job wasn't sent to the queue? Two things

  1. The vendor won't ever be aware that an order was placed.
  2. The customer will believe the order was not submitted after receiving an error message. They'll probably keep submitting it again and again.

The effect of a single failure is catastropic. Customers will be patiently waiting for orders that vendors have never received. Some of them have submitted multiple copies of the same order.

Immediate vs. eventual consistency

This is an example of eventual consistency. The order is saved in the database and is eventually dispatched to the queue. Let us try something different:

$order = DB::transaction(function () use ($request) {
    $order = Order::create($request->input('order'));

    OrderProduct::insert(/** ... **/);

   SendOrderToVendor::dispatch($order->id);
});

The entire transaction is now rolled back if the job was not dispatched to the queue. The user encounters an error and attempts to re-submit the order. There will be no duplicate orders in this case because the order was never persisted when the error occurred.

But there's a catch: what if the job was dispatched and picked up by a worker before the transaction committed? Queue workers can outperform database transactions. In that case, the job will fail because it will be unable to locate the order in the database.

We now have a failed job and a vendor who is unaware that an order has been placed. We're back where we started.

The issue here is ensuring that an order is only persisted when the job is dispatched, and that the job is only dispatched after the order is persisted. This is one version of "the byzantine generals problem"

The web general sent an order message to the queue general. The queue general will be unaware of the order if the message fails to deliver. If the message was delivered and the queue general acted too quickly, they will not find any orders to deliver and will assume the order was cancelled.

One solution to this problem is to instruct the queue general to wait for a predetermined amount of time before acting on a message. Just enough time for the web general to confirm the order.

$order = DB::transaction(function () use ($request) {
    $order = Order::create($request->input('order'));

    OrderProduct::insert(/** ... **/);

    SendOrderToVendor::dispatch($order->id)
       ->delay(5); // Delay processing for 5 seconds.
});

This will not work if acting on the message as soon as possible ensures the mission's success. For example, while processing is delayed, the vendor may fulfill another order — from a different system — containing the same items. In that case, the vendor will be unable to complete our order by the time it arrives. We need to dispatch the order as soon as possible to increase the likelihood of its fulfillment.

Another approach is to request that the web general wait for confirmation from the queue general before responding to the user.

$order = DB::transaction(function () use ($request) {
    $order = Order::create($request->input('order'));

    OrderProduct::insert(/** ... **/);
});

retry(
   times: 3,
   callback: fn() => SendOrderToVendor::dispatch($order->id),
   sleepMilliseconds: 1000
);

We try dispatching the job three times and sleep one second between each retry. This works well for a temporary failure on the queue service end. However, if the failure lasts longer than two seconds, the problem will reappear.

Of course, we could retry more, but that would mean the request would be slow in responding to users. If we have too many requests coming in and they are all retrying communication with the queue, our web server pool will become stagnant and no new requests will be accepted. That is not what we want.

Introducing more generals

In our example, there are actually four generals:

  1. The web general: receives requests.
  2. The database general: stores orders.
  3. The queue general: stores jobs.
  4. The worker general: processes jobs.

The web general coordinates with the database and queue generals, and the queue general coordinates with the worker general.

The issue with the web general is that it must respond to the user quickly. It cannot wait until all parties have confirmed receipt of their messages. We need a general who is willing to wait as long as it takes to coordinate with the other generals. This is the general scheduler.

This general operates in the background, away from user land, and can take its time to ensure that everything runs smoothly. We will rely on it to scan our orders and ensure that they are all dispatched to the queue. Here's how it's done:

$order = DB::transaction(function () use ($request) {
    $order = Order::create($request->input('order'));

    OrderProduct::insert(/** ... **/);
});

try {
    SendOrderToVendor::dispatch($order->id);
} catch (Throwable $e) {

}

We will wait for the database general to confirm that the order was persisted; this is critical because the database general must receive all of the vital information for the mission to succeed. If our database general is unreliable, we have bigger problems to solve.

When the database successfully persists the order, we will attempt to dispatch the job to the queue, catch any errors, and respond immediately.

Now if the queue general confirmed they received the message, all is fine. If not, we will just respond to the user and let the web general handle more requests.

In the background, we will have our scheduler general periodically scan all unfulfilled orders and re-communicate them with the queue general.

Order::where('fullfilled', 0)->each(
    fn ($order) => SendOrderToVendor::dispatch($order->id);
);

We can configure the scheduler to run this task every minute; it queries the database for all unfulfilled orders and adds them to the queue.

I've simplified the code for speed, but you can add additional checks to only scan applicable orders. Excluding orders cancelled by customers, for example.

Idempotency

The scheduler general helped us by coordinating between the database and queue generals in the background. However, we must prepare for the possibility that the queue general receives multiple messages for the same order. This could occur if the worker general was too busy to handle the job before the scheduler general noticed the order was incomplete and dispatched another job.

We must ensure that our vendor will only process a single order, regardless of how many times we send the same order. This is known as idempotency.

Here's how Stripe describes it:

This API supports idempotency for safely retrying requests without accidentally performing the same operation twice.

And Shopify:

Shopify APIs support idempotency, which allows you to safely retry API requests that might have failed due to connection issues, without causing duplication or conflicts.

Most vendor APIs support idempotency, which allows us to safely dispatch the same job for the same order multiple times without worrying about the vendor believing they are receiving multiple orders when they are actually receiving the same order.

Check your vendor's API documentation and make sure to implement the idempotency requirements. It is typically very simple.

More can go wrong

We'll probably spend ages building anything if we keep Murphy's law in mind while designing every piece of software. Everything has the potential to go wrong. The internet is made up of millions of computers linked together by cables in the ocean and satellites in space. It takes an insane amount of effort to plan for every possible failure.

You must identify and accept the things that are acceptable to go wrong from time to time. Budget, reliability targets, and development speed all influence how these things are identified.

However, it is critical that you understand the layers of abstraction upon which your code is built. Frameworks and third-party software libraries can introduce layers of abstraction that hide weak links. If you are not aware of these, they will surprise you and you will be unprepared.

Open-source software is a gift from others that we cannot afford to turn down. However, we must assess their maturity and comprehend how they function in order to identify any hidden weak links. The same is true for other abstractions, such as cloud services and SaaS APIs.


I'm Mohamed Said. I work with companies and teams all over the world to build and scale web applications in the cloud. Find me on twitter @themsaid.


Get updates in your inbox.

Menu

Log in to access your purchases.

Log in

Check the courses I have published.


You can reach me on Twitter @themsaid or email [email protected].


Join my newsletter to receive updates on new content I publish.