pa'i Benchmarks

Published on 03/25/2020, 1519 words, 6 minutes to read

In my last post I mentioned that pa'i was faster than Olin's cwa binary written in go without giving any benchmarks. I've been working on new ways to gather and visualize these benchmarks, and here they are.

Benchmarking WebAssembly implementations is slightly hard. A lot of existing benchmark tools simply do not run in WebAssembly as is, not to mention inside the Olin ABI. However, I have created a few tasks that I feel represent common tasks that pa'i (and later wasmcloud) will run:

compressing data with Snappy
parsing JSON
parsing yaml
recursive fibbonacci number calculation
blake-2 hashing

As always, if you don't trust my numbers, you don't have to. Commands will be given to run these benchmarks on your own hardware. This may not be the most scientifically accurate benchmarks possible, but it should help to give a reasonable idea of the speed gains from using Rust instead of Go.

You can run these benchmarks in the docker image xena/pahi. You may need to replace ./result/ with / for running this inside Docker.

$ docker run --rm -it xena/pahi bash -l

Compressing Data with Snappy

This is implemented as cpustrain.wasm. Here is the source code used in the benchmark:

#![no_main]
#![feature(start)]

extern crate olin;

use olin::{entrypoint, Resource};
use std::io::Write;

entrypoint!();

fn main() -> Result<(), std::io::Error> {
    let fout = Resource::open("null://").expect("opening /dev/null");
    let data = include_bytes!("/proc/cpuinfo");

    let mut writer = snap::write::FrameEncoder::new(fout);

    for _ in 0..256 {
        // compressed data
        writer.write(data)?;
    }

    Ok(())
}

This compresses my machine's copy of /proc/cpuinfo 256 times. This number was chosen arbitrarily.

Here are the results I got from the following command:

$ hyperfine --warmup 3 --prepare './result/bin/pahi result/wasm/cpustrain.wasm' \
        './result/bin/cwa result/wasm/cpustrain.wasm' \
        './result/bin/pahi --no-cache result/wasm/cpustrain.wasm' \
        './result/bin/pahi result/wasm/cpustrain.wasm'

CPU	cwa	pahi --no-cache	pahi	multiplier
Ryzen 5 3600	2.392 seconds	38.6 milliseconds	17.7 milliseconds	pahi is 135 times faster than cwa
Intel Xeon E5-1650	7.652 seconds	99.3 milliseconds	53.7 milliseconds	pahi is 142 times faster than cwa

Parsing JSON

This is implemented as bigjson.wasm. Here is the source code of the benchmark:


#![no_main]
#![feature(start)]

extern crate olin;

use olin::entrypoint;
use serde_json::{from_slice, to_string, Value};

entrypoint!();

fn main() -> Result<(), std::io::Error> {
    let input = include_bytes!("./bigjson.json");

    if let Ok(val) = from_slice(input) {
        let v: Value = val;
        if let Err(_why) = to_string(&v) {
            return Err(std::io::Error::new(
                std::io::ErrorKind::Other,
                "oh no json encoding failed!",
            ));
        }
    } else {
        return Err(std::io::Error::new(
            std::io::ErrorKind::Other,
            "oh no json parsing failed!",
        ));
    }

    Ok(())
}

This decodes and encodes this rather large json file. This is a very large file (over 64k of json) and should represent over 65536 times times the average json payload size.

Here are the results I got from the following command:

$ hyperfine --warmup 3 --prepare './result/bin/pahi result/wasm/bigjson.wasm' \
        './result/bin/cwa result/wasm/bigjson.wasm' \
        './result/bin/pahi --no-cache result/wasm/bigjson.wasm' \
        './result/bin/pahi result/wasm/bigjson.wasm'

CPU	cwa	pahi --no-cache	pahi	multiplier
Ryzen 5 3600	257 milliseconds	49.4 milliseconds	20.4 milliseconds	pahi is 12.62 times faster than cwa
Intel Xeon E5-1650	935.5 milliseconds	135.4 milliseconds	101.4 milliseconds	pahi is 9.22 times faster than cwa

Parsing yaml

This is implemented as k8sparse.wasm. Here is the source code of the benchmark:

#![no_main]
#![feature(start)]

extern crate olin;

use olin::entrypoint;
use serde_yaml::{from_slice, to_string, Value};

entrypoint!();

fn main() -> Result<(), std::io::Error> {
    let input = include_bytes!("./k8sparse.yaml");

    if let Ok(val) = from_slice(input) {
        let v: Value = val;
        if let Err(_why) = to_string(&v) {
            return Err(std::io::Error::new(
                std::io::ErrorKind::Other,
                "oh no yaml encoding failed!",
            ));
        } else {
            return Err(std::io::Error::new(
                std::io::ErrorKind::Other,
                "oh no yaml parsing failed!",
            ));
        }
    }

    Ok(())
}

This decodes and encodes this kubernetes manifest set from my cluster. This is a set of a few normal kubernetes deployments and isn't as much of a worse-case scenario as it could be with the other tests.

Here are the results I got from running the following command:

$ hyperfine --warmup 3 --prepare './result/bin/pahi result/wasm/k8sparse.wasm' \
        './result/bin/cwa result/wasm/k8sparse.wasm' \
        './result/bin/pahi --no-cache result/wasm/k8sparse.wasm' \
        './result/bin/pahi result/wasm/k8sparse.wasm'

CPU	cwa	pahi --no-cache	pahi	multiplier
Ryzen 5 3600	211.7 milliseconds	125.3 milliseconds	8.5 milliseconds	pahi is 25.04 times faster than cwa
Intel Xeon E5-1650	674.1 milliseconds	342.7 milliseconds	30.8 milliseconds	pahi is 21.85 times faster than cwa

Recursive Fibbonacci Number Calculation

This is implemented as fibber.wasm. Here is the source code used in the benchmark:

#![no_main]
#![feature(start)]

extern crate olin;

use olin::{entrypoint, log};

entrypoint!();

fn fib(n: u64) -> u64 {
    if n <= 1 {
        return 1;
    }
    fib(n - 1) + fib(n - 2)
}

fn main() -> Result<(), std::io::Error> {
    log::info("starting");
    fib(30);
    log::info("done");
    Ok(())
}

Fibbonacci number calculation done recursively is an incredibly time-complicated ordeal. This is the worst possible case for this kind of calculation, as it doesn't cache results from the fib function.

Here are the results I got from running the following command:

$ hyperfine --warmup 3 --prepare './result/bin/pahi result/wasm/fibber.wasm' \
        './result/bin/cwa result/wasm/fibber.wasm' \
        './result/bin/pahi --no-cache result/wasm/fibber.wasm' \
        './result/bin/pahi result/wasm/fibber.wasm'

CPU	cwa	pahi --no-cache	pahi	multiplier
Ryzen 5 3600	13.6 milliseconds	13.7 milliseconds	2.7 milliseconds	pahi is 5.13 times faster than cwa
Intel Xeon E5-1650	41.0 milliseconds	27.3 milliseconds	7.2 milliseconds	pahi is 5.70 times faster than cwa

Blake-2 Hashing

This is implemented as blake2stress.wasm. Here's the source code for this benchmark:

#![no_main]
#![feature(start)]

extern crate olin;

use blake2::{Blake2b, Digest};
use olin::{entrypoint, log};

entrypoint!();

fn main() -> Result<(), std::io::Error> {
    let json: &'static [u8] = include_bytes!("./bigjson.json");
    let yaml: &'static [u8] = include_bytes!("./k8sparse.yaml");
    for _ in 0..8 {
        let mut hasher = Blake2b::new();
        hasher.input(json);
        hasher.input(yaml);
        hasher.result();
    }

    Ok(())
}

This runs the blake2b hashing algorithm on the JSON and yaml files used earlier eight times. This is supposed to represent a few hundred thousand invocations of production code.

Here are the results I got from running the following command:

$ hyperfine --warmup 3 --prepare './result/bin/pahi result/wasm/blake2stress.wasm' \
        './result/bin/cwa result/wasm/blake2stress.wasm' \
        './result/bin/pahi --no-cache result/wasm/blake2stress.wasm' \
        './result/bin/pahi result/wasm/blake2stress.wasm'

CPU	cwa	pahi --no-cache	pahi	multiplier
Ryzen 5 3600	358.7 milliseconds	17.4 milliseconds	5.0 milliseconds	pahi is 71.76 times faster than cwa
Intel Xeon E5-1650	1.351 seconds	35.5 milliseconds	11.7 milliseconds	pahi is 115.04 times faster than cwa

Conclusions

From these tests, we can roughly conclude that pa'i is about 54 times faster than Olin's cwa tool. A lot of this speed gain is arguably the result of pa'i using an ahead of time compiler (namely cranelift as wrapped by wasmer). The compilation time also became a somewhat notable factor for comparing performance too, however the compilation cost only has to be eaten once.

Another conclusion I've made is very unsurprising. My old 2013 mac pro with an Intel Xeon E5-1650 is significantly slower in real-world computing tasks than the new Ryzen 5 3600. Both of these machines were using the same nix closure for running the binaries and they are running NixOS 20.03.

As always, if you have any feedback for what other kinds of benchmarks to run and how these benchmarks were collected, I welcome it. Please comment wherever this article is posted or contact me.

Here are the /proc/cpuinfo files for each machine being tested:

shachi (Ryzen 5 3600) /proc/cpuinfo
chrysalis (Intel Xeon E5-1650) /proc/cpuinfo

If you run these benchmarks on your own hardware and get different data, please let me know and I will be more than happy to add your results to these tables. I will need the CPU model name and the output of hyperfine for each of the above commands.

Facts and circumstances may have changed since publication. Please contact me before jumping to conclusions if something seems wrong or unclear.

Tags: wasm, rust, golang, pahi