A new version

nnnoiseless: porting audio code from C to rust

I ported a C library to rust last week, and it went pretty smoothly. This is the story, and here is the repo.

The library in question is RNNoise, a library for removing noise from audio. It works well, it runs fast, and best of all it has no knobs that you need to tune. There’s even a rust binding.

So why bother porting it? Well, I need to patch it so that it would compile with MSVC, but my PR went unnoticed for a month. I thought about maintaining my own fork, but it’s been more than 10 years since I last wrote anything in C or C++. And that’s how I ended up porting RNNoise to rust. It probably wasn’t the most efficient use of my time, but I had fun and learned something.

There’s a lot of information out there about porting C to rust, but the most useful resource for me was the fantastic talk by Carol (Nichols || Goulding). It lays out a simple process for porting one function at a time: first, you set up the cargo to compile as a static library and you set up the C build system to link that static library into the C library (see the slides for the relevant Makefile and Cargo.toml snippets). Then you can port one function at time: the C code goes like this:

+extern void _celt_lpc(opus_val16 *_lpc, const opus_val32 *ac, int p);
+void __celt_lpc(opus_val16 *_lpc, const opus_val32 *ac, int p)
-void _celt_lpc(opus_val16 *_lpc, const opus_val32 *ac, int p)
{
    /* C body of _celt_lpc */
}

and the rust code goes like this:

+#[no_mangle]
+pub extern "C" fn _celt_lpc(lpc: *mut f32, ac: *const f32, p: c_int) {
+    unsafe {
+        let lpc_slice = std::slice::from_raw_parts_mut(lpc, p as usize);
+        let ac_slice = std::slice::from_raw_parts(ac, p as usize + 1);
+        rs_celt_lpc(lpc_slice, ac_slice);
+    }
+}
+
+fn rs_celt_lpc(lpc: &mut [f32], ac: &[f32]) {
+// rust body of celt_lpc
+}

If you’ve watched the talk (which you should), you might notice that this is a tiny bit different from what they recommend: I’ve renamed the original C function instead of deleting it. I found that this helped me narrow down porting mistakes, because it made it easy to switch back and forth between the C and rust implementations.

Pain points

Most of the porting process was mechanical and easy. One of the less fun parts was porting code involving C structs. RNNoise has structs that (when ported to rust) look like this:

#[repr(C)]
struct RnnState {
    model: *const RnnModel,
    // Various buffers, whose sizes are determined by some subfields of `model`.
    vad_gru_state: *mut f32,
    noise_gru_state: *mut f32,
    denoise_gru_state: *mut f32,
}

An idomatic rust version might look something like

struct RnnState {
    model: &'static RnnModel,
    vad_gru_state: Vec<f32>,
    noise_gru_state: Vec<f32>,
    denoise_gru_state: Vec<f32>,
}

but this isn’t layout-compatible with the original C version, and so I need to stick with the original struct for as long as RnnState is being accessed by both C and rust code. This increases the amount of unsafe sprinkled around the rust code, and it was also the source of an annoying bug of the sort that I thought I had left behind by moving to rust.

A Heisenbug

At some point during the porting process, my tests started failing in release mode, but not in debug mode. Most likely some undefined behavior triggered by my amateurish attempts at unsafe code, but I couldn’t quickly spot the problem and the prospect of a more careful round of debugging didn’t spark a whole lot of joy. So I did something that I never would have dared to do in my C/C++ days: I ignored the problem and kept porting; after all, the tests were still working in debug mode. And sure enough, a few more ported functions later and rustc found the problem for me: in a function taking a &RnnState parameter, I was modifying data in the vad_gru_state buffer. Since I was using unsafe code, rustc didn’t complain at first. But once I ported the RnnState struct to safe and idiomatic rust, the compiler flagged the problem immediately.

Performance

After getting everything to 100% safe (if not particulary idiomatic) rust, it was time to check whether performance had suffered.

initial benchmark

Yes, apparently, by about 50%. The most obvious culprit was bounds checking: there was a lot of indexing in the C code, and some of it wasn’t trivial to convert to a more rust-friendly, iterator-based version. First priority was the neural network evaluation:

let m = ...; // At most 114.
let n = ...; // At most 96.

for i in 0..n {
    let output[i] = layer.bias[i] as f32;
    for j in 0..m {
        output[i] += layer.input_weights[j * n + i] as f32 * input[j];
    }
}

I can already see you shaking your head. I’m doing naive matrix-vector multiplication with a 100x100ish matrix in column-major format? Not only is this costing me bounds checks, it’s terrible for memory locality. Swapping the weights storage from column- to row-major order only made things about 1.5% faster, but more importantly it made the whole thing iterator-friendly. Converting to zips and sums bought another 15%, leaving me only about 25-30% slower than the C code.

for i in 0..n {
    let output[i] =
        layer.bias[i] as f32 + 
        layer.input_weights[(i * m)..((i + 1) * m)]
            .iter()
            .zip(input)
            .map(|(&x, &y)| x as f32 * y)
            .sum();
}

For my next optimization opportunity, I moved on to the function that computes cross-correlations. The un-optimized version of this function looks like

fn pitch_xcorr(xs: &[f32], ys: &[f32], xcorr: &mut [f32]) {
    for i in 0..xcorr.len() {
        xcorr[i] = xs.iter().zip(&ys[i..]).map(|(&x, &y)| x * y).sum();
    }
}

but the C code contained a massive, manually-unrolled version. I’d skipped it while porting, but maybe I’d gain something from porting it over. Here’s an abbreviated version of the optimized function, assuming that all lengths are a multiple of 4 (the real code also handles the case that they aren’t).

for i in (0..xcorr.len()).step_by(4) {
    let mut c0 = 0.0;
    let mut c1 = 0.0;
    let mut c2 = 0.0;
    let mut c3 = 0.0;

    let mut y0 = ys[i + 0];
    let mut y1 = ys[i + 1];
    let mut y2 = ys[i + 2];
    let mut y3 = ys[i + 3];

    for (x, y) in xs.chunks_exact(4).zip(ys[(i + 4)..].chunks_exact(4)) {
        c0 += x[0] * y0;
        c1 += x[0] * y1;
        c2 += x[0] * y2;
        c3 += x[0] * y3;

        y0 = y[0];
        c0 += x[1] * y1;
        c1 += x[1] * y2;
        c2 += x[1] * y3;
        c3 += x[1] * y0;

        y1 = y[1];
        c0 += x[2] * y2;
        c1 += x[2] * y3;
        c2 += x[2] * y0;
        c3 += x[2] * y1;

        y2 = y[2];
        c0 += x[3] * y3;
        c1 += x[3] * y0;
        c2 += x[3] * y1;
        c3 += x[3] * y2;

        y3 = y[3];
    }
}

Basically, both inner and outer loops have been unrolled four times, and I’ve exploited the inner loop’s unrolling to optimize the memory access pattern. Thanks to the amazing cargo asm, I can happily report that there’s no bounds-checking in the inner loop and that all the arithmetic has been auto-vectorized to work four f32s at a time. (Maybe it would get even faster if I unrolled 8 times and compiled with AVX enabled; I haven’t tried that yet.)

This change more than doubled the speed of pitch_xcorr, and gained me about 10% overall. More importantly, it showed me how to coerce the compiler into auto-vectorizing something that it hadn’t auto-vectorized before. I went back to the neural network code and replaced things like

xs.iter().zip(ys).map(|(&x, &y)| x as f32 * y).sum()

with things like

{
    let mut sum0 = 0.0;
    let mut sum1 = 0.0;
    let mut sum2 = 0.0;
    let mut sum3 = 0.0;

    for (x, y) in xs.chunks_exact(4).zip(ys.chunks_exact(4)) {
        sum0 += x[0] * y[0];
        sum1 += x[1] * y[1];
        sum2 += x[2] * y[2];
        sum3 += x[3] * y[3];
    }
    sum0 + sum1 + sum2 + sum3
}

for another 20% improvement.

Current score: the rust version (still 100% safe) is about 15% faster, and there’s probably plenty more still on the table.

final benchmark

The performance lesson I learned from this is that bounds checking can be expensive in numerical code and iterator-style code can help a bit, but if you really want faster numerical code then you need to write in a style that the auto-vectorizer likes. (Or you could use the SIMD intrinsics directly, but that’s another story.)

A huge thank you to cargo

Like I wrote above, it’s been a while since I did any C/C++, and because of that I’ve started to take tools like cargo for granted. This little porting project brought back some memories, mostly because about half of the code in RNNoise was actually “vendored” from opus. I put “vendored” in quotes because I usually think of vendoring as involving a subdirectory (maybe even a git submodule if I’m lucky) with its own build artifacts. That’s not what’s going on here, though; I’m just talking about files that were copied from the source directory of one project to the source directory of another, complete with never-used functions and never-def’ed ifdefs. The thing is, though, that I understand exactly why they did it: it’s by far the easiest way to share code between C projects. So I just want to finish by saying a big “thank you” to cargo and crates.io for making me not have to deal with C dependency management any more.

Part 5: Pseudo-edges

This is the fifth (and final planned) post in a series on some new ideas in version control. To start at the beginning, go here.

The goal of this post is to describe pseudo-edges: what they are, how to compute them efficiently, and how to update them efficiently upon small changes. To recall the important points from the last post:

The main idea for solving this is to add “pseudo-edges” to the graph: for every path that connects two live nodes through a sequence of deleted nodes, add a corresponding edge to the graph. Once this is done, we can render the current output without traversing the deleted parts of the graph, because every ordering contraint that used to depend on some deleted parts is now represented by some pseudo-edge. Here’s an example: the deleted nodes are in gray, and the pseudo-edge that they induce is the dashed arrow.

We haven’t really solved anything yet, though: once we have the pseudo-edges, we can efficiently render the output, but how do we compute the pseudo-edges? The naive algorithm (look at every pair of live nodes, and check if they’re connected by a path of deleted nodes) still depends on the number of deleted nodes. Clearly, what we need is some sort of incremental way to update the pseudo-edges.

Deferring pseudo-edges

The easiest way that we can reduce the amount of time required for computing pseudo-edges is simply to do it rarely. Specifically, remember that applying a patch can be very fast, and that pseudo-edges only need to be computed when outputting a file. So, obviously, we should only update the pseudo-edges when it’s time to actually output the file. This sounds trivial, but it can actually be significant. Imagine, for example, that you’re cloning a repository that has a long history; let’s say it has n patches, each of which has a constant size, and let’s assume that computing pseudo-edges takes time O(m), where m is the size of the history. Cloning a repository involves downloading all of those patches, and then applying them one-by-one. If we recompute the pseudo-edges after every patch application, the total amount of time required to clone the repository is O(n^2); if we apply all the patches first and only compute the pseudo-edges at the end, the total time is O(n).

You can see how ojo implements this deferred pseudo-edge computation here: first, it applies all of the patches; then it recomputes the pseudo-edges.

Connected deleted components

Deferring the pseudo-edge computation certainly helps, but we’d also like to speed up the computation itself. The main idea is to avoid unnecessary recomputation by only examining parts of the graph that might have actually changed. At this point, I need to admit that I don’t know whether what I’m about to propose is the best way of updating the pseudo-edges. In particular, its efficiency rests on a bunch of assumptions about what sort of graphs we’re likely to encounter. I haven’t made any attempt to test these assumptions on actual large repositories (although that’s something I’d like to try in the future).

The main assumption is that while there may be many deleted nodes, they tend to be collected into a large number of connected components, each of which tends to be small. What’s more, each patch (I’ll assume) tends to only affect a small number of these connected components. In other words, the plan will be:

Before talking about algorithms, here are some pictures that should help unpack what it is that I actually mean. Here is a graph containing three connected components of deleted nodes (represented by the rounded rectangles):

When I delete node h, it gets added to one of the connected components, and I can update relevant pseudo-edges without looking at the other two connected components:

If I delete node d then it will cause all of the connected components to merge:

This isn’t hard to handle, it just means that we should run our pseudo-edge-checking algorithm on the merged component.

Maintaining the components

To maintain the partition of deleted nodes into connected components, we use a disjoint-set data structure. This is very fast (pretty close to constant time) when applying patches, because applying patches can only enlarge deleted components. It’s slower when reverting patches, because the disjoint-set algorithm doesn’t allow splitting: when reverting patches, connected components could split into smaller ones. Our approach is to defer the splitting: we just mark the original connected component as dirty. When it comes time to compute the pseudo-edges, we explore the original component, and figure out what the new connected pieces are.

The disjoint-set data structure is implemented in the ojo_partition subcrate. It appears in the Graggle struct; note also the dirty_reps member: that’s for keeping track of which parts in the partition have been modified by a patch and require recomputing pseudo-edges.

We recompute the components here. Specifically, we consider the subgraph consisting only of nodes that belong to one of the dirty connected components. We run Tarjan’s algorithm on that subgraph to find out what the new connected components are. On each of those components, we recompute the pseudo-edges.

Recomputing the pseudo-edges

The algorithm for this is: after deleting the node, look at the deleted connected component that it belongs to, including the “boundary” consisting of live nodes:

Using depth-first search, check which of the live boundary nodes (in this case, just a and i) are connected by a path within that component (in this case, they are). If so, add a pseudo-edge. The complexity of this algorithm is O(nm), where n is the number of boundary nodes, and m is the total number of nodes in the component, including the boundary (because we need to run n DFSes, and each one takes O(m) time). The hope here is that m and n are small, even for large histories. For example, I hope that n is almost always 2; at least, this is the case if the final live graph is totally ordered.

This algorithm is implemented here.

Unapplying, and pseudo-edge reasons

There’s one more wrinkle in the pseudo-edge computation, and it has to do with reverting patches: if applying a patch created a pseudo-edge, removing a patch might cause that pseudo-edge to get deleted. But we have to be very careful when doing so, because a pseudo-edge might have multiple reasons for existing. You can see why in this example from before:

The pseudo-edge from a to d is caused independently by the both the b -> c component and the cy -> cl -> e component. If by unapplying some patch we destroy the b -> c component but leave the cy -> cl -> e component untouched, we have to be sure not to delete the pseudo-edge from a to d.

The solution to this is to track to “reasons” for pseudo-edges, where each “reason” is a deleted connected component. This is a many-to-many mapping between connected deleted components and pseudo-edges, and it’s stored in the pseudo_edge_reasons and reason_pseudo_edges members of the GraggleData struct. Once we store pseudo-edge reasons, it’s easy to figure out when a pseudo-edge needs deleting: whenever its last reason becomes obsolete.

Pseudo-edge spamming: an optimization

We’ve finished describing ojo’s algorithm for keeping pseudo-edges up to date, but there’s stil room for improvement. Here, I’ll describe a potential optimization that I haven’t implemented yet. It’s based on a simple, but non-quite-correct, algorithm for adding pseudo-edges incrementally: every time you mark a node as deleted, add a pseudo-edge from each of its in-neighbors to each of its out-neighbors. I call this “pseudo-edge spamming” because it just eagerly throws in as many pseudo-edges as needed. In pictures, if we have this graph

and we delete the “deleted” line, then we’ll add a pseudo-edge from the in-neighbor of “deleted” (namely, “first”) to the out-neighbor of “deleted” (namely, “last”).

This algorithm has two problems. The first is that it isn’t complete: you might also need to add pseudo-edges when adding an edge where at least one end is deleted. Consider this example, where our graph consists of two disconnected parts.

If we add an edge from “deleted 1” to “deleted 2”, clearly we also need to add a pseudo-edge between each of the “first” nodes and each of the “last” nodes. In order to handle this case, we really do need to explore the deleted connected component (which could be slow).

The second problem with our pseudo-edge spamming algorithm is that it doesn’t handle reverting patches: it only describes how to add pseudo-edges, not delete them.

The nice thing about pseudo-edge spamming is that even if it isn’t completely correct, it can be used as a fast-path in the correct algorithm: when applying a patch, if it modifies the boundary of a deleted connected component that isn’t already dirty, use pseudo-edge spamming to update the pseudo-edges (and don’t mark the component as dirty). In every other case, fall back to the previous algorithm.