"Clean" Code, Horrible Performance in Rust
I was just recently going through some old episodes from Software Engineering Radio when I came across this one episode featuring Casey Muratori, where he goes through some of his thoughts around his video from February 2023, titled "‘Clean’ Code, Horrible Performance". I was actually already aware of the video by this time, but listening through the episode gave me an itch to see these concepts in my reality, experiment them by myself.
I chose to rewrite the ideas Casey presented in the video in Rust, trying to be idiomatic is not really the goal of the solutions you’ll see below, but rather experiment with how structuring the code may lead to a big performance penalty.
All the code I reference in this post is available in this Gist.
You can also check this awesome video by @lavafroth, that goes over this article and further explains some areas I might have missed: Two Decades of Hardware Optimizations Down The Drain
Base Case: Traits
Starting with all the “Clean” Code principles, we would likely implement something that looks like the code seen below:
1trait Shape {
2 fn area(&self) -> f32;
3}
4
5struct Square {
6 side: f32,
7}
8
9impl Shape for Square {
10 fn area(&self) -> f32 {
11 self.side * self.side
12 }
13}
14
15// ... the same for Rectangle, Triangle and Circle
Running the accumulator test Casey shows in the video, yields us a runtime of
54,956 ns/iter
for 10240
items (more detailed information on the
annex down below). We’ll use this as our baseline
going forward.
1 #[bench]
2 fn total_area(b: &mut Bencher) {
3 let shapes: Vec<Box<dyn Shape>> = init(COUNT);
4
5 b.iter(|| {
6 let mut accum = 0f32;
7 for shape in &shapes {
8 accum += shape.area();
9 }
10
11 accum
12 });
13 }
If you’re more familiar with Rust you’ll immediately catch the use of Box
there as a possible location for optimization. For our constraints, we do need
to box the values when constructing them as we need to have a “polymorphic
list”, and because we’re using a trait as our abstraction for Shape
, the
values of our shapes vec are of unknown size to the compiler.
Our first performance gain comes without touching much of the implementation code, but rather, the code in the “caller” side (the benchmark function). Using 4 accumulators instead of one:
1 #[bench]
2 fn total_area_sum4(b: &mut Bencher) {
3 let shapes: Vec<Box<dyn Shape>> = init(COUNT);
4
5 b.iter(|| {
6 let mut accum1 = 0f32;
7 let mut accum2 = 0f32;
8 let mut accum3 = 0f32;
9 let mut accum4 = 0f32;
10
11 let count = COUNT / 4;
12 let mut iter = 1;
13 while count - iter > 0 {
14 accum1 += shapes[iter * 4].area();
15 accum2 += shapes[1 + (iter * 4)].area();
16 accum3 += shapes[2 + (iter * 4)].area();
17 accum4 += shapes[3 + (iter * 4)].area();
18
19 iter += 1;
20 }
21
22 accum1 + accum2 + accum3 + accum4
23 });
24 }
This improves on the baseline by 3.1x, executing at a runtime of
17,725 ns/iter
. Why? A better utilization of CPU cycles. Modern CPUs are
able to handle parallel computations such as the one done above, but the
compiler may not take advantage of this if they’re not easily recognizable.
Of course, the code above is not really something we can easily copy paste and use in other solutions because there are a lot of assumptions we need to hold, the biggest one being: Is count divisible by 4? But that is besides the point, as we’re mostly bit twiddling here.
Still Idiomatic: Match
Going forward, Casey then breaks the “Polymorphism Rule” of “Clean” Code, and writes the same implementation using a switch statement. In idiomatic Rust, we can equate that to a match on a enum.
1enum Shape {
2 Square { side: f32 },
3 Rectangle { width: f32, height: f32 },
4 Triangle { base: f32, height: f32 },
5 Circle { radius: f32 },
6}
7
8impl Shape {
9 fn area(&self) -> f32 {
10 match self {
11 Shape::Square { side } => side * side,
12 Shape::Rectangle { width, height } => width * height,
13 Shape::Triangle { base, height } => 0.5f32 * base * height,
14 Shape::Circle { radius } => PI * radius * radius,
15 }
16 }
17}
The code here is still very readable, follows some very good conventions and is
something that you’d see every day in Rust. There is a drawback of this
implementation if you’re writing a library: the dependents of said library would
not be able to extend the Shape
system and add a Hexagon
for example.
With the code structured this way, we see a whopping 3.2x improvement from
the baseline when using the regular for
loop, and a massive 8.5x
improvement when using the _sum4
variant (17,141 ns/iter
and
6,434 ns/iter
).
This is the knowledge I’ll take from this experiment the most, because this is still very idiomatic code, and knowing this performance difference is beneficial when tackling performance critical code. This is where I can see most value from the “Performance-aware Programming” that Casey advocates for.
My assumptions is that the performance gains here come mostly from the removal
of the Box
construct. Having the shapes represented as enum variants instead
of trait implementors makes their size known at compile time and reduces the
need of the Box
to represent a “polymorphic list” (using quotes here is very
applicable, because this is not really a “polymorphic list” on the original
sense of the word).
Breaking with Idiomatic Rust: Lookup Table
Moving forward with the optimizations we then come across using a lookup table
for the multiplier constants. If you look closely at the match
code above,
you’ll notice a pattern. Casey mentions this pattern in his video and also
claims that moving the code out of the Polymorphic variant and into the
if-else/switch
style (or in our case the match
style) highlights this
pattern.
Trying to still maintain some semblance of idiomatic Rust, I went with a
different approach for the lookup table, using a constructor function for
Shape
s and storing the multiplier within the struct
itself. This leaves us
with:
1enum ShapeType {
2 Square { side: f32 },
3 Rectangle { width: f32, height: f32 },
4 Triangle { base: f32, height: f32 },
5 Circle { radius: f32 },
6}
7
8struct Shape {
9 multiplier: f32,
10 width: f32,
11 height: f32,
12}
13
14impl Shape {
15 fn new(shape_type: ShapeType) -> Self {
16 match shape_type {
17 ShapeType::Square { side } => Shape {
18 multiplier: 1f32,
19 width: side,
20 height: side,
21 },
22 // ... the same for Rectangle, Triangle and Circle
23 }
24 }
25
26 fn area(&self) -> f32 {
27 self.multiplier * self.width * self.height
28 }
29}
Styling the code this way allows us to create shapes like so:
1let shape = Shape::new(Square { side: 2f32 }) // Shape::Square { side: 2f32 }
This is not so different from what we saw in the more idiomatic examples using
trait
s and match
.
This code is 6.4x faster than the baseline using the simple benchmark and
11x faster using the _sum4
variant. In this example, we can see that the
sizes are known at compile time and there’s probably some other optimization of
the instruction sent to the CPU that I’m not aware of.
Breaking the “small functions” principle: corner_area
To break the “small functions that only do a single thing” mantra from “Clean”
Code, Casey introduces a CornerCont()
information to the Shape struct
. The
resulting value should be 1 / (1 + shape.CornerCount() + shape.Area())
.
Implementation wise they are very similar from what we’ve seen so far, and the
performance gains we seen before were also observed when calculating with the
corner
.
The base case yielded a runtime of 54,302 ns/iter
. Changing the trait
implementation into a match
statement, we see the performance jumping 4.4x
this time (interestingly we don’t see that much difference in the _sum4
variants, see the annex for more information). The
lookup table implementation took the performance to new levels with a
performance boost of 6.4x with the naive for
and once again by 11x
with the _sum4
variant.
Conclusion
Even though most of what we seen here is hardcore bit twiddling, being a “Performance-aware Programmer” is not a bad thing. Most of the people reading this (and me writing!) will most likely be developing Software that does not necessarily needs to make these trade-offs for performance, but that’s not the point, at least not for me.
The point is: we should be aware of them!
We should always be aware when we’re doing trade-offs and we should strive to learn new things that challenge our status quo.
I’ll probably not directly use many of the constructs I went through in this post in my day-to-day job, but it sure as hell was fun researching and quickly doing them this Sunday evening!
Annex 1: Benchmark Results
For all the results we’ve seen thus far in this post, I used a
Vec::with_capacity(10240)
and the cargo bench
command, below you can see the
execution times:
$ cargo bench -- total_area
Compiling clean_code v0.1.0 (/home/chrs/Workspace/clean_code)
Finished `bench` profile [optimized] target(s) in 0.58s
Running unittests src/lib.rs (target/release/deps/clean_code-3affde07f20d1563)
running 8 tests
test m01_trait::tests::total_area ... bench: 54,956 ns/iter (+/- 280)
test m01_trait::tests::total_area_sum4 ... bench: 17,725 ns/iter (+/- 1,373)
test m02_match::tests::total_area ... bench: 17,141 ns/iter (+/- 304)
test m02_match::tests::total_area_sum4 ... bench: 6,434 ns/iter (+/- 5,001)
test m03_table::tests::total_area ... bench: 8,560 ns/iter (+/- 169)
test m03_table::tests::total_area_sum4 ... bench: 5,734 ns/iter (+/- 1,005)
test m04_table_multiplier::tests::total_area ... bench: 8,555 ns/iter (+/- 29)
test m04_table_multiplier::tests::total_area_sum4 ... bench: 5,002 ns/iter (+/- 82)
$ cargo bench -- corner_area
Finished `bench` profile [optimized] target(s) in 0.01s
Running unittests src/lib.rs (target/release/deps/clean_code-3affde07f20d1563)
running 8 tests
test m01_trait::tests::corner_area ... bench: 54,302 ns/iter (+/- 233)
test m01_trait::tests::corner_area_sum4 ... bench: 35,922 ns/iter (+/- 563)
test m02_match::tests::corner_area ... bench: 12,160 ns/iter (+/- 2,019)
test m02_match::tests::corner_area_sum4 ... bench: 12,128 ns/iter (+/- 41)
test m03_table::tests::corner_area ... bench: 8,554 ns/iter (+/- 23)
test m03_table::tests::corner_area_sum4 ... bench: 5,736 ns/iter (+/- 15)
test m04_table_multiplier::tests::corner_area ... bench: 8,547 ns/iter (+/- 10)
test m04_table_multiplier::tests::corner_area_sum4 ... bench: 4,995 ns/iter (+/- 6)
test result: ok. 0 passed; 0 failed; 0 ignored; 8 measured; 8 filtered out; finished in 5.24s
You’ll see a m03_table
that has been omitted on the blog post just due to it
being very similar to the m04_table_multiplier
, the m03
module was my first
naive translation of the lookup table optimization (I’ve left in the gist if
you’re curious).