I've since noticed this many times. Optimizing is like meditation to me. It's very mechanical (measure), with a sprinkle of creative work (once you know what is slow, it's quite obvious how to make it faster, but just challenging enough to be engaging), and it has a very nice tight feedback loop: Something is slow. I make a change. Now it's fast. Next.
Optimizing is my happy place.
This seems like a nonsensical statement to me. How could measuring be a substitute for thinking/analyzing/predicting/forming a plan?
Measuring/profiling just means observing the system you want to optimize in a systematic way. You certainly won't be very effective at optimizing anything if you don't observe it.
Theoretical calculations means you've formed a model of what's happening and you're devising a plan to optimize against that model. But of course, a useful model needs to represent the significant aspects of your system (and a good model should exclude most insignificant ones). Failing to observe your system means your model could be bad -- focused on insignificant aspects and missing significant ones -- and you'd never know.
One of my early-career successes was just creating a framework for generating every permutation of perf optimizations for every (log-scaled -- clz is very fast) input size and checking which was best, dropping the results into a lookup table of function pointers to branch on. The university had a large supply of heterogeneous computers, replete with all the normal problems like being able to double floating-point addition throughput on Haswell CPUs by abusing the fmadd instruction, so I made a framework (probably closer to a DSL) for encoding your algorithms in a way that you could analyze perf tradeoffs at compile time and tune your result for the given computer. It's kind of like what ATLAS does for some linear algebra tasks.
Such practices are almost never optimal, but they're pretty easy to implement, and the results are near-optimal for almost all inputs. In the tradeoff between human and computer performance, I think it's a nice option.
> There is no way to provide both optimized assembly and equivalent C code and let the compiler use the former in the general case and the latter in special cases.
This is true, but can be seen as a failure of language and tooling. For example, Halide [1] pioneered (AFAIK) the concept of separating algorithm from implementation at the language level. This separation lets you express the algorithm once, and then "schedule" it by specifying parallelism, vectorization, etc. You can provide multiple schedules for one algorithm, which allows you to specialize / make different choices depending on varying factors.
It's a really interesting concept, though maybe limited in practice to DSLs. I'm not sure a general purpose language would be a good fit for this model, but then, for general purpose programs written in general purpose languages, perf optimization at the level TFA discusses is frequently limited to just specific hot sections. Those hot sections could be extracted out into specialized components written in such a DSL.
Strategic optimisations is often basically free if you have domain expertise. It's that easy to know that the business wants x outcome and algorithm y is the right choice etc if its all internal thought processes. Whereas if you don't know enough then you're likely to make very expensive to undo decisions.
I'm not so sure. How many cycles would you expect this code to take?
mov dword [rsi], eax
add dword [rsi], 5
mov ebx, dword [rsi]
According to Agner Fog, these 3 instructions have a latency of 15 cycles on an AMD Zen 1 processor. On Zen 2, its latency is 2 cycles. This is because the CPU was given the ability to assign a register to `dword [rsi]`, overcoming the limit of 16 registers.This optimization is subject to problems, obviously pointer aliasing will enable the CPU to make the wrong assumption at times, and cause a situation not entirely unlike a branch mispredict.
There are constraints imposed by the micro-architecture for this feature. For you and I, a big one is it only works with general purpose registers. But is there a reason it couldn't or shouldn't be done for vectors? It seems like a micro-arch issue to me. Perhaps in a few years or in a lot of years, we'll have a CPU that can do this optimization for vectors.
"Even Apple’s LLVM fork lacks scheduling annotations for Apple Silicon. How am I supposed to write efficient code when Apple doesn’t bother to tune their own compiler?"
In addition to its public fork of LLVM, Apple also keeps a very private fork where (one assumes) they keep all the really juicy stuff.
I agree that it is frustrating not to have the data, but don't confuse "I don't have the data" with "no one has the data".
That doesn't seem to be what this post it talking about. It seems to talking about well worn areas trying to improve the state of the art. An example that illustrates it for me is DeepMind's AlphaTensor finding a better way to multiply matrices[0] in 2022. It wasn't a brute-force solution, but the scale of it makes it appear so.
> On 4x4 matrices, AlphaTensor unexpectedly discovered a solution with 47 multiplication steps, an improvement over the 49 required with Strassen’s algorithm of 1969, albeit restricted to mod 2 arithmetic. Similarly, AlphaTensor solved 5x5 matrices with 96 rather than Strassen's 98 steps. Based on the surprising discovery that such improvements exist, other researchers were quickly able to find a similar independent 4x4 algorithm, and separately tweaked Deepmind's 96-step 5x5 algorithm down to 95 steps in mod 2 arithmetic and to 97[24] in normal arithmetic.[25] Some algorithms were completely new: for example, (4, 5, 5) was improved to 76 steps from a baseline of 80 in both normal and mod 2 arithmetic.
This to me shows that direct profiling and observation wouldn't have led to the optimization. Improvements needed a sort-of but not actually brute-force effort of many people trying, but also being clever with their attempts.
[0] https://en.wikipedia.org/wiki/Matrix_multiplication_algorith...
It's really just debugging and troubleshooting, but with a different goal in mind.
After narrowing the smallest leaf, think about another branch and if some optimizations studied could lead to nice results there too.
With time they learn which branch is more promising and which optimizations are good beforehand with their problem.
I guess this could be called branch and bound with memoization instead of brute force aha.
Note: we write code in cuda
For this we need programmable languages in which users can code their own optimization cases.
There is such a thing as "I would like a certain optimization, that should not be upstreamed into the compiler".
That's exactly what's in this article.
The construct:
HashSet::from([a, b]).contains(&c);
has some definite AST pattern (or can be made to have one with a sufficiently pinned-down language). We can think about adding a pattern match for that AST pattern to the compiler, along with a right hand side to rewrite it into.fundamentally disagree. First it is a building of a mental model of what happens, a kind of analysis stage, and then compare it to the mental model of how it should or could work or producing a more efficient algorithm/way of accomplishing the target task.
When people try to brute-force, lets try this or this, without having the model in mind, that is frequently a waste, and even when/if it produces some improvement there is no understanding and guarantee what use cases the improvement will cover, whether it would regress some use cases, whether it still be there after we push those new features/fixes/etc.
Not so much as collateral damage when different optimizations negatively interact though.
Even when optimization is not further possible[0] a careful avoidance of distinct pessimizations themselves can lead toward the same kind of beneficial outcomes.
[0] Or at the other extreme, not the least bit possible.
It’s a process of continuous uncovering, and unless you have visibility across the whole stack — from kernel to cluster — you’ll spend all your time slicing through surface layers with lots of tears being shed.
Fortunately, there are software automation solutions to this.
In my experience most webapps can fix so much low hanging performance issues by mapping the API in a way that matches how its used in the client. It can remove so much mapping and combining for data all over.
For them, the systematic way to optimize goes: profile your code, and then apply domain knowledge of the product to optimize the hotspots with these common techniques:
Don't do repeated work. (If you have an expensive invariant calculation in a loop or function call, move it out, to a higher level of the program where it can be done once.)
Save your work. (Caching and memoization.)
Do less work. (Alter the product requirements to use less computationally-intensive techniques in cases where users won't notice the difference.)
Do work when the user isn't looking. (Move computations to background tasks, apply concurrency, perform async requests.)
If all else fails, call in a performance engineer like the author to micro-optimize your building blocks.
You can often get speedups of 3-6 orders of magnitude by applying these techniques, simply because the original code is so brain-dead. Performance engineers like the author tend to work on code that has already been tightly optimized, and so there is less low-hanging fruit to pick.
In C++, you can achieve performance using the often-denigrated standard template library, if you would only pay attention to the documented performance requirements for given function calls. But even there it's not the panacae that it could be, because it often provides an amortized cost while handwashing the access pattern (for example: std::unordered_map is great in algorithmic cost theory and terrible in memory access patterns).
What's the algorithmic cost (big-O) of this function call? What's the memory footprint of that function call? Worse: is that documented big-O estimated across a contiguous dataset, discontiguous paged datasets, or does it have to dereference pointers? What's the network cost of a given call? It's hard to know if you don't know that `write()` to a socket could incur 40 or 400 bytes across a network, and don't know whether that network is 1mbps, 10mbps, 1gbit, or localhost, etc; and with how much latency.
For example, when I hand-rolled some x86_64 SIMD instructions to analyze some DNA, I found the Intel Intrinsics Guide [0] 1000% helpful because many of the instructions detailed what sort of performance to expect on specific processor architectures (or, at least, in general). If you read the Intel Instruction Set Reference [1], a lot of similar performance information can be found. The performance I achieved was approximately the theoretical bottleneck between CPU and main RAM; getting better performance would have required a complete algorithm change which would have been Ph.D paper publishing worthy.
Of course, sometimes even low level hardware can have performance-killing bugs.
[0]: https://www.intel.com/content/www/us/en/docs/intrinsics-guid...
[1]: https://www.intel.com/content/www/us/en/developer/articles/t...
The fact that sometimes optimization work is tricky or requires some pre-thinking, or is even gasp counter-intuitive is such a hilarious way to say "this is hard". That's just table stakes starting points for so-so-so much work.
Edit: Heck, even deciding whether to prioritize optimization work or feature work is usually a harder problem than the actual optimization work itself.
Being how I fall under any developer, I'm going to bite the bullet and ask:
How are they supposed to be equivalent? One is a HashSet, other is boolean? And what is `c`.
My father had a PhD in Operations Research/Industrial Engneering.