NHacker Next
login
▲Blazing Matrix Productspanadestein.github.io
48 points by Bogdanp 13 hours ago | 6 comments
Loading comments...
imurray 7 hours ago [-]
This post is for those interested high-performance matrix multiplication in BQN (an APL-like array language).

The main thing I got out of it was the footnotes, in particular: https://en.algorithmica.org/hpc/algorithms/matmul/ is a really nice post on fast matrix multiplication, and is a chapter of what looks like a nice online book.

bee_rider 6 hours ago [-]
Does BQN just natively vectorize the code? I was surprised not to see anything about that.
dzaima 16 minutes ago [-]
More generally than the other replies, BQN is an array language, and as such "a+b" & "a×b" etc automatically map over arrays, and the interpreter can thus trivially vectorize them (and, if you're curious, yes, in a naive interpreter (which CBQN currently is) this does mean intermediate arrays go through memory, which is probably a significant part of the difference in speed).
mlochbaum 5 hours ago [-]
The relevant operations for matrix multiply are leading-axis extension, shown near the end of [0], and Insert +˝ shown in [1]. Both for floats; the leading-axis operation is × but it's the same speed as + with floating-point SIMD. We don't handle these all that well, with needless copying in × and a lot of per-row overhead in +˝, but of course it's way better than scalar evaluation.

[0] https://mlochbaum.github.io/bencharray/pages/arith.html

[1] https://mlochbaum.github.io/bencharray/pages/fold.html

mlochbaum 5 hours ago [-]
And the reason +˝ is fairly fast for long rows, despite that page claiming no optimizations, is that ˝ is defined to split its argument into cells, e.g. rows of a matrix, and apply + with those as arguments. So + is able to apply its ordinary vectorization, while it can't in some other situations where it's applied element-wise. This still doesn't make great use of cache and I do have some special code working for floats that does much better with a tiling pattern, but I wanted to improve +˝ for integers along with it and haven't finished those (widening on overflow is complicated).
icen 6 hours ago [-]
Yes; the CBQN interpreter has a number of specialised vectorised codepaths. It picks a good one for the arrays at runtime.