> We observe a performance improvement in the CUTLASS FP8 kernel between NVCC 12.2 and 12.3. By comparing the compiled SASS, we discover that one bit in a series of FADD instructions is flipped in an interleaving pattern. After referencing some open-source CUDA assembler implementations, we identified that this bit controls yield, which may enhance warp-level parallelism (just a guess, yielding the current warp and let other warps work).
> To leverage this, we develop a similar script to modify the FFMA instructions in the compiled binary. Besides simply modifying the yield bit, we also flip the reuse bit (registers cannot be reused if the warp is yielded). This adjustment improves performance (10%+ in some cases) for fine-grained scaling FP8 GEMMs by creating more opportunities to overlap MMA instructions with promotion FFMA instructions.
I would say it is really mind-blowing.
blackeyeblitzar 62 days ago [-]
From what I read elsewhere, this is the type of typical performance optimization for matrix math you would see when performance is critical. It’s just not been applied yet to this specific problem by other AI players since it wasn’t a necessity for other companies. But eventually everyone would probably end up here regardless.
mitthrowaway2 62 days ago [-]
How many people does it take to implement this? A 10% gain in performance could pay for a lot of people's salaries when your company is spending hundreds of millions on GPU clusters.
fulafel 62 days ago [-]
If you think how many people who looked and failed to realize this optimization in the preceding performance efforts of the community, you could argue for quite a big number.
rfoo 62 days ago [-]
Uh, three? I worked at $CORP where we had a three people sub-team, they reverse engineered most of Volta's SASS instruction encoding, built a working SASS assembler (before the open source one of course), with the ultimate goal of making GEMM / Conv faster. And they did it. Though it wasn't applied to a high-profile enough big picture so we never heard about it :>
If you don't believe me, previous open source SASS assemblers were mostly from university, they surely didn't have that many people.
bjourne 62 days ago [-]
Did $CORP also release the im0lementation to make it trivial for others to replicate their work?
rfoo 62 days ago [-]
I think we did release some of the optimized kernels but I don't think we have released any one with SASS black magic, at least not before I left. Already been sanctioned by BIS, better not annoy NVIDIA furthermore.
DannyBee 62 days ago [-]
Actually, a number of them did. Even Google did.
saagarjha 62 days ago [-]
I mean it’s not a significant change so one? But that isn’t to say anyone could do it.
rvz 62 days ago [-]
Just a reminder, this is the third of many open source releases from DeepSeek that they are willing to release, and that release is a very trivial low bar for them to find optimizations when it is needed.
I guess since the majority here are blown away by the very low-level code involved, it tells me that they're likely not ready to use it or have been stuck on very high level tools that abstract this away.
randomNumber7 62 days ago [-]
I tell you a secret. Most devs do something wrong when they start rolling out their own linear algebra library. Thats why people use LAPAC, BLAS, etc...
KeplerBoy 62 days ago [-]
The thing is most people don't use Lapack or Blas. Most people are at higher levels of abstraction than torch.matmul.
rowanG077 61 days ago [-]
Just a few of highly skilled people.
Bimos 62 days ago [-]
I think most AI players rely on high performance GEMM. But most people would be satisfied with cutlass or cublas, and the others implement gemm themselves, but not necessarily use undocumented features?
creato 62 days ago [-]
Using undocumented features is not rare. People reverse engineered Apple's undocumented AMX instructions on their CPU, and I know people use undocumented/private extensions for several different kinds of GPUs.
Zacharias030 62 days ago [-]
I‘ve only seen it done by hedge funds so far. What were you referring to?
fracon 62 days ago [-]
[flagged]
shaklee3 62 days ago [-]
scott grey figured out this exact thing and more back in 2015 for maxwell, and it's been written about many times since by other people.
ETH_start 62 days ago [-]
[flagged]
tough 62 days ago [-]
I think he might mean hyperbolically figuratively so
dang 62 days ago [-]
Literally literally means not literally.
I love it when words turn into their opposites!
Bimos 62 days ago [-]
I edited it.
kneegerman 62 days ago [-]
orthogonally
fulafel 62 days ago [-]
This kind of stuff is an intersting demonstration of how far compilers are from extracting high performance from hardware based on high level code.
What would it take for traditional compiler tech or AI assisted optimization agents to come up with something like it?
WithinReason 62 days ago [-]
a lot of trial and error in a reinforcement learning feedback loop
shihab 62 days ago [-]
The speedup figures they report are compared to their own cutlass-based baseline. Has anyone done a performance comparison against cuBLAS?
All cutlass results I have seen so far for Gemm are within ~10% of cuBLAS. If the 2x-2.5x speedup they report holds up that would be extremely impressive.
ashvardanian 61 days ago [-]
I generally avoid FP8 and prefer I8, but your question got me wondering how well cuBLAS performs.
First of all, cuBLAS needs the cuBLASLt extension API for mixed-precision workloads to handle FP8. Second, some adequate type combinations, like E5M2 x E5M2 for A x B, are not supported, while others, like E5M2 x E4M3, are! Moreover, matrix A must always come in a transposed layout for Ampere, Hopper, and Blackwell... and the list of constraints goes on.
I've integrated FP8 cuBLASLt benchmarks into my "Less Slow C++" repository <https://github.com/ashvardanian/less_slow.cpp>, adding to the list of existing cuBLAS and hand-rolled CUDA and PTX benchmarks. I'm running them on H200 GPUs, which should have the same performance as H100. For square inputs, the throughput peaks around 1.35 Peta-ops.
I heard that it is possible to achieve better performance than cuBLAS using CUTLASS? I thought they chose the better one among cuBLAS and CUTLASS as baseline.
WiSaGaN 62 days ago [-]
I think these kind of open-source is really showing their objective of achieving efficiency in the industry. The reason is this kind of software benefits a lot to the big guys serving the model (competitors to Deekseek themselves if they are interested in being a provider) rather than to the general open-source community that wants to learn and tinker or serve model in consumer hardware.
fspeech 62 days ago [-]
Efficiency could lead to cheaper hardware for everyone, themselves included.
jmward01 62 days ago [-]
I'm not sure the lower and lower precision optimization is a good idea long term. It indicates that models are really sparse and that may be true right now but I think that is likely just because we have some bad ideas about how to train them and not because they really should be that sparse.
rfoo 62 days ago [-]
Well, let's enjoy free "sparsity" until it doesn't. Being able to train a really good model but in higher precision only is a research problem. Low precision training and inference is an engineering one.
We've been doing this since CNN days (9 years ago if not more), and I believe we have a good few years left.
sudosysgen 62 days ago [-]
The activation function throws away much of the dynamic range of floating point numbers, it's relatively clear that having a lot of range where the activation is already saturated is unlikely to be useful.
jmward01 62 days ago [-]
That depends on the activation function. I personally think Layernorm is destroying density (and have some solid evidence for this) but it is in use because there is a lot of missing supporting structure to really help pump data into the weights so it helps so long as we are using simple linear combinations.
sudosysgen 62 days ago [-]
Basically every activation function throws away half of the dynamic range at every neuron, which across a deep network is a lot.
You make a good point about LayerNorm, it's probably even worse.
nbonaparte 62 days ago [-]
This might be rendered moot by native microscaling support in Blackwell (MXFP). They've manually done a coarser-grained version of that for Hopper, but with full FP32 scaling factors.
rfoo 62 days ago [-]
Yes.
These are very good and high profile public demonstrations of where $NVDA's moat is: that GPGPU is very flexible and you can program to do a lot of stuff that makes perfect sense but wasn't in the mind of hardware vendors.
Now, if you predict the future to eventually converge on more and more dedicated hardware support, to the point that there's no more software optimizations like these, then the so-called "CUDA moat" breaks.
To stay in this game, NVIDIA is breaking down their own moat :p
imtringued 62 days ago [-]
Nvidias moat lies in not repelling its casual userbase.
rfoo 62 days ago [-]
I agree. Gamers are cursing Nvidia right now tho, and sadly university labs doing serious research on gaming cards is also a past :(
alecco 62 days ago [-]
Wow, MIT license. I hope some big players embrace this open source cooperative approach.
niemandhier 62 days ago [-]
I keep wondering why there even are undocumented instruction.
Wouldn’t it make sense to provide these to the user? Even if they might not be perfectly reliable.
This stuff must be documented internally, why not just release it?
Security by obscurity does not work: Your competitor reverse engineer everything you do anyways.
Tanjreeve 62 days ago [-]
Probably same reason as anything you work on might have undocumented stuff. Combo of lack of time and/or not wanting to imply support for unstable/experimental features. If you're only screwing over the team on the next desk or whatever it's a lot easier to change things.
rfoo 62 days ago [-]
> This stuff must be documented internally
Probably no. They are likely only documented in architectural design doc / spec etc which you surely do not want to share.
62 days ago [-]
dr_kretyn 62 days ago [-]
Honestly, this is beyond my usage and understanding. But I really appreciate such sharing findings and improvements so that everyone can benefit from them. It's a refreshment.
greenavocado 62 days ago [-]
FFMA (Fused Floating-point Multiply-Add) is a fundamental GPU instruction that performs D = A*B + C in a single operation. This instruction is critical for matrix multiplication and deep learning workloads.
In NVIDIA's SASS (Streaming Assembly), FFMA instructions are encoded as 64-bit or 128-bit instructions with various control bits that determine their exact behavior.
When the yield bit is set the bit tells the warp scheduler that the current warp can yield execution after this instruction. The hardware can then schedule a different warp to execute, potentially hiding latency.
GPUs achieve high throughput through massive parallelism. When one warp stalls (e.g., waiting for memory), others can proceed. The yield bit creates explicit opportunities for the scheduler to switch warps.
This bit indicates whether the source registers can be reused immediately in subsequent operations. When the yield bit is set, the reuse bit must be cleared. If a warp yields, it might not be the next one to execute. Another warp might modify the register file state. The hardware cannot guarantee register values will remain unchanged across yields.
By setting the yield bit in an alternating pattern across FFMA instructions, the compiler creates explicit scheduling points where other warps can make progress. When modifying the yield bit, they also had to clear the reuse bit for affected instructions to maintain correctness. This modification specifically helps overlap two types of operations: MMA (Matrix Multiply-Accumulate) instructions: Heavy compute operations that form the core of matrix multiplication, and Promotion FFMA instructions: Operations that convert between precision formats (likely FP8 to higher precision for accumulation)
FP8 (8-bit floating point) GEMM operations have specific characteristics that make this optimization particularly effective. FP8 calculations typically require conversion to higher precision for accumulation and back, creating additional FFMA operations. FP8 reduces memory bandwidth requirements but creates complex computation patterns with promotion/demotion operations. The mention of "fine-grained scaling" suggests these are operations where precision is carefully managed at multiple points in the calculation.
The yield bit manipulation creates a more optimal interleaving of compute operations and format conversions, allowing the GPU to utilize its execution units more efficiently. Without this optimization, the warp scheduler might not find natural opportunities to switch between warps, leading to underutilization of compute resources.
jarbus 62 days ago [-]
This is crazy insightful, thanks! I’d really love to learn how to get to this level of understanding, but can’t seem to figure out what curriculum I’d follow where I’d end up with this level of technical competence.
randomNumber7 62 days ago [-]
You need to understand how the gpu architecture works on a abstract level. Try to understand the SIMT (Single Instruction Multiple Threads) principle.
Doing some shader programming or writing a cuda kernel could be a nice exercise.
In a nutshell, if you want to add two vectors with hundred elements, instead of looping from 0 to 99 you would call a function called "kernel" (or "shader" in graphics programming) 100 times and pass it different indices.
Then research how it is realized on the hardware with "warp"s or "wavefront"s (on AMD i think). How the cache works is also very important here. Sadly the information on the internet is relatively sparse here.
apples_oranges 62 days ago [-]
Perhaps I know as much as you, but to begin, I would dive into CUDA and running code on GPUs.
fotta 62 days ago [-]
They should call the warp that is yielded to the weft.
dekhn 62 days ago [-]
No, that doesn't make sense; both the yielder and yieldee are warps, the PC is the weft (approximately).
CamperBob2 61 days ago [-]
Or the woof, an amusing older term.
rramadass 62 days ago [-]
Very Nice!
Can you recommend some good resources/books on GPU/TPU/ML Accelerators/etc. architecture/ISA where i can read the above details? Also on Computer Math where one can study how FP8/etc. works?
zackangelo 62 days ago [-]
My go to is Programming Massively Parallel Processors by Wen-Mei Hwu, excellent really approachable introduction. [0]
Nice. I had looked at the older editions of this book but don't recall that it covered GPU ISA (i may be wrong here since i have not really put in the time to study GPUs) ?
Amazon search brought up the following two interesting books, perhaps somebody who has browsed/read them can chime in;
1) Advanced GPU Assembly Programming: A Technical Reference for NVIDIA and AMD Architectures by Gareth Thomas.
2) Numerical Computations with GPUs edited by Volodymyr Kindratenko.
loginx 62 days ago [-]
I'm gonna ask my chatgpt to write like you ;)
I went from understanding none of it, to everything making sense. Thanks!
Zacharias030 62 days ago [-]
HN at its best! What do you do for a living, Sir,
m3kw9 62 days ago [-]
The 20$ question, what can I do with this?
devit 62 days ago [-]
Multiply FP8 matrices with FP32 scaling factors giving a bfloat16 matrix result on an nVidia Hopper or newer GPU.
Maxious 62 days ago [-]
Just tested and it doesn't work out of the box on the consumer 50 series ie. 5080:
Testing GEMM:
Assertion failed:
deep_gemm/jit/../include/deep_gemm/fp8_gemm.cuh:369, condition: cudaFuncSetAttribute(kernel,
cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) == cudaSuccess
terminate called after throwing an instance of
'AssertionException'
what(): Assertion failed: cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) == cudaSuccess
devit 62 days ago [-]
Perhaps your card has less per-SM shared memory than the GPUs DeepSeek uses.
Try to lower the sm90_capacity value in gemm.py: I think 128KB is the correct value for RTX 5080 compared to 256KB for the H100/H800.
these guys are on fire! seriously, kudos to the deepseek team.
dbfudyfg 62 days ago [-]
[dead]
cde-v 62 days ago [-]
Interesting timing with NVDA releasing results tomorrow.
wenc 62 days ago [-]
Lol. I don't think the people that buy/sell NVDA even know what this is about.
This is a highly specialized linear algebra library to do general matrix-matrix multiplications for low-precision floats (FP8, vs FP32 (float), FP64 (double)) while maintaining accuracy.
porridgeraisin 62 days ago [-]
Careful, they're going to tell you it's "priced in" next.
> We observe a performance improvement in the CUTLASS FP8 kernel between NVCC 12.2 and 12.3. By comparing the compiled SASS, we discover that one bit in a series of FADD instructions is flipped in an interleaving pattern. After referencing some open-source CUDA assembler implementations, we identified that this bit controls yield, which may enhance warp-level parallelism (just a guess, yielding the current warp and let other warps work).
> To leverage this, we develop a similar script to modify the FFMA instructions in the compiled binary. Besides simply modifying the yield bit, we also flip the reuse bit (registers cannot be reused if the warp is yielded). This adjustment improves performance (10%+ in some cases) for fine-grained scaling FP8 GEMMs by creating more opportunities to overlap MMA instructions with promotion FFMA instructions.
I would say it is really mind-blowing.
If you don't believe me, previous open source SASS assemblers were mostly from university, they surely didn't have that many people.
I guess since the majority here are blown away by the very low-level code involved, it tells me that they're likely not ready to use it or have been stuck on very high level tools that abstract this away.
I love it when words turn into their opposites!
What would it take for traditional compiler tech or AI assisted optimization agents to come up with something like it?
All cutlass results I have seen so far for Gemm are within ~10% of cuBLAS. If the 2x-2.5x speedup they report holds up that would be extremely impressive.
First of all, cuBLAS needs the cuBLASLt extension API for mixed-precision workloads to handle FP8. Second, some adequate type combinations, like E5M2 x E5M2 for A x B, are not supported, while others, like E5M2 x E4M3, are! Moreover, matrix A must always come in a transposed layout for Ampere, Hopper, and Blackwell... and the list of constraints goes on.
I've integrated FP8 cuBLASLt benchmarks into my "Less Slow C++" repository <https://github.com/ashvardanian/less_slow.cpp>, adding to the list of existing cuBLAS and hand-rolled CUDA and PTX benchmarks. I'm running them on H200 GPUs, which should have the same performance as H100. For square inputs, the throughput peaks around 1.35 Peta-ops.
That's around 67% of the advertised number for dense GEMM <https://resources.nvidia.com/en-us-data-center-overview-mc/e...>.We've been doing this since CNN days (9 years ago if not more), and I believe we have a good few years left.
You make a good point about LayerNorm, it's probably even worse.
These are very good and high profile public demonstrations of where $NVDA's moat is: that GPGPU is very flexible and you can program to do a lot of stuff that makes perfect sense but wasn't in the mind of hardware vendors.
Now, if you predict the future to eventually converge on more and more dedicated hardware support, to the point that there's no more software optimizations like these, then the so-called "CUDA moat" breaks.
To stay in this game, NVIDIA is breaking down their own moat :p
Wouldn’t it make sense to provide these to the user? Even if they might not be perfectly reliable.
This stuff must be documented internally, why not just release it?
Security by obscurity does not work: Your competitor reverse engineer everything you do anyways.
Probably no. They are likely only documented in architectural design doc / spec etc which you surely do not want to share.
In NVIDIA's SASS (Streaming Assembly), FFMA instructions are encoded as 64-bit or 128-bit instructions with various control bits that determine their exact behavior.
When the yield bit is set the bit tells the warp scheduler that the current warp can yield execution after this instruction. The hardware can then schedule a different warp to execute, potentially hiding latency.
GPUs achieve high throughput through massive parallelism. When one warp stalls (e.g., waiting for memory), others can proceed. The yield bit creates explicit opportunities for the scheduler to switch warps.
This bit indicates whether the source registers can be reused immediately in subsequent operations. When the yield bit is set, the reuse bit must be cleared. If a warp yields, it might not be the next one to execute. Another warp might modify the register file state. The hardware cannot guarantee register values will remain unchanged across yields.
By setting the yield bit in an alternating pattern across FFMA instructions, the compiler creates explicit scheduling points where other warps can make progress. When modifying the yield bit, they also had to clear the reuse bit for affected instructions to maintain correctness. This modification specifically helps overlap two types of operations: MMA (Matrix Multiply-Accumulate) instructions: Heavy compute operations that form the core of matrix multiplication, and Promotion FFMA instructions: Operations that convert between precision formats (likely FP8 to higher precision for accumulation)
FP8 (8-bit floating point) GEMM operations have specific characteristics that make this optimization particularly effective. FP8 calculations typically require conversion to higher precision for accumulation and back, creating additional FFMA operations. FP8 reduces memory bandwidth requirements but creates complex computation patterns with promotion/demotion operations. The mention of "fine-grained scaling" suggests these are operations where precision is carefully managed at multiple points in the calculation.
The yield bit manipulation creates a more optimal interleaving of compute operations and format conversions, allowing the GPU to utilize its execution units more efficiently. Without this optimization, the warp scheduler might not find natural opportunities to switch between warps, leading to underutilization of compute resources.
Then research how it is realized on the hardware with "warp"s or "wavefront"s (on AMD i think). How the cache works is also very important here. Sadly the information on the internet is relatively sparse here.
Can you recommend some good resources/books on GPU/TPU/ML Accelerators/etc. architecture/ISA where i can read the above details? Also on Computer Math where one can study how FP8/etc. works?
[0] https://a.co/d/9fmbZqg
Amazon search brought up the following two interesting books, perhaps somebody who has browsed/read them can chime in;
1) Advanced GPU Assembly Programming: A Technical Reference for NVIDIA and AMD Architectures by Gareth Thomas.
2) Numerical Computations with GPUs edited by Volodymyr Kindratenko.
I went from understanding none of it, to everything making sense. Thanks!
Try to lower the sm90_capacity value in gemm.py: I think 128KB is the correct value for RTX 5080 compared to 256KB for the H100/H800.
And probably add ", 3, 2, 1" after "6, 5, 4".
> DeepGEMM exclusively supports NVIDIA Hopper tensor cores
This is a highly specialized linear algebra library to do general matrix-matrix multiplications for low-precision floats (FP8, vs FP32 (float), FP64 (double)) while maintaining accuracy.