Birbla

Compiling LLMs into a MegaKernel: A path to low-latency inference (zhihaojia.medium.com)
132 points by matt_d - 1 day ago

Ollama integration?
by NitroPython - 24 hours ago
Next step - compile straight to verilog so I can buy some LLMs on aliexpress
by baq - 24 hours ago
> Traditional LLM systems often rely on sequences of GPU kernel launches and external communication calls, resulting in underutilized hardware.
What? Why? This seems like an obvious optimization if it's possible.
by scotty79 - 23 hours ago
This is very cool. I enjoyed going through the writeup and GitHub README.
I was wondering if these same optimizations can be brought to bear on training as well, rather than only inference. I guess the challenge here is fusing backward computations with gradient communication.
I also saw that this currently does not handle dynamic workloads such as MoE. I recently came across this paper that does exactly this:
FlashDMoE: Fast Distributed MoE in a Single Kernel - https://arxiv.org/pdf/2506.04667
by bytepoet - 23 hours ago
The Qwen 8B number, if verified, is very impressive. Much more practical than the previous megakernel one.
That's being said, these one-persisted kernel on each SM reminds me Larrabee, and now wondering what the world will be if we just do traditional process-thread-simd path rather than CUDA path.
by liuliu - 23 hours ago
After working pretty closely with vLLM and SGLang over the past few months, this is EXACTLY what I had envisioned what a successor project would look like - analyzing an operation dependency graph and then fusing (or, at a minimum, scheduling tasks smarter). Congrats to the team.
by kp1197 - 23 hours ago
Does anyone have an intuition on why this offers significant gains over CUDA Graphs?. The CPU launch cost of a graph is tiny which implies most of the work has been offloaded to the GPU's own scheduler. I'd expect that some I/O marshalling at kernel boundaries could be avoided with megakernels. Maybe some loop fusion? Are there any more interesting optimizations they enable?
by skavi - 23 hours ago
This project is from CMU. Hazy Research at Stanford talked about the megakernel too: https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles
Good to see the competition in this area.
(Edited): Related paper covering the larger "mirage" project, but this doesn't cover the "megakernel" approach: https://arxiv.org/abs/2405.05751
by flakiness - 22 hours ago
really cool. would love to try it for our 3b model.
by olivia111 - 22 hours ago
any detailed tutorial about how to use it?
by olivia111 - 22 hours ago
Isn’t fusing ops at a fine-grained level also the core benefit of JAX over TensorFlow? How does this work compare to JAX?
by fxtentacle - 21 hours ago
Certainly an important discovery for utilizing these models on scaled hardware. This approach could certainly be applied beyond LLMs to other types of neural networks. That would be an interesting space to explore.
by bdbenton5255 - 21 hours ago
if you want to try on 5090, it's not supported yet
> Support for modern GPU architectures. One of our next milestones is extending MPK to support next-generation architectures such as NVIDIA Blackwell. A major challenge lies in integrating warp specialization — a key optimization for newer GPUs — with MPK’s megakernel execution model.
by tuananh - 20 hours ago
Probably should make this into a backend of torch.compile
by qihqi - 20 hours ago