- Ollama integration?by NitroPython - 24 hours ago
- Next step - compile straight to verilog so I can buy some LLMs on aliexpressby baq - 24 hours ago
- > Traditional LLM systems often rely on sequences of GPU kernel launches and external communication calls, resulting in underutilized hardware.by scotty79 - 23 hours ago
What? Why? This seems like an obvious optimization if it's possible.
- This is very cool. I enjoyed going through the writeup and GitHub README.by bytepoet - 23 hours ago
I was wondering if these same optimizations can be brought to bear on training as well, rather than only inference. I guess the challenge here is fusing backward computations with gradient communication.
I also saw that this currently does not handle dynamic workloads such as MoE. I recently came across this paper that does exactly this:
FlashDMoE: Fast Distributed MoE in a Single Kernel - https://arxiv.org/pdf/2506.04667
- The Qwen 8B number, if verified, is very impressive. Much more practical than the previous megakernel one.by liuliu - 23 hours ago
That's being said, these one-persisted kernel on each SM reminds me Larrabee, and now wondering what the world will be if we just do traditional process-thread-simd path rather than CUDA path.
- After working pretty closely with vLLM and SGLang over the past few months, this is EXACTLY what I had envisioned what a successor project would look like - analyzing an operation dependency graph and then fusing (or, at a minimum, scheduling tasks smarter). Congrats to the team.by kp1197 - 23 hours ago
- Does anyone have an intuition on why this offers significant gains over CUDA Graphs?. The CPU launch cost of a graph is tiny which implies most of the work has been offloaded to the GPU's own scheduler. I'd expect that some I/O marshalling at kernel boundaries could be avoided with megakernels. Maybe some loop fusion? Are there any more interesting optimizations they enable?by skavi - 23 hours ago
- This project is from CMU. Hazy Research at Stanford talked about the megakernel too: https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubblesby flakiness - 22 hours ago
Good to see the competition in this area.
(Edited): Related paper covering the larger "mirage" project, but this doesn't cover the "megakernel" approach: https://arxiv.org/abs/2405.05751
- really cool. would love to try it for our 3b model.by olivia111 - 22 hours ago
- any detailed tutorial about how to use it?by olivia111 - 22 hours ago
- Isn’t fusing ops at a fine-grained level also the core benefit of JAX over TensorFlow? How does this work compare to JAX?by fxtentacle - 21 hours ago
- Certainly an important discovery for utilizing these models on scaled hardware. This approach could certainly be applied beyond LLMs to other types of neural networks. That would be an interesting space to explore.by bdbenton5255 - 21 hours ago
- if you want to try on 5090, it's not supported yetby tuananh - 20 hours ago
> Support for modern GPU architectures. One of our next milestones is extending MPK to support next-generation architectures such as NVIDIA Blackwell. A major challenge lies in integrating warp specialization — a key optimization for newer GPUs — with MPK’s megakernel execution model.
- Probably should make this into a backend of torch.compileby qihqi - 20 hours ago