Birbla

Login
    Compiling LLMs into a MegaKernel: A path to low-latency inference (zhihaojia.medium.com)
    132 points by matt_d - 1 day ago

  • Ollama integration?
    by NitroPython - 24 hours ago
  • Next step - compile straight to verilog so I can buy some LLMs on aliexpress
    by baq - 24 hours ago
  • > Traditional LLM systems often rely on sequences of GPU kernel launches and external communication calls, resulting in underutilized hardware.

    What? Why? This seems like an obvious optimization if it's possible.

    by scotty79 - 23 hours ago
  • This is very cool. I enjoyed going through the writeup and GitHub README.

    I was wondering if these same optimizations can be brought to bear on training as well, rather than only inference. I guess the challenge here is fusing backward computations with gradient communication.

    I also saw that this currently does not handle dynamic workloads such as MoE. I recently came across this paper that does exactly this:

    FlashDMoE: Fast Distributed MoE in a Single Kernel - https://arxiv.org/pdf/2506.04667

    by bytepoet - 23 hours ago
  • The Qwen 8B number, if verified, is very impressive. Much more practical than the previous megakernel one.

    That's being said, these one-persisted kernel on each SM reminds me Larrabee, and now wondering what the world will be if we just do traditional process-thread-simd path rather than CUDA path.

    by liuliu - 23 hours ago
  • After working pretty closely with vLLM and SGLang over the past few months, this is EXACTLY what I had envisioned what a successor project would look like - analyzing an operation dependency graph and then fusing (or, at a minimum, scheduling tasks smarter). Congrats to the team.
    by kp1197 - 23 hours ago
  • Does anyone have an intuition on why this offers significant gains over CUDA Graphs?. The CPU launch cost of a graph is tiny which implies most of the work has been offloaded to the GPU's own scheduler. I'd expect that some I/O marshalling at kernel boundaries could be avoided with megakernels. Maybe some loop fusion? Are there any more interesting optimizations they enable?
    by skavi - 23 hours ago
  • This project is from CMU. Hazy Research at Stanford talked about the megakernel too: https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles

    Good to see the competition in this area.

    (Edited): Related paper covering the larger "mirage" project, but this doesn't cover the "megakernel" approach: https://arxiv.org/abs/2405.05751

    by flakiness - 22 hours ago
  • really cool. would love to try it for our 3b model.
    by olivia111 - 22 hours ago
  • any detailed tutorial about how to use it?
    by olivia111 - 22 hours ago
  • Isn’t fusing ops at a fine-grained level also the core benefit of JAX over TensorFlow? How does this work compare to JAX?
    by fxtentacle - 21 hours ago
  • Certainly an important discovery for utilizing these models on scaled hardware. This approach could certainly be applied beyond LLMs to other types of neural networks. That would be an interesting space to explore.
    by bdbenton5255 - 21 hours ago
  • if you want to try on 5090, it's not supported yet

    > Support for modern GPU architectures. One of our next milestones is extending MPK to support next-generation architectures such as NVIDIA Blackwell. A major challenge lies in integrating warp specialization — a key optimization for newer GPUs — with MPK’s megakernel execution model.

    by tuananh - 20 hours ago
  • Probably should make this into a backend of torch.compile
    by qihqi - 20 hours ago

© 2025 Birbla.com, a Hacker News reader · Content · Terms · Privacy · Support