Introducing BaldEagle

Easy, Effective, Open-Source Speculative Decoding

May 13, 2025

Today we’re excited to launch BaldEagle, a simplified and accessible unofficial implementation of the influential EAGLE speculative decoding paper series. BaldEagle is designed with clarity, easy customization, and compatibility with consumer GPUs (such as RTX 3090s), specifically to empower both the open-source community and researchers looking to innovate on existing methods.

We also release an EAGLE model trained using the BaldEagle repo that speeds up Llama3.1 8B by 3.17x (49.24 tok/s -> 156.33 tok/s)! See the model card for more information.

BaldEagle Github Repository: https://github.com/NickL77/BaldEagle

What is Speculative Decoding?

Speculative Decoding Diagram from: https://medium.com/@genai.works/speed-up-llm-inference-with-speculative-decoding-1fc79701e9d6

Speculative decoding is an inference optimization technique that can dramatically accelerate inference speeds, achieving up to ~3x and up to 6x faster token generation rates compared to standard decoding approaches, as demonstrated in Eagle 2 and Eagle 3, respectively. This breakthrough is achieved by leveraging a smaller, efficient "draft" model to rapidly generate multiple token suggestions, which a larger model can then quickly validate with minimal overhead.

By efficiently parallelizing token generation and validation, speculative decoding significantly reduces inference latency without compromising the quality of generated text. This makes speculative decoding particularly valuable for latency-sensitive applications such as interactive chatbots and real-time assistance, as well as for significantly reducing the cost of deploying powerful large language models. The EAGLE paper series focuses on training state-of-the-art draft models and BaldEagle simplifies the training, data generation, and benchmarking process.

If you want to learn more about EAGLE and speculative decoding, here's a good video from the authors:

Why an Unofficial Implementation?

While the EAGLE authors are focused on churning out more innovations for faster inference, their codebase has been somewhat neglected with dangling files, confusing file names, missing modules, and unaddressed issues/questions. This creates an unnecessarily long and difficult onboarding process before training your own EAGLE model.

BaldEagle maintains a simple implementation for training EAGLE models on top of HuggingFace Trainer that the open-source community is already familiar with. This approach ensures minimal friction for new contributors and encourages rapid experimentation and innovation. The BaldEagle open-source repository will be a place for the community to actively discuss improvements on the existing training recipe and model architecture.

Key Features of BaldEagle

BaldEagle comes packed with features to make training EAGLE speculative decoding models straightforward and accessible:

Simple training-loop built on HuggingFace Trainer for easy development and modification
Model reuses Llama implementation where possible to reduce complexity
Thorough logging via Weights and Biases
Improved data generation scripts to easily change models and add datasets
Included benchmarking scripts using state-of-the-art inference server
Llama 8B draft model trainable on 1 RTX 3090

How to use BaldEagle Models?

Eagle Speculative Decoding is already implemented on production grade inference servers like sglang and vllm, allowing for immediate usage of these models.

With sglang, there are a few additional parameters required when starting the server from the command line.

python3 -m sglang.launch_server \
  --model meta-llama/Meta-Llama-3-8B-Instruct \
  --speculative-algo EAGLE \
  --speculative-draft NickL77/BaldEagle-Llama-3.1-8B-Instruct \
  --speculative-num-steps 5 \
  --speculative-eagle-topk 8 \
  --speculative-num-draft-tokens 64 \
  --dtype bfloat16 \
  --port 30000 \
  --mem-fraction-static 0.65

In vllm, even fewer parameters are required (although decreasing configurability)

# in python

llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    speculative_config={
        "method": "eagle",
        "model": "NickL77/BaldEagle-Llama-3.1-8B-Instruct",
        "num_speculative_tokens": 5,
    },
)

What's Next and How You Can Get Involved

We're excited about the upcoming milestones and actively looking for collaborators to help drive BaldEagle forward. Here's what's on our roadmap:

Implementing EAGLE 3: Achieve even faster decoding speeds (targeting ~6x speed-up).
Enhancing Data Generation: Develop more efficient and flexible scripts.
Expanding Draft Models: We invite community members to train draft models for additional architectures like Qwen 3.
Architectural Improvements: We are always welcoming of contributions and ideas for further model and training enhancements.

Let's push the boundaries of speculative decoding together—check out the repo, star it if you find it useful, and help us shape what's next!

https://github.com/NickL77/BaldEagle/

Frugal GPU

Discussion about this post