Skip to main content
All work
[01] // CASE STUDY / ELLECI

Elleci.

Elleci starts from a simple constraint: build a serious LLM without lab-grade hardware. Instead of chasing more VRAM, I worked on the architecture: BitNet, Mamba-2, EG-MLA, and MoE in the same model, with training on a single A100 and inference designed for consumer GPUs.

Client Independent research Industry ML Research Year 2025
  • PyTorch
  • Mamba-2
  • BitNet 1.58b
  • DeepSpeed ZeRO-2
  • Liger-Kernel
  • WandB
  • EG-MLA
  • MoE
  • CUDA
← All work

For an independent researcher, the limit is not the idea. It is the hardware. The most capable models already need GPUs with tens of gigabytes of VRAM just for inference, and much more for training. In practice, that locks out anyone who is not working in a lab or a large tech company.

Elleci starts there: instead of adapting a model built for expensive clusters, the goal was to design it from the beginning around real constraints. Training on a single rented A100 40GB on Vast.ai. Inference on consumer GPUs with 9-16GB VRAM.

That changes the question. Not "how much hardware does it take?", but "which architectural choices let capacity and cost stay under control at the same time?". The whole project is built around that constraint.

[A] BitNet 1.58b
Ternary weights {-1, 0, +1} instead of FP16.
Cuts memory footprint by roughly 70% and moves the cost where it hurts less. This is not a final compression trick. It is a core design choice to keep the model inside accessible hardware limits.
[B] Differential Mamba2Block
SSM (State Space Model) layer replacing standard attention. Present in 3 out of 4 layers.
Brings linear O(n) complexity instead of O(n²). In practice, that means longer sequences without the memory blow-up that full attention creates everywhere.
[C] EG-MLA (Multi-head Latent Attention)
Compressed KV-cache via latent projection 2560→128 dim. Appears every 4 layers.
Cuts KV-cache by about 85% versus standard MHA while preserving global recall where Mamba is better at local patterns. It avoids paying full memory cost on every layer.
[D] MoE (Mixture of Experts)
Conditional routing: only K of N expert FFNs are activated per token.
Keeps total capacity high without activating everything every time. The model reaches 5.84B total parameters, while active compute per token stays closer to a much smaller model.
demo
Brutalist mockup of Elleci: hybrid Mamba-2/EG-MLA architecture, 3B-parameter training, 2,700 tokens/sec on consumer VRAM.
One A100. 2-3k tokens/sec. Consumer GPUs at inference.

→ These four choices work together to reduce memory and compute, not as add-on optimizations after the fact.

Elleci shows that you can work on new LLM architectures without starting from a private cluster. Training runs on a single rented A100 40GB, and inference is designed for consumer GPUs with 9-16GB VRAM.

The most important technical result is not just the parameter count or the throughput of roughly 2,000-3,000 tokens/sec on A100. It is the fact that BitNet, Mamba-2, EG-MLA, and MoE coexist in the same model without breaking the training loop.

In practice, the project moves the effort where it matters: less budget spent chasing ever larger hardware, more room to test real architectural ideas in an open-source codebase that can keep evolving.

A100 training
2-3k Tokens/sec inference
3090+ Consumer GPU deploy
[06] // Let's talk

Want to use or study the architecture?

The project is open source. If you want to inspect the code or discuss it, start from the repository.

Let's talk →
STATUS

Coming soon

Repository is being cleaned up and will be published soon.