MicromOne: SpikingBrain: Brain-Inspired Large Models That Learn Efficiently

A recent paper presents SpikingBrain, a new family of large language models (LLMs) inspired by neuroscience, aimed at making training and inference more efficient, especially for very long sequences. arXiv

What’s the problem they address

Traditional Transformer-based LLMs scale poorly with sequence length: compute during training scales roughly quadratically in the length of the sequence; inference memory also becomes large. arXiv
Also, many LLM developments are done on NVIDIA GPUs; doing this at scale on non-NVIDIA hardware adds challenges. arXiv
Energy consumption and resource demands are huge. The authors want models that handle long contexts, but use less data, less compute, and ideally behave more like brains (sparse, event-driven, efficient). arXiv

Their approach: key innovations

They combine several strategies inspired by brain/neural computation to get efficient and scalable LLMs:

Hybrid Linear / Hybrid Attention architectures
- They use linear attention (which has linear complexity in sequence length) and sliding window/local attention to get good trade-offs. arXiv
- They build two models: SpikingBrain-7B (7-billion-parameter) which is a linear model focused on long-context efficiency; and SpikingBrain-76B (hybrid-linear + Mixture-of-Experts) to balance performance and capacity. arXiv+1
Conversion-based training
- Instead of training completely from scratch, they start from pre-trained Transformer checkpoints (in this case Qwen2.5-7B) and convert or adapt them to their efficient/hybrid attention + spiking neuron architectures. arXiv
- They gradually extend sequence length during training (from 8k, to 32k, to 128k tokens) in stages. arXiv
- They also use techniques like MoE upcycling: take dense feed-forward networks and replicate them into sparse experts, to add capacity without proportionally increasing compute or memory. arXiv
Spiking / adaptive neuron model
- They define a “spiking” scheme that converts activations to integer spike‐counts, uses spike coding so that during inference one can use event-driven sparse operations. arXiv
- They propose an adaptive threshold on neurons to avoid over-spiking or neurons being silent. arXiv
- They experiment with several spike encoding schemes: binary {0,1}, ternary {−1,0,1}, bit-wise etc. The idea is to trade off expressivity vs time‐steps and sparsity. arXiv
System engineering on non-NVIDIA hardware (MetaX GPUs)
- They implement operator libraries, custom training frameworks, communication strategies, parallelism, etc., all tuned for the MetaX C550 GPU cluster. arXiv
- They also adapt CUDA / Triton operators, parallelize across devices in various ways (data-parallelism, expert parallelism, sequence parallelism etc.). arXiv

What they achieved

Some of their results and metrics:

SpikingBrain-7B matches many open-source Transformer baselines in performance, while being more efficient. On standard benchmarks it recovers a large fraction of the base model’s performance. arXiv
For very long input lengths, SpikingBrain-7B shows big speedups in Time to First Token (TTFT). For example, at 1 million tokens, TTFT is ~26.5× faster than a comparable baseline; extrapolated to 4 million tokens, they estimate over 100× speedup. arXiv
They show over 69% sparsity with the spiking scheme, which helps reduce energy/use. arXiv
The training data used for their conversion-based continual pretraining (CPT) is only ~150B tokens, much less than the ~10T tokens used in many large scale training runs. So they manage good performance with far less data. arXiv

Significance / why it matters

This shows a promising path toward making large models more efficient, especially for tasks that need long context (e.g. long documents, code, logs, etc.).
It points toward combining ideas from brain/neuroscience (sparsity, spiking, threshold adaptation) with pragmatic engineering and modern LLM frameworks.
Also, being able to do this on non-NVIDIA hardware is important for diversity of hardware, cost, supply, etc.
Could help reduce energy usage, carbon footprint, and compute costs of large models.

Limitations & open questions

Although they close much of the performance gap, there is still some drop in performance (especially in the purely linear 7B model vs full Transformer baselines). arXiv
The inference gains for very long sequences are large, but for “shorter” sequences or general usage, gains might be more modest.
The spike/event-driven benefits are more theoretical unless one has hardware that supports asynchronous or neuromorphic computation. On GPUs, many of those benefits are not fully realized; on general hardware, the event-driven nature is partially “simulated” rather than naturally exploited. arXiv
Dataset / domain coverage, alignment, safety, multilingual / cross-domain generalization are still to be validated.

MicromOne

Pagine

SpikingBrain: Brain-Inspired Large Models That Learn Efficiently

A recent paper presents SpikingBrain, a new family of large language models (LLMs) inspired by neuroscience, aimed at making training and inference more efficient, especially for very long sequences. arXiv

What’s the problem they address

Their approach: key innovations

What they achieved

Significance / why it matters

Limitations & open questions

Post più popolari