" MicromOne: SpikingBrain: Brain-Inspired Large Models That Learn Efficiently

Pagine

SpikingBrain: Brain-Inspired Large Models That Learn Efficiently

 

A recent paper presents SpikingBrain, a new family of large language models (LLMs) inspired by neuroscience, aimed at making training and inference more efficient, especially for very long sequences. arXiv















What’s the problem they address

  • Traditional Transformer-based LLMs scale poorly with sequence length: compute during training scales roughly quadratically in the length of the sequence; inference memory also becomes large. arXiv

  • Also, many LLM developments are done on NVIDIA GPUs; doing this at scale on non-NVIDIA hardware adds challenges. arXiv

  • Energy consumption and resource demands are huge. The authors want models that handle long contexts, but use less data, less compute, and ideally behave more like brains (sparse, event-driven, efficient). arXiv

Their approach: key innovations

They combine several strategies inspired by brain/neural computation to get efficient and scalable LLMs:

  1. Hybrid Linear / Hybrid Attention architectures

    • They use linear attention (which has linear complexity in sequence length) and sliding window/local attention to get good trade-offs. arXiv

    • They build two models: SpikingBrain-7B (7-billion-parameter) which is a linear model focused on long-context efficiency; and SpikingBrain-76B (hybrid-linear + Mixture-of-Experts) to balance performance and capacity. arXiv+1

  2. Conversion-based training

    • Instead of training completely from scratch, they start from pre-trained Transformer checkpoints (in this case Qwen2.5-7B) and convert or adapt them to their efficient/hybrid attention + spiking neuron architectures. arXiv

    • They gradually extend sequence length during training (from 8k, to 32k, to 128k tokens) in stages. arXiv

    • They also use techniques like MoE upcycling: take dense feed-forward networks and replicate them into sparse experts, to add capacity without proportionally increasing compute or memory. arXiv

  3. Spiking / adaptive neuron model

    • They define a “spiking” scheme that converts activations to integer spike‐counts, uses spike coding so that during inference one can use event-driven sparse operations. arXiv

    • They propose an adaptive threshold on neurons to avoid over-spiking or neurons being silent. arXiv

    • They experiment with several spike encoding schemes: binary {0,1}, ternary {−1,0,1}, bit-wise etc. The idea is to trade off expressivity vs time‐steps and sparsity. arXiv

  4. System engineering on non-NVIDIA hardware (MetaX GPUs)

    • They implement operator libraries, custom training frameworks, communication strategies, parallelism, etc., all tuned for the MetaX C550 GPU cluster. arXiv

    • They also adapt CUDA / Triton operators, parallelize across devices in various ways (data-parallelism, expert parallelism, sequence parallelism etc.). arXiv

What they achieved

Some of their results and metrics:

  • SpikingBrain-7B matches many open-source Transformer baselines in performance, while being more efficient. On standard benchmarks it recovers a large fraction of the base model’s performance. arXiv

  • For very long input lengths, SpikingBrain-7B shows big speedups in Time to First Token (TTFT). For example, at 1 million tokens, TTFT is ~26.5× faster than a comparable baseline; extrapolated to 4 million tokens, they estimate over 100× speedup. arXiv

  • They show over 69% sparsity with the spiking scheme, which helps reduce energy/use. arXiv

  • The training data used for their conversion-based continual pretraining (CPT) is only ~150B tokens, much less than the ~10T tokens used in many large scale training runs. So they manage good performance with far less data. arXiv













Significance / why it matters

  • This shows a promising path toward making large models more efficient, especially for tasks that need long context (e.g. long documents, code, logs, etc.).

  • It points toward combining ideas from brain/neuroscience (sparsity, spiking, threshold adaptation) with pragmatic engineering and modern LLM frameworks.

  • Also, being able to do this on non-NVIDIA hardware is important for diversity of hardware, cost, supply, etc.

  • Could help reduce energy usage, carbon footprint, and compute costs of large models.

Limitations & open questions

  • Although they close much of the performance gap, there is still some drop in performance (especially in the purely linear 7B model vs full Transformer baselines). arXiv

  • The inference gains for very long sequences are large, but for “shorter” sequences or general usage, gains might be more modest.

  • The spike/event-driven benefits are more theoretical unless one has hardware that supports asynchronous or neuromorphic computation. On GPUs, many of those benefits are not fully realized; on general hardware, the event-driven nature is partially “simulated” rather than naturally exploited. arXiv

  • Dataset / domain coverage, alignment, safety, multilingual / cross-domain generalization are still to be validated.