Training large neural networks is famously slow and expensive. In our recent paper, Accelerating Training with Neuron Interaction and Nowcasting Networks, presented at ICLR 2025 in Singapore, we introduced a new way to speed things up: treat a neural network as what it really is — a graph of interacting neurons (or “neural graph”).

Due to massive training costs, recent years have seen a surge of faster adaptive optimizers: Shampoo, Muon, SOAP, and others. Each improves on Adam by smarter scaling and/or orthogonalization of parameter updates for a given weight matrix, often inspired by second-order methods or preconditioning. But they work at the level of parameters or layers, not the network as a whole. Moreover, they do not learn from the previous optimization runs, i.e. the optimization algorithms are based on manually-designed gradient-descent rules.
Our method, NiNo (Neuron Interaction and Nowcasting), is different. We model neurons as nodes and weights as edges, and use a graph neural network (GNN) to predict how weights will evolve. This lets us “nowcast” future parameters and reduce the number of steps required to reach the same performance metric.
So NiNo is essentially a GNN that takes a history of past parameter values along the optimization trajectory (which can be obtained by Adam or another optimizer) and making a jump by predicting future parameter values. After making the prediction, optimization is continued with Adam, then followed by NiNo’s another prediction and so on. This basic idea is borrowed from Weight Nowcaster Network (WNN) that revealed predictable patterns in optimization trajectories, but neural graphs synergized with GNNs make it work really well.
As NiNo is a neural network, it needs to be trained first before we can use it to speed up optimization. To do so, we collected and publicly released a dataset of checkpoints with optimization trajectories on 4 vision and language tasks. Even though collecting these checkpoints and training NiNo is computationally expensive, this single-time cost is amortized meaning the same trained NiNo can be potentially used on millions of tasks to eventually save much more costs.
To make NiNo work well for LLMs and Transformers in general, it was critical to carefully construct the neural graph for multi-head self-attention, making it stand out compared to WNN (that ignores the neural network structure). This is a tricky part as the illustration below shows, but we implemented it for many different Transformer layers.

Our work connects to the broader “learning to optimize” literature, with seminal Luke Metz’s et al. works such as VeLO: Training Versatile Learned Optimizers by Scaling Up. Unlike many learned optimizers that struggle with cost, stability or show only a ~1.2–1.3x speedup¹, NiNo is conceptually lightweight and stable — because it is only applied every 1,000 steps (by default), while for all the other steps any base optimizer, such as Adam, can be applied to allow for stable convergence. At the same time, recent learned optimizers such as our recent Celo: Training Versatile Learned Optimizers on a Compute Diet and μLO: Compute-Efficient Meta-Generalization of Learned Optimizers make a significant step in improving learned optimizers and making them more cost-effective and stable (i.e. without big loss spikes) to train and use.
¹while a ~1.2–1.3x speedup is remarkable, in practice it usually doesn’t justify the immense amount of extra work (e.g. efficient distributed implementation, tuning, potential instabilities especially in mixed/low-bit precision, investigating unexpected side effects like overfitting or poor generalization) that is required for actual large-scale usefulness.

Even though NiNo shows great speedups, it requires further work, so we see many exciting paths ahead:
We’ve open-sourced our code at github.com/SamsungSAILMontreal/nino under MIT License and welcome contributions. As we show below, applying NiNo is straightforward:
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM
from optim import NiNo
model = AutoModelForCausalLM.from_config(...) # some model
# NiNo is implemented as a wrapper around the base optimizer
# any optimizer other than Adam should also be possible to use with NiNo
opt = NiNo(base_opt=torch.optim.AdamW(model.parameters(),
lr=1e-3,
weight_decay=1e-2),
ckpt='checkpoints/nino.pt',
subgraph=False, # can be set to True for larger models (see Llama 3.2 example below)
edge_sample_ratio=0, # can be set to a small positive number for larger models (see Llama 3.2 example below)
model=model,
period=1000,
max_train_steps=10000)
for step in range(10000):
if opt.need_grads: # True/False based on the step number and period
opt.zero_grad() # zero out gradients
data, targets = ... # get some batch of data
# base optimizer step (majority of the time)
outputs = model(data) # forward pass
loss = F.cross_entropy(outputs, targets) # compute some loss
loss.backward() # only compute gradients for the base optimizer
opt.step() # base_opt step or nowcast params every 1000 steps using NiNo
...
To conclude, by recognizing that neural networks are networks — graphs of interacting parts — we may be opening the door to a new generation of efficient and learnable optimizers.
See my website for my other latest research and updates (if you are a student passionate about this kind of topic, contact me for a potential collaboration).
Updated on Sep 30, 2025.