Merging ViTs and LLMs using a pretrained (graph) neural net.
In AI, the dominant approach to develop neural nets is to first pretrain a large model on a big general dataset and then fine-tune it on specific tasks. In vision, there are, for example, Vision Transformers (ViTs) pretrained on ImageNet and fine-tuned on downstream tasks (e.g., classification of satellite images, textures, etc.). In language modeling, there are, for example, large language models (LLMs) specialized in Math, Coding or specific languages (e.g., French). Even though the original pretrained model can also perform well on many tasks, fine-tuning remains essential to achieve optimal performance
As AI tackles increasingly more tasks, fine-tuning costs and storage requirements add up. Consider a situation when we first collected a lot of Math data and fine-tuned an LLM on it, then some other lab released a model excelling at Code and another lab released a model excelling at French. First of all, we now have tripled the number of models to store and maintain. Secondly, despite having three models, we cannot effectively solve mixed tasks such as a Math task in French, since there may be no single model trained on such a mixture. To solve the issues in this example, we can collect a desired mix of data and fine-tune the model on it, but this is costly and time-consuming especially if we need to change the mixing ratio. We can also build some smart ensemble with routing conditioned on the input, but doing this effectively is not trivial and still requires some training of the router and storing an entire ensemble
Model merging is a recent training-free approach to combine multiple models into one. As many papers and our experiments show, model merging can mitigate limitations of naive fine-tuning in a simple yet effective way
In this post, we introduce a new model merging method called MetaMerge.
MetaMerge uses a pretrained graph neural network (GNN), that takes weights of multiple models as input and produces the weights for a single merged model as output without any training or finetuning.
Before introducing MetaMerge and why it is called “Meta”, let us briefly describe a common baseline approach and several key concepts.
In general, model merging can be defined as some aggregation function $f$ that takes weights of $N$ models as input and produces the weights of a single merged model as output:
\[\mathbf{W}_{\text{merged}} = f \big( \mathbf{W}_1, \mathbf{W}_2, \ldots, \mathbf{W}_N \big),\]where \(\mathbf{W}_1, \mathbf{W}_2, \ldots, \mathbf{W}_N\) are the weights of the individual models, and \(\mathbf{W}_{\text{merged}}\) are the weights of the merged model.
Most of the merging methods implement $f$ directly in the weight space. For example, the simplest way to merge models is to average their weights:
\[\mathbf{W}_{\text{merged}} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{W}_i.\]More advanced methods include task arithmetic
The key concept of MetaMerge is Meta Networks
Among various meta networks, we focus on NiNo
This step is applied every 1,000 steps, while Adam or SGD is used for the rest of the steps. So, the original NiNo that we are going to use is not trained for any model merging objective, but for accelerating training. However, it does not mean it cannot be used for merging as we show.
The only challenge when adapting NiNo to the merging setup is the preparation of the input. Specifically, in the equation above NiNo expects $c$ input models in the order corresponding to an optimization trajectory (where $c=5$ in pretrained NiNo models). One straightforward way to address this issue is to create the “trajectory” as [pretrained model, model fine-tuned for task 1, model fine-tuned for task 2, ...]. Since tasks do not have a particular order and there may be fewer/more than $c$ tasks, we need certain heuristics. Specifically, for the setup with two tasks, we found the following heuristic to work well in practice:
For the setup with four tasks, the equation is more straightforward since we have exactly 5 models to feed into NiNo:
\[W_{\text{merged}} = \text{NiNo}([W_0, W_{\text{task1}},W_{\text{task2}},W_{\text{task3}},W_{\text{task4}}]).\]In principle, the order of the tasks can be optimized, but we fixed it after some trial and error.
We evaluate MetaMerge on two vision and two language setups. For vision, we use ViT-B-16 pretrained on ImageNet and fine-tuned on the DTD, RESISC45, MNIST and SVHN datasets. This is a subset from standard 8 tasks used in model merging papers
For ViT-B-16, we use Task Arithmetic code and their checkpoints for evaluation, which contains trained classification heads for all the task, so merging is done only for the backbone
Table 1. Results using ViT-B-16 on 2 tasks.
| Model | N models | DTD | RESISC45 | Avg |
|---|---|---|---|---|
| Zero Shot (pretrained) | 1 | 44.68 | 66.38 | 55.53 |
| Finetuned DTD | 1 | 82.07 | 50.54 | 66.31 |
| Finetuned RESISC45 | 1 | 36.44 | 96.89 | 66.67 |
| Finetuned (per task) | 2 | 82.07 | 96.89 | 89.48 |
| Merged Model (avg merge) | 1 | 72.66 | 94.13 | 83.39 |
| Merged Model (meta merge, mlp) | 1 | 78.99 | 90.43 | 84.71 |
| Merged Model (meta merge) | 1 | 76.76 | 91.60 | 84.18 |
Table 2. Results using ViT-B-16 on 4 tasks.
| Model | N models | DTD | RESISC45 | MNIST | SVHN | Avg |
|---|---|---|---|---|---|---|
| Zero Shot (pretrained) | 1 | 44.68 | 66.38 | 51.73 | 51.99 | 53.70 |
| Finetuned (per task) | 4 | 82.07 | 96.89 | 99.76 | 97.86 | 94.15 |
| Merged Model (avg merge) | 1 | 57.18 | 84.06 | 98.55 | 87.28 | 81.77 |
| Merged Model (meta merge, mlp) | 1 | 30.90 | 30.90 | 98.72 | 97.60 | 62.44 |
| Merged Model (meta merge) | 1 | 46.54 | 78.46 | 98.43 | 96.23 | 79.92 |
Surprisingly, on 2 tasks (Table 1) the ablated NiNo variant (without the GNN part) actually outperforms the full NiNo even though the full one excelled at accelerating training as shown in the NiNo paper’s ablations. One logical explanation is that merging is quite different from predicting future parameters, so good results on merging are not expected in the first place. Also, the MLP version of NiNo has a simplicity inductive bias so its predictions can be more generic (kind of trend prediction which may be closer to weight averaging) and less overfitted to the training objective of accelerating training. On 4 tasks (Table 2), the ablated variant performs much worse. In future work, it would be interesting to push NiNo’s performance specifically for merging, however, defining a proper objective for that is not trivial.
Below we visualize detailed results from Table 2 for all the tasks and models.
In the language setup, we first fine-tuned base Qwen3 models on the train split of GSM8K and the subset of luth-sft. These are Math and French datasets, respectively. We chose to fine-tune our own models instead of using existing checkpoints to have a more controlled consistent setup for different model sizes
In Qwen3 experiments, we do not have a model fine-tuned on GSM8K-Fr to showcase a common scenario when a training dataset for the mixed task is unavailable, which is prevailing for rare mixes. Therefore, Qwen3 (fine-tuned per task) only contains two models. Models with “Math-Fr” in their name are obtained by merging the Math and French models using either simple averaging or MetaMerge. The code to merge Qwen models using MetaMerge is provided at merge_qwen.py.
Table 3. Results using Qwen3-0.6B on 3 tasks.
| Model | N models | GSM8K | French | GSM8K-Fr | avg |
|---|---|---|---|---|---|
| Qwen3-0.6B | 1 | 21.0 | 24.4 | 19.6 | 21.7 |
| Qwen3-0.6B-Math | 1 | 46.3 | 25.4 | 29.2 | 33.6 |
| Qwen3-0.6B-Fr | 1 | 36.1 | 26.5 | 26.5 | 29.7 |
| Qwen3-0.6B (fine-tuned per task) | 2 | 46.3 | 26.5 | 26.5 | 33.1 |
| Qwen3-0.6B-Math-Fr (avg merge) | 1 | 48.4 | 27.4 | 33.9 | 36.6 |
| Qwen3-0.6B-Math-Fr (meta merge, mlp) | 1 | 47.8 | 25.8 | 30.9 | 34.8 |
| Qwen3-0.6B-Math-Fr (meta merge) | 1 | 45.1 | 25.7 | 31.6 | 34.1 |
Table 4. Results using Qwen3-1.7B on 3 tasks.
| Model | N models | GSM8K | French | GSM8K-Fr | avg |
|---|---|---|---|---|---|
| Qwen3-1.7B | 1 | 20.6 | 26.2 | 20.2 | 22.3 |
| Qwen3-1.7B-Math | 1 | 62.1 | 28.3 | 41.5 | 43.9 |
| Qwen3-1.7B-Fr | 1 | 60.9 | 32.8 | 43.9 | 45.9 |
| Qwen3-1.7B (fine-tuned per task) | 2 | 62.1 | 32.8 | 43.9 | 46.3 |
| Qwen3-1.7B-Math-Fr (avg merge) | 1 | 64.0 | 31.4 | 46.9 | 47.4 |
| Qwen3-1.7B-Math-Fr (meta merge, mlp) | 1 | 63.0 | 28.3 | 43.8 | 45.0 |
| Qwen3-1.7B-Math-Fr (meta merge) | 1 | 62.7 | 28.4 | 43.2 | 44.8 |
For Qwen3-0.6B, the simple averaging method outperforms MetaMerge and as in ViT (on 2 tasks), the MLP version of NiNo outperforms the full NiNo. One possible explanation is that the pretrained NiNo has only seen tiny GPT2 style transformers (with <=1.6M parameters) during training. So making predictions for a Qwen architecture with 600M parameters is an extreme out-of-distribution scenario for NiNo. As in ViT experiments, the MLP version of NiNo performs slightly better than the full model potentially due to its simplicity inductive bias. Qwen3-1.7B is even a more severe out-of-distribution scenario for NiNo, so it is not surprising that MetaMerge does not perform as well as simple averaging. Despite the average benchmark scores being worse or comparable, an interesting question for future studies is whether MetaMerge produces functionally different models than simple averaging or the ablated NiNo variant.
In this demo, we compare weight averaging to MetaMerge on ViT using 4 tasks. We sampled 100 weight positions per layer (same positions for all models) to enable efficient visualization. For each position, parameters are ordered in the way we pass them to NiNo (i.e., pretrained, dtd, resisc45, mnist, svhn). Since NiNo was trained to predict future parameters, we show a MetaMerge prediction as the last point on the trajectory. The weight averaging baseline is shown as a horizontal line for comparison.
MetaMerge is used for all layers in the visual backbone (starting with “visual.”) except for conv1 as explained in our merge_vit.py code and layers not starting with “visual.”. For those we use weight averaging, so in the visualization the “meta-merge” point is the same as the “average” in such cases. Overall, the visualization shows that there is no simple relationship between the input models and the MetaMerge prediction. But the visualization may give some insights on problematic behavior (e.g., for biases the MetaMerge predictions tend to be too far from the overall trajectory).
Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0, unless noted otherwise.
@misc{knyazev2026metamerge,
title={MetaMerge: Model Merging with Meta Networks},
author={Boris Knyazev and Albert M. Orozco Camacho},
year={2026},
url={https://bknyaz.github.io/blog/2026/metamerge/}
}
Here are some more articles you might like to read next: