WeightFlow: Learning Stochastic Dynamics via Evolving Weight of Neural Network

💐 Paper accepted as AAAI'26 Oral

Ruikun Li¹, Jiazhen Liu², Huandong Wang^2*, Qingmin Liao¹, Yong Li²

¹ Shenzhen International Graduate School, Tsinghua University,
² Department of Electronic Engineering, BNRist, Tsinghua University

^* Corresponding Author (wanghuandong@tsinghua.edu.cn)

Paper Appendix Code

Learning stochastic dynamics: WeightFlow projects the evolution of probability distributions (middle) from stochastic state trajectories (top) into a continuous path in the parameterized weight space of a neural network (bottom).

🔍 Highlights

New Paradigm: Proposes a novel framework, WeightFlow, for modeling stochastic dynamics in the weight space of a neural network, avoiding the curse of dimensionality.
Theoretical Connection: Theoretically derives the approximately equivalent formulation in neural network weight space for the dynamic optimal transport path found in probability space.
Novel Architecture: Models network weights as a weight graph and introduces a graph controlled differential equation (CDE) to learn its continuous evolution.
State-of-the-Art Performance: Achieves an average performance improvement of 43.02% over existing state-of-the-art methods on diverse interdisciplinary datasets.

⚙️ WeightFlow Framework

WeightFlow models the neural network weights as a graph and employs a graph neural differential equation to learn the continuous dynamics of this weight graph. The framework consists of two main parts:

1. Backbone ($\theta_t$): A backbone network with parameters $\theta_t$ models the static probability distribution at time $t$ using an autoregressive factorization:

\[ p(x,t)=p_{\theta_{t}}(x)=\prod_{i=1}^{d}p_{\theta_{t}}(x_{i}|x_{1},...,x_{i-1}) \]

2. Hypernetwork ($g_{\phi}$): A graph hypernetwork $g_{\phi}$ then models the continuous evolution of these weights $\theta_t$ as a Controlled Differential Equation (CDE):

\[ \theta_{\tau}=\theta_{0}+\int_{0}^{\tau}g_{\phi}(\theta_{t},t)\frac{dZ_{t}}{dt}dt \]

The framework of WeightFlow, illustrating the backbone, weight graph, path projection, and the Graph Neural Differential Equation.

📊 Experimental Results

We empirically evaluate WeightFlow on a diverse set of simulated and real-world stochastic dynamics, demonstrating its superior performance and robustness.

Simulated Datasets: Discrete State Systems

We first benchmarked WeightFlow against several state-of-the-art baselines on five discrete stochastic systems. As shown in Table 1, WeightFlow significantly outperforms all baselines, improving the Wasserstein (W) and Jensen-Shannon (JSD) distances by 32.04% and 53.99% on average, respectively.

Table 1: Statistic results on various stochastic dynamical systems. (All values $\times 10^{-1}$)
Model	Epidemic		Toggle Switch		Signalling Cascade1		Signalling Cascade2		Ecological Evolution
Model	$\mathcal{W} \downarrow$	$JSD \downarrow$	$\mathcal{W} \downarrow$	$JSD \downarrow$	$\mathcal{W} \downarrow$	$JSD \downarrow$	$\mathcal{W} \downarrow$	$JSD \downarrow$	$\mathcal{W} \downarrow$	$JSD \downarrow$
Latent SDE	3.14_±0.25	4.22_±0.26	2.34_±0.15	1.27_±0.12	3.04_±0.17	0.85_±0.14	3.59_±0.13	1.02_±0.06	8.04_±0.33	3.52_±0.23
Neural MJP	1.88_±0.14	1.61_±0.14	2.13_±0.26	0.94_±0.14	1.69_±0.15	0.30_±0.04	1.68_±0.11	0.36_±0.01	1.68_±0.18	0.51_±0.03
T-IB	2.62_±0.17	3.52_±0.29	1.59_±0.20	0.88_±0.11	1.66_±0.16	0.32_±0.04	2.16_±0.17	0.40_±0.03	2.17_±0.24	0.56_±0.06
NLSB	3.27_±0.28	1.65_±0.14	2.97_±0.30	1.32_±0.20	1.50_±0.10	0.39_±0.05	1.83_±0.15	0.48_±0.05	3.09_±0.26	2.80_±0.32
DeepRUOT	1.78_±0.13	1.08_±0.09	1.37_±0.17	0.77_±0.05	0.52_±0.02	0.07_±0.00	0.51_±0.01	0.08_±0.00	3.27_±0.31	2.47_±0.36
WeightFlow (Ours)	1.10_±0.14	0.34_±0.01	0.82_±0.07	0.33_±0.02	0.48_±0.03	0.04_±0.00	0.49_±0.07	0.06_±0.01	0.51_±0.07	0.12_±0.02

In the ecological evolution system (visualized below), a 2D genetic phenotype (Locus 1, Locus 2) evolves towards a global peak on a fitness landscape. WeightFlow accurately predicts the distribution's evolution, capturing both macroscopic landscape shifts and fine-grained local dynamics.

Joint and marginal distributions predicted by WeightFlow over time on the Ecological Evolution system.

Real-world Datasets: Real-World Single-Cell Data

We also evaluated WeightFlow on high-dimensional, continuous-space, single-cell differentiation datasets. The visualization of the pancreatic β-cell differentiation path shows our model's predictions. WeightFlow is significantly more accurate for higher-order moments like skewness and kurtosis, reproducing fine-grained distribution structures.

Weight prediction for β-cell differentiation, showing continuous evolution and comparison to DeepRUOT.

Table 2: Statistical results on real-world cell datasets.
Model	$\beta$-cell		Embryoid
Model	$\mathcal{W} \downarrow$	$MMD \downarrow$	$\mathcal{W} \downarrow$	$MMD \downarrow$
NLSB	11.18_±0.22	0.07_±0.01	14.39_±0.40	0.10_±0.03
RUOT	10.99_±0.20	0.06_±0.01	14.71_±0.49	0.15_±0.03
WeightFlow (Ours)	9.73_±0.27	0.02_±0.01	14.18_±0.43	0.03_±0.01

Ablation Studies

We performed ablation studies to validate key design choices of WeightFlow.

Component Ablations

Autoregressive Order: A random order performs similarly to the sequential one, confirming robustness.

Backbone Architecture: Performance is similar for both GRU and Transformer backbones.

Sequential Aligning: Disabling the warm-start strategy (w/o Aligned) leads to significant performance degradation.

Sensitivity Analysis

We analyzed WeightFlow's sensitivity to various hyperparameters, demonstrating its robustness.

Impact of Backbone, Data, and Path Dimension

Backbone Size: A small hidden dimension (e.g., 8) is sufficient.

Data Size: Performance is stable even with only 20% of the data.

Path Dimension: A 1-dim path is sufficient to capture dynamics.

Time and Space Cost Analysis

WeightFlow is designed to be scalable and efficient. The backbone's size is independent of the system's dimension $d$, with $O(L)$ space complexity (where $L$ is states per dimension) and $O(d)$ inference time. The hypernetwork's complexity $O(N_{nodes}^2)$ is also independent of $d$. This design effectively avoids the curse of dimensionality.

Inference Time vs. Error: WeightFlow achieves the Pareto frontier, offering the best trade-off compared to baselines.

Model Size: The parameter size of both the backbone and hypernetwork scales only linearly with the number of candidate states, $L$.

📚 Citation

If you find our work useful for your research, please consider citing:

@article{li2025weightflow, title={WeightFlow: Learning Stochastic Dynamics via Evolving Weight of Neural Network}, author={Li, Ruikun and Liu, Jiazhen and Wang, Huandong and Liao, Qingmin and Li, Yong}, journal={arXiv preprint arXiv:2508.00451}, year={2025} }