Add Bayesian Flows by valsdav · Pull Request #66 · probabilists/zuko

valsdav · 2025-04-23T11:32:32Z

Bayesian Flows in zuko would be a very useful feature for uncertainty estimation with normalizing flows.

This PR introduces Bayesian Flow in zuko with minimal changes: BayesianLinear layers are provided as an alternative way of building the MLPs at the base of zuko transformations

All flows can be transformed to their Bayesian version by adding bayesian=True to the building arguments.

An utility function total_KL_divergence(model) is added in the zuko.utils to compute the total KL divergence from all the bayesian layers in a model to be added to the loss.

Let me know if you agree with the design of this feature! I added tests and a tutorial ;)

P.S.: CNF flows are not working for the moment but I haven't investigated why yet.

francois-rozet · 2025-05-01T21:35:50Z

Hi @valsdav 👋 Thank you for your PR! I haven't had the time to take a look at it yet. I have a deadline by the end of May.

Until then, could you give a bit more context on this PR? Why is this feature necessary? What does it allow? Are there alternatives?

valsdav · 2025-05-03T12:03:18Z

HI @francois-rozet! Sure take your time. I'm already using and testing more the PR in my group and so far so good!

This implementation is one of the way of encoding training uncertainty in a Normalizing Flow model. I'm interested in this application in the field of high energy physics, where estimating an uncertainty on the learnt probability density is crucial for downstream tasks.

In particular bayesian networks can learn a posterior distribution over a model output making each network weight sampled from its own gaussian distribution. These networks, when properly trained with a KL divergence component in the loss, approximate well the statistical uncertainty from the limit amount of training dataset.

In zuko the implementation is quite straightforward as it is enough to "make bayesian" all the linear models used inside the definition of the flow. At each forward call the linear NNs get sampled and the rest of the flow implementation stays the same. At inference time the user can call multiple time the forward function to build a distribution of the output prob. density.

The idea of baysian NF is explored in https://arxiv.org/abs/2104.04543

Let me know if you prefer to have more references or more explanations!

valsdav · 2025-06-02T15:42:50Z

HI @francois-rozet! Did you have any chance to have a look at it?

We are using it successfully in our applications and we didn't have any issue with the code so far :)

francois-rozet · 2025-06-03T10:49:37Z

Hi @valsdav, thank you for remind me of the PR!

I have been taking a look this morning. Overall I think the code is clean and understandable. However I have a few concerns regarding invertibility.

Transformations in normalizing flows are deterministic functions. Some (most) transformations call their underlying neural networks several times during a single forward. This is notably the case of auto-regressive transformations. It is heavily assumed that each call to the underlying network leads to the same deterministic output.

With your implementation where weights are sampled during the forward of the linear layers, I believe these assumptions are broken. As such the "Bayesian" flows are not invertible anymore.

>>> flow = zuko.flows.MAF(3, bayesian=True)
>>> x = torch.randn(3)
>>> y = flow().transform(x)
>>> z = flow().transform.inv(y)
>>> x - z  # should be zero
tensor([ 0.0023,  0.0017, -0.0067], grad_fn=<SubBackward0>)

For this use case, I think it would be easier to take a functional approach, where the sampling of the weights is tackled outside of the normalizing flow, the weights are loaded into the flow, and then the flow is used (without randomness).

I think this could be achieved with torch.func.functional_call.

torch.func.functional_call is a wrapper around torch.nn.utils.stateless._reparametrize_module. The latter seems exactly what we need. I would suggest writing an interface around _reparametrize_module for the Bayesian flows.

valsdav · 2025-06-03T11:10:26Z

Hi @francois-rozet I overlooked the fact that the single MLP may be called multiple times during a single call..

Having a look at the functional_call, setting just the random tensor should be enough.
Would you recommend doing it with an utility function in zuko which is calling the flow after setting the necessary random tensors if "Bayesian" layers types are found?

francois-rozet · 2025-06-03T11:37:28Z

Would you recommend doing it with an utility function in zuko which is calling the flow after setting the necessary random tensors if "Bayesian" layers types are found?

What I imagine is to remove "Bayesian" layers, and instead provide functional helpers that operate on a standard (non-Bayesian) flow. Something like

flow = zuko.flows.MAF(3)  # not Bayesian
prior = zuko.bayesian.diagonal_gaussian_prior(flow.parameters())  # special object

optim = torch.optim.AdamW(prior.parameters(), lr=1e-3)

for x in train_loader:
    phi = prior().rsample()  # optimizable sampling
    with zuko.bayesian.parameterize(flow, phi):
        log_p = flow().log_prob(x)
    loss = -log_p.mean()  # + KL term of the prior I guess
    loss.backward()
    
    optim.step()
    optim.zero_grad()

phi = prior().sample()
with zuko.bayesian.parameterize(flow, phi):
    x = flow().sample()

Note that prior here is a (kind of) distribution over the parameters. It can be more than diagonal Gaussian.

valsdav · 2025-06-03T11:45:26Z

In practice this prior object would store the mean and std of the weights of Linear layers.

In my opinion this seems an elegant solution of making normal flows "bayesian", but then it makes more difficult on the user side to store properly the "prior" parameters alongside the flow model itself.

What is we wrap this "prior+flow" object in a new flow object? So that we can store them together and provide the same user interface?

Something like:

flow = zuko.flows.MAF(3)  # not Bayesian
bayesian_flow = zuko.bayesian.wrap_bayesian_flow(flow)

for x in train_loader:
  with bayesian_flow.sample_prior():
      bayesian_flow().log_prob(x)

with bayesian_flow.sample_prior():
      x = bayesian_flow().sample()

francois-rozet · 2025-06-03T11:55:21Z

In practice this prior object would store the mean and std of the weights of Linear layers.

Yes exactly! Maybe of the biases as well, although I don't know if this is common.

but then it makes more difficult on the user side to store properly the "prior" parameters alongside the flow model itself.

Is your concern about saving and loading the models to disk? In this case only the weights of the prior have to be saved.

What is we wrap this "prior+flow" object in a new flow object? So that we can store them together and provide the same user interface?

This is another option to consider. We should however ensure that the user understands what this new "prior + flow" object represents and how to use it "like other Zuko flows".

I usually like when everything is very explicit for the user. In my snippet it is clear that what is optimized is the prior, not the flow.

valsdav · 2025-06-03T12:37:39Z

Ok thanks for the discussion. I will give a try to this new implementation and come back to you :)

francois-rozet · 2025-06-03T12:51:14Z

Something like: [...]

Yes, this is an option. But I would like to dissociate the object that samples the weights, from the model that computes the density/generates samples. It could be

bayesian_flow = zuko.bayesian.BayesianModel(zuko.flows.MAF(3), **prior_kwargs)

with bayesian_flow.sample() as flow:  # flow is a deepcopy, bayesian_flow is not modified
    flow().log_prob(x)
    flow().rsample()
    flow().transform
    # ...

Ok thanks for the discussion. I will give a try to this new implementation and come back to you :)

Thanks! I'll check it out right away this time so I don't waste too much of your time.

valsdav · 2025-09-24T07:26:42Z

Hi @francois-rozet I finally finished this. The API is now as we discussed:

from zuko.bayesian import BayesianModel

net = MAF(3, 5)
bnet = BayesianModel(net)

with bnet.sample() as sampled_net:
    y = sampled_net(x)

# test single sampled model
sampled_net = bnet.sample_model()
y = sampled_net(x)

I don't understand well why the tests on the GF flow are failing. Does that have some special implementation colliding with this implementation maybe?

francois-rozet

Hi @valsdav, thank you very much for coming back to this!

I did a first quick read and review, and have left a few comments and questions. Overall this looks good!

francois-rozet · 2025-09-24T08:26:54Z

+        if isinstance(module, zuko_nn.MaskedBayesianLinear) or isinstance(
+            module, zuko_nn.BayesianLinear
+        ):


These classes do not exist. I guess this is a residue of the previous implementation?

francois-rozet · 2025-09-24T08:46:28Z

+            sampled_params = self._sample_params()
+            proxy = self._create_sampled_proxy(sampled_params)


Is there a reason for not using torch.nn.utils.stateless._reparametrize_module ?

with torch.nn.utils.stateless._reparametrize_module(self.base, sampled_params): yield self.base

Hi @francois-rozet! I tried using the functional_call but I realized that it was not working as the Transformations are calling again the forward method after the first call to forward. So in my case the reparametrization was not actually used to change the parameters of the flow while calling rsample or log_prob.

I can try using directly the torch.nn.utils.stateless._reparametrize_module method as it does not only change the parametrization during the first call.

I added this implementation in the last commit and this works fine!

francois-rozet · 2025-09-24T08:50:07Z

+        """Context manager yielding a proxy model with sampled parameters."""
+        # print(f"[BayesianFlow.sample] Starting sample context (training={self.training})")
+
+        if self.training:


Do you think adding a argument trick: bool to the method instead of relying on self.training would make sense?

I wonder if some would like to rely on the non-trick implementation even during training.

valsdav · 2025-09-26T08:55:18Z

Hi @francois-rozet I did some more tests on the latest version and we have a regression somewhere. The flows is not learning anymore during the training. Did I messed up the gradient propagation? Having a closer look

valsdav · 2025-09-26T09:04:34Z

+            var_out = var_out + torch.exp(b_logvar)
+
+        # Sample output using reparameterization trick
+        result = torch.normal(mu_out, var_out.sqrt(), generator=generator)


This is the problem, apparently this stops the gradients.

Fix:

result = mu_out + var_out.sqrt() * torch.randn(mu_out.shape, device=mu_out.device, generator=generator)

francois-rozet · 2025-09-27T09:37:05Z

Hi @valsdav, I went over the code. I have refactored it to make it future proof (e.g. self.means instead of self.weight_means, which should allow to generalize to non-linear layers). Instead of replacing . with _, I replaced it with -. This prevents potential name clashes (e.g. my_layer.weight and my.layer.weight). I have also added docstrings.

The bug with the GF was due to the initialization of the weights, which were too large. I prefer to fallback to the initial base model weights.

Finally, sample_model should never be used to train as load_state_dict breaks the gradients to the Bayesian model parameters.

Let me know what you think.

francois-rozet · 2025-09-28T14:02:15Z

Hey @valsdav, I generalized the code to any kind of models, even with non-linear layers. The local reparameterization trick is only applied to linear layers, but the parameters of non-linear layers are still sampled randomly!

I have also replaced the generator seed with a cache, which is much more efficient as it does not require GPU-CPU synchronization.

valsdav · 2025-09-29T20:39:24Z

Hi @francois-rozet looks great to me! The code has become much better :)

Thanks a lot, +1 for merging on my side, we would start using this new version right away.

francois-rozet · 2025-09-29T20:47:43Z

Hey @valsdav , great!

What do you think about the way to include/exclude layers? Is it easy enough to use for you? I was thinking maybe allowing regex patterns could be nice too.

Would you have the time to update the notebook with the new API? And maybe test that training is working like you would expect in a real-case scenario?

Once you are done with the notebook, I will merge 🎉

valsdav · 2025-09-29T20:51:26Z

Excluding with a patter is neat, but the usecase for me would be to make bayesian only the last layers of a deeper net (a common practice to avoid making large bayesian net hard to train). In this case would would need to exclude all the other layers. I can try to implement in a similar way an "include_only" option.

I will have a look again at the notebook and update it in the next days 👍

francois-rozet · 2025-09-29T21:15:44Z

Oh it makes sense for the include 🤔 I think we can have both an include and exclude list then. We first include if any(name.startswith(prefix) for prefix in include_modules) and then exclude if any(name.startswith(prefix) for prefix in exclude_modules).

If we want to allow pattern matching, I think wildcards * would be nice (e.g. **.hyper.*.* matches transform.hyper.1.weight).

francois-rozet · 2025-10-01T21:01:54Z

I have added the include list, as well as pattern matching, which is quite fun to use 😁

I have also identified a bug with NAF and UNAF with torch<=1.13. Both flows rely on self.parameters() to propagate gradients, which is incompatible with _reparametrize_module. See pytorch/pytorch#92295. Nothing we can do on our side unfortunately.

valsdav · 2025-10-02T07:47:13Z

Hi @francois-rozet I added back the init_logvar parameter of the model because that's a parameter that users may customize to decide how much the model is "bayesian" at the beginning of the training.

I updated the notebook with the new API and added examples for the include/exclude feature.

francois-rozet · 2025-10-02T13:36:14Z

Thanks @valsdav! Shouldn't the default init_logvar be a negative number ($\ln 10^{-4} \approx -9$)?

valsdav · 2025-10-02T13:39:27Z

you're right!! fixed

francois-rozet · 2025-10-03T15:19:30Z

Hi @valsdav, after fixing the tests, I noticed more bugs with NAF and UNAF. This is due to the use of self.parameters() in MNN and UMNN, which breaks with _reparameterize_module and the local reparametrization trick (in subtle, different ways). I have decided to drop official support for these flows.

I have updated the tutorial to be more consistent with the other ones. I am ready to merge 🥳

valsdav · 2025-10-06T07:23:11Z

Thanks @francois-rozet !

valsdav added 4 commits April 23, 2025 00:34

added Bayesian MLP layers for Bayesian flows

178b1d6

formatting and docs

aa8b250

Adding tests for bayesian networks

6b9cd3c

Added tutorial

b1aa49a

valsdav force-pushed the bayesian-flow branch from bd3ab55 to b1aa49a Compare April 23, 2025 13:22

valsdav changed the title ~~Added Bayesian Flows 🚀~~ Add Bayesian Flows 🚀 Apr 24, 2025

valsdav added 3 commits September 24, 2025 08:39

implementation of the bayesian wrapper

bd10e33

new options for bayesian model

698d85e

added tests for bayesian models

b015cab

new implementazion of context manager

916cf5d