Skip to content

Add Bayesian Flows#66

Merged
francois-rozet merged 27 commits into
probabilists:masterfrom
valsdav:bayesian-flow
Oct 3, 2025
Merged

Add Bayesian Flows#66
francois-rozet merged 27 commits into
probabilists:masterfrom
valsdav:bayesian-flow

Conversation

@valsdav
Copy link
Copy Markdown
Contributor

@valsdav valsdav commented Apr 23, 2025

Bayesian Flows in zuko would be a very useful feature for uncertainty estimation with normalizing flows.

This PR introduces Bayesian Flow in zuko with minimal changes: BayesianLinear layers are provided as an alternative way of building the MLPs at the base of zuko transformations

All flows can be transformed to their Bayesian version by adding bayesian=True to the building arguments.

An utility function total_KL_divergence(model) is added in the zuko.utils to compute the total KL divergence from all the bayesian layers in a model to be added to the loss.

Let me know if you agree with the design of this feature! I added tests and a tutorial ;)

P.S.: CNF flows are not working for the moment but I haven't investigated why yet.

@valsdav valsdav changed the title Added Bayesian Flows 🚀 Add Bayesian Flows 🚀 Apr 24, 2025
@francois-rozet
Copy link
Copy Markdown
Member

francois-rozet commented May 1, 2025

Hi @valsdav 👋 Thank you for your PR! I haven't had the time to take a look at it yet. I have a deadline by the end of May.

Until then, could you give a bit more context on this PR? Why is this feature necessary? What does it allow? Are there alternatives?

@valsdav
Copy link
Copy Markdown
Contributor Author

valsdav commented May 3, 2025

HI @francois-rozet! Sure take your time. I'm already using and testing more the PR in my group and so far so good!

This implementation is one of the way of encoding training uncertainty in a Normalizing Flow model. I'm interested in this application in the field of high energy physics, where estimating an uncertainty on the learnt probability density is crucial for downstream tasks.

In particular bayesian networks can learn a posterior distribution over a model output making each network weight sampled from its own gaussian distribution. These networks, when properly trained with a KL divergence component in the loss, approximate well the statistical uncertainty from the limit amount of training dataset.

In zuko the implementation is quite straightforward as it is enough to "make bayesian" all the linear models used inside the definition of the flow. At each forward call the linear NNs get sampled and the rest of the flow implementation stays the same. At inference time the user can call multiple time the forward function to build a distribution of the output prob. density.

The idea of baysian NF is explored in https://arxiv.org/abs/2104.04543

Let me know if you prefer to have more references or more explanations!

@valsdav
Copy link
Copy Markdown
Contributor Author

valsdav commented Jun 2, 2025

HI @francois-rozet! Did you have any chance to have a look at it?

We are using it successfully in our applications and we didn't have any issue with the code so far :)

@francois-rozet
Copy link
Copy Markdown
Member

francois-rozet commented Jun 3, 2025

Hi @valsdav, thank you for remind me of the PR!

I have been taking a look this morning. Overall I think the code is clean and understandable. However I have a few concerns regarding invertibility.

Transformations in normalizing flows are deterministic functions. Some (most) transformations call their underlying neural networks several times during a single forward. This is notably the case of auto-regressive transformations. It is heavily assumed that each call to the underlying network leads to the same deterministic output.

With your implementation where weights are sampled during the forward of the linear layers, I believe these assumptions are broken. As such the "Bayesian" flows are not invertible anymore.

>>> flow = zuko.flows.MAF(3, bayesian=True)
>>> x = torch.randn(3)
>>> y = flow().transform(x)
>>> z = flow().transform.inv(y)
>>> x - z  # should be zero
tensor([ 0.0023,  0.0017, -0.0067], grad_fn=<SubBackward0>)

For this use case, I think it would be easier to take a functional approach, where the sampling of the weights is tackled outside of the normalizing flow, the weights are loaded into the flow, and then the flow is used (without randomness).

I think this could be achieved with torch.func.functional_call.

torch.func.functional_call is a wrapper around torch.nn.utils.stateless._reparametrize_module. The latter seems exactly what we need. I would suggest writing an interface around _reparametrize_module for the Bayesian flows.

@valsdav
Copy link
Copy Markdown
Contributor Author

valsdav commented Jun 3, 2025

Hi @francois-rozet I overlooked the fact that the single MLP may be called multiple times during a single call..

Having a look at the functional_call, setting just the random tensor should be enough.
Would you recommend doing it with an utility function in zuko which is calling the flow after setting the necessary random tensors if "Bayesian" layers types are found?

@francois-rozet
Copy link
Copy Markdown
Member

francois-rozet commented Jun 3, 2025

Would you recommend doing it with an utility function in zuko which is calling the flow after setting the necessary random tensors if "Bayesian" layers types are found?

What I imagine is to remove "Bayesian" layers, and instead provide functional helpers that operate on a standard (non-Bayesian) flow. Something like

flow = zuko.flows.MAF(3)  # not Bayesian
prior = zuko.bayesian.diagonal_gaussian_prior(flow.parameters())  # special object

optim = torch.optim.AdamW(prior.parameters(), lr=1e-3)

for x in train_loader:
    phi = prior().rsample()  # optimizable sampling
    with zuko.bayesian.parameterize(flow, phi):
        log_p = flow().log_prob(x)
    loss = -log_p.mean()  # + KL term of the prior I guess
    loss.backward()
    
    optim.step()
    optim.zero_grad()

phi = prior().sample()
with zuko.bayesian.parameterize(flow, phi):
    x = flow().sample()

Note that prior here is a (kind of) distribution over the parameters. It can be more than diagonal Gaussian.

@valsdav
Copy link
Copy Markdown
Contributor Author

valsdav commented Jun 3, 2025

In practice this prior object would store the mean and std of the weights of Linear layers.

In my opinion this seems an elegant solution of making normal flows "bayesian", but then it makes more difficult on the user side to store properly the "prior" parameters alongside the flow model itself.

What is we wrap this "prior+flow" object in a new flow object? So that we can store them together and provide the same user interface?

Something like:

flow = zuko.flows.MAF(3)  # not Bayesian
bayesian_flow = zuko.bayesian.wrap_bayesian_flow(flow)

for x in train_loader:
  with bayesian_flow.sample_prior():
      bayesian_flow().log_prob(x)

with bayesian_flow.sample_prior():
      x = bayesian_flow().sample()

@francois-rozet
Copy link
Copy Markdown
Member

francois-rozet commented Jun 3, 2025

In practice this prior object would store the mean and std of the weights of Linear layers.

Yes exactly! Maybe of the biases as well, although I don't know if this is common.

but then it makes more difficult on the user side to store properly the "prior" parameters alongside the flow model itself.

Is your concern about saving and loading the models to disk? In this case only the weights of the prior have to be saved.

What is we wrap this "prior+flow" object in a new flow object? So that we can store them together and provide the same user interface?

This is another option to consider. We should however ensure that the user understands what this new "prior + flow" object represents and how to use it "like other Zuko flows".

I usually like when everything is very explicit for the user. In my snippet it is clear that what is optimized is the prior, not the flow.

@valsdav
Copy link
Copy Markdown
Contributor Author

valsdav commented Jun 3, 2025

Ok thanks for the discussion. I will give a try to this new implementation and come back to you :)

@francois-rozet
Copy link
Copy Markdown
Member

francois-rozet commented Jun 3, 2025

Something like: [...]

Yes, this is an option. But I would like to dissociate the object that samples the weights, from the model that computes the density/generates samples. It could be

bayesian_flow = zuko.bayesian.BayesianModel(zuko.flows.MAF(3), **prior_kwargs)

with bayesian_flow.sample() as flow:  # flow is a deepcopy, bayesian_flow is not modified
    flow().log_prob(x)
    flow().rsample()
    flow().transform
    # ...

Ok thanks for the discussion. I will give a try to this new implementation and come back to you :)

Thanks! I'll check it out right away this time so I don't waste too much of your time.

@valsdav
Copy link
Copy Markdown
Contributor Author

valsdav commented Sep 24, 2025

Hi @francois-rozet I finally finished this. The API is now as we discussed:

from zuko.bayesian import BayesianModel

net = MAF(3, 5)
bnet = BayesianModel(net)

with bnet.sample() as sampled_net:
    y = sampled_net(x)

# test single sampled model
sampled_net = bnet.sample_model()
y = sampled_net(x)

I don't understand well why the tests on the GF flow are failing. Does that have some special implementation colliding with this implementation maybe?

Copy link
Copy Markdown
Member

@francois-rozet francois-rozet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @valsdav, thank you very much for coming back to this!

I did a first quick read and review, and have left a few comments and questions. Overall this looks good!

Comment thread zuko/utils.py Outdated
Comment on lines +627 to +629
if isinstance(module, zuko_nn.MaskedBayesianLinear) or isinstance(
module, zuko_nn.BayesianLinear
):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These classes do not exist. I guess this is a residue of the previous implementation?

Comment thread zuko/bayesian.py Outdated
Comment thread zuko/bayesian.py Outdated
Comment on lines +195 to +196
sampled_params = self._sample_params()
proxy = self._create_sampled_proxy(sampled_params)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for not using torch.nn.utils.stateless._reparametrize_module ?

with torch.nn.utils.stateless._reparametrize_module(self.base, sampled_params):
    yield self.base

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @francois-rozet! I tried using the functional_call but I realized that it was not working as the Transformations are calling again the forward method after the first call to forward. So in my case the reparametrization was not actually used to change the parameters of the flow while calling rsample or log_prob.

I can try using directly the torch.nn.utils.stateless._reparametrize_module method as it does not only change the parametrization during the first call.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this implementation in the last commit and this works fine!

Comment thread zuko/bayesian.py Outdated
"""Context manager yielding a proxy model with sampled parameters."""
# print(f"[BayesianFlow.sample] Starting sample context (training={self.training})")

if self.training:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think adding a argument trick: bool to the method instead of relying on self.training would make sense?

I wonder if some would like to rely on the non-trick implementation even during training.

Comment thread zuko/bayesian.py Outdated
Comment thread zuko/bayesian.py Outdated
Comment thread zuko/bayesian.py Outdated
Comment thread zuko/bayesian.py Outdated
Comment thread zuko/bayesian.py Outdated
Comment thread zuko/bayesian.py Outdated
Comment thread zuko/bayesian.py Outdated
Comment thread zuko/bayesian.py Outdated
Comment thread zuko/bayesian.py Outdated
Comment thread zuko/bayesian.py Outdated
Comment thread zuko/bayesian.py Outdated
@valsdav
Copy link
Copy Markdown
Contributor Author

valsdav commented Sep 26, 2025

Hi @francois-rozet I did some more tests on the latest version and we have a regression somewhere. The flows is not learning anymore during the training. Did I messed up the gradient propagation? Having a closer look

Comment thread zuko/bayesian.py Outdated
var_out = var_out + torch.exp(b_logvar)

# Sample output using reparameterization trick
result = torch.normal(mu_out, var_out.sqrt(), generator=generator)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the problem, apparently this stops the gradients.

Fix:

result = mu_out + var_out.sqrt() * torch.randn(mu_out.shape, device=mu_out.device, 
                                                       generator=generator)

@francois-rozet
Copy link
Copy Markdown
Member

francois-rozet commented Sep 27, 2025

Hi @valsdav, I went over the code. I have refactored it to make it future proof (e.g. self.means instead of self.weight_means, which should allow to generalize to non-linear layers). Instead of replacing . with _, I replaced it with -. This prevents potential name clashes (e.g. my_layer.weight and my.layer.weight). I have also added docstrings.

The bug with the GF was due to the initialization of the weights, which were too large. I prefer to fallback to the initial base model weights.

Finally, sample_model should never be used to train as load_state_dict breaks the gradients to the Bayesian model parameters.

Let me know what you think.

@francois-rozet
Copy link
Copy Markdown
Member

francois-rozet commented Sep 28, 2025

Hey @valsdav, I generalized the code to any kind of models, even with non-linear layers. The local reparameterization trick is only applied to linear layers, but the parameters of non-linear layers are still sampled randomly!

I have also replaced the generator seed with a cache, which is much more efficient as it does not require GPU-CPU synchronization.

@valsdav
Copy link
Copy Markdown
Contributor Author

valsdav commented Sep 29, 2025

Hi @francois-rozet looks great to me! The code has become much better :)

Thanks a lot, +1 for merging on my side, we would start using this new version right away.

@francois-rozet
Copy link
Copy Markdown
Member

francois-rozet commented Sep 29, 2025

Hey @valsdav , great!

What do you think about the way to include/exclude layers? Is it easy enough to use for you? I was thinking maybe allowing regex patterns could be nice too.

Would you have the time to update the notebook with the new API? And maybe test that training is working like you would expect in a real-case scenario?

Once you are done with the notebook, I will merge 🎉

@valsdav
Copy link
Copy Markdown
Contributor Author

valsdav commented Sep 29, 2025

Excluding with a patter is neat, but the usecase for me would be to make bayesian only the last layers of a deeper net (a common practice to avoid making large bayesian net hard to train). In this case would would need to exclude all the other layers. I can try to implement in a similar way an "include_only" option.

I will have a look again at the notebook and update it in the next days 👍

@francois-rozet
Copy link
Copy Markdown
Member

Oh it makes sense for the include 🤔 I think we can have both an include and exclude list then. We first include if any(name.startswith(prefix) for prefix in include_modules) and then exclude if any(name.startswith(prefix) for prefix in exclude_modules).

If we want to allow pattern matching, I think wildcards * would be nice (e.g. **.hyper.*.* matches transform.hyper.1.weight).

@francois-rozet
Copy link
Copy Markdown
Member

francois-rozet commented Oct 1, 2025

I have added the include list, as well as pattern matching, which is quite fun to use 😁

I have also identified a bug with NAF and UNAF with torch<=1.13. Both flows rely on self.parameters() to propagate gradients, which is incompatible with _reparametrize_module. See pytorch/pytorch#92295. Nothing we can do on our side unfortunately.

@valsdav
Copy link
Copy Markdown
Contributor Author

valsdav commented Oct 2, 2025

Hi @francois-rozet I added back the init_logvar parameter of the model because that's a parameter that users may customize to decide how much the model is "bayesian" at the beginning of the training.

I updated the notebook with the new API and added examples for the include/exclude feature.

@francois-rozet
Copy link
Copy Markdown
Member

Thanks @valsdav! Shouldn't the default init_logvar be a negative number ($\ln 10^{-4} \approx -9$)?

@valsdav
Copy link
Copy Markdown
Contributor Author

valsdav commented Oct 2, 2025

you're right!! fixed

@francois-rozet
Copy link
Copy Markdown
Member

francois-rozet commented Oct 3, 2025

Hi @valsdav, after fixing the tests, I noticed more bugs with NAF and UNAF. This is due to the use of self.parameters() in MNN and UMNN, which breaks with _reparameterize_module and the local reparametrization trick (in subtle, different ways). I have decided to drop official support for these flows.

I have updated the tutorial to be more consistent with the other ones. I am ready to merge 🥳

@francois-rozet francois-rozet merged commit 3f09b10 into probabilists:master Oct 3, 2025
6 of 7 checks passed
@valsdav
Copy link
Copy Markdown
Contributor Author

valsdav commented Oct 6, 2025

Thanks @francois-rozet !

@francois-rozet francois-rozet changed the title Add Bayesian Flows 🚀 Add Bayesian Flows Mar 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants