Add Bayesian Flows#66
Conversation
|
Hi @valsdav 👋 Thank you for your PR! I haven't had the time to take a look at it yet. I have a deadline by the end of May. Until then, could you give a bit more context on this PR? Why is this feature necessary? What does it allow? Are there alternatives? |
|
HI @francois-rozet! Sure take your time. I'm already using and testing more the PR in my group and so far so good! This implementation is one of the way of encoding training uncertainty in a Normalizing Flow model. I'm interested in this application in the field of high energy physics, where estimating an uncertainty on the learnt probability density is crucial for downstream tasks. In particular bayesian networks can learn a posterior distribution over a model output making each network weight sampled from its own gaussian distribution. These networks, when properly trained with a KL divergence component in the loss, approximate well the statistical uncertainty from the limit amount of training dataset. In zuko the implementation is quite straightforward as it is enough to "make bayesian" all the linear models used inside the definition of the flow. At each forward call the linear NNs get sampled and the rest of the flow implementation stays the same. At inference time the user can call multiple time the forward function to build a distribution of the output prob. density. The idea of baysian NF is explored in https://arxiv.org/abs/2104.04543 Let me know if you prefer to have more references or more explanations! |
|
HI @francois-rozet! Did you have any chance to have a look at it? We are using it successfully in our applications and we didn't have any issue with the code so far :) |
|
Hi @valsdav, thank you for remind me of the PR! I have been taking a look this morning. Overall I think the code is clean and understandable. However I have a few concerns regarding invertibility. Transformations in normalizing flows are deterministic functions. Some (most) transformations call their underlying neural networks several times during a single forward. This is notably the case of auto-regressive transformations. It is heavily assumed that each call to the underlying network leads to the same deterministic output. With your implementation where weights are sampled during the forward of the linear layers, I believe these assumptions are broken. As such the "Bayesian" flows are not invertible anymore. >>> flow = zuko.flows.MAF(3, bayesian=True)
>>> x = torch.randn(3)
>>> y = flow().transform(x)
>>> z = flow().transform.inv(y)
>>> x - z # should be zero
tensor([ 0.0023, 0.0017, -0.0067], grad_fn=<SubBackward0>)For this use case, I think it would be easier to take a functional approach, where the sampling of the weights is tackled outside of the normalizing flow, the weights are loaded into the flow, and then the flow is used (without randomness). I think this could be achieved with
|
|
Hi @francois-rozet I overlooked the fact that the single MLP may be called multiple times during a single call.. Having a look at the functional_call, setting just the random tensor should be enough. |
What I imagine is to remove "Bayesian" layers, and instead provide functional helpers that operate on a standard (non-Bayesian) flow. Something like flow = zuko.flows.MAF(3) # not Bayesian
prior = zuko.bayesian.diagonal_gaussian_prior(flow.parameters()) # special object
optim = torch.optim.AdamW(prior.parameters(), lr=1e-3)
for x in train_loader:
phi = prior().rsample() # optimizable sampling
with zuko.bayesian.parameterize(flow, phi):
log_p = flow().log_prob(x)
loss = -log_p.mean() # + KL term of the prior I guess
loss.backward()
optim.step()
optim.zero_grad()
phi = prior().sample()
with zuko.bayesian.parameterize(flow, phi):
x = flow().sample()Note that |
|
In practice this In my opinion this seems an elegant solution of making normal flows "bayesian", but then it makes more difficult on the user side to store properly the "prior" parameters alongside the flow model itself. What is we wrap this "prior+flow" object in a new flow object? So that we can store them together and provide the same user interface? Something like: flow = zuko.flows.MAF(3) # not Bayesian
bayesian_flow = zuko.bayesian.wrap_bayesian_flow(flow)
for x in train_loader:
with bayesian_flow.sample_prior():
bayesian_flow().log_prob(x)
with bayesian_flow.sample_prior():
x = bayesian_flow().sample() |
Yes exactly! Maybe of the biases as well, although I don't know if this is common.
Is your concern about saving and loading the models to disk? In this case only the weights of the
This is another option to consider. We should however ensure that the user understands what this new "prior + flow" object represents and how to use it "like other Zuko flows". I usually like when everything is very explicit for the user. In my snippet it is clear that what is optimized is the prior, not the flow. |
|
Ok thanks for the discussion. I will give a try to this new implementation and come back to you :) |
Yes, this is an option. But I would like to dissociate the object that samples the weights, from the model that computes the density/generates samples. It could be bayesian_flow = zuko.bayesian.BayesianModel(zuko.flows.MAF(3), **prior_kwargs)
with bayesian_flow.sample() as flow: # flow is a deepcopy, bayesian_flow is not modified
flow().log_prob(x)
flow().rsample()
flow().transform
# ...
Thanks! I'll check it out right away this time so I don't waste too much of your time. |
|
Hi @francois-rozet I finally finished this. The API is now as we discussed: I don't understand well why the tests on the GF flow are failing. Does that have some special implementation colliding with this implementation maybe? |
francois-rozet
left a comment
There was a problem hiding this comment.
Hi @valsdav, thank you very much for coming back to this!
I did a first quick read and review, and have left a few comments and questions. Overall this looks good!
| if isinstance(module, zuko_nn.MaskedBayesianLinear) or isinstance( | ||
| module, zuko_nn.BayesianLinear | ||
| ): |
There was a problem hiding this comment.
These classes do not exist. I guess this is a residue of the previous implementation?
| sampled_params = self._sample_params() | ||
| proxy = self._create_sampled_proxy(sampled_params) |
There was a problem hiding this comment.
Is there a reason for not using torch.nn.utils.stateless._reparametrize_module ?
with torch.nn.utils.stateless._reparametrize_module(self.base, sampled_params):
yield self.baseThere was a problem hiding this comment.
Hi @francois-rozet! I tried using the functional_call but I realized that it was not working as the Transformations are calling again the forward method after the first call to forward. So in my case the reparametrization was not actually used to change the parameters of the flow while calling rsample or log_prob.
I can try using directly the torch.nn.utils.stateless._reparametrize_module method as it does not only change the parametrization during the first call.
There was a problem hiding this comment.
I added this implementation in the last commit and this works fine!
| """Context manager yielding a proxy model with sampled parameters.""" | ||
| # print(f"[BayesianFlow.sample] Starting sample context (training={self.training})") | ||
|
|
||
| if self.training: |
There was a problem hiding this comment.
Do you think adding a argument trick: bool to the method instead of relying on self.training would make sense?
I wonder if some would like to rely on the non-trick implementation even during training.
|
Hi @francois-rozet I did some more tests on the latest version and we have a regression somewhere. The flows is not learning anymore during the training. Did I messed up the gradient propagation? Having a closer look |
| var_out = var_out + torch.exp(b_logvar) | ||
|
|
||
| # Sample output using reparameterization trick | ||
| result = torch.normal(mu_out, var_out.sqrt(), generator=generator) |
There was a problem hiding this comment.
This is the problem, apparently this stops the gradients.
Fix:
result = mu_out + var_out.sqrt() * torch.randn(mu_out.shape, device=mu_out.device,
generator=generator)
|
Hi @valsdav, I went over the code. I have refactored it to make it future proof (e.g. The bug with the Finally, Let me know what you think. |
|
Hey @valsdav, I generalized the code to any kind of models, even with non-linear layers. The local reparameterization trick is only applied to linear layers, but the parameters of non-linear layers are still sampled randomly! I have also replaced the generator |
|
Hi @francois-rozet looks great to me! The code has become much better :) Thanks a lot, +1 for merging on my side, we would start using this new version right away. |
|
Hey @valsdav , great! What do you think about the way to include/exclude layers? Is it easy enough to use for you? I was thinking maybe allowing regex patterns could be nice too. Would you have the time to update the notebook with the new API? And maybe test that training is working like you would expect in a real-case scenario? Once you are done with the notebook, I will merge 🎉 |
|
Excluding with a patter is neat, but the usecase for me would be to make bayesian only the last layers of a deeper net (a common practice to avoid making large bayesian net hard to train). In this case would would need to exclude all the other layers. I can try to implement in a similar way an "include_only" option. I will have a look again at the notebook and update it in the next days 👍 |
|
Oh it makes sense for the include 🤔 I think we can have both an include and exclude list then. We first include If we want to allow pattern matching, I think wildcards |
|
I have added the I have also identified a bug with |
|
Hi @francois-rozet I added back the I updated the notebook with the new API and added examples for the include/exclude feature. |
|
Thanks @valsdav! Shouldn't the default |
|
you're right!! fixed |
|
Hi @valsdav, after fixing the tests, I noticed more bugs with I have updated the tutorial to be more consistent with the other ones. I am ready to merge 🥳 |
|
Thanks @francois-rozet ! |
Bayesian Flows in zuko would be a very useful feature for uncertainty estimation with normalizing flows.
This PR introduces Bayesian Flow in zuko with minimal changes:
BayesianLinearlayers are provided as an alternative way of building the MLPs at the base of zuko transformationsAll flows can be transformed to their Bayesian version by adding
bayesian=Trueto the building arguments.An utility function
total_KL_divergence(model)is added in thezuko.utilsto compute the total KL divergence from all the bayesian layers in a model to be added to the loss.Let me know if you agree with the design of this feature! I added tests and a tutorial ;)
P.S.: CNF flows are not working for the moment but I haven't investigated why yet.