in baseclip_graph_v1.py:
215-220:
inputs_text = self.base_text_features.unsqueeze(dim=1) #[100, 1, 1024]
inputs_img = img_feature.unsqueeze(dim=1)
node_cluster_tt = node_cluster_t[:, :, index, :].repeat(inputs_text.size()[0], 1, 1) #[100, 100, 1024] t->t
node_cluster_it = node_cluster_i[:, :, index, :].repeat(inputs_text.size()[0], 1, 1) # i -> t
feat_tt = torch.cat([inputs_text, node_cluster_tt], dim=1)
feat_it = torch.cat([inputs_text, node_cluster_it], dim=1)
Is inputs_img useless?
In paper, "As shown in Fig. 3, to construct the visual knowledge sub-graph Gv = {Cv, Ev}, we pass the augmented image group from the same class into visual encoder to obtain their visual features, and then compute the mean features of them as the nodes."
feat_it =torch.cat([inputs_img , node_cluster_it], dim=1) is right ??
I am confused about this problem and would appreciate your response. Thank you very much.
in baseclip_graph_v1.py:
215-220:
inputs_text = self.base_text_features.unsqueeze(dim=1) #[100, 1, 1024]
inputs_img = img_feature.unsqueeze(dim=1)
node_cluster_tt = node_cluster_t[:, :, index, :].repeat(inputs_text.size()[0], 1, 1) #[100, 100, 1024] t->t
node_cluster_it = node_cluster_i[:, :, index, :].repeat(inputs_text.size()[0], 1, 1) # i -> t
feat_tt = torch.cat([inputs_text, node_cluster_tt], dim=1)
feat_it = torch.cat([inputs_text, node_cluster_it], dim=1)
Is inputs_img useless?
In paper, "As shown in Fig. 3, to construct the visual knowledge sub-graph Gv = {Cv, Ev}, we pass the augmented image group from the same class into visual encoder to obtain their visual features, and then compute the mean features of them as the nodes."
feat_it =torch.cat([inputs_img , node_cluster_it], dim=1) is right ??
I am confused about this problem and would appreciate your response. Thank you very much.