The validity of the <d, c, u> tuples forms the foundation of this study, and while the paper validates them in RQ1, the current evaluation does not seem sufficient to fully establish their reliability. Specifically, the average similarity of similar and dissimilar pairs, as currently presented, appears too abstract to accurately reflect the quality of the <d, c, u> tuples, as it depends heavily on the distribution of the data points. I suggest providing more fine-grained results that offer deeper insights into the validity of <d, c, u>. For example, the authors could construct triples such as <c1, c2, c3>, where c1 and c2 are similar while c1 and c3 are dissimilar, and compare the cosine similarity of <d, c, u> between the pairs within each triple. Additionally, statistical testing could be employed to validate the significance of the results more rigorously. Furthermore, the design of <d, c, u> were changed in RQ2 to probe the models for capturing abstractions of syntactic information. However, this design change was not validated in RQ1, which raises concerns about the validity of the findings in RQ2. To address this, I recommend including a validation of the modified <d, c, u> design as part of RQ1 to strengthen the validity of the study.
Here are what needs to be addressed: