Skip to content

Comments

Category: A1; Team name: GAAIMC; Dataset: Github (MUSAE)#217

Open
ixime wants to merge 5 commits intogeometric-intelligence:mainfrom
ixime:musae_github
Open

Category: A1; Team name: GAAIMC; Dataset: Github (MUSAE)#217
ixime wants to merge 5 commits intogeometric-intelligence:mainfrom
ixime:musae_github

Conversation

@ixime
Copy link

@ixime ixime commented Nov 7, 2025

Checklist

  • My pull request has a clear and explanatory title.
  • My pull request passes the Linting test.
  • I added appropriate unit tests and I made sure the code passes all unit tests. (refer to comment below)
  • My PR follows PEP8 guidelines. (refer to comment below)
  • My code is properly documented, using numpy docs conventions, and I made sure the documentation renders properly.
  • I linked to issues and PRs that are relevant to this PR.

Description

This pull request adds the Github (MUSAE) dataset published in [1] for TAG-DS Topological Deep Learning Challenge 2025: expanding the data landscape.

This dataset contains a large social network of GitHub developers which was collected from the public API in June 2019. Nodes are developers who have starred at least 10 repositories and edges are mutual follower relationships between them. The vertex features are extracted based on the location, repositories starred, employer and e-mail address. The task related to the graph is binary node classification - one has to predict whether the GitHub user is a web or a machine learning developer. This target feature was derived from the job title of each user. [2]

This dataset was shared in PyG [3], but the url to download it is broken, so we downloaded it from [2]. In [4] the features were truncated to a dimensionality of 128 using SVD. We added the dimensionality reduction as a data transformation and is performed as default for this dataset, however the complete data is kept, in case of choosing another kind of data transformation.

The same data transformation is used in PR's #214, #216 and #229

References:

[1] B. Rozemberczki, C. Allen and R. Sarkar. Multi-scale Attributed Node Embedding. 2019.
[2] SNAP: Network Datasets: Github (MUSAE)
[3] Github in PyG
[4] B. Rozemberczki and R. Sarkar. Characteristic Functions on Graphs: Birds of a Feather, from Statistical Descriptors to Parametric Models. 2020.

Issue

Additional context

@levtelyatnikov levtelyatnikov added the category-a1 Submission to TDL Challenge 2025: Mission A, Category 1. label Nov 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category-a1 Submission to TDL Challenge 2025: Mission A, Category 1.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants