Datasets are not sorted by time, model uses information from the future

I went through the code and one thing is bothering me. I think there is a major bug in the implementation. It is possible that I don't understand something, so please correct me if I'm wrong, but as of my current understanding this code trains and validates using the information "from the future" .

If you examine values in the code below you will see that there negative values for the delta.
https://github.com/DyGRec/TGSRec/blob/0c7ba17b1b787648ac0af0b57dfc7b91f2f00654/model.py#L557

I can see that mask is created only for the 0 values so negatives values are still used.
https://github.com/DyGRec/TGSRec/blob/0c7ba17b1b787648ac0af0b57dfc7b91f2f00654/model.py#L575

Data here is sorted by edge_ids not timestamps, so the possible fix for that would by sorting by x[2], instead of x[1]
https://github.com/DyGRec/TGSRec/blob/0c7ba17b1b787648ac0af0b57dfc7b91f2f00654/graph.py#L34-L39

If you look at the:
[TGSRec](https://github.com/DyGRec/TGSRec)/[datasets](https://github.com/DyGRec/TGSRec/tree/master/datasets)/[ml-100k](https://github.com/DyGRec/TGSRec/tree/master/datasets/ml-100k)/u.data

data is not sorted, by the timestamp and there is no point in your codebase, where this sorting happens (I guess).

I tried to run experiments for ml-100 for both scenarios: your original implementation and with the sorted input data and the results I got are significantly worse, at least for the early stages of training. I haven't run it for 200 epochs, so maybe the final results are closer to each other, but firstly I would like to see if my assumption is correct.

**Results afters 20 epochs:**
*Without sorting:*

    valid acc: 0.7069337926425662
    valid auc: 0.8038448618385599
    valid f1: 0.7070636805233961
    valid ap: 0.8172432697828477

*With sorting*:

    valid acc: 0.5271334211112526
    valid auc: 0.7374971517076878
    valid f1: 0.6789490018391757
    valid ap: 0.7001478155001294



	for i in range(len(adj_list)):
	curr = adj_list[i]
	curr = sorted(curr, key=lambda x: x[1])
	n_idx_l.extend([x[0] for x in curr])
	e_idx_l.extend([x[1] for x in curr])
	n_ts_l.extend([x[2] for x in curr])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasets are not sorted by time, model uses information from the future #3

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Datasets are not sorted by time, model uses information from the future #3

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions