retinanet runs forever without generating I/O


I am using the new_training_yamls branch.  My system has 1TB of memory so it wanted me to create 8559921 files during datagen.  Unfortunately the training run ran over the weekend at 100% CPU but it never finished and there was practically no I/O... not too good for a storage test.

I limited it to 100,000 files combined with --client-host-memory-in-gb 1 and it ran fairly quickly, e.g.:

781 steps completed in 91.85 s

However, bumping it up to 1,000,000 files took much more than 10x longer:

3906 steps completed in 4376.46 s

The completed runs had all zeroes as well:

[METRIC] Number of Simulated Accelerators: 4 
[METRIC] Training Accelerator Utilization [AU] (%): 0.0000 (0.0000)
[METRIC] Training Throughput (samples/second): 0.0000 (0.0000)
[METRIC] Training I/O Throughput (MB/second): 0.0000 (0.0000)
[METRIC] train_au_meet_expectation: fail


So the questions I have are:

1) Is this the right place to ask this question?
2) Am I using the right code?
3) What should I try next to make the retinanet test work?

Thanks,
Mark


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

retinanet runs forever without generating I/O #258

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

retinanet runs forever without generating I/O #258

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions