-
Notifications
You must be signed in to change notification settings - Fork 58
Description
I am using the new_training_yamls branch. My system has 1TB of memory so it wanted me to create 8559921 files during datagen. Unfortunately the training run ran over the weekend at 100% CPU but it never finished and there was practically no I/O... not too good for a storage test.
I limited it to 100,000 files combined with --client-host-memory-in-gb 1 and it ran fairly quickly, e.g.:
781 steps completed in 91.85 s
However, bumping it up to 1,000,000 files took much more than 10x longer:
3906 steps completed in 4376.46 s
The completed runs had all zeroes as well:
[METRIC] Number of Simulated Accelerators: 4
[METRIC] Training Accelerator Utilization [AU] (%): 0.0000 (0.0000)
[METRIC] Training Throughput (samples/second): 0.0000 (0.0000)
[METRIC] Training I/O Throughput (MB/second): 0.0000 (0.0000)
[METRIC] train_au_meet_expectation: fail
So the questions I have are:
- Is this the right place to ask this question?
- Am I using the right code?
- What should I try next to make the retinanet test work?
Thanks,
Mark