Skip to content

retinanet runs forever without generating I/O #258

@pyite

Description

@pyite

I am using the new_training_yamls branch. My system has 1TB of memory so it wanted me to create 8559921 files during datagen. Unfortunately the training run ran over the weekend at 100% CPU but it never finished and there was practically no I/O... not too good for a storage test.

I limited it to 100,000 files combined with --client-host-memory-in-gb 1 and it ran fairly quickly, e.g.:

781 steps completed in 91.85 s

However, bumping it up to 1,000,000 files took much more than 10x longer:

3906 steps completed in 4376.46 s

The completed runs had all zeroes as well:

[METRIC] Number of Simulated Accelerators: 4
[METRIC] Training Accelerator Utilization [AU] (%): 0.0000 (0.0000)
[METRIC] Training Throughput (samples/second): 0.0000 (0.0000)
[METRIC] Training I/O Throughput (MB/second): 0.0000 (0.0000)
[METRIC] train_au_meet_expectation: fail

So the questions I have are:

  1. Is this the right place to ask this question?
  2. Am I using the right code?
  3. What should I try next to make the retinanet test work?

Thanks,
Mark

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions