Skip to content

Tensorboard data loading time: When an experiment involves numerous datasets, each containing multiple metrics, and the total data volume is around 5GB, loading the data can take approximately 20 seconds. #7053

@tianenshi69-max

Description

@tianenshi69-max

I have carefully analyzed the source code of TensorBoard version 2.20. The specific time consumption occurs in the _stub.ListScalars method (located in tensorboard-2.20.0/tensorboard/data/server/server.rs).

Thread-210 (process_request_thread)[7f849acfe640] LEAVE _stub.ListScalars - 12.963066s elapsed

After a thorough examination of the overall architecture, I have drawn the following conclusions:

Full Loading for Every Tags Fetch: The RunData employs a double-lock structure combined with a full in-memory model. Any modification to this setup would have widespread, cascading effects.

Nested Double-Layer Locking: RwLock<HashMap<Run, RwLock>>

Outer Lock: Protects the collection of runs (for adding or deleting runs).

Inner Lock: Protects all data (scalars, tensors, etc.) within a single run.

Inefficient Data Access: Accessing any data requires first acquiring a read lock on the outer layer to locate the run. Then, the inner RwLock is cloned (performing a deep copy of the entire run's data). After this, the outer lock is released, and processing is done on the clone. Direct reference passing is impossible, necessitating a clone. This results in significant memory overhead and prevents integration with external storage.

Fully Loaded, No Lazy Loading:

All tags and their corresponding data points are loaded entirely at startup or when a run is loaded.

Even when querying just a single tag, the entire RunData must be cloned.

Given this analysis, how can we improve the data loading efficiency for experiments? I believe the following solutions could be considered:

Parallelize Run Processing: Modify the current sequential processing of runs into concurrent processing for faster execution.

Re-evaluate the Double-Lock Mechanism: The nested locking significantly impacts performance. Is it possible to redesign this without nested locks?

Modify Frontend Default Behavior: Could the frontend interface be adjusted to not select all data by default? Loading all data at once places excessive pressure on the system.

`
async fn list_scalars(
&self,
req: Requestdata::ListScalarsRequest,
) -> Result<Responsedata::ListScalarsResponse, Status> {
let req = req.into_inner();
let want_plugin = parse_plugin_filter(req.plugin_filter)?;
let (run_filter, tag_filter) = parse_rtf(req.run_tag_filter);
let runs = self.read_runs()?;

    let mut res: data::ListScalarsResponse = Default::default();
    for (run, data) in runs.iter() {
        if !run_filter.want(run) {
            continue;
        }
        let data = data
            .read()
            .map_err(|_| Status::internal(format!("failed to read run data for {:?}", run)))?;
        let mut run_res: data::list_scalars_response::RunEntry = Default::default();
        for (tag, ts) in &data.scalars {
            if !tag_filter.want(tag) {
                continue;
            }
            if plugin_name(&ts.metadata) != Some(&want_plugin) {
                continue;
            }
            let max_step = match ts.valid_values().last() {
                None => continue,
                Some((step, _, _)) => step,
            };
            // TODO(@wchargin): Consider tracking this on the time series itself?
            let max_wall_time = ts
                .valid_values()
                .map(|(_, wt, _)| wt)
                .max()
                .expect("have valid values for step but not wall time");
            run_res.tags.push(data::list_scalars_response::TagEntry {
                tag_name: tag.0.clone(),
                metadata: Some(data::ScalarMetadata {
                    max_step: max_step.into(),
                    max_wall_time: max_wall_time.into(),
                    summary_metadata: Some(*ts.metadata.clone()),
                    ..Default::default()
                }),
            });
        }
        if !run_res.tags.is_empty() {
            run_res.run_name = run.0.clone();
            res.runs.push(run_res);
        }
    }

    Ok(Response::new(res))
}

`

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions