Tensorboard data loading time: When an experiment involves numerous datasets, each containing multiple metrics, and the total data volume is around 5GB, loading the data can take approximately 20 seconds.

I have carefully analyzed the source code of TensorBoard version 2.20. The specific time consumption occurs in the _stub.ListScalars method (located in tensorboard-2.20.0/tensorboard/data/server/server.rs).

Thread-210 (process_request_thread)[7f849acfe640] LEAVE _stub.ListScalars - 12.963066s elapsed

After a thorough examination of the overall architecture, I have drawn the following conclusions:

Full Loading for Every Tags Fetch: The RunData employs a double-lock structure combined with a full in-memory model. Any modification to this setup would have widespread, cascading effects.

Nested Double-Layer Locking: RwLock<HashMap<Run, RwLock<RunData>>>

Outer Lock: Protects the collection of runs (for adding or deleting runs).

Inner Lock: Protects all data (scalars, tensors, etc.) within a single run.

Inefficient Data Access: Accessing any data requires first acquiring a read lock on the outer layer to locate the run. Then, the inner RwLock<RunData> is cloned (performing a deep copy of the entire run's data). After this, the outer lock is released, and processing is done on the clone. Direct reference passing is impossible, necessitating a clone. This results in significant memory overhead and prevents integration with external storage.

Fully Loaded, No Lazy Loading:

All tags and their corresponding data points are loaded entirely at startup or when a run is loaded.

Even when querying just a single tag, the entire RunData must be cloned.

Given this analysis, how can we improve the data loading efficiency for experiments? I believe the following solutions could be considered:

Parallelize Run Processing: Modify the current sequential processing of runs into concurrent processing for faster execution.

Re-evaluate the Double-Lock Mechanism: The nested locking significantly impacts performance. Is it possible to redesign this without nested locks?

Modify Frontend Default Behavior: Could the frontend interface be adjusted to not select all data by default? Loading all data at once places excessive pressure on the system.


`
    async fn list_scalars(
        &self,
        req: Request<data::ListScalarsRequest>,
    ) -> Result<Response<data::ListScalarsResponse>, Status> {
        let req = req.into_inner();
        let want_plugin = parse_plugin_filter(req.plugin_filter)?;
        let (run_filter, tag_filter) = parse_rtf(req.run_tag_filter);
        let runs = self.read_runs()?;

        let mut res: data::ListScalarsResponse = Default::default();
        for (run, data) in runs.iter() {
            if !run_filter.want(run) {
                continue;
            }
            let data = data
                .read()
                .map_err(|_| Status::internal(format!("failed to read run data for {:?}", run)))?;
            let mut run_res: data::list_scalars_response::RunEntry = Default::default();
            for (tag, ts) in &data.scalars {
                if !tag_filter.want(tag) {
                    continue;
                }
                if plugin_name(&ts.metadata) != Some(&want_plugin) {
                    continue;
                }
                let max_step = match ts.valid_values().last() {
                    None => continue,
                    Some((step, _, _)) => step,
                };
                // TODO(@wchargin): Consider tracking this on the time series itself?
                let max_wall_time = ts
                    .valid_values()
                    .map(|(_, wt, _)| wt)
                    .max()
                    .expect("have valid values for step but not wall time");
                run_res.tags.push(data::list_scalars_response::TagEntry {
                    tag_name: tag.0.clone(),
                    metadata: Some(data::ScalarMetadata {
                        max_step: max_step.into(),
                        max_wall_time: max_wall_time.into(),
                        summary_metadata: Some(*ts.metadata.clone()),
                        ..Default::default()
                    }),
                });
            }
            if !run_res.tags.is_empty() {
                run_res.run_name = run.0.clone();
                res.runs.push(run_res);
            }
        }

        Ok(Response::new(res))
    }
`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensorboard data loading time: When an experiment involves numerous datasets, each containing multiple metrics, and the total data volume is around 5GB, loading the data can take approximately 20 seconds. #7053

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Tensorboard data loading time: When an experiment involves numerous datasets, each containing multiple metrics, and the total data volume is around 5GB, loading the data can take approximately 20 seconds. #7053

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions