Skip to content

use concurrent.futures.Executor instead of multiprocessing pool to resolve conflict with duet#7938

Open
NoureldinYosri wants to merge 9 commits intoquantumlib:mainfrom
NoureldinYosri:z_phase_threads
Open

use concurrent.futures.Executor instead of multiprocessing pool to resolve conflict with duet#7938
NoureldinYosri wants to merge 9 commits intoquantumlib:mainfrom
NoureldinYosri:z_phase_threads

Conversation

@NoureldinYosri
Copy link
Copy Markdown
Collaborator

@NoureldinYosri NoureldinYosri commented Mar 7, 2026

Fixes b/490175992

@NoureldinYosri NoureldinYosri requested review from a team, mrwojtek and vtomole as code owners March 7, 2026 00:39
@github-actions github-actions bot added the size: S 10< lines changed <50 label Mar 7, 2026
Copy link
Copy Markdown
Collaborator

@eliottrosenberg eliottrosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixes b/490175992 (discussed in quantumlib/ReCirq#461 (review))

@github-actions github-actions bot added size: M 50< lines changed <250 and removed size: S 10< lines changed <50 labels Mar 11, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.63%. Comparing base (336a71f) to head (23a175e).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #7938   +/-   ##
=======================================
  Coverage   99.63%   99.63%           
=======================================
  Files        1110     1110           
  Lines       99685    99698   +13     
=======================================
+ Hits        99318    99336   +18     
+ Misses        367      362    -5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Collaborator

@pavoljuhas pavoljuhas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check if the new ThreadPoolExecutor default is faster than a serial run. If it makes no difference or is worse, consider switching to a serial evaluation by default.

Also it may be worthwhile to check if creating a local multiprocessing.Pool with a spawn instead of fork start method would help with warnings in the bug.

Ref: https://docs.python.org/3.11/library/multiprocessing.html#multiprocessing.get_context

pool = num_workers_or_pool # pragma: no cover
elif num_workers_or_pool != 0:
pool = multiprocessing.Pool(num_workers_or_pool if num_workers_or_pool > 0 else None)
pool = cf.ThreadPoolExecutor(num_workers_or_pool if num_workers_or_pool > 0 else None)
Copy link
Copy Markdown
Collaborator

@pavoljuhas pavoljuhas Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ThreadPoolExecutor is subject to GIL. Unless the mapped function spends a lot of time in numpy calls or waiting for IO, the execution would be the same as in a serial call or worse due to thread-switching overhead.

I made a quick test with a many-term sums computed in series or in parallel with multiprocessing.Pool.map vs ThreadPoolExecutor.map. The ThreadPoolExecutor took about 2.5 times longer than a serial evaluation.

example timing code
def partial_sum(start_end: tuple[int, int]) -> float:
    total = 0
    for i in range(*start_end, 3):
        total += (-1) ** i * 1.0 / i
    return total


def tedious_sum(terms_count: int, mapfunc) -> float:
    total = sum(mapfunc(partial_sum, ((start, terms_count) for start in (1, 2, 3))))
    return total

# %timeit tedious_sum(10_000_000, map)
# 2.93 s ± 51.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# pool = multiprocessing.Pool(3)
# %timeit tedious_sum(10_000_000, pool.map)
# 1.01 s ± 25.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# tpx = concurrent.futures.ThreadPoolExecutor(3)
# %timeit tedious_sum(10_000_000, tpx.map)
# 8.46 s ± 1.11 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
 

Can you make a quick comparison of the z_phase_calibration_workflow duration with ThreadPoolExecutor compared to a serial run?

If comparable I'd suggest to make it a default to do a serial evaluation.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit hard to isolate the duration of z_phase_calibration_workflow calls in that colab.
If I patch the test here which runs that function with 2 workers to run it in serial, ie,

diff --git a/cirq-core/cirq/experiments/z_phase_calibration_test.py b/cirq-core/cirq/experiments/z_phase_calibration>
index 48db85ad2..309651e50 100644
--- a/cirq-core/cirq/experiments/z_phase_calibration_test.py
+++ b/cirq-core/cirq/experiments/z_phase_calibration_test.py
@@ -223,3 +223,3 @@ def test_calibrate_z_phases_workflow_no_options_no_pool(angles, error) -> None:
         random_state=_SEED,
-        num_workers_or_pool=2,
+        num_workers_or_pool=0,
     )

the test duration goes down from ~1.6 to ~1.4 s - measured with

pytest -p no:randomly -Wignore --durations=1 \
"cirq-core/cirq/experiments/z_phase_calibration_test.py::test_calibrate_z_phases_workflow_no_options_no_pool[angles1-error1]"

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comparison I did was between serial and num_workers=None (i.e. os.cpu_count()) which on my device=24

random_state: cirq.RANDOM_STATE_OR_SEED_LIKE = None,
atol: float = 1e-3,
num_workers_or_pool: int | multiprocessing.pool.Pool = -1,
num_workers_or_pool: int | multiprocessing.pool.Pool | cf.Executor = -1,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT, the code later needs only the Pool.map or Executor.map functions.

Would it be possible to change this to accept either an int for a number of workers or a parallel-map function?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is too much indirection, also everywhere else in experiments/ expects either a pool or number of workers

Copy link
Copy Markdown
Collaborator

@pavoljuhas pavoljuhas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ThreadPoolExecutor seems to be slower than serial in the one test that uses it here. That said I am fine going ahead if you are sure it works better in a more realistic use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size: M 50< lines changed <250

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants