use concurrent.futures.Executor instead of multiprocessing pool to resolve conflict with duet#7938
use concurrent.futures.Executor instead of multiprocessing pool to resolve conflict with duet#7938NoureldinYosri wants to merge 9 commits intoquantumlib:mainfrom
Conversation
…solve conflict with duet
eliottrosenberg
left a comment
There was a problem hiding this comment.
Fixes b/490175992 (discussed in quantumlib/ReCirq#461 (review))
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #7938 +/- ##
=======================================
Coverage 99.63% 99.63%
=======================================
Files 1110 1110
Lines 99685 99698 +13
=======================================
+ Hits 99318 99336 +18
+ Misses 367 362 -5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Please check if the new ThreadPoolExecutor default is faster than a serial run. If it makes no difference or is worse, consider switching to a serial evaluation by default.
Also it may be worthwhile to check if creating a local multiprocessing.Pool with a spawn instead of fork start method would help with warnings in the bug.
Ref: https://docs.python.org/3.11/library/multiprocessing.html#multiprocessing.get_context
| pool = num_workers_or_pool # pragma: no cover | ||
| elif num_workers_or_pool != 0: | ||
| pool = multiprocessing.Pool(num_workers_or_pool if num_workers_or_pool > 0 else None) | ||
| pool = cf.ThreadPoolExecutor(num_workers_or_pool if num_workers_or_pool > 0 else None) |
There was a problem hiding this comment.
ThreadPoolExecutor is subject to GIL. Unless the mapped function spends a lot of time in numpy calls or waiting for IO, the execution would be the same as in a serial call or worse due to thread-switching overhead.
I made a quick test with a many-term sums computed in series or in parallel with multiprocessing.Pool.map vs ThreadPoolExecutor.map. The ThreadPoolExecutor took about 2.5 times longer than a serial evaluation.
example timing code
def partial_sum(start_end: tuple[int, int]) -> float:
total = 0
for i in range(*start_end, 3):
total += (-1) ** i * 1.0 / i
return total
def tedious_sum(terms_count: int, mapfunc) -> float:
total = sum(mapfunc(partial_sum, ((start, terms_count) for start in (1, 2, 3))))
return total
# %timeit tedious_sum(10_000_000, map)
# 2.93 s ± 51.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# pool = multiprocessing.Pool(3)
# %timeit tedious_sum(10_000_000, pool.map)
# 1.01 s ± 25.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# tpx = concurrent.futures.ThreadPoolExecutor(3)
# %timeit tedious_sum(10_000_000, tpx.map)
# 8.46 s ± 1.11 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Can you make a quick comparison of the z_phase_calibration_workflow duration with ThreadPoolExecutor compared to a serial run?
If comparable I'd suggest to make it a default to do a serial evaluation.
There was a problem hiding this comment.
see https://colab.sandbox.google.com/drive/17eskpU9OH_L-mLIIcrFNkafhsZwgT4BN#scrollTo=GH8YAuLgFZ__
serial: 4m35s
with ThreadPoolExecuter(): 3m54s
There was a problem hiding this comment.
It is a bit hard to isolate the duration of z_phase_calibration_workflow calls in that colab.
If I patch the test here which runs that function with 2 workers to run it in serial, ie,
diff --git a/cirq-core/cirq/experiments/z_phase_calibration_test.py b/cirq-core/cirq/experiments/z_phase_calibration>
index 48db85ad2..309651e50 100644
--- a/cirq-core/cirq/experiments/z_phase_calibration_test.py
+++ b/cirq-core/cirq/experiments/z_phase_calibration_test.py
@@ -223,3 +223,3 @@ def test_calibrate_z_phases_workflow_no_options_no_pool(angles, error) -> None:
random_state=_SEED,
- num_workers_or_pool=2,
+ num_workers_or_pool=0,
)
the test duration goes down from ~1.6 to ~1.4 s - measured with
pytest -p no:randomly -Wignore --durations=1 \
"cirq-core/cirq/experiments/z_phase_calibration_test.py::test_calibrate_z_phases_workflow_no_options_no_pool[angles1-error1]"There was a problem hiding this comment.
the comparison I did was between serial and num_workers=None (i.e. os.cpu_count()) which on my device=24
| random_state: cirq.RANDOM_STATE_OR_SEED_LIKE = None, | ||
| atol: float = 1e-3, | ||
| num_workers_or_pool: int | multiprocessing.pool.Pool = -1, | ||
| num_workers_or_pool: int | multiprocessing.pool.Pool | cf.Executor = -1, |
There was a problem hiding this comment.
AFAICT, the code later needs only the Pool.map or Executor.map functions.
Would it be possible to change this to accept either an int for a number of workers or a parallel-map function?
There was a problem hiding this comment.
I think this is too much indirection, also everywhere else in experiments/ expects either a pool or number of workers
pavoljuhas
left a comment
There was a problem hiding this comment.
The ThreadPoolExecutor seems to be slower than serial in the one test that uses it here. That said I am fine going ahead if you are sure it works better in a more realistic use.
Fixes b/490175992