-
Notifications
You must be signed in to change notification settings - Fork 54
Description
We are using F1 instances (2 cores). In Python 3, with no meaningful code changes, we are seeing 1.5-2x median latency across the app, an even greater increase in created/active/billed instances with the same appengine settings, and some new errors and warnings we never saw before. For example, we see this a lot now with datastore operations:
...
File "/layers/google.python.pip/pip/lib/python3.11/site-packages/google/appengine/datastore/datastore_query.py", line 2949, in __query_result_hook
self._batch_shared.conn.check_rpc_success(rpc)
File "/layers/google.python.pip/pip/lib/python3.11/site-packages/google/appengine/datastore/datastore_rpc.py", line 1365, in check_rpc_success
rpc.check_success()
File "/layers/google.python.pip/pip/lib/python3.11/site-packages/google/appengine/api/apiproxy_stub_map.py", line 614, in check_success
self.__rpc.CheckSuccess()
File "/layers/google.python.pip/pip/lib/python3.11/site-packages/google/appengine/api/apiproxy_rpc.py", line 149, in CheckSuccess
raise self.exception
File "/layers/google.python.pip/pip/lib/python3.11/site-packages/google/appengine/runtime/default_api_stub.py", line 276, in _CaptureTrace
f(**kwargs)
File "/layers/google.python.pip/pip/lib/python3.11/site-packages/google/appengine/runtime/default_api_stub.py", line 261, in _SendRequest
raise self._ErrorException(*_DEFAULT_EXCEPTION)
google.appengine.runtime.apiproxy_errors.RPCFailedError: The remote RPC to the application server failed for call datastore_v3.RunQuery().
^^ (similar traces also with datastore_v3.Get()
We also see tons of these warnings, not related to any outgoing web calls our app is making: (I know it's just a warning and might be harmless, but it's worrying because of the extremely high volume in our logs):
Connection pool is full, discarding connection: appengine.googleapis.internal. Connection pool size: 10
I've been tweaking the app.yaml various ways, but I can't seem to find a configuration that solves or even significantly reduces either issue. Our app is doing the exact same operations as before with the same load. What can we try? This isn't sustainable, we can't afford this jump in cost and the app is performing much worse now. Our unit tests are running in about 30% less time compared to Python 2 so I didn't expect this, it seems like it could be I/O related?
Not sure if this is relevant, but I noticed this in google.appengine.api.apiproxy_rpc.py:
_MAX_CONCURRENT_API_CALLS = 100
_THREAD_POOL = futures.ThreadPoolExecutor(_MAX_CONCURRENT_API_CALLS)
But then in concurrent.futures.threads.py we see this comment in the constructor:
class ThreadPoolExecutor(_base.Executor):
# Used to assign unique thread names when thread_name_prefix is not supplied.
_counter = itertools.count().__next__
def __init__(self, max_workers=None, thread_name_prefix='',
initializer=None, initargs=()):
"""Initializes a new ThreadPoolExecutor instance.
Args:
max_workers: The maximum number of threads that can be used to
execute the given calls.
thread_name_prefix: An optional name prefix to give our threads.
initializer: A callable used to initialize worker threads.
initargs: A tuple of arguments to pass to the initializer.
"""
if max_workers is None:
# ThreadPoolExecutor is often used to:
# * CPU bound task which releases GIL
# * I/O bound task (which releases GIL, of course)
#
# We use cpu_count + 4 for both types of tasks.
# But we limit it to 32 to avoid consuming surprisingly large resource
# on many core machine.
max_workers = min(32, (os.cpu_count() or 1) + 4)
So the default number of workers would be 6, but apiproxy_rpc.py is setting it to 100 regardless of the actual number of CPU cores available. And, this default has a hardcoded limit of 32, because of reported "consuming surprisingly large resource." What led to the decision to use 100 here? (Edit: is this maybe OK because RPC calls aren't CPU-bound? Should the urllib3 connection pool size be increased to match this?)