We host elatsicsearch cluster on Elastic Cloud and call it from dataflow (GCP). Job works fine in dev but when we deploy to prod we're seeing lots of connection timeout on the client side.
Traceback (most recent call last):
File "apache_beam/runners/common.py", line 1213, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 570, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "main.py", line 159, in process
File "/usr/local/lib/python3.7/site-packages/elasticsearch/client/utils.py", line 152, in _wrapped
return func(*args, params=params, headers=headers, **kwargs)
File "/usr/local/lib/python3.7/site-packages/elasticsearch/client/__init__.py", line 1617, in search
body=body,
File "/usr/local/lib/python3.7/site-packages/elasticsearch/transport.py", line 390, in perform_request
raise e
File "/usr/local/lib/python3.7/site-packages/elasticsearch/transport.py", line 365, in perform_request
timeout=timeout,
File "/usr/local/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py", line 258, in perform_request
raise ConnectionError("N/A", str(e), e)
elasticsearch.exceptions.ConnectionError: ConnectionError(<urllib3.connection.HTTPSConnection object at 0x7fe5d04e5690>: Failed to establish a new connection: [Errno 110] Connection timed out) caused by: NewConnectionError(<urllib3.connection.HTTPSConnection object at 0x7fe5d04e5690>: Failed to establish a new connection: [Errno 110] Connection timed out)
I increased timeout setting in elasticsearch client to 300s like below but it didn't seem to help.
self.elasticsearch = Elasticsearch([es_host], http_auth=http_auth, timeout=300)
Looking at deployment at https://cloud.elastic.co/deployments//metrics
CPU and memory usage are very low (below 10%) and search response time is also order of 200ms.
What could be the bottleneck here and how we can we avoid such timeouts?
As seen in below log most of requests are failing with connection timeout while successful request receives response very quick:
I tried ssh into the VM where we experience the connection error. netstat showed there were about 60 ESTABLISHED connections to the elastic search IP address. When I curl from the VM to elasticsearch address I was able to reproduce timeout. I can curl fine to other URLs. Also I can curl fine to elasticsearch from my local so issue is only connection between VM and elasticsaerch server.
Does dataflow (compute engine) or ElasticSearch has limitation on number of concurrent connection? I could not find any information online.
I did a little bit of research about the connector for ElasticSearch. There are a two principles that you may want to try to ensure your connector is as efficient as possible.
Note Setting a maximum number of workers, as suggested in the other answer, will probably not help as much (for now) - let's improve utilization from your Beam/Elastic cluster resources, and if we start hitting limits for either, then we can consider restricting # of workers - but right now, you can try to improve your connector.
Using bulk requests to external services
The code you provide issues an individual search request for every element coming into the DoFn. As you've noted, this works fine, but it will cause your pipeline to spend too much time waiting on external requests for each element - so your wait for roundtrips will be O(n).
Gladly, the Elasticsearch client has an msearch method, which should allow you to perform searches in bulk. You can do something like this:
class PredictionFn(beam.DoFn):
def __init__(self, ...):
self.buffer = []
...
def process(self, element):
self.buffer.append(element)
if len(self.buffer) > BATCH_SIZE:
return self.flush()
def flush(self):
result = []
# Perform the search requests for user ids
user_ids = [uid for cid, did, uid in self.buffer]
user_ids_request = self._build_uid_reqs(user_ids)
resp = es.msearch(body=user_ids_request)
user_id_and_device_id_lists = []
for r, elm in zip(resp['responses'], self.buffer):
if len(r["hits"]["hits"]) == 0:
continue
# Get new device_id_list
user_id_and_device_id_lists.append((elm[2], # User ID
device_id_list))
device_id_lists = [elm[1] for elm in user_id_and_device_id_lists]
device_ids_request = self._build_device_id_reqs(device_id_lists)
resp = es.msearch(body=device_ids_request)
resp = self.elasticsearch.search(index="sessions", body={"query": {"match": {"userId": user_id }}})
# Handle the result, output anything necessary
def _build_uid_reqs(self, uids):
# Relying on this answer: https://stackoverflow.com/questions/28546253/how-to-create-request-body-for-python-elasticsearch-msearch/37187352
res = []
for uid in uids:
res.append(json.dumps({'index': 'sessions'})) # Request HEAD
res.append(json.dumps({"query": {"match": {"userId": uid }}})) # Request BODY
return '\n'.join(res)
Reusing the client as it's thread-safe
The Elasticsearch client is also thread safe!
So rather than creating a new one every time, you can do something like this:
class PredictionFn(beam.DoFn):
CLIENT = None
def init_elasticsearch(self):
if PredictionFn.CLIENT is not None:
return PredictionFn.CLIENT
es_host = fetch_host()
http_auth = fetch_auth()
PredictionFn.CLIENT = Elasticsearch([es_host], http_auth=http_auth,
timeout=300, sniff_on_connection_fail=True,
retry_on_timeout=True, max_retries=2,
maxsize=5) # 5 connections per client
return PredictionFn.CLIENT
This should ensure that you keep a single client for each worker, and you won't be creating so many connections to ElasticSearch - and thus not getting the rejection messages.
Let me know if these two help, or if we need to try further improvements!
EDIT: This was red herring. CLOSE_WAIT is not related. I again had the same issue and most of connections are now in ESTABLISHED status :/
While both of answers below are insightful, I don't think they answered the question.
After some more investigation, I find out that somehow elasticsearch-py (or urllib3), in combination with dataflow, will leave connection in CLOSE_WAIT status. Once connection got this status, these connections got stuck (OS will not release these sockets because OS thinks application code will close it) so after running job sometime, all of my connections in connection pool are in this CLOSE_WAIT status and therefore I cannot make any new connections. If I don't use connection pool and instantiate elasticsaerch client for each pardo, it just gets worth, somehow connections got stuck even faster.
I reported issue here https://github.com/elastic/elasticsearch-py/issues/1459 but honestly the issue seems deeper in stack, because I had similar issue when I directly used requests package's connection pool (which I believe also used urllib3 under the hood).
Dataflow has no limit on the number of outgoing connections.
It uses a K8s cluster under the hood, and every python thread lives into their own docker container.
API calls to Elastic cloud are rate-limited (take a look at the x-rate-limit-{interval,limit,remaining} fields in the response headers).
With Dataflow it is very easy to hit API rate limits if you do a lot of parallel jobs and/or google cloud scales up the nodes of your job to make it faster.
Possible workarounds in your Dataflow / Apache Beam job:
1 - (no code required) Play with (Dataflow execution parameters)[ https://cloud.google.com/dataflow/docs/guides/specifying-exec-params] to limit the number of concurrent processing threads.
The three parameters you need to tweak are:
max_num_workers : maximum number of worker instances (machines) running.
number_of_worker_harness_threads: by default 1 thead per CPU your instance has.
machine_type: the instance type you will use.
2 - Implement rate-limit on your code. See Apache Beam Timely (and stateful) processing processing with Apache Beam
I have a couple different needs for asynchrony in my Python 3.6 Flask RESTful web service running under Gunicorn.
1) I'd like for one of my service's routes to be able to send an HTTP request to another HTTP service and, without waiting for the response, send a response back to the client that called my service.
Some example code:
#route
def fire_and_forget():
# Send request to other server without waiting
# for it to send a response.
# Return my own response.
2) I'd like for another one of my service's routes to be able to send 2 or more asynchronous HTTP requests to other HTTP services and wait for them all to reply before my service sends a response.
Some example code:
#route
def combine_results():
# Send request to service A
# Send request to service B
# Wait for both to return.
# Do something with both responses
# Return my own response.
Thanks in advance.
EDIT: I am trying to avoid the additional complexity of using a queue (e.g. celery).
You can use eventlets for the the second use case. It's pretty easy to do:
import eventlet
providers = [EventfulPump(), MeetupPump()]
try:
pool = eventlet.GreenPool()
pile = eventlet.GreenPile(pool)
for each in providers:
pile.spawn(each.get, [], 5, loc) # call the interface method
except (PumpFailure, PumpOverride):
return abort(503)
results = []
for res in pile:
results += res
You can wrap each of your api endpoints in a class that implements a "common interface" (in the above it is the get method) and you can make the calls in parallel. I just place them all in a list.
Your other use case is harder to accomplish in straight python. At least a few years ago you would be forced to introduce some sort of worker process like celery to get something like that done. This question seems to cover all the issues:
Making an asynchronous task in Flask
Perhaps things have changed in flask land?
I've struggled for two days to understand how REST API Gateways should return GET requests to browsers when the backend service runs on AMQP (without using Web Sockets or polling).
Have successfully RPC'ed betweeen AMQP service (with RabbitMqs reply_to & correlation_id), but with Flask HTTP request waiting I'm still lost.
gateway.py - Response Handler Inside The HTTP Handler, Times out
def products_get():
def handler(ch=None, method=None, properties=None, body=None):
if body:
return body
return False
return_queue = 'products.get.return'
broker.channel.queue_declare(return_queue)
broker.channel.basic_consume(handler, return_queue)
broker.publish(exchange='', routing_key='products.get', body='Request data', properties=pika.BasicProperties(reply_to=return_queue))
now = time.time() # for timeout. Not having this returns 'no content' immediately
while time.time() < now + 1:
if handler():
return handler()
return 'Time out'
POST/PUT can simply send the AMQP message, return 200/201/201 immediately and the service work at its own pace. A separate REST interface just for GET requests seems implausible, but don't know the other options.
Regards
I think what you're asking is "how to perform asynchronous GET requests". and I reckon that the answer is - you can't. and should not. its bad practice or bad design. and it does not scale.
Why are you trying to get your GET response payload from AMQP?
If the paylaod (the content of the response) can be pulled from some DB, just pull it from there. that's called a synchronous request.
If the payload must be processed in some backend, send it away and don't have the requester wait for a response. You could assign some ID and have the requester ask again later (or collect some callback URL from the requester and have your backend POST the response once its ready - less common design).
EDIT:
so, given that you have to work with AMQP-backed backend, I would do something a little more elaborate: spawn a thread or a process in your front end that would constantly consume from AMQP and store the results locally or in some db. and serve GET results based on data that you stored locally. if the data isn't yet available, just return 404. ideally you'll need to re-shape your API: split it into "post" requests (that would trigger work at the backend) and "get" requests (that would return the results if they're available).
I have a two part python standalone application: a publisher and a subscriber.
The publisher generates fake JSON devices objects and published them on a channel called "devices." And as you would guess, the subscriber subscribes to the channel "devices."
(Additionally, given optional command line arguments, the publisher or subscriber can write JSON objects to a socket or a local directory where an Apache Spark Streaming context pickups the JSON objects and processes it. For now, this is not in the picture, as it's optional.)
However, my problem is when my subscriber runs, after the publisher has finished, I get "ERROR: Forbidden".
Here are the respective python code snippets for the publisher:
pubnub = Pubnub(publish_key="my_key", subscribe_key="my_key")
....
pubnub.publish(ch, device_msg)
In the subscriber python file I have the following init code:
def receive(message, channel):
json.dumps(message)
def on_error(message):
print ("ERROR: " + str(message))
....
pubnub = Pubnub(publish_key="my_keys", subscribe_key="my_keys")
# subscribe to a channel and invoke the appropriate callback when a message arrives on that
# channel
#
pubnub.subscribe(channels=ch, callback=receive, error=on_error)
pubnub.start()
While the publisher, when run, seems to publish the JSON messages, all 120 in a loop, whereas the subscriber, when run, seems to fail with the following error message:
ERROR: Forbidden
My attempts to use "demo" keys have made no difference. Note that I'm using a trial account for PubNub.
Since this is one of my first app using its API, has anyone seen this problem before. Surely, something very obvious or trivial is amiss here.
Answer was that there was a copy/paste error with the pub/sub keys.
I'm using a service that publishes messages to Amazon SQS, but my messages come out garbled when I do the following in Python, via boto:
queue = SQS_CONNECTION.get_queue(QUEUE_NAME)
messages = queue.get_messages()
The messages are returned as strings of what appear to be base 64 encoded data
As helped by this discussion https://groups.google.com/forum/#!topic/boto-users/Pv5fUc_RdVU ,
the solution is as follows:
from boto.sqs.message import RawMessage
queue = SQS_CONNECTION.get_queue(QUEUE_NAME)
queue.set_message_class(RawMessage)
messages = queue.get_messages()