Elasticsearch/dataflow - connection timeout after ~60 concurrent connection - python

We host elatsicsearch cluster on Elastic Cloud and call it from dataflow (GCP). Job works fine in dev but when we deploy to prod we're seeing lots of connection timeout on the client side.
Traceback (most recent call last):
File "apache_beam/runners/common.py", line 1213, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 570, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "main.py", line 159, in process
File "/usr/local/lib/python3.7/site-packages/elasticsearch/client/utils.py", line 152, in _wrapped
return func(*args, params=params, headers=headers, **kwargs)
File "/usr/local/lib/python3.7/site-packages/elasticsearch/client/__init__.py", line 1617, in search
body=body,
File "/usr/local/lib/python3.7/site-packages/elasticsearch/transport.py", line 390, in perform_request
raise e
File "/usr/local/lib/python3.7/site-packages/elasticsearch/transport.py", line 365, in perform_request
timeout=timeout,
File "/usr/local/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py", line 258, in perform_request
raise ConnectionError("N/A", str(e), e)
elasticsearch.exceptions.ConnectionError: ConnectionError(<urllib3.connection.HTTPSConnection object at 0x7fe5d04e5690>: Failed to establish a new connection: [Errno 110] Connection timed out) caused by: NewConnectionError(<urllib3.connection.HTTPSConnection object at 0x7fe5d04e5690>: Failed to establish a new connection: [Errno 110] Connection timed out)
I increased timeout setting in elasticsearch client to 300s like below but it didn't seem to help.
self.elasticsearch = Elasticsearch([es_host], http_auth=http_auth, timeout=300)
Looking at deployment at https://cloud.elastic.co/deployments//metrics
CPU and memory usage are very low (below 10%) and search response time is also order of 200ms.
What could be the bottleneck here and how we can we avoid such timeouts?
As seen in below log most of requests are failing with connection timeout while successful request receives response very quick:
I tried ssh into the VM where we experience the connection error. netstat showed there were about 60 ESTABLISHED connections to the elastic search IP address. When I curl from the VM to elasticsearch address I was able to reproduce timeout. I can curl fine to other URLs. Also I can curl fine to elasticsearch from my local so issue is only connection between VM and elasticsaerch server.
Does dataflow (compute engine) or ElasticSearch has limitation on number of concurrent connection? I could not find any information online.

I did a little bit of research about the connector for ElasticSearch. There are a two principles that you may want to try to ensure your connector is as efficient as possible.
Note Setting a maximum number of workers, as suggested in the other answer, will probably not help as much (for now) - let's improve utilization from your Beam/Elastic cluster resources, and if we start hitting limits for either, then we can consider restricting # of workers - but right now, you can try to improve your connector.
Using bulk requests to external services
The code you provide issues an individual search request for every element coming into the DoFn. As you've noted, this works fine, but it will cause your pipeline to spend too much time waiting on external requests for each element - so your wait for roundtrips will be O(n).
Gladly, the Elasticsearch client has an msearch method, which should allow you to perform searches in bulk. You can do something like this:
class PredictionFn(beam.DoFn):
def __init__(self, ...):
self.buffer = []
...
def process(self, element):
self.buffer.append(element)
if len(self.buffer) > BATCH_SIZE:
return self.flush()
def flush(self):
result = []
# Perform the search requests for user ids
user_ids = [uid for cid, did, uid in self.buffer]
user_ids_request = self._build_uid_reqs(user_ids)
resp = es.msearch(body=user_ids_request)
user_id_and_device_id_lists = []
for r, elm in zip(resp['responses'], self.buffer):
if len(r["hits"]["hits"]) == 0:
continue
# Get new device_id_list
user_id_and_device_id_lists.append((elm[2], # User ID
device_id_list))
device_id_lists = [elm[1] for elm in user_id_and_device_id_lists]
device_ids_request = self._build_device_id_reqs(device_id_lists)
resp = es.msearch(body=device_ids_request)
resp = self.elasticsearch.search(index="sessions", body={"query": {"match": {"userId": user_id }}})
# Handle the result, output anything necessary
def _build_uid_reqs(self, uids):
# Relying on this answer: https://stackoverflow.com/questions/28546253/how-to-create-request-body-for-python-elasticsearch-msearch/37187352
res = []
for uid in uids:
res.append(json.dumps({'index': 'sessions'})) # Request HEAD
res.append(json.dumps({"query": {"match": {"userId": uid }}})) # Request BODY
return '\n'.join(res)
Reusing the client as it's thread-safe
The Elasticsearch client is also thread safe!
So rather than creating a new one every time, you can do something like this:
class PredictionFn(beam.DoFn):
CLIENT = None
def init_elasticsearch(self):
if PredictionFn.CLIENT is not None:
return PredictionFn.CLIENT
es_host = fetch_host()
http_auth = fetch_auth()
PredictionFn.CLIENT = Elasticsearch([es_host], http_auth=http_auth,
timeout=300, sniff_on_connection_fail=True,
retry_on_timeout=True, max_retries=2,
maxsize=5) # 5 connections per client
return PredictionFn.CLIENT
This should ensure that you keep a single client for each worker, and you won't be creating so many connections to ElasticSearch - and thus not getting the rejection messages.
Let me know if these two help, or if we need to try further improvements!

EDIT: This was red herring. CLOSE_WAIT is not related. I again had the same issue and most of connections are now in ESTABLISHED status :/
While both of answers below are insightful, I don't think they answered the question.
After some more investigation, I find out that somehow elasticsearch-py (or urllib3), in combination with dataflow, will leave connection in CLOSE_WAIT status. Once connection got this status, these connections got stuck (OS will not release these sockets because OS thinks application code will close it) so after running job sometime, all of my connections in connection pool are in this CLOSE_WAIT status and therefore I cannot make any new connections. If I don't use connection pool and instantiate elasticsaerch client for each pardo, it just gets worth, somehow connections got stuck even faster.
I reported issue here https://github.com/elastic/elasticsearch-py/issues/1459 but honestly the issue seems deeper in stack, because I had similar issue when I directly used requests package's connection pool (which I believe also used urllib3 under the hood).

Dataflow has no limit on the number of outgoing connections.
It uses a K8s cluster under the hood, and every python thread lives into their own docker container.
API calls to Elastic cloud are rate-limited (take a look at the x-rate-limit-{interval,limit,remaining} fields in the response headers).
With Dataflow it is very easy to hit API rate limits if you do a lot of parallel jobs and/or google cloud scales up the nodes of your job to make it faster.
Possible workarounds in your Dataflow / Apache Beam job:
1 - (no code required) Play with (Dataflow execution parameters)[ https://cloud.google.com/dataflow/docs/guides/specifying-exec-params] to limit the number of concurrent processing threads.
The three parameters you need to tweak are:
max_num_workers : maximum number of worker instances (machines) running.
number_of_worker_harness_threads: by default 1 thead per CPU your instance has.
machine_type: the instance type you will use.
2 - Implement rate-limit on your code. See Apache Beam Timely (and stateful) processing processing with Apache Beam

Related

Flask Cloudant slow response time

I am creating a Flask application that is connecting to a Cloudant database using the python cloudant library.
My response time when I just add connect statement (with no queries) can be anywhere from .4s to 12s. My connect statement is like so:
client = Cloudant(USERNAME, PASSWORD, url=URL, connect=True)
When I remove the connection code, my response time is very low.
I have run a profiler on my system and it shows that the increase in response time is due to reading an ssl socket.
I have also tried using the default example from IBM Bluemix Github and got similar results for response time.
I am running my Flask application using the built in development web server. I have tried connecting to the database before every request and I have tried having a single connection that gets reused. Could this delay be due to my local machine? And what would cause it to be quick some times and not others? Other posts have suggested issues with IPv6 or DNS, but I do not think that is the case.
With API calls like:
ddoc = DesignDocument(g.db, '_design/docs')
g.myview = View(ddoc, 'my-view')
g.myview(key=[somekey])['rows']
I have already created the views and are indexed by the appropriate key, so it is not slow due to indexing.
try to use this code to connect to your Cloudant database:
def conn(user, pwd, db, **kwargs):
client = Cloudant(user, pwd, account=kwargs.get('host', user))
client.connect()
database = self.client[db]

Flask RESTful API request, Broken pipe [Errno 32] !

I'm new to web development and I'm trying to create a RESTful web service using the Flask micro-framework.
Here is my code:
app = Flask(__name__)
client = MongoClient()
db = client.markets
def toJson(data):
return json.dumps(data, default=json_util.default)
#app.route('/', methods=['GET'])
def get_tasks():
cursor = db.europe.find()
list = []
for i in cursor:
list.append(i)
return toJson(list)
When I send the request from my browser, it is constantly waiting for the server and nothing is returned.
Eventually I will see the flask server running in the terminal will give me: [Errno 32] Broken pipe.
My collection has 1.5 million entries, each with about 20 attributes. Could it be because the request is too large?
Thanks in advance.
The Broken pipe indicates that the other end of a socket or pipe that your flask process wants to talk to has died. Considering that you are interacting with the database it's very likely that the database has terminated the connection or the connection has died for other reasons.
Probably you should be analyzing the query that you run on your db, because the code itself doesn't seem to have an obvious problem.
Try running the query on your MongoDB manually and see what happens. Does the query return successfully?
You're mentioning that it takes a lot of time until you get that error. Could it be that some indexes are missing or not properly used in your schema, which makes the query execute very slow, and after waiting for a long time it reaches a timeout (f.e. maxTimeMS)?

Azure ML with python - (SSLError(SSLError('The write operation timed out',),),) when doing a table storage entity query

Hi I am trying to begin an Azure ML algorithm by executing a python script that queries data from a table storage account. I do it using this:
entities_Azure=table_session.query_entities(table_name=table_name,
filter="PartitionKey eq '" + partitionKey + "'",
select='PartitionKey,RowKey,Timestamp,value',
next_partition_key = next_pk,
next_row_key = next_rk, top=1000)
I pass in the variables needed when calling the function that this bit of code sits in, and I include the function by including a zip file in Azure ML.
I assume the error is due to the query taking too long, or something like that, but it has to take a long time because I might have to query loads of data.... I looked at this SO post Windows Azure Storage Table connection timed out which is a similar issue I think with regard to hitting specified thresholds for these queries, but I don't know how I'd be able to avoid it. The run time of the program is only about 1.5 mins before timing out..
Any ideas as to why this is happening and how I might be able to solve it?
Edit:
As per Peter Pan - MSFT's advice I ran a query that was more specific:
entities_Azure=table_service.query_entities(table_name='#######',select='PartitionKey,RowKey,Timestamp,value', next_partition_key = None, next_row_key = None, top=2)
This returned the following error log:
Error 0085: The following error occurred during script evaluation, please view the output log for more information:
---------- Start of error message from Python interpreter ----------
data:text/plain,Caught exception while executing function: Traceback (most recent call last):
File "C:\server\invokepy.py", line 169, in
batch odfs = mod.azureml_main(*idfs)
File "C:\temp\azuremod.py", line 61, in
azureml_main entities_Azure=table_service.query_entities(table_name='######',select='PartitionKey,RowKey,Timestamp,value', next_partition_key = None, next_row_key = None, top=2)
File "./Script Bundle\azure\storage\table\tableservice.py", line 421, in query_entities
response = self._perform_request(request)
File "./Script Bundle\azure\storage\storageclient.py", line 171, in _perform_request
resp = self._filter(request)
File "./Script Bundle\azure\storage\table\tableservice.py", line 664, in _perform_request_worker
return self._httpclient.perform_request(request)
File "./Script Bundle\azure\storage\_http\httpclient.py", line 181, in perform_request
self.send_request_body(connection, request.body)
File "./Script Bundle\azure\storage\_http\httpclient.py", line 145, in send_request_body
connection.send(None)
File "./Script Bundle\azure\storage\_http\requestsclient.py", line 81, in send
self.response = self.session.request(self.method, self.uri, data=request_body, headers=self.headers, timeout=self.timeout)
File "C:\pyhome\lib\site-packages\requests\sessions.py", line 456, in request
resp = self.send(prep, **send_kwargs)
File "C:\pyhome\lib\site-packages\requests\sessions.py", line 559, in send
r = adapter.send(request, **kwargs)
File "C:\pyhome\lib\site-packages\requests\adapters.py", line 382, in send
raise SSLError(e, request=request)
SSLError: The write operation timed out
---------- End of error message from Python interpreter
---------- Start time: UTC 11/18/2015 11:39:32 End time: UTC 11/18/2015 11:40:53
Hopefully this brings more insight to the situation!
I tried to fill a table storage with data generated by myself and want to reproduce your issue thru doing a query like yours, but failed.
I found table storage query timeout issue for REST API (azure storage sdk for python wrapped REST API). The page (https://msdn.microsoft.com/en-us/library/azure/dd894042.aspx) "Query Timeout and Pagination" for Table Service REST API said:
A query against the Table service may return a maximum of 1,000 items at one time and may execute for a maximum of five seconds. If the result set contains more than 1,000 items, if the query did not complete within five seconds, or if the query crosses the partition boundary, the response includes headers which provide the developer with continuation tokens to use in order to resume the query at the next item in the result set. Continuation token headers may be returned for a Query Tables operation or a Query Entities operation.
Note that the total time allotted to the request for scheduling and processing the query is 30 seconds, including the five seconds for query execution.
It is possible for a query to return no results but to still return a continuation header.
I think the issue was caused by hitting these sepcified thresholds.
Also, I used the module Reader in Data Input and Output and set the data source with Azure Table to read 1000 entites successfully and fast on the Experiment of Azure ML Studio.
For this scenario, I suggest you can use the specified query filter to query your table storage, such as the following:
entities_Azure=table_session.query_entities(table_name=table_name,
filter="PartitionKey eq '" + partitionKey + "' and Rowkey eq '" + rowkey + "'",
select='PartitionKey,RowKey,Timestamp,value',
next_partition_key = next_pk,
next_row_key = next_rk, top=1000)
We can use this code to judge the problem is connection issue or thresholds issue.
Any concern, please feel free to let me know.
I ran into a very similar problem in Access Azure blog storage from within an Azure ML experiment. I didn't realize they were similar when I first posted. However, it became very clear as the debugging and help continued.
Bottom Line: SSLError with Timeout occurs when azure.storage.* is accessed over HTTPS/SSL. If you change the creation of the 'TableService' to force the use of HTTP (protocol='http') the timeout errors will cease.
table_service = TableService(account_name='myaccount', account_key='mykey',protocol='http')
The full analysis can be found at the StackOverflow post above. However, I saw this and felt I should mention it directly here to help with searching. The fix applies to azure.storage.table, azure.storage.blob, azure.storage.page and azure.storage.queue.
PS. Yes, I know that using HTTP isn't optimal, however, you are running everything here within Azure. And when you leave Azure ML (or Azure App Service) you can switch back to HTTPS.

Python suds "RuntimeError: maximum recursion depth exceeded while calling a Python object"

I'm trying to consume a SOAP web service using Python suds but I am getting the error "RuntimeError: maximum recursion depth exceeded while calling a Python object".
According to the trace, there is infinite recursion at "suds/binding/multiref.py", line 69.
The web service I'm trying to access is http://www.reactome.org:8080/caBIOWebApp/services/caBIOService?wsdl.
The method I'm trying to access is loadPathwayForId.
Here's the part of my code that consumes the web service:
from suds.client import Client
client = Client('http://www.reactome.org:8080/caBIOWebApp/services/caBIOService?wsdl')
pathway = client.service.loadPathwayForId(2470946)
I'm not sure what is responsible for the infinite recursion. I tried to look up this problem and there has been reports of issues with suds and infinite recursion, but the traces are different than mine (the recursive code is different), so I suspect my problem has other origins.
The full trace:
File "C:\Python27\lib\suds\bindings\multiref.py", line 69, in update
self.update(c)
File "C:\Python27\lib\suds\bindings\multiref.py", line 69, in update
self.update(c)
...
File "C:\Python27\lib\suds\bindings\multiref.py", line 69, in update
self.update(c)
File "C:\Python27\lib\suds\bindings\multiref.py", line 69, in update
self.update(c)
File "C:\Python27\lib\suds\bindings\multiref.py", line 67, in update
self.replace_references(node)
File "C:\Python27\lib\suds\bindings\multiref.py", line 80, in replace_references
href = node.getAttribute('href')
File "C:\Python27\lib\suds\sax\element.py", line 404, in getAttribute
prefix, name = splitPrefix(name)
File "C:\Python27\lib\suds\sax\__init__.py", line 49, in splitPrefix
if isinstance(name, basestring) \
RuntimeError: maximum recursion depth exceeded while calling a Python object
Thanks in advance for the help!
After more testing, it seems that (unfortunately) suds has trouble interpreting Java Collection objects serialized as XML. I ended up using SOAPpy instead to avoid this issue. If someone can suggest a fix, that would be awesome! I really like suds for its other merits over SOAPpy.
I tried lots of SUDS versions and forks, and finally got to find one that works with proxies, https and authenticated services, find it here:
https://github.com/unomena/suds
Also, here is example code showing simple usage:
from suds.client import Client
# SOAP WSDL url
url = 'https://example.com/ws/service?WSDL'
# SOAP service username and password for authentication, if needed
username = 'user_name'
password = 'pass_word'
# local intranet proxy definition to get to the internet, if needed
proxy = dict(http='http://username:password#localproxy:8080',
https='http://username:password#localproxy:8080')
# unauthenticaded, no-proxy
# client = Client(url)
# use a proxy to connect to the service
# client = Client(url, proxy=proxy)
# no proxy, authenticathed service
# client = Client(url, username=username, password=password)
# use a proxy to connect to an authenticated service
client = Client(url, proxy=proxy, username=username, password=password)
print client

DNS query using Google App Engine socket

I'm trying to use the new socket support for Google App Engine in order to perform some DNS queries. I'm using dnspython to perform the query, and the code works fine outside GAE.
The code is the following:
class DnsQuery(webapp2.RequestHandler):
def get(self):
domain = self.request.get('domain')
logging.info("Test Query for "+domain)
answers = dns.resolver.query(domain, 'TXT', tcp=True)
logging.info("DNS OK")
for rdata in answers:
rc = str(rdata.exchange).lower()
logging.info("Record "+rc)
When I run in GAE I get the following error:
File "/base/data/home/apps/s~/one.366576281491296772/main.py", line 37, in post
return self.get()
File "/base/data/home/apps/s~/one.366576281491296772/main.py", line 41, in get
answers = dns.resolver.query(domain, 'TXT', tcp=True)
File "/base/data/home/apps/s~/one.366576281491296772/dns/resolver.py", line 976, in query
raise_on_no_answer, source_port)
File "/base/data/home/apps/s~/one.366576281491296772/dns/resolver.py", line 821, in query
timeout = self._compute_timeout(start)
File "/base/data/home/apps/s~/one.366576281491296772/dns/resolver.py", line 735, in _compute_timeout
raise Timeout
Which is raised by dnspython when no answer is returned within the time limit. I've raised the timelimit to 60 seconds, and DnsQuery is a task, but still getting the same error.
Is there any limitation in Google App Engine socket implementation, which prevents the execution of DNS requests ?
This is a bug and will be fixed ASAP.
As a workaround, pass in the source='' argument to dns.resolver.query.
tcp=True is not necessary.
No. There is no limit on UDP ports. (only smtp ports on TCP).
It is possible there is an issue with the socket service routing. Please file an issue with the app engine issue tracker. https://code.google.com/p/googleappengine/issues/list
dnspython is using socket. However, socket is only available in paid apps.1

Categories