Elasticsearch Python client Reindex Timedout

Elasticsearch Python client Reindex Timedout - python

I'm trying to reindex using the Elasticsearch python client, using https://elasticsearch-py.readthedocs.org/en/master/helpers.html#elasticsearch.helpers.reindex. But I keep getting the following exception: elasticsearch.exceptions.ConnectionTimeout: ConnectionTimeout caused by - ReadTimeout
The stacktrace of the error is
Traceback (most recent call last):
File "~/es_test.py", line 33, in <module>
main()
File "~/es_test.py", line 30, in main
target_index='users-2')
File "~/ENV/lib/python2.7/site-packages/elasticsearch/helpers/__init__.py", line 306, in reindex
chunk_size=chunk_size, **kwargs)
File "~/ENV/lib/python2.7/site-packages/elasticsearch/helpers/__init__.py", line 182, in bulk
for ok, item in streaming_bulk(client, actions, **kwargs):
File "~/ENV/lib/python2.7/site-packages/elasticsearch/helpers/__init__.py", line 124, in streaming_bulk
raise e
elasticsearch.exceptions.ConnectionTimeout: ConnectionTimeout caused by - ReadTimeout(HTTPSConnectionPool(host='myhost', port=9243): Read timed out. (read timeout=10))
Is there anyway to prevent this exception besides increasing the timeout?
EDIT: python code
from elasticsearch import Elasticsearch, RequestsHttpConnection, helpers
es = Elasticsearch(connection_class=RequestsHttpConnection,
host='myhost',
port=9243,
http_auth=HTTPBasicAuth(username, password),
use_ssl=True,
verify_certs=True,
timeout=600)
helpers.reindex(es, source_index=old_index, target_index=new_index)

I have been suffering from this issue for couple of days, I changed the request_timeout parameter to 30 (which is 30 seconds) didn't work.
Finally I have to edit the stream_bulk and reindex APIs inside the elasticsearch.py
Change the chunk_size parameter from the default 500 (which is processing 500 documents) to less number of documents per batch. I changed mine to 50 which worked fine for me. No more read timeout errors.
def streaming_bulk(client, actions, chunk_size=50, raise_on_error=True,
expand_action_callback=expand_action, raise_on_exception=True,
**kwargs):
def reindex(client, source_index, target_index, query=None, target_client=None,
chunk_size=50, scroll='5m', scan_kwargs={}, bulk_kwargs={}):

It may be happening because of the OutOfMemoryError for Java heap space, which means you are not giving elasticsearch enough memory for what you want to do.
Try to look at your /var/log/elasticsearch if there is any exception like that.
https://github.com/elastic/elasticsearch/issues/2636

Related

Using pysolr getting 400 error when trying to add data to solr

Using pysolr trying to add document to solr with Python3.9 and getting below error 400 even with only 1 or 2 fields.
The fields that I'm using here are dynamic fields.
No issue with connecting to solr.
#/usr/bin/python
import pysolr
solr = pysolr.Solr('http://localhost:8080/solr/', always_commit=True)
if solr.ping():
print('connection successful')
docs = [{'id':'123c', 's_chan_name': 'TV-201'}]
solr.add(docs)
res = solr.search('123c')
print(res)
Getting below error:
connection successful
Traceback (most recent call last):
File "/tmp/test-solr.py", line 8, in <module>
solr.add(docs)
File "/usr/local/lib/python3.9/site-packages/pysolr.py", line 1042, in add
return self._update( File "/usr/local/lib/python3.9/site-packages/pysolr.py", line 568, in
_update
return self._send_request( File "/usr/local/lib/python3.9/site-packages/pysolr.py", line 463, in
_send_request
raise SolrError(error_message % (resp.status_code, solr_message))
pysolr.SolrError: Solr responded with an error (HTTP 400): [Reason: None]
<html><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 400 Unexpected character &apos;[&apos; (code 91) in prolog;
expected &apos;<&apos; at [row,col {unknown-source}]: [1,1]</title>
</head><body><h2>HTTP ERROR 400</h2><p>Problem accessing /solr/update/.
Reason:<pre> Unexpected character &apos;[&apos; (code 91) in prolog;
expected &apos;<&apos; at [row,col {unknown-source}]: [1,1]</pre>
</p><hr><a href="http://eclipse.org/jetty">
Powered by Jetty://9.4.15.v20190215</a><hr/></body></html>

I’d recommend looking at the debug logs (which might require you to use logging.basicConfig() to enable them) and testing the URLs it’s using yourself. Depending on your Solr configuration, it might be the case that your Solr URL needs to have the core at the end (e.g. http://localhost:8983/solr/mycore).
In general, most reports like this which we get turn out to be oddities in how someone configured Solr. In addition to the debug logging, I highly recommend looking through the test suite since that does everything needed to run Solr and can be a handy point for comparison:
https://github.com/django-haystack/pysolr

solr.add method supports only one attribute
docs = {'id':'123c', 's_chan_name': 'TV-201'}
solr.add(docs)
it is not suitable for iterable.
docs = [{'id':'123c', 's_chan_name': 'TV-201'}]
solr.add_many(docs)
it will works

How to call an Odoo model's method with no parameter(except self) on a specific record through xmlrpc in Odoo 13?

I am developing a script to create a record in a model of an Odoo. I need to run this model's methods on specific records. In my case the method which I need to run on a specific record doesn't have any parameter (just has self). I want to know how can I run the method on a specific record of the model through xmlrpc call from client to Odoo server. Below is the way I tried to call the method and pass the id of a specific record regarding this question.
xmlrpc_object.execute('test_db', user, 'admin', 'test.test', 'action_check_constraint', [record_id])
action_check_constraint checks some constraints on each record of the model and if all the constraints passed, changes the state of the record or raise validation errors. But the above method call with xmlrpc raise below error:
xmlrpc.client.Fault: <Fault cannot marshal None unless allow_none is enabled: 'Traceback (most recent call last):\n File "/home/ibrahim/workspace/odoo13/odoo/odoo/addons/base/controllers/rpc.py", line 60, in xmlrpc_1\n response = self._xmlrpc(service)\n File "/home/ibrahim/workspace/odoo13/odoo/odoo/addons/base/controllers/rpc.py", line 50, in _xmlrpc\n return dumps((result,), methodresponse=1, allow_none=False)\n File "/usr/local/lib/python3.8/xmlrpc/client.py", line 968, in dumps\n data = m.dumps(params)\n File "/usr/local/lib/python3.8/xmlrpc/client.py", line 501, in dumps\n dump(v, write)\n File "/usr/local/lib/python3.8/xmlrpc/client.py", line 523, in __dump\n f(self, value, write)\n File "/usr/local/lib/python3.8/xmlrpc/client.py", line 527, in dump_nil\n raise TypeError("cannot marshal None unless allow_none is enabled")\nTypeError: cannot marshal None unless allow_none is enabled\n'>
> /home/ibrahim/workspace/scripts/automate/automate_record_creation.py(328)create_record()
Can anyone help with the correct and best way of calling a model's method (with no parameter except self) on a specific record through xmlrpc client to Odoo server?

That error is raised, because the xmlrpc library is not allowing None as return value as default. But you should change that behaviour by just allowing it.
Following line is from Odoo's external API documentation, extended to allow None as return value:
models = xmlrpc.client.ServerProxy(
'{}/xmlrpc/2/object'.format(url), allow_none=True)
For more information about xmlrpc ServerProxy look into the python documentation

You can get the error if action_check_constraint does not return anything (by default None).
Try to run the server with the log-level option set to debug_rpc_answer to get more details.

After lost of search and try first I used this fix to solve the error but I think this fix is not a best practice. So, I found OdooRPC which does the same job but it handled the above case and there's no such error for model methods which return None. Using OdooRPC solved my problem and I done what I needed to do with xmlrpc in Odoo.

How to best handle Scrapy cache at 'OSError: [Errno 28] No space left on device' failure?

What's the advised action to take should Scrapy fail with exception:
OSError: [Errno 28] No space left on device
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
result = g.send(result)
File "/usr/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 53, in process_response
spider=spider)
File "/usr/lib/python3.6/site-packages/scrapy/downloadermiddlewares/httpcache.py", line 86, in process_response
self._cache_response(spider, response, request, cachedresponse)
File "/usr/lib/python3.6/site-packages/scrapy/downloadermiddlewares/httpcache.py", line 106, in _cache_response
self.storage.store_response(spider, request, response)
File "/usr/lib/python3.6/site-packages/scrapy/extensions/httpcache.py", line 317, in store_response
f.write(to_bytes(repr(metadata)))
OSError: [Errno 28] No space left on device
In this specific case, a ramdisk/tmpfs limited to 128 MB was used as cache disk, with scrapy setting HTTPCACHE_EXPIRATION_SECS = 300 on httpcache.FilesystemCacheStorage
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 300
HTTPCACHE_DIR = '/tmp/ramdisk/scrapycache' # (tmpfs on /tmp/ramdisk type tmpfs (rw,relatime,size=131072k))
HTTPCACHE_IGNORE_HTTP_CODES = ['400','401','403','404','500','504']
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
I might be wrong, but i get the impression that Scrapy's FilesystemCacheStorage might not be managing it's cache (storage limitations) all that well (?) .
Might it be better to use LevelDB ?

You are right. Nothing will be deleted after the cache is expired. HTTPCACHE_EXPIRATION_SECS settings only decide whether to use cache response or re-download, for all HTTPCACHE_STORAGE.
If your cache data is very large, you should consider to use DB to store instead of local file system. Or you can extend the backend storage to add a LoopingCall Task to delete expired cache continously.
Why scrapy keep around data that's being ignored?
I think there are two points:
HTTPCACHE_EXPIRATION_SECS control whether to use cache response or re-download, it only gurantee that you use no-expire cache. Different spiders may set different expiration_secs, deleting cache will make cache in confusion.
If you want to delete expired cache, there will need a LoopingCall Task to check expired cache continously, it make scrapy extension more complex, which not scrapy want to be.

How to fix IncompleteRead error on Linux using Py2Neo

I am updating data on a Neo4j server using Python (2.7.6) and Py2Neo (1.6.4). My load function is:
from py2neo import neo4j,node, rel, cypher
session = cypher.Session('http://my_neo4j_server.com.mine:7474')
def load_data():
tx = session.create_transaction()
for row in dataframe.iterrows(): #dataframe is a pandas dataframe
name = row[1].name
id = row[1].id
merge_query = "MERGE (a:label {name:'%s', name_var:'%s'}) " % (id, name)
tx.append(merge_query)
tx.commit()
When I execute this from Spyder in Windows it works great. All the data from the dataframe is committed to neo4j and visible in the graph. However, when I run this from a linux server (different from the neo4j server) I get the following error at tx.commit(). Note that I have the same version of python and py2neo.
INFO:py2neo.packages.httpstream.http:>>> POST http://neo4j1.qs:7474/db/data/transaction/commit [1360120]
INFO:py2neo.packages.httpstream.http:<<< 200 OK [chunked]
ERROR:__main__:some part of process failed
Traceback (most recent call last):
File "my_file.py", line 132, in load_data
tx.commit()
File "/usr/local/lib/python2.7/site-packages/py2neo/cypher.py", line 242, in commit
return self._post(self._commit or self._begin_commit)
File "/usr/local/lib/python2.7/site-packages/py2neo/cypher.py", line 208, in _post
j = rs.json
File "/usr/local/lib/python2.7/site-packages/py2neo/packages/httpstream/http.py", line 563, in json
return json.loads(self.read().decode(self.encoding))
File "/usr/local/lib/python2.7/site-packages/py2neo/packages/httpstream/http.py", line 634, in read
data = self._response.read()
File "/usr/local/lib/python2.7/httplib.py", line 543, in read
return self._read_chunked(amt)
File "/usr/local/lib/python2.7/httplib.py", line 597, in _read_chunked
raise IncompleteRead(''.join(value))
IncompleteRead: IncompleteRead(128135 bytes read)
This post (IncompleteRead using httplib) suggests that is an httplib error. I am not sure how to handle since I am not calling httplib directly.
Any suggestions for getting this load to work on Linux or what the IncompleteRead error message means?
UPDATE :
The IncompleteRead error is being caused by a Neo4j error being returned. The line returned in _read_chunked that is causing the error is:
pe}"}]}],"errors":[{"code":"Neo.TransientError.Network.UnknownFailure"
Neo4j docs say this is an unknown network error.

Although I can't say for sure, this implies some kind of local network issue between client and server rather than a bug within the library. Py2neo wraps httplib (which is pretty solid itself) and, from the stack trace, it looks as though the client is expecting more chunks from a chunked response.
To diagnose further, you could make some curl calls from your Linux application server to your database server and see what succeeds and what doesn't. If that works, try writing a quick and dirty python script to make the same calls with httplib directly.
UPDATE 1: Given the update above and the fact that the server streams its responses, I'm thinking that the chunk size might represent the intended payload but the error cuts the response short. Recreating the issue with curl certainly seems like the best next step to help determine whether it is a fault in the driver, the server or something else.
UPDATE 2: Looking again this morning, I notice that you're using Python substitution for the properties within the MERGE statement. As good practice, you should use parameter substitution at the Cypher level:
merge_query = "MERGE (a:label {name:{name}, name_var:{name_var}})"
merge_params = {"name": id, "name_var": name}
tx.append(merge_query, merge_params)

Pymongo failing but won't give exception

Here is the query in Pymongo
import mong #just my library for initializing
collection_1 = mong.init(collect="col_1")
collection_2 = mong.init(collect="col_2")
for name in collection_2.find({"field1":{"$exists":0}}):
try:
to_query = name['something']
actual_id = collection_1.find_one({"something":to_query})['_id']
crap_id = name['_id']
collection_2.update({"_id":id},{"$set":{"new_name":actual_id}},upset=True)
except:
open('couldn_find_id.txt','a').write(name)
All this is doing is taking a field from one collection, finding the id of that field and updating the id of another collection. It works for about 1000-5000 iterations, but periodically fails with this and then I have to restart the script.
> Traceback (most recent call last):
File "my_query.py", line 6, in <module>
for name in collection_2.find({"field1":{"$exists":0}}):
File "/home/user/python_mods/pymongo/pymongo/cursor.py", line 814, in next
if len(self.__data) or self._refresh():
File "/home/user/python_mods/pymongo/pymongo/cursor.py", line 776, in _refresh
limit, self.__id))
File "/home/user/python_mods/pymongo/pymongo/cursor.py", line 720, in __send_message
self.__uuid_subtype)
File "/home/user/python_mods/pymongo/pymongo/helpers.py", line 98, in _unpack_response
cursor_id)
pymongo.errors.OperationFailure: cursor id '7578200897189065658' not valid at server
^C
bye
Does anyone have any idea what this failure is, and how I can turn it into an exception to continue my script even at this failure?
Thanks

The reason of the problem is described in pymongo's FAQ:
Cursors in MongoDB can timeout on the server if they’ve been open for
a long time without any operations being performed on them. This can
lead to an OperationFailure exception being raised when attempting to
iterate the cursor.
This is because of the timeout argument of collection.find():
timeout (optional): if True (the default), any returned cursor is
closed by the server after 10 minutes of inactivity. If set to False,
the returned cursor will never time out on the server. Care should be
taken to ensure that cursors with timeout turned off are properly
closed.
Passing timeout=False to the find should fix the problem:
for name in collection_2.find({"field1":{"$exists":0}}, timeout=False):
But, be sure you are closing the cursor properly.
Also see:
mongodb cursor id not valid error

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Elasticsearch Python client Reindex Timedout - python

It may be happening because of the OutOfMemoryError for Java heap space, which means you are not giving elasticsearch enough memory for what you want to do. Try to look at your /var/log/elasticsearch if there is any exception like that. https://github.com/elastic/elasticsearch/issues/2636

Related

Using pysolr getting 400 error when trying to add data to solr

How to call an Odoo model's method with no parameter(except self) on a specific record through xmlrpc in Odoo 13?

How to best handle Scrapy cache at 'OSError: [Errno 28] No space left on device' failure?

How to fix IncompleteRead error on Linux using Py2Neo

Pymongo failing but won't give exception

Categories

Resources