Timeout error when saving large data in elastic search - python

I am working in saving data from a zip file to Elasticsearch DB in a Python application, that zip file has HTML pages and domain names. Now I need to push data to array from that file and then need to save it in Elastic search DB.
The issue is that sometimes, when data is extensive because HTML can be of any size, then I get the error:
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='localhost', port=9200): Read timed out. (read timeout=300)
ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='localhost', port=9200): Read timed out. (read timeout=300))
I have tried increasing timeout value but I don't know how long can data become in future to save on update, so not sure what value should I put there.
Can anyone please help me in knowing that if this is the only way or is there any other better way to fix it.

Related

How to solve gspread API ReadTimeout limit (Python)?

I am using gspread module to read data from a google sheet, however, some gsheets are somehow too large, and whenever i try to read(get) the values from the google sheet i get a timeout error as the following:
ReadTimeout: HTTPSConnectionPool(host='sheets.googleapis.com', port=443): Read timed out. (read timeout=120)
One solution comes to my mind is to extend the timeout value ,which i don't know exactly how.
If you know how, or have any kind of solution to this issue, I would really appreciate your help.
Hi if you look at gspread repository it recently merged a new PR that introduces timeouts in the client. When released, just update gspread to latest version and you'll be able to set a timeout on you requests.

How can i upload as 1cr rows .csv file using django?

I am getting a memory error while uploading a CSV file of size around 650 mb with a shape (10882101, 6).
How can i upload such file in postgres using django framework.
You haven't shared much details (error logs, which python package you are using etc).
You might like to read Most efficient way to parse a large .csv in python?, https://medium.com/casual-inference/the-most-time-efficient-ways-to-import-csv-data-in-python-cc159b44063d
How I will do it from Django Framework:
I will use Celery to run the job as a background process, as waiting for the file to be uploaded completely before returning a response might give a HTTP timeout.
Celery quickstart with Django

Insert large amount of data to BigQuery via bigquery-python library

I have large csv files and excel files where I read them and create the needed create table script dynamically depending on the fields and types it has. Then insert the data to the created table.
I have read this and understood that I should send them with jobs.insert() instead of tabledata.insertAll() for large amount of data.
This is how I call it (Works for smaller files not large ones).
result = client.push_rows(datasetname,table_name,insertObject) # insertObject is a list of dictionaries
When I use library's push_rows it gives this error in windows.
[Errno 10054] An existing connection was forcibly closed by the remote host
and this in ubuntu.
[Errno 32] Broken pipe
So when I went through BigQuery-Python code it uses table_data.insertAll().
How can I do this with this library? I know we can upload through Google storage but I need direct upload method with this.
When handling large files don't use streaming, but batch load: Streaming will easily handle up to 100,000 rows per second. That's pretty good for streaming, but not for loading large files.
The sample code linked is doing the right thing (batch instead of streaming), so what we see is a different problem: This sample code is trying to load all this data straight into BigQuery, but the uploading through POST part fails. gsutil has a more robust uploading algorithm than just a plain POST.
Solution: Instead of loading big chunks of data through POST, stage them in Google Cloud Storage first, then tell BigQuery to read files from GCS.
See also BigQuery script failing for large file

how to find out how much of file size served by nginx?

so I want to serve some file with django nginx solution. the problem is that many of files which is serving have a huge size and users have some quota to download files.
so how I can find out how much of file size serving to user? what I mean is that maybe user close download file connection. so how can I find the right size of serving?
thanks!
With Nginx you can log the amount of bandwidth used pushed to a log file by using the log_module but this is not exactly what you want, but can help achieve what you wish.
So, now you will have logs that you can parse to get the file sizes and total bandwidth used and then have a script that updates a database and then you can then authorize future downloads if their limit is reached or within some soft limit range.
Server Fault with similar question
Another option is, making an assumption, keep the file sizes in a database and just keep a tally at the request so when ever they hit a download link it immediately increments their download count and if it is over their limit then just invalidate the link else make the link valid and pass them over to Nginx.
Another option would be to write a custom Nginx module that performs the increment at a much more fine grained level, but this could be more wok than your situation requires.

Fetch a large chunk of data with TaskQueue

I'd like to fetch a large file from an url, but it always raises a DeadLineExceededError, although I have tried with a TaskQueue and put deadline=600 to fetch.
The problem comes from the fetch, so Backends cannot help here : even if i'd launched a backend with a TaskQueue, i'd had 24h to return, but there 'd be still the limit of 10 min with the fetch, ya ?
Is there a way to fetch from a particular offset of file to an other offset ? So could I split the fetch and after put all parts together ?
Any ideas ?
Actually the file to fetch is not really large : between 15 and 30 MB, but the server is likely overwhelmingly slow and constantly fired ...
If the server supports it, you can supply the HTTP Range header to specify a subset of the file that you want to fetch. If the content is being served statically, the server will probably respect range requests; if it's dynamic, it depends on whether the author of the code that generates the response allowed for them.

Categories