I am getting a memory error while uploading a CSV file of size around 650 mb with a shape (10882101, 6).
How can i upload such file in postgres using django framework.
You haven't shared much details (error logs, which python package you are using etc).
You might like to read Most efficient way to parse a large .csv in python?, https://medium.com/casual-inference/the-most-time-efficient-ways-to-import-csv-data-in-python-cc159b44063d
How I will do it from Django Framework:
I will use Celery to run the job as a background process, as waiting for the file to be uploaded completely before returning a response might give a HTTP timeout.
Celery quickstart with Django
Related
I have a flask application where I query and filter large datasets from S3 in a Celery Task. I want to serve the filtered data as a CSV to the user. The size of the CSV is up to 100mb. My initial thought was to have Celery save the dataset as a CSV to disk and then use send_file in flask route BUT I am using Heroku for deployment which has an ephemeral file system. So if I save the file in the Celery worker it won't share across to the web worker. I also played with having the S3 query directly in the flask route and then sending the file without saving to the server, but the file query takes time, up to 30 seconds, so I want to keep it as a background job in Celery.
The other thought was to upload the filtered data to S3 and remove following the download. However, this seems inefficient because it means downloading, filtering, and re-uploading.
Is there any way to do this efficiently or should I move off Heroku to something where I have SSD space. Thanks!
I am working on a single page web app that will allow users to upload a large CSV file (greater than 5 GB) and will then send it to a python flask server to be streamed to a database. Large CSVs are hard to read in python because of memory issues so I think it's best to send the csv data to a database like sqlite and then query the database to get the data back. However, I tried reading a csv file of 6.9 GB by chunks using pandas and then using the df.to_sql() method to store it into an sqlite db but it took about an hour which would be inefficient in a flask server and a terrible user experience. I have been doing some research and it seems that I can use worker processes/task queues like Celery or Redis Queue to speed up the process but not to sure how to go about it or if I need sockets to do this. I am a junior dev so any reference to tutorials, examples, or advice would be greatly appreciated.
I am trying to upload a 400 GB .ibd file of MySql DB machine into D42/S3.
I am using set_contents_from_file function of Python boto. But it is taking a lot of time and I cannot see the progress (about how much uploaded/left).
Does anyone have any python script to use thread or parallel multipart upload? It's a very simple use case for end-user, but boto's documentation doesn't have any function like this.
Finally I did it with 'S3cmd' and not with python.
My Flask application will allow the upload of large files (up to 100 Mb) to my server. I was wondering how Flask managed the chunked file if the client decides to stop the upload half way. I read the documentation about File Upload but wasn't able to find that mentioned.
Does Flask automatically delete the file? How can it know that the user won't retry it? Or do I have to manually delete the aborted files in the temporary folder?
Werkzeug (the library that Flask uses for many tasks including this one) uses a tempfile.TemporaryFile object to receive the WSGI file stream when uploading. The object automatically manages the open file.
The file is immediately deleted on disk; there is no entry in the directory table anymore, but the process retains a file handle
When the TemporaryFile object is cleared (no references remain, usually because the request ended), the file object is closed and the operating system clears the disk space used.
As such, the file data is deleted when a request is aborted.
Flask does not handle the case where a user uploads the file again; there is no standard way to handle that anyway. You'd have to come up with your own solution there.
I have a web service on Google App Engine (programmed in Python) and everyday I have to update with data from a ftp source.
My daily job, that´s outside of GAE, downloads the data from the ftp server, parse and enrich this data with other information sources and this process take nearly 2 hours.
After all this, I upload all this data to my server using the bulk upload function of the appcfg.py (command line).
Since I want to have better reports of this process, I need to know how many records were really uploaded by each call to appcfg (there more then 10)
My question is: Can I get this number of lines uploaded from the appcfg.py without having to parse its output?
Bonus question: Does anyone else do this kind of daily routine? or is it a bad practice?