Downloading large .bz2 files with Python requests library - python

I am currently developing a Python script that calls a REST API to download data that is made available everyday through the API. The files that I am trying to download have a '.txt.bz2` extension.
The API documentation recommends to use curl to download data from the API. In particular, the command to download the data that is recommended is:
curl --user Username:Password https://api.endpoint.com/data/path/to/file -o my_filename.txt.bz2
Where, of course, the url of the API data endpoint here is just fictitious.
Since the documentation recommends curl, my current implementation of the Python script leverages the subprocess library to call curl within Python:
import subprocess
def data_downloader(download_url, file_name, api_username, api_password):
args = ['curl', '--user', f'{api_username}:{api_password}', f'{download_url}', '-o', f'{file_name}']
subrpocess.call(args)
return file_name
Since, however, I am extensively using the requests library in other parts of the application that I am developing, mainly to send requests to the API and walking the file system like structure of the API, I have tried to implement the download function using this library as well. In particular, I have been using this other Stackoverflow thread as the reference of my alternative implementation, and the two functions that I have implemented using the requests library look like this:
import requests
import shutil
def download_file(download_url, file_name, api_username, api_password, chunk_size):
with requests.get(download_url, auth=(api_username, api_password), stream=True) as r:
with open(file_name, 'wb') as f:
for chunk in r.iter_content(chunk_size=chunk_size):
f.write(chunk)
return file_name
def shutil_download(download_url, file_name, api_username, api_password):
with requests.get(download_url, auth=(api_username, api_password), stream=True) as r:
with open(file_name, 'wb') as f:
shutil.copyfileobj(r.raw, file_name)
return file_name
While, however, with the subprocess implementation I am able to download the entire file without any issue, when trying to perform the download using the two requests implementations, I always end up with a downloaded file with a 1Kb dimension, which clearly is wrong since most of the data that I am downloading is >10GB.
I suspect that the issue that I am experiencing is caused by the format of the data that I am attempting to download, as I have seen successful attempts of downloading .zip or .gzip files using the same logic that I am using in the two functions. Hence I am wondering if anyone may have an explanation to the issue that I am experiencing or may provide a working solution to the problem.
UPDATE
I had a chance to discuss the issue with the owner of the API and apparently, upon analysis of the logs on their side, they found out there were some issues on their side that prevented the request to go through. On my side the status code of the request was signalling a successful request, however the returned data was not the correct one.
The two functions that use the requests library work as expected and the issue can be considered solved.

Related

How do I download and upload files as effectively as possible using Python?

So here is what I want to do: I want to download JSON files from an API, then I want to upload them to a container in Azure.
Here is my current solution: I use the module requests and
file = requests.get('<link>')
to download the data, which I then create a json file with and store it on my local computer like so
local_path = os.getcwd()
local_file_name = 'file'
open(local_file_name + '.json', 'wb').write(file.content)
then I find the local file, upload it to Azure, and delete the local file. This is a process I repeat for all the files, and I feel as if it is quite ineffective. Is there a better way to do it? I am wondering if I can upload the request files directly, with a specified name for every individual file, without having to store them locally. Thoughts?
You could probably use asyncio, async and aiohttp to download files. One is for speed and the other for getting webpages like in requests. There is a fantastic article below about how to use aiohttp along with async and asyncio. I put it below.
https://www.queness.com/post/17688/aiohttp-tutorial-how-does-it-worka
Aiohttp tutorial
https://realpython.com/async-io-python/
asyncio/async tutorial

Best practice: send multiple files to rest endpoint using python requests

What is the best way to send a lot of POST requests to a REST endpoint via Python?
E.g. I want to upload ~500k files to a database.
What I've done so far is a loop that creates for each file a new request using the requests package.
# get list of files
files = [f for f in listdir(folder_name)]
# loop through the list
for file_name in files:
try:
# open file and get content
with open(folder_name + "\\" + file_name, "r") as file:
f = file.read()
# create request
req = make_request(url, f)
# error handling, logging, ...
But as this is quite slow: what is the best practice to do that? Thank you.
First approach:
I dont know if it is the best practice you can split the files in batches of 1000 and zip it and send it as post requests using threads ( set the num threads = number of processor cores)
( The rest end point can extract the zipped contents and then process it )
second approach:
zip the files in batches and transfer it in batches
after the transfer is completed , validate in the server side
Then start the database upload at one go.
The first thing you want to do is determine exactly which part of your script is the bottleneck. You have both disk and network I/O here (reading files and sending HTTP requests, respectively).
Assuming that the HTTP requests are the actual bottleneck (highly likely), consider using aiohttp instead of requests. The docs have some good examples to get you started and there are plenty of "Quick Start" articles out there. This would allow your network requests to be cooperative, meaning that other python code can run while one of your network requests is waiting. Just to be careful to not overwhelm whatever server is receiving the requests.

Write-streaming to Google Cloud Storage in Python

I am trying to migrate an AWS Lambda function written in Python to CF that
unzips on-the-fly and read line-by-line
performs some light transformations on each line
write output (a line at a time or chunks) uncompressed to GCS
The output is > 2GB - but slightly less than 3GB so it fits in Lambda, just.
Well, it seems impossible or way more involved in GCP:
uncompressed cannot fit in memory or /tmp - limited to 2048MB as of writing this - so Python Client lib upload_from_file (or _filename) cannot be used
there is this official paper but to my surprise, it's referring to boto, a library initially designed for AWS S3, and a quite outdated one since boto3 is out for some time. No genuine GCP method to stream write or read
Node.js has a simple createWriteStream() - nice article here btw - but no equivalent one-liner in Python
Resumable media upload sounds like it but lot of code for something handled in Node much easier
AppEngine had cloudstorage but not available outside of it - and obsolete
little to no example out there on a working wrapper for writing text/plain data line-by-line as if GCS was a local filesystem. This is not limited to Cloud Functions and a lacking feature of the Python Client library, but it is more acute in CF due the resource constraints. Btw, I was part of a discussion to add a writeable IOBase function but it had no traction.
obviously using a VM or DataFlow are out of question for the task at hand.
In my mind, stream (or stream-like) reading/writing from cloud-based storage should even be included in the Python standard library.
As recommended back then, one can still use GCSFS, which behind the scenes commits the upload in chunks for you while you are writing stuff to a FileObj.
The same team wrote s3fs. I don't know for Azure.
AFAIC, I will stick to AWS Lambda as the output can fit in memory - for now - but multipart upload is the way to go to support any output size with a minimum of memory.
Thoughts or alternatives ?
I got confused with multipart vs. resumable upload. The latter is what you need for "streaming" - it's actually more like uploading chunks of a buffered stream.
Multipart upload is to load data and custom metadata at once, in the same API call.
While I like GCSFS very much - Martin, his main contributor is very responsive -, I recently found an alternative that uses the google-resumable-media library.
GCSFS is built upon the core http API whereas Seth's solution uses a low-level library maintained by Google, more in sync with API changes and which includes exponential backup. The latter is really a must for large/long stream as connection may drop, even within GCP - we faced the issue with GCF.
On a closing note, I still believe that the Google Cloud Library is the right place to add stream-like functionality, with basic write and read. It has the core code already.
If you too are interested in that feature in the core lib, thumbs up the issue here - assuming priority is based thereon.
smart_open now has support for GCS and also has support for on the fly decompression.
import lzma
from smart_open import open, register_compressor
def _handle_xz(file_obj, mode):
return lzma.LZMAFile(filename=file_obj, mode=mode, format=lzma.FORMAT_XZ)
register_compressor('.xz', _handle_xz)
# stream from GCS
with open('gs://my_bucket/my_file.txt.xz') as fin:
for line in fin:
print(line)
# stream content *into* GCS (write mode):
with open('gs://my_bucket/my_file.txt.xz', 'wb') as fout:
fout.write(b'hello world')

Streaming audio using Python (without GStreamer)

I'm working on a project that involves streaming .OGG (or .mp3) files from my webserver. I'd prefer not to have to download the whole file and then play it, is there a way to do that in pure Python (no GStreamer - hoping to make it truly cross platform)? Is there a way to use urllib to download the file chunks at a time and load that into, say, PyGame to do the actual audio playing?
Thanks!
I suppose your server supports Range requests. You ask the server by header Range with start byte and end byte of the range you want:
import urllib2
req = urllib2.Request(url)
req.headers['Range'] = 'bytes=%s-%s' % (startByte, endByte)
f = urllib2.urlopen(req)
f.read()
You can implement a file object and always download just a needed chunk of file from server. Almost every library accepts a file object as input.
It will be probably slow because of a network latency. You would need to download bigger chunks of the file, preload the file in a separate thread, etc. In other words, you would need to implement the streaming client logic yourself.

Serving Files with Pyramid

I am serving quite large files from a Pyramid Application I have written.
My only problem is download managers don't want to play nice.
I can't get resume downloading or segmenting to work with download manager like DownThemAll.
size = os.path.getsize(Path + dFile)
response = Response(content_type='application/force-download', content_disposition='attachment; filename=' + dFile)
response.app_iter = open(Path + dFile, 'rb')
response.content_length = size
I think the problem may lie with paste.httpserver but I am not sure.
The web server on the python side needs to support partial downloads, which happens through the HTTP Accept-Ranges header. This blog post digs a bit into this matter with an example in python:
Python sample: Downloading file through HTTP protocol with multi-threads
Pyramid 1.3 adds new response classes, FileResponse and FileIter for manually serving files.
Having been working on this problem for a while, I found
http://docs.webob.org/en/latest/file-example.html
to be a great help.

Categories