I am trying to retrieve data from an API and immediate write the JSON response directly to a file and not store any part of the response in memory. The reason for this requirement is because I'm executing this script on a AWS Linux EC2 that only has 2GB of memory, and if I try to hold everything in memory and then write the responses to a file, the process will fail due to not enough memory.
I've tried using f.write() as well as sys.stdout.write(), but both of these approaches seemed to only write the file after all the queries were executed. While this worked with my small example, it didn't work when dealing with my actual data.
The problem with both approaches below is that the file doesn't populate until the loop is complete. This will not work with my actual process, as the machine doesn't have enough memory to hold the all the responses in memory.
How can I adapt either of the approaches below, or come up with something new, to write data received from the API immediately to a file without saving anything in memory?
Note: I'm using Python 3.7 but happy to update if there is something that would make this easier.
My Approach 1
# script1.py
import requests
import json
with open('data.json', 'w') as f:
for i in range(0, 100):
r = requests.get("https://httpbin.org/uuid")
data = r.json()
f.write(json.dumps(data) + "\n")
f.close()
My Approach 2
# script2.py
import request
import json
import sys
for i in range(0, 100):
r = requests.get("https://httpbin.org/uuid")
data = r.json()
sys.stdout.write(json.dumps(data))
sys.stdout.write("\n")
With approach 2, I tried using the > to redirect the output to a file:
script2.py > data.json
You can use response.iter_content to download the content in chunks. For example:
import requests
url = 'https://httpbin.org/uuid'
with requests.get(url, stream=True) as r:
r.raise_for_status()
with open('data.json', 'wb') as f_out:
for chunk in r.iter_content(chunk_size=8192):
f_out.write(chunk)
Saves data.json with content:
{
"uuid": "991a5843-35ca-47b3-81d3-258a6d4ce582"
}
Related
I have a local python file that decodes binary files. This python file first reads from the file, opens it as binary and then saves it in a buffer and interprets it. Reading it is simply:
with open(filepath, 'rb') as f:
buff = f.read()
read_all(buff)
This works fine locally. Now I'd like to setup a Azure Python job where I can send the file, approx. 100kb, over a HTTP POST and then read the interpreted meta data which my original python script does well.
I've first removed the read function so that I'll now work with the buffer only.
In my Azure Python Job I have the following, triggered by a HttpRequest
my_data = reader.read_file(req.get_body())
To test my sending I've tried the following in python
import requests
url = 'http://localhost:7071/api/HttpTrigger'
files = {'file': open('test', 'rb')}
with open('test', 'rb') as f:
buff = f.read()
r = requests.post(url, files=files) #Try using files
r = requests.post(url, data=buff) #Try using data
I've also tried in Postman adding the file to the body as a binary and setting the headers to application/octet-stream
All this doesn't send the binary file the same way as the original f.read() did. So I'm getting a wrong interpretation of the binary file.
What is file.read doing differently to how I'm sending it over as a HTTP Body message?
Printing out the first line from the local python read file gives.
b'\n\n\xfe\xfe\x00\x00\x00\x00\\\x18,A\x18\x00\x00\x00(\x00\x00\x00\x1f\x00\x00\
Whereas printing it out at the req.get_body() shows me
b'\n\n\xef\xbf\xbd\xef\xbf\xbd\x00\x00\x00\x00\\\x18,A\x18\x00\x00\x00(\x00\x00\x00\x1f\x00\
So something is clearly wrong. Any help why this could be different?
Thanks
EDIT:
I've implemented a similar function in Flask and it works well.
The code in flask is simply grabbing the file from a POST. No encoding/decoding.
if request.method == 'POST':
f = request.files['file']
#f.save(secure_filename(f.filename))
my_data = reader.read_file(f.read())
Why is the Azure Function different?
You can try UTF-16 to decode and do the further action in your code.
Here is the code for that:
with open(path_to_file,'rb') as f:
contents = f.read()
contents = contents.rstrip("\n").decode("utf-16")
Basically after doing re.get_body, perform the below operation:
contents = contents.rstrip("\n").decode("utf-16")
See if it gives you the same output as your receive in local python file.
Hope it helps.
I have a data export job that reads data from a REST endpoint and then saves the data in a temporary compressed file before being written to S3. This was working for smaller payloads:
import gzip
import urllib2
# Fails when writing too much data at once
def get_data(url, params, fileobj):
request = urllib2.urlopen(url, params)
event_data = request.read()
with gzip.open(fileobj.name, 'wb') as f:
f.write(event_data)
However, as the data size increased I got an error that seems to indicate I'm writing too much data at once:
File "/usr/lib64/python2.7/gzip.py", line 241, in write
self.fileobj.write(self.compress.compress(data))
OverflowError: size does not fit in an int
I tried modifying the code to read from the REST endpoint line-by-line and write each line to the file, but this was incredibly slow, probably because the endpoint isn't setup to handle that.
# Incredibly slow
def get_data(url, params, fileobj):
request = urllib2.urlopen(url, params)
with gzip.open(fileobj.name, 'wb') as f:
for line in request:
f.write(line)
Is there a more efficient way to do this, such as by reading the entire payload at once, like in the first example, but then efficiently reading line-by-line from the data now residing in memory?
Turns out this is what StringIO is for. By turning my payload into a StringIO object I was able to read from it line-by-line and write to a gzipped file without any errors.
from StringIO import StringIO
def get_data(url, params, fileobj):
request = urllib2.urlopen(url, params)
event_data = StringIO(request.read())
with gzip.open(fileobj.name, 'wb') as f:
for line in event_data:
f.write(line)
I am new to Python. Here is my environment setup:
I have Anaconda 3 ( Python 3). I would like to be able to download an CSV file from the website:
https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD
I would like to use the requests library. I would appreciate anyhelp in figuring our how I can use the requests library in downloading the CSV file to the local directory on my machine
It is recommended to download data as stream, and flush it into the target or intermediate local file.
import requests
def download_file(url, output_file, compressed=True):
"""
compressed: enable response compression support
"""
# NOTE the stream=True parameter. It enable a more optimized and buffer support for data loading.
headers = {}
if compressed:
headers["Accept-Encoding"] = "gzip"
r = requests.get(url, headers=headers, stream=True)
with open(output_file, 'wb') as f: #open as block write.
for chunk in r.iter_content(chunk_size=4096):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
f.flush() #Afterall, force data flush into output file (optional)
return output_file
Considering original post:
remote_csv = "https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD"
local_output_file = "test.csv"
download_file(remote_csv, local_output_file)
#Check file content, just for test purposes:
print(open(local_output_file).read())
Base code was extracted from this post: https://stackoverflow.com/a/16696317/176765
Here, you can have more detailed information about body stream usage with requests lib:
http://docs.python-requests.org/en/latest/user/advanced/#body-content-workflow
I am downloading multiple CSV files from a website using Python. I would like to be able to check the response code on each request.
I know how to download the file using wget, but not how to check the response code:
os.system('wget http://example.com/test.csv')
I've seen a lot of people suggesting using requests, but I'm not sure that's quite right for my use case of saving CSV files.
r = request.get('http://example.com/test.csv')
r.status_code # 200
# Pipe response into a CSV file... hm, seems messy?
What's the neatest way to do this?
You can use the stream argument - along with iter_content() it's possible to stream the response contents right into a file (docs):
import requests
r = None
try:
r = requests.get('http://example.com/test.csv', stream=True)
with open('test.csv', 'w') as f:
for data in r.iter_content():
f.write(data)
finally:
if r is not None:
r.close()
I have a Pylons controller action that needs to return a file to the client. (The file is outside the web root, so I can't just link directly to it.) The simplest way is, of course, this:
with open(filepath, 'rb') as f:
response.write(f.read())
That works, but it's obviously inefficient for large files. What's the best way to do this? I haven't been able to find any convenient methods in Pylons to stream the contents of the file. Do I really have to write the code to read a chunk at a time myself from scratch?
The correct tool to use is shutil.copyfileobj, which copies from one to the other a chunk at a time.
Example usage:
import shutil
with open(filepath, 'r') as f:
shutil.copyfileobj(f, response)
This will not result in very large memory usage, and does not require implementing the code yourself.
The usual care with exceptions should be taken - if you handle signals (such as SIGCHLD) you have to handle EINTR because the writes to response could be interrupted, and IOError/OSError can occur for various reasons when doing I/O.
I finally got it to work using the FileApp class, thanks to Chris AtLee and THC4k (from this answer). This method also allowed me to set the Content-Length header, something Pylons has a lot of trouble with, which enables the browser to show an estimate of the time remaining.
Here's the complete code:
def _send_file_response(self, filepath):
user_filename = '_'.join(filepath.split('/')[-2:])
file_size = os.path.getsize(filepath)
headers = [('Content-Disposition', 'attachment; filename=\"' + user_filename + '\"'),
('Content-Type', 'text/plain'),
('Content-Length', str(file_size))]
from paste.fileapp import FileApp
fapp = FileApp(filepath, headers=headers)
return fapp(request.environ, self.start_response)
The key here is that WSGI, and pylons by extension, work with iterable responses. So you should be able to write some code like (warning, untested code below!):
def file_streamer():
with open(filepath, 'rb') as f:
while True:
block = f.read(4096)
if not block:
break
yield block
response.app_iter = file_streamer()
Also, paste.fileapp.FileApp is designed to be able to return file data for you, so you can also try:
return FileApp(filepath)
in your controller method.