JSON Line issue when loading from import.io using Python - python

I'm having a hard time trying to load an API response from import.io into a file or a list.
The enpoint I'm using is https://data.import.io/extractor/{0}/json/latest?_apikey={1}
Previously all my scripts were set to use normal JSON and all was working well, but now hey have decided to use json line, but somehow it seems malformed.
The way I tried to adapt my scripts is to read the API response in the following way:
url_call = 'https://data.import.io/extractor/{0}/json/latest?_apikey={1}'.format(extractors_row_dict['id'], auth_key)
r = requests.get(url_call)
with open(temporary_json_file_path, 'w') as outfile:
json.dump(r.content, outfile)
data = []
with open(temporary_json_file_path) as f:
for line in f:
data.append(json.loads(line))
the problem doing this is that when I check data[0], all of the json file content was dumped in it...
data[1] = IndexError: list index out of range
Here is an example of data[0][:300]:
u'{"url":"https://www.example.com/de/shop?condition[0]=new&page=1&lc=DE&l=de","result":{"extractorData":{"url":"https://www.example.com/de/shop?condition[0]=new&page=1&lc=DE&l=de","resourceId":"23455234","data":[{"group":[{"Brand":[{"text":"Brand","href":"https://www.example.com'
Does anyone have experience with the response of this API?
All other jsonline reads I do from other sources work fine except this one.
EDIT based on comment:
print repr(open(temporary_json_file_path).read(300))
gives this:
'"{\\"url\\":\\"https://www.example.com/de/shop?condition[0]=new&page=1&lc=DE&l=de\\",\\"result\\":{\\"extractorData\\":{\\"url\\":\\"https://www.example.com/de/shop?condition[0]=new&page=1&lc=DE&l=de\\",\\"resourceId\\":\\"df8de15cede2e96fce5fe7e77180e848\\",\\"data\\":[{\\"group\\":[{\\"Brand\\":[{\\"text\\":\\"Bra'

You've got a bug in your code where you are double encoding:
with open(temporary_json_file_path, 'w') as outfile:
json.dump(r.content, outfile)
Try:
with open(temporary_json_file_path, 'w') as outfile:
outfile.write(r.content)

Related

Immediate write JSON API response to file with Python Requests

I am trying to retrieve data from an API and immediate write the JSON response directly to a file and not store any part of the response in memory. The reason for this requirement is because I'm executing this script on a AWS Linux EC2 that only has 2GB of memory, and if I try to hold everything in memory and then write the responses to a file, the process will fail due to not enough memory.
I've tried using f.write() as well as sys.stdout.write(), but both of these approaches seemed to only write the file after all the queries were executed. While this worked with my small example, it didn't work when dealing with my actual data.
The problem with both approaches below is that the file doesn't populate until the loop is complete. This will not work with my actual process, as the machine doesn't have enough memory to hold the all the responses in memory.
How can I adapt either of the approaches below, or come up with something new, to write data received from the API immediately to a file without saving anything in memory?
Note: I'm using Python 3.7 but happy to update if there is something that would make this easier.
My Approach 1
# script1.py
import requests
import json
with open('data.json', 'w') as f:
for i in range(0, 100):
r = requests.get("https://httpbin.org/uuid")
data = r.json()
f.write(json.dumps(data) + "\n")
f.close()
My Approach 2
# script2.py
import request
import json
import sys
for i in range(0, 100):
r = requests.get("https://httpbin.org/uuid")
data = r.json()
sys.stdout.write(json.dumps(data))
sys.stdout.write("\n")
With approach 2, I tried using the > to redirect the output to a file:
script2.py > data.json
You can use response.iter_content to download the content in chunks. For example:
import requests
url = 'https://httpbin.org/uuid'
with requests.get(url, stream=True) as r:
r.raise_for_status()
with open('data.json', 'wb') as f_out:
for chunk in r.iter_content(chunk_size=8192):
f_out.write(chunk)
Saves data.json with content:
{
"uuid": "991a5843-35ca-47b3-81d3-258a6d4ce582"
}

Python : Read .JS file without missing any data

I'm tring to read a '.js' file but when i try I can't read data same as I have in the js file some lines goes missing, can anyone please help.
The code I tried is
import json
file_path = 'C:/Users/smith/Desktop/inject.bundle.js'
with open(file_path,'r', encoding='utf-8') as dataFile:
data = dataFile.read()
print(data)
The js file I tried is here.
Why not download the js file using a rest call?
import requests
res = requests.get("https://raw.githubusercontent.com/venkatesanramasekar/jstesting/master/inject.bundle.js")
data = res.text
// to save to a file
with open("inject.bundle.js", "w") as f:
f.write(data)

Read JSON file correctly

I am trying to read a JSON file (BioRelEx dataset: https://github.com/YerevaNN/BioRelEx/releases/tag/1.0alpha7) in Python. The JSON file is a list of objects, one per sentence.
This is how I try to do it:
def _read(self, file_path):
with open(cached_path(file_path), "r") as data_file:
for line in data_file.readlines():
if not line:
continue
items = json.loads(lines)
text = items["text"]
label = items.get("label")
My code is failing on items = json.loads(line). It looks like the data is not formatted as the code expects it to be, but how can I change it?
Thanks in advance for your time!
Best,
Julia
With json.load() you don't need to read each line, you can do either of these:
import json
def open_json(path):
with open(path, 'r') as file:
return json.load(file)
data = open_json('./1.0alpha7.dev.json')
Or, even cooler, you can GET request the json from GitHub
import json
import requests
url = 'https://github.com/YerevaNN/BioRelEx/releases/download/1.0alpha7/1.0alpha7.dev.json'
response = requests.get(url)
data = response.json()
These will both give the same output. data variable will be a list of dictionaries that you can iterate over in a for loop and do your further processing.
Your code is reading one line at a time and parsing each line individually as JSON. Unless the creator of the file created the file in this format (which given it has a .json extension is unlikely) then that won't work, as JSON does not use line breaks to indicate end of an object.
Load the whole file content as JSON instead, then process the resulting items in the array.
def _read(self, file_path):
with open(cached_path(file_path), "r") as data_file:
data = json.load(data_file)
for item in data:
text = item["text"]
label appears to be buried in item["interaction"]

Passing Binary file over HTTP POST

I have a local python file that decodes binary files. This python file first reads from the file, opens it as binary and then saves it in a buffer and interprets it. Reading it is simply:
with open(filepath, 'rb') as f:
buff = f.read()
read_all(buff)
This works fine locally. Now I'd like to setup a Azure Python job where I can send the file, approx. 100kb, over a HTTP POST and then read the interpreted meta data which my original python script does well.
I've first removed the read function so that I'll now work with the buffer only.
In my Azure Python Job I have the following, triggered by a HttpRequest
my_data = reader.read_file(req.get_body())
To test my sending I've tried the following in python
import requests
url = 'http://localhost:7071/api/HttpTrigger'
files = {'file': open('test', 'rb')}
with open('test', 'rb') as f:
buff = f.read()
r = requests.post(url, files=files) #Try using files
r = requests.post(url, data=buff) #Try using data
I've also tried in Postman adding the file to the body as a binary and setting the headers to application/octet-stream
All this doesn't send the binary file the same way as the original f.read() did. So I'm getting a wrong interpretation of the binary file.
What is file.read doing differently to how I'm sending it over as a HTTP Body message?
Printing out the first line from the local python read file gives.
b'\n\n\xfe\xfe\x00\x00\x00\x00\\\x18,A\x18\x00\x00\x00(\x00\x00\x00\x1f\x00\x00\
Whereas printing it out at the req.get_body() shows me
b'\n\n\xef\xbf\xbd\xef\xbf\xbd\x00\x00\x00\x00\\\x18,A\x18\x00\x00\x00(\x00\x00\x00\x1f\x00\
So something is clearly wrong. Any help why this could be different?
Thanks
EDIT:
I've implemented a similar function in Flask and it works well.
The code in flask is simply grabbing the file from a POST. No encoding/decoding.
if request.method == 'POST':
f = request.files['file']
#f.save(secure_filename(f.filename))
my_data = reader.read_file(f.read())
Why is the Azure Function different?
You can try UTF-16 to decode and do the further action in your code.
Here is the code for that:
with open(path_to_file,'rb') as f:
contents = f.read()
contents = contents.rstrip("\n").decode("utf-16")
Basically after doing re.get_body, perform the below operation:
contents = contents.rstrip("\n").decode("utf-16")
See if it gives you the same output as your receive in local python file.
Hope it helps.

Python: Download CSV file, check return code?

I am downloading multiple CSV files from a website using Python. I would like to be able to check the response code on each request.
I know how to download the file using wget, but not how to check the response code:
os.system('wget http://example.com/test.csv')
I've seen a lot of people suggesting using requests, but I'm not sure that's quite right for my use case of saving CSV files.
r = request.get('http://example.com/test.csv')
r.status_code # 200
# Pipe response into a CSV file... hm, seems messy?
What's the neatest way to do this?
You can use the stream argument - along with iter_content() it's possible to stream the response contents right into a file (docs):
import requests
r = None
try:
r = requests.get('http://example.com/test.csv', stream=True)
with open('test.csv', 'w') as f:
for data in r.iter_content():
f.write(data)
finally:
if r is not None:
r.close()

Categories