From gzip to json to dataframe to csv - python

I am trying to get some data from an open API:
https://data.brreg.no/enhetsregisteret/api/enheter/lastned
but I am having difficulties understanding the different type of objects and the order the conversions should be in. Is it strings to bytes, is it BytesIO or StringIO, is it decode('utf-8) or decode('unicode) etc..?
So far:
url_get = 'https://data.brreg.no/enhetsregisteret/api/enheter/lastned'
with urllib.request.urlopen(url_get) as response:
encoding = response.info().get_param('charset', 'utf8')
compressed_file = io.BytesIO(response.read())
decompressed_file = gzip.GzipFile(fileobj=compressed_file)
and now is where I am stuck, how should I write the next line of code?
json_str = json.loads(decompressed_file.read().decode('utf-8'))
My workaround is if I write it as a json file then read it in again and do the transformation to df then it works:
with io.open('brreg.json', 'wb') as f:
f.write(decompressed_file.read())
with open(f_path, encoding='utf-8') as fin:
d = json.load(fin)
df = json_normalize(d)
with open('brreg_2.csv', 'w', encoding='utf-8', newline='') as fout:
fout.write(df.to_csv())
I found many SO posts about it, but I am still so confused. This first one explains it quite good, but I still need some spoon feeding.
Python 3, read/write compressed json objects from/to gzip file
TypeError when trying to convert Python 2.7 code to Python 3.4 code
How can I create a GzipFile instance from the “file-like object” that urllib.urlopen() returns?
JSONDecodeError: Expecting value: line 1 column 1 (char 0)

It works fine for me using the decompress function rather than the GZipFile class to decompress the file, but not sure why yet...
import urllib.request
import gzip
import io
import json
url_get = 'https://data.brreg.no/enhetsregisteret/api/enheter/lastned'
with urllib.request.urlopen(url_get) as response:
encoding = response.info().get_param('charset', 'utf8')
compressed_file = io.BytesIO(response.read())
decompressed_file = gzip.decompress(compressed_file.read())
json_str = json.loads(decompressed_file.decode('utf-8'))
EDIT, in fact the following also works fine for me which appears to be your exact code...
(Further edit, turns out it's not quite your exact code because your final line was outside the with block which meant response was no longer open when it was needed - see comment thread)
import urllib.request
import gzip
import io
import json
url_get = 'https://data.brreg.no/enhetsregisteret/api/enheter/lastned'
with urllib.request.urlopen(url_get) as response:
encoding = response.info().get_param('charset', 'utf8')
compressed_file = io.BytesIO(response.read())
decompressed_file = gzip.GzipFile(fileobj=compressed_file)
json_str = json.loads(decompressed_file.read().decode('utf-8'))

Related

read json file with german characters python 2 (ironpython)

I am working with Ironpython, so python 2 and to read the .json file with German characters I am using encoding='utf-8' but I get the following error: open() got an unexpected keyboard argument 'encoding'.
Here an example of the code:
def get_data(self):
#Open and read data from json file into dict
with open(self.file, encoding='utf-8') as j_file:
data_dict = json.load(j_file)
return data_dict
python 2.x doesn't support the encoding parameter. You must import the io module to specify the encoding
open Function - pythontips
import io
def get_data(self):
#Open and read data from json file into dict
with io.open(self.file, encoding='utf-8') as j_file:
data_dict = json.load(j_file)
return data_dict
Python 2 also does not support the open function directly.
As seen here: https://stackoverflow.com/a/30700186/12711820
You can use something like this:
import codecs
f = codecs.open(fileName, 'r', errors = 'ignore')
The encoding argument in open was added in Python 3. If you want to use encoding in python 2.x, use the following:
import io.open
f = io.open('file.json', encoding='utf-8')
# Do stuff with f
f.close

Compressing a list of json as Gzip using a file object

I have to compress the list of dictionary using gzip and send as request parameter to a backend API. While using the gzip library i couldn't send the json because everytime I get a bad request from the server. Here is an example of the data:
[{'name':'John'}, {'name': 'Clark'}]
First I tried using the version 1 of the code below but I always get a
TypeError: string argument expected, got 'bytes'
So, I tried version 2 which returns a bad request.There is another similar answer but I need to do this with a file-object. Could you point out where I am going wrong?
Version 1
import gzip
import json
import io
inp1 = {"name": "John"}
inp2 = {"name": "Clark"}
final_inp = []
final_inp.append(inp1)
final_inp.append(inp2)
print(final_inp)
data = json.dumps(final_inp)
out = io.StringIO()
with gzip.GzipFile(fileobj=out, mode="w") as f:
f.write(data)
res = out.getvalue()
print(res)
Version 2
# same imports and declartion from version 1
# this code works but i have to do it using version 1.
data = json.dumps(final_inp)
out = io.BytesIO()
with gzip.GzipFile(fileobj=out, mode="wb") as f:
f.write(data.encode())
res = out.getvalue()
print(res)

How to Replace \n, b and single quotes from Raw File from GitHub?

I am trying to download file from GitHub(raw file) and then run this file as .sql file.
import snowflake.connector
from codecs import open
import logging
import requests
from os import getcwd
import os
import sys
#logging
logging.basicConfig(
filename='C:/Users/abc/Documents/Test.log',
level=logging.INFO
)
url = "https://github.com/raw/abc/master/file_name?token=Anvn3lJXDks5ciVaPwA%3D%3D"
directory = getcwd()
filename = os.path.join(getcwd(),'VIEWS.SQL')
r = requests.get(url)
filename.decode("utf-8")
f = open(filename,'w')
f.write(str(r.content))
with open(filename,'r') as theFile, open(filename,'w') as outFile:
data = theFile.read().split('\n')
data = theFile.read().replace('\n','')
data = theFile.read().replace("b'","")
data = theFile.read()
outFile.write(data)
However I get this error
syntax error line 1 at position 0 unexpected 'b'
My converted sql file has b at the beginning and bunch of newline \n characters in the file. Also the entire output file is in single quotes 'text'. Can anyone help me get rid of these? Looks like replace isn't working.
OS: Windows
Python Version: 3.7.0
You introduced a b'.. prefix by converting the response.content bytes value to a string with str():
>>> import requests
>>> r = requests.get("https://github.com/raw/abc/master/file_name?token=Anvn3lJXDks5ciVaPwA%3D%3D")
>>> r.content
b'Not Found'
>>> str(r.content)
"b'Not Found'"
Of course, the specific dummy URL you gave in your question produces a 404 Not Found response, hence the Not Found content of the response body:
>>> r.status_code
404
so the contents in this demonstration are not actually all that useful. However, even for your real URL you probably want to test for a 200 status code before moving to write the data to a file!
What is going wrong in the above is that str(bytesvalue) converts a bytes object to its representation. You'd normally want to decode a bytes value with a text codec, using the bytes.decode() method. But because you are writing the data to a file here, you should instead just open the file in binary mode and write the bytes object without decoding:
r = requests.get(url)
if r.status_code == 200:
with open(filename, 'wb') as f:
f.write(r.content)
The 'wb' mode opens the file for writing in binary mode. Writing binary content to a binary file is the most efficient; decoding it first then writing to a text file requires that it is encoded again. Better to avoid doing double work.
As a side note: there is no need to join a local filename with getcwd(); relative paths always end up in the current working directory, and otherwise it's better to use os.path.abspath(filename).
You could also trust that GitHub sets the correct character set in the Content-Type headers and have response decode the value to str for you in the form of the response.text attribute:
r = requests.get(url)
if r.status_code == 200:
with open(filename, 'w') as f:
f.write(r.text)
but again, that's really doing extra work for nothing, first decoding the binary content from the request, then encoding again when writing to a text file.
Finally, for larger file responses it is better to stream the data and copy it directly to a file. The shutil.copyfileobj() function can take a raw response fileobject directly, provided you enable transparent transport decompression:
import shutil
r = requests.get(url, stream=True)
if r.status_code == 200:
with open(filename, 'wb') as f:
# enable transparent transport decompression handling
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
Depending on your version of Python/OS it could be as simple as changing the file to read/write in binary (and if they're still there then altering where you have the replaces):
with open(filename,'rb') as theFile, open(filename,'wb') as outFile:
outfile.write(str(r.content))
data = theFile.read().split('\n')
data = data.replace('\n','')
data = data.replace("b'","")
outFile.write(data)
It would help to have a copy of the file and the line the error is occurring on.

How to download a csv file in python from a server?

from pip._vendor import requests
import csv
url = 'https://docs.google.com/spreadsheets/abcd'
dataReader = csv.reader(open(url), delimiter=',', quotechar='"')
exampleData = list(dataReader)
exampleData
Use Python Requests.
import requests
r = requests.get(url)
lines = r.text.splitlines()
We use splitlines to turn the text into an iterable like a file handle. You should probably wrap it up in a try, catch block in case of errors.
You need to use something like urllib2 to retrieve the file.
for example:
import urllib2
import csv
csvfile = urllib2.urlopen('https://docs.google.com/spreadsheets/abcd')
dataReader = csv.reader(csvfile,delimiter=',', quotechar='"')
do_stuff(dataReader)
You can import urllib.request and then simply call data_stream = urllib.request.urlopen(url) to get a buffer of the file. You can then save the csv data as data = str(data_stream.read(), which may be a bit unclean depending on your source or encoded, so you may need to do some manipulation, but if not then you can just throw it into csv.reader(data, delimiter=',')
An example requiring translating from byte format that may work for you:
data = urllib.request.urlopen(url)
data_csv = str(data.read())
# split out b' flag from string, then also split at newlines up to the last one
dataReader = csv.reader(data_csv.split("b\'",1)[1].split("\\n")[:-1], delimiter=",")
headers = reader.__next__()
exampleData = list(dataReader)

Open a file from urlfetch in GAE

I'm trying to open a file in GAE that was retrieved using urlfetch().
Here's what I have so far:
from google.appengine.api import urlfetch
result = urlfetch.fetch('http://example.com/test.txt')
data = result.content
## f = open(...) <- what goes in here?
This might seem strange but there's a very similar function in the BlobStore that can write data to a blobfile:
f = files.blobstore.create(mime_type='txt', _blobinfo_uploaded_filename='test')
with files.open(f, 'a') as data:
data.write(result.content)
How can I write data into an arbitrary file object?
Edit: Should've been more clear; I'm trying to urlfetch any file and open result.content in a file object. So it might be a .doc instead of a .txt
You can use the StringIO module to emulate a file object using the contents of your string.
from google.appengine.api import urlfetch
from StringIO import StringIO
result = urlfetch.fetch('http://example.com/test.txt')
f = StringIO(result.content)
You can then read() from the f object or use other file object methods like seek(), readline(), etc.
Yoy do not have to open a file. You have received the txt data in data = result.content.

Categories