read json file with german characters python 2 (ironpython) - python

I am working with Ironpython, so python 2 and to read the .json file with German characters I am using encoding='utf-8' but I get the following error: open() got an unexpected keyboard argument 'encoding'.
Here an example of the code:
def get_data(self):
#Open and read data from json file into dict
with open(self.file, encoding='utf-8') as j_file:
data_dict = json.load(j_file)
return data_dict

python 2.x doesn't support the encoding parameter. You must import the io module to specify the encoding
open Function - pythontips
import io
def get_data(self):
#Open and read data from json file into dict
with io.open(self.file, encoding='utf-8') as j_file:
data_dict = json.load(j_file)
return data_dict

Python 2 also does not support the open function directly.
As seen here: https://stackoverflow.com/a/30700186/12711820
You can use something like this:
import codecs
f = codecs.open(fileName, 'r', errors = 'ignore')

The encoding argument in open was added in Python 3. If you want to use encoding in python 2.x, use the following:
import io.open
f = io.open('file.json', encoding='utf-8')
# Do stuff with f
f.close

Related

From gzip to json to dataframe to csv

I am trying to get some data from an open API:
https://data.brreg.no/enhetsregisteret/api/enheter/lastned
but I am having difficulties understanding the different type of objects and the order the conversions should be in. Is it strings to bytes, is it BytesIO or StringIO, is it decode('utf-8) or decode('unicode) etc..?
So far:
url_get = 'https://data.brreg.no/enhetsregisteret/api/enheter/lastned'
with urllib.request.urlopen(url_get) as response:
encoding = response.info().get_param('charset', 'utf8')
compressed_file = io.BytesIO(response.read())
decompressed_file = gzip.GzipFile(fileobj=compressed_file)
and now is where I am stuck, how should I write the next line of code?
json_str = json.loads(decompressed_file.read().decode('utf-8'))
My workaround is if I write it as a json file then read it in again and do the transformation to df then it works:
with io.open('brreg.json', 'wb') as f:
f.write(decompressed_file.read())
with open(f_path, encoding='utf-8') as fin:
d = json.load(fin)
df = json_normalize(d)
with open('brreg_2.csv', 'w', encoding='utf-8', newline='') as fout:
fout.write(df.to_csv())
I found many SO posts about it, but I am still so confused. This first one explains it quite good, but I still need some spoon feeding.
Python 3, read/write compressed json objects from/to gzip file
TypeError when trying to convert Python 2.7 code to Python 3.4 code
How can I create a GzipFile instance from the “file-like object” that urllib.urlopen() returns?
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
It works fine for me using the decompress function rather than the GZipFile class to decompress the file, but not sure why yet...
import urllib.request
import gzip
import io
import json
url_get = 'https://data.brreg.no/enhetsregisteret/api/enheter/lastned'
with urllib.request.urlopen(url_get) as response:
encoding = response.info().get_param('charset', 'utf8')
compressed_file = io.BytesIO(response.read())
decompressed_file = gzip.decompress(compressed_file.read())
json_str = json.loads(decompressed_file.decode('utf-8'))
EDIT, in fact the following also works fine for me which appears to be your exact code...
(Further edit, turns out it's not quite your exact code because your final line was outside the with block which meant response was no longer open when it was needed - see comment thread)
import urllib.request
import gzip
import io
import json
url_get = 'https://data.brreg.no/enhetsregisteret/api/enheter/lastned'
with urllib.request.urlopen(url_get) as response:
encoding = response.info().get_param('charset', 'utf8')
compressed_file = io.BytesIO(response.read())
decompressed_file = gzip.GzipFile(fileobj=compressed_file)
json_str = json.loads(decompressed_file.read().decode('utf-8'))

Avoiding UnicodeEncodeError in python

I tried to parse an html table into csv using python with a following script:
from bs4 import BeautifulSoup
import requests
import csv
csvFile = open('log.csv', 'w', newline='')
writer = csv.writer(csvFile)
def parse():
html = requests.get('https://en.wikipedia.org/wiki/Comparison_of_text_editors')
bs = BeautifulSoup(html.text, 'lxml')
table = bs.select_one('table.wikitable')
rows = table.select('tr')
for row in rows:
csvRow = []
for cell in row.findAll(['th', 'td']):
csvRow.append(cell.getText())
writer.writerow(csvRow)
print(csvRow)
parse()
csvFile.close()
This code outputed a clear formated CSV file with no encoding issues.
All was just fine before Enrico Tröger's Geany. My script was unable to write ö
into a csv file, so i tried this:
csvRow.append(cell.text.encode('ascii', 'replace')) instead of that: csvRow.append(cell.getText())
All was fine, despite the fact that each table cell was nested in b''. So, how can i get a clear formated csv file withous encoding issues(like in the first screenshot) and replaced or ignored all
non-unicode symbols(like in the second screenshot) using my scipt?
Change this one:
csvFile = open('log.csv', 'w', newline='')
To this one:
csvFile = open('log.csv', 'w', newline='', encoding='utf8')
csv module documentation:
Since open() is used to open a CSV file for reading, the file will by default be decoded into unicode using the system default encoding (see locale.getpreferredencoding()). To decode a file using a different encoding, use the encoding argument of open:
import csv
with open('some.csv', newline='', encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
print(row)
The same applies to writing in something other than the system default encoding: specify the encoding argument when opening the output file.
I suppose your system default encoding is not utf8.
You can check it like this:
import locale
locale.getpreferredencoding()
Hope it helps!
Looks like the csv module expects strings, not bytes. So you could de-encode your bytes before passing them:
cell.text.encode('ascii', 'replace').decode('ascii')

how to clean a JSON file and store it to another file in Python

I am trying to read a JSON file with Python. This file is described by the authors as not strict JSON. In order to convert it to strict JSON, they suggest this approach:
import json
def parse(path):
g = gzip.open(path, 'r')
for l in g:
yield json.dumps(eval(l))
however, not being familiar with Python, I am able to execute the script but I am not able to produce any output file with the new clean JSON. How should I modify the script in order to produce a new JSON file? I have tried this:
import json
class Amazon():
def parse(self, inpath, outpath):
g = open(inpath, 'r')
out = open(outpath, 'w')
for l in g:
yield json.dumps(eval(l), out)
amazon = Amazon()
amazon.parse("original.json", "cleaned.json")
but the output is an empty file. Any help more than welcome
import json
class Amazon():
def parse(self, inpath, outpath):
g = open(inpath, 'r')
with open(outpath, 'w') as fout:
for l in g:
fout.write(json.dumps(eval(l)))
amazon = Amazon()
amazon.parse("original.json", "cleaned.json")
another shorter way of doing this
import json
class Amazon():
def parse(readpath, writepath):
with open(readpath) as g, open(writepath, 'w') as fout:
for l in g:
json.dump(eval(l), fout)
amazon = Amazon()
amazon.parse("original.json", "cleaned.json")
While handling json data it is better to use json modules json.dump(json, output_file) for dumping json in file and json.load(file_path) to load the data. In this way you can get maintain json wile saving and reading json data.
For very large amount of data say 1k+ use python pandas module.

function for loading both strings and files on disk?

I have a design question. I have a function loadImage() for loading an image file. Now it accepts a string which is a file path. But I also want to be able to load files which are not on physical disk, eg. generated procedurally. I could have it accept a string, but then how could it know the string is not a file path but file data? I could add an extra boolean argument to specify that, but that doesn't sound very clean. Any ideas?
It's something like this now:
def loadImage(filepath):
file = open(filepath, 'rb')
data = file.read()
# do stuff with data
The other version would be
def loadImage(data):
# do stuff with data
How to have this function accept both 'filepath' or 'data' and guess what it is?
You can change your loadImage function to expect an opened file-like object, such as:
def load_image(f):
data = file.read()
... and then have that called from two functions, one of which expects a path and the other a string that contains the data:
from StringIO import StringIO
def load_image_from_path(path):
with open(path, 'rb') as f:
load_image(f)
def load_image_from_string(s):
sio = StringIO(s)
try:
load_image(sio)
finally:
sio.close()
How about just creating two functions, loadImageFromString and loadImageFromFile?
This being Python, you can easily distinguish between a filename and a data string. I would do something like this:
import os.path as P
from StringIO import StringIO
def load_image(im):
fin = None
if P.isfile(im):
fin = open(im, 'rb')
else:
fin = StringIO(im)
# Read from fin like you would from any open file object
Other ways to do it would be a try block instead of using os.path, but the essence of the approach remains the same.

Python encoding conversion

I wrote a Python script that processes CSV files with non-ascii characters, encoded in UTF-8. However the encoding of the output is broken. So, from this in the input:
"d\xc4\x9bjin hornictv\xc3\xad"
I get this in the output:
"d\xe2\x99\xafjin hornictv\xc2\xa9\xc6\xaf"
Can you suggest where the encoding error might come from? Have you seen similar behaviour previously?
EDIT: I'm using csv standard library with the UnicodeWriter class featured in the docs. I use Python version 2.6.6.
EDIT 2: The code to reproduce the behaviour:
#!/usr/bin/env python
#-*- coding:utf-8 -*-
import csv
from pymarc import MARCReader # The pymarc package available PyPI: http://pypi.python.org/pypi/pymarc/2.71
from UnicodeWriter import UnicodeWriter # The UnicodeWriter from: http://docs.python.org/library/csv.html
def getRow(tag, record):
if record[tag].is_control_field():
row = [tag, record[tag].value()]
else:
row = [tag] + record[tag].subfields
return row
inputFile = open("input.mrc", "r")
outputFile = open("output.csv", "wb")
reader = MARCReader(inputFile, to_unicode = True)
writer = UnicodeWriter(outputFile, delimiter = ",", quoting = csv.QUOTE_MINIMAL)
for record in reader:
if bool(record["001"]):
tags = [field.tag for field in record.get_fields()]
tags.sort()
for tag in tags:
writer.writerow(getRow(tag, record))
inputFile.close()
outputFile.close()
The input data is available here (large file).
It seems adding force_utf8 = True argument to the MARCReader constructor solved the problem:
reader = MARCReader(inputFile, to_unicode = True, force_utf8 = True)
According to the inspection of the source code (via inspect) it does something like:
string.decode("utf-8", "strict")
You can try to open the file with UTF-8 encoding:
import codecs
codecs.open('myfile.txt', encoding='utf8')

Categories