Formatting json data fetched from URL by removing escape character - python

I have fetched json data from url and write it to in a file name urljson.json
i want to format the json data removing '\' and result [] key for requirment purpose
In my json file the data are arranged like this
{\"result\":[{\"BldgID\":\"1006AVE \",\"BldgName\":\"100-6th Avenue SW (Oddfellows) \",\"BldgCity\":\"Calgary \",\"BldgState\":\"AB \",\"BldgZip\":\"T2G 2C4 \",\"BldgAddress1\":\"100-6th Avenue Southwest \",\"BldgAddress2\":\"ZZZ None\",\"BldgPhone\":\"4035439600 \",\"BldgLandlord\":\"1006AV\",\"BldgLandlordName\":\"100-6 TH Avenue SW Inc. \",\"BldgManager\":\"AVANDE\",\"BldgManagerName\":\"Alyssa Van de Vorst \",\"BldgManagerType\":\"Internal\",\"BldgGLA\":\"34242\",\"BldgEntityID\":\"1006AVE \",\"BldgInactive\":\"N\",\"BldgPropType\":\"ZZZ None\",\"BldgPropTypeDesc\":\"ZZZ None\",\"BldgPropSubType\":\"ZZZ None\",\"BldgPropSubTypeDesc\":\"ZZZ None\",\"BldgRetailFlag\":\"N\",\"BldgEntityType\":\"REIT \",\"BldgCityName\":\"Calgary \",\"BldgDistrictName\":\"Downtown \",\"BldgRegionName\":\"Western Canada \",\"BldgAccountantID\":\"KKAUN \",\"BldgAccountantName\":\"Kendra Kaun \",\"BldgAccountantMgrID\":\"LVALIANT \",\"BldgAccountantMgrName\":\"Lorretta Valiant \",\"BldgFASBStartDate\":\"2012-10-24\",\"BldgFASBStartDateStr\":\"2012-10-24\"}]}
I want it like this format
[
{
"BldgID":"1006AVE",
"BldgName":"100-6th Avenue SW (Oddfellows) ",
"BldgCity":"Calgary ",
"BldgState":"AB ",
"BldgZip":"T2G 2C4 ",
"BldgAddress1":"100-6th Avenue Southwest ",
"BldgAddress2":"ZZZ None",
"BldgPhone":"4035439600 ",
"BldgLandlord":"1006AV",
"BldgLandlordName":"100-6 TH Avenue SW Inc. ",
"BldgManager":"AVANDE",
"BldgManagerName":"Alyssa Van de Vorst ",
"BldgManagerType":"Internal",
"BldgGLA":"34242",
"BldgEntityID":"1006AVE ",
"BldgInactive":"N",
"BldgPropType":"ZZZ None",
"BldgPropTypeDesc":"ZZZ None",
"BldgPropSubType":"ZZZ None",
"BldgPropSubTypeDesc":"ZZZ None",
"BldgRetailFlag":"N",
"BldgEntityType":"REIT ",
"BldgCityName":"Calgary ",
"BldgDistrictName":"Downtown ",
"BldgRegionName":"Western Canada ",
"BldgAccountantID":"KKAUN ",
"BldgAccountantName":"Kendra Kaun ",
"BldgAccountantMgrID":"LVALIANT ",
"BldgAccountantMgrName\":" Lorretta Valiant ",
"BldgFASBStartDate":"2012-10-24",
"BldgFASBStartDateStr":"2012-10-24"
} `
]
i have tried replace("\","") but nothing changed
Here is my code
import json
import urllib2
urllink=urllib2.urlopen("url").read()
print urllink -commented out
with open('urljson.json','w')as outfile:
json.dump(urllink,outfile)
jsonfile='urljson.json'
jsondata=open(jsonfile)
data=json.load(jsondata)
data.replace('\'," ") --commented out
print (data)
but it is saying fileobject has no replace attribute, I didnt find any idea how to remove 'result' and most outer "{}"
kindly guide me
i think the file object is not parsed in string somehow .i am beginner in python
thank you

JSON is a serialized encoding for data. urllink=urllib2.urlopen("url").read() read that serialized string. With json.dump(urllink,outfile) you serialized that single serialized JSON string again. You double-encoded it and that's why you see those extra "\" escape characters. json needs to escape those characters so as not to confuse them with the quotes it uses to demark strings.
If you wanted the file to hold the original json, you wouldn't need to encode it again, just do
with open('urljson.json','w')as outfile:
outfile.write(urllink)
But it looks like you want to grab the "result" list and only save that. So, decode the JSON into python, grab the bits you want, and encode it again.
import json
import codecs
import urllib2
# read a json string from url
urllink=urllib2.urlopen("url").read()
# decode and grab result list
result = json.loads(urllink)['result']
# write the json to a file
with open('urljson.json','w')as outfile:
json.dump(result, outfile)

\ is escape character in json:
you can load json string to python dict:

Tidy up the JSON object before writing it to file. It has lot of whitespace noise. Try like this:
urllink = {a.strip():b.strip() for a,b in json.loads(urllink).values()[0][0].items()}
jsonobj = json.loads(json.dumps(urllink))
with open('urljson.json','w') as outfile:
json.dump(jsonobj, outfile)
For all objects:
jsonlist = []
for dirtyobj in json.loads(urllink)['result']:
jsonlist.append(json.loads(json.dumps({a.strip():b.strip() for a,b in dirtyobj.items()})))
with open('urljson.json','w') as outfile:
json.dump(json.loads(json.dumps(jsonlist)), outfile)
Don't wanna tidy up? Then simply do this:
jsonobj = json.loads(urllink)
And you can't do '\', it's syntax error. The second ' is escaped and is not considered as closing quote.
data.replace('\'," ")
Why can't Python's raw string literals end with a single backslash?

Related

How to convert binary data to json

I want to convert the below data to json in python.
I have the data in the following format.
b'{"id": "1", "name": " value1"}\n{"id":"2", name": "value2"}\n{"id":"3", "name": "value3"}\n'
This has multiple json objects separated by \n. I was trying to load this as json .
converted the data into string first and loads as json but getting the exception.
my_json = content.decode('utf8')
json_data = json.loads(my_json)
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 2306)
You need to decode it then split by '\n' and load each json object separately. If you store your byte string in a variable called byte_string you could do something like:
json_str = byte_string.decode('utf-8')
json_objs = json_str.split('\n')
for obj in json_objs:
json.loads(obj)
For the particular string that you have posted here though, you will get an error on the second object because the second key in it is missing a double quote. It is name" in the string you linked.
First, this isn't valid json since it's not a single object. Second, there is a typo: the "id":"2" entry is missing a double-quote on the name property element.
An alternative to processing one dict at a time, you can replace the newlines with "," and turn it into an array. This is a fragile solution since it requires exactly one newline between each dict, but is compact:
s = b'{"id": "1", "name": " value1"}\n{"id":"2", "name": "value2"}\n{"id":"3", "name": "value3"}\n'
my_json = s.decode('utf8')
json_data = json.loads("[" + my_json.rstrip().replace("\n", ",") + "]")
What have to first decode your json to a string. So you can just say:
your_json_string = the_json.decode()
now you have a string.
Now what you want to do is:
your_json_string = your_json_string.replace("\\n", "")
so you are replacing the \n with nothing basically. Note that the two backslashes are required, this is not a typo.
Now you can just say:
your_json = json.loads(your_json_string)

Python string / bytes encoding with non ascii characters

I need to write a very simple function that reads from a json file and writes some of the content back to csv file.
The trouble is that the input json file has weird encoding format, for example :
{
"content": "b\"Comment minimiser l'impact environnemental d\\xe8s la R&D des proc\\xe9d\\xe9s micro\\xe9lectroniques."
}
I would like to write back
Comment minimiser l'impact environnemental dès la R&D des procédés microélectroniques.
The first problem is the 'b' so the content should read as a byte array but it is read as a string.
The second one is how to replace the weird characters ?
Thank you
You could use something like this:
json_file_path = 'your_json_file.json'
with open(json_file_path, 'r', encoding='utf-8') as j:
# Remove problematic "b\ character
j = j.read().replace('\"b\\',"");
# Process json
contents = json.loads(j)
# Decode string to process correctly double backslashes
output = contents['content'].encode('utf-8').decode('unicode_escape')
print(output)
# Output
# Comment minimiser l'impact environnemental dès la R&D des procédés microélectroniques.

Convert a bytes array into JSON format

I want to parse a bytes string in JSON format to convert it into python objects. This is the source I have:
my_bytes_value = b'[{\'Date\': \'2016-05-21T21:35:40Z\', \'CreationDate\': \'2012-05-05\', \'LogoType\': \'png\', \'Ref\': 164611595, \'Classe\': [\'Email addresses\', \'Passwords\'],\'Link\':\'http://some_link.com\'}]'
And this is the desired outcome I want to have:
[{
"Date": "2016-05-21T21:35:40Z",
"CreationDate": "2012-05-05",
"LogoType": "png",
"Ref": 164611595,
"Classes": [
"Email addresses",
"Passwords"
],
"Link": "http://some_link.com"}]
First, I converted the bytes to string:
my_new_string_value = my_bytes_value.decode("utf-8")
but when I try to invoke loads to parse it as JSON:
my_json = json.loads(my_new_string_value)
I get this error:
json.decoder.JSONDecodeError: Expecting value: line 1 column 174 (char 173)
Your bytes object is almost JSON, but it's using single quotes instead of double quotes, and it needs to be a string. So one way to fix it is to decode the bytes to str and replace the quotes. Another option is to use ast.literal_eval; see below for details. If you want to print the result or save it to a file as valid JSON you can load the JSON to a Python list and then dump it out. Eg,
import json
my_bytes_value = b'[{\'Date\': \'2016-05-21T21:35:40Z\', \'CreationDate\': \'2012-05-05\', \'LogoType\': \'png\', \'Ref\': 164611595, \'Classe\': [\'Email addresses\', \'Passwords\'],\'Link\':\'http://some_link.com\'}]'
# Decode UTF-8 bytes to Unicode, and convert single quotes
# to double quotes to make it valid JSON
my_json = my_bytes_value.decode('utf8').replace("'", '"')
print(my_json)
print('- ' * 20)
# Load the JSON to a Python list & dump it back out as formatted JSON
data = json.loads(my_json)
s = json.dumps(data, indent=4, sort_keys=True)
print(s)
output
[{"Date": "2016-05-21T21:35:40Z", "CreationDate": "2012-05-05", "LogoType": "png", "Ref": 164611595, "Classe": ["Email addresses", "Passwords"],"Link":"http://some_link.com"}]
- - - - - - - - - - - - - - - - - - - -
[
{
"Classe": [
"Email addresses",
"Passwords"
],
"CreationDate": "2012-05-05",
"Date": "2016-05-21T21:35:40Z",
"Link": "http://some_link.com",
"LogoType": "png",
"Ref": 164611595
}
]
As Antti Haapala mentions in the comments, we can use ast.literal_eval to convert my_bytes_value to a Python list, once we've decoded it to a string.
from ast import literal_eval
import json
my_bytes_value = b'[{\'Date\': \'2016-05-21T21:35:40Z\', \'CreationDate\': \'2012-05-05\', \'LogoType\': \'png\', \'Ref\': 164611595, \'Classe\': [\'Email addresses\', \'Passwords\'],\'Link\':\'http://some_link.com\'}]'
data = literal_eval(my_bytes_value.decode('utf8'))
print(data)
print('- ' * 20)
s = json.dumps(data, indent=4, sort_keys=True)
print(s)
Generally, this problem arises because someone has saved data by printing its Python repr instead of using the json module to create proper JSON data. If it's possible, it's better to fix that problem so that proper JSON data is created in the first place.
You can simply use,
import json
json.loads(my_bytes_value)
Python 3.5 + Use io module
import json
import io
my_bytes_value = b'[{\'Date\': \'2016-05-21T21:35:40Z\', \'CreationDate\': \'2012-05-05\', \'LogoType\': \'png\', \'Ref\': 164611595, \'Classe\': [\'Email addresses\', \'Passwords\'],\'Link\':\'http://some_link.com\'}]'
fix_bytes_value = my_bytes_value.replace(b"'", b'"')
my_json = json.load(io.BytesIO(fix_bytes_value))
d = json.dumps(byte_str.decode('utf-8'))
To convert this bytesarray directly to json, you could first convert the bytesarray to a string with decode(), utf-8 is standard. Change the quotation markers.. The last step is to remove the " from the dumped string, to change the json object from string to list.
dumps(s.decode()).replace("'", '"')[1:-1]
Better solution is:
import json
byte_array_example = b'{"text": "\u0627\u06CC\u0646 \u06CC\u06A9 \u0645\u062A\u0646 \u062A\u0633\u062A\u06CC \u0641\u0627\u0631\u0633\u06CC \u0627\u0633\u062A."}'
res = json.loads(byte_array_example.decode('unicode_escape'))
print(res)
result:
{'text': 'این یک متن تستی فارسی است.'}
decode by utf-8 cannot decode unicode characters. The right solution is uicode_escape
It is OK
if you have a bytes object and want to store it in a JSON file, then you should first decode the byte object because JSON only has a few data types and raw byte data isn't one of them. It has arrays, decimal numbers, strings, and objects.
To decode a byte object you first have to know its encoding. For this, you can use
import chardet
encoding = chardet.detect(your_byte_object)['encoding']
then you can save this object to your json file like this
data = {"data": your_byte_object.decode(encoding)}
with open('request.txt', 'w') as file:
json.dump(data, file)
The most simple solution is to use the json function that comes with http request.
For example:

How do I decode unicode characters via python?

I am trying to import the following json file using python:
The file is called new_json.json:
{
"nextForwardToken": "f/3208873243596875673623625618474139659",
"events": [
{
"ingestionTime": 1045619,
"timestamp": 1909000,
"message": "2 32823453119 eni-889995t1 54.25.64.23 156.43.12.120 3389 23 6 342 24908 143234809 983246 ACCEPT OK"
}]
}
I have the following code to read the json file, and remove the unicode characters:
JSON_FILE = "new_json.json"
with open(JSON_FILE) as infile:
print infile
print '\n type of infile is \n', infile
data = json.load(infile)
str_data = str(data) # convert to string to remove unicode characters
wo_unicode = str_data.decode('unicode_escape').encode('ascii','ignore')
print 'unicode characters have been removed \n'
print wo_unicode
But print wo_unicode still prints with the unicode characters (i.e.u) in it.
The unicode characters cause a problem when trying to treat the json as a dictionary:
for item in data:
iden = item.get['nextForwardToken']
...results in an error:
AttributeError: 'unicode' object has no attribute 'get'
This has to work in Python2.7. Is there an easy way around this?
The error has nothing to do with unicode, you are trying to treat the keys as dicts, just use data to get 'nextForwardToken':
print data.get('nextForwardToken')
When you iterate over data, you are iterating over the keys so 'nextForwardToken'.get('nextForwardToken'), "events".get('nextForwardToken') etc.. are obviously not going to work even with the correct syntax.
Whether you access by data.get(u'nextForwardToken') or data.get('nextForwardToken'), both will return the value for the key:
In [9]: 'nextForwardToken' == u'nextForwardToken'
Out[9]: True
In [10]: data[u'nextForwardToken']
Out[10]: u'f/3208873243596875673623625618474139659'
In [11]: data['nextForwardToken']
Out[11]: u'f/3208873243596875673623625618474139659'
This code will give you the values as str without the unicode
import json
JSON_FILE = "/tmp/json.json"
with open(JSON_FILE) as infile:
print infile
print '\n type of infile is \n', infile
data = json.load(infile)
print data
str_data = json.dumps(data)
print str_data

Converting a utf16 csv to array

I've tried to convert a CSV file coded in UTF-16 (exported by another program) to a simple array in Python 2.7 with very few luck.
Here's the nearest solution I've found:
from io import BytesIO
with open ('c:\\pfm\\bdh.txt','rb') as f:
x = f.read().decode('UTF-16').encode('UTF-8')
for line in csv.reader(BytesIO(x)):
print line
This code returns:
[' \tNombre\tEtiqueta\tExtensi\xc3\xb3n de archivo\tTama\xc3\xb1ol\xc3\xb3gico\tCategor\xc3\xada']
['1\tnom1\tetq1\text1 ...
What I'm trying to get it's something like this:
[['','Nombre','Etiqueta','Extensión de archivo','Tamaño lógico','Categoría']
['1','nom1','etq1','ext1','123','cat1']
['2','nom2','etq2','ext2','456','cat2']]
So, I'd need to convert those hexadecimal chars to latin typos (as: á,é,í,ó,ú, or ñ), and those tab-separated strings into arrays fields.
Do I really need to use dictionaries for the first part? I think there should be an easier solution, as I can see and write all these characater by keyboard.
For the second part, I think the CSV library won't help in this case, as I read it can't manage UTF-16 yet.
Could you give me a hand? Thank you!
ITEM #1: The hexadecimal characters
You are getting the:
[' \tNombre\tEtiqueta\tExtensi\xc3\xb3n de archivo\tTama\xc3\xb1ol\xc3\xb3gico\tCategor\xc3\xada']
output because you are printing a list. The behaviour of the list is to print the representation of each item. That is, it is the equivalent of:
print('[{0}]'.format(','.join[repr(item) for item in lst]))
If you use print(line[0]) you will get the output of the line.
ITEM #2: The output
The problem here is that the csv parser is not parsing the content as a tab-separated file, but a comma-separated file. You can fix this by using:
for line in csv.reader(BytesIO(s), delimiter='\t'):
print(line)
instead.
This will give you the desired result.
Processing a UTF-16 file with the csv module in Python 2 can indeed be a pain. Re-encoding to UTF-8 works, but you then still need to decode the resulting columns to produce unicode values.
Note also that your data appears to be tab delimited; the csv.reader() by default uses commas, not tabs, to separate columns. You'll need to configure it to use tabs instead by setting delimiter='\t' when constructing the reader.
Use io.open() to read UTF-16 and produce unicode lines. You can then use codecs.iterencode() to translate the decoded unicode values from the UTF-16 file to UTF-8.
To decode the rows back to unicode values, you could use an extra generator to do so as you iterate:
import csv
import codecs
import io
def row_decode(reader, encoding='utf8'):
for row in reader:
yield [col.decode('utf8') for col in row]
with io.open('c:\\pfm\\bdh.txt', encoding='utf16') as f:
wrapped = codecs.iterencode(f, 'utf8')
reader = csv.reader(wrapped, delimiter='\t')
for row in row_decode(reader):
print row
Each line will still use repr() on each contained value, which means that you'll see Python string literal syntax to represent strings. Any non-printable or non-ASCII codepoint will be represented by an escape code:
>>> [u'', u'Nombre', u'Etiqueta', u'Extensión de archivo', u'Tamaño lógico', u'Categoría']
[u'', u'Nombre', u'Etiqueta', u'Extensi\xf3n de archivo', u'Tama\xf1o l\xf3gico', u'Categor\xeda']
This is normal; the output is meant to be useful as a debugging aid and can be pasted back into any Python session to reproduce the original value, without worrying about terminal encodings.
For example, ó is represented as \xf3, representing the Unicode codepoint U+00F3 LATIN SMALL LETTER O WITH ACUTE. If you were to print this one column, Python will encode the Unicode string to bytes matching your terminal encoding, resulting in your terminal showing you the correct string again:
>>> u'Extensi\xf3n de archivo'
u'Extensi\xf3n de archivo'
>>> print u'Extensi\xf3n de archivo'
Extensión de archivo
Demo:
>>> import csv, codecs, io
>>> io.open('/tmp/demo.csv', 'w', encoding='utf16').write(u'''\
... \tNombre\tEtiqueta\tExtensi\xf3n de archivo\tTama\xf1o l\xf3gico\tCategor\xeda
... ''')
63L
>>> def row_decode(reader, encoding='utf8'):
... for row in reader:
... yield [col.decode('utf8') for col in row]
...
>>> with io.open('/tmp/demo.csv', encoding='utf16') as f:
... wrapped = codecs.iterencode(f, 'utf8')
... reader = csv.reader(wrapped, delimiter='\t')
... for row in row_decode(reader):
... print row
...
[u' ', u'Nombre', u'Etiqueta', u'Extensi\xf3n de archivo', u'Tama\xf1o l\xf3gico', u'Categor\xeda']
>>> # the row is displayed using repr() for each column; the values are correct:
...
>>> print row[3], row[4], row[5]
Extensión de archivo Tamaño lógico Categoría

Categories