Error while loading a specific json file in Python [duplicate] - python

I have some json files created by powershell using the ConvertTo-Json command. The content of the json file looks like
{
"Key1": "Value1",
"Key2": "Value2"
}
I ran the python interpreter to see if I could read the file but I get this weird output
>>> f=open('test.json', 'r')
>>> f.read()
'ÿ\xfe{\x00\n\x00\n\x00 \x00 \x00 \x00 \x00"\x00K\x00e\x00y\x001\x00"\x00:\x00 \x00 \x00"\x00V\x00a\x00l\x00u\x00e\x001\x00"\x00,\x00\n\x00\n\x00 \x00 \x00 \x00 \x00"\x00K\x00e\x00y\x002\x00"\x00:\x00 \x00 \x00"\x00V\x00a\x00l\x00u\x00e\x002\x00"\x00\n\x00\n\x00}\x00\n\x00\n\x00'
For some reason all the characters are escaped byte characters and there's the weird ÿ at the begninning (powershell error?).
The weird thing is this:
>>> f=open('test.json', 'r')
>>> str=f.read()
>>> type(str)
<class 'str'>
>>> json.loads(str)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Rutvik_Choudhary\AppData\Local\Programs\Python\Python35-32\lib\json\__init__.py", line 319, in loads
return _default_decoder.decode(s)
File "C:\Users\Rutvik_Choudhary\AppData\Local\Programs\Python\Python35-32\lib\json\decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Users\Rutvik_Choudhary\AppData\Local\Programs\Python\Python35-32\lib\json\decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
So the input is a string, but the json module can't parse it (json.load(f) return the same error). What is causing this error? Is it a python thing, a powershell thing, a json thing?

As pointed out by jwodder, PowerShell has encoded your json using UTF-16LE. To get this data into json correctly, you need to open the file using the correct encoding. eg.
with open("test.json", "r", encoding="utf16") as f:
json_string = f.read()
my_dict = json.loads(json_string)
You don't need to tell Python which variant of UTF-16 is being used. This is the purpose of the first two bytes of the text file. It's called a Byte Order Mark (BOM). It lets a program know if UTF-16LE or UTF-16BE has been used to encode the text file.

It seems that you have a BOM at the start of your file. You can verify it in a hex editor or with a good text editor (Notepad++ shows if BOM is present).

If you want to load text files with Unicode BOM headers, like yours you should better use to codecs.open functions instead of open as the default open is not able to interpret the BOM.
Or you can have a look at tendo.unicode - a small library that I wrote that can improve life for people that are not used to Unicode texts.

Related

Load string into dictionary when some values contain single quotes

I have a string that I need to load into a dictionary. I'm trying to use json.loads() to do this, but it is failing because I need to replace single quotes with double quotes because by default the strings use single quotes to wrap property names and values, although in some cases the value is wrapped in double quotes because it contains an single quote.
Here is a reproduceable example using Python3.8
>>> import json
>>> example = "{'msg': \"I'm a string\"}"
>>> json.loads(example.replace("'", '"'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.8/json/__init__.py", line 357, in loads
return _default_decoder.decode(s)
File "/usr/local/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/local/lib/python3.8/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 12 (char 11)
Is there a better way to load a string into a dictionary? The json module seems to only work with double quotes and as that is not possible here (I have no control over the formatting of the strings), hence why I'm having to use the unreliable replace.
My desired result is to have a dictionary like this.
{'msg': "I'm a string"}
Use ast.literal_eval()Docs:
>>> import ast
>>> ast.literal_eval(example)
{'msg': "I'm a string"}

Problem with loading a json file for geo_data in python

I am currently trying to use folium library in python to create webmaps. I have a file world.json which contains geo_data. I have provided a link to the file at the end of this post. I tried the following code:
data = [json.loads(line) for line in open('world.json', 'r')]
and received the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <listcomp>
File "C:\Users\name\AppData\Local\Programs\Python\Python38\lib\json\__init__.py", line 357, in loads
return _default_decoder.decode(s)
File "C:\Users\name\AppData\Local\Programs\Python\Python38\lib\json\decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Users\name\AppData\Local\Programs\Python\Python38\lib\json\decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
How can I load this file?
What I want to achieve is essentially obtain the population data and create a Choropleth and overlay it on my webmap.
Edit: Forgot the link:
https://1drv.ms/u/s!Army95vqcKXpaooVAZU_g-VCAVw?e=vwTknq
Edit: Previous link to skydrive stopped working due to "high traffic". Below is link to dropbox, hopefully this works:
https://www.dropbox.com/s/gmm8db0g03rc7cv/world.json?dl=0
Good news/bad news:
It turns out that this file was encoded in a locale that we are not accustomed to, and json/ascii cannot make sense of some of the character encoding. I tried this, and it seems to be working for me -- with a major caveat:
with open("world.json", "r") as fh:
contents = fh.read()
asciiContents = contents.encode("ascii", errors="ignore")
data = json.loads(asciiContents)
The major caveat is that only 3 countries come through with no encoding errors:
>>> len(data["features"])
3
Maybe there another source for this data that is closer to a native english locale, or maybe someone else can provide wisdom in encoding foreign data in a more friendly way...
The open command will return a file handle, not string lines. I would do:
with open('world.json', 'r') as fh:
data = json.load(fh)
data will then be your contents converted to python (list or dictionary, etc)

python unable to load a json file with utf-8 encoding

With the following python code:
filePath = urllib2.urlopen('xx.json')
fileJSON = json.loads(filePath.read().decode('utf-8'))
Where the xx.json looks like:
{
"tags": [{
"id": "123",
"name": "Airport",
"name_en": "Airport",
"name_cn": "机场",
"display": false
}]
}
I see the following exception:
fileJSON = json.loads(filePath.read().decode('utf-8'))
File "/usr/lib64/python2.7/json/__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/lib64/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib64/python2.7/json/decoder.py", line 382, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
The code works before the Chinese characters are added to the json file, when I also added the .decode('utf-8') behind the read() as well.
I am not sure what needs to be done?
$ wget https://s3.amazonaws.com/wherego-sims/tags.json
$ file tags.json
tags.json: UTF-8 Unicode (with BOM) text, with CRLF line terminators
This file begins with a byte order mark (EF BB BF), which is illegal in JSON (JSON Specification and usage of BOM/charset-encoding). You must first decode this using 'utf-8-sig' in Python to get a valid JSON unicode string.
json.loads(filePath.read().decode('utf-8-sig'))
For what it's worth, Python 3 (which you should be using) will give a specific error in this case and guide you in handling this malformed file:
json.decoder.JSONDecodeError: Unexpected UTF-8 BOM (decode using utf-8-sig): line 1 column 1 (char 0)
Namely, by specifying that you wish to discard the BOM if it exists (again, it's not conventional to use this in UTF-8, particularly with JSON which is always encoded in UTF-8 so it is worse than useless):
>>> import json
>>> json.load(open('tags.json', encoding='utf-8-sig'))

Python function to turn internationalized domain name from U-Label to A-Label? [duplicate]

I have a long list of domain names which I need to generate some reports on. The list contains some IDN domains, and although I know how to convert them in python on the command line:
>>> domain = u"pfarmerü.com"
>>> domain
u'pfarmer\xfc.com'
>>> domain.encode("idna")
'xn--pfarmer-t2a.com'
>>>
I'm struggling to get it to work with a small script reading data from the text file.
#!/usr/bin/python
import sys
infile = open(sys.argv[1])
for line in infile:
print line,
domain = unicode(line.strip())
print type(domain)
print "IDN:", domain.encode("idna")
print
I get the following output:
$ ./idn.py ./test
pfarmer.com
<type 'unicode'>
IDN: pfarmer.com
pfarmerü.com
Traceback (most recent call last):
File "./idn.py", line 9, in <module>
domain = unicode(line.strip())
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 7: ordinal not in range(128)
I have also tried:
#!/usr/bin/python
import sys
import codecs
infile = codecs.open(sys.argv[1], "r", "utf8")
for line in infile:
print line,
domain = line.strip()
print type(domain)
print "IDN:", domain.encode("idna")
print
Which gave me:
$ ./idn.py ./test
Traceback (most recent call last):
File "./idn.py", line 8, in <module>
for line in infile:
File "/usr/lib/python2.6/codecs.py", line 679, in next
return self.reader.next()
File "/usr/lib/python2.6/codecs.py", line 610, in next
line = self.readline()
File "/usr/lib/python2.6/codecs.py", line 525, in readline
data = self.read(readsize, firstline=True)
File "/usr/lib/python2.6/codecs.py", line 472, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-5: unsupported Unicode code range
Here is my test data file:
pfarmer.com
pfarmerü.com
I'm very aware of my need to understand unicode now.
Thanks,
Peter
you need to know in which encoding you file was saved. This would be something like 'utf-8' (which is NOT Unicode) or 'iso-8859-1' or 'cp1252' or alike.
Then you can do (assuming 'utf-8'):
infile = open(sys.argv[1])
for line in infile:
print line,
domain = line.strip().decode('utf-8')
print type(domain)
print "IDN:", domain.encode("idna")
print
Convert encoded strings to unicode with decode. Convert unicode to string with encode. If you try to encode something which is already encoded, python tries to decode first, with the default codec 'ascii' which fails for non-ASCII-values.
Your first example is fine, except that:
domain = unicode(line.strip())
you have to specify a particular encoding here: unicode(line.strip(), 'utf-8'). Otherwise you get the default encoding which for safety is 7-bit ASCII, hence the error. Alternatively you can spell it line.strip().decode('utf-8') as in knitti's example; there is no difference in behaviour between the two syntaxes.
However judging by the error “can't decode byte 0xfc”, I think you haven't actually saved your test file as UTF-8. Presumably this is why the second example, that also looks OK in principle, fails.
Instead it's ISO-8859-1 or the very similar Windows code page 1252. If it's come from a text editor on a Western Windows box it will certainly be the latter; Linux machines use UTF-8 by default instead nowadays. Either make sure to save your file as UTF-8, or read the file using the encoding 'cp1252' instead.

python, vobject, encoding, vcards

I am using vobject in python. I am attempting to parse the vcard located here:
http://www.mayerbrown.com/people/vCard.aspx?Attorney=1150
to do this, I do the following:
import urllib
import vobject
vcard = urllib.urlopen("http://www.mayerbrown.com/people/vCard.aspx?Attorney=1150").read()
vcard_object = vobject.readOne(vcard)
Whenever I do this, I get the following error:
Traceback (most recent call last):
File "<pyshell#86>", line 1, in <module>
vobject.readOne(urllib.urlopen("http://www.mayerbrown.com/people/vCard.aspx?Attorney=1150").read())
File "C:\Python27\lib\site-packages\vobject-0.8.1c-py2.7.egg\vobject\base.py", line 1078, in readOne
ignoreUnreadable, allowQP).next()
File "C:\Python27\lib\site-packages\vobject-0.8.1c-py2.7.egg\vobject\base.py", line 1031, in readComponents
vline = textLineToContentLine(line, n)
File "C:\Python27\lib\site-packages\vobject-0.8.1c-py2.7.egg\vobject\base.py", line 888, in textLineToContentLine
return ContentLine(*parseLine(text, n), **{'encoded':True, 'lineNumber' : n})
File "C:\Python27\lib\site-packages\vobject-0.8.1c-py2.7.egg\vobject\base.py", line 262, in __init__
self.value = str(self.value).decode('quoted-printable')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 29: ordinal not in range(128)
I have tried a number of other variations on this, such as converting vcard into unicode, using various encodings,etc. But I always get the same, or a very similar, error message.
Any ideas on how to fix this?
It's failing on line 13 of the vCard because the ADR property is incorrectly marked as being encoded in the "quoted-printable" encoding. The ü character should be encoded as =FC, which is why vobject is throwing the error.
File is downloaded as UTF-8 (i think) encoded string, but library tries to interpret it as ASCII.
Try adding following line after urlopen:
vcard = vcard.decode('utf-8')
vobject library readOne method is pretty awkward.
To avoid problems I decided to persist in my database the vcards in form of quoted-printable data, which the one likes.
assuming some_vcard is string with UTF-8 encoding
quopried_vcard = quopri.encodestring(some_vcard)
and the quopried_vcard gets persisted, and when needed just:
vobj = vobject.readOne(quopried_vcard)
and then to get back decoded data, e.g for fn field in vcard:
quopri.decodestring(vobj.fn.value)
Maybe somebody can handle UTF-8 with readOne better. If yes I would love to see it.

Categories