Loading a json file in python - python

I've got multiple file to load as JSON, they are all formatted the same way but for one of them I can't load it without raising an exception. This is where you can find the file:
File
I did the following code:
def from_seed_data_extract_summoners():
summonerIds = set()
for i in range(1,11):
file_name = 'data/matches%s.json' % i
print file_name
with open(file_name) as data_file:
data = json.load(data_file)
for match in data['matches']:
for summoner in match['participantIdentities']:
summonerIds.add(summoner['player']['summonerId'])
return summonerIds
The error occurs when I do the following: json.load(data_file). I suppose there is a special character but I can't find it and don't know how to replace it. The error generated is:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xeb in position 6: invalid continuation byte
Do you know how I can get ride of it?

Your JSON is trying to force the data into unicode, not just a simple string. You've got some embedded character (probably a space or something not very noticable) that is not able to be forced into unicode.
How to get string objects instead of Unicode ones from JSON in Python?
That is a great thread about making JSON objects more manageable in python.

replace file_name = 'data/matches%s.json' % i with file_name = 'data/matches%i.json' % i
the right syntax is data = json.load(file_name) and not -
with open(file_name) as data_file:
data = json.load(data_file)
EDIT:
def from_seed_data_extract_summoners():
summonerIds = set()
for i in range(1,11):
file_name = 'data/matches%i.json' % i
with open(file_path) as f:
data = json.load(f, encoding='utf-8')
for match in data['matches']:
for summoner in match['participantIdentities']:
summonerIds.add(summoner['player']['summonerId'])
return summonerIds

Try:
json.loads(unicode(data_file.read(), errors='ignore'))
or :
json.loads(unidecode.unidecode(unicode(data_file.read(), errors='ignore')))
(for the second, you would need to install unidecode)

try :
json.loads(data_file.read(), encoding='utf-8')

Related

<class 'UnicodeDecodeError'> that only appears in Python 3 but not Python 2

I am doing a project analyzing tweets for an Urban Policy class. The purpose of this script is to parse out certain information from JSON files that a colleague downloaded. Here's a link to a sample Tweet I am trying to parse:
https://www.dropbox.com/s/qf1e06601m2mrxr/5thWardChicago.0.json?dl=0
I had a friend of mine test the following script in some version of Python 2 (Windows) and it worked. However, my machine (Windows 10) is running a recent version of Python 3 and its not working for me.
import json
import collections
import sys, os
import glob
from datetime import datetime
import csv
def convert(input):
if isinstance(input, dict):
return {convert(key): convert(value) for key, value in input.iteritems()}
elif isinstance(input, list):
return [convert(element) for element in input]
elif isinstance(input, unicode):
return input.encode('utf-8')
else:
return input
def to_ilan_csv(json_files):
# write the column headers
csv_writer = csv.writer(open("test.csv", "w"))
headers = ["tweet_id", "handle", "username", "tweet_text", "has_image", "image_url", "created_at", "retweets", "hashtags", "mentions", "isRT", "isMT"]
csv_writer.writerow(headers)
# open the JSON files we stored and parse them into the CSV file we're working on
try:
#json_files = glob.glob(folder)
print("Parsing %s files." % len(json_files))
for file in json_files:
f = open(file, 'r')
if f != None:
for line in f:
# hack to avoid the trailing \n at the end of the file - sitcking point LH 4/7/16
if len(line) > 3:
i = 0
tweets = convert(json.loads(line))
for tweet in tweets:
has_media = False
is_RT = False
is_MT = False
hashtags_list = []
mentions_list = []
media_list = []
entities = tweet["entities"]
# old tweets don't have key "media" so need a workaround
if entities.has_key("media"):
has_media = True
for item in entities["media"]:
media_list.append(item["media_url"])
for hashtag in entities["hashtags"] :
hashtags_list.append(hashtag["text"])
for user in entities["user_mentions"]:
mentions_list.append(user["screen_name"])
if tweet["text"][:2] == "RT":
is_RT = True
if tweet["text"][:2] == "MT":
is_MT = True
values = [
tweet["id_str"],
tweet["user"]["id_str"],
tweet["user"]["screen_name"],
tweet["text"],
has_media,
','.join(media_list) if len(media_list) > 0 else "",
datetime.strptime(tweet["created_at"], '%a %b %d %H:%M:%S +0000 %Y').strftime('%Y-%m-%d %H:%M:%S'),
tweet["retweet_count"],
','.join(hashtags_list) if len(hashtags_list) > 0 else "",
','.join(mentions_list) if len(mentions_list) > 0 else "",
is_RT,
is_MT
]
csv_writer.writerow(values)
else:
continue
f.close()
except:
print("Something went wrong. Quitting.")
for i in sys.exc_info():
print(i)
def parse_tweets():
file_names = []
file_names.append("C:\\Users\\Adam\\Downloads\\Test Code\\sample1.json")
file_names.append("C:\\Users\\Adam\\Downloads\\Test Code\\sample2.json")
to_ilan_csv(file_names)
Then I execute by simply performing
parse_tweets()
But I get the following error:
Parsing 2 files.
Something went wrong. Quitting.
<class 'UnicodeDecodeError'>
'charmap' codec can't decode byte 0x9d in position 3338: character maps to <undefined>
<traceback object at 0x0000016CCFEE5648>
I sought help from a CS friend of mine but he was unable to diagnose the problem. So I've come here.
MY QUESTION
What is this error and why is it only arising in Python 3 instead of Python 2?
For those who want to try, the code as presented should be able to be run using a Jupyter notebook and the copy of the file in the drop box link I provided.
Sooo, after a bit debugging in chat, here’s the solution:
Apparently, the file OP was using was not correctly recognized as UTF-8, so iterating over the file (with for line in f) caused the UnicodeDecodeError from the cp1252 encoding module. We fixed that by explicitely opening the file as utf-8:
f = open(file, 'r', encoding='utf-8')
After we did that, the file could be opened correctly and OP ran into the Python 3 issues we all have been expecting and seeing before. The following three issues came up:
'dict' object has no attribute 'iteritems'
dict.iteritems() no longer exists in Python 3, so we just switch to dict.items() here:
return {convert(key): convert(value) for key, value in input.items()}
name 'unicode' is not defined
Unicode is no longer a separate type in Python 3, the normal string type is already capable of unicode, so we just delete this case:
elif isinstance(input, unicode):
return input.encode('utf-8')
'dict' object has no attribute 'has_key'
To check whether a key exists in a dictionary, we use the in operator, so the if check becomes the following:
if "media" in entities:
Afterwards, the code should run fine with Python 3.

Python JSON to CSV - bad encoding, UnicodeDecodeError: 'charmap' codec can't decode byte

I have a problem converting nested JSON to CSV. For this i use https://github.com/vinay20045/json-to-csv (forked a bit to support python 3.4), here is full json-to-csv.py file.
Converting is working, if i set
#Base Condition
else:
reduced_item[str(key)] = (str(value)).encode('utf8','ignore')
and
fp = open(json_file_path, 'r', encoding='utf-8')
but when i import csv to MS Excel i see bad cyrillic characters, for example \xe0\xf1 , english text is ok.
Experimented with setting encode('cp1251','ignore') but then i got an error
UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to (as here UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>)
import sys
import json
import csv
##
# This function converts an item like
# {
# "item_1":"value_11",
# "item_2":"value_12",
# "item_3":"value_13",
# "item_4":["sub_value_14", "sub_value_15"],
# "item_5":{
# "sub_item_1":"sub_item_value_11",
# "sub_item_2":["sub_item_value_12", "sub_item_value_13"]
# }
# }
# To
# {
# "node_item_1":"value_11",
# "node_item_2":"value_12",
# "node_item_3":"value_13",
# "node_item_4_0":"sub_value_14",
# "node_item_4_1":"sub_value_15",
# "node_item_5_sub_item_1":"sub_item_value_11",
# "node_item_5_sub_item_2_0":"sub_item_value_12",
# "node_item_5_sub_item_2_0":"sub_item_value_13"
# }
##
def reduce_item(key, value):
global reduced_item
#Reduction Condition 1
if type(value) is list:
i=0
for sub_item in value:
reduce_item(key+'_'+str(i), sub_item)
i=i+1
#Reduction Condition 2
elif type(value) is dict:
sub_keys = value.keys()
for sub_key in sub_keys:
reduce_item(key+'_'+str(sub_key), value[sub_key])
#Base Condition
else:
reduced_item[str(key)] = (str(value)).encode('cp1251','ignore')
if __name__ == "__main__":
if len(sys.argv) != 4:
print("\nUsage: python json_to_csv.py <node_name> <json_in_file_path> <csv_out_file_path>\n")
else:
#Reading arguments
node = sys.argv[1]
json_file_path = sys.argv[2]
csv_file_path = sys.argv[3]
fp = open(json_file_path, 'r', encoding='cp1251')
json_value = fp.read()
raw_data = json.loads(json_value)
processed_data = []
header = []
for item in raw_data[node]:
reduced_item = {}
reduce_item(node, item)
header += reduced_item.keys()
processed_data.append(reduced_item)
header = list(set(header))
header.sort()
with open(csv_file_path, 'wt+') as f:#wb+ for python 2.7
writer = csv.DictWriter(f, header, quoting=csv.QUOTE_ALL, delimiter=',')
writer.writeheader()
for row in processed_data:
writer.writerow(row)
print("Just completed writing csv file with %d columns" % len(header))
How to convert cyrillic correctly and also i want to skip bad characters?
You need to know cyrylic encoding of which file are you going to open.
For example that is enough in python3:
with open(args.input_file, 'r', encoding="cp866") as input_file:
data = input_file.read()
structure = json.loads(data)
In python3 data variable is automatically utf-8. In python2 there might be problem with feeding input to json.
Also try to print out in python interpreter line and see if symbols are right. Without input file is hard to tell if everything is right. Also are you sure that it is python, not excel related problem? Did you tried to open in notepad++ or similar encodings respecting editors?
Most important thing working with encodings is cheking that input and output is right. I would suggest to look here.
maybe you could use the chardet to detect the file's encoding.
import chardet
File='arq.GeoJson'
enc=chardet.detect(open(File,'rb').read())['encoding']
with open(File,'r', encoding = enc) as f:
data=json.load(f)
f.close()
This avoids 'to kick' the encoding.

Save file without first and last double quotes

I am trying to save my data to a file. My problem is the file i saved contains double quotes at the first and the last of a line. I have tried many ways to solve it from str.replace(), strip, csv to json, pickle. However, the problem has been still persistent. I have got stuck with it. Please help me. I will detail my problem below.
Firstly, I have a file called angles.txt like that:
{'left_w0': -2.6978887076110842, 'left_w1': -1.3257428944152834, 'left_w2': -1.7533400385498048, 'left_e0': 0.03566505327758789, 'left_e1': 0.6948932961 181641, 'left_s0': -1.1665923878540039, 'left_s1': -0.6726505747192383}
{'left_w0': -2.6967382220214846, 'left_w1': -0.8440729275695802, 'left_w2': -1.7541070289428713, 'left_e0': 0.036048548474121096, 'left_e1': 0.166820410 49194338, 'left_s0': -0.7731263162109375, 'left_s1': -0.7056311616210938}
I read line by line from the text file and transfer to a dict variable called data. Here is the reading file code:
def read_data_from_file(file_name):
data = dict()
f = open(file_name, 'r')
for index_line in range(1, number_lines +1):
data[index_line] = eval(f.readline())
f.close()
return data
Then I changed something in the data. Something like data[index_line]['left_w0'] = data[index_line]['left_w0'] + 0.0006. After that I wrote my data into another text file. Here is the code:
def write_data_to_file(data, file_name)
f = open(file_name, 'wb')
data_convert = dict()
for index_line in range(1, number_lines):
data_convert[index_line] = repr(data[index_line])
data_convert[index_line] = data_convert[index_line].replace('"','') # I also used strip
json.dump(data_convert[index_line], f)
f.write('\n')
f.close()
The result I received in the new file is:
"{'left_w0': -2.6978887076110842, 'left_w1': -1.3257428944152834, 'left_w2': -1.7533400385498048, 'left_e0': 0.03566505327758789, 'left_e1': 0.6948932961 181641, 'left_s0': -1.1665923878540039, 'left_s1': -0.6726505747192383}"
"{'left_w0': -2.6967382220214846, 'left_w1': -0.8440729275695802, 'left_w2': -1.7541070289428713, 'left_e0': 0.036048548474121096, 'left_e1': 0.166820410 49194338, 'left_s0': -0.7731263162109375, 'left_s1': -0.7056311616210938}"
I cannot remove "".
You could simplify your code by removing unnecessary transformations:
import json
def write_data_to_file(data, filename):
with open(filename, 'w') as file:
json.dump(data, file)
def read_data_from_file(filename):
with open(filename) as file:
return json.load(file)

Python UnicodeEncodeError with pre decoded UTF-8

I'm trying to parse through a bunch of logfiles (up to 4GiB) in a tar.gz file. The source files come from RedHat 5.8 Server systems and SunOS 5.10, processing has to be done on WindowsXP.
I iterate through the tar.gz files, read the files, decode the file contents to UTF-8 and parse them with regular expressions before further processing.
When I'm writing out the processed data along with the raw-data that was read from the tar.gz, I get the following error:
Traceback (most recent call last):
File "C:\WoMMaxX\lt_automation\Tools\LogParser.py", line 375, in <module>
p.analyze_longtails()
File "C:\WoMMaxX\lt_automation\Tools\LogParser.py", line 196, in analyze_longtails
oFile.write(entries[key]['source'] + '\n')
File "C:\Python\3.2\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 24835-24836: character maps
to <undefined>
Heres the part where I read and parse the logfiles:
def getSalesSoaplogEntries(perfid=None):
for tfile in parser.salestarfiles:
path = os.path.join(parser.logpath,tfile)
if os.path.isfile(path):
if tarfile.is_tarfile(path):
tar = tarfile.open(path,'r:gz')
for tarMember in tar.getmembers():
if 'salescomponent-soap.log' in tarMember.name:
tarMemberFile = tar.extractfile(tarMember)
content = tarMemberFile.read().decode('UTF-8','surrogateescape')
for m in parser.soaplogregex.finditer(content):
entry = {}
entry['time'] = datetime(datetime.now().year, int(m.group('month')), int(m.group('day')),int(m.group('hour')), int(m.group('minute')), int(m.group('second')), int(m.group('millis'))*1000)
entry['perfid'] = m.group('perfid')
entry['direction'] = m.group('direction')
entry['payload'] = m.group('payload')
entry['file'] = tarMember.name
entry['source'] = m.group(0)
sm = parser.soaplogmethodregex.match(entry['payload'])
if sm:
entry['method'] = sm.group('method')
if entry['time'] >= parser.starttime and entry['time'] <= parser.endtime:
if perfid and entry['perfid'] == perfid:
yield entry
tar.members = []
And heres the part where I write the processed information along with the raw data out(its an aggregation of all log-entries for one specific process:
if len(entries) > 0:
time = perfentry['time']
filename = time.isoformat('-').replace(':','').replace('-','') + 'longtail_' + perfentry['perfid'] + '.txt'
oFile = open(os.path.join(parser.logpath,filename), 'w')
oFile.write(perfentry['source'] +'\n')
oFile.write('------\n')
for key in sorted(entries.keys()):
oFile.write('------\n')
oFile.write(entries[key]['source'] + '\n') #<-- here it is failing
What I don't get is why it seems to be correct to read the files in UTF-8, it is not possible to just write them out as UTF-8. What am I doing wrong?
Your output file is using the default encoding for your OS, which is not UTF-8. Use codecs.open instead of open and specify encoding='utf-8'.
oFile = codecs.open(os.path.join(parser.logpath,filename), 'w', encoding='utf-8')
See http://docs.python.org/howto/unicode.html#reading-and-writing-unicode-data

python display unicode in html

I'm writing script to export my links and their titles from chrome to html.
Chrome bookmarks stored as json, in utf encoding
Some titles are on Russian therefore they stored like that:
"name": "\u0425\u0430\u0431\u0440\ ..."
import codecs
f = codecs.open("chrome.json","r", "utf-8")
data = f.readlines()
urls = [] # for links
names = [] # for link titles
ind = 0
for i in data:
if i.find('"url":') != -1:
urls.append(i.split('"')[3])
names.append(data[ind-2].split('"')[3])
ind += 1
fw = codecs.open("chrome.html","w","utf-8")
fw.write("<html><body>\n")
for n in names:
fw.write(n + '<br>')
# print type(n) # this will return <type 'unicode'> for each url!
fw.write("</body></html>")
Now, in chrome.html I got those displayed as \u0425\u0430\u0431...
How I can turn them back to Russian?
using python 2.5
**Edit: Solved!**
s = '\u041f\u0440\u0438\u0432\u0435\u0442 world!'
type(s)
<type 'str'>
print s.decode('raw-unicode-escape').encode('utf-8')
Привет world!
That's what I needed, to convert str of \u041f... into unicode.
f = open("chrome.json", "r")
data = f.readlines()
f.close()
urls = [] # for links
names = [] # for link titles
ind = 0
for i in data:
if i.find('"url":') != -1:
urls.append(i.split('"')[3])
names.append(data[ind-2].split('"')[3])
ind += 1
fw = open("chrome.html","w")
fw.write("<html><body>\n")
for n in names:
fw.write(n.decode('raw-unicode-escape').encode('utf-8') + '<br>')
fw.write("</body></html>")
By the way, it's not just Russian; non-ASCII characters are quite common in page names. Example:
name=u'Python Programming Language \u2013 Official Website'
url=u'http://www.python.org/'
As an alternative to fragile code like
urls.append(i.split('"')[3])
names.append(data[ind-2].split('"')[3])
# (1) relies on name being 2 lines before url
# (2) fails if there is a `"` in the name
# example: "name": "The \"Fubar\" website",
you could process the input file using the json module. For Python 2.5, you can get simplejson.
Here's a script that emulates yours:
try:
import json
except ImportError:
import simplejson as json
import sys
def convert_file(infname, outfname):
def explore(folder_name, folder_info):
for child_dict in folder_info['children']:
ctype = child_dict.get('type')
name = child_dict.get('name')
if ctype == 'url':
url = child_dict.get('url')
# print "name=%r url=%r" % (name, url)
fw.write(name.encode('utf-8') + '<br>\n')
elif ctype == 'folder':
explore(name, child_dict)
else:
print "*** Unexpected ctype=%r ***" % ctype
f = open(infname, 'rb')
bmarks = json.load(f)
f.close()
fw = open(outfname, 'w')
fw.write("<html><body>\n")
for folder_name, folder_info in bmarks['roots'].iteritems():
explore(folder_name, folder_info)
fw.write("</body></html>")
fw.close()
if __name__ == "__main__":
convert_file(sys.argv[1], sys.argv[2])
Tested using Python 2.5.4 on Windows 7 Pro.
It's a JSON file, so read it using a JSON parser. That will give you a Unicode string directly, without you having to unescape it. This is going to be much more reliable (as well as simpler), since JSON strings are not the same format as Python strings.
(They're pretty similar and both use the \u format, but your current code will fall over badly for other escaped characters, not to mention that it relies on the exact attribute order and whitespace settings of a JSON file, which makes it very fragile indeed.)
import json, cgi, codecs
with open('chrome.json') as fp:
bookmarks= json.load(fp)
with codecs.open('chrome.html', 'w', 'utf-8') as fp:
fp.write(u'<html><body>\n')
for root in bookmarks[u'roots'].values():
for child in root['children']:
fp.write(u'%s' % (
cgi.escape(child[u'url']),
cgi.escape(child[u'name'])
))
fp.write(u'</body></html>')
Note also the use of cgi.escape to HTML-encode any < or & characters in the strings.
I'm not sure where you're trying to display the russian text, but in the interpreter you can do the following to see the Russian text:
s = '\u0425\u0430\u0431'
l = s.split('\u')
l.remove('')
for x in l:
print(unichr(int(x, 16))),
This will give the following output:
Х а б
If you're storing it in html, better off to leave it as '\u0425...' until you need to convert it.
Hope this helps.
You could include the utf-8 BOM, so chrome knows to read it as utf-8, not ascii:
fw = codecs.open("chrome.html","w","utf-8")
fw.write(codecs.BOM_UTF8.decode('utf-8'))
fw.write(u'你好')
Oh, but if you open fw in python, remember to use 'utf-8-sig' to strip the BOM.
Maybe you need to encode the unicode into utf-8, but I think codecs does that already, right:

Categories