pycharm console unicode to readable string - python

studying python with this tutorial
The problem is when i trying to get cyrillic characters i get unicode in pycharm console.
import requests
from bs4 import BeautifulSoup
import operator
import codecs
def start(url):
word_list = []
source_code = requests.get(url).text
soup = BeautifulSoup(source_code)
for post_text in soup.findAll('a', {'class': 'b-tasks__item__title js-set-visited'}):
content = post_text.string
words = content.lower().split()
for each_word in words:
word_list.append(each_word)
clean_up_list(word_list)
def clean_up_list(word_list):
clean_word_list = []
for word in word_list:
symbols = "!##$%^&*()_+{}|:<>?,./;'[]\=-\""
for i in range(0, len(symbols)):
word = word.replace(symbols[i], "")
if len(word) > 0:
clean_word_list.append(word)
create_dictionary(clean_word_list)
def create_dictionary(clean_word_list):
word_count = {}
for word in clean_word_list:
if word in word_count:
word_count[word] += 1
else:
word_count[word] = 1
for key, value in sorted(word_count.items(), key=operator.itemgetter(1)):
print(key, value)
When i am changing print(key, value) to print(key.decode('utf8'), value) i am getting "UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-7: ordinal not in range(128)"
start('https://youdo.com/tasks-all-opened-all-moscow-1')
There is some suggestion on the internet about changing encoding in some files - don't really get it. Can't i read it in console?
OSX
UPD
key.encode("utf-8")

UTF-8 is sometimes painful. I created a file with a line in Latin caracters and another one with Russian ones. The following code:
# encoding: utf-8
with open("testing.txt", "r", encoding='utf-8') as f:
line = f.read()
print(line)
outputs in PyCharm
Note the two encoding entries
Since you are getting data from a web page, you must make sure that you use the right encoding as well. The following code
# encoding: utf-8
r = requests.get('http://www.pravda.ru/')
r.encoding = 'utf-8'
print(r.text)
outputs in PyCharm as
Please note that you must specifically set the encoding to match the one of the page.

Related

Saving text with Polish characters (utf-8) to a textfile from JSON in Python

I am trying to save a conversation from Messenger to a textfile, including things like timestamps and senders.
In the JSON file downloaded from Messenger, the emojis and Polish characters are displayed as UTF-8 in literal (e.g. "ą" as \xc4\x85).
After executing this program:
import json
from datetime import datetime
messages = []
jsonfiles = ["message_1.json","message_2.json","message_3.json","message_4.json","message_5.json", "message_6.json","message_7.json","message_8.json","message_9.json","message_10.json","message_11.json"]
def filldict(textfile,jsonfile):
with open(textfile,"a", encoding="utf-8") as w:
with open(jsonfile, "r", encoding="utf-8") as j:
data = json.load(j)
i = 0
while i<len(data["messages"]):
message = {}
if "content" in data["messages"][len(data["messages"])-1-i]:
stamp = int(data["messages"][len(data["messages"])-1-i]["timestamp_ms"])
date = datetime.fromtimestamp(stamp/1000)
message['timestamp']=stamp
message['date']=date
w.write(str(date))
w.write(" ")
w.write(data["messages"][len(data["messages"])-1-i]["sender_name"])
message['sender']=data["messages"][len(data["messages"])-1-i]["sender_name"]
w.write(": ")
if "content" in str(data["messages"][len(data["messages"])-1-i]):
w.write(data["messages"][len(data["messages"])-1-i]["content"])
message['content']=data["messages"][len(data["messages"])-1-i]["content"]
w.write("\n")
i +=1
messages.append(message)
message = {}
j = len(jsonfiles)
while j>0:
filldict("messages11.txt", jsonfiles[j-1])
j-=1
print("process finished")
the output textfile contains those utf-8 literals instead of the characters which they represent. What can I do in order to fix it and display the Polish characters (and, if that's even possible, emojis) in the textfile? I thought that including " encoding = 'utf-8' " would be enough. Thank you for any clues.

Loading a json file in python

I've got multiple file to load as JSON, they are all formatted the same way but for one of them I can't load it without raising an exception. This is where you can find the file:
File
I did the following code:
def from_seed_data_extract_summoners():
summonerIds = set()
for i in range(1,11):
file_name = 'data/matches%s.json' % i
print file_name
with open(file_name) as data_file:
data = json.load(data_file)
for match in data['matches']:
for summoner in match['participantIdentities']:
summonerIds.add(summoner['player']['summonerId'])
return summonerIds
The error occurs when I do the following: json.load(data_file). I suppose there is a special character but I can't find it and don't know how to replace it. The error generated is:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xeb in position 6: invalid continuation byte
Do you know how I can get ride of it?
Your JSON is trying to force the data into unicode, not just a simple string. You've got some embedded character (probably a space or something not very noticable) that is not able to be forced into unicode.
How to get string objects instead of Unicode ones from JSON in Python?
That is a great thread about making JSON objects more manageable in python.
replace file_name = 'data/matches%s.json' % i with file_name = 'data/matches%i.json' % i
the right syntax is data = json.load(file_name) and not -
with open(file_name) as data_file:
data = json.load(data_file)
EDIT:
def from_seed_data_extract_summoners():
summonerIds = set()
for i in range(1,11):
file_name = 'data/matches%i.json' % i
with open(file_path) as f:
data = json.load(f, encoding='utf-8')
for match in data['matches']:
for summoner in match['participantIdentities']:
summonerIds.add(summoner['player']['summonerId'])
return summonerIds
Try:
json.loads(unicode(data_file.read(), errors='ignore'))
or :
json.loads(unidecode.unidecode(unicode(data_file.read(), errors='ignore')))
(for the second, you would need to install unidecode)
try :
json.loads(data_file.read(), encoding='utf-8')

Python JSON to CSV - bad encoding, UnicodeDecodeError: 'charmap' codec can't decode byte

I have a problem converting nested JSON to CSV. For this i use https://github.com/vinay20045/json-to-csv (forked a bit to support python 3.4), here is full json-to-csv.py file.
Converting is working, if i set
#Base Condition
else:
reduced_item[str(key)] = (str(value)).encode('utf8','ignore')
and
fp = open(json_file_path, 'r', encoding='utf-8')
but when i import csv to MS Excel i see bad cyrillic characters, for example \xe0\xf1 , english text is ok.
Experimented with setting encode('cp1251','ignore') but then i got an error
UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to (as here UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>)
import sys
import json
import csv
##
# This function converts an item like
# {
# "item_1":"value_11",
# "item_2":"value_12",
# "item_3":"value_13",
# "item_4":["sub_value_14", "sub_value_15"],
# "item_5":{
# "sub_item_1":"sub_item_value_11",
# "sub_item_2":["sub_item_value_12", "sub_item_value_13"]
# }
# }
# To
# {
# "node_item_1":"value_11",
# "node_item_2":"value_12",
# "node_item_3":"value_13",
# "node_item_4_0":"sub_value_14",
# "node_item_4_1":"sub_value_15",
# "node_item_5_sub_item_1":"sub_item_value_11",
# "node_item_5_sub_item_2_0":"sub_item_value_12",
# "node_item_5_sub_item_2_0":"sub_item_value_13"
# }
##
def reduce_item(key, value):
global reduced_item
#Reduction Condition 1
if type(value) is list:
i=0
for sub_item in value:
reduce_item(key+'_'+str(i), sub_item)
i=i+1
#Reduction Condition 2
elif type(value) is dict:
sub_keys = value.keys()
for sub_key in sub_keys:
reduce_item(key+'_'+str(sub_key), value[sub_key])
#Base Condition
else:
reduced_item[str(key)] = (str(value)).encode('cp1251','ignore')
if __name__ == "__main__":
if len(sys.argv) != 4:
print("\nUsage: python json_to_csv.py <node_name> <json_in_file_path> <csv_out_file_path>\n")
else:
#Reading arguments
node = sys.argv[1]
json_file_path = sys.argv[2]
csv_file_path = sys.argv[3]
fp = open(json_file_path, 'r', encoding='cp1251')
json_value = fp.read()
raw_data = json.loads(json_value)
processed_data = []
header = []
for item in raw_data[node]:
reduced_item = {}
reduce_item(node, item)
header += reduced_item.keys()
processed_data.append(reduced_item)
header = list(set(header))
header.sort()
with open(csv_file_path, 'wt+') as f:#wb+ for python 2.7
writer = csv.DictWriter(f, header, quoting=csv.QUOTE_ALL, delimiter=',')
writer.writeheader()
for row in processed_data:
writer.writerow(row)
print("Just completed writing csv file with %d columns" % len(header))
How to convert cyrillic correctly and also i want to skip bad characters?
You need to know cyrylic encoding of which file are you going to open.
For example that is enough in python3:
with open(args.input_file, 'r', encoding="cp866") as input_file:
data = input_file.read()
structure = json.loads(data)
In python3 data variable is automatically utf-8. In python2 there might be problem with feeding input to json.
Also try to print out in python interpreter line and see if symbols are right. Without input file is hard to tell if everything is right. Also are you sure that it is python, not excel related problem? Did you tried to open in notepad++ or similar encodings respecting editors?
Most important thing working with encodings is cheking that input and output is right. I would suggest to look here.
maybe you could use the chardet to detect the file's encoding.
import chardet
File='arq.GeoJson'
enc=chardet.detect(open(File,'rb').read())['encoding']
with open(File,'r', encoding = enc) as f:
data=json.load(f)
f.close()
This avoids 'to kick' the encoding.

Python: Incorrect dictionary encoding

I'm making a dictionary which contains words containing characters like č,ě,á (Czech alphabet). When I try to print these words into the console before adding to dictionary, I can see correct encoded words. The problem is that when I add it to the dictionary and print the word as it's value, I see it but in wrong encoding. Here is a printscreen of my console, the first row is print name and the second row is print dict.
These words / sentences should be the same.
Info:
PyCharm IDE, Python 2.7.8, default encoding: "utf-8"
Thank you for your advices!
EDIT: Attaching the code ('url' is the url of the web page):
def getSoup(url):
req = urllib2.Request(url)
response = urllib2.urlopen(req)
page = response.read()
soup = BeautifulSoup(page, 'xml')
return soup
a=0
klubyDict = dict()
index = getSoup("url")
all = index.findAll('A')
for i in all:
okres = getSoup("http://url%s"%(i['HREF']))
kluby = okres.findAll('A')
# print(kluby[0]['HREF'])
print "Okrsok...%s"%(i.text)
for klub in kluby:
klubHtml = getSoup("http://url%s"%(klub['HREF']))
name = klub.text
print name
emailTag = klubHtml.find('td',text=re.compile("Email:"))
email = emailTag.text[7:]
if len(name)>0:
klubyDict[name]=email if len(email)>0 else "email nezadany"
print klubyDict
print "Saving to file..."
with open('futbaloveKluby','wb') as f:
pickle.dump(klubyDict,f)
EDIT2: Adding the data into the excel file
# -*- coding: utf-8 -*-
import cPickle as pickle
dict = dict()
workbook = xlsxwriter.Workbook('Futbal.xlsx')
worksheet = workbook.add_worksheet()
with open('futbaloveKluby','rb') as f:
dict = pickle.load(f)
colKlub = 0
colEmail = 1
row = 0
for klub in dict.keys():
worksheet.write(row,colKlub, klub)
worksheet.write(row,colEmail, dict[klub])
row += 1
workbook.close()
print table.text
The main thing is that after this code, I put values of this dictionary into the Excel table using xlscWriter. When I open Excel file I can see wrong characters.
import codecs and use this code for your name :
name = klub.text
print name.decode('utf-8')

python display unicode in html

I'm writing script to export my links and their titles from chrome to html.
Chrome bookmarks stored as json, in utf encoding
Some titles are on Russian therefore they stored like that:
"name": "\u0425\u0430\u0431\u0440\ ..."
import codecs
f = codecs.open("chrome.json","r", "utf-8")
data = f.readlines()
urls = [] # for links
names = [] # for link titles
ind = 0
for i in data:
if i.find('"url":') != -1:
urls.append(i.split('"')[3])
names.append(data[ind-2].split('"')[3])
ind += 1
fw = codecs.open("chrome.html","w","utf-8")
fw.write("<html><body>\n")
for n in names:
fw.write(n + '<br>')
# print type(n) # this will return <type 'unicode'> for each url!
fw.write("</body></html>")
Now, in chrome.html I got those displayed as \u0425\u0430\u0431...
How I can turn them back to Russian?
using python 2.5
**Edit: Solved!**
s = '\u041f\u0440\u0438\u0432\u0435\u0442 world!'
type(s)
<type 'str'>
print s.decode('raw-unicode-escape').encode('utf-8')
Привет world!
That's what I needed, to convert str of \u041f... into unicode.
f = open("chrome.json", "r")
data = f.readlines()
f.close()
urls = [] # for links
names = [] # for link titles
ind = 0
for i in data:
if i.find('"url":') != -1:
urls.append(i.split('"')[3])
names.append(data[ind-2].split('"')[3])
ind += 1
fw = open("chrome.html","w")
fw.write("<html><body>\n")
for n in names:
fw.write(n.decode('raw-unicode-escape').encode('utf-8') + '<br>')
fw.write("</body></html>")
By the way, it's not just Russian; non-ASCII characters are quite common in page names. Example:
name=u'Python Programming Language \u2013 Official Website'
url=u'http://www.python.org/'
As an alternative to fragile code like
urls.append(i.split('"')[3])
names.append(data[ind-2].split('"')[3])
# (1) relies on name being 2 lines before url
# (2) fails if there is a `"` in the name
# example: "name": "The \"Fubar\" website",
you could process the input file using the json module. For Python 2.5, you can get simplejson.
Here's a script that emulates yours:
try:
import json
except ImportError:
import simplejson as json
import sys
def convert_file(infname, outfname):
def explore(folder_name, folder_info):
for child_dict in folder_info['children']:
ctype = child_dict.get('type')
name = child_dict.get('name')
if ctype == 'url':
url = child_dict.get('url')
# print "name=%r url=%r" % (name, url)
fw.write(name.encode('utf-8') + '<br>\n')
elif ctype == 'folder':
explore(name, child_dict)
else:
print "*** Unexpected ctype=%r ***" % ctype
f = open(infname, 'rb')
bmarks = json.load(f)
f.close()
fw = open(outfname, 'w')
fw.write("<html><body>\n")
for folder_name, folder_info in bmarks['roots'].iteritems():
explore(folder_name, folder_info)
fw.write("</body></html>")
fw.close()
if __name__ == "__main__":
convert_file(sys.argv[1], sys.argv[2])
Tested using Python 2.5.4 on Windows 7 Pro.
It's a JSON file, so read it using a JSON parser. That will give you a Unicode string directly, without you having to unescape it. This is going to be much more reliable (as well as simpler), since JSON strings are not the same format as Python strings.
(They're pretty similar and both use the \u format, but your current code will fall over badly for other escaped characters, not to mention that it relies on the exact attribute order and whitespace settings of a JSON file, which makes it very fragile indeed.)
import json, cgi, codecs
with open('chrome.json') as fp:
bookmarks= json.load(fp)
with codecs.open('chrome.html', 'w', 'utf-8') as fp:
fp.write(u'<html><body>\n')
for root in bookmarks[u'roots'].values():
for child in root['children']:
fp.write(u'%s' % (
cgi.escape(child[u'url']),
cgi.escape(child[u'name'])
))
fp.write(u'</body></html>')
Note also the use of cgi.escape to HTML-encode any < or & characters in the strings.
I'm not sure where you're trying to display the russian text, but in the interpreter you can do the following to see the Russian text:
s = '\u0425\u0430\u0431'
l = s.split('\u')
l.remove('')
for x in l:
print(unichr(int(x, 16))),
This will give the following output:
Х а б
If you're storing it in html, better off to leave it as '\u0425...' until you need to convert it.
Hope this helps.
You could include the utf-8 BOM, so chrome knows to read it as utf-8, not ascii:
fw = codecs.open("chrome.html","w","utf-8")
fw.write(codecs.BOM_UTF8.decode('utf-8'))
fw.write(u'你好')
Oh, but if you open fw in python, remember to use 'utf-8-sig' to strip the BOM.
Maybe you need to encode the unicode into utf-8, but I think codecs does that already, right:

Categories