Python using json to read a string with emoticons - python

I have a giant .json file
Im reading it with
json_data=open('file.json')
data = json.load(json_data)
for item in data['payload']['actions']:
print item['author']
print item['action_id']
print item['body']
json_data.close()
eventually one of the item['body'] contains this string (which are actually facebook emoticons) :
words words stuff stuff\ud83c\udf89\ud83c\udf8a\ud83c\udf87\ud83c\udf86\ud83c\udf08\ud83d\udca5\u2728\ud83d\udcab\ud83d\udc45\ud83d\udeb9\ud83d\udeba\ud83d\udc83\ud83d\ude4c\ud83c\udfc3\ud83d\udc6c
which makes it give this error:
Traceback (most recent call last):
File "curse.py", line 15, in <module>
print item['body']
File "C:\python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 35-63: character maps to <undefined>
Is there a way to make it ignore these?

You can use string.printable
import string
try:
print item['body']
except UnicodeEncodeError:
print(''.join(c for c in item['body'] if c in string.printable))

Related

Unicode-encode issues while sending desktop notification using Python

I am fetching latest football scores from a website and sending a notification on the desktop (OS X). I am using BeautifulSoup to scrape the data. I had issues with the unicode data which was generating this error
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 2: ordinal not in range(128).
So I inserted this at the beginning which solved the problem while outputting on the terminal.
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
But the problem exists when I am sending notifications on the desktop. I use terminal-notifier to send desktop-notifications.
def notify (title, subtitle, message):
t = '-title {!r}'.format(title)
s = '-subtitle {!r}'.format(subtitle)
m = '-message {!r}'.format(message)
os.system('terminal-notifier {}'.format(' '.join((m, t, s))))
The below images depict the output on the terminal Vs the desktop notification.
Output on terminal.
Desktop Notification
Also, if I try to replace the comma in the string, I get the error,
new_scorer = str(new_scorer[0].text).replace(",","")
File "live_football_bbc01.py", line 41, in get_score
new_scorer = str(new_scorer[0].text).replace(",","")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 2: ordinal not in range(128)
How do I get the output on the desktop notifications like the one on the terminal? Thanks!
Edit : Snapshot of the desktop notification. (Solved)
You are formatting using !r which gives you the repr output, forget the terrible reload logic and either use unicode everywhere:
def notify (title, subtitle, message):
t = u'-title {}'.format(title)
s = u'-subtitle {}'.format(subtitle)
m = u'-message {}'.format(message)
os.system(u'terminal-notifier {}'.format(u' '.join((m, t, s))))
or encode:
def notify (title, subtitle, message):
t = '-title {}'.format(title.encode("utf-8"))
s = '-subtitle {}'.format(subtitle.encode("utf-8"))
m = '-message {}'.format(message.encode("utf-8"))
os.system('terminal-notifier {}'.format(' '.join((m, t, s))))
When you call str(new_scorer[0].text).replace(",","") you are trying to encode to ascii, you need to specify the encoding to use:
In [13]: s1=s2=s3= u'\xfc'
In [14]: str(s1) # tries to encode to ascii
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-14-589849bdf059> in <module>()
----> 1 str(s1)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 0: ordinal not in range(128)
In [15]: "{}".format(s1) + "{}".format(s2) + "{}".format(s3) # tries to encode to ascii---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-15-7ca3746f9fba> in <module>()
----> 1 "{}".format(s1) + "{}".format(s2) + "{}".format(s3)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 0: ordinal not in range(128)
You can encode straight away:
In [16]: "{}".format(s1.encode("utf-8")) + "{}".format(s2.encode("utf-8")) + "{}".format(s3.encode("utf-8"))
Out[16]: '\xc3\xbc\xc3\xbc\xc3\xbc'
Or use use all unicode prepending a u to the format strings and encoding last:
In [17]: out = u"{}".format(s1) + u"{}".format(s2) + u"{}".format(s3)
In [18]: out
Out[18]: u'\xfc\xfc\xfc'
In [19]: out.encode("utf-8")
Out[19]: '\xc3\xbc\xc3\xbc\xc3\xbc'
If you use !r you are always going to the the bytes in the output:
In [30]: print "{}".format(s1.encode("utf-8"))
ü
In [31]: print "{!r}".format(s1).encode("utf-8")
u'\xfc'
You can also pass the args using subprocess:
from subprocess import check_call
def notify (title, subtitle, message):
cheek_call(['terminal-notifier','-title',title.encode("utf-8"),
'-subtitle',subtitle.encode("utf-8"),
'-message'.message.encode("utf-8")])
Use: ˋsys.getfilesystemencoding` to get your encoding
Encode your string with it, ignore or replace errors:
import sys
encoding = sys.getfilesystemencoding()
msg = new_scorer[0].text.replace(",", "")
print(msg.encode(encoding, errons="replace"))

UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2019' in position 4: ordinal not in range(256)

I am using eyeD3 to edit metadata of mp3 files. I am unable to set lyrics tag.
def fetch_lyrics(title, artist):
URL='http://makeitpersonal.co/lyrics?artist=%s&title=%s'
webaddr=(URL %(artist, title)).replace(" ", "%20")
print webaddr
response = requests.get(webaddr)
if response.content=="Sorry, We don't have lyrics for this song yet.":
return 0
else:
return response.content
def get_lyrics(pattern, path=os.getcwd()):
files=find(pattern, path)
matches = len(files)
if matches==1:
tag = eyeD3.Tag()
tag.link(files[0])
lyrics = tag.getLyrics()
if lyrics:
for l in lyrics:
print l.lyrics
else:
print "Lyrics not found. Searching online..."
tag = eyeD3.Tag()
tag.link(files[0])
artist = tag.getArtist()
title = tag.getTitle()
l = fetch_lyrics(title, artist)
if l==0:
print "No matches found."
else:
#print l
tag.addLyrics(l.decode('utf-8'))
tag.update()
The traceback that I got is:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "lyrics.py", line 99, in get_lyrics
tag.update()
File "/usr/lib/python2.7/dist-packages/eyeD3/tag.py", line 526, in update
self.__saveV2Tag(version);
File "/usr/lib/python2.7/dist-packages/eyeD3/tag.py", line 1251, in __saveV2Ta
g
raw_frame = f.render();
File "/usr/lib/python2.7/dist-packages/eyeD3/frames.py", line 1200, in render
self.lyrics.encode(id3EncodingToString(self.encoding))
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2019' in position
4: ordinal not in range(256)
I don't understand the error. Do I need to pass any other parameter to the update() or addLyrics() functions. Any help?
I imagine you're trying to write ID3v1 (or ID3v2 single-byte) tag which only permits latin-1.
I think I had to patch my eyeD3 once to fix that problem. Try to turn ID3v1 off and set ID3v2 to v2.4 UTF-8.
Ideally - catch, turn off ID3v1, retry. The specific problem is that ’ quote is multi-byte.

convert pdf to text file in python

My code works perfectly for some pdf, but some show error:
Traceback (most recent call last):
File "con.py", line 24, in <module>
print getPDFContent("abc.pdf")
File "con.py", line 17, in getPDFContent
f.write(a)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u02dd' in position 64: ordinal not in range(128)
My code is
import pyPdf
def getPDFContent(path):
content = ""
pdf = pyPdf.PdfFileReader(file(path, "rb"))
for i in range(0, pdf.getNumPages()):
f=open("xxx.txt",'a')
content= pdf.getPage(i).extractText() + "\n"
import string
c=content.split()
for a in c:
f.write(" ")
f.write(a)
f.write('\n')
f.close()
return content
print getPDFContent("abc.pdf")
Your problem is that when you call f.write() with a string, it is trying to encode it using the ascii codec. Your pdf contains characters that can not be represented by the ascii codec. Try explicitly encoding your str, e.g.
a = a.encode('utf-8')
f.write(a)
Try
import sys
print getPDFContent("abc.pdf").encode(sys.getfilesystemencoding())

Python script won't work on Autokey

I'm trying to make a html entities encoder/decoder on Python that behaves similar to PHP's htmlentities and html_entity_decode, it works normally as a standalone script:
My input:
Lorem ÁÉÍÓÚÇÃOÁáéíóúção ##$%*()[]<>+ 0123456789
python decode.py
Output:
Lorem ÁÉÍÓÚÇÃOÁáéíóúção ##$%*()[]<>+ 0123456789
Now if I run it as an Autokey script I get this error:
Script name: 'html_entity_decode'
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/autokey/service.py", line 454, in execute
exec script.code in scope
File "<string>", line 40, in <module>
File "/usr/local/lib/python2.7/dist-packages/autokey/scripting.py", line 42, in send_keys
self.mediator.send_string(keyString.decode("utf-8"))
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 6-12: ordinal not in range(128)
What am I doing wrong? Here's the script:
import htmlentitydefs
import re
entity_re = re.compile(r'&(%s|#(\d{1,5}|[xX]([\da-fA-F]{1,4})));' % '|'.join(
htmlentitydefs.name2codepoint.keys()))
def html_entity_decode(s, encoding='utf-8'):
if not isinstance(s, basestring):
raise TypeError('argument 1: expected string, %s found' \
% s.__class__.__name__)
def entity_2_unichr(matchobj):
g1, g2, g3 = matchobj.groups()
if g3 is not None:
codepoint = int(g3, 16)
elif g2 is not None:
codepoint = int(g2)
else:
codepoint = htmlentitydefs.name2codepoint[g1]
return unichr(codepoint)
if isinstance(s, unicode):
entity_2_chr = entity_2_unichr
else:
entity_2_chr = lambda o: entity_2_unichr(o).encode(encoding,
'xmlcharrefreplace')
def silent_entity_replace(matchobj):
try:
return entity_2_chr(matchobj)
except ValueError:
return matchobj.group(0)
return entity_re.sub(silent_entity_replace, s)
text = clipboard.get_selection()
text = html_entity_decode(text)
keyboard.send_keys("%s" % text)
I found it on a Gist https://gist.github.com/607454, I'm not the author.
Looking at the backtrace the likely problem is that you are passing in a unicode string to keyboard.send_keys, which expects a UTF-8 encoded bytestring. autokey then tries to decode your string which fails because the input is unicode instead of utf-8. This looks like a bug in autokey: it should not try to decode strings unless their are really plain (byte)sstrings.
If this guess is correct you should be able to work around this by making sure you pass a unicode instance to send_keys. Try something like this:
text = clipboard.get_selection()
if isinstance(text, unicode):
text = text.encode('utf-8')
text = html_entity_decode(text)
assert isinstance(text, str)
keyboard.send_keys(text)
The assert is not needed but is a handy sanity check to make sure html_entity_decode does the right thing.
The problem is the the output of:
clipboard.get_selection()
is an unicode string.
to solve the problem replace:
text = clipboard.get_selection()
by:
text = clipboard.get_selection().encode("utf8")

python: unicode problem

I am trying to decode a string I took from file:
file = open ("./Downloads/lamp-post.csv", 'r')
data = file.readlines()
data[0]
'\xff\xfeK\x00e\x00y\x00w\x00o\x00r\x00d\x00\t\x00C\x00o\x00m\x00p\x00e\x00t\x00i\x00t\x00i\x00o\x00n\x00\t\x00G\x00l\x00o\x00b\x00a\x00l\x00
\x00M\x00o\x00n\x00t\x00h\x00l\x00y\x00
\x00S\x00e\x00a\x00r\x00c\x00h\x00e\x00s\x00\t\x00D\x00e\x00c\x00
\x002\x000\x001\x000\x00\t\x00N\x00o\x00v\x00
\x002\x000\x001\x000\x00\t\x00O\x00c\x00t\x00
\x002\x000\x001\x000\x00\t\x00S\x00e\x00p\x00
\x002\x000\x001\x000\x00\t\x00A\x00u\x00g\x00
\x002\x000\x001\x000\x00\t\x00J\x00u\x00l\x00
\x002\x000\x001\x000\x00\t\x00J\x00u\x00n\x00
\x002\x000\x001\x000\x00\t\x00M\x00a\x00y\x00
\x002\x000\x001\x000\x00\t\x00A\x00p\x00r\x00
\x002\x000\x001\x000\x00\t\x00M\x00a\x00r\x00
\x002\x000\x001\x000\x00\t\x00F\x00e\x00b\x00
\x002\x000\x001\x000\x00\t\x00J\x00a\x00n\x00
\x002\x000\x001\x000\x00\t\x00A\x00d\x00
\x00s\x00h\x00a\x00r\x00e\x00\t\x00S\x00e\x00a\x00r\x00c\x00h\x00
\x00s\x00h\x00a\x00r\x00e\x00\t\x00E\x00s\x00t\x00i\x00m\x00a\x00t\x00e\x00d\x00
\x00A\x00v\x00g\x00.\x00
\x00C\x00P\x00C\x00\t\x00E\x00x\x00t\x00r\x00a\x00c\x00t\x00e\x00d\x00
\x00F\x00r\x00o\x00m\x00
\x00W\x00e\x00b\x00
\x00P\x00a\x00g\x00e\x00\t\x00L\x00o\x00c\x00a\x00l\x00
\x00M\x00o\x00n\x00t\x00h\x00l\x00y\x00
\x00S\x00e\x00a\x00r\x00c\x00h\x00e\x00s\x00\n'
Adding ignore do not really help...:
In [69]: data[2]
Out[69]: u'\u6700\u6100\u7200\u6400\u6500\u6e00\u2000\u6c00\u6100\u6d00\u7000\u2000\u7000\u6f00\u7300\u7400\u0900\u3000\u2e00\u3900\u3400\u0900\u3800\u3800\u3000\u0900\u2d00\u0900\u3300\u3200\u3000\u0900\u3300\u3900\u3000\u0900\u3300\u3900\u3000\u0900\u3400\u3800\u3000\u0900\u3500\u3900\u3000\u0900\u3500\u3900\u3000\u0900\u3700\u3200\u3000\u0900\u3700\u3200\u3000\u0900\u3300\u3900\u3000\u0900\u3300\u3200\u3000\u0900\u3200\u3600\u3000\u0900\u2d00\u0900\u2d00\u0900\ua300\u3200\u2e00\u3100\u3800\u0900\u2d00\u0900\u3400\u3800\u3000\u0a00'
In [70]: data[2].decode("utf-8",
"replace")
---------------------------------------------------------------------------
Traceback (most recent call last)
/Users/oleg/ in
()
/opt/local/lib/python2.5/encodings/utf_8.py
in decode(input, errors)
14
15 def decode(input, errors='strict'):
---> 16 return codecs.utf_8_decode(input, errors,
True)
17
18 class IncrementalEncoder(codecs.IncrementalEncoder):
:
'ascii' codec can't encode characters
in position 0-87: ordinal not in
range(128)
In [71]:
This looks like UTF-16 data. So try
data[0].rstrip("\n").decode("utf-16")
Edit (for your update): Try to decode the whole file at once, that is
data = open(...).read()
data.decode("utf-16")
The problem is that the line breaks in UTF-16 are "\n\x00", but using readlines() will split at the "\n", leaving the "\x00" character for the next line.
This file is a UTF-16-LE encoded file, with an initial BOM.
import codecs
fp= codecs.open("a", "r", "utf-16")
lines= fp.readlines()
EDIT
Since you posted 2.7 this is the 2.7 solution:
file = open("./Downloads/lamp-post.csv", "r")
data = [line.decode("utf-16", "replace") for line in file]
Ignoring undecodeable characters:
file = open("./Downloads/lamp-post.csv", "r")
data = [line.decode("utf-16", "ignore") for line in file]

Categories