I am trying to figure out this error that pops up from this code:
filename = os.path.join(os.path.expanduser("~"), "data", "blogs",
"1005545.male.25.Engineering.Sagittarius.xml")
#filename = open('C:/Users/spenc/data/blogs/1005545.male.25.Engineering.Sagittarius.xml',
#encoding='utf-8', errors = 'ignore')
all_posts = []
allPosts = []
with open(filename) as inf:
postStart = False
post = []
for line in inf:
line = line.strip()
if line == "<post>":
postStart = True
elif line == "</post>":
postStart = False
allPosts.append("\n".join(post))
post =[]
elif postStart:
post.append(line)
print(allPosts[0])
print(len(allPosts))
filename.close()
and get this error:
File "D:\Anaconda-Python\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 4836: character maps to <undefined> here
I am just trying to figure out the encoding error to make sure this works in finding the length of the posts and print the post itself, but it keeps getting caught up on the allposts.append line. Not really sure of anywork around or if there is a newer way of doing something of this sort. I was trying to follow a textbook on it, but cant continue on in the chapter until this has been worked out.
My Code:
import re
import urllib.request
url="https://www.google.com/search?sxsrf="
stock=input("Enter your stock: ") # Enter your stock: FB
url=url+stock
print(url) # https://www.google.com/search?sxsrf=FB
data=urllib.request.urlopen(url).read()
data1=data.decode("utf-8")
My Error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 12387:
invalid start byte
The data isn't UTF-8-encoded; it's ISO-8859-1.
>>> url="https://www.google.com/search?sxsrf=FB"
>>> d = urllib.request.urlopen(url)
>>> dict(d.getheaders())['Content-Type']
'text/html; charset=ISO-8859-1'
>>> data1 = d.read().decode('iso-8859-1')
In my snippet below, I'm process a string of text thats: Déclaration.png
I return the description as unicode:
return self.render_json(request, {..."description": u''.join((instance.description)),..})
In another function, I use the description above as follows:
if document.description:
file_name = document.description.split(".")
file_name = "{}.{}.{}".format(
"_".join(file_name[:-1]),
str(document.id),
file_name[-1]
)
file_name is: [u'De\u0301claration', u'png']
When I try .format() on file_name I get the following error:
error: 'latin-1' codec can't encode character u'\u0301' in position 2: ordinal not in range(256)
Any ideas?
"{}.{}.{}" is a string but you try to fill it with unicode.
use
...
file_name = u"{}.{}.{}".format(
...
instead
also have a look at this nice talk: https://www.youtube.com/watch?v=sgHbC6udIqc
I'm facing an issue while trying to concatenate strings with gzipped content
content = "Some Long Content"
out = StringIO.StringIO()
with gzip.GZipFile(fileobj=out, mode='w') as f:
f.write(content)
gzipped_content = out.getvalue()
part1 = 'Something'
part2 = 'SomethingElse'
complete_content = part1 + part2 + gzipped_content
During Execution, this causes a UnicodeDecodeError
complete_content = part1 + part2 + gzipped_content
exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 1: ordinal not in range(128)
I'm unable to figure out why an ascii decode is required for String Concatenation.
Is there a way around to make the concatenation happen?
I am fetching latest football scores from a website and sending a notification on the desktop (OS X). I am using BeautifulSoup to scrape the data. I had issues with the unicode data which was generating this error
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 2: ordinal not in range(128).
So I inserted this at the beginning which solved the problem while outputting on the terminal.
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
But the problem exists when I am sending notifications on the desktop. I use terminal-notifier to send desktop-notifications.
def notify (title, subtitle, message):
t = '-title {!r}'.format(title)
s = '-subtitle {!r}'.format(subtitle)
m = '-message {!r}'.format(message)
os.system('terminal-notifier {}'.format(' '.join((m, t, s))))
The below images depict the output on the terminal Vs the desktop notification.
Output on terminal.
Desktop Notification
Also, if I try to replace the comma in the string, I get the error,
new_scorer = str(new_scorer[0].text).replace(",","")
File "live_football_bbc01.py", line 41, in get_score
new_scorer = str(new_scorer[0].text).replace(",","")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 2: ordinal not in range(128)
How do I get the output on the desktop notifications like the one on the terminal? Thanks!
Edit : Snapshot of the desktop notification. (Solved)
You are formatting using !r which gives you the repr output, forget the terrible reload logic and either use unicode everywhere:
def notify (title, subtitle, message):
t = u'-title {}'.format(title)
s = u'-subtitle {}'.format(subtitle)
m = u'-message {}'.format(message)
os.system(u'terminal-notifier {}'.format(u' '.join((m, t, s))))
or encode:
def notify (title, subtitle, message):
t = '-title {}'.format(title.encode("utf-8"))
s = '-subtitle {}'.format(subtitle.encode("utf-8"))
m = '-message {}'.format(message.encode("utf-8"))
os.system('terminal-notifier {}'.format(' '.join((m, t, s))))
When you call str(new_scorer[0].text).replace(",","") you are trying to encode to ascii, you need to specify the encoding to use:
In [13]: s1=s2=s3= u'\xfc'
In [14]: str(s1) # tries to encode to ascii
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-14-589849bdf059> in <module>()
----> 1 str(s1)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 0: ordinal not in range(128)
In [15]: "{}".format(s1) + "{}".format(s2) + "{}".format(s3) # tries to encode to ascii---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-15-7ca3746f9fba> in <module>()
----> 1 "{}".format(s1) + "{}".format(s2) + "{}".format(s3)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 0: ordinal not in range(128)
You can encode straight away:
In [16]: "{}".format(s1.encode("utf-8")) + "{}".format(s2.encode("utf-8")) + "{}".format(s3.encode("utf-8"))
Out[16]: '\xc3\xbc\xc3\xbc\xc3\xbc'
Or use use all unicode prepending a u to the format strings and encoding last:
In [17]: out = u"{}".format(s1) + u"{}".format(s2) + u"{}".format(s3)
In [18]: out
Out[18]: u'\xfc\xfc\xfc'
In [19]: out.encode("utf-8")
Out[19]: '\xc3\xbc\xc3\xbc\xc3\xbc'
If you use !r you are always going to the the bytes in the output:
In [30]: print "{}".format(s1.encode("utf-8"))
ü
In [31]: print "{!r}".format(s1).encode("utf-8")
u'\xfc'
You can also pass the args using subprocess:
from subprocess import check_call
def notify (title, subtitle, message):
cheek_call(['terminal-notifier','-title',title.encode("utf-8"),
'-subtitle',subtitle.encode("utf-8"),
'-message'.message.encode("utf-8")])
Use: ˋsys.getfilesystemencoding` to get your encoding
Encode your string with it, ignore or replace errors:
import sys
encoding = sys.getfilesystemencoding()
msg = new_scorer[0].text.replace(",", "")
print(msg.encode(encoding, errons="replace"))