Why mishandled error message when Unicode is inside elif statement?

Why mishandled error message when Unicode is inside elif statement? - python

As an exemple if you run this code:
text = "Hi"
if text == "Hello":
print("Hello")
elif text == "Hi":
emoji = '\U000274C'
print(emoji)
else:
print("")
You will get
"IndentationError: unexpected indent"
as a erro message, but if you run just emoji = '\U000274C' you will get the correct erro
"SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes
in position 0-8: truncated \UXXXXXXXX escape"
Any ideas why? is it a bug or a feature and why?
The correct message is really useful as it makes clear that you forgot a zero in thhe unicode, while the indentation erro is totally useless.
I was expecting a useful erro message, it's not clear to me if this behavior is correct or if it is a bug.

It seems you did not write emoji unicode correctly.
It usually have 8 digits. You only wrote 7 digits, so check this part.
I slightly modified your 'elif' part with a popular example unicode. Unicode and CDLR-name both work and return 😀. It works even on Online-Python-Compiler.
Another try with '\U0000274C', which seems most similar to your unicode, returned ❌.
text = "Hi"
if text == "Hello":
print("Hello")
elif text == "Hi":
emoji = "\U0001f600"
# emoji = "\N{grinning face}"
print(emoji)
else:
print("")

Related

How to correctly represent a supplementary unicode char in python3 (3.6.1+) by using \u or \U escape within string

Recently I'm learing python and has encountered a problem with unicode escape literal in python 3.
It seems that like Java, the \u escape is interpreted as UTF-16 code point which Java uses, but here comes the problem:
For example, if I try to put a 3 bytes utf-8 char like "♬" (https://unicode-table.com/en/266C/) or even supplementary unicode char like "𠜎" (https://unicode-table.com/en/2070E/) by the format of \uXXXX or \UXXXXXXXX in a normal string as followed:
print('\u00E2\u99AC') # UTF-8, messy code for sure
print('\U00E299AC') # UTF-8, with 8 bytes \U, (unicode error) for sure
print('\u266C') # UTF-16 BE, music note appeares
# from which I suppose \u and \U function the same way they should do in Java
# (may be a little different since they function like macro in Java, and can be useed in comments)
# However, while print('\u266C') gives me '♬'，'\u266C' == '♬' is equal to false
# which is true in Java semantics.
# Further more, print('\UD841DF0E') didn't give me '𠜎' : (unicode error) 'unicodeescape' codec can't decode bytes in position 0-9: illegal Unicode character
# which I suppose it should be, so it appears to me that I may get it wrong
# Here again : print('\uD841\uDF0E') # Error, 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed
print('\xD8\x41\xDF\x0E') # also tried this, messy code
# maybe UTF-16 LE?
print('\u41D8\u0EDF') # messy code
print('\U41D80EDF') # error
So, I could see that python "doesn't support supplementary escape literal", and its behavior is also weird.
Well, I already know that the correct way to decode and encode such characters:
s_decoded = '\\xe2\\x99\\xac'.encode().decode('unicode-escape')\
.encode('latin-1').decode('utf-8')
print(b'\xf0\xa0\x9c\x8e'.decode('utf-8'))
print(b'\xd8\x41\xdf\x0e'.decode('utf-16 be'))
assert s_decoded == '♬'
But still don't get how to do it right using \u & \U escape literal. Hopefully someone could point it out what I'm doing wrong and how it differs from Java's way, thanks!
By the way, my environment is PyCharm win, python 3.6.1, source code is encoded as UTF-8

Python 3.6.3:
>>> print('\u266c') # U+266C
♬
>>> print('\U0002070E') # U+2070E. Python is not Java
𠜎
>>> '\u266c' == '♬'
True
>>> '\U0002070E' == '𠜎'
True

Ascii encoding error during sending a mail [duplicate]

This question already has an answer here:
The smtplib.server.sendmail function in python raises UnicodeEncodeError: 'ascii' codec can't encode character
(1 answer)
Closed 2 years ago.
I am new to python and trying to receive and resend an email using poplib and smtplib:
messages = [pop_conn.retr(i)[1] for i in range(1, mail_count + 1)]
#decode messages
messages = [[line.decode("utf-8") for line in message] for message in messages]
# Concat messages
messages = ["\n".join(msg) for msg in messages]
#...
for message in messages:
smtp_conn.sendmail(args.address, args.target, message)
In the debugger all message strings look good, but in the sendmail call following error occurs:
msg = _fix_eols(msg).encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character '\xa0' in position 17938: ordinal not in range(128)
What am i doing wrong?

It appears as though whatever character \xa0 represents does not have a representation in ASCII. According to this link, \xa0 is the unicode character for a non-breaking space.
Since this is really just a space, you could try and replace all \xa0 characters in your string:
messages = ["\n".join(msg.replace(u'\xa0', u' ')) for msg in messages]
To be fair, space and non-breaking spaces function differently, so depending on where this character appears in your message, the output could look slightly different after replacing the non-breaking spaces with regular spaces.
Another option is to ignore any characters that produce any error. This solution is not ideal because you could lose characters that end-up changing the formatting (or sometimes meaning) of your text. Replacing the non-breaking space with a normal space is smart to do regardless, but for all other pesky characters:
msg.encode("ascii", errors="ignore")
Alternatively, you can do msg.encode("ascii", errors="replace") but that will replace these characters with a '?' which doesn't look so nice.

You are trying to encode an utf-8 character that is not in the ascii standard, as ascii. A0 is a non-breaking space. If that's the only character that's not ascii encodable, you can just replace it with a normal space:
spaced_message = message.replace("\x0a", " ")
Otherwise, look into https://en.wikipedia.org/wiki/Unicode_and_email#Unicode_support_in_message_bodies
Encoding strings as utf-7 (yes, 7) usually works, but it's officially deprecated in many systems. Utf-8 requires base64 encoding on top, which is a bit tricky.

I solved this error by editing the smtplib source code on line 859.
Replace 'ascii' on line 859
msg = _fix_eols(msg).encode('ascii')
with 'utf-8'
msg = _fix_eols(msg).encode('utf-8')

Getting error on if and elif

Does anyone know why I keep getting an error with this section of my code?
if db_orientation2 =="Z":
a="/C=C\"
elif db_orientation2=="E":
a="\C=C\"
This is the error:
File "<ipython-input-7-25cda51c429e>", line 11
a="/C=C\"
^
SyntaxError: EOL while scanning string literal
The elif is highlighted as red as if the operation is not allowed...

String literals cannot end with a backslash. You'll have to double it:
a="/C=C\\"
# ^
The highlighting of your code also clearly shows the problem.

Python: UnicodeEncodeError: 'ascii' codec can't encode characters in position 34-39: ordinal not in range(128)

I've got a data of twitter log and I have to sort the file to show each user's retweeted tweet ranking.
Here's the code.
import codecs
with codecs.open('hoge_qdata.tsv', 'r', 'utf-8') as tweets:
tweet_list = tweets.readlines()
tweet_list.pop(0)
facul={}
for t in tweet_list:
t = t.split('\t')
t[-2] = int(t[-2])
if t[-2] <= 0:
continue
if not t[0] in facul:
facul[t[0]] = []
facul[t[0]].append(t)
def cmp_retweet(a,b):
if a[-2] < b[-2]:
return 1
if a[-2] > b[-2]:
return -1
return 0
for f in sorted(facul.keys()):
facul[f].sort(cmp=cmp_retweet)
print ('[%s]' %(f))
for t in facul[f][:5]:
print ('%d:%s:%s' % (t[-2], t[2], t[-1].strip())
Somehow I got an error saying:
print '%d:%s:%s' %(t[-2], t[2], t[-1].strip())
UnicodeEncodeError: 'ascii' codec can't encode characters in position
34-39: ordinal not in range(128)
Looks like Japanese language letters can't be decoded. How can I fix this?
I tried to use sys.setdefaultencoding("utf-8") but then I got an error:
sys.setdefaultencoding("utf-8")
AttributeError: 'module' object has no attribute 'setdefaultencoding'
This is how I tried it:
import codecs
import sys
sys.setdefaultencoding("utf-8")
with codecs.open('hoge_qdata.tsv', 'r', 'utf-8') as tweets:
tweet_list = tweets.readlines()
p.s. I am using Python version 2.7.5

The basic issue, as you have discovered, is that ASCII cannot represent much of unicode.
So you have to make a choice on how to handle it:
don't display non-ASCII chars
display non-ASCII chars as some other type of representation
The first choice would like this:
for t in facul[f][:5]:
print ('%d:%s:%s' % (
t[-2],
t[2].encode('ascii', errors='ignore'),
t[-1].encode('ascii', errors='ignore').strip()
))
While the second choice would replace ignore with something like replace, xmlcharrefreplace, or backslashreplace.
Here's the reference.

The error message is giving you two clues: first, the problem is in the statement
print '%d:%s:%s' %(t[-2], t[2], t[-1].strip())
Second, the problem is related to an encode operation. If you don't remember what is meant by "encode", now would be a good time to re-read the Unicode HOWTO in the Python 2.7 docs.
It looks like your list t[] contains Unicode strings. The print() statement is emitting byte strings. The conversion of Unicode strings to byte strings is encoding. Because you aren't specifying an encoding, Python is implicitly doing a default encoding. It uses the ascii codec, which cannot handle any accented or non-Latin characters.
Try splitting that print() statement into two parts. First, insert the unicode t[] values into a unicode format string. Note the use of u'' syntax. Second, encode the unicode string to UTF and print.
s = u'%d:%s:%s' %(t[-2], t[2], t[-1].strip())
print s.encode('utf8')
(I haven't tested this change to your code. Let me know if it doesn't work.)
I think sys.setdefaultencoding() is probably a red herring, but I don't know your environment well.
By the way, the statement, as you write it above, has unbalanced parentheses. Did you drop a right parenthesis when you pasted in the code?
print ('%d:%s:%s' %(t[-2], t[2], t[-1].strip())

How to completely sanitize a string of illegal characters in python?

I have a feature of my program where the user can upload a csv file, which my program goes through and uses as input. I have one user complaining about a problem where his input is throwing up an error. The error is caused by there being an illegal character that is encoded wrong. The characters is below:
�
Sometimes it appears as a diamond with a "?" in the middle, sometimes it appears as a double diamond with "?" in the middle, sometimes it appears as "\xa0", and sometimes it appears as "\xa0\xa0".
In my program if I do:
print str_with_weird_char
The string will show up in my terminal with the diamond "?" in place of the weird character. If I copy+paste that string into ipython, it will exit with this message:
In [1]: g="blah��blah"
WARNING:
********
You or a %run:ed script called sys.stdin.close() or sys.stdout.close()!
Exiting IPython!
notice how the diamond "?" is double now. For some reason copy+paste makes it double...
In the django traceback page, it looks like this:
UnicodeDecodeError at /chris/import.html
('ascii', 'blah \xa0 BLAH', 14, 15, 'ordinal not in range(128)')
The thing that messes me up is that I can't do anything with this string without it throwing an exception. I tried unicode(), I tried str(), I tried .encode(), I tried .encode("utf-8"), no matter what it throws up an error.
What can I do it get this thing to be a working string?

You can pass, "ignore" to skip invalid characters in .encode/.decode
like "ILLEGAL".decode("utf8","ignore")
>>> "ILLEGA\xa0L".decode("utf8")
...
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 6: unexpected code byte
>>> "ILLEGA\xa0L".decode("utf8","ignore")
u'ILLEGAL'
>>>

Declare the coding on the second line of your script. It really has to be second. Like
#!/usr/bin/python
# coding=utf-8
This might be enough to solve your problem all by itself. If not, see str.encode('utf-8') and str.decode('utf-8').

you can also use:
python3 -c "import urllib, sys ; print urllib.quote_plus(sys.stdin.read())";
taken from https://wiki.python.org/moin/Powerful%20Python%20One-Liners
** ps, in the website it's pointed to use python, but I tested in python3 and it works just fine

The only way to do it (at least in python2) is to use unicodedata.normalize:
unicodedata.normalize('NFKD', text).encode('utf-8', 'ignore')
decode('utf-8', 'ignore') will just raise exception.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why mishandled error message when Unicode is inside elif statement? - python

Related

How to correctly represent a supplementary unicode char in python3 (3.6.1+) by using \u or \U escape within string

Ascii encoding error during sending a mail [duplicate]

Getting error on if and elif

Python: UnicodeEncodeError: 'ascii' codec can't encode characters in position 34-39: ordinal not in range(128)

How to completely sanitize a string of illegal characters in python?

Categories

Resources