Unicode literals causing invalid syntax - python

The following code:
s = s.replace(u"&", u"&")
is causing an error in python:
SyntaxError: invalid syntax
removing the u's before the " fixes the problem, but this should work as is? I'm using Python 3.1

The u is no longer used in Python 3. String literals are unicode by default. See What's New in Python 3.0.
You can no longer use u"..." literals for Unicode text. However, you must use b"..." literals for binary data.

On Python 3, strings are unicode. There is no need to (and as you've discovered, you can't) put a u before the string literal to designate unicode.
Instead, you have to put a b before a byte literal to designate that it isn't unicode.

In Python3.3+ unicode literal is valid again, see What’s New In Python 3.3:
New syntax features:
New yield from expression for generator delegation.
The u'unicode' syntax is accepted again for str objects.

Related

Unicode Regex in Python 3 (from Python 2 Code)

I'm trying to convert my Python 2 script to Python 3. How do we do Regex with Unicode?
This is what I had in Python 2 which works It replaces quotes to « and »:
text = re.sub(ur'"(.*?)"', ur'«\1»', text)
I have some really complex ones which the "ur" made it so easy. But it doesn't work in Python 3:
text = re.sub(ur'ه\sایم([\]\.،\:»\)\s])', ur'ه\u200cایم\1', text)
All strings in Python3 are unicode by default. Just remove the u and you should be fine.
In Python2 strings are lists of bytes by default, so we use u to mark them as unicode strings.
Since Python 3.0, the language features a str type that contain
Unicode characters, meaning any string created using "unicode rocks!",
'unicode rocks!', or the triple-quoted string syntax is stored as
Unicode.
Unicode HOWTO This doc will help you.
so, you just do want every you do in Python2, and it will works, no extra effects.

python version 3.4 does not support a 'ur' prefix

I have some python code writen in an older version of python(2.x) and I struggle to make it work. I'm using python 3.4
_eng_word = ur"[a-zA-Z][a-zA-Z0-9'.]*"
(it's part of a tokenizer)
http://bugs.python.org/issue15096
Title: Drop support for the "ur" string prefix
When PEP 414 restored support for explicit Unicode literals in Python 3, the "ur" string prefix was deemed to be a synonym for the "r" prefix.
So, use 'r' instead of 'ur'
Indeed, Python 3.4 only supports u'...' (to support code that needs to run on both Python 2 and 3) and r'....', but not both. That's because the semantics of how ur'..' works in Python 2 are different from how ur'..' would work in Python 3 (in Python 2, \uhhhh and \Uhhhhhhhh escapes still are processed, in Python 3 a `r'...' string would not).
Note that in this specific case there is no difference between the raw string literal and the regular! You can just use:
_eng_word = u"[a-zA-Z][a-zA-Z0-9'.]*"
and it'll work in both Python 2 and 3.
For cases where a raw string literal does matter, you could decode the raw string from raw_unicode_escape on Python 2, catching the AttributeError on Python 3:
_eng_word = r"[a-zA-Z][a-zA-Z0-9'.]*"
try:
# Python 2
_eng_word = _eng_word.decode('raw_unicode_escape')
except AttributeError:
# Python 3
pass
If you are writing Python 3 code only (so it doesn't have to run on Python 2 anymore), just drop the u entirely:
_eng_word = r"[a-zA-Z][a-zA-Z0-9'.]*"
This table compares (some of) the different string literal prefixes in Python 2(.7) and 3(.4+):
As you can see, in Python 3 there's no way to have a literal that doesn't process escapes, but does process unicode literals. To get such a string with code that works in both Python 2 and 3, use:
br"[a-zA-Z][a-zA-Z0-9'.]*".decode('raw_unicode_escape')
Actually, your example is not very good, since it doesn't have any unicode literals, or escape sequences. A better example would be:
br"[\u03b1-\u03c9\u0391-\u03a9][\t'.]*".decode('raw_unicode_escape')
In python 2:
>>> br"[\u03b1-\u03c9\u0391-\u03a9][\t'.]*".decode('raw_unicode_escape')
u"[\u03b1-\u03c9\u0391-\u03a9][\\t'.]*"
In Python 3:
>>> br"[\u03b1-\u03c9\u0391-\u03a9][\t'.]*".decode('raw_unicode_escape')
"[α-ωΑ-Ω][\\t'.]*"
Which is really the same thing.

Why don't python interpreter use the file coding format for decoding?

The code bellow will cause an UnicodeDecodeError:
#-*- coding:utf-8 -*-
s="中文"
u=u"123"
u=s+u
I know it's because python interpreter is using ascii to decode s.
Why don't python interpreter use the file format(utf-8) for decoding?
Implicit decoding cannot know what source encoding was used. That information is not stored with strings.
All that Python has after importing is a byte string with characters representing bytes in the range 0-255. You could have imported that string from another module, or read it from a file object, etc. The fact that the parser knew what encoding was used for those bytes doesn't even matter for plain byte strings.
As such, it is always better to decode bytes explicitly, rather than rely on the implicit decoding. Either make use a Unicode literal for s as well, or explicitly decode using str.decode()
u = s.decode('utf8') + u
The types of the 2 strings are different - the first is a normal string, second is a unicode string, hence the error.
So, instead of doing s="中文", do as following to get unicode strings for both:
s=u"中文"
u=u"123"
u=s+u
The code works perfectly fine on Python 3.
However, in Python 2, if you do not add a u before a string literal, you are constructing a string of bytes. When one wants to combine a string of bytes and a string of characters, one either has to decode the string of bytes, or encode the string of characters. Python 2.x opted for the former. In order to prevent accidents (for example, someone appending binary data to a user input and thus generating garbage), the Python developers chose ascii as the encoding for that conversion.
You can add a line
from __future__ import unicode_literals
after the #coding declaration so that literals without u or b prefixes are always character and not byte literals.

Python .split() without 'u

In Python, if I have a string like:
a =" Hello - to - everybody"
And I do
a.split('-')
then I get
[u'Hello', u'to', u'everybody']
This is just an example.
How can I get a simple list without that annoying u'??
The u means that it's a unicode string - your original string must also have been a unicode string. Generally it's a good idea to keep strings Unicode as trying to convert to normal strings could potentially fail due to characters with no equivalent.
The u is purely used to let you know it's a unicode string in the representation - it will not affect the string itself.
In general, unicode strings work exactly as normal strings, so there should be no issue with leaving them as unicode strings.
In Python 3.x, unicode strings are the default, and don't have the u prepended (instead, bytes (the equivalent to old strings) are prepended with b).
If you really, really need to convert to a normal string (rarely the case, but potentially an issue if you are using an extension library that doesn't support unicode strings, for example), take a look at unicode.encode() and unicode.decode(). You can either do this before the split, or after the split using a list comprehension.
I have a opposite problem. The str '第一回\u3000甄士隐梦幻识通灵 贾雨村风尘怀闺秀' needs to be splitted by the unicode character. But I made wrong and code split('\u') that leaded to the unicode syntax error.
I should code split('\u3000')

Is there any good reason not to use unicode as opposed to string?

Many problems I've ran into in Python have been related to not having something in Unicode. Is there any good reason to not use Unicode by default? I understand needing to translate something in ASCII, but it seems to be the exception and not the rule.
I know Python 3 uses Unicode for all strings. Should this encourage me as a developer to unicode() all my strings?
Generally, I'm going to say "no" there's not a good reason to use string over unicode. Remember, as well, that you don't have to call unicode() to create a unicode string, you can do so by prefixing the string with a lowercase u like u"this is a unicode string".
In Python 2.x:
A str object is basically just a sequence of bytes.
A unicode object is a sequence of characters.
Knowing this, it should be easy to choose the correct type:
If you want a string of characters use unicode.
If you want an string encoded as bytes use str (in many other languages you'd use byte[] here).
In Python 3.x the type str is a string of characters, just as you would expect. You can use bytes if you want a sequence of bytes.

Categories