I have some python code writen in an older version of python(2.x) and I struggle to make it work. I'm using python 3.4
_eng_word = ur"[a-zA-Z][a-zA-Z0-9'.]*"
(it's part of a tokenizer)
http://bugs.python.org/issue15096
Title: Drop support for the "ur" string prefix
When PEP 414 restored support for explicit Unicode literals in Python 3, the "ur" string prefix was deemed to be a synonym for the "r" prefix.
So, use 'r' instead of 'ur'
Indeed, Python 3.4 only supports u'...' (to support code that needs to run on both Python 2 and 3) and r'....', but not both. That's because the semantics of how ur'..' works in Python 2 are different from how ur'..' would work in Python 3 (in Python 2, \uhhhh and \Uhhhhhhhh escapes still are processed, in Python 3 a `r'...' string would not).
Note that in this specific case there is no difference between the raw string literal and the regular! You can just use:
_eng_word = u"[a-zA-Z][a-zA-Z0-9'.]*"
and it'll work in both Python 2 and 3.
For cases where a raw string literal does matter, you could decode the raw string from raw_unicode_escape on Python 2, catching the AttributeError on Python 3:
_eng_word = r"[a-zA-Z][a-zA-Z0-9'.]*"
try:
# Python 2
_eng_word = _eng_word.decode('raw_unicode_escape')
except AttributeError:
# Python 3
pass
If you are writing Python 3 code only (so it doesn't have to run on Python 2 anymore), just drop the u entirely:
_eng_word = r"[a-zA-Z][a-zA-Z0-9'.]*"
This table compares (some of) the different string literal prefixes in Python 2(.7) and 3(.4+):
As you can see, in Python 3 there's no way to have a literal that doesn't process escapes, but does process unicode literals. To get such a string with code that works in both Python 2 and 3, use:
br"[a-zA-Z][a-zA-Z0-9'.]*".decode('raw_unicode_escape')
Actually, your example is not very good, since it doesn't have any unicode literals, or escape sequences. A better example would be:
br"[\u03b1-\u03c9\u0391-\u03a9][\t'.]*".decode('raw_unicode_escape')
In python 2:
>>> br"[\u03b1-\u03c9\u0391-\u03a9][\t'.]*".decode('raw_unicode_escape')
u"[\u03b1-\u03c9\u0391-\u03a9][\\t'.]*"
In Python 3:
>>> br"[\u03b1-\u03c9\u0391-\u03a9][\t'.]*".decode('raw_unicode_escape')
"[α-ωΑ-Ω][\\t'.]*"
Which is really the same thing.
Related
I'm porting a library from Python 2 to Python 3 and now I'm working on the docstrings. I wonder if I could safely replace all occurrences of unicode type with str in docstrings, since all strings in Python 3 are stored as unicode, or if on the other hand is there any potential case in Python 3 where a type should be specified as unicode.
I already know r"string" in Python 2.7 often used for regex patterns. I also have seen u"string" for, I think, Unicode strings. Now with Python 3 we see b"string".
I have searched for these in different sources / questions, such as What does a b prefix before a python string mean?, but it's difficult to see the big picture of all these strings with prefixes in Python, especially with Python 2 vs 3.
Question: would you have a rule of thumb to remember the different types of strings with prefixes in Python? (or maybe a table with a column for Python 2 and one for Python 3?)
NB: I have read a few questions+answers but I haven't found an easy to remember comparison with all prefixes / Python 2+3
From the python docs for literals: https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals
Bytes literals are always prefixed with 'b' or 'B'; they produce an
instance of the bytes type instead of the str type. They may only
contain ASCII characters; bytes with a numeric value of 128 or greater
must be expressed with escapes.
Both string and bytes literals may optionally be prefixed with a
letter 'r' or 'R'; such strings are called raw strings and treat
backslashes as literal characters. As a result, in string literals,
'\U' and '\u' escapes in raw strings are not treated specially. Given
that Python 2.x’s raw unicode literals behave differently than Python
3.x’s the 'ur' syntax is not supported.
and
A string literal with 'f' or 'F' in its prefix is a formatted string
literal; see Formatted string literals. The 'f' may be combined with
'r', but not with 'b' or 'u', therefore raw formatted strings are
possible, but formatted bytes literals are not.
So:
r means raw
b means bytes
u means unicode
f means format
The r and b were already available in Python 2, as such in many other languages (they are very handy sometimes).
Since the strings literals were not unicode in Python 2, the u-strings were created to offer support for internationalization. As of Python 3, u-strings are the default strings, so "..." is semantically the same as u"...".
Finally, from those, the f-string is the only one that isn't supported in Python 2.
u-strings if for unicode in python 2. Most probably you should forget this, if you're working with modern applications — default strings in python 3 is all unicode, and if you're migrating from python 2, you'll most probably use from __future__ import unicode_literals, which makes [almost] the same for python 2
b-strings is for raw bytes — have no idea of text, rather just stream of bytes. Rarely used as input for your source, most often as result of network or low-level code — reading data in binary format, unpacking archives, working with encryption libraries.
Moving from/to b-string to str done via
# python 3
>>> 'hēllö'.encode('utf-8')
b'h\xc4\x93ll\xc3\xb6'
>>> b'h\xc4\x93ll\xc3\xb6'.decode()
'hēllö'
# python 2 without __future__
>>> u'hēllö'.encode('utf-8')
'h\xc4\x93ll\xc3\xb6'
>>> 'h\xc4\x93ll\xc3\xb6'.decode('utf-8')
u'h\u0113ll\xf6' # this is correct representation
r-strings is not specifically for regex, this is "raw" string. Unlike regular string literals, r-string doesn't give any special meaning for escape characters. I.e. normal string 'abc\n' is 4 characters long, last char is "newline" special character. To provide it in literal, we're using escaping with \. For raw strings, r'abc\n' is 5-length string, last two characters is literally \ and n. Two places to see raw strings often:
regex patterns — to not mess escaping with actual special characters in patters
file path notations for windows systems, as windows family uses \ as delimeter, normal string literals will look like 'C:\\dir\\file', or '\\\\share\\dir', while raw would be nicer: r'C:\dir\file' and r'\\share\dir' respectively
One more notable is f-strings, which came to life with python 3.6 as simple and powerful way of formatting strings:
f'a equals {a} and b is {b}' will substitute variables a and b in runtime.
There are really only two types of string (or string-like object) in Python.
The first is 'Unicode' strings, which are a sequence of characters.
The second is bytes (or 'bytestrings'), which are a sequence of bytes.
The first is a series of letter characters found in the Unicode specification.
The second is a series of integers between 0 and 255 that are usually rendered to text using some assumed encoding such as ASCII or UTF-8 (which is a specification for encoding Unicode characters in a bytestream).
In Python 2, the default "my string" is a bytestring.
The prefix 'u' indicates a 'Unicode' string, e.g. u"my string".
In Python 3, 'Unicode' strings became the default, and thus "my string" is equivalent to u"my string".
To get the old Python 2 bytestrings, you use the prefix b"my string" (not in the oldest versions of Python 3).
There are two further prefixes, but they do not affect the type of string object, just the way it is interpreted.
The first is 'raw' strings which do not interpret escape characters such as \n or \t. For example, the raw string r"my_string\n" contains the literal backslash and 'n' character, while "my_string\n" contains a linebreak at the end of the line.
The second was introduced in the newest versions of Python 3: formatted strings with the prefix 'f'. In these, curly braces are used to show expressions to be interpreted. For example, the string in:
my_object = 'avocado'
f"my {0.5 + 1.0, my_object} string"
will be interpreted to "my (1.5, avocado) string" (where the comma created a tuple). This interpretation happens immediately when the code is read; there is nothing special subsequently about the string.
And finally, you can use the multiline string notation:
"""this is my
multiline
string"""
with 'r' or 'f' specifiers as you wish.
In Python 2, if you have used no prefix or only an 'r' prefix, it is a bytestring, and if you have used a 'u' prefix it is a Unicode string.
In Python 3, if you have used no prefix or only a combination of 'r', 'f' and 'u', it is a Unicode string. If you have used a 'b' prefix it is a bytestring. Using both 'b' and 'u' is obviously not allowed.
This is what I observed (seems confirmed by other answers):
Python 2 Python 3
-----------------------------------------------
"hello" b"hello"
b"hello" <=>
<type 'str'> <class 'bytes'>
-----------------------------------------------
u"hello" <=> "hello"
u"hello"
<type 'unicode'> <class 'str'>
I'm trying to convert my Python 2 script to Python 3. How do we do Regex with Unicode?
This is what I had in Python 2 which works It replaces quotes to « and »:
text = re.sub(ur'"(.*?)"', ur'«\1»', text)
I have some really complex ones which the "ur" made it so easy. But it doesn't work in Python 3:
text = re.sub(ur'ه\sایم([\]\.،\:»\)\s])', ur'ه\u200cایم\1', text)
All strings in Python3 are unicode by default. Just remove the u and you should be fine.
In Python2 strings are lists of bytes by default, so we use u to mark them as unicode strings.
Since Python 3.0, the language features a str type that contain
Unicode characters, meaning any string created using "unicode rocks!",
'unicode rocks!', or the triple-quoted string syntax is stored as
Unicode.
Unicode HOWTO This doc will help you.
so, you just do want every you do in Python2, and it will works, no extra effects.
Apparently the ur"" syntax has been disabled in Python 3. However, I need it! "Why?", you may ask. Well, I need the u prefix because it is a unicode string and my code needs to work on Python 2. As for the r prefix, maybe it's not essential, but the markup format I'm using requires a lot of backslashes and it would help avoid mistakes.
Here is an example that does what I want in Python 2 but is illegal in Python 3:
tamil_letter_ma = u"\u0bae"
marked_text = ur"\a%s\bthe Tamil\cletter\dMa\e" % tamil_letter_ma
After coming across this problem, I found http://bugs.python.org/issue15096 and noticed this quote:
It's easy to overcome the limitation.
Would anyone care to offer an idea about how?
Related: What exactly do "u" and "r" string flags do in Python, and what are raw string literals?
Why don't you just use raw string literal (r'....'), you don't need to specify u because in Python 3, strings are unicode strings.
>>> tamil_letter_ma = "\u0bae"
>>> marked_text = r"\a%s\bthe Tamil\cletter\dMa\e" % tamil_letter_ma
>>> marked_text
'\\aம\\bthe Tamil\\cletter\\dMa\\e'
To make it also work in Python 2.x, add the following Future import statement at the very beginning of your source code, so that all the string literals in the source code become unicode.
from __future__ import unicode_literals
The preferred way is to drop u'' prefix and use from __future__ import unicode_literals as #falsetru suggested. But in your specific case, you could abuse the fact that "ascii-only string" % unicode returns Unicode:
>>> tamil_letter_ma = u"\u0bae"
>>> marked_text = r"\a%s\bthe Tamil\cletter\dMa\e" % tamil_letter_ma
>>> marked_text
u'\\a\u0bae\\bthe Tamil\\cletter\\dMa\\e'
Unicode strings are the default in Python 3.x, so using r alone will produce the same as ur in Python 2.
The following code:
s = s.replace(u"&", u"&")
is causing an error in python:
SyntaxError: invalid syntax
removing the u's before the " fixes the problem, but this should work as is? I'm using Python 3.1
The u is no longer used in Python 3. String literals are unicode by default. See What's New in Python 3.0.
You can no longer use u"..." literals for Unicode text. However, you must use b"..." literals for binary data.
On Python 3, strings are unicode. There is no need to (and as you've discovered, you can't) put a u before the string literal to designate unicode.
Instead, you have to put a b before a byte literal to designate that it isn't unicode.
In Python3.3+ unicode literal is valid again, see What’s New In Python 3.3:
New syntax features:
New yield from expression for generator delegation.
The u'unicode' syntax is accepted again for str objects.