Unicode Regex in Python 3 (from Python 2 Code)

Unicode Regex in Python 3 (from Python 2 Code) - python

I'm trying to convert my Python 2 script to Python 3. How do we do Regex with Unicode?
This is what I had in Python 2 which works It replaces quotes to « and »:
text = re.sub(ur'"(.*?)"', ur'«\1»', text)
I have some really complex ones which the "ur" made it so easy. But it doesn't work in Python 3:
text = re.sub(ur'ه\sایم([\]\.،\:»\)\s])', ur'ه\u200cایم\1', text)

All strings in Python3 are unicode by default. Just remove the u and you should be fine.
In Python2 strings are lists of bytes by default, so we use u to mark them as unicode strings.

Since Python 3.0, the language features a str type that contain
Unicode characters, meaning any string created using "unicode rocks!",
'unicode rocks!', or the triple-quoted string syntax is stored as
Unicode.
Unicode HOWTO This doc will help you.
so, you just do want every you do in Python2, and it will works, no extra effects.

Related

extract some arabic/persian (unicode) words with regex using python [duplicate]

This question already has answers here:
Python and regular expression with Unicode
(2 answers)
Closed 2 years ago.
I need to extract some specific names in Arabic/Persian (something like proper nouns in English), using python re library.
example (the word "شرکت" means "company" and we want to extract what the company name is):
input: شرکت تست گستران خلیج فارس
output: تست گستران خلیج فارس
I've seen [this answer] and it would be fine to replace "university" with "شرکت" in that example but I don't understand how to find the keywords by regex with Arabic Unicode when it's not possible to use that in this way:
re.match("شرکت", "\u0634\u0631\u06A9\u062A") # returns None

Python 2 does not default to parsing unicode literals (like when pasting unicode letters, or having a \u in the code). You have to be explicit about it:
re.match(u"شرکت", u"\u0634\u0631\u06A9\u062A")
Otherwise, the Arabic will be translated to the actual bytes, which are different then the unicode code-points, and the Unicode string on the right will have literal backslashes since Python 2 does not recognize \u as a valid escape by default.
Another option is to import from the future - in Python 3 everything is initially parsed as unicode, making that u"..." somewhat obsolete:
from __future__ import unicode_literals
will make unicode literals be parsed correctly with no u"".

r"string" b"string" u"string" Python 2 / 3 comparison

I already know r"string" in Python 2.7 often used for regex patterns. I also have seen u"string" for, I think, Unicode strings. Now with Python 3 we see b"string".
I have searched for these in different sources / questions, such as What does a b prefix before a python string mean?, but it's difficult to see the big picture of all these strings with prefixes in Python, especially with Python 2 vs 3.
Question: would you have a rule of thumb to remember the different types of strings with prefixes in Python? (or maybe a table with a column for Python 2 and one for Python 3?)
NB: I have read a few questions+answers but I haven't found an easy to remember comparison with all prefixes / Python 2+3

From the python docs for literals: https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals
Bytes literals are always prefixed with 'b' or 'B'; they produce an
instance of the bytes type instead of the str type. They may only
contain ASCII characters; bytes with a numeric value of 128 or greater
must be expressed with escapes.
Both string and bytes literals may optionally be prefixed with a
letter 'r' or 'R'; such strings are called raw strings and treat
backslashes as literal characters. As a result, in string literals,
'\U' and '\u' escapes in raw strings are not treated specially. Given
that Python 2.x’s raw unicode literals behave differently than Python
3.x’s the 'ur' syntax is not supported.
and
A string literal with 'f' or 'F' in its prefix is a formatted string
literal; see Formatted string literals. The 'f' may be combined with
'r', but not with 'b' or 'u', therefore raw formatted strings are
possible, but formatted bytes literals are not.
So:
r means raw
b means bytes
u means unicode
f means format
The r and b were already available in Python 2, as such in many other languages (they are very handy sometimes).
Since the strings literals were not unicode in Python 2, the u-strings were created to offer support for internationalization. As of Python 3, u-strings are the default strings, so "..." is semantically the same as u"...".
Finally, from those, the f-string is the only one that isn't supported in Python 2.

u-strings if for unicode in python 2. Most probably you should forget this, if you're working with modern applications — default strings in python 3 is all unicode, and if you're migrating from python 2, you'll most probably use from __future__ import unicode_literals, which makes [almost] the same for python 2
b-strings is for raw bytes — have no idea of text, rather just stream of bytes. Rarely used as input for your source, most often as result of network or low-level code — reading data in binary format, unpacking archives, working with encryption libraries.
Moving from/to b-string to str done via
# python 3
>>> 'hēllö'.encode('utf-8')
b'h\xc4\x93ll\xc3\xb6'
>>> b'h\xc4\x93ll\xc3\xb6'.decode()
'hēllö'
# python 2 without __future__
>>> u'hēllö'.encode('utf-8')
'h\xc4\x93ll\xc3\xb6'
>>> 'h\xc4\x93ll\xc3\xb6'.decode('utf-8')
u'h\u0113ll\xf6' # this is correct representation
r-strings is not specifically for regex, this is "raw" string. Unlike regular string literals, r-string doesn't give any special meaning for escape characters. I.e. normal string 'abc\n' is 4 characters long, last char is "newline" special character. To provide it in literal, we're using escaping with \. For raw strings, r'abc\n' is 5-length string, last two characters is literally \ and n. Two places to see raw strings often:
regex patterns — to not mess escaping with actual special characters in patters
file path notations for windows systems, as windows family uses \ as delimeter, normal string literals will look like 'C:\\dir\\file', or '\\\\share\\dir', while raw would be nicer: r'C:\dir\file' and r'\\share\dir' respectively
One more notable is f-strings, which came to life with python 3.6 as simple and powerful way of formatting strings:
f'a equals {a} and b is {b}' will substitute variables a and b in runtime.

There are really only two types of string (or string-like object) in Python.
The first is 'Unicode' strings, which are a sequence of characters.
The second is bytes (or 'bytestrings'), which are a sequence of bytes.
The first is a series of letter characters found in the Unicode specification.
The second is a series of integers between 0 and 255 that are usually rendered to text using some assumed encoding such as ASCII or UTF-8 (which is a specification for encoding Unicode characters in a bytestream).
In Python 2, the default "my string" is a bytestring.
The prefix 'u' indicates a 'Unicode' string, e.g. u"my string".
In Python 3, 'Unicode' strings became the default, and thus "my string" is equivalent to u"my string".
To get the old Python 2 bytestrings, you use the prefix b"my string" (not in the oldest versions of Python 3).
There are two further prefixes, but they do not affect the type of string object, just the way it is interpreted.
The first is 'raw' strings which do not interpret escape characters such as \n or \t. For example, the raw string r"my_string\n" contains the literal backslash and 'n' character, while "my_string\n" contains a linebreak at the end of the line.
The second was introduced in the newest versions of Python 3: formatted strings with the prefix 'f'. In these, curly braces are used to show expressions to be interpreted. For example, the string in:
my_object = 'avocado'
f"my {0.5 + 1.0, my_object} string"
will be interpreted to "my (1.5, avocado) string" (where the comma created a tuple). This interpretation happens immediately when the code is read; there is nothing special subsequently about the string.
And finally, you can use the multiline string notation:
"""this is my
multiline
string"""
with 'r' or 'f' specifiers as you wish.
In Python 2, if you have used no prefix or only an 'r' prefix, it is a bytestring, and if you have used a 'u' prefix it is a Unicode string.
In Python 3, if you have used no prefix or only a combination of 'r', 'f' and 'u', it is a Unicode string. If you have used a 'b' prefix it is a bytestring. Using both 'b' and 'u' is obviously not allowed.

This is what I observed (seems confirmed by other answers):
Python 2 Python 3
-----------------------------------------------
"hello" b"hello"
b"hello" <=>
<type 'str'> <class 'bytes'>
-----------------------------------------------
u"hello" <=> "hello"
u"hello"
<type 'unicode'> <class 'str'>

python version 3.4 does not support a 'ur' prefix

I have some python code writen in an older version of python(2.x) and I struggle to make it work. I'm using python 3.4
_eng_word = ur"[a-zA-Z][a-zA-Z0-9'.]*"
(it's part of a tokenizer)

http://bugs.python.org/issue15096
Title: Drop support for the "ur" string prefix
When PEP 414 restored support for explicit Unicode literals in Python 3, the "ur" string prefix was deemed to be a synonym for the "r" prefix.
So, use 'r' instead of 'ur'

Indeed, Python 3.4 only supports u'...' (to support code that needs to run on both Python 2 and 3) and r'....', but not both. That's because the semantics of how ur'..' works in Python 2 are different from how ur'..' would work in Python 3 (in Python 2, \uhhhh and \Uhhhhhhhh escapes still are processed, in Python 3 a `r'...' string would not).
Note that in this specific case there is no difference between the raw string literal and the regular! You can just use:
_eng_word = u"[a-zA-Z][a-zA-Z0-9'.]*"
and it'll work in both Python 2 and 3.
For cases where a raw string literal does matter, you could decode the raw string from raw_unicode_escape on Python 2, catching the AttributeError on Python 3:
_eng_word = r"[a-zA-Z][a-zA-Z0-9'.]*"
try:
# Python 2
_eng_word = _eng_word.decode('raw_unicode_escape')
except AttributeError:
# Python 3
pass
If you are writing Python 3 code only (so it doesn't have to run on Python 2 anymore), just drop the u entirely:
_eng_word = r"[a-zA-Z][a-zA-Z0-9'.]*"

This table compares (some of) the different string literal prefixes in Python 2(.7) and 3(.4+):
As you can see, in Python 3 there's no way to have a literal that doesn't process escapes, but does process unicode literals. To get such a string with code that works in both Python 2 and 3, use:
br"[a-zA-Z][a-zA-Z0-9'.]*".decode('raw_unicode_escape')
Actually, your example is not very good, since it doesn't have any unicode literals, or escape sequences. A better example would be:
br"[\u03b1-\u03c9\u0391-\u03a9][\t'.]*".decode('raw_unicode_escape')
In python 2:
>>> br"[\u03b1-\u03c9\u0391-\u03a9][\t'.]*".decode('raw_unicode_escape')
u"[\u03b1-\u03c9\u0391-\u03a9][\\t'.]*"
In Python 3:
>>> br"[\u03b1-\u03c9\u0391-\u03a9][\t'.]*".decode('raw_unicode_escape')
"[α-ωΑ-Ω][\\t'.]*"
Which is really the same thing.

Python .split() without 'u

In Python, if I have a string like:
a =" Hello - to - everybody"
And I do
a.split('-')
then I get
[u'Hello', u'to', u'everybody']
This is just an example.
How can I get a simple list without that annoying u'??

The u means that it's a unicode string - your original string must also have been a unicode string. Generally it's a good idea to keep strings Unicode as trying to convert to normal strings could potentially fail due to characters with no equivalent.
The u is purely used to let you know it's a unicode string in the representation - it will not affect the string itself.
In general, unicode strings work exactly as normal strings, so there should be no issue with leaving them as unicode strings.
In Python 3.x, unicode strings are the default, and don't have the u prepended (instead, bytes (the equivalent to old strings) are prepended with b).
If you really, really need to convert to a normal string (rarely the case, but potentially an issue if you are using an extension library that doesn't support unicode strings, for example), take a look at unicode.encode() and unicode.decode(). You can either do this before the split, or after the split using a list comprehension.

I have a opposite problem. The str '第一回\u3000甄士隐梦幻识通灵 贾雨村风尘怀闺秀' needs to be splitted by the unicode character. But I made wrong and code split('\u') that leaded to the unicode syntax error.
I should code split('\u3000')

Unicode literals causing invalid syntax

The following code:
s = s.replace(u"&", u"&")
is causing an error in python:
SyntaxError: invalid syntax
removing the u's before the " fixes the problem, but this should work as is? I'm using Python 3.1

The u is no longer used in Python 3. String literals are unicode by default. See What's New in Python 3.0.
You can no longer use u"..." literals for Unicode text. However, you must use b"..." literals for binary data.

On Python 3, strings are unicode. There is no need to (and as you've discovered, you can't) put a u before the string literal to designate unicode.
Instead, you have to put a b before a byte literal to designate that it isn't unicode.

In Python3.3+ unicode literal is valid again, see What’s New In Python 3.3:
New syntax features:
New yield from expression for generator delegation.
The u'unicode' syntax is accepted again for str objects.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.