I would like to use regular expressions on bytestrings in python of which I know the encoding (utf-8). I am facing difficulties trying to use character classes that involve characters that are encoded using more than one bit block. They appear to become two or more 'characters' that are matched separately in the character class.
Performing the search on (unicode) strings instead is possible, but I would like to know if there is a solution to defining character classes for the case of bytestrings as well. Maybe it's just not possible!?
Below is a python 3 example that shows what happens when I try to replace different line breaks with '\n':
import re
def show_pattern(pattern):
print(f"\nPattern repr:\t{repr(pattern)}")
def test_sub(pattern, replacement, text):
print(f"Before repr:\t{repr(text)}")
result = re.sub(pattern, replacement, text)
print(f"After repr:\t{repr(result)}")
# Pattern for line breaks
PATTERN = '[' + "\u000A\u000B\u000C\u000D\u0085\u2028\u2029" + ']'
REPLACEMENT = '\n'
TEXT = "How should I replace my unicode string\u2028using utf-8-encoded bytes?"
show_pattern(PATTERN)
test_sub(PATTERN, REPLACEMENT, TEXT)
# expected output:
# Pattern repr: '[\n\x0b\x0c\r\x85\u2028\u2029]'
# Before repr: 'How should I replace my unicode string\u2028using utf-8-encoded bytes?'
# After repr: 'How should I replace my unicode string\nusing utf-8-encoded bytes?'
ENCODED_PATTERN = PATTERN.encode('utf-8')
ENCODED_REPLACEMENT = REPLACEMENT.encode('utf-8')
ENCODED_TEXT = TEXT.encode('utf-8')
show_pattern(ENCODED_PATTERN)
test_sub(ENCODED_PATTERN, ENCODED_REPLACEMENT, ENCODED_TEXT)
# expected output:
# Pattern repr: b'[\n\x0b\x0c\r\xc2\x85\xe2\x80\xa8\xe2\x80\xa9]'
# Before repr: b'How should I replace my unicode string\xe2\x80\xa8using utf-8-encoded bytes?'
# After repr: b'How should I replace my unicode string\n\n\nusing utf-8-encoded bytes?'
In the encoded version, I end up with three '\n''s instead of one. Similar things happen for a more complicated document where it's not obvious what the correct output should be.
You may use an alternation based pattern rather than a character class, as you will want to match sequences of bytes:
PATTERN = "|".join(['\u000A','\u000B','\u000C','\u000D','\u0085','\u2028','\u2029'])
See the online demo.
If you prefer to initialize the pattern from a string use
CHARS = "\u000A\u000B\u000C\u000D\u0085\u2028\u2029"
PATTERN = "|".join(CHARS)
I'm using the argparse library in Python 3 to read in Unicode strings from a command line parameter. Often those strings contain "ordinary" Unicode characters (extended Latin, etc.), but sometimes--particularly when the characters belong to a right-to-left script--it's easier to encode the strings as Unicode code points, like \u0644. But argparse treats these designators as a sequence of characters, and does not convert them into the character they designate. For instance, if a command line parameter is
... -a "abc\06d2d" ...
then what I get in the argparse variable is
"abc\06d2d"
rather than the expected
"abcےd"
(the character between the 'c' and 'd' is the yeh baree). Of course both outcomes are logical, it's just that the second one is the one I want.
I tried to reproduce this in an interpreter, but under most circumstances Python3 automagically converts a string like "abc\06d2d" into "abcےd". Not so when I read the string using argparse...
I came up with a function to do the conversion, see below. But I feel like I'm missing something much simpler. Is there an easier way to do this conversion? (Obviously I could use str.startswith(), or regex's to match the entire thing, rather than going character by character, but the code below is really just an illustration. It seems like I shouldn't have to create my own function to do this at all, especially since in some circumstances it seems to happen automatically.)
---------My code to do this follows---------
def ParseString2Unicode(sInString):
"""Return a version of sInString in which any Unicode code points of the form
\uXXXX (X = hex digit)
have been converted into their corresponding Unicode characters.
Example:
"\u0064b\u0065"
becomes
"dbe"
"""
sOutString = ""
while sInString:
if len(sInString) >= 6 and \
sInString[0] == "\\" and \
sInString[1] == "u" and \
sInString[2] in "0123456789ABCDEF" and \
sInString[3] in "0123456789ABCDEF" and \
sInString[4] in "0123456789ABCDEF" and \
sInString[5] in "0123456789ABCDEF":
#If we get here, the first 6 characters of sInString represent
# a Unicode code point, like "\u0065"; convert it into a char:
sOutString += chr(int(sInString[2:6], 16))
sInString = sInString[6:]
else:
#Strip a single char:
sOutString += sInString[0]
sInString = sInString[1:]
return sOutString
What you may want to look at is the raw_unicode_escape encoding.
>>> len(b'\\uffff')
6
>>> b'\\uffff'.decode('raw_unicode_escape')
'\uffff'
>>> len(b'\\uffff'.decode('raw_unicode_escape'))
1
So, the function would be:
def ParseString2Unicode(sInString):
try:
decoded = sInString.encode('utf-8')
return decoded.decode('raw_unicode_escape')
except UnicodeError:
return sInString
This, however, also matches other unicode escape sequences, like \Uxxxxxxxx. If you just want to match \uxxxx, use a regex, like so:
import re
escape_sequence_re = re.compile(r'\\u[0-9a-fA-F]{4}')
def _escape_sequence_to_char(match):
return chr(int(match[0][2:], 16))
def ParseString2Unicode(sInString):
return re.sub(escape_sequence_re, _escape_sequence_to_char, sInString)
A concise, flexible way of handling this would be to use a regular expression:
return re.sub(
r"\\u([0-9A-Fa-f]{4})",
lambda m: chr(int(m[1], 16)),
sInString
)
I have a py3 string that includes escaped utf-8 sequencies, such as "Company\\ffffffc2\\ffffffae", which I would like to convert to the correct utf 8 string (which would in the example be "Company®", since the escaped sequence is c2 ae). I've tried
print (bytes("Company\\\\ffffffc2\\\\ffffffae".replace(
"\\\\ffffff", "\\x"), "ascii").decode("utf-8"))
result: Company\xc2\xae
print (bytes("Company\\\\ffffffc2\\\\ffffffae".replace (
"\\\\ffffff", "\\x"), "ascii").decode("unicode_escape"))
result: Company®
(wrong, since chracters are treated separately, but they should be treated together.
If I do
print (b"Company\xc2\xae".decode("utf-8"))
It gives the correct result.
Company®
How can i achieve that programmatically (i.e. starting from a py3 str)
A simple solution is:
import ast
test_in = "Company\\\\ffffffc2\\\\ffffffae"
test_out = ast.literal_eval("b'''" + test_in.replace('\\\\ffffff','\\x') + "'''").decode('utf-8')
print(test_out)
However it will fail if there is a triple quote ''' in the input string itself.
Following code does not have this problem, but it is not as simple as the first one.
In the first step the string is split on a regular expression. The odd items are ascii parts, e.g. "Company"; each even item corresponds to one escaped utf8 code, e.g. "\\\\ffffffc2". Each substring is converted to bytes according to its meaning in the input string. Finally all parts are joined together and decoded from bytes to a string.
import re
REGEXP = re.compile(r'(\\\\ffffff[0-9a-f]{2})', flags=re.I)
def convert(estr):
def split(estr):
for i, substr in enumerate(REGEXP.split(estr)):
if i % 2:
yield bytes.fromhex(substr[-2:])
elif substr:
yield bytes(substr, 'ascii')
return b''.join(split(estr)).decode('utf-8')
test_in = "Company\\\\ffffffc2\\\\ffffffae"
print(convert(test_in))
The code could be optimized. Ascii parts do not need encode/decode and consecutive hex codes should be concatenated.
I have been looking for a while now but I am not able to find a proper solution.
I have a database with Dutch, French and German words which all have their special characters. e.g. é, è, ß, ç, etc...
For some cases, like in a url, I would like to replace these with alphanumeric characters. respectively e, e, ss, c, etc...
Is there a generic function or Python package that does this?
I could do this with Regex of course, but something generic would be great here.
Thanks.
try this package: https://pypi.python.org/pypi/Unidecode
>>> import unidecode
>>> unidecode.unidecode(u'çß')
'css'
As you say, this could be done using a Regex sub. You would of course need to include upper and lowercase variants.
import re
data = "é, è, ß, ç, äÄ"
lookup = {'é':'e', 'è':'e', 'ß':'ss', 'ç':'c', 'ä':'a', 'Ä':'A'}
print re.sub(r'([éèßçäÄ])', lambda x: lookup[x.group(1)], data)
This would display the following:
e, e, ss, c, aA
you can almost get away with the builtin unicode data (unfortunately a few of your characters break it)
>>> import unicodedata
>>> s=u"é, è, ß, ç"
>>> unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')
'e, e, , c'
here is a solution that has the codepoints hardcoded stolen from http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/
def latin1_to_ascii (unicrap):
"""This takes a UNICODE string and replaces Latin-1 characters with
something equivalent in 7-bit ASCII. It returns a plain ASCII string.
This function makes a best effort to convert Latin-1 characters into
ASCII equivalents. It does not just strip out the Latin-1 characters.
All characters in the standard 7-bit ASCII range are preserved.
In the 8th bit range all the Latin-1 accented letters are converted
to unaccented equivalents. Most symbol characters are converted to
something meaningful. Anything not converted is deleted.
"""
xlate={0xc0:'A', 0xc1:'A', 0xc2:'A', 0xc3:'A', 0xc4:'A', 0xc5:'A',
0xc6:'Ae', 0xc7:'C',
0xc8:'E', 0xc9:'E', 0xca:'E', 0xcb:'E',
0xcc:'I', 0xcd:'I', 0xce:'I', 0xcf:'I',
0xd0:'Th', 0xd1:'N',
0xd2:'O', 0xd3:'O', 0xd4:'O', 0xd5:'O', 0xd6:'O', 0xd8:'O',
0xd9:'U', 0xda:'U', 0xdb:'U', 0xdc:'U',
0xdd:'Y', 0xde:'th', 0xdf:'ss',
0xe0:'a', 0xe1:'a', 0xe2:'a', 0xe3:'a', 0xe4:'a', 0xe5:'a',
0xe6:'ae', 0xe7:'c',
0xe8:'e', 0xe9:'e', 0xea:'e', 0xeb:'e',
0xec:'i', 0xed:'i', 0xee:'i', 0xef:'i',
0xf0:'th', 0xf1:'n',
0xf2:'o', 0xf3:'o', 0xf4:'o', 0xf5:'o', 0xf6:'o', 0xf8:'o',
0xf9:'u', 0xfa:'u', 0xfb:'u', 0xfc:'u',
0xfd:'y', 0xfe:'th', 0xff:'y',
0xa1:'!', 0xa2:'{cent}', 0xa3:'{pound}', 0xa4:'{currency}',
0xa5:'{yen}', 0xa6:'|', 0xa7:'{section}', 0xa8:'{umlaut}',
0xa9:'{C}', 0xaa:'{^a}', 0xab:'<<', 0xac:'{not}',
0xad:'-', 0xae:'{R}', 0xaf:'_', 0xb0:'{degrees}',
0xb1:'{+/-}', 0xb2:'{^2}', 0xb3:'{^3}', 0xb4:"'",
0xb5:'{micro}', 0xb6:'{paragraph}', 0xb7:'*', 0xb8:'{cedilla}',
0xb9:'{^1}', 0xba:'{^o}', 0xbb:'>>',
0xbc:'{1/4}', 0xbd:'{1/2}', 0xbe:'{3/4}', 0xbf:'?',
0xd7:'*', 0xf7:'/'
}
r = ''
for i in unicrap:
if xlate.has_key(ord(i)):
r += xlate[ord(i)]
elif ord(i) >= 0x80:
pass
else:
r += str(i)
return r
of coarse you could just as easily use a regex as indicated in the other answers
(Python 3.3.2) I have to unescape some non ASCII escaped characters returned by a call to re.escape(). I see here and here methods that doesn't work. I'm working in a 100% UTF-8 environment.
# pure ASCII string : ok
mystring = "a\n" # expected unescaped string : "a\n"
cod = codecs.getencoder('unicode_escape')
print( cod(mystring) )
# non ASCII string : method #1
mystring = "€\n"
# equivalent to : mystring = codecs.unicode_escape_decode(mystring)
cod = codecs.getdecoder('unicode_escape')
print(cod(mystring))
# RESULT = ('â\x82¬\n', 5) INSTEAD OF ("€\n", 2)
# non ASCII string : method #2
mystring = "€\n"
mystring = bytes(mystring, 'utf-8').decode('unicode_escape')
print(mystring)
# RESULT = â\202¬ INSTEAD OF "€\n"
Is this a bug ? Have I misunderstood something ?
Any help would be appreciated !
PS : I edited my post thanks to the Michael Foukarakis' remark.
I guess the actual string you need to process is mystring = €\\n?
mystring = "€\n" # that's 2 char, "€" and new line
mystring = "€\\n" # that's 3 char, "€", "\" and "n"
I don't really understand what's going wrong within encode() and decode() of python3, but my friend solve this problem when we are writing some tools.
How we did is to bypass the encoder("utf_8") after the escape procedure is done.
>>> "€\\n".encode("utf_8")
b'\xe2\x82\xac\\n'
>>> "€\\n".encode("utf_8").decode("unicode_escape")
'â\x82¬\n'
>>> "€\\n".encode("utf_8").decode("unicode_escape").encode("utf_8")
b'\xc3\xa2\xc2\x82\xc2\xac\n' # we don't want this
>>> bytes([ord(char) for char in "€\\n".encode("utf_8").decode("unicode_escape")])
b'\xe2\x82\xac\n' # what we really need
>>> str(bytes([ord(char) for char in "€\\n".encode("utf_8").decode("unicode_escape")]), "utf_8")
'€\n'
We can see that: though the result of decode("unicode_escape") looks wired, the bytes object actually contain the correct bytes of your strings(with utf-8 encoding), in this case, "\xe2\x82\xac\n"
And we now do not print the str object directly, neither do we use encode("utf_8"), we use ord() to create the bytes object b'\xe2\x82\xac\n'.
And you can get the correct str from this bytes object, just put it into str()
BTW, the tool my friend and me want to make is a wrapper that allow user to input c-like string literal, and convert the escaped sequence automatically.
User input:\n\x61\x62\n\x20\x21 # 20 characters, which present 6 chars semantically
output: # \n
ab # \x61\x62\n
! # \x20\x21
That's a powerful tool for user to input some non-printable character in terminal.
Our final tools is:
#!/usr/bin/env python3
import sys
for line in sys.stdin:
sys.stdout.buffer.write(bytes([ord(char) for char in line[:-1].encode().decode('unicode_escape')]))
sys.stdout.flush()
You seem to misunderstand encodings. To be protected against common errors, we usually encode a string when it leaves our application, and decode it when it comes in.
Firstly, let's look at the documentation for unicode_escape, which states:
Produce[s] a string that is suitable as Unicode literal in Python source code.
Here is what you would get from the network or a file that claims its contents are Unicode escaped:
b'\\u20ac\\n'
Now, you have to decode this to use it in your app:
>>> s = b'\\u20ac\\n'.decode('unicode_escape')
>>> s
'€\n'
and if you wanted to write it back to, say, a Python source file:
with open('/tmp/foo', 'wb') as fh: # binary mode
fh.write(b'print("' + s.encode('unicode_escape') + b'")')
import string
printable = string.printable
printable = printable + '€'
def cod(c):
return c.encode('unicode_escape').decode('ascii')
def unescape(s):
return ''.join(c if ord(c)>=32 and c in printable else cod(c) for c in s)
mystring = "€\n"
print(unescape(mystring))
Unfortunately string.printable only includes ASCII characters. You can make a copy as I did here and extend it with any Unicode characters that you'd like, such as €.