Regex sub function not working with unicode string - python

I'm trying to use Python's sub function and I'm having a problem getting it to work. From the troubleshooting I've been doing I believe it has something to do with the unicode characters in the string.
# -*- coding: utf-8 -*-
reload(sys)
sys.setdefaultencoding('utf-8')
import re
someFunction(string):
string = string.decode('utf-8')
match = re.search(ur'éé', string)
if match:
print >> sys.stderr, "It was found"
else:
print >> sys.stderr, "It was NOT found"
if isinstance(string, str):
print >> sys.stderr, 'string is a string object'
elif isinstance(string, unicode):
print >> sys.stderr, 'string is a unicode object'
new_string = re.sub(ur'éé', ur'é:', string)
return new_string
stringNew = 'éégktha'
returnedString = someFunction(stringNew)
print >> sys.stderr, "After printing it: " + returnedString
#At this point in the code string = 'éégktha'
returnString = someFunction(string)
print >> sys.stderr, "After printing it: " + returnedString
So I would like 'é:gktha'. Below is what is printed to the error log when I run this code.
It was found
string is a unicode object
é:gktha
It was NOT found
string is a unicode object
éégktha
So I'm thinking it must be something with string that is passed into my function. When I declared is as a unicode string or a string literal and then decode it the pattern is found. But the pattern is not being found in the string being passed in. I was thinking my string = string.decode('utf-8') statement would convert any string passed into the function and then would would work.
I tried to do this in the python interpreter to work through this and when I declare string as a unicode string it works.
string = u'éégktha'
So to simulate the function I declared the string and then 'decode' it to and then tried my regex statement and it worked.
string = 'éégktha'
newString = string.decode('utf8')
string = re.sub(ur'éé', ur'é:', newString)
print string #é:gktha
This web app that works with a lot of unicode characters. This is Python 2.5 and I've always had a hard time when working with unicode characters. Any help and knowledge is greatly appreciated.

You should print what it returned by someFunction.
>>> string = 'éégktha'
>>> def someFunction(string):
... #string = 'éégktha'
... string = string.decode('utf8')
... new_string = re.sub(ur'éé', ur'é:', string)
... return new_string
>>> import re
>>> someFunction(string)
u'\xe9:gktha'
>>> print someFunction(string)
é:gktha
Your functions fine. In the simulation you are printing which prints what is returned by __str__ while when you return the interpreter prints what is returned by the __repr__ of the new_string/newString.

Related

Use single quote and double quote same time as string python

How can I use single quote and double quote same time as string python?
For example:
string = "Let's print "Happines" out"
result should be Let's print "Happines" out
I tried to use backslash but it prints out a \ before 's that should be.
In python there's lots of ways to write string literals.
For this example you can:
print('Let\'s print "Happiness" out')
print("Let's print \"Happiness\" out")
print('''Let's print "Happiness" out''')
print("""Let's print "Happiness" out""")
Any of the above will behave as expected.
Taking this string:
string = "Let's print "Happines" out"
If you want to mix quotes, use the triple single quotes:
>>> string = '''Let's print "Happines" out'''
>>> print(string)
Let's print "Happines" out
Using triple quotes is acceptable too:
>>> string = """Let's print "Happines" out"""
>>> print(string)
Let's print "Happines" out

Turning a unicode code point into a unicode character in Python

I'm parsing hex/unicode escapes from text.
So I'll have an input string like
\x{abcd}
which is easy enough - I wind up with an array ["ab", "cd"] which I call digits and do this to it:
return bytes(int(d, 16) for d in digits).decode("utf-8")
So I basically accept everything between the {} as a UTF-8-encoded character and turn it into a character. Simple.
>>> bytes(int(d, 16) for d in ["e1", "88", "92"]).decode("utf-8")
'ሒ'
But I want to go the other way: \u{1212} should result in the same character. The problem is, I don't know how to treat the resulting ["12", "12"] as a unicode code point instead of UTF-8 bytes to get the ሒ character again.
How can I do this in python 3?
You can use chr after parsing the number as base-16:
>>> chr(int('1212', 16))
'ሒ'
>>> '\u1212'
'ሒ'
If you're replacing this globally in some string, using re.sub with a substitution function could make this simple:
import re
def replacer(match):
if match.group(2) == 'u':
return chr(int(match.group(3), 16))
elif match.group(2) == 'x':
return # ...
re.sub(r'(\\(x|u)\{(.*?)\})', replacer, r'\x{abcd} foo \u{1212}')
do you mean to encode the string like this?
>>> print u"\u1212"
ሒ
>>> print u"\u00A9"
©
edit:
if you start with a string, it's just
>>> chr(int("1212", 16))
'ሒ'

Convert Python's internal str to print equivalent

Currently I have:
>> class_name = 'AEROSPC\xc2\xa01A'
>> print(class)
>> AEROSPC 1A
>> 'AEROSPC 1A' == class_name
>> False
How can I convert class_name into 'AEROSPC 1A'? Thanks!
Convert to Unicode
You get interesting errors when converting that, I first converted to utf8:
my_utf8 = 'AEROSPC\xc2\xa01A'.decode('utf8', 'ignore')
my_utf8
returns:
u'AEROSPC\xa01A'
and then I normalize the string, the \xa0 is a non-breaking space.
import unicodedata
my_normed_utf8 = unicodedata.normalize('NFKC', my_utf8)
print my_normed_utf8
prints:
AEROSPC 1A
Convert back to String
which I can then convert back to an ASCII string:
my_str = str(my_normed_utf8)
print my_str
prints:
AEROSPC 1A

Python3 : unescaping non ascii characters

(Python 3.3.2) I have to unescape some non ASCII escaped characters returned by a call to re.escape(). I see here and here methods that doesn't work. I'm working in a 100% UTF-8 environment.
# pure ASCII string : ok
mystring = "a\n" # expected unescaped string : "a\n"
cod = codecs.getencoder('unicode_escape')
print( cod(mystring) )
# non ASCII string : method #1
mystring = "€\n"
# equivalent to : mystring = codecs.unicode_escape_decode(mystring)
cod = codecs.getdecoder('unicode_escape')
print(cod(mystring))
# RESULT = ('â\x82¬\n', 5) INSTEAD OF ("€\n", 2)
# non ASCII string : method #2
mystring = "€\n"
mystring = bytes(mystring, 'utf-8').decode('unicode_escape')
print(mystring)
# RESULT = â\202¬ INSTEAD OF "€\n"
Is this a bug ? Have I misunderstood something ?
Any help would be appreciated !
PS : I edited my post thanks to the Michael Foukarakis' remark.
I guess the actual string you need to process is mystring = €\\n?
mystring = "€\n" # that's 2 char, "€" and new line
mystring = "€\\n" # that's 3 char, "€", "\" and "n"
I don't really understand what's going wrong within encode() and decode() of python3, but my friend solve this problem when we are writing some tools.
How we did is to bypass the encoder("utf_8") after the escape procedure is done.
>>> "€\\n".encode("utf_8")
b'\xe2\x82\xac\\n'
>>> "€\\n".encode("utf_8").decode("unicode_escape")
'â\x82¬\n'
>>> "€\\n".encode("utf_8").decode("unicode_escape").encode("utf_8")
b'\xc3\xa2\xc2\x82\xc2\xac\n' # we don't want this
>>> bytes([ord(char) for char in "€\\n".encode("utf_8").decode("unicode_escape")])
b'\xe2\x82\xac\n' # what we really need
>>> str(bytes([ord(char) for char in "€\\n".encode("utf_8").decode("unicode_escape")]), "utf_8")
'€\n'
We can see that: though the result of decode("unicode_escape") looks wired, the bytes object actually contain the correct bytes of your strings(with utf-8 encoding), in this case, "\xe2\x82\xac\n"
And we now do not print the str object directly, neither do we use encode("utf_8"), we use ord() to create the bytes object b'\xe2\x82\xac\n'.
And you can get the correct str from this bytes object, just put it into str()
BTW, the tool my friend and me want to make is a wrapper that allow user to input c-like string literal, and convert the escaped sequence automatically.
User input:\n\x61\x62\n\x20\x21 # 20 characters, which present 6 chars semantically
output: # \n
ab # \x61\x62\n
! # \x20\x21
That's a powerful tool for user to input some non-printable character in terminal.
Our final tools is:
#!/usr/bin/env python3
import sys
for line in sys.stdin:
sys.stdout.buffer.write(bytes([ord(char) for char in line[:-1].encode().decode('unicode_escape')]))
sys.stdout.flush()
You seem to misunderstand encodings. To be protected against common errors, we usually encode a string when it leaves our application, and decode it when it comes in.
Firstly, let's look at the documentation for unicode_escape, which states:
Produce[s] a string that is suitable as Unicode literal in Python source code.
Here is what you would get from the network or a file that claims its contents are Unicode escaped:
b'\\u20ac\\n'
Now, you have to decode this to use it in your app:
>>> s = b'\\u20ac\\n'.decode('unicode_escape')
>>> s
'€\n'
and if you wanted to write it back to, say, a Python source file:
with open('/tmp/foo', 'wb') as fh: # binary mode
fh.write(b'print("' + s.encode('unicode_escape') + b'")')
import string
printable = string.printable
printable = printable + '€'
def cod(c):
return c.encode('unicode_escape').decode('ascii')
def unescape(s):
return ''.join(c if ord(c)>=32 and c in printable else cod(c) for c in s)
mystring = "€\n"
print(unescape(mystring))
Unfortunately string.printable only includes ASCII characters. You can make a copy as I did here and extend it with any Unicode characters that you'd like, such as €.

How do I get the raw representation of a string in Python?

I am making a class that relies heavily on regular expressions.
Let's say my class looks like this:
class Example:
def __init__(self, regex):
self.regex = regex
def __repr__(self):
return 'Example({})'.format(repr(self.regex.pattern))
And let's say I use it like this:
import re
example = Example(re.compile(r'\d+'))
If I do repr(example), I get 'Example('\\\\d+')', but I want 'Example(r'\\d+')'. Take into account the extra backslash where that upon printing, it appears correctly. I suppose I could implement it to return "r'{}'".format(regex.pattern), but that doesn't sit well with me. In the unlikely event that the Python Software Foundation someday changes the manner for specifying raw string literals, my code won't reflect that. That's hypothetical, though. My main concern is whether or not this always works. I can't think of an edge case off the top of my head, though. Is there a more formal way of doing this?
EDIT: Nothing seems to appear in the Format Specification Mini-Language, the printf-style String Formatting guide, or the string module.
The problem with rawstring representation is, that you cannot represent everything in a portable (i.e. without using control characters) manner. For example, if you had a linebreak in your string, you had to literally break the string to the next line, because it cannot be represented as rawstring.
That said, the actual way to get rawstring representation is what you already gave:
"r'{}'".format(regex.pattern)
The definition of rawstrings is that there are no rules applied except that they end at the quotation character they start with and that you can escape said quotation character using a backslash. Thus, for example, you cannot store the equivalent of a string like "\" in raw string representation (r"\" yields SyntaxError and r"\\" yields "\\\\").
If you really want to do this, you should use a wrapper like:
def rawstr(s):
"""
Return the raw string representation (using r'') literals of the string
*s* if it is available. If any invalid characters are encountered (or a
string which cannot be represented as a rawstr), the default repr() result
is returned.
"""
if any(0 <= ord(ch) < 32 for ch in s):
return repr(s)
if (len(s) - len(s.rstrip("\\"))) % 2 == 1:
return repr(s)
pattern = "r'{0}'"
if '"' in s:
if "'" in s:
return repr(s)
elif "'" in s:
pattern = 'r"{0}"'
return pattern.format(s)
Tests:
>>> test1 = "\\"
>>> test2 = "foobar \n"
>>> test3 = r"a \valid rawstring"
>>> test4 = "foo \\\\\\"
>>> test5 = r"foo \\"
>>> test6 = r"'"
>>> test7 = r'"'
>>> print(rawstr(test1))
'\\'
>>> print(rawstr(test2))
'foobar \n'
>>> print(rawstr(test3))
r'a \valid rawstring'
>>> print(rawstr(test4))
'foo \\\\\\'
>>> print(rawstr(test5))
r'foo \\'
>>> print(rawstr(test6))
r"'"
>>> print(rawstr(test7))
r'"'

Categories