Replacing Unicode character / Python / Django - python

Since I'm pretty much forced to replace some unicode characters in my string returned by some OCR technology the only way I found to do it is replace them "one by one". This is done using following code:
def recode(mystr):
mystr = mystr.replace(r'\u0104', '\u0104')
mystr = mystr.replace(r'\u017c', '\u017c')
mystr = mystr.replace(r'\u0106' , '\u0106')
...
...
mystr = mystr.replace(r'\u017a' , '\u017a')
mystr = mystr.replace(r'\u017c' , '\u017c')
return mystr
I know this might be confusing. The string returned by mentioned OCR API is returning a sequence of characters, for example "\u017a" is not a mapped character in Unicode but rather "\" , "u", "0", "1", "7", "a". But this can't be changed from my end.
The above solution is very messy and unprofessional. However if I try to loop through all the characters that I want to "replace" it seems like it doesn't do anything:
def recode(mystr):
for foo in ['\u0106','\u0118','\u0141', ...... , '\u017a','\u017c']:
mystr = mystr.replace(r'%s' % foo, foo)
return mystr
Why in this case the foo string is not read as a raw text if in first scenario it is done properly? What is the difference?

So the reason why foo is not read as raw text is that the r in front of a string only plays a role when the string is created - afterwards it will act as a normal string - for example when the %-operator is applied.
As a solution to what you want to do, you can try something like this:
bar = r"\u0104"
mystr = mystr.replace(bar, chr(int(bar[2:], 16)))

This is an X-Y problem. The API is returning literal Unicode strings. Maybe it is actually JSON and OP should be doing json.loads() on the returned data, but if not you can use the unicode_escape codec to translate the escape codes. That codec requires a byte string so it may need to be encoded via ascii or latin1 first:
def recode(mystr):
mystr = mystr.replace(r'\u0104', '\u0104')
mystr = mystr.replace(r'\u017c', '\u017c')
mystr = mystr.replace(r'\u0106' , '\u0106')
mystr = mystr.replace(r'\u017a' , '\u017a')
mystr = mystr.replace(r'\u017c' , '\u017c')
return mystr
def recode2(s):
return s.encode('latin1').decode('unicode_escape')
s = r'\u0104\u017c\u0106\u017a\u017c'
print(s)
print(recode(s))
print(recode2(s))
Output:
\u0104\u017c\u0106\u017a\u017c
ĄżĆźż
ĄżĆźż

Related

python3 - json.loads for a string that contains " in a value

I'm trying to transform a string that contains a dict to a dict object using json.
But in the data contains a "
example
string = '{"key1":"my"value","key2":"my"value2"}'
js = json.loads(s,strict=False)
it outputs json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 13 (char 12) as " is a delimiter and there is too much of it
What is the best way to achieve my goal ?
The solution I have found is to perform several .replace on the string to replace legit " by a pattern until only illgal " remains then replace back the pattern by the legit "
After that I can use json.loads and then replace the remaining pattern by the illegal "
But there must be another way
ex :
string = '{"key1":"my"value","key2":"my"value2"}'
string = string.replace('{"','__pattern_1')
string = string.replace('}"','__pattern_2')
...
...
string = string.replace('"','__pattern_42')
string = string.replace('__pattern_1','{"')
string = string.replace('__pattern_2','}"')
...
...
js = json.loads(s,strict=False)
This should work. What I am doing here is to simply replace all the expected double quotes with something else and then remove the unwanted double quotes. and then convert it back.
import re
import json
def fix_json_string(st):
st = re.sub(r'","',"!!",st)
st = re.sub(r'":"',"--",st)
st = re.sub(r'{"',"{{",st)
st = re.sub(r'"}',"}}",st)
st = st.replace('"','')
st = re.sub(r'}}','"}',st)
st = re.sub(r'{{','{"',st)
st = re.sub(r'--','":"',st)
st = re.sub(r'!!','","',st)
return st
broken_string = '{"key1":"my"value","key2":"my"value2"}'
fixed_string = fix_json_string(broken_string)
print(fixed_string)
js = json.dumps(eval(fixed_string))
print(js)
Output -
{"key1":"myvalue","key2":"myvalue2"} # str
{"key1": "myvalue", "key2": "myvalue2"} # converted to json
The variable string is not a valid JSON string.
The correct string should be:
string = '{"key1":"my\\"value","key2":"my\\"value2"}'
Problem is, that the string contains invalid json format.
String '{"key1": "my"value", "key2": "my"value2"}': value of key1 ends with "my" and additional characters value" are against the format.
You can use character escaping, valid json would look like:
{"key1": "my\"value", "key2": "my\"value2"}.
Since you are defining it as value you would then need to escape the escape characters:
string = '{"key1": "my\\"value", "key2": "my\\"value2"}'
There is a lot of educative material online on character escaping. I recommend to check it out if something is not clear
Edit: If you insist on fixing the string in code (which I don't recommend, see comment) you can do something like this:
import re
import json
string = '{"key1":"my"value","key2":"my"value2"}'
# finds contents of keys and values, assuming that the key the real key/value ending double quotes
# followed by one of following characters: ,}:]
m = re.finditer(r'"([^:]+?)"(?:[,}:\]])', string)
new_string = string
for i in reversed(list(m)):
ss, se = i.span(1) # first group holds the content
# escape double backslashes in the content and add all back together
# note that this is not effective. Bigger amounts of replacements would require other approach of concatanation
new_string = new_string[:ss] + new_string[ss:se].replace('"', '\\"') + new_string[se:]
json.loads(new_string)
This assumes that the real ending double quotes are followed by one of ,:}]. In other cases this won't work

Character classes using byte regex for characters encoded with multiple bit blocks

I would like to use regular expressions on bytestrings in python of which I know the encoding (utf-8). I am facing difficulties trying to use character classes that involve characters that are encoded using more than one bit block. They appear to become two or more 'characters' that are matched separately in the character class.
Performing the search on (unicode) strings instead is possible, but I would like to know if there is a solution to defining character classes for the case of bytestrings as well. Maybe it's just not possible!?
Below is a python 3 example that shows what happens when I try to replace different line breaks with '\n':
import re
def show_pattern(pattern):
print(f"\nPattern repr:\t{repr(pattern)}")
def test_sub(pattern, replacement, text):
print(f"Before repr:\t{repr(text)}")
result = re.sub(pattern, replacement, text)
print(f"After repr:\t{repr(result)}")
# Pattern for line breaks
PATTERN = '[' + "\u000A\u000B\u000C\u000D\u0085\u2028\u2029" + ']'
REPLACEMENT = '\n'
TEXT = "How should I replace my unicode string\u2028using utf-8-encoded bytes?"
show_pattern(PATTERN)
test_sub(PATTERN, REPLACEMENT, TEXT)
# expected output:
# Pattern repr: '[\n\x0b\x0c\r\x85\u2028\u2029]'
# Before repr: 'How should I replace my unicode string\u2028using utf-8-encoded bytes?'
# After repr: 'How should I replace my unicode string\nusing utf-8-encoded bytes?'
ENCODED_PATTERN = PATTERN.encode('utf-8')
ENCODED_REPLACEMENT = REPLACEMENT.encode('utf-8')
ENCODED_TEXT = TEXT.encode('utf-8')
show_pattern(ENCODED_PATTERN)
test_sub(ENCODED_PATTERN, ENCODED_REPLACEMENT, ENCODED_TEXT)
# expected output:
# Pattern repr: b'[\n\x0b\x0c\r\xc2\x85\xe2\x80\xa8\xe2\x80\xa9]'
# Before repr: b'How should I replace my unicode string\xe2\x80\xa8using utf-8-encoded bytes?'
# After repr: b'How should I replace my unicode string\n\n\nusing utf-8-encoded bytes?'
In the encoded version, I end up with three '\n''s instead of one. Similar things happen for a more complicated document where it's not obvious what the correct output should be.
You may use an alternation based pattern rather than a character class, as you will want to match sequences of bytes:
PATTERN = "|".join(['\u000A','\u000B','\u000C','\u000D','\u0085','\u2028','\u2029'])
See the online demo.
If you prefer to initialize the pattern from a string use
CHARS = "\u000A\u000B\u000C\u000D\u0085\u2028\u2029"
PATTERN = "|".join(CHARS)

converting Unicode code point numbers to Unicode characters

I'm using the argparse library in Python 3 to read in Unicode strings from a command line parameter. Often those strings contain "ordinary" Unicode characters (extended Latin, etc.), but sometimes--particularly when the characters belong to a right-to-left script--it's easier to encode the strings as Unicode code points, like \u0644. But argparse treats these designators as a sequence of characters, and does not convert them into the character they designate. For instance, if a command line parameter is
... -a "abc\06d2d" ...
then what I get in the argparse variable is
"abc\06d2d"
rather than the expected
"abcےd"
(the character between the 'c' and 'd' is the yeh baree). Of course both outcomes are logical, it's just that the second one is the one I want.
I tried to reproduce this in an interpreter, but under most circumstances Python3 automagically converts a string like "abc\06d2d" into "abcےd". Not so when I read the string using argparse...
I came up with a function to do the conversion, see below. But I feel like I'm missing something much simpler. Is there an easier way to do this conversion? (Obviously I could use str.startswith(), or regex's to match the entire thing, rather than going character by character, but the code below is really just an illustration. It seems like I shouldn't have to create my own function to do this at all, especially since in some circumstances it seems to happen automatically.)
---------My code to do this follows---------
def ParseString2Unicode(sInString):
"""Return a version of sInString in which any Unicode code points of the form
\uXXXX (X = hex digit)
have been converted into their corresponding Unicode characters.
Example:
"\u0064b\u0065"
becomes
"dbe"
"""
sOutString = ""
while sInString:
if len(sInString) >= 6 and \
sInString[0] == "\\" and \
sInString[1] == "u" and \
sInString[2] in "0123456789ABCDEF" and \
sInString[3] in "0123456789ABCDEF" and \
sInString[4] in "0123456789ABCDEF" and \
sInString[5] in "0123456789ABCDEF":
#If we get here, the first 6 characters of sInString represent
# a Unicode code point, like "\u0065"; convert it into a char:
sOutString += chr(int(sInString[2:6], 16))
sInString = sInString[6:]
else:
#Strip a single char:
sOutString += sInString[0]
sInString = sInString[1:]
return sOutString
What you may want to look at is the raw_unicode_escape encoding.
>>> len(b'\\uffff')
6
>>> b'\\uffff'.decode('raw_unicode_escape')
'\uffff'
>>> len(b'\\uffff'.decode('raw_unicode_escape'))
1
So, the function would be:
def ParseString2Unicode(sInString):
try:
decoded = sInString.encode('utf-8')
return decoded.decode('raw_unicode_escape')
except UnicodeError:
return sInString
This, however, also matches other unicode escape sequences, like \Uxxxxxxxx. If you just want to match \uxxxx, use a regex, like so:
import re
escape_sequence_re = re.compile(r'\\u[0-9a-fA-F]{4}')
def _escape_sequence_to_char(match):
return chr(int(match[0][2:], 16))
def ParseString2Unicode(sInString):
return re.sub(escape_sequence_re, _escape_sequence_to_char, sInString)
A concise, flexible way of handling this would be to use a regular expression:
return re.sub(
r"\\u([0-9A-Fa-f]{4})",
lambda m: chr(int(m[1], 16)),
sInString
)

Python3 : unescaping non ascii characters

(Python 3.3.2) I have to unescape some non ASCII escaped characters returned by a call to re.escape(). I see here and here methods that doesn't work. I'm working in a 100% UTF-8 environment.
# pure ASCII string : ok
mystring = "a\n" # expected unescaped string : "a\n"
cod = codecs.getencoder('unicode_escape')
print( cod(mystring) )
# non ASCII string : method #1
mystring = "€\n"
# equivalent to : mystring = codecs.unicode_escape_decode(mystring)
cod = codecs.getdecoder('unicode_escape')
print(cod(mystring))
# RESULT = ('â\x82¬\n', 5) INSTEAD OF ("€\n", 2)
# non ASCII string : method #2
mystring = "€\n"
mystring = bytes(mystring, 'utf-8').decode('unicode_escape')
print(mystring)
# RESULT = â\202¬ INSTEAD OF "€\n"
Is this a bug ? Have I misunderstood something ?
Any help would be appreciated !
PS : I edited my post thanks to the Michael Foukarakis' remark.
I guess the actual string you need to process is mystring = €\\n?
mystring = "€\n" # that's 2 char, "€" and new line
mystring = "€\\n" # that's 3 char, "€", "\" and "n"
I don't really understand what's going wrong within encode() and decode() of python3, but my friend solve this problem when we are writing some tools.
How we did is to bypass the encoder("utf_8") after the escape procedure is done.
>>> "€\\n".encode("utf_8")
b'\xe2\x82\xac\\n'
>>> "€\\n".encode("utf_8").decode("unicode_escape")
'â\x82¬\n'
>>> "€\\n".encode("utf_8").decode("unicode_escape").encode("utf_8")
b'\xc3\xa2\xc2\x82\xc2\xac\n' # we don't want this
>>> bytes([ord(char) for char in "€\\n".encode("utf_8").decode("unicode_escape")])
b'\xe2\x82\xac\n' # what we really need
>>> str(bytes([ord(char) for char in "€\\n".encode("utf_8").decode("unicode_escape")]), "utf_8")
'€\n'
We can see that: though the result of decode("unicode_escape") looks wired, the bytes object actually contain the correct bytes of your strings(with utf-8 encoding), in this case, "\xe2\x82\xac\n"
And we now do not print the str object directly, neither do we use encode("utf_8"), we use ord() to create the bytes object b'\xe2\x82\xac\n'.
And you can get the correct str from this bytes object, just put it into str()
BTW, the tool my friend and me want to make is a wrapper that allow user to input c-like string literal, and convert the escaped sequence automatically.
User input:\n\x61\x62\n\x20\x21 # 20 characters, which present 6 chars semantically
output: # \n
ab # \x61\x62\n
! # \x20\x21
That's a powerful tool for user to input some non-printable character in terminal.
Our final tools is:
#!/usr/bin/env python3
import sys
for line in sys.stdin:
sys.stdout.buffer.write(bytes([ord(char) for char in line[:-1].encode().decode('unicode_escape')]))
sys.stdout.flush()
You seem to misunderstand encodings. To be protected against common errors, we usually encode a string when it leaves our application, and decode it when it comes in.
Firstly, let's look at the documentation for unicode_escape, which states:
Produce[s] a string that is suitable as Unicode literal in Python source code.
Here is what you would get from the network or a file that claims its contents are Unicode escaped:
b'\\u20ac\\n'
Now, you have to decode this to use it in your app:
>>> s = b'\\u20ac\\n'.decode('unicode_escape')
>>> s
'€\n'
and if you wanted to write it back to, say, a Python source file:
with open('/tmp/foo', 'wb') as fh: # binary mode
fh.write(b'print("' + s.encode('unicode_escape') + b'")')
import string
printable = string.printable
printable = printable + '€'
def cod(c):
return c.encode('unicode_escape').decode('ascii')
def unescape(s):
return ''.join(c if ord(c)>=32 and c in printable else cod(c) for c in s)
mystring = "€\n"
print(unescape(mystring))
Unfortunately string.printable only includes ASCII characters. You can make a copy as I did here and extend it with any Unicode characters that you'd like, such as €.

How do I get the raw representation of a string in Python?

I am making a class that relies heavily on regular expressions.
Let's say my class looks like this:
class Example:
def __init__(self, regex):
self.regex = regex
def __repr__(self):
return 'Example({})'.format(repr(self.regex.pattern))
And let's say I use it like this:
import re
example = Example(re.compile(r'\d+'))
If I do repr(example), I get 'Example('\\\\d+')', but I want 'Example(r'\\d+')'. Take into account the extra backslash where that upon printing, it appears correctly. I suppose I could implement it to return "r'{}'".format(regex.pattern), but that doesn't sit well with me. In the unlikely event that the Python Software Foundation someday changes the manner for specifying raw string literals, my code won't reflect that. That's hypothetical, though. My main concern is whether or not this always works. I can't think of an edge case off the top of my head, though. Is there a more formal way of doing this?
EDIT: Nothing seems to appear in the Format Specification Mini-Language, the printf-style String Formatting guide, or the string module.
The problem with rawstring representation is, that you cannot represent everything in a portable (i.e. without using control characters) manner. For example, if you had a linebreak in your string, you had to literally break the string to the next line, because it cannot be represented as rawstring.
That said, the actual way to get rawstring representation is what you already gave:
"r'{}'".format(regex.pattern)
The definition of rawstrings is that there are no rules applied except that they end at the quotation character they start with and that you can escape said quotation character using a backslash. Thus, for example, you cannot store the equivalent of a string like "\" in raw string representation (r"\" yields SyntaxError and r"\\" yields "\\\\").
If you really want to do this, you should use a wrapper like:
def rawstr(s):
"""
Return the raw string representation (using r'') literals of the string
*s* if it is available. If any invalid characters are encountered (or a
string which cannot be represented as a rawstr), the default repr() result
is returned.
"""
if any(0 <= ord(ch) < 32 for ch in s):
return repr(s)
if (len(s) - len(s.rstrip("\\"))) % 2 == 1:
return repr(s)
pattern = "r'{0}'"
if '"' in s:
if "'" in s:
return repr(s)
elif "'" in s:
pattern = 'r"{0}"'
return pattern.format(s)
Tests:
>>> test1 = "\\"
>>> test2 = "foobar \n"
>>> test3 = r"a \valid rawstring"
>>> test4 = "foo \\\\\\"
>>> test5 = r"foo \\"
>>> test6 = r"'"
>>> test7 = r'"'
>>> print(rawstr(test1))
'\\'
>>> print(rawstr(test2))
'foobar \n'
>>> print(rawstr(test3))
r'a \valid rawstring'
>>> print(rawstr(test4))
'foo \\\\\\'
>>> print(rawstr(test5))
r'foo \\'
>>> print(rawstr(test6))
r"'"
>>> print(rawstr(test7))
r'"'

Categories