Replace Unicode code point with actual character using regex - python

I have a large file where any unicode character that wasn't in UTF-8 got replaced by its code point in angle brackets (e.g. the "👍" was converted to "<U+0001F44D>"). Now I want to revert this with a regex substitution.
I've tried to acomplish this with
re.sub(r'<U\+([A-F0-9]+)>',r'\U\1', str)
but obviously this won't work because we cannot insert the group into this unicode escape.
What's the best/easiest way to do this? I found many questions trying to do the exact opposite but nothing useful to 're-encode' these code points as actual characters...

When you have a number of the character, you can do ord(number) to get the character of that number.
Because we have a string, we need to read it as int with base 16.
Both of those together:
>>> chr(int("0001F44D", 16))
'👍'
However, now we have a small function, not a string to simply replace! Quick search returned that you can pass a function to re.sub
Now we get:
re.sub(r'<U\+([A-F0-9]+)>', lambda x: chr(int(x.group(1), 16)), my_str)
PS Don't name your string just str - you'll shadow the builtin str meaning type.

Related

How to modify unicode code as it is a string

I have a list of partial Unicode codes for cuneiform characters.
for example I have, 12220 which python couldn't render to 𒈠 which is what I wanted. Then I realized that adding \U000 in front of these partial codes creates results that I want. The problem is I can't modify unicode.
"\U000{}".format(12220) doesn't work. Clearly adding string to unicode is not possible. I don't want to hand merge 375 characters. Can anyone help me with this?
Use this:
print(chr(int("12220", 16)))
chr function returns a character from int, and second paraemeter of int is the base it should be converted to.

How do I urlencode all the characters in a string, including safe characters?

Using Python, how would I encode all the characters of a string to a URL-encoded string?
As of now, just about every answer eventually references the same methods, such as urllib.parse.quote() or urllib.parse.urlencode(). While these answers are technically valid (they follow the required specifications for encoding special characters in URLs), I have not managed to find a single answer that describes how to encode other/non-special characters as well (such as lowercase or uppercase letters).
How do I take a string and encode every character into a URL-encoded string?
This gist reveals a very nice answer to this problem. The final function code is as follows:
def encode_all(string):
return "".join("%{0:0>2}".format(format(ord(char), "x")) for char in string)
Let's break this down.
The first thing to notice is that the return value is a generator expression (... for char in string) wrapped in a str.join call ("".join(...)). This means we will be performing an operation for each character in the string, then finally joining each outputted string together (with the empty string, "").
The operation performed on each character in the string is "%{0:0>2}".format(format(ord(char), "x")). This can be broken down into the following:
ord(char): Convert each character to the corresponding number.
format(..., "x"): Convert the number to a hexadecimal value.
"%{0:0>2}".format(...): Format the hexadecimal value into a string (with a prefixed "%").
When you look at the whole function from an overview, it is converting each character to a number, converting that number to hexadecimal, then jamming all the hexadecimal values into a string (which is then returned).

Parsing variable length data with X escapes in Python 2

I am trying to parse elements in this string with Python 2.7.
r='\x01\x99h\x1bu=https://cpr.sm/eIOxaAZ-he'
'\x01' , '\x99', and 'h' are all separate elements r[0],r[1],r[2].
But I am trying to extract variable length data here, specifically, the concatenation of '\x99' and 'h' in positions r[1] and r[2]. That concatenation will then be decoded via LEB 128 format. But the portion I'm looking for, in this case '\x99h', can be of variable length. Sometimes it will be one byte, so just r[1], sometimes more, like r[1]+r[2]+r[3]. The only way to know is when the next X escape '\x' occurs.
But I can't for the life of my figure out how to parse this data for the '\x' escapes into a more manageable format.
TL:DR, how do I replace '\x' escapes in my string, or at least identify where they occur. And also, str.replace('\x','') doesnt work, I get "invalid \x escape".
Before I answer this, you need to understand something.
Every character in a string is a byte. Every byte can be represented as a \x-escaped literal. (recall: 8 bits in a byte, 2**8 == 256 possible values; hence the range \x00 to \xFF) When those literals happen to fall within ASCII-printable ranges and you print out the string, python will print the associated ASCII character instead of the \x-escaped version.
But make no mistakes - they are 100% equivalent.
In [7]: '\x68\x65\x6c\x6c\x6f\x20\x77\x6f\x72\x6c\x64'
Out[7]: 'hello world'
So, let's assume there's some meaningful boundary that you can give me. (there has to be one, since a variable-length encoding like LEB128 needs some method to say "hey, the data stops here") Perhaps \x1b, which is the ASCII escape character. Were you looking for that escape character?
If so, extracting it is quite easy:
r='\x01\x99h\x1bu=https://cpr.sm/eIOxaAZ-he'
r[1:r.index('\x1b')]
Out[15]: '\x99h'
And then you can run that through whatever LEB128 decoding algorithm you'd like. The one on the wiki seems serviceable, and gives me:
leb128_decode(r[1:r.index('\x1b')])
Out[16]: (13337, 2) # 13337 is the value encoded by these two bytes
You have two options. Either use raw strings (preferable), where no character would be treated as special character or escape \ in original string to avoid making \x a special character.
>>> str = r'hello\nhello\t\nhello\r'
>>> str.replace(r'\n', 'x')
'helloxhello\\txhello\\r'
or
>>> str = r'hello\nhello\t\nhello\r'
>>> str.replace('\\n', 'x')
'helloxhello\\txhello\\r'

How to escape special char

I got the following code to handle Chinese character problem, or some special character in powerpoint file , because I would like to use the content of the ppt as the filename to save.
If it contains some special character, it will throw some exception, so I use the following code to handle it.
It works fine under Python 2.7 , but when I run with Python 3.0 it gives me the following error :
if not (char in '<>:"/\|?*'):
TypeError: 'in <string>' requires string as left operand, not int
I Googled the error message but I don't understand how to resolve it. I know the code if not (char in '<>:"/\|?*'): is to convert the character to ASCII code number, right?
Is there any example to fix my problem in Python 3?
def rm_invalid_char(self,str):
final=""
dosnames=['CON', 'PRN', 'AUX', 'NUL', 'COM1', 'COM2', 'COM3', 'COM4', 'COM5', 'COM6', 'COM7', 'COM8', 'COM9', 'LPT1', 'LPT2', 'LPT3', 'LPT4', 'LPT5', 'LPT6', 'LPT7', 'LPT8', 'LPT9']
for char in str:
if not (char in '<>:"/\|?*'):
if ord(char)>31:
final+=char
if final in dosnames:
#oh dear...
raise SystemError('final string is a DOS name!')
elif final.replace('.', '')=='':
print ('final string is all periods!')
pass
return final
Simple: use this
re.escape(YourStringHere)
From the docs:
Return string with all non-alphanumerics backslashed; this is useful
if you want to match an arbitrary literal string that may have regular
expression metacharacters in it.
You are passing an iterable whose first element is an integer (232) to rm_invalid_char(). The problem does not lie with this function, but with the caller.
Some debugging is in order: right at the beginning of rm_invalid_char(), you should do print(repr(str)): you will not see a string, contrary to what is expected by rm_invalid_char(). You must fix this until you see the string that you were expecting, by adjusting the code before rm_invalid_char() is called.
The problem is likely due to how Python 2 and Python 3 handle strings (in Python 2, str objects are strings of bytes, while in Python 3, they are strings of characters).
I'm curious why there is something in "str" that is acting like an integer - something strange is going on with the input.
However, I suspect if you:
Change the name of your str value to something else, e.g. char_string
Right after for char in char_string coerce whatever your input is to a string
then the problem you describe will be solved.
You might also consider adding a random bit to the end of your generated file name so you don't have to worry about colliding with the DOS reserved names.

how to remove '\xe2' from a list

I am new to python and am using it to use nltk in my project.After word-tokenizing the raw data obtained from a webpage I got a list containing '\xe2' ,'\xe3','\x98' etc.However I do not need these and want to delete them.
I simply tried
if '\x' in a
and
if a.startswith('\xe')
and it gives me an error saying invalid \x escape
But when I try a regular expression
re.search('^\\x',a)
i get
Traceback (most recent call last):
File "<pyshell#83>", line 1, in <module>
print re.search('^\\x',a)
File "C:\Python26\lib\re.py", line 142, in search
return _compile(pattern, flags).search(string)
File "C:\Python26\lib\re.py", line 245, in _compile
raise error, v # invalid expression
error: bogus escape: '\\x'
even re.search('^\\x',a) is not identifying it.
I am confused by this,even googling didnt help(I might be missing something).Please suggest any simple way to remove such strings from the list and what was wrong with the above.
Thanks in advance!
You can use unicode(a, 'ascii', 'ignore') to remove all non-ascii characters in the string at once.
It helps here to understand the difference between a string literal and a string.
A string literal is a sequence of characters in your source code. When parsed and compiled by the Python interpreter, it produces a string, which is a sequence of characters in memory.
For example, the string literal " a " produces the string a.
String literals can take a number of forms. All of these produce the same string a:
"a"
'a'
r"a"
"""a"""
r'''a'''
Source code is traditionally ASCII-only, but we'd like it to contain string literals that can produce characters beyond ASCII. To do this escapes can be used. For example, the string literal "\xe2" produces a single-character string, with a character with integer value E2 hexadecimal, or 226 decimal.
This explains the error about "\x" being an invalid escape: the parser is expecting you to specify the hexadecimal value of a character.
To detect if a string has any characters in a certain range, you can use a regex with a character class specifying the lower and upper bounds of the characters you don't want:
if re.search(r"[\x90-\xff]", a):
'\xe2' is one character, \x is an escape sequence that's followed by a hex number and used to specify a byte literally.
That means you have to specify the whole expression:
>>> s = '\xe2hello'
>>> print s
'\xe2hello'
>>> s.replace('\xe2', '')
'hello'
More information can be found in the Python docs.
I see other answers have done a good job in explaining your confusion with respect to '\x', but while suggesting that you may not want to completely remove non-ASCII characters, have not provided a specific way to do other normalization beyond such removing.
If you want to obtain some "reasonably close ASCII character" (e.g., strip accents from letters but leave the underlying letter, &c), this SO answer may help -- the code in the accepted answer, using only the standard Python library, is:
import unicodedata
def strip_accents(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn')
Of course, you'll need to apply this function to each string item in the list you mention in the title, e.g
cleanedlist = [strip_accents(s) for s in mylist]
if all items in mylist are strings.
Let's stand back and think about this a little bit ...
You're using nltk (natural language toolkit) to parse (presumably) natural language.
Your '\xe2' is highly likely to represent U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX (â).
Your '\xe3' is highly likely to represent U+00E3 LATIN SMALL LETTER A WITH TILDE (ã).
They look like natural language letters to me. Are you SURE that you don't need them?
If you want only to enter this pattern and avoid the error,
you can try insert a + between \ and x like here:
re.search('\+x[0123456789abcdef]*',a)

Categories