I have a dataset containing some some poorly parsed text that includes a lot of unicode characters (like 'a', '{', 'Ⅷ', '♞', ...) that have been improperly converted to Unicode.
All of the backslashes are escaped, so every unicode escape sequence was interpreted as a \ next to a u instead of a single character, \u.
More specifically, I have strings that look like this:
>>> '\\u00e9'
'\\u00e9'
And I want them to look like this:
>>> '\u00e9'
'é'
How can I convert the first string to the second?
Here is one way to accomplish without importing another module.
input_string = '\\u00e9'
print(input_string.encode('latin-1').decode('unicode-escape'))
# output
é
First you need to identify the string as hex
classmethod fromhex(string)
This bytes class method returns a bytes object, decoding the given string object. The string must contain two hexadecimal digits per byte, with ASCII whitespace being ignored.
https://docs.python.org/3/library/stdtypes.html#bytes.fromhex
Next we need to convert the hex to Unicode
bytes.decode(encoding="utf-8", errors="strict")
https://docs.python.org/3/library/stdtypes.html#bytes.decode
So it would look something like this
char = '\\u00e9'
print (bytes.fromhex(char)[3:-1].decode('latin-1'))
Related
there are characters like '' that are not visible so I cant copy paste it. I want to convert any character to its codepoint like '\u200D'
another example is: 'abc' => '\u0061\u0062\u0063'
Allow me to rephrase your question. The header convert a string to its codepoint in python clearly did not get through to everyone, mostly, I think, because we can't imagine what you want it for.
What you want is a string containing a representation of Unicode escapes.
You can do that this way:
print(''.join("\\u{:04x}".format(b) for b in b'abc'))
\u0061\u0062\u0063
If you display that printed value as a string literal you will see doubled backslashes, because backslashes have to be escaped in a Python string. So it will look like this:
'\\u0061\\u0062\\u0063'
The reason for that is that if you simply put unescaped backslashes in your string literal, like this:
a = "\u0061\u0062\u0063"
when you display a at the prompt you will get:
>>> a
'abc'
'\u0061\u0062\u0063'.encode('utf-8') will encode the text to Unicode.
Edit:
Since python automatically converts the string to Unicode you can't see the value but you can create a function that will generate that.
def get_string_unicode(string_to_convert):
res = ''
for letter in string_to_convert:
res += '\\u' + (hex(ord(letter))[2:]).zfill(4)
return res
Result:
>>> get_string_unicode('abc')
'\\u0061\\u0062\\u0063'
I'm a Python newbie and I'm trying to make one script that writes some strings in a file if there's a difference. Problem is that original string has some characters in \uNNNN Unicode format and I cannot convert the new string to the same Unicode format.
The original string I'm trying to compare: \u00A1 ATENCI\u00D3N! \u25C4
New string is received as: ¡ ATENCIÓN! ◄
And this the code
str = u'¡ ATENCIÓN! ◄'
print(str)
str1 = str.encode('unicode_escape')
print (str1)
str2 = str1.decode()
print (str2)
And the result is:
¡ ATENCIÓN! ◄
b'\\xa1 ATENCI\\xd3N! \\u25c4'
\xa1 ATENCI\xd3N! \u25c4
So, how can I get \xa1 ATENCI\xd3N! \u25c4 converted to \u00A1 ATENCI\u00D3N! \u25C4 as this is the only Unicode format I can save?
Note: Cases of characters in strings also need to be the same for comparison.
The issue is, according to the docs (read down a little bit, between the escape sequences tables), the \u, \U, and \N Unicode escape sequences are only recognized in string literals. That means that once the literal is evaluated in memory, such as in a variable assignment:
s = "\u00A1 ATENCI\u00D3N! \u25C4"
any attempt to str.encode() it automatically converts it to a bytes object that uses \x where it can:
b'\\xa1 ATENCI\\xd3N! \\u25c4'
Using
b'\\xa1 ATENCI\\xd3N! \\u25c4'.decode("unicode_escape")
will convert it back to '¡ ATENCIÓN! ◄'. This uses the actual (intended) representation of the characters, and not the \uXXXX escape sequences of the original string s.
So, what you should do is not mess around with encoding and decoding things. Observe:
print("\u00A1 ATENCI\u00D3N! \u25C4" == '¡ ATENCIÓN! ◄')
True
That's all the comparison you need to do.
For further reading, you may be interested in:
How to work with surrogate pairs in Python?
Encodings and Unicode from the Python docs.
Hi I am having trouble trying to print a literal string in a proper format.
For starters i have an object with a string parameter which is used for metadata such that it looks like:
obj {
metadata: <str>
}
The object is being returned as a protocol response and we have the object to use as such.
print obj gives:
metadata: "\n)\n\022foobar"
When I print the obj.metadata python treats the value as a string and converts the escapes to linebreaks and the corresponding ascii values as expected.
When i tried
print repr(obj.metadata)
"\n)\n\x12foobar"
Unfortunately python prints the literal but converts the escaped characters from octal to hex. Is there a way i can print the python literal with the escaped characters in octal or convert the string such that I can have the values printed as it is in the object?
Thanks for the help
The extremely bad solution I have so far is
print str(obj).rstrip('\"').lstrip('metadata: \"')
to get the correct answer, but i am assuming there must be a smarter way
TLDR:
x = "\n)\n\022foobar"
print x
)
foobar
print repr(x)
'\n)\n\x12foobar'
how do i get x to print the way it was assigned
Please try this:
print('\\\n)\\\n\\\022foobar')
or
print(r'\n)\n\022foobar')
The escape character '\' interprets the character following it differently, for example \n is used for new line.
The double escape character '\\' or letter 'r' nullifies the interpretation of the escape character. This is similar in C language.
I have a string which contains unicode characters which looks like this,
"u'type'` does not belong to `[u'item1', u'item2']"
How would I unescape the unicode parts?
So that it prints out,
"'type' does not belong to ['item1', 'item2']"
You could use the replace method for the string
stringToModify = "u'type' does not belong to [u'item1', u'item2']"
stringToModify.replace("u'","'")
What you show here is actually a byte string displaying the representation of three Unicode strings (all of them containing only ASCII characters FWIW).
If you cannot change the code producing this string your only solution is re.sub().
I am trying to parse elements in this string with Python 2.7.
r='\x01\x99h\x1bu=https://cpr.sm/eIOxaAZ-he'
'\x01' , '\x99', and 'h' are all separate elements r[0],r[1],r[2].
But I am trying to extract variable length data here, specifically, the concatenation of '\x99' and 'h' in positions r[1] and r[2]. That concatenation will then be decoded via LEB 128 format. But the portion I'm looking for, in this case '\x99h', can be of variable length. Sometimes it will be one byte, so just r[1], sometimes more, like r[1]+r[2]+r[3]. The only way to know is when the next X escape '\x' occurs.
But I can't for the life of my figure out how to parse this data for the '\x' escapes into a more manageable format.
TL:DR, how do I replace '\x' escapes in my string, or at least identify where they occur. And also, str.replace('\x','') doesnt work, I get "invalid \x escape".
Before I answer this, you need to understand something.
Every character in a string is a byte. Every byte can be represented as a \x-escaped literal. (recall: 8 bits in a byte, 2**8 == 256 possible values; hence the range \x00 to \xFF) When those literals happen to fall within ASCII-printable ranges and you print out the string, python will print the associated ASCII character instead of the \x-escaped version.
But make no mistakes - they are 100% equivalent.
In [7]: '\x68\x65\x6c\x6c\x6f\x20\x77\x6f\x72\x6c\x64'
Out[7]: 'hello world'
So, let's assume there's some meaningful boundary that you can give me. (there has to be one, since a variable-length encoding like LEB128 needs some method to say "hey, the data stops here") Perhaps \x1b, which is the ASCII escape character. Were you looking for that escape character?
If so, extracting it is quite easy:
r='\x01\x99h\x1bu=https://cpr.sm/eIOxaAZ-he'
r[1:r.index('\x1b')]
Out[15]: '\x99h'
And then you can run that through whatever LEB128 decoding algorithm you'd like. The one on the wiki seems serviceable, and gives me:
leb128_decode(r[1:r.index('\x1b')])
Out[16]: (13337, 2) # 13337 is the value encoded by these two bytes
You have two options. Either use raw strings (preferable), where no character would be treated as special character or escape \ in original string to avoid making \x a special character.
>>> str = r'hello\nhello\t\nhello\r'
>>> str.replace(r'\n', 'x')
'helloxhello\\txhello\\r'
or
>>> str = r'hello\nhello\t\nhello\r'
>>> str.replace('\\n', 'x')
'helloxhello\\txhello\\r'