convert a string to its codepoint in python - python

there are characters like '‌' that are not visible so I cant copy paste it. I want to convert any character to its codepoint like '\u200D'
another example is: 'abc' => '\u0061\u0062\u0063'

Allow me to rephrase your question. The header convert a string to its codepoint in python clearly did not get through to everyone, mostly, I think, because we can't imagine what you want it for.
What you want is a string containing a representation of Unicode escapes.
You can do that this way:
print(''.join("\\u{:04x}".format(b) for b in b'abc'))
\u0061\u0062\u0063
If you display that printed value as a string literal you will see doubled backslashes, because backslashes have to be escaped in a Python string. So it will look like this:
'\\u0061\\u0062\\u0063'
The reason for that is that if you simply put unescaped backslashes in your string literal, like this:
a = "\u0061\u0062\u0063"
when you display a at the prompt you will get:
>>> a
'abc'

'\u0061\u0062\u0063'.encode('utf-8') will encode the text to Unicode.
Edit:
Since python automatically converts the string to Unicode you can't see the value but you can create a function that will generate that.
def get_string_unicode(string_to_convert):
res = ''
for letter in string_to_convert:
res += '\\u' + (hex(ord(letter))[2:]).zfill(4)
return res
Result:
>>> get_string_unicode('abc')
'\\u0061\\u0062\\u0063'

Related

Converting Python unicode code point string to its actual unicode character

I have a dataset containing some some poorly parsed text that includes a lot of unicode characters (like 'a', '{', 'Ⅷ', '♞', ...) that have been improperly converted to Unicode.
All of the backslashes are escaped, so every unicode escape sequence was interpreted as a \ next to a u instead of a single character, \u.
More specifically, I have strings that look like this:
>>> '\\u00e9'
'\\u00e9'
And I want them to look like this:
>>> '\u00e9'
'é'
How can I convert the first string to the second?
Here is one way to accomplish without importing another module.
input_string = '\\u00e9'
print(input_string.encode('latin-1').decode('unicode-escape'))
# output
é
First you need to identify the string as hex
classmethod fromhex(string)
This bytes class method returns a bytes object, decoding the given string object. The string must contain two hexadecimal digits per byte, with ASCII whitespace being ignored.
https://docs.python.org/3/library/stdtypes.html#bytes.fromhex
Next we need to convert the hex to Unicode
bytes.decode(encoding="utf-8", errors="strict")
https://docs.python.org/3/library/stdtypes.html#bytes.decode
So it would look something like this
char = '\\u00e9'
print (bytes.fromhex(char)[3:-1].decode('latin-1'))

How can I make a Python string to include unicode code points?

I want to have an ASCII representation of a string that could contain non-ascii characters such as German umlauts. The way the non-ascii characters should be encoded is as unicode code points, e.g. ß would be \u00df.
The problem is that I have those escape sequences in my database. It gets displayed like I want it, but when the user searches for something, he enters ß and not \u00df. For ß, it works for me to simply make search_query.replace('ß', r'\u00df'), but there are (many) more possible escape sequences.
What I tried
>>> name = 'Ein Spaß'
>>> name.encode('ascii', 'backslashreplace')
b'Ein Spa\\xdf'
>>> name.encode('ascii', 'xmlcharrefreplace')
b'Ein Spaß'
What I want to get:
'Ein Spa\\u00df'
As a dumb workaround, stdlib json encoding will use the 4 digit unicode escapes:
>>> name = 'Ein Spaß'
>>> json.dumps(name)
'"Ein Spa\\u00df"'
>>> ast.literal_eval(json.dumps(name)) == name
True
However, this will not really solve your search problem robustly. You'll need to normalize the query text before searching. And you'll want to normalize unicode data on the way into the database, too - or use a db + ORM which handles such details for you.
See this answer for details about a better tool for the job here: unicodedata.normalize.
encode in ascii if possible
else replace by code point as unicode string
ord : is a function to get character code point as integer base 10
new=[]
for e in name:
try:
new.append(e.encode("ascii").decode())
except:
new.append(u"\\u%04x"%ord(e))
"".join(new)
If the data in your database is stored as escaped unicode, you can use codecs.decode with encoding set to unicode_escape:
>>> name = "Ein Spa\\u00df"
>>> codecs.decode(name, "unicode_escape")
'Ein Spaß'

Printing a literal string python in octal

Hi I am having trouble trying to print a literal string in a proper format.
For starters i have an object with a string parameter which is used for metadata such that it looks like:
obj {
metadata: <str>
}
The object is being returned as a protocol response and we have the object to use as such.
print obj gives:
metadata: "\n)\n\022foobar"
When I print the obj.metadata python treats the value as a string and converts the escapes to linebreaks and the corresponding ascii values as expected.
When i tried
print repr(obj.metadata)
"\n)\n\x12foobar"
Unfortunately python prints the literal but converts the escaped characters from octal to hex. Is there a way i can print the python literal with the escaped characters in octal or convert the string such that I can have the values printed as it is in the object?
Thanks for the help
The extremely bad solution I have so far is
print str(obj).rstrip('\"').lstrip('metadata: \"')
to get the correct answer, but i am assuming there must be a smarter way
TLDR:
x = "\n)\n\022foobar"
print x
)
foobar
print repr(x)
'\n)\n\x12foobar'
how do i get x to print the way it was assigned
Please try this:
print('\\\n)\\\n\\\022foobar')
or
print(r'\n)\n\022foobar')
The escape character '\' interprets the character following it differently, for example \n is used for new line.
The double escape character '\\' or letter 'r' nullifies the interpretation of the escape character. This is similar in C language.

Python - How to print one backslash in a string within a dictionary?

I have a dictionary with some strings, in one of the string there are two backslashes. I want to replace them with a single backslash.
These are the backslashes: IfNotExist\\u003dtrue
Configurations = {
"javax.jdo.option.ConnectionUserName": "test",
"javax.jdo.option.ConnectionDriverName": "org.mariadb.jdbc.Driver",
"javax.jdo.option.ConnectionPassword": "sxxxsasdsasad",
"javax.jdo.option.ConnectionURL": "jdbc:mysql://hive-metastore.cr.eu-west-1.rds.amazonaws.com:3306/hive?createDatabaseIfNotExist\\u003dtrue"
}
print (Configurations)
When I print it keeps showing the two backslashes. I know that the way to escape a backslash is using \ this works in a regular string but it does not work in a dictionary.
Any ideas?
The problem comes from the encoding.
In fact \u003d is the UNICODE representation of =.
The backslash is escaped by another backslash which is a good thing.
You may need to:
Replace \u003d as =
Read it as unicode then you should prepend the string with u like u"hi \\u003d" may be ok
Printing the dictionary shows you a representation of the dictionary object. It doesn't necessarily show you a nice representation of everything inside it. To do that you need to do:
for value in Configurations.values():
print(value)
When you print out your dictionary using
print (Configurations), it will print out the repr() value of the dictionary
You will get
{'javax.jdo.option.ConnectionDriverName': 'org.mariadb.jdbc.Driver', 'javax.jdo.option.ConnectionUserName': 'test', 'javax.jdo.option.ConnectionPassword': 'sxxxsasdsasad', 'javax.jdo.option.ConnectionURL': 'jdbc:mysql://hive-metastore.cr.eu-west-1.rds.amazonaws.com:3306/hive?createDatabaseIfNotExist\\u003dtrue'}
You need to print out your dictionary with
print (Configurations["javax.jdo.option.ConnectionURL"])
or
print (str(Configurations["javax.jdo.option.ConnectionURL"]))
Note: str() is added
Then the output will be
jdbc:mysql://hive-metastore.cr.eu-west-1.rds.amazonaws.com:3306/hive?createDatabaseIfNotExist\u003dtrue
For more detail check Python Documentation - Fancier Output Formatting
The str() function is meant to return representations of values which
are fairly human-readable, while repr() is meant to generate
representations which can be read by the interpreter (or will force a
SyntaxError if there is no equivalent syntax).
If you want to represent that string by using a single backslash instead of a double backslash, then you need the str() representation, not the repr(). When you print a dictionary, you always get the repr() of the included strings.
You can print the str() by formatting the dictionary yourself, like so:
print ( "{" +
', '.join("'{key}': '{value}'".format(key=key, value=value)
for key, value in Configurations.items()) +
"}")
Depending on how you print your string, Python will print two backslashes where the string actually only has one in it. This is Python's way of indicating that the backslash is an actual backslash, and not part of an escaped character; because print will actually show you '\n' for a carriage return, for example.
Try writing the string to a file and then opening the file in an editor.
(Linux..)
> f = open('/tmp/somefile.txt', 'w')
> f.write(sometextwithbackslashes)
> \d
$ vi /tmp/somefile.txt

Replace a character with backslash bug - Python

This feels like a bug to me. I am unable to replace a character in a string with a single backslash:
>>>st = "a&b"
>>>st.replace('&','\\')
'a\\b'
I know that '\' isn't a legitimate string because the \ escapes the last '.
However, I don't want the result to be 'a\\b'; I want it to be 'a\b'. How is this possible?
You are looking at the string representation, which is itself a valid Python string literal.
The \\ is itself just one slash, but displayed as an escaped character to make the value a valid Python literal string. You can copy and paste that string back into Python and it'll produce the same value.
Use print st.replace('&','\\') to see the actual value being displayed, or test for the length of the resulting value:
>>> st = "a&b"
>>> print st.replace('&','\\')
a\b
>>> len(st.replace('&','\\'))
3

Categories