Python - How can I convert a special character to the unicode representation?

Python - How can I convert a special character to the unicode representation? - python

In a dictionary, I have the following value with equals signal:
{"appVersion":"o0u5jeWA6TwlJacNFnjiTA=="}
To be explicit, I need to replace the = for the unicode representation '\u003d' (basically the reverse process of [json.loads()][1]). How can I set the unicode value to a variable without store the value with two scapes (\\u003d)?.
I've tryed of different ways, including the enconde/decode, repr(), unichr(61), etc, and even searching a lot, cound't find anything that does this, all the ways give me the following final result (or the original result):
'o0u5jeWA6TwlJacNFnjiTA\\u003d\\u003d'
Since now, thanks for your attention.
EDIT
When I debug the code, it gives me the value of the variable with 2 escapes. The program will get this value and use it to do the following actions, including the extra escape. I'm using this code to construct a json by the json.dumps() and the result returned is a unicode with 2 escapes.
Follow a print of the final result after the JSON construction. I need to find a way to store the value in the var with just one escape.
I don't know if make difference, but I'm doing this to a custom BURP Plugin, manipulating some selected requests.
Here is an image of my POC, getting the value of the var.

The extra backslash is not actually added, The Python interpreter uses the repr() to indicate that it's a backslash not something like \t or \n when the string containing \ gets printed:
I hope this helps:
>>> t['appVersion'] = t["appVersion"].replace('=', '\u003d')
>>> t['appVersion']
'o0u5jeWA6TwlJacNFnjiTA\\u003d\\u003d'
>>> print(t['appVersion'])
o0u5jeWA6TwlJacNFnjiTA\u003d\u003d
>>> t['appVersion'] == 'o0u5jeWA6TwlJacNFnjiTA\u003d\u003d'
True

Related

Python - Issues with Unicode String from API Call

I'm using Python to call an API that returns the last name of some soccer players. One of the players has a "ć" in his name.
When I call the endpoint, the name prints out with the unicode attached to it:
>>> last_name = (json.dumps(response["response"][2]["player"]["lastname"]))
>>> print(last_name)
"Mitrovi\u0107"
>>> print(type(last_name))
<class 'str'>
If I were to take copy and paste that output and put it in a variable on its own like so:
>>> print("Mitrovi\u0107")
Mitrović
>>> print(type("Mitrovi\u0107"))
<class 'str'>
Then it prints just fine?
What is wrong with the API endpoint call and the string that comes from it?

Well, you serialise the string with json.dumps() before printing it, that's why you get a different output.
Compare the following:
>>> print("Mitrović")
Mitrović
and
>>> print(json.dumps("Mitrović"))
"Mitrovi\u0107"
The second command adds double quotes to the output and escapes non-ASCII chars, because that's how strings are encoded in JSON. So it's possible that response["response"][2]["player"]["lastname"] contains exactly what you want, but maybe you fooled yourself by wrapping it in json.dumps() before printing.
Note: don't confuse Python string literals and JSON serialisation of strings. They share some common features, but they aren't the same (eg. JSON strings can't be single-quoted), and they serve a different purpose (the first are for writing strings in source code, the second are for encoding data for sending it accross the network).
Another note: You can avoid most of the escaping with ensure_ascii=False in the json.dumps() call:
>>> print(json.dumps("Mitrović", ensure_ascii=False))
"Mitrović"

Count the number of characters in your string & I'll bet you'll notice that the result of json is 13 characters:
"M-i-t-r-o-v-i-\-u-0-1-0-7", or "Mitrovi\\u0107"
When you copy "Mitrovi\u0107" you're coping 8 characters and the '\u0107' is a single unicode character.
That would suggest the endpoint is not sending properly json-escaped unicode, or somewhere in your doc you're reading it as ascii first. Carefully look at exactly what you're receiving.

Is there a way to append r' to a sting in Python dynamically?

I am new to Python and I am working to calculate checksum of a string that has backslashes (). I was able to come up with a logic to calculate the checksum but that function needs the string to be in raw format (r') to calculate the correct checksum. However this function will be invoked by another function and I will not be able to convert the string to raw string manually. Can someone please help me on how to achieve this dynamically.
Here is the string:
raw_message1 = '=K+8\\\08|'
This string has 3 backslashes however it shows only 2 after saving and this string may vary again so replacing after processing will not help.
And when I print the result:
print(message1)
=K+8\ 8|
What I need is to have something that retains the backslashes as is. I cannot go for any other character as every character has its own ASCII value and checksum would differ. I tried all the other options mentioned before and see different result for each case.

You can define the string as so:
message1 = r'=K+8\\08|'
This should make the string message1 be in raw form.
Let me know if this helps. I don't really understand what you mean by converting to raw string manually and converting it dynamically. This is the most I can help with for now.

All the r prefix does is manually add backslashes so you don't have to, it doesn't "magically" retain the backslashes, instead it replaces them with double backslashes so they are interpreted as single backslashes.
Look at the following example:
>>> s = r"\rando\\\mst\ri\ngo\\flet\ters"
>>> s
'\\rando\\\\\\mst\\ri\\ngo\\\\flet\\ters'
>>> print(s)
\rando\\\mst\ri\ngo\\flet\ters
>>>
The string actually contains double backslashes and when printed they are interpreted as single backslashes.

Behavior of python's repr method

I understand that the goal of repr is to be unambiguous, but the behavior of repr really confused me.
repr('"1"')
Out[84]: '\'"1"\''
repr("'1'")
Out[85]: '"\'1\'"'
Based on the above code, I think repr just put '' around the string.
But when i try this:
repr('1')
Out[82]: "'1'"
repr("1")
Out[83]: "'1'"
repr put "" around strings and repr("1") and repr('1') is the same.
Why?

There are three levels of quotes going on here!
The quotes inside the string you're passing (only present in your first example).
The quotes in the string produced by repr. Keep in mind that repr tries to return a string representation that would work as Python code, so if you pass it a string, it will add quotes around the string.
The quotes added by your Python interpreter upon printing the output. These are probably what confuses you. Probably your interpreter is calling repr again, in order to give you an idea of the type of object being returned. Otherwise, the string 1 and the number 1 would look identical.
To get rid of this extra level of quoting, so you can see the exact string produced by repr, use print(repr(...)) instead.

The python REPL (and Ipython in your case) print out the repr() of the output value, so your input is getting repred twice.
To avoid this, print it out instead.
>>> repr('1') # what you're doing
"'1'"
>>> print(repr('1')) # if you print it out
'1'
>>> print(repr(repr('1'))) # what really happens in the first line
"'1'"
The original (outer) quotes may not be preserved since the object being repred has no idea what they originally were.

From documentation:
repr(object): Return a string containing a printable representation of
an object.
So it returns a string that given to Python can be used to recreate that object.
Your first example:
repr('"1"') # string <"1"> passed as an argument
Out[84]: '\'"1"\'' # to create your string you need to type like '"1"'.
# Outer quotes are just interpretator formatting
Your second example:
repr("'1'") # you pass a string <'1'>
Out[85]: '"\'1\'"' # to recreate it you have to type "'1'" or '\'1\'',
# depending on types of quotes you use (<'> and <"> are the same in python
Last,
repr('1') # you pass <1> as a string
Out[82]: "'1'" # to make that string in python you type '1', right?
repr("1") # you pass the same <1> as string
Out[83]: "'1'" # to recreate it you can type either '1' or "1", does not matter. Hence the output.
I both interpreter and repr set surrounding quotes to ' or " depending on content to minimize escaping, so that's why output differs.

How to convert python byte string containing a mix of hex characters?

Specifically, I am receiving a stream of bytes from a TCP socket that looks something like this:
inc_tcp_data = b'\x02hello\x1cthisisthedata'
The stream using hex values to denote different parts of the incoming data. However I want to use the inc_data in the following format:
converted_data = '\x02hello\x1cthisisthedata'
essentially I want to get rid of the b and just literally spit out what came in.
I've tried various struct.unpack methods as well as .decode("encoding). I could not get the former to work at all, and the latter would strip out the hex values if there was no visual way to encode it or it would convert it to character if it could. Any ideas?
Update:
I was able to get my desired result with the following code:
inc_tcp_data = b'\x02hello\x3Fthisisthedata'.decode("ascii")
d = repr(inc_tcp_data)
print(d)
print(len(d))
print(len(inc_tcp_data))
the output is:
'\x02hello?thisisthedata'
25
20
however, this still doesn't help me because I do actually need the regular expression that follows to see \x02 as a hex value and not as a 4 byte string.
what am I doing wrong?
UPDATE
I've solved this issue by not solving it. The reason I wanted the hex characters to remain unchanged was so that a regular expression would be able to detect it further down the road. However what I should have done (and did) was simply change the regular expression to analyze the bytes without decoding it. Once I had separated out all the parts via regular expression, I decoded the parts with .decode("ascii") and everything worked out great.
I'm just updating this if it happens to help someone else.

Assuming you are using python 3
>>> inc_tcp_data.decode('ascii')
'\x02hello\x1cthisisthedata'

Bytes string in Python

Would you know by any chance how to get rid on the bytes identifier in front of a string in a Python's list, perhaps there is some global setting that can be amended?
I retrieve a query from the Postgres 9.3, and create a list form that query. It looks like Python 3.3 interprets records in columns that are of type char(4) as if the they are bytes strings, for example:
Funds[1][1]
b'FND3'
Funds[1][1].__class__
<class 'bytes'>
So the implication is:
Funds[1][1]=='FND3'
False
I have some control over that database so I could change the column type to varchar(4), and it works well:
Funds[1][1]=='FND3'
True
But this is only a temporary solution.
The little b makes my life a nightmare for the last two days ;), and I would appreciate your help with that problem.
Thanks and Regards
Peter

You have to either manually implement __str__/__repr__ or, if you're willing to take the risk, do some sort of Regex-replace over the string.
Example __repr__:
def stringify(lst):
return "[{}]".format(", ".join(repr(x)[1:] if isinstance(x, bytes) else repr(x) for x in lst))

The b isn't part of the string, any more than the quotes around it are; they're just part of the representation when you print the string out. So, you're chasing the wrong problem, one that doesn't exist.
The problem is that the byte string b'FND3' is not the same thing as the string 'FND3'. In this particular example, that may seem silly, but if you might ever have any non-ASCII characters anywhere, it stops being silly.
For example, the string 'é' is the same as the byte string b'\xe9' in Latin-1, and it's also the same as the byte string b'\xce\xa9' in UTF-8. And of course b'\xce\a9' is the same as the string 'Ã©' in Latin-1.
So, you have to be explicit about what encoding you're using:
Funds[1][1].decode('utf-8')=='FND3'
But why is PostgreSQL returning you byte strings? Well, that's what a char column is. It's up to the Python bindings to decide what to do with them. And without knowing which of the multiple PostgreSQL bindings you're using, and which version, it's impossible to tell you what to do. But, for example, in recent-ish psycopg, you just have to set an encoding in the connection (e.g., conn.set_client_encoding('UTF-8'); in older versions you had to register a standard typecaster and do some more stuff; etc.; in py-postgresql you have to register lambda s: s.decode('utf-8'); etc.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.