Specifically, I am receiving a stream of bytes from a TCP socket that looks something like this:
inc_tcp_data = b'\x02hello\x1cthisisthedata'
The stream using hex values to denote different parts of the incoming data. However I want to use the inc_data in the following format:
converted_data = '\x02hello\x1cthisisthedata'
essentially I want to get rid of the b and just literally spit out what came in.
I've tried various struct.unpack methods as well as .decode("encoding). I could not get the former to work at all, and the latter would strip out the hex values if there was no visual way to encode it or it would convert it to character if it could. Any ideas?
Update:
I was able to get my desired result with the following code:
inc_tcp_data = b'\x02hello\x3Fthisisthedata'.decode("ascii")
d = repr(inc_tcp_data)
print(d)
print(len(d))
print(len(inc_tcp_data))
the output is:
'\x02hello?thisisthedata'
25
20
however, this still doesn't help me because I do actually need the regular expression that follows to see \x02 as a hex value and not as a 4 byte string.
what am I doing wrong?
UPDATE
I've solved this issue by not solving it. The reason I wanted the hex characters to remain unchanged was so that a regular expression would be able to detect it further down the road. However what I should have done (and did) was simply change the regular expression to analyze the bytes without decoding it. Once I had separated out all the parts via regular expression, I decoded the parts with .decode("ascii") and everything worked out great.
I'm just updating this if it happens to help someone else.
Assuming you are using python 3
>>> inc_tcp_data.decode('ascii')
'\x02hello\x1cthisisthedata'
Related
I have string data in various languages where parts of the strings have seen some wrong encoding/decoding while others are correct, I need to fix the wrong ones:
Here's an example for the german word "Zubehör":
correct = "ZUBEHÖR"
incorrect = "ZUBEHÃ\x96R"
I already found out that I can correct the errors like this:
incorrect.encode("raw_unicode_escape").decode("utf8")
However using this on the correct strings yields an error. I could iterate over all strings and use a try-statement, but I don't know if this will work reliably and I'd like to know a more elegant way.
Also while the \x96 is written out when printing it's actually only one character:
incorrect[-3]
Out[34]: 'Ã'
incorrect[-2]
Out[33]: '\x96'
How can I reliably only find those strings that have these odd unicode characters in them like ZUBEHÃ\x96R?
EDIT:
Here's something else I stumbled upon while experimenting:
When I do incorrect.encode("raw_unicode_escape") then the result is b'ZUBEH\xc3\x96R'.
But when I do this with e.g. a cyrillic word like this:
"Персонализированные".encode("raw_unicode_escape")
Then the result is b'\\u041f\\u0435\\u0440\\u0441\\u043e\\u043d\\u0430\\u043b\\u0438\\u0437\\u0438\\u0440\\u043e\\u0432\\u0430\\u043d\\u043d\\u044b\\u0435'
Why am I getting \x-escapes in the first case and \u-escapes in the second case while doing the exact same thing?
And why can I .decode("utf8") back the \x-escapes into a readable format but not the \u-escapes?
You should try the fixes-text-for-you library (ftfy):
>>> import ftfy
>>> ftfy.fix_text("ZUBEHÃ\x96R")
'ZUBEHÖR'
It operates line by line, so if you have a string with clean and corrupt strings, but on separate lines, ftfy can probably handle it.
Note: This is not an exact science.
The way ftfy works involves a lot of educated guesses.
The tool is very well made, but it may not guess correctly in all cases you have.
If you can, it is always better to fix the errors at the source (ie. make sure all text is correctly decoded in the first place).
Why does a regex fail of a string cast from an object when line breaks are present?
That is why does this fail to find a match (ie print 'Green') in a string created from str(obj):
import re
s = str(b'Package Name: Green\\r\\n Release version: 8.1\\r\\n')
match = re.search(r'Package Name: (.*)\r\n', s)
print(match.group(1))
When this succeeds using a string created from obj.decode()?
import re
s = (b'Package Name: Green\\r\\n Release version: 8.1\\r\\n').decode()
match = re.search(r'Package Name: (.*)\r\n', s)
print(match.group(1))
No matter what search pattern was tried, searching the string created by str(obj) failed to find a match...
The reason you get different results is that you’re doing different things. Calling str on a bytes with newline characters returns a string with a literal backslash and n; calling decode returns a string with a newline character in it. So, if you’re searching the results for newline characters, the second one will succeed, and the first will fail. And it’s the second one that you wanted.
In other words, using decode here is right, and str is wrong; that’s why you get different results. If you can’t think through the difference, try just printing them out: print(b.decode()); print(str(b)) and you’ll see the difference immediately.
In fact, you should usually be decoding the strings as soon as you receive them, and never looking at the bytes again. Then you never have to worry about the str representation of bytes objects (except maybe in some code that logs errors caused by invalid strings that you couldn’t decode). The only exception is when you know the bytes are some kind of encoded text, but can’t be sure what the encoding is. For example, if you’re parsing HTTP headers or email messages or Python source code, you don’t know the character set until you read part of the file and search it for special ASCII-encoded strings. Or, if you’re converting a bunch of old text files from Windows to Unix line endings and some are cp1252 while others are cp1250, you don’t care which is which because they both encode line endings the same way. For those cases, just stick with bytes, and search for b'\n' instead of '\n'.
If you want to know why Python makes this so complicated:
bytes objects are used to store strings encoded in your default encoding—but they’re also used to store strings encoded in different encodings, and binary data that isn’t a string at all. And a bytes object has no idea which of those it’s storing; they’re all just sequences of numbers.
Python 2 effectively assumed that a bytes was being used to store a string in your default encoding, so it let you convert back and forth to Unicode by calling functions like str, or even concatenating a bytes and Unicode string. That turned out to be one of the biggest sources of errors in the language. You still see Python 2 users posting questions here every few days asking why they got a UnicodeEncodeError when they weren’t calling encode anywhere (or, worse, when they were calling decode), and fixing that was one of the main reasons for Python 3’s existence.
The human-readable representation of a bytes object has to be something that can be produced without error, and read unambiguously, whether it’s a string in the default encoding, a string in a completely different encoding, or a sequence of pixel brightness values ranging from 0 to 255. The compromise solution (for things like that HTTP headers case above) is the backslash-escaped quoted string.
By the way, during the Python 2 to 3 transition, the core devs assumed multiple people would come up with clever EncodedBytes types that carried around their encoding, and could therefore act more like Python 2 byte strings but without all the associated errors, and after a couple years one of them would be the clear winner on PyPI and maybe they could add it to Python 3.3 or so. That’s what you’re probably instinctively reaching for here. But, as it turned out, nobody used any such libraries, because it’s almost always easier to just decode and encode at the edges of your program and use Unicode everywhere, and the exceptions are almost always cases where you don’t know the encoding so EncodedBytes wouldn’t help.
One last thing: thinking of functions like str or float as “casts” is misleading. While it looks superficially similar to the way you do explicit casts in C or Java or Go or whatever language you’re used to, it has a very different meaning
In a dictionary, I have the following value with equals signal:
{"appVersion":"o0u5jeWA6TwlJacNFnjiTA=="}
To be explicit, I need to replace the = for the unicode representation '\u003d' (basically the reverse process of [json.loads()][1]). How can I set the unicode value to a variable without store the value with two scapes (\\u003d)?.
I've tryed of different ways, including the enconde/decode, repr(), unichr(61), etc, and even searching a lot, cound't find anything that does this, all the ways give me the following final result (or the original result):
'o0u5jeWA6TwlJacNFnjiTA\\u003d\\u003d'
Since now, thanks for your attention.
EDIT
When I debug the code, it gives me the value of the variable with 2 escapes. The program will get this value and use it to do the following actions, including the extra escape. I'm using this code to construct a json by the json.dumps() and the result returned is a unicode with 2 escapes.
Follow a print of the final result after the JSON construction. I need to find a way to store the value in the var with just one escape.
I don't know if make difference, but I'm doing this to a custom BURP Plugin, manipulating some selected requests.
Here is an image of my POC, getting the value of the var.
The extra backslash is not actually added, The Python interpreter uses the repr() to indicate that it's a backslash not something like \t or \n when the string containing \ gets printed:
I hope this helps:
>>> t['appVersion'] = t["appVersion"].replace('=', '\u003d')
>>> t['appVersion']
'o0u5jeWA6TwlJacNFnjiTA\\u003d\\u003d'
>>> print(t['appVersion'])
o0u5jeWA6TwlJacNFnjiTA\u003d\u003d
>>> t['appVersion'] == 'o0u5jeWA6TwlJacNFnjiTA\u003d\u003d'
True
I want to encode a string in UTF-8 and view the corresponding UTF-8 bytes individually. In the Python REPL the following seems to work fine:
>>> unicode('©', 'utf-8').encode('utf-8')
'\xc2\xa9'
Note that I’m using U+00A9 COPYRIGHT SIGN as an example here. The '\xC2\xA9' looks close to what I want — a string consisting of two separate code points: U+00C2 and U+00A9. (When UTF-8-decoded, it gives back the original string, '\xA9'.)
Then, I want the UTF-8-encoded string to be converted to a JSON-compatible string. However, the following doesn’t seem to do what I want:
>>> import json; json.dumps('\xc2\xa9')
'"\\u00a9"'
Note that it generates a string containing U+00A9 (the original symbol). Instead, I need the UTF-8-encoded string, which would look like "\u00C2\u00A9" in valid JSON.
TL;DR How can I turn '©' into "\u00C2\u00A9" in Python? I feel like I’m missing something obvious — is there no built-in way to do this?
If you really want "\u00c2\u00a9" as the output, give json a Unicode string as input.
>>> print json.dumps(u'\xc2\xa9')
"\u00c2\u00a9"
You can generate this Unicode string from the raw bytes:
s = unicode('©', 'utf-8').encode('utf-8')
s2 = u''.join(unichr(ord(c)) for c in s)
I think what you really want is "\xc2\xa9" as the output, but I'm not sure how to generate that yet.
I have a hex-string made from a unicode string with that function:
def toHex(s):
res = ""
for c in s:
res += "%02X" % ord(c) #at least 2 hex digits, can be more
return res
hex_str = toHex(u"...")
This returns a string like this one:
"80547CFB4EBA5DF15B585728"
That's a sequence of 6 chinese symbols.
But
u"Knödel"
converts to
"4B6EF664656C"
What I need now is a function to convert this back to the original unicode. The chinese symbols seem to have a 2-byte representation while the second example has 1-byte representations for all characters. So I can't just use unichr() for each 1- or 2-byte block.
I've already tried
binascii.unhexlify(hex_str)
but this seems to convert byte-by-byte and returns a string, not unicode. I've also tried
binascii.unhexlify(hex_str).decode(...)
with different formats. Never got the original unicode string.
Thank you a lot in advance!
This seems to work just fine:
binascii.unhexlify(binascii.hexlify(u"Knödel".encode('utf-8'))).decode('utf-8')
Comes back to the original object. You can do the same for the chinese text if it's encoded properly, however ord(x) already destroys the text you started from. You'll need to encode it first and only then treat like a string of bytes.
Can't be done. Using %02X loses too much information. You should be using something like UTF-8 first and converting that, instead of inventing a broken encoding.
>>> u"Knödel".encode('utf-8').encode('hex')
'4b6ec3b664656c'
When I was working with Unicode in a VB app a while ago the first 1 or 2 digits would be removed if they were a "0". Meaning "&H00A2" would automatically be converted to "&HA2", I just created a small function to check the length of the string and if it was less than 4 chars add the missing 0's. I'm not sure if this is what's happening to you, but I thought I would give bit of information as something to be aware of.