python encode/decode hex string to utf-8 string - python

I trying to decode a hex string in python.
value = ""
for i in "54 C3 BC 72 20 6F 66 66 65 6E 20 4B 6C 69 6D 61".split(" "):
value += chr(int(i, 16))
print(value)
Result:
Tür offen Klima
Expected result should be "Tür offen Klima"
How can i make this work properly ?

Your data is encoded as UTF-8, which means that you sometimes have to look at more than one byte to get one character. The easiest way to do this is probably to decode your string into a sequence of bytes, and then decode those bytes into a string. Python has built-in features for both:
value = bytes.fromhex("54 C3 BC").decode("utf-8")

The problem is the result of the string
"54 C3 BC 72 20 6F 66 66 65 6E 20 4B 6C 69 6D 61"
is indeed
Tür offen Klima
The proper hex string that result in "Tür offen Klima" is actually:
"54 FC 72 20 6F 66 66 65 6E 20 4B 6C 69 6D 61"
Therefore, the code below would generate the result you expected:
value = ""
for i in "54 FC 72 20 6F 66 66 65 6E 20 4B 6C 69 6D 61".split(" "):
value += chr(int(i, 16))
print(value)

Related

Python email module behaves unexpected when trying to parse "raw" subject lines

I have trouble parsing an email which is encoded in win-1252 and contains the following header (literally like that in the file):
Subject: Счета на оплату по заказу . .
Here is a hexdump of that area:
000008a0 56 4e 4f 53 41 52 45 56 40 41 43 43 45 4e 54 2e |VNOSAREV#ACCENT.|
000008b0 52 55 3e 0d 0a 53 75 62 6a 65 63 74 3a 20 d1 f7 |RU>..Subject: ..|
000008c0 e5 f2 e0 20 ed e0 20 ee ef eb e0 f2 f3 20 ef ee |... .. ...... ..|
000008d0 20 e7 e0 ea e0 e7 f3 20 20 20 2e 20 20 2e 20 20 | ...... . . |
000008e0 20 20 0d 0a 58 2d 4d 61 69 6c 65 72 3a 20 4f 73 | ..X-Mailer: Os|
000008f0 74 72 6f 53 6f 66 74 20 53 4d 54 50 20 43 6f 6e |troSoft SMTP Con|
I realize that this encoding doesn't adhere to the usual RFC 1342 style encoding of =?charset?encoding?encoded-text?= but I assume that many email clients will still correctly display the subject and hence I would like to extract it correctly as well. For context: I am not making these emails up or creating them, they are given and I need to deal with them as is.
My approach so far was to use the email module that comes with Python:
import email
with open('data.eml', 'rb') as fp:
content = fp.read()
mail = email.message_from_bytes(content)
print(mail.get('subject'))
# ����� �� ������ �� ������ . .
print(mail.get('subject').encode())
# '=?unknown-8bit?b?0ffl8uAg7eAg7u/r4PLzIO/uIOfg6uDn8yAgIC4gIC4gICAg?='
My questions are:
can I somehow convince the email module to parse mails with subjects like this correctly?
if not, can I somehow access the "raw" data of this header? i.e. the entries of mail._headers without accessing private properties?
if not, can someone recommend a more versatile Python module for email parsing?
Some random observations:
a) Poking around in the internal data structure of mail, I arrived at [hd[1] for hd in mail._headers if hd[0] == 'Subject'] which is:
['\udcd1\udcf7\udce5\udcf2\udce0 \udced\udce0 \udcee\udcef\udceb\udce0\udcf2\udcf3 \udcef\udcee \udce7\udce0\udcea\udce0\udce7\udcf3 . . ']
b) According to the docs, mail.get_charsets() returns a list of character sets in case of multipart message, and it returns [None, 'windows-1251', None] here. So at least theoretically, the modules does have a chance to guessing the correct charset.
For completeness, the SHA256 has of the email file is 1aee4d068c2ae4996a47a3ae9c8c3fa6295a14b00d9719fb5ac0291a229b4038 (and I uploaded it to MalShare and VirusTotal).
The string you are seeing is just a normal unicode string which contains a lot of characters from the low surrogate range. I am quite sure that in this case, the string came about by using the .decode method with a surrogateescape error handler. Indeed:
In [1]: a = "Счета на оплату по заказу"
In [2]: a.encode("windows-1251").decode("utf8", "surrogateescape")
Out[2]: '\udcd1\udcf7\udce5\udcf2\udce0 \udced\udce0 \udcee\udcef\udceb\udce0\udcf2\udcf3 \udcef\udcee \udce7\udce0\udcea\udce0\udce7\udcf3'
To undo the damage, you should be able to use .encode("utf8", "surrogateescape").decode("windows-1251").
It is unclear to me whether they actually used utf8 with the surrogateescape handler, and you would have to match the charset that they (incorrectly) decode with. However, since the string matches yours perfectly, I think utf8 is what is being used.
mail.get_charsets() returns probably right values (with hard-coded the hexdump provided):
x = '56 4e 4f 53 41 52 45 56 40 41 43 43 45 4e 54 2e' + \
'52 55 3e 0d 0a 53 75 62 6a 65 63 74 3a 20 d1 f7' + \
'e5 f2 e0 20 ed e0 20 ee ef eb e0 f2 f3 20 ef ee' + \
'20 e7 e0 ea e0 e7 f3 20 20 20 2e 20 20 2e 20 20' + \
'20 20 0d 0a 58 2d 4d 61 69 6c 65 72 3a 20 4f 73' + \
'74 72 6f 53 6f 66 74 20 53 4d 54 50 20 43 6f 6e'
print(bytes.fromhex(x).decode('windows-1251'))
VNOSAREV#ACCENT.RU>
Subject: Счета на оплату по заказу . .
X-Mailer: OstroSoft SMTP Con
Your mail.get('subject').encode() does return exactly the bytes you put in. There is no "correctly" beyond this point; you have to know, or guess, the correct encoding.
mail.raw_items() returns what purports to be the "raw" headers from the message, but they are actually encoded. #Jesko's answer shows how to take the encoded value and transform it back to the original bytes, provided you know which encoding to use.
(The surrogate encoding is apparently a hack to allow Python to keep the raw bytes in a form which cannot accidentally leak back into a proper decoded string. You have to know how it was assembled and explicitly request it to be undone.)
Going out on a limb, you can try all the encodings of the body of the message, and check if any of them return a useful decoding.
The following uses the modern EmailMessage API where mail.get('subject').encode() no longer behaves like in your example (I think perhaps this is a bug?)
import email
from email.policy import default
content = b'''\
From: <VNOSAREV#ACCENT.example.RU>
Subject: \xd1\xf7\xe5\xf2\xe0 \xed\xe0 \xee\xef\xeb\xe0\xf2\xf3 \xef\xee \xe7\xe0\xea\xe0\xe7\xf3 . .
Content-type: text/plain; charset="windows-1251"
\xef\xf0\xe8\xe2\xe5\xf2
'''
# notice use of modern EmailMessage API, by specifying a policy
mail = email.message_from_bytes(content, policy=default)
# print(mail.get("subject"))
charsets = set(mail.get_charsets()) - {None}
for header, value in mail.raw_items():
if header == "Subject":
value = value.encode("utf-8", "surrogateescape")
for enc in charsets:
try:
print(value.decode(enc))
break
except (UnicodeEncodeError, UnicodeDecodeError):
pass
This crude heuristic could still misfire in a number of situations. If you know the encoding, probably hardcode it.
To the extent that mail clients are able to display the raw header correctly, I'm guessing it's mainly pure luck. If your system is set up to use code page 1251 by default, that probably helps some clients. Some mail clients also let you manually select an encoding for each message, so you can play around until you get the right one (and perhaps leave it at that setting if you receive many messages with this problem).

How to find out address in binary file with python code only?

I have binary for example https://github.com/andrew-d/static-binaries/blob/master/binaries/linux/x86_64/nmap
1) How to find what is the address of this series of bytes :48 8B 45 A8 48 8D 1C 02 48 8B 45 C8 ? , the result need to be 0x6B0C67
2)How to find out the 12 bytes that in address 0x6B0C67 ? the result need to be 48 8B 45 A8 48 8D 1C 02 48 8B 45 C8 .
3) How to find which address call to specific string? for example i + 1 == features[i].index that locate in 0x6FC272 ? the result need to be 0x4022F6
How can I find all of this without open Ida? only with python/c code?
thanks
For 1) Is your file small enough to be loaded into memory? Then it's as simple as
offset = open(file, 'rb').read().find(
bytes.fromhex("48 8B 45 A8 48 8D 1C 02 48 8B 45 C8")
)
# offset will be -1 if not found
If not, you will need to read it in chunks.
For 2), do
with open(file, 'rb') as stream:
stream.seek(0x6b0c67)
data = stream.read(12)
I'm afraid I don't understand the question in 3)...

How can I read 32-bit binary numbers from a binary file in python?

I have data files which contain series of 32-bit binary "numbers."
I say "numbers" because the 32 1/0's define what type of data sensors were picking up, when they were, which sensor,etc; so the decimal value of the numbers is of no concern to me. In particular some (most) of the data will begin with possibly up to 5 zeros.
I simply need a way in python to read these files, get a list containing each 32-bit number, and then I'll need to mess around with it a little (delete some events) and rewrite it to a new file.
Can anyone help me with this? I've tried the following so far but the numbers which should be corresponding to the time data we encode seem to be impossible.
with open(lm_file, mode='rb') as file:
bytes_read = file.read(struct.calcsize("I"))
while bytes_read:
idList = struct.unpack("I", bytes_read)[0]
idList=bin(idList)
print(idList)
bytes_read = file.read(struct.calcsize("=l"))
Output of hexdump:
00000000 80 0a 83 4d ba a5 80 0c c0 00 7b 42 cb 90 0f 41 |...M......{B...A|
00000010 98 c9 9c 53 4c 15 35 52 d8 54 f7 0a 5d 87 16 4d |...SL.5R.T..]..M|
00000020 89 6a 3f 04 f2 eb c4 4a e2 37 e6 08 23 5e ca 06 |.j?....J.7..#^..|

Redirect stdout to a file with unicode encoding while keeping windows eol in python 2

I hit a wall here. I need to redirect all output to a file but I need this file to be encoded in utf-8. Problem is that when using codecs.open:
# errLog = io.open(os.path.join(os.getcwdu(),u'BashBugDump.log'), 'w',
# encoding='utf-8')
errLog = codecs.open(os.path.join(os.getcwdu(), u'BashBugDump.log'),
'w', encoding='utf-8')
sys.stdout = errLog
sys.stderr = errLog
codecs opens the file in binary mode resulting in \n line terminators. I tried using io.open but this does not play with the print statement used all over the codebase (see Python 2.7: print doesn't speak unicode to the io module? or python: TypeError: can't write str to text stream)
I am not the only one having this issue for instance see here but the solution they adopted is specific to the logging module we do not use.
See also this won't fix bug in python: https://bugs.python.org/issue2131
So what's the one right way for doing this in python2 ?
Option 1
Redirection is a shell operation. You don't have to change the Python code at all, but you do have to tell Python what encoding to use if redirected. That is done with an environment variable. The following code redirects both stdout and stderr to a UTF-8-encoded file:
test.bat
set PYTHONIOENCODING=utf8
python test.py >out.txt 2>&1
test.py
#coding:utf8
import sys
print u"我不喜欢你女朋友!"
print >>sys.stderr, u"你需要一个新的。"
out.txt (encoded in UTF-8)
我不喜欢你女朋友!
你需要一个新的。
Hex dump of out.txt
0000: E6 88 91 E4 B8 8D E5 96 9C E6 AC A2 E4 BD A0 E5
0010: A5 B3 E6 9C 8B E5 8F 8B EF BC 81 0D 0A E4 BD A0
0020: E9 9C 80 E8 A6 81 E4 B8 80 E4 B8 AA E6 96 B0 E7
0030: 9A 84 E3 80 82 0D 0A
Note: You do need to print Unicode strings for this to work. Print byte strings and you get the bytes you print.
Option 2
codecs.open may force binary mode, but codecs.getwriter doesn't. Give it a file opened in text mode:
#coding:utf8
import sys
import codecs
sys.stdout = sys.stderr = codecs.getwriter('utf8')(open('out.txt','w'))
print u"我不喜欢你女朋友!"
print >>sys.stderr, u"你需要一个新的。"
(same output and hexdump as above)
It appears that the Python 2 version of io doesn't play well with the print statement, but it will work if you use the print function.
Demo:
from __future__ import print_function
import sys
import io
errLog = io.open('test.log', mode='wt', buffering=1, encoding='utf-8', newline='\r\n')
sys.stdout = errLog
print(u'This is a ™ test')
print(u'Another © line')
contents of 'test.log'
This is a ™ test
Another © line
hexdump of 'test.log'
00000000 54 68 69 73 20 69 73 20 61 20 e2 84 a2 20 74 65 |This is a ... te|
00000010 73 74 0d 0a 41 6e 6f 74 68 65 72 20 c2 a9 20 6c |st..Another .. l|
00000020 69 6e 65 0d 0a |ine..|
00000025
I ran this code on Python 2.6 on Linux, YMMV.
If you don't want to use the print function, you can implement your own file-like encoding class.
import sys
class Encoder(object):
def __init__(self, fname):
self.file = open(fname, 'wb')
def write(self, s):
self.file.write(s.replace('\n', '\r\n').encode('utf-8'))
errlog = Encoder('test.log')
sys.stdout = errlog
sys.stderr = errlog
print 'hello\nthere'
print >>sys.stderr, u'This is a ™ test'
print u'Another © line'
print >>sys.stderr, 1, 2, 3, 4
print 5, 6, 7, 8
contents of 'test.log'
hello
there
This is a ™ test
Another © line
1 2 3 4
5 6 7 8
hexdump of 'test.log'
00000000 68 65 6c 6c 6f 0d 0a 74 68 65 72 65 0d 0a 54 68 |hello..there..Th|
00000010 69 73 20 69 73 20 61 20 e2 84 a2 20 74 65 73 74 |is is a ... test|
00000020 0d 0a 41 6e 6f 74 68 65 72 20 c2 a9 20 6c 69 6e |..Another .. lin|
00000030 65 0d 0a 31 20 32 20 33 20 34 0d 0a 35 20 36 20 |e..1 2 3 4..5 6 |
00000040 37 20 38 0d 0a |7 8..|
00000045
Please bear in mind that this is just a quick demo. You may want a more sophisticated way to handle newlines, eg you probably don't want to replace \n if it's already preceded by \r. OTOH, with normal Python text handling that shouldn't be an issue...
Here's yet another version which combines the 2 previous strategies. I don't know if it's any faster than the second version.
import sys
import io
class Encoder(object):
def __init__(self, fname):
self.file = io.open(fname, mode='wt', encoding='utf-8', newline='\r\n')
def write(self, s):
self.file.write(unicode(s))
errlog = Encoder('test.log')
sys.stdout = errlog
sys.stderr = errlog
print 'hello\nthere'
print >>sys.stderr, u'This is a ™ test'
print u'Another © line'
print >>sys.stderr, 1, 2, 3, 4
print 5, 6, 7, 8
This produces the same output as the previous version.

Why do i get an added 'i' character at the end of string while using .writelines() in Python ?

I used that python code to get 3 inputs and write to a file but when i looked at the file named output.txt under /tmp, i see an added 'i' at the end of string.To illustrate, when i give as input a,b and c , respectively i get such an input : a b ci .What can i do to fix this problem ?
name = raw_input("Name:")
email = raw_input("Email:")
phone = raw_input("Telephone:")
a=open("/tmp/output.txt","w")
a.writelines( name+' '+email+' '+phone)
a.close()
I don't:
$ cat >/tmp/t <<_EOF_
> name = raw_input("Name:")
> email = raw_input("Email:")
> phone = raw_input("Telephone:")
>
> a=open("/tmp/output.txt","w")
> a.writelines( name+' '+email+' '+phone)
> a.close()
> _EOF_
$ (echo "John Smith"; echo "nobody#example.com"; echo "123-45-678") | python /tmp/t
$ hexdump -C /tmp/output.txt
00000000 4a 6f 68 6e 20 53 6d 69 74 68 20 6e 6f 62 6f 64 |John Smith nobod|
00000010 79 40 65 78 61 6d 70 6c 65 2e 63 6f 6d 20 31 32 |y#example.com 12|
00000020 33 2d 34 35 2d 36 37 38 |3-45-678|
00000028
Either phone has 'i' at the end or something later appends 'i' to the file.
.writelines() accepts a sequence of strings. It works here by accident (any string is also a sequence of one character strings). To write a string to file you could use .write() instead. Add '\n' at the end if you need it; neither .writelines() nor .write append a newline for you.
try print
print phone
to see if content of variable "phone" is your expected

Categories