DEFLATE discrepancy?

DEFLATE discrepancy? - python

So I'm trying to create a python script to generate a level for a game made in MMF2+Lua, and I've run into something I can't figure out how to fix.
Generating a 16x16 empty level with borders with the game gives this (deflated):
78 5E 63 20 0A FC 27 00 40 86 8C AA C1 1D 02 23 3D 7C 08 27 32 00 9F 62 FE 10
which should be a flattened 18x18 array with the edge having 0x00, and the rest having 0xFF.
My python script generates this with the exact same input to zlib.deflate:
78 9C 63 60 20 06 FC 27 00 46 D5 8C AA C1 A7 86 30 00 00 9F 62 FE 10
They're different, but inflating them gives the same exact data. However, when I put the data into the game, it crashes when trying to load the level.
What's really different between the two values, and am I able to fix it?

Those are two different encodings of the same data, both valid. They differ in the sequence of copies. Here are readable forms of both, first from the game:
! infgen 2.6 output
!
zlib
!
last
fixed
literal 0
match 37 1
literal 255
match 31 1
match 4 69
match 258 36
match 26 258
match 256 288
match 34 613
end
!
adler
then from zlib:
! infgen 2.6 output
!
zlib
!
last
fixed
literal 0 0
match 36 1
literal 255
match 31 1
match 258 36
match 258 36
match 28 36
match 34 1
end
!
adler
literal gives a byte or bytes inserted in the stream. match is a copy of previous bytes in the stream (possibly overlapped with bytes being copied), where the first parameter is the number of bytes to copy, and the second parameter is the distance back in bytes to copy from.

Related

Python email module behaves unexpected when trying to parse "raw" subject lines

I have trouble parsing an email which is encoded in win-1252 and contains the following header (literally like that in the file):
Subject: Счета на оплату по заказу . .
Here is a hexdump of that area:
000008a0 56 4e 4f 53 41 52 45 56 40 41 43 43 45 4e 54 2e |VNOSAREV#ACCENT.|
000008b0 52 55 3e 0d 0a 53 75 62 6a 65 63 74 3a 20 d1 f7 |RU>..Subject: ..|
000008c0 e5 f2 e0 20 ed e0 20 ee ef eb e0 f2 f3 20 ef ee |... .. ...... ..|
000008d0 20 e7 e0 ea e0 e7 f3 20 20 20 2e 20 20 2e 20 20 | ...... . . |
000008e0 20 20 0d 0a 58 2d 4d 61 69 6c 65 72 3a 20 4f 73 | ..X-Mailer: Os|
000008f0 74 72 6f 53 6f 66 74 20 53 4d 54 50 20 43 6f 6e |troSoft SMTP Con|
I realize that this encoding doesn't adhere to the usual RFC 1342 style encoding of =?charset?encoding?encoded-text?= but I assume that many email clients will still correctly display the subject and hence I would like to extract it correctly as well. For context: I am not making these emails up or creating them, they are given and I need to deal with them as is.
My approach so far was to use the email module that comes with Python:
import email
with open('data.eml', 'rb') as fp:
content = fp.read()
mail = email.message_from_bytes(content)
print(mail.get('subject'))
# ����� �� ������ �� ������ . .
print(mail.get('subject').encode())
# '=?unknown-8bit?b?0ffl8uAg7eAg7u/r4PLzIO/uIOfg6uDn8yAgIC4gIC4gICAg?='
My questions are:
can I somehow convince the email module to parse mails with subjects like this correctly?
if not, can I somehow access the "raw" data of this header? i.e. the entries of mail._headers without accessing private properties?
if not, can someone recommend a more versatile Python module for email parsing?
Some random observations:
a) Poking around in the internal data structure of mail, I arrived at [hd[1] for hd in mail._headers if hd[0] == 'Subject'] which is:
['\udcd1\udcf7\udce5\udcf2\udce0 \udced\udce0 \udcee\udcef\udceb\udce0\udcf2\udcf3 \udcef\udcee \udce7\udce0\udcea\udce0\udce7\udcf3 . . ']
b) According to the docs, mail.get_charsets() returns a list of character sets in case of multipart message, and it returns [None, 'windows-1251', None] here. So at least theoretically, the modules does have a chance to guessing the correct charset.
For completeness, the SHA256 has of the email file is 1aee4d068c2ae4996a47a3ae9c8c3fa6295a14b00d9719fb5ac0291a229b4038 (and I uploaded it to MalShare and VirusTotal).

The string you are seeing is just a normal unicode string which contains a lot of characters from the low surrogate range. I am quite sure that in this case, the string came about by using the .decode method with a surrogateescape error handler. Indeed:
In [1]: a = "Счета на оплату по заказу"
In [2]: a.encode("windows-1251").decode("utf8", "surrogateescape")
Out[2]: '\udcd1\udcf7\udce5\udcf2\udce0 \udced\udce0 \udcee\udcef\udceb\udce0\udcf2\udcf3 \udcef\udcee \udce7\udce0\udcea\udce0\udce7\udcf3'
To undo the damage, you should be able to use .encode("utf8", "surrogateescape").decode("windows-1251").
It is unclear to me whether they actually used utf8 with the surrogateescape handler, and you would have to match the charset that they (incorrectly) decode with. However, since the string matches yours perfectly, I think utf8 is what is being used.

mail.get_charsets() returns probably right values (with hard-coded the hexdump provided):
x = '56 4e 4f 53 41 52 45 56 40 41 43 43 45 4e 54 2e' + \
'52 55 3e 0d 0a 53 75 62 6a 65 63 74 3a 20 d1 f7' + \
'e5 f2 e0 20 ed e0 20 ee ef eb e0 f2 f3 20 ef ee' + \
'20 e7 e0 ea e0 e7 f3 20 20 20 2e 20 20 2e 20 20' + \
'20 20 0d 0a 58 2d 4d 61 69 6c 65 72 3a 20 4f 73' + \
'74 72 6f 53 6f 66 74 20 53 4d 54 50 20 43 6f 6e'
print(bytes.fromhex(x).decode('windows-1251'))
VNOSAREV#ACCENT.RU>
Subject: Счета на оплату по заказу . .
X-Mailer: OstroSoft SMTP Con

Your mail.get('subject').encode() does return exactly the bytes you put in. There is no "correctly" beyond this point; you have to know, or guess, the correct encoding.
mail.raw_items() returns what purports to be the "raw" headers from the message, but they are actually encoded. #Jesko's answer shows how to take the encoded value and transform it back to the original bytes, provided you know which encoding to use.
(The surrogate encoding is apparently a hack to allow Python to keep the raw bytes in a form which cannot accidentally leak back into a proper decoded string. You have to know how it was assembled and explicitly request it to be undone.)
Going out on a limb, you can try all the encodings of the body of the message, and check if any of them return a useful decoding.
The following uses the modern EmailMessage API where mail.get('subject').encode() no longer behaves like in your example (I think perhaps this is a bug?)
import email
from email.policy import default
content = b'''\
From: <VNOSAREV#ACCENT.example.RU>
Subject: \xd1\xf7\xe5\xf2\xe0 \xed\xe0 \xee\xef\xeb\xe0\xf2\xf3 \xef\xee \xe7\xe0\xea\xe0\xe7\xf3 . .
Content-type: text/plain; charset="windows-1251"
\xef\xf0\xe8\xe2\xe5\xf2
'''
# notice use of modern EmailMessage API, by specifying a policy
mail = email.message_from_bytes(content, policy=default)
# print(mail.get("subject"))
charsets = set(mail.get_charsets()) - {None}
for header, value in mail.raw_items():
if header == "Subject":
value = value.encode("utf-8", "surrogateescape")
for enc in charsets:
try:
print(value.decode(enc))
break
except (UnicodeEncodeError, UnicodeDecodeError):
pass
This crude heuristic could still misfire in a number of situations. If you know the encoding, probably hardcode it.
To the extent that mail clients are able to display the raw header correctly, I'm guessing it's mainly pure luck. If your system is set up to use code page 1251 by default, that probably helps some clients. Some mail clients also let you manually select an encoding for each message, so you can play around until you get the right one (and perhaps leave it at that setting if you receive many messages with this problem).

Extract specific bytes in payload from a pcap file using scapy

I am trying to extract a specific byte from each packet in a pcap file. All packets are ICMP.
In the data section, there is a byte that changes each packet. It is in the same position for each one. I would like to extract that byte.
Using scapy:
pkts = rdpcap('test.pcap')
pl = PacketList([p for p in pkts])
bytes(pl[12].payload)
returns the following:
b'E\x00\x00T]\xa7\x00\x00***J***\x01!A\xc0\xa88\x01\xc0\xa88o\x08\x004\xe9\xbf2\x00\x00^"\x87\xbe\x00\x0c2\xf4\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./01234567'
I have enclosed the byte that I want to extract within three stars. However, if I print out the bytes for each packet, the byte that I want to extract will be in a different offset.
If I run a hexdump for each packet as follows:
hexdump(bytes(pl[12].payload))
The specific byte I want to extract is always in the same position, but I don't know how to extract it.
How can I extract a specific byte from a pcap using scapy?
Following this answer here: Get specific bytes in payload from a pcap file
If I execute the same command, it doesn't do anything useful:
>>> hexdump(pkts[14][2].load[8])
0000 00 00 00 00 00 00 00 00 ........
>>>

You want the TTL?
Lets start at the high level, and move down.
Scapy is giving you the constructed packet. If you want the TTL of the packet, call the attribute:
>>> plist[182].ttl
64
If you want to get the specific byte of the packet, lets look at the hexdump:
>>> hexdump(plist[182])
0000 AA BB CC 66 42 DE AA BB CC 3F 52 A3 08 00 45 00 .a.lM..M.AK...E.
0010 00 5B 58 B9 40 00 40 06 64 96 C0 A8 01 28 AC D9 .[X.#.#.d....(..
...
These are in hex, the first field 0000 is the offset, then 16 bytes in hex, then ascii.
Offset Bytes ASCII
====== =============================================== ================
0000 AA BB CC 66 42 DE AA BB CC 3F 52 A3 08 00 45 00 .a.lM..M.AK...E.
Things start at 0, so the the byte addresses are 0..15 for the first line.
The second line the offset is 16 (16 * 1). So the byte addresses are 16..31.
The third line, the offset is 32 (16 * 2). So the byte addresses are 32..47
You highlighted the 7th byte on row 2:
Offset Bytes ASCII
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
====== =============================================== ================
0010 00 5B 58 B9 40 00 40 06 64 96 C0 A8 01 28 AC D9 .[X.#.#.d....(..
That address is:
offset + byte_address.
offset = 16 * 1
byte_address = 6
Which gives us:
16 + 6 = 22
With that, we can now get byte address 22 out of the raw packet:
>>> b = raw(plist[182])
>>> b[22]
64
Note that wireshark packet numbers start at 1. The packets in python are going to start at 0. So in my example, packet 182 corresponded to packet 183 in Wireshark.
plist[182].payload gives you the IP portion of the packet, so the offsets are going to be different since we aren't looking at the whole packet anymore. We can get the same value using the '.ttl' attribute. Or, knowing that the address is byte 8 in the IP header:
>>> plist[182].payload.ttl
64
>>> raw(plist[182].payload)[8]
64

How to find out address in binary file with python code only?

I have binary for example https://github.com/andrew-d/static-binaries/blob/master/binaries/linux/x86_64/nmap
1) How to find what is the address of this series of bytes :48 8B 45 A8 48 8D 1C 02 48 8B 45 C8 ? , the result need to be 0x6B0C67
2)How to find out the 12 bytes that in address 0x6B0C67 ? the result need to be 48 8B 45 A8 48 8D 1C 02 48 8B 45 C8 .
3) How to find which address call to specific string? for example i + 1 == features[i].index that locate in 0x6FC272 ? the result need to be 0x4022F6
How can I find all of this without open Ida? only with python/c code?
thanks

For 1) Is your file small enough to be loaded into memory? Then it's as simple as
offset = open(file, 'rb').read().find(
bytes.fromhex("48 8B 45 A8 48 8D 1C 02 48 8B 45 C8")
)
# offset will be -1 if not found
If not, you will need to read it in chunks.
For 2), do
with open(file, 'rb') as stream:
stream.seek(0x6b0c67)
data = stream.read(12)
I'm afraid I don't understand the question in 3)...

How can I read 32-bit binary numbers from a binary file in python?

I have data files which contain series of 32-bit binary "numbers."
I say "numbers" because the 32 1/0's define what type of data sensors were picking up, when they were, which sensor,etc; so the decimal value of the numbers is of no concern to me. In particular some (most) of the data will begin with possibly up to 5 zeros.
I simply need a way in python to read these files, get a list containing each 32-bit number, and then I'll need to mess around with it a little (delete some events) and rewrite it to a new file.
Can anyone help me with this? I've tried the following so far but the numbers which should be corresponding to the time data we encode seem to be impossible.
with open(lm_file, mode='rb') as file:
bytes_read = file.read(struct.calcsize("I"))
while bytes_read:
idList = struct.unpack("I", bytes_read)[0]
idList=bin(idList)
print(idList)
bytes_read = file.read(struct.calcsize("=l"))
Output of hexdump:
00000000 80 0a 83 4d ba a5 80 0c c0 00 7b 42 cb 90 0f 41 |...M......{B...A|
00000010 98 c9 9c 53 4c 15 35 52 d8 54 f7 0a 5d 87 16 4d |...SL.5R.T..]..M|
00000020 89 6a 3f 04 f2 eb c4 4a e2 37 e6 08 23 5e ca 06 |.j?....J.7..#^..|

Reading fortran binary (streaming access) with np.fromfile or open & struct

The following Fortran code:
INTEGER*2 :: i, Array_A(32)
Array_A(:) = (/ (i, i=0, 31) /)
OPEN (unit=11, file = 'binary2.dat', form='unformatted', access='stream')
Do i=1,32
WRITE(11) Array_A(i)
End Do
CLOSE (11)
Produces streaming binary output with numbers from 0 to 31 in integer 16bit. Each record is taking up 2 bytes, so they are written at byte 1, 3, 5, 7 and so on. The access='stream' suppresses the standard header of Fortran for each record (I need to do that to keep the files as tiny as possible).
Looking at it with a Hex-Editor, I get:
00 00 01 00 02 00 03 00 04 00 05 00 06 00 07 00
08 00 09 00 0A 00 0B 00 0C 00 0D 00 0E 00 0F 00
10 00 11 00 12 00 13 00 14 00 15 00 16 00 17 00
18 00 19 00 1A 00 1B 00 1C 00 1D 00 1E 00 1F 00
which is completely fine (despite the fact that the second byte is never used, because decimals are too low in my example).
Now I need to import these binary files into Python 2.7, but I can't. I tried many different routines, but I always fail in doing so.
1. attempt: "np.fromfile"
with open("binary2.dat", 'r') as f:
content = np.fromfile(f, dtype=np.int16)
returns
[ 0 1 2 3 4 5 6 7 8 9 10 11
12 13 14 15 16 17 18 19 20 21 22 23
24 25 0 0 26104 1242 0 0]
2. attempt: "struct"
import struct
with open("binary2.dat", 'r') as f:
content = f.readlines()
struct.unpack('h' * 32, content)
delivers
struct.error: unpack requires a string argument of length 64
because
print content
['\x00\x00\x01\x00\x02\x00\x03\x00\x04\x00\x05\x00\x06\x00\x07\x00\x08\x00\t\x00\n', '\x00\x0b\x00\x0c\x00\r\x00\x0e\x00\x0f\x00\x10\x00\x11\x00\x12\x00\x13\x00\x14\x00\x15\x00\x16\x00\x17\x00\x18\x00\x19\x00']
(note the delimiter, the t and the n which shouldn't be there according to what Fortran's "streaming" access does)
3. attempt: "FortranFile"
f = FortranFile("D:/Fortran/Sandbox/binary2.dat", 'r')
print(f.read_ints(dtype=np.int16))
With the error:
TypeError: only length-1 arrays can be converted to Python scalars
(remember how it detected a delimiter in the middle of the file, but it would also crash for shorter files without line break (e.g. decimals from 0 to 8))
Some additional thoughts:
Python seems to have troubles with reading parts of the binary file. For np.fromfile it reads Hex 19 (dec: 25), but crashes for Hex 1A (dec: 26). It seems to be confused with the letters, although 0A, 0B ... work just fine.
For attempt 2 the content-result is weird. Decimals 0 to 8 work fine, but then there is this strange \t\x00\n thing. What is it with hex 09 then?
I've been spending hours trying to find the logic, but I'm stuck and really need some help. Any ideas?

The problem is in open file mode. Default it is 'text'. Change this mode to binary:
with open("binary2.dat", 'rb') as f:
content = np.fromfile(f, dtype=np.int16)
and all the numbers will be readed successfull. See Dive in to Python chapter Binary Files for more details.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

DEFLATE discrepancy? - python

Related

Python email module behaves unexpected when trying to parse "raw" subject lines

Extract specific bytes in payload from a pcap file using scapy

How to find out address in binary file with python code only?

How can I read 32-bit binary numbers from a binary file in python?

Reading fortran binary (streaming access) with np.fromfile or open & struct

Categories

Resources