Python read .bin data and convert to string

Python read .bin data and convert to string - python

I have multiple bin files, and I want to extract the data from them, but the results i'm getting are pretty weird.
For example, my first file does the following:
path = 'D:\lut.bin'
with open(path, 'rb') as file: # b is important -> binary
fileContent = file.read()
print(fileContent)
Output:
xc7\xfb\x99\x0c\x8e\xf9~7\xb9a\xb1*\x06\xd2o\xb8\xb2 \x81\x8bg\xd2\xc6bE\r\xb9KL7\xa0\xa52\xa5\xd2\x17\x86(\xe9\x02\xbf\xeb\x8fDvk\xe7\x8d\x03\x872\x9fop\xbck\xe1\x94\x02\xdc\xef\x85I\t\xc8\x8d\xdfl\x90\xcf*\xb1\x02(\x16~)\xc7\xa2\x1f\xf6o\xdc\x1en\x84H\xf6%\xfaW/\xee\xbc\xdd^/\x9b\x9a\xe5\x99\xa2\xd7\xe4\x93U\xd4\xef$\xa5\x8aW\xf6\xc9\xb0T\xe3<\x147\xcc\x08}\xc8\x15J3v\n\x9d\x16\xa3\x8d\r\xa2\xc4\x15\xf13!\xa2\x01\x14\xef\xaf\x06\x83p\xa7Ot\x8cr\xdf\xef\xbe\x93\xc2D`y\\\xdb\x8a\x1c\\H\x9cE\xabF\xd6\xe1B\xdd\xbc\x8a\xdb\x06|\x05{!\xf0K25K0\xb9\xfe\xa6n\xd7-\xd1\xcb\xefQ\xd9w\x08{4\x13\xba8\x06\x00}S\xe4\xd8*\xe2\x81f\x8d\xc4P\xde\x88/\xa6q\x7fG\x99\xbd\xa84v\xcfS+\xc6\xc5#\x0ey\xd8\xcd\xf2!\xf8`1\x03k5\xb9\xee\xb3V\xc3">\xdd\xf4\x94\x1b\x83\xf9\xdbe\xfcw\xf4+O\xf4\xf1\xfc\xa2 \xc5\xccq\xd1\xc8dH\x00\xf7K|7\x87\xa8$\xb8\x92^\x90.\xffK\xbf\xf6\xcaHv9l\xa6\x0e\xd5"\xd6`>}f\xfc\xd1\x15\xd0\xf0\x89\xb7\x12\xdf\xc9\xdfn\x97\xc7O\xf8\x05)Ua|\xd6\xd5\x03P\xf3\xcd\x08 \xc6\xc7\xe2"\xae\x1fz\xb9\xbd\x99\x100\x9a\x8d\xeb\x89\xa3T\xa0\xc7S\xcc\xe4h\xbe\xf3R\xe9\x9d\xf4Y\xe91\xa4%\x85>mn\xc3\x1e\x8a}\x04\xd9:\xb5\xde\x01h\x90y\xfe4&\xea\x1d\x9a\xbd\xac\x1a\x8e{\xb2Y\xcb\xc47\xd8\xe2\xf6\xd6\xdc\x91,]\x1d\xca\x90_sb\x86X\xad]\x8e\xe1A\x1a\xaa\xc6\xdf\x1ca#A\x1a\xa2\t!3\x06y\x92\x96\xebg\xdb3\xdd\x9f\xefh\x9d6\x17c0\x0e\xfe\x9a\x06\x06;\x16\xa7\x
I have no idea what this is, but it does not look like readable text, is there a way to even convert this?
My other file looks like this:
U\xff\xf3\xe8d\x00\x00\x00\x01\xa4\x00\x00\x00\x00\x00\x00\x03H\x00\x00\x00\x00LAME3.100UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU
Again with the same code above.
I have tried decoding it, I keep getting decoding errors, and text encoding to utf-8 doesn't help either.
I want to get the text from this, these files came with book on the playstore I downloaded.

Bin files are just binary data, meaning each byte can be any value between 0 and 255 (00000000 and 11111111, hex 0x00-0xFF).
Printable characters are a subset of those codes.
This means, not every bin file can be converted to text.
Python tries to visualise the byte stream already by putting those printable characters in place of their \xNN code (where N is a hex digit). The rest of the characters are printed as their codes.
This means the
U\xff\xf3\xe8d\x00\x00\x00\x01\xa4\x00\x00\x00\x00\x00\x00\x03H\x00\x00\x00\x00LAME3.100UU
is in fact
\x55\xff\xf3\xe8\x64\x00\x00\x00\x01\xa4\x00\x00\x00\x00\x00\x00\x03\x48\x00\x00\x00\x00\x4c\x41\x4d\x45\x33\x2e\x31\x30\x30\x55\x55
[Copy this into your Python interpreter as a string (i.e. in quotes) and see how it visually converts itself when displaying/printing!]
The parts:
decoded U,
then not decoded \xff\xf3\xe8
decoded d
not decoded \x00\x00\x00\x01\xa4\x00\x00\x00\x00\x00\x00\x03
decodedH
not decoded \x00\x00\x00\x00
decoded LAME3.100UU
Can you extract some data from it? Depending on the type of the bin, you may probably find some strings directly put in there - like the LAME3.10 that looks like some code/version... but I really doubt that you would find anything useful. It can be literally anything, just dumped there: text, photo, memory dump...

This is very late, but LAME3.100 followed by a bunch of U characters is actually the start of a certain encoding of .mp3 file, and knowing that it may possibly be incomplete, you could try and convert it using https://ffmpeg.org into a proper .mp3 container
After you have ffmpeg in your path, a command such as ffmpeg -i "D:/lut.bin" "D:/lut.mp3" should hopefully decode and re-encode it

Related

how to convert specific characters into string [duplicate]

There is a file called "settings.dat" which I want to read and edit. On opening this file through Notepad, I get an unreadable encoding.
I'm thinking this is probably a binary file. And the encoding is probably UTF-16, as far as I can tell. This is how I tried to convert it:
with open('settings.dat', 'rb') as binary_file:
raw_data = binary_file.read()
str_data = raw_data.decode('utf-16', 'ignore')
print(str_data)
The Output is again an unreadable form, with characters that look Chinese. Isn't this supposed to be a simple bytes-to-string conversion problem? Here is the output:
䕗䙃h 3 Ԁ ː ᙫ ␐☐ᜐ┐Ⱀ⨐ᴐሐ⼐【ㄐ㈐䠐倐䬐䴐ᄐἐḐ‐점퀐쬐촐

.dat files are generic files, and can either be binary or text. These files are usually accessed and used only for application support, and each application treats .dat files differently. Hence, .dat files follow no specific protocols which affect all .dat files, unlike .gif or .docx files.
If you want to understand how .dat files work and convert to human-readable form, you need to know how the application handles these files beforehand.
For the Chinese characters, you tried to decode the binary .dat file by the UTF-16 format. That does not change the file content; you are just grouping sequences of bytes of repeating sequences of bbbb bbbb bbbb bbbb = xxxx where the b are the bytes and the x are the hexadecimal digits.
Many Unicode characters are Chinese [technically they are called ideographs or ideographic] whereas others are unused, aka reserved.

Not a python answer, but the strings command line tool is often invaluable in reverse engineering data formats, letting you easily skim through a binary in search for finding familiar plaintext patterns. Obviously if some kind of encryption/compression is used (such as commonly used gzip) it won't help and needs some preprocessing first.
Calling it is as simple as that:
user#host:~/ $ strings mydir/settings.dat

If it's a binary file, then why do you want to view it? Unless you're aware beforehand that settings.dat contains human-readable characters, it does not make sense to attempt to "find" an encoding so that the output is human-readable characters, because you won't be successful.
On the other hand, if you do know that settings.dat contains human-readable characters, then maybe utf-16 is the wrong encoding.

How to convert the content of a .dat file to a human readable form using Python?

There is a file called "settings.dat" which I want to read and edit. On opening this file through Notepad, I get an unreadable encoding.
I'm thinking this is probably a binary file. And the encoding is probably UTF-16, as far as I can tell. This is how I tried to convert it:
with open('settings.dat', 'rb') as binary_file:
raw_data = binary_file.read()
str_data = raw_data.decode('utf-16', 'ignore')
print(str_data)
The Output is again an unreadable form, with characters that look Chinese. Isn't this supposed to be a simple bytes-to-string conversion problem? Here is the output:
䕗䙃h 3 Ԁ ː ᙫ ␐☐ᜐ┐Ⱀ⨐ᴐሐ⼐【ㄐ㈐䠐倐䬐䴐ᄐἐḐ‐점퀐쬐촐

.dat files are generic files, and can either be binary or text. These files are usually accessed and used only for application support, and each application treats .dat files differently. Hence, .dat files follow no specific protocols which affect all .dat files, unlike .gif or .docx files.
If you want to understand how .dat files work and convert to human-readable form, you need to know how the application handles these files beforehand.
For the Chinese characters, you tried to decode the binary .dat file by the UTF-16 format. That does not change the file content; you are just grouping sequences of bytes of repeating sequences of bbbb bbbb bbbb bbbb = xxxx where the b are the bytes and the x are the hexadecimal digits.
Many Unicode characters are Chinese [technically they are called ideographs or ideographic] whereas others are unused, aka reserved.

Not a python answer, but the strings command line tool is often invaluable in reverse engineering data formats, letting you easily skim through a binary in search for finding familiar plaintext patterns. Obviously if some kind of encryption/compression is used (such as commonly used gzip) it won't help and needs some preprocessing first.
Calling it is as simple as that:
user#host:~/ $ strings mydir/settings.dat

If it's a binary file, then why do you want to view it? Unless you're aware beforehand that settings.dat contains human-readable characters, it does not make sense to attempt to "find" an encoding so that the output is human-readable characters, because you won't be successful.
On the other hand, if you do know that settings.dat contains human-readable characters, then maybe utf-16 is the wrong encoding.

Converting a PDF file (or any binary) to a string in python (not grab text out of pdf)

I am using an api that only takes strings. It's intended to store things. I would like to be able to read in a binary file, convert the binary data to a string, and store the string. Then I would like to retrieve the string, convery back to binary, and save the file.
so what I am trying to do is (in python):
PDF -> load into program as string -> store string ->retrieve string ->save as binary PDF file
For example, I have a PDF called PDFfile. I want to read it in:
datafile=open(PDFfile,'rb')
pdfdata=datafile.read()
When I read up on the .read function it says that it's supposed to result in a string. It does not, or if it does, its taking the parts that define it as a binary also. I have two lines of code later that prints it out:
print(pdfdata[:20])
print(str(pdfdata[:20]))
The result is this:
b'%PDF-1.3\n%\xc4\xe5\xf2\xe5\xeb\xa7\xf3\xa0\xd0\xc4'
b'%PDF-1.3\n%\xc4\xe5\xf2\xe5\xeb\xa7\xf3\xa0\xd0\xc4'
Those look like 2 bytes types to me, but apparently, the second one is a string. When I do type(pdfdata) I get bytes.
I am struggling to try to get a clean string that represents the PDF file, that I can then convert back to a bytes format. The API fails if I send it without stringifying it.
str(pdfdata)
I have also tried playing around with encode and decode, but I get errors that encode/decode cant handle 0xc4 which is apparently in the binary file.
The final oddity:
When I store the str(pdfdata) and retrieve it into 'retdata' I print some bytes out of it and compare to the original
print(pdfdata[:20])
print(retdata[:20])
i get really different results
b'%PDF-1.3\n%\xc4\xe5\xf2\xe5\xeb\xa7\xf3\xa0\xd0\xc4'
b'%PDF-1.3\n%\xc4\xe
But the data is there, if I show 50 characters of the retdata
b'%PDF-1.3\n%\xc4\xe5\xf2\xe5\xeb\xa7\xf3\xa0\xd0\
Needless to say, when I retrieve the data, and store as a pdf, its corrupted and doesn't work.
When I save the stringified pdf and the string version of the retrieved data, they are identical. so the storage and retrieval of a string is working fine.
So I think the corruption is happening when I convert to a string.
I know I'm getting loquacious, but you guys like to have all the info.

OK I got this to work. The key was to properly encode the binary data BEFORE it was turned into a string.
Step 1) Read in binary data
datafile=open(PDFfile,'rb')
pdfdatab=datafile.read() #this is binary data
datafile.close()
Step 2) encode the data into a bytes array
import codecs
b64PDF = codecs.encode(pdfdatab, 'base64')
Step 3) convert bytes array into a string
Sb64PDF=b64PDF.decode('utf-8')
Now the string can be restored. To get it back, you just go through the reverse. Load string data from storage into string variable retdata.
#so we have a string and want it to be bytes
bretdata=retdata.encode('utf-8')
#now lets get it back into the binary file format
bPDFout=codecs.decode(bretdata, 'base64')
#open a new file and put defragments data into it!
datafile=open(newPDFFile,'wb')
datafile.write(bPDFout)
datafile.close()

Using Python to overwrite resource section in C program

I have a C program that has a resource section.
IDS_STRING 87 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
In the hex editor it looks like this
I use code such as this in Python to search and replace the A's:
str = b'\x00A'*40
str1 = b"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB"
if str in file:
print("A in file")
f.write(file.replace(str, str1))
This makes the new file look like this:
So I am wondering why the A's are stored like '41 00' and then when I overwrite them they are just '42'.
Is this a WCHAR thing?
I did a test where I loaded the string and printed it out.
This is some text.AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
But then when I used my Python and overwrote the A's with the B's it does this..
This is some text.???????????????????????????????????????B
So with my limited knowledge of C, If I want to put things into the resource section I should place them in as WCHAR?
UPDATE:
My main issue with this is I have a hex string similar to below:
'685308358035803507835083408303508350835083508350835083083508'
I want to put that into the resource section. But if I do that similar to the way I am replacing, so by doing
f.write(file.replace(str, '685308358035803507835083408303508350835083508350835083083508'))
Then it puts it into the resource section as:
If it goes in like this, it causes things to break because it is grabbing 2 bytes at a time it seems like.
The reason I am asking this is because when I replace the A's with my hex and run the program. It does not work. But if I place the hex directly into the resource section in Visual Studio and run it, it does work. When I replace with Python it is '34322424...' but when the same string is placed in the resource section is it '3400220042004....'
2nd UPDATE:
It seems that the resource section string table does store in a 2 bytes.
https://learn.microsoft.com/en-us/windows/desktop/debug/pe-format#the-rsrc-section
Resource Directory Strings
Two-byte-aligned Unicode strings, which serve as string data that is pointed to by directory entries.

It looks like utf-16 encoding. So you can use regular python unicode strings, and make sure you open and write to the file in text mode, and with utf16 encoding.
If you use binary mode, each ascii character you write will be represented in a single byte. If you use text mode, each character you write will be represented by two bytes. If the text you write is only using low unicode code points, there will be a bunch of null bytes. If you write some Chinese text, you need both bytes.
The hex dump you posted don't show a BOM at the start, so you might have to use utf-16le instead of utf-16.
with open('foo.txt', 'r', encoding='utf-16le') as fp:
text = fp.read()
with open('foo.txt', 'w', encoding='utf-16le') as fp:
fp.write(text.replace('AAAAAA', 'BBBBBB'))

Python write as a file, unknown format

So here's a small snippet of a file opened in python. But I'm not sure what kind of formatting this is. Doesn't look like binary. How do I write this back out into a file?
hDuwHkAbG9hZGVyX21jAEAAiQYJHgBpAEAQAHBAAIkGCR4AaQBAEACsQACJBgkeAGkAQBAA5EAAiQYJHgBpAEAQARxAAIkGCR4AaQBAEAFUQACJBgkeAGkAQBABkEAAiQYJHgBpAEAQAchAAIkGCR4AaQBAEAIAQACJBgkeAGkAQBACOEAAiQYJHgBpAEAQAnBAAIkGCR4AaQBAEAKsQACJBgkeAGkAQ

It's a fragment of a base-64 encoded binary file of some sort. Unfortunately, it hasn't been cut at a byte boundary; however, when I insert a letter at the front, the decoded version looks like this:
ÆîÀy�loader_mc�#� �i�#�p#� �i�#�¬#� �i�#�ä#� �i�##� �i�#T#� �i�##� �i�#È#� �i�#�#� �i�#8#� �i�#p#� �i�#¬#� �i�
You can see some clear meaningful ASCII in there. However, what the binary data represents is anyone's guess.
If you just need to decode it into the above binary format, use the base64 module.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.