Python write as a file, unknown format - python

So here's a small snippet of a file opened in python. But I'm not sure what kind of formatting this is. Doesn't look like binary. How do I write this back out into a file?
hDuwHkAbG9hZGVyX21jAEAAiQYJHgBpAEAQAHBAAIkGCR4AaQBAEACsQACJBgkeAGkAQBAA5EAAiQYJHgBpAEAQARxAAIkGCR4AaQBAEAFUQACJBgkeAGkAQBABkEAAiQYJHgBpAEAQAchAAIkGCR4AaQBAEAIAQACJBgkeAGkAQBACOEAAiQYJHgBpAEAQAnBAAIkGCR4AaQBAEAKsQACJBgkeAGkAQ

It's a fragment of a base-64 encoded binary file of some sort. Unfortunately, it hasn't been cut at a byte boundary; however, when I insert a letter at the front, the decoded version looks like this:
ÆîÀy�loader_mc�#� �i�#�p#� �i�#�¬#� �i�#�ä#� �i�##� �i�#T#� �i�##� �i�#È#� �i�#�#� �i�#8#� �i�#p#� �i�#¬#� �i�
You can see some clear meaningful ASCII in there. However, what the binary data represents is anyone's guess.
If you just need to decode it into the above binary format, use the base64 module.

Related

how to convert specific characters into string [duplicate]

There is a file called "settings.dat" which I want to read and edit. On opening this file through Notepad, I get an unreadable encoding.
I'm thinking this is probably a binary file. And the encoding is probably UTF-16, as far as I can tell. This is how I tried to convert it:
with open('settings.dat', 'rb') as binary_file:
raw_data = binary_file.read()
str_data = raw_data.decode('utf-16', 'ignore')
print(str_data)
The Output is again an unreadable form, with characters that look Chinese. Isn't this supposed to be a simple bytes-to-string conversion problem? Here is the output:
䕗䙃h 3 Ԁ ː ᙫ ␐☐ᜐ┐Ⱀ⨐ᴐሐ⼐【ㄐ㈐䠐倐䬐䴐ᄐἐḐ‐점퀐쬐촐
.dat files are generic files, and can either be binary or text. These files are usually accessed and used only for application support, and each application treats .dat files differently. Hence, .dat files follow no specific protocols which affect all .dat files, unlike .gif or .docx files.
If you want to understand how .dat files work and convert to human-readable form, you need to know how the application handles these files beforehand.
For the Chinese characters, you tried to decode the binary .dat file by the UTF-16 format. That does not change the file content; you are just grouping sequences of bytes of repeating sequences of bbbb bbbb bbbb bbbb = xxxx where the b are the bytes and the x are the hexadecimal digits.
Many Unicode characters are Chinese [technically they are called ideographs or ideographic] whereas others are unused, aka reserved.
Not a python answer, but the strings command line tool is often invaluable in reverse engineering data formats, letting you easily skim through a binary in search for finding familiar plaintext patterns. Obviously if some kind of encryption/compression is used (such as commonly used gzip) it won't help and needs some preprocessing first.
Calling it is as simple as that:
user#host:~/ $ strings mydir/settings.dat
If it's a binary file, then why do you want to view it? Unless you're aware beforehand that settings.dat contains human-readable characters, it does not make sense to attempt to "find" an encoding so that the output is human-readable characters, because you won't be successful.
On the other hand, if you do know that settings.dat contains human-readable characters, then maybe utf-16 is the wrong encoding.

Python read .bin data and convert to string

I have multiple bin files, and I want to extract the data from them, but the results i'm getting are pretty weird.
For example, my first file does the following:
path = 'D:\lut.bin'
with open(path, 'rb') as file: # b is important -> binary
fileContent = file.read()
print(fileContent)
Output:
xc7\xfb\x99\x0c\x8e\xf9~7\xb9a\xb1*\x06\xd2o\xb8\xb2 \x81\x8bg\xd2\xc6bE\r\xb9KL7\xa0\xa52\xa5\xd2\x17\x86(\xe9\x02\xbf\xeb\x8fDvk\xe7\x8d\x03\x872\x9fop\xbck\xe1\x94\x02\xdc\xef\x85I\t\xc8\x8d\xdfl\x90\xcf*\xb1\x02(\x16~)\xc7\xa2\x1f\xf6o\xdc\x1en\x84H\xf6%\xfaW/\xee\xbc\xdd^/\x9b\x9a\xe5\x99\xa2\xd7\xe4\x93U\xd4\xef$\xa5\x8aW\xf6\xc9\xb0T\xe3<\x147\xcc\x08}\xc8\x15J3v\n\x9d\x16\xa3\x8d\r\xa2\xc4\x15\xf13!\xa2\x01\x14\xef\xaf\x06\x83p\xa7Ot\x8cr\xdf\xef\xbe\x93\xc2D`y\\\xdb\x8a\x1c\\H\x9cE\xabF\xd6\xe1B\xdd\xbc\x8a\xdb\x06|\x05{!\xf0K25K0\xb9\xfe\xa6n\xd7-\xd1\xcb\xefQ\xd9w\x08{4\x13\xba8\x06\x00}S\xe4\xd8*\xe2\x81f\x8d\xc4P\xde\x88/\xa6q\x7fG\x99\xbd\xa84v\xcfS+\xc6\xc5#\x0ey\xd8\xcd\xf2!\xf8`1\x03k5\xb9\xee\xb3V\xc3">\xdd\xf4\x94\x1b\x83\xf9\xdbe\xfcw\xf4+O\xf4\xf1\xfc\xa2 \xc5\xccq\xd1\xc8dH\x00\xf7K|7\x87\xa8$\xb8\x92^\x90.\xffK\xbf\xf6\xcaHv9l\xa6\x0e\xd5"\xd6`>}f\xfc\xd1\x15\xd0\xf0\x89\xb7\x12\xdf\xc9\xdfn\x97\xc7O\xf8\x05)Ua|\xd6\xd5\x03P\xf3\xcd\x08 \xc6\xc7\xe2"\xae\x1fz\xb9\xbd\x99\x100\x9a\x8d\xeb\x89\xa3T\xa0\xc7S\xcc\xe4h\xbe\xf3R\xe9\x9d\xf4Y\xe91\xa4%\x85>mn\xc3\x1e\x8a}\x04\xd9:\xb5\xde\x01h\x90y\xfe4&\xea\x1d\x9a\xbd\xac\x1a\x8e{\xb2Y\xcb\xc47\xd8\xe2\xf6\xd6\xdc\x91,]\x1d\xca\x90_sb\x86X\xad]\x8e\xe1A\x1a\xaa\xc6\xdf\x1ca#A\x1a\xa2\t!3\x06y\x92\x96\xebg\xdb3\xdd\x9f\xefh\x9d6\x17c0\x0e\xfe\x9a\x06\x06;\x16\xa7\x
I have no idea what this is, but it does not look like readable text, is there a way to even convert this?
My other file looks like this:
U\xff\xf3\xe8d\x00\x00\x00\x01\xa4\x00\x00\x00\x00\x00\x00\x03H\x00\x00\x00\x00LAME3.100UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU
Again with the same code above.
I have tried decoding it, I keep getting decoding errors, and text encoding to utf-8 doesn't help either.
I want to get the text from this, these files came with book on the playstore I downloaded.
Bin files are just binary data, meaning each byte can be any value between 0 and 255 (00000000 and 11111111, hex 0x00-0xFF).
Printable characters are a subset of those codes.
This means, not every bin file can be converted to text.
Python tries to visualise the byte stream already by putting those printable characters in place of their \xNN code (where N is a hex digit). The rest of the characters are printed as their codes.
This means the
U\xff\xf3\xe8d\x00\x00\x00\x01\xa4\x00\x00\x00\x00\x00\x00\x03H\x00\x00\x00\x00LAME3.100UU
is in fact
\x55\xff\xf3\xe8\x64\x00\x00\x00\x01\xa4\x00\x00\x00\x00\x00\x00\x03\x48\x00\x00\x00\x00\x4c\x41\x4d\x45\x33\x2e\x31\x30\x30\x55\x55
[Copy this into your Python interpreter as a string (i.e. in quotes) and see how it visually converts itself when displaying/printing!]
The parts:
decoded U,
then not decoded \xff\xf3\xe8
decoded d
not decoded \x00\x00\x00\x01\xa4\x00\x00\x00\x00\x00\x00\x03
decodedH
not decoded \x00\x00\x00\x00
decoded LAME3.100UU
Can you extract some data from it? Depending on the type of the bin, you may probably find some strings directly put in there - like the LAME3.10 that looks like some code/version... but I really doubt that you would find anything useful. It can be literally anything, just dumped there: text, photo, memory dump...
This is very late, but LAME3.100 followed by a bunch of U characters is actually the start of a certain encoding of .mp3 file, and knowing that it may possibly be incomplete, you could try and convert it using https://ffmpeg.org into a proper .mp3 container
After you have ffmpeg in your path, a command such as ffmpeg -i "D:/lut.bin" "D:/lut.mp3" should hopefully decode and re-encode it

How to convert the content of a .dat file to a human readable form using Python?

There is a file called "settings.dat" which I want to read and edit. On opening this file through Notepad, I get an unreadable encoding.
I'm thinking this is probably a binary file. And the encoding is probably UTF-16, as far as I can tell. This is how I tried to convert it:
with open('settings.dat', 'rb') as binary_file:
raw_data = binary_file.read()
str_data = raw_data.decode('utf-16', 'ignore')
print(str_data)
The Output is again an unreadable form, with characters that look Chinese. Isn't this supposed to be a simple bytes-to-string conversion problem? Here is the output:
䕗䙃h 3 Ԁ ː ᙫ ␐☐ᜐ┐Ⱀ⨐ᴐሐ⼐【ㄐ㈐䠐倐䬐䴐ᄐἐḐ‐점퀐쬐촐
.dat files are generic files, and can either be binary or text. These files are usually accessed and used only for application support, and each application treats .dat files differently. Hence, .dat files follow no specific protocols which affect all .dat files, unlike .gif or .docx files.
If you want to understand how .dat files work and convert to human-readable form, you need to know how the application handles these files beforehand.
For the Chinese characters, you tried to decode the binary .dat file by the UTF-16 format. That does not change the file content; you are just grouping sequences of bytes of repeating sequences of bbbb bbbb bbbb bbbb = xxxx where the b are the bytes and the x are the hexadecimal digits.
Many Unicode characters are Chinese [technically they are called ideographs or ideographic] whereas others are unused, aka reserved.
Not a python answer, but the strings command line tool is often invaluable in reverse engineering data formats, letting you easily skim through a binary in search for finding familiar plaintext patterns. Obviously if some kind of encryption/compression is used (such as commonly used gzip) it won't help and needs some preprocessing first.
Calling it is as simple as that:
user#host:~/ $ strings mydir/settings.dat
If it's a binary file, then why do you want to view it? Unless you're aware beforehand that settings.dat contains human-readable characters, it does not make sense to attempt to "find" an encoding so that the output is human-readable characters, because you won't be successful.
On the other hand, if you do know that settings.dat contains human-readable characters, then maybe utf-16 is the wrong encoding.

Converting a PDF file (or any binary) to a string in python (not grab text out of pdf)

I am using an api that only takes strings. It's intended to store things. I would like to be able to read in a binary file, convert the binary data to a string, and store the string. Then I would like to retrieve the string, convery back to binary, and save the file.
so what I am trying to do is (in python):
PDF -> load into program as string -> store string ->retrieve string ->save as binary PDF file
For example, I have a PDF called PDFfile. I want to read it in:
datafile=open(PDFfile,'rb')
pdfdata=datafile.read()
When I read up on the .read function it says that it's supposed to result in a string. It does not, or if it does, its taking the parts that define it as a binary also. I have two lines of code later that prints it out:
print(pdfdata[:20])
print(str(pdfdata[:20]))
The result is this:
b'%PDF-1.3\n%\xc4\xe5\xf2\xe5\xeb\xa7\xf3\xa0\xd0\xc4'
b'%PDF-1.3\n%\xc4\xe5\xf2\xe5\xeb\xa7\xf3\xa0\xd0\xc4'
Those look like 2 bytes types to me, but apparently, the second one is a string. When I do type(pdfdata) I get bytes.
I am struggling to try to get a clean string that represents the PDF file, that I can then convert back to a bytes format. The API fails if I send it without stringifying it.
str(pdfdata)
I have also tried playing around with encode and decode, but I get errors that encode/decode cant handle 0xc4 which is apparently in the binary file.
The final oddity:
When I store the str(pdfdata) and retrieve it into 'retdata' I print some bytes out of it and compare to the original
print(pdfdata[:20])
print(retdata[:20])
i get really different results
b'%PDF-1.3\n%\xc4\xe5\xf2\xe5\xeb\xa7\xf3\xa0\xd0\xc4'
b'%PDF-1.3\n%\xc4\xe
But the data is there, if I show 50 characters of the retdata
b'%PDF-1.3\n%\xc4\xe5\xf2\xe5\xeb\xa7\xf3\xa0\xd0\
Needless to say, when I retrieve the data, and store as a pdf, its corrupted and doesn't work.
When I save the stringified pdf and the string version of the retrieved data, they are identical. so the storage and retrieval of a string is working fine.
So I think the corruption is happening when I convert to a string.
I know I'm getting loquacious, but you guys like to have all the info.
OK I got this to work. The key was to properly encode the binary data BEFORE it was turned into a string.
Step 1) Read in binary data
datafile=open(PDFfile,'rb')
pdfdatab=datafile.read() #this is binary data
datafile.close()
Step 2) encode the data into a bytes array
import codecs
b64PDF = codecs.encode(pdfdatab, 'base64')
Step 3) convert bytes array into a string
Sb64PDF=b64PDF.decode('utf-8')
Now the string can be restored. To get it back, you just go through the reverse. Load string data from storage into string variable retdata.
#so we have a string and want it to be bytes
bretdata=retdata.encode('utf-8')
#now lets get it back into the binary file format
bPDFout=codecs.decode(bretdata, 'base64')
#open a new file and put defragments data into it!
datafile=open(newPDFFile,'wb')
datafile.write(bPDFout)
datafile.close()

How do I read the codes of a .dat file(binary file) in Python?

my friend mailed me a binary file "Masters.dat" coded in Python.I want to read the codes inside the binary file,so how do I do it?
I have tried :-
file = open("C:\Users\Samanyou\Desktop\Source_XII\Project\Masters.dat", "rb")
read=file.readlines()
print read
But this gives me the result in ASCII or something else but not in human readable form.
readlines is meant to work with text files, not binary ones. For binary files you'd typically use read to get chunk of bytes -- but there's no way to make such chunks "human readable" unless you know what detailed format was used to write the file (in which case you can use e.g struct to decode it back into Python data and format them as you wish). So your friend had better send you info about exactly how the file was written in the first place!-)

Categories