how to convert specific characters into string [duplicate] - python

There is a file called "settings.dat" which I want to read and edit. On opening this file through Notepad, I get an unreadable encoding.
I'm thinking this is probably a binary file. And the encoding is probably UTF-16, as far as I can tell. This is how I tried to convert it:
with open('settings.dat', 'rb') as binary_file:
raw_data = binary_file.read()
str_data = raw_data.decode('utf-16', 'ignore')
print(str_data)
The Output is again an unreadable form, with characters that look Chinese. Isn't this supposed to be a simple bytes-to-string conversion problem? Here is the output:
䕗䙃h 3 Ԁ ː ᙫ ␐☐ᜐ┐Ⱀ⨐ᴐሐ⼐【ㄐ㈐䠐倐䬐䴐ᄐἐḐ‐점퀐쬐촐

.dat files are generic files, and can either be binary or text. These files are usually accessed and used only for application support, and each application treats .dat files differently. Hence, .dat files follow no specific protocols which affect all .dat files, unlike .gif or .docx files.
If you want to understand how .dat files work and convert to human-readable form, you need to know how the application handles these files beforehand.
For the Chinese characters, you tried to decode the binary .dat file by the UTF-16 format. That does not change the file content; you are just grouping sequences of bytes of repeating sequences of bbbb bbbb bbbb bbbb = xxxx where the b are the bytes and the x are the hexadecimal digits.
Many Unicode characters are Chinese [technically they are called ideographs or ideographic] whereas others are unused, aka reserved.

Not a python answer, but the strings command line tool is often invaluable in reverse engineering data formats, letting you easily skim through a binary in search for finding familiar plaintext patterns. Obviously if some kind of encryption/compression is used (such as commonly used gzip) it won't help and needs some preprocessing first.
Calling it is as simple as that:
user#host:~/ $ strings mydir/settings.dat

If it's a binary file, then why do you want to view it? Unless you're aware beforehand that settings.dat contains human-readable characters, it does not make sense to attempt to "find" an encoding so that the output is human-readable characters, because you won't be successful.
On the other hand, if you do know that settings.dat contains human-readable characters, then maybe utf-16 is the wrong encoding.

Related

Python - pdfme - writing utf-8 characters to file

I would like to generate report to pdf using pdfme library. I need the Polish characters to be there as well. The example report end with:
with open('document.pdf', 'wb') as f:
build_pdf(document, f)
So I cannot add encoding = "utf-8". Is there any way I can still use Polish characters?
I tried:
Change to write mode and set encoding to utf-8. Getting: "TypeError: write() argument must be str, not bytes".
While having Polish characters add .encode("utf-8"). Example: "Paweł".encode("utf-8"). Getting: "TypeError: value of . attr must be of type str, list or tuple: b'Pawe\xc5\x82'"
In this case, the part of the code responsible for dealing with the unicode characters is the PDF library. The build_pdf call there, for whatever library it is, has to be able to handle any character in "document". And if it fails it is the context for the PDF library, owner of the "build_pdf" call that has to be changed so that it will handle all the characters you need.
"utf-8" is just one form os expressing characters as bytes - aPDF file is a binary file, and it does have internal headers, structures and settings to do its own character encoding handling: your text may endup inside the PDF either encoded as utf-8, or some other, legacy encoding- but that will be transparent for you and anyone using the PDF file.
It may be that the document, if it is text (we don't know if it is plain text, or if it is some object from your library that has already been pre-processed) - but if it is text, and your library says that build_pdf can accept bytes instead, you can encode the document prior to this call:
build_pdf(document.encode('utf-8', f) - but that would be some strange way of working - it is likely that either build_pdf will do the encoding, or whatever process generated the document had already done so.
To get more meaningful help, you have to say which library you are using to geneate the PDF, and include the import lines in your code,including the creation of your document so that we have a minimal reproducible example: i.e. I can copy your code, paste in a .py file here, install the lib, run it, and see a corrupted PDF file with the Polish characters magled: then I, and others, can be able to fix it. Otherwise, this answer is as far as I can get.

Python read .bin data and convert to string

I have multiple bin files, and I want to extract the data from them, but the results i'm getting are pretty weird.
For example, my first file does the following:
path = 'D:\lut.bin'
with open(path, 'rb') as file: # b is important -> binary
fileContent = file.read()
print(fileContent)
Output:
xc7\xfb\x99\x0c\x8e\xf9~7\xb9a\xb1*\x06\xd2o\xb8\xb2 \x81\x8bg\xd2\xc6bE\r\xb9KL7\xa0\xa52\xa5\xd2\x17\x86(\xe9\x02\xbf\xeb\x8fDvk\xe7\x8d\x03\x872\x9fop\xbck\xe1\x94\x02\xdc\xef\x85I\t\xc8\x8d\xdfl\x90\xcf*\xb1\x02(\x16~)\xc7\xa2\x1f\xf6o\xdc\x1en\x84H\xf6%\xfaW/\xee\xbc\xdd^/\x9b\x9a\xe5\x99\xa2\xd7\xe4\x93U\xd4\xef$\xa5\x8aW\xf6\xc9\xb0T\xe3<\x147\xcc\x08}\xc8\x15J3v\n\x9d\x16\xa3\x8d\r\xa2\xc4\x15\xf13!\xa2\x01\x14\xef\xaf\x06\x83p\xa7Ot\x8cr\xdf\xef\xbe\x93\xc2D`y\\\xdb\x8a\x1c\\H\x9cE\xabF\xd6\xe1B\xdd\xbc\x8a\xdb\x06|\x05{!\xf0K25K0\xb9\xfe\xa6n\xd7-\xd1\xcb\xefQ\xd9w\x08{4\x13\xba8\x06\x00}S\xe4\xd8*\xe2\x81f\x8d\xc4P\xde\x88/\xa6q\x7fG\x99\xbd\xa84v\xcfS+\xc6\xc5#\x0ey\xd8\xcd\xf2!\xf8`1\x03k5\xb9\xee\xb3V\xc3">\xdd\xf4\x94\x1b\x83\xf9\xdbe\xfcw\xf4+O\xf4\xf1\xfc\xa2 \xc5\xccq\xd1\xc8dH\x00\xf7K|7\x87\xa8$\xb8\x92^\x90.\xffK\xbf\xf6\xcaHv9l\xa6\x0e\xd5"\xd6`>}f\xfc\xd1\x15\xd0\xf0\x89\xb7\x12\xdf\xc9\xdfn\x97\xc7O\xf8\x05)Ua|\xd6\xd5\x03P\xf3\xcd\x08 \xc6\xc7\xe2"\xae\x1fz\xb9\xbd\x99\x100\x9a\x8d\xeb\x89\xa3T\xa0\xc7S\xcc\xe4h\xbe\xf3R\xe9\x9d\xf4Y\xe91\xa4%\x85>mn\xc3\x1e\x8a}\x04\xd9:\xb5\xde\x01h\x90y\xfe4&\xea\x1d\x9a\xbd\xac\x1a\x8e{\xb2Y\xcb\xc47\xd8\xe2\xf6\xd6\xdc\x91,]\x1d\xca\x90_sb\x86X\xad]\x8e\xe1A\x1a\xaa\xc6\xdf\x1ca#A\x1a\xa2\t!3\x06y\x92\x96\xebg\xdb3\xdd\x9f\xefh\x9d6\x17c0\x0e\xfe\x9a\x06\x06;\x16\xa7\x
I have no idea what this is, but it does not look like readable text, is there a way to even convert this?
My other file looks like this:
U\xff\xf3\xe8d\x00\x00\x00\x01\xa4\x00\x00\x00\x00\x00\x00\x03H\x00\x00\x00\x00LAME3.100UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU
Again with the same code above.
I have tried decoding it, I keep getting decoding errors, and text encoding to utf-8 doesn't help either.
I want to get the text from this, these files came with book on the playstore I downloaded.
Bin files are just binary data, meaning each byte can be any value between 0 and 255 (00000000 and 11111111, hex 0x00-0xFF).
Printable characters are a subset of those codes.
This means, not every bin file can be converted to text.
Python tries to visualise the byte stream already by putting those printable characters in place of their \xNN code (where N is a hex digit). The rest of the characters are printed as their codes.
This means the
U\xff\xf3\xe8d\x00\x00\x00\x01\xa4\x00\x00\x00\x00\x00\x00\x03H\x00\x00\x00\x00LAME3.100UU
is in fact
\x55\xff\xf3\xe8\x64\x00\x00\x00\x01\xa4\x00\x00\x00\x00\x00\x00\x03\x48\x00\x00\x00\x00\x4c\x41\x4d\x45\x33\x2e\x31\x30\x30\x55\x55
[Copy this into your Python interpreter as a string (i.e. in quotes) and see how it visually converts itself when displaying/printing!]
The parts:
decoded U,
then not decoded \xff\xf3\xe8
decoded d
not decoded \x00\x00\x00\x01\xa4\x00\x00\x00\x00\x00\x00\x03
decodedH
not decoded \x00\x00\x00\x00
decoded LAME3.100UU
Can you extract some data from it? Depending on the type of the bin, you may probably find some strings directly put in there - like the LAME3.10 that looks like some code/version... but I really doubt that you would find anything useful. It can be literally anything, just dumped there: text, photo, memory dump...
This is very late, but LAME3.100 followed by a bunch of U characters is actually the start of a certain encoding of .mp3 file, and knowing that it may possibly be incomplete, you could try and convert it using https://ffmpeg.org into a proper .mp3 container
After you have ffmpeg in your path, a command such as ffmpeg -i "D:/lut.bin" "D:/lut.mp3" should hopefully decode and re-encode it

How to convert the content of a .dat file to a human readable form using Python?

There is a file called "settings.dat" which I want to read and edit. On opening this file through Notepad, I get an unreadable encoding.
I'm thinking this is probably a binary file. And the encoding is probably UTF-16, as far as I can tell. This is how I tried to convert it:
with open('settings.dat', 'rb') as binary_file:
raw_data = binary_file.read()
str_data = raw_data.decode('utf-16', 'ignore')
print(str_data)
The Output is again an unreadable form, with characters that look Chinese. Isn't this supposed to be a simple bytes-to-string conversion problem? Here is the output:
䕗䙃h 3 Ԁ ː ᙫ ␐☐ᜐ┐Ⱀ⨐ᴐሐ⼐【ㄐ㈐䠐倐䬐䴐ᄐἐḐ‐점퀐쬐촐
.dat files are generic files, and can either be binary or text. These files are usually accessed and used only for application support, and each application treats .dat files differently. Hence, .dat files follow no specific protocols which affect all .dat files, unlike .gif or .docx files.
If you want to understand how .dat files work and convert to human-readable form, you need to know how the application handles these files beforehand.
For the Chinese characters, you tried to decode the binary .dat file by the UTF-16 format. That does not change the file content; you are just grouping sequences of bytes of repeating sequences of bbbb bbbb bbbb bbbb = xxxx where the b are the bytes and the x are the hexadecimal digits.
Many Unicode characters are Chinese [technically they are called ideographs or ideographic] whereas others are unused, aka reserved.
Not a python answer, but the strings command line tool is often invaluable in reverse engineering data formats, letting you easily skim through a binary in search for finding familiar plaintext patterns. Obviously if some kind of encryption/compression is used (such as commonly used gzip) it won't help and needs some preprocessing first.
Calling it is as simple as that:
user#host:~/ $ strings mydir/settings.dat
If it's a binary file, then why do you want to view it? Unless you're aware beforehand that settings.dat contains human-readable characters, it does not make sense to attempt to "find" an encoding so that the output is human-readable characters, because you won't be successful.
On the other hand, if you do know that settings.dat contains human-readable characters, then maybe utf-16 is the wrong encoding.

Using Python to overwrite resource section in C program

I have a C program that has a resource section.
IDS_STRING 87 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
In the hex editor it looks like this
I use code such as this in Python to search and replace the A's:
str = b'\x00A'*40
str1 = b"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB"
if str in file:
print("A in file")
f.write(file.replace(str, str1))
This makes the new file look like this:
So I am wondering why the A's are stored like '41 00' and then when I overwrite them they are just '42'.
Is this a WCHAR thing?
I did a test where I loaded the string and printed it out.
This is some text.AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
But then when I used my Python and overwrote the A's with the B's it does this..
This is some text.???????????????????????????????????????B
So with my limited knowledge of C, If I want to put things into the resource section I should place them in as WCHAR?
UPDATE:
My main issue with this is I have a hex string similar to below:
'685308358035803507835083408303508350835083508350835083083508'
I want to put that into the resource section. But if I do that similar to the way I am replacing, so by doing
f.write(file.replace(str, '685308358035803507835083408303508350835083508350835083083508'))
Then it puts it into the resource section as:
If it goes in like this, it causes things to break because it is grabbing 2 bytes at a time it seems like.
The reason I am asking this is because when I replace the A's with my hex and run the program. It does not work. But if I place the hex directly into the resource section in Visual Studio and run it, it does work. When I replace with Python it is '34322424...' but when the same string is placed in the resource section is it '3400220042004....'
2nd UPDATE:
It seems that the resource section string table does store in a 2 bytes.
https://learn.microsoft.com/en-us/windows/desktop/debug/pe-format#the-rsrc-section
Resource Directory Strings
Two-byte-aligned Unicode strings, which serve as string data that is pointed to by directory entries.
It looks like utf-16 encoding. So you can use regular python unicode strings, and make sure you open and write to the file in text mode, and with utf16 encoding.
If you use binary mode, each ascii character you write will be represented in a single byte. If you use text mode, each character you write will be represented by two bytes. If the text you write is only using low unicode code points, there will be a bunch of null bytes. If you write some Chinese text, you need both bytes.
The hex dump you posted don't show a BOM at the start, so you might have to use utf-16le instead of utf-16.
with open('foo.txt', 'r', encoding='utf-16le') as fp:
text = fp.read()
with open('foo.txt', 'w', encoding='utf-16le') as fp:
fp.write(text.replace('AAAAAA', 'BBBBBB'))

How do I read the codes of a .dat file(binary file) in Python?

my friend mailed me a binary file "Masters.dat" coded in Python.I want to read the codes inside the binary file,so how do I do it?
I have tried :-
file = open("C:\Users\Samanyou\Desktop\Source_XII\Project\Masters.dat", "rb")
read=file.readlines()
print read
But this gives me the result in ASCII or something else but not in human readable form.
readlines is meant to work with text files, not binary ones. For binary files you'd typically use read to get chunk of bytes -- but there's no way to make such chunks "human readable" unless you know what detailed format was used to write the file (in which case you can use e.g struct to decode it back into Python data and format them as you wish). So your friend had better send you info about exactly how the file was written in the first place!-)

Categories