I have a C program that has a resource section.
IDS_STRING 87 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
In the hex editor it looks like this
I use code such as this in Python to search and replace the A's:
str = b'\x00A'*40
str1 = b"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB"
if str in file:
print("A in file")
f.write(file.replace(str, str1))
This makes the new file look like this:
So I am wondering why the A's are stored like '41 00' and then when I overwrite them they are just '42'.
Is this a WCHAR thing?
I did a test where I loaded the string and printed it out.
This is some text.AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
But then when I used my Python and overwrote the A's with the B's it does this..
This is some text.???????????????????????????????????????B
So with my limited knowledge of C, If I want to put things into the resource section I should place them in as WCHAR?
UPDATE:
My main issue with this is I have a hex string similar to below:
'685308358035803507835083408303508350835083508350835083083508'
I want to put that into the resource section. But if I do that similar to the way I am replacing, so by doing
f.write(file.replace(str, '685308358035803507835083408303508350835083508350835083083508'))
Then it puts it into the resource section as:
If it goes in like this, it causes things to break because it is grabbing 2 bytes at a time it seems like.
The reason I am asking this is because when I replace the A's with my hex and run the program. It does not work. But if I place the hex directly into the resource section in Visual Studio and run it, it does work. When I replace with Python it is '34322424...' but when the same string is placed in the resource section is it '3400220042004....'
2nd UPDATE:
It seems that the resource section string table does store in a 2 bytes.
https://learn.microsoft.com/en-us/windows/desktop/debug/pe-format#the-rsrc-section
Resource Directory Strings
Two-byte-aligned Unicode strings, which serve as string data that is pointed to by directory entries.
It looks like utf-16 encoding. So you can use regular python unicode strings, and make sure you open and write to the file in text mode, and with utf16 encoding.
If you use binary mode, each ascii character you write will be represented in a single byte. If you use text mode, each character you write will be represented by two bytes. If the text you write is only using low unicode code points, there will be a bunch of null bytes. If you write some Chinese text, you need both bytes.
The hex dump you posted don't show a BOM at the start, so you might have to use utf-16le instead of utf-16.
with open('foo.txt', 'r', encoding='utf-16le') as fp:
text = fp.read()
with open('foo.txt', 'w', encoding='utf-16le') as fp:
fp.write(text.replace('AAAAAA', 'BBBBBB'))
Related
I would like to generate report to pdf using pdfme library. I need the Polish characters to be there as well. The example report end with:
with open('document.pdf', 'wb') as f:
build_pdf(document, f)
So I cannot add encoding = "utf-8". Is there any way I can still use Polish characters?
I tried:
Change to write mode and set encoding to utf-8. Getting: "TypeError: write() argument must be str, not bytes".
While having Polish characters add .encode("utf-8"). Example: "Paweł".encode("utf-8"). Getting: "TypeError: value of . attr must be of type str, list or tuple: b'Pawe\xc5\x82'"
In this case, the part of the code responsible for dealing with the unicode characters is the PDF library. The build_pdf call there, for whatever library it is, has to be able to handle any character in "document". And if it fails it is the context for the PDF library, owner of the "build_pdf" call that has to be changed so that it will handle all the characters you need.
"utf-8" is just one form os expressing characters as bytes - aPDF file is a binary file, and it does have internal headers, structures and settings to do its own character encoding handling: your text may endup inside the PDF either encoded as utf-8, or some other, legacy encoding- but that will be transparent for you and anyone using the PDF file.
It may be that the document, if it is text (we don't know if it is plain text, or if it is some object from your library that has already been pre-processed) - but if it is text, and your library says that build_pdf can accept bytes instead, you can encode the document prior to this call:
build_pdf(document.encode('utf-8', f) - but that would be some strange way of working - it is likely that either build_pdf will do the encoding, or whatever process generated the document had already done so.
To get more meaningful help, you have to say which library you are using to geneate the PDF, and include the import lines in your code,including the creation of your document so that we have a minimal reproducible example: i.e. I can copy your code, paste in a .py file here, install the lib, run it, and see a corrupted PDF file with the Polish characters magled: then I, and others, can be able to fix it. Otherwise, this answer is as far as I can get.
I have multiple bin files, and I want to extract the data from them, but the results i'm getting are pretty weird.
For example, my first file does the following:
path = 'D:\lut.bin'
with open(path, 'rb') as file: # b is important -> binary
fileContent = file.read()
print(fileContent)
Output:
xc7\xfb\x99\x0c\x8e\xf9~7\xb9a\xb1*\x06\xd2o\xb8\xb2 \x81\x8bg\xd2\xc6bE\r\xb9KL7\xa0\xa52\xa5\xd2\x17\x86(\xe9\x02\xbf\xeb\x8fDvk\xe7\x8d\x03\x872\x9fop\xbck\xe1\x94\x02\xdc\xef\x85I\t\xc8\x8d\xdfl\x90\xcf*\xb1\x02(\x16~)\xc7\xa2\x1f\xf6o\xdc\x1en\x84H\xf6%\xfaW/\xee\xbc\xdd^/\x9b\x9a\xe5\x99\xa2\xd7\xe4\x93U\xd4\xef$\xa5\x8aW\xf6\xc9\xb0T\xe3<\x147\xcc\x08}\xc8\x15J3v\n\x9d\x16\xa3\x8d\r\xa2\xc4\x15\xf13!\xa2\x01\x14\xef\xaf\x06\x83p\xa7Ot\x8cr\xdf\xef\xbe\x93\xc2D`y\\\xdb\x8a\x1c\\H\x9cE\xabF\xd6\xe1B\xdd\xbc\x8a\xdb\x06|\x05{!\xf0K25K0\xb9\xfe\xa6n\xd7-\xd1\xcb\xefQ\xd9w\x08{4\x13\xba8\x06\x00}S\xe4\xd8*\xe2\x81f\x8d\xc4P\xde\x88/\xa6q\x7fG\x99\xbd\xa84v\xcfS+\xc6\xc5#\x0ey\xd8\xcd\xf2!\xf8`1\x03k5\xb9\xee\xb3V\xc3">\xdd\xf4\x94\x1b\x83\xf9\xdbe\xfcw\xf4+O\xf4\xf1\xfc\xa2 \xc5\xccq\xd1\xc8dH\x00\xf7K|7\x87\xa8$\xb8\x92^\x90.\xffK\xbf\xf6\xcaHv9l\xa6\x0e\xd5"\xd6`>}f\xfc\xd1\x15\xd0\xf0\x89\xb7\x12\xdf\xc9\xdfn\x97\xc7O\xf8\x05)Ua|\xd6\xd5\x03P\xf3\xcd\x08 \xc6\xc7\xe2"\xae\x1fz\xb9\xbd\x99\x100\x9a\x8d\xeb\x89\xa3T\xa0\xc7S\xcc\xe4h\xbe\xf3R\xe9\x9d\xf4Y\xe91\xa4%\x85>mn\xc3\x1e\x8a}\x04\xd9:\xb5\xde\x01h\x90y\xfe4&\xea\x1d\x9a\xbd\xac\x1a\x8e{\xb2Y\xcb\xc47\xd8\xe2\xf6\xd6\xdc\x91,]\x1d\xca\x90_sb\x86X\xad]\x8e\xe1A\x1a\xaa\xc6\xdf\x1ca#A\x1a\xa2\t!3\x06y\x92\x96\xebg\xdb3\xdd\x9f\xefh\x9d6\x17c0\x0e\xfe\x9a\x06\x06;\x16\xa7\x
I have no idea what this is, but it does not look like readable text, is there a way to even convert this?
My other file looks like this:
U\xff\xf3\xe8d\x00\x00\x00\x01\xa4\x00\x00\x00\x00\x00\x00\x03H\x00\x00\x00\x00LAME3.100UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU
Again with the same code above.
I have tried decoding it, I keep getting decoding errors, and text encoding to utf-8 doesn't help either.
I want to get the text from this, these files came with book on the playstore I downloaded.
Bin files are just binary data, meaning each byte can be any value between 0 and 255 (00000000 and 11111111, hex 0x00-0xFF).
Printable characters are a subset of those codes.
This means, not every bin file can be converted to text.
Python tries to visualise the byte stream already by putting those printable characters in place of their \xNN code (where N is a hex digit). The rest of the characters are printed as their codes.
This means the
U\xff\xf3\xe8d\x00\x00\x00\x01\xa4\x00\x00\x00\x00\x00\x00\x03H\x00\x00\x00\x00LAME3.100UU
is in fact
\x55\xff\xf3\xe8\x64\x00\x00\x00\x01\xa4\x00\x00\x00\x00\x00\x00\x03\x48\x00\x00\x00\x00\x4c\x41\x4d\x45\x33\x2e\x31\x30\x30\x55\x55
[Copy this into your Python interpreter as a string (i.e. in quotes) and see how it visually converts itself when displaying/printing!]
The parts:
decoded U,
then not decoded \xff\xf3\xe8
decoded d
not decoded \x00\x00\x00\x01\xa4\x00\x00\x00\x00\x00\x00\x03
decodedH
not decoded \x00\x00\x00\x00
decoded LAME3.100UU
Can you extract some data from it? Depending on the type of the bin, you may probably find some strings directly put in there - like the LAME3.10 that looks like some code/version... but I really doubt that you would find anything useful. It can be literally anything, just dumped there: text, photo, memory dump...
This is very late, but LAME3.100 followed by a bunch of U characters is actually the start of a certain encoding of .mp3 file, and knowing that it may possibly be incomplete, you could try and convert it using https://ffmpeg.org into a proper .mp3 container
After you have ffmpeg in your path, a command such as ffmpeg -i "D:/lut.bin" "D:/lut.mp3" should hopefully decode and re-encode it
I am reading from someone else and come to the part concerning unicode, which is always a headache for me. That will really help a lot if you can give some hints.
The situation is so:
I have a stopword file named stopword.txt in the form of following:
1 781037
2 650706 damen
3 196100 löwe
4 146044 lego
5 138280 monster
6 136410 high
7 100657 kost%c3%bcm #this % seems to be strange already
8 94084 schuhe
9 93680 kinder
10 87308 mit
and the code trying to read in it, look likes:
with open('%s/%s'%('path_to_stopwords.txt'), 'r') as f:
stoplines = [line.decode('utf-8').strip() for line in f.readlines()]
this decode('utf-8') seems to be very mysterious to me. As my understanding, without extra
specification "open" method read in files as string which will be automated encoded as
ascii (so in this case it causes already information loss if file which is opened contains character whose code point outside of 128, like löwe and it is read into program with encoding ascii, because then ö will be truncated encoded?) What the meaning of trying decoding it into utf-8 after reading into program ?
And to verify my ideas, I have tried to check what is in each line now with codes.
for line in stoplines:
print line
which gives me:
%09
%21%21%21
%26
%26amp%3b
%28buch%29
%28gr.
%2b
%2bbarbie
I am quite confused where these % comes from. Have I correctly read in the context of file ?
Thnak you very much
In Python 2, when you open a file and read from it, you get an str instance back, not a unicode string (in Python 3, you'd get a str, which is unicode in Python 3).
str.decode('utf-8') lets you decode that str into a unicode string (assuming the encoding is UTF8!).
It seems like your stopwords are URL-encoded:
print urllib.unquote('%c3%bc')
ü
It is indeed redundant to use urlencoding if the file is supposed to be UTF8 (which natively supports characters such as ü), but my intuition would be that this file is in fact ASCII, not UTF8.
All ASCII chars map to the same char in UTF8, so this works, despite being wrong.
A few points:
If the file is UTF-8, you should open all of it as UTF-8, not line by line. Either read it all and then decode (i.e f.read().decode("utf-8")) or open it using codecs.open with UTF-8.
You don't need f.readlines(), you can simple do "for line in f". It's more memory efficient and shorter.
'%s/%s'%('path_to_stopwords.txt') does not even work. Make sure you're doing it correctly. You might want to use os.path.join to join the paths.
The % encoding is url encoding. As Thomas above me wrote, you can use urllib.unquote.
If a python script uses the open("filename", "r") function to open, and subsequently read, the contents of a text file, how can I tell which encoding this file is supposed to have?
Note that since I'm executing this script from my own program, if there is any way to control this through environment variables, then that is good enough for me.
This is Python 2.7 by the way.
The code in question comes from Mercurial, it can be given a list of files to, say, add to the repository, through a file on disk, instead of passing them on the command line.
So basically, instead of this:
hg add A B C
I can write out A, B and C to a file, with newlines between each, and then execute the following:
hg add listfile:input.txt
The code that ends up reading this file is this:
files = open(name, 'r').read().split(delimiter)
Hence my question. The answer I was given on IRC when I asked which encoding I should use was this:
it is the same encoding than the one you use on command line when passing a file argument
I take this to mean that it is the same encoding I "use" when I execute Mercurial (hg). Since I have no idea which encoding that is, I just give everything to the .NET Process object, I ask here.
You can't. Reading a file is independent of its encoding; you'll need to know the encoding in advance in order to properly interpret the bytes you read in.
For example, if you know the file is encoded in UTF-8:
with open('filename', 'rb') as f:
contents = f.read().decode('utf-8-sig') # -sig deals with BOM, if present
Or if you know the file is ASCII only:
with open('filename', 'r') as f:
contents = f.read() # results in a str object
If you really don't know the encoding of the file, then there's obviously no guarantee that you can read it properly; however, you can guess at the encoding using a tool like chardet.
UPDATE:
I think I understand your question now. I thought you had a file you needed to write code for, but it seems you have code you need to write a file for ;-)
The code in question probably only deals properly with plain ASCII (it's possible the strings are converted later, but unlikely I think). So you'll want to make a text file that contains only ASCII (codepoint < 128) characters, and make sure it is saved in an ASCII encoding (i.e. not UTF-16 or anything like that). This is a little unfortunate considering that Mercurial deals with filenames, which can contain Unicode characters.
I'm programming in Python 3 and I'm having a small problem which I can't find any reference to it on the net.
As far as I understand the default string in is utf-16, but I must work with utf-8, I can't find the command that will convert from the default one to utf-8.
I'd appreciate your help very much.
In Python 3 there are two different datatypes important when you are working with string manipulation. First there is the string class, an object that represents unicode code points. Important to get is that this string is not some bytes, but really a sequence of characters. Secondly, there is the bytes class, which is just a sequence of bytes, often representing an string stored in an encoding (like utf-8 or iso-8859-15).
What does this mean for you? As far as I understand you want to read and write utf-8 files. Let's make a program that replaces all 'ć' with 'ç' characters
def main():
# Let's first open an output file. See how we give an encoding to let python know, that when we print something to the file, it should be encoded as utf-8
with open('output_file', 'w', encoding='utf-8') as out_file:
# read every line. We give open() the encoding so it will return a Unicode string.
for line in open('input_file', encoding='utf-8'):
#Replace the characters we want. When you define a string in python it also is automatically a unicode string. No worries about encoding there. Because we opened the file with the utf-8 encoding, the print statement will encode the whole string to utf-8.
print(line.replace('ć', 'ç'), out_file)
So when should you use bytes? Not often. An example I could think of would be when you read something from a socket. If you have this in an bytes object, you could make it a unicode string by doing bytes.decode('encoding') and visa versa with str.encode('encoding'). But as said, probably you won't need it.
Still, because it is interesting, here the hard way, where you encode everything yourself:
def main():
# Open the file in binary mode. So we are going to write bytes to it instead of strings
with open('output_file', 'wb') as out_file:
# read every line. Again, we open it binary, so we get bytes
for line_bytes in open('input_file', 'rb'):
#Convert the bytes to a string
line_string = bytes.decode('utf-8')
#Replace the characters we want.
line_string = line_string.replace('ć', 'ç')
#Make a bytes to print
out_bytes = line_string.encode('utf-8')
#Print the bytes
print(out_bytes, out_file)
Good reading about this topic (string encodings) is http://www.joelonsoftware.com/articles/Unicode.html. Really recommended read!
Source: http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit
(P.S. As you see, I didn't mention utf-16 in this post. I actually don't know whether python uses this as internal decoding or not, but it is totally irrelevant. At the moment you are working with a string, you work with characters (code points), not bytes.