binary -> UTF-8 -> string - python

I'm trying to understand Unicode and all asociated things. I have made an utf-8.txt file which obviously is encoded in utf-8. It has "Hello world!" inside.
Heres what I do:
f = open('utf8.txt', mode = 'r', encoding = 'utf8')
f.read()
What I get is: '\ufeffHello world!' where did the prefix came from?
Another try:
f = open('utf8.txt', 'rb')
byte = f.read()
printing byte gives: b'\xef\xbb\xbfHello world!' I assume that prefix came in as hex.
byte.decode('utf8')
above code again gives me: '\ufeffHello world!'
What am I doing wrong? How to retrive text to python from utf-8 file?
Thanks for feedback!

Your utf-8.txt is encoded utf-8-bom which is different from utf-8. For an utf-8-bom file, '\uFEFF' is written into the beginning of the file. Instead of using encoding = 'utf8', try encoding = 'utf-8-sig'
f = open('utf8.txt', mode = 'r', encoding = 'utf-8-sig')
print (f.read())

Related

How to read binary file to text in python 3.9

I have a .sql files that I want to read into my python session (python 3.9). I'm opening using the file context manager.
with open('file.sql', 'r') as f:
text = f.read()
When I print the text, I still get the binary characters, i.e., \xff\xfe\r\x00\n\x00-\x00-..., etc.
I've tried all the arguments such as 'rb', encoding='utf-8, etc., but the results are still binary text. It should be noted that I've used this very same procedure many times over in my code before and this has not been a problem.
Did something change in python 3.9?
First two bytes \xff\xfe looks like BOM (Byte Order Mark)
and table at Wikipedia page BOM shows that \xff\xfe can means encoding UTF-16-LE
So you could try
with open('file.sql', 'r', encoding='utf-16-le') as f:
EDIT:
There is module chardet which you may also try to use to detect encoding.
import chardet
with open('file.sql', 'rb') as f: # read bytes
data = f.read()
info = chardet.detect(data)
print(info['encoding'])
text = data.decode(info['encoding'])
Usually files don't have BOM but if they have then you may try to detect it using example from unicodebook.readthedocs.io/guess_encoding/check-for-bom-markers
from codecs import BOM_UTF8, BOM_UTF16_BE, BOM_UTF16_LE, BOM_UTF32_BE, BOM_UTF32_LE
BOMS = (
(BOM_UTF8, "UTF-8"),
(BOM_UTF32_BE, "UTF-32-BE"),
(BOM_UTF32_LE, "UTF-32-LE"),
(BOM_UTF16_BE, "UTF-16-BE"),
(BOM_UTF16_LE, "UTF-16-LE"),
)
def check_bom(data):
return [encoding for bom, encoding in BOMS if data.startswith(bom)]
# ---------
with open('file.sql', 'rb') as f: # read bytes
data = f.read()
encoding = check_bom(data)
print(encoding)
if encoding:
text = data.decode(encoding[0])
else:
print('unknown encoding')

How to convert cp866 in utf-8 coding in python

I have file in cp866 encoding, i open it:
input_file = open(file_name, 'r', encoding = 'cp866')
How i can print() lines from this file like utf-8
I need decode this file to utf-8 and print
Well, the characters read from the file will be decoded and stored in memory as Python strings. You can print them on screen and they should be correct. You can then save the data as utf-8.
You can try:
result = text.encode('cp866').decode('cp866').encode('utf8') - simple convert
data = myfile.read()
b = bytes(data,"KOI8-R")
data_encoding = str(b,"cp1251")
to cp1251, usualy from cp866 you want to decode to cp1251"
If you need decode just once - try this web-site

Python reading a PE file and changing resource section

I am trying to open a Windows PE file and alter some strings in the resource section.
f = open('c:\test\file.exe', 'rb')
file = f.read()
if b'A'*10 in file:
s = file.replace(b'A'*10, newstring)
In the resource section I have a string that is just:
AAAAAAAAAA
And I want to replace that with something else. When I read the file I get:
\x00A\x00A\x00A\x00A\x00A\x00A\x00A\x00A\x00A\x00A
I have tried opening with UTF-16 and decoding as UTF-16 but then I run into a error:
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 1604-1605: illegal encoding
Everyone I seen who had the same issue fixed by decoding to UTF-16. I am not sure why this doesn't work for me.
If resource inside binary file is encoded to utf-16, you shouldn't change encoding.
try this
f = open('c:\\test\\file.exe', 'rb')
file = f.read()
unicode_str = u'AAAAAAAAAA'
encoded_str = unicode_str.encode('UTF-16')
if encoded_str in file:
s = file.replace(encoded_str, new_utf_string.encode('UTF-16'))
inside binary file everything is encoded, keep in mind

convert the content of a file in hex to base64 and print out the result

So i'm trying to create a very simple program that opens a file, read the file and convert what is in it from hex to base64 using python3.
I tried this :
file = open("test.txt", "r")
contenu = file.read()
encoded = contenu.decode("hex").encode("base64")
print (encoded)
but I get the error:
AttributeError: 'str' object has no attribute 'decode'
I tried multiple other things but always get the same error.
inside the test.txt is :
4B
if you guys can explain me what I do wrong would be awesome.
Thank you
EDIT:
i should get Sw== as output
This should do the trick. Your code works for Python <= 2.7 but needs updating in later versions.
import base64
file = open("test.txt", "r")
contenu = file.read()
bytes = bytearray.fromhex(contenu)
encoded = base64.b64encode(bytes).decode('ascii')
print(encoded)
you need to encode hex string from file test.txt to bytes-like object using bytes.fromhex() before encoding it to base64.
import base64
with open("test.txt", "r") as file:
content = file.read()
encoded = base64.b64encode(bytes.fromhex(content))
print(encoded)
you should always use with statement for opening your file to automatically close the I/O when finished.
in IDLE:
>>>> import base64
>>>>
>>>> with open('test.txt', 'r') as file:
.... content = file.read()
.... encoded = base64.b64encode(bytes.fromhex(content))
....
>>>> encoded
b'Sw=='

Encoding - Problems creating JSON using an Unicode Object in Python

I've got some real problems to encode/decode strings to a specific charset (UTF-8).
My Unicode Object is:
>> u'Valor Econ\xf4mico - Opini\xe3o'
When I call print from python it returns:
>> Valor Econômico - Opinião
When I call .encode("utf-8") from my unicode object to write it to a JSON it returns:
>> 'Valor Econ\xc3\xb4mico - Opini\xc3\xa3o'
What am I doing wrong? What exactly is print() doing that I'm not?
Obs: I'm creating this unicode object from a line of a file.
import codecs
with codecs.open(path, 'r') as local_file:
for line in local_file:
obj = unicode((line.replace(codecs.BOM_UTF8, '')).replace('\n', ''), 'utf-8')
Valor Econ\xc3\xb4mico - Opini\xc3\xa3o is the UTF-8 representation prepared for a non-UTF-8 terminal, probably in the interactive shell. If you were to write this to a file (open("myfile", "wb").write("Valor Econ\xc3\xb4mico - Opini\xc3\xa3o") then you'd have a valid UTF-8 file.
To create Unicode strings from a file, you can use automatic decoding in the io module (Codecs.open() is being deprecated). BOMs will be removed automatically:
import io
with io.open(path, "r", encoding="utf-8") as local_file:
for line in local_file:
unicode_obj = line.strip()
When it comes to creating a JSON response, use the result from json.dumps(my_object). It will return an str with all non-ASCII chars encoded using Unicode codepoints.

Categories