Change file encoding scheme in Python - python

I'm trying to open a file using latin-1 encoding in order to produce a file with a different encoding. I get a NameError stating that unicode is not defined. Here the piece of code I use to this:
sourceEncoding = "latin-1"
targetEncoding = "utf-8"
source = open(r'C:\Users\chsafouane\Desktop\saf.txt')
target = open(r'C:\Users\chsafouane\Desktop\saf2.txt', "w")
target.write(unicode(source.read(), sourceEncoding).encode(targetEncoding))
I'm not used at all to handling files so I don't know if there is a module I should import to use "unicode"

The fact that you see unicode not defined suggests that you're in Python3. Here's a code snippet that'll generate a latin1-encoded file, then does what you want to do, slurp the latin1-encoded file and spit out a UTF8-encoded file:
# Generate a latin1-encoded file
txt = u'U+00AxNBSP¡¢£¤¥¦§¨©ª«¬SHY­®¯U+00Bx°±²³´µ¶·¸¹º»¼½¾¿U+00CxÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏU+00DxÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßU+00ExàáâãäåæçèéêëìíîïU+00Fxðñòóôõö÷øùúûüýþÿ'
latin1 = txt.encode('latin1')
with open('example-latin1.txt', 'wb') as fid:
fid.write(latin1)
# Read in the latin1 file
with open('example-latin1.txt', 'r', encoding='latin1') as fid:
contents = fid.read()
assert contents == latin1.decode('latin1') # sanity check
# Spit out a UTF8-encoded file
with open('converted-utf8.txt', 'w') as fid:
fid.write(contents)
If you want the output to be something other than UTF8, add an encoding argument to open, e.g.,
with open('converted-utf_32.txt', 'w', encoding='utf_32') as fid:
fid.write(contents)
The docs have a list of all supported codecs.

Related

How to read binary file to text in python 3.9

I have a .sql files that I want to read into my python session (python 3.9). I'm opening using the file context manager.
with open('file.sql', 'r') as f:
text = f.read()
When I print the text, I still get the binary characters, i.e., \xff\xfe\r\x00\n\x00-\x00-..., etc.
I've tried all the arguments such as 'rb', encoding='utf-8, etc., but the results are still binary text. It should be noted that I've used this very same procedure many times over in my code before and this has not been a problem.
Did something change in python 3.9?
First two bytes \xff\xfe looks like BOM (Byte Order Mark)
and table at Wikipedia page BOM shows that \xff\xfe can means encoding UTF-16-LE
So you could try
with open('file.sql', 'r', encoding='utf-16-le') as f:
EDIT:
There is module chardet which you may also try to use to detect encoding.
import chardet
with open('file.sql', 'rb') as f: # read bytes
data = f.read()
info = chardet.detect(data)
print(info['encoding'])
text = data.decode(info['encoding'])
Usually files don't have BOM but if they have then you may try to detect it using example from unicodebook.readthedocs.io/guess_encoding/check-for-bom-markers
from codecs import BOM_UTF8, BOM_UTF16_BE, BOM_UTF16_LE, BOM_UTF32_BE, BOM_UTF32_LE
BOMS = (
(BOM_UTF8, "UTF-8"),
(BOM_UTF32_BE, "UTF-32-BE"),
(BOM_UTF32_LE, "UTF-32-LE"),
(BOM_UTF16_BE, "UTF-16-BE"),
(BOM_UTF16_LE, "UTF-16-LE"),
)
def check_bom(data):
return [encoding for bom, encoding in BOMS if data.startswith(bom)]
# ---------
with open('file.sql', 'rb') as f: # read bytes
data = f.read()
encoding = check_bom(data)
print(encoding)
if encoding:
text = data.decode(encoding[0])
else:
print('unknown encoding')

csv.writer encoding 'utf-8', but reading encoding 'cp1252'

When writing to a file I use the following code. Here it's upper case, but I've also seen the encoding in lower case utf-8.
path_to_file = os.path.join(r'C:\Users\jpm\Downloads', 'c19_Vaccine_Current.csv')
#write to file
with open(path_to_file, 'w', newline='', encoding='UTF-8') as csvfile:
f = csv.writer(csvfile)
#write the headers of the csv file
f.writerow(['County','AdminCount','AdminCountChange', 'RollAvg', 'AllocDoses', 'FullyVaccinated', 'FullyVaccinatedChange', 'ReportDate', 'Pop', 'PctVaccinated', 'LHDInventory', 'CommInventory',
'TotalInventory', 'InventoryDate'])
And to check if the *.csv is in fact utf-8 I open it and read it:
with open(path_to_file, 'r') as r:
print(r)
I'm expecting the encoding to be utf-8, but I get:
<_io.TextIOWrapper name='C:\\Users\\jpm\\Downloads\\c19_Vaccine_Current.csv' mode='r' encoding='cp1252'>
I pretty much borrowed the code from this answer. And I've also read the doc. It's crucial that I have the *.csv file as utf-8, but that doesn't appear to be the case.
The encoding has to be specified on an open as well. The encoding in which a file is opened is platform dependant, it would seem to be cp1252 on windows.
You can check the default platform encoding with this: (on Mac it gives utf-8)
>>>import locale
>>>locale.getpreferredencoding(False)
'UTF-8'
with open('file', 'r', encoding='utf-8'):
...
with open('file', 'w', encoding='utf-8'):
...

How can I manipulate a txt file to be all in lowercase in python?

Let's say that I have a txt file that I have to get all in lowercase. I tried this
def lowercase_txt(file):
file = file.casefold()
with open(file, encoding = "utf8") as f:
f.read()
Here I get "'str' object has no attribute 'read'"
then I tried
def lowercase_txt(file):
with open(poem_filename, encoding="utf8") as f:
f = f.casefold()
f.read()
and here '_io.TextIOWrapper' object has no attribute 'casefold'
What can I do?
EDIT: I re-runned this exact code and now there are no errors (dunno why) but the file doesn't change at all, all the letters stay the way they are.
This will rewrite the file. Warning: if there is some type of error in the middle of processing (power failure, you spill coffee on your computer, etc.) you could lose your file. So, you might want to first make a backup of your file:
def lowercase_txt(file_name):
"""
file_name is the full path to the file to be opened
"""
with open(file_name, 'r', encoding = "utf8") as f:
contents = f.read() # read contents of file
contents = contents.lower() # convert to lower case
with open(file_name, 'w', encoding = "utf8") as f: # open for output
f.write(contents)
For example:
lowercase_txt('/mydirectory/test_file.txt')
Update
The following version opens the file for reading and writing. After the file is read, the file position is reset to the start of the file before the contents is rewritten. This might be a safer option.
def lowercase_txt(file_name):
"""
file_name is the full path to the file to be opened
"""
with open(file_name, 'r+', encoding = "utf8") as f:
contents = f.read() # read contents of file
contents = contents.lower() # convert to lower case
f.seek(0, 0) # position back to start of file
f.write(contents)
f.truncate() # in case new encoded content is shorter than older

Python reading a PE file and changing resource section

I am trying to open a Windows PE file and alter some strings in the resource section.
f = open('c:\test\file.exe', 'rb')
file = f.read()
if b'A'*10 in file:
s = file.replace(b'A'*10, newstring)
In the resource section I have a string that is just:
AAAAAAAAAA
And I want to replace that with something else. When I read the file I get:
\x00A\x00A\x00A\x00A\x00A\x00A\x00A\x00A\x00A\x00A
I have tried opening with UTF-16 and decoding as UTF-16 but then I run into a error:
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 1604-1605: illegal encoding
Everyone I seen who had the same issue fixed by decoding to UTF-16. I am not sure why this doesn't work for me.
If resource inside binary file is encoded to utf-16, you shouldn't change encoding.
try this
f = open('c:\\test\\file.exe', 'rb')
file = f.read()
unicode_str = u'AAAAAAAAAA'
encoded_str = unicode_str.encode('UTF-16')
if encoded_str in file:
s = file.replace(encoded_str, new_utf_string.encode('UTF-16'))
inside binary file everything is encoded, keep in mind

Python: how to convert from Windows 1251 to Unicode?

I'm trying to convert file content from Windows-1251 (Cyrillic) to Unicode with Python. I found this function, but it doesn't work.
#!/usr/bin/env python
import os
import sys
import shutil
def convert_to_utf8(filename):
# gather the encodings you think that the file may be
# encoded inside a tuple
encodings = ('windows-1253', 'iso-8859-7', 'macgreek')
# try to open the file and exit if some IOError occurs
try:
f = open(filename, 'r').read()
except Exception:
sys.exit(1)
# now start iterating in our encodings tuple and try to
# decode the file
for enc in encodings:
try:
# try to decode the file with the first encoding
# from the tuple.
# if it succeeds then it will reach break, so we
# will be out of the loop (something we want on
# success).
# the data variable will hold our decoded text
data = f.decode(enc)
break
except Exception:
# if the first encoding fail, then with the continue
# keyword will start again with the second encoding
# from the tuple an so on.... until it succeeds.
# if for some reason it reaches the last encoding of
# our tuple without success, then exit the program.
if enc == encodings[-1]:
sys.exit(1)
continue
# now get the absolute path of our filename and append .bak
# to the end of it (for our backup file)
fpath = os.path.abspath(filename)
newfilename = fpath + '.bak'
# and make our backup file with shutil
shutil.copy(filename, newfilename)
# and at last convert it to utf-8
f = open(filename, 'w')
try:
f.write(data.encode('utf-8'))
except Exception, e:
print e
finally:
f.close()
How can I do that?
Thank you
import codecs
f = codecs.open(filename, 'r', 'cp1251')
u = f.read() # now the contents have been transformed to a Unicode string
out = codecs.open(output, 'w', 'utf-8')
out.write(u) # and now the contents have been output as UTF-8
Is this what you intend to do?
This is just a guess, since you didn't specify what you mean by "doesn't work".
If the file is being generated properly but appears to contain garbage characters, likely the application you're viewing it with does not recognize that it contains UTF-8. You need to add a BOM to the beginning of the file - the 3 bytes 0xEF,0xBB,0xBF (unencoded).
If you use the codecs module to open the file, it will do the conversion to Unicode for you when you read from the file. E.g.:
import codecs
f = codecs.open('input.txt', encoding='cp1251')
assert isinstance(f.read(), unicode)
This only makes sense if you're working with the file's data in Python. If you're trying to convert a file from one encoding to another on the filesystem (which is what the script you posted tries to do), you'll have to specify an actual encoding, since you can't write a file in "Unicode".

Categories