Fixing corrupt encoding (with Python)

Fixing corrupt encoding (with Python) - python

I have bunch of text files contains Korean characters with wrong encodings. Specifically, it seems the characters are encoded with EUC-KR, but the files themselves were saved with UTF8+BOM.
So far I managed to fix a file with the following:
Open a file with EditPlus (it shows the file's encoding is UTF8+BOM)
In EditPlus, save the file as ANSI
Lastly, in Python:
with codecs.open(html, 'rb', encoding='euc-kr') as source_file:
contents = source_file.read()
with open(html, 'w+b') as dest_file:
dest_file.write(contents.encode('utf-8'))
I want to automate this, but I have not been able to do so. I can open the original file in Python:
codecs.open(html, 'rb', encoding='utf-8-sig')
However, I haven't been able to figure out how to do the 2. part.

I am presuming here that you have text already encoded to EUC-KR, then encoded again to UTF-8. If so, encoding to Latin 1 (what Windows calls ANSI) is indeed the best way to get back to the original EUC-KR bytestring.
Open the file as UTF8 with BOM, encode to Latin1, decode as EUC-KR:
import io
with io.open(html, encoding='utf-8-sig') as infh:
data = infh.read().encode('latin1').decode('euc-kr')
with io.open(html, 'w', encoding='utf8') as outfh:
outfh.write(data)
I'm using the io.open() function here instead of codecs as the more robust method; io is the new Python 3 library also backported to Python 2.
Demo:
>>> broken = '\xef\xbb\xbf\xc2\xb9\xc3\x8c\xc2\xbc\xc3\xba'
>>> print broken.decode('utf-8-sig').encode('latin1').decode('euc-kr')
미술

Related

Python keeps showing cp1250 character encoding in files

I have excersise to make script which convert UTF-16 files to UTF-8, so I wanted to have one example file with UTF-16 coding. The problem is that all files encoding which Python shows me is 'cp1250'(no matter which format .csv or .txt). What am I missing here? I have also example files from the Internet, but Python recognize them as cp-1250. Even when I save file with UTF-8, Python shows cp-1250 coding.
This is the code I use:
with open('FILE') as f:
print(f.encoding)

The result from open simply is a file in your system's default encoding. To open it in something else, you have to specifically say so.
To actually convert a file, try something like
with open('input', encoding='cp1252') as input, open('output', 'w', encoding='utf-16le') as output:
for line in input:
output.write(line)
Converting a legacy 8-bit file to Unicode isn't really useful because it only exercises a small subset of the character set. See if you can find a good "hello world" sample file. https://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html is one for UTF-8.

Converting DOS text files to Unicode using Python

I am trying to write a Python application for converting old DOS code page text files to their Unicode equivalent. Now, I have done this before using Turbo Pascal by creating a look-up table and I'm sure the same can be done using a Python dictionary. My question is: How do I index into the dictionary to find the character I want to convert and send the equivalent Unicode to a Unicode output file?
I realize that this may be a repeat of a similar question but nothing I searched for here quite matches my question.

Python has the codecs to do the conversions:
#!python3
# Test file with bytes 0-255.
with open('dos.txt','wb') as f:
f.write(bytes(range(256)))
# Read the file and decode using code page 437 (DOS OEM-US).
# Write the file as UTF-8 encoding ("Unicode" is not an encoding)
# UTF-8, UTF-16, UTF-32 are encodings that support all Unicode codepoints.
with open('dos.txt',encoding='cp437') as infile:
with open('unicode.txt','w',encoding='utf8') as outfile:
outfile.write(infile.read())

You can use standard buildin decode method of bytes objects:
with open('dos.txt', 'r', encoding='cp437') as infile, \
open('unicode.txt', 'w', encoding='utf8') as outfile:
for line in infile:
outfile.write(line)

Convert from ANSI to UTF-8

I have around 600,000 files encoded in ANSI and I want to convert them to UTF-8. I can do that individually in NOTEPAD++, but i can't do that for 600,000 files.Can i do this in R or Python?
I have found this link but the Python script is not running:
notepad++ converting ansi encoded file to utf-8

Why don't you read the file and write it as UTF-8? You can do that in Python.
#to support encodings
import codecs
#read input file
with codecs.open(path, 'r', encoding = 'utf8') as file:
lines = file.read()
#write output file
with codecs.open(path, 'w', encoding = 'utf8') as file:
file.write(lines)

I appreciate that this is an old question but having just resolved a similar problem recently I thought I would share my solution.
I had a file being prepared by one program that I needed to import in to an sqlite3 database but the text file was always 'ANSI' and sqlite3 requires UTF-8.
The ANSI encoding is recognised as 'mbcs' in python and therefore the code I have used, ripping off something else I found is:
blockSize = 1048576
with codecs.open("your ANSI source file.txt","r",encoding="mbcs") as sourceFile:
with codecs.open("Your UTF-8 output file.txt","w",encoding="UTF-8") as targetFile:
while True:
contents = sourceFile.read(blockSize)
if not contents:
break
targetFile.write(contents)
The below link contains some information on the encoding types that I found on my research
https://docs.python.org/2.4/lib/standard-encodings.html

Encoding error even using the right codec

I want to open files depending on the encoding format, therefore I do the following:
import magic
import csv
i_file = open(filename).read()
mag = magic.Magic(mime_encoding=True)
encoding = mag.from_buffer(i_file)
print "The encoding is ",encoding
Once I know the encoding format, I try to open the file using the right one:
with codecs.open(filename, "rb", encoding) as f_obj:
reader = csv.reader(f_obj)
for row in reader:
csvlist.append(row)
However, I get the next error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)
trying to open a csv file which encoding is:
The encoding is utf-16le
The funny part comes here. If utf-16le is replaced by utf-16, the CSV utf-16le file is properly read. However, it is not well read when used in ascii csv files.
What am I doing wrong?

Python 2's csv module doesn't support Unicode. Is switching to Python 3 an option? If not, can you convert the input file to UTF-8 first?
From the docs linked above:
The csv module doesn’t directly support reading and writing Unicode,
but it is 8-bit-clean save (sic!) for some problems with ASCII NUL
characters. So you can write functions or classes that handle the
encoding and decoding for you as long as you avoid encodings like
UTF-16 that use NULs. UTF-8 is recommended.
Quick and dirty example:
with codecs.open(filename, "rb", encoding) as f_obj:
with codecs.open(filename+"u8", "wb", "utf-8") as utf8:
utf8.write(f_obj.read())
with codecs.open(filename+"u8", "rb", "utf-8") as f_obj:
reader = csv.reader(f_obj)
# etc.

This may be a bit useful to you.
Checkout python 2 documentation
https://docs.python.org/2/library/csv.html
Especially this section:
For all other encodings the following UnicodeReader and UnicodeWriter
classes can be used. They take an additional encoding parameter in
their constructor and make sure that the data passes the real reader
or writer encoded as UTF-8:
Look at the bottom of the page!!!!

Python - Save CSV file in UTF-16LE

I have to import som data to my Yahoo Marketing account - and the CSV file has to be encoding in - yahoo: CSV/TSV files: Unicode (technically UTF-16LE encoding)
writer = csv.writer(open('new_yahoo.csv','w', encoding='utf-16-le'), delimiter="\t")
writer.writerows(reader)

If you scroll down on the examples provided in the Python CSV page, you'll find that it
The csv module doesn’t directly support reading and writing Unicode, but it is 8-bit-clean save for some problems with ASCII NUL characters. So you can write functions or classes that handle the encoding and decoding for you as long as you avoid encodings like UTF-16 that use NULs. UTF-8 is recommended.
But if you do need to do unicode, it looks like this could help:
unicode_csv_reader() below is a generator that wraps csv.reader to handle Unicode CSV data (a list of Unicode strings).
...
For all other encodings the following UnicodeReader and UnicodeWriter classes can be used. They take an additional encoding parameter in their constructor and make sure that the data passes the real reader or writer encoded as UTF-8:
So it looks like the example they provide at the bottom should do the encoding you want.

It looks like you are using Python 3.X judging by the open command used. What you have should work, although you may need to define the newline parameter as well. newline='' will use the native line ending (CRLF on Windows, for example), but Yahoo may have other requirements. The code below generated the file correctly on Windows with CRLF line endings.
data = [
['One','Two','Three'],
[1,2,3],
[4,5,6],
[7,8,9]]
import csv
f = open('new_yahoo.csv','w', newline='', encoding='utf-16-le')
writer = csv.writer(f, delimiter='\t')
writer.writerows(data)
f.close()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fixing corrupt encoding (with Python) - python

Related

Python keeps showing cp1250 character encoding in files

Converting DOS text files to Unicode using Python

Convert from ANSI to UTF-8

Encoding error even using the right codec

Python - Save CSV file in UTF-16LE

Categories

Resources