Converting DOS text files to Unicode using Python - python

I am trying to write a Python application for converting old DOS code page text files to their Unicode equivalent. Now, I have done this before using Turbo Pascal by creating a look-up table and I'm sure the same can be done using a Python dictionary. My question is: How do I index into the dictionary to find the character I want to convert and send the equivalent Unicode to a Unicode output file?
I realize that this may be a repeat of a similar question but nothing I searched for here quite matches my question.

Python has the codecs to do the conversions:
#!python3
# Test file with bytes 0-255.
with open('dos.txt','wb') as f:
f.write(bytes(range(256)))
# Read the file and decode using code page 437 (DOS OEM-US).
# Write the file as UTF-8 encoding ("Unicode" is not an encoding)
# UTF-8, UTF-16, UTF-32 are encodings that support all Unicode codepoints.
with open('dos.txt',encoding='cp437') as infile:
with open('unicode.txt','w',encoding='utf8') as outfile:
outfile.write(infile.read())

You can use standard buildin decode method of bytes objects:
with open('dos.txt', 'r', encoding='cp437') as infile, \
open('unicode.txt', 'w', encoding='utf8') as outfile:
for line in infile:
outfile.write(line)

Related

read a file and try to remove all non UTF-8 chars

I am trying to read a file and convert the string to a UTF-8 string, in order to remove some non utf-8 chars in the file string,
file_str = open(file_path, 'r').read()
file_str = file_str.decode('utf-8')
but I got the following error,
AttributeError: 'str' object has no attribute 'decode'
Update: I tried the code as suggested by the answer,
file_str = open(file_path, 'r', encoding='utf-8').read()
but it didn't eliminate the non utf-8 chars, so how to remove them?
Remove the .decode('utf8') call. Your file data has already been decoded, because in Python 3 the open() call with text mode (the default) returned a file object that decodes the data to Unicode strings for you.
You probably do want to add the encoding to the open() call to make this explicit. Otherwise Python uses a system default, and that may not be UTF-8:
file_str = open(file_path, 'r', encoding='utf8').read()
For example, on Windows, the default codec is almost certainly going to be wrong for UTF-8 data, but you won't see the problem until you try to read the text; you'd find your have a Mojibake as the UTF-8 data is decoded using CP1252 or a similar 8-bit codec.
See the open() function documentation for further details.
If you use
file_str = open(file_path, 'r', encoding='utf8', errors='ignore').read()
, then non-UTF-8 characters will essentially be ignored. Read the open() function documentation for details. The documentation has a section on the possible values for the errors parameter.

Encoding error even using the right codec

I want to open files depending on the encoding format, therefore I do the following:
import magic
import csv
i_file = open(filename).read()
mag = magic.Magic(mime_encoding=True)
encoding = mag.from_buffer(i_file)
print "The encoding is ",encoding
Once I know the encoding format, I try to open the file using the right one:
with codecs.open(filename, "rb", encoding) as f_obj:
reader = csv.reader(f_obj)
for row in reader:
csvlist.append(row)
However, I get the next error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)
trying to open a csv file which encoding is:
The encoding is utf-16le
The funny part comes here. If utf-16le is replaced by utf-16, the CSV utf-16le file is properly read. However, it is not well read when used in ascii csv files.
What am I doing wrong?
Python 2's csv module doesn't support Unicode. Is switching to Python 3 an option? If not, can you convert the input file to UTF-8 first?
From the docs linked above:
The csv module doesn’t directly support reading and writing Unicode,
but it is 8-bit-clean save (sic!) for some problems with ASCII NUL
characters. So you can write functions or classes that handle the
encoding and decoding for you as long as you avoid encodings like
UTF-16 that use NULs. UTF-8 is recommended.
Quick and dirty example:
with codecs.open(filename, "rb", encoding) as f_obj:
with codecs.open(filename+"u8", "wb", "utf-8") as utf8:
utf8.write(f_obj.read())
with codecs.open(filename+"u8", "rb", "utf-8") as f_obj:
reader = csv.reader(f_obj)
# etc.
This may be a bit useful to you.
Checkout python 2 documentation
https://docs.python.org/2/library/csv.html
Especially this section:
For all other encodings the following UnicodeReader and UnicodeWriter
classes can be used. They take an additional encoding parameter in
their constructor and make sure that the data passes the real reader
or writer encoded as UTF-8:
Look at the bottom of the page!!!!

Fixing corrupt encoding (with Python)

I have bunch of text files contains Korean characters with wrong encodings. Specifically, it seems the characters are encoded with EUC-KR, but the files themselves were saved with UTF8+BOM.
So far I managed to fix a file with the following:
Open a file with EditPlus (it shows the file's encoding is UTF8+BOM)
In EditPlus, save the file as ANSI
Lastly, in Python:
with codecs.open(html, 'rb', encoding='euc-kr') as source_file:
contents = source_file.read()
with open(html, 'w+b') as dest_file:
dest_file.write(contents.encode('utf-8'))
I want to automate this, but I have not been able to do so. I can open the original file in Python:
codecs.open(html, 'rb', encoding='utf-8-sig')
However, I haven't been able to figure out how to do the 2. part.
I am presuming here that you have text already encoded to EUC-KR, then encoded again to UTF-8. If so, encoding to Latin 1 (what Windows calls ANSI) is indeed the best way to get back to the original EUC-KR bytestring.
Open the file as UTF8 with BOM, encode to Latin1, decode as EUC-KR:
import io
with io.open(html, encoding='utf-8-sig') as infh:
data = infh.read().encode('latin1').decode('euc-kr')
with io.open(html, 'w', encoding='utf8') as outfh:
outfh.write(data)
I'm using the io.open() function here instead of codecs as the more robust method; io is the new Python 3 library also backported to Python 2.
Demo:
>>> broken = '\xef\xbb\xbf\xc2\xb9\xc3\x8c\xc2\xbc\xc3\xba'
>>> print broken.decode('utf-8-sig').encode('latin1').decode('euc-kr')
미술

Encoding issue when writing to text file, with Python

I'm writing a program to 'manually' arrange a csv file to be proper JSON syntax, using a short Python script. From the input file I use readlines() to format the file as a list of rows, which I manipulate and concenate into a single string, which is then outputted into a separate .txt file. The output, however, contains gibberish instead of Hebrew characters that were present in the input file, and the output is double-spaced, horizontally (a whitespace character is added in between each character). As far as I can understand, the problem has to do with the encoding, but I haven't been able to figure out what. When I detect the encoding of the input and output files (using .encoding attribute), they both return None, which means they use the system default. Technical details: Python 2.7, Windows 7.
While there are a number of questions out there on this topic, I didn't find a direct answer to my problem.
Detecting the system defaults won't help me in this case, because I need the program to be portable.
Here's the code:
def txt_to_JSON(csv_list):
...some manipulation of the list...
return JSON_string
file_name = "input_file.txt"
my_file = open(file_name)
# make each line of input file a value in a list
lines = my_file.readlines()
# break up each line into a list such that each 'column' is a value in that list
for i in range(0,len(lines)):
lines[i] = lines[i].split("\t")
J_string = txt_to_JSON(lines)
json_file = open("output_file.txt", "w+")
json_file.write(jstring)
json_file.close()
All data needs to be encoded to be stored on disk. If you don't know the encoding, the best you can do is guess. There's a library for that: https://pypi.python.org/pypi/chardet
I highly recommend Ned Batchelder's presentation
http://nedbatchelder.com/text/unipain.html
for details.
There's an explanation about the use of "unicode" as an encoding on windows: What's the difference between Unicode and UTF-8?
TLDR:
Microsoft uses UTF16 as encoding for unicode strings, but decided to call it "unicode" as they also use it internally.
Even if Python2 is a bit lenient as to string/unicode conversions, you should get used to always decode on input and encode on output.
In your case
filename = 'where your data lives'
with open(filename, 'rb') as f:
encoded_data = f.read()
decoded_data = encoded_data.decode("UTF16")
# do stuff, resulting in result (all on unicode strings)
result = text_to_json(decoded_data)
encoded_result = result.encode("UTF-16") #really, just using UTF8 for everything makes things a lot easier
outfile = 'where your data goes'
with open(outfile, 'wb') as f:
f.write(encoded_result)
You need to tell Python to use the Unicode character encoding to decode the Hebrew characters.
Here's a link to how you can read Unicode characters in Python: Character reading from file in Python

Python - Save CSV file in UTF-16LE

I have to import som data to my Yahoo Marketing account - and the CSV file has to be encoding in - yahoo: CSV/TSV files: Unicode (technically UTF-16LE encoding)
writer = csv.writer(open('new_yahoo.csv','w', encoding='utf-16-le'), delimiter="\t")
writer.writerows(reader)
If you scroll down on the examples provided in the Python CSV page, you'll find that it
The csv module doesn’t directly support reading and writing Unicode, but it is 8-bit-clean save for some problems with ASCII NUL characters. So you can write functions or classes that handle the encoding and decoding for you as long as you avoid encodings like UTF-16 that use NULs. UTF-8 is recommended.
But if you do need to do unicode, it looks like this could help:
unicode_csv_reader() below is a generator that wraps csv.reader to handle Unicode CSV data (a list of Unicode strings).
...
For all other encodings the following UnicodeReader and UnicodeWriter classes can be used. They take an additional encoding parameter in their constructor and make sure that the data passes the real reader or writer encoded as UTF-8:
So it looks like the example they provide at the bottom should do the encoding you want.
It looks like you are using Python 3.X judging by the open command used. What you have should work, although you may need to define the newline parameter as well. newline='' will use the native line ending (CRLF on Windows, for example), but Yahoo may have other requirements. The code below generated the file correctly on Windows with CRLF line endings.
data = [
['One','Two','Three'],
[1,2,3],
[4,5,6],
[7,8,9]]
import csv
f = open('new_yahoo.csv','w', newline='', encoding='utf-16-le')
writer = csv.writer(f, delimiter='\t')
writer.writerows(data)
f.close()

Categories