Python - Save CSV file in UTF-16LE - python

I have to import som data to my Yahoo Marketing account - and the CSV file has to be encoding in - yahoo: CSV/TSV files: Unicode (technically UTF-16LE encoding)
writer = csv.writer(open('new_yahoo.csv','w', encoding='utf-16-le'), delimiter="\t")
writer.writerows(reader)

If you scroll down on the examples provided in the Python CSV page, you'll find that it
The csv module doesn’t directly support reading and writing Unicode, but it is 8-bit-clean save for some problems with ASCII NUL characters. So you can write functions or classes that handle the encoding and decoding for you as long as you avoid encodings like UTF-16 that use NULs. UTF-8 is recommended.
But if you do need to do unicode, it looks like this could help:
unicode_csv_reader() below is a generator that wraps csv.reader to handle Unicode CSV data (a list of Unicode strings).
...
For all other encodings the following UnicodeReader and UnicodeWriter classes can be used. They take an additional encoding parameter in their constructor and make sure that the data passes the real reader or writer encoded as UTF-8:
So it looks like the example they provide at the bottom should do the encoding you want.

It looks like you are using Python 3.X judging by the open command used. What you have should work, although you may need to define the newline parameter as well. newline='' will use the native line ending (CRLF on Windows, for example), but Yahoo may have other requirements. The code below generated the file correctly on Windows with CRLF line endings.
data = [
['One','Two','Three'],
[1,2,3],
[4,5,6],
[7,8,9]]
import csv
f = open('new_yahoo.csv','w', newline='', encoding='utf-16-le')
writer = csv.writer(f, delimiter='\t')
writer.writerows(data)
f.close()

Related

Handle Multiple Languages with xml.tree.ElementTree Module

I'm attempting to parse an XML file and print sections of the contents into a CSV file for manipulation with a program such as Microsoft Excel. The issue I'm running into is that the XML file contains multiple alphabets (Arabic, Cyrillic, etc.) and I'm getting confused over what encoding I should be using.
import csv
import xml.etree.ElementTree as ET
import os
file = 'example.xml'
csvf = open(os.path.splitext(file)[0] + '.csv', "w+", newline='')
csvf.seek(0)
csvw = csv.writer(csvf, delimiter=',')
root = ET.parse(file).getroot()
name_base = root.find("name")
name_base_string = ET.tostring(name_base, encoding="unicode", method="xml").strip()
csv_data.append(name_base_string)
csvf.close()
I do not know what encoding to pass to the tostring() method. If I use 'unicode' it returns a unicode python string and all is well when writing to the CSV file, but Excel seems to handle this really improperly (all editors on windows and linux seem to see the character sets properly). If I use encoding 'UTF-8' the method returns a bytearray, which if I pass to the CSV writer without decoding I receive the string b'stuff' in the csv document.
Is there something I'm missing here? Does Excel just suck at handling certain encodings? I've read up on how UTF-8 is an encoding and Unicode is just a character set (that you can't really compare them) but I'm still confused.

Python 3.5.1 mixed line code file UTF-8 and UTF-16

I have successfully been parsing data files that I recieve with a simple python script I wrote. The files I get are like this:
file.txt, ~50 columns of data, x 1000s of rows
abcd1,1234a,efgh1,5678a,ijkl1 ...etc
abcd2,1234b,efgh2,5678b,ijkl2 ...etc
...
Unfortunatly, sometimes some of the lines contain UTF-16 symbols, and look like this
abcd1,12341,efgh1,UTF-16 symbols here,ijkl1 ...etc
abcd2,1234b,efgh2,5678b,ijkl2 ...etc
...
I have been able to implement the "latin-1" coding for commands in my script like:
open('file fixed.txt', 'w', encoding="latin-1").writelines([line for line in open('file.txt', 'r', encoding="latin-1"])
My problem lies in code such as:
for line in fileinput.Fileinput('file fixed.txt', inplace=1):
line = line.replace(":",",")
print (line, ",")
I am unable to get past the encoding errors for the last command. I have tried enforcing the coding of:
# -*- coding: latin-1 -*-
At the top of the document as well as before the last mentioned command (find and replace). How can I get mixed encoded files to process for the above command? I would like to preserve the UTF-16 (unicode) symbols as they appear in the new file. Thanks in advance.
EDIT: Thanks to Alexis I was able to determine that filinput would not work for setting another encoding method. I used the below to resolve my issue.
f = open(filein,'r', encoding="latin-1")
filedata = f.read()
f.close()
newdata = filedata.replace("old data","new data")
f = open(fileout,'w', encoding="latin-1")
f.write(newdata)
f.close()
You can tell fileinput how to open your files. As the documentation says:
You can control how files are opened by providing an opening hook via the openhook parameter to fileinput.input() or FileInput(). The hook must be a function that takes two arguments, filename and mode, and returns an accordingly opened file-like object. Two useful hooks are already provided by this module.
So you'd do it like this:
def open_utf16(name, m):
return open(name, m, encoding="utf-16")
for line in fileinput.FileInput("file fixed.txt", openhook=open_utf16):
...
I use "utf-16" as the encoding since this is your file's encoding, not "latin-1". 8-bit encodings don't have error checking so Latin1 will read the bytes without noticing there's anything wrong, but you're likely to have problems down the line. If this gives you errors, your file is not in utf-16.
If your file has mixed encoding, you need to read it as binary and then decode different parts as necessary, or just process the whole thing as binary instead. The latin-1 solution in the question works by accident really.
In your example that would be something like:
with open('the/path', 'rb') as fi:
data = fi.read().replace(b'old data', b'new data')
with open('other/path', 'wb') as fo:
fo.write(data)
This is the closest to what you ask for - as far as I understand you don't even care about that field with potentially different encoding - you just want to change some content and copy the rest of the file as is. Binary mode allows you to do that.

Converting DOS text files to Unicode using Python

I am trying to write a Python application for converting old DOS code page text files to their Unicode equivalent. Now, I have done this before using Turbo Pascal by creating a look-up table and I'm sure the same can be done using a Python dictionary. My question is: How do I index into the dictionary to find the character I want to convert and send the equivalent Unicode to a Unicode output file?
I realize that this may be a repeat of a similar question but nothing I searched for here quite matches my question.
Python has the codecs to do the conversions:
#!python3
# Test file with bytes 0-255.
with open('dos.txt','wb') as f:
f.write(bytes(range(256)))
# Read the file and decode using code page 437 (DOS OEM-US).
# Write the file as UTF-8 encoding ("Unicode" is not an encoding)
# UTF-8, UTF-16, UTF-32 are encodings that support all Unicode codepoints.
with open('dos.txt',encoding='cp437') as infile:
with open('unicode.txt','w',encoding='utf8') as outfile:
outfile.write(infile.read())
You can use standard buildin decode method of bytes objects:
with open('dos.txt', 'r', encoding='cp437') as infile, \
open('unicode.txt', 'w', encoding='utf8') as outfile:
for line in infile:
outfile.write(line)

Encoding error even using the right codec

I want to open files depending on the encoding format, therefore I do the following:
import magic
import csv
i_file = open(filename).read()
mag = magic.Magic(mime_encoding=True)
encoding = mag.from_buffer(i_file)
print "The encoding is ",encoding
Once I know the encoding format, I try to open the file using the right one:
with codecs.open(filename, "rb", encoding) as f_obj:
reader = csv.reader(f_obj)
for row in reader:
csvlist.append(row)
However, I get the next error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)
trying to open a csv file which encoding is:
The encoding is utf-16le
The funny part comes here. If utf-16le is replaced by utf-16, the CSV utf-16le file is properly read. However, it is not well read when used in ascii csv files.
What am I doing wrong?
Python 2's csv module doesn't support Unicode. Is switching to Python 3 an option? If not, can you convert the input file to UTF-8 first?
From the docs linked above:
The csv module doesn’t directly support reading and writing Unicode,
but it is 8-bit-clean save (sic!) for some problems with ASCII NUL
characters. So you can write functions or classes that handle the
encoding and decoding for you as long as you avoid encodings like
UTF-16 that use NULs. UTF-8 is recommended.
Quick and dirty example:
with codecs.open(filename, "rb", encoding) as f_obj:
with codecs.open(filename+"u8", "wb", "utf-8") as utf8:
utf8.write(f_obj.read())
with codecs.open(filename+"u8", "rb", "utf-8") as f_obj:
reader = csv.reader(f_obj)
# etc.
This may be a bit useful to you.
Checkout python 2 documentation
https://docs.python.org/2/library/csv.html
Especially this section:
For all other encodings the following UnicodeReader and UnicodeWriter
classes can be used. They take an additional encoding parameter in
their constructor and make sure that the data passes the real reader
or writer encoded as UTF-8:
Look at the bottom of the page!!!!

Fixing corrupt encoding (with Python)

I have bunch of text files contains Korean characters with wrong encodings. Specifically, it seems the characters are encoded with EUC-KR, but the files themselves were saved with UTF8+BOM.
So far I managed to fix a file with the following:
Open a file with EditPlus (it shows the file's encoding is UTF8+BOM)
In EditPlus, save the file as ANSI
Lastly, in Python:
with codecs.open(html, 'rb', encoding='euc-kr') as source_file:
contents = source_file.read()
with open(html, 'w+b') as dest_file:
dest_file.write(contents.encode('utf-8'))
I want to automate this, but I have not been able to do so. I can open the original file in Python:
codecs.open(html, 'rb', encoding='utf-8-sig')
However, I haven't been able to figure out how to do the 2. part.
I am presuming here that you have text already encoded to EUC-KR, then encoded again to UTF-8. If so, encoding to Latin 1 (what Windows calls ANSI) is indeed the best way to get back to the original EUC-KR bytestring.
Open the file as UTF8 with BOM, encode to Latin1, decode as EUC-KR:
import io
with io.open(html, encoding='utf-8-sig') as infh:
data = infh.read().encode('latin1').decode('euc-kr')
with io.open(html, 'w', encoding='utf8') as outfh:
outfh.write(data)
I'm using the io.open() function here instead of codecs as the more robust method; io is the new Python 3 library also backported to Python 2.
Demo:
>>> broken = '\xef\xbb\xbf\xc2\xb9\xc3\x8c\xc2\xbc\xc3\xba'
>>> print broken.decode('utf-8-sig').encode('latin1').decode('euc-kr')
미술

Categories