Handle Multiple Languages with xml.tree.ElementTree Module - python

I'm attempting to parse an XML file and print sections of the contents into a CSV file for manipulation with a program such as Microsoft Excel. The issue I'm running into is that the XML file contains multiple alphabets (Arabic, Cyrillic, etc.) and I'm getting confused over what encoding I should be using.
import csv
import xml.etree.ElementTree as ET
import os
file = 'example.xml'
csvf = open(os.path.splitext(file)[0] + '.csv', "w+", newline='')
csvf.seek(0)
csvw = csv.writer(csvf, delimiter=',')
root = ET.parse(file).getroot()
name_base = root.find("name")
name_base_string = ET.tostring(name_base, encoding="unicode", method="xml").strip()
csv_data.append(name_base_string)
csvf.close()
I do not know what encoding to pass to the tostring() method. If I use 'unicode' it returns a unicode python string and all is well when writing to the CSV file, but Excel seems to handle this really improperly (all editors on windows and linux seem to see the character sets properly). If I use encoding 'UTF-8' the method returns a bytearray, which if I pass to the CSV writer without decoding I receive the string b'stuff' in the csv document.
Is there something I'm missing here? Does Excel just suck at handling certain encodings? I've read up on how UTF-8 is an encoding and Unicode is just a character set (that you can't really compare them) but I'm still confused.

Related

How can recode mbox files in utf-8 in Python?

I have exported a bunch of Gmail messages and would like to parse them and get insights using Python. However, upon exporting I realized a weird encoding in these mbox files, e.g. the character 'é' is transformed as =E9, quote symbols (“ and ”) are transformed as =E2=80=9C and =E2=80=9D. My emails often have a lot of foreign script, therefore it would be very important for me to decode these files into utf-8. Furthermore, I often have messages with emojis as well that also convey important sentiment information that I need to preserve.
I found out that this encoding is called Quoted Printable and I tried using the quopri Python module, however, without success.
Here is my simplified code:
import os
import quopri
from pathlib import Path
for filename in os.listdir(directory):
if filename.endswith(".mbox"):
input_filename = Path(os.path.join(directory,filename))
output_filename = Path(os.path.join(directory,filename+'_utf-8'))
with open(input_filename, 'rb'):
quopri.decode(input_filename, output_filename)
However, when running this, I get the following error at the last line: AttributeError: 'WindowsPath' object has no attribute 'read'. I don't understand why this error appears, as the path defined points to the file.
You need to declare names for the context managers (the with statements), like this:
with input_filename.open('rb') as infile, output_filename.open('wb') as outfile:
quopri.decode(infile, outfile)

'utf-8' codec can't decode byte - Python

My Django application is working with both .txt and .doc filetypes. And this application opens a file, compares it with other files in db and prints out some report.
Now the problem is that, when file type is .txt, I get 'utf-8' codec can't decode byte error (here I'm using encoding='utf-8'). When I switch encoding='utf-8' to encoding='ISO-8859-1' error changes to 'latin-1' codec can't decode byte.
I want to find such encoding format that works with every type of a file. This is a small part of my function:
views.py:
#login_required(login_url='sign_in')
def result(request):
last_uploaded = OriginalDocument.objects.latest('id')
original = open(str(last_uploaded.document), 'r', encoding='utf-8')
original_words = original.read().lower().split()
words_count = len(original_words)
open_original = open(str(last_uploaded.document), "r")
read_original = open_original.read()
report_fives = open("static/report_documents/" + str(last_uploaded.student_name) +
"-" + str(last_uploaded.document_title) + "-5.txt", 'w')
# Path to the documents with which original doc is comparing
path = 'static/other_documents/doc*.txt'
files = glob.glob(path)
rows, found_count, fives_count, rounded_percentage_five, percentage_for_chart_five, fives_for_report, founded_docs_for_report = search_by_five(last_uploaded, 5, original_words, report_fives, files)
context = {
...
}
return render(request, 'result.html', context)
There is no general encoding which automatically knows how to decode an already encoded file in a specific encoding.
UTF-8 is a good option with many compatibilities with other encodings. You can e.g. simply ignore or replace characters which aren't decodable like this:
from codecs import open
original = open(str(last_uploaded.document), encoding="utf-8", errors="ignore")
original_words = original.read().lower().split()
...
original.close()
Or even using a context manager (with statement) who closes the file for you:
with open(str(last_uploaded.document), encoding="utf-8", errors="ignore") as fr:
original_words = fr.read().lower().split()
...
(Note: You do not need to use the codecs library if you're using Python 3, but you have tagged your question with python-2.7.)
You can see advantages and disadvantages of using different error handlers here and here. You have to know that not using an error handler will default to using errors="strict" which you probably do not want. Other options may be nearly self-explaining, e.g.:
using errors="replace" will replace an undecodable character with a suitable replacement marker
using errors="ignore" will simply ignore the character and continues reading the file data.
What you should use depends on your needs and usecase(s).
You're saying that you also have encoding problems not only with plain text files, but also with proprietary doc files:
The .doc format is not a plain text file which you can simply read with open() or codecs.open() since there are many information stored in binary format, see this site for more information. So you need a special reader for .doc files to get the text from it. Which library you are using depends on your Python version and maybe also on the operating system you are using. Maybe here is a good starting point for you.
Unfortunately, using a library does not prevent you completely from encoding errors. (Maybe yes, but I'm not sure if the encoding is saved in the file itself like in a .docx file.) You maybe also have the chance to figure out the encoding of the file. How you can handle encoding errors likely depends on the library itself.
So I just guess that you are trying opening .doc files as simple text files. Then you will get decoding errors, because it's not saved as human readable text. And even if you get rid of the error, you only will see the non human readable text: (I've created a simple text file with LibreOffice in doc-format (Microsoft Word 1997-2003)):
In [1]: open("./test.doc", "r").read()
UnicodeDecodeError: 'utf-8' codec can`t decode byte 0xd0 in position 0: invalid continuation byte
In [2]: open("./test.doc", "r", errors="replace").read() # or open("./test.doc", "rb").read()
'��\x11\u0871\x1a�\x00\x00\x00' ...

Python 3.5.1 mixed line code file UTF-8 and UTF-16

I have successfully been parsing data files that I recieve with a simple python script I wrote. The files I get are like this:
file.txt, ~50 columns of data, x 1000s of rows
abcd1,1234a,efgh1,5678a,ijkl1 ...etc
abcd2,1234b,efgh2,5678b,ijkl2 ...etc
...
Unfortunatly, sometimes some of the lines contain UTF-16 symbols, and look like this
abcd1,12341,efgh1,UTF-16 symbols here,ijkl1 ...etc
abcd2,1234b,efgh2,5678b,ijkl2 ...etc
...
I have been able to implement the "latin-1" coding for commands in my script like:
open('file fixed.txt', 'w', encoding="latin-1").writelines([line for line in open('file.txt', 'r', encoding="latin-1"])
My problem lies in code such as:
for line in fileinput.Fileinput('file fixed.txt', inplace=1):
line = line.replace(":",",")
print (line, ",")
I am unable to get past the encoding errors for the last command. I have tried enforcing the coding of:
# -*- coding: latin-1 -*-
At the top of the document as well as before the last mentioned command (find and replace). How can I get mixed encoded files to process for the above command? I would like to preserve the UTF-16 (unicode) symbols as they appear in the new file. Thanks in advance.
EDIT: Thanks to Alexis I was able to determine that filinput would not work for setting another encoding method. I used the below to resolve my issue.
f = open(filein,'r', encoding="latin-1")
filedata = f.read()
f.close()
newdata = filedata.replace("old data","new data")
f = open(fileout,'w', encoding="latin-1")
f.write(newdata)
f.close()
You can tell fileinput how to open your files. As the documentation says:
You can control how files are opened by providing an opening hook via the openhook parameter to fileinput.input() or FileInput(). The hook must be a function that takes two arguments, filename and mode, and returns an accordingly opened file-like object. Two useful hooks are already provided by this module.
So you'd do it like this:
def open_utf16(name, m):
return open(name, m, encoding="utf-16")
for line in fileinput.FileInput("file fixed.txt", openhook=open_utf16):
...
I use "utf-16" as the encoding since this is your file's encoding, not "latin-1". 8-bit encodings don't have error checking so Latin1 will read the bytes without noticing there's anything wrong, but you're likely to have problems down the line. If this gives you errors, your file is not in utf-16.
If your file has mixed encoding, you need to read it as binary and then decode different parts as necessary, or just process the whole thing as binary instead. The latin-1 solution in the question works by accident really.
In your example that would be something like:
with open('the/path', 'rb') as fi:
data = fi.read().replace(b'old data', b'new data')
with open('other/path', 'wb') as fo:
fo.write(data)
This is the closest to what you ask for - as far as I understand you don't even care about that field with potentially different encoding - you just want to change some content and copy the rest of the file as is. Binary mode allows you to do that.

Fixing corrupt encoding (with Python)

I have bunch of text files contains Korean characters with wrong encodings. Specifically, it seems the characters are encoded with EUC-KR, but the files themselves were saved with UTF8+BOM.
So far I managed to fix a file with the following:
Open a file with EditPlus (it shows the file's encoding is UTF8+BOM)
In EditPlus, save the file as ANSI
Lastly, in Python:
with codecs.open(html, 'rb', encoding='euc-kr') as source_file:
contents = source_file.read()
with open(html, 'w+b') as dest_file:
dest_file.write(contents.encode('utf-8'))
I want to automate this, but I have not been able to do so. I can open the original file in Python:
codecs.open(html, 'rb', encoding='utf-8-sig')
However, I haven't been able to figure out how to do the 2. part.
I am presuming here that you have text already encoded to EUC-KR, then encoded again to UTF-8. If so, encoding to Latin 1 (what Windows calls ANSI) is indeed the best way to get back to the original EUC-KR bytestring.
Open the file as UTF8 with BOM, encode to Latin1, decode as EUC-KR:
import io
with io.open(html, encoding='utf-8-sig') as infh:
data = infh.read().encode('latin1').decode('euc-kr')
with io.open(html, 'w', encoding='utf8') as outfh:
outfh.write(data)
I'm using the io.open() function here instead of codecs as the more robust method; io is the new Python 3 library also backported to Python 2.
Demo:
>>> broken = '\xef\xbb\xbf\xc2\xb9\xc3\x8c\xc2\xbc\xc3\xba'
>>> print broken.decode('utf-8-sig').encode('latin1').decode('euc-kr')
미술

Python - Save CSV file in UTF-16LE

I have to import som data to my Yahoo Marketing account - and the CSV file has to be encoding in - yahoo: CSV/TSV files: Unicode (technically UTF-16LE encoding)
writer = csv.writer(open('new_yahoo.csv','w', encoding='utf-16-le'), delimiter="\t")
writer.writerows(reader)
If you scroll down on the examples provided in the Python CSV page, you'll find that it
The csv module doesn’t directly support reading and writing Unicode, but it is 8-bit-clean save for some problems with ASCII NUL characters. So you can write functions or classes that handle the encoding and decoding for you as long as you avoid encodings like UTF-16 that use NULs. UTF-8 is recommended.
But if you do need to do unicode, it looks like this could help:
unicode_csv_reader() below is a generator that wraps csv.reader to handle Unicode CSV data (a list of Unicode strings).
...
For all other encodings the following UnicodeReader and UnicodeWriter classes can be used. They take an additional encoding parameter in their constructor and make sure that the data passes the real reader or writer encoded as UTF-8:
So it looks like the example they provide at the bottom should do the encoding you want.
It looks like you are using Python 3.X judging by the open command used. What you have should work, although you may need to define the newline parameter as well. newline='' will use the native line ending (CRLF on Windows, for example), but Yahoo may have other requirements. The code below generated the file correctly on Windows with CRLF line endings.
data = [
['One','Two','Three'],
[1,2,3],
[4,5,6],
[7,8,9]]
import csv
f = open('new_yahoo.csv','w', newline='', encoding='utf-16-le')
writer = csv.writer(f, delimiter='\t')
writer.writerows(data)
f.close()

Categories