How can recode mbox files in utf-8 in Python?

How can recode mbox files in utf-8 in Python? - python

I have exported a bunch of Gmail messages and would like to parse them and get insights using Python. However, upon exporting I realized a weird encoding in these mbox files, e.g. the character 'é' is transformed as =E9, quote symbols (“ and ”) are transformed as =E2=80=9C and =E2=80=9D. My emails often have a lot of foreign script, therefore it would be very important for me to decode these files into utf-8. Furthermore, I often have messages with emojis as well that also convey important sentiment information that I need to preserve.
I found out that this encoding is called Quoted Printable and I tried using the quopri Python module, however, without success.
Here is my simplified code:
import os
import quopri
from pathlib import Path
for filename in os.listdir(directory):
if filename.endswith(".mbox"):
input_filename = Path(os.path.join(directory,filename))
output_filename = Path(os.path.join(directory,filename+'_utf-8'))
with open(input_filename, 'rb'):
quopri.decode(input_filename, output_filename)
However, when running this, I get the following error at the last line: AttributeError: 'WindowsPath' object has no attribute 'read'. I don't understand why this error appears, as the path defined points to the file.

You need to declare names for the context managers (the with statements), like this:
with input_filename.open('rb') as infile, output_filename.open('wb') as outfile:
quopri.decode(infile, outfile)

Related

Python - pdfme - writing utf-8 characters to file

I would like to generate report to pdf using pdfme library. I need the Polish characters to be there as well. The example report end with:
with open('document.pdf', 'wb') as f:
build_pdf(document, f)
So I cannot add encoding = "utf-8". Is there any way I can still use Polish characters?
I tried:
Change to write mode and set encoding to utf-8. Getting: "TypeError: write() argument must be str, not bytes".
While having Polish characters add .encode("utf-8"). Example: "Paweł".encode("utf-8"). Getting: "TypeError: value of . attr must be of type str, list or tuple: b'Pawe\xc5\x82'"

In this case, the part of the code responsible for dealing with the unicode characters is the PDF library. The build_pdf call there, for whatever library it is, has to be able to handle any character in "document". And if it fails it is the context for the PDF library, owner of the "build_pdf" call that has to be changed so that it will handle all the characters you need.
"utf-8" is just one form os expressing characters as bytes - aPDF file is a binary file, and it does have internal headers, structures and settings to do its own character encoding handling: your text may endup inside the PDF either encoded as utf-8, or some other, legacy encoding- but that will be transparent for you and anyone using the PDF file.
It may be that the document, if it is text (we don't know if it is plain text, or if it is some object from your library that has already been pre-processed) - but if it is text, and your library says that build_pdf can accept bytes instead, you can encode the document prior to this call:
build_pdf(document.encode('utf-8', f) - but that would be some strange way of working - it is likely that either build_pdf will do the encoding, or whatever process generated the document had already done so.
To get more meaningful help, you have to say which library you are using to geneate the PDF, and include the import lines in your code,including the creation of your document so that we have a minimal reproducible example: i.e. I can copy your code, paste in a .py file here, install the lib, run it, and see a corrupted PDF file with the Polish characters magled: then I, and others, can be able to fix it. Otherwise, this answer is as far as I can get.

Handle Multiple Languages with xml.tree.ElementTree Module

I'm attempting to parse an XML file and print sections of the contents into a CSV file for manipulation with a program such as Microsoft Excel. The issue I'm running into is that the XML file contains multiple alphabets (Arabic, Cyrillic, etc.) and I'm getting confused over what encoding I should be using.
import csv
import xml.etree.ElementTree as ET
import os
file = 'example.xml'
csvf = open(os.path.splitext(file)[0] + '.csv', "w+", newline='')
csvf.seek(0)
csvw = csv.writer(csvf, delimiter=',')
root = ET.parse(file).getroot()
name_base = root.find("name")
name_base_string = ET.tostring(name_base, encoding="unicode", method="xml").strip()
csv_data.append(name_base_string)
csvf.close()
I do not know what encoding to pass to the tostring() method. If I use 'unicode' it returns a unicode python string and all is well when writing to the CSV file, but Excel seems to handle this really improperly (all editors on windows and linux seem to see the character sets properly). If I use encoding 'UTF-8' the method returns a bytearray, which if I pass to the CSV writer without decoding I receive the string b'stuff' in the csv document.
Is there something I'm missing here? Does Excel just suck at handling certain encodings? I've read up on how UTF-8 is an encoding and Unicode is just a character set (that you can't really compare them) but I'm still confused.

Python 3.5.1 mixed line code file UTF-8 and UTF-16

I have successfully been parsing data files that I recieve with a simple python script I wrote. The files I get are like this:
file.txt, ~50 columns of data, x 1000s of rows
abcd1,1234a,efgh1,5678a,ijkl1 ...etc
abcd2,1234b,efgh2,5678b,ijkl2 ...etc
...
Unfortunatly, sometimes some of the lines contain UTF-16 symbols, and look like this
abcd1,12341,efgh1,UTF-16 symbols here,ijkl1 ...etc
abcd2,1234b,efgh2,5678b,ijkl2 ...etc
...
I have been able to implement the "latin-1" coding for commands in my script like:
open('file fixed.txt', 'w', encoding="latin-1").writelines([line for line in open('file.txt', 'r', encoding="latin-1"])
My problem lies in code such as:
for line in fileinput.Fileinput('file fixed.txt', inplace=1):
line = line.replace(":",",")
print (line, ",")
I am unable to get past the encoding errors for the last command. I have tried enforcing the coding of:
# -*- coding: latin-1 -*-
At the top of the document as well as before the last mentioned command (find and replace). How can I get mixed encoded files to process for the above command? I would like to preserve the UTF-16 (unicode) symbols as they appear in the new file. Thanks in advance.
EDIT: Thanks to Alexis I was able to determine that filinput would not work for setting another encoding method. I used the below to resolve my issue.
f = open(filein,'r', encoding="latin-1")
filedata = f.read()
f.close()
newdata = filedata.replace("old data","new data")
f = open(fileout,'w', encoding="latin-1")
f.write(newdata)
f.close()

You can tell fileinput how to open your files. As the documentation says:
You can control how files are opened by providing an opening hook via the openhook parameter to fileinput.input() or FileInput(). The hook must be a function that takes two arguments, filename and mode, and returns an accordingly opened file-like object. Two useful hooks are already provided by this module.
So you'd do it like this:
def open_utf16(name, m):
return open(name, m, encoding="utf-16")
for line in fileinput.FileInput("file fixed.txt", openhook=open_utf16):
...
I use "utf-16" as the encoding since this is your file's encoding, not "latin-1". 8-bit encodings don't have error checking so Latin1 will read the bytes without noticing there's anything wrong, but you're likely to have problems down the line. If this gives you errors, your file is not in utf-16.

If your file has mixed encoding, you need to read it as binary and then decode different parts as necessary, or just process the whole thing as binary instead. The latin-1 solution in the question works by accident really.
In your example that would be something like:
with open('the/path', 'rb') as fi:
data = fi.read().replace(b'old data', b'new data')
with open('other/path', 'wb') as fo:
fo.write(data)
This is the closest to what you ask for - as far as I understand you don't even care about that field with potentially different encoding - you just want to change some content and copy the rest of the file as is. Binary mode allows you to do that.

Problems extracting the XML from a Word document in French with Python: illegal characters generated

Over the past few days I have been attempting to create a script which would 1) extract the XML from a Word document, 2) modify that XML, and 3) use the new XML to create and save a new Word document. With the help of many stackoverflow users I was eventually able to find code that looks very promising. Here it is:
import zipfile
import os
import tempfile
import shutil
def getXml(docxFilename):
zip = zipfile.ZipFile(open(docxFilename,"rb"))
xmlString= zip.read("word/document.xml").decode("utf-8")
return xmlString
def createNewDocx(originalDocx,xmlString,newFilename):
tmpDir = tempfile.mkdtemp()
zip = zipfile.ZipFile(open(originalDocx,"rb"))
zip.extractall(tmpDir)
with open(os.path.join(tmpDir,"word/document.xml"),"w") as f:
f.write(xmlString)
filenames = zip.namelist()
zipCopyFilename = newFilename
with zipfile.ZipFile(zipCopyFilename,"w") as docx:
for filename in filenames:
docx.write(os.path.join(tmpDir,filename),filename)
shutil.rmtree(tmpDir)
getXml extracts the XML from docxFilename as a string. createNewDocx takes the original Word document and replaces its XML with xmlString, which is a modified version of the original XML, and saves the resulting Word document as newFilename.
To check that the script works as intended, I first created a test document ("test.docx") and ran createNewDocx("test.docx",getXml("test.docx"),"test2.docx"). If everything worked as intended, this was supposed to create an identical copy of test.docx saved as test2.docx. Indeed, that was the case.
I then made the test document more elaborate and experimented with modifying it. And the script still worked!
I then confidently applied my script to the Word document I was actually interested in modifying: template.docx. I ran createNewDocx("template.docx",getXml("template.docx"),"template2.docx"), expecting that the script would generate an identical copy of template.docx but named template2.docx. Unfortunately, the new Word document was not able to open; apparently there was an illegal character in the XML.
I really don't understand why my code would work for my test document but not for my actual document. I would post template.docx's XML but it contains personal information. One important difference between test.docx and template.docx is that template.docx is written in French, and therefore contains special characters like accents, and also the apostrophes look different. I have no idea if this is what's causing my trouble but I have no other ideas.

The problem is that you are accidentally changing the encoding on word/document.xml in template2.docx. word/document.xml (from template.docx) is initially encoded as UTF-8 (as is the default encoding for XML documents).
xmlString = zip.read("word/document.xml").decode("utf-8")
However, when you copy it for template2.docx you are changing the encoding to CP-1252. According to the documentation for open(file, "w"),
In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.
You indicated that calling locale.getpreferredencoding(False) gives you cp1252 which is the encoding word/document.xml is being written.
Since you did not explicitly add <?xml version="1.0" encoding="cp1252"?> to the beginning of word/document.xml, Word (or any other XML reader) will read it as UTF-8 instead of CP-1252 which is what gives you the illegal XML character error.
So you want to specify the encoding as UTF-8 when writing by using the encoding argument to open():
with open(os.path.join(tmpDir, "word/document.xml"), "w", encoding="UTF-8") as f:
f.write(xmlString)

Python bz2 - text vs. interactive console (data stream)

I was using bz2 earlier to try to decompress an input. The input that I wanted to decode was already in compressed format, so I decided to input the format into the interactive Python console:
>>> import bz2
>>> bz2.decompress(input)
This worked just fine without any errors. However, I got different results when I tried to extract the text from a html file and then decompress it:
file = open("example.html", "r")
contents = file.read()
# Insert code to pull out the text, which is of type 'str'
result = bz2.decompress(parsedString)
I've checked the string I parsed with the original one, and it looks identical. Furthermore, when I copy and paste the string I wish to decompress into my .py file (basically enclosing it with double parentheses ""), it works fine. I have also tried to open with "rb" in hopes that it'll look at the .html file as a binary, though that failed to work as well.
My questions are: what is the difference between these two strings? They are both of type 'str', so I'm assuming there is an underlying difference I am missing. Furthermore, how would I go about retrieving the bz2 content from the .html in such a way that it will not be treated as an incorrect datastream? Any help is appreciated. Thanks!

My guess is that the html file has the text representation of the data instead of the actual binary data in the file itself.
For instance take a look at the following code:
>>> t = '\x80'
>>> print t
>>> '\x80'
But say I create a text file with the contents \x80 and do:
with open('file') as f:
t = f.read()
print t
I would get back:
'\\x80'
If this is the case, you could use eval to get the desired result:
result = bz2.decompress(eval('"'+parsedString'"'))
Just make sure that you only do this for trusted data.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can recode mbox files in utf-8 in Python? - python

You need to declare names for the context managers (the with statements), like this: with input_filename.open('rb') as infile, output_filename.open('wb') as outfile: quopri.decode(infile, outfile)

Related

Python - pdfme - writing utf-8 characters to file

Handle Multiple Languages with xml.tree.ElementTree Module

Python 3.5.1 mixed line code file UTF-8 and UTF-16

Problems extracting the XML from a Word document in French with Python: illegal characters generated

Python bz2 - text vs. interactive console (data stream)

Categories

Resources