I want to treat Outlook .msg file as string and check if a substring exists in it.
So I thought importing win32 library, which is suggested in similar SO threads, would be an overkill.
Instead, I tried to just open the file the same way as a .txt file:
file_path= 'O:\\MAP\\177926 Delete comiitted position.msg'
mail = open(file_path)
mail_contents = mail.read()
print(mail_contents)
However, I get
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 870: character maps to <undefined>
Is there any decoding I can specify to make it work?
I have also tried
mail = open(file_path, encoding='utf-8')
which returns
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte
Unless you're willing to do a lot of work, you really should use a library for this.
First, a .msg file is a binary file, so the contents should not be read in as a string. A string is usually terminated with a null byte, and binary files can have a lot of those inside, which could mean you're not looking at all the data (might depend on the implementation).
Also, the .msg file can have plain ascii and/or unicode in different parts/blocks of the file, so it would be really hard to treat this as one string to search for a substring.
As an alternative you could save the mails as .eml (i.e. the plain text version of an e-mail), but there would still be some problems to overcome in order to search for a specific text:
All data in an e-mail are lower ascii (1-127) which means special characters have to be encoded to lower ascii bytes. There are several different encodings for headers (for example 'Subject'), body, attachment.
Body text: can be plain text or hml (or both). Lines and words can be split because there is a maximum line length. Different encodings can be used, even base64 in which you would never find the text you're looking for.
A lot more would have to be done to properly decode everything, but this should give you an idea of the work you would have to do in order to find the text you're looking for.
When you face these type of issues, it is good pratice to try the Python Latin-1 encoding.
mail = open(file_path, encoding='Latin-1')
We often confound the Windows cp1252 encoding with the actual Python's Latin-1. Indeed, the latter maps all possible byte values to the first 256 Unicode code points.
See this for more information.
Related
Using Python, I am fetching some text data from an API and storing it in a text file after some transformations and then reading this text file from a different process.
There are no problems while reading data from API, but I am getting this error while reading the text file:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 907: invalid start byte
The byte being read as '0x96' is actually "–" character in API data and this error occurs only when encoding argument is explicitly specified as 'utf-8'. It doesn't occur when encoding is not explicitly passed to open function while opening the text file.
My questions:
Why do we get this error only when encoding is specified? I think, we should get the same error in other case as well since default encoding is also 'UTF-8'. (Please correct me if I am wrong)
Is it possible to resolve this issue without changing the way I am reading the text file? (i.e. Can I make any changes to the stage where I am creating this text file from API data?)
Really appreciate you looking into it. Thanks!
In open() the default encoding is platform dependent, you can find out what is the default for your system by checking what locale.getpreferredencoding() returns. This is from the documentation
For the 2nd part of your question, since you are not getting an error when you do not specify utf-8 as encoding, you could just use the output for locale.getpreferredencoding() as the encoding method.
You could do this for each line of the text if you are doing it this way. Since 0x96 is considered a "non-printable".
import re
...
line = re.sub(r'\x96',r'\x2D', line)
I am working on a Machine Learning Project which filters spam/phishing emails out of all emails. For this, I am using the SpamAssassin dataset. The dataset contains different mails in this format:
For identifying phishing emails, first thing I have to do is finding out how many web-links the email has. For doing that, I have written the following code:
wordsInLine = []
tempWord = []
urlList = []
base_dir = "C:/Users/keert/Downloads/Spam_Assassin/spam"
def count():
flag = 0
print("Reading all file names in sorted order")
for filename in sorted(os.listdir(base_dir)):
file=open(os.path.join(base_dir, filename))
count1 = 0
for line in file:
wordsInLine = line.split(' ')
for word in wordsInLine:
if re.search('href="http',word,re.I):
count1=count1+1
file.close()
urlList.append(count1)
if flag!=0:
print("File Name = " + filename)
print ("Number of links = ",count1)
flag = flag + 1
count()
final = urlList[1:]
print("List of number of links in each email")
print(final)
with open('count_links.csv', 'wb') as myfile:
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
for val in final:
wr.writerow([val])
print("CSV file generated")
But this code is giving me an error saying that: 'charmap' codec can't decode byte 0x81 in position 3124: character maps to
I have even tried opening the file by adding encoding = 'utf8' option. But still, the clash remains and I got an error like: 'utf-8' codec can't decode byte 0x81 in position 3124: character maps to
I guess this is due to the special characters that are in the file. Is there any way to deal with this because I can not skip the special characters as they are also important. Please suggest me a way for doing this. Thank you in advance
You have to open and read the file using the same encoding that was used to write the file. In this case, that might be a bit difficult, since you are dealing with e-mails and they can be in any encoding, dependent on the sender. In the example file you showed, the message is encoded using 'iso-8859-1' encoding.
However, e-mails are a bit strange, since they consist of a header (which is in ASCII format as far as I know), followed by an empty line and the body. The body is encoded in the encoding that was specified in the header. So two different encodings could be used in the same file!
If you're sure that all the e-mails use iso-8859-1 encoding and you're looking for a quick-and-dirty solution, then you could also just open the file using 'iso-8859-1' encoding, since e-mail headers are compatible with iso-8859-1. However, be prepared that you will have to deal with other e-mail formatting/encoding/escaping issues as well, or your script might not work completely as expected.
I think the best solution would be to look for a Python module that can handle e-mails, so it will deal with all the decoding stuff and you don't have to worry about that. It will also solve other problems such as escape characters and line breaks.
I don't have experience with this myself, but it seems that Python has built-in support for parsing e-mails using the e-mail package. I recommend to take a look at that.
My Django application is working with both .txt and .doc filetypes. And this application opens a file, compares it with other files in db and prints out some report.
Now the problem is that, when file type is .txt, I get 'utf-8' codec can't decode byte error (here I'm using encoding='utf-8'). When I switch encoding='utf-8' to encoding='ISO-8859-1' error changes to 'latin-1' codec can't decode byte.
I want to find such encoding format that works with every type of a file. This is a small part of my function:
views.py:
#login_required(login_url='sign_in')
def result(request):
last_uploaded = OriginalDocument.objects.latest('id')
original = open(str(last_uploaded.document), 'r', encoding='utf-8')
original_words = original.read().lower().split()
words_count = len(original_words)
open_original = open(str(last_uploaded.document), "r")
read_original = open_original.read()
report_fives = open("static/report_documents/" + str(last_uploaded.student_name) +
"-" + str(last_uploaded.document_title) + "-5.txt", 'w')
# Path to the documents with which original doc is comparing
path = 'static/other_documents/doc*.txt'
files = glob.glob(path)
rows, found_count, fives_count, rounded_percentage_five, percentage_for_chart_five, fives_for_report, founded_docs_for_report = search_by_five(last_uploaded, 5, original_words, report_fives, files)
context = {
...
}
return render(request, 'result.html', context)
There is no general encoding which automatically knows how to decode an already encoded file in a specific encoding.
UTF-8 is a good option with many compatibilities with other encodings. You can e.g. simply ignore or replace characters which aren't decodable like this:
from codecs import open
original = open(str(last_uploaded.document), encoding="utf-8", errors="ignore")
original_words = original.read().lower().split()
...
original.close()
Or even using a context manager (with statement) who closes the file for you:
with open(str(last_uploaded.document), encoding="utf-8", errors="ignore") as fr:
original_words = fr.read().lower().split()
...
(Note: You do not need to use the codecs library if you're using Python 3, but you have tagged your question with python-2.7.)
You can see advantages and disadvantages of using different error handlers here and here. You have to know that not using an error handler will default to using errors="strict" which you probably do not want. Other options may be nearly self-explaining, e.g.:
using errors="replace" will replace an undecodable character with a suitable replacement marker
using errors="ignore" will simply ignore the character and continues reading the file data.
What you should use depends on your needs and usecase(s).
You're saying that you also have encoding problems not only with plain text files, but also with proprietary doc files:
The .doc format is not a plain text file which you can simply read with open() or codecs.open() since there are many information stored in binary format, see this site for more information. So you need a special reader for .doc files to get the text from it. Which library you are using depends on your Python version and maybe also on the operating system you are using. Maybe here is a good starting point for you.
Unfortunately, using a library does not prevent you completely from encoding errors. (Maybe yes, but I'm not sure if the encoding is saved in the file itself like in a .docx file.) You maybe also have the chance to figure out the encoding of the file. How you can handle encoding errors likely depends on the library itself.
So I just guess that you are trying opening .doc files as simple text files. Then you will get decoding errors, because it's not saved as human readable text. And even if you get rid of the error, you only will see the non human readable text: (I've created a simple text file with LibreOffice in doc-format (Microsoft Word 1997-2003)):
In [1]: open("./test.doc", "r").read()
UnicodeDecodeError: 'utf-8' codec can`t decode byte 0xd0 in position 0: invalid continuation byte
In [2]: open("./test.doc", "r", errors="replace").read() # or open("./test.doc", "rb").read()
'��\x11\u0871\x1a�\x00\x00\x00' ...
My files are in US-ASCII and a command like a = file( 'main.html') and a.read() loads them as an ASCII text. How do I get it to load as UTF8?
The problem I am tring to solve is:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 38: ordinal not in range(128)
I was using the content of the files for templating as in template_str.format(attrib=val). But the string to interpolate is of a superset of ASCII.
Our team's version control and text editors does not care about the encoding. So how do I handle it in the code?
You are trying to opening files without specifying an encoding, which means that python uses the default value (ASCII).
You need to decode the byte-string explicitly, using the .decode() function:
template_str = template_str.decode('utf8')
Your val variable you tried to interpolate into your template is itself a unicode value, and python wants to automatically convert your byte-string template (read from the file) into a unicode value too, so that it can combine both, and it'll use the default encoding to do so.
Did I mention already you should read Joel Spolsky's article on Unicode and the Python Unicode HOWTO? They'll help you understand what happened here.
A solution working in Python2:
import codecs
fo = codecs.open('filename.txt', 'r', 'ascii')
content = fo.read() ## returns unicode
assert type(content) == unicode
fo.close()
utf8_content = content.encode('utf-8')
assert type(utf8_content) == str
I suppose that you are sure that your files are encoded in ASCII. Are you? :) As ASCII is included in UTF-8, you can decode this data using UTF-8 without expecting problems. However, when you are sure that the data is just ASCII, you should decode the data using just ASCII and not UTF-8.
"How do I get it to load as UTF8?"
I believe you mean "How do I get it to load as unicode?". Just decode the data using the ASCII codec and, in Python 2.x, the resulting data will be of type unicode. In Python 3, the resulting data will be of type str.
You will have to read about this topic in order to learn how to perform this kind of decoding in Python. Once understood, it is very simple.
My aim is to write an XML file with few tags whose values are in the regional language. I'm using Python to do this and using IDLE (Pythong GUI) for programming.
While I try to write the words in an xmls file it gives the following error:
UnicodeEncodeError: 'ascii' codec
can't encode characters in position
0-4: ordinal not in range(128)
For now, I'm not using any xml writer library; instead, I'm opening a file "test.xml" and writing the data into it. This error is encountered by the line:
f.write(data)
If I replace the above write statement with print statement then it prints the data properly on the Python shell.
I'm reading the data from an Excel file which is not in the UTF-8, 16, or 32 encoding formats. It's in some other format. cp1252 is reading the data properly.
Any help in getting this data written to an XML file would be highly appreciated.
You should .decode your incoming cp1252 to get Unicode strings, and .encode them in utf-8 (by far the preferred encoding for XML) at the time you write, i.e.
f.write(unicodedata.encode('utf-8'))
where unicodedata is obtained by .decode('cp1252') on the incoming bytestrings.
It's possible to put lipstick on it by using the codecs module of the standard Python library to open the input and output files each with their proper encodings in lieu of plain open, but what I show is the underlying mechanism (and it's often, though not invariably, clearer and more explicit to apply it directly, rather than indirectly via codecs -- a matter of style and taste).
What does matter is the general principle: translate your input strings to unicode as soon as you can right after you obtain them, use unicode throughout your processing, translate them back to byte strings at late as you can just before you output them. This gives you the simplest, most straightforward life!-)