Extracting Text from Gmail eml file using Python - python

Good Morning,
I have downloaded my *.eml from my Gmail and wanted to extract the content of the email as text.
I used the following codes:
import email
from email import policy
from email.parser import BytesParser
filepath = 'Project\Data\Your GrabPay Wallet Statement for 15 Feb 2022.eml'
fp = open(filepath, 'rb')
msg = BytesParser(policy=policy.default).parse(fp)
text = msg.get_body(preferencelist=('plain')).get_content()
I am unable to extract the content of the email. The length of text is 0.
When I attempted to open the *.eml using Word/Outlook, I could see the content.
When I use a normal file handler to open it:
fhandle = open(filepath)
print(fhandle)
print(fhandle.read())
I get
<_io.TextIOWrapper name='Project\Data\Your GrabPay Wallet Statement
for 15 Feb 2022.eml' mode='r' encoding='cp1252'>
And the contents look something like the one below:
Content-Transfer-Encoding: base64
Content-Type: text/html; charset=UTF-8
PCFET0NUWVBFIGh0bWwgUFVCTElDICItLy9XM0MvL0RURCBYSFRNTCAxLjAgVHJhbnNpdGlvbmFs
Ly9FTiIgImh0dHA6Ly93d3cudzMub3JnL1RSL3hodG1sMS9EVEQveGh0bWwxLXRyYW5zaXRpb25h
bC5kdGQiPgo8aHRtbCB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMTk5OS94aHRtbCI+CjxoZWFk
I might have underestimated the amount of codes needed to extract email body content from *eml to Python.

I do not have access to your email, but I've been able to extract text from an email that I downloaded myself as a .eml from google.
import email
with open('email.eml') as email_file:
email_message = email.message_from_file(email_file)
print(email_message.get_payload())
When working with files it is important to consider using context managers such as I did in my example because it ensures that files are properly cleaned up and file handles are closed when they are no longer needed.
I briefly read over https://docs.python.org/3/library/email.parser.html for additional information on how to achieve the intended goal.

I realised the email is in multipart. So there is a need to get to the specific part, and decode the email. While doing do, it returns a chunk of HTML codes. To strip off the HTML codes and get plain-text, I used html2text.
import email
from email import policy
from email.parser import BytesParser
import html2text
filepath = 'Project\Data\Your GrabPay Wallet Statement for 15 Feb 2022.eml'
with open(filepath) as email_file:
email_message = email.message_from_file(email_file)
if email_message.is_multipart():
for part in email_message.walk():
#print(part.is_multipart())
#print(part.get_content_type())
#print()
message = str(part.get_payload(decode=True))
plain_message = html2text.html2text(message)
print(plain_message)
print()

Related

Getting invalid zip error when I try to open emailed Zip file using Python

So first of all I'm not really sure if I'm going about attaching this zip folder the right way. I've kinda hacked together different things together, but I'm getting an error message below which I've linked towards the bottom of my code.
from email.message import EmailMessage
import shutil
import smtplib
import base64
message = EmailMessage()
message["From"] = "superdummy#idiot.com"
message["To"] = "superdummy#idiot.com"
message["Subject"] = "Testing Zip"
path = "Testing.zip"
with open(path, "rb") as f:
bytes = f.read()
encoded = base64.b64encode(bytes)
message.add_attachment(encoded, maintype='application/zip', subtype='octet-stream', filename="Testing.zip")
s = smtplib.SMTP(host='blah.blah.yum', port=99)
s.send_message(message)
When I receive the email and I try to open it up, I get the error message:
"Windows cannot open the folder, the compressed Zip folder is invalid"
Couldn't find similar error messages on Stackoverflow or other places. I'd really appreciate some direction. Thanks!

Extract images from an eml file

I was looking for an appropriate way to get a specific image from an eml file, but unfortunately, I always get text data without images!
Here is the code I used but it gives me just text data :
from email.parser import BytesParser
from email import policy
with open(em, 'rb') as fp:
name = fp.name # Get file name
msg = BytesParser(policy=policy.default).parse(fp)
data = msg.get_body(preferencelist=('plain')).get_content()
print(data)
fp.close()
do you find any way to solve this? I'm eager to know the method

Importing mime .eml file to gmail API using the import function

I am a python developer and somewhat new to using Google's gMail API to import .eml files into a gMail account.
I've gotten all of the groundwork done getting my oAuth credentials working, etc.
However, I am stuck where I load in the data-file. I need help loading the message data in to place in a variable..
How do I create the message_data variable reference - in the appropriate format - from my sample email file (which is stored in rfc822 format) that is on disk?
Assuming I have a file on disk at /path/to/file/sample.eml ... how do I load that to message_data in the proper format for the gMail API import call?
...
# how do I properly load message_data from the rfc822 disk file?
media = MediaIoBaseUpload(message_data, mimetype='message/rfc822')
message_response = service.users().messages().import_(
userId='me',
fields='id',
neverMarkSpam=True,
processForCalendar=False,
internalDateSource='dateHeader',
media_body=media).execute(num_retries=2)
...
You want to import an eml file using Gmail API.
You have already been able to get and put values for Gmail API.
You want to achieve this using google-api-python-client.
service in your script can be used for uploading the eml file.
If my understanding is correct, how about this answer? Please think of this as just one of several possible answers.
Modification point:
In this case, the method of "Users.messages: insert" is used.
Modified script:
Before you run the script, please set the filename with the path of the eml file.
eml_file = "###" # Please set the filename with the path of the eml file.
user_id = "me"
f = open(eml_file, "r", encoding="utf-8")
eml = f.read()
f.close()
message_data = io.BytesIO(eml.encode('utf-8'))
media = MediaIoBaseUpload(message_data, mimetype='message/rfc822', resumable=True)
metadata = {'labelIds': ['INBOX']}
res = service.users().messages().insert(userId=user_id, body=metadata, media_body=media).execute()
print(res)
In above script, the following modules are also required.
import io
from googleapiclient.http import MediaIoBaseUpload
Note:
In above modified script, {'labelIds': ['INBOX']} is used as the metadata. In this case, the imported eml file can be seen at INBOX of Gmail. If you want to change this, please modify this.
Reference:
Users.messages: insert
If I misunderstood your question and this was not the result you want, I apologize.

Send data as .csv email attachmet Python

I have a data, let's say
data = [
['header_1', 'header_2'],
['row_1_!', 'row_1_2'],
['row_2_1', 'row_2_2'],
]
I need to send that data as .csv file attachment to email message.
I can not save it as .csv and then attach existing csv - application is working in Googpe App Engine sandbox environment. so no files can be saved.
As I understand, email attachment consists of file name and file encoded as base64.
I tried to make attachment body in the following way:
import csv
if sys.version_info >= (3, 0):
from io import StringIO
else:
from StringIO import StringIO
in_memory_data = StringIO()
csv.writer(inmemory_data).writerows(data)
encoded = base64.b64encode(inmemory_data.getvalue())
But in result I have received by email not valid file 2 columns and 3 rows, but just one string in file (see the picture).
csv_screen
What I'm doing wrong?
I've found out the mistake. I should have been convert it to bytearray instead of encoding to base64:
encoded = bytearray(inmemory_data.getvalue(), "utf-8")
Worked fine that way.

Django: CSV in email attachment including Unicode characters results in extra linebreak

I have a reporting feature on my site that send CSV attached file by email. I recently noticed that if one of the string included an accent character my attached CSV has extra line break. Strange thing is I don't see any of these extra linebreak if the string doesn't contain any accent.
Code looks a bit like this:
# -*- coding: utf8 -*-
import unicodecsv
from StringIO import StringIO
from django.core.mail import EmailMultiAlternatives
# Generating the CSV
csvfile = StringIO()
writer = unicodecsv.writer(csvfile, encoding='utf-8')
writer.writerow([u'Test', u'Linebreak è'])
writer.writerow([u'Another', u'line'])
# Email
msg = EmailMultiAlternatives(
'csv report',
'Here is your attached report',
'email#from.com',
'email#to.com'
)
msg.attach('your_report.csv', csvfile.getvalue(), 'text/csv')
msg.send()
Opening the file with VIM shows me something like that:
Test,Linebreak è^M
Another,line
In comparison if the CSV rows include :
writer.writerow([u'Test', u'Linebreak'])
writer.writerow([u'Another', u'line'])
The attached CSV will look like that:
Test,Linebreak
Another,line
The getvalue() seems to output the right EOL formater but something seems to happen once the file is attached. Did someone else noticed similar issue?
(Runing Django 1.6 on python 2.7)
Edit: I have found the root of my problem. Turns out I'm using sendgrid for sending my emails, and for some reason their system is adding extra linebreak on my CSV when this one contains an accent...
As per commenter's request, I'll add a solution that involves Python's stamdard SMTP library instead of SendGrid.
As with OP's code, we use CSV data that is unicode. When it's time to prepare the message, we explictly add the data as UTF-8-encoded text attachment, and construct the message object like so:
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
# Write CSV data ...
msg = MIMEMultipart()
msg['Subject'] = subject
msg['From'] = sender
msg['To'] = recipients
msg.preamble = subject + '\n'
# Message body
body = MIMEText(body, 'plain', 'utf-8')
# CSV data
csv = MIMEText(csvfile.getvalue(), 'csv', 'utf-8')
csv.add_header("Content-Disposition", "attachment",
filename=report_filename)
# Add body and attachment to message
msg.attach(body)
msg.attach(csv)
You can read more about MIMEText in the Python library documentation. I find that passing it unicode strings (as opposed to str/bytes) works as long as there is a properly delcared charset.
Also, I have to note that I'm not sure whether the the newline problem was simply solved by using MIMEText attachment, or because of encoding. It's possible that using MIMEText object as attachment in OP's code may solve the problem. I leave experimentation to you, though.
For those who use Sendgrid as an SMTP provider to send you emails and if you noticed a similar issue, I fixed my problem by not using SMTP but the Web API of Sendgrid (via https://github.com/elbuo8/sendgrid-django).
No more extra lines in my CSV reports now!

Categories