Get the Gmail attachment filename without downloading it

Get the Gmail attachment filename without downloading it - python

I'm trying to get all the messages from a Gmail account that may contain some large attachments (about 30MB). I just need the names, not the whole files. I found a piece of code to get a message and the attachment's name, but it downloads the file and then read its name:
import imaplib, email
#log in and select the inbox
mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login('username', 'password')
mail.select('inbox')
#get uids of all messages
result, data = mail.uid('search', None, 'ALL')
uids = data[0].split()
#read the lastest message
result, data = mail.uid('fetch', uids[-1], '(RFC822)')
m = email.message_from_string(data[0][1])
if m.get_content_maintype() == 'multipart': #multipart messages only
for part in m.walk():
#find the attachment part
if part.get_content_maintype() == 'multipart': continue
if part.get('Content-Disposition') is None: continue
#save the attachment in the program directory
filename = part.get_filename()
fp = open(filename, 'wb')
fp.write(part.get_payload(decode=True))
fp.close()
print '%s saved!' % filename
I have to do this once a minute, so I can't download hundreds of MB of data. I am a newbie into the web scripting, so could anyone help me? I don't actually need to use imaplib, any python lib will be ok for me.
Best regards

Rather than fetch RFC822, which is the full content, you could specify BODYSTRUCTURE.
The resulting data structure from imaplib is pretty confusing, but you should be able to find the filename, content-type and sizes of each part of the message without downloading the entire thing.

If you know something about the file name, you can use the X-GM-RAW gmail extensions for imap SEARCH command. These extensions let you use any gmail advanced search query to filter the messages. This way you can restrict the downloads to the matching messages, or exclude some messages you don't want.
mail.uid('search', None, 'X-GM-RAW',
'has:attachment filename:pdf in:inbox -label:parsed'))
The above search for messages with PDF attachments in INBOX not labeled "parsed".
Some pro tips:
label the messages you have already parsed, so you don't need to fetch them again (the -label:parsed filter in the above example)
always use the uid version instead of the standard sequential ids (you are already doing this)
unfortunately MIME is messy: there are a lot of clients that do weird (or plain wrong) things. You could try to download and parse only the headers, but is it worth the trouble?
[edit]
If you label a message after parsing it, you can skip the messages you have parsed already. This should be reasonable enough to monitor your class mailbox.
Perhaps you live in a corner of the world where internet bandwidth is more expensive than programmer time; in this case, you can fetch only the headers and look for "Content-disposition" == "attachment; filename=somefilename.ext".

A FETCH of the RFC822 message data item is functionally equivalent to BODY[]. IMAP4 supports other message data items, listed in section 6.4.5 of RFC 3501.
Try requesting a different set of message data items to get just the information that you need. For example, you could try RFC822.HEADER or maybe BODY.PEEK[MIME].

Old question, but just wanted to share the solution to this I came up with today. Searches for all emails with attachments and outputs the uid, sender, subject, and a formatted list of attachments. Edited relevant code to show how to format BODYSTRUCTURE:
data = mailobj.uid('fetch', mail_uid, '(BODYSTRUCTURE)')[1]
struct = data[0].split()
list = [] #holds list of attachment filenames
for j, k in enumerate(struct):
if k == '("FILENAME"':
count = 1
val = struct[j + count]
while val[-3] != '"':
count += 1
val += " " + struct[j + count]
list.append(val[1:-3])
elif k == '"FILENAME"':
count = 1
val = struct[j + count]
while val[-1] != '"':
count += 1
val += " " + struct[j + count]
list.append(val[1:-1])
I've also published it on GitHub.
EDIT
Above solution is good but the logic to extract attachment file name from payload is not robust. It fails when file name contains space with first word having only two characters,
for example: "ad cde gh.png".
Try this:
import re # Somewhere at the top
result, data = mailobj.uid("fetch", mail_uid, "BODYSTRUCTURE")
itr = re.finditer('("FILENAME" "([^\/:*?"<>|]+)")', data[0].decode("ascii"))
for match in itr:
print(f"File name: {match.group(2)}")
Test Regex here.

Related

How to log content of emails so repeat emails aren't sent again?

I use this code to scrape odds data. It sends me an email every time there is a discrepancy in odds between bookmakers. I run the code multiple times everyday manually through the terminal. As I run the code multiple times a day, I get many duplicate emails of the same instances of odds discrepancies. The content of the emails is exactly the same, except for the subject line which displays the time when the emails were sent. Is there anyway to log the content of emails so the the code prevents emails being sent which are the same? I would like to prevent the emails being sent not just filter them out once they've been sent through Gmail. The code would also have to refresh each day so the log of the content is emptied. Sorry for the complex question.

You could log the data to a CSV file in append mode every time an email is sent. This can be done using Python's [CSV module][1]. If the CSV file exists read it and make sure the values haven't been already written to that file. Name the file with the current date. This way you can keep or delete the logs as required. I'm assuming the variables you want to check are dfm2 and dfm3, but you can replace with any other variables you need to check. It's a good idea to read from this link to make sure it's set how you want it to work. There are other logging options, but using a csv file is an easy solution.
from datetime import datetime
from os import path
import csv
# your code prior to below line here
dfmerge5 = dfmerge1.loc[dfmerge1['ComparedOdds'] == 1, 'AvgOdds'].values
csv_file = f'<path to log file>/odds-log-{datetime.today().strftime("%d-%m")}.csv')
for dfm2, dfm3, dfm4, dfm5 in zip(dfmerge2, dfmerge3, dfmerge4, dfmerge5):
mail_sent = False
if path.isfile(csv_file):
with open(csv_file, 'r') as r_file:
c_reader = csv.reader(r_file)
for row in c_reader:
if dfm2 in row and dfm3 in row:
mail_sent = True
if not mail_sent:
with open(csv_file, 'a') as w_file:
c_writer = csv.writer(w_file)
c_writer.writerow([dfm2, dfm3])
dt = datetime.now()
subject = f'Overlay - {dt}'
body = f'Hi Harrison,\n\nYou have one overlay:\n\nFor {dfm2} in {dfm3}, Bet365 are displaying odds of ${dfm4}, where as the average odds are ${dfm5}.\n\nBest of luck.'
message = f'Subject: {subject}\n\n{body}'
mail = smtplib.SMTP('smtp.gmail.com', 587)
mail.ehlo()
mail.starttls()
mail.login
mail.sendmail(message)
mail.close()
[1]: https://docs.python.org/3/library/csv.html#csv.writer

How can I split large JSON file in Azure Blob Storage into individual files for each record using Python?

I want to be able to split some large JSON files in blob storage (~1GB each) into individual files (one file per record)
I have tried using get_blob_to_stream from the Azure Python SDK, but am getting the following error:
AzureHttpError: Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
To test, I've just been printing the text that has been downloaded from the blob, and haven't yet tried writing back to individual JSON files
with BytesIO() as document:
block_blob_service = BlockBlobService(account_name=STORAGE_ACCOUNT_NAME, account_key=STORAGE_ACCOUNT_KEY)
block_blob_service.get_blob_to_stream(container_name=CONTAINER_NAME, blob_name=BLOB_ID, stream=document)
print(document.getvalue())
Interestingly, when I limit the size of the blob information that I'm downloading, the error message doesn't appear, and I can get some information out:
with BytesIO() as document:
block_blob_service = BlockBlobService(account_name=STORAGE_ACCOUNT_NAME, account_key=STORAGE_ACCOUNT_KEY)
block_blob_service.get_blob_to_stream(container_name=CONTAINER_NAME, blob_name=BLOB_ID, stream=document, start_range=0, end_range=100000)
print(document.getvalue())
Does anyone know what is going on here, or have any better approaches to splitting a large JSON out?
Thanks!

This Error message "Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature" is you get usually when the header is not formed correctly. you get following when you get this error:
<?xml version="1.0" encoding="utf-8"?>
<Error>
<Code>AuthenticationFailed</Code>
<Message>Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
RequestId:096c6d73-f01e-0054-6816-e8eaed000000
Time:2019-03-31T23:08:43.6593937Z</Message>
<AuthenticationErrorDetail>Authentication scheme Bearer is not supported in this version.</AuthenticationErrorDetail>
</Error>
and the solution to resolve this is to add below header:
x-ms-version: 2017-11-09
But since you are saying that it is working when you limit the size which means you have to write your code using chunk approach. Here is something you can try.
import io
import datetime
from azure.storage.blob import BlockBlobService
acc_name = 'myaccount'
acc_key = 'my key'
container = 'storeai'
blob = "orderingai2.csv"
block_blob_service = BlockBlobService(account_name=acc_name, account_key=acc_key)
props = block_blob_service.get_blob_properties(container, blob)
blob_size = int(props.properties.content_length)
index = 0
chunk_size = 104,858 # = 0.1meg don't make this to big or you will get memory error
output = io.BytesIO()
def worker(data):
print(data)
while index < blob_size:
now_chunk = datetime.datetime.now()
block_blob_service.get_blob_to_stream(container, blob, stream=output, start_range=index, end_range=index + chunk_size - 1, max_connections=50)
if output is None:
continue
output.seek(index)
data = output.read()
length = len(data)
index += length
if length > 0:
worker(data)
if length < chunk_size:
break
else:
break
Hope it helps.

Extract a single line of text from an email

I'm trying to get one sentence from a lot of HTML emails. The sentence is located in the exact same place in every email (including the same line if you view the source code).
So far I have used imaplib to set up the connection to the correct mailbox, search and fetch the body of the email.
response_code_fetch, data_fetch = mail.fetch('1', '(BODY.PEEK[TEXT])')
if response_code_fetch == "OK":
print("Body Text: " + str(data_fetch[0]))
else:
print("Unable to find requested messages")
However, I get an incoherent list that has the entire body of the email at index [0] of the returned list. I've tried str(data_fetch[0]) and then using the splitlines method, but it doesn't work.
I've also found the below suggestion online using the email module, but it doesn't seem to work as it prints the else statement.
my_email = email.message_from_string(data_fetch)
body = ""
if my_email.is_multipart():
for part in my_email.walk():
ctype = part.get_content_type()
cdispo = str(part.get('Content-Disposition'))
print(ctype, cdispo)
# not multipart - i.e. plain text, no attachments, keeping fingers crossed
else:
print("Email is not multipart")
body = my_email.get_payload(decode=True)
print(body)
I won't include the whole result as it's very long but it basically looks like I get the code for the email, HTML formatting, body text and all:
Body Text: [(b'1 (BODY[TEXT] {78687}', b'--_av-
uaAIyctTRCxY0f6Fw54pvw\r\nContent-Type: text/plain; charset=utf-
8\r\nContent-Transfer-Encoding: quoted-printable\r\n\r\n
Does anyone know how can I get the one sentence out of the body text?

I think the b in front of your string makes it a a byte literal. What if you put a .decode('UTF-8') behind your Body Text string?

importing gmail text content to a text file with python returns nothing

I was looking for a solution to get all my text messages/content (not attachments) in my gmail's email-account to a text file and I found a piece of code which promises to do so. I have python 3.4 and 2.7 both installed on win 7. I know php a bit but python I am zero. Right now, I have the code copied into a notepad text and saved it as test.py. Here is the complete code.
import imaplib
import email
mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login('myemailid#gmail.com', 'mypassword')
mail.list()
mail.select('inbox')
result, data = mail.uid('search', None, "ALL")
i = len(data[0].split())
for x in range(i):
latest_email_uid = data[0].split()[x]
result, email_data = mail.uid('fetch', latest_email_uid, '(RFC822)')
raw_email = email_data[0][1]
raw_email_string = raw_email.decode('utf-8')
email_message = email.message_from_string(raw_email_string)
for part in email_message.walk():
if part.get_content_type() == "text/plain":
body = part.get_payload(decode=True)
save_string = r"F:\result\email_" + str(x) + ".txt"
myfile = open(save_string, 'a')
myfile.write(body.decode('utf-8'))
myfile.close()
else:
continue
ISSUE : The code when run gives nothing in return.
UPDATE : Actually I have been going through a lot of threads here. Closest to what I am asking are here and here but no one has a unified answer and the solutions are scattered in bits and pieces, or it is too specific so it is very difficult for me to make sense out of it. Trust me, I am testing and trying. Even a Similar question remains unanswered here.

For those who may find it useful. This script will run fine on Python 3.4 with these changes.
save_string = "F:\\result\\email_" + str(x) + ".txt"
myfile.write(str(body)).
Just change the directory and the folder name according to your need. Thanks to the original code provider.

How to send a file through Soap in python?

I want to send a zip file through SOAP (from a SOAP client to a SOAP server) in python.
Following the reading of this SO question, I choose to use suds as my python client.
But according to this, suds do not support sending attachment. A method is given to circumvent the problem but I've not been able to make it work. I'm puzzled over what I'm supposed to give as parameters.
Anyone know how to send a file through Soap in python ?
If needed I'll switch to another SOAP client library.

Download the provided wrapper, and then where you would normally say something like...
client.service.fooMethod(fooParam1,fooParam2,...)
...instead do...
soap_attachments.with_soap_attachment(client.service.fooMethod,binaryParam,fooParam1,fooParam2,...)
Where binaryParam is of the type expected by soap_attachements.py. For example, if you wanted to send a png image I think (never done this) you would do:
imageFile = open('imageFile.png','rb')
imageData = imageFile.read()
mimeType = 'image/png'
binaryParam = (imageData, uuid.uuid4(), mimeType)

Attachments are best way to send binary file through SOAP. If you can't use any other method but only SOAP, just encode your binaries with Base64 and paste it into SOAP method as a parameter. It isn't pure, but works great with small attachments. Large binaries? Use FTP, WebDAV and all others native ways for sending files between hosts.

I made the following changes to soap_attachments.py under suds to get my own uploads to work. You may not need some of the changes that I've made to this, but hopefully it'll at least give you a start.
--- /home/craig/Downloads/soap_attachments.py 2011-07-08 20:38:55.708038918 -0400
+++ soap_attachments.py 2011-06-21 10:29:50.090243052 -0400
## -1,4 +1,8 ##
+import uuid
+import re
def with_soap_attachment(suds_method, attachment_data, *args, **kwargs):
+ HUD_ARM_SERVICE_URL = suds_method.client.wsdl.url
+ HUD_ARM_SERVICE_URL = HUD_ARM_SERVICE_URL.replace('wsdl','xsd')
""" Add an attachment to a suds soap request.
attachment_data is assumed to contain a list:
## -16,7 +20,9 ##
soap_method = suds_method.method
if len(attachment_data) == 3:
+ print "here"
data, attachment_id, attachment_mimetype = attachment_data
+ attachment_id = uuid.uuid4()
elif len(attachment_data) == 2:
data, attachment_id = attachment_data
attachment_mimetype = MIME_DEFAULT
## -55,7 +61,7 ##
])
# Build the full request
- request_text = '\n'.join([
+ request_text = '\r\n'.join([
'',
'--%s' % boundary_id,
soap_headers,
I then use:
f = open(dir_path + infile,'rb')
data_file = f.read()
data_file_type = mimetypes.guess_type(infile)[0]
(filename,ext) = infile.split('.')
...
clientargs = [...]
identifier = with_soap_attachment(client.service.fooThing, [data_file, '1', data_file_type], credentials['foo'],credentials['bar'], morefoo)
You might not need all of these changes, but it's what got me going.
Hope this helps!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get the Gmail attachment filename without downloading it - python

Rather than fetch RFC822, which is the full content, you could specify BODYSTRUCTURE. The resulting data structure from imaplib is pretty confusing, but you should be able to find the filename, content-type and sizes of each part of the message without downloading the entire thing.

Related

How to log content of emails so repeat emails aren't sent again?

How can I split large JSON file in Azure Blob Storage into individual files for each record using Python?

Extract a single line of text from an email

importing gmail text content to a text file with python returns nothing

How to send a file through Soap in python?

Categories

Resources