Python code to extract single/multiple attachments(.pdf/.png) from .msg file

Python code to extract single/multiple attachments(.pdf/.png) from .msg file - python

I have a master folder containing 10-15 .msg files.
Each file may or maynot have attachments either in pdf or png format.
Is there any python code to extract those attachments .
P.S i already tried pywin32..it is specific to just windows.
I am looking to run my code in linux/ubuntu terminal.

This can be done with the package extract_msg as shown below in the form of a MWE (without looping all mail files, without considering overwriting due to duplicate filenames, etcetera).
import extract_msg
with extract_msg.openMsg('Mail.msg') as msg:
for attm in msg.attachments:
file = attm.save()

Related

zipfile.ZipFile extracts the wrong file

I am working on a project that manipulates with a document's xml file. My approach is like the following. First convert the DOCX document into a zip archive, then extract the contents of that archive in order to have access to the document.xml file, and finally convert the XML to a txt in order to work with it.
So i did all the above on a document, and everything worked perfectly, but when i decided to use a different document, the Zipfile library doesnt extract the content of the new ZIP archive, however it somehow extracts the contents of the old document that i processed before, and converts the document.xml file into document.txt without even me even running that block of code that converts the XML into txt.
The worst part is the old document is not even in the directory anymore, so i have no idea how Zipfile is extracting the content of that particular document when its not even in the path.
This is the code I am using in Jupyter notebook.
import shutil
import zipfile
# Convert the DOCX to ZIP
shutil.copyfile('data/docx/input.docx', 'data/zip/document.zip')
# Extract the ZIP
with zipfile.ZipFile('zip/document.zip', 'r') as zip_ref:
zip_ref.extractall('data/extracted/')
# Convert "document.xml" to txt
os.rename('extracted/word/document.xml', 'extracted/word/document.txt')
# Read the txt file
with open('extracted/word/document.txt') as intxt:
data = intxt.read()
This is the directory tree for the extracted zip archive for the first document.
data -
1-docx
2-zip
3-extracted/-
1-customXml/
2-docProps/
3-_rels
4-[Content_Types].xml
5-word/-document.txt
The 2nd document's directory tree should be as following
data -
1-docx
2-zip
3-extracted/-
1-customXml/
2-docProps/
3-_rels
4-[Content_Types].xml
5-word/-document.xml
But Zipfile is extracting the contents of the first document even when the DOCX file is not in the directory.I am also using Ubuntu 20.04 so i am not sure if it has to do with my OS.

I suspect that you are having issues with relative paths, as unzipping any Word document will create the same file/directory structure. I'd suggest using absolute paths to avoid this. What you may also want to do is, after you are done manipulating and processing the extracted files and directories, delete them. That way you won't encounter any issues with lingering files.

Why are excel files uploaded as zip file?

I have an excel sheet called last_run.xlsx, and I use a small python code to upload it on slack, as follow:
import os
import slack
token= XXX
client = slack.WebClient(token=slack_token)
response = client.files_upload(
channels="#viktor",
file="last_run.xlsx")
But when I receive it on slack it is a weird zip file and not an excel file anymore... any idea what I do wrong?

Excel files are actually zipped collection of XML documents. So it appears that the automatic file detection of Slack is recognizing it as ZIP file for that reason.
Also manually specified xlsx as filetype does not change that behavior.
What works is if you also specify a filename. Then it will be correctly recognized and uploaded as Excel file.
Code:
import os
import slack
client = slack.WebClient(token="MY_TOKEN")
response = client.files_upload(
channels="#viktor",
file="last_run.xlsx",
filename="last_run.xlsx")
This looks like a bug in the automatic to me, so I would suggest to submit a bug report to Slack about this behavior.

Is it possible to download just part of a ZIP file using python zipfile library

I was wondering is there any way by which I can download only a part of a .rar or .zip file without downloading the whole file ? There is a zip file containing files A,B,C and D. I only need A. Can I somehow, use zipfile module so that i can only download 1 file ?
i am trying below code:
r = c.get(file)
z = ZipFile.ZipFile(BytesIO(r.content))
for file1 in z.namelist():
if 'time' not in file1:
print("hi")
z.extractall(file1,download_path + filename)
This code is downloading whole zip file and only extracting specific one. Can i somehow download only the file i Need.
There is similar question here but it shows only approch by command line in linux. That question dosent address how it can be done using python liabraries.

The question #Juggernaut mentioned in a comment is actually very helpful, as it points you in the direction of the solution.
You need to create a replacement for Bytes.IO that returns the necessary information to ZipFile. You will need to get the length of the file, and then get whatever sections ZipFile asks for.
How large are those file? Is it really worth the trouble?

Use remotezip: https://github.com/gtsystem/python-remotezip. You can install it using pip:
pip install remotezip
Usage example:
from remotezip import RemoteZip
with RemoteZip("https://path/to/zip/file.zip") as zip_file:
for file in zip_file.namelist():
if 'time' not in file:
print("hi")
zip_file.extract(file, path="/path/to/extract")
Note that to use this approach, the web server from which you receive the file needs to support the Range header.

Reading gzipped data in Python

I have a *.tar.gz compressed file that I would like to read in with Python 2.7. The file contains multiple h5 formatted files as well as a few text files. I'm a novice with Python. Here is the code I'm trying to adapt:
`subset_path='c:\data\grant\files'
f=gzip.open(filename,'subset_full.tar.gz')
subset_data_path=os.path.join(subset_path,'f')
The first statement identifies the path to the folder with the data. The second statement tells Python to open a specific compressed file and the third statement (hopefully) executes a join of the prior two statements.
Several lines below this code I get an error when Python tries to use the 'subset_data_path' assignment.
What's going on?

The gzip module will only open a single file that has been compressed, i.e. my_file.gz. You have a tar archive of multiple files that are also compressed. This needs to be both untarred and uncompressed.
Try using the tarfile module instead, see https://docs.python.org/2/library/tarfile.html#examples
edit: To add a bit more information on what has happened, you have successfully opened the zipped tarball into a gzip file object, which will work almost the same as a standard file object. For instance you could call f.readlines() as if f was a normal file object and it would return the uncompressed lines.
However, this did not actually unpack the archive into new files in the filesystem. You did not create a subdirectory 'c:\data\grant\files\f', and so when you try to use the path subset_data_path you are looking for a directory that does not exist.
The following ought to work:
import tarfile
subset_path='c:\data\grant\files'
tar = tarfile.open("subset_full.tar.gz")
tar.extractall(subset_path)
subset_data_path=os.path.join(subset_path,'subset_full')

Extracting .app from zip file in Python

(Python 2.7)
I have a program that will download a .zip file from a server, containing a .app file which I'd like to run. The .zip downloads fine from the server, and trying to extract it outside of Python works fine. However, when I try to extract the zip from Python, the .app doesn't run - it does not say the file is corrupted or damaged, it simply won't launch. I've tried this with other .app files, and I get the same problem, and was wondering if anyone else has had this problem before and a way to fix it?
The code I'm using:
for a in gArchives:
if (a['fname'].endswith(".build.zip") or a['fname'].endswith(".patch.zip")):
#try to extract: if not, delete corrupted zip
try :
zip_file = zipfile.ZipFile(a['fname'], 'r')
except:
os.remove(a['fname'])
for files in zip_file.namelist() :
#deletes local files in the zip that already exist
if os.path.exists(files) :
try :
os.remove(files)
except:
print("Cannot remove file")
try :
shutil.rmtree(files)
except:
print("Cannot remove directory")
try :
zip_file.extract(files)
except:
print("Extract failed")
zip_file.close()
I've also tried using zip_file.extractall(), and I get the same problem.

Testing on my macbook pro, the problem appears to be with the way Python extracts the files.
If you run
diff -r python_extracted_zip normal_extracted_zip
You will come into messages like this:
File Seashore.app/Contents/Frameworks/TIFF.framework/Resources is a directory while file here/Seashore.app/Contents/Frameworks/TIFF.framework/Resources is a regular file
So obviously the issue is with the filenames it's coming across as it's extracting them. You will need to implement some checking of the filenames as you extract them.
EDIT: It appears to be a bug within python 2.7.* as found here - Sourced from another question posted here.

Managed to resolve this myself - the problem was not to do with directories not being extracted correctly, but in fact with permissions as eri mentioned above.
When the files were being extracted with Python, the permissions were not being kept as they were inside the .zip, so all executable files were set to be not executable. This problem was resolved with a call to the following on all files I extracted, where 'path' is the path of the file:
os.chmod(path, 0755)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python code to extract single/multiple attachments(.pdf/.png) from .msg file - python

I have a master folder containing 10-15 .msg files. Each file may or maynot have attachments either in pdf or png format. Is there any python code to extract those attachments . P.S i already tried pywin32..it is specific to just windows. I am looking to run my code in linux/ubuntu terminal.

This can be done with the package extract_msg as shown below in the form of a MWE (without looping all mail files, without considering overwriting due to duplicate filenames, etcetera). import extract_msg with extract_msg.openMsg('Mail.msg') as msg: for attm in msg.attachments: file = attm.save()

Related

zipfile.ZipFile extracts the wrong file

Why are excel files uploaded as zip file?

Is it possible to download just part of a ZIP file using python zipfile library

Reading gzipped data in Python

Extracting .app from zip file in Python

Categories

Resources