Reading mbox files with mbox Python module - python

Good afternoon, I'm working on a kind of spam filter in Python, and I've downloaded some spam and harm emails from this corpus
https://spamassassin.apache.org/publiccorpus/
This is the code I made to read the mbox files
import os
import mailbox
import sys
import pprint
print("Reading emails:")
for mbox_file in os.listdir(os.getcwd()+"/spam"):
print("Processing "+mbox_file)
mbox = mailbox.mbox(mbox_file)
for message in mbox:
print(message['from'])
The thing is that apparently it does not recognize the files, because it never reads anything at all. I create a separate .mbox file, copying the contents of one of the files and it readed perfectly. I Also try reading the files with read() and throws an error message that the file does not exist. I do not know what I'm missing, any help would be nice. Thanks for your time

Related

Runinng a ipynb script on many files at once/an entire directory?

I will be the first to tell you that my Python skills are beginner at best, so please forgive my ignorance here.
By way of background, I have created a Python script in Anaconda Jupyter Notebooks that reads a single PDF from a folder, C:\Users\...\PDFs , extracts the text of said PDF, and then through some splicing puts the text of interest into a CSV file that it creates.
The problem is that I want to execute this script on hundreds of PDFs (the ipynb script itself works just fine when executed on individual PDFs, I just don't want to keep manually changing the file name in the Notebook/Python script). Using pdfreader, my script starts with the following:
import pdfreader
from pdfreader import PDFDocument, SimplePDFViewer
fd = open(r'C:Users\...\PDFs\[pdf name].pdf', 'rb')
viewer = SimplePDFViewer(fd)
doc = PDFDocument(fd)
This is where I get stuck - I cannot figure out how to run this on/import all PDFs in the folder. I have seen some people use a variable file name with an asterisk, eg C:\Users\...\PDFs\*.pdf, however I can't get that to go. It seems like like it might be possible to save my ipynb as a py file, and then somehow run it in Anaconda Prompt, however I have struggled getting this method to work as well. I am unfamiliar with bat files, but those too seem potentially promising.
Does anyone know of a way to run this script on many PDFs in a single directory at once? I have scrounged around a ton, but for the life of me cannot figure this out. Any help would be greatly appreciated! :)
You can use the glob module to gather all of the files names, then loop through them.
import pdfreader
from pdfreader import PDFDocument, SimplePDFViewer
from glob import glob
pdf_files = glob(r'C:Users\...\PDFs\*.pdf')
for path in pdf_files:
fd = open(path, 'rb')
viewer = SimplePDFViewer(fd)
doc = PDFDocument(fd)
...
fd.close()

Why are excel files uploaded as zip file?

I have an excel sheet called last_run.xlsx, and I use a small python code to upload it on slack, as follow:
import os
import slack
token= XXX
client = slack.WebClient(token=slack_token)
response = client.files_upload(
channels="#viktor",
file="last_run.xlsx")
But when I receive it on slack it is a weird zip file and not an excel file anymore... any idea what I do wrong?
Excel files are actually zipped collection of XML documents. So it appears that the automatic file detection of Slack is recognizing it as ZIP file for that reason.
Also manually specified xlsx as filetype does not change that behavior.
What works is if you also specify a filename. Then it will be correctly recognized and uploaded as Excel file.
Code:
import os
import slack
client = slack.WebClient(token="MY_TOKEN")
response = client.files_upload(
channels="#viktor",
file="last_run.xlsx",
filename="last_run.xlsx")
This looks like a bug in the automatic to me, so I would suggest to submit a bug report to Slack about this behavior.

How to import a module from a zip file which is encrypted

I have a module which I've written, and it consists of several files, so I've packed it into a zip file and then I added the zip into the path variable:
mylib_zip_dir = os.path.dirname(os.path.realpath(__file__))
mylib_zip_path = os.path.join(mylib_zip_dir, "mylib.zip")
sys.path.insert(0, mylib_zip_path)
import mylib
I've been asked if I could encrypt the zip file but still use the module as I currently use it. I've seen ways for Python to extract an encrypted zip file, but can't figure out how I could still use it the way I currently do.
Is there a way add the zip file's password to my script so that it'll know how to decrypt and import the module? I'm aware that this goes against the open source nature of Python, but I would still like to know if and how I could achieve that.

Why is my glob.glob loop not iterating through all text files in folder?

I am attempting to read from a folder containing text documents with python 3. Specifically, this is a modification of the LingSpam email spam dataset. I am expecting the code I wrote to return all 1893 text document names, however, the code instead returns the first 420 filenames. I do not understand why it is stopping short of the total number of filenames. Any ideas?
if not os.path.exists('train'): # download data
from urllib.request import urlretrieve
import tarfile
urlretrieve('http://cs.iit.edu/~culotta/cs429/lingspam.tgz', 'lingspam.tgz')
tar = tarfile.open('lingspam.tgz')
tar.extractall()
tar.close()
abc = []
for f in glob.glob("train/*.txt"):
print(f)
abc.append(f)
print(len(abc))
I've tried changing the glob params but still no success.
Edit: Apparently my code works for everyone but me. Here's my output
Success! The problem was
if not os.path.exists('train'): # download data
To check my output, I had actually downloaded the files onto my computer, and since this line checked whether or not the folder existed, and it did exist, it caused issues. I deleted the files off of my machine and now it works as it should, though I suspect running
from urllib.request import urlretrieve
import tarfile
urlretrieve('http://cs.iit.edu/~culotta/cs429/lingspam.tgz', 'lingspam.tgz')
tar = tarfile.open('lingspam.tgz')
tar.extractall()
tar.close()
without the if statement would have had the same result.

Opening a CSV from a Different Directory Python

I've been working on a project where I need to import csv files, previously the csv files have been in the same working directory. Now the project is getting bigger so for security and organizational resaons I'd much prefer to keep them in a different directory.
I've had a look at some other questions asking similar things but I couldn't figure out out how to apply them to my code as each time I tried I kept getting the same error message mainly:
IOError: [Errno 2] No such file or directory:
My original attempts all looked something like this:
import csv # Import the csv module
import MySQLdb # Import MySQLdb module
def connect():
login = csv.reader(file('/~/Projects/bmm_private/login_test.txt'))
I changed the path within several times as well by dropping the first / then then the ~ then the / again after that, but each time I got the error message. I then tried another method suggested by several people by importing the os:
import os
import csv # Import the csv module
import MySQLdb # Import MySQLdb module
def connect():
path = r'F:\Projects\bmm_private\login_test.txt'
f = os.path.normpath(path)
login = csv.reader(file(f))
But I got the error message yet again.
Any help here would be much appreciated, if I could request that you use the real path (~/Projects/bmm_private/login_test.txt) in any answers you know of so it's very clear to me what I'm missing out here.
I'm pretty new to python so I may struggle to understand without extra clarity/explanation. Thanks in advance!
The tilde tells me that this is the home folder (e.g. C:\Users\<username> on my Windows system, or /Users/<username> on my Mac). If you want to expand the user, use the os.path.expanduser call:
full_path = os.path.expanduser('~/Projects/bmm_private/login_test.txt')
# Now you can open it
Another approach is to seek for the file in relative to your current script. For example, if your script is in ~/Projects/my_scripts, then:
script_dir = os.path.dirname(__file__) # Script directory
full_path = os.path.join(script_dir, '../bmm_private/login_test.txt')
# Now use full_path

Categories