Extract e-mail addresses from .txt files in python - python

I would like to parse out e-mail addresses from several text files in Python. In a first attempt, I tried to get the following element that includes an e-mail address from a list of strings ('2To whom correspondence should be addressed. E-mail: joachim+pnas#uci.edu.\n').
When I try to find the list element that includes the e-mail address via i.find("#") == 0 it does not give me the content[i]. Am I misunderstanding the .find() function? Is there a better way to do this?
from os import listdir
TextFileList = []
PathInput = "C:/Users/p282705/Desktop/PythonProjects/ExtractingEmailList/text/"
# Count the number of different files you have!
for filename in listdir(PathInput):
if filename.endswith(".txt"): # In case you accidentally put other files in directory
TextFileList.append(filename)
for i in TextFileList:
file = open(PathInput + i, 'r')
content = file.readlines()
file.close()
for i in content:
if i.find("#") == 0:
print(i)

The standard way of checking whether a string contains a character, in Python, is using the in operator. In your case, that would be:
for i in content:
if "#" in i:
print(i)
The find method, as you where using, returns the position where the # character is located, starting at 0, as described in the Python official documentation.
For instance, in the string abc#google.com, it will return 3. In case the character is not located, it will return -1. The equivalent code would be:
for i in content:
if i.find("#") != -1:
print(i)
However, this is considered unpythonic and the in operator usage is preferred.

Find returns the index if you find the substring you are searching for. This isn't correct for what you are trying to do.
You would be better using a Regular Expression or RE to search for an occurence of #. In your case, you may come into as situation where there are more than one email address per line (Again I don't know your input data so I can't take a guess)
Something along these lines would benefit you:
import re
for i in content:
findEmail = re.search(r'[\w\.-]+#[\w\.-]+', i)
if findEmail:
print(findEmail.group(0))
You would need to adjust this for valid email addresses... I'm not entirely sure if you can have symbols like +...

'Find' function in python returns the index number of that character in a string. Maybe you can try this?
list = i.split(' ') # To split the string in words
for x in list: # search each word in list for # character
if x.find("#") != -1:
print(x)

Related

PYTHON: How do I grab a specific part of a string?

I have the following string: "https://www.instagram.com/paula.mtzm/"
I want to put the user "paula.mtzm" to a variable.
Anyone know how to do this ? Maybe you can somehow delete a part of the string like "https://www.instagram.com/" and then delete the last character "/" ?
"https://www.instagram.com/paula.mtzm/".split(".com/")[-1].replace("/", "")
This should do what you want. Effectively it splits the string into a list using the separator .com/, gets the last item of that list ("paula.mtzm/"), and finally removes any remaining /s
I'm not sure how specific your use-case is so I don't know how suitable this is in general.
This is actually pretty easy:
Strings are indexed in Python just like a list. So:
string = "potato"
print string[0] #this will print "p" to the console
#we can 'slice' in the index too
print string[0:3] #this will print "pot" to the console
So for your specific problem you could have your code search for the 3rd
forward slash and grab everything after that.
If you always know the web address you can just start your index at the end of
the address and where the user begins:
string = "https://www.instagram.com/paula.mtzm/"
string_index = 26 # the 'p' in paula begins here
user_name = string[string_index:string.len]
print user_name #outputs paula.mtzm

How to get everything after string x in python

I have a string:
s3://tester/test.pdf
I want to exclude s3://tester/ so even if i have s3://tester/folder/anotherone/test.pdf I am getting the entire path after s3://tester/
I have attempted to use the split & partition method but I can't seem to get it.
Currently am trying:
string.partition('/')[3]
But i get an error saying that it out of index.
EDIT: I should have specified that the name of the bucket will not always be the same so I want to make sure that it is only grabbing anything after the 3rd '/'.
You can use str.split():
path = 's3://tester/test.pdf'
print(path.split('/', 3)[-1])
Output:
test.pdf
UPDATE: With regex:
import re
path = 's3://tester/test.pdf'
print(re.split('/',path,3)[-1])
Output:
test.pdf
Have you tried .replace?
You could do:
string = "s3://tester/test.pdf"
string = string.replace("s3://tester/", "")
print(string)
This will replace "s3://tester/" with the empty string ""
Alternatively, you could use .split rather than .partition
You could also try:
string = "s3://tester/test.pdf"
string = "/".join(string.split("/")[3:])
print(string)
To answer "How to get everything after x amount of characters in python"
string[x:]
PLEASE SEE UPDATE
ORIGINAL
Using the builtin re module.
p = re.search(r'(?<=s3:\/\/tester\/).+', s).group()
The pattern uses a lookbehind to skip over the part you wish to ignore and matches any and all characters following it until the entire string is consumed, returning the matched group to the p variable for further processing.
This code will work for any length path following the explicit s3://tester/ schema you provided in your question.
UPDATE
Just saw updates duh.
Got the wrong end of the stick on this one, my bad.
Below re method should work no matter S3 variable, returning all after third / in string.
p = ''.join(re.findall(r'\/[^\/]+', s)[1:])[1:]

Making python input search function case insensitive

I'm creating a python 3 search function that accepts input from the user. The user inputs a string, which is then searched in a word document.
I have tested the function with 2 docx files - one with the word "hello" in it and the other with "There".
When I input the words with the exact case as they are in the docx file the search function returns the name of the file - success! However, when I input the words without the correct case I don't get anything returned.
I've seen a few questions on here about case insensitive options but I couldn't really find one that was similar enough to my project. Any help would be much appreciated.
import os
import docx2txt
os.chdir('c:/users/Says/desktop/projectx')
path = ('c:/users/Says/desktop/projectx')
files = []
x = str(input("search: "))
for file in os.listdir(path):
if file.endswith('.docx'):
files.append(file)
for i in range(len(files)):
text = docx2txt.process(files[i])
if x in text:
print (files[i])
You could use .lower() to make both strings lower case.
if x.lower() in text.lower():
print( files[i] )
One way is which I can suggest is just convert x and test to either upper or lower and then compare as follows, it will give you desired result.
if x.upper() in text.upper():
if x.lower() in text.lower():

Python Regex with Paramiko

I am using Python Paramiko module to sftp into one of my servers. I did a list_dir() to get all of the files in the folder. Out of the folder I'd like to use regex to find the matching pattern and then printout the entire string.
List_dir will list a list of the XML files with this format
LOG_MMDDYYYY_HHMM.XML
LOG_07202018_2018 --> this is for the date 07/20/2018 at the time 20:18
Id like to use regex to file all the XML files for that particular date and store them to a list or a variable. I can then pass this variable to Paramiko to get the file.
for log in file_list:
regex_pattern = 'POSLog_' + date + '*'
if (re.search(regex_pattern, log) != None):
matchObject = re.findall(regex_pattern, log)
print(matchObject)
the code above just prints:
['Log_07202018'] I want it to store the entire string Log_07202018_20:18.XML to a variable.
How would I go about doing this?
Thank you
If you are looking for a fixed string, don't use regex.
search_str = 'POSLog_' + date
for line in file_list:
if search_str in line:
print(line)
Alternatively, a list comprehension can make list of matching lines in one go:
log_lines = [line for line in file_list if search_str in line]
for line in log_lines:
print(line)
If you must use regex, there are a few things to change:
Any variable part that you put into the regex pattern must either be guaranteed to be a regex itself, or it must be escaped.
"The rest of the line" is not *, it's .*.
The start-of-line anchor ^ should be used to speed up the search - this way the regex fails faster when there is no match on a given line.
To support the ^ on multiple lines instead of only at the start of the entire string, the MULTILINE flag is needed.
There are several ways of getting all matches. One could do "for each line, if there is a match, print line", same as above. Here I'm using .finditer() and a search over the whole input block (i.e. not split into lines).
log_pattern = '^POSLog_' + re.escape(date) + '.*'
for match in re.finditer(log_pattern, whole_file, re.MULTILINE):
print(match.string)
Because you only print the matched part, just do print(log) instead and it'll print the whole filename.

Python: Regex a dictionary using user input wildcards

I would like to be able to search a dictionary in Python using user input wildcards.
I have found this:
import fnmatch
lst = ['this','is','just','a','test', 'thing']
filtered = fnmatch.filter(lst, 'th*')
This matches this and thing. Now if I try to input a whole file and search through
with open('testfilefolder/wssnt10.txt') as f:
file_contents = f.read().lower()
filtered = fnmatch.filter(file_contents, 'th*')
this doesn't match anything. The difference is that in the file that I am reading from I is a text file (Shakespeare play) so I have spaces and it is not a list. I can match things such as a single letter, so if I just have 't' then I get a bunch of t's. So this tells me that I am matching single letters - I however am wanting to match whole words - but even more, to preserve the wildcard structure.
Since what I would like to happen is that a user enters in text (including what will be a wildcard) that I can substitute it in to the place that 'th*' is. The wild card would do what it should still. That leads to the question, can I just stick in a variable holding the search text in for 'th*'? After some investigation I am wondering if I am somehow supposed to translate the 'th*' for example and have found something such as:
regex = fnmatch.translate('th*')
print(regex)
which outputs th.*\Z(?ms)
Is this the right way to go about doing this? I don't know if it is needed.
What would be the best way in going about "passing in regex formulas" as well as perhaps an idea of what I have wrong in the code as it is not operating on the string of incoming text in the second set of code as it does (correctly) in the first.
If the problem is just that you "have spaces and it is not a list," why not make it into a list?
with open('testfilefolder/wssnt10.txt') as f:
file_contents = f.read().lower().split(' ') # split line on spaces to make a list
filtered = fnmatch.filter(file_contents, 'th*')

Categories