Extract Message-ID from a file

Extract Message-ID from a file - python

I have the following code that extracts the Message-Id in gathers them in a Dataframe.It works and gives me the follwing results :
This an example of the lines in the dataframe :
Message-ID: <23272646.1075847145300.JavaMail.evans#thyme>
What I want to have is only the string after < character and the before >. Because Message-ID ends with >. Also I have some lines where the Message-ID value is empty. I want to delete these lines.
Here is the code that I wrote
import pandas as pd
import numpy as np
f = open('C:\\Users\\hmk\\Desktop\\PFE 2019\\ML\\MachineLearningPhishing-
master\\MachineLearningPhishing-master\\code\\resources\\emails-
enron.mbox','r')
line_num = 0
e = []
search_phrase = "Message-ID"
for line in f.readlines():
line_num += 1
if line.find(search_phrase) >= 0:
#line = line[13:]
#line = line[:-2]
e.append(line)
f.close()
dfObj = pd.DataFrame(e)

One way to do it is using regex and pandas DataFrame replace:
clean_df = df.replace(to_replace='\<|\>', value='', regex=True)
clean_df = clean_df.replace(to_replace='(Message-ID:\s*$)', value=np.nan, regex=True).dropna()
the first line of code is removing the < and >, assuming the msgs will only contain those two
the second is checking if there is a message id on the body, if not it will replace for NaN.
note that I used numpy.nan just to simplify the process of dropping the blank msgs

You can use a regex which will extract the desired Message-ID for you.
So your first part for extracting the message id would be like below:
import re # import regex
s = 'Message-ID: <23272646.1075847145300.JavaMail.evans#thyme>'
message_id = re.search(r'Message-ID: <(.*?)>', s).group(1)
print('message_id: ', message_id)
Your ideal Message ID:
>>> message_id: 23272646.1075847145300.JavaMail.evans#thyme>
So you can loop through your data end check for the regex like this:
for line in f.readlines():
line_num += 1
message_id = re.search(r'Message-ID: <(.*?)>', line)
if message_id:
msg_id_string = message_id.group(1)
e.append(line)
# your other works
The if message_id: checks whether there is a match for your Message-ID and if it doesn't match it will return None and won't go through the if instructions.

You want a substring of your lines
for line in f.readlines():
if all(word in line for word in [search_phrase, "<", ">"]):
e.append(line[line.find("<")+1:-1])
#-1 suppose ">" as the last character
Use in to check if a string is inside another string
Use find to get the index of your pattern
Use [in:out] to get substring between your two values

s = "We want <This text inside only>. yes we do."
s2 = s[s.find("<")+1:s.find(">")]
print(s2) # Prints : This text inside only
# If you want to remove empty lines :
lines = filter(lambda x: x.strip(), lines)
filter goes through the whole lines, no need for a for loop that way.

One suggestion for you:
import re
f = open('PATH/TO/FILE', 'r').read()
msgID = re.findall(r'(?<=<).*?(?=>)', f)

Related

Find first line of text according to value in Python

How can I do a search of a value of the first "latitude, longitude" coordinate in a "file.txt" list in Python and get 3 rows above and 3 rows below?
Value
37.0459
file.txt
37.04278,-95.58895
37.04369,-95.58592
37.04369,-95.58582
37.04376,-95.58557
37.04376,-95.58546
37.04415,-95.58429
37.0443,-95.5839
37.04446,-95.58346
37.04461,-95.58305
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04516,-95.57948
37.04508,-95.57914
37.04494,-95.57842
37.04483,-95.5771
37.0448,-95.57674
37.04474,-95.57606
37.04467,-95.57534
37.04462,-95.57474
37.04458,-95.57396
37.04454,-95.57274
37.04452,-95.57233
37.04453,-95.5722
37.0445,-95.57164
37.04448,-95.57122
37.04444,-95.57054
37.04432,-95.56845
37.04432,-95.56834
37.04424,-95.5668
37.044,-95.56251
37.04396,-95.5618
Expected Result
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04516,-95.57948
Additional information
In linux I can get the closest line and do the treatment I need using grep, sed, cut and others, but I'd like in Python.
Any help will be greatly appreciated!
Thank you.

How can I do a search of a value of the first "latitude, longitude"
coordinate in a "file.txt" list in Python and get 3 rows above and 3
rows below?*
You can try:
with open("text_filter.txt") as f:
text = f.readlines() # read text lines to list
filter= "37.0459"
match = [i for i,x in enumerate(text) if filter in x] # get list index of item matching filter
if match:
if len(text) >= match[0]+3: # if list has 3 items after filter, print it
print("".join(text[match[0]:match[0]+3]).strip())
print(text[match[0]].strip())
if match[0] >= 3: # if list has 3 items before filter, print it
print("".join(text[match[0]-3:match[0]]).strip())
Output:
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04597,-95.58127
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139

You can use pandas to import the data in a dataframe and then easily manipulate it. As per your question the value to check is not the exact match and therefore I have converted it to string.
import pandas as pd
data = pd.read_csv("file.txt", header=None, names=["latitude","longitude"]) #imports text file as dataframe
value_to_check = 37.0459 # user defined
for i in range(len(data)):
if str(value_to_check) == str(data.iloc[i,0])[:len(str(value_to_check))]:
break
print(data.iloc[i-3:i+4,:])
output
latitude longitude
9 37.04502 -95.58204
10 37.04516 -95.58184
11 37.04572 -95.58139
12 37.04597 -95.58127
13 37.04565 -95.58073
14 37.04546 -95.58033
15 37.04516 -95.57948

A solution with iterators, that only keeps in memory the necessary lines and doesn't load the unnecessary part of the file:
from collections import deque
from itertools import islice
def find_in_file(file, target, before=3, after=3):
queue = deque(maxlen=before)
with open(file) as f:
for line in f:
if target in map(float, line.split(',')):
out = list(queue) + [line] + list(islice(f, 3))
return out
queue.append(line)
else:
raise ValueError('target not found')
Some tests:
print(find_in_file('test.txt', 37.04597))
# ['37.04502,-95.58204\n', '37.04516,-95.58184\n', '37.04572,-95.58139\n', '37.04597,-95.58127\n',
# '37.04565,-95.58073\n', '37.04565,-95.58073\n', '37.04565,-95.58073\n']
print(find_in_file('test.txt', 37.044)) # Only one line after the match
# ['37.04432,-95.56845\n', '37.04432,-95.56834\n', '37.04424,-95.5668\n', '37.044,-95.56251\n',
# '37.04396,-95.5618\n']
Also, it works if there is less than the expected number of lines before or after the match. We match floats, not strings, as '37.04' would erroneously match '37.0444' otherwise.

This solution will print the before and after elements even if they are less than 3.
Also I am using string as it is implied from the question that you want partial matches also. ie. 37.0459 will match 37.04597
search_term='37.04462'
with open('file.txt') as f:
lines = f.readlines()
lines = [line.strip().split(',') for line in lines] #remove '\n'
for lat,lon in lines:
if search_term in lat:
index=lines.index([lat,lon])
break
left=0
right=0
for k in range (1,4): #bcoz last one is not included
if index-k >=0:
left+=1
if index+k<=(len(lines)-1):
right+=1
for i in range(index-left,index+right+1): #bcoz last one is not included
print(lines[i][0],lines[i][1])

Regex Python find everything between four characters

I have a string that holds data. And I want everything in between ({ and })
"({Simple Data})"
Should return "Simple Data"

Or regex:
s = '({Simple Data})'
print(re.search('\({([^})]+)', s).group(1))
Output:
'Simple Data'

You could try the following:
^\({(.*)}\)$
Group 1 will contain Simple Data.
See an example on regexr.

If the brackets are always positioned at the beginning and the end of the string, then you can do this:
l = "({Simple Data})"
print(l[2:-2])
Which resulst in:
"Simple Data"
In Python you can access single characters via the [] operator. With this you can access the sequence of characters starting with the third one (index = 2) up to the second-to-last (index = -2, second-to-last is not included in the sequence).

You could try this regex (?s)\(\{(.*?)\}\)
which simply captures the contents between the delimiters.
Beware though, this doesn't account for nesting.
If nesting is a concern, the best you can to with standard Python re engine
is to get the inner nest only, using this regex:
\(\{((?:(?!\(\{|\}\).)*)\}\)

Hereby I designed a tokenizer aimming at nesting data. OP should check out here.
import collections
import re
Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])
def tokenize(code):
token_specification = [
('DATA', r'[ \t]*[\w]+[\w \t]*'),
('SKIP', r'[ \t\f\v]+'),
('NEWLINE', r'\n|\r\n'),
('BOUND_L', r'\(\{'),
('BOUND_R', r'\}\)'),
('MISMATCH', r'.'),
]
tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
line_num = 1
line_start = 0
lines = code.splitlines()
for mo in re.finditer(tok_regex, code):
kind = mo.lastgroup
value = mo.group(kind)
if kind == 'NEWLINE':
line_start = mo.end()
line_num += 1
elif kind == 'SKIP':
pass
else:
column = mo.start() - line_start
yield Token(kind, value, line_num, column)
statements = '''
({Simple Data})
({
Parent Data Prefix
({Nested Data (Yes)})
Parent Data Suffix
})
'''
queue = collections.deque()
for token in tokenize(statements):
if token.typ == 'DATA' or token.typ == 'MISMATCH':
queue.append(token.value)
elif token.typ == 'BOUND_L' or token.typ == 'BOUND_R':
print(''.join(queue))
queue.clear()
Output of this code should be:
Simple Data
Parent Data Prefix
Nested Data (Yes)
Parent Data Suffix

Replace "*" (asterics) in HTML file with increasing number with python

I have a HTML file that has a series of * (asterics) in it and would like to replace it with numbers starting from 0 and on until it replaces all * (asterics) with a number.
I am unsure if this is possible in python or if another methods would be better.
Edit 2
Here is a short snippet from the TXT file that I am working on
<td nowrap>4/29/2011 14.42</td>
<td align="center">*</td></tr>
I made a file just containing those lines to test out the code.
And here is the code that I am attempting to use to change the asterics:
number = 0
with open('index.txt', 'r+') as inf:
text = inf.read()
while "*" in text:
print "I am in the loop"
text = text.replace("*", str(number), 1)
number += 1
I think that is as much detail as I can go into. Please let me know if I should just add this edit as another comment or keep it as an edit.
And thanks for all the quick responses so far~!

Use the re.sub() function, this lets you produce a new value for each replacement by using a function for the repl argument:
from itertools import count
with open('index.txt', 'r') as inf:
text = inf.read()
text = re.sub(r'\*', lambda m, c=count(): str(next(c)), text)
with open('index.txt', 'w') as outf:
outf.write(text)
The count is taken care of by itertools.count(); each time you call next() on such an object the next value in the series is produced:
>>> import re
>>> from itertools import count
>>> sample = '''\
... foo*bar
... bar**foo
... *hello*world
... '''
>>> print(re.sub(r'\*', lambda m, c=count(): str(next(c)), sample))
foo0bar
bar12foo
3hello4world
Huapito's approach would work too, albeit slowly, provided you limit the number of replacements and actually store the result of the replacement:
with open('index.txt', 'r') as inf:
text = inf.read()
while "*" in text:
text = text.replace("*", str(number), 1)
number += 1
Note the third argument to str.replace(); that tells the method to only replace the first instance of the character.

html = 'some string containing html'
new_html = list(html)
count = 0
for char in range(0, len(new_html)):
if new_html[char] == '*':
new_html[char] = count
count += 1
new_html = ''.join(new_html)
This would replace each asteric with the numbers 1 to one less than the number of asterics, in order.

You need to iterate over each char, you can write to a tempfile and then replace the original with shutil.move using itertools.count to assign a number incrementally each time you find an asterix:
from tempfile import NamedTemporaryFile
from shutil import move
from itertools import count
cn = count()
with open("in.html") as f, NamedTemporaryFile("w+",dir="",delete=False) as out:
out.writelines((ch if ch != "*" else str(next(cn))
for line in f for ch in line ))
move(out.name,"in.html")
using a test file with:
foo*bar
bar**foo
*hello*world
Will output:
foo1bar
bar23foo
4hello5world

It is possible. Have a look at the docs. You should use something like a 'while' loop and 'replace'
Example:
number=0 # the first number
while "*" in text: #repeats the following code until this is false
text = text.replace("*", str(number), maxreplace=1) # replace with 'number'
number+=1 #increase number

Use fileinput
import fileinput
with fileinput.FileInput(fileToSearch, inplace=True) as file:
number=0
for line in file:
print(line.replace("*", str(number))
number+=1

Splitting strings in python in user defined manner

I have data in the following format:
Input Data:
## <id_1d3s2ia_p3m_zkjp59>
<Eckhard_Christian> <hasGender> <ma<>le> .
## <id_1jmz109_1gi_t71dyx>
<Peter_Pinn<>e> <created> <In_Your_Arms_(Love_song_from_"Neighbours")> .
## <id_v9bcjt_ice_fraki6>
<Blanchester,_Ohio> <hasWebsite> <http://www.blanchester.com/> .
## <id_10tunwc_p3m_zkjp59>
<Hub_(bassi~st)> <hasGender> <ma??le> .
Output Data:
<Eckhard_Christian> <hasGender> <male> <id_1d3s2ia_p3m_zkjp59>.
<Peter_Pinne> <created> <In_Your_Arms_(Love_song_from_"Neighbours")> <id_1jmz109_1gi_t71dyx>.
<Blanchester,_Ohio> <hasWebsite> <http://www.blanchester.com/> <id_v9bcjt_ice_fraki6>.
<Hub_(bassist)> <hasGender> <male> <id_10tunwc_p3m_zkjp59>.
That is in the output data I want to delete all other characters except: alphanumeric and :,/,/,.,_,(,) between any two starting and ending < >. I know python allows me to split using string.split() but if I split using < > as demarkers then for <ma<>le> I get (<ma,<>,le>).
Is there some other way by which I may split in python so that I may get the data in the desired form. Also I want preceding lines < > (following # #) to appear as the last column.

Assuming that there is always a whitespace character before/after the "proper" < and >, you could try something like this, using regular expressions:
import re
with open('data') as data:
for line in data:
if line.startswith('##'):
id_ = re.search('\s(<.*>)', line).group(1)
fields = re.findall('(<.*?>)\s', next(data))
fields = ['<' + re.sub(r'[^\w:/._()"]', '', f) + '>' for f in fields]
print fields + [id_]
Output:
['<Eckhard_Christian>', '<hasGender>', '<male>', '<id_1d3s2ia_p3m_zkjp59>']
['<Peter_Pinne>', '<created>', '<In_Your_Arms_(Love_song_from_"Neighbours")>', '<id_1jmz109_1gi_t71dyx>']
['<Blanchester_Ohio>', '<hasWebsite>', '<http://www.blanchester.com/>', '<id_v9bcjt_ice_fraki6>']
['<Hub_(bassist)>', '<hasGender>', '<male>', '<id_10tunwc_p3m_zkjp59>']

python regex unicode - extracting data from an utf-8 file

I am facing difficulties for extracting data from an UTF-8 file that contains chinese characters.
The file is actually the CEDICT (chinese-english dictionary) and looks like this :
賓 宾 [bin1] /visitor/guest/object (in grammar)/
賓主 宾主 [bin1 zhu3] /host and guest/
賓利 宾利 [Bin1 li4] /Bentley/
賓士 宾士 [Bin1 shi4] /Taiwan equivalent of 奔馳|奔驰[Ben1 chi2]/
賓夕法尼亞 宾夕法尼亚 [Bin1 xi1 fa3 ni2 ya4] /Pennsylvania/
賓夕法尼亞大學 宾夕法尼亚大学 [Bin1 xi1 fa3 ni2 ya4 Da4 xue2] /University of Pennsylvania/
賓夕法尼亞州 宾夕法尼亚州 [Bin1 xi1 fa3 ni2 ya4 zhou1] /Pennsylvania/
Until now, I manage to get the first two fields using split() but I can't find out how I should proceed to extract the two other fields (let's say for the second line "bin1 zhu3" and "host and guest". I have been trying to use regex but it doesn't work for a reason I ignore.
#!/bin/python
#coding=utf-8
import re
class REMatcher(object):
def __init__(self, matchstring):
self.matchstring = matchstring
def match(self,regexp):
self.rematch = re.match(regexp, self.matchstring)
return bool(self.rematch)
def group(self,i):
return self.rematch.group(i)
def look(character):
myFile = open("/home/quentin/cedict_ts.u8","r")
for line in myFile:
line = line.rstrip()
elements = line.split(" ")
try:
if line != "" and elements[1] == character:
myFile.close()
return line
except:
myFile.close()
break
myFile.close()
return "Aucun résultat :("
translation = look("賓主") # translation contains one line of the file
elements = translation.split()
traditionnal = elements[0]
simplified = elements[1]
print "Traditionnal:" + traditionnal
print "Simplified:" + simplified
m = REMatcher(translation)
tr = ""
if m.match(r"\[(\w+)\]"):
tr = m.group(1)
print "Pronouciation:" + tr
Any help appreciated.

This builds a dictionary to look up translations by either simplified or traditional characters and works in both Python 2.7 and 3.3:
# coding: utf8
import re
import codecs
# Process the whole file decoding from UTF-8 to Unicode
with codecs.open('cedict_ts.u8',encoding='utf8') as datafile:
D = {}
for line in datafile:
# Skip comment lines
if line.startswith('#'):
continue
trad,simp,pinyin,trans = re.match(r'(.*?) (.*?) \[(.*?)\] /(.*)/',line).groups()
D[trad] = (simp,pinyin,trans)
D[simp] = (trad,pinyin,trans)
Output (Python 3.3):
>>> D['马克']
('馬克', 'Ma3 ke4', 'Mark (name)')
>>> D['一路顺风']
('一路順風', 'yi1 lu4 shun4 feng1', 'to have a pleasant journey (idiom)')
>>> D['馬克']
('马克', 'Ma3 ke4', 'Mark (name)')
Output (Python 2.7, you have to print strings to see non-ASCII characters):
>>> D[u'马克']
(u'\u99ac\u514b', u'Ma3 ke4', u'Mark (name)')
>>> print D[u'马克'][0]
馬克

I would continue to use splits instead of regular expressions, with the maximum split number given. It depends on how consistent the format of the input file is.
elements = translation.split(' ',2)
traditionnal = elements[0]
simplified = elements[1]
rest = elements[2]
print "Traditionnal:" + traditionnal
print "Simplified:" + simplified
elems = rest.split(']')
tr = elems[0].strip('[')
print "Pronouciation:" + tr
Output:
Traditionnal:賓主
Simplified:宾主
Pronouciation:bin1 zhu3
EDIT: To split the last field into a list, split on the /:
translations = elems[1].strip().strip('/').split('/')
#strip the spaces, then the first and last slash,
#then split on the slashes
Output (for the first line of input):
['visitor', 'guest', 'object (in grammar)']

Heh, I've done this exact same thing before. Basically you just need to use regex with groupings. Unfortunately, I don't know python regex super well (I did the same thing using C#), but you should really do something like this:
matcher = "(\b\w+\b) (\b\w+\b) \[(\.*?)\] /(.*?)/"
basically you match the entire line using one expression, but then you use ( ) to separate each item into a regex-group. Then you just need to read the groups and voila!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract Message-ID from a file - python

s = "We want <This text inside only>. yes we do." s2 = s[s.find("<")+1:s.find(">")] print(s2) # Prints : This text inside only # If you want to remove empty lines : lines = filter(lambda x: x.strip(), lines) filter goes through the whole lines, no need for a for loop that way.

One suggestion for you: import re f = open('PATH/TO/FILE', 'r').read() msgID = re.findall(r'(?<=<).*?(?=>)', f)

Related

Find first line of text according to value in Python

Regex Python find everything between four characters

Replace "*" (asterics) in HTML file with increasing number with python

Splitting strings in python in user defined manner

python regex unicode - extracting data from an utf-8 file

Categories

Resources