Check if string qualifies to extract text from the string - Regex - python

I am new to regex, and I don't have just a string to extract the desired text but, I also need to allow other type of strings that don't match the regex to iterate in my other functions.
Here is what I'm trying to achieve:
I am running through device names from a csv file, If it has only DeviceName without the string mentioned below, it should simply return that to the function and let other function take care of it.
The string where I want to use regex will be like
'Card ADFGTR43567 on "DeviceName"' where I want to extract only the DeviceName from it.
ADFGTR43567 is a serial number and there are 11 letters in it, consisting of numbers and alphabets with no definite positions.
Here DeviceName could be anything, for EX: it could be ARIEFRETO002 or ARIERDTOT5968.na.abc.com or even just a plain mac address like 1234.abcd.5678
So even if the string has a pattern like 'Card serialnumber on DeviceName'.
I would want it to extract DeviceName and run against other functions in my code. If the device name in my csv is without such pattern, I would still want it to extract them and give it to the other function.
I have written a code with my functions, but I am not able to use regex here. This is what I have tried so far, only pasting the necessary info.
def validnames():
idx = col[0].find('-')
if idx > -1:
user = col[0][idx + 1:idx + 4]
if user.upper() in d:
return col[0].split('.')[0]
else:
return 'Not Found'
else:
return 'Not Found'
def pingable():
response = subprocess.Popen(['ping.exe', validnames()], stdout=subprocess.PIPE).communicate()[0]
response = response.decode()
if 'bytes=32' in response:
status = 'Up'
return status
else:
status = 'Down'
return status
with open("Book2.csv", 'r') as lookuplist:
for col in csv.reader(lookuplist):
if validnames() == 'Not Found' : continue
if pingable() == 'Down' : continue
if validnames().lower() not in data:
with open('Test.csv', 'a', newline='') as csvoutput:
output = csv.writer(csvoutput)
output.writerows([[validnames()]+[pingable()]])
print("Device: %s" % validnames(), pingable())
def validnames(): is a function to check if that device is eligible for the ping operation (based on the condition). I was thinking to put regex in that function, and I got lost there completely!) Maybe another function, but not quite getting how to do with regex.
UPDATE: This is how I have integrated two functions, based on the accepted answer.
def regexfilter():
try:
rx = r'\bon\s+(\S+)'
m = re.search(rx, col[0])
if m:
return m.group(1)
else:
return col[0]
except:
return col[0]
def validnames():
idx = regexfilter().find('-')
if idx > -1:
user = regexfilter()[idx + 1:idx + 4]
if user.upper() in d:
return regexfilter().split('.')[0]
else:
return 'Not Found'
else:
return 'Not Found'

Since you want to match any text inside double quotation marks that follows an on preposition, you may use the following regex:
\bon\s+(\S+)
See the regex demo.
Details
\b - a word boundary
on - on word
\s+ - 1+ whitespaces
(\S+) - Capturing group 1: one or more non-whitespace chars.
See Python demo:
import re
rx = r'\bon\s+(\S+)'
s = "Card ADFGTR43567 on DeviceName"
m = re.search(rx, s)
if m:
print(m.group(1)) # => DeviceName

Related

Extract Message-ID from a file

I have the following code that extracts the Message-Id in gathers them in a Dataframe.It works and gives me the follwing results :
This an example of the lines in the dataframe :
Message-ID: <23272646.1075847145300.JavaMail.evans#thyme>
What I want to have is only the string after < character and the before >. Because Message-ID ends with >. Also I have some lines where the Message-ID value is empty. I want to delete these lines.
Here is the code that I wrote
import pandas as pd
import numpy as np
f = open('C:\\Users\\hmk\\Desktop\\PFE 2019\\ML\\MachineLearningPhishing-
master\\MachineLearningPhishing-master\\code\\resources\\emails-
enron.mbox','r')
line_num = 0
e = []
search_phrase = "Message-ID"
for line in f.readlines():
line_num += 1
if line.find(search_phrase) >= 0:
#line = line[13:]
#line = line[:-2]
e.append(line)
f.close()
dfObj = pd.DataFrame(e)
One way to do it is using regex and pandas DataFrame replace:
clean_df = df.replace(to_replace='\<|\>', value='', regex=True)
clean_df = clean_df.replace(to_replace='(Message-ID:\s*$)', value=np.nan, regex=True).dropna()
the first line of code is removing the < and >, assuming the msgs will only contain those two
the second is checking if there is a message id on the body, if not it will replace for NaN.
note that I used numpy.nan just to simplify the process of dropping the blank msgs
You can use a regex which will extract the desired Message-ID for you.
So your first part for extracting the message id would be like below:
import re # import regex
s = 'Message-ID: <23272646.1075847145300.JavaMail.evans#thyme>'
message_id = re.search(r'Message-ID: <(.*?)>', s).group(1)
print('message_id: ', message_id)
Your ideal Message ID:
>>> message_id: 23272646.1075847145300.JavaMail.evans#thyme>
So you can loop through your data end check for the regex like this:
for line in f.readlines():
line_num += 1
message_id = re.search(r'Message-ID: <(.*?)>', line)
if message_id:
msg_id_string = message_id.group(1)
e.append(line)
# your other works
The if message_id: checks whether there is a match for your Message-ID and if it doesn't match it will return None and won't go through the if instructions.
You want a substring of your lines
for line in f.readlines():
if all(word in line for word in [search_phrase, "<", ">"]):
e.append(line[line.find("<")+1:-1])
#-1 suppose ">" as the last character
Use in to check if a string is inside another string
Use find to get the index of your pattern
Use [in:out] to get substring between your two values
s = "We want <This text inside only>. yes we do."
s2 = s[s.find("<")+1:s.find(">")]
print(s2) # Prints : This text inside only
# If you want to remove empty lines :
lines = filter(lambda x: x.strip(), lines)
filter goes through the whole lines, no need for a for loop that way.
One suggestion for you:
import re
f = open('PATH/TO/FILE', 'r').read()
msgID = re.findall(r'(?<=<).*?(?=>)', f)

Regex Python find everything between four characters

I have a string that holds data. And I want everything in between ({ and })
"({Simple Data})"
Should return "Simple Data"
Or regex:
s = '({Simple Data})'
print(re.search('\({([^})]+)', s).group(1))
Output:
'Simple Data'
You could try the following:
^\({(.*)}\)$
Group 1 will contain Simple Data.
See an example on regexr.
If the brackets are always positioned at the beginning and the end of the string, then you can do this:
l = "({Simple Data})"
print(l[2:-2])
Which resulst in:
"Simple Data"
In Python you can access single characters via the [] operator. With this you can access the sequence of characters starting with the third one (index = 2) up to the second-to-last (index = -2, second-to-last is not included in the sequence).
You could try this regex (?s)\(\{(.*?)\}\)
which simply captures the contents between the delimiters.
Beware though, this doesn't account for nesting.
If nesting is a concern, the best you can to with standard Python re engine
is to get the inner nest only, using this regex:
\(\{((?:(?!\(\{|\}\).)*)\}\)
Hereby I designed a tokenizer aimming at nesting data. OP should check out here.
import collections
import re
Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])
def tokenize(code):
token_specification = [
('DATA', r'[ \t]*[\w]+[\w \t]*'),
('SKIP', r'[ \t\f\v]+'),
('NEWLINE', r'\n|\r\n'),
('BOUND_L', r'\(\{'),
('BOUND_R', r'\}\)'),
('MISMATCH', r'.'),
]
tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
line_num = 1
line_start = 0
lines = code.splitlines()
for mo in re.finditer(tok_regex, code):
kind = mo.lastgroup
value = mo.group(kind)
if kind == 'NEWLINE':
line_start = mo.end()
line_num += 1
elif kind == 'SKIP':
pass
else:
column = mo.start() - line_start
yield Token(kind, value, line_num, column)
statements = '''
({Simple Data})
({
Parent Data Prefix
({Nested Data (Yes)})
Parent Data Suffix
})
'''
queue = collections.deque()
for token in tokenize(statements):
if token.typ == 'DATA' or token.typ == 'MISMATCH':
queue.append(token.value)
elif token.typ == 'BOUND_L' or token.typ == 'BOUND_R':
print(''.join(queue))
queue.clear()
Output of this code should be:
Simple Data
Parent Data Prefix
Nested Data (Yes)
Parent Data Suffix

Python parse text from multiple txt file

Seeking advice on how to mine items from multiple text files to build a dictionary.
This text file: https://pastebin.com/Npcp3HCM
Was manually transformed into this required data structure: https://drive.google.com/file/d/0B2AJ7rliSQubV0J2Z0d0eXF3bW8/view
There are thousands of such text files and they may have different section headings as shown in these examples:
https://pastebin.com/wWSPGaLX
https://pastebin.com/9Up4RWHu
I started off by reading the files
from glob import glob
txtPth = '../tr-txt/*.txt'
txtFiles = glob(txtPth)
with open(txtFiles[0],'r') as tf:
allLines = [line.rstrip() for line in tf]
sectionHeading = ['Corporate Participants',
'Conference Call Participiants',
'Presentation',
'Questions and Answers']
for lineNum, line in enumerate(allLines):
if line in sectionHeading:
print(lineNum,allLines[lineNum])
My idea was to look for the line numbers where section headings existed and try to extract the content in between those line numbers, then strip out separators like dashes. That didn't work and I got stuck in trying to create a dictionary of this kind so that I can later run various natural language processing algorithms on quarried items.
{file-name1:{
{date-time:[string]},
{corporate-name:[string]},
{corporate-participants:[name1,name2,name3]},
{call-participants:[name4,name5]},
{section-headings:{
{heading1:[
{name1:[speechOrderNum, text-content]},
{name2:[speechOrderNum, text-content]},
{name3:[speechOrderNum, text-content]}],
{heading2:[
{name1:[speechOrderNum, text-content]},
{name2:[speechOrderNum, text-content]},
{name3:[speechOrderNum, text-content]},
{name2:[speechOrderNum, text-content]},
{name1:[speechOrderNum, text-content]},
{name4:[speechOrderNum, text-content]}],
{heading3:[text-content]},
{heading4:[text-content]}
}
}
}
The challenge is that different files may have different headings and number of headings. But there will always be a section called "Presentation" and very likely to have "Question and Answer" section. These section headings are always separated by a string of equal-to signs. And content of different speaker is always separated by string of dashes. The "speech order" for Q&A section is indicated with a number in square brackets. The participants are are always indicated in the beginning of the document with an asterisks before their name and their tile is always on the next line.
Any suggestion on how to parse the text files is appreciated. The ideal help would be to provide guidance on how to produce such a dictionary (or other suitable data structure) for each file that can then be written to a database.
Thanks
--EDIT--
One of the files looks like this: https://pastebin.com/MSvmHb2e
In which the "Question & Answer" section is mislabeled as "Presentation" and there is no other "Question & Answer" section.
And final sample text: https://pastebin.com/jr9WfpV8
The comments in the code should explain everything. Let me know if anything is under specified, and needs more comments.
In short I leverage regex to find the '=' delimiter lines to subdivide the entire text into subsections, then handle each type of sections separately for clarity sake ( so you can tell how I am handling each case).
Side note: I am using the word 'attendee' and 'author' interchangeably.
EDIT: Updated the code to sort based on the '[x]' pattern found right next to the attendee/author in the presentation/QA section. Also changed the pretty print part since pprint does not handle OrderedDict very well.
To strip any additional whitespace including \n anywhere in the string, simply do str.strip(). if you specifically need to strip only \n, then just do str.strip('\n').
I have modified the code to strip any whitespace in the talks.
import json
import re
from collections import OrderedDict
from pprint import pprint
# Subdivides a collection of lines based on the delimiting regular expression.
# >>> example_string =' =============================
# asdfasdfasdf
# sdfasdfdfsdfsdf
# =============================
# asdfsdfasdfasd
# =============================
# >>> subdivide(example_string, "^=+")
# >>> ['asdfasdfasdf\nsdfasdfdfsdfsdf\n', 'asdfsdfasdfasd\n']
def subdivide(lines, regex):
equ_pattern = re.compile(regex, re.MULTILINE)
sections = equ_pattern.split(lines)
sections = [section.strip('\n') for section in sections]
return sections
# for processing sections with dashes in them, returns the heading of the section along with
# a dictionary where each key is the subsection's header, and each value is the text in the subsection.
def process_dashed_sections(section):
subsections = subdivide(section, "^-+")
heading = subsections[0] # header of the section.
d = {key: value for key, value in zip(subsections[1::2], subsections[2::2])}
index_pattern = re.compile("\[(.+)\]", re.MULTILINE)
# sort the dictionary by first capturing the pattern '[x]' and extracting 'x' number.
# Then this is passed as a compare function to 'sorted' to sort based on 'x'.
def cmp(d):
mat = index_pattern.findall(d[0])
if mat:
print(mat[0])
return int(mat[0])
# There are issues when dealing with subsections containing '-'s but not containing '[x]' pattern.
# This is just to deal with that small issue.
else:
return 0
o_d = OrderedDict(sorted(d.items(), key=cmp))
return heading, o_d
# this is to rename the keys of 'd' dictionary to the proper names present in the attendees.
# it searches for the best match for the key in the 'attendees' list, and replaces the corresponding key.
# >>> d = {'mr. man ceo of company [1]' : ' This is talk a' ,
# ... 'ms. woman ceo of company [2]' : ' This is talk b'}
# >>> l = ['mr. man', 'ms. woman']
# >>> new_d = assign_attendee(d, l)
# new_d = {'mr. man': 'This is talk a', 'ms. woman': 'This is talk b'}
def assign_attendee(d, attendees):
new_d = OrderedDict()
for key, value in d.items():
a = [a for a in attendees if a in key]
if len(a) == 1:
# to strip out any additional whitespace anywhere in the text including '\n'.
new_d[a[0]] = value.strip()
elif len(a) == 0:
# to strip out any additional whitespace anywhere in the text including '\n'.
new_d[key] = value.strip()
return new_d
if __name__ == '__main__':
with open('input.txt', 'r') as input:
lines = input.read()
# regex pattern for matching headers of each section
header_pattern = re.compile("^.*[^\n]", re.MULTILINE)
# regex pattern for matching the sections that contains
# the list of attendee's (those that start with asterisks )
ppl_pattern = re.compile("^(\s+\*)(.+)(\s.*)", re.MULTILINE)
# regex pattern for matching sections with subsections in them.
dash_pattern = re.compile("^-+", re.MULTILINE)
ppl_d = dict()
talks_d = dict()
# Step1. Divide the the entire document into sections using the '=' divider
sections = subdivide(lines, "^=+")
header = []
print(sections)
# Step2. Handle each section like a switch case
for section in sections:
# Handle headers
if len(section.split('\n')) == 1: # likely to match only a header (assuming )
header = header_pattern.match(section).string
# Handle attendees/authors
elif ppl_pattern.match(section):
ppls = ppl_pattern.findall(section)
d = {key.strip(): value.strip() for (_, key, value) in ppls}
ppl_d.update(d)
# assuming that if the previous section was detected as a header, then this section will relate
# to that header
if header:
talks_d.update({header: ppl_d})
# Handle subsections
elif dash_pattern.findall(section):
heading, d = process_dashed_sections(section)
talks_d.update({heading: d})
# Else its just some random text.
else:
# assuming that if the previous section was detected as a header, then this section will relate
# to that header
if header:
talks_d.update({header: section})
#pprint(talks_d)
# To assign the talks material to the appropriate attendee/author. Still works if no match found.
for key, value in talks_d.items():
talks_d[key] = assign_attendee(value, ppl_d.keys())
# ordered dict does not pretty print using 'pprint'. So a small hack to make use of json output to pretty print.
print(json.dumps(talks_d, indent=4))
Could you please confirm that whether you only require "Presentation" and "Question and Answer" sections?
Also, regarding the output is it ok to dump CSV format similar to what you have "manually transformed".
Updated solution to work for every sample file you provided.
Output is from Cell "D:H" as per "Parsed-transcript" file shared.
#state = ["other", "head", "present", "qa", "speaker", "data"]
# codes : 0, 1, 2, 3, 4, 5
def writecell(out, data):
out.write(data)
out.write(",")
def readfile(fname, outname):
initstate = 0
f = open(fname, "r")
out = open(outname, "w")
head = ""
head_written = 0
quotes = 0
had_speaker = 0
for line in f:
line = line.strip()
if not line: continue
if initstate in [0,5] and not any([s for s in line if "=" != s]):
if initstate == 5:
out.write('"')
quotes = 0
out.write("\n")
initstate = 1
elif initstate in [0,5] and not any([s for s in line if "-" != s]):
if initstate == 5:
out.write('"')
quotes = 0
out.write("\n")
initstate = 4
elif initstate == 1 and line == "Presentation":
initstate = 2
head = "Presentation"
head_written = 0
elif initstate == 1 and line == "Questions and Answers":
initstate = 3
head = "Questions and Answers"
head_written = 0
elif initstate == 1 and not any([s for s in line if "=" != s]):
initstate = 0
elif initstate in [2, 3] and not any([s for s in line if ("=" != s and "-" != s)]):
initstate = 4
elif initstate == 4 and '[' in line and ']' in line:
comma = line.find(',')
speech_st = line.find('[')
speech_end = line.find(']')
if speech_st == -1:
initstate = 0
continue
if comma == -1:
firm = ""
speaker = line[:speech_st].strip()
else:
speaker = line[:comma].strip()
firm = line[comma+1:speech_st].strip()
head_written = 1
if head_written:
writecell(out, head)
head_written = 0
order = line[speech_st+1:speech_end]
writecell(out, speaker)
writecell(out, firm)
writecell(out, order)
had_speaker = 1
elif initstate == 4 and not any([s for s in line if ("=" != s and "-" != s)]):
if had_speaker:
initstate = 5
out.write('"')
quotes = 1
had_speaker = 0
elif initstate == 5:
line = line.replace('"', '""')
out.write(line)
elif initstate == 0:
continue
else:
continue
f.close()
if quotes:
out.write('"')
out.close()
readfile("Sample1.txt", "out1.csv")
readfile("Sample2.txt", "out2.csv")
readfile("Sample3.txt", "out3.csv")
Details
in this solution there is a state machine which works as follows:
1. detects whether heading is present, if yes, write it
2. detects speakers after heading is written
3. writes notes for that speaker
4. switches to next speaker and so on...
You can later process the csv files as you want.
You can also populate the data in any format you want once basic processing is done.
Edit:
Please replace the function "writecell"
def writecell(out, data):
data = data.replace('"', '""')
out.write('"')
out.write(data)
out.write('"')
out.write(",")

regex number of repetitions from code

Can you use values from script to inform regexs dynamically how to operate?
For example:
base_pattern = r'\s*(([\d.\w]+)[ \h]+)'
n_rep = random.randint(1, 9)
new_pattern = base_pattern + '{n_rep}'
line_matches = re.findall(new_pattern, some_text)
I keep getting problems with trying to get the grouping to work
Explanation
I am attempting to find the most common number of repetitions of a regex pattern in a text file in order to find table type data within files.
I have the idea to make a regex such as this:
base_pattern = r'\s*(([\d.\w]+)[ \h]+)'
line_matches = np.array([re.findallbase_pattern, line) for line_num, line in enumerate(some_text.split("\n"))])
# Find where the text has similar number of words/data in each line
where_same_pattern= np.where(np.diff([len(x) for x in line_matches])==0)
line_matches_where_same = line_matches[where_same_pattern]
# Extract out just the lines which have data
interesting_lines = np.array([x for x in line_matches_where_same if x != []])
# Find how many words in each line of interest
len_of_lines = [len(l) for l in interesting_lines]
# Use the most prevalent as the most likely number of columns of data
n_cols = Counter(len_of_lines).most_common()[0][0]
# Rerun the data through a regex to find the columns
new_pattern = base_pattern + '{n_cols}'
line_matches = np.array([re.findall(new_pattern, line) for line_num, line in enumerate(some_text.split("\n"))])
you need to use the value of the variable, not a string literal with the name of the variable, e.g.:
new_pattern = base_pattern + '{' + n_cols + '}'
Your pattern is just a string. So, all you need is to convert your number into a string. You can use format (for example, https://infohost.nmt.edu/tcc/help/pubs/python/web/new-str-format.html) to do that:
base_pattern = r'\s*(([\d.\w]+)[ \h]+)'
n_rep = random.randint(1, 9)
new_pattern = base_pattern + '{{{0}}}'.format(n_rep)
print new_pattern ## '\\s*(([\\d.\\w]+)[ \\h]+){6}'
Note that the two first and the two last curly braces are creating the curly braces in the new pattern, while {0} is being replaced by the number n_rep

Python Regex Match Integer After String

I need a regex in python to match and return the integer after the string "id": in a text file.
The text file contains the following:
{"page":1,"results": [{"adult":false,"backdrop_path":"/ba4CpvnaxvAgff2jHiaqJrVpZJ5.jpg","id":807,"original_title":"Se7en","release_date":"1995-09-22","p
I need to get the 807 after the "id", using a regular expression.
Is this what you mean?
#!/usr/bin/env python
import re
subject = '{"page":1,"results": [{"adult":false,"backdrop_path":"/ba4CpvnaxvAgff2jHiaqJrVpZJ5.jpg","id":807,"original_title":"Se7en","release_date":"1995-09-22","p'
match = re.search('"id":([^,]+)', subject)
if match:
result = match.group(1)
else:
result = "no result"
print result
The Output: 807
Edit:
In response to your comment, adding one simple way to ignore the first match. If you use this, remember to add something like "id":809,"etc to your subject so that we can ignore 807 and find 809.
n=1
for match in re.finditer('"id":([^,]+)', subject):
if n==1:
print "ignoring the first match"
else:
print match.group(1)
n+=1
Assuming that there is more to the file than that:
import json
with open('/path/to/file.txt') as f:
data = json.loads(f.read())
print(data['results'][0]['id'])
If the file is not valid JSON, then you can get the value of id with:
from re import compile, IGNORECASE
r = compile(r'"id"\s*:\s*(\d+)', IGNORECASE)
with open('/path/to/file.txt') as f:
for match in r.findall(f.read()):
print(match(1))

Categories