How can text be used for multiple regex matches (Python)? - python

I am trying to read a file that has the following structure.
Question 1 What is the weather today? Answer 1 It is hot and sunny. Question 2 What day is it today? Answer 2 Thursday Question 3 How many legs does a dog have? Answer 3 Four legs
I want to put the content in a dictionary with questions and answers, so something like this:
dict = {
"What is the weather today?": "It is hot and sunny.",
"What day is it today?": "Thursday",
"How many legs does a dog have?": "Four legs"
}
To find the questions and answers in the text, I created this regular expression:
\s?(Question|Answer)\s\d+\s?(.*)\s?(Question|Answer)\s\d+\s?
You also can find the regex with the example here. As you can see on that page, it finds one big match, instead of multiple smaller matches. I assume that you need the Question and Answer texts for two matches, because Question 2, for example, means both the end of the match of Answer 1, and the start of the match of Question 2. How can I get the questions and the answers itself correctly, so that I can put it in a dictionary (including the last answer, after which no new 'Question X' follows), as shown in the example dictionary?

If there is a question followed by an answer, you don't have to use the alternation |, but you can first match Question and then match Answer
\bQuestion\s+\d+\s+(\S.*?)\s+Answer\s+\d+\s+(\S.*?)\s*(?=Question|$)
\bQuestion\s+\d+\s+ Match Question followed by 1+ digits between whitespace chars
(\S.*?) Capture group 1, match at least a single non whitespace char
\s+Answer\s+\d+\s+ Match Answer followed by 1+ digits between whitespace chars
(\S.*?) Capture group 2, match at least a single non whitespace char
\s*(?=Question|$) Match optional whitespace char asserting either another question to the right or the end of the string in case of the last question
Then you could for example use re.findall to get the group 1 and group 2 values and fill a dictionary.
Regex demo | Python demo
import re
dict = {}
regex = r"\bQuestion\s+\d+\s+(\S.*?)\s+Answer\s+\d+\s+(\S.*?)\s*(?=Question|$)"
s = "Question 1 What is the weather today? Answer 1 It is hot and sunny. Question 2 What day is it today? Answer 2 Thursday Question 3 How many legs does a dog have? Answer 3 Four legs"
for m in re.findall(regex, s):
dict[m[0]] = m[1]
print(dict)
Output
{'What is the weather today?': 'It is hot and sunny.', 'What day is it today?': 'Thursday', 'How many legs does a dog have?': 'Four legs'}

Related

How to slice a string input at a certain unknown index

A string is given as an input (e.g. "What is your name?"). The input always contains a question which I want to extract. But the problem that I am trying to solve is that the input is always with unneeded input.
So the input could be (but not limited to) the following:
1- "eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn" 2- "What is your\nlastname and email?\ndasf?lkjas" 3- "askjdmk.\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"
(Notice that at the third input, the question starts with the word "Given" and end with "yourself?")
The above input examples are generated by the pytesseract OCR library of scanning an image and converting it into text
I only want to extract the question from the garbage input and nothing else.
I tried to use find('?', 1) function of the re library to get index of last part of the question (assuming for now that the first question mark is always the end of the question and not part of the input that I don't want). But I can't figure out how to get the index of the first letter of the question. I tried to loop in reverse and get the first spotted \n in the input, but the question doesn't always have \n before the first letter of the question.
def extractQuestion(q):
index_end_q = q.find('?', 1)
index_first_letter_of_q = 0 # TODO
question = '\n ' . join(q[index_first_letter_of_q :index_end_q ])
A way to find the question's first word index would be to search for the first word that has an actual meaning (you're interested in English words I suppose). A way to do that would be using pyenchant:
#!/usr/bin/env python
import enchant
GLOSSARY = enchant.Dict("en_US")
def isWord(word):
return True if GLOSSARY.check(word) else False
sentences = [
"eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn",
"What is your\nlastname and email?\ndasf?lkjas",
"\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"]
for sentence in sentences:
for i,w in enumerate(sentence.split()):
if isWord(w):
print('index: {} => {}'.format(i, w))
break
The above piece of code gives as a result:
index: 3 => What
index: 0 => What
index: 0 => Given
You could try a regular expression like \b[A-Z][a-z][^?]+\?, meaning:
The start of a word \b with an upper case letter [A-Z] followed by a lower case letter [a-z],
then a sequence of non-questionmark-characters [^?]+,
followed by a literal question mark \?.
This can still have some false positives or misses, e.g. if a question actually starts with an acronym, or if there is a name in the middle of the question, but for you examples it works quite well.
>>> tests = ["eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn",
"What is your\nlastname and email?\ndasf?lkjas",
"\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"]
>>> import re
>>> p = r"\b[A-Z][a-z][^?]+\?"
>>> [re.search(p, t).group() for t in tests]
['What is your name?',
'What is your\nlastname and email?',
'Given your skills\nhow would you rate yourself?']
If that's one blob of text, you can use findall instead of search:
>>> text = "\n".join(tests)
>>> re.findall(p, text)
['What is your name?',
'What is your\nlastname and email?',
'Given your skills\nhow would you rate yourself?']
Actually, this also seems to work reasonably well for questions with names in them:
>>> t = "asdGARBAGEasd\nHow did you like St. Petersburg? more stuff with ?"
>>> re.search(p, t).group()
'How did you like St. Petersburg?'

How to write regular expression for all text after ":" [duplicate]

This question already has answers here:
Regular expression: Match everything after a particular word
(4 answers)
Closed 4 years ago.
I need to filter the sentence and select only few terms from the whole sentence
For example, I have sample text:
ID: a9000006
NSF Org : DMI
Total Amt. : $225024
Abstract :This SBIR proposal is aimed at (1) the synthesis of new ferroelectric liquid crystals with ultra-high polarization,
chemical stability and low viscosity
token = re.compile('a90[0-9][0-9][0-9][0-9][0-9]| [$][\d]+ |')
re.findall(token, filetext)
I get 'a9000006','$225024', but I do not know how to write regex for three upper case letter right after "NSF Org:" which is "DMI" and all text after "Abstract:"
If you want to create a single regex which will match each of those 4 fields with explicit checks on each, then use this regex: :\s?(a90[\d]+|[$][\d]+|[A-Z]{3}|.*$)
>>> token = re.compile(r':\s?(a90[\d]+|[$][\d]+|[A-Z]{3}|.*$)', re.DOTALL) # flag needed
>>> re.findall(token, filetext)
['a9000006', 'DMI', '$225024', 'This SBIR proposal is aimed at (1) the synthesis of new ferroelectric liquid crystals wi
th ultra-high polarization, \n chemical stability and low viscosity']
>>>
However, since you're searching for all at the same time, would be better to use one which matches all 4 together and generically, such as the one in this answer here.
This must do the job.
: .*
You can check this here.
check

getting words between m and n characters

I am trying to get all names that start with a capital letter and ends with a full-stop on the same line where the number of characters are between 3 and 5
My text is as follows:
King. Great happinesse
Rosse. That now Sweno, the Norwayes King,
Craues composition:
Nor would we deigne him buriall of his men,
Till he disbursed, at Saint Colmes ynch,
Ten thousand Dollars, to our generall vse
King. No more that Thane of Cawdor shall deceiue
Our Bosome interest: Goe pronounce his present death,
And with his former Title greet Macbeth
Rosse. Ile see it done
King. What he hath lost, Noble Macbeth hath wonne.
I am testing it out on this link. I am trying to get all words between 3 and 5 but haven't succeeded.
Does this produce your desired output?
import re
re.findall(r'[A-Z].{2,4}\.', text)
When text contains the text in your question it will produce this output:
['King.', 'Rosse.', 'King.', 'Rosse.', 'King.']
The regex pattern matches any sequence of characters following an initial capital letter. You can tighten that up if required, e.g. using [a-z] in the pattern [A-Z][a-z]{2,4}\. would match an upper case character followed by between 2 to 4 lowercase characters followed by a literal dot/period.
If you don't want duplicates you can use a set to get rid of them:
>>> set(re.findall(r'[A-Z].{2,4}\.', text))
set(['Rosse.', 'King.'])
You may have your own reasons for wanting to use regexs here, but Python provides a rich set of string methods and (IMO) it's easier to understand the code using these:
matched_words = []
for line in open('text.txt'):
words = line.split()
for word in words:
if word[0].isupper() and word[-1] == '.' and 3 <= len(word)-1 <=5:
matched_words.append(word)
print matched_words

Search in a string and obtain the 2 words before and after the match in Python

I'm using Python to search some words (also multi-token) in a description (string).
To do that I'm using a regex like this
result = re.search(word, description, re.IGNORECASE)
if(result):
print ("Trovato: "+result.group())
But what I need is to obtain the first 2 word before and after the match. For example if I have something like this:
Parking here is horrible, this shop sucks.
"here is" is the word that I looking for. So after I matched it with my regex I need the 2 words (if exists) before and after the match.
In the example:
Parking here is horrible, this
"Parking" and horrible, this are the words that I need.
ATTTENTION
The description cab be very long and the pattern "here is" can appear multiple times?
How about string operations?
line = 'Parking here is horrible, this shop sucks.'
before, term, after = line.partition('here is')
before = before.rsplit(maxsplit=2)[-2:]
after = after.split(maxsplit=2)[:2]
Result:
>>> before
['Parking']
>>> after
['horrible,', 'this']
Try this regex: ((?:[a-z,]+\s+){0,2})here is\s+((?:[a-z,]+\s*){0,2})
with re.findall and re.IGNORECASE set
Demo
I would do it like this (edit: added anchors to cover most cases):
(\S+\s+|^)(\S+\s+|)here is(\s+\S+|)(\s+\S+|$)
Like this you will always have 4 groups (might have to be trimmed) with the following behavior:
If group 1 is empty, there was no word before (group 2 is empty too)
If group 2 is empty, there was only one word before (group 1)
If group 1 and 2 are not empty, they are the words before in order
If group 3 is empty, there was no word after
If group 4 is empty, there was only one word after
If group 3 and 4 are not empty, they are the words after in order
Corrected demo link
Based on your clarification, this becomes a bit more complicated. The solution below deals with scenarios where the searched pattern may in fact also be in the two preceding or two subsequent words.
line = "Parking here is horrible, here is great here is mediocre here is here is "
print line
pattern = "here is"
r = re.search(pattern, line, re.IGNORECASE)
output = []
if r:
while line:
before, match, line = line.partition(pattern)
if match:
if not output:
before = before.split()[-2:]
else:
before = ' '.join([pattern, before]).split()[-2:]
after = line.split()[:2]
output.append((before, after))
print output
Output from my example would be:
[(['Parking'], ['horrible,', 'here']), (['is', 'horrible,'], ['great', 'here']), (['is', 'great'], ['mediocre', 'here']), (['is', 'mediocre'], ['here', 'is']), (['here', 'is'], [])]

Regex, how to remove all non-alphanumeric except colon in a 12/24 hour timestamp?

I have a string like:
Today, 3:30pm - Group Meeting to discuss "big idea"
How do you construct a regex such that after parsing it would return:
Today 3:30pm Group Meeting to discuss big idea
I would like it to remove all non-alphanumeric characters except for those that appear in a 12 or 24 hour time stamp.
# this: D:DD, DD:DDam/pm 12/24 hr
re = r':(?=..(?<!\d:\d\d))|[^a-zA-Z0-9 ](?<!:)'
A colon must be preceded by at least one digit and followed by at least two digits: then it's a time. All other colons will be considered textual colons.
How it works
: // match a colon
(?=.. // match but not capture two chars
(?<! // start a negative look-behind group (if it matches, the whole fails)
\d:\d\d // time stamp
) // end neg. look behind
) // end non-capture two chars
| // or
[^a-zA-Z0-9 ] // match anything not digits or letters
(?<!:) // that isn't a colon
Then when applied to this silly text:
Today, 3:30pm - Group 1,2,3 Meeting to di4sc::uss3: 2:3:4 "big idea" on 03:33pm or 16:47 is also good
...changes it into:
Today, 3:30pm Group 123 Meeting to di4scuss3 234 big idea on 03:33pm or 16:47 is also good
Python.
import string
punct=string.punctuation
s='Today, 3:30pm - Group Meeting:am to discuss "big idea" by our madam'
for item in s.split():
try:
t=time.strptime(item,"%H:%M%p")
except:
item=''.join([ i for i in item if i not in punct])
else:
item=item
print item,
output
$ ./python.py
Today 3:30pm Group Meetingam to discuss big idea by our madam
# change to s='Today, 15:30pm - Group 1,2,3 Meeting to di4sc::uss3: 2:3:4 "big idea" on 03:33pm or 16:47 is also good'
$ ./python.py
Today 15:30pm Group 123 Meeting to di4scuss3 234 big idea on 03:33pm or 1647 is also good
NB: Method should be improved to check for valid time only when necessary(by imposing conditions) , but i will leave it as that for now.
I assume you'd like to keep spaces as well, and this implementation is in python, but it's PCRE so it should be portable.
import re
x = u'Today, 3:30pm - Group Meeting to discuss "big idea"'
re.sub(r'[^a-zA-Z0-9: ]', '', x)
Output: 'Today 3:30pm Group Meeting to discuss big idea'
for a slightly cleaner answer (no double spaces)
import re
x = u'Today, 3:30pm - Group Meeting to discuss "big idea"'
tmp = re.sub(r'[^a-zA-Z0-9: ]', '', x)
re.sub(r'[ ]+', ' ', tmp)
Output: 'Today 3:30pm Group Meeting to discuss big idea'
You can try, in Javascript:
var re = /(\W+(?!\d{2}[ap]m))/gi;
var input = 'Today, 3:30pm - Group Meeting to discuss "big idea"';
alert(input.replace(re, " "))
Correct regexp to do that would be:
'(?<!\d):|:(?!\d\d)|[^a-zA-Z0-9 :]'
s="Call me, my dear, at 3:30"
re.sub(r'[^\w :]','',s)
'Call me my dear at 3:30'

Categories