A string is given as an input (e.g. "What is your name?"). The input always contains a question which I want to extract. But the problem that I am trying to solve is that the input is always with unneeded input.
So the input could be (but not limited to) the following:
1- "eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn" 2- "What is your\nlastname and email?\ndasf?lkjas" 3- "askjdmk.\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"
(Notice that at the third input, the question starts with the word "Given" and end with "yourself?")
The above input examples are generated by the pytesseract OCR library of scanning an image and converting it into text
I only want to extract the question from the garbage input and nothing else.
I tried to use find('?', 1) function of the re library to get index of last part of the question (assuming for now that the first question mark is always the end of the question and not part of the input that I don't want). But I can't figure out how to get the index of the first letter of the question. I tried to loop in reverse and get the first spotted \n in the input, but the question doesn't always have \n before the first letter of the question.
def extractQuestion(q):
index_end_q = q.find('?', 1)
index_first_letter_of_q = 0 # TODO
question = '\n ' . join(q[index_first_letter_of_q :index_end_q ])
A way to find the question's first word index would be to search for the first word that has an actual meaning (you're interested in English words I suppose). A way to do that would be using pyenchant:
#!/usr/bin/env python
import enchant
GLOSSARY = enchant.Dict("en_US")
def isWord(word):
return True if GLOSSARY.check(word) else False
sentences = [
"eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn",
"What is your\nlastname and email?\ndasf?lkjas",
"\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"]
for sentence in sentences:
for i,w in enumerate(sentence.split()):
if isWord(w):
print('index: {} => {}'.format(i, w))
break
The above piece of code gives as a result:
index: 3 => What
index: 0 => What
index: 0 => Given
You could try a regular expression like \b[A-Z][a-z][^?]+\?, meaning:
The start of a word \b with an upper case letter [A-Z] followed by a lower case letter [a-z],
then a sequence of non-questionmark-characters [^?]+,
followed by a literal question mark \?.
This can still have some false positives or misses, e.g. if a question actually starts with an acronym, or if there is a name in the middle of the question, but for you examples it works quite well.
>>> tests = ["eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn",
"What is your\nlastname and email?\ndasf?lkjas",
"\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"]
>>> import re
>>> p = r"\b[A-Z][a-z][^?]+\?"
>>> [re.search(p, t).group() for t in tests]
['What is your name?',
'What is your\nlastname and email?',
'Given your skills\nhow would you rate yourself?']
If that's one blob of text, you can use findall instead of search:
>>> text = "\n".join(tests)
>>> re.findall(p, text)
['What is your name?',
'What is your\nlastname and email?',
'Given your skills\nhow would you rate yourself?']
Actually, this also seems to work reasonably well for questions with names in them:
>>> t = "asdGARBAGEasd\nHow did you like St. Petersburg? more stuff with ?"
>>> re.search(p, t).group()
'How did you like St. Petersburg?'
Related
Let's assume I have the sentence:
sentence = "A cow runs on the grass"
If I want to replace the word cow with "some" special token, I can do:
to_replace = "cow"
# A <SPECIAL> runs on the grass
sentence = re.sub(rf"(?!\B\w)({re.escape(to_replace)})(?<!\w\B)", "<SPECIAL>", sentence, count=1)
Additionally, if I want to replace it's plural form, I could do:
sentence = "The cows run on the grass"
to_replace = "cow"
# Influenza is one of the respiratory <SPECIAL>
sentence = re.sub(rf"(?!\B\w)({re.escape(to_replace) + 's?'})(?<!\w\B)", "<SPECIAL>", sentence, count=1)
which does the replacement even if the word to replace remains in its singular form cow, while the s? does the job to perform the replacement.
My question is what happens if I want to apply the same in a more general way, i.e., find-and-replace words which can be singular, plural - ending with s, and also plural - ending with es (note that I'm intentionally ignoring many edge cases that could appear - discussed in the comments of the question). Another way to frame the question would be how can add multiple optional ending suffixes to a word, so that it works for the following examples:
to_replace = "cow"
sentence1 = "The cow runs on the grass"
sentence2 = "The cows run on the grass"
# --------------
to_replace = "gas"
sentence3 = "There are many natural gases"
I suggest using regular python logic, remember to avoid stretching regexes too much if you don't need to:
phrase = "There are many cows in the field cowes"
for word in phrase.split():
if word == "cow" or word == "cow" + "s" or word == "cow" + "es":
phrase = phrase.replace(word, "replacement")
print(phrase)
Output:
There are many replacement in the field replacement
Apparently, for the use-case I posted, I can make the suffix optional. So it could go as:
re.sub(rf"(?!\B\w)({re.escape(e_obj) + '(s|es)?'})(?<!\w\B)", "<SPECIAL>", sentence, count=1)
Note that this would not work for many edge cases discussed in the comments!
I have a large number of sentences, from which I want to extract sub-sentences that start with certain word combinations. For example, I want to extract sentence segments that begin with "what does" or "what is', etc. (essentially eliminating the words from the sentence that appear before the word-pairs). Both the sentences and the word-pairs are stored in a DataFrame:
'Sentence' 'First2'
0 If this is a string what does it say? 0 can I
1 And this is a string, should it say more? 1 should it
2 This is yet another string. 2 what does
3 etc. etc. 3 etc. etc
The result I want from the above example would be:
0 what does it say?
1 should it say more?
2
The most obvious solution (at least to me) below does not work. It only uses the first word-pair b to go over all the sentences r, but not the other b's.
a = df['Sentence']
b = df['First2']
#The function seems to loop over all r's but only over the first b:
def func(z):
for x in b:
if x in r:
s = z[z.index(x):]
return s
else:
return ‘’
df['Segments'] = a.apply(func)
It seems that looping over two DataFrames simultaneously in this way does not work. Is there a more efficient and effective way to do this?
I believe there is a bug in your code.
else:
return ''
This means if the 1st comparison is not a match, 'func' will return immediately. That might be why the code does not return any matches.
A sample working code is below:
# The function seems to loop over all r's but only over the first b:
def func(sentence, first_twos=b):
for first_two in first_twos:
if first_two in sentence:
s = sentence[sentence.index(first_two):]
return s
return ''
df['Segments'] = a.apply(func)
And the output:
df:
{
'First2': ['can I', 'should it', 'what does'],
'Segments': ['what does it say? ', 'should it say more?', ''],
'Sentence': ['If this is a string what does it say? ', 'And this is a string, should it say more?', 'This is yet another string. ' ]
}
you can loop over two things easily via zip(iterator,iterator_foo)
My question was answered by the following code:
def func(r):
for i in b:
if i in r:
q = r[r.index(i):]
return q
return ''
df['Segments'] = a.apply(func)
The solution was pointed out here by Daming Lu (only the last line is different from his). The problem was in the last two lines of the original code:
else:
return ''
This caused the function to return too early. Daming Lu's answer was better than the answer to the possible duplicate question python for-loop only executes once? which created other problems - as explained in my respons to wii. (So I am not sure mine really is a duplicate.)
I'm using Python to search some words (also multi-token) in a description (string).
To do that I'm using a regex like this
result = re.search(word, description, re.IGNORECASE)
if(result):
print ("Trovato: "+result.group())
But what I need is to obtain the first 2 word before and after the match. For example if I have something like this:
Parking here is horrible, this shop sucks.
"here is" is the word that I looking for. So after I matched it with my regex I need the 2 words (if exists) before and after the match.
In the example:
Parking here is horrible, this
"Parking" and horrible, this are the words that I need.
ATTTENTION
The description cab be very long and the pattern "here is" can appear multiple times?
How about string operations?
line = 'Parking here is horrible, this shop sucks.'
before, term, after = line.partition('here is')
before = before.rsplit(maxsplit=2)[-2:]
after = after.split(maxsplit=2)[:2]
Result:
>>> before
['Parking']
>>> after
['horrible,', 'this']
Try this regex: ((?:[a-z,]+\s+){0,2})here is\s+((?:[a-z,]+\s*){0,2})
with re.findall and re.IGNORECASE set
Demo
I would do it like this (edit: added anchors to cover most cases):
(\S+\s+|^)(\S+\s+|)here is(\s+\S+|)(\s+\S+|$)
Like this you will always have 4 groups (might have to be trimmed) with the following behavior:
If group 1 is empty, there was no word before (group 2 is empty too)
If group 2 is empty, there was only one word before (group 1)
If group 1 and 2 are not empty, they are the words before in order
If group 3 is empty, there was no word after
If group 4 is empty, there was only one word after
If group 3 and 4 are not empty, they are the words after in order
Corrected demo link
Based on your clarification, this becomes a bit more complicated. The solution below deals with scenarios where the searched pattern may in fact also be in the two preceding or two subsequent words.
line = "Parking here is horrible, here is great here is mediocre here is here is "
print line
pattern = "here is"
r = re.search(pattern, line, re.IGNORECASE)
output = []
if r:
while line:
before, match, line = line.partition(pattern)
if match:
if not output:
before = before.split()[-2:]
else:
before = ' '.join([pattern, before]).split()[-2:]
after = line.split()[:2]
output.append((before, after))
print output
Output from my example would be:
[(['Parking'], ['horrible,', 'here']), (['is', 'horrible,'], ['great', 'here']), (['is', 'great'], ['mediocre', 'here']), (['is', 'mediocre'], ['here', 'is']), (['here', 'is'], [])]
Edit: This code has been worked on and released as a basic module: https://github.com/hyperreality/Poetry-Tools
I'm a linguist who has recently picked up python and I'm working on a project which hopes to automatically analyze poems, including detecting the form of the poem. I.e. if it found a 10 syllable line with 0101010101 stress pattern, it would declare that it's iambic pentameter. A poem with 5-7-5 syllable pattern would be a haiku.
I'm using the following code, part of a larger script, but I have a number of problems which are listed below the program:
corpus in the script is simply the raw text input of the poem.
import sys, getopt, nltk, re, string
from nltk.tokenize import RegexpTokenizer
from nltk.util import bigrams, trigrams
from nltk.corpus import cmudict
from curses.ascii import isdigit
...
def cmuform():
tokens = [word for sent in nltk.sent_tokenize(corpus) for word in nltk.word_tokenize(sent)]
d = cmudict.dict()
text = nltk.Text(tokens)
words = [w.lower() for w in text]
regexp = "[A-Za-z]+"
exp = re.compile(regexp)
def nsyl(word):
lowercase = word.lower()
if lowercase not in d:
return 0
else:
first = [' '.join([str(c) for c in lst]) for lst in max(d[lowercase])]
second = ''.join(first)
third = ''.join([i for i in second if i.isdigit()]).replace('2', '1')
return third
#return max([len([y for y in x if isdigit(y[-1])]) for x in d[lowercase]])
sum1 = 0
for a in words:
if exp.match(a):
print a,nsyl(a),
sum1 = sum1 + len(str(nsyl(a)))
print "\nTotal syllables:",sum1
I guess that the output that I want would be like this:
1101111101
0101111001
1101010111
The first problem is that I lost the line breaks during the tokenization, and I really need the line breaks to be able to identify form. This should not be too hard to deal with though. The bigger problems are that:
I can't deal with non-dictionary words. At the moment I return 0 for them, but this will confound any attempt to identify the poem, as the syllabic count of the line will probably decrease.
In addition, the CMU dictionary often says that there is stress on a word - '1' - when there is not - '0 - . Which is why the output looks like this: 1101111101, when it should be the stress of iambic pentameter: 0101010101
So how would I add some fudging factor so the poem still gets identified as iambic pentameter when it only approximates the pattern? It's no good to code a function that identifies lines of 01's when the CMU dictionary is not going to output such a clean result. I suppose I'm asking how to code a 'partial match' algorithm.
Welcome to stack overflow. I'm not that familiar with Python, but I see you have not received many answers yet so I'll try to help you with your queries.
First some advice: You'll find that if you focus your questions your chances of getting answers are greatly improved. Your post is too long and contains several different questions, so it is beyond the "attention span" of most people answering questions here.
Back on topic:
Before you revised your question you asked how to make it less messy. That's a big question, but you might want to use the top-down procedural approach and break your code into functional units:
split corpus into lines
For each line: find the syllable length and stress pattern.
Classify stress patterns.
You'll find that the first step is a single function call in python:
corpus.split("\n");
and can remain in the main function but the second step would be better placed in its own function and the third step would require to be split up itself, and would probably be better tackled with an object oriented approach. If you're in academy you might be able to convince the CS faculty to lend you a post-grad for a couple of months and help you instead of some workshop requirement.
Now to your other questions:
Not loosing line breaks: as #ykaganovich mentioned, you probably want to split the corpus into lines and feed those to the tokenizer.
Words not in dictionary/errors: The CMU dictionary home page says:
Find an error? Please contact the developers. We will look at the problem and improve the dictionary. (See at bottom for contact information.)
There is probably a way to add custom words to the dictionary / change existing ones, look in their site, or contact the dictionary maintainers directly.
You can also ask here in a separate question if you can't figure it out. There's bound to be someone in stackoverflow that knows the answer or can point you to the correct resource.
Whatever you decide, you'll want to contact the maintainers and offer them any extra words and corrections anyway to improve the dictionary.
Classifying input corpus when it doesn't exactly match the pattern: You might want to look at the link ykaganovich provided for fuzzy string comparisons. Some algorithms to look for:
Levenshtein distance: gives you a measure of how different two strings are as the number of changes needed to turn one string into another. Pros: easy to implement, Cons: not normalized, a score of 2 means a good match for a pattern of length 20 but a bad match for a pattern of length 3.
Jaro-Winkler string similarity measure: similar to Levenshtein, but based on how many character sequences appear in the same order in both strings. It is a bit harder to implement but gives you normalized values (0.0 - completely different, 1.0 - the same) and is suitable for classifying the stress patterns. A CS postgrad or last year undergrad should not have too much trouble with it ( hint hint ).
I think those were all your questions. Hope this helps a bit.
To preserve newlines, parse line by line before sending each line to the cmu parser.
For dealing with single-syllable words, you probably want to try both 0 and 1 for it when nltk returns 1 (looks like nltk already returns 0 for some words that would never get stressed, like "the"). So, you'll end up with multiple permutations:
1101111101
0101010101
1101010101
and so forth. Then you have to pick ones that look like a known forms.
For non-dictionary words, I'd also fudge it the same way: figure out the number of syllables (the dumbest way would be by counting the vowels), and permutate all possible stresses. Maybe add some more rules like "ea is a single syllable, trailing e is silent"...
I've never worked with other kinds of fuzzying, but you can check https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison for some ideas.
This is my first post on stackoverflow.
And I'm a python newbie, so please excuse any deficits in code style.
But I too am attempting to extract accurate metre from poems.
And the code included in this question helped me, so I post what I came up with that builds on that foundation. It is one way to extract the stress as a single string, correct with a 'fudging factor' for the cmudict bias, and not lose words that are not in the cmudict.
import nltk
from nltk.corpus import cmudict
prondict = cmudict.dict()
#
# parseStressOfLine(line)
# function that takes a line
# parses it for stress
# corrects the cmudict bias toward 1
# and returns two strings
#
# 'stress' in form '0101*,*110110'
# -- 'stress' also returns words not in cmudict '0101*,*1*zeon*10110'
# 'stress_no_punct' in form '0101110110'
def parseStressOfLine(line):
stress=""
stress_no_punct=""
print line
tokens = [words.lower() for words in nltk.word_tokenize(line)]
for word in tokens:
word_punct = strip_punctuation_stressed(word.lower())
word = word_punct['word']
punct = word_punct['punct']
#print word
if word not in prondict:
# if word is not in dictionary
# add it to the string that includes punctuation
stress= stress+"*"+word+"*"
else:
zero_bool=True
for s in prondict[word]:
# oppose the cmudict bias toward 1
# search for a zero in array returned from prondict
# if it exists use it
# print strip_letters(s),word
if strip_letters(s)=="0":
stress = stress + "0"
stress_no_punct = stress_no_punct + "0"
zero_bool=False
break
if zero_bool:
stress = stress + strip_letters(prondict[word][0])
stress_no_punct=stress_no_punct + strip_letters(prondict[word][0])
if len(punct)>0:
stress= stress+"*"+punct+"*"
return {'stress':stress,'stress_no_punct':stress_no_punct}
# STRIP PUNCTUATION but keep it
def strip_punctuation_stressed(word):
# define punctuations
punctuations = '!()-[]{};:"\,<>./?##$%^&*_~'
my_str = word
# remove punctuations from the string
no_punct = ""
punct=""
for char in my_str:
if char not in punctuations:
no_punct = no_punct + char
else:
punct = punct+char
return {'word':no_punct,'punct':punct}
# CONVERT the cmudict prondict into just numbers
def strip_letters(ls):
#print "strip_letters"
nm = ''
for ws in ls:
#print "ws",ws
for ch in list(ws):
#print "ch",ch
if ch.isdigit():
nm=nm+ch
#print "ad to nm",nm, type(nm)
return nm
# TESTING results
# i do not correct for the '2'
line = "This day (the year I dare not tell)"
print parseStressOfLine(line)
line = "Apollo play'd the midwife's part;"
print parseStressOfLine(line)
line = "Into the world Corinna fell,"
print parseStressOfLine(line)
"""
OUTPUT
This day (the year I dare not tell)
{'stress': '01***(*011111***)*', 'stress_no_punct': '01011111'}
Apollo play'd the midwife's part;
{'stress': "0101*'d*01211***;*", 'stress_no_punct': '010101211'}
Into the world Corinna fell,
{'stress': '01012101*,*', 'stress_no_punct': '01012101'}
I am making a simple chat bot in Python. It has a text file with regular expressions which help to generate the output. The user input and the bot output are separated by a | character.
my name is (?P<'name'>\w*) | Hi {'name'}!
This works fine for single sets of input and output responses, however I would like the bot to be able to store the regex values the user inputs and then use them again (i.e. give the bot a 'memory'). For example, I would like to have the bot store the value input for 'name', so that I can have this in the rules:
my name is (?P<'word'>\w*) | You said your name is {'name'} already!
my name is (?P<'name'>\w*) | Hi {'name'}!
Having no value for 'name' yet, the bot will first output 'Hi steve', and once the bot does have this value, the 'word' rule will apply. I'm not sure if this is easily feasible given the way I have structured my program. I have made it so that the text file is made into a dictionary with the key and value separated by the | character, when the user inputs some text, the program compares whether the user input matches the input stored in the dictionary, and prints out the corresponding bot response (there is also an 'else' case if no match is found).
I must need something to happen at the comparing part of the process so that the user's regular expression text is saved and then substituted back into the dictionary somehow. All of my regular expressions have different names associated with them (there are no two instances of 'word', for example...there is 'word', 'word2', etc), I did this as I thought it would make this part of the process easier. I may have structured the thing completely wrong to do this task though.
Edit: code
import re
io = {}
with open("rules.txt") as brain:
for line in brain:
key, value = line.split('|')
io[key] = value
string = str(raw_input('> ')).lower()+' word'
x = 1
while x == 1:
for regex, output in io.items():
match = re.match(regex, string)
if match:
print(output.format(**match.groupdict()))
string = str(raw_input('> ')).lower()+' word'
else:
print ' Sorry?'
string = str(raw_input('> ')).lower()+' word'
I had some difficulty to understand the principle of your algorithm because I'm not used to employ the named groups.
The following code is the way I would solve your problem, I hope it will give you some ideas.
I think that having only one dictionary isn't a good principle, it increases the complexity of reasoning and of the algorithm. So I based the code on two dictionaries: direg and memory
Theses two dictionaries have keys that are indexes of groups, not all the indexes, only some particular ones, the indexes of the groups being the last in each individual patterns.
Because, for the fun, I decided that the regexes must be able to have several groups.
What I call individual patterns in my code are the following strings:
"[mM]y name [Ii][sS] (\w*)"
"[Ii]n repertory (\w*) I [wW][aA][nN][tT] file (\w*)"
"[Ii] [wW][aA][nN][tT] to ([ \w]*)"
You see that the second individual pattern has 2 capturing groups: consequently there are 3 individual patterns, but a total of 4 groups in all the individual groups.
So the creation of the dictionaries needs some additional care to take account of the fact that the index of the last matching group ( which I use with help of the attribute of name lastindex of a regex MatchObject ) may not correspond to the numbering of individual regexes present in the regex pattern: it's harder to explain than to understand. That's the reason why I count in the function distr() the occurences of strings {0} {1} {2} {3} {4} etc whose number MUST be the same as the number of groups defined in the corresponding individual pattern.
I found the suggestion of Laurence D'Oliveiro to use '||' instead of '|' as separator interesting.
My code simulates a session in which several inputs are done:
import re
regi = ("[mM]y name [Ii][sS] (\w*)"
"||Hi {0}!"
"||You said that your name was {0} !!!",
"[Ii]n repertory (\w*) I [wW][aA][nN][tT] file (\w*)"
"||OK here's your file {0}\\{1} :"
"||I already gave you the file {0}\\{1} !",
"[Ii] [wW][aA][nN][tT] to ([ \w]*)"
"||OK, I will do {0}"
"||You already did {0}. Do yo really want again ?")
direg = {}
memory = {}
def distr(regi,cnt = 0,di = direg,mem = memory,
regnb = re.compile('{\d+}')):
for i,el in enumerate(regi,start=1):
sp = el.split('||')
cnt += len(regnb.findall(sp[1]))
di[cnt] = sp[1]
mem[cnt] = sp[2]
yield sp[0]
regx = re.compile('|'.join(distr(regi)))
print 'direg :\n',direg
print
print 'memory :\n',memory
for inp in ('I say that my name is Armano the 1st',
'In repertory ONE I want file SPACE',
'I want to record music',
'In repertory ONE I want file SPACE',
'I say that my name is Armstrong',
'But my name IS Armstrong now !!!',
'In repertory TWO I want file EARTH',
'Now my name is Helena'):
print '\ninput ==',inp
mat = regx.search(inp)
if direg[mat.lastindex]:
print 'output ==',direg[mat.lastindex]\
.format(*(d for d in mat.groups() if d))
direg[mat.lastindex] = None
memory[mat.lastindex] = memory[mat.lastindex]\
.format(*(d for d in mat.groups() if d))
else:
print 'output ==',memory[mat.lastindex]\
.format(*(d for d in mat.groups() if d))
if not memory[mat.lastindex].startswith('Sorry'):
memory[mat.lastindex] = 'Sorry, ' \
+ memory[mat.lastindex][0].lower()\
+ memory[mat.lastindex][1:]
result
direg :
{1: 'Hi {0}!', 3: "OK here's your file {0}\\{1} :", 4: 'OK, I will do {0}'}
memory :
{1: 'You said that your name was {0} !!!', 3: 'I already gave you the file {0}\\{1} !', 4: 'You already did {0}. Do yo really want again ?'}
input == I say that my name is Armano the 1st
output == Hi Armano!
input == In repertory ONE I want file SPACE
output == OK here's your file ONE\SPACE :
input == I want to record music
output == OK, I will do record music
input == In repertory ONE I want file SPACE
output == I already gave you the file ONE\SPACE !
input == I say that my name is Armstrong
output == You said that your name was Armano !!!
input == But my name IS Armstrong now !!!
output == Sorry, you said that your name was Armano !!!
input == In repertory TWO I want file EARTH
output == Sorry, i already gave you the file ONE\SPACE !
input == Now my name is Helena
output == Sorry, you said that your name was Armano !!!
OK, let me see if I understand this:
You want to a dictionary of key-value pairs. This will be the “memory” of the chatbot.
You want to apply regular-expression rules to user input. But which rules might apply is conditional on which keys are already present in the memory dictionary: if “name” is not yet defined, then the rule that defines “name” applies; but if it is, then the rule that mentions “word” applies.
Seems to me you need more information attached to your rules. For example, the “word” rule you gave above shouldn’t actually add “word” to the dictionary, otherwise it would only apply once (imagine if the user keeps trying to say “my name is x” more than twice).
Does that give you a bit more idea about how to proceed?
Oh, by the way, I think “|” is a poor choice for a separator character, because it can occur in regular expressions. Not sure what to suggest: how about “||”?