This question already has answers here:
Regular expression: Match everything after a particular word
(4 answers)
Closed 4 years ago.
I need to filter the sentence and select only few terms from the whole sentence
For example, I have sample text:
ID: a9000006
NSF Org : DMI
Total Amt. : $225024
Abstract :This SBIR proposal is aimed at (1) the synthesis of new ferroelectric liquid crystals with ultra-high polarization,
chemical stability and low viscosity
token = re.compile('a90[0-9][0-9][0-9][0-9][0-9]| [$][\d]+ |')
re.findall(token, filetext)
I get 'a9000006','$225024', but I do not know how to write regex for three upper case letter right after "NSF Org:" which is "DMI" and all text after "Abstract:"
If you want to create a single regex which will match each of those 4 fields with explicit checks on each, then use this regex: :\s?(a90[\d]+|[$][\d]+|[A-Z]{3}|.*$)
>>> token = re.compile(r':\s?(a90[\d]+|[$][\d]+|[A-Z]{3}|.*$)', re.DOTALL) # flag needed
>>> re.findall(token, filetext)
['a9000006', 'DMI', '$225024', 'This SBIR proposal is aimed at (1) the synthesis of new ferroelectric liquid crystals wi
th ultra-high polarization, \n chemical stability and low viscosity']
>>>
However, since you're searching for all at the same time, would be better to use one which matches all 4 together and generically, such as the one in this answer here.
This must do the job.
: .*
You can check this here.
check
Related
I'm trying to extract ONLY one string that contains $ character. The input based on output that I extracted using BeautifulSoup.
Code
price = [m.split() for m in re.findall(r"\w+/$(?:\s+\w+/$)*", soup_content.find('blockquote', { "class": "postcontent restore" }).text)]
Input
For Sale is my Tag Heuer Carrera Calibre 6 with box and papers and extras.
39mm
47 ish lug to lug
19mm in between lugs
Pretty thin but not sure exact height. Likely around 12mm (maybe less)
I've owned it for about 2 years. I absolutely love the case on this watch. It fits my wrist and sits better than any other watch I've ever owned. I'm selling because I need cash and other pieces have more sentimental value
I am the second owner, but the first barely wore it.
It comes with barely worn blue leather strap, extra suede strap that matches just about perfectly and I'll include a blue Barton Band Elite Silicone.
I also purchased an OEM bracelet that I personally think takes the watch to a new level. This model never came with a bracelet and it was several hundred $ to purchase after the fact.
The watch was worn in rotation and never dropped or knocked around.
The watch does have hairlines, but they nearly all superficial. A bit of time with a cape cod cloth would take care of a lot it them. The pics show the imperfections in at "worst" possible angle to show the nature of scratches.
The bracelet has a few desk diving marks, but all in all, the watch and bracelet are in very good shape.
Asking $2000 obo. PayPal shipped. CONUS.
It's a big hard to compare with others for sale as this one includes the bracelet.
The output should be like this.
2000
You don't need a regex. Instead you can iterate over lines and over each word to check for starting with '$' and extract the word:
[word[1:] for line in s.split('\n') for word in line.split() if word.startswith('$') and len(word) > 1]
where s is your paragraph.
which outputs:
['2000']
Since this is very simple you don't need a regex solution, this should sufice:
words = text.split()
words_with_dollar = [word for word in words if '$' in word]
print(words_with_dollar)
>>> ['$', '$2000']
If you don't want the dollar sign alone, you can add a filter like this:
words_with_dollar = [word for word in words if '$' in word and '$' != word]
print(words_with_dollar)
>>> ['$2000']
I would do something like that (provided input is the string you wrote above)-
price_start = input.find('$')
price = input[price_start:].split(' ')[0]
IF there is only 1 occurrence like you said.
Alternative- you could use regex like that-
price = re.findall('\S*\$\S*\d', input)[0]
price = price.replace('$', '')
A string is given as an input (e.g. "What is your name?"). The input always contains a question which I want to extract. But the problem that I am trying to solve is that the input is always with unneeded input.
So the input could be (but not limited to) the following:
1- "eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn" 2- "What is your\nlastname and email?\ndasf?lkjas" 3- "askjdmk.\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"
(Notice that at the third input, the question starts with the word "Given" and end with "yourself?")
The above input examples are generated by the pytesseract OCR library of scanning an image and converting it into text
I only want to extract the question from the garbage input and nothing else.
I tried to use find('?', 1) function of the re library to get index of last part of the question (assuming for now that the first question mark is always the end of the question and not part of the input that I don't want). But I can't figure out how to get the index of the first letter of the question. I tried to loop in reverse and get the first spotted \n in the input, but the question doesn't always have \n before the first letter of the question.
def extractQuestion(q):
index_end_q = q.find('?', 1)
index_first_letter_of_q = 0 # TODO
question = '\n ' . join(q[index_first_letter_of_q :index_end_q ])
A way to find the question's first word index would be to search for the first word that has an actual meaning (you're interested in English words I suppose). A way to do that would be using pyenchant:
#!/usr/bin/env python
import enchant
GLOSSARY = enchant.Dict("en_US")
def isWord(word):
return True if GLOSSARY.check(word) else False
sentences = [
"eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn",
"What is your\nlastname and email?\ndasf?lkjas",
"\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"]
for sentence in sentences:
for i,w in enumerate(sentence.split()):
if isWord(w):
print('index: {} => {}'.format(i, w))
break
The above piece of code gives as a result:
index: 3 => What
index: 0 => What
index: 0 => Given
You could try a regular expression like \b[A-Z][a-z][^?]+\?, meaning:
The start of a word \b with an upper case letter [A-Z] followed by a lower case letter [a-z],
then a sequence of non-questionmark-characters [^?]+,
followed by a literal question mark \?.
This can still have some false positives or misses, e.g. if a question actually starts with an acronym, or if there is a name in the middle of the question, but for you examples it works quite well.
>>> tests = ["eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn",
"What is your\nlastname and email?\ndasf?lkjas",
"\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"]
>>> import re
>>> p = r"\b[A-Z][a-z][^?]+\?"
>>> [re.search(p, t).group() for t in tests]
['What is your name?',
'What is your\nlastname and email?',
'Given your skills\nhow would you rate yourself?']
If that's one blob of text, you can use findall instead of search:
>>> text = "\n".join(tests)
>>> re.findall(p, text)
['What is your name?',
'What is your\nlastname and email?',
'Given your skills\nhow would you rate yourself?']
Actually, this also seems to work reasonably well for questions with names in them:
>>> t = "asdGARBAGEasd\nHow did you like St. Petersburg? more stuff with ?"
>>> re.search(p, t).group()
'How did you like St. Petersburg?'
Edit: This code has been worked on and released as a basic module: https://github.com/hyperreality/Poetry-Tools
I'm a linguist who has recently picked up python and I'm working on a project which hopes to automatically analyze poems, including detecting the form of the poem. I.e. if it found a 10 syllable line with 0101010101 stress pattern, it would declare that it's iambic pentameter. A poem with 5-7-5 syllable pattern would be a haiku.
I'm using the following code, part of a larger script, but I have a number of problems which are listed below the program:
corpus in the script is simply the raw text input of the poem.
import sys, getopt, nltk, re, string
from nltk.tokenize import RegexpTokenizer
from nltk.util import bigrams, trigrams
from nltk.corpus import cmudict
from curses.ascii import isdigit
...
def cmuform():
tokens = [word for sent in nltk.sent_tokenize(corpus) for word in nltk.word_tokenize(sent)]
d = cmudict.dict()
text = nltk.Text(tokens)
words = [w.lower() for w in text]
regexp = "[A-Za-z]+"
exp = re.compile(regexp)
def nsyl(word):
lowercase = word.lower()
if lowercase not in d:
return 0
else:
first = [' '.join([str(c) for c in lst]) for lst in max(d[lowercase])]
second = ''.join(first)
third = ''.join([i for i in second if i.isdigit()]).replace('2', '1')
return third
#return max([len([y for y in x if isdigit(y[-1])]) for x in d[lowercase]])
sum1 = 0
for a in words:
if exp.match(a):
print a,nsyl(a),
sum1 = sum1 + len(str(nsyl(a)))
print "\nTotal syllables:",sum1
I guess that the output that I want would be like this:
1101111101
0101111001
1101010111
The first problem is that I lost the line breaks during the tokenization, and I really need the line breaks to be able to identify form. This should not be too hard to deal with though. The bigger problems are that:
I can't deal with non-dictionary words. At the moment I return 0 for them, but this will confound any attempt to identify the poem, as the syllabic count of the line will probably decrease.
In addition, the CMU dictionary often says that there is stress on a word - '1' - when there is not - '0 - . Which is why the output looks like this: 1101111101, when it should be the stress of iambic pentameter: 0101010101
So how would I add some fudging factor so the poem still gets identified as iambic pentameter when it only approximates the pattern? It's no good to code a function that identifies lines of 01's when the CMU dictionary is not going to output such a clean result. I suppose I'm asking how to code a 'partial match' algorithm.
Welcome to stack overflow. I'm not that familiar with Python, but I see you have not received many answers yet so I'll try to help you with your queries.
First some advice: You'll find that if you focus your questions your chances of getting answers are greatly improved. Your post is too long and contains several different questions, so it is beyond the "attention span" of most people answering questions here.
Back on topic:
Before you revised your question you asked how to make it less messy. That's a big question, but you might want to use the top-down procedural approach and break your code into functional units:
split corpus into lines
For each line: find the syllable length and stress pattern.
Classify stress patterns.
You'll find that the first step is a single function call in python:
corpus.split("\n");
and can remain in the main function but the second step would be better placed in its own function and the third step would require to be split up itself, and would probably be better tackled with an object oriented approach. If you're in academy you might be able to convince the CS faculty to lend you a post-grad for a couple of months and help you instead of some workshop requirement.
Now to your other questions:
Not loosing line breaks: as #ykaganovich mentioned, you probably want to split the corpus into lines and feed those to the tokenizer.
Words not in dictionary/errors: The CMU dictionary home page says:
Find an error? Please contact the developers. We will look at the problem and improve the dictionary. (See at bottom for contact information.)
There is probably a way to add custom words to the dictionary / change existing ones, look in their site, or contact the dictionary maintainers directly.
You can also ask here in a separate question if you can't figure it out. There's bound to be someone in stackoverflow that knows the answer or can point you to the correct resource.
Whatever you decide, you'll want to contact the maintainers and offer them any extra words and corrections anyway to improve the dictionary.
Classifying input corpus when it doesn't exactly match the pattern: You might want to look at the link ykaganovich provided for fuzzy string comparisons. Some algorithms to look for:
Levenshtein distance: gives you a measure of how different two strings are as the number of changes needed to turn one string into another. Pros: easy to implement, Cons: not normalized, a score of 2 means a good match for a pattern of length 20 but a bad match for a pattern of length 3.
Jaro-Winkler string similarity measure: similar to Levenshtein, but based on how many character sequences appear in the same order in both strings. It is a bit harder to implement but gives you normalized values (0.0 - completely different, 1.0 - the same) and is suitable for classifying the stress patterns. A CS postgrad or last year undergrad should not have too much trouble with it ( hint hint ).
I think those were all your questions. Hope this helps a bit.
To preserve newlines, parse line by line before sending each line to the cmu parser.
For dealing with single-syllable words, you probably want to try both 0 and 1 for it when nltk returns 1 (looks like nltk already returns 0 for some words that would never get stressed, like "the"). So, you'll end up with multiple permutations:
1101111101
0101010101
1101010101
and so forth. Then you have to pick ones that look like a known forms.
For non-dictionary words, I'd also fudge it the same way: figure out the number of syllables (the dumbest way would be by counting the vowels), and permutate all possible stresses. Maybe add some more rules like "ea is a single syllable, trailing e is silent"...
I've never worked with other kinds of fuzzying, but you can check https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison for some ideas.
This is my first post on stackoverflow.
And I'm a python newbie, so please excuse any deficits in code style.
But I too am attempting to extract accurate metre from poems.
And the code included in this question helped me, so I post what I came up with that builds on that foundation. It is one way to extract the stress as a single string, correct with a 'fudging factor' for the cmudict bias, and not lose words that are not in the cmudict.
import nltk
from nltk.corpus import cmudict
prondict = cmudict.dict()
#
# parseStressOfLine(line)
# function that takes a line
# parses it for stress
# corrects the cmudict bias toward 1
# and returns two strings
#
# 'stress' in form '0101*,*110110'
# -- 'stress' also returns words not in cmudict '0101*,*1*zeon*10110'
# 'stress_no_punct' in form '0101110110'
def parseStressOfLine(line):
stress=""
stress_no_punct=""
print line
tokens = [words.lower() for words in nltk.word_tokenize(line)]
for word in tokens:
word_punct = strip_punctuation_stressed(word.lower())
word = word_punct['word']
punct = word_punct['punct']
#print word
if word not in prondict:
# if word is not in dictionary
# add it to the string that includes punctuation
stress= stress+"*"+word+"*"
else:
zero_bool=True
for s in prondict[word]:
# oppose the cmudict bias toward 1
# search for a zero in array returned from prondict
# if it exists use it
# print strip_letters(s),word
if strip_letters(s)=="0":
stress = stress + "0"
stress_no_punct = stress_no_punct + "0"
zero_bool=False
break
if zero_bool:
stress = stress + strip_letters(prondict[word][0])
stress_no_punct=stress_no_punct + strip_letters(prondict[word][0])
if len(punct)>0:
stress= stress+"*"+punct+"*"
return {'stress':stress,'stress_no_punct':stress_no_punct}
# STRIP PUNCTUATION but keep it
def strip_punctuation_stressed(word):
# define punctuations
punctuations = '!()-[]{};:"\,<>./?##$%^&*_~'
my_str = word
# remove punctuations from the string
no_punct = ""
punct=""
for char in my_str:
if char not in punctuations:
no_punct = no_punct + char
else:
punct = punct+char
return {'word':no_punct,'punct':punct}
# CONVERT the cmudict prondict into just numbers
def strip_letters(ls):
#print "strip_letters"
nm = ''
for ws in ls:
#print "ws",ws
for ch in list(ws):
#print "ch",ch
if ch.isdigit():
nm=nm+ch
#print "ad to nm",nm, type(nm)
return nm
# TESTING results
# i do not correct for the '2'
line = "This day (the year I dare not tell)"
print parseStressOfLine(line)
line = "Apollo play'd the midwife's part;"
print parseStressOfLine(line)
line = "Into the world Corinna fell,"
print parseStressOfLine(line)
"""
OUTPUT
This day (the year I dare not tell)
{'stress': '01***(*011111***)*', 'stress_no_punct': '01011111'}
Apollo play'd the midwife's part;
{'stress': "0101*'d*01211***;*", 'stress_no_punct': '010101211'}
Into the world Corinna fell,
{'stress': '01012101*,*', 'stress_no_punct': '01012101'}
I am trying to get sentences from a string that contain a given substring using python.
I have access to the string (an academic abstract) and a list of highlights with start and end indexes. For example:
{
abstract: "...long abstract here..."
highlights: [
{
concept: 'a word',
start: 1,
end: 10
}
{
concept: 'cancer',
start: 123,
end: 135
}
]
}
I am looping over each highlight, locating it's start index in the abstract (the end doesn't really matter as I just need to get a location within a sentence), and then somehow need to identify the sentence that index occurs in.
I am able to tokenize the abstract into sentences using nltk.tonenize.sent_tokenize, but by doing that I render the index location useless.
How should I go about solving this problem? I suppose regexes are an option but the nltk tokenizer seems such a nice way of doing it that it would be a shame not to make use of it.. Or somehow reset the start index by finding the number of chars since the previous full stop/exclamation mark/question mark?
You are right, the NLTK tokenizer is really what you should be using in this situation since it is robust enough to handle delimiting mostly all sentences including ending a sentence with a "quotation." You can do something like this (paragraph from a random generator):
Start with,
from nltk.tokenize import sent_tokenize
paragraph = "How does chickens harden over the acceptance? Chickens comprises coffee. Chickens crushes a popular vet next to the eater. Will chickens sweep beneath a project? Coffee funds chickens. Chickens abides against an ineffective drill."
highlights = ["vet","funds"]
sentencesWithHighlights = []
Most intuitive way:
for sentence in sent_tokenize(paragraph):
for highlight in highlights:
if highlight in sentence:
sentencesWithHighlights.append(sentence)
break
But using this method we actually have what is effectively a 3x nested for loop. This is because we first check each sentence, then each highlight, then each subsequence in the sentence for the highlight.
We can get better performance since we know the start index for each highlight:
highlightIndices = [100,169]
subtractFromIndex = 0
for sentence in sent_tokenize(paragraph):
for index in highlightIndices:
if 0 < index - subtractFromIndex < len(sentence):
sentencesWithHighlights.append(sentence)
break
subtractFromIndex += len(sentence)
In either case we get:
sentencesWithHighlights = ['Chickens crushes a popular vet next to the eater.', 'Coffee funds chickens.']
I assume that all your sentences end with one of these three characters: !?.
What about looping over the list of highlights, creating a regexp group:
(?:list|of|your highlights)
Then matching your whole abstract against this regexp:
/(?:[\.!\?]|^)\s*([^\.!\?]*(?:list|of|your highlights)[^\.!\?]*?)(?=\s*[\.!\?])/ig
This way you would get the sentence containing at least one of your highlights in the first subgroup of each match (RegExr).
Another option (though it's tough to say how reliable it would be with variably defined text), would be to split the text into a list of sentences and test against them:
re.split('(?<=\?|!|\.)\s{0,2}(?=[A-Z]|$)', text)
Consider I have following string with a tab in between left & right part in a text file:
The dreams of REM (Geo) sleep The sleep paralysis
I want to match the above string that match both left part & right part in each line of another following file:
The pons also contains the sleep paralysis center of the brain as well as generating the dreams of REM sleep.
If can not match with fill string, then try to match with substring.
I want to search with leftmost and rightmost pattern.
eg.(leftmost cases)
The dreams of REM sleep paralysis
The dreams of REM sleep The sleep
eg.(Right most cases):
REM sleep The sleep paralysis
The dreams of The sleep paralysis
Thanks a lot again for any kind of help.
(Ok, you clarified most of what you want. Let me restate, then clarify the points I listed below as remaining unclear... Also take the starter code I show you, adapt it, post us the result.)
You want to search, line-by-line, case-insensitive, for the longest contiguous matches to each of a pair of match-patterns. All the patterns seem to be disjoint (impossible to get a match on both patternX and patternY, since they use different phrases, e.g. can't match both 'frontal lobe' and 'prefrontal cortex').
Your patterns are supplied as a sequence of pairs ('dom','rang'), => let's just refer to them by their subscript [0] and [1, you can use string.split('\t') to get that.)
The important thing is a matching line must match both the dom and rang patterns (fully or partially).
Order is independent, so we can match rang then dom, or vice versa => use 2 separate regexes per line, and test d and r matched.
Patterns have optional parts, in parentheses => so just write/convert them to regex syntax using (optionaltext)? syntax already, e.g.: re.compile('Frontallobes of (leftside)? the brain', re.IGNORECASE)
The return value should be the string buffer with the longest substring match so far.
Now this is where several things remain to be clarified - please edit your question to explain the following:
If you find full matches to any pair of patterns, then return that.
If you can't find any full matches, then search for partial matches of both of the pair of patterns. Where 'partial match' means 'the most words' or 'the highest proportion(%) of words' from a pattern? Presumably we exclude spurious matches to words like 'the', in which case we lose nothing by simply omitting 'the' from your search patterns, then this guarantees that all partial matches to any pattern are significant.
We score the partial matches (somehow), e.g. 'contains most words from pattern X', or 'contains highest % of words from pattern X'. We should do this for all patterns, then return the pattern with the highest score. You'll need to think about this a little, is it better to match 2 words of a 5-word pattern (40%) e.g. 'dreams of', or 1 of 2 (50%) e.g. 'prefrontal BUT NOT cortex'? How do we break ties, etc? What happens if we match 'sleep' but nothing else?
Each of the above questions will affect the solution, so you need to answer them for us. There's no point in writing pages of code to solve the most general case when you only needed something simple.
In general this is called 'NLP' (natural language processing). You might end up using an NLP library.
The general structure of the code so far is sounding like:
import re
# normally, read your input directly from file, but this allows us to test:
input = """The pons also contains the sleep paralysis center of the brain as well as generating the dreams of REM sleep.
The optic tract is a part of the visual system in the brain.
The inferior frontal gyrus is a gyrus of the frontal lobe of the human brain.
The prefrontal cortex (PFC) is the anterior part of the frontallobes of the brain, lying in front of the motor and premotor areas.
There are three possible ways to define the prefrontal cortex as the granular frontal cortex as that part of the frontal cortex whose electrical stimulation does not evoke movements.
This allowed the establishment of homologies despite the lack of a granular frontal cortex in nonprimates.
Modern tracing studies have shown that projections of the mediodorsal nucleus of the thalamus are not restricted to the granular frontal cortex in primates.
""".split('\n')
patterns = [
('(dreams of REM (Geo)? sleep)', '(sleep paralysis)'),
('(frontal lobe)', '(inferior frontal gyrus)'),
('(prefrontal cortex)', '(frontallobes of (leftside )?(the )?brain)'),
('(modern tract)', '(probably mediodorsal nucleus)') ]
# Compile the patterns as regexes
patterns = [ (re.compile(dstr),re.compile(rstr)) for (dstr,rstr) in patterns ]
def longest(t):
"""Get the longest from a tuple of strings."""
l = list(t) # tuples can't be sorted (immutable), so convert to list...
l.sort(key=len,reverse=True)
return l[0]
def custommatch(line):
for (d,r) in patterns:
# If got full match to both (d,r), return it immediately...
(dm,rm) = (d.findall(line), r.findall(line))
# Slight design problem: we get tuples like: [('frontallobes of the brain', '', 'the ')]
#... so return the longest match strings for each of dm,rm
if dm and rm: # must match both dom & rang
return [longest(dm), longest(rm)]
# else score any partial matches to (d,r) - how exactly?
# TBD...
else:
# We got here because we only have partial matches (or none)
# TBD: return the 'highest-scoring' partial match
return ('TBD... partial match')
for line in input:
print custommatch(line)
and running on the 7 lines of input you supplied currently gives:
TBD... partial match
TBD... partial match
['frontal lobe', 'inferior frontal gyrus']
['prefrontal cortex', ('frontallobes of the brain', '', 'the ')]
TBD... partial match
TBD... partial match
TBD... partial match
TBD... partial match