How can I use regexes to find definitions in Python? - python

This is what my prof has given us for clues:
text = '''I make my own cheese. Cheese is a dairy product, derived from milk and produced in wide ranges of flavors, textures and forms by coagulation of the milk protein casein. I personally really love cheese. Casein: a family of related phosphoproteins. These proteins are commonly found in mammalian milk'''
for r in re.finditer('\w+',text): #Here, I would split my text into sentences
word = r.group(0)
if re.search(r'lly\b',word): #Here, I would identify a type of sentence
print(word)
if re.search(r'tion\b',word): #Here, I would identify another type of sentence
print(word)
Basically, what I gathered from my own text are two types of definitions; one that is integrated into the sentence usually followed by a descriptive verb ("Cheese is...") and one that is the defined word followed by a colon and its definitory (invented word?) sentence ("Casein: [...]"). I've scraped my brain the whole week trying to find a way to extract and print these sentences without any luck. As a Linguistics major who's just trying to get by, any help would be greatly appreciated. Thanks.

Related

Grabbing words surrounding keywords when keywords show up multiple times in string

Intro: I am trying to extract surrounding words next to a key word of choice. I've learned how to do this using re.compile. Other threads address this issue, e.g.: Finding words after keyword in python
Problem: My issues is that what if keyword appears more than one time in a string. e.g., my keyword is "pie" and my string (s1) is: "i like to eat blueberry pie it is delicious but i also like to eat apple pie it is cool."
How do I extract "blueberry", "it", "apple" and "it"? My current method only extracts "blueberry" and "it."
Current method:
re.compile(r'((?:\w+(?:\s+|$)){1})+pie\s+((?:\w+(?:\s+|$)){1})').findall(s1)
Thanks let me know if you need clarification!

Cross-Lingual Word Sense Disambiguation

I am a beginner in computer programming and I am completing an essay on Parallel Corpora in Word Sense Disambiguation.
Basically, I intend to show that substituting a sense for a word translation simplifies the process of identifying the meaning of ambiguous words. I have already word-aligned my parallel corpus (EUROPARL English-Spanish) with GIZA++, but I don't know what to do with the output files. My intention is to build a classifier to calculate the probability of a translation word given the contextual features of the tokens which surround the ambiguous word in the source text.
So, my question is: how do you extract instances of an ambiguous word from a parallel corpus WITH its aligned translation?
I have tried various scripts on Python, but these are run on the assumption that 1) the English and Spanish texts are in separate corpora and 2) the English and Spanish sentences share the same indexes, which obviously does not work.
e.g.
def ambigu_word2(document, document2):
words = ['letter']
for sentences in document:
tokens = word_tokenize(sentences)
for item in tokens:
x = w_lemma.lemmatize(item)
for w in words:
if w == x in sentences:
print (sentences, document2[document.index(sentences)])
print (ambigu_word2(raw1, raw2))
I would be really grateful if you could provide any guidance on this matter.

find job role from text data

I have a text file from which i have to extract on what role the people are working. "Mechanical engineer","software developer" etc.
I have used NLTK to extract this using grammer like,
grammer= r"""
NP: {<NN.*|JJ>*<NN.*>} """
the result i am getting is good, but still for few documnets junk is coming. for those lines i want to apply Regular expressions.
my sample texts are like this.
"I am software developement engineer in microsoft"
"I am mechanical engineer with 10 years experience"
what i want is, I will extract two or three words before "Engineer".
I m using regular expression like,
regex=re.compile('|'.join([r'(?:\S+\s)?\S*[eE]ngineer']))
but, it extracts only one word before the specific word. How to make it to extract two or more words.?
i tried putting {2-3} in place of "?" in expression. but i am not getting desired result.
Is my approach correct ?
or any other approach to extract this specific phrase in better way ?
The regex
(\w+\s){2,3}dog
Will match
over the lazy dog
the lazy dog
In
The quick brown fox jumps over the lazy dog the lazy dog
This should get you started I think

Search key phrases in text

I'm looking for a fast solution which allows me to find predefined phrases (1-5 words) in a (not big) text.
The phrases can be up to 1000. Suppose, the simple find() function is not a good solution.
Could you advise what should I use?
Thanks in advance.
Update
Why i don't want to use bruit force search:
I believe, it is not fast enough.
Text can have some inclusions in the phrases. I.e. phrase can be Bank America, but text has bank of America.
Phrases can be a little bit changed - apostrophes, -s ending etc.
I'm not sure about your goal but you can easily find predefined prephrasses in text like that:
predefined_phrases = ["hello", "unicorns with a big mouth!", "Sweet donats"]
isnt_big_text = "A big mouse fly by unicorns with a big mouth! with hello wold."
for phrase in predefined_phrases:
if phrase in isnt_big_text:
print("Phrase '%s' found in text" % phrase)

Add in word boundary syntax to list of strings

Please point me to a post if one already exists for this question.
How might I efficiently add in word boundary syntax to list of strings?
So for instance, I want to make sure the words below in badpositions only match a word in their entirety so I'd like to use re.search('\bword\b', text).
How do I get the words in bad positions to take the form ['\bPresident\b', '\bProvost\b'] etc
text = ['said Duke University President Richard H. Brodhead. "Our faculty look forward']
badpositions = ['President', 'Provost', 'University President', 'Senior Vice President']
re_badpositions = [r"\b{word}\b".format(word=word) for word in badpositions]
indexes = {badpositions[i]:re.search(re_badpositions[i],text) for i in range(len(badpositions))}
If I understand you correctly, you're looking to find the starting index of all words that match exactly (that is, \bWORD\b) in your text string. This is how I'd do that, but I'm certainly adding a step here, you could just as easily do:
indexes = {word: re.search("\b{word}\b".format(word=word),text) for word in badpositions}
I find it a little more intelligible to create a list of regexes to search with, then search by them separately than to plunk those regexes in place at the same time. This is ENTIRELY due to personal preference, though.

Categories