I have a text file from which i have to extract on what role the people are working. "Mechanical engineer","software developer" etc.
I have used NLTK to extract this using grammer like,
grammer= r"""
NP: {<NN.*|JJ>*<NN.*>} """
the result i am getting is good, but still for few documnets junk is coming. for those lines i want to apply Regular expressions.
my sample texts are like this.
"I am software developement engineer in microsoft"
"I am mechanical engineer with 10 years experience"
what i want is, I will extract two or three words before "Engineer".
I m using regular expression like,
regex=re.compile('|'.join([r'(?:\S+\s)?\S*[eE]ngineer']))
but, it extracts only one word before the specific word. How to make it to extract two or more words.?
i tried putting {2-3} in place of "?" in expression. but i am not getting desired result.
Is my approach correct ?
or any other approach to extract this specific phrase in better way ?
The regex
(\w+\s){2,3}dog
Will match
over the lazy dog
the lazy dog
In
The quick brown fox jumps over the lazy dog the lazy dog
This should get you started I think
Related
I am currently trying to convert a PDF into text for the purposes of ML, but whenever I do so, it returns the text in broken lines, which is making the text less readable.
Here is what I am currently doing to convert the text:
import fitz, spacy
with fitz.open("asset/example2.pdf") as doc:
text_ = ""
for page in doc:
text_ += page.getText()
and here are the results:
Animals - Animals have
always been near my
heart and it has led me to
involve myself in animal
rights events and
protests. It still stays as
a dream of mine to go
volunteer at an animal
sanctuary one day.
Food Travel - Through a
diet change, I have
found my love for food
and exploring different
food cultures across the
world. I feel confident
saying that I could write
an extensive
encyclopaedia for great
vegan restaurants.
what would be the best way to approach this?
I don't quite understand what result you are looking for but you if you would like all the text to be on one line you can use text.replace('\n', ''). If you want. You may also find text.split(separator) and separator.join(list) useful for formating your string, for example:
string = 'This is my \nfirst sentance. This \nsecond sentance\n.'
print(string)
string = string.replace('\n', '')
sentanceList = string.split('.')
string = '.\n'.join(sentanceList)
print(string)
I hope this answers your question.
This is what my prof has given us for clues:
text = '''I make my own cheese. Cheese is a dairy product, derived from milk and produced in wide ranges of flavors, textures and forms by coagulation of the milk protein casein. I personally really love cheese. Casein: a family of related phosphoproteins. These proteins are commonly found in mammalian milk'''
for r in re.finditer('\w+',text): #Here, I would split my text into sentences
word = r.group(0)
if re.search(r'lly\b',word): #Here, I would identify a type of sentence
print(word)
if re.search(r'tion\b',word): #Here, I would identify another type of sentence
print(word)
Basically, what I gathered from my own text are two types of definitions; one that is integrated into the sentence usually followed by a descriptive verb ("Cheese is...") and one that is the defined word followed by a colon and its definitory (invented word?) sentence ("Casein: [...]"). I've scraped my brain the whole week trying to find a way to extract and print these sentences without any luck. As a Linguistics major who's just trying to get by, any help would be greatly appreciated. Thanks.
I'm new to regex and am trying to create a "Dadbot" in discord that uses regex to respond to messages in a #text-channel with "Hi, ____ I'm Dad". The problem I have is that it includes "im" as an accepted regex. This isn't bad on it's own; however, in cases where the word "him" is used, im is accepted in him. I'd like to be able to fix this regex so that:
- it reads I'm (and it's variations)
- includes the next 3 words after I'm or until it reaches a period.
I'm not sure if I'm writing this correctly. I've used regex101.com to check my regex and my original regex was this monster: "(I'm|Im|I am|im|i am|i'm)\s+([a-zA-z]+)" and I use groups to capture the second group.
things I've tried
"(I'm|Im|I am|im|i am|i'm)\s+([a-zA-z]+)"
"\bi'?m\s+(\w+)\b"
"/\bi'?m\s+(\w+)\b/gi"
Here's the part of the code that grabs the second group
if dadCheck.search(message.content):
match = dadCheck.search(message.content).group(2)
await channel.send("Hi, " + match + ". I'm Dad. ;D")
These are the expected results given an accepted message:
Hello everyone. My name is Brad and I'm a cool guy.
Hi, "a cool guy". I'm Dad!
/(im|i am|i'm)\s(\w*\s?){1,3}/i
Regex is awesome and can definitely cover your use case. The above regex looks for the I'm tag and then grabs the next 1-3 words plus space in a capture group for you to reference. Strings it works with
I'm a cool guy.
Im a cool guy.
i am a cool guy.
i'm a cool guy.
im a cool guy.
im a cool guy and I like to watch football games with friends
This is the answer I ultimately landed on.
\b(\s)?[Ii]((\sa)?|\'{0,1})[mM]\b\s+((\w*\s?)[^\.\!;:\(\)]\w+){1,3}
This captures all variations of I'm(im, Im, I'm, i'm,) and ignores i' am
The only problem with this is that it captures only the last iteration of the word loop. so I'll have to fix the capture groups or the algorithm I use to print the capture groups.
I want to find and count the specific bigram words such as "red apple" in the text file.
I already made the text file to the word list, so I couldn't use regex to count the whole phrase. (i.e. bigram) ( or can I ? )
How can I count the specific bigram in the text file? not using nltk or other module... regex can be a solution?
Why you have made text file into list. Also it's not memory efficient.
Instead of text you can use file.read() method directly.
import re
text = 'I like red apples and green apples but I like red apples more.'
bigram = ['red apples', 'green apples']
for i in bigram:
print 'Found', i, len(re.findall(i, text))
out:
Found red apples 2
Found green apples 1
Are you looking only for a specific bigrams or you might need to extend the search to detect any bigrams common in your text or something? In the latter case have a look at NLTK collocations module. You say you want to do this without using NLTK or other module, but in practice that's a very very bad idea. You'll miss what you are looking for due to there being eg 'red apple', not 'red apples'. NLTK, on the other hand, provides useful tools for lemmatizaton, calculating tons of the statistics and such.
And think of this: why and how have you turned the lines to list of words? Not only this is inefficient, but depending on exactly how you did that you may have lost information on word order, improperly processed punctuation, messed up uppercase/lowercase, or made a million of other mistakes. Which, again, is why NLTK is what you need.
I'm looking for a fast solution which allows me to find predefined phrases (1-5 words) in a (not big) text.
The phrases can be up to 1000. Suppose, the simple find() function is not a good solution.
Could you advise what should I use?
Thanks in advance.
Update
Why i don't want to use bruit force search:
I believe, it is not fast enough.
Text can have some inclusions in the phrases. I.e. phrase can be Bank America, but text has bank of America.
Phrases can be a little bit changed - apostrophes, -s ending etc.
I'm not sure about your goal but you can easily find predefined prephrasses in text like that:
predefined_phrases = ["hello", "unicorns with a big mouth!", "Sweet donats"]
isnt_big_text = "A big mouse fly by unicorns with a big mouth! with hello wold."
for phrase in predefined_phrases:
if phrase in isnt_big_text:
print("Phrase '%s' found in text" % phrase)