how can I count the specific bigram words? - python

I want to find and count the specific bigram words such as "red apple" in the text file.
I already made the text file to the word list, so I couldn't use regex to count the whole phrase. (i.e. bigram) ( or can I ? )
How can I count the specific bigram in the text file? not using nltk or other module... regex can be a solution?

Why you have made text file into list. Also it's not memory efficient.
Instead of text you can use file.read() method directly.
import re
text = 'I like red apples and green apples but I like red apples more.'
bigram = ['red apples', 'green apples']
for i in bigram:
print 'Found', i, len(re.findall(i, text))
out:
Found red apples 2
Found green apples 1

Are you looking only for a specific bigrams or you might need to extend the search to detect any bigrams common in your text or something? In the latter case have a look at NLTK collocations module. You say you want to do this without using NLTK or other module, but in practice that's a very very bad idea. You'll miss what you are looking for due to there being eg 'red apple', not 'red apples'. NLTK, on the other hand, provides useful tools for lemmatizaton, calculating tons of the statistics and such.
And think of this: why and how have you turned the lines to list of words? Not only this is inefficient, but depending on exactly how you did that you may have lost information on word order, improperly processed punctuation, messed up uppercase/lowercase, or made a million of other mistakes. Which, again, is why NLTK is what you need.

Related

How can I use regexes to find definitions in Python?

This is what my prof has given us for clues:
text = '''I make my own cheese. Cheese is a dairy product, derived from milk and produced in wide ranges of flavors, textures and forms by coagulation of the milk protein casein. I personally really love cheese. Casein: a family of related phosphoproteins. These proteins are commonly found in mammalian milk'''
for r in re.finditer('\w+',text): #Here, I would split my text into sentences
word = r.group(0)
if re.search(r'lly\b',word): #Here, I would identify a type of sentence
print(word)
if re.search(r'tion\b',word): #Here, I would identify another type of sentence
print(word)
Basically, what I gathered from my own text are two types of definitions; one that is integrated into the sentence usually followed by a descriptive verb ("Cheese is...") and one that is the defined word followed by a colon and its definitory (invented word?) sentence ("Casein: [...]"). I've scraped my brain the whole week trying to find a way to extract and print these sentences without any luck. As a Linguistics major who's just trying to get by, any help would be greatly appreciated. Thanks.

Words.word() from nltk corpus seemingly contains strange non-valid words

This code loops through every word in word.words() from the nltk library, then pushes the word into an array. Then it checks every word in the array to see if it is an actual word by using the same library and somehow many words are strange words that aren't real at all, like "adighe". What's going on here?
import nltk
from nltk.corpus import words
test_array = []
for i in words.words():
i = i.lower()
test_array.append(i)
for i in test_array:
if i not in words.words():
print(i)
I don't think there's anything mysterious going on here. The first such example I found is "Aani", "the dog-headed ape sacred to the Egyptian god Thoth". Since it's a proper noun, "Aani" is in the word list and "aani" isn't.
According to dictionary.com, "Adighe" is an alternative spelling of "Adygei", which is another proper noun meaning a region of Russia. Since it's also a language I suppose you might argue that "adighe" should also be allowed. This particular word list will argue that it shouldn't.

find job role from text data

I have a text file from which i have to extract on what role the people are working. "Mechanical engineer","software developer" etc.
I have used NLTK to extract this using grammer like,
grammer= r"""
NP: {<NN.*|JJ>*<NN.*>} """
the result i am getting is good, but still for few documnets junk is coming. for those lines i want to apply Regular expressions.
my sample texts are like this.
"I am software developement engineer in microsoft"
"I am mechanical engineer with 10 years experience"
what i want is, I will extract two or three words before "Engineer".
I m using regular expression like,
regex=re.compile('|'.join([r'(?:\S+\s)?\S*[eE]ngineer']))
but, it extracts only one word before the specific word. How to make it to extract two or more words.?
i tried putting {2-3} in place of "?" in expression. but i am not getting desired result.
Is my approach correct ?
or any other approach to extract this specific phrase in better way ?
The regex
(\w+\s){2,3}dog
Will match
over the lazy dog
the lazy dog
In
The quick brown fox jumps over the lazy dog the lazy dog
This should get you started I think

Search key phrases in text

I'm looking for a fast solution which allows me to find predefined phrases (1-5 words) in a (not big) text.
The phrases can be up to 1000. Suppose, the simple find() function is not a good solution.
Could you advise what should I use?
Thanks in advance.
Update
Why i don't want to use bruit force search:
I believe, it is not fast enough.
Text can have some inclusions in the phrases. I.e. phrase can be Bank America, but text has bank of America.
Phrases can be a little bit changed - apostrophes, -s ending etc.
I'm not sure about your goal but you can easily find predefined prephrasses in text like that:
predefined_phrases = ["hello", "unicorns with a big mouth!", "Sweet donats"]
isnt_big_text = "A big mouse fly by unicorns with a big mouth! with hello wold."
for phrase in predefined_phrases:
if phrase in isnt_big_text:
print("Phrase '%s' found in text" % phrase)

Finding the common words between two text corpus in NLTK

I am very new to NLTK and am trying to do something.
What would be the best way to find the common words between two bodies of text? Basically, I have one long text file say text1, and another say text2. I want to find the common words that appear in both the files using NLTK.
Is there a direct way to do so? What would be the best approach?
Thanks!
It seems to me that unless you need to do something special with regards to language processing, you don't need NLTK:
words1 = "This is a simple test of set intersection".lower().split()
words2 = "Intersection of sets is easy using Python".lower().split()
intersection = set(words1) & set(words2)
>>> set(['of', 'is', 'intersection'])

Categories