I want to extract some desirable concepts (noun phrases) in the text automatically. My plan is to extract all noun phrases and then label them as two classifications (i.e., desirable phrases and non-desirable phrases). After that, train a classifier to classify them. What I am trying now is to extract all possible phrases as the training set first. For example, one sentence is Where a shoulder of richer mix is required at these junctions, or at junctions of columns and beams, the items are so described. I want to get all phrases like shoulder, richer mix, shoulder of richer mix,junctions,junctions of columns and beams, columns and beams, columns, beams or whatever possible. The desirable phrases are shoulder, junctions, junctions of columns and beams. But I don't care the correctness at this step, I just want to get the training set first. Are there available tools for such task?
I tried Rake in rake_nltk, but the results failed to include my desirable phrases (i.e., it did not extract all possible phrases)
from rake_nltk import Rake
data = 'Where a shoulder of richer mix is required at these junctions, or at junctions of columns and beams, the items are so described.'
r = Rake()
r.extract_keywords_from_text(data)
phrase = r.get_ranked_phrases()
print(phrase)enter code herenter code here
Result: ['richer mix', 'shoulder', 'required', 'junctions', 'items', 'described', 'columns', 'beams']
(Missed junctions of columns and beams here)
I also tried phrasemachine, the results also missed some desirable ones.
import spacy
import phrasemachine
matchedList=[]
doc = nlp(data)
tokens = [token.text for token in doc]
pos = [token.pos_ for token in doc]
out = phrasemachine.get_phrases(tokens=tokens, postags=pos, output="token_spans")
print(out['token_spans'])
while len(out['token_spans']):
start,end = out['token_spans'].pop()
print(tokens[start:end])
Result:
[(2, 6), (4, 6), (14, 17)]
['junctions', 'of', 'columns']
['richer', 'mix']
['shoulder', 'of', 'richer', 'mix']
(Missed many noun phrases here)
You may wish to make use of noun_chunks attribute:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Where a shoulder of richer mix is required at these junctions, or at junctions of columns and beams, the items are so described.')
phrases = set()
for nc in doc.noun_chunks:
phrases.add(nc.text)
phrases.add(doc[nc.root.left_edge.i:nc.root.right_edge.i+1].text)
print(phrases)
{'junctions of columns and beams', 'junctions', 'the items', 'a shoulder', 'columns', 'richer mix', 'beams', 'columns and beams', 'a shoulder of richer mix', 'these junctions'}
Related
Im trying a sentiment analysis based approach on youtube comments, but the comments many times have words like mrbeast, tiger/'s, lion/'s, pewdiepie, james, etc which do not add any feeling in the sentence. I've gone through nltk's average_perception_tagger but it didn't work well as it gave the results as
my input:
"mrbeast james lion tigers bad sad clickbait fight nice good"
words that i need in my sentence:
"bad sad clickbait fight nice good"
what i got using average_perception_tagger:
[('mrbeast', 'NN'),
('james', 'NNS'),
('lion', 'JJ'),
('tigers', 'NNS'),
('bad', 'JJ'),
('sad', 'JJ'),
('clickbait', 'NN'),
('fight', 'NN'),
('nice', 'RB'),
('good', 'JJ')]
so as you can see if i remove mrbeast i.e NN the words like clickbait, fight will also get removed which than ultimately remove expressions from that sentence.
okay, this is what i do for companies that report on the LSE. You can do similar with your words.
# define what you consider to be positive, negative or neutral keywords
posKeyWords = ['profit', 'increase', 'pleased', 'excellent', 'good', 'solid financial', 'robust', 'significantly improved', 'improve']
negKeyWords = ['loss', 'decrease', 'dissapoint', 'poor', 'bad','decline', 'negative', 'bad', 'weather', 'covid' ]
neutralKeyWords = ['financial']
keyWords = posKeyWords + neutralKeyWords + negKeyWords
Next you get data as text (from whatever source you choose). Put the data (words) into a list (array).
dataTest = []
dataText = resp.text # or whatever source you are reading from
Mine is a response from a web query, but yours cour be from a text file or ther source.
Next create an empty dictionary to count key words into a dict (hashing is fast).
keyWordSummary = {} # dictionary of keywords & values
Finally, loop through the keywords and put them into the dict.
# look for some keywords
for kw in keyWords:
kwVal = re.findall(kw, dataText)
#print('keyword count:', kw, len(kwVal))
# put into a dict
keyWordSummary[kw] = len(kwVal)
You now have a list of word frequencies which you could analyse in a dataframe for example (which outside the scope of this particular question).
There are multiple ways of doing this like
you can create a set of positive and negative words and for each word in your grammar you can check if it exists in your set, if it does you should keep the word, else delete it. This however would first require all positive and negative words dataset.
you can use something like textblob which can give you the sentiment score of a word or a sentence. so with a cutoff sentiment score you can filter out the words that you don't need.
As I've stated in the title, I'm trying to calculate the phrase frequency of a given list of sequences that appear in in a list of strings. The problem is that the words in phrases do not have to appear next to the other ones, there may be one or more words in between.
Example:
Sequence: ('able', 'help', 'number') in a sentence "Please call us, we may be able to help, our phone number is 1234"
I remove the stopwords (NLTK stopwords), remove punctuation, lowercase all letters and tokenize the sentence, so the processed sequence looks like ['please', 'call', 'us', 'able', 'help', 'phone', 'number', '1234']. I have about 30,000 sequences varying in length from 1 (single words) to 3, and I'm searching in almost 6,000 short sentences. My current approach is presented below:
from collections import Counter
from tqdm import tqdm
import nltk
# Get term sequency per sentence
def get_bow(sen, vocab):
vector = [0] * len(vocab)
tokenized_sentence = nltk.word_tokenize(sen)
combined_sentence = list(itertools.chain.from_iterable([itertools.combinations(tokenized_sentence, 1),
itertools.combinations(tokenized_sentence, 2),
itertools.combinations(tokenized_sentence, 3)]))
for el in combined_sentence:
if el in vocab:
cnt = combined_sentence.count(el)
idx = vocab.index(el)
vector[idx] = cnt
return vector
sentence_vectors = []
for sentence in tqdm(text_list):
sent_vec = get_bow
sentence_vectors.append(get_bow(sentence, phrase_list))
phrase_list is a list of tuples with the sequences, text_list is a list of strings. Currently, the frequency takes over 1 hour to calculate and I'm trying to find more efficient way to get the list of frequencies associated with the given terms. I've also tried using sklearn's CountVectorizer, but there's a problem with processing sequences with gaps and they're not calculated at all.
I'd be grateful if anyone would try to give me some insight about how to make my script more efficient. Thanks in advance!
EDIT:
Example of phrase_list: [('able',), ('able', 'us', 'software'), ('able', 'back'), ('printer', 'holidays'), ('printer', 'information')]
Example of text_list: ['able add printer mac still working advise calling support team mon fri excluding bank holidays would able look', 'absolutely one cat coyote peterson', 'accurate customs checks cause delays also causing issues expected delivery dates changing', 'add super mario flair fridge desk supermario dworld bowsersfury magnet set available platinum points shipping costs mynintendo reward get', 'additional information though pass comments team thanks']
Expected output: [2, 0, 0, 1, 0] - a vector with occurrence count of each phrase, the order of values should be the same as in phrase_list. My code returns the vector of a phrase occurence per sentence, because I was trying to implement something like a bag-of-words.
There are many aspects that could be made faster, but here is the main problem:
combined_sentence = list(itertools.chain.from_iterable([itertools.combinations(tokenized_sentence, 1),
itertools.combinations(tokenized_sentence, 2),
itertools.combinations(tokenized_sentence, 3)]))
You generate all potential combinations of 1,2 or 3 words of the sentence. This is always bad, no matter what you want to do.
Sentence: "Master Yoda about sentence structure care does not."
You really want to treat this sentence as if it contained "Yoda does not", then you should still not generate all combinations. There are much faster ways, but I will only spend time on this, if that indeed is your goal.
If you would want to treat this sentence as a sentence that does NOT contain "Yoda does not", then I think you can figure out yourself how to speed up your code. Maybe look here.
I hope this helped. Let me know in case you need option 1.
I have been generating topics with yelp data set of customer reviews by using Latent Dirichlet allocation(LDA) in python(gensim package). While generating tokens, I am selecting only the words having length >= 3 from the reviews( By using RegexpTokenizer):
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w{3,}')
tokens = tokenizer.tokenize(review)
This will allow us to filter out the noisy words of length less than 3, while creating the corpus document.
How will filtering out these words effect performance with the LDA algorithm?
Generally speaking, for the English language, one and two letter words don't add information about the topic. If they don't add value they should be removed during the pre-processing step. Like most algorithms, less data in will speed up the execution time.
Words less than length 3 are considered stop words. LDAs build topics so imagine you generate this topic:
[I, him, her, they, we, and, or, to]
compared to:
[shark, bull, greatwhite, hammerhead, whaleshark]
Which is more telling? This is why it is important to remove stopwords. This is how I do that:
# Create functions to lemmatize stem, and preprocess
# turn beautiful, beautifuly, beautified into stem beauti
def lemmatize_stemming(text):
stemmer = PorterStemmer()
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
# parse docs into individual words ignoring words that are less than 3 letters long
# and stopwords: him, her, them, for, there, ect since "their" is not a topic.
# then append the tolkens into a list
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
newStopWords = ['your_stopword1', 'your_stopword2']
if token not in gensim.parsing.preprocessing.STOPWORDS and token not in newStopWords and len(token) > 3:
nltk.bigrams(token)
result.append(lemmatize_stemming(token))
return result
Given a single word such as "table", I want to identify what it is most commonly used as, whether its most common usage is noun, verb or adjective. I want to do this in python. Is there anything else besides wordnet too? I don't prefer wordnet. Or, if I use wordnet, how would I do it exactly with it?
import nltk
text = 'This is a table. We should table this offer. The table is in the center.'
text = nltk.word_tokenize(text)
result = nltk.pos_tag(text)
result = [i for i in result if i[0].lower() == 'table']
print(result) # [('table', 'JJ'), ('table', 'VB'), ('table', 'NN')]
If you have a word out of context and want to know its most common use, you could look at someone else's frequency table (e.g. WordNet), or you can do your own counts: Just find a tagged corpus that's large enough for your purposes, and count its instances. If you want to use a free corpus, the NLTK includes the Brown corpus (1 million words). The NLTK also provides methods for working with larger, non-free corpora (e.g, the British National Corpus).
import nltk
from nltk.corpus import brown
table = nltk.FreqDist(t for w, t in brown.tagged_words() if w.lower() == 'table')
print(table.most_common())
[('NN', 147), ('NN-TL', 50), ('VB', 1)]
I've come up with the below. I've narrowed down the problem to the inability to capture both 1-word and 2-word proper nouns.
(1) It would be great if i could put in a condition that instructs a default to the longer word when given a choice between two captures.
AND
(2) if I could tell the regex to only consider this if the string starts with a prepositoin, such as On|At|For. I was playing around with something like this but it isn't working:
(^On|^at)([A-Z][a-z]{3,15}$|[A-Z][a-z]{3,15}\s{0,1}[A-Z][a-z]{0,5})
How would I do 1 and 2?
my current regex
r'([A-Z][a-z]{3,15}$|[A-Z][a-z]{3,15}\s{0,1}[A-Z][a-z]{0,15})'
I'd like to capture, Ashoka, Shift Series, Compass Partners, and Kenneth Cole
#'On its 25th anniversary, Ashoka',
#'at the Shift Series national conference, Compass Partners and fashion designer Kenneth Cole',
What you're trying to do here is called "named entity recognition" in natural language processing. If you really want an approach that will find proper nouns, then you may have to consider stepping up to named entity recognition. Thankfully there's some easy to use functions in the nltk library:
import nltk
s2 = 'at the Shift Series national conference, Compass Partners and fashion designer Kenneth Cole'
tokens2 = nltk.word_tokenize(s2)
tags = nltk.pos_tag(tokens2)
res = nltk.ne_chunk(tags)
Results:
res.productions()
Out[8]:
[S -> ('at', 'IN') ('the', 'DT') ORGANIZATION ('national', 'JJ') ('conference', 'NN') (',', ',') ORGANIZATION ('and', 'CC') ('fashion', 'NN') ('designer', 'NN') PERSON,
ORGANIZATION -> ('Shift', 'NNP') ('Series', 'NNP'),
ORGANIZATION -> ('Compass', 'NNP') ('Partners', 'NNPS'),
PERSON -> ('Kenneth', 'NNP') ('Cole', 'NNP')]
I would use an NLP tool, the most popular for python seems to be nltk. Regular expressions are really not the right way to go... There's an example on the frontpage of the nltk site, linked to earlier in the answer, which is copy-pasted below:
import nltk
sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
tokens = nltk.word_tokenize(sentence)
tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
tagged = nltk.pos_tag(tokens)
entities = nltk.chunk.ne_chunk(tagged)
entities now contains your words tagged according to the Penn treebank
Not entirely correct, but this will match most of what you are looking for, with the exception of On.
import re
text = """
#'On its 25th anniversary, Ashoka',
#'at the Shift Series national conference, Compass Partners and fashion designer Kenneth
Cole',
"""
proper_noun_regex = r'([A-Z]{1}[a-z]{1,}(\s[A-Z]{1}[a-z]{1,})?)'
p = re.compile(proper_noun_regex)
matches = p.findall(text)
print matches
output:
[('On', ''), ('Ashoka', ''), ('Shift Series', ' Series'), ('Compass Partners', ' Partners'), ('Kenneth Cole', ' Cole')]
And then maybe you could implement a filter to go over this list.
def filter_false_positive(unfiltered_matches):
filtered_matches = []
black_list = ["an","on","in","foo","bar"] #etc
for match in unfiltered_matches:
if match.lower() not in black_list:
filtered_matches.append(match)
return filtered_matches
or because python is cool:
def filter_false_positive(unfiltered_matches):
black_list = ["an","on","in","foo","bar"] #etc
return [match for match in filtered_matches if match.lower() not in black_list]
and you could use it like this:
# CONTINUED FROM THE CODE ABOVE
matches = [i[0] for i in matches]
matches = filter_false_positive(matches)
print matches
giving the final output:
['Ashoka', 'Shift Series', 'Compass Partners', 'Kenneth Cole']
The problem of determining whether a word is capitalized due to occuring at the beginning of the sentance or whether it is a proper noun is not that trivial.
'Kenneth Cole is a brand name.' v.s. 'Can I eat something now?' v.s. 'An English man had tea'
In this case it is pretty difficult, so without something that can know a proper noun by other standards, a black list, a database, etc. it won't be so easy. regex is awesome but I don't think it can interpret English on a grammatical level in any trivial way...
That being said, good luck!