I am trying to remove certain words (in addition to using stopwords) from the list of text strings but it is not working for some reason
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
exclude = ['am', 'there','here', 'for', 'of', 'user']
new_doc = [word for word in documents if word not in exclude]
print new_doc
OUTPUT
['Human machine interface for lab abc computer applications', 'A survey of user opinion of computer system response time', 'The EPS user interface management system', 'System and human system engineering testing of EPS', 'Relation of user perceived response time to error measurement', 'The generation of random binary unordered trees', 'The intersection graph of paths in trees', 'Graph minors IV Widths of trees and well quasi ordering', 'Graph minors A survey']
As you can see, no words in EXCLUDE are removed from the DOCUMENTS (e.g. "for" is a prime example)
it works with this operator:
new_doc = [word for word in str(documents).split() if word not in exclude]
but then how do I get back the initial elements (albeit "cleaned ones") in DOCUMENTS?
I will greatly appreciate your help!
You should split lines to words before filter them:
new_doc = [' '.join([word for word in line.split() if word not in exclude]) for line in documents]
You are looping over the sentences not the words.For that aim you need to split the sentences and use a nested loop to loop over your words and filter them then join the result.
>>> new_doc = [' '.join([word for word in sent.split() if word not in exclude]) for sent in documents]
>>>
>>> new_doc
['Human machine interface lab abc computer applications', 'A survey opinion computer system response time', 'The EPS interface management system', 'System and human system engineering testing EPS', 'Relation perceived response time to error measurement', 'The generation random binary unordered trees', 'The intersection graph paths in trees', 'Graph minors IV Widths trees and well quasi ordering', 'Graph minors A survey']
>>>
Also instead of a nested list comprehension and splitting and filtering you can use regex to replace the exclude words with an empty string with re.sub function :
>>> import re
>>>
>>> new_doc = [re.sub(r'|'.join(exclude),'',sent) for sent in documents]
>>> new_doc
['Human machine interface lab abc computer applications', 'A survey opinion computer system response time', 'The EPS interface management system', 'System and human system engineering testing EPS', 'Relation perceived response time to error measurement', 'The generation random binary unordered trees', 'The intersection graph paths in trees', 'Graph minors IV Widths trees and well quasi ordering', 'Graph minors A survey']
>>>
r'|'.join(exclude) will concatenate the words with an pip (means logical OR in regex).
Related
I am looking for a way to find, in a sentence, if a common noun refers to places. This is easy for proper nouns, but I didn't find any straightforward solution for common nouns.
For example, given the sentence "After a violent and brutal attack, a group of college students travel into the countryside to find refuge from the town they fled, but soon discover that the small village is also home to a coven of serial killers" I would like to mark the following nouns as referred to places: countryside, town, small village, home.
Here is the code I'm using:
import spacy
nlp = spacy.load('en_core_web_lg')
# Process whole documents
text = ("After a violent and brutal attack, a group of college students travel into the countryside to find refuge from the town they fled, but soon discover that the small village is also home to a coven of satanic serial killers")
doc = nlp(text)
# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])
# Find named entities, phrases and concepts
for entity in doc.ents:
print(entity.text, entity.label_)
Which gives as output the following:
Noun phrases: ['a violent and brutal attack', 'a group', 'college students', 'the countryside', 'refuge', 'the town', 'they', 'the small village', 'a coven', 'serial killers']
Verbs: ['travel', 'find', 'flee', 'discover']
You can use WordNet for this.
from nltk.corpus import wordnet as wn
loc = wn.synsets("location")[0]
def is_location(candidate):
for ss in wn.synsets(candidate):
# only get those where the synset matches exactly
name = ss.name().split(".", 1)[0]
if name != candidate:
continue
hit = loc.lowest_common_hypernyms(ss)
if hit and hit[0] == loc:
return True
return False
# true things
for word in ("countryside", "town", "village", "home"):
print(is_location(word), word, sep="\t")
# false things
for word in ("cat", "dog", "fish", "cabbage", "knife"):
print(is_location(word), word, sep="\t")
Note that sometimes the synsets are wonky, so be sure to double-check everything.
Also, for things like "small village", you'll have to pull out the head noun, but it'll just be the last word.
I'm new to NLP and to Python.
I'm trying to use object standardization to replace abbreviations with their full meaning. I found code online and altered it to test it out on a wikipedia exert. but all the code does is print out the original text. Can any one help out a newbie in need?
heres the code:
import nltk
lookup_dict = {'EC': 'European Commission', 'EU': 'European Union', "ECSC": "European Coal and Steel Commuinty",
"EEC": "European Economic Community"}
def _lookup_words(input_text):
words = input_text.split()
new_words = []
for word in words:
if word.lower() in lookup_dict:
word = lookup_dict[word.lower()]
new_words.append(word)
new_text = " ".join(new_words)
print(new_text)
return new_text
_lookup_words(
"The High Authority was the supranational administrative executive of the new European Coal and Steel Community ECSC. It took office first on 10 August 1952 in Luxembourg. In 1958, the Treaties of Rome had established two new communities alongside the ECSC: the eec and the European Atomic Energy Community (Euratom). However their executives were called Commissions rather than High Authorities")
Thanks in advance, any help is appreciated!
In your case, the lookup dict has the abbreviations for EC and ECSC amongs the words found in your input sentence. Calling split splits the input based on whitespace. But your sentence has the words ECSC. and ECSC: ,ie these are the tokens obtained post splitting as opposed to ECSC thus you are not able to map the input. I would suggest to do some depunctuation and run it again.
I've been working on latent semantic analysis (lsa) and applied this example: https://radimrehurek.com/gensim/tut2.html
It includes the terms clustering under topics but couldn't find anything how we can cluster documents under topics.
In that example, it says that 'It appears that according to LSI, “trees”, “graph” and “minors” are all related words (and contribute the most to the direction of the first topic), while the second topic practically concerns itself with all the other words. As expected, the first five documents are more strongly related to the second topic while the remaining four documents to the first topic'.
How can we relate those five documents with Python code to the related topic?
You can find my python code below. I would appreciate any help.
from numpy import asarray
from gensim import corpora, models, similarities
#https://radimrehurek.com/gensim/tut2.html
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once] for text in texts]
dictionary = corpora.Dictionary(texts)
corp = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corp) # step 1 -- initialize a model
corpus_tfidf = tfidf[corp]
# extract 400 LSI topics; use the default one-pass algorithm
lsi = models.lsimodel.LsiModel(corpus=corp, id2word=dictionary, num_topics=2)
corpus_lsi = lsi[corpus_tfidf]
#for i in range(0, lsi.num_topics-1):
for i in range(0, 3):
print lsi.print_topics(i)
for doc in corpus_lsi: # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
print(doc)
corpus_lsi has a list of 9 vectors, which is the number of documents.
Each vector stores in at its i-th index the likeliness that this document belongs to topic i.
If you just want to assign a document to 1 topic, choose the topic-index with the highest value in your vector.
I'm trying to run this example code in Python 2.7 for LSI text clustering.
import gensim
from gensim import corpora, models, similarities
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once] for text in texts]
dictionary = corpora.Dictionary(texts)
corp = [dictionary.doc2bow(text) for text in texts]
# extract 400 LSI topics; use the default one-pass algorithm
lsi = gensim.models.lsimodel.LsiModel(corpus=corp, id2word=dictionary, num_topics=400)
# print the most contributing words (both positively and negatively) for each of the first ten topics
lsi.print_topics(10)
But it returns this error for the 2nd last command.
Warning (from warnings module):
File "/usr/local/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 2499
VisibleDeprecationWarning)
VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
Please let me know what I am doing wrong or if i need to update anything to make it work.
The lda.show_topics module from the following code only prints the distribution of the top 10 words for each topic, how do i print out the full distribution of all the words in the corpus?
from gensim import corpora, models
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=2)
for i in lda.show_topics():
print i
There is a variable call topn in show_topics() where you can specify the number of top N words you require from the words distribution over each topic. see http://radimrehurek.com/gensim/models/ldamodel.html
So instead of the default lda.show_topics(). You can use the len(dictionary) for the full word distributions for each topic:
for i in lda.show_topics(topn=len(dictionary)):
print i
There are two variable call num_topics and num_words in show_topics(),for num_topics number of topics, return num_words most significant words (10 words per topic, by default). see http://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.show_topics
So you can use the len(lda.id2word) for the full words distributions for each topic,and the lda.num_topics for the all topics in your lda model.
for i in lda.show_topics(formatted=False,num_topics=lda.num_topics,num_words=len(lda.id2word)):
print i
The below code will print your words as well as their probability. I have printed top 10 words. You can change num_words = 10 to print more words per topic.
for words in lda.show_topics(formatted=False,num_words=10):
print(words[0])
print("******************************")
for word_prob in words[1]:
print("(",dictionary[int(word_prob[0])],",",word_prob[1],")",end = "")
print("")
print("******************************")