Remove Stopwords in python - python

I'm developing an algorithm to remove stopword.
I am transforming a txt file into a list and thus passing in the algorithm for removal.
Example of file lines:
'mora vai nascer viver cair falar','positivo'
'deixa ver entendi vai crescer vai passar ve','positivo'
'so deveria ter foi agradeco de passei passei fez','positivo'
'nunca nao nao muito nao mais','negativo'
'a nao ate infelizmente ai ate quando','negativo'
'nao perto nao quanto menos nao sim nao nem simplesmente','negativo'
Code
with open('BasePalavras.txt') as arquivo:
baseTeste = [linha.strip() for linha in arquivo]
stopwords = ['a', 'agora', 'algum', 'alguma', 'aquele', 'aqueles', 'de', 'deu', 'do', 'e', 'estou', 'esta', 'esta',
'ir', 'meu', 'muito', 'mesmo', 'no', 'nossa', 'o', 'outro', 'para', 'que', 'sem', 'talvez', 'tem', 'tendo',
'tenha', 'teve', 'tive', 'todo', 'um', 'uma', 'umas', 'uns', 'vou']
def removestopword(texto):
frases=[]
for(palavras, emocao) in texto:
semstopwords = [p for p in palavras.splits() if p not in stopwords]
frases.append((semstopwords, emocao))
return frases
print (removestopword(baseTeste))
ERROR
Traceback (most recent call last):
File "C:/Users/Rivaldo/PycharmProjects/Mineracao/Principal.py", line 22, in <module>
print (removestopword(baseTeste))
File "C:/Users/Rivaldo/PycharmProjects/Mineracao/Principal.py", line 17, in removestopword
for(palavras, emocao) in texto:
ValueError: too many values to unpack

Try this:
with open('BasePalavras.txt') as arquivo:
baseTeste = [linha.strip().split(',') for linha in arquivo]
stopwords = ['a', 'agora', 'algum', 'alguma', 'aquele', 'aqueles', 'de', 'deu', 'do', 'e', 'estou', 'esta', 'esta',
'ir', 'meu', 'muito', 'mesmo', 'no', 'nossa', 'o', 'outro', 'para', 'que', 'sem', 'talvez', 'tem', 'tendo',
'tenha', 'teve', 'tive', 'todo', 'um', 'uma', 'umas', 'uns', 'vou']
def removestopword(texto):
frases=[]
for (palavras, emocao) in texto:
semstopwords = [p for p in palavras.split() if p not in stopwords]
frases.append((semstopwords, emocao))
return frases
print (removestopword(baseTeste))
Changed baseTeste = [linha.strip() for linha in arquivo] to baseTeste = [linha.strip().split(',') for linha in arquivo]
and
semstopwords = [p for p in palavras.splits() if p not in stopwords] to semstopwords = [p for p in palavras.split() if p not in stopwords].

Here's how I would do it.
stopwords = ['a', 'agora', 'algum', 'alguma', 'aquele', 'aqueles', 'de', 'deu', 'do', 'e', 'estou', 'esta', 'esta',
'ir', 'meu', 'muito', 'mesmo', 'no', 'nossa', 'o', 'outro', 'para', 'que', 'sem', 'talvez', 'tem', 'tendo',
'tenha', 'teve', 'tive', 'todo', 'um', 'uma', 'umas', 'uns', 'vou']
def remove_stopwords(text):
phrases = []
for (sentence, _) in text:
sentence_without_stopwords = [word for word in sentence.split() if word not in stopwords]
phrases.append(sentence_without_stopwords)
return phrases
with open('input.txt') as raw_text:
sentence_sentiments = []
lines = [line for line in raw_text]
for line in lines:
sentence, sentiment = line.split(',')
sentence_sentiments.append((sentence[1:-1], sentiment[1:-1]))
print(remove_stopwords(sentence_sentiments))
Notice how, in your provided code, baseTeste is an array that contains a list of strings, representing the lines of your input file. This is not what you want, as you're attempting to loop (for(palavras, emocao) in texto:) over the (sentence, sentiment) pairs inside these lines. You are thus missing the middle step of splitting each line into (sentence, sentiment) pairs.

Related

How does the ISRI Stemmer give better stem words than Lancaster or Snowball Stemmer

I have this sample text which i want to tokenize and subsequently find the stem words
sample_text = "'I am a student from the University of Alabama. \
I was born in Ontario, Canada and I am a huge fan of the United States. \
I am going to get a degree in Philosophy to improve\
my chances of becoming a Philosophy professor. \
I have been working towards this goal for 4 years. \
I am currently enrolled in a PhD program. \
It is very difficult, but I am confident that it will be a good decision'"
Using Lancaster Stemmer I am getting the following result -
sentences = sent_tokenize(sample_text)
from nltk.stem import LancasterStemmer
lancaster = LancasterStemmer()
for i in range(len(sentences)):
sentences[i] = re.sub('[^A-Za-z0-9]', ' ', sentences[i])
sentences[i] = word_tokenize(sentences[i])
stopwds = [word.lower() for word in stopwords.words('english')]
sentences[i] = [word.lower() for word in sentences[i] if word.lower() not in stopwds]
sentences[i] = [lancaster.stem(word) for word in sentences[i]]
print(sentences[i])
Output of Lancaster Stemmer:
['stud', 'univers', 'alabam']
['born', 'ontario', 'canad', 'hug', 'fan', 'unit', 'stat']
['going', 'get', 'degr', 'philosoph', 'improvemy', 'chant', 'becom', 'philosoph', 'profess']
['work', 'toward', 'goal', '4', 'year']
['cur', 'enrol', 'phd', 'program']
['difficult', 'confid', 'good', 'decid']
Output with Snowball stemmer -
['student', 'univers', 'alabama']
['born', 'ontario', 'canada', 'huge', 'fan', 'unit', 'state']
['go', 'get', 'degre', 'philosophi', 'improvemi', 'chanc', 'becom', 'philosophi', 'professor']
['work', 'toward', 'goal', '4', 'year']
['current', 'enrol', 'phd', 'program']
['difficult', 'confid', 'good', 'decis']
Output of Porter Stemmer
['student', 'univers', 'alabama']
['born', 'ontario', 'canada', 'huge', 'fan', 'unit', 'state']
['go', 'get', 'degre', 'philosophi', 'improvemi', 'chanc', 'becom', 'philosophi', 'professor']
['work', 'toward', 'goal', '4', 'year']
['current', 'enrol', 'phd', 'program']
['difficult', 'confid', 'good', 'decis']
Whereas ISRI Stemmer almost gives me same results as if the words had been lemmatized
sentences = sent_tokenize(sample_text)
from nltk.stem import ISRIStemmer
isri = ISRIStemmer()
for i in range(len(sentences)):
sentences[i] = re.sub('[^A-Za-z0-9]', ' ', sentences[i])
sentences[i] = word_tokenize(sentences[i])
stopwds = [word.lower() for word in stopwords.words()]
sentences[i] = [word.lower() for word in sentences[i] if word.lower() not in stopwds]
sentences[i] = [ isri.stem(word) for word in sentences[i]]
print(sentences[i])
Output :
['student', 'university', 'alabama']
['born', 'ontario', 'canada', 'huge', 'fan', 'united', 'states']
['going', 'get', 'degree', 'philosophy', 'improvemy', 'chances', 'becoming', 'philosophy', 'professor']
['working', 'towards', 'goal', '4', 'years']
['currently', 'enrolled', 'phd', 'program']
['difficult', 'confident', 'good', 'decision']
Can someone explain how ISRI Stemmer gives almost Lemmatized words
This stemmer was developed specifically for Arabic, that is why you are getting such results when using it on English.
Here is a link to the nltk doc https://www.nltk.org/_modules/nltk/stem/isri.html

Removing accents from keyword strings

This is a word processing code for chabot, in it it removes some articles and prepositions to make it easier for the bot to read
import json
from random import choice
class ChatterMessage:
def __init__(self, raw):
self.raw = str(raw).lower()
self.processed_str = self.reduce()
self.responses = self.get_responses()
self.data = self.process_response()
self.response = choice(self.data['response'])
def remove_unwanted_chars(self, string):
list_of_chars = ['?', ".", ",", "!", "#", "[", "]", "{", "}", "#", "$", "%", "*", "&", "(", ")", "-", "_", "+", "="]
new_str = ""
for char in string:
if char not in list_of_chars:
new_str += str(char)
return new_str
def get_responses(self, response_file="info.json"):
with open(response_file, 'r') as file:
return json.loads(file.read())
def reduce(self):
stopwords = ['de', 'a', 'o', 'que', 'e', 'é', 'do', 'da', 'em', 'um', 'para', 'com', 'não', 'uma', 'os', 'no', 'se', 'na', 'por', 'mais', 'as', 'dos', 'como', 'mas', 'ao', 'ele', 'das', 'à', 'seu', 'sua', 'ou', 'quando', 'muito', 'nos', 'já', 'eu', 'também', 'só', 'pelo', 'pela', 'até', 'isso', 'ela', 'entre', 'depois', 'sem', 'mesmo', 'aos', 'seus', 'quem', 'nas', 'me', 'esse', 'eles', 'você', 'essa', 'num', 'nem', 'suas', 'meu', 'às', 'minha', 'numa', 'pelos', 'elas', 'qual', 'nós', 'lhe', 'deles', 'essas', 'esses', 'pelas', 'este', 'dele', 'tu', 'te', 'vocês', 'vos', 'lhes', 'meus', 'minhas', 'teu', 'tua', 'teus', 'tuas', 'nosso', 'nossa', 'nossos', 'nossas', 'dela', 'delas', 'esta', 'estes', 'estas', 'aquele', 'aquela', 'aqueles', 'aquelas', 'isto', 'aquilo', 'estou', 'está', 'estamos', 'estão', 'estive', 'esteve', 'estivemos', 'estiveram', 'estava', 'estávamos', 'estavam', 'estivera', 'estivéramos', 'esteja', 'estejamos', 'estejam', 'estivesse', 'estivéssemos', 'estivessem', 'estiver', 'estivermos', 'estiverem', 'hei', 'há', 'havemos', 'hão', 'houve', 'houvemos', 'houveram', 'houvera', 'houvéramos', 'haja', 'hajamos', 'hajam', 'houvesse', 'houvéssemos', 'houvessem', 'houver', 'houvermos', 'houverem', 'houverei', 'houverá', 'houveremos', 'houverão', 'houveria', 'houveríamos', 'houveriam', 'sou', 'somos', 'são', 'era', 'éramos', 'eram', 'fui', 'foi', 'fomos', 'foram', 'fora', 'fôramos', 'seja', 'sejamos', 'sejam', 'fosse', 'fôssemos', 'fossem', 'for', 'formos', 'forem', 'serei', 'será', 'seremos', 'serão', 'seria', 'seríamos', 'seriam', 'tenho', 'tem', 'temos', 'tém', 'tinha', 'tínhamos', 'tinham', 'tive', 'teve', 'tivemos', 'tiveram', 'tivera', 'tivéramos', 'tenha', 'tenhamos', 'tenham', 'tivesse', 'tivéssemos', 'tivessem', 'tiver', 'tivermos', 'tiverem', 'terei', 'terá', 'teremos', 'terão', 'teria', 'teríamos', 'teriam']
custom_filter = []
keywords_list = []
strlist = self.raw.split(" ")
for x in strlist:
if x not in stopwords and x not in custom_filter:
keywords_list.append(self.remove_unwanted_chars(x))
return keywords_list
def process_response(self):
percentage = lambda x, y: (100 * y) / x
total = sum(len(x['keywords']) for x in self.responses)
most_acc = 0
response_data = None
acc = 0
for value in self.responses:
c = 0
for x in value['keywords']:
if str(x).lower() in self.processed_str:
c += 1
if c > most_acc:
most_acc = c
acc = percentage(total, most_acc)
print(acc)
response_data = value
if acc < 6:
return {"response": "Sorry, I do not understand. Be more clear please"}
for x in self.processed_str:
if x not in response_data['keywords']:
response_data['keywords'].append(x)
return response_data
if __name__ == '__main__':
while True:
k = input("Você: ")
res = ChatterMessage(k)
.response
print("Bot:", res)
How to remove accents from keyword strings to "make it easier" for chatbot to read? I found this explanation: How to remove string accents using Python 3? But I don't know how it would be applied to this code as the bot always stops responding
You could use the Python package unidecode that replaces special characters with ASCII equivalents.
from unidecode import unidecode
text = "Björn, Łukasz and Σωκράτης."
print(unidecode(text))
# ==> Bjorn, Lukasz and Sokrates.
You could apply this to both the input and keywords.
# In the function definition of reduce(), place this line of code after
# stopwords = ['de', 'a', 'o', .....])
stopwords = [unidecode(s) for s in stopwords]
# In "__main__": replace k = input("Você: ") with the following line of code.
k = unidecode(input("Você: "))
If it makes sense, you could also force the strings to be all lowercase. This will make your string comparisons even more robust.
k = unidecode(input("Você: ").lower())
Because you requested the entire code:
import json
from random import choice
from unidecode import unidecode
class ChatterMessage:
def __init__(self, raw):
self.raw = str(raw).lower()
self.processed_str = self.reduce()
self.responses = self.get_responses()
self.data = self.process_response()
self.response = choice(self.data['response'])
def remove_unwanted_chars(self, string):
list_of_chars = ['?', ".", ",", "!", "#", "[", "]", "{", "}", "#", "$", "%", "*", "&", "(", ")", "-", "_", "+", "="]
new_str = ""
for char in string:
if char not in list_of_chars:
new_str += str(char)
return new_str
def get_responses(self, response_file="info.json"):
with open(response_file, 'r') as file:
return json.loads(file.read())
def reduce(self):
stopwords = ['de', 'a', 'o', 'que', 'e', 'é', 'do', 'da', 'em', 'um', 'para', 'com', 'não', 'uma', 'os', 'no', 'se', 'na', 'por', 'mais', 'as', 'dos', 'como', 'mas', 'ao', 'ele', 'das', 'à', 'seu', 'sua', 'ou', 'quando', 'muito', 'nos', 'já', 'eu', 'também', 'só', 'pelo', 'pela', 'até', 'isso', 'ela', 'entre', 'depois', 'sem', 'mesmo', 'aos', 'seus', 'quem', 'nas', 'me', 'esse', 'eles', 'você', 'essa', 'num', 'nem', 'suas', 'meu', 'às', 'minha', 'numa', 'pelos', 'elas', 'qual', 'nós', 'lhe', 'deles', 'essas', 'esses', 'pelas', 'este', 'dele', 'tu', 'te', 'vocês', 'vos', 'lhes', 'meus', 'minhas', 'teu', 'tua', 'teus', 'tuas', 'nosso', 'nossa', 'nossos', 'nossas', 'dela', 'delas', 'esta', 'estes', 'estas', 'aquele', 'aquela', 'aqueles', 'aquelas', 'isto', 'aquilo', 'estou', 'está', 'estamos', 'estão', 'estive', 'esteve', 'estivemos', 'estiveram', 'estava', 'estávamos', 'estavam', 'estivera', 'estivéramos', 'esteja', 'estejamos', 'estejam', 'estivesse', 'estivéssemos', 'estivessem', 'estiver', 'estivermos', 'estiverem', 'hei', 'há', 'havemos', 'hão', 'houve', 'houvemos', 'houveram', 'houvera', 'houvéramos', 'haja', 'hajamos', 'hajam', 'houvesse', 'houvéssemos', 'houvessem', 'houver', 'houvermos', 'houverem', 'houverei', 'houverá', 'houveremos', 'houverão', 'houveria', 'houveríamos', 'houveriam', 'sou', 'somos', 'são', 'era', 'éramos', 'eram', 'fui', 'foi', 'fomos', 'foram', 'fora', 'fôramos', 'seja', 'sejamos', 'sejam', 'fosse', 'fôssemos', 'fossem', 'for', 'formos', 'forem', 'serei', 'será', 'seremos', 'serão', 'seria', 'seríamos', 'seriam', 'tenho', 'tem', 'temos', 'tém', 'tinha', 'tínhamos', 'tinham', 'tive', 'teve', 'tivemos', 'tiveram', 'tivera', 'tivéramos', 'tenha', 'tenhamos', 'tenham', 'tivesse', 'tivéssemos', 'tivessem', 'tiver', 'tivermos', 'tiverem', 'terei', 'terá', 'teremos', 'terão', 'teria', 'teríamos', 'teriam']
stopwords = [unidecode(s) for s in stopwords]
custom_filter = []
keywords_list = []
strlist = self.raw.split(" ")
for x in strlist:
if x not in stopwords and x not in custom_filter:
keywords_list.append(self.remove_unwanted_chars(x))
return keywords_list
def process_response(self):
percentage = lambda x, y: (100 * y) / x
total = sum(len(x['keywords']) for x in self.responses)
most_acc = 0
response_data = None
acc = 0
for value in self.responses:
c = 0
for x in value['keywords']:
if str(x).lower() in self.processed_str:
c += 1
if c > most_acc:
most_acc = c
acc = percentage(total, most_acc)
print(acc)
response_data = value
if acc < 6:
return {"response": "Sorry, I do not understand. Be more clear please"}
for x in self.processed_str:
if x not in response_data['keywords']:
response_data['keywords'].append(x)
return response_data
if __name__ == '__main__':
while True:
k = unidecode(input("Você: "))
res = ChatterMessage(k).response
print("Bot:", res)

Got Argument 'other' has incorrect type (expected spacy.tokens.token.Token, got str)

I was getting the following error while i was trying to read a list in spacy.
TypeError: Argument 'string' has incorrect type (expected spacy.tokens.token.Token, got str)
Here is the code below
f= "MotsVides.txt"
file= open(f, 'r', encoding='utf-8')
stopwords = [line.rstrip() for line in file]
# stopwords =['alors', 'au', 'aucun', 'aussi', 'autre', 'avant', 'avec', 'avoir', 'bon', 'car', 'ce', 'cela', 'ces', 'ceux', 'chaque', 'ci', 'comme', 'comment', 'ça', 'dans', 'des', 'du', 'dedans', 'dehors', 'depuis', 'deux', 'devrait', 'doit', 'donc', 'dos', 'droite', 'début', 'elle', 'elles', 'en', 'encore', 'essai', 'est', 'et', 'eu', 'étaient', 'état', 'étions', 'été', 'être', 'fait', 'faites', 'fois', 'font', 'force', 'haut', 'hors', 'ici', 'il', 'ils', 's', 'juste', 'la', 'le', 'les', 'leur', 'là\t ma', 'maintenant', 'mais', 'mes', 'mine', 'moins', 'mon', 'mot', 'même', 'ni', 'nommés', 'notre', 'nous', 'nouveaux', 'ou', 'où', 'par', 'parce', 'parole', 'pas', 'personnes', 'peut', 'peu', 'pièce', 'plupart', 'pour', 'pourquoi', 'quand', 'que', 'quel', 'quelle', 'quelles', 'quels', 'qui\t sa', 'sans', 'ses', 'seulement', 'si', 'sien', 'son', 'sont', 'sous', 'soyez', 'sujet', 'sur', 'ta', 'tandis', 'tellement', 'tels', 'tes', 'ton', 'tous', 'tout', 'trop', 'très', 'tu', 'valeur', 'voie', 'voient', 'vont', 'votre', 'vous', 'vu']
def spacy_process(texte):
for lt in texte:
mytokens = nlp(lt)
print(mytokens)
mytokens2 = [word.lemma_.lower().strip() for word in mytokens if word.pos_ != "PUNCT" and word not in stopwords]
print(type(mytokens2))
a = ['je suis la bonne personne et droit à la caricature.', 'Je suis la bonne personne et droit à la caricature.']
spacy_process(a)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-133-03cc18018278> in <module>
33
34 a = ['je suis la bonne personne et droit à la caricature.', 'Je suis la bonne personne et droit à la caricature.']
---> 35 spacy_process(a)
<ipython-input-133-03cc18018278> in spacy_process(texte)
28 mytokens = nlp(lt)
29 print(mytokens)
---> 30 mytokens2 = [word.lemma_.lower().strip() for word in mytokens if word.pos_ != "PUNCT" and word not in stopwords]
31
32 print(type(mytokens2))
<ipython-input-133-03cc18018278> in <listcomp>(.0)
28 mytokens = nlp(lt)
29 print(mytokens)
---> 30 mytokens2 = [word.lemma_.lower().strip() for word in mytokens if word.pos_ != "PUNCT" and word not in stopwords]
31
32 print(type(mytokens2))
TypeError: Argument 'other' has incorrect typ (expected spacy.tokens.token.Token, got str)
The issue is that word from word not in stopwords is a Token not a string. Python is complaining because it's trying to search and do comparisons between a list of strings and the Token class which doesn't work.
With spacy you want to use word.text to get the string, not word.
The following code should work...
import spacy
stopwords = ['alors', 'au', 'aucun', 'aussi', 'autre'] # truncated for simplicity
nlp = spacy.load('en')
def spacy_process(texte):
for lt in texte:
mytokens = nlp(lt)
mytokens2 = [word.lemma_.lower().strip() for word in mytokens if word.pos_ != "PUNCT" and word.text not in stopwords]
print(mytokens2)
a = ['je suis la bonne personne et droit à la caricature.', 'Je suis la bonne personne et droit à la caricature.']
spacy_process(a)
BTW... Checking for a value in a list is fairly slow. You should convert your list to a set to speed things up.

find words that can be made from a string in python

im fairly new to python and im not sure how to tackle my problem
im trying to make a program that can take a string of 15 characters from a .txt file and find words that you can make from those characters with a dictionary file, than output those words to another text file.
this is what i have tried:
attempting to find words that don't contain the characters and removing them from the list
various anagram solver type programs of git hub
i tried this sudo pip3 install anagram-solverbut it has been 3 hours on 15 characters and it is still running
im new so please tell me if im forgetting something
If you're looking for "perfect" anagrams, i.e. those that contain exactly the same number of characters, not a subset, it's pretty easy:
take your word-to-find, sort it by its letters
take your dictionary, sort each word by its letters
if the sorted versions match, they're anagrams
def find_anagrams(seek_word):
sorted_seek_word = sorted(seek_word.lower())
for word in open("/usr/share/dict/words"):
word = word.strip() # remove trailing newline
sorted_word = sorted(word.lower())
if sorted_word == sorted_seek_word and word != seek_word:
print(seek_word, word)
if __name__ == "__main__":
find_anagrams("begin")
find_anagrams("nicer")
find_anagrams("decor")
prints (on my macOS machine – Windows machines won't have /usr/share/dict/words by default, and some Linux distributions need it installed separately)
begin being
begin binge
nicer cerin
nicer crine
decor coder
decor cored
decor Credo
EDIT
A second variation that finds all words that are assemblable from the letters in the original word, using collections.Counter:
import collections
def find_all_anagrams(seek_word):
seek_word_counter = collections.Counter(seek_word.lower())
for word in open("/usr/share/dict/words"):
word = word.strip() # remove trailing newline
word_counter = collections.Counter(word.strip())
if word != seek_word and all(
n <= seek_word_counter[l] for l, n in word_counter.items()
):
yield word
if __name__ == "__main__":
print("decoration", set(find_all_anagrams("decoration")))
Outputs e.g.
decoration {'carte', 'drona', 'roit', 'oat', 'cantred', 'rond', 'rid', 'centroid', 'trine', 't', 'tenai', 'cond', 'toroid', 'recon', 'contra', 'dain', 'cootie', 'iao', 'arctoid', 'oner', 'indart', 'tine', 'nace', 'rident', 'cerotin', 'cran', 'eta', 'eoan', 'cardoon', 'tone', 'trend', 'trinode', 'coaid', 'ranid', 'rein', 'end', 'actine', 'ide', 'cero', 'iodate', 'corn', 'oer', 'retia', 'nidor', 'diter', 'drat', 'tec', 'tic', 'creat', 'arent', 'coon', 'doater', 'ornoite', 'terna', 'docent', 'tined', 'edit', 'octroi', 'eric', 'read', 'toned', 'c', 'tera', 'can', 'rocta', 'cortina', 'adonite', 'iced', 'no', 'natr', 'net', 'oe', 'rodeo', 'actor', 'otarine', 'on', 'cretin', 'ericad', 'dance', 'tornade', 'tinea', 'coontie', 'anerotic', 'acrite', 'ra', 'danio', 'inroad', 'inde', 'tied', 'tar', 'coronae', 'tid', 'rad', 'doc', 'derat', 'tea', 'acerin', 'ronde', 'recti', 'areito', 'drain', 'odontic', 'octoad', 'rio', 'actin', 'tread', 'rect', 'ariot', 'road', 'doctrine', 'enactor', 'indoor', 'toco', 'ton', 'trice', 'norite', 'nea', 'coda', 'noria', 'rot', 'trona', 'rice', 'arite', 'eria', 'orad', 'rate', 'toed', 'enact', 'crinet', 'cento', 'arid', 'coot', 'nat', 'nar', 'cain', 'at', 'antired', 'ear', 'triode', 'doter', 'cedarn', 'orna', 'rand', 'tari', 'crea', 'tiar', 'retan', 'tire', 'cora', 'aroid', 'iron', 'tenio', 'enroot', 'd', 'oaric', 'acetin', 'tain', 'neat', 'noter', 'tien', 'aortic', 'tode', 'dicer', 'irate', 'tie', 'canid', 'ado', 'noticer', 'arn', 'nacre', 'ceration', 'ratine', 'denaro', 'cotoin', 'aint', 'canto', 'cinter', 'decani', 'roon', 'donor', 'acnode', 'aide', 'doer', 'tacnode', 'oread', 'acetoin', 'rine', 'acton', 'conoid', 'a', 'otocrane', 'norate', 'care', 'ticer', 'io', 'detain', 'cedar', 'ta', 'toadier', 'atone', 'cornet', 'dacoit', 'toric', 'orate', 'arni', 'adroit', 'rend', 'tanier', 'rooted', 'doit', 'dier', 'odorate', 'trica', 'rated', 'cotonier', 'dine', 'roid', 'cairned', 'cat', 'i', 'coin', 'octine', 'trod', 'orc', 'cardo', 'eniac', 'arenoid', 'erd', 'creant', 'oda', 'ratio', 'ceria', 'ad', 'acorn', 'dorn', 'deric', 'credit', 'door', 'cinder', 'cantor', 'er', 'doon', 'coner', 'donate', 'roe', 'tora', 'antic', 'racoon', 'ooid', 'noa', 'tae', 'coroa', 'earn', 'retain', 'canted', 'norie', 'rota', 'tao', 'redan', 'rondo', 'entia', 'ctenoid', 'cent', 'daroo', 'inrooted', 'roed', 'adore', 'coat', 'e', 'rat', 'deair', 'arend', 'coir', 'acid', 'coronate', 'rodent', 'acider', 'iota', 'codo', 'redaction', 'cot', 'aeric', 'tonic', 'candier', 'decart', 'dicta', 'dot', 'recoat', 'caroon', 'rone', 'tarie', 'tarin', 'teca', 'oar', 'ocrea', 'ante', 'creation', 'tore', 'conto', 'tairn', 'roc', 'conter', 'coeditor', 'certain', 'roncet', 'decator', 'not', 'coatie', 'toran', 'caid', 'redia', 'root', 'cad', 'cartoon', 'n', 'coed', 'cand', 'neo', 'coronadite', 'dare', 'dartoic', 'acoin', 'detar', 'dite', 'trade', 'train', 'ordinate', 'racon', 'citron', 'dan', 'doat', 'nito', 'tercia', 'rote', 'cooer', 'acone', 'rita', 'caret', 'dern', 'enatic', 'too', 'cried', 'tade', 'dit', 'orient', 'ria', 'torn', 'coati', 'cnida', 'note', 'tried', 'acrid', 'nitro', 'acron', 'tern', 'one', 'it', 'naio', 'dor', 'ea', 'ca', 'ire', 'inert', 'orcanet', 'cine', 'coe', 'nardoo', 'deota', 'den', 'toi', 'adion', 'to', 'rite', 'nectar', 'rane', 'riant', 'cod', 'de', 'adit', 'airt', 'ie', 'retin', 'toon', 'cane', 'aeon', 'are', 'cointer', 'actioner', 'crin', 'detrain', 'art', 'cant', 'ort', 'tored', 'antoeci', 'tier', 'cite', 'onto', 'coater', 'tranced', 'atonic', 'roi', 'in', 'roan', 'decoat', 'rain', 'cronet', 'ronco', 'dont', 'citer', 'redact', 'cider', 'nor', 'octan', 'ration', 'doina', 'rie', 'aero', 'noted', 'crate', 'crain', 'cadet', 'condite', 'ran', 'odeon', 'date', 'eat', 'intoed', 'cation', 'carone', 'ratoon', 'retina', 'tiao', 'nice', 'nodi', 'codon', 'coo', 'torc', 'dent', 'entad', 'ne', 'toe', 'dae', 'decant', 'redcoat', 'coiner', 'irade', 'air', 'oint', 'coronet', 'radon', 'ce', 'octonare', 'oaten', 'citrean', 'dice', 'dancer', 'carotid', 'cretion', 'don', 'cion', 'nei', 'tead', 'nori', 'nacrite', 'ootid', 'rancid', 'dornic', 'orenda', 'cairn', 'aroon', 'coardent', 'aider', 'notice', 'cored', 'adorn', 'tad', 'carid', 'otic', 'dian', 'od', 'dint', 'tercio', 'die', 'conred', 'tice', 'rant', 'candor', 'anti', 'dar', 'antre', 'cornea', 'ordain', 'corona', 'recta', 'redo', 'tare', 'coranto', 'action', 'caird', 'creta', 'naid', 'tri', 'acre', 'crane', 'coated', 'citronade', 'anoetic', 'tenor', 'anode', 'triad', 'ceratoid', 'rod', 'idea', 'carton', 'cortin', 'endaortic', 'dicot', 'tend', 'da', 'tod', 'erotica', 'cord', 'coreid', 'toader', 'dace', 'tan', 'editor', 'rection', 'toner', 'cone', 'ni', 'tide', 'coder', 'din', 'ocote', 'ore', 'daer', 'octane', 'darn', 'do', 'reit', 'na', 'catenoid', 'tron', 'condor', 'crinated', 'cordon', 'crone', 'toad', 'noir', 'into', 'tirade', 'nadir', 'ant', 'ade', 'droit', 'icon', 'drone', 'ared', 'cardin', 'nid', 'dire', 'orcin', 'donator', 'rani', 'tane', 'ace', 'iodo', 'doria', 'ride', 'eon', 'ornate', 'cedrat', 'aire', 'carotin', 'dation', 'tear', 'onca', 'cote', 'taroc', 'con', 'nod', 'dinero', 'ecad', 'recant', 'ae', 'octad', 'cor', 'doctor', 'acridone', 'neti', 'cordite', 'crotin', 'aneroid', 'diota', 'coorie', 'dita', 'aconite', 'nard', 'cadent', 'ectad', 'rance', 'rea', 'tai', 'denat', 'rood', 'acne', 'decan', 'ani', 'rit', 'cit', 'cetin', 'odor', 'acorned', 'iceroot', 'inro', 'crood', 'daric', 'dacite', 'trone', 'acier', 'reina', 'oncia', 'drant', 'acrodont', 'nacred', 'cotrine', 'dinar', 'tean', 'atoner', 'toorie', 'nadorite', 'cardon', 'taen', 'tin', 'conte', 'acoine', 'dater', 'diact', 'aid', 'anodic', 'coronated', 'direct', 're', 'era', 'anticor', 'triace', 'octoid', 'dao', 'corta', 'edict', 'trode', 'ode', 'orant', 'niter', 'centrad', 'cater', 'tronc', 'coronad', 'r', 'toro', 'ar', 'once', 'ora', 'trace', 'creodont', 'erotic', 'ai', 'troca', 'ion', 'tecon', 'tra', 'acor', 'radio', 'acred', 'croon', 'tricae', 'recto', 'riden', 'andorite', 'taro', 'red', 'dear', 'ate', 'tinder', 'trin', 'deacon', 'ardent', 'aer', 'arc', 'crine', 'dart', 'diet', 'riot', 'tanrec', 'tor', 'noetic', 'ret', 'trance', 'ona', 'rind', 'coto', 'daoine', 'teind', 'toa', 'inter', 'code', 'cart', 'aion', 'detin', 'core', 'oont', 'rent', 'cedrin', 'card', 'trained', 'o', 'recoin', 'cro', 'and', 'diner', 'id', 'cordant', 'cedron', 'ditone', 'odic', 'cadi', 'cerin', 'nit', 'ecoid', 'nide', 'ean', 'andric', 'tind', 'raid', 'crena', 'oroide', 'roadite', 'canter', 'idant', 'cade', 'race', 'ten', 'caner', 'tarn', 'cooter', 'etna', 'tornadic', 'irone', 'ice', 'en', 'oord', 'oared', 'draine', 'cordate', 'react', 'reaction', 'tornado', 'troco', 'niota', 'carotenoid', 'an', 'cader', 'naric', 'car', 'centiar', 'ti', 'cearin', 'aroint', 'crined', 'iter', 'di', 'or', 'trio', 'dari', 'oration', 'orcein', 'coned', 'odorant', 'dean', 'coadore', 'cate', 'drate', 'dirten', 'ted', 'done', 'cadre', 'ocean', 'tired', 'adet', 'dirt', 'te', 'nae', 'ceti', 'cern', 'rotan', 'doe', 'roto', 'dote', 'node', 'ait', 'act', 'canoe', 'rode'}

Finding word context with regular expressions

I have created a function to search for the contexts of a given word(w) in a text, with left and right as parameters for flexibility in the number of words to record.
import re
def get_context (text, w, left, right):
text.insert (0, "*START*")
text.append ("*END*")
all_contexts = []
for i in range(len(text)):
if re.match(w,text[i], 0):
if i < left:
context_left = text[:i]
else:
context_left = text[i-left:i]
if len(text) < (i+right):
context_right = text[i:]
else:
context_right = text[i:(i+right+1)]
context = context_left + context_right
all_contexts.append(context)
return all_contexts
So for example if a have a text in the form of a list like this:
text = ['Python', 'is', 'dynamically', 'typed', 'language', 'Python',
'functions', 'really', 'care', 'about', 'what', 'you', 'pass', 'to',
'them', 'but', 'you', 'got', 'it', 'the', 'wrong', 'way', 'if', 'you',
'want', 'to', 'pass', 'one', 'thousand', 'arguments', 'to', 'your',
'function', 'then', 'you', 'can', 'explicitly', 'define', 'every',
'parameter', 'in', 'your', 'function', 'definition', 'and', 'your',
'function', 'will', 'be', 'automagically', 'able', 'to', 'handle',
'all', 'the', 'arguments', 'you', 'pass', 'to', 'them', 'for', 'you']
The function works fine for example:
get_context(text, "function",2,2)
[['language', 'python', 'functions', 'really', 'care'], ['to', 'your', 'function', 'then', 'you'], ['in', 'your', 'function', 'definition', 'and'], ['and', 'your', 'function', 'will', 'be']]
Now I am trying to build a dictionary with the contexts of every word in the text doing the following:
d = {}
for w in set(text):
d[w] = get_context(text,w,2,2)
But I am getting this error.
Traceback (most recent call last):
File "<pyshell#32>", line 2, in <module>
d[w] = get_context(text,w,2,2)
File "<pyshell#20>", line 9, in get_context
if re.match(w,text[i], 0):
File "/usr/lib/python3.4/re.py", line 160, in match
return _compile(pattern, flags).match(string)
File "/usr/lib/python3.4/re.py", line 294, in _compile
p = sre_compile.compile(pattern, flags)
File "/usr/lib/python3.4/sre_compile.py", line 568, in compile
p = sre_parse.parse(p, flags)
File "/usr/lib/python3.4/sre_parse.py", line 760, in parse
p = _parse_sub(source, pattern, 0)
File "/usr/lib/python3.4/sre_parse.py", line 370, in _parse_sub
itemsappend(_parse(source, state))
File "/usr/lib/python3.4/sre_parse.py", line 579, in _parse
raise error("nothing to repeat")
sre_constants.error: nothing to repeat
I don't understand this error. Can anyone help me with this?
The problem is that "*START*" and "*END*" are being interpreted as regex. Also, note that inserting "*START*" and "*END*" in text in the begging of the function will cause problem. You should do it just once.
Here is a complete version of the working code:
import re
def get_context(text, w, left, right):
all_contexts = []
for i in range(len(text)):
if re.match(w,text[i], 0):
if i < left:
context_left = text[:i]
else:
context_left = text[i-left:i]
if len(text) < (i+right):
context_right = text[i:]
else:
context_right = text[i:(i+right+1)]
context = context_left + context_right
all_contexts.append(context)
return all_contexts
text = ['Python', 'is', 'dynamically', 'typed', 'language',
'Python', 'functions', 'really', 'care', 'about', 'what',
'you', 'pass', 'to', 'them', 'but', 'you', 'got', 'it', 'the',
'wrong', 'way', 'if', 'you', 'want', 'to', 'pass', 'one',
'thousand', 'arguments', 'to', 'your', 'function', 'then',
'you', 'can', 'explicitly', 'define', 'every', 'parameter',
'in', 'your', 'function', 'definition', 'and', 'your',
'function', 'will', 'be', 'automagically', 'able', 'to', 'handle',
'all', 'the', 'arguments', 'you', 'pass', 'to', 'them', 'for', 'you']
text.insert(0, "START")
text.append("END")
d = {}
for w in set(text):
d[w] = get_context(text,w,2,2)
Maybe you can replace re.match(w,text[i], 0) with w == text[i].
The whole thing can be re-written very succinctly follows,
text = 'Python is dynamically typed language Python functions really care about what you pass to them but you got it the wrong way if you want to pass one thousand arguments to your function then you can explicitly define every parameter in your function definition and your function will be automagically able to handle all the arguments you pass to them for you'
Keeping it a str, assuming context = 'function',
pat = re.compile(r'(\w+\s\w+\s)functions?(?=(\s\w+\s\w+))')
pat.findall(text)
[('language Python ', ' really care'),
('to your ', ' then you'),
('in your ', ' definition and'),
('and your ', ' will be')]
Now, minor customization will be needed in the regex to allow for, words like say, functional or functioning not only function or functions. But the important idea is to do away with indexing and go more functional.
Please comment if this doesn't work out for you, when you apply it in bulk.
At least one of the elements in text contains characters that are special in a regular expression. If you're just trying to find whether the word is in the string, just use str.startswith, i.e.
if text[i].startswith(w): # instead of re.match(w,text[i], 0):
But I don't understand why you are checking for that anyways, and not for equality.

Categories