I'm working on a classification task, using a movie reviews dataset from Kaggle. The part with which I'm struggling is a series of functions, in which the output of one becomes the input of the next.
Specifically, in the code provided, the function "word_token" takes the input "phraselist", tokenizes it, and returns a tokenized document titled "phrasedocs". The only problem is that it doesn't seem to be working, because when I take that theoretical document "phrasedocs" and enter it into the next function, "process_token", I get:
NameError: name 'phrasedocs' is not defined
I am completely willing to accept that there is something simple I have overlooked, but I've been on this for hours and I can't figure it out. I would appreciate any help.
I have tried proofreading and debugging the code, but my Python expertise is not great.
# This function obtains data from train.tsv
def processkaggle(dirPath, limitStr):
# Convert the limit argument from a string to an int
limit = int(limitStr)
os.chdir(dirPath)
f = open('./train.tsv', 'r')
# Loop over lines in the file and use their first limit
phrasedata = []
for line in f:
# Ignore the first line starting with Phrase, then read all lines
if (not line.startswith('Phrase')):
# Remove final end of line character
line = line.strip()
# Each line has four items, separated by tabs
# Ignore the phrase and sentence IDs, keep the phrase and sentiment
phrasedata.append(line.split('\t')[2:4])
return phrasedata
# Randomize and subset data
def random_phrase(phrasedata):
random.shuffle(phrasedata) # phrasedata initiated in function processkaggle
phraselist = phrasedata[:limit]
for phrase in phraselist[:10]:
print(phrase)
return phraselist
# Tokenization
def word_token(phraselist):
phrasedocs=[]
for phrase in phraselist:
tokens=nltk.word_tokenize(phrase[0])
phrasedocs.append((tokens, int(phrase[1])))
return phrasedocs
# Pre-processing
# Convert all tokens to lower case
def lower_case(doc):
return [w.lower() for w in doc]
# Clean text, fixing confusion over apostrophes
def clean_text(doc):
cleantext=[]
for review_text in doc:
review_text = re.sub(r"it 's", "it is", review_text)
review_text = re.sub(r"that 's", "that is", review_text)
review_text = re.sub(r"\'s", "\'s", review_text)
review_text = re.sub(r"\'ve", "have", review_text)
review_text = re.sub(r"wo n't", "will not", review_text)
review_text = re.sub(r"do n't", "do not", review_text)
review_text = re.sub(r"ca n't", "can not", review_text)
review_text = re.sub(r"sha n't", "shall not", review_text)
review_text = re.sub(r"n\'t", "not", review_text)
review_text = re.sub(r"\'re", "are", review_text)
review_text = re.sub(r"\'d", "would", review_text)
review_text = re.sub(r"\'ll", "will", review_text)
cleantext.append(review_text)
return cleantext
# Remove punctuation and numbers
def rem_no_punct(doc):
remtext = []
for text in doc:
punctuation = re.compile(r'[-_.?!/\%#,":;\'{}<>~`()|0-9]')
word = punctuation.sub("", text)
remtext.append(word)
return remtext
# Remove stopwords
def rem_stopword(doc):
stopwords = nltk.corpus.stopwords.words('english')
updatestopwords = [word for word in stopwords if word not in ['not','no','can','has','have','had','must','shan','do','should','was','were','won','are','cannot','does','ain','could','did','is','might','need','would']]
return [w for w in doc if not w in updatestopwords]
# Lemmatization
def lemmatizer(doc):
wnl = nltk.WordNetLemmatizer()
lemma = [wnl.lemmatize(t) for t in doc]
return lemma
# Stemming
def stemmer(doc):
porter = nltk.PorterStemmer()
stem = [porter.stem(t) for t in doc]
return stem
# This function combines all the previous pre-processing functions into one, which is helpful
# if I want to alter these settings for experimentation later
def process_token(phrasedocs):
phrasedocs2 = []
for phrase in phrasedocs:
tokens = nltk.word_tokenize(phrase[0])
tokens = lower_case(tokens)
tokens = clean_text(tokens)
tokens = rem_no_punct(tokens)
tokens = rem_stopword(tokens)
tokens = lemmatizer(tokens)
tokens = stemmer(tokens)
phrasedocs2.append((tokens, int(phrase[1]))) # Any words that pass through the processing
# steps above are added to phrasedocs2
return phrasedocs2
dirPath = 'C:/Users/J/kagglemoviereviews/corpus'
processkaggle(dirPath, 5000) # returns 'phrasedata'
random_phrase(phrasedata) # returns 'phraselist'
word_token(phraselist) # returns 'phrasedocs'
process_token(phrasedocs) # returns phrasedocs2
NameError Traceback (most recent call last)
<ipython-input-120-595bc4dcf121> in <module>()
5 random_phrase(phrasedata) # returns 'phraselist'
6 word_token(phraselist) # returns 'phrasedocs'
----> 7 process_token(phrasedocs) # returns phrasedocs2
8
9
NameError: name 'phrasedocs' is not defined
Simply you defined "phrasedocs" inside a function which is not seen from outside and the function return should be captured in a variable,
edit your code:
dirPath = 'C:/Users/J/kagglemoviereviews/corpus'
phrasedata = processkaggle(dirPath, 5000) # returns 'phrasedata'
phraselist = random_phrase(phrasedata) # returns 'phraselist'
phrasedocs = word_token(phraselist) # returns 'phrasedocs'
phrasedocs2 = process_token(phrasedocs) # returns phrasedocs2
You have only created the variable phrasedocs in a function. Therefore the variable is not defined for all of your other code outside this function. When you call the variable as an input to the function python can't find any variable named like that. You must create a variable called phrasedocs in your main code.
Related
I wrote the search code and I want to store what is between " " as one place in the list, how I may do that? In this case, I have 3 lists but the second one should is not as I want.
import re
message='read read read'
others = ' '.join(re.split('\(.*\)', message))
others_split = others.split()
to_compile = re.compile('.*\((.*)\).*')
to_match = to_compile.match(message)
ors_string = to_match.group(1)
should = ors_string.split(' ')
must = [term for term in re.findall(r'\(.*?\)|(-?(?:".*?"|\w+))', message) if term and not term.startswith('-')]
must_not = [term for term in re.findall(r'\(.*?\)|(-?(?:".*?"|\w+))', message) if term and term.startswith('-')]
must_not = [s.replace("-", "") for s in must_not]
print(f'must: {must}')
print(f'should: {should}')
print(f'must_not: {must_not}')
Output:
must: ['read', '"find find"', 'within', '"plane"']
should: ['"exactly', 'needed"', 'empty']
must_not: ['russia', '"destination good"']
Wanted result:
must: ['read', '"find find"', 'within', '"plane"']
should: ['"exactly needed"', 'empty'] <---
must_not: ['russia', '"destination good"']
Error when edited the message, how to handle it?
Traceback (most recent call last):
ors_string = to_match.group(1)
AttributeError: 'NoneType' object has no attribute 'group'
Your should list splits on whitespace: should = ors_string.split(' '), this is why the word is split in the list. The following code gives you the output you requested but I'm not sure that is solves your problem for future inputs.
import re
message = 'read "find find":within("exactly needed" OR empty) "plane" -russia -"destination good"'
others = ' '.join(re.split('\(.*\)', message))
others_split = others.split()
to_compile = re.compile('.*\((.*)\).*')
to_match = to_compile.match(message)
ors_string = to_match.group(1)
# Split on OR instead of whitespace.
should = ors_string.split('OR')
to_remove_or = "OR"
while to_remove_or in should:
should.remove(to_remove_or)
# Remove trailing whitespace that is left after the split.
should = [word.strip() for word in should]
must = [term for term in re.findall(r'\(.*?\)|(-?(?:".*?"|\w+))', message) if term and not term.startswith('-')]
must_not = [term for term in re.findall(r'\(.*?\)|(-?(?:".*?"|\w+))', message) if term and term.startswith('-')]
must_not = [s.replace("-", "") for s in must_not]
print(f'must: {must}')
print(f'should: {should}')
print(f'must_not: {must_not}')
I created a csv file like this:
"CAMERA", "Camera", "kamera", "cam", "Kamera"
"PICTURE", "Picture", "bild", "photograph"
and used it somewhat like this:
nlp = de_core_news_sm.load()
text = "Cam is not good"
doc = nlp(text)
name_dict, desc_dict = load_entities()
kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=96)
for qid, desc in desc_dict.items():
desc_doc = nlp(desc)
desc_enc = desc_doc.vector
kb.add_entity(entity=qid, entity_vector=desc_enc, freq=342) # 342 is an arbitrary value here
for qid, name in name_dict.items():
kb.add_alias(alias=name, entities=[qid], probabilities=[1]) # 100% prior probability P(entity|alias)
Printing values like this:
print(f"Entities in the KB: {kb.get_entity_strings()}")
print(f"Aliases in the KB: {kb.get_alias_strings()}")
gives me:
Entities in the KB: ['PICTURE', 'CAMERA']
Aliases in the KB: [' "Camera"', ' "Picture"']
However, if I try to check for candidates, I only get an empty list:
candidates = kb.get_candidates("Camera")
print(candidates)
for c in candidates:
print(" ", c.entity_, c.prior_prob, c.entity_vector)
Aliases in the KB: [' "Camera"', ' "Picture"']
It looks to me as if your parsing script added the literal string "Camera", with spaces and quotes and all, to the KB, instead of just the raw string Camera?
So I have two functions. The first takes a string parameter and converts it into spacy tokens.
def preprocess (texts):
case = truecase.get_true_case(texts)
doc = nlp(case)
return doc
The next calls that function and processes the text into aggregated dictionaries.
def summarize_texts(texts):
doc = preprocess(texts) #another function that took text and processed it as a spacy doc
actions = {}
entities = {}
for token in doc:
if token.pos_ == "VERB":
actions[token.lemma_] = actions.get(token.text, 0) +1
for token in doc.ents:
entities[token.label_] = [token.text]
return {
'actions': actions,
'entities': entities
}
So that when you call the function you'll get these results.
summarize_texts("Play it again, Sam")
output: {'actions': {'play': 1}, 'entities': {'PERSON': ['Sam']}}
The issue I'm having is that my functions only work with one parameter but will fail if give it a parameter that's a list of sentences such as:
["Play something by Billie Holiday",
"Set a timer for five minutes",
"Play it again, Sam"]
and I'm not sure how to get it to work the way I want it to.
for example if I called
summarize_texts(["Play it again, Sam", "Play something by Billie Holiday"])
output: {'actions': {'play': 2}, 'entities': {'PERSON': ['Sam', 'Billie']}}
However if I run
docs = [
"Play something by Billie Holiday",
"Set a timer for five minutes",
"Play it again, Sam"
]
summarize_texts(docs)
output is:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-46-200347d5cac5> in <module>()
4 "Play it again, Sam"
5 ]
----> 6 summarize_texts(docs)
5 frames
/usr/local/lib/python3.6/dist-packages/nltk/tokenize/casual.py in _replace_html_entities(text, keep, remove_illegal, encoding)
257 return "" if remove_illegal else match.group(0)
258
--> 259 return ENT_RE.sub(_convert_entity, _str_to_unicode(text, encoding))
260
261
TypeError: expected string or bytes-like object
You can check the type of the input!
Here I am checking if it is an str or a list!
After that if it is a str I am creating a list with just one sentence!
Your output will be a list of results! [Here you just return the one result if there was just one input! -- optional]
return result[0] if len(result)==1 else result
def preprocess (texts):
case = truecase.get_true_case(texts)
doc = nlp(case)
return doc
def summarize_texts(texts):
if type(texts) is str: texts = [texts]
result = []
for text in texts:
doc = preprocess(text) #another function that took text and processed it as a spacy doc
actions = {}
entities = {}
for token in doc:
if token.pos_ == "VERB":
actions[token.lemma_] = actions.get(token.text, 0) +1
for token in doc.ents:
entities[token.label_] = [token.text]
result.append({
'actions': actions,
'entities': entities
})
return result
print(summarize_texts("Play it again, Sam"))
print(summarize_texts(["Play something by Billie Holiday", "Set a timer for five minutes", "Play it again, Sam"]))
se_eng_fr_dict = {'School': ['Skola', 'Ecole'], 'Ball': ['Boll', 'Ballon']}
choose_language = raw_input("Type 'English', for English. Skriv 'svenska' fo:r svenska. Pour francais, ecrit 'francais'. ")
if choose_language == 'English':
word = raw_input("Type in a word:")
swe_word = se_eng_fr_dict[word][0]
fra_word = se_eng_fr_dict[word][1]
print word, ":", swe_word, "pa. svenska," , fra_word, "en francais."
elif choose_language == 'Svenska':
word = raw_input("Vilket ord:")
for key, value in se_eng_fr_dict.iteritems():
if value == word:
print key
I want to create a dictionary (to be stored locally as a txt file) and the user can choose between entering a word in English, Swedish or French to get the translation of the word in the two other languages. The user should also be able to add data to the dictionary.
The code works when I look up the Swedish and French word with the English word. But how can I get the Key, and Value2 if I only have value1?
Is there a way or should I try to approach this problem in a different way?
A good option would be to store None for the value if it hasn't been set. While it would increase the amount of memory required, you could go a step further and add the language itself.
Example:
se_eng_fr_dict = {'pencil': {'se': None, 'fr': 'crayon'}}
def translate(word, lang):
# If dict.get() finds no value with `word` it will return
# None by default. We override it with an empty dictionary `{}`
# so we can always call `.get` on the result.
translated = se_eng_fr_dict.get(word, {}).get(lang)
if translated is None:
print("No {lang} translation found for {word}.format(**locals()))
else:
print("{} is {} in {}".format(word, translated, lang))
translate('pencil', 'fr')
translate('pencil', 'se')
i hope there could be a better solution, but here is mine:
class Word:
def __init__(self, en, fr, se):
self.en = en
self.fr = fr
self.se = se
def __str__(self):
return '<%s,%s,%s>' % (self.en, self.fr, self.se)
then you dump all these Words into a mapping data structure. you can use dictionary, but here if you have a huge data set, it's better for you to use BST, have a look at https://pypi.python.org/pypi/bintrees/2.0.1
lets say you have all these Words loaded in a list named words, then:
en_words = {w.en: w for w in words}
fr_words = {w.fr: w for w in words}
se_words = {w.se: w for w in words}
again, BST is more recommended here.
Maybe a set of nested lists would be better for this:
>>> my_list = [
[
"School", "Skola", "Ecole"
],
[
"Ball", "Boll", "Ballon"
]
]
Then you can access the set of translations by doing:
>>> position = [index for index, item in enumerate(my_list) for subitem in item if value == subitem][0]
This returns the index of the list, which you can grab:
>>> sub_list = my_list[position]
And the sublist will have all the translations in order.
For example:
>>> position = [index for index, item in enumerate(my_list) for subitem in item if "Ball" == subitem][0]
>>> print position
1
>>> my_list[position]
['Ball', 'Boll', 'Ballon']
In order to speedup word lookups and achieve a good flexibility, I'd choose a dictionary of subdictionaries: each subdictionary translates the words of a language into all the available languages and the top-level dictionary maps each language into the corresponding subdictionary.
For example, if multidict is the top-level dictionary, then multidict['english']['ball'] returns the (sub)dictionary:
{'english':'ball', 'francais':'ballon', 'svenska':'ball'}
Below is a class Multidictionary implementing such an idea.
For convenience it assumes that all the translations are stored into a text file in CSV format, which is read at initialization time, e.g.:
english,svenska,francais,italiano
school,skola,ecole,scuola
ball,boll,ballon,palla
Any number of languages can be easily added to the CSV file.
class Multidictionary(object):
def __init__(self, fname=None):
'''Init a multidicionary from a CSV file.
The file describes a word per line, separating all the available
translations with a comma.
First file line must list the corresponding languages.
For example:
english,svenska,francais,italiano
school,skola,ecole,scuola
ball,boll,ballon,palla
'''
self.fname = fname
self.multidictionary = {}
if fname is not None:
import csv
with open(fname) as csvfile:
reader = csv.DictReader(csvfile)
for translations in reader:
for lang, word in translations.iteritems():
self.multidictionary.setdefault(lang, {})[word] = translations
def get_available_languages(self):
'''Return the list of available languages.'''
return sorted(self.multidictionary)
def translate(self, word, language):
'''Return a dictionary containing the translations of a word (in a
specified language) into all the available languages.
'''
if language in self.get_available_languages():
translations = self.multidictionary[language].get(word)
else:
print 'Invalid language %r selected' % language
translations = None
return translations
def get_translations(self, word, language):
'''Generate the string containing the translations of a word in a
language into all the other available languages.
'''
translations = self.translate(word, language)
if translations:
other_langs = (lang for lang in translations if lang != language)
lang_trans = ('%s in %s' % (translations[lang], lang) for lang in other_langs)
s = '%s: %s' % (word, ', '.join(lang_trans))
else:
print '%s word %r not found' % (language, word)
s = None
return s
if __name__ == '__main__':
multidict = Multidictionary('multidictionary.csv')
print 'Available languages:', ', '.join(multidict.get_available_languages())
language = raw_input('Choose the input language: ')
word = raw_input('Type a word: ')
translations = multidict.get_translations(word, language)
if translations:
print translations
i'm doing an automatic language detection in python using stopwords
but i'm getting KeyError when trying to test the code.
this is the code
import nltk
from nltk.corpus import stopwords
def scoreFunction(wholetext):
dictiolist={}
scorelist={}
NLTKlanguage = ["dutch","finnish","german","italian","portuguese","spanish","turkish","danish","english"," french","hungarian","norwegian","russian","swedish"]
FREElanguages = [""]
languages= NLTKlanguages + FREElanguages
for lang in NLTKlanguages:
dictiolist[lang]=stopwords.words(lang)
tokens=nltk.tokenize.word_tokenize(wholetext)
tokens=[t.lower() for t in tokens]
freq_dist=nltk.FreqDist(tokens)
for lang in languages:
scorelist[lang]=0
for word in freq_dist.keys()[0:20]:
if word in dictiolist[lang]:
scorelist[lang]+=1
return scorelist
def whichLanguage(scorelist):
maximum=0
for item in scorelist:
value = scorelist[item]
if maximum<value:
maximum = value
lang = item
return lang
whene i run it scoreFunction("hillo my name is osfar and i'm genius")
i get the error
Traceback (most recent call last): File "", line 1, in
scoreFunction("hello my name is osfar and i'm very genius")
File "C:/Users/osama1/Desktop
/fun-test", line 17, in scoreFunction
if word in dictiolist[lang]:
KeyError: ''
Your problem is in the following block of code:
for word in freq_dist.keys()[0:20]:
if word in dictiolist[lang]:
scorelist[lang]+=1
You're using the variable lang in this for loop, but you aren't defining it anywhere. Which means that its value is undefined; as it happens, its value is "" (the empty string) because that was the last value it had in your previous for loop.
What you apparently meant to do is:
for word in freq_dist.keys()[0:20]:
for lang in languages:
if word in dictiolist[lang]:
scorelist[lang]+=1
By the way, there's an easier way to do what you're trying to do: use a Counter. See http://docs.python.org/2.7/library/collections.html#counter-objects for more information.