UnicodeDecodeError with word stemming in Python - python

I'm so stumped.
I have a list of a couple of thousand words
x = ['company', 'arriving', 'wednesday', 'and', 'then', 'beach', 'how', 'are', 'you', 'any', 'warmer', 'there', 'enjoy', 'your', 'day', 'follow', 'back', 'please', 'everyone', 'go', 'watch', 's', 'new', 'video', 'you', 'know', 'the', 'deal', 'make', 'sure', 'to', 'subscribe', 'and', 'like', '<http>', 'you', 'said', 'next', 'week', 'you', 'will', 'be', 'the', 'one', 'picking', 'me', 'up', 'lol', 'hindi', 'na', 'tl', 'huehue', 'that', 'works', 'you', 'said', 'everyone', 'of', 'us', 'my', 'little', 'cousin', 'keeps', 'asking', 'if', 'i', 'wanna', 'play', 'and', "i'm", 'like', 'yes', 'but', 'with', 'my', 'pals', 'not', 'you', "you're", 'welcome', 'pas', 'quand', 'tu', 'es', 'vers', '<num>', 'i', 'never', 'get', 'good', 'mornng', 'texts', 'sad', 'sad', 'moment', 'i', 'think', 'ima', 'go', 'get', 'a', 'glass', 'of', 'milk', 'ahah', 'for', 'the', 'first', 'time', 'i', 'actually', 'know', 'what', 'their', 'doing', 'd', 'thank', 'you', 'happy', 'birthday', 'hope', "you're"...........]
Now, I've confirmed the type of each element in this list to be a string
types = []
for word in x:
a.append(type(word))
print set(a)
>>>set([<type 'str'>])
Now, I attempt to stem each word using NLTK's porter stemmer
import nltk
porter = nltk.PorterStemmer()
stemmed_x = [porter.stem(word) for word in x]
And I get this error that is clearly related to the stemming package and unicode somehow:
File "/Library/Python/2.7/site-packages/nltk-3.0.0b2-py2.7.egg/nltk/stem/porter.py", line 633, in stem
stem = self.stem_word(word.lower(), 0, len(word) - 1)
File "/Library/Python/2.7/site-packages/nltk-3.0.0b2-py2.7.egg/nltk/stem/porter.py", line 591, in stem_word
word = self._step1ab(word)
File "/Library/Python/2.7/site-packages/nltk-3.0.0b2-py2.7.egg/nltk/stem/porter.py", line 289, in _step1ab
if word.endswith("ied"):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 12: ordinal not in range(128)
I have tried everything, using codecs.open, trying to explicitly encode each word as utf8 - still produce the same error.
Please advise.
EDIT:
I should mention that this code worked perfect on my PC running Ubuntu. I recently got a macbook pro and I'm getting this error. I've checked the terminal settings on my mac and it is set to utf8 encoding.
EDIT 2:
Interesting, with this piece of code, I have isolated the problem words:
for w in x:
try:
porter.stem(w)
except UnicodeDecodeError:
print w
#sagittarius”
#instadane…
#bleedblue”
#pr챕cieux
#على_شرفة_الماضي
#exploringsf…
#fishing…
#sindhubestfriend…
#الإستعداد_لإنهيار_ال_سعود
#jaredpreslar…
#femalepains”
#gobillings”
#juicing…
#instamood…
Seems like what they all have in common are extra punctuation at the end of the word, except for the word #pr챕cieux

You have probably a multi-byte UTF8 char lurking around as 0xe2 is one of the first byte possible for an 16-bit codepoint encoded as UTF-8. As your program assume ASCII chars, with valid encoded values from 0x00 to 0x7F, this value is rejected.
You might be able to identify the "bad" value by a simple comprehension, then fix it by hand (as I assume from your data your want only deal with ASCII chars):
print [value for value in x if '\xe2' in x]

Using word.decode('utf-8') should solve this error.
import nltk
porter = nltk.PorterStemmer()
stemmed_x = [porter.stem(word.decode('utf-8')) for word in x]

Related

Removing punctuation with an exception in python

I am trying to remove punctuation from a given string in python.
It works well, however the data I am using includes lots of ":D" or ":)" or ":(".
Therefore when I analyse the data, I end up removing all of these text-smiles or only ":" for ":D"-case.
Following is an example code:
import string
import nltk
example_string = 'I would like to remove some punctiation, \
however some stuff like \':D\' causes errors. \
How would I not get rid of \':\', \
if it is followed by a \'D\'? '
translator = str.maketrans('', '', string.punctuation)
line = example_string.translate(translator)
line = nltk.word_tokenize(line)
line = [word.lower() for word in line
if word not in ['\'', '’', '”', '“']]
print(line)
This produces as output:
['i', 'would', 'like', 'to', 'remove', 'some', 'punctiation',
'however', 'some', 'stuff', 'like', 'd', 'causes', 'errors',
'how', 'would', 'i', 'not', 'get', 'rid', 'of', 'if', 'it',
'is', 'followed', 'by', 'a', 'd']
What I would like to produce is (check the second line 5th word):
['i', 'would', 'like', 'to', 'remove', 'some', 'punctiation',
'however', 'some', 'stuff', 'like', ':d', 'causes', 'errors',
'how', 'would', 'i', 'not', 'get', 'rid', 'of', 'if', 'it',
'is', 'followed', 'by', 'a', 'd']
It will also remove all of ":)" or ":(".
Is there a way to not remove ":", if it is followed by a "d"?
or not remove ")" or "(", if the previous character is ":" ?

Word Counter loop keeps loading forever in Python

I have a DataFrame comments as seen below. I want to make a Counter of words for the Text field. I have made a list of UserId whose count of words is needed, those UserIds are stored in gold_users. But the loop to create Counter just keeps loading. Please help me fix this.
comments
This is just part of dataframe, original has many rows.
Id| Text | UserId
6| Before the 2006 course, there was Allen Knutso... | 3
8| Also, Theo Johnson-Freyd has some notes from M... | 1
Code
#Text Cleaning
punct = set(string.punctuation)
stopword = set(stopwords.words('english'))
lm = WordNetLemmatizer()
def clean_text(text):
text = ''.join(char.lower() for char in text if char not in punct)
tokens = re.split('\W+', text)
text = [lm.lemmatize(word) for word in tokens if word not in stopword]
return tuple(text) # Writing only `return text` was giving unhashable error 'list'
comments['Text'] = comments['Text'].apply(lambda x: clean_text(x))
for index,rows in comments.iterrows():
gold_comments = rows[comments.Text.loc[comments.UserId.isin(gold_users)]]
Counter(gold_comments)
Expected Output
[['scholar',20],['school',18],['bus',15],['class',14],['teacher',14],['bell',13],['time',12],['books',11],['bag',9],'student',7],......]
Passing your dataframe already having only your gold_users ids and texts, the following pure python function returns exactly what you need:
def word_count(df):
counts = dict()
for str in df['Text']:
words = str.split()
for word in words:
if word in counts:
counts[word] += 1
else:
counts[word] = 1
return list(counts.items())
Hope it helps!
You overcomplicated the problem, I am afraid. In Pandas, it is almost never desirable to iterate through the rows. Select the rows that meet your condition, add their Texts, and apply a Counter to the combined list:
gold_users = [3,1]
golden_comments = comments[comments['UserId'].isin(gold_users)]
counter = Counter(golden_comments['Text'].sum())
If necessary, convert the counter to a list of lists:
[[k, v] for k, v in counter.items()]
# [['2006', 1], ['course', 1], ['allen', 1], ...]
# Initialise packages in session:
import pandas as pd
import re
# comments => Data Frame
comments = pd.DataFrame({"Id": [6, 8],
"Text": ["Before the 2006 course, there was Allen Knutso...",
"Also, Theo Johnson-Freyd has some notes from M..."],
"UserId": [3, 1]})
# Stopwords to remove from text: stopwords_lst => list of strings
stopwords_lst = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",
"you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself',
'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',
'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these',
'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having',
'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as',
'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in',
'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when',
'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some',
'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can',
'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've',
'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',
"hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't",
'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn',
"wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
# Clean lists of strings using regex: list of strings => function() => list of strings
def clean_string_list(str_lst):
"""Convert all alpha numeric characters in list of strings to lowercase,
non alpha-numeric characters to whitepsace, and trim whitespace on both sides of each string.
Args:
str_lst (list): Function takes a list of strings.
Returns:
(list) A list of strings
"""
return([*map(lambda x: re.sub('\W+', ' ', x.lower().strip('\s+\t\n\r')), str_lst)])
# Store a list of gold user's UserIds: gold_user_ids => list of integers:
gold_user_ids = [3, 1]
# Take Subset of Data Frame containing only gold users: gold_users => Data Frame
gold_users = comments[comments["UserId"].isin(gold_user_ids)]
# Apply the function to the list of stopwords and collapse the list into a single string: stopwords_re => string
stopwords_re = ' | '.join(clean_string_list(stopwords_lst))
# Clean strings, and remove stopwords: cleaned_text => vector of strings
gold_users['cleaned_text'] = [*map(lambda y: re.sub(stopwords_re, ' ', y), clean_string_list(gold_users['Text']))]
# Split each word on whitespace: words => list of strings
words = (' '.join(gold_users['cleaned_text'])).split()
# Count the number of occurences of each word: word_count => dict
word_count = dict(zip(words, [*map(lambda z: words.count(z), words)]))
# Print words to console: dictionary => stdout
print(word_count)

nltk pos tagger looks to incorporate '.'

I am new to python, nlp and and nltk, so please bear with me. I have a handful of articles (~200), that have been categorized by hand. I am looking to develop a heuristic to assist/ recommend categories. To start I was hoping to build a relationship between current categories and the words within the document.
My premise is that the nouns are more important to the category than any other part of speech. For example the category "Energy" probably is driven nearly completely through the nouns: oil, battery, wind, etc.
The first thing I wanted to do was tag the parts and evaluate them. On the first article I encountered some issues. Some of the tokens are bound to punctuation.
for articles in articles[1]:
articles_id, content = articles
clean = nltk.clean_html(content).replace('’', "'")
tokens = nltk.word_tokenize(clean)
pos_document = nltk.pos_tag(tokens)
pos ={}
for pos_word in pos_document:
word, part = pos_word
if pos.has_key(part):
pos[part].append(word)
else:
pos[part] = [word]
Formatted output:
{
'VBG': ['continuing', 'paying', 'falling', 'starting'],
'VBD': ['made', 'ended'], 'VBN': ['been', 'leaned', 'been', 'been'],
'VBP': ['know', 'hasn', 'have', 'continue', 'expect', 'take', 'see', 'have', 'are'],
'WDT': ['which', 'which'], 'JJ': ['negative', 'positive', 'top', 'modest', 'negative', 'real', 'financial', 'isn', 'important', 'long', 'short', 'next'],
'VBZ': ['is', 'has', 'is', 'leads', 'is', 'is'], 'DT': ['Another', 'the', 'the', 'any', 'any', 'the', 'the', 'a', 'the', 'the', 'the', 'the', 'a', 'the', 'a', 'a', 'the', 'a', 'the', 'any'],
'RP': ['back'],
'NN': [ 'listless', 'day', 'rsquo', 'll', 'progress', 'rsquo', 't', 'news', 'season', 'corner', 'surprise', 'stock', 'line', 'growth', 'question',
'stop', 'engineering', 'growth', 'isn', 'rsquo', 't', 'rsquo', 't', 'stock', 'market', 'look', 'junk', 'bond', 'market', 'turning', 'junk',
'rock', 'history', 'guide', 't', 'day', '%', '%', '%', 'level', 'move', 'isn', 'rsquo', 't', 'indication', 'way'],
',': [',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ','], '.': ['.'],
'TO': ['to', 'to', 'to', 'to', 'to', 'to', 'to'],
'PRP': ['them', 'they', 'they', 'we', 'you', 'they', 'it'],
'RB': ['then', 'there', 'just', 'just', 'always', 'so', 'so', 'only', 'there', 'right', 'there', 'much', 'typically', 'far', 'certainly'],
':': [';', ';', ';', ';', ';', ';', ';'],
'NNS': ['folks', 'companies', 'estimates', 'covers', 's', 'equities', 'bonds', 'equities', 'flats'],
'NNP': ['drift.', 'We', 'Monday', 'DC', 'note.', 'Earnings', 'EPS', 'same.', 'The', 'Street', 'now.', 'Since', 'points.', 'What', 'behind.', 'We', 'flat.', 'The'],
'VB': ['get', 'manufacture', 'buy', 'boost', 'look', 'see', 'say', 'let', 'rsquo', 'rsquo', 'be', 'build', 'accelerate', 'be'],
'WRB': ['when', 'where'],
'CC': ['&', 'and', '&', 'and', 'and', 'or', 'and', '&', '&', '&', 'and', '&', 'and', 'but', '&'],
'CD': ['47', '23', '30'],
'EX': ['there'],
'IN': ['on', 'if', 'until', 'of', 'around', 'as', 'on', 'down', 'since', 'of', 'for', 'under', 'that', 'about', 'at', 'at', 'that', 'like', 'if'],
'MD': ['can', 'will', 'can', 'can', 'will'],
'JJR': ['more']
}
notice under the NMP the word 'drift.' - shouldn't the period be removed? Do I need to remove this on my own or am I missing something with the library?
NLTK's word tokenizer assumes that its input has already been separated into sentences. Therefore in order to get it to work, you need to call sent_tokenize on your input first. I think you can use the output of sent_tokenize as the input to word_tokenize, but typically you would want to iterate over your sentences.
for articles in articles[1]:
articles_id, content = articles
clean = nltk.clean_html(content).replace('’', "'")
sents = nltk.sent_tokenize(clean)
pos ={}
for sent in sents:
tokens = nltk.word_tokenize(sent)
pos_document = nltk.pos_tag(tokens)
for pos_word in pos_document:
word, part = pos_word
if pos.has_key(part):
pos[part].append(word)
else:
pos[part] = [word]
I believe the reason this is necessary is to help distinguish punctuation periods at the ends of sentences from periods used in abbreviations (i.e. you wouldn't want "Mr. Smith" to be broken into 'Mr', '.', 'Smith')

from a list of strings, how do you get a tuple/list of words which contain 3 or more characters and are evenly spaced in python [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Checking if a string's characters are ascending alphabetically and its ascent is evenly spaced python
I have a list of strings/words:
mylist = ['twas', 'brillig', 'and', 'the', 'slithy', 'toves', 'did', 'gyre', 'and', 'gimble', 'in', 'the', 'wabe', 'all', 'mimsy', 'were', 'the', 'borogoves', 'and', 'the', 'mome', 'raths', 'outgrabe', '"beware', 'the', 'jabberwock', 'my', 'son', 'the', 'jaws', 'that', 'bite', 'the', 'claws', 'that', 'catch', 'beware', 'the', 'jubjub', 'bird', 'and', 'shun', 'the', 'frumious', 'bandersnatch', 'he', 'took', 'his', 'vorpal', 'sword', 'in', 'hand', 'long', 'time', 'the', 'manxome', 'foe', 'he', 'sought', 'so', 'rested', 'he', 'by', 'the', 'tumtum', 'tree', 'and', 'stood', 'awhile', 'in', 'thought', 'and', 'as', 'in', 'uffish', 'thought', 'he', 'stood', 'the', 'jabberwock', 'with', 'eyes', 'of', 'flame', 'came', 'whiffling', 'through', 'the', 'tulgey', 'wood', 'and', 'burbled', 'as', 'it', 'came', 'one', 'two', 'one', 'two', 'and', 'through', 'and', 'through', 'the', 'vorpal', 'blade', 'went', 'snicker-snack', 'he', 'left', 'it', 'dead', 'and', 'with', 'its', 'head', 'he', 'went', 'galumphing', 'back', '"and', 'has', 'thou', 'slain', 'the', 'jabberwock', 'come', 'to', 'my', 'arms', 'my', 'beamish', 'boy', 'o', 'frabjous', 'day', 'callooh', 'callay', 'he', 'chortled', 'in', 'his', 'joy', '`twas', 'brillig', 'and', 'the', 'slithy', 'toves', 'did', 'gyre', 'and', 'gimble', 'in', 'the', 'wabe', 'all', 'mimsy', 'were', 'the', 'borogoves', 'and', 'the', 'mome', 'raths', 'outgrabe']
firstly i need to only get the words which have 3 or more characters in them - i assume a for loop for that or something.
then i need to get a list of words which contain only words that increase from left to right alphabetically and are a fixed number apart. (e.g. ('ace', 2) or ('ceg', 2) does not have to be 2) the list also has to be sorted in alphabetical order and each element should be a tuple consisting of the word and character difference.
I think i have to use a for loop but im not sure how to use it in this case and am not sure how to do the second part.
for the list above the answer i should get is:
([])
I do not have the newest version of python.
Any help is greatly appreciated.
You should probably start by learning how to use a for loop. A for loop will get things out of a collection and assign them to a variable:
for letter in "strings are collections":
print letter
Or..
for thing in ['this is a list', 'of', 4, 'things']:
if thing == 4:
print '4 is in the list'
If you're able to do more than this, then try something, figure out where you get stuck, and ask as more specifically what you need help with.
Take this problem in steps
To filter words with length >= 3
[w for w in mylist if len(w) >= 3]
To see if the words are increasing in regular interval? Calculate the difference, between consecutive letters, create a set and check if the length == 1
diff =lambda word:len({ord(n)-ord(c) for n,c in zip(word[1:],word)}) == 1
Now use this new function to filter the remaining words
[w for w in mylist if len(w) >= 3 and diff(w)]

Technique to remove common words(and their plural versions) from a string

I am attempting to find tags(keywords) for a recipe by parsing a long string of text. The text contains the recipe ingredients, directions and a short blurb.
What do you think would be the most efficient way to remove common words from the tag list?
By common words, I mean words like: 'the', 'at', 'there', 'their' etc.
I have 2 methodologies I can use, which do you think is more efficient in terms of speed and do you know of a more efficient way I could do this?
Methodology 1:
- Determine the number of times each word occurs(using the library Collections)
- Have a list of common words and remove all 'Common Words' from the Collection object by attempting to delete that key from the Collection object if it exists.
- Therefore the speed will be determined by the length of the variable delims
import collections from Counter
delim = ['there','there\'s','theres','they','they\'re']
# the above will end up being a really long list!
word_freq = Counter(recipe_str.lower().split())
for delim in set(delims):
del word_freq[delim]
return freq.most_common()
Methodology 2:
- For common words that can be plural, look at each word in the recipe string, and check if it partially contains the non-plural version of a common word. Eg; For the string "There's a test" check each word to see if it contains "there" and delete it if it does.
delim = ['this','at','them'] # words that cant be plural
partial_delim = ['there','they',] # words that could occur in many forms
word_freq = Counter(recipe_str.lower().split())
for delim in set(delims):
del word_freq[delim]
# really slow
for delim in set(partial_delims):
for word in word_freq:
if word.find(delim) != -1:
del word_freq[delim]
return freq.most_common()
I'd just do something like this:
from nltk.corpus import stopwords
s=set(stopwords.words('english'))
txt="a long string of text about him and her"
print filter(lambda w: not w in s,txt.split())
which prints
['long', 'string', 'text']
and in terms of complexity should be O(n) in number of words in the string, if you believe the hashed set lookup is O(1).
FWIW, my version of NLTK defines 127 stopwords:
'all', 'just', 'being', 'over', 'both', 'through', 'yourselves', 'its', 'before', 'herself', 'had', 'should', 'to', 'only', 'under', 'ours', 'has', 'do', 'them', 'his', 'very', 'they', 'not', 'during', 'now', 'him', 'nor', 'did', 'this', 'she', 'each', 'further', 'where', 'few', 'because', 'doing', 'some', 'are', 'our', 'ourselves', 'out', 'what', 'for', 'while', 'does', 'above', 'between', 't', 'be', 'we', 'who', 'were', 'here', 'hers', 'by', 'on', 'about', 'of', 'against', 's', 'or', 'own', 'into', 'yourself', 'down', 'your', 'from', 'her', 'their', 'there', 'been', 'whom', 'too', 'themselves', 'was', 'until', 'more', 'himself', 'that', 'but', 'don', 'with', 'than', 'those', 'he', 'me', 'myself', 'these', 'up', 'will', 'below', 'can', 'theirs', 'my', 'and', 'then', 'is', 'am', 'it', 'an', 'as', 'itself', 'at', 'have', 'in', 'any', 'if', 'again', 'no', 'when', 'same', 'how', 'other', 'which', 'you', 'after', 'most', 'such', 'why', 'a', 'off', 'i', 'yours', 'so', 'the', 'having', 'once'
obviously you can provide your own set; I'm in agreement with the comment on your question that it's probably easiest (and fastest) to just provide all the variations you want to eliminate up front, unless you want to eliminate a lot more words than this but then it becomes more a question of spotting interesting ones than eliminating spurious ones.
Your problem domain is "Natural Language Processing".
If you don't want to reinvent the wheel, use NLTK, search for stemming in the docs.
Given that NLP is one of the hardest subjects in computer science, reinventing this wheel is a lot of work...
You ask about speed, but you should be more concerned with accuracy. Both your suggestions will make a lot of mistakes, removing either too much or too little (for example, there are a lot of words that contain the substring "at"). I second the suggestion to look into the nltk module. In fact, one of the early examples in the NLTK book involves removing common words until the most common remaining ones reveal something about the genre. You'll get not only tools, but instruction on how to go about it.
Anyway you'll spend much longer writing your program than your computer will spend executing it, so focus on doing it well.

Categories