SO I want to apply stemming using Snowballstemmer on a column (unstemmed) of a dataframe in order to use a classification algorithm.
So my code looks like the following:
df = pd.read_excel(...)
df["content"] = df['column2'].str.lower()
stopword_list = nltk.corpus.stopwords.words('dutch')
df['unstemmed'] = df['content'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stopword_list)]))
df["unstemmed"] = df["unstemmed"].str.replace(r"[^a-zA-Z ]+", " ").str.strip()
df["unstemmed"] = df["unstemmed"].replace('\s+', ' ', regex=True)
df['unstemmed'] = df['unstemmed'].str.split()
df['stemmed'] = df['unstemmed'].apply(lambda x : [stemmer.stem(y) for y in x])
So first, I convert all upper cases to lower cases and remove all Dutch stopwords. This is followed by removing all special characters and then splitting all words. I checked and all columns are "objects".
I get the following error: stem() missing 1 required positional argument: 'token'
How can I solve this?
Related
I have the following sequences of strings within a column in pandas:
SEQ
An empty world
So the word is
So word is
No word is
I can check the similarity using fuzzywuzzy or cosine distance.
However I would like to know how to get information about the word which changes position from amore to another.
For example:
Similarity between the first row and the second one is 0. But here is similarity between row 2 and 3.
They present almost the same words and the same position. I would like to visualize this change (missing word) if possible. Similarly to the 3rd row and the 4th.
How can I see the changes between two rows/texts?
Assuming you're using jupyter / ipython and you are just interested in comparisons between a row and that preceding it I would do something like this.
The general concept is:
find shared tokens between the two strings (by splitting on ' ' and finding the intersection of two sets).
apply some html formatting to the tokens shared between the two strings.
apply this to all rows.
output the resulting dataframe as html and render it in ipython.
import pandas as pd
data = ['An empty world',
'So the word is',
'So word is',
'No word is']
df = pd.DataFrame(data, columns=['phrase'])
bold = lambda x: f'<b>{x}</b>'
def highlight_shared(string1, string2, format_func):
shared_toks = set(string1.split(' ')) & set(string2.split(' '))
return ' '.join([format_func(tok) if tok in shared_toks else tok for tok in string1.split(' ') ])
highlight_shared('the cat sat on the mat', 'the cat is fat', bold)
df['previous_phrase'] = df.phrase.shift(1, fill_value='')
df['tokens_shared_with_previous'] = df.apply(lambda x: highlight_shared(x.phrase, x.previous_phrase, bold), axis=1)
from IPython.core.display import HTML
HTML(df.loc[:, ['phrase', 'tokens_shared_with_previous']].to_html(escape=False))
I have rows of blurbs (in text format) and I want to use tf-idf to define the weight of each word. Below is the code:
def remove_punctuations(text):
for punctuation in string.punctuation:
text = text.replace(punctuation, '')
return text
df["punc_blurb"] = df["blurb"].apply(remove_punctuations)
df = pd.DataFrame(df["punc_blurb"])
vectoriser = TfidfVectorizer()
df["blurb_Vect"] = list(vectoriser.fit_transform(df["punc_blurb"]).toarray())
df_vectoriser = pd.DataFrame(x.toarray(),
columns = vectoriser.get_feature_names())
print(df_vectoriser)
All I get is a massive list of numbers, which I am not even sure anymore if its the TF or TF-IDF that it is giving me as the frequent words (the, and, etc) all have a score of more than 0.
The goal is to see the weights in the tf-idf column shown below and I am unsure if I am doing this in the most efficient way:
Goal Output table
You don't need punctuation remover if you use TfidfVectorizer. It will take care of punctuation automatically, by virtue of default token_pattern param:
from sklearn.feature_extraction.text import TfidfVectorizer
df = pd.DataFrame({"blurb":["this is a sentence", "this is, well, another one"]})
vectorizer = TfidfVectorizer(token_pattern='(?u)\\b\\w\\w+\\b')
df["tf_idf"] = list(vectorizer.fit_transform(df["blurb"].values.astype("U")).toarray())
vocab = sorted(vectorizer.vocabulary_.keys())
df["tf_idf_dic"] = df["tf_idf"].apply(lambda x: {k:v for k,v in dict(zip(vocab,x)).items() if v!=0})
I have the following data frame called sentences
data = ["Home of the Jacksons"], ["Is it the real thing?"], ["What is it with you?"], [ "Tomatoes are the best"] [ "I think it's best to path ways now"]
sentences = pd.DataFrame(data, columns = ['sentence'])
And a dataframe called stopwords:
data = [["the"], ["it"], ["best"], [ "is"]]
stopwords = pd.DataFrame(data, columns = ['word'])
I want to remove all stopwords from sentences["sentence"]. I tried the code below but it does not work. I think there is an issue with my if statement. Can anyone help?
Def remove_stopwords(input_string, stopwords_list):
stopwords_list = list(stopwords_list)
my_string_split = input_string.split(' ')
my_string = []
for word in my_string_split:
if word not in stopwords_list:
my_string.append(word)
my_string = " ".join(my_string)
return my_string
sentence['cut_string']= sentence.apply(lambda row: remove_stopwords(row['sentence'], stopwords['word']), axis=1)
When I apply the function, it just returns the first or first few strings in the sentence but does not cut out stopwords at all. Kinda stuck here
You can convert stopwords word to list and remove those words from sentences using list comprehension,
stopword_list = stopwords['word'].tolist()
sentences['filtered] = sentences['sentence'].apply(lambda x: ' '.join([i for i in x.split() if i not in stopword_list]))
You get
0 Home of Jacksons
1 Is real thing?
2 What with you?
3 Tomatoes are
4 I think it's to path ways now
Or you can wrap the code in a function,
def remove_stopwords(input_string, stopwords_list):
my_string = []
for word in input_string.split():
if word not in stopwords_list:
my_string.append(word)
return " ".join(my_string)
stopword_list = stopwords['word'].tolist()
sentences['sentence'].apply(lambda row: remove_stopwords(row, stopword_list))
You have many syntax errors in your code above. If you keep the stopwords as a list (or set) rather than DataFrame the following will work -
data = ["Home of the Jacksons", "Is it the real thing?", "What is it with you?", "Tomatoes are the best", "I think it's best to path ways now"]
sentences = pd.DataFrame(data, columns = ['sentence'])
stopwords = ["the", "it", "best", "is"]
sentences.sentence.str.split().apply(lambda x: " ".join([y for y in x if y not in stopwords]))
The key to success is to convert the list of stopwords into a set(): sets have O(1) lookup times, while lists' time is O(N).
stop_set = set(stopwords.word.tolist())
sentences.sentence.str.split()\
.apply(lambda x: ' '.join(w for w in x if w not in stop_set))
I have a pandas dataframe that contains the following columns: df['adjectives'], df['nouns'], and df['adverbs']. Each of these columns contains lists of tokens based on their respective parts of speech.
I would like to use TextBlob to create three new columns in my data frame, df['adjlemmatized'], df['nounlemmatized'], and df['advlemmatized'].
Each of these columns should contain wordlists consisting of words in their singularized, lemma form.
I have tried following the TextBlob documentation, but I am stuck writing functions that will iterate over my entire dataframe.
Words Inflection and Lemmatization
Each word in TextBlob.words or Sentence.words is a Word object (a subclass of unicode) with useful methods, e.g. for word inflection.
>>> sentence = TextBlob('Use 4 spaces per indentation level.')
>>> sentence.words
WordList(['Use', '4', 'spaces', 'per', 'indentation', 'level'])
>>> sentence.words[2].singularize()
'space'
>>> sentence.words[-1].pluralize()
'levels'
Words can be lemmatized by calling the lemmatize method.
>>> from textblob import Word
>>> w = Word("octopi")
>>> w.lemmatize()
'octopus'
>>> w = Word("went")
>>> w.lemmatize("v") # Pass in WordNet part of speech (verb)
'go'
Here is the code I used to get the parts of speech from my text:
# get adjectives
def get_adjectives(text):
blob = TextBlob(text)
print(text)
return [word for (word,tag) in blob.tags if tag.startswith("JJ")]
df['adjectives'] = df['clean_reviews'].apply(get_adjectives)
If your words are already tokenized and you want to keep them that way, it's easy:
df['adjlemmatized'] = df.adjectives.apply(lambda x: [ TextBlob(w) for w in x])
df['adjlemmatized'] = df.adjlemmatized.apply(lambda x: [ w.lemmatize() for w in x])
I am trying create groups of words. First I am counting all words. Then I establish the top 10 words by word count. Then I want to create 10 groups of words based on those top 10. Each group consist of all the words that are before and after the top word.
I have survey results stored in a python pandas dataframe structured like this
Question_ID | Customer_ID | Answer
1 234 Data is very important to use because ...
2 234 We value data since we need it ...
I also saved the answers column as a string.
I am using the following code to find 3 words before and after a word ( I actually had to create a string out of the answers column)
answers_str = df.Answer.apply(str)
for value in answers_str:
non_data = re.split('data|Data', value)
terms_list = [term for term in non_data if len(term) > 0] # skip empty terms
substrs = [term.split()[0:3] for term in terms_list] # slice and grab first three terms
result = [' '.join(term) for term in substrs] # combine the terms back into substrings
print result
I have been manually creating groups of words - but is there a way of doing it in python?
So based on the example shown above the group with word counts would look like this:
group "data":
data : 2
important: 1
value: 1
need:1
then when it goes through the whole file, there would be another group:
group "analytics:
analyze: 5
report: 7
list: 10
visualize: 16
The idea would be to get rid of "we", "to","is" as well - but I can do it manually, if that's not possible.
Then to establish the 10 most used words (by word count) and then create 10 groups with words that are in front and behind those main top 10 words.
We can use regex for this. We'll be using this regular expression
((?:\b\w+?\b\s*){0,3})[dD]ata((?:\s*\b\w+?\b){0,3})
which you can test for yourself here, to extract the three words before and after each occurence of data
First, let's remove all the words we don't like from the strings.
import re
# If you're processing a lot of sentences, it's probably wise to preprocess
#the pattern, assuming that bad_words is the same for all sentences
def remove_words(sentence, bad_words):
pat = r'(?:{})'.format(r'|'.join(bad_words))
return re.sub(pat, '', sentence, flags=re.IGNORECASE)
The we want to get the words that surround data in each line
data_pat = r'((?:\b\w+?\b\s*){0,3})[dD]ata((?:\s*\b\w+?\b){0,3})'
res = re.findall(pat, s, flags=re.IGNORECASE)
gives us a list of tuples of strings. We want to get a list of those strings after they are split.
from itertools import chain
list_of_words = list(chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res))))))
That's not pretty, but it works. Basically, we pull the tuples out of the list, pull the strings out of each tuples, then split each string then pull all the strings out of the lists they end up in into one big list.
Let's put this all together with your pandas code. pandas isn't my strongest area, so please don't assume that I haven't made some elementary mistake if you see something weird looking.
import re
from itertools import chain
from collections import Counter
def remove_words(sentence, bad_words):
pat = r'(?:{})'.format(r'|'.join(bad_words))
return re.sub(pat, '', sentence, flags=re.IGNORECASE)
bad_words = ['we', 'is', 'to']
sentence_list = df.Answer.apply(lambda x: remove_words(str(x), bad_words))
c = Counter()
data_pat = r'((?:\b\w+?\b\s*){0,3})data((?:\s*\b\w+?\b){0,3})'
for sentence in sentence_list:
res = re.findall(data_pat, sentence, flags=re.IGNORECASE)
words = chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res)))))
c.update(words)
The nice thing about the regex we're using is that all of the complicated parts don't care about what word we're using. With a slight change, we can make a format string
base_pat = r'((?:\b\w+?\b\s*){{0,3}}){}((?:\s*\b\w+?\b){{0,3}})'
such that
base_pat.format('data') == data_pat
So with some list of words we want to collect information about key_words
import re
from itertools import chain
from collections import Counter
def remove_words(sentence, bad_words):
pat = r'(?:{})'.format(r'|'.join(bad_words))
return re.sub(pat, '', sentence, flags=re.IGNORECASE)
bad_words = ['we', 'is', 'to']
sentence_list = df.Answer.apply(lambda x: remove_words(str(x), bad_words))
key_words = ['data', 'analytics']
d = {}
base_pat = r'((?:\b\w+?\b\s*){{0,3}}){}((?:\s*\b\w+?\b){{0,3}})'
for keyword in key_words:
key_pat = base_pat.format(keyword)
c = Counter()
for sentence in sentence_list:
res = re.findall(key_pat, sentence, flags=re.IGNORECASE)
words = chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res)))))
c.update(words)
d[keyword] = c
Now we have a dictionary d that maps keywords, like data and analytics to Counters that map words that are not on our blacklist to their counts in the vicinity of the associated keyword. Something like
d= {'data' : Counter({ 'important' : 2,
'very' : 3}),
'analytics' : Counter({ 'boring' : 5,
'sleep' : 3})
}
As to how we get the top 10 words, that's basically the thing Counter is best at.
key_words, _ = zip(*Counter(w for sentence in sentence_list for w in sentence.split()).most_common(10))