Generate features from "comments" column in dataframe - python

I have a dataset with a column that has comments. This comments are words separated by commas.
df_pat['reason'] =
chest pain
chest pain, dyspnea
chest pain, hypertrophic obstructive cariomyop...
chest pain
chest pain
cad, rca stents
non-ischemic cardiomyopathy, chest pain, dyspnea
I would like to generate separated columns in the dataframe so that a column represent each word from all the set of words, and then have 1 or 0 to the rows where I initially had that word in the comment.
For example:
df_pat['chest_pain'] =
df_pat['dyspnea'] =
And so on...
Thank you!

sklearn.feature_extraction.text has something for you! It looks like you may be trying to predict something. If so - and if you're planning to use sci-kit learn at some point, then you can bypass making a dataframe with len(set(words)) number of columns and just use CountVectorizer. This method will return a matrix with dimensions (rows, columns) = (number of rows in dataframe, number of unique words in entire 'reason' column).
from sklearn.feature_extraction.text import CountVectorizer
df = pd.DataFrame({'reason': ['chest pain', 'chest pain, dyspnea', 'chest pain, hypertrophic obstructive cariomyop', 'chest pain', 'chest pain', 'cad, rca stents', 'non-ischemic cardiomyopathy, chest pain, dyspnea']})
# turns body of text into a matrix of features
# split string on commas instead of spaces
vectorizer = CountVectorizer(tokenizer = lambda x: x.split(","))
# X is now a n_documents by n_distinct_words-dimensioned matrix of features
X = vectorizer.fit_transform(df['reason'])
pandas plays really nicely with sklearn.
Or, a strict pandas solution that should probably be vectorized, but if you don't have that much data, should work:
# split on the comma instead of spaces to get "chest pain" instead of "chest" and "pain"
reasons = [reason for case in df['reason'] for reason in case.split(",")]
for reason in reasons:
for idx in df.index:
if reason in df.loc[idx, 'reason']:
df.loc[idx, reason] = 1
df.loc[idx, reason] = 0


sentiment analysis of a dataframe

i have a project that involves determining the sentiments of a text based on the adjectives. The dataframe to be used is the adjectives column which i derived like so:
def getAdjectives(text):
return [ word for (word,tag) in blob.tags if tag == "JJ"]
dataset['adjectives'] = dataset['text'].apply(getAdjectives)`
I obtained the dataframe from a json file using this code:
with open('reviews.json') as project_file:
data = json.load(project_file)
i have done the sentiment analysis for the dataframe using this code:
dataset[['polarity', 'subjectivity']] = dataset['text'].apply(lambda text: pd.Series(TextBlob(text).sentiment))
print(dataset[['adjectives', 'polarity']])
this is the output:
adjectives polarity
0 [] 0.333333
1 [right, mad, full, full, iPad, iPad, bad, diff... 0.209881
2 [stop, great, awesome] 0.633333
3 [awesome] 0.437143
4 [max, high, high, Gorgeous] 0.398333
5 [decent, easy] 0.466667
6 [it’s, bright, wonderful, amazing, full, few... 0.265146
7 [same, same] 0.000000
8 [old, little, Easy, daily, that’s, late] 0.161979
9 [few, huge, storage.If, few] 0.084762
The code has no issue except I want it to output the polarity of each adjective with the adjective, like for example right, 0.00127, mad, -0.9888 even though they are in the same row of the dataframe.
Try this:
dataset = dataset.explode("adjectives")
Note that [] will result in a np.NaN row which you might want to remove beforehand/afterwards.

str.contains not working when there is not a space between the word and special character

I have a dataframe which includes the names of movie titles and TV Series.
From specific keywords I want to classify each row as Movie or Title according to these key words. However, due to brackets not having a space between the key words they are not being picked up by the str.contains() funtion and I need to do a workaround.
This is my dataframe:
import pandas as pd
import numpy as np
watched_df = pd.DataFrame([['Love Death Robots (Episode 1)'],
['James Bond'],
['How I met your Mother (Avnsitt 3)'],
['random name'],
['Random movie 3 Episode 8383893']],
To add the column that classifies the titles as TV series or Movies I have the following code.
watched_df["temporary_brackets_removed_title"] = watched_df['Title'].str.replace('(', '')
watched_df["Film_Type"] = np.where(watched_df.temporary_brackets_removed_title.astype(str).str.contains(pat = 'Episode | Avnsitt', case = False), 'Series', 'Movie')
watched_df = watched_df.drop('temporary_brackets_removed_title', 1)
Is there a simpler way to solve this without having to add and drop a column?
Maybe a str.contains-like function that does not look at a string being the exact same but just containing the given word? Similar to how in SQL you have the "Like" functionality?
You can use str.contains and then map the results:
watched_df['Film_Type'] = watched_df['Title'].str.contains(r'(?:Episode|Avnsitt)').map({True: 'Series', False: 'Movie'})
>>> watched_df
Title Film_Type
0 Love Death Robots (Episode 1) Series
1 James Bond Movie
2 How I met your Mother (Avnsitt 3) Series
3 random name Movie
4 Random movie 3 Episode 8383893 Series

Conditional word frequency count in Pandas

I have a dataframe like below:
data = {'speaker':['Adam','Ben','Clair'],
'speech': ['Thank you very much and good afternoon.',
'Let me clarify that because I want to make sure we have got everything right',
'By now you should have some good rest']}
df = pd.DataFrame(data)
I want to count the number of words in the speech column but only for the words from a pre-defined list. For example, the list is:
wordlist = ['much', 'good','right']
I want to generate a new column which shows the frequency of these three words in each row. My expected output is therefore:
speaker speech words
0 Adam Thank you very much and good afternoon. 2
1 Ben Let me clarify that because I want to make sur... 1
2 Clair By now you should have received a copy of our ... 1
I tried:
df['total'] = 0
for word in df['speech'].str.split():
if word in wordlist:
df['total'] += 1
But I after running it, the total column is always zero. I am wondering what's wrong with my code?
You could use the following vectorised approach:
data = {'speaker':['Adam','Ben','Clair'],
'speech': ['Thank you very much and good afternoon.',
'Let me clarify that because I want to make sure we have got everything right',
'By now you should have some good rest']}
df = pd.DataFrame(data)
wordlist = ['much', 'good','right']
df['total'] = df['speech'].str.count(r'\b|\b'.join(wordlist))
Which gives:
>>> df
speaker speech total
0 Adam Thank you very much and good afternoon. 2
1 Ben Let me clarify that because I want to make sur... 1
2 Clair By now you should have some good rest 1
This is a much faster (runtime wise) solution, if you have a very large list and a large data frame to search through.
I guess it is because it takes advantage of a dictionary (which takes O(N) to construct and O(1) to search through). Performance wise, regex search is slower.
import pandas as pd
from collections import Counter
def occurrence_counter(target_string, search_list):
data = dict(Counter(target_string.split()))
count = 0
for key in search_list:
if key in data:
return count
data = {'speaker':['Adam','Ben','Clair'],
'speech': ['Thank you very much and good afternoon.',
'Let me clarify that because I want to make sure we have got everything right',
'By now you should have some good rest']}
df = pd.DataFrame(data)
wordlist = ['much', 'good','right']
df['speech'].apply(lambda x: occurrence_counter(x, wordlist))
import pandas as pd
data = {'speaker': ['Adam', 'Ben', 'Clair'],
'speech': ['Thank you very much and good afternoon.',
'Let me clarify that because I want to make sure we have got everything right',
'By now you should have some good rest']}
df = pd.DataFrame(data)
wordlist = ['much', 'good', 'right']
df["speech"] = df["speech"].str.split()
df = df.explode("speech")
counts = df[df.speech.isin(wordlist)].groupby("speaker").size()

Calculate TF-IDF using sklearn for n-grams in python

I have a vocabulary list that include n-grams as follows.
myvocabulary = ['tim tam', 'jam', 'fresh milk', 'chocolates', 'biscuit pudding']
I want to use these words to calculate TF-IDF values.
I also have a dictionary of corpus as follows (key = recipe number, value = recipe).
corpus = {1: "making chocolates biscuit pudding easy first get your favourite biscuit chocolates", 2: "tim tam drink new recipe that yummy and tasty more thicker than typical milkshake that uses normal chocolates", 3: "making chocolates drink different way using fresh milk egg"}
I am currently using the following code.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(vocabulary = myvocabulary, stop_words = 'english')
tfs = tfidf.fit_transform(corpus.values())
Now I am printing tokens or n-grams of the recipe 1 in corpus along with the tF-IDF value as follows.
feature_names = tfidf.get_feature_names()
doc = 0
feature_index = tfs[doc,:].nonzero()[1]
tfidf_scores = zip(feature_index, [tfs[doc, x] for x in feature_index])
for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:
print(w, s)
The results I get is chocolates 1.0. However, my code does not detect n-grams (bigrams) such as biscuit pudding when calculating TF-IDF values. Please let me know where I make the code wrong.
I want to get the TD-IDF matrix for myvocabulary terms by using the recipe documents in the corpus. In other words, the rows of the matrix represents myvocabulary and the columns of the matrix represents the recipe documents of my corpus. Please help me.
Try increasing the ngram_range in TfidfVectorizer:
tfidf = TfidfVectorizer(vocabulary = myvocabulary, stop_words = 'english', ngram_range=(1,2))
Edit: The output of TfidfVectorizer is the TF-IDF matrix in sparse format (or actually the transpose of it in the format you seek). You can print out its contents e.g. like this:
feature_names = tfidf.get_feature_names()
corpus_index = [n for n in corpus]
rows, cols = tfs.nonzero()
for row, col in zip(rows, cols):
print((feature_names[col], corpus_index[row]), tfs[row, col])
which should yield
('biscuit pudding', 1) 0.646128915046
('chocolates', 1) 0.763228291628
('chocolates', 2) 0.508542320378
('tim tam', 2) 0.861036995944
('chocolates', 3) 0.508542320378
('fresh milk', 3) 0.861036995944
If the matrix is not large, it might be easier to examine it in dense form. Pandas makes this very convenient:
import pandas as pd
df = pd.DataFrame(tfs.T.todense(), index=feature_names, columns=corpus_index)
This results in
1 2 3
tim tam 0.000000 0.861037 0.000000
jam 0.000000 0.000000 0.000000
fresh milk 0.000000 0.000000 0.861037
chocolates 0.763228 0.508542 0.508542
biscuit pudding 0.646129 0.000000 0.000000
#user8566323 try using
df = pd.DataFrame(tfs.todense(), index=feature_names, columns=corpus_index)
instead of
df = pd.DataFrame(tfs.T.todense(), index=feature_names, columns=corpus_index)
i.e. without making a transpose (T) of matrix

Python/Pandas aggregation combined with NLTK

I want to do some text processing on a dataset containing Twitter messages. So far I'm able to load the data (.CSV) in a Pandas dataframe and index that by a (custom) column 'timestamp'.
df = pandas.read_csv(f)
df.index = pandas.to_datetime(df.pop('timestamp'))
Looks a bit like this:
user_name user_handle
2015-02-02 23:58:42 Netherlands Startups NLTechStartups
2015-02-02 23:58:42 shareNL share_NL
2015-02-02 23:58:42 BreakngAmsterdamNews iAmsterdamNews
[49570 rows x 8 columns]
I can create a new object (Series) containing just the text like so:
texts = pandas.Series(df['text'])
Which creates this:
2015-06-02 14:50:54 Business Update Meer cruiseschepen dan ooit in...
2015-06-02 14:50:53 RT #ProvincieNH: Provincie maakt Markermeerdij...
2015-06-02 14:50:53 Amsterdam - Nieuwe flitspaal Wibautstraat: In ...
2015-06-02 14:50:53 Amsterdam - Nieuwe flitspaal Wibautstraat http...
2015-06-02 14:50:53 Lugar secreto em Amsterdam: Begijnhof // Hidde...
Name: text, Length: 49570
1. Is this new object of the same sort of type (dataframe) as my initial df variable, just with different columns/rows?
Now together with the nltk tookit I'm able to tokenize the strings using this:
for w in words:
This iterates the array instead of mapping the 'text' column to a multiple-column 'words' array. 2. How would I do this and moreover how do I then count the occurrences of each word?
I know there is a unique() method which I could use to create a distinct list of words. But then again I'd need an extra column which is a count over the array which I'm unable to produce in the first place. :) 3. Or would the next step towards 'counting' occurrences of those words be grouping?
EDIT. 3: I seem to need "CountVectorizer", thanks EdChum
documents = df['text'].values
vectorizer = CountVectorizer(min_df=0, stop_words=[])
X = vectorizer.fit_transform(documents)
My main goal is to count the occurences of each word and selecting the top X results. I feel I'm on the right track, but I can't get the final steps just right..
Building on EdChums comments here is a way to get the (I assume global) word counts from CountVectorizer:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
vect= CountVectorizer()
df= pd.DataFrame({'text':['cat on the cat','angel eyes has','blue red angel','one two blue','blue whales eat','hot tin roof','angel eyes has','have a cat']\
,'class': ['a','a','a','a','c','c','b','e']})
X = vect.fit_transform(df['text'].values)
y = df['class'].values
covert the sparse matrix returned by CountVectoriser to a dense matrix, and pass it and the feature names to the dataframe constructor. Then transpose the frame and sum along axis=1 to get the total per word:
word_counts =pd.DataFrame(X.todense(),columns = vect.get_feature_names()).T.sum(axis=1)
If all you are interested in is the frequency distribution of the words consider using Freq Dist from NLTK:
import nltk
import itertools
from nltk.probability import FreqDist
texts = ['cat on the cat','angel eyes has','blue red angel','one two blue','blue whales eat','hot tin roof','angel eyes has','have a cat']
texts = [nltk.word_tokenize(text) for text in texts]
# collapse into a single list
tokens = list(itertools.chain(*texts))
FD =FreqDist(tokens)
