I have an excel file with many rows and columns. I want to do the following. First, I want to filter the rows based on a text match. Second, I want to choose a particular column and generate word frequency for ALL THE WORDS in that column. Third, I want to graph the word and frequency.
I have figured out the first part. My question is how to apply Counter() on a dataframe. If I just use Counter(df), it returns an error. So, I used the following code to convert each row into a list and then applied Counter. When I do this, I get the word frequency for each row separately (if I use counter within the for loop, else I get the word frequency for just one row). However, I want a word count for all the rows put together. Appreciate any inputs. Thanks!
The following is an example data.
product review
a Great Product
a Delivery was fast
a Product received in good condition
a Fast delivery but useless product
b Dont recommend
b I love it
b Please dont buy
b Second purchase
My desired output is like this: for product a - (product,3),(delivery,2)(fast,2) etc..my current output is like (great,1), (product,1) for the first row.
This is the code I used.
strdata = column.values.tolist()
tokens = [tokenizer.tokenize(str(i)) for i in strdata]
cleaned_list = []
for m in tokens:
stopped = [i for i in m if str(i).lower() not in stop_words]
stemmed = [stemmer.stem(i) for i in stopped]
cleaned_list.append(stopped) #append stemmed words to list
count = Counter(stemmed)
print(count.most_common(10))
Firstly, using groupby concatenate strings from same group.
Secondly, apply Counter() on joined strings.
joined = df.groupby('product', as_index=False).agg({'review' : ' '.join})
joined['count'] = joined.apply(lambda x: collections.Counter(x['review'].split(' ')), axis=1)
# print(joined)
product review count
0 a Great Product Delivery was fast Product receiv... {'Great': 1, 'Product': 2, 'Delivery': 1, 'was...
1 b Dont recommend I love it Please dont buy Secon... {'Dont': 1, 'recommend': 1, 'I': 1, 'love': 1,...
You can use the following function. The idea is
Group your data by byvar. Combine every words in yvar as a list.
Apply Counter and, if you want, select the most common
Explode to get a long formatted dataframe (easier to analyze afterwards)
just keep the relevent columns (word and count in a new dataframe)
:
from collections import Counter
import pandas as pd
def count_words_by(data, yvar, byvar):
cw = pd.DataFrame({'counter' : data
.groupby(byvar)
.apply(lambda s: ' '.join(s[yvar]).split())
.apply(lambda s: Counter(s))
# .apply(lambda s: s.most_common(10)) #uncomment this line if you want the top 10 words
.explode()}
)
cw[['word','count']] = pd.DataFrame(cw['counter'].tolist(), index=cw.index)
cw_red = cw[['word','count']].reset_index()
return cw_red
count_words_by(data = df, yvar = "review", byvar = "product")
where I assume you start from there:
product review
a Great Product
a Delivery was fast
a Product received in good condition
a Fast delivery but useless product
b Dont recommend
b I love it
b Please dont buy
b Second purchase
Related
I am doing some data mining. I have a database that looks like this (pulling out three lines):
100324822$10032482$1$PS$BENICAR$OLMESARTAN MEDOXOMIL$1$Oral$UNK$$$Y$$$$021286$$$TABLET$
1014687010$10146870$2$SS$BENICAR HCT$HYDROCHLOROTHIAZIDE\OLMESARTAN MEDOXOMIL$1$Oral$1/2 OF 40/25MG TABLET$$$Y$$$$$.5$DF$FILM-COATED TABLET$QD
115700162$11570016$5$C$Olmesartan$OLMESARTAN$1$Unknown$UNK$$$U$U$$$$$$$
My Code looks like this :
with open('DRUG20Q4.txt') as fileDrug20Q4:
drugTupleList20Q4 = [tuple(map(str, i.split('$'))) for i in fileDrug20Q4]
drug20Q4 = []
for entryDrugPrimaryID20Q4 in drugTupleList20Q4:
drug20Q4.append((entryDrugPrimaryID20Q4[0], entryDrugPrimaryID20Q4[3], entryDrugPrimaryID20Q4[5]))
fileDrug20Q4.close()
drugNameDataFrame20Q4 = pd.DataFrame(drug20Q4, columns = ['PrimaryID', 'Role', 'Drug Name']) drugNameDataFrame20Q4 = pd.DataFrame(drugNameDataFrame20Q4.loc[drugNameDataFrame20Q4['Drug Name'] == 'OLMESARTAN'])
Currently the code will pull only entries with the exact name "OLMESARTAN" out, how do I capture all the variations, for instance "OLMESARTAN MEDOXOMIL" etc? I can't simply list all the varieties as there's an infinite amount of variations, so I would need something that captures anything with the term "OLMESARTAN" within it.
Thanks!
You can use str.contains to get what you are looking for.
Here's an example (using some string I found in the documentation):
import pandas as pd
df = pd.DataFrame()
item = 'Return boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index.'
df['test'] = item.split(' ')
df[df['test'].str.contains('de')]
This outputs:
test
4 Index
22 Index.
I've managed to get close, as I'm able to get the total to correctly sum up each individual group, my issue is with getting the total to appear at the end of each group, below is my working -
(I'm using Django, so I'm using a queryset as data )
df = pd.DataFrame(list(
get_daily_transaction_object(value, SQLLIST, date).values('posgroupid',
'posid',
'cardscheme',
'transactionamount',
'transactiontype',
'currencycode')))
def f(x):
a = x['transactionamount'].nunique()
b = x[df['transactiontype'] == 1]['transactionamount'].sum()
c = x[df['transactiontype'] == 4]['transactionamount'].sum()
d = x[df['transactiontype'] == 3]['transactionamount'].sum()
e = x['transactionamount'].sum()
return pd.Series([a, b, c, d, e], index=['transactions', 'sales', 'refund', 'cashback', 'Total'])
grouped_df = df.groupby(['currencycode',
'posgroupid',
'posid',
'cardscheme']).apply(f)
subtotal = grouped_df.sum(level=[0, 1, 2]).assign(cardscheme='Total').set_index('cardscheme', append=True)
grouped_new = pd.concat([grouped_df, subtotal]).sort_index()
context = {'desc': 'Transaction Report',
'report': grouped_new.to_html(classes='white_space_df')
}
the above calculates correctly but places the total in seemingly random places, this is causing an issue when I have dynamically sized data.
Is there a way to always have the total appear at the end of a group?
picture for refenrece.
Look like you are setting Total as an instance of cardscheme so as to group it under the column of cardscheme. Then you set cardscheme as an index and finally you sorted the index. Hence, Total is sorted (as index) together with other cardscheme e.g. Visa, MasterCard in alphabetic order.
How about you sort other cardscheme items first (by sorting index), then add Total without further sorting the index ? Just an example (not fully tested), change the codes sequence to:
grouped_df = df.groupby(['currencycode',
'posgroupid',
'posid',
'cardscheme']).apply(f)
# add the following to set index and sort the index before concat Total
sorted_grouped_df = grouped_df.set_index('cardscheme').sort_index()
subtotal = grouped_df.sum(level=[0, 1, 2]).assign(cardscheme='Total').set_index('cardscheme', append=True)
# revise to concat the sorted_grouped_df with subtotal AND remove sort index
grouped_new = pd.concat([sorted_grouped_df, subtotal])
I have a dataframe and one column contains the lemmatized words of a paragraph. I wish to count the frequency of each word within the whole dataframe, not just within the record. There are over 40,000 records so the computation has to be quick and not reach the limit of my RAM.
For example, this basic input:
ID lemm
1 ['test','health']
2 ['complete','health','science']
would have this desired output:
'complete':1
'health':2
'science':1
'test':1
This is my current code:
from collections import Counter
cnt = Counter()
for entry in df.lemm:
for word in entry:
cnt[word]+=1
cnt
Which works when I manually enter a list of a list of strings (ex/[['completing', 'dog', 'cat'], ['completing','degree','health','health']]), but not when it iterates through the df.
I have also tried this:
top_N=20
word_dist = nltk.FreqDist(df_main.stem)
print('All frequences')
print('='*60)
rslt=pd.DataFrame(word_dist.most_common(top_N),columns=['Word','Frequency'])
print(rslt)
to return the top 20 terms, but the output lists the frequencies of terms within the entry, not the entire dataframe.
Any help would be appreciated!
You can try explode if you have Pandas 0.25+:
df.Text.explode().value_counts()
maybe you can the counter line to:
cnt = Counter(word for entry in df.lemm for word in entry)
Refer to: How to find the lemmas and frequency count of each word in list of sentences in a list?
Assuming your column names and input data:
data = {
"ID": [1, 2],
"lemm": [['test', 'health'], ['complete', 'health', 'science']]
}
df = pd.DataFrame(data)
freq = df.explode("lemm").groupby(["lemm"]).count().rename(columns={"ID" : "Frecuency"})
Output:
from collections import Counter
cnt = df.apply(lambda x:Counter(x['lemm']),axis=1).sum()
Will do it for you. That will make cnt a Counter object so you can do most common on it or anything else counter offers.
Sample df:
filldata = [['5,Blue,Football', 3], ['Baseball,Blue,College,1993', 4], ['Green,5,Football', 1]]
df = pd.DataFrame(filldata, columns=['Tags', 'Count'])
I am wanting a unique list of words used in the Tags column. So I'm trying to loop through df and pull each row of Tags, split on , and add the words to a list. I could either check and add only unique words, or add them all and then just pull unique. I would like a solution for both methods if possible to see which is faster.
So expected output should be:
5, Blue, Football, Baseball, College, 1993, Green.
I have tried these:
tagslist = df['Tags'][0].split(',') # To give me initial starting words
def adduniquetags(newtags, tagslist):
thesetags = newtags.split(',')
tagslist = tagslist.extend(thesetags)
return tagslist
tagslist = [adduniquetags(row, tagslist) for row in df['Tags']]
and
tagslist = df['Tags'][0].split(',')
def adduniquetags(newtags, tagslist):
thesetags = newtags.split(',')
for word in thesetags:
if word not in tagslist:
tagslist.append(word)
tagslist = [adduniquetags(row, tagslist) for row in df['Tags']]
These two are essentially the same with one looking only for unique words. Both of these return a list of 'None'.
I have also tried this:
tagslist = df['Tags'][0].split(',')
def adduniquetags(newtags, tagslist):
thesetags = newtags.split(',')
tagslist = list(set(tagslist + thesetags))
return tagslist
tagslist = [adduniquetags(row, tagslist) for row in df['Tags']]
This one is adding unique values for each row, but not the words in each row. So even though I tried to split on the ,, it is still treating the entire text as one instead of using the individual words from the string.
Use Series.str.split to split strings, then use np.hstack to horizontally stack all the lists in column Tags, next use np.unique on this stacked array, to find the unique elements in array.
lst = np.unique(np.hstack(df['Tags'].str.split(','))).tolist()
Another possible idea using Series.explode + Series.unique:
lst = df['Tags'].str.split(',').explode().unique().tolist()
Result:
['1993', '5', 'Baseball', 'Blue', 'College', 'Football', 'Green']
Because I want to remove ambiguity when I train the data. I want to clean it well. So how can I remove all rows that contain 3 words or less in python?
Hello World! This will be my first contribution ever to SO :-)
Let's create some data:
data = { 'Source':['Hello all Im Happy','Its a lie, dont trust him','Oops','foo','bar']}
df = pd.DataFrame (data, columns = ['Source'])
My approach is very straight forward, simple and little "brute" and inefficient,howver I ran this in a large dataframe (1013952 rows) and the time was fairly acceptable.
let's find the indices of the data frame where there are more than n tokens:
from nltk.tokenize import word_tokenize
def get_indices(df,col,n):
"""
Get the indices of dataframe where exist more than n tokens in a specific column
Parameters:
df(pandas dataframe)
n(int): threshold value for minimum words
col(string): column name
"""
tmp = []
for i in range(len(df)):#df.iterrows() wasnt working for me
if len(word_tokenize(df[col][i])) < n:
tmp.append(i)
return tmp
Next we just need to call the function and drop the rows and said indices:
tmp = get_indices(df)
df_clean = df.drop(tmp)
Best!
df = pd.DataFrame({"mycolumn": ["", " ", "test string", "test string 1", "test string 2 2"]})
df = df.loc[df["mycolumn"].str.count(" ") >= 2]
You should never loop over a dataframe, always use vectorized operations.