Frequency count based on column values in Pandas - python

For example I have a data frame which looks like this:
First Image
And I would like to make a new data frame which shows the number of times a word was marked as spam or ham. I want it to look like this:
Second image
I have tried the following code to make a list of only spam counts on a word to test but it does not seem to work and crashes the Kernel on Jupyter Notebook:
words = []
for word in df["Message"]:
words.extend(word.split())
sentences = []
for word in df["Message"]:
sentences.append(word.split())
spam = []
ham = []
for word in words:
sc = 0
hc = 0
for index,sentence in enumerate(sentences):
if word in sentence:
print(word)
if(df["Category"][index])=="ham":
hc+=1
else:
sc+=1
spam.append(sc)
spam
Where df is the data frame shown in the First Image.
How can I go about doing this?

You can form two dictionaries spam and ham to store the number of occurrences of different words in spam/ham message.
from collections import defaultdict as dd
spam = dd(int)
ham = dd(int)
for i in range(len(sentences)):
if df['Category'][i] == 'ham':
p = sentences[i]
for x in p:
ham[x] += 1
else:
p = sentences[i]
for x in p:
spam[x] += 1
The output obtained from the code above for similar input to yours is as below.
>>> spam
defaultdict(<class 'int'>, {'ok': 1, 'lar': 1, 'joking': 1, 'wtf': 1, 'u': 1, 'oni': 1, 'free': 1, 'entry': 1, 'in': 1, '2': 1, 'a': 1, 'wkly': 1, 'comp': 1})
>>> ham
defaultdict(<class 'int'>, {'go': 1, 'until': 1, 'jurong': 1, 'crazy': 1, 'available': 1, 'only': 1, 'in': 1, 'u': 1, 'dun': 1, 'say': 1, 's': 1, 'oearly': 1, 'nah': 1, 'I': 1, 'don’t': 1, 'think': 1, 'he': 1, 'goes': 1, 'to': 1, 'usf': 1})
Now can manipulate the data and export it in the required format.
EDIT:
answer = []
for x in spam:
answer.append([x,spam[x],ham[x]])
for x in ham:
if x not in spam:
answer.append([x,spam[x],ham[x]])
So here the numbers of rows in answer list in equal to the number of distinct words in all the messages. While the first column in every row is the word we are talking about and the second and third column is the number of occurrences of the word in spam and ham message respectively.
The output obtained for my code is as below.
['ok', 1, 0]
['lar', 1, 0]
['joking', 1, 0]
['wif', 1, 0]
['u', 1, 1]
['oni', 1, 0]
['free', 1, 0]
['entry', 1, 0]
['in', 1, 1]

This would be better:
https://docs.python.org/3.8/library/collections.html#collections.Counter
from collections import Counter
import pandas as pd
df # the data frame in your first image
df['Counter'] = df.Message.apply(lambda x: Counter(x.split()))
def func(df: pd.DataFrame):
for category, data in df.groupby('Category'):
count = Counter()
for var in data.Counter:
count += var
cur = pd.DataFrame.from_dict(count, orient='index', columns=[category])
yield cur
demo = func(df)
df2 = next(demo)
for cur in demo:
df2 = df2.merge(cur, how='outer', left_index=True, right_index=True)
EDIT:
from collections import Counter
import pandas as pd
df # the data frame in your first image. Suit both cases(whether it is a slice of the complete data frame or not)
def func(df: pd.DataFrame):
res = df.groupby('Category').Message.apply(' '.join).str.split().apply(Counter)
for category, count in res.to_dict().items():
yield pd.DataFrame.from_dict(count, orient='index', columns=[category])
demo = func(df)
df2 = next(demo)
for cur in demo:
df2 = df2.merge(cur, how='outer', left_index=True, right_index=True)

Related

Python dataframe merge on condition

I have two data frames, and 3 conditions to create new data frame
1)df1["Product"]==df2["Product"] and df2["Date"] >= df1["Date"]
2)Now need to loop df2["Product"] sum(df2["Count"]) while checking df1["Count"] on each iteration for df2["Count"] == df1["Count"]
Example
df1["Product"][2] = "147326.A" and df1["Date"][2] = "1/03/22" and df1["Count"][2] = 4,
now we check df2 if there is match df2["Product"][1] == df1["Product"][2] and df2["Date"][1] >= df1["Date"][2], first condition are met now we need to sum() the df2["Count"] end on each iteration compare it to df1["Count"] if df1["Count"]== df2[Count] add to new data frame
df1 = pd.DataFrame({"Date":["11/01/22", "1/02/22", "1/03/22", "1/04/22", "2/02/22"],"Product" :["315114.A", "147326.A", "147326.A", "91106.A", "283214.A"],"Count":[3,1,4,1,2]})
df2 = pd.DataFrame({"Date" : ["15/01/22", "4/02/22", "7/03/22", "1/04/22", "2/02/22", "15/01/22","1/06/22","1/06/22"],"Product" : ["315114.A", "147326.A ", "147326.A", "91106.A", "283214.A", "315114.A","147326.A","147326.A" ],"Count" : [1, 1, 2, 1, 2, 2, 1, 1]})
The following data should be a match:
df1 = pd.DataFrame({"Date" : ["01/03/2022"],"Product":["91106.A"],"Count":[2]})
df2 = pd.DataFrame({"Date" : ["01/03/2022", "7/03/2022", "7/03/2022", "7/03/2022","7/03/2022", "7/03/2022"],"Product" : ["91106.A", "91106.A","91106.A", "91106.A", "91106.A", "91106.A"],"Count" : [1, 1, 1, 1, 1, 1]})
You could solve this in a list comprehension (within a pd.DataFrame):
df3 = pd.DataFrame([j.to_dict() for i, j in df1.iterrows() if
j["Count"] == df2[(df2["Product"] == j["Product"]) &
(df2["Date"] >= j["Date"])]["Count"].sum()])
Splitting this up into lots of lines would look like this:
l = []
for i, j in df1.iterrows():
if j["Count"] == df2[(df2["Product"] == j["Product"]) &
(df2["Date"] >= j["Date"])]["Count"].sum():
x = j.to_dict()
l.append(x)
df3 = pd.DataFrame(l)

count the number of unique column elements in python example

Imagine that this data frame is a small sample of a bigger data frame with 11 pianists, each producing an emotion of Angry, Happy, Relaxed, and Sad to a listener. Now I want to count for every pianist the number of emotions, since I want to later plot it, to see a pattern in the data.
I am struggling to get this done. I, somehow, managed it to a certain degree but, it is very bad code and very long if I have to do it for these 11 pianists.
Could somebody please help me out, in really automating it, more efficient and better code?
My Work:
d = {
'pianist_id':
[1, 1, 1, 2, 2, 2, 3, 3, 4, 4],
'class':
['Angry', 'Sad', 'Sad', 'Angry', 'Angry', 'Angry', 'Relaxed', 'Happy', 'Happy', 'Happy']
}
df = pd.DataFrame(d)
count = 0
for i in range(df.shape[0]):
if df['pianist_id'][i] == 1:
count += 1
df_split_1 = df.iloc[: count]
print(data_split_1['class'].value_counts())
pianist_1 = data_split_1['class'].value_counts().to_dict()
dict_pianist_1 = {}
dict_pianist_1['1'] = pianist_1
I want to have something like this for every 11 pianists.
{
'1': {
'Sad': 67,
'Happy': 66,
'Angry': 54,
'Relaxed': 50
},
'2':{
'Angry',,,,,''
},
,,,,,,
}
Thanks for the help!
You can group by pianist_id column and then use value_counts to get each type count of class column. Finally use to_dict to convert them to dict.
d = df.groupby('pianist_id').apply(lambda group: group['class'].value_counts().to_dict()).to_dict()
print(d)
{1: {'Sad': 2, 'Angry': 1}, 2: {'Angry': 3}, 3: {'Relaxed': 1, 'Happy': 1}, 4: {'Happy': 2}}
You can compute the size of each pair :
df.groupby(['pianist_id', 'class']).size()
Which gives the following output :
pianist_id class
1 Angry 1
Sad 2
2 Angry 3
3 Happy 1
Relaxed 1
4 Happy 2
dtype: int64
To get the format you need, you have to unstack the index, allowing to fill the missing values at the same time, and then convert the final DataFrame to a dict :
df.groupby(['pianist_id', 'class']).size().unstack(fill_value=0).to_dict(orient='index')
Producing the output :
{1: {'Angry': 1, 'Happy': 0, 'Relaxed': 0, 'Sad': 2}, 2: {'Angry': 3, 'Happy': 0, 'Relaxed': 0, 'Sad': 0}, 3: {'Angry': 0, 'Happy': 1, 'Relaxed': 1, 'Sad': 0}, 4: {'Angry': 0, 'Happy': 2, 'Relaxed': 0, 'Sad': 0}}
Since the end result specified in the question is a Python dict of dicts, you may prefer to use a more Python-centric than pandas-centric approach. Here's an answer that gives several alternatives for which pandas usage is limited to taking the original dataframe as input, calling its apply method and accessing its 'pianist_id' and 'class' columns:
result = {id : {} for id in df['pianist_id'].unique()}
def updateEmotionCount(id, emotion):
result[id].update({emotion : result[id].get(emotion, 0) + 1})
df.apply(lambda x: updateEmotionCount(x['pianist_id'], x['class']), axis = 1)
print(result)
... or, in two lines using just lambda:
result = {id : {} for id in df['pianist_id'].unique()}
df.apply(lambda x: result[x['pianist_id']].update({x['class'] : result[x['pianist_id']].get(x['class'], 0) + 1}), axis = 1)
... or, using more lines but benefitting from the convenience of defaultdict:
import collections
result = {id : collections.defaultdict(int) for id in df['pianist_id'].unique()}
def updateEmotionCount(id, emotion):
result[id][emotion] += 1
df.apply(lambda x: updateEmotionCount(x['pianist_id'], x['class']), axis = 1)
result = {id : dict(result[id]) for id in result}
... or (finally) using the walrus operator := to eliminate the separate function and just use lambda (there is an argument that this approach is somewhat cryptic ... but the same could be said of pandas-centric solutions):
Using regular dict datatype:
result = {id : {} for id in df['pianist_id'].unique()}
df.apply(lambda x: (id := x['pianist_id'], emotion := x['class'], result[id].update({emotion : result[id].get(emotion, 0) + 1})), axis = 1)
Using defaultdict:
import collections
result = {id : collections.defaultdict(int) for id in df['pianist_id'].unique()}
df.apply(lambda x: (id := x['pianist_id'], emotion := x['class'], result[id].update({emotion : result[id][emotion] + 1})), axis = 1)
result = {id : dict(result[id]) for id in result}

How to use .apply() to combine a column of dictionaries into one dictionary?

I have a column of dictionaries within a pandas data frame.
srs_tf = pd.Series([{'dried': 1, 'oak': 2},{'fruity': 2, 'earthy': 2},{'tones': 2, 'oak': 4}])
srs_b = pd.Series([2,4,6])
df = pd.DataFrame({'tf': srs_tf, 'b': srs_b})
df
tf b
0 {'dried': 1, 'oak': 2} 2
1 {'fruity': 2, 'earthy': 2} 4
2 {'tones': 2, 'oak': 4} 6
These dictionaries represent word frequency in descriptions of wines (Ex input dictionary:{'savory': 1, 'dried': 3, 'thyme': 1, 'notes':..}). I need to create an output dictionary from this column of dictionaries that contains all of the keys from the input dictionaries and maps them to the number of input dictionaries in which those keys are present. For example, the word 'dried' is a key in 850 of the input dictionaries, so in the output dictionary {.. 'dried': 850...}.
I want to try using the data frame .apply() method but I believe that I am using it incorrectly.
def worddict(row, description_counter):
for key in row['tf'].keys():
if key in description_counter.keys():
description_counter[key] += 1
else:
description_counter[key] = 1
return description_counter
description_counter = {}
output_dict = df_wine_list.apply(lambda x: worddict(x, description_counter), axis = 1)
So a couple things. I think that my axis should = 0 rather than 1, but I get this error when I try that: KeyError: ('tf', 'occurred at index Unnamed: 0')
When I do use axis = 1, my function returns a column of identical dictionaries rather than a single dictionary.
You can use chain and Counter:
from collections import Counter
from itertools import chain
Counter(chain.from_iterable(df['a']))
# Counter({'dried': 1, 'earthy': 1, 'fruity': 1, 'oak': 2, 'tones': 1})
Or,
Counter(y for x in df['a'] for y in x)
# Counter({'dried': 1, 'earthy': 1, 'fruity': 1, 'oak': 2, 'tones': 1})
You can also use Index.value_counts,
pd.concat(map(pd.Series, df['a'])).index.value_counts().to_dict()
# {'dried': 1, 'earthy': 1, 'fruity': 1, 'oak': 2, 'tones': 1}

Group list items according to digit counts in second list

The goal is to create a stacked bar graph showing the sentiment of tweets (that I got from tweepy) over a total of 360 seconds (by second). I have two lists. The first one has the sentiment analysis of the tweets in chronological order and the second one has the amount of tweets per second, also in chronological order.
list1 = ("neg", "pos", "pos", "neu", "neg", "pos", "neu", "neu",...)
list2 = (2, 1, 3, 2,...)
Now I would like to create some sort of nested loop and use list2 to count the items in list 1. I would then have 3 lists with 360 values for each sentiment that I can use for the graph. Its should give me an output similar to this:
lis_negative = (1, 0, 1, 0, ...)
lis_positive = (1, 1, 1, 0, ...)
lis_neutral = (0, 0, 1, 2, ...)
How can I create this loop and is there maybe a simpler approach to it? I would prefer not to use any library for it other than matplotlib.
Code:
from itertools import islice
from collections import Counter
def categorize(clas, amounts):
cats = {'neg': [], 'pos': [], 'neu': []}
clas = iter(clas)
for a in amounts:
cs = Counter(islice(clas, a)) # take a items
for cat in cats:
cats[cat].append(cs[cat])
return cats
Demo:
>>> t1 = ('neg', 'pos', 'pos', 'neu', 'neg', 'pos', 'neu', 'neu')
>>> t2 = (2, 1, 3, 2)
>>>
>>> categorize(t1, t2)
{'neg': [1, 0, 1, 0], 'neu': [0, 0, 1, 2], 'pos': [1, 1, 1, 0]}
As requested, a solution without imports:
def make_counter(iterable):
c = {}
for x in iterable:
c[x] = c.get(x, 0) + 1
return c
def categorize(clas, amounts):
cats = {'neg': [], 'pos': [], 'neu': []}
pos = 0
for a in amounts:
chunk = clas[pos:pos+a]
pos += a
cs = make_counter(chunk)
for cat in cats:
cats[cat].append(cs.get(cat, 0))
return cats
edit: shorter import-less solution:
def categorize(clas, amounts):
cats = {k:[0]*len(amounts) for k in ('neg', 'pos', 'neu')}
pos = 0
for i, a in enumerate(amounts):
chunk = clas[pos:pos+a]
pos += a
for c in chunk:
cats[c][i] += 1
return cats

How to return the number of characters whose frequency is above a threshold

How do I print the number of upper case characters whose frequency is above a threshold (in the tutorial)?
The homework question is:
Your task is to write a function which takes as input a single non-negative number and returns (not print) the number of characters in the tally whose count is strictly greater than the argument of the function. Your function should be called freq_threshold.
My answer is:
mobyDick = "Blah blah A B C A RE."
def freq_threshold(threshold):
tally = {}
for char in mobyDick:
if char in tally:
tally[char] += 1
else:
tally[char] = 1
for key in tally.keys():
if key.isupper():
print tally[key],tally.keys
if threshold>tally[key]:return threshold
else:return tally[key]
It doesn't work, but I don't know where it is wrong.
Your task is to return number of characters that satisfy the condition. You're trying to return count of occurrences of some character. Try this:
result = 0
for key in tally.keys():
if key.isupper() and tally[key] > threshold:
result += 1
return result
You can make this code more pythonic. I wrote it this way to make it more clear.
The part where you tally up the number of each character is fine:
>>> pprint.pprint ( tally )
{' ': 5,
'.': 1,
'A': 2,
'B': 2,
'C': 1,
'E': 1,
'R': 1,
'a': 2,
'b': 1,
'h': 2,
'l': 2,
'\x80': 2,
'\xe3': 1}
The error is in how you are summarising the tally.
Your assignment asked you to print the number of characters occurring more than n times in the string.
What you are returning is either n or the number of times one particular character occurred.
You instead need to step through your tally of characters and character counts, and count how many characters have frequencies exceeding n.
Do not reinvent the wheel, but use a counter object, e.g.:
>>> from collections import Counter
>>> mobyDick = "Blah blah A B C A RE."
>>> c = Counter(mobyDick)
>>> c
Counter({' ': 6, 'a': 2, 'B': 2, 'h': 2, 'l': 2, 'A': 2, 'C': 1, 'E': 1, '.': 1, 'b': 1, 'R': 1})
from collections import Counter
def freq_threshold(s, n):
cnt = Counter(s)
return [i for i in cnt if cnt[i]>n and i.isupper()]
To reinvent the wheel:
def freq_threshold(s, n):
d = {}
for i in s:
d[i] = d.get(i, 0)+1
return [i for i in d if d[i]>n and i.isupper()]

Categories