count the number of unique column elements in python example - python

Imagine that this data frame is a small sample of a bigger data frame with 11 pianists, each producing an emotion of Angry, Happy, Relaxed, and Sad to a listener. Now I want to count for every pianist the number of emotions, since I want to later plot it, to see a pattern in the data.
I am struggling to get this done. I, somehow, managed it to a certain degree but, it is very bad code and very long if I have to do it for these 11 pianists.
Could somebody please help me out, in really automating it, more efficient and better code?
My Work:
d = {
'pianist_id':
[1, 1, 1, 2, 2, 2, 3, 3, 4, 4],
'class':
['Angry', 'Sad', 'Sad', 'Angry', 'Angry', 'Angry', 'Relaxed', 'Happy', 'Happy', 'Happy']
}
df = pd.DataFrame(d)
count = 0
for i in range(df.shape[0]):
if df['pianist_id'][i] == 1:
count += 1
df_split_1 = df.iloc[: count]
print(data_split_1['class'].value_counts())
pianist_1 = data_split_1['class'].value_counts().to_dict()
dict_pianist_1 = {}
dict_pianist_1['1'] = pianist_1
I want to have something like this for every 11 pianists.
{
'1': {
'Sad': 67,
'Happy': 66,
'Angry': 54,
'Relaxed': 50
},
'2':{
'Angry',,,,,''
},
,,,,,,
}
Thanks for the help!

You can group by pianist_id column and then use value_counts to get each type count of class column. Finally use to_dict to convert them to dict.
d = df.groupby('pianist_id').apply(lambda group: group['class'].value_counts().to_dict()).to_dict()
print(d)
{1: {'Sad': 2, 'Angry': 1}, 2: {'Angry': 3}, 3: {'Relaxed': 1, 'Happy': 1}, 4: {'Happy': 2}}

You can compute the size of each pair :
df.groupby(['pianist_id', 'class']).size()
Which gives the following output :
pianist_id class
1 Angry 1
Sad 2
2 Angry 3
3 Happy 1
Relaxed 1
4 Happy 2
dtype: int64
To get the format you need, you have to unstack the index, allowing to fill the missing values at the same time, and then convert the final DataFrame to a dict :
df.groupby(['pianist_id', 'class']).size().unstack(fill_value=0).to_dict(orient='index')
Producing the output :
{1: {'Angry': 1, 'Happy': 0, 'Relaxed': 0, 'Sad': 2}, 2: {'Angry': 3, 'Happy': 0, 'Relaxed': 0, 'Sad': 0}, 3: {'Angry': 0, 'Happy': 1, 'Relaxed': 1, 'Sad': 0}, 4: {'Angry': 0, 'Happy': 2, 'Relaxed': 0, 'Sad': 0}}

Since the end result specified in the question is a Python dict of dicts, you may prefer to use a more Python-centric than pandas-centric approach. Here's an answer that gives several alternatives for which pandas usage is limited to taking the original dataframe as input, calling its apply method and accessing its 'pianist_id' and 'class' columns:
result = {id : {} for id in df['pianist_id'].unique()}
def updateEmotionCount(id, emotion):
result[id].update({emotion : result[id].get(emotion, 0) + 1})
df.apply(lambda x: updateEmotionCount(x['pianist_id'], x['class']), axis = 1)
print(result)
... or, in two lines using just lambda:
result = {id : {} for id in df['pianist_id'].unique()}
df.apply(lambda x: result[x['pianist_id']].update({x['class'] : result[x['pianist_id']].get(x['class'], 0) + 1}), axis = 1)
... or, using more lines but benefitting from the convenience of defaultdict:
import collections
result = {id : collections.defaultdict(int) for id in df['pianist_id'].unique()}
def updateEmotionCount(id, emotion):
result[id][emotion] += 1
df.apply(lambda x: updateEmotionCount(x['pianist_id'], x['class']), axis = 1)
result = {id : dict(result[id]) for id in result}
... or (finally) using the walrus operator := to eliminate the separate function and just use lambda (there is an argument that this approach is somewhat cryptic ... but the same could be said of pandas-centric solutions):
Using regular dict datatype:
result = {id : {} for id in df['pianist_id'].unique()}
df.apply(lambda x: (id := x['pianist_id'], emotion := x['class'], result[id].update({emotion : result[id].get(emotion, 0) + 1})), axis = 1)
Using defaultdict:
import collections
result = {id : collections.defaultdict(int) for id in df['pianist_id'].unique()}
df.apply(lambda x: (id := x['pianist_id'], emotion := x['class'], result[id].update({emotion : result[id][emotion] + 1})), axis = 1)
result = {id : dict(result[id]) for id in result}

Related

Converting collections.Counters of combinations frequency from dataframe multi-index into string

Would like to ask for some advise on how to do this properly. I'm new to python.
Initially I wanted to find out the counters/frequency of the combinations of multi-index. I tried a few ways, such as loop, itertuples, iterrows, etc and I realize the fastest and least overhead is to use collections.Counter
However, it returns a list of tuples of the multi-index index combinations as the counter dict keys. The keys of tuples makes it hard for thereafter processing.
Thus I am figuring out how to make them into string with separators to make the thereafter processing easier to manage.
For example this multi-index below:
# testing
def testing():
testing_df = pd.read_csv("data/testing.csv", float_precision="high")
testing_df = testing_df.set_index(["class", "table", "seat"]).sort_index()
print("\n1: \n" + str(testing_df.to_string()))
print("\n2 test: \n" + str(testing_df.index))
occurrences = collections.Counter(testing_df.index)
print("\n3: \n" + str(occurrences))
output:
1:
random_no
class table seat
Emerald 1 0 55.00
Ruby 0 0 33.67
0 24.01
1 87.00
Topaz 0 0 67.00
2 test:
MultiIndex([('Emerald', 1, 0),
( 'Ruby', 0, 0),
( 'Ruby', 0, 0),
( 'Ruby', 0, 1),
( 'Topaz', 0, 0)],
names=['class', 'table', 'seat'])
3:
Counter({('Ruby', 0, 0): 2, ('Emerald', 1, 0): 1, ('Ruby', 0, 1): 1, ('Topaz', 0, 0): 1})
As we can see from 3), it returns the combinations in tuples of different data types as the dict keys, and makes it hard for processing.
I tried to separate it or making it string so processing it can be easier.
Tried below with errors:
x = "|".join(testing_df.index)
print(x)
x = "|".join(testing_df.index)
TypeError: sequence item 0: expected str instance, tuple found
and below with errors
x = "|".join(testing_df.index[0])
print(x)
x = "|".join(testing_df.index[0])
TypeError: sequence item 1: expected str instance, numpy.int64 found
Basically, its either:
I make the combinations into strings before calculating collections.Counter or
after making it into collections.Counter, where all the numerous keys are tuples and convert those keys into strings
Can I ask how do I do this properly?
Thank you very much!
I can offer a solution for 2., convert key tuples into strings:
from collections import Counter
# recreate your problem
occurrences = Counter([('Ruby', 0, 0),
('Ruby', 0, 0),
('Emerald', 1, 0),
('Ruby', 0, 1),
('Topaz', 0, 0)])
# convert tuple keys to string keys
new_occurrences = {'|'.join(str(index) for index in key) : value for key,value in occurrences.items()}
print(new_occurrences)
{'Ruby|0|0': 2, 'Emerald|1|0': 1, 'Ruby|0|1': 1, 'Topaz|0|0': 1}
Counter is a subclass of dict, therefore you can use fancy things like dict-comprehensions and .items() to loop over keys and values at the same time.
Depending on you how you intend to further process your data, it might be more useful to convert the result of your counter to a pandas DataFrame. Simply because pandas offers more and easier functionality for processing.
Here's how:
import pandas as pd
df = pd.DataFrame({'class': [k[0] for k in occurrences.keys()],
'table': [k[1] for k in occurrences.keys()],
'seat': [k[2] for k in occurrences.keys()],
'counts': [v for _,v in occurrences.items()]})
df.head()
class table seat counts
0 Ruby 0 0 2
1 Emerald 1 0 1
2 Ruby 0 1 1
3 Topaz 0 0 1

Frequency count based on column values in Pandas

For example I have a data frame which looks like this:
First Image
And I would like to make a new data frame which shows the number of times a word was marked as spam or ham. I want it to look like this:
Second image
I have tried the following code to make a list of only spam counts on a word to test but it does not seem to work and crashes the Kernel on Jupyter Notebook:
words = []
for word in df["Message"]:
words.extend(word.split())
sentences = []
for word in df["Message"]:
sentences.append(word.split())
spam = []
ham = []
for word in words:
sc = 0
hc = 0
for index,sentence in enumerate(sentences):
if word in sentence:
print(word)
if(df["Category"][index])=="ham":
hc+=1
else:
sc+=1
spam.append(sc)
spam
Where df is the data frame shown in the First Image.
How can I go about doing this?
You can form two dictionaries spam and ham to store the number of occurrences of different words in spam/ham message.
from collections import defaultdict as dd
spam = dd(int)
ham = dd(int)
for i in range(len(sentences)):
if df['Category'][i] == 'ham':
p = sentences[i]
for x in p:
ham[x] += 1
else:
p = sentences[i]
for x in p:
spam[x] += 1
The output obtained from the code above for similar input to yours is as below.
>>> spam
defaultdict(<class 'int'>, {'ok': 1, 'lar': 1, 'joking': 1, 'wtf': 1, 'u': 1, 'oni': 1, 'free': 1, 'entry': 1, 'in': 1, '2': 1, 'a': 1, 'wkly': 1, 'comp': 1})
>>> ham
defaultdict(<class 'int'>, {'go': 1, 'until': 1, 'jurong': 1, 'crazy': 1, 'available': 1, 'only': 1, 'in': 1, 'u': 1, 'dun': 1, 'say': 1, 's': 1, 'oearly': 1, 'nah': 1, 'I': 1, 'don’t': 1, 'think': 1, 'he': 1, 'goes': 1, 'to': 1, 'usf': 1})
Now can manipulate the data and export it in the required format.
EDIT:
answer = []
for x in spam:
answer.append([x,spam[x],ham[x]])
for x in ham:
if x not in spam:
answer.append([x,spam[x],ham[x]])
So here the numbers of rows in answer list in equal to the number of distinct words in all the messages. While the first column in every row is the word we are talking about and the second and third column is the number of occurrences of the word in spam and ham message respectively.
The output obtained for my code is as below.
['ok', 1, 0]
['lar', 1, 0]
['joking', 1, 0]
['wif', 1, 0]
['u', 1, 1]
['oni', 1, 0]
['free', 1, 0]
['entry', 1, 0]
['in', 1, 1]
This would be better:
https://docs.python.org/3.8/library/collections.html#collections.Counter
from collections import Counter
import pandas as pd
df # the data frame in your first image
df['Counter'] = df.Message.apply(lambda x: Counter(x.split()))
def func(df: pd.DataFrame):
for category, data in df.groupby('Category'):
count = Counter()
for var in data.Counter:
count += var
cur = pd.DataFrame.from_dict(count, orient='index', columns=[category])
yield cur
demo = func(df)
df2 = next(demo)
for cur in demo:
df2 = df2.merge(cur, how='outer', left_index=True, right_index=True)
EDIT:
from collections import Counter
import pandas as pd
df # the data frame in your first image. Suit both cases(whether it is a slice of the complete data frame or not)
def func(df: pd.DataFrame):
res = df.groupby('Category').Message.apply(' '.join).str.split().apply(Counter)
for category, count in res.to_dict().items():
yield pd.DataFrame.from_dict(count, orient='index', columns=[category])
demo = func(df)
df2 = next(demo)
for cur in demo:
df2 = df2.merge(cur, how='outer', left_index=True, right_index=True)

How to find the number of every length of contiguous sequences of values in a list?

Problem
Given a sequence (list or numpy array) of 1's and 0's how can I find the number of contiguous sub-sequences of values? I want to return a JSON-like dictionary of dictionaries.
Example
[0, 0, 1, 1, 0, 1, 1, 1, 0, 0] would return
{
0: {
1: 1,
2: 2
},
1: {
2: 1,
3: 1
}
}
Tried
This is the function I have so far
def foo(arr):
prev = arr[0]
count = 1
lengths = dict.fromkeys(arr, {})
for i in arr[1:]:
if i == prev:
count += 1
else:
if count in lengths[prev].keys():
lengths[prev][count] += 1
else:
lengths[prev][count] = 1
prev = i
count = 1
return lengths
It is outputting identical dictionaries for 0 and 1 even if their appearance in the list is different. And this function isn't picking up the last value. How can I improve and fix it? Also, does numpy offer any quicker ways to solve my problem if my data is in a numpy array? (maybe using np.where(...))
You're suffering from Ye Olde Replication Error. Let's instrument your function to show the problem, adding one line to check the object ID of each dict in the list:
lengths = dict.fromkeys(arr, {})
print(id(lengths[0]), id(lengths[1]))
Output:
140130522360928 140130522360928
{0: {2: 2, 1: 1, 3: 1}, 1: {2: 2, 1: 1, 3: 1}}
The problem is that you gave the same dict as initial value for each key. When you update either of them, you're changing the one object to which they both refer.
Replace it with an explicit loop -- not a mutable function argument -- that will create a new object for each dict entry:
for key in lengths:
lengths[key] = {}
print(id(lengths[0]), id(lengths[1]))
Output:
139872021765576 139872021765288
{0: {2: 1, 1: 1}, 1: {2: 1, 3: 1}}
Now you have separate objects.
If you want a one-liner, use a dict comprehension:
lengths = {key: {} for key in lengths}

How to use .apply() to combine a column of dictionaries into one dictionary?

I have a column of dictionaries within a pandas data frame.
srs_tf = pd.Series([{'dried': 1, 'oak': 2},{'fruity': 2, 'earthy': 2},{'tones': 2, 'oak': 4}])
srs_b = pd.Series([2,4,6])
df = pd.DataFrame({'tf': srs_tf, 'b': srs_b})
df
tf b
0 {'dried': 1, 'oak': 2} 2
1 {'fruity': 2, 'earthy': 2} 4
2 {'tones': 2, 'oak': 4} 6
These dictionaries represent word frequency in descriptions of wines (Ex input dictionary:{'savory': 1, 'dried': 3, 'thyme': 1, 'notes':..}). I need to create an output dictionary from this column of dictionaries that contains all of the keys from the input dictionaries and maps them to the number of input dictionaries in which those keys are present. For example, the word 'dried' is a key in 850 of the input dictionaries, so in the output dictionary {.. 'dried': 850...}.
I want to try using the data frame .apply() method but I believe that I am using it incorrectly.
def worddict(row, description_counter):
for key in row['tf'].keys():
if key in description_counter.keys():
description_counter[key] += 1
else:
description_counter[key] = 1
return description_counter
description_counter = {}
output_dict = df_wine_list.apply(lambda x: worddict(x, description_counter), axis = 1)
So a couple things. I think that my axis should = 0 rather than 1, but I get this error when I try that: KeyError: ('tf', 'occurred at index Unnamed: 0')
When I do use axis = 1, my function returns a column of identical dictionaries rather than a single dictionary.
You can use chain and Counter:
from collections import Counter
from itertools import chain
Counter(chain.from_iterable(df['a']))
# Counter({'dried': 1, 'earthy': 1, 'fruity': 1, 'oak': 2, 'tones': 1})
Or,
Counter(y for x in df['a'] for y in x)
# Counter({'dried': 1, 'earthy': 1, 'fruity': 1, 'oak': 2, 'tones': 1})
You can also use Index.value_counts,
pd.concat(map(pd.Series, df['a'])).index.value_counts().to_dict()
# {'dried': 1, 'earthy': 1, 'fruity': 1, 'oak': 2, 'tones': 1}

saving more than one value in one Python array position (kinda container)

I would like to use something as a container but I can't do objects... I believe there is some library or collection or something which could help my.
I want to save a few connected values into one array position:
array = []
array.append(value1 = 1, value2 = 2, value3 = 3)
array.append(value1 = 5, value2 = 7, value3 = 10)
array.append(value1 = 2, value2 = 3, value3 = 3)
Something like this... And then I would like to search in this array like
for n in array:
n.value1 = ....
But I'm beginner and don't know much about the language... Can you please help me?
you are looking for a dictionary. it can be used like this:
d = {"value1": 1, "value2": 2, "value3": 3}
for k in d:
print("key: {}, value: {}".format(k, d[k]))
here are the docs: https://docs.python.org/2/tutorial/datastructures.html#dictionaries
for your problem you 'll need a list of dictionaries. like this:
list_of_dict = []
list_of_dict.append({"value1": 1, "value2": 2, "value3": 3})
list_of_dict.append({"value1": 5, "value2": 7, "value3": 10})
list_of_dict.append({"value1": 2, "value2": 3, "value3": 3})
for dct in list_of_dict:
dct["value1"] = ...
As mentioned in the comment you are looking for a dictionary; see the docs or this tutorial.
Example code:
dict = {'value1':1,'value2':2,'value3':3}

Categories