Using value_counts to find unique values across combination of columns - python

I have been trying to get a count on multiple columns using value_counts. Right now, I have it working on one column, but not multiple.
EDIT: I needed a count of unique IDs previously, hence the count on 'id', but now I want a count of the services under 'id'. I'm editing the data below to more accurately explain the situation.
import pandas as pd
d = {'id': [1, 1, 2, 3, 3], 'service': [3, 3, 4, 2, 3], 'name': ['Joe', 'Joe', 'Bob', 'Val', 'Val']}
df = pd.DataFrame(data=d)
df['count'] = df['id'].map(df['id'].value_counts())
If I try
df['count'] = df['id'].map(df['id']['service'].value_counts())
I get a KeyError on service.
If I try
df['count'] = df['id']['service'].map(df['id'].value_counts())
I get the same error.
I'm hoping to get something along the lines of:
id service 1 , 3: 2
id service 2 , 4: 1
id service 3 , 2: 1
id service 3 , 3: 1
Am I using the wrong function?

A couple of ways. Either use groupby and use count, or create a tuple column and apply value_counts.
Both methods provide results that can be indexed via tuples.
Setup
import pandas as pd
d = {'id': [1, 2, 1], 'service': [3, 4, 3], 'name': ['Joe', 'Bob', 'Mark']}
df = pd.DataFrame(d)
Groupby method
As suggested by #Dark:
res = df.groupby(['id', 'service']).count()
# name
# id service
# 1 3 2
# 2 4 1
Tuple column method
df['id_service'] = list(zip(df.id, df.service))
res = df['id_service'].value_counts()
# (1, 3) 2
# (2, 4) 1
# Name: id_service, dtype: int64

Related

Pandas: Setting a value in a cell when multiple columns are empty

I've been looking for ways to do this natively for a little while now and can't find a solution.
I have a large dataframe where I would like to set the value in other_col to 'True' for all rows where one of a list of columns is empty.
This works for a single column page_title:
df.loc[df['page_title'].isna(), ['other_col']] = ''
But not when using a list
df.loc[df[['page_title','brand','name']].isna(), ['other_col']] = ''
Any ideas of how I could do this without using Numpy or looping through all rows?
Thanks
Maybe this is what you are looking for:
df = pd.DataFrame({
'A' : ['1', '2', '3', np.nan],
'B': ['10', np.nan, np.nan, '40'],
'C' : ['test', 'test', 'test', 'test']})
df.loc[df[['A', 'B']].isna().any(1), ['C']] = 'value'
print(df)
Result:
A B C
0 1 10 test
1 2 NaN value
2 3 NaN value
3 NaN 40 value
This will allow you to set which columns you want to determine if np.nan is present and set a True/False indicator
data = {
'Column1' : [1, 2, 3, np.nan],
'Column2' : [1, 2, 3, 4],
'Column3' : [1, 2, np.nan, 4]
}
df = pd.DataFrame(data)
df['other_col'] = np.where((df['Column1'].isna()) | (df['Column2'].isna()) | (df['Column3'].isna()), True, False)
df

select a single value from a column after groupby another columns in python

I tried to select a single value of column class from each group of my dataframe after i performed the groupby function on the column first_register and second_register but it seems did not work.
Suppose I have a dataframe like this:
import numpy as np
import pandas as pd
df = pd.DataFrame({'class': [1, 1, 1, 2, 2, 2, 0, 0, 1],
'first_register': ["70/20", "70/20", "70/20", "71/20", "71/20", "71/20", np.NAN, np.NAN, np.NAN],
'second_register': [np.NAN, np.NAN, np.NAN, np.NAN, np.NAN, np.NAN, "72/20", "72/20", "73/20"]})
What I have tried and did not work at all:
group_by_df = df.groupby(["first_register", "second_register"])
label_class = group_by_df["class"].unique()
print(label_class)
How can I select/access each single class label from each group of dataframe?
The desired output can be an ordered list like this to represent each class of each group from the first group to the final group:
label_class = [1, 2, 0, 1]
Use dropna=False:
group_by_df = df.groupby(["first_register", "second_register"], dropna=False)
label_class = group_by_df["class"].unique()
first_register second_register
70/20 NaN [1]
71/20 NaN [2]
NaN 72/20 [0]
73/20 [1]
Name: class, dtype: object
if you knok length of unique class is 1 or you want get the first or the last:
label_class = group_by_df["class"].first()
Or:
label_class = group_by_df["class"].last()
Use GroupBy.first:
out = df.groupby(["first_register", "second_register"], dropna=False)["class"].first()
print (out)
first_register second_register
70/20 NaN 1
71/20 NaN 2
NaN 72/20 0
73/20 1
Name: class, dtype: int64
label_class = out.tolist()
print (label_class)
[1, 2, 0, 1]

pandas largest value per group with multi columns / why does it only work when flattening?

For a pandas dataframe of:
import pandas as pd
df = pd.DataFrame({
'id': [1, 1, 2, 1], 'anomaly_score':[5, 10, 8, 100], 'match_level_0':[np.nan, 1, 1, 1], 'match_level_1':[np.nan, np.nan, 1, 1], 'match_level_2':[np.nan, 1, 1, 1]
})
display(df)
df = df.groupby(['id', 'match_level_0']).agg(['mean', 'sum'])
I want to calculate the largest rows per group.
df.columns = ['__'.join(col).strip() for col in df.columns.values]
df.groupby(['id'])['anomaly_score__mean'].nlargest(2)
Works but requires to flatten the multiindex for the columns.
Instead I want to directly use,
df.groupby(['id'])[('anomaly_score', 'mean')].nlargest(2)
But this fails with the key not being found.
Interestingly, it works just fine when not grouping:
df[('anomaly_score', 'mean')].nlargest(2)
For me working grouping by Series with first level of MultiIndex, but it seems bug why not working like in your solution:
print (df[('anomaly_score', 'mean')].groupby(level=0).nlargest(2))
id match_level_0
1 1.0 55
2 1.0 8
Name: (anomaly_score, mean), dtype: int64
print (df[('anomaly_score', 'mean')].groupby(level='id').nlargest(2))

Pandas groupby get row with max in multiple columns

Looking to get the row of a group that has the maximum value across multiple columns:
pd.DataFrame([{'grouper': 'a', 'col1': 1, 'col2': 3, 'uniq_id': 1}, {'grouper': 'a', 'col1': 2, 'col2': 4, 'uniq_id': 2}, {'grouper': 'a', 'col1': 3, 'col2': 2, 'uniq_id': 3}])
col1 col2 grouper uniq_id
0 1 3 a 1
1 2 4 a 2
2 3 2 a 3
In the above, I'm grouping by the "grouper" column. Within the "a" group, I want to get the row that has the max of col1 and col2, in this case, when I group my DataFrame, I want to get the row with uniq_id of 2 because it has the highest value of col1/col2 with 4, so the outcome would be:
col1 col2 grouper uniq_id
1 2 4 a 2
In my actual example, I'm using timestamps, so I actually don't expect ties. But in the case of a tie, I am indifferent to which row I select in the group, so it would just be first of the group in that case.
One more way you can try:
# find row wise max value
df['row_max'] = df[['col1','col2']].max(axis=1)
# filter rows from groups
df.loc[df.groupby('grouper')['row_max'].idxmax()]
col1 col2 grouper uniq_id row_max
1 2 4 a 2 4
Later you can drop row_max using df.drop('row_max', axis=1)
IIUC using transform then compare with original dataframe
g=df.groupby('grouper')
s1=g.col1.transform('max')
s2=g.col2.transform('max')
s=pd.concat([s1,s2],axis=1).max(1)
df.loc[df[['col1','col2']].eq(s,0).any(1)]
Out[89]:
col1 col2 grouper uniq_id
1 2 4 a 2
Interesting approaches all around. Adding another one just to show the power of apply (which I'm a big fan of) and using some of the other mentioned methods.
import pandas as pd
df = pd.DataFrame(
[
{"grouper": "a", "col1": 1, "col2": 3, "uniq_id": 1},
{"grouper": "a", "col1": 2, "col2": 4, "uniq_id": 2},
{"grouper": "a", "col1": 3, "col2": 2, "uniq_id": 3},
]
)
def find_max(grp):
# find max value per row, then find index of row with max val
max_row_idx = grp[["col1", "col2"]].max(axis=1).idxmax()
return grp.loc[max_row_idx]
df.groupby("grouper").apply(find_max)
value = pd.concat([df['col1'], df['col2']], axis = 0).max()
df.loc[(df['col1'] == value) | (df['col2'] == value), :]
col1 col2 grouper uniq_id
1 2 4 a 2
This probably isn't the fastest way, but it will work in your case. Concat both the columns and find the max, then search the df for where either column equals the value.
You can use numpy and pandas as follows:
import numpy as np
import pandas as pd
df = pd.DataFrame({'col1': [1, 2, 3],
'col2': [3, 4, 2],
'grouper': ['a', 'a', 'a'],
'uniq_id': [1, 2, 3]})
df['temp'] = np.max([df.col1.values, df.col2.values],axis=0)
idx = df.groupby('grouper')['temp'].idxmax()
df.loc[idx].drop('temp',1)
col1 col2 grouper uniq_id
1 2 4 a 2

Select columns of pandas dataframe using a dictionary list value

I have column names in a dictionary and would like to select those columns from a dataframe.
In the example below, how do I select dictionary values 'b', 'c' and save it in to df1?
import pandas as pd
ds = {'cols': ['b', 'c']}
d = {'a': [2, 3], 'b': [3, 4], 'c': [4, 5]}
df_in = pd.DataFrame(data=d)
print(ds)
print(df_in)
df_out = df_in[[ds['cols']]]
print(df_out)
TypeError: unhashable type: 'list'
Remove nested list - []:
df_out = df_in[ds['cols']]
print(df_out)
b c
0 3 4
1 4 5
According to ref, just need to drop one set of brackets.
df_out = df_in[ds['cols']]

Categories