pandas: How to count the unique categories?

pandas: How to count the unique categories? - python

I have a dataframe
df_input = pd.DataFrame(
{
"col_cate": ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'],
"target_bool": [True, False, True, False, True, False, True, False]
}
)
And I want to count the number of unique categories. So I am expecting the output to be like this
col_cate, target_bool, cnt
'A' , True , 2
'A' , False , 2
'B' , True , 2
'B' , False , 2
But df_input.group_by(["col_cate", "target_bool"]).count() gives
Empty DataFrame
Columns: []
Index: [(A, False), (A, True), (B, False), (B, True)]
But adding a dummy to the df_input works, like df_input["dummy"] = 1.
How do I get the group by count table without adding a dummy?

df_input.groupby('col_cate')['target_bool'].value_counts()
col_cate target_bool
A False 2
True 2
B False 2
True 2
then you can reset_index()

Because function GroupBy.count is used for counts values with exclude missing values if exist is necessary specify column after groupby, if both columns are used in by parameter in groupby:
df = (df_input.groupby(by=["col_cate", "target_bool"])['col_cate']
.count()
.reset_index(name='cnt'))
print (df)
col_cate target_bool cnt
0 A False 2
1 A True 2
2 B False 2
3 B True 2
If want count all columns, here both is it possible (but here always same output) if specify both columns:
df1 = (df_input.groupby(["col_cate", "target_bool"])[['col_cate','target_bool']]
.count()
.add_suffix('_count')
.reset_index())
print (df1)
col_cate target_bool col_cate_count target_bool_count
0 A False 2 2
1 A True 2 2
2 B False 2 2
3 B True 2 2
Or if use GroupBy.size method it working a bit different - it count all values, not exclude missing, so no column is necessary specify:
df = df_input.groupby(["col_cate", "target_bool"]).size().reset_index(name='cnt')
print (df)
col_cate target_bool cnt
0 A False 2
1 A True 2
2 B False 2
3 B True 2

Like this also:
In [54]: df_input.groupby(df_input.columns.tolist()).size().reset_index().\
...: rename(columns={0:'cnt'})
Out[54]:
col_cate target_bool cnt
0 A False 2
1 A True 2
2 B False 2
3 B True 2

Related

Count NA and none-NA per group in pandas

I assume this is a simple task for pandas but I don't get it.
I have data liket this
Group Val
0 A 0
1 A 1
2 A <NA>
3 A 3
4 B 4
5 B <NA>
6 B 6
7 B <NA>
And I want to know the frequency of valid and invalid values in Val per group Group. This is the expected result.
A B Total
Valid 3 2 5
NA 1 2 3
Here is code to generate that sample data.
#!/usr/bin/env python3
import pandas as pd
df = pd.DataFrame({
'Group': list('AAAABBBB'),
'Val': range(8)
})
# some values to NA
for idx in [2, 5, 7]:
df.iloc[idx, 1] = pd.NA
print(df)
What I tried is something with grouping
>>> df.groupby('Group').agg(lambda x: x.isna())
Val
Group
A [False, False, True, False]
B [False, True, False, True]
>>> df.groupby('Group').apply(lambda x: x.isna())
Group Val
0 False False
1 False False
2 False True
3 False False
4 False False
5 False True
6 False False
7 False True

You are close with using groupby and isna
new = df.groupby(['Group', df['Val'].isna().replace({True: 'NA', False: 'Valid'})])['Group'].count().unstack(level=0)
new['Total'] = new.sum(axis=1)
print(new)
Group A B Total
Val
NA 1 2 3
Valid 3 2 5

here is one way to do it
# cross tab to take the summarize
# convert Val to NA or Valid depending on the value
df2=(pd.crosstab(df['Val'].isna().map({True: 'NA', False: 'Valid'}),
df['Group'] )
.reset_index()
.rename_axis(columns=None))
df2['Total']=df2.sum(axis=1, numeric_only=True) # add Total column
out=df2.set_index('Val') # set index to match expected output
out
A B Total
Val
NA 1 2 3
Valid 3 2 5
if you need both row and column total, then it'll be even simpler with crosstab
df2=(pd.crosstab(df['Val'].isna().map({True: 'NA', False: 'Valid'}),
df['Group'],
margins=True, margins_name='Total')
Group A B Total
Val
NA 1 2 3
Valid 3 2 5
Total 4 4 8

Another possible solution, based on pandas.pivot_table and on the following ideas:
Add a new column, status, which contains NA or Valid if the corresponding value is or is not NaN, respectively.
Create a pivot table, using len as aggregation function.
Add the Total column, by summing by rows.
(df.assign(status=np.where(df['Val'].isna(), 'NA', 'Valid'))
.pivot_table(
index='status', columns='Group', values='Val',
aggfunc=lambda x: len(x))
.reset_index()
.rename_axis(None, axis=1)
.assign(Total = lambda x: x.sum(axis=1)))
Output:
status A B Total
0 NA 1 2 3
1 Valid 3 2 5

Mark highest documents with true

I have a dataframe with two columns:
name and version.
I want to add a boolean in an extra column. So if the highest version then true, otherwise false.
import pandas as pd
data = [['a', 1], ['b', 2], ['a', 2], ['a', 2], ['b', 4]]
df = pd.DataFrame(data, columns = ['name', 'version'])
df
is it best to use groupby for this?
I have tried smth. like this but I do not know how to add extra column with bolean.
df.groupby(['name']).max()

Compare maximal values per groups created by GroupBy.transform with max for generate new column/ Series filled by maxinmal values, so possible compare by original column:
df['bool_col'] = df['version'] == df.groupby('name')['version'].transform('max')
print(df)
name version bool_col
0 a 1 False
1 b 2 False
2 a 2 True
3 a 2 True
4 b 4 True
Detail:
print(df.groupby('name')['version'].transform('max'))
0 2
1 4
2 2
3 2
4 4
Name: version, dtype: int64

You can assign your column directly:
df['bool_col'] = df['version'] == max(df['version'])
Output:
name version bool_col
0 a 1 False
1 b 2 False
2 a 2 False
3 a 2 False
4 b 4 True
Is this what you were looking for?

how to change suffix on new df column of df when merging iteratively

I have a temp df and a a dflst.
the temp has as columns the unique col names from a dataframes in a dflst .
The dflst has a dynamic len, my problem arrises when len(dflst)>=4.
aLL DFs (temp and all the ones in dflst) have columns with true/false values and a p column with numbers
code to recreate data:
#making temp df
var_cols=['a', 'b', 'c', 'd']
temp = pd.DataFrame(list(itertools.product([False, True], repeat=len(var_cols))), columns=var_cols)
#makinf dflst
df0=pd.DataFrame(list(itertools.product([False, True], repeat=len(['a', 'b']))), columns=['a', 'b'])
df0['p']= np.random.randint(1, 5, df0.shape[0])
df1=pd.DataFrame(list(itertools.product([False, True], repeat=len(['c', 'd']))), columns=['c', 'd'])
df1['p']= np.random.randint(1, 5, df1.shape[0])
df2=pd.DataFrame(list(itertools.product([False, True], repeat=len(['a', 'c', ]))), columns=['a', 'c'])
df2['p']= np.random.randint(1, 5, df2.shape[0])
df3=pd.DataFrame(list(itertools.product([False, True], repeat=len(['d']))), columns=['d'])
df3['p']= np.random.randint(1, 5, df3.shape[0])
dflst=[df0, df1, df2, df3]
I want to merge the dfs in dflst, so that the 'p'col values from dfs in dflst into temp df, in the rows with compatible values between the two .
I am currently doing it with pd.merge as follows:
for df in dflst:
temp = temp.merge(df, on=list(df)[:-1], how='right')
but this results to a df that has same names for different columns, when dflst has 4 or more dfs.. I understand that that is due to suffix of merge. but it creates problems with column indexing.
How can I have unique names on the new columns added to temp iteratively?

I don't fully understand what you want but IIUC:
for i, df in enumerate(dflst):
temp = temp.merge(df.rename(columns={'p': f'p{i}'}),
on=df.columns[:-1].tolist(), how='right')
print(temp)
# Output:
a b c d p0 p1 p2 p3
0 False False False False 4 2 2 1
1 False True False False 3 2 2 1
2 False False True False 4 3 4 1
3 False True True False 3 3 4 1
4 True False False False 3 2 2 1
5 True True False False 3 2 2 1
6 True False True False 3 3 1 1
7 True True True False 3 3 1 1
8 False False False True 4 4 2 3
9 False True False True 3 4 2 3
10 False False True True 4 1 4 3
11 False True True True 3 1 4 3
12 True False False True 3 4 2 3
13 True True False True 3 4 2 3
14 True False True True 3 1 1 3
15 True True True True 3 1 1 3

Identifying closest value in a column for each filter using Pandas

I have a data frame with categories and values. I need to find the value in each category closest to a value. I think I'm close but I can't really get the right output when applying the results of argsort to the original dataframe.
For example, if the input was defined in the code below the output should have only (a, 1, True), (b, 2, True), (c, 2, True) and all other isClosest Values should be False.
If multiple values are closest then it should be the first value listed marked.
Here is the code I have which works but I can't get it to reapply to the dataframe correctly. I would love some pointers.
df = pd.DataFrame()
df['category'] = ['a', 'b', 'b', 'b', 'c', 'a', 'b', 'c', 'c', 'a']
df['values'] = [1, 2, 3, 4, 5, 4, 3, 2, 1, 0]
df['isClosest'] = False
uniqueCategories = df['category'].unique()
for c in uniqueCategories:
filteredCategories = df[df['category']==c]
sortargs = (filteredCategories['value']-2.0).abs().argsort()
#how to use sortargs so that we set column in df isClosest=True if its the closest value in each category to 2.0?

You can create a column of absolute differences:
df['dif'] = (df['values'] - 2).abs()
df
Out:
category values dif
0 a 1 1
1 b 2 0
2 b 3 1
3 b 4 2
4 c 5 3
5 a 4 2
6 b 3 1
7 c 2 0
8 c 1 1
9 a 0 2
And then use groupby.transform to check whether the minimum value of each group is equal to the difference you calculated:
df['is_closest'] = df.groupby('category')['dif'].transform('min') == df['dif']
df
Out:
category values dif is_closest
0 a 1 1 True
1 b 2 0 True
2 b 3 1 False
3 b 4 2 False
4 c 5 3 False
5 a 4 2 False
6 b 3 1 False
7 c 2 0 True
8 c 1 1 False
9 a 0 2 False
df.groupby('category')['dif'].idxmin() would also give you the indices of the closest values for each category. You can use that for mapping too.
For selection:
df.loc[df.groupby('category')['dif'].idxmin()]
Out:
category values dif
0 a 1 1
1 b 2 0
7 c 2 0
For assignment:
df['is_closest'] = False
df.loc[df.groupby('category')['dif'].idxmin(), 'is_closest'] = True
df
Out:
category values dif is_closest
0 a 1 1 True
1 b 2 0 True
2 b 3 1 False
3 b 4 2 False
4 c 5 3 False
5 a 4 2 False
6 b 3 1 False
7 c 2 0 True
8 c 1 1 False
9 a 0 2 False
The difference between these approaches is that if you check equality against the difference, you would get True for all rows in case of ties. However, with idxmin it will return True for the first occurrence (only one for each group).

Solution with DataFrameGroupBy.idxmin - get indexes of minimal values per group and then assign boolean mask by Index.isin to column isClosest:
idx = (df['values'] - 2).abs().groupby([df['category']]).idxmin()
print (idx)
category
a 0
b 1
c 7
Name: values, dtype: int64
df['isClosest'] = df.index.isin(idx)
print (df)
category values isClosest
0 a 1 True
1 b 2 True
2 b 3 False
3 b 4 False
4 c 5 False
5 a 4 False
6 b 3 False
7 c 2 True
8 c 1 False
9 a 0 False

Pandas Pivot Table Keeps Returning False Instead of 0

I am making a pivot table using pandas. If I set aggfunc=sum or aggfunc=count on a column of boolean values, it works fine provided there's at least one True in the column. E.g. [True, False, True, True, False] would return 3. However, if all the values are False, then the pivot table outputs False instead of 0. No matter what, I can't get around it. The only way I can circumvent it is to define a function follows:
def f(x):
mySum = sum(x)
return "0" if mySum == 0 else mySum
and then set aggfunc=lambda x: f(x). While that works visually, it still disturbs me that outputing a string is the only way I can get the 0 to stick. If I cast it as an int, or try to return 0.0, or do anything that's numeric at all, False is always the result.
Why is this, and how do I get the pivot table to actually give me 0 in this case (by only modifying the aggfunc, not the dataframe itself)?

df = pd.DataFrame({'count': [False] * 12, 'index': [0, 1] * 6, 'cols': ['a', 'b', 'c'] * 4})
print(df)
outputs
cols count index
0 a False 0
1 b False 1
2 c False 0
3 a False 1
4 b False 0
5 c False 1
6 a False 0
7 b False 1
8 c False 0
9 a False 1
10 b False 0
11 c False 1
You can use astype (docs) to cast to int before pivoting.
res = df.pivot_table(values='count', aggfunc=np.sum, columns='cols', index='index').astype(int)
print(res)
outputs
cols a b c
index
0 0 0 0
1 0 0 0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas: How to count the unique categories? - python

df_input.groupby('col_cate')['target_bool'].value_counts() col_cate target_bool A False 2 True 2 B False 2 True 2 then you can reset_index()

Like this also: In [54]: df_input.groupby(df_input.columns.tolist()).size().reset_index().\ ...: rename(columns={0:'cnt'}) Out[54]: col_cate target_bool cnt 0 A False 2 1 A True 2 2 B False 2 3 B True 2

Related

Count NA and none-NA per group in pandas

Mark highest documents with true

how to change suffix on new df column of df when merging iteratively

Identifying closest value in a column for each filter using Pandas

Pandas Pivot Table Keeps Returning False Instead of 0

Categories

Resources