aggregating and counting in pandas - python

for the following df
group participated
A 1
A 1
B 0
A 0
B 1
A 1
B 0
B 0
I want to count the total number of values in the participated column for each value in the group column (groupby-count) and then find a count of how many 1s there are in each group too
Something like
group tot_participated 1s
A 4 3
B 4 1
I know the first part is simple and can be done by a simple
grouped_df=df.groupby('group').count().reset_index()
unable to wrap my head around the second part. Any help will be greatly appreciated!

You could follow the groupby with an aggregation as below:
grp_df = df.groupby('group', as_index=False).agg({'participated':['count','sum']})
grp_df.columns = ['group','tot_participated','1s']
grp_df.head()
The caveat to using .agg with multiple aggregation functions on the same column is that a multi-column index is created. This can be remedied by resetting the column names as in line 2.

Related

merge a dataframe based on one column and summing the other columns - Python

I'm looking to merge a dataframe based on one column and summing the others. I have attempted to do this via df.loc combining sum and have searched Stack extensively already. If any of you have suggestions, they are very welcome :)
Original:
Date
Value x
Value y
13-3-1920
1
0
13-3-1920
0
1
30-4-1920
0
1
30-4-1920
1
1
Desired Output:
Date
Value x
Value y
13-3-1920
1
1
30-4-1920
1
2
Thank you in advance!
"df" is the dataframe. I've considered date column to be of string data type.
The group by function gives the expected output.
df.groupby(['date']).sum()

Lambda function with groupby and condition issue

I wanted to see how to express counts of unique values of column B for each unique value in column A where corresponding column C >0.
df:
A B C
1 10 0
1 12 3
2 3 1
I tried this but its missing the where clause to filter for C>0. How do I add it?
df.groupby(['A'])['B'].apply(lambda b : b.astype(int).nunique())
Let's first start by creating the dataframe that OP mentions in the question
import pandas as pd
df = pd.DataFrame({'A': [1,1,2], 'B': [10,12,3], 'C': [0,3,1]})
Now, in order to achieve what OP wants, there are various options. One of the ways is by selecting the df where column C is greater than 0, then use pandas.DataFrame.groupby to group by column A. Finally, use pandas.DataFrame.nunique to count the unique values of column B. In one line it would look like the following
count = df[df['C'] > 0].groupby('A')['B'].nunique()
[Out]:
A
1 1
2 1
If one wants to sum every number in of unique items that satisfy the condition, from the dataframe count, one can simply do
count = count.sum()
[Out]:
2
Assuming one wants to do everything in one line, one can use pandas.DataFrame.sum as
count = df[df['C'] > 0].groupby('A')['B'].nunique().sum()
[Out]:
2

New columns with incremental numbers that initial based on a diffrent column value (pandas)

I want to add a column with incremental numbers for rows with the same value in a defined row;
e.g. if I would have this df
df=pd.DataFrame([['a','b'],['a','c'],['c','b']])
and I want incremental numbers for the first column. It should look like this
df=pd.DataFrame([['a','b',1],['a','c',2],['c','b',1]])
I found sql solutions but I'm working with ipython/pandas. Can someone help me?
Use cumcount, for name of new column use length of original columns:
print (len(df.columns))
2
df[len(df.columns)] = df.groupby(0).cumcount() + 1
print (df)
0 1 2
0 a b 1
1 a c 2
2 c b 1

Pandas COUNTIF based on column value

I am trying to essentially do a COUNTIF in pandas to count how many items in a row match a number in the first column.
Dataframe:
a b c d
1 2 3 1
2 3 4 2
3 5 6 3
So I want to count instances in a row (b,c,d) that match a. In row 1 for instance it should be 1 as only d matches a.
I have searched quite a bit for this but so far only found examples where its a common number (like counting all values more than 0) but not based on a dataframe column. i'm guessing its some form of logic that masks based on the column but df == df.a doesnt seem to work
You can use eq, which you can pass an axis parameter to specify the direction of the comparison, then you can do a row sum to count the number of matched values:
df.eq(df.a, axis=0).sum(1) - 1
#0 1
#1 1
#2 1
#dtype: int64
df.apply(lambda x: (x == x[0]).sum()-1,axis=1)

Group counts. Why every column?

I often need to know how many entries I have in each group in a dataframe in Pandas. The following does it, but it returns one value for every column in my dataframe.
df.groupby(['A', 'B', 'C']).count()
That is, if I have, say 20 columns (where A, B and C are three of them), it would return 17 counts, all identical (at least every time I have done it) within each group.
What is the rationale behind this?
Is there any way to restrict the count to only one column? (or have it return only one value per group?)
Would that speed up the counts in any way?
The method dataFrameGroupBy.count doesn't seem to have an argument to specify on which columns to do the count (I also could not find it on the API ref)
groupby(...).count() returns the count of non null values in each column. So potentially it can be different for each column.
example:
>>> df
jim joe jolie
0 4 NaN 4
1 8 0 NaN
>>> df.groupby('jim').count()
joe jolie
jim
4 0 1
8 1 0
.groupby(...).size() returns the size of each group.

Categories