Group counts. Why every column?

Group counts. Why every column? - python

I often need to know how many entries I have in each group in a dataframe in Pandas. The following does it, but it returns one value for every column in my dataframe.
df.groupby(['A', 'B', 'C']).count()
That is, if I have, say 20 columns (where A, B and C are three of them), it would return 17 counts, all identical (at least every time I have done it) within each group.
What is the rationale behind this?
Is there any way to restrict the count to only one column? (or have it return only one value per group?)
Would that speed up the counts in any way?
The method dataFrameGroupBy.count doesn't seem to have an argument to specify on which columns to do the count (I also could not find it on the API ref)

groupby(...).count() returns the count of non null values in each column. So potentially it can be different for each column.
example:
>>> df
jim joe jolie
0 4 NaN 4
1 8 0 NaN
>>> df.groupby('jim').count()
joe jolie
jim
4 0 1
8 1 0
.groupby(...).size() returns the size of each group.

Related

Dropping rows, where a dynamic number of integer columns only contain 0's

I have the following problem - I have the following example data-frame:
Name
Planet
Number Column #1
Number Column #2
John
Earth
2
0
Peter
Terra
0
0
Anna
Mars
5
4
Robert
Knowhere
0
1
Here, I want to only remove the rows, in which there are numbers, that are all 0's. In this case this is the second row. So my data-frame has to become like this:
Name
Planet
Number Column #1
Number Column #2
John
Earth
2
0
Anna
Mars
5
4
Robert
Knowhere
0
1
For this, I have a solution and it is the following:
new_df = old_df.loc[(a['Number Column #1'] > 0) + (a['Number Column #2'] > 0)]
This works, however I have another problem. My dataframe, based on the request, will dynamically have a different number of number columns. For example:
Name
Planet
Number Column #1
Number Column #2
Number Column #3
John
Earth
2
0
1
Peter
Terra
0
0
0
Anna
Mars
5
4
2
Robert
Knowhere
0
1
1
This is the problematic part, as I am not sure how I can adjust my code to work for dynamic columns. I've tried multiple things from StackOverflow and the Pandas documentation - however most examples only work for dataframes, in which all columns are integers. Pandas would them consider them booleans, and you can add a simple solution like this:
new_df = (df != 0).any(axis=1)
In my case however, the text columns, which are always the same, are the problematic ones. Does anyone have an idea for a solution here? Thanks a lot in advance!
P.S. I have the names of the number columns available beforehand in the code as a list, for example:
my_num_columns = ["Number Column #1", "Number Column #2", "Number Column #3"]
# my pandas logic...

IIUC:
You can try via select_dtypes() and select int and float columns after that check your condition and filter out your dataframe:
df=df.loc[~df.select_dtypes(['int','float']).eq(0).all(axis=1)]
#OR
df=df.loc[df.select_dtypes(['int','float']).ne(0).any(axis=1)]
Note: If needed you can also include 'bool' columns and typecast it to float and then check your condition:
df=df.loc[df.select_dtypes(['int','float','bool']).astype(float).ne(0).any(axis=1)]

aggregating and counting in pandas

for the following df
group participated
A 1
A 1
B 0
A 0
B 1
A 1
B 0
B 0
I want to count the total number of values in the participated column for each value in the group column (groupby-count) and then find a count of how many 1s there are in each group too
Something like
group tot_participated 1s
A 4 3
B 4 1
I know the first part is simple and can be done by a simple
grouped_df=df.groupby('group').count().reset_index()
unable to wrap my head around the second part. Any help will be greatly appreciated!

You could follow the groupby with an aggregation as below:
grp_df = df.groupby('group', as_index=False).agg({'participated':['count','sum']})
grp_df.columns = ['group','tot_participated','1s']
grp_df.head()
The caveat to using .agg with multiple aggregation functions on the same column is that a multi-column index is created. This can be remedied by resetting the column names as in line 2.

Pandas COUNTIF based on column value

I am trying to essentially do a COUNTIF in pandas to count how many items in a row match a number in the first column.
Dataframe:
a b c d
1 2 3 1
2 3 4 2
3 5 6 3
So I want to count instances in a row (b,c,d) that match a. In row 1 for instance it should be 1 as only d matches a.
I have searched quite a bit for this but so far only found examples where its a common number (like counting all values more than 0) but not based on a dataframe column. i'm guessing its some form of logic that masks based on the column but df == df.a doesnt seem to work

You can use eq, which you can pass an axis parameter to specify the direction of the comparison, then you can do a row sum to count the number of matched values:
df.eq(df.a, axis=0).sum(1) - 1
#0 1
#1 1
#2 1
#dtype: int64

df.apply(lambda x: (x == x[0]).sum()-1,axis=1)

How can I average ACROSS groups in python-pandas?

I have a dataset like this:
Participant Type Rating
1 A 6
1 A 5
1 B 4
1 B 3
2 A 9
2 A 8
2 B 7
2 B 6
I want obtain this:
Type MeanRating
A mean(6,9)
A mean(5,8)
B mean(4,7)
B mean(3,6)
So, for each type, I want the mean of the higher value in each group, then the mean of the second higher value in each group, etc.
I can't think up a proper way to do this with python pandas, since the means seem to apply always within groups, but not across them.

First use groupby.rank to create a column that allows you to align the highest values, second highest values, etc. Then perform another groupby using the newly created column to compute the means:
# Get the grouping column.
df['Grouper'] = df.groupby(['Type', 'Participant']).rank(method='first', ascending=False)
# Perform the groupby and format the result.
result = df.groupby(['Type', 'Grouper'])['Rating'].mean().rename('MeanRating')
result = result.reset_index(level=1, drop=True).reset_index()
The resulting output:
Type MeanRating
0 A 7.5
1 A 6.5
2 B 5.5
3 B 4.5
I used the method='first' parameter of groupby.rank to handle the case of duplicate ratings within a ['Type', 'Participant'] group. You can omit it if this is not a possibility within your dataset, but it won't change the output if you leave it and there are no duplicates.

Conditional selection of data in a pandas DataFrame

I have two columns in my pandas DataFrame.
A B
0 1 5
1 2 3
2 3 2
3 4 0
4 5 1
I need the value in A where the value of B is minimum. In the above case my answer would be 4 since the minimum B value is 0.
Can anyone help me with it?

To find the minimum in column B, you can use df.B.min(). For your DataFrame this returns 0.
To find values at particular locations in a DataFrame, you can use loc:
>>> df.loc[(df.B == df.B.min()), 'A']
3 4
Name: A, dtype: int64
So here, loc picks out all of the rows where column B is equal to its minimum value (df.B == df.B.min()) and selects the corresponding values in column A.
This method returns all values in A corresponding to the minimum value in B. If you only need to find one of the values, the better way is to use idxmin as #aus_lacy suggests.

Here's one way:
b_min = df.B.idxmin()
a_val = df.A[b_min]
idxmin() returns the index of the minimum value within column B. You then locate the value at that same index in column A.
or if you want a single, albeit less readable, line you can do it like:
a_val = df.A[df.B.idxmin()]
Also, as a disclaimer this solution assumes that the minimum value in column B is unique. For example if you were to have a data set that looked like this:
A B
1 2
2 5
3 0
4 3
5 0
My solution would return the first instance where B's minimum value is located which in this case is in the third row and has a corresponding A value of 3. If you believe that the minimum value of B is not unique then you should go with #ajcr's solution.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.