Pandas COUNTIF based on column value - python

I am trying to essentially do a COUNTIF in pandas to count how many items in a row match a number in the first column.
Dataframe:
a b c d
1 2 3 1
2 3 4 2
3 5 6 3
So I want to count instances in a row (b,c,d) that match a. In row 1 for instance it should be 1 as only d matches a.
I have searched quite a bit for this but so far only found examples where its a common number (like counting all values more than 0) but not based on a dataframe column. i'm guessing its some form of logic that masks based on the column but df == df.a doesnt seem to work

You can use eq, which you can pass an axis parameter to specify the direction of the comparison, then you can do a row sum to count the number of matched values:
df.eq(df.a, axis=0).sum(1) - 1
#0 1
#1 1
#2 1
#dtype: int64

df.apply(lambda x: (x == x[0]).sum()-1,axis=1)

Related

Count how many times any value appears in a combination of columns but not in other columns (pandas)

I have a data set that looks like this:
A B C
1 2
3 4
1 5
1 2 4
1 2
1
I want to return the amount of times any value appears in a combination of two columns but not the third column. The numeric values in this case are arbitrary. I just care about counting instances.
In other words:
1.) How many times does any value appear in column A and column B but not column C?
2.) How many times does any value appear in column B and column C but not column A?
3.) How many times does any value appear in column A and column C but not in column B?
My expected answers based on the mock data I have given above:
1.) 1 (row 1)
2.) 1 (row 2)
3.) 2 (rows 3 and 5)
You could use isna to create a boolean DataFrame. Then filter the rows that have only one NaN value (so that we exclude the last row). Finally sum vertically:
df_na = df.isna()
df_na[df_na.sum(axis=1).eq(1)].sum()
Output:
A 1
B 2
C 1
dtype: int64
Then for example, column "A" doesn't have a value while the other two have values once, "B" doesn't have a value while the other two have twice, etc.

aggregating and counting in pandas

for the following df
group participated
A 1
A 1
B 0
A 0
B 1
A 1
B 0
B 0
I want to count the total number of values in the participated column for each value in the group column (groupby-count) and then find a count of how many 1s there are in each group too
Something like
group tot_participated 1s
A 4 3
B 4 1
I know the first part is simple and can be done by a simple
grouped_df=df.groupby('group').count().reset_index()
unable to wrap my head around the second part. Any help will be greatly appreciated!
You could follow the groupby with an aggregation as below:
grp_df = df.groupby('group', as_index=False).agg({'participated':['count','sum']})
grp_df.columns = ['group','tot_participated','1s']
grp_df.head()
The caveat to using .agg with multiple aggregation functions on the same column is that a multi-column index is created. This can be remedied by resetting the column names as in line 2.

Lambda function with groupby and condition issue

I wanted to see how to express counts of unique values of column B for each unique value in column A where corresponding column C >0.
df:
A B C
1 10 0
1 12 3
2 3 1
I tried this but its missing the where clause to filter for C>0. How do I add it?
df.groupby(['A'])['B'].apply(lambda b : b.astype(int).nunique())
Let's first start by creating the dataframe that OP mentions in the question
import pandas as pd
df = pd.DataFrame({'A': [1,1,2], 'B': [10,12,3], 'C': [0,3,1]})
Now, in order to achieve what OP wants, there are various options. One of the ways is by selecting the df where column C is greater than 0, then use pandas.DataFrame.groupby to group by column A. Finally, use pandas.DataFrame.nunique to count the unique values of column B. In one line it would look like the following
count = df[df['C'] > 0].groupby('A')['B'].nunique()
[Out]:
A
1 1
2 1
If one wants to sum every number in of unique items that satisfy the condition, from the dataframe count, one can simply do
count = count.sum()
[Out]:
2
Assuming one wants to do everything in one line, one can use pandas.DataFrame.sum as
count = df[df['C'] > 0].groupby('A')['B'].nunique().sum()
[Out]:
2

New columns with incremental numbers that initial based on a diffrent column value (pandas)

I want to add a column with incremental numbers for rows with the same value in a defined row;
e.g. if I would have this df
df=pd.DataFrame([['a','b'],['a','c'],['c','b']])
and I want incremental numbers for the first column. It should look like this
df=pd.DataFrame([['a','b',1],['a','c',2],['c','b',1]])
I found sql solutions but I'm working with ipython/pandas. Can someone help me?
Use cumcount, for name of new column use length of original columns:
print (len(df.columns))
2
df[len(df.columns)] = df.groupby(0).cumcount() + 1
print (df)
0 1 2
0 a b 1
1 a c 2
2 c b 1

Conditional selection of data in a pandas DataFrame

I have two columns in my pandas DataFrame.
A B
0 1 5
1 2 3
2 3 2
3 4 0
4 5 1
I need the value in A where the value of B is minimum. In the above case my answer would be 4 since the minimum B value is 0.
Can anyone help me with it?
To find the minimum in column B, you can use df.B.min(). For your DataFrame this returns 0.
To find values at particular locations in a DataFrame, you can use loc:
>>> df.loc[(df.B == df.B.min()), 'A']
3 4
Name: A, dtype: int64
So here, loc picks out all of the rows where column B is equal to its minimum value (df.B == df.B.min()) and selects the corresponding values in column A.
This method returns all values in A corresponding to the minimum value in B. If you only need to find one of the values, the better way is to use idxmin as #aus_lacy suggests.
Here's one way:
b_min = df.B.idxmin()
a_val = df.A[b_min]
idxmin() returns the index of the minimum value within column B. You then locate the value at that same index in column A.
or if you want a single, albeit less readable, line you can do it like:
a_val = df.A[df.B.idxmin()]
Also, as a disclaimer this solution assumes that the minimum value in column B is unique. For example if you were to have a data set that looked like this:
A B
1 2
2 5
3 0
4 3
5 0
My solution would return the first instance where B's minimum value is located which in this case is in the third row and has a corresponding A value of 3. If you believe that the minimum value of B is not unique then you should go with #ajcr's solution.

Categories