Count number of occurrences of first column value respectively across each row - python

data = [['BAL', 'BAL', 'NO', 'DAL'], ['DAL', 'DAL', 'TEN', 'SF']]
df = pd.DataFrame(data)
I want to count the number of occurrences of the value in the first column in each row, across that row.
In this example, the number of times "BAL" appears in the first row, "DAL" in the second row, etc.
Then assign that count to a new column df['Count'].

You could do something like this:
df.assign(count=df.eq(df.iloc[:,0], axis=0).sum(axis=1))
Create a series using iloc of the first column of your dataframe, then compare values use pd.DataFrame.eq with axis=0 and sum along axis=1.
Output:
0 1 2 3 count
0 BAL BAL NO DAL 2
1 DAL DAL TEN SF 2

We can compare the first column to all the remaining columns with DataFrame.eq then sum across the rows to add up the number of True values (matches):
df['count'] = df.iloc[:, 1:].eq(df.iloc[:, 0], axis=0).sum(axis=1)
df:
0 1 2 3 count
0 BAL BAL NO DAL 1
1 DAL DAL TEN SF 1
*Note this output is slightly different than the accepted answer in that it does not include the column containing the reference values in the row count.

Related

Dropping rows, where a dynamic number of integer columns only contain 0's

I have the following problem - I have the following example data-frame:
Name
Planet
Number Column #1
Number Column #2
John
Earth
2
0
Peter
Terra
0
0
Anna
Mars
5
4
Robert
Knowhere
0
1
Here, I want to only remove the rows, in which there are numbers, that are all 0's. In this case this is the second row. So my data-frame has to become like this:
Name
Planet
Number Column #1
Number Column #2
John
Earth
2
0
Anna
Mars
5
4
Robert
Knowhere
0
1
For this, I have a solution and it is the following:
new_df = old_df.loc[(a['Number Column #1'] > 0) + (a['Number Column #2'] > 0)]
This works, however I have another problem. My dataframe, based on the request, will dynamically have a different number of number columns. For example:
Name
Planet
Number Column #1
Number Column #2
Number Column #3
John
Earth
2
0
1
Peter
Terra
0
0
0
Anna
Mars
5
4
2
Robert
Knowhere
0
1
1
This is the problematic part, as I am not sure how I can adjust my code to work for dynamic columns. I've tried multiple things from StackOverflow and the Pandas documentation - however most examples only work for dataframes, in which all columns are integers. Pandas would them consider them booleans, and you can add a simple solution like this:
new_df = (df != 0).any(axis=1)
In my case however, the text columns, which are always the same, are the problematic ones. Does anyone have an idea for a solution here? Thanks a lot in advance!
P.S. I have the names of the number columns available beforehand in the code as a list, for example:
my_num_columns = ["Number Column #1", "Number Column #2", "Number Column #3"]
# my pandas logic...
IIUC:
You can try via select_dtypes() and select int and float columns after that check your condition and filter out your dataframe:
df=df.loc[~df.select_dtypes(['int','float']).eq(0).all(axis=1)]
#OR
df=df.loc[df.select_dtypes(['int','float']).ne(0).any(axis=1)]
Note: If needed you can also include 'bool' columns and typecast it to float and then check your condition:
df=df.loc[df.select_dtypes(['int','float','bool']).astype(float).ne(0).any(axis=1)]

Count the number of zeros with Pandas

I have the following dataframe:
import pandas as pd
data = {
'State': ['EU','EU','EU','US','US','US','UK','UK','UK'],
'Var': [4,1,6,2,1,6,2,0,1],
'Mark': [0,1,1,1,0,0,0,0,0]
}
Table = pd.DataFrame(data)
Table
I want to count the number of zeros in the "Mark" column for each country. the output should be a table like this:
State #ofZeros
EU 1
US 2
UK 3
I managed to count the number of "1s" in the "Mark" column for each country with groupby:
Table.groupby('State')['Mark'].sum()
and it would be great to know if it is possible to count also the zeros (or anyother value) with groupby.
Group the dataframe by States then call sum on boolean masking for Mark==0.
>>> Table.groupby('State', sort=False)['Mark'].agg(lambda x:x.eq(0).sum())
State
EU 1
US 2
UK 3
Name: Mark, dtype: int64
You can also call to_frame to convert it to a dataframe, then reset the index if needed:
Table.groupby('State', sort=False)['Mark'].agg(lambda x:x.eq(0).sum()).to_frame('#of Zeros').reset_index()
State #of Zeros
0 EU 1
1 US 2
2 UK 3
On a side note, as you have mentioned in the question: I managed to count the number of "1s" in the "Mark" column for each country with groupby: Table.groupby('State')['Mark'].sum(),
No, you are not actually counting the number of 1s, you are just getting the sum of the values in Mark column for each groups. For the sample data you have, Mark column has only 0 and 1 values, that is why sum and count of 1s, both are equal. If it had some other values as well in addition to 0, and 1, the sum of the values would be different than the count of 1s.
You can actually check the occurrences of value 0 in the column "Mark" using the code below.
Table[['State', 'Mark']][Table.Mark == 0].value_counts()
Table[['State', 'Mark']] narrows the columns that are required to be shown.
The output should be
State Mark
UK 0 3
US 0 2
EU 0 1
dtype: int64
You could use value_counts when you filter the dataframe on state. Then you just lookup the count of value 0.
states = set(Table.State)
count_data = [[state, Table[Table.State == state].Mark.value_counts()[0]] for state in states]
df = pd.DataFrame(count_data, columns=['State', 'zeros'])
df
print(df)
>>
State zeros
0 US 2
1 UK 3
2 EU 1

aggregating and counting in pandas

for the following df
group participated
A 1
A 1
B 0
A 0
B 1
A 1
B 0
B 0
I want to count the total number of values in the participated column for each value in the group column (groupby-count) and then find a count of how many 1s there are in each group too
Something like
group tot_participated 1s
A 4 3
B 4 1
I know the first part is simple and can be done by a simple
grouped_df=df.groupby('group').count().reset_index()
unable to wrap my head around the second part. Any help will be greatly appreciated!
You could follow the groupby with an aggregation as below:
grp_df = df.groupby('group', as_index=False).agg({'participated':['count','sum']})
grp_df.columns = ['group','tot_participated','1s']
grp_df.head()
The caveat to using .agg with multiple aggregation functions on the same column is that a multi-column index is created. This can be remedied by resetting the column names as in line 2.

Pandas COUNTIF based on column value

I am trying to essentially do a COUNTIF in pandas to count how many items in a row match a number in the first column.
Dataframe:
a b c d
1 2 3 1
2 3 4 2
3 5 6 3
So I want to count instances in a row (b,c,d) that match a. In row 1 for instance it should be 1 as only d matches a.
I have searched quite a bit for this but so far only found examples where its a common number (like counting all values more than 0) but not based on a dataframe column. i'm guessing its some form of logic that masks based on the column but df == df.a doesnt seem to work
You can use eq, which you can pass an axis parameter to specify the direction of the comparison, then you can do a row sum to count the number of matched values:
df.eq(df.a, axis=0).sum(1) - 1
#0 1
#1 1
#2 1
#dtype: int64
df.apply(lambda x: (x == x[0]).sum()-1,axis=1)

pandas: append new column of row subtotals

This is very similar to this question, except I want my code to be able to apply to the length of a dataframe, instead of specific columns.
I have a DataFrame, and I'm trying to get a sum of each row to append to the dataframe as a column.
df = pd.DataFrame([[1,0,0],[20,7,1],[63,13,5]],columns=['drinking','drugs','both'],index = ['First','Second','Third'])
drinking drugs both
First 1 0 0
Second 20 7 1
Third 63 13 5
Desired output:
drinking drugs both total
First 1 0 0 1
Second 20 7 1 28
Third 63 13 5 81
Current code:
df['total'] = df.apply(lambda row: (row['drinking'] + row['drugs'] + row['both']),axis=1)
This works great. But what if I have another dataframe, with seven columns, which are not called 'drinking', 'drugs', or 'both'? Is it possible to adjust this function so that it applies to the length of the dataframe? That way I can use the function for any dataframe at all, with a varying number of columns, not just a dataframe with columns called 'drinking', 'drugs', and 'both'?
Something like:
df['total'] = df.apply(for col in df: [code to calculate sum of each row]),axis=1)
You can use sum:
df['total'] = df.sum(axis=1)
If you need sum only some columns, use subset:
df['total'] = df[['drinking', 'drugs', 'both']].sum(axis=1)
what about something like this :
df.loc[:, 'Total'] = df.sum(axis=1)
with the output :
Out[4]:
drinking drugs both Total
First 1 0 0 1
Second 20 7 1 28
Third 63 13 5 81
It will sum all columns by row.

Categories