I have the following dataframe:
import pandas as pd
data = {
'State': ['EU','EU','EU','US','US','US','UK','UK','UK'],
'Var': [4,1,6,2,1,6,2,0,1],
'Mark': [0,1,1,1,0,0,0,0,0]
}
Table = pd.DataFrame(data)
Table
I want to count the number of zeros in the "Mark" column for each country. the output should be a table like this:
State #ofZeros
EU 1
US 2
UK 3
I managed to count the number of "1s" in the "Mark" column for each country with groupby:
Table.groupby('State')['Mark'].sum()
and it would be great to know if it is possible to count also the zeros (or anyother value) with groupby.
Group the dataframe by States then call sum on boolean masking for Mark==0.
>>> Table.groupby('State', sort=False)['Mark'].agg(lambda x:x.eq(0).sum())
State
EU 1
US 2
UK 3
Name: Mark, dtype: int64
You can also call to_frame to convert it to a dataframe, then reset the index if needed:
Table.groupby('State', sort=False)['Mark'].agg(lambda x:x.eq(0).sum()).to_frame('#of Zeros').reset_index()
State #of Zeros
0 EU 1
1 US 2
2 UK 3
On a side note, as you have mentioned in the question: I managed to count the number of "1s" in the "Mark" column for each country with groupby: Table.groupby('State')['Mark'].sum(),
No, you are not actually counting the number of 1s, you are just getting the sum of the values in Mark column for each groups. For the sample data you have, Mark column has only 0 and 1 values, that is why sum and count of 1s, both are equal. If it had some other values as well in addition to 0, and 1, the sum of the values would be different than the count of 1s.
You can actually check the occurrences of value 0 in the column "Mark" using the code below.
Table[['State', 'Mark']][Table.Mark == 0].value_counts()
Table[['State', 'Mark']] narrows the columns that are required to be shown.
The output should be
State Mark
UK 0 3
US 0 2
EU 0 1
dtype: int64
You could use value_counts when you filter the dataframe on state. Then you just lookup the count of value 0.
states = set(Table.State)
count_data = [[state, Table[Table.State == state].Mark.value_counts()[0]] for state in states]
df = pd.DataFrame(count_data, columns=['State', 'zeros'])
df
print(df)
>>
State zeros
0 US 2
1 UK 3
2 EU 1
Related
I have the following problem - I have the following example data-frame:
Name
Planet
Number Column #1
Number Column #2
John
Earth
2
0
Peter
Terra
0
0
Anna
Mars
5
4
Robert
Knowhere
0
1
Here, I want to only remove the rows, in which there are numbers, that are all 0's. In this case this is the second row. So my data-frame has to become like this:
Name
Planet
Number Column #1
Number Column #2
John
Earth
2
0
Anna
Mars
5
4
Robert
Knowhere
0
1
For this, I have a solution and it is the following:
new_df = old_df.loc[(a['Number Column #1'] > 0) + (a['Number Column #2'] > 0)]
This works, however I have another problem. My dataframe, based on the request, will dynamically have a different number of number columns. For example:
Name
Planet
Number Column #1
Number Column #2
Number Column #3
John
Earth
2
0
1
Peter
Terra
0
0
0
Anna
Mars
5
4
2
Robert
Knowhere
0
1
1
This is the problematic part, as I am not sure how I can adjust my code to work for dynamic columns. I've tried multiple things from StackOverflow and the Pandas documentation - however most examples only work for dataframes, in which all columns are integers. Pandas would them consider them booleans, and you can add a simple solution like this:
new_df = (df != 0).any(axis=1)
In my case however, the text columns, which are always the same, are the problematic ones. Does anyone have an idea for a solution here? Thanks a lot in advance!
P.S. I have the names of the number columns available beforehand in the code as a list, for example:
my_num_columns = ["Number Column #1", "Number Column #2", "Number Column #3"]
# my pandas logic...
IIUC:
You can try via select_dtypes() and select int and float columns after that check your condition and filter out your dataframe:
df=df.loc[~df.select_dtypes(['int','float']).eq(0).all(axis=1)]
#OR
df=df.loc[df.select_dtypes(['int','float']).ne(0).any(axis=1)]
Note: If needed you can also include 'bool' columns and typecast it to float and then check your condition:
df=df.loc[df.select_dtypes(['int','float','bool']).astype(float).ne(0).any(axis=1)]
data = [['BAL', 'BAL', 'NO', 'DAL'], ['DAL', 'DAL', 'TEN', 'SF']]
df = pd.DataFrame(data)
I want to count the number of occurrences of the value in the first column in each row, across that row.
In this example, the number of times "BAL" appears in the first row, "DAL" in the second row, etc.
Then assign that count to a new column df['Count'].
You could do something like this:
df.assign(count=df.eq(df.iloc[:,0], axis=0).sum(axis=1))
Create a series using iloc of the first column of your dataframe, then compare values use pd.DataFrame.eq with axis=0 and sum along axis=1.
Output:
0 1 2 3 count
0 BAL BAL NO DAL 2
1 DAL DAL TEN SF 2
We can compare the first column to all the remaining columns with DataFrame.eq then sum across the rows to add up the number of True values (matches):
df['count'] = df.iloc[:, 1:].eq(df.iloc[:, 0], axis=0).sum(axis=1)
df:
0 1 2 3 count
0 BAL BAL NO DAL 1
1 DAL DAL TEN SF 1
*Note this output is slightly different than the accepted answer in that it does not include the column containing the reference values in the row count.
I have a dataset of employees (their IDs) and the names of their bosses for several years.
df:
What I need to do is to see if an employee had a boss' change. So, desired output is:
For employees who appear in the df only once, I just assign 0 (no boss' change). However, I cannot figure out how to do it for the employees who are in the df for several years.
I was thinking that first I need to assign 0 for the first year they appear in the df (because we do not know who was the boss before, therefore there is no boss' change). Then I need to compare the name of the boss with the name in the next row and decide whether to assign 1 or 0 into the ManagerChange column.
So far I split the df into two (with unique IDs and duplicated IDs) and assigned 0 to ManagerChange for the unique IDs.
Then I groupby the duplicated IDs and sort them by year. However, I am new to Python and cannot figure out how to compare strings and assign a result value to a new column inside the groupby. Please, help.
Code I have so far:
# splitting database in two
bool_series = df["ID"].duplicated(keep=False)
df_duplicated=df[bool_series]
df_unique = df[~bool_series]
# assigning 0 for ManagerChange for the unique IDs
df_unique['ManagerChange'] = 0
# groupby by ID and sorting by year for the duplicated IDs
df_duplicated.groupby('ID').apply(lambda x: x.sort_values('Year'))
You can groupby then shift() the group and compare on Boss columns.
# Sort value first
df.sort_values(['ID', 'Year'], inplace=True)
# Compare Boss column with shifted Boss column
df['ManagerChange'] = df.groupby('ID').apply(lambda group: group['Boss'] != group['Boss'].shift(1)).tolist()
# Change True to 1, False to 0
df['ManagerChange'] = df['ManagerChange'].map({True: 1, False: 0})
# Sort df to original df
df = df.sort_index()
# Change the first in each group to 0
df.loc[df.groupby('ID').head(1).index, 'ManagerChange'] = 0
# print(df)
ID Year Boss ManagerChange
0 1234 2018 Anna 0
1 567 2019 Sarah 0
2 1234 2020 Michael 0
3 8976 2019 John 0
4 1234 2019 Michael 1
5 8976 2020 John 0
You could also make use of fill_value argument, this will help you get rid of the last df.loc[] operation.
# Sort value first
df.sort_values(['ID', 'Year'], inplace=True)
df['ManagerChange'] = df.groupby('ID').apply(lambda group: group['Boss'] != group['Boss'].shift(1, fill_value=group['Boss'].iloc[0])).tolist()
# Change True to 1, False to 0
df['ManagerChange'] = df['ManagerChange'].map({True: 1, False: 0})
# Sort df to original df
df = df.sort_index()
I have a large dataframe that contains information about food items. For example:
Year Journal Subscription Known_author
0 2014 A 1 1
1 2014 A 1 0
2 2014 B 0 1
3 2014 C 1 0
4 2015 A 1 1
5 2015 B 1 1
6 2015 C 0 1
7 2015 D 0 0
I want to be able to group by year and create a table that contains (1) the number of unique journals per year, (2) number of unique journals that have a subscription, and (3) number of unique journals that have a subscription and a known author.
This would be the table that I am looking for in this scenario:
Year (1) Column (2) Column (3) Column
2014 3 2 1
2015 4 2 2
I have used:
(1) df.groupby('Pub_Date_Year')['Journal'].agg('nunique') for the first column
(2) df.loc[(df['Subscription']==1)&(df['Year']==2014),'Journal'].agg(['nunique']).values[0]
(3) df.loc[(df['Subscription']==1)&(df['Known_author']==1)&(df['Year']==2014),'Journal'].agg(['nunique']).values[0]
However, I would want this table to be created in one go, I'm assuming using groupby, aggregate and some sort of lambda function. The ultimate idea is to automate this process as we get more data in, and not have to rely on manually changing the year in the df.loc code.
Is there a way this could be done?
As you guessed, you need to use groupby plus apply with a custom function.
def grouping(x):
journal_uniq = x['Journal'].nunique()
journal_subs = x.groupby('Journal').apply(lambda d : d['Subscription'].sum() > 0).sum()
journal_author = x.groupby('Journal').apply(lambda d : ((((d['Subscription'] == 1) & (d['Known_author'] == 1)).sum()) > 0)).sum()
return pd.Series([journal_uniq, journal_subs, journal_author])
ddf = df.groupby('Year').apply(grouping)
Using your sample input, this will return:
0 1 2
Year
2014 3 2 1
2015 4 2 2
More details on the function:
journal_uniq is the value in the 1st column. It counts the unique values in column 'Journal' using nunique, you already did this step.
journal_subs is the value in the 2nd column. Since you want the unique journals, you need to group on 'Journal' too and check if the sum of 'Subscription' is greater than zero. The second sum function sums the number of True values (True are casted to 1, False to 0).
journal_author is the value in the 3rd column. The logic is the same for 2nd column, but a bit more complex since you need to check that both 'Subscription' and 'Known_author' column are equal to 1 at the same row.
The returned pandas.Series is a row of the final dataframe.
I am trying to essentially do a COUNTIF in pandas to count how many items in a row match a number in the first column.
Dataframe:
a b c d
1 2 3 1
2 3 4 2
3 5 6 3
So I want to count instances in a row (b,c,d) that match a. In row 1 for instance it should be 1 as only d matches a.
I have searched quite a bit for this but so far only found examples where its a common number (like counting all values more than 0) but not based on a dataframe column. i'm guessing its some form of logic that masks based on the column but df == df.a doesnt seem to work
You can use eq, which you can pass an axis parameter to specify the direction of the comparison, then you can do a row sum to count the number of matched values:
df.eq(df.a, axis=0).sum(1) - 1
#0 1
#1 1
#2 1
#dtype: int64
df.apply(lambda x: (x == x[0]).sum()-1,axis=1)