Groupby multiple columns & Sum - Create new column with added If Condition - python

I need to groupby multiple columns & then get Sum in a new column with added If condition. I tried the next code and it worked great with grouping by single column:
df['new column'] = (
df['value'].where(df['value'] > 0).groupby(df['column1']).transform('sum')
)
However, when I try to group by multiple columns I get an error.
df['new_column'] = (
df['value'].where(df['value'] > 0).groupby(df['column1', 'column2']).transform('sum')
)
Error:
->return self._engine.get_loc(casted_key)
The above exception was the direct cause of the following exception:
->indexer = self.columns.get_loc(key)
->raise KeyError(key) from err
->if is_scalar(key) and isna(key) and not self.hasnans: ('column1', 'column2')
Could you please advise how I should change the code to get the same result but grouping by multiple columns?
Thank you

Cause of error
The syntax to select multiple columns df['column1', 'column2'] is wrong. This should be df[['column1', 'column2']]
Even if you use df[['column1', 'column2']] for groupby, pandas will raise another error complaining that the grouper should be one dimensional. This is because df[['column1', 'column2']] returns a dataframe which is a two dimensional object.
How to fix the error?
Hard way:
Pass each of the grouping columns as one dimensional series to groupby
df['new_column'] = (
df['value']
.where(df['value'] > 0)
.groupby([df['column1'], df['column2']]) # Notice the change
.transform('sum')
)
Easy way:
First assign the masked column values to the target column, then do groupby + transform as you would normally do
df['new_column'] = df['value'].where(df['value'] > 0)
df['new_column'] = df.groupby(['column1', 'column2'])['new_column'].transform('sum')

Related

Sort dataframe by value returns "For a multi-index, the label must be a tuple with elements corresponding to each level."

Objective: Based off dataframe with 5 columns, return dataframe with 3 columns including one which is the count and sort by largest count to smallest.
What I have tried:
df = df[['Country', 'Year','NumInstances']].groupby(['Country', 'Year']).agg(['count'])
df = df.sort_values(by='NumInstances', ascending=False)
print(df)
Error:
ValueError: The column label 'NumInstances' is not unique.
For a multi-index, the label must be a tuple with elements corresponding to each level.
Before this gets mark as a duplicate, I have gone through all other suggested duplicates and it seems they all suggest using the same code as I have above.
Is there something small that I am doing that may be incorrect?
Thanks!
I guess you need to remove multi-index -
Try this -
df = df[['Country', 'Year','NumInstances']].groupby(['Country', 'Year']).agg(['count']).reset_index()
or -
df = df[['Country', 'Year','NumInstances']].groupby(['Country', 'Year'], as_index=False).agg(['count'])
Found the issue. Adding an agg to the NumInstances column made the NumInstances column name a tuple of ('NumInstances', 'sum'), therefore I just updated the sort code to:
df = df.sort_values(by=('NumInstances', 'sum'), ascending=False)

Looping through pandas dataframe using list comprehension

I'm trying to loop through a pandas dataframe and for every row add a new column called upper, whose value should be set according to a simple condition based on the values of two other columns of the same row.
I tried to do that using list comprehension:
df['upper'] = [df['Close'][i] if df['Close'][i] > df['Open'][i] else df['Open'][i] for i in df]
But this line of code gives me the following error:
raise KeyError(key) from err KeyError: 'Date'
Where Date is just another column of the dataframe that isn't even involved in that line of code. What am i doing wrong here? Is there a better way to do this? Thanks in advance!
pandas is an advanced library, looping over DataFrame is a bad practice
df['upper'] = df[['Close', 'Open']].max(axis=1)
import numpy as np
df['upper'] = np.maximum(df['Close'], df['Open'])

Pandas. Selecting rows with missing values in multiple columns

Suppose we have a dataframe with the columns 'Race', 'Age', 'Name'. I want to create two 2 DF's:
1) Without missing values in columns 'Race' and 'Age'
2) Only with missing values in columns 'Race' and 'Age'
I wrote the following code
first_df = df[df[columns].notnull()]
second_df= df[df[columns].isnull()]
However this code does not work. I solved this problem using this code
first_df= df[df['Race'].isnull() & df['Age'].isnull()]
second_df = df[df['Race'].isnull() & df['Age'].isnull()]
But what if there are 10 columns ? Is there a way to write this code without logical operators, using only columns list ?
If select multiple columns get boolean DataFrame, then is necessary test if all columns are Trues by DataFrame.all or test if at least one True per rows by DataFrame.any:
first_df = df[df[columns].notnull().all(axis=1)]
second_df= df[df[columns].isnull().all(axis=1)]
You can also use ~ for invert mask:
mask = df[columns].notnull().all(axis=1)
first_df = df[mask]
second_df= df[~mask]
Step 1 : Make a new dataframe having dropped the missing data (NaN, pd.NaT, None) you can filter out incomplete rows.
DataFrame.dropna drops all rows containing at least one field with missing data
Assume new df as DF_updated and earlier as DF_Original
Step 2 : Now our solution DF will be difference between two DFs. It can be found by
pd.concat([DF_Original,DF_updated]).drop_duplicates(keep=False)

dataframe index groupby error ValueError: 'GL' is both an index level and a column label

I have this code which runs well until this morning:
# delete rows of 2019
df.drop(df[df.month.str.contains('2019')].index, inplace=True)
df.sort_values(by=['GL', 'month'], inplace=True)
df["diffDebit"] = df.groupby('GL')['GL_Debit'].diff().fillna(df['GL_Debit'])
df["diffCredit"] = df.groupby('GL')['GL_Credit'].diff().fillna(df['GL_Credit'])
error is :
ValueError: 'GL' is both an index level and a column label, which is ambiguous.
If I delete
df.drop(df[df.month.str.contains('2019')].index, inplace=True)
It works again, but I need to delete these rows before. Any idea?
Template of dataframe:
Find a solution:
in fact just add [] for ['GL'] in groupby as follows:
df["diffDebit"] = df.groupby(['GL'])['GL_Debit'].diff().fillna(df['GL_Debit'])

subtracting mean of each column away from the column and returning it

I have a dataset with many columns. I have to create a function which gets the mean of each column and subtracts it from each row in the column and then returns that dataset with those means subtracted. I found a similar question asked here and applied the answer but i keep getting an error. Here is my code:
def exercise1(df):
df1 = DataFrame(df)
df2 = df1 - df1.mean()
return df2
exercise1(data)
# Where data is the a csv file regarding salaries in the San Francisco area.
I am getting the following error
TypeError: ufunc 'subtract' did not contain a loop with signature matching types dtype('<U32') dtype('<U32') dtype('<U32')
I cant figure out what I am doing wrong
You can do a for loop on column with try-except:
def exercise1(df):
df1 = df.copy()
for col in df1.columns:
try: # if we can compute the mean then substract
df1[col] -= df1[col].mean()
except: # otherwise just ignore the column
pass
return df1
df.mean() produces a pandas Series data type with only numerical columns from your original DataFrame.
means = df.mean()
You can get the index values of that series by using:
means.index
Use this to slice your original DataFrame and subtract the mean value
df2 = df[means.index] - means
You need to specify the column you're subtracting from:
df = {'values1': [1,2,3], 'values2': [4,5,6]}
def exercise1(df):
df1 = pd.DataFrame(df)
df2 = df1['values2'] - df1['values2'].mean()
return df2
print(exercise1(df))

Categories