How do I stop aggregate functions from adding unwanted rows to dataframe? - python

I wrote a line of code that groups the dataframe by column
df = df.groupby(['where','when']).agg({'col1': ['max'], 'col2': ['sum']})
After using the above code, the aggregated columns in the output has two extra rows, with 'max' and 'sum' taking up a column below the 'col1' and 'col2' index. It looks like this:
col1
col2
max
sum
where
when
home
1
a
a
work
2
b
b
This is my expected outcome:
where
when
col1
col2
home
1
a
a
work
2
b
b
I want to bring down both col1 and col2 down to the same row as location and month, and at the same time remove 'max' and 'sum' from showing. I couldn't really think of a way to make this work so help would be appreciated.

What you need is reset_index and pass column name to aggregate function in advance.
Use followoing:
df = df.groupby(['where','when']).agg(col1 = ('col1', 'max'), col2 = ('col2', 'sum')).reset_index()
Dataframe:
where when col1 col2
0 home 1 1 1
1 work 2 2 2
2 home 1 3 3
Output:
where when col1 col2
0 home 1 3 3
1 work 2 2 2
Update:
We can pass as_index = False to groupby which will stop pandas to put keys as the index and hence we don't need to reset the index afterwards.
df = df.groupby(['where','when'], as_index = False).agg(col1 = ('col1', 'max'), col2 = ('col2', 'sum'))

Related

Find name of column which is non nan

I have a Dataframe defined like :
df1 = pd.DataFrame({"col1":[1,np.nan,np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan],
"col2":[np.nan,3,np.nan,4,np.nan,np.nan,np.nan,5,6],
"col3":[np.nan,np.nan,7,np.nan,np.nan,8,9,np.nan, np.nan]})
I want to transform it into a DataFrame like:
df2 = pd.DataFrame({"col_name":['col1','col2','col3','col2','col1',
'col3','col3','col2','col2'],
"value":[1,3,7,4,2,8,9,5,6]})
If possible, can we reverse this process too? By that I mean convert df2 into df1.
I don't want to go through the DataFrame iteratively as it becomes too computationally expensive.
You can stack it:
out = (df1.stack().astype(int).droplevel(0)
.rename_axis('col_name').reset_index(name='value'))
Output:
col_name value
0 col1 1
1 col2 3
2 col3 7
3 col2 4
4 col1 2
5 col3 8
6 col3 9
7 col2 5
8 col2 6
To go from out back to df1, you could pivot:
out1 = pd.pivot(out.reset_index(), 'index', 'col_name', 'value')

How do you drop a column by index?

When I run this code it drops the first row instead of the first column:
df.drop(axis=1, index=0)
How do you drop a column by index?
You can use df.columns[i] to denote the column. Example:
df.drop(df.columns[0], axis=1)
Using the example
df = pd.DataFrame([
[1023.423,12.59595],
[1000,11.63024902],
[975,9.529815674],
[100,-48.20524597]], columns = ['col1', 'col2'])
col1 col2
0 1023.423 12.595950
1 1000.000 11.630249
2 975.000 9.529816
3 100.000 -48.205246
If you do df.drop(index=0), the output is dropping row with index 0
col1 col2
1 1000.0 11.630249
2 975.0 9.529816
3 100.0 -48.205246
If you do df.drop('col1', axis=1), the output is dropping column with name 'col1'
col2
0 12.595950
1 11.630249
2 9.529816
3 -48.205246
Please remember to use inplace=True where necessary

How to compare two dataframes in Python pandas and output the difference?

I have two df with the same numbers of columns but different numbers of rows.
df1
col1 col2
0 a 1,2,3,4
1 b 1,2,3
2 c 1
df2
col1 col2
0 b 1,3
1 c 1,2
2 d 1,2,3
3 e 1,2
df1 is the existing list, df2 is the updated list. The expected result is whatever in df2 that was previously not in df1.
Expected result:
col1 col2
0 c 2
1 d 1,2,3
2 e 1,2
I've tried with
mask = df1['col2'] != df2['col2']
but it doesn't work with different rows of df.
Use DataFrame.explode by splitted values in columns col2, then use DataFrame.merge with right join and indicato parameter, filter by boolean indexing only rows with right_only and last aggregate join:
df11 = df1.assign(col2 = df1['col2'].str.split(',')).explode('col2')
df22 = df2.assign(col2 = df2['col2'].str.split(',')).explode('col2')
df = df11.merge(df22, indicator=True, how='right', on=['col1','col2'])
df = (df[df['_merge'].eq('right_only')]
.groupby('col1')['col2']
.agg(','.join)
.reset_index(name='col2'))
print (df)
col1 col2
0 c 2
1 d 1,2,3
2 e 1,2

How to apply a function on a series of columns, based on the values in a corresponding series of columns?

I have a df where I have several columns, that, based on the value (1-6) in these columns, I want to assign a value (0-1) to its corresponding column. I can do it on a column by column basis but would like to make it a single function. Below is some example code:
import pandas as pd
df = pd.DataFrame({'col1': [1,3,6,3,5,2], 'col2': [4,5,6,6,1,3], 'col3': [3,6,5,1,1,6],
'colA': [0,0,0,0,0,0], 'colB': [0,0,0,0,0,0], 'colC': [0,0,0,0,0,0]})
(col1 corresponds with colA, col2 with colB, col3 with colC)
This code works on a column by column basis:
df.loc[(df.col1 != 1) & (df.col1 < 6), 'colA'] = (df['colA']+ 1)
But I would like to be able to have a list of columns, so to speak, and have it correspond with another. Something like this, (but that actually works):
m = df['col1' : 'col3'] != 1 & df['col1' : 'col3'] < 6
df.loc[m, 'colA' : 'colC'] += 1
Thank You!
Idea is filter both DataFrames by DataFrame.loc, then filter columns by mask and rename columns by another df2 and last use DataFrame.add only for df.columns:
df1 = df.loc[:, 'col1' : 'col3']
df2 = df.loc[:, 'colA' : 'colC']
d = dict(zip(df1.columns,df2.columns))
df1 = ((df1 != 1) & (df1 < 6)).rename(columns=d)
df[df2.columns] = df[df2.columns].add(df1)
print (df)
col1 col2 col3 colA colB colC
0 1 4 3 0 1 1
1 3 5 6 1 1 0
2 6 6 5 0 0 1
3 3 6 1 1 0 0
4 5 1 1 1 0 0
5 2 3 6 1 1 0
Here's what I would do:
# split up dataframe
sub_df = df.iloc[:,:3]
abc = df.iloc[:,3:]
# make numpy array truth table
truth_table = (sub_df.to_numpy() > 1) & (sub_df.to_numpy() < 6)
# redefine abc based on numpy truth table
new_abc = pd.DataFrame(truth_table.astype(int), columns=['colA', 'colB', 'colC'])
# join the updated dataframe subgroups
new_df = pd.concat([sub_df, new_abc], axis=1)

how to create a dataframe aggregating (grouping?) a dataframe containing only strings

I would like to create a dataframe "aggregating" a larger data set.
Starting:
df:
col1 col2
1 A B
2 A C
3 A B
and getting:
df_aggregated:
col1 col2
1 A B
2 A C
without using any calclulation (count())
I would write:
df_aggreagated = df.groupby('col1')
but I do not get anything
print ( df_aggregated )
"error"
any help appreciated
You can accomplish this by simply dropping the duplicate entries using the df.drop_duplicates function:
df_aggregated = df.drop_duplicates(subset=['col1', 'col2'], keep=False)
print(df_aggregated)
col1 col2
1 A B
2 A C
You can use groupby with a function:
In [849]: df.groupby('col2', as_index=False).max()
Out[849]:
col2 col1
0 B A
1 C A

Categories