Sum pandas dataframe column values based on condition of column name - python

I have a DataFrame with column names in the shape of x.y, where I would like to sum up all columns with the same value on x without having to explicitly name them. That is, the value of column_name.split(".")[0] should determine their group. Here's an example:
import pandas as pd
df = pd.DataFrame({'x.1': [1,2,3,4], 'x.2': [5,4,3,2], 'y.8': [19,2,1,3], 'y.92': [10,9,2,4]})
df
Out[3]:
x.1 x.2 y.8 y.92
0 1 5 19 10
1 2 4 2 9
2 3 3 1 2
3 4 2 3 4
The result should be the same as this operation, only I shouldn't have to explicitly list the column names and how they should group.
pd.DataFrame({'x': df[['x.1', 'x.2']].sum(axis=1), 'y': df[['y.8', 'y.92']].sum(axis=1)})
x y
0 6 29
1 6 11
2 6 3
3 6 7

Another option, you can extract the prefix from the column names and use it as a group variable:
df.groupby(by = df.columns.str.split('.').str[0], axis = 1).sum()
# x y
#0 6 29
#1 6 11
#2 6 3
#3 6 7

You can first create Multiindex by split and then groupby by first level and aggregate sum:
df.columns = df.columns.str.split('.', expand=True)
print (df)
x y
1 2 8 92
0 1 5 19 10
1 2 4 2 9
2 3 3 1 2
3 4 2 3 4
df = df.groupby(axis=1, level=0).sum()
print (df)
x y
0 6 29
1 6 11
2 6 3
3 6 7

Related

Add all column values repeated of one data frame to other in pandas

Having two data frames:
df1 = pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
a b
0 1 4
1 2 5
2 3 6
df2 = pd.DataFrame({'c':[7],'d':[8]})
c d
0 7 8
The goal is to add all df2 column values to df1, repeated and create the following result. It is assumed that both data frames do not share any column names.
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
If there are strings columns names is possible use DataFrame.assign with unpack Series created by selecing first row of df2:
df = df1.assign(**df2.iloc[0])
print (df)
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
Another idea is repeat values by df1.index with DataFrame.reindex and use DataFrame.join (here first index value of df2 is same like first index value of df1.index):
df = df1.join(df2.reindex(df1.index, method='ffill'))
print (df)
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
If no missing values in original df is possible use forward filling missing values in last step, but also are types changed to floats, thanks #Dishin H Goyan:
df = df1.join(df2).ffill()
print (df)
a b c d
0 1 4 7.0 8.0
1 2 5 7.0 8.0
2 3 6 7.0 8.0

Pandas 'partial melt' or 'group melt'

I have a DataFrame like this
>>> df = pd.DataFrame([[1,1,2,3,4,5,6],[2,7,8,9,10,11,12]],
columns=['id', 'ax','ay','az','bx','by','bz'])
>>> df
id ax ay az bx by bz
0 1 1 2 3 4 5 6
1 2 7 8 9 10 11 12
and I want to transform it into something like this
id name x y z
0 1 a 1 2 3
1 2 a 7 8 9
2 1 b 4 5 6
3 2 b 10 11 12
This is an unpivot / melt problem, but I don't know of any way to melt by keeping these groups intact. I know I can create projections across the original dataframe and then concat those but I'm wondering if I'm missing some common melt tricks from my toolbelt.
Set_index, convert columns to multi index and stack,
df = df.set_index('id')
df.columns = [df.columns.str[1], df.columns.str[0]]
new_df = df.stack().reset_index().rename(columns = {'level_1': 'name'})
id name x y z
0 1 a 1 2 3
1 1 b 4 5 6
2 2 a 7 8 9
3 2 b 10 11 12
Not melt wide_to_long with stack and unstack
pd.wide_to_long(df,['a','b'],i='id',j='drop',suffix='\w+').stack().unstack(1)
Out[476]:
drop x y z
id
1 a 1 2 3
b 4 5 6
2 a 7 8 9
b 10 11 12
An addition to the already excellent answers; pivot_longer from pyjanitor can help to abstract the reshaping :
# pip install pyjanitor
import pandas as pd
import janitor
df.pivot_longer(index = 'id',
names_to = ('name', '.value'),
names_pattern = r"(.)(.)")
id name x y z
0 1 a 1 2 3
1 2 a 7 8 9
2 1 b 4 5 6
3 2 b 10 11 12

Select Columns of a DataFrame based on another DataFrame

I am trying to select a subset of a DataFrame based on the columns of another DataFrame.
The DataFrames look like this:
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
a b
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
I want to get all rows of the first Dataframe for the columns which are included in both DataFrames. My result should look like this:
a b
0 0 1
1 4 5
2 8 9
3 12 13
You can use pd.Index.intersection or its syntactic sugar &:
intersection_cols = df1.columns & df2.columns
res = df1[intersection_cols]
import pandas as pd
data1=[[0,1,2,3,],[4,5,6,7],[8,9,10,11],[12,13,14,15]]
data2=[[0,1],[2,3],[4,5],[6,7],[8,9]]
df1 = pd.DataFrame(data=data1,columns=['a','b','c','d'])
df2 = pd.DataFrame(data=data2,columns=['a','b'])
df1[(df1.columns) & (df2.columns)]

Deleting multiple DataFrame columns in Pandas

Say I want to delete a set of adjacent columns in a DataFrame and my code looks something like this currently:
del df['1'], df['2'], df['3'], df['4'], df['5'], df['6']
This works, but I was wondering if there was a more efficient, compact, or aesthetically pleasing way to do it, such as:
del df['1','6']
I think you need drop, for selecting is used range or numpy.arange:
df = pd.DataFrame({'1':[1,2,3],
'2':[4,5,6],
'3':[7,8,9],
'4':[1,3,5],
'5':[7,8,9],
'6':[1,3,5],
'7':[5,3,6],
'8':[5,3,6],
'9':[7,4,3]})
print (df)
1 2 3 4 5 6 7 8 9
0 1 4 7 1 7 1 5 5 7
1 2 5 8 3 8 3 3 3 4
2 3 6 9 5 9 5 6 6 3
print (np.arange(1,7))
[1 2 3 4 5 6]
print (range(1,7))
range(1, 7)
#convert string column names to int
df.columns = df.columns.astype(int)
df = df.drop(np.arange(1,7), axis=1)
#another solution with range
#df = df.drop(range(1,7), axis=1)
print (df)
7 8 9
0 5 5 7
1 3 3 4
2 6 6 3
You can do this without modifying the columns, by passing a slice object to drop:
In [29]:
df.drop(df.columns[slice(df.columns.tolist().index('1'),df.columns.tolist().index('6')+1)], axis=1)
Out[29]:
7 8 9
0 5 5 7
1 3 3 4
2 6 6 3
So this returns the ordinal position of the lower and upper bound of the column end points and passes these to create a slice object against the columns array

Pandas merge on aggregated columns

Let's say I create a DataFrame:
import pandas as pd
df = pd.DataFrame({"a": [1,2,3,13,15], "b": [4,5,6,6,6], "c": ["wish", "you","were", "here", "here"]})
Like so:
a b c
0 1 4 wish
1 2 5 you
2 3 6 were
3 13 6 here
4 15 6 here
... and then group and aggregate by a couple columns ...
gb = df.groupby(['b','c']).agg({"a": lambda x: x.nunique()})
Yielding the following result:
a
b c
4 wish 1
5 you 1
6 here 2
were 1
Is it possible to merge df with the newly aggregated table gb such that I create a new column in df, containing the corresponding values from gb? Like this:
a b c nc
0 1 4 wish 1
1 2 5 you 1
2 3 6 were 1
3 13 6 here 2
4 15 6 here 2
I tried doing the simplest thing:
df.merge(gb, on=['b','c'])
But this gives the error:
KeyError: 'b'
Which makes sense because the grouped table has a Multi-index and b is not a column. So my question is two-fold:
Can I transform the multi-index of the gb DataFrame back into columns (so that it has the b and c column)?
Can I merge df with gb on the column names?
Whenever you want to add some aggregated column from groupby operation back to the df you should be using transform, this produces a Series with its index aligned with your orig df:
In [4]:
df['nc'] = df.groupby(['b','c'])['a'].transform(pd.Series.nunique)
df
Out[4]:
a b c nc
0 1 4 wish 1
1 2 5 you 1
2 3 6 were 1
3 13 6 here 2
4 15 6 here 2
There is no need to reset the index or perform an additional merge.
There's a simple way of doing this using reset_index().
df.merge(gb.reset_index(), on=['b','c'])
gives you
a_x b c a_y
0 1 4 wish 1
1 2 5 you 1
2 3 6 were 1
3 13 6 here 2
4 15 6 here 2

Categories