Sum of every two columns and leave one column in pandas dataframe - python

My task is like this:
df=pd.DataFrame([(1,2,3,4,5,6),(1,2,3,4,5,6),(1,2,3,4,5,6)],columns=['a','b','c','d','e','f'])
Out:
a b c d e f
0 1 2 3 4 5 6
1 1 2 3 4 5 6
2 1 2 3 4 5 6
I want to do is the output dataframe looks like this:
Out
s1 b s2 d s3 f
0 3 2 7 4 11 6
1 3 2 7 4 11 6
2 3 2 7 4 11 6
That is to say, sum the column (a,b),(c,d),(e,f) separately and keep each last column and rename the result columns names as (s1,s2,s3). Could anyone help solve this problem in Pandas? Thank you so much.

You can seelct columns by posistions by iloc, sum each 2 values and last rename columns by f-strings
i = 2
for x in range(0, len(df.columns), i):
df.iloc[:, x] = df.iloc[:, x:x+i].sum(axis=1)
df = df.rename(columns={df.columns[x]:f's{x // i + 1}'})
print (df)
s1 b s2 d s3 f
0 3 2 7 4 11 6
1 3 2 7 4 11 6
2 3 2 7 4 11 6

For one do
df['a'] = df['a'] + df['b']
df.rename(columns={col1: 's1')}, inplace=True)
You can use a loop to do all
the loop using enumerate and zip, generates
(0,('a','b')), (1,('c','d')), (2,('e','f'))
use these indexes to do the sum and the renaming
import pandas as pd
cols = ['a','b','c','d','e','f']
df =pd.DataFrame([(1,2,3,4,5,6),(1,2,3,4,5,6),(1,2,3,4,5,6)],columns=cols)
for idx, (col1, col2) in enumerate(zip(cols[::2], cols[1::2])):
df[col1] = df[col1] + df[col2]
df.rename(columns={col1: 's'+str(idx+1)}, inplace=True)
print(df)
CODE DEMO

You can try this:-
res = pd.DataFrame()
for i in range(len(df.columns)-1):
if i%2==0:
res[df.columns[i]] = df[df.columns[i]]+df[df.columns[i+1]]
else:
res[df.columns[i]] = df[df.columns[i]]
res['f'] = df[df.columns[-1]]
res.columns = ['s1', 'b', 's2', 'd', 's3', 'f']
Output:-
s1 b s2 d s3 f
0 3 2 7 4 11 6
1 3 2 7 4 11 6
2 3 2 7 4 11 6

df=pd.DataFrame([(1,2,3,4,5,6),(1,2,3,4,5,6),(1,2,3,4,5,6)],columns=['a','b','c','d','e','f'])
df['s1'] = df['a'] + df['b']
df['s2'] = df['c'] + df['d']
df['s3'] = df['e'] + df['f']
df = a b c d e f s1 s2 s3
0 1 2 3 4 5 6 3 7 11
1 1 2 3 4 5 6 3 7 11
2 1 2 3 4 5 6 3 7 11
and you can remove the columns 'a', 'b', 'c'
df.pop('a')
df.pop('c')
df.pop('d')
df = b e f s1 s2 s3
0 2 5 6 3 7 11
1 2 5 6 3 7 11
2 2 5 6 3 7 11

Jump is in steps of two; so we can split the dataframe with np.split :
res = np.split(df.to_numpy(), df.shape[-1] // 2, 1)
Next, we compute the new data, where we sum pairs of columns and keep the last column in each pair :
new_frame = np.hstack([np.vstack((np.sum(entry,1), entry[:,-1])).T for entry in res])
Create new column, taking into cognizance the jump of 2 :
new_cols = [f"s{ind//2+1}" if ind%2==0 else val for ind,val in enumerate(df.columns)]
Create new dataframe :
pd.DataFrame(new_frame, columns=new_cols)
s1 b s2 d s3 f
0 3 2 7 4 11 6
1 3 2 7 4 11 6
2 3 2 7 4 11 6

Related

Print out pandas groupby without operation

So I have the following pandas dataframe:
import pandas as pd
sample_df = pd.DataFrame({'note': ['D','C','D','C'], 'time': [1,1,4,6], 'val': [6,4,7,9]})
which gives the result
note time val
0 D 1 6
1 C 1 4
2 D 4 7
3 C 6 9
What I want is
note index time val
C 1 1 4
3 6 9
D 0 1 6
2 4 7
I tried sample_df.set_index('note',append=True) and it didn't work.
Add DataFrame.swaplevel with DataFrame.sort_index by first level:
df = sample_df.set_index('note', append=True).swaplevel(1,0).sort_index(level=0)
print (df)
time val
note
C 1 1 4
3 6 9
D 0 1 6
2 4 7
If need set level name add DataFrame.rename_axis:
df = (sample_df.rename_axis('idx')
.set_index('note',append=True)
.swaplevel(1,0)
.sort_index(level=0))
print (df)
time val
note idx
C 1 1 4
3 6 9
D 0 1 6
2 4 7
Alternatively:
sample_df.index.rename('old_index', inplace=True)
sample_df.reset_index(inplace=True)
sample_df.set_index(['note','old_index'], inplace=True)
sample_df.sort_index(level=0, inplace=True)
print (sample_df)
time val
note old_index
C 1 1 4
3 6 9
D 0 1 6
2 4 7
I am using MultiIndex create the target index
sample_df.index=pd.MultiIndex.from_arrays([sample_df.note,sample_df.index])
sample_df.drop('note',1,inplace=True)
sample_df=sample_df.sort_index(level=0)
sample_df
time val
note
C 1 1 4
3 6 9
D 0 1 6
2 4 7
I would use set_index and pop to simultaneously discard column 'note' and set new index
df.set_index([df.pop('note'), df.index]).sort_index(level=0)
Out[380]:
time val
note
C 1 1 4
3 6 9
D 0 1 6
2 4 7

python - sum list of columns, even if not all there

I have a dataframe that looks like this
A B C D G
0 9 5 7 6 1
1 1 4 7 3 1
2 8 4 1 3 1
generated by this:
df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))
x=np.array([[1,2]])
df['G'] = np.repeat(x,5)
Suppose there are times when a certain column 'E' exists, and sometimes it doesn't depending on the time frame of the data.
So sometimes we have
A B C D E G
0 9 5 7 6 2 1
1 1 4 7 3 3 1
2 8 4 1 3 4 1
So either way, I'd like to sum the rows from columns A, C, and E, and groupby column G. So when column E exists , I just use
df.groupby('G')['A', 'C', 'E'].sum()
but when E doesn't exist, like in the first dataframe, it doesn't work.
What do I need to do in order to sum even if a column is missing?
You could store the columns you wish to sum in a list sum_cols = list('ACE'), and then intersect whatever DataFrame you're working with with this list.
df.groupby('G')[df.columns.intersection(sum_cols)].sum()
Demo
>>> df = pd.DataFrame(np.random.randint(0, 10, (2, 5)),
columns=list('ABCDG'))
>>> df
A B C D G
0 9 5 9 2 6
1 3 1 1 1 3
>>> sum_cols = list('ACE')
>>> df.groupby('G')[df.columns.intersection(sum_cols)].sum()
A C
G
3 3 1
6 9 9
>>> df['E'] = [100, 200]
>>> df.groupby('G')[df.columns.intersection(sum_cols)].sum()
A C E
G
3 3 1 200
6 9 9 100

Drop level for index

I have the below result from a pivot table, which is about the count of customer grades that visited my stores. I used the 'droplevel' method to flatten the column header into 1 layer, how can I do the same for the index? I want to remove 'Grade' above the index, so that the column headers are at the same level as 'Store No_'.
it seems you need remove column name:
df.columns.name = None
Or rename_axis:
df = df.rename_axis(None, axis=1)
Sample:
df = pd.DataFrame({'Store No_':[1,2,3],
'A':[4,5,6],
'B':[7,8,9],
'C':[1,3,5],
'D':[5,3,6],
'E':[7,4,3]})
df = df.set_index('Store No_')
df.columns.name = 'Grade'
print (df)
Grade A B C D E
Store No_
1 4 7 1 5 7
2 5 8 3 3 4
3 6 9 5 6 3
print (df.rename_axis(None, axis=1))
A B C D E
Store No_
1 4 7 1 5 7
2 5 8 3 3 4
3 6 9 5 6 3
df = df.rename_axis(None, axis=1).reset_index()
print (df)
Store No_ A B C D E
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3

collapse a pandas MultiIndex

Suppose I have a DataFrame with MultiIndex columns. How can I collapse the levels to a concatenation of the values so that I only have one level?
Setup
np.random.seed([3, 14])
col = pd.MultiIndex.from_product([list('ABC'), list('DE'), list('FG')])
df = pd.DataFrame(np.random.rand(4, 12) * 10, columns=col).astype(int)
print df
A B C
D E D E D E
F G F G F G F G F G F G
0 2 1 1 7 5 9 9 2 7 4 0 3
1 3 7 1 1 5 3 1 4 3 5 6 0
2 2 6 9 9 9 5 7 0 1 2 7 5
3 2 2 8 0 3 9 4 7 0 8 2 5
I want the result to look like this:
ADF ADG AEF AEG BDF BDG BEF BEG CDF CDG CEF CEG
0 2 1 1 7 5 9 9 2 7 4 0 3
1 3 7 1 1 5 3 1 4 3 5 6 0
2 2 6 9 9 9 5 7 0 1 2 7 5
3 2 2 8 0 3 9 4 7 0 8 2 5
Solution
I did this
def collapse_columns(df):
df = df.copy()
if isinstance(df.columns, pd.MultiIndex):
df.columns = df.columns.to_series().apply(lambda x: "".join(x))
return df
I had to check if its a MultiIndex because if it wasn't, I'd split a string and recombine it with what ever separator I chose in the join.
you may try this:
In [200]: cols = pd.Series(df.columns.tolist()).apply(pd.Series).sum(axis=1)
In [201]: cols
Out[201]:
0 ADF
1 ADG
2 AEF
3 AEG
4 BDF
5 BDG
6 BEF
7 BEG
8 CDF
9 CDG
10 CEF
11 CEG
dtype: object
df.columns = df.columns.to_series().apply(''.join)
This will give no separation, but you can sub in '_' for '' or any other separator you might want.
Solution 1)
df.columns = df.columns.to_series().str.join('_')
print(df.columns.shape) #(1,_X_) # a 2 D Array.
OR BETTER Solution 2
pivoteCols = df.columns.to_series().str.join('_')
pivoteCols = pivoteCols.values.reshape(len(pivoteCols))
df.columns = pivoteCols
print(df.columns.shape) # One Dimensional

python pandas groupby() result

I have the following python pandas data frame:
df = pd.DataFrame( {
'A': [1,1,1,1,2,2,2,3,3,4,4,4],
'B': [5,5,6,7,5,6,6,7,7,6,7,7],
'C': [1,1,1,1,1,1,1,1,1,1,1,1]
} );
df
A B C
0 1 5 1
1 1 5 1
2 1 6 1
3 1 7 1
4 2 5 1
5 2 6 1
6 2 6 1
7 3 7 1
8 3 7 1
9 4 6 1
10 4 7 1
11 4 7 1
I would like to have another column storing a value of a sum over C values for fixed (both) A and B. That is, something like:
A B C D
0 1 5 1 2
1 1 5 1 2
2 1 6 1 1
3 1 7 1 1
4 2 5 1 1
5 2 6 1 2
6 2 6 1 2
7 3 7 1 2
8 3 7 1 2
9 4 6 1 1
10 4 7 1 2
11 4 7 1 2
I have tried with pandas groupby and it kind of works:
res = {}
for a, group_by_A in df.groupby('A'):
group_by_B = group_by_A.groupby('B', as_index = False)
res[a] = group_by_B['C'].sum()
but I don't know how to 'get' the results from res into df in the orderly fashion. Would be very happy with any advice on this. Thank you.
Here's one way (though it feels this should work in one go with an apply, I can't get it).
In [11]: g = df.groupby(['A', 'B'])
In [12]: df1 = df.set_index(['A', 'B'])
The size groupby function is the one you want, we have to match it to the 'A' and 'B' as the index:
In [13]: df1['D'] = g.size() # unfortunately this doesn't play nice with as_index=False
# Same would work with g['C'].sum()
In [14]: df1.reset_index()
Out[14]:
A B C D
0 1 5 1 2
1 1 5 1 2
2 1 6 1 1
3 1 7 1 1
4 2 5 1 1
5 2 6 1 2
6 2 6 1 2
7 3 7 1 2
8 3 7 1 2
9 4 6 1 1
10 4 7 1 2
11 4 7 1 2
You could also do a one liner using transform applied to the groupby:
df['D'] = df.groupby(['A','B'])['C'].transform('sum')
You could also do a one liner using merge as follows:
df = df.merge(pd.DataFrame({'D':df.groupby(['A', 'B'])['C'].size()}), left_on=['A', 'B'], right_index=True)
you can use this method :
columns = ['col1','col2',...]
df.groupby('col')[columns].sum()
if you want you can also use .sort_values(by = 'colx', ascending = True/False) after .sum() to sort the final output by a specific column (colx) and in an ascending or descending order.

Categories