Related
My task is like this:
df=pd.DataFrame([(1,2,3,4,5,6),(1,2,3,4,5,6),(1,2,3,4,5,6)],columns=['a','b','c','d','e','f'])
Out:
a b c d e f
0 1 2 3 4 5 6
1 1 2 3 4 5 6
2 1 2 3 4 5 6
I want to do is the output dataframe looks like this:
Out
s1 b s2 d s3 f
0 3 2 7 4 11 6
1 3 2 7 4 11 6
2 3 2 7 4 11 6
That is to say, sum the column (a,b),(c,d),(e,f) separately and keep each last column and rename the result columns names as (s1,s2,s3). Could anyone help solve this problem in Pandas? Thank you so much.
You can seelct columns by posistions by iloc, sum each 2 values and last rename columns by f-strings
i = 2
for x in range(0, len(df.columns), i):
df.iloc[:, x] = df.iloc[:, x:x+i].sum(axis=1)
df = df.rename(columns={df.columns[x]:f's{x // i + 1}'})
print (df)
s1 b s2 d s3 f
0 3 2 7 4 11 6
1 3 2 7 4 11 6
2 3 2 7 4 11 6
For one do
df['a'] = df['a'] + df['b']
df.rename(columns={col1: 's1')}, inplace=True)
You can use a loop to do all
the loop using enumerate and zip, generates
(0,('a','b')), (1,('c','d')), (2,('e','f'))
use these indexes to do the sum and the renaming
import pandas as pd
cols = ['a','b','c','d','e','f']
df =pd.DataFrame([(1,2,3,4,5,6),(1,2,3,4,5,6),(1,2,3,4,5,6)],columns=cols)
for idx, (col1, col2) in enumerate(zip(cols[::2], cols[1::2])):
df[col1] = df[col1] + df[col2]
df.rename(columns={col1: 's'+str(idx+1)}, inplace=True)
print(df)
CODE DEMO
You can try this:-
res = pd.DataFrame()
for i in range(len(df.columns)-1):
if i%2==0:
res[df.columns[i]] = df[df.columns[i]]+df[df.columns[i+1]]
else:
res[df.columns[i]] = df[df.columns[i]]
res['f'] = df[df.columns[-1]]
res.columns = ['s1', 'b', 's2', 'd', 's3', 'f']
Output:-
s1 b s2 d s3 f
0 3 2 7 4 11 6
1 3 2 7 4 11 6
2 3 2 7 4 11 6
df=pd.DataFrame([(1,2,3,4,5,6),(1,2,3,4,5,6),(1,2,3,4,5,6)],columns=['a','b','c','d','e','f'])
df['s1'] = df['a'] + df['b']
df['s2'] = df['c'] + df['d']
df['s3'] = df['e'] + df['f']
df = a b c d e f s1 s2 s3
0 1 2 3 4 5 6 3 7 11
1 1 2 3 4 5 6 3 7 11
2 1 2 3 4 5 6 3 7 11
and you can remove the columns 'a', 'b', 'c'
df.pop('a')
df.pop('c')
df.pop('d')
df = b e f s1 s2 s3
0 2 5 6 3 7 11
1 2 5 6 3 7 11
2 2 5 6 3 7 11
Jump is in steps of two; so we can split the dataframe with np.split :
res = np.split(df.to_numpy(), df.shape[-1] // 2, 1)
Next, we compute the new data, where we sum pairs of columns and keep the last column in each pair :
new_frame = np.hstack([np.vstack((np.sum(entry,1), entry[:,-1])).T for entry in res])
Create new column, taking into cognizance the jump of 2 :
new_cols = [f"s{ind//2+1}" if ind%2==0 else val for ind,val in enumerate(df.columns)]
Create new dataframe :
pd.DataFrame(new_frame, columns=new_cols)
s1 b s2 d s3 f
0 3 2 7 4 11 6
1 3 2 7 4 11 6
2 3 2 7 4 11 6
I have a dataframe and I would like to insert rows at specific indexes at the beginning of each group within the dataframe. As an example lets say I have the following dataframe:
import pandas as pd
df = pd.DataFrame(data=[['A',1,1],['A',2,3],['A',5,4],['B',3,4],['B',2,6],['B',8,4],['C',9,3],['C',3,7],['C',1,9],['D',5,5],['D',8,3],['D',4,7]], columns=['Group','val1','val2'])
I would like to copy the first row of each unique value in the column group and insert that row at the beginning of each group while growing the dataframe. I can currently achieve this by using a for loop but it is pretty slow because my dataframe is large so I am looking for a vectorized solution.
I have a list of indexes where I would like to insert the rows.
idxs = [0, 3, 6, 9]
In each iteration of the loop I currently slice the dataframe at each one of the idxs into two dataframes, insert the row, and concat the dataframes. My dataframe is very large so this process has been very slow.
The solution would look like this:
Group val1 val2
0 A 1 1
1 A 1 1
2 A 2 3
3 A 5 4
4 B 3 4
5 B 3 4
6 B 2 6
7 B 8 4
8 C 9 3
9 C 9 3
10 C 3 7
11 C 1 9
12 D 5 5
13 D 5 5
14 D 8 3
15 D 4 7
You can do this by grouping by group, iterating over each group, and constructing a DataFrame via concatenation of each the first row of a group to the group itself, then the concatenation of all those concatenations.
Code:
import pandas as pd
df = pd.DataFrame(data=[['A',1,1],['A',2,3],['A',5,4],['B',3,4],['B',2,6],['B',8,4],['C',9,3],['C',3,7],['C',1,9],['D',5,5],['D',8,3],['D',4,7]], columns=['Group','val1','val2'])
df_new = pd.concat([
pd.concat([grp.iloc[[0], :], grp])
for key, grp in df.groupby('Group')
])
print(df_new)
Output:
Group val1 val2
0 A 1 1
0 A 1 1
1 A 2 3
2 A 5 4
3 B 3 4
3 B 3 4
4 B 2 6
5 B 8 4
6 C 9 3
6 C 9 3
7 C 3 7
8 C 1 9
9 D 5 5
9 D 5 5
10 D 8 3
11 D 4 7
So I have the following pandas dataframe:
import pandas as pd
sample_df = pd.DataFrame({'note': ['D','C','D','C'], 'time': [1,1,4,6], 'val': [6,4,7,9]})
which gives the result
note time val
0 D 1 6
1 C 1 4
2 D 4 7
3 C 6 9
What I want is
note index time val
C 1 1 4
3 6 9
D 0 1 6
2 4 7
I tried sample_df.set_index('note',append=True) and it didn't work.
Add DataFrame.swaplevel with DataFrame.sort_index by first level:
df = sample_df.set_index('note', append=True).swaplevel(1,0).sort_index(level=0)
print (df)
time val
note
C 1 1 4
3 6 9
D 0 1 6
2 4 7
If need set level name add DataFrame.rename_axis:
df = (sample_df.rename_axis('idx')
.set_index('note',append=True)
.swaplevel(1,0)
.sort_index(level=0))
print (df)
time val
note idx
C 1 1 4
3 6 9
D 0 1 6
2 4 7
Alternatively:
sample_df.index.rename('old_index', inplace=True)
sample_df.reset_index(inplace=True)
sample_df.set_index(['note','old_index'], inplace=True)
sample_df.sort_index(level=0, inplace=True)
print (sample_df)
time val
note old_index
C 1 1 4
3 6 9
D 0 1 6
2 4 7
I am using MultiIndex create the target index
sample_df.index=pd.MultiIndex.from_arrays([sample_df.note,sample_df.index])
sample_df.drop('note',1,inplace=True)
sample_df=sample_df.sort_index(level=0)
sample_df
time val
note
C 1 1 4
3 6 9
D 0 1 6
2 4 7
I would use set_index and pop to simultaneously discard column 'note' and set new index
df.set_index([df.pop('note'), df.index]).sort_index(level=0)
Out[380]:
time val
note
C 1 1 4
3 6 9
D 0 1 6
2 4 7
Given the following DataFrame:
>>> pd.DataFrame(data=[['a',1],['a',2],['b',3],['b',4],['c',5],['c',6],['d',7],['d',8],['d',9],['e',10]],columns=['key','value'])
key value
0 a 1
1 a 2
2 b 3
3 b 4
4 c 5
5 c 6
6 d 7
7 d 8
8 d 9
9 e 10
I'm looking for a method that will change the structure based on the key value, like so:
a b c d e
0 1 3 5 7 10
1 2 4 6 8 10 <- 10 is duplicated
2 2 4 6 9 10 <- 10 is duplicated
The result row number is as the longest group count (d in the above example) and the missing values are duplicates of the last available value.
Create MultiIndex by set_index with counter column by cumcount, reshape by unstack, repalce missing values by last non missing ones with ffill and last converting all data to integers if necessary:
df = df.set_index([df.groupby('key').cumcount(),'key'])['value'].unstack().ffill().astype(int)
Another solution with custom lambda function:
df = (df.groupby('key')['value']
.apply(lambda x: pd.Series(x.values))
.unstack(0)
.ffill()
.astype(int))
print (df)
key a b c d e
0 1 3 5 7 10
1 2 4 6 8 10
2 2 4 6 9 10
Using pivot , with groupby + cumcount
df.assign(key2=df.groupby('key').cumcount()).pivot('key2','key','value').ffill().astype(int)
Out[214]:
key a b c d e
key2
0 1 3 5 7 10
1 2 4 6 8 10
2 2 4 6 9 10
SQL : Select Max(A) , Min (B) , C from Table group by C
I want to do the same operation in pandas on a dataframe. The closer I got was till :
DF2= DF1.groupby(by=['C']).max()
where I land up getting max of both the columns , how do i do more than one operation while grouping by.
You can use function agg:
DF2 = DF1.groupby('C').agg({'A': max, 'B': min})
Sample:
print DF1
A B C D
0 1 5 a a
1 7 9 a b
2 2 10 c d
3 3 2 c c
DF2 = DF1.groupby('C').agg({'A': max, 'B': min})
print DF2
A B
C
a 7 5
c 3 2
GroupBy-fu: improvements in grouping and aggregating data in pandas - nice explanations.
try agg() function:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,5,size=(20, 3)), columns=list('ABC'))
print(df)
print(df.groupby('C').agg({'A': max, 'B':min}))
Output:
A B C
0 2 3 0
1 2 2 1
2 4 0 1
3 0 1 4
4 3 3 2
5 0 4 3
6 2 4 2
7 3 4 0
8 4 2 2
9 3 2 1
10 2 3 1
11 4 1 0
12 4 3 2
13 0 0 1
14 3 1 1
15 4 1 1
16 0 0 0
17 4 0 1
18 3 4 0
19 0 2 4
A B
C
0 4 0
1 4 0
2 4 2
3 0 4
4 0 1
Alternatively you may want to check pandas.read_sql_query() function...
You can use the agg function
import pandas as pd
import numpy as np
df.groupby('something').agg({'column1': np.max, 'columns2': np.min})