I have two DataFrame objects which I would like to multiply based on the column names and output the new column with a suffix...
df1 = pd.DataFrame(np.random.randint(0,10, size=(5,5)), columns=list('ABCDE'))
A B C D E
0 6 2 1 7 2
1 0 0 2 1 8
2 7 2 6 6 9
3 2 5 5 1 3
4 9 1 6 7 4
df2 = pd.DataFrame(np.random.randint(1, 10, size=(5,3)), columns=list('ABC'))
A B C
0 2 1 2
1 7 5 1
2 2 1 4
3 7 8 5
4 9 2 2
I would like the output to be listed as with columns A_x, B_x and C_x being the product of the aligning columns in df1 and df2
A B C A_x B_x C_x D E
0 6 2 1 12 2 2 7 2
1 0 0 2 0 0 2 1 8
2 7 2 6 14 2 24 6 9
3 2 5 5 14 40 25 1 3
4 9 1 6 81 2 12 7 4
You can use intersection for get same columns names and then multiple by mul, add add_suffix and last concat df1:
cols = df1.columns.intersection(df2.columns)
df = df1[cols].mul(df2[cols], axis=1).add_suffix('_x')
df = pd.concat([df1, df], axis=1)
print (df)
A B C D E A_x B_x C_x
0 6 2 1 7 2 12 2 2
1 0 0 2 1 8 0 0 2
2 7 2 6 6 9 14 2 24
3 2 5 5 1 3 14 40 25
4 9 1 6 7 42 81 2 12
If need change order of columns:
cols = df1.columns.intersection(df2.columns)
df = df1[cols].mul(df2[cols], axis=1).add_suffix('_x')
cols1 = cols.tolist() + \
df.columns.tolist() + \
df1.columns.difference(df2.columns).tolist()
df = pd.concat([df1, df], axis=1)
print (df[cols1])
A B C A_x B_x C_x D E
0 6 2 1 12 2 2 7 2
1 0 0 2 0 0 2 1 8
2 7 2 6 14 2 24 6 9
3 2 5 5 14 40 25 1 3
4 9 1 6 81 2 12 7 42
Related
I have a data-frame with n rows:
df = 1 2 3
4 5 6
4 2 3
3 1 9
6 7 0
9 2 5
I want to add a columns with the same value in groups of 3.
n (num rows) is for sure divided by 3.
So the new df will be:
df = 1 2 3 A
4 5 6 A
4 2 3 A
3 1 9 B
6 7 0 B
9 2 5 B
What is the best way to do so?
First remove last rows if not dividsable by 3 with DataFrame.iloc and then create 100% unique group by divide by 3 with integer division by 3:
print (df)
a b d
0 1 2 3
1 4 5 6
2 4 2 3
3 3 1 9
4 6 7 0
5 9 2 5
6 0 0 4 <- removed last row
N = 3
num = len(df) // N * N
df = df.iloc[:num]
df['groups'] = np.arange(len(df)) // N
print (df)
a b d groups
0 1 2 3 0
1 4 5 6 0
2 4 2 3 0
3 3 1 9 1
4 6 7 0 1
5 9 2 5 1
IIUC, groupby:
df['new_col'] = df.sum(1).groupby(np.arange(len(df))//3).transform('sum')
Output:
0 1 2 new_col
0 1 2 3 30
1 4 5 6 30
2 4 2 3 30
3 3 1 9 42
4 6 7 0 42
5 9 2 5 42
If you have 2 dataframes, represented as:
A F Y
0 1 2 3
1 4 5 6
And
B C T
0 7 8 9
1 10 11 12
When combining it becomes:
A B C F T Y
0 1 7 8 2 9 3
1 4 10 11 5 12 6
I would like it to become:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12
How do I combine 1 data frame with another but keep the original column order?
In [1294]: new_df = df.join(df1)
In [1295]: new_df
Out[1295]:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12
OR you can also use pd.merge(not a very clean solution though)
In [1297]: df['tmp' ] =1
In [1298]: df1['tmp'] = 1
In [1309]: pd.merge(df, df1, on=['tmp'], left_index=True, right_index=True).drop('tmp', 1)
Out[1309]:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12
I have dataframe df with following data.
A B C D
1 1 3 1
1 2 9 8
1 3 3 9
2 1 2 9
2 2 1 4
2 3 9 5
2 4 6 4
3 1 4 1
3 2 0 4
4 1 2 6
5 1 2 4
5 2 8 3
grp = df.groupby('A')
Next I want to make all groups of dataframe df grouped on columns A to have same number of rows. Either Truncate extra rows or pad 0 rows. For above data, I want to make all groups to have 3 rows. I required the following results.
A B C D
1 1 3 1
1 2 9 8
1 3 3 9
2 1 2 9
2 2 1 4
2 3 9 5
3 1 4 1
3 2 0 4
3 0 0 0
4 1 2 6
4 0 0 0
4 0 0 0
5 1 2 4
5 2 8 3
5 0 0 0
Similarly, I may want to groupby on multiple columns, like
grp = df.groupby(['A','B'])
Use GroupBy.cumcount for counter column with DataFrame.reindex by MultiIndex.from_product:
df['g'] = df.groupby('A').cumcount()
mux = pd.MultiIndex.from_product([df['A'].unique(), range(3)], names=('A','g'))
df = (df.set_index(['A','g'])
.reindex(mux, fill_value=0)
.reset_index(level=1, drop=True)
.reset_index())
print (df)
A B C D
0 1 1 3 1
1 1 2 9 8
2 1 3 3 9
3 2 1 2 9
4 2 2 1 4
5 2 3 9 5
6 3 1 4 1
7 3 2 0 4
8 3 0 0 0
9 4 1 2 6
10 4 0 0 0
11 4 0 0 0
12 5 1 2 4
13 5 2 8 3
14 5 0 0 0
Another solution with DataFrame.merge with left join with helper DataFrame:
from itertools import product
df['g'] = df.groupby('A').cumcount()
df1 = pd.DataFrame(list(product(df['A'].unique(), range(3))), columns=['A','g'])
df = df1.merge(df, how='left').fillna(0).astype(int).drop('g', axis=1)
print (df)
A B C D
0 1 1 3 1
1 1 2 9 8
2 1 3 3 9
3 2 1 2 9
4 2 2 1 4
5 2 3 9 5
6 3 1 4 1
7 3 2 0 4
8 3 0 0 0
9 4 1 2 6
10 4 0 0 0
11 4 0 0 0
12 5 1 2 4
13 5 2 8 3
14 5 0 0 0
EDIT:
df['g'] = df.groupby(['A','B']).cumcount()
mux = pd.MultiIndex.from_product([df['A'].unique(),
df['B'].unique(),
range(3)], names=('A','B','g'))
df = (df.set_index(['A','B','g'])
.reindex(mux, fill_value=0)
.reset_index(level=2, drop=True)
.reset_index())
print (df.head(10))
A B C D
0 1 1 3 1
1 1 1 0 0
2 1 1 0 0
3 1 2 9 8
4 1 2 0 0
5 1 2 0 0
6 1 3 3 9
7 1 3 0 0
8 1 3 0 0
9 1 4 0 0
from itertools import product
df['g'] = df.groupby(['A','B']).cumcount()
df1 = pd.DataFrame(list(product(df['A'].unique(),
df['B'].unique(),
range(3))), columns=['A','B','g'])
df = df1.merge(df, how='left').fillna(0).astype(int).drop('g', axis=1)
I have a dataset where I want to add a suffix to column names based on their positions. For ex- 1st to 4th columns should be named 'abc_1', then 5th to 8th columns as 'abc_2' and so on.
I have tried using dataframe.rename
but it is a time consuming process. What would be the most efficient way to achieve this?
I think here is good choice create MultiIndex for avoid duplicated columns names - create first level by floor divide by 4 and add prefix by f-strings:
np.random.seed(123)
df = pd.DataFrame(np.random.randint(10, size=(5, 10)))
df.columns = [[f'abc_{i+1}' for i in df.columns // 4], df.columns]
print (df)
abc_1 abc_2 abc_3
0 1 2 3 4 5 6 7 8 9
0 2 2 6 1 3 9 6 1 0 1
1 9 0 0 9 3 4 0 0 4 1
2 7 3 2 4 7 2 4 8 0 7
3 9 3 4 6 1 5 6 2 1 8
4 3 5 0 2 6 2 4 4 6 3
More general solution if no RangeIndex in column names:
cols = [f'abc_{i+1}' for i in np.arange(len(df.columns)) // 4]
df.columns = [cols, df.columns]
print (df)
abc_1 abc_2 abc_3
0 1 2 3 4 5 6 7 8 9
0 2 2 6 1 3 9 6 1 0 1
1 9 0 0 9 3 4 0 0 4 1
2 7 3 2 4 7 2 4 8 0 7
3 9 3 4 6 1 5 6 2 1 8
4 3 5 0 2 6 2 4 4 6 3
Also is possible specify MultiIndex levels names by MultiIndex.from_arrays:
df.columns = pd.MultiIndex.from_arrays([cols, df.columns], names=('level0','level1'))
print (df)
level0 abc_1 abc_2 abc_3
level1 0 1 2 3 4 5 6 7 8 9
0 2 2 6 1 3 9 6 1 0 1
1 9 0 0 9 3 4 0 0 4 1
2 7 3 2 4 7 2 4 8 0 7
3 9 3 4 6 1 5 6 2 1 8
4 3 5 0 2 6 2 4 4 6 3
Then is possible select each level by xs:
print (df.xs('abc_2', axis=1))
4 5 6 7
0 3 9 6 1
1 3 4 0 0
2 7 2 4 8
3 1 5 6 2
4 6 2 4 4
consider the dataframe df
df = pd.DataFrame(dict(
A=list('aaaaabbbbccc'),
B=range(12)
))
print(df)
A B
0 a 0
1 a 1
2 a 2
3 a 3
4 a 4
5 b 5
6 b 6
7 b 7
8 b 8
9 c 9
10 c 10
11 c 11
I want to sort the dataframe such if I grouped by column 'A' I'd pull the first position from each group, then cycle back and get the second position from each group if any are remaining. So on and so forth.
I'd expect results tot look like this
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4
You can use cumcount for count values in groups first, then sort_values and reindex by Series cum:
cum = df.groupby('A')['B'].cumcount().sort_values()
print (cum)
0 0
5 0
9 0
1 1
6 1
10 1
2 2
7 2
11 2
3 3
8 3
4 4
dtype: int64
print (df.reindex(cum.index))
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4
Here's a NumPy approach -
def approach1(g, v):
# Inputs : 1D arrays of groupby and value columns
id_arr2 = np.ones(v.size,dtype=int)
sf = np.flatnonzero(g[1:] != g[:-1])+1
id_arr2[sf[0]] = -sf[0]+1
id_arr2[sf[1:]] = sf[:-1] - sf[1:]+1
return id_arr2.cumsum().argsort(kind='mergesort')
Sample run -
In [246]: df
Out[246]:
A B
0 a 0
1 a 1
2 a 2
3 a 3
4 a 4
5 b 5
6 b 6
7 b 7
8 b 8
9 c 9
10 c 10
11 c 11
In [247]: df.iloc[approach1(df.A.values, df.B.values)]
Out[247]:
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4
Or using df.reindex from #jezrael's post :
df.reindex(approach1(df.A.values, df.B.values))