I have a dataset where I want to add a suffix to column names based on their positions. For ex- 1st to 4th columns should be named 'abc_1', then 5th to 8th columns as 'abc_2' and so on.
I have tried using dataframe.rename
but it is a time consuming process. What would be the most efficient way to achieve this?
I think here is good choice create MultiIndex for avoid duplicated columns names - create first level by floor divide by 4 and add prefix by f-strings:
np.random.seed(123)
df = pd.DataFrame(np.random.randint(10, size=(5, 10)))
df.columns = [[f'abc_{i+1}' for i in df.columns // 4], df.columns]
print (df)
abc_1 abc_2 abc_3
0 1 2 3 4 5 6 7 8 9
0 2 2 6 1 3 9 6 1 0 1
1 9 0 0 9 3 4 0 0 4 1
2 7 3 2 4 7 2 4 8 0 7
3 9 3 4 6 1 5 6 2 1 8
4 3 5 0 2 6 2 4 4 6 3
More general solution if no RangeIndex in column names:
cols = [f'abc_{i+1}' for i in np.arange(len(df.columns)) // 4]
df.columns = [cols, df.columns]
print (df)
abc_1 abc_2 abc_3
0 1 2 3 4 5 6 7 8 9
0 2 2 6 1 3 9 6 1 0 1
1 9 0 0 9 3 4 0 0 4 1
2 7 3 2 4 7 2 4 8 0 7
3 9 3 4 6 1 5 6 2 1 8
4 3 5 0 2 6 2 4 4 6 3
Also is possible specify MultiIndex levels names by MultiIndex.from_arrays:
df.columns = pd.MultiIndex.from_arrays([cols, df.columns], names=('level0','level1'))
print (df)
level0 abc_1 abc_2 abc_3
level1 0 1 2 3 4 5 6 7 8 9
0 2 2 6 1 3 9 6 1 0 1
1 9 0 0 9 3 4 0 0 4 1
2 7 3 2 4 7 2 4 8 0 7
3 9 3 4 6 1 5 6 2 1 8
4 3 5 0 2 6 2 4 4 6 3
Then is possible select each level by xs:
print (df.xs('abc_2', axis=1))
4 5 6 7
0 3 9 6 1
1 3 4 0 0
2 7 2 4 8
3 1 5 6 2
4 6 2 4 4
Related
My df looks as follows:
import pandas as pd
d = {'col1': [1,2,3,3,1,2,2,3,4,1,1,2]
df= pd.DataFrame(data=d)
Now I want to add a new column with the following schemata:
col1
new_col
1
1
2
2
3
3
3
3
3
3
1
4
2
5
2
5
3
6
4
7
1
8
1
8
2
9
Once it starts again at 1 it should just keep counting.
At the moment I am at the point where I just add a column with difference:
df['diff'] = df['col1'].diff()
How to extend this approach?
Try with
df.col1.diff().ne(0).cumsum()
Out[94]:
0 1
1 2
2 3
3 3
4 4
5 5
6 5
7 6
8 7
9 8
10 8
11 9
Name: col1, dtype: int32
Try:
df["new_col"] = df["col1"].ne(df["col1"].shift()).cumsum()
>>> df
col1 new_col
0 1 1
1 2 2
2 3 3
3 3 3
4 1 4
5 2 5
6 2 5
7 3 6
8 4 7
9 1 8
10 1 8
11 2 9
I have a data-frame with n rows:
df = 1 2 3
4 5 6
4 2 3
3 1 9
6 7 0
9 2 5
I want to add a columns with the same value in groups of 3.
n (num rows) is for sure divided by 3.
So the new df will be:
df = 1 2 3 A
4 5 6 A
4 2 3 A
3 1 9 B
6 7 0 B
9 2 5 B
What is the best way to do so?
First remove last rows if not dividsable by 3 with DataFrame.iloc and then create 100% unique group by divide by 3 with integer division by 3:
print (df)
a b d
0 1 2 3
1 4 5 6
2 4 2 3
3 3 1 9
4 6 7 0
5 9 2 5
6 0 0 4 <- removed last row
N = 3
num = len(df) // N * N
df = df.iloc[:num]
df['groups'] = np.arange(len(df)) // N
print (df)
a b d groups
0 1 2 3 0
1 4 5 6 0
2 4 2 3 0
3 3 1 9 1
4 6 7 0 1
5 9 2 5 1
IIUC, groupby:
df['new_col'] = df.sum(1).groupby(np.arange(len(df))//3).transform('sum')
Output:
0 1 2 new_col
0 1 2 3 30
1 4 5 6 30
2 4 2 3 30
3 3 1 9 42
4 6 7 0 42
5 9 2 5 42
Actually I thougth this should be very easy. I have a pandas data frame with lets say 100 colums and I want a subset containing colums 0:30 and 77:99.
What I've done so far is:
df_1 = df.iloc[:,0:30]
df_2 = df.iloc[:,77:99]
df2 = pd.concat([df_1 , df_2], axis=1, join_axes=[df_1 .index])
Is there an easier way?
Use numpy.r_ for concanecate indices:
df2 = df.iloc[:, np.r_[0:30, 77:99]]
Sample:
df = pd.DataFrame(np.random.randint(10, size=(5,15)))
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 6 2 9 5 4 6 9 9 7 9 6 6 1 0 6
1 5 6 7 0 7 8 7 9 4 8 1 2 0 8 5
2 5 6 1 6 7 6 1 5 5 4 6 3 2 3 0
3 4 3 1 3 3 8 3 6 7 1 8 6 2 1 8
4 3 8 2 3 7 3 6 4 4 6 2 6 9 4 9
df2 = df.iloc[:, np.r_[0:3, 7:9]]
print (df2)
0 1 2 7 8
0 6 2 9 9 7
1 5 6 7 9 4
2 5 6 1 5 5
3 4 3 1 6 7
4 3 8 2 4 4
df_1 = df.iloc[:,0:3]
df_2 = df.iloc[:,7:9]
df2 = pd.concat([df_1 , df_2], axis=1, join_axes=[df_1 .index])
print (df2)
0 1 2 7 8
0 6 2 9 9 7
1 5 6 7 9 4
2 5 6 1 5 5
3 4 3 1 6 7
4 3 8 2 4 4
I have two data frames in pandas from which i need to get the rows with all the corresponding column values in second which are not in first .
ex
df A
A B C D
6 4 1 6
7 6 6 3
1 6 2 9
8 0 4 9
1 0 2 3
8 4 7 5
4 7 1 1
3 7 3 4
5 2 8 8
3 2 8 8
5 2 8 8
df B
A B C D
1 0 2 3
8 4 7 5
4 7 1 1
1 0 2 3
8 4 7 5
4 7 1 1
3 7 3 4
5 2 8 8
1 1 1 1
2 2 2 2
1 1 1 1
req
A B C D
1 1 1 1
2 2 2 2
1 1 1 1
i tried using pd.merge and inner/left on all columns but it is taking a lot more computational time and resource if the rows and columns are more. is there any other way to work it around like iterating through each row of dfA with dfB on all columns and then pick the ones which are there only in dfB?
You can use merge with ind parameter.
df_b.merge(df_a, on=['A','B','C','D'],
how='left', indicator='ind')\
.query('ind == "left_only"')\
.drop('ind', axis=1)
Output:
A B C D
9 1 1 1 1
10 2 2 2 2
11 1 1 1 1
Say I have a python DataFrame with the following structure:
pd.DataFrame([[1,2,3,4],[1,2,3,4],[1,3,5,6],[1,4,6,7],[1,4,6,7],[1,4,6,7]])
Out[262]:
0 1 2 3
0 1 2 3 4
1 1 2 3 4
2 1 3 5 6
3 1 4 6 7
4 1 4 6 7
5 1 4 6 7
How can I add a column called 'ct' that counts the instances of the DataFrame where column 1-3 match to each row that matches... so the DataFrame would look like this when all is completed.
0 1 2 3 ct
0 1 2 3 4 2
1 1 2 3 4 2
2 1 3 5 6 1
3 1 4 6 7 3
4 1 4 6 7 3
5 1 4 6 7 3
You can use groupby + transform + size:
df['ct'] = df.groupby([1,2,3])[1].transform('size')
#alternatively
#df['ct'] = df.groupby([1,2,3])[1].transform(len)
print (df)
0 1 2 3 ct
0 1 2 3 4 2
1 1 2 3 4 2
2 1 3 5 6 1
3 1 4 6 7 3
4 1 4 6 7 3
5 1 4 6 7 3