Create a column in pandas dataframe - python

I have a dataframe as below:
df = pd.DataFrame({'ORDER':["A", "A", "A", "B", "B","B"], 'GROUP': ["A1C", "A1", "B1", "B1C", "M1", "M1C"]})
df['_A1_XYZ'] = 1
df['_A1C_XYZ'] = 2
df['_B1_XYZ'] = 3
df['_B1C_XYZ'] = 4
df['_M1_XYZ'] = 5
df
ORDER GROUP _A1_XYZ _A1C_XYZ _B1_XYZ _B1C_XYZ _M1_XYZ
0 A A1C 1 2 3 4 5
1 A A1 1 2 3 4 5
2 A B1 1 2 3 4 5
3 B B1C 1 2 3 4 5
4 B M1 1 2 3 4 5
5 B M1C 1 2 3 4 5
I want to create a column "NEW" based on column "GROUP" and all the columns that ends with XYZ as below:
Based on the value of GROUP for each row df["NEW"] = df["_XYZ"].
For example, for 1st row, GROUP = A1C, So "NEW" = 2 (_A1C_XYZ), Similarly for 2nd row "NEW" = 1 (_A1_XYZ)
My expected output
ORDER GROUP _A1_XYZ _A1C_XYZ _B1_XYZ _B1C_XYZ _M1_XYZ NEW
0 A A1C 1 2 3 4 5 2
1 A A1 1 2 3 4 5 1
2 A B1 1 2 3 4 5 3
3 B B1C 1 2 3 4 5 4
4 B M1 1 2 3 4 5 5
5 B M1C 1 2 3 4 5

Use pd.DataFrame.lookup:
df['NEW'] = df.lookup(df.index, '_'+df['GROUP']+'_XYZ')
df
Output:
ORDER GROUP _A1_XYZ _A1C_XYZ _B1_XYZ _B1C_XYZ _M1_XYZ _M1C_XYZ NEW
0 A A1C 1 2 3 4 5 6 2
1 A A1 1 2 3 4 5 6 1
2 A B1 1 2 3 4 5 6 3
3 B B1C 1 2 3 4 5 6 4
4 B M1 1 2 3 4 5 6 5
5 B M1C 1 2 3 4 5 6 6
Updated after question edited.
Or use stack and reindex,
(df['New'] = df.stack().reindex(zip(df.index, '_'+dfl['GROUP']+'_XYZ'))
.rename('NEW').reset_index(level=1, drop=True))
df
Output:
ORDER GROUP _A1_XYZ _A1C_XYZ _B1_XYZ _B1C_XYZ _M1_XYZ New
0 A A1C 1 2 3 4 5 2
1 A A1 1 2 3 4 5 1
2 A B1 1 2 3 4 5 3
3 B B1C 1 2 3 4 5 4
4 B M1 1 2 3 4 5 5
5 B M1C 1 2 3 4 5 NaN

#ScottBoston's answer is better if all of the values in the rows are also columns, but I thought I'd share mine! Essentially, I create a new dataframe with the relevant columns, drop the duplicates, change the column names, transpose the dataframe and merge the column back in...
a = df.iloc[:,2:].drop_duplicates()
a.columns = [col.split('_')[1] for col in df.columns if '_' in col]
a = a.T.rename({0:'NEW'}, axis=1)
df = pd.merge(df, a, how='left', left_on='GROUP', right_index=True)
df
output:
ORDER GROUP _A1_XYZ _A1C_XYZ _B1_XYZ _B1C_XYZ _M1_XYZ NEW
0 A A1C 1 2 3 4 5 2.0
1 A A1 1 2 3 4 5 1.0
2 A B1 1 2 3 4 5 3.0
3 B B1C 1 2 3 4 5 4.0
4 B M1 1 2 3 4 5 5.0
5 B M1C 1 2 3 4 5 NaN

Related

split a string into separate columns in pandas

I have a dataframe with lots of data and 1 column that is structured like this:
index var_1
1 a=3:b=4:c=5:d=6:e=3
2 b=3:a=4:c=5:d=6:e=3
3 e=3:a=4:c=5:d=6
4 c=3:a=4:b=5:d=6:f=3
I am trying to structure the data in that column to look like this:
index a b c d e f
1 3 4 5 6 3 0
2 4 3 5 6 3 0
3 4 0 5 6 3 0
4 4 5 3 6 0 3
I have done the following thus far:
df1 = df['var1'].str.split(':', expand=True)
I can then loop through the cols of df1 and do another split on '=', but then I'll just have loads of disorganised label cols and value cols.
Use list comprehension with dictionaries for each value and pass to DataFrame constructor:
comp = [dict([y.split('=') for y in x.split(':')]) for x in df['var_1']]
df = pd.DataFrame(comp).fillna(0).astype(int)
print (df)
a b c d e f
0 3 4 5 6 3 0
1 4 3 5 6 3 0
2 4 0 5 6 3 0
3 4 5 3 6 0 3
Or use Series.str.split with expand=True for DataFrame, reshape by DataFrame.stack, again split, remove first level of MultiIndex and add new level by 0 column, last reshape by Series.unstack:
df = (df['var_1'].str.split(':', expand=True)
.stack()
.str.split('=', expand=True)
.reset_index(level=1, drop=True)
.set_index(0, append=True)[1]
.unstack(fill_value=0)
.rename_axis(None, axis=1))
print (df)
a b c d e f
1 3 4 5 6 3 0
2 4 3 5 6 3 0
3 4 0 5 6 3 0
4 4 5 3 6 0 3
Here's one approach using str.get_dummies:
out = df.var_1.str.get_dummies(sep=':')
out = out * out.columns.str[2:].astype(int).values
out.columns = pd.MultiIndex.from_arrays([out.columns.str[0], out.columns])
print(out.max(axis=1, level=0))
a b c d e f
index
1 3 4 5 6 3 0
2 4 3 5 6 3 0
3 4 0 5 6 3 0
4 4 5 3 6 0 3
You can apply "extractall" and "pivot".
After "extractall" you get:
0 1
index match
1 0 a 3
1 b 4
2 c 5
3 d 6
4 e 3
2 0 b 3
1 a 4
2 c 5
3 d 6
4 e 3
3 0 e 3
1 a 4
2 c 5
3 d 6
4 0 c 3
1 a 4
2 b 5
3 d 6
4 f 3
And in one step:
rslt= df.var_1.str.extractall(r"([a-z])=(\d+)") \
.reset_index(level="match",drop=True) \
.pivot(columns=0).fillna(0)
1
0 a b c d e f
index
1 3 4 5 6 3 0
2 4 3 5 6 3 0
3 4 0 5 6 3 0
4 4 5 3 6 0 3
#rslt.columns= rslt.columns.levels[1].values

add_suffix to column name based on position

I have a dataset where I want to add a suffix to column names based on their positions. For ex- 1st to 4th columns should be named 'abc_1', then 5th to 8th columns as 'abc_2' and so on.
I have tried using dataframe.rename
but it is a time consuming process. What would be the most efficient way to achieve this?
I think here is good choice create MultiIndex for avoid duplicated columns names - create first level by floor divide by 4 and add prefix by f-strings:
np.random.seed(123)
df = pd.DataFrame(np.random.randint(10, size=(5, 10)))
df.columns = [[f'abc_{i+1}' for i in df.columns // 4], df.columns]
print (df)
abc_1 abc_2 abc_3
0 1 2 3 4 5 6 7 8 9
0 2 2 6 1 3 9 6 1 0 1
1 9 0 0 9 3 4 0 0 4 1
2 7 3 2 4 7 2 4 8 0 7
3 9 3 4 6 1 5 6 2 1 8
4 3 5 0 2 6 2 4 4 6 3
More general solution if no RangeIndex in column names:
cols = [f'abc_{i+1}' for i in np.arange(len(df.columns)) // 4]
df.columns = [cols, df.columns]
print (df)
abc_1 abc_2 abc_3
0 1 2 3 4 5 6 7 8 9
0 2 2 6 1 3 9 6 1 0 1
1 9 0 0 9 3 4 0 0 4 1
2 7 3 2 4 7 2 4 8 0 7
3 9 3 4 6 1 5 6 2 1 8
4 3 5 0 2 6 2 4 4 6 3
Also is possible specify MultiIndex levels names by MultiIndex.from_arrays:
df.columns = pd.MultiIndex.from_arrays([cols, df.columns], names=('level0','level1'))
print (df)
level0 abc_1 abc_2 abc_3
level1 0 1 2 3 4 5 6 7 8 9
0 2 2 6 1 3 9 6 1 0 1
1 9 0 0 9 3 4 0 0 4 1
2 7 3 2 4 7 2 4 8 0 7
3 9 3 4 6 1 5 6 2 1 8
4 3 5 0 2 6 2 4 4 6 3
Then is possible select each level by xs:
print (df.xs('abc_2', axis=1))
4 5 6 7
0 3 9 6 1
1 3 4 0 0
2 7 2 4 8
3 1 5 6 2
4 6 2 4 4

pandas stack second column below first and vice versa

I have a DataFrame with two columns and I would like to stack the second column below the first and the first below the second.
pd.DataFrame({'A':[1,2,3], 'B': [4,5,6]})
A B
0 1 4
1 2 5
2 3 6
Desired output:
A B
0 1 4
1 2 5
2 3 6
3 4 1
4 5 2
5 6 3
So far I have tried:
pd.concat([df, df[['B','A']].rename(columns={'A':'B', 'B':'A'})])
A B
0 1 4
1 2 5
2 3 6
3 4 1
4 5 2
5 6 3
Is this the cleanest way?
Concat is better if you ask me. But if you have a 100 columns renaming is a pain. As a generalized approach here's one with numpy flip and vstack i.e
v = df.values
pd.DataFrame(pd.np.vstack((v, pd.np.fliplr(v))), columns=df.columns)
A B
0 1 4
1 2 5
2 3 6
3 4 1
4 5 2
5 6 3

Creating a new column in panda dataframe using logical indexing and group by

I have a data frame like below
df=pd.DataFrame({'a':['a','a','b','a','b','a','a','a'], 'b' : [1,0,0,1,0,1,1,1], 'c' : [1,2,3,4,5,6,7,8],'d':['1','2','1','2','1','2','1','2']})
df
Out[94]:
a b c d
0 a 1 1 1
1 a 0 2 2
2 b 0 3 1
3 a 1 4 2
4 b 0 5 1
5 a 1 6 2
6 a 1 7 1
7 a 1 8 2
I want something like below
df[(df['a']=='a') & (df['b']==1)]
In [97]:
df[(df['a']=='a') & (df['b']==1)].groupby('d')['c'].rank()
df[(df['a']=='a') & (df['b']==1)].groupby('d')['c'].rank()
Out[97]:
0 1
3 1
5 2
6 2
7 3
dtype: float64
I want this rank as a new column in dataframe df and wherever there is no rank I want NaN. SO final output will be something like below
a b c d rank
0 a 1 1 1 1
1 a 0 2 2 NaN
2 b 0 3 1 NaN
3 a 1 4 2 1
4 b 0 5 1 NaN
5 a 1 6 2 2
6 a 1 7 1 2
7 a 1 8 2 3
I will appreciate all the help and guidance. Thanks a lot.
Almost there, you just need to call transform to return a series with an index aligned to your orig df:
In [459]:
df['rank'] = df[(df['a']=='a') & (df['b']==1)].groupby('d')['c'].transform(pd.Series.rank)
df
Out[459]:
a b c d rank
0 a 1 1 1 1
1 a 0 2 2 NaN
2 b 0 3 1 NaN
3 a 1 4 2 1
4 b 0 5 1 NaN
5 a 1 6 2 2
6 a 1 7 1 2
7 a 1 8 2 3

combination of two DF, pandas

I have two df,
First df
A B C
1 1 3
1 1 2
1 2 5
2 2 7
2 3 7
Second df
B D
1 5
2 6
3 4
The column Bhas the same meaning in the both dfs. What is the most easy way add column D to the corresponding values in the first df? Output should be:
A B C D
1 1 3 5
1 1 2 5
1 2 5 6
2 2 7 6
2 3 7 4
Perform a 'left' merge in your case on column 'B':
In [206]:
df.merge(df1, how='left', on='B')
Out[206]:
A B C D
0 1 1 3 5
1 1 1 2 5
2 1 2 5 6
3 2 2 7 6
4 2 3 7 4
Another method would be to set 'B' on your second df as the index and then call map:
In [215]:
df1 = df1.set_index('B')
df['D'] = df['B'].map(df1['D'])
df
Out[215]:
A B C D
0 1 1 3 5
1 1 1 2 5
2 1 2 5 6
3 2 2 7 6
4 2 3 7 4

Categories