I have used pandas merge to bring together two dataframes (24 columns each), based on a set of condition, to generate a dataframe which contains rows which have the same values; naturally there are many other columns in each dataframe with different values. The code used to do this is:
Merged=pd.merge(Buy_MD,Sell_MD, on= ['ID','LocName','Sub-Group','Month'], how = 'inner' )
The result is a dataframe which has 48 columns, I would like to bring together these now (using melt possibly). so to visualise this:
Deal_x ID_x Location_x \... 21 other columns with _x postfix
0 130 5845 A
1 155 5845 B
2 138 6245 C
3 152 7345 A
Deal_y ID_y Location_y \ ... 21 other columns with _y postfix
0 155 9545 B
1 155 0345 C
2 155 0445 D
I want this to become:
Deal ID Location \
0 130 5845 A
1 155 5845 B
2 138 6245 C
3 152 7345 A
0 155 9545 B
1 155 0345 C
2 155 0445 D
Please how do I do this?
You can do something with the suffixes, split the columns to a MultiIndex, and then unstack
Merged=pd.merge(Buy_MD,Sell_MD, on= ['ID','LocName','Sub-Group','Month'], how = 'inner', suffixes=('_buy', '_sell')
Merged.columns = pd.MultiIndex.from_tuples(Merged.columns.str.rsplit('_').map(tuple), names=('key', 'transaction'))
Merged = Merged.stack(level='transaction')
transaction Deal ID Location
0 buy 130 5845 A
0 sell 155 9545 B
1 buy 155 5845 B
1 sell 155 345 C
2 buy 138 6245 C
2 sell 155 445 D
If you want to get rid of the MultiIndex you can do:
Merged.index = Merged.index.droplevel('transaction')
First, get rid of the suffixes using df.columns.str.split and taking the first split value from each sub-list in the result.
df_list = [df1, df2, ...] # a generic solution for 2 or more frames
for i, df in enumerate(df_list):
df_list[i].columns = df.columns.str.split('_').str[0]
Now, concatenate the result -
df = pd.concat(df_list, ignore_index=True)
df
Deal ID Location
0 130 5845 A
1 155 5845 B
2 138 6245 C
3 152 7345 A
4 155 9545 B
5 155 345 C
6 155 445 D
Also, if you're interested, use str.zfill on ID to get your expected output -
v = df.ID.astype(str)
v.str.zfill(v.str.len().max())
0 5845
1 5845
2 6245
3 7345
4 9545
5 0345
6 0445
Name: ID, dtype: object
Assign the result back.
Related
I have a wide dataframe I want to be able to reshape.
I have some columns that I wanna preserve. I have been exploring melt and wide_to_long but I'm not sure that's what I need.
Imagine I have some columns named: 'id', 'classroom', 'city'
And other columns called: 'alumn_x_subject_y_mark', 'alumn_x_subject_y_name', 'alumn_x_subject_y_teacher'
And x and y are the product of [range(20), range(10)].
I would like to end with a df that has columns: id, classroom, city, alumn, subject, mark, name, teacher
With all the original 20*10 columns converted to rows.
An empty dataframe with that structure can be generated this way:
import pandas as pd
import itertools
vals = list(itertools.product(*[range(20), range(10)]))
pd.DataFrame(columns=['id', 'classroom', 'city']+ \
['alumn_{0}_subject_{1}_mark'.format(x, y) for x, y in vals] + \
['alumn_{0}_subject_{1}_name'.format(x, y) for x, y in vals] + \
['alumn_{0}_subject_{1}_teacher'.format(x, y) for x, y in vals]
, dtype=object)
I'm not building this dataframe but receiving it from a file, that's why it has so many columns and I cannot change that.
If you had only 2 parameters to extract, wide_to_long would work.
Here you have 3, thus you can perform a manual reshaping with a MultiIndex:
regex = r'alumn_(\d+)_subject_(\d+)_(.*)'
out = (df
.set_index(['id', 'classroom', 'city'])
.pipe(lambda d: d.set_axis(pd.MultiIndex
.from_frame(d.columns.str.extract(regex),
names=['alumn', 'subject', None]
),
axis=1))
.stack(['alumn', 'subject'])
.reset_index()
)
output:
Empty DataFrame
Columns: [id, classroom, city, alumn, subject, mark, name, teacher]
Index: []
output with a single row (after df.loc[0] = range(df.shape[1])):
id classroom city alumn subject mark name teacher
0 0 1 2 0 0 3 203 403
1 0 1 2 0 1 4 204 404
2 0 1 2 0 2 5 205 405
3 0 1 2 0 3 6 206 406
4 0 1 2 0 4 7 207 407
.. .. ... ... ... ... ... ... ...
195 0 1 2 9 5 98 298 498
196 0 1 2 9 6 99 299 499
197 0 1 2 9 7 100 300 500
198 0 1 2 9 8 101 301 501
199 0 1 2 9 9 102 302 502
[200 rows x 8 columns]
I'd like to merge this dataframe:
import pandas as pd
import numpy as np
df1 = pd.DataFrame([[1,10,100],[2,20,np.nan],[3,30,300]], columns=["A","B","C"])
df1
A B C
0 1 10 100
1 2 20 NaN
2 3 30 300
with this one:
df2 = pd.DataFrame([[1,422],[10,72],[2,278],[300,198]], columns=["ID","Value"])
df2
ID Value
0 1 422
1 10 72
2 2 278
3 300 198
to get an output:
df_output = pd.DataFrame([[1,10,100,422],[1,10,100,72],[2,20,200,278],[3,30,300,198]], columns=["A","B","C","Value"])
df_output
A B C Value
0 1 10 100 422
1 1 10 100 72
2 2 20 NaN 278
3 3 30 300 198
The idea is that for df2 the key column is "ID", while for df1 we have 3 possible key columns ["A","B","C"].
Please notice that the numbers in df2 are chosen to be like this for simplicity, and they can include random numbers in practice.
How do I perform such a merge? Thanks!
IIUC, you need a double merge/join.
First, melt df1 to get a single column, while keeping the index. Then merge to get the matches. Finally join to the original DataFrame.
s = (df1
.reset_index().melt(id_vars='index')
.merge(df2, left_on='value', right_on='ID')
.set_index('index')['Value']
)
# index
# 0 422
# 1 278
# 0 72
# 2 198
# Name: Value, dtype: int64
df_output = df1.join(s)
output:
A B C Value
0 1 10 100.0 422
0 1 10 100.0 72
1 2 20 NaN 278
2 3 30 300.0 198
Alternative with stack + map:
s = df1.stack().droplevel(1).map(df2.set_index('ID')['Value']).dropna()
df_output = df1.join(s.rename('Value'))
I have this df1:
CHR SNP Pos Ref Min
1 rs3094315 113934 A G
1 rs12124819 126070 A G
1 rs28765502 135853 C T
1 rs9419478 158202 C T
1 rs4881551 159076 G A
and this df2:
CHR SNP A1 A2 MAF NCHROBS
1 rs3094315 G A 0.1402 214
1 rs12124819 G A 0.1887 212
1 rs28765502 C T 0.3113 212
1 rs7419119 G T 0.2243 214
1 rs950122 C G 0.1944 216
The first three rows have the same SNP names, so what I want to do is something like merge d2 with df1 based on coincidence in "SNP" and if they match evaluate if "A1" in df2 and "Ref" in df1 are the same, if not just Change the position letters and then subtract 1-MAF value as this at the end:
CHR SNP A1 A2 MAF NCHROBS
1 rs3094315 A G 0.8598 214
1 rs12124819 A G 0.8113 212
1 rs28765502 C T 0.3113 212
1 rs7419119 G T 0.2243 214
1 rs950122 C G 0.1944 216
What I tried is
import pandas as pd
import numpy as pd
df3=pd.merge(df2,df1, on=['SNP'])
df3['subtract']=np.where(df3['A1']!=df3['Ref']###print the subtract result 1-MAF= in 'subtract col')
But I don't want to lose those values that don't match in merge and print the subtract result of 1-MAF's value.
You can first right merge on "SNP", then use np.where to evaluate the condition. Then fill NaN values with the corresponding values. Finally drop the columns with missing values and rearrange to fit the desired outcome:
merged_df = df1.merge(df2, on='SNP', how='right', suffixes=('_',''))
merged_df['MAF'] = np.where(merged_df['Ref'].eq(merged_df['A2']), 1-merged_df['MAF'], merged_df['MAF'])
merged_df['Ref'] = merged_df['Ref'].fillna(merged_df['A1'])
merged_df['Min'] = merged_df['Min'].fillna(merged_df['A2'])
merged_df = merged_df.dropna(axis=1).drop(columns=['A2','A1']).rename(columns={'Ref':'A1', 'Min':'A2'})[['CHR','SNP','A1','A2','MAF','NCHROBS']]
Output:
CHR SNP A1 A2 MAF NCHROBS
0 1 rs3094315 A G 0.8598 214
1 1 rs12124819 A G 0.8113 212
2 1 rs28765502 C T 0.3113 212
3 1 rs7419119 G T 0.2243 214
4 1 rs950122 C G 0.1944 216
I have two dataframes, one with data, one with a list of forecasting assumptions. The column names correspond, but the index levels do not (by design). Please show me how to multiply columns A, B, and C in df1 by the relevant columns in df2, as in my example below, and with the remainder of the original dataframe (aka column D) intact. Thanks!
df1list=[1,2,3,4,5,6,7,8,9,10]
df2list=[2017]
df = pd.DataFrame(np.random.randint(0,100,size=(10, 4)), columns=list('ABCD'), index=list(df1list))
df2 = pd.DataFrame(np.random.randint(1,4,size=(1, 3)), columns=list('ABC'), index=list(df2list))
>>> df[['A','B','C']] * df2.values
A B C
1 81 168 116
2 21 8 6
3 147 108 52
4 54 64 114
5 48 16 20
6 72 116 12
7 36 188 178
8 90 96 162
9 63 166 156
10 120 22 10
So to overwrite you can do:
df.loc[:,['A','B','C']] = df[['A','B','C']] * df2.values
And I guess to be more programmatic you can do:
df[df2.columns] *= df2.values
I have some data imported from a csv, to create something similar I used this:
data = pd.DataFrame([[1,0,2,3,4,5],[0,1,2,3,4,5],[1,1,2,3,4,5],[0,0,2,3,4,5]], columns=['split','sex', 'group0Low', 'group0High', 'group1Low', 'group1High'])
means = data.groupby(['split','sex']).mean()
so the dataframe looks something like this:
group0Low group0High group1Low group1High
split sex
0 0 2 3 4 5
1 2 3 4 5
1 0 2 3 4 5
1 2 3 4 5
You'll notice that each column actually contains 2 variables (group# and height). (It was set up this way for running repeated measures anova in SPSS.)
I want to split the columns up, so I can also groupby "group", like this (I actually screwed up the order of the numbers, but hopefully the idea is clear):
low high
split sex group
0 0 95 265
0 0 1 123 54
1 0 120 220
1 1 98 111
1 0 0 150 190
0 1 211 300
1 0 139 86
1 1 132 250
How do I achieve this?
The first trick is to gather the columns into a single column using stack:
In [6]: means
Out[6]:
group0Low group0High group1Low group1High
split sex
0 0 2 3 4 5
1 2 3 4 5
1 0 2 3 4 5
1 2 3 4 5
In [13]: stacked = means.stack().reset_index(level=2)
In [14]: stacked.columns = ['group_level', 'mean']
In [15]: stacked.head(2)
Out[15]:
group_level mean
split sex
0 0 group0Low 2
0 group0High 3
Now we can do whatever string operations we want on group_level using pd.Series.str as follows:
In [18]: stacked['group'] = stacked.group_level.str[:6]
In [21]: stacked['level'] = stacked.group_level.str[6:]
In [22]: stacked.head(2)
Out[22]:
group_level mean group level
split sex
0 0 group0Low 2 group0 Low
0 group0High 3 group0 High
Now you're in business and you can do whatever you want. For example, sum each group/level:
In [31]: stacked.groupby(['group', 'level']).sum()
Out[31]:
mean
group level
group0 High 12
Low 8
group1 High 20
Low 16
How do I group by everything?
If you want to group by split, sex, group and level you can do:
In [113]: stacked.reset_index().groupby(['split', 'sex', 'group', 'level']).sum().head(4)
Out[113]:
mean
split sex group level
0 0 group0 High 3
Low 2
group1 0High 5
0Low 4
What if the split is not always at location 6?
This SO answer will show you how to do the splitting more intelligently.
This can be done by first construct multi-level index on column names and then reshape the dataframe by stack.
import pandas as pd
import numpy as np
# some artificial data
# ==================================
multi_index = pd.MultiIndex.from_arrays([[0,0,1,1], [0,1,0,1]], names=['split', 'sex'])
np.random.seed(0)
df = pd.DataFrame(np.random.randint(50,300, (4,4)), columns='group0Low group0High group1Low group1High'.split(), index=multi_index)
df
group0Low group0High group1Low group1High
split sex
0 0 222 97 167 242
1 117 245 153 59
1 0 261 71 292 86
1 137 120 266 138
# processing
# ==============================
level_group = np.where(df.columns.str.contains('0'), 0, 1)
# output: array([0, 0, 1, 1])
level_low_high = np.where(df.columns.str.contains('Low'), 'low', 'high')
# output: array(['low', 'high', 'low', 'high'], dtype='<U4')
multi_level_columns = pd.MultiIndex.from_arrays([level_group, level_low_high], names=['group', 'val'])
df.columns = multi_level_columns
df.stack(level='group')
val high low
split sex group
0 0 0 97 222
1 242 167
1 0 245 117
1 59 153
1 0 0 71 261
1 86 292
1 0 120 137
1 138 266