Grep columns by values in different dataframe in python

Grep columns by values in different dataframe in python - python

I have this df1:
CHR SNP Pos Ref Min
1 rs3094315 113934 A G
1 rs12124819 126070 A G
1 rs28765502 135853 C T
1 rs9419478 158202 C T
1 rs4881551 159076 G A
and this df2:
CHR SNP A1 A2 MAF NCHROBS
1 rs3094315 G A 0.1402 214
1 rs12124819 G A 0.1887 212
1 rs28765502 C T 0.3113 212
1 rs7419119 G T 0.2243 214
1 rs950122 C G 0.1944 216
The first three rows have the same SNP names, so what I want to do is something like merge d2 with df1 based on coincidence in "SNP" and if they match evaluate if "A1" in df2 and "Ref" in df1 are the same, if not just Change the position letters and then subtract 1-MAF value as this at the end:
CHR SNP A1 A2 MAF NCHROBS
1 rs3094315 A G 0.8598 214
1 rs12124819 A G 0.8113 212
1 rs28765502 C T 0.3113 212
1 rs7419119 G T 0.2243 214
1 rs950122 C G 0.1944 216
What I tried is
import pandas as pd
import numpy as pd
df3=pd.merge(df2,df1, on=['SNP'])
df3['subtract']=np.where(df3['A1']!=df3['Ref']###print the subtract result 1-MAF= in 'subtract col')
But I don't want to lose those values that don't match in merge and print the subtract result of 1-MAF's value.

You can first right merge on "SNP", then use np.where to evaluate the condition. Then fill NaN values with the corresponding values. Finally drop the columns with missing values and rearrange to fit the desired outcome:
merged_df = df1.merge(df2, on='SNP', how='right', suffixes=('_',''))
merged_df['MAF'] = np.where(merged_df['Ref'].eq(merged_df['A2']), 1-merged_df['MAF'], merged_df['MAF'])
merged_df['Ref'] = merged_df['Ref'].fillna(merged_df['A1'])
merged_df['Min'] = merged_df['Min'].fillna(merged_df['A2'])
merged_df = merged_df.dropna(axis=1).drop(columns=['A2','A1']).rename(columns={'Ref':'A1', 'Min':'A2'})[['CHR','SNP','A1','A2','MAF','NCHROBS']]
Output:
CHR SNP A1 A2 MAF NCHROBS
0 1 rs3094315 A G 0.8598 214
1 1 rs12124819 A G 0.8113 212
2 1 rs28765502 C T 0.3113 212
3 1 rs7419119 G T 0.2243 214
4 1 rs950122 C G 0.1944 216

Related

Merging dataframes with multiple key columns

I'd like to merge this dataframe:
import pandas as pd
import numpy as np
df1 = pd.DataFrame([[1,10,100],[2,20,np.nan],[3,30,300]], columns=["A","B","C"])
df1
A B C
0 1 10 100
1 2 20 NaN
2 3 30 300
with this one:
df2 = pd.DataFrame([[1,422],[10,72],[2,278],[300,198]], columns=["ID","Value"])
df2
ID Value
0 1 422
1 10 72
2 2 278
3 300 198
to get an output:
df_output = pd.DataFrame([[1,10,100,422],[1,10,100,72],[2,20,200,278],[3,30,300,198]], columns=["A","B","C","Value"])
df_output
A B C Value
0 1 10 100 422
1 1 10 100 72
2 2 20 NaN 278
3 3 30 300 198
The idea is that for df2 the key column is "ID", while for df1 we have 3 possible key columns ["A","B","C"].
Please notice that the numbers in df2 are chosen to be like this for simplicity, and they can include random numbers in practice.
How do I perform such a merge? Thanks!

IIUC, you need a double merge/join.
First, melt df1 to get a single column, while keeping the index. Then merge to get the matches. Finally join to the original DataFrame.
s = (df1
.reset_index().melt(id_vars='index')
.merge(df2, left_on='value', right_on='ID')
.set_index('index')['Value']
)
# index
# 0 422
# 1 278
# 0 72
# 2 198
# Name: Value, dtype: int64
df_output = df1.join(s)
output:
A B C Value
0 1 10 100.0 422
0 1 10 100.0 72
1 2 20 NaN 278
2 3 30 300.0 198
Alternative with stack + map:
s = df1.stack().droplevel(1).map(df2.set_index('ID')['Value']).dropna()
df_output = df1.join(s.rename('Value'))

Concatenate dataframes with the same column names but different suffixes

I have used pandas merge to bring together two dataframes (24 columns each), based on a set of condition, to generate a dataframe which contains rows which have the same values; naturally there are many other columns in each dataframe with different values. The code used to do this is:
Merged=pd.merge(Buy_MD,Sell_MD, on= ['ID','LocName','Sub-Group','Month'], how = 'inner' )
The result is a dataframe which has 48 columns, I would like to bring together these now (using melt possibly). so to visualise this:
Deal_x ID_x Location_x \... 21 other columns with _x postfix
0 130 5845 A
1 155 5845 B
2 138 6245 C
3 152 7345 A
Deal_y ID_y Location_y \ ... 21 other columns with _y postfix
0 155 9545 B
1 155 0345 C
2 155 0445 D
I want this to become:
Deal ID Location \
0 130 5845 A
1 155 5845 B
2 138 6245 C
3 152 7345 A
0 155 9545 B
1 155 0345 C
2 155 0445 D
Please how do I do this?

You can do something with the suffixes, split the columns to a MultiIndex, and then unstack
Merged=pd.merge(Buy_MD,Sell_MD, on= ['ID','LocName','Sub-Group','Month'], how = 'inner', suffixes=('_buy', '_sell')
Merged.columns = pd.MultiIndex.from_tuples(Merged.columns.str.rsplit('_').map(tuple), names=('key', 'transaction'))
Merged = Merged.stack(level='transaction')
transaction Deal ID Location
0 buy 130 5845 A
0 sell 155 9545 B
1 buy 155 5845 B
1 sell 155 345 C
2 buy 138 6245 C
2 sell 155 445 D
If you want to get rid of the MultiIndex you can do:
Merged.index = Merged.index.droplevel('transaction')

First, get rid of the suffixes using df.columns.str.split and taking the first split value from each sub-list in the result.
df_list = [df1, df2, ...] # a generic solution for 2 or more frames
for i, df in enumerate(df_list):
df_list[i].columns = df.columns.str.split('_').str[0]
Now, concatenate the result -
df = pd.concat(df_list, ignore_index=True)
df
Deal ID Location
0 130 5845 A
1 155 5845 B
2 138 6245 C
3 152 7345 A
4 155 9545 B
5 155 345 C
6 155 445 D
Also, if you're interested, use str.zfill on ID to get your expected output -
v = df.ID.astype(str)
v.str.zfill(v.str.len().max())
0 5845
1 5845
2 6245
3 7345
4 9545
5 0345
6 0445
Name: ID, dtype: object
Assign the result back.

Python pandas: replace values multiple columns matching multiple columns from another dataframe

I searched a lot for an answer, the closest question was Compare 2 columns of 2 different pandas dataframes, if the same insert 1 into the other in Python, but the answer to this person's particular problem was a simple merge, which doesn't answer the question in a general way.
I have two large dataframes, df1 (typically about 10 million rows), and df2 (about 130 million rows). I need to update values in three columns of df1 with values from three columns of df2, based on two df1 columns matching two df2 columns. It is imperative that the order of df1 remains unchanged, and that only rows with matching values get updated.
This is how the dataframes look like:
df1
chr snp x pos a1 a2
1 1-10020 0 10020 G A
1 1-10056 0 10056 C G
1 1-10108 0 10108 C G
1 1-10109 0 10109 C G
1 1-10139 0 10139 C T
Note that it's not always the case that the values of "snp" is chr-pos, it can take many other values with no link to any of the columns (like rs1234, indel-6032 etc)
df2
ID CHR STOP OCHR OSTOP
rs376643643 1 10040 1 10020
rs373328635 1 10066 1 10056
rs62651026 1 10208 1 10108
rs376007522 1 10209 1 10109
rs368469931 3 30247 1 10139
I need to update ['snp', 'chr', 'pos'] in df1 with df2[['ID', 'OCHR', 'OSTOP']] only when df1[['chr', 'pos']] matches df2[['OCHR', 'OSTOP']]
so in this case, after update, df1 would look like:
chr snp x pos a1 a2
1 rs376643643 0 10040 G A
1 rs373328635 0 10066 C G
1 rs62651026 0 10208 C G
1 rs376007522 0 10209 C G
3 rs368469931 0 30247 C T
I have used merge as a workaround:
df1 = pd.merge(df1, df2, how='left', left_on=["chr", "pos"], right_on=["OCHR", "OSTOP"],
left_index=False, right_index=False, sort=False)
and then
df1.loc[~df1.OCHR.isnull(), ["snp", "chr", "pos"]] = df1.loc[~df1.OCHR.isnull(), ["ID", "CHR", "STOP"]].values
and then remove the extra columns.
Yes, it works, but what would be a way to do that directly by comparing the values from both dataframes, I just don't know how to formulate it, and I couldn't find an answer anywhere; I guess it could be useful to get a general answer on this.
I tried that but it doesn't work:
df1.loc[(df1.chr==df2.OCHR) & (df1.pos==df2.OSTOP),["snp", "chr", "pos"]] = df2.loc[df2[['OCHR', 'OSTOP']] == df1.loc[(df1.chr==df2.OCHR) & (df1.pos==df2.OSTOP),["chr", "pos"]],['ID', ''CHR', 'STOP']].values
Thanks,
Stephane

You can use the update function (requires setting the matching criteria to index). I've modified your sample data to allow some mismatch.
# your data
# =====================
# df1 pos is modified from 10020 to 10010
print(df1)
chr snp x pos a1 a2
0 1 1-10020 0 10010 G A
1 1 1-10056 0 10056 C G
2 1 1-10108 0 10108 C G
3 1 1-10109 0 10109 C G
4 1 1-10139 0 10139 C T
print(df2)
ID CHR STOP OCHR OSTOP
0 rs376643643 1 10040 1 10020
1 rs373328635 1 10066 1 10056
2 rs62651026 1 10208 1 10108
3 rs376007522 1 10209 1 10109
4 rs368469931 3 30247 1 10139
# processing
# ==========================
# set matching columns to multi-level index
x1 = df1.set_index(['chr', 'pos'])['snp']
x2 = df2.set_index(['OCHR', 'OSTOP'])['ID']
# call update function, this is inplace
x1.update(x2)
# replace the values in original df1
df1['snp'] = x1.values
print(df1)
chr snp x pos a1 a2
0 1 1-10020 0 10010 G A
1 1 rs373328635 0 10056 C G
2 1 rs62651026 0 10108 C G
3 1 rs376007522 0 10109 C G
4 1 rs368469931 0 10139 C T

Start by renaiming the columns you want to merge in df2
df2.rename(columns={'OCHR':'chr','OSTOP':'pos'},inplace=True)
Now merge on these columns
df_merged = pd.merge(df1, df2, how='inner', on=['chr', 'pos']) # you might have to preserve the df1 index at this stage, not sure
Next, you want to
updater = df_merged[['D','CHR','STOP']] #this will be your update frame
updater.rename( columns={'D':'snp','CHR':'chr','STOP':'pos'},inplace=True) # rename columns to update original
Finally update (see bottom of this link):
df1.update( df1_updater) #updates in place
# chr snp x pos a1 a2
#0 1 rs376643643 0 10040 G A
#1 1 rs373328635 0 10066 C G
#2 1 rs62651026 0 10208 C G
#3 1 rs376007522 0 10209 C G
#4 3 rs368469931 0 30247 C T
update works by matching index/column so you might have to string along the index of df1 for the entire process, then do df1_updater.re_index(... before df1.update(df1_updater)

Attributes/information contained in DataFrame column names

I have some data imported from a csv, to create something similar I used this:
data = pd.DataFrame([[1,0,2,3,4,5],[0,1,2,3,4,5],[1,1,2,3,4,5],[0,0,2,3,4,5]], columns=['split','sex', 'group0Low', 'group0High', 'group1Low', 'group1High'])
means = data.groupby(['split','sex']).mean()
so the dataframe looks something like this:
group0Low group0High group1Low group1High
split sex
0 0 2 3 4 5
1 2 3 4 5
1 0 2 3 4 5
1 2 3 4 5
You'll notice that each column actually contains 2 variables (group# and height). (It was set up this way for running repeated measures anova in SPSS.)
I want to split the columns up, so I can also groupby "group", like this (I actually screwed up the order of the numbers, but hopefully the idea is clear):
low high
split sex group
0 0 95 265
0 0 1 123 54
1 0 120 220
1 1 98 111
1 0 0 150 190
0 1 211 300
1 0 139 86
1 1 132 250
How do I achieve this?

The first trick is to gather the columns into a single column using stack:
In [6]: means
Out[6]:
group0Low group0High group1Low group1High
split sex
0 0 2 3 4 5
1 2 3 4 5
1 0 2 3 4 5
1 2 3 4 5
In [13]: stacked = means.stack().reset_index(level=2)
In [14]: stacked.columns = ['group_level', 'mean']
In [15]: stacked.head(2)
Out[15]:
group_level mean
split sex
0 0 group0Low 2
0 group0High 3
Now we can do whatever string operations we want on group_level using pd.Series.str as follows:
In [18]: stacked['group'] = stacked.group_level.str[:6]
In [21]: stacked['level'] = stacked.group_level.str[6:]
In [22]: stacked.head(2)
Out[22]:
group_level mean group level
split sex
0 0 group0Low 2 group0 Low
0 group0High 3 group0 High
Now you're in business and you can do whatever you want. For example, sum each group/level:
In [31]: stacked.groupby(['group', 'level']).sum()
Out[31]:
mean
group level
group0 High 12
Low 8
group1 High 20
Low 16
How do I group by everything?
If you want to group by split, sex, group and level you can do:
In [113]: stacked.reset_index().groupby(['split', 'sex', 'group', 'level']).sum().head(4)
Out[113]:
mean
split sex group level
0 0 group0 High 3
Low 2
group1 0High 5
0Low 4
What if the split is not always at location 6?
This SO answer will show you how to do the splitting more intelligently.

This can be done by first construct multi-level index on column names and then reshape the dataframe by stack.
import pandas as pd
import numpy as np
# some artificial data
# ==================================
multi_index = pd.MultiIndex.from_arrays([[0,0,1,1], [0,1,0,1]], names=['split', 'sex'])
np.random.seed(0)
df = pd.DataFrame(np.random.randint(50,300, (4,4)), columns='group0Low group0High group1Low group1High'.split(), index=multi_index)
df
group0Low group0High group1Low group1High
split sex
0 0 222 97 167 242
1 117 245 153 59
1 0 261 71 292 86
1 137 120 266 138
# processing
# ==============================
level_group = np.where(df.columns.str.contains('0'), 0, 1)
# output: array([0, 0, 1, 1])
level_low_high = np.where(df.columns.str.contains('Low'), 'low', 'high')
# output: array(['low', 'high', 'low', 'high'], dtype='<U4')
multi_level_columns = pd.MultiIndex.from_arrays([level_group, level_low_high], names=['group', 'val'])
df.columns = multi_level_columns
df.stack(level='group')
val high low
split sex group
0 0 0 97 222
1 242 167
1 0 245 117
1 59 153
1 0 0 71 261
1 86 292
1 0 120 137
1 138 266

Select first or last n groups after applying "group by"

I want to extract specific number of groups after applying group by column. For example first 2 or 3 groups.
I have a data frame:
id gender value
1 f 1123
1 f 10
2 m 123
2 m 154
2 m 165
3 m 654
3 m 987
4 f 7654
4 f 7654
4 f 7654
... ... ....
I want something like this
id gender value
2 m 123
2 m 154
3 m 654
3 m 987
... .. ...
My code is:
dtFrame2 = dtFrame.groupby('id').head(2)
dtFrameMale = dtFrame2.loc[dtFrame2.gender=='male']
temp = maleGroups.filter(lambda x: len(x) == 2)
The last statement gives me all the groups with two row but after that I want to extract first two, three or n number of groups.

Something like this
In [60]: s = df[df['gender'] == 'm'].groupby('id').size()
In [61]: s.name = 'size'
In [62]: df2 = df.join(s, on='id')
In [63]: df2[df2['size'] == 2]
Out[63]:
id gender value size
5 3 m 654 2
6 3 m 987 2

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Grep columns by values in different dataframe in python - python

Related

Merging dataframes with multiple key columns

Concatenate dataframes with the same column names but different suffixes

Python pandas: replace values multiple columns matching multiple columns from another dataframe

Attributes/information contained in DataFrame column names

Select first or last n groups after applying "group by"

Categories

Resources