df1= pd.DataFrame({'Chr':['1', '1', '2', '2', '3','3','4'],
'position':[50, 500, 1030, 2005 , 3575,50, 250]})
df2 = pd.DataFrame({'Chr':['1', '1', '1', '1',
'1','2','2','2','2','2','3','3','3','3','3'],
'start':
[0,100,1000,2000,3000,0,100,1000,2000,3000,0,100,1000,2000,3000],
'end':
[100,1000,2000,3000,4000,100,1000,2000,3000,4000,100,1000,2000,3000,4000],
'logr':[3, 4, 5, 6, 7,8,9,10,11,12,13,15,16,17,18],
'seg':[0.2,0.5,0.2,0.1,0.5,0.5,0.2,0.2,0.1,0.2,0.1,0.5,0.5,0.9,0.3]})
I wanted to conditionally loop through 'Chr' and 'position' in df1 to 'Chr' and intervals ( where the position in df1 falls between 'start' and 'end') in df2, then add 'logr' and 'seg'column in df1
my desired output is :
df3= pd.DataFrame({'Chr':['1', '1', '2', '2', '3','3','4'],
'position':[50, 500, 1030, 2005 , 3575,50, 250],
'logr':[3, 4, 10,11, 18,13, "NA"],
'seg':[0.2,0.5,0.2,0.1,0.3,0.1,"NA"]})
Thank you in advance.
Use DataFrame.merge with outer join for all combinations, then filter by Series.between and boolean indexing with DataFrame.pop for extract columns and last left join for add missing rows:
df3 = df1.merge(df2, on='Chr', how='outer')
#between is by default inclusive (>=, <=) orwith parameter inclusive=False (>, <)
df3 = df3[df3['position'].between(df3.pop('start'), df3.pop('end'))]
#if need one inclusive and another interval not (e.g. >, <=)
#df3 = df3[(df3['position'] > df3.pop('start')) & (df3['position'] <= df3.pop('end'))]
df3 = df1.merge(df3, how='left')
print (df3)
Chr position logr seg
0 1 50 3.0 0.2
1 1 500 4.0 0.5
2 2 1030 10.0 0.2
3 2 2005 11.0 0.1
4 3 3575 18.0 0.3
5 3 50 13.0 0.1
6 4 250 NaN NaN
Another solution:
df3 = df1.merge(df2, on='Chr', how='outer')
s = df3.pop('start')
e = df3.pop('end')
df3 = df3[df3['position'].between(s, e) | s.isna() | e.isna()]
#if different closed intervals
#df3 = df3[(df3['position'] > s) & (df3['position'] <= e) | s.isna() | e.isna()]
print (df3)
Chr position logr seg
0 1 50 3.0 0.2
6 1 500 4.0 0.5
12 2 1030 10.0 0.2
18 2 2005 11.0 0.1
24 3 3575 18.0 0.3
25 3 50 13.0 0.1
30 4 250 NaN NaN
try using pd.merge() and
np.where()
import pandas pd
import numpy as np
res_df = pd.merge(df1,df2,on=['Chr'],how='outer')
res_df['check_between'] = np.where((res_df['position']>=res_df['start'])&(res_df['position']<=res_df['end']),True,False)
df3 = res_df[(res_df['check_between']==True) |
(res_df['start'].isnull())|
(res_df['end'].isnull()) ]
df3.drop(['check_between','start','end'],axis=1,inplace=True)
Chr position logr seg
0 1 50 3.0 0.2
6 1 500 4.0 0.5
12 2 1030 10.0 0.2
18 2 2005 11.0 0.1
24 3 3575 18.0 0.3
25 3 50 13.0 0.1
30 4 250 NaN NaN
Doing left-merge with indicator=True. Next, query checks position between start, end or _merge value is left_only. Finally, drop unwanted columns
df1.merge(df2, 'left', indicator=True).query('(start<=position<=end) | _merge.eq("left_only")') \
.drop(['start', 'end', '_merge'],1)
Out[364]:
Chr position logr seg
0 1 50 3.0 0.2
6 1 500 4.0 0.5
12 2 1030 10.0 0.2
18 2 2005 11.0 0.1
24 3 3575 18.0 0.3
25 3 50 13.0 0.1
30 4 250 NaN NaN
Related
I have 2 dataframes:
dfA = pd.DataFrame({'label':[1,5,2,4,2,3],
'group':['A']*3 + ['B']*3,
'x':[np.nan]*3 + [1,2,3],
'y':[np.nan]*3 + [1,2,3]})
dfB = pd.DataFrame({'uniqid':[1,2,3,4,5,6,7],
'horizontal':[34,41,23,34,23,43,22],
'vertical':[98,67,19,57,68,88,77]})
...which look like:
label group x y
0 1 A NaN NaN
1 5 A NaN NaN
2 2 A NaN NaN
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
uniqid horizontal vertical
0 1 34 98
1 2 41 67
2 3 23 19
3 4 34 57
4 5 23 68
5 6 43 88
6 7 22 77
Basically, dfB contains 'horizontal' and 'vertical' values for all unique IDs. I want to populate the 'x' and 'y' columns in dfA with the 'horizontal' and 'vertical' values in dfB but only for group A; data for group B should remain unchanged.
The desired output would be:
label group x y
0 1 A 34.0 98.0
1 5 A 23.0 68.0
2 2 A 41.0 67.0
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
I've used .merge() to add additional columns to the dataframe for both groups A and B and then copy data to x and y columns for group A only. And finally delete columns from dfB.
dfA = dfA.merge(dfB, how = 'left', left_on = 'label', right_on = 'uniqid')
dfA.loc[dfA['group'] == 'A','x'] = dfA.loc[dfA['group'] == 'A','horizontal']
dfA.loc[dfA['group'] == 'A','y'] = dfA.loc[dfA['group'] == 'A','vertical']
dfA = dfA[['label','group','x','y']]
The correct output is produced:
label group x y
0 1 A 34.0 98.0
1 5 A 23.0 68.0
2 2 A 41.0 67.0
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
...but this is a really, really ugly solution. Is there a better solution?
combine_first
dfA.set_index(['label', 'group']).combine_first(
dfB.set_axis(['label', 'x', 'y'], axis=1).set_index(['label'])
).reset_index()
label group x y
0 1 A 34.0 98.0
1 5 A 23.0 68.0
2 2 A 41.0 67.0
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
fillna
Works as well
dfA.set_index(['label', 'group']).fillna(
dfB.set_axis(['label', 'x', 'y'], axis=1).set_index(['label'])
).reset_index()
We can try loc to extract/update only the part we want. And since you are merging on one column, which also has unique value on dfB, you can use set_index and loc/reindex:
mask = dfA['group']=='A'
dfA.loc[ mask, ['x','y']] = (dfB.set_index('uniqid')
.loc[dfA.loc[mask,'label'],
['horizontal','vertical']]
.values
)
Output:
label group x y
0 1 A 34.0 98.0
1 5 A 23.0 68.0
2 2 A 41.0 67.0
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
Note that the above would fail if some of dfA.label is not in dfB.uniqueid. In which case, we need to use reindex:
(dfB.set_index('uniqid')
.reindex[dfA.loc[mask,'label']
[['horizontal','vertical']].values
)
I want to df.cut() with two different bin sizes for two specific parts of a dataframe. I believe the easiest way to do that is to read my dataframe and split it in two so I can use df.cut() in the two independent dataframe with two independent bins.
I understand I can use df.head(), but I had to keep changing the dataframe and they don't have always the same size. For example, with the following dataframe
A B
1 0.1 0.423655
2 0.2 0.645894
3 0.3 0.437587
4 0.31 0.891773
5 0.4 0.1773
6 0.43 0.91773
7 0.5 0.891773
I want to have two dataframes for value of A higher or equal than 0.4 and lower than 0.4.
So I would have df2:
A B
1 0.1 0.423655
2 0.2 0.645894
3 0.3 0.437587
4 0.31 0.891773
and df3:
A B
1 0.4 0.1773
2 0.43 0.91773
3 0.5 0.891773
Again, df.head(4) or df.tail(3) won't work.
df2 = df[df["A"] < 0.4]
df3 = df[df["A"] >= 0.4]
This should work:
import pandas as pd
data = {'A': [0.1,0.2,0.1,0.2,5,6,7,8], 'B': [5,0.2,4,8,11,9,10,14]}
df = pd.DataFrame(data)
df2 = df[df.A >= 0.4]
print(df2)
# A B
#4 5.0 11.0
#5 6.0 9.0
#6 7.0 10.0
#7 8.0 14.0
df3 = df[df.A < 0.4]
print(df3)
# A B
#0 0.1 5.0
#1 0.2 0.2
#2 0.1 4.0
#3 0.2 8.0
I added in some ficticious data as an example:
data = {'A': [1,2,3,4,5,6,7,8], 'B': [5,8,9,10,11,12,13,14]}
df = pd.DataFrame(data)
df1 = df[df.A > 4]
df2 = df[df.A <13]
print(df1)
print(df2)
Output
>>> print(df1)
A B
4 5 11
5 6 12
6 7 13
7 8 14
>>> print(df2)
A B
0 1 5
1 2 8
2 3 9
3 4 10
4 5 11
5 6 12
6 7 13
7 8 14
I have the following table:
ID Metric Level Level(% Change) Level(Diff)
Index
0 2016 A 10 NaN NaN
1 2017 A 15 0.5 5
2 2018 A 20 0.3 5
3 2016 B 40 NaN NaN
4 2017 B 45 0.2 5
5 2018 B 50 0.1 5
I'd like to get the following:
A_Level B_Level A_Level(% Change) B_Level(% Change) A_Level(Diff) B_Level(Diff)
Index
2016 10 40 NaN NaN NaN NaN
2017 15 45 0.5 0.2 5 5
2018 20 50 0.3 0.1 5 5
I tried:
df = pd.pivot_table(df, index = 'ID', values = ['Level','Level(% Change)','Level(Diff)'], columns = ['Metric'])
df.columns = df.columns.map('_'.join)
However I only get the following table:
Level_A Level_B Level_A Level_B Level_A Level_B
Index
2016 10 40 NaN NaN NaN NaN
2017 15 45 0.5 0.2 5 5
2018 20 50 0.3 0.1 5 5
Basically data in the pivot is correct but the label in the first level column is wrong. There is only 'Level' while 'Level(% Change)', 'Level(Diff)' are missing. I would also get 'A_Level' instead of 'Level_A'.
Thank you in advance
Use list comprehension with swap a,b and f-strings:
df = pd.pivot_table(df,
index = 'ID',
values = ['Level','Level(% Change)','Level(Diff)'],
columns = ['Metric'])
df.columns = [f'{b}_{a}' for a, ab in df.columns]
Or add DataFrame.swaplevel:
df.columns = df.swaplevel(0,1, axis=1).columns.map('_'.join)
print (df)
A_Level B_Level A_Level(% Change) B_Level(% Change) A_Level(Diff) \
ID
2016 10 40 NaN NaN NaN
2017 15 45 0.5 0.2 5.0
2018 20 50 0.3 0.1 5.0
B_Level(Diff)
ID
2016 NaN
2017 5.0
2018 5.0
Alternatively, you could melt the data to do the concatenation,
(df.melt(id_vars=['ID','Metric'])
.assign(header = lambda x:x.Metric + '_' + x.variable)
.pivot_table(index='ID', columns='header', values='value'))
I have a dataframe called df1:
ID Value Name Score
-1 10 A -1
-1 5 B -1
NaN 0.2 Track C 100
NaN 0.5 Track C 200
1 0 D 100
5 0 D 200
I want to fill the NaN in column ID with multiple rows of Score data from dataframe df2.
df2:
Score ID
100 1
100 2
100 3
100 4
200 5
200 6
200 7
So that ultimately, my final dataframe looks like this:
df3:
ID Value Name Score
-1 10 A -1
-1 5 B -1
1 0.2 Track C 100
2 0.2 Track C 100
3 0.2 Track C 100
4 0.2 Track C 100
5 0.5 Track C 200
6 0.5 Track C 200
7 0.5 Track C 200
1 0 D 100
5 0 D 200
How could I accomplish this?
I have a solution, but it is not elegant, I plea experienced users to take a look at this.
to ease others, here are the code to setup the test case:
df1 = pd.DataFrame(
columns=\
'ID Value Name Score'.split(),
data = [
re.split('\s{2,}', line) for line in \
"""
-1 10 A -1
-1 5 B -1
NaN 0.2 Track C 100
NaN 0.5 Track C 200
1 0 D 100
5 0 D 200
""".strip().split('\n')
],
)
df1 = df1.replace({'NaN':np.nan})
df2 = pd.DataFrame(
columns=\
'Score ID'.split(),
data = [
re.split('\s{2,}', line) for line in \
"""
100 1
100 2
100 3
100 4
200 5
200 6
200 7
""".strip().split('\n')
],
)
and my solution is:
"""
the general first reaction is to pd.merge().
however the hurdle is, how to deal with the fillna of the column "ID".
mine works, but it is too hard coded.
"""
df = pd.merge(left=df1, right=df2, on='Score', how='left')
df['ID'] = df['ID_x'].fillna(df['ID_y'])
finalresult = df.drop(columns=['ID_x', 'ID_y']).drop_duplicates(subset=['ID','Name'])
OUTPUT:
Value Name Score ID
0 10 A -1 -1
1 5 B -1 -1
2 0.2 Track C 100 1
3 0.2 Track C 100 2
4 0.2 Track C 100 3
5 0.2 Track C 100 4
6 0.5 Track C 200 5
7 0.5 Track C 200 6
8 0.5 Track C 200 7
9 0 D 100 1
13 0 D 200 5
You can first use pandas.merge then use pandas.concat to concat both dataframes over axis=0:
s = pd.merge(df2, df, on='Score', how='left', suffixes=['', '_2'])\
.drop('ID_2', axis=1)\
.drop_duplicates('ID')
df3 = pd.concat([df.dropna(), s], ignore_index=True)
Output
print(df3)
ID Name Score Value
0 -1.0 A -1 10.0
1 -1.0 B -1 5.0
2 1.0 D 100 0.0
3 5.0 D 200 0.0
4 1.0 Track C 100 0.2
5 2.0 Track C 100 0.2
6 3.0 Track C 100 0.2
7 4.0 Track C 100 0.2
8 5.0 Track C 200 0.5
9 6.0 Track C 200 0.5
10 7.0 Track C 200 0.5
split your df, then using merge and concat back
df1_1=df1.loc[df1.ID.isnull()].copy()
df1_2=df1.loc[df1.ID.notnull()].copy()
df1_1=df1_1.reset_index().drop('ID',1).merge(df2,on='Score',how='left').set_index('index')
yourdf=pd.concat([df1_1,df1_2],sort=False).sort_index()
yourdf
Out[645]:
Value Name Score ID
0 10.0 A -1 -1.0
1 5.0 B -1 -1.0
2 0.2 TrackC 100 1.0
2 0.2 TrackC 100 2.0
2 0.2 TrackC 100 3.0
2 0.2 TrackC 100 4.0
3 0.5 TrackC 200 5.0
3 0.5 TrackC 200 6.0
3 0.5 TrackC 200 7.0
4 0.0 D 100 1.0
5 0.0 D 200 5.0
I'm frequently using pandas for merge (join) by using a range condition.
For instance if there are 2 dataframes:
A (A_id, A_value)
B (B_id,B_low, B_high, B_name)
which are big and approximately of the same size (let's say 2M records each).
I would like to make an inner join between A and B, so A_value would be between B_low and B_high.
Using SQL syntax that would be:
SELECT *
FROM A,B
WHERE A_value between B_low and B_high
and that would be really easy, short and efficient.
Meanwhile in pandas the only way (that's not using loops that I found), is by creating a dummy column in both tables, join on it (equivalent to cross-join) and then filter out unneeded rows. That sounds heavy and complex:
A['dummy'] = 1
B['dummy'] = 1
Temp = pd.merge(A,B,on='dummy')
Result = Temp[Temp.A_value.between(Temp.B_low,Temp.B_high)]
Another solution that I had is by applying on each of A value a search function on B by usingB[(x>=B.B_low) & (x<=B.B_high)] mask, but it sounds inefficient as well and might require index optimization.
Is there a more elegant and/or efficient way to perform this action?
Setup
Consider the dataframes A and B
A = pd.DataFrame(dict(
A_id=range(10),
A_value=range(5, 105, 10)
))
B = pd.DataFrame(dict(
B_id=range(5),
B_low=[0, 30, 30, 46, 84],
B_high=[10, 40, 50, 54, 84]
))
A
A_id A_value
0 0 5
1 1 15
2 2 25
3 3 35
4 4 45
5 5 55
6 6 65
7 7 75
8 8 85
9 9 95
B
B_high B_id B_low
0 10 0 0
1 40 1 30
2 50 2 30
3 54 3 46
4 84 4 84
numpy
The ✌easiest✌ way is to use numpy broadcasting.
We look for every instance of A_value being greater than or equal to B_low while at the same time A_value is less than or equal to B_high.
a = A.A_value.values
bh = B.B_high.values
bl = B.B_low.values
i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))
pd.concat([
A.loc[i, :].reset_index(drop=True),
B.loc[j, :].reset_index(drop=True)
], axis=1)
A_id A_value B_high B_id B_low
0 0 5 10 0 0
1 3 35 40 1 30
2 3 35 50 2 30
3 4 45 50 2 30
To address the comments and give something akin to a left join, I appended the part of A that doesn't match.
pd.concat([
A.loc[i, :].reset_index(drop=True),
B.loc[j, :].reset_index(drop=True)
], axis=1).append(
A[~np.in1d(np.arange(len(A)), np.unique(i))],
ignore_index=True, sort=False
)
A_id A_value B_id B_low B_high
0 0 5 0.0 0.0 10.0
1 3 35 1.0 30.0 40.0
2 3 35 2.0 30.0 50.0
3 4 45 2.0 30.0 50.0
4 1 15 NaN NaN NaN
5 2 25 NaN NaN NaN
6 5 55 NaN NaN NaN
7 6 65 NaN NaN NaN
8 7 75 NaN NaN NaN
9 8 85 NaN NaN NaN
10 9 95 NaN NaN NaN
Not sure that is more efficient, however you can use sql directly (from the module sqlite3 for instance) with pandas (inspired from this question) like:
conn = sqlite3.connect(":memory:")
df2 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])
df1 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])
df1.to_sql("df1", conn, index=False)
df2.to_sql("df2", conn, index=False)
qry = "SELECT * FROM df1, df2 WHERE df1.col1 > 0 and df1.col1<0.5"
tt = pd.read_sql_query(qry,conn)
You can adapt the query as needed in your application
I don't know how efficient it is, but someone wrote a wrapper that allows you to use SQL syntax with pandas objects. That's called pandasql. The documentation explicitly states that joins are supported. This might be at least easier to read since SQL syntax is very readable.
conditional_join from pyjanitor may be helpful in the abstraction/convenience;:
# pip install pyjanitor
import pandas as pd
import janitor
inner join
A.conditional_join(B,
('A_value', 'B_low', '>='),
('A_value', 'B_high', '<=')
)
A_id A_value B_id B_low B_high
0 0 5 0 0 10
1 3 35 1 30 40
2 3 35 2 30 50
3 4 45 2 30 50
left join
A.conditional_join(
B,
('A_value', 'B_low', '>='),
('A_value', 'B_high', '<='),
how = 'left'
)
A_id A_value B_id B_low B_high
0 0 5 0.0 0.0 10.0
1 1 15 NaN NaN NaN
2 2 25 NaN NaN NaN
3 3 35 1.0 30.0 40.0
4 3 35 2.0 30.0 50.0
5 4 45 2.0 30.0 50.0
6 5 55 NaN NaN NaN
7 6 65 NaN NaN NaN
8 7 75 NaN NaN NaN
9 8 85 NaN NaN NaN
10 9 95 NaN NaN NaN
lets take a simple example:
df=pd.DataFrame([2,3,4,5,6],columns=['A'])
returns
A
0 2
1 3
2 4
3 5
4 6
now lets define a second dataframe
df2=pd.DataFrame([1,6,2,3,5],columns=['B_low'])
df2['B_high']=[2,8,4,6,6]
results in
B_low B_high
0 1 2
1 6 8
2 2 4
3 3 6
4 5 6
here we go; and we want output to be index 3 and A value 5
df.where(df['A']>=df2['B_low']).where(df['A']<df2['B_high']).dropna()
results in
A
3 5.0
I know this is an old question but for newcomers there is now the pandas.merge_asof function that performs join based on closest match.
In case you want to do a merge so that a column of one DataFrame (df_right) is between 2 columns of another DataFrame (df_left) you can do the following:
df_left = pd.DataFrame({
"time_from": [1, 4, 10, 21],
"time_to": [3, 7, 15, 27]
})
df_right = pd.DataFrame({
"time": [2, 6, 16, 25]
})
df_left
time_from time_to
0 1 3
1 4 7
2 10 15
3 21 27
df_right
time
0 2
1 6
2 16
3 25
First, find matches of the right DataFrame that are closest but largest than the left boundary (time_from) of the left DataFrame:
merged = pd.merge_asof(
left=df_1,
right=df_2.rename(columns={"time": "candidate_match_1"}),
left_on="time_from",
right_on="candidate_match_1",
direction="forward"
)
merged
time_from time_to candidate_match_1
0 1 3 2
1 4 7 6
2 10 15 16
3 21 27 25
As you can see the candidate match in index 2 is wrongly matched, as 16 is not between 10 and 15.
Then, find matches of the right DataFrame that are closest but smaller than the right boundary (time_to) of the left DataFrame:
merged = pd.merge_asof(
left=merged,
right=df_2.rename(columns={"time": "candidate_match_2"}),
left_on="time_to",
right_on="candidate_match_2",
direction="backward"
)
merged
time_from time_to candidate_match_1 candidate_match_2
0 1 3 2 2
1 4 7 6 6
2 10 15 16 6
3 21 27 25 25
Finally, keep the matches where the candidate matches are the same, meaning that the value of the right DataFrame are between values of the 2 columns of the left DataFrame:
merged["match"] = None
merged.loc[merged["candidate_match_1"] == merged["candidate_match_2"], "match"] = \
merged.loc[merged["candidate_match_1"] == merged["candidate_match_2"], "candidate_match_1"]
merged
time_from time_to candidate_match_1 candidate_match_2 match
0 1 3 2 2 2
1 4 7 6 6 6
2 10 15 16 6 None
3 21 27 25 25 25