Merge two dataframes based on interval overlap - python

I have two dataframes A and B:
For example:
import pandas as pd
import numpy as np
In [37]:
A = pd.DataFrame({'Start': [10, 11, 20, 62, 198], 'End': [11, 11, 35, 70, 200]})
A[["Start","End"]]
Out[37]:
Start End
0 10 11
1 11 11
2 20 35
3 62 70
4 198 200
In [38]:
B = pd.DataFrame({'Start': [8, 5, 8, 60], 'End': [10, 90, 13, 75], 'Info': ['some_info0','some_info1','some_info2','some_info3']})
B[["Start","End","Info"]]
Out[38]:
Start End Info
0 8 10 some_info0
1 5 90 some_info1
2 8 13 some_info2
3 60 75 some_info3
I would like to add column info to dataframe A based on if the interval (Start-End) of A overlaps with the interval of B. In case, the A interval overlaps with more than one B interval, the info corresponding to the shorter interval should be added.
I have been looking arround how to manage this issue and I have found kind of similar questions but most of their answers are using iterrows() which in my case, as I am dealing with huge dataframes is not viable.
I would like something like:
A.merge(B,on="overlapping_interval", how="left")
And then drop duplicates keeping the info coming from the shorter interval.
The output should look like this:
In [39]:
C = pd.DataFrame({'Start': [10, 11, 20, 62, 198], 'End': [11, 11, 35, 70, 200], 'Info': ['some_info0','some_info2','some_info1','some_info3',np.nan]})
C[["Start","End","Info"]]
Out[39]:
Start End Info
0 10 11 some_info0
1 11 11 some_info2
2 20 35 some_info1
3 62 70 some_info3
4 198 200 NaN
I have found this question really interesting as it suggests the posibility of solving this issue using pandas Interval object. But after lots attempts I have not managed to solve it.
Any ideas?

I would suggest to do a function then apply on the rows:
First I compute the delta (End - Start) in B for sorting purpose
B['delta'] = B.End - B.Start
Then a function to get information:
def get_info(x):
#Fully included
c0 = (x.Start >= B.Start) & (x.End <= B.End)
#start lower, end include
c1 = (x.Start <= B.Start) & (x.End >= B.Start)
#start include, end higher
c2 = (x.Start <= B.End) & (x.End >= B.End)
#filter with conditions and sort by delta
_B = B[c0|c1|c2].sort_values('delta',ascending=True)
return None if len(_B) == 0 else _B.iloc[0].Info #None if no info corresponding
Then you can apply this function to A:
A['info'] = A.apply(lambda x : get_info(x), axis='columns')
print(A)
Start End info
0 10 11 some_info0
1 11 11 some_info2
2 20 35 some_info1
3 62 70 some_info3
4 198 200 None
Note:
Instead of using pd.Interval, make your own conditions. cx are your intervals definitions, change them to get the exact expected behaviour

Related

Delete rows with overlapping intervals efficiently

Consider the following DataFrame
>>> df
Start End Tiebreak
0 1 6 0.376600
1 5 7 0.050042
2 15 20 0.628266
3 10 15 0.984022
4 11 12 0.909033
5 4 8 0.531054
Whenever the [Start, End] intervals of two rows overlap I want the row with lower tiebreaking value to be removed. The result of the example would be
>>> df
Start End Tiebreak
2 15 20 0.628266
3 10 15 0.984022
5 4 8 0.531054
I have a double-loop which does the job inefficiently and was wondering whether there exists an approach which exploits built-ins and works columnwise.
import pandas as pd
import numpy as np
# initial data
df = pd.DataFrame({
'Start': [1, 5, 15, 10, 11, 4],
'End': [6, 7, 20, 15, 12, 8],
'Tiebreak': np.random.uniform(0, 1, 6)
})
# checking for overlaps
list_idx_drop = []
for i in range(len(df) - 1):
for j in range(i + 1, len(df)):
idx_1 = df.index[i]
idx_2 = df.index[j]
cond_1 = (df.loc[idx_1, 'Start'] < df.loc[idx_2, 'End'])
cond_2 = (df.loc[idx_2, 'Start'] < df.loc[idx_1, 'End'])
# if rows overlaps
if cond_1 & cond_2:
tie_1 = df.loc[idx_1, 'Tiebreak']
tie_2 = df.loc[idx_2, 'Tiebreak']
# delete row with lower tiebreaking value
if tie_1 < tie_2:
df.drop(idx_1, inplace=True)
else:
df.drop(idx_2, inplace=True)
You could sort by End and check cases where the end is greater than the previous Start. Using that True/False value, you can create groupings on which to drop duplicates. Sort again by Tiebreak and drop duplicates on the group column.
import pandas as pd
df = pd.DataFrame({'Start': {0: 1, 1: 5, 2: 15, 3: 10, 4: 11, 5: 4}, 'End': {0: 6, 1: 7, 2: 20, 3: 15, 4: 12, 5: 8}, 'Tiebreak': {0: 0.3766, 1: 0.050042, 2: 0.628266, 3: 0.984022, 4: 0.909033, 5: 0.531054}})
df = df.sort_values(by='End', ascending=False)
df['overlap'] = df['End'].gt(df['Start'].shift(fill_value=0))
df['group'] = df['overlap'].eq(False).cumsum()
df = df.sort_values(by='Tiebreak', ascending=False)
df = df.drop_duplicates(subset='group').drop(columns=['overlap','group'])
print(df)
Output
Start End Tiebreak
2 15 20 0.628266
3 10 15 0.984022
5 4 8 0.531054
You can sort the values by Start and compute a cummax of the End, then form group by non-overlapping intervals and get the max Tiebreak with groupby.idxmax:
keep = (df
.sort_values(by=['Start', 'End'])
.assign(max_End=lambda d: d['End'].cummax(),
group=lambda d: d['Start'].ge(d['max_End'].shift()).cumsum())
.groupby('group', sort=False)['Tiebreak'].idxmax()
)
out = df[df.index.isin(keep)]
Output:
Start End Tiebreak
2 15 20 0.628266
3 10 15 0.984022
5 4 8 0.531054
logic as image
The logic is to move left to right and start a new group when then is a "jump" (no overlap). As hard lines the intervals (in bold the greatest Tiebreak), and as dotted lines the cummax End.
Intermediates:
Start End Tiebreak max_End group
0 1 6 0.376600 6 0
5 4 8 0.531054 8 0
1 5 7 0.050042 8 0
3 10 15 0.984022 15 1 # 10 ≥ 8
4 11 12 0.909033 15 1
2 15 20 0.628266 20 2 # 15 ≥ 15

Compare two dataframe and conditionally capture random data in Python

The main logic of my question is on comparing the two dataframes a little, but it will be different from the existing questions here. Q1, Q2, Q3
Let's create dummy two dataframes.
data1 = {'user': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4,4],
'checkinid': [10, 20, 30, 40, 50, 35, 45, 55, 20, 120, 100, 35, 55, 180, 200,400],
'count': [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]}
data2 = {'checkinid': [10, 20, 30, 35, 40, 45, 50,55, 60, 70,100,120,180,200,300,400]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
data2 consists of whole checkinid values. I am trying to create a training file.
For example, user 1 visited 5 places where ids are (10,20,30,40,50)
I want to add randomly the places that user 1 does not visit and set the 'count column' as 0.
My expectation dataframe like this
user checkinid count
1 10 1
1 20 1
1 30 1
1 40 1
1 50 1
1 300 0 (add randomly)
1 180 0 (add randomly)
1 55 0 (add randomly)
2 35 1
2 45 1
2 55 1
2 20 1
2 120 1
2 10 0 (add randomly)
2 400 0 (add randomly)
2 180 0 (add randomly)
... ...
Now those who read the question can ask how many random data they will add.
For each user, just add 3 non-visited places is enough for this example.
This might not be the best solution but it works
you have to get each users and then pick the checkinids which are not assigned to them
#get all users
users = df1.user.unique();
for user in users:
checkins = df1.loc[df1['user'] == user]
df = checkins.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='right_only'].sample(n=3)
df['user']=[user,user,user]
df['count']=[0,0,0]
df.pop("_merge")
df1 = df1.append(df, ignore_index=True)
#sort data frome based on user
df1 = df1.sort_values(by=['user']);
#re-arrange cols
df1 = df1[['user', 'checkinid', 'count']]
#print df
print df1

Pandas DataFrame group-by indexes matching list - indexes respectively smaller than list[i+1] and greater than list[i]

I have a DataFrame Times_df with times in a single column and a second DataFrame End_df with specific end times for each group indexed by group name.
Times_df = pd.DataFrame({'time':np.unique(np.cumsum(np.random.randint(5, size=(100,))), axis=0)})
End_df = pd.DataFrame({'end time':np.unique(random.sample(range(Times_df.index.values[0], Times_df.index.values[-1]), 10))})
End_df.index.name = 'group'
I want to add a group index for all times in Times_df smaller or equal than each consequitive end time in End_df but greater than the previous one
I can only do it for now with a loop, which takes forever ;(
lis = []
i = 1
for row in Times_df['time'].values:
while i <= row:
lis.append((End_df['end time']==row).index)
i +1
Then I add the list lis as a new column to Times_df
Times_df['group']=lis
A nother sollution that sadly still uses a loop is this:
test_df = pd.DataFrame()
for group, index in End_df.iterrows():
test = count.loc[count.index<=index['end time]][:]
test['group']=group
test_df = pd.concat([test_df,test], axis=0, ignore_index=True)
I think what you are looking for is pd.cut to bin your values into the groups.
bins = [0, 3, 10, 20, 53, 59, 63, 65, 68, 74, np.inf]
groups = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Times_df["group"] = pd.cut(Times_df["time"], bins, labels=groups)
print(Times_df)
time group
0 2 0
1 3 0
2 7 1
3 11 2
4 15 2
5 16 2
6 18 2
7 22 3
8 25 3
9 28 3

How to write a nested query with in python pandas?

Hi all I am new to pandas. I need some help regarding how to write pandas query for my required output.
I want to retrieve output data like
when 0 < minimum_age < 10 i need to get sum(population) for that 0 to 10 only
when 10 < minimum_age < 20 i need to get sum(population) for that 10 to 20 only
and then it continues
My Input Data Looks Like:
population,minimum_age,maximum_age,gender,zipcode,geo_id
50,30,34,f,61747,8600000US61747
5,85,NaN,m,64120,8600000US64120
1389,10,34,m,95117,8600000US95117
231,5,60,f,74074,8600000US74074
306,22,24,f,58042,8600000US58042
My Code:
import pandas as pd
import numpy as np
df1 = pd.read_csv("C:\Users\Rahul\Desktop\Desktop_Folders\Code\Population\population_by_zip_2010.csv")
df2=df1.set_index("geo_id")
df2['sum_population'] = np.where(df2['minimum_age'] < 10,sum(df2['population']),0)
print df2
You can try pandas cut along with groupby,
df.groupby(pd.cut(df['minimum_age'], bins=np.arange(0,100, 10), right=False)).population.sum().reset_index(name = 'sum of population')
minimum_age sum of population
0 [0, 10) 231.0
1 [10, 20) 1389.0
2 [20, 30) 306.0
3 [30, 40) 50.0
4 [40, 50) NaN
5 [50, 60) NaN
6 [60, 70) NaN
7 [70, 80) NaN
8 [80, 90) 5.0
Explanation: Pandas cut helps create bins of minimum_age by putting them in groups of 0-10, 10-20 and so on. This is how it looks
pd.cut(df['minimum_age'], bins=bins, right=False)
0 [30, 40)
1 [80, 90)
2 [10, 20)
3 [0, 10)
4 [20, 30)
Now we use groupby on the output of pd.cut to find sum of population.

Write to file from dictionary instead of pandas

I would like to print dictionaries to file in a different way.
Right now, I am using Pandas to convert dictionaries to Dataframes, combine several Dataframes and then print them to file (see below code).
However, the Pandas operations seem to take a very long time and I would like to do this more efficiently.
Is it possible to do the below approach more efficiently while retaining the structure of the output files? (e.g. by printing from dictionary directly?)
import pandas as pd
labels = ["A", "B", "C"]
periods = [0, 1, 2]
header = ['key', 'scenario', 'metric', 'labels']
metrics_names = ["metric_balances", "metric_record"]
key = "key_x"
scenario = "base"
# The metrics are structured as dicts where the keys are `periods` and the values
# are arrays (where each array entry correspond to one of the `labels`)
metric_balances = {0: [1000, 100, 50], 1: [900, 150, 100], 2: [800, 350, 100]}
metric_record = {0: [20, 10, 5], 1: [90, 15, 10], 2: [80, 35, 10]}
# Combine all metrics into one output structure for key "x"
output_x = pd.concat([pd.DataFrame(metric_balances, columns=periods, index=labels),
pd.DataFrame(metric_record, columns=periods, index=labels)],
keys=pd.MultiIndex.from_product([[key], [scenario], metrics_names]),
names=header)
key = "key_y"
scenario = "base_2"
metric_balances = {0: [2000, 200, 50], 1: [1900, 350, 100], 2: [1200, 750, 100]}
metric_record = {0: [40, 5, 3], 1: [130, 45, 10], 2: [82, 25, 18]}
# Combine all metrics into one output structure for key "y"
output_y = pd.concat([pd.DataFrame(metric_balances, columns=periods, index=labels),
pd.DataFrame(metric_record, columns=periods, index=labels)],
keys=pd.MultiIndex.from_product([[key], [scenario], metrics_names]),
names=header)
# Concatenate all output dataframes
output = pd.concat([output_x, output_y], names=header)
# Print results to a csv file
output.to_csv("test.csv", index=False)
Below are the respective outputs:
OUTPUT X
0 1 2
key scenario metric labels
key_x base metric_balances A 1000 900 800
B 100 150 350
C 50 100 100
metric_record A 20 90 80
B 10 15 35
C 5 10 10
-----------------------------------
OUTPUT Y
0 1 2
key scenario metric labels
key_y base_2 metric_balances A 2000 1900 1200
B 200 350 750
C 50 100 100
metric_record A 40 130 82
B 5 45 25
C 3 10 18
------------------------------
OUTPUT COMBINED
0 1 2
key scenario metric labels
key_x base metric_balances A 1000 900 800
B 100 150 350
C 50 100 100
metric_record A 20 90 80
B 10 15 35
C 5 10 10
key_y base_2 metric_balances A 2000 1900 1200
B 200 350 750
C 50 100 100
metric_record A 40 130 82
B 5 45 25
C 3 10 18
I was looking into row wise printing of the dictionaries - but I had difficulties in merging the labels with the relevant arrays.

Categories