finding intersection of intervals in pandas - python

I have two dataframes
df_a=
Start Stop Value
0 0 100 0.0
1 101 200 1.0
2 201 1000 0.0
df_b=
Start Stop Value
0 0 50 0.0
1 51 300 1.0
2 301 1000 0.0
I would like to generate a DataFrame which contains the intervals as identified by Start and Stop, where Value was the same in df_a and df_b. For each interval I would like to store: if Value was the same, and which was the value in df_a and df_b.
Desired output:
df_out=
Start Stop SameValue Value_dfA Value_dfB
0 50 1 0 0
51 100 0 0 1
101 200 1 1 1
201 300 0 0 1
[...]

Not sure if this is the best way to do this but you can reindex, join, groupby and agg to get your intervals, e.g.:
Expand each df so that the index is every single value of the range (Start to Stop) using reindex() and padding the values:
In []:
df_a_expanded = df_a.set_index('Start').reindex(range(max(df_a['Stop'])+1)).fillna(method='pad')
df_a_expanded
Out[]:
Stop Value
Start
0 100.0 0.0
1 100.0 0.0
2 100.0 0.0
3 100.0 0.0
4 100.0 0.0
...
997 1000.0 0.0
998 1000.0 0.0
999 1000.0 0.0
1000 1000.0 0.0
[1001 rows x 2 columns]
In []:
df_b_expanded = df_b.set_index('Start').reindex(range(max(df_b['Stop'])+1)).fillna(method='pad')
Join the two expanded dfs:
In []:
df = df_a_expanded.join(df_b_expanded, lsuffix='_dfA', rsuffix='_dfB').reset_index()
df
Out[]:
Start Stop_dfA Value_dfA Stop_dfB Value_dfB
0 0 100.0 0.0 50.0 0.0
1 1 100.0 0.0 50.0 0.0
2 2 100.0 0.0 50.0 0.0
3 3 100.0 0.0 50.0 0.0
4 4 100.0 0.0 50.0 0.0
...
Note: you can ignore the Stop columns and could have dropped them in the previous step.
There is no standard way to groupby only consecutive values (à la itertools.groupby), so resorting to a cumsum() hack:
In []:
groups = (df[['Value_dfA', 'Value_dfB']] != df[['Value_dfA', 'Value_dfB']].shift()).any(axis=1).cumsum()
g = df.groupby([groups, 'Value_dfA', 'Value_dfB'], as_index=False)
Now you can get the result you want by aggregating the group with min, max:
In []:
df_out = g['Start'].agg({'Start': 'min', 'Stop': 'max'})
df_out
Out[]:
Value_dfA Value_dfB Start Stop
0 0.0 0.0 0 50
1 0.0 1.0 51 100
2 1.0 1.0 101 200
3 0.0 1.0 201 300
4 0.0 0.0 301 1000
Now you just have to add the SameValue column and, if desired, order the columns to get the exact output you want:
In []:
df_out['SameValue'] = (df_out['Value_dfA'] == df_out['Value_dfB'])*1
df_out[['Start', 'Stop', 'SameValue', 'Value_dfA', 'Value_dfB']]
Out[]:
Start Stop SameValue Value_dfA Value_dfB
0 0 50 1 0.0 0.0
1 51 100 0 0.0 1.0
2 101 200 1 1.0 1.0
3 201 300 0 0.0 1.0
4 301 1000 1 0.0 0.0
This assumes the ranges of the two dataframes are the same, or you will need to handle the NaNs you will get with the join().

I found a way but not sure it is the most efficient. You have the input data:
import pandas as pd
dfa = pd.DataFrame({'Start': [0, 101, 201], 'Stop': [100, 200, 1000], 'Value': [0., 1., 0.]})
dfb = pd.DataFrame({'Start': [0, 51, 301], 'Stop': [50, 300, 1000], 'Value': [0., 1., 0.]})
First I would create the columns Start and Stop of df_out with:
df_out = pd.DataFrame({'Start': sorted(set(dfa['Start'])|set(dfb['Start'])),
'Stop': sorted(set(dfa['Stop'])|set(dfb['Stop']))})
Then to get the value of dfa (and dfb) associated to the right range(Start,Stop) in a column named Value_dfA (and Value_dfB), I would do:
df_out['Value_dfA'] = df_out['Start'].apply(lambda x: dfa['Value'][dfa['Start'] <= x].iloc[-1])
df_out['Value_dfB'] = df_out['Start'].apply(lambda x: dfb['Value'][dfb['Start'] <= x].iloc[-1])
To get the column SameValue, do:
df_out['SameValue'] = df_out.apply(lambda x: 1 if x['Value_dfA'] == x['Value_dfB'] else 0,axis=1)
If it matters, you can reorder the columns with:
df_out = df_out[['Start', 'Stop', 'SameValue', 'Value_dfA', 'Value_dfB']]
Your output is then
Start Stop SameValue Value_dfA Value_dfB
0 0 50 1 0.0 0.0
1 51 100 0 0.0 1.0
2 101 200 1 1.0 1.0
3 201 300 0 0.0 1.0
4 301 1000 1 0.0 0.0

I have O(nlog(n)) solution where n is the sum of rows of df_a and df_b. Here's how it goes:
Rename value column of both dataframes to value_a and value_b repsectively. Next append df_b to df_a.
df = df_a.append(df_b)
Sort the df with respect to start column.
df = df.sort_values('start')
Resulting dataframe will look like this:
start stop value_a value_b
0 0 100 0.0 NaN
0 0 50 NaN 0.0
1 51 300 NaN 1.0
1 101 200 1.0 NaN
2 201 1000 0.0 NaN
2 301 1000 NaN 0.0
Forward fill the missing values:
df = df.fillna(method='ffill')
Compute same_value column:
df['same_value'] = df['value_a'] == df['value_b']
Recompute stop column:
df.stop = df.start.shift(-1)
You will get the dataframe you desire (except the first and last row which is pretty easy to fix):
start stop value_a value_b same_value
0 0 0.0 0.0 NaN False
0 0 51.0 0.0 0.0 True
1 51 101.0 0.0 1.0 False
1 101 201.0 1.0 1.0 True
2 201 301.0 0.0 1.0 False
2 301 NaN 0.0 0.0 True

Here is an answer which computes the overlapping intervals really quickly (which answers the question in the title):
from io import StringIO
import pandas as pd
from ncls import NCLS
c1 = StringIO("""Start Stop Value
0 100 0.0
101 200 1.0
201 1000 0.0""")
c2 = StringIO("""Start Stop Value
0 50 0.0
51 300 1.0
301 1000 0.0""")
df1 = pd.read_table(c1, sep="\s+")
df2 = pd.read_table(c2, sep="\s+")
ncls = NCLS(df1.Start.values, df1.Stop.values, df1.index.values)
x1, x2 = ncls.all_overlaps_both(df2.Start.values, df2.Stop.values, df2.index.values)
df1 = df1.reindex(x2).reset_index(drop=True)
df2 = df2.reindex(x1).reset_index(drop=True)
# print(df1)
# print(df2)
df = df1.join(df2, rsuffix="2")
print(df)
# Start Stop Value Start2 Stop2 Value2
# 0 0 100 0.0 0 50 0.0
# 1 0 100 0.0 51 300 1.0
# 2 101 200 1.0 51 300 1.0
# 3 201 1000 0.0 51 300 1.0
# 4 201 1000 0.0 301 1000 0.0
With this final df it should be simple to get to the result you need (but it is left as an exercise for the reader).
See NCLS for the interval overlap data structure.

Related

Is there a more efficient method than apply(), to check and replace row values in a Pandas Dataframe?

I'm running the function below on my Dataframe, where if a date is the same as a date where there is an alert ('Y'), then the number in column 'tdelta' should be replaced by 250.
This is the original df:
reset category date id group tdelta tdelta reverse
6 N low 2021-06-23 16860 2.0 33.0 0.0
5 Y low 2021-05-21 16860 2.0 0.0 -33.0
10 N medium 2020-12-06 1111 1.0 29.0 0.0
1 Y low 2020-12-06 16860 1.0 0.0 0.0
2 N low 2020-12-06 16860 1.0 0.0 0.0
8 Y medium 2020-11-07 1111 1.0 0.0 -29.0
9 N medium 2020-11-07 1111 1.0 0.0 -29.0
4 N low 2019-11-08 16860 0.0 65.0 0.0
3 N low 2019-09-07 16860 0.0 3.0 -62.0
7 N medium 2019-09-04 1111 0.0 0.0 0.0
This is the code and resulting output:
def format(row):
r_dates = df[(df['id'] == row['id']) & (df['reset'] == 'Y')]['date']
r_dates = r_dates.tolist()
if row['date'] in r_dates:
val = 250
else:
val = row['tdelta']
return val
df['tdelta'] = df.apply(format, axis =1)
print(df)
reset category date id group tdelta tdelta reverse
6 N low 2021-06-23 16860 2.0 33.0 0.0
5 Y low 2021-05-21 16860 2.0 250.0 -33.0
10 N medium 2020-12-06 1111 1.0 29.0 0.0
1 Y low 2020-12-06 16860 1.0 250.0 0.0
2 N low 2020-12-06 16860 1.0 250.0 0.0
8 Y medium 2020-11-07 1111 1.0 250.0 -29.0
9 N medium 2020-11-07 1111 1.0 250.0 -29.0
4 N low 2019-11-08 16860 0.0 65.0 0.0
3 N low 2019-09-07 16860 0.0 3.0 -62.0
7 N medium 2019-09-04 1111 0.0 0.0 0.0
0 N low 2019-09-04 16860 0.0 0.0 -65.0
However, when I apply this non a much larger dataset (approx. 200k rows), then it seems to take a very long time. I'm wondering if there is a more efficient practical way of accomplishing the above code.
EDIT:
I have changed some of the ids in column "id" to reflect what happens with different ids.
The id column must also be taken into account as in the much larger dataset (with different ids in the id column), there are reset dates unique for those ids.
Therefore, I would need to calculate all the resets that are "Y" for that id.
The issue is there will be other ids that may fall into the same date as an alert for another id, however, those shouldn't change, as shown in the third row (index 10), where the date is the same as in id 16860, but remains unchanged as an alert for that id is not on that date.
After trying this:
r_id = df[df['reset'] == 'Y']['id']
r_dates = df[df['reset'] == 'Y']['date']
df['tdelta'] = np.where((df['id'].isin(r_id) & df['date'].isin(r_dates)),250,df['tdelta'])
The below shows an incorrect output:
reset category date id group tdelta tdelta reverse
6 N low 2021-06-23 16860 2.0 33.0 0.0
5 Y low 2021-05-21 16860 2.0 250.0 -33.0
10 N medium 2020-12-06 1111 1.0 250.0 0.0
1 Y low 2020-12-06 16860 1.0 250.0 0.0
2 N low 2020-12-06 16860 1.0 250.0 0.0
8 Y medium 2020-11-07 1111 1.0 250.0 -29.0
9 N medium 2020-11-07 1111 1.0 250.0 -29.0
4 N low 2019-11-08 16860 0.0 65.0 0.0
3 N low 2019-09-07 16860 0.0 3.0 -62.0
7 N medium 2019-09-04 1111 0.0 0.0 0.0
0 N low 2019-09-04 16860 0.0 0.0 -65.0
Timing of your apply():
In: %timeit df['tdelta'] = df.apply(format, axis =1)
Out: 11.3 ms ± 470 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Last time I've measured apply it was without
df['id'] == row['id']
my bad, now I understand what it's for.
To make it faster, we'll need 2 lists:
r_id = df[df['reset'] == 'Y']['id']
r_dates = df[df['reset'] == 'Y']['date']
Updated numpy.where:
df['tdelta'] = np.where((df['id'].isin(r_id) & df['date'].isin(r_dates)),250,df['tdelta'])
generating each of these lists was about half of np.where time for each, all together (generating lists and numpy where:)
1.63 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
example df:
df = pd.DataFrame({'id':[223, 227, 2214, 2215, 226, 2215, 2215, 238, 253],
'reset':['N','Y','N','Y','N','Y','N','Y','Y'],
'date':[3, 7, 15, 16, 15, 15, 15, 38, 53],
'tdelta':[3, 7, 14, 15, 17, 26, 32, 38, 53]})
r_id = df[df['reset'] == 'Y']['id']
r_dates = df[df['reset'] == 'Y']['date']
df['tdelta'] = np.where((df['id'].isin(r_id) & df['date'].isin(r_dates)),250,df['tdelta'])
df
out:
id reset date tdelta
0 223 N 3 3
1 227 Y 7 250
2 2214 N 15 14
3 2215 Y 16 250
4 226 N 15 17
5 2215 Y 15 250
6 2215 N 15 250
7 238 Y 38 250
8 253 Y 53 250

Pandas add missing weeks from range to dataframe

I am computing a DataFrame with weekly amounts and now I need to fill it with missing weeks from a provided date range.
This is how I'm generating the dataframe with the weekly amounts:
df['date'] = pd.to_datetime(df['date']) - timedelta(days=6)
weekly_data: pd.DataFrame = (df
.groupby([pd.Grouper(key='date', freq='W-SUN')])[data_type]
.sum()
.reset_index()
)
Which outputs:
date sum
0 2020-10-11 78
1 2020-10-18 673
If a date range is given as start='2020-08-30' and end='2020-10-30', then I would expect the following dataframe:
date sum
0 2020-08-30 0.0
1 2020-09-06 0.0
2 2020-09-13 0.0
3 2020-09-20 0.0
4 2020-09-27 0.0
5 2020-10-04 0.0
6 2020-10-11 78
7 2020-10-18 673
8 2020-10-25 0.0
So far, I have managed to just add the missing weeks and set the sum to 0, but it also replaces the existing values:
weekly_data = weekly_data.reindex(pd.date_range('2020-08-30', '2020-10-30', freq='W-SUN')).fillna(0)
Which outputs:
date sum
0 2020-08-30 0.0
1 2020-09-06 0.0
2 2020-09-13 0.0
3 2020-09-20 0.0
4 2020-09-27 0.0
5 2020-10-04 0.0
6 2020-10-11 0.0 # should be 78
7 2020-10-18 0.0 # should be 673
8 2020-10-25 0.0
Remove reset_index for DatetimeIndex, because reindex working with index and if RangeIndex get 0 values, because no match:
weekly_data = (df.groupby([pd.Grouper(key='date', freq='W-SUN')])[data_type]
.sum()
)
Then is possible use fill_value=0 parameter and last add reset_index:
r = pd.date_range('2020-08-30', '2020-10-30', freq='W-SUN', name='date')
weekly_data = weekly_data.reindex(r, fill_value=0).reset_index()
print (weekly_data)
date sum
0 2020-08-30 0
1 2020-09-06 0
2 2020-09-13 0
3 2020-09-20 0
4 2020-09-27 0
5 2020-10-04 0
6 2020-10-11 78
7 2020-10-18 673
8 2020-10-25 0

Filter pandas dataframe based on column list values

My dataframe has many columns. one of these columns is array
df
Out[191]:
10012005 10029008 10197000 ... filename_int filename result
0 0.0 0.0 0.0 ... 1 1.0 [280, NON]
1 0.0 0.0 0.0 ... 10 10.0 [286, NON]
2 0.0 0.0 0.0 ... 100 100.0 [NON, 285]
3 0.0 0.0 0.0 ... 10000 10000.0 [NON, 286]
4 0.0 0.0 0.0 ... 10001 10001.0 [NON]
... ... ... ... ... ... ...
52708 0.0 0.0 0.0 ... 9995 9995.0 [NON]
52709 0.0 0.0 0.0 ... 9996 9996.0 [NON]
52710 0.0 0.0 0.0 ... 9997 9997.0 [285, NON]
52711 0.0 0.0 0.0 ... 9998 9998.0 [NON]
52712 0.0 0.0 0.0 ... 9999 9999.0 [NON]
[52713 rows x 4289 columns]
the column result is an array of these values
[NON]
[123,NON]
[357,938,837]
[455,NON,288]
[388,929,NON,020]
I want my filter dataframe to only display records that has values other than NON
therefore values such as
[NON,NON]
[NON]
[]
these will be excluded
only in the filer values like
[123,NON]
[357,938,837]
[455,NON,288]
[388,929,NON,020]
I tried this code
df[len(df["result"])!="NON"]
but I get this error !!
File "pandas\_libs\hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: True
how to filter my dataframe?
Try map with lambda here:
df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [[280, 'NON'], ['NON'], [], [285]] })
df
A B
0 1 [280, NON]
1 2 [NON]
2 3 []
3 4 [285]
df[df['B'].map(lambda x: any(y != 'NON' for y in x))]
A B
0 1 [280, NON]
3 4 [285]
The generator expression inside map returns True if there are at least 1 items in the list which are "NON".
You can use apply to identify rows that meet your criteria. Here, the filter works because apply returns a boolean.
import pandas as pd
import numpy as np
vals = [280, 285, 286, 'NON', 'NON', 'NON']
listcol = [np.random.choice(vals, 3) for _ in range(100)]
df = pd.DataFrame({'vals': listcol})
def is_non(l):
return len([i for i in l if i != 'NON']) > 0
df.loc[df.vals.apply(is_non), :]
I will do
s=pd.DataFrame(df.B.tolist())
df=df[(s.ne('NON')&s.notnull()).any(1).to_numpy()].copy()
A B
0 1 [280, NON]
3 4 [285]

Pandas DataFrame --> GroupBy --> MultiIndex Process

I'm trying to restructure a large DataFrame of the following form as a MultiIndex:
date store_nbr item_nbr units snowfall preciptotal event
0 2012-01-01 1 1 0 0.0 0.0 0.0
1 2012-01-01 1 2 0 0.0 0.0 0.0
2 2012-01-01 1 3 0 0.0 0.0 0.0
3 2012-01-01 1 4 0 0.0 0.0 0.0
4 2012-01-01 1 5 0 0.0 0.0 0.0
I want to group by store_nbr (1-45), within each store_nbr group by item_nbr (1-111) and then for the corresponding index pair (e.g., store_nbr=12, item_nbr=109), display the rows in chronological order, so that ordered rows will look like, for example:
store_nbr=12, item_nbr=109: date=2014-02-06, units=0, snowfall=...
date=2014-02-07, units=0, snowfall=...
date=2014-02-08, units=0, snowfall=...
... ...
store_nbr=12, item_nbr=110: date=2014-02-06, units=0, snowfall=...
date=2014-02-07, units=1, snowfall=...
date=2014-02-08, units=1, snowfall=...
...
It looks like some combination of groupby and set_index might be useful here, but I'm getting stuck after the following line:
grouped = stores.set_index(['store_nbr', 'item_nbr'])
This produces the following MultiIndex:
date units snowfall preciptotal event
store_nbr item_nbr
1 1 2012-01-01 0 0.0 0.0 0.0
2 2012-01-01 0 0.0 0.0 0.0
3 2012-01-01 0 0.0 0.0 0.0
4 2012-01-01 0 0.0 0.0 0.0
5 2012-01-01 0 0.0 0.0 0.0
Does anyone have any suggestions from here? Is there an easy way to do this by manipulating groupby objects?
You can sort your rows with:
df.sort_values(by='date')

Obtaining the class with maximum frequency(python)

so based on the following groupby code:
aps1.groupby(['S3bin2','S105_9bin2', 'class_predict']).size().unstack()
I get the following output:
class_predict 0 1
S3bin2 S105_9bin2
50 50 16058.0 133.0
100 256.0 7.0
150 161.0 NaN
200 160.0 1.0
400000 4195.0 58.0
100 50 3480.0 20.0
100 68.0 NaN
150 43.0 1.0
200 48.0 1.0
400000 689.0 2.0
150 50 1617.0 6.0
100 73.0 NaN
150 33.0 NaN
200 52.0 NaN
400000 935.0 3.0
200 50 1155.0 8.0
100 73.0 1.0
150 37.0 NaN
200 45.0 NaN
400000 937.0 NaN
300000 50 11508.0 178.0
100 748.0 11.0
150 446.0 5.0
200 350.0 9.0
400000 13080.0 49.0
So for the group 50 in both S3bin2 and S105_9bin2, the frequency of 0 is the highest. Is it possible to run a function whereby I can print the groups for which 0 has highest count, and also the count? I've tried transform(max) and other things but I'm not getting it.
Solution for test maximum in all data:
First you can remove unstack and add aggregate for max and idxmin and last create output by format:
s = aps1.groupby(['S3bin2','S105_9bin2', 'class_predict']).size()
a = s.agg(['idxmax', 'max'])
print (a)
idxmax (50, 50, 0)
max 16058
dtype: object
print (s.index.names)
['S3bin2', 'S105_9bin2', None]
a,b,c = a['max'], a['idxmax'], s.index.names
d = 'Maximum failure ({0}) at {1[0]}({2[0]}) and {1[1]}({2[1]})'.format(a,b,c)
print (d)
Maximum failure (16058) at 50(S3bin2) and 50(S105_9bin2)
But if want test only column 0 or 1:
df = aps1.groupby(['S3bin2','S105_9bin2', 'class_predict']).size().unstack()
#change 0 to 1 for test column 1
a = df[0].agg(['idxmax', 'max'])
print (a)
idxmax (50, 50)
max 16058
Name: 0, dtype: object
a,b,c = a['max'], a['idxmax'], df.index.names
d = 'Maximum failure ({0}) at {1[0]}({2[0]}) and {1[1]}({2[1]})'.format(a,b,c)
print (d)
Maximum failure (16058.0) at 50(S3bin2) and 50(S105_9bin2)

Categories