Connecting lines over NaNs without dropping entire rows from the dataframe

Connecting lines over NaNs without dropping entire rows from the dataframe - python

I have my code set up as follows
df = t_sepsis_col_adder(filename)
if df['SepsisLabel'].sum() > 0:
cols = list(df.columns)
cols_to_remove = ['Age', 'Gender', 'Unit1', 'Unit2', 'T_Sepsis', 'SepsisLabel', 'HospAdmTime']
for col in cols:
if df[col].isnull().all():
cols_to_remove.append(col)
for col in cols_to_remove:
cols.remove(col)
col_l, col_r = cols[:len(cols) // 2], cols[len(cols) // 2:]
chart_l = alt.Chart(df).mark_line(point=True).encode(
alt.X(alt.repeat("column"), type='quantitative',
sort="descending"),
alt.Y(alt.repeat("row"), type='quantitative',
scale=alt.Scale(zero=False)),
order="T_Sepsis"
).properties(
width=600,
height=100
).repeat(
row=col_l,
column=['T_Sepsis']
)
chart_l.save("chart_l.png")
chart_r = alt.Chart(df).mark_line(point=True).encode(
alt.X(alt.repeat("column"), type='quantitative',
sort="descending"),
alt.Y(alt.repeat("row"), type='quantitative',
scale=alt.Scale(zero=False)),
order="T_Sepsis"
).properties(
width=600,
height=100
).repeat(
row=col_r,
column=['T_Sepsis']
)
chart_concat = alt.hconcat(chart_l, chart_r)
I essentially have a lot of features to plot so I decided to split the different plots into two columns. The issue is that the line plots don't actually connect to the points most of the time. I will attach a screenshot below. Any ideas on how to go about fixing this issue? By the way, the issue still persists if I stick to a single column of plots so I don't think repeat is causing the issue here. It's also worth noting that my data has a lot of NaN values (which I am choosing not to pre-process and take care of since I need to plot the raw data). Thanks!
Part of the main chart
Edit: In addition to the above code, here is the function t_sepsis_col_adder.
def t_sepsis_col_adder(filename: str) -> pd.DataFrame:
"""Adds a column that gives the time till t_sepsis at each time step.
Args:
filename: name of the file being processed
Returns:
Returns a dataframe with the new column
"""
df = df_instantiator_unaugmented(filename)
number_of_sepsis_hours = df['SepsisLabel'].sum()
if number_of_sepsis_hours > 0:
if number_of_sepsis_hours >= 7:
first_occurrence = df['SepsisLabel'].idxmax()
t_sepsis = first_occurrence + 6
else:
t_sepsis = len(df.index) - 1
temp_list = []
for i in range(len(df.index)):
temp_list.append(t_sepsis - i)
df['T_Sepsis'] = temp_list
else:
t_end_recording = len(df.index)
temp_list = []
for i in range(len(df.index)):
temp_list.append(t_end_recording - i - 1)
df['T_EndRecording'] = temp_list
return df

I can't confirm without a reproducible example, but this is likely due to there being NaNs in the data. These are not connected by Altair/VegaLite by default, whereas values that are missing would be connected. You can fix this by using dropna(subset=['column_name']) in pandas or .transform_filter('isValid(datum.column_name)') in Altair. Please see this answer for examples.
What I reference above also work for faceting, here is a reproducible example:
import pandas as pd
import altair as alt
import numpy as np
df = pd.DataFrame({'date': ['2020-04-03', '2020-04-04', '2020-04-05', '2020-04-06',
'2020-04-03', '2020-04-04','2020-04-05','2020-04-06'],
'ID': ['a','a','a','a','b','b','b','b'],
'line': [8,np.nan,10,8, 4, 5,6,7] })
df
## out
date ID line
0 2020-04-03 a 8.0
1 2020-04-04 a NaN
2 2020-04-05 a 10.0
3 2020-04-06 a 8.0
4 2020-04-03 b 4.0
5 2020-04-04 b 5.0
6 2020-04-05 b 6.0
7 2020-04-06 b 7.0
transform_filter only exclude points from individual facets, not entire rows.
(alt.Chart(df).mark_line(point=True).encode(
alt.X('monthdate(date):O'), y='line:Q')
.transform_filter('isValid(datum.line)')
.facet('ID'))

Related

Run functions over many dataframes, add results to another dataframe, and dynamically name the resulting column with the name of the original df

I have many different tables that all have different column names and each refer to an outcome, like glucose, insulin, leptin etc (except keep in mind that the tables are all gigantic and messy with tons of other columns in them as well).
I am trying to generate a report that starts empty but then adds columns based on functions applied to each of the glucose, insulin, and leptin tables.
I have included a very simple example - ignore that the function makes little sense. The below code works, but I would like to, instead of copy + pasting final_report["outcome"] = over and over again, just run the find_result function over each of glucose, insulin, and leptin and add the "glucose_result", "insulin_result" and "leptin_result" to the final_report in one or a few lines.
Thanks in advance.
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
outcome = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
glucose = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
insulin = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
leptin = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
ids = [1,2,3,4]
start = [1,1,1,1]
end = [6,6,6,6]
final_report = pd.DataFrame({'id':ids,
'start':start,
'end':end})
def find_result(subject, start, end, df):
df = df.loc[(df["id"] == subject) & (df["timepoint"] >= start) & (df["timepoint"] <= end)].sort_values(by = "timepoint")
return df["timepoint"].nunique()
final_report['glucose_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], glucose), axis=1)
final_report['insulin_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], insulin), axis=1)
final_report['leptin_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], leptin), axis=1)

If you have to use this code structure, you can create a simple dictionary with your dataframes and their names and loop through them, creating new columns with programmatically assigned names:
input_dfs = {"glucose": glucose, "insulin": insulin, "leptin": leptin}
for name, df in input_dfs.items():
final_report[f"{name}_result"] = final_report.apply(
lambda x: find_result(x['id'], x['start'], x['end'], df),
axis=1
)
Output:
id start end glucose_result insulin_result leptin_result
0 1 1 6 6 6 6
1 2 1 6 6 6 6
2 3 1 6 3 3 3
3 4 1 6 6 6 6

Make a calculation or function skip rows based on NaN values

I have two data frames that I am looking to apply two separate functions to that will perform validation checks on each data frame independently, and then any differences that arise will get concatenated into one transformed list.
The issue I am facing is that the first validation check should happen ONLY if numeric values exist in ALL of the numeric columns of whichever of the two data frames it is analyzing. If there are ANY NaN values in a row line for the first validation check, then that row should be skipped.
The second validation check does not need that specification.
Here are the data frames, functions, and transformations:
import pandas as pd
import numpy as np
df1 = {'Fruits': ["Banana","Blueberry","Apple","Cherry","Mango","Pineapple","Watermelon","Papaya","Pear","Coconut"],
'Price': [2,1.5,np.nan,2.5,3,4,np.nan,3.5,1.5,2],'Amount':[40,19,np.nan,np.nan,60,70,80,np.nan,45,102],
'Quantity Frozen':[3,4,np.nan,15,np.nan,9,12,8,np.nan,80],
'Quantity Fresh':[37,12,np.nan,45,np.nan,61,np.nan,24,14,20],
'Multiple':[74,17,np.nan,112.5,np.nan,244,np.nan,84,21,40]}
df1 = pd.DataFrame(df1, columns = ['Fruits', 'Price','Amount','Quantity Frozen','Quantity Fresh','Multiple'])
df2 = {'Fruits': ["Banana","Blueberry","Apple","Cherry","Mango","Pineapple","Watermelon","Papaya","Pear","Coconut"],
'Price': [2,1.5,np.nan,2.6,3,4,np.nan,3.5,1.5,2],'Amount':[40,16,np.nan,np.nan,60,72,80,np.nan,45,100],
'Quantity Frozen':[3,4,np.nan,np.nan,np.nan,9,12,8,np.nan,80],
'Quantity Fresh':[np.nan,12,np.nan,45,np.nan,61,np.nan,24,15,20],
'Multiple':[74,17,np.nan,112.5,np.nan,244,np.nan,84,20,40]}
df2 = pd.DataFrame(df2, columns = ['Fruits', 'Price','Amount','Quantity Frozen','Quantity Fresh','Multiple'])
#Validation Check 1:
for name, dataset in {'Fruit Dataset1':df1,'Fruit Dataset2':df2}.items():
dataset['dif_Stock on Hand'] = dataset['Quantity Fresh']+dataset['Quantity Frozen']
for varname,var in {'Stock on Hand vs. Quantity Fresh + Quantity Frozen':'dif_Stock on Hand'}.items():
print('{} differences in {}:'.format(name, varname))
print(dataset[var].value_counts())
print('\n')
#Validation Check 2:
for name, dataset in {'Fruit Dataset1':df1,'Fruit Dataset2':df2}.items():
dataset['dif_Multiple'] = dataset['Price'] * dataset['Quantity Fresh']
for varname,var in {'Multiple vs. Price x Quantity Fresh':'dif_Multiple'}.items():
print('{} differences in {}:'.format(name, varname))
print(dataset[var].value_counts())
print('\n')
# #Wrangling internal inconsistency data frames to be in correct format
inconsistency_vars = ['dif_Stock on Hand','dif_Multiple']
inconsistency_var_betternames = {'dif_Stock on Hand':'Stock on Hand = Quantity Fresh + Quantity Frozen','dif_Multiple':'Multiple = Price x Quantity on Hand'}
# #Rollup1
idvars1=['Fruits']
df1 = df1[idvars1 + inconsistency_vars]
df2 = df2[idvars1 + inconsistency_vars]
df1 = df1.melt(id_vars = idvars1, value_vars = inconsistency_vars, value_name = 'Difference Magnitude')
df2 = df2.melt(id_vars = idvars1, value_vars = inconsistency_vars, value_name = 'Difference Magnitude')
df1['dataset'] = 'Fruit Dataset1'
df2['dataset'] = 'Fruit Dataset2'
# #First table in Internal Inconsistencies Sheet (Table 5)
inconsistent = pd.concat([df1,df2])
inconsistent = inconsistent[['variable','Difference Magnitude','dataset','Fruits']]
inconsistent['variable'] = inconsistent['variable'].map(inconsistency_var_betternames)
inconsistent = inconsistent[inconsistent['Difference Magnitude'] != 0]
Here is the desired output, which for the first validation check skips rows in either data frame that have ANY NaN values in the numeric columns (every column but 'Fruits'):
#Desired output
inconsistent_true = {'variable': ["Stock on Hand = Quantity Fresh + Quantity Frozen","Stock on Hand = Quantity Fresh + Quantity Frozen","Multiple = Price x Quantity on Hand",
"Multiple = Price x Quantity on Hand","Multiple = Price x Quantity on Hand"],
'Difference Magnitude': [1,2,1,4.5,2.5],
'dataset':["Fruit Dataset1","Fruit Dataset1","Fruit Dataset2","Fruit Dataset2","Fruit Datset2"],
'Fruits':["Blueberry","Coconut","Blueberry","Cherry","Pear"]}
inconsistent_true = pd.DataFrame(inconsistent_true, columns = ['variable', 'Difference Magnitude','dataset','Fruits'])

A pandas function that may come in handy is pd.isnull() return True for np.nan value-
For example take df1-
pd.isnull(df1['Amount'][2])
True
This can be added as a check to all your numeric columns as such and then use only rows that have column 'numeric_check' value as 1-
df1['numeric_check'] = df1.apply(lambda x: 0 if (pd.isnull(x['Amount']) or
pd.isnull(x['Price']) or pd.isnull(x['Quantity Frozen']) or
pd.isnull(x['Quantity Fresh']) or pd.isnull(x['Multiple'])) else 1, axis =1)
refer the modified validation check 1 -
#Validation Check 1:
for name, dataset in {'Fruit Dataset1':df1,'Fruit Dataset2':df2}.items():
if '1' in name: # check to implement condition for only df1
# Adding the 'numeric_check' column to dataset df
dataset['numeric_check'] = dataset.apply(lambda x: 0 if (pd.isnull(x['Amount']) or
pd.isnull(x['Price']) or pd.isnull(x['Quantity Frozen']) or
pd.isnull(x['Quantity Fresh']) or pd.isnull(x['Multiple'])) else 1, axis =1)
# filter out Nan rows, they will not be considered for this check
dataset = dataset.loc[dataset['numeric_check']==1]
dataset['dif_Stock on Hand'] = dataset['Quantity Fresh']+dataset['Quantity Frozen']
for varname,var in {'Stock on Hand vs. Quantity Fresh + Quantity Frozen':'dif_Stock on Hand'}.items():
print('{} differences in {}:'.format(name, varname))
print(dataset[var].value_counts())
print('\n')

I hope I got your intention.
# make boolean mask, True if all numeric values are not NaN
mask = df1.select_dtypes('number').notna().all(axis=1)
print(df1[mask])
Fruits Price Amount Quantity Frozen Quantity Fresh Multiple
0 Banana 2.0 40.0 3.0 37.0 74.0
1 Blueberry 1.5 19.0 4.0 12.0 17.0
5 Pineapple 4.0 70.0 9.0 61.0 244.0
9 Coconut 2.0 102.0 80.0 20.0 40.0

Is it possible to apply seasonal_decompose() after a groupby() where date is not the index of the data frame?

So, for a forecasting project, I have a really long Dataframe of multiple time series of the following type (it has a numerical index):
date
time_series_id
value
2015-08-01
0
0
2015-08-02
0
1
2015-08-03
0
2
2015-08-04
0
3
2015-08-01
1
2
2015-08-02
1
3
2015-08-03
1
4
2015-08-04
1
5
My objective, is to add 3 new columns to these dataset, for each individual time series (each id) that correspond to trend, seasonal and resid.
According to the characteristics of the dataset, they tend to have Nans at the start and the end of the dates.
What I was trying to do was the following:
from statsmodels.tsa.seasonal import seasonal_decompose
df.assign(trend = lambda x: x.groupby("time_series_id")["value"].transform(lambda s: s.mask(~s.isna(), other= seasonal_decompose(s[~s.isna()], model='aditive', extrapolate_trend='freq').trend))
The expected output (trend value are not actual values) should be:
date
time_series_id
value
trend
2015-08-01
0
0
1
2015-08-02
0
1
1
2015-08-03
0
2
1
2015-08-04
0
3
1
2015-08-01
1
2
1
2015-08-02
1
3
1
2015-08-03
1
4
1
2015-08-04
1
5
1
But I get the following error message:
AttributeError: 'Int64Index' object has no attribute 'inferred_freq'
In a previous iteration of my code, this worked for my individual time series data frames, since I had embedded the date column as an index of the data frame instead of an additional column, so the "x" that the lambda function takes has already a "date time" index appropriate for the seasonal_decompose function.
df.assign(
trend = lambda x: x["value"].mask(~x["value"].isna(), other =
seasonal_decompose(x["value"][~x["value"].isna()], model='aditive', extrapolate_trend='freq').trend))
My questions are, first: is it possible to achieve this using groupby? Or other approaches are possible second: is it possible to handle this that doesn't eat much memory? The original dataset I'm working on has approximately 1MM ~ rows, so any help is really welcomed :).

Did one of the already posed solutions work? If so or you found a different solution please share. I tried each without success, but I'm new to Python so probably missing something.
Here is what I came up with, using a for loop. For my dataset it took 8 minutes to decompose 20 million rows consisting of 6,000 different subsets. This works but I wish it were faster.
Date Time
Segment ID
Travel Time(Minutes)
2021-11-09 07:15:00
1
30
2021-11-09 07:30:00
1
18
2021-11-09 07:15:00
2
30
2021-11-09 07:30:00
2
17
segments = set(frame['Segment ID'])
data = pd.DataFrame([])
for s in segments:
df = frame[frame['Segment ID'] == s].set_index('Date Time').resample('H').mean()
comp = sm.tsa.seasonal_decompose(x=df['Travel Time(Minutes)'], period=24*7, two_sided=False)
df = df.join(comp.trend).join(comp.seasonal).join(comp.resid)
#optional columns with some statistics to find outliers and trend changes
df['resid zscore'] = (df['resid'] - df['resid'].mean()).div(df['resid'].std())
df['trend pct_change'] = df.trend.pct_change()
df['trend pct_change zscore'] = (df['trend pct_change'] - df['trend pct_change'].mean()).div(df['trend pct_change'].std())
data = data.append(df.dropna())

where you have lambda x: x.groupby(..., you don't have anything to group; you are telling it to group a row (I believe). You can try a setup like this, perhaps
Here you define a function to act on the group you are sending via the apply() method. Then you should be able to use your original code.
I have not tested this, but I use this setup quite often to work on groups.
def trend_function(x):
# do your lambda function here as you are sending each grouping
x.assign(
trend = lambda x: x["value"].mask(~x["value"].isna(), other =
seasonal_decompose(x["value"][~x["value"].isna()], model='aditive', extrapolate_trend='freq').trend))
return x
dfnew = df.groupby('time_series_id').apply(trend_function)

use extrapolate_trend='freq' as a parameter. you add the trend, seasonal, and residual to a dictionary and plot the dictionary
from statsmodels.graphics import tsaplots
import statsmodels.api as sm
date=['2015-08-01','2015-08-02','2015-08-03','2015-08-04','2015-08-01','2015-08-02','2015-08-03','2015-08-04']
time_series_id=[0,0,0,0,1,1,1,1]
value=[0,1,2,3,2,3,4,5]
df=pd.DataFrame({'date':date,'time_series_id':time_series_id,'value':value})
df['date']=pd.to_datetime(df['date'])
df=df.set_index('date')
print(df)
index_day = df.index.day
value_by_day = df.groupby(index_day)['value'].mean()
fig,ax = plt.subplots(figsize=(12,4))
value_by_day.plot(ax=ax)
plt.title('value by month')
plt.show()
df[['value']].boxplot()
plt.show()
fig,ax = plt.subplots(figsize=(12,4))
df[['value']].hist(ax=ax, bins=5)
plt.show()
fig,ax = plt.subplots(figsize=(12,4))
df[['value']].plot(kind='density', ax=ax)
plt.show()
plt.clf()
fig,ax = plt.subplots(figsize=(12,4))
plt.style.use('seaborn-pastel')
fig = tsaplots.plot_acf(df['value'], lags=4,ax=ax)
plt.show()
decomposition=sm.tsa.seasonal_decompose(x=df['value'],model='additive', extrapolate_trend='freq', period=1)
decomposition.plot()
plt.show()
decomposition_trend=decomposition.trend
ax= decomposition_trend.plot(figsize=(14,2))
ax.set_xlabel('Date')
ax.set_ylabel('Trend of time series')
ax.set_title('Trend values of the time series')
plt.show()

I changed the first piece of code according to my scinario.
Here's my code and attached output
data = pd.DataFrame([])
segments = set(subset['Planning_Material'])
for s in segments:
df = subset[subset['Planning_Material'] == s].set_index('Cal_year_month').resample('M').sum()
comp = sm.tsa.seasonal_decompose(df)
df = df.join(comp.trend).join(comp.seasonal).join(comp.resid)
df['Planning_Material'] = s
data = pd.concat([data,df])
data = data.reset_index()
data = data[['Planning_Material', 'Cal_year_month', 'Historical_demand', 'trend','seasonal','resid']]
data

Finding closest timestamp between dataframe columns

I have two dataframes
import numpy as np
import pandas as pd
test1 = pd.date_range(start='1/1/2018', end='1/10/2018')
test1 = pd.DataFrame(test1)
test1.rename(columns = {list(test1)[0]: 'time'}, inplace = True)
test2 = pd.date_range(start='1/5/2018', end='1/20/2018')
test2 = pd.DataFrame(test2)
test2.rename(columns = {list(test2)[0]: 'time'}, inplace = True)
Now in first dataframe I create column
test1['values'] = np.zeros(10)
I want to fill this column, next to each date there should be the index of the closest date from second dataframe. I want it to look like this:
0 2018-01-01 0
1 2018-01-02 0
2 2018-01-03 0
3 2018-01-04 0
4 2018-01-05 0
5 2018-01-06 1
6 2018-01-07 2
7 2018-01-08 3
Of course my real data is not evenly spaced and has minutes and seconds, but the idea is same. I use the following code:
def nearest(items, pivot):
return min(items, key=lambda x: abs(x - pivot))
for k in range(10):
a = nearest(test2['time'], test1['time'][k]) ### find nearest timestamp from second dataframe
b = test2.index[test2['time'] == a].tolist()[0] ### identify the index of this timestamp
test1['value'][k] = b ### assign this value to the cell
This code is very slow on large datasets, how can I make it more efficient?
P.S. timestamps in my real data are sorted and increasing just like in these artificial examples.

You could do this in one line, using numpy's argmin:
test1['values'] = test1['time'].apply(lambda t: np.argmin(np.absolute(test2['time'] - t)))
Note that applying a lambda function is essentially also a loop. Check if that satisfies your requirements performance-wise.
You might also be able to leverage the fact that your timestamps are sorted and the timedelta between each timestamp is constant (if I got that correctly). Calculate the offset in days and derive the index vector, e.g. as follows:
offset = (test1['time'] - test2['time']).iloc[0].days
if offset < 0: # test1 time starts before test2 time, prepend zeros:
offset = abs(offset)
idx = np.append(np.zeros(offset), np.arange(len(test1['time'])-offset)).astype(int)
else: # test1 time starts after or with test2 time, use arange right away:
idx = np.arange(offset, offset+len(test1['time']))
test1['values'] = idx

How to get overlap between two date ranges that have a start and end time from a csv?

I already asked a similar question but was able to piece some more of it together but need more help. Determining how one date/time range overlaps with the second date/time range?
I want to be able to check when two date range with start date/time and end date/time overlap. My type2 has about 50 rows while type 1 has over 500. I want to be able to take the start and end of type2 and see if it falls within type1 range. Here is a snip of the data, however the dates do change down the list from 2019-04-01 the the following days.
type1 type1_start type1_end
a 2019-04-01T00:43:18.046Z 2019-04-01T00:51:35.013Z
b 2019-04-01T02:16:46.490Z 2019-04-01T02:23:23.887Z
c 2019-04-01T03:49:31.981Z 2019-04-01T03:55:16.153Z
d 2019-04-01T05:21:22.131Z 2019-04-01T05:28:05.469Z
type2 type2_start type2_end
1 2019-04-01T00:35:12.061Z 2019-04-01T00:37:00.783Z
2 2019-04-02T00:37:15.077Z 2019-04-02T00:39:01.393Z
3 2019-04-03T00:39:18.268Z 2019-04-03T00:41:01.844Z
4 2019-04-04T00:41:21.576Z 2019-04-04T00:43:02.071Z`
I have been googling the best way to this and have read through Determine Whether Two Date Ranges Overlap and understand how it should be done, but I don't know enough about how to call for the variables and make them work.
#Here is what I have, but I am stuck and have no clue where to go form here:
import pandas as pd
from pandas import Timestamp
import numpy as np
from collections import namedtuple
colnames = ['type1', 'type1_start', 'type1_end', 'type2', 'type2_start', 'type2_end']
data = pd.read_csv('test.csv', names=colnames, parse_dates=['type1_start', 'type1_end','type2_start', 'type2_end'])
A_start = data['type1_start']
A_end = data['type1_end']
B_start= data['typer2_start']
B_end = data['type2_end']
t1 = data['type1']
t2 = data['type2']
r1 = (B_start, B_end)
r2 = (A_start, A_end)
def doesOverlap(r1, r2):
if B_start > A_start:
swap(r1, r2)
if A_start > B_end:
return false
return true
It would be nice to have a csv with a result of true or false overlap. I was able to make my data run using this also Efficiently find overlap of date-time ranges from 2 dataframes but it isn't correct in the results. I added couple of rows that I know should overlap to the data, and it didn't work. I'd need for each type2 start/end to go through each type1.
Any help would be greatly appreciated.

Here is one way to do it:
import pandas as pd
def overlaps(row):
if ((row['type1_start'] < row['type2_start'] and row['type2_start'] < row['type1_end'])
or (row['type1_start'] < row['type2_end'] and row['type2_end'] < row['type1_end'])):
return True
else:
return False
colnames = ['type1', 'type1_start', 'type1_end', 'type2', 'type2_start', 'type2_end']
df = pd.read_csv('test.csv', names=colnames, parse_dates=[
'type1_start', 'type1_end', 'type2_start', 'type2_end'])
df['overlap'] = df.apply(overlaps, axis=1)
print('\n', df)
gives:
type1 type1_start type1_end type2 type2_start type2_end overlap
0 type1 type1_start type1_end type2 type2_start type2_end False
1 a 2019-03-01T00:43:18.046Z 2019-04-02T00:51:35.013Z 1 2019-04-01T00:35:12.061Z 2019-04-01T00:37:00.783Z True
2 b 2019-04-01T02:16:46.490Z 2019-04-01T02:23:23.887Z 2 2019-04-02T00:37:15.077Z 2019-04-02T00:39:01.393Z False
3 c 2019-04-01T03:49:31.981Z 2019-04-01T03:55:16.153Z 3 2019-04-03T00:39:18.268Z 2019-04-03T00:41:01.844Z False
4 d 2019-04-01T05:21:22.131Z 2019-04-01T05:28:05.469Z 4 2019-04-04T00:41:21.576Z 2019-04-04T00:43:02.071Z False

Below df1 contains type1 records and df2 contains type2 records:
df_new = df1.assign(key=1)\
.merge(df2.assign(key=1), on='key')\
.assign(has_overlap=lambda x: ~((x.type2_start > x.type1_end) | (x.type2_end < x.type1_start)))
REF: Performant cartesian product (CROSS JOIN) with pandas

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Connecting lines over NaNs without dropping entire rows from the dataframe - python

Related

Run functions over many dataframes, add results to another dataframe, and dynamically name the resulting column with the name of the original df

Make a calculation or function skip rows based on NaN values

Is it possible to apply seasonal_decompose() after a groupby() where date is not the index of the data frame?

Finding closest timestamp between dataframe columns

How to get overlap between two date ranges that have a start and end time from a csv?

Categories

Resources