Speeding up a Python datetime comparison to generate values - python

I want to do this in a pythonic way without using 1) a nested if statement and 2) the use of iterrows.
I have columns
Date in | Date Out | 1/22 | 2/22 | ... | 12/22
1/1/19 5/5/22
5/5/22 7/7/22
for columns like '1/22', I want to insert a calculated value which would be one of the following:
Not Created Yet
Closed
Open
For the first row, column 1/22 would read "Open" because it was opened in Jan/22. This would continue until column 5/22, in which it would be labeled "Closed."
For the second row, column 1/22 would read "Not Created Yet" until 5/22 which would read "Open" until 7/22 which would have the value "Closed."
I don't need the full table necessarily, but I want to get a count of how many are closed/open/not created yet for every month.
Here is the code I'm using which works, but just takes longer than I think it could:
table={}
for i in mcLogsClose.iterrows():
table[i[0]] = {}
for month in pd.date_range(start='9/2021', end='9/2022', freq='M'):
if i[1]['Notif Date'] <= month:
if i[1]['Completion Date'] <= month:
table[i[0]][month]="Closed"
else:
table[i[0]][month]="Open"
else:
table[i[0]][month]="Not Yet Created"
I then want to run table['1/22'].value_counts()
Thanks for your attention!

1. Using a loop
# The date range you are calculating for
min_date = pd.Period("2022-01")
max_date = pd.Period("2022-12")
span = (max_date - min_date).n + 1
# Strip the "Date In" and "Date Out" columns down to the month
date_in = pd.to_datetime(df["Date In"]).dt.to_period("M")
date_out = pd.to_datetime(df["Date Out"]).dt.to_period("M")
data = []
for d_in, d_out in zip(date_in, date_out):
if d_in > max_date:
# If date in is after max date, the whole span is under "Not Created" status
data.append((span, 0, 0))
elif d_out < min_date:
# If date out is before min date, the whole span is under "Closed" status
data.append((0, span, 0))
else:
# Now that we have some overlap between (d_in, d_out) and (min_date,
# max_date), we need to calculate time spent in each status
closed = (max_date - min(d_out, max_date)).n
not_created = (max(d_in, min_date) - min_date).n
open_ = span - closed - not_created
data.append((not_created, closed, open_))
cols = ["Not Created Yet", "Closed", "Open"]
df[cols] = pd.DataFrame(data, columns=cols, index=df.index)
2. Using numpy
def to_n(arr: np.array) -> np.array:
"""Convert an array of pd.Period to array of integers"""
return np.array([i.n for i in arr])
# The date range you are calculating for. Since we intend to use vectorized
# code, we need to turn them into numpy arrays
min_date = np.repeat(pd.Period("2022-01"), len(df))
max_date = np.repeat(pd.Period("2022-12"), len(df))
span = to_n(max_date - min_date) + 1
date_in = pd.to_datetime(df["Date In"]).dt.to_period("M")
date_out = pd.to_datetime(df["Date Out"]).dt.to_period("M")
df["Not Created Yet"] = np.where(
date_in > max_date,
span,
to_n(np.max([date_in, min_date], axis=0) - min_date),
)
df["Closed"] = np.where(
date_out < min_date,
span,
to_n(max_date - np.min([date_out, max_date], axis=0)),
)
df["Open"] = span - df["Not Created Yet"] - df["Closed"]
Result (with some rows added for my testing):
Date In Date Out Not Created Yet Closed Open
0 1/1/19 5/5/22 0 7 5
1 5/5/22 7/7/22 4 5 3
2 1/1/20 12/12/20 0 12 0
3 1/1/23 6/6/23 12 0 0
4 6/6/21 6/6/23 0 0 12

Related

Poor performance filtering one dataframe with another

I have two dataframes one holds unique records of episodic data, the other lists of events. There are multiple events per episode. I need to loop through the episode data, find all the events that correspond to each episode and write the resultant events for a new dataframe. There are around 4,000 episodes and 20,000 events. The process is painfully slow as for each episode I am searching 20,000 events. I am guessing there is a way to reduce the number of events searched each loop by removing the matched ones - but I am not sure. This is my code (there is additional filtering to assist with matching)
for idx, row in episode_df.iterrows():
total_episodes += 1
icu_admission = datetime.strptime(row['ICU_ADM'], '%d/%m/%Y %H:%M:%S')
tmp_df = event_df.loc[event_df['ur'] == row['HRN']]
if ( len(tmp_df.index) < 1):
empty_episodes += 1
continue
# Loop through temp dataframe and write all records with an admission date
# close to icu_admission to new dataframe
for idx_a, row_a in tmp_df.iterrows():
admission = datetime.strptime(row_a['admission'], '%Y-%m-%d %H:%M:%S')
difference = admission - icu_admission
if (abs(difference.total_seconds()) > 14400):
continue
new_df = new_df.append(row_a)
selected_records += 1
A simplified version of the dataframes:
episode_df:
episode_no HRN name ICU_ADM
1 12345 joe date1
2 78124 ann date1
3 98374 bill date2
4 76523 lucy date3
event_df
episode_no ur admission
1 12345 date1
1 12345 date1
1 12345 date5
7 67899 date9
Not all episodes have events and only events with episodes need to be copied.
This could work:
import pandas as pd
import numpy as np
df1 = pd.DataFrame()
df1['ICU_ADM'] = [pd.to_datetime(f'2020-01-{x}') for x in range(1,10)]
df1['test_day'] = df1['ICU_ADM'].dt.day
df2 = pd.DataFrame()
df2['admission'] = [pd.to_datetime(f'2020-01-{x}') for x in range(2,10,3)]
df2['admission_day'] = df2['admission'].dt.day
df2['random_val'] = np.random.rand(len(df2),1)
pd.merge_asof(df1, df2, left_on=['ICU_ADM'], right_on=['admission'], tolerance=pd.Timedelta('1 day'))

Calculate time difference between all dates in column python

I have a data frame that looks like that:
group date value
g_1 1/2/2019 11:03:00 3
g_1 1/2/2019 11:04:00 5
g_1 1/2/2019 10:03:32 100
g_2 4/3/2019 09:11:09 46
I want to calculate the time difference between occurrences (in seconds) per group.
Example output:
groups_time_diff = {'g_1': [23,5666,7878], 'g_2: [0.2,56,2343] ,...}
This is my code:
groups_time_diff = defaultdict(list)
for group in tqdm(groups):
group_df = unit_df[unit_df['group'] == group]
dates = list(group_df['time'])
while len(dates) != 0:
min_date = min(dates)
dates.remove(min_date)
if len(dates) > 0:
second_min_date = min(dates)
date_diff = second_min_date - min_date
groups_time_diff[group].append(date_diff.seconds)
This takes forever to run and I am looking for a more time efficient way to get the desired output.
Any ideas?
Try this:
sorted_group_df = group_df.sort_values(by='time',ascending=True)
dates = sorted_group_df['time']
one = dates[1:-1].reset_index(drop=True)
two = dates[0:-1].reset_index(drop=True)
date_difference = one - two
date_difference_in_seconds = date_difference.dt.seconds
Try at first sort your dates. Then subtract these two series:
dates = dates.sort_values()
pd.Series.subtract(dates[0:-1], dates[1:-1])
You are using min function twice in each iteration that is not efficient.
Hope this helps.

Group by date range in pandas dataframe

I have a time series data in pandas, and I would like to group by a certain time window in each year and calculate its min and max.
For example:
times = pd.date_range(start = '1/1/2011', end = '1/1/2016', freq = 'D')
df = pd.DataFrame(np.random.rand(len(times)), index=times, columns=["value"])
How to group by time window e.g. 'Jan-10':'Mar-21' for each year and calculate its min and max for column value?
You can use the resample method.
df.resample('5d').agg(['min','max'])
I'm not sure if there's a direct way to do it without first creating a flag for the days required. The following function is used to create a flag required:
# Function for flagging the days required
def flag(x):
if x.month == 1 and x.day>=10: return True
elif x.month in [2,3,4]: return True
elif x.month == 5 and x.day<=21: return True
else: return False
Since you require for each year, it would be a good idea to have the year as a column.
Then the min and max for each year for given periods can be obtained with the code below:
times = pd.date_range(start = '1/1/2011', end = '1/1/2016', freq = 'D')
df = pd.DataFrame(np.random.rand(len(times)), index=times, columns=["value"])
df['Year'] = df.index.year
pd.pivot_table(df[list(pd.Series(df.index).apply(flag))], values=['value'], index = ['Year'], aggfunc=[min,max])
The output will look like follows:
Sample Output
Hope that answers your question... :)
You can define the bin edges, then throw out the bins you don't need (every other) with .loc[::2, :]. Here I'll define two functions just to check we're getting the date ranges we want within groups (Note since left edges are open, need to subtract 1 day):
import pandas as pd
edges = pd.to_datetime([x for year in df.index.year.unique()
for x in [f'{year}-02-09', f'{year}-03-21']])
def min_idx(x):
return x.index.min()
def max_idx(x):
return x.index.max()
df.groupby(pd.cut(df.index, bins=edges)).agg([min_idx, max_idx, min, max]).loc[::2, :]
Output:
value
min_idx max_idx min max
(2011-02-09, 2011-03-21] 2011-02-10 2011-03-21 0.009343 0.990564
(2012-02-09, 2012-03-21] 2012-02-10 2012-03-21 0.026369 0.978470
(2013-02-09, 2013-03-21] 2013-02-10 2013-03-21 0.039491 0.946481
(2014-02-09, 2014-03-21] 2014-02-10 2014-03-21 0.029161 0.967490
(2015-02-09, 2015-03-21] 2015-02-10 2015-03-21 0.006877 0.969296
(2016-02-09, 2016-03-21] NaT NaT NaN NaN

New column based off certain input parameter to select what columns to use - Python

Have a pandas dataframe that includes multiple columns of monthly finance data. I have an input of period that is specified by the person running the program. It's currently just saved as period like shown below within the code.
#coded into python
period = ?? (user adds this in from input screen)
I need to create another column of data that uses the input period number to perform a calculation of other columns.
So, in the above table I'd like to create a new column 'calculation' that depends on the period input. For example, if a period of 1 was used the following calc1 would be completed (with math actually done). Period = 2 - then calc2. Period = 3 - then calc3. I only need one column calculated depending on the period number but added three examples in below picture for example of how it'd work.
I can do this in SQL using case when. So using the input period then sum what columns I need to.
select Account #,
'&Period' AS Period,
'&Year' AS YR,
case
When '&Period' = '1' then sum(d_cf+d_1)
when '&Period' = '2' then sum(d_cf+d_1+d_2)
when '&Period' = '3' then sum(d_cf+d_1+d_2+d_3)
I am unsure on how to do this easily in python (newer learner). Yes, I could create a column that does each calculation via new column for every possible period (1-12), and then only select that column but I'd like to learn and do it a more efficient way.
Can you help more or point me in a better direction?
You could certainly do something like
df[['d_cf'] + [f'd_{i}' for i in range(1, period+1)]].sum(axis=1)
You can do this using a simple function in python:
def get_calculation(df, period=NULL):
'''
df = pandas data frame
period = integer type
'''
if period == 1:
return df.apply(lambda x: x['d_0'] +x['d_1'], axis=1)
if period == 2:
return df.apply(lambda x: x['d_0'] +x['d_1']+ x['d_2'], axis=1)
if period == 3:
return df.apply(lambda x: x['d_0'] +x['d_1']+ x['d_2'] + x['d_3'], axis=1)
new_df = get_calculation(df, period = 1)
Setup:
df = pd.DataFrame({'d_0':list(range(1,7)),
'd_1': list(range(10,70,10)),
'd_2':list(range(100,700,100)),
'd_3': list(range(1000,7000,1000))})
Setup:
import pandas as pd
ddict = {
'Year':['2018','2018','2018','2018','2018',],
'Account_Num':['1111','1122','1133','1144','1155'],
'd_cf':['1','2','3','4','5'],
}
data = pd.DataFrame(ddict)
Create value calculator:
def get_calcs(period):
# Convert period to integer
s = str(period)
# Convert to string value
n = int(period) + 1
# This will repeat the period number by the value of the period number
return ''.join([i * n for i in s])
Main function copies data frame, iterates through period values, and sets calculated values to the correct spot index-wise for each relevant column:
def process_data(data_frame=data, period_column='d_cf'):
# Copy data_frame argument
df = data_frame.copy(deep=True)
# Run through each value in our period column
for i in df[period_column].values.tolist():
# Create a temporary column
new_column = 'd_{}'.format(i)
# Pass the period into our calculator; Capture the result
calculated_value = get_calcs(i)
# Create a new column based on our period number
df[new_column] = ''
# Use indexing to place the calculated value into our desired location
df.loc[df[period_column] == i, new_column] = calculated_value
# Return the result
return df
Start:
Year Account_Num d_cf
0 2018 1111 1
1 2018 1122 2
2 2018 1133 3
3 2018 1144 4
4 2018 1155 5
Result:
process_data(data)
Year Account_Num d_cf d_1 d_2 d_3 d_4 d_5
0 2018 1111 1 11
1 2018 1122 2 222
2 2018 1133 3 3333
3 2018 1144 4 44444
4 2018 1155 5 555555

Python Pandas: Count quarterly occurrence from start and end date range

I have a dataframe of jobs for different people with star and end time for each job. I'd like to count, every four months, how many jobs each person is responsible for. I figured out away to do it but I'm sure it's tremendously inefficient (I'm new to pandas). It takes quite a while to compute when I run the code on my complete dataset (hundreds of persons and jobs).
Here is what I have so far.
#create a data frame
import pandas as pd
import numpy as np
df = pd.DataFrame({'job': pd.Categorical(['job1','job2','job3','job4']),
'person': pd.Categorical(['p1', 'p1', 'p2','p2']),
'start': ['2015-01-01', '2015-06-01', '2015-01-01', '2016- 01- 01'],
'end': ['2015-07-01', '2015- 12-31', '2016-03-01', '2016-12-31']})
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
Which gives me
I then create a new dataset with
bdate = min(df['start'])
edate = max(df['end'])
dates = pd.date_range(bdate, edate, freq='4MS')
people = sorted(set(list(df['person'])))
df2 = pd.DataFrame(np.zeros((len(dates), len(people))), index=dates, columns=people)
for d in pd.date_range(bdate, edate, freq='MS'):
for p in people:
contagem = df[(df['person'] == p) &
(df['start'] <= d) &
(df['end'] >= d)]
pos = np.argmin(np.abs(dates - d))
df2.iloc[pos][p] = len(contagem.index)
df2
And I get
I'm sure there must be a better way of doing this without having to loop through all dates and persons. But how?
This answer assumes that each job-person combination is unique. It creates a series for every row with the value equal to the job an index that expands the dates. Then it resamples every 4th month (which is not quarterly but what your solution describes) and counts the unique non-na occurrences.
def make_date_range(x):
return pd.Series(index=pd.date_range(x.start.values[0], x.end.values[0], freq='M'), data=x.job.values[0])
# Iterate through each job person combo and make an entry for each month with the job as the value
df1 = df.groupby(['job', 'person']).apply(make_date_range).unstack('person')
# remove outer level from index
df1.index = df1.index.droplevel('job')
# resample each month counting only unique values
df1.resample('4MS').agg(lambda x: len(x[x.notnull()].unique()))
Output
person p1 p2
2015-01-01 1 1
2015-05-01 2 1
2015-09-01 1 1
2016-01-01 0 2
2016-05-01 0 1
2016-09-01 0 1
And here is a long one line solution that iterates over every rows and creates a new dataframe and stacks all of them together via pd.concat and then resamples.
pd.concat([pd.DataFrame(index = pd.date_range(tup.start, tup.end, freq='4MS'),
data=[[tup.job]],
columns=[tup.person]) for tup in df.itertuples()])\
.resample('4MS').count()
And another one that is faster
df1 = pd.melt(df, id_vars=['job', 'person'], value_name='date').set_index('date')
g = df1.groupby([pd.TimeGrouper('4MS'), 'person'])['job']
g.agg('nunique').unstack('person', fill_value=0)

Categories