How to identify edge date cases using pandas - python

I have a dataframe like as shown below
df1 = pd.DataFrame({'person_id': [11,11,11,11,11,12,12,12,12,12,13,13,13,13,14,14,14,14,14],
'date_birth': ['12/31/1961','01/01/1961','10/21/1961','12/11/1961','02/11/1961',
'05/29/1967','01/29/1967','04/29/1967','03/19/1967','01/01/1957',
'12/31/1959','01/01/1959','01/01/1959','07/27/1959',
'01/01/1957','01/01/1957','12/31/1957','12/31/1958','01/01/1957']})
df1 = df1.melt('person_id', value_name='dates')
df1['dates'] = pd.to_datetime(df1['dates'])
My objective is to identify the edge cases in this data frame.
An edge case is defined as a scenario when a subject has both Jan 1st and Dec 31st in their dates column.
For instance, from the sample data frame we can see that person_id=11 is a edge case because he has both Jan 1st and Dec 31st in their dates column values whereas person_id = 12 is not a edge case because he doesn't have both Dec 31st and Jan 1st
This is what I tried
op_df = df1.groupby(['person_id'], sort=False).apply(lambda x: x.sort_values(['dates'], ascending=True)).reset_index(drop=True)
op_df['day'] = op_df.dates.dt.day
op_df['month'] = op_df.dates.dt.month
op_df['points'] = np.where(((op_df['day'] == 1) & (op_df['month'] == 1)) & ((op_df['day'] == 31) & (op_df['month'] == 12)),'edge','No')
But the code above doesn't filter correctly. It returns as No for all my person_ids.
I expect my output to be like as below

Here is problem is not possible day=1& month=1 with end of month, need chain by | for OR:
op_df = df1.sort_values(['person_id','dates'])
op_df['day'] = op_df.dates.dt.day
op_df['month'] = op_df.dates.dt.month
op_df['points'] = np.where(((op_df['day'] == 1) & (op_df['month'] == 1)) | ((op_df['day'] == 31) & (op_df['month'] == 12)),'edge','No')
If need for both edges separate columns is possible create 2 columns first by masks, aggregate sum for count Trues values and add Edge column in DataFrame.insert for second column by condition - Yes if at least one 0 in one or second column:
#instead groupby + sort_values use sort_values by 2 columns
op_df = df1.sort_values(['person_id','dates'], ascending=True)
day = op_df.dates.dt.day
month = op_df.dates.dt.month
op_df['1.1'] = (day == 1) & (month == 1)
op_df['31.12'] = (day == 31) & (month == 12)
op_df = op_df.groupby('person_id', as_index=False)[['1.1','31.12']].sum()
op_df.insert(1, 'Edge', np.where(op_df[['1.1','31.12']].eq(0).any(axis=1),'No','Yes'))
print (op_df)
person_id Edge 1.1 31.12
0 11 Yes 1 1
1 12 No 1 0
2 13 Yes 2 1
3 14 Yes 3 2

Related

Change value based on condition on slice of dataframe

I have a dataframe like this:
df = pd.DataFrame(columns=['Dog', 'Small', 'Adult'])
df.Dog = ['Poodle', 'Shepard', 'Bird dog','St.Bernard']
df.Small = [1,1,0,0]
df.Adult = 0
That will look like this:
Dog Small Adult
0 Poodle 1 0
1 Shepard 1 0
2 Bird dog 0 0
3 St.Bernard 0 0
Then I would like to change one column based on another. I can do that:
df.loc[df.Small == 0, 'Adult'] = 1
However, I just want to do so for the 3 first rows.
I can select the first three rows:
df.iloc[0:2]
But if I try to change values on the first three rows:
df.iloc[0:2, df.Small == 0, 'Adult'] = 1
I get an error.
I also get an error if I merge the two:
df.iloc[0:2].loc[df.Small == 0, 'Adult'] = 1
It tells me that I am trying to set a value on a copy of a slice.
How should I do this correctly?
You could include the range as another condition in your .loc selection (for the general case, I'll explicitly include the 0):
df.loc[(df.Small == 0) & (0 <= df.index) & (df.index <= 2), 'Adult'] = 1
Another option is to transform the index into a series to use pd.Series.between:
df.loc[(df.Small == 0) & (df.index.to_series().between(0, 2)), 'Adult'] = 1
adding conditionals based on index works only if the index is already sorted. Alternatively, you can do the following:
ind = df[df.Small == 0].index[:2]
df.loc[ind, 'Adult'] = 1

How to discretize a datetime column?

I have a dataset that contains a column of datetime of a month, and I need to divide it into two blocks (day and night or am\pm) and then discretize the time in each block into 10mins bins. I could add another column of 0 and 1 to show it is am or pm, but I cannot discretize it! Can you please help me with it?
df['started_at'] = pd.to_datetime(df['started_at'])
df['start hour'] = df['started_at'].dt.hour.astype('int')
df['mor/aft'] = np.where(df['start hour'] < 12, 1, 0)
df['started_at']
0 16:05:36
2 06:22:40
3 16:08:10
4 12:28:57
6 15:47:30
...
3084526 15:24:24
3084527 16:33:07
3084532 14:08:12
3084535 09:43:46
3084536 17:02:26
If I understood correctly you are trying to add a column for every interval of ten minutes to indicate if an observation is from that interval of time.
You can use lambda expressions to loop through each observation from the series.
Dividing by 10 and making this an integer gives the first digit of the minutes, based on which you can add indicator columns.
I also included how to extract the day indicator column with a lambda expression for you to compare. It achieves the same as your np.where().
import pandas as pd
from datetime import datetime
# make dataframe
df = pd.DataFrame({
'started_at': ['14:20:56',
'00:13:24',
'16:01:33']
})
# convert column to datetime
df['started_at'] = pd.to_datetime(df['started_at'])
# make day indicator column
df['day'] = df['started_at'].apply(lambda ts: 1 if ts.hour > 12 else 0)
# make indicator column for every ten minutes
for i in range(24):
for j in range(6):
col = 'hour_' + str(i) + '_min_' + str(j) + '0'
df[col] = df['started_at'].apply(lambda ts: 1 if int(ts.minute/10) == j and ts.hour == i else 0)
print(df)
Output first columns:
started_at day hour_0_min_00 hour_0_min_10 hour_0_min_20
0 2021-11-21 14:20:56 1 0 0 0
1 2021-11-21 00:13:24 0 0 1 0
2 2021-11-21 16:01:33 1 0 0 0
...
...
...

for loop indexing in python

What's wrong with my for loop? I am getting problem related to array indexing.
How can I fix the indexing problem inside the for loop?
''for sales in months:
quarter += sales''
Create a months list, as well as an index, and set the quarter to 0
months = [100, 100, 150, 250 , 300, 10, 20]
quarter = 0
quarters = []
index = 0
Create for loop for quarter, print result, and increment the index
for sales in months:
quarter += sales
if index % 3 == 0 or index == len(months):
quarters.append(quarter)
quarter = 0
index = index + 1
print("The quarter totals are Q1: {}, Q2: {}, Q3: {}".format(quarters[0], quarters[1], quarters[2]))
It looks like you copied some of the indentation wrong, but my guess is that in your code index = index + 1 is indented into the if block, so it will stop incrementing. You can use enumerate instead to avoid those kind of bugs completely.
for index, value in enumerate(collection):
print(index, value) # your code here
Try this out:
# Month sales data
month_sales = [100, 100, 150, 250, 300, 10, 20]
# Empty list to hold quarter data
quarter_sales = []
# Current quarter iteration data
quarter = 0
# Iterate over monthly sales
for index, sale in enumerate(month_sales):
# Add month's sales to running total for the quarter
quarter += sale
# If last month of quarter or end of list
if (index + 1) % 3 == 0 or index == len(month_sales) - 1:
# Add quarter sales data to new list and start over
quarter_sales.append(quarter)
quarter = 0
print(f"The quarter totals are Q1: {quarter_sales[0]}, Q2: {quarter_sales[2]}, Q3: {quarter_sales[2]}")
There were issues with your indices
For instance index % 3 would work if your index started at 1, but in programming list indices start at 0. That equates to month 3 being at index 2. Since 2 is not divisible by 3 it moves on and throws the first four months into quarter 1 and so on. You need to modify your code to account for this with (index + 1) % 3 == 0 and also len(month_sales) -1

Repeating the pattern of Numbers thrice in a month

I want to distribute the numbers preset in the list in whole month
a) Given a Holiday list, I want to dynamically assign '1' on the holiday date and '0' for working day .
eg.
Holiday_List = ['2020-01-01','2020-01-05','2020-01-12','2020-01-19','2020-01-26']
Start_date = datetime.datetime(year=2020, month =1 , day=1)
end_date = datetime.datetime(year =2020,month =1,day=28 )
Below is the outpput I am looking for in dataframe,where 'Date' and 'Holiday' are columns.
Date Holiday
01-01-2020 1
02-01-2020 0
03-01-2020 0
04-01-2020 0
05-01-2020 1
06-01-2020 0
07-01-2020 0
08-01-2020 0
09-01-2020 0
10-01-2020 0
11-01-2020 0
12-01-2020 1
13-01-2020 0
14-01-2020 0
15-01-2020 0
16-01-2020 0
17-01-2020 0
18-01-2020 0
19-01-2020 1
20-01-2020 0
21-01-2020 0
22-01-2020 0
23-01-2020 0
24-01-2020 0
25-01-2020 0
26-01-2020 1
27-01-2020 0
28-01-2020 0
B) Given a list of nos like [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18].. I want to break into 3 equal part and store it in 3 different list.
a=[1,2,3,4,5,6],b=[7,8,9,10,11,12], c=[13,14,15,16,17,18]..
sequence should be there like first 6 element in a, sec in 'b' and 3rd in 'c'
C) I want to distribute the above lists a,b,c in whole months such that gap between 1 element of a,b and
c should be 8 days only..similarly for others nos. and there is one constraint I cannot assign any no. of holiday.
Below is the final output I am looking for, where list values are assign in column "Values" and Here I have assigning dummy value 'NW' to have gap of 8 days between every list.
Date Holiday Values
01-01-2020 1 Holiday
02-01-2020 0 1
03-01-2020 0 2
04-01-2020 0 3
05-01-2020 1 Holiday
06-01-2020 0 4
07-01-2020 0 5
08-01-2020 0 6
09-01-2020 0 NW
10-01-2020 0 NW
11-01-2020 0 7
12-01-2020 1 Holiday
13-01-2020 0 8
14-01-2020 0 9
15-01-2020 0 10
16-01-2020 0 11
17-01-2020 0 12
18-01-2020 0 NW
19-01-2020 1 Holiday
20-01-2020 0 13
21-01-2020 0 14
22-01-2020 0 15
23-01-2020 0 16
24-01-2020 0 17
25-01-2020 0 18
26-01-2020 1 Holiday
27-01-2020 0 NW
28-01-2020 0 NW
A) You can use date_range to create column with dates
df = pd.DataFrame()
df['Date'] = pd.date_range(start_date, end_date)
Next you can create column Holiday with zeros in all cells
df['Holiday'] = 0
And next you can replace some values
for item in holiday_list:
item = datetime.datetime.strptime(item, '%Y-%m-%d')
df['Holiday'][ df['Date'] == item ] = 1
but maybe this part could be simpler using isin()
mask = df['Date'].dt.strftime('%Y-%m-%d').isin(holiday_list)
df['Holiday'][mask] = 1
or using numpy.where()
import numpy as np
mask = df['Date'].dt.strftime('%Y-%m-%d').isin(holiday_list)
df['Holiday'] = np.where(mask, 1, 0)
or simply keep it as True/False instead of 1/0
df['Holiday'] = df['Date'].dt.strftime('%Y-%m-%d').isin(holiday_list)
import pandas as pd
import datetime
holiday_list = ['2020-01-01','2020-01-05','2020-01-12','2020-01-19','2020-01-26']
start_date = datetime.datetime(year=2020, month=1, day=1)
end_date = datetime.datetime(year=2020,month=1, day=28)
df = pd.DataFrame()
df['Date'] = pd.date_range(start_date, end_date)
df['Holiday'] = 0
mask = df['Date'].dt.strftime('%Y-%m-%d').isin(holiday_list)
df['Holiday'][mask] = 1
print(df)
B) you could use [start:start+size] to split list
numbers = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18]
size = len(numbers)//3
print(d[size*0:size*1], d[size*1:size*2], d[size*2:size*3])
or
print(d[:size], d[size:size*2], d[size*2:])
Similar way you can split dataframe (after filtered "Holiday") to work with 8 days [start:star+8] but I wil use it in (C)
C) You can create column Values with NW in all cells
df['Values'] = 'NW'
Next you can use previous mask to assign "Holiday"
mask = df['Date'].dt.strftime('%Y-%m-%d').isin(holiday_list)
df['Values'][ mask ] = 'Holiday'
Using ~ you can negate mask to reverse selection - to select cells withou "Holiday"
selected = df['Values'][ ~mask ]
and now I can try to assing
for a, b in zip(range(0, len(selected), 8), range(0, len(numbers), size)):
selected[a:a+size] = numbers[b:b+size]
df['Values'][ ~mask ] = selected
but maybe it can be done in simpler way. Maybe with groupby() or rolling() ?
import pandas as pd
import datetime
holiday_list = ['2020-01-01','2020-01-05','2020-01-12','2020-01-19','2020-01-26']
start_date = datetime.datetime(year=2020, month=1, day=1)
end_date = datetime.datetime(year=2020,month=1, day=28)
df = pd.DataFrame()
# ---
df['Date'] = pd.date_range(start_date, end_date)
mask = df['Date'].dt.strftime('%Y-%m-%d').isin(holiday_list)
df['Holiday'] = 0
df['Holiday'][mask] = 1
# ---
df['Values'] = 'NW'
df['Values'][ mask ] = 'Holiday'
numbers = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18]
size = len(numbers)//3
selected = df['Values'][ ~mask ]
for a, b in zip(range(0, len(selected), 8), range(0, len(numbers), size)):
selected[a:a+size] = numbers[b:b+size]
df['Values'][ ~mask ] = selected
print(df)
EDIT:
I created this code.
Main problem was it sometimes create copy of data and it change values in this copy but not in original dataframe - so I use masks instead of slicings.
It may display warning that it changes values in copy of data (not in original dataframe) but finally it gives me correct result.
Maybe using information from Returning a view versus a cop it could remove this warning
import pandas as pd
import datetime
holiday_list = [
'2020-01-01','2020-01-05',
#'2020-01-10','2020-01-11', # add more to test when there is less then 7 NW
'2020-01-12','2020-01-19','2020-01-26'
]
start_date = datetime.datetime(year=2020, month=1, day=1)
end_date = datetime.datetime(year=2020,month=1, day=28)
df = pd.DataFrame()
# ---
df['Date'] = pd.date_range(start_date, end_date)
mask = df['Date'].dt.strftime('%Y-%m-%d').isin(holiday_list)
df['Holiday'] = 0
df['Holiday'][mask] = 1
# ---
df['Values'] = 'NW'
df['Values'][ mask ] = 'Holiday'
numbers = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18]
size = len(numbers)//3
start = 0
for b in range(0, len(numbers), size):
# find first and last NW to replace (needs `start` to keep few NW at the end of previous 8 days gap)
mask = (df['Values'] == 'NW') & (df.index >= start)
# change size if there is less then 7 `NW`
print('NW:', sum(mask)) # sum() counts all `True` in mask
if sum(mask) <= size:
left = size - sum(mask)
size = sum(mask)
print('shorter:', size, left)
# first and last NW to replace
start = df[ mask ].index[0]
end = df[ mask ].index[size-1]
print('start, end:', start, end)
# use new mask to select and replace values
# (using slicing [0:6] doesn't work beacuse it create copy of data
# and it doesn't replace in original dataframe)
mask = mask & (df.index >= start) & (df.index <= end)
df['Values'][ mask ] = numbers[b:b+size]
# create gap 8days
start += 8+1
print(df)
I hope you solved it by now :) anyway this is my approach to solve the problem,
First of all, there are certain assumptions that I consider about when writing the code,
The length of the given array of integers is <= 18 which makes the length of a,b,c arrays <= 8
First, we need to divide the given array into equal three parts,
and if the length of split arrays are < 8 we need to fill them with NW dummy values so the array length becomes 8.
To do that easily, we could use numpy.array, the array needs to split and add string type data NW. to do that we could use object as dtype of the array numpy.chararray here is an application
arr = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18], dtype=object)
then we need to split the array into three equal parts,
arr = np.split(arr,3)
those created arrays need to fill if their length is < 8, np.insert
for i in range(len(arr[0]), 8):
arr = np.insert(arr, i, dummy, axis=1) # fill remaining slots of arrays with dummy value(NW)
Then we need to consider,
Part- A
We need to get the number of days between two days delta (can put that calculation inside the for statement)
we need to get the dates for that range of days with the help of (datetime — Basic date and time types ) and iteration.
delta = end_date - Start_date
for i in range(delta.days + 1):
day = Start_date + timedelta(days=i)
we can use .strftime() to define the time format we need.
day.strftime("%d-%m-%Y")
Finally, we need to check the current date given from the iteration is in the Holiday_List and print 1 Holiday next to date. If not, we need to print 0 and the elements from arrays next to date and also need to make sure to have a gap of 8 days between every list and empty day slot need to fill with the dummy value NW.
count = 0
for i in range(delta.days + 1):
day = Start_date + timedelta(days=i)
if day.strftime("%Y-%m-%d") in Holiday_List:
print("{}\t{}\t{}".format(day.strftime("%d-%m-%Y"), 1, hDay))
else:
print("{}\t{}\t{}".format(day.strftime("%d-%m-%Y"), 0, arr[count//8][count%8]))
count += 1
here count//8 will decide which array need to use to print its' elements and count%8 choose which element needs to print.
So the program,
import datetime
import numpy as np
from datetime import timedelta
Holiday_List = ['2020-01-01','2020-01-05','2020-01-12','2020-01-19','2020-01-26']
Start_date = datetime.datetime(year=2020, month =1 , day=1)
end_date = datetime.datetime(year =2020,month =1,day=28 )
delta = end_date - Start_date
print(delta)
hDay = "Holiday"
dummy = "NW"
# --- numpy array ---
arr = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18], dtype=object) #Assumed that the array length of is divisible by 3 every time
arr = np.split(arr,3) #spilts the array to three equal parts
for i in range(len(arr[0]), 8):
arr = np.insert(arr, i, dummy, axis=1) # fill remaining slots with dummy value(NW)
print("{}\t{}\t{}".format("Date", "Holiday", "Values"))
count = 0
for i in range(delta.days + 1):
day = Start_date + timedelta(days=i)
if day.strftime("%Y-%m-%d") in Holiday_List:
print("{}\t{}\t{}".format(day.strftime("%d-%m-%Y"), 1, hDay))
else:
print("{}\t{}\t{}".format(day.strftime("%d-%m-%Y"), 0, arr[count//8][count%8]))
count += 1
EDIT:
The above code has an issue in the last part that determines the gap and setting the dummy value NW
"When there are no holidays then you would need 3 NW so I would add 3 NW to every list ('a', 'b', 'c'), and then I would work with every list separately. I would use external for-loop like for data in arr: instead of arr[count//8] and I would count gap to skip last element if gap is 8 and element is 'NW' (BTW: if you add more holidays then you has to create gap bigger then 8). – #furas "
So with the help of #furas able to solve the issue(Thanks to him) :), Excess dummy values NW were neglected by iterating through the list,
import datetime
import numpy as np
from datetime import timedelta
Holiday_List = ['2020-01-01','2020-01-05','2020-01-12','2020-01-19','2020-01-26']
Start_date = datetime.datetime(year=2020, month=1, day=1)
end_date = datetime.datetime(year=2020, month=1, day=28)
delta = end_date - Start_date
print(delta)
hDay = "Holiday"
dummy = "NW"
# --- numpy array ---
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18], dtype=object) # Assumed that the array length of is divisible by 3 every time
arr = np.split(arr, 3) # spilts the array to three equal parts
for i in range(len(arr[0]), 9): # add 3 'NW' instead of 2 'NW'
arr = np.insert(arr, i, dummy, axis=1) # fill remaining slots with dummy value(NW)
print("{}\t{}\t{}".format("Date", "Holiday", "Values"))
# ---
i = 0
for numbers in arr:
gap = 0
numbers_index = 0
numbers_count = len(numbers) - 3 # count numbers without 3 `NW`
while i < delta.days + 1:
day = Start_date + timedelta(days=i)
i += 1
if day.strftime("%Y-%m-%d") in Holiday_List:
print("{}\t{}\t{}".format(day.strftime("%d-%m-%Y"), 1, hDay))
if numbers_index > 0: # don't count Holiday before displaying first number from list `data` (ie. '2020-01-01')
gap += 1
else:
value = numbers[numbers_index]
# always put number (!='NW') or put 'NW' when gap is too small (<9)
if value != 'NW' or gap < 9:
print("{}\t{}\t{}".format(day.strftime("%d-%m-%Y"), 0, value))
numbers_index += 1
gap += 1
# IDEA: maybe it could use `else:` to put `NW` without adding `NW` to `arr`
# exit loop if all numbers are displayed and gap is big enough
if numbers_index >= numbers_count and gap >= 9:
break
Answer provided by the #furas is less messier, you should study that.
Cheers mate, learned a lot actually!

Pythonic Solution for Improving Runtime Efficiency

I would like to improve the runtime of a python program that takes a pandas dataframe and create two new variables (Group and Group date) based on several conditions (The code and logic are below). The code works fine on small datasets but on large datasets (20 million rows) it is taking 7+ hours to run.
Logic behind code
if the ID is the first ID encountered then group=1 and groupdate = date
else if not first ID and date - previous date > 10 or date - previous groupdate >10 then group=previous group # + 1 and groupdate = date
else if not first ID and date - previous date <= 10 or date - previous groupdate<=10 then group = previous group # and groupdate = previous groupdate.
Sample Code
import pandas as pd
import numpy as np
ID = ['a1','a1','a1','a1','a1','a2','a2','a2','a2','a2']
DATE = ['1/1/2014','1/15/2014','1/20/2014','1/22/2014','3/10/2015', \
'1/13/2015','1/20/2015','1/28/2015','2/28/2015','3/20/2015']
ITEM = ['P1','P2','P3','P4','P5','P1','P2','P3','P4','P5']
df = pd.DataFrame({"ID": ID, "DATE": DATE, "ITEM": ITEM})
df['DATE']= pd.to_datetime(df['DATE'], format = '%m/%d/%Y')
ids=df.ID
df['first_id'] = np.where((ids!=ids.shift(1)), 1, 0)
df['last_id'] = np.where((ids!=ids.shift(-1)), 1, 0)
print(df); print('\n')
for i in range(0,len(df)):
if df.loc[i,'first_id']==1:
df.loc[i,'group'] = 1
df.loc[i,'groupdate'] = df.loc[i,'DATE']
elif df.loc[i,'first_id']==0 and ((df.loc[i,'DATE'] - df.loc[i-1,'DATE']).days > 10) or \
((df.loc[i,'DATE'] - df.loc[i-1,'groupdate']).days > 10):
df.loc[i,'group'] = df.loc[i-1,'group'] + 1
df.loc[i,'groupdate'] = df.loc[i,'DATE']
else:
if df.loc[i,'first_id']==0 and ((df.loc[i,'DATE'] - df.loc[i-1,'DATE']).days <= 10) or \
((df.loc[i,'DATE'] - df.loc[i-1,'groupdate']).days <= 10):
df.loc[i,'group'] = df.loc[i-1,'group']
df.loc[i,'groupdate'] = df.loc[i-1,'groupdate']
print(df); print('\n')
Output
ID DATE ITEM GROUP GROUPDATE
1 1/1/2014 P1 1 1/1/2014
1 1/15/2014 P2 2 1/15/2014
1 1/20/2014 P3 2 1/15/2014
1 1/22/2014 P4 2 1/15/2014
1 3/10/2015 P5 3 3/10/2015
2 1/13/2015 P1 1 1/13/2015
2 1/20/2015 P2 1 1/13/2015
2 1/28/2015 P3 2 1/28/2015
2 2/28/2015 P4 3 2/28/2015
2 3/20/2015 P5 4 3/20/2015
Please don't take this as a full answer but as a work in progress and as a starting point.
I think that your code generate some problems when you move from a group to the other.
You should avoid group so I use groupby
I'm not implementing here your logic about previous_groupdate
Generate Data
import pandas as pd
import numpy as np
ID = ['a1','a1','a1','a1','a1','a2','a2','a2','a2','a2']
DATE = ['1/1/2014','1/15/2014','1/20/2014','1/22/2014','3/10/2015', \
'1/13/2015','1/20/2015','1/28/2015','2/28/2015','3/20/2015']
ITEM = ['P1','P2','P3','P4','P5','P1','P2','P3','P4','P5']
df = pd.DataFrame({"ID": ID, "DATE": DATE, "ITEM": ITEM})
df['DATE']= pd.to_datetime(df['DATE'], format = '%m/%d/%Y')
ids=df.ID
df['first_id'] = np.where((ids!=ids.shift(1)), 1, 0)
Function that work for every "ID"
def fun(x):
# To compare with previous date I add a column
x["PREVIOUS_DATE"] = x["DATE"].shift(1)
x["DATE_DIFF1"] = (x["DATE"]-x["PREVIOUS_DATE"]).dt.days
# These are your simplified conditions
conds = [x["first_id"]==1,
((x["first_id"]==0) & (x["DATE_DIFF1"]>10)),
((x["first_id"]==0) & (x["DATE_DIFF1"]<=10))]
# choices for date
choices_date = [x["DATE"].astype(str),
x["DATE"].astype(str),
'']
# choices for group
# To get the expected output we'll need a cumsum
choices_group = [ 1, 1, 0]
# I use np.select you can check how it works
x["group_date"] = np.select(conds, choices_date, default="")
x["group"] = np.select(conds, choices_group, default=0)
# some group_date are empty so I fill them
x["group_date"] = x["group_date"].astype("M8[us]").fillna(method="ffill")
# Here is the cumsum
x["group"] = x["group"].cumsum()
# Remove columns we don't need
x = x.drop(["first_id", "PREVIOUS_DATE", "DATE_DIFF1"], axis=1)
return x
How to use
df = df.groupby("ID").apply(fun)
ID DATE ITEM group_date group
0 a1 2014-01-01 P1 2014-01-01 1
1 a1 2014-01-15 P2 2014-01-15 2
2 a1 2014-01-20 P3 2014-01-15 2
3 a1 2014-01-22 P4 2014-01-15 2
4 a1 2015-03-10 P5 2015-03-10 3
5 a2 2014-01-01 P1 2014-01-01 1
6 a2 2014-01-15 P2 2014-01-15 2
7 a2 2014-01-20 P3 2014-01-15 2
8 a2 2014-01-22 P4 2014-01-15 2
9 a2 2015-03-10 P5 2015-03-10 3
Speed up
Here you could think to use dask, modin or cuDF see modin vs cuDF But probably you should work on how to organize your data before process it. I'm talking about something like this it's mine, sorry, but gives you an idea about how correctly partition data could speed things up.

Categories