I have a pandas dataframe with customer transactions as shown below how can I achieve the below outcome (compare transaction end date and transaction start date columns)
The labels to be created are
Start date appeared earlier in the end date column.
Start date did not appear earlier in the end date column.
Input
Transaction ID Transaction Start Date Transaction End Date
1 27-Oct-2014 11-Nov-2014
2 29-Oct-2014 30-Nov-2014
3 11-Nov-2014 20-Nov-2014
4 15-Nov-2014 28-Nov-2014
5 20-Nov-2014 05-Dec-2014
6 28-Nov-2014 15-Dec-2014
7 29-Nov-2014 20-Dec-2014
Desired output
Transaction ID Transaction Start Date Transaction End Date Label
1 27-Oct-2014 11-Nov-2014
2 29-Oct-2014 30-Nov-2014 start date did not appear earlier in the end date column
3 11-Nov-2014 20-Nov-2014 start date appeared earlier in end date column
4 15-Nov-2014 28-Nov-2014 start date did not appear earlier in the end date column
5 20-Nov-2014 05-Dec-2014 start date appeared earlier in end date column
6 28-Nov-2014 15-Dec-2014 start date appeared earlier in the end date column
7 29-Nov-2014 20-Dec-2014 start date did not appear earlier in the end date column
Use:
#convert values to datetimes
df['Transaction End Date'] = pd.to_datetime(df['Transaction End Date'])
df['Transaction Start Date'] = pd.to_datetime(df['Transaction Start Date'])
#check if previous values exist in list comprehenion
mask = [df['Transaction End Date'].iloc[:i].eq(x).any()
for i, x in enumerate(df['Transaction Start Date'])]
#set labels
df['Label'] = np.where(mask,
'start date appeared earlier in end date column',
'start date did not appear earlier in the end date column')
#set first value to empty string
df.loc[0, 'Label'] = ''
print (df)
Transaction ID Transaction Start Date Transaction End Date \
0 1 2014-10-27 2014-11-11
1 2 2014-10-29 2014-11-30
2 3 2014-11-11 2014-11-20
3 4 2014-11-15 2014-11-28
4 5 2014-11-20 2014-12-05
5 6 2014-11-28 2014-12-15
6 7 2014-11-29 2014-12-20
Label
0
1 start date did not appear earlier in the end d...
2 start date appeared earlier in end date column
3 start date did not appear earlier in the end d...
4 start date appeared earlier in end date column
5 start date appeared earlier in end date column
6 start date did not appear earlier in the end d...
Related
how to filter on date that starts on the last day of the previous month and ends on the current day
(If the last day of the previous month is Saturday or Sunday, the previous day must be assigned)
For example:
example = pd.read_excel("C:/Users/USER/Downloads/Movimentos.xls")
example = example.drop(columns=['Data Valor', 'Descrição', 'Valor'])
example.sort_values(by='Data Operação', ascending=False)
display(example)
Data Operação Saldo
0 12/10/2022 310.36
1 12/10/2022 312.86
2 11/10/2022 315.34
3 11/10/2022 317.84
4 09/10/2022 326.44
5 30/09/2022 224.44
... ... ...
188 05/07/2022 128.40
189 01/07/2022 8.40
190 24/06/2022 18.40
191 22/06/2022 23.40
192 27/05/2022 50.00
In this case I would like to filter from ( 5 30/09/2022 224.44 ) which it is the last day of the previous month and it is a weekday, to ( 0 12/10/2022 310.36 ) which it is the current day.
I've seen some examples where you just had to enter the date '2022-09-30' but in this case will be recurring so it needs to be something like:
today = datetime.date.today()
today.month (for the end date)
but for the start date I don't know how I'm supposed to do.
here is one way to do it
btw, you can convert the date column to datetime as follows, to avoid converting while filtering
# optionally
df['Operação'] = pd.to_datetime(df['Operação'], dayfirst=True)
# convert (in memory) the date to ymd format
# using pd.offset, get business monthend of previous month
# finally using loc to identify the rows matching the criteria
(df.loc[(pd.to_datetime(df['Operação'], dayfirst=True) <=
pd.Timestamp(datetime.now()).strftime('%Y-%m-%d')) &
(pd.to_datetime(df['Operação'], dayfirst=True) >=
pd.Timestamp(datetime.now()+pd.offsets.BusinessMonthEnd(-1)).strftime('%Y-%m-%d'))
]
)
OR
to make it more comprehensible
# create today (with a typo in var name) and lmonth variables
# then use these for comparison
tday=pd.Timestamp(datetime.now()).strftime('%Y-%m-%d')
lmonth= (pd.Timestamp(datetime.now()+pd.offsets.BusinessMonthEnd(-1))).strftime('%Y-%m-%d')
(df.loc[(pd.to_datetime(df['Operação'], dayfirst=True) <= tday) &
(pd.to_datetime(df['Operação'], dayfirst=True) >= lmonth)
]
)
Data Operação Saldo
0 0 12/10/2022 310.36
1 1 12/10/2022 312.86
2 2 11/10/2022 315.34
3 3 11/10/2022 317.84
4 4 09/10/2022 326.44
5 5 30/09/2022 224.44
I have a df
date
2021-03-12
2021-03-17
...
2022-05-21
2022-08-17
I am trying to add a column year_week, but my year week starts at 2021-06-28, which is the first day of July.
I tried:
df['date'] = pd.to_datetime(df['date'])
df['year_week'] = (df['date'] - timedelta(days=datetime(2021, 6, 24).timetuple()
.tm_yday)).dt.isocalendar().week
I played around with the timedelta days values so that the 2021-06-28 has a value of 1.
But then I got problems with previous & dates exceeding my start date + 1 year:
2021-03-12 has a value of 38
2022-08-17 has a value of 8
So it looks like the valid period is from 2021-06-28 + 1 year.
date year_week
2021-03-12 38 # LY38
2021-03-17 39 # LY39
2021-06-28 1 # correct
...
2022-05-21 47 # correct
2022-08-17 8 # NY8
Is there a way to get around this? As I am aggregating the data by year week I get incorrect results due to the past & upcoming dates. I would want to have negative dates for the days before 2021-06-28 or LY38 denoting that its the year week of the last year, accordingly year weeks of 52+ or NY8 denoting that this is the 8th week of the next year?
Here is a way, I added two dates more than a year away. You need the isocalendar from the difference between the date column and the dayofyear of your specific date. Then you can select the different scenario depending on the year of your specific date. use np.select for the different result format.
#dummy dataframe
df = pd.DataFrame(
{'date': ['2020-03-12', '2021-03-12', '2021-03-17', '2021-06-28',
'2022-05-21', '2022-08-17', '2023-08-17']
}
)
# define start date
d = pd.to_datetime('2021-6-24')
# remove the nomber of day of year from each date
s = (pd.to_datetime(df['date']) - pd.Timedelta(days=d.day_of_year)
).dt.isocalendar()
# get the difference in year
m = (s['year'].astype('int32') - d.year)
# all condition of result depending on year difference
conds = [m.eq(0), m.eq(-1), m.eq(1), m.lt(-1), m.gt(1)]
choices = ['', 'LY','NY',(m+1).astype(str)+'LY', '+'+(m-1).astype(str)+'NY']
# create the column
df['res'] = np.select(conds, choices) + s['week'].astype(str)
print(df)
date res
0 2020-03-12 -1LY38
1 2021-03-12 LY38
2 2021-03-17 LY39
3 2021-06-28 1
4 2022-05-21 47
5 2022-08-17 NY8
6 2023-08-17 +1NY8
I think
pandas period_range can be of some help
pd.Series(pd.period_range("6/28/2017", freq="W", periods=Number of weeks you want))
I have a dataframe with multiple columns, one of which is a date column. I'm interested in creating a new column which contains the number of months between the date column and a preset date. For example one of the dates in the 'start date' column is '2019-06-30 00:00:00' i would want to be able to calculate the number of months between that date and the end of 2021 so 2021-12-31 and place the answer into a new column and do this for the entire date column in the dataframe. I haven't been able to work out how i could go about this but i would like it in the end to look like this if the predetermined end date was 2021-12-31:
df =
|start date months
0|2019-06-30 30
1|2019-08-12 28
2|2020-01-24 23
You can do this using np.timedelta64:
end_date = pd.to_datetime('2021-12-31')
df['start date'] = pd.to_datetime(df['start date'])
df['month'] = ((end_date - df['start date'])/np.timedelta64(1, 'M')).astype(int)
print(df)
start date month
0 2019-06-30 30
1 2019-08-12 28
2 2020-01-24 23
Assume that start date column is of datetime type (not string)
and the reference date is defined as follows:
refDate = pd.to_datetime('2021-12-31')
or any other date of your choice.
Then you can compute the number of months as:
df['months'] = (refDate.to_period('M') - df['start date']\
.dt.to_period('M')).apply(lambda x: x.n)
I am currently struggling with one issue about counting the number of open tasks at month end with pyspark. I have a dataframe with two timestamps (one for the beginning of the issue, one for the end of the issue). I would like to count the number of open issues at month end.
My dataframe df1 is a merging of two timestamps table (timestamp of the beginning and the end of the task). Let's say it looks like below:
TaskNumber Start Date StartTimestamp EndDate EndTimestamp EndOfTheMonth
1 2/15/2019 Start 4/18/2019 End 2/28/2019
2 2/16/2019 Start 2/23/2019 End 2/28/2019
3 2/17/2019 Start 3/4/2019 End 2/28/2019
4 3/1/2019 Start Null Null 3/31/2019
from pyspark.sql.functions import col, when, sum as _sum
from pyspark.sql import functions as F
df2 = df1.withColumn("EndOfTheMonth", last_day("Start Date"))
df3 = (df2.withColumn("OpenTaskAtMonthEnd", when((col("Start Date") <= col("EndOfTheMonth")) & ((col("End Date") >= col("EndOfTheMonth")) |
(col("EndTask").isNull())),1).otherwise(0))
.withColumn("TaskStillOpened", when((col("EndTimestamp").isNull()),1).otherwise(0))
.withColumn("DateofToday", F.current_date())
)
df4 = df3.filter(col("OpenTaskAtMonthEnd") == 1)
.withColumn("Pending Days", when(col("TaskStillOpened") == 1, F.datediff(F.to_date(df3.DateofToday), F.to_date(df3.Start Date))).otherwise(F.datediff(F.to_date(df3.End Date), F.to_date(df3.Start Date))))
dfOpen = (df4.groupBy("EndOfTheMonth")
.agg(_sum(col("OpenTaskAtMonthEnd").alias("NrTotalOpenTasksAtMonthEnd"), _sum(col(TasksStillOpened)).alias("TotalWorkinProgress"))
)
TaskNumber Start Date End Date EndOfTheMonth OpenTaskatMonthEnd TaskStillOpen
1 2/15/2019 4/18/2019 2/28/2019 1 0
2 2/16/2019 2/23/2019 2/28/2019 0 0
3 2/17/2019 3/4/2019 2/28/2019 1 0
4 3/1/2019 Null 3/31/2019 1 1
I manage to count those open tasks for the month n when they have started in this particular month n. But when, a task has begun in the month n-1 and ended only in the month n+1, I cannot find it with my actual code in the open tasks for the month n.
The expected output is indeed to get the number of open tasks for each month and ideally also for each week.
Thanks a lot for your support ;)
I have a dataferam with a column that contains the date for the first monday of evry week between an arbitrary start date and now. I wish to generate a new column that has 2 week jumps but is the same length as the original column and would contain repeated values. For example this would be the result for the month of October where the column weekly exists and bi-weekly is the target:
data = {'weekly':['2018-10-08','2018-10-15','2018-10-22','2018-10-29']
,'bi-weekly':['2018-10-08','2018-10-08',
'2018-10- 22','2018-10-22']}
df = pd.DataFrame(data)
At the moment I am stuck with pd.date_range(start,end,freq='14D') but this does not contain any repeated values which I need to be able to groupby
IIUC
df.groupby(np.arange(len(df))//2).weekly.transform('first')
Out[487]:
0 2018-10-08
1 2018-10-08
2 2018-10-22
3 2018-10-22
Name: weekly, dtype: datetime64[ns]