I am currently struggling with one issue about counting the number of open tasks at month end with pyspark. I have a dataframe with two timestamps (one for the beginning of the issue, one for the end of the issue). I would like to count the number of open issues at month end.
My dataframe df1 is a merging of two timestamps table (timestamp of the beginning and the end of the task). Let's say it looks like below:
TaskNumber Start Date StartTimestamp EndDate EndTimestamp EndOfTheMonth
1 2/15/2019 Start 4/18/2019 End 2/28/2019
2 2/16/2019 Start 2/23/2019 End 2/28/2019
3 2/17/2019 Start 3/4/2019 End 2/28/2019
4 3/1/2019 Start Null Null 3/31/2019
from pyspark.sql.functions import col, when, sum as _sum
from pyspark.sql import functions as F
df2 = df1.withColumn("EndOfTheMonth", last_day("Start Date"))
df3 = (df2.withColumn("OpenTaskAtMonthEnd", when((col("Start Date") <= col("EndOfTheMonth")) & ((col("End Date") >= col("EndOfTheMonth")) |
(col("EndTask").isNull())),1).otherwise(0))
.withColumn("TaskStillOpened", when((col("EndTimestamp").isNull()),1).otherwise(0))
.withColumn("DateofToday", F.current_date())
)
df4 = df3.filter(col("OpenTaskAtMonthEnd") == 1)
.withColumn("Pending Days", when(col("TaskStillOpened") == 1, F.datediff(F.to_date(df3.DateofToday), F.to_date(df3.Start Date))).otherwise(F.datediff(F.to_date(df3.End Date), F.to_date(df3.Start Date))))
dfOpen = (df4.groupBy("EndOfTheMonth")
.agg(_sum(col("OpenTaskAtMonthEnd").alias("NrTotalOpenTasksAtMonthEnd"), _sum(col(TasksStillOpened)).alias("TotalWorkinProgress"))
)
TaskNumber Start Date End Date EndOfTheMonth OpenTaskatMonthEnd TaskStillOpen
1 2/15/2019 4/18/2019 2/28/2019 1 0
2 2/16/2019 2/23/2019 2/28/2019 0 0
3 2/17/2019 3/4/2019 2/28/2019 1 0
4 3/1/2019 Null 3/31/2019 1 1
I manage to count those open tasks for the month n when they have started in this particular month n. But when, a task has begun in the month n-1 and ended only in the month n+1, I cannot find it with my actual code in the open tasks for the month n.
The expected output is indeed to get the number of open tasks for each month and ideally also for each week.
Thanks a lot for your support ;)
Related
how to filter on date that starts on the last day of the previous month and ends on the current day
(If the last day of the previous month is Saturday or Sunday, the previous day must be assigned)
For example:
example = pd.read_excel("C:/Users/USER/Downloads/Movimentos.xls")
example = example.drop(columns=['Data Valor', 'Descrição', 'Valor'])
example.sort_values(by='Data Operação', ascending=False)
display(example)
Data Operação Saldo
0 12/10/2022 310.36
1 12/10/2022 312.86
2 11/10/2022 315.34
3 11/10/2022 317.84
4 09/10/2022 326.44
5 30/09/2022 224.44
... ... ...
188 05/07/2022 128.40
189 01/07/2022 8.40
190 24/06/2022 18.40
191 22/06/2022 23.40
192 27/05/2022 50.00
In this case I would like to filter from ( 5 30/09/2022 224.44 ) which it is the last day of the previous month and it is a weekday, to ( 0 12/10/2022 310.36 ) which it is the current day.
I've seen some examples where you just had to enter the date '2022-09-30' but in this case will be recurring so it needs to be something like:
today = datetime.date.today()
today.month (for the end date)
but for the start date I don't know how I'm supposed to do.
here is one way to do it
btw, you can convert the date column to datetime as follows, to avoid converting while filtering
# optionally
df['Operação'] = pd.to_datetime(df['Operação'], dayfirst=True)
# convert (in memory) the date to ymd format
# using pd.offset, get business monthend of previous month
# finally using loc to identify the rows matching the criteria
(df.loc[(pd.to_datetime(df['Operação'], dayfirst=True) <=
pd.Timestamp(datetime.now()).strftime('%Y-%m-%d')) &
(pd.to_datetime(df['Operação'], dayfirst=True) >=
pd.Timestamp(datetime.now()+pd.offsets.BusinessMonthEnd(-1)).strftime('%Y-%m-%d'))
]
)
OR
to make it more comprehensible
# create today (with a typo in var name) and lmonth variables
# then use these for comparison
tday=pd.Timestamp(datetime.now()).strftime('%Y-%m-%d')
lmonth= (pd.Timestamp(datetime.now()+pd.offsets.BusinessMonthEnd(-1))).strftime('%Y-%m-%d')
(df.loc[(pd.to_datetime(df['Operação'], dayfirst=True) <= tday) &
(pd.to_datetime(df['Operação'], dayfirst=True) >= lmonth)
]
)
Data Operação Saldo
0 0 12/10/2022 310.36
1 1 12/10/2022 312.86
2 2 11/10/2022 315.34
3 3 11/10/2022 317.84
4 4 09/10/2022 326.44
5 5 30/09/2022 224.44
My DataFrame looks like this:
id
date
value
1
2021-07-16
100
2
2021-09-15
20
1
2021-04-10
50
1
2021-08-27
30
2
2021-07-22
15
2
2021-07-22
25
1
2021-06-30
40
3
2021-10-11
150
2
2021-08-03
15
1
2021-07-02
90
I want to groupby the id, and return the difference of total value in a 90-days period.
Specifically, I want the values of last 90 days based on today, and based on 30 days ago.
For example, considering today is 2021-10-13, I would like to get:
the sum of all values per id between 2021-10-13 and 2021-07-15
the sum of all values per id between 2021-09-13 and 2021-06-15
And finally, subtract them to get the variation.
I've already managed to calculate it, by creating separated temporary dataframes containing only the dates in those periods of 90 days, grouping by id, and then merging these temp dataframes into a final one.
But I guess it should be an easier or simpler way to do it. Appreciate any help!
Btw, sorry if the explanation was a little messy.
If I understood correctly, you need something like this:
import pandas as pd
import datetime
## Calculation of the dates that we are gonna need.
today = datetime.datetime.now()
delta = datetime.timedelta(days = 120)
# Date of the 120 days ago
hundredTwentyDaysAgo = today - delta
delta = datetime.timedelta(days = 90)
# Date of the 90 days ago
ninetyDaysAgo = today - delta
delta = datetime.timedelta(days = 30)
# Date of the 30 days ago
thirtyDaysAgo = today - delta
## Initializing an example df.
df = pd.DataFrame({"id":[1,2,1,1,2,2,1,3,2,1],
"date": ["2021-07-16", "2021-09-15", "2021-04-10", "2021-08-27", "2021-07-22", "2021-07-22", "2021-06-30", "2021-10-11", "2021-08-03", "2021-07-02"],
"value": [100,20,50,30,15,25,40,150,15,90]})
## Casting date column
df['date'] = pd.to_datetime(df['date']).dt.date
grouped = df.groupby('id')
# Sum of last 90 days per id
ninetySum = grouped.apply(lambda x: x[x['date'] >= ninetyDaysAgo.date()]['value'].sum())
# Sum of last 90 days, starting from 30 days ago per id
hundredTwentySum = grouped.apply(lambda x: x[(x['date'] >= hundredTwentyDaysAgo.date()) & (x['date'] <= thirtyDaysAgo.date())]['value'].sum())
The output is
ninetySum - hundredTwentySum
id
1 -130
2 20
3 150
dtype: int64
You can double check to make sure these are the numbers you wanted by printing ninetySum and hundredTwentySum variables.
I have a pandas dataframe with customer transactions as shown below how can I achieve the below outcome (compare transaction end date and transaction start date columns)
The labels to be created are
Start date appeared earlier in the end date column.
Start date did not appear earlier in the end date column.
Input
Transaction ID Transaction Start Date Transaction End Date
1 27-Oct-2014 11-Nov-2014
2 29-Oct-2014 30-Nov-2014
3 11-Nov-2014 20-Nov-2014
4 15-Nov-2014 28-Nov-2014
5 20-Nov-2014 05-Dec-2014
6 28-Nov-2014 15-Dec-2014
7 29-Nov-2014 20-Dec-2014
Desired output
Transaction ID Transaction Start Date Transaction End Date Label
1 27-Oct-2014 11-Nov-2014
2 29-Oct-2014 30-Nov-2014 start date did not appear earlier in the end date column
3 11-Nov-2014 20-Nov-2014 start date appeared earlier in end date column
4 15-Nov-2014 28-Nov-2014 start date did not appear earlier in the end date column
5 20-Nov-2014 05-Dec-2014 start date appeared earlier in end date column
6 28-Nov-2014 15-Dec-2014 start date appeared earlier in the end date column
7 29-Nov-2014 20-Dec-2014 start date did not appear earlier in the end date column
Use:
#convert values to datetimes
df['Transaction End Date'] = pd.to_datetime(df['Transaction End Date'])
df['Transaction Start Date'] = pd.to_datetime(df['Transaction Start Date'])
#check if previous values exist in list comprehenion
mask = [df['Transaction End Date'].iloc[:i].eq(x).any()
for i, x in enumerate(df['Transaction Start Date'])]
#set labels
df['Label'] = np.where(mask,
'start date appeared earlier in end date column',
'start date did not appear earlier in the end date column')
#set first value to empty string
df.loc[0, 'Label'] = ''
print (df)
Transaction ID Transaction Start Date Transaction End Date \
0 1 2014-10-27 2014-11-11
1 2 2014-10-29 2014-11-30
2 3 2014-11-11 2014-11-20
3 4 2014-11-15 2014-11-28
4 5 2014-11-20 2014-12-05
5 6 2014-11-28 2014-12-15
6 7 2014-11-29 2014-12-20
Label
0
1 start date did not appear earlier in the end d...
2 start date appeared earlier in end date column
3 start date did not appear earlier in the end d...
4 start date appeared earlier in end date column
5 start date appeared earlier in end date column
6 start date did not appear earlier in the end d...
In the below pandas dataframe example, MyDate consists of the 1st day of the month and the last business day of the month. The dataset will always run to 1st of (current month - 1).
I would like to dynamically increase MyDate by one month. In doing so however, the last business day is no longer the correct one. As such, I would also like to calculate the last business day based on the updated month.
Input:
MyDate MyValue
31/Mar/2020 0
01/Apr/2020 a
30/Apr/2020 b
01/May/2020 c
29/May/2020 d <<< note 29 May - last workday of month
01/Jun/2020 e
30/Jun/2020 f
01/Jul/2020 g
31/Jul/2020 h
01/Aug/2020 i
Desired output:
MyDate MyValue
30/Apr/2020 0
01/May/2020 a
29/May/2020 b <<< note 29 May - last workday of month
01/Jun/2020 c
30/Jun/2020 d
01/Jul/2020 e
31/Jul/2020 f
01/Aug/2020 g
31/Aug/2020 h
01/Sep/2020 i
I've broken down the problem into two parts:
Change month to month+1 - using relativedelta
Get last business day for changed month - using pd.offsets.BMonthEnd()
, but somehow I am stuck at #2 although I have attempted similar solutions posted on stackoverflow.
This is my code:
import pandas as pd
from dateutil.relativedelta import relativedelta
...
# this solves part #1
df['MyDate']=df['MyDate'].dt.date + relativedelta(months=+1)
# attempt at solving part 2
df['MyDate']=pd.to_datetime(df['MyDate'])
mask = df['MyDate'].dt.day > 1
df.loc[mask, 'MyDate'] = df['MyDate'] + pd.offsets.BMonthEnd(1)
The last line is where I am stuck; obviously it does not generate the results I thought it would...
Any help with solving this, or a different "pandas-esque" approach of solving the problem as a whole, would be greatly appreciated.
You may create a boolean mask to identify Business-month-end dates in your MyDate columns (Business-month-end dates returns True, others returns False). Use this mask to adding 1 month-begin and 1 business-month-end separately
m = df.MyDate == (df.MyDate + pd.offsets.BMonthEnd(0))
df.loc[m, 'MyDate'] = df.loc[m, 'MyDate'] + pd.offsets.BMonthEnd(1)
df.loc[~m, 'MyDate'] = df.loc[~m, 'MyDate'] + pd.offsets.MonthBegin(1)
print(df)
Output:
MyDate MyValue
0 2020-04-30 0
1 2020-05-01 a
2 2020-05-29 b
3 2020-06-01 c
4 2020-06-30 d
5 2020-07-01 e
6 2020-07-31 f
7 2020-08-01 g
8 2020-08-31 h
9 2020-09-01 i
Note: I assume your MyDate column is already in dtype: datetime64[ns]
I have the Following Dataframe:
ID Minutes Datetime
1 30 6/4/2018 23:47:00
2 420
3 433 6/10/2018 2:50
4 580 6/9/2018 3:10
5 1020
I want to count the number of times the Minutes occur between a certain range. I want to do a similar count for datetime field (timestamp falls within certain range of time).
Below is the output I want:
MIN_RANGE COUNT
6-8 hours 2
8-10 hours 1
10-12 hours 0
12-14 hours 0
14-16 hours 0
16+ hours 1
RANGE COUNT
8pm - 10pm 0
10pm - 12am 1
12am - 2am 0
2am-4am 2
4am-6am 0
6am-8am 0
8am -10am 0
10am - 12pm 0
12pm - 2pm 0
2pm - 4pm 0
4pm - 6pm 0
6pm - 8pm 0
I have searched around google and stackoverflow on how to do this (searching bins and stuff) but couldn't find anything directly related to what I am trying to do.
Help?
This is a complex problem that can be achieved by using pd.date_range and pd.cut, and then some index manipulation.
First of all, you can start by cutting your data frame using pd.cut
cuts = pd.cut(pd.to_datetime(df.Datetime), pd.date_range('02:00:00', freq='2H', periods=13))
0 (2018-07-09 22:00:00, 2018-07-10]
1 NaN
2 (2018-07-09 02:00:00, 2018-07-09 04:00:00]
3 (2018-07-09 02:00:00, 2018-07-09 04:00:00]
4 NaN
This will yield the cuts based on your Datetime column and the ranges defined.
Lets start by having a base data frame with values set to 0, such that we will update it later with your counts. Using your cuts from above,
cats = cuts.cat.categories
bases = ["{}-{}".format(v.left.strftime("%H%p"),v.right.strftime("%H%p")) for v in cats]
df_base = pd.DataFrame({"Range": bases, "Count":0}).set_index("Range")
which yields
COUNT
Range
02AM-04AM 0
04AM-06AM 0
06AM-08AM 0
08AM-10AM 0
10AM-12PM 0
12PM-14PM 0
14PM-16PM 0
16PM-18PM 0
18PM-20PM 0
20PM-22PM 0
22PM-00AM 0
00AM-02AM 0
Now, you can use collections.Counter to quickly count your occurrences
x = Counter(cuts.dropna())
Notice that I have used dropna() not to count NaNs. With your x variable, we can
values = {"{}-{}".format(k.left.strftime("%H%p"), k.right.strftime("%H%p")) : v for k,v in x.items()}
counts_df = pd.DataFrame([values]).T
which yields
0
02AM-04AM 2
22PM-00AM 1
Finally, we just update our previous data frame with these values
df_base.loc[counts_df.index, "Count"] = counts_df[0]
COUNT
Range
02AM-04AM 2
04AM-06AM 0
06AM-08AM 0
08AM-10AM 0
10AM-12PM 0
12PM-14PM 0
14PM-16PM 0
16PM-18PM 0
18PM-20PM 0
20PM-22PM 0
22PM-00AM 1
00AM-02AM 0
import numpy as np
counts = np.histogram(df['Minutes'],
bins = list(range(6*60,18*60,2*60))+[24*60])[0]
bin_labels = [ '6-8 hours',
'8-10 hours',
'10-12 hours',
'12-14 hours',
'14-16 hours',
'16+ hours']
pd.Series(counts, index = bin_labels)
You can do a similar thing with the hours, using the hour attribute of datetime objects. You will have to fill in the empty parts of the Datetime column first.
#RafaelC has already addressed the binning and counting, but I'll make a note about reading the data from a file.
First, let's assume you separate your columns by commas (CSV), and start with:
dates.csv
ID,Minutes,Datetime
1,30,6/4/2018 23:47:00
2,420,
3,433,6/10/2018 2:50
4,580,6/9/2018 3:10
5,1020,
Then, you can read the values and parse the third column as dates as follows.
from datetime import datetime
import pandas as pd
def my_date_parser(date_str):
# Allow empty values to be coerced to NaT (Not a Time)
# rather than throw an exception
return pd.to_datetime(date_str, errors='coerce')
df = pd.read_csv(
'./dates.csv',
date_parser=my_date_parser,
parse_dates=['Datetime']
)
You can get also get the counts by using the built in floor attribute of datetime objects. In this case, you want to use the frequency of '2h' so that you are looking at 2 hour bins. Then just grab the time part
import pandas as pd
df['Datetime'] = pd.to_datetime(df.Datetime)
df.Datetime.dt.floor('2h').dt.time
#0 22:00:00
#1 NaT
#2 02:00:00
#3 02:00:00
#4 NaT
(Alternatively you could also just use df.Datetime.dt.hour//2 to get the same grouping logic, but slightly different labels)
So you can easily just groupby this and then count:
df.groupby(df.Datetime.dt.floor('2h').dt.time).size()
#Datetime
#02:00:00 2
#22:00:00 1
#dtype: int64
Now to get the full list, we can just reindex, and change the index labels to be a bit more informative.
import datetime
import numpy as np
df_counts = df.groupby(df.Datetime.dt.floor('2h').dt.time).size()
ids = [datetime.time(2*x,0) for x in range(12)]
df_counts = df_counts.reindex(ids).fillna(0).astype('int')
# Appropriately label the ranges with more info if needed
df_counts.index = '['+df_counts.index.astype(str) + ' - ' + np.roll(df_counts.index.astype(str), -1)+')'
Output:
df_counts
[00:00:00 - 02:00:00) 0
[02:00:00 - 04:00:00) 2
[04:00:00 - 06:00:00) 0
[06:00:00 - 08:00:00) 0
[08:00:00 - 10:00:00) 0
[10:00:00 - 12:00:00) 0
[12:00:00 - 14:00:00) 0
[14:00:00 - 16:00:00) 0
[16:00:00 - 18:00:00) 0
[18:00:00 - 20:00:00) 0
[20:00:00 - 22:00:00) 0
[22:00:00 - 00:00:00) 1
dtype: float64