I've been staring at this way too long and I think Ive lost my mind, it really shouldn't be as complicated as I'm making it.
I have a df:
Date1
Date2
2022-04-01
2022-06-17
2022-04-15
2022-04-15
2022-03-03
NaT
2022-04-22
NaT
2022-05-06
2022-06-06
I want to fill the blanks in 'Date2' where it keeps the values from 'Date2' if they are present but if 'Date2' is NaT then I want it to be the last date of the subsequent month from 'Date1'.
In the example above, the 2 NaT fields would become:
Date1
Date2
2022-03-03
2022-04-30
2022-04-22
2022-05-31
I know I have to use .fillna and the closest I've come is this:
df['Date2'] = (df['Date2'].fillna((df['Date1'] + pd.DateOffset(months=1)).replace)).to_numpy().astype('datetime64[M]')
This returns the first of the month. However, it returns the first of the month for all rows (not just NaT rows) and it is returning the first of the month as opposed to the last of the month.
I'm pretty sure my parenthesis are messed up and I've tried many different combinations of - timedelta and similar.
What am I doing wrong here? TIA!
Your question can be interpreted in two ways given the provided example.
End of month of the next row's date 1 (which now does not seem to be what you want)
You need to use pd.offses.MonthEnd and shift
df['Date2'] = (df['Date2']
.fillna(df['Date1'].add(pd.offsets.MonthEnd())
.shift(-1))
)
Next month's end (same row)
If you want the next month end of the same row:
df['Date2'] = (df['Date2']
.fillna(df['Date1'].add(pd.offsets.MonthEnd(2)))
)
Output:
Date1 Date2
0 2022-04-01 2022-06-17
1 2022-04-15 2022-04-15
2 2022-03-03 2022-04-30
3 2022-04-22 2022-05-31
4 2022-05-06 2022-06-06
Use MonthEnd and loc:
from pandas.tseries.offsets import MonthEnd
>>> df.loc[df['Date2'].isnull(), 'Date2'] = df['Date1'] + pd.DateOffset(months=1) + MonthEnd(1)
Use MonthEnd with an offset of 2 (current month and next month):
df['Date2'] = df['Date2'].fillna(df['Date1'].add(pd.offsets.MonthEnd(2)))
print(df)
# Output
Date1 Date2
0 2022-04-01 2022-06-17
1 2022-04-15 2022-04-15
2 2022-03-03 2022-04-30
3 2022-04-22 2022-05-31
4 2022-05-06 2022-06-06
Related
I have a pyspark dataframe with a column named 'datetime' of the 'datetime64[ns]' type in the format "yyyy-MM-dd HH:mm:ss".
I'm trying to group it by a given timewindow.
This is what I'm doiung
import pyspark.sql.functions as psf
dataframe.groupBy(psf.window('datetime', f'{interval} seconds'), 'player_id', 'media_id').count()
interval is a parameter received as a string such as 'hour', 'day', 'week'.
I then convert it to seconds, as in, 1 hour = 3600 seconds, 1 day = 86400 seconds.
When I group it by 1 hour it works fine, this is the result:
window_start
window_end
player_id
media_id
count
2022-08-01 00:00:00
2022-08-01 01:00:00
1
2841
22
2022-08-01 00:00:00
2022-08-01 01:00:00
1
2899
44
Since the first date in the dataframe is 2022-08-01 everything is fine, but, when I try to group it by a week, this is the result:
window_start
window_end
player_id
media_id
count
2022-07-27 21:00:00
2022-08-03 21:00:00
1
1524
3
2022-07-27 21:00:00
2022-08-03 21:00:00
1
2841
1117
I'm positive there are no dates beforee 2022-08-01 in the dataframe.
Why is it doing this? I tried using the startTime parameter for the window function, but it is only to off-set the start, and not specify the beginning of a valid interval.
Any thoughts?
if I have 2 different set of dates:
01/05/2022 - 31/12/2022
01/01/2023 - 31/12/2023
01/05/2022 - 30/09/2022
01/10/2022 - 31/12/2022
01/01/2023 - 31/12/2023
I want to check if both set of dates above are contiguous between below range of dates
Date 1 = 01/05/2022
Date 2 = 31/12/2023
Please suggest a solution.
It seems to me easier to use pandas to check if dates fall into the date range.
You have the data day, month, year. In my practice, I usually see the sequences year, month, day.
I changed the variables 'Date_1', 'Date_2' to the desired format and the arrays themselves with dates, which I divided into two parts from and to. Then I filled the dataframe with these arrays and checked the date range. I specifically added one line with data for clarity: 2023-01-01 2025-12-31, it is just filtered, since it does not fall under the condition.
import pandas as pd
from datetime import datetime
Date_1 = '01/05/2022'
Date_2 = '31/12/2023'
Date_1 = datetime.strptime(Date_1, "%d/%m/%Y")
Date_2 = datetime.strptime(Date_2, "%d/%m/%Y")
start = [datetime.strptime(i, "%d/%m/%Y")for i in ['01/05/2022', '01/01/2023', '01/05/2022', '01/10/2022', '01/01/2023', '01/01/2023']]
finish = [datetime.strptime(i, "%d/%m/%Y")for i in ['31/12/2022', '31/12/2023', '30/09/2022', '31/12/2022', '31/12/2023', '31/12/2025']]
df = pd.DataFrame({'start': start, 'finish': finish})
print(df)
print(df[(df['start'] >= Date_1) & (df['finish'] <= Date_2)])
Output print(df)
start finish
0 2022-05-01 2022-12-31
1 2023-01-01 2023-12-31
2 2022-05-01 2022-09-30
3 2022-10-01 2022-12-31
4 2023-01-01 2023-12-31
5 2023-01-01 2025-12-31
Output print(df[(df['start'] >= Date_1) & (df['finish'] <= Date_2)])
start finish
0 2022-05-01 2022-12-31
1 2023-01-01 2023-12-31
2 2022-05-01 2022-09-30
3 2022-10-01 2022-12-31
4 2023-01-01 2023-12-31
I have the following df:
time_series date sales
store_0090_item_85261507 1/2020 1,0
store_0090_item_85261501 2/2020 0,0
store_0090_item_85261500 3/2020 6,0
Being 'date' = Week/Year.
So, I tried use the following code:
df['date'] = df['date'].apply(lambda x: datetime.strptime(x + '/0', "%U/%Y/%w"))
But, return this df:
time_series date sales
store_0090_item_85261507 2020-01-05 1,0
store_0090_item_85261501 2020-01-12 0,0
store_0090_item_85261500 2020-01-19 6,0
But, the first day of the first week of 2020 is 2019-12-29, considering sunday as first day. How can I have the first day 2020-12-29 of the first week of 2020 and not 2020-01-05?
From the datetime module's documentation:
%U: Week number of the year (Sunday as the first day of the week) as a zero padded decimal number. All days in a new year preceding the first Sunday are considered to be in week 0.
Edit: My originals answer doesn't work for input 1/2023 and using ISO 8601 date values doesn't work for 1/2021, so I've edited this answer by adding a custom function
Here is a way with a custom function
import pandas as pd
from datetime import datetime, timedelta
##############################################
# to demonstrate issues with certain dates
print(datetime.strptime('0/2020/0', "%U/%Y/%w")) # 2019-12-29 00:00:00
print(datetime.strptime('1/2020/0', "%U/%Y/%w")) # 2020-01-05 00:00:00
print(datetime.strptime('0/2021/0', "%U/%Y/%w")) # 2020-12-27 00:00:00
print(datetime.strptime('1/2021/0', "%U/%Y/%w")) # 2021-01-03 00:00:00
print(datetime.strptime('0/2023/0', "%U/%Y/%w")) # 2023-01-01 00:00:00
print(datetime.strptime('1/2023/0', "%U/%Y/%w")) # 2023-01-01 00:00:00
#################################################
df = pd.DataFrame({'date':["1/2020", "2/2020", "3/2020", "1/2021", "2/2021", "1/2023", "2/2023"]})
print(df)
def get_first_day(date):
date0 = datetime.strptime('0/' + date.split('/')[1] + '/0', "%U/%Y/%w")
date1 = datetime.strptime('1/' + date.split('/')[1] + '/0', "%U/%Y/%w")
date = datetime.strptime(date + '/0', "%U/%Y/%w")
return date if date0 == date1 else date - timedelta(weeks=1)
df['new_date'] = df['date'].apply(lambda x:get_first_day(x))
print(df)
Input
date
0 1/2020
1 2/2020
2 3/2020
3 1/2021
4 2/2021
5 1/2023
6 2/2023
Output
date new_date
0 1/2020 2019-12-29
1 2/2020 2020-01-05
2 3/2020 2020-01-12
3 1/2021 2020-12-27
4 2/2021 2021-01-03
5 1/2023 2023-01-01
6 2/2023 2023-01-08
You'll want to use ISO week parsing directives, Ex:
import pandas as pd
date = pd.Series(["1/2020", "2/2020", "3/2020"])
pd.to_datetime(date+"/1", format="%V/%G/%u")
0 2019-12-30
1 2020-01-06
2 2020-01-13
dtype: datetime64[ns]
you can also shift by one day if the week should start on Sunday:
pd.to_datetime(date+"/1", format="%V/%G/%u") - pd.Timedelta('1d')
0 2019-12-29
1 2020-01-05
2 2020-01-12
dtype: datetime64[ns]
My company uses a 4-4-5 calendar for reporting purposes. Each month (aka period) is 4-weeks long, except every 3rd month is 5-weeks long.
Pandas seems to have good support for custom calendar periods. However, I'm having trouble figuring out the correct frequency string or custom business month offset to achieve months for a 4-4-5 calendar.
For example:
df_index = pd.date_range("2020-03-29", "2021-03-27", freq="D", name="date")
df = pd.DataFrame(
index=df_index, columns=["a"], data=np.random.randint(0, 100, size=len(df_index))
)
df.groupby(pd.Grouper(level=0, freq="4W-SUN")).mean()
Grouping by 4-weeks starting on Sunday results in the following. The first three month start dates are correct but I need every third month to be 5-weeks long. The 4th month start date should be 2020-06-28.
a
date
2020-03-29 16.000000
2020-04-26 50.250000
2020-05-24 39.071429
2020-06-21 52.464286
2020-07-19 41.535714
2020-08-16 46.178571
2020-09-13 51.857143
2020-10-11 44.250000
2020-11-08 47.714286
2020-12-06 56.892857
2021-01-03 55.821429
2021-01-31 53.464286
2021-02-28 53.607143
2021-03-28 45.037037
Essentially what I'd like to achieve is something like this:
a
date
2020-03-29 20.000000
2020-04-26 50.750000
2020-05-24 49.750000
2020-06-28 49.964286
2020-07-26 52.214286
2020-08-23 47.714286
2020-09-27 46.250000
2020-10-25 53.357143
2020-11-22 52.035714
2020-12-27 39.750000
2021-01-24 43.428571
2021-02-21 49.392857
Pandas currently support only yearly and quarterly 5253 (aka 4-4-5 calendar).
See is pandas.tseries.offsets.FY5253 and pandas.tseries.offsets.FY5253Quarter
df_index = pd.date_range("2020-03-29", "2021-03-27", freq="D", name="date")
df = pd.DataFrame(index=df_index)
df['a'] = np.random.randint(0, 100, df.shape[0])
So indeed you need some more work to get to week level and maintain a 4-4-5 calendar. You could align to quarters using the native pandas offset and fill-in the 4-4-5 week pattern manually.
def date_range(start, end, offset_array, name=None):
start = pd.to_datetime(start)
end = pd.to_datetime(end)
index = []
start -= offset_array[0]
while(start<end):
for x in offset_array:
start += x
if start > end:
break
index.append(start)
return pd.Series(index, name=name)
This function takes a list of offsets rather than a regular frequency period, so it allows to move from date to date following the offsets in the given array:
offset_445 = [
pd.tseries.offsets.FY5253Quarter(weekday=6),
4*pd.tseries.offsets.Week(weekday=6),
4*pd.tseries.offsets.Week(weekday=6),
]
df_index_445 = date_range("2020-03-29", "2021-03-27", offset_445, name='date')
Out:
0 2020-05-03
1 2020-05-31
2 2020-06-28
3 2020-08-02
4 2020-08-30
5 2020-09-27
6 2020-11-01
7 2020-11-29
8 2020-12-27
9 2021-01-31
10 2021-02-28
Name: date, dtype: datetime64[ns]
Once the index is created, then it's back to aggregations logic to get the data in the right row buckets. Assuming that you want the mean for the start of each 4 or 5 week period, according to the df_index_445 you have generated, it could look like this:
# calculate the mean on reindex groups
reindex = df_index_445.searchsorted(df.index, side='right') - 1
res = df.groupby(reindex).mean()
# filter valid output
res = res[res.index>=0]
res.index = df_index_445
Out:
a
2020-05-03 47.857143
2020-05-31 53.071429
2020-06-28 49.257143
2020-08-02 40.142857
2020-08-30 47.250000
2020-09-27 52.485714
2020-11-01 48.285714
2020-11-29 56.178571
2020-12-27 51.428571
2021-01-31 50.464286
2021-02-28 53.642857
Note that since the frequency is not regular, pandas will set the datetime index frequency to None.
I've separate columns for start( timestamp ) and end( timestamp) and i need to get the earliest starttime and last endtime for each date.
number start end test time
0 1 2020-02-01 06:27:38 2020-02-01 08:29:42 1 02:02:04
1 1 2020-02-01 08:41:03 2020-02-01 11:05:30 2 02:24:27
2 1 2020-02-01 11:20:22 2020-02-01 13:03:49 1 01:43:27
3 1 2020-02-01 13:38:18 2020-02-01 16:04:31 2 02:26:13
4 1 2020-02-01 16:26:46 2020-02-01 17:42:49 1 01:16:03
5 1 2020-02-02 10:11:00 2020-02-02 12:11:00 1 02:00:00
I want the output for each date as : Date Min Max
I'm fairly new to Pandas and most of the solutions i've across is finding the min and max datetime from column. While what i want to do is min and max datetime for each date, where the timestamps are spread over two columns
expected output (ignore the date and time formats please)
date min max
1/2/2020 6:27 17:42
2/2/2020 10:11 12:11
I believe you need to start by creating a date column and later performing groupby with date.
df['date'] = df['start'].dt.date
df['start_hm'] = df['start'].dt.strftime('%H:%M')
df['end_hm'] = df['end'].dt.strftime('%H:%M')
output = df.groupby('date').agg(min = pd.NamedAgg(column = 'start_hm',aggfunc='min'),
max = pd.NamedAgg(column='end_hm',aggfunc='max'))
Output:
min max
date
2020-02-01 06:27 17:42
2020-02-02 10:11 12:11