Weird behavior for groupby using time interval in pyspark - python

I have a pyspark dataframe with a column named 'datetime' of the 'datetime64[ns]' type in the format "yyyy-MM-dd HH:mm:ss".
I'm trying to group it by a given timewindow.
This is what I'm doiung
import pyspark.sql.functions as psf
dataframe.groupBy(psf.window('datetime', f'{interval} seconds'), 'player_id', 'media_id').count()
interval is a parameter received as a string such as 'hour', 'day', 'week'.
I then convert it to seconds, as in, 1 hour = 3600 seconds, 1 day = 86400 seconds.
When I group it by 1 hour it works fine, this is the result:
window_start
window_end
player_id
media_id
count
2022-08-01 00:00:00
2022-08-01 01:00:00
1
2841
22
2022-08-01 00:00:00
2022-08-01 01:00:00
1
2899
44
Since the first date in the dataframe is 2022-08-01 everything is fine, but, when I try to group it by a week, this is the result:
window_start
window_end
player_id
media_id
count
2022-07-27 21:00:00
2022-08-03 21:00:00
1
1524
3
2022-07-27 21:00:00
2022-08-03 21:00:00
1
2841
1117
I'm positive there are no dates beforee 2022-08-01 in the dataframe.
Why is it doing this? I tried using the startTime parameter for the window function, but it is only to off-set the start, and not specify the beginning of a valid interval.
Any thoughts?

Related

Out of different set of dates I want to check if all set of dates are contiguous

if I have 2 different set of dates:
01/05/2022 - 31/12/2022
01/01/2023 - 31/12/2023
01/05/2022 - 30/09/2022
01/10/2022 - 31/12/2022
01/01/2023 - 31/12/2023
I want to check if both set of dates above are contiguous between below range of dates
Date 1 = 01/05/2022
Date 2 = 31/12/2023
Please suggest a solution.
It seems to me easier to use pandas to check if dates fall into the date range.
You have the data day, month, year. In my practice, I usually see the sequences year, month, day.
I changed the variables 'Date_1', 'Date_2' to the desired format and the arrays themselves with dates, which I divided into two parts from and to. Then I filled the dataframe with these arrays and checked the date range. I specifically added one line with data for clarity: 2023-01-01 2025-12-31, it is just filtered, since it does not fall under the condition.
import pandas as pd
from datetime import datetime
Date_1 = '01/05/2022'
Date_2 = '31/12/2023'
Date_1 = datetime.strptime(Date_1, "%d/%m/%Y")
Date_2 = datetime.strptime(Date_2, "%d/%m/%Y")
start = [datetime.strptime(i, "%d/%m/%Y")for i in ['01/05/2022', '01/01/2023', '01/05/2022', '01/10/2022', '01/01/2023', '01/01/2023']]
finish = [datetime.strptime(i, "%d/%m/%Y")for i in ['31/12/2022', '31/12/2023', '30/09/2022', '31/12/2022', '31/12/2023', '31/12/2025']]
df = pd.DataFrame({'start': start, 'finish': finish})
print(df)
print(df[(df['start'] >= Date_1) & (df['finish'] <= Date_2)])
Output print(df)
start finish
0 2022-05-01 2022-12-31
1 2023-01-01 2023-12-31
2 2022-05-01 2022-09-30
3 2022-10-01 2022-12-31
4 2023-01-01 2023-12-31
5 2023-01-01 2025-12-31
Output print(df[(df['start'] >= Date_1) & (df['finish'] <= Date_2)])
start finish
0 2022-05-01 2022-12-31
1 2023-01-01 2023-12-31
2 2022-05-01 2022-09-30
3 2022-10-01 2022-12-31
4 2023-01-01 2023-12-31

Pandas dataframe - fillna with last of next month

I've been staring at this way too long and I think Ive lost my mind, it really shouldn't be as complicated as I'm making it.
I have a df:
Date1
Date2
2022-04-01
2022-06-17
2022-04-15
2022-04-15
2022-03-03
NaT
2022-04-22
NaT
2022-05-06
2022-06-06
I want to fill the blanks in 'Date2' where it keeps the values from 'Date2' if they are present but if 'Date2' is NaT then I want it to be the last date of the subsequent month from 'Date1'.
In the example above, the 2 NaT fields would become:
Date1
Date2
2022-03-03
2022-04-30
2022-04-22
2022-05-31
I know I have to use .fillna and the closest I've come is this:
df['Date2'] = (df['Date2'].fillna((df['Date1'] + pd.DateOffset(months=1)).replace)).to_numpy().astype('datetime64[M]')
This returns the first of the month. However, it returns the first of the month for all rows (not just NaT rows) and it is returning the first of the month as opposed to the last of the month.
I'm pretty sure my parenthesis are messed up and I've tried many different combinations of - timedelta and similar.
What am I doing wrong here? TIA!
Your question can be interpreted in two ways given the provided example.
End of month of the next row's date 1 (which now does not seem to be what you want)
You need to use pd.offses.MonthEnd and shift
df['Date2'] = (df['Date2']
.fillna(df['Date1'].add(pd.offsets.MonthEnd())
.shift(-1))
)
Next month's end (same row)
If you want the next month end of the same row:
df['Date2'] = (df['Date2']
.fillna(df['Date1'].add(pd.offsets.MonthEnd(2)))
)
Output:
Date1 Date2
0 2022-04-01 2022-06-17
1 2022-04-15 2022-04-15
2 2022-03-03 2022-04-30
3 2022-04-22 2022-05-31
4 2022-05-06 2022-06-06
Use MonthEnd and loc:
from pandas.tseries.offsets import MonthEnd
>>> df.loc[df['Date2'].isnull(), 'Date2'] = df['Date1'] + pd.DateOffset(months=1) + MonthEnd(1)
Use MonthEnd with an offset of 2 (current month and next month):
df['Date2'] = df['Date2'].fillna(df['Date1'].add(pd.offsets.MonthEnd(2)))
print(df)
# Output
Date1 Date2
0 2022-04-01 2022-06-17
1 2022-04-15 2022-04-15
2 2022-03-03 2022-04-30
3 2022-04-22 2022-05-31
4 2022-05-06 2022-06-06

Reading in Date / Time Values Correctly

Any ideas on how I can manipulate my current date-time data to make it suitable for use when converting the datatype to time?
For example:
df1['Date/Time'] = pd.to_datetime(df1['Date/Time'])
The current format for the data is mm/dd 00:00:00
an example of the column in the dataframe can be seen below.
Date/Time Dry_Temp[C] Wet_Temp[C] Solar_Diffuse_Rate[[W/m2]] \
0 01/01 00:10:00 8.45 8.237306 0.0
1 01/01 00:20:00 7.30 6.968360 0.0
2 01/01 00:30:00 6.15 5.710239 0.0
3 01/01 00:40:00 5.00 4.462898 0.0
4 01/01 00:50:00 3.85 3.226244 0.0
For the condition where the hour is denoted as 24, you have two choices. First you can simply reset the hour to 00 and second you can reset the hour to 00 and also add 1 to the date.
In either case the first step is detecting the condition which can be done with a simple find statement t.find(' 24:')
Having detected the condition in the first case it is a simple matter of reseting the hour to 00 and proceeding with the process of formatting the field. In the second case, however, adding 1 to the day is a little more complicated because of the fact you can roll over to next month.
Here is the approach I would use:
Given a df of form:
Date Time
0 01/01 00:00:00
1 01/01 00:24:00
2 01/01 24:00:00
3 01/31 24:00:00
The First Case
def parseDate2(tx):
ti = tx.find(' 24:')
if ti >= 0:
tk = pd.to_datetime(tx[:5]+' 00:'+tx[10:], format= '%m/%d %H:%M:%S')
return tk + du.relativedelta.relativedelta(hours=+24)
return pd.to_datetime(tx, format= '%m/%d %H:%M:%S')
df['Date Time'] = df['Date Time'].apply(lambda x: parseDate(x))
Produces the following:
Date Time
0 1900-01-01 00:00:00
1 1900-01-01 00:24:00
2 1900-01-01 00:00:00
3 1900-01-31 00:00:00
For the second case, I employed the dateutil relativedelta library and slightly modified my parseDate funstion as shown below:
import dateutil as du
def parseDate2(tx):
ti = tx.find(' 24:')
if ti >= 0:
tk = pd.to_datetime(tx[:5]+' 00:'+tx[10:], format= '%m/%d %H:%M:%S')
return tk + du.relativedelta.relativedelta(hours=+24)
return pd.to_datetime(tx, format= '%m/%d %H:%M:%S')
df['Date Time'] = df['Date Time'].apply(lambda x: parseDate2(x))
Yields:
Date Time
0 1900-01-01 00:00:00
1 1900-01-01 00:24:00
2 1900-01-02 00:00:00
3 1900-02-01 00:00:00
​
To access the values of the datetime (namely the time), you can use:
# These are now in a usable format
seconds = df1['Date/Time'].dt.second
minutes = df1['Date/Time'].dt.minute
hours = df1['Date/Time'].dt.hours
And if need be, you can create its own independent time series with:
df1['Dat/Time'].dt.time

Date Time Format Issues Python

I am currently having issues with date-time format, particularly converting string input to the correct python datetime format
Date/Time Dry_Temp[C] Wet_Temp[C] Solar_Diffuse_Rate[[W/m2]] \
0 01/01 00:10:00 8.45 8.237306 0.0
1 01/01 00:20:00 7.30 6.968360 0.0
2 01/01 00:30:00 6.15 5.710239 0.0
3 01/01 00:40:00 5.00 4.462898 0.0
4 01/01 00:50:00 3.85 3.226244 0.0
These are current examples of timestamps I have in my time, I have tried splitting date and time such that I now have the following columns:
WC_Humidity[%] WC_Htgsetp[C] WC_Clgsetp[C] Date Time
0 55.553640 18 26 1900-01-01 00:10:00
1 54.204342 18 26 1900-01-01 00:20:00
2 51.896272 18 26 1900-01-01 00:30:00
3 49.007770 18 26 1900-01-01 00:40:00
4 45.825810 18 26 1900-01-01 00:50:00
I have managed to get the year into datetime format, but there are still 2 problems to resolve:
the data was not recorded in 1900, so I would like to change the year in the Date,
I get the following error whent rying to convert time into time datetime python format
pandas/_libs/tslibs/strptime.pyx in pandas._libs.tslibs.strptime.array_strptime()
ValueError: time data '00:00:00' does not match format ' %m/%d %H:%M:%S' (match)
I tried having 24:00:00, however, python didn't like that either...
preferences:
I would prefer if they were both in the same cell without having to split this information into two columns.
I would also like to get rid of the seconds data as the data was recorded in 10 min intervals so there is no need for seconds in my case.
Any help would be greatly appreciated.
the data was not recorded in 1900, so I would like to change the year in the Date,
datetime.datetime.replace method of datetime.datetime instance is used for this task consider following example:
import pandas as pd
df = pd.DataFrame({"when":pd.to_datetime(["1900-01-01","1900-02-02","1900-03-03"])})
df["when"] = df["when"].apply(lambda x:x.replace(year=2000))
print(df)
output
when
0 2000-01-01
1 2000-02-02
2 2000-03-03
Note that it can be used also without pandas for example
import datetime
d = datetime.datetime.strptime("","") # use all default values which result in midnight of Jan 1 of year 1900
print(d) # 1900-01-01 00:00:00
d = d.replace(year=2000)
print(d) # 2000-01-01 00:00:00

Allocating the timeframe based on the datetime using pandas

I need to find the timeframe from the master based on the input time.
cust_id starttime
0 1 2000-01-01 09:00:03
1 2 2000-01-01 18:01:03
output i needed is
cust_id starttime timeframe
0 1 2000-01-01 09:00:03 morning
1 2 2000-01-01 18:01:03 evening
Code for creating master timeframe details
mastdf={'timeframe':['morning','latemorning','midnoon','evening'],'start_time':['8:00:00','11:00:00','13:00:00','17:00:00'],'end_time':['10:59:59','13:59:59','16:59:59','7:59:59']}enter code here
Code for creating input dataframe
inputdf={'cust_id':[1,2],'starttime':['2000-01-01 09:00:03', '2000-01-01 18:01:03']}
Use cut for binning but first convert values to timedeltas by to_timedelta, create bins with add endpoint 24H and for timeframe between 00:00:00 to 8:00:00 is used fillna by last value of column timeframe:
mastdf={'timeframe':['morning','latemorning','midnoon','evening'],
'start_time':['8:00:00','11:00:00','13:00:00','17:00:00'],
'end_time':['10:59:59','13:59:59','16:59:59','7:59:59']}
mastdf = pd.DataFrame(mastdf)
print (mastdf)
timeframe start_time end_time
0 morning 8:00:00 10:59:59
1 latemorning 11:00:00 13:59:59
2 midnoon 13:00:00 16:59:59
3 evening 17:00:00 7:59:59
inputdf={'cust_id':[1,2],'starttime':['2000-01-01 09:00:03', '2000-01-01 18:01:03']}
inputdf = pd.DataFrame(inputdf)
inputdf['starttime'] = pd.to_datetime(inputdf['starttime'])
start = pd.to_timedelta(mastdf['start_time']).tolist() + [pd.Timedelta(24, unit='h')]
s = pd.to_timedelta(inputdf['starttime'].dt.strftime('%H:%M:%S'))
last = mastdf['timeframe'].iat[-1]
inputdf['timeframe'] = pd.cut(s,
bins=start,
labels=mastdf['timeframe'], right=False).fillna(last)
print (inputdf)
cust_id starttime timeframe
0 1 2000-01-01 09:00:03 morning
1 2 2000-01-01 18:01:03 evening

Categories