Spark - Statistical calculations sorted by timestamps ranges - python

I'm trying to calculate statistical measures based on a range of hours and\or days.
Meaning, I have a CSV file that is something like this:
TRANSACTION_URL START_TIME END_TIME SIZE FLAG
www.google.com 20170113093210 20170113093210 150 1
www.cnet.com 20170113114510 20170113093210 150 2
START_TIME and END_TIME are in yyyyMMddhhmmss format.
I'm first converting it to yyyy-MM-dd hh:mm:ss format by using the following code:
from_pattern = 'yyyyMMddhhmmss'
to_pattern = 'yyyy-MM-dd hh:mm:ss'
log_df = log_df.withColumn('START_TIME', from_unixtime(unix_timestamp(
log_df['START_TIME'].cast(StringType()), from_pattern), to_pattern).cast(TimestampType()))
And afterward, I would like to use groupBy() in order to calculate, for example, the mean of the SIZE column, based on the transaction TIME frame.
For example, I would like to do something like:
for all transactions that are between 09:00 to 11:00
calculate SIZE mean
for all transactions that are between 14:00 to 16:00
calculate SIZE mean
And also:
for all transactions that are in a WEEKEND date
calculate SIZE mean
for all transactions that are NOT in a WEEKEND date
calculate SIZE mean
I DO know how to use groupBy for a 'default' configuration, such as calculating statistical measures for SIZE column, based on FLAG column values. I'm using something like:
log_df.cache().groupBy('FLAG').agg(mean('SIZE').alias("Mean"), stddev('SIZE').alias("Stddev")).\
withColumn("Variance", pow(col("Stddev"), 2)).show(3, False)
So, my questions are:
How to achieve such grouping and calculating, for a range of hours? (1st pseudo code example)
How to achieve such grouping and calculating, by dates? (2nd pseudo code example)
Is there any python package that can receive yy-MM-dd and return true if it's a weekend date?
Thanks

Let's assume you have a function encode_dates which receives the date and returns a sequence of encoding for all times periods you are interested in. So for example for tuesday 9-12 it would return Seq("9-11","10-12","11-13","weekday"). This would be a regular scala function (unrelated to spark).
now you can make it a UDF and add it as a column and explode the column so you will have multiple copies. Now all you need to do is add this column for the groupby.
So it would look something like this:
val encodeUDF = udf(encode_dates _)
log_df.cache().withColumn("timePeriod", explode(encodeUDF($"start_date", $"end_date").groupBy('FLAG', 'timePeriod').agg(mean('SIZE').alias("Mean"), stddev('SIZE').alias("Stddev")).
withColumn("Variance", pow(col("Stddev"), 2)).show(3, False)

Related

Can you Group XArray Data Array by Day of Month?

If there was a variable in an xarray dataset with a time dimension with daily values over some multiyear time span
2017-01-01 ... 2018-12-31, then it is possible to group the data by month, or by the day of the year, using
.groupby("time.month") or .groupby("time.dayofyear")
Is there a way to efficiently group the data by the day of the month, for example if I wanted to calculate the mean value on the 21st of each month?
See the xarray docs on the DateTimeAccessor helper object. For more info, you can also check out the xarray docs on Working with Time Series Data: Datetime Components, which in turn refers to the pandas docs on date/time components.
You're looking for day. Unfortunately, both pandas and xarray simply describe .dt.day as referring to "the days of the datetime" which isn't particularly helpful. But if you take a look at python's native datetime.Date.day definition, you'll see the more specific:
date.day
Between 1 and the number of days in the given month of the given year.
So, simply
da.groupby("time.day")
Should do the trick!
I not sure, but maybe you can do like this:
import datetime
x = datetime.datetime.now()
day = x.strftime("%d")
month = x.strftime("%m")
year = x.strftime("%Y")
.groupby(month) or .groupby(year)

How to use resample count to screen original data

Say I have a dataset at daily scale, but not all days have valid data. In other words, some days are missing in the data. I want to compute the summer season mean from the dataset, and want to remove the month which has less than 20 days of valid data.
How do I achieve this (in pythonic fashion)?
Say my dataframe (df) is like this:
DATE VAR
1900-01-01 123
1900-01-02 456
1900-01-10 789
...
I know how to compute the count:
df_count = df.resample('MS').count()
I also know how to compute the summer season mean:
df_summer = df.resample('Q-NOV').mean()
You can based on df_count to filter out the month which have less than 20 days of valid data. After that compute the summer season mean using your formula.
df_count = df.resample('MS').count()
relevant_month = df_count[df_count > 10].index
df_summer = df[df.index.isin(relevant_month)].resample('Q-NOV').mean()
I suppose you store the month in index. If the month or time is stored in a different column, change df.index.isin(relevant_month) to df.columnName.isin(relevant_month).
I also don't know the format of your time column (date or datetime) so you might need to modify the code to change this part df.index.isin(relevant_month) accordingly. It is just the general idea.

maximum difference between two time series of different resolution

I have two time series data that gives the electricity demand in one-hour resolution and five-minute resolution. I am trying to find the maximum difference between these two time series. So the one-hour resolution data has 8760 rows (hourly for an year) and the 5-minute resolution data has 104,722 rows (5-minutly for an year).
I can only think of a method that will expand the hourly data into 5 minute resolution that will have 12 times repeating of the hourly data and find the maximum of the difference of the two data sets.
If this technique is the way to go, is there an easy way to convert my hourly data into 5-minute resolution by repeating the hourly data 12 times?
for your reference I posted a plot of this data for one day.
P.S> I am using Python to do this task
Numpy's .repeat() function
You can change your hourly data into 5-minute data by using numpy's repeat function
import numpy as np
np.repeat(hourly_data, 12)
I would strongly recommend against converting the hourly data into five-minute data. If the data in both cases refers to the mean load of those time ranges, you'll be looking at more accurate data if you group the five-minute intervals into hourly datasets. You'd get more granularity the way you're talking about, but the granularity is not based on accurate data, so you're not actually getting more value from it. If you aggregate the five-minute chunks into hourly chunks and compare the series that way, you can be more confident in the trustworthiness of your results.
In order to group them together to get that result, you can define a function like the following and use the apply method like so:
def to_hour(date):
date = date.strftime("%Y-%m-%d %H:00:00")
date = dt.strptime(date, "%Y-%m-%d %H:%M:%S")
return date
df['Aggregated_Datetime'] = df['Original_Datetime'].apply(lambda x: to_hour(x))
df.groupby('Aggregated_Datetime').agg('Real-Time Lo

How can a DataFrame change from having two columns (a "from" datetime and a "to" datetime) to having a single column for a date?

I've got a DataFrame that looks like this:
It has two columns, one of them being a "from" datetime and one of them being a "to" datetime. I would like to change this DataFrame such that it has a single column or index for the date (e.g. 2015-07-06 00:00:00 in datetime form) with the variables of the other columns (like deep) split proportionately into each of the days. How might one approach this problem? I've meddled with groupby tricks and I'm not sure how to proceed.
So I don't have time to work through your specific problem at the moment. But the way to approach this is to us pandas.resample(). Here are the steps I would take. 1) Resample your to date column by minute. 2) Populate the other columns out over that resample. 3) Add the date column back in as an index.
If this doesn't work or is being tricky to work with I would create a date range from your earliest date to your latest date (at the smallest interval you want - so maybe hourly?) and then run some conditional statements over your other columns to fill in the data.
Here is somewhat what your code may look like for the resample portion (replace day with hour or whatever):
drange = pd.date_range('01-01-1970', '01-20-2018', freq='D')
data = data.resample('D').fillna(method='ffill')
data.index.name = 'date'
Hope this helps!

Python Pandas: Group datetime column into hour and minute aggregations

This seems like it would be fairly straight forward but after nearly an entire day I have not found the solution. I've loaded my dataframe with read_csv and easily parsed, combined and indexed a date and a time column into one column but now I want to be able to just reshape and perform calculations based on hour and minute groupings similar to what you can do in excel pivot.
I know how to resample to hour or minute but it maintains the date portion associated with each hour/minute whereas I want to aggregate the data set ONLY to hour and minute similar to grouping in excel pivots and selecting "hour" and "minute" but not selecting anything else.
Any help would be greatly appreciated.
Can't you do, where df is your DataFrame:
times = pd.to_datetime(df.timestamp_col)
df.groupby([times.dt.hour, times.dt.minute]).value_col.sum()
Wes' code didn't work for me. But the DatetimeIndex function (docs) did:
times = pd.DatetimeIndex(data.datetime_col)
grouped = df.groupby([times.hour, times.minute])
The DatetimeIndex object is a representation of times in pandas. The first line creates a array of the datetimes. The second line uses this array to get the hour and minute data for all of the rows, allowing the data to be grouped (docs) by these values.
Came across this when I was searching for this type of groupby. Wes' code above didn't work for me, not sure if it's because changes in pandas over time.
In pandas 0.16.2, what I did in the end was:
grp = data.groupby(by=[data.datetime_col.map(lambda x : (x.hour, x.minute))])
grp.count()
You'd have (hour, minute) tuples as the grouped index. If you want multi-index:
grp = data.groupby(by=[data.datetime_col.map(lambda x : x.hour),
data.datetime_col.map(lambda x : x.minute)])
I have an alternative of Wes & Nix answers above, with just one line of code, assuming your column is already a datetime column, you don't need to get the hour and minute attributes separately:
df.groupby(df.timestamp_col.dt.time).value_col.sum()
This might be a little late but I found quite a good solution for any one that has the same problem.
I have a df like this:
datetime value
2022-06-28 13:28:08 15
2022-06-28 13:28:09 30
... ...
2022-06-28 14:29:11 20
2022-06-28 14:29:12 10
I want to convert those timestamps which are in intervals of a second to timestamps with an interval of minutes adding the value column in the process.
There is a neat way of doing it:
df['datetime'] = pd.to_datetime(df['datetime']) #if not already as datetime object
grouped = df.groupby(pd.Grouper(key='datetime', axis=0, freq='T')).sum()
print(grouped.head())
Result:
datetime value
2022-06-28 13:28:00 45
... ...
2022-06-28 14:29:00 30
freq='T' stands for minutes. You could also group it by hours or days. They are called Offset aliases.

Categories