This seems like it would be fairly straight forward but after nearly an entire day I have not found the solution. I've loaded my dataframe with read_csv and easily parsed, combined and indexed a date and a time column into one column but now I want to be able to just reshape and perform calculations based on hour and minute groupings similar to what you can do in excel pivot.
I know how to resample to hour or minute but it maintains the date portion associated with each hour/minute whereas I want to aggregate the data set ONLY to hour and minute similar to grouping in excel pivots and selecting "hour" and "minute" but not selecting anything else.
Any help would be greatly appreciated.
Can't you do, where df is your DataFrame:
times = pd.to_datetime(df.timestamp_col)
df.groupby([times.dt.hour, times.dt.minute]).value_col.sum()
Wes' code didn't work for me. But the DatetimeIndex function (docs) did:
times = pd.DatetimeIndex(data.datetime_col)
grouped = df.groupby([times.hour, times.minute])
The DatetimeIndex object is a representation of times in pandas. The first line creates a array of the datetimes. The second line uses this array to get the hour and minute data for all of the rows, allowing the data to be grouped (docs) by these values.
Came across this when I was searching for this type of groupby. Wes' code above didn't work for me, not sure if it's because changes in pandas over time.
In pandas 0.16.2, what I did in the end was:
grp = data.groupby(by=[data.datetime_col.map(lambda x : (x.hour, x.minute))])
grp.count()
You'd have (hour, minute) tuples as the grouped index. If you want multi-index:
grp = data.groupby(by=[data.datetime_col.map(lambda x : x.hour),
data.datetime_col.map(lambda x : x.minute)])
I have an alternative of Wes & Nix answers above, with just one line of code, assuming your column is already a datetime column, you don't need to get the hour and minute attributes separately:
df.groupby(df.timestamp_col.dt.time).value_col.sum()
This might be a little late but I found quite a good solution for any one that has the same problem.
I have a df like this:
datetime value
2022-06-28 13:28:08 15
2022-06-28 13:28:09 30
... ...
2022-06-28 14:29:11 20
2022-06-28 14:29:12 10
I want to convert those timestamps which are in intervals of a second to timestamps with an interval of minutes adding the value column in the process.
There is a neat way of doing it:
df['datetime'] = pd.to_datetime(df['datetime']) #if not already as datetime object
grouped = df.groupby(pd.Grouper(key='datetime', axis=0, freq='T')).sum()
print(grouped.head())
Result:
datetime value
2022-06-28 13:28:00 45
... ...
2022-06-28 14:29:00 30
freq='T' stands for minutes. You could also group it by hours or days. They are called Offset aliases.
Related
My Plan:
I have two different datasets with OHLC values, one representing the Weekly (1W) timeframes: weekly_df and the other representing the hourly (1H) timeframes hourly_df.
Here's what the two data frame looks like:
My goal is to merge the weekly OHLC values to the hourly df by using pandas merge and followed by ffill. However, Before I do it, I need to get the date columns in the same format and type. Meaning I need to reformat the weekly dates with 00:00:00 after the date. Here's how I'm doing it:
The Problem:
Once this is done, everything is now a string and when I try to convert it back to datetime, the 00:00:00 in the date column disappears:
Once this is done, I wanted to merge the data frames by the date and fill, so that all the hourly OHLC values in a given date also have a column displaying their weekly OHLC value.
As of right now, this is not working as the merge only merges the dates common between he data frames and omits the rest:
If there is an easier way to do it? As most of the methods I have tried are returning an error.
The two data frame CSV files:
Incase you need to test it, here are the two CSV files:
Hourly
Weekly
Any help would be appreciated. Thanks in advance!
For anyone who would face a similar issue in the future, here's how I fixed it:
Since the datetime format applied on the dataframe is not enforcing 00:00:00, I have done an offset to the time by 1 second as 00:00:01 to both the dataframes as follows:
hourly_df['date'] = hourly_df['date'] + pd.DateOffset(seconds=1)
This helps me enforce the same format on the weekly df by offsetting it with 1 sec.
Finally as I now have the same date columns I can merge & ffill them as follows:
merged_df = hourly_df.merge(weekly_df ,on =['date'], how='left').ffill()
which merges and displays the result as follows:
Do let me know if anyone else finds another way to solve this by keeping the original time. Cheers!
So I have sales data that I'm trying to analyze. I have datetime data ["Order Date Time"] and I'd like to see the most common hours for sales but more importantly I'd like to see what minutes have NO sales.
I have been spinning my wheels for a while and I can't get my brain around a solution. Any help is greatly appreciated.
I import the data:
df = pd.read_excel ('Audit Period.xlsx')
print (df)
I clean up the data:
# Remove all columns except `applieddate` and null rows
time_df = df[df["Order Date Time"].notnull()]
# Ensure the index is still sequential
time_df = time_df[["Order Date Time"]].reset_index(drop=True)
# Select the first 10 rows
time_df.head(10)
I convert to datetime and I look at the month totals:
# Convert applieddate to datetime
time_df = time_df.copy()
time_df["Order Date Time"] = time_df["Order Date Time"].apply(pd.to_datetime)
time_df = time_df.set_index(time_df["Order Date Time"])
# Group by month
grouped = time_df.resample("M").count()
time_df = pd.DataFrame({"count": grouped.values.flatten()}, index=grouped.index)
time_df.head(10)
I try to group by hour but that gives me totals per day/hour rather than totals per hour like every order ever at noon, etc:
# Group by hour
grouped = time_df.resample("2H").count()
time_df = pd.DataFrame({"count": grouped.values.flatten()}, index=grouped.index)
time_df.head(10)
And that is where I'm stuck. I'm trying to integrate the below suggestions but can't quite get a grasp on them yet. Any help would be appreciated.
Not sure if this is the most brilliant solution, but I would start by generating a dataframe at the level of detail I wanted, whether that is 1-hour intervals, 5-minute intervals, etc. Then in your df with all the actual data, you could do your grouping as you currently are doing it above. Once it is grouped, join the two. That way you have one dataframe that includes empty rows associated with time spans with no records. The tricky part will just be making sure you have your date and time formatted in a way that it will match and join properly.
I've got a DataFrame that looks like this:
It has two columns, one of them being a "from" datetime and one of them being a "to" datetime. I would like to change this DataFrame such that it has a single column or index for the date (e.g. 2015-07-06 00:00:00 in datetime form) with the variables of the other columns (like deep) split proportionately into each of the days. How might one approach this problem? I've meddled with groupby tricks and I'm not sure how to proceed.
So I don't have time to work through your specific problem at the moment. But the way to approach this is to us pandas.resample(). Here are the steps I would take. 1) Resample your to date column by minute. 2) Populate the other columns out over that resample. 3) Add the date column back in as an index.
If this doesn't work or is being tricky to work with I would create a date range from your earliest date to your latest date (at the smallest interval you want - so maybe hourly?) and then run some conditional statements over your other columns to fill in the data.
Here is somewhat what your code may look like for the resample portion (replace day with hour or whatever):
drange = pd.date_range('01-01-1970', '01-20-2018', freq='D')
data = data.resample('D').fillna(method='ffill')
data.index.name = 'date'
Hope this helps!
I'm trying to calculate statistical measures based on a range of hours and\or days.
Meaning, I have a CSV file that is something like this:
TRANSACTION_URL START_TIME END_TIME SIZE FLAG
www.google.com 20170113093210 20170113093210 150 1
www.cnet.com 20170113114510 20170113093210 150 2
START_TIME and END_TIME are in yyyyMMddhhmmss format.
I'm first converting it to yyyy-MM-dd hh:mm:ss format by using the following code:
from_pattern = 'yyyyMMddhhmmss'
to_pattern = 'yyyy-MM-dd hh:mm:ss'
log_df = log_df.withColumn('START_TIME', from_unixtime(unix_timestamp(
log_df['START_TIME'].cast(StringType()), from_pattern), to_pattern).cast(TimestampType()))
And afterward, I would like to use groupBy() in order to calculate, for example, the mean of the SIZE column, based on the transaction TIME frame.
For example, I would like to do something like:
for all transactions that are between 09:00 to 11:00
calculate SIZE mean
for all transactions that are between 14:00 to 16:00
calculate SIZE mean
And also:
for all transactions that are in a WEEKEND date
calculate SIZE mean
for all transactions that are NOT in a WEEKEND date
calculate SIZE mean
I DO know how to use groupBy for a 'default' configuration, such as calculating statistical measures for SIZE column, based on FLAG column values. I'm using something like:
log_df.cache().groupBy('FLAG').agg(mean('SIZE').alias("Mean"), stddev('SIZE').alias("Stddev")).\
withColumn("Variance", pow(col("Stddev"), 2)).show(3, False)
So, my questions are:
How to achieve such grouping and calculating, for a range of hours? (1st pseudo code example)
How to achieve such grouping and calculating, by dates? (2nd pseudo code example)
Is there any python package that can receive yy-MM-dd and return true if it's a weekend date?
Thanks
Let's assume you have a function encode_dates which receives the date and returns a sequence of encoding for all times periods you are interested in. So for example for tuesday 9-12 it would return Seq("9-11","10-12","11-13","weekday"). This would be a regular scala function (unrelated to spark).
now you can make it a UDF and add it as a column and explode the column so you will have multiple copies. Now all you need to do is add this column for the groupby.
So it would look something like this:
val encodeUDF = udf(encode_dates _)
log_df.cache().withColumn("timePeriod", explode(encodeUDF($"start_date", $"end_date").groupBy('FLAG', 'timePeriod').agg(mean('SIZE').alias("Mean"), stddev('SIZE').alias("Stddev")).
withColumn("Variance", pow(col("Stddev"), 2)).show(3, False)
I have a input parameter dictionary as below -
InparamDict = {'DataInputDate':'2014-10-25'
}
Using the field InparamDict['DataInputDate'], I want to pull up data from 2013-10-01 till 2013-10-25. What would be the best way to arrive at the same using Pandas?
The sql equivalent is -
DATEFROMPARTS(DATEPART(year,GETDATE())-1,DATEPART(month,GETDATE()),'01')
You forgot to mention if you're trying to pull up data from a DataFrame, Series or what. If you just want to get the date parts, you just have to get the attribute you want from the Timestamp object.
from pandas import Timestamp
dt = Timestamp(InparamDict['DataInputDate'])
dt.year, dt.month, dt.day
If the dates are in a DataFrame (df) and you convert them to dates instead of strings. You can select the data by ranges as well, for instance
df[df['DataInputDate'] > datetime(2013,10,1)]