I can create an hour of day column in Pandas using data['hod'] = [r.hour for r in data.index] which is really useful for groupby related analysis. However, I would like to be able to create a similar column for 1 hour intervals starting at 09:30 instead of 09:00. So the column values would be 09:30-10:30, 10:30-11:30 etc.
The aim is to be able to groupby these values in order to gain stats for the time period.
Using data as follows. I already added hour of day, day of week etc, I just need the same for time sliced from 09:30 onwards in one hour intervals:
data['2008-05-06 09:00:00':].head()
Open High Low Last Volume hod dow dom minute
Timestamp
2008-05-06 09:00:00 1399.50 1399.50 1399.25 1399.50 4 9 1 6 0
2008-05-06 09:01:00 1399.25 1399.75 1399.25 1399.50 5 9 1 6 1
2008-05-06 09:02:00 1399.75 1399.75 1399.00 1399.50 19 9 1 6 2
2008-05-06 09:03:00 1399.50 1399.75 1398.50 1398.50 37 9 1 6 3
2008-05-06 09:04:00 1398.75 1399.00 1398.75 1398.75 15 9 1 6 4
I assumed that when you start from half point of each hour then you divide a day into 25 sections instead of 24. Here is how I labelled those sections: Section -1: [0:00, 0:29], Section 0: [0:30, 1:29], Section 1: [1:30, 2:29] ... Section 22: [22:30, 23:29] and Section 23: [23:30, 23:50], where first and last sections are half an hour long.
And here is an implementation with pandas
import pandas as pd
import numpy as np
def shifted_hour_of_day(ts, beginning_of_hour=0):
shift = pd.Timedelta('%dmin' % (beginning_of_hour))
ts_shifted = ts - pd.Timedelta(shift)
hour = ts_shifted.hour
if ts_shifted.day != ts.day: # we shifted these timestamps to yesterday
hour = -1 # label the first section as -1
return hour
# Generate random data
timestamps = pd.date_range('2008-05-06 00:00:00', '2008-05-07 00:00:00', freq='10min')
vals = np.random.rand(len(timestamps))
df = pd.DataFrame(index=timestamps, data={'value': vals})
df.loc[:, 'hod'] = [r.hour for r in df.index]
# Test shifted_hour_of_day
df.loc[:, 'hod2'] = [shifted_hour_of_day(r, beginning_of_hour=20) for r in df.index]
df.head(20)
Now you can groupby this DataFrame on 'hod2'.
Related
So I have two data frames. The first data frame contains numerical data that is used to "score" the second data frame which contains simulation data.
df1 = base records
df2 = simulation records
Part 1: What I am trying to accomplish is to query df1 'base records' to find the row that has the most recent timestamp to that in the df2 'simulation records' where the "Name" & "Time" columns match exactly.
Part 2: Then I want to use an if then function to determine whether a value in the simulation record row fall between a range created using two values from the base record row and return a boolean.
low range = df1['Po']-df1['Ref']
high range = df1['Po']+df1['Ref']
if df2['Sim'] falls in between the low range & high range of its most recent df1 base record then I want to return true in the new column "Sim Score"
otherwise return false
Part 3: I want to repeat Part 1 & Part 2 for each row in the simulation records.
helpful information:
df1 (base records) have more or less rows than df2 (simulation records)
df1 has more columns than df2
some columns in df1 have the same name but different values in df2
ideally want to be able to slice both dataframes where the if then function only sees the two rows used in the comparison
only need the most recent df1 base record to compare to the df2 simulation record
previously accomplished this in google sheets with if then & query combination formula dragged down entire sheet (want to replace with python & pandas)
df1 base records example (columns that matter)
Timestamp Name Time Po Ref
7/11/2022 11:30:00 trial 20 mins 5 2
7/10/2022 04:00:00 trial 20 mins 4 4
7/09/2022 02:45:00 trial 20 mins 2 2
6/28/2022 03:45:00 trial 20 mins 3 6
df2 simulation records example (columns that matter)
Timestamp Name Time Sim
7/10/2022 05:15:00 trial 20 mins 7
7/11/2022 12:45:00 trial 20 mins 4
7/12/2022 03:30:00 trial 20 mins 8
desired result of new column added to df2
Timestamp Name Time Sim Sim Score
7/10/2022 05:15:00 trial 20 mins 7 True
7/11/2022 12:45:00 trial 20 mins 4 True
7/12/2022 03:30:00 trial 20 mins 8 False
Use pandas.DataFrame.reindex, its method offer nearest to find the computable index(e.g., string can not calculate distance)
Or use merge_asof, its direction offer nearest.
Method 1:
reindex() with method='nearest'
df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
df1.set_index('Timestamp', inplace=True)
df1['l_r'] = df1['Po'] - df1['Ref']
df1['h_r'] = df1['Po'] + df1['Ref']
print(df1)
###
Name Time Po Ref l_r h_r
Timestamp
2022-07-11 11:30:00 trial 20 mins 5 2 3 7
2022-07-10 04:00:00 trial 20 mins 4 4 0 8
2022-07-09 02:45:00 trial 20 mins 2 2 0 4
2022-06-28 03:45:00 trial 20 mins 3 6 -3 9
df2['Timestamp'] = pd.to_datetime(df2['Timestamp'])
df2.set_index('Timestamp', inplace=True)
print(df2)
###
Name Time Sim
Timestamp
2022-07-10 05:15:00 trial 20 mins 7
2022-07-11 12:45:00 trial 20 mins 4
2022-07-12 03:30:00 trial 20 mins 8
temp = df2.join(df1.reindex(df2.index, method='nearest'), lsuffix='_left', rsuffix='_right')
print(temp)
As you can see, this is df2.join(df1),
join multiple DataFrame objects by index at once.
with method='nearest', in this case, it would join df2 and df1 by the nearest Timestamp index.
df2['Sim Score'] = temp['Sim'].between(temp['l_r'], temp['h_r']).values
df2.reset_index(inplace=True)
print(df2)
###
Timestamp Name Time Sim Sim Score
0 2022-07-10 05:15:00 trial 20 mins 7 True
1 2022-07-11 12:45:00 trial 20 mins 4 True
2 2022-07-12 03:30:00 trial 20 mins 8 False
Method 2:
merge_asof() with direction='nearest'
this way is not executed with indexed value, therefore we don't have to set column Timestamp as index. But it needs binding objects(in this case we merge on column Timestamp)sorted.
df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
# df1.set_index('Timestamp', inplace=True)
df1['l_r'] = df1['Po'] - df1['Ref']
df1['h_r'] = df1['Po'] + df1['Ref']
df1.sort_values(by='Timestamp', inplace=True)
print(df1)
###
Timestamp Name Time Po Ref l_r h_r
3 2022-06-28 03:45:00 trial 20 mins 3 6 -3 9
2 2022-07-09 02:45:00 trial 20 mins 2 2 0 4
1 2022-07-10 04:00:00 trial 20 mins 4 4 0 8
0 2022-07-11 11:30:00 trial 20 mins 5 2 3 7
df2['Timestamp'] = pd.to_datetime(df2['Timestamp'])
# df2.set_index('Timestamp', inplace=True)
df2.sort_values(by='Timestamp', inplace=True)
print(df2)
###
Timestamp Name Time Sim
0 2022-07-10 05:15:00 trial 20 mins 7
1 2022-07-11 12:45:00 trial 20 mins 4
2 2022-07-12 03:30:00 trial 20 mins 8
temp = pd.merge_asof(df2 ,df1[['Timestamp', 'l_r', 'h_r']], on='Timestamp', direction='nearest')
print(temp)
As you can see, this is pd.merge_asof(df2, df1),
This is similar to a left-join except that we match on nearest key rather than equal keys. Both DataFrames must be sorted by the key.
For each row in the left DataFrame:
A “nearest” search selects the row in the right DataFrame whose ‘on’ key is closest in absolute distance to the left’s key.
df2['Sim Score'] = temp['Sim'].between(temp['l_r'], temp['h_r']).values
print(df2)
###
Timestamp Name Time Sim Sim Score
0 2022-07-10 05:15:00 trial 20 mins 7 True
1 2022-07-11 12:45:00 trial 20 mins 4 True
2 2022-07-12 03:30:00 trial 20 mins 8 False
Frankly speaking, working on indexed things would be faster if you have a large dataset.
Method 2 (on multiple keys)
I remodified df1 adding different Name and Time
df1 = pd.DataFrame({'Timestamp':['7/11/2022 11:30:00','7/11/2022 11:30:00','7/10/2022 04:00:00','7/10/2022 04:00:00','7/09/2022 02:45:00','6/28/2022 03:45:00'],
'Name':['trial','trial','trial','non-trial','trial','trial'],
'Time':['20 mins','30 mins','20 mins','20 mins','20 mins','20 mins'],
'Po':[5, 6, 4, 1, 2, 3],
'Ref':[2, 2, 4, 3, 2, 6]})
df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
df1['l_r'] = df1['Po'] - df1['Ref']
df1['h_r'] = df1['Po'] + df1['Ref']
df1.sort_values(by='Timestamp', inplace=True)
print(df1)
###
Timestamp Name Time Po Ref l_r h_r
5 2022-06-28 03:45:00 trial 20 mins 3 6 -3 9
4 2022-07-09 02:45:00 trial 20 mins 2 2 0 4
2 2022-07-10 04:00:00 trial 20 mins 4 4 0 8
3 2022-07-10 04:00:00 non-trial 20 mins 1 3 -2 4
0 2022-07-11 11:30:00 trial 20 mins 5 2 3 7
1 2022-07-11 11:30:00 trial 30 mins 6 2 4 8
print(df2)
###
Timestamp Name Time Sim
0 2022-07-10 05:15:00 trial 20 mins 7
1 2022-07-11 12:45:00 trial 20 mins 4
2 2022-07-12 03:30:00 trial 20 mins 8
Important:
can only merge_asof on a single key, therefore others would utilize by= to deal with.
temp = pd.merge_asof(df2, df1[['Timestamp', 'Name', 'Time', 'l_r', 'h_r']], on='Timestamp', by=['Name','Time'], direction='nearest')
print(temp)
df2['Sim Score'] = temp['Sim'].between(temp['l_r'], temp['h_r']).values
print(df2)
###
Timestamp Name Time Sim Sim Score
0 2022-07-10 05:15:00 trial 20 mins 7 True
1 2022-07-11 12:45:00 trial 20 mins 4 True
2 2022-07-12 03:30:00 trial 20 mins 8 False
Reference:
pandas.DataFrame.join
pandas.merge_asof
merging/join concept
Because you don't provide code to construct the dataframe, I will sketch a potential solution:
First, I will assume that you have only one timestamp per day (which it looks like in your examples). Accordingly, I would truncate or split the timestamp to only have the date in one column. This is done so we can join the dataframes based on the date, i.e. use set_index("date_column") for both dataframes (use an inner-join to only keep the rows where the date was present in both dataframes). Finally, you can use apply() to check your condition:
df_joined['Sim Score'] = df_joined.apply(lambda row: (row['Po']-row['Ref'] <= row['Sim']) and (row['Po']+row['Ref'] >= row['Sim']), axis = 1)
You can do it via pandasql:
But note that you better add a unique constraint to one of the columns (e.g. a number of trial)
from pandasql import sqldf
df3 = sqldf('''
SELECT df2.Timestamp AS Date, df1.Name, df1.Time, df2.Sim,
CASE
WHEN Sim >= (df1.Po - df1.Ref) AND Sim <= (df1.Po + df1.Ref) THEN 'True'
WHEN Sim < (df1.Po - df1.Ref) OR Sim > (df1.Po + df1.Ref) THEN 'False'
END AS 'Sim Score'
FROM df1, df2
WHERE df2.Name == df1.Name AND df2.Time == df1.Time
ORDER BY Date ASC
''')
Also to work with datetime format in sqldf you need to name your Timestamp column as date in the query. If you need to get only let's say first/earliest 5 results add LIMIT 5 in the end of the query.
If you need to get closest date in df2 to df1 try this:
from pandasql import sqldf
df3 = sqldf('''
SELECT df2.Timestamp AS Date1, df2.Timestamp AS Date2,
df1.Name, df1.Time, df2.Sim,
CASE
WHEN Sim >= (df1.Po - df1.Ref) AND Sim <= (df1.Po + df1.Ref) THEN 'True'
WHEN Sim < (df1.Po - df1.Ref) OR Sim > (df1.Po + df1.Ref) THEN 'False'
END AS 'Sim Score'
FROM df1, df2
WHERE df2.Name == df1.Name AND df2.Time == df1.Time
and Date1 <= Date2
group by Date2
ORDER BY Date1 ASC
''')
I have the timeseries dataframe as:
timestamp
signal_value
2017-08-28 00:00:00
10
2017-08-28 00:05:00
3
2017-08-28 00:10:00
5
2017-08-28 00:15:00
5
I am trying to get the average Monthly percentage of the time where "signal_value" is greater than 5. Something like:
Month
metric
January
16%
February
2%
March
8%
April
10%
I tried the following code which gives the result for the whole dataset but how can I summarize it per each month?
total,count = 0, 0
for index, row in df.iterrows():
total += 1
if row["signal_value"] >= 5:
count += 1
print((count/total)*100)
Thank you in advance.
Let us first generate some random data (generate random dates taken from here):
import pandas as pd
import numpy as np
import datetime
def randomtimes(start, end, n):
frmt = '%d-%m-%Y %H:%M:%S'
stime = datetime.datetime.strptime(start, frmt)
etime = datetime.datetime.strptime(end, frmt)
td = etime - stime
dtimes = [np.random.random() * td + stime for _ in range(n)]
return [d.strftime(frmt) for d in dtimes]
# Recreat some fake data
timestamp = randomtimes("01-01-2021 00:00:00", "01-01-2023 00:00:00", 10000)
signal_value = np.random.random(len(timestamp)) * 10
df = pd.DataFrame({"timestamp": timestamp, "signal_value": signal_value})
Now we can transform the timestamp column to pandas timestamps to extract month and year per timestamp:
df.timestamp = pd.to_datetime(df.timestamp)
df["month"] = df.timestamp.dt.month
df["year"] = df.timestamp.dt.year
We generate a boolean column whether signal_value is larger than some threshold (here 5):
df["is_larger5"] = df.signal_value > 5
Finally, we can get the average for every month by using pandas.groupby:
>>> df.groupby(["year", "month"])['is_larger5'].mean()
year month
2021 1 0.509615
2 0.488189
3 0.506024
4 0.519362
5 0.498778
6 0.483709
7 0.498824
8 0.460396
9 0.542918
10 0.463043
11 0.492500
12 0.519789
2022 1 0.481663
2 0.527778
3 0.501139
4 0.527322
5 0.486936
6 0.510638
7 0.483370
8 0.521253
9 0.493639
10 0.495349
11 0.474886
12 0.488372
Name: is_larger5, dtype: float64
I have a pandas dataframe with a date column
I'm trying to create a function and apply it to the dataframe to create a column that returns the number of days in the month/year specified
so far i have:
from calendar import monthrange
def dom(x):
m = dfs["load_date"].dt.month
y = dfs["load_date"].dt.year
monthrange(y,m)
days = monthrange[1]
return days
This however does not work when I attempt to apply it to the date column.
Additionally, I would like to be able to identify whether or not it is the current month, and if so return the number of days up to the current date in that month as opposed to days in the entire month.
I am not sure of the best way to do this, all I can think of is to check the month/year against datetime's today and then use a delta
thanks in advance
For pt.1 of your question, you can cast to pd.Period and retrieve days_in_month:
import pandas as pd
# create a sample df:
df = pd.DataFrame({'date': pd.date_range('2020-01', '2021-01', freq='M')})
df['daysinmonths'] = df['date'].apply(lambda t: pd.Period(t, freq='S').days_in_month)
# df['daysinmonths']
# 0 31
# 1 29
# 2 31
# ...
For pt.2, you can take the timestamp of 'now' and create a boolean mask for your date column, i.e. where its year/month is less than "now". Then calculate the cumsum of the daysinmonth column for the section where the mask returns True. Invert the order of that series to get the days until now.
now = pd.Timestamp('now')
m = (df['date'].dt.year <= now.year) & (df['date'].dt.month < now.month)
df['daysuntilnow'] = df['daysinmonths'][m].cumsum().iloc[::-1].reset_index(drop=True)
Update after comment: to get the elapsed days per month, you can do
df['dayselapsed'] = df['daysinmonths']
m = (df['date'].dt.year == now.year) & (df['date'].dt.month == now.month)
if m.any():
df.loc[m, 'dayselapsed'] = now.day
df.loc[(df['date'].dt.year >= now.year) & (df['date'].dt.month > now.month), 'dayselapsed'] = 0
output
df
Out[13]:
date daysinmonths daysuntilnow dayselapsed
0 2020-01-31 31 213.0 31
1 2020-02-29 29 182.0 29
2 2020-03-31 31 152.0 31
3 2020-04-30 30 121.0 30
4 2020-05-31 31 91.0 31
5 2020-06-30 30 60.0 30
6 2020-07-31 31 31.0 31
7 2020-08-31 31 NaN 27
8 2020-09-30 30 NaN 0
9 2020-10-31 31 NaN 0
10 2020-11-30 30 NaN 0
11 2020-12-31 31 NaN 0
UsageDate CustID1 CustID2 .... CustIDn
0 2018-01-01 00:00:00 1.095
1 2018-01-01 01:00:00 1.129
2 2018-01-01 02:00:00 1.165
3 2018-01-01 04:00:00 1.697
.
.
m 2018-31-01 23:00:00 1.835 (m,n)
The dataframe (df) has m rows and n columns. m is a Hourly TimeSeries Index which starts from first hour of month to last hour of month.
The columns are the customers which are almost 100,000.
The values at each cell of Dataframe are energy consumption values.
For every customer, I need to calculate:
1) Mean of every hour usage - so basically average of 1st hour of every day in a month, 2nd hour of every day in a month etc.
2) Summation of usage of every customer
3) Top 3 usage hours - for a customer x, it can be "2018-01-01 01:00:00",
"2018-11-01 05:00:00" "2018-21-01 17:00:00"
4) Bottom 3 usage hours - Similar explanation as above
5) Mean of usage for every customer in the month
My main point of trouble is how to aggregate data both for every customer and the hour of day, or day together.
For summation of usage for every customer, I tried:
df_temp = pd.DataFrame(columns=["TotalUsage"])
for col in df.columns:
`df_temp[col,"TotalUsage"] = df[col].apply.sum()`
However, this and many version of this which I tried are not helping me solve the problem.
Please help me with an approach and how to think about such problems.
Also, since the dataframe is large, it would be helpful if we can talk about Computational Complexity and how can we decrease computation time.
This looks like a job for pandas.groupby.
(I didn't test the code because I didn't have a good sample dataset from which to work. If there are errors, let me know.)
For some of your requirements, you'll need to add a column with the hour:
df['hour']=df['UsageDate'].dt.hour
1) Mean by hour.
mean_by_hour=df.groupby('hour').mean()
2) Summation by user.
sum_by_uers=df.sum()
3) Top usage by customer. Bottom 3 usage hours - Similar explanation as above.--I don't quite understand your desired output, you might be asking too many different questions in this question. If you want the hour and not the value, I think you may have to iterate through the columns. Adding an example may help.
4) Same comment.
5) Mean by customer.
mean_by_cust = df.mean()
I am not sure if this is all the information you are looking for but it will point you in the right direction:
import pandas as pd
import numpy as np
# sample data for 3 days
np.random.seed(1)
data = pd.DataFrame(pd.date_range('2018-01-01', periods= 72, freq='H'), columns=['UsageDate'])
data2 = pd.DataFrame(np.random.rand(72,5), columns=[f'ID_{i}' for i in range(5)])
df = data.join([data2])
# print('Sample Data:')
# print(df.head())
# print()
# mean of every month and hour per year
# groupby year month hour then find the mean of every hour in a given year and month
mean_data = df.groupby([df['UsageDate'].dt.year, df['UsageDate'].dt.month, df['UsageDate'].dt.hour]).mean()
mean_data.index.names = ['UsageDate_year', 'UsageDate_month', 'UsageDate_hour']
# print('Mean Data:')
# print(mean_data.head())
# print()
# use set_index with max and head
top_3_Usage_hours = df.set_index('UsageDate').max(1).sort_values(ascending=False).head(3)
# print('Top 3:')
# print(top_3_Usage_hours)
# print()
# use set_index with min and tail
bottom_3_Usage_hours = df.set_index('UsageDate').min(1).sort_values(ascending=False).tail(3)
# print('Bottom 3:')
# print(bottom_3_Usage_hours)
out:
Sample Data:
UsageDate ID_0 ID_1 ID_2 ID_3 ID_4
0 2018-01-01 00:00:00 0.417022 0.720324 0.000114 0.302333 0.146756
1 2018-01-01 01:00:00 0.092339 0.186260 0.345561 0.396767 0.538817
2 2018-01-01 02:00:00 0.419195 0.685220 0.204452 0.878117 0.027388
3 2018-01-01 03:00:00 0.670468 0.417305 0.558690 0.140387 0.198101
4 2018-01-01 04:00:00 0.800745 0.968262 0.313424 0.692323 0.876389
Mean Data:
ID_0 ID_1 ID_2 \
UsageDate_year UsageDate_month UsageDate_hour
2018 1 0 0.250716 0.546475 0.202093
1 0.414400 0.264330 0.535928
2 0.335119 0.877191 0.380688
3 0.577429 0.599707 0.524876
4 0.702336 0.654344 0.376141
ID_3 ID_4
UsageDate_year UsageDate_month UsageDate_hour
2018 1 0 0.244185 0.598238
1 0.400003 0.578867
2 0.623516 0.477579
3 0.429835 0.510685
4 0.503908 0.595140
Top 3:
UsageDate
2018-01-01 21:00:00 0.997323
2018-01-03 23:00:00 0.990472
2018-01-01 08:00:00 0.988861
dtype: float64
Bottom 3:
UsageDate
2018-01-01 19:00:00 0.002870
2018-01-03 02:00:00 0.000402
2018-01-01 00:00:00 0.000114
dtype: float64
For top and bottom 3 if you want to find the min sum across rows then:
df.set_index('UsageDate').sum(1).sort_values(ascending=False).tail(3)
maybe I could not find it... anyhow, with pandas '0.19.2' there is the following
problem:
I have some timed events of associated groups which can be generated by:
from numpy.random import randint, seed
import pandas as pd
seed(42) # reproducibility
samp_N = 1000
# create times within 3 hours, and 15 random groups
df = pd.DataFrame({'time': randint(0,3*60*60, samp_N),
'group': randint(0, 15, samp_N)})
# make a resample-able index from the seconds time values
df.set_index(pd.TimedeltaIndex(df.time, 's'), inplace=True)
which looks like:
group time
02:01:10 10 7270
00:14:20 13 860
01:29:50 9 5390
01:26:31 13 5191
...
When I try to resample the events, I get something undesirable
df.resample('5T').count()
group time
00:00:04 28 28
00:05:04 18 18
00:10:04 32 32
...
Unfortunately the resampling periods start at arbitrary (first in data) offset values.
It is even more annoying if I group this (as ultimately required)
df.groupby('group').resample('5T').count()
then I get a new offset for each group
what I want is the precise start of sampling windows:
00:00:00 5 ...
00:05:00 17 ...
00:10:00 11 ...
...
there was a suggestion in: https://stackoverflow.com/a/23966229
df.groupby(pd.TimeGrouper('5Min')).count()
but it does not work either, as it also ruins the grouping required above.
thanks for hints!
Unfortunately i didn't come up with a nice solution but rather a work around. I added a dummy row with time value zero and then grouped by time and group:
df = pd.Series({'time':0,'group':-1}).to_frame().T.set_index(pd.TimedeltaIndex([0], 's')).append(df)
df = df.groupby([pd.Grouper(freq='5Min'), 'group']).count().reset_index('group')
df = df.loc[df['group']!=-1]
df.head()
group time
0 days 0 2
0 days 1 4
0 days 2 3
0 days 3 1
0 days 4 2
I am not sure this is the result you want:
result = df.groupby(['group', pd.Grouper(freq='5Min')]).count().reset_index(level=0)
result.head()
>>> group time
00:05:00 0 2
00:10:00 0 1
00:15:00 0 3
00:20:00 0 2
00:30:00 0 1
result.sort_index().head()
>>> group time
0 days 10 1
0 days 14 3
0 days 2 1
0 days 13 1
0 days 4 3