I have a dataset containing monthly observations of a time-series.
What I want to do is transform the datetime to year/quarter format and then extract the first value DATE[0] as the previous quarter. For example 2006-10-31 belongs to 4Q of 2006. But I want to change it to 2006Q3.
For the extraction of the subsequent values I will just use the last value from each quarter.
So, for 2006Q4 I will keep BBGN, SSD, and QQ4567 values only from DATE[2]. Similarly, for 2007Q1 I will keep only DATE[5] values, and so forth.
Original dataset:
DATE BBGN SSD QQ4567
0 2006-10-31 00:00:00 1.210 22.022 9726.550
1 2006-11-30 00:00:00 1.270 22.060 9891.008
2 2006-12-31 00:00:00 1.300 22.080 10055.466
3 2007-01-31 00:00:00 1.330 22.099 10219.924
4 2007-02-28 00:00:00 1.393 22.110 10350.406
5 2007-03-31 00:00:00 1.440 22.125 10480.888
After processing the DATE
DATE BBGN SSD QQ4567
0 2006Q3 1.210 22.022 9726.550
2 2006Q4 1.300 22.080 10055.466
5 2007Q1 1.440 22.125 10480.888
The steps I have taken so far are:
Turn the values from the yyyy-mm-dd hh format to yyyyQQ format
DF['DATE'] = pd.to_datetime(DF['DATE']).dt.to_period('Q')
and I get this
DATE BBGN SSD QQ4567
0 2006Q4 1.210 22.022 9726.550
1 2006Q4 1.270 22.060 9891.008
2 2006Q4 1.300 22.080 10055.466
3 2007Q1 1.330 22.099 10219.924
4 2007Q1 1.393 22.110 10350.406
5 2007Q1 1.440 22.125 10480.888
The next step is to extract the last values from each quarter. But because I always want to keep the first row I will exclude DATE[0] from the function.
quarterDF = DF.iloc[1:,].drop_duplicates(subset='DATE', keep='last')
Now, my question is how can I change the value in DATE[0] to always be the previous quarter. So, from 2006Q4 to be 2006Q3. Also, how this will work if DATE[0] is 2007Q1, can I change it to 2006Q4?
My suggestion would be to create a new DATE column with a day 3 months in the past. Like this
import pandas as pd
df = pd.DataFrame()
df['Date'] = pd.to_datetime(['2006-10-31', '2007-01-31'])
one_quarter = pd.tseries.offsets.DateOffset(months=3)
df['Last_quarter'] = df.Date - one_quarter
This will give you
Date Last_quarter
0 2006-10-31 2006-07-31
1 2007-01-31 2006-10-31
Then you can do the same process as you described above on Last_quarter
Here is a pivot_table approach
# Subtract the quarter from date save it in a column
df['Q'] = df['DATE'] - pd.tseries.offsets.QuarterEnd()
#0 2006-09-30
#1 2006-09-30
#2 2006-09-30
#3 2006-12-31
#4 2006-12-31
#5 2006-12-31
#Name: Q, dtype: datetime64[ns]
# Drop and pivot for not including the columns
ndf = df.drop(['DATE','Q'],1).pivot_table(index=pd.to_datetime(df['Q']).dt.to_period('Q'),aggfunc='last')
BBGN QQ4567 SSD
Qdate
2006Q3 1.30 10055.466 22.080
2006Q4 1.44 10480.888 22.125
Related
My company uses a 4-4-5 calendar for reporting purposes. Each month (aka period) is 4-weeks long, except every 3rd month is 5-weeks long.
Pandas seems to have good support for custom calendar periods. However, I'm having trouble figuring out the correct frequency string or custom business month offset to achieve months for a 4-4-5 calendar.
For example:
df_index = pd.date_range("2020-03-29", "2021-03-27", freq="D", name="date")
df = pd.DataFrame(
index=df_index, columns=["a"], data=np.random.randint(0, 100, size=len(df_index))
)
df.groupby(pd.Grouper(level=0, freq="4W-SUN")).mean()
Grouping by 4-weeks starting on Sunday results in the following. The first three month start dates are correct but I need every third month to be 5-weeks long. The 4th month start date should be 2020-06-28.
a
date
2020-03-29 16.000000
2020-04-26 50.250000
2020-05-24 39.071429
2020-06-21 52.464286
2020-07-19 41.535714
2020-08-16 46.178571
2020-09-13 51.857143
2020-10-11 44.250000
2020-11-08 47.714286
2020-12-06 56.892857
2021-01-03 55.821429
2021-01-31 53.464286
2021-02-28 53.607143
2021-03-28 45.037037
Essentially what I'd like to achieve is something like this:
a
date
2020-03-29 20.000000
2020-04-26 50.750000
2020-05-24 49.750000
2020-06-28 49.964286
2020-07-26 52.214286
2020-08-23 47.714286
2020-09-27 46.250000
2020-10-25 53.357143
2020-11-22 52.035714
2020-12-27 39.750000
2021-01-24 43.428571
2021-02-21 49.392857
Pandas currently support only yearly and quarterly 5253 (aka 4-4-5 calendar).
See is pandas.tseries.offsets.FY5253 and pandas.tseries.offsets.FY5253Quarter
df_index = pd.date_range("2020-03-29", "2021-03-27", freq="D", name="date")
df = pd.DataFrame(index=df_index)
df['a'] = np.random.randint(0, 100, df.shape[0])
So indeed you need some more work to get to week level and maintain a 4-4-5 calendar. You could align to quarters using the native pandas offset and fill-in the 4-4-5 week pattern manually.
def date_range(start, end, offset_array, name=None):
start = pd.to_datetime(start)
end = pd.to_datetime(end)
index = []
start -= offset_array[0]
while(start<end):
for x in offset_array:
start += x
if start > end:
break
index.append(start)
return pd.Series(index, name=name)
This function takes a list of offsets rather than a regular frequency period, so it allows to move from date to date following the offsets in the given array:
offset_445 = [
pd.tseries.offsets.FY5253Quarter(weekday=6),
4*pd.tseries.offsets.Week(weekday=6),
4*pd.tseries.offsets.Week(weekday=6),
]
df_index_445 = date_range("2020-03-29", "2021-03-27", offset_445, name='date')
Out:
0 2020-05-03
1 2020-05-31
2 2020-06-28
3 2020-08-02
4 2020-08-30
5 2020-09-27
6 2020-11-01
7 2020-11-29
8 2020-12-27
9 2021-01-31
10 2021-02-28
Name: date, dtype: datetime64[ns]
Once the index is created, then it's back to aggregations logic to get the data in the right row buckets. Assuming that you want the mean for the start of each 4 or 5 week period, according to the df_index_445 you have generated, it could look like this:
# calculate the mean on reindex groups
reindex = df_index_445.searchsorted(df.index, side='right') - 1
res = df.groupby(reindex).mean()
# filter valid output
res = res[res.index>=0]
res.index = df_index_445
Out:
a
2020-05-03 47.857143
2020-05-31 53.071429
2020-06-28 49.257143
2020-08-02 40.142857
2020-08-30 47.250000
2020-09-27 52.485714
2020-11-01 48.285714
2020-11-29 56.178571
2020-12-27 51.428571
2021-01-31 50.464286
2021-02-28 53.642857
Note that since the frequency is not regular, pandas will set the datetime index frequency to None.
I would like to calculate a mean of a time delta serie excluding 00:00:00 values.
Then this is my time serie:
1 00:28:00
3 01:57:00
5 00:00:00
7 01:27:00
9 00:00:00
11 01:30:00
I try to replace 5 and 9 row per NaN and then apply .mean() to the serie. mean() doesn´t include NaN values and I get the desired value.
How can I do that stuff?
I´am trying:
`df["time_column"].replace('0 days 00:00:00', np.NaN).mean()`
but no values are replaced
One idea is use 0 Timedelta object:
out = df["time_column"].replace(pd.Timedelta(0), np.NaN).mean()
print (out)
0 days 01:20:30
I've separate columns for start( timestamp ) and end( timestamp) and i need to get the earliest starttime and last endtime for each date.
number start end test time
0 1 2020-02-01 06:27:38 2020-02-01 08:29:42 1 02:02:04
1 1 2020-02-01 08:41:03 2020-02-01 11:05:30 2 02:24:27
2 1 2020-02-01 11:20:22 2020-02-01 13:03:49 1 01:43:27
3 1 2020-02-01 13:38:18 2020-02-01 16:04:31 2 02:26:13
4 1 2020-02-01 16:26:46 2020-02-01 17:42:49 1 01:16:03
5 1 2020-02-02 10:11:00 2020-02-02 12:11:00 1 02:00:00
I want the output for each date as : Date Min Max
I'm fairly new to Pandas and most of the solutions i've across is finding the min and max datetime from column. While what i want to do is min and max datetime for each date, where the timestamps are spread over two columns
expected output (ignore the date and time formats please)
date min max
1/2/2020 6:27 17:42
2/2/2020 10:11 12:11
I believe you need to start by creating a date column and later performing groupby with date.
df['date'] = df['start'].dt.date
df['start_hm'] = df['start'].dt.strftime('%H:%M')
df['end_hm'] = df['end'].dt.strftime('%H:%M')
output = df.groupby('date').agg(min = pd.NamedAgg(column = 'start_hm',aggfunc='min'),
max = pd.NamedAgg(column='end_hm',aggfunc='max'))
Output:
min max
date
2020-02-01 06:27 17:42
2020-02-02 10:11 12:11
UsageDate CustID1 CustID2 .... CustIDn
0 2018-01-01 00:00:00 1.095
1 2018-01-01 01:00:00 1.129
2 2018-01-01 02:00:00 1.165
3 2018-01-01 04:00:00 1.697
.
.
m 2018-31-01 23:00:00 1.835 (m,n)
The dataframe (df) has m rows and n columns. m is a Hourly TimeSeries Index which starts from first hour of month to last hour of month.
The columns are the customers which are almost 100,000.
The values at each cell of Dataframe are energy consumption values.
For every customer, I need to calculate:
1) Mean of every hour usage - so basically average of 1st hour of every day in a month, 2nd hour of every day in a month etc.
2) Summation of usage of every customer
3) Top 3 usage hours - for a customer x, it can be "2018-01-01 01:00:00",
"2018-11-01 05:00:00" "2018-21-01 17:00:00"
4) Bottom 3 usage hours - Similar explanation as above
5) Mean of usage for every customer in the month
My main point of trouble is how to aggregate data both for every customer and the hour of day, or day together.
For summation of usage for every customer, I tried:
df_temp = pd.DataFrame(columns=["TotalUsage"])
for col in df.columns:
`df_temp[col,"TotalUsage"] = df[col].apply.sum()`
However, this and many version of this which I tried are not helping me solve the problem.
Please help me with an approach and how to think about such problems.
Also, since the dataframe is large, it would be helpful if we can talk about Computational Complexity and how can we decrease computation time.
This looks like a job for pandas.groupby.
(I didn't test the code because I didn't have a good sample dataset from which to work. If there are errors, let me know.)
For some of your requirements, you'll need to add a column with the hour:
df['hour']=df['UsageDate'].dt.hour
1) Mean by hour.
mean_by_hour=df.groupby('hour').mean()
2) Summation by user.
sum_by_uers=df.sum()
3) Top usage by customer. Bottom 3 usage hours - Similar explanation as above.--I don't quite understand your desired output, you might be asking too many different questions in this question. If you want the hour and not the value, I think you may have to iterate through the columns. Adding an example may help.
4) Same comment.
5) Mean by customer.
mean_by_cust = df.mean()
I am not sure if this is all the information you are looking for but it will point you in the right direction:
import pandas as pd
import numpy as np
# sample data for 3 days
np.random.seed(1)
data = pd.DataFrame(pd.date_range('2018-01-01', periods= 72, freq='H'), columns=['UsageDate'])
data2 = pd.DataFrame(np.random.rand(72,5), columns=[f'ID_{i}' for i in range(5)])
df = data.join([data2])
# print('Sample Data:')
# print(df.head())
# print()
# mean of every month and hour per year
# groupby year month hour then find the mean of every hour in a given year and month
mean_data = df.groupby([df['UsageDate'].dt.year, df['UsageDate'].dt.month, df['UsageDate'].dt.hour]).mean()
mean_data.index.names = ['UsageDate_year', 'UsageDate_month', 'UsageDate_hour']
# print('Mean Data:')
# print(mean_data.head())
# print()
# use set_index with max and head
top_3_Usage_hours = df.set_index('UsageDate').max(1).sort_values(ascending=False).head(3)
# print('Top 3:')
# print(top_3_Usage_hours)
# print()
# use set_index with min and tail
bottom_3_Usage_hours = df.set_index('UsageDate').min(1).sort_values(ascending=False).tail(3)
# print('Bottom 3:')
# print(bottom_3_Usage_hours)
out:
Sample Data:
UsageDate ID_0 ID_1 ID_2 ID_3 ID_4
0 2018-01-01 00:00:00 0.417022 0.720324 0.000114 0.302333 0.146756
1 2018-01-01 01:00:00 0.092339 0.186260 0.345561 0.396767 0.538817
2 2018-01-01 02:00:00 0.419195 0.685220 0.204452 0.878117 0.027388
3 2018-01-01 03:00:00 0.670468 0.417305 0.558690 0.140387 0.198101
4 2018-01-01 04:00:00 0.800745 0.968262 0.313424 0.692323 0.876389
Mean Data:
ID_0 ID_1 ID_2 \
UsageDate_year UsageDate_month UsageDate_hour
2018 1 0 0.250716 0.546475 0.202093
1 0.414400 0.264330 0.535928
2 0.335119 0.877191 0.380688
3 0.577429 0.599707 0.524876
4 0.702336 0.654344 0.376141
ID_3 ID_4
UsageDate_year UsageDate_month UsageDate_hour
2018 1 0 0.244185 0.598238
1 0.400003 0.578867
2 0.623516 0.477579
3 0.429835 0.510685
4 0.503908 0.595140
Top 3:
UsageDate
2018-01-01 21:00:00 0.997323
2018-01-03 23:00:00 0.990472
2018-01-01 08:00:00 0.988861
dtype: float64
Bottom 3:
UsageDate
2018-01-01 19:00:00 0.002870
2018-01-03 02:00:00 0.000402
2018-01-01 00:00:00 0.000114
dtype: float64
For top and bottom 3 if you want to find the min sum across rows then:
df.set_index('UsageDate').sum(1).sort_values(ascending=False).tail(3)
My data has trips with datetime info, user id for each trip and trip type (single, round, pseudo).
Here's a data sample (pandas dataframe), named All_Data:
HoraDTRetirada idpass type
2016-02-17 15:36:00 39579449489 'single'
2016-02-18 19:13:00 39579449489 'single'
2016-02-26 09:20:00 72986744521 'pseudo'
2016-02-27 12:11:00 72986744521 'round'
2016-02-27 14:55:00 11533148958 'pseudo'
2016-02-28 12:27:00 72986744521 'round'
2016-02-28 16:32:00 72986744521 'round'
I would like to count the number of times each category repeats in a "week of year" by user.
For example, if the event happens on a monday and the next event happens on a thursday for a same user, that makes two events on the same week; however, if one event happens on a saturday and the next event happens on the following monday, they happened in different weeks.
The output I am looking for would be in a form like this:
idpass weekofyear type frequency
39579449489 1 'single' 2
72986744521 2 'round' 3
72986744521 2 'pseudo' 1
11533148958 2 'pseudo' 1
Edit: this older question approaches a similar problem, but I don't know how to do it with pandas.
import pandas as pd
data = {"HoraDTRetirada": ["2016-02-17 15:36:00", "2016-02-18 19:13:00", "2016-12-31 09:20:00", "2016-02-28 12:11:00",
"2016-02-28 14:55:00", "2016-02-29 12:27:00", "2016-02-29 16:32:00"],
"idpass": ["39579449489", "39579449489", "72986744521", "72986744521", "11533148958", "72986744521",
"72986744521"],
"type": ["single", "single", "pseudo", "round", "pseudo", "round", "round"]}
df = pd.DataFrame.from_dict(data)
print(df)
df["HoraDTRetirada"] = pd.to_datetime(df['HoraDTRetirada'])
df["week"] = df['HoraDTRetirada'].dt.strftime('%U')
k = df.groupby(["idpass", "week", "type"],as_index=False).count()
print(k)
Output:
HoraDTRetirada idpass type
0 2016-02-17 15:36:00 39579449489 single
1 2016-02-18 19:13:00 39579449489 single
2 2016-12-31 09:20:00 72986744521 pseudo
3 2016-02-28 12:11:00 72986744521 round
4 2016-02-28 14:55:00 11533148958 pseudo
5 2016-02-29 12:27:00 72986744521 round
6 2016-02-29 16:32:00 72986744521 round
idpass week type HoraDTRetirada
0 11533148958 09 pseudo 1
1 39579449489 07 single 2
2 72986744521 09 round 3
3 72986744521 52 pseudo 1
This is how I got what I was looking for:
Step 1 from suggested answers was skipped because timestamps were already in pandas datetime form.
Step 2: create column for week of year:
df['week'] = df['HoraDTRetirada'].dt.strftime('%U')
Step 3: group by user id, type and week, and count values with size()
df.groupby(['idpass','type','week']).size()
My suggestion would be to do this:
make sure your timestamp is pandas datetime and add frequency column
df['HoraDTRetirada'] = pd.to_datetime(df['HoraDTRetirada'])
df['freq'] = 1
Group it and count
res = df.groupby(['idpass', 'type', pd.Grouper(key='HoraDTRetirada', freq='1W')]).count().reset_index()
Convert time to week of a year
res['HoraDTRetirada'] = res['HoraDTRetirada'].apply(lambda x: x.week)
Final result looks like that:
EDIT:
You are right, in your case we should do step 3 before step 2, and if you want to do that, remember that groupby will change, so finally step 2 will be:
res['HoraDTRetirada'] = res['HoraDTRetirada'].apply(lambda x: x.week)
and step 3 :
res = df.groupby(['idpass', 'type', 'HoraDTRetirada')]).count().reset_index()
It's a bit different because the "Hora" variable is not a time anymore, but just an int representing a week.