I have formatted my data through pandas in such a way that I get the number of orders that are placed in every 2 hour period for the past 3 months. I need to get the total amount of order that is placed for each timeslot based on the day of the week.
Converted OrderCount day_of_week
2/1/2019 0:00 2 Friday
2/1/2019 2:00 0 Friday
2/1/2019 4:00 0 Friday
2/1/2019 6:00 0 Friday
2/1/2019 8:00 0 Friday
2/1/2019 10:00 1 Friday
2/1/2019 12:00 2 Friday
2/1/2019 14:00 3 Friday
2/1/2019 16:00 5 Friday
2/2/2019 0:00 2 Saturday
2/2/2019 2:00 1 Saturday
2/2/2019 4:00 0 Saturday
2/2/2019 6:00 0 Saturday
2/2/2019 8:00 0 Saturday
Where Converted is my index and OrderCount column contains the count of orders by timeslot(2hr)
I have tried the following code
df.groupby([df.index.hour, df.index.weekday]).count()
But this give totally different result
What is want is the total number of orders placed on a particular day based on the timeslot
Ex
Converted OrderCount day_of_week
2/1/2019 0:00 2 Friday
2/8/2019 0:00 5 Friday
2/2/2019 4:00 1 Saturday
2/9/2019 4:00 10 Saturday
The Output Should be
TimeSlot OrderCount day_of_week
0:00 7 Friday
4:00 11 Saturday
Where total 7 is (5+2) and 11 is (1+11)
Related
this is my first question on Stackoverflow and I hope I describe my problem detailed enough.
I'm starting to learn data analysis with Pandas and I've created a time series with daily data for gas prices of a certain station. I've already grouped the hourly data into daily data.
I've been successfull with a simple scatter plot over the year with plotly but in the next step I would like to analyze which weekday is the cheapest or most expensive in every week, count the daynames and then look if there is a pattern over the whole year.
count mean std min 25% 50% 75% max \
2022-01-01 35.0 1.685000 0.029124 1.649 1.659 1.689 1.6990 1.749
2022-01-02 27.0 1.673444 0.024547 1.649 1.649 1.669 1.6890 1.729
2022-01-03 28.0 1.664000 0.040597 1.599 1.639 1.654 1.6890 1.789
2022-01-04 31.0 1.635129 0.045069 1.599 1.599 1.619 1.6490 1.779
2022-01-05 33.0 1.658697 0.048637 1.599 1.619 1.649 1.6990 1.769
2022-01-06 35.0 1.658429 0.050756 1.599 1.619 1.639 1.6940 1.779
2022-01-07 30.0 1.637333 0.039136 1.599 1.609 1.629 1.6565 1.759
2022-01-08 41.0 1.655829 0.041740 1.619 1.619 1.639 1.6790 1.769
2022-01-09 35.0 1.647857 0.031602 1.619 1.619 1.639 1.6590 1.769
2022-01-10 31.0 1.634806 0.041374 1.599 1.609 1.619 1.6490 1.769
...
week weekday
2022-01-01 52 Saturday
2022-01-02 52 Sunday
2022-01-03 1 Monday
2022-01-04 1 Tuesday
2022-01-05 1 Wednesday
2022-01-06 1 Thursday
2022-01-07 1 Friday
2022-01-08 1 Saturday
2022-01-09 1 Sunday
2022-01-10 2 Monday
...
I tried with grouping and resampling but unfortunately I didn't get the result I was hoping for.
Can someone suggest a way how to deal with this problem? Thanks!
Here's a way to do what I believe your question asks:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'count':[35,27,28,31,33,35,30,41,35,31]*40,
'mean':
[1.685,1.673444,1.664,1.635129,1.658697,1.658429,1.637333,1.655829,1.647857,1.634806]*40
},
index=pd.Series(pd.to_datetime(pd.date_range("2022-01-01", periods=400, freq="D"))))
print( '','input df:',df,sep='\n' )
df_date = df.reset_index()['index']
df['weekday'] = list(df_date.dt.day_name())
df['year'] = df_date.dt.year.to_numpy()
df['week'] = df_date.dt.isocalendar().week.to_numpy()
df['year_week_started'] = df.year - np.where((df.week>=52)&(df.week.shift(-7)==1),1,0)
print( '','input df with intermediate columns:',df,sep='\n' )
cols = ['year_week_started', 'week']
dfCheap = df.loc[df.groupby(cols)['mean'].idxmin(),:].set_index(cols)
dfCheap = ( dfCheap.groupby(['year_week_started', 'weekday'])['mean'].count()
.rename('freq').to_frame().set_index('freq', append=True)
.reset_index(level='weekday').sort_index(ascending=[True,False]) )
print( '','dfCheap:',dfCheap,sep='\n' )
dfExpensive = df.loc[df.groupby(cols)['mean'].idxmax(),:].set_index(cols)
dfExpensive = ( dfExpensive.groupby(['year_week_started', 'weekday'])['mean'].count()
.rename('freq').to_frame().set_index('freq', append=True)
.reset_index(level='weekday').sort_index(ascending=[True,False]) )
print( '','dfExpensive:',dfExpensive,sep='\n' )
Sample input:
input df:
count mean
2022-01-01 35 1.685000
2022-01-02 27 1.673444
2022-01-03 28 1.664000
2022-01-04 31 1.635129
2022-01-05 33 1.658697
... ... ...
2023-01-31 35 1.658429
2023-02-01 30 1.637333
2023-02-02 41 1.655829
2023-02-03 35 1.647857
2023-02-04 31 1.634806
[400 rows x 2 columns]
input df with intermediate columns:
count mean weekday year week year_week_started
2022-01-01 35 1.685000 Saturday 2022 52 2021
2022-01-02 27 1.673444 Sunday 2022 52 2021
2022-01-03 28 1.664000 Monday 2022 1 2022
2022-01-04 31 1.635129 Tuesday 2022 1 2022
2022-01-05 33 1.658697 Wednesday 2022 1 2022
... ... ... ... ... ... ...
2023-01-31 35 1.658429 Tuesday 2023 5 2023
2023-02-01 30 1.637333 Wednesday 2023 5 2023
2023-02-02 41 1.655829 Thursday 2023 5 2023
2023-02-03 35 1.647857 Friday 2023 5 2023
2023-02-04 31 1.634806 Saturday 2023 5 2023
[400 rows x 6 columns]
Sample output:
dfCheap:
weekday
year_week_started freq
2021 1 Monday
2022 11 Tuesday
10 Thursday
10 Wednesday
6 Sunday
5 Friday
5 Monday
5 Saturday
2023 2 Thursday
1 Saturday
1 Sunday
1 Wednesday
dfExpensive:
weekday
year_week_started freq
2021 1 Saturday
2022 16 Monday
10 Tuesday
6 Sunday
5 Friday
5 Saturday
5 Thursday
5 Wednesday
2023 2 Monday
1 Friday
1 Thursday
1 Tuesday
I have the following formula which get me EOM date every 3M starting Feb 90.
dates = pd.date_range(start="1990-02-01", end="2029-09-30", freq="3M")
I am looking to get in a condensed manner the same table but where the dates are offset by x business days.
This mean, if x = 2, 2 business days before the EOM date calculated every 3M starting Feb 90.
Thanks for the help.
from pandas.tseries.offsets import BDay
x = 2
dates = pd.date_range(start="1990-02-01", end="2029-09-30", freq="3M") - BDay(x)
>>> dates
DatetimeIndex(['1990-02-26', '1990-05-29', '1990-08-29', '1990-11-28',
'1991-02-26', '1991-05-29', '1991-08-29', '1991-11-28',
'1992-02-27', '1992-05-28',
...
'2027-05-27', '2027-08-27', '2027-11-26', '2028-02-25',
'2028-05-29', '2028-08-29', '2028-11-28', '2029-02-26',
'2029-05-29', '2029-08-29'],
dtype='datetime64[ns]', length=159, freq=None)
Example
x = 2
dti1 = pd.date_range(start="1990-02-01", end="2029-09-30", freq="3M")
dti2 = pd.date_range(start="1990-02-01", end="2029-09-30", freq="3M") - BDay(x)
df = pd.DataFrame({"dti1": dti1.day_name(), "dti2": dti2.day_name()})
>>> df.head(20)
dti1 dti2
0 Wednesday Monday
1 Thursday Tuesday
2 Friday Wednesday
3 Friday Wednesday
4 Thursday Tuesday
5 Friday Wednesday
6 Saturday Thursday
7 Saturday Thursday
8 Saturday Thursday
9 Sunday Thursday
10 Monday Thursday
11 Monday Thursday
12 Sunday Thursday
13 Monday Thursday
14 Tuesday Friday
15 Tuesday Friday
16 Monday Thursday
17 Tuesday Friday
18 Wednesday Monday
19 Wednesday Monday
I working on the Production analysis data set(Shift-wise one(Day/Night)). Day shift is 7 AM-7 PM Aand Night Shift is 7 PM-7 AM.
Sometimes day & night shift can be divided into two or more portions(ex:7AM-7PM Day shift can be - 7AM-10AM & 10AM-7PM).
If shifts are divided into two or more portions, first need to check if the Brand is the same for that entire Shift partitions.
If YES, set the start time as the beginning of the first shift start time partition and the End time as the end of the last shift end time partition.
For production: get the total production of the shift partitions
For RPM: get average of the shift partions
If No, get the appropriate values for each Brand.
(For more understanding, Please check the expected output.)
Sample of the Raw dataframe:
Start end shift Brand Production RPM
7/8/2020 19:00 7/9/2020 7:00 Night A 10 50
7/9/2020 7:00 7/9/2020 17:07 Day A 5 50
7/9/2020 17:07 7/9/2020 17:58 Day A 10 100
7/9/2020 17:58 7/9/2020 19:00 Day A 5 60
7/9/2020 19:00 7/9/2020 21:30 Night A 2 10
7/9/2020 21:30 7/9/2020 22:40 Night B 5 20
7/9/2020 22:40 7/10/2020 7:00 Night B 5 30
7/10/2020 7:00 7/10/2020 18:27 Day C 15 20
7/10/2020 18:27 7/10/2020 19:00 Day C 5 40
Expected Output:
Start end shift Brand Production RPM
7/8/2020 19:00 7/9/2020 7:00 Night A 10 50
7/9/2020 7:00 7/9/2020 19:00 Day A 20 70
7/9/2020 19:00 7/9/2020 21:30 Night A 2 10
7/9/2020 21:30 7/10/2020 7:00 Night B 10 25
7/10/2020 7:00 7/10/2020 19:00 Day C 20 30
Thanks in advance.
Here's a suggestion:
Make sure the columns Start and End have datetime values (I've renamed end to End and shift to Shift :)):
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
Then
df['Day'] = df['Start'].dt.strftime('%Y-%m-%d')
df = (df.groupby(['Day', 'Shift', 'Brand'])
.agg(Start = pd.NamedAgg(column='Start', aggfunc='min'),
End = pd.NamedAgg(column='End', aggfunc='max'),
Production = pd.NamedAgg(column='Production', aggfunc='sum'),
RPM = pd.NamedAgg(column='RPM', aggfunc='mean'))
.reset_index()[df.columns]
.drop('Day', axis='columns'))
gives you
Start End Shift Brand Production RPM
0 2020-07-08 19:00:00 2020-07-09 07:00:00 Night A 10 50
1 2020-07-09 07:00:00 2020-07-09 19:00:00 Day A 20 70
2 2020-07-09 19:00:00 2020-07-09 21:30:00 Night A 2 10
3 2020-07-09 21:30:00 2020-07-10 07:00:00 Night B 10 25
4 2020-07-10 07:00:00 2020-07-10 19:00:00 Day C 20 30
which seems to be your desired output (if I'm not mistaken).
If you want to transform the columns Start and End back to string with a format similar to the one you've given above (there's some additional padding):
df['Start'] = df['Start'].dt.strftime('%m/%d/%Y %H:%M')
df['End'] = df['End'].dt.strftime('%m/%d/%Y %H:%M')
I am creating a dictionary for 7 days. From 22th January to 29th. But there is two different data in one column in a day. Column name is Last Update. That values are I want to combine is '1/25/2020 10:00 PM', '1/25/2020 12:00 PM'. This values in the same column. So 25. January is Saturday. I want to combine them together as Saturday.
For understanding the column:
Last Update
0 1/22/2020 12:00
1 1/22/2020 12:00
2 1/22/2020 12:00
3 1/22/2020 12:00
4 1/22/2020 12:00
...
363 1/29/2020 21:00
364 1/29/2020 21:00
365 1/29/2020 21:00
366 1/29/2020 21:00
367 1/29/2020 21:00
i came so far:
day_map = {'1/22/2020 12:00': 'Wednesday', '1/23/20 12:00 PM': 'Thursday',
'1/24/2020 12:00 PM': 'Friday', .?.?.
You just need to convert date to datetime and use pandas.dt functions. In this case
df["Last Update"] = df["Last Update"].astype("M8")
df["Last Update"].dt.weekday_name
# returns
0 Wednesday
1 Wednesday
2 Wednesday
3 Wednesday
4 Wednesday
Name: Last Update, dtype: object
Here is my dataframe:
Dec-18 Jan-19 Feb-19 Mar-19 Apr-19 May-19
Saturday 2540.0 2441.0 3832.0 4093.0 1455.0 2552.0
Sunday 1313.0 1891.0 2968.0 2260.0 1454.0 1798.0
Monday 1360.0 1558.0 2967.0 2156.0 1564.0 1752.0
Tuesday 1089.0 2105.0 2476.0 1577.0 1744.0 1457.0
Wednesday 1329.0 1658.0 2073.0 2403.0 1231.0 874.0
Thursday 798.0 1195.0 2183.0 1287.0 1460.0 1269.0
I have tried some pandas ops but I am not able to do that.
This is what I want to do:
items
Saturday 2540.0
Sunday 1313.0
Monday 1360.0
Tuesday 1089.0
Wednesday 1329.0
Thursday 798.0
Saturday 2441.0
Sunday 1891.0
Monday 1558.0
Tuesday 2105.0
Wednesday 1658.0
Thursday 1195.0 ............ and so on
I want to set those rows into rows in downside, how to do that?
df.reset_index().melt(id_vars='index').drop('variable',1)
Output:
index value
0 Saturday 2540.0
1 Sunday 1313.0
2 Monday 1360.0
3 Tuesday 1089.0
4 Wednesday 1329.0
5 Thursday 798.0
6 Saturday 2441.0
7 Sunday 1891.0
8 Monday 1558.0
9 Tuesday 2105.0
10 Wednesday 1658.0
11 Thursday 1195.0
12 Saturday 3832.0
13 Sunday 2968.0
14 Monday 2967.0
15 Tuesday 2476.0
16 Wednesday 2073.0
17 Thursday 2183.0
18 Saturday 4093.0
19 Sunday 2260.0
20 Monday 2156.0
21 Tuesday 1577.0
22 Wednesday 2403.0
23 Thursday 1287.0
24 Saturday 1455.0
25 Sunday 1454.0
26 Monday 1564.0
27 Tuesday 1744.0
28 Wednesday 1231.0
29 Thursday 1460.0
30 Saturday 2552.0
31 Sunday 1798.0
32 Monday 1752.0
33 Tuesday 1457.0
34 Wednesday 874.0
35 Thursday 1269.0
Note: just noted a commented suggesting to do the same thing, I will delete my post if requested :)
Create it with numpy by reshaping the data.
import pandas as pd
import numpy as np
pd.DataFrame(df.to_numpy().flatten('F'),
index=np.tile(df.index, df.shape[1]),
columns=['items'])
Output:
items
Saturday 2540.0
Sunday 1313.0
Monday 1360.0
Tuesday 1089.0
Wednesday 1329.0
Thursday 798.0
Saturday 2441.0
...
Sunday 1798.0
Monday 1752.0
Tuesday 1457.0
Wednesday 874.0
Thursday 1269.0
You can do:
df = df.stack().sort_index(level=1).reset_index(level = 1, drop=True).to_frame('items')
It is interesting that this method got overlooked even though it is the fastest:
import time
start = time.time()
df.stack().sort_index(level=1).reset_index(level = 1, drop=True).to_frame('items')
end = time.time()
print("time taken {}".format(end-start))
yields: time taken 0.006181955337524414
while this:
start = time.time()
df.reset_index().melt(id_vars='days').drop('variable',1)
end = time.time()
print("time taken {}".format(end-start))
yields: time taken 0.010072708129882812
Any my output format matches OP's requested exactly.