this is my first question on Stackoverflow and I hope I describe my problem detailed enough.
I'm starting to learn data analysis with Pandas and I've created a time series with daily data for gas prices of a certain station. I've already grouped the hourly data into daily data.
I've been successfull with a simple scatter plot over the year with plotly but in the next step I would like to analyze which weekday is the cheapest or most expensive in every week, count the daynames and then look if there is a pattern over the whole year.
count mean std min 25% 50% 75% max \
2022-01-01 35.0 1.685000 0.029124 1.649 1.659 1.689 1.6990 1.749
2022-01-02 27.0 1.673444 0.024547 1.649 1.649 1.669 1.6890 1.729
2022-01-03 28.0 1.664000 0.040597 1.599 1.639 1.654 1.6890 1.789
2022-01-04 31.0 1.635129 0.045069 1.599 1.599 1.619 1.6490 1.779
2022-01-05 33.0 1.658697 0.048637 1.599 1.619 1.649 1.6990 1.769
2022-01-06 35.0 1.658429 0.050756 1.599 1.619 1.639 1.6940 1.779
2022-01-07 30.0 1.637333 0.039136 1.599 1.609 1.629 1.6565 1.759
2022-01-08 41.0 1.655829 0.041740 1.619 1.619 1.639 1.6790 1.769
2022-01-09 35.0 1.647857 0.031602 1.619 1.619 1.639 1.6590 1.769
2022-01-10 31.0 1.634806 0.041374 1.599 1.609 1.619 1.6490 1.769
...
week weekday
2022-01-01 52 Saturday
2022-01-02 52 Sunday
2022-01-03 1 Monday
2022-01-04 1 Tuesday
2022-01-05 1 Wednesday
2022-01-06 1 Thursday
2022-01-07 1 Friday
2022-01-08 1 Saturday
2022-01-09 1 Sunday
2022-01-10 2 Monday
...
I tried with grouping and resampling but unfortunately I didn't get the result I was hoping for.
Can someone suggest a way how to deal with this problem? Thanks!
Here's a way to do what I believe your question asks:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'count':[35,27,28,31,33,35,30,41,35,31]*40,
'mean':
[1.685,1.673444,1.664,1.635129,1.658697,1.658429,1.637333,1.655829,1.647857,1.634806]*40
},
index=pd.Series(pd.to_datetime(pd.date_range("2022-01-01", periods=400, freq="D"))))
print( '','input df:',df,sep='\n' )
df_date = df.reset_index()['index']
df['weekday'] = list(df_date.dt.day_name())
df['year'] = df_date.dt.year.to_numpy()
df['week'] = df_date.dt.isocalendar().week.to_numpy()
df['year_week_started'] = df.year - np.where((df.week>=52)&(df.week.shift(-7)==1),1,0)
print( '','input df with intermediate columns:',df,sep='\n' )
cols = ['year_week_started', 'week']
dfCheap = df.loc[df.groupby(cols)['mean'].idxmin(),:].set_index(cols)
dfCheap = ( dfCheap.groupby(['year_week_started', 'weekday'])['mean'].count()
.rename('freq').to_frame().set_index('freq', append=True)
.reset_index(level='weekday').sort_index(ascending=[True,False]) )
print( '','dfCheap:',dfCheap,sep='\n' )
dfExpensive = df.loc[df.groupby(cols)['mean'].idxmax(),:].set_index(cols)
dfExpensive = ( dfExpensive.groupby(['year_week_started', 'weekday'])['mean'].count()
.rename('freq').to_frame().set_index('freq', append=True)
.reset_index(level='weekday').sort_index(ascending=[True,False]) )
print( '','dfExpensive:',dfExpensive,sep='\n' )
Sample input:
input df:
count mean
2022-01-01 35 1.685000
2022-01-02 27 1.673444
2022-01-03 28 1.664000
2022-01-04 31 1.635129
2022-01-05 33 1.658697
... ... ...
2023-01-31 35 1.658429
2023-02-01 30 1.637333
2023-02-02 41 1.655829
2023-02-03 35 1.647857
2023-02-04 31 1.634806
[400 rows x 2 columns]
input df with intermediate columns:
count mean weekday year week year_week_started
2022-01-01 35 1.685000 Saturday 2022 52 2021
2022-01-02 27 1.673444 Sunday 2022 52 2021
2022-01-03 28 1.664000 Monday 2022 1 2022
2022-01-04 31 1.635129 Tuesday 2022 1 2022
2022-01-05 33 1.658697 Wednesday 2022 1 2022
... ... ... ... ... ... ...
2023-01-31 35 1.658429 Tuesday 2023 5 2023
2023-02-01 30 1.637333 Wednesday 2023 5 2023
2023-02-02 41 1.655829 Thursday 2023 5 2023
2023-02-03 35 1.647857 Friday 2023 5 2023
2023-02-04 31 1.634806 Saturday 2023 5 2023
[400 rows x 6 columns]
Sample output:
dfCheap:
weekday
year_week_started freq
2021 1 Monday
2022 11 Tuesday
10 Thursday
10 Wednesday
6 Sunday
5 Friday
5 Monday
5 Saturday
2023 2 Thursday
1 Saturday
1 Sunday
1 Wednesday
dfExpensive:
weekday
year_week_started freq
2021 1 Saturday
2022 16 Monday
10 Tuesday
6 Sunday
5 Friday
5 Saturday
5 Thursday
5 Wednesday
2023 2 Monday
1 Friday
1 Thursday
1 Tuesday
Related
I am a beginner of Python. These readings are extracted from sensors which report to system in every 20 mins interval. Now, I would like to find out the total downtime from the start time until end time recovered.
Original Data:
date, Quality Sensor Reading
1/1/2022 9:00 0
1/1/2022 9:20 0
1/1/2022 9:40 0
1/1/2022 10:00 0
1/1/2022 10:20 0
1/1/2022 10:40 0
1/1/2022 12:40 0
1/1/2022 13:00 0
1/1/2022 13:20 0
1/3/2022 1:20 0
1/3/2022 1:40 0
1/3/2022 2:00 0
1/4/2022 14:40 0
1/4/2022 15:00 0
1/4/2022 15:20 0
1/4/2022 17:20 0
1/4/2022 17:40 0
1/4/2022 18:00 0
1/4/2022 18:20 0
1/4/2022 18:40 0
The expected output are as below:
Quality Sensor = 0
Start_Time End_Time Total_Down_Time
2022-01-01 09:00:00 2022-01-01 10:40:00 100 minutes
2022-01-01 12:40:00 2022-01-01 13:20:00 40 minutes
2022-01-03 01:20:00 2022-01-03 02:00:00 40 minutes
2022-01-04 14:40:00 2022-01-04 15:20:00 40 minutes
2022-01-04 17:20:00 2022-01-04 18:40:00 80 minutes
First, let's break them into groups:
df.loc[df.date.diff().gt('00:20:00'), 'group'] = 1
df.group = df.group.cumsum().ffill().fillna(0)
Then, we can extract what we want from each group, and rename:
df2 = df.groupby('group')['date'].agg(['min', 'max']).reset_index(drop=True)
df2.columns = ['start_time', 'end_time']
Finally, we'll add the interval column and format it to minutes:
df2['down_time'] = df2.end_time.sub(df2.start_time)
# Optional, I wouldn't do this here:
df2.down_time = df2.down_time.dt.seconds/60
Output:
start_time end_time down_time
0 2022-01-01 09:00:00 2022-01-01 10:40:00 100.0
1 2022-01-01 12:40:00 2022-01-01 13:20:00 40.0
2 2022-01-03 01:20:00 2022-01-03 02:00:00 40.0
3 2022-01-04 14:40:00 2022-01-04 15:20:00 40.0
4 2022-01-04 17:20:00 2022-01-04 18:40:00 80.0
Let's say the dates are listed in the dataframe df under column date. You can use shift() to create a second column with the subsequent date/time, then create a third that has your duration by subtracting them. Something like:
df['date2'] = df['date'].shift(-1)
df['difference'] = df['date2'] - df['date']
You'll obviously have one row at the end that doesn't have a following value, and therefore doesn't have a difference.
I have a pandas dataframe called df_dummy with 3 columns: Days, Vacations_per_day and day_of_week. And I have a list called legal_days. I want to see if values from df_dummy['Days'] are found in the list legal_days and if found, change the value of Vacations_per_day column to 4 for that specific row.
This is my code so far:
legal_days = ['2022-08-15', '2022-11-30', '2022-12-01', '2022-12-26']
for index,i in enumerate(df_dummy['Days']):
if i in legal_days:
df_dummy['Vacations_per_day'][index] = 4
else:
pass
And the output is this:
Days Vacations_per_day day_of_week
2 2022-06-13 0.0 Monday
3 2022-06-14 0.0 Tuesday
4 2022-06-15 0.0 Wednesday
5 2022-06-16 1.0 Thursday
6 2022-06-17 1.0 Friday
7 2022-06-20 0.0 Monday
8 2022-06-21 0.0 Tuesday
9 2022-06-22 0.0 Wednesday
10 2022-06-23 0.0 Thursday
11 2022-06-24 0.0 Friday
12 2022-06-27 0.0 Monday
13 2022-06-28 0.0 Tuesday
14 2022-06-29 1.0 Wednesday
15 2022-06-30 1.0 Thursday
16 2022-07-01 1.0 Friday
17 2022-07-04 1.0 Monday
18 2022-07-05 1.0 Tuesday
19 2022-07-06 1.0 Wednesday
20 2022-07-07 0.0 Thursday
21 2022-07-08 1.0 Friday
22 2022-07-11 1.0 Monday
23 2022-07-12 1.0 Tuesday
24 2022-07-13 1.0 Wednesday
25 2022-07-14 1.0 Thursday
26 2022-07-15 1.0 Friday
27 2022-07-18 0.0 Monday
28 2022-07-19 0.0 Tuesday
29 2022-07-20 0.0 Wednesday
30 2022-07-21 0.0 Thursday
31 2022-07-22 0.0 Friday
32 2022-07-25 1.0 Monday
33 2022-07-26 1.0 Tuesday
34 2022-07-27 1.0 Wednesday
35 2022-07-28 1.0 Thursday
36 2022-07-29 1.0 Friday
37 2022-08-01 1.0 Monday
38 2022-08-02 1.0 Tuesday
39 2022-08-03 1.0 Wednesday
40 2022-08-04 1.0 Thursday
41 2022-08-05 1.0 Friday
42 2022-08-08 0.0 Monday
43 2022-08-09 0.0 Tuesday
44 2022-08-10 0.0 Wednesday
45 2022-08-11 4.0 Thursday
46 2022-08-12 0.0 Friday
47 2022-08-15 0.0 Monday
The problem is that, instead of changing the value of the row with the date 2022-08-15, it changes the row with a date of 2022-08-11. Could anyone help me with this?
Looking at your code, I don't see how this would occur. However, your solution seems to be outside the normal use-case for modifying a pandas dataframe. You could accomplish all of this with loc and isin:
df_dummy.loc[df_dummy['Days'].isin(legal_days), 'Vacations_per_day'] = 4
This code works by looking up all the rows in your dataframe that have a value for Days in the legal_days list and then sets the associated value for the Vacations_per_day column to 4.
I am new to Quantitative Finance in Python so please bear with me. I have the following data set:
> head(df, 20)
# A tibble: 20 × 15
deal_id book counterparty commodity_name commodity_code executed_date first_delivery_date last_delivery_date last_trading_date volume buy_sell trading_unit tenor delivery_window strategy
<int> <chr> <chr> <chr> <chr> <dttm> <dttm> <dttm> <dttm> <int> <chr> <chr> <chr> <chr> <chr>
1 0 Book_7 Counterparty_3 api2coal ATW 2021-03-07 11:50:24 2022-01-01 00:00:00 2022-12-31 00:00:00 2021-12-31 00:00:00 23000 sell MT year Cal 22 NA
2 1 Book_7 Counterparty_3 oil B 2019-11-10 18:33:39 2022-01-01 00:00:00 2022-12-31 00:00:00 2021-11-30 00:00:00 16000 sell bbl year Cal 22 NA
3 2 Book_4 Counterparty_3 oil B 2021-02-25 11:44:20 2021-04-01 00:00:00 2021-04-30 00:00:00 2021-02-26 00:00:00 7000 buy bbl month Apr 21 NA
4 3 Book_3 Counterparty_3 gold GC 2022-05-27 19:28:48 2022-11-01 00:00:00 2022-11-30 00:00:00 2022-10-31 00:00:00 200 buy oz month Nov 22 NA
5 4 Book_2 Counterparty_3 czpower CZ 2022-09-26 13:14:31 2023-03-01 00:00:00 2023-03-31 00:00:00 2023-02-27 00:00:00 2 buy MW quarter Mar 23 NA
6 5 Book_1 Counterparty_3 depower DE 2022-08-29 10:28:34 2022-10-01 00:00:00 2022-10-31 00:00:00 2022-09-30 00:00:00 23 buy MW month Oct 22 NA
7 6 Book_3 Counterparty_1 api2coal ATW 2022-12-08 08:17:11 2023-01-01 00:00:00 2023-01-31 00:00:00 2022-12-30 00:00:00 29000 sell MT quarter Jan 23 NA
8 7 Book_3 Counterparty_2 depower DE 2020-10-16 17:36:13 2022-03-01 00:00:00 2022-03-31 00:00:00 2022-02-25 00:00:00 3 sell MW quarter Mar 22 NA
9 8 Book_7 Counterparty_1 api2coal ATW 2020-10-13 09:35:24 2021-02-01 00:00:00 2021-02-28 00:00:00 2021-01-29 00:00:00 1000 sell MT quarter Feb 21 NA
10 9 Book_2 Counterparty_1 api2coal ATW 2020-05-19 11:04:39 2022-01-01 00:00:00 2022-12-31 00:00:00 2021-12-31 00:00:00 19000 sell MT year Cal 22 NA
11 10 Book_6 Counterparty_1 oil B 2022-03-03 08:04:04 2022-08-01 00:00:00 2022-08-31 00:00:00 2022-06-30 00:00:00 26000 buy bbl month Aug 22 NA
12 11 Book_3 Counterparty_1 gold GC 2021-05-09 18:08:31 2022-05-01 00:00:00 2022-05-31 00:00:00 2022-04-29 00:00:00 1600 sell oz month May 22 NA
13 12 Book_5 Counterparty_2 oil B 2020-08-20 11:54:34 2021-04-01 00:00:00 2021-04-30 00:00:00 2021-02-26 00:00:00 6000 buy bbl month Apr 21 Strategy_3
14 13 Book_6 Counterparty_2 gold GC 2020-12-23 16:28:55 2021-12-01 00:00:00 2021-12-31 00:00:00 2021-11-30 00:00:00 1700 sell oz month Dec 21 NA
15 14 Book_2 Counterparty_1 depower DE 2021-08-11 12:54:23 2024-01-01 00:00:00 2024-12-31 00:00:00 2023-12-28 00:00:00 15 buy MW year Cal 24 NA
16 15 Book_5 Counterparty_1 czpower CZ 2022-02-15 07:45:24 2022-12-01 00:00:00 2022-12-31 00:00:00 2022-11-30 00:00:00 28 buy MW month Dec 22 Strategy_3
17 16 Book_7 Counterparty_2 oil B 2021-05-19 07:37:05 2022-02-01 00:00:00 2022-02-28 00:00:00 2021-12-31 00:00:00 11000 buy bbl quarter Feb 22 Strategy_3
18 17 Book_4 Counterparty_3 depower DE 2022-02-01 12:34:49 2022-06-01 00:00:00 2022-06-30 00:00:00 2022-05-31 00:00:00 14 sell MW month Jun 22 NA
19 18 Book_2 Counterparty_3 czpower CZ 2022-06-02 09:39:16 2023-02-01 00:00:00 2023-02-28 00:00:00 2023-01-30 00:00:00 21 buy MW quarter Feb 23 NA
20 19 Book_3 Counterparty_1 czpower CZ 2021-10-28 12:41:11 2022-09-01 00:00:00 2022-09-30 00:00:00 2022-08-31 00:00:00 3 sell MW month Sep 22 NA
And I am asked to extract some information from it while applying what is called Yearly and Quarterly Futures Cascading, which I do not know. The question is as follows:
Compute the position size (contracted volume) for a combination of books and commodities, for a selected time in history. The output format should be a data frame with future delivery periods as index (here comes yearly and quarterly cascading), commodities as column names and total volume as values. Provide negative values when the total volume for given period was sold and positive value when it was bought.
I read some material online about Cascading Futures here and here, but it only gave me a vague idea of what they are about and doesn't help solve the problem in hand. and coding examples in Python are nonexistent.
Can someone please give me a hint as to how to approach this problem? I am a beginner in the field of quantitative finance and any help would be much appreciated.
I'm using Python, and I have a Dataframe in which all dates and weekdays are mentioned.
And I want to divide them into Week (Like - Thursday to Thursday)
Dataframe -
And Now I want to divide this dataframe in this format-
Date Weekday
0 2021-01-07 Thursday
1 2021-01-08 Friday
2 2021-01-09 Saturday
3 2021-01-10 Sunday
4 2021-01-11 Monday
5 2021-01-12 Tuesday
6 2021-01-13 Wednesday
7 2021-01-14 Thursday,
Date Weekday
0 2021-01-14 Thursday
1 2021-01-15 Friday
2 2021-01-16 Saturday
3 2021-01-17 Sunday
4 2021-01-18 Monday
5 2021-01-19 Tuesday
6 2021-01-20 Wednesday
7 2021-01-21 Thursday,
Date Weekday
0 2021-01-21 Thursday
1 2021-01-22 Friday
2 2021-01-23 Saturday
3 2021-01-24 Sunday
4 2021-01-25 Monday
5 2021-01-26 Tuesday
6 2021-01-27 Wednesday
7 2021-01-28 Thursday,
Date Weekday
0 2021-01-28 Thursday
1 2021-01-29 Friday
2 2021-01-30 Saturday.
In this Format but i don't know how can i divide this dataframe.
You can use pandas.to_datetime if the Date is not yet datetime type, then use the dt.week accessor to groupby:
dfs = [g for _,g in df.groupby(pd.to_datetime(df['Date']).dt.week)]
Alternatively, if you have several years, use dt.to_period:
dfs = [g for _,g in df.groupby(pd.to_datetime(df['Date']).dt.to_period('W'))]
output:
[ Date Weekday
0 2021-01-07 Thursday
1 2021-01-08 Friday
2 2021-01-09 Saturday
3 2021-01-10 Sunday,
Date Weekday
4 2021-01-11 Monday
5 2021-01-12 Tuesday
6 2021-01-13 Wednesday
7 2021-01-14 Thursday
8 2021-01-14 Thursday
9 2021-01-15 Friday
10 2021-01-16 Saturday
11 2021-01-17 Sunday,
Date Weekday
12 2021-01-18 Monday
13 2021-01-19 Tuesday
14 2021-01-20 Wednesday
15 2021-01-21 Thursday
16 2021-01-21 Thursday
17 2021-01-22 Friday
18 2021-01-23 Saturday
19 2021-01-24 Sunday,
Date Weekday
20 2021-01-25 Monday
21 2021-01-26 Tuesday
22 2021-01-27 Wednesday
23 2021-01-28 Thursday
24 2021-01-28 Thursday
25 2021-01-29 Friday
26 2021-01-30 Saturday]
variants
As dictionary:
{k:g for k,g in df.groupby(pd.to_datetime(df['Date']).dt.to_period('W'))}
reset_index of subgroups:
[g.reset_index() for _,g in df.groupby(pd.to_datetime(df['Date']).dt.to_period('W'))]
weeks ending on Wednesday/starting on Thursday with anchor offsets:
[g.reset_index() for _,g in df.groupby(pd.to_datetime(df['Date']).dt.to_period('W-WED'))]
i have this dataframe
Matricule DateTime Date Time
1 10 2022-01-06 10:59:51 2022-01-06 10:59:51
2 10 2022-01-07 08:40:09 2022-01-07 08:40:09
3 10 2022-01-26 15:39:10 2022-01-26 15:39:10
4 11 2022-01-03 14:33:38 2022-01-03 14:33:38
81 11 2022-01-04 10:04:18 2022-01-04 10:04:18
... ... ... ... ...
15 18 2022-01-24 15:51:22 2022-01-24 15:51:22
15 18 2022-01-24 15:51:29 2022-01-24 15:51:29
15 18 2022-01-24 16:54:23 2022-01-24 16:54:23
15 18 2022-01-28 14:42:01 2022-01-28 14:42:01
15 18 2022-01-28 14:42:32 2022-01-28 14:42:32
i want to calculate time difference between the first time of the day and last time of the day for each day for every employee to know how much hours he spent at work daily for exemple
Matricule Date WorkTime
1 10 2022-01-06 1
2 10 2022-01-07 3
3 10 2022-01-26 5
4 11 2022-01-03 2
81 11 2022-01-04 8
you can use the split-apply-combine approach, write a func for each group and apply on the groupby:
grpd = df.groupby(['Matricule', 'Date'])
def get_hours(df):
start = df['Time'].min()
end = df['Time'].max()
new_df = pd.DataFrame([end-start], columns=['WorkTime'])
return new_df
grpd.apply(get_hours)