Dataset1:
Date Weekday OpenPrice ClosePrice
_______________________________________________
28/07/2022 Thursday 5678 5674
04/08/2022 Thursday 5274 5674
11/08/2022. Thursday 7650 7652
Dataset2:
Date Weekday Open Price Close Price
______________________________________________
29/07/2022 Friday 4371 4387
05/08/2022 Friday 6785 6790
12/08/2022 Friday 4367 6756
I would like to iterate these two datasets and create a new dataset with shows data as below. This is the difference between Open Price of Week1 (Week n-1) on Friday and Close price of Week2 (Week n) on Thursday.
Week Difference
______________________________
Week2 543 (i.e 5674 - 4371)
Week3 867 (i.e 7652 - 6785)
Here is the real file:
https://github.com/ravindraprasad75/HRBot/blob/master/DatasetforSOF.xlsx
Don't iterate over dataframes. Merge them instead.
Reconstruction of your data (cf. How to make good reproducible pandas examples on how to share dataframes)
from io import StringIO
from datetime import datetime
cols = ['Date', 'Weekday', 'OpenPrice', 'ClosePrice']
data1 = """28/07/2022 Thursday 5674 5678
04/08/2022 Thursday 5274 5278
11/08/2022. Thursday 7652 7687"""
data2 = """29/07/2022 Friday 4371 4387
05/08/2022 Friday 6785 6790
12/08/2022 Friday 4367 6756"""
df1, df2 = (pd.read_csv(StringIO(d),
header = None,
sep="\s+",
names=cols,
parse_dates=["Date"],
dayfirst=True) for d in (data1, data2))
Add Week column
df1['Week'] = df1.Date.dt.isocalendar().week
df2['Week'] = df2.Date.dt.isocalendar().week
Resulting dataframes:
>>> df1
Date Weekday OpenPrice ClosePrice Week
0 2022-07-28 Thursday 5674 5678 30
1 2022-08-04 Thursday 5274 5278 31
2 2022-08-11 Thursday 7652 7687 32
>>> df2
Date Weekday OpenPrice ClosePrice Week
0 2022-07-29 Friday 4371 4387 30
1 2022-08-05 Friday 6785 6790 31
2 2022-08-12 Friday 4367 6756 32
Merge on Week
df3 = df1.merge(df2, on="Week", suffixes=("_Thursday", "_Friday"))
Result:
>>> df3
Date_Thursday Weekday_Thursday OpenPrice_Thursday ClosePrice_Thursday \
0 2022-07-28 Thursday 5674 5678
1 2022-08-04 Thursday 5274 5278
2 2022-08-11 Thursday 7652 7687
Week Date_Friday Weekday_Friday OpenPrice_Friday ClosePrice_Friday
0 30 2022-07-29 Friday 4371 4387
1 31 2022-08-05 Friday 6785 6790
2 32 2022-08-12 Friday 4367 6756
Now you can simply do df3.OpenPrice_Friday - df3.ClosePrice_Thursday, using shift where you need to compare different weeks.
Related
this is my first question on Stackoverflow and I hope I describe my problem detailed enough.
I'm starting to learn data analysis with Pandas and I've created a time series with daily data for gas prices of a certain station. I've already grouped the hourly data into daily data.
I've been successfull with a simple scatter plot over the year with plotly but in the next step I would like to analyze which weekday is the cheapest or most expensive in every week, count the daynames and then look if there is a pattern over the whole year.
count mean std min 25% 50% 75% max \
2022-01-01 35.0 1.685000 0.029124 1.649 1.659 1.689 1.6990 1.749
2022-01-02 27.0 1.673444 0.024547 1.649 1.649 1.669 1.6890 1.729
2022-01-03 28.0 1.664000 0.040597 1.599 1.639 1.654 1.6890 1.789
2022-01-04 31.0 1.635129 0.045069 1.599 1.599 1.619 1.6490 1.779
2022-01-05 33.0 1.658697 0.048637 1.599 1.619 1.649 1.6990 1.769
2022-01-06 35.0 1.658429 0.050756 1.599 1.619 1.639 1.6940 1.779
2022-01-07 30.0 1.637333 0.039136 1.599 1.609 1.629 1.6565 1.759
2022-01-08 41.0 1.655829 0.041740 1.619 1.619 1.639 1.6790 1.769
2022-01-09 35.0 1.647857 0.031602 1.619 1.619 1.639 1.6590 1.769
2022-01-10 31.0 1.634806 0.041374 1.599 1.609 1.619 1.6490 1.769
...
week weekday
2022-01-01 52 Saturday
2022-01-02 52 Sunday
2022-01-03 1 Monday
2022-01-04 1 Tuesday
2022-01-05 1 Wednesday
2022-01-06 1 Thursday
2022-01-07 1 Friday
2022-01-08 1 Saturday
2022-01-09 1 Sunday
2022-01-10 2 Monday
...
I tried with grouping and resampling but unfortunately I didn't get the result I was hoping for.
Can someone suggest a way how to deal with this problem? Thanks!
Here's a way to do what I believe your question asks:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'count':[35,27,28,31,33,35,30,41,35,31]*40,
'mean':
[1.685,1.673444,1.664,1.635129,1.658697,1.658429,1.637333,1.655829,1.647857,1.634806]*40
},
index=pd.Series(pd.to_datetime(pd.date_range("2022-01-01", periods=400, freq="D"))))
print( '','input df:',df,sep='\n' )
df_date = df.reset_index()['index']
df['weekday'] = list(df_date.dt.day_name())
df['year'] = df_date.dt.year.to_numpy()
df['week'] = df_date.dt.isocalendar().week.to_numpy()
df['year_week_started'] = df.year - np.where((df.week>=52)&(df.week.shift(-7)==1),1,0)
print( '','input df with intermediate columns:',df,sep='\n' )
cols = ['year_week_started', 'week']
dfCheap = df.loc[df.groupby(cols)['mean'].idxmin(),:].set_index(cols)
dfCheap = ( dfCheap.groupby(['year_week_started', 'weekday'])['mean'].count()
.rename('freq').to_frame().set_index('freq', append=True)
.reset_index(level='weekday').sort_index(ascending=[True,False]) )
print( '','dfCheap:',dfCheap,sep='\n' )
dfExpensive = df.loc[df.groupby(cols)['mean'].idxmax(),:].set_index(cols)
dfExpensive = ( dfExpensive.groupby(['year_week_started', 'weekday'])['mean'].count()
.rename('freq').to_frame().set_index('freq', append=True)
.reset_index(level='weekday').sort_index(ascending=[True,False]) )
print( '','dfExpensive:',dfExpensive,sep='\n' )
Sample input:
input df:
count mean
2022-01-01 35 1.685000
2022-01-02 27 1.673444
2022-01-03 28 1.664000
2022-01-04 31 1.635129
2022-01-05 33 1.658697
... ... ...
2023-01-31 35 1.658429
2023-02-01 30 1.637333
2023-02-02 41 1.655829
2023-02-03 35 1.647857
2023-02-04 31 1.634806
[400 rows x 2 columns]
input df with intermediate columns:
count mean weekday year week year_week_started
2022-01-01 35 1.685000 Saturday 2022 52 2021
2022-01-02 27 1.673444 Sunday 2022 52 2021
2022-01-03 28 1.664000 Monday 2022 1 2022
2022-01-04 31 1.635129 Tuesday 2022 1 2022
2022-01-05 33 1.658697 Wednesday 2022 1 2022
... ... ... ... ... ... ...
2023-01-31 35 1.658429 Tuesday 2023 5 2023
2023-02-01 30 1.637333 Wednesday 2023 5 2023
2023-02-02 41 1.655829 Thursday 2023 5 2023
2023-02-03 35 1.647857 Friday 2023 5 2023
2023-02-04 31 1.634806 Saturday 2023 5 2023
[400 rows x 6 columns]
Sample output:
dfCheap:
weekday
year_week_started freq
2021 1 Monday
2022 11 Tuesday
10 Thursday
10 Wednesday
6 Sunday
5 Friday
5 Monday
5 Saturday
2023 2 Thursday
1 Saturday
1 Sunday
1 Wednesday
dfExpensive:
weekday
year_week_started freq
2021 1 Saturday
2022 16 Monday
10 Tuesday
6 Sunday
5 Friday
5 Saturday
5 Thursday
5 Wednesday
2023 2 Monday
1 Friday
1 Thursday
1 Tuesday
I have three column in a data frame
ID - A001
DoA - 15-03-2014 - Date of Admission
DoL - 17-08-2020 - Date of Leaving
Create three new column:
Cal_Yr - Calender Year
Str_Date - Start of Date
End_Date - End of Date
If the year of admission is less than 2015 than
Str_Date = 01-01-2015 else DoA
End_Date = 15-03-2015
I am dividing the year in two parts ... One part before anniversary date( start dd-mm of the year) and other part after anniversary date so that I can find weight of both parts ... but the date before 01-01-2015 should be revauled as 01-01-2015
I have to design a loop which create repetative 12 rows as shown in figure.
input table is:
ID
DoA
status
DoL
Duration(years)
fee amt
A23
02-Jan-16
DH
18-Aug-18
2
2345
B23
01-Mar-09
IS
31-Dec-20
11
1000
C23
16-Sep-12
SU
12-Jul-19
7
14565
D23
01-Jun-20
LA
07-Sep-20
0
123
E23
15-Sep-16
IS
31-Dec-20
4
6790
F23
01-Jan-19
IS
31-Dec-20
1
7272
This does what you want. This is not a hard job; like most similar tasks, you just have to take it step by step. "What do I know here", "what information do I need here"? Note that I have converted to datetime.date objects for the dates, assuming you will want to do some analyses based on the dates.
import pandas as pd
import datetime
data = [
[ "A001", "15-03-2014", "17-08-2020" ],
[ "A002", "01-06-2018", "01-06-2020" ]
]
rows = []
for id, stdate, endate in data:
s = stdate.split('-')
startdate = datetime.date(int(s[2]),int(s[1]),int(s[0]))
s = endate.split('-')
enddate = datetime.date(int(s[2]),int(s[1]),int(s[0]))
for year in range(startdate.year, enddate.year + 1 ):
start1 = datetime.date(year,1,1)
anniv = datetime.date(year,startdate.month,startdate.day)
end1 = datetime.date(year,12,31)
if year != startdate.year:
rows.append( [id, year, start1, anniv] )
if anniv == enddate:
break
if year != enddate.year:
rows.append( [id, year, anniv, end1] )
elif anniv < enddate:
rows.append( [id, year, anniv, enddate] )
df = pd.DataFrame( rows, columns=["ID", "Cal_Yr", "Str_date", "End_date"] )
print( df )
Output:
ID Cal_Yr Str_date End_date
0 A001 2014 2014-03-15 2014-12-31
1 A001 2015 2015-01-01 2015-03-15
2 A001 2015 2015-03-15 2015-12-31
3 A001 2016 2016-01-01 2016-03-15
4 A001 2016 2016-03-15 2016-12-31
5 A001 2017 2017-01-01 2017-03-15
6 A001 2017 2017-03-15 2017-12-31
7 A001 2018 2018-01-01 2018-03-15
8 A001 2018 2018-03-15 2018-12-31
9 A001 2019 2019-01-01 2019-03-15
10 A001 2019 2019-03-15 2019-12-31
11 A001 2020 2020-01-01 2020-03-15
12 A001 2020 2020-03-15 2020-08-17
13 A002 2018 2018-06-01 2018-12-31
14 A002 2019 2019-01-01 2019-06-01
15 A002 2019 2019-06-01 2019-12-31
16 A002 2020 2020-01-01 2020-06-01
I have the following formula which get me EOM date every 3M starting Feb 90.
dates = pd.date_range(start="1990-02-01", end="2029-09-30", freq="3M")
I am looking to get in a condensed manner the same table but where the dates are offset by x business days.
This mean, if x = 2, 2 business days before the EOM date calculated every 3M starting Feb 90.
Thanks for the help.
from pandas.tseries.offsets import BDay
x = 2
dates = pd.date_range(start="1990-02-01", end="2029-09-30", freq="3M") - BDay(x)
>>> dates
DatetimeIndex(['1990-02-26', '1990-05-29', '1990-08-29', '1990-11-28',
'1991-02-26', '1991-05-29', '1991-08-29', '1991-11-28',
'1992-02-27', '1992-05-28',
...
'2027-05-27', '2027-08-27', '2027-11-26', '2028-02-25',
'2028-05-29', '2028-08-29', '2028-11-28', '2029-02-26',
'2029-05-29', '2029-08-29'],
dtype='datetime64[ns]', length=159, freq=None)
Example
x = 2
dti1 = pd.date_range(start="1990-02-01", end="2029-09-30", freq="3M")
dti2 = pd.date_range(start="1990-02-01", end="2029-09-30", freq="3M") - BDay(x)
df = pd.DataFrame({"dti1": dti1.day_name(), "dti2": dti2.day_name()})
>>> df.head(20)
dti1 dti2
0 Wednesday Monday
1 Thursday Tuesday
2 Friday Wednesday
3 Friday Wednesday
4 Thursday Tuesday
5 Friday Wednesday
6 Saturday Thursday
7 Saturday Thursday
8 Saturday Thursday
9 Sunday Thursday
10 Monday Thursday
11 Monday Thursday
12 Sunday Thursday
13 Monday Thursday
14 Tuesday Friday
15 Tuesday Friday
16 Monday Thursday
17 Tuesday Friday
18 Wednesday Monday
19 Wednesday Monday
If a date in a datetime series falls on a weekend (US), I'd like to move that date forward to the following Monday. So far I've come up with this, but it obviously won't work for likely several reasons, least of which because the days parameter of timedelta can't be a series.
df['Open Date'] = np.where(df['Open Date'].dt.weekday > 4, df['Open Date'] + timedelta(days=7-df['Open Date'].dt.weekday), df['Open Date'])
How can I change this to work with a series?
pd.offsets.BusinessDay(0) will shift weekends to the following Monday, leaving weekdays unchanged.
import pandas as pd
df = pd.DataFrame({'date': pd.date_range('2020-12-20', '2020-12-29', freq='D')})
df['date_shift'] = df['date'] + pd.offsets.BusinessDay(0)
date date_shift
0 2020-12-20 2020-12-21 # Sunday -> Monday
1 2020-12-21 2020-12-21 # Monday -> Monday
2 2020-12-22 2020-12-22
3 2020-12-23 2020-12-23
4 2020-12-24 2020-12-24
5 2020-12-25 2020-12-25 # Christmas Holiday Friday Unchanged
6 2020-12-26 2020-12-28 # Saturday -> Monday
7 2020-12-27 2020-12-28 # Sunday -> Monday
8 2020-12-28 2020-12-28
9 2020-12-29 2020-12-29
I have a dataframe consisting of counts within 10 minute time intervals, how would I set count = 0 if the time interval doesn't exist?
DF1
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'City' : np.random.choice(['PHOENIX','ATLANTA','CHICAGO', 'MIAMI', 'DENVER'], 10000),
'Day': np.random.choice(['Monday','Tuesday','Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'], 10000),
'Time': np.random.randint(1, 86400, size=10000),
'COUNT': np.random.randint(1, 700, size=10000)})
df['Time'] = pd.to_datetime(df['Time'], unit='s').dt.round('10min').dt.strftime('%H:%M:%S')
print(df)
COUNT City Day Time
0 441 PHOENIX Thursday 10:20:00
1 641 ATLANTA Monday 14:30:00
2 661 PHOENIX Saturday 03:50:00
3 570 MIAMI Tuesday 21:00:00
4 222 CHICAGO Friday 15:00:00
DF2 - My approach is to create all the 10 minute time slots in a day (6*24 = 144 entries) and then use "not in"
df2 = pd.DataFrame({'TIME_BIN': np.arange(0, 86401, 600), })
df2['TIME_BIN'] = pd.to_datetime(df2['TIME_BIN'], unit='s').dt.round('10min').dt.strftime('%H:%M:%S')
TIME_BIN
0 00:00:00
1 00:10:00
2 00:20:00
3 00:30:00
4 00:40:00
5 00:50:00
6 01:00:00
7 01:10:00
8 01:20:00
How do I check if the timeslots in DF2 do not exist in DF1 for each city and day and if so, set count = 0? I basically just need to fill in all the missing time slots in DF1.
Attempt:
for each_city in df.City.unique():
for each_day in df.Day.unique():
df['Time'] = df.apply(lambda row: df2['TIME_BIN'] if row['Time'] not in (df2['TIME_BIN'].tolist()) else None)
I think need reindex by MultiIndex from_product:
np.random.seed(123)
df = pd.DataFrame({ 'City' : np.random.choice(['PHOENIX','ATLANTA','CHICAGO', 'MIAMI', 'DENVER'], 10000),
'Day': np.random.choice(['Monday','Tuesday','Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'], 10000),
'Time': np.random.randint(1, 86400, size=10000),
'COUNT': np.random.randint(1, 700, size=10000)})
df['Time'] = pd.to_datetime(df['Time'], unit='s').dt.round('10min').dt.strftime('%H:%M:%S')
df = df.drop_duplicates(['City','Day','Time'])
#print(df)
times = (pd.to_datetime(pd.Series(np.arange(0, 86401, 600)), unit='s')
.dt.round('10min')
.dt.strftime('%H:%M:%S'))
mux = pd.MultiIndex.from_product([df['City'].unique(),
df['Day'].unique(),
times],names=['City','Day','Time'])
df = (df.set_index(['City','Day','Time'])
.reindex(mux, fill_value=0)
.reset_index())
print (df.head(20))
City Day Time COUNT
0 CHICAGO Wednesday 00:00:00 66
1 CHICAGO Wednesday 00:10:00 205
2 CHICAGO Wednesday 00:20:00 260
3 CHICAGO Wednesday 00:30:00 127
4 CHICAGO Wednesday 00:40:00 594
5 CHICAGO Wednesday 00:50:00 683
6 CHICAGO Wednesday 01:00:00 203
7 CHICAGO Wednesday 01:10:00 0
8 CHICAGO Wednesday 01:20:00 372
9 CHICAGO Wednesday 01:30:00 109
10 CHICAGO Wednesday 01:40:00 32
11 CHICAGO Wednesday 01:50:00 184
12 CHICAGO Wednesday 02:00:00 630
13 CHICAGO Wednesday 02:10:00 108
14 CHICAGO Wednesday 02:20:00 35
15 CHICAGO Wednesday 02:30:00 604
16 CHICAGO Wednesday 02:40:00 500
17 CHICAGO Wednesday 02:50:00 367
18 CHICAGO Wednesday 03:00:00 118
19 CHICAGO Wednesday 03:10:00 546
One way is to convert to categories and use groupby to calculate Cartesian product.
In fact, given your data is largely categorical, this is a good idea and would yield memory benefits for large number of Time-City-Day combinations.
for col in ['Time', 'City', 'Day']:
df[col] = df[col].astype('category')
bin_cats = sorted(set(pd.Series(pd.to_datetime(np.arange(0, 86401, 600), unit='s'))\
.dt.round('10min').dt.strftime('%H:%M:%S')))
df['Time'] = df['Time'].cat.set_categories(bin_cats, ordered=True)
res = df.groupby(['Time', 'City', 'Day'], as_index=False)['COUNT'].sum()
res['COUNT'] = res['COUNT'].fillna(0).astype(int)
# Time City Day COUNT
# 0 00:00:00 ATLANTA Friday 521
# 1 00:00:00 ATLANTA Monday 767
# 2 00:00:00 ATLANTA Saturday 474
# 3 00:00:00 ATLANTA Sunday 1126
# 4 00:00:00 ATLANTA Thursday 157
# 5 00:00:00 ATLANTA Tuesday 720
# 6 00:00:00 ATLANTA Wednesday 0
# 7 00:00:00 CHICAGO Friday 1114
# 8 00:00:00 CHICAGO Monday 813
# 9 00:00:00 CHICAGO Saturday 137
# 10 00:00:00 CHICAGO Sunday 134
# 11 00:00:00 CHICAGO Thursday 0
# 12 00:00:00 CHICAGO Tuesday 168
# ..........
Then you can try following
df.groupby(['City','Day']).apply(lambda x : x.set_index('Time').reindex(df2.TIME_BIN.unique()).fillna({'COUNT':0}).ffill())