Getting index value in pandas multiindex based on condition - python

I have created a dataframe with this code:
The objective of this is to find the weekly low and to get the dates date at which the weekly low took place.
To do this:
import pandas as pd
from pandas_datareader import data as web
import pandas_datareader
import datetime
df = web.DataReader('GOOG', 'yahoo', start, end)
df2 = web.DataReader('GOOG', 'yahoo', start, end)
start = datetime.datetime(2021,1,1)
end = datetime.datetime.today()
df['Date1'] = df.index
df['month'] = df.index.month
df['week'] = df.index.week
df['day'] = df.index.day
df.set_index('week',append=True,inplace=True)
df.set_index('day',append=True,inplace=True)
To get the weekly low :
df['Low'].groupby(['week']).min().tail(50)
I trying to find out the date on which weekly low occured: Such as 1735.420044
If i try to do this :
df['Low'].isin([1735.420044])
I get :
Date week day
2020-12-31 53 31 False
2021-01-04 1 4 False
2021-01-05 1 5 False
2021-01-06 1 6 False
2021-01-07 1 7 False
...
2021-08-02 31 2 False
2021-08-03 31 3 False
2021-08-04 31 4 False
2021-08-05 31 5 False
2021-08-06 31 6 False
Name: Low, Length: 151, dtype: bool
How can i get the actual dates for the low?

To get the weekly lows, you could simply access the index.
res = df['Low'].groupby(['week']).min()
res is the series of lowest prices with the date in the index. You can access the raw numpy array that represents the index with res.index.values. This will include week and day levels as well.
To get just the dates as a series, this should work:
dates = res.index.get_level_values("Date").to_series()
PS: Clarification from the comments
df['Low'].isin([1735.420044]).any() # returns False
The above doesn't work for you (should return True if there's a match) because when you say .isin([<bunch of floats>]), you are essentially comparing floats for equality. Which doesn't work because floating point comparisons can never be guaranteed to be exact, they always have to be in ranges of tolerance (this is not Python specific, true for all languages). Sometimes it might seem to work in Python, but that is entirely coincidental and is a result of underlying memory optimisations. Have a look at this thread to gain some (Python specific) insight into this.

Related

Calculating rate of return for multiple time frames (annualized, quarterly) with daily time series data (S&P 500 (SPX index) daily prices)

I have a CSV file with some 30 years worth of daily close prices for the S&P 500 (SPX) stock market index, and I read it as Dataframe Series with Dates set as Index.
Dataframe:
Date
Open
High
Low
Close
2023-01-13
3960.60
4003.95
3947.67
3999.09
2023-01-12
3977.57
3997.76
3937.56
3983.17
2023-01-11
3932.35
3970.07
3928.54
3969.61
2023-01-10
3888.57
3919.83
3877.29
3919.25
2023-01-09
3910.82
3950.57
3890.42
3892.09
1990-01-08
353.79
354.24
350.54
353.79
1990-01-05
352.20
355.67
351.35
352.20
1990-01-04
355.67
358.76
352.89
355.67
1990-01-03
358.76
360.59
357.89
358.76
1990-01-02
359.69
359.69
351.98
359.69
It effectively has a date (as index) column, and four columns (open, high, low, close) of daily prices. I am using close prices.
I would like a flexible function to calculate annual returns from the chosen start date to the end date using the formula:
(end_price / beginning_price - 1) * 100
So, the annual return for 2022 would be:
(SPX_Close_price_at_31_December_2022 - SPX_Close_price_at_31_December_2021 - 1)*100
It would be ideal if the same function could handle monthly or quarterly date inputs. Then, I would like these periodic returns (%) to be added to the dataframe in a separate column, and/or a new dataframe, and match the start and end dates across rows, so I can produce consecutive annual returns on a Matplotlib line chart. And I would like to do this for the whole time series of 30 years.
This is the what I would like for the final dataframe to look like (return numbers below are examples only):
Date
Annual Return (%)
m/d/2022
-18
m/d/2021
20
m/d/2020
15
m/d/2019
18
I am a beginner with Python am and still struggling working with date and datetime formats and matching those dates to data in columns across selected rows.
Below is what I got to so far, but it doesn't work properly. I will try the dateutil library, but I think that concepts of building out efficient functions is still something I need to work on. This is my first question on Stack Overflow, so thanks for having me :)
def spx_return(df, sdate, edate):
delta = dt.timedelta(days=365)
while (sdate <= edate):
df2 = df['RoR'] = (df['Close'] / df['Close'].shift(-365) - 1) * 100
sdate += delta
#print(sdate, end="\n")
return df2
To calculate annual and quarterly rates in a generic way as well, I came up with a function that takes as arguments the start date, end date, and a pattern that distinguishes between years and quarters as the type of frequency. For the data frames extracted by start and end date, we use pd.Grouper() to extract the target data rows. For the result of that extraction, we will incorporate your formula in the next line. Also, when determining the rate from the start date, we need to go back further in time, so we subtract '366 days' or '90 days' for the frequency keyword. I have not verified that this value leads to the correct result in all cases. This is due to market holidays such as the year-end and New Year holidays. Setting a larger number of days may solve this problem.
import pandas as pd
import yfinance as yf
df = yf.download("^GSPC", start="2016-01-01", end="2022-01-01")
df.index = pd.to_datetime(df.index)
df.index = df.index.tz_localize(None)
def rating(data, startdate, enddate, freq):
offset = '366 days' if freq == 'Y' else '90 days'
#dff = df.loc[(df.index >= startdate) & (df.index <= enddate)]
dff = df.loc[(df.index >= pd.Timestamp(startdate) - pd.Timedelta(offset)) & (df.index <= pd.Timestamp(enddate))]
dfy = dff.groupby(pd.Grouper(level='Date', freq=freq)).tail(1)
ratio = (dfy['Close'] / dfy['Close'].shift()-1)*100
return ratio
period_rating = rating(df, '2017-01-01', '2019-12-31', freq='Y')
print(period_rating)
Date
2016-12-30 NaN
2017-12-29 19.419966
2018-12-31 -6.237260
2019-12-31 28.878070
Name: Close, dtype: float64
period_rating = rating(df, '2017-01-01', '2019-12-31', freq='Q')
print(period_rating)
Date
2016-12-30 NaN
2017-03-31 5.533689
2017-06-30 2.568647
2017-09-29 3.959305
2017-12-29 6.122586
2018-03-29 -1.224561
2018-06-29 2.934639
2018-09-28 7.195851
2018-12-31 -13.971609
2019-03-29 13.066190
2019-06-28 3.787754
2019-09-30 1.189083
2019-12-31 8.534170
Name: Close, dtype: float64
If your df has a DatetimeIndex, then you can use the .loc accessor with the date formatted as a string to retrieve the necessary values. For example, df.loc['2022-12-31'].Close should return the Close value on 2022-12-31.
In terms of efficiency, although you could use a shift operation, there isn't really a need to allocate more memory in a dataframe – you can use a loop instead:
annual_returns = []
end_dates = []
for year in range(1991,2022):
end_date = f"{year}-12-31"
start_date = f"{year-1}-12-31"
end_dates.append(end_date)
end_price, start_price = df.loc[end_date].Close, df.loc[start_date].Close
annual_returns.append((end_price / start_price - 1)*100)
Then you can build your final dataframe from your lists:
df_final = pd.DataFrame(
data=annual_returns,
index=pd.DatetimeIndex(end_dates, name='Date'),
columns=['Annual Return (%)']
)
Using some sample data from yfinance, I get the following:
>>> df_final
Annual Return (%)
Date
2008-12-31 -55.508475
2009-12-31 101.521206
2010-12-31 -4.195294
2013-12-31 58.431109
2014-12-31 -5.965609
2015-12-31 44.559938
2019-12-31 29.104585
2020-12-31 31.028712
2021-12-31 65.170561

Pandas time conversion and elapsed time

I have a pandas dataframe that has columns:
['A'] of departure time (listed as integer ex: 700 or 403 which is 7:00 and 4:03);
['B'] of elapsed time (listed as integer ex: 70 or 656 which is 70 mins and 656 mins);
['C'] of arrival time (listed as integer: 1810 and 355 which is 18:10 and 03:55).
I need to find a way to develop a new column ['D'] with a boolean value that returns True if the arrival is on the following day and False if arrival is on the same day.
I thought of accessing the -2 index of column A to convert hour to minute and then add the remainder minutes to normalize the values but not sure how to do that, or if there's a simpler way to find this. The idea behind this would be to get total minutes elapsed from moment the day started and if exceeds total minutes in a day, then I'd have my answer but unsure if this would work.
Similar to the method you outlined, you can accomplish the task by converting the integers in column A to a 24-hour datetime (starting from 1900-01-01), adding the integer number of minutes from column B as a timedelta and then check to see if the result is still in day 1 of the month. As a sanity check, I made sure the last row should return True.
You can probably combine these steps without creating a new column, but I think the code is more readable this way.
import numpy as np
import pandas as pd
import datetime as dt
df = pd.DataFrame({
'A':[700,403,2359],
'B':[70,656,2],
'C':[810,1059,1]
})
# convert to string, add leading zeros, then convert column A to datetime
df['arrival'] = pd.to_datetime(df['A'].astype(str).str.zfill(4), format='%H%M') + pd.to_timedelta(df['B'],'m')
# check if you are on day 1 of the month still
df['D'] = np.where(df.arrival.dt.day > 1, True, False)
Output:
A B C arrival D
0 700 70 810 1900-01-01 08:10:00 False
1 403 656 1059 1900-01-01 14:59:00 False
2 2359 2 1 1900-01-02 00:01:00 True

Create a boolean dataframe based on the difference between two datetimes

I have a pandas dataframe called "gaps" that looks like this:
Index Gap in days
0 2 days 00:00:00
1 8 days 00:00:00
2 4 days 00:00:00
3 15 days 00:00:00
...
201 21 days 00:00:00
The date format has been converted to the standard datetime format. I want to create a simple boolean dataframe that returns TRUE if the gap in days is more than 7 days, and FALSE otherwise.
My initial attempt was the simple:
morethan7days = gaps > 7
For which I get the error:
TypeError: invalid type comparison
Anybody know what I'm doing wrong and how to fix it?
Nevermind, I got the answer through trial and error:
morethan7days = gaps > datetime.timedelta(days=7)
You can convert timedeltas to days by Series.dt.days and then compare by integer:
gaps = df['Gap in days']
morethan7days = gaps.dt.days > 7
print (morethan7days)
0 False
1 True
2 False
3 True
4 True
Name: Gap in days, dtype: bool
Another solution is compare with pandas.Timedelta:
gaps = df['Gap in days']
morethan7days = gaps > pd.Timedelta(7, unit='d')

calculate date difference between today's date and pandas date series

Want to calculate the difference of days between pandas date series -
0 2013-02-16
1 2013-01-29
2 2013-02-21
3 2013-02-22
4 2013-03-01
5 2013-03-14
6 2013-03-18
7 2013-03-21
and today's date.
I tried but could not come up with logical solution.
Please help me with the code. Actually I am new to python and there are lot of syntactical errors happening while applying any function.
You could do something like
# generate time data
data = pd.to_datetime(pd.Series(["2018-09-1", "2019-01-25", "2018-10-10"]))
pd.to_datetime("now") > data
returns:
0 False
1 True
2 False
you could then use that to select the data
data[pd.to_datetime("now") > data]
Hope it helps.
Edit: I misread it but you can easily alter this example to calculate the difference:
data - pd.to_datetime("now")
returns:
0 -122 days +13:10:37.489823
1 24 days 13:10:37.489823
2 -83 days +13:10:37.489823
dtype: timedelta64[ns]
You can try as Follows:
>>> from datetime import datetime
>>> df
col1
0 2013-02-16
1 2013-01-29
2 2013-02-21
3 2013-02-22
4 2013-03-01
5 2013-03-14
6 2013-03-18
7 2013-03-21
Make Sure to convert the column names to_datetime:
>>> df['col1'] = pd.to_datetime(df['col1'], infer_datetime_format=True)
set the current datetime in order to Further get the diffrence:
>>> curr_time = pd.to_datetime("now")
Now get the Difference as follows:
>>> df['col1'] - curr_time
0 -2145 days +07:48:48.736939
1 -2163 days +07:48:48.736939
2 -2140 days +07:48:48.736939
3 -2139 days +07:48:48.736939
4 -2132 days +07:48:48.736939
5 -2119 days +07:48:48.736939
6 -2115 days +07:48:48.736939
7 -2112 days +07:48:48.736939
Name: col1, dtype: timedelta64[ns]
With numpy you can solve it like difference-two-dates-days-weeks-months-years-pandas-python-2
. bottom line
df['diff_days'] = df['First dates column'] - df['Second Date column']
# for days use 'D' for weeks use 'W', for month use 'M' and for years use 'Y'
df['diff_days']=df['diff_days']/np.timedelta64(1,'D')
print(df)
if you want days as int and not as float use
df['diff_days']=df['diff_days']//np.timedelta64(1,'D')
From the pandas docs under Converting To Timestamps you will find:
"Converting to Timestamps To convert a Series or list-like object of date-like objects e.g. strings, epochs, or a mixture, you can use the to_datetime function"
I haven't used pandas before but this suggests your pandas date series (a list-like object) is iterable and each element of this series is an instance of a class which has a to_datetime function.
Assuming my assumptions are correct, the following function would take such a list and return a list of timedeltas' (a datetime object representing the difference between two date time objects).
from datetime import datetime
def convert(pandas_series):
# get the current date
now = datetime.now()
# Use a list comprehension and the pandas to_datetime method to calculate timedeltas.
return [now - pandas_element.to_datetime() for pandas_series]
# assuming 'some_pandas_series' is a list-like pandas series object
list_of_timedeltas = convert(some_pandas_series)

get subset dataframe by date

I have the following subset with a starting date (DD/MM/YYYY) and Amount
Start Date Amount
1 01/01/2013 20
2 02/05/2007 10
3 01/05/2004 15
4 01/06/2014 20
5 17/08/2008 21
I'd like to create a subset of this dataframe where only where the Start Date Day is 01:
Start Date Amount
1 01/01/2013 20
3 01/05/2004 15
4 01/06/2014 20
I've tried to loop through the table and use the index but couldn't find a suitable way to iterate through a dataframe rows.
Assuming your dates are datetime already then the following should work, if they are strings you can convert them using to_datetime so df['Start Date'] = pd.to_datetime(df['Start Date']), you may also need to pass param dayfirst = True if required. If you imported the data using read_csv you could've done this at the point of import so df = pd.read_csv('data.csv', parse_dates=[n], dayfirst=True) where n is the index (0-based of course) so if it was the first then pass parse_dates=[0].
One method could be to apply a lambda to the column and use the boolean index returned this to index against:
In [19]:
df[df['Start Date'].apply(lambda x: x.day == 1)]
Out[19]:
Start Date Amount
index
1 2013-01-01 20
3 2004-05-01 15
4 2014-06-01 20
Not sure if there is a built in method that doesn't involve setting this to be the index which will convert it into a timeseries index.

Categories