Intraday daily return - python

I am quite new in python and I need your help guys!
my data structure
this data is intraday data 5 min from 2001/01/02 till 31/12/2019. As you can see from the data 0 indicated the date, and 2 indicate the prices of the stock.
Each day, such as 2001/01/02 has 79 observation.
First of all, I need to create a daily return as a new column. Normaly I was dealing with daily data and for the daily log return was as follow
def lr(x):
return np.log(x[1:]) - np.log(x[:-1])
How can I create new column for the daily return from the 5 min data.

If you load your data into a pandas.DataFrame, you can use df.groupby() and then apply your lr-function with minimal changes:
df = pd.read_excel('path/to/your/file.xlsx', header=None,
names=['Index', 'Date', 'Some_var', 'Stock_price'])
The key thing to do, though, will be to decide how you want to generate your daily values from your 5 minute data. I'm no stock expert, but I'd guess you want to use the last value for each day to represent the stock value. If that's the case, you can use
daily_values = df.groupby('Date')['Stock_price'].agg('last')
and then apply your lr function to get the returns
lr(daily_values)

Related

Gathering rate of return from yfinance API

Picture of data
Looking at my data I am trying to create a new column in a separate dataset that gives the ticker and the rate of return calculated by taking the open price of the first observation for a ticker and the last observation for that same ticker to take the close price and use those two numbers to calculate for my rate of return.
Can you try using this function and tell me if it achieves your purpose?
def calculate_return_rate(df):
df_ordered = df.sort_values(by=["date"])
df_last_close = df_ordered.groupby(["Ticker"]).agg("last")["close"]
df_first_open = df_ordered.groupby(["Ticker"]).agg("first")["open"]
return (df_last_close - df_first_open)/df_first_open

How to calculate the variance of stocks of the last 36 months (some monthly data are missing)?

I downloaded some stock data from CRSP and need the variance of the stock returns of the last 36 months of that company.
So, basically the variance based on two conditions:
Same PERMCO (company number)
Monthly stock returns of the last 3 years.
However, I excluded penny stocks from my sample (stocks with prices < $2). Hence, sometimes months are missing and e.g. april and junes monthly returns are directly on top of each other.
If I am not mistaken, a rolling function (grouped by Permco) would just take the 36 monthly returns above. But when months are missing, the rolling function would actually take more than 3 years data (since the last 36 monthly returns would exceed that timeframe).
Usually I work with Ms Excel. However, in this case the amount of data is too big and it takes years to let Excel calculate stuff. Thats why I want to tackle that problem with Python.
The sample is organized as follows:
PERMNO date SHRCD PERMCO PRC RET
When I have figured out how to make a proper table in here I will show you a sample of my data.
What I have tried so far:
data["RET"]=data["RET"].replace(["C","B"], np.nan)
data["date"] = pd.to_datetime(date["date"])
data=data.sort_values[("PERMCO" , "date"]).reset_index()
L3Yvariance=data.groupby("PERMCO")["RET"].rolling(36).var().reset_index()
Sometimes there are C and B instead of actual returns, thats why the first line
You can replace the missing values by the mean value. It won't affect the variance as the variance is calculated after subtracting the mean, so in this case, for times you won't have the value, the contribution to variance will be 0.

Python: Shift time series so they all match at a given y value

I'm writing my own code to analyse/visualise COVID-19 data from the European CDC.
https://opendata.ecdc.europa.eu/covid19/casedistribution/csv'
I've got a simple code to extract the data and make plots with cumulative deaths against time, and am trying to add functionality.
My aim is something like the attached graph, with all countries time shifted to match at (in this case the 5th death) I want to make a general bit of code to shift countries to match at the 'n'th death.
https://ourworldindata.org/grapher/covid-confirmed-deaths-since-5th-death
The current way I'm trying to do this is to have a maze of "if group is 'country' shift by ..." terms.
Where ... is a lookup to find the date for the particular 'country' when there were 'n' deaths, and to interpolate fractional dates where appropriate.
i.e. currently deaths are assigned as 00:00 day/month, but the data can be shifted by 2/3 a day as below.
datetime cumulative deaths
00:00 15/02 80
00:00 16/02 110
my '...' should give 16:00 15/02
I'm working on this right now but it doesn't feel very efficient and I'm sure there must be a much simpler way that I'm not seeing.
Essentially despite copious googling I can't seem to find a simple way of automatically shifting a bunch of timeseries to match at a particular y value, which feels like it should have some built-in functionality, i.e. a Lookup with interpolation.
####Live url (I've downloaded my own csv and been calling that for code development)
url = 'https://opendata.ecdc.europa.eu/covid19/casedistribution/csv'
dataraw = pd.read_csv(url)
#extract relevanty colums
data = dataraw.loc[:,["dateRep","countriesAndTerritories","deaths"]]
####convert date format
data['dateRep'] = pd.to_datetime(data['dateRep'],dayfirst=True)
####sort by date
data = data.sort_values(["dateRep"],ascending=True)
data['cumdeaths'] = data.groupby(['countriesAndTerritories']).cumsum()
##### limit to countries with cumulative deaths > 500
data = data.groupby('countriesAndTerritories').filter(lambda x:x['cumdeaths'].max() >500)
###### remove China from data for now as it doesn't match so well with dates
data = data.groupby('countriesAndTerritories').filter(lambda x:(x['countriesAndTerritories'] != "China").any())
##### only recent dates
data = data[data['dateRep'] > '2020-03-01']
print(data)
You can use groupby('country') and the pd.transform function to add a column which will set every row with the date in which its country hit the nth death.
Then you can do a vectorized subtraction of the date column and the new column to get the number of days.

Python Correlation Analysis of 2 data sets

I currently try to put the following time series into one plot, with adjusted scalings to examine whether they correlate or not.
raw = pd.read_csv('C:/Users/Jens/Documents/Studium/Master/Big Data Analytics/Lecture 5/tr_eikon_eod_data.csv', index_col=0, parse_dates=True)
data = raw[['.SPX', '.VIX']].dropna()
data.tail()
data.plot(subplots=True, figsize=(10,6));
Has anyone an idea how to do that?
Also, would it possible to do the same with data from two different data sets? I like to compare a house price index with a stock index. I have the daily closing prices for the stock index and just a value for each quarter for the house prices (10y). I.e. closing price against house price index.
I don't really know where to start.
Thank you!:)

Getting values from csv file in python, large dataset

I have a csv file with 500 companies stock values for 5 years (2013-2017). The columns I have are: date, open, high, low, close, volume and name. I would like to be able to compare these companies, to see which 20 of them are the best. I was thinking about just using the mean, but since the stocks value of the first data collected (jan 2013) are different (some starts of at 30 usd, and others at 130 usd), it's hard to really compare which ones that has been the best during these 5 years. I would therefore want to have the values of the first date of every company as the zero-point. Basically I want to subtract the close value from the first date to the rest of the datas collected.
My problem is that, firstly, I have a hard time getting to the first dates close value. Somehow I want to write somthing like "data.loc(data['close']).iloc(0)". But since it's a dataframe I can't find a value of a row, nor iterate through the dataframe.
Secondly, I'm not sure how I can differentiate between the companies. I want to do the procedure with the zero-point for every of these 500 companies, so somehow I need to know when to start over.
The code I have now is
def main():
data = pd.read_csv('./all_stocks_5yr.csv', usecols = ['date', 'close', 'Name'])
comp_name = sorted(set(data.Name))
number_of = comp_name.__len__()
comp_mean = []
for i in comp_name:
frames = data.loc[data['Name'] == i]
comp_mean.append([i, frames['close'].mean()])
print(comp_mean)
But this will only give me the mean, without using the zero-point
Another idea I had was to just compare the closing price from first value (January 1, 2013) with the price from the last value (December 31, 2017) to see how much the stock has increased/decreased, what I'm not sure about here is how I will reach the close values from these dates, for every single of the 500 companies.
Do you have any recommendations for any of the methods?
Thank you in advance

Categories