Python Correlation Analysis of 2 data sets - python

I currently try to put the following time series into one plot, with adjusted scalings to examine whether they correlate or not.
raw = pd.read_csv('C:/Users/Jens/Documents/Studium/Master/Big Data Analytics/Lecture 5/tr_eikon_eod_data.csv', index_col=0, parse_dates=True)
data = raw[['.SPX', '.VIX']].dropna()
data.tail()
data.plot(subplots=True, figsize=(10,6));
Has anyone an idea how to do that?
Also, would it possible to do the same with data from two different data sets? I like to compare a house price index with a stock index. I have the daily closing prices for the stock index and just a value for each quarter for the house prices (10y). I.e. closing price against house price index.
I don't really know where to start.
Thank you!:)

Related

How do I stop Pandas from continuing to put the same data on the same plot?

My first post, so I hope I do this correctly!
This is admittedly an OOP modification of something on DataCamp. I have two objects which contain Pandas dataframes. The first (StockData) has stock data for every trading day of 2016 for both Amazon and Facebook. The second (BenchmarkData) has the S&P 500 closing values for every trading day of 2016. For both, I want to calculate the percent change (StockReturns and BenchmarkReturns, respectively) and then plot them. I want both of the StockReturns on the same plot, but the BenchmarkReturns (which is a Series and not a dataframe for reasons irrelevant to this part of the code) on a separate plot. For my function, I've added a flag as input to tell the program whether the object contains a stock dataframe or a benchmark dataframe and I call the function twice during runtime, once for stock and the other for benchmark. However, no matter what I do, Pandas plots all 3 on the same plot. How do I separate the benchmark data and get it on its own plot?
def _CalculatePercentChange(self, IsStockData):
if(IsStockData):
self.__StockReturns = self.__StockData.GetData().pct_change()
self.__StockReturns.plot(title = 'Daily Percent Change')
else:
self.__BenchmarkReturnsDataFrame = self.__BenchmarkData.GetData().pct_change()
self.__BenchmarkReturns = self.__BenchmarkReturnsDataFrame['S&P 500'].squeeze()
self.__BenchmarkReturns.plot(title = 'Daily Percent Change')
Thanks guys.

Pandas dataframe: summing cell data from a group of rows, storing in a new column

As a part of a treatment for a health related issue, I need to measure my liquid intake (along with some other parameters), registring the amount of liquid every time I drink. I have a dataframe, of several months of such registration.
I want to sum my daily amount in an additional column (in red, image below)
As you may see, I wish like to store it in the first column of the slice returned by df.groupby(df['Date'])., for all the days.
I tried the following:
df.groupby(df.Date).first()['Total']= df.groupby(df.Date)['Drank'].fillna(0).sum()
But seems not to be the way to do it.
Greatful for any advice.
Thanks
Michael
use fact False==0
first row of date will be where data is not equal to shift() of date
merge() to sum
## construct a data set
d = pd.date_range("1-jan-2021", "1-mar-2021", freq="2H")
A = np.random.randint(20,300,len(d)).astype(float)
A.ravel()[np.random.choice(A.size, A.size//2, replace=False)] = np.nan
df = pd.DataFrame({"datetime":d, "Drank":A})
df = df.assign(Date=df.datetime.dt.date, Time=df.datetime.dt.time).drop(columns=["datetime"]).loc[:,["Date","Time","Drank"]]
## construction done
# first row will have different date to shift
# merge Total back
df.assign(row=df.Date.eq(df.Date.shift())).merge(df.groupby("Date", as_index=False).agg(Total=("Drank","sum")).assign(row=0),
on=["Date","row"], how="left").drop(columns="row")

Python: Shift time series so they all match at a given y value

I'm writing my own code to analyse/visualise COVID-19 data from the European CDC.
https://opendata.ecdc.europa.eu/covid19/casedistribution/csv'
I've got a simple code to extract the data and make plots with cumulative deaths against time, and am trying to add functionality.
My aim is something like the attached graph, with all countries time shifted to match at (in this case the 5th death) I want to make a general bit of code to shift countries to match at the 'n'th death.
https://ourworldindata.org/grapher/covid-confirmed-deaths-since-5th-death
The current way I'm trying to do this is to have a maze of "if group is 'country' shift by ..." terms.
Where ... is a lookup to find the date for the particular 'country' when there were 'n' deaths, and to interpolate fractional dates where appropriate.
i.e. currently deaths are assigned as 00:00 day/month, but the data can be shifted by 2/3 a day as below.
datetime cumulative deaths
00:00 15/02 80
00:00 16/02 110
my '...' should give 16:00 15/02
I'm working on this right now but it doesn't feel very efficient and I'm sure there must be a much simpler way that I'm not seeing.
Essentially despite copious googling I can't seem to find a simple way of automatically shifting a bunch of timeseries to match at a particular y value, which feels like it should have some built-in functionality, i.e. a Lookup with interpolation.
####Live url (I've downloaded my own csv and been calling that for code development)
url = 'https://opendata.ecdc.europa.eu/covid19/casedistribution/csv'
dataraw = pd.read_csv(url)
#extract relevanty colums
data = dataraw.loc[:,["dateRep","countriesAndTerritories","deaths"]]
####convert date format
data['dateRep'] = pd.to_datetime(data['dateRep'],dayfirst=True)
####sort by date
data = data.sort_values(["dateRep"],ascending=True)
data['cumdeaths'] = data.groupby(['countriesAndTerritories']).cumsum()
##### limit to countries with cumulative deaths > 500
data = data.groupby('countriesAndTerritories').filter(lambda x:x['cumdeaths'].max() >500)
###### remove China from data for now as it doesn't match so well with dates
data = data.groupby('countriesAndTerritories').filter(lambda x:(x['countriesAndTerritories'] != "China").any())
##### only recent dates
data = data[data['dateRep'] > '2020-03-01']
print(data)
You can use groupby('country') and the pd.transform function to add a column which will set every row with the date in which its country hit the nth death.
Then you can do a vectorized subtraction of the date column and the new column to get the number of days.

Intraday daily return

I am quite new in python and I need your help guys!
my data structure
this data is intraday data 5 min from 2001/01/02 till 31/12/2019. As you can see from the data 0 indicated the date, and 2 indicate the prices of the stock.
Each day, such as 2001/01/02 has 79 observation.
First of all, I need to create a daily return as a new column. Normaly I was dealing with daily data and for the daily log return was as follow
def lr(x):
return np.log(x[1:]) - np.log(x[:-1])
How can I create new column for the daily return from the 5 min data.
If you load your data into a pandas.DataFrame, you can use df.groupby() and then apply your lr-function with minimal changes:
df = pd.read_excel('path/to/your/file.xlsx', header=None,
names=['Index', 'Date', 'Some_var', 'Stock_price'])
The key thing to do, though, will be to decide how you want to generate your daily values from your 5 minute data. I'm no stock expert, but I'd guess you want to use the last value for each day to represent the stock value. If that's the case, you can use
daily_values = df.groupby('Date')['Stock_price'].agg('last')
and then apply your lr function to get the returns
lr(daily_values)

Getting values from csv file in python, large dataset

I have a csv file with 500 companies stock values for 5 years (2013-2017). The columns I have are: date, open, high, low, close, volume and name. I would like to be able to compare these companies, to see which 20 of them are the best. I was thinking about just using the mean, but since the stocks value of the first data collected (jan 2013) are different (some starts of at 30 usd, and others at 130 usd), it's hard to really compare which ones that has been the best during these 5 years. I would therefore want to have the values of the first date of every company as the zero-point. Basically I want to subtract the close value from the first date to the rest of the datas collected.
My problem is that, firstly, I have a hard time getting to the first dates close value. Somehow I want to write somthing like "data.loc(data['close']).iloc(0)". But since it's a dataframe I can't find a value of a row, nor iterate through the dataframe.
Secondly, I'm not sure how I can differentiate between the companies. I want to do the procedure with the zero-point for every of these 500 companies, so somehow I need to know when to start over.
The code I have now is
def main():
data = pd.read_csv('./all_stocks_5yr.csv', usecols = ['date', 'close', 'Name'])
comp_name = sorted(set(data.Name))
number_of = comp_name.__len__()
comp_mean = []
for i in comp_name:
frames = data.loc[data['Name'] == i]
comp_mean.append([i, frames['close'].mean()])
print(comp_mean)
But this will only give me the mean, without using the zero-point
Another idea I had was to just compare the closing price from first value (January 1, 2013) with the price from the last value (December 31, 2017) to see how much the stock has increased/decreased, what I'm not sure about here is how I will reach the close values from these dates, for every single of the 500 companies.
Do you have any recommendations for any of the methods?
Thank you in advance

Categories