Filling results of linear regression into a dataframe

Filling results of linear regression into a dataframe - python

I'm running a regression between two stocks:
(y=bank_matrix['EXO.MI']
and
x=bank_matrix['LDO.MI']).
My task is to update the slope coefficient every 20 days (lookback). In short, I want to have a list of slope coefficients starting from day 20 (my lookback). So I run this regression model called reg.
In the meantime, I create:
A)3 empty lists: Intercetta=[], Hedge=[], Residuals=[]
B)1 Dataframe called Regressione where I want to copy the results of my regression (Intercept,Slope and residuals) inside this dataframe columns (['Intercept','Hedge','Residuals']).
Now the whole code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pandas_datareader as pdr
from sklearn.linear_model import LinearRegression
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
tickers=['EXO.MI','LDO.MI']
end=datetime.date.today()
gap=datetime.timedelta(days=650)
start=end- gap
Bank=pdr.get_data_yahoo(tickers,start=start,end=end)
bank_matrix=Bank['Adj Close']
bank_matrix=bank_matrix.dropna()
exor=bank_matrix['EXO.MI']
leonardo=bank_matrix['LDO.MI']
Regressione=pd.DataFrame(data=np.zeros((len(exor),3)),columns=['Intercetta','Hedge','Residuals'],index=bank_matrix['EXO.MI'].index)
lookback=20
Hedge=[]
Intercetta=[]
Residuals=[]
for i in range(lookback,len(exor)):
reg=LinearRegression().fit(bank_matrix[['LDO.MI']][i-lookback+1:i],bank_matrix[['EXO.MI']][i-lookback+1:i])
# Regressione.iloc[Regressione[i,'Hedge']]=reg.coef_[0]
Hedge.append(reg.coef_[0])
Intercetta.append(reg.intercept_)
y_pred=reg.predict(bank_matrix[['LDO.MI']][lookback:])
Residuals.append(bank_matrix[['EXO.MI']][lookback:].to_numpy()-y_pred)
Regressione=pd.DataFrame(list(zip(Intercetta,Hedge,Residuals)),columns=['Intercetta','Hedge','Residuals'])
Regressione.set_index(bank_matrix[['EXO.MI']].index[lookback:],inplace=True)
NOW THE FINAL QUESTION: Why in my final dataframe 'Regressione', the third column('Residuals') is an horizontal array???

so, firstly I think these 2 lines you are doing completely wrong:
y_pred=reg.predict(bank_matrix[['LDO.MI']][lookback:])
Residuals.append(bank_matrix[['EXO.MI']][lookback:].to_numpy()-y_pred)
You basically try to run linear regression for all the points starting 1 to 20, then 2 to 21, 3 to 22 etc. Then you try to fit that regression to data from observation 20 onward. So you get the model for e.g. 5 to 24 and based on it you predict observations 20 till the end, and take the difference between that prediction and actuals (mind that bank_matrix[['EXO.MI']][lookback:].to_numpy() doesn't change during your for loop).
I suppose what would make more sense here would be:
y_pred=reg.predict(bank_matrix[['LDO.MI']][i-lookback+1:i])
Residuals.append(bank_matrix[['EXO.MI']][i-lookback+1:i].to_numpy()-y_pred)
So you would take error of the model, or:
y_pred=reg.predict(bank_matrix[['LDO.MI']][i:])
Residuals.append(bank_matrix[['EXO.MI']][i:].to_numpy()-y_pred)
So you would try to fit prediction based on the current time span to the data going forward.
Now first option will produce lists of 19 elements per row, while the other one will produce list of 430, decreasing by 1 per row, until 1 in the last row. Because these are residuals - so you have a line, with a slope, and hedge 1 per given time span, but then you have number of observation within this range producing each different result. So depending on how do you want to express it - you can make it sum of square residuals, or maybe take mean residual - you can make it one number only by applying some further transformation to it.
Hope this helps...

From the doc:
If a list of dict/series is passed and the keys are all contained in the DataFrame’s index, the order of the columns in the resulting DataFrame will be unchanged.
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.
You need to use df.loc for example to modify the data in your dataframe...

Related

how to add to plot gaps when observations are missed?

Here is what i got (time series) in pandas dataframe
screenshot
(also dates were converted from timestamps)
My goal is to plot not only observations, but all the range of dates. I need to see horizontal line or gap when there is no new observations.

Dealing with data that is not observed equidistant in time is a typical challenge with real-world time series data. Given your problem, this code should work.
from datetime import datetime
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
# sample Frame
df = pd.DataFrame({'time' : ['2022,7,3,0,1,21', '2022,7,3,0,2,47', '2022,7,3,0,2,47', '2022,7,3,0,5,5',
'2022,7,3,0,5,5'],
'balance' : [12.6, 12.54, 12.494426, 12.482481, 12.449206]})
df['time'] = pd.to_datetime(df['time'], format='%Y,%m,%d,%H,%M,%S')
# aggregate time duplicates by mean
df = df.groupby('time').mean()
df.reset_index(inplace=True)
# pick equidistant time grid
df_new = pd.DataFrame({'time' : pd.date_range(start=df.loc[0]['time'], end=df.loc[2]['time'], freq='S')})
df = pd.merge(left=df_new, right=df, on='time', how='left')
# fill nan
df['balance'].fillna(method='pad', inplace=True)
df.set_index("time", inplace=True)
# plot
_ = df.plot(title='Time Series of Balance')
There are several caveats to this solution.
First, your data has a high temporal resolution (seconds). However, there are hours-long gaps in between observations. You either coarsen the timestamp by rounding (e.g. to minutes or hours) or go along with the time series on a second-by-second resolution and accept the fact that most you balance values will be filled-in values rather than true observations.
Second, you have different balance values for the same timestamp which indicates faulty entries or a misspecified timestamp. I unified those entries via grouping by timestamp and averaged the balance over those non-unique timestamps.
Third, filled-up gaps and true observations both have the same visual representation in the plot (blue dots in the graph). As previously mentioned commenting out the fillna() line would only showcase true observations leaving everything in between white.
Finally, the missing values are merely filled in via padding. Look up different values of the argument method in the documentation in case you want to linearly interpolate etc.
Summary
The problems described above are typical for event-driven time series data. Since you deal with a (financial) balance that constitutes a state that is only changed by events (orders), I believe that the assumptions made above arew reasonable and can be adjusted easily for your or many other use cases.

this helped
data = data.set_index('time').resample('1M').mean()

How to prepare training data (remove boundary values)

Have we numpy function or pandas function which make somthinfg like that:
For me, boundary values are the farthest values from the regression line.
That means for me:
the farthest from the line over the line and the farthest from the line under the line.
If I will have data:
l1 = [0,1,4,3,4,3]
df = pd.DataFrame(l1)
It looks like that:
0
0 1
1 4
2 3
3 4
4 3
How to find data from index 1 and index 4.
I need to recognize from python script and remove them. I know how to remove but i do not know how to find.
What I want to do:
First I am going to calculate linear regression, next I am going to remove outsider values and next i am going to recalculate linear regression one more time without the farthest values.

To remove outliers, you can use Series.quantile:
Suppose the following dataframe:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(2022)
df = pd.DataFrame({'A': np.random.normal(5, 2, size=50)})
df.plot.hist(bins=25)
plt.xlim(0, 10)
plt.show()
Now filter out your dataframe:
df1 = df.loc[df['A'].between(*df['A'].quantile([0.25, 0.75]).values)]
df1.plot.hist(bins=10)
plt.xlim(0, 10)
plt.show()

Pandas correlation on just one column containing np arrays

I'm working with a dataframe with a column containing a np.array per row (in this case representing the mean waveform of brain recordings trought the time). I want to calculate the pearson correlation of this column (array by array).
This is my code
lenght = len(df.Mean)
Mean = []
for i in range(len(df.Mean)):
Mean.append(df.Mean[i])
Correlation_p = np.zeros((lenght,lenght))
P_Value_p = np.zeros((lenght,lenght))
for i in range(lenght):
for j in range(lenght):
Correlation_p[i][j],P_Value_p[i][j] = stats.pearsonr(df.Mean[i],df.Mean[j])
This works, but I want to know if there is a more pythonic way to do it, maybe using df.corr(). I tried but I failed in how to do it.
EDIT: the output of df.Mean.head()
0 [-0.2559348091247745, 0.02743063113723536, 0.3...
1 [-0.37025615099744325, -0.11299328141596175, 0...
2 [-1.0543681894876467, -0.8452798699354909, -0....
3 [-0.23527437766943646, -0.28657810260136585, -...
4 [0.45557980303095674, 0.6055674269814991, 0.74...
Name: Mean, dtype: object

The arrays that you would like to correlate seem in single cells of the DataFrame, if I am not mistaken. The following brings it in a format where each single array occupies a single column.
I made an data example that resembles the format of df.Mean.head():
df = pd.DataFrame({'x':[np.random.randint(0,5,10), np.random.randint(0,5,10), np.random.randint(0,5,10)]})
You can turn these arrays into columns using this:
df = pd.DataFrame(np.array(df['x'].tolist()).transpose())
Adapt the reshape parameters according to your own dimensions.
From there, it would be fairly straightforward.
A correlation matrix can be created by:
df.corr()
A visualization of the correlation matrix:
import matplotlib.pyplot as plt
plt.matshow(df.corr())
plt.show()

Python plot lines with specific x values from numpy

I have a situation with a bunch of datafiles, these datafiles have a number of samples in a given time frame that depends on the system. i.e. At time t=1 for instance I might have a file with 10 items, or 20 items, at later times in that file I will always have the same number of items. The format is time, x, y, z in columns, and loaded into a numpy array. The time values show which frame, but as mentioned there's always the same, let's go with 10 as a sample. So I'll have a (10,4) numpy array where the time values are identical, but there are many frames in the file, so lets say 100 frames, so really I have (1000,4). I want to plot the data with time on the x-axis and manipulations of the other data on the y, but I am unsure how to do this with line plot methods in matplotlib. Normally to provide both x,y values I believe I need to do a scatter plot, so I'm hoping there's a better way to do this. What I ideally want is to treat each line that has the same time code as a different series (so it will colour differently), and the next bit of data for that same line number in the next frame (time value) will be labelled the same colour, giving those good contiguous lines. We can look at the time column and figure out how many items share a time code, let's call it "n". Sample code:
a = numpy.loadtxt('sampledata.txt')
plt.plot(a[:0,:,n],a[:1,:1])
plt.show()
I think this code expresses what I'm going for, though it doesn't work.

Edit:
I hope this is what you wanted.
seaborn scatterplot can categorize data to some groups which have the same codes (time code in this case) and use the same colors to them.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv(r"E:\Programming\Python\Matplotlib\timecodes.csv",
names=["time","x","y","z","code"]) #use your file
df["time"]=pd.to_datetime(df["time"]) #recognize the data as Time
df["x"]=df["time"].dt.day # I changed the data into "Date only" and imported to x column. Easier to see on graph.
#just used random numbers in y and z in my data.
sns.scatterplot("x", "y", data = df, hue = "code") #hue does the grouping
plt.show()
I used csv file here but you can do to your text file as well by adding sep="\t" in the argument. I also added a code in the file. If you have it the code can group the data in the graph, so you don't have to separate or make a hierarchical index. If you want to change colors or grouping please see seaborn website.
Hope this helps.

Alternative, the method I used, but Tim's answer is still accurate as well. Since the time codes are not date/time information I modified my own code to add tags as a second column I call "p" (they're polymers).
import numpy as np
import pandas as pd
datain = np.loadtxt('somefile.txt')
df = pd.DataFrame(data = datain, columns = ["t","p","x","y","z"])
ax = sns.scatterplot("t","x", data = df, hue = "p")
plt.show()
And of course the other columns can be plotted similarly if desired.

Why does Pandas qcut give me unequal sized bins?

Pandas docs have this to say about the qcut function:
Discretize variable into equal-sized buckets based on rank or based on sample quantiles.
So I would expect this code to give me 4 bins of 10 values each:
import numpy as np
import pandas as pd
np.random.seed(4242)
y = pd.Series(np.random.randint(low=1, high=10, size=40))
quartiles = pd.qcut(y, 4, labels=['1st', '2nd', '3rd', '4th'])
print('Quartiles:')
print(quartiles.value_counts(sort=False))
y.groupby(quartiles).agg(['count', 'mean']).plot(kind='bar');
But instead I get this:
Quartiles:
1st 14
2nd 6
3rd 11
4th 9
dtype: int64
graph
What am I doing wrong here?

The reason this happens is because python doesn't know how to deal with 'boundary-line' cases, i.e. records that may fit first and second quartiles easily. A simple adjustment to your code will produce the desired result:
import numpy as np
import pandas as pd
np.random.seed(4242)
y = pd.Series(np.random.randint(low=1, high=10, size=40))
quartiles = pd.qcut(y.rank(method = 'first'), 4, labels=['1st', '2nd', '3rd', '4th'])
print('Quartiles:')
print(quartiles.value_counts(sort=False))
y.groupby(quartiles).agg(['count', 'mean']).plot(kind='bar');
By stating the approach to be used by python using the rank() function, we give python a clear approach to handling records that cut across multiple bins. In this case, I've used (method = 'first') as the argument for the rank() function.
The output I get is as follows:
Quartiles:
1st 10
2nd 10
3rd 10
4th 10
dtype: int64

Looking at the boundaries of the bins highlights the problem stated inside the comments.
boundaries = [1, 2, 3.5, 6, 9]
These boundaries are correct. The code of pandas creates the values for the quantiles (inside qcut), first. Afterwards the samples are put into the bins. The range of 2s overlaps the boundary of the first quartile.
The reason for the third values is that the value below the threshold is a 3 and the value above the threshold is a 4. The function quantile of pandas is called so that the boundary lies in between the two neighboring values.
Concluding: A concept like quantiles gets more and more appropriate, when there are a larger number of samples, so that more values are available fixing the boundaries.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Filling results of linear regression into a dataframe - python

Related

how to add to plot gaps when observations are missed?

How to prepare training data (remove boundary values)

Pandas correlation on just one column containing np arrays

Python plot lines with specific x values from numpy

Why does Pandas qcut give me unequal sized bins?

Categories

Resources