I am trying to use linear regression using data pulled from yfinance to predict future stock prices, but I am having trouble using linear regression after transposing my data's shape.
Here I create a normalization function
def normalize_data(df):
# df on input should contain only one column with the price data (plus dataframe index)
min = df.min()
max = df.max()
x = df
# time series normalization part
# y will be a column in a dataframe
y = (x - min) / (max - min)
return y
And another function to pull stock prices from Yfinance that calls the normalization function
def closing_price(ticker):
#Asset = pd.DataFrame(yf.download(ticker, start=Start,end=End)['Adj Close'])
Asset = pd.DataFrame(yf.download(ticker, start='2022-07-13',end='2022-09-16')['Adj Close'])
Asset = normalize_data(Asset)
return Asset.to_numpy()
I then pull 11 different stocks using the function
MRO= closing_price('MRO')
HES= closing_price('HES')
FANG= closing_price('FANG')
DVN= closing_price('DVN')
PXD= closing_price('PXD')
COP= closing_price('COP')
CVX= closing_price('CVX')
APA= closing_price('APA')
EOG= closing_price('EOG')
HAL= closing_price('HAL')
BLK = closing_price('BLK')
Which works so far
But when I try to merge the first 10 numpy arrays together,
X = np.array([MRO, HES, FANG, DVN, PXD, COP, CVS, APA, EOG, HAL])[:, :, 0]
X = np.transpose(X)
it gives me the error for the first line when I merge the numpy arrays
<ipython-input-53-a30faf3e4390>:1: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
Have you tried passing the following as is suggested by your error message?
X = np.array([MRO, HES, FANG, DVN, PXD, COP, CVS, APA, EOG, HAL], dtype=float)[:, :, 0]
Alternatively, what are you trying to do with your data afterwards, run a linear regression? Does the data have to be an np array? Often working with data is a lot easier using pandas.DataFrame, and basically all machine learning libraries such as sklearn or statsmodels or any other you might want to use will have pandas support.
To create one big dataset out of these you could try the following:
data = pd.DataFrame() #creating empty dataframe
list_of_tickers = [MRO, HES, FANG, DVN, PXD, COP, CVS, APA, EOG, HAL, BLK]
for ticker in list_of_tickers:
for column in ticker: #because each column will just be labelled "Adj. Close" and you can't name multiple columns the same way
new_name = str(ticker) + "_" + str(column) #columns in data will then be named "MRO_Adj. Close", "HES_Adj. Close", etc
ticker[new_name] = ticker[column]
ticker = ticker.drop(column, axis=1)
data = pd.concat([data, ticker], axis=1)
Additionally, this neatly prevents problems that might arise from issues that different stock tickers have or lack different dates in their dataset, as was correctly pointed out by Kevin Choon Liang Yew in the comments above.
Related
I have a dataset of 6 parameters with 500 values each and I want to combine the two of the datasets to get the road curvature but I am getting an error. Since I am new to python, I am not sure that I am using the correct logic or not. Please guide.
from asammdf import MDF
import pandas as pd
mdf = MDF('./Data.mf4')
c=['Vhcl.Yaw','Vhcl.a','Car.Road.tx', 'Car.Road.ty', 'Vhcl.v', 'Car.Width']
m = mdf.to_dataframe(channels=c, raster=0.02)
for i in range(0,500):
mm = m.iloc[i].values
y = pd.concat([mm[2], mm[3]])
plt.plot(y)
plt.show()
print(y)
Error:
TypeError: cannot concatenate object of type '<class 'numpy.float64'>'; only Series and DataFrame objs are valid
Starting from your dataframe m
y = m.iloc[:, 1:3]
This will create another dataframe with all the entries in the first component and only the entries from the second and third channel.
I'm trying to attempt to SMOTE on a dataframe full of sliding windows here:
DataFrame
I'm using imblearn's SMOTE() function on it. Without any manipulation, I'm getting an error that each cell must have a size 1 array. SMOTING individually by rows or exploding the dataframe and SMOTING on each window (same index) results in a ValueError because there is only one class in each SMOTE attempt. How do I get around this problem of wanting to SMOTE an entire sliding window without aggregating them or getting dimensional errors by keeping it in the dataframe in the first picture?
new_df_labels = X_with_labels.reset_index().apply(pd.Series.explode)
new_df = X_smoted.reset_index().apply(pd.Series.explode)
np.unique(new_df.index)
X_list = pd.DataFrame(columns = X_smoted.columns)
y_list = []
for j in np.unique(new_df.index):
new_df1 = new_df[new_df.index == j]
new_df_labels1 = new_df_labels[new_df_labels.index == j]
X_smoted_1, y_smoted_1 = smot.fit_resample(new_df1, new_df_labels1['Activity'])
X_list = X_list.append(X_smoted_1)
y_list.append(y_smoted_1.ravel())
Exploded DataFrame
I have two three-dimensional arrays a and b with [time,lat,lon]. I want to correlate the time series of each grid cell like correlate(a[:,0,0],b[:,0,0]), correlate(a[:,0,1],b[:,0,1]), ... . I'm aiming for two correlations. One with the entire time series and one only where array a surpasses a certain threshold.
The datasets also include some missing values in the time series and I read in both datasets with Xarray. Correlations and masking are done using numpy.
At the moment I walk through each latitude and longitude, grabbing the time series, mask it to account for nan and the threshold and correlate them. My code looks like this:
def correlate(A, B, var1, var2, TH):
name = "corr_"+var1+"_"+var2+"_TH_"+str(TH)+".nc"
a = xr.open_dataset(A).sel(time=slice('1950-03','2013-12'))
b = xr.open_dataset(B).sel(time=slice('1950-03','2013-12'))
corr = np.empty([a[var1].shape[1],a[var1].shape[2]],dtype=float)
corr_TH = corr
varname_TH = "r_TH_"+str(TH)
for lt in range(corr.shape[0]):
for ln in range(corr.shape[1]):
corr[lt,ln] = np.ma.corrcoef(a[var1][:,lt,ln],b[var2][:,lt,ln], rowvar=True)[0,1]
corr_TH[lt,ln] = np.ma.corrcoef(np.ma.masked_greater(a[var1][:,lt,ln],TH),b[var2][:,lt,ln], rowvar=True)[0,1]
# save whole correlations
ds = xr.Dataset({'r': (['lat', 'lon'], corr),varname_TH: (['lat', 'lon'], corr_TH)},coords={'lon': a['lon'],'lat': a['lat']})
return ds
This works in general but is super slow. I found the Xarray function array.stack() to flatten the arrays and tried something like:
A_stack = A.var1.stack(z=('lat','lon'))
B_stack = B.var2.stack(z=('lat','lon'))
cov = ((A_stack - A_stack.mean(axis=0))* (B_stack - B_stack.mean(axis=0))).mean(axis=0)
corr = cov / (A_stack.std(axis=0) * B_stack.std(axis=0))
The multi index 'z' over which the array is stacked is retained through the process, however, the correlation array in the end is empty. I suppose that's because of the Nans.
Does anyone have an idea of the do this?
thanks
Data: Here
Question:
I have several data sheets which I export to Python as dataframes. I want to perform multiplications across these dataframes, which will generate another dataframe that takes the same dimension as the dataframes I use and/or augment the dimension (i.e. the index) based on the combination from the different dataframes used. However, I stumble upon some issues to which I could not find a solution. Below is the code.
Code:
#---------------------------------------------------------------------------------------------------
#Load the pandas library
#---------------------------------------------------------------------------------------------------
import numpy as np
import pandas as pd
#---------------------------------------------------------------------------------------------------
#Load the dataframes
#---------------------------------------------------------------------------------------------------
##Supply at the gridcell level (in Pj per year)
biosup = pd.read_excel('01EconMod_EU1.xlsx', sheet_name = 'biosup', skiprows = 5, index_col = 0, usecols = 'A:K')
##Cost at the gridcell level (in MEUR per Pj)
biocost = pd.read_excel('01EconMod_EU1.xlsx', sheet_name = 'biocost', skiprows = 5, index_col = 0, usecols = 'A:K')
##Demand at the gridcell level (in Pj per year)
biodem = pd.read_excel('01EconMod_EU1.xlsx', sheet_name = 'biodem', skiprows = 5, index_col = [0,1], usecols = 'A:L')
##Inter-gridcell distance matrix (in km)
dist = pd.read_excel('01EconMod_EU1.xlsx', sheet_name = 'distance', skiprows = 5, index_col = 0, usecols = 'A:AE')
#---------------------------------------------------------------------------------------------------
#Definition of model parameter
#---------------------------------------------------------------------------------------------------
##Power parameter for the distance-decay component (gamma)
gamma = pd.DataFrame({'sim1':[1.06],'sim2':[1.59],'sim3':[2.12]})
gamma = gamma.transpose()
gamma.columns = ['val']
##Inter-gridcell distance range for the supply curve determination (dmaxsup in km)
dmaxsup = pd.DataFrame({'dsup1':[390],'dsup2':[770],'dsup3':[1050]})
dmaxsup = dmaxsup.transpose()
dmaxsup.columns = ['dmax']
##Inter-gridcell distance range for the distance-decay (dmaxdem in km)
dmaxdem = pd.DataFrame({'ddem1':[750],'ddem2':[1000]})
dmaxdem = dmaxdem.transpose()
dmaxdem.columns = ['dmax']
#---------------------------------------------------------------------------------------------------
#New parameter calculation
#---------------------------------------------------------------------------------------------------
##The ratio of the inter-gridcell distance and the dmaxdem
dist1 = pd.DataFrame(np.concatenate(dist.values / dmaxdem.values[:, None]), pd.MultiIndex.from_product([dmaxdem.index, dist.index]), dist.columns)
##The decay coefficients
decay = pd.DataFrame(np.concatenate(2 * (1 / (1 + (np.exp(dist1.values)**gamma.values[:, None])))), pd.MultiIndex.from_product([gamma.index, dist1.index]), dist1.columns)
decay1 = pd.DataFrame(np.concatenate(2 * (1 / (1 + (np.exp(dist.values / dmaxdem.values[:, None])**gamma.values[:, None])))), pd.MultiIndex.from_product([dmaxdem.index, gamma.index, dist.index]), dist.columns)
Comments on the code:
1/The parameter "dist1" represents the division of the "dist" dataframe by each of the element of the "dmaxdem" dataframe. Think of the values of the "dmaxdem" dataframe are distance scenarios. In other words, this operation computes the ratio for each of the distance values prodived.
2/ I try to compute a distance decay coefficients, i.e. "decay" dataframe, as defined by the formula inside the brackets. However, I get the following error message
NotImplementedError: isna is not defined for MultiIndex
which I believe has something to do with the multiindex structure of the "dist1" dataframe. I have tried a direct approach by embedding the previous operation, and which will require the use of the 3 different dataframes as illustrated by the code for "decay1". I get the following error
ValueError: operands could not be broadcast together with shapes (2,30,30) (3,1,1)
Any help would be appreciated.
pardon me if I misunderstood you because I am unable to comment before posting answer:
Well, if they are all the same length, and have the same index, you can start off by first concatenation them along the 0 axis. This will create a larger dataframe. Next, you can assert a conditional column or columns that you need:
largerdf = pd.concat([df1, df2, df3 , dfn], axis=0)
largerdf[“calculationcolumn”] = largerdf[“columnvalue1”] *largerdf[“columnvalue2”]
Or change the operand to any you need.
So I have an issue with dask DataFrame.append. I generate a lot of derivative features from main data and append them to the main dataframe. After that the dask graph for any set of columns is blown up. Here is small example:
%pylab inline
import numpy as np
import pandas as pd
import dask.dataframe as dd
from dask.dot import dot_graph
df=pd.DataFrame({'x%s'%i:np.random.rand(20) for i in range(5)})
ddf = dd.from_pandas(df, npartitions=2)
dot_graph(ddf['x0'].dask)
here is the dask graph as expected
g=ddf.assign(y=ddf['x0']+ddf['x1'])
dot_graph(g['x0'].dask)
here the graph for same column is exploded with irrelevant computation
Imagine i have lots of lots of spawned columns. So computation graph for any particular column includes irrelevant computations for all the other columns. I.e. in my case I have len(ddf['someColumn'].dask)>100000. So that becomes unusable quickly.
So my question is can this issue be resolved? Are there any existing means to do this? If not - what direction should i look to implement this?
Thanks!
Rather than continuously assigning new columns to the dask dataframe, you might want to build several dask series and then concat them all together at the end
So instead of doing this:
df['x'] = df.w + 1
df['y'] = df.x * 10
df['z'] = df.y ** 2
Do this
x = df.w + 1
y = x + 10
z = y * 2
df = df.assign(x=x, y=y, z=z)
Or this:
dd.concat([df, x, y, z], axis=1)
This may still result in the same number of tasks in your graph however, but will probably result in fewer memory copies.
Alternatively, if all of your transformations are row-wise then you can construct a pandas function and map that across all partitions
def f(part):
part = part.copy()
part['x'] = part.w + 1
part['y'] = part.x * 10
part['z'] = part.y ** 2
return part
df = df.map_partitions(f)
Also, while a million-node task graph is less than ideal, it should also be OK. I've seen larger graphs run comfortably.