I have a dataset of 6 parameters with 500 values each and I want to combine the two of the datasets to get the road curvature but I am getting an error. Since I am new to python, I am not sure that I am using the correct logic or not. Please guide.
from asammdf import MDF
import pandas as pd
mdf = MDF('./Data.mf4')
c=['Vhcl.Yaw','Vhcl.a','Car.Road.tx', 'Car.Road.ty', 'Vhcl.v', 'Car.Width']
m = mdf.to_dataframe(channels=c, raster=0.02)
for i in range(0,500):
mm = m.iloc[i].values
y = pd.concat([mm[2], mm[3]])
plt.plot(y)
plt.show()
print(y)
Error:
TypeError: cannot concatenate object of type '<class 'numpy.float64'>'; only Series and DataFrame objs are valid
Starting from your dataframe m
y = m.iloc[:, 1:3]
This will create another dataframe with all the entries in the first component and only the entries from the second and third channel.
Related
I am trying to use linear regression using data pulled from yfinance to predict future stock prices, but I am having trouble using linear regression after transposing my data's shape.
Here I create a normalization function
def normalize_data(df):
# df on input should contain only one column with the price data (plus dataframe index)
min = df.min()
max = df.max()
x = df
# time series normalization part
# y will be a column in a dataframe
y = (x - min) / (max - min)
return y
And another function to pull stock prices from Yfinance that calls the normalization function
def closing_price(ticker):
#Asset = pd.DataFrame(yf.download(ticker, start=Start,end=End)['Adj Close'])
Asset = pd.DataFrame(yf.download(ticker, start='2022-07-13',end='2022-09-16')['Adj Close'])
Asset = normalize_data(Asset)
return Asset.to_numpy()
I then pull 11 different stocks using the function
MRO= closing_price('MRO')
HES= closing_price('HES')
FANG= closing_price('FANG')
DVN= closing_price('DVN')
PXD= closing_price('PXD')
COP= closing_price('COP')
CVX= closing_price('CVX')
APA= closing_price('APA')
EOG= closing_price('EOG')
HAL= closing_price('HAL')
BLK = closing_price('BLK')
Which works so far
But when I try to merge the first 10 numpy arrays together,
X = np.array([MRO, HES, FANG, DVN, PXD, COP, CVS, APA, EOG, HAL])[:, :, 0]
X = np.transpose(X)
it gives me the error for the first line when I merge the numpy arrays
<ipython-input-53-a30faf3e4390>:1: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
Have you tried passing the following as is suggested by your error message?
X = np.array([MRO, HES, FANG, DVN, PXD, COP, CVS, APA, EOG, HAL], dtype=float)[:, :, 0]
Alternatively, what are you trying to do with your data afterwards, run a linear regression? Does the data have to be an np array? Often working with data is a lot easier using pandas.DataFrame, and basically all machine learning libraries such as sklearn or statsmodels or any other you might want to use will have pandas support.
To create one big dataset out of these you could try the following:
data = pd.DataFrame() #creating empty dataframe
list_of_tickers = [MRO, HES, FANG, DVN, PXD, COP, CVS, APA, EOG, HAL, BLK]
for ticker in list_of_tickers:
for column in ticker: #because each column will just be labelled "Adj. Close" and you can't name multiple columns the same way
new_name = str(ticker) + "_" + str(column) #columns in data will then be named "MRO_Adj. Close", "HES_Adj. Close", etc
ticker[new_name] = ticker[column]
ticker = ticker.drop(column, axis=1)
data = pd.concat([data, ticker], axis=1)
Additionally, this neatly prevents problems that might arise from issues that different stock tickers have or lack different dates in their dataset, as was correctly pointed out by Kevin Choon Liang Yew in the comments above.
I am trying to find linear regression plot for the data provided
import pandas
from pandas import DataFrame
import matplotlib.pyplot
data = pandas.read_csv('cost_revenue_clean.csv')
data.describe()
X = DataFrame(data,columns=['production_budget_usd'])
y = DataFrame(data,columns=['worldwide_gross_usd'])
when I try to plot it
matplotlib.pyplot.scatter(X,y)
matplotlib.pyplot.show()
the plot was completely empty
and when I printed the type of X
for element in X:
print(type(element))
it shows the type is string.. Where am I standing wrong???
No need to make new DataFrames for X and y. Try astype(float) if you want them as numeric:
X = data['production_budget_usd'].astype(float)
y = data['worldwide_gross_usd'].astype(float)
Trying to learn sklearn in python. But the jupyter ntbk is giving error saying "ValueError: Expected 2D array, got scalar array instead:
array=750.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample."
*But I have already defined x to be 2D array using x.values.reshape(-1,1)
You can find the CSV file and screenshot of the Error Code here -> https://github.com/CaptainRD/CSV-for-StackOverflow
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.linear_model import LinearRegression
data = pd.read_csv('1.02. Multiple linear regression.csv')
data.head()
x = data[['SAT','Rand 1,2,3']]
y = data['GPA']
reg = LinearRegression()
reg.fit(x,y)r2 = reg.score(x,y)
n = x.shape[0]
p = x.shape[1]
adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
adjusted_r2
reg.predict(1750)
As you can see in your code, your x has two variables, SAT and Rand 1,2,3. Which means, you need to provide a two dimensional input for your predict method. example:
reg.predict([[1750, 1]])
which returns:
>>> array([1.88])
You are facing this error because you did not provide the second value (for the Rand 1,2,3 variable). Note, if this variable is not important, you should remove it from your x data.
This model is mapping two inputs (SAT and Rand 1,2,3) to a single output (GPA), and thus requires a list of two elements as input for a valid prediction. I'm guessing the 1750 that you're supplying is meant to be the SAT value, but you also need to provide the Rand 1,2,3 value. Something like [1750, 1] would work.
I'm working with a dataframe with a column containing a np.array per row (in this case representing the mean waveform of brain recordings trought the time). I want to calculate the pearson correlation of this column (array by array).
This is my code
lenght = len(df.Mean)
Mean = []
for i in range(len(df.Mean)):
Mean.append(df.Mean[i])
Correlation_p = np.zeros((lenght,lenght))
P_Value_p = np.zeros((lenght,lenght))
for i in range(lenght):
for j in range(lenght):
Correlation_p[i][j],P_Value_p[i][j] = stats.pearsonr(df.Mean[i],df.Mean[j])
This works, but I want to know if there is a more pythonic way to do it, maybe using df.corr(). I tried but I failed in how to do it.
EDIT: the output of df.Mean.head()
0 [-0.2559348091247745, 0.02743063113723536, 0.3...
1 [-0.37025615099744325, -0.11299328141596175, 0...
2 [-1.0543681894876467, -0.8452798699354909, -0....
3 [-0.23527437766943646, -0.28657810260136585, -...
4 [0.45557980303095674, 0.6055674269814991, 0.74...
Name: Mean, dtype: object
The arrays that you would like to correlate seem in single cells of the DataFrame, if I am not mistaken. The following brings it in a format where each single array occupies a single column.
I made an data example that resembles the format of df.Mean.head():
df = pd.DataFrame({'x':[np.random.randint(0,5,10), np.random.randint(0,5,10), np.random.randint(0,5,10)]})
You can turn these arrays into columns using this:
df = pd.DataFrame(np.array(df['x'].tolist()).transpose())
Adapt the reshape parameters according to your own dimensions.
From there, it would be fairly straightforward.
A correlation matrix can be created by:
df.corr()
A visualization of the correlation matrix:
import matplotlib.pyplot as plt
plt.matshow(df.corr())
plt.show()
I want to normalize all the numeric values in my dataset.
I have taken my whole dataset into a pandas dataframe.
My code to do this so far:
for column in numeric: #numeric=df._get_numeric_data()
x_array=np.array(df[column])
normalized_X=preprocessing.normalize([x_array])
But how do i verify this is correct though?
I tried plotting a histogram for one of the columns before normalizing and after adding this piece of code before and after my for loop:
x=df['Below.Primary'] #Below.Primary is one of my column names
plt.hist(x, bins=45)
The blue histogram was before the for loop and the orange, after.
My total code looked like this:
ln[21] plt.hist(df['Below.Primary'], bins=45)
ln[22] for column in numeric:
x_array=np.array(df[column])
normalized_X=preprocessing.normalize([x_array])
x=df['Below.Primary']
plt.hist(x, bins=45)
I don't see any reduction in scale. What have i done wrong? If not correct, can someone point out the correct way to do what i wanted to do?
Try use this:
scaler = preprocessing.StandardScaler()
df[col] = scaler.fit_transform(df[col])
A couple general things first.
If numeric is a list of column names (looks like this is the case), the for loop is not necessary.
A Pandas series using an ndarray under the hood so you can just request the ndarray with Series.values instead of calling np.array(). See this page on the Pandas Series.
I am assuming you are using preprocessing from sklearn.
I recommend using sklearn.preprocessing.Normalizer for this.
import pandas as pd
from sklearn.preprocessing import Normalizer
### Without the for loop (recommended)
# this version returns array
normalizer = Normalizer()
normalized_values = normalizer.fit_transform(df[numeric])
# normalized_values is a 2D array which is useful
# for many applications
# to convert back to DataFrame
df = pd.DataFrame(normalized_values, columns = numeric)
### with the for-loop (not recommended)
for column in numeric:
x_array = df[column].values.reshape(-1,1)
df[column] = normalizer.fit_transform(x_array)
You have to set normalized_X to the respective column while iterating.
for column in numeric:
x_array=np.array(df[column])
normalized_X=preprocessing.normalize([x_array])
df[column]= normalized_X #Setting normalized value in the column
x=df['Below.Primary']
plt.hist(x, bins=45)