Convert DF into Numpy Array for calculations - python

I have the data in a dataframe format that I will use for linear regression calculation using user-built function. Here is the code:
from sklearn.datasets import load_boston
boston = load_boston()
bos = pd.DataFrame(boston.data) # convert to DF
bos.columns = boston.feature_names
bos['PRICE'] = boston.target
y = bos.PRICE
x = bos.drop('PRICE', axis = 1) # DROP PRICE since only want X-type variables (not Y-target)
xw = df.to_array(x)
xw = np.insert(xw,0,1, axis = 1) # to insert a column of "1" values
However, I am getting the error:
AttributeError Traceback (most recent call last)
<ipython-input-131-272f1b4d26ba> in <module>()
1 import copy
2
----> 3 xw = df.to_array(x)
AttributeError: 'int' object has no attribute 'to_array'
I am not sure where the problem. I need to pass an array of values (x in this case) to the function to execute some matrix operations
The insert function was working in a step by step code development but for some reason is failing here.
I tried:
xw = copy.deepcopy(x)
with no success
Any thoughts?

it is x.as_matrix() not df.to_array(x)
Please refer to pandas document for more detail on as_matrix()
Here is the code that work
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
boston = load_boston()
bos = pd.DataFrame(boston.data) # convert to DF
bos.columns = boston.feature_names
bos['PRICE'] = boston.target
y = bos.PRICE
x = bos.drop('PRICE', axis = 1) # DROP PRICE since only want X-type variables (not Y-target)
xw = x.as_matrix()
xw = np.insert(xw,0,1, axis = 1) # to insert a column of "1" values

Related

LinearRegression TypeError

The above screenshot is refereed to as: sample.xlsx. I've been having trouble getting the beta for each stock using the LinearRegression() function.
Input:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
df = pd.read_excel('sample.xlsx')
mean = df['ChangePercent'].mean()
for index, row in df.iterrows():
symbol = row['stock']
perc = row['ChangePercent']
x = np.array(perc).reshape((-1, 1))
y = np.array(mean)
model = LinearRegression().fit(x, y)
print(model.coef_)
Output:
Line 16: model = LinearRegression().fit(x, y)
"Singleton array %r cannot be considered a valid collection." % x
TypeError: Singleton array array(3.34) cannot be considered a valid collection.
How can I make the collection valid so that I can get a beta value(model.coef_) for each stock?
X and y must have same shape, so you need to reshape both x and y to 1 row and 1 column. In this case it is resumed to the following:
np.array(mean).reshape(-1,1) or np.array(mean).reshape(1,1)
Given that you are training 5 classifiers, each one with just one value, is not surprising that the 5 models will "learn" that the coefficient of the linear regression is 0 and the intercept is 3.37 (y).
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
df = pd.DataFrame({
"stock": ["ABCD", "XYZ", "JK", "OPQ", "GHI"],
"ChangePercent": [-1.7, 30, 3.7, -15.3, 0]
})
mean = df['ChangePercent'].mean()
for index, row in df.iterrows():
symbol = row['stock']
perc = row['ChangePercent']
x = np.array(perc).reshape(-1,1)
y = np.array(mean).reshape(-1,1)
model = LinearRegression().fit(x, y)
print(f"{model.intercept_} + {model.coef_}*{x} = {y}")
Which is correct from an algorithmic point of view, but it doesn't make any practical sense given that you're only providing one example to train each model.

TypeError : list indices must be integers or slices, not str when trying to plot [duplicate]

This question already has answers here:
Renaming column names in Pandas
(35 answers)
Closed 1 year ago.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler,normalize
from sklearn.metrics import silhouette_score
newdf = pd.read_csv("D:\DATASETS/CC_GENERAL.csv")
x = newdf.drop('CUST_ID',axis = 1)
x.fillna(method = 'ffill',inplace = True)
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)
x_normalized = normalize(x_scaled)
#CONVERTING THE NUMPY ARRAY INTO A PANDAS DATAFRAME
x_normalized = pd.DataFrame(x_normalized)
#REDUCING THE DIMENTIONALITY OF THE DATA!
pca = PCA(n_components= 2)
x_principal = pca.fit_transform(x_normalized)
x_principal = pd.DataFrame(x_normalized)
x_principal = ['P1','P2']
ac2 = AgglomerativeClustering(n_clusters = 2)
plt.figure(figsize = (6,6))
plt.scatter(x_principal['P1'],x_principal['P2'])
c= ac2.fit_predict((x_principal),cmap = 'rainbow')
plt.show()
and this is my error:
TypeError Traceback (most recent call last)
<ipython-input-61-56f631c43c3e> in <module>
3 #visualizing the cluster
4 plt.figure(figsize = (6,6))
----> 5 plt.scatter(x_principal['P1'],x_principal['P2'])
6 c= ac2.fit_predict((x_principal),cmap = 'rainbow')
7 plt.show()
TypeError: list indices must be integers or slices, not str
If you are trying to update the columns names for x_principal, which is more likely, you should be using x_principal.columns = ['P1, 'P2'], right now you are assigning those values, which is overwriting the data
x_principal is a list containing two strings P1 and P2. So x_principal['P1'] is wrong. You can not index list elements with the element itself.

Converting CSV file data into federated data

I am trying to convert my CSV dataset into a federated data. Please find the code and the error I am getting while I am running my code
code: import collections
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_federated as tff
np.random.seed(0)
df = pd.read_csv('path to my csv file')
client_id_colname = 'aratio: continuous.'
SHUFFLE_BUFFER = 1000
NUM_EPOCHS = 1
client_ids = df[client_id_colname].unique()
train_client_ids = sample(client_ids.tolist(),500)
test_client_ids = [x for x in client_ids if x not in train_client_ids]
def create_tf_dataset_for_client_fn(client_id):
client_data = df[df[client_id_colname] == client_id]
dataset = tf.data.Dataset.from_tensor_slices(client_data.to_dict('list'))
dataset = dataset.shuffle(SHUFFLE_BUFFER).batch(1).repeat(NUM_EPOCHS)
return dataset
train_data = tff.simulation.ClientData.from_clients_and_fn(
client_ids=train_client_ids,
create_tf_dataset_for_client_fn=create_tf_dataset_for_client_fn
)
test_data = tff.simulation.ClientData.from_clients_and_fn(
client_ids=test_client_ids,
create_tf_dataset_for_client_fn=create_tf_dataset_for_client_fn
)
Error: ---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-7-9d85508920a8> in <module>
15 # split client id into train and test clients
16 client_ids = df[client_id_colname].unique()
---> 17 train_client_ids = sample(client_ids.tolist(),500)
18 test_client_ids = [x for x in client_ids if x not in train_client_ids]
19
NameError: name 'sample' is not defined
Python cannot find the sample function. The code will need to import it from somewhere, a few possible options:
random.sample
numpy.random.sample
To use the first, the code would need an import random and the sample line would need to change to:
train_client_ids = random.sample(client_ids.tolist(), 500)
add the following line in the list of your import statements:
from random import sample

Python ValueError. Don't understand error or how to fix

I am following the tutorial here; https://www.analyticsvidhya.com/blog/2018/10/predicting-stock-price-machine-learningnd-deep-learning-techniques-python/#comment-155692
Instead of using the provided dataset I am using one needed for my assignment.
The code used is
#import packages
import pandas as pd
import numpy as np
#to plot within notebook
import matplotlib.pyplot as plt
%matplotlib inline
#setting figure size
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 20,10
#for normalizing data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
#read the file
df = pd.read_csv('C:/Users/Usert/Downloads/stock-20050101-to-20171231/stock-20050101-to-20171231/IBM_2006-01-01_to_2018-01-01.csv')
#print the head
df.head()
#setting index as date
df['Date'] = pd.to_datetime(df.Date,format='%Y-%m-%d')
df.index = df['Date']
#plot
plt.figure(figsize=(16,8))
plt.plot(df['Close'], label='Close Price history')
#creating dataframe with date and the target variable
data = df.sort_index(ascending=True, axis=0)
new_data = pd.DataFrame(index=range(0,len(df)),columns=['Date', 'Close'])
for i in range(0,len(data)):
new_data['Date'][i] = data['Date'][i]
new_data['Close'][i] = data['Close'][i]
#splitting into train and validation
train = new_data[:987]
valid = new_data[987:]
new_data.shape, train.shape, valid.shape
((1235, 2), (987, 2), (248, 2))
train['Date'].min(), train['Date'].max(), valid['Date'].min(), valid['Date'].max()
#make predictions
preds = []
for i in range(0,248):
a = train['Close'][len(train)-248+i:].sum() + sum(preds)
b = a/248
preds.append(b)
#calculate rmse
rms=np.sqrt(np.mean(np.power((np.array(valid['Close'])-preds),2)))
rms
#plot
valid['Predictions'] = 0
valid['Predictions'] = preds
plt.plot(train['Close'])
plt.plot(valid[['Close', 'Predictions']])
This runs fine until "#Calculate RMSE" when it hits the error.
File "<ipython-input-92-1256d885493e>", line 65, in <module>
rms=np.sqrt(np.mean(np.power((np.array(valid['Close'])-preds),2)))
ValueError: operands could not be broadcast together with shapes (2033,) (248,)
Using "print(valid.shape)" and "print(len(preds))" as requested returns "(604, 3)" and "248".
Any idea how I change the numbers to fit my dataset as each time I change the numbers I create more errors?
Just FYI;
The dataset I am using has 7 columns named "Date, Open, High, Low, Close, Volume and Name" with 3021 rows of data including headers.
Whilst the one in the tutorial has 8 columns being "date, open, high, low, last, close, total_trade_quantity, and turnover" with 1236 rows including headers.

Python Index column doesn't freeze while scrolling to the right

my problem is that I have a Dataframe of 200 rows and 200 columns, while I scroll to the right the index column stay fixed ( I can still see it) as it should be.
However when I select a column or value into the Dataframe (for example to order the values in ascending or descending order), the index column change and becomes the same as the column I selected.
I would like to still see the index column.
I am using Spyder 3.3.0 and Python 3.6
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import operator
# Importing the dataset
dataset = pd.read_csv('1992_2014.csv', index_col =0)
nations_all = dataset.iloc[:, 0].values
nations = [nations_all[0]]
for i in range(0, len(nations_all)):
if nations_all[i] not in nations:
nations.append(nations_all[i])
Year = dataset.iloc[:, 1].values
CO2 = dataset.iloc[:, 8].values
# Creating the Trend Matrix between two nations
trend_matrix = pd.DataFrame(index = nations, columns = nations)
for i in nations:
n = dataset[dataset["Nation"] == i].index.values.astype(int)
for k in nations:
kn = dataset[dataset["Nation"] == k].index.values.astype(int)
div_n = CO2[n[0]]
div_kn = CO2[kn[0]]
CO2_n = (CO2[n]/div_n)
CO2_kn = (CO2[kn]/div_kn)
trend_matrix.loc[i, k] = sum(list(map(abs,list(map(operator.sub, CO2_n, CO2_kn)))))
Thanks!

Categories