NumPy - Getting column and row number to reshape array - python

I am learning neural networks and I am trying to automate some of the processes.
Right now, I have code to split the dataset randomly, a 284807x31 piece. Then, I need to separate inputs and outputs, meaning I need to select the entire array until the last column and, on the other hand, select only the last column. For some reason I can't figure out how to do this properly and I am stuck at splitting and separating the set as explained above. Here's my code so far (the part that refers to this specific problem):
train, test, cv = np.vsplit(data[np.random.permutation(data.shape[0])], (6,8))
# Should select entire array except the last column
train_inputs = np.resize(train, len(train[:,1]), -1)
test_inputs = np.resize(test, len(test[:,1]), -1)
cv_inputs = np.resize(cv, len(cv[:,1]), -1)
# Should select **only** the last column.
train_outs = train[:, 30]
test_outs = test[:, 30]
cv_outs = test[:, 30]
The idea is that I'd like the machine to find the column number of the corresponding dataset and do intended resizes. The second part will select only the last column - and I am not sure if that works because the script stops before that. The error is, by the way:
Traceback (most recent call last):
File "src/model.py", line 43, in <module>
train_inputs = np.resize(train, len(train[:,1]), -1)
TypeError: resize() takes exactly 2 arguments (3 given)
PS: Now that I am looking at the documentation, I can see I am very far from the solution but I really can't figure it out. It's the first time I am using NumPy.
Thanks in advance.

Some slicing should help:
Should select entire array except the last column
train_inputs = train[:,:-1]
test_inputs = test[:,:-1]
cv_inputs = cv[:,:-1]
and:
Should select only the last column.
train_outs = train[:,-1]
test_outs = test[:, -1]
cv_outs = test[:, -1]

Related

Importing CSV data and splitting correctly

I'm trying to create a neural network using NumPy, but im having trouble with importing and splitting my CSV file. When I run my code it seems to believe I want all of my rows, when I only want the 70% that I split in my code below. my data is 470 rows in total but I want the 329 value which is 70%.
not sure if the way ive indexed my code to split my x/y values has messed it up maybe. when I print my 'trainingdata' before I split to x/y it presents 329 which is the correct value. 'trainingdata' is only ever used in this section of code also which means its not a later function issue I believe.
dataset = pd.read_csv(('data.csv'), names = header_list)
trainingdata = dataset.sample(frac=0.7)
testingdata = dataset.drop(trainingdata.index)
X = trainingdata.iloc[:,1:].to_numpy().astype('float32')
Y = trainingdata.iloc[:,0].to_numpy().reshape(-1, 1).astype('float32')
print(Y.shape)
print(X.shape)
output message = ValueError: Shape of passed values is (329, 1), indices imply (470, 1)
EDIT =
# convert output to
predicted_output = pd.DataFrame(
predicted_output,
columns=["prediction"],
index=df.index)
sorry for the confusion. Basically I need to call my split 70/30 dataframe in the code above, but when I try to call it doesnt work as I dont know how to call a split dataframe like this. Above is when I call the 'pre-split' dataframe (pd.Dataframe) which works fine, but when I try and call 'testingdata' (data that is split 70/30), it says as its a NumPy array this wont work.
I'm not sure how to convert my 'trainingdata' array into something callable so that it usable by the function above. Is there specific syntax for this?

Tensorflow target column always returns 1

I'm working on a classification problem with Tensorflow and I'm new to this. I want to see two targets (1 and 0) after all. I'm asking because I don't know, is it normal for the whole target column to be 1 as below? Thank you.
df['target'] = np.where(df['Class']== 2, 0, 1)
df = df.drop(columns=['Class'])
then when I run the command line below, the target column shows exactly 1.
print(df.head(50))
Just change the last parameter to the array you are making the comparisons on.
This will replace the values of 2 with 1 in df["Class"]
df['target'] = np.where(df['Class']== 2, 1, df['Class'])

Inputing samples from a dataframe into an ARMA model and getting 'ValueError: setting an array element with a sequence.'

I have an 8 by 7 dataframe ‘selected_parameters’ as following
ar_params and ma_params corresponds to the evaluated parameters of an ARMA model on a time series.
I then select randomly one parameter from ar_params and ma_params:
ar_sample = selected_parameters['ar_params'].sample(1)
ma_sample = selected_parameters['ma_params'].sample(1)
And modify them as follow in order to be then used to generate time series with an ARMA process, following the explanations at the end of this page
https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima_process.arma_generate_sample.html
ar_sample_array = np.r_[1, -ar_sample]
ma_sample_array = np.r_[1, ma_sample]
y = arma_generate_sample(ar_sample_array, ma_sample_array, nsample=100, scale=0.1)
plt.plot(y)
Everything works well IF we did select a set of ar_params and ma_params including only ONE value.
If at the random selection stage, we did select a set with two or more values I receive the following error message.
‘ValueError: setting an array element with a sequence.’
When printing the values of ar_sample_array and ma_sample_array
print(ar_sample_array)
print(ma_sample_array)
I get the following output
[1 array([-1.01, 0.01])]
[1 array([-0.76, 0.03])]
Thank you
I think the params must but only one array, not one array with other array inside. I think this would work:
ar_sample_array = [1].append(-ar_sample)
ma_sample_array = [1].append(ma_sample)

Join one dataset and the result of OneHotEncoder in Pandas

Let's consider the dataset of House prices from this example.
I have the entire dataset stored in the housing variable:
housing.shape
(20640, 10)
I also have done a OneHotEncoder encoding of one dimensions and get housing_cat_1hot, so
housing_cat_1hot.toarray().shape
(20640, 5)
My target is to join the two variables and store everything in just one dataset.
I have tried the Join with index tutorial but the problem is that the second matrix haven't any index.
How can I do a JOIN between housing and housing_cat_1hot?
>>> left=housing
>>> right=housing_cat_1hot.toarray()
>>> result = left.join(right)
Traceback (most recent call last): File "", line 1, in
result = left.join(right) File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pandas/core/frame.py",
line 5293, in join
rsuffix=rsuffix, sort=sort) File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pandas/core/frame.py",
line 5323, in _join_compat
can_concat = all(df.index.is_unique for df in frames) File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pandas/core/frame.py",
line 5323, in
can_concat = all(df.index.is_unique for df in frames) AttributeError: 'numpy.ndarray' object has no attribute 'index'
Well, depends on how you created the one-hot vector.
But if it's sorted the same as your original DataFrame, and itself is a DataFrame, you can add the same index before joining:
housing_cat_1hot.index = range(len(housing_cat_1hot))
And if it's not a DataFrame, convert it to one.
This is simple, as long as both objects are sorted the same
Edit: If it's not a DataFrame, then:
housing_cat_1hot = pd.DataFrame(housing_cat_1hot)
Already creates the proper index for you
If you wish to join the two arrays (assuming both housing_cat_1hot and housing are arrays), you can use
housing = np.hstack((housing, housing_cat_1hot))
Though the best way to OneHotEncode a variable is selecting that variable within the array and encode. It saves you the trouble of joining the two later
Say the index of the variable you wish to encode in your array is 1,
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
le = LabelEncoder()
X[:, 1] = le.fit_transform(X[:, 1])
onehotencoder = OneHotEncoder(categorical_features = [1])
X = onehotencoder.fit_transform(X).toarray()
Thanks to #Elez-Shenhar answer I get the following working code:
OneHot=housing_cat_1hot.toarray()
OneHot= pd.DataFrame(OneHot)
result = housing.join(OneHot)
result.shape
(20640, 15)

Slicing columns in Python

I am new in Python. I want to slice columns from index 1 to end of a marix and perform some operations on the those sliced out columns. Following is the code:
import numpy as np
import pandas as pd
train_df = pd.read_csv('train_475_60_W1.csv',header = None)
train = train_df.as_matrix()
y = train[:,0]
X = train[:,1:-1]
The problem is if I execeute "train.shape", it gives me (89512, 61). But when I execute "X.shape", it give me (89512, 59). I was expecting to get 60 as I want to execute operations on all the colunms except the first one. Can anyone please help me in solving this?
In the line
X = train[:,1:-1]
you cut off the last column. -1 refers to the last column, and Python includes the beginning but not the end of a slice - so lst[2:6] would give you entries 2,3,4, and 5. Correct it to
X = train[:,1:]
BTW, you can make your code format properly by including four spaces before each line (you can just highlight it and hit Ctrl+K).
The thing you should know with slicing for single dimension even in normal lists is that it looks like this:
[start : end]
with start included and end excluded.
you can also use these:
[:x] # from the start to x
[x:] # from x to the end
you can then generalize than to 2D or more, so in your case it would be:
X = train[:,1:] # the first : to get all rows, and 1: to get all columns except the first
you can learn more about these in here if you want, it's a good way to practice

Categories