I wanted to know the difference between these two lines of code
X_train = training_dataset.iloc[:, 1].values
X_train = training_dataset.iloc[:, 1:2].values
My guess is that the latter is a 2-D numpy array and the former is a 1-D numpy array. For inputs in a neural network, the latter is the proper way for input data, is there are specific reason for that?
Please help!
Not quite that, they have both have ndim=2, just check by doing this:
X_train.ndim
The difference is that in the second one it doesn't have a defined second dimension if you want to see the difference between the shapes I suggest reading this: Difference between numpy.array shape (R, 1) and (R,)
Difference is iloc returns a Series with a single row or column is selected but a Dataframe with a multiple row or column ranges reference
Although they both refer to column 1, 1 and 1:2 are different types, with 1 representing an int and 1:2 representing a slice.
With,
X_train = training_dataset.iloc[:, 1].values
You specify a single column so training_dataset.iloc[:, 1] is a Pandas Series, so .values is a 1D Numpy array
Vs.,
X_train = training_dataset.iloc[:, 1:2].values
Although it becomes one column, [1:2] is a slice you represents a column range so training_dataset.iloc[:, 1:2] is a Pandas Dataframe. Thus, .values is a 2D Numpy array
Test as follows:
Create training_dataset Dataframe
data = {'Height':[1, 14, 2, 1, 5], 'Width':[15, 25, 2, 20, 27]}
training_dataset = pd.DataFrame(data)
Using .iloc[:, 1]
print(type(training_dataset.iloc[:, 1]))
print(training_dataset.iloc[:, 1].values)
# Result is:
<class 'pandas.core.series.Series'>
# Values returns a 1D Numpy array
0 15
1 25
2 2
3 20
4 27
Name: Width, dtype: int64,
Using iloc[:, 1:2]
print(type(training_dataset.iloc[:, 1:2]))
print(training_dataset.iloc[:, 1:2].values)
# Result is:
<class 'pandas.core.frame.DataFrame'>
# Values is a 2D Numpy array (since values of Pandas Dataframe)
[[15]
[25]
[ 2]
[20]
[27]],
X_train Values Var Type <class 'numpy.ndarray'>
Related
I am very new to python and machine learning.
Let's say that I have a 1D np array (with both numbers and NaN) with one column and 1308 rows and want to create two variables:
train_outcome = outcome[0:891, 0]
y_pred = outcome[891:, 0]
I tried this and got the obvious <IndexError: too many indices for the array: array is 1-dimensional, but 2 were indexed>.
I was so desperate that I converted it back to a DF to make the operation. There must be an easier way to achieve this.
If the array has 1 dimension, there is no need for a comma. Here is how I'd do it:
train_outcome = outcome[:891]
y_pred = outcome[891:]
Use np.split for a one-liner:
train_outcome, y_pred = np.split(outcome, [891])
When i used x = dataset.iloc[:,1:2].values and later on in my code
import matplotlib.pyplot as plt
import numpy as np
dataset = pd.read_csv('Position_Salaries.csv')
x = dataset.iloc[:,1:2].values #look here please
y = dataset.iloc[:,-1].values
from sklearn.svm import SVR
sv_regressor = SVR(kernel='rbf')
so when i used x = dataset.iloc[:,1].values instead, i got an error saying
'expected 2d array and got 1d array instead'
in the sv_regresso line
The error is in sv_regressor line w, that's why i tagged sklearn
The difference is that with dataset.iloc[:,1:2] you will get a DataFrame and with dataset.iloc[:,-1] you will get a Series. When you use the attribute values with a DataFrame you get a 2d ndarray and with a Series you get a 1d ndarray. Consider the following example:
A B C
0 0 2 0
1 1 0 0
2 1 2 1
Series:
type(df.iloc[:, -1])
# pandas.core.series.Series
df.iloc[:, -1].values.shape
# (3,)
DataFrame:
type(df.iloc[:, -1:])
# pandas.core.frame.DataFrame
df.iloc[:, -1:].values.shape
# (3, 1)
It's a common trick in machine learning to get a target variable as 2d ndarray in one step.
It's almost the same, dataset.iloc[:,1:2] gives you a 2-d dataframe (columns from 1 to 2), dataset.iloc[:,1] gives you a pandas series (1-d) (from column 1).
I'm trying to load MNIST dataset into arrays.
When I use
(X_train, y_train), (X_test, y_test)= mnist.load_data()
I get an array y_test(10000,) but I want it to be in the shape of (10000,1).
What is the difference between array(10000,1) and array(10000,)?
How can I convert the first array to the second array?
Your first Array with shape (10000,) is a 1-Dimensional np.ndarray.
Since the shape attribute of numpy Arrays is a Tuple and a tuple of length 1 needs a trailing comma the shape is (10000,) and not (10000) (which would be an int). So currently your data looks like this:
import numpy as np
a = np.arange(5) # >>> array([0, 1, 2, 3, 4]
print(a.shape) # >>> (5,)
What you want is an 2-Dimensional array with shape of (10000, 1).
Adding a dimension of length 1 doesn't require any additional data, it is basically and "empty" dimension. To add an dimension to an existing array you can use either np.expand_dims() or np.reshape().
Using np.expand_dims:
import numpy as np
b = np.array(np.arange(5)) # >>> array([0, 1, 2, 3, 4])
b = np.expand_dims(b, axis=1) # >>> array([[0],[1],[2],[3],[4]])
The function was specifically made for the purpose of adding empty dimensions to arrays. The axis keyword specifies which position the newly added dimension will occupy.
Using np.reshape:
import numpy as np
a = np.arange(5)
X_test_reshaped = np.reshape(a, shape=[-1, 1]) # >>> array([[0],[1],[2],[3],[4]])
The shape=[-1, 1] specifies how the new shape should look like after the reshape operation. The -1 itself will be replaced by the shape that 'fits the data' by numpy internally.
Reshape is a more powerful function than expand_dims and can be used in many different ways. You can read more on other uses of it in the numpy docs. numpy.reshape()
An array with a size of (10,1) is a 2D array containing empty columns.
An array with a size of (10,) is a 1D array.
To convert (10,1) to (10,), you can simply collapse the columns. For example, we take the x array, which has x.shape = (10,1). now using x[:,] you can collapse the columns and x[:,].shape = (10,).
To convert (10,) to (10,1), you can add a dimension by using np.newaxis. So, after import numpy as np, assuming we are using numpy arrays here. Take a y array for example, which has y.shape = (10,). Using y[:, np.newaxis], you can a new array with the shape of (10,1).
I have a column in a Dataframe where each cell has a (300,) shaped numpy array.
When I extract the values of this column using the .values method, I get a numpy array of shape (N,) where N is the number of rows of the Dataframe. And each element of N is a (300,) array. I would have expected the extracted shape to be (Nx300).
So I would like to shape of the extracted column to be (Nx300). I tried using pd.as_matrix() but this still gets me a numpy array of shape (N,).
Any suggestions?
You can use numpy.concatenate, connvert to list and cast to array:
a = np.random.randint(10, size=300)
print (a.shape)
(300,)
df = pd.DataFrame({ 'A':[a,a,a]})
arr = np.array(np.concatenate(df.values).tolist())
print (arr.shape)
(3, 300)
How does plot numpy's logical indexing get the datapoints from the "data" variable in the code snippet below? I understand that the first parameter is the x co-ordinate and the second parameter is the y co-ordinate. I am unsure of how it maps to the datapoints from the variable.
data = vstack((rand(150,2) + array([.5,.5]),rand(150,2)))
# assign each sample to a cluster
idx,_ = vq(data,centroids)
# some plotting using numpy's logical indexing
plot(data[idx==0,0],data[idx==0,1],'ob',
data[idx==1,0],data[idx==1,1],'or')
plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
It's all in the shapes:
In [89]: data.shape
Out[89]: (300, 2) # data has 300 rows and 2 columns
In [93]: idx.shape
Out[93]: (300,) # idx is a 1D-array with 300 elements
idx == 0 is a boolean array with the same shape as idx. It is True wherever an element in idx equals 0:
In [97]: (idx==0).shape
Out[97]: (300,)
When you index data with idx==0, you get all rows of data where idx==0 is True:
In [98]: data[idx==0].shape
Out[98]: (178, 2)
When you index using a tuple, data[idx==0, 0], the first axis of data is indexed with the boolean array idx==0, and the second axis of data is indexed with 0:
In [99]: data[idx==0, 0].shape
Out[99]: (178,)
The first axis of data correspond to rows, the second axis corresponds to columns. So you get just the first column of data[idx==0]. Since the first column of data are x-values, this gives you those x-values in data where idx==0.