Fit() method, sklearn in python - python

I'm new in the sklearn, and could someone explain to me why in the fit method os the linear regression, the predictor (X) is coded like this:
X = df[['highway-mpg']]
and the response variable is coded in this form:
Y = df['price']
I'm a little confused when I have to use df with double and single brackets, could someone explain to me, I tried to understand with the documentation od sklearn in the fit method, but I got more confused.

Double brackets:
They are used to select multiple columns from a DataFrame and the result is a DataFrame which is a 2D array.
Single bracket:
They are used to select one column from a DataFrame and the result is a Series which is an 1D array.
According to Sci-kit docs, in the fit method of LinearRegression, the shape of X should be (n_samples, n_features) and to do so, we use double brackets.

Related

Type of data output from a function

May be a silly question but i am confused how do programmers know that the output datatype is a dataframe or a numpy array and which corresponding methods should be used. For e.g. Here we read the csv file using pd.read_csv which results in a dataframe.
d0 = pd.read_csv('train.csv') # MNIST train database (https://www.kaggle.com/c/digit-recognizer/data)
l = d0['label']
d = d0.drop('label', axis = 1)
so here d0 is pandas.core.frame.DataFrame, d is also pandas.core.frame.DataFrame
from sklearn.preprocessing import StandardScaler
standard_data = StandardScaler().fit_transform(d)
print(type(standard_data))
But how come the type of data is <class 'numpy.ndarray'>
You know by either printing the type as you have done, or by Googling for the docs of the function you are using.
See here for the docs on the fit_transform method, which says:
Returns X_new: ndarray array of shape (n_samples, n_features_new)
As a general rule of thumb, pd is pandas and returns a pandas Dataframe, whilst np and sklearn usually use numpy arrays.

Row-wise prediction over Pandas dataframe by passing sklearn.predict to df.apply

Assuming we have a Pandas dataframe and a scikit-learn model, trained (fit) using that dataframe. Is there a way to do row-wise prediction? The use case is to use the predict function to fill in empty values in the dataframe, using an sklearn model.
I expected that this would be possible using the pandas apply function (with axis=1), but I keep getting dimensionality errors.
Using Pandas version '0.22.0' and sklearn version '0.19.1'.
Simple example:
import pandas as pd
from sklearn.cluster import kmeans
data = [[x,y,x*y] for x in range(1,10) for y in range(10,15)]
df = pd.DataFrame(data,columns=['input1','input2','output'])
model = kmeans()
model.fit(df[['input1','input2']],df['output'])
df['predictions'] = df[['input1','input2']].apply(model.predict,axis=1)
The resulting dimensionality error:
ValueError: ('Expected 2D array, got 1D array instead:\narray=[ 1.
10.].\nReshape your data either using array.reshape(-1, 1) if your data has
a single feature or array.reshape(1, -1) if it contains a single sample.',
'occurred at index 0')
Running predict on the whole column works fine:
df['predictions'] = model.predict(df[['input1','input2']])
However, I want the flexibility to use this row-wise.
I've tried various approaches to reshape the data first, for example:
def reshape_predict(df):
return model.predict(np.reshape(df.values,(1,-1)))
df[['input1','input2']].apply(reshape_predict,axis=1)
Which just returns the input with no error, whereas I expect it to return a single column of output values (as an array).
SOLUTION:
Thanks to Yakym for providing a working solution! Trying a few variants based on his suggestion, the easiest solution was to simply wrap the row values in square brackets (I tried this previously, but without the 0 index for the prediction, with no luck).
df['predictions'] = df[['input1','input2']].apply(lambda x: model.predict([x])[0],axis=1)
Slightly more verbose, you can turn each row into 2D array by adding new a new axis to the values. You will then have to access the prediction with 0 index:
df["predictions"] = df[["input1", "input2"]].apply(
lambda s: model.predict(s.values[None])[0], axis=1
)

removing NaN from a scipy sparse matrix

I have the following piece of code:
input_data = pd.read_csv('file_name.tsv', sep='\t')
data = sparse.csr_matrix(data.values)
model = TruncatedSVD(n_components=2)
model.fit(data)
Now TruncatedSVD does take sparse matrices from scipy but it does not take NaN. I expected the crs_matrix function to strip NaN but it does not and I can't find a way to strip these NaN's from my scipy matrix.
Is there a good way to do this? I can't find a function within scipy.
I ended up setting the NaNs to zero, this is not the optimal solution but I don't think there really is a satisfying way to impute the missing values in this instance.
There are some approaches to this. One of which you chose - setting the NaNs to zero, other way may be to set them equal to the mean value of your data (column, let's say).
An easy way to adress this is with scikit imputer
from sklearn.reprocessing import Imputer
data_imputer = Imputer(missing_values='NaN, strategy='mean', axis=0)
# fit the imputer - suppose missing data is in the 0th column
data_imputer = data_imputer.fit(data[:, 0])
# transform the data
data[:, 0] = data_imputer.transform(X[:, 0])
Note that this is a very simple example and can be improved a lot, for more info, see scikit-learn doc documentation about this issue.

Auto broadcasting in Scipy

I have two np.ndarrays, data with shape (8000, 500) and sample with shape (1, 500).
What I am trying to achieve is measure various types of metrics between every row in data to sample.
When using from sklearn.metrics.pairwise.cosine_distances I was able to take advantage of numpy's broadcasting executing the following line
x = cosine_distances(data, sample)
But when I tried to use the same procedure with scipy.spatial.distance.cosine I got the error
ValueError: Input vector should be 1-D.
I guess this is a broadcasting issue and I'm trying to find a way to get around it.
My ultimate goal is to iterate over all of the distances available in scipy.spatial.distance that can accept two vectors and apply them to the data and the sample.
How can I replicate the broadcasting that automatically happens in sklearn's in my scipy version of the code?
OK, looking at the docs, http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_distances.html
With (800,500) and (1,500) inputs ((samples, features)), you should get back a (800,1) result ((samples1, samples2)).
I wouldn't describe that as broadcasting. It's more like dot product, that performs some sort calculation (norm) over features (the 500 shape), reducing that down to one value. It's more like np.dot(data, sample.T) in its handling of dimensions.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html is Computes the Cosine distance between 1-D arrays, more like
for row in data:
for s in sample:
d = cosine(row, s)
or since sample has only one row
distances = np.array([cosine(row, sample[0]) for row in data])
In other words, the sklearn version does the pairwise iteration (maybe in compiled code), while the spartial just evaluates the distance for one pair.
pairwise.cosine_similarity does
# K(X, Y) = <X, Y> / (||X||*||Y||)
K = safe_sparse_dot(X_normalized, Y_normalized.T, dense_output=dense_output)
That's the dot like behavior that I mentioned earlier, but with the normalization added.

How to plot text documents in a scatter map?

I'm using scikit to perform text classification and I'm trying to understand where the points lie with respect to my hyperplane to decide how to proceed. But I can't seem to plot the data that comes from the CountVectorizer() function. I used the following function: pl.scatter(X[:, 0], X[:, 1]) and it gives me the error: ValueError: setting an array element with a sequence.
Any idea how to fix this?`
If X is a sparse matrix, you probably need X = X.todense() in order to get access to the data in the correct format. You probably want to check X.shape before doing this though, as if X is very large (but very sparse) it may consume a lot of memory when "densified".

Categories