I am trying to run T-distributed Stochastic Neighbor Embedding (t-SNE) in Jupyter but always facing a issue with
ValueError: could not convert string to float: '<Null>'
Code:
enter image description here
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
# Reading the data using pandas
df = pd.read_csv("E:\\Field data\Output\\Pixel values7.csv")
# print first five rows of df
print(df.head(9))
# save the labels into a variable l.
l = df['label']
# Drop the label feature and store the pixel data in d.
d = df.drop("label", axis = 1)
I got error after this line
# Data-preprocessing: Standardizing the data
from sklearn.preprocessing import StandardScaler
standardized_data = StandardScaler().fit_transform(df)
print(standardized_data.shape)
# TSNE
# Picking the top 1000 points as TSNE
# takes a lot of time for 15K points
data_1000 = standardized_data[0:1000, :]
labels_1000 = labels[0:1000]
model = TSNE(n_components = 2, random_state = 0)
# configuring the parameters
# the number of components = 2
# default perplexity = 30
# default learning rate = 200
# default Maximum number of iterations
# for the optimization = 1000
tsne_data = model.fit_transform(data_1000)
# creating a new data frame which
# help us in plotting the result data
tsne_data = np.vstack((tsne_data.T, labels_1000)).T
tsne_df = pd.DataFrame(data = tsne_data,
columns =("Dim_1", "Dim_2", "label"))
# Plotting the result of tsne
sn.FacetGrid(tsne_df, hue ="label", size = 6).map(
plt.scatter, 'Dim_1', 'Dim_2').add_legend()
plt.show()
I got this link from somewhere, I am not expert in python. I request you to kindly help me out.
I am trying to run this program for my data but always getting a error
ValueError: could not convert string to float: '<Null>'
If there is any other code for T-distributed Stochastic Neighbor Embedding (t-SNE). Please let me know.
My data look like this
Related
image of the error I am trying to build a collaborative recommendation system the code below. I am a noob to deep learning right now, and I am stuck with this error when I try to train the model. I want to train a model with a csv data set. Can anyone please help me understand what's happening? I would really appreciate it.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Import the surprise packages
from surprise import Dataset
from surprise import Reader
from surprise.prediction_algorithms.matrix_factorization import SVD as FunkSVD
# Import GridSearchCV for algorithm tuning
from surprise.model_selection import GridSearchCV
# Import train_test_split
from surprise.model_selection import train_test_split
# Read in the prepared dataframe from the user_cleanup notebook
user_df = pd.read_csv('user_clean.csv')
user_df.head()
# Merge the two dataframes on appid
df = user_df.merge(games_df,on='appid')
df = df.drop('name',1)
df.head()
# Let's take a look at one of the most prominent users in the dataset, user 24469287
df[df['user_id'] == 24469287]
# Let's find this users favorite games using the 1-5 rating scale
print(f"Shape:{df[(df['user_id'] == 24469287) & (df['rating_5'] == 5)].shape}")
display(df[(df['user_id'] == 24469287) & (df['rating_5'] == 5)])
# Prepare the dataframes for the surprise package
# Dataframe needs to contain 3 columns: user id, item id, and rating
# For the 1-10 scale
rating_10_df = df.filter(['user_id','appid','rating_10'])
rating_10_df = rating_10_df.sort_values(by=['user_id','appid'])
# And the 1-5 scale
rating_5_df = df.filter(['user_id','appid','rating_5'])
rating_5_df = rating_5_df.sort_values(by=['user_id','appid'])
# Confirm dataframe is set up properly (user, item, rating)
rating_10_df.head()
# initialize the reader with 1-10 rating scale
my_reader = Reader(rating_scale=(0,10))
# load the dataframe with the reader
md = Dataset.load_from_df(rating_10_df, my_reader)
%%time
# Set the parameter grid for optimization
param_grid = {
# Number of latent factors. More factors could give better results, but can also lead overfitting
'n_factors': [50, 100, 150],
# Number of epochs. Number of iterations the algorithm will run
'n_epochs': [10, 20, 50],
# Learning rate. The speed at which algorithm learns. Larger values give faster learning, but smaller values give more accurate learning.
'lr_all': [0.005, 0.1],
'biased': [False] }
# Set GridSearchCV with 5 fold cross-validation using the FunkSVD
GS = GridSearchCV(FunkSVD, param_grid, measures=['rmse','mae','fcp'], cv=5)
# Fit the model to the data
GS.fit(md)
Based on the guide Implementing PCA in Python, by Sebastian Raschka I am building the PCA algorithm from scratch for my research purpose. The class definition is:
import numpy as np
class PCA(object):
"""Dimension Reduction using Principal Component Analysis (PCA)
It is the procces of computing principal components which explains the
maximum variation of the dataset using fewer components.
:type n_components: int, optional
:param n_components: Number of components to consider, if not set then
`n_components = min(n_samples, n_features)`, where
`n_samples` is the number of samples, and
`n_features` is the number of features (i.e.,
dimension of the dataset).
Attributes
==========
:type covariance_: np.ndarray
:param covariance_: Coviarance Matrix
:type eig_vals_: np.ndarray
:param eig_vals_: Calculated Eigen Values
:type eig_vecs_: np.ndarray
:param eig_vecs_: Calculated Eigen Vectors
:type explained_variance_: np.ndarray
:param explained_variance_: Explained Variance of Each Principal Components
:type cum_explained_variance_: np.ndarray
:param cum_explained_variance_: Cumulative Explained Variables
"""
def __init__(self, n_components : int = None):
"""Default Constructor for Initialization"""
self.n_components = n_components
def fit_transform(self, X : np.ndarray):
"""Fit the PCA algorithm into the Dataset"""
if not self.n_components:
self.n_components = min(X.shape)
self.covariance_ = np.cov(X.T)
# calculate eigens
self.eig_vals_, self.eig_vecs_ = np.linalg.eig(self.covariance_)
# explained variance
_tot_eig_vals = sum(self.eig_vals_)
self.explained_variance_ = np.array([(i / _tot_eig_vals) * 100 for i in sorted(self.eig_vals_, reverse = True)])
self.cum_explained_variance_ = np.cumsum(self.explained_variance_)
# define `W` as `d x k`-dimension
self.W_ = self.eig_vecs_[:, :self.n_components]
print(X.shape, self.W_.shape)
return X.dot(self.W_)
Consider the iris-dataset as a test case, PCA is achieved and visualized as follows:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# loading iris data, and normalize
from sklearn.datasets import load_iris
iris = load_iris()
from sklearn.preprocessing import MinMaxScaler
X, y = iris.data, iris.target
X = MinMaxScaler().fit_transform(X)
# using the PCA function (defined above)
# to fit_transform the X value
# naming the PCA object as dPCA (d = defined)
dPCA = PCA()
principalComponents = dPCA.fit_transform(X)
# creating a pandas dataframe for the principal components
# and visualize the data using scatter plot
PCAResult = pd.DataFrame(principalComponents, columns = [f"PCA-{i}" for i in range(1, dPCA.n_components + 1)])
PCAResult["target"] = y # possible as original order does not change
sns.scatterplot(x = "PCA-1", y = "PCA-2", data = PCAResult, hue = "target", s = 50)
plt.show()
The output is as:
Now, I wanted to verify the output, for which I used sklearn library, and the output is as follows:
from sklearn.decomposition import PCA # note the same name
sPCA = PCA() # consider all the components
principalComponents_ = sPCA.fit_transform(X)
PCAResult_ = pd.DataFrame(principalComponents_, columns = [f"PCA-{i}" for i in range(1, 5)])
PCAResult_["target"] = y # possible as original order does not change
sns.scatterplot(x = "PCA-1", y = "PCA-2", data = PCAResult_, hue = "target", s = 50)
plt.show()
I don't understand why the output is oriented differently, with a minor different value. I studied numerous codes [1, 2, 3], all of which have the same issue. My questions:
What is different in sklearn, that the plot is different? I've tried with a different dataset too - the same problem.
Is there a way to fix this issue?
I was not able to study the sklearn.decompose.PCA algorithm, as I am new to OOPs concept with python.
Output in the blog post by Sebastian Raschka also has a minor variation in output. Figure below:
When calculating an eigenvector you may change its sign and the solution will also be a valid one.
So any PCA axis can be reversed and the solution will be valid.
Nevertheless, you may wish to impose a positive correlation of a PCA axis with one of the original variables in the dataset, inverting the axis if needed.
The difference in values comes from PCA from sklearn using svd decomposition. In sklearn there's a function svd_flip used to flip the PCs, which explains why you see this flip
More details on the help page:
It uses the LAPACK implementation of the full SVD or a randomized
truncated SVD by the method of Halko et al. 2009, depending on the
shape of the input data and the number of components to extract.
You can read about the relation here
We first run your example dataset:
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.utils.extmath import svd_flip
import pandas as pd
import numpy as np
import scipy
iris = load_iris()
X, y = iris.data, iris.target
X = MinMaxScaler().fit_transform(X)
n_components = 4
sPCA = PCA(n_components,svd_solver="full")
sklearnPCs = pd.DataFrame(sPCA.fit_transform(X))
We now perform SVD on your centered matrix:
U,S,Vt = scipy.linalg.svd(X - X.mean(axis=0))
U = U[:,:n_components]
U, Vt = svd_flip(U, Vt)
svdPCs = pd.DataFrame(U*S)
The results:
0 1 2 3
0 -0.630703 0.107578 -0.018719 -0.007307
1 -0.622905 -0.104260 -0.049142 -0.032359
2 -0.669520 -0.051417 0.019644 -0.007434
3 -0.654153 -0.102885 0.023219 0.020114
4 -0.648788 0.133488 0.015116 0.011786
.. ... ... ... ...
145 0.551462 0.059841 0.086283 -0.110092
146 0.407146 -0.171821 -0.004102 -0.065241
147 0.447143 0.037560 0.049546 -0.032743
148 0.488208 0.149678 0.239209 0.002864
149 0.312066 -0.031130 0.118672 0.052505
svdPCs
0 1 2 3
0 -0.630703 0.107578 -0.018719 -0.007307
1 -0.622905 -0.104260 -0.049142 -0.032359
2 -0.669520 -0.051417 0.019644 -0.007434
3 -0.654153 -0.102885 0.023219 0.020114
4 -0.648788 0.133488 0.015116 0.011786
.. ... ... ... ...
145 0.551462 0.059841 0.086283 -0.110092
146 0.407146 -0.171821 -0.004102 -0.065241
147 0.447143 0.037560 0.049546 -0.032743
148 0.488208 0.149678 0.239209 0.002864
149 0.312066 -0.031130 0.118672 0.052505
You can implement without the flip. The values will be the same and your PCA will be valid as noted in the other answer.
Im trying to make my way to a sligthly more flexible knn input script than the tutorials based of the iris dataset but Im having some trouble (I think) to add the matching 2nd dimension to the numpy array in #6 and when I come to #11. the fitting.
File "G:\PROGRAMMERING\Anaconda\lib\site-packages\sklearn\utils\validation.py", line 212, in check_consistent_length
" samples: %r" % [int(l) for l in lengths]) ValueError: Found input variables with inconsistent numbers of samples: [150, 1]
x is (150,5) and y is (150,1). 150 is the number of samples in both, but they differ in number of fields, is this the problem and if so how do I fix it?
#1. Loading the Pandas libraries as pd
import pandas as pd
import numpy as np
#2. Read data from the file 'custom.csv' placed in your code directory
data = pd.read_csv("custom.csv")
#3. Preview the first 5 lines of the loaded data
print(data.head())
print(type(data))
#4.Test the shape of the data
print(data.shape)
df = pd.DataFrame(data)
print(df)
#5. Convert non-numericals to numericals
print(df.dtypes)
# Any object should be converted to numerical
df['species'] = pd.Categorical(df['species'])
df['species'] = df.species.cat.codes
print("outcome:")
print(df.dtypes)
#6.Convert df to numpy.ndarray
np = df.to_numpy()
print(type(np)) #this should state <class 'numpy.ndarray'>
print(data.shape)
print(np)
x = np.data
y = [df['species']]
print(y)
#K-nearest neighbor (find closest) - searach for the K nearest observations in the dataset
#The model calculates the distance to all, and selects the K nearest ones.
#8. Import the class you plan to use
from sklearn.neighbors import (KNeighborsClassifier)
#9. Pick a value for K
k = 2
#10. Instantiate the "estimator" (make an instance of the model)
knn = KNeighborsClassifier(n_neighbors=k)
print(knn)
#11. fit the model with data/model training
knn.fit(x, y)
#12. Predict the response for a new observation
print(knn.predict([[3, 5, 4, 2]]))```
This is how I used the scikit-learn KNeighborsClassifier to fit the knn model:
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
df = datasets.load_iris()
X = pd.DataFrame(df.data)
y = df.target
knn = KNeighborsClassifier(n_neighbors = 2)
knn.fit(X,y)
print(knn.predict([[6, 3, 5, 2]]))
#prints output class [2]
print(knn.predict([[3, 5, 4, 2]]))
#prints output class [1]
From DataFrame you don't need to convert to numpy array, you can directly fit the model on DataFrame, also while converting the DataFrame to numpy array you have named that as np which is also used to import numpy at the top import numpy as np
The input prediction input is 4 columns, leaving the fifth 'species' without prediction. Also, if 'species' was the target it cannot be given as input to the knn at the same time. The pop removes this particular column from the dataFrame df.
#npdf = df.to_numpy()
df = df.apply(lambda x:pd.Series(x))
y = np.asarray(df['species'])
#removes the target from the sample
df.pop('species')
x = df.to_numpy()
I am using LassoCV() model for feature selection. It is giving me this issue and not selecting any features too. "C:\Users\xyz\Anaconda3\lib\site-packages\sklearn\feature_selection\base.py:80: UserWarning: No features were selected: either the data is too noisy or the selection test too strict.
UserWarning)"
The code is given below.
The data is in https://www.kaggle.com/jtrofe/beer-recipes/downloads/recipeData.csv/3
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LassoCV
# dataset URL = https://www.kaggle.com/jtrofe/beer-recipes/downloads/recipeData.csv/3
dataframe = pd.read_csv('Brewer Friend Beer Recipes.csv', encoding = 'latin')
# Encoding the non numerical columns
def encoding_data(dataframe):
if(dataframe.dtype == 'object'):
return LabelEncoder().fit_transform(dataframe.astype(str))
else:
return dataframe
# Feature Selection using the selected Target Feature
def feature_selection(raw_dataframe, target_feature_list):
output_list = []
# preprocessing Converting Categorical data into Numeric Data
dataframe = raw_dataframe.apply(encoding_data)
column_list = dataframe.columns.tolist()
dataframe = dataframe.dropna()
for target in target_feature_list:
target_feature = target
x = dataframe.drop(columns=[target_feature])
y = dataframe[target_feature].values
# Lasso feature selection
estimator = LassoCV(cv = 3, n_alphas = 1)
featureselection = SelectFromModel(estimator)
featureselection.fit(x,y)
features = featureselection.transform(x)
feature_list = x.columns[featureselection.get_support()]
features = ''
features = ', '.join(feature_list)
l = (target,features)
output_list.append(l)
output_df = pd.DataFrame(output_list,columns = ['Name','Selected Features'])
print('\nThe Feature Selection is done with the respective target feature(s)')
return output_df
print(feature_selection(dataframe, ['BrewMethod']))
I am getting this warning and no features are selected.
"C:\Users\xyz\Anaconda3\lib\site-packages\sklearn\feature_selection\base.py:80: UserWarning: No features were selected: either the data is too noisy or the selection test too strict. UserWarning)"
Any idea how to rectify this ?
If no features have been selected you can gradually decrease lambda (or in scikit's case alpha). This will reduce the penalization and probably return some nonzero coefficients.
It is extremely unusual that no coefficients have been selected. You should think about checking correlations in your data. Maybe you have a lot of collinearity.
I'm running the code below.
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import random
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
train=pd.read_csv('C:\\path_here\\train.csv')
test=pd.read_csv('C:\\path_here\\test.csv')
train['Type']='Train' #Create a flag for Train and Test Data set
test['Type']='Test'
fullData = pd.concat([train,test],axis=0) #Combined both Train and Test Data set
fullData.columns # This will show all the column names
fullData.head(10) # Show first 10 records of dataframe
fullData.describe() #You can look at summary of numerical fields by using describe() function
ID_col = ['REF_NO']
target_col = ['Status']
cat_cols = ['children','age_band','status','occupation','occupation_partner','home_status','family_income','self_employed', 'self_employed_partner','year_last_moved','TVarea','post_code','post_area','gender','region']
num_cols= list(set(list(fullData.columns)))
other_col=['Type'] #Test and Train Data set identifier
fullData.isnull().any()#Will return the feature with True or False,True means have missing value else False
num_cat_cols = num_cols+cat_cols # Combined numerical and Categorical variables
#Create a new variable for each variable having missing value with VariableName_NA
# and flag missing value with 1 and other with 0
for var in num_cat_cols:
if fullData[var].isnull().any()==True:
fullData[var+'_NA']=fullData[var].isnull()*1
#Impute numerical missing values with mean
fullData[num_cols] = fullData[num_cols].fillna(fullData[num_cols].mean(),inplace=True)
#Impute categorical missing values with 0
fullData[cat_cols] = fullData[cat_cols].fillna(value = 0)
#create label encoders for categorical features
for var in cat_cols:
number = LabelEncoder()
fullData[var] = number.fit_transform(fullData[var].astype('str'))
#Target variable is also a categorical so convert it
fullData["Account.Status"] = number.fit_transform(fullData["Account.Status"].astype('str'))
train=fullData[fullData['Type']=='Train']
test=fullData[fullData['Type']=='Test']
train['is_train'] = np.random.uniform(0, 1, len(train)) <= .75
Train, Validate = train[train['is_train']==True], train[train['is_train']==False]
features=list(set(list(fullData.columns))-set(ID_col)-set(target_col)-set(other_col))
x_train = Train[list(features)].values
y_train = Train["Account.Status"].values
x_validate = Validate[list(features)].values
y_validate = Validate["Account.Status"].values
x_test=test[list(features)].values
random.seed(100)
rf = RandomForestClassifier(n_estimators=1000)
rf.fit(x_train, y_train)
It seems to run, endlessly, in this line.
fullData[cat_cols] = fullData[cat_cols].fillna(value = 0)
I can't get it past that spot. how can I see what's happening in the background? Is there some way to see the work that's being done? Thanks.
One way to check where to code is getting to is to add print statements. For example you can add (right before the label encoder):
print("Code got before label encoder")
And then after that code block add another print statement. You can see in your console exactly where the code is getting stuck and debug that specific line.