Confuse why my KNN code is throwing a ValueError

Confuse why my KNN code is throwing a ValueError - python

I am using sklearn for KNN regressor:
#importing libraries and data
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor as KNR
theta = pd.read_csv("train.csv")#pandas dataframe
#getting data wanted from theta and putting it in a new dataframe
a = theta.get("YearBuilt")
b = theta.get("YrSold")
A = a.to_frame()
B = b.to_frame()
glasses = [A,B]
x = pd.concat(glasses)
#getting target data
y = theta.get("SalePrice")
#using KNN
horses = KNR(n_neighbors = 3)
horses.fit(x,y)
I get this error message:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Could someone please explain this? My data is in the hundred thousands for target and the thousands for input. And there is no blanks in the data.

Before answering the question, Let me refactor the code. You are using a dataframe so you can index single or muliple fields of the dataframe without going through the extra steps you've used:
#importing libraries and data
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor as KNR
theta = pd.read_csv("train.csv") # pandas dataframe
#getting data wanted from theta and putting it in a new dataframe
x = theta[["YearBuilt", "YrSold"]] # index multiple fields
#getting target data
y = theta["SalePrice"] # index single field
#using KNN
horses = KNR(n_neighbors = 3)
horses.fit(x,y) # fit KNN
Regarding your error, it indicates that you have some NaN, Inf, large values in your data. You can ensure these doesnt occur by filtering out the NaN and inf values using this:
theta = theta.replace([np.inf, -np.inf], np.nan)
theta.dropna(inplace=True)

Related

T-distributed Stochastic Neighbor Embedding (t-SNE)

I am trying to run T-distributed Stochastic Neighbor Embedding (t-SNE) in Jupyter but always facing a issue with
ValueError: could not convert string to float: '<Null>'
Code:
enter image description here
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
# Reading the data using pandas
df = pd.read_csv("E:\\Field data\Output\\Pixel values7.csv")
# print first five rows of df
print(df.head(9))
# save the labels into a variable l.
l = df['label']
# Drop the label feature and store the pixel data in d.
d = df.drop("label", axis = 1)
I got error after this line
# Data-preprocessing: Standardizing the data
from sklearn.preprocessing import StandardScaler
standardized_data = StandardScaler().fit_transform(df)
print(standardized_data.shape)
# TSNE
# Picking the top 1000 points as TSNE
# takes a lot of time for 15K points
data_1000 = standardized_data[0:1000, :]
labels_1000 = labels[0:1000]
model = TSNE(n_components = 2, random_state = 0)
# configuring the parameters
# the number of components = 2
# default perplexity = 30
# default learning rate = 200
# default Maximum number of iterations
# for the optimization = 1000
tsne_data = model.fit_transform(data_1000)
# creating a new data frame which
# help us in plotting the result data
tsne_data = np.vstack((tsne_data.T, labels_1000)).T
tsne_df = pd.DataFrame(data = tsne_data,
columns =("Dim_1", "Dim_2", "label"))
# Plotting the result of tsne
sn.FacetGrid(tsne_df, hue ="label", size = 6).map(
plt.scatter, 'Dim_1', 'Dim_2').add_legend()
plt.show()
I got this link from somewhere, I am not expert in python. I request you to kindly help me out.
I am trying to run this program for my data but always getting a error
ValueError: could not convert string to float: '<Null>'
If there is any other code for T-distributed Stochastic Neighbor Embedding (t-SNE). Please let me know.
My data look like this

Data imputation in Python for Google Analytics data

I have sets of Google Analytics data from a website which I plan to analyse for a project. However, due to maintenance and other factors, there are chunks of dates for which there is no data. I want to impute this data while still maintaining the integrity of the data as I plan to plot these sets and compare the curves of different sets to each-other over time.
Example
I want to use the nearest valid datapoints to each missing datapoint to impute that value in order to maintain the underlying shape that can be seen from the image.
I've already tried to use scikit-learn's KNN-Imputer and Iterative Imputer but I'm either miss-understanding how these imputers are supposed to be used or they're not the correct for what I'm trying to do, potentially both.
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import numpy as np
df = pd.read_csv('data.csv', names=['Day','Views'],delimiter=',',skiprows=3, usecols=[0,1], skipfooter=1, engine='python', quoting= 1)
df = df.replace(0, np.nan)
da = df.Views.rename_axis('ID').values
da = da.reshape(-1,1)
imputer = IterativeImputer(n_nearest_features = 100, max_iter = 10)
df_imputed = imputer.fit_transform(da)
df_imputed.reshape(1,-1)
df.Views = df_imputed
df
All of the NaN values are calculated to be the exact same number from what I have currently implemented.
Any help would be greatly appreciated.

The problem here was I reshaping the array. My data was just a 1D array of values so I was making it 2D by reshaping the array which was causing all the NaN values to be calculated as the same. When I added an index column and included this as an input to the imputer the values were calculated correctly.I also ended up using a KNN imputer from sklearn instead of the iterative imputer in this instance.

pandas dataframe conditional selecting

I have a Dataframe and im trying to apply some ml algorithms on it.
im using pandas to handle it but im having several problems with it:
as you see in the 3rd cell, i have splitted Y into Ytr and Yts. after this the dataframe losses its column names. I've tried to name the column again but it doesn't work.
in the 4th cell, Im trying to use conditional statement to create a subset of Y in which Y values are 1(it is named ytr1). but it returns an empty dataframe.
any suggestions on the whole code would be really appreciated since im not really experienced with Pandas
note: if you haven't worked with jupyter notebook, #%% just means a new cell.
#%%
from pandas import DataFrame as df
import random
import numpy as np
import pandas as pd
import re
#%%
# Preparing the DataFrame
labels = pd.read_csv(r'A:\Data Sets\Pima Indian Diabetes\labels.csv', header=None)
ll = labels.loc[:, 0].tolist()
data = pd.read_csv(r'A:\Data Sets\Pima Indian Diabetes\pima-indians-diabetes2.csv', names=ll)
i = data.columns.values.tolist() # i is the labels of the csv file
i[-1]
#%%
# Spliting the Dataset
X = data.drop(i[-1], axis=1)
Y = data.iloc[:, 8]
Y = Y.to_frame()
Y = pd.DataFrame(Y.values.reshape(-1, 1), columns=i[-1])
tr_idx = data.sample(frac=0.7).index
Xtr = df(X[X.index.isin(tr_idx)])
Xts = df(X[~X.index.isin(tr_idx)])
Ytr = df(Y[X.index.isin(tr_idx)], columns='result')
Yts = df(Y[~X.index.isin(tr_idx)], columns=i[-1])
#%%
# splitting the Classes
ytr1 = Ytr.drop(Ytr[Ytr.iloc[0]!=1].index)
X: all the columns except Labels\classes which are 0 or 1
Y: last column of the csv files that are loaded as labels
Xtr: fraction of X that Im planning to use for training
Xts: fraction of X that Im planning to use for testing

Python running extremely slow one one line of code

I'm running the code below.
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import random
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
train=pd.read_csv('C:\\path_here\\train.csv')
test=pd.read_csv('C:\\path_here\\test.csv')
train['Type']='Train' #Create a flag for Train and Test Data set
test['Type']='Test'
fullData = pd.concat([train,test],axis=0) #Combined both Train and Test Data set
fullData.columns # This will show all the column names
fullData.head(10) # Show first 10 records of dataframe
fullData.describe() #You can look at summary of numerical fields by using describe() function
ID_col = ['REF_NO']
target_col = ['Status']
cat_cols = ['children','age_band','status','occupation','occupation_partner','home_status','family_income','self_employed', 'self_employed_partner','year_last_moved','TVarea','post_code','post_area','gender','region']
num_cols= list(set(list(fullData.columns)))
other_col=['Type'] #Test and Train Data set identifier
fullData.isnull().any()#Will return the feature with True or False,True means have missing value else False
num_cat_cols = num_cols+cat_cols # Combined numerical and Categorical variables
#Create a new variable for each variable having missing value with VariableName_NA
# and flag missing value with 1 and other with 0
for var in num_cat_cols:
if fullData[var].isnull().any()==True:
fullData[var+'_NA']=fullData[var].isnull()*1
#Impute numerical missing values with mean
fullData[num_cols] = fullData[num_cols].fillna(fullData[num_cols].mean(),inplace=True)
#Impute categorical missing values with 0
fullData[cat_cols] = fullData[cat_cols].fillna(value = 0)
#create label encoders for categorical features
for var in cat_cols:
number = LabelEncoder()
fullData[var] = number.fit_transform(fullData[var].astype('str'))
#Target variable is also a categorical so convert it
fullData["Account.Status"] = number.fit_transform(fullData["Account.Status"].astype('str'))
train=fullData[fullData['Type']=='Train']
test=fullData[fullData['Type']=='Test']
train['is_train'] = np.random.uniform(0, 1, len(train)) <= .75
Train, Validate = train[train['is_train']==True], train[train['is_train']==False]
features=list(set(list(fullData.columns))-set(ID_col)-set(target_col)-set(other_col))
x_train = Train[list(features)].values
y_train = Train["Account.Status"].values
x_validate = Validate[list(features)].values
y_validate = Validate["Account.Status"].values
x_test=test[list(features)].values
random.seed(100)
rf = RandomForestClassifier(n_estimators=1000)
rf.fit(x_train, y_train)
It seems to run, endlessly, in this line.
fullData[cat_cols] = fullData[cat_cols].fillna(value = 0)
I can't get it past that spot. how can I see what's happening in the background? Is there some way to see the work that's being done? Thanks.

One way to check where to code is getting to is to add print statements. For example you can add (right before the label encoder):
print("Code got before label encoder")
And then after that code block add another print statement. You can see in your console exactly where the code is getting stuck and debug that specific line.

Why does this return 'Too Many Indexers'?

My code is:
import pandas as pd
import numpy as np
from sklearn import svm
name = '../CLIWOC/CLIWOC15.csv'
data = pd.read_csv(name)
# Get info into dataframe and drop NaNs
data = pd.concat([data.UTC, data.Lon3, data.Lat3, data.Rain]).dropna(how='any')
# Set target
X = data.loc[:, ['UTC', 'Lon3', 'Lat3']]
y = data['Rain']
# Partition a test set
Xtest = X[-1]
ytest = y[-1]
X = X[1:-2]
y = y[1:-2]
# Train classifier
classifier = svm.svc(gamma=0.01, C=100.)
classifier.fit(X, y)
classifier.predict(Xtest)
y
Arriving at the 'set target' section, the compiler returns the error 'Too Many Indexers'. I lifted this syntax directly from the documentation, so I'm unsure what could be wrong.
The csv is organized with these headers for columns of data.

Without your data, it is hard to verify. My immediate suspicion, however, is that you need to pass a numpy array instead of a DataFrame.
Try this to extract them:
# Set target
X = data.loc[:, ['UTC', 'Lon3', 'Lat3']].values
y = data['Rain'].values

Use data.loc[['UTC', 'Lon3', 'Lat3']]. This will also work in iloc method as well.
Do not use like data.loc[:, 0] etc...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Confuse why my KNN code is throwing a ValueError - python

Related

T-distributed Stochastic Neighbor Embedding (t-SNE)

Data imputation in Python for Google Analytics data

pandas dataframe conditional selecting

Python running extremely slow one one line of code

Why does this return 'Too Many Indexers'?

Categories

Resources