Data Imputation Methods - python

I want to run an evaluation of imputation methods on my data, rather than the California Housing data on the following sklearn page:
https://scikit-learn.org/stable/auto_examples/impute/plot_iterative_imputer_variants_comparison.html#sphx-glr-auto-examples-impute-plot-iterative-imputer-variants-comparison-py
I can remove the following code
from sklearn.datasets import fetch_california_housing
but don't know how to add my data (as a *.csv file) for evaluation and to what extent the code below needs to be modified.
N_SPLITS = 5
rng = np.random.RandomState(0)
X_full, y_full = fetch_california_housing(return_X_y=True)
# ~2k samples is enough for the purpose of the example.
# Remove the following two lines for a slower run with different error bars.
X_full = X_full[::10]
y_full = y_full[::10]
n_samples, n_features = X_full.shape

Related

PLS AttributeError: 'function' object has no attribute 'fit'

image of the error I am trying to build a collaborative recommendation system the code below. I am a noob to deep learning right now, and I am stuck with this error when I try to train the model. I want to train a model with a csv data set. Can anyone please help me understand what's happening? I would really appreciate it.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Import the surprise packages
from surprise import Dataset
from surprise import Reader
from surprise.prediction_algorithms.matrix_factorization import SVD as FunkSVD
# Import GridSearchCV for algorithm tuning
from surprise.model_selection import GridSearchCV
# Import train_test_split
from surprise.model_selection import train_test_split
# Read in the prepared dataframe from the user_cleanup notebook
user_df = pd.read_csv('user_clean.csv')
user_df.head()
# Merge the two dataframes on appid
df = user_df.merge(games_df,on='appid')
df = df.drop('name',1)
df.head()
# Let's take a look at one of the most prominent users in the dataset, user 24469287
df[df['user_id'] == 24469287]
# Let's find this users favorite games using the 1-5 rating scale
print(f"Shape:{df[(df['user_id'] == 24469287) & (df['rating_5'] == 5)].shape}")
display(df[(df['user_id'] == 24469287) & (df['rating_5'] == 5)])
# Prepare the dataframes for the surprise package
# Dataframe needs to contain 3 columns: user id, item id, and rating
# For the 1-10 scale
rating_10_df = df.filter(['user_id','appid','rating_10'])
rating_10_df = rating_10_df.sort_values(by=['user_id','appid'])
# And the 1-5 scale
rating_5_df = df.filter(['user_id','appid','rating_5'])
rating_5_df = rating_5_df.sort_values(by=['user_id','appid'])
# Confirm dataframe is set up properly (user, item, rating)
rating_10_df.head()
# initialize the reader with 1-10 rating scale
my_reader = Reader(rating_scale=(0,10))
# load the dataframe with the reader
md = Dataset.load_from_df(rating_10_df, my_reader)
%%time
# Set the parameter grid for optimization
param_grid = {
# Number of latent factors. More factors could give better results, but can also lead overfitting
'n_factors': [50, 100, 150],
# Number of epochs. Number of iterations the algorithm will run
'n_epochs': [10, 20, 50],
# Learning rate. The speed at which algorithm learns. Larger values give faster learning, but smaller values give more accurate learning.
'lr_all': [0.005, 0.1],
'biased': [False] }
# Set GridSearchCV with 5 fold cross-validation using the FunkSVD
GS = GridSearchCV(FunkSVD, param_grid, measures=['rmse','mae','fcp'], cv=5)
# Fit the model to the data
GS.fit(md)

How Do I Create My Own Datasets To Work With Sklearn

I have found a piece of sklearn code that I am finding relatively straightforward to run the example Iris Dataset, but how do I create my own dataset similar to this?
iris.data - contains the data measurements of three types of flower
iris.target - contains the labels of three types of flower
e.g. rather than analysing the three types of flower in the Iris Dataset, I would like to make my own datasets that follow this format and that I can pass through the code.
example_sports.data - contains the data measurements of three types of sports players
example_sports.target - contains the labels of three types of sport
from sklearn.datasets import load_iris #load inbuilt dataset from sklearn
iris = load_iris() #assign variable name iris to inbuilt dataset
iris.data # contains the four numeric variables, sepal length, sepal width, petal length, petal width
print(iris.data) #printing the measurements of the Iris Dataset below
iris.target # relates to the different species shown as 0,1,2 for the three different
# species of Iris, a categorical variable, basically a label
print(iris.target)
The full code can be found at https://www.youtube.com/watch?v=asW8tp1qiFQ
sklearn datasets are stored in Bunch which is basically just a type of dict. Sklearn data and targets are basically just NumPy arrays and can be fed into the fit() method of the sklearn estimator you are interested in. But if you want to make your own data as a Bunch, you can do something like the following:
from sklearn.utils import Bunch
import numpy as np
example_sports = Bunch(data = np.array([[1.0,1.2,2.1],[3.0,2.3,1.0],[4.5,3.4,0.5]]), target = np.array([3,2,1]))
print(example_sports.data)
print(example_sports.target)
Naturally, you can read your own custom lists into the data and target entries of the Bunch. Pandas is a good tool if you have the data in Excel/CSV files.
Try using type() command whenever you are stuck. In this case it shows you that it is a Bunch object. Then you can search documentations of that class on the web and understand how to use them.
The following will help you.
from sklearn.utils import Bunch
b = Bunch(a=1, b="textt", c = pd.Series(np.arange(5)), d = np.asarray([0, 8, 9]))
b.c

sklearn cross validation : The least populated class in y has only 1 members, which is less than n_splits=10

i'm working in a machine learning project and i'm stuck with this warning when i try to use cross validation to know how many neighbours do i need to achieve the best accuracy in knn; here's the warning:
The least populated class in y has only 1 members, which is less than n_splits=10.
The dataset i'm using is https://archive.ics.uci.edu/ml/datasets/Student+Performance
In this dataset we have several attributes, but we'll be using only "G1", "G2", "G3", "studytime","freetime","health","famrel". all the instances in those columns are integers.
https://i.stack.imgur.com/sirSl.png <-dataset example
Next,here's my first chunk of code where i assign the train and test groups:
import pandas as pd
import numpy as np
from google.colab import drive
drive.mount('/gdrive')
import sklearn
data=pd.read_excel("/gdrive/MyDrive/Colab Notebooks/student-por.xls")
#print(data.head())
data = data[["G1", "G2", "G3", "studytime","freetime","health","famrel"]]
print(data)
predict = "G3"
x = np.array(data.drop([predict], axis=1))
y = np.array(data[predict])
print(y)
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.3, random_state=42)
print(len(y))
print(len(x))
That's how i assign x and y. with len, i can see that x and y have 649 rows both, representing 649 students.
Here's the second chunk of code when i do the cross_val:
#CROSSVALIDATION
from sklearn.neighbors import KNeighborsClassifier
neighbors = list (range(2,30))
cv_scores=[]
#print(y_train)
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
for k in neighbors:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn,x_train,y_train,cv=11,scoring='accuracy')
cv_scores.append(scores.mean())
plt.plot(cv_scores)
plt.show()```
the code is pretty self explanatory as you can tell
The warning:
The least populated class in y has only 1 members, which is less than n_splits=10.
happens in every iteration of the for-loop
Although this warning happens every time, plt.show() is still able to plot a graph regarding which amount of neighbours is best to achieve a good accuracy, i dont know if the plot, or the readings in cv_scores are accurate.
my question is :
How my "class in y" has only 1 members, len(y) clearly says y have 649 instances, more than enough to be splitted in 59 groups of 11 members each one?, By members is it referring to "instances" in my dataset, or colums/labels in the y group?
I'm not using stratify=y when i do the train/test splits, it's seems to be the 1# solution to this warning but its useless in my case.
I've tried everything i've seen on google/stack overflow and nothing helped me, the dataset seems to be the problem, but i canĀ“t understand whats wrong.
I think your main mistake is that your are using KNeighborsClassifier, and your feature to predict seems to be continuous (G3 - final grade (numeric: from 0 to 20, output target)) and not categorical.
In this case, every single value of the "y" is taken as a different possible class or label. The message you obtain is saying that in your dataset (on the "y"), there are values that only appears one time. For example, the values 3, appears only one time inside your dataset. This is not an error, but indicates that the model won't work correctly or accurate.
After all, I strongly reccomend you to use the sklearn.neighbors.KNeighborsRegressor.
This is the Knn used for "continuous" variables and not classes. Using this model, you shouldn't have this problem anymore. The output value will be the mean between the number of nearest neighbors you defined.
With this simple changes, your problem will be solved.

Sklearn Lasso regularization not dropping out random variables?

I've been using SelectFromModel from sklearn to reduce features using LASSO regularization and I'm finding that even when I set max_features to quite low (low enough to negatively impact performance) the random variables are often kept.
I generated an example with fake data to illustrate, but I'm seeing similar with actual real data and I am trying to understand why.
import pandas as pd
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
from numpy import random
data = datasets.make_classification(n_features = 20, n_informative = 20, n_redundant = 0,n_samples= 1000, random_state = 3)
X = pd.DataFrame(data[0] )
y = data[1]
X['rand_feat1'] = random.randint(100, size=(X.shape[0]))
X['rand_feat2'] = random.randint(100, size=(X.shape[0]))/100
embeded_lr_selector = SelectFromModel(LogisticRegression(penalty="l1",
solver='liblinear',random_state = 3),max_features=10)
embeded_lr_selector.fit(X, y)
embeded_lr_support = embeded_lr_selector.get_support()
embeded_lr_feature = X.loc[:,embeded_lr_support].columns.tolist()
print(str(len(embeded_lr_feature)), 'selected features')
print('Features kept', embeded_lr_feature)
Even though I've set 20 variables to be informative, and added 2 completely random ones, in many cases this will keep rand_feat2 when selecting the top 10 or even top 5. On a side note I get different results even with random state set....not sure why? But the point is fairly often a random variable will be included as a top 5 feature. I am seeing similar with real world data, where I have to get rid of a huge chunk of the variables to get rid of the random feature....makes me seriously doubt how reliable it is? How do I explain this?
EDIT:
Adding a screenshot along with sklearn/pandas versions printed... I'm not always getting the random features included, but if I run it a few times it will be there. On my real world dataset at least one is almost always included even after removing about half the variables.

Can somebody tell me what the last loop is doing ? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
import os
import tarfile
from six.moves import urllib
import pandas as pd
import hashlib
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"
def fetch_housing_data(housing_url=HOUSING_URL, housing_path= HOUSING_PATH):
if not os.path.isdir(housing_path):
os.makedirs(housing_path)
tgz_path = os.path.join(housing_path, "housing.tgz")
urllib.request.urlretrieve(housing_url, tgz_path)
housing_tgz = tarfile.open(tgz_path)
housing_tgz.extractall(path=housing_path)
housing_tgz.close()
#getting the housing data
def load_housing_data(housing_path=HOUSING_PATH):
csv_path = os.path.join(housing_path, "housing.csv")
return pd.read_csv(csv_path)
#that function loaded the data in a panda datafrome object
#need to call the function to get the housing data
fetch_housing_data()
housing = load_housing_data()
housing.head()
#total bedrooms doesnt match entries deal with later
#ocean proximity holds an object, since its in csv file still can contain text
housing.describe()
#describes the output of the housing information
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50,figsize=(20,15))
plt.show()
#creates a histogram of the data set, x axis is the range of hosuing prices, y axis number of instances of housing prices at that
#given range
#income data has been scaled by max 15 and .5 for lower
#since the data of housing prices has been capped at 500k posssible delete that data set
#thus so our model wont learn those bad values because it may not be 500k thus labels could be off
#tail heavy because its 200K plus for example so just barel a dollar more would make it (left)
import numpy as np
def split_train_test(data,test_ratio):
shuffled_indices = np.random.permutation(len(data))
#a randomized array with the same length as the input data so all data
test_set_size = int(len(data)*test_ratio)
#mutliplying by a ratio to see the difference of the data
test_indices = shuffled_indices[:test_set_size]
train_indices = shuffled_indices[test_set_size:]
#taking the test of the beggining because of the entry
#taking rest for training
return data.iloc[train_indices],data.iloc[test_indices]
#redo the variable since outside the cells
housing = load_housing_data()
#creating a category of income prices that is stratified
housing["income_cat"] = np.ceil(housing["median_income"]/1.5)
housing["income_cat"].where(housing["income_cat"]<5,5.0,inplace = True)
#since now the income has been set into categories
#stratified because not even split reprisentative of the population
split = StratifiedShuffleSplit(n_splits=1,test_size = 0.2,random_state=42)
This is loop at the end of the code
for train_index,test_index in split.split(housing,housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
Can someone please explain to me what the last for loop is doing? Basically its supposed to be stratifying the data sets into train and test but I am confused especially on the loop header because why is the whole data frame object in the first param then its followed by the income category section. Is it stratifying in reference to each of the income categories created and thus manipulating all the subsequent categories in the whole data frame object ?
Im sure you read already: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html#sklearn.model_selection.StratifiedShuffleSplit.split
So split takes two variables:
housing: training data, where n_samples is the number of samples and n_features is the number of features.
housing["income_cat"]: The target variable for supervised learning problems. Stratification is done based on the y labels.
and it will return an array of tuples with 2 entries (where each is an ndarray):
First entry: The training set indices for that split.
Second entry: The testing set indices for that split.

Categories