Can somebody tell me what the last loop is doing ? [closed] - python

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
import os
import tarfile
from six.moves import urllib
import pandas as pd
import hashlib
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"
def fetch_housing_data(housing_url=HOUSING_URL, housing_path= HOUSING_PATH):
if not os.path.isdir(housing_path):
os.makedirs(housing_path)
tgz_path = os.path.join(housing_path, "housing.tgz")
urllib.request.urlretrieve(housing_url, tgz_path)
housing_tgz = tarfile.open(tgz_path)
housing_tgz.extractall(path=housing_path)
housing_tgz.close()
#getting the housing data
def load_housing_data(housing_path=HOUSING_PATH):
csv_path = os.path.join(housing_path, "housing.csv")
return pd.read_csv(csv_path)
#that function loaded the data in a panda datafrome object
#need to call the function to get the housing data
fetch_housing_data()
housing = load_housing_data()
housing.head()
#total bedrooms doesnt match entries deal with later
#ocean proximity holds an object, since its in csv file still can contain text
housing.describe()
#describes the output of the housing information
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50,figsize=(20,15))
plt.show()
#creates a histogram of the data set, x axis is the range of hosuing prices, y axis number of instances of housing prices at that
#given range
#income data has been scaled by max 15 and .5 for lower
#since the data of housing prices has been capped at 500k posssible delete that data set
#thus so our model wont learn those bad values because it may not be 500k thus labels could be off
#tail heavy because its 200K plus for example so just barel a dollar more would make it (left)
import numpy as np
def split_train_test(data,test_ratio):
shuffled_indices = np.random.permutation(len(data))
#a randomized array with the same length as the input data so all data
test_set_size = int(len(data)*test_ratio)
#mutliplying by a ratio to see the difference of the data
test_indices = shuffled_indices[:test_set_size]
train_indices = shuffled_indices[test_set_size:]
#taking the test of the beggining because of the entry
#taking rest for training
return data.iloc[train_indices],data.iloc[test_indices]
#redo the variable since outside the cells
housing = load_housing_data()
#creating a category of income prices that is stratified
housing["income_cat"] = np.ceil(housing["median_income"]/1.5)
housing["income_cat"].where(housing["income_cat"]<5,5.0,inplace = True)
#since now the income has been set into categories
#stratified because not even split reprisentative of the population
split = StratifiedShuffleSplit(n_splits=1,test_size = 0.2,random_state=42)
This is loop at the end of the code
for train_index,test_index in split.split(housing,housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
Can someone please explain to me what the last for loop is doing? Basically its supposed to be stratifying the data sets into train and test but I am confused especially on the loop header because why is the whole data frame object in the first param then its followed by the income category section. Is it stratifying in reference to each of the income categories created and thus manipulating all the subsequent categories in the whole data frame object ?

Im sure you read already: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html#sklearn.model_selection.StratifiedShuffleSplit.split
So split takes two variables:
housing: training data, where n_samples is the number of samples and n_features is the number of features.
housing["income_cat"]: The target variable for supervised learning problems. Stratification is done based on the y labels.
and it will return an array of tuples with 2 entries (where each is an ndarray):
First entry: The training set indices for that split.
Second entry: The testing set indices for that split.

Related

Data Imputation Methods

I want to run an evaluation of imputation methods on my data, rather than the California Housing data on the following sklearn page:
https://scikit-learn.org/stable/auto_examples/impute/plot_iterative_imputer_variants_comparison.html#sphx-glr-auto-examples-impute-plot-iterative-imputer-variants-comparison-py
I can remove the following code
from sklearn.datasets import fetch_california_housing
but don't know how to add my data (as a *.csv file) for evaluation and to what extent the code below needs to be modified.
N_SPLITS = 5
rng = np.random.RandomState(0)
X_full, y_full = fetch_california_housing(return_X_y=True)
# ~2k samples is enough for the purpose of the example.
# Remove the following two lines for a slower run with different error bars.
X_full = X_full[::10]
y_full = y_full[::10]
n_samples, n_features = X_full.shape

PLS AttributeError: 'function' object has no attribute 'fit'

image of the error I am trying to build a collaborative recommendation system the code below. I am a noob to deep learning right now, and I am stuck with this error when I try to train the model. I want to train a model with a csv data set. Can anyone please help me understand what's happening? I would really appreciate it.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Import the surprise packages
from surprise import Dataset
from surprise import Reader
from surprise.prediction_algorithms.matrix_factorization import SVD as FunkSVD
# Import GridSearchCV for algorithm tuning
from surprise.model_selection import GridSearchCV
# Import train_test_split
from surprise.model_selection import train_test_split
# Read in the prepared dataframe from the user_cleanup notebook
user_df = pd.read_csv('user_clean.csv')
user_df.head()
# Merge the two dataframes on appid
df = user_df.merge(games_df,on='appid')
df = df.drop('name',1)
df.head()
# Let's take a look at one of the most prominent users in the dataset, user 24469287
df[df['user_id'] == 24469287]
# Let's find this users favorite games using the 1-5 rating scale
print(f"Shape:{df[(df['user_id'] == 24469287) & (df['rating_5'] == 5)].shape}")
display(df[(df['user_id'] == 24469287) & (df['rating_5'] == 5)])
# Prepare the dataframes for the surprise package
# Dataframe needs to contain 3 columns: user id, item id, and rating
# For the 1-10 scale
rating_10_df = df.filter(['user_id','appid','rating_10'])
rating_10_df = rating_10_df.sort_values(by=['user_id','appid'])
# And the 1-5 scale
rating_5_df = df.filter(['user_id','appid','rating_5'])
rating_5_df = rating_5_df.sort_values(by=['user_id','appid'])
# Confirm dataframe is set up properly (user, item, rating)
rating_10_df.head()
# initialize the reader with 1-10 rating scale
my_reader = Reader(rating_scale=(0,10))
# load the dataframe with the reader
md = Dataset.load_from_df(rating_10_df, my_reader)
%%time
# Set the parameter grid for optimization
param_grid = {
# Number of latent factors. More factors could give better results, but can also lead overfitting
'n_factors': [50, 100, 150],
# Number of epochs. Number of iterations the algorithm will run
'n_epochs': [10, 20, 50],
# Learning rate. The speed at which algorithm learns. Larger values give faster learning, but smaller values give more accurate learning.
'lr_all': [0.005, 0.1],
'biased': [False] }
# Set GridSearchCV with 5 fold cross-validation using the FunkSVD
GS = GridSearchCV(FunkSVD, param_grid, measures=['rmse','mae','fcp'], cv=5)
# Fit the model to the data
GS.fit(md)

How to make a data frame combining different regression results in python?

I am running some regression models to predict performance.
After running the models I created a variable to see the predictions (y_pred_* are lists with 2567 values):
y_pred_LR = regressor.predict(X_test)
y_pred_SVR = regressor2.predict(X_test)
y_pred_RF = regressor3.predict(X_test)
the types of these prediction lists are Array of float64, while the y_test is a DataFrame.
I wanted to create a table with the results, I tried some different ways, calling as list, trying to convert, trying to select as values, and I did not succeed so far, any one could help?
My last trial was like below:
comparison = pd.DataFrame({'Real': y_test, LR':y_pred_LR,'RF':y_pred_RF,'SVM':y_pred_SVM})
In this case the DataFrame is created but the values donĀ“t appear.
Additionally, I would like to create two new rows with the mean and standard deviation of results and this row should be located at beginning (or first row) of the Data Frame.
Thanks
import pandas as pd
import numpy as np
real = np.array([2] * 10).reshape(-1,1)
y_pred_LR = np.array([0] * 10)
y_pred_SVR = np.array([1] * 10)
y_pred_RF = np.array([5] * 10)
real = real.flatten()
comparison = pd.DataFrame({'real':real,'y_pred_LR':y_pred_LR,'y_pred_SVR':y_pred_SVR,"y_pred_RF":y_pred_RF})
Mean = comparison.mean(axis=0)
StD = comparison.std(axis=0)
Mean_StD = pd.concat([Mean,StD],axis=1).T
result = pd.concat([Mean_StD,comparison],ignore_index=True)
print(result)

How to select best features in Dataframe using the Information Gain measure in scikit-learn [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I want to identify the 10 best features of a Dataframe using the Information Gain measure (Mutual Info in scikit-learn) and display them in a table (in ascending order according to the score obtained by the Information Gain).
In this example, features is the Dataframe that contains all the interesting training data that could tell if a restaurant will close or not.
# Initialization of data and labels
x = features.copy () # "x" contains all training data
y = x ["closed"] # "y" contains the labels of the records in "x"
# Elimination of the class column (closed) of features
x = x.drop ('closed', axis = 1)
# this is x.columns, sorry for the mix french and english
features_columns = ['moyenne_etoiles', 'ville', 'zone', 'nb_restaurants_zone',
'zone_categories_intersection', 'ville_categories_intersection',
'nb_restaurant_meme_annee', 'ecart_type_etoiles', 'tendance_etoiles',
'nb_avis', 'nb_avis_favorables', 'nb_avis_defavorables',
'ratio_avis_favorables', 'ratio_avis_defavorables',
'nb_avis_favorables_mention', 'nb_avis_defavorables_mention',
'nb_avis_favorables_elites', 'nb_avis_defavorables_elites',
'nb_conseils', 'nb_conseils_compliment', 'nb_conseils_elites',
'nb_checkin', 'moyenne_checkin', 'annual_std', 'chaine',
'nb_heures_ouverture_semaine', 'ouvert_samedi', 'ouvert_dimanche',
'ouvert_lundi', 'ouvert_vendredi', 'emporter', 'livraison',
'bon_pour_groupes', 'bon_pour_enfants', 'reservation', 'prix',
'terrasse']
# normalization
std_scale = preprocessing.StandardScaler().fit(features[features_columns])
normalized_data = std_scale.transform(features[features_columns])
labels = np.array(features['closed'])
# split the data
train_features, test_features, train_labels, test_labels = train_test_split(normalized_data, labels, test_size = 0.2, random_state = 42)
labels_true = ?
labels_pred = ?
# I dont really know how to use this function to achieve what i want
from sklearn.feature_selection import mutual_info_classif
from sklearn.datasets import make_classification
# Get the mutual information coefficients and convert them to a data frame
coeff_df =pd.DataFrame(features,
columns=['Coefficient'], index=x.columns)
coeff_df.head()
What is the correct syntax using Mutual Info score to achieve this?
The adjusted_mutual_info_score compares ground truth labels with labels predictions from a classifier. Both label arrays must have the same shape (nsamples,).
You need Scikit-Learn's mutual_info_classif for what you are trying to achieve. Pass the array of features and the corresponding labels to mutual_info_classif to get back the estimated mutual information between each feature and the target.
import numpy as np
import pandas as pd
from sklearn.feature_selection import mutual_info_classif
from sklearn.datasets import make_classification
# Generate a sample data frame
X, y = make_classification(n_samples=1000, n_features=4,
n_informative=2, n_redundant=2,
random_state=0, shuffle=False)
feature_columns = ['A', 'B', 'C', 'D']
features = pd.DataFrame(X, columns=feature_columns)
# Get the mutual information coefficients and convert them to a data frame
coeff_df =pd.DataFrame(mutual_info_classif(X, y).reshape(-1, 1),
columns=['Coefficient'], index=feature_columns)
Output
features.head(3)
Out[43]:
A B C D
0 -1.668532 -1.299013 0.799353 -1.559985
1 -2.972883 -1.088783 1.953804 -1.891656
2 -0.596141 -1.370070 -0.105818 -1.213570
# Displaying only the top two features. Adjust the number as required.
coeff_df.sort_values(by='Coefficient', ascending=False)[:2]
Out[44]:
Coefficient
B 0.523911
D 0.366884

ML - Getting feature names after feature selection - SelectPercentile, python

I have been struggling with this one for a while.
My goal is to take a text feature that I have, and find the best 5-10 words in it to help me classify. Hence, I am running a TfIdfVectorizer, and choosing ~90 best for now. however, after I downsize the feature amount, I am unable to see which features were actually chosen.
here is what I have:
import pandas
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif
train=pandas.read_csv("train.tsv", sep='\t')
labels_train = train["label"]
documents = []
for i, row in train.iterrows():
documents.append((row['boilerplate'][1:-1].lower()))
vectorizer = TfidfVectorizer(sublinear_tf=True, stop_words="english")
features_train_transformed = vectorizer.fit_transform(documents)
selector = SelectPercentile(f_classif, percentile=0.1)
selector.fit(features_train_transformed, labels_train)
features_train_transformed = selector.transform(features_train_transformed).toarray()
The result is that features_train_transformed contains a matrix of all the tfidf scores per word per document of the selected words, however I have no idea which words were chosen, and methods like "get_feature_names()" are unavailable for the class SelectPercentile.
This is neccesary because i need to add these features to a bunch of numeric features and only then make my training and predictions.
selector.get_support() to get you a boolean array of columns that were within the percentile range you specified
train.columns.values should get you the complete list of column names for the original dataframe
filtering the latter with the former should give you the names of columns that make up your chosen percentile range.
the code below (cut-pasted from working code) is similar enough to yours, that it's hopefully helpful
import numpy as np
selection = SelectPercentile(f_regression, percentile=2)
train_minus_target = train.drop("y", axis=1)
x_features = selection.fit_transform(train_minus_target, y_train)
columns = np.asarray(train_minus_target.columns.values)
support = np.asarray(selection.get_support())
columns_with_support = columns[support]
Reference:
about get_support

Categories