creating train and test datasets manually in python - python

I am trying to split the data set into train and test datasets manually meaning that I don't want to use the scikit learn package. I want split them in a way that if the row index module 4 is equal to zero, put them into the training dataset, else put them into the test dataset. I have done it in R like the following:
testidx = which(1:nrow(price_accommodates_bedrooms )%%4 == 0)
df_train = price_accommodates_bedrooms [-testidx, ]
df_test = price_accommodates_bedrooms[testidx, ]
But I am not sure how to do it in python because I am new to python. Thanks in advance

If you want to do this you can take advantage of the DataFrame index and masking:
test_df = df[df.index % 4 != 0]
train_df = df[df.index % 4 == 0]

Related

cosine_similarity giving different answer for dataframe and subset of dataframe

I have the following piece of code for my recommendation system and it gives different output.
Scenario 1:
a = df[df.index == 5031]
b = df[df.index == 9365]
print(cosine_similarity(a,b)) #0.33
Scenario 2:
cosine_sim = cosine_similarity(df)
print(cosine_sim[5031][9365]) #0.25
I think the output for both scenarios should be the same. I feel scenario 1 to be more accurate according to the data.
Can anyone help with this?
Dataframe looks like this.
You are mixing label index with location based index.
In scenario 1 you get the vectors by label index
# labels 5031 and 9365
a = df[df.index == 5031]
b = df[df.index == 9365]
The matrix which is returned by sklearn.metrics.pairwise.cosine_similarity does not know anything about the index labels.
Thus before you get the data from the matrix you need to know the location based index in the dataframe
idx_a = df.index.get_loc(5031)
idx_b = df.index.get_loc(9365)
cosine_sim[idx_a][idx_b]

Python cannot get data from column in dataframe

I have dataframe like this sample.
priceUsd,time,date
38492.2698958105979245,1627948800000,2021-08-03T00:00:00.000Z
39573.1543437718690816,1628035200000,2021-08-04T00:00:00.000Z
40090.5174131427618446,1628121600000,2021-08-05T00:00:00.000Z
41356.0360622010701055,1628208000000,2021-08-06T00:00:00.000Z
43535.9969201307711635,1628294400000,2021-08-07T00:00:00.000Z
I want to split last 10 rows for test dataset in tensorflow and I get data from first row to before last 10 rows for train.
train = df.loc[:-10 , ['priceUsd']]
test = df.loc[-10: , ['priceUsd']]
when I run this code it show error
TypeError: cannot do slice indexing on DatetimeIndex with these indexers [-10] of type int
How to fix it?
Try this instead:
train = df[['priceUsd']].head(len(df) - 10)
test = df[['priceUsd']].tail(10)

Python Confusion Matrix Length

I'm trying to evaluate the accuracy and performance of several KNN Classifiers.
DataTest["ConfM_K30_ST"] = confusion_matrix(
DataTest["ST_Class"],
DataTest["KNN_K30_ST"]
)
aux = DataTest["ST_Class"]
aux1 = DataTest["KNN_K30_ST"]
When trying to compare the Predicted Result with the Originals I receive the following error:
ValueError: Length of values does not match length of index
DataTest is my DataFrame containing 20% of the Data. The labeled data is, for this example, "ST_Class" and the predicted data is "KNN_K30_ST".
In order to verify what was going on I set these 2 dataframes on aux and aux1. They are both of type Series with sizes (3224,).
The only problem I could see is that the indexes are not continuous and don't start in 0 nor end in 3223. To facilitate comprehension see the image below.
Link: https://i.imgur.com/Splhr62.png
The only error I can see is that you are trying to store the confusion matrix as a column in the dataframe. This isn't possible due to the size mismatch.
Here's a small sample
df1
a
0 1
2 1
4 1
df2
a
1 0
3 1
5 0
# Output from the confusion matrix
confusion_matrix(df1, df2)
array([[0, 0],
[2, 1]])
As suggested, I was obliviously trying to store a confusion matrix in a DataFrame.
My solution was to set it in a Dictionary.
Thank you all for the quick replies!

Solving Kaggle's Titanic Machine Learning

I'm trying to solving Kaggle's Titanic with Python.
But I have an error trying to fit my data.
This is my code:
import pandas as pd
from sklearn import linear_model
def clean_data(data):
data["Fare"] = data["Fare"].fillna(data["Fare"].dropna().median())
data["Age"] = data["Age"].fillna(data["Age"].dropna().median())
data.loc[data["Sex"] == "male", "Sex"] = 0
data.loc[data["Sex"] == "female", "Sex"] = 1
data.loc["Embarked"] = data["Embarked"].fillna("S")
data.loc[data["Embarked"] == "S", "Embarked"] = 0
data.loc[data["Embarked"] == "C", "Embarked"] = 1
data.loc[data["Embarked"] == "Q", "Embarked"] = 2
train = pd.read_csv("train.csv")
clean_data(train)
target = train["Survived"].values
features = train[["Pclass", "Age","Sex","SibSp", "Parch"]].values
classifier = linear_model.LogisticRegression()
classifier_ = classifier.fit(features, target) # Here is where error comes from
And the error is this:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Can you help me please?
Before you fit the model with features and target, the best practice is to check whether the null value is present in all the features which you want to use in building the model. You can know the below to check it
dataframe_name.isnull().any() this will give the column names and True if atleast one Nan value is present
dataframe_name.isnull().sum() this will give the column names and value of how many NaN values are present
By knowing the column names then you perform cleaning of data.
This will not create the problem of NaN.
You should reset the index of your dataframe before running any sklearn code:
df = df.reset_index()
Nan simply represents empty,None or null values in a dataset. Before applying some ML algorithm on the dataset you, first, need to preprocess the dataset for it's streamlined processing. In other words it's called data cleaning. you can use scikit learn's imputer module to handle Nan.
How to check if dataset has Nan:
dataframe's isnan() returns a list of True/False values to show whether some column contains Nan or not for example:
str = pd.Series(['a','b',np.nan, 'c', 'np.nan'])
str.isnull()
out: False, False, True, False, True
And str.isnull().sum() would return you the count of null values present in the series. In this case '2'.
you can apply this method on a dataframe itself e.g. df.isnan()
Two techniques I know to handle Nan: 1. Removing the row which contains Nan.e.g.
str.dropna() orstr.dropna(inplace=True) or df.dropna(how=all)
But this would remove many valuable information from the dataset. Hence, mostly we avoid it.
2.Imputing: replacing the Nan values with the mean/median of the column.
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)
#strategy can also be median or most_frequent
imputer = imputer.fit(training_data_df)
imputed_data = imputer.fit_transform(training_data_df.values)
print(imputed_data_df)
I hope this would help you.

Possible ways to do one hot encoding in scikit-learn?

I have a pandas data frame with some categorical columns. Some of these contains non-integer values.
I currently want to apply several machine learning models on this data. With some models, it is necessary to do normalization to get better result. For example, converting categorical variable into dummy/indicator variables. Indeed, pandas has a function called get_dummies for that purpose. However, this function returns the result depending on the data. So if I call get_dummies on training data, then call it again on test data, columns achieved in two cases can be different because a categorical column in test data can contains just a sub-set/different set of possible values compared to possible values in training data.
Therefore, I am looking for other methods to do one-hot coding.
What are possible ways to do one hot encoding in python (pandas/sklearn)?
Scikit-learn provides an encoder sklearn.preprocessing.LabelBinarizer.
For encoding training data you can use fit_transform which will discover the category labels and create appropriate dummy variables.
label_binarizer = sklearn.preprocessing.LabelBinarizer()
training_mat = label_binarizer.fit_transform(df.Label)
For the test data you can use the same set of categories using transform.
test_mat = label_binarizer.transform(test_df.Label)
In the past, I've found the easiest way to deal with this problem is to use get_dummies and then enforce that the columns match up between test and train. For example, you might do something like:
import pandas as pd
train = pd.get_dummies(train_df)
test = pd.get_dummies(test_df)
# get the columns in train that are not in test
col_to_add = np.setdiff1d(train.columns, test.columns)
# add these columns to test, setting them equal to zero
for c in col_to_add:
test[c] = 0
# select and reorder the test columns using the train columns
test = test[train.columns]
This will discard information about labels that you haven't seen in the training set, but will enforce consistency. If you're doing cross validation using these splits, I'd recommend two things. First, do get_dummies on the whole dataset to get all of the columns (instead of just on the training set as in the code above). Second, use StratifiedKFold for cross validation so that your splits contain the relevant labels.
Say, I have a feature "A" with possible values "a", "b", "c", "d". But the training data set consists of only three categories "a", "b", "c" as values. If get_dummies is used at this stage, features generated will be three (A_a, A_b, A_c). But ideally there should be another feature A_d as well with all zeros. That can be achieved in the following way :
import pandas as pd
data = pd.DataFrame({"A" : ["a", "b", "c"]})
data["A"] = data["A"].astype("category", categories=["a", "b", "c", "d"])
mod_data = pd.get_dummies(data[["A"]])
print(mod_data)
The output being
A_a A_b A_c A_d
0 1.0 0.0 0.0 0.0
1 0.0 1.0 0.0 0.0
2 0.0 0.0 1.0 0.0
For the text columns, you can try this
from sklearn.feature_extraction.text import CountVectorizer
data = ['he is good','he is bad','he is strong']
vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform(data)
For Output:
for i in range(len(data)):
print(vectors[i, :].toarray())
Output:
[[0 1 1 1 0]]
[[1 0 1 1 0]]
[[0 0 1 1 1]]

Categories