I have a DataFrame in Python and I need to preprocess my data. Which is the best method to preprocess data?, knowing that some variables have huge scale and others doesn't. Data hasn't huge deviance either. I tried with preprocessing.Scale function and it works, but I'm not sure at all if is the best method to proceed to the machine learning algorithms.
There are various techniques for data preprocessing, you can refer to the ideas in sklearn.preprocessing as potential guidelines to follow.
http://scikit-learn.org/stable/modules/preprocessing.html
Preprocessing is coupled to the data you are studying, but in general you could explore:
Assessing missing values, by computing their percentage per column
Compute the variance and remove variables with near zero variance
Assess the inter variable correlation to detect redundancy
You can compute these scores easily in pandas as follows:
data_file = "your_input_data_file.csv"
data = pd.read_csv(data_file, delimiter="|")
variance = data.var()
variance = variance.to_frame("variance")
variance["feature_names"] = variance.index
variance.reset_index(inplace=True)
#reordering columns
variance = variance[["feature_names","variance"]]
logging.debug("exporting variance to csv file")
variance.to_csv(data_file+"_variance.csv", sep="|", index=False)
missing_values_percentage = data.isnull().sum()/data.shape[0]
missing_values_percentage = missing_values_percentage.to_frame("missing_values_percentage")
missing_values_percentage["feature_names"] = missing_values_percentage.index
missing_values_percentage.reset_index(inplace=True)
missing_values_percentage = missing_values_percentage[["feature_names","missing_values_percentage"]]
logging.debug("exporting missing values to csv file")
missing_values_percentage.to_csv(data_file+"_mssing_values.csv", sep="|", index=False)
correlation = data.corr()
correlation.to_csv(data_file+"_correlation.csv", sep="|")
The above would generate three files holding respectively, the variance, missing values percentage and correlation results.
Refer to this blog article for a hands on tutorial.
always split your data to train and test split to prevent overfiting.
if some of your features has big scale and some doesnt you should standard the data.make sure to sandard the data only on the train set not to couse overfiting.
you also have to look for missing datas and replace or remove them.
if less than 0.5% of the data in a column is missing you can use 'dropna' otherwise you have to replace it with something(you can replace ut with zero,mean,the previous data...)
you also have to check outliers by using boxplot.
outliers are point that are significantly different from other data in the same group can also affects your prediction in machine learning.
its the best if we check the multicollinearity.
if some features have correlation we have multicollinearity can couse wrong prediction for our model.
for using your data some of the columns might be categorical with sholud be converted to numerical.
Related
I have a train and test dataset. On the train dataset I detected and deleted outlier values, when their standard deviation is 5 times greater from the mean. If a z-score returned is larger than that, the value is quite unusual and therefore I delete it from the dataset.
import scipy.stats as stats
z_scores = train_df.apply(stats.zscore)
abs_z_scores = np.abs(z_scores)
filtered_entries = (abs_z_scores < 5).all(axis=1)
train_df= train_df[filtered_entries]
Now I want to use the same z-scores based on the train set to remove values from the test set. (I don't want to get the z_scores from the test dataset itself!) Probably one idea is to store the mean and standard deviation of X from the train data and calculate the z-score for the test data based on them e.g.
(Xtest−μ)/σ
But I do not have any concrete ideas how to do so. Could someone give me some advice?
I have several datasets with very unevenly distributed values: Most values are very low, but a few are very high, for example, in the histogram screenshot or even more extreme.
I am actually interested in the differences in the high values.
So what I am looking for is a classification method that sets many break values where there are few data values and large classes where there are many values. Maybe something like a reversed quantile classification.
Do you have a suggestion on which algorithm could help with this task, preferably in Python?
if you are using pandas, couldn't you just select the values above your chosen threshold and analyze the difference seperately?
import pandas as pd
df = pd.DataFrame(your data)
df_to_analyze_large_values = df[df.your_Column_of_interest > 100000]
I am using python with dask to create a logistic regression model, In order to speed up things when training.
I have x that is the feature array (numpy array) and y that is a label vector.
edit:
The numpy arrays are: x_train (n*m size) array of floats and the y_train is (n*1) vector of integers that are labels for the training. both suits well into sklearn LogisticRegression.fit and working fine there.
I tried to use this code to create a pandas df then converting it to dask ddf and training on it like shown here
from dask_ml.linear_model import LogisticRegression
from dask import dataframe as dd
df["label"] = y_train
sd = dd.from_pandas(df, npartitions=3)
lr = LogisticRegression(fit_intercept=False)
lr.fit(sd, sd["label"])
But getting an error
Could not find signature for add_intercept:
I found this issue on Gitgub
Explaining to use this code instead
from dask_ml.linear_model import LogisticRegression
from dask import dataframe as dd
df["label"] = y_train
sd = dd.from_pandas(df, npartitions=3)
lr = LogisticRegression(fit_intercept=False)
lr.fit(sd.values, sd["label"])
But I get this error
ValueError: Multiple constant columns detected!
How can I use dask to train a logistic regression over data originated from a numpy array?
Thanks.
You can bypass std verification by using
lr = LogisticRegression(solver_kwargs={"normalize":False})
Or you can use #Emptyless code to get faulty column_indices
and then remove those columns from your array.
This does not seem like an issue with dask_ml. Looking at the source, the std is calculated using:
mean, std = da.compute(X.mean(axis=0), X.std(axis=0))
This means that for every column in your provided array, dask_ml calculates the standard deviation. If the standard deviation of one of those columns is equal to zero (np.where(std == 0))) that means that that column has zero variation.
Including a column with zero variation does not allow any training, ergo it needs to be removed prior to training the model (in a data preparation / cleansing step).
You can quickly check which columns have no variation by checking the following:
import numpy as np
std = sd.std(axis=0)
column_indices = np.where(std == 0)
print(column_indices)
A little late to the party but here I go anyway. Hope future readers appreciate it. This answer is for the Multiple Columns error.
A Dask DataFrame is split up into many Pandas DataFrames. These are called partitions. If you set your npartitions to 1 it should have exactly the same effect as sci-kit learn. If you set it to more partitions it splits it into multiple DataFrames but I found it changes the shape of the DataFrames which in the end resulted in the Multiple Columns error. It also might cause a overflow warning. Unfortunately it is not in my interest to investigate the direct cause of this error. It might simply be because the DataFrame is too large or too small.
A source for partitioning
Below the errors for search engine indexing:
ValueError: Multiple constant columns detected!
RuntimeWarning: overflow encountered in exp return np.exp(A)
I am looking to predict whether someone is a smoker from several columns of demographic data stored in a csv, as well as their smoker status.
The columns used are:
Gender, Age,Race, ServedInMilitary, CountryofBirth, EducationLevel MaritalStatus, HouseholdIncome, FamilyIncome, ChildrenInHouse, QuantitiyofAlcohol, PerUnitTime, ShortnessOfBreath, Asthma, Exercise, Smoker, SmokedBefore, AgeStartedSmoking.
All columns have numeric, but not necessarily binary values. Could someone help me correct my code to take these factors into account when determining smoker status and then help me measure the accuracy of my classifier?
I have the following code from a similar question: how to Load CSV Data in scikit and using it for Naive Bayes Classification
target_names = np.array(['Positives','Negatives'])
# add columns to your data frame
data['is_train'] = np.random.uniform(0, 1, len(df)) <= 0.75
data['Type'] = pd.Factor(targets, target_names)
data['Targets'] = targets
# define training and test sets
train = data[data['is_train']==True]
test = data[data['is_train']==False]
trainTargets = np.array(train['Targets']).astype(int)
testTargets = np.array(test['Targets']).astype(int)
# columns you want to model
features = data.columns[0:7]
# call Gaussian Naive Bayesian class with default parameters
gnb = GaussianNB()
# train model
y_gnb = gnb.fit(train[features], trainTargets).predict(train[features])
#Predict Output
There seems to be a missing line here for the dataframe, but I will assume you have it. If you don't, then read your data using pandas.read_csv.
Also, your columns seem to have data that is both categorical and numerical. For example, the "SmokedBefore" column is likely 1/0 whereas your "Age" column is likely numbers such as 20 or 30.
This makes a difference, because in "SmokedBefore" the intent is not to say that 1>0. The intent is to say Yes/No. If your model assumes that higher (or lower) is better, then this will cause an issue. Therefore it is categorical and should not be treated like a numerical value. It is simply a tag to indicate whether someone has smoked before.
However, in "Age" the intent is to say that 30 is different than 20 by 10. Therefore, it is numerical and should be treated as such.
To treat this, you will need to transform your categorical features into another set of binary features that will balance out this effect and handle it for you. This is called One Hot Encoding. Instead of 1/0 on your "SmokedBefore", it will become "is_1" and "is_0" with corresponding data. Like that, each column will have a 1 and a 0.
You can simply use the onehotencoder function provided in sklearn. Use the categorical_features argument to specify which columns have categorical features
I have a pandas.dataframe with a column passengers with a range which may vary greatly depending on the function creating the dataframe.
The other columns are often more or less of constant ranges (they're economy indicators).
segments.head(2);
passengers gdp gdp_per_capita inflation unemployment \
Month
2002-01-01 11688 4461.087 31634.953 150.847 14.418
2002-02-01 9049 4142.153 29321.702 204.132 14.738
population
Month
2002-01-01 339.59
2002-02-01 343.32
My most valuable data is the number of passengers, so I do not want to transform it. However, the differences of scale of the other measures, which I want to use as predictors, make it difficult to track the variations (sometimes in tens of thousands, sometimes in decimals).
How could I standardize the range of all my columns to be consistent with the mean(passengers)?
There are different ways you can approach that problem, you can make/apply a manual transformation function, or you can use a pre existing function, such as sklearn.preprocessing.StandardScaler.
StandardScaler will "Standardize features by removing the mean and scaling to unit variance". You can hence shift mean and adjust unit variance accordingly to your desires/needs.
However, it looks to me you are going to try and build a predictive model on that data, if so,the best approach would be to test all hypothesis, and keep what works best, my advice is:
Remove skew from passagers (if present) - Log & Log1p are most common transforms, but depending on your data other transforms might be better. You should test arbitrary functions as well (inverse, or 1/(X+1) for example) and use the best transform (skew closest to 0)
Test both scaled / non scaled features. If data is skewed test both with transform/without as above.
If outliers are present test both with and without (outliers converted to borderline values / outliers converted to np.nan) Make a boolean feature column identifying outliers for each feature. Test to see if its valuable information or just noise to the model.
Hope that helps,