Pandas Regression Model Replacing Column Values - python

I have a data frame "df" with columns "bedrooms", "bathrooms", "sqft_living", and "sqft_lot".
I want to create a regression model by filling the missing column values based on the values of the other columns. The missing value would be determined by observing the other columns and making a prediction based on the other column values.
As an example, the sqft_living column is missing in row 12. To determine this, the count for the bedrooms, bathrooms, and sqft_lot would be considered to make a prediction on the missing value.
Is there any way to do this? Any help is appreciated. Thanks!

import pandas as pd
from sklearn.linear_model import LinearRegression
# setup
dictionary = {'bedrooms': [3,3,2,4,3,4,3,3,3,3,3,2,3,3],
'bathrooms': [1,2.25,1,3,2,4.5,2.25,1.5,1,2.5,2.5,1,1,1.75],
'sqft_living': [1180, 2570,770,1960,1680,5420,1715,1060,1780,1890,'',1160,'',1370],
'sqft_lot': [5650,7242,10000,5000,8080,101930,6819,9711,7470,6560,9796,6000,19901,9680]}
df = pd.DataFrame(dictionary)
# setup x and y for training
# drop data with empty row
clean_df = df[df['sqft_living'] != '']
# separate variables into my x and y
x = clean_df.iloc[:, [0,1,3]].values
y = clean_df['sqft_living'].values
# fit my model
lm = LinearRegression()
lm.fit(x, y)
# get the rows I am trying to do my prediction on
predict_x = df[df['sqft_living'] == ''].iloc[:, [0,1,3]].values
# perform my prediction
lm.predict(predict_x)
# I get values 1964.983 for row 10, and 1567.068 row row 12
It should be noted that you're asking about imputation. I suggest reading and understanding other methods, trade offs, and when to do it.
Edit: Putting Code back into DataFrame:
# Get index of missing data
missing_index = df[df['sqft_living'] == ''].index
# Replace
df.loc[missing_index, 'sqft_living'] = lm.predict(predict_x)

Related

pandas dropna giving different results when applied to a dataframe with 2 columns or the columns as independednt dataframes

I'm using the following public dataset to practice linear regression:
https://www.kaggle.com/theforcecoder/wind-power-forecasting
I tried to do a least squares regression using numpy polynomial, and I ran into issues because the columns had nan values
applying dropna to the dataframe from where i extract the columns does not have an effect. I tried both using in_place=True and defining a new dataframe, but neither works:
LSFitdDf = BearingTempsCorr[['WindSpeed', 'BearingShaftTemperature']]
LSFitdDf = LSFitdDf[['WindSpeed', 'BearingShaftTemperature']]
WindSpeed = BearingTempsCorr['WindSpeed']
BearingShaftTemperature = BearingTempsCorr['BearingShaftTemperature']
print(len(WindSpeed))
print(len(BearingShaftTemperature))
and
LSFitdDf = BearingTempsCorr[['WindSpeed', 'BearingShaftTemperature']].dropna()
LSFitdDf = LSFitdDf[['WindSpeed', 'BearingShaftTemperature']]
WindSpeed = BearingTempsCorr['WindSpeed']
BearingShaftTemperature = BearingTempsCorr['BearingShaftTemperature']
print(len(WindSpeed))
print(len(BearingShaftTemperature))
Both produce the same output (length of both columns=323)
However, applying dropna to the columns themselves does drop rows:
WindSpeed = BearingTempsCorr['WindSpeed'].dropna()
BearingShaftTemperature = BearingTempsCorr['BearingShaftTemperature'].dropna()
results in lengths=(316, 312)
However this introducces a new problem: regression cannot be applied because x and y have different lengths
What is going on here?
There is an error in your code:
WindSpeed = BearingTempsCorr['WindSpeed']
BearingShaftTemperature = BearingTempsCorr['BearingShaftTemperature']
You use BearingTempsCorr, but you should use LSFitdDf (where you saved dropna values).
WindSpeed = LSFitdDf['WindSpeed']
BearingShaftTemperature = LSFitdDf['BearingShaftTemperature']
P.S. You also have redundant line, which just copies the LSFitdDf into the same variable.
LSFitdDf = LSFitdDf[['WindSpeed', 'BearingShaftTemperature']]
P.P.S. The most clear way to get the whole dataset but drop lines with NA values in desired columns is
BearingTempsCorr.dropna(subset=['WindSpeed', 'BearingShaftTemperature'])

Taking a proportion of a dataframe based on column values

I have a Pandas dataframe with ~50,000 rows and I want to randomly select a proportion of rows from that dataframe based on a number of conditions. Specifically, I have a column called 'type of use' and, for each field in that column, I want to select a different proportion of rows.
For instance:
df[df['type of use'] == 'housing'].sample(frac=0.2)
This code returns 20% of all the rows which have 'housing' as their 'type of use'. The problem is I do not know how to do this for the remaining fields in a way that is 'idiomatic'. I also do not know how I could take the result from this sampling to form a new dataframe.
You can make a unique list for all the values in the column by list(df['type of use'].unique()) and iterate like below:
for i in list(df['type of use'].unique()):
print(df[df['type of use'] == i].sample(frac=0.2))
or
i = 0
while i < len(list(df['type of use'].unique())):
df1 = df[(df['type of use']==list(df['type of use'].unique())[i])].sample(frac=0.2)
print(df1.head())
i = i + 1
For storing you can create a dictionary:
dfs = ['df' + str(x) for x in list(df2['type of use'].unique())]
dicdf = dict()
i = 0
while i < len(dfs):
dicdf[dfs[i]] = df[(df['type of use']==list(df2['type of use'].unique())[i])].sample(frac=0.2)
i = i + 1
print(dicdf)
This will print a dictionary of the dataframes.
You can print what you like to see for example for housing sample : print (dicdf['dfhousing'])
Sorry this is coming in 2+ years late, but I think you can do this without iterating, based on help I received to a similar question here. Applying it to your data:
import pandas as pd
import math
percentage_to_flag = 0.2 #I'm assuming you want the same %age for all 'types of use'?
#First, create a new 'helper' dataframe:
random_state = 41 # Change to get different random values.
df_sample = df.groupby("type of use").apply(lambda x: x.sample(n=(math.ceil(percentage_to_flag * len(x))),random_state=random_state))
df_sample = df_sample.reset_index(level=0, drop=True) #may need this to simplify multi-index dataframe
# Now, mark the random sample in a new column in the original dataframe:
df["marked"] = False
df.loc[df_sample.index, "marked"] = True

Dataframe.sample - Weights - How to use it?

I have this situation:
A have a probability of 0.1348 calculated in a variable called treat_conv
Now, I am trying to create a dataframe from the original dataframe, using this probability to bring a especified column. Is that possible? I am trying to using weights but no success. Maybe am I using it wrong?
Follow my code:
weights = np.array(treat_conv) #creating a array with treat_conv
new_page_converted = df2.sample(n = treat_group.shape[0], weights=df2.converted(weights)) #creating new dataframe with the number of rows of treat_group and the column converted must have a 0.13 of chance to bring value 1
So, the code works if I use the n alone. It creates a new dataframe with the correct ammount of rows. But I cant get the correct probabiliy to bring certain ammount of value 1 in converted column.
I hope my explanation is undestandable.
Thank you!
You could do something like this
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.arange(0, 100, 1), columns=["SomeValue"])
selected = pd.DataFrame(data=np.random.choice(df["SomeValue"], int(len(df["SomeValue"]) * 0.13), replace=False),
columns=["SomeValue"])
selected["Trigger"] = 1
df = df.merge(selected, how="left", on="SomeValue")
df["Trigger"].fillna(0, inplace=True)
"df" is your original DataFrame. Then select random 13% of the values and add a column indicating they've been selected. Finally, merge all back to your original Dataframe.

Solving Kaggle's Titanic Machine Learning

I'm trying to solving Kaggle's Titanic with Python.
But I have an error trying to fit my data.
This is my code:
import pandas as pd
from sklearn import linear_model
def clean_data(data):
data["Fare"] = data["Fare"].fillna(data["Fare"].dropna().median())
data["Age"] = data["Age"].fillna(data["Age"].dropna().median())
data.loc[data["Sex"] == "male", "Sex"] = 0
data.loc[data["Sex"] == "female", "Sex"] = 1
data.loc["Embarked"] = data["Embarked"].fillna("S")
data.loc[data["Embarked"] == "S", "Embarked"] = 0
data.loc[data["Embarked"] == "C", "Embarked"] = 1
data.loc[data["Embarked"] == "Q", "Embarked"] = 2
train = pd.read_csv("train.csv")
clean_data(train)
target = train["Survived"].values
features = train[["Pclass", "Age","Sex","SibSp", "Parch"]].values
classifier = linear_model.LogisticRegression()
classifier_ = classifier.fit(features, target) # Here is where error comes from
And the error is this:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Can you help me please?
Before you fit the model with features and target, the best practice is to check whether the null value is present in all the features which you want to use in building the model. You can know the below to check it
dataframe_name.isnull().any() this will give the column names and True if atleast one Nan value is present
dataframe_name.isnull().sum() this will give the column names and value of how many NaN values are present
By knowing the column names then you perform cleaning of data.
This will not create the problem of NaN.
You should reset the index of your dataframe before running any sklearn code:
df = df.reset_index()
Nan simply represents empty,None or null values in a dataset. Before applying some ML algorithm on the dataset you, first, need to preprocess the dataset for it's streamlined processing. In other words it's called data cleaning. you can use scikit learn's imputer module to handle Nan.
How to check if dataset has Nan:
dataframe's isnan() returns a list of True/False values to show whether some column contains Nan or not for example:
str = pd.Series(['a','b',np.nan, 'c', 'np.nan'])
str.isnull()
out: False, False, True, False, True
And str.isnull().sum() would return you the count of null values present in the series. In this case '2'.
you can apply this method on a dataframe itself e.g. df.isnan()
Two techniques I know to handle Nan: 1. Removing the row which contains Nan.e.g.
str.dropna() orstr.dropna(inplace=True) or df.dropna(how=all)
But this would remove many valuable information from the dataset. Hence, mostly we avoid it.
2.Imputing: replacing the Nan values with the mean/median of the column.
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)
#strategy can also be median or most_frequent
imputer = imputer.fit(training_data_df)
imputed_data = imputer.fit_transform(training_data_df.values)
print(imputed_data_df)
I hope this would help you.

Iterating over entire columns and storing the result into a list

I would like to know how I could iterate through each columns of a dataframe to perform some calculations and store the result in an another dataframe.
df_empty = []
m = daily.ix[:,-1] #Columns= stocks & Rows= daily returns
stocks = daily.ix[:,:-1]
for col in range (len(stocks.columns)):
s = daily.ix[:,col]
covmat = np.cov(s,m)
beta = covmat[0,1]/covmat[1,1]
return (beta)
print(beta)
In the above example, I first want to calculate a covariance matrix between "s" (the columns representing stocks daily returns and for which I want to iterate through one by one) and "m" (the market daily return which is my reference column/the last column of my dataframe). Then I want to calculate the beta for each covariance pair stock/market.
I'm not sure why return(beta) give me a single numerical result for one stock while print(beta) print the beta for all stocks.
I'd like to find a way to create a dataframe with all these betas.
beta_df = df_empty.append(beta)
I have tried the above code but it returns 'none' as if it could not append the outcome.
Thank you for your help
The return statement within your for-loop ends the loop itself the first time the return is encountered. Moreover, you are not saving the beta value anywhere because the for-loop itself does not return a value in python (it only has side effects).
Apart from that, you may choose a more pandas-like approach using apply on the data frame which basically iterates over the columns of the data frame and passes each column to a supplied function as the first parameter while returning the result of the function call. Here is a minimal working example with some dummy data:
import pandas as pd
import numpy as pd
# create some dummy data
daily = pd.DataFrame(np.random.randint(100, size=(100, 5)))
# define reference column
cov_column = daily.iloc[:, -1]
# setup computation function
def compute(column):
covmat = np.cov(column, cov_column)
return covmat[0,1]/covmat[1,1]
# use apply to iterate over columns
result = daily.iloc[:, :-1].apply(compute)
# show output
print(result)
0 -0.125382
1 0.024777
2 0.011324
3 -0.017622
dtype: float64

Categories