How to impute Null values in python for categorical data? - python

I have seen in R, imputation of categorical data is done straight forward by packages like DMwR, Caret and also I do have algorithm options like KNN or CentralImputation. But I do not see any libraries in python doing the same. FancyImpute performs well on numeric data.
Is there a way to do imputation of Null values in python for categorical data?
Edit: Added the top few rows of the data set.
>>> data_set.head()
1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond \
0 856 854 0 NaN 3 1Fam TA
1 1262 0 0 NaN 3 1Fam TA
2 920 866 0 NaN 3 1Fam TA
3 961 756 0 NaN 3 1Fam Gd
4 1145 1053 0 NaN 4 1Fam TA
BsmtExposure BsmtFinSF1 BsmtFinSF2 ... SaleType ScreenPorch Street \
0 No 706.0 0.0 ... WD 0 Pave
1 Gd 978.0 0.0 ... WD 0 Pave
2 Mn 486.0 0.0 ... WD 0 Pave
3 No 216.0 0.0 ... WD 0 Pave
4 Av 655.0 0.0 ... WD 0 Pave
TotRmsAbvGrd TotalBsmtSF Utilities WoodDeckSF YearBuilt YearRemodAdd \
0 8 856.0 AllPub 0 2003 2003
1 6 1262.0 AllPub 298 1976 1976
2 6 920.0 AllPub 0 2001 2002
3 7 756.0 AllPub 0 1915 1970
4 9 1145.0 AllPub 192 2000 2000
YrSold
0 2008
1 2007
2 2008
3 2006
4 2008
[5 rows x 81 columns]

There are few ways to deal with missing values. As I understand you want to fill NaN according to specific rule. Pandas fillna can be used. Below code is example of how to fill categoric NaN with most frequent value.
df['Alley'].fillna(value=df['MSZoning'].value_counts().index[0],inplace =True)
Also this might be helpful sklearn.preprocessing.Imputer
For more information about pandas fillna pandas.DataFrame.fillna
Hope this will work

Related

How to keep number as string when creating dataframe Pandas

I am having some issue converting a multidimensional list into a Pandas dataframe.
The problem is related to the numeric fields: I have some number in a non-standard format, as you can see from this table (scraped using tabula.py):
[ Unnamed: 0 0 Stück kg € / kg 0.1 Stück.1 \
0 Region Nord-Ost NaN 64.852 6.269.400 1,60 0.0 37.408
1 Niedersachsen / Bremen NaN 164.424 15.993.570 1,59 0.0 88.625
2 Nordrhein-Westfalen NaN 179.692 17.422.749 1,59 0.0 73.199
3 Hessen / Rheinland-Pfalz NaN 6.322 610.099 1,61 NaN 10.281
4 Baden-Württemberg NaN 21.924 2.135.045 1,62 0.0 22.661
5 Bayern NaN 21.105 2.062.882 1,62 0.0 18.188
6 Deutschland gesamt NaN 458.319 44.493.745 1,59 NaN 250.362
kg.1 € / kg.1
0 3.632.427 1,56
1 8.683.864 1,56
2 7.155.988 1,55
3 1.004.925 1,60
4 2.220.986 1,63
5 1.798.013 1,58
6 24.496.203 1,57 ]
In this case the dot is the thousand separator. When i convert it to a Dataframe, those number become float (I think), and the result is the following.
Unnamed: 0 0 Stück kg € / kg 0.1 \
0 Region Nord-Ost nan 64.852 6.269.400 1,60 0.0
1 Niedersachsen / Bremen nan 164.424 15.993.570 1,59 0.0
2 Nordrhein-Westfalen nan 179.692 17.422.749 1,59 0.0
3 Hessen / Rheinland-Pfalz nan 6.322 610.099 1,61 nan
4 Baden-Württemberg nan 21.924 2.135.045 1,62 0.0
5 Bayern nan 21.105 2.062.882 1,62 0.0
6 Deutschland gesamt nan 458.319 44.493.745 1,59 nan
Stück.1 kg.1 € / kg.1
0 37.408 3.632.427 1,56
1 88.625 8.683.864 1,56
2 73.199 7.155.988 1,55
3 10.280999999999999 1.004.925 1,60
4 22.660999999999998 2.220.986 1,63
5 18.188 1.798.013 1,58
6 250.362 24.496.203 1,57
I would like to consider those numbers like strings, and then replace the dots with nothing, converting the number to a standard integer, but I cannot find a way to do that.
I already tried to set the dtype of the df to string, like this:
df = pd.DataFrame(table[0], dtype=str);
But the problem is still there, any suggestions?

Pandas Error: '[nan nan] not found in axis' while dropping a column without labels

I'm trying to drop the first two columns in a dataframe that has NaN for column headers. The dataframe looks like this:
**15 NaN NaN NaN Energy Supply Energy Supply Renewable Energy**
17 NaN Afghanistan Afghanistan 1 2 3
18 NaN Albania Albania 1 2 3
19 NaN Algeria Algeria 1 2 3
I need to drop the first two columns labeled NaN. I tried df=df.drop(df.columns[[1,2]],axis=1), which returns an error
What am I missing?
KeyError: '[nan nan] not found in axis'
Strange you have NaN as columns. Please try filter columns that do not start with NaN using regex.
df.filter(regex='^(?!NaN).+', axis=1)
Using your data
print(df)
15 NaN NaN.1 NaN.2 EnergySupply EnergySupply.1 \
0 17 NaN Afghanistan Afghanistan 1 2
1 18 NaN Albania Albania 1 2
2 19 NaN Algeria Algeria 1 2
RenewableEnergy
0 3
1 3
2 3
Solution
print(df.filter(regex='^(?!NaN).+', axis=1))
15 EnergySupply EnergySupply.1 RenewableEnergy
0 17 1 2 3
1 18 1 2 3
2 19 1 2 3
When the NaN columns exist, I had to do a case-insenstive version of the regex from wwnde's answer in order for them to successfully filter out the column:
df = df.filter(regex='(?i)^(?!NaN).+', axis=1)
Other suggestions, such as df=df[df.columns.dropna()] and df=df.drop(np.nan, axis=1) do not work, but the above did.
I'm guessing this is related to the painful reality of np.nan == np.nan not evaluating to true, but ultimately it seems like a bug with pandas.

Pandas dropna() function not working

I am trying to drop NA values from a pandas dataframe.
I have used dropna() (which should drop all NA rows from the dataframe). Yet, it does not work.
Here is the code:
import pandas as pd
import numpy as np
prison_data = pd.read_csv('https://andrewshinsuke.me/docs/compas-scores-two-years.csv')
That's how you get the data frame. As the following shows, the default read_csv method does indeed convert the NA data points to np.nan.
np.isnan(prison_data.head()['out_custody'][4])
Out[2]: True
Conveniently, the head() of the DF already contains a NaN values (in the column out_custody), so printing prison_data.head() this, you get:
id name first last compas_screening_date sex
0 1 miguel hernandez miguel hernandez 2013-08-14 Male
1 3 kevon dixon kevon dixon 2013-01-27 Male
2 4 ed philo ed philo 2013-04-14 Male
3 5 marcu brown marcu brown 2013-01-13 Male
4 6 bouthy pierrelouis bouthy pierrelouis 2013-03-26 Male
dob age age_cat race ...
0 1947-04-18 69 Greater than 45 Other ...
1 1982-01-22 34 25 - 45 African-American ...
2 1991-05-14 24 Less than 25 African-American ...
3 1993-01-21 23 Less than 25 African-American ...
4 1973-01-22 43 25 - 45 Other ...
v_decile_score v_score_text v_screening_date in_custody out_custody
0 1 Low 2013-08-14 2014-07-07 2014-07-14
1 1 Low 2013-01-27 2013-01-26 2013-02-05
2 3 Low 2013-04-14 2013-06-16 2013-06-16
3 6 Medium 2013-01-13 NaN NaN
4 1 Low 2013-03-26 NaN NaN
priors_count.1 start end event two_year_recid
0 0 0 327 0 0
1 0 9 159 1 1
2 4 0 63 0 1
3 1 0 1174 0 0
4 2 0 1102 0 0
However, running prison_data.dropna() does not change the dataframe in any way.
prison_data.dropna()
np.isnan(prison_data.head()['out_custody'][4])
Out[3]: True
df.dropna() by default returns a new dataset without NaN values. So, you have to assign it to the variable
df = df.dropna()
if you want it to modify the df inplace, you have to explicitly specify
df.dropna(inplace= True)
it wasn't working because there was at least one nan per row

Problems with combining columns from dataframes in pandas

I have two dataframes that I'm trying to merge.
df1
code scale R1 R2...
0 121 1 80 110
1 121 2 NaN NaN
2 121 3 NaN NaN
3 313 1 60 60
4 313 2 NaN NaN
5 313 3 NaN NaN
...
df2
code scale R1 R2...
0 121 2 30 20
3 313 2 15 10
...
I need, based on the equality of the columns code and scale copy the value from df2 to df1.
The result should look like this:
df1
code scale R1 R2...
0 121 1 80 110
1 121 2 30 20
2 121 3 NaN NaN
3 313 1 60 60
4 313 2 15 10
5 313 3 NaN NaN
...
The problem is that there can be a lot of columns like R1 and R2 and I can not check each one separately, so I wanted to use something from this instruction, but nothing gives me the desired result. I'm doing something wrong, but I can't understand what. I really need advice.
What do you want to happen if the two dataframes both have values for R1/R2? If you want keep df1, you could do
df1.set_index(['code', 'scale']).fillna(df2.set_index(['code', 'scale'])).reset_index()
To keep df2 just do the fillna the other way round. To combine in some other way please clarify the question!
Try this ?
pd.concat([df,df1],axis=0).sort_values(['code','scale']).drop_duplicates(['code','scale'],keep='last')
Out[21]:
code scale R1 R2
0 121 1 80.0 110.0
0 121 2 30.0 20.0
2 121 3 NaN NaN
3 313 1 60.0 60.0
3 313 2 15.0 10.0
5 313 3 NaN NaN
This is a good situation for combine_first. It replaces the nulls in the calling dataframe from the passed dataframe.
df1.set_index(['code', 'scale']).combine_first(df2.set_index(['code', 'scale'])).reset_index()
code scale R1 R2
0 121 1 80.0 110.0
1 121 2 30.0 20.0
2 121 3 NaN NaN
3 313 1 60.0 60.0
4 313 2 15.0 10.0
5 313 3 NaN NaN
Other solutions
with fillna
df.set_index(['code', 'scale']).fillna(df1.set_index(['code', 'scale'])).reset_index()
with add - a bit faster
df.set_index(['code', 'scale']).add(df1.set_index(['code', 'scale']), fill_value=0)

Remove special characters in pandas dataframe

This seems like an inherently simple task but I am finding it very difficult to remove the '' from my entire data frame and return the numeric values in each column, including the numbers that did not have ''. The dateframe includes hundreds of more columns and looks like this in short:
Time A1 A2
2.0002546296 1499 1592
2.0006712963 1252 1459
2.0902546296 1731 2223
2.0906828704 1691 1904
2.1742245370 2364 3121
2.1764699074 2096 1942
2.7654050926 *7639* *8196*
2.7658564815 *7088* *7542*
2.9048958333 *8736* *8459*
2.9053125000 *7778* *7704*
2.9807175926 *6612* *6593*
3.0585763889 *8520* *9122*
I have not written it to iterate over every column in df yet but as far as the first column goes I have come up with this
df['A1'].str.replace('*','').astype(float)
which yields
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 NaN
19 7639.0
20 7088.0
21 8736.0
22 7778.0
23 6612.0
24 8520.0
Is there a very easy way to just remove the '*' in the dataframe in pandas?
use replace which applies on whole dataframe :
df
Out[14]:
Time A1 A2
0 2.000255 1499 1592
1 2.176470 2096 1942
2 2.765405 *7639* *8196*
3 2.765856 *7088* *7542*
4 2.904896 *8736* *8459*
5 2.905312 *7778* *7704*
6 2.980718 *6612* *6593*
7 3.058576 *8520* *9122*
df=df.replace('\*','',regex=True).astype(float)
df
Out[16]:
Time A1 A2
0 2.000255 1499 1592
1 2.176470 2096 1942
2 2.765405 7639 8196
3 2.765856 7088 7542
4 2.904896 8736 8459
5 2.905312 7778 7704
6 2.980718 6612 6593
7 3.058576 8520 9122
I found this to be a simple approach - Use replace to retain only the digits (and dot and minus sign).
This would remove characters, alphabets or anything that is not defined in to_replace attribute.
So, the solution is:
df['A1'].replace(regex=True, inplace=True, to_replace=r'[^0-9.\-]', value=r'']
df['A1'] = df['A1'].astype(float64)
There is another solution which uses map and strip functions.
You can see the below link:
Pandas DataFrame: remove unwanted parts from strings in a column.
df =
Time A1 A2
0 2.0 1258 *1364*
1 2.1 *1254* 2002
2 2.2 1520 3364
3 2.3 *300* *10056*
cols = ['A1', 'A2']
for col in cols:
df[col] = df[col].map(lambda x: str(x).lstrip('*').rstrip('*')).astype(float)
df =
Time A1 A2
0 2.0 1258 1364
1 2.1 1254 2002
2 2.2 1520 3364
3 2.3 300 10056
The parsing procedure only be applied on the desired columns.
I found the answer of CuriousCoder so brief and useful but there must be a ')' instead of ']'
So it should be:
df['A1'].replace(regex=True, inplace=True, to_replace=r'[^0-9.\-]',
value=r''] df['A1'] = df['A1'].astype(float64)

Categories