How to do proper imputation in Python / Sklearn - python

I have the following data below. Notice the Age has Nan. My goal is to impute all columns properly.
+----+-------------+----------+--------+------+-------+-------+---------+
| ID | PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare |
+----+-------------+----------+--------+------+-------+-------+---------+
| 0 | 1 | 0 | 3 | 22.0 | 1 | 0 | 7.2500 |
| 1 | 2 | 1 | 1 | 38.0 | 1 | 0 | 71.2833 |
| 2 | 3 | 1 | 3 | 26.0 | 0 | 0 | 7.9250 |
| 3 | 4 | 1 | 1 | 35.0 | 1 | 0 | 53.1000 |
| 4 | 5 | 0 | 3 | 35.0 | 0 | 0 | 8.0500 |
| 5 | 6 | 0 | 3 | NaN | 0 | 0 | 8.4583 |
+----+-------------+----------+--------+------+-------+-------+---------+
I have a working code that imputes all columns. The results are below. The results looks problematic.
+----+-------------+----------+--------+-----------+-------+-------+---------+
| ID | PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare |
+----+-------------+----------+--------+-----------+-------+-------+---------+
| 0 | 1.0 | 0.0 | 3.0 | 22.000000 | 1.0 | 0.0 | 7.2500 |
| 1 | 2.0 | 1.0 | 1.0 | 38.000000 | 1.0 | 0.0 | 71.2833 |
| 2 | 3.0 | 1.0 | 3.0 | 26.000000 | 0.0 | 0.0 | 7.9250 |
| 3 | 4.0 | 1.0 | 1.0 | 35.000000 | 1.0 | 0.0 | 53.1000 |
| 4 | 5.0 | 0.0 | 3.0 | 35.000000 | 0.0 | 0.0 | 8.0500 |
| 5 | 6.0 | 0.0 | 3.0 | 2.909717 | 0.0 | 0.0 | 8.4583 |
+----+-------------+----------+--------+-----------+-------+-------+---------+
My code is below:
import pandas as pd
import numpy as np
#https://www.kaggle.com/shivamp629/traincsv/downloads/traincsv.zip/1
data = pd.read_csv("train.csv")
data2 = data[['PassengerId', 'Survived','Pclass','Age','SibSp','Parch','Fare']].copy()
from sklearn.preprocessing import Imputer
fill_NaN = Imputer(missing_values=np.nan, strategy='mean', axis=1)
data2_im = pd.DataFrame(fill_NaN.fit_transform(data2), columns = data2.columns)
data2_im
It's weird the age is 2.909717. Is there a proper way to do simple mean imputation. I am okay doing column by column but I am not clear with syntax/approach. Thanks for any help.

The root of your problem is this line:
fill_NaN = Imputer(missing_values=np.nan, strategy='mean', axis=1)
, which means you're averaging over rows (oranges and apples).
Try changing it to:
fill_NaN = Imputer(missing_values=np.nan, strategy='mean', axis=0) # axis=0
and you will have the expected behaviour.
strategy='median' could be even better, as it's robust against outliers:
fill_NaN = Imputer(missing_values=np.nan, strategy='median', axis=0)

The problem is that you use the wrong axis. The correct code should be:
fill_NaN = Imputer(missing_values=np.nan, strategy='mean', axis=0)
Note the axis=0.

Try like
fill_NaN = Imputer(missing_values=np.nan, strategy='mean', axis=0)
or
data2.fillna(data2.mean())

Related

Match and store based on two columns in python dataframe

I'm working on a project that requires me to match two data frames based on two separate columns, X and Y.
e.g.
df1 =
| X | Y | AGE |
|:--- |:---:|----:|
| 2.0 | 1.5 | 25 |
| 1.0 | 0.5 | 29 |
| 1.5 | 0.5 | 21 |
| 2.0 | 2.0 | 32 |
| 0.0 | 1.5 | 19 |
df2 =
| X | Y | AGE |
|:--- |:---:|----:|
| 0.0 | 0.0 | [] |
| 0.0 | 0.5 | [] |
| 0.0 | 1.0 | [] |
| 0.0 | 1.5 | [] |
| 0.0 | 2.0 | [] |
| 0.5 | 0.0 | [] |
| 0.5 | 0.5 | [] |
| 0.5 | 1.0 | [] |
| 0.5 | 1.5 | [] |
| 0.5 | 2.0 | [] |
| 1.0 | 0.0 | [] |
| 1.0 | 0.5 | [] |
| 1.0 | 1.0 | [] |
| 1.0 | 1.5 | [] |
| 1.0 | 2.0 | [] |
| 1.5 | 0.0 | [] |
| 1.5 | 0.5 | [] |
| 1.5 | 1.0 | [] |
| 1.5 | 1.5 | [] |
| 1.5 | 2.0 | [] |
| 2.0 | 0.0 | [] |
| 2.0 | 0.5 | [] |
| 2.0 | 1.0 | [] |
| 2.0 | 1.5 | [] |
| 2.0 | 2.0 | [] |
The goal is to sort through df1, find the row with its matching coordinates in df2, and then store the AGE value from df1 in the AGE list in df2. The expected output would be:
df2 =
| X | Y | AGE |
|:--- |:---:|----:|
| 0.0 | 0.0 | [] |
| 0.0 | 0.5 | [] |
| 0.0 | 1.0 | [] |
| 0.0 | 1.5 |[19] |
| 0.0 | 2.0 | [] |
| 0.5 | 0.0 | [] |
| 0.5 | 0.5 | [] |
| 0.5 | 1.0 | [] |
| 0.5 | 1.5 | [] |
| 0.5 | 2.0 | [] |
| 1.0 | 0.0 | [] |
| 1.0 | 0.5 |[29] |
| 1.0 | 1.0 | [] |
| 1.0 | 1.5 | [] |
| 1.0 | 2.0 | [] |
| 1.5 | 0.0 | [] |
| 1.5 | 0.5 |[21] |
| 1.5 | 1.0 | [] |
| 1.5 | 1.5 | [] |
| 1.5 | 2.0 | [] |
| 2.0 | 0.0 | [] |
| 2.0 | 0.5 | [] |
| 2.0 | 1.0 | [] |
| 2.0 | 1.5 |[25] |
| 2.0 | 2.0 |[32] |
The code I have so far is:
for n in df1:
if df1["X"].values[n] == df2["X"].values[n]:
for m in df1:
if df1["Y"].values[m]) == df2["Y"].values[m]:
df2['AGE'].push(df1['AGE'])
This is a merge operation that can be solved by not considering the Age column from df2, and merging left on ['X','Y']:
df2 = df2[['X','Y']].merge(df1,on=['X','Y'],how='left')
To store age in lists:
df2 = df2[['X','Y']].merge(df1, on=['X','Y'], how='left')
df2['AGE'] = df2.apply(lambda row: [row['AGE']] if not pd.isnull(row['AGE']) else [], axis=1)

Multiplying pandas columns based on multiple conditions

I have a df like this
| count | people | A | B | C |
|---------|--------|-----|-----|-----|
| yes | siya | 4 | 2 | 0 |
| no | aish | 4 | 3 | 0 |
| total | | 4 | | 0 |
| yes | dia | 6 | 4 | 0 |
| no | dia | 6 | 2 | 0 |
| total | | 6 | | 0 |
I want a output like below
| count | people | A | B | C |
|---------|--------|-----|-----|-----|
| yes | siya | 4 | 2 | 8 |
| no | aish | 4 | 3 | 0 |
| total | | 4 | | 0 |
| yes | dia | 6 | 4 | 0 |
| no | dia | 6 | 2 | 2 |
| total | | 6 | | 0 |
The goal is calculate column C by mulytiplying A and B only when the count value is "yes" but if the column People values are same that is yes for dia and no for also dia , then we have to calculate for the count value "no"
I tried this much so far
df.C= df.groupby("Host", as_index=False).apply(lambda dfx : df.A *
df.B if (df['count'] == 'no') else df.A *df.B)
But not able to achieve the goal, any idea how can I achieve the output
import numpy as np
#Set Condtions
c1=df.groupby('people')['count'].transform('nunique').eq(1)&df['count'].eq('yes')
c2=df.groupby('people')['count'].transform('nunique').gt(1)&df['count'].eq('no')
#Put conditions in list
c=[c1,c2]
#Mke choices corresponding to condition list
choice=[df['A']*df['B'],len(df[df['count'].eq('no')])]
#Apply np select
df['C']= np.select(c,choice,0)
print(df)
count people A B C
0 yes siya 4 2.0 8.0
1 no aish 4 3.0 0.0
2 total NaN 4 0.0 0.0
3 yes dia 6 4.0 0.0
4 no dia 6 2.0 2.0
5 total NaN 6 NaN 0.0

Python Pandas: Return all values from dataframe that match two columns from another dataframe [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have a dataframe that looks like this (10k~ rows). I'll call it Maindf
+---+---------+----------+-------+--------------+
| | Product | Discount | Store | OtherColumns |
+---+---------+----------+-------+--------------+
| 0 | A | 0.5 | Red | |
| 1 | A | 1 | Red | |
| 2 | C | 3 | Green | |
| 3 | Z | 1.5 | Blue | |
| 4 | I | 0 | Red | |
| 5 | D | 0 | Green | |
+---+---------+----------+-------+--------------+
Through code I generate this other dataframe (changes depending on the input data). I'll call it Filterdf
+---+---------+----------+---------+
| | Product | Discount | Counter |
+---+---------+----------+---------+
| 0 | A | 0.5 | 1 |
| 1 | B | 2.0 | 2 |
| 2 | C | 1 | 9 |
| 3 | D | 0 | 7 |
+---+---------+----------+---------+
I am trying to return all values from Maindf that match on columns Product and Discount with Filterdf.
So the expected output would be this
+---+---------+----------+-------+--------------+
| | Product | Discount | Store | OtherColumns |
+---+---------+----------+-------+--------------+
| 0 | A | 0.5 | Red | |
| 1 | D | 0 | Green | |
+---+---------+----------+-------+--------------+
And here is my code line to do it, which is not working out properly.
NewMaindf = Maindf[(Maindf['Product'].isin(Filterdf['Product']) & Maindf['Discount'].isin(Filterdf['Discount']))]
print(NewMaindf)
The output is this. I am interested only in the data from Maindf that matches both columns of Filterdf, in this case A with discount 1 is coming through because A isin Filterdf['Product'] and also 1 isin Filterdf['Discount'] but with Product C
+---+---------+----------+-------+--------------+
| | Product | Discount | Store | OtherColumns |
+---+---------+----------+-------+--------------+
| 0 | A | 0.5 | Red | |
| 1 | A | 1 | Red | |
| 2 | D | 0 | Green | |
+---+---------+----------+-------+--------------+
How could this be achieved?
Thank you and sorry for the poor formatting, first time posting here
import pandas as pd
maindf = {'Product': ['A', 'A','C','Z','I','D'], 'Discount': [0.5,1,3,1.5,0,0],'Store' :['Red','Red','Red','Red','Red','Red']}
Maindf = pd.DataFrame(data=maindf)
print(Maindf)
filterdf = {'Product': ['A', 'B','C','D' ], 'Discount': [0.5, 2.0,1,0]}
Filterdf = pd.DataFrame(data=filterdf)
print(Filterdf)
NewMaindf= Maindf[Maindf[['Product','Discount']].astype(str).sum(axis = 1).isin(
Filterdf[['Product','Discount']].astype(str).sum(axis = 1))]
print(NewMaindf)
Output:
Product Discount Store
0 A 0.5 Red
1 A 1.0 Red
2 C 3.0 Red
3 Z 1.5 Red
4 I 0.0 Red
5 D 0.0 Red
Product Discount
0 A 0.5
1 B 2.0
2 C 1.0
3 D 0.0
Product Discount Store
0 A 0.5 Red
5 D 0.0 Red

Adding a row to an existing Pivot table

I have the below pivot table:
Fruit | | apple | orange | banana
Market | # num bracket | | |
:-----------------------------------------------------------:
X | 100 | 1.2 | 1.0 | NaN
Y | 50 | 2.0 | 3.5 | NaN
Y | 100 | NaN | 3.6 | NaN
Z | 50 | NaN | NaN | 1.6
Z | 100 | NaN | NaN | 1.3
I want to add in the below row at the bottom
Fruit | apple | orange | banana
Price | 3.5 | 1.2 | 2
So the new table looks like the below
Fruit | x | apple | orange | banana
Market | # num bracket | | |
:-----------------------------------------------------------:
X | 100 | 1.2 | 1.0 | NaN
Y | 50 | 2.0 | 3.5 | NaN
Y | 100 | NaN | 3.6 | NaN
Z | 50 | NaN | NaN | 1.6
Z | 100 | NaN | NaN | 1.3
Price | | 3.5 | 1.2 | 2
Does any one have a quick in easy recommendation on how to do this?
temp_df = pd.DataFrame(data=[{'Fruit Market':'Price',
'apple':3.5,
'orange':1.2
'banana':2}],
columns=['Fruit Market','x','apple','orange','banana'])
pd.concat([df, temp_df], axis=0, ignore_index=True)

pandas pivot table to data frame [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 4 years ago.
I have a dataframe (df) that looks like this:
+---------+-------+------------+----------+
| subject | pills | date | strength |
+---------+-------+------------+----------+
| 1 | 4 | 10/10/2012 | 250 |
| 1 | 4 | 10/11/2012 | 250 |
| 1 | 2 | 10/12/2012 | 500 |
| 2 | 1 | 1/6/2014 | 1000 |
| 2 | 1 | 1/7/2014 | 250 |
| 2 | 1 | 1/7/2014 | 500 |
| 2 | 3 | 1/8/2014 | 250 |
+---------+-------+------------+----------+
When I use reshape in R, I get what I want:
reshape(df, idvar = c("subject","date"), timevar = 'strength', direction = "wide")
+---------+------------+--------------+--------------+---------------+
| subject | date | strength.250 | strength.500 | strength.1000 |
+---------+------------+--------------+--------------+---------------+
| 1 | 10/10/2012 | 4 | NA | NA |
| 1 | 10/11/2012 | 4 | NA | NA |
| 1 | 10/12/2012 | NA | 2 | NA |
| 2 | 1/6/2014 | NA | NA | 1 |
| 2 | 1/7/2014 | 1 | 1 | NA |
| 2 | 1/8/2014 | 3 | NA | NA |
+---------+------------+--------------+--------------+---------------+
Using pandas:
df.pivot_table(df, index=['subject','date'],columns='strength')
+---------+------------+-------+----+-----+
| | | pills |
+---------+------------+-------+----+-----+
| | strength | 250 | 500| 1000|
+---------+------------+-------+----+-----+
| subject | date | | | |
+---------+------------+-------+----+-----+
| 1 | 10/10/2012 | 4 | NA | NA |
| | 10/11/2012 | 4 | NA | NA |
| | 10/12/2012 | NA | 2 | NA |
+---------+------------+-------+----+-----+
| 2 | 1/6/2014 | NA | NA | 1 |
| | 1/7/2014 | 1 | 1 | NA |
| | 1/8/2014 | 3 | NA | NA |
+---------+------------+-------+----+-----+
How do I get exactly the same output as in R with pandas? I only want 1 header.
After pivoting, convert the dataframe to records and then back to dataframe:
flattened = pd.DataFrame(pivoted.to_records())
# subject date ('pills', 250) ('pills', 500) ('pills', 1000)
#0 1 10/10/2012 4.0 NaN NaN
#1 1 10/11/2012 4.0 NaN NaN
#2 1 10/12/2012 NaN 2.0 NaN
#3 2 1/6/2014 NaN NaN 1.0
#4 2 1/7/2014 1.0 1.0 NaN
#5 2 1/8/2014 3.0 NaN NaN
You can now "repair" the column names, if you want:
flattened.columns = [hdr.replace("('pills', ", "strength.").replace(")", "") \
for hdr in flattened.columns]
flattened
# subject date strength.250 strength.500 strength.1000
#0 1 10/10/2012 4.0 NaN NaN
#1 1 10/11/2012 4.0 NaN NaN
#2 1 10/12/2012 NaN 2.0 NaN
#3 2 1/6/2014 NaN NaN 1.0
#4 2 1/7/2014 1.0 1.0 NaN
#5 2 1/8/2014 3.0 NaN NaN
It's awkward, but it works.

Categories