Python: How to replace missing values column wise by median - python

I have a dataframe as follows
df = pd.DataFrame({'A': [1, 2, 3], 'B': [1.45, 2.33, np.nan], 'C': [4, 5, 6], 'D': [4.55, 7.36, np.nan]})
I want to replace the missing values i.e. np.nan in generic way. For this I have created a function as follows
def treat_mis_value_nu(df):
df_nu = df.select_dtypes(include=['number'])
lst_null_col = df_nu.columns[df_nu.isnull().any()].tolist()
if len(lst_null_col)>0:
for i in lst_null_col:
if df_nu[i].isnull().sum()/len(df_nu[i])>0.10:
df_final_nu = df_nu.drop([i],axis=1)
else:
df_final_nu = df_nu[i].fillna(df_nu[i].median(),inplace=True)
return df_final_nu
When I apply this function as follows
df_final = treat_mis_value_nu(df)
I am getting a dataframe as follows
A B C
0 1 1.0 4
1 2 2.0 5
2 3 NaN 6
So it has actually removed column D correctly, but failed to remove column B.
I know in past there have been discussion on this topic (here). Still I might be missing something?

Use:
df = pd.DataFrame({'A': [1, 2, 3,5,7], 'B': [1.45, 2.33, np.nan, np.nan, np.nan],
'C': [4, 5, 6,8,7], 'D': [4.55, 7.36, np.nan,9,10],
'E':list('abcde')})
print (df)
A B C D E
0 1 1.45 4 4.55 a
1 2 2.33 5 7.36 b
2 3 NaN 6 NaN c
3 5 NaN 8 9.00 d
4 7 NaN 7 10.00 e
def treat_mis_value_nu(df):
#get only numeric columns to dataframe
df_nu = df.select_dtypes(include=['number'])
#get only columns with NaNs
df_nu = df_nu.loc[:, df_nu.isnull().any()]
#get columns for remove with mean instead sum/len, it is same
cols_to_drop = df_nu.columns[df_nu.isnull().mean() <= 0.30]
#replace missing values of original columns and remove above thresh
return df.fillna(df_nu.median()).drop(cols_to_drop, axis=1)
print (treat_mis_value_nu(df))
A C D E
0 1 4 4.55 a
1 2 5 7.36 b
2 3 6 8.18 c
3 5 8 9.00 d
4 7 7 10.00 e

I would recommend looking at the sklearn Imputer transformer. I don't think it it can drop columns but it can definetly fill them in a 'generic way' - for example, filling in missing values with the median of the relevant column.
You could use it as such:
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy='median')
num_df = df.values
names = df.columns.values
df_final = pd.DataFrame(imputer.transform(num_df), columns=names)
If you have additional transformations you would like to make you could consider making a transformation Pipeline or could even make your own transformers to do bespoke tasks.

Related

Pandas new dataframe that has sum of columns from another

I'm struggling to figure out how to do a couple of transformation with pandas. I want a new dataframe with the sum of the values from the columns in the original. I also want to be able to merge two of these 'summed' dataframes.
Example #1: Summing the columns
Before:
A B C D
1 4 7 0
2 5 8 1
3 6 9 2
After:
A B C D
6 15 24 3
Right now I'm getting the sums of the columns I'm interested in, storing them in a dictionary, and creating a dataframe from the dictionary. I feel like there is a better way to do this with pandas that I'm not seeing.
Example #2: merging 'summed' dataframes
Before:
A B C D F
6 15 24 3 1
A B C D E
1 2 3 4 2
After:
A B C D E F
7 17 27 7 2 1
First question:
Summing the columns
Use sum then convert Series to DataFrame and transpose
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6],
'C': [7, 8, 9], 'D': [0, 1, 2]})
df1 = df1.sum().to_frame().T
print(df1)
# Output:
A B C D
0 6 15 24 3
Second question:
Merging 'summed' dataframes
Use combine
df2 = pd.DataFrame({'A': [1], 'B': [2], 'C': [3], 'D': [4], 'E': [2]})
out = df1.combine(df2, sum, fill_value=0)
print(out)
# Output:
A B C D E
0 7 17 27 7 2
First part, use DataFrame.sum() to sum the columns then convert Series to dataframe by .to_frame() and finally transpose:
df_sum = df.sum().to_frame().T
Result:
print(df_sum)
A B C D
0 6 15 24 3
Second part, use DataFrame.add() with parameter fill_value, as follows:
df_sum2 = df1.add(df2, fill_value=0)
Result:
print(df_sum2)
A B C D E F
0 7 17 27 7 2.0 1.0

join 2 pandas dataframes with different number of columns

Consider we have 2 dataframes:
df = pd.DataFrame(columns = ['a','b','c']) ##empty
d = {'a': [1, 2], 'b': [3, 4]}
df1 = pd.DataFrame(data=d)
How can I join them in order the result to be this:
a b c
-----
1 3 Nan
---------
2 4 Nan
-------
Use reindex by columns from df:
df = pd.DataFrame(columns = ['a','b','c'])
d = {'a': [1, 2], 'b': [3, 4]}
df1 = pd.DataFrame(data=d).reindex(columns=df.columns)
print (df1)
a b c
0 1 3 NaN
1 2 4 NaN
Difference betwen soluions - if columns are not sorted get different output:
#different order
df = pd.DataFrame(columns = ['c','a','b'])
d = {'a': [1, 2], 'b': [3, 4]}
df1 = pd.DataFrame(data=d)
print (df1.reindex(columns=df.columns))
c a b
0 NaN 1 3
1 NaN 2 4
print (df1.merge(df,how='left'))
a b c
0 1 3 NaN
1 2 4 NaN
How can I join them
If you have the dataframe existing somewhere(not creating a new), do :
df1.merge(df,how='left')
a b c
0 1 3 NaN
1 2 4 NaN
Note: This produces sorted columns. So if order of columns are already sorted, this will work fine , else not.

Pandas create a data frame based on two other 'sub' frames

I have two Pandas data frames. df1 has columns ['a','b','c'] and df2 has columns ['a','c','d']. Now, I create a new data frame df3 with columns ['a',
b','c','d'].
I want to fill df3 with all the inputs from df1 and df2. For example, if I have x rows in df1, and y rows in df2, then I will have x+y rows in df3.
Which Pandas function fills the new dataframe based on partial columns?
Example data:
df1 = pd.DataFrame({'a':[1, 2, 3], 'b':[2, 3, 4], 'd':['h', 'j', 'k']})
df2 = pd.DataFrame({'a':[5, 6, 7], 'b':[1, 1, 1], 'c':[2, 2, 2]})
Code:
df1.append(df2)
Out:
a b c d
0 1 2 NaN h
1 2 3 NaN j
2 3 4 NaN k
0 5 1 2.0 NaN
1 6 1 2.0 NaN
2 7 1 2.0 NaN
use append function of dataframe https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html
anotherFrame = df1.append(df2, ignore_index=True)
another way is merge - https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html
df1.merge(df2, how='outer')
How about:
df1 = pd.DataFrame({"a": [1,2], "b": [3,4], "c": [5,6]})
df2 = pd.DataFrame({"a": [7,8], "c": [9,10], "d": [11,12]})
df3 = df1.append(df2, sort=False)
df3
a b c d
0 1 3.0 5 NaN
1 2 4.0 6 NaN
0 7 NaN 9 11.0
1 8 NaN 10 12.0

Dataframe is not updated when columns are passed to function using apply

I have two dataframes like this:
A B
a 1 10
b 2 11
c 3 12
d 4 13
A B
a 11 NaN
b NaN NaN
c NaN 20
d 16 30
They have identical column names and indices. My goal is to replace the NAs in df2 by the values of df1. Currently, I do this like this:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'A': range(1, 5), 'B': range(10, 14)}, index=list('abcd'))
df2 = pd.DataFrame({'A': [11, np.nan, np.nan, 16], 'B': [np.nan, np.nan, 20, 30]}, index=list('abcd'))
def repl_na(s, d):
s[s.isnull().values] = d[s.isnull().values][s.name]
return s
df2.apply(repl_na, args=(df1, ))
which gives me the desired output:
A B
a 11 10
b 2 11
c 3 20
d 16 30
My question is now how this could be accomplished if the indices of the dataframes are different (column names are still the same, and the columns have the same length). So I would have a df2 like this(df1 is unchanged):
A B
0 11 NaN
1 NaN NaN
2 NaN 20
3 16 30
Then the above code does not work anymore since the indices of the dataframes are different. Could someone tell me how the line
s[s.isnull().values] = d[s.isnull().values][s.name]
has to be modified in order to get the same result as above?
You could temporarily change the indexes on df1 to be the same as df2and just combine_first with df2;
df2.combine_first(df1.set_index(df2.index))
A B
1 11 10
2 2 11
3 3 20
4 16 30

Creating dataframe from a dictionary where entries have different lengths

Say I have a dictionary with 10 key-value pairs. Each entry holds a numpy array. However, the length of the array is not the same for all of them.
How can I create a dataframe where each column holds a different entry?
When I try:
pd.DataFrame(my_dict)
I get:
ValueError: arrays must all be the same length
Any way to overcome this? I am happy to have Pandas use NaN to pad those columns for the shorter entries.
In Python 3.x:
import pandas as pd
import numpy as np
d = dict( A = np.array([1,2]), B = np.array([1,2,3,4]) )
pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in d.items() ]))
Out[7]:
A B
0 1 1
1 2 2
2 NaN 3
3 NaN 4
In Python 2.x:
replace d.items() with d.iteritems().
Here's a simple way to do that:
In[20]: my_dict = dict( A = np.array([1,2]), B = np.array([1,2,3,4]) )
In[21]: df = pd.DataFrame.from_dict(my_dict, orient='index')
In[22]: df
Out[22]:
0 1 2 3
A 1 2 NaN NaN
B 1 2 3 4
In[23]: df.transpose()
Out[23]:
A B
0 1 1
1 2 2
2 NaN 3
3 NaN 4
A way of tidying up your syntax, but still do essentially the same thing as these other answers, is below:
>>> mydict = {'one': [1,2,3], 2: [4,5,6,7], 3: 8}
>>> dict_df = pd.DataFrame({ key:pd.Series(value) for key, value in mydict.items() })
>>> dict_df
one 2 3
0 1.0 4 8.0
1 2.0 5 NaN
2 3.0 6 NaN
3 NaN 7 NaN
A similar syntax exists for lists, too:
>>> mylist = [ [1,2,3], [4,5], 6 ]
>>> list_df = pd.DataFrame([ pd.Series(value) for value in mylist ])
>>> list_df
0 1 2
0 1.0 2.0 3.0
1 4.0 5.0 NaN
2 6.0 NaN NaN
Another syntax for lists is:
>>> mylist = [ [1,2,3], [4,5], 6 ]
>>> list_df = pd.DataFrame({ i:pd.Series(value) for i, value in enumerate(mylist) })
>>> list_df
0 1 2
0 1 4.0 6.0
1 2 5.0 NaN
2 3 NaN NaN
You may additionally have to transpose the result and/or change the column data types (float, integer, etc).
Use pandas.DataFrame and pandas.concat
The following code will create a list of DataFrames with pandas.DataFrame, from a dict of uneven arrays, and then concat the arrays together in a list-comprehension.
This is a way to create a DataFrame of arrays, that are not equal in length.
For equal length arrays, use df = pd.DataFrame({'x1': x1, 'x2': x2, 'x3': x3})
import pandas as pd
import numpy as np
# create the uneven arrays
mu, sigma = 200, 25
np.random.seed(365)
x1 = mu + sigma * np.random.randn(10, 1)
x2 = mu + sigma * np.random.randn(15, 1)
x3 = mu + sigma * np.random.randn(20, 1)
data = {'x1': x1, 'x2': x2, 'x3': x3}
# create the dataframe
df = pd.concat([pd.DataFrame(v, columns=[k]) for k, v in data.items()], axis=1)
Use pandas.DataFrame and itertools.zip_longest
For iterables of uneven length, zip_longest fills missing values with the fillvalue.
The zip generator needs to be unpacked, because the DataFrame constructor won't unpack it.
from itertools import zip_longest
# zip all the values together
zl = list(zip_longest(*data.values()))
# create dataframe
df = pd.DataFrame(zl, columns=data.keys())
plot
df.plot(marker='o', figsize=[10, 5])
dataframe
x1 x2 x3
0 232.06900 235.92577 173.19476
1 176.94349 209.26802 186.09590
2 194.18474 168.36006 194.36712
3 196.55705 238.79899 218.33316
4 249.25695 167.91326 191.62559
5 215.25377 214.85430 230.95119
6 232.68784 240.30358 196.72593
7 212.43409 201.15896 187.96484
8 188.97014 187.59007 164.78436
9 196.82937 252.67682 196.47132
10 NaN 223.32571 208.43823
11 NaN 209.50658 209.83761
12 NaN 215.27461 249.06087
13 NaN 210.52486 158.65781
14 NaN 193.53504 199.10456
15 NaN NaN 186.19700
16 NaN NaN 223.02479
17 NaN NaN 185.68525
18 NaN NaN 213.41414
19 NaN NaN 271.75376
While this does not directly answer the OP's question. I found this to be an excellent solution for my case when I had unequal arrays and I'd like to share:
from pandas documentation
In [31]: d = {'one' : Series([1., 2., 3.], index=['a', 'b', 'c']),
....: 'two' : Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
....:
In [32]: df = DataFrame(d)
In [33]: df
Out[33]:
one two
a 1 1
b 2 2
c 3 3
d NaN 4
You can also use pd.concat along axis=1 with a list of pd.Series objects:
import pandas as pd, numpy as np
d = {'A': np.array([1,2]), 'B': np.array([1,2,3,4])}
res = pd.concat([pd.Series(v, name=k) for k, v in d.items()], axis=1)
print(res)
A B
0 1.0 1
1 2.0 2
2 NaN 3
3 NaN 4
Both the following lines work perfectly :
pd.DataFrame.from_dict(df, orient='index').transpose() #A
pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in df.items() ])) #B (Better)
But with %timeit on Jupyter, I've got a ratio of 4x speed for B vs A, which is quite impressive especially when working with a huge data set (mainly with a big number of columns/features).
If you don't want it to show NaN and you have two particular lengths, adding a 'space' in each remaining cell would also work.
import pandas
long = [6, 4, 7, 3]
short = [5, 6]
for n in range(len(long) - len(short)):
short.append(' ')
df = pd.DataFrame({'A':long, 'B':short}]
# Make sure Excel file exists in the working directory
datatoexcel = pd.ExcelWriter('example1.xlsx',engine = 'xlsxwriter')
df.to_excel(datatoexcel,sheet_name = 'Sheet1')
datatoexcel.save()
A B
0 6 5
1 4 6
2 7
3 3
If you have more than 2 lengths of entries, it is advisable to make a function which uses a similar method.

Categories