Change column value in pandas df conditionally [duplicate] - python
I have a Pandas Dataframe as below:
itm Date Amount
67 420 2012-09-30 00:00:00 65211
68 421 2012-09-09 00:00:00 29424
69 421 2012-09-16 00:00:00 29877
70 421 2012-09-23 00:00:00 30990
71 421 2012-09-30 00:00:00 61303
72 485 2012-09-09 00:00:00 71781
73 485 2012-09-16 00:00:00 NaN
74 485 2012-09-23 00:00:00 11072
75 485 2012-09-30 00:00:00 113702
76 489 2012-09-09 00:00:00 64731
77 489 2012-09-16 00:00:00 NaN
When I try to apply a function to the Amount column, I get the following error:
ValueError: cannot convert float NaN to integer
I have tried applying a function using .isnan from the Math Module
I have tried the pandas .replace attribute
I tried the .sparse data attribute from pandas 0.9
I have also tried if NaN == NaN statement in a function.
I have also looked at this article How do I replace NA values with zeros in an R dataframe? whilst looking at some other articles.
All the methods I have tried have not worked or do not recognise NaN.
Any Hints or solutions would be appreciated.
I believe DataFrame.fillna() will do this for you.
Link to Docs for a dataframe and for a Series.
Example:
In [7]: df
Out[7]:
0 1
0 NaN NaN
1 -0.494375 0.570994
2 NaN NaN
3 1.876360 -0.229738
4 NaN NaN
In [8]: df.fillna(0)
Out[8]:
0 1
0 0.000000 0.000000
1 -0.494375 0.570994
2 0.000000 0.000000
3 1.876360 -0.229738
4 0.000000 0.000000
To fill the NaNs in only one column, select just that column. in this case I'm using inplace=True to actually change the contents of df.
In [12]: df[1].fillna(0, inplace=True)
Out[12]:
0 0.000000
1 0.570994
2 0.000000
3 -0.229738
4 0.000000
Name: 1
In [13]: df
Out[13]:
0 1
0 NaN 0.000000
1 -0.494375 0.570994
2 NaN 0.000000
3 1.876360 -0.229738
4 NaN 0.000000
EDIT:
To avoid a SettingWithCopyWarning, use the built in column-specific functionality:
df.fillna({1:0}, inplace=True)
It is not guaranteed that the slicing returns a view or a copy. You can do
df['column'] = df['column'].fillna(value)
You could use replace to change NaN to 0:
import pandas as pd
import numpy as np
# for column
df['column'] = df['column'].replace(np.nan, 0)
# for whole dataframe
df = df.replace(np.nan, 0)
# inplace
df.replace(np.nan, 0, inplace=True)
The below code worked for me.
import pandas
df = pandas.read_csv('somefile.txt')
df = df.fillna(0)
I just wanted to provide a bit of an update/special case since it looks like people still come here. If you're using a multi-index or otherwise using an index-slicer the inplace=True option may not be enough to update the slice you've chosen. For example in a 2x2 level multi-index this will not change any values (as of pandas 0.15):
idx = pd.IndexSlice
df.loc[idx[:,mask_1],idx[mask_2,:]].fillna(value=0,inplace=True)
The "problem" is that the chaining breaks the fillna ability to update the original dataframe. I put "problem" in quotes because there are good reasons for the design decisions that led to not interpreting through these chains in certain situations. Also, this is a complex example (though I really ran into it), but the same may apply to fewer levels of indexes depending on how you slice.
The solution is DataFrame.update:
df.update(df.loc[idx[:,mask_1],idx[[mask_2],:]].fillna(value=0))
It's one line, reads reasonably well (sort of) and eliminates any unnecessary messing with intermediate variables or loops while allowing you to apply fillna to any multi-level slice you like!
If anybody can find places this doesn't work please post in the comments, I've been messing with it and looking at the source and it seems to solve at least my multi-index slice problems.
You can also use dictionaries to fill NaN values of the specific columns in the DataFrame rather to fill all the DF with some oneValue.
import pandas as pd
df = pd.read_excel('example.xlsx')
df.fillna( {
'column1': 'Write your values here',
'column2': 'Write your values here',
'column3': 'Write your values here',
'column4': 'Write your values here',
.
.
.
'column-n': 'Write your values here'} , inplace=True)
Easy way to fill the missing values:-
filling string columns: when string columns have missing values and NaN values.
df['string column name'].fillna(df['string column name'].mode().values[0], inplace = True)
filling numeric columns: when the numeric columns have missing values and NaN values.
df['numeric column name'].fillna(df['numeric column name'].mean(), inplace = True)
filling NaN with zero:
df['column name'].fillna(0, inplace = True)
To replace na values in pandas
df['column_name'].fillna(value_to_be_replaced,inplace=True)
if inplace = False, instead of updating the df (dataframe) it will return the modified values.
Replace all nan with 0
df = df.fillna(0)
Considering the particular column Amount in the above table is of integer type. The following would be a solution :
df['Amount'] = df.Amount.fillna(0).astype(int)
Similarly, you can fill it with various data types like float, str and so on.
In particular, I would consider datatype to compare various values of the same column.
There have been many contributions already, but since I'm new here, I will still give input.
There are two approaches to replace NaN values with zeros in Pandas DataFrame:
fillna(): function fills NA/NaN values using the specified method.
replace(): df.replace()a simple method used to replace a string, regex, list, dictionary
Example:
#NaN with zero on all columns
df2 = df.fillna(0)
#Using the inplace=True keyword in a pandas method changes the default behaviour.
df.fillna(0, inplace = True)
# multiple columns appraoch
df[["Student", "ID"]] = df[["Student", "ID"]].fillna(0)
finally the replace() method :
df["Student"] = df["Student"].replace(np.nan, 0)
To replace nan in different columns with different ways:
replacement= {'column_A': 0, 'column_B': -999, 'column_C': -99999}
df.fillna(value=replacement)
This works for me, but no one's mentioned it. could there be something wrong with it?
df.loc[df['column_name'].isnull(), 'column_name'] = 0
There are two options available primarily; in case of imputation or filling of missing values NaN / np.nan with only numerical replacements (across column(s):
df['Amount'].fillna(value=None, method= ,axis=1,) is sufficient:
From the Documentation:
value : scalar, dict, Series, or DataFrame
Value to use to fill holes (e.g. 0), alternately a
dict/Series/DataFrame of values specifying which value to use for
each index (for a Series) or column (for a DataFrame). (values not
in the dict/Series/DataFrame will not be filled). This value cannot
be a list.
Which means 'strings' or 'constants' are no longer permissable to be imputed.
For more specialized imputations use SimpleImputer():
from sklearn.impute import SimpleImputer
si = SimpleImputer(strategy='constant', missing_values=np.nan, fill_value='Replacement_Value')
df[['Col-1', 'Col-2']] = si.fit_transform(X=df[['C-1', 'C-2']])
If you were to convert it to a pandas dataframe, you can also accomplish this by using fillna.
import numpy as np
df=np.array([[1,2,3, np.nan]])
import pandas as pd
df=pd.DataFrame(df)
df.fillna(0)
This will return the following:
0 1 2 3
0 1.0 2.0 3.0 NaN
>>> df.fillna(0)
0 1 2 3
0 1.0 2.0 3.0 0.0
If you want to fill NaN for a specific column you can use loc:
d1 = {"Col1" : ['A', 'B', 'C'],
"fruits": ['Avocado', 'Banana', 'NaN']}
d1= pd.DataFrame(d1)
output:
Col1 fruits
0 A Avocado
1 B Banana
2 C NaN
d1.loc[ d1.Col1=='C', 'fruits' ] = 'Carrot'
output:
Col1 fruits
0 A Avocado
1 B Banana
2 C Carrot
I think it's also worth mention and explain
the parameters configuration of fillna()
like Method, Axis, Limit, etc.
From the documentation we have:
Series.fillna(value=None, method=None, axis=None,
inplace=False, limit=None, downcast=None)
Fill NA/NaN values using the specified method.
Parameters
value [scalar, dict, Series, or DataFrame] Value to use to
fill holes (e.g. 0), alternately a dict/Series/DataFrame
of values specifying which value to use for each index
(for a Series) or column (for a DataFrame). Values not in
the dict/Series/DataFrame will not be filled. This
value cannot be a list.
method [{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None},
default None] Method to use for filling holes in
reindexed Series pad / ffill: propagate last valid
observation forward to next valid backfill / bfill:
use next valid observation to fill gap axis
[{0 or ‘index’}] Axis along which to fill missing values.
inplace [bool, default False] If True, fill
in-place. Note: this will modify any other views
on this object (e.g., a no-copy slice for a
column in a DataFrame).
limit [int,defaultNone] If method is specified,
this is the maximum number of consecutive NaN
values to forward/backward fill. In other words,
if there is a gap with more than this number of
consecutive NaNs, it will only be partially filled.
If method is not specified, this is the maximum
number of entries along the entire axis where NaNs
will be filled. Must be greater than 0 if not None.
downcast [dict, default is None] A dict of item->dtype
of what to downcast if possible, or the string ‘infer’
which will try to downcast to an appropriate equal
type (e.g. float64 to int64 if possible).
Ok. Let's start with the method= Parameter this
have forward fill (ffill) and backward fill(bfill)
ffill is doing copying forward the previous
non missing value.
e.g. :
import pandas as pd
import numpy as np
inp = [{'c1':10, 'c2':np.nan, 'c3':200}, {'c1':np.nan,'c2':110, 'c3':210}, {'c1':12,'c2':np.nan, 'c3':220},{'c1':12,'c2':130, 'c3':np.nan},{'c1':12,'c2':np.nan, 'c3':240}]
df = pd.DataFrame(inp)
c1 c2 c3
0 10.0 NaN 200.0
1 NaN 110.0 210.0
2 12.0 NaN 220.0
3 12.0 130.0 NaN
4 12.0 NaN 240.0
Forward fill:
df.fillna(method="ffill")
c1 c2 c3
0 10.0 NaN 200.0
1 10.0 110.0 210.0
2 12.0 110.0 220.0
3 12.0 130.0 220.0
4 12.0 130.0 240.0
Backward fill:
df.fillna(method="bfill")
c1 c2 c3
0 10.0 110.0 200.0
1 12.0 110.0 210.0
2 12.0 130.0 220.0
3 12.0 130.0 240.0
4 12.0 NaN 240.0
The Axis Parameter help us to choose the direction of the fill:
Fill directions:
ffill:
Axis = 1
Method = 'ffill'
----------->
direction
df.fillna(method="ffill", axis=1)
c1 c2 c3
0 10.0 10.0 200.0
1 NaN 110.0 210.0
2 12.0 12.0 220.0
3 12.0 130.0 130.0
4 12.0 12.0 240.0
Axis = 0 # by default
Method = 'ffill'
|
| # direction
|
V
e.g: # This is the ffill default
df.fillna(method="ffill", axis=0)
c1 c2 c3
0 10.0 NaN 200.0
1 10.0 110.0 210.0
2 12.0 110.0 220.0
3 12.0 130.0 220.0
4 12.0 130.0 240.0
bfill:
axis= 0
method = 'bfill'
^
|
|
|
df.fillna(method="bfill", axis=0)
c1 c2 c3
0 10.0 110.0 200.0
1 12.0 110.0 210.0
2 12.0 130.0 220.0
3 12.0 130.0 240.0
4 12.0 NaN 240.0
axis = 1
method = 'bfill'
<-----------
df.fillna(method="bfill", axis=1)
c1 c2 c3
0 10.0 200.0 200.0
1 110.0 110.0 210.0
2 12.0 220.0 220.0
3 12.0 130.0 NaN
4 12.0 240.0 240.0
# alias:
# 'fill' == 'pad'
# bfill == backfill
limit parameter:
df
c1 c2 c3
0 10.0 NaN 200.0
1 NaN 110.0 210.0
2 12.0 NaN 220.0
3 12.0 130.0 NaN
4 12.0 NaN 240.0
Only replace the first NaN element across columns:
df.fillna(value = 'Unavailable', limit=1)
c1 c2 c3
0 10.0 Unavailable 200.0
1 Unavailable 110.0 210.0
2 12.0 NaN 220.0
3 12.0 130.0 Unavailable
4 12.0 NaN 240.0
df.fillna(value = 'Unavailable', limit=2)
c1 c2 c3
0 10.0 Unavailable 200.0
1 Unavailable 110.0 210.0
2 12.0 Unavailable 220.0
3 12.0 130.0 Unavailable
4 12.0 NaN 240.0
downcast parameter:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 c1 4 non-null float64
1 c2 2 non-null float64
2 c3 4 non-null float64
dtypes: float64(3)
memory usage: 248.0 bytes
df.fillna(method="ffill",downcast='infer').info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 c1 5 non-null int64
1 c2 4 non-null float64
2 c3 5 non-null int64
dtypes: float64(1), int64(2)
memory usage: 248.0 bytes
Related
How to add element to row in pandas dataframe [duplicate]
I have a Pandas Dataframe as below: itm Date Amount 67 420 2012-09-30 00:00:00 65211 68 421 2012-09-09 00:00:00 29424 69 421 2012-09-16 00:00:00 29877 70 421 2012-09-23 00:00:00 30990 71 421 2012-09-30 00:00:00 61303 72 485 2012-09-09 00:00:00 71781 73 485 2012-09-16 00:00:00 NaN 74 485 2012-09-23 00:00:00 11072 75 485 2012-09-30 00:00:00 113702 76 489 2012-09-09 00:00:00 64731 77 489 2012-09-16 00:00:00 NaN When I try to apply a function to the Amount column, I get the following error: ValueError: cannot convert float NaN to integer I have tried applying a function using .isnan from the Math Module I have tried the pandas .replace attribute I tried the .sparse data attribute from pandas 0.9 I have also tried if NaN == NaN statement in a function. I have also looked at this article How do I replace NA values with zeros in an R dataframe? whilst looking at some other articles. All the methods I have tried have not worked or do not recognise NaN. Any Hints or solutions would be appreciated.
I believe DataFrame.fillna() will do this for you. Link to Docs for a dataframe and for a Series. Example: In [7]: df Out[7]: 0 1 0 NaN NaN 1 -0.494375 0.570994 2 NaN NaN 3 1.876360 -0.229738 4 NaN NaN In [8]: df.fillna(0) Out[8]: 0 1 0 0.000000 0.000000 1 -0.494375 0.570994 2 0.000000 0.000000 3 1.876360 -0.229738 4 0.000000 0.000000 To fill the NaNs in only one column, select just that column. in this case I'm using inplace=True to actually change the contents of df. In [12]: df[1].fillna(0, inplace=True) Out[12]: 0 0.000000 1 0.570994 2 0.000000 3 -0.229738 4 0.000000 Name: 1 In [13]: df Out[13]: 0 1 0 NaN 0.000000 1 -0.494375 0.570994 2 NaN 0.000000 3 1.876360 -0.229738 4 NaN 0.000000 EDIT: To avoid a SettingWithCopyWarning, use the built in column-specific functionality: df.fillna({1:0}, inplace=True)
It is not guaranteed that the slicing returns a view or a copy. You can do df['column'] = df['column'].fillna(value)
You could use replace to change NaN to 0: import pandas as pd import numpy as np # for column df['column'] = df['column'].replace(np.nan, 0) # for whole dataframe df = df.replace(np.nan, 0) # inplace df.replace(np.nan, 0, inplace=True)
The below code worked for me. import pandas df = pandas.read_csv('somefile.txt') df = df.fillna(0)
I just wanted to provide a bit of an update/special case since it looks like people still come here. If you're using a multi-index or otherwise using an index-slicer the inplace=True option may not be enough to update the slice you've chosen. For example in a 2x2 level multi-index this will not change any values (as of pandas 0.15): idx = pd.IndexSlice df.loc[idx[:,mask_1],idx[mask_2,:]].fillna(value=0,inplace=True) The "problem" is that the chaining breaks the fillna ability to update the original dataframe. I put "problem" in quotes because there are good reasons for the design decisions that led to not interpreting through these chains in certain situations. Also, this is a complex example (though I really ran into it), but the same may apply to fewer levels of indexes depending on how you slice. The solution is DataFrame.update: df.update(df.loc[idx[:,mask_1],idx[[mask_2],:]].fillna(value=0)) It's one line, reads reasonably well (sort of) and eliminates any unnecessary messing with intermediate variables or loops while allowing you to apply fillna to any multi-level slice you like! If anybody can find places this doesn't work please post in the comments, I've been messing with it and looking at the source and it seems to solve at least my multi-index slice problems.
You can also use dictionaries to fill NaN values of the specific columns in the DataFrame rather to fill all the DF with some oneValue. import pandas as pd df = pd.read_excel('example.xlsx') df.fillna( { 'column1': 'Write your values here', 'column2': 'Write your values here', 'column3': 'Write your values here', 'column4': 'Write your values here', . . . 'column-n': 'Write your values here'} , inplace=True)
Easy way to fill the missing values:- filling string columns: when string columns have missing values and NaN values. df['string column name'].fillna(df['string column name'].mode().values[0], inplace = True) filling numeric columns: when the numeric columns have missing values and NaN values. df['numeric column name'].fillna(df['numeric column name'].mean(), inplace = True) filling NaN with zero: df['column name'].fillna(0, inplace = True)
To replace na values in pandas df['column_name'].fillna(value_to_be_replaced,inplace=True) if inplace = False, instead of updating the df (dataframe) it will return the modified values.
Considering the particular column Amount in the above table is of integer type. The following would be a solution : df['Amount'] = df.Amount.fillna(0).astype(int) Similarly, you can fill it with various data types like float, str and so on. In particular, I would consider datatype to compare various values of the same column.
There have been many contributions already, but since I'm new here, I will still give input. There are two approaches to replace NaN values with zeros in Pandas DataFrame: fillna(): function fills NA/NaN values using the specified method. replace(): df.replace()a simple method used to replace a string, regex, list, dictionary Example: #NaN with zero on all columns df2 = df.fillna(0) #Using the inplace=True keyword in a pandas method changes the default behaviour. df.fillna(0, inplace = True) # multiple columns appraoch df[["Student", "ID"]] = df[["Student", "ID"]].fillna(0) finally the replace() method : df["Student"] = df["Student"].replace(np.nan, 0)
Replace all nan with 0 df = df.fillna(0)
To replace nan in different columns with different ways: replacement= {'column_A': 0, 'column_B': -999, 'column_C': -99999} df.fillna(value=replacement)
This works for me, but no one's mentioned it. could there be something wrong with it? df.loc[df['column_name'].isnull(), 'column_name'] = 0
There are two options available primarily; in case of imputation or filling of missing values NaN / np.nan with only numerical replacements (across column(s): df['Amount'].fillna(value=None, method= ,axis=1,) is sufficient: From the Documentation: value : scalar, dict, Series, or DataFrame Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list. Which means 'strings' or 'constants' are no longer permissable to be imputed. For more specialized imputations use SimpleImputer(): from sklearn.impute import SimpleImputer si = SimpleImputer(strategy='constant', missing_values=np.nan, fill_value='Replacement_Value') df[['Col-1', 'Col-2']] = si.fit_transform(X=df[['C-1', 'C-2']])
If you were to convert it to a pandas dataframe, you can also accomplish this by using fillna. import numpy as np df=np.array([[1,2,3, np.nan]]) import pandas as pd df=pd.DataFrame(df) df.fillna(0) This will return the following: 0 1 2 3 0 1.0 2.0 3.0 NaN >>> df.fillna(0) 0 1 2 3 0 1.0 2.0 3.0 0.0
If you want to fill NaN for a specific column you can use loc: d1 = {"Col1" : ['A', 'B', 'C'], "fruits": ['Avocado', 'Banana', 'NaN']} d1= pd.DataFrame(d1) output: Col1 fruits 0 A Avocado 1 B Banana 2 C NaN d1.loc[ d1.Col1=='C', 'fruits' ] = 'Carrot' output: Col1 fruits 0 A Avocado 1 B Banana 2 C Carrot
I think it's also worth mention and explain the parameters configuration of fillna() like Method, Axis, Limit, etc. From the documentation we have: Series.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None) Fill NA/NaN values using the specified method. Parameters value [scalar, dict, Series, or DataFrame] Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list. method [{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None] Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use next valid observation to fill gap axis [{0 or ‘index’}] Axis along which to fill missing values. inplace [bool, default False] If True, fill in-place. Note: this will modify any other views on this object (e.g., a no-copy slice for a column in a DataFrame). limit [int,defaultNone] If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None. downcast [dict, default is None] A dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible). Ok. Let's start with the method= Parameter this have forward fill (ffill) and backward fill(bfill) ffill is doing copying forward the previous non missing value. e.g. : import pandas as pd import numpy as np inp = [{'c1':10, 'c2':np.nan, 'c3':200}, {'c1':np.nan,'c2':110, 'c3':210}, {'c1':12,'c2':np.nan, 'c3':220},{'c1':12,'c2':130, 'c3':np.nan},{'c1':12,'c2':np.nan, 'c3':240}] df = pd.DataFrame(inp) c1 c2 c3 0 10.0 NaN 200.0 1 NaN 110.0 210.0 2 12.0 NaN 220.0 3 12.0 130.0 NaN 4 12.0 NaN 240.0 Forward fill: df.fillna(method="ffill") c1 c2 c3 0 10.0 NaN 200.0 1 10.0 110.0 210.0 2 12.0 110.0 220.0 3 12.0 130.0 220.0 4 12.0 130.0 240.0 Backward fill: df.fillna(method="bfill") c1 c2 c3 0 10.0 110.0 200.0 1 12.0 110.0 210.0 2 12.0 130.0 220.0 3 12.0 130.0 240.0 4 12.0 NaN 240.0 The Axis Parameter help us to choose the direction of the fill: Fill directions: ffill: Axis = 1 Method = 'ffill' -----------> direction df.fillna(method="ffill", axis=1) c1 c2 c3 0 10.0 10.0 200.0 1 NaN 110.0 210.0 2 12.0 12.0 220.0 3 12.0 130.0 130.0 4 12.0 12.0 240.0 Axis = 0 # by default Method = 'ffill' | | # direction | V e.g: # This is the ffill default df.fillna(method="ffill", axis=0) c1 c2 c3 0 10.0 NaN 200.0 1 10.0 110.0 210.0 2 12.0 110.0 220.0 3 12.0 130.0 220.0 4 12.0 130.0 240.0 bfill: axis= 0 method = 'bfill' ^ | | | df.fillna(method="bfill", axis=0) c1 c2 c3 0 10.0 110.0 200.0 1 12.0 110.0 210.0 2 12.0 130.0 220.0 3 12.0 130.0 240.0 4 12.0 NaN 240.0 axis = 1 method = 'bfill' <----------- df.fillna(method="bfill", axis=1) c1 c2 c3 0 10.0 200.0 200.0 1 110.0 110.0 210.0 2 12.0 220.0 220.0 3 12.0 130.0 NaN 4 12.0 240.0 240.0 # alias: # 'fill' == 'pad' # bfill == backfill limit parameter: df c1 c2 c3 0 10.0 NaN 200.0 1 NaN 110.0 210.0 2 12.0 NaN 220.0 3 12.0 130.0 NaN 4 12.0 NaN 240.0 Only replace the first NaN element across columns: df.fillna(value = 'Unavailable', limit=1) c1 c2 c3 0 10.0 Unavailable 200.0 1 Unavailable 110.0 210.0 2 12.0 NaN 220.0 3 12.0 130.0 Unavailable 4 12.0 NaN 240.0 df.fillna(value = 'Unavailable', limit=2) c1 c2 c3 0 10.0 Unavailable 200.0 1 Unavailable 110.0 210.0 2 12.0 Unavailable 220.0 3 12.0 130.0 Unavailable 4 12.0 NaN 240.0 downcast parameter: df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 c1 4 non-null float64 1 c2 2 non-null float64 2 c3 4 non-null float64 dtypes: float64(3) memory usage: 248.0 bytes df.fillna(method="ffill",downcast='infer').info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 c1 5 non-null int64 1 c2 4 non-null float64 2 c3 5 non-null int64 dtypes: float64(1), int64(2) memory usage: 248.0 bytes
How do I create a dummy variable by comparing columns in different data frames?
I would like to compare one column of a df with another column in a different df. The columns are timestamp and holiday date. I'd like to create a dummy variable wherein if the timestamp in df1 match the dates in df2 = 1, else 0. For example, df1: timestamp weight(kg) 0 2016-03-04 4.0 1 2015-02-15 5.0 2 2019-05-04 5.0 3 2018-12-25 29.0 4 2020-01-01 58.0 For example, df2: holiday 0 2016-12-25 1 2017-01-01 2 2019-05-01 3 2018-12-26 4 2020-05-26 Ideal output: timestamp weight(kg) holiday 0 2016-03-04 4.0 0 1 2015-02-15 5.0 0 2 2019-05-04 5.0 0 3 2018-12-25 29.0 1 4 2020-01-01 58.0 1 I have tried writing a function but it is taking very long to calculate: def add_holiday(x): hols_df = hols.apply(lambda y: y['holiday_dt'] if x['timestamp'] == y['holiday_dt'] else None, axis=1) hols_df = hols_df.dropna(axis=0, how='all') if hols_df.empty: hols_df= np.nan else: hols_df= hols_df.to_string(index=False) return hols_df #df_hols['holidays'] = df_hols.apply(add_holiday, axis=1) Perhaps, there is a simpler way to do so or the function is not exactly well-written. Any help will be appreciated.
Use Series.isin with convert mask to 1,0 by Series.astype: df1['holiday'] = df1['timestamp'].isin(df2['holiday']).astype(int) Or with numpy.where: df1['holiday'] = np.where(df1['timestamp'].isin(df2['holiday']), 1, 0)
How to drop '_merge' column in Pandas.merge
I am sorting 2 dataframes according to its accuracy. So I merge 2 df with strict conditions with how='outer', indicator=True at first then save it to a df called 'perfect'. Later I extract left_only and right_only from _merge column to two new dfs. Then I merge these two df with simple conditions how='outer', indicator=True and save new df as 'partial match'. But when I do this I get eeror ValueError: Cannot use name of an existing column for indicator column because I used indicator = True again but I need that indicator to apply for unmatched rows (ie, left only and right only) and put them for much simpler conditions. How can I drop that merge column? Or how can I remove this ValueError? _merge It is not appearing in df.columns, so I am unable to drop(['_merge') or del df._merge
Use 'string' for indicator instead of True. See docs indicatorbool or str, default False If True, adds a column to output DataFrame called “_merge” with information on the source of each row. If string, column with information on source of each row will be added to output DataFrame, and column will be named value of string. Information column is Categorical-type and takes on a value of “left_only” for observations whose merge key only appears in ‘left’ DataFrame, “right_only” for observations whose merge key only appears in ‘right’ DataFrame, and “both” if the observation’s merge key is found in both. Then the second time you merge, use a different 'string' for indicator. dfA = pd.DataFrame({'key':np.arange(0,10), 'dataA':np.arange(100,110)}) dfB = pd.DataFrame({'key':np.arange(5,15), 'dataB':np.arange(100,110)}) dfA.merge(dfB, on='key', indicator='Ind', how='outer') Output: key dataA dataB Ind 0 0 100.0 NaN left_only 1 1 101.0 NaN left_only 2 2 102.0 NaN left_only 3 3 103.0 NaN left_only 4 4 104.0 NaN left_only 5 5 105.0 100.0 both 6 6 106.0 101.0 both 7 7 107.0 102.0 both 8 8 108.0 103.0 both 9 9 109.0 104.0 both 10 10 NaN 105.0 right_only 11 11 NaN 106.0 right_only 12 12 NaN 107.0 right_only 13 13 NaN 108.0 right_only 14 14 NaN 109.0 right_only
Strange behavior with Pandas median
Consider the following dataframe: b c d e f g h 0 6.25 2018-04-01 True NaN 7 54.0 64.0 1 32.50 2018-04-01 True NaN 7 54.0 64.0 2 16.75 2018-04-01 True NaN 7 54.0 64.0 3 29.25 2018-04-01 True NaN 7 54.0 64.0 4 21.75 2018-04-01 True NaN 7 54.0 64.0 5 21.75 2018-04-01 True True 7 54.0 64.0 6 7.75 2018-04-01 True True 7 54.0 64.0 7 23.25 2018-04-01 True True 7 54.0 64.0 8 12.25 2018-04-01 True True 7 54.0 64.0 9 30.50 2018-04-01 True NaN 7 54.0 64.0 (copy and paste and use df = pd.read_clipboard() to create the dataframe) Finding the medians initially works with no problem: df.median() b 21.75 d 1.00 e 1.00 f 7.00 g 54.00 h 64.00 dtype: float64 However, if a column is dropped and then the median is found, the median for column e disappears: new_df = df.drop(columns=['b']) new_df.median() d 1.0 f 7.0 g 54.0 h 64.0 dtype: float64 This behavior is a little unexpected and finding the median for column e by itself still works: new_df['e'].median() 1.0 Using skipna=False does not make a difference: new_df.median(skipna=False) d 1.0 f 7.0 g 54.0 h 64.0 dtype: float64 (it does for the original dataframe): df.median(skipna=False) b 21.75 d 1.00 e NaN f 7.00 g 54.00 h 64.00 dtype: float64 The datatype of column e is object in both df and new_df and the only difference between the two dataframes is new_df does not have column b. Adding the column back into new_df does not resolve the issue. This only occurs when the first column b is dropped. It does not occur if column e is a float or integer datatype. This behavior is present in both pandas==0.22.0 and pandas==0.24.1 There is now an open GitHub issue for anyone to try and solve this!
This appears to be a bug. When we dispatch any df to median, this maps to the internal _reduce function. With numeric_only set to None, this computes the median by series, and ignore failures (for the c columns, for e.g. median computation will fail.) and accumulate results (see _reduce in pandas source core/frame.py). So far it is fine. But while stiching the results together through it does a check to infer if the results are scalar or series (for median it will be scalar of course). To do this check, it always use the first column (see wrap_results in pandas source core/apply.py). So if the first column calc failed and it was skipped, this check fails, raising an exception. This triggers the fallback method within _reduce of forcing the dataframe to numeric only (dropping any columns with NaN) and re-compute the medians. So in your case, if the column c (or any other dtype where median computation will fail, like text) is in the first column, then all columns with NaN will also be dropped for the median results. Setting skipna does not change as the bug is with how non-numeric column in first position triggers a forced numeric only computation. I do not see there is any fix possible without fixing it in the pandas codebase. Or ensuring first column will always succeed for median computation.
Insert Value to a dataframe column
I have a pandas dataframe 0 1 2 3 0 173.0 147.0 161 162.0 1 NaN NaN 23 NaN I just want to add value a column such as 3 0 161 1 23 2 181 But can't go with the approch of loc and iloc. Because the file can contain columns of any length and I will not know loc and iloc. Hence Just want to add value to a column. Thanks in advance.
I believe need setting with enlargement: df.loc[len(df.index), 2] = 181 print (df) 0 1 2 3 0 173.0 147.0 161.0 162.0 1 NaN NaN 23.0 NaN 2 NaN NaN 181.0 NaN
If that 2x3 dataframe is your original dataframe, you can add an extra row to dataframe by pandas.concat(). For example: pandas.concat([your_original_dataframe, pandas.DataFrame([[181]] , columns=[2] )], axis=0) This will add 181 at the bottom of column 2