My data looks like this. I would like to replace marital_status 'Missing' with 'Married' if 'no_of_children' is not nan.
>cust_data_df[['marital_status','no_of_children']]
>
marital_status no_of_children
0 Married NaN
1 Married NaN
2 Missing 1
3 Missing 2
4 Single NaN
5 Single NaN
6 Married NaN
7 Single NaN
8 Married NaN
9 Married NaN
10 Single NaN
This is what I tried:
cust_data_df.loc[cust_data_df['no_of_children'].notna()==True, 'marital_status'].replace({'Missing':'Married'},inplace=True)
But this is not doing anything.
Assign back replaced values for avoid chained assignments:
m = cust_data_df['no_of_children'].notna()
d = {'Missing':'Married'}
cust_data_df.loc[m, 'marital_status'] = cust_data_df.loc[m, 'marital_status'].replace(d)
If need set all values:
cust_data_df.loc[m, 'marital_status'] = 'Married'
EDIT:
Thanks #Quickbeam2k1 for explanation:
cust_data_df.loc[cust_data_df['no_of_children'].notna()==True, 'marital_status'] is just a new object which has no reference. Replacing there, will leave the original object unchanged
Related
I want to replace all numerical values in a DataFrame column with NaN
Input
A B C
test foo xyz
hit bar 10
hit fish 90
hit NaN abc
test val 20
test val 90
Desired Output:
A B C
test foo xyz
hit bar NaN
hit fish NaN
hit NaN abc
test val NaN
test val NaN
I tried the following:
db_old.loc[db_old['Current Value'].istype(float), db_old['Current Value']] = np.nan
but returns:
AttributeError: 'Series' object has no attribute 'istype'
Any suggestions?
Thanks
You can mask numeric values using to_numeric:
df['C'] = df['C'].mask(pd.to_numeric(df['C'], errors='coerce').notna())
df
A B C
0 test foo xyz
1 hit bar NaN
2 hit fish NaN
3 hit NaN abc
4 test val NaN
5 test val NaN
to_numeric is the most general solution and should work regardless of whether you have a column of strings or mixed objects.
If it is a column of strings and you're only trying to retain strings of letters, str.isalpha may suffice:
df['C'] = df['C'].where(df['C'].str.isalpha())
df
A B C
0 test foo xyz
1 hit bar NaN
2 hit fish NaN
3 hit NaN abc
4 test val NaN
5 test val NaN
Although this specifically keeps strings that do not have digits.
If you have a column of mixed objects, here is yet another solution using str.match (any str method with a na flag, really) with na=False:
df['C'] = ['xyz', 10, 90, 'abc', 20, 90]
df['C'] = df['C'].where(df['C'].str.match(r'\D+$', na=False))
df
A B C
0 test foo xyz
1 hit bar NaN
2 hit fish NaN
3 hit NaN abc
4 test val NaN
5 test val NaN
I have a dataframe which contains NaNs and 0s in some rows for all columns. I am trying to extract such rows, so that I can process them further. Also, some of these columns are object and some float. I am trying the below code to extract such rows, but because of the columns being object, its not giving me the desired result.
Now, I can solve this problem by substituting some arbitrary values to NaN and use it in .isin statement, but then it also changes the datatype of my columns, and I would have to convert them back.
Can somebody please help me with a workaround/solution to this.
Thanks.
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[np.nan,0,np.nan,1,'abc'], 'b':[0,np.nan,np.nan,1,np.nan]})
df
a b
0 NaN 0.0
1 0 NaN
2 NaN NaN
3 1 1.0
4 abc NaN
5 NaN 1.0
values = [np.nan,0]
df_all_empty = df[df.isin(values).all(1)]
df_all_empty
Expected Output:
a b
0 NaN 0.0
1 0 NaN
2 NaN NaN
Actual Output:
a b
0 NaN 0.0
Change
df_all_empty = df[(df.isnull()|df.isin([0])).all(1)]
The code below will let you select those rows.
df_sel = df.loc[(df.a.isnull()) | \
(df.b.isnull()) | \
(df.a==0) | \
(df.b==0) ]
If you want to make column 'a' in those rows, say for example -9999, you can use:
df.loc[(df.a.isnull()) | \
(df.b.isnull()) | \
(df.a==0) | \
(df.b==0) , 'a'] = -9999
For reference, refer to the official documentation, in
https://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
You can use df.query, and the trick described here (compare to NaN by checking if a value equals to itself)
Write something like this:
df.query("(a!=a or a==0) and (b!=b or b==0)")
And the output is:
a b
0 NaN 0.0
1 0 NaN
2 NaN NaN
I would like to fill N/A values in a DataFrame in a selective manner. In particular, if there is a sequence of consequetive nans within a column, I want them to be filled by the preceeding non-nan value, but only if the length of the nan sequence is below a specified threshold. For example, if the threshold is 3 then a within-column sequence of 3 or less will be filled with the preceeding non-nan value, whereas a sequence of 4 or more nans will be left as is.
That is, if the input DataFrame is
2 5 4
nan nan nan
nan nan nan
5 nan nan
9 3 nan
7 9 1
I want the output to be:
2 5 4
2 5 nan
2 5 nan
5 5 nan
9 3 nan
7 9 1
The fillna function, when applied to a DataFrame, has the method and limit options. But these are unfortunately not sufficient to acheive the task. I tried to specify method='ffill' and limit=3, but that fills in the first 3 nans of any sequence, not selectively as described above.
I suppose this can be coded by going column by column with some conditional statements, but I suspect there must be something more Pythonic. Any suggestinos on an efficient way to acheive this?
Working with contiguous groups is still a little awkward in pandas.. or at least I don't know of a slick way to do this, which isn't at all the same thing. :-)
One way to get what you want would be to use the compare-cumsum-groupby pattern:
In [68]: nulls = df.isnull()
...: groups = (nulls != nulls.shift()).cumsum()
...: to_fill = groups.apply(lambda x: x.groupby(x).transform(len) <= 3)
...: df.where(~to_fill, df.ffill())
...:
Out[68]:
0 1 2
0 2.0 5.0 4.0
1 2.0 5.0 NaN
2 2.0 5.0 NaN
3 5.0 5.0 NaN
4 9.0 3.0 NaN
5 7.0 9.0 1.0
Okay, another alternative which I don't like because it's too tricky:
def method_2(df):
nulls = df.isnull()
filled = df.ffill(limit=3)
unfilled = nulls & (~filled.notnull())
nf = nulls.replace({False: 2.0, True: np.nan})
do_not_fill = nf.combine_first(unfilled.replace(False, np.nan)).bfill() == 1
return df.where(do_not_fill, df.ffill())
This doesn't use any groupby tools and so should be faster. Note that a different approach would be to manually (using shifts) determine which elements are to be filled because they're a group of length 1, 2, or 3.
I have an issue with pandas pivot_table.
Sometimes, the order of the columns specified on "values" list does not match
In [11]: p = pivot_table(df, values=["x","y"], cols=["month"],
rows="name", aggfunc=np.sum)
i get the wrong order (y,x) instead of (x,y)
Out[12]:
y x
month 1 2 3 1 2 3
name
a 1 NaN 7 2 NaN 8
b 3 NaN 9 4 NaN 10
c NaN 5 NaN NaN 6 NaN
Is there something i don't do well ?
According to the pandas documentation, values should take the name of a single column, not an iterable.
values : column to aggregate, optional
I have a pandas dataframe tdf
I am extracting a slice based on boolean labels
idx = tdf['MYcol1'] == 1
myslice = tdf.loc[idx] //I want myslice to be a view not a copy
Now i want to fill the missing values in a column of myslice and i want this to be reflected in tdf my original dataframe
myslice.loc[:,'MYcol2'].fillna(myslice['MYcol2'].mean(), inplace = True) // 1
myslice.ix[:,'MYcol2'].fillna(myslice['MYcol2'].mean(), inplace = True) // 2
Both 1 and 2 above throw the warning that: A value is trying to be set on a copy of a slice from a DataFrame
What am i doing wrong?
When you assign it to a new variable, it creates a copy. The things you do after that are irrelevant. Consider this:
tdf
Out:
A B C
0 NaN 0.195070 -1.781563
1 -0.729045 0.196557 0.354758
2 0.616887 0.008628 NaN
3 NaN NaN 0.037006
4 0.767902 NaN NaN
5 -0.805627 NaN NaN
6 1.133080 NaN -0.659892
7 -1.139802 0.784958 -0.554310
8 -0.470638 -0.216950 NaN
9 -0.392389 -3.046143 0.543312
idx = tdf['A'] > 0
myslice = tdf.loc[idx]
Fill NaN's in myslice:
myslice.loc[:,'B'].fillna(myslice['B'].mean(), inplace = True)
C:\Anaconda3\envs\p3\lib\site-packages\pandas\core\generic.py:3191: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._update_inplace(new_data)
myslice
Out:
A B C
2 0.616887 0.008628 NaN
4 0.767902 0.008628 NaN
6 1.133080 0.008628 -0.659892
tdf
Out:
A B C
0 NaN 0.195070 -1.781563
1 -0.729045 0.196557 0.354758
2 0.616887 0.008628 NaN
3 NaN NaN 0.037006
4 0.767902 NaN NaN
5 -0.805627 NaN NaN
6 1.133080 NaN -0.659892
7 -1.139802 0.784958 -0.554310
8 -0.470638 -0.216950 NaN
9 -0.392389 -3.046143 0.543312
It is not reflected in tdf, because:
myslice.is_copy
Out: <weakref at 0x000001CC842FD318; to 'DataFrame' at 0x000001CC8422D6A0>
If you change it to:
tdf.loc[:, 'B'].fillna(tdf.loc[idx, 'B'].mean(), inplace=True)
tdf
Out:
A B C
0 NaN 0.195070 -1.781563
1 -0.729045 0.196557 0.354758
2 0.616887 0.008628 NaN
3 NaN 0.008628 0.037006
4 0.767902 0.008628 NaN
5 -0.805627 0.008628 NaN
6 1.133080 0.008628 -0.659892
7 -1.139802 0.784958 -0.554310
8 -0.470638 -0.216950 NaN
9 -0.392389 -3.046143 0.543312
then it works. In the last part you can also use myslice['B'].mean() because you are not updating those values. But the left side should be the original DataFrame.