I have an issue with pandas pivot_table.
Sometimes, the order of the columns specified on "values" list does not match
In [11]: p = pivot_table(df, values=["x","y"], cols=["month"],
rows="name", aggfunc=np.sum)
i get the wrong order (y,x) instead of (x,y)
Out[12]:
y x
month 1 2 3 1 2 3
name
a 1 NaN 7 2 NaN 8
b 3 NaN 9 4 NaN 10
c NaN 5 NaN NaN 6 NaN
Is there something i don't do well ?
According to the pandas documentation, values should take the name of a single column, not an iterable.
values : column to aggregate, optional
Related
My data looks like this. I would like to replace marital_status 'Missing' with 'Married' if 'no_of_children' is not nan.
>cust_data_df[['marital_status','no_of_children']]
>
marital_status no_of_children
0 Married NaN
1 Married NaN
2 Missing 1
3 Missing 2
4 Single NaN
5 Single NaN
6 Married NaN
7 Single NaN
8 Married NaN
9 Married NaN
10 Single NaN
This is what I tried:
cust_data_df.loc[cust_data_df['no_of_children'].notna()==True, 'marital_status'].replace({'Missing':'Married'},inplace=True)
But this is not doing anything.
Assign back replaced values for avoid chained assignments:
m = cust_data_df['no_of_children'].notna()
d = {'Missing':'Married'}
cust_data_df.loc[m, 'marital_status'] = cust_data_df.loc[m, 'marital_status'].replace(d)
If need set all values:
cust_data_df.loc[m, 'marital_status'] = 'Married'
EDIT:
Thanks #Quickbeam2k1 for explanation:
cust_data_df.loc[cust_data_df['no_of_children'].notna()==True, 'marital_status'] is just a new object which has no reference. Replacing there, will leave the original object unchanged
This question already has answers here:
How to replace NaNs by preceding or next values in pandas DataFrame?
(10 answers)
Closed 5 years ago.
I have a pd.Series that looks like this:
>>> series
0 This is a foo bar something...
1 NaN
2 NaN
3 foo bar indeed something...
4 NaN
5 NaN
6 foo your bar self...
7 NaN
8 NaN
How do I populate the NaN column values with the previous non NaN value in the series?
I have tried this:
new_column = []
for row in list(series):
if type(row) == str:
new_column.append(row)
else:
new_column.append(new_column[-1])
series = pd.Series(new_column)
But is there another way to do the same in pandas?
From the docs:
DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)
...
method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap
So:
series.fillna(method='ffill')
Some explanation:
ffill / pad: Forward fill is to use the value from previous row that isn't NA and populate the NA value. pad is just a verbose alias to ffill.
bfill / backfill: Back fill is to use the value from the next row that isn't NA to populate the NA value. backfill is just verbose alias to bfill.
In code:
>>> import pandas as pd
>>> import numpy as np
>>> np.NaN
nan
>>> series = pd.Series([np.NaN, 'abc', np.NaN, np.NaN, 'def', np.NaN, np.NaN])
>>> series
0 NaN
1 abc
2 NaN
3 NaN
4 def
5 NaN
6 NaN
dtype: object
>>> series.fillna(method='ffill')
0 NaN
1 abc
2 abc
3 abc
4 def
5 def
6 def
dtype: object
>>> series.fillna(method='bfill')
0 abc
1 abc
2 def
3 def
4 def
5 NaN
6 NaN
dtype: object
I would like to fill N/A values in a DataFrame in a selective manner. In particular, if there is a sequence of consequetive nans within a column, I want them to be filled by the preceeding non-nan value, but only if the length of the nan sequence is below a specified threshold. For example, if the threshold is 3 then a within-column sequence of 3 or less will be filled with the preceeding non-nan value, whereas a sequence of 4 or more nans will be left as is.
That is, if the input DataFrame is
2 5 4
nan nan nan
nan nan nan
5 nan nan
9 3 nan
7 9 1
I want the output to be:
2 5 4
2 5 nan
2 5 nan
5 5 nan
9 3 nan
7 9 1
The fillna function, when applied to a DataFrame, has the method and limit options. But these are unfortunately not sufficient to acheive the task. I tried to specify method='ffill' and limit=3, but that fills in the first 3 nans of any sequence, not selectively as described above.
I suppose this can be coded by going column by column with some conditional statements, but I suspect there must be something more Pythonic. Any suggestinos on an efficient way to acheive this?
Working with contiguous groups is still a little awkward in pandas.. or at least I don't know of a slick way to do this, which isn't at all the same thing. :-)
One way to get what you want would be to use the compare-cumsum-groupby pattern:
In [68]: nulls = df.isnull()
...: groups = (nulls != nulls.shift()).cumsum()
...: to_fill = groups.apply(lambda x: x.groupby(x).transform(len) <= 3)
...: df.where(~to_fill, df.ffill())
...:
Out[68]:
0 1 2
0 2.0 5.0 4.0
1 2.0 5.0 NaN
2 2.0 5.0 NaN
3 5.0 5.0 NaN
4 9.0 3.0 NaN
5 7.0 9.0 1.0
Okay, another alternative which I don't like because it's too tricky:
def method_2(df):
nulls = df.isnull()
filled = df.ffill(limit=3)
unfilled = nulls & (~filled.notnull())
nf = nulls.replace({False: 2.0, True: np.nan})
do_not_fill = nf.combine_first(unfilled.replace(False, np.nan)).bfill() == 1
return df.where(do_not_fill, df.ffill())
This doesn't use any groupby tools and so should be faster. Note that a different approach would be to manually (using shifts) determine which elements are to be filled because they're a group of length 1, 2, or 3.
I am trying to do the following: on a dataframe X, I want to select all rows where X['a']>0 but I want to preserve the dimension of X, so that any other row will appear as containing NaN. Is there a fast way to do it? If one does X[X['a']>0] the dimensions of X are not preserved.
Use double subscript [[]]:
In [42]:
df = pd.DataFrame({'a':np.random.randn(10)})
df
Out[42]:
a
0 1.042971
1 0.978914
2 0.764374
3 -0.338405
4 0.974011
5 -0.995945
6 -1.649612
7 0.965838
8 -0.142608
9 -0.804508
In [48]:
df[df[['a']] > 1]
Out[48]:
a
0 1.042971
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
The key semantic difference here is what is returned is a df when you double subscript so this masks the df itself rather than the index
Note though that if you have multiple columns then it will mask all those as NaN
This question already has answers here:
How do I count the NaN values in a column in pandas DataFrame?
(32 answers)
Closed 7 years ago.
I have a pandas data frame with 83 columns and 4000 rows. I intend to use the data for a logistic regression and therefore want to narrow down my columns to those that have the least amount of missing data.
To do this I was thinking of ranking them based on the frequency of NaN observations. I tried a few things like
econ_balance["BG.GSR.NFSV.GD.ZS"].describe()
econ_balance["BG.GSR.NFSV.GD.ZS"].value_counts
econ_balance["BG.GSR.NFSV.GD.ZS"]["NaN"]
econ_balance["BG.GSR.NFSV.GD.ZS"][NaN]
None of which seem to work. I always tried googling to see if this question has been answered before but no luck.
Thanks in advance for the help
Josh
If you're looking just to count the NaN values:
In [2]:
df = pd.DataFrame({'a':[0,1,np.NaN,np.NaN,np.NaN],'b':np.NaN, 'c':[np.NaN,1,2,3,np.NaN]})
df
Out[2]:
a b c
0 0 NaN NaN
1 1 NaN 1
2 NaN NaN 2
3 NaN NaN 3
4 NaN NaN NaN
In [6]:
df.isnull().astype(int).sum()
Out[6]:
a 3
b 5
c 2
dtype: int64
EDIT
#CTZhu has pointed out the type casting is unnecessary:
In [7]:
df.isnull().sum()
Out[7]:
a 3
b 5
c 2
dtype: int64