This question already has answers here:
How to replace NaNs by preceding or next values in pandas DataFrame?
(10 answers)
Closed 5 years ago.
I have a pd.Series that looks like this:
>>> series
0 This is a foo bar something...
1 NaN
2 NaN
3 foo bar indeed something...
4 NaN
5 NaN
6 foo your bar self...
7 NaN
8 NaN
How do I populate the NaN column values with the previous non NaN value in the series?
I have tried this:
new_column = []
for row in list(series):
if type(row) == str:
new_column.append(row)
else:
new_column.append(new_column[-1])
series = pd.Series(new_column)
But is there another way to do the same in pandas?
From the docs:
DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)
...
method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap
So:
series.fillna(method='ffill')
Some explanation:
ffill / pad: Forward fill is to use the value from previous row that isn't NA and populate the NA value. pad is just a verbose alias to ffill.
bfill / backfill: Back fill is to use the value from the next row that isn't NA to populate the NA value. backfill is just verbose alias to bfill.
In code:
>>> import pandas as pd
>>> import numpy as np
>>> np.NaN
nan
>>> series = pd.Series([np.NaN, 'abc', np.NaN, np.NaN, 'def', np.NaN, np.NaN])
>>> series
0 NaN
1 abc
2 NaN
3 NaN
4 def
5 NaN
6 NaN
dtype: object
>>> series.fillna(method='ffill')
0 NaN
1 abc
2 abc
3 abc
4 def
5 def
6 def
dtype: object
>>> series.fillna(method='bfill')
0 abc
1 abc
2 def
3 def
4 def
5 NaN
6 NaN
dtype: object
Related
This question already has answers here:
Pandas: using groupby to get mean for each data category
(2 answers)
Pandas groupby mean - into a dataframe?
(4 answers)
Closed 5 years ago.
I have a df that looks like this:
headings = ['foo','bar','qui','gon','jin']
table = [[1,1,3,4,5],
[1,1,4,5,6],
[2,2,3,4,5],
[2,2,4,5,6],
]
df = DataFrame(columns=headings,data=table)
foo bar qui gon jin
0 1 1 3 4 5
1 1 1 4 5 6
2 2 2 3 4 5
3 2 2 4 5 6
What I want to do is average the values of all columns whenever a certain column has a similar value e.g. I want to average all the columns with similar 'bar' values and then create a dataframe with the answer. I tried the following:
newDf = DataFrame([])
for i in df['bar'].loc[1:2]:
newDf = newDf.append(df[df['foo'] == i].mean(axis=0),ignore_index=True)
And it outputs what I want:
bar foo gon jin qui
0 1.00E+00 1.00E+00 4.50E+00 5.50E+00 3.50E+00
1 2.00E+00 2.00E+00 4.50E+00 5.50E+00 3.50E+00
But when I try that with another column with value, it does not output what I want:
for i in df['qui'].loc[1:2]:
newDf = newDf.append(df[df['foo'] == i].mean(axis=0),ignore_index=True)
Produces
bar foo gon jin qui
0 NAN NAN NAN NAN NAN
1 NAN NAN NAN NAN NAN
Can you give me a hand?
Side question: how do I prevent the columns of the new dataframe to be ordered alphabetically? Is it possible to maintain the order of the original dataframe?
Hi I have a pandas dataframe, I have a column A.
data = pandas.DataFrame()
data['A']= [numpy.random.choice([1,2,3,4,5,6]) for i in range(10)]
I want to add a column B condition on A (when A =1 then B=0, when A>5 then B=1) instead of using:
data.loc[data['A']==1,'B']=0
data.loc[data['A']>5, 'B']=1
Here I want to create a function to do this given the condition as a dict: {'A=1':0,'A>5':1} so I could use add_column({'A=1':0,'A>5':1}, 'B') to do the code above. I am thinking it is tricky to deal with the operators, any good idea?
def add_column(condition_dict, NewColumnName):
pass
While there may be efficient ways to do it, one possible way might be to use eval function.
Creating input df:
import pandas as pd
import numpy as np
data = pd.DataFrame()
data['A']= [np.random.choice([1,2,3,4,5,6]) for i in range(10)]
print(data)
Input df:
A
0 4
1 3
2 3
3 1
4 1
5 2
6 3
7 6
8 2
9 1
Now, a function is created such that it iterates through each row of dataframe and condition_dict and when row evaluation matches value is stored in list for corresponding row which is updated for new column. If none of the condition matches then with default it will None:
def add_column(df, condition_dict, NewColumnName):
new_values = []
for index, row in df.iterrows():
# if none of the condition matches then put default value
default_value = None
# iterate through each condition to check if any matches
for key, value in condition_dict.items():
expression = 'row.' + key
if(eval(expression)):
default_value = value
# add corresponding rows new value for new column
new_values.append(default_value)
df[NewColumnName] = new_values
Now, to call the function:
add_column(data, {'A==1':0, 'A>5':1}, 'B')
print(data)
Output:
A B
0 4 NaN
1 3 NaN
2 3 NaN
3 1 0.0
4 1 0.0
5 2 NaN
6 3 NaN
7 6 1.0
8 2 NaN
9 1 0.0
I have an issue with pandas pivot_table.
Sometimes, the order of the columns specified on "values" list does not match
In [11]: p = pivot_table(df, values=["x","y"], cols=["month"],
rows="name", aggfunc=np.sum)
i get the wrong order (y,x) instead of (x,y)
Out[12]:
y x
month 1 2 3 1 2 3
name
a 1 NaN 7 2 NaN 8
b 3 NaN 9 4 NaN 10
c NaN 5 NaN NaN 6 NaN
Is there something i don't do well ?
According to the pandas documentation, values should take the name of a single column, not an iterable.
values : column to aggregate, optional
Suppose I have a dataframe as follows,
import pandas as pd
columns=['A','B','C','D', 'E', 'F']
index=['1','2','3','4','5','6']
df = pd.DataFrame(columns=columns,index=index)
df['D']['1'] = 1
df['E'] = 1
df['F']['1'] = 1
df['A']['2'] = 1
df['B']['3'] = 1
df['C']['4'] = 1
df['A']['5'] = 1
df['B']['5'] = 1
df['C']['5'] = 1
df['D']['6'] = 1
df['F']['6'] = 1
df
A B C D E F
1 NaN NaN NaN 1 1 1
2 1 NaN NaN NaN 1 NaN
3 NaN 1 NaN NaN 1 NaN
4 NaN NaN 1 NaN 1 NaN
5 1 1 1 NaN 1 NaN
6 NaN NaN NaN 1 1 1
My condition is, I want to remove the columns which have value only when A,B,C(together) don't have a value. I want to find which column is mutually exclusive to A,B,C columns together. I am interested in finding the columns that have values only when A or B or C has values. The output here would be to remove D,F columns. But my dataframe has 400 columns and I want a way to check this for A,B,C vs rest of the columns.
One way I can think is,
Remove NA rows from A,B,C
df = df[np.isfinite(df['A'])]
df = df[np.isfinite(df['B'])]
df = df[np.isfinite(df['C'])]
and get NA count of all columns and check with the total number of rows,
df.isnull().sum()
and remove the counts that match.
Is there a better and efficient way to do this?
Thanks
Rather than delete rows, just select the others that don't have A, B, C equal to NaN at the same time.
mask = df[["A", "B", "C"]].isnull().all(axis=1)
df = df[~mask]
I am trying to do the following: on a dataframe X, I want to select all rows where X['a']>0 but I want to preserve the dimension of X, so that any other row will appear as containing NaN. Is there a fast way to do it? If one does X[X['a']>0] the dimensions of X are not preserved.
Use double subscript [[]]:
In [42]:
df = pd.DataFrame({'a':np.random.randn(10)})
df
Out[42]:
a
0 1.042971
1 0.978914
2 0.764374
3 -0.338405
4 0.974011
5 -0.995945
6 -1.649612
7 0.965838
8 -0.142608
9 -0.804508
In [48]:
df[df[['a']] > 1]
Out[48]:
a
0 1.042971
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
The key semantic difference here is what is returned is a df when you double subscript so this masks the df itself rather than the index
Note though that if you have multiple columns then it will mask all those as NaN