I have a df with an additional column of boolean values based off a conditional statement.
df = pd.DataFrame({'col1': [1,2,3,2.5,5,2]})
df['bool'] = df['col1'] >= 3
The df looks like...
col bool
0 1.0 False
1 2.0 False
2 3.0 True
3 2.5 False
4 5.0 True
5 2.0 False
I would like to get pct_change() of "col1" if "bool" is true and if false return NaN. The output should look something like...
col pct_change
0 1.0 NaN
1 2.0 NaN
2 3.0 -0.169
3 2.5 NaN
4 5.0 -0.400
5 2.0 NaN
What would be the best way of going about this?
Use numpy.where to use df["bool"] as a boolean mask:
df["pct_change"] = np.where(df["bool"], df["col"].pct_change().shift(-1), np.nan)
print(df)
Output:
col bool pct_change
0 1.0 False NaN
1 2.0 False NaN
2 3.0 True -0.166667
3 2.5 False NaN
4 5.0 True -0.400000
5 3.0 False NaN
Related
I have a data frame like
df = pd.DataFrame({"A":[1,2,np.nan],"B":[np.nan,10,np.nan], "C":[5,10,7]})
A B C
0 1.0 NaN 5
1 2.0 10.0 10
2 NaN NaN 7
I want to add a new column 'D'. Expected output is
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
Thanks in advance!
Another way is to explicitly fill column D with A,B,C in that order.
df['D'] = np.nan
df['D'] = df.D.fillna(df.A).fillna(df.B).fillna(df.C)
Another approach is to use the combine_first method of a pd.Series. Using your example df,
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"A":[1,2,np.nan],"B":[np.nan,10,np.nan], "C":[5,10,7]})
>>> df
A B C
0 1.0 NaN 5
1 2.0 10.0 10
2 NaN NaN 7
we have
>>> df.A.combine_first(df.B).combine_first(df.C)
0 1.0
1 2.0
2 7.0
We can use reduce to abstract this pattern to work with an arbitrary number of columns.
>>> from functools import reduce
>>> cols = [df[c] for c in df.columns]
>>> reduce(lambda acc, col: acc.combine_first(col), cols)
0 1.0
1 2.0
2 7.0
Name: A, dtype: float64
Let's put this all together in a function.
>>> def coalesce(*args):
... return reduce(lambda acc, col: acc.combine_first(col), args)
...
>>> coalesce(*cols)
0 1.0
1 2.0
2 7.0
Name: A, dtype: float64
I think you need bfill with selecting first column by iloc:
df['D'] = df.bfill(axis=1).iloc[:,0]
print (df)
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
same as:
df['D'] = df.fillna(method='bfill',axis=1).iloc[:,0]
print (df)
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
option 1
pandas
df.assign(D=df.lookup(df.index, df.isnull().idxmin(1)))
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
option 2
numpy
v = df.values
j = np.isnan(v).argmin(1)
df.assign(D=v[np.arange(len(v)), j])
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
naive time test
over given data
over larger data
There is already a method for Series in Pandas that does this:
df['D'] = df['A'].combine_first(df['C'])
Or just stack them if you want to look up values sequentially:
df['D'] = df['A'].combine_first(df['B']).combine_first(df['C'])
This outputs the following:
>>> df
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
I want to back fillna, but I only want to backfill only one na value and replace that value with a specific value (1)
I tried using
df.fillna(value=1,method='bfill',inplace=True,limit=1)
but I get
ValueError: Cannot specify both 'value' and 'method'.
because I cannot use method and value at the same time. If this was possible, I would not be asking this question (pandas possibly should look into this with a new update)
Here is an example:
import pandas as pd
import numpy as np
col1 = [3,2,2,np.nan,np.nan,np.nan,2,6,np.nan,np.nan,np.nan,6]
col2 = [8,2,np.nan,np.nan,6,0,np.nan,5,np.nan,6,6,3]
col3 = [np.nan,np.nan,np.nan,np.nan,6,7,np.nan,1,np.nan,np.nan,3,4]
df = pd.DataFrame(data=[col1,col2,col3],columns=['col1','col2','col3'])
print(df)
index col1 col2 col3
0 3 8 np.nan
1 2 2 np.nan
2 2 np.nan np.nan
3 np.nan np.nan np.nan
4 np.nan 6 6
5 np.nan 0 7
6 2 np.nan np.nan
7 6 5 1
8 np.nan np.nan np.nan
9 np.nan 6 np.nan
10 np.nan 6 3
11 6 3 4
here is my desired output:
index col1 col2 col3
0 3 8 np.nan
1 2 2 np.nan
2 2 np.nan np.nan
3 np.nan 1 1
4 np.nan 6 6
5 1 0 7
6 2 1 1
7 6 5 1
8 np.nan 1 np.nan
9 np.nan 6 1
10 1 6 3
11 6 3 4
I've been at this for hours. Anything is appreciated!
You can bfill with limit 1, doesnt matter which value. Then you check which value is filled, but is still NaN in your original dataframe. Those indices you fill in 1:
d = df.bfill(limit=1)
mask = df.isna() & d.notna()
df = pd.DataFrame(np.where(mask, 1, df), columns=df.columns)
Output
col1 col2 col3
0 3.0 8.0 NaN
1 2.0 2.0 NaN
2 2.0 NaN NaN
3 NaN 1.0 1.0
4 NaN 6.0 6.0
5 1.0 0.0 7.0
6 2.0 1.0 1.0
7 6.0 5.0 1.0
8 NaN 1.0 NaN
9 NaN 6.0 1.0
10 1.0 6.0 3.0
11 6.0 3.0 4.0
Apparently ffill cannot handle specifying both value and method. Here's an alternative approach:
m = (df.isna() & df.shift(-1).notna()).shift().fillna(False)
pd.DataFrame(np.where(m, 1, df), columns=df.columns)
col1 col2 col3
0 3.0 8.0 NaN
1 2.0 2.0 NaN
2 2.0 NaN NaN
3 NaN 1.0 1.0
4 NaN 6.0 6.0
5 1.0 0.0 7.0
6 2.0 5.0 1.0
7 6.0 5.0 1.0
8 NaN 6.0 NaN
9 NaN 6.0 1.0
10 1.0 6.0 3.0
11 6.0 3.0 4.0
I have a big dataframe with many columns (like 1000). I have a list of columns (generated by a script ~10). And I would like to select all the rows in the original dataframe where at least one of my list of columns is not null.
So if I would know the number of my columns in advance, I could do something like this:
list_of_cols = ['col1', ...]
df[
df[list_of_cols[0]].notnull() |
df[list_of_cols[1]].notnull() |
...
df[list_of_cols[6]].notnull() |
]
I can also iterate over the list of cols and create a mask which then I would apply to df, but his looks too tedious. Knowing how powerful is pandas with respect to dealing with nan, I would expect that there is a way easier way to achieve what I want.
Use the thresh parameter in the dropna() method. By setting thresh=1, you specify that if there is at least 1 non null item, don't drop it.
df = pd.DataFrame(np.random.choice((1., np.nan), (1000, 1000), p=(.3, .7)))
list_of_cols = list(range(10))
df[list_of_cols].dropna(thresh=1).head()
Starting with this:
data = {'a' : [np.nan,0,0,0,0,0,np.nan,0,0, 0,0,0, 9,9,],
'b' : [np.nan,np.nan,1,1,1,1,1,1,1, 2,2,2, 1,7],
'c' : [np.nan,np.nan,1,1,2,2,3,3,3, 1,1,1, 1,1],
'd' : [np.nan,np.nan,7,9,6,9,7,np.nan,6, 6,7,6, 9,6]}
df = pd.DataFrame(data, columns=['a','b','c','d'])
df
a b c d
0 NaN NaN NaN NaN
1 0.0 NaN NaN NaN
2 0.0 1.0 1.0 7.0
3 0.0 1.0 1.0 9.0
4 0.0 1.0 2.0 6.0
5 0.0 1.0 2.0 9.0
6 NaN 1.0 3.0 7.0
7 0.0 1.0 3.0 NaN
8 0.0 1.0 3.0 6.0
9 0.0 2.0 1.0 6.0
10 0.0 2.0 1.0 7.0
11 0.0 2.0 1.0 6.0
12 9.0 1.0 1.0 9.0
13 9.0 7.0 1.0 6.0
Rows where not all values are nulls. (Removing row index 0)
df[~df.isnull().all(axis=1)]
a b c d
1 0.0 NaN NaN NaN
2 0.0 1.0 1.0 7.0
3 0.0 1.0 1.0 9.0
4 0.0 1.0 2.0 6.0
5 0.0 1.0 2.0 9.0
6 NaN 1.0 3.0 7.0
7 0.0 1.0 3.0 NaN
8 0.0 1.0 3.0 6.0
9 0.0 2.0 1.0 6.0
10 0.0 2.0 1.0 7.0
11 0.0 2.0 1.0 6.0
12 9.0 1.0 1.0 9.0
13 9.0 7.0 1.0 6.0
One can use boolean indexing
df[~pd.isnull(df[list_of_cols]).all(axis=1)]
Explanation:
The expression df[list_of_cols]).all(axis=1) returns a boolean array that is applied as a filter to the dataframe:
isnull() applied to df[list_of_cols] creates a boolean mask for the dataframe df[list_of_cols] with True values for the null elements in df[list_of_cols], False otherwise
all() returns True if all of the elements are True (row-wise axis=1)
So, by negation ~ (not all null = at least one is non-null) one gets a mask for all rows that have at least one non-null element in the given list of columns.
An example:
Dataframe:
>>> df=pd.DataFrame({'A':[11,22,33,np.NaN],
'B':['x',np.NaN,np.NaN,'w'],
'C':['2016-03-13',np.NaN,'2016-03-14','2016-03-15']})
>>> df
A B C
0 11 x 2016-03-13
1 22 NaN NaN
2 33 NaN 2016-03-14
3 NaN w 2016-03-15
isnull mask:
>>> ~pd.isnull(df[list_of_cols])
B C
0 True True
1 False False
2 False True
3 True True
apply all(axis=1) row-wise:
>>> ~pd.isnull(df[list_of_cols]).all(axis=1)
0 True
1 False
2 True
3 True
dtype: bool
Boolean selection from dataframe:
>>> df[~pd.isnull(df[list_of_cols]).all(axis=1)]
A B C
0 11 x 2016-03-13
2 33 NaN 2016-03-14
3 NaN w 2016-03-15
I have a dataframe with some columns representing counts for every timestep, I would like to automatically drop these, for example like the df.dropna() functionality, but something like df.dropcounts().
Here is an example dataframe
array = [[0.0,1.6,2.7,12.0],[1.0,3.5,4.5,13.0],[2.0,6.5,8.6,14.0]]
pd.DataFrame(array)
0 1 2 3
0 0.0 1.6 2.7 12.0
1 1.0 3.5 4.5 13.0
2 2.0 6.5 8.6 14.0
I would like to drop the first and last columns
I believe need:
val = 1
df = df.loc[:, df.diff().fillna(val).ne(val).any()]
print (df)
1 2
0 1.6 2.7
1 3.5 4.5
2 6.5 8.6
Explanation:
First compare by DataFrame.diff:
print (df.diff())
0 1 2 3
0 NaN NaN NaN NaN
1 1.0 1.9 1.8 1.0
2 1.0 3.0 4.1 1.0
Replace NaNs:
print (df.diff().fillna(val))
0 1 2 3
0 1.0 1.0 1.0 1.0
1 1.0 1.9 1.8 1.0
2 1.0 3.0 4.1 1.0
Compare if not equal by ne:
print (df.diff().fillna(val).ne(val))
0 1 2 3
0 False False False False
1 False True True False
2 False True True False
And chck at least one True per column by DataFrame.any:
print (df.diff().fillna(val).ne(val).any())
0 False
1 True
2 True
3 False
dtype: bool
Using all
d.loc[:,~d.diff().fillna(1).eq(1).all().values]
Out[295]:
1 2
0 1.6 2.7
1 3.5 4.5
2 6.5 8.6
I would like to forward fill a pandas df with the previous line only when the current line is entirely composed ofnan.
This means that fillna(method='ffill', limit = 1) does not work in my case because it works element wise while I would need a fillna line wise.
Is there a more elegant way to achieve this task than the following instructions?
s = df.count(axis = 1)
for d in df.index[1:]:
if s.loc[d] == 0:
i = s.index.get_loc(d)
df.iloc[i] = df.iloc[i-1]
Input
v1 v2
1 1 2
2 nan 3
3 2 4
4 nan nan
Output
v1 v2
1 1 2
2 nan 3
3 2 4
4 2 4
You can use conditions for filter rows for applying ffill:
m = df.isnull().all(axis=1) | df.notnull().all(axis=1)
print (m)
1 True
2 False
3 True
4 True
dtype: bool
print (df[m])
v1 v2
1 1.0 2.0
3 2.0 4.0
4 NaN NaN
df[m] = df[m].ffill()
print (df)
v1 v2
1 1.0 2.0
2 NaN 3.0
3 2.0 4.0
4 2.0 4.0
EDIT:
print (df)
v1 v2
1 1.0 2.0
2 NaN 7.0
3 4.0 8.0
4 NaN NaN
5 2.0 4.0
6 NaN 3.0
7 NaN NaN
m = df.isnull().all(axis=1) | df.notnull().all(axis=1)
print (m)
1 True
2 False
3 True
4 True
5 True
6 False
7 True
dtype: bool
long_str = 'some long helper str'
df[~m] = df[~m].fillna(long_str)
df = df.ffill().replace(long_str, np.nan)
print (df)
v1 v2
1 1.0 2.0
2 NaN 7.0
3 4.0 8.0
4 4.0 8.0
5 2.0 4.0
6 NaN 3.0
7 NaN 3.0