How to conditional left shift based on null values in Pandas - python

I have a data frame like
df = pd.DataFrame({"A":[1,np.nan,5],"B":[np.nan,10,np.nan], "C":[2,3,np.nan]})
A B C
0 1 NaN 5
1 NaN 10 NaN
2 2 3 NaN
I want to left shift all the values to occupy the nulls. Desired output:
A B C
0 1 5 NaN
1 10 NaN NaN
2 2 3 NaN
I tried doing this using a series of df['A'].fillna(df['B'].fillna(df['C']) but in my actual data there are more than 100 columns. Is there a better way to do this?

Let us do
out = df.T.apply(lambda x : sorted(x,key=pd.isnull)).T
Out[41]:
A B C
0 1.0 5.0 NaN
1 10.0 NaN NaN
2 2.0 3.0 NaN

I also figured out another way to do this without the sort:
def shift_null(arr):
return [x for x in arr if x == x] + [np.nan for x in arr if x != x]
out = df.T.apply(lambda arr: shift_null(arr)).T
This was faster for big dataframes.

Related

Return Column(s) if they Have a certain Percentage of NaN Values (Python)

Looking to return only the columns that have at least 25% NaN values as a new df
I'm thinking either a conditional statement using .loc, .isnull, or count, but I'm not certain what the most efficient method is. Appreciate any and all assistance.
DF:
df1:
(axis 1 = A,B,C = series)
A B C
1 1 2 1
2 NaN NaN 3
3 4 NaN 1
4 2 NaN 4
Thinking:
df.loc[df['series'] == nan >= 25% ]
Or something like:
if count(nan) for column(x) in 'series' is >= (.25 * (count(x)))
return loc[x]
Return New Dataframe:
df2:
A B
1 1 2
2 NaN NaN
3 4 NaN
4 2 NaN
Returns A and B because each of those have at least 25% of their column entries as NaN (missing)
Based on the responses from https://datascience.stackexchange.com/q/12645.
na_count_mask = df.isna().sum(axis=0) >= (col_count // 4)
res_df = df.loc[na_count_mask]

Python Pandas: Search rows with consecutive condition

I have a dataframe like below:
Text Label
a NaN
b NaN
c NaN
1 NaN
2 NaN
b NaN
c NaN
a NaN
b NaN
c NaN
Whenever the pattern "a,b,c" occurs downwards I want to label that part as a string such as 'Check'. Final dataframe should look like this:
Text Label
a Check
b Check
c Check
1 NaN
2 NaN
b NaN
c NaN
a Check
b Check
c Check
What is the best way to do this. Thank you =)
Here's a NumPy based approach leveraging broadcasting:
import numpy as np
w = df.Text.cumsum().str[-3:].eq('abc') # inefficient for large dfs
m = (w[w].index.values[:,None] + np.arange(-2,1)).ravel()
df.loc[m, 'Label'] = 'Check'
Text Label
0 a Check
1 b Check
2 c Check
3 1 NaN
4 2 NaN
5 b NaN
6 c NaN
7 a Check
8 b Check
9 c Check
Use this solution with numpy.where for general solution:
arr = df['Text']
pat = list('abc')
N = len(pat)
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
return c
b = np.all(rolling_window(arr, N) == pat, axis=1)
c = np.mgrid[0:len(b)][b]
d = [i for x in c for i in range(x, x+N)]
df['label'] = np.where(np.in1d(np.arange(len(arr)), d), 'Check', np.nan)
print (df)
Text Label label
0 a NaN Check
1 b NaN Check
2 c NaN Check
3 1 NaN nan
4 2 NaN nan
5 b NaN nan
6 c NaN nan
7 a NaN Check
8 b NaN Check
9 c NaN Check
Good old shift and bfill work as well (for small number of steps):
s = df.Text.eq('c') & df.Text.shift().eq('b') & df.Text.shift(2).eq('a')
df.loc[s, 'Label'] = 'Check'
df.Label.bfill(limit=2, inplace=True)
Output:
Text Label
0 a Check
1 b Check
2 c Check
3 1 NaN
4 2 NaN
5 b NaN
6 c NaN
7 a Check
8 b Check
9 c Check

pandas Dataframe Replace NaN values with with previous value based on a key column

I have a pd.dataframe that looks like this:
key_value a b c d e
value_01 1 10 x NaN NaN
value_01 NaN 12 NaN NaN NaN
value_01 NaN 7 NaN NaN NaN
value_02 7 4 y NaN NaN
value_02 NaN 5 NaN NaN NaN
value_02 NaN 6 NaN NaN NaN
value_03 19 15 z NaN NaN
So now based on the key_value,
For column 'a' & 'c', I want to copy over the last cell's value from the same column 'a' & 'c' based off of the key_value.
For another column 'd', I want to copy over the row 'i - 1' cell value from column 'b' to column 'd' i'th cell.
Lastly, for column 'e' I want to copy over the sum of 'i - 1' cell's from column 'b' to column 'e' i'th cell .
For every key_value the columns 'a', 'b' & 'c' have some value in their first row, based on which the next values are being copied over or for different columns the values are being generated for.
key_value a b c d e
value_01 1 10 x NaN NaN
value_01 1 12 x 10 10
value_01 1 7 x 12 22
value_02 7 4 y NaN NaN
value_02 7 5 y 4 4
value_02 7 6 y 5 9
value_03 8 15 z NaN NaN
My current approach:
size = df.key_value.size
for i in range(size):
if pd.isna(df.a[i]) and df.key_value[i] == output.key_value[i - 1]:
df.a[i] = df.a[i - 1]
df.c[i] = df.c[i - 1]
df.d[i] = df.b[i - 1]
df.e[i] = df.e[i] + df.b[i - 1]
For columns like 'a' and 'b' the NaN values are all in the same row indexes.
My approach works but takes very long since my datframe has over 50000 records, I was wondering if there is a different way to do this, since I have multiple columns like 'a' & 'b' where values need to be copied over based on 'key_value' and some columns where the values are being computed using say a column like 'b'
pd.concat with groupby and assign
pd.concat([
g.ffill().assign(d=lambda d: d.b.shift(), e=lambda d: d.d.cumsum())
for _, g in df.groupby('key_value')
])
key_value a b c d e
0 value_01 1.0 1 x NaN NaN
1 value_01 1.0 2 x 1.0 1.0
2 value_01 1.0 3 x 2.0 3.0
3 value_02 7.0 4 y NaN NaN
4 value_02 7.0 5 y 4.0 4.0
5 value_02 7.0 6 y 5.0 9.0
6 value_03 19.0 7 z NaN NaN
groupby and apply
def h(g):
return g.ffill().assign(
d=lambda d: d.b.shift(), e=lambda d: d.d.cumsum())
df.groupby('key_value', as_index=False, group_keys=False).apply(h)
You can use groupby + ffill for the groupwise filling. The other operations require shift and cumsum.
In general, note that many common operations have been implemented efficiently in Pandas.
g = df.groupby('key_value')
df['a'] = g['a'].ffill()
df['c'] = g['c'].ffill()
df['d'] = df['b'].shift()
df['e'] = df['d'].cumsum()
print(df)
key_value a b c d e
0 value_01 1.0 1 x NaN NaN
1 value_01 1.0 2 x 1.0 1.0
2 value_01 1.0 3 x 2.0 3.0
3 value_02 7.0 4 y 3.0 6.0
4 value_02 7.0 5 y 4.0 10.0
5 value_02 7.0 6 y 5.0 15.0
6 value_03 19.0 7 z 6.0 21.0

Lengthening a DataFrame based on stacking columns within it in Pandas

I am looking for a function that achieves the following. It is best shown in an example. Consider:
pd.DataFrame([ [1, 2, 3 ], [4, 5, np.nan ]], columns=['x', 'y1', 'y2'])
which looks like:
x y1 y2
0 1 2 3
1 4 5 NaN
I would like to collapase the y1 and y2 columns, lengthening the DataFame where necessary, so that the output is:
x y
0 1 2
1 1 3
2 4 5
That is, one row for each combination between either x and y1, or x and y2. I am looking for a function that does this relatively efficiently, as I have multiple ys and many rows.
You can use stack to get things done i.e
pd.DataFrame(df.set_index('x').stack().reset_index(level=0).values,columns=['x','y'])
x y
0 1.0 2.0
1 1.0 3.0
2 4.0 5.0
Repeat all the items in first column based on counts of not null values in each row. Then simply create your final dataframe using the rest of not null values in other columns. You can use DataFrame.count() method to count not null values and numpy.repeat() to repeat an array based on a respective count array.
>>> rest = df.loc[:,'y1':]
>>> pd.DataFrame({'x': np.repeat(df['x'], rest.count(1)).values,
'y': rest.values[rest.notna()]})
Demo:
>>> df
x y1 y2 y3 y4
0 1 2.0 3.0 NaN 6.0
1 4 5.0 NaN 9.0 3.0
2 10 NaN NaN NaN NaN
3 9 NaN NaN 6.0 NaN
4 7 6.0 NaN NaN NaN
>>> rest = df.loc[:,'y1':]
>>> pd.DataFrame({'x': np.repeat(df['x'], rest.count(1)).values,
'y': rest.values[rest.notna()]})
x y
0 1 2.0
1 1 3.0
2 1 6.0
3 4 5.0
4 4 9.0
5 4 3.0
6 9 6.0
7 7 6.0
Here's one based on NumPy, as you were looking for performance -
def gather_columns(df):
col_mask = [i.startswith('y') for i in df.columns]
ally_vals = df.iloc[:,col_mask].values
y_valid_mask = ~np.isnan(ally_vals)
reps = np.count_nonzero(y_valid_mask, axis=1)
x_vals = np.repeat(df.x.values, reps)
y_vals = ally_vals[y_valid_mask]
return pd.DataFrame({'x':x_vals, 'y':y_vals})
Sample run -
In [78]: df #(added more cols for variety)
Out[78]:
x y1 y2 y5 y7
0 1 2 3.0 NaN NaN
1 4 5 NaN 6.0 7.0
In [79]: gather_columns(df)
Out[79]:
x y
0 1 2.0
1 1 3.0
2 4 5.0
3 4 6.0
4 4 7.0
If the y columns are always starting from the second column onwards until the end, we can simply slice the dataframe and hence get further performance boost, like so -
def gather_columns_v2(df):
ally_vals = df.iloc[:,1:].values
y_valid_mask = ~np.isnan(ally_vals)
reps = np.count_nonzero(y_valid_mask, axis=1)
x_vals = np.repeat(df.x.values, reps)
y_vals = ally_vals[y_valid_mask]
return pd.DataFrame({'x':x_vals, 'y':y_vals})

Python pandas.DataFrame: Make whole row NaN according to condition

I want to make the whole row NaN according to a condition, based on a column. For example, if B > 5, I want to make the whole row NaN.
Unprocessed data frame looks like this:
A B
0 1 4
1 3 5
2 4 6
3 8 7
Make whole row NaN, if B > 5:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Thank you.
Use boolean indexing for assign value per condition:
df[df['B'] > 5] = np.nan
print (df)
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Or DataFrame.mask which add by default NaNs by condition:
df = df.mask(df['B'] > 5)
print (df)
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Thank you Bharath shetty:
df = df.where(~(df['B']>5))
You can also use df.loc[df.B > 5, :] = np.nan
Example
In [14]: df
Out[14]:
A B
0 1 4
1 3 5
2 4 6
3 8 7
In [15]: df.loc[df.B > 5, :] = np.nan
In [16]: df
Out[16]:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
in human language df.loc[df.B > 5, :] = np.nan can be translated to:
assign np.nan to any column (:) of the dataframe ( df ) where the
condition df.B > 5 is valid.
Or using reindex
df.loc[df.B<=5,:].reindex(df.index)
Out[83]:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN

Categories