Forward fill column with an index-based limit - python

I want to forward fill a column and I want to specify a limit, but I want the limit to be based on the index---not a simple number of rows like limit allows.
For example, say I have the dataframe given by:
df = pd.DataFrame({
'data': [0.0, 1.0, np.nan, 3.0, np.nan, 5.0, np.nan, np.nan, np.nan, np.nan],
'group': [0, 0, 0, 1, 1, 0, 0, 0, 1, 1]
})
which looks like
In [27]: df
Out[27]:
data group
0 0.0 0
1 1.0 0
2 NaN 0
3 3.0 1
4 NaN 1
5 5.0 0
6 NaN 0
7 NaN 0
8 NaN 1
9 NaN 1
If I group by the group column and forward fill in that group with limit=2, then my resulting data frame will be
In [35]: df.groupby('group').ffill(limit=2)
Out[35]:
group data
0 0 0.0
1 0 1.0
2 0 1.0
3 1 3.0
4 1 3.0
5 0 5.0
6 0 5.0
7 0 5.0
8 1 3.0
9 1 NaN
What I actually want to do here however is only forward fill onto rows whose indexes are within say 2 from the first index of each group, as opposed to the next 2 rows of each group. For example, if we just look at the groups on the dataframe:
In [36]: for i, group in df.groupby('group'):
...: print(group)
...:
data group
0 0.0 0
1 1.0 0
2 NaN 0
5 5.0 0
6 NaN 0
7 NaN 0
data group
3 3.0 1
4 NaN 1
8 NaN 1
9 NaN 1
I would want the second group here to only be forward filled to index 4---not 8 and 9. The first group's NaN values are all within 2 indexes from the last non-NaN values, so they would be filled completely. The resulting dataframe would look like:
group data
0 0 0.0
1 0 1.0
2 0 1.0
3 1 3.0
4 1 3.0
5 0 5.0
6 0 5.0
7 0 5.0
8 1 NaN
9 1 NaN
FWIW in my actual use case, my index is a DateTimeIndex (and it is sorted).
I currently have a solution which sort of works, requiring looping through the dataframe filtered on the group indexes, creating a time range for every single event with a non-NaN value based on the index, and then combining those. But this is far too slow to be practical.

import numpy as np
import pandas as pd
df = pd.DataFrame({
'data': [0.0, 1.0, 1, 3.0, np.nan, 22, np.nan, 5, np.nan, np.nan],
'group': [0, 0, 1, 0, 1, 0, 1, 0, 1, 1]})
df = df.reset_index()
df['stop_index'] = df['index'] + 2
df['stop_index'] = df['stop_index'].where(pd.notnull(df['data']))
df['stop_index'] = df.groupby('group')['stop_index'].ffill()
df['mask'] = df['index'] <= df['stop_index']
df.loc[df['mask'], 'data'] = df.groupby('group')['data'].ffill()
print(df)
# index data group stop_index mask
# 0 0 0.0 0 2.0 True
# 1 1 1.0 0 3.0 True
# 2 2 1.0 1 4.0 True
# 3 3 3.0 0 5.0 True
# 4 4 1.0 1 4.0 True
# 5 5 22.0 0 7.0 True
# 6 6 NaN 1 4.0 False
# 7 7 5.0 0 9.0 True
# 8 8 NaN 1 4.0 False
# 9 9 NaN 1 4.0 False
# clean up df
df = df[['data', 'group']]
print(df)
yields
data group
0 0.0 0
1 1.0 0
2 1.0 1
3 3.0 0
4 1.0 1
5 22.0 0
6 NaN 1
7 5.0 0
8 NaN 1
9 NaN 1
This copies the index into a column, then
makes a second stop_index column which is the index augmented by the size of
the (time) window.
df = df.reset_index()
df['stop_index'] = df['index'] + 2
Then it makes null rows in stop_index to match null rows in data:
df['stop_index'] = df['stop_index'].where(pd.notnull(df['data']))
Then it forward-fills stop_index on a per-group basis:
df['stop_index'] = df.groupby('group')['stop_index'].ffill()
Now (at last) we can define the desired mask -- the places where we actually want to forward-fill data:
df['mask'] = df['index'] <= df['stop_index']
df.loc[df['mask'], 'data'] = df.groupby('group')['data'].ffill()

IIUC
l=[]
for i, group in df.groupby('group'):
idx=group.index
l.append(group.reindex(df.index).ffill(limit=2).loc[idx])
pd.concat(l).sort_index()
data group
0 0.0 0.0
1 1.0 0.0
2 1.0 0.0
3 3.0 1.0
4 3.0 1.0
5 5.0 0.0
6 5.0 0.0
7 5.0 0.0
8 NaN 1.0
9 NaN 1.0
Testing data
data group
0 0.0 0
1 1.0 0
2 1.0 1
3 3.0 0
4 NaN 1
5 22 0
6 NaN 1
7 5.0 0
8 NaN 1
9 NaN 1
My method for testing data
data group
0 0.0 0.0
1 1.0 0.0
2 1.0 1.0
3 3.0 0.0
4 1.0 1.0
5 22.0 0.0
6 NaN 1.0# here not change , since the previous two do not have valid value for group 1
7 5.0 0.0
8 NaN 1.0
9 NaN 1.0
Out put with unutbu
data group
0 0.0 0
1 1.0 0
2 1.0 1
3 3.0 0
4 1.0 1
5 22.0 0
6 1.0 1# miss match in here
7 5.0 0
8 NaN 1
9 NaN 1

Related

Append any further columns to the first three columns AND indicate the triple column it comes from

This is a follow-up question to Append any further columns to the first three columns.
I start out with about 120 columns. It is always three columns that belong to each other. Instead of being 120 columns side by side, they should be stacked on top of each other, so we end up with three columns. This has already been solved (see link above).
Sample data:
df = pd.DataFrame({
"1": np.random.randint(900000000, 999999999, size=5),
"2": np.random.choice( ["A","B","C", np.nan], 5),
"3": np.random.choice( [np.nan, 1], 5),
"4": np.random.randint(900000000, 999999999, size=5),
"5": np.random.choice( ["A","B","C", np.nan], 5),
"6": np.random.choice( [np.nan, 1], 5)
})
Working solution for initial question as suggested by Jezrael:
arr = np.arange(len(df.columns))
df.columns = [arr // 3, arr % 3]
df = df.stack(0).sort_index(level=[1, 0]).reset_index(drop=True)
df.columns = ['A','B','C']
This transforms this:
1 2 3 4 5 6
0 960189042 B NaN 991581392 A 1.0
1 977655199 nan 1.0 964195250 A 1.0
2 961771966 A NaN 969007327 B 1.0
3 955308022 C 1.0 973316485 A NaN
4 933277976 A 1.0 976749175 A NaN
to this:
A B C
0 960189042 B NaN
1 977655199 nan 1.0
2 961771966 A NaN
3 955308022 C 1.0
4 933277976 A 1.0
5 991581392 A 1.0
6 964195250 A 1.0
7 969007327 B 1.0
8 973316485 A NaN
9 976749175 A NaN
Follow Up Question:
Now, if I'd need an indicator from which triple each block comes from, how could this be done? So a result could look like:
A B C D
0 960189042 B NaN 0
1 977655199 nan 1.0 0
2 961771966 A NaN 0
3 955308022 C 1.0 0
4 933277976 A 1.0 0
5 991581392 A 1.0 1
6 964195250 A 1.0 1
7 969007327 B 1.0 1
8 973316485 A NaN 1
9 976749175 A NaN 1
These blocks can be of different lengths! So I cannot simply add a counter.
Use reset_index for remove only first level, second level of MultiIndex convert to column:
arr = np.arange(len(df.columns))
df.columns = [arr // 3, arr % 3]
df = df.stack(0).sort_index(level=[1, 0]).reset_index(level=0, drop=True).reset_index()
df.columns = ['D','A','B','C']
print (df)
D A B C
0 0 960189042 B NaN
1 0 977655199 nan 1.0
2 0 961771966 A NaN
3 0 955308022 C 1.0
4 0 933277976 A 1.0
5 1 991581392 A 1.0
6 1 964195250 A 1.0
7 1 969007327 B 1.0
8 1 973316485 A NaN
9 1 976749175 A NaN
Then if need change order of columns:
cols = df.columns[1:].tolist() + df.columns[:1].tolist()
df = df[cols]
print (df)
A B C D
0 960189042 B NaN 0
1 977655199 nan 1.0 0
2 961771966 A NaN 0
3 955308022 C 1.0 0
4 933277976 A 1.0 0
5 991581392 A 1.0 1
6 964195250 A 1.0 1
7 969007327 B 1.0 1
8 973316485 A NaN 1
9 976749175 A NaN 1

Python pandas.DataFrame: Make whole row NaN according to condition

I want to make the whole row NaN according to a condition, based on a column. For example, if B > 5, I want to make the whole row NaN.
Unprocessed data frame looks like this:
A B
0 1 4
1 3 5
2 4 6
3 8 7
Make whole row NaN, if B > 5:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Thank you.
Use boolean indexing for assign value per condition:
df[df['B'] > 5] = np.nan
print (df)
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Or DataFrame.mask which add by default NaNs by condition:
df = df.mask(df['B'] > 5)
print (df)
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Thank you Bharath shetty:
df = df.where(~(df['B']>5))
You can also use df.loc[df.B > 5, :] = np.nan
Example
In [14]: df
Out[14]:
A B
0 1 4
1 3 5
2 4 6
3 8 7
In [15]: df.loc[df.B > 5, :] = np.nan
In [16]: df
Out[16]:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
in human language df.loc[df.B > 5, :] = np.nan can be translated to:
assign np.nan to any column (:) of the dataframe ( df ) where the
condition df.B > 5 is valid.
Or using reindex
df.loc[df.B<=5,:].reindex(df.index)
Out[83]:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN

Python - divide data frame by a list of numbers, zero included

I have a data frame with 10 columns. I want to divide each column with a different number. How to divide the data frame by the list of numbers? Also there are zeros in the list, and if divided by zero I want the numbers in that column to be 1. How to do this?
Thanks
given the dataframe df and list lst as a numpy array
df = pd.DataFrame(np.random.rand(10, 10))
lst = np.array([1, 0, 1, 0, 1, 0, 1, 0, 1, 0])
Then we can use a mask to filter. By using a mask, we can use boolean slicing to get at just the columns that have corresponding zero values in lst. We can also easily access the non zeros with ~m and slice.
m = lst == 0
# assign the number 1 to all columns where there is a zero in lst
df.values[:, m] = 1
# do the division in place for all columns where lst is not zero
df.values[:, ~m] /= lst[~m]
print(df)
0 1 2 3 4 5
0 0.195316 1.0 0.988503 1.0 0.981752 1.0
1 0.136812 1.0 0.887689 1.0 0.346385 1.0
2 0.927454 1.0 0.733464 1.0 0.773818 1.0
3 0.782234 1.0 0.363441 1.0 0.295135 1.0
4 0.751046 1.0 0.442886 1.0 0.700396 1.0
5 0.028402 1.0 0.724199 1.0 0.047674 1.0
6 0.680154 1.0 0.974464 1.0 0.717932 1.0
7 0.636310 1.0 0.191252 1.0 0.777813 1.0
8 0.766330 1.0 0.975292 1.0 0.224856 1.0
9 0.335766 1.0 0.093384 1.0 0.547195 1.0
You can use div and then replace values where 0 in L by 1:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
L = [0,1,2,3,0,3]
s = pd.Series(L, index=df.columns)
df1 = df.div(s)
df1[s.index[s == 0]] = 1
print (df1)
A B C D E F
0 1.0 4.0 3.5 0.333333 1.0 2.333333
1 1.0 5.0 4.0 1.000000 1.0 1.333333
2 1.0 6.0 4.5 1.666667 1.0 1.000000

DataFrameGroupBy diff() on condition

Suppose i have a DataFrame:
df = pd.DataFrame({'CATEGORY':['a','b','c','b','b','a','b'],
'VALUE':[pd.np.NaN,1,0,0,5,0,4]})
which looks like
CATEGORY VALUE
0 a NaN
1 b 1
2 c 0
3 b 0
4 b 5
5 a 0
6 b 4
I group it:
df = df.groupby(by='CATEGORY')
And now, let me show, what i want with the help of example on one group 'b':
df.get_group('b')
group b:
CATEGORY VALUE
1 b 1
3 b 0
4 b 5
6 b 4
I need: In the scope of each group, count diff() between VALUE values, skipping all NaNs and 0s. So the result should be:
CATEGORY VALUE DIFF
1 b 1 -
3 b 0 -
4 b 5 4
6 b 4 -1
You can use diff to subtract values after dropping 0 and NaN values:
df = pd.DataFrame({'CATEGORY':['a','b','c','b','b','a','b'],
'VALUE':[pd.np.NaN,1,0,0,5,0,4]})
grouped = df.groupby("CATEGORY")
# define diff func
diff = lambda x: x["VALUE"].replace(0, np.NaN).dropna().diff()
df["DIFF"] = grouped.apply(diff).reset_index(0, drop=True)
print(df)
CATEGORY VALUE DIFF
0 a NaN NaN
1 b 1.0 NaN
2 c 0.0 NaN
3 b 0.0 NaN
4 b 5.0 4.0
5 a 0.0 NaN
6 b 4.0 -1.0
Sounds like a job for a pd.Series.shift() operation along with a notnull mask.
First we remove the unwanted values, before we group the data
nonull_df = df[(df['VALUE'] != 0) & df['VALUE'].notnull()]
groups = nonull_df.groupby(by='CATEGORY')
Now we can shift internally in the groups and calculate the diff
nonull_df['next_value'] = groups['VALUE'].shift(1)
nonull_df['diff'] = nonull_df['VALUE'] - nonull_df['next_value']
Lastly and optionally you can copy the data back to the original dataframe
df.loc[nonull_df.index] = nonull_df
df
CATEGORY VALUE next_value diff
0 a NaN NaN NaN
1 b 1.0 NaN NaN
2 c 0.0 NaN NaN
3 b 0.0 1.0 -1.0
4 b 5.0 1.0 4.0
5 a 0.0 NaN NaN
6 b 4.0 5.0 -1.0

Implement a counter which resets in python panda data frame

Hi I would like to implement a counter which counts the number of successive zero observations in a dataframe (across multiple columns). But I would like to reset it if a non-zero observation is found. I have used a for loop but it is incredibly slow, I am sure there must be far more efficient ways. This is my code:
Here is a snapshot of df
df.head()
ACL ACT ADH ADR AFE AFH AFT
2013-02-05 NaN NaN NaN NaN NaN NaN NaN
2013-02-12 -0.136861 -0.020406 0.046150 0.000000 -0.005321 NaN 0.058195
2013-02-19 -0.006632 0.041665 0.007365 0.012738 0.040930 NaN -0.037818
2013-02-26 -0.023848 -0.023999 -0.030677 -0.003144 0.050604 NaN -0.047604
2013-03-05 0.009771 -0.024589 -0.021073 -0.039432 0.047315 NaN 0.068727
I first initialise an empty data frame which has the same properties of df (dataframe) above
df1=pd.DataFrame( index= df, columns=df)
df1=df1.fillna(0)
Then I create my function which iterates over the rows, but this only deals with one column at a time
def zero_obs(x=df,y=df1):
for i in range(len(x)):
if x[i] == 0:
y[i] = y[i-1] + 1
else:
y[i] = 0
return y
for col in df.columns:
df1[col] = zero_obs(x=df[col],y=df1[col])
Really appreciate any help!!
The output i expect is as follows:
df1.tail()
BRN AXL TTO AGL ACL
2017-01-03 3 125 0 0 0
2017-01-10 0 126 0 0 0
2017-01-17 1 127 0 0 0
2017-01-24 0 128 0 0 0
2017-01-31 0 129 1 0 0
setup
Consider the dataframe df
df = pd.DataFrame(
np.zeros((10, 2), dtype=int),
columns=list('AB')
)
df.loc[[0, 4, 8], 'A'] = 1
df.loc[6, 'B'] = 1
print(df)
A B
0 1 0
1 0 0
2 0 0
3 0 0
4 1 0
5 0 0
6 0 1
7 0 0
8 1 0
9 0 0
Option 1
pandas apply
def zero_obs(x):
"""`x` is assumed to be a `pd.Series`"""
csum = x.eq(0).cumsum()
cpos = csum.where(x.ne(0)).ffill().fillna(0)
return csum.sub(cpos)
print(df.apply(zero_obs))
A B
0 0.0 1.0
1 1.0 2.0
2 2.0 3.0
3 3.0 4.0
4 0.0 5.0
5 1.0 6.0
6 2.0 0.0
7 3.0 1.0
8 0.0 2.0
9 1.0 3.0
Option 2
don't use apply
This function works just as well on df
zero_obs(df)
A B
0 0.0 1.0
1 1.0 2.0
2 2.0 3.0
3 3.0 4.0
4 0.0 5.0
5 1.0 6.0
6 2.0 0.0
7 3.0 1.0
8 0.0 2.0
9 1.0 3.0

Categories