If I slice a dataframe with something like
>>> df = pd.DataFrame(data=[[x] for x in [1,2,3,5,1,3,2,1,1,4,5,6]], columns=['A'])
>>> df.loc[df['A'] == 1]
# or
>>> df[df['A'] == 1]
A
0 1
4 1
7 1
8 1
how could I pad my selections by a buffer of 1 and get the each of the indices 0, 1, 3, 4, 5, 6, 7, 8, 9? I want to select all rows for which the value in column 'A' is 1, but also a row before or after any such row.
edit I'm hoping to figure out a solution that works for arbitrary pad sizes, rather than just for a pad size of 1.
edit 2 here's another example illustrating what I'm going for
df = pd.DataFrame(data=[[x] for x in [1,2,3,5,3,2,1,1,4,5,6,0,0,3,1,2,4,5]], columns=['A'])
and we're looking for pad == 2. In this case I'd be trying to fetch rows 0, 1, 2, 4, 5, 6, 7, 8, 9, 12, 13, 14, 15, 16.
you can use shift with bitwise or |
c = df['A'] == 1
df[c|c.shift()|c.shift(-1)]
A
0 1
1 2
3 5
4 1
5 3
6 2
7 1
8 1
9 4
For arbitrary pad sizes, you may try where, interpolate, and notna to create the mask
n = 2
c = df.where(df['A'] == 1)
m = c.interpolate(limit=n, limit_direction='both').notna()
df[m]
Out[61]:
A
0 1
1 2
2 3
4 3
5 2
6 1
7 1
8 4
9 5
12 0
13 3
14 1
15 2
16 4
Here is an approach that allows for multiple pad levels. Use ffill and bfill on the boolean mask (df['A'] == 1), after converting the False values to np.nan:
import numpy as np
pad = 2
df[(df['A'] == 1).replace(False, np.nan).ffill(limit=pad).bfill(limit=pad).replace(np.nan,False).astype(bool)]
Here it is in action:
def padsearch(df, column, value, pad):
return df[(df[column] == value).replace(False, np.nan).ffill(limit=pad).bfill(limit=pad).replace(np.nan,False).astype(bool)]
# your first example
df = pd.DataFrame(data=[[x] for x in [1,2,3,5,1,3,2,1,1,4,5,6]], columns=['A'])
print(padsearch(df=df, column='A', value=1, pad=1))
# your other example
df = pd.DataFrame(data=[[x] for x in [1,2,3,5,3,2,1,1,4,5,6,0,0,3,1,2,4,5]], columns=['A'])
print(padsearch(df=df, column='A', value=1, pad=2))
Result:
A
0 1
1 2
3 5
4 1
5 3
6 2
7 1
8 1
9 4
A
0 1
1 2
2 3
4 3
5 2
6 1
7 1
8 4
9 5
12 0
13 3
14 1
15 2
16 4
Granted the command is far less nice, and its a little clunky to be converting the False to and from null. But it's still using all Pandas builtins, so it is fairly quick still.
I found another solution but not nearly as slick as some of the ones already posted.
# setup
df = ...
pad = 2
# determine set of indicies
indices = set(
[
x for x in filter(
lambda x: x>=0,
[
x+y
for x in df[df['A'] == 1].index
for y in range(-pad, pad+1)
]
)
]
)
# fetch rows
df.iloc[[*indices]]
I am working now in getting a cumulative sum column using pandas. However, this column most include cumulative sum only if other column value is greater than other column value. Here's an example of my current data:
Index A B C
0 1 20 3
1 10 15 11
2 20 12 25
3 30 18 32
4 40 32 17
5 50 12 4
Then I want to cumsum() column A if column B is greater than C, if not value is zero. Result column D in original df should look like:
Index A B C D
0 1 20 3 1
1 10 15 11 11
2 20 12 25 0
3 30 18 32 0
4 40 32 17 40
5 50 12 4 90
I appreciate any support in advance.
df = pd.DataFrame({'A': {0: 1, 1: 10, 2: 20, 3: 30, 4: 40, 5: 50},
'B': {0: 20, 1: 15, 2: 12, 3: 18, 4: 32, 5: 12},
'C': {0: 3, 1: 11, 2: 25, 3: 32, 4: 17, 5: 4}})
Make a boolean Series for your condition and identify consecutive groups of True or False
b_gt_c = df.B > df.C
groups = b_gt_c.ne(b_gt_c.shift()).cumsum()
In [107]: b_gt_c
Out[107]:
0 True
1 True
2 False
3 False
4 True
5 True
dtype: bool
In [108]: groups
Out[108]:
0 1
1 1
2 2
3 2
4 3
5 3
dtype: int32
Group by those groups; multiply the cumsum of each group by the condition; assign the result to the new df column.
gb = df.groupby(groups)
for k,g in gb:
df.loc[g.index,'D'] = g['A'].cumsum() * b_gt_c[g.index]
In [109]: df
Out[109]:
A B C D
0 1 20 3 1.0
1 10 15 11 11.0
2 20 12 25 0.0
3 30 18 32 0.0
4 40 32 17 40.0
5 50 12 4 90.0
You could skip the for loop as well :
df['G'] = np.where(df.B.gt(df.C), df.A, np.NaN)
group = df.B.gt(df.C).ne(df.B.gt(df.C).shift()).cumsum()
df['G'] = df.groupby(group).G.cumsum().fillna(0)
Identifying consecutive occurrence of values from SO Q&A: Grouping dataframe based on consecutive occurrence of values
There probably is more legant solution, but this also works.
We first create two dummy columns - x and x_shift.
df.x is conditional where we retain values of df.A where df.B > df.C.
df.x_shift is where we shift values one row below and fill na with 0.
In last step we conditionally add df.A and df.x_shift and then drop df.x and df.x_shift
df['x'] = pd.DataFrame(np.where(df.B>df.C, df.A ,0))
df['x_shift'] = df.x.shift(1).fillna(0)
df['D'] = pd.DataFrame(np.where(df.B >df.C, df.A+df.x_shift,0))
df= df.drop(['x','x_shift'], axis=1
While it's a little barbaric you could convert to numpy arrays and then write a simple catch that goes through the 3 arrays and compares values.
I have a dataframe of the following type:
A B
0 1 2
1 4 5
2 7 8
3 10 11
4 13 14
5 16 17
I want to calculate the mean of the first 3 element of each column and then next 3 elements and so on and then store in a dataframe.
Desired Output-
A B
0 4 5
1 12 14
Using Group By was one of the approach I thought of but I am unable to figure out how to use Group by in this case.
If default RangeIndex then use integer division and pass to groupby:
df = df.groupby(df.index // 3).mean()
print (df)
A B
0 4 5
1 13 14
Detail:
print (df.index // 3)
Int64Index([0, 0, 0, 1, 1, 1], dtype='int64')
General solution with array created by length of DataFrame - working with all index values:
df = df.groupby(np.arange(len(df)) // 3).mean()
Detail:
print (np.arange(len(df)) // 3)
[0 0 0 1 1 1]
I have a pandas dataframe with two columns A,B as below.
I want a vectorized solution for creating a new column C where C[i] = C[i-1] - A[i] + B[i].
df = pd.DataFrame(data={'A': [10, 2, 3, 4, 5, 6], 'B': [0, 1, 2, 3, 4, 5]})
>>> df
A B
0 10 0
1 2 1
2 3 2
3 4 3
4 5 4
5 6 5
Here is the solution using for-loops:
df['C'] = df['A']
for i in range(1, len(df)):
df['C'][i] = df['C'][i-1] - df['A'][i] + df['B'][i]
>>> df
A B C
0 10 0 10
1 2 1 9
2 3 2 8
3 4 3 7
4 5 4 6
5 6 5 5
... which does the job.
But since loops are slow in comparison to vectorized calculations, I want a vectorized solution for this in pandas:
I tried to use the shift() method like this:
df['C'] = df['C'].shift(1).fillna(df['A']) - df['A'] + df['B']
but it didn't help since the shifted C column isn't updated with the calculation. It keeps its original values:
>>> df['C'].shift(1).fillna(df['A'])
0 10
1 10
2 2
3 3
4 4
5 5
and that produces a wrong result.
This can be vectorized since:
delta[i] = C[i] - C[i-1] = -A[i] +B[i]. You can get delta from A and B first, then...
calculate cumulative sum of delta (plus C[0]) to get full C
Code as follows:
delta = df['B'] - df['A']
delta[0] = 0
df['C'] = df.loc[0, 'A'] + delta.cumsum()
print df
A B C
0 10 0 10
1 2 1 9
2 3 2 8
3 4 3 7
4 5 4 6
5 6 5 5
I have a dataframe as listed below:
In []: dff = pd.DataFrame({'A': np.arange(8),
'B': list('aabbbbcc'),
'C':np.random.randint(100,size=8)})
which i have grouped based on column B
In []: grouped = dff.groupby('B')
Now, I want to filter the dff based on difference of values in column 'C'. For example, if the difference between any two points within the group in column C is greater than a threshold, remove that row.
If dff is:
A B C
0 0 a 18
1 1 a 25
2 2 b 56
3 3 b 62
4 4 b 46
5 5 b 56
6 6 c 74
7 7 c 3
Then, a threshold of 10 for C will produce a final table like:
A B C
0 0 a 18
1 1 a 25
2 2 b 56
3 3 b 62
4 4 b 46
5 5 b 56
here the grouped category c (small letter) is removed as the difference between the two is greater than 10, but category b has all the rows intact as they are all within 10 of each other.
I think I'd do the hard work in numpy:
In [11]: a = np.array([2, 3, 14, 15, 54])
In [12]: res = np.abs(a[:, np.newaxis] - a) < 10 # Note: perhaps you want <= 10.
In [13]: np.fill_diagonal(res, False)
In [14]: res.any(0)
Out[14]: array([ True, True, True, True, False], dtype=bool)
You could wrap this in a function:
In [15]: def has_close(a, n=10):
res = np.abs(a[:, np.newaxis] - a) < n
np.fill_diagonal(res, False)
return res.any(0)
In [16]: g = df.groupby('B', as_index=False)
In [17]: g.C.apply(lambda x: x[has_close(x.C.values)])
Out[17]:
A B C
0 0 a 18
1 1 a 25
2 2 b 56
3 3 b 62
5 5 b 56