I'd like to figure out how often a negative values occurs and how long that negative price occurs.
example df
d = {'value': [1,2,-3,-4,-5,6,7,8,-9,-10], 'period':[1,2,3,4,5,6,7,8,10]}
df = pd.DataFrame(data=d)
I checked which rows had negative values. df['value'] < 0
I thought I could just iterate through each row, keep a counter for when a negative value occurs and perhaps moving that row to another df, as I would like to save the beginning period and ending period.
What I'm currently trying
def count_negatives(df):
df_negatives = pd.DataFrame(columns=['start','end', 'counter'])
for index, row in df.iterrows():
counter = 0
df_negative_index = 0
while(row['value'] < 0):
# if its the first one add it to df as start ?
# grab the last one and add it as end
#constantly overwrite the counter?
counter += 1
#add counter to df row
df_negatives['counter'] = counter
return df_negatives
Except that gives me an infinite loop I think. If I replace while with an if I'm stuck comming up with a way to keep track of how long.
I think better is avoid loops:
#compare by <
a = df['value'].lt(0)
#running sum
b = a.cumsum()
#counter only for negative consecutive values
df['counter'] = b-b.mask(a).ffill().fillna(0).astype(int)
print (df)
value period counter
0 1 1 0
1 2 2 0
2 -3 3 1
3 -4 4 2
4 -5 5 3
5 6 6 0
6 7 7 0
7 8 8 0
8 -9 9 1
9 -10 10 2
Or if dont need reset counter :
a = df['value'].lt(0)
#repalce values per mask a to 0
df['counter'] = a.cumsum().where(a, 0)
print (df)
value period counter
0 1 1 0
1 2 2 0
2 -3 3 1
3 -4 4 2
4 -5 5 3
5 6 6 0
6 7 7 0
7 8 8 0
8 -9 9 4
9 -10 10 5
If want start and end period:
#comapre for negative mask
a = df['value'].lt(0)
#inverted mask
b = (~a).cumsum()
#filter only negative rows
c = b[a].reset_index()
#aggregate first and last value per groups
df = (c.groupby('value')['index']
.agg([('start', 'first'),('end', 'last')])
.reset_index(drop=True))
print (df)
start end
0 2 4
1 8 9
I would like to save the beginning period and ending period.
If this is your requirement, you can use itertools.groupby. Note also a period series is not required, as Pandas provides a natural integer index (beginning at 0) if not explicitly provided.
from itertools import groupby
from operator import itemgetter
d = {'value': [1,2,-3,-4,-5,6,7,8,-9,-10]}
df = pd.DataFrame(data=d)
ranges = []
for k, g in groupby(enumerate(df['value'][df['value'] < 0].index), lambda x: x[0]-x[1]):
group = list(map(itemgetter(1), g))
ranges.append((group[0], group[-1]))
print(ranges)
[(2, 4), (8, 9)]
Then, to convert to a dataframe:
df = pd.DataFrame(ranges, columns=['start', 'end'])
print(df)
start end
0 2 4
1 8 9
Related
I need to find if the last value of dataframe['position'] is different from 0, then count the previous (so in reverse) values until them changes and store the counted index before the change in a variable, this without for loops. By loc or iloc for example...
dataframe:
| position |
0 1
1 0
2 1 <4
3 1 <3
4 1 <2
5 1 <1
count = 4
I achieved this by a for loop, but I need to avoid it:
count = 1
if data['position'].iloc[-1] != 0:
for i in data['position']:
if data['position'].iloc[-count] == data['position'].iloc[-1]:
count = count + 1
else:
break
if data['position'].iloc[-count] != data['position'].iloc[-1]:
count = count - 1
You can reverse your Series, convert to boolean using the target condition (here "not equal 0" with ne), and apply a cummin to propagate the False upwards and sum to count the trailing True:
count = df.loc[::-1, 'position'].ne(0).cummin().sum()
Output: 4
If you have multiple columns:
counts = df.loc[::-1].ne(0).cummin().sum()
alternative
A slightly faster alternative (~25% faster), but relying on the assumptions that you have at least one zero and non duplicated indices could be to find the last zero and use indexing
m = df['position'].eq(0)
count = len(df.loc[m[m].index[-1]:])-1
Without the requirement to have at least one zero:
m = df['position'].eq(0)
m = m[m]
count = len(df) if m.empty else len(df.loc[m.index[-1]:])-1
This should do the trick:
((data.iloc[-1] != 0) & (data[::-1] == data.iloc[-1])).cumprod().sum()
This builds a condition ((data.iloc[-1] != 0) & (data[::-1] == data.iloc[-1])) indicating whether the value in each row (counting backwards from the end) is nonzero and equals the last value. Then, the values are coerced into 0 or 1 and the cumulative product is taken, so that the first non-matching zero will break the sequence and all subsequent values will be zero. Then the flags are summed to get the count of these consecutive matched values.
Depending on your data, though, stepping iteratively backwards from the end may be faster. This solution is vectorized, but it requires working with the entire column of data and doing several computations which are the same size as the original series.
Example:
In [12]: data = pd.DataFrame(np.random.randint(0, 3, size=(10, 5)), columns=list('ABCDE'))
...: data
Out[12]:
A B C D E
0 2 0 1 2 0
1 1 0 1 2 1
2 2 1 2 1 0
3 1 0 1 2 2
4 1 1 0 0 2
5 2 2 1 0 2
6 2 1 1 2 2
7 0 0 0 1 0
8 2 2 0 0 1
9 2 0 0 2 1
In [13]: ((data.iloc[-1] != 0) & (data[::-1] == data.iloc[-1])).cumprod().sum()
Out[13]:
A 2
B 0
C 0
D 1
E 2
dtype: int64
Say I have the following sample dataframe (there are about 25k rows in the real dataframe)
df = pd.DataFrame({'A' : [0,3,2,9,1,0,4,7,3,2], 'B': [9,8,3,5,5,5,5,8,0,4]})
df
A B
0 0 9
1 3 8
2 2 3
3 9 5
4 1 5
5 0 5
6 4 5
7 7 8
8 3 0
9 2 4
For the column A I need to know how many next and previous rows are greater than current row value but less than value in column B.
So my expected output is :
A B next count previous count
0 9 2 0
3 8 0 0
2 3 0 1
9 5 0 0
1 5 0 0
0 5 2 1
4 5 1 0
7 8 0 0
3 0 0 2
2 4 0 0
Explanation :
First row is calculated as : since 3 and 2 are greater than 0 but less than corresponding B value 8 and 3
Second row is calculated as : since next value 2 is not greater than 3
Third row is calculated as : since 9 is greater than 2 but not greater than its corresponding B value
Similarly, previous count is calculated
Note : I know how to solve this problem by looping using list comprehension or using the pandas apply method but still I won't mind a clear and concise apply approach. I was looking for a more pandaic approach.
My Solution
Here is the apply solution, which I think is inefficient. Also, as people said that there might be no vector solution for the question. So as mentioned, a more efficient apply solution will be accepted for this question.
This is what I have tried.
This function gets the number of previous/next rows that satisfy the condition.
def get_prev_next_count(row):
next_nrow = df.loc[row['index']+1:,['A', 'B']]
prev_nrow = df.loc[:row['index']-1,['A', 'B']][::-1]
if (next_nrow.size == 0):
return 0, ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin()
if (prev_nrow.size == 0):
return ((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), 0
return (((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin())
Generating output :
df[['next count', 'previous count']] = df.reset_index().apply(get_prev_next_count, axis=1, result_type="expand")
Output :
This gives us the expected output
df
A B next count previous count
0 0 9 2 0
1 3 8 0 0
2 2 3 0 1
3 9 5 0 0
4 1 5 0 0
5 0 5 2 1
6 4 5 1 0
7 7 8 0 0
8 3 0 0 2
9 2 4 0 0
I made some optimizations:
You don't need to reset_index() you can access the index with .name
If you only pass df[['A']] instead of the whole frame, that may help.
prev_nrow.empty is the same as (prev_nrow.size == 0)
Applied different logic to get the desired value via first_false, this speeds things up significantly.
def first_false(val1, val2, A):
i = 0
for x, y in zip(val1, val2):
if A < x < y:
i += 1
else:
break
return i
def get_prev_next_count(row):
A = row['A']
next_nrow = df.loc[row.name+1:,['A', 'B']]
prev_nrow = df2.loc[row.name-1:,['A', 'B']]
if next_nrow.empty:
return 0, first_false(prev_nrow.A, prev_nrow.B, A)
if prev_nrow.empty:
return first_false(next_nrow.A, next_nrow.B, A), 0
return (first_false(next_nrow.A, next_nrow.B, A),
first_false(prev_nrow.A, prev_nrow.B, A))
df2 = df[::-1].copy() # Shave a tiny bit of time by only reversing it once~
df[['next count', 'previous count']] = df[['A']].apply(get_prev_next_count, axis=1, result_type='expand')
print(df)
Output:
A B next count previous count
0 0 9 2 0
1 3 8 0 0
2 2 3 0 1
3 9 5 0 0
4 1 5 0 0
5 0 5 2 1
6 4 5 1 0
7 7 8 0 0
8 3 0 0 2
9 2 4 0 0
Timing
Expanding the data:
df = pd.concat([df]*(10000//4), ignore_index=True)
# df.shape == (25000, 2)
Original Method:
Gave up at 15 minutes.
New Method:
1m 20sec
Throw pandarallel at it:
from pandarallel import pandarallel
pandarallel.initialize()
df[['A']].parallel_apply(get_prev_next_count, axis=1, result_type='expand')
26sec
I need to add a new column labelled "counter" to the existing dataframe that will be calculated as shows in the example below:
symbol
percentage
??? counter ???
A
11
-1
A
2
0
A
5
1
B
2
0
B
1
1
B
3
2
A
2
2
A
9
-1
A
4
0
B
2
3
B
8
-1
B
7
-1
So the data is grouped per "symbol" and the logic for calculating the "counter" is like this:
if the "percentage" is greater than 5, then "counter" is equal to -1
if the "percentage" is less than 5, then we start counter with 0, if the next row for the same symbol is again less than 5, we increase the counter
if the next row "percentage" is again greater than 5, we break the counting and set the "counter" column again to -1
I've tried something like this, but it's not good, since the reset is not working:
df['counter'] = np.where(df['percentage'] > 5, -1, df.groupby('symbol').cumcount())
IIUC, you can use a mask and a custom groupby:
m = df['percentage'].gt(5)
group = m.groupby(df['symbol']).apply(lambda s: s.ne(s.shift()).cumsum())
df['count'] = (df
.groupby(['symbol', group])
.cumcount()
.mask(m, -1)
)
Output:
symbol percentage counter
0 A 11 -1
1 A 2 0
2 A 5 1
3 B 2 0
4 B 1 1
5 B 3 2
6 A 2 2
7 A 9 -1
8 A 4 0
9 B 2 3
10 B 8 -1
11 B 7 -1
I need to sum the value column until I hit a break.
df = pd.DataFrame({'value': [1,2,3,4,5,6,7,8], 'break': [0,0,1,0,0,1,0,0]})
value break
0 1 0
1 2 0
2 3 1
3 4 0
4 5 0
5 6 1
6 7 0
7 8 0
Expected output
value break
0 6 1
1 15 1
I was thinking a group by but I can't seem to get anywhere with it. I don't even need the break columns at the end.
You're on the right track, try groupby on reverse cumsum:
(df.groupby(df['break'][::-1].cumsum()[::-1],
as_index=False, sort=False)
.sum()
.query('`break` != 0') # remove this for full data
)
Output:
value break
0 6 1
1 15 1
As in question. I know how to compute it, but is there better/faster/more elegant way to do this?
Cnt is the result.
s = pd.Series( np.random.randint(2, size=10) )
cnt = 0
for n in s:
if n != 0:
break
else:
cnt += 1
continue
Use Series.eq to create a boolean mask then use Series.cummin to return a cummulative minimum over this series finally use Series.sum to get the total count:
cnt = s.eq(0).cummin().sum()
Example:
np.random.seed(9)
s = pd.Series(np.random.randint(2, size=10))
# print(s)
0 0
1 0
2 0
3 1
4 0
5 0
6 1
7 0
8 1
9 1
dtype: int64
cnt = s.eq(0).cummin().sum()
#print(cnt)
3
I have done in a dataframe as it is easier to produce but you can use the vectorized .cumsum to speed up your code with .loc for values == 0. Then just find the length with len:
import pandas as pd, numpy as np
s = pd.DataFrame(pd.Series(np.random.randint(2, size=10)))
s['t'] = s[0].cumsum()
o = len(s.loc[s['t']==0])
o
If you set o = to a column with s['o'] = o, then the output looks like this:
0 t o
0 0 0 2
1 0 0 2
2 1 1 2
3 1 2 2
4 0 2 2
5 1 3 2
6 1 4 2
7 1 5 2
8 1 6 2
9 0 6 2
You can use cumsum() in a mask and then sum() to get the number of initial 0s in the sequence:
s = pd.Series(np.random.randint(2, size=10))
(s.cumsum() == 0).sum()
Note that this method only works if you want to count 0s. If you want to count occurrences of non-zero values you can generalize it, ie.:
(s.sub(s[0]).cumsum() == 0).sum()