Count Positive Consecutive Elements in Dataframe - python

Question
Is there a way to count elements along an axis in a dataframe that conform to a condition?
Background
I am trying count the consecutive positive digits left to right along the horizontal axis (axis=1). For example, row zero would result in 0 because the row starts with a negative number, while row one would result in 2 as there are two consecutive positive numbers. Row two would result in 3 and so on.
I've tried looping over it and applying methods, but I am at a loss.
Code
df = pd.DataFrame(np.random.randn(5, 5))
df
0 1 2 3 4
0 -1.017333 -0.322464 0.635497 0.248172 1.567705
1 0.038626 0.335656 -1.374040 0.273872 1.613521
2 1.655696 1.456255 0.051992 1.559657 -0.256284
3 -0.776232 -0.386942 0.810013 -0.054174 0.696907
4 -0.250789 -0.135062 1.285705 -0.326607 -1.363189
binary = np.where(df < 0, 0, 1)
binary
array([[0, 0, 1, 1, 1],
[1, 1, 0, 1, 1],
[1, 1, 1, 1, 0],
[0, 0, 1, 0, 1],
[0, 0, 1, 0, 0]])

Here's a similar approach in Pandas
In [792]: df_p = df > 0
In [793]: df_p
Out[793]:
0 1 2 3 4
0 False False True True True
1 True True False True True
2 True True True True False
3 False False True False True
4 False False True False False
In [794]: df_p['0'] * (df_p < df_p.shift(1, axis=1)).idxmax(axis=1).astype(int)
Out[794]:
0 0
1 2
2 4
3 0
4 0
dtype: int32

Here's one approach -
def count_pos_consec_elems(a):
count = (a[:,1:] < a[:,:-1]).argmax(1)+1
count[a[:,0] < 1] = 0
count[a.all(1)] = a.shape[1]
return count
Sample run -
In [145]: df
Out[145]:
0 1 2 3 4
0 0.602198 -0.899124 -1.104486 -0.106802 -0.092505
1 0.012199 -1.415231 0.604574 -0.133460 -0.264506
2 -0.878637 1.607330 -0.950801 -0.594610 -0.718909
3 1.200000 1.200000 1.200000 1.200000 1.200000
4 1.434637 0.500000 0.421560 -1.001847 -0.980985
In [146]: binary = df.values > 0
In [147]: count_pos_consec_elems(binary)
Out[147]: array([1, 1, 0, 5, 3])

Related

Sum n values of n-lists with the same index in Python

For example, I have 5 lists with 10 elements each one generated with random values simulating a coin toss.
I get my 5 lists with 10 elements in the following way:
result = [0,1] #0 is tail #1 is head
probability = [1/2,1/2]
N = 10
list = []
def list_generator(number): #this number would be 5 in this case
for i in range(number):
n_round = np.array(rnd.choices(result, probability, k=N))
print(n_round)
list_generator(5)
And for example I would get this
[1 1 0 0 0 1 0 1 1 0]
[0 1 0 0 0 1 1 1 0 1]
[1 1 0 0 1 1 1 0 1 1]
[0 0 0 1 0 0 0 1 0 0]
[0 0 1 1 0 0 0 0 1 1]
How can I sum only the numbers of the same column, I mean, I would like to get a list that appends the value of 1+0+1+0+0 (the first column), then, that list appends the sum of each second coin toss of each round i.e. 1+1+1+0+0 (the second column), and so on with the ten coin tosses
(I need it in a list because I will use this to plot a graph)
I have thought about making a matrix with each array and summing only the nth column and append that value in the list but I do not know how to do that, I do not have much knowledge about using arrays.
Have your function return a 2d numpy array and then sum along the required axis. Separately, you don't need to pass probability to random.choices as equal probabilities are the default.
import random
import numpy as np
def list_generator(number):
return np.array([np.array(random.choices([0,1], k=10)) for i in range(number)])
a = list_generator(5)
>>> a
array([[0, 1, 1, 1, 0, 1, 1, 0, 0, 0],
[1, 0, 1, 0, 1, 1, 1, 1, 1, 0],
[1, 1, 0, 1, 1, 1, 0, 0, 1, 1],
[1, 1, 0, 0, 1, 1, 1, 1, 0, 0],
[0, 1, 1, 0, 0, 1, 1, 1, 0, 0]])
>>> a.sum(axis=0)
array([3, 4, 3, 2, 3, 5, 4, 3, 2, 1])
You can use numpy.random.randint to generate your randomized data. Then use sum to get the sum of the columns:
import numpy as np
N = 10
data = np.random.randint(2, size=(N, N))
print(data)
print(data.sum(axis=0))
[[1 0 1 1 1 1 0 0 1 1]
[0 0 1 1 0 0 1 1 1 0]
[1 1 0 1 1 1 0 0 1 1]
[1 1 0 0 0 0 1 1 1 1]
[1 0 0 1 1 1 0 1 1 1]
[1 0 1 1 0 1 0 1 1 1]
[0 0 0 1 0 1 0 1 1 0]
[0 0 0 1 0 1 0 1 0 1]
[1 0 0 0 1 0 1 0 1 1]
[1 0 1 1 0 1 0 0 0 1]]
[7 2 4 8 4 7 3 6 8 8]

Numpy / Pandas slicing based on intervals

Trying to figure out a way to slice non-contiguous and non-equal length rows of a pandas / numpy matrix so I can set the values to a common value. Has anyone come across an elegant solution for this?
import numpy as np
import pandas as pd
x = pd.DataFrame(np.arange(12).reshape(3,4))
#x is the matrix we want to index into
"""
x before:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
"""
y = pd.DataFrame([[0,3],[2,2],[1,2],[0,0]])
#y is a matrix where each row contains a start idx and end idx per column of x
"""
0 1
0 0 3
1 2 3
2 1 3
3 0 1
"""
What I'm looking for is a way to effectively select different length slices of x based on the rows of y
x[y] = 0
"""
x afterwards:
array([[ 0, 1, 2, 0],
[ 0, 5, 0, 7],
[ 0, 0, 0, 11]])
Masking can still be useful, because even if a loop cannot be entirely avoided, the main dataframe x would not need to be involved in the loop, so this should speed things up:
mask = np.zeros_like(x, dtype=bool)
for i in range(len(y)):
mask[y.iloc[i, 0]:(y.iloc[i, 1] + 1), i] = True
x[mask] = 0
x
0 1 2 3
0 0 1 2 0
1 0 5 0 7
2 0 0 0 11
As a further improvement, consider defining y as a NumPy array if possible.
I customized this answer to your problem:
y_t = y.values.transpose()
y_t[1,:] = y_t[1,:] - 1 # or remove this line and change '>= r' below to '> r`
r = np.arange(x.shape[0])
mask = ((y_t[0,:,None] <= r) & (y_t[1,:,None] >= r)).transpose()
res = x.where(~mask, 0)
res
# 0 1 2 3
# 0 0 1 2 0
# 1 0 5 0 7
# 2 0 0 0 11

Turning Sequence/two dimentional array into Dataframe Column in Pandas

So I use my previously trained model to predict new data
y_pred_aplikasi = model.predict(X_aplikasi)
y_pred_aplikasi
It returns
array([[7.7066602e-07, 9.9993092e-01, 4.6858725e-07],
[7.1568817e-02, 4.3571211e-07, 7.3567069e-01],
[9.8825598e-01, 6.3803792e-03, 4.4066067e-07],
...,
[3.8332163e-15, 1.0000000e+00, 1.4775689e-11],
[1.8400473e-14, 1.0000000e+00, 6.1960957e-11],
[7.0748132e-01, 5.9783965e-02, 5.7850748e-02]], dtype=float32)
​
I want to make that sequence into something like this, with the largest value of each part become 1 and the rest 0.
A B C
0 1 0
0 0 1
1 0 0
....
1 0 0
0 0 1
1 0 0
how can I achieve this with pandas?
Considering this to be your array:
In [841]: a
Out[841]:
array([[7.7066602e-07, 9.9993092e-01, 4.6858725e-07],
[7.1568817e-02, 4.3571211e-07, 7.3567069e-01],
[9.8825598e-01, 6.3803792e-03, 4.4066067e-07],
[3.8332163e-15, 1.0000000e+00, 1.4775689e-11],
[1.8400473e-14, 1.0000000e+00, 6.1960957e-11],
[7.0748132e-01, 5.9783965e-02, 5.7850748e-02]])
Convert above array into dataframe using pd.DataFrame constructor:
In [851]: df = pd.DataFrame(a, columns=['A', 'B', 'C'])
In [852]: df
Out[852]:
A B C
0 7.706660e-07 9.999309e-01 4.685873e-07
1 7.156882e-02 4.357121e-07 7.356707e-01
2 9.882560e-01 6.380379e-03 4.406607e-07
3 3.833216e-15 1.000000e+00 1.477569e-11
4 1.840047e-14 1.000000e+00 6.196096e-11
5 7.074813e-01 5.978397e-02 5.785075e-02
Replace max value with 1, else 0, using df.where and df.max(axis=1):
In [854]: df = df.eq(df.where(df != 0).max(1), axis=0).astype(int)
In [855]: df
Out[855]:
A B C
0 0 1 0
1 0 0 1
2 1 0 0
3 0 1 0
4 0 1 0
5 1 0 0
Manually looping through each element could work, but not sure how feasible that is for your application.
for i in range(len(y_pred_aplikasi)):
for j in range(3):
# for j in range(len(y_pred_aplikasi[i])): # to be more dynamic
if y_pred_aplikasi[i][j] == y_pred_aplikasi[i].max():
y_pred_aplikasi[i][j] = 1
else:
y_pred_aplikasi[i][j] = 0
y_pred_aplikasi.astype(int)
Out[5]:
array([[0, 1, 0],
[0, 0, 1],
[1, 0, 0],
...,
[0, 1, 0],
[0, 1, 0],
[1, 0, 0]])

How to remove rows from DataFrame

I have a DataFrame with n rows and an ndarray with n values (-1 for outliers and 1 for inlier). Is there a pythonic way to remove DataFrame rows that match the indices of the elements of the nparray marked as -1?
You can just do: new_df = old_df[arr == 1].
Example:
df = pd.DataFrame(np.random.randn(5,5))
arr = np.random.choice([1,-1], 5)
>>> df
0 1 2 3 4
0 -0.238418 0.291475 0.139162 -0.030003 -0.515817
1 -0.162404 -1.272317 0.342051 -0.787938 0.464699
2 -0.965481 0.727143 -0.887149 -0.430592 -2.074865
3 0.699129 -0.242738 1.754805 -0.120637 -1.536973
4 0.228538 0.799445 -0.217787 0.398572 -1.255639
>>> arr
array([ 1, -1, -1, 1, -1])
>>> df[arr == 1]
0 1 2 3 4
0 -0.238418 0.291475 0.139162 -0.030003 -0.515817
3 0.699129 -0.242738 1.754805 -0.120637 -1.536973

Pandas: flag consecutive values

I have a pandas series of the form [0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0 , 0 , 1].
0: indicates economic increase.
1: indicates economic decline.
A recession is signaled by two consecutive declines (1).
The end of the recession is signaled by two consecutive increase (0).
In the above dataset I have two recessions, begin at index 3, end at index 5 and begin at index 8 end at index 11.
I am at a lost for how to approach this with pandas. I would like to identify the index for the start and end of the recession. Any assistance would be appreciated.
Here is my python attempt at a soln.
np_decline = np.array([0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0 , 0 , 1])
recession_start_flag = 0
recession_end_flag = 0
recession_start = []
recession_end = []
for i in range(len(np_decline) - 1):
if recession_start_flag == 0 and np_decline[i] == 1 and np_decline[i + 1] == 1:
recession_start.append(i)
recession_start_flag = 1
if recession_start_flag == 1 and np_decline[i] == 0 and np_decline[i + 1] == 0:
recession_end.append(i - 1)
recession_start_flag = 0
print(recession_start)
print(recession_end)
Is the a more pandas centric approach?
Leon
The start of a run of 1's satisfies the condition
x_prev = x.shift(1)
x_next = x.shift(-1)
((x_prev != 1) & (x == 1) & (x_next == 1))
That is to say, the value at the start of a run is 1 and the previous value is not 1 and the next value is 1. Similarly, the end of a run satisfies the condition
((x == 1) & (x_next == 0) & (x_next2 == 0))
since the value at the end of a run is 1 and the next two values value are 0.
We can find indices where these conditions are true using np.flatnonzero:
import numpy as np
import pandas as pd
x = pd.Series([0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0 , 0 , 1])
x_prev = x.shift(1)
x_next = x.shift(-1)
x_next2 = x.shift(-2)
df = pd.DataFrame(
dict(start = np.flatnonzero((x_prev != 1) & (x == 1) & (x_next == 1)),
end = np.flatnonzero((x == 1) & (x_next == 0) & (x_next2 == 0))))
print(df[['start', 'end']])
yields
start end
0 3 5
1 8 11
You can use shift:
df = pd.DataFrame([0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0 , 0 , 1], columns=['signal'])
df_prev = df.shift(1)['signal']
df_next = df.shift(-1)['signal']
df_next2 = df.shift(-2)['signal']
df.loc[(df_prev != 1) & (df['signal'] == 1) & (df_next == 1), 'start'] = 1
df.loc[(df['signal'] != 0) & (df_next == 0) & (df_next2 == 0), 'end'] = 1
df.fillna(0, inplace=True)
df = df.astype(int)
signal start end
0 0 0 0
1 1 0 0
2 0 0 0
3 1 1 0
4 1 0 0
5 1 0 1
6 0 0 0
7 0 0 0
8 1 1 0
9 1 0 0
10 0 0 0
11 1 0 1
12 0 0 0
13 0 0 0
14 1 0 0
Similar idea using shift, but writing the result as a single Boolean column:
# Boolean indexers for recession start and stops.
rec_start = (df['signal'] == 1) & (df['signal'].shift(-1) == 1)
rec_end = (df['signal'] == 0) & (df['signal'].shift(-1) == 0)
# Mark the recession start/stops as True/False.
df.loc[rec_start, 'recession'] = True
df.loc[rec_end, 'recession'] = False
# Forward fill the recession column with the last known Boolean.
# Fill any NaN's as False (i.e. locations before the first start/stop).
df['recession'] = df['recession'].ffill().fillna(False)
The resulting output:
signal recession
0 0 False
1 1 False
2 0 False
3 1 True
4 1 True
5 1 True
6 0 False
7 0 False
8 1 True
9 1 True
10 0 True
11 1 True
12 0 False
13 0 False
14 1 False
use rolling(2)
s = pd.Series([0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0 , 0 , 1])
I subtract .5 so the rolling sum is 1 when a recession starts and -1 when it stops.
s2 = s.sub(.5).rolling(2).sum()
since both 1 and -1 evaluate to True I can mask the rolling signal to just start and stops and ffill. Get truth values of when they are positive or negative with gt(0).
pd.concat([s, s2.mask(~s2.astype(bool)).ffill().gt(0)], axis=1, keys=['signal', 'isRec'])
You can use scipy.signal.find_peaks for this problem.
from scipy.signal import find_peaks
np_decline = np.array([0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0 , 0 , 1])
peaks = find_peaks(np_decline,width=2)
recession_start_loc = peaks[1]['left_bases'][0]

Categories