Pandas set value if most columns are equal in a dataframe - python

starting by another my question I've done yesterday Pandas set value if all columns are equal in a dataframe
Starting by #anky_91 solution I'm working on something similar.
Instead of put 1 or -1 if all columns are equals I want something more flexible.
In fact I want 1 if (for example) the 70% percentage of the columns are 1, -1 for the same but inverse condition and 0 else.
So this is what I've wrote:
# Instead of using .all I use .sum to count the occurence of 1 and 0 for each row
m1 = local_df.eq(1).sum(axis=1)
m2 = local_df.eq(0).sum(axis=1)
# Debug print, it work
print(m1)
print(m2)
But I don't know how to change this part:
local_df['enseamble'] = np.select([m1, m2], [1, -1], 0)
m = local_df.drop(local_df.columns.difference(['enseamble']), axis=1)
I write in pseudo code what I want:
tot = m1 + m2
if m1 > m2
if(m1 * 100) / tot > 0.7 # simple percentage calculus
df['enseamble'] = 1
else if m2 > m1
if(m2 * 100) / tot > 0.7 # simple percentage calculus
df['enseamble'] = -1
else:
df['enseamble'] = 0
Thanks
Edit 1
This is an example of expected output:
NET_0 NET_1 NET_2 NET_3 NET_4 NET_5 NET_6
date
2009-08-02 0 1 1 1 0 1
2009-08-03 1 0 0 0 1 0
2009-08-04 1 1 1 0 0 0
date enseamble
2009-08-02 1 # because 1 is more than 70%
2009-08-03 -1 # because 0 is more than 70%
2009-08-04 0 # because 0 and 1 are 50-50

You could obtain the specified output from the following conditions:
thr = 0.7
c1 = (df.eq(1).sum(1)/df.shape[1]).gt(thr)
c2 = (df.eq(0).sum(1)/df.shape[1]).gt(thr)
c2.astype(int).mul(-1).add(c1)
Output
2009-08-02 0
2009-08-03 0
2009-08-04 0
2009-08-05 0
2009-08-06 -1
2009-08-07 1
dtype: int64
Or using np.select:
pd.DataFrame(np.select([c1,c2], [1,-1], 0), index=df.index, columns=['result'])
result
2009-08-02 0
2009-08-03 0
2009-08-04 0
2009-08-05 0
2009-08-06 -1
2009-08-07 1

Try with (m1 , m2 and tot are same as what you have):
cond1=(m1>m2)&((m1 * 100/tot).gt(0.7))
cond2=(m2>m1)&((m2 * 100/tot).gt(0.7))
df['enseamble'] =np.select([cond1,cond2],[1,-1],0)
m =df.drop(df.columns.difference(['enseamble']), axis=1)
print(m)
enseamble
date
2009-08-02 1
2009-08-03 -1
2009-08-04 0

Related

Would like to vectorize while loop for performance

I am trying to set values for a window of an array based on the current value of another array.
It should ignore values that the windown overrides.
I need to be able to change the size of the window for different runs.
This works but it is very slow.
I thought there would be a vectorized solution somewhere.
window_size=3
def signal(self):
signal = pd.Series(data=0, index=arr.index)
i = 0
while i < len(self.arr) - 1:
s = self.arr.iloc[i]
if s in [-1, 1]:
j = i + window_size
signal.iloc[i: j] = s
i = i + window_size
else:
i += 1
return signal
arr = [0 0 0 0 1 0 0 0 0 0 0 -1 -1 0 0 0 0 ]
signal = [0 0 0 0 1 1 1 0 0 0 0 -1 -1 -1 0 0 0 ]
You could use shift function of pd.Series
arr_series = pd.Series(arr)
arr_series + arr_series.shift(periods=1, fill_value=0) + arr_series.shift(periods=2, fill_value=0)

Series calculation based on shifted values / recursive algorithm

I have the following:
df['PositionLong'] = 0
df['PositionLong'] = np.where(df['Alpha'] == 1, 1, (np.where(np.logical_and(df['PositionLong'].shift(1) == 1, df['Bravo'] == 1), 1, 0)))
This lines basically only take in df['Alpha'] but not the df['PositionLong'].shift(1).. It cannot recognize it but I dont understand why?
It produces this:
df['Alpha'] df['Bravo'] df['PositionLong']
0 0 0
1 1 1
0 1 0
1 1 1
1 1 1
However what I wanted the code to do is this:
df['Alpha'] df['Bravo'] df['PositionLong']
0 0 0
1 1 1
0 1 1
1 1 1
1 1 1
I believe the solution is to loop each row, but this will take very long.
Can you help me please?
You are looking for a recursive function, since a previous PositionLong value depends on Alpha, which itself is used to determine PositionLong.
But numpy.where is a regular function, so df['PositionLong'].shift(1) is evaluated as a series of 0 values, since you initialise the series with 0.
A manual loop need not be expensive. You can use numba to efficiently implement your recursive algorithm:
from numba import njit
#njit
def rec_algo(alpha, bravo):
res = np.empty(alpha.shape)
res[0] = 1 if alpha[0] == 1 else 0
for i in range(1, len(res)):
if (alpha[i] == 1) or ((res[i-1] == 1) and bravo[i] == 1):
res[i] = 1
else:
res[i] = 0
return res
df['PositionLong'] = rec_algo(df['Alpha'].values, df['Bravo'].values).astype(int)
Result:
print(df)
Alpha Bravo PositionLong
0 0 0 0
1 1 1 1
2 0 1 1
3 1 1 1
4 1 1 1

Find longest run of consecutive zeros for each user in dataframe

I'm looking to find the max run of consecutive zeros in a DataFrame with the result grouped by user. I'm interested in running the RLE on usage.
sample input:
user--day--usage
A-----1------0
A-----2------0
A-----3------1
B-----1------0
B-----2------1
B-----3------0
Desired output
user---longest_run
a - - - - 2
b - - - - 1
mydata <- mydata[order(mydata$user, mydata$day),]
user <- unique(mydata$user)
d2 <- data.frame(matrix(NA, ncol = 2, nrow = length(user)))
names(d2) <- c("user", "longest_no_usage")
d2$user <- user
for (i in user) {
if (0 %in% mydata$usage[mydata$user == i]) {
run <- rle(mydata$usage[mydata$user == i]) #Run Length Encoding
d2$longest_no_usage[d2$user == i] <- max(run$length[run$values == 0])
} else {
d2$longest_no_usage[d2$user == i] <- 0 #some users did not have no-usage days
}
}
d2 <- d2[order(-d2$longest_no_usage),]
this works in R but I want to do the same thing in python, I'm totally stumped
Use groupby with size by columns user, usage and helper Series for consecutive values first:
print (df)
user day usage
0 A 1 0
1 A 2 0
2 A 3 1
3 B 1 0
4 B 2 1
5 B 3 0
6 C 1 1
df1 = (df.groupby([df['user'],
df['usage'].rename('val'),
df['usage'].ne(df['usage'].shift()).cumsum()])
.size()
.to_frame(name='longest_run'))
print (df1)
longest_run
user val usage
A 0 1 2
1 2 1
B 0 3 1
5 1
1 4 1
C 1 6 1
Then filter only zero rows, get max and add reindex for append non 0 groups:
df2 = (df1.query('val == 0')
.max(level=0)
.reindex(df['user'].unique(), fill_value=0)
.reset_index())
print (df2)
user longest_run
0 A 2
1 B 1
2 C 0
Detail:
print (df['usage'].ne(df['usage'].shift()).cumsum())
0 1
1 1
2 2
3 3
4 4
5 5
6 6
Name: usage, dtype: int32
get max number of consecutive zeros on series:
def max0(sr):
return (sr != 0).cumsum().value_counts().max() - (0 if (sr != 0).cumsum().value_counts().idxmax()==0 else 1)
max0(pd.Series([1,0,0,0,0,2,3]))
4
I think the following does what you are looking for, where the consecutive_zero function is an adaptation of the top answer here.
Hope this helps!
import pandas as pd
from itertools import groupby
df = pd.DataFrame([['A', 1], ['A', 0], ['A', 0], ['B', 0],['B',1],['C',2]],
columns=["user", "usage"])
def len_iter(items):
return sum(1 for _ in items)
def consecutive_zero(data):
x = list((len_iter(run) for val, run in groupby(data) if val==0))
if len(x)==0: return 0
else: return max(x)
df.groupby('user').apply(lambda x: consecutive_zero(x['usage']))
Output:
user
A 2
B 1
C 0
dtype: int64
If you have a large dataset and speed is essential, you might want to try the high-performance pyrle library.
Setup:
# pip install pyrle
# or
# conda install -c bioconda pyrle
import numpy as np
np.random.seed(0)
import pandas as pd
from pyrle import Rle
size = int(1e7)
number = np.random.randint(2, size=size)
user = np.random.randint(5, size=size)
df = pd.DataFrame({"User": np.sort(user), "Number": number})
df
# User Number
# 0 0 0
# 1 0 1
# 2 0 1
# 3 0 0
# 4 0 1
# ... ... ...
# 9999995 4 1
# 9999996 4 1
# 9999997 4 0
# 9999998 4 0
# 9999999 4 1
#
# [10000000 rows x 2 columns]
Execution:
for u, udf in df.groupby("User"):
r = Rle(udf.Number)
is_0 = r.values == 0
print("User", u, "Max", np.max(r.runs[is_0]))
# (Wall time: 1.41 s)
# User 0 Max 20
# User 1 Max 23
# User 2 Max 20
# User 3 Max 22
# User 4 Max 23

Pandas convert positive number to 1 and negative number to -1

I have a column of positive and negative number. How to convert this column to a new column to realize convert positive number to 1 and negative number to -1?
You need numpy.sign
df['new'] = np.sign(df['col'])
Sample:
df = pd.DataFrame({ 'col':[-1,3,-5,7,1,0]})
df['new'] = np.sign(df['col'])
print (df)
col new
0 -1 -1
1 3 1
2 -5 -1
3 7 1
4 1 1
5 0 0
It's really easy to perform this task by -
For whole data frame -
df[df < 0] = -1
df[df > 0] = 1
For specific column -
df['column_name'][df['column_name'] < 0] = -1
df['column_name'][df['column_name'] > 0] = 1
df[df < 0] = -1
df[df > 0] = 1
no behaviour defined for df == 0

Python numpy zeros array being assigned 1 for every value when only one index is updated

The following is my code:
amount_features = X.shape[1]
best_features = np.zeros((amount_features,), dtype=int)
best_accuracy = 0
best_accuracy_index = 0
def find_best_features(best_features, best_accuracy):
for i in range(amount_features):
trial_features = best_features
trial_features[i] = 1
svc = SVC(C = 10, gamma = .1)
svc.fit(X_train[:,trial_features==1],y_train)
y_pred = svc.predict(X_test[:,trial_features==1])
accuracy = metrics.accuracy_score(y_test,y_pred)
if (accuracy > best_accuracy):
best_accuracy = accuracy
best_accuracy_index = i
print(best_accuracy_index)
best_features[best_accuracy_index] = 1
return best_features, best_accuracy
bf, ba = find_best_features(best_features, best_accuracy)
print(bf, ba)
And this is my output:
25
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] 0.865853658537
And my expected output:
25
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0] 0.865853658537
I am trying to update the zeros array with the index that gives the highest accuracy. As you see it should be index 25, and I follow that by assigning the 25 index for my array equal to 1. However, when I print the array it shows every index has been updated to 1.
Not sure what is the mishap. Thanks for spending your limited time on Earth to help me.
Change trial_features = best_features to trial_features = numpy.copy(best_features). Reasoning behind the change is already given by #Michael Butscher.

Categories