I hv the following dataframe:
A B C D E F
100 0 0 0 100 0
0 100 0 0 0 100
-100 0 0 0 100 0
and this code:
cond = [
(df['A'] == 100),
(df['A'] == -100),
(df['B'] == 100),
(df['C'] == 100),
(df['D'] == 100),
(df['E'] == 100),
(df['F'] == 100),
]
choices = ['A','neg_A', 'B', 'C','D', 'E', 'F']
df['result'] = np.select(cond, choices)
For both rows there will be two results but I want only one to be selected. I want the selection to be made with this criteria:
+A = 67%
-A = 68%
B = 70%
C = 75%
D = 66%
E = 54%
F = 98%
Percentage shows accuracy rate so i would want the one with highest percentage to be preferred over the other.
Intended result:
A B C D E F result
100 0 0 0 100 0 A
0 100 0 0 0 100 F
-100 0 0 0 100 0 neg_A
Little help will be appreciated. THANKS!
EDIT:
Some of the columns (like A) may have a mix of 100 and -100. Positive 100 will yield a simple A (see row 1) but a -100 should yield some other name like "neg_A" in the result (see row 3).
Let's sort the columns of dataframe based on the priority values then use .eq + .idxmax on axis=1 to get the column name with first occurrence of 100:
# define a dict with col names and priority values
d = {'A': .67, 'B': .70, 'C': .75, 'D': .66, 'E': .54, 'F': .98}
df['result'] = df[sorted(d, key=lambda x: -d[x])].eq(100).idxmax(axis=1)
A B C D E F result
0 100 0 0 0 100 0 A
1 0 100 0 0 0 100 F
Related
Let's label a dataframe with two columns, A,B, and 100M rows. Starting at the index i, we want to know if the data in column B is trending down or trending up comparing to the data at [i, 'A'].
Here is a loop:
import pandas as pd
df = pd.DataFrame({'A': [0,1,2,3,5,0,0,0,0,0], 'B': [1, 10, -10, 2, 3,0,0,0,0,0], "label":[0,0,0,0,0,0,0,0,0,0]})
for i in range (0,5):
j = i
while j in range (i,i+5) and df.at[i,'label'] == 0: #if classfied, no need to continue
if df.at[j,'B']-df.at[i,'A']>= 10:
df.at[i,'label'] = 1 #Label 1 means trending up
if df.at[j,'B']-df.at[i,'A']<= -10:
df.at[i,'label'] = 2 #Label 2 means trending down
j=j+1
[out]
A B label
0 1 1
1 10 2
2 -10 2
3 2 0
5 3 0
...
The estimated finishing time for this code is 30 days. (A human with a plot and a ruler might finish this task faster.)
What is a fast way to do this? Ideally without a loop.
Looping on Dataframe is slow compared to using Pandas methods.
The task can be accomplished using Pandas vectorized methods:
rolling method which does computations in a rolling window
min & max methods which we compute in the rolling window
where method DataFrame where allows us to set values based upon logic
Code
def set_trend(df, threshold = 10, window_size = 2):
'''
Use rolling_window to find max/min values in a window from the current point
rolling window normally looks at backward values
We use technique from https://stackoverflow.com/questions/22820292/how-to-use-pandas-rolling-functions-on-a-forward-looking-basis/22820689#22820689
to look at forward values
'''
# To have a rolling window on lookahead values in column B
# We reverse values in column B
df['B_rev'] = df["B"].values[::-1]
# Max & Min in B_rev, then reverse order of these max/min
# https://stackoverflow.com/questions/50837012/pandas-rolling-min-max
df['max_'] = df.B_rev.rolling(window_size, min_periods = 0).max().values[::-1]
df['min_'] = df.B_rev.rolling(window_size, min_periods = 0).min().values[::-1]
nrows = df.shape[0] - 1 # adjustment for argmax & armin indexes since rows are in reverse order
# i.e. idx = nrows - x.argmax() give index for max in non-reverse row
df['max_idx'] = df.B_rev.rolling(window_size, min_periods = 0).apply(lambda x: nrows - x.argmax(), raw = True).values[::-1]
df['min_idx'] = df.B_rev.rolling(window_size, min_periods = 0).apply(lambda x: nrows - x.argmin(), raw = True).values[::-1]
# Use np.select to implement label assignment logic
conditions = [
(df['max_'] - df["A"] >= threshold) & (df['max_idx'] <= df['min_idx']), # max above & comes first
(df['min_'] - df["A"] <= -threshold) & (df['min_idx'] <= df['max_idx']), # min below & comes first
df['max_'] - df["A"] >= threshold, # max above threshold but didn't come first
df['min_'] - df["A"] <= -threshold, # min below threshold but didn't come first
]
choices = [
1, # max above & came first
2, # min above & came first
1, # max above threshold
2, # min above threshold
]
df['label'] = np.select(conditions, choices, default = 0)
# Drop scratch computation columns
df.drop(['B_rev', 'max_', 'min_', 'max_idx', 'min_idx'], axis = 1, inplace = True)
return df
Tests
Case 1
df = pd.DataFrame({'A': [0,1,2,3,5,0,0,0,0,0], 'B': [1, 10, -10, 2, 3,0,0,0,0,0], "label":[0,0,0,0,0,0,0,0,0,0]})
display(set_trend(df, 10, 4))
Case 2
df = pd.DataFrame({'A': [0,1,2], 'B': [1, -10, 10]})
display(set_trend(df, 10, 4))
Output
Case 1
A B label
0 0 1 1
1 1 10 2
2 2 -10 2
3 3 2 0
4 5 3 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 0
Case 2
A B label
0 0 1 2
1 1 -10 2
2 2 10 0
I have the following dataframe:
advblock belthld takuri doji
stock A 100 0 0 0
stock B -100 0 0 0
stock C 0 100 100 0
stock D -100 100 0 -100
stock E 0 0 100 100
I want a new column that would store the name of pattern formed by each stock.
100 = pattern formed, -100 = inverse pattern formed. Note only some of the patterns can have 100 and -100 both, like advblock and doji
Here is how I decided to do it:
cond = [
(df['advblock'] == 100),
(df['advblock'] == -100),
(df['belthld'] == 100),
(df['takuri'] == 100),
(df['doji'] == 100),
(df['doji'] == -100),
]
choices = ['Advance Block', 'Inv Advance Block','Belthold','Takuri','Doji','Inv Doji']
df['result'] = np.select(cond, choices)
This works fine until stock B where only one pattern qualifies per stock. However, in stock C, D, and E more than one pattern appears so I want the most accurate one to show up in results. Here is the accuracy list:
(+)advblock: 68%
(-)advblock: 71%
belthold: 56%
takuri: 70%
(+)doji: 66%
(-)doji: 73%
I would want numpy select to consider that list when showing up results where the one with highest accuracy should be preferred over others.
I could have done something like this but it doesn't give me the liberty to change my column names and account for the inverse patterns:
d = {'advblock': .68, 'belthold': .56, 'takuri': .70, 'doji': .66}
df['result'] = df[sorted(d, key=lambda x: -d[x])].abs.eq(100).idxmax(axis=1)
Final intended result:
advblock belthld takuri doji result
stock A 100 0 0 0 Advance Block
stock B -100 0 0 0 Inv Advance Block
stock C 0 100 100 0 Takuri
stock D -100 100 0 -100 Inv Doji
stock E 0 0 100 100 Takuri
Little help will be appreciated, THANKS!
Let us do melt make the logic more obvious
s = df.reset_index().melt('index')
mappingdf=pd.DataFrame({'variable':['advblock','advblock','belthld','takuri','doji','doji'],
'value':[100,-100,100,100,100,-100],
'choice':['Advance Block', 'Inv Advance Block','Belthold','Takuri','Doji','Inv Doji'],
'acc' : [68,71,56,70,66,73]})
df['result'] = s.merge(mappingdf).sort_values('acc',ascending=False).drop_duplicates('index').set_index('index')['choice'].reindex(df.index).values
df
takuri advblock doji belthld result
stock A 0 100 0 0 Advance Block
stock B 0 -100 0 0 Inv Advance Block
stock C 100 0 0 100 Takuri
stock D 0 -100 -100 100 Inv Doji
stock E 100 0 100 0 Takuri
this is my first question at stackoverflow.
I have two dataframes of different sizes df1(266808 rows) and df2 (201 rows).
df1
and
df2
I want to append the count of each value/number in df1['WS_140m'] to df2['count'] if number falls in a class interval given in df2['Class_interval'].
I have tried
1)
df2['count']=pd.cut(x=df1['WS_140m'], bins=df2['Class_interval'])
2)
df2['count'] = df1['WS_140m'].groupby(df1['Class_interval'])
3)
for anum in df1['WS_140m']:
if anum in df2['Class_interval']:
df2['count'] = df2['count'] + 1
Please guide, if someone knows.
Please try something like:
def in_class_interval(value, interval):
#TODO
def in_class_interval_closure(interval):
return lambda x: in_class_interval(x, interval)
df2['count'] = df2['Class_interval']
.apply(lambda x: df1[in_class_interval_closure(x)(df1['WS_140m'])].size,axis=1)
Define your function in_class_interval(value, interval), which returns boolean.
I guess something like this would do it:
In [330]: df1
Out[330]:
WS_140m
0 5.10
1 5.16
2 5.98
3 5.58
4 4.81
In [445]: df2
Out[445]:
count Class_interval
0 0 NaN
1 0 (0.05,0.15]
2 0 (0.15,0.25]
3 0 (0.25,0.35]
4 0 (3.95,5.15]
In [446]: df2.Class_interval = df2.Class_interval.str.replace(']', ')')
In [451]: from ast import literal_eval
In [449]: for i, v in df2.Class_interval.iteritems():
...: if pd.notnull(v):
...: df2.at[i, 'Class_interval'] = literal_eval(df2.Class_interval[i])
In [342]: df2['falls_in_range'] = df1.WS_140m.between(df2.Class_interval.str[0], df2.Class_interval.str[1])
You can increase the count wherever True comes like below :
In [360]: df2['count'] = df2.loc[df2.index[df2['falls_in_range'] == True].tolist()]['count'] +1
In [361]: df2
Out[361]:
count Class_interval falls_in_range
0 NaN NaN False
1 NaN (0.05, 0.15) False
2 NaN (0.15, 0.25) False
3 NaN (0.25, 0.35) False
4 1.0 (3.95, 5.15) True
starting by another my question I've done yesterday Pandas set value if all columns are equal in a dataframe
Starting by #anky_91 solution I'm working on something similar.
Instead of put 1 or -1 if all columns are equals I want something more flexible.
In fact I want 1 if (for example) the 70% percentage of the columns are 1, -1 for the same but inverse condition and 0 else.
So this is what I've wrote:
# Instead of using .all I use .sum to count the occurence of 1 and 0 for each row
m1 = local_df.eq(1).sum(axis=1)
m2 = local_df.eq(0).sum(axis=1)
# Debug print, it work
print(m1)
print(m2)
But I don't know how to change this part:
local_df['enseamble'] = np.select([m1, m2], [1, -1], 0)
m = local_df.drop(local_df.columns.difference(['enseamble']), axis=1)
I write in pseudo code what I want:
tot = m1 + m2
if m1 > m2
if(m1 * 100) / tot > 0.7 # simple percentage calculus
df['enseamble'] = 1
else if m2 > m1
if(m2 * 100) / tot > 0.7 # simple percentage calculus
df['enseamble'] = -1
else:
df['enseamble'] = 0
Thanks
Edit 1
This is an example of expected output:
NET_0 NET_1 NET_2 NET_3 NET_4 NET_5 NET_6
date
2009-08-02 0 1 1 1 0 1
2009-08-03 1 0 0 0 1 0
2009-08-04 1 1 1 0 0 0
date enseamble
2009-08-02 1 # because 1 is more than 70%
2009-08-03 -1 # because 0 is more than 70%
2009-08-04 0 # because 0 and 1 are 50-50
You could obtain the specified output from the following conditions:
thr = 0.7
c1 = (df.eq(1).sum(1)/df.shape[1]).gt(thr)
c2 = (df.eq(0).sum(1)/df.shape[1]).gt(thr)
c2.astype(int).mul(-1).add(c1)
Output
2009-08-02 0
2009-08-03 0
2009-08-04 0
2009-08-05 0
2009-08-06 -1
2009-08-07 1
dtype: int64
Or using np.select:
pd.DataFrame(np.select([c1,c2], [1,-1], 0), index=df.index, columns=['result'])
result
2009-08-02 0
2009-08-03 0
2009-08-04 0
2009-08-05 0
2009-08-06 -1
2009-08-07 1
Try with (m1 , m2 and tot are same as what you have):
cond1=(m1>m2)&((m1 * 100/tot).gt(0.7))
cond2=(m2>m1)&((m2 * 100/tot).gt(0.7))
df['enseamble'] =np.select([cond1,cond2],[1,-1],0)
m =df.drop(df.columns.difference(['enseamble']), axis=1)
print(m)
enseamble
date
2009-08-02 1
2009-08-03 -1
2009-08-04 0
I used the code below to map the 2 values inside S column to 0 but it didn't work. Any suggestion on how to solve this?
N.B : I want to implement an external function inside the map.
df = pd.DataFrame({
'Age': [30,40,50,60,70,80],
'Sex': ['F','M','M','F','M','F'],
'S' : [1,1,2,2,1,2]
})
def app(value):
for n in df['S']:
if n == 1:
return 1
if n == 2:
return 0
df["S"] = df.S.map(app)
Use eq to create a boolean series and conver that boolean series to int with astype:
df['S'] = df['S'].eq(1).astype(int)
OR
df['S'] = (df['S'] == 1).astype(int)
Output:
Age Sex S
0 30 F 1
1 40 M 1
2 50 M 0
3 60 F 0
4 70 M 1
5 80 F 0
Don't use apply, simply use loc to assign the values:
df.loc[df.S.eq(2), 'S'] = 0
Age Sex S
0 30 F 1
1 40 M 1
2 50 M 0
3 60 F 0
4 70 M 1
5 80 F 0
If you need a more performant option, use np.select. This is also more scalable, as you can always add more conditions:
df['S'] = np.select([df.S.eq(2)], [0], 1)
You're close but you need a few corrections. Since you want to use a function, remove the for loop and replace n with value. Additionally, use apply instead of map. Apply operates on the entire column at once. See this answer for how to properly use apply vs applymap vs map
def app(value):
if value == 1:
return 1
elif value == 2:
return 0
df['S'] = df.S.apply(app)
Age Sex S
0 30 F 1
1 40 M 1
2 50 M 0
3 60 F 0
4 70 M 1
5 80 F 0
If you only wish to change values equal to 2, you can use pd.DataFrame.loc:
df.loc[df['S'] == 0, 'S'] = 0
pd.Series.apply is not recommend and this is just a thinly veiled, inefficient loop.
You could use .replace as follows:
df["S"] = df["S"].replace([2], 0)
This will replace all of 2 values to 0 in one line
Go with vectorize numpy operation:
df['S'] = np.abs(df['S'] - 2)
and stand yourself out from competitions in interviews and SO answers :)
>>>df = pd.DataFrame({'Age':[30,40,50,60,70,80],'Sex':
['F','M','M','F','M','F'],'S':
[1,1,2,2,1,2]})
>>> def app(value):
return 1 if value == 1 else 0
# or app = lambda value : 1 if value == 1 else 0
>>> df["S"] = df["S"].map(app)
>>> df
Age S Sex
Age S Sex
0 30 1 F
1 40 1 M
2 50 0 M
3 60 0 F
4 70 1 M
5 80 0 F
You can do:
import numpy as np
df['S'] = np.where(df['S'] == 2, 0, df['S'])