New column based in multiple conditions

New column based in multiple conditions - python

a b
0 100 90
1 30 117
2 90 99
3 200 94
I want to create a new df["c"] with next conditions:
If a > 50 and b is into (a ± 0.5a), then c = a
If a > 50 and b is out (a ± 0.5a), then c = b
If a <= 50, then *c = a*
Output should be:
a b c
0 100 90 100
1 30 117 30
2 90 99 90
3 200 94 94
I´ve tried:
df['c'] = np.where(df.eval("0.5 * a <= b <= 1.5 * a"), df.a, df.b)
But I don´t know how to include the last condition (If a <= 50, then c = a) in this sentence.

You're almost there, you'll just need to add an or clause inside your eval string.
np.where(df.eval("(0.5 * a <= b <= 1.5 * a) or (a <= 50)"), df.a, df.b)
# ~~~~~~~~~~~~
array([100, 30, 90, 94])

Related

how to use pandas for condition selection

I have a table like this
AREA AMOUNT
A 1000
A 10
B 30
B 3000
C 22
D 300
What I want to get is more that 100 in AREA A & more than 100 in AREA B & less than 100 in AREA C and more than 100 in AREA D . I have many of these kind of area to analyse.
so what I want to get is below.
AREA AMOUNT
A 1000
B 3000
C 22
D 300

You can use .isin() and pass the three columns > 100 and then == for just the C column using & and | for and and or. Pay attention to parentheses here:
df = df[((df['AREA'].isin(['A','B','D'])) & (df['AMOUNT'] > 100)) |
((df['AREA'] == 'C') & (df['AMOUNT'] < 100))]
df
Out[1]:
AREA AMOUNT
0 A 1000
3 B 3000
4 C 22
5 D 300

You can write in this way also by creating custom function for setting up the condition
import operator
ops = {'eq': operator.eq, 'neq': operator.ne, 'gt': operator.gt, 'ge': operator.ge, 'lt': operator.lt, 'le': operator.le}
g = lambda x, y, z: (df['AREA'].eq(x)) & (ops[z](df['AMOUNT'], y))
df[g('A', 100, 'gt')| g('B', 100, 'gt') | g('C', 100, 'lt') | g('D', 100, 'gt') ]
AREA AMOUNT
0 A 1000
3 B 3000
4 C 22
5 D 300

Finding combinations that meet thresholds for subgroups

I need to find all the combinations of rows where multiple conditions are met.
I tried to use the powerset recipe from itertools and the answer here by adding multiple conditions but can't seem to get the conditions to work properly.
The code I've come up with is:
def powerset(iterable):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))
df_groups = pd.concat(
[data.reindex(l).assign(Group = n) for n, l in
enumerate(powerset(data.index)) ])
if ((data.loc[l, 'Account'] == 'COS').any() & (data.loc[l,'Amount'].sum() >= 100)
& (data.loc[l,'Account'] == 'Rev').any() & (data.loc[l, 'Amount'].sum() >= 150)
& (data.loc[l,'Account'] == 'Inv').any() and (data.loc[l, 'Amount'].sum() >= 60)))] )
What I'm trying to do above is find only those combinations where the the following thresholds are met/exceeded:
Account Amount
COS 150
Rev 100
Inv 60
Sample data:
Entity Account Amount Location
A10 Rev 60 A
B01 Rev 90 B
C11 Rev 80 C
B01 COS 90 B
C11 COS 80 C
A10 Inv 60 A
Apologies in advance for the poor question writing etiquette, its the first time I haven't been able to find an answer on Stackoverflow and have had to ask a question.
Also, aware that this will get very slow as len(data) increases so any suggestions on that end are also greatly appreciated.

Let's start by creating the dataframe that OP mentions in the question
df = pd.DataFrame({'Entity': ['A10', 'B01', 'C11', 'B01', 'C11', 'A10'],
'Account': ['Rev', 'Rev', 'Rev', 'COS', 'COS', 'Inv'],
'Amount': [60, 90, 80, 90, 80, 60],
'Location': ['A', 'B', 'C', 'B', 'C', 'A']})
[Out]:
Entity Account Amount Location
0 A10 Rev 60 A
1 B01 Rev 90 B
2 C11 Rev 80 C
3 B01 COS 90 B
4 C11 COS 80 C
5 A10 Inv 60 A
Then, in order to achieve OP's goal of filtering based on specific constraints, one can do this using a one-liner with pandas.concat and pandas.DataFrame.query, as follows
df_new = pd.concat([df[df['Account'] == 'Rev'].query('Amount <= 100'), df[df['Account'] == 'COS'].query('Amount <= 150'), df[df['Account'] == 'Inv'].query('Amount <= 60')])
[Out]:
Entity Account Amount Location
0 A10 Rev 60 A
1 B01 Rev 90 B
2 C11 Rev 80 C
3 B01 COS 90 B
4 C11 COS 80 C
5 A10 Inv 60 A
As the sample dataframe doesn't allow us to get a clear picture if it is working or not, let us create a new random dataframe for testing purposes.
import numpy as np
df = pd.DataFrame({'Entity': np.random.choice(['A10', 'B01', 'C11', 'B01', 'C11', 'A10'], 1000),
'Account': np.random.choice(['Rev', 'COS', 'Inv'], 1000),
'Amount': np.random.randint(0, 1000, 1000),
'Location': np.random.choice(['A', 'B', 'C'], 1000)})
[Out]:
Entity Account Amount Location
0 B01 Rev 497 A
1 B01 Rev 52 C
2 B01 Rev 42 C
3 B01 Rev 285 B
4 A10 COS 714 B
5 A10 Rev 288 B
6 B01 Rev 396 B
7 A10 Inv 277 B
8 C11 Inv 435 C
9 C11 COS 228 C
If one runs the one-liner on that newly created dataframe, one gets the following
df_new = pd.concat([df[df['Account'] == 'Rev'].query('Amount <= 100'), df[df['Account'] == 'COS'].query('Amount <= 150'), df[df['Account'] == 'Inv'].query('Amount <= 60')])
[Out]:
Entity Account Amount Location
1 B01 Rev 52 C
2 B01 Rev 42 C
21 B01 Rev 1 A
31 C11 Rev 38 A
47 A10 Rev 83 C
60 B01 Rev 41 C
156 B01 Rev 81 C
197 C11 Rev 61 C
206 C11 Rev 90 A
224 C11 Rev 23 B
which, from the sample we are seeing, it does satisfy the requirements.
There are additional ways to solve this.
Another example is using pandas.DataFrame.apply and a lambda function as follows
df_new = df[df.apply(lambda x: x['Amount'] <= 100 if x['Account'] == 'Rev' else x['Amount'] <= 150 if x['Account'] == 'COS' else x['Amount'] <= 60, axis=1)]

selecting indexes with multiple years of observations

I wish to select only the rows that have observations across multiple years. For example, suppose
mlIndx = pd.MultiIndex.from_tuples([('x', 0,),('x',1),('z', 0), ('y', 1),('t', 0),('t', 1)])
df = pd.DataFrame(np.random.randint(0,100,(6,2)), columns = ['a','b'], index=mlIndx)
In [18]: df
Out[18]:
a b
x 0 6 1
1 63 88
z 0 69 54
y 1 27 27
t 0 98 12
1 69 31
My desired output is
Out[19]:
a b
x 0 6 1
1 63 88
t 0 98 12
1 69 31
My current solution is blunt so something that can scale up more easily would be great. You can assumed a sorted index.
df.reset_index(level=0, inplace=True)
df[df.level_0.duplicated() | df.level_0.duplicated(keep='last')]
Out[30]:
level_0 a b
0 x 6 1
1 x 63 88
0 t 98 12
1 t 69 31

You can figure this out with groupby (on the first level of the index) + transform, and then use boolean indexing to filter out those rows:
df[df.groupby(level=0).a.transform('size').gt(1)]
a b
x 0 67 83
1 2 34
t 0 18 87
1 63 20
Details
Output of the groupby -
df.groupby(level=0).a.transform('size')
x 0 2
1 2
z 0 1
y 1 1
t 0 2
1 2
Name: a, dtype: int64
Filtering from here is straightforward, just find those rows with size > 1.

Use the group by filter
You can pass a function that returns a boolean to
df.groupby(level=0).filter(lambda x: len(x) > 1)
a b
x 0 7 33
1 31 43
t 0 71 18
1 68 72
I've spent my fare share of time focused on speed. Not all solutions need to be the fastest solutions. However, since the subject has come up. I'll offer what I think should be a fast solution. It is my intent to keep future readers informed.
Results of Time Test
res.plot(loglog=True)
res.div(res.min(1), 0).T
10 30 100 300 1000 3000
cs 4.425970 4.643234 5.422120 3.768960 3.912819 3.937120
wen 2.617455 4.288538 6.694974 18.489803 57.416648 148.860403
jp 6.644870 21.444406 67.315362 208.024627 569.421257 1525.943062
pir 6.043569 10.358355 26.099766 63.531397 165.032540 404.254033
pir_pd_factorize 1.153351 1.132094 1.141539 1.191434 1.000000 1.000000
pir_np_unique 1.058743 1.000000 1.000000 1.000000 1.021489 1.188738
pir_best_of 1.000000 1.006871 1.030610 1.086425 1.068483 1.025837
Simulation Details
def pir_pd_factorize(df):
f, u = pd.factorize(df.index.get_level_values(0))
m = np.bincount(f)[f] > 1
return df[m]
def pir_np_unique(df):
u, f = np.unique(df.index.get_level_values(0), return_inverse=True)
m = np.bincount(f)[f] > 1
return df[m]
def pir_best_of(df):
if len(df) > 1000:
return pir_pd_factorize(df)
else:
return pir_np_unique(df)
def cs(df):
return df[df.groupby(level=0).a.transform('size').gt(1)]
def pir(df):
return df.groupby(level=0).filter(lambda x: len(x) > 1)
def wen(df):
s=df.a.count(level=0)
return df.loc[s[s>1].index.tolist()]
def jp(df):
return df.loc[[i for i in df.index.get_level_values(0).unique() if len(df.loc[i]) > 1]]
res = pd.DataFrame(
index=[10, 30, 100, 300, 1000, 3000],
columns='cs wen jp pir pir_pd_factorize pir_np_unique pir_best_of'.split(),
dtype=float
)
np.random.seed([3, 1415])
for i in res.index:
d = pd.DataFrame(
dict(a=range(i)),
pd.MultiIndex.from_arrays([
np.random.randint(i // 4 * 3, size=i),
range(i)
])
)
for j in res.columns:
stmt = f'{j}(d)'
setp = f'from __main__ import d, {j}'
res.at[i, j] = timeit(stmt, setp, number=100)

Just a new way
s=df.a.count(level=0)
df.loc[s[s>1].index.tolist()]
Out[12]:
a b
x 0 1 31
1 70 29
t 0 42 26
1 96 29
And if you want to keep using duplicate
s=df.index.get_level_values(level=0)
df.loc[s[s.duplicated()].tolist()]
Out[18]:
a b
x 0 1 31
1 70 29
t 0 42 26
1 96 29

I'm not convinced groupby is necessary:
df = df.sort_index()
df.loc[[i for i in df.index.get_level_values(0).unique() if len(df.loc[i]) > 1]]
# a b
# x 0 16 3
# 1 97 36
# t 0 9 18
# 1 37 30
Some benchmarking:
df = pd.concat([df]*10000).sort_index()
def cs(df):
return df[df.groupby(level=0).a.transform('size').gt(1)]
def pir(df):
return df.groupby(level=0).filter(lambda x: len(x) > 1)
def wen(df):
s=df.a.count(level=0)
return df.loc[s[s>1].index.tolist()]
def jp(df):
return df.loc[[i for i in df.index.get_level_values(0).unique() if len(df.loc[i]) > 1]]
%timeit cs(df) # 19.5ms
%timeit pir(df) # 33.8ms
%timeit wen(df) # 17.0ms
%timeit jp(df) # 22.3ms

intersection 2 pandas dataframe

in my problem I have 2 dataframes mydataframe1 and mydataframe2 as below.
mydataframe1
Out[13]:
Start End Remove
50 60 1
61 105 0
106 150 1
151 160 0
161 180 1
181 200 0
201 400 1
mydataframe2
Out[14]:
Start End
55 100
105 140
151 154
155 185
220 240
From mydataframe2 I would like to remove the rows for which the interval Start-End are contained (also partially) in any of the "Remove"=1 intervals in mydataframe1. In other words there should not be any itnersection between the intervals of mydataframe2 and each of the intervals in mydataframe1
in this case mydataframe2 becomes
mydataframe2
Out[15]:
Start End
151 154

You could use pd.IntervalIndex for intersections
Get rows to be removed
In [313]: dfr = df1.query('Remove == 1')
Construct IntervalIndex from to be removed ranges
In [314]: s1 = pd.IntervalIndex.from_arrays(dfr.Start, dfr.End, 'both')
Construct IntervalIndex from to be tested
In [315]: s2 = pd.IntervalIndex.from_arrays(df2.Start, df2.End, 'both')
Select rows of s2 which are not in s1 ranges
In [316]: df2.loc[[x not in s1 for x in s2]]
Out[316]:
Start End
2 151 154
Details
In [320]: df1
Out[320]:
Start End Remove
0 50 60 1
1 61 105 0
2 106 150 1
3 151 160 0
4 161 180 1
5 181 200 0
6 201 400 1
In [321]: df2
Out[321]:
Start End
0 55 100
1 105 140
2 151 154
3 155 185
4 220 240
In [322]: dfr
Out[322]:
Start End Remove
0 50 60 1
2 106 150 1
4 161 180 1
6 201 400 1
IntervalIndex details
In [323]: s1
Out[323]:
IntervalIndex([[50, 60], [106, 150], [161, 180], [201, 400]]
closed='both',
dtype='interval[int64]')
In [324]: s2
Out[324]:
IntervalIndex([[55, 100], [105, 140], [151, 154], [155, 185], [220, 240]]
closed='both',
dtype='interval[int64]')
In [326]: [x not in s1 for x in s2]
Out[326]: [False, False, True, False, False]

We can use Medial- or length-oriented tree: Overlap test:
In [143]: d1 = d1.assign(s=d1.Start+d1.End, d=d1.End-d1.Start)
In [144]: d2 = d2.assign(s=d2.Start+d2.End, d=d2.End-d2.Start)
In [145]: d1
Out[145]:
Start End Remove d s
0 50 60 1 10 110
1 61 105 0 44 166
2 106 150 1 44 256
3 151 160 0 9 311
4 161 180 1 19 341
5 181 200 0 19 381
6 201 400 1 199 601
In [146]: d2
Out[146]:
Start End d s
0 55 100 45 155
1 105 140 35 245
2 151 154 3 305
3 155 185 30 340
4 220 240 20 460
now we can check for overlapping intervals and filter:
In [148]: d2[~d2[['s','d']]\
...: .apply(lambda x: ((d1.loc[d1.Remove==1, 's'] - x.s).abs() <
...: d1.loc[d1.Remove==1, 'd'] +x.d).any(),
...: axis=1)]\
...: .drop(['s','d'], 1)
...:
Out[148]:
Start End
2 151 154

I think that this should work:
mydataframe2[mydataframe2.Start.isin(mydataframe1[mydataframe1.Remove != 0].Start)]
Breaking it down:
# This filter will remove anything which has Remove not 0
filter_non_remove = mydataframe1.Remove != 0
# This provides a valid Sequence of Start values
valid_starts = mydataframe1[mydataframe1.Remove != 0].Start
# Another filter, that checks whether the Start
# value is in the valid_starts Sequence
is_df2_valid = mydataframe2.Start.isin(valid_starts)
# Final applied filter
output = mydataframe2[is_df2_valid]

You can get all the unique range values from the columns marked Remove then evaluate the Start and End dates contained in mydataframe2 are not in any of the range values. The first part will define all unique values falling with the Start/End values were Remove = 1.
start_end_remove = mydataframe1[mydataframe1['Remove'] == 1][['Start', 'End']].as_matrix()
remove_ranges = set([])
for x in start_end_remove:
remove_ranges.update(np.arange(x[0], x[1] + 1))
Next you can evaluate mydataframe2 against the unique set of range values. If the Start/End values of mydataframe2 are in the range of values they are removed from the dataframe by flagging whether they should be removed in a new columns. A function is defined to see if there is overlap between any of the ranges, then that function is applied to each row in mydataframe2 and remove the rows where the ranges do overlap.
def evaluate_in_range(x, remove_ranges):
s = x[0]
e = x[1]
eval_range = set(np.arange(s, e + 1))
if len(eval_range.intersection(remove_ranges)) > 0:
return 1
else:
return 0
mydataframe2['Remove'] = mydataframe2[['Start', 'End']].apply(lambda x: evaluate_in_range(x, remove_ranges), axis=1)
mydataframe2.drop(mydataframe2[mydataframe2['Remove'] == 1].index, inplace=True)

How about this:
mydataframe1['key']=1
mydataframe2['key']=1
df3 = mydataframe2.merge(mydataframe1, on="key")
df3['s_gt_s'] = df3.Start_y > df3.Start_x
df3['s_lt_e'] = df3.Start_y < df3.End_x
df3['e_gt_s'] = df3.End_y > df3.Start_x
df3['e_lt_e'] = df3.End_y < df3.End_x
df3['s_in'] = df3.s_gt_s & df3.s_lt_e
df3['e_in'] = df3.e_gt_s & df3.e_lt_e
df3['overlaps'] = df3.s_in | df3.e_in
my_new_dataframe = df3[df3.overlaps & df3.Remove==1][['End_x','Start_x']].drop_duplicates()

Applying diff on selected rows for comparing angles from math.atan2

I have a data frame like this that want to apply diff function on:
test = pd.DataFrame({ 'Observation' : ['0','1','2',
'3','4','5',
'6','7','8'],
'Value' : [30,60,170,-170,-130,-60,-30,10,20]
})
Observation Value
0 30
1 60
2 170
3 -170
4 -130
5 -60
6 -30
7 10
8 20
The column 'Value' is in degrees. So, the difference between -170 and 170 should be 20, not -340. In other words, when d2*d1 < 0, instead of d2-d1, I'd like to get 360-(abs(d1)+abs(d2))
Here's why I try. But then I don't know how to continue it without using a for loop:
test['Value_diff_1st_attempt'] = test['Value'].diff(1)
test['sign_temp'] = test['Value'].shift()
test['Sign'] = np.sign(test['Value']*test['sign_temp'])
Here's what the result should look like:
Observation Value Delta_Value
0 30 NAN
1 60 30
2 170 110
3 -170 20
4 -130 40
5 -60 70
6 -30 30
7 10 40
8 20 10
Eventually I'd like to get just the magnitude of differences all in positive values. Thanks.
Update: So, the value results are derived from math.atan2 function. The values are from 0<theta<180 or -180<theta<0. The problem arises when we are dealing with a change of direction from 170 (upper left corner) to -170 (lower left corner) for example, where the change is really just 20 degrees. However, when we go from -30 (Lower right corner) to 10 (upper right corner), the change is really 40 degrees. I hope I explained it well.

I believe this should work (took the definition from #JasonD's answer):
test["Value"].rolling(2).apply(lambda x: 180 - abs(abs(x[0] - x[1]) - 180))
Out[45]:
0 NaN
1 30.0
2 110.0
3 20.0
4 40.0
5 70.0
6 30.0
7 40.0
8 10.0
Name: Value, dtype: float64
How it works:
Based on your question, the two angles a and b are between 0 and +/-180. For 0 < d < 180 I will write d < 180 and for -180 < d < 0 I will write d < 0. There are four possibilities:
a < 180, b < 180 -> the result is simply |a - b|. And since |a - b| - 180 cannot be greater than 180, the formula will simplify to a - b if a > b and b - a if b > a.
a < 0, b < 0 - > The same logic applies here. Both negative and their absolute difference cannot be greater than 180. The result will be |a - b|.
a < 180, b < 0 - > a - b will be greater than 0 for sure. For the cases where |a - b| > 180, we should look at the other angle and this translates to 360 - |a - b|.
a < 0, b < 180 -> again, similar to the above. If the absolute difference is greater than 180, calculate 360 - absolute difference.
For the pandas part: rolling(n) creates arrays of size n. For 2: (row 0, row1), (row1, row2), ... With apply, you apply that formula to every rolling pair where x[0] is the first element (a) and x[1] is the second element.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

New column based in multiple conditions - python

You're almost there, you'll just need to add an or clause inside your eval string. np.where(df.eval("(0.5 * a <= b <= 1.5 * a) or (a <= 50)"), df.a, df.b) # ~~~~~~~~~~~~ array([100, 30, 90, 94])

Related

how to use pandas for condition selection

Finding combinations that meet thresholds for subgroups

selecting indexes with multiple years of observations

intersection 2 pandas dataframe

Applying diff on selected rows for comparing angles from math.atan2

Categories

Resources