vectoring pandas df by row with multiple conditional statements

vectoring pandas df by row with multiple conditional statements - python

I'm trying to avoid for loops applying a function on a per row basis of a pandas df. I have looked at many vectorization examples but have not come across anything that will work completely. Ultimately I am trying to add an additional df column with the summation of successful conditions with a specified value per each condition by row.
I have looked at np.apply_along_axis but that's just a hidden loop, np.where but I could not see this working for 25 conditions that I am checking
A B C ... R S T
0 0.279610 0.307119 0.553411 ... 0.897890 0.757151 0.735718
1 0.718537 0.974766 0.040607 ... 0.470836 0.103732 0.322093
2 0.222187 0.130348 0.894208 ... 0.480049 0.348090 0.844101
3 0.834743 0.473529 0.031600 ... 0.049258 0.594022 0.562006
4 0.087919 0.044066 0.936441 ... 0.259909 0.979909 0.403292
[5 rows x 20 columns]
def point_calc(row):
points = 0
if row[2] >= row[13]:
points += 1
if row[2] < 0:
points -= 3
if row[4] >= row[8]:
points += 2
if row[4] < row[12]:
points += 1
if row[16] == row[18]:
points += 4
return points
points_list = []
for indx, row in df.iterrows():
value = point_calc(row)
points_list.append(value)
df['points'] = points_list
This is obviously not efficient but I am not sure how I can vectorize my code since it requires the values per row for each column in the df to get a custom summation of the conditions.
Any help in pointing me in the right direction would be much appreciated.
Thank you.
UPDATE:
I was able to get a little more speed replacing the df.iterrows section with df.apply.
df['points'] = df.apply(lambda row: point_calc(row), axis=1)
UPDATE2:
I updated the function as follows and have substantially decreased the run time with a 10x speed increase from using df.apply and the initial function.
def point_calc(row):
a1 = np.where(row[:,2]) >= row[:,13], 1,0)
a2 = np.where(row[:,2] < 0, -3, 0)
a3 = np.where(row[:,4] >= row[:,8])
etc.
all_points = a1 + a2 + a3 + etc.
return all_points
df['points'] = point_calc(df.to_numpy())
What I am still working on is using np.vectorize on the function itself to see if that can be improved upon as well.

You can try it it the following way:
# this is a small version of your dataframe
df = pd.DataFrame(np.random.random((10,4)), columns=list('ABCD'))
It looks like that:
A B C D
0 0.724198 0.444924 0.554168 0.368286
1 0.512431 0.633557 0.571369 0.812635
2 0.680520 0.666035 0.946170 0.652588
3 0.467660 0.277428 0.964336 0.751566
4 0.762783 0.685524 0.294148 0.515455
5 0.588832 0.276401 0.336392 0.997571
6 0.652105 0.072181 0.426501 0.755760
7 0.238815 0.620558 0.309208 0.427332
8 0.740555 0.566231 0.114300 0.353880
9 0.664978 0.711948 0.929396 0.014719
You can create a Series which counts your points and is initialized with zeros:
points = pd.Series(0, index=df.index)
It looks like that:
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
dtype: int64
Afterwards you can add and subtract values line by line if you want:
The condition within the brackets selects the rows, where the condition is true.
Therefore -= and += is only applied in those rows.
points.loc[df.A < df.C] += 1
points.loc[df.B < 0] -= 3
At the end you can extract the values of the series as numpy array if you want (optional):
point_list = points.values
Does this solve your problem?

Related

Creating a Dataframe with nested Loops

I want to create a dataframe which has 3 columns:
cols = ('ID', 'Y_Start','X_Start')
I got it this far with the help of Prune´s answer
stepsminus = -0.0009009009
steps = 0.0009009009
List1 = [] # 35
for i in np.arange(48.34, 48.309, stepsminus):
List1.append(i)
List2 = [] # 100
for i in np.arange(16.0108, 16.1, steps):
List2.append(i)
df = pd.DataFrame(columns=cols)
df['ID'] = list(range(1, 3501))
Now I want to enter the X and Y_start values accordingly. In every Row, there are 100 columns with different values and in every column there are 35 rows with different values. But the values from row to row and from column to column are of course the same values. I wanted to solve this with 2 for-loops,
However THIS is where I am stuck. THIS is where I need some help
df = pd.DataFrame(columns=cols)
df['ID'] = list(range(0, 3500))
y = -1
for pos_y in range(0, 35): # 35
x = 0
y = y + 1
for pos_x in range(0, 100): # 100
df['Y_Start'].iloc[y] = List_Y[pos_y]
df['X_Start'].iloc[x] = List_X[pos_x]
x = x + 1
df.head(102)
Outputs
ID Y_Start X_Start
0 0 48.34 16.0108
1 1 48.339099 16.011701
2 2 48.338198 16.012602
3 3 48.337297 16.013503
4 4 48.336396 16.014404
... ... ... ...
97 97 NaN 16.098187
98 98 NaN 16.099088
99 99 NaN 16.099989
100 100 NaN NaN
101 101 NaN NaN
102 rows × 3 columns
I want something like this:
ID Y_Start X_Start
0 1 48.34 16.0108
1 2 48.34 16.011701
2 3 48.34 16.012602
3 4 48.34 16.013503
4 5 48.34 16.014404

This is much easier than you make it. You're simply counting:
df['ID'] = list(range(1, 3501))
Apply the same range iteration for each of the other two rows. There may also be cases where you'll want to use NumPy's range slicing to generate your list.
Second part of problem, after OP update:
The long-term problem is that you're trying to apply iteration skills you haven't yet developed. Please return to your basic materials on loops and work on those until you learn to think in terms of a loop as a single control concept, rather than a series of disconnected operations.
That said, the central problem here is that, although you want 3500 rows of results from your nested loops, there is no attempt to do anything with an index that runs to 3500 values.
The auxiliary problem is that you've added "shadow" variables x and y, which do nothing except maintain the same values as your loop indices. As given, you should dump those variables and simply use pos_x and pos_y.
Now, for the actual solution. First, we'll repair the loop. For a give DF row k, you have to extract the x and y coordinates from your 2D array. You have already done this in the opposite direction, in your original post. Use the well-traveled arithmetic to get those:
for row in range(3500):
pos_x = row % 100
pos_y = row // 100
df['X_Start'].iloc[row] = List_X[pos_x]
df['Y_Start'].iloc[row] = List_Y[pos_y]
However, I recommend that you do this with a single assignment from a constructed list of 3500 values: just what I recommended in the top part of this post. Replicating elements and replicating an entire list, are techniques for you to look up, or simply derive from elementary list operations.

Pandas - merging rows with contiguous intervals

I've started using Pandas recently and have been stumbling over this issue for a few days. I have a dataframe with interval information that looks a bit like this:
df = pd.DataFrame({'RangeBegin' : [1,3,5,10,12,42,65],
'RangeEnd' : [2,4,7,11,41,54,100],
'Var1' : ['A','A','A','B','B','B','A'],
'Var2' : ['A','A','B','B','B','B','A']})
RangeBegin RangeEnd Var1 Var2
0 1 2 A A
1 3 4 A A
2 5 7 A B
3 10 11 B B
4 12 41 B B
5 42 54 B B
6 65 100 A A
It is sorted by RangeBegin. The idea is to to end up with something like this instead:
RangeBegin RangeEnd Var1 Var2
0 1.0 4.0 A A
2 5.0 7.0 A B
3 10.0 54.0 B B
6 65.0 100.0 A A
Where every "duplicate" (matching Var1 and Var2) row with contiguous ranges is aggregated into a single row. I'm thinking of expanding this algorithm to detect and deal with overlaps, but I'd like to get this working properly first.
You see, I've got a solution working by using iterrows to build a new dataframe row-by-row, but it takes far too long on my real dataset and I'd like to use a more vectorized implementation.
I've looked into groupby but can't find a set of keys (or a function to apply to said groups) that would make this work.
Here's my current implementation as it stands:
def test():
df = pd.DataFrame({'RangeBegin' : [1,3,5,10,12,42,65],
'RangeEnd' : [2,4,7,11,41,54,100],
'Var1' : ['A','A','A','B','B','B','A'],
'Var2' : ['A','A','B','B','B','B','A']})
print(df)
i = 0
cols = df.columns
aggData = pd.DataFrame(columns = cols)
for row in df.iterrows():
rowIndex, rowData = row
#if our new dataframe is empty or its last row is not contiguous, append it
if(aggData.empty or not duplicateContiguousRow(cols,rowData,aggData.loc[i])):
aggData = aggData.append(rowData)
i=rowIndex
#otherwise, modify the last row
else:
aggData.loc[i,'RangeEnd'] = rowData['RangeEnd']
print(aggData)
def duplicateContiguousRow(cols, row, aggDataRow):
#first bool: are the ranges contiguous?
contiguousBool = aggDataRow['RangeEnd']+1 == row['RangeBegin']
if(not contiguousBool):
return False
#second bool: is this row a duplicate (minus range columns)?
duplicateBool = True
for col in cols:
if(not duplicateBool):
break
elif col not in ['RangeBegin','RangeEnd']:
#Nan != Nan
duplicateBool = duplicateBool and (row[col] == aggDataRow[col] or (row[col]!=row[col] and aggDataRow[col]!=aggDataRow[col]))
return duplicateBool
EDIT: This question just got asked while I was writing this one. The answer looks promising

You can use groupby for this purpose, when first detecting the consecutive segments:
df['block'] = ((df['Var1'].shift(1) != df['Var1']) | (df['Var2'].shift(1) != df['Var2'])).astype(int).cumsum()
df.groupby(['Var1', 'Var2', 'block']).agg({'RangeBegin': np.min, 'RangeEnd': np.max}).reset_index()
will result in:
Var1 Var2 block RangeBegin RangeEnd
0 A A 1 1 4
1 A A 4 65 100
2 A B 2 5 7
3 B B 3 10 54
You could then sort by block to restore the original order.

vectorise nested iterations by using groupby methods

I have written code to iterate through a dataset that has a demarcation column. This column consist of a value shared by all equally demarked rows. The code iterate through each demarcated section with a nested loop to iterate through each line, finding the nearest neighbor for each row in its respective demarcated block
import pandas as pd
import numpy as np
Create a df with XYZ and Section demark
p=5
df = pd.DataFrame(np.random.randn(100, 3), columns=list('XYZ'))
df2 = df.sort('Z')
df2 = df2.reset_index(drop=True)
df2['Section_demark'] = (df2.index/p).astype('int')
df2.head(15)
X Y Z Section_demark
0 -1.125526 -0.249091 -2.505444 0
1 0.710114 1.357477 -2.195904 0
2 -0.580319 -0.997311 -2.031280 0
3 1.311526 -0.268590 -1.741079 0
4 0.481450 0.448904 -1.546278 0
5 -1.820224 -0.846628 -1.392700 1
6 0.528618 0.418862 -1.388170 1
7 0.360560 -0.309429 -1.319548 1
8 -0.369107 -1.290528 -1.233815 1
9 0.139063 0.045076 -1.209820 1
10 0.049387 1.087300 -1.188375 2
11 0.678247 -1.191882 -1.172214 2
12 -0.976294 -0.752081 -1.092286 2
13 0.875952 0.319304 -1.079185 2
14 0.469730 -0.329548 -1.044178 2
Function for euclidean distance
def eucl_d(item_id):
a = df3.sub(df3.iloc[item_id], axis=1)
b = np.sum( np.square(a), axis=1 )
return b
Iterate through the section demarks, iterate through the lines in each Section_demark and find nearest neighbor,
Isolate the row nearest to top row and create a series, take the ix for that series and compile a list from it.
read the list back to df2, creating a new column with the Nearest neighbor index number as value
s=0
elements = []
while s<(len(df2)/p):
df3 = df2[df2['Section_demark']==(s)]
r=0
while r<(p):
df4=df3.copy()
df4['dist'] = eucl_d(r)
df4 = df4.sort('dist')
ser = df4.iloc[1]
elements.append(ser.name)
r=r+1
s=s+1
df2["NNIX"] = elements
df2.head(10)
X1 Y1 Z1 NNIX
0 0.002299 1.284195 -1.604009 1
1 -0.444305 0.346856 -2.396538 0
2 -0.490741 -1.416682 -1.423573 3
3 0.203635 -0.676841 -1.596332 2
4 0.002299 1.284195 -1.604009 1
5 -0.314330 0.036554 -1.153127 6
6 -0.387839 0.129000 -1.235331 5
7 -0.314330 0.036554 -1.153127 6
8 -0.059477 -0.205260 -1.136376 7
9 0.717980 0.130665 -1.040372 8
I would like to exchange the last section of iteration with a groupby command and use aggregate or apply to run the eucl_d function, but it eludes me
I can get df2 grouped by running this:
grouped = df3.groupby('Section_demark')
Its the second step that is giving me trouble
I was thinking:
grouped.agg(eucl_d(item_id))
But I dont know how to specify the item_id for eucl_d(item_id)

python pandas dataframe, operations on values

I am trying to understand how Pandas DataFrames works to copy information downward, and then reset until the next variables changes... Specifically below, how do I make Share_Amt_To_Buy reset to 0 once my Signal or Signal_Diff switches from 1 to 0?
Using .cumsum() on Share_Amt_To_Buy ends up bringing down the values and accumulating which is not exactly what I would like to do.
My goal is that when Signal changes from 0 to 1, the Share_Amt_To_Buy is calculated and copied until Signal switches back to 0. Then if Signal turns to 1 again, I want Share_Amt_To_Buy to be recalculated based on that point in time.
Hopefully this makes sense - please let me know.
Signal Signal_Diff Share_Amt_To_Buy (Correctly) Share_Amt_To_Buy (Currently)
0 0 0 0
0 0 0 0
0 0 0 0
1 1 100 100
1 0 100 100
1 0 100 100
0 -1 0 100
0 0 0 100
1 1 180 280
1 0 180 280
As you can see, my signals alternate from 0 to 1, and this means the following:
0 = no trade (or position)
1 = trade (with a position)
Signal_Diff is calculated as follows
portfolio['Signal_Diff'] = portfolio['Signal'].diff().fillna(0.0)
The column 'Share_Amt_To_Buy' is calculated when signal changes from 0 to 1. I have used the following as an example to calculate this
initial_cap = 100000.0
portfolio['close'] = my stock's closing prices as a float
portfolio['Share_Amt'] = np.where(variables['Signal']== 1.0, np.round(initial_cap / portfolio['close'] * 0.25 * portfolio['Signal']), 0.0).cumsum()
portfolio['Share_Amt_To_Buy'] = (portfolio['Share_Amt']*portfolio['Signal'])

From what I understand, there is no built-in formula module for pandas. You can perform formulas on columns, cells, arrays and generate different arrays or values from them (df[column].count() is an example), and do plenty of work like that, but there is no method for dynamically updating the array itself based on another value in the array (like an Excel formula).
You could always do the procedure iteratively and say:
>>> for index in df.index:
>>> if df['Signal_Diff'] == 0:
>>> df.loc[index, 'Signal_Diff'] = some_value
>>> elif df['Signal_Diff'] == 1:
>>> df.loc[index, 'Signal_Diff'] = some_other_value
Or you could create a custom function via the map tool:
https://stackoverflow.com/a/19226745/4131059
EDIT:
Another solution would be to query for all indexes with a value of 1 in the old array and the new array upon some change to the array:
>>> df_old_list = df[df.Signal_Diff == 1].index.tolist()
>>> ...
>>> df_new_list = df[df.Signal_Diff == 1].index.tolist()
>>>
>>> for x in df_old_list:
>>> if x in df_new_list:
>>> df_new_list.remove(x)
Then recalculate for only the indexes in df_new_list.

Reference previous row when iterating through dataframe

Is there a simple way to reference the previous row when iterating through a dataframe?
In the following dataframe I would like column B to change to 1 when A > 1 and remain at 1 until A < -1, when it changes to -1.
In [11]: df
Out[11]:
A B
2000-01-01 -0.182994 0
2000-01-02 1.290203 0
2000-01-03 0.245229 0
2000-01-08 -1.230742 0
2000-01-09 0.534939 0
2000-01-10 1.324027 0
This is what I've tried to do, but clearly you can't just subtract 1 from the index:
for idx,row in df.iterrows():
if df["A"][idx]<-1:
df["B"][idx] = -1
elif df["A"][idx]>1:
df["B"][idx] = 1
else:
df["B"][idx] = df["B"][idx-1]
I also tried using get_loc but got completely lost, I'm sure I'm missing a very simple solution!

This is what you are trying to do?
In [38]: df = DataFrame(randn(10,2),columns=list('AB'))
In [39]: df['B'] = np.nan
In [40]: df.loc[df.A<-1,'B'] = -1
In [41]: df.loc[df.A>1,'B'] = 1
In [42]: df.ffill()
Out[42]:
A B
0 -1.186808 -1
1 -0.095587 -1
2 -1.921372 -1
3 -0.772836 -1
4 0.016883 -1
5 0.350778 -1
6 0.165055 -1
7 1.101561 1
8 -0.346786 1
9 -0.186263 1

Similar question here: Reference values in the previous row with map or apply .
My impression is that pandas should handle iterations and we shouldn't have to do it on our own... Therefore, I chose to use the DataFrame 'apply' method.
Here is the same answer I posted on other question linked above...
You can use the dataframe 'apply' function and leverage the unused the 'kwargs' parameter to store the previous row.
import pandas as pd
df = pd.DataFrame({'a':[0,1,2], 'b':[0,10,20]})
new_col = 'c'
def apply_func_decorator(func):
prev_row = {}
def wrapper(curr_row, **kwargs):
val = func(curr_row, prev_row)
prev_row.update(curr_row)
prev_row[new_col] = val
return val
return wrapper
#apply_func_decorator
def running_total(curr_row, prev_row):
return curr_row['a'] + curr_row['b'] + prev_row.get('c', 0)
df[new_col] = df.apply(running_total, axis=1)
print(df)
# Output will be:
# a b c
# 0 0 0 0
# 1 1 10 11
# 2 2 20 33
This example uses a decorator to store the previous row in a dictionary and then pass it to the function when Pandas calls it on the next row.
Disclaimer 1: The 'prev_row' variable starts off empty for the first row so when using it in the apply function I had to supply a default value to avoid a 'KeyError'.
Disclaimer 2: I am fairly certain this will be slower the apply operation but I did not do any tests to figure out how much.

Try this: If the first value is neither >= 1 or < -1 set to 0 or whatever you like.
df["B"] = None
df["B"] = np.where(df['A'] >= 1, 1,df['B'])
df["B"] = np.where(df['A'] < -1, -1,df['B'])
df = df.ffill().fillna(0)
This solves the problem stated, But the real solution to reference previous row is use .shift() or .index() -1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.