I am trying to check three continuous values in a column and if they are all positive, then create a new column with a string value in the third row. My index is the date index.
I want a new column created in my data frame and want to check in a loop if three consecutive values in a row are positive, then return a string value of 'increasing' or if all three are negative, then return a value of 'decreasing' or if neither, then return 'none'. And this new value should be in the new column and in the row that is the last one of the three values that have been checked.
I have tried below but whatever variation I use, it is not working.
df['num_change'] = df.num.diff()
result = []
for i in range(len(df)):
if np.all(df['num_change'].values[i:i+3]) < 0:
result.loc[[i+3],'Trend'] =('decreasing')
elif np.all(df['num_change'].values[i:i+3]) > 0:
result.loc[[i+3],'Trend'] =('increasing')
else:
result.loc[[i+3],'Trend'] =('none')
df["new_col"] = result
I am unfortunately not able to insert an image here, I hope someone is patient enough to help me still.
This can be achieved with a custom rolling without an (explicit) loop
First we define the aggregation (it has to return a numeric value):
def trend(s):
if (s < 0).all():
return -1
if (s > 0).all():
return 1
return 0
Now apply it and map to a label
df['trend'] = (df['col'].rolling(3, min_periods = 1)
.apply(trend)
.map({1:'Increasing', -1:'Decreasing', 0:'none'})
)
output
col trend
0 1 Increasing
1 2 Increasing
2 3 Increasing
3 -4 none
4 -5 none
5 -6 Decreasing
6 7 none
7 8 none
8 9 Increasing
Note that we set min_periods to 1 here which has the effect of filling the first two rows based on the sub-series of 1 or 2 elements. if you don't want that you can delete the min_periods bit
You could do this as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col' : [1,2,3,-4,-5,-6,7,8,9]})
start = 0
end = 3
result = [None] * 2 # because trend will start after the third value
while end <= len(df.col):
if np.all(df.col[start:end] > 0):
result.append("Increasing")
elif np.all(df.col[start:end] < 0):
result.append("Decreasing")
else:
result.append(None)
start += 1
end += 1
df["new_col"] = result
In this solution, the while-loop runs till the subset of the column in the data frame has at least 3 values, i.e. end is less than or equals to the length of df.col. Inside it, the first three elements of the column will be checked. If all of them are greater than 0, then the trend "increasing" will be added to the result. If not, then the trend "decreasing" will be added. Otherwise, None is added.
The first two elements of the result are None because there can be no comparison for the first two elements as the comparison is for the first 3 elements and so on. The start and end are 0 and 3 respectively, which are incremented by 1 after each iteration. The output is as shown below:
>>> df
col new_col
0 1 None
1 2 None
2 3 Increasing
3 -4 None
4 -5 None
5 -6 Decreasing
6 7 None
7 8 None
8 9 Increasing
Related
Here in this excel analysis, there is a condition that was used, Excel formula=(=IF(AND(B2<1;B3>5);1;0)),please refer the image below.
(https://i.stack.imgur.com/FpPIK.png)
if compressor-1 first-row value is less than 1 (<1) and the second-row value is greater than 5 (>5) then it will return value '1',
if the condition is not satisfied it will return value'0'.
Even if one row satisfied the condition and the other row doesn't it will return '0'
( for first output 1st &2nd rows,for second output 2nd &3rd rows and so on for the rest of the rows)
So, I have tried in jupyter notebook to write a code that iterates through all rows in 1 column by comparing with this condition
df3['cycle']=0&1
df3.loc[(df3['Kompressor_1_val']<1&5),['cycle']]=0
df3.loc[(df3['Kompressor_1_val']>1&5),['cycle']]=1
df3
But could anyone please help me to write the code by considering the above Excel analysis?
In the new column, I need to get the output by satisfying this condition-output should be 0 or 1 based on the following description which was provided in excel analysis
i.e for 1st iteration, it should compare 1st row and 2nd row of the selected column with the condition to give the output either 1 or 0
for 2nd iteration, it should compare the 2nd row and 3rd row of the selected column with the condition to give the output either 1 or 0 and so on for the rest of the rows.
(https://i.stack.imgur.com/dCuMr.png)
You can check the current Compressor 1 row using .lt(...)
df["Compressor 1"].lt(1)
And the next row using .shift(-1) and .gt(...)
df["Compressor 1"].shift(-1).gt(5)
Put them together with & and convert to int
df["Frequency Cycle Comp 1"] = (df["Compressor 1"].lt(1) & df["Compressor 1"].shift(-1).gt(5)).astype(int)
An example
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame(np.random.randint(low=-10, high=10, size=(10,)), columns=["Compressor 1"])
df["Frequency Cycle Comp 1"] = (df["Compressor 1"].lt(1) & df["Compressor 1"].shift(-1).gt(5)).astype(int)
print(df)
Compressor 1 Frequency Cycle Comp 1
0 2 0
1 5 0
2 -10 0
3 -7 0
4 -7 0
5 -3 0
6 -1 1
7 9 0
8 8 0
9 -6 0
I have following Pinescript statements that I'm trying to implement using python dataframes.
Hlv = float(na)
Hlv := close > sma_high ? 1 : close < sma_low ? -1 : Hlv[1]
Line 1 - Hlv - is basically a variable name which is a float
Line 2 - Assinging new value to variable , ternary operator (if-elseif-else). The assignment Hlv[1] means previous value (value of Hlv, 1 step(row) back)
Now Implementing this in Dataframe having following columns and data ->
Current ->
Close SMA_High SMA_Low
10 12 5
12 14 6
13 17 7
Now , I want to add another column called HLV storing Hlv values for each row , based on the condition we will compute as in pinescript line2.
Expected ->
Close SMA_High SMA_Low Hlv
10 9 5 1 // close > sma_high = 1
5 14 6 -1 // close < sma_low = -1
13 17 7 -1 // here no conditions are met , so previous value of Hlv is taken i.e -1
I am not able to figure out how to generate this new column with values deriving from other columns and even how to take previous value of the column.
I went through this answer and could see we could add values on condition as below -
df['Hlv'] = pd.NA
df.loc[df.Close>df.SMA_High,'Hlv'] = 1
df.loc[df.Close<df.SMA_Low,'Hlv'] = -1
But still not sure how can I populate with previous value if no conditions are met / default case .
Thanks in advance.
import numpy as np
df['Hlv'] = np.NaN
df.loc[df.Close>df.SMA_High,'Hlv'] = 1
df.loc[df.Close<df.SMA_Low,'Hlv'] = -1
df.fillna(method='ffill',inplace=True)
I have a dataframe that has following columns: X and Y are Cartesian coordinates and Value is the value of element at these coordinates. What I want to achieve is to select only one coordinates out of n that are close to other, lets say coordinates are close if distance is lower than some value m, so the initial DF looks like this (example):
data = {'X':[0,0,0,1,1,5,6,7,8],'Y':[0,1,4,2,6,5,6,4,8],'Value':[6,7,4,5,6,5,6,4,8]}
df = pd.DataFrame(data)
X Y Value
0 0 0 6
1 0 1 7
2 0 4 4
3 1 2 5
4 1 6 6
5 5 5 5
6 6 6 6
7 7 4 4
8 8 8 8
distance is count with following function:
def countDistance(lat1, lon1, lat2, lon2):
#use basic knowledge about triangles - values are in meters
distance = sqrt(pow(lat1-lat2,2)+pow(lon1-lon2,2))
return distance
lets say if we want to m<=3, the output dataframe would look like this:
X Y Value
1 0 1 7
4 1 6 6
8 8 8 8
What is to be done:
rows 0,1,3 are close, highest value is in row 1, continue
rows 2 and 4 (from original df) are close, keep row 4
rows 5,6,7 are close, keep row 6
left over row 6 is close to row 8, keep row 8, has higher value
So I need to go through dataframe row by row, check the rest, select best match and then continue. I can't think about any simple method how to achieve this, this cant be use case of drop_duplicates, since they are not duplicates, but looping over the whole DF will be very inefficient. One method I could think about was to loop just once, for each of rows finds close ones (probably apply countdistance()), select the best fitting row and replace rest with its values, in the end use drop_duplicates. The other idea was to create a recursive function that would create a new DF, then while original df will have rows select first, find close ones, best match append to new DF, remove first row and all close from original DF and continue until empty, then return same function with new DF as to remove possible uncaught close points.
These ideas are all kind of inefficient, is there a nice and efficient pythonic way to achieve this?
For now, I have created simple code with recursion, the code works but is most likely not optimal.
def recModif(self,df):
#columns=['','X','Y','Value']
new_df = df.copy()
new_df = new_df[new_df['Value']<0] #create copy to work with
changed = False
while not df.empty: #for all the data
df = df.reset_index(drop=True) #need to reset so 0 is always accessible
x = df.loc[0,'X'] #first row x and y
y = df.loc[0,'Y']
df['dist'] = self.countDistance(x,y,df['X'],df['Y']) #add column with distances
select = df[df['dist']<10] #number of meters that two elements cant be next to other
if(len(select.index)>1): #if there is more than one elem close
changed = True
#print(select,select['Value'].idxmax())
select = select.loc[[select['Value'].idxmax()]] #get the highest one
new_df = new_df.append(pd.DataFrame(select.iloc[:,:3]),ignore_index=True) #add it to new df
df = df[df['dist'] >= 10] #drop the elements now
if changed:
return self.recModif(new_df) #use recursion if possible overlaps
else:
return new_df #return new df if all was OK
I'm trying to avoid for loops applying a function on a per row basis of a pandas df. I have looked at many vectorization examples but have not come across anything that will work completely. Ultimately I am trying to add an additional df column with the summation of successful conditions with a specified value per each condition by row.
I have looked at np.apply_along_axis but that's just a hidden loop, np.where but I could not see this working for 25 conditions that I am checking
A B C ... R S T
0 0.279610 0.307119 0.553411 ... 0.897890 0.757151 0.735718
1 0.718537 0.974766 0.040607 ... 0.470836 0.103732 0.322093
2 0.222187 0.130348 0.894208 ... 0.480049 0.348090 0.844101
3 0.834743 0.473529 0.031600 ... 0.049258 0.594022 0.562006
4 0.087919 0.044066 0.936441 ... 0.259909 0.979909 0.403292
[5 rows x 20 columns]
def point_calc(row):
points = 0
if row[2] >= row[13]:
points += 1
if row[2] < 0:
points -= 3
if row[4] >= row[8]:
points += 2
if row[4] < row[12]:
points += 1
if row[16] == row[18]:
points += 4
return points
points_list = []
for indx, row in df.iterrows():
value = point_calc(row)
points_list.append(value)
df['points'] = points_list
This is obviously not efficient but I am not sure how I can vectorize my code since it requires the values per row for each column in the df to get a custom summation of the conditions.
Any help in pointing me in the right direction would be much appreciated.
Thank you.
UPDATE:
I was able to get a little more speed replacing the df.iterrows section with df.apply.
df['points'] = df.apply(lambda row: point_calc(row), axis=1)
UPDATE2:
I updated the function as follows and have substantially decreased the run time with a 10x speed increase from using df.apply and the initial function.
def point_calc(row):
a1 = np.where(row[:,2]) >= row[:,13], 1,0)
a2 = np.where(row[:,2] < 0, -3, 0)
a3 = np.where(row[:,4] >= row[:,8])
etc.
all_points = a1 + a2 + a3 + etc.
return all_points
df['points'] = point_calc(df.to_numpy())
What I am still working on is using np.vectorize on the function itself to see if that can be improved upon as well.
You can try it it the following way:
# this is a small version of your dataframe
df = pd.DataFrame(np.random.random((10,4)), columns=list('ABCD'))
It looks like that:
A B C D
0 0.724198 0.444924 0.554168 0.368286
1 0.512431 0.633557 0.571369 0.812635
2 0.680520 0.666035 0.946170 0.652588
3 0.467660 0.277428 0.964336 0.751566
4 0.762783 0.685524 0.294148 0.515455
5 0.588832 0.276401 0.336392 0.997571
6 0.652105 0.072181 0.426501 0.755760
7 0.238815 0.620558 0.309208 0.427332
8 0.740555 0.566231 0.114300 0.353880
9 0.664978 0.711948 0.929396 0.014719
You can create a Series which counts your points and is initialized with zeros:
points = pd.Series(0, index=df.index)
It looks like that:
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
dtype: int64
Afterwards you can add and subtract values line by line if you want:
The condition within the brackets selects the rows, where the condition is true.
Therefore -= and += is only applied in those rows.
points.loc[df.A < df.C] += 1
points.loc[df.B < 0] -= 3
At the end you can extract the values of the series as numpy array if you want (optional):
point_list = points.values
Does this solve your problem?
Is there anyway to compare values within the same column of a pandas DataFrame?
The task at hand is something like this:
import pandas as pd
data = pd.DataFrame({"A": [0,-5,2,3,-3,-4,-4,-2,-1,5,6,7,3,-1]});
I need to find the maximum time (in indices) consecutive +/- values appear (Equivalently checking consecutive values because the sign can be encoded by True/False). The above data should yield 5 because there are 5 consecutive negative integers [-3,-4,-4,-2,-1]
If possible, I was hoping to avoid using a loop because the number of data points in the column may very well exceed millions in order.
I've tried using data.A.rolling() and it's variants, but can't seem to figure out any possible way to do this in a vectorized way.
Any suggestions?
Here's a NumPy approach that computes the max interval lengths for the positive and negative values -
def max_interval_lens(arr):
# Store mask of positive values
pos_mask = arr>=0
# Get indices of shifts
idx = np.r_[0,np.flatnonzero(pos_mask[1:] != pos_mask[:-1])+1, arr.size]
# Return max of intervals
lens = np.diff(idx)
s = int(pos_mask[0])
maxs = [0,0]
if len(lens)==1:
maxs[1-s] = lens[0]
else:
maxs = lens[1-s::2].max(), lens[s::2].max()
return maxs # Positive, negative max lens
Sample run -
In [227]: data
Out[227]:
A
0 0
1 -5
2 2
3 3
4 -3
5 -4
6 -4
7 -2
8 -1
9 5
10 6
11 7
12 3
13 -1
In [228]: max_interval_lens(data['A'].values)
Out[228]: (4, 5)