Pandas recursively drop values from each row if outside a limit - python

I'm working on a dataset that has temperature values from multiple sensors in 5min intervals.
The requirement is to
calculate the mean from all sensors in each interval
if any values are > 3% from the mean (above or below) then
drop the highest value (ie furthest from the mean)
recalculate the mean
repeat if any remaining values are higher than the recalculated mean
This is different to other answers I've found where the entire row is dropped - I just need to successively drop the highest outlier until all values are within 3%.
I've tried a range of approaches but I'm going in circles. Help!

What you want to do is loop until your condition is met (no more values >3% from mean).
import pandas as pd
values= {
"col" :[90,85,80,70,95,100]
}
index_labels=["A","B","C","D","E","F"]
df = pd.DataFrame(values,index=index_labels)
all_within_three_percent = False #set condition to false by default
while all_within_three_percent == False: #while condition is not met, while loop keeps looping
mean = df.col.mean() #(re)calculate mean
three_percent_deviation = mean*0.03 #(re)calculate current deviation threshold
df['deviation'] = abs(df.col - mean) #determine individual row deviations (absolute, so both above and below mean)
if sum(df['deviation'] > three_percent_deviation) > 0: #when there are deviation values above the threshold
df = df.drop(df['deviation'].idxmax()) #delete the maximum value
else:
all_within_three_percent = True #otherwise, condition is met: we're done, loop should be stopped
df = df.drop('deviation', axis=1) #drop the helper column deviation
returns:
col
E 95
F 100
Note that when the difference is the same, it will remove the first occurrence. After the first iteration (removing 70), the mean is 90, so both 80 and 100 have a difference of 10. It will remove 80, not 100.

Related

Deleting observations from a data frame, according to a bernoulli random variable that is 0 or 1

I have a data frame with 1000 rows. I want to delete 500 observations from a specific column Y, in a way thet the bigger the values of Y, the probability it will be deleted is bigger.
One way to do that is to sort this column in an ascending way. For i = 1,...,1000, toss a bernoulli random variable with a p_i probability for success that is dependant on i. delete all observations that their bernoulli random variable is 1.
So first I sort this column:
df_sorted = df.sort_values("mycolumn")
Next, I tried something like this:
p_i = np.linspace(0,1,num=sample_Encoded_sorted.shape[0])
bernoulli = np.random.binomial(1, p_i)
delete_index = bernoulli == 1
I get delete_index is a boolian vector with True or False, and the probability to get a True is higher among higher index. However, I get more than 500 True in it.
How do I get only 500 Trues in this vector? and how do I use it to delete the corresponding rows of the data frame?
For example if i = 1 in delete_index is False, the first row of the data frame wont be deleted, if it's True it will be deleted.
I don't know why are you trying to limit the appearance of True and False to 500, due to the random binomial it will be near from 500 but most of the time it wont be 500, but here it is one possible solution, i don't know how useful it is for your purposes.
p_i = np.linspace(0,1,num=1000)
#We make a loop that make the number 1 appear 500 times
count=0
while count != 500:
bernoulli = np.random.binomial(1, p_i)
count=np.sum(bernoulli)
#We transform the array that we got from np.random.binomial into a boolean mask and slice the sorted_df
df_sorted=df_sorted[pd.Series(bernoulli)==0]
#This will return a new DataFrame with the 500 values
I hope this helps, i can't comment yet because of my reputation, thats why I put this as an answer and not a comment.

Find the average of last 25% from the input range in pandas

I have successfully imported temperature CSV file to Python Pandas DataFrame. I have also found the mean value of specific range:
df.loc[7623:23235, 'Temperature'].mean()
where 'Temperature' is Column title in DataFrame.
I would like to know if it is possible to change this function to find the average of last 25% (or 1/4) from the input range (7623:23235).
Yes, you can use the quantile method to find the value that separates the last 25% of the values in the input range and then use the mean method to calculate the average of the values in the last 25%.
Here's how you can do it:
quantile = df.loc[7623:23235, 'Temperature'].quantile(0.75)
mean = df.loc[7623:23235, 'Temperature'][df.loc[7623:23235, 'Temperature'] >= quantile].mean()
To find the average of the last 25% of the values in a specific range of a column in a Pandas DataFrame, you can use the iloc indexer along with slicing and the mean method.
For example, given a DataFrame df with a column 'Temperature', you can find the average of the last 25% of the values in the range 7623:23235 like this:
import math
# Find the length of the range
length = 23235 - 7623 + 1
# Calculate the number of values to include in the average
n = math.ceil(length * 0.25)
# Calculate the index of the first value to include in the average
start_index = length - n
# Use iloc to slice the relevant range of values from the 'Temperature' column
# and calculate the mean of those values
mean = df.iloc[7623:23235]['Temperature'].iloc[start_index:].mean()
print(mean)
This code first calculates the length of the range, then calculates the number of values that represent 25% of that range. It then uses the iloc indexer to slice the relevant range of values from the 'Temperature' column and calculates the mean of those values using the mean method.
Note that this code assumes that the indices of the DataFrame are consecutive integers starting from 0. If the indices are not consecutive or do not start at 0, you may need to adjust the code accordingly.

calculate descriptive statistics in pandas dataframe based on a condition cyclically

I have a pandas DataFrame with more than 100 thousands of rows. Index represents the time and two columns represents the sensor data and the condition.
When the condition becomes 1, I want to start calculating score card (average and standard deviation) till the next 1 comes. This needs to be calculated for the whole dataset.
Here is a picture of the DataFrame for a specific time span:
What I thought is to iterate through index and items of the df and when condition is met I start to calculate the descriptive statistics.
cycle = 0
for i, row in df_b.iterrows():
if row['condition'] == 1:
print('Condition is changed')
cycle += 1
print('cycle: ', cycle)
#start = ?
#end = ?
#df_b.loc[start:end]
I am not sure how to calculate start and end for this DataFrame. The end will be the start for the next cycle. Additionally, I think this iteration is not the optimal one because it takes a bit of long time to iterate. I appreciate any idea or solution for this problem.
Maybe start out with getting the rows where condition == 1:
cond_1_df = df.loc[df['condition'] == 1]
This dataframe will only contain the rows that meet your condition (being 1).
From here on, you can access the timestamps pairwise, meaning that the first element is beginning and second element is end, sketched below:
former = 0
stamp_pairs = []
df = cond_1_df.reset_index() # make sure indexes pair with number of rows
for index, row in df.iterrows():
if former != 0:
beginning = former
end = row["timestamp"]
former = row["timestamp"]
else:
beginning = 0
end = row["timestamp"]
former = row["timestamp"]
stamp_pairs.append([beginning, end])
This should give you something like this:
[[stamp0, stamp1], [stamp1,stamp2], [stamp2, stamp3]...]
for each of these pairs, you can again create a df containing only the subset of rows where stamp_x < timestamp < stamp_x+1:
time_cond_df = df.loc[(df['timestamp'] > stamp_x) & (df['timestamp'] < stamp_x+1)]
Finally, you get one time_cond_df per timestamp tuple, on which you can perform your score calculations.
Just make shure that your timestamps are comparable with operators ">" and "<"! We can't tell since you did not explicate how you produced the timestamps.

How to find maximum percentage fall in a dataset with Python ? (stock market)

I am currently dealing with stock market analysis and trying to find a fast algorithm that enables me to calculate the maximum price drop in a given dataset and, I think it is a good algorithm question to think about. So, the input would be the stock prices of a share for a specific time interval and, the output would be the maximum of all price drops.
Here is a visual example, please look at the picture; (percentages were given by eye decision)
Stock Price Image
I have roughly represented some price drops and their percentage. Even though the last price drop is the maximum in terms of its value, the one with %60 price drop is the one that I want to find.
Thanks in advance
Solution:
You can do it in linear time by just iterating through the stock values backwards.
You keep track of the smallest element you have seen so far, as the biggest drop will always go to the smallest value that still lays ahead of it. Then you can calculate the relative drop from every point to the smallest element that is ahead of it and just keep track of the biggest drop you found that way.
Here is an implementation of it in Python. Make sure you understand what I'm doing and why it works, to be sure it fits the problem you had in mind.
def getGreatestDrop(stock):
"""Calculates the greatest relative drop of a stock.
#param stock: 1-D list contianing the values of that stock
Returns a tuple with the relative drop size, the index of the start and the
index of the end of that drop.
"""
min = None # The smallest absolute value seen so far
minIndex = None # The index of the smallest absolute value smallest value
greatestDrop = None # The biggest relative drop seen so far
greatestDropStart = None # The index of the drop start
greatestDropEnd = None # The index of the drop end
# Iterating backwards through the array, starting from the last element
for index in range(len(stock)-1,-1,-1):
# Update min
if min is None or stock[index] < min:
min = stock[index]
minIndex = index
# Calculate relative drop
drop = 1-min/stock[index]
# Update greatest drop
if greatestDrop is None or drop > greatestDrop:
greatestDrop = drop
greatestDropStart = index
greatestDropEnd = minIndex
# Return values
return greatestDrop, greatestDropStart, greatestDropEnd
Example:
Here is an example program where I use this function on 6 randomly generated stocks:
#!/usr/bin/env python3
import random
import numpy as np
from matplotlib import pyplot as plt
def generateRandomStock(length, volatility=1, trend=0, scale=100, lowest=500):
"""Generat
#param stock: 1-D list contianing the values of that stock
Returns a tuple with the relative drop size, the index of the start and the
index of the end of that drop.
"""
out = np.ndarray(length)
value = 0
for i in range(length):
value += volatility*(random.random()-0.5) + trend
out[i] = value
out *= scale
out -= out.min()
out += lowest
return out
def getGreatestDrop(stock):
"""Calculates the greatest relative drop of a stock.
#param stock: 1-D list contianing the values of that stock
Returns a tuple with the relative drop size, the index of the start and the
index of the end of that drop.
"""
min = None # The smallest absolute value seen so far
minIndex = None # The index of the smallest absolute value smallest value
greatestDrop = None # The biggest relative drop seen so far
greatestDropStart = None # The index of the drop start
greatestDropEnd = None # The index of the drop end
# Iterating backwards through the array, starting from the last element
for index in range(len(stock)-1,-1,-1):
# Update min
if min is None or stock[index] < min:
min = stock[index]
minIndex = index
# Calculate relative drop
drop = 1-min/stock[index]
# Update greatest drop
if greatestDrop is None or drop > greatestDrop:
greatestDrop = drop
greatestDropStart = index
greatestDropEnd = minIndex
# Return values
return greatestDrop, greatestDropStart, greatestDropEnd
if __name__ == "__main__":
# Create subplots
width = 3
height = 2
fig, axs = plt.subplots(width, height)
# Fix random seed to get the same results every time
random.seed(42)
# Draw all plots
for w in range(width):
for h in range(height):
# Generate stocks randomly
stocks = generateRandomStock(1000)
axs[w][h].plot(stocks)
# Calculate greatest drop
drop, dropStart, dropEnd = getGreatestDrop(stocks)
axs[w][h].plot([dropStart, dropEnd],[stocks[dropStart],stocks[dropEnd]], color="red")
# Set title
axs[w][h].set_title("Greatest Drop is {:.1f}% from {} to {}".format(100*drop, dropStart, dropEnd))
# Show all results
plt.show()
Output:

Calculate the percentage of values that meet multiple conditions in DataFrame

I have a DataFrame with information from every single March Madness game since 1985. Now I am trying to calculate the percentage of wins by the higher seed by round. The main DataFrame looks like this:
I thought that the best way to do it is by creating separate functions. The first one deals with when the score is higher than the score.1 return team and when score.1 is higher than score return team.1 Then append those at end of function. Next one for needs u do seed.1 higher than seed and return team then seed higher than seed.1 and return team.1 then append and last function make a function for when those are equal
def func1(x):
if tourney.loc[tourney['Score']] > tourney.loc[tourney['Score.1']]:
return tourney.loc[tourney['Team']]
elif tourney.loc[tourney['Score.1']] > tourney.loc[tourney['Score']]:
return tourney.loc[tourney['Team.1']]
func1(tourney.loc[tourney['Score']])
You can apply a row-wise function by apply a lambda function to the entire dataframe, with the axis=1. This will allow you to get a True/False column 'low_seed_wins'.
With the new column of True/False you can take the count and the sum (count being the number of games, and sum being the number of lower_seed victories). Using this you can divide the sum by the count to get the win ratio.
This only works because your lower seed teams are always on the left. If they are not it will be a little more complex.
import pandas as pd
df = pd.DataFrame([[1987,3,1,74,68,5],[1987,3,2,87,81,6],[1987,4,1,84,81,2],[1987,4,1,75,79,2]], columns=['Year','Round','Seed','Score','Score.1','Seed.1'])
df['low_seed_wins'] = df.apply(lambda row: row['Score'] > row['Score.1'], axis=1)
df = df.groupby(['Year','Round'])['low_seed_wins'].agg(['count','sum']).reset_index()
df['ratio'] = df['sum'] / df['count']
df.head()
Year Round count sum ratio
0 1987 3 2 2.0 1.0
1 1987 4 2 1.0 0.5
You should be to calculate this by checking both conditions, for both the first and second team. This returns a boolean, the sum of which is the number of cases it is true. Then just divide by the length of the whole dataframe to get the percentage. Without test data hard to check exactly
(
((tourney['Seed'] > tourney['Seed.1']) &
(tourney['Score'] > tourney['Score.1'])) ||
((tourney['Seed.1'] > tourney['Seed']) &
(tourney['Score.1'] > tourney['Score']))
).sum() / len(tourney)

Categories