Find the average of last 25% from the input range in pandas - python

I have successfully imported temperature CSV file to Python Pandas DataFrame. I have also found the mean value of specific range:
df.loc[7623:23235, 'Temperature'].mean()
where 'Temperature' is Column title in DataFrame.
I would like to know if it is possible to change this function to find the average of last 25% (or 1/4) from the input range (7623:23235).

Yes, you can use the quantile method to find the value that separates the last 25% of the values in the input range and then use the mean method to calculate the average of the values in the last 25%.
Here's how you can do it:
quantile = df.loc[7623:23235, 'Temperature'].quantile(0.75)
mean = df.loc[7623:23235, 'Temperature'][df.loc[7623:23235, 'Temperature'] >= quantile].mean()

To find the average of the last 25% of the values in a specific range of a column in a Pandas DataFrame, you can use the iloc indexer along with slicing and the mean method.
For example, given a DataFrame df with a column 'Temperature', you can find the average of the last 25% of the values in the range 7623:23235 like this:
import math
# Find the length of the range
length = 23235 - 7623 + 1
# Calculate the number of values to include in the average
n = math.ceil(length * 0.25)
# Calculate the index of the first value to include in the average
start_index = length - n
# Use iloc to slice the relevant range of values from the 'Temperature' column
# and calculate the mean of those values
mean = df.iloc[7623:23235]['Temperature'].iloc[start_index:].mean()
print(mean)
This code first calculates the length of the range, then calculates the number of values that represent 25% of that range. It then uses the iloc indexer to slice the relevant range of values from the 'Temperature' column and calculates the mean of those values using the mean method.
Note that this code assumes that the indices of the DataFrame are consecutive integers starting from 0. If the indices are not consecutive or do not start at 0, you may need to adjust the code accordingly.

Related

Pandas recursively drop values from each row if outside a limit

I'm working on a dataset that has temperature values from multiple sensors in 5min intervals.
The requirement is to
calculate the mean from all sensors in each interval
if any values are > 3% from the mean (above or below) then
drop the highest value (ie furthest from the mean)
recalculate the mean
repeat if any remaining values are higher than the recalculated mean
This is different to other answers I've found where the entire row is dropped - I just need to successively drop the highest outlier until all values are within 3%.
I've tried a range of approaches but I'm going in circles. Help!
What you want to do is loop until your condition is met (no more values >3% from mean).
import pandas as pd
values= {
"col" :[90,85,80,70,95,100]
}
index_labels=["A","B","C","D","E","F"]
df = pd.DataFrame(values,index=index_labels)
all_within_three_percent = False #set condition to false by default
while all_within_three_percent == False: #while condition is not met, while loop keeps looping
mean = df.col.mean() #(re)calculate mean
three_percent_deviation = mean*0.03 #(re)calculate current deviation threshold
df['deviation'] = abs(df.col - mean) #determine individual row deviations (absolute, so both above and below mean)
if sum(df['deviation'] > three_percent_deviation) > 0: #when there are deviation values above the threshold
df = df.drop(df['deviation'].idxmax()) #delete the maximum value
else:
all_within_three_percent = True #otherwise, condition is met: we're done, loop should be stopped
df = df.drop('deviation', axis=1) #drop the helper column deviation
returns:
col
E 95
F 100
Note that when the difference is the same, it will remove the first occurrence. After the first iteration (removing 70), the mean is 90, so both 80 and 100 have a difference of 10. It will remove 80, not 100.

Multiply by unique number based on which pandas interval a number falls within

I am trying to take a number multiply it by a unique number given which interval it falls within.
I did a groupby on my pandas dataframe according to which bins a value fell into
bins = pd.cut(df['A'], 50)
grouped = df['B'].groupby(bins)
interval_averages = grouped.mean()
A
(0.00548, 0.0209] 0.010970
(0.0209, 0.0357] 0.019546
(0.0357, 0.0504] 0.036205
(0.0504, 0.0651] 0.053656
(0.0651, 0.0798] 0.068580
(0.0798, 0.0946] 0.086754
(0.0946, 0.109] 0.094038
(0.109, 0.124] 0.114710
(0.124, 0.139] 0.136236
(0.139, 0.153] 0.142115
(0.153, 0.168] 0.161752
(0.168, 0.183] 0.185066
(0.183, 0.198] 0.205451
I need to be able to check which interval a number falls into, and then multiply it by the average value of the B column for that interval range.
From the docs I know I can use the in keyword to check if a number is in an interval, but I cannot find how to access the value for a given interval. In addition, I don't want to have to loop through the Series checking if the number is in each interval, that seems quite slow.
Does anybody know how to do this efficiently?
Thanks a lot.
You can store the numbers being tested in an array, and use the cut() method with your bins to sort the values into their respective intervals. This will return an array with the bins that each number has fallen into. You can use this array to determine where the value in the dataframe (the mean) that you need to access is located (you will know the correct row) and access the value via iloc.
Hopefully this helps a bit

How to get consecutive averages of the column values based on the condition from another column in the same data frame using pandas

I have large data frame in pandas which has two columns Time and Values. I want to calculate consecutive averages for values in column Values based on the condition which is formed from the column Time.
I want to calculate average of the first l values in column Values, then next l values from the same column and so on, till the end of the data frame. The value l is the number of values that go into every average and it is determined by the time difference in column Time. Starting data frame looks like this
Time Values
t1 v1
t2 v2
t3 v3
... ...
tk vk
For example, average needs to be taken at every 2 seconds and the number of time values inside that time difference will determine the number of values l for which the average will be calculated.
a1 would be the first average of l values, a2 next, and so on.
Second part of the question is the same calculation of averages, but if the number l is known in advance. I tried this
df['Time'].iloc[0:l].mean()
which works for the first l values.
In addition, I would need to store the average values in another data frame with columns Time and Averages for plotting using matplotlib.
How can I use pandas to achieve my goal?
I have tried the following
df = pd.DataFrame({'Time': [1595006371.756430732,1595006372.502789381 ,1595006373.784446912 ,1595006375.476658051], 'Values': [4,5,6,10]},index=list('abcd'))
I get
Time Values
a 1595006371.756430732 4
b 1595006372.502789381 5
c 1595006373.784446912 6
d 1595006375.476658051 10
Time is in the format seconds.milliseconds.
If I expect to have the same number of values in every 2 seconds till the end of the data frame, I can use the following loop to calculate value of l:
s=1
l=0
while df['Time'][s] - df['Time'][0] <= 2:
s+=1
l+=1
Could this be done differently, without the loop?
How can I do this if number l is not expected to be the same inside each averaging interval?
For the given l, I want to calculate average values of l elements in another column, for example column Values, and to populate column Averages of data frame df1 with these values.
I tried with the following code
p=0
df1=pd.DataFrame(columns=['Time','Averages']
for w in range (0, len(df)-1,2):
df1['Averages'][p]=df['Values'].iloc[w:w+2].mean()
p=p+1
Is there any other way to calculate these averages?
To clarify a bit more.
I have two columns Time and Values. I want to determine how many consecutive values from the column Values should be averaged at one point. I do that by determining this number l from the column Time by calculating how many rows are inside the time difference of 2 seconds. When I determined that value, for example 2, then I average first two values from the column Values, and then next 2, and so on till the end of the data frame. At the end, I store this value in the separate column of another data frame.
I would appreciate your assistance.
You talk about Time and Value and then groups of consecutive rows.
If you want to group by consecutive rows and get the mean of the Time and Value this does it for you. You really need to show by example what you are really trying to achieve.
d = list(pd.date_range(dt.datetime(2020,7,1), dt.datetime(2020,7,2), freq="15min"))
df = pd.DataFrame({"Time":d,
"Value":[round(random.uniform(0, 1),6) for x in d]})
df
n = 5
df.assign(grp=df.index//5).groupby("grp").agg({"Time":lambda s: s.mean(),"Value":"mean"})

pandas rolling average with a rolling mask / excluding entries

I have a pandas dataframe with a time index like this
import pandas as pd
import numpy as np
idx = pd.date_range(start='2000',end='2001')
df = pd.DataFrame(np.random.normal(size=(len(idx),2)),index=idx)
which looks like this:
0 1
2000-01-01 0.565524 0.355548
2000-01-02 -0.234161 0.888384
I would like to compute a rolling average like
df_avg = df.rolling(60).mean()
but excluding always entries corresponding to (let's say) 10 days before +- 2 days. In other words, for each date df_avg should contain the mean (exponential with ewm or flat) of previous 60 entries but excluding entries from t-48 to t-52. I guess I should do a kind of a rolling mask but I don't know how. I could also try to compute two separate averages and obtain the result as a difference but it looks dirty and I wonder if there is a better way which generalize to other non-linear computations...
Many thanks!
You can use apply to customize your function:
# select indexes you want to average over
avg_idx = [idx for idx in range(60) if idx not in range(8, 13)]
# do rolling computation, calculating average only on the specified indexes
df_avg = df.rolling(60).apply(lambda x: x[avg_idx].mean())
The x DataFrame in apply will always have 60 rows, so you can specify your positional index based on this, knowing that the first entry (0) is t-60.
I am not entirely sure about your exclusion logic, but you can easily modify my solution for your case.
Unfortunately, not. From pandas source code:
df.rolling(window, min_periods=None, freq=None, center=False, win_type=None,
on=None, axis=0, closed=None)
window : int, or offset
Size of the moving window. This is the number of observations used for
calculating the statistic. Each window will be a fixed size.
If its an offset then this will be the time period of each window. Each
window will be a variable sized based on the observations included in
the time-period.

How to decile python pandas dataframe by column value, and then sum each decile?

Say a dataframe only has one numeric column, order it desc.
What I want to get is a new dataframe with 10 rows, row 1 is sum of smallest 10% values then row 10 is sum of largest 10% values.
I can calculate this via a non-pythonic way but I guess there must be a fashion and pythonic way to achieve this.
Any help?
Thanks!
You can do this with pd.qcut:
df = pd.DataFrame({'A':np.random.randn(100)})
# pd.qcut(df.A, 10) will bin into deciles
# you can group by these deciles and take the sums in one step like so:
df.groupby(pd.qcut(df.A, 10))['A'].sum()
# A
# (-2.662, -1.209] -16.436286
# (-1.209, -0.866] -10.348697
# (-0.866, -0.612] -7.133950
# (-0.612, -0.323] -4.847695
# (-0.323, -0.129] -2.187459
# (-0.129, 0.0699] -0.678615
# (0.0699, 0.368] 2.007176
# (0.368, 0.795] 5.457153
# (0.795, 1.386] 11.551413
# (1.386, 3.664] 20.575449
pandas.qcut documentation

Categories