Start calculation if BIN is complete (Python) - python

I have a df with 250.000 rows, sorted by date. Another column called 'wind speed' consists of numbers from 1 to 25 in increments of 0.1, but is unsorted. Now I want to start a calculation with the 'Power' column if I have at least every integer number of 'wind speed' from 1 to 25 once. If I have multiple values for one number I want to built the average.
When I try an if clause, it checks the entire df and doesn't start the calculation when the BIN is complete. Then a new BIN should be searched for.
Does someoen have an idea how to do it?
This is a part of my df:

Related

Easy way to cluster entries in a numpy array with a condition

Background: I have a numpy array of float entries. This is basically a set of observations of something, suppose temperature measured during 24 hours. Imagine that one who records the temperature is not available for the entire day, instead he/she takes few (say 5) readings during an hour and again after few hours, takes reading (say 8 times). All the measurements he/she puts in a single np.array and has handed over to me!
Problem: I have no idea when the readings were taken. So I decide to cluster the observations in the following way: maybe, first recognize local peaks in the array and all entries that are close enough (chosen tolerance, say 1 deg) are grouped together, meaning, I want to split the array into a list of sub-arrays. Note, any entry should belong to exactly one group.
One possible approach: First, sort the array, then split it into sub-arrays with two conditions: (1) Difference between the first and last entries is not more than 1 deg, (2) Difference between the last entry of a sub-array and the first entry of the next sub-array is greater than 1 deg. How can I achieve this fast (numpy way)?

Multiply by unique number based on which pandas interval a number falls within

I am trying to take a number multiply it by a unique number given which interval it falls within.
I did a groupby on my pandas dataframe according to which bins a value fell into
bins = pd.cut(df['A'], 50)
grouped = df['B'].groupby(bins)
interval_averages = grouped.mean()
A
(0.00548, 0.0209] 0.010970
(0.0209, 0.0357] 0.019546
(0.0357, 0.0504] 0.036205
(0.0504, 0.0651] 0.053656
(0.0651, 0.0798] 0.068580
(0.0798, 0.0946] 0.086754
(0.0946, 0.109] 0.094038
(0.109, 0.124] 0.114710
(0.124, 0.139] 0.136236
(0.139, 0.153] 0.142115
(0.153, 0.168] 0.161752
(0.168, 0.183] 0.185066
(0.183, 0.198] 0.205451
I need to be able to check which interval a number falls into, and then multiply it by the average value of the B column for that interval range.
From the docs I know I can use the in keyword to check if a number is in an interval, but I cannot find how to access the value for a given interval. In addition, I don't want to have to loop through the Series checking if the number is in each interval, that seems quite slow.
Does anybody know how to do this efficiently?
Thanks a lot.
You can store the numbers being tested in an array, and use the cut() method with your bins to sort the values into their respective intervals. This will return an array with the bins that each number has fallen into. You can use this array to determine where the value in the dataframe (the mean) that you need to access is located (you will know the correct row) and access the value via iloc.
Hopefully this helps a bit

How to select every 4th row in a pandas dataframe and calculate the rolling average

I have a pandas dataframe that you can see in the screenshot. The dataframe has a time resolution of 15 minutes (it is generation data). I would like to reduce this time resolution to 1 hour meaning that I should take every 4th row and the value in every 4th row should be the anverage values of the last 4 rows (including this one). So it should be a rolling average with non-overlapping horizons.
I tried the following for one column (wind offshore):
df_generation = pd.read_csv("C:/Users/Desktop/Data/generation_data.csv", sep =",")
df_generation_2 = df_generation
df_generation_2['Wind Offshore Average'] = df_generation_2['Wind Offshore'].rolling(4).mean()
But this is not what I really want. As you can see in the screenshot, my code just created a further column with the average of the last 4th entries for every timeslot. Here the rolling average has overlapping horizons. What I want is to have a new dataframe that only has an entry after every hour (after 4 timslots of the original array). Do you have an idea how I can do that? I'd appreciate every comment.
From looking at your Index it looks like the .resample method is what you are looking for (with many examples for specific uses): https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html
as in
new = df_generation['Wind Offshore'].resample('1H').mean()

How to get consecutive averages of the column values based on the condition from another column in the same data frame using pandas

I have large data frame in pandas which has two columns Time and Values. I want to calculate consecutive averages for values in column Values based on the condition which is formed from the column Time.
I want to calculate average of the first l values in column Values, then next l values from the same column and so on, till the end of the data frame. The value l is the number of values that go into every average and it is determined by the time difference in column Time. Starting data frame looks like this
Time Values
t1 v1
t2 v2
t3 v3
... ...
tk vk
For example, average needs to be taken at every 2 seconds and the number of time values inside that time difference will determine the number of values l for which the average will be calculated.
a1 would be the first average of l values, a2 next, and so on.
Second part of the question is the same calculation of averages, but if the number l is known in advance. I tried this
df['Time'].iloc[0:l].mean()
which works for the first l values.
In addition, I would need to store the average values in another data frame with columns Time and Averages for plotting using matplotlib.
How can I use pandas to achieve my goal?
I have tried the following
df = pd.DataFrame({'Time': [1595006371.756430732,1595006372.502789381 ,1595006373.784446912 ,1595006375.476658051], 'Values': [4,5,6,10]},index=list('abcd'))
I get
Time Values
a 1595006371.756430732 4
b 1595006372.502789381 5
c 1595006373.784446912 6
d 1595006375.476658051 10
Time is in the format seconds.milliseconds.
If I expect to have the same number of values in every 2 seconds till the end of the data frame, I can use the following loop to calculate value of l:
s=1
l=0
while df['Time'][s] - df['Time'][0] <= 2:
s+=1
l+=1
Could this be done differently, without the loop?
How can I do this if number l is not expected to be the same inside each averaging interval?
For the given l, I want to calculate average values of l elements in another column, for example column Values, and to populate column Averages of data frame df1 with these values.
I tried with the following code
p=0
df1=pd.DataFrame(columns=['Time','Averages']
for w in range (0, len(df)-1,2):
df1['Averages'][p]=df['Values'].iloc[w:w+2].mean()
p=p+1
Is there any other way to calculate these averages?
To clarify a bit more.
I have two columns Time and Values. I want to determine how many consecutive values from the column Values should be averaged at one point. I do that by determining this number l from the column Time by calculating how many rows are inside the time difference of 2 seconds. When I determined that value, for example 2, then I average first two values from the column Values, and then next 2, and so on till the end of the data frame. At the end, I store this value in the separate column of another data frame.
I would appreciate your assistance.
You talk about Time and Value and then groups of consecutive rows.
If you want to group by consecutive rows and get the mean of the Time and Value this does it for you. You really need to show by example what you are really trying to achieve.
d = list(pd.date_range(dt.datetime(2020,7,1), dt.datetime(2020,7,2), freq="15min"))
df = pd.DataFrame({"Time":d,
"Value":[round(random.uniform(0, 1),6) for x in d]})
df
n = 5
df.assign(grp=df.index//5).groupby("grp").agg({"Time":lambda s: s.mean(),"Value":"mean"})

Insert a series of values into pd.dataframe randomly

I have a large dataframe and what I want to do is overwrite X entries of that dataframe with a new value I set. The new entries have to be at a random position, but they have to be in order. Like I have a Column with random numbers, and want to overwrite 20 of them in a row with the new value x.
I tried df.sample(x) and then update the dataframe, but I only get individual entries. But I need the X new entries in a row (consecutively).
Somebody got a solution? I'm quite new to Python and have to get into it for my master thesis.
CLARIFICATION:
My dataframe has 5 columns with almost 60,000 rows, each row for 10 minutes of the year.
One Column is 'output' with electricity production values for that 10 minutes.
For 2 consecutive hours (120 consecutive minutes, hence 12 consecutive rows) of the year I want to lower that production to 60%. I want it to happen at a random time of the year.
Another column is 'status', with information about if the production is reduced or not.
I tried:
df_update = df.sample(12)
df_update.status = 'reduced'
df.update(df_update)
df.loc[('status) == 'reduced', ['production']] *=0.6
which does the trick for the total amount of time (12*10 minutes), but I want 120 consecutive minutes and not separated.
I decided to get a random value and just index the next 12 entries to be 0.6. I think this is what you want.
df = pd.DataFrame({'output':np.random.randn(20),'status':[0]*20})
idx = df.sample(1).index.values[0]
df.loc[idx:idx+11,"output"]=0.6
df.loc[idx:idx+11,"status"]=1

Categories