I have a large dataframe and what I want to do is overwrite X entries of that dataframe with a new value I set. The new entries have to be at a random position, but they have to be in order. Like I have a Column with random numbers, and want to overwrite 20 of them in a row with the new value x.
I tried df.sample(x) and then update the dataframe, but I only get individual entries. But I need the X new entries in a row (consecutively).
Somebody got a solution? I'm quite new to Python and have to get into it for my master thesis.
CLARIFICATION:
My dataframe has 5 columns with almost 60,000 rows, each row for 10 minutes of the year.
One Column is 'output' with electricity production values for that 10 minutes.
For 2 consecutive hours (120 consecutive minutes, hence 12 consecutive rows) of the year I want to lower that production to 60%. I want it to happen at a random time of the year.
Another column is 'status', with information about if the production is reduced or not.
I tried:
df_update = df.sample(12)
df_update.status = 'reduced'
df.update(df_update)
df.loc[('status) == 'reduced', ['production']] *=0.6
which does the trick for the total amount of time (12*10 minutes), but I want 120 consecutive minutes and not separated.
I decided to get a random value and just index the next 12 entries to be 0.6. I think this is what you want.
df = pd.DataFrame({'output':np.random.randn(20),'status':[0]*20})
idx = df.sample(1).index.values[0]
df.loc[idx:idx+11,"output"]=0.6
df.loc[idx:idx+11,"status"]=1
Related
I have a pandas dataframe that you can see in the screenshot. The dataframe has a time resolution of 15 minutes (it is generation data). I would like to reduce this time resolution to 1 hour meaning that I should take every 4th row and the value in every 4th row should be the anverage values of the last 4 rows (including this one). So it should be a rolling average with non-overlapping horizons.
I tried the following for one column (wind offshore):
df_generation = pd.read_csv("C:/Users/Desktop/Data/generation_data.csv", sep =",")
df_generation_2 = df_generation
df_generation_2['Wind Offshore Average'] = df_generation_2['Wind Offshore'].rolling(4).mean()
But this is not what I really want. As you can see in the screenshot, my code just created a further column with the average of the last 4th entries for every timeslot. Here the rolling average has overlapping horizons. What I want is to have a new dataframe that only has an entry after every hour (after 4 timslots of the original array). Do you have an idea how I can do that? I'd appreciate every comment.
From looking at your Index it looks like the .resample method is what you are looking for (with many examples for specific uses): https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html
as in
new = df_generation['Wind Offshore'].resample('1H').mean()
I have a data frame with 2 columns.
The 1st column is a timestamp of every minute.
The 2nd column is a number.
All I want to do is to change the 1st column into timestamp of every 30 minutes, and the sum of the 30 numbers within that period from column 2.
Power is demonstrated for every minute and but I want to sum them up for every 30 minutes.
Using pandas/Series.resample
Series.resample can help you if set the timestamp as index ; then use series.resample('30T').sum()
Manual version
You can use cumsum over the serie you want to keep.
Then select only the index at every 30 positions (np.arange(0, len(df), 30).
Then iterate over the dataframe backward and substract at row n the sum found at row n-1 to keep only the value of the last 30 minutes. Iterating is not very efficient but since your dataset is 1M row, if you take 1 row every 30 rows, it should be fast (33,333 iterations).
df['cumsum'] = df["Power_kw"].cumsum()
df_30_min = df.iloc[np.arange(0, len(df), 30)].copy()
for i in range(len(df_30_min), 1, -1):
df_30_min.iloc[i-1, df_30_min.columns.get_loc('B')] -= df_30_min.iloc[i-2, df_30_min.columns.get_loc('B')]
I have a df with 250.000 rows, sorted by date. Another column called 'wind speed' consists of numbers from 1 to 25 in increments of 0.1, but is unsorted. Now I want to start a calculation with the 'Power' column if I have at least every integer number of 'wind speed' from 1 to 25 once. If I have multiple values for one number I want to built the average.
When I try an if clause, it checks the entire df and doesn't start the calculation when the BIN is complete. Then a new BIN should be searched for.
Does someoen have an idea how to do it?
This is a part of my df:
I have large data frame in pandas which has two columns Time and Values. I want to calculate consecutive averages for values in column Values based on the condition which is formed from the column Time.
I want to calculate average of the first l values in column Values, then next l values from the same column and so on, till the end of the data frame. The value l is the number of values that go into every average and it is determined by the time difference in column Time. Starting data frame looks like this
Time Values
t1 v1
t2 v2
t3 v3
... ...
tk vk
For example, average needs to be taken at every 2 seconds and the number of time values inside that time difference will determine the number of values l for which the average will be calculated.
a1 would be the first average of l values, a2 next, and so on.
Second part of the question is the same calculation of averages, but if the number l is known in advance. I tried this
df['Time'].iloc[0:l].mean()
which works for the first l values.
In addition, I would need to store the average values in another data frame with columns Time and Averages for plotting using matplotlib.
How can I use pandas to achieve my goal?
I have tried the following
df = pd.DataFrame({'Time': [1595006371.756430732,1595006372.502789381 ,1595006373.784446912 ,1595006375.476658051], 'Values': [4,5,6,10]},index=list('abcd'))
I get
Time Values
a 1595006371.756430732 4
b 1595006372.502789381 5
c 1595006373.784446912 6
d 1595006375.476658051 10
Time is in the format seconds.milliseconds.
If I expect to have the same number of values in every 2 seconds till the end of the data frame, I can use the following loop to calculate value of l:
s=1
l=0
while df['Time'][s] - df['Time'][0] <= 2:
s+=1
l+=1
Could this be done differently, without the loop?
How can I do this if number l is not expected to be the same inside each averaging interval?
For the given l, I want to calculate average values of l elements in another column, for example column Values, and to populate column Averages of data frame df1 with these values.
I tried with the following code
p=0
df1=pd.DataFrame(columns=['Time','Averages']
for w in range (0, len(df)-1,2):
df1['Averages'][p]=df['Values'].iloc[w:w+2].mean()
p=p+1
Is there any other way to calculate these averages?
To clarify a bit more.
I have two columns Time and Values. I want to determine how many consecutive values from the column Values should be averaged at one point. I do that by determining this number l from the column Time by calculating how many rows are inside the time difference of 2 seconds. When I determined that value, for example 2, then I average first two values from the column Values, and then next 2, and so on till the end of the data frame. At the end, I store this value in the separate column of another data frame.
I would appreciate your assistance.
You talk about Time and Value and then groups of consecutive rows.
If you want to group by consecutive rows and get the mean of the Time and Value this does it for you. You really need to show by example what you are really trying to achieve.
d = list(pd.date_range(dt.datetime(2020,7,1), dt.datetime(2020,7,2), freq="15min"))
df = pd.DataFrame({"Time":d,
"Value":[round(random.uniform(0, 1),6) for x in d]})
df
n = 5
df.assign(grp=df.index//5).groupby("grp").agg({"Time":lambda s: s.mean(),"Value":"mean"})
I have a pandas DataFrame with data from an icecream freezer. Several columns describe the different temperatures in the system as well as some other things.
One column, named 'Defrost status', tells me when the freezer was defreezing to remove abundant ice with boolean values.
Those 'defrosts' is what I am interested in, so I added another column named "around_defrost". This column currently only has NaN values, but I want to change them to 'True' whenever there is a defrost within 30 minutes away from that specific row in the dataframe.
The data is recorded every minute so 30 minutes would mean 30 rows before a defrost and 30 rows behind it need to be set to 'True'
I have tried to do this with itterrows, ittertuples and by playing with the indexes as seen in the figure below but nu success so far. If anyone has a good idea of how this would could be done, I'd really appreciate it!
enter image description here
You need to use dataframe.rolling:
df = df.sort_values("Time") #sort by Time
df['around_defrost'] = df['Defrost status'].rolling(60, center=True, min_periods = 0).apply(
lambda x: True if True in x else False, raw=True)
EDIT: you may need rolling(61, center=True) since you want to consider the row in question AND 30 before and after.