I have a data frame with 2 columns.
The 1st column is a timestamp of every minute.
The 2nd column is a number.
All I want to do is to change the 1st column into timestamp of every 30 minutes, and the sum of the 30 numbers within that period from column 2.
Power is demonstrated for every minute and but I want to sum them up for every 30 minutes.
Using pandas/Series.resample
Series.resample can help you if set the timestamp as index ; then use series.resample('30T').sum()
Manual version
You can use cumsum over the serie you want to keep.
Then select only the index at every 30 positions (np.arange(0, len(df), 30).
Then iterate over the dataframe backward and substract at row n the sum found at row n-1 to keep only the value of the last 30 minutes. Iterating is not very efficient but since your dataset is 1M row, if you take 1 row every 30 rows, it should be fast (33,333 iterations).
df['cumsum'] = df["Power_kw"].cumsum()
df_30_min = df.iloc[np.arange(0, len(df), 30)].copy()
for i in range(len(df_30_min), 1, -1):
df_30_min.iloc[i-1, df_30_min.columns.get_loc('B')] -= df_30_min.iloc[i-2, df_30_min.columns.get_loc('B')]
I have large data frame in pandas which has two columns Time and Values. I want to calculate consecutive averages for values in column Values based on the condition which is formed from the column Time.
I want to calculate average of the first l values in column Values, then next l values from the same column and so on, till the end of the data frame. The value l is the number of values that go into every average and it is determined by the time difference in column Time. Starting data frame looks like this
Time Values
t1 v1
t2 v2
t3 v3
... ...
tk vk
For example, average needs to be taken at every 2 seconds and the number of time values inside that time difference will determine the number of values l for which the average will be calculated.
a1 would be the first average of l values, a2 next, and so on.
Second part of the question is the same calculation of averages, but if the number l is known in advance. I tried this
df['Time'].iloc[0:l].mean()
which works for the first l values.
In addition, I would need to store the average values in another data frame with columns Time and Averages for plotting using matplotlib.
How can I use pandas to achieve my goal?
I have tried the following
df = pd.DataFrame({'Time': [1595006371.756430732,1595006372.502789381 ,1595006373.784446912 ,1595006375.476658051], 'Values': [4,5,6,10]},index=list('abcd'))
I get
Time Values
a 1595006371.756430732 4
b 1595006372.502789381 5
c 1595006373.784446912 6
d 1595006375.476658051 10
Time is in the format seconds.milliseconds.
If I expect to have the same number of values in every 2 seconds till the end of the data frame, I can use the following loop to calculate value of l:
s=1
l=0
while df['Time'][s] - df['Time'][0] <= 2:
s+=1
l+=1
Could this be done differently, without the loop?
How can I do this if number l is not expected to be the same inside each averaging interval?
For the given l, I want to calculate average values of l elements in another column, for example column Values, and to populate column Averages of data frame df1 with these values.
I tried with the following code
p=0
df1=pd.DataFrame(columns=['Time','Averages']
for w in range (0, len(df)-1,2):
df1['Averages'][p]=df['Values'].iloc[w:w+2].mean()
p=p+1
Is there any other way to calculate these averages?
To clarify a bit more.
I have two columns Time and Values. I want to determine how many consecutive values from the column Values should be averaged at one point. I do that by determining this number l from the column Time by calculating how many rows are inside the time difference of 2 seconds. When I determined that value, for example 2, then I average first two values from the column Values, and then next 2, and so on till the end of the data frame. At the end, I store this value in the separate column of another data frame.
I would appreciate your assistance.
You talk about Time and Value and then groups of consecutive rows.
If you want to group by consecutive rows and get the mean of the Time and Value this does it for you. You really need to show by example what you are really trying to achieve.
d = list(pd.date_range(dt.datetime(2020,7,1), dt.datetime(2020,7,2), freq="15min"))
df = pd.DataFrame({"Time":d,
"Value":[round(random.uniform(0, 1),6) for x in d]})
df
n = 5
df.assign(grp=df.index//5).groupby("grp").agg({"Time":lambda s: s.mean(),"Value":"mean"})
I have a dataframe that contains three series called Date, Element,
and Data_Value--their types are string, string, and numpy.int64
respectively. Date has dates in the form of yyyy-mm-dd; Element has
strings that say either TMIN or TMAX, and it denotes whether the
Data_Value is the minimum or maximum temperature of a particular date;
lastly, the Data_Value series just represents the actual temperature.
The date series has multiple duplicates of the same date. E.g. for the
date 2005-01-01, there are 19 entries for the temperature column, the
values start at 28 and go all the way up to 156. I want to create a
new dataframe with the date and the maximum temperature only--I'll
eventually want one for TMIN values too, but I figure that if I can do
one I can figure out the other. I'll post some psuedocode with
explanation below to show what I've tried so far.
So far I have pulled in the csv and assigned it to a variable, df.
Then I sorted the values by Date, Element and Temperature
(Data_Value). After that, I created a variable called tmax that grabs
the necessary dates (I only need the data from 2005-2014) that have
'TMAX' as its Element value. I cast tmax into a new DataFrame, reset
its index to get rid of the useless index data from the first
dataframe, and dropped the 'Element' column since it was redundant at
this point. Now I'm (ultimately) trying to create a list of all the
Temperatures for TMAX so that I can plot it with pyplot. But I can't
figure out for the life of me how to reduce the dataframe to just the
single date and max value for that date. If I could just get that then
I could easily convert the series to a list and plot it.
def record_high_and_low_temperatures():
#read in csv
df = pd.read_csv('somedata.csv')
#sort values so they're in a nice order
df.sort_values(by=['Date', 'Element', 'Data_Value'], inplace=True)
# grab all entries for TMAX in correct date range
tmax = df[(df['Element'] == 'TMAX') & (df['Date'].between("2005-01-01", "2014-12-31"))]
# cast to dataframe
tmax = pd.DataFrame(tmax, columns=['Date', 'Data_Value'])
# Remove index column from previous dataframe
tmax.reset_index(drop=True, inplace=True)
# this is where I'm stuck, how do I get the max value per unique date?
max_temp_by_date = tmax.loc[tmax['Data_Value'].idxmax()]
Any and all help is appreciated, let me know if I need to clarify anything.
TL;DR:
Ok...
input dataframe looks like
date | data_value
2005-01-01 28
2005-01-01 33
2005-01-01 33
2005-01-01 44
2005-01-01 56
2005-01-02 0
2005-01-02 12
2005-01-02 30
2005-01-02 28
2005-01-02 22
Expected df should look like:
date | data_value
2005-01-01 79
2005-01-02 90
2005-01-03 88
2005-01-04 44
2005-01-05 63
I just want a dataframe that has each unique date coupled with the highest temperature on that day.
If I understand you correctly, what you would want to do is as Grzegorz already suggested in the comments, is to groupby date (take all elements of one date) and then take the maximum of that date:
df.groupby('date').max()
This will take all your groups and reduce them to only one row, taking the maximum element of every group. In this case, max() is called the aggregation function of the group. As you mentioned that you will also need the minimum at some point, a nice way to do this (instead of two groupbys) is to do the following:
df.groupby('date').agg(['max', 'min'])
which will pass over all groups once and apply both aggregation functions max and min returning two columns for each input column. More documentation on aggregation is here.
Try this:
df.groupby("Date")['data_value'].max()
I have a 6 x n matrix with the data: year, month, day, hour, minute, use.
I have to make a new matrix containing the aggregated measurements for use, in the value ’hour’. So all rows recorded within the same hour are combined.
So every time the number of hour chances the code need to know a new period starts.
I just tried something, but I don't now how to solve this.
Thank you. This is what I tried + a test
def groupby_measurements(data):
count = -1
for i in range(9):
array = np.split(data, np.where(data[i,3] != data[i+1,3])[0][:1])
return array
print(groupby_measurements(np.array([[2006,2,11,1,1,55],
[2006,2,11,1,11,79],
[2006,2,11,1,32,2],
[2006,2,11,1,41,66],
[2006,2,11,1,51,76],
[2006,2,11,10,2,89],
[2006,2,11,10,3,33],
[2006,2,11,14,2,22],
[2006,2,11,14,5,34]])))
In this case I tried, I expect the output to be:
np.array([[2006,2,11,1,1,55],
[2006,2,11,1,11,79],
[2006,2,11,1,32,2],
[2006,2,11,1,41,66],
[2006,2,11,1,51,76]]),
np.array([[2006,2,11,10,2,89],
[2006,2,11,10,3,33]]),
np.array([[2006,2,11,14,2,22],
[2006,2,11,14,5,34]])
The final output should be:
np.array([2006,2,11,1,0,278]),
np.array([2006,2,11,10,0,122]),
np.array([2006,2,11,14,0,56])
(the sum of use in the 3 hour periodes)
I would recommend using pandas Dataframes, and then using groupby combined with sum
import pandas as pd
import numpy as np
data = pd.DataFrame(np.array(
[[2006,2,11,1,1,55],
[2006,2,11,1,11,79],
[2006,2,11,1,32,2],
[2006,2,11,1,41,66],
[2006,2,11,1,51,76],
[2006,2,11,10,2,89],
[2006,2,11,10,3,33],
[2006,2,11,14,2,22],
[2006,2,11,14,5,34]]),
columns=['year','month','day','hour','minute','use'])
aggregated = data.groupby(['year','month','day','hour'])['use'].sum()
# you can also use .agg and pass which aggregation function you want as a string.
aggregated = data.groupby(['year','month','day','hour'])['use'].agg('sum')
year month day hour
2006 2 11 1 278
10 122
14 56
Aggregated is now a pandas Series, if you want it as an array just do
aggregated.values