Python Pandas can do these tasks? - python

I have a time series data frame for eight Years (2013-2020) has Hourly data, each Year has Nine zones, under each zone two columns("Gen", "Load") as follows:
A ZONE B ZONE ... G ZONE H ZONE I ZONE
date_time GEN LOAD GEN LOAD ... LOAD GEN LOAD GEN LOAD
2013-01-01 00:00:00 725.7 5,859.5 312.2 3,194.7 ... 77.1 706.0 227.1 495.0 861.9
2013-01-01 01:00:00 436.2 450.5 248.0 198.0 ... 865.5 240.7 107.9 640.5 767.3
2013-01-01 02:00:00 464.5 160.2 144.2 068.3 ... 738.7 044.7 32.7 509.3 700.4
2013-01-01 03:00:00 169.9 733.8 268.1 869.5 ... 671.7 649.4 951.3 626.8 652.1
2013-01-01 04:00:00 145.4 553.4 280.2 872.8 ... 761.5 561.0 912.9 552.1 637.3
... ... ... ... ... ... ... ... ... ... ... ...
2020-12-31 19:00:00 450.9 951.7 371.4 516.3 ... 461.7 808.9 471.4 983.7 447.8
2020-12-31 20:00:00 553.0 936.5 848.7 233.9 ... 397.3 978.3 404.3 490.9 233.0
2020-12-31 21:00:00 458.6 735.6 716.8 121.7 ... 385.1 808.0 192.0 131.5 70.1
2020-12-31 22:00:00 515.8 651.6 693.5 142.4 ... 291.4 826.1 16.8 591.9 863.2
2020-12-31 23:00:00 218.6 293.4 448.2 14.2 ... 340.6 435.0 897.4 622.5 768.3
What I want is the following:
1- Detect outliers in each column which is more or less three time Standard Deviation
of that column and put it in a new column its name "A_gen_outliers" if the there is
outliers in "GEN"column under "A Zone" as well as "A_load_outliers" if the there is
outliers in "LOAD"column under "A Zone". Number of new columns are 18 columns.
2- A new column represents sum of "Gen" columns
3- A new column represents sum of "Load" columns
4- A new column represents "GEN" column calculate A_GEN_div = cell value/maximum value of "GEN column under A Zone for each year for example 725.7/725.7=1 for the first cell and 436.2/725.1 for second cell and for last cell 218.6/553. etc. and the same for all "GEN" columns and also for "LOAD" columns- proposed names "A_Load_div".
Number of new columns are 18 columns.
Number of total new columns are "18 *2 + 2" columns
Thanks in advance.

I think this might help. Note that this will keep the columns MultiIndex. Your points above seem to imply that you want to flatten your MultIndex. If this is the case, you might want to look at this question.
1:
df.join(df>(3*df.std()), rsuffix='_outlier')
2 and 3:
df.groupby(level=-1, axis=1).sum()
Note that it is not clear from what the first level of the columns MultIndex should be for this.
4:
maxima = df.resample('1Y').max()
maxima.index = maxima.index + pd.DateOffset(hours=23)
maxima = maxima.reindex(df.index, method='bfill')
df.join(df.divide(maxima), rsuffix='_div')

Related

What is the safest way to iterate through a panda data frame that has rows for the day broken into minutes?

I have a .csv that I grabbed online here: https://marketplace.spp.org/file-browser-api/download/generation-mix-historical?path=%2FGenMix_2017.csv
The first column is date/time and is broken into 5 minute intervals (military time). I need to ensure that the dates are only for '2017' as there is some data at the end of the .csv from 2018. I want to be able to capture all the data but only in one hour increments.
For example in that .csv, this would be:
2017-01-01T06:00:00Z to 2017-01-01T06:55:00Z which is 12 rows.
This is only the case for 2017-01-01 that it starts at time: 6:00:00 all others start at 0:00:00
I was thinking that I might be able to just iterate ONLY for '2017' data by 12 step increments to get the hour blocks of time, and then once it has run 12*24 times it resets, not sure how to do this.
But also not sure if that would be a good idea it terms of future use cases, it may be that the times change or that there are missing data. Trying to ensure that this won't break if it is used in a few years. It's probably safe to say the company producing this data will continue to produce it the same way, but I guess you never know.
Here is what I have so far:
# puts api call into a pandas dataframe
energy_data = pd.read_csv('https://marketplace.spp.org/file-browser-api/download/generation-mix-historical?path=%2FGenMix_2017.csv')
# puts the date/time field into proper float64 format
energy_data['GMT MKT Interval'] = pd.to_datetime(energy_data['GMT MKT Interval'])
# ensures that the entire dataframe can be treated as time series data
energy_data.set_index('GMT MKT Interval', inplace = True)
Use resample_sum:
df = pd.read_csv('GenMix_2017.csv', parse_dates=['GMT MKT Interval'],
index_col='GMT MKT Interval')
out = df.resample('H').sum()
Output:
>>> out
Coal Market Coal Self Diesel Fuel Oil Hydro Natural Gas ... Waste Disposal Services Wind Waste Heat Other Average Actual Load
GMT MKT Interval ...
2017-01-01 06:00:00+00:00 34104.7 159041.4 0.0 3220.5 35138.3 ... 113.8 57517.0 0 431.3 303688.602
2017-01-01 07:00:00+00:00 32215.4 156570.6 0.0 3326.3 33545.2 ... 132.9 63397.0 0 422.9 304163.427
2017-01-01 08:00:00+00:00 29604.7 152379.6 0.0 3246.0 33851.4 ... 133.2 64230.5 0 358.1 300871.117
2017-01-01 09:00:00+00:00 28495.9 149474.0 0.0 2973.1 35171.5 ... 131.9 65860.7 0 344.5 298908.514
2017-01-01 10:00:00+00:00 29304.8 146561.1 0.0 3161.2 34315.4 ... 133.8 67882.8 0 340.9 299825.531
... ... ... ... ... ... ... ... ... ... ... ...
2018-01-01 01:00:00+00:00 36071.3 216336.8 55.2 16093.1 93466.6 ... 140.4 75547.5 0 327.6 463542.027
2018-01-01 02:00:00+00:00 35339.9 213596.9 55.2 14378.4 97397.7 ... 114.6 75277.5 0 325.4 459252.079
2018-01-01 03:00:00+00:00 35051.4 217333.2 55.2 12334.3 96351.1 ... 107.3 69376.7 0 328.1 453214.866
2018-01-01 04:00:00+00:00 35220.7 220868.9 53.2 8520.8 98404.2 ... 116.9 60699.7 0 328.5 446139.366
2018-01-01 05:00:00+00:00 35392.1 223590.8 52.2 8980.9 103893.6 ... 131.1 48453.0 0 329.8 439107.888
[8760 rows x 12 columns]

How to loop through a pandas grouped time series?

I have a dataframe like this:
datetime type d13C ... dayofyear week dmy
1 2018-01-05 15:22:30 air -8.88 ... 5 1 5-1-2018
2 2018-01-05 15:23:30 air -9.08 ... 5 1 5-1-2018
3 2018-01-05 15:24:30 air -10.08 ... 5 1 5-1-2018
4 2018-01-05 15:25:30 air -9.51 ... 5 1 5-1-2018
5 2018-01-05 15:26:30 air -9.61 ... 5 1 5-1-2018
... ... ... ... ... ... ...
341543 2018-12-17 12:42:30 air -9.99 ... 351 51 17-12-2018
341544 2018-12-17 12:43:30 air -9.53 ... 351 51 17-12-2018
341545 2018-12-17 12:44:30 air -9.54 ... 351 51 17-12-2018
341546 2018-12-17 12:45:30 air -9.93 ... 351 51 17-12-2018
341547 2018-12-17 12:46:30 air -9.66 ... 351 51 17-12-2018
Full data here: https://drive.google.com/file/d/1KmOwnpvrG2Edz1AlLyD0CKZlBpaFervM/view?usp=sharing
I'm plotting d13C column on the Y-axis and inverse total_co2 on the X and then fitting a regression line for each day in the data. I then filter out and store the dates I want depending on if the r^2 value of the regression line is > 0.8 like this:
import pandas as pd
from numpy.polynomial.polynomial import polyfit
import numpy as np
from scipy import stats
df = pd.read_csv('dataset.txt', usecols = ['datetime', 'type', 'total_co2', 'd13C', 'day','month','year','dayofyear','week','hour'], dtype = {'total_co2':
np.float64, 'd13C':np.float64, 'day':str, 'month':str, 'year':str,'week':str, 'hour': str, 'dayofyear':str})
df['dmy'] = df['day'] +'-'+ df['month'] +'-'+ df['year'] # adding a full date column to make it easir to filter through
# the rows, ie. each day
# window18 = df[((df['year']=='2018'))] # selecting just the data from the year 2018
accepted_dates_list = [] # creating an empty list to store the dates that we're interested in
for d in df['dmy'].unique(): # this will pass through each day, the .unique() ensures that it doesnt go over the same days
acceptable_date = {} # creating a dictionary to store the valid dates
period = df[df.dmy==d] # defining each period from the dmy column
p = (period['total_co2'])**-1
q = period['d13C']
c,m = polyfit(p,q,1) # intercept and gradient calculation of the regression line
slope, intercept, r_value, p_value, std_err = stats.linregress(p, q) # getting some statistical properties of the regression line
if r_value**2 >= 0.8:
acceptable_date['period'] = d # populating the dictionary with the accpeted dates and corresponding other values
acceptable_date['r-squared'] = r_value**2
acceptable_date['intercept'] = intercept
accepted_dates_list.append(acceptable_date) # sending the valid stuff in the dictionary to the list
else:
pass
accepted_dates18 = pd.DataFrame(accepted_dates_list) # converting the list to a df
print(accepted_dates18)
But now I want to do the same thing, just over three day periods which I'm trying to select from the day of year column (unsure if this is the best way or not). For example, I would want to fit the regression line using all the rows with dayofyear=5, dayofyear=6, dayofyear=7, then for the next three days until the end of the data. There are some days missing, but essentially I just need to do this for every 3 days in the data.
The output dataframe I am then trying to get would have the list of the three day intervals with the r^2 >0.8, so anything like this that will show the valid date range:
Accepted dates
0 23-08-2018 - 25-08-2018
1 26-08-2018 - 28-08-2018
2 31-08-2018 - 02-09-2018
3 15-09-2018 - 17-09-2018
4 24-09-2018 - 26-09-2018
I'm not too sure what to do to iterate over every three days. Any help would go a long way, thanks!
Your code loops through a list of unique dates and filters the dataframe on each iteration.
Pandas implemented this with df.groupby(). It can be used to loop and get each group or it can be combined with aggregations, function applications, and transformations. You can read more about it on the user guide. This function can return groups according to any the columns (or set of columns) in df, levels of the index, or any other exogenous list-like with the same length as df (we are grouping rows, but note it can also group columns). It even has implementations for the most common statistical aggregations like mean, stdev, and corr, among many others.
Now to your problem. You not only want the correlation but the equation, so you do need to loop. And to get three-day groups you can use that dayofyear column with a twist.
Take this data
import io
fo = io.StringIO(
'''datetime,d13C
2018-01-05 15:22:30,-8.88
2018-01-05 15:23:30,-9.08
2018-01-06 15:24:30,-10.0
2018-01-06 15:25:30,-9.51
2018-01-07 15:26:30,-9.61
2018-01-07 15:27:30,-9.61
2018-01-08 15:28:30,-9.61
2018-01-08 15:29:30,-9.61
2018-01-09 15:26:30,-9.61
2018-01-09 15:27:30,-9.61
''')
df = pd.read_csv(fo)
df.datetime = pd.to_datetime(df.datetime)
fo.close()
With the code for grouping and looping
first_day = 5
days_to_group = 3
for doy, gdf in df.groupby((df.datetime.dt.dayofyear.sub(first_day) // days_to_group)
* days_to_group + first_day):
print(gdf, '\n')
print(doy, '\n')
Output
datetime d13C
0 2018-01-05 15:22:30 -8.88
1 2018-01-05 15:23:30 -9.08
2 2018-01-06 15:24:30 -10.00
3 2018-01-06 15:25:30 -9.51
4 2018-01-07 15:26:30 -9.61
5 2018-01-07 15:27:30 -9.61
5
datetime d13C
6 2018-01-08 15:28:30 -9.61
7 2018-01-08 15:29:30 -9.61
8 2018-01-09 15:26:30 -9.61
9 2018-01-09 15:27:30 -9.61
8
Now you can plug your code into this loop and get what you need.
PS
You can also use df.datetime.dt.floor('3d') as the grouper but I am not aware of how to control the first_day, so use it with caution.
Here is one approach. As I understand it, the primary goal is to get from current observations (multiple per day) to a 3-day moving average. First, I created a smaller, simpler data set:
import pandas as pd
df = pd.DataFrame({'counter': [*range(100)],
'date': pd.date_range('2020-01-01', periods=100, freq='7H')})
df = df.set_index('date')
print(df.head())
counter
date
2020-01-01 00:00:00 0
2020-01-01 07:00:00 1
2020-01-01 14:00:00 2
2020-01-01 21:00:00 3
2020-01-02 04:00:00 4
Second, I re-sampled on a daily basis:
df2 = df['counter'].resample('1D').mean() # <-- called df2
print(df2.head())
date
2020-01-01 1.5
2020-01-02 5.0
2020-01-03 8.5
2020-01-04 12.0
2020-01-05 15.5
Freq: D, Name: counter, dtype: float64
Third, I computed mean value for a 3-day moving window:
print(df2.rolling(3).mean().head())
date
2020-01-01 NaN
2020-01-02 NaN
2020-01-03 5.0
2020-01-04 8.5
2020-01-05 12.0
Freq: D, Name: counter, dtype: float64
Seems like resample().mean() and rolling().mean() would be useful in this case.

Copy row to another dataframe

I have 2 dataframes with index type: Datatimeindex and I would like to copy one row to another. The dataframes are:
variable: df
DateTime
2013-01-01 01:00:00 0.0
2013-01-01 02:00:00 0.0
2013-01-01 03:00:00 0.0
....
Freq: H, Length: 8759, dtype: float64
variable: consumption_year
PotĂȘncia Ativa ... Costs
Datetime ...
2019-01-01 00:00:00 11.500000 ... 1.08874
2019-01-01 01:00:00 6.500000 ... 0.52016
2019-01-01 02:00:00 5.250000 ... 0.38183
2019-01-01 03:00:00 5.250000 ... 0.38183
[8760 rows x 5 columns]
here is my code:
mc.run_model(tmy_data)
df=round(mc.ac.fillna(0)/1000,3)
consumption_year['PVProduction'] = df.iloc[:,[1]] #1
consumption_year['PVProduction'] = df[:,1] #2
I am trying to copy the second column of df, to a new column in consumption_year dataframe but none of those previous experiences worked. Looking to the index, I see 3 major differences:
year (2013 and 2019)
starting hour: 01:00 and 00:00
length: 8760 and 8759
Do I need to solve those 3 differences first (making an datetime from df equal to consumption_year), before I can copy one row to another? If so, could you provide me a solution to fix those differences.
Those are the errors:
1: consumption_year['PVProduction'] = df.iloc[:,[1]]
raise IndexingError("Too many indexers")
pandas.core.indexing.IndexingError: Too many indexers
2: consumption_year['PVProduction'] = df[:,1]
raise ValueError("Can only tuple-index with a MultiIndex")
ValueError: Can only tuple-index with a MultiIndex
You can merge two data frames together.
pd.merge(df, consumption_year, left_index=True, right_index=True, how='outer')

Slicing pandas dataframe by custom months and days -- is there a way to avoid for loops?

The problem
Suppose I have a time series dataframe df (a pandas dataframe) and some days I want to slice from it, contained in another dataframe called sample_days:
>>> df
foo bar
2020-01-01 00:00:00 0.360049 0.897839
2020-01-01 01:00:00 0.285667 0.409544
2020-01-01 02:00:00 0.323871 0.240926
2020-01-01 03:00:00 0.921623 0.766624
2020-01-01 04:00:00 0.087618 0.142409
... ... ...
2020-12-31 19:00:00 0.145111 0.993822
2020-12-31 20:00:00 0.331223 0.021287
2020-12-31 21:00:00 0.531099 0.859035
2020-12-31 22:00:00 0.759594 0.790265
2020-12-31 23:00:00 0.103651 0.074029
[8784 rows x 2 columns]
>>> sample_days
month day
0 3 16
1 7 26
2 8 15
3 9 26
4 11 25
I want to slice df with the days specified in sample_days. I can do this with for loops (see below). However, is there a way to avoid for loops (as this is more efficient)? The result should be a dataframe called sample like the following:
>>> sample
foo bar
2020-03-16 00:00:00 0.707276 0.592614
2020-03-16 01:00:00 0.136679 0.357872
2020-03-16 02:00:00 0.612331 0.290126
2020-03-16 03:00:00 0.276389 0.576996
2020-03-16 04:00:00 0.612977 0.781527
... ... ...
2020-11-25 19:00:00 0.904266 0.825501
2020-11-25 20:00:00 0.269589 0.050304
2020-11-25 21:00:00 0.271814 0.418235
2020-11-25 22:00:00 0.595005 0.973198
2020-11-25 23:00:00 0.151149 0.024057
[120 rows x 2 columns
which is just the df sliced across the correct days.
My (slow) solution
I've managed to do this using for loops and pd.concat:
sample = pd.concat([df.loc[df.index.month.isin([sample_day.month]) &
df.index.day.isin([sample_day.day])]
for sample_day in sample_days.itertuples()])
which is based on concatenating multiple days as sliced by the method indicated here. This gives the desired result but is rather slow. For example, using this method to get the first day of each month takes 0.2 seconds on average, whereas just calling df.loc[df.index.day == 1] (presumably avoiding python for loops under-the-hood) is around 300 times faster. However, this is a slice on just the day -- I am slicing on month and day.
Apologies if this has been answered somewhere else -- I've searched for quite a while but perhaps was not using the correct keywords.
You can do a string comparison of the month and days at the same time.
You need the space to differentiate between 11 2 and 1 12 for example, otherwise both would be regarded as the same.
df.loc[(df.index.month.astype(str) +' '+ df.index.day.astype(str)).isin(sample_days['month'].astype(str)+' '+sample_days['day'].astype(str))]
After getting a bit of inspiration from #Ben Pap's solution (thanks!), I've found a solution that is both fast and avoids any "hacks" like changing datetime to strings. It combines the month and day into a single MultiIndex, as below (you can make this a single line, but I've expanded it into multiple to make the idea clear).
full_index = pd.MultiIndex.from_arrays([df.index.month, df.index.day],
names=['month', 'day'])
sample_index = pd.MultiIndex.from_frame(sample_days)
sample = df.loc[full_index.isin(sample_index)]
If I run this code along with my original for loop and #Ben Pap's answer, and sample 100 days from one year time series for 2020 (8784 hours with the leap day), I get the following solution times:
Original for loop: 0.16s
#Ben Pap's solution, combining month and day into single string: 0.019s
Above solution using MultiIndex: 0.006s
so I think using a MultiIndex is the way to go.

Python- splitting a csv file into different dataframes according to date or time interval

I have imported a csv file into python. It has readings at 5 min intervals over a period of a month. There are about 250 readings per 5 min timestamp. Below is a sample of one row per timestamp. Is there a way to split the csv into different dataframes grouped by date or even 5 min interval for plotting purposes? Like i mentioned, this dataset has 250 readings per 5 min interval for a month so I would like to do this without having to hard code each dataframe for each day or each interval in the set.
df.head()
tmc_code measurement_tstamp ... miles road_order
0 112-05650 2018-05-01 00:00:00 ... 0.427814 768.0
1 112-05650 2018-05-01 00:05:00 ... 0.427814 768.0
2 112-05650 2018-05-01 00:10:00 ... 0.427814 768.0
3 112-05650 2018-05-01 00:15:00 ... 0.427814 768.0
4 112-05650 2018-05-01 00:20:00 ... 0.427814 768.0
What it sounds like to me is that you want a new DataFrame for each date. If that is what you desire, the following code will take your dataframe, and make a list of dataframes, each which will only contain data for one date.
df.measurement_tstamp = df.measurement_tstamp.str[:10]
l = df.measurement_tstamp.unique()
data = [df.loc[df['measurement_tstamp']==i] for i in l]
Edit
If you want to do it by 5 min interval, it's even simpler!
data = [df.loc[df['measurement_tstamp']==i] for i in df.measurement_tstamp.unique()]
That should do it

Categories