I am trying to use 2 timestamps from a data frame to extract (slice) a subset from a different dataframe where those timestamps are included as indexes. Here is what I've tried.
In [152]: wt_filtered.head()
Out[152]:
start_time end_time tot_PR
81 2013-07-13 05:00:00 2013-07-13 07:00:00 0.015
164 2013-10-31 19:00:00 2013-10-31 21:00:00 0.030
234 2013-12-09 16:00:00 2013-12-09 18:00:00 0.015
295 2014-01-11 02:00:00 2014-01-11 07:00:00 0.060
325 2014-02-05 17:00:00 2014-02-05 19:00:00 0.015
And this is my second dataframe:
In [160]: my_df.head()
Out[160]:
ValidLevel ValidVelocity ... GW_Level PR
DateTime ...
2013-06-07 00:00:00 2.07 0.91 ... 444.74 0.000
2013-06-07 01:00:00 2.01 0.46 ... 444.74 0.010
2013-06-07 02:00:00 1.82 0.54 ... 444.74 0.005
2013-06-07 03:00:00 1.98 0.68 ... 444.74 0.005
2013-06-07 04:00:00 1.92 0.59 ... 444.74 0.015
I want to extract data from the second data frame using the start_time and end_time as parameters in the slicing process.
I tried this:
my_df[wt_filtered.start_time[0]:wt_filtered.end_time[0]]
And did not work. Maybe I am overlooking something but cannot find the answer.
I want to get something like this:
DateTime ValidLevel ValidVelocity ... GW_Level PR
2013-07-13 05:00:00 1.24 0.99 ... 445.06 0.005
2013-07-13 06:00:00 1.29 1.51 ... 445.08 0.005
2013-07-13 07:00:00 1.57 1.44 ... 445.11 0.005
A good practice for this situation is to access the df's data using .loc and .iloc method:
my_df.loc[wt_filtered.start_time.iloc[0]:wt_filtered.end_time.iloc[0],:]
Hope it is useful
Related
I have 2 types of text files which I need to read into a pandas DataFrame. I have a problem with the datetimes and separator.
File A:
2009,7,1,3,101,13.03,89.33,0.6,287.69,0
2009,7,1,6,102,19.3,55,1,288.67,0
2009,7,1,9,103,22.33,39.67,1,289.6,0
2009,7,1,12,104,21.97,41,1,295.68,0
File B:
2019 9 1 3.00 101 14.02 92.08 2.62 174.77 0.109
2019 9 1 6.00 102 13.79 92.86 2.79 179.29 0.046
2019 9 1 9.00 103 13.81 92.60 2.73 178.94 0.070
2019 9 1 12.00 104 13.31 95.20 2.91 179.38 0.015
fileA.txt has no extra spaces, fileB.txt has 1 extra space on the beginning of each line. I can read each of these as follows and the results are correct:
>>> import pandas as pd
>>> from datetime import datetime as dtdt
>>> par3 = lambda x: dtdt.strptime(x, '%Y %m %d %H')
>>> par4 = lambda x: dtdt.strptime(x, '%Y %m %d %H.%M')
>>> df3=pd.read_csv('fileA.txt',header=None,parse_dates={'Date': [0,1,2,3]}, date_parser=par3, index_col='Date')
>>> df3
4 5 6 7 8 9
Date
2009-07-01 03:00:00 101 13.03 89.33 0.6 287.69 0
2009-07-01 06:00:00 102 19.30 55.00 1.0 288.67 0
2009-07-01 09:00:00 103 22.33 39.67 1.0 289.60 0
2009-07-01 12:00:00 104 21.97 41.00 1.0 295.68 0
>>> dg3=pd.read_csv('fileB.txt',sep='\s+',engine='python',header=None,parse_dates={'Date': [0,1,2,3]}, date_parser=par4, index_col='Date')
>>> dg3
4 5 6 7 8 9
Date
2019-09-01 03:00:00 101 14.02 92.08 2.62 174.77 0.109
2019-09-01 06:00:00 102 13.79 92.86 2.79 179.29 0.046
2019-09-01 09:00:00 103 13.81 92.60 2.73 178.94 0.070
2019-09-01 12:00:00 104 13.31 95.20 2.91 179.38 0.015
Question: how do I read in both of these types with the same command? The only way I can think of is to first open the file and read in first line to deduce the hour-format (col.3) and the separator? That feels non-pythonic way.
Also, if the hour-reading float is e.g. 3.75, it would be ok to round that into nearest integer and just set the minute reading to 0.
I am trying to extract the minimum value for each day in a dataset containing hourly prices. This I want to do for every hour separately since I later want to add other information to each hour, before combining the dataset again (which is why I want to keep the hour in datetime).
This is my data:
Price_REG1 Price_REG2 ... Price_24_3 Price_24_4
date ...
2020-01-01 00:00:00 30.83 30.83 ... NaN NaN
2020-01-01 01:00:00 28.78 28.78 ... NaN NaN
2020-01-01 02:00:00 28.45 28.45 ... 30.83 30.83
2020-01-01 03:00:00 27.90 27.90 ... 28.78 28.78
2020-01-01 04:00:00 27.52 27.52 ... 28.45 28.45
To extract the minimum I use this command:
df_min_1 = df_hour_1[['Price_REG1', 'Price_REG2', 'Price_REG3',
'Price_REG4']].between_time('00:00', '23:00').resample('d').min()
Which leaves me with this:
Price_REG1 Price_REG2 Price_REG3 Price_REG4
date
2020-01-01 25.07 25.07 25.07 25.07
2020-01-02 12.07 12.07 12.07 12.07
2020-01-03 0.14 0.14 0.14 0.14
2020-01-04 3.83 3.83 3.83 3.83
2020-01-05 25.77 25.77 25.77 25.77
I understand that the resample does this, but I want to know if there is any way to avoid this, or if there is any other way to achieve the results I am after.
To clarify, this is what I would like to have:
Price_REG1 Price_REG2 Price_REG3 Price_REG4
date
2020-01-01 01:00:00 25.07 25.07 25.07 25.07
2020-01-02 01:00:00 12.07 12.07 12.07 12.07
2020-01-03 01:00:00 0.14 0.14 0.14 0.14
2020-01-04 01:00:00 3.83 3.83 3.83 3.83
2020-01-05 01:00:00 25.77 25.77 25.77 25.77
I did not find a nice solution to this problem, I managed to get where I want though with this method:
t = datetime.timedelta(hours=1)
df_min = df_min.reset_index()
df_min['date'] = df_min['date'] + t
df_min.set_index('date', inplace = True)
df_hour_1 = pd.concat([df_hour_1, df_min], axis=1)
That is, I first create a timedelta of 01:00:00, I then reset the index to be able to add the timedelta to the date column. In this way, I am able to contact df_hour and df_min, while still keeping the time so I can concat all 24 datasets in a later step.
So I have two data frames.
Energy: (100 columns)
Affluent Adversity Affluent Comfortable Adversity \
Time
2019-01-01 01:00:00 0.254 0.244 0.155 0.215 0.274
2019-01-01 02:00:00 0.346 0.154 0.083 0.246 0.046
2019-01-01 03:00:00 0.309 0.116 0.085 0.220 0.139
2019-01-01 04:00:00 0.302 0.158 0.083 0.226 0.186
2019-01-01 05:00:00 0.181 0.171 0.096 0.246 0.051
... ... ... ... ... ...
2019-12-31 20:00:00 1.102 0.263 2.157 0.209 2.856
2019-12-31 21:00:00 0.712 0.269 1.409 0.212 0.497
2019-12-31 22:00:00 0.398 0.274 0.073 0.277 0.199
2019-12-31 23:00:00 0.449 0.452 0.072 0.252 0.183
2020-01-01 00:00:00 0.466 0.291 0.110 0.203 0.117
loadshift: (1 column)
Time load_difference
2019-01-01 01:00:00 0.10
2019-01-01 02:00:00 0.10
2019-01-01 03:00:00 0.15
2019-01-01 04:00:00 0.10
2019-01-01 05:00:00 0.10
... ...
2019-12-31 20:00:00 -0.10
2019-12-31 21:00:00 0.10
2019-12-31 22:00:00 0.15
2019-12-31 23:00:00 0.10
2020-01-01 00:00:00 -0.10
all I want to do is add the load difference to the df1 so for example the first affluent house at 1 am would change to 0.345. I have been able to use concat to multiply in my other models but somehow really struggling with this.
Expected output(but for all 8760 hours):
Affluent Adversity Affluent Comfortable Adversity \
Time
2019-01-01 01:00:00 0.354 0.344 0.255 0.315 0.374
2019-01-01 02:00:00 0.446 0.254 0.183 0.446 0.146
2019-01-01 03:00:00 0.409 0.216 0.185 0.320 0.239
2019-01-01 04:00:00 0.402 0.258 0.183 0.326 0.286
2019-01-01 05:00:00 0.281 0.271 0.196 0.346 0.151
I have tried: Energy.add(loadshift, fill_value=0)
but I get
Concatenation operation is not implemented for NumPy arrays, use np.concatenate() instead. Please do not rely on this error; it may not be given on all Python implementations.
also tried:
df_merged = pd.concat([Energy,loadshift], ignore_index=True, sort=False)
df_merged =Energy.append(loadshift)
this prints:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
How do I please go about to fixing these errors. Thanks
Try merge and add
# merge the two frames on the index, which is time in this case
df = loadshift.merge(energy, left_index=True, right_index=True)
# add the load difference to all the columns
new = df[df.columns[1:]].add(df['load_difference'], axis=0)
Affluent Adversity Affluent.1 Comfortable Adversity.1
Time
2019-01-01 01:00:00 0.354 0.344 0.255 0.315 0.374
2019-01-01 02:00:00 0.446 0.254 0.183 0.346 0.146
2019-01-01 03:00:00 0.459 0.266 0.235 0.370 0.289
2019-01-01 04:00:00 0.402 0.258 0.183 0.326 0.286
2019-01-01 05:00:00 0.281 0.271 0.196 0.346 0.151
2019-12-31 20:00:00 1.002 0.163 2.057 0.109 2.756
2019-12-31 21:00:00 0.812 0.369 1.509 0.312 0.597
2019-12-31 22:00:00 0.548 0.424 0.223 0.427 0.349
2019-12-31 23:00:00 0.549 0.552 0.172 0.352 0.283
2020-01-01 00:00:00 0.366 0.191 0.010 0.103 0.017
I have a pandas dataframe which looks like this:
Concentr 1 Concentr 2 Time
0 25.4 0.48 00:01:00
1 26.5 0.49 00:02:00
2 25.2 0.52 00:03:00
3 23.7 0.49 00:04:00
4 23.8 0.55 00:05:00
5 24.6 0.53 00:06:00
6 26.3 0.57 00:07:00
7 27.1 0.59 00:08:00
8 28.8 0.56 00:09:00
9 23.9 0.54 00:10:00
10 25.6 0.49 00:11:00
11 27.5 0.56 00:12:00
12 26.3 0.55 00:13:00
13 25.3 0.54 00:14:00
and I want to keep the max value of Concentr 1 of every 5 minute interval, along with the time it occured and the value of concetr 2 at that time. So, for the previous example I would like to have:
Concentr 1 Concentr 2 Time
0 26.5 0.49 00:02:00
1 28.8 0.56 00:09:00
2 27.5 0.56 00:12:00
My current approach would be i) to create and auxiliary variable with an ID for each 5-min interval eg 00:00 to 00:05 will be interval 1, from 00:05 to 00:10 would be interval 2 etc, ii) use the interval variable in a groupby to get the max concentr 1 per interval and iii) merge back to the initial df using both the interval variable and the concentr 1 and thus identifying the corresponding time.
I would like to ask if there is a better / more efficient / more elegant way to do it.
Thank you very much for any help.
You can do a regular resample / groupby, and use the idxmax method to get the desired row for each group. Then use that to index your original data:
>>> df.loc[df.resample('5T', on='Time')['Concentr1'].idxmax()]
Concentr 1 Concentr 2 Time
1 26.5 0.49 2021-10-09 00:02:00
8 28.8 0.56 2021-10-09 00:09:00
11 27.5 0.56 2021-10-09 00:12:00
This is assuming your 'Time' column is datetime like, which I did with pd.to_datetime. You can convert the time column back with strftime. So in full:
df['Time'] = pd.to_datetime(df['Time'])
result = df.loc[df.resample('5T', on='Time')['Concentr1'].idxmax()]
result['Time'] = result['Time'].dt.strftime('%H:%M:%S')
Giving:
Concentr1 Concentr2 Time
1 26.5 0.49 00:02:00
8 28.8 0.56 00:09:00
11 27.5 0.56 00:12:00
df = df.set_index('Time')
idx = df.resample('5T').agg({'Concentr 1': np.argmax})
df = df.iloc[idx.conc]
Then you would probably need to reset_index() if you do not wish Time to be your index.
You can also use this:
groupby every n=5 nrows and filter the original df based on max index of "Concentr 1"
df = df[df.index.isin(df.groupby(df.index // 5)["Concentr 1"].idxmax())]
print(df)
Output:
Concentr 1 Concentr 2 Time
1 26.5 0.49 00:02:00
8 28.8 0.56 00:09:00
11 27.5 0.56 00:12:00
one years' data showed as follows:
datetime data
2008-01-01 00:00:00 0.044
2008-01-01 00:30:00 0.031
2008-01-01 01:00:00 -0.25
.....
2008-01-31 23:00:00 0.036
2008-01-31 23:30:00 0.42
2008-01-02 00:00:00 0.078
2008-01-02 00:30:00 0.008
2008-01-02 01:00:00 0.09
2008-01-02 01:30:00 0.054
.....
2008-12-31 22:00:00 0.55
2008-12-31 22:30:00 0.05
2008-12-31 23:00:00 0.08
2008-12-31 23:30:00 0.033
There is a value per half-hour. I want the sum of all values in a day, so convert to 365 rows values.
year day sum values
2008 1 *
2008 2 *
...
2008 364 *
2008 365 *
You can use dt.year + dt.dayofyear with groupby and aggregate sum:
df = df.groupby([df['datetime'].dt.year, df['datetime'].dt.dayofyear]).sum()
print (df)
data
datetime datetime
2008 1 -0.175
2 0.230
31 0.456
366 0.713
And if need DataFrame is possible convert index to column and set columns names by reset_index + rename_axis:
df = df.groupby([df['datetime'].dt.year, df['datetime'].dt.dayofyear])['data']
.sum()
.rename_axis(('year','dayofyear'))
.reset_index()
print (df)
year dayofyear data
0 2008 1 -0.175
1 2008 2 0.230
2 2008 31 0.456
3 2008 366 0.713