So I have a dataframe of the form: index is a date and then I have a column that consists of np.arrays with a shape of 180x360. What I want to do is calculate the weekly mean of the data set. Example of the dataframe:
vika geop
1990-01-01 06:00:00 [[50995.954225, 50995.954225, 50995.954225, 50...
1990-01-02 06:00:00 [[51083.0576138, 51083.0576138, 51083.0576138,...
1990-01-03 06:00:00 [[51045.6321168, 51045.6321168, 51045.6321168,...
1990-01-04 06:00:00 [[50499.8436192, 50499.8436192, 50499.8436192,...
1990-01-05 06:00:00 [[49823.5114237, 49823.5114237, 49823.5114237,...
1990-01-06 06:00:00 [[50050.5148846, 50050.5148846, 50050.5148846,...
1990-01-07 06:00:00 [[50954.5188533, 50954.5188533, 50954.5188533,...
1990-01-08 06:00:00 [[50995.954225, 50995.954225, 50995.954225, 50...
1990-01-09 06:00:00 [[50628.1596088, 50628.1596088, 50628.1596088,...
What I've tried so far is the simple
df = df.resample('W-MON')
But I get this error:
pandas.core.groupby.DataError: No numeric types to aggregate
I've tried to change the datatype of the column to list, but it still does not work. Any idea of how to do it with resample, or any other method?
You can use Panel to represent 3d data:
import pandas as pd
import numpy as np
index = pd.date_range("2012/01/01", "2012/02/01")
p = pd.Panel(np.random.rand(len(index), 3, 4), items=index)
p.resample("W-MON")
Related
I am working on some code that will rearrange a time series. Currently I have a standard time series. I have a three columns with with the header being [Date, Time, Value]. I want to reformat the dataframe to index with the date and use a header with the time (i.e. 0:00, 1:00, ... , 23:00). The dataframe will be filled in with the value.
Here is the Dataframe currently have
essentially I'd like to mve the index toa single day and show the hours through the columns.
Thanks,
Use pivot:
df = df.pivot(index='Date', columns='Time', values='Total')
Output (first 10 columns and with random values for Total):
>>> df.pivot(index='Date', columns='Time', values='Total').iloc[0:10]
time 00:00:00 01:00:00 02:00:00 03:00:00 04:00:00 05:00:00 06:00:00 07:00:00 08:00:00 09:00:00
date
2019-01-01 0.732494 0.087657 0.930405 0.958965 0.531928 0.891228 0.664634 0.432684 0.009653 0.604878
2019-01-02 0.471386 0.575126 0.509707 0.715290 0.337983 0.618632 0.413530 0.849033 0.725556 0.186876
You could try this.
Split the time part to get only the hour. Add hr to it.
df = pd.DataFrame([['2019-01-01', '00:00:00',-127.57],['2019-01-01', '01:00:00',-137.57],['2019-01-02', '00:00:00',-147.57],], columns=['Date', 'Time', 'Totals'])
df['hours'] = df['Time'].apply(lambda x: 'hr'+ str(int(x.split(':')[0])))
print(pd.pivot_table(df, values ='Totals', index=['Date'], columns = 'hours'))
Output
hours hr0 hr1
Date
2019-01-01 -127.57 -137.57
2019-01-02 -147.57 NaN
I have a dataframe which looks like this:
pressure mean pressure std
2016-03-01 00:00:00 615.686441 0.138287
2016-03-01 01:00:00 615.555000 0.067460
2016-03-01 02:00:00 615.220000 0.262840
2016-03-01 03:00:00 614.993333 0.138841
2016-03-01 04:00:00 615.075000 0.072778
2016-03-01 05:00:00 615.513333 0.162049
................
The first column is the index column.
I want to create a new dataframe with only the rows of 3pm and 3am,
so it will look like this:
pressure mean pressure std
2016-03-01 03:00:00 614.993333 0.138841
2016-03-01 15:00:00 616.613333 0.129493
2016-03-02 03:00:00 615.600000 0.068889
..................
Any ideas ?
Thank you !
I couldn't load your data using pd.read_clipboard(), so I'm going to recreate some data:
df = pd.DataFrame(index=pd.date_range('2016-03-01', freq='H', periods=72),
data=np.random.random(size=(72,2)),
columns=['pressure', 'mean'])
Now your dataframe should have a DatetimeIndex. If not, you can use df.index = pd.to_datetime(df.index).
Then its really easy using boolean indexing:
df.ix[(df.index.hour == 3) | (df.index.hour == 15)]
I have data in a csv file which appears as:
DateTime Temp
10/1/2016 0:00 20.35491156
10/1/2016 1:00 19.75320845
10/1/2016 4:00 17.62411292
10/1/2016 5:00 18.30190001
10/1/2016 6:00 19.37101638
I am reading this file from csv file as:
import numpy as np
import pandas as pd
d2 = pd.Series.from_csv(r'C:\PowerCurve.csv')
d3 = d2.interpolate(method='time')
My goal is to fill the missing hours 2 and 3 with interpolation based on nearby values. i.e. every time there is are missing data it should do the interpolation.
However, d3 doesn't show any interpolation.
Edit:
Based on suggestions below my Python 2.7 still errors out. I am trying the following:
import pandas as pd
d2 = pd.Series.from_csv(r'C:\PowerCurve.csv')
d2.set_index('DateTime').resample('H').interpolate()
Error is:
File "C:\Python27\lib\site-packages\pandas\core\generic.py", line 2672, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'set_index'
Use resample with datetime as index and use one of the methods of resampling that fits your need. For instance:
df.set_index('DateTime').resample('1H').pad()
Out[23]:
Temp
DateTime
2016-10-01 00:00:00 20.354912
2016-10-01 01:00:00 19.753208
2016-10-01 02:00:00 19.753208
2016-10-01 03:00:00 19.753208
2016-10-01 04:00:00 17.624113
2016-10-01 05:00:00 18.301900
2016-10-01 06:00:00 19.371016
use the interpolate method after resample on an hourly basis.
d2.set_index('DateTime').resample('H').interpolate()
If d2 is a series then we don't need the set_index
d2.resample('H').interpolate()
I have been using the between_time method of TimeSeries in pandas, which returns all values between the specified times, regardless of their date.
But I need to select both date and time, because my timeseries structure
contains multiple dates.
One way of solving this, though quite inflexible, is just iterate over the values and remove those which are not relevant.
Is there a more elegant way of doing this ?
You can select the dates that are of interest first, and then use between_time. For example, suppose you have a time series of 72 hours:
import pandas as pd
from numpy.random import randn
rng = pd.date_range('1/1/2013', periods=72, freq='H')
ts = pd.Series(randn(len(rng)), index=rng)
To select the between 20:00 & 22:00 on the 2nd and 3rd of January you can simply do:
ts['2013-01-02':'2013-01-03'].between_time('20:00', '22:00')
Giving you something like this:
2013-01-02 20:00:00 0.144399
2013-01-02 21:00:00 0.886806
2013-01-02 22:00:00 0.126844
2013-01-03 20:00:00 -0.464741
2013-01-03 21:00:00 1.856746
2013-01-03 22:00:00 -0.286726
I have a timeseries of intraday day data looks like below
ts =pd.Series(np.random.randn(60),index=pd.date_range('1/1/2000',periods=60, freq='2h'))
I am hoping to transform the data into a DataFrame, with the columns as each date, and rows as the time in the date.
I have tried these,
key = lambda x:x.date()
grouped = ts.groupby(key)
But how do I transform the groups into date columned DataFrame? or is there any better way?
import pandas as pd
import numpy as np
index = pd.date_range('1/1/2000', periods=60, freq='2h')
ts = pd.Series(np.random.randn(60), index = index)
key = lambda x: x.time()
groups = ts.groupby(key)
print pd.DataFrame({k:g for k,g in groups}).resample('D').T
out:
2000-01-01 2000-01-02 2000-01-03 2000-01-04 2000-01-05 2000-01-06 \
00:00:00 0.109959 -0.124291 -0.137365 0.054729 -1.305821 -1.928468
03:00:00 1.336467 0.874296 0.153490 -2.410259 0.906950 1.860385
06:00:00 -1.172638 -0.410272 -0.800962 0.568965 -0.270307 -2.046119
09:00:00 -0.707423 1.614732 0.779645 -0.571251 0.839890 0.435928
12:00:00 0.865577 -0.076702 -0.966020 0.589074 0.326276 -2.265566
15:00:00 1.845865 -1.421269 -0.141785 0.433011 -0.063286 0.129706
18:00:00 -0.054569 0.277901 0.383375 -0.546495 -0.644141 -0.207479
21:00:00 1.056536 0.031187 -1.667686 -0.270580 -0.678205 0.750386
2000-01-07 2000-01-08
00:00:00 -0.657398 -0.630487
03:00:00 2.205280 -0.371830
06:00:00 -0.073235 0.208831
09:00:00 1.720097 -0.312353
12:00:00 -0.774391 NaN
15:00:00 0.607250 NaN
18:00:00 1.379823 NaN
21:00:00 0.959811 NaN