I have a dataframe that looks roughly like:
01/01/19 02/01/19 03/01/19 04/01/19
hour
1.0 27.08 47.73 54.24 10.0
2.0 26.06 49.53 46.09 22.0
...
24.0 12.0 34.0 22.0 40.0
I'd like to reduce its dimension to a single column with a proper date index concatenating all the columns. Is there a smart pandas way to do it?
Expected result... something like:
01/01/19 00:00:00 27.08
01/01/19 01:00:00 26.08
...
01/01/19 23:00:00 12.00
02/01/19 00:00:00 47.73
02/01/19 01:00:00 49.53
...
02/01/19 23:00:00 34.00
...
You can stack and then fix the index using pd.to_datetime and pd.to_timedelta:
u = df.stack()
u.index = (pd.to_datetime(u.index.get_level_values(1), dayfirst=True)
+ pd.to_timedelta(u.index.get_level_values(0) - 1, unit='h'))
u.sort_index()
2019-01-01 00:00:00 27.08
2019-01-01 01:00:00 26.06
2019-01-01 23:00:00 12.00
2019-01-02 00:00:00 47.73
2019-01-02 01:00:00 49.53
2019-01-02 23:00:00 34.00
2019-01-03 00:00:00 54.24
2019-01-03 01:00:00 46.09
2019-01-03 23:00:00 22.00
2019-01-04 00:00:00 10.00
2019-01-04 01:00:00 22.00
2019-01-04 23:00:00 40.00
dtype: float64
Related
I have following dataframe:
time u10 ... tsn tp
longitude latitude ...
20.0 45.0 2014-01-01 00:00:00 2.595551 ... 272.453827 0.000014
45.0 2014-01-01 01:00:00 2.615493 ... 273.159973 0.000000
45.0 2014-01-01 02:00:00 2.587403 ... 273.122192 0.000000
45.0 2014-01-01 03:00:00 2.528865 ... 273.050903 0.000000
45.0 2014-01-01 04:00:00 2.556740 ... 272.772491 0.000000
I want to subtract neighboring records of column u10 for all values of column time,except one value of column time ( where time ends with 00:00:00 )
I need following output:
time u10 ... tsn tp
longitude latitude ...
20.0 45.0 2014-01-01 00:00:00 2.595551 ... 272.453827 0.000014
45.0 2014-01-01 01:00:00 0.019942 ... 273.159973 0.000000
45.0 2014-01-01 02:00:00 -0.02809 ... 273.122192 0.000000
45.0 2014-01-01 03:00:00 -0.058538 ... 273.050903 0.000000
I can do df.loc combining with df['time'].shift()-df['time'] but that will work for all record.
How can I make this work with desired output?
P.S. I am looking for vectorized solution.
Use np.where. Where midnight occurs, keep u10 values, where it doesnt find the consecutive differences.
df['u10']=np.where(df['time'].dt.time==time(0,0,0), df['u10'], df['u10'].diff())
time u10
longitude latitude
20.0 45.0 2014-01-01 00:00:00 2.595551
45.0 2014-01-01 01:00:00 0.019942
45.0 2014-01-01 02:00:00 -0.028090
45.0 2014-01-01 03:00:00 -0.058538
45.0 2014-01-01 04:00:00 0.027875
import numpy as np
import pandas as pd
import xarray as xr
validIdx = np.ones(365*5, dtype= bool)
validIdx[np.random.randint(low=0, high=365*5, size=30)] = False
time = pd.date_range("2000-01-01", freq="H", periods=365 * 5)[validIdx]
data = np.arange(365 * 5)[validIdx]
ds = xr.Dataset({"foo": ("time", data), "time": time})
df = ds.to_dataframe()
In the above example, the time-series data ds (or df) has 30 randomly chosen missing records without having those as NaNs. Therefore, the length of data is 365x5 - 30, not 365x5).
My question is this: how can I expand the ds and df to have the 30 missing values as NaNs (so, the length will be 365x5)? For example, if a value in "2000-12-02" is missed in the example data, then it will look like:
...
2000-12-01 value 1
2000-12-03 value 2
...
, while what I want to have is:
...
2000-12-01 value 1
2000-12-02 NaN
2000-12-03 value 2
...
Perhaps you can try resample with 1 hour.
The df without NaNs (just after df = ds.to_dataframe()):
>>> df
foo
time
2000-01-01 00:00:00 0
2000-01-01 01:00:00 1
2000-01-01 02:00:00 2
2000-01-01 03:00:00 3
2000-01-01 04:00:00 4
... ...
2000-03-16 20:00:00 1820
2000-03-16 21:00:00 1821
2000-03-16 22:00:00 1822
2000-03-16 23:00:00 1823
2000-03-17 00:00:00 1824
[1795 rows x 1 columns]
The df with NaNs (df_1h):
>>> df_1h = df.resample('1H').mean()
>>> df_1h
foo
time
2000-01-01 00:00:00 0.0
2000-01-01 01:00:00 1.0
2000-01-01 02:00:00 2.0
2000-01-01 03:00:00 3.0
2000-01-01 04:00:00 4.0
... ...
2000-03-16 20:00:00 1820.0
2000-03-16 21:00:00 1821.0
2000-03-16 22:00:00 1822.0
2000-03-16 23:00:00 1823.0
2000-03-17 00:00:00 1824.0
[1825 rows x 1 columns]
Rows with NaN:
>>> df_1h[df_1h['foo'].isna()]
foo
time
2000-01-02 10:00:00 NaN
2000-01-04 07:00:00 NaN
2000-01-05 06:00:00 NaN
2000-01-09 02:00:00 NaN
2000-01-13 15:00:00 NaN
2000-01-16 16:00:00 NaN
2000-01-18 21:00:00 NaN
2000-01-21 22:00:00 NaN
2000-01-23 19:00:00 NaN
2000-01-24 01:00:00 NaN
2000-01-24 19:00:00 NaN
2000-01-27 12:00:00 NaN
2000-01-27 16:00:00 NaN
2000-01-29 06:00:00 NaN
2000-02-02 01:00:00 NaN
2000-02-06 13:00:00 NaN
2000-02-09 11:00:00 NaN
2000-02-15 12:00:00 NaN
2000-02-15 15:00:00 NaN
2000-02-21 04:00:00 NaN
2000-02-28 05:00:00 NaN
2000-02-28 06:00:00 NaN
2000-03-01 15:00:00 NaN
2000-03-02 18:00:00 NaN
2000-03-04 18:00:00 NaN
2000-03-05 20:00:00 NaN
2000-03-12 08:00:00 NaN
2000-03-13 20:00:00 NaN
2000-03-16 01:00:00 NaN
The number of NaNs in df_1h:
>>> df_1h.isnull().sum()
foo 30
dtype: int64
I'm not able to create a Pandas Series of every hour (as datetime objects) of a given year without iterating and adding one hour to the last, and that's slow. Is there any way to do that paralelly.
My input would be a year and the output should be a Pandas Series of every hour of that year.
You can use pd.date_range with freq='H' which is hourly frequency:
Edit with 23:00:00 after comment by #ALollz
year = 2019
pd.Series(pd.date_range(start=f'{year}-01-01', end=f'{year}-12-31 23:00:00', freq='H'))
0 2019-01-01 00:00:00
1 2019-01-01 01:00:00
2 2019-01-01 02:00:00
3 2019-01-01 03:00:00
4 2019-01-01 04:00:00
5 2019-01-01 05:00:00
6 2019-01-01 06:00:00
7 2019-01-01 07:00:00
8 2019-01-01 08:00:00
9 2019-01-01 09:00:00
10 2019-01-01 10:00:00
11 2019-01-01 11:00:00
12 2019-01-01 12:00:00
13 2019-01-01 13:00:00
14 2019-01-01 14:00:00
15 2019-01-01 15:00:00
16 2019-01-01 16:00:00
17 2019-01-01 17:00:00
18 2019-01-01 18:00:00
19 2019-01-01 19:00:00
20 2019-01-01 20:00:00
21 2019-01-01 21:00:00
22 2019-01-01 22:00:00
23 2019-01-01 23:00:00
24 2019-01-02 00:00:00
25 2019-01-02 01:00:00
26 2019-01-02 02:00:00
27 2019-01-02 03:00:00
28 2019-01-02 04:00:00
29 2019-01-02 05:00:00
30 2019-01-02 06:00:00
31 2019-01-02 07:00:00
32 2019-01-02 08:00:00
33 2019-01-02 09:00:00
34 2019-01-02 10:00:00
35 2019-01-02 11:00:00
36 2019-01-02 12:00:00
37 2019-01-02 13:00:00
38 2019-01-02 14:00:00
39 2019-01-02 15:00:00
40 2019-01-02 16:00:00
41 2019-01-02 17:00:00
42 2019-01-02 18:00:00
43 2019-01-02 19:00:00
44 2019-01-02 20:00:00
45 2019-01-02 21:00:00
46 2019-01-02 22:00:00
47 2019-01-02 23:00:00
48 2019-01-03 00:00:00
49 2019-01-03 01:00:00
...
8711 2019-12-29 23:00:00
8712 2019-12-30 00:00:00
8713 2019-12-30 01:00:00
8714 2019-12-30 02:00:00
8715 2019-12-30 03:00:00
8716 2019-12-30 04:00:00
8717 2019-12-30 05:00:00
8718 2019-12-30 06:00:00
8719 2019-12-30 07:00:00
8720 2019-12-30 08:00:00
8721 2019-12-30 09:00:00
8722 2019-12-30 10:00:00
8723 2019-12-30 11:00:00
8724 2019-12-30 12:00:00
8725 2019-12-30 13:00:00
8726 2019-12-30 14:00:00
8727 2019-12-30 15:00:00
8728 2019-12-30 16:00:00
8729 2019-12-30 17:00:00
8730 2019-12-30 18:00:00
8731 2019-12-30 19:00:00
8732 2019-12-30 20:00:00
8733 2019-12-30 21:00:00
8734 2019-12-30 22:00:00
8735 2019-12-30 23:00:00
8736 2019-12-31 00:00:00
8737 2019-12-31 01:00:00
8738 2019-12-31 02:00:00
8739 2019-12-31 03:00:00
8740 2019-12-31 04:00:00
8741 2019-12-31 05:00:00
8742 2019-12-31 06:00:00
8743 2019-12-31 07:00:00
8744 2019-12-31 08:00:00
8745 2019-12-31 09:00:00
8746 2019-12-31 10:00:00
8747 2019-12-31 11:00:00
8748 2019-12-31 12:00:00
8749 2019-12-31 13:00:00
8750 2019-12-31 14:00:00
8751 2019-12-31 15:00:00
8752 2019-12-31 16:00:00
8753 2019-12-31 17:00:00
8754 2019-12-31 18:00:00
8755 2019-12-31 19:00:00
8756 2019-12-31 20:00:00
8757 2019-12-31 21:00:00
8758 2019-12-31 22:00:00
8759 2019-12-31 23:00:00
8760 2020-01-01 00:00:00
Length: 8761, dtype: datetime64[ns]
Note if your Python version is lower than 3.6 use .format for string formatting:
year = 2019
pd.Series(pd.date_range(start='{}-01-01'.format(year), end='{}-01-01 23:00:00'.format(year), freq='H'))
A dataframe contains only a few timestamps per day and I need to select the latest one for each date (not the values, the time stamp itself). the df looks like this:
A B C
2016-12-05 12:00:00+00:00 126.0 15.0 38.54
2016-12-05 16:00:00+00:00 131.0 20.0 42.33
2016-12-14 05:00:00+00:00 129.0 18.0 43.24
2016-12-15 03:00:00+00:00 117.0 22.0 33.70
2016-12-15 04:00:00+00:00 140.0 23.0 34.81
2016-12-16 03:00:00+00:00 120.0 21.0 32.24
2016-12-16 04:00:00+00:00 142.0 22.0 35.20
I managed to achieve what i needed by defining the following function:
def find_last_h(df,column):
newindex = []
df2 = df.resample('d').last().dropna()
for x in df2[column].values:
newindex.append(df[df[column]==x].index.values[0])
return pd.DatetimeIndex(newindex)
with which I specify which column's values to use as a filter to get the desired timestamps. The issue here is in the case of non unique values this might not work as desired.
Another way that is used is:
grouped = df.groupby([df.index.day,df.index.hour])
grouped.groupby(level=0).last()
and then reconstruct the timestamps but it is even more verbose. What is the smart way?
Use boolean indexing with mask created by duplicated and floor for truncate times:
idx = df.index.floor('D')
df = df[~idx.duplicated(keep='last') | ~idx.duplicated(keep=False)]
print (df)
A B C
2016-12-05 16:00:00 131.0 20.0 42.33
2016-12-14 05:00:00 129.0 18.0 43.24
2016-12-15 04:00:00 140.0 23.0 34.81
2016-12-16 04:00:00 142.0 22.0 35.20
Another solution with reset_index + set_index:
df = df.reset_index().groupby([df.index.date]).last().set_index('index')
print (df)
A B C
index
2016-12-05 16:00:00 131.0 20.0 42.33
2016-12-14 05:00:00 129.0 18.0 43.24
2016-12-15 04:00:00 140.0 23.0 34.81
2016-12-16 04:00:00 142.0 22.0 35.20
resample and groupby dates only lost times:
print (df.resample('1D').last().dropna())
A B C
2016-12-05 131.0 20.0 42.33
2016-12-14 129.0 18.0 43.24
2016-12-15 140.0 23.0 34.81
2016-12-16 142.0 22.0 35.20
print (df.groupby([df.index.date]).last())
A B C
2016-12-05 131.0 20.0 42.33
2016-12-14 129.0 18.0 43.24
2016-12-15 140.0 23.0 34.81
2016-12-16 142.0 22.0 35.20
how about
df.resample('24H',kind='period').last().dropna() ?
You can groupby the date and just take the max of each datetime to get the last datetime on each date.
This may look like:
df.groupby(df["datetime"].dt.date)["datetime"].max()
or something like
df.groupby(pd.Grouper(freq='D'))["datetime"].max()
I'm trying to use pandas dataframe/series to store measurements. These are used as inputs for a thermodynamical simulation. So when resampling the sum of heat(/energy) and the values of temperature should stay constant.
import pandas as pd
import numpy as np
start_date = '2015-01-01 '
my_timing = '60T'
my_ts_index = pd.date_range(start_date, periods=6, freq=my_timing, name='Time')
# read from a csv or sql data base
Heat = [120,200,210,0 ,50, 180]
Temperature = [23.2, 19.1, 15.3, 25.2, 20.1, 0.0]
data=np.array([(Heat),
(Temperature)])
columnNames=['Heat','Temperature']
my_ts=pd.DataFrame(np.transpose(data),columns=columnNames,index=my_ts_index)
print(my_ts)
# GUI checks if len(my_ts)%new_timestep==0
# increase frequency
tmp=my_ts.resample('25T').mean()
tmp['Heat']=tmp['Heat'].fillna(0)
tmp['Temperature']=tmp['Temperature'].ffill()
print(tmp)
# reduce frequency
tmp=tmp.resample('60T').agg({'Heat':lambda x:np.sum(x),'Temperature':np.average})
print(tmp)
tmp=my_ts.resample('75T').agg({'Heat':lambda x:np.sum(x),'Temperature':np.average})
print(tmp)
Output:
Heat Temperature
Time
2015-01-01 00:00:00 120.0 23.2
2015-01-01 01:00:00 200.0 19.1
2015-01-01 02:00:00 210.0 15.3
2015-01-01 03:00:00 0.0 25.2
2015-01-01 04:00:00 50.0 20.1
2015-01-01 05:00:00 180.0 0.0
Heat Temperature
Time
2015-01-01 00:00:00 120.0 23.2
2015-01-01 00:25:00 0.0 23.2
2015-01-01 00:50:00 200.0 19.1
2015-01-01 01:15:00 0.0 19.1
2015-01-01 01:40:00 210.0 15.3
2015-01-01 02:05:00 0.0 15.3
2015-01-01 02:30:00 0.0 15.3
2015-01-01 02:55:00 0.0 25.2
2015-01-01 03:20:00 0.0 25.2
2015-01-01 03:45:00 50.0 20.1
2015-01-01 04:10:00 0.0 20.1
2015-01-01 04:35:00 0.0 20.1
2015-01-01 05:00:00 180.0 0.0
Heat Temperature
Time
2015-01-01 00:00:00 320.0 21.833333
2015-01-01 01:00:00 210.0 17.200000
2015-01-01 02:00:00 0.0 18.600000
2015-01-01 03:00:00 50.0 22.650000
2015-01-01 04:00:00 0.0 20.100000
2015-01-01 05:00:00 180.0 0.000000
Heat Temperature
Time
2015-01-01 00:00:00 320.0 21.15
2015-01-01 01:15:00 210.0 15.30
2015-01-01 02:30:00 0.0 25.20
2015-01-01 03:45:00 50.0 20.10
2015-01-01 05:00:00 180.0 0.00
Which operations should be used to resample data to higher frequency and back to the original without loosing anything?
(See: Output 1&3+conservative properties in Output 2 of energy and temperature still have to hold)
How to set the mean/average/sum to start at the 2nd value so the first stays untouched for initial value reasons? Or inother words: How to assign the mean to the next not the previous hour?
(See: Output 4 first value pair)