I have a DataFrame of the following form:
You see that it has a multi index. For each muni index I want to do a resampling of the form .resample('A').mean() of the popDate index. Hence, I want python to fill in the missing years. NaN values shall be replaced by a linear interpolation. How do I do that?
Update: Some mock input DataFrame:
interData=pd.DataFrame({'muni':['Q1','Q1','Q1','Q2','Q2','Q2'],'popDate':['2015','2021','2022','2015','2017','2022'],'population':[5,11,22,15,17,22]})
interData['popDate']=pd.to_datetime(interData['popDate'])
interData=interData.set_index(['muni','popDate'])
It looks like you want a groupby.resample:
interData.groupby(level='muni').resample('A', level='popDate').mean()
Output:
population
muni popDate
Q1 2015-12-31 5.0
2016-12-31 NaN
2017-12-31 NaN
2018-12-31 NaN
2019-12-31 NaN
2020-12-31 NaN
2021-12-31 11.0
2022-12-31 22.0
Q2 2015-12-31 15.0
2016-12-31 NaN
2017-12-31 17.0
2018-12-31 NaN
2019-12-31 NaN
2020-12-31 NaN
2021-12-31 NaN
2022-12-31 22.0
If you also need interpolation, combine with interpolate:
out = (interData.groupby(level='muni')
.apply(lambda g: g.resample('A', level='popDate').mean()
.interpolate(method='time'))
)
Output:
population
muni popDate
Q1 2015-12-31 5.000000
2016-12-31 6.001825
2017-12-31 7.000912
2018-12-31 8.000000
2019-12-31 8.999088
2020-12-31 10.000912
2021-12-31 11.000000
2022-12-31 22.000000
Q2 2015-12-31 15.000000 # 366 days between 2015-12-31 and 2016-12-31
2016-12-31 16.001368 # 365 days between 2016-12-31 and 2017-12-31
2017-12-31 17.000000
2018-12-31 17.999452
2019-12-31 18.998905
2020-12-31 20.001095
2021-12-31 21.000548
2022-12-31 22.000000
Related
import numpy as np
import pandas as pd
import xarray as xr
validIdx = np.ones(365*5, dtype= bool)
validIdx[np.random.randint(low=0, high=365*5, size=30)] = False
time = pd.date_range("2000-01-01", freq="H", periods=365 * 5)[validIdx]
data = np.arange(365 * 5)[validIdx]
ds = xr.Dataset({"foo": ("time", data), "time": time})
df = ds.to_dataframe()
In the above example, the time-series data ds (or df) has 30 randomly chosen missing records without having those as NaNs. Therefore, the length of data is 365x5 - 30, not 365x5).
My question is this: how can I expand the ds and df to have the 30 missing values as NaNs (so, the length will be 365x5)? For example, if a value in "2000-12-02" is missed in the example data, then it will look like:
...
2000-12-01 value 1
2000-12-03 value 2
...
, while what I want to have is:
...
2000-12-01 value 1
2000-12-02 NaN
2000-12-03 value 2
...
Perhaps you can try resample with 1 hour.
The df without NaNs (just after df = ds.to_dataframe()):
>>> df
foo
time
2000-01-01 00:00:00 0
2000-01-01 01:00:00 1
2000-01-01 02:00:00 2
2000-01-01 03:00:00 3
2000-01-01 04:00:00 4
... ...
2000-03-16 20:00:00 1820
2000-03-16 21:00:00 1821
2000-03-16 22:00:00 1822
2000-03-16 23:00:00 1823
2000-03-17 00:00:00 1824
[1795 rows x 1 columns]
The df with NaNs (df_1h):
>>> df_1h = df.resample('1H').mean()
>>> df_1h
foo
time
2000-01-01 00:00:00 0.0
2000-01-01 01:00:00 1.0
2000-01-01 02:00:00 2.0
2000-01-01 03:00:00 3.0
2000-01-01 04:00:00 4.0
... ...
2000-03-16 20:00:00 1820.0
2000-03-16 21:00:00 1821.0
2000-03-16 22:00:00 1822.0
2000-03-16 23:00:00 1823.0
2000-03-17 00:00:00 1824.0
[1825 rows x 1 columns]
Rows with NaN:
>>> df_1h[df_1h['foo'].isna()]
foo
time
2000-01-02 10:00:00 NaN
2000-01-04 07:00:00 NaN
2000-01-05 06:00:00 NaN
2000-01-09 02:00:00 NaN
2000-01-13 15:00:00 NaN
2000-01-16 16:00:00 NaN
2000-01-18 21:00:00 NaN
2000-01-21 22:00:00 NaN
2000-01-23 19:00:00 NaN
2000-01-24 01:00:00 NaN
2000-01-24 19:00:00 NaN
2000-01-27 12:00:00 NaN
2000-01-27 16:00:00 NaN
2000-01-29 06:00:00 NaN
2000-02-02 01:00:00 NaN
2000-02-06 13:00:00 NaN
2000-02-09 11:00:00 NaN
2000-02-15 12:00:00 NaN
2000-02-15 15:00:00 NaN
2000-02-21 04:00:00 NaN
2000-02-28 05:00:00 NaN
2000-02-28 06:00:00 NaN
2000-03-01 15:00:00 NaN
2000-03-02 18:00:00 NaN
2000-03-04 18:00:00 NaN
2000-03-05 20:00:00 NaN
2000-03-12 08:00:00 NaN
2000-03-13 20:00:00 NaN
2000-03-16 01:00:00 NaN
The number of NaNs in df_1h:
>>> df_1h.isnull().sum()
foo 30
dtype: int64
I found this behavior of resample to be confusing after working on a related question. Here are some time series data at 5 minute intervals but with missing rows (code to construct at end):
user value total
2020-01-01 09:00:00 fred 1 1
2020-01-01 09:05:00 fred 13 1
2020-01-01 09:15:00 fred 27 3
2020-01-01 09:30:00 fred 40 12
2020-01-01 09:35:00 fred 15 12
2020-01-01 10:00:00 fred 19 16
I want to fill in the missing times using different methods for each column to fill missing data. For user and total, I want to to a forward fill, while for value I want to fill in with zeroes.
One approach I found was to resample, and then fill in the missing data after the fact:
resampled = df.resample('5T').asfreq()
resampled['user'].ffill(inplace=True)
resampled['total'].ffill(inplace=True)
resampled['value'].fillna(0, inplace=True)
Which gives correct expected output:
user value total
2020-01-01 09:00:00 fred 1.0 1.0
2020-01-01 09:05:00 fred 13.0 1.0
2020-01-01 09:10:00 fred 0.0 1.0
2020-01-01 09:15:00 fred 27.0 3.0
2020-01-01 09:20:00 fred 0.0 3.0
2020-01-01 09:25:00 fred 0.0 3.0
2020-01-01 09:30:00 fred 40.0 12.0
2020-01-01 09:35:00 fred 15.0 12.0
2020-01-01 09:40:00 fred 0.0 12.0
2020-01-01 09:45:00 fred 0.0 12.0
2020-01-01 09:50:00 fred 0.0 12.0
2020-01-01 09:55:00 fred 0.0 12.0
2020-01-01 10:00:00 fred 19.0 16.0
I thought one would be able to use agg to specify what to do by column. I try to do the following:
resampled = df.resample('5T').agg({'user':'ffill',
'value':'sum',
'total':'ffill'})
I find this to be more clear and simpler, but it doesn't give the expected output. The sum works, but the forward fill does not:
user value total
2020-01-01 09:00:00 fred 1 1.0
2020-01-01 09:05:00 fred 13 1.0
2020-01-01 09:10:00 NaN 0 NaN
2020-01-01 09:15:00 fred 27 3.0
2020-01-01 09:20:00 NaN 0 NaN
2020-01-01 09:25:00 NaN 0 NaN
2020-01-01 09:30:00 fred 40 12.0
2020-01-01 09:35:00 fred 15 12.0
2020-01-01 09:40:00 NaN 0 NaN
2020-01-01 09:45:00 NaN 0 NaN
2020-01-01 09:50:00 NaN 0 NaN
2020-01-01 09:55:00 NaN 0 NaN
2020-01-01 10:00:00 fred 19 16.0
Can someone explain this output, and if there is a way to achieve the expected output using agg? It seems odd that the forward fill doesn't work here, but if I were to just do resampled = df.resample('5T').ffill(), that would work for every column (but is undesired here as it would do so for the value column as well). The closest I have come is to individually run resampling for each column and apply the function I want:
resampled = pd.DataFrame()
d = {'user':'ffill',
'value':'sum',
'total':'ffill'}
for k, v in d.items():
resampled[k] = df[k].resample('5T').apply(v)
This works, but feels silly given that it adds extra iteration and uses the dictionary I am trying to pass to agg! I have looked a few posts on agg and apply but can't seem to explain what is happening here:
Losing String column when using resample and aggregation with pandas
resample multiple columns with pandas
pandas groupby with agg not working on multiple columns
Pandas named aggregation not working with resample agg
I have also tried using groupby with a pd.Grouper and using the pd.NamedAgg class, with no luck.
Example data:
import pandas as pd
dates = ['01-01-2020 9:00', '01-01-2020 9:05', '01-01-2020 9:15',
'01-01-2020 9:30', '01-01-2020 9:35', '01-01-2020 10:00']
dates = pd.to_datetime(dates)
df = pd.DataFrame({'user':['fred']*len(dates),
'value':[1,13,27,40,15,19],
'total':[1,1,3,12,12,16]},
index=dates)
I am reading many CSV files. Each one contains time series data. For example:
import pandas as pd
csv_a = [['2019-05-25 10:00', 25, 60],
['2019-05-25 10:05', 26, 25],
['2019-05-25 10:10', 27, 63],
['2019-05-25 10:20', 28, 62]]
df_a = pd.DataFrame(csv_a, columns=["Timestamp", "Temperature", "Humidity"])
df_a["Timestamp"] = (pd.to_datetime(df_a["Timestamp"]))
csv_b = [['2019-05-25 10:05', 1020],
['2019-05-25 10:10', 1021],
['2019-05-25 10:15', 1019],
['2019-05-25 10:45', 1035]]
df_b = pd.DataFrame(csv_b, columns=["Timestamp", "Pressure"])
df_b["Timestamp"] = (pd.to_datetime(df_b["Timestamp"]))
After creating these Dataframes, we can see:
print(df_a)
Timestamp Temperature Humidity
0 2019-05-25 10:00:00 25 60
1 2019-05-25 10:05:00 26 25
2 2019-05-25 10:10:00 27 63
3 2019-05-25 10:20:00 28 62
print(df_b)
Timestamp Pressure
0 2019-05-25 10:05:00 1020
1 2019-05-25 10:10:00 1021
2 2019-05-25 10:15:00 1019
3 2019-05-25 10:45:00 1035
I want to create a new Dataframe with a regular index, for example:
import datetime as dt
start = dt.datetime(2019,5,25,10,0,0)
end = dt.datetime(2019,5,25,10,20,0)
index = pd.date_range(start, end, freq='5min')
And then, start appending each time series in different columns, filling the missing values with NaN and discarting values out of my index.
Desired output:
Temperature Humidity Pressure
Timestamp
2019-05-25 10:00:00 25.0 60.0 NaN
2019-05-25 10:05:00 26.0 25.0 1020.0
2019-05-25 10:10:00 27.0 63.0 1021.0
2019-05-25 10:15:00 NaN NaN 1019.0
2019-05-25 10:20:00 28.0 62.0 NaN
And I also want do this as efficient as possible. Let's say I have hundreds of CSVs and long series.
I am messing with Panda's functions like concat or append, but I am not able to obtain what I want.
As I understand you already had a custom datetimeindex index and want to join each time series by this index. Try combine_first and reindex. If you have multiple time series to join, you need to use loop or use python reduce
df_out = df_b.combine_first(df_a).reindex(index)
Out[1063]:
Humidity Pressure Temperature
2019-05-25 10:00:00 60.0 NaN 25.0
2019-05-25 10:05:00 25.0 1020.0 26.0
2019-05-25 10:10:00 63.0 1021.0 27.0
2019-05-25 10:15:00 NaN 1019.0 NaN
2019-05-25 10:20:00 62.0 NaN 28.0
If your time series don't have the same column names, you may try join. Just list the time series inside the bracket []
df_out = df_a.join([df_b], how='outer').reindex(index)
Out[1068]:
Temperature Humidity Pressure
2019-05-25 10:00:00 25.0 60.0 NaN
2019-05-25 10:05:00 26.0 25.0 1020.0
2019-05-25 10:10:00 27.0 63.0 1021.0
2019-05-25 10:15:00 NaN NaN 1019.0
2019-05-25 10:20:00 28.0 62.0 NaN
Use DataFrame.merge. You could check with Series.diff
to discard rows where there is a temporary jump greater than the period.
But you could choose another criteria to exclude rows (let me know if you want to choose another criteria)
df2 = (df_a.merge(df_b, on='Timestamp', how='outer')
.sort_values('Timestamp'))
diff=df2['Timestamp'].diff().abs().bfill()
mask=diff.eq(diff.min())
new_df=(df2.loc[mask]
.set_index('Timestamp')
)
print(new_df)
# Temperature Humidity Pressure
#Timestamp
#2019-05-25 10:00:00 25.0 60.0 NaN
#2019-05-25 10:05:00 26.0 25.0 1020.0
#2019-05-25 10:10:00 27.0 63.0 1021.0
#2019-05-25 10:15:00 NaN NaN 1019.0
#2019-05-25 10:20:00 28.0 62.0 NaN
You could select the frecuency
and rule out those who don't comply by doing
df2 = (df_a.merge(df_b, on='Timestamp', how='outer')
.set_index('Timestamp')
)
new_df=(df2.reindex(pd.date_range(df2.index.min(),df2.index.max(),freq='5min'))
.loc[lambda x: x.isna().all(axis=1).cumsum().eq(0)])
or simply enter the lower and upper limit statically as you say in your question
Did you try pd.merge?
pd.merge(df_a, df_b, how='outer').set_index('Timestamp').sort_index()
output:
Temperature Humidity Pressure
Timestamp
2019-05-25 10:00:00 25.0 60.0 NaN
2019-05-25 10:05:00 26.0 25.0 1020.0
2019-05-25 10:10:00 27.0 63.0 1021.0
2019-05-25 10:15:00 NaN NaN 1019.0
2019-05-25 10:20:00 28.0 62.0 NaN
2019-05-25 10:45:00 NaN NaN 1035.0
I have an issue similar to "ValueError: cannot reindex from a duplicate axis".The solution isn't provided.
I have an excel file containing multiple rows and columns of weather data. Data has missing at certain intervals although not shown in the sample below. I want to reindex the time column at 5 minute intervals so that I can interpolate the missing values. Data Sample:
Date Time Temp Hum Dewpnt WindSpd
04/01/18 12:05 a 30.6 49 18.7 2.7
04/01/18 12:10 a NaN 51 19.3 1.3
04/01/18 12:20 a 30.7 NaN 19.1 2.2
04/01/18 12:30 a 30.7 51 19.4 2.2
04/01/18 12:40 a 30.9 51 19.6 0.9
Here's what I have tried.
import pandas as pd
ts = pd.read_excel('E:\DATA\AP.xlsx')
ts['Time'] = pd.to_datetime(ts['Time'])
ts.set_index('Time', inplace=True)
dt = pd.date_range("2018-04-01 00:00:00", "2018-05-01 00:00:00", freq='5min', name='T')
idx = pd.DatetimeIndex(dt)
ts.reindex(idx)
I just just want to have my index at 5 min frequency so that I can interpolate the NaN later. Expected output:
Date Time Temp Hum Dewpnt WindSpd
04/01/18 12:05 a 30.6 49 18.7 2.7
04/01/18 12:10 a NaN 51 19.3 1.3
04/01/18 12:15 a NaN NaN NaN NaN
04/01/18 12:20 a 30.7 NaN 19.1 2.2
04/01/18 12:25 a NaN NaN NaN NaN
04/01/18 12:30 a 30.7 51 19.4 2.2
One more approach.
df['Time'] = pd.to_datetime(df['Time'])
df = df.set_index(['Time']).resample('5min').last().reset_index()
df['Time'] = df['Time'].dt.time
df
output
Time Date Temp Hum Dewpnt WindSpd
0 00:05:00 4/1/2018 30.6 49.0 18.7 2.7
1 00:10:00 4/1/2018 NaN 51.0 19.3 1.3
2 00:15:00 NaN NaN NaN NaN NaN
3 00:20:00 4/1/2018 30.7 NaN 19.1 2.2
4 00:25:00 NaN NaN NaN NaN NaN
5 00:30:00 4/1/2018 30.7 51.0 19.4 2.2
6 00:35:00 NaN NaN NaN NaN NaN
7 00:40:00 4/1/2018 30.9 51.0 19.6 0.9
If times from multiple dates have to be re-sampled, you can use code below.
However, you will have to seperate 'Date' & 'Time' columns later.
df1['DateTime'] = df1['Date']+df1['Time']
df1['DateTime'] = pd.to_datetime(df1['DateTime'],format='%d/%m/%Y%I:%M %p')
df1 = df1.set_index(['DateTime']).resample('5min').last().reset_index()
df1
Output
DateTime Date Time Temp Hum Dewpnt WindSpd
0 2018-01-04 00:05:00 4/1/2018 12:05 AM 30.6 49.0 18.7 2.7
1 2018-01-04 00:10:00 4/1/2018 12:10 AM NaN 51.0 19.3 1.3
2 2018-01-04 00:15:00 NaN NaN NaN NaN NaN NaN
3 2018-01-04 00:20:00 4/1/2018 12:20 AM 30.7 NaN 19.1 2.2
4 2018-01-04 00:25:00 NaN NaN NaN NaN NaN NaN
5 2018-01-04 00:30:00 4/1/2018 12:30 AM 30.7 51.0 19.4 2.2
6 2018-01-04 00:35:00 NaN NaN NaN NaN NaN NaN
7 2018-01-04 00:40:00 4/1/2018 12:40 AM 30.9 51.0 19.6 0.9
You can try this for example:
import pandas as pd
ts = pd.read_excel('E:\DATA\AP.xlsx')
ts['Time'] = pd.to_datetime(ts['Time'])
ts.set_index('Time', inplace=True)
ts.resample('5T').mean()
More information here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html
Set the Time column as the index, making sure it is DateTime type, then try
ts.asfreq('5T')
use
ts.asfreq('5T', method='ffill')
to pull previous values forward.
I would take the approach of creating a blank table and fill it in with the data as it comes from your data source. For this example three observations are read in as NaN, plus the row for 1:15 and 1:20 is missing.
import pandas as pd
import numpy as np
rawpd = pd.read_excel('raw.xlsx')
print(rawpd)
Date Time Col1 Col2
0 2018-04-01 01:00:00 1.0 10.0
1 2018-04-01 01:05:00 2.0 NaN
2 2018-04-01 01:10:00 NaN 10.0
3 2018-04-01 01:20:00 NaN 10.0
4 2018-04-01 01:30:00 5.0 10.0
Now create a dataframe targpd with the ideal structure.
time5min = pd.date_range(start='2018/04/1 01:00',periods=7,freq='5min')
targpd = pd.DataFrame(np.nan,index = time5min,columns=['Col1','Col2'])
print(targpd)
Col1 Col2
2018-04-01 01:00:00 NaN NaN
2018-04-01 01:05:00 NaN NaN
2018-04-01 01:10:00 NaN NaN
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN NaN
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 NaN NaN
Now the trick is to update targpd with the data sent to you in rawpd. For this to happen the Date and Time columns have to be combined in rawpd and made into an index.
print(rawpd.Date,rawpd.Time)
0 2018-04-01
1 2018-04-01
2 2018-04-01
3 2018-04-01
4 2018-04-01
Name: Date, dtype: datetime64[ns]
0 01:00:00
1 01:05:00
2 01:10:00
3 01:20:00
4 01:30:00
Name: Time, dtype: object
You can see above the trick in all this. Your date data was converted to datetime but your time data is just a string. Below a proper index is created by used of a lambda function.
rawidx=rawpd.apply(lambda r : pd.datetime.combine(r['Date'],r['Time']),1)
print(rawidx)
This can be applied to the rawpd database as an index.
rawpd2=pd.DataFrame(rawpd[['Col1','Col2']].values,index=rawidx,columns=['Col1','Col2'])
rawpd2=rawpd2.sort_index()
print(rawpd2)
Once this is in place the update command can get you what you want.
targpd.update(rawpd2,overwrite=True)
print(targpd)
Col1 Col2
2018-04-01 01:00:00 1.0 10.0
2018-04-01 01:00:00 1.0 10.0
2018-04-01 01:05:00 2.0 NaN
2018-04-01 01:10:00 NaN 10.0
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN 10.0
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 5.0 10.0
2018-04-01 01:05:00 2.0 NaN
2018-04-01 01:10:00 NaN 10.0
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN 10.0
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 5.0 10.0
You now have a file ready for interpolation
I have got it to work. thank you everyone for your time. I am providing the working code.
import pandas as pd
df = pd.read_excel('E:\DATA\AP.xlsx', sheet_name='Sheet1', parse_dates=[['Date', 'Time']])
df = df.set_index(['Date_Time']).resample('5min').last().reset_index()
print(df)
I have a DataFrame like this
gauge satellite
1979-06-23 18:00:00 6.700000 2.484378
1979-06-27 03:00:00 NaN 8.891460
1979-06-27 06:00:00 1.833333 4.053460
1979-06-27 09:00:00 NaN 2.876649
1979-07-31 18:00:00 6.066667 1.438324
I want to obtain a DataFrame Like this
gauge satellite
1979-06-23 18:00:00 6.700000 2.484378
1979-06-27 03:00:00 NaN NaN
1979-06-27 06:00:00 1.833333 4.053460
1979-06-27 09:00:00 NaN NaN
1979-07-31 18:00:00 6.066667 1.438324
What I will do reindex
df.dropna().reindex(df.index)
mask:
df.mask(df.gauge.isna())
gauge satellite
1979-06-23 18:00:00 6.700000 2.484378
1979-06-27 03:00:00 NaN NaN
1979-06-27 06:00:00 1.833333 4.053460
1979-06-27 09:00:00 NaN NaN
1979-07-31 18:00:00 6.066667 1.438324
use np.where to add nan
import numpy as np
df['satellite'] = np.where(df['gauge'].isnull(),np.nan,df['satellite'])
Second solution
use .loc and isnull
df.loc[df['guage'].isnull(),'satellite'] = np.nan
You can use np.where:
df['satellite'] = np.where(df['gauge'].isna(), np.NaN, df['satellite'])
df['gauge'] = np.where(df['satellite'].isna(), np.NaN, df['gauge'])
You need to find if a row has np.nan. .any(1) gives you masking for a row.
df.loc[df.isna().any(1)] = np.nan
Output:
gauge satellite
1979-06-23 18:00:00 6.700000 2.484378
1979-06-27 03:00:00 NaN NaN
1979-06-27 06:00:00 1.833333 4.053460
1979-06-27 09:00:00 NaN NaN
1979-07-31 18:00:00 6.066667 1.438324