I have 3 resampled pandas dataframes using the same data indexed by datetime.
Each dataframe is resampled using a different timeframe (e.g 30min / 60 min / 240 min).
2 of the dataframes have resampled correctly with the datetimes aligned because they have an equal number of rows (20) but the 3rd dataframe only has 12 rows because there isn't enough data to create 20 rows resampled to 240mins.
How can I adjust the 240min dataframe so the datetimes are aligned with the other 2 dataframes?
For example, every 2nd row in the 30min dataframe equals the corresponding row in the 60min dataframe and every 4 rows in the 60min dataframe should equal the corresponding row in 240min dataframe but this is not the case because the 240min dataframe has resampled the datetimes differently due there not being enough data to create 20 rows.
If you're just trying to align the different datasets to one index you can use pd.concat.
import pandas as pd
periods = 12.5 * 240
index = pd.date_range(start='1/1/2018', periods=periods, freq="min")
data = pd.DataFrame(list(range(int(periods))), index=index)
df1 = data.resample('30min').asfreq()
df2 = data.resample('60min').asfreq()
df3 = data.resample('240min').asfreq()
df4 = pd.concat([df1, df2, df3], axis=1)
print(df4)
Output:
2018-01-01 00:00:00 0 0.0 0.0
2018-01-01 00:30:00 30 NaN NaN
2018-01-01 01:00:00 60 60.0 NaN
2018-01-01 01:30:00 90 NaN NaN
2018-01-01 02:00:00 120 120.0 NaN
... ... ... ...
2018-01-02 23:30:00 2850 NaN NaN
2018-01-03 00:00:00 2880 2880.0 2880.0
2018-01-03 00:30:00 2910 NaN NaN
2018-01-03 01:00:00 2940 2940.0 NaN
2018-01-03 01:30:00 2970 NaN NaN
Related
I have a sensor that measures data every ~60seconds. There is a little bit of delay between calls, so the data might look like this:
timestamp, value
12:01:45, 100
12:02:50, 90
12:03:55, 87
# 12:04 missing
12:05:00, 91
I only need precision to the minute, not seconds. Since this gathers data all day long, there should be 1440 entries (1440 minutes per day), however, there are some missing timestamps.
I'm loading this into a pd.DataFrame, and I'd like to have 1440 rows no matter what. How could I squeeze in None values to any missing timestamps?
timestamp, value
12:01:45, 100
12:02:50, 90
12:03:55, 87
12:04:00, None # Squeezed in a None value
12:05:00, 91
Additionally, some data is missing for several HOURS, but I'd still like to fill those with None.
Ultimately, I wish to plot the data using matplotlib, with the x-axis ranging between (0, 1440), and the y-axis ranging between (0, 100).
Use Resampler.first with Series.fillna if need replace only values between first and last timestamp:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df = df.resample('1min', on='timestamp').first()
df['timestamp'] = df['timestamp'].fillna(df.index.to_series())
df = df.reset_index(drop=True)
print (df)
timestamp value
0 2021-09-20 12:01:45 100.0
1 2021-09-20 12:02:50 90.0
2 2021-09-20 12:03:55 87.0
3 2021-09-20 12:04:00 NaN
4 2021-09-20 12:05:00 91.0
If need all datetimes per day add DataFrame.reindex:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df = df.resample('1min', on='timestamp').first()
rng = pd.date_range('00:00:00','23:59:00', freq='Min')
df = df.reindex(rng)
df['timestamp'] = df['timestamp'].fillna(df.index.to_series())
df = df.reset_index(drop=True)
print (df)
timestamp value
0 2021-09-20 00:00:00 NaN
1 2021-09-20 00:01:00 NaN
2 2021-09-20 00:02:00 NaN
3 2021-09-20 00:03:00 NaN
4 2021-09-20 00:04:00 NaN
... ...
1435 2021-09-20 23:55:00 NaN
1436 2021-09-20 23:56:00 NaN
1437 2021-09-20 23:57:00 NaN
1438 2021-09-20 23:58:00 NaN
1439 2021-09-20 23:59:00 NaN
[1440 rows x 2 columns]
I have a big weather csv dataframe containing several hundred thousand of rows as well many columns. The rows are time-series sampled every 10 minutes over many years. The index data column that represents datetime consists of year, month, day, hour, minute and second. Unfortunately, there were several thousand missing rows containing only NaNs. The goal is to fill these ones using the values of other rows collected at the same time but of other years if they are not NaNs.
I wrote a python for loop code but it seems like a very time consuming solution. I need your help for a more efficient and faster solution.
The raw dataframe is as follows:
print(df)
p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%)
datetime
2004-01-01 00:10:00 996.52 -8.02 265.40 -8.90 93.30
2004-01-01 00:20:00 996.57 -8.41 265.01 -9.28 93.40
2004-01-01 00:40:00 996.51 -8.31 265.12 -9.07 94.20
2004-01-01 00:50:00 996.51 -8.27 265.15 -9.04 94.10
2004-01-01 01:00:00 996.53 -8.51 264.91 -9.31 93.90
... ... ... ... ... ...
2020-12-31 23:20:00 1000.07 -4.05 269.10 -8.13 73.10
2020-12-31 23:30:00 999.93 -3.35 269.81 -8.06 69.71
2020-12-31 23:40:00 999.82 -3.16 270.01 -8.21 67.91
2020-12-31 23:50:00 999.81 -4.23 268.94 -8.53 71.80
2021-01-01 00:00:00 999.82 -4.82 268.36 -8.42 75.70
[820551 rows x 5 columns]
For any reason, there were missing rows in the df dataframe. To identify them, it is possible to apply the below function:
findnanrows(df.groupby(pd.Grouper(freq='10T')).mean())
p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%)
datetime
2004-01-01 00:30:00 NaN NaN NaN NaN NaN
2009-10-08 09:50:00 NaN NaN NaN NaN NaN
2009-10-08 10:00:00 NaN NaN NaN NaN NaN
2013-05-16 09:00:00 NaN NaN NaN NaN NaN
2014-07-30 08:10:00 NaN NaN NaN NaN NaN
... ... ... ... ... ...
2016-10-28 12:00:00 NaN NaN NaN NaN NaN
2016-10-28 12:10:00 NaN NaN NaN NaN NaN
2016-10-28 12:20:00 NaN NaN NaN NaN NaN
2016-10-28 12:30:00 NaN NaN NaN NaN NaN
2016-10-28 12:40:00 NaN NaN NaN NaN NaN
[5440 rows x 5 columns]
The aim is to fill all these NaN rows. As an example, the first NaN row which corresponds to the datetime 2004-01-01 00:30:00 should be filled with the not NaN values of another row collected on the same datetime xxxx-01-01 00:30:00 of another year like 2005-01-01 00:30:00 or 2006-01-01 00:30:00 and so on, even 2003-01-01 00:30:00 or 2002-01-01 00:30:00 if they existing. It is possible to apply an average over all these other years.
Seeing the values of the row with the datetime index 2005-01-01 00:30:00:
print(df.loc["2005-01-01 00:30:00", :])
p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%)
datetime
2005-01-01 00:30:00 996.36 12.67 286.13 7.11 68.82
After filling the row corresponding to the index datetime 2004-01-01 00:30:00 using the values of the row having the index datetime 2005-01-01 00:30:00, the df dataframe will have the following row:
print(df.loc["2004-01-01 00:30:00", :])
p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%)
datetime
2004-01-01 00:30:00 996.36 12.67 286.13 7.11 68.82
The two functions that I created are the following. The first is to identify the NaN rows. The second is to fill them.
def findnanrows(df):
is_NaN = df.isnull()
row_has_NaN = is_NaN.any(axis=1)
rows_with_NaN = df[row_has_NaN]
return rows_with_NaN
def filldata(weatherdata):
fillweatherdata = weatherdata.copy()
allyears = fillweatherdata.index.year.unique().tolist()
dfnan = findnanrows(fillweatherdata.groupby(pd.Grouper(freq='10T')).mean())
for i in range(dfnan.shape[0]):
dnan = dfnan.index[i]
if dnan.year == min(allyears):
y = 0
dnew = dnan.replace(year=dnan.year+y)
while dnew in dfnan.index:
dnew = dnew.replace(year=dnew.year+y)
y += 1
else:
y = 0
dnew = dnan.replace(year=dnan.year-y)
while dnew in dfnan.index:
dnew = dnew.replace(year=dnew.year-y)
y += 1
new_row = pd.DataFrame(np.array([fillweatherdata.loc[dnew, :]]).tolist(), columns=fillweatherdata.columns.tolist(), index=[dnan])
fillweatherdata = pd.concat([fillweatherdata, pd.DataFrame(new_row)], ignore_index=False)
#fillweatherdata = fillweatherdata.drop_duplicates()
fillweatherdata = fillweatherdata.sort_index()
return fillweatherdata
I have a multiindexed dataframe (but with more columns)
2020-12-22 09:47:50 2020-12-23 16:43:45 2020-12-22 15:00
Lines VehicleNumber
102 9405 3 NaN 3
9415 NaN NaN NaN
9416 NaN NaN NaN
Now I want to sort the columns such that I have the earliest date as a first column and the lastest as last. After that I want to delete columns, which are not in between two dates let's say 2020-12-22 10:00:00 < date < 2020-12-23 10:00:00. I tried transposing the dataframe, but it seems not to work when I have a multiindex.
Expected output:
2020-12-22 15:00 2020-12-23 16:43:45
Lines VehicleNumber
102 9405 3 NaN
9415 NaN NaN
9416 NaN NaN
So first we sort the columns by date and then check if they are between the two dates:
2020-12-22 10:00:00 < date < 2020-12-23 10:00:00 hence delete one column
First convert str columns to date time columns:
In [2244]: df.columns = pd.to_datetime(df.columns)
Then, sort df based on datetimes:
In [2246]: df = df.reindex(sorted(df.columns), axis=1)
Suppose you want to keep only column that are greater than following:
In [2251]: x = '2020-12-22 10:00:00'
Use List comprehension:
In [2257]: m = [i for i in df.columns if i > pd.to_datetime(x)]
In [2258]: df[m]
Out[2258]:
2020-12-22 15:00:00 2020-12-23 16:43:45
Lines VehicleNumber
102 9405.0 3.0 NaN
9415 NaN NaN NaN
9416 NaN NaN NaN
I have a dataframe with a multi-index: "subject" and "datetime".
Each row corresponds to a subject and a datetime, and columns of the dataframe correspond to various measurements.
The range of days differ per subject and some days can be missing for a given subject (see example). Moreover, a subject can have one or several values for a given day.
I want to resample the dataframe so that:
there is only one row per day per subject (I do not care about time of day),
each column value is the last non-NaN of the day (and NaN if there is no value for that day),
days with no values on any column are not created or kept.
For instance, the following dataframe example:
a b
subject datetime
patient1 2018-01-01 00:00:00 2.0 high
2018-01-01 01:00:00 NaN medium
2018-01-01 02:00:00 6.0 NaN
2018-01-01 03:00:00 NaN NaN
2018-01-02 00:00:00 4.3 low
patient2 2018-01-01 00:00:00 NaN medium
2018-01-01 02:00:00 NaN NaN
2018-01-01 03:00:00 5.0 NaN
2018-01-03 00:00:00 9.0 NaN
2018-01-04 02:00:00 NaN NaN
should return:
a b
subject datetime
patient1 2018-01-01 00:00:00 6.0 medium
2018-01-02 00:00:00 4.3 low
patient2 2018-01-01 00:00:00 5.0 medium
2018-01-03 00:00:00 9.0 NaN
I spent too much time trying to obtain this using resample with the 'pad' option, but I always get errors or not the result I want. Can anybody help?
Note: Here is a code to create the example dataframe:
import pandas as pd
import numpy as np
index = pd.MultiIndex.from_product([['patient1', 'patient2'], pd.date_range('20180101', periods=4,
freq='h')])
df = pd.DataFrame({'a': [2, np.nan, 6, np.nan, np.nan, np.nan, np.nan, 5], 'b': ['high', 'medium', np.nan, np.nan, 'medium', 'low', np.nan, np.nan]},
index=index)
df.index.names = ['subject', 'datetime']
df = df.drop(df.index[5])
df.at[('patient2', '2018-01-03 00:00:00'), 'a'] = 9
df.at[('patient2', '2018-01-04 02:00:00'), 'a'] = None
df.at[('patient1', '2018-01-02 00:00:00'), 'a'] = 4.3
df.at[('patient1', '2018-01-02 00:00:00'), 'b'] = 'low'
df = df.sort_index(level=['subject', 'datetime'])
Let's floor the datetime on daily frequency, then groupby the dataframe on subject + floored timestamp and agg using last, finally drop the rows having all NaN's:
i = pd.to_datetime(df.index.get_level_values(1)).floor('d')
df1 = df.groupby(['subject', i]).agg('last').dropna(how='all')
a b
subject datetime
patient1 2018-01-01 6.0 medium
2018-01-02 4.3 low
patient2 2018-01-01 5.0 medium
2018-01-03 9.0 NaN
# drop a et b we don't need them when they ='re both na
df = df.reset_index().dropna(subset=["a", "b"], how="all")
#add a day columns we need it to keep last value
df["dt_day"] = df["datetime"].dt.date
#d1 result dataframe which we add a et b
d1 = df.copy().drop_duplicates(subset=["subject", "dt_day"]).loc[:, ["subject", "datetime"]].reset_index(drop=True)
#add a et b to ou dataframe result
for col in ["a", "b"]:
d1.loc[:,col] = (df.copy().
dropna(subset=[col]).drop_duplicates(subset=["subject", "dt_day"], keep="last")[col]
.reset_index(drop=True))
Wall time: 24 ms
#Shubham Sharma code => Wall time: 2.94 ms
subject datetime a b
0 patient1 2018-01-01 6.0 medium
1 patient1 2018-01-02 4.3 low
2 patient2 2018-01-01 5.0 medium
3 patient2 2018-01-03 9.0 NaN
thanks for your question :)
This should do the job:
def day_agg(series_):
try:
return series_.dropna().iloc[-1]
except IndexError:
return float("nan")
df = df.reset_index().sort_values("datetime")
df.groupby([df["subject"],df.datetime.map(lambda x:datetime(year=x.year,month=x.month,day=x.day))])\
.agg({"a":day_agg, "b":day_agg})\
.dropna(how="all")
I have a MultiIndex DataFrame with gappy date values on level 1, like this:
np.random.seed(456)
j = [(a, b) for a in ['A','B','C'] for b in random.sample(pd.date_range('2018-01-01', periods=100, freq='D').tolist(), 5)]
j.sort()
i = pd.MultiIndex.from_tuples(j, names=['Name','Date'])
df = pd.DataFrame(np.random.random_integers(0,100,15), i, columns=['Vals'])
# print(df):
Vals
Name Date
A 2018-01-01 27
2018-01-08 43
2018-03-26 89
2018-03-29 42
2018-04-01 28
B 2018-01-02 79
2018-01-26 60
2018-02-18 45
2018-03-11 37
2018-03-23 92
C 2018-03-17 39
2018-03-20 81
2018-03-21 11
2018-03-27 77
2018-04-08 69
For each level 0 value, I want to fill in the index level 1 with every calendar date between the min and max date values for that level 0. (This Q&A addresses the scenario of filling in level 1 with the same value set for all level 0 values.)
E.g., for subset = df.loc['A'] I want to insert rows so that subset.index.values == pd.date_range(subset.index.values.min(), subset.index.values.max()).values. I.e., the resulting DataFrame would look like:
Vals
Name Date
A 2018-01-01 27
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 NaN
2018-01-06 NaN
2018-01-07 NaN
2018-01-08 43
2018-01-09 NaN
...
Is there a pandaic way to accomplish this?
(The best I can come up with is to inefficiently and iteratively append new DataFrames for each level 0 value. Or similarly iteratively construct a list of index values and then pandas.concat them with the original DataFrame.)
You can use asfreq
df.groupby(level=0).apply(lambda x: x.reset_index(level=0, drop=True).asfreq("D"))