I have a dataframe with a multi-index: "subject" and "datetime".
Each row corresponds to a subject and a datetime, and columns of the dataframe correspond to various measurements.
The range of days differ per subject and some days can be missing for a given subject (see example). Moreover, a subject can have one or several values for a given day.
I want to resample the dataframe so that:
there is only one row per day per subject (I do not care about time of day),
each column value is the last non-NaN of the day (and NaN if there is no value for that day),
days with no values on any column are not created or kept.
For instance, the following dataframe example:
a b
subject datetime
patient1 2018-01-01 00:00:00 2.0 high
2018-01-01 01:00:00 NaN medium
2018-01-01 02:00:00 6.0 NaN
2018-01-01 03:00:00 NaN NaN
2018-01-02 00:00:00 4.3 low
patient2 2018-01-01 00:00:00 NaN medium
2018-01-01 02:00:00 NaN NaN
2018-01-01 03:00:00 5.0 NaN
2018-01-03 00:00:00 9.0 NaN
2018-01-04 02:00:00 NaN NaN
should return:
a b
subject datetime
patient1 2018-01-01 00:00:00 6.0 medium
2018-01-02 00:00:00 4.3 low
patient2 2018-01-01 00:00:00 5.0 medium
2018-01-03 00:00:00 9.0 NaN
I spent too much time trying to obtain this using resample with the 'pad' option, but I always get errors or not the result I want. Can anybody help?
Note: Here is a code to create the example dataframe:
import pandas as pd
import numpy as np
index = pd.MultiIndex.from_product([['patient1', 'patient2'], pd.date_range('20180101', periods=4,
freq='h')])
df = pd.DataFrame({'a': [2, np.nan, 6, np.nan, np.nan, np.nan, np.nan, 5], 'b': ['high', 'medium', np.nan, np.nan, 'medium', 'low', np.nan, np.nan]},
index=index)
df.index.names = ['subject', 'datetime']
df = df.drop(df.index[5])
df.at[('patient2', '2018-01-03 00:00:00'), 'a'] = 9
df.at[('patient2', '2018-01-04 02:00:00'), 'a'] = None
df.at[('patient1', '2018-01-02 00:00:00'), 'a'] = 4.3
df.at[('patient1', '2018-01-02 00:00:00'), 'b'] = 'low'
df = df.sort_index(level=['subject', 'datetime'])
Let's floor the datetime on daily frequency, then groupby the dataframe on subject + floored timestamp and agg using last, finally drop the rows having all NaN's:
i = pd.to_datetime(df.index.get_level_values(1)).floor('d')
df1 = df.groupby(['subject', i]).agg('last').dropna(how='all')
a b
subject datetime
patient1 2018-01-01 6.0 medium
2018-01-02 4.3 low
patient2 2018-01-01 5.0 medium
2018-01-03 9.0 NaN
# drop a et b we don't need them when they ='re both na
df = df.reset_index().dropna(subset=["a", "b"], how="all")
#add a day columns we need it to keep last value
df["dt_day"] = df["datetime"].dt.date
#d1 result dataframe which we add a et b
d1 = df.copy().drop_duplicates(subset=["subject", "dt_day"]).loc[:, ["subject", "datetime"]].reset_index(drop=True)
#add a et b to ou dataframe result
for col in ["a", "b"]:
d1.loc[:,col] = (df.copy().
dropna(subset=[col]).drop_duplicates(subset=["subject", "dt_day"], keep="last")[col]
.reset_index(drop=True))
Wall time: 24 ms
#Shubham Sharma code => Wall time: 2.94 ms
subject datetime a b
0 patient1 2018-01-01 6.0 medium
1 patient1 2018-01-02 4.3 low
2 patient2 2018-01-01 5.0 medium
3 patient2 2018-01-03 9.0 NaN
thanks for your question :)
This should do the job:
def day_agg(series_):
try:
return series_.dropna().iloc[-1]
except IndexError:
return float("nan")
df = df.reset_index().sort_values("datetime")
df.groupby([df["subject"],df.datetime.map(lambda x:datetime(year=x.year,month=x.month,day=x.day))])\
.agg({"a":day_agg, "b":day_agg})\
.dropna(how="all")
Related
I am trying to add some dataframes that contain NaN values. The data frames are index by time series, and in my case a NaN is meaningful, it means that a measurement wasn't done. So if all the data frames I'm adding have a NaN for a given timestamp, I need the result to have a NaN for this timestamp. But if one or more df have a value for the timestamp, I need to have the sum of theses values.
EDIT : Also, in my case, a 0 is different from an NaN, it means that there was a mesurement and it mesured 0 activity, different from a NaN meaning that there was no mesurement. So any solution using fillna(0) won't work.
I haven't found a proper way to do this yet. Here is an exemple of what I want to do :
import pandas as pd
df1 = pd.DataFrame({'value': [0, 1, 1, 1, np.NaN, np.NaN, np.NaN]},
index=pd.date_range("01/01/2020 00:00", "01/01/2020 01:00", freq = '10T'))
df2 = pd.DataFrame({'value': [0, 5, 5, 5, 5, 5, np.NaN]},
index=pd.date_range("01/01/2020 00:00", "01/01/2020 01:00", freq = '10T'))
df1 + df2
What i get :
df1 + df2
value
2020-01-01 00:00:00 0.0
2020-01-01 00:10:00 6.0
2020-01-01 00:20:00 6.0
2020-01-01 00:30:00 6.0
2020-01-01 00:40:00 NaN
2020-01-01 00:50:00 NaN
2020-01-01 01:00:00 NaN
What I would want to have as a result :
value
2020-01-01 00:00:00 0.0
2020-01-01 00:10:00 6.0
2020-01-01 00:20:00 6.0
2020-01-01 00:30:00 6.0
2020-01-01 00:40:00 5.0
2020-01-01 00:50:00 5.0
2020-01-01 01:00:00 NaN
Does anybody know a clean way to do so ?
Thank you.
(I'm using Python 3.9.1 and pandas 1.2.4)
You can use add with the fill_value=0 option. This will maintain the "all NaN" combinations as NaN:
df1.add(df2, fill_value=0)
output:
value
2020-01-01 00:00:00 0.0
2020-01-01 00:10:00 6.0
2020-01-01 00:20:00 6.0
2020-01-01 00:30:00 6.0
2020-01-01 00:40:00 5.0
2020-01-01 00:50:00 5.0
2020-01-01 01:00:00 NaN
I want to aggregate the data of two pandas Dataframes into one, where the column total needs to backfill with previous existing values, here is my code:
import pandas as pd
df1 = pd.DataFrame({
'date': ['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-05'],
'day_count': [1, 1, 1, 1],
'total': [1, 2, 3, 4]})
df2 = pd.DataFrame({
'date': ['2020-01-02', '2020-01-03', '2020-01-04'],
'day_count': [2, 2, 2],
'total': [2, 4, 6]})
# set "date" as index and convert to datetime for later resampling
df1.index = df1['date']
df1.index = pd.to_datetime(df1.index)
df2.index = df2['date']
df2.index = pd.to_datetime(df2.index)
Now I need to resample both my dataframes to some frequency, let's say daily so I would do:
df1 = df1.resample('D').agg({'day_count': 'sum', 'total': 'last'})
df2 = df2.resample('D').agg({'day_count': 'sum', 'total': 'last'})
The Dataframes now looks like:
In [20]: df1
Out[20]:
day_count total
date
2020-01-01 1 1.0
2020-01-02 1 2.0
2020-01-03 1 3.0
2020-01-04 0 NaN
2020-01-05 1 4.0
In [22]: df2
Out[22]:
day_count total
date
2020-01-02 2 2
2020-01-03 2 4
2020-01-04 2 6
Now I need to merge both, but notice that total, has some NaN values where I need to backfill the the previously existing value, so I do:
df1['total'] = df1['total'].fillna(method='ffill').astype(int)
df2['total'] = df2['total'].fillna(method='ffill').astype(int)
Now df1 looks like:
In [25]: df1
Out[25]:
day_count total
date
2020-01-01 1 1
2020-01-02 1 2
2020-01-03 1 3
2020-01-04 0 3
2020-01-05 1 4
So now I have the two dataframes ready to be merged, I think, so I concat them:
final_df = pd.concat([df1, df1]).fillna(method='ffill').groupby(["date"], as_index=True).sum()
In [31]: final_df
Out[31]:
day_count total
date
2020-01-01 1 1
2020-01-02 3 4
2020-01-03 3 7
2020-01-04 2 9
2020-01-05 1 4
I have the correct aggregation for day_count simply summing what's on the same date for both DF's but for total I do not get what I expected, which is to get:
In [31]: final_df
Out[31]:
day_count total
date
2020-01-01 1 1
2020-01-02 3 4
2020-01-03 3 7
2020-01-04 2 9
2020-01-05 1 10 --> this value I miss
Certainly I am doing something wrong, I feel like maybe there is even a simpler way to do this, thanks !
Concatenate them horizontally and groupby along columns:
pd.concat([df1,df2], axis=1).ffill().groupby(level=0, axis=1).sum()
That said, you can also bypass the individual fillna and groupby
# these are not needed
# df1['total'] = df1['total'].fillna(method='ffill').astype(int)
# df2['total'] = df2['total'].fillna(method='ffill').astype(int)
pd.concat([df1,df2],axis=1).ffill().sum(level=0, axis=1)
Output:
day_added total
date
2020-01-01 1.0 1.0
2020-01-02 3.0 4.0
2020-01-03 3.0 7.0
2020-01-04 2.0 9.0
2020-01-05 3.0 10.0
I have 3 resampled pandas dataframes using the same data indexed by datetime.
Each dataframe is resampled using a different timeframe (e.g 30min / 60 min / 240 min).
2 of the dataframes have resampled correctly with the datetimes aligned because they have an equal number of rows (20) but the 3rd dataframe only has 12 rows because there isn't enough data to create 20 rows resampled to 240mins.
How can I adjust the 240min dataframe so the datetimes are aligned with the other 2 dataframes?
For example, every 2nd row in the 30min dataframe equals the corresponding row in the 60min dataframe and every 4 rows in the 60min dataframe should equal the corresponding row in 240min dataframe but this is not the case because the 240min dataframe has resampled the datetimes differently due there not being enough data to create 20 rows.
If you're just trying to align the different datasets to one index you can use pd.concat.
import pandas as pd
periods = 12.5 * 240
index = pd.date_range(start='1/1/2018', periods=periods, freq="min")
data = pd.DataFrame(list(range(int(periods))), index=index)
df1 = data.resample('30min').asfreq()
df2 = data.resample('60min').asfreq()
df3 = data.resample('240min').asfreq()
df4 = pd.concat([df1, df2, df3], axis=1)
print(df4)
Output:
2018-01-01 00:00:00 0 0.0 0.0
2018-01-01 00:30:00 30 NaN NaN
2018-01-01 01:00:00 60 60.0 NaN
2018-01-01 01:30:00 90 NaN NaN
2018-01-01 02:00:00 120 120.0 NaN
... ... ... ...
2018-01-02 23:30:00 2850 NaN NaN
2018-01-03 00:00:00 2880 2880.0 2880.0
2018-01-03 00:30:00 2910 NaN NaN
2018-01-03 01:00:00 2940 2940.0 NaN
2018-01-03 01:30:00 2970 NaN NaN
Generating the data
random.seed(42)
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='H')
df = pd.DataFrame(np.random.randint(0,10,size=(len(date_rng), 3)),
columns=['data1', 'data2', 'data3'],
index= date_rng)
daily_mean_df = pd.DataFrame(np.zeros([len(date_rng), 3]),
columns=['data1', 'data2', 'data3'],
index= date_rng)
mask = np.random.choice([1, 0], df.shape, p=[.35, .65]).astype(bool)
df[mask] = np.nan
df
>>>
data1 data2 data3
2018-01-01 00:00:00 1.0 3.0 NaN
2018-01-01 01:00:00 8.0 5.0 8.0
2018-01-01 02:00:00 5.0 NaN 6.0
2018-01-01 03:00:00 4.0 7.0 4.0
2018-01-01 04:00:00 NaN 8.0 NaN
... ... ... ...
2018-01-07 20:00:00 8.0 7.0 NaN
2018-01-07 21:00:00 5.0 4.0 5.0
2018-01-07 22:00:00 NaN 6.0 NaN
2018-01-07 23:00:00 2.0 4.0 3.0
2018-01-08 00:00:00 NaN NaN NaN
I want to select a specific time each day, then set all value in a day equal to the data of that time.
For example, I want to select 1:00:00, then all data of 2018-01-01 will be equal to 2018-01-01 01:00:00, all data of 2018-01-02 will be equal to 2018-01-02 01:00:00,etc.,
I know how to select the data of the time:
timestamp = "01:00:00"
df[df.index.strftime("%H:%M:%S") == timestamp]
but I don't know how to set data of the day equal to it.
Thank you for reading.
Check with reindex
s=df[df.index.strftime("%H:%M:%S") == timestamp]
s.index=s.index.date
df[:]=s.reindex(df.index.date).values
I have a dataframe like that:
time A time B 2017-11 2017-12 2018-01 2018-02
2017-01-24 2020-01-01 NaN NaN NaN NaN
2016-11-28 2020-01-01 NaN 4.0 2.0 2.0
2017-03-18 2017-12-21 NaN NaN NaN NaN
I want replace all NaN to 0 when the column name between time A and time B. for example, for third row, the time range is from 2017-03-18 to 2017-12-21, so data at the third row with columns name between this range, if it is NaN, replace it with 0, otherwise remain as the same. Hopes its clear. Thanks
Maybe, not the best solution, however it works.
Here's my test sample:
d = pd.DataFrame([
{"time A": "2017-01-24", "time B": np.nan, "2016-11": np.nan, "2016-12": np.nan, "2017-01": np.nan, "2017-02": np.nan},
{"time A": "2016-11-28", "time B": np.nan, "2016-11": np.nan, "2016-12": 4, "2017-01": 2, "2017-02": 2},
{"time A": "2016-12-18", "time B": "2017-01-01", "2016-11": np.nan, "2016-12": np.nan, "2017-01": np.nan, "2017-02": np.nan},
])
d["time B"].fillna("2020-01-01", inplace=True)
d.set_index(["time A", "time B"], inplace=True)
Initial table:
time A time B 2016-11 2016-12 2017-01 2017-02
2017-01-24 2020-01-01 NaN NaN NaN NaN
2016-11-28 2020-01-01 NaN 4.0 2.0 2.0
2016-12-18 2017-01-01 NaN NaN NaN NaN
Looks like time A is open date and time B is close date, or smth like that. Thus for convenience I've filled missing time B with any future date, for example '2020-01-01'
I don't like working with pivot tables, so I've used df.stack() to stack it and formatted date columns:
d_stack = d.stack(dropna=False).reset_index()
d_stack.columns = ["time A", "time B", "month", "value"]
for col in ["time A", "time B"]:
d_stack[col] = pd.to_datetime(d_stack[col], format="%Y-%m-%d", errors="ignore")
d_stack["month"] = pd.to_datetime(d_stack["month"], format="%Y-%m", errors="ignore")
Now it's more convenient to fill missing values
def fill_existing(x):
if (x["time A"] <= x["month"] <= x["time B"] and
np.isnan(x["value"])):
return 0
else:
return x["value"]
d_stack["value"] = d_stack.apply(fill_existing, axis=1)
Output:
time A time B month value
0 2017-01-24 2020-01-01 2016-11-01 NaN
1 2017-01-24 2020-01-01 2016-12-01 NaN
2 2017-01-24 2020-01-01 2017-01-01 NaN
3 2017-01-24 2020-01-01 2017-02-01 0.0
Finally, format month back and pd.pivot_table to return to the initial table format:
d_stack["month"] = d_stack["month"].apply(lambda x: x.strftime("%Y-%m"))
pd.pivot_table(d_stack, columns="month", index=["time A", "time B"],
values="value", aggfunc=np.sum)
Result:
time A time B 2016-12 2017-01 2017-02
2016-11-28 2020-01-01 4.0 2.0 2.0
2016-12-18 2017-01-01 NaN 0.0 NaN
2017-01-24 2020-01-01 NaN NaN 0.0
try this code:
newdf=df[(df.date>some_date) & (df.date<somedate)]
newdf.fillna(0)
newdf is the dataframe you are looking for.