I want to resample this following dataframe from weekly to daily then ffill the missing values.
Note: 2018-01-07 and 2018-01-14 is Sunday.
Date Val
0 2018-01-07 1
1 2018-01-14 2
I tried.
df.Date = pd.to_datetime(df.Date)
df.set_index('Date', inplace=True)
offset = pd.offsets.DateOffset(-6)
df.resample('D', loffset=offset).ffill()
Val
Date
2018-01-01 1
2018-01-02 1
2018-01-03 1
2018-01-04 1
2018-01-05 1
2018-01-06 1
2018-01-07 1
2018-01-08 2
But I want
Date Val
0 2018-01-01 1
1 2018-01-02 1
2 2018-01-03 1
3 2018-01-04 1
4 2018-01-05 1
5 2018-01-06 1
6 2018-01-07 1
7 2018-01-08 2
8 2018-01-09 2
9 2018-01-10 2
10 2018-01-11 2
11 2018-01-12 2
12 2018-01-13 2
13 2018-01-14 2
What did I do wrong?
You can add new last row manually with subtract offset for datetime:
df.loc[df.index[-1] - offset] = df.iloc[-1]
df = df.resample('D', loffset=offset).ffill()
print (df)
Val
Date
2018-01-01 1
2018-01-02 1
2018-01-03 1
2018-01-04 1
2018-01-05 1
2018-01-06 1
2018-01-07 1
2018-01-08 2
2018-01-09 2
2018-01-10 2
2018-01-11 2
2018-01-12 2
2018-01-13 2
2018-01-14 2
Related
The following is a simplification of the problem at hand.
I have a dataframe containing three columns, the date a state began, the state itself, and a flag field. It looks similar to this:
df = pd.DataFrame(
{'begin': pd.to_datetime(['2018-01-05', '2018-07-11', '2018-11-14', '2019-02-19']),
'state': [1, 2, 3, 4],
'started': [1, 0, 0, 0]
}
)
df
begin state started
0 2018-01-05 1 1
1 2018-07-11 2 0
2 2018-11-14 3 0
3 2019-02-19 4 0
I want to resample the dates so that they have a monthly period, and I achieve this as follows:
df.set_index('begin', drop=False).resample('m').ffill()
df
begin state started
begin
2018-01-31 2018-01-05 1 1
2018-02-28 2018-01-05 1 1
2018-03-31 2018-01-05 1 1
2018-04-30 2018-01-05 1 1
2018-05-31 2018-01-05 1 1
2018-06-30 2018-01-05 1 1
2018-07-31 2018-07-11 2 0
2018-08-31 2018-07-11 2 0
2018-09-30 2018-07-11 2 0
2018-10-31 2018-07-11 2 0
2018-11-30 2018-11-14 3 0
2018-12-31 2018-11-14 3 0
2019-01-31 2018-11-14 3 0
2019-02-28 2019-02-19 4 0
Everything looks ok, except for the flag column (started). I need it to be 1 exactly once, at its first occurrence as in the original dataframe.
The desired output is :
begin state started
begin
2018-01-31 2018-01-05 1 1
2018-02-28 2018-01-05 1 0
2018-03-31 2018-01-05 1 0
2018-04-30 2018-01-05 1 0
2018-05-31 2018-01-05 1 0
2018-06-30 2018-01-05 1 0
2018-07-31 2018-07-11 2 0
2018-08-31 2018-07-11 2 0
2018-09-30 2018-07-11 2 0
2018-10-31 2018-07-11 2 0
2018-11-30 2018-11-14 3 0
2018-12-31 2018-11-14 3 0
2019-01-31 2018-11-14 3 0
2019-02-28 2019-02-19 4 0
Thus, for a given combination of begin and state, if started is 1, it should be one only at the first occurrence of this combination.
Is there an efficient way to achieve this?
If only 1 and 0 in started column use DataFrame.duplicated with specify both columns in list:
mask = df.duplicated(['begin','started'])
Also is possible rewrite only 1 values by chain another mask:
mask = df.duplicated(['begin','started']) & df['started'].eq(1)
df.loc[mask, 'started'] = 0
Or:
df['started'] = np.where(mask, 0, df['started'])
print (df)
begin state started
begin
2018-01-31 2018-01-05 1 1
2018-02-28 2018-01-05 1 0
2018-03-31 2018-01-05 1 0
2018-04-30 2018-01-05 1 0
2018-05-31 2018-01-05 1 0
2018-06-30 2018-01-05 1 0
2018-07-31 2018-07-11 2 0
2018-08-31 2018-07-11 2 0
2018-09-30 2018-07-11 2 0
2018-10-31 2018-07-11 2 0
2018-11-30 2018-11-14 3 0
2018-12-31 2018-11-14 3 0
2019-01-31 2018-11-14 3 0
2019-02-28 2019-02-19 4 0
Can you do:
df = df.set_index('begin', drop=False).resample('m').ffill()
df.loc[df['started'].duplicated(keep='first'), 'started'] = 0
I actually work on time series in Python 3 and Pandas and I want to make a synthesis of periods of contiguous missing values but I'm only able to find the indexes of nan values ...
Sample data :
Valeurs
2018-01-01 00:00:00 1.0
2018-01-01 04:00:00 NaN
2018-01-01 08:00:00 2.0
2018-01-01 12:00:00 NaN
2018-01-01 16:00:00 NaN
2018-01-01 20:00:00 5.0
2018-01-02 00:00:00 6.0
2018-01-02 04:00:00 7.0
2018-01-02 08:00:00 8.0
2018-01-02 12:00:00 9.0
2018-01-02 16:00:00 5.0
2018-01-02 20:00:00 NaN
2018-01-03 00:00:00 NaN
2018-01-03 04:00:00 NaN
2018-01-03 08:00:00 1.0
2018-01-03 12:00:00 2.0
2018-01-03 16:00:00 NaN
Expected results :
Start_Date number of contiguous missing values
2018-01-01 04:00:00 1
2018-01-01 12:00:00 2
2018-01-02 20:00:00 3
2018-01-03 16:00:00 1
How can i manage to obtain this type of results with pandas (shift(), cumsum(), groupby() ???)?
Thank you for your advice!
Sylvain
groupby and agg
mask = df.Valeurs.isna()
d = df.index.to_series()[mask].groupby((~mask).cumsum()[mask]).agg(['first', 'size'])
d.rename(columns=dict(size='num of contig null', first='Start_Date')).reset_index(drop=True)
Start_Date num of contig null
0 2018-01-01 04:00:00 1
1 2018-01-01 12:00:00 2
2 2018-01-02 20:00:00 3
3 2018-01-03 16:00:00 1
Working on the underlying numpy array:
a = df.Valeurs.values
m = np.concatenate(([False],np.isnan(a),[False]))
idx = np.nonzero(m[1:] != m[:-1])[0]
out = df[df.Valeurs.isnull() & ~df.Valeurs.shift().isnull()].index
pd.DataFrame({'Start date': out, 'contiguous': (idx[1::2] - idx[::2])})
Start date contiguous
0 2018-01-01 04:00:00 1
1 2018-01-01 12:00:00 2
2 2018-01-02 20:00:00 3
3 2018-01-03 16:00:00 1
If you have the indices where the values occur, you can use itertools as in this to find continuous chunks
I have a dataframe (named df) sorted by identifier, id_number and contract_year_month in order like this so far:
**identifier id_number contract_year_month collection_year_month**
K001 1 2018-01-03 2018-01-09
K001 1 2018-01-08 2018-01-10
K001 2 2018-01-01 2018-01-05
K001 2 2018-01-15 2018-01-18
K002 4 2018-01-04 2018-01-07
K002 4 2018-01-09 2018-01-15
and would like to add a column named 'date_difference' that is consisted of contract_year_month minus collection_year_month from previous row based on identifier and id_number (e.g. 2018-01-08 minus 2018-01-09),
so that the df would be:
**identifier id_number contract_year_month collection_year_month date_difference**
K001 1 2018-01-03 2018-01-09
K001 1 2018-01-08 2018-01-10 -1
K001 2 2018-01-01 2018-01-05
K001 2 2018-01-15 2018-01-18 10
K002 4 2018-01-04 2018-01-07
K002 4 2018-01-09 2018-01-15 2
I already converted the type of contract_year_month and collection_year_month columns to datetime, and tried to work on with simple shift function or iloc but neither doesn't work.
df["date_difference"] = df.groupby(["identifier", "id_number"])["contract_year_month"]
Is there any way to use groupby to get the difference between the current row value and previous row value in another column, separated by two identifiers? (I've searched for an hour but couldn't find a hint...) I would sincerely appreciate if you guys give some advice.
Here is one potential way to do this.
First create a boolean mask, then use numpy.where and Series.shift to create the column date_difference:
mask = df.duplicated(['identifier', 'id_number'])
df['date_difference'] = (np.where(mask, (df['contract_year_month'] -
df['collection_year_month'].shift(1)).dt.days, np.nan))
[output]
identifier id_number contract_year_month collection_year_month date_difference
0 K001 1 2018-01-03 2018-01-09 NaN
1 K001 1 2018-01-08 2018-01-10 -1.0
2 K001 2 2018-01-01 2018-01-05 NaN
3 K001 2 2018-01-15 2018-01-18 10.0
4 K002 4 2018-01-04 2018-01-07 NaN
5 K002 4 2018-01-09 2018-01-15 2.0
Here's one approach using your grouby() (Updated based on feedback from #piRSquared):
In []:
(df['collection_year_month']
.groupby([df['identifier'], df['id_number']])
.shift() - df['contract_year_month']).dt.days
Out[]:
0 NaN
1 -1.0
2 NaN
3 10.0
4 NaN
5 2.0
dtype: float64
You can just assign this to df['date_difference']
For some reason doing df.resample("M").apply(foo) drops the index name in df. Is this expected behavior?
import pandas as pd
df = pd.DataFrame({"a": np.arange(60)}, index=pd.date_range(start="2018-01-01", periods=60))
df.index.name = "dte"
df.head()
# a
#dte
#2018-01-01 0
#2018-01-02 1
#2018-01-03 2
#2018-01-04 3
#2018-01-05 4
def f(x):
print(x.head())
df.resample("M").apply(f)
#2018-01-01 0
#2018-01-02 1
#2018-01-03 2
#2018-01-04 3
#2018-01-05 4
#Name: a, dtype: int64
update/clarification:
When I said drops the name I meant that series received by the function doesn't have a name component associated with its index
I suggest use alternative of resample - groupby with Grouper:
def f(x):
print(x.head())
df.groupby(pd.Grouper(freq="M")).apply(f)
dte
2018-01-01 0
2018-01-02 1
2018-01-03 2
2018-01-04 3
2018-01-05 4
a
dte
2018-01-01 0
2018-01-02 1
2018-01-03 2
2018-01-04 3
2018-01-05 4
a
dte
2018-02-01 31
2018-02-02 32
2018-02-03 33
2018-02-04 34
2018-02-05 35
I am trying to sum the values of colA, over a date range based on "date" column, and store this rolling value in the new column "sum_col"
But I am getting the sum of all rows (=100), not just those in the date range.
I can't use rolling or groupby by as my dates (in the real data) are not sequential (some days are missing)
Amy idea how to do this? Thanks.
# Create data frame
df = pd.DataFrame()
# Create datetimes and data
df['date'] = pd.date_range('1/1/2018', periods=100, freq='D')
df['colA']= 1
df['colB']= 2
df['colC']= 3
StartDate = df.date- pd.to_timedelta(5, unit='D')
EndDate= df.date
dfx=df
dfx['StartDate'] = StartDate
dfx['EndDate'] = EndDate
dfx['sum_col']=df[(df['date'] > StartDate) & (df['date'] <= EndDate)].sum()['colA']
dfx.head(50)
I'm not sure whether you want 3 columns for the sum of colA, colB, colC respectively, or one column which sums all three, but here is an example of how you would sum the values for colA:
dfx['colAsum'] = dfx.apply(lambda x: df.loc[(df.date >= x.StartDate) &
(df.date <= x.EndDate), 'colA'].sum(), axis=1)
e.g. (withperiods=10):
date colA colB colC StartDate EndDate colAsum
0 2018-01-01 1 2 3 2017-12-27 2018-01-01 1
1 2018-01-02 1 2 3 2017-12-28 2018-01-02 2
2 2018-01-03 1 2 3 2017-12-29 2018-01-03 3
3 2018-01-04 1 2 3 2017-12-30 2018-01-04 4
4 2018-01-05 1 2 3 2017-12-31 2018-01-05 5
5 2018-01-06 1 2 3 2018-01-01 2018-01-06 6
6 2018-01-07 1 2 3 2018-01-02 2018-01-07 6
7 2018-01-08 1 2 3 2018-01-03 2018-01-08 6
8 2018-01-09 1 2 3 2018-01-04 2018-01-09 6
9 2018-01-10 1 2 3 2018-01-05 2018-01-10 6
If what I understand is correct:
for i in range(df.shape[0]):
dfx.loc[i,'sum_col']=df[(df['date'] > StartDate[i]) & (df['date'] <= EndDate[i])].sum()['colA']
For example, in range (2018-01-01, 2018-01-06) the sum is 6.
date colA colB colC StartDate EndDate sum_col
0 2018-01-01 1 2 3 2017-12-27 2018-01-01 1.0
1 2018-01-02 1 2 3 2017-12-28 2018-01-02 2.0
2 2018-01-03 1 2 3 2017-12-29 2018-01-03 3.0
3 2018-01-04 1 2 3 2017-12-30 2018-01-04 4.0
4 2018-01-05 1 2 3 2017-12-31 2018-01-05 5.0
5 2018-01-06 1 2 3 2018-01-01 2018-01-06 5.0
6 2018-01-07 1 2 3 2018-01-02 2018-01-07 5.0
7 2018-01-08 1 2 3 2018-01-03 2018-01-08 5.0
8 2018-01-09 1 2 3 2018-01-04 2018-01-09 5.0
9 2018-01-10 1 2 3 2018-01-05 2018-01-10 5.0
10 2018-01-11 1 2 3 2018-01-06 2018-01-11 5.0