pandas subtracting value in another column from previous row - python

I have a dataframe (named df) sorted by identifier, id_number and contract_year_month in order like this so far:
**identifier id_number contract_year_month collection_year_month**
K001 1 2018-01-03 2018-01-09
K001 1 2018-01-08 2018-01-10
K001 2 2018-01-01 2018-01-05
K001 2 2018-01-15 2018-01-18
K002 4 2018-01-04 2018-01-07
K002 4 2018-01-09 2018-01-15
and would like to add a column named 'date_difference' that is consisted of contract_year_month minus collection_year_month from previous row based on identifier and id_number (e.g. 2018-01-08 minus 2018-01-09),
so that the df would be:
**identifier id_number contract_year_month collection_year_month date_difference**
K001 1 2018-01-03 2018-01-09
K001 1 2018-01-08 2018-01-10 -1
K001 2 2018-01-01 2018-01-05
K001 2 2018-01-15 2018-01-18 10
K002 4 2018-01-04 2018-01-07
K002 4 2018-01-09 2018-01-15 2
I already converted the type of contract_year_month and collection_year_month columns to datetime, and tried to work on with simple shift function or iloc but neither doesn't work.
df["date_difference"] = df.groupby(["identifier", "id_number"])["contract_year_month"]
Is there any way to use groupby to get the difference between the current row value and previous row value in another column, separated by two identifiers? (I've searched for an hour but couldn't find a hint...) I would sincerely appreciate if you guys give some advice.

Here is one potential way to do this.
First create a boolean mask, then use numpy.where and Series.shift to create the column date_difference:
mask = df.duplicated(['identifier', 'id_number'])
df['date_difference'] = (np.where(mask, (df['contract_year_month'] -
df['collection_year_month'].shift(1)).dt.days, np.nan))
[output]
identifier id_number contract_year_month collection_year_month date_difference
0 K001 1 2018-01-03 2018-01-09 NaN
1 K001 1 2018-01-08 2018-01-10 -1.0
2 K001 2 2018-01-01 2018-01-05 NaN
3 K001 2 2018-01-15 2018-01-18 10.0
4 K002 4 2018-01-04 2018-01-07 NaN
5 K002 4 2018-01-09 2018-01-15 2.0

Here's one approach using your grouby() (Updated based on feedback from #piRSquared):
In []:
(df['collection_year_month']
.groupby([df['identifier'], df['id_number']])
.shift() - df['contract_year_month']).dt.days
Out[]:
0 NaN
1 -1.0
2 NaN
3 10.0
4 NaN
5 2.0
dtype: float64
You can just assign this to df['date_difference']

Related

Create a DataFrame in Pandas using an index from an existing TimeSerie and a column form another TimeSerie

I want to create a DataFrame or TimeSerie using an index of an existing TimeSerie and the values from another TimeSerie with different time indices. The TimeSeries look like;
<class 'pandas.core.series.Series'>
DT
2018-01-02 172.3000
2018-01-03 174.5500
2018-01-04 173.4700
2018-01-05 175.3700
2018-01-08 175.6100
2018-01-09 175.0600
2018-01-10 174.3000
2018-01-11 175.4886
2018-01-12 177.3600
2018-01-16 179.3900
2018-01-17 179.2500
2018-01-18 180.1000
...
and
<class 'pandas.core.series.Series'>
DT
2018-01-02 NaN
2018-01-09 175.610
2018-01-16 177.360
2018-01-23 180.100
...
I want to use the index from the first TS and fill it with the values with appropriate index form the second TS. Like;
<class 'pandas.core.series.Series'>
DT
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 NaN
2018-01-08 NaN
2018-01-09 175.610
2018-01-10 NaN
2018-01-11 NaN
2018-01-12 NaN
2018-01-16 177.360
2018-01-17 NaN
2018-01-18 NaN
...
Thx
IIUC, use Series.reindex:
new_s = s2.reindex(s1.index)
#2018-01-02 NaN
#2018-01-03 NaN
#2018-01-04 NaN
#2018-01-05 NaN
#2018-01-08 NaN
#2018-01-09 175.61
#2018-01-10 NaN
#2018-01-11 NaN
#2018-01-12 NaN
#2018-01-16 177.36
#2018-01-17 NaN
#2018-01-18 NaN
#Name: s2, dtype: float64
convert your series data structure into a Data frame Data structure then use the following line :
pd.merge(TS1,TS2,left_index=True,right_index=True,how='left').iloc[:,-1]

python: populating tuples in tuples over dataframe range

I have 4 portfolios a,b,c,d which can take on values either "no" or "own" over a period of time. (code included below to facilitate replication)
ano=('a','no',datetime(2018,1,1), datetime(2018,1,2))
aown=('a','own',datetime(2018,1,3), datetime(2018,1,4))
bno=('b','no',datetime(2018,1,1), datetime(2018,1,5))
bown=('b','own',datetime(2018,1,6), datetime(2018,1,7))
cown=('c','own',datetime(2018,1,9), datetime(2018,1,10))
down=('d','own',datetime(2018,1,9), datetime(2018,1,9))
sch=pd.DataFrame([ano,aown,bno,bown,cown,down],columns=['portf','base','st','end'])
Summary of schedule:
portf base st end
0 a no 2018-01-01 2018-01-02
1 a own 2018-01-03 2018-01-04
2 b no 2018-01-01 2018-01-05
3 b own 2018-01-06 2018-01-07
4 c own 2018-01-09 2018-01-10
5 d own 2018-01-09 2018-01-09
What I have tried: create a holding dataframe and filling in values based on the schedule. Unfortunately the first portfolio 'a' gets overridden
df=pd.DataFrame(index=pd.date_range(min(sch.st),max(sch.end)),columns=['portf','base'])
for row in range(len(sch)):
df.loc[sch['st'][row]:sch['end'][row],['portf','base']]= sch.loc[row,['portf','base']].values
portf base
2018-01-01 b no
2018-01-02 b no
2018-01-03 b no
2018-01-04 b no
2018-01-05 b no
2018-01-06 b own
2018-01-07 b own
2018-01-08 NaN NaN
2018-01-09 d own
2018-01-10 c own
desired output:
2018-01-01 (('a','no'), ('b','no'))
2018-01-02 (('a','no'), ('b','no'))
2018-01-03 (('a','own'), ('b','no'))
2018-01-04 (('a','own'), ('b','no'))
2018-01-05 ('b','no')
...
I am sure there's an easier way of achieving this but probably this is an example I haven't encountered before. Many thanks in advance!
I would organize the data differently, index is date, columns for portf and the values are base.
First we need to reshape the data and resample to daily fields. Then it's a simple pivot.
cols = ['portf', 'base']
s = (df.reset_index()
.melt(cols+['index'], value_name='date')
.set_index('date')
.groupby(cols+['index'], group_keys=False)
.resample('D').ffill()
.drop(columns=['variable', 'index'])
.reset_index())
res = s.pivot(index='date', columns='portf')
res = res.resample('D').first() # Recover missing dates between
Output res
base
portf a b c d
2018-01-01 no no NaN NaN
2018-01-02 no no NaN NaN
2018-01-03 own no NaN NaN
2018-01-04 own no NaN NaN
2018-01-05 NaN no NaN NaN
2018-01-06 NaN own NaN NaN
2018-01-07 NaN own NaN NaN
2018-01-08 NaN NaN NaN NaN
2018-01-09 NaN NaN own own
2018-01-10 NaN NaN own NaN
If you need your other output, we can get there with some less than ideal Series.apply calls. This will be very bad for a large DataFrame; I would seriously consider keeping the above.
s.set_index('date').apply(tuple, axis=1).groupby('date').apply(tuple)
date
2018-01-01 ((a, no), (b, no))
2018-01-02 ((a, no), (b, no))
2018-01-03 ((a, own), (b, no))
2018-01-04 ((a, own), (b, no))
2018-01-05 ((b, no),)
2018-01-06 ((b, own),)
2018-01-07 ((b, own),)
2018-01-09 ((c, own), (d, own))
2018-01-10 ((c, own),)
dtype: object

Pandas: Resample from weekly to daily with offset

I want to resample this following dataframe from weekly to daily then ffill the missing values.
Note: 2018-01-07 and 2018-01-14 is Sunday.
Date Val
0 2018-01-07 1
1 2018-01-14 2
I tried.
df.Date = pd.to_datetime(df.Date)
df.set_index('Date', inplace=True)
offset = pd.offsets.DateOffset(-6)
df.resample('D', loffset=offset).ffill()
Val
Date
2018-01-01 1
2018-01-02 1
2018-01-03 1
2018-01-04 1
2018-01-05 1
2018-01-06 1
2018-01-07 1
2018-01-08 2
But I want
Date Val
0 2018-01-01 1
1 2018-01-02 1
2 2018-01-03 1
3 2018-01-04 1
4 2018-01-05 1
5 2018-01-06 1
6 2018-01-07 1
7 2018-01-08 2
8 2018-01-09 2
9 2018-01-10 2
10 2018-01-11 2
11 2018-01-12 2
12 2018-01-13 2
13 2018-01-14 2
What did I do wrong?
You can add new last row manually with subtract offset for datetime:
df.loc[df.index[-1] - offset] = df.iloc[-1]
df = df.resample('D', loffset=offset).ffill()
print (df)
Val
Date
2018-01-01 1
2018-01-02 1
2018-01-03 1
2018-01-04 1
2018-01-05 1
2018-01-06 1
2018-01-07 1
2018-01-08 2
2018-01-09 2
2018-01-10 2
2018-01-11 2
2018-01-12 2
2018-01-13 2
2018-01-14 2

Pandas upsampling using groupby and resample

I have grouped timeseries with gaps. I wan't to fill the gaps, respecting the groupings.
date is unique within each id.
The following works but gives me zero's where I wan't NaN's
data.groupby('id').resample('D', on='date').sum()\
.drop('id', axis=1).reset_index()
The following do not work for some reason
data.groupby('id').resample('D', on='date').asfreq()\
.drop('id', axis=1).reset_index()
data.groupby('id').resample('D', on='date').fillna('pad')\
.drop('id', axis=1).reset_index()
I get the following error:
Upsampling from level= or on= selection is not supported, use .set_index(...) to explicitly set index to datetime-like
I've tried to use the pandas.Grouper with set_index multilevel index or single but it do not seems to upsample my date column so i get continous dates or it do not respect the id column.
Pandas is version 0.23
Try it yourself:
data = pd.DataFrame({
'id': [1,1,1,2,2,2],
'date': [
datetime(2018, 1, 1),
datetime(2018, 1, 5),
datetime(2018, 1, 10),
datetime(2018, 1, 1),
datetime(2018, 1, 5),
datetime(2018, 1, 10)],
'value': [100, 110, 90, 50, 40, 60]})
# Works but gives zeros
data.groupby('id').resample('D', on='date').sum()
# Fails
data.groupby('id').resample('D', on='date').asfreq()
data.groupby('id').resample('D', on='date').fillna('pad')
Create DatetimeIndex and remove parameter on from resample:
print (data.set_index('date').groupby('id').resample('D').asfreq())
id
id date
1 2018-01-01 1.0
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 1.0
2018-01-06 NaN
2018-01-07 NaN
2018-01-08 NaN
2018-01-09 NaN
2018-01-10 1.0
2 2018-01-01 2.0
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 2.0
2018-01-06 NaN
2018-01-07 NaN
2018-01-08 NaN
2018-01-09 NaN
2018-01-10 2.0
print (data.set_index('date').groupby('id').resample('D').fillna('pad'))
#alternatives
#print (data.set_index('date').groupby('id').resample('D').ffill())
#print (data.set_index('date').groupby('id').resample('D').pad())
id
id date
1 2018-01-01 1
2018-01-02 1
2018-01-03 1
2018-01-04 1
2018-01-05 1
2018-01-06 1
2018-01-07 1
2018-01-08 1
2018-01-09 1
2018-01-10 1
2 2018-01-01 2
2018-01-02 2
2018-01-03 2
2018-01-04 2
2018-01-05 2
2018-01-06 2
2018-01-07 2
2018-01-08 2
2018-01-09 2
2018-01-10 2
EDIT:
If want use sum with missing values need min_count=1 parameter - sum:
min_count : int, default 0
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.
New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.
print (data.groupby('id').resample('D', on='date').sum(min_count=1))

Pandas : How to sum column values over a date range

I am trying to sum the values of colA, over a date range based on "date" column, and store this rolling value in the new column "sum_col"
But I am getting the sum of all rows (=100), not just those in the date range.
I can't use rolling or groupby by as my dates (in the real data) are not sequential (some days are missing)
Amy idea how to do this? Thanks.
# Create data frame
df = pd.DataFrame()
# Create datetimes and data
df['date'] = pd.date_range('1/1/2018', periods=100, freq='D')
df['colA']= 1
df['colB']= 2
df['colC']= 3
StartDate = df.date- pd.to_timedelta(5, unit='D')
EndDate= df.date
dfx=df
dfx['StartDate'] = StartDate
dfx['EndDate'] = EndDate
dfx['sum_col']=df[(df['date'] > StartDate) & (df['date'] <= EndDate)].sum()['colA']
dfx.head(50)
I'm not sure whether you want 3 columns for the sum of colA, colB, colC respectively, or one column which sums all three, but here is an example of how you would sum the values for colA:
dfx['colAsum'] = dfx.apply(lambda x: df.loc[(df.date >= x.StartDate) &
(df.date <= x.EndDate), 'colA'].sum(), axis=1)
e.g. (withperiods=10):
date colA colB colC StartDate EndDate colAsum
0 2018-01-01 1 2 3 2017-12-27 2018-01-01 1
1 2018-01-02 1 2 3 2017-12-28 2018-01-02 2
2 2018-01-03 1 2 3 2017-12-29 2018-01-03 3
3 2018-01-04 1 2 3 2017-12-30 2018-01-04 4
4 2018-01-05 1 2 3 2017-12-31 2018-01-05 5
5 2018-01-06 1 2 3 2018-01-01 2018-01-06 6
6 2018-01-07 1 2 3 2018-01-02 2018-01-07 6
7 2018-01-08 1 2 3 2018-01-03 2018-01-08 6
8 2018-01-09 1 2 3 2018-01-04 2018-01-09 6
9 2018-01-10 1 2 3 2018-01-05 2018-01-10 6
If what I understand is correct:
for i in range(df.shape[0]):
dfx.loc[i,'sum_col']=df[(df['date'] > StartDate[i]) & (df['date'] <= EndDate[i])].sum()['colA']
For example, in range (2018-01-01, 2018-01-06) the sum is 6.
date colA colB colC StartDate EndDate sum_col
0 2018-01-01 1 2 3 2017-12-27 2018-01-01 1.0
1 2018-01-02 1 2 3 2017-12-28 2018-01-02 2.0
2 2018-01-03 1 2 3 2017-12-29 2018-01-03 3.0
3 2018-01-04 1 2 3 2017-12-30 2018-01-04 4.0
4 2018-01-05 1 2 3 2017-12-31 2018-01-05 5.0
5 2018-01-06 1 2 3 2018-01-01 2018-01-06 5.0
6 2018-01-07 1 2 3 2018-01-02 2018-01-07 5.0
7 2018-01-08 1 2 3 2018-01-03 2018-01-08 5.0
8 2018-01-09 1 2 3 2018-01-04 2018-01-09 5.0
9 2018-01-10 1 2 3 2018-01-05 2018-01-10 5.0
10 2018-01-11 1 2 3 2018-01-06 2018-01-11 5.0

Categories