Pandas upsampling using groupby and resample - python

I have grouped timeseries with gaps. I wan't to fill the gaps, respecting the groupings.
date is unique within each id.
The following works but gives me zero's where I wan't NaN's
data.groupby('id').resample('D', on='date').sum()\
.drop('id', axis=1).reset_index()
The following do not work for some reason
data.groupby('id').resample('D', on='date').asfreq()\
.drop('id', axis=1).reset_index()
data.groupby('id').resample('D', on='date').fillna('pad')\
.drop('id', axis=1).reset_index()
I get the following error:
Upsampling from level= or on= selection is not supported, use .set_index(...) to explicitly set index to datetime-like
I've tried to use the pandas.Grouper with set_index multilevel index or single but it do not seems to upsample my date column so i get continous dates or it do not respect the id column.
Pandas is version 0.23
Try it yourself:
data = pd.DataFrame({
'id': [1,1,1,2,2,2],
'date': [
datetime(2018, 1, 1),
datetime(2018, 1, 5),
datetime(2018, 1, 10),
datetime(2018, 1, 1),
datetime(2018, 1, 5),
datetime(2018, 1, 10)],
'value': [100, 110, 90, 50, 40, 60]})
# Works but gives zeros
data.groupby('id').resample('D', on='date').sum()
# Fails
data.groupby('id').resample('D', on='date').asfreq()
data.groupby('id').resample('D', on='date').fillna('pad')

Create DatetimeIndex and remove parameter on from resample:
print (data.set_index('date').groupby('id').resample('D').asfreq())
id
id date
1 2018-01-01 1.0
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 1.0
2018-01-06 NaN
2018-01-07 NaN
2018-01-08 NaN
2018-01-09 NaN
2018-01-10 1.0
2 2018-01-01 2.0
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 2.0
2018-01-06 NaN
2018-01-07 NaN
2018-01-08 NaN
2018-01-09 NaN
2018-01-10 2.0
print (data.set_index('date').groupby('id').resample('D').fillna('pad'))
#alternatives
#print (data.set_index('date').groupby('id').resample('D').ffill())
#print (data.set_index('date').groupby('id').resample('D').pad())
id
id date
1 2018-01-01 1
2018-01-02 1
2018-01-03 1
2018-01-04 1
2018-01-05 1
2018-01-06 1
2018-01-07 1
2018-01-08 1
2018-01-09 1
2018-01-10 1
2 2018-01-01 2
2018-01-02 2
2018-01-03 2
2018-01-04 2
2018-01-05 2
2018-01-06 2
2018-01-07 2
2018-01-08 2
2018-01-09 2
2018-01-10 2
EDIT:
If want use sum with missing values need min_count=1 parameter - sum:
min_count : int, default 0
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.
New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.
print (data.groupby('id').resample('D', on='date').sum(min_count=1))

Related

Pandas Set all value in a day equal to data of a time of that day

Generating the data
random.seed(42)
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='H')
df = pd.DataFrame(np.random.randint(0,10,size=(len(date_rng), 3)),
columns=['data1', 'data2', 'data3'],
index= date_rng)
daily_mean_df = pd.DataFrame(np.zeros([len(date_rng), 3]),
columns=['data1', 'data2', 'data3'],
index= date_rng)
mask = np.random.choice([1, 0], df.shape, p=[.35, .65]).astype(bool)
df[mask] = np.nan
df
>>>
data1 data2 data3
2018-01-01 00:00:00 1.0 3.0 NaN
2018-01-01 01:00:00 8.0 5.0 8.0
2018-01-01 02:00:00 5.0 NaN 6.0
2018-01-01 03:00:00 4.0 7.0 4.0
2018-01-01 04:00:00 NaN 8.0 NaN
... ... ... ...
2018-01-07 20:00:00 8.0 7.0 NaN
2018-01-07 21:00:00 5.0 4.0 5.0
2018-01-07 22:00:00 NaN 6.0 NaN
2018-01-07 23:00:00 2.0 4.0 3.0
2018-01-08 00:00:00 NaN NaN NaN
I want to select a specific time each day, then set all value in a day equal to the data of that time.
For example, I want to select 1:00:00, then all data of 2018-01-01 will be equal to 2018-01-01 01:00:00, all data of 2018-01-02 will be equal to 2018-01-02 01:00:00,etc.,
I know how to select the data of the time:
timestamp = "01:00:00"
df[df.index.strftime("%H:%M:%S") == timestamp]
but I don't know how to set data of the day equal to it.
Thank you for reading.
Check with reindex
s=df[df.index.strftime("%H:%M:%S") == timestamp]
s.index=s.index.date
df[:]=s.reindex(df.index.date).values

python: populating tuples in tuples over dataframe range

I have 4 portfolios a,b,c,d which can take on values either "no" or "own" over a period of time. (code included below to facilitate replication)
ano=('a','no',datetime(2018,1,1), datetime(2018,1,2))
aown=('a','own',datetime(2018,1,3), datetime(2018,1,4))
bno=('b','no',datetime(2018,1,1), datetime(2018,1,5))
bown=('b','own',datetime(2018,1,6), datetime(2018,1,7))
cown=('c','own',datetime(2018,1,9), datetime(2018,1,10))
down=('d','own',datetime(2018,1,9), datetime(2018,1,9))
sch=pd.DataFrame([ano,aown,bno,bown,cown,down],columns=['portf','base','st','end'])
Summary of schedule:
portf base st end
0 a no 2018-01-01 2018-01-02
1 a own 2018-01-03 2018-01-04
2 b no 2018-01-01 2018-01-05
3 b own 2018-01-06 2018-01-07
4 c own 2018-01-09 2018-01-10
5 d own 2018-01-09 2018-01-09
What I have tried: create a holding dataframe and filling in values based on the schedule. Unfortunately the first portfolio 'a' gets overridden
df=pd.DataFrame(index=pd.date_range(min(sch.st),max(sch.end)),columns=['portf','base'])
for row in range(len(sch)):
df.loc[sch['st'][row]:sch['end'][row],['portf','base']]= sch.loc[row,['portf','base']].values
portf base
2018-01-01 b no
2018-01-02 b no
2018-01-03 b no
2018-01-04 b no
2018-01-05 b no
2018-01-06 b own
2018-01-07 b own
2018-01-08 NaN NaN
2018-01-09 d own
2018-01-10 c own
desired output:
2018-01-01 (('a','no'), ('b','no'))
2018-01-02 (('a','no'), ('b','no'))
2018-01-03 (('a','own'), ('b','no'))
2018-01-04 (('a','own'), ('b','no'))
2018-01-05 ('b','no')
...
I am sure there's an easier way of achieving this but probably this is an example I haven't encountered before. Many thanks in advance!
I would organize the data differently, index is date, columns for portf and the values are base.
First we need to reshape the data and resample to daily fields. Then it's a simple pivot.
cols = ['portf', 'base']
s = (df.reset_index()
.melt(cols+['index'], value_name='date')
.set_index('date')
.groupby(cols+['index'], group_keys=False)
.resample('D').ffill()
.drop(columns=['variable', 'index'])
.reset_index())
res = s.pivot(index='date', columns='portf')
res = res.resample('D').first() # Recover missing dates between
Output res
base
portf a b c d
2018-01-01 no no NaN NaN
2018-01-02 no no NaN NaN
2018-01-03 own no NaN NaN
2018-01-04 own no NaN NaN
2018-01-05 NaN no NaN NaN
2018-01-06 NaN own NaN NaN
2018-01-07 NaN own NaN NaN
2018-01-08 NaN NaN NaN NaN
2018-01-09 NaN NaN own own
2018-01-10 NaN NaN own NaN
If you need your other output, we can get there with some less than ideal Series.apply calls. This will be very bad for a large DataFrame; I would seriously consider keeping the above.
s.set_index('date').apply(tuple, axis=1).groupby('date').apply(tuple)
date
2018-01-01 ((a, no), (b, no))
2018-01-02 ((a, no), (b, no))
2018-01-03 ((a, own), (b, no))
2018-01-04 ((a, own), (b, no))
2018-01-05 ((b, no),)
2018-01-06 ((b, own),)
2018-01-07 ((b, own),)
2018-01-09 ((c, own), (d, own))
2018-01-10 ((c, own),)
dtype: object

pandas subtracting value in another column from previous row

I have a dataframe (named df) sorted by identifier, id_number and contract_year_month in order like this so far:
**identifier id_number contract_year_month collection_year_month**
K001 1 2018-01-03 2018-01-09
K001 1 2018-01-08 2018-01-10
K001 2 2018-01-01 2018-01-05
K001 2 2018-01-15 2018-01-18
K002 4 2018-01-04 2018-01-07
K002 4 2018-01-09 2018-01-15
and would like to add a column named 'date_difference' that is consisted of contract_year_month minus collection_year_month from previous row based on identifier and id_number (e.g. 2018-01-08 minus 2018-01-09),
so that the df would be:
**identifier id_number contract_year_month collection_year_month date_difference**
K001 1 2018-01-03 2018-01-09
K001 1 2018-01-08 2018-01-10 -1
K001 2 2018-01-01 2018-01-05
K001 2 2018-01-15 2018-01-18 10
K002 4 2018-01-04 2018-01-07
K002 4 2018-01-09 2018-01-15 2
I already converted the type of contract_year_month and collection_year_month columns to datetime, and tried to work on with simple shift function or iloc but neither doesn't work.
df["date_difference"] = df.groupby(["identifier", "id_number"])["contract_year_month"]
Is there any way to use groupby to get the difference between the current row value and previous row value in another column, separated by two identifiers? (I've searched for an hour but couldn't find a hint...) I would sincerely appreciate if you guys give some advice.
Here is one potential way to do this.
First create a boolean mask, then use numpy.where and Series.shift to create the column date_difference:
mask = df.duplicated(['identifier', 'id_number'])
df['date_difference'] = (np.where(mask, (df['contract_year_month'] -
df['collection_year_month'].shift(1)).dt.days, np.nan))
[output]
identifier id_number contract_year_month collection_year_month date_difference
0 K001 1 2018-01-03 2018-01-09 NaN
1 K001 1 2018-01-08 2018-01-10 -1.0
2 K001 2 2018-01-01 2018-01-05 NaN
3 K001 2 2018-01-15 2018-01-18 10.0
4 K002 4 2018-01-04 2018-01-07 NaN
5 K002 4 2018-01-09 2018-01-15 2.0
Here's one approach using your grouby() (Updated based on feedback from #piRSquared):
In []:
(df['collection_year_month']
.groupby([df['identifier'], df['id_number']])
.shift() - df['contract_year_month']).dt.days
Out[]:
0 NaN
1 -1.0
2 NaN
3 10.0
4 NaN
5 2.0
dtype: float64
You can just assign this to df['date_difference']

Pandas : How to sum column values over a date range

I am trying to sum the values of colA, over a date range based on "date" column, and store this rolling value in the new column "sum_col"
But I am getting the sum of all rows (=100), not just those in the date range.
I can't use rolling or groupby by as my dates (in the real data) are not sequential (some days are missing)
Amy idea how to do this? Thanks.
# Create data frame
df = pd.DataFrame()
# Create datetimes and data
df['date'] = pd.date_range('1/1/2018', periods=100, freq='D')
df['colA']= 1
df['colB']= 2
df['colC']= 3
StartDate = df.date- pd.to_timedelta(5, unit='D')
EndDate= df.date
dfx=df
dfx['StartDate'] = StartDate
dfx['EndDate'] = EndDate
dfx['sum_col']=df[(df['date'] > StartDate) & (df['date'] <= EndDate)].sum()['colA']
dfx.head(50)
I'm not sure whether you want 3 columns for the sum of colA, colB, colC respectively, or one column which sums all three, but here is an example of how you would sum the values for colA:
dfx['colAsum'] = dfx.apply(lambda x: df.loc[(df.date >= x.StartDate) &
(df.date <= x.EndDate), 'colA'].sum(), axis=1)
e.g. (withperiods=10):
date colA colB colC StartDate EndDate colAsum
0 2018-01-01 1 2 3 2017-12-27 2018-01-01 1
1 2018-01-02 1 2 3 2017-12-28 2018-01-02 2
2 2018-01-03 1 2 3 2017-12-29 2018-01-03 3
3 2018-01-04 1 2 3 2017-12-30 2018-01-04 4
4 2018-01-05 1 2 3 2017-12-31 2018-01-05 5
5 2018-01-06 1 2 3 2018-01-01 2018-01-06 6
6 2018-01-07 1 2 3 2018-01-02 2018-01-07 6
7 2018-01-08 1 2 3 2018-01-03 2018-01-08 6
8 2018-01-09 1 2 3 2018-01-04 2018-01-09 6
9 2018-01-10 1 2 3 2018-01-05 2018-01-10 6
If what I understand is correct:
for i in range(df.shape[0]):
dfx.loc[i,'sum_col']=df[(df['date'] > StartDate[i]) & (df['date'] <= EndDate[i])].sum()['colA']
For example, in range (2018-01-01, 2018-01-06) the sum is 6.
date colA colB colC StartDate EndDate sum_col
0 2018-01-01 1 2 3 2017-12-27 2018-01-01 1.0
1 2018-01-02 1 2 3 2017-12-28 2018-01-02 2.0
2 2018-01-03 1 2 3 2017-12-29 2018-01-03 3.0
3 2018-01-04 1 2 3 2017-12-30 2018-01-04 4.0
4 2018-01-05 1 2 3 2017-12-31 2018-01-05 5.0
5 2018-01-06 1 2 3 2018-01-01 2018-01-06 5.0
6 2018-01-07 1 2 3 2018-01-02 2018-01-07 5.0
7 2018-01-08 1 2 3 2018-01-03 2018-01-08 5.0
8 2018-01-09 1 2 3 2018-01-04 2018-01-09 5.0
9 2018-01-10 1 2 3 2018-01-05 2018-01-10 5.0
10 2018-01-11 1 2 3 2018-01-06 2018-01-11 5.0

Select rows where column values are between a given range

How to find and remove rows from DataFrame with values in a specific range, for example dates greater than '2017-03-02' and smaller than '2017-03-05'
import pandas as pd
d_index = pd.date_range('2018-01-01', '2018-01-06')
d_values = pd.date_range('2017-03-01', '2017-03-06')
s = pd.Series(d_values)
s = s.rename('values')
df = pd.DataFrame(s)
df = df.set_index(d_index)
# remove rows with specific values in 'value' column
In example above I have d_values ordered from earliest to the latest date so in this case slicing dataframe by index could do the work. But I am looking for solution that would work also when d_values contain not ordered random date values. Is there any way to do it in pandas?
Option 1
pd.Series.between seems suited for this task.
df[~df['values'].between('2017-03-02', '2017-03-05', inclusive=False)]
values
2018-01-01 2017-03-01
2018-01-02 2017-03-02
2018-01-05 2017-03-05
2018-01-06 2017-03-06
Details
between identifies all items within the range -
m = df['values'].between('2017-03-02', '2017-03-05', inclusive=False)
m
2018-01-01 False
2018-01-02 False
2018-01-03 True
2018-01-04 True
2018-01-05 False
2018-01-06 False
Freq: D, Name: values, dtype: bool
Use the mask to filter on df -
df = df[~m]
Option 2
Alternatively, with the good ol' old logical OR -
df[~(df['values'].gt('2017-03-02') & df['values'].lt('2017-03-05'))]
values
2018-01-01 2017-03-01
2018-01-02 2017-03-02
2018-01-05 2017-03-05
2018-01-06 2017-03-06
Note that both options work with datetime objects as well as string date columns (in which case, the comparison is lexicographic).
first let's shuffle your DF:
In [65]: df = df.sample(frac=1)
In [66]: df
Out[66]:
values
2018-01-03 2017-03-03
2018-01-04 2017-03-04
2018-01-01 2017-03-01
2018-01-06 2017-03-06
2018-01-05 2017-03-05
2018-01-02 2017-03-02
you can use DataFrame.eval method (thanks # cᴏʟᴅsᴘᴇᴇᴅ for the correction!):
In [70]: df[~df.eval("'2017-03-02' < values < '2017-03-05'")]
Out[70]:
values
2018-01-01 2017-03-01
2018-01-06 2017-03-06
2018-01-05 2017-03-05
2018-01-02 2017-03-02
or DataFrame.query():
In [300]: df.query("not ('2017-03-02' < values < '2017-03-05')")
Out[300]:
values
2018-01-01 2017-03-01
2018-01-06 2017-03-06
2018-01-05 2017-03-05
2018-01-02 2017-03-02

Categories