python: populating tuples in tuples over dataframe range

python: populating tuples in tuples over dataframe range - python

I have 4 portfolios a,b,c,d which can take on values either "no" or "own" over a period of time. (code included below to facilitate replication)
ano=('a','no',datetime(2018,1,1), datetime(2018,1,2))
aown=('a','own',datetime(2018,1,3), datetime(2018,1,4))
bno=('b','no',datetime(2018,1,1), datetime(2018,1,5))
bown=('b','own',datetime(2018,1,6), datetime(2018,1,7))
cown=('c','own',datetime(2018,1,9), datetime(2018,1,10))
down=('d','own',datetime(2018,1,9), datetime(2018,1,9))
sch=pd.DataFrame([ano,aown,bno,bown,cown,down],columns=['portf','base','st','end'])
Summary of schedule:
portf base st end
0 a no 2018-01-01 2018-01-02
1 a own 2018-01-03 2018-01-04
2 b no 2018-01-01 2018-01-05
3 b own 2018-01-06 2018-01-07
4 c own 2018-01-09 2018-01-10
5 d own 2018-01-09 2018-01-09
What I have tried: create a holding dataframe and filling in values based on the schedule. Unfortunately the first portfolio 'a' gets overridden
df=pd.DataFrame(index=pd.date_range(min(sch.st),max(sch.end)),columns=['portf','base'])
for row in range(len(sch)):
df.loc[sch['st'][row]:sch['end'][row],['portf','base']]= sch.loc[row,['portf','base']].values
portf base
2018-01-01 b no
2018-01-02 b no
2018-01-03 b no
2018-01-04 b no
2018-01-05 b no
2018-01-06 b own
2018-01-07 b own
2018-01-08 NaN NaN
2018-01-09 d own
2018-01-10 c own
desired output:
2018-01-01 (('a','no'), ('b','no'))
2018-01-02 (('a','no'), ('b','no'))
2018-01-03 (('a','own'), ('b','no'))
2018-01-04 (('a','own'), ('b','no'))
2018-01-05 ('b','no')
...
I am sure there's an easier way of achieving this but probably this is an example I haven't encountered before. Many thanks in advance!

I would organize the data differently, index is date, columns for portf and the values are base.
First we need to reshape the data and resample to daily fields. Then it's a simple pivot.
cols = ['portf', 'base']
s = (df.reset_index()
.melt(cols+['index'], value_name='date')
.set_index('date')
.groupby(cols+['index'], group_keys=False)
.resample('D').ffill()
.drop(columns=['variable', 'index'])
.reset_index())
res = s.pivot(index='date', columns='portf')
res = res.resample('D').first() # Recover missing dates between
Output res
base
portf a b c d
2018-01-01 no no NaN NaN
2018-01-02 no no NaN NaN
2018-01-03 own no NaN NaN
2018-01-04 own no NaN NaN
2018-01-05 NaN no NaN NaN
2018-01-06 NaN own NaN NaN
2018-01-07 NaN own NaN NaN
2018-01-08 NaN NaN NaN NaN
2018-01-09 NaN NaN own own
2018-01-10 NaN NaN own NaN
If you need your other output, we can get there with some less than ideal Series.apply calls. This will be very bad for a large DataFrame; I would seriously consider keeping the above.
s.set_index('date').apply(tuple, axis=1).groupby('date').apply(tuple)
date
2018-01-01 ((a, no), (b, no))
2018-01-02 ((a, no), (b, no))
2018-01-03 ((a, own), (b, no))
2018-01-04 ((a, own), (b, no))
2018-01-05 ((b, no),)
2018-01-06 ((b, own),)
2018-01-07 ((b, own),)
2018-01-09 ((c, own), (d, own))
2018-01-10 ((c, own),)
dtype: object

Related

How to select the first valid rows in a pandas dataframe?

I have pd.DataFrame with time series as index:
a b
2018-01-02 12:30:00+00:00 NaN NaN
2018-01-02 13:45:00+00:00 NaN 232.0
2018-01-02 14:00:00+00:00 133.0 133.0
2018-01-02 14:15:00+00:00 134.0 134.0
I am interested in preserving the first non-NaN value of each columns and the rest of elements should be NaN
a b
2018-01-02 12:30:00+00:00 NaN NaN
2018-01-02 13:45:00+00:00 NaN 232.0
2018-01-02 14:00:00+00:00 133.0 NaN
2018-01-02 14:15:00+00:00 NaN NaN
Does pandas/numpy have an operation to achieve this in a vectorized way (without writing for loops)?

You can try apply Series.first_valid_index per column and mask the other rows with nan
df[df.apply(lambda col: col.index != col.first_valid_index())] = np.nan
print(df)
a b
2018-01-02 12:30:00+00:00 NaN NaN
2018-01-02 13:45:00+00:00 NaN 132.0
2018-01-02 14:00:00+00:00 133.0 NaN
2018-01-02 14:15:00+00:00 NaN NaN

Using a boolean masking:
m1 = df.isna().cummin() # get NAs prior to first non-NA
m2 = m1.shift(fill_value=False) # get first non-NA and after
out = df.where(m2&~m1)
output:
a b
2018-01-02 12:30:00+00:00 NaN NaN
2018-01-02 13:45:00+00:00 NaN 232.0
2018-01-02 14:00:00+00:00 133.0 NaN
2018-01-02 14:15:00+00:00 NaN NaN

Create a DataFrame in Pandas using an index from an existing TimeSerie and a column form another TimeSerie

I want to create a DataFrame or TimeSerie using an index of an existing TimeSerie and the values from another TimeSerie with different time indices. The TimeSeries look like;
<class 'pandas.core.series.Series'>
DT
2018-01-02 172.3000
2018-01-03 174.5500
2018-01-04 173.4700
2018-01-05 175.3700
2018-01-08 175.6100
2018-01-09 175.0600
2018-01-10 174.3000
2018-01-11 175.4886
2018-01-12 177.3600
2018-01-16 179.3900
2018-01-17 179.2500
2018-01-18 180.1000
...
and
<class 'pandas.core.series.Series'>
DT
2018-01-02 NaN
2018-01-09 175.610
2018-01-16 177.360
2018-01-23 180.100
...
I want to use the index from the first TS and fill it with the values with appropriate index form the second TS. Like;
<class 'pandas.core.series.Series'>
DT
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 NaN
2018-01-08 NaN
2018-01-09 175.610
2018-01-10 NaN
2018-01-11 NaN
2018-01-12 NaN
2018-01-16 177.360
2018-01-17 NaN
2018-01-18 NaN
...
Thx

IIUC, use Series.reindex:
new_s = s2.reindex(s1.index)
#2018-01-02 NaN
#2018-01-03 NaN
#2018-01-04 NaN
#2018-01-05 NaN
#2018-01-08 NaN
#2018-01-09 175.61
#2018-01-10 NaN
#2018-01-11 NaN
#2018-01-12 NaN
#2018-01-16 177.36
#2018-01-17 NaN
#2018-01-18 NaN
#Name: s2, dtype: float64

convert your series data structure into a Data frame Data structure then use the following line :
pd.merge(TS1,TS2,left_index=True,right_index=True,how='left').iloc[:,-1]

Pandas upsampling using groupby and resample

I have grouped timeseries with gaps. I wan't to fill the gaps, respecting the groupings.
date is unique within each id.
The following works but gives me zero's where I wan't NaN's
data.groupby('id').resample('D', on='date').sum()\
.drop('id', axis=1).reset_index()
The following do not work for some reason
data.groupby('id').resample('D', on='date').asfreq()\
.drop('id', axis=1).reset_index()
data.groupby('id').resample('D', on='date').fillna('pad')\
.drop('id', axis=1).reset_index()
I get the following error:
Upsampling from level= or on= selection is not supported, use .set_index(...) to explicitly set index to datetime-like
I've tried to use the pandas.Grouper with set_index multilevel index or single but it do not seems to upsample my date column so i get continous dates or it do not respect the id column.
Pandas is version 0.23
Try it yourself:
data = pd.DataFrame({
'id': [1,1,1,2,2,2],
'date': [
datetime(2018, 1, 1),
datetime(2018, 1, 5),
datetime(2018, 1, 10),
datetime(2018, 1, 1),
datetime(2018, 1, 5),
datetime(2018, 1, 10)],
'value': [100, 110, 90, 50, 40, 60]})
# Works but gives zeros
data.groupby('id').resample('D', on='date').sum()
# Fails
data.groupby('id').resample('D', on='date').asfreq()
data.groupby('id').resample('D', on='date').fillna('pad')

Create DatetimeIndex and remove parameter on from resample:
print (data.set_index('date').groupby('id').resample('D').asfreq())
id
id date
1 2018-01-01 1.0
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 1.0
2018-01-06 NaN
2018-01-07 NaN
2018-01-08 NaN
2018-01-09 NaN
2018-01-10 1.0
2 2018-01-01 2.0
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 2.0
2018-01-06 NaN
2018-01-07 NaN
2018-01-08 NaN
2018-01-09 NaN
2018-01-10 2.0
print (data.set_index('date').groupby('id').resample('D').fillna('pad'))
#alternatives
#print (data.set_index('date').groupby('id').resample('D').ffill())
#print (data.set_index('date').groupby('id').resample('D').pad())
id
id date
1 2018-01-01 1
2018-01-02 1
2018-01-03 1
2018-01-04 1
2018-01-05 1
2018-01-06 1
2018-01-07 1
2018-01-08 1
2018-01-09 1
2018-01-10 1
2 2018-01-01 2
2018-01-02 2
2018-01-03 2
2018-01-04 2
2018-01-05 2
2018-01-06 2
2018-01-07 2
2018-01-08 2
2018-01-09 2
2018-01-10 2
EDIT:
If want use sum with missing values need min_count=1 parameter - sum:
min_count : int, default 0
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.
New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.
print (data.groupby('id').resample('D', on='date').sum(min_count=1))

Pandas resample and ffill leaves NaN at the end

I want to up-sample a series from weekly to daily frequency by forward filling the result.
If the last observation of my original series is NaN, I would have expected this value to be replaced by the previous valid value, but instead it remains as NaN.
SETUP
import numpy as np
import pandas as pd
all_dates = pd.date_range(start='2018-01-01', freq='W-WED', periods=4)
ts = pd.Series([1, 2, 3], index=all_dates[:3])
ts[all_dates[3]] = np.nan
ts
Out[16]:
2018-01-03 1.0
2018-01-10 2.0
2018-01-17 3.0
2018-01-24 NaN
Freq: W-WED, dtype: float64
RESULT
ts.resample('B').ffill()
ts.resample('B').ffill()
Out[17]:
2018-01-03 1.0
2018-01-04 1.0
2018-01-05 1.0
2018-01-08 1.0
2018-01-09 1.0
2018-01-10 2.0
2018-01-11 2.0
2018-01-12 2.0
2018-01-15 2.0
2018-01-16 2.0
2018-01-17 3.0
2018-01-18 3.0
2018-01-19 3.0
2018-01-22 3.0
2018-01-23 3.0
2018-01-24 NaN
Freq: B, dtype: float64
While I was expecting the last value to be 3 as well.
Does anyone has an explanation of this behaviour?

resample() returns DatetimeIndexResampler
You need to return the original pandas Series.
You can use asfreq() method to do it, before filling the Nan https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.asfreq.html.
So, this should work:
ts.resample('B').asfreq().ffill()

The point of resample and ffillis simply to propagate forward from the first day of the week - if the first day of the week is NaN, that's what gets filled forward. For example:
ts.iloc[1] = np.nan
ts.resample('B').ffill()
2018-01-03 1.0
2018-01-04 1.0
2018-01-05 1.0
2018-01-08 1.0
2018-01-09 1.0
2018-01-10 NaN
2018-01-11 NaN
2018-01-12 NaN
2018-01-15 NaN
2018-01-16 NaN
2018-01-17 3.0
2018-01-18 3.0
2018-01-19 3.0
2018-01-22 3.0
2018-01-23 3.0
2018-01-24 NaN
Freq: B, dtype: float64
In most cases, propagating from the previous week's data would not be desired behaviour. If you'd like to use previous weeks' data in the case of missing values in the original (weekly) series, just fillna on that first with ffill.

pandas subtracting value in another column from previous row

I have a dataframe (named df) sorted by identifier, id_number and contract_year_month in order like this so far:
**identifier id_number contract_year_month collection_year_month**
K001 1 2018-01-03 2018-01-09
K001 1 2018-01-08 2018-01-10
K001 2 2018-01-01 2018-01-05
K001 2 2018-01-15 2018-01-18
K002 4 2018-01-04 2018-01-07
K002 4 2018-01-09 2018-01-15
and would like to add a column named 'date_difference' that is consisted of contract_year_month minus collection_year_month from previous row based on identifier and id_number (e.g. 2018-01-08 minus 2018-01-09),
so that the df would be:
**identifier id_number contract_year_month collection_year_month date_difference**
K001 1 2018-01-03 2018-01-09
K001 1 2018-01-08 2018-01-10 -1
K001 2 2018-01-01 2018-01-05
K001 2 2018-01-15 2018-01-18 10
K002 4 2018-01-04 2018-01-07
K002 4 2018-01-09 2018-01-15 2
I already converted the type of contract_year_month and collection_year_month columns to datetime, and tried to work on with simple shift function or iloc but neither doesn't work.
df["date_difference"] = df.groupby(["identifier", "id_number"])["contract_year_month"]
Is there any way to use groupby to get the difference between the current row value and previous row value in another column, separated by two identifiers? (I've searched for an hour but couldn't find a hint...) I would sincerely appreciate if you guys give some advice.

Here is one potential way to do this.
First create a boolean mask, then use numpy.where and Series.shift to create the column date_difference:
mask = df.duplicated(['identifier', 'id_number'])
df['date_difference'] = (np.where(mask, (df['contract_year_month'] -
df['collection_year_month'].shift(1)).dt.days, np.nan))
[output]
identifier id_number contract_year_month collection_year_month date_difference
0 K001 1 2018-01-03 2018-01-09 NaN
1 K001 1 2018-01-08 2018-01-10 -1.0
2 K001 2 2018-01-01 2018-01-05 NaN
3 K001 2 2018-01-15 2018-01-18 10.0
4 K002 4 2018-01-04 2018-01-07 NaN
5 K002 4 2018-01-09 2018-01-15 2.0

Here's one approach using your grouby() (Updated based on feedback from #piRSquared):
In []:
(df['collection_year_month']
.groupby([df['identifier'], df['id_number']])
.shift() - df['contract_year_month']).dt.days
Out[]:
0 NaN
1 -1.0
2 NaN
3 10.0
4 NaN
5 2.0
dtype: float64
You can just assign this to df['date_difference']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python: populating tuples in tuples over dataframe range - python

Related

How to select the first valid rows in a pandas dataframe?

Create a DataFrame in Pandas using an index from an existing TimeSerie and a column form another TimeSerie

Pandas upsampling using groupby and resample

Pandas resample and ffill leaves NaN at the end

pandas subtracting value in another column from previous row

Categories

Resources