How can I create extra columns during resampling a Pandas data frame?

How can I create extra columns during resampling a Pandas data frame? - python

if I have a dataframe like this:
timestamp price
1596267946298 100.0
1596267946299 101.0
1596267946300 102.0
1596267948301 99.0
1596267948302 98.0
1596267949303 99.0
and I want to create the high, low and average during resampling:
I can duplicate the price column into a high and low column and then during resample do the min, max and mean on the appropriate columns.
But I was wondering if there is a way to make this in one pass?
my expected output would be (let's assume resampling at 100ms for this example)
timestamp price min mean max
1596267946298 100.0 100 100.5 101
1596267946299 101.0 100 100.5 101
1596267946300 102.0 98 99.5 102
1596267948301 99.0 98 99.5 102
1596267948302 98.0 98 99.5 102
1596267949303 99.0 98 995. 102

You could something like this-
import pandas as pd
index = pd.date_range('1/1/2000', periods=9, freq='T')
series = pd.Series(range(9), index=index)
def custom_func(x):
return x[-1], x.min(), x.max(), x.mean()
result = series.resample('3T').apply(custom_func)
print(pd.DataFrame(result.tolist(), columns=['resampled', 'min', 'max', 'mean'], index=result.index))
Before resampling
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 00:07:00 7
2000-01-01 00:08:00 8
After resampling
resampled min max mean
2000-01-01 00:00:00 2 0 2 1.0
2000-01-01 00:03:00 5 3 5 4.0
2000-01-01 00:06:00 8 6 8 7.0

Related

Change value if it repeats a certain number of times in a month

I have a dataframe with time data in the format:
date values
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 0.0
2 2013-01-01 02:00:00 -9999
3 2013-01-01 03:00:00 -9999
4 2013-01-01 04:00:00 0.0
.. ... ...
8754 2016-12-31 18:00:00 427.5
8755 2016-12-31 19:00:00 194.9
8756 2016-12-31 20:00:00 -9999
8757 2016-12-31 21:00:00 237.6
8758 2016-12-31 22:00:00 -9999
8759 2016-12-31 23:00:00 0.0
And I want every month that the value -9999 is repeated more than 175 times those values get changed to NaN.
Imagine that we have this other dataframe with the number of times the value is repeated per month:
date values
0 2013-01 200
1 2013-02 0
2 2013-03 2
3 2013-04 181
4 2013-05 0
5 2013-06 0
6 2013-07 66
7 2013-08 0
8 2013-09 7
In this case, the month of January and April passed the stipulated value and that first dataframe should be:
date values
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 0.0
2 2013-01-01 02:00:00 NaN
3 2013-01-01 03:00:00 NaN
4 2013-01-01 04:00:00 0.0
.. ... ...
8754 2016-12-31 18:00:00 427.5
8755 2016-12-31 19:00:00 194.9
8756 2016-12-31 20:00:00 -9999
8757 2016-12-31 21:00:00 237.6
8758 2016-12-31 22:00:00 -9999
8759 2016-12-31 23:00:00 0.0
I imagined creating a list using tolist() that separates the months that the value appears more than 175 times and then creating a condition if df["values"]==-9999 and df["date"] in list_with_months and then change the values.

You can do this using a transform call where you calculate the number of values per month in the same dataframe. Then you create a new column conditionally on this:
import numpy as np
MISSING = -9999
THRESHOLD = 175
# Create a month column
df['month'] = df['date'].dt.to_period('M')
# Count number of MISSING per month and assign to dataframe
df['n_missing'] = (
df.groupby('month')['values']
.transform(lambda d: (d == MISSING).sum())
)
# If value is MISSING and number of missing is above THRESHOLD, replace with NaN, otherwise keep original values
df['new_value'] = np.where(
(df['values'] == MISSING) & (df['n_missing'] > THRESHOLD),
np.nan,
df['values']
)

Downsampling in Pandas DataFrame by dividing observations into ratios

Given a DataFrame having timestamp (ts), I'd like to these by the hour (downsample). Values that were previously indexed by ts should now be divided into ratios based on the number of minutes left in an hour. [note: divide data in ratios for NaN columns while doing resampling]
ts event duration
0 2020-09-09 21:01:00 a 12
1 2020-09-10 00:10:00 a 22
2 2020-09-10 01:31:00 a 130
3 2020-09-10 01:50:00 b 60
4 2020-09-10 01:51:00 b 50
5 2020-09-10 01:59:00 b 26
6 2020-09-10 02:01:00 c 72
7 2020-09-10 02:51:00 b 51
8 2020-09-10 03:01:00 b 63
9 2020-09-10 04:01:00 c 79
def create_dataframe():
df = pd.DataFrame([{'duration':12, 'event':'a', 'ts':'2020-09-09 21:01:00'},
{'duration':22, 'event':'a', 'ts':'2020-09-10 00:10:00'},
{'duration':130, 'event':'a', 'ts':'2020-09-10 01:31:00'},
{'duration':60, 'event':'b', 'ts':'2020-09-10 01:50:00'},
{'duration':50, 'event':'b', 'ts':'2020-09-10 01:51:00'},
{'duration':26, 'event':'b', 'ts':'2020-09-10 01:59:00'},
{'duration':72, 'event':'c', 'ts':'2020-09-10 02:01:00'},
{'duration':51, 'event':'b', 'ts':'2020-09-10 02:51:00'},
{'duration':63, 'event':'b', 'ts':'2020-09-10 03:01:00'},
{'duration':79, 'event':'c', 'ts':'2020-09-10 04:01:00'},
{'duration':179, 'event':'c', 'ts':'2020-09-10 06:05:00'},
])
df.ts = pd.to_datetime(df.ts)
return df
I want to estimate a produced based on the ratio of time spend and produced. This can be compared to how many lines of code have been completed or find how many actual lines per hour?
for example: at "2020-09-10 00:10:00" we have 22. Then during the period from 21:01 - 00:10, we produced based on
59 min of 21:00 hours -> 7 => =ROUND(22/189*59,0)
60 min of 22:00 hours -> 7 => =ROUND(22/189*60,0)
60 min of 23:00 hours -> 7 => =ROUND(22/189*60,0)
10 min of 00:00 hours -> 1 => =ROUND(22/189*10,0)
the result should be something like.
ts event duration
0 2020-09-09 20:00:00 a NaN
1 2020-09-10 21:00:00 a 7
2 2020-09-10 22:00:00 a 7
3 2020-09-10 23:00:00 a 7
4 2020-09-10 00:00:00 a 1
5 2020-09-10 01:00:00 b ..
6 2020-09-10 02:01:00 c ..
Problem with this approach:
It appears to me that, we are having a serious issue with this approach. If you look at the rows[1] -> 2020-09-10 07:00:00, we have 4, we need to divide it between 3 hours. Considering base duration value as 1 (base unit), we however get
def create_dataframe2():
df = pd.DataFrame([{'duration':4, 'event':'c', 'c':'event3.5', 'ts':'2020-09-10 07:00:00'},
{'duration':4, 'event':'c', 'c':'event3.5', 'ts':'2020-09-10 10:00:00'}])
df.ts = pd.to_datetime(df.ts)
return df
Source
duration event c ts
0 4 c event3.5 2020-09-10 07:00:00
1 4 c event3.5 2020-09-10 10:00:00
Expected Output
ts_hourly mins duration
0 2020-09-10 07:00:00 60.0 2
1 2020-09-10 08:00:00 60.0 1
2 2020-09-10 09:00:00 60.0 1
3 2020-09-10 10:00:00 0.0 0

The first step is to add "previous ts" column to the source DataFrame:
df['tsPrev'] = df.ts.shift()
Then set ts column as the index:
df.set_index('ts', inplace=True)
The third step is to create an auxiliary index, composed of the original
index and "full hours":
ind = df.event.resample('H').asfreq().index.union(df.index)
Then create an auxiliary DataFrame, reindexed with the just created index
and "back fill" event column:
df2 = df.reindex(ind)
df2.event = df2.event.bfill()
Define a function to be applied to each group of rows from df2:
def parts(grp):
lstRow = grp.iloc[-1] # Last row from group
if pd.isna(lstRow.tsPrev): # First group
return pd.Series([lstRow.duration], index=[grp.index[0]], dtype=int)
# Other groups
return -pd.Series([0], index=[lstRow.tsPrev]).append(grp.duration)\
.interpolate(method='index').round().diff(-1)[:-1].astype(int)
Then generate the source data for "produced" column in 2 steps:
Generate detailed data:
prodDet = df2.groupby(np.isfinite(df2.duration.values[::-1]).cumsum()[::-1],
sort=False).apply(parts).reset_index(level=0, drop=True)
The source is df2 grouped this way that each group is terminated
with a row with a non-null value in duration column. Then each group
is processed with parts function.
The result is:
2020-09-09 21:00:00 12
2020-09-09 21:01:00 7
2020-09-09 22:00:00 7
2020-09-09 23:00:00 7
2020-09-10 00:00:00 1
2020-09-10 00:10:00 80
2020-09-10 01:00:00 50
2020-09-10 01:31:00 60
2020-09-10 01:50:00 50
2020-09-10 01:51:00 26
2020-09-10 01:59:00 36
2020-09-10 02:00:00 36
2020-09-10 02:01:00 51
2020-09-10 02:51:00 57
2020-09-10 03:00:00 6
2020-09-10 03:01:00 78
2020-09-10 04:00:00 1
2020-09-10 04:01:00 85
2020-09-10 05:00:00 87
2020-09-10 06:00:00 7
dtype: int32
Generate aggregated data, for the time being also as a Series:
prod = prodDet.resample('H').sum().rename('produced')
This time prodDet is resampled (broken down by hours) and the
result is the sum of values.
The result is:
2020-09-09 21:00:00 19
2020-09-09 22:00:00 7
2020-09-09 23:00:00 7
2020-09-10 00:00:00 81
2020-09-10 01:00:00 222
2020-09-10 02:00:00 144
2020-09-10 03:00:00 84
2020-09-10 04:00:00 86
2020-09-10 05:00:00 87
2020-09-10 06:00:00 7
Freq: H, Name: produced, dtype: int32
Let's describe the content of prodDet:
There is no row for 2020-09-09 20:00:00, because no source row is
from this hour (your data start from 21:01:00).
Row 21:00:00 12 comes from the first source row (you forgot about
it writing the expected result).
Rows for 21:01:00, 22:00:00, 23:00:00 and 00:00:00 come from
"partitioning" of row 00:10:00 a 22, just as a part of your
expected result.
Rows with 80 and 50 come from row containing 130, divided
between rows with hours 00:01:00 and 01:00:00.
And so on.
Now we start to assemble the final result.
Join prod (converted to a DataFrame) with event column:
result = prod.to_frame().join(df2.event)
Add tsMin column - the minimal ts in each hour (as you asked
in one of comments):
result['tsMin'] = df.duration.resample('H').apply(lambda grp: grp.index.min())
Change the index into a regular column and set its name to ts
(like in the source DataFrame):
result = result.reset_index().rename(columns={'index': 'ts'})
The final result is:
ts produced event tsMin
0 2020-09-09 21:00:00 19 a 2020-09-09 21:01:00
1 2020-09-09 22:00:00 7 a NaT
2 2020-09-09 23:00:00 7 a NaT
3 2020-09-10 00:00:00 81 a 2020-09-10 00:10:00
4 2020-09-10 01:00:00 222 a 2020-09-10 01:31:00
5 2020-09-10 02:00:00 144 c 2020-09-10 02:01:00
6 2020-09-10 03:00:00 84 b 2020-09-10 03:01:00
7 2020-09-10 04:00:00 86 c 2020-09-10 04:01:00
8 2020-09-10 05:00:00 87 c NaT
9 2020-09-10 06:00:00 7 c 2020-09-10 06:05:00
E.g. the value of 81 for 00:00:00 is a sum of 1 and 80 (the first
part resulting from row with 130), see prodDet above.
Some values in tsMin column are empty, for hours in which there is no
source row.
If you want to totally drop the result from the first row (with
duration == 12), change return pd.Series([lstRow.duration]... to
return pd.Series([0]... (the 4-th row of parts function).
To sum up, my solution is more pandasonic and significantly shorter
than yours (17 rows (my solution) vs. about 70 (yours), excluding comments).

I was not able to find a solution in pandas, so I created a solution with plain python.
Basically, I am iterating over all the values after sorting and sending two datetimes viz start_time and end_time to a function, which does the processing.
def get_ratio_per_hour(start_time: list, end_time: list, data_: int):
# get total hours between the start and end, use this for looping
totalhrs = lambda x: [1 for _ in range(int(x // 3600))
] + [
(x % 3600 / 3600
or 0.1 # added for loop fix afterwards
)]
# check if Start and End are not in same hour
if start_time.hour != end_time.hour:
seconds = (end_time - start_time).total_seconds()
if seconds < 3600:
parts_ = [1] + totalhrs(seconds)
else:
parts_ = totalhrs(seconds)
else:
# parts_ define the loop iterations
parts_ = totalhrs((end_time - start_time).total_seconds())
sum_of_hrs = sum(parts_)
# for Constructing DF
new_hours = []
mins = []
# Clone data
start_time_ = start_time
end_time_ = end_time
for e in range(len(parts_)):
# print(parts_[e])
if sum_of_hrs != 0:
if sum_of_hrs > 1:
if end_time_.hour != start_time_.hour:
# Floor > based on the startTime +1 hour
floor_time = (start_time_ + timedelta(hours=1)).floor('H')
#
new_hours.append(start_time_.floor('H'))
mins.append((floor_time - start_time_).total_seconds() // 60)
sum_of_hrs = sum_of_hrs - 1
start_time_ = floor_time
else:
# Hour is same.
floor_time = (start_time_ + timedelta(hours=1)).floor('H')
new_hours.append(start_time_.floor('H'))
mins.append((floor_time - start_time_).total_seconds() // 60)
sum_of_hrs = sum_of_hrs - 1
start_time_ = floor_time
else:
if end_time_.hour != start_time_.hour:
# Get round off hour
floor_time = (end_time_ + timedelta(hours=1)).floor('H')
new_hours.append(end_time_.floor('H'))
mins.append(60 - ((floor_time - end_time_).total_seconds() // 60)
)
sum_of_hrs = sum_of_hrs - 1
start_time_ = floor_time
else:
# Hour is same.
floor_time = (end_time_ + timedelta(hours=1)).floor('H')
new_hours.append(end_time_.floor('H'))
mins.append((end_time_ - start_time_).total_seconds() // 60)
sum_of_hrs = sum_of_hrs - 1
start_time_ = floor_time
# Get DataFrame Build
df_out = pd.DataFrame()
df_out['hours'] = pd.Series(new_hours)
df_out['mins'] = pd.Series(mins)
df_out['ratios'] = round(data_ / sum(mins) * df_out['mins'])
return df_out
Now, let's run the code for each iteration
time_val=[]
split_f_val=[]
split_field = 'duration'
time_field = 'ts'
# creating DataFrames for intermediate results!
df_final = pd.DataFrame()
df2 = pd.DataFrame()
for ix, row in df.iterrows():
time_val.append(row[str(time_field)])
split_f_val.append(int(row[str(split_field)]))
# Skipping First Element for Processing. Therefore, having minimum two data values
if ix !=0:
# getting Last Two Values
new_time_list=time_val[-2:]
new_data_list=split_f_val[-2:]
# get times to compare
start_time=new_time_list[: -1][0]
end_time=new_time_list[1:][0]
# get latest Data to divide
data_ = new_data_list[1:][0]
# print(start_time)
# print(end_time)
df2 = get_ratio_per_hour(start_time,end_time, data_ )
df_final = pd.concat([df_final
, df2], ignore_index=True)
else:
# Create Empty DataFrame for First Value.
df_final = pd.DataFrame([[np.nan,np.nan,np.nan] ],
columns=['hours', 'mins', 'ratios'])
df_final = pd.concat([df_final
, df2], ignore_index=True)
result = df_final.groupby(['hours'])['ratios'].sum()
Intermediate DataFrame:
hours mins ratios
0
0 2020-09-09 21:00:00 59.0 7.0
1 2020-09-09 22:00:00 60.0 7.0
2 2020-09-09 23:00:00 60.0 7.0
3 2020-09-10 00:00:00 10.0 1.0
0 2020-09-10 00:00:00 50.0 80.0
1 2020-09-10 01:00:00 31.0 50.0
0 2020-09-10 01:00:00 19.0 60.0
0 2020-09-10 01:00:00 1.0 50.0
0 2020-09-10 01:00:00 8.0 26.0
0 2020-09-10 01:00:00 1.0 36.0
1 2020-09-10 02:00:00 1.0 36.0
0 2020-09-10 02:00:00 50.0 51.0
0 2020-09-10 02:00:00 9.0 57.0
1 2020-09-10 03:00:00 1.0 6.0
0 2020-09-10 03:00:00 59.0 78.0
1 2020-09-10 04:00:00 1.0 1.0
0 2020-09-10 04:00:00 59.0 85.0
1 2020-09-10 05:00:00 60.0 87.0
2 2020-09-10 06:00:00 5.0 7.0
Final Output:
hours ratios
2020-09-09 21:00:00 7.0
2020-09-09 22:00:00 7.0
2020-09-09 23:00:00 7.0
2020-09-10 00:00:00 81.0
2020-09-10 01:00:00 222.0
2020-09-10 02:00:00 144.0
2020-09-10 03:00:00 84.0
2020-09-10 04:00:00 86.0
2020-09-10 05:00:00 87.0
2020-09-10 06:00:00 7.0

How to create a rolling time window with offset in Pandas

I want to apply some statistics on records within a time window with an offset. My data looks something like this:
lon lat stat ... speed course head
ts ...
2016-09-30 22:00:33.272 5.41463 53.173161 15 ... 0.0 0.0 511
2016-09-30 22:01:42.879 5.41459 53.173180 15 ... 0.0 0.0 511
2016-09-30 22:02:42.879 5.41461 53.173161 15 ... 0.0 0.0 511
2016-09-30 22:03:44.051 5.41464 53.173168 15 ... 0.0 0.0 511
2016-09-30 22:04:53.013 5.41462 53.173141 15 ... 0.0 0.0 511
[5 rows x 7 columns]
I need the records within time windows of 600 seconds, with steps of 300 seconds. For example, these windows:
start end
2016-09-30 22:00:00.000 2016-09-30 22:10:00.000
2016-09-30 22:05:00.000 2016-09-30 22:15:00.000
2016-09-30 22:10:00.000 2016-09-30 22:20:00.000
I have looked at Pandas rolling to do this. But it seems like it does not have the option to add the offset which I described above. Am I overlooking something, or should I create a custom function for this?

What you want to achieve should be possible by combining DataFrame.resample with DataFrame.shift.
import pandas as pd
index = pd.date_range('1/1/2000', periods=9, freq='T')
series = pd.Series(range(9), index=index)
df = pd.DataFrame(series)
That will give you a primitive timeseries (example taken from api docs DataFrame.resample).
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 00:07:00 7
2000-01-01 00:08:00 8
Now resample by your step size (see DataFrame.shift).
sampled = df.resample('90s').sum()
This will give you non-overlapping windows of the step size.
2000-01-01 00:00:00 1
2000-01-01 00:01:30 2
2000-01-01 00:03:00 7
2000-01-01 00:04:30 5
2000-01-01 00:06:00 13
2000-01-01 00:07:30 8
Finally, shift the sampled df by one step and sum with the previously created df. Window size being twice the step size helps.
sampled.shift(1, fill_value=0) + sampled
This will yield:
2000-01-01 00:00:00 1
2000-01-01 00:01:30 3
2000-01-01 00:03:00 9
2000-01-01 00:04:30 12
2000-01-01 00:06:00 18
2000-01-01 00:07:30 21
There may be a more elegant solution, but I hope this helps.

Sampling dataframe Considering NaN values+Pandas

I have a data frame like below. I want to do sampling with '3S'
So there are situations where NaN is present. What I was expecting is the data frame should do sampling with '3S' and also if there is any 'NaN' found in between then stop there and start the sampling from that index. I tried using dataframe.apply method to achieve but it looks very complex. Is there any short way to achieve?
df.sample(n=3)
Code to generate Input:
index = pd.date_range('1/1/2000', periods=13, freq='T')
series = pd.DataFrame(range(13), index=index)
print series
series.iloc[4] = 'NaN'
series.iloc[10] = 'NaN'
I tried to do sampling but after that there is no clue how to proceed.
2015-01-01 00:00:00 0.0
2015-01-01 01:00:00 1.0
2015-01-01 02:00:00 2.0
2015-01-01 03:00:00 2.0
2015-01-01 04:00:00 NaN
2015-01-01 05:00:00 3.0
2015-01-01 06:00:00 4.0
2015-01-01 07:00:00 4.0
2015-01-01 08:00:00 4.0
2015-01-01 09:00:00 NaN
2015-01-01 10:00:00 3.0
2015-01-01 11:00:00 4.0
2015-01-01 12:00:00 4.0
The new data frame should sample based on '3S' also take into account of 'NaN' if present and start the sampling from there where 'NaN' records are found.
Expected Output:
2015-01-01 02:00:00 2.0 -- Sampling after 3S
2015-01-01 03:00:00 2.0 -- Print because NaN has found in Next
2015-01-01 04:00:00 NaN -- print NaN record
2015-01-01 07:00:00 4.0 -- Sampling after 3S
2015-01-01 08:00:00 4.0 -- Print because NaN has found in Next
2015-01-01 09:00:00 NaN -- print NaN record
2015-01-01 12:00:00 4.0 -- Sampling after 3S

Use:
index = pd.date_range('1/1/2000', periods=13, freq='H')
df = pd.DataFrame({'col': range(13)}, index=index)
df.iloc[4, 0] = np.nan
df.iloc[9, 0] = np.nan
print (df)
col
2000-01-01 00:00:00 0.0
2000-01-01 01:00:00 1.0
2000-01-01 02:00:00 2.0
2000-01-01 03:00:00 3.0
2000-01-01 04:00:00 NaN
2000-01-01 05:00:00 5.0
2000-01-01 06:00:00 6.0
2000-01-01 07:00:00 7.0
2000-01-01 08:00:00 8.0
2000-01-01 09:00:00 NaN
2000-01-01 10:00:00 10.0
2000-01-01 11:00:00 11.0
2000-01-01 12:00:00 12.0
m = df['col'].isna()
s1 = m.ne(m.shift()).cumsum()
t = pd.Timedelta(2, unit='H')
mask = df.index >= df.groupby(s1)['col'].transform(lambda x: x.index[0]) + t
df1 = df[mask | m]
print (df1)
col
2000-01-01 02:00:00 2.0
2000-01-01 03:00:00 3.0
2000-01-01 04:00:00 NaN
2000-01-01 07:00:00 7.0
2000-01-01 08:00:00 8.0
2000-01-01 09:00:00 NaN
2000-01-01 12:00:00 12.0
Explanation:
Create mask for compare missing values by Series.isna
Create groups by consecutive values by comparing shifted values with Series.ne (!=)
print (s1)
2000-01-01 00:00:00 1
2000-01-01 01:00:00 1
2000-01-01 02:00:00 1
2000-01-01 03:00:00 1
2000-01-01 04:00:00 2
2000-01-01 05:00:00 3
2000-01-01 06:00:00 3
2000-01-01 07:00:00 3
2000-01-01 08:00:00 3
2000-01-01 09:00:00 4
2000-01-01 10:00:00 5
2000-01-01 11:00:00 5
2000-01-01 12:00:00 5
Freq: H, Name: col, dtype: int32
Get first value of index per groups, add timdelta (for expected output are added 2T) and compare by DatetimeIndex
Last filter by boolean indexing and chained masks by | for bitwise OR

One way would be to Fill the NAs with 0:
df['Col_of_Interest'] = df['Col_of_Interest'].fillna(0)
And then have the resampling to be done on the series:
(if datetime is your index)
series.resample('30S').asfreq()

best way to fill up gaps by yearly dates in Python dataframe

all, I'm newbie to Python and am stuck with the problem below. I have a DF as:
ipdb> DF
asofdate port_id
1 2010-01-01 76
2 2010-04-01 43
3 2011-02-01 76
4 2013-01-02 93
5 2017-02-01 43
For the yearly gaps, say 2012, 2014, 2015, and 2016, I'd like to fill in the gap using the new year date for each of the missing years, and port_id from previous year. Ideally, I'd like:
ipdb> DF
asofdate port_id
1 2010-01-01 76
2 2010-04-01 43
3 2011-02-01 76
4 2012-01-01 76
5 2013-01-02 93
6 2014-01-01 93
7 2015-01-01 93
8 2016-01-01 93
9 2017-02-01 43
I tried multiple approaches but still no avail. Could some expert shed me some lights on how to make it work out? Thanks much in advance!

You can use set.difference with range to find missing dates and then append a dataframe:
# convert to datetime if not already converted
df['asofdate'] = pd.to_datetime(df['asofdate'])
# calculate missing years
years = df['asofdate'].dt.year
missing = set(range(years.min(), years.max())) - set(years)
# append dataframe, sort and front-fill
df = df.append(pd.DataFrame({'asofdate': pd.to_datetime(list(missing), format='%Y')}))\
.sort_values('asofdate')\
.ffill()
print(df)
asofdate port_id
1 2010-01-01 76.0
2 2010-04-01 43.0
3 2011-02-01 76.0
1 2012-01-01 76.0
4 2013-01-02 93.0
2 2014-01-01 93.0
3 2015-01-01 93.0
0 2016-01-01 93.0
5 2017-02-01 43.0

I would create a helper dataframe, containing all the year start dates, then filter out the ones where the years match what is in df, and finally merge them together:
# First make sure it is proper datetime
df['asofdate'] = pd.to_datetime(df.asofdate)
# Create your temporary dataframe of year start dates
helper = pd.DataFrame({'asofdate':pd.date_range(df.asofdate.min(), df.asofdate.max(), freq='YS')})
# Filter out the rows where the year is already in df
helper = helper[~helper.asofdate.dt.year.isin(df.asofdate.dt.year)]
# Merge back in to df, sort, and forward fill
new_df = df.merge(helper, how='outer').sort_values('asofdate').ffill()
>>> new_df
asofdate port_id
0 2010-01-01 76.0
1 2010-04-01 43.0
2 2011-02-01 76.0
5 2012-01-01 76.0
3 2013-01-02 93.0
6 2014-01-01 93.0
7 2015-01-01 93.0
8 2016-01-01 93.0
4 2017-02-01 43.0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I create extra columns during resampling a Pandas data frame? - python

Related

Change value if it repeats a certain number of times in a month

Downsampling in Pandas DataFrame by dividing observations into ratios

How to create a rolling time window with offset in Pandas

Sampling dataframe Considering NaN values+Pandas

best way to fill up gaps by yearly dates in Python dataframe

Categories

Resources