How to fill periods in columns? - python

There is a dataframe. The period column contains lists. These lists contain time spans.
#load data
df = pd.DataFrame(data, columns=['task_id', 'target_start_date', 'target_end_date'])
df['target_start_date'] = pd.to_datetime(df.target_start_date)
df['target_end_date'] = pd.to_datetime(df.target_end_date)
df['period'] = np.nan
#create period column
z = dict()
freq = 'M'
for i in range(0, len(df)):
l = pd.period_range(df.target_start_date[i], df.target_end_date[i], freq=freq)
l = l.to_native_types()
z[i] = l
df.period = z.values()
Output
task_id target_start_date target_end_date period
0 35851 2019-04-01 07:00:00 2019-04-01 07:00:00 [2019-04]
1 35852 2020-02-26 11:30:00 2020-02-26 11:30:00 [2020-02]
2 35854 2019-05-17 07:00:00 2019-06-01 17:30:00 [2019-05, 2019-06]
3 35855 2019-03-20 11:30:00 2019-04-07 15:00:00 [2019-03, 2019-04]
4 35856 2019-04-06 08:00:00 2019-04-26 19:00:00 [2019-04]
enter image description here
Then I add columns which are called time slices.
#create slices
date_min = df.target_start_date.min()
date_max = df.target_end_date.max()
period = pd.period_range(date_min, date_max, freq=freq)
#add columns
for i in period:
df[str(i)] = np.nan
result
enter image description here
How can I fill Nan values ​​for True, if this value is in the list in the period column?
enter image description here

Apply a function across the dataframe rows
def fillit(row):
for i in row.period:
row[i] = True
df.apply(fillit), axis=1)

My approach was to iterate over rows and column names and compare values:
import numpy as np
import pandas as pd
# handle assignment error
pd.options.mode.chained_assignment = None
# setup test data
data = {'time': [['2019-04'], ['2019-01'], ['2019-03'], ['2019-06', '2019-05']]}
data = pd.DataFrame(data=data)
# create periods
date_min = data.time.min()[0]
date_max = data.time.max()[0]
period = pd.period_range(date_min, date_max, freq='M')
for i in period:
data[str(i)] = np.nan
# compare and fill data
for index, row in data.iterrows():
for column in data:
if data[column].name in row['time']:
data[column][index] = 'True'
Output:
time 2019-01 2019-02 2019-03 2019-04 2019-05 2019-06
0 [2019-04] NaN NaN NaN True NaN NaN
1 [2019-01] True NaN NaN NaN NaN NaN
2 [2019-03] NaN NaN True NaN NaN NaN
3 [2019-06, 2019-05] NaN NaN NaN NaN True True

Related

python masking each day in dataframe

I have to make a daily sum on a dataframe but only if at least 70% of the daily data is not NaN. If it is then this day must not be taken into account. Is there a way to create such a mask? My dataframe is more than 17 years of hourly data.
my data is something like this:
clear skies all skies Lab
2015-02-26 13:00:00 597.5259 376.1830 307.62
2015-02-26 14:00:00 461.2014 244.0453 199.94
2015-02-26 15:00:00 283.9003 166.5772 107.84
2015-02-26 16:00:00 93.5099 50.7761 23.27
2015-02-26 17:00:00 1.1559 0.2784 0.91
... ... ...
2015-12-05 07:00:00 95.0285 29.1006 45.23
2015-12-05 08:00:00 241.8822 120.1049 113.41
2015-12-05 09:00:00 363.8040 196.0568 244.78
2015-12-05 10:00:00 438.2264 274.3733 461.28
2015-12-05 11:00:00 456.3396 330.6650 447.15
if I groupby and aggregate than there is no way to know if in any day there was some lack of data and some days will have lower sums and therefore lowering my monthly means
As said in the comments, use groupby to group the data by date and then write an appropriate selection. This is an example that would sum all days (assuming regular data points, 24 per day) with less than 50% of nan entries:
import pandas as pd
import numpy as np
# create a date range
date_rng = pd.date_range(start='1/1/2018', end='1/1/2021', freq='H')
# create random data
df = pd.DataFrame({"data":np.random.randint(0,100,size=(len(date_rng)))}, index = date_rng)
# set some values to nan
df["data"][df["data"] > 50] = np.nan
# looks like this
df.head(20)
# sum everything where less than 50% are nan
df.groupby(df.index.date).sum()[df.isna().groupby(df.index.date).sum() < 12]
Example output:
data
2018-01-01 NaN
2018-01-02 NaN
2018-01-03 487.0
2018-01-04 NaN
2018-01-05 421.0
... ...
2020-12-28 NaN
2020-12-29 NaN
2020-12-30 NaN
2020-12-31 392.0
2021-01-01 0.0
An alternative solution - you may find it useful & flexible:
# pip install convtools
from convtools import conversion as c
total_number = c.ReduceFuncs.Count()
total_not_none = c.ReduceFuncs.Count(where=c.item("amount").is_not(None))
total_sum = c.ReduceFuncs.Sum(c.item("amount"))
input_data = [] # e.g. iterable of dicts
converter = (
c.group_by(
c.item("key1"),
c.item("key2"),
)
.aggregate(
{
"key1": c.item("key1"),
"key2": c.item("key2"),
"sum_if_70": c.if_(
total_not_none / total_number < 0.7,
None,
total_sum,
),
}
)
.gen_converter(
debug=False
) # install black and set to True to see the generated ad-hoc code
)
result = converter(input_data)

Merge two data-frames based on multiple conditions

I am looking to compare two dataframes (df-a and df-b) and search for where a given ID and date from 1 dataframe (df-b) sits within a date range where the ID matches in the other dataframe (df-a). I then want to strip all the columns in df-a and concat them to df-b where they match. E.g
If I have a dataframe df-a, in the following format
df-a:
ID Start_Date End_Date A B C D E
0 cd2 2020-06-01 2020-06-24 'a' 'b' 'c' 10 20
1 cd2 2020-06-24 2020-07-21
2 cd56 2020-06-10 2020-07-03
3 cd915 2020-04-28 2020-07-21
4 cd103 2020-04-13 2020-04-24
and df-b in
ID Date
0 cd2 2020-05-12
1 cd2 2020-04-12
2 cd2 2020-06-10
3 cd15 2020-04-28
4 cd193 2020-04-13
I would like an output df like so df-c=
ID Date Start_Date End_Date A B C D E
0 cd2 2020-05-12 - - - - - - -
1 cd2 2020-04-12 - - - - - - -
2 cd2 2020-06-10 2020-06-01 2020-06-11 'a' 'b' 'c' 10 20
3 cd15 2020-04-28 - - - - - - -
4 cd193 2020-04-13 - - - - - - -
In a previous post I got a brilliant answer which allowed to compare the data-frames and drop wherever this condition was met, but I am struggling to figure out how to extract the information appropriately from df-a. Current attempts are below!
df_c=df_b.copy()
ar=[]
for i in range(df_c.shape[0]):
currentID = df_c.stafnum[i]
currentDate = df_c.Date[i]
df_a_entriesForCurrentID = df_a.loc[df_a.stafnum == currentID]
for j in range(df_a_entriesForCurrentID.shape[0]):
startDate = df_a_entriesForCurrentID.iloc[j,:].Leave_Start_Date
endDate = df_a_entriesForCurrentID.iloc[j,:].Leave_End_Date
if (startDate <= currentDate <= endDate):
print(df_c.loc[i])
print(df_a_entriesForCurrentID.iloc[j,:])
#df_d=pd.concat([df_c.loc[i], df_a_entriesForCurrentID.iloc[j,:]], axis=0)
#df_fin_2=df_fin.append(df_d, ignore_index=True)
#ar.append(df_d)
So you want to make a sort of "soft" match. Here's a solution that attempts to vectorize the date range matching.
# notice working with dates as strings, inequalities will only work if dates in format y-m-d
# otherwise it is safer to parse all date columns like `df_a.Date = pd.to_datetime(df_a)`
# create a groupby object once so we can efficiently filter df_b inside the loop
# good idea if df_b is considerably large and has many different IDs
gdf_b = df_b.groupby('ID')
b_IDs = gdf_b.indices # returns a dictionary with grouped rows {ID: arr(integer-indices)}
matched = [] # so we can collect matched rows from df_b
# iterate over rows with `.itertuples()`, more efficient than iterating range(len(df_a))
for i, ID, date in df_a.itertuples():
if ID in b_IDs:
gID = gdf_b.get_group(ID) # get the filtered df_b
inrange = gID.Start_Date.le(date) & gID.End_Date.ge(date)
if any(inrange):
matched.append(
gID.loc[inrange.idxmax()] # get the first row with date inrange
.values[1:] # use the array without column indices and slice `ID` out
)
else:
matched.append([np.nan] * (df_b.shape[1] - 1)) # no date inrange, fill with NaNs
else:
matched.append([np.nan] * (df_b.shape[1] - 1)) # no ID match, fill with NaNs
df_c = df_a.join(pd.DataFrame(matched, columns=df_b.columns[1:]))
print(df_c)
Output
ID Date Start_Date End_Date A B C D E
0 cd2 2020-05-12 NaN NaN NaN NaN NaN NaN NaN
1 cd2 2020-04-12 NaN NaN NaN NaN NaN NaN NaN
2 cd2 2020-06-10 2020-06-01 2020-06-24 a b c 10.0 20.0
3 cd15 2020-04-28 NaN NaN NaN NaN NaN NaN NaN
4 cd193 2020-04-13 NaN NaN NaN NaN NaN NaN NaN

How can I count the rows between a date index and a date one month in the future in pandas vectorized to add them as a column?

I have a dataframe (df) with a date index. And I want to achieve the following:
1. Take Dates column and add one month -> e.g. nxt_dt = df.index + np.timedelta64(month=1) and lets call df.index curr_dt
2. Find the nearest entry in Dates that is >= nxt_dt.
3 Count the rows between curr_dt and nxt_dt and put them into a column in df.
The result is supposed to look like this:
px_volume listed_sh ... iv_mid_6m '30d'
Dates ...
2005-01-03 228805 NaN ... 0.202625 21
2005-01-04 189983 NaN ... 0.203465 22
2005-01-05 224310 NaN ... 0.202455 23
2005-01-06 221988 NaN ... 0.202385 20
2005-01-07 322691 NaN ... 0.201065 21
Needless to mention that there are only dates/rows in the df for which there are observations.
I can think of some different ways to get this done in loops, but since the data I work with is quite big, I would really like to avoid to loop through rows to fill them.
Is there a way in pandas to get this done vectorized?
If you are OK to reindex this should do the job:
import numpy as np
import pandas as pd
df = pd.DataFrame({'date': ['2020-01-01', '2020-01-08', '2020-01-24', '2020-01-29', '2020-02-09', '2020-03-04']})
df['date'] = pd.to_datetime(df['date'])
df['value'] = 1
df = df.set_index('date')
df = df.reindex(pd.date_range('2020-01-01','2020-03-04')).fillna(0)
df = df.sort_index(ascending=False)
df['30d'] = df['value'].rolling(30).sum() - 1
df.sort_index().query("value == 1")
gives:
value 30d
2020-01-01 1.0 3.0
2020-01-08 1.0 2.0
2020-01-24 1.0 2.0
2020-01-29 1.0 1.0
2020-02-09 1.0 NaN
2020-03-04 1.0 NaN

Shifting Date time index

I am trying to shift my date time index such that 2018-04-09 will show as 2018-04-08 one day ahead and only shifting the last row, I tried a few ways with different error such as below:
df.index[-1] = df.index[-1] + pd.offsets.Day(1)
TypeError: Index does not support mutable operations
Can you kindly advise a suitable way please?
My df looks like this:
FinalPosition
dt
2018-04-03 1.32
2018-04-04 NaN
2018-04-05 NaN
2018-04-06 NaN
2018-04-09 NaN
Use rename if values of DatetimeIndex are unique:
df = df.rename({df.index[-1]: df.index[-1] + pd.offsets.Day(1)})
print (df)
FinalPosition
dt
2018-04-03 1.32
2018-04-04 NaN
2018-04-05 NaN
2018-04-06 NaN
2018-04-10 NaN
If possible not unique for me working DatetimeIndex.insert:
df.index = df.index[:-1].insert(len(df), df.index[-1] + pd.offsets.Day(1))
Use .iloc
Ex:
import pandas as pd
df = pd.DataFrame({"datetime": ["2018-04-09"]})
df["datetime"] = pd.to_datetime(df["datetime"])
print df["datetime"].iloc[-1:] - pd.offsets.Day(1)
Output:
0 2018-04-08
Name: datetime, dtype: datetime64[ns]

how to resample without skipping nan values in pandas

I am trying get the 10 days aggregate of my data which has NaN values. The sum of 10 days should return a nan values if there is a NaN value in the 10 day duration.
When I apply the below code, pandas is considering NaN as Zero and returning the sum of remaining days.
dateRange = pd.date_range(start_date, periods=len(data), freq='D')
# Creating a data frame so that the timeseries can handle numpy array.
df = pd.DataFrame(data)
base_Series = pd.DataFrame(list(df.values), index=dateRange)
# Converting to aggregated series
agg_series = base_Series.resample('10D', how='sum')
agg_data = agg_series.values
Sample Data:
2011-06-01 46.520536
2011-06-02 8.988311
2011-06-03 0.133823
2011-06-04 0.274521
2011-06-05 1.283360
2011-06-06 2.556313
2011-06-07 0.027461
2011-06-08 0.001584
2011-06-09 0.079193
2011-06-10 2.389549
2011-06-11 NaN
2011-06-12 0.195844
2011-06-13 0.058720
2011-06-14 6.570925
2011-06-15 0.015107
2011-06-16 0.031066
2011-06-17 0.073008
2011-06-18 0.072198
2011-06-19 0.044534
2011-06-20 0.240080
Output:
2011-06-01 62.254651
2011-06-11 7.301481
This uses numpy sum which will return nan if nan is present in the sum
In [35]: s = Series(randn(100),index=date_range('20130101',periods=100))
In [36]: s.iloc[11] = np.nan
In [37]: s.resample('10D',how=lambda x: x.values.sum())
Out[37]:
2013-01-01 6.910729
2013-01-11 NaN
2013-01-21 -1.592541
2013-01-31 -2.013012
2013-02-10 1.129273
2013-02-20 -2.054807
2013-03-02 4.669622
2013-03-12 3.489225
2013-03-22 0.390786
2013-04-01 -0.005655
dtype: float64
to filter out those days which have any NaNs, I propose that you do
noNaN_days_only = s.groupby(lambda x: x.date).filter(lambda x: ~x.isnull().any()
where s is a DataFrame
Just apply an agg function:
agg_series = base_Series.resample('10D').agg(lambda x: np.nan if np.isnan(x).all() else np.sum(x) )

Categories