Create a function to calculate median cost across different years

Create a function to calculate median cost across different years - python

I have a sample dataset which contains id and costs in diff years as the one below:
Id
2015-04
2015-05
2015-06
2015-07
2016-04
2016-05
2016-06
2016-07
2017-04
2017-05
2017-06
2017-07
2018-04
2018-05
2018-06
2018-07
10
58500
58500
58300
57800
57500
57700
57800
57800
57800
57900
58400
59000
59500
59500
59000
58500
11
104600
104600
105700
106100
106300
107300
108000
107600
107800
108300
109200
109600
109300
108700
109000
110700
12
104900
106700
107900
107500
106100
105200
105700
106400
106700
107100
107200
107100
107500
108300
109200
110500
13
50500
49600
48900
48400
48100
48000
47700
47500
47400
47600
47800
47800
47600
47600
48100
48400
14
49800
49900
50300
50800
51100
51200
51200
51400
51600
51900
52400
52600
52300
51800
51100
50900
How can I create a function in Python to find the median cost of each year belonging to their respective id? I want the function to be dynamic in terms of the start and end year so that if new data comes for different years, the code will calculate the changes accordingly. For example, if new data comes for 2019, the end date would automatically be considered as 2019 instead of 2018 and calculate its median respectively.
With the current data sample given above, the result should look something like one below:
Id
2015
2016
2017
2018
10
58400
57750
58150
59250
11
105150
107450
108750
109150
12
107100
105900
107100
108750
13
49250
47850
47700
47850
14
50100
51200
52150
51450

First we split the column names on - and get only the year. Then we groupby over axis=1 based on these years and take the median:
df = df.set_index("Id")
df = df.groupby(df.columns.str.split("-").str[0], axis=1).median().reset_index()
# or get first 4 characters
# df = df.groupby(df.columns.str[:4], axis=1).median().reset_index()
Id 2015 2016 2017 2018
0 10 58400 57750 58150 59250
1 11 105150 107450 108750 109150
2 12 107100 105900 107100 108750
3 13 49250 47850 47700 47850
4 14 50100 51200 52150 51450

Related

Do Group by on one column and then split the group on categorical column's 2 specific values and the finally get the First and last records

After group by on Id column, I would like to split the group again on a categorical column's 2 specific values and the the get the first and last rows as the final output to find the percent drop.
To make the problem easier, I can only filter the dataframe to contain 2 specific categorical value rows. Below is the sample dataframe after filter as explained above to make this easier.
The EncDate in the image above for sample data is different than the code written below.Sample Data code :
import pandas as pd
rng = pd.date_range('2015-02-24', periods=20, freq='M')
df = pd.DataFrame({
'Id': [ '21','21','21','29','29','29','29','29','29','29','29','29','29','67','67','67','67','67','67','67'],
'Score': [21,21,21,29,29,29,29,29,29,29,29,29,29,67,67,67,67,67,67,67],
'Dx': ['F11','F11','F11','F72','F72','F72','F72','F72','F72','F72','F72','F72','F72','F72','F72','F72','F72','F72','F72','F72'],
'EncDate' : rng,
'Treatment': ['Active','Active','Inactive','Inactive','Active','Active','Active','Active ','Inactive','Active','Active','Active ','Inactive','Active','Active','Active ','Inactive','Active','Active','Inactive'],
'ProviderName': ["Doe, Kim","Doe, Kim","Doe, Kim","Lee, Mei","Lee, Mei","Lee, Mei","Lee, Mei","Lee, Mei","Lee, Mei","Lee, Mei","Lee, Mei","Lee, Mei","Lee, Mei","Shah, Neha","Shah, Neha","Shah, Neha","Shah, Neha","Shah, Neha","Shah, Neha","Shah, Neha"]
})
I want to group by Id column and then Treatment such a way that a Treatment group should start from Active to Inactive chronologically EncDate. For e.g: For Id 29, the treatment had been started 2 times. Treatment starts when the value is "Active" and that same Treatment ends when a doctor documents "Inactive". For Id 29 and 67, treatment started and ended Twice. I need to mark first active treatment as First and the subsequent Inactive as Last and then Find the Score drop between them.

This should work:
g = df.groupby('Id')['Treatment'].transform(lambda x: (x.eq('Inactive').shift().fillna(0))).cumsum()
ndf = df.loc[g.groupby([df['Id'],g]).transform('count').ne(1)].groupby(['Id',g],as_index=False).nth([0,-1])
ndf.assign(PercentDrop = ndf.groupby(['Id',g])['Score'].pct_change())
Output:
Id Score Dx EncDate Treatment ProviderName PercentDrop
0 21 22 F11 2015-02-28 Active Doe, Kim NaN
2 21 9 F11 2015-04-30 Inactive Doe, Kim -0.307692
4 29 25 F72 2015-06-30 Active Lee, Mei NaN
8 29 8 F72 2015-10-31 Inactive Lee, Mei -0.272727
9 29 28 F72 2015-11-30 Active Lee, Mei NaN
12 29 8 F72 2016-02-29 Inactive Lee, Mei -0.466667
13 67 26 F72 2016-03-31 Active Shah, Neha NaN
16 67 10 F72 2016-06-30 Inactive Shah, Neha -0.375000
17 67 24 F72 2016-07-31 Active Shah, Neha NaN
19 67 7 F72 2016-09-30 Inactive Shah, Neha -0.533333

On running the code it's only showing 4 columns instead of 8...I'm pretty sure the code is correct, why is this happening?

import pandas as pd
patient={'patientno':[2000,2010,2022,2024,2100,2330,2345,2479,2526,2556,2567,2768,2897,2999,3000],
'patientname':['Ramlal Tukkaram','Jethalal Gada','Karen Smith','Phoebe Buffet','Lily Aldrin','Sugmadi Kplese','Chad Broman','Babu Rao','Barney Stinson', 'Leegma Bawles','Ted Bundy','Pediphilee Kyler','Regina George','Mikasa Ackerman','Levi Ackerman'],
'age':[22,45,17,32,32,42,45,42,31,22,35,34,17,19,36],
'roomno':[20,60,48,13,12,69,32,40,21,63,1,54,12,68,14],
'contactdetails':[4934944909,7685948576,5343258732,3846384849,2843839493,3237273888,9808909778,9089786756,7757586867,8878777999,7687677756,8789675758,7766969866,9078787867,6656565658],
'diagnosis':['Dementia','Schizophenia','Intellectual Disability','Hepatitis','Child Birth','Piles','Diarrhoea','Corona','Gonorrhea','Cardiac Arrest','Psychopathy','Freak Accident','Road Accident','Attachment Issues','Depression’ ,’OCD'],
'admitdate':['12.01.2022','13.01.2022','17.01.2022','04.01.2022','17.01.2022','12.01.2022','04.01.2022','15.01.2022','05.01.2022','13.01.2022','08.01.2022','01.01.2022','08.01.2022','10.01.2022','06.01.2022'],
'dischargedate':['18.01.2022','17.01.2022','18.01.2022','09.01.2022','21.01.2022','15.01.2022','08.01.2022','18.01.2022','16.01.2022','17.01.2022','18.01.2022','14.01.2022','15.01.2022','13.01.2022','22.01.2022']}
df= pd.DataFrame(patient)
print(df)
OUTPUT
patientno patientname ... admitdate dischargedate
0 2000 Ramlal Tukkaram ... 12.01.2022 18.01.2022
1 2010 Jethalal Gada ... 13.01.2022 17.01.2022
2 2022 Karen Smith ... 17.01.2022 18.01.2022
3 2024 Phoebe Buffet ... 04.01.2022 09.01.2022
4 2100 Lily Aldrin ... 17.01.2022 21.01.2022
5 2330 Sugmadi Kplese ... 12.01.2022 15.01.2022
6 2345 Chad Broman ... 04.01.2022 08.01.2022
7 2479 Babu Rao ... 15.01.2022 18.01.2022
8 2526 Barney Stinson ... 05.01.2022 16.01.2022
9 2556 Leegma Bawles ... 13.01.2022 17.01.2022
10 2567 Ted Bundy ... 08.01.2022 18.01.2022
11 2768 Pediphilee Kyler ... 01.01.2022 14.01.2022
12 2897 Regina George ... 08.01.2022 15.01.2022
13 2999 Mikasa Ackerman ... 10.01.2022 13.01.2022
14 3000 Levi Ackerman ... 06.01.2022 22.01.2022
[15 rows x 8 columns]

Try to remove the limit on the number of displayed columns with:
pd.options.display.max_columns = None
The dataframe has 8 columns, it's just that not all are shown.

It is due to space ( width) that you see only four columns( the first two and last two) and the rest are represented by the three dots "...". There is nothing wrong except that pandas shows partially due to spacing ( width).

print(df) hides some columns by default.
You could either change the pandas display options mentioned by #user2314737, or you could try df.head() instead.

Retrieving the average of averages in Python DataFrame

I have a mass pandas DataFrame df:
year count
1983 5
1983 4
1983 7
...
2009 8
2009 11
2009 30
and I aim to sample 10 data points per year 100 times and get the mean and standard deviation of count per year. The signs of the count values are determined randomly.
I want to randomly sample 10 data per year, which can be done by:
new_df = pd.DataFrame(columns=['year', 'count'])
ref = df.year.unique()
for i in range(len(ref)):
appended_df = df[df['year'] == ref[i]].sample(n=10)
new_df = pd.concat([new_df,appended_df])
Then, I assign a sign to count randomly (so that by random chance the count could be positive or negative) and rename it to value, which can be done by:
vlist = []
for i in range(len(new_df)):
if randint(0,1) == 0:
vlist.append(new_df.count.iloc[i])
else:
vlist.append(new_df.count.iloc[i] * -1)
new_data['value'] = vlist
Getting a mean and standard deviation per each year is quite simple:
xdf = new_data.groupby("year").agg([np.mean, np.std]).reset_index()
But I can't seem to find an optimal way to try this sampling 100 times per year, store the mean values, and get the mean and standard deviation of those 100 means per year. I could think of using for loop, but it would take too much of a runtime.
Essentially, the output should be in the form of the following (the values are arbitrary here):
year mean_of_100_means total_sd
1983 4.22 0.43
1984 -6.39 1.25
1985 2.01 0.04
...
2007 11.92 3.38
2008 -5.27 1.67
2009 1.85 0.99
Any insights would be appreciated.

Try:
def fn(x):
_100_means = [x.sample(10).mean() for i in range(100)]
return {
"mean_of_100_means": np.mean(_100_means),
"total_sd": np.std(_100_means),
}
print(df.groupby("year")["count"].apply(fn).unstack().reset_index())
EDIT: Changed the computation of means.
Prints:
year mean_of_100_means total_sd
0 1983 48.986 8.330787
1 1984 48.479 10.384896
2 1985 48.957 7.854900
3 1986 50.821 10.303847
4 1987 50.198 9.835832
5 1988 47.497 8.678749
6 1989 46.763 9.197387
7 1990 49.696 8.837589
8 1991 46.979 8.141969
9 1992 48.555 8.603597
10 1993 50.220 8.263946
11 1994 48.735 9.954741
12 1995 49.759 8.532844
13 1996 49.832 8.998654
14 1997 50.306 9.038316
15 1998 49.513 9.024341
16 1999 50.532 9.883166
17 2000 49.195 9.177008
18 2001 50.731 8.309244
19 2002 48.792 9.680028
20 2003 50.251 9.384759
21 2004 50.522 9.269677
22 2005 48.090 8.964458
23 2006 49.529 8.250701
24 2007 47.192 8.682196
25 2008 50.124 9.337356
26 2009 47.988 8.053438
The dataframe was created:
data = []
for y in range(1983, 2010):
for i in np.random.randint(0, 100, size=1000):
data.append({"year": y, "count": i})
df = pd.DataFrame(data)

I think you can use pandas groupby and sample functions together to take 10 samples from each year of your DataFrame. If you put this in a loop, then you can sample it 100 times, and combine the results.
It sounds like you only need the standard deviation of the 100 means (and you don't need the standard deviation of the sample of 10 observations), so you can calculate only the mean in your groupby and sample, then calculate the standard deviation from each of those 100 means when you are creating the total_sd column of your final DataFrame.
import numpy as np
import pandas as pd
np.random.seed(42)
## create a random DataFrame with 100 entries for the years 1980-1999, length 2000
df = pd.DataFrame({
'year':[year for year in list(range(1980, 2000)) for _ in range(100)],
'count':np.random.randint(1,100,size=2000)
})
list_of_means = []
## sample 10 observations from each year, and repeat this process 100 times, storing the mean for each year in a list
for _ in range(100):
df_sample = df.groupby("year").sample(10).groupby("year").mean()
list_of_means.append(df_sample['count'].tolist())
array_of_means = [np.array(x) for x in list_of_means]
result = pd.DataFrame({
'year': df.year.unique(),
'mean_of_100_means': [np.mean(k) for k in zip(*array_of_means)],
'total_sd': [np.std(k) for k in zip(*array_of_means)]
})
This results in:
>>> result
year mean_of_100_means total_sd
0 1980 50.316 8.656948
1 1981 48.274 8.647643
2 1982 47.958 8.598455
3 1983 49.357 7.854620
4 1984 48.977 8.523484
5 1985 49.847 7.114485
6 1986 47.338 8.220143
7 1987 48.106 9.413085
8 1988 53.487 9.237561
9 1989 47.376 9.173845
10 1990 46.141 9.061634
11 1991 46.851 7.647189
12 1992 49.389 7.743318
13 1993 52.207 9.333309
14 1994 47.271 8.177815
15 1995 52.555 8.377355
16 1996 47.606 8.668769
17 1997 52.584 8.200558
18 1998 51.993 8.695232
19 1999 49.054 8.178929

annotate a single line from a multi-line plot with labels from another pandas column matplotlib

i have been looking around and i can find examples for annotating a single line chart by using iterrows for the dataframe. what i am struggling with is
a) selecting the single line in the plot instead of ax.lines (using ax.lines[#]) is clearly not proper and
b) annotating the values for the line with values from a different column
the dataframe dfg is in a format such that (edited to provide a minimal, reproducible example):
week 2016 2017 2018 2019 2020 2021 min max avg WoW Change
1 8188.0 9052.0 7658.0 7846.0 6730.0 6239.0 6730 9052 7893.7
2 7779.0 8378.0 7950.0 7527.0 6552.0 6045.0 6552 8378 7588.0 -194.0
3 7609.0 7810.0 8041.0 8191.0 6432.0 5064.0 6432 8191 7529.4 -981.0
4 8256.0 8290.0 8430.0 7083.0 6660.0 6507.0 6660 8430 7687.0 1443.0
5 7124.0 9372.0 7892.0 7146.0 6615.0 5857.0 6615 9372 7733.7 -650.0
6 7919.0 8491.0 7888.0 6210.0 6978.0 5898.0 6210 8491 7455.3 41.0
7 7802.0 7286.0 7021.0 7522.0 6547.0 4599.0 6547 7802 7218.1 -1299.0
8 8292.0 7589.0 7282.0 5917.0 6217.0 6292.0 5917 8292 7072.3 1693.0
9 8048.0 8150.0 8003.0 7001.0 6238.0 5655.0 6238 8150 7404.0 -637.0
10 7693.0 7405.0 7585.0 6746.0 6412.0 5323.0 6412 7693 7135.1 -332.0
11 8384.0 8307.0 7077.0 6932.0 6539.0 6539 8384 7451.7
12 7748.0 8224.0 8148.0 6540.0 6117.0 6117 8224 7302.6
13 7254.0 7850.0 7898.0 6763.0 6047.0 6047 7898 7108.1
14 7940.0 7878.0 8650.0 6599.0 5874.0 5874 8650 7352.1
15 8187.0 7810.0 7930.0 5992.0 5680.0 5680 8187 7066.6
16 7550.0 8912.0 8469.0 7149.0 4937.0 4937 8912 7266.6
17 7660.0 8264.0 8549.0 7414.0 5302.0 5302 8549 7291.4
18 7655.0 7620.0 7323.0 6693.0 5712.0 5712 7655 6910.0
19 7677.0 8590.0 7601.0 7612.0 5391.0 5391 8590 7264.6
20 7315.0 8294.0 8159.0 6943.0 5197.0 5197 8294 7057.0
21 7839.0 7985.0 7631.0 6862.0 7200.0 6862 7985 7480.6
22 7705.0 8341.0 8346.0 7927.0 6179.0 6179 8346 7574.7
... ... ... ... ... ... ... ... ...
51 8167.0 7993.0 7656.0 6809.0 5564.0 5564 8167 7131.4
52 7183.0 7966.0 7392.0 6352.0 5326.0 5326 7966 6787.3
53 5369.0 5369 5369 5369.0
with the graph plotted by:
fig, ax = plt.subplots(1, figsize=[14,4])
ax.fill_between(dfg.index, dfg["min"], dfg["max"], label="5 Yr. Range", facecolor="oldlace")
ax.plot(dfg.index, dfg[2020], label="2020", c="grey")
ax.plot(dfg.index, dfg[2021], label="2021", c="coral")
ax.plot(dfg.index, dfg.avg, label="5 Yr. Avg.", c="goldenrod", ls=(0,(1,2)), lw=3).
I would like to label the dfg[2021] line with the values from dfg['WoW Change']. Additionally, if anyone knows how to get the calculate the first value in the WoW column based on the last value from 2020 and the first value from 2021, that would be wonderful! It's currently just dfg['WoW Change'] = dfg[2021].diff()
Thanks!

Figured it out. Zipped the index and two columns up as a tuple. I ended up deciding I only wanted the last value to be shown but using below code:
a = dfg.index.values
b = dfg[2021]
c = dfg['WoW Change']
#zip 3 columns separately
labels = list(zip(dfg.index.values,dfg[2021],dfg['WoW Change']))
#remove tuples with index + 2 nan values
labels_light = [i for i in labels if not any(isinstance(n,float) and math.isnan(n) for n in i)]
#label last point using list accessors
ax.annotate(str("w/w change: " + str("{:,}".format(int(labels_light[-1][2])))+link[1]),xy=(labels_light[-1][0],labels_light[-1][1]))
I'm sure this could have been done much better by someone who knows what they're doing, any feedback is appreciated.

effective backup strategy with storage reuse but without duplicates

I'm trying to figure out how to implement an automatic backup file naming/recycling strategy that keeps older backup files but with decreasing frequency over time. The basic idea is that it would be possible to remove at maximum one file when adding a new one, but I was not successful implementing this from scratch.
That's why I started to try out the Grandfather-Father-Son pattern, but there is not a requirement to stick to this. I started my experiments using a single pool, but I failed more than once, so I started again from this more descriptive approach using four pools, one for each frequency:[1]
import datetime
t = datetime.datetime(2001, 1, 1, 5, 0, 0) # start at 1st of Jan 2001, at 5:00 am
d = datetime.timedelta(days=1)
days = []
weeks = []
months = []
years = []
def pool_it(t):
days.append(t)
if len(days) > 7: # keep not more than seven daily backups
del days[0]
if (t.weekday() == 6):
weeks.append(t)
if len(weeks) > 5: # ...not more than 5 weekly backups
del weeks[0]
if (t.day == 28):
months.append(t)
if len(months) > 12: # ... limit monthly backups
del months[0]
if (t.day == 28 and t.month == 12):
years.append(t)
if len(years) > 10: # ... limit yearly backups...
del years[0]
for i in range(4505):
pool_it(t)
t += d
no = 0
def print_pool(pool, rt):
global no
print("----")
for i in pool:
no += 1
print("{:3} {} {}".format(no, i.strftime("%Y-%m-%d %a"), (i-rt).days))
print_pool(years, t)
print_pool(months,t)
print_pool(weeks,t)
print_pool(days,t)
The output shows that there are duplicates, marked with * and **
----
1 2003-12-28 Sun -3414
2 2004-12-28 Tue -3048
3 2005-12-28 Wed -2683
4 2006-12-28 Thu -2318
5 2007-12-28 Fri -1953
6 2008-12-28 Sun -1587
7 2009-12-28 Mon -1222
8 2010-12-28 Tue -857
9 2011-12-28 Wed -492
10 2012-12-28 Fri -126 *
----
11 2012-05-28 Mon -340
12 2012-06-28 Thu -309
13 2012-07-28 Sat -279
14 2012-08-28 Tue -248
15 2012-09-28 Fri -217
16 2012-10-28 Sun -187
17 2012-11-28 Wed -156
18 2012-12-28 Fri -126 *
19 2013-01-28 Mon -95
20 2013-02-28 Thu -64
21 2013-03-28 Thu -36
22 2013-04-28 Sun -5 **
----
23 2013-03-31 Sun -33
24 2013-04-07 Sun -26
25 2013-04-14 Sun -19
26 2013-04-21 Sun -12
27 2013-04-28 Sun -5 **
----
28 2013-04-26 Fri -7
29 2013-04-27 Sat -6
30 2013-04-28 Sun -5 **
31 2013-04-29 Mon -4
32 2013-04-30 Tue -3
33 2013-05-01 Wed -2
34 2013-05-02 Thu -1
...which is not a big problem. What I'm getting from it is daily backups in the last week, weekly backups for the last month, monthly backups for the last year, and yearly backups for 10 years. The amount of files is always limited to 10+12+5+7=34.
My ideal solution would
create files with human-readable names including timestampes (i.e. xyz-yyyy-mm-dd.bak)
use only one pool (store/remove files within one folder)
recycle targeted, that is, would not delete more than one file a day
(naturally) not contain any duplicates
Do you have a trivial solution at hand or a suggestion where to learn more about it?
[1] I used python as to better understand/communicate my question, but the question is about the algorithm.

As a committer of pyExpireBackups i can point you to the ExpirationRule implementation of my solution (source below and in the github repo)
see https://wiki.bitplan.com/index.php/PyExpireBackups for the doku.
An example run would lead to:
keeping 7 files for dayly backup
keeping 6 files for weekly backup
keeping 8 files for monthly backup
keeping 4 files for yearly backup
expiring 269 files dry run
# 1✅: 0.0 days( 5 GB/ 5 GB)→./sql_backup.2022-04-02.tgz
# 2✅: 3.0 days( 5 GB/ 9 GB)→./sql_backup.2022-03-30.tgz
# 3✅: 4.0 days( 5 GB/ 14 GB)→./sql_backup.2022-03-29.tgz
# 4✅: 5.0 days( 5 GB/ 18 GB)→./sql_backup.2022-03-28.tgz
# 5✅: 7.0 days( 5 GB/ 23 GB)→./sql_backup.2022-03-26.tgz
# 6✅: 9.0 days( 5 GB/ 27 GB)→./sql_backup.2022-03-24.tgz
# 7✅: 11.0 days( 5 GB/ 32 GB)→./sql_backup.2022-03-22.tgz
# 8❌: 15.0 days( 5 GB/ 37 GB)→./sql_backup.2022-03-18.tgz
# 9❌: 17.0 days( 5 GB/ 41 GB)→./sql_backup.2022-03-16.tgz
# 10✅: 18.0 days( 5 GB/ 46 GB)→./sql_backup.2022-03-15.tgz
# 11❌: 19.0 days( 5 GB/ 50 GB)→./sql_backup.2022-03-14.tgz
# 12❌: 20.0 days( 5 GB/ 55 GB)→./sql_backup.2022-03-13.tgz
# 13❌: 22.0 days( 5 GB/ 59 GB)→./sql_backup.2022-03-11.tgz
# 14❌: 23.0 days( 5 GB/ 64 GB)→./sql_backup.2022-03-10.tgz
# 15✅: 35.0 days( 4 GB/ 68 GB)→./sql_backup.2022-02-26.tgz
# 16❌: 37.0 days( 4 GB/ 73 GB)→./sql_backup.2022-02-24.tgz
# 17❌: 39.0 days( 4 GB/ 77 GB)→./sql_backup.2022-02-22.tgz
# 18❌: 40.0 days( 5 GB/ 82 GB)→./sql_backup.2022-02-21.tgz
# 19✅: 43.0 days( 4 GB/ 86 GB)→./sql_backup.2022-02-18.tgz
...
class ExpirationRule():
'''
an expiration rule keeps files at a certain
'''
def __init__(self,name,freq:float,minAmount:int):
'''
constructor
name(str): name of this rule
freq(float): the frequency) in days
minAmount(int): the minimum of files to keep around
'''
self.name=name
self.ruleName=name # will late be changed by a sideEffect in getNextRule e.g. from "week" to "weekly"
self.freq=freq
self.minAmount=minAmount
if minAmount<0:
raise Exception(f"{self.minAmount} {self.name} is invalid - {self.name} must be >=0")
def reset(self,prevFile:BackupFile):
'''
reset my state with the given previous File
Args:
prevFile: BackupFile - the file to anchor my startAge with
'''
self.kept=0
if prevFile is None:
self.startAge=0
else:
self.startAge=prevFile.ageInDays
def apply(self,file:BackupFile,prevFile:BackupFile,debug:bool)->bool:
'''
apply me to the given file taking the previously kept File prevFile (which might be None) into account
Args:
file(BackupFile): the file to apply this rule for
prevFile(BackupFile): the previous file to potentially take into account
debug(bool): if True show debug output
'''
if prevFile is not None:
ageDiff=file.ageInDays - prevFile.ageInDays
keep=ageDiff>=self.freq
else:
ageDiff=file.ageInDays - self.startAge
keep=True
if keep:
self.kept+=1
else:
file.expire=True
if debug:
print(f"Δ {ageDiff}({ageDiff-self.freq}) days for {self.ruleName}({self.freq}) {self.kept}/{self.minAmount}{file}")
return self.kept>=self.minAmount

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create a function to calculate median cost across different years - python

Related

Do Group by on one column and then split the group on categorical column's 2 specific values and the finally get the First and last records

On running the code it's only showing 4 columns instead of 8...I'm pretty sure the code is correct, why is this happening?

Retrieving the average of averages in Python DataFrame

annotate a single line from a multi-line plot with labels from another pandas column matplotlib

effective backup strategy with storage reuse but without duplicates

Categories

Resources