I have a mass pandas DataFrame df:
year count
1983 5
1983 4
1983 7
...
2009 8
2009 11
2009 30
and I aim to sample 10 data points per year 100 times and get the mean and standard deviation of count per year. The signs of the count values are determined randomly.
I want to randomly sample 10 data per year, which can be done by:
new_df = pd.DataFrame(columns=['year', 'count'])
ref = df.year.unique()
for i in range(len(ref)):
appended_df = df[df['year'] == ref[i]].sample(n=10)
new_df = pd.concat([new_df,appended_df])
Then, I assign a sign to count randomly (so that by random chance the count could be positive or negative) and rename it to value, which can be done by:
vlist = []
for i in range(len(new_df)):
if randint(0,1) == 0:
vlist.append(new_df.count.iloc[i])
else:
vlist.append(new_df.count.iloc[i] * -1)
new_data['value'] = vlist
Getting a mean and standard deviation per each year is quite simple:
xdf = new_data.groupby("year").agg([np.mean, np.std]).reset_index()
But I can't seem to find an optimal way to try this sampling 100 times per year, store the mean values, and get the mean and standard deviation of those 100 means per year. I could think of using for loop, but it would take too much of a runtime.
Essentially, the output should be in the form of the following (the values are arbitrary here):
year mean_of_100_means total_sd
1983 4.22 0.43
1984 -6.39 1.25
1985 2.01 0.04
...
2007 11.92 3.38
2008 -5.27 1.67
2009 1.85 0.99
Any insights would be appreciated.
Try:
def fn(x):
_100_means = [x.sample(10).mean() for i in range(100)]
return {
"mean_of_100_means": np.mean(_100_means),
"total_sd": np.std(_100_means),
}
print(df.groupby("year")["count"].apply(fn).unstack().reset_index())
EDIT: Changed the computation of means.
Prints:
year mean_of_100_means total_sd
0 1983 48.986 8.330787
1 1984 48.479 10.384896
2 1985 48.957 7.854900
3 1986 50.821 10.303847
4 1987 50.198 9.835832
5 1988 47.497 8.678749
6 1989 46.763 9.197387
7 1990 49.696 8.837589
8 1991 46.979 8.141969
9 1992 48.555 8.603597
10 1993 50.220 8.263946
11 1994 48.735 9.954741
12 1995 49.759 8.532844
13 1996 49.832 8.998654
14 1997 50.306 9.038316
15 1998 49.513 9.024341
16 1999 50.532 9.883166
17 2000 49.195 9.177008
18 2001 50.731 8.309244
19 2002 48.792 9.680028
20 2003 50.251 9.384759
21 2004 50.522 9.269677
22 2005 48.090 8.964458
23 2006 49.529 8.250701
24 2007 47.192 8.682196
25 2008 50.124 9.337356
26 2009 47.988 8.053438
The dataframe was created:
data = []
for y in range(1983, 2010):
for i in np.random.randint(0, 100, size=1000):
data.append({"year": y, "count": i})
df = pd.DataFrame(data)
I think you can use pandas groupby and sample functions together to take 10 samples from each year of your DataFrame. If you put this in a loop, then you can sample it 100 times, and combine the results.
It sounds like you only need the standard deviation of the 100 means (and you don't need the standard deviation of the sample of 10 observations), so you can calculate only the mean in your groupby and sample, then calculate the standard deviation from each of those 100 means when you are creating the total_sd column of your final DataFrame.
import numpy as np
import pandas as pd
np.random.seed(42)
## create a random DataFrame with 100 entries for the years 1980-1999, length 2000
df = pd.DataFrame({
'year':[year for year in list(range(1980, 2000)) for _ in range(100)],
'count':np.random.randint(1,100,size=2000)
})
list_of_means = []
## sample 10 observations from each year, and repeat this process 100 times, storing the mean for each year in a list
for _ in range(100):
df_sample = df.groupby("year").sample(10).groupby("year").mean()
list_of_means.append(df_sample['count'].tolist())
array_of_means = [np.array(x) for x in list_of_means]
result = pd.DataFrame({
'year': df.year.unique(),
'mean_of_100_means': [np.mean(k) for k in zip(*array_of_means)],
'total_sd': [np.std(k) for k in zip(*array_of_means)]
})
This results in:
>>> result
year mean_of_100_means total_sd
0 1980 50.316 8.656948
1 1981 48.274 8.647643
2 1982 47.958 8.598455
3 1983 49.357 7.854620
4 1984 48.977 8.523484
5 1985 49.847 7.114485
6 1986 47.338 8.220143
7 1987 48.106 9.413085
8 1988 53.487 9.237561
9 1989 47.376 9.173845
10 1990 46.141 9.061634
11 1991 46.851 7.647189
12 1992 49.389 7.743318
13 1993 52.207 9.333309
14 1994 47.271 8.177815
15 1995 52.555 8.377355
16 1996 47.606 8.668769
17 1997 52.584 8.200558
18 1998 51.993 8.695232
19 1999 49.054 8.178929
I'm scraping a website that has a table of satellite values (https://planet4589.org/space/gcat/data/cat/satcat.html).
Because every entry is only separated by whitespace, I need a way to split the string of data entries into an array.
However, the .split() function does not suit my needs, because some of the data entries have spaces (e.g. Able 3), I can't just split everything separated by whitespace.
It gets trickier, however. In some cases where no data is available, a dash ("-") is used. If two data entries are separated by only a space, and one of them is a dash, I don't want to include it as one entry.
e.g say we have the two entries "Able 3" and "-", separated only by a single space. In the file, they would appear as "Able 3 -". I want to split this string into the separate data entries, "Able 3" and "-" (as a list, this would be ["Able 3", "-"]).
Another example would be the need to split "data1 -" into ["data1", "-"]
Pretty much, I need to take a string and split it into a list or words separated by whitespace, except when there is a single space between words, and one of them is not a dash.
Also, as you can see the table is massive. I thought about looping through every character, but that would be too slow, and I need to run this thousands of times.
Here is a sample from the beginning of the file:
JCAT Satcat Piece Type Name PLName LDate Parent SDate Primary DDate Status Dest Owner State Manufacturer Bus Motor Mass DryMass TotMass Length Diamete Span Shape ODate Perigee Apogee Inc OpOrbitOQU AltNames
S00001 00001 1957 ALP 1 R2 8K71A M1-10 8K71A M1-10 (M1-1PS) 1957 Oct 4 - 1957 Oct 4 1933 Earth 1957 Dec 1 1000? R - OKB1 SU OKB1 Blok-A - 7790 7790 7800 ? 28.0 2.6 28.0 Cyl 1957 Oct 4 214 938 65.10 LLEO/I -
S00002 00002 1957 ALP 2 P 1-y ISZ PS-1 1957 Oct 4 S00001 1957 Oct 4 1933 Earth 1958 Jan 4? R - OKB1 SU OKB1 PS - 84 84 84 0.6 0.6 2.9 Sphere + Ant 1957 Oct 4 214 938 65.10 LLEO/I -
S00003 00003 1957 BET 1 P A 2-y ISZ PS-2 1957 Nov 3 A00002 1957 Nov 3 0235 Earth 1958 Apr 14 0200? AR - OKB1 SU OKB1 PS - 508 508 8308 ? 2.0 1.0 2.0 Cone 1957 Nov 3 211 1659 65.33 LEO/I -
S00004 00004 1958 ALP P A Explorer 1 Explorer 1 1958 Feb 1 A00004 1958 Feb 1 0355 Earth 1970 Mar 31 1045? AR - ABMA/JPL US JPL Explorer - 8 8 14 0.8 0.1 0.8 Cyl 1958 Feb 1 359 2542 33.18 LEO/I -
S00005 00005 1958 BET 2 P Vanguard I Vanguard Test Satellite 1958 Mar 17 S00016 1958 Mar 17 1224 Earth - O - NRL US NRL NRL 6" - 2 2 2 0.1 0.1 0.1 Sphere 1959 May 23 657 3935 34.25 MEO -
S00006 00006 1958 GAM P A Explorer 3 Explorer 3 1958 Mar 26 A00005 1958 Mar 26 1745 Earth 1958 Jun 28 AR - ABMA/JPL US JPL Explorer - 8 8 14 0.8 0.1 0.8 Cyl 1958 Mar 26 195 2810 33.38 LEO/I -
S00007 00007 1958 DEL 1 R2 8K74A 8K74A 1958 May 15 - 1958 May 15 0705 Earth 1958 Dec 3 R - OKB1 SU OKB1 Blok-A - 7790 7790 7820 ? 28.0 2.6 28.0 Cyl 1958 May 15 214 1860 65.18 LEO/I -
S00008 00008 1958 DEL 2 P 3-y Sovetskiy ISZ D-1 No. 2 1958 May 15 S00007 1958 May 15 0706 Earth 1960 Apr 6 R - OKB1 SU OKB1 Object D - 1327 1327 1327 3.6 1.7 3.6 Cone 1959 May 7 207 1247 65.12 LEO/I -
S00009 00009 1958 EPS P A Explorer 4 Explorer 4 1958 Jul 26 A00009 1958 Jul 26 1507 Earth 1959 Oct 23 AR - ABMA/JPL US JPL Explorer - 12 12 17 0.8 0.1 0.8 Cyl 1959 Apr 24 258 2233 50.40 LEO/I -
S00010 00010 1958 ZET P A SCORE SCORE 1958 Dec 18 A00015 1958 Dec 18 2306 Earth 1959 Jan 21 AR - ARPA/SRDL US SRDL SCORE - 68 68 3718 2.5 ? 1.5 ? 2.5 Cone 1958 Dec 30 159 1187 32.29 LEO/I -
S00011 00011 1959 ALP 1 P Vanguard II Cloud cover satellite 1959 Feb 17 S00012 1959 Feb 17 1605 Earth - O - BSC US NRL NRL 20" - 10 10 10 0.5 0.5 0.5 Sphere 1959 May 15 564 3304 32.88 MEO -
S00012 00012 1959 ALP 2 R3 GRC 33-KS-2800 GRC 33-KS-2800 175-15-21 1959 Feb 17 R02749 1959 Feb 17 1604 Earth - O - BSC US GCR 33-KS-2800 - 195 22 22 1.5 0.7 1.5 Cyl 1959 Apr 28 564 3679 32.88 MEO -
S00013 00013 1959 BET P A Discoverer 1 CORONA Test Vehicle 2 1959 Feb 28 A00017 1959 Feb 28 2156 Earth 1959 Mar 5 AR - ARPA/CIA US LMSD CORONA - 78 ? 78 ? 668 ? 2.0 1.5 2.0 Cone 1959 Feb 28 163? 968? 89.70 LLEO/P -
S00014 00014 1959 GAM P A Discoverer 2 CORONA BIO 1 1959 Apr 13 A00021 1959 Apr 13 2126 Earth 1959 Apr 26 AR - ARPA/CIA US LMSD CORONA - 110 ? 110 ? 788 1.3 1.5 1.3 Frust 1959 Apr 13 239 346 89.90 LLEO/P -
S00015 00015 1959 DEL 1 P Explorer 6 NASA S-2 1959 Aug 7 S00017 1959 Aug 7 1430 Earth 1961 Jul 1 R? - GSFC US TRW Able Probe ARC 420 40 40 42 ? 0.7 0.7 2.2 Sphere + 4 Pan 1959 Sep 8 250 42327 46.95 HEO - Able 3
S00016 00016 1958 BET 1 R3 GRC 33-KS-2800 GRC 33-KS-2800 144-79-22 1958 Mar 17 R02064 1958 Mar 17 1223 Earth - O - NRL US GCR 33-KS-2800 - 195 22 22 1.5 0.7 1.5 Cyl 1959 Sep 30 653 4324 34.28 MEO -
S00017 00017 1959 DEL 2 R3 Altair Altair X-248 1959 Aug 7 A00024 1959 Aug 7 1428 Earth 1961 Jun 30 R? - USAF US ABL Altair - 24 24 24 1.5 0.5 1.5 Cyl 1961 Jan 8 197 40214 47.10 GTO -
S00018 00018 1959 EPS 1 P A Discoverer 5 CORONA C-2 1959 Aug 13 A00028 1959 Aug 13 1906 Earth 1959 Sep 28 AR - ARPA/CIA US LMSD CORONA - 140 140 730 1.3 1.5 1.3 Frust 1959 Aug 14 215 732 80.00 LLEO/I - NRO Mission 9002
A less haphazard approach would be to interpret the headers on the first line as column indicators, and split on those widths.
import sys
import re
def col_widths(s):
# Shamelessly adapted from https://stackoverflow.com/a/33090071/874188
cols = re.findall(r'\S+\s+', s)
return [len(col) for col in cols]
widths = col_widths(next(sys.stdin))
for line in sys.stdin:
line = line.rstrip('\n')
fields = []
for col_max in widths[:-1]:
fields.append(line[0:col_max].strip())
line = line[col_max:]
fields.append(line)
print(fields)
Demo: https://ideone.com/ASANjn
This seems to provide a better interpretation of e,g. the LDate column, where the dates are sometimes padded with more than one space. The penultimate column preserves the final dash as part of the column value; this seems more consistent with the apparent intent of the author of the original table, though perhaps separately split that off from that specific column if that's not to your liking.
If you don't want to read sys.stdin, just wrap this in with open(filename) as handle: and replace sys.stdin with handle everywhere.
One approach is to use pandas.read_fwf(), which reads text files in fixed-width format. The function returns Pandas DataFrames, which are useful for handling large data sets.
As a quick taste, here's what this simple bit of code does:
import pandas as pd
data = pd.read_fwf("data.txt")
print(data.columns) # Prints an index of all columns.
print()
print(data.head(5)) # Prints the top 5 rows.
# Index(['JCAT', 'Satcat', 'Piece', 'Type', 'Name', 'PLName', 'LDate',
# 'Unnamed: 7', 'Parent', 'SDate', 'Unnamed: 10', 'Unnamed: 11',
# 'Primary', 'DDate', 'Unnamed: 14', 'Status', 'Dest', 'Owner', 'State',
# 'Manufacturer', 'Bus', 'Motor', 'Mass', 'Unnamed: 23', 'DryMass',
# 'Unnamed: 25', 'TotMass', 'Unnamed: 27', 'Length', 'Unnamed: 29',
# 'Diamete', 'Span', 'Unnamed: 32', 'Shape', 'ODate', 'Unnamed: 35',
# 'Perigee', 'Apogee', 'Inc', 'OpOrbitOQU', 'AltNames'],
# dtype='object')
#
# JCAT Satcat Piece Type ... Apogee Inc OpOrbitOQU AltNames
# 0 S00001 1 1957 ALP 1 R2 ... 938 65.10 LLEO/I - NaN
# 1 S00002 2 1957 ALP 2 P ... 938 65.10 LLEO/I - NaN
# 2 S00003 3 1957 BET 1 P A ... 1659 65.33 LEO/I - NaN
# 3 S00004 4 1958 ALP P A ... 2542 33.18 LEO/I - NaN
# 4 S00005 5 1958 BET 2 P ... 3935 34.25 MEO - NaN
You'll note that some of the columns are unnamed. We can solve this by determining the field widths of the file, guiding the read_fwf()'s parsing. We'll achieve this by reading the first line of the file and iterating over it.
field_widths = [] # We'll append column widths into this list.
last_i = 0
new_field = False
for i, x in enumerate(first_line):
if x != ' ' and new_field:
# Register a new field.
new_field = False
field_widths.append(i - last_i) # Get the field width by subtracting
# the index from previous field's index.
last_i = i # Set the new field index.
elif not new_field and x == ' ':
# We've encountered a space.
new_field = True # Set true so that the next
# non-space encountered is
# recognised as a new field
else:
field_widths.append(64) # Append last field. Set to a high number,
# so that all strings are eventually read.
Just a simple for-loop. Nothing fancy.
All that's left is passing the field_widths list through the widths= keyword arg:
data = pd.read_fwf("data.txt", widths=field_widths)
print(data.columns)
# Index(['JCAT', 'Satcat', 'Piece', 'Type', 'Name', 'PLName', 'LDate', 'Parent',
# 'SDate', 'Primary', 'DDate', 'Status', 'Dest', 'Owner', 'State',
# 'Manufacturer', 'Bus', 'Motor', 'Mass', 'DryMass', 'TotMass', 'Length',
# 'Diamete', 'Span', 'Shape', 'ODate', 'Perigee', 'Apogee', 'Inc',
# 'OpOrbitOQU'],
# dtype='object')
data is a dataframe, but with some work, you can change it to a list of lists or a list of dicts. Or you could also work with the dataframe directly.
So say, you want the first row. Then you could do
datalist = data.values.tolist()
print(datalist[0])
# ['S00001', 1, '1957 ALP 1', 'R2', '8K71A M1-10', '8K71A M1-10 (M1-1PS)', '1957 Oct 4', '-', '1957 Oct 4 1933', 'Earth', '1957 Dec 1 1000?', 'R', '-', 'OKB1', 'SU', 'OKB1', 'Blok-A', '-', '7790', '7790', '7800 ?', '28.0', '2.6', '28.0', 'Cyl', '1957 Oct 4', '214', '938', '65.10', 'LLEO/I -']
I have a dataframe that looks like below -
Year Salary Amount
0 2019 1200 53
1 2020 3443 455
2 2021 6777 123
3 2019 5466 313
4 2020 4656 545
5 2021 4565 775
6 2019 4654 567
7 2020 7867 657
8 2021 6766 567
Python script to get the dataframe below -
import pandas as pd
import numpy as np
d = pd.DataFrame({
'Year': [
2019,
2020,
2021,
] * 3,
'Salary': [
1200,
3443,
6777,
5466,
4656,
4565,
4654,
7867,
6766
],
'Amount': [
53,
455,
123,
313,
545,
775,
567,
657,
567
]
})
I want to calculate certain percentile values for all the columns grouped by 'Year'.
Desired output should look like -
I am running below python script to perform the calculations to calculate certain percentile values-
df_percentile = pd.DataFrame()
p_list = [0.05, 0.10, 0.25, 0.50, 0.75, 0.95, 0.99]
c_list = []
p_values = []
for cols in d.columns[1:]:
for p in p_list:
c_list.append(cols + '_' + str(p))
p_values.append(np.percentile(d[cols], p))
print(len(c_list), len(p_values))
df_percentile['Name'] = pd.Series(c_list)
df_percentile['Value'] = pd.Series(p_values)
print(df_percentile)
Output -
Name Value
0 Salary_0.05 1208.9720
1 Salary_0.1 1217.9440
2 Salary_0.25 1244.8600
3 Salary_0.5 1289.7200
4 Salary_0.75 1334.5800
5 Salary_0.95 1370.4680
6 Salary_0.99 1377.6456
7 Amount_0.05 53.2800
8 Amount_0.1 53.5600
9 Amount_0.25 54.4000
10 Amount_0.5 55.8000
11 Amount_0.75 57.2000
12 Amount_0.95 58.3200
13 Amount_0.99 58.5440
How can I get the output in the required format without having to do extra data manipulation/formatting or in fewer lines of code?
You can try pivot followed by quantile:
(df.pivot(columns='Year')
.quantile([0.01,0.05,0.75, 0.95, 0.99])
.stack('Year')
)
Output:
Salary Amount
Year
0.01 2019 1269.08 58.20
2020 3467.26 456.80
2021 4609.02 131.88
0.05 2019 1545.40 79.00
2020 3564.30 464.00
2021 4785.10 167.40
0.75 2019 5060.00 440.00
2020 6261.50 601.00
2021 6771.50 671.00
0.95 2019 5384.80 541.60
2020 7545.90 645.80
2021 6775.90 754.20
0.99 2019 5449.76 561.92
2020 7802.78 654.76
2021 6776.78 770.84
I have the following dataframe:
name Jan Feb Mar Apr May Jun Jul Aug \
0 IBM 156.08 160.01 159.81 165.22 172.25 167.15 164.75 152.77
1 MSFT 45.51 43.08 42.13 43.47 47.53 45.96 45.61 45.51
2 GOOGLE 512.42 537.99 559.72 540.50 535.24 532.92 590.09 636.84
3 APPLE 110.64 125.43 125.97 127.29 128.76 127.81 125.34 113.39
Sep Oct Nov Dec
0 145.36 146.11 137.21 137.96
1 43.56 48.70 53.88 55.40
2 617.93 663.59 735.39 755.35
3 112.80 113.36 118.16 111.73
Which I want to transform into the following:
Month AAPL GOOG IBM
0 Jan 117.160004 534.522445 153.309998
1 Feb 128.460007 558.402511 161.940002
2 Mar 124.430000 548.002468 160.500000
3 Apr 125.150002 537.340027 171.289993
4 May 130.279999 532.109985 169.649994
I've been fiddling around with melt and pivot but have no idea how to get this to work.
Any advice would be appreciated.
Thanks
Let's use 'set_index','rename_axis', and T for transpose.
df.set_index('name')\
.rename_axis(None).T\
.rename_axis('Month')\
.reset_index()
Output:
Month IBM MSFT GOOGLE APPLE
0 Jan 156.08 45.51 512.42 110.64
1 Feb 160.01 43.08 537.99 125.43
2 Mar 159.81 42.13 559.72 125.97
3 Apr 165.22 43.47 540.50 127.29
4 May 172.25 47.53 535.24 128.76
5 Jun 167.15 45.96 532.92 127.81
6 Jul 164.75 45.61 590.09 125.34
7 Aug 152.77 45.51 636.84 113.39
Creative Way
pd.DataFrame({'Month': df.columns[1:]}).assign(**{c: v for c, *v in df.values})
Month APPLE GOOGLE IBM MSFT
0 Jan 110.64 512.42 156.08 45.51
1 Feb 125.43 537.99 160.01 43.08
2 Mar 125.97 559.72 159.81 42.13
3 Apr 127.29 540.50 165.22 43.47
4 May 128.76 535.24 172.25 47.53
5 Jun 127.81 532.92 167.15 45.96
6 Jul 125.34 590.09 164.75 45.61
7 Aug 113.39 636.84 152.77 45.51
8 Sep 112.80 617.93 145.36 43.56
9 Oct 113.36 663.59 146.11 48.70
10 Nov 118.16 735.39 137.21 53.88
11 Dec 111.73 755.35 137.96 55.40
Set_index basically changes the group by element and groups data according to the index specified by you so If you set_index to name of the group this should solve your issue.
I'm trying to figure out how to implement an automatic backup file naming/recycling strategy that keeps older backup files but with decreasing frequency over time. The basic idea is that it would be possible to remove at maximum one file when adding a new one, but I was not successful implementing this from scratch.
That's why I started to try out the Grandfather-Father-Son pattern, but there is not a requirement to stick to this. I started my experiments using a single pool, but I failed more than once, so I started again from this more descriptive approach using four pools, one for each frequency:[1]
import datetime
t = datetime.datetime(2001, 1, 1, 5, 0, 0) # start at 1st of Jan 2001, at 5:00 am
d = datetime.timedelta(days=1)
days = []
weeks = []
months = []
years = []
def pool_it(t):
days.append(t)
if len(days) > 7: # keep not more than seven daily backups
del days[0]
if (t.weekday() == 6):
weeks.append(t)
if len(weeks) > 5: # ...not more than 5 weekly backups
del weeks[0]
if (t.day == 28):
months.append(t)
if len(months) > 12: # ... limit monthly backups
del months[0]
if (t.day == 28 and t.month == 12):
years.append(t)
if len(years) > 10: # ... limit yearly backups...
del years[0]
for i in range(4505):
pool_it(t)
t += d
no = 0
def print_pool(pool, rt):
global no
print("----")
for i in pool:
no += 1
print("{:3} {} {}".format(no, i.strftime("%Y-%m-%d %a"), (i-rt).days))
print_pool(years, t)
print_pool(months,t)
print_pool(weeks,t)
print_pool(days,t)
The output shows that there are duplicates, marked with * and **
----
1 2003-12-28 Sun -3414
2 2004-12-28 Tue -3048
3 2005-12-28 Wed -2683
4 2006-12-28 Thu -2318
5 2007-12-28 Fri -1953
6 2008-12-28 Sun -1587
7 2009-12-28 Mon -1222
8 2010-12-28 Tue -857
9 2011-12-28 Wed -492
10 2012-12-28 Fri -126 *
----
11 2012-05-28 Mon -340
12 2012-06-28 Thu -309
13 2012-07-28 Sat -279
14 2012-08-28 Tue -248
15 2012-09-28 Fri -217
16 2012-10-28 Sun -187
17 2012-11-28 Wed -156
18 2012-12-28 Fri -126 *
19 2013-01-28 Mon -95
20 2013-02-28 Thu -64
21 2013-03-28 Thu -36
22 2013-04-28 Sun -5 **
----
23 2013-03-31 Sun -33
24 2013-04-07 Sun -26
25 2013-04-14 Sun -19
26 2013-04-21 Sun -12
27 2013-04-28 Sun -5 **
----
28 2013-04-26 Fri -7
29 2013-04-27 Sat -6
30 2013-04-28 Sun -5 **
31 2013-04-29 Mon -4
32 2013-04-30 Tue -3
33 2013-05-01 Wed -2
34 2013-05-02 Thu -1
...which is not a big problem. What I'm getting from it is daily backups in the last week, weekly backups for the last month, monthly backups for the last year, and yearly backups for 10 years. The amount of files is always limited to 10+12+5+7=34.
My ideal solution would
create files with human-readable names including timestampes (i.e. xyz-yyyy-mm-dd.bak)
use only one pool (store/remove files within one folder)
recycle targeted, that is, would not delete more than one file a day
(naturally) not contain any duplicates
Do you have a trivial solution at hand or a suggestion where to learn more about it?
[1] I used python as to better understand/communicate my question, but the question is about the algorithm.
As a committer of pyExpireBackups i can point you to the ExpirationRule implementation of my solution (source below and in the github repo)
see https://wiki.bitplan.com/index.php/PyExpireBackups for the doku.
An example run would lead to:
keeping 7 files for dayly backup
keeping 6 files for weekly backup
keeping 8 files for monthly backup
keeping 4 files for yearly backup
expiring 269 files dry run
# 1✅: 0.0 days( 5 GB/ 5 GB)→./sql_backup.2022-04-02.tgz
# 2✅: 3.0 days( 5 GB/ 9 GB)→./sql_backup.2022-03-30.tgz
# 3✅: 4.0 days( 5 GB/ 14 GB)→./sql_backup.2022-03-29.tgz
# 4✅: 5.0 days( 5 GB/ 18 GB)→./sql_backup.2022-03-28.tgz
# 5✅: 7.0 days( 5 GB/ 23 GB)→./sql_backup.2022-03-26.tgz
# 6✅: 9.0 days( 5 GB/ 27 GB)→./sql_backup.2022-03-24.tgz
# 7✅: 11.0 days( 5 GB/ 32 GB)→./sql_backup.2022-03-22.tgz
# 8❌: 15.0 days( 5 GB/ 37 GB)→./sql_backup.2022-03-18.tgz
# 9❌: 17.0 days( 5 GB/ 41 GB)→./sql_backup.2022-03-16.tgz
# 10✅: 18.0 days( 5 GB/ 46 GB)→./sql_backup.2022-03-15.tgz
# 11❌: 19.0 days( 5 GB/ 50 GB)→./sql_backup.2022-03-14.tgz
# 12❌: 20.0 days( 5 GB/ 55 GB)→./sql_backup.2022-03-13.tgz
# 13❌: 22.0 days( 5 GB/ 59 GB)→./sql_backup.2022-03-11.tgz
# 14❌: 23.0 days( 5 GB/ 64 GB)→./sql_backup.2022-03-10.tgz
# 15✅: 35.0 days( 4 GB/ 68 GB)→./sql_backup.2022-02-26.tgz
# 16❌: 37.0 days( 4 GB/ 73 GB)→./sql_backup.2022-02-24.tgz
# 17❌: 39.0 days( 4 GB/ 77 GB)→./sql_backup.2022-02-22.tgz
# 18❌: 40.0 days( 5 GB/ 82 GB)→./sql_backup.2022-02-21.tgz
# 19✅: 43.0 days( 4 GB/ 86 GB)→./sql_backup.2022-02-18.tgz
...
class ExpirationRule():
'''
an expiration rule keeps files at a certain
'''
def __init__(self,name,freq:float,minAmount:int):
'''
constructor
name(str): name of this rule
freq(float): the frequency) in days
minAmount(int): the minimum of files to keep around
'''
self.name=name
self.ruleName=name # will late be changed by a sideEffect in getNextRule e.g. from "week" to "weekly"
self.freq=freq
self.minAmount=minAmount
if minAmount<0:
raise Exception(f"{self.minAmount} {self.name} is invalid - {self.name} must be >=0")
def reset(self,prevFile:BackupFile):
'''
reset my state with the given previous File
Args:
prevFile: BackupFile - the file to anchor my startAge with
'''
self.kept=0
if prevFile is None:
self.startAge=0
else:
self.startAge=prevFile.ageInDays
def apply(self,file:BackupFile,prevFile:BackupFile,debug:bool)->bool:
'''
apply me to the given file taking the previously kept File prevFile (which might be None) into account
Args:
file(BackupFile): the file to apply this rule for
prevFile(BackupFile): the previous file to potentially take into account
debug(bool): if True show debug output
'''
if prevFile is not None:
ageDiff=file.ageInDays - prevFile.ageInDays
keep=ageDiff>=self.freq
else:
ageDiff=file.ageInDays - self.startAge
keep=True
if keep:
self.kept+=1
else:
file.expire=True
if debug:
print(f"Δ {ageDiff}({ageDiff-self.freq}) days for {self.ruleName}({self.freq}) {self.kept}/{self.minAmount}{file}")
return self.kept>=self.minAmount