Related
After group by on Id column, I would like to split the group again on a categorical column's 2 specific values and the the get the first and last rows as the final output to find the percent drop.
To make the problem easier, I can only filter the dataframe to contain 2 specific categorical value rows. Below is the sample dataframe after filter as explained above to make this easier.
The EncDate in the image above for sample data is different than the code written below.Sample Data code :
import pandas as pd
rng = pd.date_range('2015-02-24', periods=20, freq='M')
df = pd.DataFrame({
'Id': [ '21','21','21','29','29','29','29','29','29','29','29','29','29','67','67','67','67','67','67','67'],
'Score': [21,21,21,29,29,29,29,29,29,29,29,29,29,67,67,67,67,67,67,67],
'Dx': ['F11','F11','F11','F72','F72','F72','F72','F72','F72','F72','F72','F72','F72','F72','F72','F72','F72','F72','F72','F72'],
'EncDate' : rng,
'Treatment': ['Active','Active','Inactive','Inactive','Active','Active','Active','Active ','Inactive','Active','Active','Active ','Inactive','Active','Active','Active ','Inactive','Active','Active','Inactive'],
'ProviderName': ["Doe, Kim","Doe, Kim","Doe, Kim","Lee, Mei","Lee, Mei","Lee, Mei","Lee, Mei","Lee, Mei","Lee, Mei","Lee, Mei","Lee, Mei","Lee, Mei","Lee, Mei","Shah, Neha","Shah, Neha","Shah, Neha","Shah, Neha","Shah, Neha","Shah, Neha","Shah, Neha"]
})
I want to group by Id column and then Treatment such a way that a Treatment group should start from Active to Inactive chronologically EncDate. For e.g: For Id 29, the treatment had been started 2 times. Treatment starts when the value is "Active" and that same Treatment ends when a doctor documents "Inactive". For Id 29 and 67, treatment started and ended Twice. I need to mark first active treatment as First and the subsequent Inactive as Last and then Find the Score drop between them.
This should work:
g = df.groupby('Id')['Treatment'].transform(lambda x: (x.eq('Inactive').shift().fillna(0))).cumsum()
ndf = df.loc[g.groupby([df['Id'],g]).transform('count').ne(1)].groupby(['Id',g],as_index=False).nth([0,-1])
ndf.assign(PercentDrop = ndf.groupby(['Id',g])['Score'].pct_change())
Output:
Id Score Dx EncDate Treatment ProviderName PercentDrop
0 21 22 F11 2015-02-28 Active Doe, Kim NaN
2 21 9 F11 2015-04-30 Inactive Doe, Kim -0.307692
4 29 25 F72 2015-06-30 Active Lee, Mei NaN
8 29 8 F72 2015-10-31 Inactive Lee, Mei -0.272727
9 29 28 F72 2015-11-30 Active Lee, Mei NaN
12 29 8 F72 2016-02-29 Inactive Lee, Mei -0.466667
13 67 26 F72 2016-03-31 Active Shah, Neha NaN
16 67 10 F72 2016-06-30 Inactive Shah, Neha -0.375000
17 67 24 F72 2016-07-31 Active Shah, Neha NaN
19 67 7 F72 2016-09-30 Inactive Shah, Neha -0.533333
import pandas as pd
patient={'patientno':[2000,2010,2022,2024,2100,2330,2345,2479,2526,2556,2567,2768,2897,2999,3000],
'patientname':['Ramlal Tukkaram','Jethalal Gada','Karen Smith','Phoebe Buffet','Lily Aldrin','Sugmadi Kplese','Chad Broman','Babu Rao','Barney Stinson', 'Leegma Bawles','Ted Bundy','Pediphilee Kyler','Regina George','Mikasa Ackerman','Levi Ackerman'],
'age':[22,45,17,32,32,42,45,42,31,22,35,34,17,19,36],
'roomno':[20,60,48,13,12,69,32,40,21,63,1,54,12,68,14],
'contactdetails':[4934944909,7685948576,5343258732,3846384849,2843839493,3237273888,9808909778,9089786756,7757586867,8878777999,7687677756,8789675758,7766969866,9078787867,6656565658],
'diagnosis':['Dementia','Schizophenia','Intellectual Disability','Hepatitis','Child Birth','Piles','Diarrhoea','Corona','Gonorrhea','Cardiac Arrest','Psychopathy','Freak Accident','Road Accident','Attachment Issues','Depression’ ,’OCD'],
'admitdate':['12.01.2022','13.01.2022','17.01.2022','04.01.2022','17.01.2022','12.01.2022','04.01.2022','15.01.2022','05.01.2022','13.01.2022','08.01.2022','01.01.2022','08.01.2022','10.01.2022','06.01.2022'],
'dischargedate':['18.01.2022','17.01.2022','18.01.2022','09.01.2022','21.01.2022','15.01.2022','08.01.2022','18.01.2022','16.01.2022','17.01.2022','18.01.2022','14.01.2022','15.01.2022','13.01.2022','22.01.2022']}
df= pd.DataFrame(patient)
print(df)
OUTPUT
patientno patientname ... admitdate dischargedate
0 2000 Ramlal Tukkaram ... 12.01.2022 18.01.2022
1 2010 Jethalal Gada ... 13.01.2022 17.01.2022
2 2022 Karen Smith ... 17.01.2022 18.01.2022
3 2024 Phoebe Buffet ... 04.01.2022 09.01.2022
4 2100 Lily Aldrin ... 17.01.2022 21.01.2022
5 2330 Sugmadi Kplese ... 12.01.2022 15.01.2022
6 2345 Chad Broman ... 04.01.2022 08.01.2022
7 2479 Babu Rao ... 15.01.2022 18.01.2022
8 2526 Barney Stinson ... 05.01.2022 16.01.2022
9 2556 Leegma Bawles ... 13.01.2022 17.01.2022
10 2567 Ted Bundy ... 08.01.2022 18.01.2022
11 2768 Pediphilee Kyler ... 01.01.2022 14.01.2022
12 2897 Regina George ... 08.01.2022 15.01.2022
13 2999 Mikasa Ackerman ... 10.01.2022 13.01.2022
14 3000 Levi Ackerman ... 06.01.2022 22.01.2022
[15 rows x 8 columns]
Try to remove the limit on the number of displayed columns with:
pd.options.display.max_columns = None
The dataframe has 8 columns, it's just that not all are shown.
It is due to space ( width) that you see only four columns( the first two and last two) and the rest are represented by the three dots "...". There is nothing wrong except that pandas shows partially due to spacing ( width).
print(df) hides some columns by default.
You could either change the pandas display options mentioned by #user2314737, or you could try df.head() instead.
I have a mass pandas DataFrame df:
year count
1983 5
1983 4
1983 7
...
2009 8
2009 11
2009 30
and I aim to sample 10 data points per year 100 times and get the mean and standard deviation of count per year. The signs of the count values are determined randomly.
I want to randomly sample 10 data per year, which can be done by:
new_df = pd.DataFrame(columns=['year', 'count'])
ref = df.year.unique()
for i in range(len(ref)):
appended_df = df[df['year'] == ref[i]].sample(n=10)
new_df = pd.concat([new_df,appended_df])
Then, I assign a sign to count randomly (so that by random chance the count could be positive or negative) and rename it to value, which can be done by:
vlist = []
for i in range(len(new_df)):
if randint(0,1) == 0:
vlist.append(new_df.count.iloc[i])
else:
vlist.append(new_df.count.iloc[i] * -1)
new_data['value'] = vlist
Getting a mean and standard deviation per each year is quite simple:
xdf = new_data.groupby("year").agg([np.mean, np.std]).reset_index()
But I can't seem to find an optimal way to try this sampling 100 times per year, store the mean values, and get the mean and standard deviation of those 100 means per year. I could think of using for loop, but it would take too much of a runtime.
Essentially, the output should be in the form of the following (the values are arbitrary here):
year mean_of_100_means total_sd
1983 4.22 0.43
1984 -6.39 1.25
1985 2.01 0.04
...
2007 11.92 3.38
2008 -5.27 1.67
2009 1.85 0.99
Any insights would be appreciated.
Try:
def fn(x):
_100_means = [x.sample(10).mean() for i in range(100)]
return {
"mean_of_100_means": np.mean(_100_means),
"total_sd": np.std(_100_means),
}
print(df.groupby("year")["count"].apply(fn).unstack().reset_index())
EDIT: Changed the computation of means.
Prints:
year mean_of_100_means total_sd
0 1983 48.986 8.330787
1 1984 48.479 10.384896
2 1985 48.957 7.854900
3 1986 50.821 10.303847
4 1987 50.198 9.835832
5 1988 47.497 8.678749
6 1989 46.763 9.197387
7 1990 49.696 8.837589
8 1991 46.979 8.141969
9 1992 48.555 8.603597
10 1993 50.220 8.263946
11 1994 48.735 9.954741
12 1995 49.759 8.532844
13 1996 49.832 8.998654
14 1997 50.306 9.038316
15 1998 49.513 9.024341
16 1999 50.532 9.883166
17 2000 49.195 9.177008
18 2001 50.731 8.309244
19 2002 48.792 9.680028
20 2003 50.251 9.384759
21 2004 50.522 9.269677
22 2005 48.090 8.964458
23 2006 49.529 8.250701
24 2007 47.192 8.682196
25 2008 50.124 9.337356
26 2009 47.988 8.053438
The dataframe was created:
data = []
for y in range(1983, 2010):
for i in np.random.randint(0, 100, size=1000):
data.append({"year": y, "count": i})
df = pd.DataFrame(data)
I think you can use pandas groupby and sample functions together to take 10 samples from each year of your DataFrame. If you put this in a loop, then you can sample it 100 times, and combine the results.
It sounds like you only need the standard deviation of the 100 means (and you don't need the standard deviation of the sample of 10 observations), so you can calculate only the mean in your groupby and sample, then calculate the standard deviation from each of those 100 means when you are creating the total_sd column of your final DataFrame.
import numpy as np
import pandas as pd
np.random.seed(42)
## create a random DataFrame with 100 entries for the years 1980-1999, length 2000
df = pd.DataFrame({
'year':[year for year in list(range(1980, 2000)) for _ in range(100)],
'count':np.random.randint(1,100,size=2000)
})
list_of_means = []
## sample 10 observations from each year, and repeat this process 100 times, storing the mean for each year in a list
for _ in range(100):
df_sample = df.groupby("year").sample(10).groupby("year").mean()
list_of_means.append(df_sample['count'].tolist())
array_of_means = [np.array(x) for x in list_of_means]
result = pd.DataFrame({
'year': df.year.unique(),
'mean_of_100_means': [np.mean(k) for k in zip(*array_of_means)],
'total_sd': [np.std(k) for k in zip(*array_of_means)]
})
This results in:
>>> result
year mean_of_100_means total_sd
0 1980 50.316 8.656948
1 1981 48.274 8.647643
2 1982 47.958 8.598455
3 1983 49.357 7.854620
4 1984 48.977 8.523484
5 1985 49.847 7.114485
6 1986 47.338 8.220143
7 1987 48.106 9.413085
8 1988 53.487 9.237561
9 1989 47.376 9.173845
10 1990 46.141 9.061634
11 1991 46.851 7.647189
12 1992 49.389 7.743318
13 1993 52.207 9.333309
14 1994 47.271 8.177815
15 1995 52.555 8.377355
16 1996 47.606 8.668769
17 1997 52.584 8.200558
18 1998 51.993 8.695232
19 1999 49.054 8.178929
i have been looking around and i can find examples for annotating a single line chart by using iterrows for the dataframe. what i am struggling with is
a) selecting the single line in the plot instead of ax.lines (using ax.lines[#]) is clearly not proper and
b) annotating the values for the line with values from a different column
the dataframe dfg is in a format such that (edited to provide a minimal, reproducible example):
week 2016 2017 2018 2019 2020 2021 min max avg WoW Change
1 8188.0 9052.0 7658.0 7846.0 6730.0 6239.0 6730 9052 7893.7
2 7779.0 8378.0 7950.0 7527.0 6552.0 6045.0 6552 8378 7588.0 -194.0
3 7609.0 7810.0 8041.0 8191.0 6432.0 5064.0 6432 8191 7529.4 -981.0
4 8256.0 8290.0 8430.0 7083.0 6660.0 6507.0 6660 8430 7687.0 1443.0
5 7124.0 9372.0 7892.0 7146.0 6615.0 5857.0 6615 9372 7733.7 -650.0
6 7919.0 8491.0 7888.0 6210.0 6978.0 5898.0 6210 8491 7455.3 41.0
7 7802.0 7286.0 7021.0 7522.0 6547.0 4599.0 6547 7802 7218.1 -1299.0
8 8292.0 7589.0 7282.0 5917.0 6217.0 6292.0 5917 8292 7072.3 1693.0
9 8048.0 8150.0 8003.0 7001.0 6238.0 5655.0 6238 8150 7404.0 -637.0
10 7693.0 7405.0 7585.0 6746.0 6412.0 5323.0 6412 7693 7135.1 -332.0
11 8384.0 8307.0 7077.0 6932.0 6539.0 6539 8384 7451.7
12 7748.0 8224.0 8148.0 6540.0 6117.0 6117 8224 7302.6
13 7254.0 7850.0 7898.0 6763.0 6047.0 6047 7898 7108.1
14 7940.0 7878.0 8650.0 6599.0 5874.0 5874 8650 7352.1
15 8187.0 7810.0 7930.0 5992.0 5680.0 5680 8187 7066.6
16 7550.0 8912.0 8469.0 7149.0 4937.0 4937 8912 7266.6
17 7660.0 8264.0 8549.0 7414.0 5302.0 5302 8549 7291.4
18 7655.0 7620.0 7323.0 6693.0 5712.0 5712 7655 6910.0
19 7677.0 8590.0 7601.0 7612.0 5391.0 5391 8590 7264.6
20 7315.0 8294.0 8159.0 6943.0 5197.0 5197 8294 7057.0
21 7839.0 7985.0 7631.0 6862.0 7200.0 6862 7985 7480.6
22 7705.0 8341.0 8346.0 7927.0 6179.0 6179 8346 7574.7
... ... ... ... ... ... ... ... ...
51 8167.0 7993.0 7656.0 6809.0 5564.0 5564 8167 7131.4
52 7183.0 7966.0 7392.0 6352.0 5326.0 5326 7966 6787.3
53 5369.0 5369 5369 5369.0
with the graph plotted by:
fig, ax = plt.subplots(1, figsize=[14,4])
ax.fill_between(dfg.index, dfg["min"], dfg["max"], label="5 Yr. Range", facecolor="oldlace")
ax.plot(dfg.index, dfg[2020], label="2020", c="grey")
ax.plot(dfg.index, dfg[2021], label="2021", c="coral")
ax.plot(dfg.index, dfg.avg, label="5 Yr. Avg.", c="goldenrod", ls=(0,(1,2)), lw=3).
I would like to label the dfg[2021] line with the values from dfg['WoW Change']. Additionally, if anyone knows how to get the calculate the first value in the WoW column based on the last value from 2020 and the first value from 2021, that would be wonderful! It's currently just dfg['WoW Change'] = dfg[2021].diff()
Thanks!
Figured it out. Zipped the index and two columns up as a tuple. I ended up deciding I only wanted the last value to be shown but using below code:
a = dfg.index.values
b = dfg[2021]
c = dfg['WoW Change']
#zip 3 columns separately
labels = list(zip(dfg.index.values,dfg[2021],dfg['WoW Change']))
#remove tuples with index + 2 nan values
labels_light = [i for i in labels if not any(isinstance(n,float) and math.isnan(n) for n in i)]
#label last point using list accessors
ax.annotate(str("w/w change: " + str("{:,}".format(int(labels_light[-1][2])))+link[1]),xy=(labels_light[-1][0],labels_light[-1][1]))
I'm sure this could have been done much better by someone who knows what they're doing, any feedback is appreciated.
I'm trying to figure out how to implement an automatic backup file naming/recycling strategy that keeps older backup files but with decreasing frequency over time. The basic idea is that it would be possible to remove at maximum one file when adding a new one, but I was not successful implementing this from scratch.
That's why I started to try out the Grandfather-Father-Son pattern, but there is not a requirement to stick to this. I started my experiments using a single pool, but I failed more than once, so I started again from this more descriptive approach using four pools, one for each frequency:[1]
import datetime
t = datetime.datetime(2001, 1, 1, 5, 0, 0) # start at 1st of Jan 2001, at 5:00 am
d = datetime.timedelta(days=1)
days = []
weeks = []
months = []
years = []
def pool_it(t):
days.append(t)
if len(days) > 7: # keep not more than seven daily backups
del days[0]
if (t.weekday() == 6):
weeks.append(t)
if len(weeks) > 5: # ...not more than 5 weekly backups
del weeks[0]
if (t.day == 28):
months.append(t)
if len(months) > 12: # ... limit monthly backups
del months[0]
if (t.day == 28 and t.month == 12):
years.append(t)
if len(years) > 10: # ... limit yearly backups...
del years[0]
for i in range(4505):
pool_it(t)
t += d
no = 0
def print_pool(pool, rt):
global no
print("----")
for i in pool:
no += 1
print("{:3} {} {}".format(no, i.strftime("%Y-%m-%d %a"), (i-rt).days))
print_pool(years, t)
print_pool(months,t)
print_pool(weeks,t)
print_pool(days,t)
The output shows that there are duplicates, marked with * and **
----
1 2003-12-28 Sun -3414
2 2004-12-28 Tue -3048
3 2005-12-28 Wed -2683
4 2006-12-28 Thu -2318
5 2007-12-28 Fri -1953
6 2008-12-28 Sun -1587
7 2009-12-28 Mon -1222
8 2010-12-28 Tue -857
9 2011-12-28 Wed -492
10 2012-12-28 Fri -126 *
----
11 2012-05-28 Mon -340
12 2012-06-28 Thu -309
13 2012-07-28 Sat -279
14 2012-08-28 Tue -248
15 2012-09-28 Fri -217
16 2012-10-28 Sun -187
17 2012-11-28 Wed -156
18 2012-12-28 Fri -126 *
19 2013-01-28 Mon -95
20 2013-02-28 Thu -64
21 2013-03-28 Thu -36
22 2013-04-28 Sun -5 **
----
23 2013-03-31 Sun -33
24 2013-04-07 Sun -26
25 2013-04-14 Sun -19
26 2013-04-21 Sun -12
27 2013-04-28 Sun -5 **
----
28 2013-04-26 Fri -7
29 2013-04-27 Sat -6
30 2013-04-28 Sun -5 **
31 2013-04-29 Mon -4
32 2013-04-30 Tue -3
33 2013-05-01 Wed -2
34 2013-05-02 Thu -1
...which is not a big problem. What I'm getting from it is daily backups in the last week, weekly backups for the last month, monthly backups for the last year, and yearly backups for 10 years. The amount of files is always limited to 10+12+5+7=34.
My ideal solution would
create files with human-readable names including timestampes (i.e. xyz-yyyy-mm-dd.bak)
use only one pool (store/remove files within one folder)
recycle targeted, that is, would not delete more than one file a day
(naturally) not contain any duplicates
Do you have a trivial solution at hand or a suggestion where to learn more about it?
[1] I used python as to better understand/communicate my question, but the question is about the algorithm.
As a committer of pyExpireBackups i can point you to the ExpirationRule implementation of my solution (source below and in the github repo)
see https://wiki.bitplan.com/index.php/PyExpireBackups for the doku.
An example run would lead to:
keeping 7 files for dayly backup
keeping 6 files for weekly backup
keeping 8 files for monthly backup
keeping 4 files for yearly backup
expiring 269 files dry run
# 1✅: 0.0 days( 5 GB/ 5 GB)→./sql_backup.2022-04-02.tgz
# 2✅: 3.0 days( 5 GB/ 9 GB)→./sql_backup.2022-03-30.tgz
# 3✅: 4.0 days( 5 GB/ 14 GB)→./sql_backup.2022-03-29.tgz
# 4✅: 5.0 days( 5 GB/ 18 GB)→./sql_backup.2022-03-28.tgz
# 5✅: 7.0 days( 5 GB/ 23 GB)→./sql_backup.2022-03-26.tgz
# 6✅: 9.0 days( 5 GB/ 27 GB)→./sql_backup.2022-03-24.tgz
# 7✅: 11.0 days( 5 GB/ 32 GB)→./sql_backup.2022-03-22.tgz
# 8❌: 15.0 days( 5 GB/ 37 GB)→./sql_backup.2022-03-18.tgz
# 9❌: 17.0 days( 5 GB/ 41 GB)→./sql_backup.2022-03-16.tgz
# 10✅: 18.0 days( 5 GB/ 46 GB)→./sql_backup.2022-03-15.tgz
# 11❌: 19.0 days( 5 GB/ 50 GB)→./sql_backup.2022-03-14.tgz
# 12❌: 20.0 days( 5 GB/ 55 GB)→./sql_backup.2022-03-13.tgz
# 13❌: 22.0 days( 5 GB/ 59 GB)→./sql_backup.2022-03-11.tgz
# 14❌: 23.0 days( 5 GB/ 64 GB)→./sql_backup.2022-03-10.tgz
# 15✅: 35.0 days( 4 GB/ 68 GB)→./sql_backup.2022-02-26.tgz
# 16❌: 37.0 days( 4 GB/ 73 GB)→./sql_backup.2022-02-24.tgz
# 17❌: 39.0 days( 4 GB/ 77 GB)→./sql_backup.2022-02-22.tgz
# 18❌: 40.0 days( 5 GB/ 82 GB)→./sql_backup.2022-02-21.tgz
# 19✅: 43.0 days( 4 GB/ 86 GB)→./sql_backup.2022-02-18.tgz
...
class ExpirationRule():
'''
an expiration rule keeps files at a certain
'''
def __init__(self,name,freq:float,minAmount:int):
'''
constructor
name(str): name of this rule
freq(float): the frequency) in days
minAmount(int): the minimum of files to keep around
'''
self.name=name
self.ruleName=name # will late be changed by a sideEffect in getNextRule e.g. from "week" to "weekly"
self.freq=freq
self.minAmount=minAmount
if minAmount<0:
raise Exception(f"{self.minAmount} {self.name} is invalid - {self.name} must be >=0")
def reset(self,prevFile:BackupFile):
'''
reset my state with the given previous File
Args:
prevFile: BackupFile - the file to anchor my startAge with
'''
self.kept=0
if prevFile is None:
self.startAge=0
else:
self.startAge=prevFile.ageInDays
def apply(self,file:BackupFile,prevFile:BackupFile,debug:bool)->bool:
'''
apply me to the given file taking the previously kept File prevFile (which might be None) into account
Args:
file(BackupFile): the file to apply this rule for
prevFile(BackupFile): the previous file to potentially take into account
debug(bool): if True show debug output
'''
if prevFile is not None:
ageDiff=file.ageInDays - prevFile.ageInDays
keep=ageDiff>=self.freq
else:
ageDiff=file.ageInDays - self.startAge
keep=True
if keep:
self.kept+=1
else:
file.expire=True
if debug:
print(f"Δ {ageDiff}({ageDiff-self.freq}) days for {self.ruleName}({self.freq}) {self.kept}/{self.minAmount}{file}")
return self.kept>=self.minAmount