I have a list representing the year which is filled with a sub-list for each day of the year.
year = []
for i in range(0,52*7):
day = [i,0] #[day number, 0 = empty, 1 = something is planned]
year.append(day)
I also have a variable list of activities created by a class.
class Activities:
def __init__(self,name,weeks,weekends):
self.name = name
self.weeks = weeks
self.weekends = weekends
def __repr__(self):
return repr((self.name,self.weeks,self.weekends))
def activityMaker(activityList):
a= []
for i in range(0, len(activityList)):
a.append(Activities(activityList[i][0], activityList[i][1], activityList[i][2]))
a = sorted(a, key=lambda Activities: Activities.weeks)
activityList = a
return activityList
As an example;
>>> activityList = [['Tennis', 3, 0], ['Baseball', 4, 0], ['Swimming', 2, 0]]
>>>activities= activityMaker(activityList)
Which returns 'activities', sorted on Activities.weeks:
>>>activities[0].name
activities[0].week
activities[0].weekend
>>> 'Swimming' # activity name
2 #"i want to do this activity once every x weeks
0 # 0 = no preferance, 1 = not in weekends
Now here my dilemma. I wish to create an algorithm to fill year with the activities with as much rhythm as possible.
My current approach is not working properly. What I'm doing now is as follows.
for y in range(0,len(year), int(7*activities[0].weeks)):
year[y][1] = activities[i].name
Now the first activity is planned for each y. If I had two activities, which I each want planned once a week, I could plan the first on 0th, 7th, 14th etc, and the second on 3rd, 10th, 17th etcetera.
The problem with this approach is exemplified if activities[0] and activities[1] are respectively 2 and 3. If I apply the previous approach, activities[0] would be planned on the 0th, 14th, 28th etc, which is fine on its own. Between the 0th and 14th, the 2nd activity would be ideally placed on the 7th position, meaning the next time would be the 28th. On the 28 the first activity is already planned, however. This means that's there's nothing planned for two weeks and then suddenly 2 activities in a day. The second activity could be pushed to the 27th or 29th, but that would still mean that now activities are planned on the 0th, 7th, 14th, 28th, 29th. Aka, there's still 14 days between the 14th and the 28th, and then only 1 between the 28th and 29th.
In what way can I make sure that all activities are planned with as much average time in between activities?
Your problem is that unless the number of weeks is the same for all activities (so they all have the same rhythm, there will be some weeks with lots of activities and some weeks with none.
What I would suggest instead is this: As you walk through the weeks of the year, simply choose an activity (or two) at random for each week. That way, every week will have a modest amount of activity planned. Here's some example code:
import random
activities = ["Baseball", "Tennis", "Swimming", ... ]
skip_days = 3
year = {}
for y in range(0, 52*7, skip_days):
year[y] = random.choose(activities)
print year[0]
>>> "Swimming" (perhaps)
print year[15]
>>> "Baseball"
print year[17]
>>> None
If you want more activity, make skip_days smaller. If you want less, make it bigger. If you want a fixed amount of activity in each week, you could do something like
for y in range(0, 52*7, 7):
year[y] = random.choose(activities)
year[y+3] = random.choose(activities)
This would plan two days a week.
Related
I have data that looks like this:
id Date Time assigned_pat_loc prior_pat_loc Activity
0 45546325 2/7/2011 4:29:38 EIAB^EIAB^6 NaN Admission
1 45546325 2/7/2011 5:18:22 8W^W844^A EIAB^EIAB^6 Observation
2 45546325 2/7/2011 5:18:22 8W^W844^A EIAB^EIAB^6 Transfer to 8W
3 45546325 2/7/2011 6:01:44 8W^W858^A 8W^W844^A Bed Movement
4 45546325 2/7/2011 7:20:44 8W^W844^A 8W^W858^A Bed Movement
5 45546325 2/9/2011 18:36:03 8W^W844^A NaN Discharge-Observation
6 45666555 3/8/2011 20:22:36 EIC^EIC^5 NaN Admission
7 45666555 3/9/2011 1:08:04 53^5314^A EIC^EIC^5 Admission
8 45666555 3/9/2011 1:08:04 53^5314^A EIC^EIC^5 Transfer to 53
9 45666555 3/9/2011 17:03:38 53^5336^A 53^5314^A Bed Movement
I need to find where there were multiple patients (identified with id column) are in the same room at the same time, the start and end times for those, the dates, and room number (assigned_pat_loc). assigned_pat_loc is the current patient location in the hospital, formatted as “unit^room^bed”.
So far I've done the following:
# Read in CSV file and remove bed number from patient location
data = pd.read_csv('raw_data.csv')
data['assigned_pat_loc'] = data['assigned_pat_loc'].str.replace(r"([^^]+\^[^^]+).*", r"\1", regex=True)
# Convert Date column to datetime type
patient_data['Date'] = pd.to_datetime(patient_data['Date'])
# Sort dataframe by date
patient_data.sort_values(by=['Date'], inplace = True)
# Identify rows with duplicate room and date assignments, indicating multiple patients shared room
same_room = patient_data.duplicated(subset = ['Date','assigned_pat_loc'])
# Assign duplicates to new dataframe
df_same_rooms = patient_data[same_room]
# Remove duplicate patient ids but keep latest one
no_dups = df_same_rooms.drop_duplicates(subset = ['id'], keep = 'last')
# Group patients in the same rooms at the same times together
df_shuf = pd.concat(group[1] for group in df_same_rooms.groupby(['Date', 'assigned_pat_loc'], sort=False))
And then I'm stuck at this point:
id Date Time assigned_pat_loc prior_pat_loc Activity
599359 42963403 2009-01-01 12:32:25 11M^11MX 4LD^W463^A Transfer
296155 42963484 2009-01-01 16:41:55 11M^11MX EIC^EIC^2 Transfer
1373 42951976 2009-01-01 15:51:09 11M^11MX NaN Discharge
362126 42963293 2009-01-01 4:56:57 11M^11MX EIAB^EIAB^6 Transfer
362125 42963293 2009-01-01 4:56:57 11M^11MX EIAB^EIAB^6 Admission
... ... ... ... ... ... ...
268266 46381369 2011-09-09 18:57:31 54^54X 11M^1138^A Transfer
16209 46390230 2011-09-09 6:19:06 10M^1028 EIAB^EIAB^5 Admission
659699 46391825 2011-09-09 14:28:20 9W^W918 EIAB^EIAB^3 Transfer
659698 46391825 2011-09-09 14:28:20 9W^W918 EIAB^EIAB^3 Admission
268179 46391644 2011-09-09 17:48:53 64^6412 EIE^EIE^3 Admission
Where you can see different patients in the same room at the same time, but I don't know how to extract those intervals of overlap between two different rows for the same room and same times. And then to format it such that the start time and end time are related to the earlier and later times of the transpiring of a shared room between two patients. Below is the desired output.
Where r_id is the id of the other patient sharing the same room and length is the number of hours that room was shared.
As suggested, you can use groupby. One more thing you need to take care of is finding the overlapping time. Ideally you'd use datetime which are easy to work with. However you used a different format so we need to convert it first to make the solution easier. Since you did not provide a workable example, I will just write the gist here:
# convert current format to datetime
df['start_datetime'] = pd.to_datetime(df.start_date) + df.start_time.astype('timedelta64[h]')
df['end_datetime'] = pd.to_datetime(df.end_date) + df.end_time.astype('timedelta64[h]')
df = df.sort_values(['start_datetime', 'end_datetime'], ascending=[True, False])
gb = df.groupby('r_id')
for g, g_df in gb:
g_df['overlap_group'] = (g_df['end_datetime'].cummax().shift() <= g_df['start_datetime']).cumsum()
print(g_df)
This is a tentative example, and you might need to tweak the datetime conversion and some other minor things, but this is the gist.
The cummax() detects where there is an overlap between the intervals, and cumsum() counts the number of overlapping groups, since it's a counter we can use it as a unique identifier.
I used the following threads:
Group rows by overlapping ranges
python/pandas - converting date and hour integers to datetime
Edit
After discussing it with OP the idea is to take each patient's df and sort it by the date of the event. The first one will be the start_time and the last one would be the end_time.
The unification of the time and date are not necessary for detecting the start and end time as they can sort by date and then by the time to get the same order they would have gotten if they did unify the columns. However for the overlap detection it does make life easier when it's in one column.
gb_patient = df.groupby('id')
patients_data_list = []
for patient_id, patient_df in gb_patient:
patient_df = patient_df.sort_values(by=['Date', 'Time'])
patient_data = {
"patient_id": patient_id,
"start_time": patient_df.Date.values[0] + patient_df.Time.values[0],
"end_time": patient_df.Date.values[-1] + patient_df.Time.values[-1]
}
patients_data_list.append(patient_data)
new_df = pd.DataFrame(patients_data_list)
After that they can use the above code for the overlaps.
I have a tricky scenario here ,usually I get 6 values from the below query , basically prices at start and month end of the month, prices value will be there every time
trade_date price
01-01-2021 120.2
31-01-2021 220.2
01-02-2021 516.2
28-02-2021 751.0
01-03-2021 450.2
31-03-2021 854.9
and I need Sum of 1st month starting price + 1st month Ending price + every months ending price
ie 120.2+220.2+751.0+854.9
but in some cases, last month data tend to miss, how to handle those scenarios
monthly_values = Items.objects.filter(trade_date__gte=quarter_start_date,
trade_date__lte=quarter_end_date).values_list('price',
flat=True).order_by('trade_date')
total_sum = monthly_values[0]+monthly_values[1]+monthly_values[3]+monthly_values[5])
Currently getting list index out of range from the above because of missing values
So it has been some time since I last used DjangoORM, but you can do something similarly to this
from datetime import date
monthly_values: list[tuple[date, float]] = Items.objects.filter(trade_date__gte=quarter_start_date,
trade_date__lte=quarter_end_date).values_list('trade_date', 'price').order_by('trade_date')
Then create a function that adds the starting price from our input to a result, and afterwards add all prices where it is not the first day of the month.
def get_prices(month_and_prices: list[tuple[date, float]]) -> float:
res = month_and_prices[0][1]
res += sum([x[1] for x in month_and_prices[1:] if x[0].day > 1])
return res
This should solve what you are trying to do
You need to access the line and then the column:
total_sum = 0
for i in [0, 1, 3, 5]:
total_sum += monthly_values[i][1]
This gives you access "by hand". #Asger 's answer is automated.
I've setup a simulation example below.
Setup:
I have weekly data, say 6 years of data each week of around 1000 stocks some weeks more other weeks less than 1000. I randomly chose 75 stocks at time t0. At t1 some stocks dies (probability p, goes out of fashion) or leave the index (structural such as merging). I need to simulate stocks so that every week I've exactly 75 stocks. Every week some stocks dies (between 0 and 75) and I pick new ones not from the existing 75. I also check if the stock leaves do to structural reasons. Every week I calculate the returns of the 75 stocks.
Questions: Is there an obvious why to improve the speed. I started with Pandas objects (group sort) which was to slow. I haven't tried to parallel the loop. I'm more interesting to hear if I should use numba (but it doesn't have the np.in1d function) or if there is a faster way to shuffle (I actually only need to shuffle the ones). I've also think about creating a fixed array with all stocks id using NaN, the problem here is that I need 75 names so I still need to filter out these NaN every week.
Maybe this is to detailed problem for this forum, I apologize if that's the case
Code:
from timeit import default_timer
import numpy as np
# Create dataset
n_weeks = 312 # Approximately 6 years of weekly data
n_stocks = np.random.normal(1000, 5, n_weeks).astype(dtype=np.uint16) # Around 1000 stocks every week but not fixed
idx_new_week = np.cumsum(np.hstack((0, n_stocks)))
# We give each stock a stock idea
n_obs = n_stocks.sum()
stock_id = np.ones([n_obs], dtype=np.uint16)
for j in range(1, n_weeks+1):
stock_id[idx_new_week[j-1]:idx_new_week[j]] = np.cumsum(np.ones(n_stocks[j-1]))
stock_rtn = np.random.normal(0, 0.25/np.sqrt(52), n_obs) # Simulated forward (one week ahead) return for each stock
# Simulation part
# Week 0 pick randomly 75 stocks
# Week n >=1 a stock dies for two reasons
# 1) randomness (probability 'p')
# 2) structural event (could be merger, fall out of index).
# We cannot assume that it is always the high stockid which dies for structural reasons (as it looks like here)
# If a stock dies we randomely pick a stock from the "deak" stock dataset (not included the ones which dies this week)
n_sim = 100 # I want this to be 1 mill
n_stock_cand = 75 # For this example we pick 75 stocks
p_survial = 0.90
# The weekly periodcal returns
pf_rtn = np.zeros([n_weeks, n_sim])
start = default_timer()
for k in range(0, n_sim):
# Randomely choice n_stock_cand at time zero
boolean_list = np.array([False] * (n_stocks[0] - n_stock_cand) + [True] * n_stock_cand)
np.random.shuffle(boolean_list) # Shuffle the list
stock_id_this_week = stock_id[idx_new_week[0]:idx_new_week[1]][boolean_list]
stock_rtn_this_week = stock_rtn[idx_new_week[0]:idx_new_week[1]][boolean_list]
# This part only simulate the Buzz portfolio names - later we simulate returns and from specific holdings of the 75 names
for j in range(1, n_weeks):
pf_rtn[j-1, k] = stock_rtn_this_week.mean()
# Find the number of stocks to keep
boolean_keep_stocks = np.random.rand(n_stock_cand) < p_survial
# Next we need to check if a stock is still part of the universe next period
stock_cand_temp = stock_id[idx_new_week[j-1]:idx_new_week[j]]
stock_rtn_temp = stock_rtn[idx_new_week[j-1]:idx_new_week[j]]
boolean_keep_stocks = (boolean_keep_stocks) & (np.in1d(stock_id_this_week, stock_cand_temp, assume_unique=True))
n_stocks_to_replace = n_stock_cand - boolean_keep_stocks.sum() # Number of new stocks to pick this week
if n_stocks_to_replace > 0:
# We have to pick from stocks which is not part of the portfolio already
boolean_cand = np.in1d(stock_cand_temp, stock_id_this_week, assume_unique=True, invert=True)
n_stocks_to_pick_from = boolean_cand.sum()
boolean_list = np.array([False] * (n_stocks_to_pick_from - n_stocks_to_replace) + [True] * n_stocks_to_replace)
np.random.shuffle(boolean_list) # Shuffle the list
# First avoid picking the same stock twich, next pick from the unique candidate list
stock_id_new = stock_cand_temp[boolean_cand][boolean_list] # The new stocks
stock_rtn_new = stock_rtn_temp[boolean_cand][boolean_list] # and their returns
stock_id_this_week = np.hstack((stock_id_this_week[boolean_keep_stocks], stock_id_new))
stock_rtn_this_week = np.hstack((stock_rtn_this_week[boolean_keep_stocks], stock_rtn_new))
else:
# No replacement of stocks / all surview but order might differ
boolean_cand = np.in1d(stock_cand_temp, stock_id_this_week, assume_unique=True, invert=False)
stock_id_this_week = stock_cand_temp[boolean_cand]
stock_rtn_this_week = stock_rtn_temp[boolean_cand]
# PnL last period
pf_rtn[n_weeks-1, k] = stock_rtn_this_week.mean()
print(default_timer() - start)
Given two lists:
Issue year of bonds
Maturity year of bond
Something like:
issue_year = [1934, 1932, 1945, 1946, ...]
mature_years = [1967, 1937, 1957, 1998, ...]
With this example, the first bond has issue-year of 1934, and maturity year of 1967, while the second bond has issue-year of 1932 and maturity year of 1937, and so on.
The problem I am trying to solve is to find the year which has the highest number of active bonds.
Here is what I have so far. This finds the year in which all bonds are active.
L1=[1936,1934,1937]
L2=[1940,1938,1940]
ctr=0
for i in range(len(L1)):
j=i
L3=list(range(L1[i],L2[j]))
if ctr==0:
tempnew=L3
else:
tempnew=list(set(L3) & set(tempnew))
ctr = ctr+1
Here tempnew is the intersection of all the active years for all the bonds. But, it might happen that the intersection of all the active years might be empty. For example, if bond 1 were active from 1932 through 1945, and bond 2 is active from 1947 thru 1960.
Can someone help ?
Here is some code which I believe meets your requirements. It works by scanning through issue and mature years list using zip. It then fills out a dict whose keys are all of the active years, and whose value are the number of bonds active that year. Finally it dumps all of the years which have the max number of active bonds:
Code:
def add_active_years(years, issue, mature):
for year in range(issue, mature+1):
years[year] = years.get(year, 0) + 1
# go through the lists and calculate the active years
years = {}
for issue, mature in zip(L1, L2):
add_active_years(years, issue, mature)
# now reverse the years dict into a count dict of lists of years
counts = {}
for year, count in years.items():
counts[count] = counts.get(count, []) + [year]
# show the result
print(counts[max(counts.keys())])
Sample Data:
L1 = [1936,1934,1937]
L2 = [1940,1938,1940]
Results:
[1937, 1938]
I have a data set recording different weeks and the new cases of dengue for that specific week and I am supposed to calculate the infection rate and recovery rate for each week. The infection rate can be calculated by dividing the number of newly infected patients by the susceptible population for that week while the recovery rate can be calculated by dividing the number of newly recovered patients by the infected population for that week. The infection rate is relatively simple but for the recovery rate I have to take into account that infected patients take exactly 2 weeks to recover and I'm stuck. Any help would be appreciated
t_pop = 4*10**6
s_pop = t_pop
i_pop = 0
r_pop = 0
weeks = 0
#Infection Rate
for index, row in data.iterrows():
new_i = row['New Cases']
s_pop -= new_i
weeks += 1
infection_rate = float(new_i)/float(s_pop)
print('Week', weeks, ':' ,infection_rate)
*Note: t_pop refers to total population which we assume to be 4million, s_pop refers to the population at risk of contracting dengue and i_pop refers to infected population
You could create a dictionary to store the data for each week, and then use it to refer back to when you need to calculate the recovery rate. For example:
dengue_dict = {}
dengue_dict["Week 1"] = {"Infection Rate": infection_rate, "Recovery Rate": None}
I use none at first, because there's no recovery rate until at least two weeks have gone by. Later, you can either update weeks or just add them right away. Here's an example for week 3:
recovery_rate = dengue_dict["Week 1"]["Infection Rate"]/infection_rate
And then update the entry in the dictionary:
dengue_dict["Week 3"]["Recovery Rate"] = recovery_rate