I'm trying to join two dataframes with dates that don't perfectly match up. For a given group/date in the left dataframe, I want to join the corresponding record from the right dataframe with the a date just before that of the left dataframe. Probably easiest to show with an example.
df1:
group date teacher
a 1/10/00 1
a 2/27/00 1
b 1/7/00 1
b 4/5/00 1
c 2/9/00 2
c 9/12/00 2
df2:
teacher date hair length
1 1/1/00 4
1 1/5/00 8
1 1/30/00 20
1 3/20/00 100
2 1/1/00 0
2 8/10/00 50
Gives us:
group date teacher hair length
a 1/10/00 1 8
a 2/27/00 1 20
b 1/7/00 1 8
b 4/5/00 1 100
c 2/9/00 2 0
c 9/12/00 2 50
Edit 1:
Hacked together a way to do this. Basically I iterate through every row in df1 and pick out the most recent corresponding entry in df2. It is insanely slow, surely there must be a better way.
One way to do this is to create a new column in the left data frame, which will (for a given row's date) determine the value that is closest and earlier:
df1['join_date'] = df1.date.map(lambda x: df2.date[df2.date <= x].max())
then a regular join or merge between 'join_date' on the left and 'date' on the right will work. You may need to tweak the function to handle Null values or other corner cases.
This is not very efficient (you are searching the right-hand dates over and over). A more efficient approach is to sort both data frames by the dates, iterate through the left-hand data frame, and consume entries from the right hand data frame just until the date is larger:
# Assuming df1 and df2 are sorted by the dates
df1['hair length'] = 0 # initialize
r_generator = df2.iterrows()
_, cur_r_row = next(r_generator)
for i, l_row in df1.iterrows():
cur_hair_length = 0 # Assume 0 works when df1 has a date earlier than df2
while cur_r_row['date'] <= l_row['date']:
cur_hair_length = cur_r_row['hair length']
try:
_, cur_r_row = next(r_generator)
except StopIteration:
break
df1.loc[i, 'hair length'] = cur_hair_length
Seems like the quickest way to do this is using sqlite via pysqldf:
def partial_versioned_join(tablea, tableb, tablea_keys, tableb_keys):
try:
tablea_group, tablea_date = tablea_keys
tableb_group, tableb_date = tableb_keys
except ValueError, e:
raise(e, 'Need to pass in both a group and date key for both tables')
# Note: can't actually use group here as a field name due to sqlite
statement = """SELECT a.group, a.{date_a} AS {temp_date}, b.*
FROM (SELECT tablea.group, tablea.{date_a}, tablea.{group_a},
MAX(tableb.{date_b}) AS tdate
FROM tablea
JOIN tableb
ON tablea.{group_a}=tableb.{group_b}
AND tablea.{date_a}>=tableb.{date_b}
GROUP BY tablea.{base_id}, tablea.{date_a}, tablea.{group_a}
) AS a
JOIN tableb b
ON a.{group_a}=b.{group_b}
AND a.tdate=b.{date_b};
""".format(group_a=tablea_group, date_a=tablea_date,
group_b=tableb_group, date_b=tableb_date,
temp_date='join_date', base_id=base_id)
# Note: you lose types here for tableb so you may want to save them
pre_join_tableb = sqldf(statement, locals())
return pd.merge(tablea, pre_join_tableb, how='inner',
left_on=['group'] + tablea_keys,
right_on=['group', tableb_group, 'join_date'])
Related
I have many different tables that all have different column names and each refer to an outcome, like glucose, insulin, leptin etc (except keep in mind that the tables are all gigantic and messy with tons of other columns in them as well).
I am trying to generate a report that starts empty but then adds columns based on functions applied to each of the glucose, insulin, and leptin tables.
I have included a very simple example - ignore that the function makes little sense. The below code works, but I would like to, instead of copy + pasting final_report["outcome"] = over and over again, just run the find_result function over each of glucose, insulin, and leptin and add the "glucose_result", "insulin_result" and "leptin_result" to the final_report in one or a few lines.
Thanks in advance.
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
outcome = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
glucose = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
insulin = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
leptin = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
ids = [1,2,3,4]
start = [1,1,1,1]
end = [6,6,6,6]
final_report = pd.DataFrame({'id':ids,
'start':start,
'end':end})
def find_result(subject, start, end, df):
df = df.loc[(df["id"] == subject) & (df["timepoint"] >= start) & (df["timepoint"] <= end)].sort_values(by = "timepoint")
return df["timepoint"].nunique()
final_report['glucose_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], glucose), axis=1)
final_report['insulin_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], insulin), axis=1)
final_report['leptin_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], leptin), axis=1)
If you have to use this code structure, you can create a simple dictionary with your dataframes and their names and loop through them, creating new columns with programmatically assigned names:
input_dfs = {"glucose": glucose, "insulin": insulin, "leptin": leptin}
for name, df in input_dfs.items():
final_report[f"{name}_result"] = final_report.apply(
lambda x: find_result(x['id'], x['start'], x['end'], df),
axis=1
)
Output:
id start end glucose_result insulin_result leptin_result
0 1 1 6 6 6 6
1 2 1 6 6 6 6
2 3 1 6 3 3 3
3 4 1 6 6 6 6
I already asked a similar question but was able to piece some more of it together but need more help. Determining how one date/time range overlaps with the second date/time range?
I want to be able to check when two date range with start date/time and end date/time overlap. My type2 has about 50 rows while type 1 has over 500. I want to be able to take the start and end of type2 and see if it falls within type1 range. Here is a snip of the data, however the dates do change down the list from 2019-04-01 the the following days.
type1 type1_start type1_end
a 2019-04-01T00:43:18.046Z 2019-04-01T00:51:35.013Z
b 2019-04-01T02:16:46.490Z 2019-04-01T02:23:23.887Z
c 2019-04-01T03:49:31.981Z 2019-04-01T03:55:16.153Z
d 2019-04-01T05:21:22.131Z 2019-04-01T05:28:05.469Z
type2 type2_start type2_end
1 2019-04-01T00:35:12.061Z 2019-04-01T00:37:00.783Z
2 2019-04-02T00:37:15.077Z 2019-04-02T00:39:01.393Z
3 2019-04-03T00:39:18.268Z 2019-04-03T00:41:01.844Z
4 2019-04-04T00:41:21.576Z 2019-04-04T00:43:02.071Z`
I have been googling the best way to this and have read through Determine Whether Two Date Ranges Overlap and understand how it should be done, but I don't know enough about how to call for the variables and make them work.
#Here is what I have, but I am stuck and have no clue where to go form here:
import pandas as pd
from pandas import Timestamp
import numpy as np
from collections import namedtuple
colnames = ['type1', 'type1_start', 'type1_end', 'type2', 'type2_start', 'type2_end']
data = pd.read_csv('test.csv', names=colnames, parse_dates=['type1_start', 'type1_end','type2_start', 'type2_end'])
A_start = data['type1_start']
A_end = data['type1_end']
B_start= data['typer2_start']
B_end = data['type2_end']
t1 = data['type1']
t2 = data['type2']
r1 = (B_start, B_end)
r2 = (A_start, A_end)
def doesOverlap(r1, r2):
if B_start > A_start:
swap(r1, r2)
if A_start > B_end:
return false
return true
It would be nice to have a csv with a result of true or false overlap. I was able to make my data run using this also Efficiently find overlap of date-time ranges from 2 dataframes but it isn't correct in the results. I added couple of rows that I know should overlap to the data, and it didn't work. I'd need for each type2 start/end to go through each type1.
Any help would be greatly appreciated.
Here is one way to do it:
import pandas as pd
def overlaps(row):
if ((row['type1_start'] < row['type2_start'] and row['type2_start'] < row['type1_end'])
or (row['type1_start'] < row['type2_end'] and row['type2_end'] < row['type1_end'])):
return True
else:
return False
colnames = ['type1', 'type1_start', 'type1_end', 'type2', 'type2_start', 'type2_end']
df = pd.read_csv('test.csv', names=colnames, parse_dates=[
'type1_start', 'type1_end', 'type2_start', 'type2_end'])
df['overlap'] = df.apply(overlaps, axis=1)
print('\n', df)
gives:
type1 type1_start type1_end type2 type2_start type2_end overlap
0 type1 type1_start type1_end type2 type2_start type2_end False
1 a 2019-03-01T00:43:18.046Z 2019-04-02T00:51:35.013Z 1 2019-04-01T00:35:12.061Z 2019-04-01T00:37:00.783Z True
2 b 2019-04-01T02:16:46.490Z 2019-04-01T02:23:23.887Z 2 2019-04-02T00:37:15.077Z 2019-04-02T00:39:01.393Z False
3 c 2019-04-01T03:49:31.981Z 2019-04-01T03:55:16.153Z 3 2019-04-03T00:39:18.268Z 2019-04-03T00:41:01.844Z False
4 d 2019-04-01T05:21:22.131Z 2019-04-01T05:28:05.469Z 4 2019-04-04T00:41:21.576Z 2019-04-04T00:43:02.071Z False
Below df1 contains type1 records and df2 contains type2 records:
df_new = df1.assign(key=1)\
.merge(df2.assign(key=1), on='key')\
.assign(has_overlap=lambda x: ~((x.type2_start > x.type1_end) | (x.type2_end < x.type1_start)))
REF: Performant cartesian product (CROSS JOIN) with pandas
Firstly, sorry if this is a bit lengthy, but I wanted to fully describe what I have having problems with and what I have tried already.
I am trying to join (merge) together two dataframe objects on multiple conditions. I know how to do this if the conditions to be met are all 'equals' operators, however, I need to make use of LESS THAN and MORE THAN.
The dataframes represent genetic information: one is a list of mutations in the genome (referred to as SNPs) and the other provides information on the locations of the genes on the human genome. Performing df.head() on these returns the following:
SNP DataFrame (snp_df):
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 752721
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
This shows the SNP reference ID and their locations. 'BP' stands for the 'Base-Pair' position.
Gene DataFrame (gene_df):
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
This dataframe shows the locations of all the genes of interest.
What I want to find out is all of the SNPs which fall within the gene regions in the genome, and discard those that are outside of these regions.
If I wanted to merge together two dataframes based on multiple (equals) conditions, I would do something like the following:
merged_df = pd.merge(snp_df, gene_df, on=['chromosome', 'other_columns'])
However, in this instance - I need to find the SNPs where the chromosome values match those in the Gene dataframe, and the BP value falls between 'chr_start' and 'chr_stop'. What makes this challenging is that these dataframes are quite large. In this current dataset the snp_df has 6795021 rows, and the gene_df has 34362.
I have tried to tackle this by either looking at chromosomes or genes seperately. There are 22 different chromosome values (ints 1-22) as the sex chromosomes are not used. Both methods are taking an extremely long time. One uses the pandasql module, while the other approach is to loop through the separate genes.
SQL method
import pandas as pd
import pandasql as psql
pysqldf = lambda q: psql.sqldf(q, globals())
q = """
SELECT s.SNP, g.feature_id
FROM this_snp s INNER JOIN this_genes g
WHERE s.BP >= g.chr_start
AND s.BP <= g.chr_stop;
"""
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
genic_snps = pysqldf(q)
all_dfs.append(genic_snps)
all_genic_snps = pd.concat(all_dfs)
Gene iteration method
all_dfs = []
for line in gene_df.iterrows():
info = line[1] # Getting the Series object
this_snp = snp_df.loc[(snp_df['chromosome'] == info['chromosome']) &
(snp_df['BP'] >= info['chr_start']) & (snp_df['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(len(this_snp.columns), 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
Can anyone give any suggestions of a more effective way of doing this?
I've just thought of a way to solve this - by combining my two methods:
First, focus on the individual chromosomes, and then loop through the genes in these smaller dataframes. This also doesn't have to make use of any SQL queries either. I've also included a section to immediately identify any redundant genes that don't have any SNPs that fall within their range. This makes use of a double for-loop which I normally try to avoid - but in this case it works quite well.
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_chr_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
# Getting rid of redundant genes
min_bp = this_chr_snp['BP'].min()
max_bp = this_chr_snp['BP'].max()
this_genes = this_genes.loc[~(this_genes['chr_start'] >= max_bp) &
~(this_genes['chr_stop'] <= min_bp)]
for line in this_genes.iterrows():
info = line[1]
this_snp = this_chr_snp.loc[(this_chr_snp['BP'] >= info['chr_start']) &
(this_chr_snp['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(1, 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
While this doesn't run spectacularly quickly - it does run so that I can actually get some answers. I'd still like to know if anyone has any tips to make it run more efficiently though.
You can use the following to accomplish what you're looking for:
merged_df=snp_df.merge(gene_df,on=['chromosome'],how='inner')
merged_df=merged_df[(merged_df.BP>=merged_df.chr_start) & (merged_df.BP<=merged_df.chr_stop)][['SNP','feature_id']]
Note: your example dataframes do not meet your join criteria. Here is an example using modified dataframes:
snp_df
Out[193]:
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 30400
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
gene_df
Out[194]:
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
merged_df
Out[195]:
SNP feature_id
8 rs3131972 GeneID:100302278
This is my first post so please be gentle. I have searched across the world wide web looking for a solution but I am yet to find one. The problem i'm trying to solve is as follows:
I have a dataset, comprised of 500.000+ samples, with 6 features per sample.
I have put this dataset in a multiindexed Pandas DataFrame
The first level of my dataFrame is the timeseries index, the second level is the ID. It looks as follows
Time id
2017-03-07 10:06:49.963241984 122.0 -7.024347
136.0 -11.664985
243.0 1.716150
2017-03-07 10:06:50.003462400 122.0 -7.025922
136.0 -11.671526
Every timestamp, a number of objects can be seen and are marked by label 'id'. For my application, i want to add a temporal dependency by including information
that happened 5 seconds ago, i.e. in this example on timestamp 10:06:45.
But, importantly, i only want to add this information if at that timestamp the object already existed (so if the id is equal).
I wanted to use the function dataframe.shift, as mentioned here and, i want to do it per level, so as indicated by user Unutbu in How do you shift Pandas DataFrame with a multiindex?
My question is as follows:
How do I append extra columns to the original dataframe X with information on what those objects were 5s ago. I would expect something like the following
X['x_location_shifted'] = X.groupby(level=1)['x_location'].shift(5*rate)
with the rate being 25Hz, ie. we shift 125 "DateTimeIndices", but, only if an object with id='...' exists at that timestamp.
EDIT:
The timestamps are not synchronized 100%, so the timegap is not always exactly equal to 0.04. Previously, i used np.argmin(np.abs(time-index)) to find the closest index to the stamp.
For example, in my set, at timestamp 2017-03-07 10:36:03.605008640 there is an object with id == 175 and location_x = 54.323.
id = 175
X.ix['2017-03-07 10:36:03.605008640', id] = 54.323
At timestamp 2017-03-07 10:36:08.604962560 ..... this object with id=175 has a location_x = 67.165955
id = 175
old_time = pd.to_datetime('2017-03-07 10:36:03.605008640')
new_time = old_time + pd.Timedelta('5 seconds')
# Finding the new value of location
X.ix[np.argmin(np.abs(new_time - X.index.get_level_values(0))), id]
So, finally, at timestep 10:36:08 i want to add the information of timestamp 10:36:03 IF the object already existed at that timestamp.
EDIT2:
After trying Maarten Fabré's solution, I came up with my own implementation, which you can find below. If anyone can show me a more pythonic way to do this, please let me know.
for current_time in X.index.get_level_values(0)[125:]:
#only do if there are objects at current time
if len(X.ix[current_time].index):
# Calculate past time
past_time = current_time - pd.Timedelta('5 seconds')
# Find index in X.index that is closest to this past time
past_time_index = np.argmin(np.abs(past_time-X.index.get_level_values(0)))
# translate the index back to a label
past_time = X.index[past_time_index][0]
# in that timestep, cycle the objects
for obj_id in X.ix[current_time].index:
# Try looking for the value box_center.x of obj obj_id 5s ago
try:
X.ix[(current_time, obj_id), 'box_center.x.shifted'] = X.ix[(past_time, obj_id), 'box_center.x']
X.ix[(current_time, obj_id), 'box_center.y.shifted'] = X.ix[(past_time, obj_id), 'box_center.y']
X.ix[(current_time, obj_id), 'relative_velocity.x.shifted'] = X.ix[(past_time, obj_id), 'relative_velocity.x']
X.ix[(current_time, obj_id), 'relative_velocity.y.shifted'] = X.ix[(past_time, obj_id), 'relative_velocity.y']
# If the key doesnt exist, the object doesn't exist, ergo the field should be np.nan
except KeyError:
X.ix[(current_time, obj_id), 'box_center.x.shift'] = np.nan
print('Timestep {}'.format(current_time))
If this is not enough information, please say so and I can add it :)
Cheers and thanks!
Assuming that you have no gaps in the timestamps, one possible solution might be the following, which creates a new index with shifted timestamps and uses that to get the 5 seconds-ago values for each ID.
offset = 5 * rate
# Create a shallow copy of the multiindex levels for modification
modified_levels = list(X.index.levels)
# Shift them
modified_times = pd.Series(modified_levels[0]).shift(offset)
# Fill NaNs with dummy values to avoid duplicates in the new index
modified_times[modified_times.isnull()] = range(sum(modified_times.isnull()))
modified_levels[0] = modified_times
new_index = X.index.set_levels(modified_levels, inplace=False)
X['x_location_shifted'] = X.loc[new_index, 'x_location'].values
If the timestamps are not 100% regular, then you'll either have to round the to the nearest 1/x second, or use a loop
you could use this as a loop
Data definition
import pandas as pd
import numpy as np
from io import StringIO
df_str = """
timestamp id location
10:00:00.005 1 a
10:00:00.005 2 b
10:00:00.005 3 c
10:00:05.006 2 a
10:00:05.006 3 b
10:00:05.006 4 c"""
df = pd.DataFrame.from_csv(StringIO(df_str), sep='\t').reset_index()
delta = pd.to_timedelta(5, unit='s')
margin = pd.to_timedelta(1/50, unit='s')
df['location_shifted'] = np.nan
Loop over the different id's
for label_id in set(df['id']):
df_id = df[df['id'] == label_id].copy() # copy to make sure we don't overwrite the original data. Might not be necessary
df_id['time_shift'] = df['timestamp'] + delta
for row in df_id.itertuples():
idx = row.Index
time_dif = abs(df['timestamp'] - row.time_shift)
shifted_locs = df_id[time_dif < margin ]
l = len(shifted_locs)
if l:
print(shifted_locs)
if l == 1:
idx_shift = shifted_locs.index[0]
else:
idx_shift = shifted_locs['time_shift'].idxmin()
df.loc[idx_shift, 'location_shifted'] = df_id.loc[idx, 'location']
Results
timestamp id location location_shifted
0 2017-05-09 10:00:00.005 1 a
1 2017-05-09 10:00:00.005 2 b
2 2017-05-09 10:00:00.005 3 c
3 2017-05-09 10:00:05.006 2 a b
4 2017-05-09 10:00:05.006 3 b c
5 2017-05-09 10:00:05.006 4 c
Any of you arriving here with the same question; i managed to solve it in a (minimal) vectorized way, but, it required me to return to a 3d panel.
3 Steps:
- make into 3D panel
- Add new columns
- Fill those columns
From a multi-index 2d frame it's possible to change it to a pandas.Panel where you convert the 2nd index to one of the axes in the panel.
After this I have a 3D panel with axes [time, objects, parameters]. Then, tranpose the panel to have the PARAMETERS as items, this to add columns to the datapanel. So, tranpose the panel, add the columns, transpose back.
dp_new = dp.transpose(2,0,1)
dp_new['shifted_box_center_x']=np.nan
dp_new['shifted_box_center_y']=np.nan
dp_new['shifted_relative_velocity_x']=np.nan
dp_new['shifted_relative_velocity_y']=np.nan
# tranpose them back to their original form
dp_new = dp_new.transpose(1,2,0)
Now that we have added the new fields, we can get their names by
new_fields = dp_new.minor_axis[-4:]
The objective is to add information from 5s ago, if that object existed. Therefore, we cycle the time series from a moment in time which is 5s. In my case, at a rate of 25Hz, this is element 5*rate = 125.
Lets first set the time to start from 5s in the datapanel
time = dp_new.items[125:]
Then, we iterate an enumerated version of the time. The enumeration will start at 0, which is the index of the datapanel at timestep = 0. The first timestep however is the timestep at time 0+5seconds.
time = dp_new.items[125:]
for iloc, ts in enumerate(time):
# Print progress
print('{} out of {}'.format(ts, dp.items[-1]) , end="\r", flush=True)
# Generate new INDEX field, by taking the field ID and dropping the NaN values
ids = dp_new.loc[ts].id.dropna().values
# Drop the nan field from the frame
dp_new[ts].dropna(thresh=5, inplace=True)
# save the original indices
original_index = {'index': dp_new.loc[ts].index, 'id': dp_new.loc[ts].id.values}
# set the index to field id
dp_new[ts].set_index(['id'], inplace=True)
# Check if the vector ids does NOT contain ALL ZEROS
if np.any(ids): # Check for all zeros
df_past = dp_new.iloc[iloc].copy() # SCREENSHOT AT TS=5s --> ILOC = 0
df_past.dropna(thresh=5, inplace=True) # drop the nan rows
df_past.set_index(['id'], inplace=True) # set the index to field ID
dp_new[ts].loc[original_index['id'], new_fields] = df_past[fields].values
This will only fill in fields that have id's ==ids.
This code was able to run on a 300 000 element file in about 5 minutes.
Note: i spent quite some time on this, mainly because of how one indexes a panel. At first , i thought calling the 3 dimensions would work, as stated in pandas help, but it seems that this is not the case.
dp_new[ts, ids, new_fields] = values does NOT work.
I have a dataframe of jobs for different people with star and end time for each job. I'd like to count, every four months, how many jobs each person is responsible for. I figured out away to do it but I'm sure it's tremendously inefficient (I'm new to pandas). It takes quite a while to compute when I run the code on my complete dataset (hundreds of persons and jobs).
Here is what I have so far.
#create a data frame
import pandas as pd
import numpy as np
df = pd.DataFrame({'job': pd.Categorical(['job1','job2','job3','job4']),
'person': pd.Categorical(['p1', 'p1', 'p2','p2']),
'start': ['2015-01-01', '2015-06-01', '2015-01-01', '2016- 01- 01'],
'end': ['2015-07-01', '2015- 12-31', '2016-03-01', '2016-12-31']})
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
Which gives me
I then create a new dataset with
bdate = min(df['start'])
edate = max(df['end'])
dates = pd.date_range(bdate, edate, freq='4MS')
people = sorted(set(list(df['person'])))
df2 = pd.DataFrame(np.zeros((len(dates), len(people))), index=dates, columns=people)
for d in pd.date_range(bdate, edate, freq='MS'):
for p in people:
contagem = df[(df['person'] == p) &
(df['start'] <= d) &
(df['end'] >= d)]
pos = np.argmin(np.abs(dates - d))
df2.iloc[pos][p] = len(contagem.index)
df2
And I get
I'm sure there must be a better way of doing this without having to loop through all dates and persons. But how?
This answer assumes that each job-person combination is unique. It creates a series for every row with the value equal to the job an index that expands the dates. Then it resamples every 4th month (which is not quarterly but what your solution describes) and counts the unique non-na occurrences.
def make_date_range(x):
return pd.Series(index=pd.date_range(x.start.values[0], x.end.values[0], freq='M'), data=x.job.values[0])
# Iterate through each job person combo and make an entry for each month with the job as the value
df1 = df.groupby(['job', 'person']).apply(make_date_range).unstack('person')
# remove outer level from index
df1.index = df1.index.droplevel('job')
# resample each month counting only unique values
df1.resample('4MS').agg(lambda x: len(x[x.notnull()].unique()))
Output
person p1 p2
2015-01-01 1 1
2015-05-01 2 1
2015-09-01 1 1
2016-01-01 0 2
2016-05-01 0 1
2016-09-01 0 1
And here is a long one line solution that iterates over every rows and creates a new dataframe and stacks all of them together via pd.concat and then resamples.
pd.concat([pd.DataFrame(index = pd.date_range(tup.start, tup.end, freq='4MS'),
data=[[tup.job]],
columns=[tup.person]) for tup in df.itertuples()])\
.resample('4MS').count()
And another one that is faster
df1 = pd.melt(df, id_vars=['job', 'person'], value_name='date').set_index('date')
g = df1.groupby([pd.TimeGrouper('4MS'), 'person'])['job']
g.agg('nunique').unstack('person', fill_value=0)