I have a dataframe with id, purchase date, price of purchase and duration in days,
df
id purchased_date price duration
1 2020-01-01 16.50 2
2 2020-01-01 24.00 4
What I'm trying to do is where ever the duration is greater than 1 day, I want the number of extra days to be split into duplicated rows, the price to be divided by the number of individual days and the date to increase by 1 day for each day purchased. Effectively giving me this,
df_new
id purchased_date price duration
1 2020-01-01 8.25 1
1 2020-01-02 8.25 1
2 2020-01-01 6.00 1
2 2020-01-02 6.00 1
2 2020-01-03 6.00 1
2 2020-01-04 6.00 1
So far
I've managed to duplicate the rows based on the duration using.
df['price'] = df['price']/df['duration']
df = df.loc[df.index.repeat(df.duration)]
and then I've tried using,
df.groupby(['id', 'purchased_date']).purchased_date.apply(lambda n: n + pd.to_timedelta(1, unit='d'))
however, this just gets stuck in an endless loop and I'm a bit stuck.
My plan is to put this all in a function but for now I just want to get the process working.
Thank you for any help.
Use GroupBy.cumcount for counter, so possible pass to to_timedeltato_timedelta for days timedeltas and add to column purchased_date:
df['price'] = df['price']/df['duration']
df = df.loc[df.index.repeat(df.duration)].assign(duration=1)
df['purchased_date'] += pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')
df = df.reset_index(drop=True)
print (df)
id purchased_date price duration
0 1 2020-01-01 8.25 1
1 1 2020-01-02 8.25 1
2 2 2020-01-01 6.00 1
3 2 2020-01-02 6.00 1
4 2 2020-01-03 6.00 1
5 2 2020-01-04 6.00 1
An approach with pandas.date_range and explode:
(df.assign(price=df['price'].div(df['duration']),
purchased_date=df.apply(lambda x: pd.date_range(x['purchased_date'],
periods=x['duration']),
axis=1),
duration=1
)
.explode('purchased_date', ignore_index=True)
)
output:
id purchased_date price duration
0 1 2020-01-01 8.25 1
1 1 2020-01-02 8.25 1
2 2 2020-01-01 6.00 1
3 2 2020-01-02 6.00 1
4 2 2020-01-03 6.00 1
5 2 2020-01-04 6.00 1
Here is an easy to understand approach:
Assign average 'price' value
Create a temporary 'end_date' column
Modify 'purchased_date' to form a list of date-time
Explode 'purchased_date' to form new rows
Assign 1 to duration column
Delete the temporary 'end_date' column
Code:
df['price'] = df['price']/df['duration']
df['end_date'] = df.purchased_date + pd.to_timedelta(df.duration.sub(1), unit='d')
df['purchased_date'] = df.apply(lambda x: pd.date_range(start=x['purchased_date'], end=x['end_date']), axis=1)
df = df.explode('purchased_date').reset_index(drop=True)
df = df.assign(duration=1)
del df['end_date']
print (df)
id purchased_date price duration
0 1 2020-01-01 8.25 1
1 1 2020-01-02 8.25 1
2 2 2020-01-01 6.00 1
3 2 2020-01-02 6.00 1
4 2 2020-01-03 6.00 1
5 2 2020-01-04 6.00 1
Related
I need to calculate cumulative sums for different columns in a pandas dataframe based on a column playerId and a datetime column. My dataframe looks like this:
eventId playerId goal shot header dateutc
0 0 100 0 1 0 2020-11-08 17:00:00
1 1 100 0 0 1 2020-11-08 17:00:00
2 2 100 1 1 0 2020-11-08 17:00:00
3 3 200 0 1 0 2020-11-08 17:00:00
4 4 100 1 0 1 2020-11-15 17:00:00
5 5 100 1 1 0 2020-11-15 17:00:00
6 6 200 1 1 0 2020-11-15 17:00:00
So now I need to calculate cumulative sums for each player for the current date and all previous dates. So my final dateframe will look like this:
playerId dateutc goal shot header
0 100 2020-11-08 17:00:00 1 2 1
1 200 2020-11-08 17:00:00 0 1 0
2 100 2020-11-15 17:00:00 3 3 2
3 200 2020-11-15 17:00:00 1 2 0
Hopefully someone can help me :)
First remove eventId for avoid sum if numeric, aggregate sum and then cumsum:
df1 = (df.drop('eventId',axis=1)
.groupby(['playerId','dateutc'], sort=False)
.sum()
.groupby(level=0, sort=False)
.cumsum()
.reset_index())
print (df1)
playerId dateutc goal shot header
0 100 2020-11-08 17:00:00 1 2 1
1 200 2020-11-08 17:00:00 0 1 0
2 100 2020-11-15 17:00:00 3 3 2
3 200 2020-11-15 17:00:00 1 2 0
If need specify columns for processing:
df1 = (df.groupby(['playerId','dateutc'], sort=False)[['goal', 'shot', 'header']]
.sum()
.groupby(level=0, sort=False)
.cumsum()
.reset_index())
Try:
out = df.groupby(['playerId', 'dateutc'], sort=False)[['goal', 'shot', 'header']].sum()
out = out.groupby(level='playerId').cumsum().reset_index()
Output:
>>> out
playerId dateutc goal shot header
0 100 2020-11-08 17:00:00 1 2 1
1 200 2020-11-08 17:00:00 0 1 0
2 100 2020-11-15 17:00:00 3 3 2
3 200 2020-11-15 17:00:00 1 2 0
I have two dataframes df1, df2. I need to construct an output that finds the nearest date to df1, whilst simultaneously matching the ID Value in both df1 and df2. df (Output Desired) shown below illustrates what I have tried explain above!
df1:
ID Date
1 2020-01-01
2 2020-01-03
df2:
ID Date
11 2020-01-11
4 2020-02-03
5 2020-04-02
6 2020-01-05
1 2021-01-13
1 2021-03-03
1 2020-01-30
2 2020-03-31
2 2021-04-01
2 2021-02-02
df (Output desired)
ID Date Closest Date
1 2020-01-01 2020-01-30
2 2020-01-03 2020-03-31
Here's one way to achieve it – assuming that the Date columns' dtype is datetime: First,
df3 = df1[df1.ID.isin(df2.ID)]
will give you
ID Date
0 1 2020-01-01
1 2 2020-01-03
Then
df3['Closest_date'] = df3.apply(lambda row:min(df2[df2.ID.eq(row.ID)].Date,
key=lambda x:abs(x-row.Date)),
axis=1)
gets the min of df2.Date, where
df2[df2.ID.eq(row.ID)].Date is getting the rows that have the matching ID and
key=lambda x:abs(x-row.Date) is telling min to compare by distance,
which has to be done on rows, so axis=1
Output:
ID Date Closest_date
0 1 2020-01-01 2020-01-30
1 2 2020-01-03 2020-03-31
I have a dataframe of the following form where each row corresponds to a job run on a machine:
import pandas as pd
df = pd.DataFrame({
'MachineID': [4, 3, 2, 2, 1, 1, 5, 3],
'JobStartDate': ['2020-01-01', '2020-01-01', '2020-01-01', '2020-01-01', '2020-01-02', '2020-01-03', '2020-01-01', '2020-01-03'],
'JobEndDate': ['2020-01-03', '2020-01-03', '2020-01-04', '2020-01-02', '2020-01-04', '2020-01-05', '2020-01-02', '2020-01-04'],
'IsTypeAJob': [1, 1, 0, 1, 0, 0, 1, 1]
})
df
>>> MachineID JobStartDate JobEndDate IsTypeAJob
0 4 2020-01-01 2020-01-03 1
1 3 2020-01-01 2020-01-03 1
2 2 2020-01-01 2020-01-04 0
3 2 2020-01-01 2020-01-02 1
4 1 2020-01-02 2020-01-04 0
5 1 2020-01-03 2020-01-05 0
6 5 2020-01-01 2020-01-02 1
7 3 2020-01-03 2020-01-04 1
In my data there are two types of jobs that can be run on a machine, either type A or type B. My goal is to count the number of type A and type B jobs per machine per day. Thus the desired result would look something like
MachineID Date TypeAJobs TypeBJobs
0 1 2020-01-02 0 1
1 1 2020-01-03 0 2
2 1 2020-01-04 0 2
3 1 2020-01-05 0 1
4 2 2020-01-01 1 1
5 2 2020-01-02 1 1
6 2 2020-01-03 0 1
7 2 2020-01-04 0 1
8 3 2020-01-01 1 0
9 3 2020-01-02 1 0
10 3 2020-01-03 2 0
11 3 2020-01-04 1 0
12 4 2020-01-01 1 0
13 4 2020-01-02 1 0
14 4 2020-01-03 1 0
15 5 2020-01-01 1 0
16 5 2020-01-02 1 0
I have tried approaches found here and here with a resample() and apply() method, but the computing time is too slow. This has to do with the fact that some date ranges span multiple years in my set, meaning one row can blow up into 2000+ new rows during resampling (my data contains around a million rows to begin with). Thus something like creating a new machine/date row for each date in the range of a certain job is too slow (with the goal of doing a group_by(['MachineID', 'Date']).sum() at the end).
I am currently thinking about a new approach where I begin by grouping by MachineID then finding the earliest job start date and latest job end date for that machine. Then I could create a date range of days between these two dates (incrementing by day) which I would use to index a new per machine data frame. Then for each job for that MachineID I could potentially sum over a range of dates, ie in pseudocode:
df['TypeAJobs'][row['JobStartDate']:row['JobEndDate']] += 1 if it is a type A job or
df['TypeBJobs'][row['JobStartDate']:row['JobEndDate']] += 1 otherwise.
This seems like it would avoid creating a bunch of extra rows for each job as now we are creating extra rows for each machine. Furthermore, the addition operations seem like they would be fast since we are adding to an entire slice of a series at once. However, I don't know if something like this (indexing by date) is possible in Pandas. Maybe there is some conversion that can be done first? After doing the above, ideally I would have a number of data frames similar to the desired result but only with one MachineID, then I would concatenate these data frames to get the result.
I would love to hear any suggestions about the feasibility/effectiveness of this approach or another potential algorithm. Thanks so much for reading!
IIUC, try using pd.date_range and explode to create 'daily' rows, then groupby dates and IsTypeAJob and rename columns:
df_out = df.assign(JobDates=df.apply(lambda x: pd.date_range(x['JobStartDate'],
x['JobEndDate'], freq='D'),
axis=1))\
.explode('JobDates')
df_out = df_out.groupby([df_out['MachineID'],
df_out['JobDates'].dt.floor('D'),
'IsTypeAJob'])['MachineID'].count()\
.unstack()\
.rename(columns={0:'TypeBJobs', 1:'TypeAJobs'})\
.fillna(0).reset_index()
df_out
Output:
IsTypeAJob MachineID JobDates TypeBJobs TypeAJobs
0 1 2020-01-02 1.0 0.0
1 1 2020-01-03 2.0 0.0
2 1 2020-01-04 2.0 0.0
3 1 2020-01-05 1.0 0.0
4 2 2020-01-01 1.0 1.0
5 2 2020-01-02 1.0 1.0
6 2 2020-01-03 1.0 0.0
7 2 2020-01-04 1.0 0.0
8 3 2020-01-01 0.0 1.0
9 3 2020-01-02 0.0 1.0
10 3 2020-01-03 0.0 2.0
11 3 2020-01-04 0.0 1.0
12 4 2020-01-01 0.0 1.0
13 4 2020-01-02 0.0 1.0
14 4 2020-01-03 0.0 1.0
15 5 2020-01-01 0.0 1.0
16 5 2020-01-02 0.0 1.0
pd.concat([pd.DataFrame({'JobDates':pd.date_range(r.JobStartDate, r.JobEndDate, freq='D'),
'MachineID':r.MachineID,
'IsTypeAJob':r.IsTypeAJob}) for i, r in df.iterrows()])
Here is another way to do the job, the idea is similar to use str.get_dummies on both columns start and end, but done with array broadcasting. Use cumsum do get one between start and end and 0 otherwise. Create a dataframe with the columns as dates and the index as both Machine and Type. Then do similar operation than the answer from #Scott Boston to get the expected output shape.
#get all possible dates
dr = pd.date_range(df['JobStartDate'].min(),
df['JobEndDate'].max()).strftime("%Y-%m-%d").to_numpy()
df_ = (pd.DataFrame(
np.cumsum((df['JobStartDate'].to_numpy()[:, None] == dr).astype(int)
- np.pad(df['JobEndDate'].to_numpy()[:, None]==dr,((0,0),(1,False)),
mode='constant')[:, :-1], # pad is equivalent to shift along columns
axis=1),
index=pd.MultiIndex.from_frame(df[['MachineID', 'IsTypeAJob']]),
columns=dr,)
.sum(level=['MachineID', 'IsTypeAJob']) #equivalent to groupby(['MachineID', 'IsTypeAJob']).sum()
.replace(0, np.nan) #to remove extra dates per original row during the stack
.stack()
.unstack(level='IsTypeAJob', fill_value=0)
.astype(int)
.reset_index()
.rename_axis(columns=None)
.rename(columns={'level_1':'Date', 0:'TypeBJobs', 1:'TypeAJobs'})
)
and you get
MachineID Date TypeBJobs TypeAJobs
0 1 2020-01-02 1 0
1 1 2020-01-03 2 0
2 1 2020-01-04 2 0
3 1 2020-01-05 1 0
4 2 2020-01-01 1 1
5 2 2020-01-02 1 1
6 2 2020-01-03 1 0
7 2 2020-01-04 1 0
8 3 2020-01-01 0 1
9 3 2020-01-02 0 1
10 3 2020-01-03 0 2
11 3 2020-01-04 0 1
12 4 2020-01-01 0 1
13 4 2020-01-02 0 1
14 4 2020-01-03 0 1
15 5 2020-01-01 0 1
16 5 2020-01-02 0 1
I'd like to change my dataframe adding time intervals for every hour during a month
Original df
money food
0 1 2
1 4 5
2 5 7
Output:
money food time
0 1 2 2020-01-01 00:00:00
1 1 2 2020-01-01 00:01:00
2 1 2 2020-01-01 00:02:00
...
2230 5 7 2020-01-31 00:22:00
2231 5 7 2020-01-31 00:23:00
where 2231 = out_rows_number-1 = month_days_number*hours_per_day*orig_rows_number - 1
What is the proper way to perform it?
Use cross join by DataFrame.merge and new DataFrame with all hours per month created by date_range:
df1 = pd.DataFrame({'a':1,
'time':pd.date_range('2020-01-01', '2020-01-31 23:00:00', freq='h')})
df = df.assign(a=1).merge(df1, on='a', how='outer').drop('a', axis=1)
print (df)
money food time
0 1 2 2020-01-01 00:00:00
1 1 2 2020-01-01 01:00:00
2 1 2 2020-01-01 02:00:00
3 1 2 2020-01-01 03:00:00
4 1 2 2020-01-01 04:00:00
... ... ...
2227 5 7 2020-01-31 19:00:00
2228 5 7 2020-01-31 20:00:00
2229 5 7 2020-01-31 21:00:00
2230 5 7 2020-01-31 22:00:00
2231 5 7 2020-01-31 23:00:00
[2232 rows x 3 columns]
I have a dataframe that I am trying to calculate the year-to-date average for my value columns. Below is a sample dataframe.
date name values values2
0 2019-01-01 a 1 1
1 2019-02-01 a 3 3
2 2019-03-01 a 2 2
3 2019-04-01 a 6 2
I want to create new columns (values_ytd & values2_ytd) that will average the values from January to the latest period within the same year (April in sample data). I will need to group the data by year & name when calculating the averages. I am looking for an output similar to this.
date name values values2 values2_ytd values_ytd
0 2019-01-01 a 1 1 1 1
1 2019-02-01 a 3 3 2 2
2 2019-03-01 a 2 2 2 2
3 2019-04-01 a 6 2 2 3
I have tried unsuccesfully to using expanding().mean(), but most likely I was doing it wrong. My main dataframe has numerous name categories and many more columns. Here is the code I was attempting to use
df1.groupby([df1['name'], df1['date'].dt.year], as_index=False).expanding().mean().loc[:, 'values':'values2'].add_suffix('_ytd').reset_index(drop=True,level=0)
but am receiving the following error.
NotImplementedError: ops for Expanding for this dtype datetime64[ns] are not implemented
Note: This code below works perfectly when substituting cumsum() for .expanding().mean()to create a year-to-date sum of the values, but I cant figure it out for averages
df1.groupby([df1['name'], df1['date'].dt.year], as_index=False).cumsum().loc[:, 'values':'values2'].add_suffix('_ytd').reset_index(drop=True,level=0)
Any help is greatly appreciated.
Try this:
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
df[['values2_ytd', 'values_ytd']] = df.groupby([df.index.year, 'name'])['values','values2'].expanding().mean().reset_index(level=[0,1], drop=True)
df
name values values2 values2_ytd values_ytd
date
2019-01-01 a 1 1 1.0 1.0
2019-02-01 a 3 3 2.0 2.0
2019-03-01 a 2 2 2.0 2.0
2019-04-01 a 6 2 3.0 2.0
Example using multiple names and years:
date name values values2
0 2019-01-01 a 1 1
1 2019-02-01 a 3 3
2 2019-03-01 a 2 2
3 2019-04-01 a 6 2
4 2019-01-01 b 1 4
5 2019-02-01 b 3 4
6 2020-01-01 a 1 1
7 2020-02-01 a 3 3
8 2020-03-01 a 2 2
9 2020-04-01 a 6 2
Output:
name values values2 values2_ytd values_ytd
date
2019-01-01 a 1 1 1.0 1.0
2019-02-01 a 3 3 2.0 2.0
2019-03-01 a 2 2 2.0 2.0
2019-04-01 a 6 2 3.0 2.0
2019-01-01 b 1 4 1.0 4.0
2019-02-01 b 3 4 2.0 4.0
2020-01-01 a 1 1 1.0 1.0
2020-02-01 a 3 3 2.0 2.0
2020-03-01 a 2 2 2.0 2.0
2020-04-01 a 6 2 3.0 2.0
You should set date column as index: df.set_index('date', inplace=True) and then use df.resample('AS').groupby('name').mean()