Pandas concat/merge Dataframes fill missing values with last in column - python

I want to aggregate the data of two pandas Dataframes into one, where the column total needs to backfill with previous existing values, here is my code:
import pandas as pd
df1 = pd.DataFrame({
'date': ['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-05'],
'day_count': [1, 1, 1, 1],
'total': [1, 2, 3, 4]})
df2 = pd.DataFrame({
'date': ['2020-01-02', '2020-01-03', '2020-01-04'],
'day_count': [2, 2, 2],
'total': [2, 4, 6]})
# set "date" as index and convert to datetime for later resampling
df1.index = df1['date']
df1.index = pd.to_datetime(df1.index)
df2.index = df2['date']
df2.index = pd.to_datetime(df2.index)
Now I need to resample both my dataframes to some frequency, let's say daily so I would do:
df1 = df1.resample('D').agg({'day_count': 'sum', 'total': 'last'})
df2 = df2.resample('D').agg({'day_count': 'sum', 'total': 'last'})
The Dataframes now looks like:
In [20]: df1
Out[20]:
day_count total
date
2020-01-01 1 1.0
2020-01-02 1 2.0
2020-01-03 1 3.0
2020-01-04 0 NaN
2020-01-05 1 4.0
In [22]: df2
Out[22]:
day_count total
date
2020-01-02 2 2
2020-01-03 2 4
2020-01-04 2 6
Now I need to merge both, but notice that total, has some NaN values where I need to backfill the the previously existing value, so I do:
df1['total'] = df1['total'].fillna(method='ffill').astype(int)
df2['total'] = df2['total'].fillna(method='ffill').astype(int)
Now df1 looks like:
In [25]: df1
Out[25]:
day_count total
date
2020-01-01 1 1
2020-01-02 1 2
2020-01-03 1 3
2020-01-04 0 3
2020-01-05 1 4
So now I have the two dataframes ready to be merged, I think, so I concat them:
final_df = pd.concat([df1, df1]).fillna(method='ffill').groupby(["date"], as_index=True).sum()
In [31]: final_df
Out[31]:
day_count total
date
2020-01-01 1 1
2020-01-02 3 4
2020-01-03 3 7
2020-01-04 2 9
2020-01-05 1 4
I have the correct aggregation for day_count simply summing what's on the same date for both DF's but for total I do not get what I expected, which is to get:
In [31]: final_df
Out[31]:
day_count total
date
2020-01-01 1 1
2020-01-02 3 4
2020-01-03 3 7
2020-01-04 2 9
2020-01-05 1 10 --> this value I miss
Certainly I am doing something wrong, I feel like maybe there is even a simpler way to do this, thanks !

Concatenate them horizontally and groupby along columns:
pd.concat([df1,df2], axis=1).ffill().groupby(level=0, axis=1).sum()
That said, you can also bypass the individual fillna and groupby
# these are not needed
# df1['total'] = df1['total'].fillna(method='ffill').astype(int)
# df2['total'] = df2['total'].fillna(method='ffill').astype(int)
pd.concat([df1,df2],axis=1).ffill().sum(level=0, axis=1)
Output:
day_added total
date
2020-01-01 1.0 1.0
2020-01-02 3.0 4.0
2020-01-03 3.0 7.0
2020-01-04 2.0 9.0
2020-01-05 3.0 10.0

Related

How to create a new metric column based on 1 year lag of a date column?

I would like create a new column which references a date column - 1 year and displays the corresponding values:
import pandas as pd
Input DF
df = pd.DataFrame({'consumption': [0,1,3,5], 'date':[pd.to_datetime('2017-04-01'),
pd.to_datetime('2017-04-02'),
pd.to_datetime('2018-04-01'),
pd.to_datetime('2018-04-02')]})
>>> df
consumption date
0 2017-04-01
1 2017-04-02
3 2018-04-01
5 2018-04-02
Expected DF
df = pd.DataFrame({'consumption': [0,1,3,5],
'prev_year_consumption': [np.NAN,np.NAN,0,1],
'date':[pd.to_datetime('2017-04-01'),
pd.to_datetime('2017-04-02'),
pd.to_datetime('2018-04-01'),
pd.to_datetime('2018-04-02')]})
>>> df
consumption prev_year_consumption date
0 NAN 2017-04-01
1 NAN 2017-04-02
3 0 2018-04-01
5 1 2018-04-02
So prev_year_consumption is are simply values from the consumption column where 1 year is subtracted from date dynamically.
in SQL I would probably do something like
SELECT df_past.consumption as prev_year_consumption, df_current.consumption
FROM df as df_current
LEFT JOIN ON df df_past ON year(df_current.date) = year(df_past.date) - 1
Appreciate any hints
The notation in pandas is similar. We are still doing a self merge however we need to specify that the right_on (or left_on) has a DateOffset of 1 year:
new_df = df.merge(
df,
left_on='date',
right_on=df['date'] + pd.offsets.DateOffset(years=1),
how='left'
)
new_df:
date consumption_x date_x consumption_y date_y
0 2017-04-01 0 2017-04-01 NaN NaT
1 2017-04-02 1 2017-04-02 NaN NaT
2 2018-04-01 3 2018-04-01 0.0 2017-04-01
3 2018-04-02 5 2018-04-02 1.0 2017-04-02
We can further drop and rename columns to get exact output:
new_df = df.merge(
df,
left_on='date',
right_on=df['date'] + pd.offsets.DateOffset(years=1),
how='left'
).drop(columns=['date_x', 'date_y']).rename(columns={
'consumption_y': 'prev_year_consumption'
})
new_df:
date consumption_x prev_year_consumption
0 2017-04-01 0 NaN
1 2017-04-02 1 NaN
2 2018-04-01 3 0.0
3 2018-04-02 5 1.0

How do I create a dummy variable by comparing columns in different data frames?

I would like to compare one column of a df with another column in a different df. The columns are timestamp and holiday date. I'd like to create a dummy variable wherein if the timestamp in df1 match the dates in df2 = 1, else 0.
For example, df1:
timestamp weight(kg)
0 2016-03-04 4.0
1 2015-02-15 5.0
2 2019-05-04 5.0
3 2018-12-25 29.0
4 2020-01-01 58.0
For example, df2:
holiday
0 2016-12-25
1 2017-01-01
2 2019-05-01
3 2018-12-26
4 2020-05-26
Ideal output:
timestamp weight(kg) holiday
0 2016-03-04 4.0 0
1 2015-02-15 5.0 0
2 2019-05-04 5.0 0
3 2018-12-25 29.0 1
4 2020-01-01 58.0 1
I have tried writing a function but it is taking very long to calculate:
def add_holiday(x):
hols_df = hols.apply(lambda y: y['holiday_dt'] if
x['timestamp'] == y['holiday_dt']
else None, axis=1)
hols_df = hols_df.dropna(axis=0, how='all')
if hols_df.empty:
hols_df= np.nan
else:
hols_df= hols_df.to_string(index=False)
return hols_df
#df_hols['holidays'] = df_hols.apply(add_holiday, axis=1)
Perhaps, there is a simpler way to do so or the function is not exactly well-written. Any help will be appreciated.
Use Series.isin with convert mask to 1,0 by Series.astype:
df1['holiday'] = df1['timestamp'].isin(df2['holiday']).astype(int)
Or with numpy.where:
df1['holiday'] = np.where(df1['timestamp'].isin(df2['holiday']), 1, 0)

python dataframe change index type and remove duplicates

i have a dataframe that looks like this
2020-01-01 10
2020-02-01 5
2020-05-01 2
2020-08-01 7
2020-01-01 00:00:00 0
2020-02-01 00:00:00 0
2020-03-01 00:00:00 0
2020-04-01 00:00:00 0
I want to remove the time of the index and combine where the dates may be the same the end result will look like
2020-01-01 10
2020-02-01 5
2020-03-01 0
2020-04-01 0
2020-05-01 2
2020-06-01 0
2020-07-01 0
2020-08-01 7
etc, etc
change the index data type and filter with .duplicated:
df.index = pd.to_datetime(df.index)
df = df[~df.index.duplicated(keep='first')]
df
Out[1]:
1
0
2020-01-01 10
2020-02-01 5
2020-05-01 2
2020-08-01 7
2020-03-01 0
2020-04-01 0
If you want to sum them together rather than get rid of the duplicate, then use:
df.index = pd.to_datetime(df.index)
df = df.sum(level=0)
df
Out[2]:
1
0
2020-01-01 10
2020-02-01 5
2020-05-01 2
2020-08-01 7
2020-03-01 0
2020-04-01 0
if the index content is in string format u can simply slice
df.reset_index(inplace=True)#consider column name to be date
df["date"]=df["date"].str[:11]#till time index
df.set_index( "date",inplace=True)
if it is in date format:
df.reset_index(inplace=True)
df['date'] = pd.to_datetime(df['date']).dt.date
df.set_index( "date",inplace=True)
Given this data (reflecting your own) with the string dates and int data in columns (not as index):
dates = ['2020-01-01', '2020-02-01', '2020-05-01', '2020-08-01',
'2020-01-01 00:00:00', '2020-02-01 00:00:00', '2020-03-01 00:00:00',
'2020-04-01 00:00:00']
data = [10,5,2,7,0,0,0,0]
df = pd.DataFrame({'dates':dates, 'data':data})
You can do the following:
df['dates'] = pd.to_datetime(df['dates']).dt.date #convert to datetime and get the date
df = df.groupby('dates').sum().sort_index() # groupby and sort index
Giving:
data
dates
2020-01-01 10
2020-02-01 5
2020-03-01 0
2020-04-01 0
2020-05-01 2
2020-08-01 7
You can replace .sum() with your favorite aggregation method. Also, if you want to impute the missing dates (as in your expected output), you can do:
months = pd.date_range(min(df.index), max(df.index), freq='MS').date
df = df.reindex(months).fillna(0)
Giving:
data
dates
2020-01-01 10.0
2020-02-01 5.0
2020-03-01 0.0
2020-04-01 0.0
2020-05-01 2.0
2020-06-01 0.0
2020-07-01 0.0
2020-08-01 7.0

Python Pandas: Trying to speed-up a per row per date in date_range operation

I have a dataframe of the following form where each row corresponds to a job run on a machine:
import pandas as pd
df = pd.DataFrame({
'MachineID': [4, 3, 2, 2, 1, 1, 5, 3],
'JobStartDate': ['2020-01-01', '2020-01-01', '2020-01-01', '2020-01-01', '2020-01-02', '2020-01-03', '2020-01-01', '2020-01-03'],
'JobEndDate': ['2020-01-03', '2020-01-03', '2020-01-04', '2020-01-02', '2020-01-04', '2020-01-05', '2020-01-02', '2020-01-04'],
'IsTypeAJob': [1, 1, 0, 1, 0, 0, 1, 1]
})
df
>>> MachineID JobStartDate JobEndDate IsTypeAJob
0 4 2020-01-01 2020-01-03 1
1 3 2020-01-01 2020-01-03 1
2 2 2020-01-01 2020-01-04 0
3 2 2020-01-01 2020-01-02 1
4 1 2020-01-02 2020-01-04 0
5 1 2020-01-03 2020-01-05 0
6 5 2020-01-01 2020-01-02 1
7 3 2020-01-03 2020-01-04 1
In my data there are two types of jobs that can be run on a machine, either type A or type B. My goal is to count the number of type A and type B jobs per machine per day. Thus the desired result would look something like
MachineID Date TypeAJobs TypeBJobs
0 1 2020-01-02 0 1
1 1 2020-01-03 0 2
2 1 2020-01-04 0 2
3 1 2020-01-05 0 1
4 2 2020-01-01 1 1
5 2 2020-01-02 1 1
6 2 2020-01-03 0 1
7 2 2020-01-04 0 1
8 3 2020-01-01 1 0
9 3 2020-01-02 1 0
10 3 2020-01-03 2 0
11 3 2020-01-04 1 0
12 4 2020-01-01 1 0
13 4 2020-01-02 1 0
14 4 2020-01-03 1 0
15 5 2020-01-01 1 0
16 5 2020-01-02 1 0
I have tried approaches found here and here with a resample() and apply() method, but the computing time is too slow. This has to do with the fact that some date ranges span multiple years in my set, meaning one row can blow up into 2000+ new rows during resampling (my data contains around a million rows to begin with). Thus something like creating a new machine/date row for each date in the range of a certain job is too slow (with the goal of doing a group_by(['MachineID', 'Date']).sum() at the end).
I am currently thinking about a new approach where I begin by grouping by MachineID then finding the earliest job start date and latest job end date for that machine. Then I could create a date range of days between these two dates (incrementing by day) which I would use to index a new per machine data frame. Then for each job for that MachineID I could potentially sum over a range of dates, ie in pseudocode:
df['TypeAJobs'][row['JobStartDate']:row['JobEndDate']] += 1 if it is a type A job or
df['TypeBJobs'][row['JobStartDate']:row['JobEndDate']] += 1 otherwise.
This seems like it would avoid creating a bunch of extra rows for each job as now we are creating extra rows for each machine. Furthermore, the addition operations seem like they would be fast since we are adding to an entire slice of a series at once. However, I don't know if something like this (indexing by date) is possible in Pandas. Maybe there is some conversion that can be done first? After doing the above, ideally I would have a number of data frames similar to the desired result but only with one MachineID, then I would concatenate these data frames to get the result.
I would love to hear any suggestions about the feasibility/effectiveness of this approach or another potential algorithm. Thanks so much for reading!
IIUC, try using pd.date_range and explode to create 'daily' rows, then groupby dates and IsTypeAJob and rename columns:
df_out = df.assign(JobDates=df.apply(lambda x: pd.date_range(x['JobStartDate'],
x['JobEndDate'], freq='D'),
axis=1))\
.explode('JobDates')
df_out = df_out.groupby([df_out['MachineID'],
df_out['JobDates'].dt.floor('D'),
'IsTypeAJob'])['MachineID'].count()\
.unstack()\
.rename(columns={0:'TypeBJobs', 1:'TypeAJobs'})\
.fillna(0).reset_index()
df_out
Output:
IsTypeAJob MachineID JobDates TypeBJobs TypeAJobs
0 1 2020-01-02 1.0 0.0
1 1 2020-01-03 2.0 0.0
2 1 2020-01-04 2.0 0.0
3 1 2020-01-05 1.0 0.0
4 2 2020-01-01 1.0 1.0
5 2 2020-01-02 1.0 1.0
6 2 2020-01-03 1.0 0.0
7 2 2020-01-04 1.0 0.0
8 3 2020-01-01 0.0 1.0
9 3 2020-01-02 0.0 1.0
10 3 2020-01-03 0.0 2.0
11 3 2020-01-04 0.0 1.0
12 4 2020-01-01 0.0 1.0
13 4 2020-01-02 0.0 1.0
14 4 2020-01-03 0.0 1.0
15 5 2020-01-01 0.0 1.0
16 5 2020-01-02 0.0 1.0
pd.concat([pd.DataFrame({'JobDates':pd.date_range(r.JobStartDate, r.JobEndDate, freq='D'),
'MachineID':r.MachineID,
'IsTypeAJob':r.IsTypeAJob}) for i, r in df.iterrows()])
Here is another way to do the job, the idea is similar to use str.get_dummies on both columns start and end, but done with array broadcasting. Use cumsum do get one between start and end and 0 otherwise. Create a dataframe with the columns as dates and the index as both Machine and Type. Then do similar operation than the answer from #Scott Boston to get the expected output shape.
#get all possible dates
dr = pd.date_range(df['JobStartDate'].min(),
df['JobEndDate'].max()).strftime("%Y-%m-%d").to_numpy()
df_ = (pd.DataFrame(
np.cumsum((df['JobStartDate'].to_numpy()[:, None] == dr).astype(int)
- np.pad(df['JobEndDate'].to_numpy()[:, None]==dr,((0,0),(1,False)),
mode='constant')[:, :-1], # pad is equivalent to shift along columns
axis=1),
index=pd.MultiIndex.from_frame(df[['MachineID', 'IsTypeAJob']]),
columns=dr,)
.sum(level=['MachineID', 'IsTypeAJob']) #equivalent to groupby(['MachineID', 'IsTypeAJob']).sum()
.replace(0, np.nan) #to remove extra dates per original row during the stack
.stack()
.unstack(level='IsTypeAJob', fill_value=0)
.astype(int)
.reset_index()
.rename_axis(columns=None)
.rename(columns={'level_1':'Date', 0:'TypeBJobs', 1:'TypeAJobs'})
)
and you get
MachineID Date TypeBJobs TypeAJobs
0 1 2020-01-02 1 0
1 1 2020-01-03 2 0
2 1 2020-01-04 2 0
3 1 2020-01-05 1 0
4 2 2020-01-01 1 1
5 2 2020-01-02 1 1
6 2 2020-01-03 1 0
7 2 2020-01-04 1 0
8 3 2020-01-01 0 1
9 3 2020-01-02 0 1
10 3 2020-01-03 0 2
11 3 2020-01-04 0 1
12 4 2020-01-01 0 1
13 4 2020-01-02 0 1
14 4 2020-01-03 0 1
15 5 2020-01-01 0 1
16 5 2020-01-02 0 1

Cumulative aggregate of unique string values

Here is what I have:
import pandas as pd
df = pd.DataFrame()
df['date'] = ['2020-01-01', '2020-01-01','2020-01-01', '2020-01-02', '2020-01-02', '2020-01-03', '2020-01-03']
df['value'] = ['A', 'A', 'A', 'A', 'B', 'A', 'C']
df
date value
0 2020-01-01 A
1 2020-01-01 A
2 2020-01-01 A
3 2020-01-02 A
4 2020-01-02 B
5 2020-01-03 A
6 2020-01-03 C
I want to aggregate unique values over time like this:
date value
0 2020-01-01 1
3 2020-01-02 2
5 2020-01-03 3
I am NOT looking for this as an answer:
date value
0 2020-01-01 1
3 2020-01-02 2
5 2020-01-03 2
I need the 2020-01-03 to be 3 because there are three unique values (A,B,C).
We can do agg list with cumsum
s=df.groupby('date').value.agg(list).cumsum().map(set).map(len)
date
2020-01-01 1
2020-01-02 2
2020-01-03 3
Name: value, dtype: int64
Let's use pd.crosstab instead:
(pd.crosstab(df['date'], df['value']) !=0).cummax().sum(axis=1)
Output:
date
2020-01-01 1
2020-01-02 2
2020-01-03 3
dtype: int64
Details:
First, let's reshape the dataframe such that you have 'date' as rows and the values listed across as columns. Then check for non-zero cells and use cummax in the column to keep track of every "value" seen in a column, then use sum across rows to calculate how many distinct values are seen at any point in time in the dataframe.
I think,np.cumsum the first unique values. .groupby the date which in this case I have set as the index and find either the maximum or last value.
import numpy as np
(np.cumsum((~(df.set_index('date')).duplicated('value')))).groupby(level=0).max()
date
2020-01-01 1
2020-01-02 2
2020-01-03 3

Categories