I have given two dataframes.
Dataframe 1:
Date_DF1
Event
Event2
Event3
2021-01-01
Nan
PandemicHoliday
NaN
2021-02-01
Nan
PandemicHoliday
NaN
2021-03-01
Nan
PandemicHoliday
NaN
2021-04-02
SpecialDay
NaN
NaN
2021-14-02
SpecialDay
PandemicHoliday
NaN
The first dataframe is a .csv file that includes all holidays between 2017-2021 years. Date column is datetime format. If there is more than one holiday on the same day, the name of the holiday is written in all of the Event, Event1 and Event2 columns. Event, Event1 and Event2 columns include SpecialDay, PandemicHoliday, NationalHoliday values (3 types of holiday).
Dataframe 2:
Date_DF2
OrderTotal
OrderID
2021-01-01
68.5
31002
2021-01-01
56.5
31003
2021-01-01
98.5
31004
2021-01-02
78.5
31005
The second dataframe contains the daily order frequency. Date columns is datetime format.
Not all dates in df2 exist in df1.
I want to add the Event, Event1 and Event2 columns in the first table to the second table. The second table contains more than one column from the same date. Each holiday will be added to the second table as a column. How can I do this in python? Result table will look like this:
Date
OrderTotal
OrderID
SpecialDay
PandemicHoliday
NationalHoliday
2021-01-01
68.5
31002
0
1
0
2021-01-01
68.5
31003
0
1
0
2021-01-01
68.5
31004
0
1
0
2021-01-02
78.5
31005
1
0
0
You can one-hot-encode df1 with pd.get_dummies, then merge:
df2.merge(
pd.get_dummies(df1.set_index('Date_DF1').stack()).sum(level=0),
left_on='Date_DF2',
right_index=True,
how='left').fillna(0)
Output:
Date_DF2 OrderTotal OrderID PandemicHoliday SpecialDay
0 2021-01-01 68.5 31002 1 0
1 2021-01-01 56.5 31003 1 0
2 2021-01-01 98.5 31004 1 0
3 2021-01-02 78.5 31005 0 1
Related
I have a pandas dataframe with a date column and a id column. I would like to return the number of occurences the the id of each line, in the past 14 days prior to the corresponding date of each line. That means, I would like to return "1, 2, 1, 2, 3, 4, 1". How can I do this? Performance is important since the dataframe has a len of 200,000 rows or so. Thanks !
date
id
2021-01-01
1
2021-01-04
1
2021-01-05
2
2021-01-06
2
2021-01-07
1
2021-01-08
1
2021-01-28
1
Assuming the input is sorted by date, you can use a GroupBy.rolling approach:
# only required if date is not datetime type
df['date'] = pd.to_datetime(df['date'])
(df.assign(count=1)
.set_index('date')
.groupby('id')
.rolling('14d')['count'].sum()
.sort_index(level='date').reset_index() #optional if order is not important
)
output:
id date count
0 1 2021-01-01 1.0
1 1 2021-01-04 2.0
2 2 2021-01-05 1.0
3 2 2021-01-06 2.0
4 1 2021-01-07 3.0
5 1 2021-01-08 4.0
6 1 2021-01-28 1.0
I am not sure whether this is the best idea or not, but the code below is what I have come up with:
from datetime import timedelta
df["date"] = pd.to_datetime(df["date"])
newColumn = []
for index, row in df.iterrows():
endDate = row["date"]
startDate = endDate - timedelta(days=14)
id = row["id"]
summation = df[(df["date"] >= startDate) & (df["date"] <= endDate) & (df["id"] == id)]["id"].count()
newColumn.append(summation)
df["check_column"] = newColumn
df
Output
date
id
check_column
0
2021-01-01 00:00:00
1
1
1
2021-01-04 00:00:00
1
2
2
2021-01-05 00:00:00
2
1
3
2021-01-06 00:00:00
2
2
4
2021-01-07 00:00:00
1
3
5
2021-01-08 00:00:00
1
4
6
2021-01-28 00:00:00
1
1
Explanation
In this approach, I have used iterrows in order to loop over the dataframe's rows. Additionally, I have used timedelta in order to subtract 14 days from the date column.
I have a dataframe as such
Date Value
2022-01-01 10:00:00 7
2022-01-01 10:30:00 5
2022-01-01 11:00:00 3
....
....
2022-02-15 21:00:00 8
I would like to convert it into a day by row and hour by column format. The hours are the columns in this case. and the value column is now filled as cell values.
Date 10:00 10:30 11:00 11:30............21:00
2022-01-01 7 5 3 4 11
2022-01-02 8 2 4 4 13
How can I achieve this? I have tried pivot table but no success
Use pivot_table:
df['Date'] = pd.to_datetime(df['Date'])
out = df.pivot_table('Value', df['Date'].dt.date, df['Date'].dt.time, fill_value=0)
print(out)
# Output
Date 10:00:00 10:30:00 11:00:00 21:00:00
Date
2022-01-01 7 5 3 0
2022-02-15 0 0 0 8
To remove Date labels, you can use rename_axis:
for the top Date label: out.rename_axis(columns=None)
for the bottom Date label: out.rename_axis(index=None)
for both: out.rename_axis(index=None, columns=None)
You can change None by any string to rename axis.
I have two dataframes df1, df2. I need to construct an output that finds the nearest date to df1, whilst simultaneously matching the ID Value in both df1 and df2. df (Output Desired) shown below illustrates what I have tried explain above!
df1:
ID Date
1 2020-01-01
2 2020-01-03
df2:
ID Date
11 2020-01-11
4 2020-02-03
5 2020-04-02
6 2020-01-05
1 2021-01-13
1 2021-03-03
1 2020-01-30
2 2020-03-31
2 2021-04-01
2 2021-02-02
df (Output desired)
ID Date Closest Date
1 2020-01-01 2020-01-30
2 2020-01-03 2020-03-31
Here's one way to achieve it – assuming that the Date columns' dtype is datetime: First,
df3 = df1[df1.ID.isin(df2.ID)]
will give you
ID Date
0 1 2020-01-01
1 2 2020-01-03
Then
df3['Closest_date'] = df3.apply(lambda row:min(df2[df2.ID.eq(row.ID)].Date,
key=lambda x:abs(x-row.Date)),
axis=1)
gets the min of df2.Date, where
df2[df2.ID.eq(row.ID)].Date is getting the rows that have the matching ID and
key=lambda x:abs(x-row.Date) is telling min to compare by distance,
which has to be done on rows, so axis=1
Output:
ID Date Closest_date
0 1 2020-01-01 2020-01-30
1 2 2020-01-03 2020-03-31
I have two dataframes:
daily = pd.DataFrame({'Date': pd.date_range(start="2021-01-01",end="2021-04-29")})
pc21 = pd.DataFrame({'Date': ["2021-01-21", "2021-03-11", "2021-04-22"]})
pc21['Date'] = pd.to_datetime(pc21['Date'])
What I want to do is the following: for every date in pc21 and if the date in pc21 is in daily, I want to get, in a new column, values equal 1 for 8 days after the date and 0 otherwise.
This is an example of a desired output:
# 2021-01-21 is in either daframes so I want a new column in 'daily' that looks like this:
Date newcol
.
.
.
2021-01-20 0
2021-01-21 1
2021-01-22 1
2021-01-23 1
2021-01-24 1
2021-01-25 1
2021-01-26 1
2021-01-27 1
2021-01-28 1
2021-01-29 0
.
.
.
Can anyone help me achieve this?
Thanks!
you can try the following approach:
res = (daily
.merge(pd.concat([pd.date_range(d, freq="D", periods=8).to_frame(name="Date")
for d in pc21["Date"]]),
how="left", indicator=True)
.replace({"both": 1, "left_only":0})
.rename(columns={"_merge":"newcol"}))
result
In [15]: res
Out[15]:
Date newcol
0 2021-01-01 0
1 2021-01-02 0
2 2021-01-03 0
3 2021-01-04 0
4 2021-01-05 0
.. ... ...
114 2021-04-25 1
115 2021-04-26 1
116 2021-04-27 1
117 2021-04-28 1
118 2021-04-29 1
[119 rows x 2 columns]
daily['value'] = 0
pc21['value'] = 1
daily = pd.merge(daily, pc21, on='Date', how='left').rename(
columns={'value_y':'value'}).drop('value_x', 1).fillna(method="ffill", limit=7).fillna(0)
pc21.drop('value',1)
Output Subset
daily.query('value.eq(1)')
Date value
20 2021-01-21 1.0
21 2021-01-22 1.0
22 2021-01-23 1.0
23 2021-01-24 1.0
24 2021-01-25 1.0
25 2021-01-26 1.0
26 2021-01-27 1.0
27 2021-01-28 1.0
69 2021-03-11 1.0
daily["new_col"] = np.where(daily.Date.isin(pc21.Date), 1, np.nan)
daily["new_col"] = daily["new_col"].fillna(method="ffill", limit=7).fillna(0)
We generate the new column first:
If the Date of daily is in Date of pc21
then put 1
else
put a NaN
Then forward fill that column but with a limit of 7 so that we have 8 consecutive 1s
Lastly forward fill again the remaining NaNs with 0.
(you can put an astype(int) at the end to have integers).
I have one Pandas dataframe like below. I used pd.to_datetime(df['date']).dt.normalize() to get the date2 column to show just the date and ignore the time. Wasn't sure how to have it be just YYYY-MM-DD format.
date2 count compound_mean
0 2021-01-01 00:00:00+00:00 18 0.188411
1 2021-01-02 00:00:00+00:00 9 0.470400
2 2021-01-03 00:00:00+00:00 10 0.008190
3 2021-01-04 00:00:00+00:00 58 0.187510
4 2021-01-05 00:00:00+00:00 150 0.176173
Another dataframe with the following format.
Date Average
2021-01-04 18.200001
2021-01-05 22.080000
2021-01-06 22.250000
2021-01-07 22.260000
2021-01-08 21.629999
I want to have the Average column show up in the first dataframe by matching the dates and then forward-filling any blank values. From 01-01 to 01-03 there will be nothing to forward fill, so I guess it will end up being zero. I'm having trouble finding the right Pandas functions to do this, looking for some guidance. Thank you.
I assume your first dataframe to be df1 and second dataframe to be df2.
Firstly, you need to change the name of the date2 column of df1 to Date so that it matches with your Date column of df2.
df1['Date'] = pd.to_datetime(df1['date2']).dt.date
You can then remove the date2 column of df1 as
df1.drop("date2",inplace=True, axis=1)
You also need to change the column type of Date of df2 so that it matches with type of df1's Date column
df2['Date'] = pd.to_datetime(df2['Date']).dt.date
Then make a new dataframe which will contain both dataframe columns based on Date column.
main_df = pd.merge(df1,df2,on="Date", how="left")
df1['Average'] = main_df['Average']
df1 = pd.DataFrame(df1, columns = ['Date', 'count','compound_mean','Average'])
You can then fill the null values by ffill and also the first 3 null values by 0
df1.fillna(method='ffill', inplace=True)
df1.fillna(0, inplace=True)
Your first dataframe will look what you wanted
Try the following:
>>> df.index = pd.to_datetime(df.date2).dt.date
# If df.date2 is already datetime, use ^ df.index = df.date2.dt.date
>>> df2['Date'] = pd.to_datetime(df2['Date'])
# If df2['Date'] is already datetime, ^ this above line is not needed
>>> df.join(df2.set_index('Date')).fillna(0)
date2 count compound_mean Average
date2
2021-01-01 2021-01-01 00:00:00+00:00 18 0.188411 0.000000
2021-01-02 2021-01-02 00:00:00+00:00 9 0.470400 0.000000
2021-01-03 2021-01-03 00:00:00+00:00 10 0.008190 0.000000
2021-01-04 2021-01-04 00:00:00+00:00 58 0.187510 18.200001
2021-01-05 2021-01-05 00:00:00+00:00 150 0.176173 22.080000
You can perform merge operation as follows:
#Making date of same UTC format from both tables
df1['date2'] = pd.to_datetime(df1['date2'],utc = True)
df2['Date'] = pd.to_datetime(df2['Date'],utc = True)
#Renaming df1 column so that we can map 'Date' from both dataframes
df1.rename(columns={'date2': 'Date'},inplace=True)
#Merge operation
res = pd.merge(df1,df2,on='Date',how='left').fillna(0)
Output:
Date count compound_mean Average
0 2021-01-01 00:00:00+00:00 18 0.188411 0.000000
1 2021-01-02 00:00:00+00:00 9 0.470400 0.000000
2 2021-01-03 00:00:00+00:00 10 0.008190 0.000000
3 2021-01-04 00:00:00+00:00 58 0.187510 18.200001
4 2021-01-05 00:00:00+00:00 150 0.176173 22.080000