I have one Pandas dataframe like below. I used pd.to_datetime(df['date']).dt.normalize() to get the date2 column to show just the date and ignore the time. Wasn't sure how to have it be just YYYY-MM-DD format.
date2 count compound_mean
0 2021-01-01 00:00:00+00:00 18 0.188411
1 2021-01-02 00:00:00+00:00 9 0.470400
2 2021-01-03 00:00:00+00:00 10 0.008190
3 2021-01-04 00:00:00+00:00 58 0.187510
4 2021-01-05 00:00:00+00:00 150 0.176173
Another dataframe with the following format.
Date Average
2021-01-04 18.200001
2021-01-05 22.080000
2021-01-06 22.250000
2021-01-07 22.260000
2021-01-08 21.629999
I want to have the Average column show up in the first dataframe by matching the dates and then forward-filling any blank values. From 01-01 to 01-03 there will be nothing to forward fill, so I guess it will end up being zero. I'm having trouble finding the right Pandas functions to do this, looking for some guidance. Thank you.
I assume your first dataframe to be df1 and second dataframe to be df2.
Firstly, you need to change the name of the date2 column of df1 to Date so that it matches with your Date column of df2.
df1['Date'] = pd.to_datetime(df1['date2']).dt.date
You can then remove the date2 column of df1 as
df1.drop("date2",inplace=True, axis=1)
You also need to change the column type of Date of df2 so that it matches with type of df1's Date column
df2['Date'] = pd.to_datetime(df2['Date']).dt.date
Then make a new dataframe which will contain both dataframe columns based on Date column.
main_df = pd.merge(df1,df2,on="Date", how="left")
df1['Average'] = main_df['Average']
df1 = pd.DataFrame(df1, columns = ['Date', 'count','compound_mean','Average'])
You can then fill the null values by ffill and also the first 3 null values by 0
df1.fillna(method='ffill', inplace=True)
df1.fillna(0, inplace=True)
Your first dataframe will look what you wanted
Try the following:
>>> df.index = pd.to_datetime(df.date2).dt.date
# If df.date2 is already datetime, use ^ df.index = df.date2.dt.date
>>> df2['Date'] = pd.to_datetime(df2['Date'])
# If df2['Date'] is already datetime, ^ this above line is not needed
>>> df.join(df2.set_index('Date')).fillna(0)
date2 count compound_mean Average
date2
2021-01-01 2021-01-01 00:00:00+00:00 18 0.188411 0.000000
2021-01-02 2021-01-02 00:00:00+00:00 9 0.470400 0.000000
2021-01-03 2021-01-03 00:00:00+00:00 10 0.008190 0.000000
2021-01-04 2021-01-04 00:00:00+00:00 58 0.187510 18.200001
2021-01-05 2021-01-05 00:00:00+00:00 150 0.176173 22.080000
You can perform merge operation as follows:
#Making date of same UTC format from both tables
df1['date2'] = pd.to_datetime(df1['date2'],utc = True)
df2['Date'] = pd.to_datetime(df2['Date'],utc = True)
#Renaming df1 column so that we can map 'Date' from both dataframes
df1.rename(columns={'date2': 'Date'},inplace=True)
#Merge operation
res = pd.merge(df1,df2,on='Date',how='left').fillna(0)
Output:
Date count compound_mean Average
0 2021-01-01 00:00:00+00:00 18 0.188411 0.000000
1 2021-01-02 00:00:00+00:00 9 0.470400 0.000000
2 2021-01-03 00:00:00+00:00 10 0.008190 0.000000
3 2021-01-04 00:00:00+00:00 58 0.187510 18.200001
4 2021-01-05 00:00:00+00:00 150 0.176173 22.080000
Related
I have a dataframe as such
Date Value
2022-01-01 10:00:00 7
2022-01-01 10:30:00 5
2022-01-01 11:00:00 3
....
....
2022-02-15 21:00:00 8
I would like to convert it into a day by row and hour by column format. The hours are the columns in this case. and the value column is now filled as cell values.
Date 10:00 10:30 11:00 11:30............21:00
2022-01-01 7 5 3 4 11
2022-01-02 8 2 4 4 13
How can I achieve this? I have tried pivot table but no success
Use pivot_table:
df['Date'] = pd.to_datetime(df['Date'])
out = df.pivot_table('Value', df['Date'].dt.date, df['Date'].dt.time, fill_value=0)
print(out)
# Output
Date 10:00:00 10:30:00 11:00:00 21:00:00
Date
2022-01-01 7 5 3 0
2022-02-15 0 0 0 8
To remove Date labels, you can use rename_axis:
for the top Date label: out.rename_axis(columns=None)
for the bottom Date label: out.rename_axis(index=None)
for both: out.rename_axis(index=None, columns=None)
You can change None by any string to rename axis.
I have given two dataframes.
Dataframe 1:
Date_DF1
Event
Event2
Event3
2021-01-01
Nan
PandemicHoliday
NaN
2021-02-01
Nan
PandemicHoliday
NaN
2021-03-01
Nan
PandemicHoliday
NaN
2021-04-02
SpecialDay
NaN
NaN
2021-14-02
SpecialDay
PandemicHoliday
NaN
The first dataframe is a .csv file that includes all holidays between 2017-2021 years. Date column is datetime format. If there is more than one holiday on the same day, the name of the holiday is written in all of the Event, Event1 and Event2 columns. Event, Event1 and Event2 columns include SpecialDay, PandemicHoliday, NationalHoliday values (3 types of holiday).
Dataframe 2:
Date_DF2
OrderTotal
OrderID
2021-01-01
68.5
31002
2021-01-01
56.5
31003
2021-01-01
98.5
31004
2021-01-02
78.5
31005
The second dataframe contains the daily order frequency. Date columns is datetime format.
Not all dates in df2 exist in df1.
I want to add the Event, Event1 and Event2 columns in the first table to the second table. The second table contains more than one column from the same date. Each holiday will be added to the second table as a column. How can I do this in python? Result table will look like this:
Date
OrderTotal
OrderID
SpecialDay
PandemicHoliday
NationalHoliday
2021-01-01
68.5
31002
0
1
0
2021-01-01
68.5
31003
0
1
0
2021-01-01
68.5
31004
0
1
0
2021-01-02
78.5
31005
1
0
0
You can one-hot-encode df1 with pd.get_dummies, then merge:
df2.merge(
pd.get_dummies(df1.set_index('Date_DF1').stack()).sum(level=0),
left_on='Date_DF2',
right_index=True,
how='left').fillna(0)
Output:
Date_DF2 OrderTotal OrderID PandemicHoliday SpecialDay
0 2021-01-01 68.5 31002 1 0
1 2021-01-01 56.5 31003 1 0
2 2021-01-01 98.5 31004 1 0
3 2021-01-02 78.5 31005 0 1
Hello guys i want to aggregate(sum) to by week. Lets notice that the column date is per day.
Does anyone knwos how to do?
Thank you
kindly,
df = pd.DataFrame({'date':['2021-01-01','2021-01-02',
'2021-01-03','2021-01-04','2021-01-05',
'2021-01-06','2021-01-07','2021-01-08',
'2021-01-09'],
'revenue':[5,3,2,
10,12,2,
1,0,6]})
df
date revenue
0 2021-01-01 5
1 2021-01-02 3
2 2021-01-03 2
3 2021-01-04 10
4 2021-01-05 12
5 2021-01-06 2
6 2021-01-07 1
7 2021-01-08 0
8 2021-01-09 6
Expected output:
2021-01-04. 31
Use DataFrame.resample with aggregate sum, but is necessary change default close to left with label parameter:
df['date'] = pd.to_datetime(df['date'])
df1 = df.resample('W-Mon', on='date', closed='left', label='left').sum()
print (df1)
revenue
date
2020-12-28 10
2021-01-04 31
i have a dataframe that looks like this
2020-01-01 10
2020-02-01 5
2020-05-01 2
2020-08-01 7
2020-01-01 00:00:00 0
2020-02-01 00:00:00 0
2020-03-01 00:00:00 0
2020-04-01 00:00:00 0
I want to remove the time of the index and combine where the dates may be the same the end result will look like
2020-01-01 10
2020-02-01 5
2020-03-01 0
2020-04-01 0
2020-05-01 2
2020-06-01 0
2020-07-01 0
2020-08-01 7
etc, etc
change the index data type and filter with .duplicated:
df.index = pd.to_datetime(df.index)
df = df[~df.index.duplicated(keep='first')]
df
Out[1]:
1
0
2020-01-01 10
2020-02-01 5
2020-05-01 2
2020-08-01 7
2020-03-01 0
2020-04-01 0
If you want to sum them together rather than get rid of the duplicate, then use:
df.index = pd.to_datetime(df.index)
df = df.sum(level=0)
df
Out[2]:
1
0
2020-01-01 10
2020-02-01 5
2020-05-01 2
2020-08-01 7
2020-03-01 0
2020-04-01 0
if the index content is in string format u can simply slice
df.reset_index(inplace=True)#consider column name to be date
df["date"]=df["date"].str[:11]#till time index
df.set_index( "date",inplace=True)
if it is in date format:
df.reset_index(inplace=True)
df['date'] = pd.to_datetime(df['date']).dt.date
df.set_index( "date",inplace=True)
Given this data (reflecting your own) with the string dates and int data in columns (not as index):
dates = ['2020-01-01', '2020-02-01', '2020-05-01', '2020-08-01',
'2020-01-01 00:00:00', '2020-02-01 00:00:00', '2020-03-01 00:00:00',
'2020-04-01 00:00:00']
data = [10,5,2,7,0,0,0,0]
df = pd.DataFrame({'dates':dates, 'data':data})
You can do the following:
df['dates'] = pd.to_datetime(df['dates']).dt.date #convert to datetime and get the date
df = df.groupby('dates').sum().sort_index() # groupby and sort index
Giving:
data
dates
2020-01-01 10
2020-02-01 5
2020-03-01 0
2020-04-01 0
2020-05-01 2
2020-08-01 7
You can replace .sum() with your favorite aggregation method. Also, if you want to impute the missing dates (as in your expected output), you can do:
months = pd.date_range(min(df.index), max(df.index), freq='MS').date
df = df.reindex(months).fillna(0)
Giving:
data
dates
2020-01-01 10.0
2020-02-01 5.0
2020-03-01 0.0
2020-04-01 0.0
2020-05-01 2.0
2020-06-01 0.0
2020-07-01 0.0
2020-08-01 7.0
I have a dataframe with a datetime64[ns] object which has the format, so there I have data per hourly base:
Datum Values
2020-01-01 00:00:00 1
2020-01-01 01:00:00 10
....
2020-02-28 00:00:00 5
2020-03-01 00:00:00 4
and another table with closing days, also in a datetime64[ns] column with the format, so there I only have a dayformat:
Dates
2020-02-28
2020-02-29
....
How can I delete all days in the first dataframe df, which occure in the second dataframe Dates? So that df is:
2020-01-01 00:00:00 1
2020-01-01 01:00:00 10
....
2020-03-01 00:00:00 4
Use Series.dt.floor for set times to 0, so possible filter by Series.isin with inverted mask in boolean indexing:
df['Datum'] = pd.to_datetime(df['Datum'])
df1['Dates'] = pd.to_datetime(df1['Dates'])
df = df[~df['Datum'].dt.floor('d').isin(df1['Dates'])]
print (df)
Datum Values
0 2020-01-01 00:00:00 1
1 2020-01-01 01:00:00 10
3 2020-03-01 00:00:00 4
EDIT: For flag column convert mask to integers by Series.view or Series.astype:
df['flag'] = df['Datum'].dt.floor('d').isin(df1['Dates']).view('i1')
#alternative
#df['flag'] = df['Datum'].dt.floor('d').isin(df1['Dates']).astype('int')
print (df)
Datum Values flag
0 2020-01-01 00:00:00 1 0
1 2020-01-01 01:00:00 10 0
2 2020-02-28 00:00:00 5 1
3 2020-03-01 00:00:00 4 0
Putting you aded comment into consideration
string of the Dates in df1
c="|".join(df1.Dates.values)
c
Coerce Datum to datetime
df['Datum']=pd.to_datetime(df['Datum'])
df.dtypes
Extract Datum as Dates ,dtype string
df.set_index(df['Datum'],inplace=True)
df['Dates']=df.index.date.astype(str)
Boolean select date ins in both
m=df.Dates.str.contains(c)
m
Mark inclusive dates as 0 and exclusive as 1
df['drop']=np.where(m,0,1)
df
Drop unwanted rows
df.reset_index(drop=True).drop(columns=['Dates'])
Outcome