I have two dataframes df1, df2. I need to construct an output that finds the nearest date to df1, whilst simultaneously matching the ID Value in both df1 and df2. df (Output Desired) shown below illustrates what I have tried explain above!
df1:
ID Date
1 2020-01-01
2 2020-01-03
df2:
ID Date
11 2020-01-11
4 2020-02-03
5 2020-04-02
6 2020-01-05
1 2021-01-13
1 2021-03-03
1 2020-01-30
2 2020-03-31
2 2021-04-01
2 2021-02-02
df (Output desired)
ID Date Closest Date
1 2020-01-01 2020-01-30
2 2020-01-03 2020-03-31
Here's one way to achieve it – assuming that the Date columns' dtype is datetime: First,
df3 = df1[df1.ID.isin(df2.ID)]
will give you
ID Date
0 1 2020-01-01
1 2 2020-01-03
Then
df3['Closest_date'] = df3.apply(lambda row:min(df2[df2.ID.eq(row.ID)].Date,
key=lambda x:abs(x-row.Date)),
axis=1)
gets the min of df2.Date, where
df2[df2.ID.eq(row.ID)].Date is getting the rows that have the matching ID and
key=lambda x:abs(x-row.Date) is telling min to compare by distance,
which has to be done on rows, so axis=1
Output:
ID Date Closest_date
0 1 2020-01-01 2020-01-30
1 2 2020-01-03 2020-03-31
Related
I have a data frame that looks like this:
date score type
2020-01-01 1 a
2020-04-01 0 a
2020-01-01 3 a
2020-04-01 2 a
2020-11-01 3 b
2019-12-01 4 b
2020-01-01 4 b
If I want to rescale the column score from 60 to 100 for each type and for each date I can easily do as follows:
df['score_rescaled'] = df.groupby(['date', 'type'])['score'].apply(lambda x: (40*(x-min(x)))/(max(x)-min(x)) + 60)
how ever I would like to rescale the column score from 60 to 100 for each type for each date rescaling using not only the values for the single date, but value for each date up to the date:
so for instance for date 2020-01-01 I want to rescale the values at 2020-01-01 using all the scores from month 2020-01-01 and 2019-12-01
for date 2020-04-01 I want to rescale scores in 2020-04-01 using scores from month 2020-01-01, 2019-12-01 and 2020-04-01 etc.
how can I do?
I have given two dataframes.
Dataframe 1:
Date_DF1
Event
Event2
Event3
2021-01-01
Nan
PandemicHoliday
NaN
2021-02-01
Nan
PandemicHoliday
NaN
2021-03-01
Nan
PandemicHoliday
NaN
2021-04-02
SpecialDay
NaN
NaN
2021-14-02
SpecialDay
PandemicHoliday
NaN
The first dataframe is a .csv file that includes all holidays between 2017-2021 years. Date column is datetime format. If there is more than one holiday on the same day, the name of the holiday is written in all of the Event, Event1 and Event2 columns. Event, Event1 and Event2 columns include SpecialDay, PandemicHoliday, NationalHoliday values (3 types of holiday).
Dataframe 2:
Date_DF2
OrderTotal
OrderID
2021-01-01
68.5
31002
2021-01-01
56.5
31003
2021-01-01
98.5
31004
2021-01-02
78.5
31005
The second dataframe contains the daily order frequency. Date columns is datetime format.
Not all dates in df2 exist in df1.
I want to add the Event, Event1 and Event2 columns in the first table to the second table. The second table contains more than one column from the same date. Each holiday will be added to the second table as a column. How can I do this in python? Result table will look like this:
Date
OrderTotal
OrderID
SpecialDay
PandemicHoliday
NationalHoliday
2021-01-01
68.5
31002
0
1
0
2021-01-01
68.5
31003
0
1
0
2021-01-01
68.5
31004
0
1
0
2021-01-02
78.5
31005
1
0
0
You can one-hot-encode df1 with pd.get_dummies, then merge:
df2.merge(
pd.get_dummies(df1.set_index('Date_DF1').stack()).sum(level=0),
left_on='Date_DF2',
right_index=True,
how='left').fillna(0)
Output:
Date_DF2 OrderTotal OrderID PandemicHoliday SpecialDay
0 2021-01-01 68.5 31002 1 0
1 2021-01-01 56.5 31003 1 0
2 2021-01-01 98.5 31004 1 0
3 2021-01-02 78.5 31005 0 1
i have a dataframe that looks like this
2020-01-01 10
2020-02-01 5
2020-05-01 2
2020-08-01 7
2020-01-01 00:00:00 0
2020-02-01 00:00:00 0
2020-03-01 00:00:00 0
2020-04-01 00:00:00 0
I want to remove the time of the index and combine where the dates may be the same the end result will look like
2020-01-01 10
2020-02-01 5
2020-03-01 0
2020-04-01 0
2020-05-01 2
2020-06-01 0
2020-07-01 0
2020-08-01 7
etc, etc
change the index data type and filter with .duplicated:
df.index = pd.to_datetime(df.index)
df = df[~df.index.duplicated(keep='first')]
df
Out[1]:
1
0
2020-01-01 10
2020-02-01 5
2020-05-01 2
2020-08-01 7
2020-03-01 0
2020-04-01 0
If you want to sum them together rather than get rid of the duplicate, then use:
df.index = pd.to_datetime(df.index)
df = df.sum(level=0)
df
Out[2]:
1
0
2020-01-01 10
2020-02-01 5
2020-05-01 2
2020-08-01 7
2020-03-01 0
2020-04-01 0
if the index content is in string format u can simply slice
df.reset_index(inplace=True)#consider column name to be date
df["date"]=df["date"].str[:11]#till time index
df.set_index( "date",inplace=True)
if it is in date format:
df.reset_index(inplace=True)
df['date'] = pd.to_datetime(df['date']).dt.date
df.set_index( "date",inplace=True)
Given this data (reflecting your own) with the string dates and int data in columns (not as index):
dates = ['2020-01-01', '2020-02-01', '2020-05-01', '2020-08-01',
'2020-01-01 00:00:00', '2020-02-01 00:00:00', '2020-03-01 00:00:00',
'2020-04-01 00:00:00']
data = [10,5,2,7,0,0,0,0]
df = pd.DataFrame({'dates':dates, 'data':data})
You can do the following:
df['dates'] = pd.to_datetime(df['dates']).dt.date #convert to datetime and get the date
df = df.groupby('dates').sum().sort_index() # groupby and sort index
Giving:
data
dates
2020-01-01 10
2020-02-01 5
2020-03-01 0
2020-04-01 0
2020-05-01 2
2020-08-01 7
You can replace .sum() with your favorite aggregation method. Also, if you want to impute the missing dates (as in your expected output), you can do:
months = pd.date_range(min(df.index), max(df.index), freq='MS').date
df = df.reindex(months).fillna(0)
Giving:
data
dates
2020-01-01 10.0
2020-02-01 5.0
2020-03-01 0.0
2020-04-01 0.0
2020-05-01 2.0
2020-06-01 0.0
2020-07-01 0.0
2020-08-01 7.0
Here is what I have:
import pandas as pd
df = pd.DataFrame()
df['date'] = ['2020-01-01', '2020-01-01','2020-01-01', '2020-01-02', '2020-01-02', '2020-01-03', '2020-01-03']
df['value'] = ['A', 'A', 'A', 'A', 'B', 'A', 'C']
df
date value
0 2020-01-01 A
1 2020-01-01 A
2 2020-01-01 A
3 2020-01-02 A
4 2020-01-02 B
5 2020-01-03 A
6 2020-01-03 C
I want to aggregate unique values over time like this:
date value
0 2020-01-01 1
3 2020-01-02 2
5 2020-01-03 3
I am NOT looking for this as an answer:
date value
0 2020-01-01 1
3 2020-01-02 2
5 2020-01-03 2
I need the 2020-01-03 to be 3 because there are three unique values (A,B,C).
We can do agg list with cumsum
s=df.groupby('date').value.agg(list).cumsum().map(set).map(len)
date
2020-01-01 1
2020-01-02 2
2020-01-03 3
Name: value, dtype: int64
Let's use pd.crosstab instead:
(pd.crosstab(df['date'], df['value']) !=0).cummax().sum(axis=1)
Output:
date
2020-01-01 1
2020-01-02 2
2020-01-03 3
dtype: int64
Details:
First, let's reshape the dataframe such that you have 'date' as rows and the values listed across as columns. Then check for non-zero cells and use cummax in the column to keep track of every "value" seen in a column, then use sum across rows to calculate how many distinct values are seen at any point in time in the dataframe.
I think,np.cumsum the first unique values. .groupby the date which in this case I have set as the index and find either the maximum or last value.
import numpy as np
(np.cumsum((~(df.set_index('date')).duplicated('value')))).groupby(level=0).max()
date
2020-01-01 1
2020-01-02 2
2020-01-03 3
I have a dataframe with a datetime64[ns] object which has the format, so there I have data per hourly base:
Datum Values
2020-01-01 00:00:00 1
2020-01-01 01:00:00 10
....
2020-02-28 00:00:00 5
2020-03-01 00:00:00 4
and another table with closing days, also in a datetime64[ns] column with the format, so there I only have a dayformat:
Dates
2020-02-28
2020-02-29
....
How can I delete all days in the first dataframe df, which occure in the second dataframe Dates? So that df is:
2020-01-01 00:00:00 1
2020-01-01 01:00:00 10
....
2020-03-01 00:00:00 4
Use Series.dt.floor for set times to 0, so possible filter by Series.isin with inverted mask in boolean indexing:
df['Datum'] = pd.to_datetime(df['Datum'])
df1['Dates'] = pd.to_datetime(df1['Dates'])
df = df[~df['Datum'].dt.floor('d').isin(df1['Dates'])]
print (df)
Datum Values
0 2020-01-01 00:00:00 1
1 2020-01-01 01:00:00 10
3 2020-03-01 00:00:00 4
EDIT: For flag column convert mask to integers by Series.view or Series.astype:
df['flag'] = df['Datum'].dt.floor('d').isin(df1['Dates']).view('i1')
#alternative
#df['flag'] = df['Datum'].dt.floor('d').isin(df1['Dates']).astype('int')
print (df)
Datum Values flag
0 2020-01-01 00:00:00 1 0
1 2020-01-01 01:00:00 10 0
2 2020-02-28 00:00:00 5 1
3 2020-03-01 00:00:00 4 0
Putting you aded comment into consideration
string of the Dates in df1
c="|".join(df1.Dates.values)
c
Coerce Datum to datetime
df['Datum']=pd.to_datetime(df['Datum'])
df.dtypes
Extract Datum as Dates ,dtype string
df.set_index(df['Datum'],inplace=True)
df['Dates']=df.index.date.astype(str)
Boolean select date ins in both
m=df.Dates.str.contains(c)
m
Mark inclusive dates as 0 and exclusive as 1
df['drop']=np.where(m,0,1)
df
Drop unwanted rows
df.reset_index(drop=True).drop(columns=['Dates'])
Outcome