I have got the below 2 df:
lst=[['2021-01-01','A'],['2021-01-01','B'],['2021-02-01','A'],['2021-02-01','B'],['2021-03-01','A'],['2021-03-01','B']]
df1=pd.DataFrame(lst,columns=['Date','Pf'])
lst=[['2021-02-01','A','New']]
df22=pd.DataFrame(lst,columns=['Date','Pf','Status'])
I would like to merge them in order to obtain the df below:
lst=[['2021-01-01','A','NaN'],['2021-01-01','B','NaN'],['2021-02-01','A','New'],['2021-02-01','B','NaN'],['2021-03-01','A','New'],['2021-03-01','B','NaN']]
df3=pd.DataFrame(lst,columns=['Date','Pf','Status'])
For the period 2021-02-01 one could apply the merge formula. However, I would like to get the same status "New" as soon the same Pf as in df2 appears by changing dates equal and bigger than 2021-02-01
Do you have any idea how I could solve this question?
Thank you for your help
Use merge_asof with default direction='backward':
df1['Date'] = pd.to_datetime(df1['Date'])
df22['Date'] = pd.to_datetime(df22['Date'])
df = pd.merge_asof(df1, df22, on='Date', by='Pf')
print (df)
Date Pf Status
0 2021-01-01 A NaN
1 2021-01-01 B NaN
2 2021-02-01 A New
3 2021-02-01 B NaN
4 2021-03-01 A New
5 2021-03-01 B NaN
Related
I have Sales data like this as a DataFrame, the datatype of the columns is datetime[64] of pandas:
Shop ID
Special Offer Start
Special Offer End
A
'2022-01-01'
'2022-01-03'
B
'2022-01-09'
'2022-01-11'
etc.
I want to transform the data into a new binary format, that shows me the date in one column and the special offer information as 0 and 1.
The resulting table should look like this:
Shop ID
Date
Special Offer?
A
'2022-01-01'
1
A
'2022-01-02'
1
A
'2022-01-03'
1
B
'2022-01-09'
1
B
'2022-01-10'
1
B
'2022-01-11'
1
I wrote a function, which iterates every row and creates an DataFrame containing Pandas DateRange and the Special Offer information. These DataFrame are then concatenated. As you can imagine the code runs very slow.
I was thinking to append a Special Offer? Column to the Sales DataFrame and then joining it to a DataFrame containing all dates. Afterwards I could just fill the NaN with the dropna or fillna-function. But I couldn't find a function which lets me join on conditions in pandas.
See example below:
Shop ID
Special Offer Start
Special Offer End
Special Offer ?
A
'2022-01-01'
'2022-01-03'
1
B
'2022-01-09'
'2022-01-11'
1
join with (the join condition being: if Date between Special Offer Start and Special Offer End):
Date
'2022-01-01'
'2022-01-02'
'2022-01-03'
'2022-01-04'
'2022-01-05'
'2022-01-06'
'2022-01-07'
'2022-01-08'
'2022-01-09'
'2022-01-10'
'2022-01-11'
creates:
Shop ID
Date
Special Offer?
A
'2022-01-01'
1
A
'2022-01-02'
1
A
'2022-01-03'
1
A
'2022-01-04'
NaN
A
'2022-01-05'
NaN
A
'2022-01-06'
NaN
A
'2022-01-07'
NaN
A
'2022-01-08'
NaN
A
'2022-01-09'
NaN
A
'2022-01-10'
NaN
A
'2022-01-11'
NaN
B
'2022-01-01'
NaN
B
'2022-01-02'
NaN
B
'2022-01-03'
NaN
B
'2022-01-04'
NaN
B
'2022-01-05'
NaN
B
'2022-01-06'
NaN
B
'2022-01-07'
NaN
B
'2022-01-08'
NaN
B
'2022-01-09'
1
B
'2022-01-10'
1
B
'2022-01-11'
1
EDIT:
here is the code I've written:
new_list = []
for i, row in sales_df.iterrows():
df = pd.DataFrame(pd.date_range(start=row["Special Offer Start"],end=row["Special Offer End"]), columns=['Date'])
df['Shop ID'] = row['Shop ID']
df["Special Offer?"] = 1
new_list.append(df)
result = pd.concat(new_list ).reset_index(drop=True)
Update
The Shop ID column is missing
You can use date_range to expand the dates:
# Setup minimal reproducible example
data = [{'Shop ID': 'A', 'Special Offer Start': '2022-01-01', 'Special Offer End': '2022-01-03'},
{'Shop ID': 'B', 'Special Offer Start': '2022-01-09', 'Special Offer End': '2022-01-11'}]
df = pd.DataFrame(data)
# Not mandatory if you have already DatetimeIndex
df['Special Offer Start'] = pd.to_datetime(df['Special Offer Start'])
df['Special Offer End'] = pd.to_datetime(df['Special Offer End'])
# create full date range
start = df['Special Offer Start'].min()
end = df['Special Offer End'].max()
dti = pd.date_range(start, end, freq='D', name='Date')
date_range = lambda x: pd.date_range(x['Special Offer Start'], x['Special Offer End'])
out = (df.assign(Offer=df.apply(date_range, axis=1), dummy=1).explode('Offer')
.pivot_table(index='Offer', columns='Shop ID', values='dummy', fill_value=0)
.reindex(dti, fill_value=0).unstack().rename('Special Offer?').reset_index())
>>> out
Shop ID Date Special Offer?
0 A 2022-01-01 1
1 A 2022-01-02 1
2 A 2022-01-03 1
3 A 2022-01-04 0
4 A 2022-01-05 0
5 A 2022-01-06 0
6 A 2022-01-07 0
7 A 2022-01-08 0
8 A 2022-01-09 0
9 A 2022-01-10 0
10 A 2022-01-11 0
11 B 2022-01-01 0
12 B 2022-01-02 0
13 B 2022-01-03 0
14 B 2022-01-04 0
15 B 2022-01-05 0
16 B 2022-01-06 0
17 B 2022-01-07 0
18 B 2022-01-08 0
19 B 2022-01-09 1
20 B 2022-01-10 1
21 B 2022-01-11 1
My DF looks like below:
column1 column2
2020-11-01 1
2020-12-01 2
2021-01-01 3
NaT 4
NaT 5
NaT 6
Output should be like this:
column1 column2
2020-11-01 1
2020-12-01 2
2021-01-01 3
2021-02-01 4
2021-03-01 5
2021-04-01 6
I can't create next date (only months and years changed) based on the last existing date in df. Is there any pythonic way to do this? Thanks for any help!
Regards
Tomasz
This is how I would do it, you could probably tidy this up into more of a one liner but this will help illustrate the process a little more.
#convert to date
df['column1'] = pd.to_datetime(df['column1'], format='%Y-%d-%m')
#create a group for each missing section
df['temp'] = df.column1.fillna(method = 'ffill')
#count the row within this group
df['temp2'] = df.groupby(['temp']).cumcount()
# add month
df['column1'] = [x + pd.DateOffset(months=y) for x,y in zip(df['temp'], df['temp2'])]
pandas supports time series data
pd.date_range("2020-11-1", freq=pd.tseries.offsets.DateOffset(months=1), periods=10)
will give
DatetimeIndex(['2020-11-01', '2020-12-01', '2021-01-01', '2021-02-01',
'2021-03-01', '2021-04-01', '2021-05-01', '2021-06-01',
'2021-07-01', '2021-08-01'],
dtype='datetime64[ns]', freq='<DateOffset: months=1>')
I have a dataframe (df) with a date index. And I want to achieve the following:
1. Take Dates column and add one month -> e.g. nxt_dt = df.index + np.timedelta64(month=1) and lets call df.index curr_dt
2. Find the nearest entry in Dates that is >= nxt_dt.
3 Count the rows between curr_dt and nxt_dt and put them into a column in df.
The result is supposed to look like this:
px_volume listed_sh ... iv_mid_6m '30d'
Dates ...
2005-01-03 228805 NaN ... 0.202625 21
2005-01-04 189983 NaN ... 0.203465 22
2005-01-05 224310 NaN ... 0.202455 23
2005-01-06 221988 NaN ... 0.202385 20
2005-01-07 322691 NaN ... 0.201065 21
Needless to mention that there are only dates/rows in the df for which there are observations.
I can think of some different ways to get this done in loops, but since the data I work with is quite big, I would really like to avoid to loop through rows to fill them.
Is there a way in pandas to get this done vectorized?
If you are OK to reindex this should do the job:
import numpy as np
import pandas as pd
df = pd.DataFrame({'date': ['2020-01-01', '2020-01-08', '2020-01-24', '2020-01-29', '2020-02-09', '2020-03-04']})
df['date'] = pd.to_datetime(df['date'])
df['value'] = 1
df = df.set_index('date')
df = df.reindex(pd.date_range('2020-01-01','2020-03-04')).fillna(0)
df = df.sort_index(ascending=False)
df['30d'] = df['value'].rolling(30).sum() - 1
df.sort_index().query("value == 1")
gives:
value 30d
2020-01-01 1.0 3.0
2020-01-08 1.0 2.0
2020-01-24 1.0 2.0
2020-01-29 1.0 1.0
2020-02-09 1.0 NaN
2020-03-04 1.0 NaN
I have this Dataframe:
DataFrame
I applied df.groupby ('site') to classify data by this feature.
grouped = Datos.groupby('site')
After classifying it I want to complete, for all records, the "date" column day by day.
The procedure that I think I should follow will be:
1. Generate a complete sequence between start and end date. (Step completed).
for site in grouped:
dates = ['2018-01-01', '2020-01-17']
startDate = datetime.datetime.strptime( dates[0], "%Y-%m-%d") # parse first date
endDate = datetime.datetime.strptime( dates[-1],"%Y-%m-%d") # parse last date
days = (endDate - startDate).days # how many days between?
allDates = {datetime.datetime.strftime(startDate+datetime.timedelta(days=k),
"%Y-%m-%d"):0 for k in range(days+1)}
Compare this sequence with the column 'date' of my groupby. ('Site) and add those that are not present do not match the dates in' date '.
Write a function or loop that allows you to update the 'date' column with the new dates and also complete the missing values with 0.
(grouped.apply(add_days))
So far I have only managed to complete step 1, so I ask for your help to complete steps 2 and 3.
I would very much appreciate your always important help.
Regards
I had to do quiet the same thing for a project:
Maybe it's not the best solution for you but it can help you. (and I hope save you the headache I had)
Here is how as I managed it with help of https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
df_DateRange=pd.DataFrame()
df_1=pd.DataFrame()
grouped=pd.DataFrame()
#1. Create a DataFrame with alldays (your step2):
#Create a DataFrame with alldays
dates_list = ['2019-12-31', '2020-01-05']
df_DateRange['date']=pd.date_range(start=dates_list [0],end=dates_list [-1],freq='1D')
df_DateRange['date']=df_DateRange['date'].dt.strftime('%Y-%m-%d')
df_DateRange.set_index(['date'],inplace=True)
#Set index of you Datos DataFrame:
Datos.set_index(['date'], inplace=True)
#Join both DataFrame:
df_1=df_DateRange.join(Datos)
#2. Replace the NaN:
df_1['site'].fillna("", inplace=True)
df_1['value'].fillna(0, inplace=True)
df_1['value2'].fillna(0, inplace=True)
#3. do the calculation:
grouped = df_1.groupby('site').sum()
df_DateRange:
date
0 2019-12-31
1 2020-01-01
2 2020-01-02
3 2020-01-03
4 2020-01-04
5 2020-01-05
Datos:
date site value value2
0 2020-01-01 site1 1 -1
1 2020-01-01 site2 2 -2
2 2020-01-02 site1 10 -10
3 2020-01-02 site2 20 -20
df1:
site value value2
date
2019-12-31 0.0 0.0
2020-01-01 site1 1.0 -1.0
2020-01-01 site2 2.0 -2.0
2020-01-02 site1 10.0 -10.0
2020-01-02 site2 20.0 -20.0
2020-01-03 0.0 0.0
2020-01-04 0.0 0.0
2020-01-05 0.0 0.0
grouped=
value value2
site
0.0 0.0
site1 11.0 -11.0
site2 22.0 -22.0
I have a dataframe with IDs and timestamps as a multi-index. The index in the dataframe is sorted by IDs and timestamps and I want to pick the lastest timestamp for each IDs. for example:
IDs timestamp value
0 2010-10-30 1
2010-11-30 2
1 2000-01-01 300
2007-01-01 33
2010-01-01 400
2 2000-01-01 11
So basically the result I want is
IDs timestamp value
0 2010-11-30 2
1 2010-01-01 400
2 2000-01-01 11
What is the command to do that in pandas?
Given this setup:
import pandas as pd
import numpy as np
import io
content = io.BytesIO("""\
IDs timestamp value
0 2010-10-30 1
0 2010-11-30 2
1 2000-01-01 300
1 2007-01-01 33
1 2010-01-01 400
2 2000-01-01 11""")
df = pd.read_table(content, header=0, sep='\s+', parse_dates=[1])
df.set_index(['IDs', 'timestamp'], inplace=True)
using reset_index followed by groupby
df.reset_index(['timestamp'], inplace=True)
print(df.groupby(level=0).last())
yields
timestamp value
IDs
0 2010-11-30 00:00:00 2
1 2010-01-01 00:00:00 400
2 2000-01-01 00:00:00 11
This does not feel like the best solution, however. There should be a way to do this without calling reset_index...
As you point out in the comments, last ignores NaN values. To not skip NaN values, you could use groupby/agg like this:
df.reset_index(['timestamp'], inplace=True)
grouped = df.groupby(level=0)
print(grouped.agg(lambda x: x.iloc[-1]))
One can also use
df.groupby("IDs").tail(1)
This will take the last row of each label in level "IDs" and will not ignore NaN values.