I have 2 DataFrames that currently looks like this:
raw_data = {'SeriesDate':['2017-03-10','2017-03-13','2017-03-14','2017-03-15','2017-03-16','2017-03-17']}
import pandas as pd
df1 = pd.DataFrame(raw_data,columns=['SeriesDate'])
df1['SeriesDate'] = pd.to_datetime(df['SeriesDate'])
print df1
SeriesDate
0 2017-03-10
1 2017-03-13
2 2017-03-14
3 2017-03-15
4 2017-03-16
5 2017-03-17
raw_data2 = {'SeriesDate':['2017-03-10','2017-03-13','2017-03-14','2017-03-15','2017-03-16'],'NewSeriesDate':['2017-03-11','2017-03-12','2017-03-13','2017-03-14','2017-03-14']}
df2 = pd.DataFrame(raw_data2,columns=['SeriesDate','NewSeriesDate'])
df2['SeriesDate'] = pd.to_datetime(df['SeriesDate'])
print df2
SeriesDate NewSeriesDate
0 2017-03-10 2017-03-11
1 2017-03-13 2017-03-12
2 2017-03-14 2017-03-13
3 2017-03-15 2017-03-14
4 2017-03-16 2017-03-14
1) I would like to join the dataframes in such a manner that for all 'SeriesDate' in df1 before 15th March, the 'NewSeriesDate' values should be taken from df2.
2) For any 'SeriesDate' in df1 after 15th March or for any 'SeriesDate' that are not in df2, the 'NewSeriesDate' should be calculated as follows:
from pandas.tseries.offsets import BDay
df1['NewSeriesDate'] = df1[''SeriesDate'] - BDay(1)
As an example, my final DataFrame in this scenario would look like this:
raw_data3 = {'SeriesDate':['2017-03-10','2017-03-13','2017-03-14','2017-03-15','2017-03-16','2017-03-17'],'NewSeriesDate':['2017-03-11','2017-03-12','2017-03-13','2017-03-14','2017-03-15','2017-03-16']}
finaldf = pd.DataFrame(raw_data3,columns=['SeriesDate','NewSeriesDate'])
finaldf['SeriesDate'] = pd.to_datetime(df['SeriesDate'])
print finaldf
SeriesDate NewSeriesDate
0 2017-03-10 2017-03-11
1 2017-03-13 2017-03-12
2 2017-03-14 2017-03-13
3 2017-03-15 2017-03-14
4 2017-03-16 2017-03-15
5 2017-03-17 2017-03-16
I am new to Pandas so not sure how to apply conditional merge, can anyone provide some insight please?
Try this out. It could probably be a little cleaner, but does the trick. You didn't specify what happens if the date is exactly March 15th, so I made an assumption. I may have switched out some header names, but you get the idea:
import pandas as pd
from pandas.tseries.offsets import BDay
import numpy as np
df1 = pd.DataFrame({
'SeriesDate':pd.to_datetime(['3/10/17','3/13/17','3/14/17','3/15/17','3/16/17','3/17/17']),
})
df1['NewSeries'] = np.nan
df2 = pd.DataFrame({
'SeriesDate':pd.to_datetime(['3/10/17','3/13/17','3/14/17','3/15/17','3/16/17']),
'NewSeries':pd.to_datetime(['3/11/17','3/12/17','3/13/17','3/14/17','3/14/17'])
})
d = pd.to_datetime('3/15/17')
df1.loc[df1['SeriesDate'] <= d] = df1.loc[df1['SeriesDate'] <= d].set_index('SeriesDate') \
.combine_first(df2.loc[df2['SeriesDate'] <= d].set_index('SeriesDate')).reset_index()
df1.loc[df1['SeriesDate'] > d, 'NewSeries'] = df1['SeriesDate'] - BDay(1)
Related
I have the following pd data df that includes one string column mydate
import pandas as pd
df = {'mydate': ['01JAN2009','20FEB2013','13MAR2010','01APR2012', '20MAY2013', '18JUN2018', '10JUL2002', '30AUG2000', '15SEP2001', '30OCT1999',
'04NOV2020', '23DEC1995']}
df = pd.DataFrame(df, columns = ['mydate'])
I need to convert mydate into date type and store it in a new column mydate2.
You could try this:
import pandas as pd
df = {'mydate': ['01JAN2009','20FEB2013','13MAR2010','01APR2012', '20MAY2013', '18JUN2018', '10JUL2002', '30AUG2000', '15SEP2001', '30OCT1999',
'04NOV2020', '23DEC1995']}
df = pd.DataFrame(df, columns = ['mydate'])
df['mydate2']=pd.to_datetime(df['mydate'])
print(df)
Output:
mydate mydate2
0 01JAN2009 2009-01-01
1 20FEB2013 2013-02-20
2 13MAR2010 2010-03-13
3 01APR2012 2012-04-01
4 20MAY2013 2013-05-20
5 18JUN2018 2018-06-18
6 10JUL2002 2002-07-10
7 30AUG2000 2000-08-30
8 15SEP2001 2001-09-15
9 30OCT1999 1999-10-30
10 04NOV2020 2020-11-04
11 23DEC1995 1995-12-23
I have the following pandas dataframe. The dates are with time:
from pandas.tseries.holiday import USFederalHolidayCalendar
import pandas as pd<BR>
df = pd.DataFrame([[6,0,"2016-01-02 01:00:00",0.0],
[7,0,"2016-07-04 02:00:00",0.0]])
cal = USFederalHolidayCalendar()
holidays = cal.holidays(start='2014-01-01', end='2018-12-31')
I want to add a new boolean column with True/False if the date is holiday or not.
Tried df["hd"] = df[2].isin(holidays), but it doesn't work because of time digits.
Use Series.dt.floor or Series.dt.normalize for remove times:
df[2] = pd.to_datetime(df[2])
df["hd"] = df[2].dt.floor('d').isin(holidays)
#alternative
df["hd"] = df[2].dt.normalize().isin(holidays)
print (df)
0 1 2 3 hd
0 6 0 2016-01-02 01:00:00 0.0 False
1 7 0 2016-07-04 02:00:00 0.0 True
My dataset has dates in the European format, and I'm struggling to convert it into the correct format before I pass it through a pd.to_datetime, so for all day < 12, my month and day switch.
Is there an easy solution to this?
import pandas as pd
import datetime as dt
df = pd.read_csv(loc,dayfirst=True)
df['Date']=pd.to_datetime(df['Date'])
Is there a way to force datetime to acknowledge that the input is formatted at dd/mm/yy?
Thanks for the help!
Edit, a sample from my dates:
renewal["Date"].head()
Out[235]:
0 31/03/2018
2 30/04/2018
3 28/02/2018
4 30/04/2018
5 31/03/2018
Name: Earliest renewal date, dtype: object
After running the following:
renewal['Date']=pd.to_datetime(renewal['Date'],dayfirst=True)
I get:
Out[241]:
0 2018-03-31 #Correct
2 2018-04-01 #<-- this number is wrong and should be 01-04 instad
3 2018-02-28 #Correct
Add format.
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
You can control the date construction directly if you define separate columns for 'year', 'month' and 'day', like this:
import pandas as pd
df = pd.DataFrame(
{'Date': ['01/03/2018', '06/08/2018', '31/03/2018', '30/04/2018']}
)
date_parts = df['Date'].apply(lambda d: pd.Series(int(n) for n in d.split('/')))
date_parts.columns = ['day', 'month', 'year']
df['Date'] = pd.to_datetime(date_parts)
date_parts
# day month year
# 0 1 3 2018
# 1 6 8 2018
# 2 31 3 2018
# 3 30 4 2018
df
# Date
# 0 2018-03-01
# 1 2018-08-06
# 2 2018-03-31
# 3 2018-04-30
I have a DataFrame that looks like this:
raw_data = {'SeriesDate':['2017-03-10','2017-03-13','2017-03-14','2017-03-15'],'Test':['1','2','3','4']}
import pandas as pd
df = pd.DataFrame(raw_data,columns=['SeriesDate','Test'])
df['SeriesDate'] = pd.to_datetime(df['SeriesDate'])
I want to subtract the number field from the date field to arrive at a date, however when I do this:
df['TestDate'] = df['SeriesDate'] - pd.Series((dd) for dd in df['Test'])
I get the following TypeError:
TypeError: incompatible type [object] for a datetime/timedelta operation
Any idea how I can workaround this?
You can use to_timedelta:
df['TestDate'] = df['SeriesDate'] - pd.to_timedelta(df['Test'].astype(int), unit="d")
print(df)
SeriesDate Test TestDate
0 2017-03-10 1 2017-03-09
1 2017-03-13 2 2017-03-11
2 2017-03-14 3 2017-03-11
3 2017-03-15 4 2017-03-11
I have a Pandas DataFrame that looks like
col1
2015-02-02
2015-04-05
2016-07-02
I would like to add, for each date in col 1, the x days before and x days after that date.
That means that the resulting DataFrame will contain more rows (specifically, n(1+ 2*x), where n is the orignal number of dates in col1)
How can I do that in a proper Pandonic way?
Output would be (for x=1)
col1
2015-01-01
2015-01-02
2015-01-03
2015-04-04
etc
Thanks!
you can do it this way, but I'm not sure that it's the best / fastest way to do it:
In [143]: df
Out[143]:
col1
0 2015-02-02
1 2015-04-05
2 2016-07-02
In [144]: %paste
N = 2
(df.col1.apply(lambda x: pd.Series(pd.date_range(x - pd.Timedelta(days=N),
x + pd.Timedelta(days=N))
)
)
.stack()
.drop_duplicates()
.reset_index(level=[0,1], drop=True)
.to_frame(name='col1')
)
## -- End pasted text --
Out[144]:
col1
0 2015-01-31
1 2015-02-01
2 2015-02-02
3 2015-02-03
4 2015-02-04
5 2015-04-03
6 2015-04-04
7 2015-04-05
8 2015-04-06
9 2015-04-07
10 2016-06-30
11 2016-07-01
12 2016-07-02
13 2016-07-03
14 2016-07-04
Something like this takes a dataframe with a datetime.date column and then stacks another Series underneath with timedelta shifts to the original data.
import datetime
import pandas as pd
df = pd.DataFrame([{'date': datetime.date(2016, 1, 2)}, {'date': datetime.date(2016, 1, 1)}], columns=['date'])
df = pd.concat([df.date, df.date + datetime.timedelta(days=1)], ignore_index=True).to_frame()