Difference of a date and int column in Pandas - python

I have a DataFrame that looks like this:
raw_data = {'SeriesDate':['2017-03-10','2017-03-13','2017-03-14','2017-03-15'],'Test':['1','2','3','4']}
import pandas as pd
df = pd.DataFrame(raw_data,columns=['SeriesDate','Test'])
df['SeriesDate'] = pd.to_datetime(df['SeriesDate'])
I want to subtract the number field from the date field to arrive at a date, however when I do this:
df['TestDate'] = df['SeriesDate'] - pd.Series((dd) for dd in df['Test'])
I get the following TypeError:
TypeError: incompatible type [object] for a datetime/timedelta operation
Any idea how I can workaround this?

You can use to_timedelta:
df['TestDate'] = df['SeriesDate'] - pd.to_timedelta(df['Test'].astype(int), unit="d")
print(df)
SeriesDate Test TestDate
0 2017-03-10 1 2017-03-09
1 2017-03-13 2 2017-03-11
2 2017-03-14 3 2017-03-11
3 2017-03-15 4 2017-03-11

Related

Type errors when creating incremental date column in pandas

I have data [read_data] as :
month
0
1
2
I have a code to create column date as:
start = 201907 #This is YYYYMM
start_dt = pd.to_datetime(start, format='%Y%m')
read_data['date'] = read_data['month'].apply(lambda x: pd.DateOffset(months=x-1)).add(start_dt)
Basically trying to crate an incremental date column starting with date-1 at month 0
gives me error:
TypeError: unsupported operand type(s) for +: 'numpy.ndarray' and 'Timestamp'
Trying to get the result data as:
month date
0 1/6/2019
1 1/7/2019
2 1/8/2019
Any ideas what the problem could be.
Move it inside
df['date'] = df['month'].apply(lambda x: pd.DateOffset(months=x-1)+start_dt)
df
Out[69]:
month date
0 0 2019-06-01
1 1 2019-07-01
2 2 2019-08-01
You can do this:
In [349]: df['date'] = df.month.apply(lambda x: start_dt.date() + pd.DateOffset(months=x-1))
In [350]: df
Out[350]:
month date
0 0 2019-06-01
1 1 2019-07-01
2 2 2019-08-01

Comparison between datetime64[ns] and date

I have DataFrame with values looks like this
Date Value
1 2020-04-12 A
2 2020-05-12 B
3 2020-07-12 C
4 2020-10-12 D
5 2020-11-12 E
and I need to create new DataFrame only with dates from today (7.12) to future (in this example only rows 3, 4 and 5).
I use this code:
df1= df[df["Date"] >= date.today()]
but it gives me TypeError: Invalid comparison between dtype=datetime64[ns] and date
What am I doing wrong? Thank you!
Use the .dt.date on the df['Date'] column. Then you are comparing dates with dates. So:
df1 = df.loc[df['Date'].dt.date >= date.today()]
This will give you:
Date Value
3 2020-12-07 C
4 2020-12-10 D
5 2020-12-11 E
Also make sure that your dateformat is actualy correct. For example by print df['Date'].dt.month to see that it gives all 12's. If not, your date string is not converted correctly. In that case, use df['Date'] = pd.to_datetime(df['Date'], format="%Y-%d-%m") to convert the Date column to the correct datetime format after creating the DataFrame.
Could you please try following. This considers that your dates are in YYYY-DD-MM format, in case its other format then one could change date format accordingly in strftime function.
import pandas as pd
today=pd.datetime.today().strftime("%Y-%d-%m")
df.loc[df['Date'] >= today]
Sample run of solution above: Let's say we have following test DataFrame.
Date Value
1 2020-04-12 A
2 2020-05-12 B
3 2020-07-12 C
4 2020-11-12 D
5 2020-12-12 E
Now when we run the solution above we will get following output:
Date Value
3 2020-07-12 C
4 2020-11-12 D
5 2020-12-12 E

parse multiple date format pandas

I 've got stuck with the following format:
0 2001-12-25
1 2002-9-27
2 2001-2-24
3 2001-5-3
4 200510
5 20078
What I need is the date in a format %Y-%m
What I tried was
def parse(date):
if len(date)<=5:
return "{}-{}".format(date[:4], date[4:5], date[5:])
else:
pass
df['Date']= parse(df['Date'])
However, I only succeeded in parse 20078 to 2007-8, the format like 2001-12-25 appeared as None.
So, how can I do it? Thank you!
we can use the pd.to_datetime and use errors='coerce' to parse the dates in steps.
assuming your column is called date
s = pd.to_datetime(df['date'],errors='coerce',format='%Y-%m-%d')
s = s.fillna(pd.to_datetime(df['date'],format='%Y%m',errors='coerce'))
df['date_fixed'] = s
print(df)
date date_fixed
0 2001-12-25 2001-12-25
1 2002-9-27 2002-09-27
2 2001-2-24 2001-02-24
3 2001-5-3 2001-05-03
4 200510 2005-10-01
5 20078 2007-08-01
In steps,
first we cast the regular datetimes to a new series called s
s = pd.to_datetime(df['date'],errors='coerce',format='%Y-%m-%d')
print(s)
0 2001-12-25
1 2002-09-27
2 2001-02-24
3 2001-05-03
4 NaT
5 NaT
Name: date, dtype: datetime64[ns]
as you can can see we have two NaT which are null datetime values in our series, these correspond with your datetimes which are missing a day,
we then reapply the same datetime method but with the opposite format, and apply those to the missing values of s
s = s.fillna(pd.to_datetime(df['date'],format='%Y%m',errors='coerce'))
print(s)
0 2001-12-25
1 2002-09-27
2 2001-02-24
3 2001-05-03
4 2005-10-01
5 2007-08-01
then we re-assign to your dataframe.
You could use a regex to pull out the year and month, and convert to datetime :
df = pd.read_clipboard("\s{2,}",header=None,names=["Dates"])
pattern = r"(?P<Year>\d{4})[-]*(?P<Month>\d{1,2})"
df['Dates'] = pd.to_datetime([f"{year}-{month}" for year, month in df.Dates.str.extract(pattern).to_numpy()])
print(df)
Dates
0 2001-12-01
1 2002-09-01
2 2001-02-01
3 2001-05-01
4 2005-10-01
5 2007-08-01
Note that pandas automatically converts the day to 1, since only year and month was supplied.

Find annual average of pandas dataframe with date column

id vi dates f_id
0 5532714 0.549501 2015-07-07 ff_22
1 5532715 0.540969 2015-07-08 ff_22
2 5532716 0.531477 2015-07-09 ff_22
3 5532717 0.521029 2015-07-10 ff_22
4 5532718 0.509694 2015-07-11 ff_22
In the dataframe above, I want to find average yearly value for each year. This does not work:
df.groupby(df.dates.year)['vi'].transform(mean)
I get this error: *** AttributeError: 'Series' object has no attribute 'year'
How to fix this?
Let's make sure that dates is datetime dtype, then use the .dt accessor as .dt.year:
df['dates'] = pd.to_datetime(df.dates)
df.groupby(df.dates.dt.year)['vi'].transform('mean')
Output:
0 0.530534
1 0.530534
2 0.530534
3 0.530534
4 0.530534
Name: vi, dtype: float64
Updating and completing #piRsquared's example below for recent versions of pandas (e.g. v1.1.0), using the Grouper function instead of TimeGrouper which was deprecated:
import pandas as pd
import numpy as np
tidx = pd.date_range('2010-01-01', '2013-12-31', name='dates')
np.random.seed([3,1415])
df = pd.DataFrame(dict(vi=np.random.rand(tidx.size)), tidx)
df.groupby(pd.Grouper(freq='1Y')).mean()
You can also use pd.TimeGrouper with the frequency A
Consider the dataframe df consisting of four years of daily data
tidx = pd.date_range('2010-01-01', '2013-12-31', name='dates')
np.random.seed([3,1415])
df = pd.DataFrame(dict(vi=np.random.rand(tidx.size)), tidx)
df.groupby(pd.TimeGrouper('A')).mean()
vi
dates
2010-12-31 0.465121
2011-12-31 0.511640
2012-12-31 0.491363
2013-12-31 0.516614

Conditional merging of Pandas DataFrames in Python

I have 2 DataFrames that currently looks like this:
raw_data = {'SeriesDate':['2017-03-10','2017-03-13','2017-03-14','2017-03-15','2017-03-16','2017-03-17']}
import pandas as pd
df1 = pd.DataFrame(raw_data,columns=['SeriesDate'])
df1['SeriesDate'] = pd.to_datetime(df['SeriesDate'])
print df1
SeriesDate
0 2017-03-10
1 2017-03-13
2 2017-03-14
3 2017-03-15
4 2017-03-16
5 2017-03-17
raw_data2 = {'SeriesDate':['2017-03-10','2017-03-13','2017-03-14','2017-03-15','2017-03-16'],'NewSeriesDate':['2017-03-11','2017-03-12','2017-03-13','2017-03-14','2017-03-14']}
df2 = pd.DataFrame(raw_data2,columns=['SeriesDate','NewSeriesDate'])
df2['SeriesDate'] = pd.to_datetime(df['SeriesDate'])
print df2
SeriesDate NewSeriesDate
0 2017-03-10 2017-03-11
1 2017-03-13 2017-03-12
2 2017-03-14 2017-03-13
3 2017-03-15 2017-03-14
4 2017-03-16 2017-03-14
1) I would like to join the dataframes in such a manner that for all 'SeriesDate' in df1 before 15th March, the 'NewSeriesDate' values should be taken from df2.
2) For any 'SeriesDate' in df1 after 15th March or for any 'SeriesDate' that are not in df2, the 'NewSeriesDate' should be calculated as follows:
from pandas.tseries.offsets import BDay
df1['NewSeriesDate'] = df1[''SeriesDate'] - BDay(1)
As an example, my final DataFrame in this scenario would look like this:
raw_data3 = {'SeriesDate':['2017-03-10','2017-03-13','2017-03-14','2017-03-15','2017-03-16','2017-03-17'],'NewSeriesDate':['2017-03-11','2017-03-12','2017-03-13','2017-03-14','2017-03-15','2017-03-16']}
finaldf = pd.DataFrame(raw_data3,columns=['SeriesDate','NewSeriesDate'])
finaldf['SeriesDate'] = pd.to_datetime(df['SeriesDate'])
print finaldf
SeriesDate NewSeriesDate
0 2017-03-10 2017-03-11
1 2017-03-13 2017-03-12
2 2017-03-14 2017-03-13
3 2017-03-15 2017-03-14
4 2017-03-16 2017-03-15
5 2017-03-17 2017-03-16
I am new to Pandas so not sure how to apply conditional merge, can anyone provide some insight please?
Try this out. It could probably be a little cleaner, but does the trick. You didn't specify what happens if the date is exactly March 15th, so I made an assumption. I may have switched out some header names, but you get the idea:
import pandas as pd
from pandas.tseries.offsets import BDay
import numpy as np
df1 = pd.DataFrame({
'SeriesDate':pd.to_datetime(['3/10/17','3/13/17','3/14/17','3/15/17','3/16/17','3/17/17']),
})
df1['NewSeries'] = np.nan
df2 = pd.DataFrame({
'SeriesDate':pd.to_datetime(['3/10/17','3/13/17','3/14/17','3/15/17','3/16/17']),
'NewSeries':pd.to_datetime(['3/11/17','3/12/17','3/13/17','3/14/17','3/14/17'])
})
d = pd.to_datetime('3/15/17')
df1.loc[df1['SeriesDate'] <= d] = df1.loc[df1['SeriesDate'] <= d].set_index('SeriesDate') \
.combine_first(df2.loc[df2['SeriesDate'] <= d].set_index('SeriesDate')).reset_index()
df1.loc[df1['SeriesDate'] > d, 'NewSeries'] = df1['SeriesDate'] - BDay(1)

Categories