My data set contains multiple columns of sales-related data. I have ORDEREDDATE and SHIPPINGDAYS in the DataFrame. I want to add a new column named DELIVEREDDATE in the dataset.
Current DataFrame
ORDEREDDATE SHIPPINGDAYS
2018-5-13 6
2017-8-24 4
2018-6-1 2
Expected output
ORDEREDDATE SHIPPINGDAYS DELIVEREDDATE
2018-5-13 6 2018-5-19
2017-8-24 4 2017-8-28
2018-6-1 2 2018-6-3
Types
ORDEREDDATE object
SHIPPINGDAYS object
Attempt to solve
df1['DELIVERYDATE'] = (datetime.datetime.strptime(df1['ORDEREDDATE'].astype(str), '%Y-%m-%d') + datetime.timedelta(df1['SHIPPINGDAYS'].astype(str).astype(int))
Here's a way to do:
# make sure types are correct format
df['ORDEREDDATE'] = pd.to_datetime(df['ORDEREDDATE'])
df['SHIPPINGDAYS'] = df['SHIPPINGDAYS'].astype(int)
df['DELIVEREDDATE'] = (df
.apply(lambda x: x['ORDEREDDATE'] + pd.Timedelta(days= x['SHIPPINGDAYS']),
axis=1)
ORDEREDDATE SHIPPINGDAYS DELIVEREDDATE
0 2018-05-13 6 2018-05-19
1 2017-08-24 4 2017-08-28
2 2018-06-01 2 2018-06-03
First off you need to transform the column into a datetime object:
df1['ORDEREDDATE'] = pd.to_datetime(df1['ORDEREDDATE']
Then you define your new column while also turning the int values from SHIPPINGDAYS to timedelta objects. That way you can sum these objects returning the desired output:
df['DELIVEREDDATE'] = df['ORDEREDDATE'] + df['SHIPPINGDAYS'].apply(lambda x: pd.Timedelta(x,unit='D'))
Output:
ORDEREDDATE SHIPPINGDAYS DELIVEREDDATE
0 2018-05-13 6 2018-05-19
1 2017-08-24 4 2017-08-28
2 2018-06-01 2 2018-06-03
Because you are adding seconds, not days!
You may initialize timedelta with names days argument. If you won't provide name, it assumes seconds.
Also, you end with datetime object, so you need to format it the way you want after calculation is done.
Related
I am trying to convert a datetime object to datetime. In the original dataframe the data type is a string and the dataset has shape = (28000000, 26). Importantly, the format of the date is MMYYYY only. Here's a data sample:
DATE
Out[3] 0 081972
1 051967
2 101964
3 041975
4 071976
I tried:
df['DATE'].apply(pd.to_datetime(format='%m%Y'))
and
pd.to_datetime(df['DATE'],format='%m%Y')
I got Runtime Error both times
Then
df['DATE'].apply(pd.to_datetime)
it worked for the other not shown columns(with DDMMYYYY format), but generated future dates with df['DATE'] because it reads the dates as MMDDYY instead of MMYYYY.
DATE
0 1972-08-19
1 2067-05-19
2 2064-10-19
3 1975-04-19
4 1976-07-19
Expect output:
DATE
0 1972-08
1 1967-05
2 1964-10
3 1975-04
4 1976-07
If this question is a duplicate please direct me to the original one, I wasn't able to find any suitable answer.
Thank you all in advance for your help
First if error is raised obviously some datetimes not match, you can test it by errors='coerce' parameter and Series.isna, because for not matched values are returned missing values:
print (df)
DATE
0 81972
1 51967
2 101964
3 41975
4 171976 <-changed data
print (pd.to_datetime(df['DATE'],format='%m%Y', errors='coerce'))
0 1972-08-01
1 1967-05-01
2 1964-10-01
3 1975-04-01
4 NaT
Name: DATE, dtype: datetime64[ns]
print (df[pd.to_datetime(df['DATE'],format='%m%Y', errors='coerce').isna()])
DATE
4 171976
Solution with output from changed data with converting to datetimes and the to months periods by Series.dt.to_period:
df['DATE'] = pd.to_datetime(df['DATE'],format='%m%Y', errors='coerce').dt.to_period('m')
print (df)
DATE
0 1972-08
1 1967-05
2 1964-10
3 1975-04
4 NaT
Solution with original data:
df['DATE'] = pd.to_datetime(df['DATE'],format='%m%Y', errors='coerce').dt.to_period('m')
print (df)
0 1972-08
1 1967-05
2 1964-10
3 1975-04
4 1976-07
I would have done:
df['date_formatted'] = pd.to_datetime(
dict(
year=df['DATE'].str[2:],
month=df['DATE'].str[:2],
day=1
)
)
Maybe this helps. Works for your sample data.
I 've got stuck with the following format:
0 2001-12-25
1 2002-9-27
2 2001-2-24
3 2001-5-3
4 200510
5 20078
What I need is the date in a format %Y-%m
What I tried was
def parse(date):
if len(date)<=5:
return "{}-{}".format(date[:4], date[4:5], date[5:])
else:
pass
df['Date']= parse(df['Date'])
However, I only succeeded in parse 20078 to 2007-8, the format like 2001-12-25 appeared as None.
So, how can I do it? Thank you!
we can use the pd.to_datetime and use errors='coerce' to parse the dates in steps.
assuming your column is called date
s = pd.to_datetime(df['date'],errors='coerce',format='%Y-%m-%d')
s = s.fillna(pd.to_datetime(df['date'],format='%Y%m',errors='coerce'))
df['date_fixed'] = s
print(df)
date date_fixed
0 2001-12-25 2001-12-25
1 2002-9-27 2002-09-27
2 2001-2-24 2001-02-24
3 2001-5-3 2001-05-03
4 200510 2005-10-01
5 20078 2007-08-01
In steps,
first we cast the regular datetimes to a new series called s
s = pd.to_datetime(df['date'],errors='coerce',format='%Y-%m-%d')
print(s)
0 2001-12-25
1 2002-09-27
2 2001-02-24
3 2001-05-03
4 NaT
5 NaT
Name: date, dtype: datetime64[ns]
as you can can see we have two NaT which are null datetime values in our series, these correspond with your datetimes which are missing a day,
we then reapply the same datetime method but with the opposite format, and apply those to the missing values of s
s = s.fillna(pd.to_datetime(df['date'],format='%Y%m',errors='coerce'))
print(s)
0 2001-12-25
1 2002-09-27
2 2001-02-24
3 2001-05-03
4 2005-10-01
5 2007-08-01
then we re-assign to your dataframe.
You could use a regex to pull out the year and month, and convert to datetime :
df = pd.read_clipboard("\s{2,}",header=None,names=["Dates"])
pattern = r"(?P<Year>\d{4})[-]*(?P<Month>\d{1,2})"
df['Dates'] = pd.to_datetime([f"{year}-{month}" for year, month in df.Dates.str.extract(pattern).to_numpy()])
print(df)
Dates
0 2001-12-01
1 2002-09-01
2 2001-02-01
3 2001-05-01
4 2005-10-01
5 2007-08-01
Note that pandas automatically converts the day to 1, since only year and month was supplied.
I have a dataframe (df) that looks like:
DATES
0 NaT
1 01/08/2003
2 NaT
3 NaT
4 04/08/2003
5 NaT
6 30/06/2003
7 01/03/2004
8 18/05/2003
9 NaT
10 NaT
11 31/10/2003
12 NaT
13 NaT
I am struggling to find out how I transform the data-frame to remove the NaT values so the final output looks like
DATES
0
1 01/08/2003
2
3
4 04/08/2003
5
6 30/06/2003
7 01/03/2004
8 18/05/2003
9
10
11 31/10/2003
12
13
I have tried :
df["DATES"].fillna("", inplace = True)
but with no success.
For information the column is in a datatime format set with
df["DATES"] = pd.to_datetime(df["DATES"],errors='coerce').dt.strftime('%d/%m/%Y')
What can I do to resolve this?
There is problem NaT are strings, so need:
df["DATES"] = df["DATES"].replace('NaT', '')
df.fillna() works on numpy.NaN values. Your "Nat" are probably strings. So you can do following,
if you want to use fillna()
df["DATES"].replace("NaT",np.NaN, inplace=True)
df.fillna("", inplace=True)
Else, you can just replace with your desired string
df["DATES"].replace("NaT","", inplace=True)
Convert column to object and then use Series.where:
df['Dates'] = df['Dates'].astype(object).where(df['Date'].notnull(),np.nan)
Or whatever you want np.nan to be
Your conversion to datetime did not work properly on the NaTs.
You can check this before calling the fillna by printing out df['DATES'][0] and seeing that you get a 'NaT' (string) and not NaT (your wanted format)
Instead, use (for example): df['DATES'] = df['DATES'].apply(pd.Timestamp)
This example worked for me as is, but notice that it's not datetime but rather pd.Timestamp (it's another time format, but it's an easy one to use). You do not need to specify your time format with this, your current format is understood by pd.Timestamp.
I have a column in a df which contains datetime strings,
inv_date
24/01/2008
15/06/2007 14:55:22
08/06/2007 18:26:12
15/08/2007 14:53:25
15/02/2008
07/03/2007
13/08/2007
I used pd.to_datetime with format %d%m%Y for converting the strings into datetime values;
pd.to_datetime(df.inv_date, errors='coerce', format='%d%m%Y')
I got
inv_date
24/01/2008
0
0
0
15/02/2008
07/03/2007
13/08/2007
the format is inferred from inv_date as the most common datetime format; I am wondering how to not convert 15/06/2007 14:55:22, 08/06/2007 18:26:12, 15/08/2007 14:53:25 to 0s, but 15/06/2007, 08/06/2007, 15/08/2007.
Use the regular pd.to_datetime call then use .dt.date:
>>> pd.to_datetime(df.inv_date).dt.date
0 2008-01-24
1 2007-06-15
2 2007-08-06
3 2007-08-15
4 2008-02-15
5 2007-07-03
6 2007-08-13
Name: inv_date, dtype: object
>>>
Or as #ChrisA mentioned, you can also use, only thing is the pandas format is good already, so skipped that part:
>>> pd.to_datetime(df.inv_date.str[:10], errors='coerce')
0 2008-01-24
1 2007-06-15
2 2007-08-06
3 2007-08-15
4 2008-02-15
5 2007-07-03
6 2007-08-13
Name: inv_date, dtype: object
>>>
You can also try this:
df = pd.read_csv('myfile.csv', parse_dates=['inv_date'], dayfirst=True)
df['inv_date'].dt.strftime('%d/%m/%Y')
0 24/01/2008
1 15/06/2007
2 08/06/2007
3 15/08/2007
4 15/02/2008
5 07/03/2007
6 13/08/2007
Hope this will help too.
Want to calculate the difference of days between pandas date series -
0 2013-02-16
1 2013-01-29
2 2013-02-21
3 2013-02-22
4 2013-03-01
5 2013-03-14
6 2013-03-18
7 2013-03-21
and today's date.
I tried but could not come up with logical solution.
Please help me with the code. Actually I am new to python and there are lot of syntactical errors happening while applying any function.
You could do something like
# generate time data
data = pd.to_datetime(pd.Series(["2018-09-1", "2019-01-25", "2018-10-10"]))
pd.to_datetime("now") > data
returns:
0 False
1 True
2 False
you could then use that to select the data
data[pd.to_datetime("now") > data]
Hope it helps.
Edit: I misread it but you can easily alter this example to calculate the difference:
data - pd.to_datetime("now")
returns:
0 -122 days +13:10:37.489823
1 24 days 13:10:37.489823
2 -83 days +13:10:37.489823
dtype: timedelta64[ns]
You can try as Follows:
>>> from datetime import datetime
>>> df
col1
0 2013-02-16
1 2013-01-29
2 2013-02-21
3 2013-02-22
4 2013-03-01
5 2013-03-14
6 2013-03-18
7 2013-03-21
Make Sure to convert the column names to_datetime:
>>> df['col1'] = pd.to_datetime(df['col1'], infer_datetime_format=True)
set the current datetime in order to Further get the diffrence:
>>> curr_time = pd.to_datetime("now")
Now get the Difference as follows:
>>> df['col1'] - curr_time
0 -2145 days +07:48:48.736939
1 -2163 days +07:48:48.736939
2 -2140 days +07:48:48.736939
3 -2139 days +07:48:48.736939
4 -2132 days +07:48:48.736939
5 -2119 days +07:48:48.736939
6 -2115 days +07:48:48.736939
7 -2112 days +07:48:48.736939
Name: col1, dtype: timedelta64[ns]
With numpy you can solve it like difference-two-dates-days-weeks-months-years-pandas-python-2
. bottom line
df['diff_days'] = df['First dates column'] - df['Second Date column']
# for days use 'D' for weeks use 'W', for month use 'M' and for years use 'Y'
df['diff_days']=df['diff_days']/np.timedelta64(1,'D')
print(df)
if you want days as int and not as float use
df['diff_days']=df['diff_days']//np.timedelta64(1,'D')
From the pandas docs under Converting To Timestamps you will find:
"Converting to Timestamps To convert a Series or list-like object of date-like objects e.g. strings, epochs, or a mixture, you can use the to_datetime function"
I haven't used pandas before but this suggests your pandas date series (a list-like object) is iterable and each element of this series is an instance of a class which has a to_datetime function.
Assuming my assumptions are correct, the following function would take such a list and return a list of timedeltas' (a datetime object representing the difference between two date time objects).
from datetime import datetime
def convert(pandas_series):
# get the current date
now = datetime.now()
# Use a list comprehension and the pandas to_datetime method to calculate timedeltas.
return [now - pandas_element.to_datetime() for pandas_series]
# assuming 'some_pandas_series' is a list-like pandas series object
list_of_timedeltas = convert(some_pandas_series)