Converting days to years in Pandas DataFrame - python

I am trying to find the difference between 2 dates in a Pandas DataFrame this is my code:
raw['CALCULATED_AGE'] = ((raw.COMMENCEMENT_DATE - raw.DATE_OF_BIRTH))
this gives me the following output:
Pandas Output Column
I just want to convert the days to years, any easy way to do this ?
Thank you so much

You can use "relativedelta" and match it to your case:
from dateutil.relativedelta import relativedelta
rdelta = relativedelta(raw.COMMENCEMENT_DATE,raw.DATE_OF_BIRTH).years
Full code example:
create the data:
import pandas as pd
from dateutil.relativedelta import relativedelta
raw = pd.DataFrame({'COMMENCEMENT_DATE': ['3/10/2000', '3/11/2000', '3/12/2000'],
'DATE_OF_BIRTH': ['3/10/1990', '3/11/1991', '3/12/1990']})
raw['COMMENCEMENT_DATE'] = pd.to_datetime(raw['COMMENCEMENT_DATE'])
raw['DATE_OF_BIRTH'] = pd.to_datetime(raw['DATE_OF_BIRTH'])
Calc:
raw['CALCULATED_AGE'] = raw.apply(lambda x: relativedelta(x.COMMENCEMENT_DATE, x.DATE_OF_BIRTH).years, axis=1)
Output:
COMMENCEMENT_DATE DATE_OF_BIRTH CALCULATED_AGE
0 2000-03-10 1990-03-10 10
1 2000-03-11 1991-03-11 9
2 2000-03-12 1990-03-12 10
EDIT
Another solution works also for months:
raw['CALCULATED_AGE'] = (raw.COMMENCEMENT_DATE - raw.DATE_OF_BIRTH)/np.timedelta64(1, 'Y')
raw['CALCULATED_AGE'] = raw['CALCULATED_AGE'].astype(int)
If you want calc for months just change 'Y' to 'M'.

Related

Pandas dt accessor returns wrong day and month

My CSV data looks like this -
Date Time
1/12/2019 12:04AM
1/12/2019 12:09AM
1/12/2019 12:14AM
and so on
And I am trying to read this file using pandas in the following way -
import pandas as pd
import numpy as np
data = pd.read_csv('D 2019.csv',parse_dates=[['Date','Time']])
print(data['Date_Time'].dt.month)
When I try to access the year through the dt accessor the year prints out fine as 2019.
But when I try to print the day or the month it is completely incorrect. In the case of month it starts off as 1 and ends up as 12 when the right value should be 12 all the time.
With the day it starts off as 12 and ends up at 31 when it should start at 1 and end in 31. The file has total of 8867 entries. Where am I going wrong ?
The default format is MM/DD, while yours is DD/MM.
The simplest solution is to set the dayfirst parameter of read_csv:
dayfirst : DD/MM format dates, international and European format (default False)
data = pd.read_csv('D 2019.csv', parse_dates=[['Date', 'Time']], dayfirst=True)
# -------------
>>> data['Date_Time'].dt.month
# 0 12
# 1 12
# 2 12
# Name: Date_Time, dtype: int64
Try assigning format argument of pd.to_datetime
df = pd.read_csv('D 2019.csv')
df["Date_Time"] = pd.to_datetime(df["Date_Time"], format='%d/%m/%Y %H:%M%p')
You need to check the data type of your dataframe and convert the column "Date" into datetime
df["Date"] = pd.to_datetime(df["Date"])
After you can access the day, month, or year using:
dt.day
dt.month
dt.year
Note: Make sure the format of the date (D/M/Y or M/D/Y)
Full Code
import pandas as pd
import numpy as np
data = pd.read_csv('D 2019.csv')
data["Date"] = pd.to_datetime(data["Date"])
print(data["Date"].dt.day)
print(data["Date"].dt.month)
print(data["Date"].dt.year)

Am i doing something wrong with the loops?

I am using python to do some data cleaning and i've used the datetime module to split date time and tried to create another column with just the time.
My script works but it just takes the last value of the data frame.
Here is the code:
import datetime
i = 0
for index, row in df.iterrows():
date = datetime.datetime.strptime(df.iloc[i, 0], "%Y-%m-%dT%H:%M:%SZ")
df['minutes'] = date.minute
i = i + 1
This is the dataframe :
Output
df['minutes'] = date.minute reassigns the entire 'minutes' column with the scalar value date.minute from the last iteration.
You don't need a loop, as 99% of the cases when using pandas.
You can use vectorized assignment, just replace 'source_column_name' with the name of the column with the source data.
df['minutes'] = pd.to_datetime(df['source_column_name'], format='%Y-%m-%dT%H:%M:%SZ').dt.minute
It is also most likely that you won't need to specify format as pd.to_datetime is fairly smart.
Quick example:
df = pd.DataFrame({'a': ['2020.1.13', '2019.1.13']})
df['year'] = pd.to_datetime(df['a']).dt.year
print(df)
outputs
a year
0 2020.1.13 2020
1 2019.1.13 2019
Seems like you're trying to get the time column from the datetime which is in string format. That's what I understood from your post.
Could you give this a shot?
from datetime import datetime
import pandas as pd
def get_time(date_cell):
dt = datetime.strptime(date_cell, "%Y-%m-%dT%H:%M:%SZ")
return datetime.strftime(dt, "%H:%M:%SZ")
df['time'] = df['date_time'].apply(get_time)

Converting Series to Pandas DateTime [duplicate]

This question already has answers here:
Python strptime parsing year without century: assume prior to this year?
(2 answers)
Closed 2 years ago.
D = ["10Aug49","21Jan45","15Sep47","13Jun52"], convert this into pandas date, make sure that year is 1900 not 2000. So far i have this code which converts and prints the pandas date but century is 2000, i want 1900.
import pandas as pd
from datetime import datetime
Dae = pd.Series(["10Aug49","21Jan45","15Sep47","13Jun52"])
x =[]
for i in Dae:
x = datetime.strptime(i,"%d%b%y")
print(x)
I feel better to correct the year first, then convert to datetime:
# identify the year as the last group of digits and prepend 19
corrected_dates = Dae.str.replace('(\d+)$',r'19\1')
# convert to datetime
pd.to_datetime(corrected_dates)
Output:
0 1949-08-10
1 1945-01-21
2 1947-09-15
3 1952-06-13
dtype: datetime64[ns]
from datetime import date
import pandas as pd
from datetime import datetime
Dae = pd.Series(["10Aug49","21Jan45","15Sep47","13Jun52"])
x =[]
new_list = []
for i in Dae:
i = datetime.strptime(i,"%d%b%y").date()
if date.today() <= i:
i = i.replace(year=i.year - 100)
new_list.append(i)
print(new_list)
[datetime.date(1949, 8, 10), datetime.date(1945, 1, 21), datetime.date(1947, 9, 15), datetime.date(1952, 6, 13)]

How to change year value in numpy datetime64?

I have a pandas DataFrame with dtype=numpy.datetime64
In the data I want to change
'2011-11-14T00:00:00.000000000'
to:
'2010-11-14T00:00:00.000000000'
or other year. Timedelta is not known, only year number to assign.
this displays year in int
Dates_profit.iloc[50][stock].astype('datetime64[Y]').astype(int)+1970
but can't assign value.
Anyone know how to assign year to numpy.datetime64?
Since you're using a DataFrame, consider using pandas.Timestamp.replace:
In [1]: import pandas as pd
In [2]: dates = pd.DatetimeIndex([f'200{i}-0{i+1}-0{i+1}' for i in range(5)])
In [3]: df = pd.DataFrame({'Date': dates})
In [4]: df
Out[4]:
Date
0 2000-01-01
1 2001-02-02
2 2002-03-03
3 2003-04-04
4 2004-05-05
In [5]: df.loc[:, 'Date'] = df['Date'].apply(lambda x: x.replace(year=1999))
In [6]: df
Out[6]:
Date
0 1999-01-01
1 1999-02-02
2 1999-03-03
3 1999-04-04
4 1999-05-05
numpy.datetime64 objects are hard to work with. To update a value, it is normally easier to convert the date to a standard Python datetime object, do the change and then convert it back to a numpy.datetime64 value again:
import numpy as np
from datetime import datetime
dt64 = np.datetime64('2011-11-14T00:00:00.000000000')
# convert to timestamp:
ts = (dt64 - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
# standard utctime from timestamp
dt = datetime.utcfromtimestamp(ts)
# get the new updated year
dt = dt.replace(year=2010)
# convert back to numpy.datetime64:
dt64 = np.datetime64(dt)
There might be simpler ways, but this works, at least.
This vectorised solution gives the same result as using pandas to iterate over with x.replace(year=n), but the speed up on large arrays is at least x10 faster.
It is important to remember the year that the datetime64 object is replaced with should be a leap year. Using the python datetime library, the following crashes: datetime(2012,2,29).replace(year=2011) crashes. Here, the function 'replace_year' will simply move 2012-02-29 to 2011-03-01.
I'm using numpy v 1.13.1.
import numpy as np
import pandas as pd
def replace_year(x, year):
""" Year must be a leap year for this to work """
# Add number of days x is from JAN-01 to year-01-01
x_year = np.datetime64(str(year)+'-01-01') + (x - x.astype('M8[Y]'))
# Due to leap years calculate offset of 1 day for those days in non-leap year
yr_mn = x.astype('M8[Y]') + np.timedelta64(59,'D')
leap_day_offset = (yr_mn.astype('M8[M]') - yr_mn.astype('M8[Y]') - 1).astype(np.int)
# However, due to days in non-leap years prior March-01,
# correct for previous step by removing an extra day
non_leap_yr_beforeMarch1 = (x.astype('M8[D]') - x.astype('M8[Y]')).astype(np.int) < 59
non_leap_yr_beforeMarch1 = np.logical_and(non_leap_yr_beforeMarch1, leap_day_offset).astype(np.int)
day_offset = np.datetime64('1970') - (leap_day_offset - non_leap_yr_beforeMarch1).astype('M8[D]')
# Finally, apply the day offset
x_year = x_year - day_offset
return x_year
x = np.arange('2012-01-01', '2014-01-01', dtype='datetime64[h]')
x_datetime = pd.to_datetime(x)
x_year = replace_year(x, 1992)
x_datetime = x_datetime.map(lambda x: x.replace(year=1992))
print(x)
print(x_year)
print(x_datetime)
print(np.all(x_datetime.values == x_year))

How do I extract the day of week from mm/dd/yyy in a python dataframe

I have a non-index column in a python dataframe with a date like 02/03/2017. I would like to extract the day of the week and make it a separate column.
Actually there is solution using pandas.
import pandas as pd
your_df = pd.DataFrame(data={'Date': ['31/1/2018', '1/1/2018', '31/12/2018',
'28/2/2016', '3/3/2035']})
your_df['Date'] = pd.to_datetime(your_df['Date'], format="%d/%m/%Y")
your_df['Day of week (int)'] = your_df['Date'].dt.weekday
your_df['Day of week (str)'] = your_df['Date'].dt.day_name()
print(your_df)
More info here: Create a day-of-week column in a Pandas dataframe using Python
Notes as per my other (less elegant) answer...
You might also be interested in the arrow module since it offers quite a few features and advantages. Here I demonstrate its ability to provide weekday names in two forms for one locale, and in one form for a non-English locale.
>>> import arrow
>>> theDate = arrow.get('02/03/2017', 'DD/MM/YYYY')
>>> theDate
<Arrow [2017-03-02T00:00:00+00:00]>
>>> theDate.weekday()
3
>>> theDate.format('ddd', locale='en_GB')
'Thu'
>>> theDate.format('dddd', locale='en_GB')
'Thursday'
>>> theDate.format('dddd', locale='fr_FR')
'jeudi'
First you need to convert the date to a datetime object:
import datetime
date = datetime.datetime.strptime("02/03/2017", "%d/%m/%Y")
print date.weekday()
See https://docs.python.org/2/library/datetime.html#module-datetime
The solution I've found is a two step process as I haven't been able to find a way to get weekday() work on a pandas series.
import pandas as pd
your_df = pd.DataFrame(data={'Date': ['31/1/2018', '1/1/2018', '31/12/2018',
'28/2/2016', '3/3/2035']})
dt_series = pd.to_datetime(your_df['Date'], format="%d/%m/%Y")
dow = []
for dt in range(0, len(your_df)):
dow.append(dt_series[dt].weekday())
your_df.insert(1, 'Day of week', dow)
print(your_df)
The output should look like this:
Date Day of week
0 31/1/2018 2
1 1/1/2018 0
2 31/12/2018 0
3 28/2/2016 6
4 3/3/2035 5
Notes:
I'm using dd/mm/yyyy format. You will need to change the format argument for to_datetime() if your dates are in U.S. or other formats.
Python weekdays: Monday = 0, Sunday = 6.

Categories