I'm not getting my answer right for this query- Year 2061 is seemingly improper. Convert every year which are < 70 to 19XX instead of 20XX
My data frame date column - 2061-01-01,2061-01-02 ,2061-01-03...
required answer - 1961-01-01,1961-01-02,1961-01-03...
myanswer-1983-05-06 19:59:05.224192,1983-05-07 19:59:05.224192,1983-05-08 19:59:05.224192.....
my code(dataframe name is data)
for i in pd.DatetimeIndex(data['DATE']).year:
if i<2000:
data['DATE']=data.DATE+pd.offsets.DateOffset(years=100)
Check it out:
from datetime import datetime , timedelta
data['DATE'] = data.apply(lambda row: datetime.strptime(row['DATE'], "%Y-%m-%d") - timedelta(days=100*365+25), axis = 1)
data will result in:
DATE
0 1961-01-01
1 1961-01-02
2 1961-01-03
Related
I have a data frame with a column Campaign which consists of the campaign name (start date - end date) format. I need to create 3 new columns by extracting the start and end dates.
start_date, end_date, days_between_start_and_end_date.
The issue is Campaign column value is not in a fixed format, for the below values my code block works well.
1. Season1 hero (18.02. -24.03.2021)
What I am doing in my code snippet is extracting the start date & end date from the campaign column and as you see, start date doesn't have a year. I am adding the year by checking the month value.
import pandas as pd
import re
import datetime
# read csv file
df = pd.read_csv("report.csv")
# extract start and end dates from the 'Campaign' column
dates = df['Campaign'].str.extract(r'(\d+\.\d+)\.\s*-\s*(\d+\.\d+\.\d+)')
df['start_date'] = dates[0]
df['end_date'] = dates[1]
# convert start and end dates to datetime format
df['start_date'] = pd.to_datetime(df['start_date'], format='%d.%m')
df['end_date'] = pd.to_datetime(df['end_date'], format='%d.%m.%Y')
# Add year to start date
for index, row in df.iterrows():
if pd.isna(row["start_date"]) or pd.isna(row["end_date"]):
continue
start_month = row["start_date"].month
end_month = row["end_date"].month
year = row["end_date"].year
if start_month > end_month:
year = year - 1
dates_str = str(row["start_date"].strftime("%d.%m")) + "." + str(year)
df.at[index, "start_date"] = pd.to_datetime(dates_str, format="%d.%m.%Y")
dates_str = str(row["end_date"].strftime("%d.%m")) + "." + str(row["end_date"].year)
df.at[index, "end_date"] = pd.to_datetime(dates_str, format="%d.%m.%Y")
but, I have multiple different column values where my regex fail and I receive nan values, for example
1. Sales is on (30.12.21-12.01.2022)
2. Sn 2 Fol CAMPAIGN A (24.03-30.03.2023)
3. M SALE (19.04 - 04.05.2022)
4. NEW SALE (29.12.2022-11.01.2023)
5. Year End (18.12. - 12.01.2023)
6. XMAS 1 - THE TRIBE CELEBRATES XMAS (18.11.-08.12.2021) (gifting communities)
Year End (18.12. - 12.01.2023)
in all the above 4 example, my date format is completely different.
expected output
start date end date
2021-12-30 2022-01-22
2023-03-24 2023-03-30
2022-04-19 2022-05-04
2022-12-29 2023-01-11
2022-18-12 2023-01-12
2021-11-18 2021-12-08
Can someone please help me here?
Since the datetimes in the data don't have a fixed format (some are dd.mm.yy, some are dd.mm.YYYY), it might be better if we apply a custom parser function that uses try-except. We can certainly do two conversions using pd.to_datetime and choose values using np.where etc. but it might not save any time given we need to do a lot of string manipulations beforehand.
To append the missing years for some rows, since pandas string methods are not optimized and as we'll need a few of them, (str.count(), str.cat() etc.) it's probably better to use Python string methods in a loop implementation instead.
Also, iterrows() is incredibly slow, so it's much faster if you use a python loop instead.
pd.to_datetime converts each element into datetime.datetime objects anyways, so we can use datetime.strptime from the built-in module to perform the conversions.
from datetime import datetime
def datetime_parser(date, end_date=None):
# remove space around dates
date = date.strip()
# if the start date doesn't have year, append it from the end date
dmy = date.split('.')
if end_date and len(dmy) == 2:
date = f"{date}.{end_date.rsplit('.', 1)[1]}"
elif end_date and not dmy[-1]:
edmy = end_date.split('.')
if int(dmy[1]) > int(edmy[1]):
date = f"{date}{int(edmy[-1])-1}"
else:
date = f"{date}{edmy[-1]}"
try:
# try 'dd.mm.YYYY' format (e.g. 29.12.2022) first
return datetime.strptime(date, '%d.%m.%Y')
except ValueError:
# try 'dd.mm.yy' format (e.g. 30.12.21) if the above doesn't work out
return datetime.strptime(date, '%d.%m.%y')
# extract dates into 2 columns (tentatively start and end dates)
splits = df['Campaign'].str.extract(r"\((.*?)-(.*?)\)").values.tolist()
# parse the dates
df[['start_date', 'end_date']] = [[datetime_parser(start, end), datetime_parser(end)] for start, end in splits]
# find difference
df['days_between_start_and_end_date'] = df['end_date'] - df['start_date']
I would do a basic regex with extract and then perform slicing :
ser = df["Campaign"].str.extract(r"\((.*)\)", expand=False)
start_date = ser.str.strip().str[-10:]
#or ser.str.strip().str.rsplit("-").str[-1]
end_date = ser.str.strip().str.split("\s*-\s*").str[0]
NB : You can assign the Series start_date and end_date to create your two new column.
Output :
start_date, end_date
(1.0 12.01.2022 # <- start_date
2.0 30.03.2023
3.0 04.05.2022
4.0 11.01.2023
Name: Campaign, dtype: object,
1.0 30.12.21 # <- end_date
2.0 24.03
3.0 19.04
4.0 29.12.2022
Name: Campaign, dtype: object)
I have a dataframe in Python with several columns, one of which is date of birth of persons. The data type of the date of birth column is object. I would like to get the age of persons as an integer number.
For example: date of birth = 23.6.2005 gives (as today is 1.5.2021) age = 15 (years)
The ages are to be returned in a column of the dataframe.
It may work for you
import datetime
today = datetime.date.today()
df["age"] = ((today - df["DOB"]).dt.days //365.25)
You can use datetime.date.today() to get the current date, and subtract the column from that, and then divide by a timedelta of one year for a reasonably accurate measurement.
import datetime
import pandas as pd
p = pd.DataFrame({'birthdate': [datetime.date(1969,5,21), datetime.date(1996, 8, 15), datetime.date(1981, 4, 30)]})
# birthdate
# 0 1969-05-21
# 1 1996-08-15
# 2 1981-04-30
p['age'] = (datetime.date.today() - p['birthdate']) // datetime.timedelta(days=365.25)
# birthdate age
# 0 1969-05-21 51
# 1 1996-08-15 24
# 2 1981-04-30 40
You can get the date difference by substracting pd.Timestamp.now() to your date of birth column (with conversion from 'object' format to datetime format). Then divide by np.timedelta64(1, 'Y') (which mean 1 year time difference. Use numpy function since Pandas has no corresponding function with up to year time difference.)
df['age'] = (pd.Timestamp.now() - pd.to_datetime(df['date of birth'])) // np.timedelta64(1, 'Y')
Rounding down to integer age is automatically achieved through division by //
Demo
import numpy as np
df = pd.DataFrame({'date of birth': ['23.6.2005', '22.4.1995', '12.12.2002']})
df['age'] = (pd.Timestamp.now() - pd.to_datetime(df['date of birth'])) // np.timedelta64(1, 'Y')
print(df)
date of birth age
0 23.6.2005 15
1 22.4.1995 26
2 12.12.2002 18
I have dataframe in following format:
> buyer_id purch_id timestamp
> buyer_2 purch_2 1330767282
> buyer_3 purch_3 1330771685
> buyer_3 purch_4 1330778269
> buyer_4 purch_5 1330780256
> buyer_5 purch_6 1330813517
I want to ask for your advice how to convert timestamp column (in dataframe) into datetime and then extract only the time of the event into the new column??
Thanks!
assuming 'timestamp' is Unix time (seconds since the epoch), you can cast to_datetime provided the right unit ('s') and use the time part:
df['time'] = pd.to_datetime(df['timestamp'], unit='s').dt.time
df
Out[9]:
buyer_id purch_id timestamp time
0 buyer_2 purch_2 1330767282 09:34:42
1 buyer_3 purch_3 1330771685 10:48:05
2 buyer_3 purch_4 1330778269 12:37:49
3 buyer_4 purch_5 1330780256 13:10:56
4 buyer_5 purch_6 1330813517 22:25:17
I am attempting to find records in my dataframe that are 30 days old or older. I pretty much have everything working but I need to correct the format of the Age column. Most everything in the program is stuff I found on stack overflow, but I can't figure out how to change the format of the delta that is returned.
import pandas as pd
import datetime as dt
file_name = '/Aging_SRs.xls'
sheet = 'All'
df = pd.read_excel(io=file_name, sheet_name=sheet)
df.rename(columns={'SR Create Date': 'Create_Date', 'SR Number': 'SR'}, inplace=True)
tday = dt.date.today()
tdelta = dt.timedelta(days=30)
aged = tday - tdelta
df = df.loc[df.Create_Date <= aged, :]
# Sets the SR as the index.
df = df.set_index('SR', drop = True)
# Created the Age column.
df.insert(2, 'Age', 0)
# Calculates the days between the Create Date and Today.
df['Age'] = df['Create_Date'].subtract(tday)
The calculation in the last line above gives me the result, but it looks like -197 days +09:39:12 and I need it to just be a positive number 197. I have also tried to search using the python, pandas, and datetime keywords.
df.rename(columns={'Create_Date': 'SR Create Date'}, inplace=True)
writer = pd.ExcelWriter('output_test.xlsx')
df.to_excel(writer)
writer.save()
I can't see your example data, but IIUC and you're just trying to get the absolute value of the number of days of a timedelta, this should work:
df['Age'] = abs(df['Create_Date'].subtract(tday)).dt.days)
Explanation:
Given a dataframe with a timedelta column:
>>> df
delta
0 26523 days 01:57:59
1 -1601 days +01:57:59
You can extract just the number of days as an int using dt.days:
>>> df['delta']dt.days
0 26523
1 -1601
Name: delta, dtype: int64
Then, all you need to do is wrap that in a call to abs to get the absolute value of that int:
>>> abs(df.delta.dt.days)
0 26523
1 1601
Name: delta, dtype: int64
here is what i worked out for basically the same issue.
# create timestamp for today, normalize to 00:00:00
today = pd.to_datetime('today', ).normalize()
# match timezone with datetimes in df so subtraction works
today = today.tz_localize(df['posted'].dt.tz)
# create 'age' column for days old
df['age'] = (today - df['posted']).dt.days
pretty much the same as the answer above, but without the call to abs().
I am essentially trying to take data in the Date column in my dataframe, and subtract it from the date today in order to get the timedelta (which I will be storing in a new column). The issue I am running into is that i the Date value is formatted incorrectly or not a date at all, that will either cause my program to crash, or when I try to handle that error with simply mess up the other row's data. Here is my code:
def add_delta_to_dataframe():
df = create_messages_dataframe()
date = pd.to_datetime(df['Date'], format='%m/%d/%Y', errors="ignore")
now = datetime.datetime.today()
try:
delta = ((date - now).dt.days) + 1
df['Delta'] = delta
except TypeError:
pass
return df
I have also tried to iterate through:
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y', errors="ignore")
now = datetime.datetime.today()
for index, row in df.iterrows():
try:
delta = ((row['Date'] - now).days) + 1
df['Delta'] = delta
except TypeError:
continue
But no luck here either. Any ideas on doing this would be greatly appreciated. I either get an error if I don't catch the error, or the output leaves all Delta values as NaN. My expected output would be the columns with the correct date format to have the Delta value there, and the others to be NaN
IIUC, you can leverage the errors='coerce' argument of pd.to_datetime, which will set unformattable strings to NaT. Take the following df for an example:
df = pd.DataFrame({'date':['1999-01-01', 'xyz', '2000-05-05']})
>>> df
date
0 1999-01-01
1 xyz
2 2000-05-05
You can create your timedelta-like column using:
df['my_timedelta'] = pd.to_datetime('today') - pd.to_datetime(df['date'], errors='coerce')
Which results in:
>>> df
date my_timedelta
0 1999-01-01 7066 days
1 xyz NaT
2 2000-05-05 6576 days