I am attempting to find records in my dataframe that are 30 days old or older. I pretty much have everything working but I need to correct the format of the Age column. Most everything in the program is stuff I found on stack overflow, but I can't figure out how to change the format of the delta that is returned.
import pandas as pd
import datetime as dt
file_name = '/Aging_SRs.xls'
sheet = 'All'
df = pd.read_excel(io=file_name, sheet_name=sheet)
df.rename(columns={'SR Create Date': 'Create_Date', 'SR Number': 'SR'}, inplace=True)
tday = dt.date.today()
tdelta = dt.timedelta(days=30)
aged = tday - tdelta
df = df.loc[df.Create_Date <= aged, :]
# Sets the SR as the index.
df = df.set_index('SR', drop = True)
# Created the Age column.
df.insert(2, 'Age', 0)
# Calculates the days between the Create Date and Today.
df['Age'] = df['Create_Date'].subtract(tday)
The calculation in the last line above gives me the result, but it looks like -197 days +09:39:12 and I need it to just be a positive number 197. I have also tried to search using the python, pandas, and datetime keywords.
df.rename(columns={'Create_Date': 'SR Create Date'}, inplace=True)
writer = pd.ExcelWriter('output_test.xlsx')
df.to_excel(writer)
writer.save()
I can't see your example data, but IIUC and you're just trying to get the absolute value of the number of days of a timedelta, this should work:
df['Age'] = abs(df['Create_Date'].subtract(tday)).dt.days)
Explanation:
Given a dataframe with a timedelta column:
>>> df
delta
0 26523 days 01:57:59
1 -1601 days +01:57:59
You can extract just the number of days as an int using dt.days:
>>> df['delta']dt.days
0 26523
1 -1601
Name: delta, dtype: int64
Then, all you need to do is wrap that in a call to abs to get the absolute value of that int:
>>> abs(df.delta.dt.days)
0 26523
1 1601
Name: delta, dtype: int64
here is what i worked out for basically the same issue.
# create timestamp for today, normalize to 00:00:00
today = pd.to_datetime('today', ).normalize()
# match timezone with datetimes in df so subtraction works
today = today.tz_localize(df['posted'].dt.tz)
# create 'age' column for days old
df['age'] = (today - df['posted']).dt.days
pretty much the same as the answer above, but without the call to abs().
Related
I have a dataframe that contain arrival dates for vessels and I'd want to make python recognize the current year and month that we are at the moment and remove all entries that are prior to the current month and year.
I have a column with the date itself in the format '%d/%b/%Y' and columns for month and year separatly if needed.
For instance, if today is 01/01/2022. I'd like to remove everything that is from dec/2021 and prior.
Using pandas periods and boolean indexing:
# set up example
df = pd.DataFrame({'date': ['01/01/2022', '08/02/2022', '09/03/2022'], 'other_col': list('ABC')})
# find dates equal or greater to this month
keep = (pd.to_datetime(df['date'], dayfirst=False)
.dt.to_period('M')
.ge(pd.Timestamp('today').to_period('M'))
)
# filter
out = df[keep]
Output:
date other_col
1 08/02/2022 B
2 09/03/2022 C
from datetime import datetime
import pandas as pd
df = ...
# assuming your date column is named 'date'
t = datetime.utcnow()
df = df[pd.to_datetime(df.date) >= datetime(t.year, t.month, t.day)]
Let us consider this example dataframe:
import pandas as pd
import datetime
df = pd.DataFrame()
data = [['nao victoria', '21/Feb/2012'], ['argo', '6/Jun/2022'], ['kon tiki', '23/Aug/2022']]
df = pd.DataFrame(data, columns=['Vessel', 'Date'])
You can convert your dates to datetimes, by using pandas' to_datetime method; for instance, you may save the output into a new Series (column):
df['Datetime']=pd.to_datetime(df['Date'], format='%d/%b/%Y')
You end up with the following dataframe:
Vessel Date Datetime
0 nao victoria 21/Feb/2012 2012-02-21
1 argo 6/Jun/2022 2022-06-06
2 kon tiki 23/Aug/2022 2022-08-23
You can then reject rows containing datetime values that are smaller than today's date, defined using datetime's now method:
df = df[df.Datetime > datetime.datetime.now()]
This returns:
Vessel Date Datetime
2 kon tiki 23/Aug/2022 2022-08-23
This is how my dataframe looks like:
datetime open high low close
2006-01-02 4566.95 4601.35 4542.00 4556.25
2006-01-03 4531.45 4605.45 4531.45 4600.25
2006-01-04 4619.55 4707.60 4616.05 4694.14
.
.
.
Need to calculate the Monthly Returns in %
Formula: (Month Closing Price - Month Open Price) / Month Open Price
I can't seem to get the open price and closing price of a month, because in my df most months dont have a log for the 1st of the month. So having trouble calculating it.
Any help would be very much appreciated!
You need to use groupby and agg function in order to get the first and last value of each column in each month:
import pandas as pd
df = pd.read_csv("dt.txt")
df["datetime"] = pd.to_datetime(df["datetime"])
df.set_index("datetime", inplace=True)
resultDf = df.groupby([df.index.year, df.index.month]).agg(["first", "last"])
resultDf["new_column"] = (resultDf[("close", "last")] - resultDf[("open", "first")])/resultDf[("open", "first")]
resultDf.index.rename(["year", "month"], inplace=True)
resultDf.reset_index(inplace=True)
resultDf
The code above will result in a dataframe that has multiindex column. So, if you want to get, for example, rows with year of 2010, you can do something like:
resultDf[resultDf["year"] == 2010]
You can create a custom grouper such as follow :
import pandas as pd
import numpy as np
from io import StringIO
csvfile = StringIO(
"""datetime\topen\thigh\tlow\tclose
2006-01-02\t4566.95\t4601.35\t4542.00\t4556.25
2006-01-03\t4531.45\t4605.45\t4531.45\t4600.25
2006-01-04\t4619.55\t4707.60\t4616.05\t4694.14""")
df = pd.read_csv(csvfile, sep = '\t', engine='python')
df.datetime = pd.to_datetime(df.datetime, format = "%Y-%m-%d")
dg = df.groupby(pd.Grouper(key='datetime', axis=0, freq='M'))
Then each group of dg is separate by month, and since we convert datetime as pandas.datetime we can use classic arithmetic on it :
def monthly_return(datetime, close_value, open_value):
index_start = np.argmin(datetime)
index_end = np.argmax(datetime)
return (close_value[index_end] - open_value[index_start]) / open_value[index_start]
dg.apply(lambda x : monthly_return(x.datetime, x.close, x.open))
Out[97]:
datetime
2006-01-31 0.02785
Freq: M, dtype: float64
Of course a pure functional approach is possible instead of using monthly_return function
I've been working on a scraping and EDA project on Python3 using Pandas, BeautifulSoup, and a few other libraries and wanted to do some analysis using the time differences between two dates. I want to determine the number of days (or months or even years if that'll make it easier) between the start dates and end dates, and am stuck. I have two columns (air start date, air end date), with dates in the following format: MM-YYYY (so like 01-2021). I basically wanted to make a third column with the time difference between the end and start dates (so I could use it in later analysis).
# split air_dates column into start and end date
dateList = df["air_dates"].str.split("-", n = 1, expand = True)
df['air_start_date'] = dateList[0]
df['air_end_date'] = dateList[1]
df.drop(columns = ['air_dates'], inplace = True)
df.drop(columns = ['rank'], inplace = True)
# changing dates to numerical notation
df['air_start_date'] = pds.to_datetime(df['air_start_date'])
df['air_start_date'] = df['air_start_date'].dt.date.apply(lambda x: x.strftime('%m-%Y') if pds.notnull(x) else npy.NaN)
df['air_end_date'] = pds.Series(df['air_end_date'])
df['air_end_date'] = pds.to_datetime(df['air_end_date'], errors = 'coerce')
df['air_end_date'] = df['air_end_date'].dt.date.apply(lambda x: x.strftime('%m-%Y') if pds.notnull(x) else npy.NaN)
df.isnull().sum()
df.dropna(subset = ['air_end_date'], inplace = True)
def time_diff(time_series):
return datetime.datetime.strptime(time_series, '%d')
df['time difference'] = df['air_end_date'].apply(time_diff) - df['air_start_date'].apply(time_diff)
The last four lines are my attempt at getting a time difference, but I got an error saying 'ValueError: unconverted data remains: -2021'. Any help would be greatly appreciated, as this has had me stuck for a good while now. Thank you!
As far as I can understand, if you have start date and time and end date and time then you can use datetime module in python.
To use this, something like this would be used:
import datetime
# variable = datetime(year, month, day, hour, minute, second)
start = datetime(2017,5,8,18,56,40)
end = datetime(2019,6,27,12,30,58)
print( start - end ) # this will print the difference of these 2 date and time
Hope this answer helps you.
Ok so I figured it out. In my second to last line, I replaced the %d with %m-%Y and now it populates the new column with the number of days between the two dates. I think the format needed to be consistent when running strptime so that's what was causing that error.
here's a slightly cleaned up version; subtract start date from end date to get a timedelta, then take the days attribute from that.
EX:
import pandas as pd
df = pd.DataFrame({'air_dates': ["Apr 2009 - Jul 2010", "not a date - also not a date"]})
df['air_start_date'] = df['air_dates'].str.split(" - ", expand=True)[0]
df['air_end_date'] = df['air_dates'].str.split(" - ", expand=True)[1]
df['air_start_date'] = pd.to_datetime(df['air_start_date'], errors="coerce")
df['air_end_date'] = pd.to_datetime(df['air_end_date'], errors="coerce")
df['timediff_days'] = (df['air_end_date']-df['air_start_date']).dt.days
That will give you for the dummy example
df['timediff_days']
0 456.0
1 NaN
Name: timediff_days, dtype: float64
Regarding calculation of difference in month, you can find some suggestions how to calculate those here. I'd go with #piRSquared's approach:
df['timediff_months'] = ((df['air_end_date'].dt.year - df['air_start_date'].dt.year) * 12 +
(df['air_end_date'].dt.month - df['air_start_date'].dt.month))
df['timediff_months']
0 15.0
1 NaN
Name: timediff_months, dtype: float64
I am essentially trying to take data in the Date column in my dataframe, and subtract it from the date today in order to get the timedelta (which I will be storing in a new column). The issue I am running into is that i the Date value is formatted incorrectly or not a date at all, that will either cause my program to crash, or when I try to handle that error with simply mess up the other row's data. Here is my code:
def add_delta_to_dataframe():
df = create_messages_dataframe()
date = pd.to_datetime(df['Date'], format='%m/%d/%Y', errors="ignore")
now = datetime.datetime.today()
try:
delta = ((date - now).dt.days) + 1
df['Delta'] = delta
except TypeError:
pass
return df
I have also tried to iterate through:
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y', errors="ignore")
now = datetime.datetime.today()
for index, row in df.iterrows():
try:
delta = ((row['Date'] - now).days) + 1
df['Delta'] = delta
except TypeError:
continue
But no luck here either. Any ideas on doing this would be greatly appreciated. I either get an error if I don't catch the error, or the output leaves all Delta values as NaN. My expected output would be the columns with the correct date format to have the Delta value there, and the others to be NaN
IIUC, you can leverage the errors='coerce' argument of pd.to_datetime, which will set unformattable strings to NaT. Take the following df for an example:
df = pd.DataFrame({'date':['1999-01-01', 'xyz', '2000-05-05']})
>>> df
date
0 1999-01-01
1 xyz
2 2000-05-05
You can create your timedelta-like column using:
df['my_timedelta'] = pd.to_datetime('today') - pd.to_datetime(df['date'], errors='coerce')
Which results in:
>>> df
date my_timedelta
0 1999-01-01 7066 days
1 xyz NaT
2 2000-05-05 6576 days
I have a dataframe that contains the columns company_id, seniority, join_date and quit_date. I am trying to extract the number of days between join date and quit date. However, I get NaNs.
If I drop off all the columns in the dataframe except for quit date and join date and run the same code again, I get what I expect. However with all the columns, I get NaNs.
Here's my code:
df['join_date'] = pd.to_datetime(df['join_date'])
df['quit_date'] = pd.to_datetime(df['quit_date'])
df['days'] = df['quit_date'] - df['join_date']
df['days'] = df['days'].astype(str)
df1 = pd.DataFrame(df.days.str.split(' ').tolist(), columns = ['days', 'unwanted', 'stamp'])
df['numberdays'] = df1['days']
This is what I get:
days numberdays
585 days 00:00:00 NaN
340 days 00:00:00 NaN
I want 585 from the 'days' column in the 'numberdays' column. Similarly for every such row.
Can someone help me with this?
Thank you!
Instead of converting to string, extract the number of days from the timedelta value using the dt accessor.
import pandas as pd
df = pd.DataFrame({'join_date': ['2014-03-24', '2013-04-29', '2014-10-13'],
'quit_date':['2015-10-30', '2014-04-04', '']})
df['join_date'] = pd.to_datetime(df['join_date'])
df['quit_date'] = pd.to_datetime(df['quit_date'])
df['days'] = df['quit_date'] - df['join_date']
df['number_of_days'] = df['days'].dt.days
#Mohammad Yusuf Ghazi points out that dt.day is necessary to get the number of days instead of dt.days when working with datetime data rather than timedelta.