I have a pandas dataframe:
id age
001 1 hour
002 2 hours
003 2 days
004 4 days
Age refers to how long the item has been in the database. What I like to do is to print the date when the item is being added to the database.
So if age column contains the string "hour" or "hours", I want to print the current date, and if not, deduct current date by the number of days.
The desired output should look like this:
id age insertion_date
001 1 hour 2018-09-18
002 2 hours 2018-09-18
003 2 days 2018-09-16
004 4 days 2018-09-14
I am using Python 2.7 and so far this is what I have achieved.
import pandas as pd
from datetime import date
for index, row in df.iterrows():
age = row["age"]
if "days" in age:
# Remove days and convert data type of age column
df["age"] = df["age"].astype("str").str.replace('[^\d\.]', '')
# deduct current date by number of days
df["insertion_date"] = df["age"].astype("int64").apply(lambda x: date.today() - timedelta(x))
else:
# print current date
df["insertion_date"] = date.today()
The output from the code above looks like this:
id age insertion_date
001 1 2018-09-17
002 2 2018-09-16
003 2 2018-09-16
004 4 2018-09-14
The issue with this code is that even when the string "hour" or "hours" is present in the age column, it does not add the current date into the insertion_date column.
Would appreciate if someone can point out where I went wrong with this code so I can fix it to get the desired output i.e. it will add current date to the insertion_date column if the string "hour" or "hours" is present in the age column, otherwise, deduct the current date to the number of days in the age column and add the date to the insertion_date column.
You can use Timestamp.floor subtracted by timedeltas created by to_timedelta and TimedeltaIndex.floor:
df['new'] = pd.Timestamp.today().floor('D') - pd.to_timedelta(df['age']).dt.floor('D')
print (df)
id age new
0 1 1 hour 2018-09-18
1 2 2 hours 2018-09-18
2 3 2 days 2018-09-16
3 4 4 days 2018-09-14
print (df['new'].dtypes)
datetime64[ns]
Let's do a little timedeltarithmetic:
df['insertion_date'] = (
pd.to_datetime('today') - pd.to_timedelta(df.age).dt.floor('D')).dt.date
df
id age insertion_date
0 1 1 hour 2018-09-18
1 2 2 hours 2018-09-18
2 3 2 days 2018-09-16
3 4 4 days 2018-09-14
Related
I have a df
date
2021-03-12
2021-03-17
...
2022-05-21
2022-08-17
I am trying to add a column year_week, but my year week starts at 2021-06-28, which is the first day of July.
I tried:
df['date'] = pd.to_datetime(df['date'])
df['year_week'] = (df['date'] - timedelta(days=datetime(2021, 6, 24).timetuple()
.tm_yday)).dt.isocalendar().week
I played around with the timedelta days values so that the 2021-06-28 has a value of 1.
But then I got problems with previous & dates exceeding my start date + 1 year:
2021-03-12 has a value of 38
2022-08-17 has a value of 8
So it looks like the valid period is from 2021-06-28 + 1 year.
date year_week
2021-03-12 38 # LY38
2021-03-17 39 # LY39
2021-06-28 1 # correct
...
2022-05-21 47 # correct
2022-08-17 8 # NY8
Is there a way to get around this? As I am aggregating the data by year week I get incorrect results due to the past & upcoming dates. I would want to have negative dates for the days before 2021-06-28 or LY38 denoting that its the year week of the last year, accordingly year weeks of 52+ or NY8 denoting that this is the 8th week of the next year?
Here is a way, I added two dates more than a year away. You need the isocalendar from the difference between the date column and the dayofyear of your specific date. Then you can select the different scenario depending on the year of your specific date. use np.select for the different result format.
#dummy dataframe
df = pd.DataFrame(
{'date': ['2020-03-12', '2021-03-12', '2021-03-17', '2021-06-28',
'2022-05-21', '2022-08-17', '2023-08-17']
}
)
# define start date
d = pd.to_datetime('2021-6-24')
# remove the nomber of day of year from each date
s = (pd.to_datetime(df['date']) - pd.Timedelta(days=d.day_of_year)
).dt.isocalendar()
# get the difference in year
m = (s['year'].astype('int32') - d.year)
# all condition of result depending on year difference
conds = [m.eq(0), m.eq(-1), m.eq(1), m.lt(-1), m.gt(1)]
choices = ['', 'LY','NY',(m+1).astype(str)+'LY', '+'+(m-1).astype(str)+'NY']
# create the column
df['res'] = np.select(conds, choices) + s['week'].astype(str)
print(df)
date res
0 2020-03-12 -1LY38
1 2021-03-12 LY38
2 2021-03-17 LY39
3 2021-06-28 1
4 2022-05-21 47
5 2022-08-17 NY8
6 2023-08-17 +1NY8
I think
pandas period_range can be of some help
pd.Series(pd.period_range("6/28/2017", freq="W", periods=Number of weeks you want))
I would like to get the number of days before the end of the month, from a string column representing a date.
I have the following pandas dataframe :
df = pd.DataFrame({'date':['2019-11-22','2019-11-08','2019-11-30']})
df
date
0 2019-11-22
1 2019-11-08
2 2019-11-30
I would like the following output :
df
date days_end_month
0 2019-11-22 8
1 2019-11-08 22
2 2019-11-30 0
The package pd.tseries.MonthEnd with rollforward seemed a good pick, but I can't figure out how to use it to transform a whole column.
Subtract all days of month created by Series.dt.daysinmonth with days extracted by Series.dt.day:
df['date'] = pd.to_datetime(df['date'])
df['days_end_month'] = df['date'].dt.daysinmonth - df['date'].dt.day
Or use offsets.MonthEnd, subtract and convert timedeltas to days by Series.dt.days:
df['days_end_month'] = (df['date'] + pd.offsets.MonthEnd(0) - df['date']).dt.days
print (df)
date days_end_month
0 2019-11-22 8
1 2019-11-08 22
2 2019-11-30 0
I'm starting from a dataframe that has a start date and an end date, for instance:
ID START END A
0 2014-04-09 2014-04-15 5
1 2018-06-05 2018-07-01 8
2 2018-06-05 2018-07-01 7
And I'm trying to find, for each week, how many elements were started but not ended at that point.
For instance, in the DF above:
Week-Monday N
2014-04-07 1
2014-04-14 1
2014-04-21 0
...
2018-06-04 2
...
Something like the below doesn't quite work, since it only resamples on end date:
df = df.resample("W-Mon", on="END").sum()
I don't know how to integrate both conditions: that the occurrences be after the start date, yet before the end date.
You can start from here:
import pandas as pd
df = pd.DataFrame({'ID':[0,1,2],
'START':['2014-04-09', '2018-06-05', '2018-06-05'],
'END':['2014-04-15', '2018-07-01', '2018-07-01'],
'A':[5,8,7]})
1- Find week number for each SRART and each END, and find Week-Monday.
import datetime, time
from datetime import timedelta
df.loc[:,'startWeek'] = df.START.apply(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d').isocalendar()[1])
df.loc[:,'endWeek'] = df.END.apply(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d').isocalendar()[1])
df.loc[:, 'Week-Monday'] = df.START.apply(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d')- timedelta(days=datetime.datetime.strptime(x,'%Y-%m-%d').weekday()))
2- Check if they are the same, if yes, then ended during the same week.
def endedNotSameWeek(row):
if row['startWeek']!=row['endWeek']:
return 1
return 0
df.loc[:,'NotSameWeek'] = df.apply(endedNotSameWeek, axis=1)
print(df)
Output:
ID START END A startWeek endWeek Week-Monday NotSameWeek
0 0 2014-04-09 2014-04-15 5 15 16 2014-04-07 1
1 1 2018-06-05 2018-07-01 8 23 26 2018-06-04 1
2 2 2018-06-05 2018-07-01 7 23 26 2018-06-04 1
3- Groupby each Week-Monday to get the number of cases did not end during the same week.
df.groupby('Week-Monday')['NotSameWeek'].agg({'N':'sum'}).reset_index()
Week-Monday N
0 2014-04-07 1
1 2018-06-04 2
I was wondering how can we categorize timestamp column in a Data frame into Day and Night column on the basis of time?
I am trying to do so but unable to make a new column complete with the same number of entries.
d_call["time"] = d_call["timestamp"].apply(lambda x: x.time())
d_call["time"].head(1)
0 17:10:52
Name: time, dtype: object
def day_night(name):
for i in name:
if i.hour > 17:
return "night"
else:
return "day"
day_night(d_call["time"])
'day'
d_call["Day / Night"]= d_call["time"].apply(lambda x: day_night(x))
I want to get the entire series of the column but getting the first index only.
You can strip time to get the hour of timestamp and w.r.t hour you can assign your category, you can also use other conditions to put range of time
Considered df
0 2018-06-18 15:05:52.246
1 2018-05-24 21:44:07.903
2 2018-06-06 21:00:19.635
3 2018-05-24 21:44:37.883
4 2018-05-30 11:19:36.546
5 2018-05-25 11:16:07.969
6 2018-05-24 21:43:35.077
7 2018-06-07 18:39:00.258
Name: modified_at, dtype: datetime64[ns]
df['day/night'] = df.modified_at.apply(lambda x:'night' if int(x.strftime('%H')) >19 else 'day')
Out:
0 day
1 night
2 night
3 night
4 day
5 day
6 night
7 day
Name: modified_at, dtype: object
I want to convert Date into Quarters. I've used,
x['quarter'] = x['date'].dt.quarter
date quarter
0 2013-1-1 1
But, it also repeats the same for the next year.
date quarter
366 2014-1-1 1
Instead of the 1, I want the (expected result) quarter to be 5.
date quarter
366 2014-1-1 5
.
.
.
.
date quarter
731 2015-1-1 9
You can use a simple mathematical operation
starting_year = 2013
df['quarter'] = df.year.dt.quarter + (df.year.dt.year - starting_year)*4
year quarter
0 2013-01-01 1
0 2014-01-01 5
0 2015-01-01 9
0 2016-01-01 13