Pandas - math operation on H:M:S string format - python

I have some time data that I need to subtract the largest value (last row) from the smallest value (first row) per month. The HOURS column is a in string (object) format though, and I don't know how to convert this properly and then get it back into the current format. The end result needs to be displayed as H:M:S. The data looks as follows:
MACHINE HOURS MONTH
M400 54:56:00 December
M400 61:54:52 December
M400 75:38:52 December
M400 89:21:09 December
M400 13:44:00 November
M400 27:28:00 November
M400 41:12:00 November
The end result I'm looking for is:
MACHINE HOURS MONTH
M400 34:25:09 December
M400 27:28:00 November
What is the fastest way to convert this (I'm assuming to datetime format), do the math, then reverse back?

One way I can think of to achieve this is, although there might be more efficient way.
def convert(h):
#convert string to seconds
h = h.split(':')
h = list(map(int, h))
return h[0]*3600+h[1]*60+h[2]
def convert_back(t):
#convert seconds to string
m,s = divmod(t,60)
h,m = divmod(m,60)
return f"{h}:{m}:{s}"
df['time'] = df['hours'].apply(convert)
final = (df.groupby('month').max()['time'] - df.groupby('month').min()['time']).apply(convert_back)
df_final = df.groupby('month').max()
df_final['hours'] = final
df_final is what you are looking for.

Related

Python function to get or count the number of days between two years in a pandas data frame

I need a function to count the total number of days in the 'days' column between a start date of 1st Jan 1995 and an end date of 31st Dec 2019 in a dataframe taking Leapyears into account as well
For example:
1st Jan 1995 - Day 1
1st Feb 1995 - Day 32
2nd Feb 1995 - Day 33...
And so on all the way to 31st Dec 2019.
This is the function I created initially but it doesn't work.
prices is the name of the data frame and 'days' is the column where the number of days is to reflect.
def date_difference(self):
for i in range(prices.shape[0] - 1):
prices['days'][i+1] = (prices['days'][i+1] - prices['days'][i])
Convert types
First of all, make sure that the days column is the proper type. Use df.days.dtype and it should be datetime64. If you get object type that means you have a string containing a date and you need to convert the type using
df.days = pd.to_datetime(df.days)
Calculate difference
df['days_diff'] = (df.days - pd.Timestamp('1995-01-01')).dt.days
Also, I would recommend changing the name of the column to date before it contains dates. Later you can assign the days to a column called so. It's just for clarity of your code and future maintaining it.
I finally got it to work by doing this:
def date_difference(last_day):
last_day = pd.to_datetime(last_day, dayfirst = True)
first_day = pd.to_datetime("01/01/1995", dayfirst = True)
diff = last_day - first_day
prices['days'] = prices['days'].apply(date_difference)

Converting Julian to calendar date using pandas

I am trying to convert Julian codes to calendar dates in pandas using :
pd.to_datetime(43390, unit = 'D', origin = 'Julian')
This is giving me ValueError: origin Julian cannot be converted to a Timestamp
You need to set origin = 'julian'
pd.to_datetime(43390, unit = 'D', origin = 'julian')
but this number (43390) throws
OutOfBoundsDatetime: 43390 is Out of Bounds for origin='julian'
because the bounds are from 2333836 to 2547339
(Timestamp('1677-09-21 12:00:00') to Timestamp('2262-04-11 12:00:00'))
Method 1 - using Julian for origin didn't work
Method 2 - using excel start date to calculate other dates. All other date values will be referenced from excel default start date.
Finally this worked for me.
pd.to_datetime(43390, unit = 'D', origin=pd.Timestamp("30-12-1899"))
Below code works only for 6 digit julian value. It also handles the calendar date for leap and non-leap years.
A Julian date is considered as "CYYDDD". Where C represents century, YY represents Year and DDD represents total days which are then further defined in Days and Months.
import pandas as pd
from datetime import datetime
jul_date = '120075'
add_days = int(jul_date[3:6])
cal_date = pd.to_datetime(datetime.strptime(str(19+int(jul_date[0:1]))+jul_date[1:3]+'-01-01','%Y-%m-%d'))-timedelta(1)+pd.DateOffset(days= add_days)
print(cal_date.strftime('%Y-%m-%d'))
output: 2020-03-15
without timedelta(1): 2020-03-16
Here datetime.strptime function is being used to cast date type from string to date.
%Y represents year in 4 digit (1980)
%m & %d represents month and day in digits.
strftime('%Y-%m-%d') is used to remove timestamp from the date.
timedelta(1) :- It's used to minus one day from the date because we've concatenated year with '01-01'. so when total no's of days being split to days and months, one day will not be extra.

Converting inconsistently formatted string dates to datetime in pandas

I have a pandas dataframe in which the date information is a string with the month and year:
date = ["JUN 17", "JULY 17", "AUG 18", "NOV 19"]
Note that the month is usually written as the 3 digit abbreviation, but is sometimes written as the full month for June and July.
I would like to convert this into a datetime format which assumes each date is on the first of the month:
date = [06-01-2017, 07-01-2017, 08-01-2018, 11-01-2019]
Edit to provide more information:
Two main issues I wasn't sure how to handle:
Month is not in a consistent format. Tried to solve this using by just taking a subset of the first three characters of the string.
Year is last two digits only, was struggling to specify that it is 2020 without it getting very messy
I have tried a dozen different things that didn't work, most recent attempt is below:
df['date'] = pd.to_datetime(dict(year = df['Record Month'].astype(str).str[-2:], month = df['Record Month'].astype(str).str[0:3], day=1))
This has the error "Unable to parse string "JUN" at position 0
If you are not sure of the many spellings that can show up then a dictionary mapping would not work. Perhaps your best chance is to split and slice so you normalize into year and month columns and then build the date.
If date is a list as in your example.
date = [d.split() for d in date]
df = pd.DataFrame([m[:3].lower, '20' + y] for m, y in date],
# df = pd.DataFrame([[s.split()[0][:3].lower, '20' + s.split()[1]] for s in date],
columns=['month', 'year'])
Then pass a mapper to series.replace as in
df.month = df.month.replace({'jan': 1, 'feb': 2 ...})
Then parse the dates from its components
# first cap the date to the first day of the month
df['day'] = 1
df = pd.to_datetime(df)
You were close with using pandas.to_datetime(). Instead of using a dictionary though, you could just reformat the date strings to a more standard format. If you convert each date string into MMMYY format (pretty similar to what you were doing) you can pass the strftime format "%b%y" to to_datetime() and it will convert the strings into dates.
import pandas as pd
date = ["JUN 17", "JULY 17", "AUG 18", "NOV 19"]
df = pd.DataFrame(date, columns=["Record Month"])
df['date'] = pd.to_datetime(df["Record Month"].str[:3] + df["Record Month"].str[-2:], format='%b%y')
print(df)
Produces that following result:
Record Date date
0 JUN 17 2017-06-01
1 JULY 17 2017-07-01
2 AUG 18 2018-08-01
3 NOV 19 2019-11-01

Get number of days in a specific month that are in a date range

Haven't been able to find an answer to this problem. Basically what I'm trying to do is this:
Take a daterange, for example October 10th to November 25th. What is the best algorithm for determining how many of the days in the daterange are in October and how many are in November.
Something like this:
def daysInMonthFromDaterange(daterange, month):
# do stuff
return days
I know that this is pretty easy to implement, I'm just wondering if there's a very good or efficient algorithm.
Thanks
Borrowing the algorithm from this answer How do I divide a date range into months in Python?
, this might work. The inputs are in date format, but can be changed to date strings if preferred:
import datetime
begin = '2018-10-10'
end = '2018-11-25'
dt_start = datetime.datetime.strptime(begin, '%Y-%m-%d')
dt_end = datetime.datetime.strptime(end, '%Y-%m-%d')
one_day = datetime.timedelta(1)
start_dates = [dt_start]
end_dates = []
today = dt_start
while today <= dt_end:
#print(today)
tomorrow = today + one_day
if tomorrow.month != today.month:
start_dates.append(tomorrow)
end_dates.append(today)
today = tomorrow
end_dates.append(dt_end)
out_fmt = '%d %B %Y'
for start, end in zip(start_dates,end_dates):
diff = (end - start).days
print('{} to {}: {} days'.format(start.strftime(out_fmt), end.strftime(out_fmt), diff))
result:
10 October 2018 to 31 October 2018: 21 days
01 November 2018 to 25 November 2018: 24 days
The problem as stated may not have a unique answer. For example what should you get from daysInMonthFromDaterange('Feb 15 - Mar 15', 'February')? That will depend on the year!
But if you substitute actual days, I would suggest converting from dates to integer days, using the first of the month to the first of the next month as your definition of a month. This is now reduced to intersecting intervals of integers, which is much easier.
The assumption that the first of the month always happened deals with months of different lengths, variable length months, and even correctly handles the traditional placement of the switch from the Julian calendar to the Gregorian. See cal 1752 for that. (It will not handle that switch for all locations though. Should you be dealing with a library that does Romanian dates in 1919, you could have a problem...)
You can use the datetime module:
from datetime import datetime
start = datetime(2018,10,10)
end = datetime(2018,11,25)
print((end - start).days)
Something like this would work:
def daysInMonthFromDaterange(date1, date2, month):
return [x for x in range(date1.toordinal(), date2.toordinal()) if datetime.date.fromordinal(x).year == month.year and datetime.date.fromordinal(x).month == month.month]
print(len(days_in_month(date(2018,10,10), date(2018,11,25), date(2018,10,01))))
This just loops through all the days between date1 and date2, and returns it as part of a list if it matches the year and month of the third argument.

How can I subtract two dates in Python?

So basically what I want to do is to subtract the date of birth from todays date in order to get a persons age, I've successfully done this, but I can only get it to show the persons age in days.
dateofbirth = 19981128
dateofbirth = list(str(dateofbirth))
now = datetime.date.today()
yr = dateofbirth[:4]
yr = ''.join(map(str, yr))
month = dateofbirth[4:6]
month = ''.join(map(str, month))
day = dateofbirth[6:8]
day = ''.join(map(str, day))
birth = datetime.date(int(yr), int(month), int(day))
age = now - birth
print(age)
In this case, age comes out as days, is there any way to get it as xx years xx months and xx days?
You can use strptime:
>>> import datetime
>>> datetime.datetime.strptime('19981128', '%Y%m%d')
datetime.datetime(1998, 11, 28, 0, 0)
>>> datetime.datetime.now() - datetime.datetime.strptime('19981128', '%Y%m%d')
datetime.timedelta(5823, 81486, 986088)
>>> print (datetime.datetime.now() - datetime.datetime.strptime('19981128', '%Y%m%d'))
5823 days, 22:38:18.039365
The result of subtracting two dates in Python is a timedelta object, which just represents a duration. It doesn't "remember" when it starts, and so it can't tell you how many months have elapsed.
Consider that the period from 1st January to 1st March is "two months", and the period from 1st March to 28th April is "1 month and 28 days", but in a non-leap year they're both the same duration, 59 days. Actually, daylight savings, but let's not make this any more complicated than it needs to be to make the point ;-)
There may be a third-party library that helps you, but as far as standard Python libraries are concerned, AFAIK you'll have to roll your sleeves up and do it yourself by finding the differences of the day/month/year components of the two dates in turn. Of course, the month and day differences might be negative numbers so you'll have to deal with those cases. Recall how you were taught to do subtraction in school, and be very careful when carrying numbers from the month column to the days column, to use the correct number of days for the relevant month.

Categories