I've a dataframe named mainData_Frame with column 'Clearance Time' in the following format.
Clearance Time
2 days 22:43:00
1 days 12:32:23
5 days 23:44:13
.
.
.
.
I need to convert all timedelta series at 'Clearance Time' column into numeric hours. What I applied is the following codes:
import pandas as pd
import datetime
def convert_to_hours(time_string):
time = datetime.datetime.strptime(time_string, '%d days %H:%M:%S')
delta = datetime.timedelta(days=time.day, hours=time.hour, minutes=time.minute, seconds=time.second)
return delta.total_seconds()/3600
timeData = pd.Series(mainData_Frame['Clearance Time'])
num_Hours = timeData.apply(convert_to_hours)
print(num_Hours)
But got the following errors:
TypeError: strptime() argument 1 must be str, not Timedelta
I'm the beginner in python and having some project regarding this, please help to sort out.
No need for a function here, use pandas built-ins :
ser = pd.to_timedelta(mainData_Frame["Clearance Time"])
timeData = ser.dt.total_seconds().div(3600)
Output :
print(timeData)
0 70.716667
1 36.539722
2 143.736944
Name: Clearance Time, dtype: float64
Related
Inconsistent date formats
As shown in the photo above, the check-in and check-out dates are inconsistent. Whenever I try to clean convert the entire series to datetime using df['Check-in date'] = pd.to_datetime(df['Check-in date'], errors='coerce') and
df['Check-out date'] = pd.to_datetime(df['Check-out date'], errors='coerce') the days and months get mixed up. I don't really know what to do now. I also tried splitting the days months and years and re-arranging them, but I still have no luck.
My goal here is to get the total night stay of our guest but due to the inconsistency, I end up getting negative total night stays.
I'd appreciate any help here. Thanks!
You can try different formats with strptime and return a DateTime object if any of them works.
from datetime import datetime
import pandas as pd
def try_different_formats(value):
only_date_format = "%d/%m/%Y"
date_and_time_format = "%Y-%m-%d %H:%M:%S"
try:
return datetime.strptime(value,only_date_format)
except ValueError:
pass
try:
return datetime.strptime(value,date_and_time_format)
except ValueError:
return pd.NaT
in your example:
df = pd.DataFrame({'Check-in date': ['19/02/2022','2022-02-12 00:00:00']})
Check-in date
0 19/02/2022
1 2022-02-12 00:00:00
apply method will run this function on every value of the Check-in date
column. the result would be a column of DateTime objects.
df['Check-in date'].apply(try_different_formats)
0 2022-02-19
1 2022-02-12
Name: Check-in date, dtype: datetime64[ns]
for a more pandas-specific solution you can check out this answer.
My CSV data looks like this -
Date Time
1/12/2019 12:04AM
1/12/2019 12:09AM
1/12/2019 12:14AM
and so on
And I am trying to read this file using pandas in the following way -
import pandas as pd
import numpy as np
data = pd.read_csv('D 2019.csv',parse_dates=[['Date','Time']])
print(data['Date_Time'].dt.month)
When I try to access the year through the dt accessor the year prints out fine as 2019.
But when I try to print the day or the month it is completely incorrect. In the case of month it starts off as 1 and ends up as 12 when the right value should be 12 all the time.
With the day it starts off as 12 and ends up at 31 when it should start at 1 and end in 31. The file has total of 8867 entries. Where am I going wrong ?
The default format is MM/DD, while yours is DD/MM.
The simplest solution is to set the dayfirst parameter of read_csv:
dayfirst : DD/MM format dates, international and European format (default False)
data = pd.read_csv('D 2019.csv', parse_dates=[['Date', 'Time']], dayfirst=True)
# -------------
>>> data['Date_Time'].dt.month
# 0 12
# 1 12
# 2 12
# Name: Date_Time, dtype: int64
Try assigning format argument of pd.to_datetime
df = pd.read_csv('D 2019.csv')
df["Date_Time"] = pd.to_datetime(df["Date_Time"], format='%d/%m/%Y %H:%M%p')
You need to check the data type of your dataframe and convert the column "Date" into datetime
df["Date"] = pd.to_datetime(df["Date"])
After you can access the day, month, or year using:
dt.day
dt.month
dt.year
Note: Make sure the format of the date (D/M/Y or M/D/Y)
Full Code
import pandas as pd
import numpy as np
data = pd.read_csv('D 2019.csv')
data["Date"] = pd.to_datetime(data["Date"])
print(data["Date"].dt.day)
print(data["Date"].dt.month)
print(data["Date"].dt.year)
I have to calculate mean() of time column, but this column type is string, how can I do it?
id time
1 1h:2m
2 1h:58m
3 35m
4 2h
...
You can use regex to extract hours and minutes. To calcualte the mean time in minutus:
h = df['time'].str.extract('(\d{1,2})h').fillna(0).astype(int)
m = df['time'].str.extract('(\d{1,2})m').fillna(0).astype(int)
(h * 60 + m).mean()
Result:
0 83.75
dtype: float64
It's largely inspired from How to construct a timedelta object from a simple string, but you can do as below:
def convertToSecond(time_str):
regex=re.compile(r'((?P<hours>\d+?)h)?:*((?P<minutes>\d+?)m)?:*((?P<seconds>\d+?)s)?')
parts = regex.match(time_str)
if not parts:
return
parts = parts.groupdict()
time_params = {}
for (name, param) in parts.items():
if param:
time_params[name] = int(param)
return timedelta(**time_params).total_seconds()
df = pd.DataFrame({
'time': ['1h:2m', '1h:58m','35m','2h'],})
df['inSecond']=df['time'].apply(convertToSecond)
mean_inSecond=df['inSecond'].mean()
print(f"Mean of Time Column: {datetime.timedelta(seconds=mean_inSecond)}")
Result:
Mean of Time Column: 1:23:45
Another possibility is to convert your string column into timedelta (since they don't seem to be times but rather durations?).
Since your strings are not all formatted equally, you unfortinately cannot use pandas' to_timedelta function. However, parser from dateutil has an option fuzzy that you can use to convert your column to datetime. If you subtract midnight today from that, you get the value as a timedelta.
import pandas as pd
from dateutil import parser
from datetime import date
from datetime import datetime
df = pd.DataFrame([[1,'1h:2m'],[2,'1h:58m'],[3,'35m'],[4,'2h']],columns=['id','time'])
today = date.today()
midnight = datetime.combine(today, datetime.min.time())
df['time'] = df['time'].apply(lambda x: (parser.parse(x, fuzzy=True)) - midnight)
This will convert your dataframe like this (print(df)):
id time
0 1 01:02:00
1 2 01:58:00
2 3 00:35:00
3 4 02:00:00
from which you can calculate the mean using print(df['time'].mean()):
0 days 01:23:45
Full example: https://ideone.com/Aze9mR
I have a DataFrame that is indexed by date and has daily data.
As described I wish to group and aggregate this data by calendar month start minus 2 business days. My idea is to use groupby and MonthBegin with a 2 days BDay offset to this.
When I try run the code
import pandas as pd
import pandas.tseries.offsets as of
days = of.MonthBegin() - of.BDay(2)
g = df.groupby(pd.Grouper(freq=days, level='Date')).sum()
I get an error
TypeError: Argument 'other' has incorrect type (expected
datetime.datetime, got BusinessDay)
Perhaps I need to use the rollback method on MonthBegin but when I try
days = of.MonthBegin()
days.rollback(of.BDay(2))
g_df = df.groupby(pd.Grouper(freq=days, level='Date')).sum()
TypeError: Cannot convert input [<2 * BusinessDays>] of type to Timestamp
Does anyone have any ideas how to correctly use the offsets to groupby MonthBegin - 2BDay ?
It is hard to tell, what you want to achieve without any data of yours, but here is how you could do it:
df = pd.DataFrame({"dates": ["2018-01-02", "2018-01-03", "2018-02-02", "2018-01-04"],
"vals": [10, 20, 10, 5]})
df.groupby((pd.to_datetime(df.dates) - of.MonthBegin() - of.BDay(2)).dt.month).vals.sum()
Output:
dates
1 10
12 35
Name: vals, dtype: int64
I am trying to find the day difference between today, and dates in my dataframe.
Below is my conversion of dates in my dataframe
df['Date']=pd.to_datetime(df['Date'])
Below is my code to get today
today1=dt.datetime.today().strftime('%Y-%m-%d')
today1=pd.to_datetime(today1)
Both are converted to pandas.to_datetime, but when I do subtraction, the below error came out.
ValueError: Cannot add integral value to Timestamp without offset.
Can someone help to advise? Thanks!
This is a simple example how you can do this:
import pandas
import datetime as dt
First, you have to get today.
today1=dt.datetime.today().strftime('%Y-%m-%d')
today1=pd.to_datetime(today1)
Then, you can construct the data frame:
df = pandas.DataFrame({'Date':'2016-11-24 11:03:10.050000', 'today1': today1 }, index = [0])
In this example I just have 2 columns, each with one value.
Next, you should check the data types:
print(df.dtypes)
Date datetime64[ns]
today1 datetime64[ns]
If both data types are datetime64[ns], you can then subtract df.Date from df.today1.
print(df.today1 - df.Date)
The output:
0 19 days 12:56:49.950000
dtype: timedelta64[ns]