Here is a sample with date format:
data = pd.DataFrame({'Quarter':['Q1_01','Q2_01', 'Q3_01', 'Q4_01', 'Q1_02','Q2_02']
, 'Sale' :[10, 20, 30, 40, 50, 60]})
print(data)
# Quarter Sale
#0 Q1_01 10
#1 Q2_01 20
#2 Q3_01 30
#3 Q4_01 40
#4 Q1_02 50
#5 Q2_02 60
print(data.dtypes)
# Quarter object
# Sale int64
Would like to convert Quarter column into Pandas datetime format like
'Jan-2001' or '01-2001' that can be used in fbProphet for time series analysis.
Tried using strptime but got an error TypeError: strptime() argument 1 must be str, not Series
from datetime import datetime
data['Quarter'] = datetime.strptime(data['Quarter'], 'Q%q_%y')
What is the cause of the error ? Any better solution?
Knowing the format to_datetime needs to pass period indices is helpful (it is along the lines of YYYY-QX), so we start with replace, then to_datetime and finally strftime:
u = df.Quarter.str.replace(r'(Q\d)_(\d+)', r'20\2-\1')
pd.to_datetime(u).dt.strftime('%b-%Y')
0 Jan-2001
1 Apr-2001
2 Jul-2001
3 Oct-2001
4 Jan-2002
5 Apr-2002
Name: Quarter, dtype: object
The month represents the start of its respective quarter.
If the dates can range across the 90s and the 2000s, then let's try something different:
df = pd.DataFrame({'Quarter':['Q1_98','Q2_99', 'Q3_01', 'Q4_01', 'Q1_02','Q2_02']})
dt = pd.to_datetime(df.Quarter.str.replace(r'(Q\d)_(\d+)', r'\2-\1'))
(dt.where(dt <= pd.to_datetime('today'), dt - pd.DateOffset(years=100))
.dt.strftime('%b-%Y'))
0 Jan-1998
1 Apr-1999
2 Jul-2001
3 Oct-2001
4 Jan-2002
5 Apr-2002
Name: Quarter, dtype: object
pd.to_datetime auto-parses "98" as "2098", so we do a little fix to subtract 100 years from dates later than "today's date".
This hack will stop working in a few decades. Ye pandas gods, have mercy on my soul :-)
Another option is parsing to PeriodIndex:
(pd.PeriodIndex(df.Quarter.str.replace(r'(Q\d)_(\d+)', r'20\2-\1'), freq='Q')
.strftime('%b-%Y'))
# Index(['Mar-2001', 'Jun-2001', 'Sep-2001',
# 'Dec-2001', 'Mar-2002', 'Jun-2002'], dtype='object')
Here, the months printed out are at the ends of their respective quarters. You decide what to use.
Related
I have a Pandas series which consists of strings of '169:21:5:24', '54:9:19:29', and so on which stand for 169 days 21 hours 5 minutes 24 seconds and 54 days 9 hours 19 minutes 29 seconds, respectively.
I want to convert them to datetime object (preferable) or just integers of seconds.
The first try was
pd.to_datetime(series1, format = '%d:%H:%M:%S')
which failed with an error message
time data '169:21:5:24' does not match format '%d:%H:%M:%S' (match)
The second try
pd.to_datetime(series1)
also failed with
expected hh:mm:ss format
The first try seems to work if all the 'days' are less than 30 or 31 days, but my data includes 150 days, 250 days etc and with no month value.
Finally,
temp_list1 = [[int(subitem) for subitem in item.split(":")] for item in series1]
temp_list2 = [item[0] * 24 * 3600 + item[1] * 3600 + item[2] * 60 + item[3] for item in temp_list1]
successfully converted the Series into a list of seconds, but this is lengthy.
I wonder if there is a Pandas.Series.dt or datetime methods that can deal with such type of data.
I want to convert them to datetime object (preferable) or just integers of seconds.
It seems to me like you are rather looking for a timedelta because it's unclear what the year should be?
You could do that for example by (ser your series):
ser = pd.Series(["169:21:5:24", "54:9:19:29"])
timedeltas = ser.str.split(":", n=1, expand=True).assign(td=lambda df:
pd.to_timedelta(df[0].astype("int"), unit="D") + pd.to_timedelta(df[1])
)["td"]
seconds = timedeltas.dt.total_seconds().astype("int")
datetimes = pd.Timestamp("2022") + timedeltas # year has to be provided
Result:
timedeltas:
0 169 days 21:05:24
1 54 days 09:19:29
Name: td, dtype: timedelta64[ns]
seconds:
0 14677524
1 4699169
Name: td, dtype: int64
datetimes:
0 2022-06-19 21:05:24
1 2022-02-24 09:19:29
Name: td, dtype: datetime64[ns]
[PyData.Pandas]: pandas.to_datetime uses (and points to) [Python.Docs]: datetime - strftime() and strptime() Behavior which states (emphasis is mine):
%d - Day of the month as a zero-padded decimal number.
...
%j - Day of the year as a zero-padded decimal number.
So, you're using the wrong directive (correct one is %j):
>>> import pandas as pd
>>>
>>> pd.to_datetime("169:21:5:24", format="%j:%H:%M:%S")
Timestamp('1900-06-18 21:05:24')
As seen, the reference year is 1900 (as specified in the 2nd URL). If you want to use the current year, a bit of extra processing is required:
>>> import datetime
>>>
>>> cur_year_str = "{:04d}:".format(datetime.datetime.today().year)
>>> cur_year_str
'2023:'
>>>
>>> pd.to_datetime(cur_year_str + "169:21:5:24", format="%Y:%j:%H:%M:%S")
Timestamp('2023-06-18 21:05:24')
>>>
>>> # Quick leap year test
>>> pd.to_datetime("2020:169:21:5:24", format="%Y:%j:%H:%M:%S")
Timestamp('2020-06-17 21:05:24')
All in all:
>>> series = pd.Series(("169:21:5:24", "54:9:19:29"))
>>> pd.to_datetime(year_str + series, format="%Y:%j:%H:%M:%S")
0 2023-06-18 21:05:24
1 2023-02-23 09:19:29
dtype: datetime64[ns]
Hello,
I am trying to extract date and time column from my excel data. I am getting column as DataFrame with float values, after using pandas.to_datetime I am getting date with different date than actual date from excel. for example, in excel starting date is 01.01.1901 00:00:00 but in python I am getting 1971-01-03 00:00:00.000000 like this.
How can I solve this problem?
I need a final output in total seconds with DataFrame. First cell starting as a 00 sec and very next cell with timestep of seconds (time difference in ever cell is 15min.)
Thank you.
Your input is fractional days, so there's actually no need to convert to datetime if you want the duration in seconds relative to the first entry. Subtract that from the rest of the column and multiply by the number of seconds in a day:
import pandas as pd
df = pd.DataFrame({"Datum/Zeit": [367.0, 367.010417, 367.020833]})
df["totalseconds"] = (df["Datum/Zeit"] - df["Datum/Zeit"].iloc[0]) * 86400
df["totalseconds"]
0 0.0000
1 900.0288
2 1799.9712
Name: totalseconds, dtype: float64
If you have to use datetime, you'll need to convert to timedelta (duration) to do the same, e.g. like
df["datetime"] = pd.to_datetime(df["Datum/Zeit"], unit="d")
# df["datetime"]
# 0 1971-01-03 00:00:00.000000
# 1 1971-01-03 00:15:00.028800
# 2 1971-01-03 00:29:59.971200
# Name: datetime, dtype: datetime64[ns]
# subtraction of datetime from datetime gives timedelta, which has total_seconds:
df["totalseconds"] = (df["datetime"] - df["datetime"].iloc[0]).dt.total_seconds()
# df["totalseconds"]
# 0 0.0000
# 1 900.0288
# 2 1799.9712
# Name: totalseconds, dtype: float64
Currently my script is subtracting my current time with the times that i have in a Dataframe column called "Creation", generating a new column with the days of the difference. I get the difference days with this code:
df['Creation']= pandas.to_datetime(df["Creation"],dayfirst="True")
#Generates new column with the days.
df['Difference'] = df.to_datetime('now') - df['Creation']
What i want to now is for it to give me the days like hes giving me but dont count the Saturdays and Sundays. How can i do that ?
you can make use of numpy's busday_count, Ex:
import pandas as pd
import numpy as np
# some dummy data
df = pd.DataFrame({'Creation': ['2021-03-29', '2021-03-30']})
# make sure we have datetime
df['Creation'] = pd.to_datetime(df['Creation'])
# set now to a fixed date
now = pd.Timestamp('2021-04-05')
# difference in business days, excluding weekends
# need to cast to datetime64[D] dtype so that np.busday_count works
df['busday_diff'] = np.busday_count(df['Creation'].values.astype('datetime64[D]'),
np.repeat(now, df['Creation'].size).astype('datetime64[D]'))
df['busday_diff'] # since I didn't define holidays, potential Easter holiday is excluded:
0 5
1 4
Name: busday_diff, dtype: int64
If you need the output to be of dtype timedelta, you can easily cast to that via
df['busday_diff'] = pd.to_timedelta(df['busday_diff'], unit='d')
df['busday_diff']
0 5 days
1 4 days
Name: busday_diff, dtype: timedelta64[ns]
Note: np.busday_count also allows you to set a custom weekmask (exclude days other than Saturday and Sunday) or a list of holidays. See the docs I linked on top.
Related: Calculate difference between two dates excluding weekends in python?, how to use (np.busday_count) with pandas.core.series.Series
My dataframe has a column which measures time difference in the format HH:MM:SS.000
The pandas is formed from an excel file, the column which stores time difference is an Object. However some entries have negative time difference, the negative sign doesn't matter to me and needs to be removed from the time as it's not filtering a condition I have:
Note: I only have the negative time difference there because of the issue I'm currently having.
I've tried the following functions but I get errors as some of the time difference data is just 00:00:00 and some is 00:00:02.65 and some are 00:00:02.111
firstly how would I ensure that all data in this column is to 00:00:00.000. And then how would I remove the '-' from some the data.
Here's a sample of the time diff column, I cant transform this column into datetime as some of the entries dont have 3 digits after the decimal. Is there a way to iterate through the column and add a 0 if the length of the value isn't equal to 12 digits.
00:00:02.97
00:00:03:145
00:00:00
00:00:12:56
28 days 03:05:23.439
It looks like you need to clean your input before you can parse to timedelta, e.g. with the following function:
import pandas as pd
def clean_td_string(s):
if s.count(':') > 2:
return '.'.join(s.rsplit(':', 1))
return s
Applied to a df's column, this looks like
df = pd.DataFrame({'Time Diff': ['00:00:02.97', '00:00:03:145', '00:00:00', '00:00:12:56', '28 days 03:05:23.439']})
df['Time Diff'] = pd.to_timedelta(df['Time Diff'].apply(clean_td_string))
# df['Time Diff']
# 0 0 days 00:00:02.970000
# 1 0 days 00:00:03.145000
# 2 0 days 00:00:00
# 3 0 days 00:00:12.560000
# 4 28 days 03:05:23.439000
# Name: Time Diff, dtype: timedelta64[ns]
I have a DataFrame that is indexed by date and has daily data.
As described I wish to group and aggregate this data by calendar month start minus 2 business days. My idea is to use groupby and MonthBegin with a 2 days BDay offset to this.
When I try run the code
import pandas as pd
import pandas.tseries.offsets as of
days = of.MonthBegin() - of.BDay(2)
g = df.groupby(pd.Grouper(freq=days, level='Date')).sum()
I get an error
TypeError: Argument 'other' has incorrect type (expected
datetime.datetime, got BusinessDay)
Perhaps I need to use the rollback method on MonthBegin but when I try
days = of.MonthBegin()
days.rollback(of.BDay(2))
g_df = df.groupby(pd.Grouper(freq=days, level='Date')).sum()
TypeError: Cannot convert input [<2 * BusinessDays>] of type to Timestamp
Does anyone have any ideas how to correctly use the offsets to groupby MonthBegin - 2BDay ?
It is hard to tell, what you want to achieve without any data of yours, but here is how you could do it:
df = pd.DataFrame({"dates": ["2018-01-02", "2018-01-03", "2018-02-02", "2018-01-04"],
"vals": [10, 20, 10, 5]})
df.groupby((pd.to_datetime(df.dates) - of.MonthBegin() - of.BDay(2)).dt.month).vals.sum()
Output:
dates
1 10
12 35
Name: vals, dtype: int64