Currently my script is subtracting my current time with the times that i have in a Dataframe column called "Creation", generating a new column with the days of the difference. I get the difference days with this code:
df['Creation']= pandas.to_datetime(df["Creation"],dayfirst="True")
#Generates new column with the days.
df['Difference'] = df.to_datetime('now') - df['Creation']
What i want to now is for it to give me the days like hes giving me but dont count the Saturdays and Sundays. How can i do that ?
you can make use of numpy's busday_count, Ex:
import pandas as pd
import numpy as np
# some dummy data
df = pd.DataFrame({'Creation': ['2021-03-29', '2021-03-30']})
# make sure we have datetime
df['Creation'] = pd.to_datetime(df['Creation'])
# set now to a fixed date
now = pd.Timestamp('2021-04-05')
# difference in business days, excluding weekends
# need to cast to datetime64[D] dtype so that np.busday_count works
df['busday_diff'] = np.busday_count(df['Creation'].values.astype('datetime64[D]'),
np.repeat(now, df['Creation'].size).astype('datetime64[D]'))
df['busday_diff'] # since I didn't define holidays, potential Easter holiday is excluded:
0 5
1 4
Name: busday_diff, dtype: int64
If you need the output to be of dtype timedelta, you can easily cast to that via
df['busday_diff'] = pd.to_timedelta(df['busday_diff'], unit='d')
df['busday_diff']
0 5 days
1 4 days
Name: busday_diff, dtype: timedelta64[ns]
Note: np.busday_count also allows you to set a custom weekmask (exclude days other than Saturday and Sunday) or a list of holidays. See the docs I linked on top.
Related: Calculate difference between two dates excluding weekends in python?, how to use (np.busday_count) with pandas.core.series.Series
Related
Hello,
I am trying to extract date and time column from my excel data. I am getting column as DataFrame with float values, after using pandas.to_datetime I am getting date with different date than actual date from excel. for example, in excel starting date is 01.01.1901 00:00:00 but in python I am getting 1971-01-03 00:00:00.000000 like this.
How can I solve this problem?
I need a final output in total seconds with DataFrame. First cell starting as a 00 sec and very next cell with timestep of seconds (time difference in ever cell is 15min.)
Thank you.
Your input is fractional days, so there's actually no need to convert to datetime if you want the duration in seconds relative to the first entry. Subtract that from the rest of the column and multiply by the number of seconds in a day:
import pandas as pd
df = pd.DataFrame({"Datum/Zeit": [367.0, 367.010417, 367.020833]})
df["totalseconds"] = (df["Datum/Zeit"] - df["Datum/Zeit"].iloc[0]) * 86400
df["totalseconds"]
0 0.0000
1 900.0288
2 1799.9712
Name: totalseconds, dtype: float64
If you have to use datetime, you'll need to convert to timedelta (duration) to do the same, e.g. like
df["datetime"] = pd.to_datetime(df["Datum/Zeit"], unit="d")
# df["datetime"]
# 0 1971-01-03 00:00:00.000000
# 1 1971-01-03 00:15:00.028800
# 2 1971-01-03 00:29:59.971200
# Name: datetime, dtype: datetime64[ns]
# subtraction of datetime from datetime gives timedelta, which has total_seconds:
df["totalseconds"] = (df["datetime"] - df["datetime"].iloc[0]).dt.total_seconds()
# df["totalseconds"]
# 0 0.0000
# 1 900.0288
# 2 1799.9712
# Name: totalseconds, dtype: float64
I work with a variety of instruments, and one is particularly troublesome in that the exported data is in XLS or XLSX format with multiple pages, and multiple columns. I only want some pages and some columns, I have achieved reading this into pandas already.
I want to convert time (see below) into a decimal, in hours. This would be from an initial time (in the time stamp data) at the top of the column so timedelta is probably a more correct value, in hours. I am only concerned about this column. How to convert an entire column of data from one format, to another?
date/time (absolute time) timestamped format YYYY-MM-DD TT:MM:SS
I have found quite a few answers but they don't seem to apply to this particular case, mostly focusing on individual cells or manually entered small data sets. My thousands of data files each have as many as 500,000 lines so something more automated is preferred. There is no upper limit to the number of hours.
What might be part of the same question (someone asked me) is this is already in a Pandas dataframe, should it be converted before or after being read in?
This might seem an amateur-ish question, and it is. I've avoided code writing for years, now I have to learn to data-wrangle for my job and it's frustrating so go easy on me.
Going about it the usual way by trying to adapt most of the solutions I found to a column, I get errors
**This is the code which works
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
from datetime import datetime # not used
import time # not used
import numpy as np # Not used
loc1 = r"path\file.xls"
pd.read_excel(loc1)
filename=Path(loc1).stem
str_1=filename
df = pd.concat(pd.read_excel(loc1, sheet_name=[3,4,5,6,7,8,9]), ignore_index=False)
***I NEED A CODE TO CONVERT DATESTAMPS TO HOURS (decimal) most likely a form of timedelta***
df.plot(x='Relative Time(h:min:s.ms)',y='Voltage(V)', color='blue')
plt.xlabel("relative time") # This is a specific value
plt.ylabel("voltage (V)")
plt.title(str_1) # filename is used in each sample as a graph title
plt.show()
Image of relevent information (already described above)
You should provide a minimal reproducible example, to help understand what exactly are the issues you are facing.
Setup
Reading between the lines, here is a setup that hopefully exemplifies the kind of data you have:
vals = pd.Series([
'2019-10-21 17:22:06', # absolute date
'2019-10-21 23:22:06.236', # absolute date, with milliseconds
'2019-10-21 12:00:00.236145', # absolute date, with microseconds
'5:10:10', # timedelta
'40:10:10.123', # timedelta, with milliseconds
'345:10:10.123456', # timedelta, with microseconds
])
Solution
Now, we can use two great tools that Pandas offers to quickly convert string series into Timestamps (pd.to_datetime) and Timedelta (pd.to_timedelta), for absolute date-times and durations, respectively.
In both cases, we use errors='coerce' to convert what is convertible, and leave the rest to NaN.
origin = pd.Timestamp('2019-01-01 00:00:00') # origin for absolute dates
a = pd.to_datetime(vals, format='%Y-%m-%d %H:%M:%S.%f', errors='coerce') - origin
b = pd.to_timedelta(vals, errors='coerce')
tdelta = a.where(~a.isna(), b)
hours = tdelta.dt.total_seconds() / 3600
With the above:
>>> hours
0 7049.368333
1 7055.368399
2 7044.000066
3 5.169444
4 40.169479
5 345.169479
dtype: float64
Explanation
Let's examine some of the pieces above. a handles absolute date-times. Before subtraction of origin to obtain a Timedelta, it is still a Series of Timestamps:
>>> pd.to_datetime(vals, format='%Y-%m-%d %H:%M:%S.%f', errors='coerce')
0 2019-10-21 17:22:06.000000
1 2019-10-21 23:22:06.236000
2 2019-10-21 12:00:00.236145
3 NaT
4 NaT
5 NaT
dtype: datetime64[ns]
b handles values that are already expressed as durations:
>>> b
0 NaT
1 NaT
2 NaT
3 0 days 05:10:10
4 1 days 16:10:10.123000
5 14 days 09:10:10.123456
dtype: timedelta64[ns]
tdelta is the merge of the non-NaN values of a and b:
>>> tdelta
0 293 days 17:22:06
1 293 days 23:22:06.236000
2 293 days 12:00:00.236145
3 0 days 05:10:10
4 1 days 16:10:10.123000
5 14 days 09:10:10.123456
dtype: timedelta64[ns]
Of course, you can change your origin to be any particular date of reference.
Addendum
After clarifying comments, it seems that the main issue is how to adapt the solution above (or any similar existing example) to their specific problem.
Using the names seen in the images of the edited question, I would suggest:
# (...)
# df = pd.concat(pd.read_excel(loc1, sheet_name=[3,4,5,6,7,8,9]), ignore_index=False)
# note: if df['Absolute Time'] is still of dtypes str, then do this:
# (adapt format as needed; hard to be sure from the image)
df['Absolute Time'] = pd.to_datetime(
df['Absolute Time'],
format='%m.%d.%Y %H:%M:%S.%f',
errors='coerce')
# origin of time; this may have to be taken over multiple sheets
# if all experiments share an absolute origin
origin = df['Absolute Time'].min()
df['Time in hours'] = (df['Absolute Time'] - origin).dt.total_seconds() / 3600
My dataframe has a column which measures time difference in the format HH:MM:SS.000
The pandas is formed from an excel file, the column which stores time difference is an Object. However some entries have negative time difference, the negative sign doesn't matter to me and needs to be removed from the time as it's not filtering a condition I have:
Note: I only have the negative time difference there because of the issue I'm currently having.
I've tried the following functions but I get errors as some of the time difference data is just 00:00:00 and some is 00:00:02.65 and some are 00:00:02.111
firstly how would I ensure that all data in this column is to 00:00:00.000. And then how would I remove the '-' from some the data.
Here's a sample of the time diff column, I cant transform this column into datetime as some of the entries dont have 3 digits after the decimal. Is there a way to iterate through the column and add a 0 if the length of the value isn't equal to 12 digits.
00:00:02.97
00:00:03:145
00:00:00
00:00:12:56
28 days 03:05:23.439
It looks like you need to clean your input before you can parse to timedelta, e.g. with the following function:
import pandas as pd
def clean_td_string(s):
if s.count(':') > 2:
return '.'.join(s.rsplit(':', 1))
return s
Applied to a df's column, this looks like
df = pd.DataFrame({'Time Diff': ['00:00:02.97', '00:00:03:145', '00:00:00', '00:00:12:56', '28 days 03:05:23.439']})
df['Time Diff'] = pd.to_timedelta(df['Time Diff'].apply(clean_td_string))
# df['Time Diff']
# 0 0 days 00:00:02.970000
# 1 0 days 00:00:03.145000
# 2 0 days 00:00:00
# 3 0 days 00:00:12.560000
# 4 28 days 03:05:23.439000
# Name: Time Diff, dtype: timedelta64[ns]
Here is a sample with date format:
data = pd.DataFrame({'Quarter':['Q1_01','Q2_01', 'Q3_01', 'Q4_01', 'Q1_02','Q2_02']
, 'Sale' :[10, 20, 30, 40, 50, 60]})
print(data)
# Quarter Sale
#0 Q1_01 10
#1 Q2_01 20
#2 Q3_01 30
#3 Q4_01 40
#4 Q1_02 50
#5 Q2_02 60
print(data.dtypes)
# Quarter object
# Sale int64
Would like to convert Quarter column into Pandas datetime format like
'Jan-2001' or '01-2001' that can be used in fbProphet for time series analysis.
Tried using strptime but got an error TypeError: strptime() argument 1 must be str, not Series
from datetime import datetime
data['Quarter'] = datetime.strptime(data['Quarter'], 'Q%q_%y')
What is the cause of the error ? Any better solution?
Knowing the format to_datetime needs to pass period indices is helpful (it is along the lines of YYYY-QX), so we start with replace, then to_datetime and finally strftime:
u = df.Quarter.str.replace(r'(Q\d)_(\d+)', r'20\2-\1')
pd.to_datetime(u).dt.strftime('%b-%Y')
0 Jan-2001
1 Apr-2001
2 Jul-2001
3 Oct-2001
4 Jan-2002
5 Apr-2002
Name: Quarter, dtype: object
The month represents the start of its respective quarter.
If the dates can range across the 90s and the 2000s, then let's try something different:
df = pd.DataFrame({'Quarter':['Q1_98','Q2_99', 'Q3_01', 'Q4_01', 'Q1_02','Q2_02']})
dt = pd.to_datetime(df.Quarter.str.replace(r'(Q\d)_(\d+)', r'\2-\1'))
(dt.where(dt <= pd.to_datetime('today'), dt - pd.DateOffset(years=100))
.dt.strftime('%b-%Y'))
0 Jan-1998
1 Apr-1999
2 Jul-2001
3 Oct-2001
4 Jan-2002
5 Apr-2002
Name: Quarter, dtype: object
pd.to_datetime auto-parses "98" as "2098", so we do a little fix to subtract 100 years from dates later than "today's date".
This hack will stop working in a few decades. Ye pandas gods, have mercy on my soul :-)
Another option is parsing to PeriodIndex:
(pd.PeriodIndex(df.Quarter.str.replace(r'(Q\d)_(\d+)', r'20\2-\1'), freq='Q')
.strftime('%b-%Y'))
# Index(['Mar-2001', 'Jun-2001', 'Sep-2001',
# 'Dec-2001', 'Mar-2002', 'Jun-2002'], dtype='object')
Here, the months printed out are at the ends of their respective quarters. You decide what to use.
I have a dataframe which look like this as below
Year Birthday OnsetDate
5 2018/1/1
5 2018/2/2
now I use the OnsetDate column subtract with the Day column
df['Birthday'] = df['OnsetDate'] - pd.to_timedelta(df['Day'], unit='Y')
but the outcome of the Birthday column is mixing with time just like below
Birthday
2013/12/31 18:54:00
2013/1/30 18:54:00
the outcome is just a dummy data, what I focused on this is that the time will cause inaccurate of date after the operation. What is the solution to avoid the time being generated so that I can get accurate data.
Second question, I merge the above dataframe to another data frame.
new.update(df)
and the 'new' dataframe Birthday column became like this
Birthday
1164394440000000000
1165949640000000000
so actually caused this and what is the solution?
First question, you should know that is not a whole year by using pd.to_timedelta. If you print, you can see 1 year = 365 days 05:49:12.
print(pd.to_timedelta(1, unit='Y'))
365 days 05:49:12
If you want to avoid the time being generated, you can use DateOffset.
from pandas.tseries.offsets import DateOffset
df['Year'] = df['Year'].apply(lambda x: DateOffset(years=x))
df['Birthday'] = df['OnsetDate'] - df['Year']
Year OnsetDate Birthday
0 <DateOffset: years=5> 2018-01-01 2013-01-01
1 <DateOffset: years=5> 2018-02-02 2013-02-02
As for the second question is caused by the type of column, you can use pd.to_datetime to solve it.
new['Birthday'] = pd.to_datetime(new['Birthday'])