This is a follow up to Calculating new column value in dataframe based on next rows column value
The solution in the previous question worked for a column holding hh:mm:ss values as a string.
I tried applying (no pun intended) the same logic to calculate the 1 second difference on a column of pandas Timestamps:
# df.start_time is now of type <class 'pandas._libs.tslibs.timestamps.Timestamp'>
# in yyyy-dd-mm hh:mm:ss format
s = pd.to_timedelta(df.start_time).shift(-1).sub(pd.offsets.Second(1))
df = df.assign( end_time=s.add(pd.Timestamp('now').normalize()).dt.time.astype(str) )
By mistake in one round of coding I change the line where the series is applied as a column to the df to:
df = df.assign( end_time=s.add(pd.Timestamp('now').normalize()))
The results were... interesting. The end_time is in the correct format, but the date portion...
start_time end_time
2021-03-30 16:58:13 2072-06-28 03:17:30.192227
2021-03-30 17:00:00 2072-06-28 03:17:32.192227
I expected the end_time Timedelta of 1 second less than the start_time. As you can see that is not the case! The end_time Timedelta is 51 years in the future!
Can someone please explain how/why this happened? There is no explicit call of pd.offsets.DateOffset(years=50)
The solution to this was easy, and staring me in the face.
The offending code:
s = pd.to_timedelta(df.start_time).shift(-1).sub(pd.offsets.Second(1))
The correct way to create an end_time off of a timestamp type series/column:
s = pd.to_timestamp(df.start_time).shift(-1).sub(pd.offsets.Second(1))
Related
I work with a variety of instruments, and one is particularly troublesome in that the exported data is in XLS or XLSX format with multiple pages, and multiple columns. I only want some pages and some columns, I have achieved reading this into pandas already.
I want to convert time (see below) into a decimal, in hours. This would be from an initial time (in the time stamp data) at the top of the column so timedelta is probably a more correct value, in hours. I am only concerned about this column. How to convert an entire column of data from one format, to another?
date/time (absolute time) timestamped format YYYY-MM-DD TT:MM:SS
I have found quite a few answers but they don't seem to apply to this particular case, mostly focusing on individual cells or manually entered small data sets. My thousands of data files each have as many as 500,000 lines so something more automated is preferred. There is no upper limit to the number of hours.
What might be part of the same question (someone asked me) is this is already in a Pandas dataframe, should it be converted before or after being read in?
This might seem an amateur-ish question, and it is. I've avoided code writing for years, now I have to learn to data-wrangle for my job and it's frustrating so go easy on me.
Going about it the usual way by trying to adapt most of the solutions I found to a column, I get errors
**This is the code which works
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
from datetime import datetime # not used
import time # not used
import numpy as np # Not used
loc1 = r"path\file.xls"
pd.read_excel(loc1)
filename=Path(loc1).stem
str_1=filename
df = pd.concat(pd.read_excel(loc1, sheet_name=[3,4,5,6,7,8,9]), ignore_index=False)
***I NEED A CODE TO CONVERT DATESTAMPS TO HOURS (decimal) most likely a form of timedelta***
df.plot(x='Relative Time(h:min:s.ms)',y='Voltage(V)', color='blue')
plt.xlabel("relative time") # This is a specific value
plt.ylabel("voltage (V)")
plt.title(str_1) # filename is used in each sample as a graph title
plt.show()
Image of relevent information (already described above)
You should provide a minimal reproducible example, to help understand what exactly are the issues you are facing.
Setup
Reading between the lines, here is a setup that hopefully exemplifies the kind of data you have:
vals = pd.Series([
'2019-10-21 17:22:06', # absolute date
'2019-10-21 23:22:06.236', # absolute date, with milliseconds
'2019-10-21 12:00:00.236145', # absolute date, with microseconds
'5:10:10', # timedelta
'40:10:10.123', # timedelta, with milliseconds
'345:10:10.123456', # timedelta, with microseconds
])
Solution
Now, we can use two great tools that Pandas offers to quickly convert string series into Timestamps (pd.to_datetime) and Timedelta (pd.to_timedelta), for absolute date-times and durations, respectively.
In both cases, we use errors='coerce' to convert what is convertible, and leave the rest to NaN.
origin = pd.Timestamp('2019-01-01 00:00:00') # origin for absolute dates
a = pd.to_datetime(vals, format='%Y-%m-%d %H:%M:%S.%f', errors='coerce') - origin
b = pd.to_timedelta(vals, errors='coerce')
tdelta = a.where(~a.isna(), b)
hours = tdelta.dt.total_seconds() / 3600
With the above:
>>> hours
0 7049.368333
1 7055.368399
2 7044.000066
3 5.169444
4 40.169479
5 345.169479
dtype: float64
Explanation
Let's examine some of the pieces above. a handles absolute date-times. Before subtraction of origin to obtain a Timedelta, it is still a Series of Timestamps:
>>> pd.to_datetime(vals, format='%Y-%m-%d %H:%M:%S.%f', errors='coerce')
0 2019-10-21 17:22:06.000000
1 2019-10-21 23:22:06.236000
2 2019-10-21 12:00:00.236145
3 NaT
4 NaT
5 NaT
dtype: datetime64[ns]
b handles values that are already expressed as durations:
>>> b
0 NaT
1 NaT
2 NaT
3 0 days 05:10:10
4 1 days 16:10:10.123000
5 14 days 09:10:10.123456
dtype: timedelta64[ns]
tdelta is the merge of the non-NaN values of a and b:
>>> tdelta
0 293 days 17:22:06
1 293 days 23:22:06.236000
2 293 days 12:00:00.236145
3 0 days 05:10:10
4 1 days 16:10:10.123000
5 14 days 09:10:10.123456
dtype: timedelta64[ns]
Of course, you can change your origin to be any particular date of reference.
Addendum
After clarifying comments, it seems that the main issue is how to adapt the solution above (or any similar existing example) to their specific problem.
Using the names seen in the images of the edited question, I would suggest:
# (...)
# df = pd.concat(pd.read_excel(loc1, sheet_name=[3,4,5,6,7,8,9]), ignore_index=False)
# note: if df['Absolute Time'] is still of dtypes str, then do this:
# (adapt format as needed; hard to be sure from the image)
df['Absolute Time'] = pd.to_datetime(
df['Absolute Time'],
format='%m.%d.%Y %H:%M:%S.%f',
errors='coerce')
# origin of time; this may have to be taken over multiple sheets
# if all experiments share an absolute origin
origin = df['Absolute Time'].min()
df['Time in hours'] = (df['Absolute Time'] - origin).dt.total_seconds() / 3600
I am having an issue with converting the Epoch time format 1585542406929 into the 2020-09-14 Hours Minutes Seconds format.
I tried running this, but it gives me an error
from datetime import datetime
DATETIME_FORMAT = '%Y-%m-%d %H:%M:%S'
datetime.utcfromtimestamp(df2.timestamp_ms).strftime('%Y-%m-%d %H:%M:%S')
error : cannot convert the series to <class 'int'>
What am I not understanding about this datetime function? Is there a better function that I should be using?
edit: should mention that timestamp_ms is my column from my dataframe called df.
Thanks to #chepner for helping me understand the format that this is in.
A quick solution is the following:
# make a new column with Unix time as #ForceBru mentioned
start_date = '1970-01-01'
df3['helper'] = pd.to_datetime(start_date)
# convert your column of JSON dates / numbers to days
df3['timestamp_ms'] = df3['timestamp_ms'].apply(lambda x: (((x/1000)/60)/60/24))
# add a day adder column
df3['time_added'] = pd.to_timedelta(df3['timestamp_ms'],'d')
# add the two columns together
df3['actual_time'] = df3['helper'] + df3['time_added']
Note that you might have to subtract some time off from the actual time stamp. For instance, I had sent my message at 10: 40 am today when it is central time (mid west USA), but the timestamp was putting it at 3:40 pm today.
I have a dataframe ('df') and the first column is a timestamp. I successfully converted that timestamp from that milliseconds since Unix epoch thing to a date like this "2020-02-18 13:00:00" (which is 1:00 pm on February 18th, 2020) with the following code:
df['Time'] = pd.to_datetime(df['Time'], unit='ms')
I'm trying to subset to just all of the rows from 2020-02-17 but this code:
df_1day = df[(df['Time'] == '2020-02-17')]
only returns the row at midnight (2020-02-17 00:00:00)
I'm sorry if the answer is somewhere else in this site, or the internet in general, but TIA for any help.
Not sure of protocol if I answer my own questions but I'm doing this edit to include lines of code that solved my issue--even though I'm pretty sure there's an easier way of doing this
## Create new column with 'Time' as a string
df['Day'] = df['Time'].astype(str)
## Take only the first 10 characters of the string (which would be date only)
df['Day'] = df['Day'].str[:10]
## Create dataframe subset based on values in the new column
df_1day = df[(df['Day'] == '2020-02-17')]
I have a dataframe which look like this as below
Year Birthday OnsetDate
5 2018/1/1
5 2018/2/2
now I use the OnsetDate column subtract with the Day column
df['Birthday'] = df['OnsetDate'] - pd.to_timedelta(df['Day'], unit='Y')
but the outcome of the Birthday column is mixing with time just like below
Birthday
2013/12/31 18:54:00
2013/1/30 18:54:00
the outcome is just a dummy data, what I focused on this is that the time will cause inaccurate of date after the operation. What is the solution to avoid the time being generated so that I can get accurate data.
Second question, I merge the above dataframe to another data frame.
new.update(df)
and the 'new' dataframe Birthday column became like this
Birthday
1164394440000000000
1165949640000000000
so actually caused this and what is the solution?
First question, you should know that is not a whole year by using pd.to_timedelta. If you print, you can see 1 year = 365 days 05:49:12.
print(pd.to_timedelta(1, unit='Y'))
365 days 05:49:12
If you want to avoid the time being generated, you can use DateOffset.
from pandas.tseries.offsets import DateOffset
df['Year'] = df['Year'].apply(lambda x: DateOffset(years=x))
df['Birthday'] = df['OnsetDate'] - df['Year']
Year OnsetDate Birthday
0 <DateOffset: years=5> 2018-01-01 2013-01-01
1 <DateOffset: years=5> 2018-02-02 2013-02-02
As for the second question is caused by the type of column, you can use pd.to_datetime to solve it.
new['Birthday'] = pd.to_datetime(new['Birthday'])
This seems like it would be fairly straight forward but after nearly an entire day I have not found the solution. I've loaded my dataframe with read_csv and easily parsed, combined and indexed a date and a time column into one column but now I want to be able to just reshape and perform calculations based on hour and minute groupings similar to what you can do in excel pivot.
I know how to resample to hour or minute but it maintains the date portion associated with each hour/minute whereas I want to aggregate the data set ONLY to hour and minute similar to grouping in excel pivots and selecting "hour" and "minute" but not selecting anything else.
Any help would be greatly appreciated.
Can't you do, where df is your DataFrame:
times = pd.to_datetime(df.timestamp_col)
df.groupby([times.dt.hour, times.dt.minute]).value_col.sum()
Wes' code didn't work for me. But the DatetimeIndex function (docs) did:
times = pd.DatetimeIndex(data.datetime_col)
grouped = df.groupby([times.hour, times.minute])
The DatetimeIndex object is a representation of times in pandas. The first line creates a array of the datetimes. The second line uses this array to get the hour and minute data for all of the rows, allowing the data to be grouped (docs) by these values.
Came across this when I was searching for this type of groupby. Wes' code above didn't work for me, not sure if it's because changes in pandas over time.
In pandas 0.16.2, what I did in the end was:
grp = data.groupby(by=[data.datetime_col.map(lambda x : (x.hour, x.minute))])
grp.count()
You'd have (hour, minute) tuples as the grouped index. If you want multi-index:
grp = data.groupby(by=[data.datetime_col.map(lambda x : x.hour),
data.datetime_col.map(lambda x : x.minute)])
I have an alternative of Wes & Nix answers above, with just one line of code, assuming your column is already a datetime column, you don't need to get the hour and minute attributes separately:
df.groupby(df.timestamp_col.dt.time).value_col.sum()
This might be a little late but I found quite a good solution for any one that has the same problem.
I have a df like this:
datetime value
2022-06-28 13:28:08 15
2022-06-28 13:28:09 30
... ...
2022-06-28 14:29:11 20
2022-06-28 14:29:12 10
I want to convert those timestamps which are in intervals of a second to timestamps with an interval of minutes adding the value column in the process.
There is a neat way of doing it:
df['datetime'] = pd.to_datetime(df['datetime']) #if not already as datetime object
grouped = df.groupby(pd.Grouper(key='datetime', axis=0, freq='T')).sum()
print(grouped.head())
Result:
datetime value
2022-06-28 13:28:00 45
... ...
2022-06-28 14:29:00 30
freq='T' stands for minutes. You could also group it by hours or days. They are called Offset aliases.