I have a pandas dataframe with a list of datetimes in them. I want to add 12 hours onto any date time that is not equal to 8am and but is still in the morning. For example:
Datetime
A
2022-01-01 08:00:00
10
2022-01-01 09:00:00
10
2022-01-01 12:00:00
10
2022-01-01 24:00:00
10
Should become:
Datetime
A
2022-01-01 08:00:00
10
2022-01-01 21:00:00
10
2022-01-01 12:00:00
10
2022-01-01 24:00:00
10
I can do this by looping through the dataframe one element at a time and doing this conditional check. However, the dataset I am working with is large. Is it possible to do this without looping though the whole dataset by filtering on this condition. So far I have not managed to find a way!
I just write some code. You can utilize .dt.hour and datetime.timedelta to solve this problem
import datetime
data = """2022-01-01 08:00:00 10
2022-01-01 09:00:00 10
2022-01-01 12:00:00 10
2022-01-01 23:00:00 10"""
data = [f.split("\t") for f in data.split("\n")]
df = pd.DataFrame(data=data, columns=['Datetime', 'A'])
df['Datetime'] = pd.to_datetime(df['Datetime'])
mask = (df['Datetime'].dt.hour != 8) & (df['Datetime'].dt.hour <=12)
df.loc[mask, "Datetime"] += datetime.timedelta(hours=12)
Related
I have a dataframe as such
Date Value
2022-01-01 10:00:00 7
2022-01-01 10:30:00 5
2022-01-01 11:00:00 3
....
....
2022-02-15 21:00:00 8
I would like to convert it into a day by row and hour by column format. The hours are the columns in this case. and the value column is now filled as cell values.
Date 10:00 10:30 11:00 11:30............21:00
2022-01-01 7 5 3 4 11
2022-01-02 8 2 4 4 13
How can I achieve this? I have tried pivot table but no success
Use pivot_table:
df['Date'] = pd.to_datetime(df['Date'])
out = df.pivot_table('Value', df['Date'].dt.date, df['Date'].dt.time, fill_value=0)
print(out)
# Output
Date 10:00:00 10:30:00 11:00:00 21:00:00
Date
2022-01-01 7 5 3 0
2022-02-15 0 0 0 8
To remove Date labels, you can use rename_axis:
for the top Date label: out.rename_axis(columns=None)
for the bottom Date label: out.rename_axis(index=None)
for both: out.rename_axis(index=None, columns=None)
You can change None by any string to rename axis.
I have a Pandas dataframe that looks like this :
# date
--- -------------------
0 2022-01-01 08:00:00
1 2022-01-01 08:01:00
2 2022-01-01 08:52:00
My goal is to add a new column that contains a datetime object with the value of the next hour. I looked at the documentation of the ceil function, and it works pretty well in most cases.
Issue
The problem concerns hours that are perfectly round (like the one at #0) :
df["next"] = (df["date"]).dt.ceil("H")
# date next
--- ------------------- -------------------
0 2022-01-01 08:00:00 2022-01-01 08:00:00 <--- wrong, expected 09:00:00
1 2022-01-01 08:01:00 2022-01-01 09:00:00 <--- correct
2 2022-01-01 08:52:00 2022-01-01 09:00:00 <--- correct
Sub-optimal solution
I have come up with the following workaround, but I find it really clumsy :
def nextHour(current):
return pd.date_range(start=current, periods=2, freq="H")[1]
df["next"] = (df["date"]).apply(lambda x: nextHour(x))
I have around 1-2 million rows in my dataset and I find this solution extremely slow compared to the native dt.ceil(). Is there a better way of doing it ?
This is the way ceil works, it won't jump to the next hour.
What you want seems more like a floor + 1h using pandas.Timedelta:
df['next'] = df['date'].dt.floor('H')+pd.Timedelta('1h')
output:
date next
0 2022-01-01 08:00:00 2022-01-01 09:00:00
1 2022-01-01 08:01:00 2022-01-01 09:00:00
2 2022-01-01 08:52:00 2022-01-01 09:00:00
difference of bounds behavior between floor and ceil:
date ceil floor
0 2022-01-01 08:00:00 2022-01-01 08:00:00 2022-01-01 08:00:00
1 2022-01-01 08:01:00 2022-01-01 09:00:00 2022-01-01 08:00:00
2 2022-01-01 08:52:00 2022-01-01 09:00:00 2022-01-01 08:00:00
3 2022-01-01 09:00:00 2022-01-01 09:00:00 2022-01-01 09:00:00
4 2022-01-01 09:01:00 2022-01-01 10:00:00 2022-01-01 09:00:00
So I am looking for a way to fill an empty dataframe column with hourly values between two dates.
for example between
StartDate = 2019:01:01 00:00:00
to
EndDate = 2019:02:01 00:00:00
I would want a column that has
2019:01:01 00:00:00,2019:01:01 01:00:00,2019:02:01 00:00:00...
in Y:M:D H:M:S format.
I am not sure what the most efficient way of doing this is, is there a way to do it via pandas or would you have to use a for loop over a given timedelta between a range for eg?
`
Use date_range with DataFrame constructor:
StartDate = '2019-01-01 00:00:00'
EndDate = '2019-02-01 00:00:00'
df = pd.DataFrame({'dates':pd.date_range(StartDate, EndDate, freq='H')})
If there is custom format of dates first convert them to datetimes:
StartDate = '2019:01:01 00:00:00'
EndDate = '2019:02:01 00:00:00'
StartDate = pd.to_datetime(StartDate, format='%Y:%m:%d %H:%M:%S')
EndDate = pd.to_datetime(EndDate, format='%Y:%m:%d %H:%M:%S')
df = pd.DataFrame({'dates':pd.date_range(StartDate, EndDate, freq='H')})
print (df.head(10))
dates
0 2019-01-01 00:00:00
1 2019-01-01 01:00:00
2 2019-01-01 02:00:00
3 2019-01-01 03:00:00
4 2019-01-01 04:00:00
5 2019-01-01 05:00:00
6 2019-01-01 06:00:00
7 2019-01-01 07:00:00
8 2019-01-01 08:00:00
9 2019-01-01 09:00:00
I have a dataframe df as below:
Datetime Value
2020-03-01 08:00:00 10
2020-03-01 10:00:00 12
2020-03-01 12:00:00 15
2020-03-02 09:00:00 1
2020-03-02 10:00:00 3
2020-03-02 13:00:00 8
2020-03-03 10:00:00 20
2020-03-03 12:00:00 25
2020-03-03 14:00:00 15
I would like to calculate the difference between the value on the first time of each date and the last time of each date (ignoring the value of other time within a date), so the result will be:
Datetime Value_Difference
2020-03-01 5
2020-03-02 7
2020-03-03 -5
I have been doing this using a for loop, but it is slow (as expected) when I have larger data. Any help will be appreciated.
One solution would be to make sure the data is sorted by time, group by the data and then take the first and last value in each day. This works since pandas will preserve the order during groupby, see e.g. here.
df = df.sort_values(by='Datetime').groupby(df['Datetime'].dt.date).agg({'Value': ['first', 'last']})
df['Value_Difference'] = df['Value']['last'] - df['Value']['first']
df = df.drop('Value', axis=1).reset_index()
Result:
Datetime Value_Difference
2020-03-01 5
2020-03-02 7
2020-03-03 -5
Shaido's method works, but might be slow due to the groupby on very large sets
Another possible way is to take a difference from dates converted to int and only grab the values necessary without a loop.
idx = df.index
loc = np.diff(idx.strftime('%Y%m%d').astype(int).values).nonzero()[0]
loc1 = np.append(0,loc)
loc2 = np.append(loc,len(idx)-1)
res = df.values[loc2]-df.values[loc1]
df = pd.DataFrame(index=idx.date[loc1],values=res,columns=['values'])
I have a .csv file with some data. There is only one column of in this file, which includes timestamps. I need to organize that data into bins of 30 minutes. This is what my data looks like:
Timestamp
04/01/2019 11:03
05/01/2019 16:30
06/01/2019 13:19
08/01/2019 13:53
09/01/2019 13:43
So in this case, the last two data points would be grouped together in the bin that includes all the data from 13:30 to 14:00.
This is what I have already tried
df = pd.read_csv('book.csv')
df['Timestamp'] = pd.to_datetime(df.Timestamp)
df.groupby(pd.Grouper(key='Timestamp',
freq='30min')).count().dropna()
I am getting around 7000 rows showing all hours for all days with the count next to them, like this:
2019-09-01 03:00:00 0
2019-09-01 03:30:00 0
2019-09-01 04:00:00 0
...
I want to create bins for only the hours that I have in my dataset. I want to see something like this:
Time Count
11:00:00 1
13:00:00 1
13:30:00 2 (we have two data points in this interval)
16:30:00 1
Thanks in advance!
Use groupby.size as:
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df = df.Timestamp.dt.floor('30min').dt.time.to_frame()\
.groupby('Timestamp').size()\
.reset_index(name='Count')
Or as per suggestion by jpp:
df = df.Timestamp.dt.floor('30min').dt.time.value_counts().reset_index(name='Count')
print(df)
Timestamp Count
0 11:00:00 1
1 13:00:00 1
2 13:30:00 2
3 16:30:00 1