I am working with a Pandas Series that contains (Date/Time) Strings of the form:
"2020-04-01 09:29:21"-"2020-04-01 09:53:17"-"2020-04-13 09:55:55"-.....).
The format is : "yyyy-mm-dd H:M:s".
I am only interested in the hour and minute components and I am looking for a way to divide the data into 30 minute buckets and count the values in each bucket.
An example of my end result:
Range count
9:00-9:30 7
9:30-10:00 25
10:00-10:30 35.......
You need to resample first and then do a groupby the time. Lets us create a serie and set the index to DateTimeIndex otherwise resample won't work:
# random data
np.random.seed(0)
serie = pd.Series(
np.random.choice(pd.date_range(
'2020-01-01', freq='7T22S', periods=10000), 1000)
)
serie.index = serie
Do a resample and then do a groupby:
res = serie.resample('30T').count()
results = res.groupby(res.index.time).sum()
#Change the index to match the format
results.index = results.index.astype(str) + ' - ' +\
np.roll(results.index.astype(str), -1)
results.head()
# 00:00:00 - 00:30:00 19
# 00:30:00 - 01:00:00 25
# 01:00:00 - 01:30:00 19
# 01:30:00 - 02:00:00 28
# 02:00:00 - 02:30:00 22
Related
I have a pandas dataframe with time periods in the second column. Every period represents 30 minutes and it goes all the way up to 48 periods (24 hours). Is there some way to change the integers representing the periods into a time format and concatenate it with the date column for a full datetime? E.g. 1 becomes 00:30, 2 becomes 01:00, 3 becomes 01:30 and so on.
You can cast the DATE column to datetime and add a timedelta of 30 minutes multiplied by PERIOD.
import pandas as pd
df = pd.DataFrame({'DATE':['2015-01-03', '2015-01-03', '2015-01-03'],
'PERIOD':[1,2,3]})
df['DATETIME'] = pd.to_datetime(df['DATE']) + df['PERIOD'] * pd.Timedelta(30, unit='min')
# df
# DATE PERIOD DATETIME
# 0 2015-01-03 1 2015-01-03 00:30:00
# 1 2015-01-03 2 2015-01-03 01:00:00
# 2 2015-01-03 3 2015-01-03 01:30:00
I have a dataframe with a column of dates of the form
2004-01-01
2005-01-01
2006-01-01
2007-01-01
2008-01-01
2009-01-01
2010-01-01
2011-01-01
2012-01-01
2013-01-01
2014-01-01
2015-01-01
2016-01-01
2017-01-01
2018-01-01
2019-01-01
Given an integer number k, let's say k=5, I would like to generate an array of the next k years after the maximum date of the column. The output should look like:
2020-01-01
2021-01-01
2022-01-01
2023-01-01
2024-01-01
Let's use pd.to_datetime + max to compute the largest date in the column date then use pd.date_range to generate the dates based on the offset frequency one year and having the number of periods equals to k=5:
strt, offs = pd.to_datetime(df['date']).max(), pd.DateOffset(years=1)
dates = pd.date_range(strt + offs, freq=offs, periods=k).strftime('%Y-%m-%d').tolist()
print(dates)
['2020-01-01', '2021-01-01', '2022-01-01', '2023-01-01', '2024-01-01']
Here you go:
import pandas as pd
# this is your k
k = 5
# Creating a test DF
array = {'dt': ['2018-01-01', '2019-01-01']}
df = pd.DataFrame(array)
# Extracting column of year
df['year'] = pd.DatetimeIndex(df['dt']).year
year1 = df['year'].max()
# creating a new DF and populating it with k years
years_df = pd.DataFrame()
for i in range (1,k+1):
row = {'dates':[str(year1 + i) + '-01-01']}
years_df = years_df.append(pd.DataFrame(row))
years_df
The output:
dates
2020-01-01
2021-01-01
2022-01-01
2023-01-01
2024-01-01
I have a timeseries data for a full year for every minute.
timestamp day hour min rainfall_rate
2010-01-01 00:00:00 1 0 0 x
2010-01-01 00:01:00 1 0 1 1
2010-01-01 00:02:00 1 0 2 2
2010-01-01 00:03:00 1 0 3 x
2010-01-01 00:04:00 1 0 4 5
... ...
2010-12-31 23:55:00 365 23 55 3
2010-12-31 23:56:00 365 23 56 9
2010-12-31 23:57:00 365 23 57 32
2010-12-31 23:58:00 365 23 58 12
2010-12-31 23:59:00 365 23 59 22
I used sampled_df = rainfall_df.groupby(pd.Grouper(freq="M")).resample('D').sum(), to group the data by month and calculate the daily sum of rainfall_rate.
Structure of sampled_df.
How to plot the monthly data against the timestamp for every months. How do I index rainfall_rate? I want the data of rainfall_rate daily for every month. Also is the grouping correct? Suppose I want to plot timestamp vs rainfall_rate for the month of January. How do I do that?
I am new to pandas.
To generate a plot from the resulting resampled data, simply call DataFrame.plot. However, since you have a multindex with two timestamps for month and day indicator, call DataFrame.reset_index to drop the redundant month level. And for specific month plotting, run boolean indexing on the day index for specific month:
import matplotlib.pyplot as plt
...
# RESET INDEX AND FILTER COLUMNS
sampled_df = (sampled_df.reindex(['rainfall_rate'], axis='columns')
.reset_index(level=0, drop=True)
)
### ALL MONTHS
sampled_df.plot(kind='line')
### ONLY JANUARY
sampled_df[sampled_df.index.month == 1].plot(kind='line')
To demonstrate with random, seeded data:
Data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
np.random.seed(22820)
rainfall_df = pd.DataFrame({'timestamp': pd.date_range('2010-01-01 00:00',
'2010-12-31 23:59',
freq="min"),
'rainfall_rate': np.random.normal(1, 2, 525600)
})
Resampling
sampled_df = (rainfall_df.set_index('timestamp')
.groupby(pd.Grouper(freq="M"))
.resample('D')
.sum()
)
sampled_df.tail(10)
# rainfall_rate
# timestamp
# 2010-12-22 1454.287302
# 2010-12-23 1367.539650
# 2010-12-24 1460.319823
# 2010-12-25 1464.392407
# 2010-12-26 1338.139227
# 2010-12-27 1454.540103
# 2010-12-28 1553.949133
# 2010-12-29 1301.670684
# 2010-12-30 1536.173442
# 2010-12-31 1333.492614
Plots
sampled_df = sampled_df.reset_index(level=0, drop=True)
### ALL MONTHS
sampled_df.plot(kind='line')
### ONLY JANUARY
sampled_df[sampled_df.index.month == 1].plot(kind='line')
I have a column in a dataframe which contains non-continuous dates. I need to group these date by a frequency of 2 days. Data Sample(after normalization):
2015-04-18 00:00:00
2015-04-20 00:00:00
2015-04-20 00:00:00
2015-04-21 00:00:00
2015-04-27 00:00:00
2015-04-30 00:00:00
2015-05-07 00:00:00
2015-05-08 00:00:00
I tried following but as the dates are not continuous I am not getting the desired result.
df.groupby(pd.Grouper(key = 'l_date', freq='2D'))
Is these a way to achieve the desired grouping using pandas or should I write a separate logic?
Once you have a l_date sorted dataframe. you can create a continuous dummy date (dum_date) column and groupby 2D frequency on it.
df = df.sort_values(by='l_date')
df['dum_date'] = pd.date_range(pd.datetime.today(), periods=df.shape[0]).tolist()
df.groupby(pd.Grouper(key = 'dum_date', freq='2D'))
OR
If you are fine with groupings other than dates. then a generalized way to group n consecutive rows could be:
n = 2 # n = 2 for your use case
df = df.sort_values(by='l_date')
df['grouping'] = [(i//n + 1) for i in range(df.shape[0])]
df.groupby(pd.Grouper(key = 'grouping'))
I have a dataset of samples covering multiple days, all with a timestamp.
I want to select rows within a specific time window. E.g. all rows that were generated between 1pm and 3 pm every day.
This is a sample of my data in a pandas dataframe:
22 22 2018-04-12T20:14:23Z 2018-04-12T21:14:23Z 0 6370.1
23 23 2018-04-12T21:14:23Z 2018-04-12T21:14:23Z 0 6368.8
24 24 2018-04-12T22:14:22Z 2018-04-13T01:14:23Z 0 6367.4
25 25 2018-04-12T23:14:22Z 2018-04-13T01:14:23Z 0 6365.8
26 26 2018-04-13T00:14:22Z 2018-04-13T01:14:23Z 0 6364.4
27 27 2018-04-13T01:14:22Z 2018-04-13T01:14:23Z 0 6362.7
28 28 2018-04-13T02:14:22Z 2018-04-13T05:14:22Z 0 6361.0
29 29 2018-04-13T03:14:22Z 2018-04-13T05:14:22Z 0 6359.3
.. ... ... ... ... ...
562 562 2018-05-05T08:13:21Z 2018-05-05T09:13:21Z 0 6300.9
563 563 2018-05-05T09:13:21Z 2018-05-05T09:13:21Z 0 6300.7
564 564 2018-05-05T10:13:14Z 2018-05-05T13:13:14Z 0 6300.2
565 565 2018-05-05T11:13:14Z 2018-05-05T13:13:14Z 0 6299.9
566 566 2018-05-05T12:13:14Z 2018-05-05T13:13:14Z 0 6299.6
How do I achieve that? I need to ignore the date and just evaluate the time component. I could traverse the dataframe in a loop and evaluate the date time in that way, but there must be a more simple way to do that..
I converted the messageDate which was read a a string to a dateTime by
df["messageDate"]=pd.to_datetime(df["messageDate"])
But after that I got stuck on how to filter on time only.
Any input appreciated.
datetime columns have DatetimeProperties object, from which you can extract datetime.time and filter on it:
import datetime
df = pd.DataFrame(
[
'2018-04-12T12:00:00Z', '2018-04-12T14:00:00Z','2018-04-12T20:00:00Z',
'2018-04-13T12:00:00Z', '2018-04-13T14:00:00Z', '2018-04-13T20:00:00Z'
],
columns=['messageDate']
)
df
messageDate
# 0 2018-04-12 12:00:00
# 1 2018-04-12 14:00:00
# 2 2018-04-12 20:00:00
# 3 2018-04-13 12:00:00
# 4 2018-04-13 14:00:00
# 5 2018-04-13 20:00:00
df["messageDate"] = pd.to_datetime(df["messageDate"])
time_mask = (df['messageDate'].dt.hour >= 13) & \
(df['messageDate'].dt.hour <= 15)
df[time_mask]
# messageDate
# 1 2018-04-12 14:00:00
# 4 2018-04-13 14:00:00
I hope the code is self explanatory. You can always ask questions.
import pandas as pd
# Prepping data for example
dates = pd.date_range('1/1/2018', periods=7, freq='H')
data = {'A' : range(7)}
df = pd.DataFrame(index = dates, data = data)
print df
# A
# 2018-01-01 00:00:00 0
# 2018-01-01 01:00:00 1
# 2018-01-01 02:00:00 2
# 2018-01-01 03:00:00 3
# 2018-01-01 04:00:00 4
# 2018-01-01 05:00:00 5
# 2018-01-01 06:00:00 6
# Creating a mask to filter the value we with to have or not.
# Here, we use df.index because the index is our datetime.
# If the datetime is a column, you can always say df['column_name']
mask = (df.index > '2018-1-1 01:00:00') & (df.index < '2018-1-1 05:00:00')
print mask
# [False False True True True False False]
df_with_good_dates = df.loc[mask]
print df_with_good_dates
# A
# 2018-01-01 02:00:00 2
# 2018-01-01 03:00:00 3
# 2018-01-01 04:00:00 4
df=df[(df["messageDate"].apply(lambda x : x.hour)>13) & (df["messageDate"].apply(lambda x : x.hour)<15)]
You can use x.minute, x.second similarly.
try this after ensuring messageDate is indeed datetime format as you have done
df.set_index('messageDate',inplace=True)
choseInd = [ind for ind in df.index if (ind.hour>=13)&(ind.hour<=15)]
df_select = df.loc[choseInd]
you can do the same, even without making the datetime column as an index, as the answer with apply: lambda shows
it just makes your dataframe 'better looking' if the datetime is your index rather than numerical one.