grouping by weekly break down of datetimes 64 python - python

I have a pandas data frame with a column that represents dates as:
Name: ts_placed, Length: 13631, dtype: datetime64[ns]
It looks like this:
0 2014-10-18 16:53:00
1 2014-10-27 11:57:00
2 2014-10-27 11:57:00
3 2014-10-08 16:35:00
4 2014-10-24 16:36:00
5 2014-11-06 15:34:00
6 2014-11-11 10:30:00
....
I know how to group it in general using the function:
grouped = data.groupby('ts_placed')
What I want to do is to use the same function but to group the rows by week.

Pass
pd.DatetimeIndex(df.date).week
as the argument to groupby. This is the ordinal week in the year; see DatetimeIndex for other definitions of week.

you can also use Timergrouper
df.set_index(your_date).groupby(pd.TimeGrouper('W')).size()

Related

Convert 5-minute data points to 30-minute intervals by averaging the data points

I have a Dataset like this:
Timestamp Index Var1
19/03/2015 05:55:00 1 3
19/03/2015 06:00:00 2 4
19/03/2015 06:05:00 3 6
19/03/2015 06:10:00 4 5
19/03/2015 06:15:00 5 7
19/03/2015 06:20:00 6 7
19/03/2015 06:25:00 7 4
The data points were collected at 5-minute intervals. Convert 5-minute data points to 30-minute intervals by averaging Var1. For example, the first data point for the 30-minute intervals will be the average of the 1st data point to the 6th data point (row 1 – 6) from the provided dataset of 5-minute intervals.
I tried using
df.groupby(pd.Grouper(key='Timestamp', freq='30min')).mean()
To start from the first timestamp instead of aligning to hours, you just need to specify origin='start'. (I found that in the docs on Grouper.)
Also, averaging the Index column doesn't really make sense. It seems like you want to select only the Var1 column.*
df.groupby(
pd.Grouper(key='Timestamp', freq='30min', origin='start')
)['Var1'].mean()
Output:
Timestamp
2022-09-04 05:55:00 5.333333
2022-09-04 06:25:00 4.000000
Freq: 30T, Name: Var1, dtype: float64
* Or you could just as easily do something else with the Index column, for example, keep the first value from each group:
...
).agg({'Index': 'first', 'Var1': 'mean'})
Index Var1
Timestamp
2022-09-04 05:55:00 1 5.333333
2022-09-04 06:25:00 7 4.000000

Frequency of events in a week

My data has trips with datetime info, user id for each trip and trip type (single, round, pseudo).
Here's a data sample (pandas dataframe), named All_Data:
HoraDTRetirada idpass type
2016-02-17 15:36:00 39579449489 'single'
2016-02-18 19:13:00 39579449489 'single'
2016-02-26 09:20:00 72986744521 'pseudo'
2016-02-27 12:11:00 72986744521 'round'
2016-02-27 14:55:00 11533148958 'pseudo'
2016-02-28 12:27:00 72986744521 'round'
2016-02-28 16:32:00 72986744521 'round'
I would like to count the number of times each category repeats in a "week of year" by user.
For example, if the event happens on a monday and the next event happens on a thursday for a same user, that makes two events on the same week; however, if one event happens on a saturday and the next event happens on the following monday, they happened in different weeks.
The output I am looking for would be in a form like this:
idpass weekofyear type frequency
39579449489 1 'single' 2
72986744521 2 'round' 3
72986744521 2 'pseudo' 1
11533148958 2 'pseudo' 1
Edit: this older question approaches a similar problem, but I don't know how to do it with pandas.
import pandas as pd
data = {"HoraDTRetirada": ["2016-02-17 15:36:00", "2016-02-18 19:13:00", "2016-12-31 09:20:00", "2016-02-28 12:11:00",
"2016-02-28 14:55:00", "2016-02-29 12:27:00", "2016-02-29 16:32:00"],
"idpass": ["39579449489", "39579449489", "72986744521", "72986744521", "11533148958", "72986744521",
"72986744521"],
"type": ["single", "single", "pseudo", "round", "pseudo", "round", "round"]}
df = pd.DataFrame.from_dict(data)
print(df)
df["HoraDTRetirada"] = pd.to_datetime(df['HoraDTRetirada'])
df["week"] = df['HoraDTRetirada'].dt.strftime('%U')
k = df.groupby(["idpass", "week", "type"],as_index=False).count()
print(k)
Output:
HoraDTRetirada idpass type
0 2016-02-17 15:36:00 39579449489 single
1 2016-02-18 19:13:00 39579449489 single
2 2016-12-31 09:20:00 72986744521 pseudo
3 2016-02-28 12:11:00 72986744521 round
4 2016-02-28 14:55:00 11533148958 pseudo
5 2016-02-29 12:27:00 72986744521 round
6 2016-02-29 16:32:00 72986744521 round
idpass week type HoraDTRetirada
0 11533148958 09 pseudo 1
1 39579449489 07 single 2
2 72986744521 09 round 3
3 72986744521 52 pseudo 1
This is how I got what I was looking for:
Step 1 from suggested answers was skipped because timestamps were already in pandas datetime form.
Step 2: create column for week of year:
df['week'] = df['HoraDTRetirada'].dt.strftime('%U')
Step 3: group by user id, type and week, and count values with size()
df.groupby(['idpass','type','week']).size()
My suggestion would be to do this:
make sure your timestamp is pandas datetime and add frequency column
df['HoraDTRetirada'] = pd.to_datetime(df['HoraDTRetirada'])
df['freq'] = 1
Group it and count
res = df.groupby(['idpass', 'type', pd.Grouper(key='HoraDTRetirada', freq='1W')]).count().reset_index()
Convert time to week of a year
res['HoraDTRetirada'] = res['HoraDTRetirada'].apply(lambda x: x.week)
Final result looks like that:
EDIT:
You are right, in your case we should do step 3 before step 2, and if you want to do that, remember that groupby will change, so finally step 2 will be:
res['HoraDTRetirada'] = res['HoraDTRetirada'].apply(lambda x: x.week)
and step 3 :
res = df.groupby(['idpass', 'type', 'HoraDTRetirada')]).count().reset_index()
It's a bit different because the "Hora" variable is not a time anymore, but just an int representing a week.

How to build bar plot with group by day/hour interval?

I have this dataset:
name date
0 ramos-vinolas-sao-paulo-2017-final 2017-03-05 22:50:00
1 sao-paulo-2017-doubles-final-sa-dutra-silva 2017-03-05 19:29:00
2 querrey-acapulco-2017-trophy 2017-03-05 06:08:00
3 soares-murray-acapulco-2017-doubles-final 2017-03-05 02:48:00
4 cuevas-sao-paulo-2017-saturday 2017-03-04 21:54:00
5 dubai-2017-doubles-final-rojer-tecau2 2017-03-04 18:23:00
I'd like to build bar plot with amount of news by day/hour. Something like
count date
4 2017-03-05
2 2017-03-04
I think you need dt.date with value_counts, for ploting bar:
#if necessary convert to datetime
df['date'] = pd.to_datetime(df.date)
print (df.date.dt.date.value_counts())
2017-03-05 4
2017-03-04 2
Name: date, dtype: int64
df.date.dt.date.value_counts().plot.bar()
A simple approach is using the pandas function hist():
df["date"].hist()

Pandas - convert strings to time without date

I've read loads of SO answers but can't find a clear solution.
I have this data in a df called day1 which represents hours:
1 10:53
2 12:17
3 14:46
4 16:36
5 18:39
6 20:31
7 22:28
Name: time, dtype: object>
I want to convert it into a time format. But when I do this:
day1.time = pd.to_datetime(day1.time, format='H%:M%')
The result includes today's date:
1 2015-09-03 10:53:00
2 2015-09-03 12:17:00
3 2015-09-03 14:46:00
4 2015-09-03 16:36:00
5 2015-09-03 18:39:00
6 2015-09-03 20:31:00
7 2015-09-03 22:28:00
Name: time, dtype: datetime64[ns]>
It seems the format argument isn't working - how do I get the time as shown here without the date?
Update
The following formats the time correctly, but somehow the column is still an object type. Why doesn't it convert to datetime64?
day1['time'] = pd.to_datetime(day1['time'], format='%H:%M').dt.time
1 10:53:00
2 12:17:00
3 14:46:00
4 16:36:00
5 18:39:00
6 20:31:00
7 22:28:00
Name: time, dtype: object>
After performing the conversion you can use the datetime accessor dt to access just the hour or time component:
In [51]:
df['hour'] = pd.to_datetime(df['time'], format='%H:%M').dt.hour
df
Out[51]:
time hour
index
1 10:53 10
2 12:17 12
3 14:46 14
4 16:36 16
5 18:39 18
6 20:31 20
7 22:28 22
Also your format string H%:M% is malformed, it's likely to raise a ValueError: ':' is a bad directive in format 'H%:M%'
Regarding your last comment the dtype is datetime.time not datetime:
In [53]:
df['time'].iloc[0]
Out[53]:
datetime.time(10, 53)
You can use to_timedelta
pd.to_timedelta(df+':00')
Out[353]:
1 10:53:00
2 12:17:00
3 14:46:00
4 16:36:00
5 18:39:00
6 20:31:00
7 22:28:00
Name: Time, dtype: timedelta64[ns]
I recently also struggled with this problem. My method is close to EdChum's method and the result is the same as YOBEN_S's answer.
Just like EdChum illustrated, using dt.hour or dt.time will give you a datetime.time object, which is probably only good for display. I can barely do any comparison or calculation on these objects. So if you need any further comparison or calculation operations on the result columns, it's better to avoid such data formats.
My method is just subtract the date from the to_datetime result:
c = pd.Series(['10:23', '12:17', '14:46'])
pd.to_datetime(c, format='%H:%M') - pd.to_datetime(c, format='%H:%M').dt.normalize()
The result is
0 10:23:00
1 12:17:00
2 14:46:00
dtype: timedelta64[ns]
dt.normalize() basically sets all time component to 00:00:00, and it will only display the date while keeping the datetime64 data format, thereby making it possible to do calculations with it.
My answer is by no means better than the other two. I just want to provide a different approach and hope it helps.

how to import dates in python

I have a pandas dataframe containg a column (dtype: object) in which dates are expressed as:
0 2014-11-07 14:08:00
1 2014-10-18 16:53:00
2 2014-10-27 11:57:00
3 2014-10-27 11:57:00
4 2014-10-08 16:35:00
5 2014-10-24 16:36:00
6 2014-11-06 15:34:00
7 2014-11-11 10:30:00
8 2014-10-31 13:20:00
9 2014-11-07 13:15:00
10 2014-09-20 14:36:00
11 2014-11-07 17:21:00
12 2014-09-23 08:53:00
13 2014-11-05 09:37:00
14 2014-10-26 18:48:00
...
Name: ts_placed, Length: 13655, dtype: object
What i want to do is to read the column as dates and then split the dataset according to weeks.
What I tried to do is:
data["ts_placed"] = pd.to_datetime(data.ts_placed)
data.sort('ts_placed')
It did not work
TypeError: unorderable types: str() > datetime.datetime()
Does anybody know a way to import dates in pythons when these are expressed as objects?
Thank you very much
Use Series.dt methods.
For the date, you can use Series.dt.date:
data['Date Column'] = data['Date Column'].dt.date
For the week, you can use Series.dt.weekofyear :
data['Week'] = data['Date Column'].dt.weekofyear
Then you would create new data based on week:
weekdata = data[data['Week'] == week number]
The sort should also work now.
Looks like to_datetime doesn't work on a series. Looks like a vectorized version works:
data['ts_placed'] = [pd.to_datetime(strD) for strD in data.ts_placed]
data.sort('ts_placed')
UPDATE Wanted my accepted answer to match the figured solution from comments. So if the vectorized version of to_datetime is run it will not convert the vector to datatime objects if all of the strings cannot be converted. The version above will convert those that can be converted. In either case one should check whether all values have been converted.
Using the vectorized version one could check using:
data.ts_placed = pd.to_datetime(data.ts_placed)
if(not isinstance(data.ts_placed[0], pd.lib.Timestamp)):
print 'Dates not converted correctly'
Using the manually vectorized version as above:
if(sum(not isinstance(strD, datetime.datetime) for strD in data.ts_placed) > 0):
print 'Dates not converted correctly'

Categories