I have up to three different timestamps for each day in dataframe. In a new column called 'Category' I want to give them a number from 1 to 3 based on time of the timestamp. Almost like a partition by with rank in sql.
Something like: for each day check the time of run and assign a rank based on if it was the first run, the second or the third (if there is a third run).
This dataframe has about half a million rows. For a few years, 2-3 runs every day. And it has data for on hourly resolution.
Any suggestion how to do this most efficiently?
Example of how it is supposed to look like:
Timestamp
Category
2020-01-17 08:18:00
1
2020-01-17 11:57:00
2
2020-01-17 15:35:00
3
2020-01-18 09:00:00
1
2020-01-18 12:00:00
2
2020-01-18 17:00:00
3
Use groupby() and .cumcount()
df['timestamp'] = pd.to_datetime(df['timestamp'], format = '%Y/%m/%d %H:%M')
df['category'] = df.groupby([df['timestamp'].dt.to_period('d')]).cumcount().add(1)
df['Category'] = df.groupby(pd.Grouper(freq='D', key='Timestamp')).cumcount().add(1)
Output:
>>> df
Timestamp Category
0 2020-01-17 08:18:00 1
1 2020-01-17 11:57:00 2
2 2020-01-17 15:35:00 3
3 2020-01-18 09:00:00 1
4 2020-01-18 12:00:00 2
5 2020-01-18 17:00:00 3
UPDATE: Try this:
df['Category'] = df.groupby(pd.Grouper(freq='D', key='Timestamp'))['Timestamp'].diff().ne(pd.Timedelta(0)).cumsum()
Related
I have a pandas series s, I would like to extract the Monday before the third Friday:
with the help of the answer in following link, I can get a resample of third friday, I am still not sure how to get the Monday just before it.
pandas resample to specific weekday in month
from pandas.tseries.offsets import WeekOfMonth
s.resample(rule=WeekOfMonth(week=2,weekday=4)).bfill().asfreq(freq='D').dropna()
Any help is welcome
Many thanks
For each source date, compute your "wanted" date in 3 steps:
Shift back to the first day of the current month.
Shift forward to Friday in third week.
Shift back 4 days (from Friday to Monday).
For a Series containing dates, the code to do it is:
s.dt.to_period('M').dt.to_timestamp() + pd.offsets.WeekOfMonth(week=2, weekday=4)\
- pd.Timedelta('4D')
To test this code I created the source Series as:
s = (pd.date_range('2020-01-01', '2020-12-31', freq='MS') + pd.Timedelta('1D')).to_series()
It contains the second day of each month, both as the index and value.
When you run the above code, you will get:
2020-01-02 2020-01-13
2020-02-02 2020-02-17
2020-03-02 2020-03-16
2020-04-02 2020-04-13
2020-05-02 2020-05-11
2020-06-02 2020-06-15
2020-07-02 2020-07-13
2020-08-02 2020-08-17
2020-09-02 2020-09-14
2020-10-02 2020-10-12
2020-11-02 2020-11-16
2020-12-02 2020-12-14
dtype: datetime64[ns]
The left column contains the original index (source date) and the right
column - the "wanted" date.
Note that third Monday formula (as proposed in one of comments) is wrong.
E.g. third Monday in January is 2020-01-20, whereas the correct date is 2020-01-13.
Edit
If you have a DataFrame, something like:
Date Amount
0 2020-01-02 10
1 2020-01-12 10
2 2020-01-13 2
3 2020-01-20 2
4 2020-02-16 2
5 2020-02-17 12
6 2020-03-15 12
7 2020-03-16 3
8 2020-03-31 3
and you want something like resample but each "period" should start
on a Monday before the third Friday in each month, and e.g. compute
a sum for each period, you can:
Define the following function:
def dateShift(d):
d += pd.Timedelta(4, 'D')
d = pd.offsets.WeekOfMonth(week=2, weekday=4).rollback(d)
return d - pd.Timedelta(4, 'D')
i.e.:
Add 4 days (e.g. move 2020-01-13 (Monday) to 2020-01-17 (Friday).
Roll back (in the above case (on offset) this date will not be moved).
Subtract 4 days.
Run:
df.groupby(df.Date.apply(dateShift)).sum()
The result is:
Amount
Date
2019-12-16 20
2020-01-13 6
2020-02-17 24
2020-03-16 6
E. g. two values of 10 for 2020-01-02 and 2020-01-12 are assigned
to period starting on 2019-12-16 (the "wanted" date for December 2019).
I have the following table:
Hora_Retiro count_uses
0 00:00:18 1
1 00:00:34 1
2 00:02:27 1
3 00:03:13 1
4 00:06:45 1
... ... ...
748700 23:58:47 1
748701 23:58:49 1
748702 23:59:11 1
748703 23:59:47 1
748704 23:59:56 1
And I want to group all values within each hour, so I can see the total number of uses per hour (00:00:00 - 23:00:00)
I have the following code:
hora_pico_aug= hora_pico.groupby(pd.Grouper(key="Hora_Retiro",freq='H')).count()
Hora_Retiro column is of timedelta64[ns] type
Which gives the following output:
count_uses
Hora_Retiro
00:00:02 2566
01:00:02 602
02:00:02 295
03:00:02 5
04:00:02 10
05:00:02 4002
06:00:02 16075
07:00:02 39410
08:00:02 76272
09:00:02 56721
10:00:02 36036
11:00:02 32011
12:00:02 33725
13:00:02 41032
14:00:02 50747
15:00:02 50338
16:00:02 42347
17:00:02 54674
18:00:02 76056
19:00:02 57958
20:00:02 34286
21:00:02 22509
22:00:02 13894
23:00:02 7134
However, the index column starts at 00:00:02, and I want it to start at 00:00:00, and then go from one hour intervals. Something like this:
count_uses
Hora_Retiro
00:00:00 2565
01:00:00 603
02:00:00 295
03:00:00 5
04:00:00 10
05:00:00 4002
06:00:00 16075
07:00:00 39410
08:00:00 76272
09:00:00 56721
10:00:00 36036
11:00:00 32011
12:00:00 33725
13:00:00 41032
14:00:00 50747
15:00:00 50338
16:00:00 42347
17:00:00 54674
18:00:00 76056
19:00:00 57958
20:00:00 34286
21:00:00 22509
22:00:00 13894
23:00:00 7134
How can i make it to start at 00:00:00??
Thanks for the help!
You can create an hour column from Hora_Retiro column.
df['hour'] = df['Hora_Retiro'].dt.hour
And then groupby on the basis of hour
gpby_df = df.groupby('hour')['count_uses'].sum().reset_index()
gpby_df['hour'] = pd.to_datetime(gpby_df['hour'], format='%H').dt.time
gpby_df.columns = ['Hora_Retiro', 'sum_count_uses']
gpby_df
gives
Hora_Retiro sum_count_uses
0 00:00:00 14
1 09:00:00 1
2 10:00:00 2
3 20:00:00 2
I assume that Hora_Retiro column in your DataFrame is of
Timedelta type. It is not datetime, as in this case there
would be printed also the date part.
Indeed, your code creates groups starting at the minute / second
taken from the first row.
To group by "full hours":
round each element in this column to hour,
then group (just by this rounded value).
The code to do it is:
hora_pico.groupby(hora_pico.Hora_Retiro.apply(
lambda tt: tt.round('H'))).count_uses.count()
However I advise you to make up your mind, what do you want to count:
rows or values in count_uses column.
In the second case replace count function with sum.
I have a dataframe df as below:
Datetime Value
2020-03-01 08:00:00 10
2020-03-01 10:00:00 12
2020-03-01 12:00:00 15
2020-03-02 09:00:00 1
2020-03-02 10:00:00 3
2020-03-02 13:00:00 8
2020-03-03 10:00:00 20
2020-03-03 12:00:00 25
2020-03-03 14:00:00 15
I would like to calculate the difference between the value on the first time of each date and the last time of each date (ignoring the value of other time within a date), so the result will be:
Datetime Value_Difference
2020-03-01 5
2020-03-02 7
2020-03-03 -5
I have been doing this using a for loop, but it is slow (as expected) when I have larger data. Any help will be appreciated.
One solution would be to make sure the data is sorted by time, group by the data and then take the first and last value in each day. This works since pandas will preserve the order during groupby, see e.g. here.
df = df.sort_values(by='Datetime').groupby(df['Datetime'].dt.date).agg({'Value': ['first', 'last']})
df['Value_Difference'] = df['Value']['last'] - df['Value']['first']
df = df.drop('Value', axis=1).reset_index()
Result:
Datetime Value_Difference
2020-03-01 5
2020-03-02 7
2020-03-03 -5
Shaido's method works, but might be slow due to the groupby on very large sets
Another possible way is to take a difference from dates converted to int and only grab the values necessary without a loop.
idx = df.index
loc = np.diff(idx.strftime('%Y%m%d').astype(int).values).nonzero()[0]
loc1 = np.append(0,loc)
loc2 = np.append(loc,len(idx)-1)
res = df.values[loc2]-df.values[loc1]
df = pd.DataFrame(index=idx.date[loc1],values=res,columns=['values'])
I have a .csv file with some data. There is only one column of in this file, which includes timestamps. I need to organize that data into bins of 30 minutes. This is what my data looks like:
Timestamp
04/01/2019 11:03
05/01/2019 16:30
06/01/2019 13:19
08/01/2019 13:53
09/01/2019 13:43
So in this case, the last two data points would be grouped together in the bin that includes all the data from 13:30 to 14:00.
This is what I have already tried
df = pd.read_csv('book.csv')
df['Timestamp'] = pd.to_datetime(df.Timestamp)
df.groupby(pd.Grouper(key='Timestamp',
freq='30min')).count().dropna()
I am getting around 7000 rows showing all hours for all days with the count next to them, like this:
2019-09-01 03:00:00 0
2019-09-01 03:30:00 0
2019-09-01 04:00:00 0
...
I want to create bins for only the hours that I have in my dataset. I want to see something like this:
Time Count
11:00:00 1
13:00:00 1
13:30:00 2 (we have two data points in this interval)
16:30:00 1
Thanks in advance!
Use groupby.size as:
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df = df.Timestamp.dt.floor('30min').dt.time.to_frame()\
.groupby('Timestamp').size()\
.reset_index(name='Count')
Or as per suggestion by jpp:
df = df.Timestamp.dt.floor('30min').dt.time.value_counts().reset_index(name='Count')
print(df)
Timestamp Count
0 11:00:00 1
1 13:00:00 1
2 13:30:00 2
3 16:30:00 1
I am selecting several csv file in a folder. Each file has a "Time" Column.
I would like to plot an additional column called time duration which substract the time of each row with the first row and this for each file
What should I add in my code?
strong textoutput = pd.DataFrame()
for name in list_files_log:
with folder.get_download_stream(name) as f:
try:
tmp = pd.read_csv(f)
tmp["sn"] = get_sn(name)
tmp["filename"]= os.path.basename(name)
output = output.append(tmp)
except:
pass
If your Time column would look like this:
Time
0 2015-02-04 02:10:00
1 2016-03-05 03:30:00
2 2017-04-06 04:40:00
3 2018-05-07 05:50:00
You could create Duration column using:
df['Duration'] = df['Time'] - df['Time'][0]
And you'd get:
Time Duration
0 2015-02-04 02:10:00 0 days 00:00:00
1 2016-03-05 03:30:00 395 days 01:20:00
2 2017-04-06 04:40:00 792 days 02:30:00
3 2018-05-07 05:50:00 1188 days 03:40:00