Substracting rows in different files - python

I am selecting several csv file in a folder. Each file has a "Time" Column.
I would like to plot an additional column called time duration which substract the time of each row with the first row and this for each file
What should I add in my code?
strong textoutput = pd.DataFrame()
for name in list_files_log:
with folder.get_download_stream(name) as f:
try:
tmp = pd.read_csv(f)
tmp["sn"] = get_sn(name)
tmp["filename"]= os.path.basename(name)
output = output.append(tmp)
except:
pass

If your Time column would look like this:
Time
0 2015-02-04 02:10:00
1 2016-03-05 03:30:00
2 2017-04-06 04:40:00
3 2018-05-07 05:50:00
You could create Duration column using:
df['Duration'] = df['Time'] - df['Time'][0]
And you'd get:
Time Duration
0 2015-02-04 02:10:00 0 days 00:00:00
1 2016-03-05 03:30:00 395 days 01:20:00
2 2017-04-06 04:40:00 792 days 02:30:00
3 2018-05-07 05:50:00 1188 days 03:40:00

Related

Number timestamps based on time of timestamp

I have up to three different timestamps for each day in dataframe. In a new column called 'Category' I want to give them a number from 1 to 3 based on time of the timestamp. Almost like a partition by with rank in sql.
Something like: for each day check the time of run and assign a rank based on if it was the first run, the second or the third (if there is a third run).
This dataframe has about half a million rows. For a few years, 2-3 runs every day. And it has data for on hourly resolution.
Any suggestion how to do this most efficiently?
Example of how it is supposed to look like:
Timestamp
Category
2020-01-17 08:18:00
1
2020-01-17 11:57:00
2
2020-01-17 15:35:00
3
2020-01-18 09:00:00
1
2020-01-18 12:00:00
2
2020-01-18 17:00:00
3
Use groupby() and .cumcount()
df['timestamp'] = pd.to_datetime(df['timestamp'], format = '%Y/%m/%d %H:%M')
df['category'] = df.groupby([df['timestamp'].dt.to_period('d')]).cumcount().add(1)
df['Category'] = df.groupby(pd.Grouper(freq='D', key='Timestamp')).cumcount().add(1)
Output:
>>> df
Timestamp Category
0 2020-01-17 08:18:00 1
1 2020-01-17 11:57:00 2
2 2020-01-17 15:35:00 3
3 2020-01-18 09:00:00 1
4 2020-01-18 12:00:00 2
5 2020-01-18 17:00:00 3
UPDATE: Try this:
df['Category'] = df.groupby(pd.Grouper(freq='D', key='Timestamp'))['Timestamp'].diff().ne(pd.Timedelta(0)).cumsum()

How to use pandas Grouper to get sum of values within each hour

I have the following table:
Hora_Retiro count_uses
0 00:00:18 1
1 00:00:34 1
2 00:02:27 1
3 00:03:13 1
4 00:06:45 1
... ... ...
748700 23:58:47 1
748701 23:58:49 1
748702 23:59:11 1
748703 23:59:47 1
748704 23:59:56 1
And I want to group all values within each hour, so I can see the total number of uses per hour (00:00:00 - 23:00:00)
I have the following code:
hora_pico_aug= hora_pico.groupby(pd.Grouper(key="Hora_Retiro",freq='H')).count()
Hora_Retiro column is of timedelta64[ns] type
Which gives the following output:
count_uses
Hora_Retiro
00:00:02 2566
01:00:02 602
02:00:02 295
03:00:02 5
04:00:02 10
05:00:02 4002
06:00:02 16075
07:00:02 39410
08:00:02 76272
09:00:02 56721
10:00:02 36036
11:00:02 32011
12:00:02 33725
13:00:02 41032
14:00:02 50747
15:00:02 50338
16:00:02 42347
17:00:02 54674
18:00:02 76056
19:00:02 57958
20:00:02 34286
21:00:02 22509
22:00:02 13894
23:00:02 7134
However, the index column starts at 00:00:02, and I want it to start at 00:00:00, and then go from one hour intervals. Something like this:
count_uses
Hora_Retiro
00:00:00 2565
01:00:00 603
02:00:00 295
03:00:00 5
04:00:00 10
05:00:00 4002
06:00:00 16075
07:00:00 39410
08:00:00 76272
09:00:00 56721
10:00:00 36036
11:00:00 32011
12:00:00 33725
13:00:00 41032
14:00:00 50747
15:00:00 50338
16:00:00 42347
17:00:00 54674
18:00:00 76056
19:00:00 57958
20:00:00 34286
21:00:00 22509
22:00:00 13894
23:00:00 7134
How can i make it to start at 00:00:00??
Thanks for the help!
You can create an hour column from Hora_Retiro column.
df['hour'] = df['Hora_Retiro'].dt.hour
And then groupby on the basis of hour
gpby_df = df.groupby('hour')['count_uses'].sum().reset_index()
gpby_df['hour'] = pd.to_datetime(gpby_df['hour'], format='%H').dt.time
gpby_df.columns = ['Hora_Retiro', 'sum_count_uses']
gpby_df
gives
Hora_Retiro sum_count_uses
0 00:00:00 14
1 09:00:00 1
2 10:00:00 2
3 20:00:00 2
I assume that Hora_Retiro column in your DataFrame is of
Timedelta type. It is not datetime, as in this case there
would be printed also the date part.
Indeed, your code creates groups starting at the minute / second
taken from the first row.
To group by "full hours":
round each element in this column to hour,
then group (just by this rounded value).
The code to do it is:
hora_pico.groupby(hora_pico.Hora_Retiro.apply(
lambda tt: tt.round('H'))).count_uses.count()
However I advise you to make up your mind, what do you want to count:
rows or values in count_uses column.
In the second case replace count function with sum.

Group data into bins of 30 minutes

I have a .csv file with some data. There is only one column of in this file, which includes timestamps. I need to organize that data into bins of 30 minutes. This is what my data looks like:
Timestamp
04/01/2019 11:03
05/01/2019 16:30
06/01/2019 13:19
08/01/2019 13:53
09/01/2019 13:43
So in this case, the last two data points would be grouped together in the bin that includes all the data from 13:30 to 14:00.
This is what I have already tried
df = pd.read_csv('book.csv')
df['Timestamp'] = pd.to_datetime(df.Timestamp)
df.groupby(pd.Grouper(key='Timestamp',
freq='30min')).count().dropna()
I am getting around 7000 rows showing all hours for all days with the count next to them, like this:
2019-09-01 03:00:00 0
2019-09-01 03:30:00 0
2019-09-01 04:00:00 0
...
I want to create bins for only the hours that I have in my dataset. I want to see something like this:
Time Count
11:00:00 1
13:00:00 1
13:30:00 2 (we have two data points in this interval)
16:30:00 1
Thanks in advance!
Use groupby.size as:
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df = df.Timestamp.dt.floor('30min').dt.time.to_frame()\
.groupby('Timestamp').size()\
.reset_index(name='Count')
Or as per suggestion by jpp:
df = df.Timestamp.dt.floor('30min').dt.time.value_counts().reset_index(name='Count')
print(df)
Timestamp Count
0 11:00:00 1
1 13:00:00 1
2 13:30:00 2
3 16:30:00 1

How to find duration between two time difference in python dataframe

I have raw data like this want to find the difference between this two time in mint .....problem is data which is in data frame...
source:
start time end time
0 08:30:00 17:30:00
1 11:00:00 17:30:00
2 08:00:00 21:30:00
3 19:30:00 22:00:00
4 19:00:00 00:00:00
5 08:30:00 15:30:00
Need a output like this:
duration
540mint
798mint
162mint
1140mint
420mint
Your expected output seems to be incorrect. That aside, we can use base R's difftime:
transform(
df,
duration = difftime(
strptime(end.time, format = "%H:%M:%S"),
strptime(start.time, format = "%H:%M:%S"),
units = "mins"))
# start.time end.time duration
#0 08:30:00 17:30:00 540 mins
#1 11:00:00 17:30:00 390 mins
#2 08:00:00 21:30:00 810 mins
#3 19:30:00 22:00:00 150 mins
#4 19:00:00 00:00:00 -1140 mins
#5 08:30:00 15:30:00 420 mins
or as a difftime vector
with(df, difftime(
strptime(end.time, format = "%H:%M:%S"),
strptime(start.time, format = "%H:%M:%S"),
units = "mins"))
#Time differences in mins
#[1] 540 390 810 150 -1140 420
Sample data
df <- read.table(text =
" 'start time' 'end time'
0 08:30:00 17:30:00
1 11:00:00 17:30:00
2 08:00:00 21:30:00
3 19:30:00 22:00:00
4 19:00:00 00:00:00
5 08:30:00 15:30:00", header = T, row.names = 1)
import pandas as pd
df = pd.DataFrame({'start time':['08:30:00','11:00:00','08:00:00','19:30:00','19:00:00','08:30:00'],'end time':['17:30:00','17:30:00','21:30:00','22:00:00','00:00:00','15:30:00']},columns=['start time','end time'])
df
Out[355]:
start time end time
0 08:30:00 17:30:00
1 11:00:00 17:30:00
2 08:00:00 21:30:00
3 19:30:00 22:00:00
4 19:00:00 00:00:00
5 08:30:00 15:30:00
(pd.to_datetime(df['end time']) - pd.to_datetime(df['start time'])).dt.seconds/60
Out[356]:
0 540.0
1 390.0
2 810.0
3 150.0
4 300.0
5 420.0
dtype: float64
Yes, definitely datetime is what you need here. Specifically, the strptime function, which parses a string into a time object.
from datetime import datetime
s1 = '10:33:26'
s2 = '11:15:49' # for example
FMT = '%H:%M:%S'
tdelta = datetime.strptime(s2, FMT) - datetime.strptime(s1, FMT)
That gets you a timedelta object that contains the difference between the two times. You can do whatever you want with that, e.g. converting it to seconds or adding it to another datetime.
This will return a negative result if the end time is earlier than the start time, for example s1 = 12:00:00 and s2 = 05:00:00. If you want the code to assume the interval crosses midnight in this case (i.e. it should assume the end time is never earlier than the start time), you can add the following lines to the above code:
if tdelta.days < 0:
tdelta = timedelta(days=0,
seconds=tdelta.seconds, microseconds=tdelta.microseconds)
(of course you need to include from datetime import timedelta somewhere). Thanks to J.F. Sebastian for pointing out this use case.

How can i split DataFrame every x rows?

I have DataFrame in following format:
Date Open High Low Close
0 2015-06-19 20:00:00 1201.60 1202.84 1201.55 1202.13
1 2015-06-19 21:00:00 1202.13 1202.50 1200.84 1200.88
2 2015-06-19 22:00:00 1200.88 1201.55 1200.61 1201.06
3 2015-06-19 23:00:00 1201.06 1201.26 1200.02 1200.57
4 2015-06-22 01:00:00 1200.57 1201.48 1197.04 1198.94
5 2015-06-22 02:00:00 1198.94 1199.79 1198.49 1199.34
6 2015-06-22 03:00:00 1199.34 1200.05 1198.64 1199.74
7 2015-06-22 04:00:00 1199.74 1200.34 1199.14 1199.66
I am trying to split this DataFrame by dates and after that i am trying to split dates in eveery 4 hours. Here is how i select DataFrame by date:
i = 0
this_date = df["Date"][i:i+1].values[0].split(" ")[0]
today = df[df["Date"].apply(lambda x: x.split(" ")[0]) == this_date]
Now i need to split today dataframe in every 4 hours. The last size will be 3 in total as it ends at 23:00
How can i do this? Are there any easy way or do i need to map over DataFrame and do it manually?

Categories