Duplicate line based on start and end date - python

I could not find a way how to do the following:
My data looks as follows:
Time (CET) Start Duration(min) End
2015-02-01 00:00 2015-02-01 00:00 2 2015-02-01 00:02
What I want to have is that every line (containing entries, a lot do not) gets duplicated based on the duration or end date in the following way such that:
Time (CET) Start Duration(min) End
2015-02-01 00:00 2015-02-01 00:00 2 2015-02-01 00:02
2015-02-01 00:01 2015-02-01 00:00 2 2015-02-01 00:02
2015-02-01 00:02 2015-02-01 00:00 2 2015-02-01 00:02
In the end dataframe the start and end column are not necessary anymore. I thought about using shift but was not sure if it is the right away and how to use the argument freq. Any ideas how to do that?
The Time columns are in datetime format and Time (CET) is the index.
Thanks a ton!

You can repeat rows by Index.repeat with loc and add timedeltas created by cumcount with to_timedelta to column Time (CET):
print (df)
Time (CET) Start Duration(min) End
0 2015-02-01 00:00 2015-02-01 00:00 2 2015-02-01 00:02
1 2015-02-02 00:00 2015-02-02 00:00 3 2015-02-02 00:02
#convert columns to datetimes
c = ['Time (CET)','Start','End']
df[c] = df[c].apply(pd.to_datetime)
df = df.loc[df.index.repeat(df['Duration(min)'] + 1)]
df['Time (CET)'] += pd.to_timedelta(df.groupby(level=0).cumcount(), unit='s') * 60
df = df.reset_index(drop=True).drop(['Start','End'], axis=1)
print (df)
Time (CET) Duration(min)
0 2015-02-01 00:00:00 2
1 2015-02-01 00:01:00 2
2 2015-02-01 00:02:00 2
3 2015-02-02 00:00:00 3
4 2015-02-02 00:01:00 3
5 2015-02-02 00:02:00 3
6 2015-02-02 00:03:00 3

Related

How to use pandas Grouper to get sum of values within each hour

I have the following table:
Hora_Retiro count_uses
0 00:00:18 1
1 00:00:34 1
2 00:02:27 1
3 00:03:13 1
4 00:06:45 1
... ... ...
748700 23:58:47 1
748701 23:58:49 1
748702 23:59:11 1
748703 23:59:47 1
748704 23:59:56 1
And I want to group all values within each hour, so I can see the total number of uses per hour (00:00:00 - 23:00:00)
I have the following code:
hora_pico_aug= hora_pico.groupby(pd.Grouper(key="Hora_Retiro",freq='H')).count()
Hora_Retiro column is of timedelta64[ns] type
Which gives the following output:
count_uses
Hora_Retiro
00:00:02 2566
01:00:02 602
02:00:02 295
03:00:02 5
04:00:02 10
05:00:02 4002
06:00:02 16075
07:00:02 39410
08:00:02 76272
09:00:02 56721
10:00:02 36036
11:00:02 32011
12:00:02 33725
13:00:02 41032
14:00:02 50747
15:00:02 50338
16:00:02 42347
17:00:02 54674
18:00:02 76056
19:00:02 57958
20:00:02 34286
21:00:02 22509
22:00:02 13894
23:00:02 7134
However, the index column starts at 00:00:02, and I want it to start at 00:00:00, and then go from one hour intervals. Something like this:
count_uses
Hora_Retiro
00:00:00 2565
01:00:00 603
02:00:00 295
03:00:00 5
04:00:00 10
05:00:00 4002
06:00:00 16075
07:00:00 39410
08:00:00 76272
09:00:00 56721
10:00:00 36036
11:00:00 32011
12:00:00 33725
13:00:00 41032
14:00:00 50747
15:00:00 50338
16:00:00 42347
17:00:00 54674
18:00:00 76056
19:00:00 57958
20:00:00 34286
21:00:00 22509
22:00:00 13894
23:00:00 7134
How can i make it to start at 00:00:00??
Thanks for the help!
You can create an hour column from Hora_Retiro column.
df['hour'] = df['Hora_Retiro'].dt.hour
And then groupby on the basis of hour
gpby_df = df.groupby('hour')['count_uses'].sum().reset_index()
gpby_df['hour'] = pd.to_datetime(gpby_df['hour'], format='%H').dt.time
gpby_df.columns = ['Hora_Retiro', 'sum_count_uses']
gpby_df
gives
Hora_Retiro sum_count_uses
0 00:00:00 14
1 09:00:00 1
2 10:00:00 2
3 20:00:00 2
I assume that Hora_Retiro column in your DataFrame is of
Timedelta type. It is not datetime, as in this case there
would be printed also the date part.
Indeed, your code creates groups starting at the minute / second
taken from the first row.
To group by "full hours":
round each element in this column to hour,
then group (just by this rounded value).
The code to do it is:
hora_pico.groupby(hora_pico.Hora_Retiro.apply(
lambda tt: tt.round('H'))).count_uses.count()
However I advise you to make up your mind, what do you want to count:
rows or values in count_uses column.
In the second case replace count function with sum.

Copying column from one data frame to another based on matching of combination of two columns

I have got two dataframes (i.e. df1 and df2).
df1 contains date and time columns. Time columns contains 30 minutes interval of time series:
df1:
date time
0 2015-04-01 00:00:00
1 2015-04-01 00:30:00
2 2015-04-01 01:00:00
3 2015-04-01 01:30:00
4 2015-04-01 02:00:00
df2 contains date, start-time, end-time, value:
df2
INCIDENT_DATE INTERRUPTION_TIME RESTORE_TIME WASTED_MINUTES
0 2015-04-01 00:32 01:15 1056.0
1 2015-04-01 01:20 02:30 3234.0
2 2015-04-01 01:22 03:30 3712.0
3 2015-04-01 01:30 03:15 3045.0
Now I want to copy the wasted_minutes column from df2 to df1 when date columns of both data frames are the same and Interruption_time of the column of df2 lies in the time column of df1. So the output should look like:
df1:
date time Wasted_columns
0 2015-04-01 00:00:00 NaN
1 2015-04-01 00:30:00 1056.0
2 2015-04-01 01:00:00 6946.0
3 2015-04-01 01:30:00 3045.0
4 2015-04-01 02:00:00 NaN
I tried merge command (on the basis of date column), but didn't produce the desired result, because I am not sure how to check whether time falls in 30 minutes intervals or not? Could anyone guide how to fix the issue?
Convert time to timedelta and assign back to df1. Convert INTERRUPTION_TIME to timedelta and floor it to 30-minute interval and assign to s. Groupby df2 by INCIDENT_DATE, s and call sum of WASTED_MINUTES. Finally, join the result of groupby back to df1
df1['time'] = pd.to_timedelta(df1['time'].astype(str)) #cast to str before calling `to_timedelta`
s = pd.to_timedelta(df2.INTERRUPTION_TIME+':00').dt.floor('30Min')
df_final = df1.join(df2.groupby(['INCIDENT_DATE', s]).WASTED_MINUTES.sum(),
on=['date', 'time'])
Out[631]:
date time WASTED_MINUTES
0 2015-04-01 00:00:00 NaN
1 2015-04-01 00:30:00 1056.0
2 2015-04-01 01:00:00 6946.0
3 2015-04-01 01:30:00 3045.0
4 2015-04-01 02:00:00 NaN
You can do this
df1['time']=pd.to_datetime(df1['time'])
df1['Wasted_columns']=df1.apply(lambda x: df2.loc[(pd.to_datetime(df2['INTERRUPTION_TIME'])>= x['time']) & (pd.to_datetime(df2['INTERRUPTION_TIME'])< x['time']+pd.Timedelta(minutes=30)),'WASTED_MINUTES'].sum(), axis=1)
df1['time']=df1['time'].dt.time
If you convert the 'time' column in the lambda function itself, then it is just one line of code as below
df1['Wasted_columns']=df1.apply(lambda x: df2.loc[(pd.to_datetime(df2['INTERRUPTION_TIME'])>= pd.to_datetime(x['time'])) & (pd.to_datetime(df2['INTERRUPTION_TIME'])< pd.to_datetime(x['time'])+pd.Timedelta(minutes=30)),'WASTED_MINUTES'].sum(), axis=1)
Output
date time Wasted_columns
0 2015-04-01 00:00:00 0.0
1 2015-04-01 00:30:00 1056.0
2 2015-04-01 01:00:00 6946.0
3 2015-04-01 01:30:00 3045.0
4 2015-04-01 02:00:00 0.0
The idea:
+ Convert to datetime
+ Round to nearest 30 mins
+ Merge
from datetime import datetime, timedelta
def ceil_dt(dt, delta):
return dt + (datetime.min - dt) % delta
# Convert
df1['dt'] = (df1['date'] + ' ' + df1['time']).apply(datetime.strptime, args=['%Y-%m-%d %H:%M:%S'])
df2['dt'] = (df2['INCIDENT_DATE '] + ' ' + df2['INTERRUPTION_TIME']).apply(datetime.strptime, args=['%Y-%m-%d %H:%M'])
# Round
def ceil_dt(dt, delta):
return dt + (datetime.min - dt) % delta
df2['dt'] = df2['dt'].apply(ceil_dt, args=[timedelta(minutes=30)])
# Merge
final = df1.merge(df2.loc[:, ['dt', 'wasted_column'], on='dt', how='left'])
Also if multiple incidents happens in 30 mins timeframe, you would want to group by on df2 with rounded dt col first to sum up wasted then merge

Convert int64 to datetime with format %H:%M'

I have the following DataFrame:
df_h00 = df.copy()
tt = df_h00.set_index('username').post_time_data.str.extractall(r'totalCount\":([^,}]*)')
tt['index']=tt.index
tt[['user','hour']]=pd.DataFrame(tt['index'].values.tolist(),
index=tt.index)
tt = tt.drop(['index'], axis=1)
tt.columns = ['totalCount', 'user', 'hours']
tt.head()
totalCount user hours
username match
lowi 0 15 lowi 0
1 11 lowi 1
2 2 lowi 2
3 0 lowi 3
4 0 lowi 4
I want to convert the column tt['hours'] which is non-null int64 to date time with format "%H:%M".
I've tried the following code:
tthour = tt['hours']
tthour = pd.to_datetime(tthour, format='%H', errors='coerce')
tthour = tthour.to_frame()
tthour.head()
hours
username match
lowi 0 1900-01-01 00:00:00
1 1900-01-01 01:00:00
2 1900-01-01 02:00:00
3 1900-01-01 03:00:00
4 1900-01-01 04:00:00
However, I only want "%H:%M". So the expected output would be like this:
hours
username match
lowi 0 00:00
1 01:00
2 02:00
3 03:00
4 04:00
Datetimes in your expected format not exist in python.
Close what you need are timedeltas by to_timedelta with Series.str.zfill or strings:
tt = pd.DataFrame({'hours':np.arange(5)})
tt['td'] = pd.to_timedelta(tt['hours'].astype(str).str.zfill(2) + ':00:00', errors='coerce')
tt['str'] = tt['hours'].astype(str).str.zfill(2) + ':00'
print (tt)
hours td str
0 0 00:00:00 00:00
1 1 01:00:00 01:00
2 2 02:00:00 02:00
3 3 03:00:00 03:00
4 4 04:00:00 04:00

Substracting rows in different files

I am selecting several csv file in a folder. Each file has a "Time" Column.
I would like to plot an additional column called time duration which substract the time of each row with the first row and this for each file
What should I add in my code?
strong textoutput = pd.DataFrame()
for name in list_files_log:
with folder.get_download_stream(name) as f:
try:
tmp = pd.read_csv(f)
tmp["sn"] = get_sn(name)
tmp["filename"]= os.path.basename(name)
output = output.append(tmp)
except:
pass
If your Time column would look like this:
Time
0 2015-02-04 02:10:00
1 2016-03-05 03:30:00
2 2017-04-06 04:40:00
3 2018-05-07 05:50:00
You could create Duration column using:
df['Duration'] = df['Time'] - df['Time'][0]
And you'd get:
Time Duration
0 2015-02-04 02:10:00 0 days 00:00:00
1 2016-03-05 03:30:00 395 days 01:20:00
2 2017-04-06 04:40:00 792 days 02:30:00
3 2018-05-07 05:50:00 1188 days 03:40:00

Split dataframe to several dataframes

I have following code:
Date X
...
2014-12-30 23:00:00 2
2014-12-30 23:15:00 0
2014-12-30 23:30:00 1
2014-12-30 23:45:00 1
2014-12-31 00:00:00 22
...
2015-01-01 00:00:00 0
2015-01-02 00:00:00 2
2015-01-03 00:00:00 2
2015-01-04 00:00:00 2
2015-01-04 00:00:00 2
2015-01-05 00:00:00 2
...
I want to split this time series (dataframe) into many time series (dataframe). I would like to have one time series for each Monday, one for all Tuesdays, Wednesdays ... etc.
How can I do that with pandas?
You can create dictionary of DataFrames with groupby and weekday_name:
dfs = dict(tuple(df.groupby(df['Date'].dt.weekday_name)))
#select by days
print (dfs['Friday'])
Date X
6 2015-01-02 2
print (dfs['Thursday'])
Date X
5 2015-01-01 0
Detail:
print (df['Date'].dt.weekday_name)
0 Tuesday
1 Tuesday
2 Tuesday
3 Tuesday
4 Wednesday
5 Thursday
6 Friday
7 Saturday
8 Sunday
9 Sunday
10 Monday
Name: Date, dtype: object

Categories