Deleting rows between multiple sets of time stamps - python

I have a DataFrame that has time stamps in the form of (yyyy-mm-dd hh:mm:ss). I'm trying to delete data between two different time stamps. At the moment I can delete the data between 1 range of time stamps but I have trouble extending this to multiple time stamps.
For example, with the DataFrame I can delete a range of rows (e.g. 2015-03-01 00:20:00 to 2015-08-01 01:10:00) however, I'm not sure how to go about deleting another range alongside it. The code that does that is shown below.
index_list= df.timestamp[(df.timestamp >= "2015-07-01 00:00:00") & (df.timestamp <= "2015-12-30 23:50:00")].index.tolist()
df1.drop(df1.index[index_list1, inplace = True)
The DataFrame extends over 3 years and has every day in the 3 years included.
I'm trying to delete all the rows from months July to December (2015-07-01 00:00:00 to 2015-12-30 23:50:00) for all 3 years.
I was thinking that I create a helper column that gets the Month from the Date column and then drops based off the Month from the helper column.
I would greatly appreciate any advice. Thanks!
Edit:
I've added in a small summarised version of the DataFrame. This is what the intial DataFrame looks like.
df Date v
2015-01-01 00:00:00 30.0
2015-02-01 00:10:00 55.0
2015-03-01 00:20:00 36.0
2015-04-01 00:30:00 65.0
2015-05-01 00:40:00 35.0
2015-06-01 00:50:00 22.0
2015-07-01 01:00:00 74.0
2015-08-01 01:10:00 54.0
2015-09-01 01:20:00 86.0
2015-10-01 01:30:00 91.0
2015-11-01 01:40:00 65.0
2015-12-01 01:50:00 35.0
To get something like this
df Date v
2015-01-01 00:00:00 30.0
2015-02-01 00:10:00 55.0
2015-03-01 00:20:00 36.0
2015-05-01 00:40:00 35.0
2015-06-01 00:50:00 22.0
2015-11-01 01:40:00 65.0
2015-12-01 01:50:00 35.0
Where time stamps "2015-07-01 00:20:00 to 2015-10-01 00:30:00"and "2015-07-01 01:00:00 to 2015-10-01 01:30:00" are removed. Sorry if my formatting isn't up to standard.

If your timestamp column uses the correct dtype, you can just do:
df.loc[df.timestamp.dt.month.isin([1, 2, 3, 5, 6, 11, 12])]
This should filter out the months not inside the list.

As you hinted, data manipulation is always easier when you use the right data types. To support time stamps, pandas has the Timestamp type. You can do this as follows:
df['Date'] = pd.to_datetime(df['Date']) # No date format needs to be specified,
# "YYYY-MM-DD HH:MM:SS" is the standard
Then, removing all entries in the months of July to December for all years is straightforward:
df = df[df['Date'].dt.month < 7] # Keep only months less than July

Related

replace dataframe values based on another dataframe

I have a pandas dataframe which is structured as follows:
timestamp y
0 2020-01-01 00:00:00 336.0
1 2020-01-01 00:15:00 544.0
2 2020-01-01 00:30:00 736.0
3 2020-01-01 00:45:00 924.0
4 2020-01-01 01:00:00 1260.0
...
The timestamp column is a datetime data type
and I have another dataframe with the following structure:
y
timestamp
00:00:00 625.076923
00:15:00 628.461538
00:30:00 557.692308
00:45:00 501.692308
01:00:00 494.615385
...
I this case, the time is the pandas datetime index.
Now what I want to do is replace the values in the first dataframe where the time field is matching i.e. the time of the day is matching with the second dataset.
IIUC your first dataframe df1's timestamp is datetime type and your second dataframe (df2) has an index of type datetime as well but only time and not date.
then you can do:
df1['y'] = df1['timestamp'].dt.time.map(df2['y'])
I wouldn't be surprised if there is a better way, but you can accomplish this by working to get the tables so that they can merge on the time. Assuming your dataframes will be df and df2.
df['time'] = df['timestamp'].dt.time
df2 = df2.reset_index()
df2['timestamp'] = pd.to_datetime(df2['timestamp'].dt.time
df_combined = pd.merge(df,df2,left_on='time',right_on='timestamp')
df_combined
timestamp_x y_x time timestamp_y y_y
0 2020-01-01 00:00:00 336.0 00:00:00 00:00:00 625.076923
1 2020-01-01 00:15:00 544.0 00:15:00 00:15:00 628.461538
2 2020-01-01 00:30:00 736.0 00:30:00 00:30:00 557.692308
3 2020-01-01 00:45:00 924.0 00:45:00 00:45:00 501.692308
4 2020-01-01 01:00:00 1260.0 01:00:00 01:00:00 494.615385
# This clearly has more than you need, so just keep what you want and rename things back.
df_combined = df_combined[['timestamp_x','y_y']]
df_combined = df_combined.rename(columns={'timestamp_x':'timestamp','y_y':'y'})
New answer I like way better: actually using .map()
Still need to get df2 to have the time column to match on.
df2 = df2.reset_index()
df2['timestamp'] = pd.to_datetime(df2['timestamp'].dt.time
df['y'] = df['timestamp'].dt.time.map(dict(zip(df2['timestamp',df2['y'])))

Sum of Timestamps

In my dataframe I have a column that is timestamp formatted as 2021-11-18 00:58:22.705
I wish to create a column that displays the time elapsed from each row to the interval time (first time).
There are 2 ways in which I can think of doing this but I don't seem to know how to make it happen.
Method 1:
To subtract each time stamp to the row above.
df["difference"]= df["timestamp"].diff()
Now that this time difference has been calculated I would like to create another column that sums each time difference but it keeps the sum from the delta above (elapsed time from start of process)
Method 2:
I guess another way would be to calculate the timestamp of each row to the interval time stamp (first one)
I do not know how I would do that.
Thanks in advance.
not completely understood the type of difference needed so adding both which I think are reasonable:
import pandas as pd
times = pd.date_range('2022-05-23', periods=20, freq='0D30min')
df = pd.DataFrame({'Timestamp': times})
df['difference_in_min'] = (df.Timestamp - df.Timestamp.min()).astype('timedelta64[m]')
df['cumulative_dif_in_min'] = df.difference_in_min.cumsum()
print(df)
Timestamp difference_in_min cumulative_dif_in_min
0 2022-05-23 00:00:00 0.0 0.0
1 2022-05-23 00:30:00 30.0 30.0
2 2022-05-23 01:00:00 60.0 90.0
3 2022-05-23 01:30:00 90.0 180.0
4 2022-05-23 02:00:00 120.0 300.0
5 2022-05-23 02:30:00 150.0 450.0
6 2022-05-23 03:00:00 180.0 630.0
7 2022-05-23 03:30:00 210.0 840.0
8 2022-05-23 04:00:00 240.0 1080.0

How to use pandas Grouper to get sum of values within each hour

I have the following table:
Hora_Retiro count_uses
0 00:00:18 1
1 00:00:34 1
2 00:02:27 1
3 00:03:13 1
4 00:06:45 1
... ... ...
748700 23:58:47 1
748701 23:58:49 1
748702 23:59:11 1
748703 23:59:47 1
748704 23:59:56 1
And I want to group all values within each hour, so I can see the total number of uses per hour (00:00:00 - 23:00:00)
I have the following code:
hora_pico_aug= hora_pico.groupby(pd.Grouper(key="Hora_Retiro",freq='H')).count()
Hora_Retiro column is of timedelta64[ns] type
Which gives the following output:
count_uses
Hora_Retiro
00:00:02 2566
01:00:02 602
02:00:02 295
03:00:02 5
04:00:02 10
05:00:02 4002
06:00:02 16075
07:00:02 39410
08:00:02 76272
09:00:02 56721
10:00:02 36036
11:00:02 32011
12:00:02 33725
13:00:02 41032
14:00:02 50747
15:00:02 50338
16:00:02 42347
17:00:02 54674
18:00:02 76056
19:00:02 57958
20:00:02 34286
21:00:02 22509
22:00:02 13894
23:00:02 7134
However, the index column starts at 00:00:02, and I want it to start at 00:00:00, and then go from one hour intervals. Something like this:
count_uses
Hora_Retiro
00:00:00 2565
01:00:00 603
02:00:00 295
03:00:00 5
04:00:00 10
05:00:00 4002
06:00:00 16075
07:00:00 39410
08:00:00 76272
09:00:00 56721
10:00:00 36036
11:00:00 32011
12:00:00 33725
13:00:00 41032
14:00:00 50747
15:00:00 50338
16:00:00 42347
17:00:00 54674
18:00:00 76056
19:00:00 57958
20:00:00 34286
21:00:00 22509
22:00:00 13894
23:00:00 7134
How can i make it to start at 00:00:00??
Thanks for the help!
You can create an hour column from Hora_Retiro column.
df['hour'] = df['Hora_Retiro'].dt.hour
And then groupby on the basis of hour
gpby_df = df.groupby('hour')['count_uses'].sum().reset_index()
gpby_df['hour'] = pd.to_datetime(gpby_df['hour'], format='%H').dt.time
gpby_df.columns = ['Hora_Retiro', 'sum_count_uses']
gpby_df
gives
Hora_Retiro sum_count_uses
0 00:00:00 14
1 09:00:00 1
2 10:00:00 2
3 20:00:00 2
I assume that Hora_Retiro column in your DataFrame is of
Timedelta type. It is not datetime, as in this case there
would be printed also the date part.
Indeed, your code creates groups starting at the minute / second
taken from the first row.
To group by "full hours":
round each element in this column to hour,
then group (just by this rounded value).
The code to do it is:
hora_pico.groupby(hora_pico.Hora_Retiro.apply(
lambda tt: tt.round('H'))).count_uses.count()
However I advise you to make up your mind, what do you want to count:
rows or values in count_uses column.
In the second case replace count function with sum.

How to recover lost DateTime rows in a multiindexed dataframe and then reshape it into a 2D dataframe without multiindexing (Pandas)?

I have a dataframe with multiindexing and a lot of rows. The indices are 'item' and 'TimeStamp'
Each of the items have a different number of elements as some of the values were NaN and they were erased from the dataset. I would like to regenerate the lost rows and obtain a new dataframe described below.
Ideally I would like to:
create a new dataframe with the full DateTime index with a step of 10 minutes. It's size would be (full DateTimeIndex x number of items)
each column would contain data for a separate item and the rows where the data is missing would be NaN. The column names would refer to item numbers ('I01', 'I02'... etc.)
This way I would remove the multiindexing and be able to perform quicker operations on a 2D df.
The df I have is as follows:
value
item TimeStamp
I01 2011-09-20 00:00:00 -11.280400
2011-09-20 00:10:00 -11.945430
2011-09-20 00:20:00 -11.962580
2011-09-20 00:30:00 -12.074700
2011-09-20 00:40:00 -11.923750
...
I07 2014-05-31 23:20:00 985.375427
2014-05-31 23:30:00 951.776611
2014-05-31 23:40:00 822.368286
2014-05-15 23:50:00 879.974792
2014-06-01 00:00:00 587.804321
[nevermind how many rows x 1 columns]
I will be really grateful for any help with this. I am quite new to Python.
You can resolve this issue by assuring that the same timestamps are used for each item group. This can be done through pivoting, filling in dummies and, finally, unpivoting your data.
Note that, in the below example, the timestamp 2011-09-20 00:20:00 is missing for item IO2. Our goal is to retrieve this timestamp - and align the timestamps of all items in our dataset.
data = [['I01', '2011-09-20 00:00:00', 10], ['I01' , '2011-09-20 00:10:00', 20],
['I01' , '2011-09-20 00:20:00', 20], ['I02', '2011-09-20 00:00:00', 30],
['I02' , '2011-09-20 00:10:00', 40]]
df = pd.DataFrame(data, index=[0,1,2,3,4], columns=['Item', 'Timestamp', 'Value'])
Item Timestamp value
0 I01 2011-09-20 00:00:00 10.0
1 I02 2011-09-20 00:00:00 30.0
2 I01 2011-09-20 00:10:00 20.0
3 I02 2011-09-20 00:10:00 40.0
4 I01 2011-09-20 00:20:00 20.0
To achieve this, we pivot the table with timestamps as our column values as follows:
df = df.pivot_table(index=['Item'], columns='Timestamp')
Value
Timestamp 2011-09-20 00:00:00 2011-09-20 00:10:00 2011-09-20 00:20:00
I01 10.0 20.0 20.0
I02 30.0 40.0 nan
# Note that a nan value appears for 'I02' and timestamp '2011-09-20 00:20:00'
Now, we fill in the N/A's with a dummy value (e.g. the float 0.). This prevents the row from disappearing once we unpivot.
Finally, we unpivot (i.e. melt) the table to retrieve the old data structure:
df = df.fillna('dummy').reset_index()
df = df.melt(id_vars='Item')
df = df.replace('dummy', 0.)
Item Timestamp value
0 I01 2011-09-20 00:00:00 10.0
1 I02 2011-09-20 00:00:00 30.0
2 I01 2011-09-20 00:10:00 20.0
3 I02 2011-09-20 00:10:00 40.0
4 I01 2011-09-20 00:20:00 20.0
5 I02 2011-09-20 00:20:00 0.0
Now, however, the same timestamps are used for each item group.
Edit: I found a one-liner for this issue as well over here:
df.set_index(['Timestamp', 'Item']).unstack().fillna('dummy').stack().replace('dummy', 0.)

How using python to find index with logic in pandas?

This is my data:
time id w
0 2018-03-01 00:00:00 39.0 1176.000000
1 2018-03-01 00:15:00 39.0 NaN
2 2018-03-01 00:30:00 39.0 NaN
3 2018-03-01 00:45:00 39.0 NaN
4 2018-03-01 01:00:00 39.0 NaN
5 2018-03-01 01:15:00 39.0 NaN
6 2018-03-01 01:30:00 39.0 NaN
7 2018-03-01 01:45:00 39.0 1033.461538
8 2018-03-01 02:00:00 39.0 1081.066667
9 2018-03-01 02:15:00 39.0 1067.909091
10 2018-03-01 02:30:00 39.0 NaN
11 2018-03-01 02:45:00 39.0 1051.866667
12 2018-03-01 03:00:00 39.0 1127.000000
13 2018-03-01 03:15:00 39.0 1047.466667
14 2018-03-01 03:30:00 39.0 1037.533333
I want to get index: 10
Because I need to know which time not continuous and I need to add the value.
I want to know if there is a NAN in front of and behind each 'time'. If not I need to know it index. I need to add value for it.
My data is very large. I need a faster way.
I really need your help.Many thanks.
Not sure if I understood you correctly. If you want the index of the column time where the change is more than 15 minutes, you will have more index than 4, and you can do so:
df['time'] = pd.to_datetime(df['time'], format='%Y-%m-%d %H:%M:%S')
df['Delta']=(df['time'].subtract(df['time'].shift(1)))
df['Delta'] = df['Delta'].astype(str)
print df.index[df['Delta'] != '0 days 00:15:00.000000000'].tolist()
And the output is:
[4561, 4723, 5154, 5220, 5293, 5437, 5484]
Edit
Again, if I understood you right, just use this:
df.index[(pd.isnull(df['w'])) & (pd.notnull(df['w'].shift(1))) & (pd.notnull(df['w'].shift(-1)))].tolist()
Output:
[10]
This should work pretty fast:
import numpy as np
index = np.array([4561,4723,4724,4725,4726,5154,5220,5221,5222,5223,5224,5293,5437,5484,5485,5486,5487])
continuous = np.diff(index) == 1
not_continuous = np.where(~continuous[1:] & ~continuous[:-1])[0] + 1 # check on both 'sides', +1 because you 'loose' one index in the diff operation
index[not_continuous]
array([5154, 5293, 5437])
It doesn't handle the first value well but this is quite ambiguous since you don't have a preceding value to check against. Up to you to add this extra check if it matters to you... Same for last value, potentially.

Categories