How using python to find index with logic in pandas? - python

This is my data:
time id w
0 2018-03-01 00:00:00 39.0 1176.000000
1 2018-03-01 00:15:00 39.0 NaN
2 2018-03-01 00:30:00 39.0 NaN
3 2018-03-01 00:45:00 39.0 NaN
4 2018-03-01 01:00:00 39.0 NaN
5 2018-03-01 01:15:00 39.0 NaN
6 2018-03-01 01:30:00 39.0 NaN
7 2018-03-01 01:45:00 39.0 1033.461538
8 2018-03-01 02:00:00 39.0 1081.066667
9 2018-03-01 02:15:00 39.0 1067.909091
10 2018-03-01 02:30:00 39.0 NaN
11 2018-03-01 02:45:00 39.0 1051.866667
12 2018-03-01 03:00:00 39.0 1127.000000
13 2018-03-01 03:15:00 39.0 1047.466667
14 2018-03-01 03:30:00 39.0 1037.533333
I want to get index: 10
Because I need to know which time not continuous and I need to add the value.
I want to know if there is a NAN in front of and behind each 'time'. If not I need to know it index. I need to add value for it.
My data is very large. I need a faster way.
I really need your help.Many thanks.

Not sure if I understood you correctly. If you want the index of the column time where the change is more than 15 minutes, you will have more index than 4, and you can do so:
df['time'] = pd.to_datetime(df['time'], format='%Y-%m-%d %H:%M:%S')
df['Delta']=(df['time'].subtract(df['time'].shift(1)))
df['Delta'] = df['Delta'].astype(str)
print df.index[df['Delta'] != '0 days 00:15:00.000000000'].tolist()
And the output is:
[4561, 4723, 5154, 5220, 5293, 5437, 5484]
Edit
Again, if I understood you right, just use this:
df.index[(pd.isnull(df['w'])) & (pd.notnull(df['w'].shift(1))) & (pd.notnull(df['w'].shift(-1)))].tolist()
Output:
[10]

This should work pretty fast:
import numpy as np
index = np.array([4561,4723,4724,4725,4726,5154,5220,5221,5222,5223,5224,5293,5437,5484,5485,5486,5487])
continuous = np.diff(index) == 1
not_continuous = np.where(~continuous[1:] & ~continuous[:-1])[0] + 1 # check on both 'sides', +1 because you 'loose' one index in the diff operation
index[not_continuous]
array([5154, 5293, 5437])
It doesn't handle the first value well but this is quite ambiguous since you don't have a preceding value to check against. Up to you to add this extra check if it matters to you... Same for last value, potentially.

Related

Sum of Timestamps

In my dataframe I have a column that is timestamp formatted as 2021-11-18 00:58:22.705
I wish to create a column that displays the time elapsed from each row to the interval time (first time).
There are 2 ways in which I can think of doing this but I don't seem to know how to make it happen.
Method 1:
To subtract each time stamp to the row above.
df["difference"]= df["timestamp"].diff()
Now that this time difference has been calculated I would like to create another column that sums each time difference but it keeps the sum from the delta above (elapsed time from start of process)
Method 2:
I guess another way would be to calculate the timestamp of each row to the interval time stamp (first one)
I do not know how I would do that.
Thanks in advance.
not completely understood the type of difference needed so adding both which I think are reasonable:
import pandas as pd
times = pd.date_range('2022-05-23', periods=20, freq='0D30min')
df = pd.DataFrame({'Timestamp': times})
df['difference_in_min'] = (df.Timestamp - df.Timestamp.min()).astype('timedelta64[m]')
df['cumulative_dif_in_min'] = df.difference_in_min.cumsum()
print(df)
Timestamp difference_in_min cumulative_dif_in_min
0 2022-05-23 00:00:00 0.0 0.0
1 2022-05-23 00:30:00 30.0 30.0
2 2022-05-23 01:00:00 60.0 90.0
3 2022-05-23 01:30:00 90.0 180.0
4 2022-05-23 02:00:00 120.0 300.0
5 2022-05-23 02:30:00 150.0 450.0
6 2022-05-23 03:00:00 180.0 630.0
7 2022-05-23 03:30:00 210.0 840.0
8 2022-05-23 04:00:00 240.0 1080.0

Sum of dataframes : treating NaN as 0 when summed with other values, but returning NaN where all summed elements are NaN

I am trying to add some dataframes that contain NaN values. The data frames are index by time series, and in my case a NaN is meaningful, it means that a measurement wasn't done. So if all the data frames I'm adding have a NaN for a given timestamp, I need the result to have a NaN for this timestamp. But if one or more df have a value for the timestamp, I need to have the sum of theses values.
EDIT : Also, in my case, a 0 is different from an NaN, it means that there was a mesurement and it mesured 0 activity, different from a NaN meaning that there was no mesurement. So any solution using fillna(0) won't work.
I haven't found a proper way to do this yet. Here is an exemple of what I want to do :
import pandas as pd
df1 = pd.DataFrame({'value': [0, 1, 1, 1, np.NaN, np.NaN, np.NaN]},
index=pd.date_range("01/01/2020 00:00", "01/01/2020 01:00", freq = '10T'))
df2 = pd.DataFrame({'value': [0, 5, 5, 5, 5, 5, np.NaN]},
index=pd.date_range("01/01/2020 00:00", "01/01/2020 01:00", freq = '10T'))
df1 + df2
What i get :
df1 + df2
value
2020-01-01 00:00:00 0.0
2020-01-01 00:10:00 6.0
2020-01-01 00:20:00 6.0
2020-01-01 00:30:00 6.0
2020-01-01 00:40:00 NaN
2020-01-01 00:50:00 NaN
2020-01-01 01:00:00 NaN
What I would want to have as a result :
value
2020-01-01 00:00:00 0.0
2020-01-01 00:10:00 6.0
2020-01-01 00:20:00 6.0
2020-01-01 00:30:00 6.0
2020-01-01 00:40:00 5.0
2020-01-01 00:50:00 5.0
2020-01-01 01:00:00 NaN
Does anybody know a clean way to do so ?
Thank you.
(I'm using Python 3.9.1 and pandas 1.2.4)
You can use add with the fill_value=0 option. This will maintain the "all NaN" combinations as NaN:
df1.add(df2, fill_value=0)
output:
value
2020-01-01 00:00:00 0.0
2020-01-01 00:10:00 6.0
2020-01-01 00:20:00 6.0
2020-01-01 00:30:00 6.0
2020-01-01 00:40:00 5.0
2020-01-01 00:50:00 5.0
2020-01-01 01:00:00 NaN

How to extract rows in Pandas based different criteria over two datetime columns

My objective is to calculate the number of firemen working simultaneously during the same period of time.
I'm trying to extract rows from a dataframe with conditions on two columns but It doesn't work as expected.
Let me explain
Here are my data firt (this is the list of firemen interventions) with the number of firemen and start and end of the intervention.
ID
Nombre d'agents (Engins)
Date Début Sortie Engin
Date Fin Sortie Engin
194683
3.0
2018-03-01 19:12:00
2018-03-01 19:54:00
194684
3.0
2018-03-01 19:20:00
2018-03-01 20:09:00
194685
3.0
2018-03-01 19:33:00
2018-03-01 20:16:00
194686
3.0
2018-03-01 19:50:00
2018-03-01 23:01:00
194687
3.0
2018-03-01 19:53:00
2018-03-01 20:20:00
194688
3.0
2018-03-01 19:54:00
2018-03-01 20:55:00
194689
3.0
2018-03-01 19:56:00
2018-03-01 21:20:00
194690
6.0
2018-03-01 20:03:00
2018-03-01 22:10:00
194691
3.0
2018-03-01 20:09:00
2018-03-01 20:54:00
Here is what I'm trying to achieve:
Between 2018-03-01 19:20:00 and 2018-03-01 19:54:00 : 12 firemens were working during in the same time for a cumulated 1h34. (3 from first row 19:20->19:54 (34mn) and 3 from second row 19:20->19:54 (34mn), 3 from third row 19:33->19:54 (21), 3 from fourth row 19:50->19:54 (4mn) and 3 from fifth row 19:53->19:54 (1mn))
I first combined all datetime (start and end) chronologically in a dataframe in order to have all timeslot and timedelta between rows.
data = [df["Date Début Sortie Engin"]]
headers = ["Moment"]
df3 = pd.concat(data, axis=1, keys=headers)
data = [df["Date Fin Sortie Engin"]]
df4 = pd.concat(data, axis=1, keys=headers)
df3 = df3.append(df4)
df3 = df3.sort_values(by="Moment", ascending=True)
Moment
2018-03-01 19:12:00
2018-03-01 19:20:00
2018-03-01 19:33:00
2018-03-01 19:50:00
2018-03-01 19:54:00
I then compare two consecutive rows of this new dataframe to my initial datas to find out how many interventions include this timeframe. I sum the number the number of firemen and the number of simultaneous event
def calc_effectif(start, end):
mask = (df['Date Début Sortie Engin'] >= start) & (df['Date Fin Sortie Engin'] <= end)
return df['Nombre d\'agents (Engins)'].loc[mask].sum(), df['Nombre d\'agents (Engins)'].loc[mask].count()
df3["effectif"],df3["evenement"] = np.vectorize(calc_effectif)(df3["Moment"], df3["Moment"].shift(-1))
the mask doesn't seem to be the right way to do that. I've look into pandas between_times and other function but it works on index only... so I'm a bit stuck for now.
Any tip on how I can make progress on this ?
so, I solved my problem by changing the condition in my function:
mask = (df['Début Sortie Engin'] <= start) & (end <= df['Fin Sortie Engin'])
I end up with something like this :
Id
Moment
duree
effectif
evenement
0
01/03/2018 19:12
00:08:00
3.0
1
1
01/03/2018 19:20
00:13:00
6.0
2
2
01/03/2018 19:33
00:17:00
9.0
3
3
01/03/2018 19:50
00:03:00
12.0
4
4
01/03/2018 19:53
00:01:00
15.0
5
5
01/03/2018 19:54
00:00:00
18.0
6
0
01/03/2018 19:54
00:02:00
15.0
5
6
01/03/2018 19:56
00:07:00
18.0
6
7
01/03/2018 20:03
00:06:00
24.0
7
1
01/03/2018 20:09
00:00:00
27.0
8
8
01/03/2018 20:09
00:07:00
24.0
7
2
01/03/2018 20:16
00:04:00
21.0
6
4
01/03/2018 20:20
00:34:00
18.0
5
8
01/03/2018 20:54
00:01:00
15.0
4
5
01/03/2018 20:55
00:25:00
12.0
3

Reindexing timeseries data

I have an issue similar to "ValueError: cannot reindex from a duplicate axis".The solution isn't provided.
I have an excel file containing multiple rows and columns of weather data. Data has missing at certain intervals although not shown in the sample below. I want to reindex the time column at 5 minute intervals so that I can interpolate the missing values. Data Sample:
Date Time Temp Hum Dewpnt WindSpd
04/01/18 12:05 a 30.6 49 18.7 2.7
04/01/18 12:10 a NaN 51 19.3 1.3
04/01/18 12:20 a 30.7 NaN 19.1 2.2
04/01/18 12:30 a 30.7 51 19.4 2.2
04/01/18 12:40 a 30.9 51 19.6 0.9
Here's what I have tried.
import pandas as pd
ts = pd.read_excel('E:\DATA\AP.xlsx')
ts['Time'] = pd.to_datetime(ts['Time'])
ts.set_index('Time', inplace=True)
dt = pd.date_range("2018-04-01 00:00:00", "2018-05-01 00:00:00", freq='5min', name='T')
idx = pd.DatetimeIndex(dt)
ts.reindex(idx)
I just just want to have my index at 5 min frequency so that I can interpolate the NaN later. Expected output:
Date Time Temp Hum Dewpnt WindSpd
04/01/18 12:05 a 30.6 49 18.7 2.7
04/01/18 12:10 a NaN 51 19.3 1.3
04/01/18 12:15 a NaN NaN NaN NaN
04/01/18 12:20 a 30.7 NaN 19.1 2.2
04/01/18 12:25 a NaN NaN NaN NaN
04/01/18 12:30 a 30.7 51 19.4 2.2
One more approach.
df['Time'] = pd.to_datetime(df['Time'])
df = df.set_index(['Time']).resample('5min').last().reset_index()
df['Time'] = df['Time'].dt.time
df
output
Time Date Temp Hum Dewpnt WindSpd
0 00:05:00 4/1/2018 30.6 49.0 18.7 2.7
1 00:10:00 4/1/2018 NaN 51.0 19.3 1.3
2 00:15:00 NaN NaN NaN NaN NaN
3 00:20:00 4/1/2018 30.7 NaN 19.1 2.2
4 00:25:00 NaN NaN NaN NaN NaN
5 00:30:00 4/1/2018 30.7 51.0 19.4 2.2
6 00:35:00 NaN NaN NaN NaN NaN
7 00:40:00 4/1/2018 30.9 51.0 19.6 0.9
If times from multiple dates have to be re-sampled, you can use code below.
However, you will have to seperate 'Date' & 'Time' columns later.
df1['DateTime'] = df1['Date']+df1['Time']
df1['DateTime'] = pd.to_datetime(df1['DateTime'],format='%d/%m/%Y%I:%M %p')
df1 = df1.set_index(['DateTime']).resample('5min').last().reset_index()
df1
Output
DateTime Date Time Temp Hum Dewpnt WindSpd
0 2018-01-04 00:05:00 4/1/2018 12:05 AM 30.6 49.0 18.7 2.7
1 2018-01-04 00:10:00 4/1/2018 12:10 AM NaN 51.0 19.3 1.3
2 2018-01-04 00:15:00 NaN NaN NaN NaN NaN NaN
3 2018-01-04 00:20:00 4/1/2018 12:20 AM 30.7 NaN 19.1 2.2
4 2018-01-04 00:25:00 NaN NaN NaN NaN NaN NaN
5 2018-01-04 00:30:00 4/1/2018 12:30 AM 30.7 51.0 19.4 2.2
6 2018-01-04 00:35:00 NaN NaN NaN NaN NaN NaN
7 2018-01-04 00:40:00 4/1/2018 12:40 AM 30.9 51.0 19.6 0.9
You can try this for example:
import pandas as pd
ts = pd.read_excel('E:\DATA\AP.xlsx')
ts['Time'] = pd.to_datetime(ts['Time'])
ts.set_index('Time', inplace=True)
ts.resample('5T').mean()
More information here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html
Set the Time column as the index, making sure it is DateTime type, then try
ts.asfreq('5T')
use
ts.asfreq('5T', method='ffill')
to pull previous values forward.
I would take the approach of creating a blank table and fill it in with the data as it comes from your data source. For this example three observations are read in as NaN, plus the row for 1:15 and 1:20 is missing.
import pandas as pd
import numpy as np
rawpd = pd.read_excel('raw.xlsx')
print(rawpd)
Date Time Col1 Col2
0 2018-04-01 01:00:00 1.0 10.0
1 2018-04-01 01:05:00 2.0 NaN
2 2018-04-01 01:10:00 NaN 10.0
3 2018-04-01 01:20:00 NaN 10.0
4 2018-04-01 01:30:00 5.0 10.0
Now create a dataframe targpd with the ideal structure.
time5min = pd.date_range(start='2018/04/1 01:00',periods=7,freq='5min')
targpd = pd.DataFrame(np.nan,index = time5min,columns=['Col1','Col2'])
print(targpd)
Col1 Col2
2018-04-01 01:00:00 NaN NaN
2018-04-01 01:05:00 NaN NaN
2018-04-01 01:10:00 NaN NaN
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN NaN
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 NaN NaN
Now the trick is to update targpd with the data sent to you in rawpd. For this to happen the Date and Time columns have to be combined in rawpd and made into an index.
print(rawpd.Date,rawpd.Time)
0 2018-04-01
1 2018-04-01
2 2018-04-01
3 2018-04-01
4 2018-04-01
Name: Date, dtype: datetime64[ns]
0 01:00:00
1 01:05:00
2 01:10:00
3 01:20:00
4 01:30:00
Name: Time, dtype: object
You can see above the trick in all this. Your date data was converted to datetime but your time data is just a string. Below a proper index is created by used of a lambda function.
rawidx=rawpd.apply(lambda r : pd.datetime.combine(r['Date'],r['Time']),1)
print(rawidx)
This can be applied to the rawpd database as an index.
rawpd2=pd.DataFrame(rawpd[['Col1','Col2']].values,index=rawidx,columns=['Col1','Col2'])
rawpd2=rawpd2.sort_index()
print(rawpd2)
Once this is in place the update command can get you what you want.
targpd.update(rawpd2,overwrite=True)
print(targpd)
Col1 Col2
2018-04-01 01:00:00 1.0 10.0
2018-04-01 01:00:00 1.0 10.0
2018-04-01 01:05:00 2.0 NaN
2018-04-01 01:10:00 NaN 10.0
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN 10.0
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 5.0 10.0
2018-04-01 01:05:00 2.0 NaN
2018-04-01 01:10:00 NaN 10.0
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN 10.0
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 5.0 10.0
You now have a file ready for interpolation
I have got it to work. thank you everyone for your time. I am providing the working code.
import pandas as pd
df = pd.read_excel('E:\DATA\AP.xlsx', sheet_name='Sheet1', parse_dates=[['Date', 'Time']])
df = df.set_index(['Date_Time']).resample('5min').last().reset_index()
print(df)

Deleting rows between multiple sets of time stamps

I have a DataFrame that has time stamps in the form of (yyyy-mm-dd hh:mm:ss). I'm trying to delete data between two different time stamps. At the moment I can delete the data between 1 range of time stamps but I have trouble extending this to multiple time stamps.
For example, with the DataFrame I can delete a range of rows (e.g. 2015-03-01 00:20:00 to 2015-08-01 01:10:00) however, I'm not sure how to go about deleting another range alongside it. The code that does that is shown below.
index_list= df.timestamp[(df.timestamp >= "2015-07-01 00:00:00") & (df.timestamp <= "2015-12-30 23:50:00")].index.tolist()
df1.drop(df1.index[index_list1, inplace = True)
The DataFrame extends over 3 years and has every day in the 3 years included.
I'm trying to delete all the rows from months July to December (2015-07-01 00:00:00 to 2015-12-30 23:50:00) for all 3 years.
I was thinking that I create a helper column that gets the Month from the Date column and then drops based off the Month from the helper column.
I would greatly appreciate any advice. Thanks!
Edit:
I've added in a small summarised version of the DataFrame. This is what the intial DataFrame looks like.
df Date v
2015-01-01 00:00:00 30.0
2015-02-01 00:10:00 55.0
2015-03-01 00:20:00 36.0
2015-04-01 00:30:00 65.0
2015-05-01 00:40:00 35.0
2015-06-01 00:50:00 22.0
2015-07-01 01:00:00 74.0
2015-08-01 01:10:00 54.0
2015-09-01 01:20:00 86.0
2015-10-01 01:30:00 91.0
2015-11-01 01:40:00 65.0
2015-12-01 01:50:00 35.0
To get something like this
df Date v
2015-01-01 00:00:00 30.0
2015-02-01 00:10:00 55.0
2015-03-01 00:20:00 36.0
2015-05-01 00:40:00 35.0
2015-06-01 00:50:00 22.0
2015-11-01 01:40:00 65.0
2015-12-01 01:50:00 35.0
Where time stamps "2015-07-01 00:20:00 to 2015-10-01 00:30:00"and "2015-07-01 01:00:00 to 2015-10-01 01:30:00" are removed. Sorry if my formatting isn't up to standard.
If your timestamp column uses the correct dtype, you can just do:
df.loc[df.timestamp.dt.month.isin([1, 2, 3, 5, 6, 11, 12])]
This should filter out the months not inside the list.
As you hinted, data manipulation is always easier when you use the right data types. To support time stamps, pandas has the Timestamp type. You can do this as follows:
df['Date'] = pd.to_datetime(df['Date']) # No date format needs to be specified,
# "YYYY-MM-DD HH:MM:SS" is the standard
Then, removing all entries in the months of July to December for all years is straightforward:
df = df[df['Date'].dt.month < 7] # Keep only months less than July

Categories