Finding Last n Groups of Rows in Dataframe - python

I have a large dataframe (1m+ rows) that contains test data. A snapshot of "Events" was taken at various times and up to three rows were added to the dataframe per snapshot. Eg, in the extract below the first snapshot for Event At223 was taken at 18/03/2016 18:10:45, the second at 21/03/2016 10:14:28, etc.
I want to filter the dataframe so that it returns only the last n snapshots per Ref. Refs are unique whereas Events may be duplicated.
I'm new to Pandas but have tried various combinations of sort_values, groupby and tail but cannot get the desired result. Eg:
df = df.sort_values(['Ref', 'Time']).groupby(['Time', 'Ref', 'TestId']).tail(3)
Can anyone suggest how to do it? In the deisred result example below n = 3 so it shows the last three snapshots per Ref.
Extract:
Time
Ref
Event
EndTime
TestId
TestNames
Result
18/03/2016 18:10:45
1.123717985
At223
01/04/2016 16:00
28212
One
18/03/2016 18:10:45
1.123717985
At223
01/04/2016 16:00
466299
Two
18/03/2016 18:10:45
1.123717985
At223
01/04/2016 16:00
58805
Three
21/03/2016 10:14:28
1.123717985
At223
01/04/2016 16:00
28212
One
21/03/2016 10:14:28
1.123717985
At223
01/04/2016 16:00
466299
Two
4
21/03/2016 10:14:28
1.123717985
At223
01/04/2016 16:00
58805
Three
21/03/2016 12:44:34
1.123717985
At223
01/04/2016 16:00
28212
One
21/03/2016 12:44:34
1.123717985
At223
01/04/2016 16:00
466299
Two
4.5
21/03/2016 12:44:34
1.123717985
At223
01/04/2016 16:00
58805
Three
21/03/2016 13:05:16
1.123717985
At223
01/04/2016 16:00
28212
One
21/03/2016 13:05:16
1.123717985
At223
01/04/2016 16:00
466299
Two
4.5
21/03/2016 13:05:16
1.123717985
At223
01/04/2016 16:00
58805
Three
21/03/2016 13:14:22
1.123717985
At223
01/04/2016 16:00
28212
One
21/03/2016 13:14:22
1.123717985
At223
01/04/2016 16:00
466299
Two
4.5
21/03/2016 13:14:22
1.123717985
At223
01/04/2016 16:00
58805
Three
01/04/2016 10:37:43
1.123717985
At223
01/04/2016 16:00
28212
One
01/04/2016 10:37:43
1.123717985
At223
01/04/2016 16:00
466299
Two
4.5
01/04/2016 10:37:43
1.123717985
At223
01/04/2016 16:00
58805
Three
18/03/2016 18:12:12
1.123719512
Br12
03/04/2016 16:00
28214
Eight
7
18/03/2016 18:12:12
1.123719512
Br12
03/04/2016 16:00
1212772
Nine
1.58
18/03/2016 18:12:12
1.123719512
Br12
03/04/2016 16:00
58805
Ten
4.4
21/03/2016 13:03:48
1.123719512
Br12
03/04/2016 16:00
28214
Eight
7.2
21/03/2016 13:03:48
1.123719512
Br12
03/04/2016 16:00
1212772
Nine
1.58
21/03/2016 13:03:48
1.123719512
Br12
03/04/2016 16:00
58805
Ten
4.4
21/03/2016 13:19:15
1.123719512
Br12
03/04/2016 16:00
28214
Eight
7.2
21/03/2016 13:19:15
1.123719512
Br12
03/04/2016 16:00
1212772
Nine
1.58
21/03/2016 13:19:15
1.123719512
Br12
03/04/2016 16:00
58805
Ten
4.5
01/04/2016 12:48:13
1.123719512
Br12
03/04/2016 16:00
28214
Eight
7.2
01/04/2016 12:48:13
1.123719512
Br12
03/04/2016 16:00
1212772
Nine
1.59
01/04/2016 12:48:13
1.123719512
Br12
03/04/2016 16:00
58805
Ten
4.5
Desired result:
Time
Ref
Event
EndTime
TestId
TestNames
Result
21/03/2016 13:05:16
1.123717985
At223
01/04/2016 16:00
28212
One
21/03/2016 13:05:16
1.123717985
At223
01/04/2016 16:00
466299
Two
4.5
21/03/2016 13:05:16
1.123717985
At223
01/04/2016 16:00
58805
Three
21/03/2016 13:14:22
1.123717985
At223
01/04/2016 16:00
28212
One
21/03/2016 13:14:22
1.123717985
At223
01/04/2016 16:00
466299
Two
4.5
21/03/2016 13:14:22
1.123717985
At223
01/04/2016 16:00
58805
Three
01/04/2016 10:37:43
1.123717985
At223
01/04/2016 16:00
28212
One
01/04/2016 10:37:43
1.123717985
At223
01/04/2016 16:00
466299
Two
4.5
01/04/2016 10:37:43
1.123717985
At223
01/04/2016 16:00
58805
Three
21/03/2016 13:03:48
1.123719512
Br12
03/04/2016 16:00
28214
Eight
7.2
21/03/2016 13:03:48
1.123719512
Br12
03/04/2016 16:00
1212772
Nine
1.58
21/03/2016 13:03:48
1.123719512
Br12
03/04/2016 16:00
58805
Ten
4.4
21/03/2016 13:19:15
1.123719512
Br12
03/04/2016 16:00
28214
Eight
7.2
21/03/2016 13:19:15
1.123719512
Br12
03/04/2016 16:00
1212772
Nine
1.58
21/03/2016 13:19:15
1.123719512
Br12
03/04/2016 16:00
58805
Ten
4.5
01/04/2016 12:48:13
1.123719512
Br12
03/04/2016 16:00
28214
Eight
7.2
01/04/2016 12:48:13
1.123719512
Br12
03/04/2016 16:00
1212772
Nine
1.59
01/04/2016 12:48:13
1.123719512
Br12
03/04/2016 16:00
58805
Ten
4.5

You could loop through each unique event and grab the top n times from each. Then concatenate the results together:
n = 3
event_dfs = []
for event in df['Event'].unique():
sub_df = df.loc[df['Event'] == event]
max_times = sub_df['Time'].nlargest(n=n)
event_dfs.append(sub_df.loc[sub_df['Time'].isin(max_times)])
result = pd.concat(event_dfs)

Factorize the date and keep the n highest values (assuming your dataframe is already sorted by Time)
# Number of snapshot you want to keep
n = 3
# Create a boolean mask
m = (df.assign(val=pd.factorize(df['Time'])[0])
.groupby('Ref')['val']
.transform(lambda x: x.max() - x < n))
out = df[m]
Output:
>>> out
Time Ref Event EndTime TestId TestNames Result
9 21/03/2016 13:05:16 1.123718 At223 01/04/2016 16:00 28212 One NaN
10 21/03/2016 13:05:16 1.123718 At223 01/04/2016 16:00 466299 Two 4.50
11 21/03/2016 13:05:16 1.123718 At223 01/04/2016 16:00 58805 Three NaN
12 21/03/2016 13:14:22 1.123718 At223 01/04/2016 16:00 28212 One NaN
13 21/03/2016 13:14:22 1.123718 At223 01/04/2016 16:00 466299 Two 4.50
14 21/03/2016 13:14:22 1.123718 At223 01/04/2016 16:00 58805 Three NaN
15 01/04/2016 10:37:43 1.123718 At223 01/04/2016 16:00 28212 One NaN
16 01/04/2016 10:37:43 1.123718 At223 01/04/2016 16:00 466299 Two 4.50
17 01/04/2016 10:37:43 1.123718 At223 01/04/2016 16:00 58805 Three NaN
21 21/03/2016 13:03:48 1.123720 Br12 03/04/2016 16:00 28214 Eight 7.20
22 21/03/2016 13:03:48 1.123720 Br12 03/04/2016 16:00 1212772 Nine 1.58
23 21/03/2016 13:03:48 1.123720 Br12 03/04/2016 16:00 58805 Ten 4.40
24 21/03/2016 13:19:15 1.123720 Br12 03/04/2016 16:00 28214 Eight 7.20
25 21/03/2016 13:19:15 1.123720 Br12 03/04/2016 16:00 1212772 Nine 1.58
26 21/03/2016 13:19:15 1.123720 Br12 03/04/2016 16:00 58805 Ten 4.50
27 01/04/2016 12:48:13 1.123720 Br12 03/04/2016 16:00 28214 Eight 7.20
28 01/04/2016 12:48:13 1.123720 Br12 03/04/2016 16:00 1212772 Nine 1.59
29 01/04/2016 12:48:13 1.123720 Br12 03/04/2016 16:00 58805 Ten 4.50

You can use apply after groupby
And create a function that only handles dataframes with a single Ref
def extract_n_last_times(n: int):
def extract_last_times(group: pd.DataFrame):
times = group.sort_values("Time")["Time"].unique()
return group[group["Time"].isin(times[-n:])]
return extract_last_times
df.sort_values(["Ref", "Time"]).groupby("Ref", group_keys=False).apply(extract_n_last_times(n=3))

Related

combining 2 dataframes and make it create new lines mathemathically

So I created 2 different DataFrame Table and integrate it to tkinter GUI.
First Table looks like this;
Entry
Start
Finish
Total Time (Hour)
Status
Reason for Stoppage
1
23.05.2020 07:30
23.05.2020 08:30
01:00
MANUFACTURE
2
23.05.2020 08:30
23.05.2020 12:00
03:30
MANUFACTURE
3
23.05.2020 12:00
23.05.2020 13:00
01:00
STOPPAGE
MALFUNCTION
4
23.05.2020 13:00
23.05.2020 13:45
00:45
MANUFACTURE
5
23.05.2020 13:45
23.05.2020 17:30
03:45
MANUFACTURE
And second Table looks like this;
Start
Finish
Reason for Stoppage
10:00
10:15
Coffee Break
12:00
12:30
Lunch Break
15:00
15:15
Coffee Break
The main task is,combining these Tables and creating another Table.While doing that we should arrange the lines according to hours.At that time,the program has to create new lines 'itself' and show every starting/finishing hour in the Table.But I just can't do it by combining or merging them.
The third graph has to look like this;
Entry
Start
Finish
Total Time (Hour)
Status
Reason for Stoppage
1
23.05.2020 07:30
23.05.2020 08:30
01:00
MANUFACTURE
2
23.05.2020 08:30
23.05.2020 10:00
01:30
MANUFACTURE
3
23.05.2020 10:00
23.05.2020 10:15
00:15
STOPPAGE
Coffee Break
4
23.05.2020 10:15
23.05.2020 12:00
01:45
MANUFACTURE
5
23.05.2020 12:00
23.05.2020 12:30
00:30
STOPPAGE
Lunch Break
6
23.05.2020 12:30
23.05.2020 13:00
00:30
MANUFACTURE
7
23.05.2020 13:00
23.05.2020 13:45
00:45
STOPPAGE
MALFUNCTION
8
23.05.2020 13:45
23.05.2020 15:00
01:15
MANUFACTURE
9
23.05.2020 15:00
23.05.2020 15:15
00:15
STOPPAGE
Coffee Break
10
23.05.2020 15:15
23.05.2020 17:30
02:15
MANUFACTURE
I hope I explained the problem clearly.Thanks in advance.
from tkinter import *
import tkinter as tk
from tkinter import ttk
from pandastable import Table
import pandas as pd
import numpy as np
# import style
root = tk.Tk()
root.title("Çalışma Ve Mola Saatleri")
root.geometry("1800x1600")
work={"Entry":["1","2","3","4","5"],
"Start":["23.05.2020" " 07:30","23.05.2020 08:30",
"23.05.2020 12:00","23.05.2020" " 13:00","23.05.2020 13:45"],
"Finish":["23.05.2020 08:30","23.05.2020 12:00",
"23.05.2020 13:00","23.05.2020 13:45","23.05.2020 17:30"],
"Total Time (Hour)":["01:00","03:30","01:00","00:45","03:45"],
"Status":["MANUFACTURE","MANUFACTURE","STOPPAGE","MANUFACTURE","MANUFACTURE"],
"Reason For Stoppage":[" "," ","MALFUNCTION"," "," "]}
graph1=pd.DataFrame(work)
frame=tk.Frame(root)
frame.place(width=200)
frame.pack(anchor=W,padx=100,pady=50,ipadx=120,ipady=30)
pt=Table(frame,dataframe=graph1)
pt.show()
Break={"Start":["10:00","12:00","15:00"],
"Finish":["10:15","12:30","15:15"],
"Reason For Stoppage":["Coffee Break","Lunch Break","Coffee Break"]}
graph2=pd.DataFrame(Break)
frame2=tk.Frame(root)
frame2.place(width=100,height=50)
frame2.pack(anchor=NE,padx=150,ipadx=20,ipady=10)
pt2=Table(frame2,dataframe=graph2)
pt2.show()
graph3=pd.concat([graph1,graph2])
frame3=tk.Frame(root)
frame3.place()
frame3.pack(anchor=SW,padx=100,ipadx=120,ipady=500)
pt3=Table(frame3,dataframe=graph3)
pt3.show()
root.mainloop()

How to select only those data whose date in another dataframe is higher or equal to one dataframe?

I have two dataframes.
Dataframe 1: customer details
Dataframe 2: document RFI details
RFI can be raised before the customer gets approved also after approved based on the document expiry date.
I want to fetch only those RFI details from the 2nd dataframe which are raised during the customer in restricted mode
Customer Details
id user_id date1 initial_status date2 final_status
1 1234 10-12-2021 00:12 created 15-12-2021 10:12 approved
2 1234 15-12-2021 10:12 approved 01-01-2022 00:35 restricted
3 1234 01-01-2022 00:35 restricted 02-02-2022 05:35 approved
4 9879 16-08-2021 15:45 created 21-09-2021 15:45 approved
5 9879 21-09-2021 15:45 approved 24-10-2021 15:45 restricted
6 9879 24-10-2021 15:45 restricted 19-11-2022 07:34 approved
7 9879 19-11-2022 07:34 approved 07-01-2022 15:45 restricted
rfi_details
id user_id rfi_raised_date customer_response action_taken status
1 1234 10-12-2021 05:12 12-12-2021 15:12 14-12-2021 10:11 rejected
2 1234 14-12-2021 10:12 15-12-2021 09:12 15-12-2021 10:12 approved
3 1234 01-01-2022 00:35 18-01-2022 10:35 18-01-2022 10:40 rejected
4 1234 18-01-2022 10:40 25-01-2022 12:35 25-01-2022 12:40 rejected
5 1234 25-01-2022 12:40 02-02-2022 05:50 02-02-2022 05:35 approved
6 9879 16-08-2021 15:45 21-09-2021 10:35 21-09-2021 15:45 approved
7 9879 24-10-2021 15:45 24-10-2021 20:45 24-10-2021 20:50 rejected
8 9879 24-10-2021 20:50 25-10-2021 05:50 25-10-2021 06:00 rejected
9 9879 25-10-2021 06:00 19-11-2022 07:30 19-11-2022 07:34 approved
10 9879 07-01-2022 15:45 10-01-2022 10:45 10-01-2022 10:55 rejected
11 9879 10-01-2022 10:55
Output
# This will contains only those rfi details which were generated during the customer in restricted mode in dataframe 1
**Output Dataframe**
id user_id rfi_raised_date customer_response action status
1 1234 01-01-2022 00:35 18-01-2022 10:35 18-01-2022 10:40 rejected
2 1234 18-01-2022 10:40 25-01-2022 12:35 25-01-2022 12:40 rejected
3 1234 25-01-2022 12:40 02-02-2022 05:50 02-02-2022 05:35 approved
4 9879 24-10-2021 15:45 24-10-2021 20:45 24-10-2021 20:50 rejected
5 9879 24-10-2021 20:50 25-10-2021 05:50 25-10-2021 06:00 rejected
6 9879 25-10-2021 06:00 19-11-2022 07:30 19-11-2022 07:34 approved
7 9879 07-01-2022 15:45 10-01-2022 10:45 10-01-2022 10:55 rejected

Formatting pandas dataframe printing

I have the following pandas dataframe that was converted to string with to_string().
It was printed like this:
S T Q U X A D
02:36 06:00 06:00 06:00 06:30 09:46 07:56
02:37 06:10 06:15 06:15 06:40 09:48 08:00
12:00 11:00 12:00 12:00 07:43 12:00 18:03
13:15 13:00 13:15 13:15 07:50 13:15 18:08
14:00 14:00 14:00 14:00 14:00 19:00
15:15 15:00 14:15 15:15 15:15 19:05
16:15 16:00 15:15 16:15 16:15 20:15
17:15 17:00 17:15 17:15 17:15 20:17
18:15 21:22 21:19 19:55 18:15 20:18
19:15 21:24 21:21 19:58 19:15 20:19
The gaps are due to empty values in the dataframe. I would like to keep the column alignment, perhaps by replacing the empty values with tabs. I would also like to center align the header line.
This wasn't printed in a terminal, but was sent over telegram with the requests post command. I think though, it is just a print formatting problem, independent of the telegram requests library.
The desired output would be like this:
S T Q U X A D
02:36 06:00 06:00 06:00 06:30 09:46 07:56
02:37 06:10 06:15 06:15 06:40 09:48 08:00
12:00 11:00 12:00 12:00 07:43 12:00 18:03
13:15 13:00 13:15 13:15 07:50 13:15 18:08
14:00 14:00 14:00 14:00 14:00 19:00
15:15 15:00 14:15 15:15 15:15 19:05
16:15 16:00 15:15 16:15 16:15 20:15
17:15 17:00 17:15 17:15 17:15 20:17
18:15 21:22 21:19 19:55 18:15 20:18
19:15 21:24 21:21 19:58 19:15 20:19
you can use dataframe style.set_properties to set some of these options like:
df.style.set_properties(**{'text-align': 'center'})
read more here:
https://pandas.pydata.org/docs/reference/api/pandas.io.formats.style.Styler.set_properties.html

pandas merge/rearrange/sum single dataframe

I have following dataframe:
latitude longitude d1 d2 ar merge_time
0 15 10.0 12/1/1981 0:00 12/4/1981 3:00 2.317681391 1981-12-04 04:00:00
1 15 10.1 12/1/1981 0:00 12/1/1981 3:00 2.293604127 1981-12-01 04:00:00
2 15 10.2 12/1/1981 0:00 12/1/1981 2:00 2.264552161 1981-12-01 03:00:00
3 15 10.3 12/1/1981 0:00 12/4/1981 2:00 2.278556423 1981-12-04 03:00:00
4 15 10.1 12/1/1981 4:00 12/1/1981 22:00 2.168275766 1981-12-01 23:00:00
5 15 10.2 12/1/1981 3:00 12/1/1981 21:00 2.114636628 1981-12-01 22:00:00
6 15 10.4 12/1/1981 0:00 12/2/1981 17:00 1.384415903 1981-12-02 18:00:00
7 15 10.1 12/2/1981 8:00 12/2/1981 11:00 2.293604127 1981-12-01 12:00:00
I want to group and rearrange above dataframe (value of column ar) based on following criteria:
1. Values latitude and longitude are equal and
2. Values d2 and merge_time are equal withing grouped in 1
Here is desired output:
latitude longitude d1 d2 ar
15 10 12/1/1981 0:00 12/4/1981 3:00 2.317681391
15 10.1 12/1/1981 0:00 12/1/1981 22:00 4.461879893
15 10.2 12/1/1981 0:00 12/1/1981 21:00 4.379188789
15 10.3 12/1/1981 0:00 12/4/1981 2:00 2.278556423
15 10.4 12/1/1981 0:00 12/2/1981 17:00 1.384415903
15 10.1 12/2/1981 8:00 12/2/1981 11:00 2.293604127
How can I achieve this?
Any help is appreceated.
after expressing your requirements in comments
group by location (longitude & latitude)
find rows within this grouping that are contiguous in time
group and aggregate these contiguous sections
import io
import pandas as pd
df = pd.read_csv(io.StringIO(""" latitude longitude d1 d2 ar merge_time
0 15 10.0 12/1/1981 0:00 12/4/1981 3:00 2.317681391 1981-12-04 04:00:00
1 15 10.1 12/1/1981 0:00 12/1/1981 3:00 2.293604127 1981-12-01 04:00:00
2 15 10.2 12/1/1981 0:00 12/1/1981 2:00 2.264552161 1981-12-01 03:00:00
3 15 10.3 12/1/1981 0:00 12/4/1981 2:00 2.278556423 1981-12-04 03:00:00
4 15 10.1 12/1/1981 4:00 12/1/1981 22:00 2.168275766 1981-12-01 23:00:00
5 15 10.2 12/1/1981 3:00 12/1/1981 21:00 2.114636628 1981-12-01 22:00:00
6 15 10.4 12/1/1981 0:00 12/2/1981 17:00 1.384415903 1981-12-02 18:00:00
7 15 10.1 12/2/1981 8:00 12/2/1981 11:00 2.293604127 1981-12-01 12:00:00"""), sep="\s\s+", engine="python")
df = df.assign(**{c:pd.to_datetime(df[c]) for c in ["d1","d2","merge_time"]})
df.groupby(["latitude", "longitude"]).apply(
lambda d: d.groupby(
(d["d1"] != (d["d2"].shift() + pd.Timedelta("1H"))).cumsum(), as_index=False
).agg({"d1": "min", "d2": "max", "ar": "sum"})
).droplevel(2,0).reset_index()
output
latitude
longitude
d1
d2
ar
0
15
10
1981-12-01 00:00:00
1981-12-04 03:00:00
2.31768
1
15
10.1
1981-12-01 00:00:00
1981-12-01 22:00:00
4.46188
2
15
10.1
1981-12-02 08:00:00
1981-12-02 11:00:00
2.2936
3
15
10.2
1981-12-01 00:00:00
1981-12-01 21:00:00
4.37919
4
15
10.3
1981-12-01 00:00:00
1981-12-04 02:00:00
2.27856
5
15
10.4
1981-12-01 00:00:00
1981-12-02 17:00:00
1.38442

Using one dataframe output to find matching rows in another dataframe

I would like to use some daily data in one dataframe as a qualifier to run some code in another dataframe. Both dataframes contain ['Date', 'Time', 'Ticker', 'Open', 'High', 'Low', 'Close']. One dataframe has only daily information, the other contains 5min out of the same fields, here are some examples.
print(df)
Date Time Ticker Open High Low Close
0 01/02/18 3:00 PM ES 2687.00 2696.00 2681.75 2695.75
1 01/03/18 3:00 PM ES 2697.25 2714.25 2697.00 2712.50
2 01/04/18 3:00 PM ES 2719.25 2729.00 2718.25 2724.00
3 01/05/18 3:00 PM ES 2732.25 2743.00 2726.50 2741.25
4 01/08/18 3:00 PM ES 2740.25 2748.50 2737.00 2746.50
5 01/09/18 3:00 PM ES 2751.00 2760.00 2748.00 2753.00
6 01/10/18 3:00 PM ES 2744.00 2751.75 2736.50 2748.75
7 01/11/18 3:00 PM ES 2754.25 2768.50 2752.75 2768.00
8 01/12/18 3:00 PM ES 2771.25 2788.75 2770.00 2786.50
9 01/15/18 3:00 PM ES 2793.75 2796.00 2792.50 2794.50
print(df_tick)
Date Time Ticker Open High Low Close
0 01/02/18 8:45 AM ES 2687.00 2687.25 2681.75 2685.75
1 01/02/18 9:00 AM ES 2686.00 2687.75 2683.50 2687.50
2 01/02/18 9:15 AM ES 2687.50 2690.50 2687.25 2689.25
3 01/02/18 9:30 AM ES 2689.50 2692.00 2689.25 2692.00
4 01/02/18 9:45 AM ES 2692.00 2692.25 2687.25 2690.00
5 01/02/18 10:00 AM ES 2690.00 2691.00 2689.75 2690.75
6 01/02/18 10:15 AM ES 2690.50 2691.25 2690.25 2691.00
7 01/02/18 10:30 AM ES 2691.00 2692.00 2689.00 2689.50
8 01/02/18 10:45 AM ES 2689.50 2689.75 2687.75 2688.25
9 01/02/18 11:00 AM ES 2688.25 2689.50 2687.75 2689.25
10 01/02/18 11:15 AM ES 2689.25 2690.75 2689.25 2690.00
11 01/02/18 11:30 AM ES 2690.00 2690.75 2689.25 2690.00
12 01/02/18 11:45 AM ES 2690.25 2690.50 2688.50 2688.75
13 01/02/18 12:00 PM ES 2689.00 2689.25 2688.50 2689.25
14 01/02/18 12:15 PM ES 2689.25 2691.00 2689.00 2690.50
15 01/02/18 12:30 PM ES 2690.75 2691.00 2689.75 2690.50
16 01/02/18 12:45 PM ES 2690.75 2691.25 2690.25 2691.00
17 01/02/18 1:00 PM ES 2691.25 2691.25 2689.50 2690.75
18 01/02/18 1:15 PM ES 2690.50 2691.50 2690.25 2690.50
19 01/02/18 1:30 PM ES 2690.50 2691.00 2689.75 2690.75
20 01/02/18 1:45 PM ES 2690.75 2691.50 2690.25 2690.75
21 01/02/18 2:00 PM ES 2690.75 2691.25 2690.75 2691.00
22 01/02/18 2:15 PM ES 2691.25 2691.75 2690.50 2691.50
23 01/02/18 2:30 PM ES 2691.50 2693.00 2691.50 2692.75
24 01/02/18 2:45 PM ES 2693.00 2693.75 2691.00 2693.75
25 01/02/18 3:00 PM ES 2693.75 2696.00 2693.25 2695.75
26 01/03/18 8:45 AM ES 2697.25 2702.25 2697.00 2700.75
27 01/03/18 9:00 AM ES 2701.00 2703.75 2700.50 2703.25
28 01/03/18 9:15 AM ES 2703.25 2706.00 2703.00 2705.00
29 01/03/18 9:30 AM ES 2705.00 2707.25 2704.00 2706.50
Code for calculating the gap percentage
#Calculating Gap Percentage
df['Gap %'] = (df['Open'].sub(df['Close'].shift()).div(df['Close'] -
1).fillna(0))*100
I have the code for the df to find the percentage change from Close-Open, and would like to use this information as a qualifier to run some code on the df_tick.
For example if df['Gap %'] > .02, then I want to use that date in df_tick and ignore (or drop) the rest of the information.
#drop rows not meeting certain percentage
df.drop(df[df['Gap %'] < .2].index, inplace=True)
print(df)
Date Time Ticker Open High Low Close Gap Gap %
2 01/04/18 3:00 PM ES 2719.25 2729.0 2718.25 2724.00 6.75 0.247888
3 01/05/18 3:00 PM ES 2732.25 2743.0 2726.50 2741.25 8.25 0.301067
9 01/15/18 3:00 PM ES 2793.75 2796.0 2792.50 2794.50 7.25 0.259531
Now I'd like to use df['Date'] to find the matching Dates in df_tick['Date'] for some code I've already written, I tried to just drop all the data where the dates aren't the same. But received an error.
#drop rows in df_tick not matching dates in df
df_tick.drop(df_tick[df_tick['Date'] != df['Date']].index, inplace=True)
ValueError: Can only compare identically-labeled Series objects
You may be able to reset the index of both dataframes and get away with what you are trying to do, but I would try this:
df_tick = df_tick[df_tick.Date.isin(df.Date.unique())]

Categories