Pandas: Join dataframe with condition

Pandas: Join dataframe with condition - python

So I have this dataframe (as below), I am trying to join itself by copying it into another df. The join condition as below;
Join condition:
Same PERSONID and Badge_ID
But different SITE_ID1
Timedelta between the two rows should be less than 48 hrs.
Expecting
PERSONID Badge_ID Reader_ID1_x SITE_ID1_x EVENT_TS1_x Reader_ID1_y SITE_ID1_x EVENT_TS1_y
2553-AMAGID 4229 141 99 2/1/2016 3:26 145 97 2/1/2016 3:29
2553-AMAGID 4229 248 99 2/1/2016 3:26 145 97 2/1/2016 3:29
2553-AMAGID 4229 145 97 2/1/2016 3:29 251 99 2/1/2016 3:29
2553-AMAGID 4229 145 97 2/1/2016 3:29 291 99 2/1/2016 3:29
Here is what I tired,
Make a copy of df and then filter each df with this condition like below and then join them back again. But the below condition doesn't work :(
I tried this filters in SQL before reading into df but that's too slow for 600k+ rows, event with indexes.
df1 = df1[(df1['Badge_ID']==df2['Badge_ID']) and (df1['SITE_ID1']!=df2['SITE_ID1']) and ((df1['EVENT_TS1']-df2['EVENT_TS1'])<=datetime.timedelta(hours=event_time_diff))]
PERSONID Badge_ID Reader_ID1 SITE_ID1 EVENT_TS1
2553-AMAGID 4229 141 99 2/1/2016 3:26:10 AM
2553-AMAGID 4229 248 99 2/1/2016 3:26:10 AM
2553-AMAGID 4229 145 97 2/1/2016 3:29:56 AM
2553-AMAGID 4229 251 99 2/1/2016 3:29:56 AM
2553-AMAGID 4229 291 99 2/1/2016 3:29:56 AM
2557-AMAGID 4219 144 99 2/1/2016 2:36:30 AM
2557-AMAGID 4219 144 99 2/1/2016 2:40:00 AM
2557-AMAGID 4219 250 99 2/1/2016 2:40:00 AM
2557-AMAGID 4219 290 99 2/1/2016 2:40:00 AM
2557-AMAGID 4219 144 97 2/1/2016 4:02:06 AM
2557-AMAGID 4219 250 99 2/1/2016 4:02:06 AM
2557-AMAGID 4219 290 99 2/1/2016 4:02:06 AM
2557-AMAGID 4219 250 97 2/2/2016 1:36:30 AM
2557-AMAGID 4219 290 99 2/3/2016 2:38:30 AM
2559-AMAGID 4227 141 99 2/1/2016 4:33:24 AM
2559-AMAGID 4227 248 99 2/1/2016 4:33:24 AM
2560-AMAGID 4226 141 99 2/1/2016 4:10:56 AM
2560-AMAGID 4226 248 99 2/1/2016 4:10:56 AM
2560-AMAGID 4226 145 99 2/1/2016 4:33:52 AM
2560-AMAGID 4226 251 99 2/1/2016 4:33:52 AM
2560-AMAGID 4226 291 99 2/1/2016 4:33:52 AM
2570-AMAGID 4261 141 99 2/1/2016 4:27:02 AM
2570-AMAGID 4261 248 99 2/1/2016 4:27:02 AM
2986-AMAGID 4658 145 99 2/1/2016 3:14:54 AM
2986-AMAGID 4658 251 99 2/1/2016 3:14:54 AM
2986-AMAGID 4658 291 99 2/1/2016 3:14:54 AM
2986-AMAGID 4658 144 99 2/1/2016 3:26:30 AM
2986-AMAGID 4658 250 99 2/1/2016 3:26:30 AM
2986-AMAGID 4658 290 99 2/1/2016 3:26:30 AM
4133-AMAGID 6263 142 99 2/1/2016 2:44:08 AM
4133-AMAGID 6263 249 99 2/1/2016 2:44:08 AM
4133-AMAGID 6263 141 34 2/1/2016 2:44:20 AM
4133-AMAGID 6263 248 34 2/1/2016 2:44:20 AM
4414-AMAGID 6684 145 99 2/1/2016 3:08:06 AM
4414-AMAGID 6684 251 99 2/1/2016 3:08:06 AM
4414-AMAGID 6684 291 99 2/1/2016 3:08:06 AM
4414-AMAGID 6684 145 22 2/1/2016 3:19:12 AM
4414-AMAGID 6684 251 22 2/1/2016 3:19:12 AM
4414-AMAGID 6684 291 22 2/1/2016 3:19:12 AM
4414-AMAGID 6684 145 99 2/1/2016 4:14:28 AM
4414-AMAGID 6684 251 99 2/1/2016 4:14:28 AM
4414-AMAGID 6684 291 99 2/1/2016 4:14:28 AM
4484-AMAGID 6837 142 99 2/1/2016 2:51:14 AM
4484-AMAGID 6837 249 99 2/1/2016 2:51:14 AM
4484-AMAGID 6837 141 99 2/1/2016 2:51:26 AM
4484-AMAGID 6837 248 99 2/1/2016 2:51:26 AM
4484-AMAGID 6837 141 99 2/1/2016 3:05:12 AM
4484-AMAGID 6837 248 99 2/1/2016 3:05:12 AM
4484-AMAGID 6837 141 99 2/1/2016 3:08:58 AM
4484-AMAGID 6837 248 99 2/1/2016 3:08:58 AM

Try the following:
# Transform data in first dataframe
df1 = pd.DataFrame(data)
# Save the data in another datframe
df2 = pd.DataFrame(data)
# Rename column names of second dataframe
df2.rename(index=str, columns={'Reader_ID1': 'Reader_ID1_x', 'SITE_ID1': 'SITE_ID1_x', 'EVENT_TS1': 'EVENT_TS1_x'}, inplace=True)
# Merge the dataframes into another dataframe based on PERSONID and Badge_ID
df3 = pd.merge(df1, df2, how='outer', on=['PERSONID', 'Badge_ID'])
# Use df.loc() to fetch the data you want
df3.loc[(df3.Reader_ID1 < df3.Reader_ID1_x) & (df3.SITE_ID1 != df3.SITE_ID1_x) & (pd.to_datetime(df3['EVENT_TS1']) - pd.to_datetime(df3['EVENT_TS1_x'])<=datetime.timedelta(hours=event_time_diff))]

Related

Time interval calculation for consecutive days in rows

I have a dataframe that looks like this:
Path_Version commitdates Year-Month API Age api_spec_id
168 NaN 2018-10-19 2018-10 39 521
169 NaN 2018-10-19 2018-10 39 521
170 NaN 2018-10-12 2018-10 39 521
171 NaN 2018-10-12 2018-10 39 521
172 NaN 2018-10-12 2018-10 39 521
173 NaN 2018-10-11 2018-10 39 521
174 NaN 2018-10-11 2018-10 39 521
175 NaN 2018-10-11 2018-10 39 521
176 NaN 2018-10-11 2018-10 39 521
177 NaN 2018-10-11 2018-10 39 521
178 NaN 2018-09-26 2018-09 39 521
179 NaN 2018-09-25 2018-09 39 521
I want to calculate the days elapsed from the first commitdate till the last, after sorting the commit dates first, so something like this:
Path_Version commitdates Year-Month API Age api_spec_id Days_difference
168 NaN 2018-10-19 2018-10 39 521 25
169 NaN 2018-10-19 2018-10 39 521 25
170 NaN 2018-10-12 2018-10 39 521 18
171 NaN 2018-10-12 2018-10 39 521 18
172 NaN 2018-10-12 2018-10 39 521 18
173 NaN 2018-10-11 2018-10 39 521 16
174 NaN 2018-10-11 2018-10 39 521 16
175 NaN 2018-10-11 2018-10 39 521 16
176 NaN 2018-10-11 2018-10 39 521 16
177 NaN 2018-10-11 2018-10 39 521 16
178 NaN 2018-09-26 2018-09 39 521 1
179 NaN 2018-09-25 2018-09 39 521 0
I tried first sorting the commitdates by api_spec_id since it is unique for every API, and then calculating the diff
final_api['commitdates'] = final_api.groupby('api_spec_id')['commitdate'].apply(lambda x: x.sort_values())
final_api['diff'] = final_api.groupby('api_spec_id')['commitdates'].diff() / np.timedelta64(1, 'D')
final_api['diff'] = final_api['diff'].fillna(0)
It just returns me a zero for the entire column. I don't want to group them, I only want to calculate the difference based on the sorted commitdates: starting from the first commitdate till the last in the entire dataset, in days
Any idea how can I achieve this?

Use pandas.to_datetime, sub, min and dt.days:
t = pd.to_datetime(df['commitdates'])
df['Days_difference'] = t.sub(t.min()).dt.days
If you need to group per API:
t = pd.to_datetime(df['commitdates'])
df['Days_difference'] = t.sub(t.groupby(df['api_spec_id']).transform('min')).dt.days
Output:
Path_Version commitdates Year-Month API Age api_spec_id Days_difference
168 NaN 2018-10-19 2018-10 39 521 24
169 NaN 2018-10-19 2018-10 39 521 24
170 NaN 2018-10-12 2018-10 39 521 17
171 NaN 2018-10-12 2018-10 39 521 17
172 NaN 2018-10-12 2018-10 39 521 17
173 NaN 2018-10-11 2018-10 39 521 16
174 NaN 2018-10-11 2018-10 39 521 16
175 NaN 2018-10-11 2018-10 39 521 16
176 NaN 2018-10-11 2018-10 39 521 16
177 NaN 2018-10-11 2018-10 39 521 16
178 NaN 2018-09-26 2018-09 39 521 1
179 NaN 2018-09-25 2018-09 39 521 0

Python: interpolate z value based on its neighbor z1 value, the coordinates is based on (x,y)

In the x,y coordinate, each (x,y) has its corresponding z value, but some of them are missing. Can someone help to interpolate the missing data of z?
x = [161 enter image description here177 193 209 225 241 257 273 289 305 321 337 353 145 161 177 193 209 225 241 257 273 289 305 321 337 353 369 145 161 177 193 209 225 241 257 273 289 305 321 337 353 369 145 161 177 193 209 225 241 257 273 289 305 321 337 353 369 385 369 353 337 321 305 289 273 257 241 225 209 193 177 161 145 129 97 113 129 145] y = [55 55 55 55 55 55 55 55 55 55 55 55 55 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 65 65 65 65 65 65 65 65 65 65 65 65 65 65 65 74 74 74 74 74 74 74 74 74 74 74 74 74 74 74 74 74 115 115 115 115] z = [0.635 0.559 0.506 nan 0.597 nan 0.644 0.66 0.644 0.642 nan 0.545 nan nan nan 0.432 0.45 nan 0.517 0.521 nan 0.547 0.528 0.52 0.505 0.446 0.51 0.547 0.734 0.045 0.227 0 0.164 0.41 0.431 0.343 0.351 0.405 0.43 0.023 0.391 0.246 0.437 1.005 0.889 0.926 0.895 0.992 1.008 0.921 0.944 0.959 0.96 1.019 1.033 1.009 0.991 0.952 1.008 0.994 0.93 1.003 0.96 0.92 0.886 0.919 0.922 0.923 0.91 1.006 1.006 0.91 0.893 0.89 1 0.618 0.654 0.647 0.664]
The countour map can be found below, there are some Z value are missing. The calculation of missing Z-value can be some built-in matlab matric, such as use nearest 4 points of (x,y), or nearest 9 points of (x,y). countour map Thank you all for the help. enter image description here

How to split pandas some of data frame rows that are not lists?

Is there a way to change this data frame:
40 4.5 95
41 1.76 95
112 0.17/0.43 >95/>95
to this using pandas:
40 4.5 95
41 1.76 95
112 0.17 95
112 0.43 95
This is the pandas dataframe:
a b
19 560 80
40 4.5 95
41 1.76 95
112 0.17/0.43 >95/>95
154 7.2/1 >95/>95
... ... ...
2991 55 95
2992 33 95
3887 6.1 87.7
3893 3.9 70.3
3908 100 40
216 rows × 2 columns

I would use explode:
df = df.apply(lambda x: x.astype(str).str.split('/').explode(ignore_index=True))

Filter a pandas dataframe if two rows cross matches columnwise

I have a dataframe like this
key a0 p0 a1 p1 a2 p2 d1 d2 prot
2136933 GLN 35 GLN 176 GLN 39 4 137 2CPK
2136934 GLN 35 GLN 176 GLN 39 4 137 3TNQ
2136933 GLN 35 GLN 176 GLN 39 4 137 5O5M
2136961 GLN 35 GLN 177 GLN 39 4 138 1ATP
2136962 GLN 39 GLN 177 GLN 181 138 4 1ATP
2136960 GLN 35 GLN 177 GLN 39 4 138 1L3R
2136962 GLN 39 GLN 177 GLN 181 138 4 1L3R
2136910 GLN 39 GLN 177 GLN 35 4 138 2CPK
2136993 GLN 39 GLN 177 GLN 181 138 4 2CPK
2136961 GLN 35 GLN 177 GLN 39 4 138 3TNQ
2136961 GLN 35 GLN 177 GLN 39 4 138 4XW5
2136961 GLN 35 GLN 177 GLN 39 4 138 5O5M
2136849 GLN 39 GLN 181 GLN 35 4 142 1ATP
I want to retain rows only where d1=d2 and d2=d1 for a pair of rows and the keys should be within +/-10 range of each other. The expected result should be following:
key a0 p0 a1 p1 a2 p2 d1 d2 prot
2136961 GLN 35 GLN 177 GLN 39 4 138 1ATP
2136962 GLN 39 GLN 177 GLN 181 138 4 1ATP
2136960 GLN 35 GLN 177 GLN 39 4 138 1L3R
2136962 GLN 39 GLN 177 GLN 181 138 4 1L3R
2136961 GLN 35 GLN 177 GLN 39 4 138 3TNQ
2136961 GLN 35 GLN 177 GLN 39 4 138 4XW5
2136961 GLN 35 GLN 177 GLN 39 4 138 5O5M

Merge with itself using one pair comparing then check other conditions
t = df.merge(df, how='inner',left_on='d1', right_on='d2', suffixes=['','_y'])
t[(t['d2']==t['d1_y']) & (t['key']-t['key_y']).abs().lt(10)] \
.drop(columns=t.columns[t.columns.str.endswith('_y')]).drop_duplicates()
key a0 p0 a1 p1 a2 p2 d1 d2 prot
9 2136961 GLN 35 GLN 177 GLN 39 4 138 1ATP
12 2136960 GLN 35 GLN 177 GLN 39 4 138 1L3R
18 2136961 GLN 35 GLN 177 GLN 39 4 138 3TNQ
21 2136961 GLN 35 GLN 177 GLN 39 4 138 4XW5
24 2136961 GLN 35 GLN 177 GLN 39 4 138 5O5M
30 2136962 GLN 39 GLN 177 GLN 181 138 4 1ATP
36 2136962 GLN 39 GLN 177 GLN 181 138 4 1L3R

opening file in python turtle graphics

import turtle
t=turtle.Turtle()
wn=turtle.Screen()
wn.setworldcoordinates(-300,-300,300,300)
directions = { #dictionary of directions given in file.
'up': turtle.up,
'down': turtle.down
}
with open('dino.txt', 'r') as dino:
for line in dino:
line, pixel = line.split() #split line into two different directions.
if line in directions:#runs if within directions.
directions[line](pixel)
else:
raise() #raises error if not within directions.
I have this file titled "dino.txt" that has directions within it that are supposed to trace out a dinosaur in python turtle graphics. However i am having much trouble implementing a code that reads the file and traces out the image in the turtle graphics. The code i have written above opens the turtle graphics page but does not trace out anything. I was hoping someone on here could help me out or point out how exactly to implement a turtle graphic design in python from reading a text file. Thanks for any help/feedback.
here are the contents of the file "dino.txt":
UP
-218 185
DOWN
-240 189
-246 188
-248 183
-246 178
-244 175
-240 170
-235 166
-229 163
-220 158
-208 156
-203 153
-194 148
-187 141
-179 133
-171 119
-166 106
-163 87
-161 66
-162 52
-164 44
-167 28
-171 6
-172 -15
-171 -30
-165 -46
-156 -60
-152 -67
-152 -68
UP
-134 -61
DOWN
-145 -66
-152 -78
-152 -94
-157 -109
-157 -118
-151 -128
-146 -135
-146 -136
UP
-97 -134
DOWN
-98 -138
-97 -143
-96 -157
-96 -169
-98 -183
-104 -194
-110 -203
-114 -211
-117 -220
-120 -233
-122 -243
-123 -247
-157 -248
-157 -240
-154 -234
-154 -230
-153 -229
-149 -226
-146 -223
-145 -219
-143 -214
-142 -210
-141 -203
-139 -199
-136 -192
-132 -184
-130 -179
-132 -171
-133 -162
-134 -153
-138 -145
-143 -137
-143 -132
-142 -124
-138 -112
-134 -104
-132 -102
UP
-97 -155
DOWN
-92 -151
-91 -147
-89 -142
-89 -135
-90 -129
-90 -128
UP
-94 -170
DOWN
-83 -171
-68 -174
-47 -177
-30 -172
-15 -171
-11 -170
UP
12 -96
DOWN
9 -109
9 -127
7 -140
5 -157
9 -164
22 -176
37 -204
40 -209
49 -220
55 -229
57 -235
57 -238
50 -239
49 -241
51 -248
53 -249
63 -245
70 -243
57 -249
62 -250
71 -250
75 -250
81 -250
86 -248
86 -242
84 -232
85 -226
81 -221
77 -211
73 -205
67 -196
62 -187
58 -180
51 -171
47 -164
46 -153
50 -141
53 -130
54 -124
57 -112
56 -102
55 -98
UP
48 -164
DOWN
54 -158
60 -146
64 -136
64 -131
UP
5 -152
DOWN
1 -150
-4 -145
-8 -138
-14 -128
-19 -119
-17 -124
UP
21 -177
DOWN
14 -176
7 -174
-6 -174
-14 -170
-19 -166
-20 -164
UP
-8 -173
DOWN
-8 -180
-5 -189
-4 -201
-2 -211
-1 -220
-2 -231
-5 -238
-8 -241
-9 -244
-7 -249
6 -247
9 -248
16 -247
21 -246
24 -241
27 -234
27 -226
27 -219
27 -209
27 -202
28 -193
28 -188
28 -184
UP
-60 -177
DOWN
-59 -186
-57 -199
-56 -211
-59 -225
-61 -233
-65 -243
-66 -245
-73 -246
-81 -246
-84 -246
-91 -245
-91 -244
-88 -231
-87 -225
-85 -218
-85 -211
-85 -203
-85 -193
-88 -185
-89 -180
-91 -175
-92 -172
-93 -170
UP
-154 -93
DOWN
-157 -87
-162 -74
-168 -66
-172 -57
-175 -49
-178 -38
-178 -26
-178 -12
-177 4
-175 17
-172 27
-168 36
-161 48
-161 50
UP
-217 178
DOWN
-217 178
-217 177
-215 176
-214 175
-220 177
-223 178
-223 178
-222 178
UP
-248 185
DOWN
-245 184
-240 182
-237 181
-234 179
-231 177
-229 176
-228 175
-226 174
-224 173
-223 173
-220 172
-217 172
-216 171
-214 170
-214 169
UP
-218 186
DOWN
-195 173
-183 165
-175 159
-164 151
-158 145
-152 139
-145 128
-143 122
-139 112
-138 105
-134 95
-131 88
-129 78
-126 67
-125 62
-125 54
-124 44
-125 38
-126 30
-125 27
-125 8
-126 5
-125 -9
-122 -15
-115 -25
-109 -32
-103 -39
-95 -42
-84 -45
-72 -47
-56 -48
-41 -47
-31 -46
-18 -45
-1 -44
9 -43
34 -45
50 -52
67 -61
83 -68
95 -80
112 -97
142 -115
180 -132
200 -146
227 -159
259 -175
289 -185
317 -189
349 -190
375 -191
385 -192
382 -196
366 -199
352 -204
343 -204
330 -205
315 -209
296 -212
276 -214
252 -208
237 -202
218 -197
202 -193
184 -187
164 -179
147 -173
128 -168
116 -164
102 -160
88 -158
78 -159
69 -162
57 -164
56 -165
51 -165
UP
68 -144
DOWN
83 -143
96 -141
109 -139
119 -146
141 -150
161 -155
181 -163
195 -169
208 -179
223 -187
241 -191
247 -193
249 -194
UP
-6 -141
DOWN
-15 -146
-29 -150
-42 -154
-51 -153
-60 -152
-60 -152
UP
-90 -134
DOWN
-85 -131
-79 -128
-78 -123
-80 -115
-82 -106
-80 -101
-76 -101
UP
-81 -132
DOWN
-76 -130
-71 -126
-72 -124
UP
43 -118
DOWN
44 -125
47 -135
41 -156
37 -160
40 -166
47 -171
47 -171
UP
-106 -153
DOWN
-107 -167
-106 -178
-109 -192
-114 -198
-116 -201

This logic is wrong:
line, pixel = line.split()
If the data is as show in the edited version of your question, consider:
data = line.rstrip().split()
you only have two cases, either the length of list data is 1, meaning you have a direction like UP or DOWN in data[0], or the length of list data is 2, meaning you have the arguments for a goto() (after you first convert the two strings to integers). That's it.
If the data is more free form as show in your original post, then you need to process the data one token at a time. Pull off one data item and convert it via int() under a try clause. If it succeeds, read the next data item as an int and do a goto() using both, otherwise treat the current data item as a direction in the except clause since it's clearly not an int
Errors to check for include: numbers don't correctly convert to integers; direction not found in directions dictionary.
Other things to consider: change your directions keys to uppercase to match the data in the file or explicitly control the case yourself; if you use setworldcoordinates() this way, you may skew the aspect ratio of the image -- make your window square to begin with (whatever size) using wn.setup(size, size); your maximum virtual coordinate of 300 falls short of your data: 385.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: Join dataframe with condition - python

Related

Time interval calculation for consecutive days in rows

Python: interpolate z value based on its neighbor z1 value, the coordinates is based on (x,y)

How to split pandas some of data frame rows that are not lists?

Filter a pandas dataframe if two rows cross matches columnwise

opening file in python turtle graphics

Categories

Resources