Using pandas to identify nearest objects - python

I have an assignment that can be done using any programming language. I chose Python and pandas since I have little experience using these and thought it would be a good learning experience. I was able to complete the assignment using traditional loops that I know from traditional computer programming, and it ran okay over thousands of rows, but it brought my laptop down to a screeching halt once I let it process millions of rows. The assignment is outlined below.
You have a two-lane road on a two-dimensional plane. One lane is for cars and the other lane is reserved for trucks. The data looks like this (spanning millions of rows for each table):
cars
id start end
0 C1 200 215
1 C2 110 125
2 C3 240 255
...
trucks
id start end
0 T1 115 175
1 T2 200 260
2 T3 280 340
3 T4 25 85
...
The two dataframes above correspond to this:
start and end columns represent arbitrary positions on the road, where start = the back edge of the vehicle and end = the front edge of the vehicle.
The task is to identify the trucks closest to every car. A truck can have up to three different relationships to a car:
Back - it is in back of the car (cars.end > trucks.end)
Across - it is across from the car (cars.start >= trucks.start and cars.end <= trucks.end)
Front - it is in front of the car (cars.start < trucks.start)
I emphasized "up to" because if there is another car in back or front that is closer to the nearest truck, then this relationship is ignored. In the case of the illustration above, we can observe the following:
C1: Back = T1, Across = T2, Front = none (C3 is blocking)
C2: Back = T4, Across = none, Front = T1
C3: Back = none (C1 is blocking), Across = T2, Front = T3
The final output needs to be appended to the cars dataframe along with the following new columns:
data cross-referenced from the trucks dataframe
for back positions, the gap distance (cars.start - trucks.end)
for front positions, the gap distance (trucks.start - cars.end)
The final cars dataframe should look like this:
id start end back_id back_start back_end back_distance across_id across_start across_end front_id front_start front_end front_distance
0 C1 200 215 T1 115 175 25 T2 200 260
1 C2 110 125 T4 25 85 25 T1 115 175 -10
2 C3 240 255 T2 200 260 T3 280 340 25
Is pandas even the best tool for this task? If there is a better suited tool that is efficient at cross-referencing and appending columns based on some calculation across millions of rows, then I am all ears.

so with pandas, you can use merge_asof, here is one way, maybe not efficient with millions of rows:
#first sort values
trucks = trucks.sort_values(['start'])
cars = cars.sort_values(['start'])
#create back condition
df_back = pd.merge_asof(trucks.rename(columns={col:f'back_{col}'
for col in trucks.columns}),
cars.assign(back_end=lambda x: x['end']),
on='back_end', direction='forward')\
.query('end>back_end')\
.assign(back_distance=lambda x: x['start']-x['back_end'])
#create across condition: here note that cars is the first of the 2 dataframes
df_across = pd.merge_asof(cars.assign(across_start=lambda x: x['start']),
trucks.rename(columns={col:f'across_{col}'
for col in trucks.columns}),
on=['across_start'], direction='backward')\
.query('end<=across_end')
#create front condition
df_front = pd.merge_asof(trucks.rename(columns={col:f'front_{col}'
for col in trucks.columns}),
cars.assign(front_start=lambda x: x['start']),
on='front_start', direction='backward')\
.query('start<front_start')\
.assign(front_distance=lambda x: x['front_start']-x['end'])
# merge all back to cars
df_f = cars.merge(df_back, how='left')\
.merge(df_across, how='left')\
.merge(df_front, how='left')
and you get
print (df_f)
id start end back_id back_start back_end back_distance across_start \
0 C2 110 125 T4 25.0 85.0 25.0 NaN
1 C1 200 215 T1 115.0 175.0 25.0 200.0
2 C3 240 255 NaN NaN NaN NaN 240.0
across_id across_end front_id front_start front_end front_distance
0 NaN NaN T1 115.0 175.0 -10.0
1 T2 260.0 NaN NaN NaN NaN
2 T2 260.0 T3 280.0 340.0 25.0

Related

Process and return data from a group of a group

I have a pandas dataframe of 3 variables, 2 categorical and 2 numeric.
ID
Trimester
State
Tax
rate
45
T1
NY
20
0.25
23
T3
FL
34
0.3
35
T2
TX
45
0.6
I would like to get a new table of the form:
ID
Trimester
State
Tax
rate
Tax_per_state_per_trimester
45
T1
NY
20
0.25
H
23
T3
FL
34
0.3
L
35
T2
TX
45
0.6
M
where the new variable 'Tax_per_state_per_trimester' is a categorical variable representing the tertiles of the corresponding subgroup, where L = first tertile, M = second tertile, L = last tertile
I understand I can do a double grouping with:
df.groupby(['State', 'Trimester'])
but i don't know how to go from there.
I guess apply or transform with the quantile function should prove useful, but how?
Can you take a look and see if this gives you the results you want ?
df = pd.read_excel('Tax.xlsx')
def mx(tri,state):
return df[(df['Trimester'].eq(tri)) & (df['State'].eq(state))] \
.groupby(['Trimester','State'])['Tax'].apply(max)[0]
for i,v in df.iterrows():
t = (v['Tax'] / mx(v['Trimester'],v['State']))
df.loc[i,'Tax_per_state_per_trimester'] = 'L' if t < 1/3 else 'M' if t < 2/3 else 'H'

Merge rows with index +-1 of the current row

I have quite an interesting question. I am trying to merge rows that are too close to each other. Obviously "too close" depends on what you want too close to be but what I want to do is merging rows that are +-1 row close to another. I have this dataframe:
index Händelse Time Fuel level (%) Km driven (km) Difference (%)
61 Bränslenivåökning vid stillastående 20210601 100 1325217 73
124 Bränslenivåökning vid stillastående 20210601 93 1325708 63
125 Position 20210601 97 1325708 4
126 Position 20210601 100 1325720 3
176 Bränslenivåökning vid stillastående 20210602 100 1326038 46
234 Bränslenivåökning vid stillastående 20210603 90 1326528 56
235 Position 20210603 96 1326528 6
236 Position 20210603 100 1326540 4
301 Bränslenivåökning vid stillastående 20210603 100 1327019 77
360 Position 20210603 42 1327510 9
361 Bränslenivåökning vid stillastående 20210603 92 1327510 50
362 Position 20210604 100 1327513 8
436 Bränslenivåökning vid stillastående 20210604 100 1328013 72
499 Bränslenivåökning vid stillastående 20210606 87 1328504 57
500 Position 20210606 98 1328506 11
501 Position 20210606 100 1328516 2
...
As you can see in the index, there are multiple occurrences where the rows are followed up by another one with a very small time difference (I gather the data using a 10-minute interval which is not shown in the time column but is shown by looking at the index tab. For example 124, 125 and 126 who are close to each other). However, because of the small-time difference, I would like to sum the "Difference-column" for these rows but not the "Km driven", "fuel level" or "Time". In conclusion, if we take 124, 125, and 126 for example, I would like the output to be:
index Händelse Time Fuel level (%) Km driven (km) Difference (%)
126 Bränslenivåökning vid stillastående 20210601 100 (from 126) 1325710 (126) 70 (124, 125, 126)
To quickly explain what is happening in the data, there are different time stamps where a change in the fuel tank is taking place. This makes the analyst of the data assume that a refueling process is taking place. However, sometimes these "refueling-processes" take more than my time interval, resulting in it being noted as 3 different (like row 124, 125, 126) positive changes in the fuel tank. Also, I can't change the time interval.
Hopefully, this was enough information. Thank you in advance!
CURRENT CODE
from tkinter import Tk # from tkinter import Tk for Python 3.x
from tkinter.filedialog import askopenfilename
import pandas as pd
Tk().withdraw()
filepathname1 = askopenfilename()
filepathname2 = askopenfilename()
print("You have chosen to mix", filepathname1, "and", filepathname2)
pd.set_option("display.max_rows", None, "display.max_columns", 10)
df1 = pd.read_excel(
filepathname1, "CWA107 Event", na_values=["NA"], skiprows=1, usecols="A, B, D, E, F"
)
df2 = pd.read_excel(
filepathname2,
na_values=["NA"],
skiprows=1,
usecols=["Tankad mängd diesel", "Unnamed: 3"],
)
df1["Difference (%)"] = df1["Bränslenivå (%)"]
df1["Difference (%)"] = df1.loc[:, "Bränslenivå (%)"].diff()
# Renames time-column so that they match
df2.rename(columns={"Unnamed: 3": "Tid"}, inplace=True)
# Drop NaN
df2.dropna(inplace=True)
# Drop NaN
df1.dropna(inplace=True)
# Filters out the rows with a difference smaller than 2
df1filt = df1[(df1["Difference (%)"] >= 2)]
print(len(df1filt))
# Converts time-column to only year, month and date.
df1filt["Tid"] = pd.to_datetime(df1filt["Tid"]).dt.strftime("%Y%m%d").astype(str)
print(df1filt)
df1filt.reset_index(level=0, inplace=True)
filepathname3 = askopenfilename()
df1filt.to_excel(filepathname3, index=False)
input()
So I solved this problem by creating a new column that depends on the difference between my row-column (formerly known as the index column in the dataframe above). The new column represents the difference in the row column. If the difference is more than 1 row then it sets the value of the row to 0. By giving either 1 or 0 to these rows in the 'Match'-column I can further know what values to merge and which not to.
If the value = 0 then it will set the actual amount refueled to be the value of the current row (it does not merge with other rows) and it is marked as summed. If the value is bigger than 1 and under 4, the values are added. This will repeat until it hits an row with the value 0 which will mark it as "summed".
Here is the code (feel free to make changes):
df1filt["Match"] = df1filt["row"]
df1filt["Match"] = df1filt.loc[:, "row"].diff()
df1filt['Match'].values[df1filt['Match'].values > 1] = 0
ROWRANGE = len(df1filt)+1
thevalue = 0
for currentrow in range(ROWRANGE-1):
if df1filt.loc[currentrow, 'Match'] == 0.0:
df1filt.loc[currentrow-1,'Difference (%)'] = thevalue
df1filt.loc[currentrow-1,'Match'] = "SUMMED"
thevalue = df1filt.loc[currentrow, 'Difference (%)']
if df1filt.loc[currentrow, 'Match'] >= 1.0 and df1filt.loc[currentrow, 'Match'] <= 4:
thevalue += df1filt.loc[currentrow, 'Difference (%)']

How do I calculate an average of a range from a series within in a dataframe?

Im new to Python and working with data manipulation
I have a dataframe
df3
Out[22]:
Breed Lifespan
0 New Guinea Singing Dog 18
1 Chihuahua 17
2 Toy Poodle 16
3 Jack Russell Terrier 16
4 Cockapoo 16
.. ... ...
201 Whippet 12--15
202 Wirehaired Pointing Griffon 12--14
203 Xoloitzcuintle 13
204 Yorkie--Poo 14
205 Yorkshire Terrier 14--16
As you observe above, some of the lifespans are in a range like 14--16. The datatype of [Lifespan] is
type(df3['Lifespan'])
Out[24]: pandas.core.series.Series
I want it to reflect the average of these two numbers i.e. 15. I do not want any ranges. Just the average as a single digit. How do I do this?
Using split and expand=True
df = pd.DataFrame({'Breed': ['Dog1', 'Dog2'],
'Lifespan': [12, '14--15']})
df['Lifespan'] = (df['Lifespan']
.astype(str).str.split('--', expand=True)
.astype(float).mean(axis=1)
)
df
# Breed Lifespan
# 0 Dog1 12.0
# 1 Dog2 14.5

pandas group by multiple columns and remove rows based on multiple conditions

I have a dataframe which is as follows:
imagename,locationName,brandname,x,y,w,h,xdiff,ydiff
95-20180407-215120-235505-00050.jpg,Shirt,SAMSUNG,0,490,177,82,0,0
95-20180407-215120-235505-00050.jpg,Shirt,SAMSUNG,1,491,182,78,1,1
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,3,450,94,45,2,-41
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,5,451,95,48,2,1
95-20180407-215120-235505-00050.jpg,DUGOUT,VIVO,167,319,36,38,162,-132
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,446,349,99,90,279,30
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,455,342,84,93,9,-7
95-20180407-215120-235505-00050.jpg,Shirt,GOIBIBO,559,212,70,106,104,-130
Its a csv dump. From this I want to group by imagename and brandname. Wherever the values in xdiff and ydiff is less than 10 then remove the second line.
For example, from the first two lines I want to delete the second line, similarly from lines 3 and 4 I want to delete line 4.
I could do this quickly in R using dplyr group by, lag and lead functions. However, I am not sure how to combine different functions in python to achieve this. This is what I have tried so far:
df[df.groupby(['imagename','brandname']).xdiff.transform() <= 10]
Not sure what function should I call within transform and how to include ydiff too.
The expected output is as follows:
imagename,locationName,brandname,x,y,w,h,xdiff,ydiff
95-20180407-215120-235505-00050.jpg,Shirt,SAMSUNG,0,490,177,82,0,0
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,3,450,94,45,2,-41
95-20180407-215120-235505-00050.jpg,DUGOUT,VIVO,167,319,36,38,162,-132
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,446,349,99,90,279,30
95-20180407-215120-235505-00050.jpg,Shirt,GOIBIBO,559,212,70,106,104,-130
You can take individual groupby frames and apply the conditions through apply function
#df.groupby(['imagename','brandname'],group_keys=False).apply(lambda x: x.iloc[range(0,len(x),2)] if x['xdiff'].lt(10).any() else x)
df.groupby(['imagename','brandname'],group_keys=False).apply(lambda x: x.iloc[range(0,len(x),2)] if (x['xdiff'].lt(10).any() and x['ydiff'].lt(10).any()) else x)
Out:
imagename locationName brandname x y w h xdiff ydiff
2 95-20180407-215120-235505-00050.jpg Shirt DHFL 3 450 94 45 2 -41
5 95-20180407-215120-235505-00050.jpg Shirt DHFL 446 349 99 90 279 30
7 95-20180407-215120-235505-00050.jpg Shirt GOIBIBO 559 212 70 106 104 -130
0 95-20180407-215120-235505-00050.jpg Shirt SAMSUNG 0 490 177 82 0 0
4 95-20180407-215120-235505-00050.jpg DUGOUT VIVO 167 319 36 38 162 -132

Pandas iterating in range. Same number twice?

I've written this code and my output is not quite as expected. It seems that the for loop runs through the first iteration twice and then misses out the second and jumps straight to the third. I cannot see where I have gone wrong however so could someone point out the error? Thank you!
Code below:
i = 0
df_int = df1[(df1.sLap > df_z.Entry[i]) & (df1.sLap < df_z.Exit[i]) & (df1.NLap == Lap1)]
df_Entry = df_int.groupby(df_int.BCornerEntry).aggregate([np.mean, np.std])
df_Entry.rename(index={1: 'T'+str(df_z['Turn Number'][i])}, inplace=True)
for i in range(len(df_z)):
df_int = df1[(df1.sLap > df_z.Entry[i]) & (df1.sLap < df_z.Exit[i]) & (df1.NLap == Lap1)]
df_Entry2 = df_int.groupby(df_int.BCornerEntry).aggregate([np.mean, np.std])
df_Entry2.rename(index={1: 'T'+str(df_z['Turn Number'][i])}, inplace=True)
df_Entry = pd.concat([df_Entry, df_Entry2])
df_z is an excel document with data like this:
Turn Number Entry Exit
0 1 321 441
1 2 893 1033
2 3 1071 1184
3 4 1234 1352
4 5 2354 2454
5 6 2464 2554
6 7 2574 2689
7 8 2955 3120..... and so on
Then df1 is a massive DataFrame with 30 columns and 10's of thousands of rows (hence the mean and std).
My Output should be:
tLap
mean std
BCornerEntry
T1 6.845490 0.591227
T2 14.515195 0.541967
T3 19.598690 0.319181
T4 21.555500 0.246757
T5 34.980000 0.518170
T6 37.245000 0.209284
T7 40.220541 0.322800.... and so on
However I get this:
tLap
mean std
BCornerEntry
T1 6.845490 0.591227
T1 6.845490 0.591227
T3 19.598690 0.319181
T4 21.555500 0.246757
T5 34.980000 0.518170
T6 37.245000 0.209284
T7 40.220541 0.322800..... and so on
T2 is still T1 and the numbers are the same? What have I done wrong? Any help would be greatly appreciated!
Instead of range(len(df_z), try using:
for i in range(1, len(df_z)):
...
as range starts with 0 and the i=0 case is already done before the for loop (so for this reason it is included twice).

Categories