Create a new column based on several lookup tables in Python Pandas - python

I have a large pandas dataframe (df_orig) and several lookup tables (also dataframes) that correspond to each of the segments in df_orig.
Here's a small subset of df_orig:
segment score1 score2
B3 0 700
B1 0 120
B1 400 950
B1 100 220
B1 200 320
B1 650 340
B5 300 400
B5 0 320
B1 0 240
B1 100 360
B1 940 700
B3 100 340
And here's a lookup table in its entirety for segment B5 called thresholds_b5 (there is a lookup table for each segment in the large dataset):
score1 score2
990 220
980 280
970 200
960 260
950 260
940 200
930 240
920 220
910 220
900 220
850 120
800 220
750 220
700 120
650 200
600 220
550 220
500 240
400 240
300 260
200 300
100 320
0 400
I want to create a new column in my large dataset that is analagous to this SQL logic:
case when segment = 'B5' then
case when score1 = 990 and score2 >= 220 then 1
case when score1 = 980 and score2 >= 280 then 1
.
.
.
else 0
case when segment = 'B1' then
.
.
.
else 0 end as indicator
I was able to get the correct output using a loop based on the solution to this question:
df_b5 = df_orig[df_orig.loc[:,'segment'] == 'B5']
for i,row in enumerate(thresholds_b5):
value1 = thresholds_b5.iloc[i,0]
value2 = thresholds_b5.iloc[i,1]
df_b5.loc[(df_b5['score1'] == value1) & (df_b5['score2'] >= value2), 'indicator'] = 1
However, I'd need another loop to run this for each segment and then append all of the resultant dataframes back together, which is a bit messy. Furthermore, while I only have three segments (B1,B3,B5) for now, I'm going to have 20+ segments in the future.
Is there a way to do this more succinctly and preferably without loops? I've been warned that loops over dataframes tend to be slow and given the size of my dataset I think speed will matter.

If you are ok with sorting the DataFrames ahead of time, then you can replace your loop example with the new asof join in pandas 0.19:
# query
df_b5 = df_orig.query('segment == "B5"')
# sort ahead of time
df_b5.sort_values('score2', inplace=True)
threshold_b5.sort_values('score2', inplace=True)
# set the default indicator as 1
threshold_b5['indicator'] = 1
# join the tables
df = pd.merge_asof(df_b5, threshold_b5, on='score2', by='score1')
# fill missing indicators as 0
df.indicator = np.int64(df.indicator.fillna(0.0))
This is what I got:
segment score1 score2 indicator
0 B5 0 320 0
1 B5 300 400 1
If you need the original order, then save the index in a new column of df_orig and then resort the final DataFrame by that.
pandas 0.19.2 added multiple by parameters, so you could concat all of your thresholds with the segment column set for each one, then invoke:
pd.merge_asof(df_orig, thresholds, on='score2', by=['segment', 'score1'])

Related

extract the remaining data from a dataset after filtering some data in Python

Suppose I am having dataframe like this:
Length Width Height
100 150 130
120 140 150
200 170 120
250 190 180
270 200 195
Now I want to filter the data from three columns in a way like Length between (100-200) and width between (130-170) and Height between (120-150), if I extract the data like this, it will give me the data frame like
Length Width Height
100 150 130
120 140 150
200 170 120
and the remaining data from the dataframe is like
Length Width Height
250 190 180
270 200 195
Now, I want to see both filtered data and the remaining data from the data set in a separate variables. How to get that in python?
IIUC, use a boolean mask combining multiple between with &, and index difference:
mask = (df['Length'].between(100,200)
& df['Width'].between(130,170)
& df['Height'].between(120,150)
)
df1 = df[mask]
df2 = df.loc[df.index.difference(df1.index)]
# or
# df2 = df[~mask]
output:
# df1
Length Width Height
0 100 150 130
1 120 140 150
2 200 170 120
# df2
Length Width Height
3 250 190 180
4 270 200 195

Calculating Distances and Filtering and Summing Values Pandas

I currently have data which contains a location name, latitude, longitude and then a number value associated locations. The final goal for me would to get a dataframe that has the sum of the values of each location within specific distance ranges. A sample dataframe is below:
IDVALUE,Latitude,Longitude,NumberValue
ID1,44.968046,-94.420307,1
ID2,44.933208,-94.421310,10
ID3,33.755787,-116.359998,15
ID4,33.844843,-116.54911,207
ID5,44.92057,-93.44786,133
ID6,44.240309,-91.493619,52
ID7,44.968041,-94.419696,39
ID8,44.333304,-89.132027,694
ID9,33.755783,-116.360066,245
ID10,33.844847,-116.549069,188
ID11,44.920474,-93.447851,3856
ID12,44.240304,-91.493768,189
Firstly, I managed to get the distances between each of them using the haversine function. Using the code below I turned the latlongs into radians and then created a matrix where the diagonals are infinite values.
df_latlongs['LATITUDE'] = np.radians(df_latlongs['LATITUDE'])
df_latlongs['LONGITUDE'] = np.radians(df_latlongs['LONGITUDE'])
dist = DistanceMetric.get_metric('haversine')
latlong_df = pd.DataFrame(dist.pairwise(df_latlongs[['LATITUDE','LONGITUDE']].to_numpy())*6373, columns=df_latlongs.IDVALUE.unique(), index=df_latlongs.IDVALUE.unique())
np.fill_diagonal(latlong_df.values, math.inf)
This distance matrix is then in kilometres. What I'm struggling with next is to be able to filter the distances of each of the locations and get the total number of values within a range and link this to the original dataframe.
Below is the code I have used to filter the distance matrix to get all of the locations within 500 meters:
latlong_df_rows = latlong_df[latlong_df < 0.5]
latlong_df_rows = latlong_df_rows.dropna(how='all', axis=0)
latlong_df_rows = latlong_df_rows.dropna(how='all', axis=1)
My attempt was to them get a list for each location of the locations that were in this value using the code below:
within_range_df = latlong_df_rows.apply(lambda row: row[row < 0.05].index.tolist(), axis=1)
within_range_df = within_range_df.to_frame()
within_range_df = within_range_df.dropna(how='all', axis=0)
within_range_df = within_range_df.dropna(how='all', axis=1)
From here I was going to try and get the NumberValue from the original dataframe by looping through the list of values to obtain another column for the number for that location. Then sum all of them. The final dataframe would ideally look like the following:
ID VALUE,<500m,500-1000m,>100m
ID1,x1,y1,z1
ID2,x2,y2,z2
ID3,x3,y3,z3
ID4,x4,y4,z4
ID5,x5,y5,z5
ID6,x6,y6,z6
ID7,x7,y7,z7
ID8,x8,y8,z8
ID9,x9,y9,z9
ID10,x10,y10,z10
ID11,x11,y11,z11
ID12,x12,y12,z12
Where x y and z are the total number values for the nearest locations for different distances. I know this is probably really weird and overcomplicated so any tips to change the question or anything else that is needed I'll be happy to provide. Cheers
I would define a helper function, making use of BallTree, e.g.
from sklearn.neighbors import BallTree
import pandas as pd
import numpy as np
df = pd.read_csv('input.csv')
We use query_radius() to get the IDs and use list comprehension to get the values and sum them;
locations_radians = np.radians(df[["Latitude","Longitude"]].values)
tree = BallTree(locations_radians, leaf_size=12, metric='haversine')
def summed_numbervalue_for_radius( radius_in_m=100):
distance_in_meters = radius_in_m
earth_radius = 6371000
radius = distance_in_meters / earth_radius
ids_within_radius = tree.query_radius(locations_radians, r=radius, count_only=False)
values_as_array = np.array(df.NumberValue)
summed_values = [values_as_array[ix].sum() for ix in ids_within_radius]
return np.array(summed_values)
With the helper function you can do for instance;
df = df.assign( sum_100=summed_numbervalue_for_radius(100))
df = df.assign( sum_500=summed_numbervalue_for_radius(500))
df = df.assign( sum_1000=summed_numbervalue_for_radius(1000))
df = df.assign( sum_1000_to_5000=summed_numbervalue_for_radius(5000)-summed_numbervalue_for_radius(1000))
Will give you
IDVALUE Latitude Longitude NumberValue sum_100 sum_500 sum_1000 \
0 ID1 44.968046 -94.420307 1 40 40 40
1 ID2 44.933208 -94.421310 10 10 10 10
2 ID3 33.755787 -116.359998 15 260 260 260
3 ID4 33.844843 -116.549110 207 395 395 395
4 ID5 44.920570 -93.447860 133 3989 3989 3989
5 ID6 44.240309 -91.493619 52 241 241 241
6 ID7 44.968041 -94.419696 39 40 40 40
7 ID8 44.333304 -89.132027 694 694 694 694
8 ID9 33.755783 -116.360066 245 260 260 260
9 ID10 33.844847 -116.549069 188 395 395 395
10 ID11 44.920474 -93.447851 3856 3989 3989 3989
11 ID12 44.240304 -91.493768 189 241 241 241
sum_1000_to_5000
0 10
1 40
2 0
3 0
4 0
5 0
6 10
7 0
8 0
9 0
10 0
11 0

How can I group multiple columns and sum the last one?

I have this problem which I've been trying to solve:
I want the code to take this DataFrame and group multiple columns based on the most frequent number and sum the values on the last column. For example:
df = pd.DataFrame({'A':[1000, 1000, 1000, 1000, 1000, 200, 200, 500, 500],
'B':[380, 380, 270, 270, 270, 45, 45, 45, 55],
'C':[380, 380, 270, 270, 270, 88, 88, 88, 88],
'D':[45, 32, 67, 89, 51, 90, 90, 90, 90]})
df
A B C D
0 1000 380 380 45
1 1000 380 380 32
2 1000 270 270 67
3 1000 270 270 89
4 1000 270 270 51
5 200 45 88 90
6 200 45 88 90
7 500 45 88 90
8 500 55 88 90
I would like the code to show the result below:
A B C D
0 1000 380 380 284
1 1000 380 380 284
2 1000 270 270 284
3 1000 270 270 284
4 1000 270 270 284
5 200 45 88 360
6 200 45 88 360
7 500 45 88 360
8 500 55 88 360
Notice that the most frequent value on the first rows is 1000, and this way I group the column 'A' so I get the sum 284 on the column 'D'. However, on the last rows, the most frequent number, which is 88, is not on column 'A', but in column 'C'. I am trying to sum the values on column 'D' by grouping column 'C' and get 360. I am not sure if I made myself clear.
I tried to use the function df['D'] = df.groupby(['A', 'B', 'C'])['D'].transform('sum'), but it does not show the desired result aforementioned.
Is there any pandas-style way of resolving this? Thanks in advance!
Code
def get_count_sum(col, func):
return df.groupby(col).D.transform(func)
ga = get_count_sum('A', 'count')
gb = get_count_sum('B', 'count')
gc = get_count_sum('C', 'count')
conditions = [
((ga > gb) & (ga > gc)),
((gb > ga) & (gb > gc)),
((gc > ga) & (gc > gb)),
]
choices = [get_count_sum('A', 'sum'),
get_count_sum('B', 'sum'),
get_count_sum('C', 'sum')]
df['D'] = np.select(conditions, choices)
df
Output
A B C D
0 1000 380 380 284
1 1000 380 380 284
2 1000 270 270 284
3 1000 270 270 284
4 1000 270 270 284
5 200 45 88 360
6 200 45 88 360
7 500 45 88 360
8 500 55 88 360
Explanation
Since we need to group by each column 'A','B' or 'C' considering which one has max repeated number, so first we are checking the max repeated number and storing the groupby output in ga, gb, gc for A,B,C col respectively.
We are checking which col has max frequent number in conditions.
According to the conditions we are applying choices for if else conditions.
np.select is like if-elif-else where we placed the conditions and required output in choices.

Using pandas to identify nearest objects

I have an assignment that can be done using any programming language. I chose Python and pandas since I have little experience using these and thought it would be a good learning experience. I was able to complete the assignment using traditional loops that I know from traditional computer programming, and it ran okay over thousands of rows, but it brought my laptop down to a screeching halt once I let it process millions of rows. The assignment is outlined below.
You have a two-lane road on a two-dimensional plane. One lane is for cars and the other lane is reserved for trucks. The data looks like this (spanning millions of rows for each table):
cars
id start end
0 C1 200 215
1 C2 110 125
2 C3 240 255
...
trucks
id start end
0 T1 115 175
1 T2 200 260
2 T3 280 340
3 T4 25 85
...
The two dataframes above correspond to this:
start and end columns represent arbitrary positions on the road, where start = the back edge of the vehicle and end = the front edge of the vehicle.
The task is to identify the trucks closest to every car. A truck can have up to three different relationships to a car:
Back - it is in back of the car (cars.end > trucks.end)
Across - it is across from the car (cars.start >= trucks.start and cars.end <= trucks.end)
Front - it is in front of the car (cars.start < trucks.start)
I emphasized "up to" because if there is another car in back or front that is closer to the nearest truck, then this relationship is ignored. In the case of the illustration above, we can observe the following:
C1: Back = T1, Across = T2, Front = none (C3 is blocking)
C2: Back = T4, Across = none, Front = T1
C3: Back = none (C1 is blocking), Across = T2, Front = T3
The final output needs to be appended to the cars dataframe along with the following new columns:
data cross-referenced from the trucks dataframe
for back positions, the gap distance (cars.start - trucks.end)
for front positions, the gap distance (trucks.start - cars.end)
The final cars dataframe should look like this:
id start end back_id back_start back_end back_distance across_id across_start across_end front_id front_start front_end front_distance
0 C1 200 215 T1 115 175 25 T2 200 260
1 C2 110 125 T4 25 85 25 T1 115 175 -10
2 C3 240 255 T2 200 260 T3 280 340 25
Is pandas even the best tool for this task? If there is a better suited tool that is efficient at cross-referencing and appending columns based on some calculation across millions of rows, then I am all ears.
so with pandas, you can use merge_asof, here is one way, maybe not efficient with millions of rows:
#first sort values
trucks = trucks.sort_values(['start'])
cars = cars.sort_values(['start'])
#create back condition
df_back = pd.merge_asof(trucks.rename(columns={col:f'back_{col}'
for col in trucks.columns}),
cars.assign(back_end=lambda x: x['end']),
on='back_end', direction='forward')\
.query('end>back_end')\
.assign(back_distance=lambda x: x['start']-x['back_end'])
#create across condition: here note that cars is the first of the 2 dataframes
df_across = pd.merge_asof(cars.assign(across_start=lambda x: x['start']),
trucks.rename(columns={col:f'across_{col}'
for col in trucks.columns}),
on=['across_start'], direction='backward')\
.query('end<=across_end')
#create front condition
df_front = pd.merge_asof(trucks.rename(columns={col:f'front_{col}'
for col in trucks.columns}),
cars.assign(front_start=lambda x: x['start']),
on='front_start', direction='backward')\
.query('start<front_start')\
.assign(front_distance=lambda x: x['front_start']-x['end'])
# merge all back to cars
df_f = cars.merge(df_back, how='left')\
.merge(df_across, how='left')\
.merge(df_front, how='left')
and you get
print (df_f)
id start end back_id back_start back_end back_distance across_start \
0 C2 110 125 T4 25.0 85.0 25.0 NaN
1 C1 200 215 T1 115.0 175.0 25.0 200.0
2 C3 240 255 NaN NaN NaN NaN 240.0
across_id across_end front_id front_start front_end front_distance
0 NaN NaN T1 115.0 175.0 -10.0
1 T2 260.0 NaN NaN NaN NaN
2 T2 260.0 T3 280.0 340.0 25.0

pandas group by multiple columns and remove rows based on multiple conditions

I have a dataframe which is as follows:
imagename,locationName,brandname,x,y,w,h,xdiff,ydiff
95-20180407-215120-235505-00050.jpg,Shirt,SAMSUNG,0,490,177,82,0,0
95-20180407-215120-235505-00050.jpg,Shirt,SAMSUNG,1,491,182,78,1,1
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,3,450,94,45,2,-41
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,5,451,95,48,2,1
95-20180407-215120-235505-00050.jpg,DUGOUT,VIVO,167,319,36,38,162,-132
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,446,349,99,90,279,30
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,455,342,84,93,9,-7
95-20180407-215120-235505-00050.jpg,Shirt,GOIBIBO,559,212,70,106,104,-130
Its a csv dump. From this I want to group by imagename and brandname. Wherever the values in xdiff and ydiff is less than 10 then remove the second line.
For example, from the first two lines I want to delete the second line, similarly from lines 3 and 4 I want to delete line 4.
I could do this quickly in R using dplyr group by, lag and lead functions. However, I am not sure how to combine different functions in python to achieve this. This is what I have tried so far:
df[df.groupby(['imagename','brandname']).xdiff.transform() <= 10]
Not sure what function should I call within transform and how to include ydiff too.
The expected output is as follows:
imagename,locationName,brandname,x,y,w,h,xdiff,ydiff
95-20180407-215120-235505-00050.jpg,Shirt,SAMSUNG,0,490,177,82,0,0
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,3,450,94,45,2,-41
95-20180407-215120-235505-00050.jpg,DUGOUT,VIVO,167,319,36,38,162,-132
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,446,349,99,90,279,30
95-20180407-215120-235505-00050.jpg,Shirt,GOIBIBO,559,212,70,106,104,-130
You can take individual groupby frames and apply the conditions through apply function
#df.groupby(['imagename','brandname'],group_keys=False).apply(lambda x: x.iloc[range(0,len(x),2)] if x['xdiff'].lt(10).any() else x)
df.groupby(['imagename','brandname'],group_keys=False).apply(lambda x: x.iloc[range(0,len(x),2)] if (x['xdiff'].lt(10).any() and x['ydiff'].lt(10).any()) else x)
Out:
imagename locationName brandname x y w h xdiff ydiff
2 95-20180407-215120-235505-00050.jpg Shirt DHFL 3 450 94 45 2 -41
5 95-20180407-215120-235505-00050.jpg Shirt DHFL 446 349 99 90 279 30
7 95-20180407-215120-235505-00050.jpg Shirt GOIBIBO 559 212 70 106 104 -130
0 95-20180407-215120-235505-00050.jpg Shirt SAMSUNG 0 490 177 82 0 0
4 95-20180407-215120-235505-00050.jpg DUGOUT VIVO 167 319 36 38 162 -132

Categories