Related
I have two Pandas DataFrames that look like this. Trying to join the two data sets on 'Name','Longitude', and 'Latitude' but using a fuzzy/approximate match. Is there a way to join these together using a combination of the 'Name' strings being, for example, at least an 80% match and the 'Latitude' and 'Longitude' columns being the nearest value or within like 0.001 of each other? I tried using pd.merge_asof but couldn't figure out how to make it work. Thank you for the help!
import pandas as pd
data1 = [['Game Time Bar',42.3734,-71.1204,4.5],['Sports Grill',42.3739,-71.1214,4.6],['Sports Grill 2',42.3839,-71.1315,4.3]]
data2 = [['Game Time Sports Bar',42.3738,-71.1207,'$$'],['Sports Bar & Grill',42.3741,-71.1216,'$'],['Sports Grill',42.3841,-71.1316,'$$']]
df1 = pd.DataFrame(data1, columns=['Name', 'Latitude','Longitude','Rating'])
df2 = pd.DataFrame(data2, columns=['Name', 'Latitude','Longitude','Price'])
merge_asof won't work here since it can only merge on a single numeric column, such as datetimelike, integer, or float (see doc).
Here you can compute the (euclidean) distance between the coordinates of df1 and df2 and pickup the best match:
import pandas as pd
import numpy as np
from scipy.spatial.distance import cdist
data1 = [['Game Time Bar',42.3734,-71.1204,4.5],['Sports Grill',42.3739,-71.1214,4.6],['Sports Grill 2',42.3839,-71.1315,4.3]]
data2 = [['Game Time Sports Bar',42.3738,-71.1207,'$$'],['Sports Bar & Grill',42.3741,-71.1216,'$'],['Sports Grill',42.3841,-71.1316,'$$']]
df1 = pd.DataFrame(data1, columns=['Name', 'Latitude','Longitude','Rating'])
df2 = pd.DataFrame(data2, columns=['Name', 'Latitude','Longitude','Price'])
# Replacing 'Latitude' and 'Longitude' columns with a 'Coord' Tuple
df1['Coord'] = df1[['Latitude', 'Longitude']].apply(lambda x: (x['Latitude'], x['Longitude']), axis=1)
df1.drop(columns=['Latitude', 'Longitude'], inplace=True)
df2['Coord'] = df2[['Latitude', 'Longitude']].apply(lambda x: (x['Latitude'], x['Longitude']), axis=1)
df2.drop(columns=['Latitude', 'Longitude'], inplace=True)
# Creating a distance matrix between df1['Coord'] and df2['Coord']
distances_df1_df2 = cdist(df1['Coord'].to_list(), df2['Coord'].to_list())
# Creating df1['Price'] column from df2 and the distance matrix
for i in df1.index:
# you can replace the following lines with a loop over distances_df1_df2[i]
# and reject names that are too far from each other
min_dist = np.amin(distances_df1_df2[i])
if min_dist > 0.001:
continue
closest_match = np.argmin(distances_df1_df2[i])
# df1.loc[i, 'df2_Name'] = df2.loc[closest_match, 'Name'] # keep track of the merged row
df1.loc[i, 'Price'] = df2.loc[closest_match, 'Price']
print(df1)
Output:
Name Rating Coord Price
0 Game Time Bar 4.5 (42.3734, -71.1204) $$
1 Sports Grill 4.6 (42.3739, -71.1214) $
2 Sports Grill 2 4.3 (42.3839, -71.1315) $$
Edit: your requirement on 'Name' ("at least an 80% match") isn't really appropriate. Take a look at fuzzywuzzy to get a sense of how string distances can be measured.
Im looking for a way of comparing partial numeric values between columns from different dataframes, this columns are filled with something like social security numbers (they can´t and won´t repeat), so something like a dynamic isin() with be ideal.
This are representations of very large dataframes that I import from csv files.
{import numpy as np
import pandas as pd
df1 = pd.DataFrame({"S_number": ["271600", "860078", "342964", "763261", "215446", "205303", "973637", "814452", "399304", "404205"]})
df2 = pd.DataFrame({"Id_number": ["14452", "9930", "1544", "5303", "973637", "4205", "0271600", "342964", "763", "60078"]})
print(df1)
print(df2)
df2['Id_number_length']= df2['Id_number'].str.len()
df2.groupby('Id_number_length').count()
count_list = df2.groupby('Id_number_length')[['Id_number_length']].count()
print('count_list:\n', count_list)
df1 ['S_number'] = pd.to_numeric(df1['S_number'], downcast = 'integer')
df2['Id_number'] = pd.to_numeric(df2['Id_number'], downcast = 'integer')
inner_join = pd.merge(df1, df2, left_on =['S_number'], right_on = ['Id_number'] , how ='inner')
print('MATCH!:\n', inner_join)
outer_join = pd.merge(df1, df2, left_on =['S_number'], right_on = ['Id_number'] , how ='outer', indicator = True)
anti_join = outer_join[~(outer_join._merge == 'both')].drop('_merge', axis = 1)
print('UNMATCHED:\n', anti_join)
}
What I need to get is something as the following as a result of the inner join or whatever method:
{
df3 = pd.DataFrame({"S_number": ["271600", "860078", "342964", "763261", "215446", "205303", "973637", "814452", "399304", "404205"],
"Id_number": [ "027160", "60078","342964","763", "1544", "5303", "973637", "14452", "9930", "4205",]})
print('MATCH!:\n', df3)
}
I thought that something like this (very crude) pseudocode would work. Using count_list to strip parts of the numbers of df1 to fully match df2 instead of partially matching (notice that in df2 the missing or added numbers are always at the begining or the end)
{
for i in count_list:
if i ==6:
try inner join
except empty output
elif i ==5:
try
df1.loc[:,'S_number'] = df_ib_c.loc[:,'S_number'].str[1:]
inner join with df2
except empty output
try
df1.loc[:,'S_number'] = df_ib_c.loc[:,'S_number'].str[:-1]
inner join with df2
elif i == 4:
same as above...
}
But the lengths in count_list are variable so this for is an inefficient way.
Any help with this will be very appreciated, I´ve been stuck with this for days. Thanks in advance.
You can 'explode' each line of df1 into up to 45 lines. For example, SSN 123456789 can be map to [1,2,3...9,12,23,34,45..89,...12345678,23456789,123456789]. While this look bad, from algorithm standpoint it is O(1) for each row and therefore O(N) in total.
Using this new column as key, a simple 'merge on' can combine the 2 DFs easily - which is usually O(NlogN).
Here is an example of what I should do. I hope I've understood. Feel free to ask if it's not clear.
import pandas as pd
import joblib
from joblib import Parallel,delayed
# Building the base
df1 = pd.DataFrame({"S_number": ["271600", "860078", "342964", "763261", "215446", "205303", "973637", "814452", "399304", "404205"]})
df2 = pd.DataFrame({"Id_number": ["14452", "9930", "1544", "5303", "973637", "4205", "0271600", "342964", "763", "60078"]})
# Initiate empty list for indexes
IDX = []
# Using un function to paralleliza it if database is big
def func(x,y):
if all(c in df2.Id_number[y] for c in df1.S_number[x]):
return(x,y)
# using the max of processors
number_of_cpu = joblib.cpu_count()
# Prpeparing a delayed function to be parallelized
delayed_funcs = (delayed(func)(x,y) for x in range(len(df1)) for y in range(len(df2)))
# fiting it with processes and not threads
parallel_pool = Parallel(n_jobs=number_of_cpu,prefer="processes")
# Fillig the IDX List
IDX.append(parallel_pool(delayed_funcs))
# Droping the None
IDX = list(filter(None, IDX[0]))
# Making df3 with the tuples of indexes
df3 = pd.DataFrame(IDX)
# Making it readable
df3['df1'] = df1.S_number[df3[0]].to_list()
df3['df2'] = df2.Id_number[df3[1]].to_list()
df3
OUTPUT :
How to map closed values from two dataframes:
I've two dataframes in below format and looking to map values based on o_lat,o_long from data1 and near_lat,near_lon:
data1 ={'lat': [-0.659901, -0.659786, -0.659821],
'long':[2.530561, 2.530797, 2.530587],
'd':[0.4202, 1.0957, 0.6309],
'o_lat':[-37.8095,-37.8030,-37.8050],
'o_long':[145.0000,145.0077,145.0024]}
Where lat,long is coordinates of destination, d is the distance between origin and destination, o_lat,o_long is the coordinates of origin.
data2={'nearest_warehouse': ['Nickolson','Thompson','Bakers'],
'lat':[-37.8185,-37.8126,-37.8099],
'lon':[144.9695,144.9470,144.9952]}
I want to produce another column in data1 which locates nearest_warehouse in the following format based on closed value:
result={'lat': [-0.659901, -0.659786, -0.659821],
'long':[2.530561, 2.530797, 2.530587],
'd':[0.4202, 1.0957, 0.6309],
'o_lat':[-37.8095,-37.8030,-37.8050],
'o_long':[145.0000,145.0077,145.0024],
'nearest_warehouse':['Bakers','Thompson','Nickolson']}
I've tried following code:
lat_diff=[]
long_diff=[]
min_distance=[]
for i in range(0,3):
lat_diff.append(float(warehouse.near_lat[i])-lat_long_d.o_lat[0])
for j in range(0,3):
long_diff.append(float(warehouse.near_lon[j])-lat_long_d.o_long[0])
long_diff.append(float(warehouse.near_lon[j])-lat_long_d.o_long[0])
min_distance=[min(lat_diff),min(long_diff)]
min_distance
Which gives the following result which is the minimum value of the difference between latitude and longitude for o_lat=-37.8095 and o_lang=145.0000:
[-0.00897867136701791, -0.05300973586690816].
I feel the approach is not viable to map close values over a large dataset.
Looking for a better approach in this regard
From the first dataframe, you can go through each row with lambda x: and compare to all rows of the second dataframe and return a list of the absolute difference of latitude and add that to the absolute difference of longitude using list comprehension. This effectively gives you the minimum distance.
Now, what you are interested in is the index, i.e. position of the minimum absolute difference of longiture plus absolute difference of latitude for each row. You can find this with idxmin(). In dataframe 1, this returns the index number which you can use to merge against the index of dataframe 2 to pull in the closest warehouse:
setup:
data1 = pd.DataFrame({'lat': [-0.659901, -0.659786, -0.659821], 'long':[2.530561, 2.530797, 2.530587],
'd':[0.4202, 1.0957, 0.6309],
'o_lat':[-37.8095,-37.8030,-37.8050], 'o_long':[145.0000,145.0077,145.0024]})
data2= pd.DataFrame({'nearest_warehouse': ['Nickolson','Thompson','Bakers'],
'lat':[-37.818595, -37.812673, -37.809996], 'lon':[144.969551, 144.947069, 144.995232],
'near_lat':[-37.8185,-37.8126,-37.8099], 'near_lon':[144.9695,144.9470,144.9952]})
code:
data1['key'] = data1.apply(lambda x: ((x['o_lat'] - data2['near_lat']).abs()
+ (x['o_long'] - data2['near_lon']).abs()).idxmin(), axis=1)
data1 = pd.merge(data1, data2[['nearest_warehouse']], how='left', left_on='key', right_index=True).drop('key', axis=1)
data1
Out[1]:
lat long d o_lat o_long nearest_warehouse
0 -0.659901 2.530561 0.4202 -37.8095 145.0000 Bakers
1 -0.659786 2.530797 1.0957 -37.8030 145.0077 Bakers
2 -0.659821 2.530587 0.6309 -37.8050 145.0024 Bakers
This result looks accurate if you append the two dataframes into one and do a basic scatterplot. As you can see Bakers warehouse is right there compared to the other points (graph IS to scale with last line of code):
import matplotlib.pyplot as plt
data1 = pd.DataFrame({'o_lat':[-37.8095,-37.8030,-37.8050], 'o_long':[145.0000,145.0077,145.0024],
'nearest_warehouse': ['0','1','2']})
data2= pd.DataFrame({'nearest_warehouse': ['Nickolson','Thompson','Bakers'],
'o_lat':[-37.8185,-37.8126,-37.8099], 'o_long':[144.9695,144.9470,144.9952]})
df = data1.append(data2)
y = df['o_lat'].to_list()
z = df['o_long'].to_list()
n = df['nearest_warehouse'].to_list()
fig, ax = plt.subplots()
ax.scatter(z, y)
for i, txt in enumerate(n):
ax.annotate(txt, (z[i], y[i]))
plt.gca().set_aspect('equal', adjustable='box')
i have two parquet files, which i load with spark.read. These 2 dataframes have a same column named key, so i join them with:
df = df.join(df2, on=['key'], how='inner')
df columns are: ["key","Duration","Distance"] and df2 : ["key",department id"]. At the end i want to print Duration, max(Distance),department id group by department id. What i have done so far is:
df.join(df.groupBy('departmentid').agg(F.max('Distance').alias('Distance')),on='Distance',how='leftsemi').show()
but i think it is too slow, is there a faster way to achieve my goal?
thanks in advance
EDIT: sample (first 2 lines of each file)
df:
369367789289,2015-03-27 18:29:39,2015-03-27 19:08:28,-
73.975051879882813,40.760562896728516,-
73.847900390625,40.732685089111328,34.8
369367789290,2015-03-27 18:29:40,2015-03-27 18:38:35,-
73.988876342773438,40.77423095703125,-
73.985160827636719,40.763439178466797,11.16
df1:
369367789289,1
369367789290,2
each columns is seperated by "," first column on both files is my key, then i have timestamps,longtitudes and latitudes. At the second file i have only the key and department id.
to create Distance i am using a function called formater. this is how i get my distance and duration:
df = df.filter("_c3!=0 and _c4!=0 and _c5!=0 and _c6!=0")
df = df.withColumn("_c0", df["_c0"].cast(LongType()))
df = df.withColumn("_c1", df["_c1"].cast(TimestampType()))
df = df.withColumn("_c2", df["_c2"].cast(TimestampType()))
df = df.withColumn("_c3", df["_c3"].cast(DoubleType()))
df = df.withColumn("_c4", df["_c4"].cast(DoubleType()))
df = df.withColumn("_c5", df["_c5"].cast(DoubleType()))
df = df.withColumn("_c6", df["_c6"].cast(DoubleType()))
df = df.withColumn('Distance', formater(df._c3,df._c5,df._c4,df._c6))
df = df.withColumn('Duration', F.unix_timestamp(df._c2) -F.unix_timestamp(df._c1))
and then as i showed above:
df = df.join(vendors, on=['key'], how='inner')
df.registerTempTable("taxi")
df.join(df.groupBy('vendor').agg(F.max('Distance').alias('Distance')),on='Distance',how='leftsemi').show()
Output must be
Distance Duration department id
grouped by id, and geting only the row with max(distance)
I have a dataframe named SD_Apartments that has 3 variables: name (object), latitude (float64), longitude (float64). It's a list of apartment names, and their coordinates.
I have another dataframe named SD_Coffee that has 3 variables: name (object), latitude (float64), longitude (float64). It's a list of coffee shop names, and their coordinates.
I want to add another variable to SD_apartments called coffee_count that would have the number of coffee shop locations listed in my SD_coffee dataframe that are within x (for example, 300) meters from each apartment listed in SD_apartments.
Here is a setup of the code I'm working with:
import pandas as pd
import geopy.distance
from geopy.distance import geodesic
data = [['Insomnia', 32.784782, -117.129130], ['Starbucks', 32.827521, -117.139966], ['Dunkin', 32.778519, -117.154720]]
data1 = [['DreamAPT', 32.822090, -117.184200], ['OKAPT', 32.748081, -117.130691], ['BadAPT', 32.786886, -117.097536]]
SD_Coffee = pd.DataFrame(data, columns = ['name', 'latitude', 'longitude'])
SD_Apartments = pd.DataFrame(data1, columns = ['name', 'latitude', 'longitude'])
Here is the code I'm attempting to use to accomplish my goal:
def geodesic_pd(df1, df2_row):
return [(geodesic([tuple(x) for x in row.values], [tuple(x) for x in df2_row.values]).m for row in df1)]
SD_Apartments['coffee_count'] = pd.Series([(sum(geodesic_pd(SD_Coffee[['latitude', 'longitude']], row) < 300) for row in SD_Apartments[['latitude', 'longitude']])])
If you run it and print SD_Apartments, you will see that SD_Apartments looks like:
name ... coffee_count
0 DreamAPT ... <generator object <genexpr> at 0x000002E178849...
1 OKAPT ... NaN
2 BadAPT ... NaN
This will probably help you:
import pandas as pd
df = pd.DataFrame({'geodesic': [1, 10, 8, 11, 20,2,2],'apartment': list('aaceeee')})
df.nsmallest(3, 'geodesic')
Another way of doing this is by using K-Nearest neighbors using the geodesic distance:
SKLearn-KNN
Assuming you are using pandas dataframes, you should be able to use something like this unless you have very large arrays -
import numpy as np
def geodesic_pd(df1, df2_row):
dist = []
for _, row in df1.iterrows():
dist.append(geodesic(tuple(row.values), tuple(df2_row.values)).m)
return np.array(dist)
SD_Apartments['coffee_count'] = SD_Apartments.apply(lambda row: sum(geodesic_pd(SD_Coffee[['latitude', 'longitude']], row[['latitude', 'longitude']]) < 300), axis =1)
The geodesic_pd functions extends the geodesic calculation to a dataframe from individual tuples to a dataframe, and the next statement calculates the number of coffee stores within 300 meters and stores them in a new column.
If you have large arrays, then you should combine KNN in order to only perform this operation over a subset of points.