Count of number of locations within certain distance - python

I have a dataframe named SD_Apartments that has 3 variables: name (object), latitude (float64), longitude (float64). It's a list of apartment names, and their coordinates.
I have another dataframe named SD_Coffee that has 3 variables: name (object), latitude (float64), longitude (float64). It's a list of coffee shop names, and their coordinates.
I want to add another variable to SD_apartments called coffee_count that would have the number of coffee shop locations listed in my SD_coffee dataframe that are within x (for example, 300) meters from each apartment listed in SD_apartments.
Here is a setup of the code I'm working with:
import pandas as pd
import geopy.distance
from geopy.distance import geodesic
data = [['Insomnia', 32.784782, -117.129130], ['Starbucks', 32.827521, -117.139966], ['Dunkin', 32.778519, -117.154720]]
data1 = [['DreamAPT', 32.822090, -117.184200], ['OKAPT', 32.748081, -117.130691], ['BadAPT', 32.786886, -117.097536]]
SD_Coffee = pd.DataFrame(data, columns = ['name', 'latitude', 'longitude'])
SD_Apartments = pd.DataFrame(data1, columns = ['name', 'latitude', 'longitude'])
Here is the code I'm attempting to use to accomplish my goal:
def geodesic_pd(df1, df2_row):
return [(geodesic([tuple(x) for x in row.values], [tuple(x) for x in df2_row.values]).m for row in df1)]
SD_Apartments['coffee_count'] = pd.Series([(sum(geodesic_pd(SD_Coffee[['latitude', 'longitude']], row) < 300) for row in SD_Apartments[['latitude', 'longitude']])])
If you run it and print SD_Apartments, you will see that SD_Apartments looks like:
name ... coffee_count
0 DreamAPT ... <generator object <genexpr> at 0x000002E178849...
1 OKAPT ... NaN
2 BadAPT ... NaN

This will probably help you:
import pandas as pd
df = pd.DataFrame({'geodesic': [1, 10, 8, 11, 20,2,2],'apartment': list('aaceeee')})
df.nsmallest(3, 'geodesic')
Another way of doing this is by using K-Nearest neighbors using the geodesic distance:
SKLearn-KNN

Assuming you are using pandas dataframes, you should be able to use something like this unless you have very large arrays -
import numpy as np
def geodesic_pd(df1, df2_row):
dist = []
for _, row in df1.iterrows():
dist.append(geodesic(tuple(row.values), tuple(df2_row.values)).m)
return np.array(dist)
SD_Apartments['coffee_count'] = SD_Apartments.apply(lambda row: sum(geodesic_pd(SD_Coffee[['latitude', 'longitude']], row[['latitude', 'longitude']]) < 300), axis =1)
The geodesic_pd functions extends the geodesic calculation to a dataframe from individual tuples to a dataframe, and the next statement calculates the number of coffee stores within 300 meters and stores them in a new column.
If you have large arrays, then you should combine KNN in order to only perform this operation over a subset of points.

Related

Join Pandas DataFrames on Fuzzy/Approximate Matches for Multiple Columns

I have two Pandas DataFrames that look like this. Trying to join the two data sets on 'Name','Longitude', and 'Latitude' but using a fuzzy/approximate match. Is there a way to join these together using a combination of the 'Name' strings being, for example, at least an 80% match and the 'Latitude' and 'Longitude' columns being the nearest value or within like 0.001 of each other? I tried using pd.merge_asof but couldn't figure out how to make it work. Thank you for the help!
import pandas as pd
data1 = [['Game Time Bar',42.3734,-71.1204,4.5],['Sports Grill',42.3739,-71.1214,4.6],['Sports Grill 2',42.3839,-71.1315,4.3]]
data2 = [['Game Time Sports Bar',42.3738,-71.1207,'$$'],['Sports Bar & Grill',42.3741,-71.1216,'$'],['Sports Grill',42.3841,-71.1316,'$$']]
df1 = pd.DataFrame(data1, columns=['Name', 'Latitude','Longitude','Rating'])
df2 = pd.DataFrame(data2, columns=['Name', 'Latitude','Longitude','Price'])
merge_asof won't work here since it can only merge on a single numeric column, such as datetimelike, integer, or float (see doc).
Here you can compute the (euclidean) distance between the coordinates of df1 and df2 and pickup the best match:
import pandas as pd
import numpy as np
from scipy.spatial.distance import cdist
data1 = [['Game Time Bar',42.3734,-71.1204,4.5],['Sports Grill',42.3739,-71.1214,4.6],['Sports Grill 2',42.3839,-71.1315,4.3]]
data2 = [['Game Time Sports Bar',42.3738,-71.1207,'$$'],['Sports Bar & Grill',42.3741,-71.1216,'$'],['Sports Grill',42.3841,-71.1316,'$$']]
df1 = pd.DataFrame(data1, columns=['Name', 'Latitude','Longitude','Rating'])
df2 = pd.DataFrame(data2, columns=['Name', 'Latitude','Longitude','Price'])
# Replacing 'Latitude' and 'Longitude' columns with a 'Coord' Tuple
df1['Coord'] = df1[['Latitude', 'Longitude']].apply(lambda x: (x['Latitude'], x['Longitude']), axis=1)
df1.drop(columns=['Latitude', 'Longitude'], inplace=True)
df2['Coord'] = df2[['Latitude', 'Longitude']].apply(lambda x: (x['Latitude'], x['Longitude']), axis=1)
df2.drop(columns=['Latitude', 'Longitude'], inplace=True)
# Creating a distance matrix between df1['Coord'] and df2['Coord']
distances_df1_df2 = cdist(df1['Coord'].to_list(), df2['Coord'].to_list())
# Creating df1['Price'] column from df2 and the distance matrix
for i in df1.index:
# you can replace the following lines with a loop over distances_df1_df2[i]
# and reject names that are too far from each other
min_dist = np.amin(distances_df1_df2[i])
if min_dist > 0.001:
continue
closest_match = np.argmin(distances_df1_df2[i])
# df1.loc[i, 'df2_Name'] = df2.loc[closest_match, 'Name'] # keep track of the merged row
df1.loc[i, 'Price'] = df2.loc[closest_match, 'Price']
print(df1)
Output:
Name Rating Coord Price
0 Game Time Bar 4.5 (42.3734, -71.1204) $$
1 Sports Grill 4.6 (42.3739, -71.1214) $
2 Sports Grill 2 4.3 (42.3839, -71.1315) $$
Edit: your requirement on 'Name' ("at least an 80% match") isn't really appropriate. Take a look at fuzzywuzzy to get a sense of how string distances can be measured.

How to get nearest match in csv file python

If want to get the nearest match in my big .csv file in python. My (shortened) .csv file is:
0,4,5,0,132,24055,0,64,6,23215,39635,22,21451751,3233419908,8,0,4126,368,15087,0
0,4,5,16,52,22607,0,64,6,24727,22,39635,3233439332,21453192,8,0,26,501,28207,0
1,4,5,0,40,1727,0,128,6,29216,62281,22,123196295,3338477204,5,0,26,513,30738,0
0,4,5,0,116,24108,0,64,6,23178,39635,22,21452647,3233437508,8,0,4126,644,61163,0
0,4,5,0,724,32046,0,64,6,14632,38655,22,1452688218,1828171762,8,0,4126,343,31853,0
0,4,5,0,76,26502,0,128,6,4405,50266,22,1776918274,3172205875,5,0,4126,512,9381,0
1,4,5,0,40,7662,0,64,6,39665,22,62202,3176642698,3972914889,5,0,26,501,63331,0
1,4,5,0,52,939,0,128,6,29992,62206,22,1466629610,0,8,0,44,64240,43460,0
0,4,5,16,76,10076,0,64,6,37199,22,50268,4016221794,718292575,5,0,4126,501,310,0
0,4,5,0,40,26722,0,128,6,4221,50270,22,38340335,3852724687,5,0,26,510,36549,0
0,4,5,0,76,26631,0,128,6,4276,50266,22,1776920362,3172222235,5,0,4126,511,61692,0
0,4,5,16,148,38558,0,64,6,8680,22,37221,2019795091,3598991383,8,0,4126,501,9098,0
0,4,5,0,52,24058,0,64,6,23292,39635,22,21452135,3233420036,8,0,26,368,38558,0
0,4,5,16,76,10249,0,64,6,37026,22,50266,3172221011,1776919966,5,0,4126,501,31557,0
0,4,5,16,212,38490,0,64,6,8684,22,37221,2019776067,3598991175,8,0,4126,501,56063,0
0,4,5,0,60,0,0,64,6,47342,22,44751,2722242689,3606442876,10,0,4426,65160,29042,0
0,4,5,16,76,10234,0,64,6,37041,22,50266,3172220319,1776919498,5,0,4126,501,49854,0
1,4,5,0,1016,1737,0,128,6,28230,62273,22,3387237183,3449598142,5,0,4126,513,49536,0
1,4,5,0,40,20630,0,64,6,26697,22,62288,4040909519,95375909,5,0,26,501,36104,0
0,4,5,16,180,22591,0,64,6,24615,22,39635,3233437764,21452775,8,0,4126,501,28548,0
0,4,5,0,52,31654,0,64,6,15696,47873,22,3476257438,205382502,8,0,26,368,59804,0
1,4,5,0,320,20922,0,64,6,26125,22,62195,2187234888,2519273239,5,0,4126,501,52263,0
0,4,5,0,1132,22526,0,64,6,23744,22,39635,3233417124,21450447,8,0,4126,509,12391,0
1,4,5,0,52,0,0,64,6,47315,22,62282,3209938138,2722777338,8,0,4426,64240,36683,0
0,4,5,0,52,3091,0,64,6,44259,22,38655,1828172842,1452688914,8,0,26,504,7425,0
0,4,5,16,132,10184,0,64,6,37035,22,50266,3172212167,1776918310,5,0,4126,501,44260,0
0,4,5,16,256,10167,0,64,6,36928,22,50266,3172210503,1776918310,5,0,4126,501,19165,0
1,4,5,0,120,2043,0,128,6,28820,62294,22,644393448,2960970388,5,0,4126,512,36939,0
0,4,5,16,196,38575,0,64,6,8615,22,37221,2019796627,3598991543,8,0,4126,501,29587,0
0,4,5,16,148,22599,0,64,6,24639,22,39635,3233438532,21452967,8,0,4126,501,41316,0
1,4,5,0,88,1733,0,128,6,29162,62267,22,872073945,3114048214,5,0,4126,508,23918,0
I have made a programm, but it isn't finished and I don't know how I can complete it. Do I have to use an another program?:
with open("<dir>", "r") as file:
file = file.readlines()
len_ = len(file)
string = "4,5,0,52,32345,0,64,6,15005,37221,22,3598991799,2019801315,8,0,26,691,17176,0" #The string, that I want to find the neares data in the .csv data.
list_ = []
for i in range(1, len_):
item = str(file[i])
item2 = item[2:]
list_.append(item2)
for item in list_:
algorithm: Look from left to right on the row and find the row with the most sequential matches to the search data.
It seems you are handling a machine learning problem, with a dataset and a point to find the nearest neighbor. I assume you want the point of the dataset that has the shortest euclidean distance (in 19-dimension) to the given point.
I would use pandas and scikit-learn packages with the NearestNeighbors algorithm.
Upload the packages
from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd
upload the file.csv as Pandas DataFrame (with generic column names)
df = pd.read_csv('file.csv', index_col=False, names=np.arange(20))
Since you want the first column of values as results, I move it to a Pandas Series called "first_column" and drop it from the "df" dataframe
first_column = df[0]
df.drop(columns=[0], inplace=True)
What you called "string" I call it "y" and set it as numpy array:
y = np.array([[4,5,0,52,32345,0,64,6,15005,37221,22,3598991799,2019801315,8,0,26,691,17176,0]])
now let's fit the NearestNeighbors model
nnb = NearestNeighbors(n_neighbors=1).fit(df)
and now computes which point in the data set is the closest to the given point y:
distances, indices = nnb.kneighbors(y, n_neighbors=1)
print(indices)
[[13]]
So, the nearest point has index 13 in the dataframe. Let's print the 13th position of the first_column
print(first_column.loc[13])
0

Mapping nearest values from two pandas dataframes (latitude and longitude)

How to map closed values from two dataframes:
I've two dataframes in below format and looking to map values based on o_lat,o_long from data1 and near_lat,near_lon:
data1 ={'lat': [-0.659901, -0.659786, -0.659821],
'long':[2.530561, 2.530797, 2.530587],
'd':[0.4202, 1.0957, 0.6309],
'o_lat':[-37.8095,-37.8030,-37.8050],
'o_long':[145.0000,145.0077,145.0024]}
Where lat,long is coordinates of destination, d is the distance between origin and destination, o_lat,o_long is the coordinates of origin.
data2={'nearest_warehouse': ['Nickolson','Thompson','Bakers'],
'lat':[-37.8185,-37.8126,-37.8099],
'lon':[144.9695,144.9470,144.9952]}
I want to produce another column in data1 which locates nearest_warehouse in the following format based on closed value:
result={'lat': [-0.659901, -0.659786, -0.659821],
'long':[2.530561, 2.530797, 2.530587],
'd':[0.4202, 1.0957, 0.6309],
'o_lat':[-37.8095,-37.8030,-37.8050],
'o_long':[145.0000,145.0077,145.0024],
'nearest_warehouse':['Bakers','Thompson','Nickolson']}
I've tried following code:
lat_diff=[]
long_diff=[]
min_distance=[]
for i in range(0,3):
lat_diff.append(float(warehouse.near_lat[i])-lat_long_d.o_lat[0])
for j in range(0,3):
long_diff.append(float(warehouse.near_lon[j])-lat_long_d.o_long[0])
long_diff.append(float(warehouse.near_lon[j])-lat_long_d.o_long[0])
min_distance=[min(lat_diff),min(long_diff)]
min_distance
Which gives the following result which is the minimum value of the difference between latitude and longitude for o_lat=-37.8095 and o_lang=145.0000:
[-0.00897867136701791, -0.05300973586690816].
I feel the approach is not viable to map close values over a large dataset.
Looking for a better approach in this regard
From the first dataframe, you can go through each row with lambda x: and compare to all rows of the second dataframe and return a list of the absolute difference of latitude and add that to the absolute difference of longitude using list comprehension. This effectively gives you the minimum distance.
Now, what you are interested in is the index, i.e. position of the minimum absolute difference of longiture plus absolute difference of latitude for each row. You can find this with idxmin(). In dataframe 1, this returns the index number which you can use to merge against the index of dataframe 2 to pull in the closest warehouse:
setup:
data1 = pd.DataFrame({'lat': [-0.659901, -0.659786, -0.659821], 'long':[2.530561, 2.530797, 2.530587],
'd':[0.4202, 1.0957, 0.6309],
'o_lat':[-37.8095,-37.8030,-37.8050], 'o_long':[145.0000,145.0077,145.0024]})
data2= pd.DataFrame({'nearest_warehouse': ['Nickolson','Thompson','Bakers'],
'lat':[-37.818595, -37.812673, -37.809996], 'lon':[144.969551, 144.947069, 144.995232],
'near_lat':[-37.8185,-37.8126,-37.8099], 'near_lon':[144.9695,144.9470,144.9952]})
code:
data1['key'] = data1.apply(lambda x: ((x['o_lat'] - data2['near_lat']).abs()
+ (x['o_long'] - data2['near_lon']).abs()).idxmin(), axis=1)
data1 = pd.merge(data1, data2[['nearest_warehouse']], how='left', left_on='key', right_index=True).drop('key', axis=1)
data1
Out[1]:
lat long d o_lat o_long nearest_warehouse
0 -0.659901 2.530561 0.4202 -37.8095 145.0000 Bakers
1 -0.659786 2.530797 1.0957 -37.8030 145.0077 Bakers
2 -0.659821 2.530587 0.6309 -37.8050 145.0024 Bakers
This result looks accurate if you append the two dataframes into one and do a basic scatterplot. As you can see Bakers warehouse is right there compared to the other points (graph IS to scale with last line of code):
import matplotlib.pyplot as plt
data1 = pd.DataFrame({'o_lat':[-37.8095,-37.8030,-37.8050], 'o_long':[145.0000,145.0077,145.0024],
'nearest_warehouse': ['0','1','2']})
data2= pd.DataFrame({'nearest_warehouse': ['Nickolson','Thompson','Bakers'],
'o_lat':[-37.8185,-37.8126,-37.8099], 'o_long':[144.9695,144.9470,144.9952]})
df = data1.append(data2)
y = df['o_lat'].to_list()
z = df['o_long'].to_list()
n = df['nearest_warehouse'].to_list()
fig, ax = plt.subplots()
ax.scatter(z, y)
for i, txt in enumerate(n):
ax.annotate(txt, (z[i], y[i]))
plt.gca().set_aspect('equal', adjustable='box')

Appending multiple elements to a dataframe

I have a function that extracts a number of variables from zillow. I used a lambda function to append the returned values to a dataframe. I am wondering if there is a faster way to return all the variables and append them to the dataframe instead of individually.
Here is my code:
from xml.dom.minidom import parse,parseString
import xml.dom.minidom
import requests
import sys
import pandas as pd
import numpy as np
l_zwsid=''
df = pd.read_csv('data.csv')
def getElementValue(p_dom,p_element):
if len(p_dom.getElementsByTagName(p_element)) > 0:
l_value=p_dom.getElementsByTagName(p_element)[0]
return(l_value.firstChild.data)
else:
l_value='NaN'
return(l_value)
def getData(l_zwsid, a_addr, a_zip):
try:
l_url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?zws-id='+l_zwsid+'&address='+a_addr+'&citystatezip='+a_zip
xml=requests.get(l_url)
dom=parseString(xml.text)
responses=dom.getElementsByTagName('response')
zpid=getElementValue(dom,'zpid')
usecode=getElementValue(dom,'useCode')
taxyear=getElementValue(dom,'taxAssessmentYear')
tax=getElementValue(dom,'taxAssessment')
yearbuilt=getElementValue(dom,'yearBuilt')
sqft=getElementValue(dom,'finishedSqFt')
lotsize=getElementValue(dom,'lotSizeSqFt')
bathrooms=getElementValue(dom,'bathrooms')
bedrooms=getElementValue(dom,'bedrooms')
totalrooms=getElementValue(dom,'totalRooms')
lastSale=getElementValue(dom,'lastSoldDate')
lastPrice=getElementValue(dom,'lastSoldPrice')
latitude=getElementValue(dom, 'latitude')
longitude=getElementValue(dom, 'longitude')
for response in responses:
addresses=response.getElementsByTagName('address')
for addr in addresses:
street=getElementValue(addr,'street')
zipcode=getElementValue(addr,'zipcode')
zestimates=response.getElementsByTagName('zestimate')
for zest in zestimates:
amt=getElementValue(zest,'amount')
lastupdate=getElementValue(zest,'last-updated')
valranges=zest.getElementsByTagName('valuationRange')
for val in valranges:
low=getElementValue(val,'low')
high=getElementValue(val,'high')
return longitude, latitude
except AttributeError:
return None
df['Longtitude'] = df.apply(lambda row: getData(l_zwsid, row['Street'], row['Zip']), axis = 1)
df['Latitude'] = df.apply(lambda row: getData(l_zwsid, row['Street'], row['Zip']), axis = 1)
This currently does not work because the new columns will contain both the longitude and latitude.
Your getData function returns a tuple, which is why both columns have both lat and lon. One workaround could be to parameterise this function as follows:
def getData(l_zwsid, a_addr, a_zip, axis='lat'):
valid = ['lat', 'lon']
if axis not in valid:
raise ValueError(f'axis must be one of {valid}')
...
if axis == 'lat':
return latitude
else:
return longitude
This won't improve efficiency will make it even slower, however. Your main overhead is coming from making API calls for every row in the DataFrame, so you are constrained by network performance.
You can make your getData function return a string which contains comma separated values of all the elements
Append this csv string as ALL_TEXTcolumn in the dataframe df
Split the column ALL_TEXT into multiple columns (Lat, long, zipcode, street etc)
def split_into_columns(text):
required_columns = ['Latitude', 'Longtitude', 'Zipcode']
columns_value_list = text['ALL_TEXT'].split(',')
for i in range(len(required_columns)):
text[required_columns[i]] = columns_value_list[i]
return text
df= pd.DataFrame([ ['11.49, 12.56, 9823A'], ['14.02, 15.29, 9674B'] ], columns=['ALL_TEXT'])
updated_df = df.apply(split_into_columns, axis=1)
df
ALL_TEXT
0 11.49, 12.56, 9823A
1 14.02, 15.29, 9674B
updated_df
ALL_TEXT Latitude Longtitude Zipcode
0 11.49, 12.56, 9823A 11.49 12.56 9823A
1 14.02, 15.29, 9674B 14.02 15.29 9674B

Associating units with Pandas DataFrame

I'm using a web service that returns a CSV response in which the 1st row contains the column names, and the 2nd row contains the column units, for example:
longitude,latitude
degrees_east,degrees_north
-142.842,-1.82
-25.389,39.87
-37.704,27.114
I can read this into a Pandas DataFrame:
import pandas as pd
from StringIO import StringIO
x = '''
longitude,latitude
degrees_east,degrees_north
-142.842,-1.82
-25.389,39.87
-37.704,27.114
'''
# Create a Pandas DataFrame
obs=pd.read_csv(StringIO(x.strip()), sep=",\s*")
print(obs)
which produces
longitude latitude
0 degrees_east degrees_north
1 -142.842 -1.82
2 -25.389 39.87
3 -37.704 27.114
But what would be the best approach to associate the units with the DataFrame columns for later use, for example labeling plots?
Allowing pandas to read the second line as data is screwing up the dtype for the columns. Instead of a float dtype, the presence of strings make the dtype of the columns object, and the underlying objects, even the numbers, are strings. This screws up all numerical operations:
In [8]: obs['latitude']+obs['longitude']
Out[8]:
0 degrees_northdegrees_east
1 -1.82-142.842
2 39.87-25.389
3 27.114-37.704
In [9]: obs['latitude'][1]
Out[9]: '-1.82'
So it is imperative that pd.read_csv skip the second line.
The following is pretty ugly, but given the format of the input, I don't see a better way.
import pandas as pd
from StringIO import StringIO
x = '''
longitude,latitude
degrees_east,degrees_north
-142.842,-1.82
-25.389,39.87
-37.704,27.114
'''
content = StringIO(x.strip())
def read_csv(content):
columns = next(content).strip().split(',')
units = next(content).strip().split(',')
obs = pd.read_table(content, sep=",\s*", header=None)
obs.columns = ['{c} ({u})'.format(c=col, u=unit)
for col, unit in zip(columns, units)]
return obs
obs = read_csv(content)
print(obs)
# longitude (degrees_east) latitude (degrees_north)
# 0 -142.842 -1.820
# 1 -25.389 39.870
# 2 -37.704 27.114
print(obs.dtypes)
# longitude (degrees_east) float64
# latitude (degrees_north) float64

Categories