I have a pandas dataframe with the following relevant features:
location (city), lat, long.
I want to add to this pandas dataframe the following columns in the most efficient manner:
Distance to closest hospital
Distance to closest train station
Etc
My dataframe looks like this:
Column1 postal_code city_name type_of_property price number_of_rooms house_area fully_equipped_kitchen open_fire terrace garden surface_of_the_land number_of_facades swimming_pool state_of_the_building lattitude longitude province region
13380 13380 1785 Brussegem 1 235000 1 71 1 0 1 0 0 4 0 good 4.265887 50.927771 Brabant flamand Flandre
21135 21135 8630 Bulskamp 1 545000 3 191 1 0 0 0 0 2 0 as new 2.643449 51.044461 Flandre-Occidentale Flandre
5002 5002 4287 Lincent 0 97500 3 90 0 0 0 0 260 4 0 to renovate 5.031817 50.711613 Liège Wallonie
9544 9544 8400 Oostende 0 130000 3 119 1 0 0 1 71 2 0 to renovate 2.920327 51.230318 Flandre-Occidentale Flandre
38923 38923 4830 Limbourg 0 149000 2 140 1 0 0 0 15 2 0 to be done up 5.940299 50.612276 Liège Walloni
I found this python package which allows me to find places near by:
https://github.com/slimkrazy/python-google-places
So I created 2 function:
To calculate distance between 2 points (geodesic distance)
import geopy
from googleplaces import GooglePlaces, types, lang, ranking
import geopy.distance
def geodesicdistance(start,end):
return print(geopy.distance.geodesic(start, end).km)
Function to get nearby places to a start point (hospitals, train stations, etc) It all depends on the type parameter
def getplace(start, location, type):
YOUR_API_KEY = ''
google_places = GooglePlaces(YOUR_API_KEY)
query_result = google_places.nearby_search(
lat_lng=start,
location=location,
radius=500,
types=[type],
rankby=ranking.DISTANCE
)
return (query_result.places[0].geo_location['lng'], query_result.places[0].geo_location['lat'])
the type is an enum with the following relevant values for me:
TYPE_BUS_STATION, TYPE_HOSPITAL, TYPE_AIRPORT, TYPE_BAKERY,TYPE_CITY_HALL, TYPE_FIRE_STATION, TYPE_PHARMACY, TYPE_POLICE,TYPE_SUBWAY_STATION,TYPE_TRAIN_STATION,TYPE_UNIVERSITY
11 types.
Basically I want to create 11 new dynamic columns as:
Distance to closes bus station, Distance to closes hospital, Distance to closest airport, etc, etc.
The part I am missing and I dont have a clue how to do it, its how to iterate on the pandas dataframe and per each row, call the functions 11 times and store it as a new value on the corresponding column
Update:
I tried the code below but still got an error. (I changed location by city_name)
import geopy
from googleplaces import GooglePlaces, types, lang, ranking
import geopy.distance
# Corrected geodesicdistance so it returns correct result rather than None
def geodesicdistance(start,end):
result = geopy.distance.geodesic(start, end).km
#print(result) # uncommment to print
return result
# Refactored getplace to getplaces to provide distance to items in type_list
def getplaces(start,
location,
types):
'''
Get distance to nearby item based for types
'''
YOUR_API_KEY = ''
google_places = GooglePlaces('')
# Compute the closest distance for each type
start_lat_lng = (start['lat'], start['lng'])
distances = []
for type_ in types:
# Find closest place for type_
query_result = google_places.nearby_search(
lat_lng = start, # A dict containing the following keys: lat, lng (default None)
location = location, # A human readable location, e.g 'London, England' (default None)
radius = 500,
types = [type_],
rankby = ranking.DISTANCE
)
# 0th element hast lat/lng for closest since we ranked by distance
end_lat_lng = query_result.places[0].geo_location['lat'], query_result.places[0].geo_location['lng']
# Add distance to closest to list by type
distances.append(geodesicdistance(start_lat_lng , end_lat_lng ))
return distances
def min_dist(lattitude, longitude, location, types):
'''
Get distances for places closest to latitude/longitude
latitude/longitude - position to find nearby items from
location - human-readable name of the position
types - list of places to find by type
'''
start = {'lat': lattitude,
'lng': longitude}
return getplaces(start, location, types)
# List of 11 types used from googleplaces.types
types_list = [types.TYPE_BUS_STATION, types.TYPE_HOSPITAL, types.TYPE_AIRPORT, types.TYPE_BAKERY,
types.TYPE_CITY_HALL, types.TYPE_FIRE_STATION, types.TYPE_PHARMACY,
types.TYPE_POLICE,types.TYPE_SUBWAY_STATION, types.TYPE_TRAIN_STATION, types.TYPE_UNIVERSITY]
# Create Dataframe whose columns have the distances to closest item by type
closest_df = df.apply(lambda row: pd.Series(min_dist(row['lattitude'],
row['longitude'],
row['city_name'],
types_list),
index = types_list),
axis = 'columns',
result_type ='expand')
# Concatenate the two dfs columnwise
final_df = pd.concat([df, closest_df], axis = 1)
but I have this error:
IndexError Traceback (most recent call last)
/home/azureuser/cloudfiles/code/Users/levm38/LightGBM/Tests/featureengineering.ipynb Cell 2 in <cell line: 63>()
58 types_list = [types.TYPE_BUS_STATION, types.TYPE_HOSPITAL, types.TYPE_AIRPORT, types.TYPE_BAKERY,
59 types.TYPE_CITY_HALL, types.TYPE_FIRE_STATION, types.TYPE_PHARMACY,
60 types.TYPE_POLICE,types.TYPE_SUBWAY_STATION, types.TYPE_TRAIN_STATION, types.TYPE_UNIVERSITY]
62 # Create Dataframe whose columns have the distances to closest item by type
---> 63 closest_df = df.apply(lambda row: pd.Series(min_dist(row['lattitude'],
64 row['longitude'],
65 row['city_name'],
66 types_list),
67 index = types_list),
68 axis = 'columns',
69 result_type ='expand')
71 # Concatenate the two dfs columnwise
72 final_df = pd.concat([df, closest_df], axis = 1)
File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/pandas/core/frame.py:8845, in DataFrame.apply(self, func, axis, raw, result_type, args, **kwargs)
8834 from pandas.core.apply import frame_apply
8836 op = frame_apply(
8837 self,
8838 func=func,
(...)
8843 kwargs=kwargs,
8844 )
...
---> 36 end_lat_lng = query_result.places[0].geo_location['lat'], query_result.places[0].geo_location['lng']
38 # Add distance to closest to list by type
39 distances.append(geodesicdistance(start_lat_lng , end_lat_lng ))
IndexError: list index out of range
Update 2:
The code runs but I always get NaNs, even if I increase the ratious to 5000 mt instead of 500, I always get NaNs in all columns
The only thing I did is change the location by city_name, I hardcoded ,+"Belgium" on the location parameter to the cityname.
See dataset here:
https://raw.githubusercontent.com/Joffreybvn/real-estate-data-analysis/master/data/clean/belgium_real_estate.csv
Updated Code:
import geopy
from googleplaces import GooglePlaces, types, lang, ranking
import geopy.distance
import numpy as np
import pandas as pd
# Corrected geodesicdistance so it returns correct result rather than None
def geodesicdistance(start,end):
result = geopy.distance.geodesic(start, end).km
#print(result) # uncommment to print
return result
# Refactored getplace to getplaces to provide distance to items in type_list
def getplaces(start,
location,
types):
'''
Get distance to nearby item based for types
'''
YOUR_API_KEY = ''
google_places = GooglePlaces(YOUR_API_KEY)
# Compute the closest distance for each type
start_lat_lng = (start['lat'], start['lng'])
distances = []
for type_ in types:
# Find closest place for type_
query_result = google_places.nearby_search(
lat_lng = start, # A dict containing the following keys: lat, lng (default None)
location = location +",Belgium", # A human readable location, e.g 'London, England' (default None)
radius = 5000,
types = [type_],
rankby = ranking.DISTANCE
)
# 9/1/2022 --added try/except block to handle 0 responses
try:
# 0th element hast lat/lng for closest since we ranked by distance
end_lat_lng = query_result.places[0].geo_location['lat'], query_result.places[0].geo_location['lng']
# Add distance to closest to list by type
distances.append(geodesicdistance(start_lat_lng , end_lat_lng ))
except IndexError:
distances.append(np.nan) # did not return a value
return distances
def min_dist(latitude, longitude, location, types):
'''
Get distances for places closest to latitude/longitude
latitude/longitude - position to find nearby items from
location - human-readable name of the position
types - list of places to find by type
'''
start = {'lat': latitude, # spelling correction 9/1/2022
'lng': longitude}
return getplaces(start, location, types)
dfsmall= df.sample(10)
types_list = [types.TYPE_BUS_STATION, types.TYPE_HOSPITAL, types.TYPE_AIRPORT, types.TYPE_BAKERY,
types.TYPE_CITY_HALL, types.TYPE_FIRE_STATION, types.TYPE_PHARMACY,
types.TYPE_POLICE,types.TYPE_SUBWAY_STATION, types.TYPE_TRAIN_STATION, types.TYPE_UNIVERSITY]
# Create Dataframe whose columns have the distances to closest item by type
closest_df = dfsmall.apply(lambda row: pd.Series(min_dist(row['lattitude'],
row['longitude'],
row['city_name'],
types_list),
index = types_list),
axis = 'columns',
result_type ='expand')
# Concatenate the two dfs columnwise
final_df = pd.concat([dfsmall, closest_df], axis = 1)
print('Final df')
display(final_df)
Using pandas.DataFrame.apply with result_type = 'expand' to create columns by distance.
Example is Apply pandas function to column to create multiple new columns?
Code
import geopy
from googleplaces import GooglePlaces, types, lang, ranking
import geopy.distance
import numpy as np
import pandas as pd
# Corrected geodesicdistance so it returns correct result rather than None
def geodesicdistance(start,end):
result = geopy.distance.geodesic(start, end).km
#print(result) # uncommment to print
return result
# Refactored getplace to getplaces to provide distance to items in type_list
def getplaces(start,
location,
types):
'''
Get distance to nearby item based for types
'''
YOUR_API_KEY = 'YOUR API KEY'
google_places = GooglePlaces(YOUR_API_KEY)
# Compute the closest distance for each type
start_lat_lng = (start['lat'], start['lng'])
distances = []
for type_ in types:
# Find closest place for type_
query_result = google_places.nearby_search(
lat_lng = start, # A dict containing the following keys: lat, lng (default None)
location = location, # A human readable location, e.g 'London, England' (default None)
radius = 500,
types = [type_],
rankby = ranking.DISTANCE
)
# 9/1/2022 --added try/except block to handle 0 responses
try:
# 0th element hast lat/lng for closest since we ranked by distance
end_lat_lng = query_result.places[0].geo_location['lat'], query_result.places[0].geo_location['lng']
# Add distance to closest to list by type
distances.append(geodesicdistance(start_lat_lng , end_lat_lng ))
except IndexError:
distances.append(np.nan) # did not return a value
return distances
def min_dist(latitude, longitude, location, types):
'''
Get distances for places closest to latitude/longitude
latitude/longitude - position to find nearby items from
location - human-readable name of the position
types - list of places to find by type
'''
start = {'lat': latitude, # spelling correction 9/1/2022
'lng': longitude}
return getplaces(start, location, types)
Example Usage
# Create Dataframe using central areas of three cities in the USA
s = '''location latitude longitude
NYC 40.754101 -73.992081
Chicago 41.882702 -87.619392
Atlanta 33.753746 -84.386330'''
df = pd.read_csv(StringIO(s), sep = '\t')
print('Initial df')
display(df)
# List of 11 types used from googleplaces.types
types_list = [types.TYPE_BUS_STATION, types.TYPE_HOSPITAL, types.TYPE_AIRPORT, types.TYPE_BAKERY,
types.TYPE_CITY_HALL, types.TYPE_FIRE_STATION, types.TYPE_PHARMACY,
types.TYPE_POLICE,types.TYPE_SUBWAY_STATION, types.TYPE_TRAIN_STATION, types.TYPE_UNIVERSITY]
# Create Dataframe whose columns have the distances to closest item by type
closest_df = df.apply(lambda row: pd.Series(min_dist(row['latitude'],
row['longitude'],
row['location'],
types_list),
index = types_list),
axis = 'columns',
result_type ='expand')
# Concatenate the two dfs columnwise
final_df = pd.concat([df, closest_df], axis = 1)
print('Final df')
display(final_df)
Output
initial df
location latitude longitude
0 NYC 40.754101 -73.992081
1 Chicago 41.882702 -87.619392
2 Atlanta 33.753746 -84.386330
final_df
location latitude longitude bus_station hospital airport bakery city_hall fire_station pharmacy police subway_station train_station university
0 NYC 40.754101 -73.992081 0.239516 0.141911 0.196990 0.033483 NaN 0.181210 0.106216 0.248619 0.229708 0.407709 0.035780
1 Chicago 41.882702 -87.619392 0.288502 0.442175 0.957081 0.327077 1.024382 0.467242 0.249753 0.648701 0.565269 0.478530 0.424755
2 Atlanta 33.753746 -84.386330 0.374402 0.097586 0.424375 0.232727 0.548718 0.549474 0.286725 0.250114 0.366779 0.386960 0.029468
Related
I have a data frame that consist of Store ID and its lat/lon.
I want to iterate through that dataframe and for each store ID find key places nearby using the google api.
for example input:
Store-ID LAT LON
1 1.222 2.222
2 2.334 4.555
3 5.433 7.2343
Output should be (in a new dataframe):
Store-ID Places found ......
1 school
1 cafe
1 cinema
1 bookstore
.
.
.
2 toy store
2 KFC
and so on ..........
so far I have tried the following but fail to create a new data frame that would match the output format I mentioned above
map_client = googlemaps.Client(API_KEY)
theta = 0
search_string =['school','store','bank','restaurant','clothing_store','health','mosque','hospital','car_dealer','finance','electronics_store','university','home_goods_store']
distance = km_to_meters(0.1)
sc = []
business_list = []
import time
import math
for i, row in POI_copy.iterrows():
print(i, row[0], row[1], row[2])
for string in search_string:
coordinates = (row[1], row[2])
response = map_client.places_nearby(
location = coordinates,
keyword = string,
radius = distance
)
business_list.extend(response.get('results'))
print(f'we have found a place and we are appending for store code{row[0]}')
sc.append(str(row[0]))
next_page_token = response.get('next_page_token')
while next_page_token:
time.sleep(2)
response = map_client.places_nearby(
location= coordinates,
keyword=string,
radius=distance,
page_token=next_page_token)
business_list.extend(response.get('results'))
next_page_token = response.get('next_page_token')
print(f'we have found a place and we are appending for store code{row[0]}')
sc.append(str(row[0]))
I also want to know if we could extract more info for each place like rating, timings, busiest hour of the day, busiest day of the week and append to the dataframe
I want to check the accuracy of a column of addresses in my dataframe against a column of addresses in another dataframe, to see if they match and how well they match. However, it seems that it takes a long time to go through the addresses and perform the calculations. There are 15000+ addresses in my main dataframe and around 50 addresses in my reference dataframe. It ran for 5 minutes and still hadn't finished.
My code is:
import pandas as pd
from fuzzywuzzy import fuzz, process
### Main dataframe
df = pd.read_csv("adressess.csv", encoding="cp1252")
#### Reference dataframe
ref_df = pd.read_csv("ref_addresses.csv", encoding="cp1252")
### Variable for accuracy scoring
accuracy = 0
for index, value in df["address"].iteritems():
### This gathers the index from the correct address column in the reference df
ref_index = ref_df["correct_address"][
ref_df["correct_address"]
== process.extractOne(value, ref_df["correct_address"])[0]
].index.toList()[0]
### if each row can score a max total of 1, the ratio must be divided by 100
accuracy += (
fuzz.ratio(df["address"][index], ref_df["correct_address"][ref_index]) / 100
)
Is this the best way to loop through a column in a dataframe and fuzzy match it against another? I want the score to be a ratio because later I will then output an excel file with the correct values and a background colour to indicate what values were wrong and changed.
I don't believe fuzzywuzzy has a method that allows you to pull the index, value and ration into one tuple - just value and ratio of match.
Hopefully the below code (with links to dummy data) helps show what is possible. I tried to use street addresses to mock up a similar situation so it is easier to compare with your dataset; obviously it is no where near as big.
You can pull the csv text from the links in the comments and run it and see what could work on your larger sample.
For five addresses in the reference frame and 100 contacts in the other its execution timings are:
CPU times: user 107 ms, sys: 21 ms, total: 128 ms
Wall time: 137 ms
The below code should be quicker than .iteritems() etc.
Code:
# %%time
import pandas as pd
from fuzzywuzzy import fuzz, process
import difflib
# create 100-contacts.csv from data at: https://pastebin.pl/view/3a216455
df = pd.read_csv('100-contacts.csv')
# create ref_addresses.csv from data at: https://pastebin.pl/view/6e992fe8
ref_df = pd.read_csv('ref_addresses.csv')
# function used for fuzzywuzzy matching
def match_addresses(add, list_add, min_score=0):
max_score = -1
max_add = ''
for x in list_add:
score = fuzz.ratio(add, x)
if (score > min_score) & (score > max_score):
max_add = x
max_score = score
return (max_add, max_score)
# given current row of ref_df (via Apply) and series (df['address'])
# return the fuzzywuzzy score
def scoringMatches(x, s):
o = process.extractOne(x, s, score_cutoff = 60)
if o != None:
return o[1]
# creating two lists from address column of both dataframes
contacts_addresses = list(df.address.unique())
ref_addresses = list(ref_df.correct_address.unique())
# via fuzzywuzzy matching and using scoringMatches() above
# return a dictionary of addresses where there is a match
# the keys are the address from ref_df and the associated value is from df (i.e., 'huge' frame)
# example:
# {'86 Nw 66th Street #8673': '86 Nw 66th St #8673', '1 Central Avenue': '1 Central Ave'}
names = []
for x in ref_addresses:
match = match_addresses(x, contacts_addresses, 75)
if match[1] >= 75:
name = (str(x), str(match[0]))
names.append(name)
name_dict = dict(names)
# create new frame from fuzzywuzzy address matches dictionary
match_df = pd.DataFrame(name_dict.items(), columns=['ref_address', 'matched_address'])
# add fuzzywuzzy scoring to original ref_df
ref_df['fuzzywuzzy_score'] = ref_df.apply(lambda x: scoringMatches(x['correct_address'], df['address']), axis=1)
# merge the fuzzywuzzy address matches frame with the reference frame
compare_df = pd.concat([match_df, ref_df], axis=1)
compare_df = compare_df[['ref_address', 'matched_address', 'correct_address', 'fuzzywuzzy_score']].copy()
# add difflib scoring for a bit of interest.
# a random thought passed through my head maybe this is interesting?
compare_df['difflib_score'] = compare_df.apply(lambda x : difflib.SequenceMatcher\
(None, x['ref_address'], x['matched_address']).ratio(),axis=1)
# clean up column ordering ('correct_address' and 'ref_address' are basically
# copies of each other, but shown for completeness)
compare_df = compare_df[['correct_address', 'ref_address', 'matched_address',\
'fuzzywuzzy_score', 'difflib_score']]
# see what we've got
print(compare_df)
# remember: correct_address and ref_address are copies
# so just pick one to compare to matched_address
correct_address ref_address matched_address \
0 86 Nw 66th Street #8673 86 Nw 66th Street #8673 86 Nw 66th St #8673
1 2737 Pistorio Rd #9230 2737 Pistorio Rd #9230 2737 Pistorio Rd #9230
2 6649 N Blue Gum St 6649 N Blue Gum St 6649 N Blue Gum St
3 59 n Groesbeck Hwy 59 n Groesbeck Hwy 59 N Groesbeck Hwy
4 1 Central Avenue 1 Central Avenue 1 Central Ave
fuzzywuzzy_score difflib_score
0 90 0.904762
1 100 1.000000
2 100 1.000000
3 100 0.944444
4 90 0.896552
I'll try to explain what I'm currently working with:
I have two dataframes: one for Gas Station A (165 stations), and other for Gas Station B (257 stations). They both share the same format:
id Coor
1 (a1,b1)
2 (a2,b2)
Coor has tuples with the location coordinates. What I want to do is to add 3 columns to Dataframe A with nearest Competitor #1, #2 and #3 (from Gas Station B).
Currently I managed to get every distance from A to B (42405 distance measures), but in a list format:
distances=[]
for (u,v) in gasA['coor']:
for (w,x) in gasB['coor']:
distances.append(sp.distance.euclidean((u,v),(w,x)))
This lets me have the values I need, but I still need to match them with the ID from Gas Station A, and get the top 3. I have the suspicion working with lists is not the best approach here. Do you have any suggestions?
Edit: as suggested, first 5 rows are:
in GasA:
id coor
60712 (-333525363206695,-705191013427772)
60512 (-333539879388388, -705394161580837)
60085 (-333545609177068, -703168832659184)
60110 (-333601677229216, -705167284798638)
60078 (-333608898397271, -707213099595404)
in GasB:
id coor
70174 (-333427160000000,-705459060000000)
70223 (-333523030000000, -706705470000000)
70383 (-333549270000000, -705320990000000)
70162 (-333556960000000, -705384750000000)
70289 (-333565850000000, -705104360000000)
from sklearn.metrics.pairwise import euclidean_distances
import numpy as np
Creating the data:
A = pd.DataFrame({'id':['60712','60512','60085', '60110','60078'], 'coor':[ (-333525363206695,-705191013427772),\
(-333539879388388, -705394161580837),\
(-333545609177068, -703168832659184),\
(-333601677229216, -705167284798638),\
(-333608898397271, -707213099595404)]})
B = pd.DataFrame({'id':['70174','70223','70383', '70162','70289'], 'coor':[ (-333427160000000,-705459060000000),\
(-333523030000000, -706705470000000),\
(-333549270000000, -705320990000000),\
(-333556960000000, -705384750000000),\
(-333565850000000, -705104360000000)]})
Calculating the distances:
res = euclidean_distances(list(A.coor), list(B.coor))
Selecting top 3 closest stations from B and appending to a column in A:
d = []
for i, id_ in enumerate(A.index):
distances = np.argsort(res[i])[0:3] #select top 3
distances = B.iloc[distances]['id'].values
d.append(distances)
A = A.assign(dist=d)
edit
result of running with example:
coor id dist
0 (-333525363206695, -705191013427772) 60712 [70223, 70174, 70162]
1 (-333539879388388, -705394161580837) 60512 [70223, 70289, 70174]
2 (-333545609177068, -703168832659184) 60085 [70223, 70174, 70162]
3 (-333601677229216, -705167284798638) 60110 [70223, 70174, 70162]
4 (-333608898397271, -707213099595404) 60078 [70289, 70383, 70162]
Define a function that calculates the distances from A to all B's and returns indices of B with the three smallest distances.
def get_nearest_three(row):
(u,v) = row['Coor']
dist_list = gasB.Coor.apply(sp.distance.euclidean,args = [u,v])
# want indices of the 3 indices of B with smallest distances
return list(np.argsort(dist_list))[0:3]
gasA['dists'] = gasA.apply(get_nearest_three, axis = 1)
You can do something like this.
a = gasA.coor.values
b = gasB.coor.values
c = np.sum(np.sum((a[:,None,::-1] - b)**2, axis=1), axis=0)
we can get the numpy arrays for the coordinates for both and then broadcast a to represent all it's combinations and then take the euclidean distance.
Consider a cross join (matching every row by every row between both datasets) which should be manageable with your small sets, 165 X 257, then calculate the distance. Then, rank by distance and filter for top 3.
cj_df = pd.merge(gasA.assign(key=1), gasB.assign(key=1),
on="key", suffixes=['_A', '_B'])
cj_df['distance'] = cj_df.apply(lambda row: sp.distance.euclidean(row['Coor_A'],
row['Coor_B']),
axis = 1)
# RANK BY DISTANCE
cj_df['rank'] = .groupby('id_A')['distance'].rank()
# FILTER FOR TOP 3
top3_df = cj_df[cj_df['rank'] <= 3].sort_values(['id_A', 'rank'])
First dataframe df1 contains id and their corresponding two coordinates. For each coordinate pair in the first dataframe, i have to loop through the second dataframe to find the one with the least distance. I tried taking the individual coordinates and finding the distance between them but it does not work as expected. I believe it has to be taken as a pair when finding the distance between them. Not sure whether Python offers some methods to achieve this.
For eg: df1
Id Co1 Co2
334 30.371353 -95.384010
337 39.497448 -119.789623
df2
Id Co1 Co2
339 40.914585 -73.892456
441 34.760395 -77.999260
dfloc3 =[[38.991512-77.441536],
[40.89869-72.37637],
[40.936115-72.31452],
[30.371353-95.38401],
[39.84819-75.37162],
[36.929306-76.20035],
[40.682342-73.979645]]
dfloc4 = [[40.914585,-73.892456],
[41.741543,-71.406334],
[50.154522,-96.88806],
[39.743565,-121.795761],
[30.027597,-89.91014],
[36.51881,-82.560844],
[30.449587,-84.23629],
[42.920475,-85.8208]]
Given you can get your points into a list like so...
df1 = [[30.371353, -95.384010], [39.497448, -119.789623]]
df2 = [[40.914585, -73.892456], [34.760395, -77.999260]]
Import math then create a function to make finding the distance easier:
import math
def distance(pt1, pt2):
return math.sqrt((pt1[0] - pt2[0])**2 + (pt1[1] - pt2[1])**2)
Then simply transverse your your list saving the closest points:
for pt1 in df1:
closestPoints = [pt1, df2[0]]
for pt2 in df2:
if distance(pt1, pt2) < distance(closestPoints[0], closestPoints[1]):
closestPoints = [pt1, pt2]
print ("Point: " + str(closestPoints[0]) + " is closest to " + str(closestPoints[1]))
Outputs:
Point: [30.371353, -95.38401] is closest to [34.760395, -77.99926]
Point: [39.497448, -119.789623] is closest to [34.760395, -77.99926]
The code below creates a new column in df1 showing the Id of the nearest point in df2. (I can't tell from the question if this is what you want.) I'm assuming the coordinates are in a Euclidean space, i.e., that the distance between points is given by the Pythagorean Theorem. If not, you could easily use some other calculation instead of dist_squared.
import pandas as pd
df1 = pd.DataFrame(dict(Id=[334, 337], Co1=[30.371353, 39.497448], Co2=[-95.384010, -119.789623]))
df2 = pd.DataFrame(dict(Id=[339, 441], Co1=[40.914585, 34.760395], Co2=[-73.892456, -77.999260]))
def nearest(row, df):
# calculate euclidian distance from given row to all rows of df
dist_squared = (row.Co1 - df.Co1) ** 2 + (row.Co2 - df.Co2) ** 2
# find the closest row of df
smallest_idx = dist_squared.argmin()
# return the Id for the closest row of df
return df.loc[smallest_idx, 'Id']
near = df1.apply(nearest, args=(df2,), axis=1)
df1['nearest'] = near
I'm basically running some code as follows. Basically I'm just retrieving pairs of stocks (laid out as Row 1-Stock 1,2, Row 2-Stock 1,2 and so on, where Stock 1 and 2 are different in each row) from a CSV File. I then take in data from Yahoo associated with these "Pairs" of Stocks. I calculate the returns of the stocks and basically check if the distance (difference in returns) between a pair of stocks breaches some threshold and if so I return 1. However, I'm getting the following error which I am unable to resolve:
PricePort(tickers)
27 for ticker in tickers:
28 #print ticker
---> 29 x = pd.read_csv('http://chart.yahoo.com/table.csv?s=ttt'.replace('ttt',ticker),usecols=[0,6],index_col=0)
30 x.columns=[ticker]
31 final=pd.merge(final,x,left_index=True,right_index=True)
TypeError: expected a character buffer object
The code is as follows:
from datetime import datetime
import pytz
import csv
import pandas as pd
import pandas.io.data as web
import numpy as np
#Retrieves pairs of stocks (laid out as Row 1-Stock 1,2, Row 2-Stock 1,2 and so on, where Stock 1 and 2 are different in each row) from CSV File
def Dataretriever():
Pairs = []
f1=open('C:\Users\Pythoncode\Pairs.csv') #Enter the location of the file
csvdata= csv.reader(f1)
for row in csvdata: #reading tickers from the csv file
Pairs.append(row)
return Pairs
tickers = Dataretriever() #Obtaining the data
#Taking in data from Yahoo associated with these "Pairs" of Stocks
def PricePort(tickers):
"""
Returns historical adjusted prices of a portfolio of stocks.
tickers=pairs
"""
final=pd.read_csv('http://chart.yahoo.com/table.csv?s=^GSPC',usecols=[0,6],index_col=0)
final.columns=['^GSPC']
for ticker in tickers:
#print ticker
x = pd.read_csv('http://chart.yahoo.com/table.csv?s=ttt'.replace('ttt',ticker),usecols=[0,6],index_col=0)
x.columns=[ticker]
final=pd.merge(final,x,left_index=True,right_index=True)
return final
#Calculating returns of the stocks
def Returns(tickers):
l = []
begdate=(2014,1,1)
enddate=(2014,6,1)
p = PricePort(tickers)
ret = (p.close[1:] - p.close[:-1])/p.close[1:]
l.append(ret)
return l
#Basically a class to see if the distance (difference in returns) between a
#pair of stocks breaches some threshold
class ThresholdClass():
#constructor
def __init__(self, Pairs):
self.Pairs = Pairs
#Calculating the distance (difference in returns) between a pair of stocks
def Distancefunc(self, tickers):
k = 0
l = Returns(tickers)
summation=[[0 for x in range (k)]for x in range (k)] #2d matrix for the squared distance
for i in range (k):
for j in range (i+1,k): # it will be a upper triangular matrix
for p in range (len(self.PricePort(tickers))-1):
summation[i][j]= summation[i][j] + (l[i][p] - l[j][p])**2 #calculating distance
for i in range (k): #setting the lower half of the matrix 1 (if we see 1 in the answer we will set a higher limit but typically the distance squared is less than 1)
for j in range (i+1):
sum[i][j]=1
return sum
#This function is used in determining the threshold distance
def MeanofPairs(self, tickers):
sum = self.Distancefunc(tickers)
mean = np.mean(sum)
return mean
#This function is used in determining the threshold distance
def StandardDeviation(self, tickers):
sum = self.Distancefunc(tickers)
standard_dev = np.std(sum)
return standard_dev
def ThresholdandnewsChecker(self, tickers):
threshold = self.MeanofPairs(tickers) + 2*self.StandardDeviation(tickers)
if (self.Distancefunc(tickers) > threshold):
return 1
Threshold_Class = ThresholdClass(tickers)
Threshold_Class.ThresholdandnewsChecker(tickers,1)
The trouble is Dataretriever() returns a list, not a string. When you iterate over tickers(), the name ticker is bound to a list.
The str.replace method expects both arguments to be strings. The following code raises the error because the second argument is a list:
'http://chart.yahoo.com/table.csv?s=ttt'.replace('ttt', ticker)
The subsequent line x.columns = [ticker] will cause similar problems. Here, ticker needs to be a hashable object (like a string or integer), but lists are not hashable.