I'm using the ArcGIS api (ArcMap - 10.5.1), and I'm trying to get the drive time between two addresses. I can get the drive time between two points, but I don't know how to iterate over multiple points. I have one hundred addresses. I keep getting
AttributeError: 'NoneType' object has no attribute '_tools'
This is the Pandas dataframe I'm working with. I have two columns with indexes. Column 1 is the origin address and column two is the second address. If possible, I would love to make a new row with the drive time.
df2
Address_1 Address_2
0 1600 Pennsylvania Ave NW, Washington, DC 20500 2 15th St NW, Washington
1 400 Broad St, Seattle, WA 98109 325 5th Ave N, Seattle
This is the link where I grabbed the code from
https://developers.arcgis.com/python/guide/performing-route-analyses/
I tried hacking this code. Specifically the code below.
def pairwise(iterable):
a, b = tee(iterable)
next(b, None)
return zip(a, b)
#empty list - will be used to store calculated distances
list = [0]
# Loop through each row in the data frame using pairwise
for (i1, row1), (i2, row2) in pairwise(df.iterrows()):
https://medium.com/how-to-use-google-distance-matrix-api-in-python/how-to-use-google-distance-matrix-api-in-python-ef9cd895303c
I looked up what non_type means so I tried printing out to see if anything would print out and it works fine. I mostly use R and don't use python much.
for (i,j) in pairwise(df2.iterrows()):
print(i)
print(j)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from copy import deepcopy
from datetime import datetime
from IPython.display import HTML
import json
from arcgis.gis import GIS
import arcgis.network as network
import arcgis.geocoding as geocoding
from itertools import tee
user_name = 'username'
password = 'password'
my_gis = GIS('https://www.arcgis.com', user_name, password)
route_service_url = my_gis.properties.helperServices.route.url
route_service = network.RouteLayer(route_service_url, gis=my_gis)
for (i,j) in pairwise(df2.iterrows()):
stop1_geocoded = geocoding.geocode(i)
stop2_geocoded = geocoding.geocode(j)
stops = '{0},{1}; {2},{3}'.format(stop1_geocoded[0]['attributes']['X'],
stop1_geocoded[0]['attributes']['Y'],
stop2_geocoded[0]['attributes']['X'],
stop2_geocoded[0]['attributes']['Y'])
route_layer = network.RouteLayer(route_service_url, gis=my_gis)
result = route_layer.solve(stops=stops, return_directions=False, return_routes=True,
output_lines='esriNAOutputLineNone', return_barriers=False,
return_polygon_barriers=False, return_polyline_barriers=False)
travel_time = result['routes']['features'][0]['attributes']['Total_TravelTime']
print("Total travel time is {0:.2f} min".format(travel_time))
The expected output is a list of drive times as a list. I tried appended all to a dataframe, and that would be ideal. So the ideal output would be 3 columns - address 1, address 2, and drive time. The code does work with one address at a time (instead of i,j it's just two addresses as strings and no for statement).
example:
Address_1 Address_2
0 1600 Pennsylvania Ave NW, Washington, DC 20500 2 15th St NW, Washington
1 400 Broad St, Seattle, WA 98109 325 5th Ave N, Seattle
drive_time
0 7 minutes
1 3 minutes
Your use of the pairwise function is unnecessary. Just wrap the arcGIS code in a function that will return the time to you, and this way you can map the values as a new column on your dataframe.
Also make sure that you import the time library, which is not noted on the arcGIS documentation but is needed to run this.
`
def getTime(row):
try:
stop1_geocoded = geocoding.geocode(row.df_column_1)
stop2_geocoded = geocoding.geocode(row.df_column_2)
stops = '{0},{1}; {2},{3}'.format(stop1_geocoded[0]['attributes']['X'],
stop1_geocoded[0]['attributes']['Y'],
stop2_geocoded[0]['attributes']['X'],
stop2_geocoded[0]['attributes']['Y'])
route_layer = network.RouteLayer(route_service_url, gis=my_gis)
result = route_layer.solve(stops=stops, return_directions=False, return_routes=True,
output_lines='esriNAOutputLineNone', return_barriers=False,
return_polygon_barriers=False, return_polyline_barriers=False)
travel_time = result['routes']['features'][0]['attributes']['Total_TravelTime']
time = "Total travel time is {0:.2f} min".format(travel_time)
return time
except RuntimeError:
return
streets['travel_time'] = streets.apply(getTime, axis=1)
`
Related
I'm trying to filter out bogus locations from a column in a data frame. The column is filled with locations taken from tweets. Some of the locations aren't real. I am trying to separate them from the valid locations. Below is the code I have. However, the output is not producing the right thing, it instead will only return France. I'm hoping someone can identify what I'm doing wrong here or another way to try. Let me know if I didn't explain it well enough. Also, I assign variables both outside and inside the function for testing purposes.
import pandas as pd
cn_csv = pd.read_csv("~/Downloads/cntry_list.csv") #this is just a list of every country along with respective alpha 2 and alpha 3 codes, see the link below to download csv
country_names = cn_csv['country']
results = pd.read_csv("~/Downloads/results.csv") #this is a dataframe with multiple columns, one being "source location" See edit below that displays data in "Source Location" column
src_locs = results["Source Location"]
locs_to_list = list(src_locs)
new_list = [entry.split(', ') for entry in locs_to_list]
def country_name_check(input_country_list):
cn_csv = pd.read_csv("~/Downloads/cntrylst.csv")
country_names = cn_csv['country']
results = pd.read_csv("~/Downloads/results.csv")
src_locs = results["Source Location"]
locs_to_list = list(src_locs)
new_list = [entry.split(', ') for entry in locs_to_list]
valid_names = []
tobe_checked = []
for i in new_list:
if i in country_names.values:
valid_names.append(i)
else:
tobe_checked.append(i)
return valid_names, tobe_checked
print(country_name_check(src_locs))
EDIT 1: Adding the link for the cntry_list.csv file. I downloaded the csv of the table data. https://worldpopulationreview.com/country-rankings/country-codes
Since I am unable to share a file on here, here is the "Source Location" column data:
Source Location
She/her
South Carolina, USA
Torino
England, UK
trying to get by
Bemidiji, MN
St. Paul, MN
Stockport, England
Liverpool, England
EH7
DLR - LAX - PDX - SEA - GEG
Barcelona
Curitiba
kent
Paris, France
Moon
Denver, CO
France
If your goal is to find and list country names, both valid and not, you may filter the initial results DataFrame:
# make list from unique values of Source Location that match values from country_names
valid_names = list(results[results['Source Location']
.isin(country_names)]['Source Location']
.unique())
# with ~ select unique values that don't match country_names values
tobe_checked = list(results[~results['Source Location']
.isin(country_names)]['Source Location']
.unique())
Your unwanted result with only France being returned could be solved by trying that simpler approach. However, the problem in your code may be there when reading cntrylst outside of the function, as indicated by ScottC
I have a pandas DataFrame called "orders" with approx. 100k entries containing address data (zip, city, country). For each entry, I would like to calculate the distance to a specific predefined address.
So far, I'm looping over the dataframe rows with a for-loop and using geopy to 1. get latitude and longitude values for each entry and 2. calculate the distance to my predefined address.
Although this works, it takes an awful lot of time (over 15 hours with an average of 2 iterations / second) and I assume that I haven't found the most efficient way yet. Although I did quite a lot of research and tried out different things like vectorization, these alternatives did not seem to speed up the process (maybe because I didn't implement them in the correct way, as I'm not a very experienced Python user).
This is my code so far:
def get_geographic_information():
latitude = destination_geocode.latitude
longitude = destination_geocode.longitude
destination_coordinates = (latitude, longitude)
distance = round(geopy.distance.distance(starting_point_coordinates, destination_coordinates).km, 2)
return latitude, longitude, distance
import geopy
from geopy.geocoders import Nominatim
import geopy.distance
orders["Latitude"] = ""
orders["Longitude"] = ""
orders["Distance"] = ""
geolocator = Nominatim(user_agent="Project01")
starting_point = "my_address"
starting_point_geocode = geolocator.geocode(starting_point, timeout=10000)
starting_point_coordinates = (starting_point_geocode.latitude, starting_point_geocode.longitude)
for index in tqdm(range(len(orders))):
destination_zip = orders.loc[index, "ZIP"]
destination_city = orders.loc[index, "City"]
destination_country = orders.loc[index, "Country"]
destination = destination_zip + " " + destination_city + " " + destination_country
destination_geocode = geolocator.geocode(destination, timeout=15000)
if destination_geocode != None:
geographic_information = get_geographic_information()
orders.loc[index, "Latitude"] = geographic_information[0]
orders.loc[index, "Longitude"] = geographic_information[1]
orders.loc[index, "Distance"] = geographic_information[2]
else:
orders.loc[index, "Latitude"] = "-"
orders.loc[index, "Longitude"] = "-"
orders.loc[index, "Distance"] = "-"
From my previous research, I learned that the for-loop might be the problem, but I haven't managed to replace it yet. As this is my first question here, I'd appreciate any constructive feedback. Thanks in advance!
The speed of your script is likely limited by using Nominatim. They throttle the speed to 1 request per second as per this link:
https://operations.osmfoundation.org/policies/nominatim/
The only way to speed this script up would be to find a different service that allows bulk requests. Geopy has a list of geocoding services that it currently supports. Your best bet would be to look through this list and see if you find a service that handles bulk requests (e.g. Google V3. That would either allow you to make requests in batches or use a distributed process to speed things up.
I want to check the accuracy of a column of addresses in my dataframe against a column of addresses in another dataframe, to see if they match and how well they match. However, it seems that it takes a long time to go through the addresses and perform the calculations. There are 15000+ addresses in my main dataframe and around 50 addresses in my reference dataframe. It ran for 5 minutes and still hadn't finished.
My code is:
import pandas as pd
from fuzzywuzzy import fuzz, process
### Main dataframe
df = pd.read_csv("adressess.csv", encoding="cp1252")
#### Reference dataframe
ref_df = pd.read_csv("ref_addresses.csv", encoding="cp1252")
### Variable for accuracy scoring
accuracy = 0
for index, value in df["address"].iteritems():
### This gathers the index from the correct address column in the reference df
ref_index = ref_df["correct_address"][
ref_df["correct_address"]
== process.extractOne(value, ref_df["correct_address"])[0]
].index.toList()[0]
### if each row can score a max total of 1, the ratio must be divided by 100
accuracy += (
fuzz.ratio(df["address"][index], ref_df["correct_address"][ref_index]) / 100
)
Is this the best way to loop through a column in a dataframe and fuzzy match it against another? I want the score to be a ratio because later I will then output an excel file with the correct values and a background colour to indicate what values were wrong and changed.
I don't believe fuzzywuzzy has a method that allows you to pull the index, value and ration into one tuple - just value and ratio of match.
Hopefully the below code (with links to dummy data) helps show what is possible. I tried to use street addresses to mock up a similar situation so it is easier to compare with your dataset; obviously it is no where near as big.
You can pull the csv text from the links in the comments and run it and see what could work on your larger sample.
For five addresses in the reference frame and 100 contacts in the other its execution timings are:
CPU times: user 107 ms, sys: 21 ms, total: 128 ms
Wall time: 137 ms
The below code should be quicker than .iteritems() etc.
Code:
# %%time
import pandas as pd
from fuzzywuzzy import fuzz, process
import difflib
# create 100-contacts.csv from data at: https://pastebin.pl/view/3a216455
df = pd.read_csv('100-contacts.csv')
# create ref_addresses.csv from data at: https://pastebin.pl/view/6e992fe8
ref_df = pd.read_csv('ref_addresses.csv')
# function used for fuzzywuzzy matching
def match_addresses(add, list_add, min_score=0):
max_score = -1
max_add = ''
for x in list_add:
score = fuzz.ratio(add, x)
if (score > min_score) & (score > max_score):
max_add = x
max_score = score
return (max_add, max_score)
# given current row of ref_df (via Apply) and series (df['address'])
# return the fuzzywuzzy score
def scoringMatches(x, s):
o = process.extractOne(x, s, score_cutoff = 60)
if o != None:
return o[1]
# creating two lists from address column of both dataframes
contacts_addresses = list(df.address.unique())
ref_addresses = list(ref_df.correct_address.unique())
# via fuzzywuzzy matching and using scoringMatches() above
# return a dictionary of addresses where there is a match
# the keys are the address from ref_df and the associated value is from df (i.e., 'huge' frame)
# example:
# {'86 Nw 66th Street #8673': '86 Nw 66th St #8673', '1 Central Avenue': '1 Central Ave'}
names = []
for x in ref_addresses:
match = match_addresses(x, contacts_addresses, 75)
if match[1] >= 75:
name = (str(x), str(match[0]))
names.append(name)
name_dict = dict(names)
# create new frame from fuzzywuzzy address matches dictionary
match_df = pd.DataFrame(name_dict.items(), columns=['ref_address', 'matched_address'])
# add fuzzywuzzy scoring to original ref_df
ref_df['fuzzywuzzy_score'] = ref_df.apply(lambda x: scoringMatches(x['correct_address'], df['address']), axis=1)
# merge the fuzzywuzzy address matches frame with the reference frame
compare_df = pd.concat([match_df, ref_df], axis=1)
compare_df = compare_df[['ref_address', 'matched_address', 'correct_address', 'fuzzywuzzy_score']].copy()
# add difflib scoring for a bit of interest.
# a random thought passed through my head maybe this is interesting?
compare_df['difflib_score'] = compare_df.apply(lambda x : difflib.SequenceMatcher\
(None, x['ref_address'], x['matched_address']).ratio(),axis=1)
# clean up column ordering ('correct_address' and 'ref_address' are basically
# copies of each other, but shown for completeness)
compare_df = compare_df[['correct_address', 'ref_address', 'matched_address',\
'fuzzywuzzy_score', 'difflib_score']]
# see what we've got
print(compare_df)
# remember: correct_address and ref_address are copies
# so just pick one to compare to matched_address
correct_address ref_address matched_address \
0 86 Nw 66th Street #8673 86 Nw 66th Street #8673 86 Nw 66th St #8673
1 2737 Pistorio Rd #9230 2737 Pistorio Rd #9230 2737 Pistorio Rd #9230
2 6649 N Blue Gum St 6649 N Blue Gum St 6649 N Blue Gum St
3 59 n Groesbeck Hwy 59 n Groesbeck Hwy 59 N Groesbeck Hwy
4 1 Central Avenue 1 Central Avenue 1 Central Ave
fuzzywuzzy_score difflib_score
0 90 0.904762
1 100 1.000000
2 100 1.000000
3 100 0.944444
4 90 0.896552
first of all, I have no background in computer language and I am learning Python.
I'm trying to group some data in a data frame.
[dataframe "cafe_df_merged"]
Actually, I want to create a new data frame shows the 'city_number', 'city' (which is a name), and also the number of cafes in the same city. So, it should have 3 columns; 'city_number', 'city' and 'number_of_cafe'
However, I have tried to use the group by but the result did not come out as I expected.
city_directory = cafe_df_merged[['city_number', 'city']]
city_directory = city_directory.groupby('city').count()
city_directory
[the result]
How should I do this? Please help, thanks.
There are likely other ways of doing this as well, but something like this should work:
import pandas as pd
import numpy as np
# Create a reproducible example
places = [[['starbucks', 'new_york', '1234']]*5, [['bean_dream', 'boston', '3456']]*4, \
[['coffee_today', 'jersey', '7643']]*3, [['coffee_today', 'DC', '8902']]*3, \
[['starbucks', 'nowwhere', '2674']]*2]
places = [p for sub in places for p in sub]
# a dataframe containing all information
city_directory = pd.DataFrame(places, columns=['shop','city', 'id'])
# make a new dataframe with just cities and ids
# drop duplicate rows
city_info = city_directory.loc[:, ['city','id']].drop_duplicates()
# get the cafe counts (number of cafes)
cafe_count = city_directory.groupby('city').count().iloc[:,0]
# add the cafe counts to the dataframe
city_info['cafe_count'] = cafe_count[city_info['city']].to_numpy()
# reset the index
city_info = city_info.reset_index(drop=True)
city_info now yields the following:
city id cafe_count
0 new_york 1234 5
1 boston 3456 4
2 jersey 7643 3
3 DC 8902 3
4 nowwhere 2674 2
And part of the example dataframe, city_directory.tail(), looks like this:
shop city id
12 coffee_today DC 8902
13 coffee_today DC 8902
14 coffee_today DC 8902
15 starbucks nowwhere 2674
16 starbucks nowwhere 2674
Opinion: As a side note, it might be easier to get comfortable with regular Python first before diving deep into the world of pandas and numpy. Otherwise, it might be a bit overwhelming.
My task is to create a friendship matrix (user-user matrix), which values are 1, if the users are friends, and 0, if not. My .csv file have 1,5 million rows, so I create the following little csv to test my algorithm:
user_id friends
Elena Peter, John
Peter Elena, John
John Elena, Peter, Chris
Chris John
For this little csv, my code works well:
%matplotlib inline
import pandas as pd
import seaborn as sns
import numpy as np
from scipy import sparse
sns.set(style="darkgrid")
user_filepath = 'H:\\YelpData\\test.csv' # this is my little test file
df = pd.read_csv(user_filepath, usecols=['user_id','friends'])
def Convert_String_To_List(string):
if string!="None":
li = list(string.split(", "))
else:
li = []
return li
friend_map = {}
for i in range(len(df)): #storing friendships in map
friend_map[df['user_id'][i]] = Convert_String_To_List(df['friends'][i])
users = sorted(friend_map.keys())
user_indices = dict(zip(users, range(len(users)))) #giving indices for users
#and now the sparsity matrix:
row_ind = [] #row indices, where the value is 1
col_ind = [] #col indices, where the value is 1
data = [] # value 1
for user in users:
for barat in baratok[user]:
row_ind.append(user_indices[user])
col_ind.append(user_indices[barat])
for i in range(len(row_ind)):
data.append(1)
mat_coo = sparse.coo_matrix((data, (row_ind, col_ind)))
friend_matrix = mat_coo.toarray() #this friendship matrix is good for the little csv file
But when I try this code to my large (1,5 million rows) csv, I get memory error, when I want to store friendships in map (in the for cycle).
Is there any solution for this?
I think you are approaching this the wrong way, you should use pandas and vectorized operation as possible to account for the large data you have.
This is a complete pandas approach depending on your data.
import pandas as pd
_series = df1.friends.apply(lambda x: pd.Series(x.split(', '))).unstack().dropna()
data = pd.Series(_series.values, index=_series.index.droplevel(0))
pd.get_dummies(data).groupby('user_id').sum()
Output
Chris Elena John Peter
user_id
Chris 0 0 1 0
Elena 0 0 1 1
John 1 1 0 1
Peter 0 1 1 0
BTW, this can be further optimized and through using pandas you avoid using memory-expensive for loops and you can use chunksize to chunk your data for furthere optimization.
I think you should not store the string repeatedly. You need to make a list of name and store the index of the name, not the name itself. This part of the code:
friend_map[df['user_id'][i]] = Convert_String_To_List(df['friends'][i])
can be changed. If you have a list of users,
users = [....] # read from csv
friend_list = Convert_String_To_List(df['friends'][i])
friend_list_idxs = Get_Idx_of_Friends(users,friend_list) #look up table users
friend_map[df['user_id'][i]] = friend_list_idxs
This way, you will not need to store same string repeatedly.
Let's say you have 10 million friend relationship, you will need to store 10MB of memory.