I have a dataframe that has two columns, Hospital name and Address, and I want to iterate through each address to find the latitude and longitude. My code seems to be taking the first row in the dataframe and I can't seem to select the address to find the coordinates.
import pandas
from geopy.geocoders import Nominatim
geolocator = Nominatim()
for index, item in df.iterrows():
location = geolocator.geocode(item)
df["Latitude"].append(location.latitude)
df["Longitude"].append(location.longitude)
Here is the code I used to scrape the website. Copy and run this and you'll have the data set.
import requests
from bs4 import BeautifulSoup
import pandas
import numpy as np
r=requests.get("https://www.privatehealth.co.uk/hospitals-and-
clinics/orthopaedic-surgery/?offset=300")
c=r.content
soup=BeautifulSoup(c,"html.parser")
all=soup.find_all(["div"],{"class":"col-9"})
names = []
for item in all:
d={}
d["Hospital Name"] = item.find(["h3"],{"class":"mb6"}).text.replace("\n","")
d["Address"] = item.find(["p"],{"class":"mb6"}).text.replace("\n","")
names.append(d)
df=pandas.DataFrame(names)
df = df[['Hospital Name','Address']]
df
Currently the data looks like (one hospital example):
Hospital Name |Address
Fulwood Hospital|Preston, PR2 9SZ
The final output that I'm trying to achieve looks like.
Hospital Name |Address | Latitude | Longitude
Fulwood Hospital|Preston, PR2 9SZ|53.7589938|-2.7051618
Seems like there are a few issues here. Using data from the URL you provided:
df.head()
Hospital Name Address
0 Fortius Clinic City London, EC4N 7BE
1 Pinehill Hospital - Ramsay Health Care UK Hitchin, SG4 9QZ
2 Spire Montefiore Hospital Hove, BN3 1RD
3 Chelsea & Westminster Hospital London, SW10 9NH
4 Nuffield Health Tunbridge Wells Hospital Tunbridge Wells, TN2 4UL
(1) If your data frame column names really are Hospital name and Address, then you need to use item.Address in the call to geocode().
Just using item will give you both Hospital name and Address.
for index, item in df.iterrows():
print(f"index: {index}")
print(f"item: {item}")
print(f"item.Address only: {item.Address}")
# Output:
index: 0
item: Hospital Name Fortius Clinic City
Address London, EC4N 7BE
Name: 0, dtype: object
item.Address only: London, EC4N 7BE
...
(2) You noted that your data frame only has two columns. If that's true, you'll get a KeyError when you try to perform operations on df["Latitude"] and df["Longitude"], because they don't exist.
(3) Using apply() on the Address column might be clearer than iterrows().
Note that this is a stylistic point, and debatable. (The first two points are actual errors.)
For example, using the provided URL:
from geopy.geocoders import Nominatim
geolocator = Nominatim()
tmp = df.head().copy()
latlon = tmp.Address.apply(lambda addr: geolocator.geocode(addr))
tmp["Latitude"] = [x.latitude for x in latlon]
tmp["Longitude"] = [x.longitude for x in latlon]
Output:
Hospital Name Address \
0 Fortius Clinic City London, EC4N 7BE
1 Pinehill Hospital - Ramsay Health Care UK Hitchin, SG4 9QZ
2 Spire Montefiore Hospital Hove, BN3 1RD
3 Chelsea & Westminster Hospital London, SW10 9NH
4 Nuffield Health Tunbridge Wells Hospital Tunbridge Wells, TN2 4UL
Latitude Longitude
0 51.507322 -0.127647
1 51.946413 -0.279165
2 50.840871 -0.180561
3 51.507322 -0.127647
4 51.131528 0.278068
Related
How to categorize and create new columns from full addresses?
The address string is comma separated and use those keywords:
*district
*city
*borough
*village
*municipality
*county
The source full address format can look like this:
col1;col2;fulladdress
data1_1;data2_1;Some district, Cityname, Some county
data1_2;data2_2;Another village, Another municipality, Another county
data1_3;data2_3;Third city, Third county
data1_4;data2_4;Forth borough, Forth municipality, Forth county
There is one peculiarity with one city in particular - This city is called "Clause" and sometimes it is wrote out as format of: "Clause city" and sometimes it is just "Clause" in full address string.
For example:
Clause city, Third county
Seventh district, Clause, Seventh municipality
So, I want to categorize only one version format which is "Clause city" to avoid duplicate output. So, if there is "Clause" alone in full address string, it should be renamed to "Clause city" in CSV.
The source data file is called data.csv and exported.csv for the categorized version.
All I have is this:
import pandas as pd
df = pd.read_csv('data.csv', delimiter=';')
address = df.fulladdress.str.split(',', expand=True)
district = df[df.fulladdress.str.match('district')]
city = df[df.fulladdress.str.match('city')]
borough = df[df.fulladdress.str.match('borough')]
village = df[df.fulladdress.str.match('village')]
municipality = df[df.fulladdress.str.match('municipality')]
county = df[df.fulladdress.str.match('county')]
df.to_csv('exported.csv', sep=';', encoding='utf8', index=True)
print ("Done!")
If there is only one problem city, I think you could use replace?
raw csv data:
Some district, Cityname, Some county
Another village, Another municipality, Another county
Third city, Third county
Forth borough, Forth municipality, Forth county
Clause city, Third county
Seventh district, Clause, Seventh municipality
replace solution:
df = pd.read_csv('clause.csv', names=['district', 'city', 'country'], sep=',')
# need to strip white space for replace to work
df = df.apply(lambda x: x.str.strip())
df.replace('Clause', 'Clause city', inplace=True)
output:
district city country
0 Some district Cityname Some county
1 Another village Another municipality Another county
2 Third city Third county NaN
3 Forth borough Forth municipality Forth county
4 Clause city Third county NaN
5 Seventh district Clause city Seventh municipality
I want to retrieve the tables on the following website and store them in a pandas dataframe: https://www.acf.hhs.gov/orr/resource/ffy-2012-13-state-of-colorado-orr-funded-programs
However, the third table on the page returns an empty dataframe with all the table's data stored in tuples as the column headers:
Empty DataFrame
Columns: [(Service Providers, State of Colorado), (Cuban - Haitian Program, $0), (Refugee Preventive Health Program, $150,000.00), (Refugee School Impact, $450,000), (Services to Older Refugees Program, $0), (Targeted Assistance - Discretionary, $0), (Total FY, $600,000)]
Index: []
Is there a way to "flatten" the tuple headers into header + values, then append this to a dataframe made up of all four tables? My code is below -- it has worked on other similar pages but keeps breaking because of this table's formatting. Thanks!
funds_df = pd.DataFrame()
url = 'https://www.acf.hhs.gov/programs/orr/resource/ffy-2011-12-state-of-colorado-orr-funded-programs'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
year = url.split('ffy-')[1].split('-orr')[0]
tables = page.content
df_list = pd.read_html(tables)
for df in df_list:
df['URL'] = url
df['YEAR'] = year
funds_df = funds_df.append(df)
For this site, there's no need for beautifulsoup or requests
pandas.read_html creates a list of DataFrames for each <table> at the URL.
import pandas as pd
url = 'https://www.acf.hhs.gov/orr/resource/ffy-2012-13-state-of-colorado-orr-funded-programs'
# read the url
dfl = pd.read_html(url)
# see each dataframe in the list; there are 4 in this case
for i, d in enumerate(dfl):
print(i)
display(d) # display worker in Jupyter, otherwise use print
print('\n')
dfl[0]
Service Providers Cash and Medical Assistance* Refugee Social Services Program Targeted Assistance Program TOTAL
0 State of Colorado $7,140,000 $1,896,854 $503,424 $9,540,278
dfl[1]
WF-CMA 2 RSS TAG-F CMA Mandatory 3 TOTAL
0 $3,309,953 $1,896,854 $503,424 $7,140,000 $9,540,278
dfl[2]
Service Providers Refugee School Impact Targeted Assistance - Discretionary Services to Older Refugees Program Refugee Preventive Health Program Cuban - Haitian Program Total
0 State of Colorado $430,000 $0 $100,000 $150,000 $0 $680,000
dfl[3]
Volag Affiliate Name Projected ORR MG Funding Director
0 CWS Ecumenical Refugee & Immigration Services $127,600 Ferdi Mevlani 1600 Downing St., Suite 400 Denver, CO 80218 303-860-0128
1 ECDC ECDC African Community Center $308,000 Jennifer Guddiche 5250 Leetsdale Drive Denver, CO 80246 303-399-4500
2 EMM Ecumenical Refugee Services $191,400 Ferdi Mevlani 1600 Downing St., Suite 400 Denver, CO 80218 303-860-0128
3 LIRS Lutheran Family Services Rocky Mountains $121,000 Floyd Preston 132 E Las Animas Colorado Springs, CO 80903 719-314-0223
4 LIRS Lutheran Family Services Rocky Mountains $365,200 James Horan 1600 Downing Street, Suite 600 Denver, CO 80218 303-980-5400
I have a long string text that I would like to convert to a dataframe to analyze. Please see below for a sample of the data below. I would like the columns to be "Facility", "Street", "City", "Phone", and "Store Hours".
string = AlaskaUSCG Base Ketchikan 1300 Stedman Street Ketchikan, AK (907) 228-0250 Mon-Fri 7:30am-5pm | Sat 10am-4pm | Closed Sunday USCG Base Kodiak Albatros Avenue, Building 26 (2nd Floor) Kodiak, AK (907) 487-5773 USCG Base Kodiak Albatros Avenue, Building 26 (1st Floor) Kodiak, AK (907) 487-5773 Mon-Fri: 7am-9pm | Sat: 9am-9pm |
I have used StringIO to convert it to a dataframe but it converts it into a dataframe with 0 rows and 1000 columns. Instead I would like the columns I mentioned above and rows for each store.
I expect it to look like this with the data populated as rows:
Facility Street City Phone
Alaska USCG Base Ketchikan 1300 Stedman Street Ketchikan, AK (907) 228 0250
You may use simple web-scraping techniques, such as bs4 and requests.
import bs4
r = requests.get(URL)
b = bs4.BeautifulSoup(r.text)
addresses = []
for val in b.find_all(name='p'):
s = list(val.stripped_strings)
if s and not s[0].startswith('HOURS'): addresses.append(' '.join(s[:-1]))
I have the following dataframe:
d = {'Postcode': ['M3A','M4A','M5A','M6A','M9A','M1B'], 'Borough': ['North York', 'Downtown Toronto', 'Etobicoke',
'Scarborough', 'East York', 'York'],
'Neighbourhood': ['Parkwoods', 'Victoria Village', 'Harbourfront', 'Regent Park',
'Lawrence Heights', 'Lawrence Manor']}
post_df = pd.DataFrame(data = d)
Which yields something like:
Postcode Borough Neighbourhood
0 M3A North York Parkwoods
1 M4A Downtown Toronto Victoria Village
2 M5A Etobicoke Harbourfront
3 M6A Scarborough Regent Park
4 M9A East York Lawrence Heights
5 M1B York Lawrence Manor
I want to get all the latitudes and longitudes for each postal code.
I figured out this code to do so:
import geocoder
# initialize your variable to None
lat_lng_coords = None
# loop until you get the coordinates
while(lat_lng_coords is None):
g = geocoder.google('{}, Toronto, Ontario'.format(postal_code_from_df))
lat_lng_coords = g.latlng
latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]
now my question is: Using the previous code, I would like to get each latitude and longitude for each postal code and add them to 2 new columns in this existing df called 'Latitude' and 'Longitude'. How could I do that using a single loop to avoid searching each postal code coordinates one by one?
Thank you very much in advance
You can use df.apply. Something like:
post_df['Latitude'], post_df['Longitude'] = zip(*post_df['Postcode'].apply(get_geocoder))
Where get_geocoder can be defined as mentioned by #Ankur
Hi you need to define your geocoder function and loop it on your df. I am passing your Postal code column one by one in the function and to fetch values from geocoder and assigning and storing it into two new columns latitude and longitude. See below:
import geocoder
def get_geocoder(postal_code_from_df):
# initialize your variable to None
lat_lng_coords = None
# loop until you get the coordinates
while(lat_lng_coords is None):
g = geocoder.google('{}, Toronto, Ontario'.format(postal_code_from_df))
lat_lng_coords = g.latlng
latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]
return latitude,longitude
for i in range(0,len(post_df)):
post_df['Latitude'][i],post_df['Longitude'][i]=get_geocoder(post_df.iloc[i]['Postcode'])
This should work for you.
I have one pandas dataframe composed of the names of the world's cities as well as countries, to which cities belong,
city.head(3)
city country
0 Qal eh-ye Now Afghanistan
1 Chaghcharan Afghanistan
2 Lashkar Gah Afghanistan
and another data frame consisting of addresses of the world's universities, which is shown below:
df.head(3)
university
0 Inst Huizhou, Huihzhou 516001, Guangdong, Peop...
1 Guangxi Acad Sci, Nanning 530004, Guangxi, Peo...
2 Shenzhen VisuCA Key Lab SIAT, Shenzhen, People...
The locations of cities' names are irregularly distributed across rows. I would like to match the city names with the addresses of world's universities. That is, I would like to know which city each university is located in. Hopefully, the city name matched is shown in the same row as the address of each university.
I've tried the following, and it doesn't work because the locations of cities are irregular across the rows.
df['university'].str.split(',').str[0]
I would suggest to use apply
city_list = city.tolist()
def match_city(row):
for city in city_list:
if city in row['university']: return city
return 'None'
df['city'] = df.apply(match_city, axis=1)
I assume the addresses of university data is clean enough. If you want to do more advanced checking of matching, you can adjust the match_city function.
In order to deal with the inconsistent structure of your strings, a good solution is to use regular expressions. I mocked up some data based on your description and created a function to capture the city from the strings.
In my solution I used numpy to output NaN values when there wasn't a match, but you could easily just make it a blank string. I also included a test case where the input was blank in order to display the NaN result.
import re
import numpy as np
data = ["Inst Huizhou, Huihzhou 516001, Guangdong, People's Republic of China",
"Guangxi Acad Sci, Nanning 530004, Guangxi, People's Republic of China",
"Shenzhen VisuCA Key Lab SIAT, Shenzhen, People's Republic of China",
"New York University, New York, New York 10012, United States of America",
""]
df = pd.DataFrame(data, columns = ['university'])
def extract_city(row):
match = re.match('^[^,]*,([^,]*),', row)
if match:
city = re.sub('\d+', '', match.group(1)).strip()
else:
city = np.nan
return city
df.university.apply(extract_city)
Here's the output:
0 Huihzhou
1 Nanning
2 Shenzhen
3 New York
4 NaN
Name: university, dtype: object
My suggestion is that after a few pre-processing that reduce address into city level information (we don't need to be exact, but try your best; like removing numbers etc), and then merge the dataframes based on text similarities.
You may consider text similarity measures like levenshtein distance or jaro-winkler which are commonly used to match the words.
Here is example for text similarity:
class DLDistance:
def __init__(self, s1):
self.s1 = s1
self.d = {}
self.lenstr1 = len(self.s1)
for i in xrange(-1,self.lenstr1+1):
self.d[(i,-1)] = i+1
def distance(self, s2):
lenstr2 = len(s2)
for j in xrange(-1,lenstr2+1):
self.d[(-1,j)] = j+1
for i in xrange(self.lenstr1):
for j in xrange(lenstr2):
if self.s1[i] == s2[j]:
cost = 0
else:
cost = 1
self.d[(i,j)] = min(
self.d[(i-1,j)] + 1, # deletion
self.d[(i,j-1)] + 1, # insertion
self.d[(i-1,j-1)] + cost, # substitution
)
if i and j and self.s1[i]==s2[j-1] and self.s1[i-1] == s2[j]:
self.d[(i,j)] = min (self.d[(i,j)], self.d[i-2,j-2] + cost) # transposition
return self.d[self.lenstr1-1,lenstr2-1]
if __name__ == '__main__':
base = u'abs'
cmpstrs = [u'abs', u'sdfbasz', u'asdf', u'hfghfg']
dl = DLDistance(base)
for s in cmpstrs:
print "damerau_levenshtein"
print dl.distance(s)
Even though, it has high level of computation complexity since it calculate N*M times of distance measure where N rows in the first dataframe, M rows in the second dataframe.(To reduce computatinal complexity, you can truncate the set who needs comparison by only comparing the rows which have the same first character)
levenshtein distance: https://en.wikipedia.org/wiki/Levenshtein_distance
jaro-winkler: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance
I think one simple idea would be to create a mapping from any word or sequence of word of any address to the full address the word is part of, with the assumption that one of those address words is the cites. In a second step we match this with the set of known cities that you have, and anything that is not a known city gets discarded.
A mapping from each single word to address is as simple as:
def address_to_dict(address):
return {word: address for word in address.split(",")}
And we can easily extend this to include the set of bi-grams, tri-gram,... so that universities encoded in several words are also collected. See a discussion here: Elegant N-gram Generation in Python
We can then apply this to every address we have to obtain one grand mapping from any word to the full address:
word_to_address_mapping = pd.DataFrame(df.university.apply(address_to_dict ).tolist()).stack()
word_to_address_mapping = pd.DataFrame(word_to_address_mapping,
columns=["address"])
word_to_address_mapping.index = word_to_address_mapping.index.droplevel(level=0)
word_to_address_mapping
This yields something like this:
All you have to do then is join this with the actual city list you have: this will automatically discard any entry in word_to_address_mapping which is not a known city, and provide a mapping between university address and their city.
# the outer join here should ensure that several university in the
# same city do not overwrite each other
pd.merge(left=word_to_address_mapping, right=city,
left_index=True, right_on="city",
how="outer)
Partial match is prevented in the below function. Information of countries also considered while matching cities. To use this function university dataframe need to be split into list data type, such that every piece of address split into list of strings.
In [22]: def get_city(univ_name_split):
....: # find country from university address
....: for name in univ_name_split:
....: if name in city['country'].values:
....: country = name
....: else:
....: country = None
....: if country:
....: cities = city[city.country == country].city.values
....: else:
....: cities = city['city'].values
....: # find city from university address
....: for name in univ_name_split:
....: if name in cities:
....: return name
....: else:
....: return None
....:
In [1]: import pandas as pd
In [2]: city = pd.read_csv('city.csv')
In [3]: df = pd.read_csv('university.csv')
In [4]: # splitting university name and address
In [5]: df_split = df['university'].str.split(',')
In [6]: df_split = df_split.apply(lambda x:[i.strip() for i in x])
In [10]: df
Out[10]:
university
0 Kongu Engineering College, Perundurai, Erode, ...
1 Anna University - Guindy, Chennai, India
2 Birla Institute of Technology and Science, Pil...
In [11]: df_split
Out[11]:
0 [Kongu Engineering College, Perundurai, Erode,...
1 [Anna University - Guindy, Chennai, India]
2 [Birla Institute of Technology and Science, Pi...
Name: university, dtype: object
In [12]: city
Out[12]:
city country
0 Bangalore India
1 Chennai India
2 Coimbatore India
3 Delhi India
4 Erode India
#This function is shorter version of above function
In [14]: def get_city(univ_name_split):
....: for name in univ_name_split:
....: if name in city['city'].values:
....: return name
....: else:
....: return None
....:
In [15]: df['city'] = df_split.apply(get_city)
In [16]: df
Out[16]:
university city
0 Kongu Engineering College, Perundurai, Erode, ... Erode
1 Anna University - Guindy, Chennai, India Chennai
2 Birla Institute of Technology and Science, Pil... None
I've created a tiny library for my projects, especially for fuzzy joins. It might not be the fastest solution but it may help, feel free to use.
Link to my GitHub repo