Identify US county from from latitude and longitude using Python - python

I am using the codes below to identify US county. The data is taken from Yelp which provides lat/lon coordinate.
id
latitude
longitude
1
40.017544
-105.283348
2
45.588906
-122.593331
import pandas
df = pandas.read_json("/Users/yelp/yelp_academic_dataset_business.json", lines=True, encoding='utf-8')
# Identify county
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="http")
df['county'] = geolocator.reverse(df['latitude'],df['longitude'])
The error was "TypeError: reverse() takes 2 positional arguments but 3 were given".

Nominatim.reverse takes coordinate pairs; the issue is that you are passing it pandas dataframe columns. df['latitude'] here refers to the entire column in your data, not just one value, and since geopy is independent of pandas, it doesn't support processing an entire column and instead just sees that the input isn't a valid number.
Instead, try looping through the rows:
county = []
for row in range(len(df)):
county.append(geolocator.reverse((df['latitude'][row], df['longitude'][row])))
(Note the double brackets.)
Then, insert the column into the dataframe:
df.insert(index, 'county', county, True)
(index should be what column position you want, and the boolean value at the end indicates that duplicate values are allowed.)

Related

Pandas apply method: get an index label from a pivot_table

I am using a well-known dataset as an example. These are the most popular baby names given to a newborn baby in New York City, based on ethnicity. This well-known dataset is available at this address:"https://data.cityofnewyork.us/api/views/25th-nujf/rows.csv"
I have been using Pandas for a few months, and still have issues with the pivot_table.
I wanted to know for each year, what is the most popular first name, and I did this (it works):
import pandas as pd
url = "https://data.cityofnewyork.us/api/views/25th-nujf/rows.csv"
df = pd.read_csv (url)
pt = df.pivot_table (index = "Child's First Name", columns = "Year of Birth", values ​​= "Gender", aggfunc = "count", fill_value = 0, margins = True)
The variable pt gives me the list of first names, in line, and in column the years. And the values ​​are the number of times the first name has been given for a year.
Now, I want to do the opposite: from a value over one year, find the corresponding index (therefore the first name).
For example, I want to know which first names were given 4 times in 2015.
To do this I do:
condition = pt [2015] == 4
result = pt [condition]
print(result)
Now I will want to use an apply function which will return me for each row, the first name in question.
I did this, but it doesn't work:
pt ["First Name First Letter"] = pt.apply (lambda x: x.index [0], axis = 1)
I definitely want to use the apply function because I feel that there is always something that works differently when dealing with a pivot_table...
Who could help me please ?
Here is the false result I get

Is there a way to speed up querying latitude-longitude of postcodes on a large dataframe using pgeocode?

I have a dataframe of around 100k rows with postcodes and country codes. I would like to get the latitude and longitude of each location and save it in two new columns. I have a working code on a sample of the dataframe (e.g. 100 rows), but running it on all dataframe takes so long (>1hour). I am new to Python and suspect, there should be a faster way to do so in terms of:
For a given postcode and country_code, I am querying twice, once for latitude and once for longitude. I am pretty convinced that I should not do that (i.e. I can make a single query per line and create latitude and longitude columns accordingly).
The way I define the function get_lat(pcode, country) and get_long(pcode, country) and apply on the dataframe is not efficient.
An example of my code is below.
import pgeocode
import numpy as np
import pandas as pd
#Sample data
df = pd.DataFrame({'postcode':['3011','3083','3071','2660','9308','9999'], 'country_code': ['NL','NL','NL','BE','BE','DE']})
#There are blank postcodes and postcodes that pgeocode cannot return any value, so I am using try-except (e.g. last row in sample dataframe):
#function to get latitude
def get_lat(pcode, country):
try:
nomi = pgeocode.Nominatim(country)
x = nomi.query_postal_code(pcode).latitude
return x
except:
return np.NaN
#function to get longitude
def get_long(pcode, country):
try:
nomi = pgeocode.Nominatim(country)
x = nomi.query_postal_code(pcode).longitude
return x
except:
return np.NaN
#Find and create new columns for latitude-longitude based on postcode (ex: 5625) and country of postcode (ex: NL)
df['latitude'] = np.vectorize(get_lat)(df['postcode'],df['country_code'])
df['longitude'] = np.vectorize(get_long)(df['postcode'],df['country_code'])
As an alternative solution, I downloaded the txt files from this website: http://download.geonames.org/export/zip/
After downloading the files, it is simply a matter of importing the txt file and joining. It is much faster but static, i.e. you use a snapshot of the postcode database at an earlier time.
Another advantage is that you can check the files and inspect the format of the postcodes. While using pgeocode, it is harder to keep track of the accepted postcode format and understand why queries return null.

How to perform an Excel INDEX MATCH equivalent in Python

I have a question regarding how to perform what would be the equivalent of returning a value using the INDEX MATCH functions in Excel and applying it in Python.
As an Excel user performing data analytics and manipulation on large data-sets I have moved to Python for efficiency. What I am attempting to do is to populate the column cells within a pandas dataframe based on the value returned from the value stored within a dictionary.
In an attempt to do this I have used the following code:
# imported csv DataFrames
crew_data = pd.read_csv(r'C:\file_path\crew_data.csv')
export_template = pd.read_csv(r'C:\file_path\export_template.csv')
#contract number dictionary
contract = {'Northern':'046-2019',
'Southern':'048-2015D',}
#function that attempts to perform a INDEX MATCH equivalent
def contract_num():
for x, y in enumerate(crew_data.loc[:, 'Region']):
if y in contract.keys():
num = contract[y]
else:
print('ERROR')
return(num)
#for loop which prepares then exports the load data
for i, r in enumerate(export_template):
export_template.loc[:, 'Contract'] = contract_num()
export_template.to_csv(r'C:\file_path\export_files\UPLOADER.csv')
print(export_template)
To summarise what the code is intended to do is as follows:
The for loop contained in the contract_num function begins by iterating over the Region column in the crew_data DataFrame
if the value y from the DataFrame matches the key in the contract dictionary (Note: the Region column only contains 2 values, 'Southern' and 'Northern') it will return the corresponding value from the value in the contract dictionary
The for loop which prepares then exports the load data calls on the contract_num() function to populate the Contract column in the export_template DataFrame
Please note that there are 116 additional columns which are populated in this loop which have been excluded from the code above to save space.
When the code is executed it produces the result as intended, however, the issue is that when the function is called in the second for loop it only returns a single value of 048-2015D instead of the value which corresponds to the correct Region.
As mentioned previously this would have typically been carried out in Excel using INDEX MATCH, however doing so is not as efficient as using a script such as that above.
Being a beginner, I suspect the example code may appear con-deluded and unnecessary and could be performed using a more concise method.
If anyone could provide a solution or guidance that would be greatly appreciated.
df = pd.DataFrame({'Region': ['Northern', 'Northern', 'Northern',
'Northern', 'Southern', 'Southern',
'Northern', 'Eastern']})
contract = {'Northern':'046-2019',
'Southern':'048-2015D'}
# similar to INDEX MATCH
df['Contract'] = df.Region.map(contract)
out:
Region Contract
0 Northern 046-2019
1 Northern 046-2019
2 Northern 046-2019
3 Northern 046-2019
4 Southern 048-2015D
5 Southern 048-2015D
6 Northern 046-2019
7 Eastern NaN
you can add print if Contract has not matched:
if df.Contract.isna().any():
print("ERROR")
or make an assertion:
assert not df.Contract.isna().any(), "found empty contract field"
and out in this case:
AssertionError: found empty contract field

Looping through Pandas column to derive values to be added to a different column

I have code to be able to look up addresses and find their area names and cities using Geopy (Google V3). Currently it only works on static addresses, I want to be able to loop through the entire pandas column that contains addresses and output cities and area names each to a different column for each row.
from geopy.geocoders import GoogleV3
address = '1238 Davie St, Vancouver, BC'
geocoder = GoogleV3(api_key='xyzabc')
location = geocoder.geocode(address, language='en')
address_components = location.raw['address_components']
counties = [addr['long_name'] for addr in address_components if 'neighborhood' in addr['types']]
localities = [addr['long_name'] for addr in address_components if 'locality' in addr['types']]
Works well, but when I assign df['address_original'] to the address variable, I only get the result for the first row.
Would I need to build a loop for this, or are there other ways?
The output I am getting at the moment is
[] [u'Clearwater']
which is the result for the first address in the df['address_original'] column. I would like to get the results for all the addresses in the column

Get Latitude/Longitude Python Pandas

I'm learning python and am currently trying to parse out the longitude and latitude from a "Location" column and assign them to the 'lat' and 'lon' columns. I currently have the following code:
def getlatlong(cell):
dd['lat'] = cell.split('\n')[2].split(',')[0][1:]
dd['lon'] = cell.split('\n')[2].split(',')[1][1:-1]
dd['Location'] = dd['Location'].apply(getlatlong)
dd.head()
The splitting portion of the code works. The problem is that this code copies the lat and lon from the last cell in the dataframe to all of the 'lat' and 'lon' rows. I want it to split the current row it is iterating through, assign the 'lat' and 'lon' values for that row, and then do the same on every subsequent row.
I get that assigning dd['lat'] to the split value assigns it to the whole column, but I don't know how to assign to just the row currently being iterated over.
Data sample upon request:
Index,Location
0,"1554 FIRST ST\nLIVERMORE, CA 94550\n(37.67930642, -121.7765857)"
1,"1554 FIRST ST\nLIVERMORE, CA 94550\n(37.67931141, -121.7765988)"
2,"138 14TH ST\nOAKLAND, CA 94612\n(37.80140803, -122.26369831)"
3,"4014 MACARTHUR BLVD\nOAKLAND, CA 94619\n(37.78968061, -122.19690846)"
4,"4014 MACARTHUR BLVD\nOAKLAND, CA 94619\n(37.78968557, -122.19692165)"
Please see my approach below. It is based on creating a DataFrame with lat and lon columns and then adding it to the existing dataframe.
def getlatlong(x):
return pd.Series([x.split('\n')[2].split(',')[0][1:],
x.split('\n')[2].split(',')[1][1:-1]],
index = ["lat", "lon"])
df = pd.concat((df, df.Location.apply(getlatlong)), axis=1)
This addresses another technique you can use to get the answer, but isn't exact code you need. If you add sample data i can tailor it.
Using Pandas's build in str methods you can save yourself some headache as follows:
temp_df = df['Location'].str.split('\n').str.split().apply(pd.Series)
The above splits the Location column on spaces, and then turns the split values into columns. You can then assign just the Latitude and Longitude columns to the original df.
df[['Latitude', 'Longitude']] = temp_df[[<selection1>, <selection2>]]
str.split() also has an expand parameter so that you can write .str.split("char", expand=True) to spread out the columns without the apply.
Update
Given your example, this works for your specific case:
df = pd.DataFrame({"Location": ["1554 FIRST ST\nLIVERMORE, CA 94550\n(37.67930642, -121.7765857)"]})
df[["Latitude", "Longitude"]] = (df['Location']
.str.split('\n')
.apply(pd.Series)[2] # Column 2 has the str (lat, long)
.str[1:-1] # Strip the ()
.str.split(",", expand=True) # Expand latitude and longitude into two columns
.astype(float)) # Make sure latitude and longitude are floats
Out:
Location Latitude Longitude
0 1554 FIRST ST\nLIVERMORE, CA 94550\n(37.679306... 37.679306 -121.776586
Update #2
#Abhishek Mishra's answer is faster (takes only 55% of the time, since it goes through the data fewer times). Worth noting that the output from that example has strings in each column, so you might want to modify to get values back to floats.
for ind, row in dd.iterrows():
dd['lat'].loc[ind] = dd['Location'].loc[ind].split(',')[0][1:]
dd['lon'].loc[ind] = dd['Location'].loc[ind].split(',')[1][1:-1]
PS: iterrows() is slow.

Categories