Pandas and Multiprocessing - python

I am using a FCC api to convert lat/long coordinates into block group codes:
import pandas as pd
import numpy as np
import urllib
import time
import json
# getup, getup1, and getup2 make up the url to the api
getup = 'http://data.fcc.gov/api/block/find?format=json&latitude='
getup1 = '&longitude='
getup2 = '&showall=false'
lat = ['40.7127837','34.0522342','41.8781136','29.7604267','39.9525839',
'33.4483771','29.4241219','32.715738','32.7766642','37.3382082','30.267153',
'39.768403','30.3321838','37.7749295','39.9611755','35.2270869',
'32.7554883','42.331427','31.7775757','35.1495343']
long = ['-74.0059413','-118.2436849','-87.6297982','-95.3698028','-75.1652215',
'-112.0740373','-98.4936282','-117.1610838','-96.7969879','-121.8863286',
'-97.7430608','-86.158068','-81.655651','-122.4194155','-82.9987942',
'-80.8431267','-97.3307658','-83.0457538','-106.4424559','-90.0489801']
#make lat and long in to a Pandas DataFrame
latlong = pd.DataFrame([lat,long]).transpose()
latlong.columns = ['lat','long']
new_list = []
def block(x):
for index,row in x.iterrows():
#request url and read the output
a = urllib.request.urlopen(getup + row['lat'] + getup1 + row['long'] + getup2).read()
#load json output in to a form python can understand
a1 = json.loads(a)
#append output to an empty list.
new_list.append(a1['Block']['FIPS'])
#call the function with latlong as the argument.
block(latlong)
#print the list, note: it is important that function appends to the list
print(new_list)
gives this output:
['360610031001021', '060372074001033', '170318391001104', '482011000003087',
'421010005001010', '040131141001032', '480291101002041', '060730053003011',
'481130204003064', '060855010004004', '484530011001092', '180973910003057',
'120310010001023', '060750201001001', '390490040001005', '371190001005000',
'484391233002071', '261635172001069', '481410029001001', '471570042001018']
The problem with this script is that I can only call the api once per row. It takes about 5 minutes per thousand for the script to run, which is not acceptable with 1,000,000+ entries I am planning on using this script with.
I want to use multiprocessing to parallel this function to decrease the time to run the function. I have tried to look in to the multiprocessing handbook, but have not been able to figure out how to run the function and append the output in to an empty list in parallel.
Just for reference: I am using python 3.6
Any guidance would be great!

You do not have to implement the parallelism yourself, there are libraries better than urllib, e.g. requests [0] and some spin-offs [1] which use either threads or futures. I guess you need to check yourself which one is the fastest.
Because of the small amount of dependencies I like the requests-futures best, here my implementation of your code using ten threads. The library would even support processes if you believe or figure out that it is somehow better in your case:
import pandas as pd
import numpy as np
import urllib
import time
import json
from concurrent.futures import ThreadPoolExecutor
from requests_futures.sessions import FuturesSession
#getup, getup1, and getup2 make up the url to the api
getup = 'http://data.fcc.gov/api/block/find?format=json&latitude='
getup1 = '&longitude='
getup2 = '&showall=false'
lat = ['40.7127837','34.0522342','41.8781136','29.7604267','39.9525839',
'33.4483771','29.4241219','32.715738','32.7766642','37.3382082','30.267153',
'39.768403','30.3321838','37.7749295','39.9611755','35.2270869',
'32.7554883','42.331427','31.7775757','35.1495343']
long = ['-74.0059413','-118.2436849','-87.6297982','-95.3698028','-75.1652215',
'-112.0740373','-98.4936282','-117.1610838','-96.7969879','-121.8863286',
'-97.7430608','-86.158068','-81.655651','-122.4194155','-82.9987942',
'-80.8431267','-97.3307658','-83.0457538','-106.4424559','-90.0489801']
#make lat and long in to a Pandas DataFrame
latlong = pd.DataFrame([lat,long]).transpose()
latlong.columns = ['lat','long']
def block(x):
requests = []
session = FuturesSession(executor=ThreadPoolExecutor(max_workers=10))
for index, row in x.iterrows():
#request url and read the output
url = getup+row['lat']+getup1+row['long']+getup2
requests.append(session.get(url))
new_list = []
for request in requests:
#load json output in to a form python can understand
a1 = json.loads(request.result().content)
#append output to an empty list.
new_list.append(a1['Block']['FIPS'])
return new_list
#call the function with latlong as the argument.
new_list = block(latlong)
#print the list, note: it is important that function appends to the list
print(new_list)
[0] http://docs.python-requests.org/en/master/
[1] https://github.com/kennethreitz/grequests

Related

What is the most efficient way to increase speed of nested loops in Python?

I understand that Vectorization or Parallel Programming is the way to go. But what if my program doesn't fit in those use case scenarios, like let's say NumPy doesn't work for a particular problem.
For demonstration purposes, I have wrote a simple program here:
import pandas as pd
import requests
from bs4 import BeautifulSoup
def extract_data(location_name, date):
ex_data = []
url = f'http://www.example.com/search.php?locationid={location_name}&date={date}'
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")
for tr in soup.find('table', class_='classTable').find_all('tr', attrs={'class': None}):
text = [td.text for td in tr.find_all('td')]
ex_data.append(text)
return ex_data
# list of all dates in a year
get_dates = pd.read_csv('dates.csv')
#list of a number of locations
location_list = pd.read_csv('locations.csv')
master_data = []
for indexLoc, rowLoc in location_list.iterrows():
data = []
for index, row in dates.iterrows():
_date = row['Date']
location = rowLoc['Location']
row = extract_data(location, _date)
data = data + row
master_data = master_data + data
master_df = pd.DataFrame(master_data)
print(master_df)
The program basically puts a list of dates and location in separate dataframes, then loops through each location and nested loops through each date to execute a function. The function makes a request to url taking the parameters and gets some information from a table using BeautifulSoup, which it returns back. Program then stores those return values in a list and loop continues.
Now, let's say there are 100 locations and 365 dates, the program will run through each location for 365 days which makes the loop execution : 100*365. This infinity number and on top of the temp storage required for the returned list from the function for each loop execution - is definitely not anywhere near the efficient solution.
Using NumPy may change the date into Datetime variable, rather than keeping it as String (at least, that's what happened in my case). Using Multiprocessing/Multithreading might break the sequence in which the final master_data should be displayed, if a request in the function took too long to fulfill. For instance, Feb 17,...,20 data will be added in the list before Feb 16, because requesting url for Feb 16 too long. I understand that it can be sorted later on, but what if the data is such that it can't sorted.
What would be a simple, light-weight solution for these nested loops, which would be the best way to get maximum speed efficiency for the program execution. I would also like to know why that would be the best option with some example, if you can provide.

Read a lot of data using pool.map ('error("'i' format requires -2147483648 <= number <= 2147483647")')

I'm reading data from databases. I need to read from several servers (nodes) simultaniuosly, so I want to use pool.map.
I'm trying to do this way:
import pathos.pools as pp
import pandas as pd
import urllib
class DataProvider():
def __init__(self, hosts):
self.hosts_read = hosts
def read_data(self, host_index):
'''
Read data from current node
'''
limit = 1000000
host = self.hosts_read[host_index]
query = f"select FIELD1 from table_name limit {limit}"
url = urllib.parse.urlencode({'query': query})
df = pd.io.parsers.read_csv(f'http://{host}:8123/?{url}',
sep="\t", names=['FIELD1'], low_memory=False)
return df
def pool_read(self, num_workers):
'''
Read from data using Pool of workers.
Return list of lists - every list is a data from current worker.
'''
pool = pp.ProcessPool(num_workers)
result = pool.map(self.read_data, range(len(self.hosts_read)))
return result
if __name__ == '__main__':
provider = DataProvider(host=['server01.com', 'server02.com'])
data = provider.pool_read(num_workers=n_cpu)
It works perfect while limit is not so much (below 4 millions). And crushes if it is bigger:
multiprocess.pool.MaybeEncodingError: Error sending result:
'[my_pandas_dataframe]'. Reason: 'error("'i' format requires
-2147483648 <= number <= 2147483647")'
I found some answers about it: it is because we cannot return from the pool peace of data bigger than 2 GB. For example: SO link. But there is no any ideas or solutions, how to work if I need load bigger parts!
P.S. I use pathos module but it is not important here - the same error for multiprocessing module too.

Most efficient way to update Dataframe with JSON array from WebService

I have a code column which I would like to pass to a web service and update two fields in the dataframe (dfMRD1['Cache_Ticker']and dfMRD1['Cache_Product'] with two values from the returned JSON (RbcSecurityDescription and RbcSecurityType1). I have achieved this by iteration but I'd like to know if there is a more efficient way to do it?
# http://postgre01:5002/bond/912828XU9
import requests
url = 'http://postgre01:5002/bond/'
def fastquery(code):
response = requests.get(url + code)
return response.json()
Here is the sample return call:
Here is the update of dfMRD1['Cache_Ticker']anddfMRD1['Cache_Product']
dfMRD1 = df[['code']].drop_duplicates()
dfMRD1['Cache_Ticker'] = ""
dfMRD1['Cache_Product'] = ""
for index, row in dfMRD1.iterrows():
result = fastquery(row['code'])
row['Cache_Ticker'] = result['RbcSecurityDescription']
row['Cache_Product'] = result['RbcSecurityType1']
display(dfMRD1.head(5))
Would it be best to just return the json array, unest it and dump all fields in its contents to another df which I can be join with dfMRD1? Best way to achieve this?
The most time-consuming part of your code is likely to be in making synchronous requests. Instead, you could leverage requests-futures to make asynchronous requests, construct the columns as lists of results and assign back to the DF. We have nothing to test with but the approach would look like this:
from requests_futures.sessions import FuturesSession
session = FuturesSession(max_workers = 10)
codes = df[['code']].drop_duplicates().values.tolist() # Take out of DF
url = 'http://postgre01:5002/bond/'
fire_requests = [session.get(url + code) for code in codes] # Async requests
responses = [item.result() for item in fire_requests] # Grab the results
dfMRD1['Cache_Ticker'] = [result['RbcSecurityDescription']
for result in responses]
dfMRD1['Cache_Product'] = [result['RbcSecurityType1']
for result in responses]
Depending on the size of the DF, you may get a lot of data in memory. If that becomes an issue, you'll want a background callback trimming your JSON responses as they come back.

How can I make this function more efficient/ run it parallel?

I am trying to convert 33000 zipcodes into coordinates using geocoder package. I was hoping there was a way to parallelize this method because it is consuming quite a bit of resources.
from geopy.geocoders import ArcGIS
import pandas as pd
import time
geolocator = ArcGIS()
df1 = pd.DataFrame(0.0, index=list(range(0,len(df))), columns=list(['lat','lon']))
df = pd.concat([df,df1], axis=1)
for index in range(0,len(df)):
row = df['zipcode'].loc[index]
print index
# time.sleep(1)
# I put this function in just in case it would give me a timeout error.
myzip = geolocator.geocode(row)
try:
df['lat'].loc[index] = myzip.latitude
df['lon'].loc[index] = myzip.longitude
except:
continue
geopy.geocoders.ArcGIS.geocode queries a web server. Sending 33,000 queries alone will probably get you IP banned, so I wouldn't suggest sending them in parallel.
You're looking up almost every single ZIP code in the US. The US Census Bureau has a 1MB CSV file that contains this information for 33,144 ZIP codes: https://www.census.gov/geo/maps-data/data/gazetteer2017.html.
You can process it all in a fraction of a second:
zip_df = pd.read_csv('2017_Gaz_zcta_national.zip', sep='\t')
zip_df.rename(columns=str.strip, inplace=True)
One thing to watch out for is that the last column's name isn't properly parsed by Pandas and contains a lot of trailing whitespace. You have to strip the column names before use.
Here would be one way to do it, using multiprocessing.Pool
from multiprocessing import Pool
def get_longlat(x):
index, row = x
print index
time.sleep(1)
myzip = geolocator.geocode(row['zipcode'])
try:
return myzip.latitude, myzip.longitude
except:
return None, None
p = Pool()
df[['lat', 'long']] = p.map(get_longlat, df.iterrows())
More generally, using DataFrame.iterrows (for which each item iterated over is an index, row tuple) is likely slightly more efficient than the index-based method you use above
EDIT: after reading the other answer, you should be aware of rate limiting; you could use a fix number of processes in the Pool along with a time.sleep delay to mitigate this to some extent, however.

Increasing Requests/Second Python Google Maps Geocoder

I am having difficulty increasing the amount of requests I can make per second with Google Maps Geocoder. I am using a paid account (at $.50/1000 requests), so according to the Google Geocoder API I should be able to make up to 50 requests per second.
I have a list of 15k address which I am trying to get GPS coordinates for. I am storing them as a Pandas Dataframe and looping over them. To make sure this wasn't due to slow looping, I tested how fast it loops over all 15k, and it only took 1.5 seconds. But I was only able to make less than 1 request per second. I realized this might be due to my slow internet connection, so I fired up a Windows Google Cloud VM with obviously fast internet. I was able to speed up the requests to about 1.5 requests/ second, but still way slower than theoretically possible.
I thought this might be due to using a python library Geocoder, so I tried making the request directly using python requests, but this didn't speed things up either.
Does this have something to do with the fact that I'm not using a server? I would think this wouldn't matter since I'm using a Google Cloud VM. Also, I know this doesn't have to do with multithreading, since it can already iterate through the loop using 1 core with extreme speed. Thanks in advance for any thoughts.
import geocoder
import pandas as pd
import time
import requests
startTime = time.time()
#Read File Name with all transactions up to October 4th
input_filename = "C:/Users/username/Downloads/transaction-export 10-04-2017.csv"
df = pd.read_csv(input_filename, header=0, error_bad_lines=False)
#Only look at customer addresses
df = df['Customer Address']
#Drop duplicates and NAs
df = df.drop_duplicates(keep='first')
df = df.dropna()
#convert dataframe to string
addresses = df.tolist()
#Google Api Key
api_key = 'my_api_key'
#create empty array
address_gps = []
#google api address
url = 'https://maps.googleapis.com/maps/api/geocode/json'
#For each address return its geocoded latlng coordinates
for int, val in enumerate(addresses):
''' Direct way to make call without geocoder
params = {'sensor': 'false', 'address': address, 'key': api_key}
r = requests.get(url, params=params)
results = r.json()['results']
location = results[0]['geometry']['location']
print location['lat'], location['lng']
num_address = num_address+1;
'''
endTime = time.time()
g = geocoder.google(val, key=api_key, exactly_one=True)
print "Address,", (val), "Number,", int, "Total,", len(addresses), "Time,", endTime-startTime
if g.ok:
address_gps.append(g.latlng)
print g.latlng
else:
address_gps.append(0)
print("Error")
#save every 100 iterations
if int%100==0:
# save as csv
df1 = pd.DataFrame({'Address GPS': address_gps})
df1.to_csv('C:/Users/username/Downloads/AllCustomerAddressAsGPS.csv')
# save as csv
df1 = pd.DataFrame({'Address GPS': address_gps})
df1.to_csv('C:/Users/username/Downloads/AllCustomerAddressAsGPS.csv')
One way to increase the speed of this is to maintain the requests session with Google, rather than creating a new session with every request. This is suggested in the geocoder documentation.
Your modified code will then be:
import requests
#Google Api Key
api_key = 'my_api_key'
#create empty array
address_gps = []
#google api address
url = 'https://maps.googleapis.com/maps/api/geocode/json'
#For each address return its geocoded latlng coordinates
with requests.Session() as session:
for int, val in enumerate(addresses):
''' Direct way to make call without geocoder
params = {'sensor': 'false', 'address': address, 'key': api_key}
r = requests.get(url, params=params)
results = r.json()['results']
location = results[0]['geometry']['location']
print location['lat'], location['lng']
num_address = num_address+1;
'''
endTime = time.time()
g = geocoder.google(val, key=api_key, exactly_one=True, session=session)
print "Address,", (val), "Number,", int, "Total,", len(addresses), "Time,", endTime-startTime
if g.ok:
address_gps.append(g.latlng)
print g.latlng
else:
address_gps.append(0)
print("Error")
#save every 100 iterations
if int%100==0:
# save as csv
df1 = pd.DataFrame({'Address GPS': address_gps})
df1.to_csv('C:/Users/username/Downloads/AllCustomerAddressAsGPS.csv')
# save as csv
df1 = pd.DataFrame({'Address GPS': address_gps})
df1.to_csv('C:/Users/username/Downloads/AllCustomerAddressAsGPS.csv')

Categories