I have a code column which I would like to pass to a web service and update two fields in the dataframe (dfMRD1['Cache_Ticker']and dfMRD1['Cache_Product'] with two values from the returned JSON (RbcSecurityDescription and RbcSecurityType1). I have achieved this by iteration but I'd like to know if there is a more efficient way to do it?
# http://postgre01:5002/bond/912828XU9
import requests
url = 'http://postgre01:5002/bond/'
def fastquery(code):
response = requests.get(url + code)
return response.json()
Here is the sample return call:
Here is the update of dfMRD1['Cache_Ticker']anddfMRD1['Cache_Product']
dfMRD1 = df[['code']].drop_duplicates()
dfMRD1['Cache_Ticker'] = ""
dfMRD1['Cache_Product'] = ""
for index, row in dfMRD1.iterrows():
result = fastquery(row['code'])
row['Cache_Ticker'] = result['RbcSecurityDescription']
row['Cache_Product'] = result['RbcSecurityType1']
display(dfMRD1.head(5))
Would it be best to just return the json array, unest it and dump all fields in its contents to another df which I can be join with dfMRD1? Best way to achieve this?
The most time-consuming part of your code is likely to be in making synchronous requests. Instead, you could leverage requests-futures to make asynchronous requests, construct the columns as lists of results and assign back to the DF. We have nothing to test with but the approach would look like this:
from requests_futures.sessions import FuturesSession
session = FuturesSession(max_workers = 10)
codes = df[['code']].drop_duplicates().values.tolist() # Take out of DF
url = 'http://postgre01:5002/bond/'
fire_requests = [session.get(url + code) for code in codes] # Async requests
responses = [item.result() for item in fire_requests] # Grab the results
dfMRD1['Cache_Ticker'] = [result['RbcSecurityDescription']
for result in responses]
dfMRD1['Cache_Product'] = [result['RbcSecurityType1']
for result in responses]
Depending on the size of the DF, you may get a lot of data in memory. If that becomes an issue, you'll want a background callback trimming your JSON responses as they come back.
Related
first of all, thanks for this community and all advice we can retrieve, it's really appreciate!
This is my first venture into parallel processing and I have been looking into Dask by my own but I am having trouble actually coding it... to be honest I am really lost
In on of my project, I want to trigger URL and retrieve observations data (meteorological station) from xml files.
For each URL, I run some different process in order to: retreive data from URL, parsing XML information to dataframe, apply a filter and store data in MySQL database.
So i need to loop these process over thousands of URL (station)...
I wrote a sequential code , and it take 300s to finish computation which is really to long and not efficient.
As we are applying the same process for each station, I think I can speed-up all the computations, but I don't know where to start. I used delayed from dask but I don't think it's the best approach.
This is my code so far:
First I have some functions.
def xml_to_dataframe(ood_xml):
tmp_file = wget.download(ood_xml)
prstree = ETree.parse(tmp_file)
root = prstree.getroot()
################ Section to retrieve data for one station and apply parameter
all_obs = []
for obs in root.iter('observations'):
ood_observation = []
for n, param in enumerate(list_parameters):
x=obs.find(variable_to_check).text
ood_observation.append(x)
all_obs.append(ood_observation)
return(pd.DataFrame(all_obs, columns=list_parameters))
def filter_criteria(df,threshold,criteria):
if criteria in df.columns:
result = []
for index, row in df.iterrows():
if pd.to_numeric(row[criteria],errors='coerce') >= threshold:
result.append(index)
return result
else:
#print(criteria + ' parameter does not exist for this station !!! ')
return([])
def get_and_filter_data(filename,criteria,threshold):
try:
xmlToDf = xml_to_dataframe(filename)
final_df = xmlToDf.loc[filter_criteria(xmlToDf,threshold,criteria)]
some msql connection and instructions....
except:
pass
and then the main code I want to parallelise:
criteria = 'temperature'
threshold = 22
filenames =[url1.html, url2.html, url3.html]
for file in filenames:
get_and_filter_data(file,criteria,threshold)
Do you have any advice to do it ?
Many thanks for your help !
Guillaume
Not 100% sure this is what you are after, but one way is via delayed:
from dask import delayed, compute
delayeds = [delayed(get_and_filter_data)(file,criteria,threshold) for file in filenames]
results = compute(delayeds)
I have a pandas dataframe with strings that I'm using to query an API and return the results.
I'm trying to call the API using a function and .apply and then save the results from the api call into a csv file. The problem is that I'm trying to do 10000+ requests and my kernel/notebook crashes. Basically I'm trying to do a big operation and I'm guessing I'm running out of memory. So I'm trying to think of a way I can do these api calls and save the results and not have it all crash. My version with .apply works with a small amount of data but not once it gets larger.
So my notebook code currently looks something like this.
df = pd.read_csv('bigstringlist.csv')
df = df.loc[0:3000]
My function looks something like this.
def api_fetch_func(address):
sleep(.2)
API_PRIVATE = 'awewaefawefawef'
encoded = urllib.parse.quote(address)
query ='https://apitocall' + str(encoded) + \
'.json?limit=1&key=' \
+ API_PRIVATE
response = requests.get(query)
while True:
try:
jsonResponse = response.json()
break
except:
response = requests.get(query)
try:
return jsonResponse['results']
except:
return
else:
return
Then I'm calling the function like so
df['response_col'] = df['string_col'].apply(api_fetch_func)
Something tells me that .apply isn't the right thing to do here. Would be better if I just push the api responses into an array or another dataframe?
Should I just use .iterrows to loop over the list of strings and call the function? Something tells me .apply tries to jam too much into memory and that's why this doesn't work.
So I was going to try
results = []
for index, row in df.iterrows():
# call API
# push results to array
Or is there another way to do this?
If it's a memory issue, what I'd do is write the API calling function as a generator with the yield statement. Then, you can loop through the api_fetch_function generator and save smaller data frames for the csv files rather than holding everything in memory in one go.
for idx, response in api_fetch_generator():
if idx % 500 == 0:
df = create_df() # create a fresh df as you did above with 'string_col'.
df['response_col'] = df['string_col'].apply(response)
if (idx % 500 == 0) and idx != 0:
# Save the df using idx to control the file name
df.to_csv(f"response_batch_{idx / 500}.csv")
# Combine the csv's after everything is saved.
I am trying to convert 33000 zipcodes into coordinates using geocoder package. I was hoping there was a way to parallelize this method because it is consuming quite a bit of resources.
from geopy.geocoders import ArcGIS
import pandas as pd
import time
geolocator = ArcGIS()
df1 = pd.DataFrame(0.0, index=list(range(0,len(df))), columns=list(['lat','lon']))
df = pd.concat([df,df1], axis=1)
for index in range(0,len(df)):
row = df['zipcode'].loc[index]
print index
# time.sleep(1)
# I put this function in just in case it would give me a timeout error.
myzip = geolocator.geocode(row)
try:
df['lat'].loc[index] = myzip.latitude
df['lon'].loc[index] = myzip.longitude
except:
continue
geopy.geocoders.ArcGIS.geocode queries a web server. Sending 33,000 queries alone will probably get you IP banned, so I wouldn't suggest sending them in parallel.
You're looking up almost every single ZIP code in the US. The US Census Bureau has a 1MB CSV file that contains this information for 33,144 ZIP codes: https://www.census.gov/geo/maps-data/data/gazetteer2017.html.
You can process it all in a fraction of a second:
zip_df = pd.read_csv('2017_Gaz_zcta_national.zip', sep='\t')
zip_df.rename(columns=str.strip, inplace=True)
One thing to watch out for is that the last column's name isn't properly parsed by Pandas and contains a lot of trailing whitespace. You have to strip the column names before use.
Here would be one way to do it, using multiprocessing.Pool
from multiprocessing import Pool
def get_longlat(x):
index, row = x
print index
time.sleep(1)
myzip = geolocator.geocode(row['zipcode'])
try:
return myzip.latitude, myzip.longitude
except:
return None, None
p = Pool()
df[['lat', 'long']] = p.map(get_longlat, df.iterrows())
More generally, using DataFrame.iterrows (for which each item iterated over is an index, row tuple) is likely slightly more efficient than the index-based method you use above
EDIT: after reading the other answer, you should be aware of rate limiting; you could use a fix number of processes in the Pool along with a time.sleep delay to mitigate this to some extent, however.
I am having difficulty increasing the amount of requests I can make per second with Google Maps Geocoder. I am using a paid account (at $.50/1000 requests), so according to the Google Geocoder API I should be able to make up to 50 requests per second.
I have a list of 15k address which I am trying to get GPS coordinates for. I am storing them as a Pandas Dataframe and looping over them. To make sure this wasn't due to slow looping, I tested how fast it loops over all 15k, and it only took 1.5 seconds. But I was only able to make less than 1 request per second. I realized this might be due to my slow internet connection, so I fired up a Windows Google Cloud VM with obviously fast internet. I was able to speed up the requests to about 1.5 requests/ second, but still way slower than theoretically possible.
I thought this might be due to using a python library Geocoder, so I tried making the request directly using python requests, but this didn't speed things up either.
Does this have something to do with the fact that I'm not using a server? I would think this wouldn't matter since I'm using a Google Cloud VM. Also, I know this doesn't have to do with multithreading, since it can already iterate through the loop using 1 core with extreme speed. Thanks in advance for any thoughts.
import geocoder
import pandas as pd
import time
import requests
startTime = time.time()
#Read File Name with all transactions up to October 4th
input_filename = "C:/Users/username/Downloads/transaction-export 10-04-2017.csv"
df = pd.read_csv(input_filename, header=0, error_bad_lines=False)
#Only look at customer addresses
df = df['Customer Address']
#Drop duplicates and NAs
df = df.drop_duplicates(keep='first')
df = df.dropna()
#convert dataframe to string
addresses = df.tolist()
#Google Api Key
api_key = 'my_api_key'
#create empty array
address_gps = []
#google api address
url = 'https://maps.googleapis.com/maps/api/geocode/json'
#For each address return its geocoded latlng coordinates
for int, val in enumerate(addresses):
''' Direct way to make call without geocoder
params = {'sensor': 'false', 'address': address, 'key': api_key}
r = requests.get(url, params=params)
results = r.json()['results']
location = results[0]['geometry']['location']
print location['lat'], location['lng']
num_address = num_address+1;
'''
endTime = time.time()
g = geocoder.google(val, key=api_key, exactly_one=True)
print "Address,", (val), "Number,", int, "Total,", len(addresses), "Time,", endTime-startTime
if g.ok:
address_gps.append(g.latlng)
print g.latlng
else:
address_gps.append(0)
print("Error")
#save every 100 iterations
if int%100==0:
# save as csv
df1 = pd.DataFrame({'Address GPS': address_gps})
df1.to_csv('C:/Users/username/Downloads/AllCustomerAddressAsGPS.csv')
# save as csv
df1 = pd.DataFrame({'Address GPS': address_gps})
df1.to_csv('C:/Users/username/Downloads/AllCustomerAddressAsGPS.csv')
One way to increase the speed of this is to maintain the requests session with Google, rather than creating a new session with every request. This is suggested in the geocoder documentation.
Your modified code will then be:
import requests
#Google Api Key
api_key = 'my_api_key'
#create empty array
address_gps = []
#google api address
url = 'https://maps.googleapis.com/maps/api/geocode/json'
#For each address return its geocoded latlng coordinates
with requests.Session() as session:
for int, val in enumerate(addresses):
''' Direct way to make call without geocoder
params = {'sensor': 'false', 'address': address, 'key': api_key}
r = requests.get(url, params=params)
results = r.json()['results']
location = results[0]['geometry']['location']
print location['lat'], location['lng']
num_address = num_address+1;
'''
endTime = time.time()
g = geocoder.google(val, key=api_key, exactly_one=True, session=session)
print "Address,", (val), "Number,", int, "Total,", len(addresses), "Time,", endTime-startTime
if g.ok:
address_gps.append(g.latlng)
print g.latlng
else:
address_gps.append(0)
print("Error")
#save every 100 iterations
if int%100==0:
# save as csv
df1 = pd.DataFrame({'Address GPS': address_gps})
df1.to_csv('C:/Users/username/Downloads/AllCustomerAddressAsGPS.csv')
# save as csv
df1 = pd.DataFrame({'Address GPS': address_gps})
df1.to_csv('C:/Users/username/Downloads/AllCustomerAddressAsGPS.csv')
With this code
import pandas as pd
import requests
link = "http://sp.kaola.com/api/category/goods?pageNo=1&pageSize=20&search=%7B%0A%20%20%22sortType%22%20%3A%20%7B%0A%20%20%20%20%22type%22%20%3A%200%0A%20%20%7D%2C%0A%20%20%22isNavigation%22%20%3A%20%220%22%2C%0A%20%20%22filterTypeList%22%20%3A%20%5B%0A%20%20%20%20%7B%0A%20%20%20%20%20%20%22id%22%20%3A%20%5B%0A%20%20%20%20%20%20%20%204055%0A%20%20%20%20%20%20%5D%2C%0A%20%20%20%20%20%20%22type%22%20%3A%201%2C%0A%20%20%20%20%20%20%22category%22%20%3A%20%7B%0A%20%20%20%20%20%20%20%20%22parentCategoryId%22%20%3A%200%2C%0A%20%20%20%20%20%20%20%20%22categoryId%22%20%3A%204055%0A%20%20%20%20%20%20%7D%0A%20%20%20%20%7D%0A%20%20%5D%2C%0A%20%20%22source%22%20%3A%201%2C%0A%20%20%22noStoreCount%22%20%3A%200%2C%0A%20%20%22isActivity%22%20%3A%200%2C%0A%20%20%22storeCount%22%20%3A%2060%0A%7D"
df = reqeusts.get(link).json()
print df
I can get the response for the URL I'm querying.
But how can I get the data from when the url's GET arg becomes pageNo = 3, 4 and so on?
I want get all of the responses from all the pages in one request. If this possible ?
In each page I can get 20 responses. How can I get all of them ?
update:
i use this method to clearn the json:
from pandas.io.json import json_normalize
df1 = df['body']
df_final = json_normalize(df1['result'],'goodsList')
HOW CAN I get all the response into only one dataframe?
Getting all the responses on one page doesn't seem possible. This is something you cannot control, and only the person who made the website can control.
But, what you can do is loop through the pages of the search result and add them together. I notice you have a hasMore variable which tells if there are more search results. This gives something like this:
import requests
link = "http://sp.kaola.com/api/category/goods?pageSize=20&search=%7B%0A%20%20%22sortType%22%20%3A%20%7B%0A%20%20%20%20%22type%22%20%3A%200%0A%20%20%7D%2C%0A%20%20%22isNavigation%22%20%3A%20%220%22%2C%0A%20%20%22filterTypeList%22%20%3A%20%5B%0A%20%20%20%20%7B%0A%20%20%20%20%20%20%22id%22%20%3A%20%5B%0A%20%20%20%20%20%20%20%204055%0A%20%20%20%20%20%20%5D%2C%0A%20%20%20%20%20%20%22type%22%20%3A%201%2C%0A%20%20%20%20%20%20%22category%22%20%3A%20%7B%0A%20%20%20%20%20%20%20%20%22parentCategoryId%22%20%3A%200%2C%0A%20%20%20%20%20%20%20%20%22categoryId%22%20%3A%204055%0A%20%20%20%20%20%20%7D%0A%20%20%20%20%7D%0A%20%20%5D%2C%0A%20%20%22source%22%20%3A%201%2C%0A%20%20%22noStoreCount%22%20%3A%200%2C%0A%20%20%22isActivity%22%20%3A%200%2C%0A%20%20%22storeCount%22%20%3A%2060%0A%7D"
max_pages = 100
data = {}
for page_no in range(max_pages):
try:
req = reqeusts.get(link + "&pageNo=" + str(page_no))
except reqeusts.ConnectionError:
break # Stop loop if the url was not found.
df = req.json()
if df["body"]["result"]["hasMore"] == 0:
break # Page says it has no more results
# Here, add whatever data you want to save from df to data