I'm using Python to do some data cleaning/task automation, but am having a hard time reading in data through an API with multiple conditions. My data is as follows:
url = "https://data.cityofnewyork.us/resource/erm2-nwe9.json?descriptor='Social Distancing' or descriptor='Face Covering Violation' or descriptor='Business not in compliance'"
r = requests.get(url)
x = r.json()
df = pd.DataFrame(x)
When I pull it, it only provides me with data where the descriptor is 'Social Distancing'. Any tips on how to change this so that it filters for all of the needed data?
Make three requests and merge their responses:
def get_data(filter):
url = "https://data.cityofnewyork.us/resource/erm2-nwe9.json?descriptor='{0}'".format(filter)
r = requests.get(url)
return pd.DataFrame(r.json())
df = pd.concat([
get_data('Social Distancing'),
get_data('Face Covering Violation'),
get_data('Business not in compliance')
])
My script is looping through a get request and concatenating them into a pandas data frame to export to excel. Everything works good until the loop goes through 5 times, and then the site gives a 403 error. Somehow the site know once i have made requests for 50k rows and gives the 403 error. Is there a way around this that anyone can share with me please.step is a variable at the end of URL string that tells how many rows to bring back. I can only do 10k at a time or the it lags so much it wont work.SKIP is a another variable in the URL string that skips forward a set amount of rows. The script is super slow too if anyone can give any hints on how to make it faster too it would be much appreciated. Thanks.
from selenium import webdriver
import time
import json
import pandas as pd
import requests
driver = webdriver.Chrome()
executor_url = driver.command_executor._url
session_id = driver.session_id
#put the url/website you are trying to scrape from here > this should be the url you go to when you login
driver.get(r"http://10.131.178.162:9090/xGLinear/login.html")
#waits 60 secs to give you time to login manually
time.sleep(60)
#this will copy all the cookies and login info you need from chrome and now you can start using requests
cookies = driver.get_cookies()
s = requests.Session()
for cookie in cookies:
s.cookies.set(cookie['name'], cookie['value'])
res = s.get(r"http://10.131.178.162:9090/orders/OrderStatus?$dataAccess=ALL&$&$dateRange=ordered&$endDate=06%2F28%2F2020&$filter=%7B%22operator%22:%22AND%22,%22criteria%22:%5B%7B%22operator%22:%22EQUALS%22,%22fieldName%22:%22lineMode%22,%22value%22:%22R%22%7D,%7B%22operator%22:%22EQUALS%22,%22fieldName%22:%22creditHold%22,%22value%22:%22N%22%7D,%7B%22operator%22:%22OR%22,%22criteria%22:%5B%7B%22fieldName%22:%22status%22,%22operator%22:%22EQUALS%22,%22value%22:%22AP%22%7D,%7B%22fieldName%22:%22status%22,%22operator%22:%22EQUALS%22,%22value%22:%22SC%22%7D%5D%7D%5D%7D&$skip=0&$sortBy=%5B%22-key.orderlineId%22%5D&$startDate=06%2F22%2F2020&$top=10000")
data = json.loads(res.text)
TotalR=data['totalRows']
SKIP=10000
skip1=10000
total_count= int(TotalR/skip1)
step=10000
Count=0
df = pd.DataFrame()
try:
while Count < total_count :
res1= s.get(f"http://10.131.178.162:9090/orders/OrderStatus?$dataAccess=ALL&$skip={SKIP}&$dateRange=ordered&$endDate=06%2F28%2F2020&$filter=%7B%22operator%22:%22AND%22,%22criteria%22:%5B%7B%22operator%22:%22EQUALS%22,%22fieldName%22:%22lineMode%22,%22value%22:%22R%22%7D,%7B%22operator%22:%22EQUALS%22,%22fieldName%22:%22creditHold%22,%22value%22:%22N%22%7D,%7B%22operator%22:%22OR%22,%22criteria%22:%5B%7B%22fieldName%22:%22status%22,%22operator%22:%22EQUALS%22,%22value%22:%22AP%22%7D,%7B%22fieldName%22:%22status%22,%22operator%22:%22EQUALS%22,%22value%22:%22SC%22%7D%5D%7D%5D%7D&$skip=0&$sortBy=%5B%22-key.orderlineId%22%5D&$startDate=06%2F22%2F2020&$top={step}")
data1 = json.loads(res1.text)
for d in data1['data']:
dict_new = pd.DataFrame(d)
df = pd.concat([df,dict_new])
SKIP+=10000
Count+=1
except:
print(res1.status_code)
final=pd.DataFrame(data['data'])
final1=pd.DataFrame(final)
final2= pd.concat([df,final1])
final2.to_excel(r'C:\Users\c\Desktop\xg.xlsx',index= False)
There is no way to work around this, you just reached your limit.
A solution would be to look at the documentation, and know how often this count resets.
Then, you'll be able to add a wait, in order to keep the good rythm and get rid of 403 error code.
import time
try:
cpt = 0
while Count < total_count :
res1= s.get(f"http://10.131.178.162:9090/orders/OrderStatus?$dataAccess=ALL&$skip={SKIP}&$dateRange=ordered&$endDate=06%2F28%2F2020&$filter=%7B%22operator%22:%22AND%22,%22criteria%22:%5B%7B%22operator%22:%22EQUALS%22,%22fieldName%22:%22lineMode%22,%22value%22:%22R%22%7D,%7B%22operator%22:%22EQUALS%22,%22fieldName%22:%22creditHold%22,%22value%22:%22N%22%7D,%7B%22operator%22:%22OR%22,%22criteria%22:%5B%7B%22fieldName%22:%22status%22,%22operator%22:%22EQUALS%22,%22value%22:%22AP%22%7D,%7B%22fieldName%22:%22status%22,%22operator%22:%22EQUALS%22,%22value%22:%22SC%22%7D%5D%7D%5D%7D&$skip=0&$sortBy=%5B%22-key.orderlineId%22%5D&$startDate=06%2F22%2F2020&$top={step}")
data1 = json.loads(res1.text)
for d in data1['data']:
dict_new = pd.DataFrame(d)
df = pd.concat([df,dict_new])
SKIP+=10000
Count+=1
cpt += 1
if cpt == 5:
cpt = 0
time.wait(x) // X is how many seconds you'll need to wait
except:
print(res1.status_code)
I have a code column which I would like to pass to a web service and update two fields in the dataframe (dfMRD1['Cache_Ticker']and dfMRD1['Cache_Product'] with two values from the returned JSON (RbcSecurityDescription and RbcSecurityType1). I have achieved this by iteration but I'd like to know if there is a more efficient way to do it?
# http://postgre01:5002/bond/912828XU9
import requests
url = 'http://postgre01:5002/bond/'
def fastquery(code):
response = requests.get(url + code)
return response.json()
Here is the sample return call:
Here is the update of dfMRD1['Cache_Ticker']anddfMRD1['Cache_Product']
dfMRD1 = df[['code']].drop_duplicates()
dfMRD1['Cache_Ticker'] = ""
dfMRD1['Cache_Product'] = ""
for index, row in dfMRD1.iterrows():
result = fastquery(row['code'])
row['Cache_Ticker'] = result['RbcSecurityDescription']
row['Cache_Product'] = result['RbcSecurityType1']
display(dfMRD1.head(5))
Would it be best to just return the json array, unest it and dump all fields in its contents to another df which I can be join with dfMRD1? Best way to achieve this?
The most time-consuming part of your code is likely to be in making synchronous requests. Instead, you could leverage requests-futures to make asynchronous requests, construct the columns as lists of results and assign back to the DF. We have nothing to test with but the approach would look like this:
from requests_futures.sessions import FuturesSession
session = FuturesSession(max_workers = 10)
codes = df[['code']].drop_duplicates().values.tolist() # Take out of DF
url = 'http://postgre01:5002/bond/'
fire_requests = [session.get(url + code) for code in codes] # Async requests
responses = [item.result() for item in fire_requests] # Grab the results
dfMRD1['Cache_Ticker'] = [result['RbcSecurityDescription']
for result in responses]
dfMRD1['Cache_Product'] = [result['RbcSecurityType1']
for result in responses]
Depending on the size of the DF, you may get a lot of data in memory. If that becomes an issue, you'll want a background callback trimming your JSON responses as they come back.
I am using a FCC api to convert lat/long coordinates into block group codes:
import pandas as pd
import numpy as np
import urllib
import time
import json
# getup, getup1, and getup2 make up the url to the api
getup = 'http://data.fcc.gov/api/block/find?format=json&latitude='
getup1 = '&longitude='
getup2 = '&showall=false'
lat = ['40.7127837','34.0522342','41.8781136','29.7604267','39.9525839',
'33.4483771','29.4241219','32.715738','32.7766642','37.3382082','30.267153',
'39.768403','30.3321838','37.7749295','39.9611755','35.2270869',
'32.7554883','42.331427','31.7775757','35.1495343']
long = ['-74.0059413','-118.2436849','-87.6297982','-95.3698028','-75.1652215',
'-112.0740373','-98.4936282','-117.1610838','-96.7969879','-121.8863286',
'-97.7430608','-86.158068','-81.655651','-122.4194155','-82.9987942',
'-80.8431267','-97.3307658','-83.0457538','-106.4424559','-90.0489801']
#make lat and long in to a Pandas DataFrame
latlong = pd.DataFrame([lat,long]).transpose()
latlong.columns = ['lat','long']
new_list = []
def block(x):
for index,row in x.iterrows():
#request url and read the output
a = urllib.request.urlopen(getup + row['lat'] + getup1 + row['long'] + getup2).read()
#load json output in to a form python can understand
a1 = json.loads(a)
#append output to an empty list.
new_list.append(a1['Block']['FIPS'])
#call the function with latlong as the argument.
block(latlong)
#print the list, note: it is important that function appends to the list
print(new_list)
gives this output:
['360610031001021', '060372074001033', '170318391001104', '482011000003087',
'421010005001010', '040131141001032', '480291101002041', '060730053003011',
'481130204003064', '060855010004004', '484530011001092', '180973910003057',
'120310010001023', '060750201001001', '390490040001005', '371190001005000',
'484391233002071', '261635172001069', '481410029001001', '471570042001018']
The problem with this script is that I can only call the api once per row. It takes about 5 minutes per thousand for the script to run, which is not acceptable with 1,000,000+ entries I am planning on using this script with.
I want to use multiprocessing to parallel this function to decrease the time to run the function. I have tried to look in to the multiprocessing handbook, but have not been able to figure out how to run the function and append the output in to an empty list in parallel.
Just for reference: I am using python 3.6
Any guidance would be great!
You do not have to implement the parallelism yourself, there are libraries better than urllib, e.g. requests [0] and some spin-offs [1] which use either threads or futures. I guess you need to check yourself which one is the fastest.
Because of the small amount of dependencies I like the requests-futures best, here my implementation of your code using ten threads. The library would even support processes if you believe or figure out that it is somehow better in your case:
import pandas as pd
import numpy as np
import urllib
import time
import json
from concurrent.futures import ThreadPoolExecutor
from requests_futures.sessions import FuturesSession
#getup, getup1, and getup2 make up the url to the api
getup = 'http://data.fcc.gov/api/block/find?format=json&latitude='
getup1 = '&longitude='
getup2 = '&showall=false'
lat = ['40.7127837','34.0522342','41.8781136','29.7604267','39.9525839',
'33.4483771','29.4241219','32.715738','32.7766642','37.3382082','30.267153',
'39.768403','30.3321838','37.7749295','39.9611755','35.2270869',
'32.7554883','42.331427','31.7775757','35.1495343']
long = ['-74.0059413','-118.2436849','-87.6297982','-95.3698028','-75.1652215',
'-112.0740373','-98.4936282','-117.1610838','-96.7969879','-121.8863286',
'-97.7430608','-86.158068','-81.655651','-122.4194155','-82.9987942',
'-80.8431267','-97.3307658','-83.0457538','-106.4424559','-90.0489801']
#make lat and long in to a Pandas DataFrame
latlong = pd.DataFrame([lat,long]).transpose()
latlong.columns = ['lat','long']
def block(x):
requests = []
session = FuturesSession(executor=ThreadPoolExecutor(max_workers=10))
for index, row in x.iterrows():
#request url and read the output
url = getup+row['lat']+getup1+row['long']+getup2
requests.append(session.get(url))
new_list = []
for request in requests:
#load json output in to a form python can understand
a1 = json.loads(request.result().content)
#append output to an empty list.
new_list.append(a1['Block']['FIPS'])
return new_list
#call the function with latlong as the argument.
new_list = block(latlong)
#print the list, note: it is important that function appends to the list
print(new_list)
[0] http://docs.python-requests.org/en/master/
[1] https://github.com/kennethreitz/grequests
With this code
import pandas as pd
import requests
link = "http://sp.kaola.com/api/category/goods?pageNo=1&pageSize=20&search=%7B%0A%20%20%22sortType%22%20%3A%20%7B%0A%20%20%20%20%22type%22%20%3A%200%0A%20%20%7D%2C%0A%20%20%22isNavigation%22%20%3A%20%220%22%2C%0A%20%20%22filterTypeList%22%20%3A%20%5B%0A%20%20%20%20%7B%0A%20%20%20%20%20%20%22id%22%20%3A%20%5B%0A%20%20%20%20%20%20%20%204055%0A%20%20%20%20%20%20%5D%2C%0A%20%20%20%20%20%20%22type%22%20%3A%201%2C%0A%20%20%20%20%20%20%22category%22%20%3A%20%7B%0A%20%20%20%20%20%20%20%20%22parentCategoryId%22%20%3A%200%2C%0A%20%20%20%20%20%20%20%20%22categoryId%22%20%3A%204055%0A%20%20%20%20%20%20%7D%0A%20%20%20%20%7D%0A%20%20%5D%2C%0A%20%20%22source%22%20%3A%201%2C%0A%20%20%22noStoreCount%22%20%3A%200%2C%0A%20%20%22isActivity%22%20%3A%200%2C%0A%20%20%22storeCount%22%20%3A%2060%0A%7D"
df = reqeusts.get(link).json()
print df
I can get the response for the URL I'm querying.
But how can I get the data from when the url's GET arg becomes pageNo = 3, 4 and so on?
I want get all of the responses from all the pages in one request. If this possible ?
In each page I can get 20 responses. How can I get all of them ?
update:
i use this method to clearn the json:
from pandas.io.json import json_normalize
df1 = df['body']
df_final = json_normalize(df1['result'],'goodsList')
HOW CAN I get all the response into only one dataframe?
Getting all the responses on one page doesn't seem possible. This is something you cannot control, and only the person who made the website can control.
But, what you can do is loop through the pages of the search result and add them together. I notice you have a hasMore variable which tells if there are more search results. This gives something like this:
import requests
link = "http://sp.kaola.com/api/category/goods?pageSize=20&search=%7B%0A%20%20%22sortType%22%20%3A%20%7B%0A%20%20%20%20%22type%22%20%3A%200%0A%20%20%7D%2C%0A%20%20%22isNavigation%22%20%3A%20%220%22%2C%0A%20%20%22filterTypeList%22%20%3A%20%5B%0A%20%20%20%20%7B%0A%20%20%20%20%20%20%22id%22%20%3A%20%5B%0A%20%20%20%20%20%20%20%204055%0A%20%20%20%20%20%20%5D%2C%0A%20%20%20%20%20%20%22type%22%20%3A%201%2C%0A%20%20%20%20%20%20%22category%22%20%3A%20%7B%0A%20%20%20%20%20%20%20%20%22parentCategoryId%22%20%3A%200%2C%0A%20%20%20%20%20%20%20%20%22categoryId%22%20%3A%204055%0A%20%20%20%20%20%20%7D%0A%20%20%20%20%7D%0A%20%20%5D%2C%0A%20%20%22source%22%20%3A%201%2C%0A%20%20%22noStoreCount%22%20%3A%200%2C%0A%20%20%22isActivity%22%20%3A%200%2C%0A%20%20%22storeCount%22%20%3A%2060%0A%7D"
max_pages = 100
data = {}
for page_no in range(max_pages):
try:
req = reqeusts.get(link + "&pageNo=" + str(page_no))
except reqeusts.ConnectionError:
break # Stop loop if the url was not found.
df = req.json()
if df["body"]["result"]["hasMore"] == 0:
break # Page says it has no more results
# Here, add whatever data you want to save from df to data