I am using Python to extract data from a Solr API, like so:
import requests
user = 'my_username'
password= 'my password'
url = 'my_url'
print ("Accessing API..")
req = requests.get(url = url, auth=(user, password))
print ("Accessed!")
out = req.json()
#print(out)
However, it looks like in some of the API URLs: the output is fairly "large" (many of the columns are lists of dictionaries), and so it doesn't return all rows, which are necessary.
From looking around, it looks like I should use pagination to bring in results in specified increments. Something like this:
url = 'url?start=0&rows=1000'
Then,
url = 'url?start=1000&rows=1000'
and so on, until there is no result is returned.
The way I am thinking about it is write a loop, and append result to output with every loop. However, I am not sure how to do that.
Would someone be able to help please?
Thank you in advance!
Did you look at the output? In my experience, solr response usually includes a 'numFound' in it's result. On a (old) solr I have locally, doing a random query. I get this result.
{
"responseHeader": {
"status": 0,
"QTime": 1,
"params": {
"q": "*:*",
"indent": "true",
"start": "0",
"rows": "10",
"wt": "json",
"_": "1509460751164"
}
},
"response": {
"numFound": 7023,
"start": 0,
"docs": [.. 10 docs]
}
}
While working out this code example, I realized you don't need the numFound really. Solr will just return any empty list for docs if there are no further results. Making it easier to make the loop.
import requests
user = 'my_username'
password = 'my password'
# Starting values
start = 0
rows = 1000 # static, but easier to manipulate if it's a variable
base_url = 'my_url?rows={0}?start={1}'
url = base_url.format(rows, start)
req = requests.get(url=url, auth=(user, password))
out = req.json()
total_found = out.get('response', {}).get('numFound', 0)
# Up the start with 1000, so we fetch the next 1000
start += rows
results = out.get('response', {}).get('docs', [])
all_results = results
# Results will be an empty list if no more results are found
while results:
# Rebuild url base on current start.
url = base_url.format(rows, start)
req = requests.get(url=url, auth=(user, password))
out = req.json()
results = out.get('response', {}).get('docs', [])
all_results += results
start += rows
# All results will now contains all the 'docs' of each request.
print(all_results)
Mind you.. those docs will be dict like, so more parsing will be needed.
Related
I'm using the api to get all the data So, I need to paginate so I need to loop through the pages to get all the data I want. Is there any way to extract this automatically?
Alternatively, should I somehow use a while loop to get all this data? What is the best way? Any thoughts?
import json
import requests
from http import HTTPStatus
client_id = ""
client_secret = ""
os.environ["DX_GATEWAY"] = "http://api.com"
os.environ["DX_CLIENT_ID"] = client_id
os.environ["DX_CLIENT_SECRET"] = client_secret
dx_request = requests.Request()
path = "/path/to/api"
params = {
"Type": "abc",
"Id": "def",
"limit": 999,
"Category": "abc"
}
params_str = "&".join([f"{k}={v}" for k, v in params.items()])
url = "?".join([path, params_str])
vulns = dx_request.get( ##also tried dx_request.args.get(
url,
version=1,
)
if vulns.status_code != int(HTTPStatus.OK):
raise RuntimeError("API call did not return expected response: " + str(vulns))
response_data = vulns.json()
print(json.dumps(response_data))
Why not use the the automatic pagination module included in the Requests Library that you already have loaded in the code?
e.g
import requests
url = 'https://api.example.com/items'
params = {'limit': 100, 'offset': 0} # set initial parameters for first page of results
while True:
response = requests.get(url, params=params)
data = response.json()
items = data['items']
# do something with items here...
if len(items) < 100:
break # if fewer than 100 items were returned, we've reached the last page
params['offset'] += 100 # increment offset to retrieve the next page of results
Having some trouble with this. I know that I have to use the offset value to pull more than 100 records, here is what I currently have:
AIRTABLE_BASE_ID = 'airtablebaseid'
AIRTABLE_API_KEY = 'airtableapikey'
AIRTABLE_TABLE_NAME = 'airtablename'
endpoint = f'https://api.airtable.com/v0/{AIRTABLE_BASE_ID}/{AIRTABLE_TABLE_NAME}?filterByFormula=AND(NOT(%7BSent+to+Payroll%7D+%3D+%22Sent%22)%2CNOT(%7BSent+to+Payroll%7D+%3D+%22Skipped%22))'
def get_airtable():
headers = {
"Authorization": f"Bearer {AIRTABLE_API_KEY}"
}
response = requests.get(endpoint, headers=headers)
return response
recordList = []
recordIDs = []
recordTimeStamp = []
response = get_airtable()
data = response.json()
for record in data['records']:
recordList.append(record['fields'])
recordIDs.append(record['id'])
recordTimeStamp.append(record['createdTime'])
print(record)
Airtable can't give you more than 100 records per request, it works, as you said it, with pagination and offset.
The first request you made without offset parameter will return you a payload with a offset field set to n, you have to pass that offset to your next request to get the n next records, and so on...
response = get_airtable()
data = response.json()
OFFSET = data['offset'] #not sur if "offset" is at root of the response
endpoint = f'https://api.airtable.com/v0/{AIRTABLE_BASE_ID}/{AIRTABLE_TABLE_NAME}?offset={OFFSET}&filterByFormula=AND(NOT(%7BSent+to+Payroll%7D+%3D+%22Sent%22)%2CNOT(%7BSent+to+Payroll%7D+%3D+%22Skipped%22))'
I'm brand new to python and api as well.
I'm trying to use a endpoint we have at work.
We have an API we are using a lot, we also have an UI. But using the UI we can only extract 10.000 records at the time.
There is no limit on the api.
I have found a small piece of code - but i need to add a nextpagetoken.
My code looks like this:
login_url = 'https://api.ubsend.io/v1/auth/login'
username = 'xxxxx'
password = 'xxxxx'
omitClaims = "true"
session = requests.Session()
session.headers['Accept'] = "application/json; charset=UTF-8"
response = session.post(
login_url,
json={'username': username, 'password': password},
headers={'VERSION': '3'},
)
response.raise_for_status()
response_data = response.json()
print(response_data)
This gives me the AccessToken.
Then I call:
getevents = 'https://api.ubsend.io/v1/reporting/shipments?'
data ={'client_id': 13490, 'created_after': '2020-05-01T00:00', 'created_before': '2021-05-02T00:00'} req.prepare_url(getevents, data)
events = requests.get(req.url, headers={'Authorization' : 'Bearer ' + response_data['accessToken'], Content-Type': 'application/json'})
events.json()
Which returns:
'nextPageToken': 'NjA4ZDc3YzNkMjBjODgyYjBhMWVkMTVkLDE2MTk4ODM5NzA3MDE='}
So I want to loop my script - until nextPageToken is blank ....
Any thoughts?
Edit thanks for the update. I think this might be the solution we're looking for. You might have to do some poking around to figure out exactly what the name of the page_token URL parameter should be.
has_next = True
getevents = 'https://api.ubsend.io/v1/reporting/shipments?'
token = None
while has_next:
data ={'client_id': 13490, 'created_after': '2020-05-01T00:00', 'created_before': '2021-05-02T00:00'}
if token:
# I don't know the proper name for this URL parameter.
data['page_token'] = token
req.prepare_url(getevents, data)
events = requests.get(req.url, headers={'Authorization' : 'Bearer ' + response_data['accessToken'], Content-Type: 'application/json'})
token = events.json().get('nextPageToken')
if not token:
has_next = False
I made a slight typo. It should be events.json().get('nextPageToken') I believe.
Let me know if this works.
With help on Stackoverflow, I was able to come up with the scraper. The code returns a list of part numbers and its corresponding prices.
part1 price1
part2 price2
...
...
partn pricen
However the website seems to only allow 200 requests - when i raise the limit to 200+ i would get the error: "raise JSONDecodeError("Expecting value", s, err.value) from None JSONDecodeError: Expecting value".
I just want to know if there's a way to avoid this error? If not I can raise start:0 by 200 each time, but since I would have 100k+ items easily it won't be very efficient..is there a way I can loop the limit and the start function?
Please see the codes below, any help appreciated!
import requests
# import pprint # to format data on screen `pprint.pprint()
import pandas as pd
# --- fucntions ---
def get_data(query):
"""Get data from server"""
payload = {
# "facets":[{
# "name":"OEM",
# "value":"GE%20Healthcare"
# }],
"facets":[],
"facilityId": 38451,
"id_ins": "a2a3d332-73a7-4194-ad87-fe7412388916",
"limit": 200,
"query": query,
"referer": "/catalog/Service",
"start": 0,
# "urlParams":[{
# "name": "OEM",
# "value": "GE Healthcare"
# }],
"urlParams":[]
}
r = requests.post('https://prodasf-vip.partsfinder.com/Orion/CatalogService/api/v1/search', json=payload)
data = r.json()
return data
all_queries = ['GE Healthcare']
for query in all_queries:
#print('\n--- QUERY:', query, '---\n')
data = get_data(query)
Part_Num = []
Vendor_Item_Num = []
price = []
for item in data['products']:
if not item['options']:
Part_Num.append([])
Vendor_Item_Num.append([])
price.append([])
else:
all_prices = [option['price'] for option in item['options']]
all_vendor = [option['price'] for option in item['options']]
all_part_num = item['partNumber']
Part_Num.append(all_part_num)
Vendor_Item_Num.append(all_vendor)
price.append(all_prices)
list_of_dataframes = [pd.DataFrame(Part_Num),pd.DataFrame(price)]
pd.concat(list_of_dataframes, axis=1).to_csv(r'C:\Users\212677036\Documents\output7.csv')
You should always check the status_code that your request was successful. The API is giving HTTP 500 when limit is > 200. status codes. You need to study the documentation of the API. Many APIs limit requests per second and maximum request size so they can maintain a reliable service.
The json() method will fail if the HTTP request was not successful.
You can get data in batches. Sample code below I stop because I have no want to stay in the loop for 500+ iterations... You could consider using threading so it's not so sequential.
All of this is covered in SO prodasf-vip
import requests
query = 'GE Healthcare'
payload = {
"facets":[],
"facilityId": 38451,
"id_ins": "a2a3d332-73a7-4194-ad87-fe7412388916",
"limit": 200,
"query": query,
"referer": "/catalog/Service",
"start": 0,
"urlParams":[]
}
r = requests.post('https://prodasf-vip.partsfinder.com/Orion/CatalogService/api/v1/search', json=payload)
if r.status_code == 200:
js = r.json()
df = pd.json_normalize(js["products"])
while len(df) < js["totalResults"] and len(df)<2000:
payload["start"] += 200
r = requests.post('https://prodasf-vip.partsfinder.com/Orion/CatalogService/api/v1/search', json=payload)
if r.status_code == 200:
df = pd.concat([df, pd.json_normalize(r.json()["products"])])
else:
break
print(f"want: {js['totalResults']} got: {len(df)}")
df
I have the below code that I use to insert into a table
data = {"name": name, "product": product}
headers = {"Content-Type": "application/json"}
data = {
"name": data["name"],
"type": data["product"]}
response = requests.post(url=API_ENDPOINT, headers=headers, data=json.dumps(data))
resp_dump = json.dumps(response.json())
resp_load = json.loads(resp_dump)
The above code works well. I however am trying to see how can I add a idle time after every 10 inserts.
Well you can always use time
from time import sleep
i = 0
while True:
# insert in table
if i % 10 == 0:
sleep(10)
i += 1