Python client to a paginated HTTP API - python

I'm using requests to query a paginated API.
This answer already seems pretty un-pythonic to me but my case is even worse because the API doesn't have a last page parameter, it only has a last URL parameter, so I have to do
url = 'http://my_url'
result = request(url)
data = result['datum']
while not result['last_url'] == url:
url = result['next_page']
result = request(url)
data += result['datum']
This doesn't seem pythonic, is there a more elegant way?

I would write a generator function which yields pages:
def pages(url):
page = {'last_url': None,
'next_page': url}
while page['last_url'] != url:
url = page['next_page']
page = request(url)
yield page
Then you can use it as so:
data = ''.join([page['datum'] for page in pages(url)])

Related

API request search issue - how do I search through all pages

I am trying to search through multiple pages through an API GET request but I am unsure how to search passed the first page of results.
The first GET request will download the first page of results but to get to the next page you need to GET the new URL which is listed at the bottom of the first page.
def getaccount_data():
view_code = input("Enter Account: ")
global token
header = {"X-Fid-Identity": token}
url = BASE_URL + '/data/v1/entity/account'
accountdata = requests.get(url, headers=header, verify = False)
newaccountdata = accountdata.json()
for data in newaccountdata ['results']:
if (data['fields']['view_code']) == view_code:
print("Account is set up")
else:
url =(newaccountdata['moreUri'])
print(url)
getaccount_data()
Is there anyway to search all the pages by updating the url to get to the next page?

scrapy.Request doesn't enter the download middleware, it returns Request instead of response

I'm using scrapy.Spider to scrape, and I want to use request inside my callback function which is in start_requests, but that request didn't work, it should return a response but it only returns Request.
I followed the debug breakpoint and found that in class Request(object_ref), the request only finished the initialization but it didn't go into request = next(slot.start_requests) as expected, to start requesting, thus only returning Request.
Here is my code in brief:
class ProjSpider(scrapy.Spider):
name = 'Proj'
allowed_domains = ['mashable.com']
def start_requests(self):
# pages
pages = 10
for i in range(1, pages):
url = "https://mashable.com/channeldatafeed/Tech/new/page/"+str(i)
yield scrapy.Request(url, callback=self.parse_mashable)
Request works fine yet
and following is:
def parse_mashable(self, response):
item = Item()
json2parse = response.text
json_response = json.loads(json2parse)
d = json_response['dataFeed'] # a list containing dicts, in which there is url for detailed article
for data in d:
item_url = data['url'] # the url for detailed article
item_response = self.get_response_mashable(item_url)
# here I want to parse the item_response to get detail
item['content'] = item_response.xpath("//body").get
yield item
def get_response_mashable(self,url):
response = scrapy.Request(url)
# using self.parser. I've also defined my own parser and yield an item
# but the problem is it never got to callback
return response # tried yield also but failed
this is where Request doesn't work. The url is in the allowed_domains, and it's not a duplicate url. I'm guessing it's because of scrapy's asynchronous mechanism of Request, but how could it affect the request in self.parse_mashable, by then the Request in start_requests is already finished.
I managed to do the second request in python Requests-html, but still I couldn't figure out why.
So could anyone help pointing where I'm doing wrong? Thx in advance!
Scrapy doesn't really expect you to do this the way you're trying to, so it doesn't have a simple way to do it.
What you should be doing instead is passing the data you've scraped from the original page to the new callback using the request's meta dict.
For details, check Passing additional data to callback functions.

Python post request for USPTO site scraping

I’m trying to scrape data from http://portal.uspto.gov/EmployeeSearch/ web site.
I open the site in browser, click on the Search button inside the Search by Organisation part of the site and look for the request being sent to server.
When I post the same request using python requests library in my program, I don’t get the result page which I am expecting but I get the same Search page, with no employee data on it.
I’ve tried all variants, nothing seems to work.
My question is, what URL should I use in my request, do I need to specify headers (tried also, copied headers viewed in Firefox developer tools upon request) or something else?
Below is the code that sends the request:
import requests
from bs4 import BeautifulSoup
def scrape_employees():
URL = 'http://portal.uspto.gov/EmployeeSearch/searchEm.do;jsessionid=98BC24BA630AA0AEB87F8109E2F95638.prod_portaljboss4_jvm1?action=displayResultPageByOrgShortNm&currentPage=1'
response = requests.post(URL)
site_data = response.content
soup = BeautifulSoup(site_data, "html.parser")
print(soup.prettify())
if __name__ == '__main__':
scrape_employees()
All the data you need is in a form tag:
action is the url when you make a post to server.
input is the data you need post to server. {name:value}
import requests, bs4, urllib.parse,re
def make_soup(url):
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')
return soup
def get_form(soup):
form = soup.find(name='form', action=re.compile(r'OrgShortNm'))
return form
def get_action(form, base_url):
action = form['action']
# action is reletive url, convert it to absolute url
abs_action = urllib.parse.urljoin(base_url, action)
return abs_action
def get_form_data(form, org_code):
data = {}
for inp in form('input'):
# if the value is None, we put the org_code to this field
data[inp['name']] = inp['value'] or org_code
return data
if __name__ == '__main__':
url = 'http://portal.uspto.gov/EmployeeSearch/'
soup = make_soup(url)
form = get_form(soup)
action = get_action(form, url)
data = get_form_data(form, '1634')
# make request to the action using data
r = requests.post(action, data=data)

Fetch all pages using python request

I am using an api to fetch orders from a website. The problem is at one time it only fetch only 20 orders. I figured out i need to use a pagination iterator but dont know to use it. How to fetch all the orders all at once.
My code:
def search_orders(self):
headers = {'Authorization':'Bearer %s' % self.token,'Content-Type':'application/json',}
url = "https://api.flipkart.net/sellers/orders/search"
filter = {"filter": {"states": ["APPROVED","PACKED"],},}
return requests.post(url, data=json.dumps(filter), headers=headers)
Here is a link to documentation.
Documentation
You need to do what the documentation suggests -
The first call to the Search API returns a finite number of results based on the pageSize value. Calling the URL returned in the nextPageURL field of the response gets the subsequent pages of the search result.
nextPageUrl - String - A GET call on this URL fetches the next page results. Not present for the last page
(Emphasis mine)
You can use response.json() to get the json of the response. Then you can check the flag - hasMore - to see if there are more if so, use requests.get() to get the response for next page, and keep doing this till hasMore is false. Example -
def search_orders(self):
headers = {'Authorization':'Bearer %s' % self.token,'Content-Type':'application/json',}
url = "https://api.flipkart.net/sellers/orders/search"
filter = {"filter": {"states": ["APPROVED","PACKED"],},}
s = requests.Session()
response = s.post(url, data=json.dumps(filter), headers=headers)
orderList = []
resp_json = response.json()
orderList.append(resp_json["orderItems"])
while resp_json.get('hasMore') == True:
response = s.get('"https://api.flipkart.net/sellers{0}'.format(resp_json['nextPageUrl']))
resp_json = response.json()
orderList.append(resp_json["orderItems"])
return orderList
The above code should return the complete list of orders.

Checking a url for a 404 error scrapy

I'm going through a set of pages and I'm not certain how many there are, but the current page is represented by a simple number present in the url (e.g. "http://www.website.com/page/1")
I would like to use a for loop in scrapy to increment the current guess at the page and stop when it reaches a 404. I know the response that is returned from the request contains this information, but I'm not sure how to automatically get a response from a request.
Any ideas on how to do this?
Currently my code is something along the lines of :
def start_requests(self):
baseUrl = "http://website.com/page/"
currentPage = 0
stillExists = True
while(stillExists):
currentUrl = baseUrl + str(currentPage)
test = Request(currentUrl)
if test.response.status != 404: #This is what I'm not sure of
yield test
currentPage += 1
else:
stillExists = False
You can do something like this:
from __future__ import print_function
import urllib2
baseURL = "http://www.website.com/page/"
for n in xrange(100):
fullURL = baseURL + str(n)
#print fullURL
try:
req = urllib2.Request(fullURL)
resp = urllib2.urlopen(req)
if resp.getcode() == 404:
#Do whatever you want if 404 is found
print ("404 Found!")
else:
#Do your normal stuff here if page is found.
print ("URL: {0} Response: {1}".format(fullURL, resp.getcode()))
except:
print ("Could not connect to URL: {0} ".format(fullURL))
This iterates through the range and attempts to connect to each URL via urllib2. I don't know scapy or how your example function opens the URL but this is an example with how to do it via urllib2.
Note that most sites that utilize this type of URL format are normally running a CMS that can automatically redirect non-existent pages to a custom 404 - Not Found page which will still show up as a HTTP status code of 200. In this case, the best way to look for a page that may show up but is actually just the custom 404 page, you should do some screen scraping and look for anything that may not appear during a "normal" page return such as text that says "Page not found" or something similar and unique to the custom 404 page.
You need to yield/return the request in order to check the status, creating a Request object does not actually send it.
class MySpider(BaseSpider):
name = 'website.com'
baseUrl = "http://website.com/page/"
def start_requests(self):
yield Request(self.baseUrl + '0')
def parse(self, response):
if response.status != 404:
page = response.meta.get('page', 0) + 1
return Request('%s%s' % (self.baseUrl, page), meta=dict(page=page))

Categories