Fetch data from API inside Scrapy - python

I am working on a project that is divided into two parts:
Retrieve a specific page
Once the ID of this page is extracted,
Send requests to an API to obtain additional information on this page
For the second point, and to follow Scrapy's asynchronous philosophy, where should such a code be placed? (I hesitate between in the spider or in a pipeline).
Do we have to use different libraries like asyncio & aiohttp to be able to achieve this goal asynchronously? (I love aiohttp so this is not a problem to use it)
Thanks you

Since you're doing this to fetch additional information about an item, I'd just yield a request from the parsing method, passing the already scraped information in the meta attribute.
You can see an example of this at https://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-callback-arguments
This can also be done in a pipeline (either using scrapy's engine API, or a different library, e.g. treq).
I do however think that doing it "the normal way" from the spider makes more sense in this instance.

I recently had the same problem (again) and found an elegant solution using Twisted decorators t.i.d.inlineCallbacks.
# -*- coding: utf-8 -*-
import scrapy
import re
from twisted.internet.defer import inlineCallbacks
from sherlock import utils, items, regex
class PagesSpider(scrapy.spiders.SitemapSpider):
name = 'pages'
allowed_domains = ['thing.com']
sitemap_follow = [r'sitemap_page']
def __init__(self, site=None, *args, **kwargs):
super(PagesSpider, self).__init__(*args, **kwargs)
#inlineCallbacks
def parse(self, response):
# things
request = scrapy.Request("https://google.com")
response = yield self.crawler.engine.download(request, self)
# Twisted execute the request and resume the generator here with the response
print(response.text)

Related

Does web scraping have patterns?

I have not done too much of web scraping in my experience. So far I am using python and using BeautifulSoup4 to scrape the hackernews page.
Was just wondering if there are patterns I should keep in mind before doing scraping. Right now the code looks very ugly and I feel like a hack.
Code:
import requests
from bs4 import BeautifulSoup
class Command(BaseCommand):
page = {}
td_count = 2
data_count = 0
def handle(self, *args, **options):
for i in range(1,4):
self.page_no = i
self.parse()
print self.page[1]
def get_result(self):
return requests.get('https://news.ycombinator.com/news?p=%s'% self.page_no)
def parse(self):
soup = BeautifulSoup(self.get_result().text, 'html.parser')
for x in soup.find_all('table')[2].find_all('tr'):
self.data_count += 1
self.page[self.data_count] = {'other_data' : None, 'url' : ''}
if self.td_count%3 == 0:
try:
subtext = x.find_all('td','subtext')[0]
self.page[self.data_count - 1]['other_data'] = subtext
except IndexError:
pass
title = x.find_all('td', 'title')
if title:
try:
self.page[self.data_count]['url'] = title[1].a
print title[1].a
except IndexError:
print 'Done page %s'%self.page_no
self.td_count +=1
Actually I behave scrappable data as part of my domain(business) data, which allows me to use Domain Driven Design to structure the problem:
Entities and Value Objects
I use entities and value objects to store the correct extracted information from data into my programming language data structures, so I can work with them in a great way.
Repository Pattern
I use repository pattern to delegate the job of gathering data to a different class. The repository class is given a site and then fetches the data and pre-builds the entities if needed.
Transformer/Presenter pattern
After fetching the data from the repository, I pass the html data to a presenter class. The presenter class has the duty of creating my business entity/value objects from the given HTML string.
Service Layer
If there is more process than those described above, I make a service class which is a wrapper around the problem, It calls the repository , gives the fetched data to the presenter the presenter builds the entities, and done, the result may be used by another service to be stored in a SQL database.
If you are familiar with PHP, I have programmed a small app in Laravel which fetches the alexa rank of a given website each 15mins and notifies the subscribers of that website by Email.
Github repository : Alexa Watcher
Folder of Repository classes
Command line application layer class which calls the service
The Service class which is also a presenter that builds needed entities.
The Service class which pushes detected changes to subscriber emails.

Scrapy Deploy Doesn't Match Debug Result

I am using Scrapy to extract some data from a site, say "myproject.com". Here is the logic:
Go to the homepage, and there are some categorylist that to be used to build the second wave of links.
For the second round of links, they are usually the first page from each category. Also, for different pages inside that category, they follow the same regular expression pattern wholesale/something/something/request or wholesale/pagenumber. And I want to follow those patterns to keep crawling and meanwhile store the raw HTML in my item object.
I tested these two steps separately by using the parse and they both worked.
First, I tried:
scrapy parse http://www.myproject.com/categorylist/cat_a --spider myproject --rules
And I can see it built the outlinks successfully. Then I tested the built outlink again.
scrapy parse http://www.myproject.com/wholesale/cat_a/request/1 --spider myproject --rules
And seems like the rule is correct and it generate a item with the HTML stored in there.
However, when I tried to link those two steps together by using the depth argument. I saw it crawled the outlinks but no items got generated.
scrapy parse http://www.myproject.com/categorylist/cat_a --spider myproject --rules --depth 2
Here is the pseudo code:
class MyprojectSpider(CrawlSpider):
name = "Myproject"
allowed_domains = ["Myproject.com"]
start_urls = ["http://www.Myproject.com/"]
rules = (
Rule(LinkExtractor(allow=('/categorylist/\w+',)), callback='parse_category', follow=True),
Rule(LinkExtractor(allow=('/wholesale/\w+/(?:wholesale|request)/\d+',)), callback='parse_pricing', follow=True),
)
def parse_category(self, response):
try:
soup = BeautifulSoup(response.body)
...
my_request1 = Request(url=myurl1)
yield my_request1
my_request2 = Request(url=myurl2)
yield my_request2
except:
pass
def parse_pricing(self, response):
item = MyprojectItem()
try:
item['myurl'] = response.url
item['myhtml'] = response.body
item['mystatus'] = 'fetched'
except:
item['mystatus'] = 'failed'
return item
Thanks a lot for any suggestion!
I was assuming the new Request objects that I built will run against the rules and then be parsed by the corresponding callback function define in the Rule, however, after reading the documentation of Request, the callback method is handled in a different way.
class scrapy.http.Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback])
callback (callable) – the function that will be called with the response of this request (once its downloaded) as its first parameter. For more information see Passing additional data to callback functions below. If a Request doesn’t specify a callback, the spider’s parse() method will be used. Note that if exceptions are raised during processing, errback is called instead.
...
my_request1 = Request(url=myurl1, callback=self.parse_pricing)
yield my_request1
my_request2 = Request(url=myurl2, callback=self.parse_pricing)
yield my_request2
...
In another way, even if the URLs I built matches the second rule, it won't be passed to parse_pricing. Hope this is helpful to other people.

python django soaplib response with classmodel issue

I run a soap server in django.
Is it possible to create a soap method that returns a soaplib classmodel instance without <{method name}Response><{method name}Result> tags?
For example, here is a part of my soap server code:
# -*- coding: cp1254 -*-
from soaplib.core.service import rpc, DefinitionBase, soap
from soaplib.core.model.primitive import String, Integer, Boolean
from soaplib.core.model.clazz import Array, ClassModel
from soaplib.core import Application
from soaplib.core.server.wsgi import Application as WSGIApplication
from soaplib.core.model.binary import Attachment
class documentResponse(ClassModel):
__namespace__ = ""
msg = String
hash = String
class MyService(DefinitionBase):
__service_interface__ = "MyService"
__port_types__ = ["MyServicePortType"]
#soap(String, Attachment, String ,_returns=documentResponse,_faults=(MyServiceFaultMessage,) , _port_type="MyServicePortType" )
def sendDocument(self, fileName, binaryData, hash ):
binaryData.file_name = fileName
binaryData.save_to_file()
resp = documentResponse()
resp.msg = "Saved"
resp.hash = hash
return resp
and it responses like that:
<senv:Body>
<tns:sendDocumentResponse>
<tns:sendDocumentResult>
<hash>14a95636ddcf022fa2593c69af1a02f6</hash>
<msg>Saved</msg>
</tns:sendDocumentResult>
</tns:sendDocumentResponse>
</senv:Body>
But i need a response like this:
<senv:Body>
<ns3:documentResponse>
<hash>A694EFB083E81568A66B96FC90EEBACE</hash>
<msg>Saved</msg>
</ns3:documentResponse>
</senv:Body>
What kind of configurations should i make in order to get that second response i mentioned above ?
Thanks in advance.
I haven't used Python's SoapLib yet, but had the same problem while using .NET soap libs. Just for reference, in .NET this is done using the following decorator:
[SoapDocumentMethod(ParameterStyle=SoapParameterStyle.Bare)]
I've looked in the soaplib source, but it seems it doesn't have a similar decorator. The closest thing I've found is the _style property. As seen from the code https://github.com/soaplib/soaplib/blob/master/src/soaplib/core/service.py#L124 - when using
#soap(..., _style='document')
it doesn't append the %sResult tag, but I haven't tested this. Just try it and see if this works in the way you want it.
If it doesn't work, but you still want to get this kind of response, look at Spyne:
http://spyne.io/docs/2.10/reference/decorator.html
It is a fork from soaplib(I think) and has the _soap_body_style='bare' decorator, which I believe is what you want.

Retrieve data from an Eventlet GreenPile object, possibly iterator related

I am currently modifying a simple monitoring script I made some time ago that basically :
Build a list of dictionaries containing, amongst other things
A website URL
The time it took to respond (set as None by default)
The data it sent back (set as None by default)
Query (GET) each URL from the list and fill the 'time' and 'data' fields with the relevant data.
Store the results in a database.
The script used to work fine, but as the list of URLS to monitor grew the time it takes to complete all the queries has become way too long for me.
My solution is to modify the script to fetch the URLs in a concurrent way. To do that I chose to use eventlet, since this example from the documentation does almost exactly what I want.
The catch is that since my list of URLs contains dictionaries I can't use pool.imap() to iterate through my list. (As far as I know)
The Eventlet documentation has another similar example* that uses a GreenPile object to spawn jobs, it seems I can use that to launch my URL fetching function, but I can't seem to be able to retrieve the result of this thread.
Here is my test code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import eventlet
from eventlet.green import urllib2
urls = [{'url': 'http://www.google.com/intl/en_ALL/images/logo.gif', 'data': None},
{'url': 'http://www.google.com', 'data': None}]
def fetch(url):
return urllib2.urlopen(url).read()
pool = eventlet.GreenPool()
pile = eventlet.GreenPile(pool)
for url in urls:
pile.spawn(fetch, url['url']) #can I get the return of the function here?
#or
for url in urls:
url['data'] = ??? #How do I get my data back?
#Eventlet's documentation way
data = "\n".join(pile)
As far as I understand pile is an iterable so I can iterate through it but I can't access its content via an index, is this correct?
So, how (is it possible?) can I directly fill my urls list? Another solution could be build one "flat" list of urls, another list containing the url, resp time and data and use pool.imap() on the first list and fill the second one with that, but I'd rather keep my list of dictionaries.
*I can't post more than 3 links with this account, please see the "Design patterns - Dispatch patterns" page from the eventlet documentation.
You can iterate over the GreenPile but you need to return something from your fetch so you have more than just the response. I modified the example so fetch returns a tuple that is the url and the response body.
The urls variable is now a dict of urls(string) to data (None or string). Iterating over a GreenPile continues until there are no more tasks. Iterating should be done in the same thread that calls spawn
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import eventlet
from eventlet.green import urllib2
#Change to map urls to the data found at them
urls = {'http://www.google.com/intl/en_ALL/images/logo.gif': None,
'http://www.google.com' :None}
def fetch(url):
#return the url and the response
return (url, urllib2.urlopen(url).read())
pool = eventlet.GreenPool()
pile = eventlet.GreenPile(pool)
for url in urls.iterkeys():
pile.spawn(fetch, url) #can I get the return of the function here? - No
for url,response in pile:
#stick it back into the dict
urls[url] = response
for k,v in urls.iteritems():
print '%s - %d bytes' % (k,len(v))

Captchas in Scrapy

I'm working on a Scrapy app, where I'm trying to login to a site with a form that uses a captcha (It's not spam). I am using ImagesPipeline to download the captcha, and I am printing it to the screen for the user to solve. So far so good.
My question is how can I restart the spider, to submit the captcha/form information? Right now my spider requests the captcha page, then returns an Item containing the image_url of the captcha. This is then processed/downloaded by the ImagesPipeline, and displayed to the user. I'm unclear how I can resume the spider's progress, and pass the solved captcha and same session to the spider, as I believe the spider has to return the item (e.g. quit) before the ImagesPipeline goes to work.
I've looked through the docs and examples but I haven't found any ones that make it clear how to make this happen.
This is how you might get it to work inside the spider.
self.crawler.engine.pause()
process_my_captcha()
self.crawler.engine.unpause()
Once you get the request, pause the engine, display the image, read the info from the user& resume the crawl by submitting a POST request for login.
I'd be interested to know if the approach works for your case.
I would not create an Item and use the ImagePipeline.
import urllib
import os
import subprocess
...
def start_requests(self):
request = Request("http://webpagewithcaptchalogin.com/", callback=self.fill_login_form)
return [request]
def fill_login_form(self,response):
x = HtmlXPathSelector(response)
img_src = x.select("//img/#src").extract()
#delete the captcha file and use urllib to write it to disk
os.remove("c:\captcha.jpg")
urllib.urlretrieve(img_src[0], "c:\captcha.jpg")
# I use an program here to show the jpg (actually send it somewhere)
captcha = subprocess.check_output(r".\external_utility_solving_captcha.exe")
# OR just get the input from the user from stdin
captcha = raw_input("put captcha in manually>")
# this function performs the request and calls the process_home_page with
# the response (this way you can chain pages from start_requests() to parse()
return [FormRequest.from_response(response,formnumber=0,formdata={'user':'xxx','pass':'xxx','captcha':captcha},callback=self.process_home_page)]
def process_home_page(self, response):
# check if you logged in etc. etc.
...
What I do here is that I import urllib.urlretrieve(url) (to store the image), os.remove(file) (to delete the previous image), and subprocess.checoutput (to call an external command line utility to solve the captcha). The whole Scrapy infrastructure is not used in this "hack", because solving a captcha like this is always a hack.
That whole calling external subprocess thing could have been one nicer, but this works.
On some sites it's not possible to save the captcha image and you have to call the page in a browser and call a screen_capture utility and crop on an exact location to "cut out" the captcha. Now that is screenscraping.

Categories