Integrating Django Rest Framework and Scrapy - python

Both Scrapy and Django Frameworks are standalone best framework of Python to build crawler and web applications with less code, Though still whenever You want to create a spider you always have to generate new code file and have to write same piece of code(though with some variation.) I was trying to integrate both. But stuck at a place where i need to send the status 200_OK that spider run successfully, and at the same time spider keep running and when it finish off it save data to database.
Though i know the API are already available with scrapyd. But i Wanted to make it more versatile. That lets you create crawler without writing multiple file. I thought The Crawlrunner https://docs.scrapy.org/en/latest/topics/practices.html would help in this,therefor try this thing also
t Easiest way to run scrapy crawler so it doesn't block the script
but it give me error that the builtins.ValueError: signal only works in main thread
Even though I get the response back from the Rest Framework. But Crawler failed to run due to this error does that mean i need to switch to main thread?
I am doing this with a simple piece of code
spider = GeneralSpider(pk)
runner = CrawlerRunner()
d = runner.crawl(GeneralSpider, pk)
d.addBoth(lambda _: reactor.stop())
reactor.run()

I ran scrapy spider in django view, and sharing my code.
settings_file_path = "scraping.settings" # Scrapy Project Setting
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings_file_path)
settings = get_project_settings()
runner = CrawlerRunner(settings)
path = "/path/to/sample.py"
path = url.replace('.py', '')
path = url.replace('/', '.')
file_path = ".SampleSpider".format(path)
SampleSpider = locate(file_path)
d = runner.crawl(SampleSpider)
d.addBoth(lambda _: reactor.stop())
reactor.run()
I hope it's helpful.

Related

Scrapy Doesn't output json when called from another script

I've written a scrapy crawler to fetch me some sweet sweet data and it works. I'm very impressed with myself for the achievement. I even created a jupyter notebook to process the data from the Json file I created.
But I've creatyed the program so that people at work can use it and getting them to navigate to a folder and use command lines isn't going to work so I wanted to make something that I can call on and then process afterwards. But for some reason Scrapy just isnt playing ball. I've found a few bits of help but once the crawl has been completed the json output I've requested doesn't appear. But when i command line it, it shows up.
def parse(self, response):
resp_dict = json.loads(response.body)
f = open(file_name, 'w')
json.dump(resp_dict, f, indent=4)
f.close()
this is the bit that works, sometimes. I just don't understand why it wont give me an output when called from a different script. I've also tried to add this in but I think i'm putting it in the wrong place.
settings = get_project_settings()
settings.set('FEED_FORMAT', 'json')
settings.set('FEED_URI', 'result.json')
I can successfully call the Scrapy Spider, i can see the terminal showing me what's going on. But I just can't get the json output. Literally tearing my hair out now.
process = CrawlerProcess(settings={
"FEEDS": {
"items.json": {"format": "json"},
}
})
process.crawl(HoggleSpider)
process.start()

Why my simple pyramid application cannot work with pexpect?

I have a very simple pyramid application which serves a simple static page. Let's say its name is mypyramid and uses port 9999.
If I launch mypyramid in another linux console manually, then I can use the following code to print out the html string.
if __name__ == "__main__":
import urllib2
print 'trying to download url'
response = urllib2.urlopen('http://localhost:9999/index.html')
html = response.read()
print html
But I want to launch mypyramid in an application automatically.
So in my another application, I used pexpect to launch mypyramid, and then try to get the html string from http://localhost:9999/index.html.
def _start_mypyramid():
p = pexpect.spawn(command='./mypyramid')
return p
if __name__ == "__main__":
p = _start_mypyramid()
print p
print 'mypyramid started'
import urllib2
print 'trying to download url'
response = urllib2.urlopen('http://localhost:9999/index.html')
html = response.read()
print html
It seems mypyramid has been successfully launched using pexpect, as I can see the print of the process and mypyramid started has been reached.
However, the application is just hanging after trying to download url, and I can't get anything.
What is the solution? I mean I thought pexpect would create another process. If that's true, then why it is stopping the retrieval of the html?
My guess would be that the child returned by pexpect.spawn needs to communicate.
It attempts to write but nobody reads, so the app stops. (I am only guessing though).
If you do not have any reason to use pexpect (which you probably don't if you do not communicate with the child process), why wouldn't you just go for a standard module subprocess?

Scrapy parser. A bit sophisticated

I need to parse all articles from one site. There is 1000+ shops in this site.
To get any one article I need a id_shop in cookies. I do that by Requests module
To get all 1000+ id_shops I need to parse Ajax forms.
Then I run 1000+ spiders for each shop this way:
def setup_crawler(domain):
spider = MySpider(domain=domain)
settings = get_project_settings()
crawler = Crawler(settings)
crawler.configure()
crawler.crawl(spider)
crawler.start()
So I have .py script which do all that steps and I run it by python MySpider.py. Everything works.
The broplem is: I can't run my spider simultaneously with another one's. I'm following that rule(listed here http://doc.scrapy.org/en/latest/topics/practices.html):
for domain in ['scrapinghub.com', 'insophia.com']:
setup_crawler(domain)
log.start()
reactor.run()
Instead of setup_crawler() I use MySpider.run().
I got that MySpider waits anothers.
I have two quastions:
1. How to run MySpider simultaneously with another one's?
2. I want to parse id_shops from ajax and run 1000+ spiders for each id_shop in one spider. Is it possible?

Captchas in Scrapy

I'm working on a Scrapy app, where I'm trying to login to a site with a form that uses a captcha (It's not spam). I am using ImagesPipeline to download the captcha, and I am printing it to the screen for the user to solve. So far so good.
My question is how can I restart the spider, to submit the captcha/form information? Right now my spider requests the captcha page, then returns an Item containing the image_url of the captcha. This is then processed/downloaded by the ImagesPipeline, and displayed to the user. I'm unclear how I can resume the spider's progress, and pass the solved captcha and same session to the spider, as I believe the spider has to return the item (e.g. quit) before the ImagesPipeline goes to work.
I've looked through the docs and examples but I haven't found any ones that make it clear how to make this happen.
This is how you might get it to work inside the spider.
self.crawler.engine.pause()
process_my_captcha()
self.crawler.engine.unpause()
Once you get the request, pause the engine, display the image, read the info from the user& resume the crawl by submitting a POST request for login.
I'd be interested to know if the approach works for your case.
I would not create an Item and use the ImagePipeline.
import urllib
import os
import subprocess
...
def start_requests(self):
request = Request("http://webpagewithcaptchalogin.com/", callback=self.fill_login_form)
return [request]
def fill_login_form(self,response):
x = HtmlXPathSelector(response)
img_src = x.select("//img/#src").extract()
#delete the captcha file and use urllib to write it to disk
os.remove("c:\captcha.jpg")
urllib.urlretrieve(img_src[0], "c:\captcha.jpg")
# I use an program here to show the jpg (actually send it somewhere)
captcha = subprocess.check_output(r".\external_utility_solving_captcha.exe")
# OR just get the input from the user from stdin
captcha = raw_input("put captcha in manually>")
# this function performs the request and calls the process_home_page with
# the response (this way you can chain pages from start_requests() to parse()
return [FormRequest.from_response(response,formnumber=0,formdata={'user':'xxx','pass':'xxx','captcha':captcha},callback=self.process_home_page)]
def process_home_page(self, response):
# check if you logged in etc. etc.
...
What I do here is that I import urllib.urlretrieve(url) (to store the image), os.remove(file) (to delete the previous image), and subprocess.checoutput (to call an external command line utility to solve the captcha). The whole Scrapy infrastructure is not used in this "hack", because solving a captcha like this is always a hack.
That whole calling external subprocess thing could have been one nicer, but this works.
On some sites it's not possible to save the captcha image and you have to call the page in a browser and call a screen_capture utility and crop on an exact location to "cut out" the captcha. Now that is screenscraping.

Using Python/Selenium/Best Tool For The Job to get URI of image requests generated through JavaScript?

I have some JavaScript from a 3rd party vendor that is initiating an image request. I would like to figure out the URI of this image request.
I can load the page in my browser, and then monitor "Live HTTP Headers" or "Tamper Data" in order to figure out the image request URI, but I would prefer to create a command line process to do this.
My intuition is that it might be possible using python + qtwebkit, but perhaps there is a better way.
To clarify: I might have this (overly simplified code).
<script>
suffix = magicNumberFunctionIDontHaveAccessTo();
url = "http://foobar.com/function?parameter=" + suffix
img = document.createElement('img'); img.src=url; document.all.body.appendChild(img);
</script>
Then once the page is loaded, I can go figure out the url by sniffing the packets. But I can't just figure it out from the source, because I can't predict the outcome of magicNumberFunction...().
Any help would be muchly appreciated!
Thank you.
The simplest thing to do might be to use something like HtmlUnit and skip a real browser entirely. By using Rhino, it can evaluate JavaScript and likely be used to extract that URL out.
That said, if you can't get that working, try out Selenium RC and use the captureNetworkTraffic command (which requires the Selenium instant be started with an option of captureNetworkTraffic=true). This will launch Firefox with a proxy configured and then let you pull the request info back out as JSON/XML/plain text. Then you can parse that content and get what you want.
Try out the instant test tool that my company offers. If the data you're looking for is in our results (after you click View Details), you'll be able to get it from Selenium. I know, since I wrote the captureNetworkTraffic API for Selenium for my company, BrowserMob.
I would pick any one of the many http proxy servers written in Python -- probably one of the simplest ones at the very top of the list -- and tweak it to record all URLs requested (as well as proxy-serve them) e.g. appending them to a text file -- without loss of generality, call that text file 'XXX.txt'.
Now all you need is a script that: starts the proxy server in question; starts Firefox (or whatever) on your main desired URL with the proxy in question set as your proxy (see e.g. this SO question for how), though I'm sure other browsers would work just as well; waits a bit (e.g. until the proxy's XXX.txt file has not been altered for more than N seconds); reads XXX.txt to extract only the URLs you care about and record them wherever you wish; turns down the proxy and Firefox processes.
I think this will be much faster to put in place and make work correctly, for your specific requirements, than any more general solution based on qtwebkit, selenium, or other "automation kits".
Use Firebug Firefox plugin. It will show you all requests in real time and you can even debug the JS in your Browser or run it step-by-step.
Ultimately, I did it in python, using Selenium-RC. This solution requires the python files for selenium-rc, and you need to start the java server ("java -jar selenium-server.jar")
from selenium import selenium
import unittest
import lxml.html
class TestMyDomain(unittest.TestCase):
def setUp(self):
self.selenium = selenium("localhost", \
4444, "*firefox", "http://www.MyDomain.com")
self.selenium.start()
def test_mydomain(self):
htmldoc = open('site-list.html').read()
url_list = [link for (element, attribute,link,pos) in lxml.html.iterlinks(htmldoc)]
for url in url_list:
try:
sel = self.selenium
sel.open(url)
sel.select_window("null")
js_code = '''
myDomainWindow = this.browserbot.getUserWindow();
for(obj in myDomainWindow) {
/* This code grabs the OMNITURE tracking pixel img */
if ((obj.substring(0,4) == 's_i_') && (myDomainWindow[obj].src)) {
var ret = myDomainWindow[obj].src;
}
}
ret;
'''
omniture_url = sel.get_eval(js_code) #parse&process this however you want
except Exception, e:
print 'We ran into an error: %s' % (e,)
self.assertEqual("expectedValue", observedValue)
def tearDown(self):
self.selenium.stop()
if __name__ == "__main__":
unittest.main()
Why can't you just read suffix, or url for that matter? Is the image loaded in an iframe or in your page?
If it is loaded in your page, then this may be a dirty hack (substitute document.body for whatever element is considered):
var ac = document.body.appendChild;
var sources = [];
document.body.appendChild = function(child) {
if (/^img$/i.test(child.tagName)) {
sources.push(child.getAttribute('src'));
}
ac(child);
}

Categories