I have been searching how to automate and write files to Excel in Scrapy (CSV). And so far, the only doable command is the tedious, manual method of:
scrapy crawl myscript -o myscript.csv -t csv
I want to be able to format each of these into a more collected "row" format. Further more, is there any way I can make the scraper automated? Ideally, I want the code to run once per day, and I want to be able to notify myself when there has been an update regarding my scrape. With update being a relevant post.
My spider is working, and here is the code:
import scrapy
from scrapy.spiders import XMLFeedSpider
from YahooScrape.items import YahooScrapeItem
class Spider(XMLFeedSpider):
name = "Test"
allowed_domains = ["yahoo.com"]
start_urls = ('https://feeds.finance.yahoo.com/rss/2.0/headline?s=GOOGL',)
itertag = 'item'
def parse_node(self, response, node):
item = {}
item['title'] = node.xpath('title/text()',).extract_first()
item['pubDate'] = node.xpath('link/pubDate/text()').extract_first()
item['link'] = node.xpath('link/text()').extract_first()
item['description'] = node.xpath('description/text()').extract_first()
return item
I am aware that to further export/organize my scraper, I have to edit the pipeline settings (at least according to a big majority of articles I have read).
Below is my pipelines.py code:
class YahooscrapePipeline(object):
def process_item(self, item, spider):
return item
How can I set it so I can just execute the code, and it'll automatically write the code?
Update: I am using ScrapingHubs API, which runs off of shub-module to host my spider. It is very convenient, and easy to use.
Scrapy itself does not handle periodic execution or scheduling. It is completely out of scrapy's scope. I'm afraid the answer will not be as simple as you want but is what's needed.
What you CAN do is:
Use celerybeat to allow scheduling based on a crontab schedule. Running Celery tasks periodically (without Django) and http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html should get you started.
The other thing I suggest is that you host your spider in scrapyd. That will buy you log retention and a nice json api to use when you get more advanced:).
The stack overflow link gives you sample code for running celery without django (as a lot of examples assume django:) ). Remember to run the beat scheduler and not the task directly -- as pointed out in the link.
As to your question referring to organizing the output of your seat, taking note the fact that you mentioned that you are familiar of how to use exporters, but then agree that create a custom CSV exporter and then have to register the fields to export in your settings.. The order they appear in your settings did the order they will be written into the CSV file.
If I misunderstood this part of the question and instead of horizontal you mean vertical alignment of your items, if you don't have many fields... Done correctly, quick hack add the regular expression \n for new line in your spider itemization... Would probably have to first to find the items 2 then add the new line OR \t for tab of which then you can add in items with what you have to find... I give an example but this being such a hacky thing to do... I'll spare you asinine.
As to schedule a spider..
Like they have mentioned there is Scrapyd of which i use together with scrapymon... But be warned, as of this moment Scrappyd has some compatibility issues so please do remember and force yourself to create a virtual environment for your scrapyd projects.
There's a huge learning curve to getting scrapyd how you want it..
Using Django with celery is byfar TOP solution when your scraping gets serious.... Much higher learning curve is now you have to deal with server stuff, even more pain in the butt it's not a local server but old man... The speed of the cross and then custom integration or alteration of a web-based gui.If you don't want to mess with all that. What I did for a long time was used scrapinghub... get aquiated with their API... you can curl or use python modules they provide... and cron schedlue your spiders as you see fit right from you pc... scrape is done remotely so you keep resource power.
Related
Locust is a great and simple load testing tool. By default it only tracks response times and content length from which it can deduce RPS, etc. Is there any way to track custom statistics in locust as well?
In my case a site Im testing returns couple of stats via headers. For example a count of SQL queries within a request. It would be very helpful to track some of these statistics in conjunction to tracking standard response times.
I do not see any way to do that in locust however. Is there a simple way for doing that?
Only customization I could see is setting url names in a request in docs.
Manually storing some of the stats is not that straight forward either as locust is distributed so would like to avoid doing anything custom.
edit
There is an example how custom stats can be passed around however that does not show up in the UI and requires custom export. Any way to add additional data in locust which will get logged both in UI and data export?
Maybe something like:
class MyTaskSet(TaskSet):
#task
def my_task(self):
response = self.client.get("/foo")
self.record(foo=response.headers.get('x-foo'))
As far as I know, there is no simple way of visualizing custom data in Locust. However, by looking at https://github.com/locustio/locust/blob/master/locust/main.py#L370, you could easily replace main locust run function and inject some custom logic to https://github.com/locustio/locust/blob/master/locust/web.py. This seem to be a low hanging fruit for the Locust devs to make this part of code more adjustable out of the box so I'd suggest opening issue in their GitHub.
I'm learning Django and to practice I'm currently developing a clone page of YTS, it's a movie torrents repository*.
As of right now, I scrapped all the movies in the website and have them on a single db table called Movie with all the basic information of each movie (I'm planning on adding one more for Genre).
Every few days YTS will post new movies and I want my clone-web to automatically add them to the database. I'm currently stuck on deciding how to do this:
I was planning on comparing the movie id of the last movie in my db against the last movie in the YTS db each time the user enters the website, but that'd mean make a request to YTS every time my page loads, it'd also mean some very slow code should be executed inside my index() views method.
Another strategy would be to query the last time my db was updated (new entries were introduced) and if it's let's say bigger than a day then request new movies to YTS. Problem with this is I don't seem to find any method to query the time of last db updates. Does it even exist such method?
I could also set a cron job to update the information but I'm having problems to make changes from a separated Python function (I import django.db and such but the interpreter refuses to execute django db instructions).
So, all in all, what's the best strategy to update my database from a third party service/website without bothering the user with loading times? How do you set such updates in non-intrusive way to the user? How do you generally do it?
* I know a torrents website borders the illegal and I'm not intended, in any way, to make my project available to the public
I think you should choose definetely the third alternative, a cron job to update the database regularly seems the best option.
You don' t need to use a seperate python function, you can schedule a task with celery, which can be easily integrated with django using django-celery
The simplest way would be to write a custom management command and run it periodically from a cron job.
I want to create my own service for scrapyd API, which should return a little more information about running crawler. I get stuck at very beginning: where I should place the module, which will contain that service. If we look at default "scrapyd.conf" it's has a section called services:
[services]
schedule.json = scrapyd.webservice.Schedule
cancel.json = scrapyd.webservice.Cancel
addversion.json = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json = scrapyd.webservice.ListSpiders
delproject.json = scrapyd.webservice.DeleteProject
delversion.json = scrapyd.webservice.DeleteVersion
listjobs.json = scrapyd.webservice.ListJobs
so this is the absolute paths to each service in scrapyd package, which placed in dist-packages folder. Is there any way to place my own module, containing service not in dist-packages folder?
upd.
Realized that question may be unclear. Scrapy is a framework for parsing data from websites. I have a simple django site from where I can start/stop crawlers for specific region etc (http://54.186.79.236 it's in russian). Manipulating with crawlers occurs through scrapyd API. In default it has a little API's only for start/stop/list crawlers and their logs etc. This APIs are listed in this doc's http://scrapyd.readthedocs.org/en/latest/api.html
So above was a little intro, to the question now. I want extend existing API to retrieve more info from running crawler and render it in my website mentioned above. For this I need inherit existing scrapyd.webservice.WsResource and write a service. Its ok with that part if I place that service module in one of 'sys.path' paths. But I want to keep this service containing module in scrapy project folder (for some aesthetic reason). So if I keep it there it argues(predictably) 'No module named' on scrapyd launch.
So, I solve my problem according to this.
I have a large amount of automatically generated html files that I would like to push to my Plone website with a script. I currently generate the files, log into Plone, click edit on each individual page and copy and paste the html into the editor. I'd like to automate this. It would be nice to retain the plone versioning, have a auto generated comment for the edit, and come from a specific user.
I've read and tried Webdav with little luck at getting it working consistently and know that there is a way to connect to plone via ftp, but haven't tried it. I'm not sure if these are the methods that I need.
My google searches aren't leading me to anything useful. Any ideas on where to start looking for a solution to this? Or any tips on implementing it?
You can script anything in Plone via the following methods:
Through-the-web via API calls (e.g. XML-RPC, wsapi, etc.)
The bin/instance run script provided by plone.recipe.zope2instance (See charm for an example of this).
You can also use a migration framework like:
collective.transmogrifier
which allows you to write migration code, and trigger it via GenericSetup or Browser view. Additionally, there are applications written on top of Transmogrifier aimed roughly at what you are describing, the most popular of which is:
funnelweb
I would recommend that you consider using or writing a Transmogrifier "blueprint(s)" to do your import, and execute the pipeline with a tool that makes that easy:
mr.migrator
You can find blueprints by searching PyPI for "transmogrify". One popular set of blueprints is:
quintagroup.transmogrifier
One of the main attractions to the Transmogrifier approach, aside from getting the job done, is the ability to share useful blueprints with others.
I think transmogrifier is the best tool for this job, but this will definitely be a programming task no matter how you do it. It's used for many such migration jobs such as migrating from drupal.
There's an add-on, wsapi4plone.core that pumazi at WebLion started that provides web services for portals which you can then hook into. You can create, modify, delete content via XML-RPC calls. The only caveat is that it doesn't yet work with Collections (criteria specifically).
project: http://pypi.python.org/pypi/wsapi4plone.core
docs: http://packages.python.org/wsapi4plone.core/
You can also do it programmatically by hooking into the ZODB via Python (zopepy or some other method).
These should get you started:
http://plone.org/documentation/kb/manipulating-plone-objects-programmatically/reading-and-writing-field-values - you should be able to get an understanding of accessors and mutators (setters and getters), in your case you are going to be more than likely working with obj.Text (getter) and obj.setText (setter).
https://weblion.psu.edu/trac/weblion/wiki/AutomatingObjectCreation - lots of examples (slightly outdated but still relevant)
http://plone.org/documentation/faq/upload-images-files
Try to enable Webdav or ftp in Plone, then you can access Plone via webdav or ftp clients, pushing the html files. Plone (Zope) will recognises the html files as Pages.
I am really new to python, just played around with the scrapy framework that is used to crawl websites and extract data.
My question is, how to I pass parameters to a python script that is hosted somewhere online.
E.g. I make following request mysite.net/rest/index.py
Now I want to pass some parameters similar to php like *.php?id=...
Yes that would work. Although you would need to write handlers for extracting the url parameters in index.py. Try import cgi module for this in python.
Please note that there are several robust python based web frameworks available (aka Django, Pylons etc.) which automatically parses your url & forms a dictionary of all it's parameters, plus they do much more like session management, user authentication etc. I would highly recommend you use them for faster code turn-around and less maintenance hassles.