The website I am trying to crawl has the following structure:
there are various modules (for which I generate links without issues) - let's call them "module_urls"
each module page has a random number of links to various pages with videos (let's call them "lesson_urls")
each page has one video
The idea is to print links to all videos.
I have successfully managed to, separately: (1) generate the module_urls, (2) scrape the links to lesson_urls, and (3) scrape the videos. However, I am struggling with creating the appropriate loop to make it all work together.
The following script correctly generates module_urls, but, contrary to my expectations, the request to crawl each url (and then to crawl each sub-url) is never fulfilled. I am sure that this comes from my pure ignorance of the topic - this is the first time I am trying to use Scrapy.
Thank you very much for your help!
video_links = []
def after_login(self, response):
module_urls = self.generate_links()
for module_url in module_urls:
print("This is one module URL: %s" % module_url)
Request(module_url, self.get_lesson_urls)
print(self.video_links)
def get_lesson_urls(self, response):
print("Entered get_lesson_urls")
urls = response.xpath('//*[starts-with(#id,"post")]//li/a/#href').extract()
for lesson_url in urls:
Request(lesson_url, self.get_video_link)
def get_video_link(self, response):
video_address = response.xpath('//*[starts-with(#id, "post")]//iframe[#name = "vooplayerframe"]/#src').extract_first()
self.video_links.append(video_address)
I believe you will need to yield your request objects
video_links = []
def after_login(self, response):
module_urls = self.generate_links()
for module_url in module_urls:
print("This is one module URL: %s" % module_url)
yield Request(module_url, self.get_lesson_urls)
def get_lesson_urls(self, response):
print("Entered get_lesson_urls")
urls = response.xpath('//*[starts-with(#id,"post")]//li/a/#href').extract()
for lesson_url in urls:
yield Request(lesson_url, self.get_video_link)
def get_video_link(self, response):
video_address = response.xpath('//*[starts-with(#id, "post")]//iframe[#name = "vooplayerframe"]/#src').extract_first()
yield video_address
Edit:
Rather than print, if you then yield the urls you want, you can output them to json (and other formats) using:
scrapy crawl myspider -o data.json
You can do further parsing with Scrapy's Item Pipeline: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
Related
I have a problem that is stopping me from progressing my project.
I'll try to explain it as clearly as I can, but I am relatively new to scraping.
I want to scrape articles from Website A.
Website A doesn't have articles' content in its HTML code but links to articles on other websites (lets say Website B and Website C)
I have created a Spider that extracts links from Website A and yields them in JSON format.
I want to take the extracted links from Website A and scrape the articles from Websites B and C.
Now - I want to create separate Spiders for Website B and Website C (to use them later for scraping those websites directly and not through Website A) and somehow pass scraped data from Website A as arguments to them - but the "somehow" part is what I need your help with.
Thank you :)
EDIT
Anwsering jqc - since I posted my questions I made some developments - this is my code so far.
class QuotesSpider(scrapy.Spider):
name = 'Website A Spider'
start_urls = ['start_url']
def parse(self, response):
self.logger.info('###### Link Parser ######')
important_news = response.xpath('//div[contains(#class, "importantNews")]//div[contains(#class, "items")]/a')
for news in important_news:
yield {
'link': news.xpath('./#href').get(),
'title': news.xpath('.//span[contains(#class, "title")]/text()').get()
}
article_url = news.xpath('./#href').get()
self.logger.info('FOLLOWING URL OF THE ARTICLE')
if 'Website B' in article_url:
yield response.follow(article_url, callback=self.parse_Website_B)
else:
pass
def parse_Website_B(self, response):
yield {
'Website B article title': response.xpath('//p[contains(#class, "Header_desktopTextElement")]').get()
}
Don't worry about unfinished parsing, that's the least concerning part :)
Right now I am creating separate methods to parse particular websites, but I don't know if that is the optimal way.
I would like to see the URL you are trying to crawl. Then I can make some tests and try to decipher your question.
I can give you some hints, I am not sure if I understand you.
If you want to scrape the given URLs from A, you can do it directly into:
def parse_Website_B(self, response):
yield {
'Website B article title': response.xpath('//p[contains(#class, "Header_desktopTextElement")]').get()
}
You just have to yield the links, I would try with start_requests. Have a look in the documentation here.
If you provide the URL, we can try otherwise.
cheers
I think in your case will be much ease to create a list of URLs as a global variable in the spyder file, and then use it as a list for requests.
Something like this:
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy import signals
URL_LIST = []
DATA_LIST = []
def store_url(*args, **kwargs):
URL_LIST.append(kwargs['item'])
def store_data(*args, **kwargs):
DATA_LIST.append(kwargs['item'])
class QuotesSpiderWebsiteA(scrapy.Spider):
#Your code
class QuotesSpiderWebsiteB(scrapy.Spider):
# etc...
if __name__ == '__main__':
configure_logging()
runner = CrawlerRunner()
#defer.inlineCallbacks
def crawl():
crawler1 = runner.create_crawler(QuotesSpiderWebsiteA)
crawler2 = runner.create_crawler(QuotesSpiderWebsiteB)
crawler1.signals.connect(store_url, signals.item_scraped)
crawler2.signals.connect(store_data, signals.item_scraped)
yield runner.crawl(crawler1)
yield runner.crawl(crawler2)
reactor.stop()
crawl()
reactor.run()
# the script will block here until the crawling is finished
with open('output.json', 'w', encoding='utf8') as f:
json.dump(DATA_LIST)
All the websites I want to parse are in the same domain but all look very different and contain different information I need.
My start_url is a page with a list containing all links I need. So in the parse() method I yield a request for each of these links and in parse_item_page I extract the first part of the information I need - which worked completely fine.
My problem is: I thought I could just do the same another time and for each link on my item_page call parse_entry. But I tried so many different versions of this and I just can't get it to work. They are the correct URLs but scrapy seems to just don't want to call a third parse() function, nothing in there ever gets executed.
How can I get scrapy to use parse_entry, or pass all these links to a new spider?
This is a simplified, shorter version of my spider class:
def parse(self, response, **kwargs):
for href in response.xpath("//listItem/#href"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_item_page)
def parse_item_page(self, response):
for sel in response.xpath("//div"):
item = items.FirstItem()
item['attribute'] = sel.xpath("//h1/text()").get().strip()
for href in response.xpath("//entry/#href"):
yield response.follow(href.extract(), callback=self.parse_entry)
yield item
def parse_entry(self, response):
for sel in response.xpath("//textBlock"):
item = items.SecondItem()
item['attribute'] = sel.xpath("//h1/text()").get().strip()
yield item
I am a newbie and I've written a script in python scrapy to get information recursively.
Firstly, it scrapes links of city including information of tours then it tracks down each cities and reach their pages. Next, it get needed information of tours related to city before move to next pages then so on. Pagination is running on java-script without visible link.
The command I used to get the result along with a csv output is:
scrapy crawl pratice -o practice.csv -t csv
The expected result is csv file:
title, city, price, tour_url
t1, c1, p1, url_1
t2, c2, p2, url_2
...
The problem is that csv file is empty. The running is stopped at "parse_page" and callback="self.parse_item" doesn't work. I don't know how to fix it. Maybe my workflow is invalid or my code has issues. Thanks for your help.
name = 'practice'
start_urls = ['https://www.klook.com/vi/search?query=VI%E1%BB%86T%20NAM%20&type=country',]
def parse(self, response): # Extract cities from country
hxs = HtmlXPathSelector(response)
urls = hxs.select("//div[#class='swiper-wrapper cityData']/a/#href").extract()
for url in urls:
url = urllib.parse.urljoin(response.url, url)
self.log('Found city url: %s' % url)
yield response.follow(url, callback=self.parse_page) # Link to city
def parse_page(self, response): # Move to next page
url_ = response.request.url
yield response.follow(url_, callback=self.parse_item)
# I will use selenium to move next page because of next button is running
# on javascript without fixed url.
def parse_item(self, response): # Extract tours
for block in response.xpath("//div[#class='m_justify_list m_radius_box act_card act_card_lg a_sd_move j_activity_item js-item ']"):
article = {}
article['title'] = block.xpath('.//h3[#class="title"]/text()').extract()
article['city'] = response.xpath(".//div[#class='g_v_c_mid t_mid']/h1/text()").extract()# fixed
article['price'] = re.sub(" +","",block.xpath(".//span[#class='latest_price']/b/text()").extract_first()).strip()
article['tour_url'] = 'www.klook.com'+block.xpath(".//a/#href").extract_first()
yield article
hxs = HtmlXPathSelector(response) #response is already in Selector, use direct `response.xpath`
url = urllib.parse.urljoin(response.url, url)
use as:
url = response.urljoin(url)
yes it will stop as its a duplicate request to prev. url, you need to add dont_filter=True check
Instead of using Selenium, figure out what request the website performs using JavaScript (watch the Network tab of the developer tools of your browser while you navigate) and reproduce a similar request.
The website uses JSON requests undernead to fetch the items, which is much easier to parse than the HTML.
Also, if you are not familiar with Scrapy’s asynchronous nature, you are likely to get unexpected issues while using it in combination with Selenium.
Solutions like Splash or Selenium are only meant to be used as last resource, when everything else fails.
I have the following code for a web crawler in Python 3:
import requests
from bs4 import BeautifulSoup
import re
def get_links(link):
return_links = []
r = requests.get(link)
soup = BeautifulSoup(r.content, "lxml")
if r.status_code != 200:
print("Error. Something is wrong here")
else:
for link in soup.findAll('a', attrs={'href': re.compile("^http")}):
return_links.append(link.get('href')))
def recursive_search(links)
for i in links:
links.append(get_links(i))
recursive_search(links)
recursive_search(get_links("https://www.brandonskerritt.github.io"))
The code basically gets all the links off of my GitHub pages website, and then it gets all the links off of those links, and so on until the end of time or an error occurs.
I want to recreate this code in Scrapy so it can obey robots.txt and be a better web crawler overall. I've researched online and I can only find tutorials / guides / stackoverflow / quora / blog posts about how to scrape a specific domain (allowed_domains=["google.com"], for example). I do not want to do this. I want to create code that will scrape all websites recursively.
This isn't much of a problem but all the blog posts etc only show how to get the links from a specific website (for example, it might be that he links are in list tags). The code I have above works for all anchor tags, regardless of what website it's being run on.
I do not want to use this in the wild, I need it for demonstration purposes so I'm not going to suddenly annoy everyone with excessive web crawling.
Any help will be appreciated!
There is an entire section of scrapy guide dedicated to broad crawls. I suggest you to fine-grain your settings for doing this succesfully.
For recreating the behaviour you need in scrapy, you must
set your start url in your page.
write a parse function that follow all links and recursively call itself, adding to a spider variable the requested urls
An untested example (that can be, of course, refined):
class AllSpider(scrapy.Spider):
name = 'all'
start_urls = ['https://yourgithub.com']
def __init__(self):
self.links=[]
def parse(self, response):
self.links.append(response.url)
for href in response.css('a::attr(href)'):
yield response.follow(href, self.parse)
If you want to allow crawling of all domains, simply don't specify allowed_domains, and use a LinkExtractor which extracts all links.
A simple spider that follows all links:
class FollowAllSpider(CrawlSpider):
name = 'follow_all'
start_urls = ['https://example.com']
rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)]
def parse_item(self, response):
pass
I am getting confused on how to design the architecure of crawler.
I have the search where I have
pagination: next page links to follow
a list of products on one page
individual links to be crawled to get the description
I have the following code:
def parse_page(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ol[#id=\'result-set\']/li')
items = []
for site in sites[:2]:
item = MyProduct()
item['product'] = myfilter(site.select('h2/a').select("string()").extract())
item['product_link'] = myfilter(site.select('dd[2]/').select("string()").extract())
if item['profile_link']:
request = Request(urljoin('http://www.example.com', item['product_link']),
callback = self.parseItemDescription)
request.meta['item'] = item
return request
soup = BeautifulSoup(response.body)
mylinks= soup.find_all("a", text="Next")
nextlink = mylinks[0].get('href')
yield Request(urljoin(response.url, nextlink), callback=self.parse_page)
The problem is that I have two return statements: one for request, and one for yield.
In the crawl spider, I don't need to use the last yield, so everything was working fine, but in BaseSpider I have to follow links manually.
What should I do?
As an initial pass (and based on your comment about wanting to do this yourself), I would suggest taking a look at the CrawlSpider code to get an idea of how to implement its functionality.