Scrapy - Visiting nested links and grabbing meta data from each level

Scrapy - Visiting nested links and grabbing meta data from each level - python

I am relatively new to scrapy and have been getting a lot of exceptions...
Here is what I am trying to do:
There 4 nested links that I want to grab data from:
Let's say I have 5 items that I want to crawl in total. These items are
Industry=scrapy.Field()
Company=scrapy.Field()
Contact_First_name=scrapy.Field()
Contact_Last_name=scrapy.Field()
Website=scrapy.Field()
Now to begin crawling I would first have to get the Industry.
The Industry xpath also contains the link to individual listings of companies that belong to their Industry segments.
Next I want to use the Industry xpath and go into the link. This page does not contain any data that I want to crawl. But this page contains href links to individual companies that have their own basic info page.
Using the href link from the listings page, I now arrive at one page that contains the information for one company. Now I want to scrape the Company, Address, and Website.
There is another href link that I need to click in order to lead to Contact_First_Name, Contact_Last_Name.
Using the href link, I now arrive at another page that contains the Contact_First_Name, and Contact_Last_Name
After crawling all of these pages, I should have items that look somewhat like this:
Industry Company Website Contact_First_Name Contact_Last_Name
Finance JPMC JP.com Jamie Dimon
Finance BOA BOA.com Bryan Moynihan
Technology ADSK ADSK.com Carl Bass
EDITED
Here is the code that is working. Anzel's recommendations really helped out but i realized the subclass allowed_domains was wrong which stopped the nested links from following through. Once I changed it, it works.
class PschamberSpider(scrapy.Spider):
name="pschamber"
allowed_domains = ["cm.pschamber.com"]
start_urls = ["http://cm.pschamber.com/list/"]
def parse(self, response):
item = PschamberItem()
for sel in response.xpath('//*[#id="mn-ql"]/ul/li/a'):
# xpath and xpath().extract() will return a list
# extract()[0] will return the first element in the list
item['Industry'] = sel.xpath('text()').extract()
# another mistake you made here
# you're trying to call scrapy.Request(LIST of hrefs) which will fail
# scrapy.Request only takes a url string, not list
# another big mistake is you're trying to yield the item,
# whereas you should yield the Request object
yield scrapy.Request(sel.xpath('#href').extract()[0], callback=self.parse_2, meta={'item': item})
# another mistake, your callback function DOESNT take item as argument
def parse_2(self, response):
for sel in response.xpath('.//*[#id="mn-members"]/div/div/div/div/div/a').extract():
# you can access your response meta like this
item=response.meta['item']
item['Company'] = sel.xpath('text()').extract()
yield scrapy.Request(sel.xpath('#href').extract()[0], callback=self.parse_3, meta={'item': item})
# again, yield the Request object
def parse_3(self, response):
item=response.meta['item']
item['Website'] = response.xpath('.//[#id="mn-memberinfo-block-website"]/a/#href').extract()
# OK, finally assume you're done, just return the item object
return item

There are quite a few mistakes you've made in your code therefore it's not running as you expected. Please see my below brief sample how to get the items you need and passing the meta to other callbacks. I am not copying your xpath as I just grab the most straight forward one from the site, you can apply your own.
I will try to comment as clear as possible to let you know where you did wrong.
class PschamberSpider(scrapy.Spider):
name = "pschamber"
# start from this, since your domain is a sub-domain on its own,
# you need to change to this without http://
allowed_domains = ["cm.pschamber.com"]
start_urls = (
'http://cm.pschamber.com/list/',
)
def parse(self, response):
item = PschamberItem()
for sel in response.xpath('//div[#id="mn-ql"]//a'):
# xpath and xpath().extract() will return a list
# extract()[0] will return the first element in the list
item['industry'] = sel.xpath('text()').extract()[0]
# another mistake you made here
# you're trying to call scrapy.Request(LIST of hrefs) which will fail
# scrapy.Request only takes a url string, not list
# another big mistake is you're trying to yield the item,
# whereas you should yield the Request object
yield scrapy.Request(
sel.xpath('#href').extract()[0],
callback=self.parse_2,
meta={'item': item}
)
# another mistake, your callback function DOESNT take item as argument
def parse_2(self, response):
for sel in response.xpath('//div[#class="mn-title"]//a'):
# you can access your response meta like this
item = response.meta['item']
item['company'] = sel.xpath('text()').extract()[0]
# again, yield the Request object
yield scrapy.Request(
sel.xpath('#href').extract()[0],
callback=self.parse_3,
meta={'item': item}
)
def parse_3(self, response):
item = response.meta['item']
item['website'] = response.xpath('//a[#class="mn-print-url"]/text()').extract()
# OK, finally assume you're done, just return the item object
return item
Hope this is self-explanatory and you get to understand the basic of scrapy, you should READ thoroughly the doc from Scrapy, and sooner you will learn another method to set rules to follow links with certain patterns... well of course once you get the basic right you will understand them.
Although everyone's journey differs, I strongly recommend you keep reading and practice until you're confident in what you're doing before crawling actual website. Also, there are rules to protect web contents which can be scraped, and copyright about the content you scrape.
Keep this in mind or you may find yourself in big trouble in future. Anyway, good luck and I hope this answer helps you resolve the problem!

Related

How can I parse several differently structured websites with the same spider?

All the websites I want to parse are in the same domain but all look very different and contain different information I need.
My start_url is a page with a list containing all links I need. So in the parse() method I yield a request for each of these links and in parse_item_page I extract the first part of the information I need - which worked completely fine.
My problem is: I thought I could just do the same another time and for each link on my item_page call parse_entry. But I tried so many different versions of this and I just can't get it to work. They are the correct URLs but scrapy seems to just don't want to call a third parse() function, nothing in there ever gets executed.
How can I get scrapy to use parse_entry, or pass all these links to a new spider?
This is a simplified, shorter version of my spider class:
def parse(self, response, **kwargs):
for href in response.xpath("//listItem/#href"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_item_page)
def parse_item_page(self, response):
for sel in response.xpath("//div"):
item = items.FirstItem()
item['attribute'] = sel.xpath("//h1/text()").get().strip()
for href in response.xpath("//entry/#href"):
yield response.follow(href.extract(), callback=self.parse_entry)
yield item
def parse_entry(self, response):
for sel in response.xpath("//textBlock"):
item = items.SecondItem()
item['attribute'] = sel.xpath("//h1/text()").get().strip()
yield item

extract_first only resulting in first item only .extract() does not work

I hope that you're all well. I am trying to learn Python through web-scraping. My project at the moment is to scrape data from a games store. I am initially wanting to follow a product link and print the response from each link. There are 60 game links on the page that I wish for Scrapy to follow. Below is the code.
import scrapy
class GameSpider(scrapy.Spider):
name = 'spider'
allowed_domains = ['365games.co.uk']
start_urls = ['https://www.365games.co.uk/3ds-games/']
def parse(self, response):
all_games = response.xpath('//*[#id="product_grid"]')
for game in all_games:
game_url = game.xpath('.//h3/a/#href').extract_first()
yield scrapy.Request(game_url, callback=self.parse_game)
def parse_game(self, response):
print(response.status)
When I run this code scrapy runs and goes through the first link and prints the response, but stops. When I change the code to .extract() I get the following,
TypeError: Request url must be str or unicode, got list
The same applies with .get()/.getall() being that .get() only returns the first and .getall() displays the above error.
Any help would be greatly appreciated, but please be gentle I am trying to learn.
Thanks in advance and best regards,
Gav

The error is saying that you are passing a list instead of a string to scrapy.Request. This tells us that game_url is actually a list, when you want a string. You are very close to the right thing here, but I believe your problem is that you are looping in the wrong place. You first XPath returns just a single item, rather than a list of items. It is within this that you want to find your game_urls leading to
def parse(self, response):
product_grid = response.xpath('//*[#id="product_grid"]')
for game_url in product_grid.xpath('.//h3/a/#href').getall():
yield scrapy.Request(game_url, callback=self.parse_game)
You could also combine your xpath queries to directly to
def parse(self, response):
all_games = response.xpath('//*[#id="product_grid"]//h3/a/#href')
for game_url in all_games.getall():
yield scrapy.Request(game_url, callback=self.parse_game)
In this case you could also use follow instead of creating a new Request directly. You can even directly pass a selector rather than a string to follow so you don't need to getall(), and it knows how to deal with <a> elements so you don't need the #href either!
def parse(self, response):
for game_url in response.xpath('//*[#id="product_grid"]//h3/a'):
yield response.follow(game_url, callback=self.parse_game)

extract 5000 ride details from blablacar site

can anyone please help me to extract the details of rider from the url of blabla car or please put some idea for web scrawling
EXTRACT THE FIRST 5000 RIDE DETAILS FROM THE url of blabla car website
I am new to web scrawling and python .So kindly anyone put some hint to do the task

At first, you should always think where your scraping starting point is.
In this case https://www.blablacar.in/search-car-sharing looks pretty good, as there are links to the most popular routes.
Here is the pipeline you may want to follow:
Declare a spider.
Set USER_AGENT (in settings.py) to something custom to not get 403 responses.
Set DOWNLOAD_DELAY to something like 0.5 or so to not be banned (may need to make the value even bigger).
Add starting point to the spider: start_urls = ['https://www.blablacar.in/search-car-sharing']
Add a parse method that will yield requests to route pages.
Add a parse_route method that will yield information about the rides and follows the pagination.
That's how parse method may look like:
def parse(self, response):
for a_tag in response.css('.search-empty__meeting-points a'):
yield response.follow(a_tag, self.parse_route)
And here is parse_route example that parses name and date of the ride:
def parse_route(self, response):
for trip in response.css('.trip-search-results li'):
item = {}
item['name'] = trip.css('.ProfileCard-info--name::text').extract_first().strip()
item['date'] = trip.css('.description .time::attr(content)').extract_first()
yield item
for a_tag in response.css('.pagination .next:not(.disabled) a'):
yield response.follow(a_tag, self.parse_route)
Hope this gives you an intuition on how to address the task.

How to crawl and scrape data at the same time?

It's my first experience with web scraping and I'm not sure if I'm doing well or not. The thing is I want to crawl and scrape data at the same time.
Get all the links that I'm gonna scrape
Store them into MongoDB
Visit them one by one to scrape their content
# Crawling: get all links to be scrapped later on
class LinkCrawler(Spider):
name="link"
allowed_domains = ["website.com"]
start_urls = ["https://www.website.com/offres?start=%s" % start for start in xrange(0,10000,20)]
def parse(self,response):
# loop for all pages
next_page = Selector(response).xpath('//li[#class="active"]/following-sibling::li[1]/a/#href').extract()
if not not next_page:
yield Request("https://"+next_page[0], callback = self.parse)
# loop for all links in a single page
links = Selector(response).xpath('//div[#class="row-fluid job-details pointer"]/div[#class="bloc-right"]/div[#class="row-fluid"]')
for link in links:
item = Link()
url = response.urljoin(link.xpath('a/#href')[0].extract())
item['url'] = url
items.append(item)
for item in items:
yield item
# Scraping: get all the stored links on MongoDB and scrape them????

What exactly is your use case? Are you primarily interested in the links or content of the pages they lead to? I.e. is there any reason to first store the links in MongoDB and scrape pages later? If you really need to store links in MongoDB, it's best to use an item pipeline to store the items. In the link, there's even example of storing items in MongoDB. If you need something more sophisticated, look at scrapy-mongodb package.
Other than that, there are some comments to the actual code you posted:
Instead of Selector(response).xpath(...) use just response.xpath(...).
If you need only the first extracted element from selector, use extract_first() instead of using extract() and indexing.
Don't use if not not next_page:, use if next_page:.
The second loop over items is not needed, yield item in the loop over links.

Scrapy - parse a page to extract items - then follow and store item url contents

I have a question on how to do this thing in scrapy. I have a spider that crawls for listing pages of items.
Every time a listing page is found, with items, there's the parse_item() callback that is called for extracting items data, and yielding items. So far so good, everything works great.
But each item, has among other data, an url, with more details on that item. I want to follow that url and store in another item field (url_contents) the fetched contents of that item's url.
And I'm not sure how to organize code to achieve that, since the two links (listings link, and one particular item link) are followed differently, with callbacks called at different times, but I have to correlate them in the same item processing.
My code so far looks like this:
class MySpider(CrawlSpider):
name = "example.com"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com/?q=example",
]
rules = (
Rule(SgmlLinkExtractor(allow=('example\.com', 'start='), deny=('sort='), restrict_xpaths = '//div[#class="pagination"]'), callback='parse_item'),
Rule(SgmlLinkExtractor(allow=('item\/detail', )), follow = False),
)
def parse_item(self, response):
main_selector = HtmlXPathSelector(response)
xpath = '//h2[#class="title"]'
sub_selectors = main_selector.select(xpath)
for sel in sub_selectors:
item = ExampleItem()
l = ExampleLoader(item = item, selector = sel)
l.add_xpath('title', 'a[#title]/#title')
......
yield l.load_item()

After some testing and thinking, I found this solution that works for me.
The idea is to use just the first rule, that gives you listings of items, and also, very important, add follow=True to that rule.
And in parse_item() you have to yield a request instead of an item, but after you load the item. The request is to item detail url. And you have to send the loaded item to that request callback. You do your job with the response, and there is where you yield the item.
So the finish of parse_item() will look like this:
itemloaded = l.load_item()
# fill url contents
url = sel.select(item_url_xpath).extract()[0]
request = Request(url, callback = lambda r: self.parse_url_contents(r))
request.meta['item'] = itemloaded
yield request
And then parse_url_contents() will look like this:
def parse_url_contents(self, response):
item = response.request.meta['item']
item['url_contents'] = response.body
yield item
If anyone has another (better) approach, let us know.
Stefan

I'm sitting with exactly the same problem, and from the fact that no-one has answered your question for 2 days I take it that the only solution is to follow that URL manually, from within your parse_item function.
I'm new to Scrapy, so I wouldn't attempt it with that (although I'm sure it's possible), but my solution will be to use urllib and BeatifulSoup to load the second page manually, extract that information myself, and save it as part of the Item. Yes, much more trouble than Scrapy makes normal parsing, but it should get the job done with the least hassle.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy - Visiting nested links and grabbing meta data from each level - python

Related

How can I parse several differently structured websites with the same spider?

extract_first only resulting in first item only .extract() does not work

extract 5000 ride details from blablacar site

How to crawl and scrape data at the same time?

Scrapy - parse a page to extract items - then follow and store item url contents

Categories

Resources