Scrapy spider not scraping the data correctly

Scrapy spider not scraping the data correctly - python

I am trying to scrape the data about the circulrs from my college's website using scrapy for a project but my spider is not scraping the data properly. There are a lot of blank elements and also I am unable to scrape the 'href' attributes of the circulars for some reason. I am assuming that my CSS selectors are wrong but I am unable to figure out what am I doing wrong exactly. I copied my CSS selectors using the 'Selector Gadget' Chrome extension. I ams till learning scrapy so it would be great if you could explain what I was doing wrong.
The Website I am scraping data from is : https://www.imsnsit.org/imsnsit/notifications.php
My code is :
import scrapy
from ..items import CircularItem
class CircularSpider(scrapy.Spider):
name = "circular"
start_urls = [
"https://www.imsnsit.org/imsnsit/notifications.php"
]
def parse(self, response):
items = CircularItem()
all = response.css('tr~ tr+ tr font')
for x in all:
cirName = x.css('a font::text').extract()
cirLink = x.css('.list-data-focus a').attrib['href'].extract()
date = x.css('tr~ tr+ tr td::text').extract()
items["Name"] = cirName
items["href"] = cirLink
items["Date"] = date
yield items

I modified your parse callback function. I changed CSS selectors into xpath. Also, try to learn xpath selectors they are very powerful and easy to use.
Generally, It is bad idea to copy CSS or xpath using automatic selectors, because in some cases they might give you incorrect results or just one element without general path.
First of all I select all tr. If you look carefully, some of tr are just blank used for separator. You can filter them by trying to select date, if it is None you can just skip the row. And finally you can just select cirName and cirLink.
Also, markup of the given website is not good and It is really hard to write proper selectors, elements don't have many attributes, like class or id. That's the solution I came up with, I know it is not perfect.
def parse(self, response):
items = CircularItem()
all = response.xpath('//tr') # select all table items
for x in all:
date = x.xpath('.//td/font[#size="3"]/text()').get() # filter them by date
if not date:
continue
cirName = x.xpath('.//a/font/text()').get()
cirLink = x.xpath('.//a[#title="NOTICES / CIRCULARS"]/#href').get()
items["Name"] = cirName
items["href"] = cirLink
items["Date"] = date
yield items

Related

How to select various elements of a website

I am scraping a website using scrapy where I want to extract a few details such as price, product description, features etc of a product. I want to know how to select each of these elements using css selectors or xpath selectors and store them in xml or json format.
I have written the following code skeleton. Please guide me what should I do from here.
# -*- coding: utf-8 -*-
import scrapy
import time
class QuotesSpider(scrapy.Spider):
name = 'myquotes'
start_urls = [
'https://www.amazon.com/international-sales-offers/b/ref=gbps_ftr_m-9_2862_dlt_LD?node=15529609011&gb_f_deals1=dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL%252CEXPIRED%252CSOLDOUT%252CUPCOMING,sortOrder:BY_SCORE,MARKETING_ID:ship_export,enforcedCategories:15684181,dealTypes:LIGHTNING_DEAL&pf_rd_p=9b8adb89-8774-4860-8b6e-e7cefc1c2862&pf_rd_s=merchandised-search-9&pf_rd_t=101&pf_rd_i=15529609011&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=AA0VVPMWMQM1MF4XQZKR&ie=UTF8'
]
def parse(self, response):
all_div_quotes = response.css('a-section a-spacing-none tallCellView gridColumn2 singleCell')
for quotes in all_div_quotes:
title1 = all_div_quotes.css('.dealPriceText::text').extract()
title2 = all_div_quotes.css('.a-declarative::text').extract()
title3 = all_div_quotes.css('#shipSoldInfo::text').extract()
yield{
'price' : title1,
'details1' : title2,
'details2' : title3
}
I am running the code using the command:
scrapy crawl myquotes -o myfile.json
to save it inside a json file. The problem with this code is that it is not returning the title, product price, product description as intended. If someone could help me with how to scrape the product name, price and description of an amazon page it would be of great help.

The easier way to check and verify CSS selectors is using scrapy shell.
In your case, I have listed the selectors you can use along with the code:
Name: response.css("#productTitle::text").get()
Price: Price was not available in my country so couldn't test it.
Description: response.css("#productDescription p::text").getall()
Best of luck.

The normal method to solve an error like this starting at the top. I think your very first css selector is too detailed. On using the selector gadget, the general css selector is
.dealDetailContainer
Yield the whole response without a for loop and check the output to understand that you're getting some kind of a response.
For products individually, when I scraped a different amazon link the css selector for the product name is
#productTitle::text -># is not a commented line of code here
Basically, you're going wrong with the css selectors. Use the CSS Selector Gadget and before using the command to output it into json, do a normal crawl first.

generallly what you could do is
Name: response.css("#productTitle::text").extract()
Description: response.css("#productDescription p::text").extract()
With this you should be good to go.
CSS selector are more constant so they are usually a better bet than using xpath and consequently the way to go

Scraping table data using Scrapy (python)

I am working on a project and it involves scraping data from a website using Scrapy.
Earlier we were using Selenium but now we have to use Scrapy.
I don't have any knowledge on Scrapy but learning it right now.
One of the challenges is to scrap the data from a website, the data is structured in tables and though there are links to download such data, it's not working in my case.
Below is the structure of the tables
html structure
All my data is under tbody and each having tr
The pseudo code which I have written so far is:
def parse_products(self, response):
rows=response.xpath('//*[#id="records_table"]/tbody/')
for i in rows:
item = table_item()
item['company'] = i.xpath('td[1]//text()').extract_first()
item['naic'] = i.xpath('td[2]//text()').extract_first()
yield item
Am I accessing the table body correctly with the xpath?
Not sure if the xpath i specified is correct or not

Better to say:
def parse_products(self, response):
for row in response.css('table#records_table tr'):
item = table_item()
item['company'] = row.xpath('.//td[1]/text()').get()
item['naic'] = row.xpath('.//td[2]/text()').get()
yield item
Here you will be iterating by rows of table and then taking data of cells.

How do I use scrapy rules to crawl from Wiki actor and movie pages to only the links in the cast and fimlography links

I have recently started to use python and scrapy.
I have been trying to use scrapy to start at either a movie or actor wiki page, save the name and cast or filmography and traverse through the links in the cast or filmography sections to other actor/movie wiki pages.
However, I have no idea how rules work (edit: ok, this was a bit of hyperbole) and the wiki links are extremely nested. I saw that you can limit by xpath and give id or class but most of the links I want don't seem to have a class or id. I also wasn't sure if xpath also includes the other siblings and children.
Therefore I would like to know what rules to use to limit the non-relevant links and only go to cast and filmography links.
Edit: Clearly, i should have explained my question better. Its not that I dont understand xpaths and rules at all (that was a bit of hyperbole since I was getting frustrated) but I'm clearly not completely clear on their working. Firstly, let me show what I had so far and then clarify where I am having trouble.
import logging
from bs4 import BeautifulSoup
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor, re
from scrapy.exceptions import CloseSpider
from Assignment2_0.items import Assignment20Item
logging.basicConfig(filename='spider.log',level = logging.DEBUG)
class WikisoupSpiderSpider(CrawlSpider):
name = 'wikisoup_spider'
allowed_domains = ['en.wikipedia.org']
start_urls = ['https://en.wikipedia.org/wiki/Keira_Knightley']
rules = (
Rule(LinkExtractor(restrict_css= 'table.wikitable')),
Rule(LinkExtractor(allow =('(/wiki/)',), ),
callback='parse_crawl', follow=True))
actor_counter = 0
actor_max = 250
movie_counter = 0
movie_max = 125
def parse_crawl(self, response):
items = []
soup = BeautifulSoup(response.text, 'lxml')
item = Assignment20Item()
occupations = ['Actress', 'Actor']
logging.debug(soup.title)
tempoccu = soup.find('td', class_ = 'role')
logging.warning('tempoccu only works for pages of people')
tempdir = soup.find('th', text = 'Directed by')
logging.warning('tempdir only works for pages of movies')
if (tempdir is not None) and self.movie_counter < self.movie_max:
logging.info('Found movie and do not have enough yet')
item['moviename'] = soup.h1.text
logging.debug('name is ' + item['moviename'])
finder = soup.find('th', text='Box office')
gross = finder.next_sibling.next_sibling.text
gross_float = re.findall(r"[-+]?\d*\.\d+|\d+", gross)
item['netgross'] = float(gross_float[0])
logging.debug('Net gross is ' + gross_float[0])
finder = soup.find('div', text='Release date')
date = finder.parent.next_sibling.next_sibling.contents[1].contents[1].contents[1].get_text(" ")
date = date.replace(u'\xa0', u' ')
item['releasedate'] = date
logging.debug('released on ' + item['releasedate'])
item['type'] = 'movie'
items.append(item)
elif (tempoccu is not None) and (any(occu in tempoccu for occu in occupations)) and self.actor_counter < self.actor_max:
logging.info('Found actor and do not have enough yet')
item['name'] = soup.h1.text
logging.debug('name is ' + item['name'])
temp = soup.find('span', class_ = 'noprint ForceAgeToShow').text
age = re.findall('\d+', temp)
item['age'] = int(age[0])
logging.debug('age is ' + age[0])
filmo = []
finder = soup.find('span', id='Filmography')
for x in finder.parent.next_sibling.next_sibling.find_all('i'):
filmo.append(x.text)
item['filmography'] = filmo
logging.debug('has done ' + filmo[0])
item['type'] = 'actor'
items.append(item)
elif (self.movie_counter == self.movie_max and self.actor_counter == self.actor_max):
logging.info('Found enough data')
raise CloseSpider(reason='finished')
else :
logging.info('irrelavent data')
pass
return items
Now, my understanding of the rules in my code is it should allow all wiki links and should take links only from table tags and their children. This is clearly not what was happening since it very quickly crawled away from movies.
I'm clear on what to do when each element has an identifier like id or class but when inspecting the page, the links are buried in multiple nests of id-less tags which don't seem to all follow a singular pattern(I would use the regular xpath but different pages have different paths to filmography and it didn't seem like finding the path to the table under h2=filmography, would include all links in the tables below it). Therefore I wanted to know more on how I could get scrapy to only use Filmography links(in actor pages anyway).
I apologize if this was an obvious thing, I have started using both python and scrapy/xpath/css only 48 hours ago.

Firstly, you will need to know where you have to look for, I mean, Which tags you have to filter, so you have to inspect in the HMTL code corresponding on your page. Regarding libraries, I would use:
import requests
to do the connections
from bs4 import BeautifulSoup as bs
to parser
example:
bs = bs('file with html code', "html.parser")
you instance the object
select_tags = bs('select')
you look for the tags you want to filter
Then you should to wrap your list and add some condition like this:
for i in self.select:
print i.get('class'), type(i.get('class'))
if type(i.get('class')) is list and '... name you look for ...' in i.get('class'):
In this case you can filter inside the select tag you want by 'class' tag.

If I understand correctly what you want, you will probably need to combine your two rules into one, using both allow and restrict_xpath/restrict_css.
So, something like:
rules = [
Rule(LinkExtractor(allow=['/wiki/'], restrict_xpaths=['xpath']),
callback='parse_crawl',
follow=True)
]
Scraping wikipedia is usually pretty complicated, especially if trying to access very specific data.
There are a few problems i see for this particular example:
The data lacks structure - it's just a bunch of text in sequence, meaning your xpaths are going to be pretty complicated. For example, to select the 3 tables you want, you might need to use:
//table[preceding-sibling::h2[1][contains(., "Filmography")]]
You only want to follow links from the Title column (second one), however, due to the way HTML tables are defined, this might not always be represented by the second td of a row.
This means you'll probably need some additional logic, wither in your xpath, or in your code.
IMO the biggest problem: the lack of consistency. For examle, take a look at https://en.wikipedia.org/wiki/Gerard_Butler#Filmography No tables there, just a list and a link to another article. Basically, you get no guarantee about naming, positioning, layout, or display of information.
Those notes might get you started, but getting this information is going to be a big task.
My recommendation and personal choice would be to obtain the data you want from a more specialized source, instead of trying to scrape a website as generalized as wikipedia.

How to iterate over divs in Scrapy?

It is propably very trivial question but I am new to Scrapy. I've tried to find solution for my problem but I just can't see what is wrong with this code.
My goal is to scrap all of the opera shows from given website. Data for every show is inside one div with class "row-fluid row-performance ". I am trying to iterate over them to retrieve it but it doesn't work. It gives me content of the first div in each iteration(I am getting 19x times the same show, instead of different items).
import scrapy
from ..items import ShowItem
class OperaSpider(scrapy.Spider):
name = "opera"
allowed_domains = ["http://www.opera.krakow.pl"]
start_urls = [
"http://www.opera.krakow.pl/pl/repertuar/na-afiszu/listopad"
]
def parse(self, response):
divs = response.xpath('//div[#class="row-fluid row-performance "]')
for div in divs:
item= ShowItem()
item['title'] = div.xpath('//h2[#class="item-title"]/a/text()').extract()
item['time'] = div.xpath('//div[#class="item-time vertical-center"]/div[#class="vcentered"]/text()').extract()
item['date'] = div.xpath('//div[#class="item-date vertical-center"]/div[#class="vcentered"]/text()').extract()
yield item

Try to change the xpaths inside the for loop to start with .//. That is, just put a dot in front of the double backslash. You can also try using extract_first() instead of extract() and see if that gives you better results.

Confusion using Xpath when scraping websites with Scrapy

I'm having trouble understanding which part of the Xpath to select when trying to scrape certain elements of a website. In this case, I am trying to scrape all the websites that are linked in this article (for example, this section of the xpath:
data-track="Body Text Link: External" href="http://www.uspreventiveservicestaskforce.org/Page/Document/RecommendationStatementFinal/brca-related-cancer-risk-assessment-genetic-counseling-and-genetic-testing">
My spider works but it doesn't scrape anything!
My code is below:
import scrapy
from scrapy.selector import Selector
from nymag.items import nymagItem
class nymagSpider(scrapy.Spider):
name = 'nymag'
allowed_domains = ['http://wwww.nymag.com']
start_urls = ["http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html"]
def parse(self, response):
#I'm pretty sure the below line is the issue
links = Selector(response).xpath(//*[#id="primary"]/main/article/div/span)
for link in links:
item = nymagItem()
#This might also be wrong - am trying to extract the href section
item['link'] = question.xpath('a/#href').extract()
yield item

There is an easier way. Get all the a elements having data-track and href attributes:
In [1]: for link in response.xpath("//div[#id = 'primary']/main/article//a[#data-track and #href]"):
print link.xpath("#href").extract()[0]
...:
//nymag.com/tags/healthcare/
//nymag.com/author/Susan%20Rinkunas/
http://twitter.com/sueonthetown
http://www.facebook.com/sharer/sharer.php?u=http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html%3Fmid%3Dfb-share-thecut
https://twitter.com/share?text=Should%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F&url=http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html%3Fmid%3Dtwitter-share-thecut&via=TheCut
https://plus.google.com/share?url=http%3A%2F%2Fnymag.com%2Fthecut%2F2015%2F09%2Fshould-we-all-get-the-breast-cancer-gene-test.html
http://pinterest.com/pin/create/button/?url=http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html%3Fmid%3Dpinterest-share-thecut&description=Should%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F&media=http:%2F%2Fpixel.nymag.com%2Fimgs%2Ffashion%2Fdaily%2F2015%2F09%2F08%2F08-angelina-jolie.w750.h750.2x.jpg
whatsapp://send?text=Should%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F%0A%0Ahttp%3A%2F%2Fnymag.com%2Fthecut%2F2015%2F09%2Fshould-we-all-get-the-breast-cancer-gene-test.html&mid=whatsapp
mailto:?subject=Should%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F&body=I%20saw%20this%20on%20The%20Cut%20and%20thought%20you%20might%20be%20interested...%0A%0AShould%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F%0AIt's%20not%20a%20crystal%20ball.%0Ahttp%3A%2F%2Fnymag.com%2Fthecut%2F2015%2F09%2Fshould-we-all-get-the-breast-cancer-gene-test.html%3Fmid%3Demailshare%5Fthecut
...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy spider not scraping the data correctly - python

Related

How to select various elements of a website

Scraping table data using Scrapy (python)

How do I use scrapy rules to crawl from Wiki actor and movie pages to only the links in the cast and fimlography links

How to iterate over divs in Scrapy?

Confusion using Xpath when scraping websites with Scrapy

Categories

Resources