Scraping table data using Scrapy (python) - python

I am working on a project and it involves scraping data from a website using Scrapy.
Earlier we were using Selenium but now we have to use Scrapy.
I don't have any knowledge on Scrapy but learning it right now.
One of the challenges is to scrap the data from a website, the data is structured in tables and though there are links to download such data, it's not working in my case.
Below is the structure of the tables
html structure
All my data is under tbody and each having tr
The pseudo code which I have written so far is:
def parse_products(self, response):
rows=response.xpath('//*[#id="records_table"]/tbody/')
for i in rows:
item = table_item()
item['company'] = i.xpath('td[1]//text()').extract_first()
item['naic'] = i.xpath('td[2]//text()').extract_first()
yield item
Am I accessing the table body correctly with the xpath?
Not sure if the xpath i specified is correct or not

Better to say:
def parse_products(self, response):
for row in response.css('table#records_table tr'):
item = table_item()
item['company'] = row.xpath('.//td[1]/text()').get()
item['naic'] = row.xpath('.//td[2]/text()').get()
yield item
Here you will be iterating by rows of table and then taking data of cells.

Related

Scrapy spider not scraping the data correctly

I am trying to scrape the data about the circulrs from my college's website using scrapy for a project but my spider is not scraping the data properly. There are a lot of blank elements and also I am unable to scrape the 'href' attributes of the circulars for some reason. I am assuming that my CSS selectors are wrong but I am unable to figure out what am I doing wrong exactly. I copied my CSS selectors using the 'Selector Gadget' Chrome extension. I ams till learning scrapy so it would be great if you could explain what I was doing wrong.
The Website I am scraping data from is : https://www.imsnsit.org/imsnsit/notifications.php
My code is :
import scrapy
from ..items import CircularItem
class CircularSpider(scrapy.Spider):
name = "circular"
start_urls = [
"https://www.imsnsit.org/imsnsit/notifications.php"
]
def parse(self, response):
items = CircularItem()
all = response.css('tr~ tr+ tr font')
for x in all:
cirName = x.css('a font::text').extract()
cirLink = x.css('.list-data-focus a').attrib['href'].extract()
date = x.css('tr~ tr+ tr td::text').extract()
items["Name"] = cirName
items["href"] = cirLink
items["Date"] = date
yield items
I modified your parse callback function. I changed CSS selectors into xpath. Also, try to learn xpath selectors they are very powerful and easy to use.
Generally, It is bad idea to copy CSS or xpath using automatic selectors, because in some cases they might give you incorrect results or just one element without general path.
First of all I select all tr. If you look carefully, some of tr are just blank used for separator. You can filter them by trying to select date, if it is None you can just skip the row. And finally you can just select cirName and cirLink.
Also, markup of the given website is not good and It is really hard to write proper selectors, elements don't have many attributes, like class or id. That's the solution I came up with, I know it is not perfect.
def parse(self, response):
items = CircularItem()
all = response.xpath('//tr') # select all table items
for x in all:
date = x.xpath('.//td/font[#size="3"]/text()').get() # filter them by date
if not date:
continue
cirName = x.xpath('.//a/font/text()').get()
cirLink = x.xpath('.//a[#title="NOTICES / CIRCULARS"]/#href').get()
items["Name"] = cirName
items["href"] = cirLink
items["Date"] = date
yield items

Function returning null when trying to access table data

I am trying to scrape a website for some data using Scrapy. I found the table using css but its returning only thread data.
Tried using xpath too but that too didn't help. Actually, the code doesn't have tbody tag because of it the function returns null.
I am trying to scrape this website
def parse(self, response):
table = response.css('div.iw_component div.mobile-collapse div.fund-component div#exposureTabs div.component-tabs-panel div.table-chart-container div.fund-component table#tabsSectorDataTable')
print(table.extract())
I want to access data in the selected table which is present in tbody tag.
The data you're looking for is loaded dynamically using Javascript that's why Scrapy can't find it. You can try to use Scrapy-Splash or parse it by yourself:
import json
def parse(self, response):
table_json = response.xpath('//script[contains(., "var tabsSectorDataTable =")]/text()').re_first(r'var tabsSectorDataTable =(.+?\]);')
table = json.loads(table_json)

How to follow a hyper-refernce in scrapy if the href attribute contains a hash symbol

In my web-scraping project I have to scrape the football matches data from https://www.national-football-teams.com/country/67/2018/France.html
In order to navigate to matches data from the above url I have to follow a hyper-reference that has a hash in the url:
Matchesevent
The standard scrapy mechanism of following the links:
href = response.xpath("//a[contains(#href,'matches')]/#href").extract_first()
href = response.urljoin(href)
will produce a link that will not lead to the matches data:
https://www.national-football-teams.com/matches.html
I would appreciate any help. Since I am noobie to web-scraping and anything which has something to do with web-development, a more specific advice and/or a minimal working example is highly acknowledged.
For the completeness here is the complete code of my scrapy-spider:
import scrapy
class NationalFootballTeams(scrapy.Spider):
name = "nft"
start_urls = ['https://www.national-football-teams.com/continent/1/Europe.html']
def parse(self, response):
for country in response.xpath("//div[#class='row country-teams']/div[1]/ul/li/a"):
cntry = country.xpath("text()").extract_first().strip()
if cntry == 'France':
href = country.xpath("#href").extract_first()
yield response.follow(href, self.parse_country)
def parse_country(self, response):
href = response.xpath("//a[contains(#href,'matches')]/#href").extract_first()
href = response.urljoin(href)
print href
yield scrapy.Request(url=href, callback=self.parse_matches)
def parse_matches(self, response):
print response.xpath("//tr[#class='win']").extract()
When clicking that link, no new page or even new data is loaded, it's already in the html, but hidden. Clicking that link will call some javascript that hides the current tab and shows the new tab. So to get to the data, you don't need to follow any link, but just use a different xpath query. The match data is in the xpath //div[#id='matches'].

Confusion using Xpath when scraping websites with Scrapy

I'm having trouble understanding which part of the Xpath to select when trying to scrape certain elements of a website. In this case, I am trying to scrape all the websites that are linked in this article (for example, this section of the xpath:
data-track="Body Text Link: External" href="http://www.uspreventiveservicestaskforce.org/Page/Document/RecommendationStatementFinal/brca-related-cancer-risk-assessment-genetic-counseling-and-genetic-testing">
My spider works but it doesn't scrape anything!
My code is below:
import scrapy
from scrapy.selector import Selector
from nymag.items import nymagItem
class nymagSpider(scrapy.Spider):
name = 'nymag'
allowed_domains = ['http://wwww.nymag.com']
start_urls = ["http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html"]
def parse(self, response):
#I'm pretty sure the below line is the issue
links = Selector(response).xpath(//*[#id="primary"]/main/article/div/span)
for link in links:
item = nymagItem()
#This might also be wrong - am trying to extract the href section
item['link'] = question.xpath('a/#href').extract()
yield item
There is an easier way. Get all the a elements having data-track and href attributes:
In [1]: for link in response.xpath("//div[#id = 'primary']/main/article//a[#data-track and #href]"):
print link.xpath("#href").extract()[0]
...:
//nymag.com/tags/healthcare/
//nymag.com/author/Susan%20Rinkunas/
http://twitter.com/sueonthetown
http://www.facebook.com/sharer/sharer.php?u=http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html%3Fmid%3Dfb-share-thecut
https://twitter.com/share?text=Should%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F&url=http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html%3Fmid%3Dtwitter-share-thecut&via=TheCut
https://plus.google.com/share?url=http%3A%2F%2Fnymag.com%2Fthecut%2F2015%2F09%2Fshould-we-all-get-the-breast-cancer-gene-test.html
http://pinterest.com/pin/create/button/?url=http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html%3Fmid%3Dpinterest-share-thecut&description=Should%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F&media=http:%2F%2Fpixel.nymag.com%2Fimgs%2Ffashion%2Fdaily%2F2015%2F09%2F08%2F08-angelina-jolie.w750.h750.2x.jpg
whatsapp://send?text=Should%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F%0A%0Ahttp%3A%2F%2Fnymag.com%2Fthecut%2F2015%2F09%2Fshould-we-all-get-the-breast-cancer-gene-test.html&mid=whatsapp
mailto:?subject=Should%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F&body=I%20saw%20this%20on%20The%20Cut%20and%20thought%20you%20might%20be%20interested...%0A%0AShould%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F%0AIt's%20not%20a%20crystal%20ball.%0Ahttp%3A%2F%2Fnymag.com%2Fthecut%2F2015%2F09%2Fshould-we-all-get-the-breast-cancer-gene-test.html%3Fmid%3Demailshare%5Fthecut
...

Python/Scrapy: Extracting xpath from Ebay's dynamic description field

I'm working on a project that is looking to ingest data from Ebay listings and store it for analysis. I'm using Scrapy to scrape/crawl specific categories but I'm running into issues when trying to extract the text within an items' "description" field. Each listing seems to have a unique layout for the description so I can't generalize an xPath for the item of my Scrapy object.
For example one advertisement may have a layout like this, while another may be formatted like this. How can I go about extracting the text within each description tab? I can successfully extract other fields, as their xPaths are universal in Ebay advertisements. Here's the method I'm reffering to:
def parse_item(self, response):
item = EbayItem()
item['url'] = response.url
item['title'] = response.xpath('//*[#id="itemTitle"]/text()').extract()
item['description']= response.xpath( #THISISWHEREIMLOST).extract()
print ":("

Categories