Python/Scrapy: Extracting xpath from Ebay's dynamic description field - python

I'm working on a project that is looking to ingest data from Ebay listings and store it for analysis. I'm using Scrapy to scrape/crawl specific categories but I'm running into issues when trying to extract the text within an items' "description" field. Each listing seems to have a unique layout for the description so I can't generalize an xPath for the item of my Scrapy object.
For example one advertisement may have a layout like this, while another may be formatted like this. How can I go about extracting the text within each description tab? I can successfully extract other fields, as their xPaths are universal in Ebay advertisements. Here's the method I'm reffering to:
def parse_item(self, response):
item = EbayItem()
item['url'] = response.url
item['title'] = response.xpath('//*[#id="itemTitle"]/text()').extract()
item['description']= response.xpath( #THISISWHEREIMLOST).extract()
print ":("

Related

Links inside link in scrapy link extractor in python

I have a scrapy crawler that extracts the data of products from a website.
Website is : https://www.softsurroundings.com
The main page has categories and inside the categories there are products. The link of product is like "https://www.softsurroundings.com/p/audun-jumpsuit/" which I got from the first item in a category with link "https://www.softsurroundings.com/clothing/jumpsuits/"
Now if in Below code I dont use Lnk extractor, and give the link of a specific product, the "parse_start_url" works fine and gives me data that I want to scrap.
But when I give only "https://www.softsurroundings.com" in the start_url, and Use the link extractor as shown in my code below. The Link Extractor can get links of the categories but dont scrap items inside the categories.
My Existing code is:
name = 'Soft'
allowed_domains = ["softsurroundings.com"]
start_urls = [
"https://www.softsurroundings.com/"
]
den_subdirectory= ['/orderstatus/', '/faq/', '/folder/', '/sitemap/', '/catalogrequest/',
'/soft-surroundings-gift-card/', '/cart/', '/stores/',
'/emailus/', '/myaccount/', '/home-wellness/', '/new-bedding-home/']
rules = [
Rule(LinkExtractor(deny=den_subdirectory), callback='parse_start_url')
]
Now I need to modify this above code so that link extractor goes in each category and scraps all the products that exist in a category.

Scrapy spider not scraping the data correctly

I am trying to scrape the data about the circulrs from my college's website using scrapy for a project but my spider is not scraping the data properly. There are a lot of blank elements and also I am unable to scrape the 'href' attributes of the circulars for some reason. I am assuming that my CSS selectors are wrong but I am unable to figure out what am I doing wrong exactly. I copied my CSS selectors using the 'Selector Gadget' Chrome extension. I ams till learning scrapy so it would be great if you could explain what I was doing wrong.
The Website I am scraping data from is : https://www.imsnsit.org/imsnsit/notifications.php
My code is :
import scrapy
from ..items import CircularItem
class CircularSpider(scrapy.Spider):
name = "circular"
start_urls = [
"https://www.imsnsit.org/imsnsit/notifications.php"
]
def parse(self, response):
items = CircularItem()
all = response.css('tr~ tr+ tr font')
for x in all:
cirName = x.css('a font::text').extract()
cirLink = x.css('.list-data-focus a').attrib['href'].extract()
date = x.css('tr~ tr+ tr td::text').extract()
items["Name"] = cirName
items["href"] = cirLink
items["Date"] = date
yield items
I modified your parse callback function. I changed CSS selectors into xpath. Also, try to learn xpath selectors they are very powerful and easy to use.
Generally, It is bad idea to copy CSS or xpath using automatic selectors, because in some cases they might give you incorrect results or just one element without general path.
First of all I select all tr. If you look carefully, some of tr are just blank used for separator. You can filter them by trying to select date, if it is None you can just skip the row. And finally you can just select cirName and cirLink.
Also, markup of the given website is not good and It is really hard to write proper selectors, elements don't have many attributes, like class or id. That's the solution I came up with, I know it is not perfect.
def parse(self, response):
items = CircularItem()
all = response.xpath('//tr') # select all table items
for x in all:
date = x.xpath('.//td/font[#size="3"]/text()').get() # filter them by date
if not date:
continue
cirName = x.xpath('.//a/font/text()').get()
cirLink = x.xpath('.//a[#title="NOTICES / CIRCULARS"]/#href').get()
items["Name"] = cirName
items["href"] = cirLink
items["Date"] = date
yield items

Can't scrape reviews from multiple pages and it's only scraping reviews before (new line spacing) spacing

Thank you in advance for your time. I really appreciate that.
I am trying to scrape product reviews, ratings and other info from amazon. Below is the code for the same. The issue I am getting is:
The first page has 10 reviews.
And in the crawled data all reviews are from these 10 customers only.
10 lines of reviews data and then a blank line, then again these 10 and so on. Total of 196 lines in the same way.
Also if any review has 'ENTER' used in it by the customer for spacing then only text before spacing is there in the review. As highlighted in the image below in yellow.
Link to scrape - https://www.amazon.com/product-reviews/B01DFKC2SO/ref=cm_cr_arp_d_viewpnt_lft?pageNumber=
My Code:
import scrapy
class ReviewspiderSpider(scrapy.Spider):
name = 'reviewspider'
allowed_domains = ["www.amazon.com"]
start_urls = [
'https://www.amazon.com/product-reviews/B01DFKC2SO/ref=cm_cr_arp_d_viewpnt_lft?pageNumber=']
def parse(self, response):
for review in response.xpath("//div[#id='cm_cr-review_list']/div"):
yield {
'Name': review.xpath('.//span[#class="a-profile-name"]/text()').get(),
'Title': review.xpath('.//a[#data-hook="review-title"]/span/text()').get(),
'Rating': review.xpath('.//span[#class="a-icon-alt"]/text()').get(),
'Review': review.xpath('.//span[#data-hook="review-body"]/span/text()').get()
}
next_page = response.xpath(
"//a[text()='Next page']").get()
if next_page:
yield response.follow(url=next_page, callback=self.parse)
Output:
enter image description here
You have forgotten to select the href:
next_page = response.xpath("//a[text()='Next page']/#href").get()
You were joining the whole tag to the response.url, not the href.
As it goes to the part of the question of why the text is removed. The text is not removed, you are just not getting it, you are just getting the first part of the text. text() returns a plaint text, if there is a it splits the text.
There are two options on how to fix it.
First is to use string function in the xpath for the span selector:
review.xpath('string(.//span[#data-hook="review-body"]/span)').get()
But I would not recommend it, as it removes only the tags in the selector. So the text will be without any separators between them (e.g. "I have two.I have so many..." nothing between "." and "I").
I would suggest to use getall method and get all of the plaint texts of the tag and then simply join them with the seperator you seem fit.
'\n'.join(review.xpath('.//span[#data-hook="review-body"]/span/text()').getall())

Scraping table data using Scrapy (python)

I am working on a project and it involves scraping data from a website using Scrapy.
Earlier we were using Selenium but now we have to use Scrapy.
I don't have any knowledge on Scrapy but learning it right now.
One of the challenges is to scrap the data from a website, the data is structured in tables and though there are links to download such data, it's not working in my case.
Below is the structure of the tables
html structure
All my data is under tbody and each having tr
The pseudo code which I have written so far is:
def parse_products(self, response):
rows=response.xpath('//*[#id="records_table"]/tbody/')
for i in rows:
item = table_item()
item['company'] = i.xpath('td[1]//text()').extract_first()
item['naic'] = i.xpath('td[2]//text()').extract_first()
yield item
Am I accessing the table body correctly with the xpath?
Not sure if the xpath i specified is correct or not
Better to say:
def parse_products(self, response):
for row in response.css('table#records_table tr'):
item = table_item()
item['company'] = row.xpath('.//td[1]/text()').get()
item['naic'] = row.xpath('.//td[2]/text()').get()
yield item
Here you will be iterating by rows of table and then taking data of cells.

How to follow a hyper-refernce in scrapy if the href attribute contains a hash symbol

In my web-scraping project I have to scrape the football matches data from https://www.national-football-teams.com/country/67/2018/France.html
In order to navigate to matches data from the above url I have to follow a hyper-reference that has a hash in the url:
Matchesevent
The standard scrapy mechanism of following the links:
href = response.xpath("//a[contains(#href,'matches')]/#href").extract_first()
href = response.urljoin(href)
will produce a link that will not lead to the matches data:
https://www.national-football-teams.com/matches.html
I would appreciate any help. Since I am noobie to web-scraping and anything which has something to do with web-development, a more specific advice and/or a minimal working example is highly acknowledged.
For the completeness here is the complete code of my scrapy-spider:
import scrapy
class NationalFootballTeams(scrapy.Spider):
name = "nft"
start_urls = ['https://www.national-football-teams.com/continent/1/Europe.html']
def parse(self, response):
for country in response.xpath("//div[#class='row country-teams']/div[1]/ul/li/a"):
cntry = country.xpath("text()").extract_first().strip()
if cntry == 'France':
href = country.xpath("#href").extract_first()
yield response.follow(href, self.parse_country)
def parse_country(self, response):
href = response.xpath("//a[contains(#href,'matches')]/#href").extract_first()
href = response.urljoin(href)
print href
yield scrapy.Request(url=href, callback=self.parse_matches)
def parse_matches(self, response):
print response.xpath("//tr[#class='win']").extract()
When clicking that link, no new page or even new data is loaded, it's already in the html, but hidden. Clicking that link will call some javascript that hides the current tab and shows the new tab. So to get to the data, you don't need to follow any link, but just use a different xpath query. The match data is in the xpath //div[#id='matches'].

Categories