Python. Scrapy Xpath returning empty array - python

I am using scrapy to scrape information from a website in python and I am only getting used to using Xpaths to find information.
I want to return a list of all the average ratings from albums of this artist from this page.
https://rateyourmusic.com/artist/kanye_west
To find the node for the albums I used //div[#id="disco_type_s"]
and I tried searching children for divs with the attribute disco_avg_rating using div[#class="disco_avg_rating"]/text()
Here is my function
def parse_dir_contents(self, response):
item = rateyourmusicalbums() *ignore this
for i in response.xpath('//div[#id="disco_type_s"]'):
item['average rating']=i.xpath('div[#class="disco_avg_rating"]/text()').extract()
yield item
Everything I try to get this list causes a problems. Usually it's more straight but this time I have to differentiate between albums and singles etc. so I am having troubles.
Appreciate your help, I am fairly new to web scraping.

response.xpath('//div[#id="disco_type_s"]') finds only one tag (this is what mostly happens when using id to match the xpath, they are unique). to get a list of selectors you should use something like:
response.xpath('//div[#id="disco_type_s"]/div[#class="disco_release"]') which will match multiple tags, so you can iterate on those.
then get the average rating with './div[#class="disco_avg_rating"]/text()'

The following should work.
def parse_dir_contents(self, response):
for i in response.xpath('//*[#class="disco_release"]/div[3]'):
item['average rating']=i.xpath('text()').extract()
yield item

Related

Python/Selenium web scrap how to find hidden src value from a links?

Scrapping links should be a simple feat, usually just grabbing the src value of the a tag.
I recently came across this website (https://sunteccity.com.sg/promotions) where the href value of a tags of each item cannot be found, but the redirection still works. I'm trying to figure out a way to grab the items and their corresponding links. My typical python selenium code looks something as such
all_items = bot.find_elements_by_class_name('thumb-img')
for promo in all_items:
a = promo.find_elements_by_tag_name("a")
print("a[0]: ", a[0].get_attribute("href"))
However, I can't seem to retrieve any href, onclick attributes, and I'm wondering if this is even possible. I noticed that I couldn't do a right-click, open link in new tab as well.
Are there any ways around getting the links of all these items?
Edit: Are there any ways to retrieve all the links of the items on the pages?
i.e.
https://sunteccity.com.sg/promotions/724
https://sunteccity.com.sg/promotions/731
https://sunteccity.com.sg/promotions/751
https://sunteccity.com.sg/promotions/752
https://sunteccity.com.sg/promotions/754
https://sunteccity.com.sg/promotions/280
...
Edit:
Adding an image of one such anchor tag for better clarity:
By reverse-engineering the Javascript that takes you to the promotions pages (seen in https://sunteccity.com.sg/_nuxt/d4b648f.js) that gives you a way to get all the links, which are based on the HappeningID. You can verify by running this in the JS console, which gives you the first promotion:
window.__NUXT__.state.Promotion.promotions[0].HappeningID
Based on that, you can create a Python loop to get all the promotions:
items = driver.execute_script("return window.__NUXT__.state.Promotion;")
for item in items["promotions"]:
base = "https://sunteccity.com.sg/promotions/"
happening_id = str(item["HappeningID"])
print(base + happening_id)
That generated the following output:
https://sunteccity.com.sg/promotions/724
https://sunteccity.com.sg/promotions/731
https://sunteccity.com.sg/promotions/751
https://sunteccity.com.sg/promotions/752
https://sunteccity.com.sg/promotions/754
https://sunteccity.com.sg/promotions/280
https://sunteccity.com.sg/promotions/764
https://sunteccity.com.sg/promotions/766
https://sunteccity.com.sg/promotions/762
https://sunteccity.com.sg/promotions/767
https://sunteccity.com.sg/promotions/732
https://sunteccity.com.sg/promotions/733
https://sunteccity.com.sg/promotions/735
https://sunteccity.com.sg/promotions/736
https://sunteccity.com.sg/promotions/737
https://sunteccity.com.sg/promotions/738
https://sunteccity.com.sg/promotions/739
https://sunteccity.com.sg/promotions/740
https://sunteccity.com.sg/promotions/741
https://sunteccity.com.sg/promotions/742
https://sunteccity.com.sg/promotions/743
https://sunteccity.com.sg/promotions/744
https://sunteccity.com.sg/promotions/745
https://sunteccity.com.sg/promotions/746
https://sunteccity.com.sg/promotions/747
https://sunteccity.com.sg/promotions/748
https://sunteccity.com.sg/promotions/749
https://sunteccity.com.sg/promotions/750
https://sunteccity.com.sg/promotions/753
https://sunteccity.com.sg/promotions/755
https://sunteccity.com.sg/promotions/756
https://sunteccity.com.sg/promotions/757
https://sunteccity.com.sg/promotions/758
https://sunteccity.com.sg/promotions/759
https://sunteccity.com.sg/promotions/760
https://sunteccity.com.sg/promotions/761
https://sunteccity.com.sg/promotions/763
https://sunteccity.com.sg/promotions/765
https://sunteccity.com.sg/promotions/730
https://sunteccity.com.sg/promotions/734
https://sunteccity.com.sg/promotions/623
You are using a wrong locator. It brings you a lot of irrelevant elements.
Instead of find_elements_by_class_name('thumb-img') please try find_elements_by_css_selector('.collections-page .thumb-img') so your code will be
all_items = bot.find_elements_by_css_selector('.collections-page .thumb-img')
for promo in all_items:
a = promo.find_elements_by_tag_name("a")
print("a[0]: ", a[0].get_attribute("href"))
You can also get the desired links directly by .collections-page .thumb-img a locator so that your code could be:
links = bot.find_elements_by_css_selector('.collections-page .thumb-img a')
for link in links:
print(link.get_attribute("href"))

How to address alternative XPath selector if the original one does not exist?

I'm using scrapy (https://scrapy.org/) to crawl over a lot of websites (nearly 300) and save the title and date in a json. The title is on mostly website the first H1. But the date is tricky. Now, I'm using this XPath selector:
item['date'] = response.xpath("//time/text()").get()
But the website use different kinds of formatting the date. Sometimes as span, sometimes as normal paragraphs, sometimes as time and other using a a tag.
Question: How can I implement something like an if then else structure to item, to tell the spider look for different elements if the first is not existing?
You can simply use Item Loaders and pick first non-empty value:
l.add_xpath('date', '//first/xpath')
l.add_xpath('date', '//second/xpath')
l.add_xpath('date', '//third/xpath')
And in items.py:
date = scrapy.Field(output_processor=TakeFirst())

scrapy returning an empty object

i am using css selector and continually get a response with empty values. Here is the code.
import scrapy
class WebSpider(scrapy.Spider):
name = 'activities'
start_urls = [
'http://capetown.travel/events/'
]
def parse(self, response):
all_div_activities = response.css("div.tribe-events-content")#gdlr-core-pbf-column gdlr-core-column-60 gdlr-core-column-first
title = all_div_activities.css("h2.tribe-events-list-event-title::text").extract()#gdlr-core-text-box-item-content
price = all_div_activities.css(".span.ticket-cost::text").extract()
details = all_div_activities.css(".p::text").extract()
yield {
'title':title,
'price':price,
'details':details
}
In your code you're looking to select all events but that output will be a list and you can't select the title etc using extract() with a list as you are trying to do.
This is why you're not getting the data you want. You will need to use a for loop to loop over each event on the page in your case looping over all_div_activities.
Code for Script
def parse(self,response):
all_div_activities = response.css('div.tribe-events-event-content')
for a in all_div_activities:
title = a.css('a.tribe-event-url::text').get()
if a.css('span.ticket-cost::text'):
price = a.css('span.ticket-cost::text').get()
else:
price = 'No price'
details = a.css('div[class*="tribe-events-list-event-description"] > p::text').get()
yield {
'title':title.strip(),
'price':price,
'details':details
}
Notes
Using an if statement for price because there were elements that had no price at all and so inputting some information is a good idea.
Using strip() on title when yielding the dictionary as the title had space and \n attached.
Advice
As a minor point, Scrapy suggests using get() and getall() methods rather than extract_first() and extract(). With extract() its not always possible to know the output is going to be a list or not, in this case the output I got was a list. This is why scrapy docs suggests using get() instead. It's also abit more compact. With get() you will always get a string. This also meant that I could strip newlines and space with the title as you can see in the above code.
Another tip would be if the class attribute is quite long, use a *= selector as long as the partial attribute you select provides a unique result to the data you want. See here for abit more detail here.
Using items instead of yielding a dictionary may be better in the longrun, as you can set default values for data that in some events on the page you're scraping and other events it's not. You have to do this through a pipeline (again if you don't understand this then don't worry). See the docs for items and here for abit more on items.
Here is my one. Hope it will help you.
for item in response.css('div.tribe-events-event-content'):
print(item.css('a.tribe-event-url::text').get())
print(item.css('span.ticket-cost::text').get())
print(item.css('p::text').get())
Thanks.
Here is some steps to get your code fixed
When use period before name that represents the element's class name NOT HTML tag itself.. So change .span.ticket-cost::text --> span.ticket-cost::text
Also .p::text --> p::text.
Obviously you trying to get a string so use get() method instead of extract() method which is return a list.
Make sure to use > when the desired text is inside the child element of the element you've select.
Finally here is a CSS Selector Reference https://www.w3schools.com/cssref/css_selectors.asp

Two scrapy item loaders in a single parse method? how to merge them?

I'm parsing a page but id like to divide it into sections, the page has info about multiple meetings. Some of the info is common to all meetings, but not everything. So i made an item loader for the general info and one for the specific info. However I'd like this parser to return all the info pertaining to a meeting (i.e: the general and specific). Here's the parse method of my code:
def parse(self, response):
general_loader = ItemLoader(item=ProductItem(), response=response)
general_loader.default_input_processor = MapCompose(unicode.strip)
general_loader.default_output_processor = Join(" & ")
for field, xpath in self.general_item_fields.iteritems():
general_loader.add_xpath(field, xpath)
for meeting in response.xpath(self.meeting_xpath):
specific_loader = ItemLoader(item=ProductItem(), response=meeting)
specific_loader.default_input_processor = MapCompose(unicode.strip)
specific_loader.default_output_processor = Join(" & ")
for field, xpath in self.specific_item_fields.iteritems():
specific_loader.add_xpath(field, xpath)
yield general_loader.load_item().update(specific_loader.load_item())
The variables specific_item_fields and general_item_fields are dictionaries with the attribute of the meeting and it's xpath.
So what I'm trying to do here is use meeting as the response for a second ItemLoader that i called specific_loader. And since general_loader.load_item() seems to returns a dictionary, I tried updating or merging it with the specific_loader.load_item() dictionary.
Here's where I'm stuck:
The update method is not working on the load_item and I can't seem to merge these two things.
Apparently i can't use a response.xpath() (I'm using meeting here)element for the loader response?
Finally there must be a better way to implement this, I've tried nested loaders and they seem very promising but meeting changes. It cycles through the response.xpath(self.meeting_xpath) list so how could i use nested loaders?
Thanks in advance for any pointers or advice, I'm kinda lost :)
I don't think there is a way to actually merge two loaders into scrapy, but you could use the dictionaries created from them:
...
general_item = general_loader.load_item()
specific_item = specific_loader.load_item()
general_item.update(specific_item)
yield general_item

Scrapy indexing in order

I'm currently creating a cutsom webcrawler with Scrapy and try to index the fetched content with Elasticsearch.
Works fine until as of now, but I'm only capable of adding content to the search index in the order the crawler filters html tags.
So for example with
sel.xpath("//div[#class='article']/h2//text()").extract()
I can get all the content from all h2 tags inside a div with the class "article", so far so good. The next elements that get inside the index are from all h3 tags, naturally:
sel.xpath("//div[#class='article']/h3//text()").extract()
But the problem here is that the entire order of the text on a site would get messed up like that, since all headlines would get indexed first and only then their child nodes get the chance, which is kind of fatal for a search index.
Does have a tip how to properly get all the content from a page in the right order? (doesnt have to be xpath, just with Scrapy)
I guess you could solve the issue with something like this:
# Select multiple targeting nodes at once
sel_raw = '|'.join([
"//div[#class='article']/h2",
"//div[#class='article']/h3",
# Whatever else you want to select here
])
for sel in sel.xpath(sel_raw):
# Extract the texts for later use
texts = sel.xpath('self::*//text()').extract()
if sel.xpath('self::h2'):
# A h2 element. Do something with texts
pass
elif sel.xpath('self::h3'):
# A h3 element. Do something with texts
pass

Categories