scrapy returning an empty object - python

i am using css selector and continually get a response with empty values. Here is the code.
import scrapy
class WebSpider(scrapy.Spider):
name = 'activities'
start_urls = [
'http://capetown.travel/events/'
]
def parse(self, response):
all_div_activities = response.css("div.tribe-events-content")#gdlr-core-pbf-column gdlr-core-column-60 gdlr-core-column-first
title = all_div_activities.css("h2.tribe-events-list-event-title::text").extract()#gdlr-core-text-box-item-content
price = all_div_activities.css(".span.ticket-cost::text").extract()
details = all_div_activities.css(".p::text").extract()
yield {
'title':title,
'price':price,
'details':details
}

In your code you're looking to select all events but that output will be a list and you can't select the title etc using extract() with a list as you are trying to do.
This is why you're not getting the data you want. You will need to use a for loop to loop over each event on the page in your case looping over all_div_activities.
Code for Script
def parse(self,response):
all_div_activities = response.css('div.tribe-events-event-content')
for a in all_div_activities:
title = a.css('a.tribe-event-url::text').get()
if a.css('span.ticket-cost::text'):
price = a.css('span.ticket-cost::text').get()
else:
price = 'No price'
details = a.css('div[class*="tribe-events-list-event-description"] > p::text').get()
yield {
'title':title.strip(),
'price':price,
'details':details
}
Notes
Using an if statement for price because there were elements that had no price at all and so inputting some information is a good idea.
Using strip() on title when yielding the dictionary as the title had space and \n attached.
Advice
As a minor point, Scrapy suggests using get() and getall() methods rather than extract_first() and extract(). With extract() its not always possible to know the output is going to be a list or not, in this case the output I got was a list. This is why scrapy docs suggests using get() instead. It's also abit more compact. With get() you will always get a string. This also meant that I could strip newlines and space with the title as you can see in the above code.
Another tip would be if the class attribute is quite long, use a *= selector as long as the partial attribute you select provides a unique result to the data you want. See here for abit more detail here.
Using items instead of yielding a dictionary may be better in the longrun, as you can set default values for data that in some events on the page you're scraping and other events it's not. You have to do this through a pipeline (again if you don't understand this then don't worry). See the docs for items and here for abit more on items.

Here is my one. Hope it will help you.
for item in response.css('div.tribe-events-event-content'):
print(item.css('a.tribe-event-url::text').get())
print(item.css('span.ticket-cost::text').get())
print(item.css('p::text').get())
Thanks.

Here is some steps to get your code fixed
When use period before name that represents the element's class name NOT HTML tag itself.. So change .span.ticket-cost::text --> span.ticket-cost::text
Also .p::text --> p::text.
Obviously you trying to get a string so use get() method instead of extract() method which is return a list.
Make sure to use > when the desired text is inside the child element of the element you've select.
Finally here is a CSS Selector Reference https://www.w3schools.com/cssref/css_selectors.asp

Related

How to address alternative XPath selector if the original one does not exist?

I'm using scrapy (https://scrapy.org/) to crawl over a lot of websites (nearly 300) and save the title and date in a json. The title is on mostly website the first H1. But the date is tricky. Now, I'm using this XPath selector:
item['date'] = response.xpath("//time/text()").get()
But the website use different kinds of formatting the date. Sometimes as span, sometimes as normal paragraphs, sometimes as time and other using a a tag.
Question: How can I implement something like an if then else structure to item, to tell the spider look for different elements if the first is not existing?
You can simply use Item Loaders and pick first non-empty value:
l.add_xpath('date', '//first/xpath')
l.add_xpath('date', '//second/xpath')
l.add_xpath('date', '//third/xpath')
And in items.py:
date = scrapy.Field(output_processor=TakeFirst())

Two scrapy item loaders in a single parse method? how to merge them?

I'm parsing a page but id like to divide it into sections, the page has info about multiple meetings. Some of the info is common to all meetings, but not everything. So i made an item loader for the general info and one for the specific info. However I'd like this parser to return all the info pertaining to a meeting (i.e: the general and specific). Here's the parse method of my code:
def parse(self, response):
general_loader = ItemLoader(item=ProductItem(), response=response)
general_loader.default_input_processor = MapCompose(unicode.strip)
general_loader.default_output_processor = Join(" & ")
for field, xpath in self.general_item_fields.iteritems():
general_loader.add_xpath(field, xpath)
for meeting in response.xpath(self.meeting_xpath):
specific_loader = ItemLoader(item=ProductItem(), response=meeting)
specific_loader.default_input_processor = MapCompose(unicode.strip)
specific_loader.default_output_processor = Join(" & ")
for field, xpath in self.specific_item_fields.iteritems():
specific_loader.add_xpath(field, xpath)
yield general_loader.load_item().update(specific_loader.load_item())
The variables specific_item_fields and general_item_fields are dictionaries with the attribute of the meeting and it's xpath.
So what I'm trying to do here is use meeting as the response for a second ItemLoader that i called specific_loader. And since general_loader.load_item() seems to returns a dictionary, I tried updating or merging it with the specific_loader.load_item() dictionary.
Here's where I'm stuck:
The update method is not working on the load_item and I can't seem to merge these two things.
Apparently i can't use a response.xpath() (I'm using meeting here)element for the loader response?
Finally there must be a better way to implement this, I've tried nested loaders and they seem very promising but meeting changes. It cycles through the response.xpath(self.meeting_xpath) list so how could i use nested loaders?
Thanks in advance for any pointers or advice, I'm kinda lost :)
I don't think there is a way to actually merge two loaders into scrapy, but you could use the dictionaries created from them:
...
general_item = general_loader.load_item()
specific_item = specific_loader.load_item()
general_item.update(specific_item)
yield general_item

Python. Scrapy Xpath returning empty array

I am using scrapy to scrape information from a website in python and I am only getting used to using Xpaths to find information.
I want to return a list of all the average ratings from albums of this artist from this page.
https://rateyourmusic.com/artist/kanye_west
To find the node for the albums I used //div[#id="disco_type_s"]
and I tried searching children for divs with the attribute disco_avg_rating using div[#class="disco_avg_rating"]/text()
Here is my function
def parse_dir_contents(self, response):
item = rateyourmusicalbums() *ignore this
for i in response.xpath('//div[#id="disco_type_s"]'):
item['average rating']=i.xpath('div[#class="disco_avg_rating"]/text()').extract()
yield item
Everything I try to get this list causes a problems. Usually it's more straight but this time I have to differentiate between albums and singles etc. so I am having troubles.
Appreciate your help, I am fairly new to web scraping.
response.xpath('//div[#id="disco_type_s"]') finds only one tag (this is what mostly happens when using id to match the xpath, they are unique). to get a list of selectors you should use something like:
response.xpath('//div[#id="disco_type_s"]/div[#class="disco_release"]') which will match multiple tags, so you can iterate on those.
then get the average rating with './div[#class="disco_avg_rating"]/text()'
The following should work.
def parse_dir_contents(self, response):
for i in response.xpath('//*[#class="disco_release"]/div[3]'):
item['average rating']=i.xpath('text()').extract()
yield item

Extract h1 text from div class with scrapy or selenium

I am using python along with scrapy and selenium.I want to extract the text from the h1 tag which is inside a div class.
For example:
<div class = "example">
<h1>
This is an example
</h1>
</div>
This is my tried code:
for single_event in range(1,length_of_alllinks):
source_link.append(alllinks[single_event])
driver.get(alllinks[single_event])
s = Selector(response)
temp = s.xpath('//div[#class="example"]//#h1').extract()
print temp
title.append(temp)
print title
Each and every time I tried different methods I got an empty list.
Now, I want to extract "This is an example" i.e h1 text and store it or append it in a list i.e in my example title.
Like:
temp = ['This is an example']
Try the following to extract the intended text:
s.xpath('//div[#class="example"]/h1/text()').extract()
For once, it seems that in your HTML the class attribute of the is "example" but in your code you're looking for other class values; At least for XPath queries, keep in mind that you search by exact attribute value. You can use something like:
s.xpath('//div[contains(#class, "example")]')
To find an element that has the "example" class but may have additional classes. I'm not sure if this is a mistake or this is your actual code. In addition the fact that you have spaces in your HTML around the '=' sign of the class attribute may not be helping some parsers either.
Second, your query used in s.xpath seems wrong. Try something like this:
temp = s.xpath('//div[#class="example"]/h1').extract()
Its not clear from your code what s is, so I'm assuming the extract() method does what you think it does. Maybe a more clean code sample would help us help you.

Using Rules and Requests in Scrapy throws exception TypeError: __init__() got multiple values for keyword argument 'callback'

I am a beginner in scrapy, and I have been trying to do the following workflow: Start from page A, which is a result search page containing links for full articles whose url end is a digit. My intention is to grab each link of each result search page, access the links and scrap the full article.
I iterate over each page collecting links with the following rule:
rules = (Rule(SgmlLinkExtractor(allow=(r'\d+',)), callback='parse_short_story',follow=True),)
Each ensures that the last digit of the search page iterates to the next one after I am done collecting the links and scrapping the full articles of the current page.
The parse_short_story method merely uses a select to filter the portion of the html page which, and afterwards loops over the remaining portion to acquire the links of the full stories and pass it on for the request:
for short_story in short_stories:
item = DmozItem()
full_story_link = short_story.select(".//h2/a/#href").extract()
if full_story_link:
yield Request(full_story_link, self.parse_full_story, callback='self.parse_full_story', errback=lambda _: item, meta=dict(item=item),)
items.append(item)
return items
On my understanding from the tutorial of scrapy, I need to return the items by the end of the parser methods, so that the rule properly append in a final list of items which I can throw in a json file or something else on running by the console. Notice this portion below of Response and return calls which crashes. I can't figure out how to use both the Request and the return items.
The method parse_full_story gets the response parameter like the parse_short_story does, and recover the item I send as parameter with
item = response.meta.get('item')
After properly setting the information I desired on my item item I use return item.
In summary,
My expectation were that the rule would take care of moving along the search pages containing the links of the full article using the callback of parse_short_story, while for each link of each page, the parse_full_story would access the full articles of those links, scrap what I wanted, add to the item item, and exit, hopefully scanning all full articles in the end.
Apparently my understanding is wrong and I get the error:
yield Request(full_story_link, self.parse_full_story, callback='self.parse_full_story', errback=lambda _: item, meta=dict(item=item),)
exceptions.TypeError: __init__() got multiple values for keyword argument 'callback'
You can find the full runable code here. As it runs, you will see that it keeps throwing the exception. If it is feasible to perform a direct fix and give/or a short explanation of what is wrong on this I would appreciate it, since similar problems lead me to Django associated questions on the web.
Put only one callback parameter, use the self.parse_full_story (Request() expects a callable; see here)
The "callback name string" version is only for Rules (see here)
Use
yield Request(full_story_link,
self.parse_full_story,
errback=lambda _: item,
meta=dict(item=item),
)
instead of
yield Request(full_story_link,
self.parse_full_story, callback='self.parse_full_story',
errback=lambda _: item,
meta=dict(item=item),
)

Categories