Scraping some Facebook data but not all? Scrapy/Splash/Python - python

I have a spider that looks like this:
import scrapy
from scrapy_splash import SplashRequest
class BarkbotSpider(scrapy.Spider):
name = 'barkbot'
start_urls = [
'http://www.facebook.com/pg/TheBarkFL/events/?ref=page_internal/'
]
custom_settings = {
'FEED_URI': 'output/barkoutput.json'
}
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(
url,
self.parse,
)
def parse(self, response):
for href in response.css("div#upcoming_events_card a::attr(href)").extract():
yield response.follow(href, self.parse_concert)
def parse_concert(self, response):
concert = {
"headliner" : response.xpath(
"//h1[#id='seo_h1_tag']/text()"
).extract_first(),
"venue" : "The Bark",
"venue_address" : "507 All Saints St.",
"venue_website" : "https://www.facebook.com/TheBarkFL",
"date_time" : response.xpath(
"//li[#id='event_time_info']//text()"
).extract(),
"notes" : response.xpath(
"//div[#data-testid='event-permalink-details']/span/text()"
).extract()
}
if concert['headliner']:
yield concert
I run the spider and it finishes successfully. But all the "notes" and "date_time" keys are returning is empty lists. I'm especially confused on the notes one, as that seems fairly straightforward unless xpath can't use data-testid as an attribute. I am, however, getting the headliner key successfully scraped, so I'm obviously connecting to each page.
I'm new to scraping JavaScript-generated content and thus Splash, but I've managed to get one other spider working successfully, just not on Facebook. What gives?

unless xpath can't use data-testid as an attribute
No, that's not it; I just checked with Scrapy 1.5.1 and your xpath matched a sample document fine. It even matched the other data-testid attributes in that document, so I am pretty sure you've hit a race condition because event-permalink-details does not appear in the HTML; it's loaded from an XHR call to their graphql endpoint. Which in your case may be fine due to Splash, but if your selector isn't matching, then that selector is running before the XHR has resolved. I don't know enough Splash to help troubleshoot that situation.
I don't know the answer to your date_time question, but I actually bet what you really want is .xpath('//li[#id="event_time_info"]//#content') because that contains 2019-01-03T17:30:00-08:00 to 2019-01-03T20:30:00-08:00 which seems much nicer than the blob of strings the unqualified text() matches

Related

How to: Get a Python Scrapy to run a simple xpath retrieval

I'm very new to python and am trying to build a script that will eventually extract page titles and s from specified URLs to a .csv in the format I specify.
I have tried managed to get the spider to work in CMD using :
response.xpath("/html/head/title/text()").get()
So the xpath must be right.
Unfortunately when I run the the file my spider is in it never seems to work properly. I think the issue is in the final block of code, unfortunately all the guides I follow seem to use CSS. I feel more comfortable with xpath because you can simply copy,paste it from Dev Tools.
import scrapy
class PageSpider(scrapy.Spider):
name = "dorothy"
start_urls = [
"http://www.example.com",
"http://www.example.com/blog"]
def parse(self, response):
for title in response.xpath("/html/head/title/text()"):
yield {
"title": sel.xpath("Title a::text").extract_first()
}
I expected when that to give me the page title of the above URLs.
First of all, your second url on self.start_urls is invalid and returning 404, so you will end up with only one title extracted.
Second, you need to read more about selectors, you extracted the title on your test on shell but got confused when using it on your spider.
Scrapy will call the parse method for each url on self.start_urls, so you don't need to iterate trough titles, you only have one per page.
You also can access the <title> tag directly using // at the beginning of your xpath expression, see this text copied from W3Schools :
/ Selects from the root node
// Selects nodes in the document from the current node that match the selection no matter where they are
Here is the fixed code:
import scrapy
class PageSpider(scrapy.Spider):
name = "dorothy"
start_urls = [
"http://www.example.com"
]
def parse(self, response):
yield {
"title": response.xpath('//title/text()').extract_first()
}

Capture Google Search Term and ResultStats with Scrapy

I have built a very simple scraper using Scrapy. For the output table, I would like to show the Google News search term as well as the Google resultstats value.
The information I would like to capture is showing in the source of the Google page as
<input class="gsfi" value="Elon Musk">
and
<div id="resultStats">About 52,300 results</div>
I have already tried to include both through ('input.value::text') and ('id.resultstats::text'), which did not work, however. Does anyone have an idea how to solve this situation?
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['google.com']
start_urls = ['https://www.google.com/search?q=elon+musk&source=lnt&tbs=cdr%3A1%2Ccd_min%3A1%2F1%2F2015%2Ccd_max%3A12%2F31%2F2015&tbm=nws']
def parse(self, response):
for quote in response.css('div.quote'):
item = {
'search_title': quote.css('input.value::text').extract(),
'results': quote.css('id.resultstats::text').extract(),
}
yield item
The pages renders differently when you access it with Scrapy.
The search field becomes:
response.css('input#sbhost::attr(value)').get()
The results count is:
response.css('#resultStats::text').get()
Also, there is no quote class on that page.
You can test this in the scrapy shell:
scrapy shell -s ROBOTSTXT_OBEY=False "https://www.google.com/search?q=elon+musk&source=lnt&tbs=cdr%3A1%2Ccd_min%3A1%2F1%2F2015%2Ccd_max%3A12%2F31%2F2015&tbm=nws"
And then run those 2 commands.
[EDIT]
If your goal is to get one item for each URL, then you can do this:
def parse(self, response):
item = {
'search_title': response.css('input#sbhost::attr(value)').get(),
'results': response.css('#resultStats::text').get(),
}
yield item
If your goal is to extract every result on the page, then you need something different.

Scrapy scraping content that is visible sometimes but not others

I am scraping some info off of zappos.com, specifically a part of the details page that displays what customers that view the current item have also viewed.
This is one such item listing:
https://www.zappos.com/p/chaco-marshall-tartan-rust/product/8982802/color/725500
The thing is that I discovered that the section that I am scraping appears right away on some items, but on others it will only appear after I have refreshed the page 2 or three times.
I am using scrapy to scrape and splash to render.
import scrapy
import re
from scrapy_splash import SplashRequest
class Scrapys(scrapy.Spider):
name = "sqs"
start_urls = ["https://www.zappos.com","https://www.zappos.com/marty/men-shoes/CK_XAcABAuICAgEY.zso"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,
endpoint='render.html',
args={'wait': 0.5},
)
def parse(self, response):
links = response.css("div._1Mgpu")
for link in links:
url = 'https://www.zappos.com' + link.css("a::attr(href)").extract_first()
yield SplashRequest(url, callback=self.parse_attr,
endpoint='render.html',
args={'wait': 10},
)
def parse_attr(self, response):
alsoviewimg = response.css("div._18jp0 div._3Olkk div.QDcUX div.slider div.slider-frame ul.slider-list li.slider-slide a img").extract()
The alsoviewimg is one of the elements that I am pulling from the "Customers Who Viewed this Item Also Viewed" section. I have tested pulling this and other elements, all in the scrapy shell with splash rendering to get the dynamic content, and it pulled the content fine, however in the spider it rarely, if ever, gets any hits.
Is there something I can set so that it loads the page a couple times to get the content? Or something else that I am missing?
You should check if the element you're looking for exists. If it doesn't, load the page again.
I'd look into why refreshing the page requires multiple attempts, you might be able to solve the problem without this ad-hoc multiple refresh solution.
Scrapy How to check if certain class exists in a given element
This link explains how to see if a class exists.

attribute error during recursive scraping with scrapy

I have a scrapy spider that works well as long as I give it a page that contains the links to the pages that it should scrape.
Now I want to not give it all the categories but the page that contains links to all categories.
I thought I could simply add another parse function in order to achive this.
but the console output gives me an attribute error
"attributeError: 'zaubersonder' object has no attribute
'parsedetails'"
This tells me that some attribute refference is not working correctly.
I am new to object orientation but I thought scarpy is calling parse which is calling prase_level2 wich in turn calls parse_details and this should work fine.
below is my effort so far.
import scrapy
class zaubersonder(scrapy.Spider):
name = 'zaubersonder'
allowed_domains = ['abc.de']
start_urls = ['http://www.abc.de/index.php/rgergegregre.html'
]
def parse(self, response):
urls = response.css('a.ulSubMenu::attr(href)').extract() # links to categories
for url in urls:
url = response.urljoin(url)
yield scrapy.Request(url=url,callback=self.parse_level2)
def parse_level2(self, response):
urls2 = response.css('a.ulSubMenu::attr(href)').extract() # links to entries
for url2 in urls2:
url2 = response.urljoin(url2)
yield scrapy.Request(url=url2,callback=self.parse_details)
def parse_details(self,response): #extract entries
yield {
"Titel": response.css("li.active.last::text").extract(),
"Content": response.css('div.ce_text.first.block').extract() + response.css('div.ce_text.last.block').extract(),
}
edit: fixed the code in case someone will search for it
There is a typo in the code. The callback in parse_level2 is self.parsedetails, but the function is named parse_details.
Just change the yield in parse_level2 to:
yield scrapy.Request(url=url2,callback=self.parse_details)
..and it should work better.

Scrapy Linkextractor duplicating(?)

I have the crawler implemented as below.
It is working and it would go through sites regulated under the link extractor.
Basically what I am trying to do is to extract information from different places in the page:
- href and text() under the class 'news' ( if exists)
- image url under the class 'think block' ( if exists)
I have three problems for my scrapy:
1) duplicating linkextractor
It seems that it will duplicate processed page. ( I check against the export file and found that the same ~.img appeared many times while it is hardly possible)
And the fact is , for every page in the website, there are hyperlinks at the bottom that facilitate users to direct to the topic they are interested in, while my objective is to extract information from the topic's page ( here listed several passages's title under the same topic ) and the images found within a passage's page( you can arrive to the passage's page by clicking on the passage's title found at topic page).
I suspect link extractor would loop the same page over again in this case.
( maybe solve with depth_limit?)
2) Improving parse_item
I think it is quite not efficient for parse_item. How could I improve it? I need to extract information from different places in the web ( for sure it only extracts if it exists).Beside, it looks like that the parse_item could only progress HkejImage but not HkejItem (again I checked with the output file). How should I tackle this?
3) I need the spiders to be able to read Chinese.
I am crawling a site in HK and it would be essential to be capable to read Chinese.
The site:
http://www1.hkej.com/dailynews/headline/article/1105148/IMF%E5%82%B3%E4%BF%83%E4%B8%AD%E5%9C%8B%E9%80%80%E5%87%BA%E6%95%91%E5%B8%82
As long as it belongs to 'dailynews', that's the thing I want.
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors import LinkExtractor
import items
class EconjournalSpider(CrawlSpider):
name = "econJournal"
allowed_domains = ["hkej.com"]
login_page = 'http://www.hkej.com/template/registration/jsp/login.jsp'
start_urls = 'http://www.hkej.com/dailynews'
rules=(Rule(LinkExtractor(allow=('dailynews', ),unique=True), callback='parse_item', follow =True),
)
def start_requests(self):
yield Request(
url=self.login_page,
callback=self.login,
dont_filter=True
)
# name column
def login(self, response):
return FormRequest.from_response(response,
formdata={'name': 'users', 'password': 'my password'},
callback=self.check_login_response)
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if "username" in response.body:
self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
return Request(url=self.start_urls)
else:
self.log("\n\n\nYou are not logged in.\n\n\n")
# Something went wrong, we couldn't log in, so nothing happens
def parse_item(self, response):
hxs = Selector(response)
news=hxs.xpath("//div[#class='news']")
images=hxs.xpath('//p')
for image in images:
allimages=items.HKejImage()
allimages['image'] = image.xpath('a/img[not(#data-original)]/#src').extract()
yield allimages
for new in news:
allnews = items.HKejItem()
allnews['news_title']=new.xpath('h2/#text()').extract()
allnews['news_url'] = new.xpath('h2/#href').extract()
yield allnews
Thank you very much and I would appreciate any help!
First, to set settings, make it on the settings.py file or you can specify the custom_settings parameter on the spider, like:
custom_settings = {
'DEPTH_LIMIT': 3,
}
Then, you have to make sure the spider is reaching the parse_item method (which I think it doesn't, haven't tested yet). And also you can't specify the callback and follow parameters on a rule, because they don't work together.
First remove the follow on your rule, or add another rule, to check which links to follow, and which links to return as items.
Second on your parse_item method, you are getting incorrect xpath, to get all the images, maybe you could use something like:
images=hxs.xpath('//img')
and then to get the image url:
allimages['image'] = image.xpath('./#src').extract()
for the news, it looks like this could work:
allnews['news_title']=new.xpath('.//a/text()').extract()
allnews['news_url'] = new.xpath('.//a/#href').extract()
Now, as and understand your problem, this isn't a Linkextractor duplicating error, but only poor rules specifications, also make sure you have valid xpath, because your question didn't indicate you needed xpath correction.

Categories