Why is scrapy crawling a different facebook page? - python

This is a scrapy spider.This spider is supposed to collect the names of all div nodes with class attribute=5d-5 essentially making a list of all people with x name from y location.
import scrapy
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
class fb_spider(scrapy.Spider):
name="fb"
allowed_domains = ["facebook.com"]
start_urls = [
"https://www.facebook.com/search/people/?q=jaslyn%20california"]
def parse(self,response):
x=response.xpath('//div[#class="_5d-5"]'.extract())
with open("asdf.txt",'wb') as f:
f.write(u"".join(x).encode("UTF-8"))
But the scrapy crawls a web page different from the one specified.I got this on the command prompt:
2016-08-15 14:00:14 [scrapy] ERROR: Spider error processing <GET https://www.facebook.com/search/top/?q=jaslyn%20california> (referer: None)
but the URL I specified is:
https://www.facebook.com/search/people/?q=jaslyn%20california

Scraping is not allowed on Facebook: https://www.facebook.com/apps/site_scraping_tos_terms.php
If you want to get data from Facebook, you have to use their Graph API. For example, this would be the API to search for users: https://developers.facebook.com/docs/graph-api/using-graph-api#search
It is not as powerful as the Graph Search on facebook.com though.

Facebook is redirecting the request to the new url.

it seems as though you are missing some headers in your request.
2016-08-15 14:00:14 [scrapy] ERROR: Spider error processing <GET https://www.facebook.com/search/top/?q=jaslyn%20california> (referer: None)
as you can see the referer is None, I would advise you add some headers manually, namely the referer.

Related

DEBUG: Crawled (404)

This is my code:
# -*- coding: utf-8 -*-
import scrapy
class SinasharesSpider(scrapy.Spider):
name = 'SinaShares'
allowed_domains = ['money.finance.sina.com.cn/mkt/']
start_urls = ['http://money.finance.sina.com.cn/mkt//']
def parse(self, response):
contents=response.xpath('//*[#id="list_amount_ctrl"]/a[2]/#class').extract()
print(contents)
And I have set an user-agent in setting.py.
Then I get an error:
2020-04-27 10:54:50 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://money.finance.sina.com.cn/robots.txt> (referer: None)
2020-04-27 10:54:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://money.finance.sina.com.cn/mkt//> (referer: None)
So How can I eliminate this error?
Maybe your ip is banned by the website,also you can need to add some cookies to crawling the data that you needed.
The http-statuscode 404 is received because Scrapy is checking the /robots.txt by default. In your case this site does not exist and so a 404 is received but that does not have any impact. In case you want to avoid checking the robots.txt you can set ROBOTSTXT_OBEY = False in the settings.py.
Then the website is accessed successfully (http-statuscode 200). No content is printed because based on your xpath-selection nothing is selected. You have to fix your xpath-selection.
If you want to test different xpath- or css-selections in order to figure how to get your desired content, you might want to use the interactive scrapy shell:
scrapy shell "http://money.finance.sina.com.cn/mkt/"
You can find an example of a scrapy shell session in the official Scrapy documentation here.

Getting DEBUG: Crawled (504) error with Lua script While loop timing out for Scrapy-Splash

I am very new to coding and am struggling with a web scraper I am trying to build. I am using a Lua script in order for my scrapy request to wait for any web-element (don't care about which element I just need the initial page loader to finish loading so I can access the html elements) to appear after the JavaScript on the website has loaded. The particular website I am trying to access is https://www.ladbrokes.com.au/sports/basketball/usa/nba where it has a JS initial loader page before any of the elements on the website are loaded
my current code is this:
class Ladbrokes(scrapy.Spider):
name = 'Ladbrokes'
allowed_domains = ['ladbrokes.com.au']
start_urls = ['https://www.ladbrokes.com.au/sports']
def parse (self, response):
sports_link = select_ladbrokes(response)
for link in sports_link:
url = response.urljoin(link)
yield SplashRequest(url = url, callback =self.ladbrokes_all_comps,endpoint='execute',
args={'lua_source':lua_script})
def ladbrokes_all_comps(self, response):
comps = response.xpath('//*[#id="accordion_4e099d27-0f11-4c6e-848e-965fff7ad995"]/div[2]/div[2]/div[1]/div[2]/div[1]/div/div[1]/text()').extract()
lua_script = '''
function main(splash)
assert(splash:go(splash.args.url))
while not splash:select('#page-content-left > div > div') do
splash:wait(0.1)
end
return {html=splash:html()}
end '''
When I call my spider I end up getting these errors:
2019-11-25 16:41:30 [scrapy.core.engine] DEBUG: Crawled (504) <GET https://www.ladbrokes.com.au/sports/nrl via http://0.0.0.0:8050/execute> (referer: None)
2019-11-25 16:41:30 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <504 https://www.ladbrokes.com.au/sports/nrl>: HTTP status code is not handled or not allowed
It seems it is timing out on the Lua script While loop, but I am not sure if it is because I am trying to select the web-element incorrectly.
I also tried putting in a long splash wait argument in the SplashRequest function, but it seemed the initial page loader never finished loading. Any help on this would be great!

How can I convert relative paths to absolute paths with my scrapy CrawlSpider?

I am new to Scrapy and I am currently trying to write a CrawlSpider that will crawl a forum on the Tor darknet. Currently my CrawlSpider code is:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class HiddenAnswersSpider(CrawlSpider):
name = 'ha'
start_urls = ['http://answerstedhctbek.onion/questions']
allowed_domains = ['http://answerstedhctbek.onion', 'answerstedhctbek.onion']
rules = (
Rule(LinkExtractor(allow=(r'answerstedhctbek.onion/\d\.\*', r'https://answerstedhctbek.onion/\d\.\*')), follow=True, process_links='makeAbsolutePath'),
Rule(LinkExtractor(allow=()), follow=True, process_links='makeAbsolutePath')
)
def makeAbsolutePath(links):
for i in range(links):
links[i] = links[i].replace("../","")
return links
Because the forum uses relative path, I have tried to create a custom process_links to remove the "../" however when I run my code I am still recieving:
2017-11-11 14:46:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../badges>: HTTP status code is not handled or not allowed
2017-11-11 14:46:46 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../general-guidelines> (referer: http://answerstedhctbek.onion/questions)
2017-11-11 14:46:47 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../general-guidelines>: HTTP status code is not handled or not allowed
2017-11-11 14:46:47 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../contact-us> (referer: http://answerstedhctbek.onion/questions)
2017-11-11 14:46:47 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../contact-us>: HTTP status code is not handled or not allowed
2017-11-11 14:46:48 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../questions?sort=hot> (referer: http://answerstedhctbek.onion/questions)
2017-11-11 14:46:48 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../questions?sort=hot>: HTTP status code is not handled or not allowed
2017-11-11 14:46:48 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../questions?sort=votes> (referer: http://answerstedhctbek.onion/questions)
As you can see, I am still getting 400 errors due to the bad path. Why isn't my code removing the "../" from the links?
Thanks!
The problem might be that makeAbsolutePaths is not part of the spider class. The documentation states:
process_links is a callable, or a string (in which case a method from the spider object with that name will be used)
You did not use self in makeAbsolutePaths, so I assume it is not an indentation error. makeAbsolutePaths also has some other errors. If we correct the code to this state:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class HiddenAnswersSpider(CrawlSpider):
name = 'ha'
start_urls = ['file:///home/user/testscrapy/test.html']
allowed_domains = []
rules = (
Rule(LinkExtractor(allow=(r'.*')), follow=True, process_links='makeAbsolutePath'),
)
def makeAbsolutePath(self, links):
print(links)
for i in range(links):
links[i] = links[i].replace("../","")
return links
it will yield this error:
TypeError: 'list' object cannot be interpreted as an integer
This is, because no call to len() was used in the call to range and range can only operate on integers. It wants a number and will give you the range from 0 to this number minus 1.
After fixing this issue, it will give the error:
AttributeError: 'Link' object has no attribute 'replace'
This is - because unlike you thought - links is not a list of strings containing the contents of href="" attributes. Instead, it is a list of Link objects.
I'd recommend you output the contents of links inside makeAbsolutePath and see, if you have to do anything at all. In my opinion, scrapy should already stop resolving .. operators once it reaches the domain level, so your links should point to http://answerstedhctbek.onion/<number>/<title>, even though the site uses .. operator without an actual folder level (as the URL is /questions and not /questions/).
Somehow like this:
def makeAbsolutePath(self, links):
for i in range(len(links)):
print(links[i].url)
return []
(Returning an empty list here gives you the advantage that the spider will stop and you can check the console output)
If you then find out, the URLs are actually wrong, you can perform some work on them through the url attribute:
links[i].url = 'http://example.com'

Scrapy https tutorial

everyone!
I'm new to Scrapy framework. And I need to parse wisemapping.com.
At first, I read official Scrapy tutorial and tried to get access to one of "wisemap" 's, but got an errors:
[scrapy.core.engine] DEBUG: Crawled (404) <GET https://app.wisemapping.com/robots.txt> (referer: None)
[scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying
<GET https://app.wisemapping.com/c/maps/576786/public> (failed 3 times): 500 Internal Server Error
[scrapy.core.engine] DEBUG: Crawled (500) <GET https://app.wisemapping.com/c/maps/576786/public> (referer: None)
[scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 https://app.wisemapping.com/c/maps/576786/public>: HTTP status code is not handled or not allowed
Please, give me an advice to solve problems with following code:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'https://app.wisemapping.com/c/maps/576786/public',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'wisemape.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
Navigating to https://app.wisemapping.com/c/maps/576786/public gives the error
"Outch!!. This map is not available anymore.
You do not have enough right access to see this map. This map has been changed to private or deleted."
Does this map exist? If so, try making it public.
If you know for a fact the map you're trying to access exist, verify the URL you're trying to access is the correct one.

Scrapy doesn't do any scraping after login on salesforce.com based site

I'm new to scrapy and have an issue logging in to a salesforce.com based site. I use the loginform package to populate scrapy's FromRequest. When run, it does a GET of the login page and a sucessful POST of the FormRequest login as expected. But then the spider stops, no page gets scraped.
[...]
2017-06-25 14:02:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://testdomain.secure.force.com/jSites_Home> (referer: None)
2017-06-25 14:02:29 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://login.salesforce.com/> (referer: https://testdomain.secure.force.com/jSites_Home)
2017-06-25 14:02:29 [scrapy.core.engine] INFO: Closing spider (finished)
[...]
The (slightly redacted) script:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from loginform import fill_login_form
from harvest.items import HarvestItem
class TestSpider(Spider):
name = 'test'
allowed_domains = ['testdomain.secure.force.com', 'login.salesforce.com']
login_url = 'https://testdomain.secure.force.com/jSites_Home'
login_user = 'someuser'
login_password = 'p4ssw0rd'
def start_requests(self):
yield scrapy.Request(self.login_url, self.parse_login)
def parse_login(self, response):
data, url, method = fill_login_form(response.url, response.body, self.login_user, self.login_password)
return scrapy.FormRequest(url, formdata=dict(data), method=method, callback=self.get_assignments)
def get_assignments(self, response):
assignment_selector = response.xpath('//*[#id="nav"]/ul/li/a[#title="Assignments"]/#href')
return Request(urljoin(response.url, assignment_selector.extract()), callback=self.parse_item)
def parse_item(self, response):
items = HarvestItem()
items['startdatum'] = response.xpath('(//*/table[#class="detailList"])[2]/tbody/tr[1]/td[1]/span/text()')\
.extract()
return items
When I check the body of the FormRequest, it looks like a legit POST to the page 'login.salesforce.com'. If I login manually, I notice several redirects. However, when I force a parse by adding a "callback='parse'" to the FormRequest, still nothing happens.
Am I right in thinking the login went OK, looking at the 200 response?
I don't see any redirects in the scrapy output. Could it be that scrapy doesn't handle the redirects properly, causing the script to not do any scraping?
Any ideas on getting the script to scrape the final redirected page after login?
Thanks

Categories