BeautifulSoup/Scrapy: different BeautifulSoup html from Source HTML viewed in Firefox - python

I'm new to both Python, BeautifulSoup, and Scrapy, so I'm not 100% sure how to describe the problem I'm having.
I'd like to scrape the url provided by the 'next' button you can see in this image, it's in-line next to the image links 'tiff' or 'jpeg'.
The issue is that the 'next' (and in subsequent pages, the 'previous') links don't seem to present themselves via the url I provide to scrapy. When I asked a friend to check the url, she told me she didn't see the links. I confirmed this by printing the bs object associated with the tag id 'desciption':
description = soup.find('div', {'id':'description'} )
Because I generate this page from a search at the LOC website, I'm thinking I must need to pass something to my spider to indicate the search parameters. I tried the solution suggested here, by changing the referer, but it still doesn't work:
DEFAULT_REQUEST_HEADERS = {
'Referer': 'www.loc.gov/pictures/collection/fsa/search/?co=fsa&q=1935&st=grid'
}
I get the following output logs when I run my spider, confirming the referrer has been updated:
2018-07-31 15:41:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.loc.gov/robots.txt> (referer: www.loc.gov/pictures/collection/fsa/search/?co=fsa&q=1935&st=grid)
2018-07-31 15:41:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.loc.gov/pictures/resource/fsa.8a07028/?co=fsa> (referer: www.loc.gov/pictures/collection/fsa/search/?co=fsa&q=1935&st=grid)
If someone could help, I'd really appreciate it.

AFAICT, that site uses sessions to store history of your search server-side.
The search is initiated from a URL like yours.
But when visiting the image URLs afterwards, your session is active (via your cookies), and the site renders next / back links. If no session is found it doesn't (but you can still see the page). You can prove this by deleting your cookies after the initial search and watch it disappear when you refresh...
You'll need to tell Scrapy to first go the search URL, and then spider the results , making sure cookie middleware is enabled.

Related

Response.url and refer url scrapy

2020-11-09 12:13:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.com/books/adventure/book1/index.html> (referer: example.com/books/adventure/index.html)
If anyone is familiar with scarpy, you know that https://example.com/books/adventure/book1/index.html is called response.url. However, I want to get the refer link example.com/books/adventure/index.html, does anyone know what it's called.
You need to create the referer in your header.
Ideally it needs to be created by you, i.e you will already have it and you don't need to get it from the response.
eg.
headers={'Referer':'example.com/books/adventure/index.html'}
Hope that helps?

555 HTTP Protocol when trying to scrape with Scrapy

I am trying to scrape a website using Scrapy. It gets access to robots.txt file with "status code 200"
But I get this in the terminal for sending a request:
[scrapy.core.engine] DEBUG: Crawled (555) <POST https://www.<the_link>.ca/<part_of_link>/UpdateQuery> (referer: None)
b'{"d":{"Message":"SESSION_TIMEOUT","Result":null,"Succeeded":false}}'
I have tried looking up on internet about 555. However, I could not find much explanation on it.
Got some information from https://httptoolkit.tech/blog/new-http-status-code-555" but nothing very clear.
I have tried it on Postman. The first time I send request I get 555 error and the second time I send request I get status code 200.
But, how do I tackle this on script in Scrapy?
I would be glad if someone can help me to get around this.
Thank you.

Why is scrapy crawling a different facebook page?

This is a scrapy spider.This spider is supposed to collect the names of all div nodes with class attribute=5d-5 essentially making a list of all people with x name from y location.
import scrapy
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
class fb_spider(scrapy.Spider):
name="fb"
allowed_domains = ["facebook.com"]
start_urls = [
"https://www.facebook.com/search/people/?q=jaslyn%20california"]
def parse(self,response):
x=response.xpath('//div[#class="_5d-5"]'.extract())
with open("asdf.txt",'wb') as f:
f.write(u"".join(x).encode("UTF-8"))
But the scrapy crawls a web page different from the one specified.I got this on the command prompt:
2016-08-15 14:00:14 [scrapy] ERROR: Spider error processing <GET https://www.facebook.com/search/top/?q=jaslyn%20california> (referer: None)
but the URL I specified is:
https://www.facebook.com/search/people/?q=jaslyn%20california
Scraping is not allowed on Facebook: https://www.facebook.com/apps/site_scraping_tos_terms.php
If you want to get data from Facebook, you have to use their Graph API. For example, this would be the API to search for users: https://developers.facebook.com/docs/graph-api/using-graph-api#search
It is not as powerful as the Graph Search on facebook.com though.
Facebook is redirecting the request to the new url.
it seems as though you are missing some headers in your request.
2016-08-15 14:00:14 [scrapy] ERROR: Spider error processing <GET https://www.facebook.com/search/top/?q=jaslyn%20california> (referer: None)
as you can see the referer is None, I would advise you add some headers manually, namely the referer.

Crawling redirected url in scrapy

I am working in scrapy.
I am fetching a site which consists of a list of urls.
So I requested the main url in start_url and I got all the href tags(links to fetch data) in a list, I again requested each and every url in the list further for fetching data, but some of the urls are redirecting like below:
Redirecting (301) to <GET example.com/sch/mobile-68745.php> from Redirecting (301) to <GET example.com/sch/mobile-8974.php>
I came to know that scrapy ignores the redirected links, but I want to catch the redirected url and want to scrape the same like the url with 200 status
Is there anyway to catch that redirect url and scrape the data from them, I mean do we need disable redirect middleware? Or do we need to use any meta tag in Request command, can youu provide me an example of that?
I’ve got no experience with Scrapy, but apparently, you can define middlewares that change they way Scrapy works when resolving content.
There is the RedirectMiddleware that supports and handles redirects out of the box, so all you’d need to do is to enable that.
DOWNLOADER_MIDDLEWARES = {
'apy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 123,
}

Problem logging into Facebook with Scrapy

(I have asked this question on the Scrapy google-group without luck.)
I am trying to log into Facebook using Scrapy. I tried the following in the interactive shell:
I set the headers and created a request as follows:
header_vals={'Accept-Language': ['en'], 'Content-Type': ['application/
x-www-form-urlencoded'], 'Accept-Encoding': ['gzip,deflate'],
'Accept': ['text/html,application/xhtml+xml,application/xml;q=0.9,*/
*;q=0.8'], 'User-Agent': ['Mozilla/5.0 Gecko/20070219 Firefox/
2.0.0.2']}
login_request=Request('https://www.facebook.com/login.php',headers=header_vals)
fetch(login_request)
I get redirected:
2011-08-11 13:54:54+0530 [default] DEBUG: Redirecting (meta refresh)
to <GET https://www.facebook.com/login.php?_fb_noscript=1> from <GET
https://www.facebook.com/login.php>
.
.
.
[s] request <GET https://www.facebook.com/login.php>
[s] response <200 https://www.facebook.com/login.php?_fb_noscript=1>
I guess it shouldn't be redirected there if I am supplying the right
headers?
I still attempt to go ahead and supply login details using the
FormRequest as follows:
new_request=FormRequest.from_response(response,formname='login_form',formdata={'email':'...#email.com','pass':'password'},headers=header_vals)
new_request.meta['download_timeout']=180
new_request.meta['redirect_ttl']=30
fetch(new_request) results in:
2011-08-11 14:05:45+0530 [default] DEBUG: Redirecting (meta refresh)
to <GET https://www.facebook.com/login.php?login_attempt=1&_fb_noscript=1>
from <POST https://www.facebook.com/login.php?login_attempt=1>
.
.
[s] response <200 https://www.facebook.com/login.php?login_attempt=1&_fb_noscript=1>
.
What am I missing here? Thanks for any suggestions and help.
I'll add that I've also tried this with a BaseSpider to see if this was a result of the cookies not being passed along in the shell, but it doesn't work there either.
I was able to use Mechanize to log on successfully. Can I take advantage of this to somehow pass cookies on to Scrapy?
Notice that "meta redirect" text near redirecting. Facebook has a noscript tag to automatically redirect clients without javascript to "/login.php?_fb_noscript=1". The problem is that you're posting to "/login.php" instead and always getting redirected by meta refresh header.
Even if you get over this problem it's against Facebook robots.txt, so you shouldn't really be doing this.
Why don't you just use Facebook Graph API?

Categories