XML vs HTML? Scrapy downloads html files but not xml files?

XML vs HTML? Scrapy downloads html files but not xml files? - python

EDIT: This is only scraping the XML links from the crawled pages, not actually following the links to the XML pages themselves, which have a slightly different URL. I need to adjust the rules so I can crawl those pages as well. I'll post a solution once I've got it working.
http://www.digitalhumanities.org/dhq/vol/12/4/000401/000401.xml
vs.
http://www.digitalhumanities.org/dhq/vol/12/4/000401.xml
This script scrapes all the links, and should save all the files, but only the html files are actually saved. The xml files are added to the csv but not downloaded. This is a relatively simple script so what is the difference between the two?
Here's the error I'm getting.
2022-12-22 12:33:30 [scrapy.pipelines.files] WARNING: File (code: 302): Error downloading file from <GET http://www.digitalhumanities.org/dhq/vol/12/4/000407.xml> referred in <None>
2022-12-22 12:33:30 [scrapy.core.engine] DEBUG: Crawled (302) <GET http://www.digitalhumanities.org/dhq/vol/12/4/000408.xml> (referer: None)
2022-12-22 12:33:30 [scrapy.pipelines.files] WARNING: File (code: 302): Error downloading file from <GET http://www.digitalhumanities.org/dhq/vol/12/4/000408.xml> referred in <None>
And example of the output: https://pastebin.com/pDHvYTxF
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class DhqSpider(CrawlSpider):
name = 'dhqfiles'
allowed_domains = ['digitalhumanities.org']
start_urls = ['http://www.digitalhumanities.org/dhq/vol/16/3/index.html']
rules = (
Rule(LinkExtractor(allow = 'index.html')),
Rule(LinkExtractor(allow = 'vol'), callback='parse_article'),
)
def parse_article(self, response):
article = {
'xmllink' : response.urljoin(response.xpath('(//div[#class="toolbar"]/a[contains(#href, ".xml")]/#href)[1]').get()),
}
yield {'file_urls':[article['xmllink']]}

Related

DEBUG: Crawled (404)

This is my code:
# -*- coding: utf-8 -*-
import scrapy
class SinasharesSpider(scrapy.Spider):
name = 'SinaShares'
allowed_domains = ['money.finance.sina.com.cn/mkt/']
start_urls = ['http://money.finance.sina.com.cn/mkt//']
def parse(self, response):
contents=response.xpath('//*[#id="list_amount_ctrl"]/a[2]/#class').extract()
print(contents)
And I have set an user-agent in setting.py.
Then I get an error:
2020-04-27 10:54:50 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://money.finance.sina.com.cn/robots.txt> (referer: None)
2020-04-27 10:54:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://money.finance.sina.com.cn/mkt//> (referer: None)
So How can I eliminate this error?

Maybe your ip is banned by the website,also you can need to add some cookies to crawling the data that you needed.

The http-statuscode 404 is received because Scrapy is checking the /robots.txt by default. In your case this site does not exist and so a 404 is received but that does not have any impact. In case you want to avoid checking the robots.txt you can set ROBOTSTXT_OBEY = False in the settings.py.
Then the website is accessed successfully (http-statuscode 200). No content is printed because based on your xpath-selection nothing is selected. You have to fix your xpath-selection.
If you want to test different xpath- or css-selections in order to figure how to get your desired content, you might want to use the interactive scrapy shell:
scrapy shell "http://money.finance.sina.com.cn/mkt/"
You can find an example of a scrapy shell session in the official Scrapy documentation here.

How can I convert relative paths to absolute paths with my scrapy CrawlSpider?

I am new to Scrapy and I am currently trying to write a CrawlSpider that will crawl a forum on the Tor darknet. Currently my CrawlSpider code is:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class HiddenAnswersSpider(CrawlSpider):
name = 'ha'
start_urls = ['http://answerstedhctbek.onion/questions']
allowed_domains = ['http://answerstedhctbek.onion', 'answerstedhctbek.onion']
rules = (
Rule(LinkExtractor(allow=(r'answerstedhctbek.onion/\d\.\*', r'https://answerstedhctbek.onion/\d\.\*')), follow=True, process_links='makeAbsolutePath'),
Rule(LinkExtractor(allow=()), follow=True, process_links='makeAbsolutePath')
)
def makeAbsolutePath(links):
for i in range(links):
links[i] = links[i].replace("../","")
return links
Because the forum uses relative path, I have tried to create a custom process_links to remove the "../" however when I run my code I am still recieving:
2017-11-11 14:46:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../badges>: HTTP status code is not handled or not allowed
2017-11-11 14:46:46 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../general-guidelines> (referer: http://answerstedhctbek.onion/questions)
2017-11-11 14:46:47 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../general-guidelines>: HTTP status code is not handled or not allowed
2017-11-11 14:46:47 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../contact-us> (referer: http://answerstedhctbek.onion/questions)
2017-11-11 14:46:47 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../contact-us>: HTTP status code is not handled or not allowed
2017-11-11 14:46:48 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../questions?sort=hot> (referer: http://answerstedhctbek.onion/questions)
2017-11-11 14:46:48 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../questions?sort=hot>: HTTP status code is not handled or not allowed
2017-11-11 14:46:48 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../questions?sort=votes> (referer: http://answerstedhctbek.onion/questions)
As you can see, I am still getting 400 errors due to the bad path. Why isn't my code removing the "../" from the links?
Thanks!

The problem might be that makeAbsolutePaths is not part of the spider class. The documentation states:
process_links is a callable, or a string (in which case a method from the spider object with that name will be used)
You did not use self in makeAbsolutePaths, so I assume it is not an indentation error. makeAbsolutePaths also has some other errors. If we correct the code to this state:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class HiddenAnswersSpider(CrawlSpider):
name = 'ha'
start_urls = ['file:///home/user/testscrapy/test.html']
allowed_domains = []
rules = (
Rule(LinkExtractor(allow=(r'.*')), follow=True, process_links='makeAbsolutePath'),
)
def makeAbsolutePath(self, links):
print(links)
for i in range(links):
links[i] = links[i].replace("../","")
return links
it will yield this error:
TypeError: 'list' object cannot be interpreted as an integer
This is, because no call to len() was used in the call to range and range can only operate on integers. It wants a number and will give you the range from 0 to this number minus 1.
After fixing this issue, it will give the error:
AttributeError: 'Link' object has no attribute 'replace'
This is - because unlike you thought - links is not a list of strings containing the contents of href="" attributes. Instead, it is a list of Link objects.
I'd recommend you output the contents of links inside makeAbsolutePath and see, if you have to do anything at all. In my opinion, scrapy should already stop resolving .. operators once it reaches the domain level, so your links should point to http://answerstedhctbek.onion/<number>/<title>, even though the site uses .. operator without an actual folder level (as the URL is /questions and not /questions/).
Somehow like this:
def makeAbsolutePath(self, links):
for i in range(len(links)):
print(links[i].url)
return []
(Returning an empty list here gives you the advantage that the spider will stop and you can check the console output)
If you then find out, the URLs are actually wrong, you can perform some work on them through the url attribute:
links[i].url = 'http://example.com'

Scrapy https tutorial

everyone!
I'm new to Scrapy framework. And I need to parse wisemapping.com.
At first, I read official Scrapy tutorial and tried to get access to one of "wisemap" 's, but got an errors:
[scrapy.core.engine] DEBUG: Crawled (404) <GET https://app.wisemapping.com/robots.txt> (referer: None)
[scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying
<GET https://app.wisemapping.com/c/maps/576786/public> (failed 3 times): 500 Internal Server Error
[scrapy.core.engine] DEBUG: Crawled (500) <GET https://app.wisemapping.com/c/maps/576786/public> (referer: None)
[scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 https://app.wisemapping.com/c/maps/576786/public>: HTTP status code is not handled or not allowed
Please, give me an advice to solve problems with following code:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'https://app.wisemapping.com/c/maps/576786/public',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'wisemape.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)

Navigating to https://app.wisemapping.com/c/maps/576786/public gives the error
"Outch!!. This map is not available anymore.
You do not have enough right access to see this map. This map has been changed to private or deleted."
Does this map exist? If so, try making it public.
If you know for a fact the map you're trying to access exist, verify the URL you're trying to access is the correct one.

Getting Scrapy to follow specific subdomains and save the .html

My problem is fairly simple, but i seem to be blind to see the bug after all these hours I've spent on it. I want to write a simple CrawlSpider that crawls the edition.cnn.com site and saves the html files. I've noticed that the structure of the site is something like:
edition.cnn.com/yyyy/mm/dd/any_category/article_name/index.html
This is the code for my spider:
from scrapy.contrib.spiders.crawl import CrawlSpider, Rule
from scrapy.contrib.linkextractors import *
class BrickSetSpider(CrawlSpider):
name = 'brick_spider'
start_urls = ['http://edition.cnn.com/']
max_num = 30
rules = (
Rule(LinkExtractor(allow='/2016\/\d\d\/\d\d\/\w*\/.*\/.*'), callback="save_file", follow= True),
)
def save_file(self, response):
filename = response.url.split("/")[-2] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)
when i run this with: "scrapy crawl brick_spider" i get only an html file named: .html which i guess should be my starting URL, and nothing else. The spider finishes without any errors. One thing that got my attention though, is this output on the console:
2016-12-23 17:09:52 [scrapy] DEBUG: Crawled (200) http://edition.cnn.com/robots.txt> (referer: None) ['cached']
2016-12-23 17:09:52 [scrapy] DEBUG: Crawled (200) http://edition.cnn.com/> (referer: None) ['cached']
Perhaps there is something wrong with my rule? I've checked it on regexr.com with a sample link from the CNN site and my regular expression is fine.
Any help would be good, thanks is advance

Why is scrapy crawling a different facebook page?

This is a scrapy spider.This spider is supposed to collect the names of all div nodes with class attribute=5d-5 essentially making a list of all people with x name from y location.
import scrapy
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
class fb_spider(scrapy.Spider):
name="fb"
allowed_domains = ["facebook.com"]
start_urls = [
"https://www.facebook.com/search/people/?q=jaslyn%20california"]
def parse(self,response):
x=response.xpath('//div[#class="_5d-5"]'.extract())
with open("asdf.txt",'wb') as f:
f.write(u"".join(x).encode("UTF-8"))
But the scrapy crawls a web page different from the one specified.I got this on the command prompt:
2016-08-15 14:00:14 [scrapy] ERROR: Spider error processing <GET https://www.facebook.com/search/top/?q=jaslyn%20california> (referer: None)
but the URL I specified is:
https://www.facebook.com/search/people/?q=jaslyn%20california

Scraping is not allowed on Facebook: https://www.facebook.com/apps/site_scraping_tos_terms.php
If you want to get data from Facebook, you have to use their Graph API. For example, this would be the API to search for users: https://developers.facebook.com/docs/graph-api/using-graph-api#search
It is not as powerful as the Graph Search on facebook.com though.

Facebook is redirecting the request to the new url.

it seems as though you are missing some headers in your request.
2016-08-15 14:00:14 [scrapy] ERROR: Spider error processing <GET https://www.facebook.com/search/top/?q=jaslyn%20california> (referer: None)
as you can see the referer is None, I would advise you add some headers manually, namely the referer.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

XML vs HTML? Scrapy downloads html files but not xml files? - python

Related

DEBUG: Crawled (404)

How can I convert relative paths to absolute paths with my scrapy CrawlSpider?

Scrapy https tutorial

Getting Scrapy to follow specific subdomains and save the .html

Why is scrapy crawling a different facebook page?

Categories

Resources