Getting Scrapy to follow specific subdomains and save the .html

Getting Scrapy to follow specific subdomains and save the .html - python

My problem is fairly simple, but i seem to be blind to see the bug after all these hours I've spent on it. I want to write a simple CrawlSpider that crawls the edition.cnn.com site and saves the html files. I've noticed that the structure of the site is something like:
edition.cnn.com/yyyy/mm/dd/any_category/article_name/index.html
This is the code for my spider:
from scrapy.contrib.spiders.crawl import CrawlSpider, Rule
from scrapy.contrib.linkextractors import *
class BrickSetSpider(CrawlSpider):
name = 'brick_spider'
start_urls = ['http://edition.cnn.com/']
max_num = 30
rules = (
Rule(LinkExtractor(allow='/2016\/\d\d\/\d\d\/\w*\/.*\/.*'), callback="save_file", follow= True),
)
def save_file(self, response):
filename = response.url.split("/")[-2] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)
when i run this with: "scrapy crawl brick_spider" i get only an html file named: .html which i guess should be my starting URL, and nothing else. The spider finishes without any errors. One thing that got my attention though, is this output on the console:
2016-12-23 17:09:52 [scrapy] DEBUG: Crawled (200) http://edition.cnn.com/robots.txt> (referer: None) ['cached']
2016-12-23 17:09:52 [scrapy] DEBUG: Crawled (200) http://edition.cnn.com/> (referer: None) ['cached']
Perhaps there is something wrong with my rule? I've checked it on regexr.com with a sample link from the CNN site and my regular expression is fine.
Any help would be good, thanks is advance

Related

XML vs HTML? Scrapy downloads html files but not xml files?

EDIT: This is only scraping the XML links from the crawled pages, not actually following the links to the XML pages themselves, which have a slightly different URL. I need to adjust the rules so I can crawl those pages as well. I'll post a solution once I've got it working.
http://www.digitalhumanities.org/dhq/vol/12/4/000401/000401.xml
vs.
http://www.digitalhumanities.org/dhq/vol/12/4/000401.xml
This script scrapes all the links, and should save all the files, but only the html files are actually saved. The xml files are added to the csv but not downloaded. This is a relatively simple script so what is the difference between the two?
Here's the error I'm getting.
2022-12-22 12:33:30 [scrapy.pipelines.files] WARNING: File (code: 302): Error downloading file from <GET http://www.digitalhumanities.org/dhq/vol/12/4/000407.xml> referred in <None>
2022-12-22 12:33:30 [scrapy.core.engine] DEBUG: Crawled (302) <GET http://www.digitalhumanities.org/dhq/vol/12/4/000408.xml> (referer: None)
2022-12-22 12:33:30 [scrapy.pipelines.files] WARNING: File (code: 302): Error downloading file from <GET http://www.digitalhumanities.org/dhq/vol/12/4/000408.xml> referred in <None>
And example of the output: https://pastebin.com/pDHvYTxF
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class DhqSpider(CrawlSpider):
name = 'dhqfiles'
allowed_domains = ['digitalhumanities.org']
start_urls = ['http://www.digitalhumanities.org/dhq/vol/16/3/index.html']
rules = (
Rule(LinkExtractor(allow = 'index.html')),
Rule(LinkExtractor(allow = 'vol'), callback='parse_article'),
)
def parse_article(self, response):
article = {
'xmllink' : response.urljoin(response.xpath('(//div[#class="toolbar"]/a[contains(#href, ".xml")]/#href)[1]').get()),
}
yield {'file_urls':[article['xmllink']]}

Scrapy crawls ok, but link extractors do not work as intented (extracts 1st character only of the whole url)

I'm trying to learn scrappy with python. Ive used this website , which might be out of date a bit, but ive managed to get the links and urls as intended. ALMOST.
import scrapy, time
import random
#from random import randint
#from time import sleep
USER_AGENT = "Mozilla/5.0"
class OscarsSpider(scrapy.Spider):
name = "oscars5"
allowed_domains = ["en.wikipedia.org"]
start_urls = ["https://en.wikipedia.org/wiki/Academy_Award_for_Best_Picture"]
def parse(self, response):
for href in (r"tr[style='background:#FAEB86'] a[href*='film)']::attr(href)"): #).extract(): Once you extract it, it becomes a string so the library can no longer process it - so dont extarct it/ - https://stackoverflow.com/questions/57417774/attributeerror-str-object-has-no-attribute-xpath
url = response.urljoin(href)
print(url)
time.sleep(random.random()) #time.sleep(0.1) #### https://stackoverflow.com/questions/4054254/how-to-add-random-delays-between-the-queries-sent-to-google-to-avoid-getting-blo #### https://stackoverflow.com/questions/30030659/in-python-what-is-the-difference-between-random-uniform-and-random-random
req = scrapy.Request(url, callback=self.parse_titles)
time.sleep(random.random()) #sleep(randint(10,100))
##req.meta['proxy'] = "http://yourproxy.com:178" #https://checkerproxy.net/archive/2021-03-10 (from ; https://stackoverflow.com/questions/30330034/scrapy-error-error-downloading-could-not-open-connect-tunnel)
yield req
def parse_titles(self, response):
for sel in response.css('html').extract():
data = {}
data['title'] = response.css(r"h1[id='firstHeading'] i::text").extract()
data['director'] = response.css(r"tr:contains('Directed by') a[href*='/wiki/']::text").extract()
data['starring'] = response.css(r"tr:contains('Starring') a[href*='/wiki/']::text").extract()
data['releasedate'] = response.css(r"tr:contains('Release date') li::text").extract()
data['runtime'] = response.css(r"tr:contains('Running time') td::text").extract()
yield data
The problem I have is the scraper retrieves only the 1st character of the href links which i cant wrap my head around now. I cant understand why or how to fix it .
Snippet of output when I run the spider in CMD:
2021-03-12 20:09:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/Category:Best_Picture_Academy_Award_winners> (referer: None)
https://en.wikipedia.org/wiki/t
https://en.wikipedia.org/wiki/r
https://en.wikipedia.org/wiki/[
2021-03-12 20:09:59 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://en.wikipedia.org/wiki/T> from <GET https://en.wikipedia.org/wiki/t>
https://en.wikipedia.org/wiki/s
https://en.wikipedia.org/wiki/t
2021-03-12 20:10:01 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://en.wikipedia.org/wiki/t> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2021-03-12 20:10:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/T> (referer: https://en.wikipedia.org/wiki/Category:Best_Picture_Academy_Award_winners)
https://en.wikipedia.org/wiki/y
It does crawl, it finds those links (or starting points of the links we need), im sure. and appends, but doesnt get entire title or link. Hence I'm only scraping incorrect non-existent pages ! the output files are perfectly formatted but with no data apart from empty strings.
https://en.wikipedia.org/wiki/t in the spider output above for example should be https://en.wikipedia.org/wiki/The_Artist_(film)
and
https://en.wikipedia.org/wiki/r should and could be https://en.wikipedia.org/wiki/rain_man(film)
etc.
in scrapy shell,
response.css("h1[id='firstHeading'] i::text").extract()
returns []
confirming my fears. Its the selector.
How can I fix it?
As its not working as it should do or its was claimed to. If anyone could help I would be very grateful.

for href in (r"tr[style='background:#FAEB86'] a[href*='film)']::attr(href)"):
This is just doing for x in "abcde", which iterates over each letter in the string, which is why you get t, r, [, s, ...
Is this really what you intended? The parentheses sort of suggest that you intended this to be a function call. As a plain string, it makes no sense.

Scrapy getting all pages hrefs from an array of startUrls

The problem I have is the following: I am trying to scrape a website that has multiple categories of products, and for each category of products, it has several pages with 24 products in each. I am able to get all starting urls, and scraping every page I am able to get the urls (endpoints, which I then make into full urls) of all pages.
I should say that not for every category I have product pages, and not every starting url is a category and thus it might not have the structure I am looking for. But most of them do.
My intent is: from all pages of all categories I want to extract the href of every product displayed in the page. And the code I have been using is the following one:
import scrapy
class MySpider(scrapy.spiders.CrawlSpider):
name = 'myProj'
with open('resultt.txt','r') as f:
endurls = f.read()
f.close()
endurls= endurls.split(sep=' ')
endurls = ['https://www.someurl.com'+url for url in endurls]
start_urls = endurls
def parse(self, response):
with open('allpages.txt', 'a') as f:
pages_in_category = response.xpath('//option/#value').getall()
length = len(pages_in_category)
pages_in_category = ['https://www.someurl.com'+page for page in pages_in_category]
if length == 0:
f.write(str(response.url))
else:
for page in pages_in_category:
f.write(page)
f.close()
Through scrapy shell I am able to make it work, though not iteratively. The command I run in the terminal is then
scrapy runspider ScrapyCarr.py -s USER_AGENT='my-cool-project (http://example.com)'
Since I have not initialized a proper scrapy structure (I don't need that, it is a simple project for uni and I do not care much about the structure). Unfortunately the file in which I am trying to append my products urls remains empty, even if when inputting it through scrapy shell I see it working.
The output I am currently getting is the following
2020-10-15 12:51:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.someurl.com/market/fish/typefish/N-4minn0/c> (referer: None)
2020-10-15 12:51:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.someurl.com/market/medicines/typemed/N-i50owa/c> (referer: None)
2020-10-15 12:51:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.someurl.com/market/medicines/typemed/N-1l0cnr6/c> (referer: None)
2020-10-15 12:51:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.someurl.com/market/medicines/typemed/N-18isujc/c> (referer: None)

The problem was that I was initializing my class MySpider with a spider.CrawlSpider. The code works when using a class spider.Spider.
SOLVED

DEBUG: Crawled (404)

This is my code:
# -*- coding: utf-8 -*-
import scrapy
class SinasharesSpider(scrapy.Spider):
name = 'SinaShares'
allowed_domains = ['money.finance.sina.com.cn/mkt/']
start_urls = ['http://money.finance.sina.com.cn/mkt//']
def parse(self, response):
contents=response.xpath('//*[#id="list_amount_ctrl"]/a[2]/#class').extract()
print(contents)
And I have set an user-agent in setting.py.
Then I get an error:
2020-04-27 10:54:50 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://money.finance.sina.com.cn/robots.txt> (referer: None)
2020-04-27 10:54:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://money.finance.sina.com.cn/mkt//> (referer: None)
So How can I eliminate this error?

Maybe your ip is banned by the website,also you can need to add some cookies to crawling the data that you needed.

The http-statuscode 404 is received because Scrapy is checking the /robots.txt by default. In your case this site does not exist and so a 404 is received but that does not have any impact. In case you want to avoid checking the robots.txt you can set ROBOTSTXT_OBEY = False in the settings.py.
Then the website is accessed successfully (http-statuscode 200). No content is printed because based on your xpath-selection nothing is selected. You have to fix your xpath-selection.
If you want to test different xpath- or css-selections in order to figure how to get your desired content, you might want to use the interactive scrapy shell:
scrapy shell "http://money.finance.sina.com.cn/mkt/"
You can find an example of a scrapy shell session in the official Scrapy documentation here.

Scrapy https tutorial

everyone!
I'm new to Scrapy framework. And I need to parse wisemapping.com.
At first, I read official Scrapy tutorial and tried to get access to one of "wisemap" 's, but got an errors:
[scrapy.core.engine] DEBUG: Crawled (404) <GET https://app.wisemapping.com/robots.txt> (referer: None)
[scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying
<GET https://app.wisemapping.com/c/maps/576786/public> (failed 3 times): 500 Internal Server Error
[scrapy.core.engine] DEBUG: Crawled (500) <GET https://app.wisemapping.com/c/maps/576786/public> (referer: None)
[scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 https://app.wisemapping.com/c/maps/576786/public>: HTTP status code is not handled or not allowed
Please, give me an advice to solve problems with following code:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'https://app.wisemapping.com/c/maps/576786/public',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'wisemape.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)

Navigating to https://app.wisemapping.com/c/maps/576786/public gives the error
"Outch!!. This map is not available anymore.
You do not have enough right access to see this map. This map has been changed to private or deleted."
Does this map exist? If so, try making it public.
If you know for a fact the map you're trying to access exist, verify the URL you're trying to access is the correct one.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting Scrapy to follow specific subdomains and save the .html - python

Related

XML vs HTML? Scrapy downloads html files but not xml files?

Scrapy crawls ok, but link extractors do not work as intented (extracts 1st character only of the whole url)

Scrapy getting all pages hrefs from an array of startUrls

DEBUG: Crawled (404)

Scrapy https tutorial

Categories

Resources