I can test my xpaths on a HTML body string by running the two lines under (1) below.
But what if I have a local file myfile.html whose content is exactly the body string. How would I run some standalone code, outside of a spider? I am seeking something that is roughly similar to the lines under (2) below.
I am aware that scrapy shell myfile.html tests xpaths. My intention is to run some Python code on the response, and hence scrapy shell is lacking (or else tedious).
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
# (1)
body = '<html><body><span>Hello</span></body></html>'
print Selector(text=body).xpath('//span/text()').extract()
# (2)
response = HtmlResponse(url='file:///tmp/myfile.html')
print Selector(response=response).xpath('//span/text()').extract()
You can have a look at how scrapy tests for link extractors are implemented.
They use a get_test_data() helper to read a file as bytes,
and then instanciate a Response with a fake URL and body set to the bytes read with get_test_data():
...
body = get_testdata('link_extractor', 'sgml_linkextractor.html')
self.response = HtmlResponse(url='http://example.com/index', body=body)
...
Depending on the encoding used for the local HTML files, you may need to pass a non-UTF-8 encoding argument to Response
Related
I'm sending a POST request to an API using scrapy.FormRequest and receiving a TextResponse object back. The body of this object looks like so:
{
"Refiners": ...
"Results": ...
}
I am only interested in the Results portion of the response as it contains HTML that I would like to parse.
As such, I am trying to creating a new TextResponse object containing only the Results portion in the body, so that I am able to use the response.css method on it.
I tried the following and it yielded an empty response body. Any thoughts on why and how to fix this?
new_response = scrapy.http.TextResponse(response.json()["Results"])
You can use the HTMLResponse class and you need to provide the body and encoding arguments in the constructor.
from scrapy.http import HtmlResponse
new_response = HtmlResponse(url="some_url", body=response.json()["Results"], encoding="utf-8")
You can then use new_response.css(...) to select elements.
Im new to python.
Im trying to parse data from a website using BeautifulSoup, I have successful used BeautifulSoup before. However for this particular website the data returned has spaces between every character and lots of ">" characters as well.
The weird thing is if copy the page source and add it to my local apache instance and make a request to my local copy, then the output is perfect. I should mention that the difference between my local and the website:
my local does not use https
my local does not require authentication however the website does require Active Directory auth and I using requests_ntlm
import requests
from requests_ntlm import HttpNtlmAuth
from bs4 import BeautifulSoup
r = requests.get("http://WEBSITE/CONTEXT/",auth=HttpNtlmAuth('DOMAIN\USER','PASS'))
content = r.text
soup = BeautifulSoup(content, 'lxml')
print(soup)
It looks like local server returns content encoded using UTF-8 and the main website use UTF-16. It's suggests the main website in not configured correctly. However it's possible to get around this issue with code.
Python defaults the requests to the encoding to UTF-8. (I believe) this is based on the response headers. The request has a method called apparent_encoding, which reads the stream and detects the correct encoding using chardet. However apparent_encoding does not get consumed, unless specified.
Therefore by setting r.encoding = r.apparent_encoding, the request should download the text correctly across both environments.
Code should look something like:
r = requests.get("http://WEBSITE/CONTEXT/",auth=HttpNtlmAuth('DOMAIN\USER','PASS'))
r.encoding = r.apparent_encoding # Override the default encoding
content = r.text
r.raise_for_status() # Always check for server errors before consuming the data.
soup = BeautifulSoup(content, 'lxml')
print(soup.prettify()) # Should match print(content) (minus indentation)
I want to crawl some information from a webpage using python and scrapy, but when I try to do it the output of my item is empty...
First of all I've started a new project with scrapy. Then I've written the following in the items.py file:
import scrapy
class KakerlakeItem(scrapy.Item):
info=scrapy.Field()
pass
Next, I've created a new file in the spider's folder with the following code:
import scrapy
from kakerlake.items import KakerlakeItem
class Kakerlakespider(scrapy.Spider):
name='Coco'
allowed_domains=['http://www.goeuro.es/']
start_urls=['http://www.goeuro.es/search/NTYzY2U2Njk4YzA1ZDoyNzE2OTU4ODM=']
def parse(self, response):
item=KakerlakeItem()
item['info']=response.xpath('//span[#class= "inline-b height-100"]/text()').extract()
#yield item
return item
I expect, by writing scrapy crawl Coco -o data.json in the console, that I will get what I want, but instead of this I obtain the json file with {'info': []}. That is, an empty item.
I've tried a lot of things and I don't know why it doesn't work correctly...
Your xpath is invalid for the page, as there isn't a single class with "inline-b" or "height-100". This page is heavily modified via Javascript, so what you see in a browser will not be representative of what Scrapy receives.
xpath results:
>>> response.xpath('//span[contains(#class, "inline-b")]')
[]
>>> response.xpath('//span[contains(#class, "height-100")]')
[]
Remove the pass in KakerlakeItem(scrapy.Item) ?
I have defined two spiders which do the following:
Spider A:
Visits the home page.
Extracts all the links from the page and stores them in a text file.
This is necessary since the home page has a More Results button which produces further links to different products.
Spider B:
Opens the text file.
Crawls the individual pages and saves the information.
I am trying to combine the two and make a crawl-spider.
The URL structure of the home page is similar to:
http://www.example.com
The URL structure of the individual pages is similar to:
http://www.example.com/Home/Detail?id=some-random-number
The text file contains the list of such URLs which are to be scraped by the second spider.
My question:
How do I combine the two spiders so as to make a single spider which does the complete scraping?
From scrapy documantation:
In the callback function, you parse the response (web page) and return either Item objects, Request objects, or an iterable of both. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.
So what you actually need to do is in the parse method (which yuo extract the links there, for each link, yield a new request like:
yield self.make_requests_from_url(http://www.example.com/Home/Detail?id=some-random-number)
the self.make_requests_from_url is already implemented in Spider
Example of such:
class MySpider(Spider):
name = "my_spider"
def parse(self, response):
try:
user_name = Selector(text=response.body).xpath('//*[#id="ft"]/a/#href').extract()[0]
yield self.make_requests_from_url("https://example.com/" + user_name)
yield MyItem(user_name)
except Exception as e:
pass
You can handle the other requests using a different parsing function. do it by returning a Request object and specify the callback explicitly (The self.make_requests_from_url function call the parse function bu default)
Request(url=url,callback=self.parse_user_page)
In my scrapy I just want the html response inside a variable from custom url.
Suppose I have the url
url = "http://www.example.com"
Now I want to get the html of that page for parsing
pageHtml = scrapy.get(url)
I want something like this
page = urllib2.urlopen('http://yahoo.com').read()
The only problem that I can't use above line in my crawler is because my session is already authenticated by scrapy so I can't use any other function for getting the html of that function
I don't want response in any callback but simply straight inside the variable
Basically, you just need to add the relevant imports for the code in that question to work. You'll also need to add a link variable which is used but not defined in that example code.
import httplib
from scrapy.spider import BaseSpider
from scrapy.http import TextResponse
bs = BaseSpider('some')
# etc