I'm sending a POST request to an API using scrapy.FormRequest and receiving a TextResponse object back. The body of this object looks like so:
{
"Refiners": ...
"Results": ...
}
I am only interested in the Results portion of the response as it contains HTML that I would like to parse.
As such, I am trying to creating a new TextResponse object containing only the Results portion in the body, so that I am able to use the response.css method on it.
I tried the following and it yielded an empty response body. Any thoughts on why and how to fix this?
new_response = scrapy.http.TextResponse(response.json()["Results"])
You can use the HTMLResponse class and you need to provide the body and encoding arguments in the constructor.
from scrapy.http import HtmlResponse
new_response = HtmlResponse(url="some_url", body=response.json()["Results"], encoding="utf-8")
You can then use new_response.css(...) to select elements.
Related
I am reading Web Scraping with Python 2nd Ed, and wanted to use Scrapy module to crawl information from webpage.
I got following information from documentation: https://docs.scrapy.org/en/latest/topics/request-response.html
callback (callable) – the function that will be called with the
response of this request (once it’s downloaded) as its first
parameter. For more information see Passing additional data to
callback functions below. If a Request doesn’t specify a callback, the
spider’s parse() method will be used. Note that if exceptions are
raised during processing, errback is called instead.
My understanding is that:
pass in url and get resp like we did in requests module
resp = requests.get(url)
pass in resp for data parsing
parse(resp)
The problem is:
I didn't see where resp is passed in
Why need to put self keyword before parse in the argument
self keyword was never used in parse function, why bothering put it as first parameter?
can we extract url from response parameter like this: url = response.url or should be url = self.url
class ArticleSpider(scrapy.Spider):
name='article'
def start_requests(self):
urls = [
'http://en.wikipedia.org/wiki/Python_'
'%28programming_language%29',
'https://en.wikipedia.org/wiki/Functional_programming',
'https://en.wikipedia.org/wiki/Monty_Python']
return [scrapy.Request(url=url, callback=self.parse) for url in urls]
def parse(self, response):
url = response.url
title = response.css('h1::text').extract_first()
print('URL is: {}'.format(url))
print('Title is: {}'.format(title))
Seems like you are missing a few concepts related to python classes and OOP. It would be a good idea to take a read in python docs or at the very least this question.
Here is how Scrapy works, you instantiate a request object and yield it to the Scrapy Scheduler.
yield scrapy.Request(url=url) #or use return like you did
Scrapy will handle the requests, download the html and it will return all it got back that request to a callback function. If you didn't set a callback function in your request (like in my example above) it will call a default function called parse.
Parse is a method (a.k.a function) of your object. You wrote it in your code above, and EVEN if you haven't it would still be there, since your class inherited all functions from it's parent class
class ArticleSpider(scrapy.Spider): # <<<<<<<< here
name='article'
So a TL; DR of your questions:
1-You didn't saw it because it happened in the parent class.
2-You need to use self. so python knows you are referencing a method of the spider instance.
3-The self parameter was the instance itself, and it was used by python.
4-Response is an independent object that your parse method received as argument, so you can access it's attributes like response.url or response.headers
information about self you can find here - https://docs.python.org/3/tutorial/classes.html
about this question:
can we extract URL from response parameter like this: url = response.url or should be url = self.url
you should use response.url to get URL of the page which you currently crawl/parse
So I am trying to just receive the data from this json. I to use POST, GET on any link but the link I am currently trying to read. It needs [PUT]. So I wanted to know if I was calling this url correctly via urllib or am I missing something?
Request
{"DataType":"Word","Params":["1234"], "ID":"22"}
Response {
JSON DATA IN HERE
}
I feel like I am doing the PUT method call wrong since it is wrapped around Request{}.
import urllib.request, json
from pprint import pprint
header = {"DataType":"Word","Params":"[1234"]", "ID":"22"}
req = urllib.request.Request(url = "website/api/json.service", headers =
heaer, method = 'PUT')
with urllib.request.urlopen(req) as url:
data = json.loads(url.read(), decode())
pprint(data)
I am able to print json data as long as its anything but PUT. As soon as I get a site with put on it with the following JSON template I get an Internal Error 500. So I assumed it was was my header.
Thank you in advanced!
I can test my xpaths on a HTML body string by running the two lines under (1) below.
But what if I have a local file myfile.html whose content is exactly the body string. How would I run some standalone code, outside of a spider? I am seeking something that is roughly similar to the lines under (2) below.
I am aware that scrapy shell myfile.html tests xpaths. My intention is to run some Python code on the response, and hence scrapy shell is lacking (or else tedious).
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
# (1)
body = '<html><body><span>Hello</span></body></html>'
print Selector(text=body).xpath('//span/text()').extract()
# (2)
response = HtmlResponse(url='file:///tmp/myfile.html')
print Selector(response=response).xpath('//span/text()').extract()
You can have a look at how scrapy tests for link extractors are implemented.
They use a get_test_data() helper to read a file as bytes,
and then instanciate a Response with a fake URL and body set to the bytes read with get_test_data():
...
body = get_testdata('link_extractor', 'sgml_linkextractor.html')
self.response = HtmlResponse(url='http://example.com/index', body=body)
...
Depending on the encoding used for the local HTML files, you may need to pass a non-UTF-8 encoding argument to Response
I was wondering how do I make a GET request to a specific url with two query parameters? These query parameters contain two id numbers
So far I have:
import json, requests
url = 'http://'
requests.post(url)
But they gave me query paramters first_id=### and last_id=###. I don't know how to include these parameters?
To make a GET request you need the get() method, for parameters use params argument:
response = requests.get(url, params={'first_id': 1, 'last_id': 2})
If the response is of a JSON content type, you can use the json() shortcut method to get it loaded into a Python object for you:
data = response.json()
print(data)
In my scrapy I just want the html response inside a variable from custom url.
Suppose I have the url
url = "http://www.example.com"
Now I want to get the html of that page for parsing
pageHtml = scrapy.get(url)
I want something like this
page = urllib2.urlopen('http://yahoo.com').read()
The only problem that I can't use above line in my crawler is because my session is already authenticated by scrapy so I can't use any other function for getting the html of that function
I don't want response in any callback but simply straight inside the variable
Basically, you just need to add the relevant imports for the code in that question to work. You'll also need to add a link variable which is used but not defined in that example code.
import httplib
from scrapy.spider import BaseSpider
from scrapy.http import TextResponse
bs = BaseSpider('some')
# etc