I would like to benefit from the awesome features of the RoboBrowser on a HTML string that contains some forms.
Usually Robobrowser is used like that:
url = "whatever.com"
browser = RoboBrowser(history=True)
browser.open(url)
thatForm = browser.get_form("thatForm")
thatForm["thisField"].value = "some value"
browser.submit(thatForm)
I would like to use the html content of string to do the same, I was expecting something like below to work:
content = "<html>...</html>"
browser = RoboBrowser(history=True)
browser.open(content)
However, this does not, cause the open method expects that the string is a url not a html content, is there anything that can be done, any workaround so that I can pass a html content string somewhere and RoboBrowser parses it?
Alright I found a solution, not super elegant but it works, basically it all revolves around the _update_state function which is actually called at some point internally by Robobrowser when opening a URL:
def open(self, url, method='get', **kwargs):
"""Open a URL.
:param str url: URL to open
:param str method: Optional method; defaults to `'get'`
:param kwargs: Keyword arguments to `Session::request`
"""
response = self.session.request(method, url, **self._build_send_args(**kwargs))
self._update_state(response)
The solution is therefore to simply create a fake response carrying the html we want to be parsed:
fake_response = requests.Response()
fake_response._content = the_html_we_want_Robobrowser_to_parse
browser = RoboBrowser()
browser._update_state(fake_response)
my_form = browser.get_form("myform")
browser.submit_form(my_form)
And voila :)
Related
I am reading Web Scraping with Python 2nd Ed, and wanted to use Scrapy module to crawl information from webpage.
I got following information from documentation: https://docs.scrapy.org/en/latest/topics/request-response.html
callback (callable) – the function that will be called with the
response of this request (once it’s downloaded) as its first
parameter. For more information see Passing additional data to
callback functions below. If a Request doesn’t specify a callback, the
spider’s parse() method will be used. Note that if exceptions are
raised during processing, errback is called instead.
My understanding is that:
pass in url and get resp like we did in requests module
resp = requests.get(url)
pass in resp for data parsing
parse(resp)
The problem is:
I didn't see where resp is passed in
Why need to put self keyword before parse in the argument
self keyword was never used in parse function, why bothering put it as first parameter?
can we extract url from response parameter like this: url = response.url or should be url = self.url
class ArticleSpider(scrapy.Spider):
name='article'
def start_requests(self):
urls = [
'http://en.wikipedia.org/wiki/Python_'
'%28programming_language%29',
'https://en.wikipedia.org/wiki/Functional_programming',
'https://en.wikipedia.org/wiki/Monty_Python']
return [scrapy.Request(url=url, callback=self.parse) for url in urls]
def parse(self, response):
url = response.url
title = response.css('h1::text').extract_first()
print('URL is: {}'.format(url))
print('Title is: {}'.format(title))
Seems like you are missing a few concepts related to python classes and OOP. It would be a good idea to take a read in python docs or at the very least this question.
Here is how Scrapy works, you instantiate a request object and yield it to the Scrapy Scheduler.
yield scrapy.Request(url=url) #or use return like you did
Scrapy will handle the requests, download the html and it will return all it got back that request to a callback function. If you didn't set a callback function in your request (like in my example above) it will call a default function called parse.
Parse is a method (a.k.a function) of your object. You wrote it in your code above, and EVEN if you haven't it would still be there, since your class inherited all functions from it's parent class
class ArticleSpider(scrapy.Spider): # <<<<<<<< here
name='article'
So a TL; DR of your questions:
1-You didn't saw it because it happened in the parent class.
2-You need to use self. so python knows you are referencing a method of the spider instance.
3-The self parameter was the instance itself, and it was used by python.
4-Response is an independent object that your parse method received as argument, so you can access it's attributes like response.url or response.headers
information about self you can find here - https://docs.python.org/3/tutorial/classes.html
about this question:
can we extract URL from response parameter like this: url = response.url or should be url = self.url
you should use response.url to get URL of the page which you currently crawl/parse
Write a function named "variable_get" that takes a string as a parameter representing part of a path of a url and returns the response of an HTTPS GET request to the url "http://python.org/(input)" as a string where (input) is the input parameter of this function.
import urllib2
def variable_get(input1):
content = urllib2.urlopen('http://python.org/input1').read()
return content
My question is how to correctly set up input parameter of url in an HTTP GET request?
In Python > 3.6, you can use f-strings:
from urllib import request
def variable_get(input1):
url = f'http://python.org/{input1}'
print(url)
content = request.urlopen(url).read()
return content
variable_get('somePath')
should print 'http://python.org/somePath'
After some discussion with my problem on Unable to print links using beautifulsoup while automating through selenium
I realized that the main problem is in the URL which the request is not able to extract. URL of the page is actually https://society6.com/discover but I am using selenium to log into my account so the URL becomes https://society6.com/society?show=2
However, I can't use the second URL with request since its showing error. How do i scrap information from URL like this.
You need to log in first!
To do that you can use the bs4.BeautifulSoup library.
Here is an implementation that I have used:
import requests
from bs4 import BeautifulSoup
BASE_URL = "https://society6.com/"
def log_in_and_get_session():
"""
Get the session object with login details
:return: requests.Session
"""
ss = requests.Session()
ss.verify = False # optinal for uncertifaied sites.
text = ss.get(f"{BASE_URL}login").text
csrf_token = BeautifulSoup(text, "html.parser").input["value"]
data = {"username": "your_username", "password": "your_password", "csrfmiddlewaretoken": csrf_token}
# results = ss.post("{}login".format(BASE_URL), data=data)
results = ss.post("{}login".format(BASE_URL), data=data)
if results.ok:
print("Login success", results.status_code)
return ss
else:
print("Can't login", results.status_code)
Using the 'post` method to log in...
Hope this helps you!
Edit
Added the beginning of the function.
I am trying to build a web crawler that crawls all the links on the page and adds them to a file.
My Python code contains a method that does the following:-
Opens a given web page(urllib2 module is used)
Checks if the HTTP header Content-Type contains text/html
Converts the raw HTML response into readable code and stores it into html_string variable.
It then creates an instance of the Link_Finder class which takes attributes base url(Spider_url) and page url(page_url). Link_Finder is defined in another module link_finder.py.
html_string is then fed to the class using feed function.
Link_Finder class is explained in details below.
def gather_links(page_url): #page_url is relative url
html_string=''
try :
req=urllib2.urlopen(page_url)
head=urllib2.Request(page_url)
if 'text/html' in head.get_header('Content-Type'):
html_bytes=req.read()
html_string=html_bytes.decode("utf-8")
finder=LinkFinder(Spider.base_url,page_url)
finder.feed(html_string)
except Exception as e:
print "Exception " + str(e)
return set()
return finder.page_links()
The link_finder.py module uses standard Python HTMLParser and urlparse modules. Class Link_Finder inherits from HTMLParser and overrides the handle_starttag function to get all the a tags with href attribute and add the url's to a set(self.queue)
from HTMLParser import HTMLParser
import urlparse
class LinkFinder(HTMLParser):
def __init__(self,base_url,page_url): #page_url is relative url
super(LinkFinder,self).__init__()
self.base_url=base_url
self.page_url=page_url
self.links=set()
def handle_starttag(self,tag,attrs): #Override default handler methods
if tag==a:
for(key,value) in attrs:
if key=='href':
url=urlparse.urljoin(self.base_url,value) #Get exact url
self.links.add(url)
def error(self,message):
pass
def page_links(self): #return set of links
return self.links
I am getting an exception
argument of type 'NoneType' is not iterable
I think the problem in the way i used urllib2 Request to check the header content.
I am a bit new to this so some explanation would be good
I'd have used BeautifulSoup instead of HTMLParser like so -
soup = BeautifulSoup(pageContent)
links = soup.find_all('a')
I am a bit confused with using Request, urlopen and JSONDecoder().decode().
Currently I have:
hdr = {'User-agent' : 'anything'} # header, User-agent header describes my web browser
I am assuming that the server uses this to determine which browsers are acceptable? Not sure
my url is:
url = 'http://wwww.reddit.com/r/aww.json'
I set a req variable
req = Request(url,hdr) #request to access the url with header
json = urlopen(req).read() # read json page
I tried using urlopen in terminal and I get this error:
TypeError: must be string or buffer, not dict # This has to do with me header?
data = JSONDecoder().decode(json) # translate json data so I can parse through it with regular python functions?
I'm not really sure why I get the TypeError
If you look at the documentation of Request, you can see that the constructor signature is actually Request(url, data=None, headers={}, …). So the second parameter, the one after the URL, is the data you are sending with the request. But if you want to set the headers instead, you will have to specify the headers parameter.
You can do this in two different ways. Either you pass None as the data parameter:
Request(url, None, hdr)
But, well, this requires you to pass the data parameter explicitely and you have to make sure that you pass the default value to not cause any unwanted effects. So instead, you can tell Python to explicitely pass the header parameter instead, without specifying data:
Request(url, headers=hdr)