Parsing HTML page in web crawler

Parsing HTML page in web crawler - python

I am trying to build a web crawler that crawls all the links on the page and adds them to a file.
My Python code contains a method that does the following:-
Opens a given web page(urllib2 module is used)
Checks if the HTTP header Content-Type contains text/html
Converts the raw HTML response into readable code and stores it into html_string variable.
It then creates an instance of the Link_Finder class which takes attributes base url(Spider_url) and page url(page_url). Link_Finder is defined in another module link_finder.py.
html_string is then fed to the class using feed function.
Link_Finder class is explained in details below.
def gather_links(page_url): #page_url is relative url
html_string=''
try :
req=urllib2.urlopen(page_url)
head=urllib2.Request(page_url)
if 'text/html' in head.get_header('Content-Type'):
html_bytes=req.read()
html_string=html_bytes.decode("utf-8")
finder=LinkFinder(Spider.base_url,page_url)
finder.feed(html_string)
except Exception as e:
print "Exception " + str(e)
return set()
return finder.page_links()
The link_finder.py module uses standard Python HTMLParser and urlparse modules. Class Link_Finder inherits from HTMLParser and overrides the handle_starttag function to get all the a tags with href attribute and add the url's to a set(self.queue)
from HTMLParser import HTMLParser
import urlparse
class LinkFinder(HTMLParser):
def __init__(self,base_url,page_url): #page_url is relative url
super(LinkFinder,self).__init__()
self.base_url=base_url
self.page_url=page_url
self.links=set()
def handle_starttag(self,tag,attrs): #Override default handler methods
if tag==a:
for(key,value) in attrs:
if key=='href':
url=urlparse.urljoin(self.base_url,value) #Get exact url
self.links.add(url)
def error(self,message):
pass
def page_links(self): #return set of links
return self.links
I am getting an exception
argument of type 'NoneType' is not iterable
I think the problem in the way i used urllib2 Request to check the header content.
I am a bit new to this so some explanation would be good

I'd have used BeautifulSoup instead of HTMLParser like so -
soup = BeautifulSoup(pageContent)
links = soup.find_all('a')

Related

Using BeautifulSoup for web scraping: ValueError obtained

I have the following error:
ValueError at /scrape/
dictionary update sequence element #0 has length 1; 2 is required
Request Method: GET
Request URL: http://localhost:8000/scrape/
Django Version: 2.2.17
Exception Type: ValueError
Exception Value:
dictionary update sequence element #0 has length 1; 2 is required
Exception Location: c:\Users\toshiba\Desktop\newsscraper\news\views.py in scrape, line 32
Python Executable: C:\Users\toshiba\AppData\Local\Programs\Python\Python37\python.exe
Python Version: 3.7.2
Python Path:
['c:\\Users\\toshiba\\Desktop\\newsscraper',
'C:\\Users\\toshiba\\AppData\\Local\\Programs\\Python\\Python37\\python37.zip',
'C:\\Users\\toshiba\\AppData\\Local\\Programs\\Python\\Python37\\DLLs',
'C:\\Users\\toshiba\\AppData\\Local\\Programs\\Python\\Python37\\lib',
'C:\\Users\\toshiba\\AppData\\Local\\Programs\\Python\\Python37',
'C:\\Users\\toshiba\\AppData\\Local\\Programs\\Python\\Python37\\lib\\site-packages',
'C:\\Users\\toshiba\\AppData\\Local\\Programs\\Python\\Python37\\lib\\site-packages\\win32',
'C:\\Users\\toshiba\\AppData\\Local\\Programs\\Python\\Python37\\lib\\site-packages\\win32\\lib',
'C:\\Users\\toshiba\\AppData\\Local\\Programs\\Python\\Python37\\lib\\site-packages\\Pythonwin']
I am trying to scrape news articles from guardian.ng, but it seems to keep giving me an error.
This is my views.py file:
# import all necessary modules
import requests
from django.shortcuts import render, redirect
from bs4 import BeautifulSoup as BSoup
from news.models import Headline
requests.packages.urllib3.disable_warnings()
# new function news_list()
def news_list(request):
headlines = Headline.objects.all()[::-1]
context = {
'object_list': headlines,
}
return render(request, "news/home.html", context)
# the view function scrape()
def scrape(request):
session = requests.Session()
session.headers = {"User-Agent": "Googlebot/2.1 (+http://www.google.com/bot.html)"}
# This HTTP header will tell the server information about the client. We are using Google bots for that purpose.
# # When our client requests anything on the server, the server sees our request coming as a Google bot.
url = "https://www.guardian.ng/"
content = session.get(url, verify=False).content
# We create a soup object where we pass the HTML page. Alongside the HTML page, we also pass HTML parser as a parameter.
soup = BSoup(content, "html.parser")
# in the news object, we return the <div> of a particular class. We selected this class through webpage inspection.
News = soup.find_all('div', {"class":"headline"})
for article in News:
main = dict(article.find_all('a'))# we can iterate over soup objects.
# In the for loop, the main variable will hold the link to the origin webpage.
# The main attribute gets the anchor tag. Since, the <div>s returned only have one <a>tag, we get most of our work done here.
# The <a> tag contains title and href of the original link.
link = main['href']
image_src = str(main.find('img')['srcset']).split(" ")[-4] # srcset attribute contains various sizes of images...
title = main['title']
# saving the data to the database
new_headline = Headline()
new_headline.title = title
new_headline.url = link
new_headline.image = image_src
new_headline.save()
return redirect("../")
What can I do to solve this error?

How to understand callback function in scrapy.Request?

I am reading Web Scraping with Python 2nd Ed, and wanted to use Scrapy module to crawl information from webpage.
I got following information from documentation: https://docs.scrapy.org/en/latest/topics/request-response.html
callback (callable) – the function that will be called with the
response of this request (once it’s downloaded) as its first
parameter. For more information see Passing additional data to
callback functions below. If a Request doesn’t specify a callback, the
spider’s parse() method will be used. Note that if exceptions are
raised during processing, errback is called instead.
My understanding is that:
pass in url and get resp like we did in requests module
resp = requests.get(url)
pass in resp for data parsing
parse(resp)
The problem is:
I didn't see where resp is passed in
Why need to put self keyword before parse in the argument
self keyword was never used in parse function, why bothering put it as first parameter?
can we extract url from response parameter like this: url = response.url or should be url = self.url
class ArticleSpider(scrapy.Spider):
name='article'
def start_requests(self):
urls = [
'http://en.wikipedia.org/wiki/Python_'
'%28programming_language%29',
'https://en.wikipedia.org/wiki/Functional_programming',
'https://en.wikipedia.org/wiki/Monty_Python']
return [scrapy.Request(url=url, callback=self.parse) for url in urls]
def parse(self, response):
url = response.url
title = response.css('h1::text').extract_first()
print('URL is: {}'.format(url))
print('Title is: {}'.format(title))

Seems like you are missing a few concepts related to python classes and OOP. It would be a good idea to take a read in python docs or at the very least this question.
Here is how Scrapy works, you instantiate a request object and yield it to the Scrapy Scheduler.
yield scrapy.Request(url=url) #or use return like you did
Scrapy will handle the requests, download the html and it will return all it got back that request to a callback function. If you didn't set a callback function in your request (like in my example above) it will call a default function called parse.
Parse is a method (a.k.a function) of your object. You wrote it in your code above, and EVEN if you haven't it would still be there, since your class inherited all functions from it's parent class
class ArticleSpider(scrapy.Spider): # <<<<<<<< here
name='article'
So a TL; DR of your questions:
1-You didn't saw it because it happened in the parent class.
2-You need to use self. so python knows you are referencing a method of the spider instance.
3-The self parameter was the instance itself, and it was used by python.
4-Response is an independent object that your parse method received as argument, so you can access it's attributes like response.url or response.headers

information about self you can find here - https://docs.python.org/3/tutorial/classes.html
about this question:
can we extract URL from response parameter like this: url = response.url or should be url = self.url
you should use response.url to get URL of the page which you currently crawl/parse

Can RoboBrowser open a html string?

I would like to benefit from the awesome features of the RoboBrowser on a HTML string that contains some forms.
Usually Robobrowser is used like that:
url = "whatever.com"
browser = RoboBrowser(history=True)
browser.open(url)
thatForm = browser.get_form("thatForm")
thatForm["thisField"].value = "some value"
browser.submit(thatForm)
I would like to use the html content of string to do the same, I was expecting something like below to work:
content = "<html>...</html>"
browser = RoboBrowser(history=True)
browser.open(content)
However, this does not, cause the open method expects that the string is a url not a html content, is there anything that can be done, any workaround so that I can pass a html content string somewhere and RoboBrowser parses it?

Alright I found a solution, not super elegant but it works, basically it all revolves around the _update_state function which is actually called at some point internally by Robobrowser when opening a URL:
def open(self, url, method='get', **kwargs):
"""Open a URL.
:param str url: URL to open
:param str method: Optional method; defaults to `'get'`
:param kwargs: Keyword arguments to `Session::request`
"""
response = self.session.request(method, url, **self._build_send_args(**kwargs))
self._update_state(response)
The solution is therefore to simply create a fake response carrying the html we want to be parsed:
fake_response = requests.Response()
fake_response._content = the_html_we_want_Robobrowser_to_parse
browser = RoboBrowser()
browser._update_state(fake_response)
my_form = browser.get_form("myform")
browser.submit_form(my_form)
And voila :)

Accessing URLs from a list in Python

I'm trying to search a HTML document for links to articles, store them into a list and then use that list to search each one individually for their titles.

It's possibly not direct answer for OP, but should not considered as off-topic: You should not parse web page for the html data.
html web pages are not optimized to answer for a lot of requests, especially requests not from browsers. A lot of generated traffic could made servers overloaded and trigger DDOS.
So you need try to found any available API for interesting your site and only if nothing relative found, use parsing of web content with using cache of request to not overload target resource.
At the first look, The Guardian have Open API with documentation how to use.
Using that API you could work with site content in simply manner, making you interesting requests more easier and answers available without parsing.
For example, search by tag "technology" api output:
from urllib.request import urlopen, URLError, HTTPError
import json
import sys
def safeprint(s):
try:
print(s)
except UnicodeEncodeError:
if sys.version_info >= (3,):
print(s.encode('utf8').decode(sys.stdout.encoding))
else:
print(s.encode('utf8'))
url = "http://content.guardianapis.com/search?q=technology&api-key=test"
try:
content = urlopen(url).read().decode("utf-8")
json_data = json.loads(content)
if "response" in json_data and "results" in json_data["response"]:
for item in json_data["response"]["results"]:
safeprint(item["webTitle"])
except URLError as e:
if isinstance(e, HTTPError):
print("Error appear: " + str(e))
else:
raise e
Using that way you could walk through all publications in deep without any problem.

Just use Beautiful Soup to parse the HTML and find the title tag in each page:
read = [urllib.urlopen(link).read() for link in article_links]
data = [BeautifulSoup(i).find('title').getText() for i in read]

How can i get the parsed html in scrapy from hardcoded url

In my scrapy I just want the html response inside a variable from custom url.
Suppose I have the url
url = "http://www.example.com"
Now I want to get the html of that page for parsing
pageHtml = scrapy.get(url)
I want something like this
page = urllib2.urlopen('http://yahoo.com').read()
The only problem that I can't use above line in my crawler is because my session is already authenticated by scrapy so I can't use any other function for getting the html of that function
I don't want response in any callback but simply straight inside the variable

Basically, you just need to add the relevant imports for the code in that question to work. You'll also need to add a link variable which is used but not defined in that example code.
import httplib
from scrapy.spider import BaseSpider
from scrapy.http import TextResponse
bs = BaseSpider('some')
# etc

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing HTML page in web crawler - python

I'd have used BeautifulSoup instead of HTMLParser like so - soup = BeautifulSoup(pageContent) links = soup.find_all('a')

Related

Using BeautifulSoup for web scraping: ValueError obtained

How to understand callback function in scrapy.Request?

Can RoboBrowser open a html string?

Accessing URLs from a list in Python

How can i get the parsed html in scrapy from hardcoded url

Categories

Resources