How to gather specific information using urllib2 in python

How to gather specific information using urllib2 in python - python

As of now I have created a basic program in python 2.7 using urllib2 and re that gathers the html code of a website and prints it out for you as well as indexing a keyword. I would like to create a much more complex and dynamic program which could gather data from websites such as sports or stock statistics and aggregate them into lists which could then be used in analysis in something such as an excel document etc. I'm not asking for someone to literally write the code. I simply need help understanding more of how I should approach the code: whether I require extra libraries, etc. Here is the current code. It is very simplistic as of now.:
import urllib2
import re
y = 0
while(y == 0):
x = str(raw_input("[[[Enter URL]]]"))
keyword = str(raw_input("[[[Enter Keyword]]]"))
wait = 0
try:
req = urllib2.Request(x)
response = urllib2.urlopen(req)
page_content = response.read()
idall = [m.start() for m in re.finditer(keyword,page_content)]
wait = raw_input("")
print(idall)
wait = raw_input("")
print(page_content)
except urllib2.HTTPError as e:
print e.reason

You can use requests to deal with interaction with website. Here is link for it. http://docs.python-requests.org/en/latest/
And then you can use beautifulsoup to handle the html content. Here is link for it.http://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
They're more ease of use than urllib2 and re.
Hope it helps.

Related

When I take html from a website using urllib2, the inner html is empty. Anyone know why?

I am working on a project and one of the steps includes getting a random word which I will use later. When I try to grab the random word, it gives me '<span id="result"></span>' but as you can see, there is no word inside.
Code:
import urllib2
from bs4 import BeautifulSoup
quote_page = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find("span", {"id": "result"})
print name_box
name = name_box.text.strip()
print name
I am thinking that maybe it might need to wait for a word to appear, but I'm not sure how to do that.

This word is added to the page using JavaScript. We can verify this by looking at the actual HTML that is returned in the request and comparing it with what we see in the web browser DOM inspector. There are two options:
Use a library capable of executing JavaScript and giving you the resulting HTML
Try a different approach that doesn't require JavaScript support
For 1, we can use something like requests_html. This would look like:
from requests_html import HTMLSession
url = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
session = HTMLSession()
r = session.get(url)
# Some sleep required since the default of 0.2 isn't long enough.
r.html.render(sleep=0.5)
print(r.html.find('#result', first=True).text)
For 2, if we look at the network requests that the page is making, then we can see that it retrieves random words by making a POST request to http://watchout4snakes.com/wo4snakes/Random/RandomWord. Making a direct request with a library like requests (recommended in the standard library documentation here) looks like:
import requests
url = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
print(requests.post(url).text)

So the way that the site works is that it sends you the site with no word in the span box, and edits it in later through JavaScript; that's why you get a span box with nothing inside.
However, since you're trying to get the word I'd definitely suggest you use a different method to getting the word, rather than scraping the word off the page, you can simply send a POST request to http://watchout4snakes.com/wo4snakes/Random/RandomWord with no body and receive the word in response.
You're using Python 2 but in Python 3 (for example, so I can show this works) you can do:
>>> import requests
>>> r = requests.post('http://watchout4snakes.com/wo4snakes/Random/RandomWord')
>>> print(r.text)
doom
You can do something similar using urllib in Python 2 as well.

Scrape title by only downloading relevant part of webpage

I would like to scrape just the title of a webpage using Python. I need to do this for thousands of sites so it has to be fast. I've seen previous questions like retrieving just the title of a webpage in python, but all of the ones I've found download the entire page before retrieving the title, which seems highly inefficient as most often the title is contained within the first few lines of HTML.
Is it possible to download only the parts of the webpage until the title has been found?
I've tried the following, but page.readline() downloads the entire page.
import urllib2
print("Looking up {}".format(link))
hdr = {'User-Agent': 'Mozilla/5.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
req = urllib2.Request(link, headers=hdr)
page = urllib2.urlopen(req, timeout=10)
content = ''
while '</title>' not in content:
content = content + page.readline()
-- Edit --
Note that my current solution makes use of BeautifulSoup constrained to only process the title so the only place I can optimize is likely to not read in the entire page.
title_selector = SoupStrainer('title')
soup = BeautifulSoup(page, "lxml", parse_only=title_selector)
title = soup.title.string.strip()
-- Edit 2 --
I've found that BeautifulSoup itself splits the content into multiple strings in the self.current_data
variable (see this function in bs4), but I'm unsure how to modify the code to basically stop reading all remaining content after the title has been found. One issue could be that redirects should still work.
-- Edit 3 --
So here's an example. I have a link www.xyz.com/abc and I have to follow this through any redirects (almost all of my links use a bit.ly kind of link shortening). I'm interested in both the title and domain that occurs after any redirections.
-- Edit 4 --
Thanks a lot for all of your assistance! The answer by Kul-Tigin works very well and has been accepted. I'll keep the bounty until it runs out though to see if a better answer comes up (as shown by e.g. a time measurement comparison).
-- Edit 5 --
For anyone interested: I've timed the accepted answer to be roughly twice as fast as my existing solution using BeautifulSoup4.

You can defer downloading the entire response body by enabling stream mode of requests.
Requests 2.14.2 documentation - Advanced Usage
By default, when you make a request, the body of the response is
downloaded immediately. You can override this behaviour and defer
downloading the response body until you access the Response.content
attribute with the stream parameter:
...
If you set stream to True when making a request, Requests cannot release the connection back to the pool unless you consume all the data or call Response.close.
This can lead to inefficiency with connections. If you find yourself partially reading request bodies (or not reading them at all) while using stream=True, you should consider using contextlib.closing (documented here)
So, with this method, you can read the response chunk by chunk until you encounter the title tag. Since the redirects will be handled by the library you'll be ready to go.
Here's an error-prone code tested with Python 2.7.10 and 3.6.0:
try:
from HTMLParser import HTMLParser
except ImportError:
from html.parser import HTMLParser
import requests, re
from contextlib import closing
CHUNKSIZE = 1024
retitle = re.compile("<title[^>]*>(.*?)</title>", re.IGNORECASE | re.DOTALL)
buffer = ""
htmlp = HTMLParser()
with closing(requests.get("http://example.com/abc", stream=True)) as res:
for chunk in res.iter_content(chunk_size=CHUNKSIZE, decode_unicode=True):
buffer = "".join([buffer, chunk])
match = retitle.search(buffer)
if match:
print(htmlp.unescape(match.group(1)))
break

Question: ... the only place I can optimize is likely to not read in the entire page.
This does not read the entire page.
Note: Unicode .decode() will raise Exception if you cut a Unicode sequence in the middle. Using .decode(errors='ignore') remove those sequences.
For instance:
import re
try:
# PY3
from urllib import request
except:
import urllib2 as request
for url in ['http://www.python.org/', 'http://www.google.com', 'http://www.bit.ly']:
f = request.urlopen(url)
re_obj = re.compile(r'.*(<head.*<title.*?>(.*)</title>.*</head>)',re.DOTALL)
Found = False
data = ''
while True:
b_data = f.read(4096)
if not b_data: break
data += b_data.decode(errors='ignore')
match = re_obj.match(data)
if match:
Found = True
title = match.groups()[1]
print('title={}'.format(title))
break
f.close()
Output:
title=Welcome to Python.org
title=Google
title=Bitly | URL Shortener and Link Management Platform
Tested with Python: 3.4.2 and 2.7.9

You're scraping webpages using standard REST requests and I'm not aware of any request that only returns the title, so I don't think it's possible.
I know this doesn't necessarily help get the title only, but I usually use BeautifulSoup for any web scraping. It's much easier. Here's an example.
Code:
import requests
from bs4 import BeautifulSoup
urls = ["http://www.google.com", "http://www.msn.com"]
for url in urls:
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
print "Title with tags: %s" % soup.title
print "Title: %s" % soup.title.text
print
Output:
Title with tags: <title>Google</title>
Title: Google
Title with tags: <title>MSN.com - Hotmail, Outlook, Skype, Bing, Latest News, Photos & Videos</title>
Title: MSN.com - Hotmail, Outlook, Skype, Bing, Latest News, Photos & Videos

the kind of thing you want i don't think can be done, since the way the web is set up, you get the response for a request before anything is parsed. there isn't usually a streaming "if encounter <title> then stop giving me data" flag. if there is id love to see it, but there is something that may be able to help you. keep in mind, not all sites respect this. so some sites will force you to download the entire page source before you can act on it. but a lot of them will allow you to specify a range header. so in a requests example:
import requests
targeturl = "http://www.urbandictionary.com/define.php?term=Blarg&page=2"
rangeheader = {"Range": "bytes=0-150"}
response = requests.get(targeturl, headers=rangeheader)
response.text
and you get
'<!DOCTYPE html>\n<html lang="en-US" prefix="og: http://ogp.me/ns#'
now of course here's the problems with this
what if you specify a range that is too short to get the title of the page?
whats a good range to aim for? (combination of speed and assurance of accuracy)
what happens if the page doesn't respect Range? (most of the time you just get the whole response you would have without it.)
i don't know if this might help you? i hope so. but i've done similar things to only get file headers for download checking.
EDIT4:
so i thought of another kind of hacky thing that might help. nearly every page has a 404 page not found page. we might be able to use this to our advantage. instead of requesting the regular page. request something like this.
http://www.urbandictionary.com/nothing.php
the general page will have tons of information, links, data. but the 404 page is nothing more than a message, and (in this case) a video. and usually there is no video. just some text.
but you also notice that the title still appears here. so perhaps we can just request something we know does not exist on any page like.
X5ijsuUJSoisjHJFk948.php
and get a 404 for each page. that way you only download a very small and minimalistic page. nothing more. which will significantly reduce the amount of information you download. thus increasing speed and efficiency.
heres the problem with this method: you need to check somehow if the page does not supply its own version of the 404. most pages have it because it looks good with the site. and its standard practice to include one. but not all of them do. make sure to handle this case.
but i think that could be something worth trying out. over the course of thousands of sites, it would save many ms of download time for each html.
EDIT5:
so as we talked about, since you are interested in urls that redirect. we might make use of an http head reqeust. which wont get the site content. just the headers. so in this case:
response = requests.head('http://myshortenedurl.com/5b2su2')
replace my shortenedurl with tunyurl to follow along.
>>>response
<Response [301]>
nice so we know this redirects to something.
>>>response.headers['Location']
'http://stackoverflow.com'
now we know where the url redirects to without actually following it or downloading any page source. now we can apply any of the other techniques previously discussed.
Heres an example, using requests and lxml modules and using the 404 page idea. (be aware, i have to replace bit.ly with bit'ly so stack overflow doesnt get mad.)
#!/usr/bin/python3
import requests
from lxml.html import fromstring
links = ['http://bit'ly/MW2qgH',
'http://bit'ly/1x0885j',
'http://bit'ly/IFHzvO',
'http://bit'ly/1PwR9xM']
for link in links:
response = '<Response [301]>'
redirect = ''
while response == '<Response [301]>':
response = requests.head(link)
try:
redirect = response.headers['Location']
except Exception as e:
pass
fakepage = redirect + 'X5ijsuUJSoisjHJFk948.php'
scrapetarget = requests.get(fakepage)
tree = fromstring(scrapetarget.text)
print(tree.findtext('.//title'))
so here we get the 404 pages, and it will follow any number of redirects. now heres the output from this:
Urban Dictionary error
Page Not Found - Stack Overflow
Error 404 (Not Found)!!1
Kijiji: Page Not Found
so as you can see we did indeed get out titles. but we see some problems with the method. namely some titles add things, and some just dont have a good title at all. and thats the issue with that method. we could however try the range method too. benefits of that would be the title would be correct, but sometimes we might miss it, and sometimes we have to download the whole pagesource to get it. increasing required time.
Also credit to alecxe for this part of my quick and dirty script
tree = fromstring(scrapetarget.text)
print(tree.findtext('.//title'))
for an example with the range method. in the loop for link in links: change the code after the try catch statement to this:
rangeheader = {"Range": "bytes=0-500"}
scrapetargetsection = requests.get(redirect, headers=rangeheader)
tree = fromstring(scrapetargetsection.text)
print(tree.findtext('.//title'))
output is:
None
Stack Overflow
Google
Kijiji: Free Classifieds in...
here we see urban dictionary has no title or ive missed it in the bytes returned. in any of these methods there are tradeoffs. the only way to get close to total accuracy would be to download the entire source for each page i think.

using urllib you can set the Range header to request a certain range of bytes, but there are some consequences:
it depends on the server to honor the request
you assume that data you're looking for is within desired range (however you can make another request using different range header to get next bytes - i.e. download first 300 bytes and get another 300 only if you can't find title within first result - 2 requests of 300 bytes are still much cheaper than whole document)
(edit) - to avoid situations when title tag splits between two ranged requests, make your ranges overlapped, see 'range_header_overlapped' function in my example code
import urllib
req = urllib.request.Request('http://www.python.org/')
req.headers['Range']='bytes=%s-%s' % (0, 300)
f = urllib.request.urlopen(req)
just to verify if server accepted our range:
content_range=f.headers.get('Content-Range')
print(content_range)

my code also solves cases when title tag is splitted between chunks.
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
"""
Created on Tue May 30 04:21:26 2017
====================
#author: s
"""
import requests
from string import lower
from html.parser import HTMLParser
#proxies = { 'http': 'http://127.0.0.1:8080' }
urls = ['http://opencvexamples.blogspot.com/p/learning-opencv-functions-step-by-step.html',
'http://www.robindavid.fr/opencv-tutorial/chapter2-filters-and-arithmetic.html',
'http://blog.iank.org/playing-capitals-with-opencv-and-python.html',
'http://docs.opencv.org/3.2.0/df/d9d/tutorial_py_colorspaces.html',
'http://scikit-image.org/docs/dev/api/skimage.exposure.html',
'http://apprize.info/programming/opencv/8.html',
'http://opencvexamples.blogspot.com/2013/09/find-contour.html',
'http://docs.opencv.org/2.4/modules/imgproc/doc/geometric_transformations.html',
'https://github.com/ArunJayan/OpenCV-Python/blob/master/resize.py']
class TitleParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.match = False
self.title = ''
def handle_starttag(self, tag, attributes):
self.match = True if tag == 'title' else False
def handle_data(self, data):
if self.match:
self.title = data
self.match = False
def valid_content( url, proxies=None ):
valid = [ 'text/html; charset=utf-8',
'text/html',
'application/xhtml+xml',
'application/xhtml',
'application/xml',
'text/xml' ]
r = requests.head(url, proxies=proxies)
our_type = lower(r.headers.get('Content-Type'))
if not our_type in valid:
print('unknown content-type: {} at URL:{}'.format(our_type, url))
return False
return our_type in valid
def range_header_overlapped( chunksize, seg_num=0, overlap=50 ):
"""
generate overlapping ranges
(to solve cases when title tag splits between them)
seg_num: segment number we want, 0 based
overlap: number of overlaping bytes, defaults to 50
"""
start = chunksize * seg_num
end = chunksize * (seg_num + 1)
if seg_num:
overlap = overlap * seg_num
start -= overlap
end -= overlap
return {'Range': 'bytes={}-{}'.format( start, end )}
def get_title_from_url(url, proxies=None, chunksize=300, max_chunks=5):
if not valid_content(url, proxies=proxies):
return False
current_chunk = 0
myparser = TitleParser()
while current_chunk <= max_chunks:
headers = range_header_overlapped( chunksize, current_chunk )
headers['Accept-Encoding'] = 'deflate'
# quick fix, as my locally hosted Apache/2.4.25 kept raising
# ContentDecodingError when using "Content-Encoding: gzip"
# ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.',
# error('Error -3 while decompressing: incorrect header check',))
r = requests.get( url, headers=headers, proxies=proxies )
myparser.feed(r.content)
if myparser.title:
return myparser.title
current_chunk += 1
print('title tag not found within {} chunks ({}b each) at {}'.format(current_chunk-1, chunksize, url))
return False

Scraper yields no results

I am very new to python (this is my first Python project, in fact) and I am having a bit of trouble writing this web scraper. I used a tutorial to figure this out, but the code is yielding no results. I would really appreciate some help.
from lxml import html
import requests
page = requests.get('http://openbook.sfgov.org/openbooks/cgi-bin/cognosisapi.dll?b_action=cognosViewer&ui.action=run&ui.object=/content/folder%5B%40name%3D%27Reports%27%5D/report%5B%40name%3D%27Budget%27%5D&ui.name=20Budget&run.outputFormat=&run.prompt=false')
tree = html.fromstring(page.content)
#This will find the table headers:
categories = tree.xpath('//*[#id="rt_NS_"]/tbody/tr[2]/td/table/tbody/tr[4]/td/table/tbody/tr[2]/td/div/div/table/tbody/tr/td[2]/table/tbody/tr[2]/td[1]')
# This will find the budgets
category_budget = tree.xpath('//*[#id="rt_NS_"]/tbody/tr[2]/td/table/tbody/tr[4]/td/table/tbody/tr[2]/td/div/div/table/tbody/tr/td[2]/table/tbody/tr[2]/td[2]/span[1]')
print 'Cateogries: ', categories
print 'Budget: ', category_budget

Looks like contents of table id="rt_NS_" are being generated by JavaScript.
In that case requests won't help you.
page = requests.get('http://openbook.sfgov.org/openbooks/cgi-bin/cognosisapi.dll?b_action=cognosViewer&ui.action=run&ui.object=/content/folder%5B%40name%3D%27Reports%27%5D/report%5B%40name%3D%27Budget%27%5D&ui.name=20Budget&run.outputFormat=&run.prompt=false')
ctx = page.content
if "id=\"rt_NS_\"" in ctx:
print "Found!"
else:
print "Not Found!"
Not Found!
You'll need to use other approach. Selenium with python could be an option.

Accessing URLs from a list in Python

I'm trying to search a HTML document for links to articles, store them into a list and then use that list to search each one individually for their titles.

It's possibly not direct answer for OP, but should not considered as off-topic: You should not parse web page for the html data.
html web pages are not optimized to answer for a lot of requests, especially requests not from browsers. A lot of generated traffic could made servers overloaded and trigger DDOS.
So you need try to found any available API for interesting your site and only if nothing relative found, use parsing of web content with using cache of request to not overload target resource.
At the first look, The Guardian have Open API with documentation how to use.
Using that API you could work with site content in simply manner, making you interesting requests more easier and answers available without parsing.
For example, search by tag "technology" api output:
from urllib.request import urlopen, URLError, HTTPError
import json
import sys
def safeprint(s):
try:
print(s)
except UnicodeEncodeError:
if sys.version_info >= (3,):
print(s.encode('utf8').decode(sys.stdout.encoding))
else:
print(s.encode('utf8'))
url = "http://content.guardianapis.com/search?q=technology&api-key=test"
try:
content = urlopen(url).read().decode("utf-8")
json_data = json.loads(content)
if "response" in json_data and "results" in json_data["response"]:
for item in json_data["response"]["results"]:
safeprint(item["webTitle"])
except URLError as e:
if isinstance(e, HTTPError):
print("Error appear: " + str(e))
else:
raise e
Using that way you could walk through all publications in deep without any problem.

Just use Beautiful Soup to parse the HTML and find the title tag in each page:
read = [urllib.urlopen(link).read() for link in article_links]
data = [BeautifulSoup(i).find('title').getText() for i in read]

Python to Save Web Pages

This is probably a very simple task, but I cannot find any help. I have a website that takes the form www.xyz.com/somestuff/ID. I have a list of the IDs I need information from. I was hoping to have a simple script to go one the site and download the (complete) web page for each ID in a simple form ID_whatever_the_default_save_name_is in a specific folder.
Can I run a simple python script to do this for me? I can do it by hand, it is only 75 different pages, but I was hoping to use this to learn how to do things like this in the future.

Mechanize is a great package for crawling the web with python. A simple example for your issue would be:
import mechanize
br = mechanize.Browser()
response = br.open("www.xyz.com/somestuff/ID")
print response
This simply grabs your url and prints the response from the server.

This can be done simply in python using the urllib module. Here is a simple example in Python 3:
import urllib.request
url = 'www.xyz.com/somestuff/ID'
req = urllib.request.Request(url)
page = urllib.request.urlopen(req)
src = page.readall()
print(src)
For more info on the urllib module -> http://docs.python.org/3.3/library/urllib.html

Do you want just the html code for the website? If so, just create a url variable with the host site and add the page number as you go. I'll do this for an example with http://www.notalwaysright.com
import urllib.request
url = "http://www.notalwaysright.com/page/"
for x in range(1, 71):
newurl = url + x
response = urllib.request.urlopen(newurl)
with open("Page/" + x, "a") as p:
p.writelines(reponse.read())

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to gather specific information using urllib2 in python - python

Related

When I take html from a website using urllib2, the inner html is empty. Anyone know why?

Scrape title by only downloading relevant part of webpage

Scraper yields no results

Accessing URLs from a list in Python

Python to Save Web Pages

Categories

Resources