So I have a crawler that uses something like this:
#if ".mp3" in baseUrl[0] or ".pdf" in baseUrl[0]:
if baseUrl[0][-4] == "." and ".htm" not in baseUrl[0]:
raise Exception
html = requests.get(baseUrl[0], timeout=3).text
This works pretty well. What happens is, if a file like .mp4 or .m4a gets in the crawler instead of an HTML page, then the script hangs and in linux when I try to run the script it will just print:
Killed
Is there more of an efficient way to catch these non-HTML pages?
You can send a head request and check the content type. If its text/html then only proceed
r = requests.head(url)
if "text/html" in r.headers["content-type"]:
html = requests.get(url).text
else:
print "non html page"
If you just want to make single request then,
r = requests.get(url)
if "text/html" in r.headers["content-type"]:
html = r.text
else:
print "non html page"
Related
I'm looking for some library or libraries in Python to:
a) log in a web site,
b) find all links to some media files (let us say having "download" in their URLs), and
c) download each file efficiently directly to the hard drive (without loading the whole media file into RAM).
Thanks
You can use the broadly used requests module (more than 35k stars on github), and BeautifulSoup. The former handles session cookies, redirections, encodings, compression and more transparently. The later finds parts in the HTML code and has an easy-to-remember syntax, e.g. [] for properties of HTML tags.
It follows a complete example in Python 3.5.2 for a web site that you can scrap without a JavaScript engine (otherwise you can use Selenium), and downloading sequentially some links with download in its URL.
import shutil
import sys
import requests
from bs4 import BeautifulSoup
""" Requirements: beautifulsoup4, requests """
SCHEMA_DOMAIN = 'https://exmaple.com'
URL = SCHEMA_DOMAIN + '/house.php/' # this is the log-in URL
# here are the name property of the input fields in the log-in form.
KEYS = ['login[_csrf_token]',
'login[login]',
'login[password]']
client = requests.session()
request = client.get(URL)
soup = BeautifulSoup(request.text, features="html.parser")
data = {KEYS[0]: soup.find('input', dict(name=KEYS[0]))['value'],
KEYS[1]: 'my_username',
KEYS[2]: 'my_password'}
# The first argument here is the URL of the action property of the log-in form
request = client.post(SCHEMA_DOMAIN + '/house.php/user/login',
data=data,
headers=dict(Referer=URL))
soup = BeautifulSoup(request.text, features="html.parser")
generator = ((tag['href'], tag.string)
for tag in soup.find_all('a')
if 'download' in tag['href'])
for url, name in generator:
with client.get(SCHEMA_DOMAIN + url, stream=True) as request:
if request.status_code == 200:
with open(name, 'wb') as output:
request.raw.decode_content = True
shutil.copyfileobj(request.raw, output)
else:
print('status code was {} for {}'.format(request.status_code,
name),
file=sys.stderr)
You can use the mechanize module to log into websites like so:
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)
br.open("http://www.example.com")
br.select_form(nr=0) #Pass parameters to uniquely identify login form if needed
br['username'] = '...'
br['password'] = '...'
result = br.submit().read()
Use bs4 to parse this response and find all the hyperlinks in the page like so:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(result, "lxml")
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
You can use re to further narrow down the links you need from all the links present in the response webpage, which are media links (.mp3, .mp4, .jpg, etc) in your case.
Finally, use requests module to stream the media files so that they don't take up too much memory like so:
response = requests.get(url, stream=True) #URL here is the media URL
handle = open(target_path, "wb")
for chunk in response.iter_content(chunk_size=512):
if chunk: # filter out keep-alive new chunks
handle.write(chunk)
handle.close()
when the stream attribute of get() is set to True, the content does not immediately start downloading to RAM, instead the response behaves like an iterable, which you can iterate over in chunks of size chunk_size in the loop right after the get() statement. Before moving on to the next chunk, you can write the previous chunk to memory hence ensuring that the data isn't stored in RAM.
You will have to put this last chunk of code in a loop if you want to download media of every link in the links list.
You will probably have to end up making some changes to this code to make it work as I haven't tested it for your use case myself, but hopefully this gives a blueprint to work off of.
Sorry for rookie question. I was wondering if there is an efficient url opener class in python that handle redirects. I'm currently using simple urllib.urlopen() but It's not working. This is an example:
http://thetechshowdown.com/Redirect4.php
For this url, the class I'm using does not follow the redirection to:
http://www.bhphotovideo.com/
and only shows:
"You are being automatically redirected to B&H.
Page Stuck? Click Here ."
Thanks in advance.
Use module requests - it folows redirections as default.
But page can be redirected by javascript so none of modules will follow this kind of redirection.
Turn off javascript in browser and go to http://thetechshowdown.com/Redirect4.php to see if it redirects you to other page
I checked this page - there is javascript redirect and HTML redirect (tag with "refresh" argument). Both aren't normal redirection send by server - so any module will not follow this redirection. You have to read page, find url in code and connect with that url.
import requests
import lxml, lxml.html
# started page
r = requests.get('http://thetechshowdown.com/Redirect4.php')
#print r.url
#print r.history
#print r.text
# first redirection
html = lxml.html.fromstring(r.text)
refresh = html.cssselect('meta[http-equiv="refresh"]')
if refresh:
print 'refresh:', refresh[0].attrib['content']
x = refresh[0].attrib['content'].find('http')
url = refresh[0].attrib['content'][x:]
print 'url:', url
r = requests.get(url)
#print r.text
# second redirection
html = lxml.html.fromstring(r.text)
refresh = html.cssselect('meta[http-equiv="refresh"]')
if refresh:
print 'refresh:', refresh[0].attrib['content']
x = refresh[0].attrib['content'].find('http')
url = refresh[0].attrib['content'][x:]
print 'url:', url
r = requests.get(url)
# final page
print r.text
That happens because of soft redirects. urllib is not following the redirects because it does not recognize them as such. In fact a HTTP response code 200 (page found) is issued and redirection will happen by some sort of side effect in browsers.
The first page has a HTTP responde code 200, but contains the following:
<meta http-equiv="refresh" content="1; url=http://fave.co/1idiTuz">
which instructs the browser to follow the link. The second resource issues a HTTP responsec code 301 or 302 (redirect) to another resource, where a second soft redirect takes place, this time with Javascript:
<script type="text/javascript">
setTimeout(function () {window.location.replace(\'http://bhphotovideo.com\');}, 2.75 * 1000);
</script>
<noscript>
<meta http-equiv="refresh" content="2.75;http://bhphotovideo.com" />
</noscript>
Unfortunately, you will have to extract the URLs to follow by hand. However, it's not difficult. Here is the code:
from lxml.html import parse
from urllib import urlopen
from contextlib import closing
def follow(url):
"""Follow both true and soft redirects."""
while True:
with closing(urlopen(url)) as stream:
next = parse(stream).xpath("//meta[#http-equiv = 'refresh']/#content")
if next:
url = next[0].split(";")[1].strip().replace("url=", "")
else:
return stream.geturl()
print follow("http://thetechshowdown.com/Redirect4.php")
I will leave the error handling to you :) also note that this might result in an endless loop if the target page contains a <meta> tag too. It is not your case, but you could add some sort of checks to prevent that: stop after n redirects, see if page is redirecting to itself, whichever you think is better.
You will probably need to install the lxml library.
The meta refresh redirection urls from html could look like any of these:
Relative urls:
<meta http-equiv="refresh" content="0; url=legal_notices_en.htm#disclaimer">
With quotes inside quotes:
<meta http-equiv="refresh" content="0; url='legal_notices_en.htm#disclaimer'">
Uppercase letters in the content of the tag:
<meta http-equiv="refresh" content="0; URL=legal_notices_en.htm#disclaimer">
Summary:
Use lxml.xml to parse the html,
Use a lower() and two split()s to get the url part,
Strip eventual wrapping quotes and spaces,
Get the absolute url,
Store the cache of the results in a local file with shelves (useful if you have lots of urls to test).
Usage:
print get_redirections('https://www.google.com')
Returns something like:
{'final': u'https://www.google.be/?gfe_rd=fd&ei=FDDASaSADFASd', 'history': [<Response [302]>]}
Code:
from urlparse import urljoin, urlparse
import urllib, shelve, lxml, requests
from lxml import html
def get_redirections(initial_url, url_id = None):
if not url_id:
url_id = initial_url
documents_checked = shelve.open('tested_urls.log')
if url_id in documents_checked:
print 'cached'
output = documents_checked[url_id]
else:
print 'not cached'
redirecting = True
history = []
try:
current_url = initial_url
while redirecting:
r = requests.get(current_url)
final = r.url
history += r.history
status = {'final':final,'history':history}
html = lxml.html.fromstring(r.text.encode('utf8'))
refresh = html.cssselect('meta[http-equiv="refresh"]')
if refresh:
refresh_content = refresh[0].attrib['content']
current_url = refresh_content.lower().split('url=')[1].split(';')[0]
before_stripping = ''
after_stripping = current_url
while before_stripping != after_stripping:
before_stripping = after_stripping
after_stripping = before_stripping.strip('"').strip("'").strip()
current_url = urljoin(final, after_stripping)
history += [current_url]
else:
redirecting = False
except requests.exceptions.RequestException as e:
status = {'final':str(e),'history':[],'error':e}
documents_checked[url_id] = status
output = status
documents_checked.close()
return output
url = 'http://google.com'
print get_redirections(url)
Please tell me why this similar lists of code get different results.
First one (yandex.ru) get page of request, and another one get Main page of site (moyareklama.ru)
import urllib
base = "http://www.moyareklama.ru/single_ad_new.php?"
data = {"id":"201623465"}
url = base + urllib.urlencode(data)
print url
page = urllib.urlopen(url).read()
f = open ("1.html", "w")
f.write(page)
f.close()
print page
##base = "http://yandex.ru/yandsearch?"
##data = (("text","python"),("lr","192"))
##url = base + urllib.urlencode(data)
##print url
##page = urllib.urlopen(url).read()
##f = open ("1.html", "w")
##f.write(page)
##f.close()
##print page
Most likely the reason you get something different with urllib.urlopen and your browser is because your browser can be redirected with javascript and meta/refresh tags as well as standard HTTP 301/302 responses. I'm pretty sure the urllib module will only be redirected by HTTP 301/302 responses.
I am trying to parse webpages using urllib2, BeautifulSoup and Python 2.7.
The problem lies upstream: each time I try to retrieve a new webpage, I get the one I already retrieved. However, pages are different in my webbrowser: see page 1 and page 2. Is there something wrong with the loop over page numbers?
Here is a code sample:
def main(page_number_max):
import urllib2 as ul
from BeautifulSoup import BeautifulSoup as bs
base_url = 'http://www.senscritique.com/clement/collection/#page='
for page_number in range(1, 1+page_number_max):
url = base_url + str(page_number) + '/'
html = ul.urlopen(url)
bt = bs(html)
for item in bt.findAll('div', 'c_listing-products-content xl'):
item_name = item.findAll('h2', 'c_heading c_heading-5 c_bold')
print str(item_name[0].contents[1]).split('\t')[11]
print('End of page ' + str(page_number) + '\n')
if __name__ == '__main__':
page_number_max = 2
main(page_number_max)
When you send http request to server, everything after "#" character is ignored. The part after "#" is only available to browser.
If you open developer tools in Chrome browser (or open firebug in Firefox) you will see that everytime you change page on senscritique.com there is request sent to the server. That's where the data you are looking for comes from.
I'm not going into details about what exacly to send in order to retrieve data from this page, because I think it's not consistent with their TOS.
"#" is the anchor tag used to identify and jump to specific parts of the document.The browser does it so when you send the request the whole web page is loaded while the rest is ignored.
How can I list files and folders if I only have an IP-address?
With urllib and others, I am only able to display the content of the index.html file. But what if I want to see which files are in the root as well?
I am looking for an example that shows how to implement username and password if needed. (Most of the time index.html is public, but sometimes the other files are not).
Use requests to get page content and BeautifulSoup to parse the result.
For example if we search for all iso files at http://cdimage.debian.org/debian-cd/8.2.0-live/i386/iso-hybrid/:
from bs4 import BeautifulSoup
import requests
url = 'http://cdimage.debian.org/debian-cd/8.2.0-live/i386/iso-hybrid/'
ext = 'iso'
def listFD(url, ext=''):
page = requests.get(url).text
print page
soup = BeautifulSoup(page, 'html.parser')
return [url + '/' + node.get('href') for node in soup.find_all('a') if node.get('href').endswith(ext)]
for file in listFD(url, ext):
print file
You cannot get the directory listing directly via HTTP, as another answer says. It's the HTTP server that "decides" what to give you. Some will give you an HTML page displaying links to all the files inside a "directory", some will give you some page (index.html), and some will not even interpret the "directory" as one.
For example, you might have a link to "http://localhost/user-login/": This does not mean that there is a directory called user-login in the document root of the server. The server interprets that as a "link" to some page.
Now, to achieve what you want, you either have to use something other than HTTP (an FTP server on the "ip address" you want to access would do the job), or set up an HTTP server on that machine that provides for each path (http://192.168.2.100/directory) a list of files in it (in whatever format) and parse that through Python.
If the server provides an "index of /bla/bla" kind of page (like Apache server do, directory listings), you could parse the HTML output to find out the names of files and directories. If not (e.g. a custom index.html, or whatever the server decides to give you), then you're out of luck :(, you can't do it.
Zety provides a nice compact solution. I would add to his example by making the requests component more robust and functional:
import requests
from bs4 import BeautifulSoup
def get_url_paths(url, ext='', params={}):
response = requests.get(url, params=params)
if response.ok:
response_text = response.text
else:
return response.raise_for_status()
soup = BeautifulSoup(response_text, 'html.parser')
parent = [url + node.get('href') for node in soup.find_all('a') if node.get('href').endswith(ext)]
return parent
url = 'http://cdimage.debian.org/debian-cd/8.2.0-live/i386/iso-hybrid'
ext = 'iso'
result = get_url_paths(url, ext)
print(result)
HTTP does not work with "files" and "directories". Pick a different protocol.
You can use the following script to get names of all files in sub-directories and directories in a HTTP Server. A file writer can be used to download them.
from urllib.request import Request, urlopen, urlretrieve
from bs4 import BeautifulSoup
def read_url(url):
url = url.replace(" ","%20")
req = Request(url)
a = urlopen(req).read()
soup = BeautifulSoup(a, 'html.parser')
x = (soup.find_all('a'))
for i in x:
file_name = i.extract().get_text()
url_new = url + file_name
url_new = url_new.replace(" ","%20")
if(file_name[-1]=='/' and file_name[0]!='.'):
read_url(url_new)
print(url_new)
read_url("www.example.com")