Download a HTML page locally (+CSS, +images) - python

I'd like to be able to download a HTML page (let's say this actual question!):
f = urllib2.urlopen('https://stackoverflow.com/questions/33914277')
content = f.read() # soup = BeautifulSoup(content) could be useful?
g = open("mypage.html", 'w')
g.write(content)
g.close()
such that it is displayed the same way locally than online. Currently here is the (bad) result:
(source: gget.it)
Thus, one need to download CSS, and modify the HTML itself such that it points to this local CSS file... and the same for images, etc.
How to do this? (I think there should be simpler than this answer, that doesn't handle CSS, but how? Library?)

Since css and image files fall under CORS policy, from your local html you still can refer to them while they are in the cloud. The problem is unresolved URIs. In the html head section you have smth. like this:
<head>
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<link rel="stylesheet" type="text/css" href="/assets/8943fcf6/select.css" />
<link href="/css/media.css" rel="stylesheet" type="text/css">
<script type="text/javascript" src="/assets/jquery.yii.js"></script>
<script type="text/javascript" src="/assets/select.js"></script>
</head>
Obviously /css/media.css implies base address, ex. http://example.com. To resolve it for local file you need to make http://example.com/css/media.css as href value in your local copy of html. So now you should parse and add the base into the local code:
<head>
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<link rel="stylesheet" type="text/css" href="http://example.com/assets/select.css" />
<link href="http://example.com/css/media.css" rel="stylesheet" type="text/css">
<script type="text/javascript" src="http://example.com/assets/jquery.yii.js"></script>
<script type="text/javascript" src="http://example.com/assets/select.js"></script>
</head>
Use any means for that (js, php...)
Update
Since a local file also contains images' references throughout the body section you'll need to resolve them too.

Related

how to copy all the code of a URL with python

I want to copy all the code of an URL (http://modelseed.org/biochem/reactions/rxn00001) using Python 3.6, but I can only copy part of the code, and I don't know why.
So far, I tried with "requests" module
import requests
page = requests.get("http://modelseed.org/biochem/reactions/rxn00001")
print(page.content)
and "urllib"
import urllib.request
site = urllib.request.urlopen("http://modelseed.org/biochem/reactions/rxn00001")
print(site.read())
The part of the code with information of the "Reaction Details", like "Name", "ID" and "Abbreviation" are missing, but they are visible if I inspect the code on the developer bar of Chrome.
The code I'm able to download using the two codes above is:
<!DOCTYPE html>
<html lang="en" ng-app="ModelSEED">
<head>
<base href="/"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="initial-scale=1, maximum-scale=1, user-scalable=no" name="viewport">
<meta content="The ModelSEED is a resource for the reconstruction, exploration, comparison, and analysis of metabolic models." name="description"/>
<link href="/img/ModelSEED-favicon.png?v=2.0" rel="shortcut icon"/>
<meta content="nconrad" name="author"/>
<title>
ModelSEED
</title>
<link href="components/angular-material/angular-material.css" rel="stylesheet"/>
<link href="components/bootstrap/dist/css/bootstrap.min.css" rel="stylesheet"/>
<!-- to be removed -->
<link href="components/font-awesome/css/font-awesome.min.css" rel="stylesheet"/>
<link href="icomoon/style.css" rel="stylesheet"/>
<link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet"/>
<link href="http://fonts.googleapis.com/css?family=Montserrat:400,700" rel="stylesheet" type="text/css"/>
<link href="build/style.css" rel="stylesheet"/>
<!--<script src="https://cdn.socket.io/socket.io-1.3.7.js"></script>-->
<script src="build/site.js">
</script>
<!-- HTML5 Shim and Respond.js IE8 support of HTML5 elements and media queries -->
<!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>
<script src="https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js"></script>
<![endif]-->
</meta>
</head>
<body>
<div style="height: 100%;" ui-view="">
</div>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-67412611-1', 'auto');
ga('send', 'pageview');
</script>
</body>
</html>
Anyone has any hint why the code between < div style="height: 100%;" ui-view="" > and (just after < body > and before < script >) is not downloaded?
Thank you.
It's being inserted by a javascript script, therefore, either requests nor urllib would find it, you would need to use a browser for this, you should try with selenium or PhantomJS
something like:
from selenium import webdriver
driver = webdriver.Chrome('./chromedriver')
driver.get(url)
driver.page_source
Try getting this url instead: https://www.patricbrc.org/api/model_reaction/?http_accept=application/json&eq(id,rxn00001)

Replace the node of Beautiful Soup with string in python

I have to download and save the webpages with a given URL. I have downloaded the page as well as the required js and css files. But the problem is to change the src and href values of those tags in the html source file as well to make it work.
my html source is :
<link REL="shortcut icon" href="/commd/favicon.ico">
<script src="/commd/jquery.min.js"></script>
<script src="/commd/jquery-ui.min.js"></script>
<script src="/commd/slimScroll.min.js"></script>
<script src="/commd/ajaxstuff.js"></script>
<script src="/commd/jquery.nivo.slider.pack.js"></script>FCT0505
<script src="/commd/jquery.nivo.slider.pack.js"></script>
<link rel="stylesheet" type="text/css" href="/fonts/stylesheet.cssFCT0505"/>
<link rel="stylesheet" type="text/css" href="/commd/stylesheet.css"/>
<!--[if gte IE 6]>
<link rel="stylesheet" type="text/css" href="/commd/stylesheetIE.css" />
<![endif]-->
<link rel="stylesheet" type="text/css" href="/commd/accordion.css"/>
<link rel="stylesheet" href="/commd/nivo.css" type="text/css" media="screen" />
<link rel="stylesheet" href="/commd/nivo-slider.css" type="text/css" media="screen" />
I have found out all the links of css and js files as well as downloaded them using :
scriptsurl = soup3.find_all("script")
os.chdir(foldername)
for l in scriptsurl:
if l.get("src") is not None:
print(l.get("src"))
script="http://www.iitkgp.ac.in"+l.get("src")
print(script)
file=l.get("src").split("/")[-1]
l.get("src").replaceWith('./foldername/'+file)
print(file)
urllib.request.urlretrieve(script,file)
linksurl=soup3.find_all("link")
for l in linksurl:
if l.get("href") is not None:
print(l.get("href"))
css="http://www.iitkgp.ac.in"+l.get("href")
file=l.get("href").split("/")[-1]
print(css)
print(file)
if(os.path.exists(file)):
urllib.request.urlretrieve(css,file.split(".")[0]+"(1)."+file.split(".")[-1])
else:
urllib.request.urlretrieve(css,file)
os.chdir("..")
Can anyone suggest me the method to change(local machine path) the the src/href texts during these loop executions only which will be great help.
This is my first task of crawling.
Reading from the documentation:
You can add, remove, and modify a tag’s attributes. Again, this is done by treating the tag as a dictionary:
So writing something like:
l["src"] = os.path.join(os.getcwd(),foldername, file)
instead of
l.get("src").replaceWith('./foldername/'+file)
I believe will do the trick

Certain javascript libs not loading over https

After setting up https on a site, some the javascript libraries are not loading while others are. In this case, the select2 lib is not loading. Why would this be?
Head extract
<head>
<link rel="stylesheet" href="https://yui.yahooapis.com/pure/0.6.0/pure-min.css">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js"></script>
<script src="https://code.jquery.com/ui/1.11.4/jquery-ui.js"></script>
<link rel="stylesheet" href="https://code.jquery.com/ui/1.11.4/themes/cupertino/jquery-ui.css">
<link href="https://cdnjs.cloudflare.com/ajax/libs/select2/4.0.0/css/select2.min.css" rel="stylesheet" />
<script src="https://cdnjs.cloudflare.com/ajax/libs/select2/4.0.0/js/select2.min.js"></script>
<link rel="stylesheet" type="text/css" href="https://d1r6do663ilw4i.cloudfront.net/static/sweetalerts/sweetalert.css">
<script src="https://d1r6do663ilw4i.cloudfront.net/static/sweetalerts/sweetalert.min.js"></script>
First: Make sure the file actually exists at that URL. (Try different browsers, command line tools, ...)
Second: Make sure your ad-blocker/browser-plugins aren't blocking the request.

Genshi auto load css/js need exclude specific file

I'm making a bootstrap theme for Trac installation. This is my first time using Genshi so please be patient :)
So I've following:
<head py:match="head" py:attrs="select('#*')">
${select('*|comment()|text()')}
<link rel="stylesheet" type="text/css" href="${chrome.htdocs_location}css/bootstrap.min.css" />
<link rel="stylesheet" type="text/css" href="${chrome.htdocs_location}css/style.css" />
</head>
This loads my custom css, but JS/css that trac needs to use.
So result is this:
<link rel="help" href="/pixelperfect/wiki/TracGuide" />
<link rel="start" href="/pixelperfect/wiki" />
<link rel="stylesheet" href="/pixelperfect/chrome/common/css/trac.css" type="text/css" />
<link rel="stylesheet" href="/pixelperfect/chrome/common/css/wiki.css" type="text/css" />
<link rel="stylesheet" type="text/css" href="/pixelperfect/chrome/common/css/bootstrap.min.css" />
<link rel="stylesheet" type="text/css" href="/pixelperfect/chrome/common/css/style.css" />
All is good, except that I would like to exclude trac.css out of there completely.
So my question is twofold:
1. How does genshi know what to load? Where is the manfest of all css/js files that it displays.
2. Is it genshi or python doing this?
Any help and relevant reading appreciated! :)
Thanks!
On 1:
The information on CSS files is accumulated in the 'links' dictionary of a request's Chrome property (req.chrome['links']), for JS files it is the 'scripts' dictionary. See add_link and add_script functions from trac.web.chrome respectively.
The default style sheet is added to the Chrome object directly. See the add_stylesheet call in trac.web.chrome.Chrome.prepare_request() method.
On 2:
Its part of the Request object, that is processed by Genshi. Preparation is done in Python anyway, but it is in the Trac Python scripts domain rather than in Genshi Python scripts.

How can I find a file name in a block of text using python?

I have gotten the HTML of a webpage using Python, and I now want to find all of the .CSS files that are linked to in the header. I tried partitioning, as shown below, but I got the error "IndexError: string index out of range" upon running it and save each as its own variable (I know how to do this part).
sytle = src.partition(".css")
style = style[0].partition('<link href=')
print style[2]
c =1
I do no think that this is the right way to approach this, so would love some advice. Many thanks in advance. Here is a section of the kind of text I am needing to extract .CSS file(s) from.
<meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0" />
<!--[if gte IE 7]><!-->
<link href="/stylesheets/master.css?1342791430" media="screen, projection" rel="stylesheet" type="text/css" />
<link href="/stylesheets/adapt.css?1342791413" media="screen, projection" rel="stylesheet" type="text/css" />
<!-- <![endif]-->
<link href="/stylesheets/print.css?1342791421" media="print" rel="stylesheet" type="text/css" />
<link href="/apple-touch-icon-precomposed.png" rel="apple-touch-icon-precomposed" />
<link href="http://dribbble.com/shots/popular.rss" rel="alternate" title="RSS" type="application/rss+xml" />
You should use regular expression for this. Try the following:
/href="(.*\.css[^"]*)/g
EDIT
import re
matches = re.findall('href="(.*\.css[^"]*)', html)
print(matches)
My answer is along the same lines as Jon Clements' answer, but I tested mine and added a drop of explanation.
You should not use a regex. You can't parse HTML with a regex. The regex answer might work, but writing a robust solution is very easy with lxml. This approach is guaranteed to return the full href attribute of all <link rel="stylesheet"> tags and no others.
from lxml import html
def extract_stylesheets(page_content):
doc = html.fromstring(page_content) # Parse
return doc.xpath('//head/link[#rel="stylesheet"]/#href') # Search
There is no need to check the filenames, since the results of the xpath search are already known to be stylesheet links, and there's no guarantee that the filenames will have a .css extension anyway. The simple regex will catch only a very specific form, but the general html parser solution will also do the right thing in cases such as this, where the regex would fail miserably:
<link REL="stylesheet" hREf =
'/stylesheets/print?1342791421'
media="print"
><!-- link href="/css/stylesheet.css" -->
It could also be easily extended to select only stylesheets for a particular media.
For what it's worth (using lxml.html) as a parsing lib.
untested
import lxml.html
from urlparse import urlparse
sample_html = """<meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0" />
<!--[if gte IE 7]><!-->
<link href="/stylesheets/master.css?1342791430" media="screen, projection" rel="stylesheet" type="text/css" />
<link href="/stylesheets/adapt.css?1342791413" media="screen, projection" rel="stylesheet" type="text/css" />
<!-- <![endif]-->
<link href="/stylesheets/print.css?1342791421" media="print" rel="stylesheet" type="text/css" />
<link href="/apple-touch-icon-precomposed.png" rel="apple-touch-icon-precomposed" />
<link href="http://dribbble.com/shots/popular.rss" rel="alternate" title="RSS" type="application/rss+xml" />
"""
import lxml.html
page = lxml.html.fromstring(html)
link_hrefs = (p.path for p in map(urlparse, page.xpath('//head/link/#href')))
for href in link_hrefs:
if href.rsplit(href, 1)[-1].lower() == 'css': # implement smarter error handling here
pass # do whatever

Categories