I have to download and save the webpages with a given URL. I have downloaded the page as well as the required js and css files. But the problem is to change the src and href values of those tags in the html source file as well to make it work.
my html source is :
<link REL="shortcut icon" href="/commd/favicon.ico">
<script src="/commd/jquery.min.js"></script>
<script src="/commd/jquery-ui.min.js"></script>
<script src="/commd/slimScroll.min.js"></script>
<script src="/commd/ajaxstuff.js"></script>
<script src="/commd/jquery.nivo.slider.pack.js"></script>FCT0505
<script src="/commd/jquery.nivo.slider.pack.js"></script>
<link rel="stylesheet" type="text/css" href="/fonts/stylesheet.cssFCT0505"/>
<link rel="stylesheet" type="text/css" href="/commd/stylesheet.css"/>
<!--[if gte IE 6]>
<link rel="stylesheet" type="text/css" href="/commd/stylesheetIE.css" />
<![endif]-->
<link rel="stylesheet" type="text/css" href="/commd/accordion.css"/>
<link rel="stylesheet" href="/commd/nivo.css" type="text/css" media="screen" />
<link rel="stylesheet" href="/commd/nivo-slider.css" type="text/css" media="screen" />
I have found out all the links of css and js files as well as downloaded them using :
scriptsurl = soup3.find_all("script")
os.chdir(foldername)
for l in scriptsurl:
if l.get("src") is not None:
print(l.get("src"))
script="http://www.iitkgp.ac.in"+l.get("src")
print(script)
file=l.get("src").split("/")[-1]
l.get("src").replaceWith('./foldername/'+file)
print(file)
urllib.request.urlretrieve(script,file)
linksurl=soup3.find_all("link")
for l in linksurl:
if l.get("href") is not None:
print(l.get("href"))
css="http://www.iitkgp.ac.in"+l.get("href")
file=l.get("href").split("/")[-1]
print(css)
print(file)
if(os.path.exists(file)):
urllib.request.urlretrieve(css,file.split(".")[0]+"(1)."+file.split(".")[-1])
else:
urllib.request.urlretrieve(css,file)
os.chdir("..")
Can anyone suggest me the method to change(local machine path) the the src/href texts during these loop executions only which will be great help.
This is my first task of crawling.
Reading from the documentation:
You can add, remove, and modify a tag’s attributes. Again, this is done by treating the tag as a dictionary:
So writing something like:
l["src"] = os.path.join(os.getcwd(),foldername, file)
instead of
l.get("src").replaceWith('./foldername/'+file)
I believe will do the trick
Related
I am trying to create simple HTML reports using Panel and Plotly and push them in an app, however the app refuses the reports because of some external links.
The following simple code shows you these unwanted links.
import panel as pn
tabs = pn.Column()
file_name = "my_file.html"
tabs.save(file_name)
Produces a HTML file containing these links
<title>Panel</title>
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/notyf#3/notyf.min.css" type="text/css" />
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.15.1/css/all.min.css" type="text/css" />
<link rel="stylesheet" href="https://unpkg.com/#holoviz/panel#0.13.1/dist/css/dataframe.css" type="text/css" />
<link rel="stylesheet" href="https://unpkg.com/#holoviz/panel#0.13.1/dist/css/json.css" type="text/css" />
<link rel="stylesheet" href="https://unpkg.com/#holoviz/panel#0.13.1/dist/css/alerts.css" type="text/css" />
<link rel="stylesheet" href="https://unpkg.com/#holoviz/panel#0.13.1/dist/css/loading.css" type="text/css" />
<link rel="stylesheet" href="https://unpkg.com/#holoviz/panel#0.13.1/dist/css/markdown.css" type="text/css" />
<link rel="stylesheet" href="https://unpkg.com/#holoviz/panel#0.13.1/dist/css/card.css" type="text/css" />
<link rel="stylesheet" href="https://unpkg.com/#holoviz/panel#0.13.1/dist/css/debugger.css" type="text/css" />
<link rel="stylesheet" href="https://unpkg.com/#holoviz/panel#0.13.1/dist/css/widgets.css" type="text/css" />
<style>
How can I remove them ?
Why would I need them ?
I have looked at the doc of Panel but didn't find any relevant parameters that might help me to remove thse...
I would consider using BeautifulSoup for this task.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
for m in soup.find_all('link'):
m.replaceWithChildren()
print soup
Hi I am trying to parse html code
I am attaching a few line of html
<link rel="stylesheet" href="assets/css/fontawesome-min.css">
<link rel="stylesheet" href="assets/css/bootstrap.min.css">
<link rel="stylesheet" href="assets/css/xsIcon.css">
When I load this into beautifulsoup it changes attributes position in alphabetic order like code below
<link href="assets/css/fontawesome-min.css" rel="stylesheet"/>
<link href="assets/css/bootstrap.min.css" rel="stylesheet"/>
<link href="assets/css/xsIcon.css" rel="stylesheet"/>
You can see difference initially rel was before href after just loading and write file again order of attributes changes.
Is there any way to prevent this from happening.
Thanks
From the documentation, you can use custom HTMLFormatter:
from bs4 import BeautifulSoup
from bs4.formatter import HTMLFormatter
txt = '''<link rel="stylesheet" href="assets/css/fontawesome-min.css">
<link rel="stylesheet" href="assets/css/bootstrap.min.css">
<link rel="stylesheet" href="assets/css/xsIcon.css">'''
class UnsortedAttributes(HTMLFormatter):
def attributes(self, tag):
for k, v in tag.attrs.items():
yield k, v
soup = BeautifulSoup(txt, 'html.parser')
#before HTMLFormatter
print( soup )
print('-' * 80)
#after HTMLFormatter
print( soup.encode(formatter=UnsortedAttributes()).decode('utf-8') )
Prints:
<link href="assets/css/fontawesome-min.css" rel="stylesheet"/>
<link href="assets/css/bootstrap.min.css" rel="stylesheet"/>
<link href="assets/css/xsIcon.css" rel="stylesheet"/>
--------------------------------------------------------------------------------
<link rel="stylesheet" href="assets/css/fontawesome-min.css"/>
<link rel="stylesheet" href="assets/css/bootstrap.min.css"/>
<link rel="stylesheet" href="assets/css/xsIcon.css"/>
After setting up https on a site, some the javascript libraries are not loading while others are. In this case, the select2 lib is not loading. Why would this be?
Head extract
<head>
<link rel="stylesheet" href="https://yui.yahooapis.com/pure/0.6.0/pure-min.css">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js"></script>
<script src="https://code.jquery.com/ui/1.11.4/jquery-ui.js"></script>
<link rel="stylesheet" href="https://code.jquery.com/ui/1.11.4/themes/cupertino/jquery-ui.css">
<link href="https://cdnjs.cloudflare.com/ajax/libs/select2/4.0.0/css/select2.min.css" rel="stylesheet" />
<script src="https://cdnjs.cloudflare.com/ajax/libs/select2/4.0.0/js/select2.min.js"></script>
<link rel="stylesheet" type="text/css" href="https://d1r6do663ilw4i.cloudfront.net/static/sweetalerts/sweetalert.css">
<script src="https://d1r6do663ilw4i.cloudfront.net/static/sweetalerts/sweetalert.min.js"></script>
First: Make sure the file actually exists at that URL. (Try different browsers, command line tools, ...)
Second: Make sure your ad-blocker/browser-plugins aren't blocking the request.
I'm making a bootstrap theme for Trac installation. This is my first time using Genshi so please be patient :)
So I've following:
<head py:match="head" py:attrs="select('#*')">
${select('*|comment()|text()')}
<link rel="stylesheet" type="text/css" href="${chrome.htdocs_location}css/bootstrap.min.css" />
<link rel="stylesheet" type="text/css" href="${chrome.htdocs_location}css/style.css" />
</head>
This loads my custom css, but JS/css that trac needs to use.
So result is this:
<link rel="help" href="/pixelperfect/wiki/TracGuide" />
<link rel="start" href="/pixelperfect/wiki" />
<link rel="stylesheet" href="/pixelperfect/chrome/common/css/trac.css" type="text/css" />
<link rel="stylesheet" href="/pixelperfect/chrome/common/css/wiki.css" type="text/css" />
<link rel="stylesheet" type="text/css" href="/pixelperfect/chrome/common/css/bootstrap.min.css" />
<link rel="stylesheet" type="text/css" href="/pixelperfect/chrome/common/css/style.css" />
All is good, except that I would like to exclude trac.css out of there completely.
So my question is twofold:
1. How does genshi know what to load? Where is the manfest of all css/js files that it displays.
2. Is it genshi or python doing this?
Any help and relevant reading appreciated! :)
Thanks!
On 1:
The information on CSS files is accumulated in the 'links' dictionary of a request's Chrome property (req.chrome['links']), for JS files it is the 'scripts' dictionary. See add_link and add_script functions from trac.web.chrome respectively.
The default style sheet is added to the Chrome object directly. See the add_stylesheet call in trac.web.chrome.Chrome.prepare_request() method.
On 2:
Its part of the Request object, that is processed by Genshi. Preparation is done in Python anyway, but it is in the Trac Python scripts domain rather than in Genshi Python scripts.
I have gotten the HTML of a webpage using Python, and I now want to find all of the .CSS files that are linked to in the header. I tried partitioning, as shown below, but I got the error "IndexError: string index out of range" upon running it and save each as its own variable (I know how to do this part).
sytle = src.partition(".css")
style = style[0].partition('<link href=')
print style[2]
c =1
I do no think that this is the right way to approach this, so would love some advice. Many thanks in advance. Here is a section of the kind of text I am needing to extract .CSS file(s) from.
<meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0" />
<!--[if gte IE 7]><!-->
<link href="/stylesheets/master.css?1342791430" media="screen, projection" rel="stylesheet" type="text/css" />
<link href="/stylesheets/adapt.css?1342791413" media="screen, projection" rel="stylesheet" type="text/css" />
<!-- <![endif]-->
<link href="/stylesheets/print.css?1342791421" media="print" rel="stylesheet" type="text/css" />
<link href="/apple-touch-icon-precomposed.png" rel="apple-touch-icon-precomposed" />
<link href="http://dribbble.com/shots/popular.rss" rel="alternate" title="RSS" type="application/rss+xml" />
You should use regular expression for this. Try the following:
/href="(.*\.css[^"]*)/g
EDIT
import re
matches = re.findall('href="(.*\.css[^"]*)', html)
print(matches)
My answer is along the same lines as Jon Clements' answer, but I tested mine and added a drop of explanation.
You should not use a regex. You can't parse HTML with a regex. The regex answer might work, but writing a robust solution is very easy with lxml. This approach is guaranteed to return the full href attribute of all <link rel="stylesheet"> tags and no others.
from lxml import html
def extract_stylesheets(page_content):
doc = html.fromstring(page_content) # Parse
return doc.xpath('//head/link[#rel="stylesheet"]/#href') # Search
There is no need to check the filenames, since the results of the xpath search are already known to be stylesheet links, and there's no guarantee that the filenames will have a .css extension anyway. The simple regex will catch only a very specific form, but the general html parser solution will also do the right thing in cases such as this, where the regex would fail miserably:
<link REL="stylesheet" hREf =
'/stylesheets/print?1342791421'
media="print"
><!-- link href="/css/stylesheet.css" -->
It could also be easily extended to select only stylesheets for a particular media.
For what it's worth (using lxml.html) as a parsing lib.
untested
import lxml.html
from urlparse import urlparse
sample_html = """<meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0" />
<!--[if gte IE 7]><!-->
<link href="/stylesheets/master.css?1342791430" media="screen, projection" rel="stylesheet" type="text/css" />
<link href="/stylesheets/adapt.css?1342791413" media="screen, projection" rel="stylesheet" type="text/css" />
<!-- <![endif]-->
<link href="/stylesheets/print.css?1342791421" media="print" rel="stylesheet" type="text/css" />
<link href="/apple-touch-icon-precomposed.png" rel="apple-touch-icon-precomposed" />
<link href="http://dribbble.com/shots/popular.rss" rel="alternate" title="RSS" type="application/rss+xml" />
"""
import lxml.html
page = lxml.html.fromstring(html)
link_hrefs = (p.path for p in map(urlparse, page.xpath('//head/link/#href')))
for href in link_hrefs:
if href.rsplit(href, 1)[-1].lower() == 'css': # implement smarter error handling here
pass # do whatever