Hi I am trying to parse html code
I am attaching a few line of html
<link rel="stylesheet" href="assets/css/fontawesome-min.css">
<link rel="stylesheet" href="assets/css/bootstrap.min.css">
<link rel="stylesheet" href="assets/css/xsIcon.css">
When I load this into beautifulsoup it changes attributes position in alphabetic order like code below
<link href="assets/css/fontawesome-min.css" rel="stylesheet"/>
<link href="assets/css/bootstrap.min.css" rel="stylesheet"/>
<link href="assets/css/xsIcon.css" rel="stylesheet"/>
You can see difference initially rel was before href after just loading and write file again order of attributes changes.
Is there any way to prevent this from happening.
Thanks
From the documentation, you can use custom HTMLFormatter:
from bs4 import BeautifulSoup
from bs4.formatter import HTMLFormatter
txt = '''<link rel="stylesheet" href="assets/css/fontawesome-min.css">
<link rel="stylesheet" href="assets/css/bootstrap.min.css">
<link rel="stylesheet" href="assets/css/xsIcon.css">'''
class UnsortedAttributes(HTMLFormatter):
def attributes(self, tag):
for k, v in tag.attrs.items():
yield k, v
soup = BeautifulSoup(txt, 'html.parser')
#before HTMLFormatter
print( soup )
print('-' * 80)
#after HTMLFormatter
print( soup.encode(formatter=UnsortedAttributes()).decode('utf-8') )
Prints:
<link href="assets/css/fontawesome-min.css" rel="stylesheet"/>
<link href="assets/css/bootstrap.min.css" rel="stylesheet"/>
<link href="assets/css/xsIcon.css" rel="stylesheet"/>
--------------------------------------------------------------------------------
<link rel="stylesheet" href="assets/css/fontawesome-min.css"/>
<link rel="stylesheet" href="assets/css/bootstrap.min.css"/>
<link rel="stylesheet" href="assets/css/xsIcon.css"/>
Related
I am trying to create simple HTML reports using Panel and Plotly and push them in an app, however the app refuses the reports because of some external links.
The following simple code shows you these unwanted links.
import panel as pn
tabs = pn.Column()
file_name = "my_file.html"
tabs.save(file_name)
Produces a HTML file containing these links
<title>Panel</title>
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/notyf#3/notyf.min.css" type="text/css" />
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.15.1/css/all.min.css" type="text/css" />
<link rel="stylesheet" href="https://unpkg.com/#holoviz/panel#0.13.1/dist/css/dataframe.css" type="text/css" />
<link rel="stylesheet" href="https://unpkg.com/#holoviz/panel#0.13.1/dist/css/json.css" type="text/css" />
<link rel="stylesheet" href="https://unpkg.com/#holoviz/panel#0.13.1/dist/css/alerts.css" type="text/css" />
<link rel="stylesheet" href="https://unpkg.com/#holoviz/panel#0.13.1/dist/css/loading.css" type="text/css" />
<link rel="stylesheet" href="https://unpkg.com/#holoviz/panel#0.13.1/dist/css/markdown.css" type="text/css" />
<link rel="stylesheet" href="https://unpkg.com/#holoviz/panel#0.13.1/dist/css/card.css" type="text/css" />
<link rel="stylesheet" href="https://unpkg.com/#holoviz/panel#0.13.1/dist/css/debugger.css" type="text/css" />
<link rel="stylesheet" href="https://unpkg.com/#holoviz/panel#0.13.1/dist/css/widgets.css" type="text/css" />
<style>
How can I remove them ?
Why would I need them ?
I have looked at the doc of Panel but didn't find any relevant parameters that might help me to remove thse...
I would consider using BeautifulSoup for this task.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
for m in soup.find_all('link'):
m.replaceWithChildren()
print soup
I have to download and save the webpages with a given URL. I have downloaded the page as well as the required js and css files. But the problem is to change the src and href values of those tags in the html source file as well to make it work.
my html source is :
<link REL="shortcut icon" href="/commd/favicon.ico">
<script src="/commd/jquery.min.js"></script>
<script src="/commd/jquery-ui.min.js"></script>
<script src="/commd/slimScroll.min.js"></script>
<script src="/commd/ajaxstuff.js"></script>
<script src="/commd/jquery.nivo.slider.pack.js"></script>FCT0505
<script src="/commd/jquery.nivo.slider.pack.js"></script>
<link rel="stylesheet" type="text/css" href="/fonts/stylesheet.cssFCT0505"/>
<link rel="stylesheet" type="text/css" href="/commd/stylesheet.css"/>
<!--[if gte IE 6]>
<link rel="stylesheet" type="text/css" href="/commd/stylesheetIE.css" />
<![endif]-->
<link rel="stylesheet" type="text/css" href="/commd/accordion.css"/>
<link rel="stylesheet" href="/commd/nivo.css" type="text/css" media="screen" />
<link rel="stylesheet" href="/commd/nivo-slider.css" type="text/css" media="screen" />
I have found out all the links of css and js files as well as downloaded them using :
scriptsurl = soup3.find_all("script")
os.chdir(foldername)
for l in scriptsurl:
if l.get("src") is not None:
print(l.get("src"))
script="http://www.iitkgp.ac.in"+l.get("src")
print(script)
file=l.get("src").split("/")[-1]
l.get("src").replaceWith('./foldername/'+file)
print(file)
urllib.request.urlretrieve(script,file)
linksurl=soup3.find_all("link")
for l in linksurl:
if l.get("href") is not None:
print(l.get("href"))
css="http://www.iitkgp.ac.in"+l.get("href")
file=l.get("href").split("/")[-1]
print(css)
print(file)
if(os.path.exists(file)):
urllib.request.urlretrieve(css,file.split(".")[0]+"(1)."+file.split(".")[-1])
else:
urllib.request.urlretrieve(css,file)
os.chdir("..")
Can anyone suggest me the method to change(local machine path) the the src/href texts during these loop executions only which will be great help.
This is my first task of crawling.
Reading from the documentation:
You can add, remove, and modify a tag’s attributes. Again, this is done by treating the tag as a dictionary:
So writing something like:
l["src"] = os.path.join(os.getcwd(),foldername, file)
instead of
l.get("src").replaceWith('./foldername/'+file)
I believe will do the trick
I have a Flask app that I am using to serve a D3 visualization on Heroku. I have the visualization working here: http://wsankey.github.io/dew_mvp/ and the Heroku app here: https://dew.herokuapp.com/. The problem is that I have static json data that is not being picked up by my javascript--there's not even an error being thrown. The map data is 'us.json' and my data is 'dewmvp1.json.'
Here is the relevant piece of the D3 javascript file:
$(document).ready(function() {
//Reading map file and data
queue()
.defer(d3.json, "/static/us.json")
.defer(d3.json, "/static/dewmvpv1.json")
.await(ready);
And my app.py file where I thought I might route the data:
from flask import Flask, render_template
app = Flask(__name__)
#Main DEW page
#app.route('/')
def home():
return render_template("index.html")
#About page
#app.route('/about')
def about():
return render_template("about.html")
#Sending the us.json file
#app.route('/usmap/')
def usmap():
return "<a href=%s>file</a>" % url_for('static', filename='us.json')
#Sending our data
#app.route('/data/')
def data():
return "<a href=%s>file</a>" % url_for('static', filename='dewmvp1.json')
if __name__ == '__main__':
app.run(debug=True)
And the relevant pieces of the index.html where my script.js is being loaded (to give you an idea of how I'm loading it):
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<link href="http://d3js.org/queue.v1.min.js" type="text/javascript">
<link href="{{url_for('static', filename='topojson.v1.min.js')}}" type="text/javascript">
<link href="{{url_for('static', filename='underscore-1.6.0.min.js')}}" type="text/javascript">
<link href="{{url_for('static', filename='jquery-1.10.2.min.js')}}" type="text/javascript">
<link href="{{url_for('static', filename='jquery-ui-1.10.4.js')}}" type="text/javascript">
<link href='https://fonts.googleapis.com/css?family=Arvo:400,700|PT+Sans+Caption' rel='stylesheet' type='text/css'>
<link rel="stylesheet" media="screen" href="{{url_for('static', filename='jquery-ui-1.10.4.css')}}">
<link rel="stylesheet" media="screen" href="{{url_for('static', filename='grid.css')}}">
<link rel="stylesheet" media="screen" href="{{url_for('static', filename='layout.css')}}">
<link rel="stylesheet" media="screen" href="{{url_for('static', filename='map.css')}}">
<link rel="stylesheet" media="screen" href="{{url_for('static', filename='normalize.css')}}">
<link rel="stylesheet" media="screen" href="{{url_for('static', filename='elements.css')}}">
<link rel="stylesheet" media="screen" href="{{url_for('static', filename='typography.css')}}">
<link href='https://fonts.googleapis.com/css?family=Raleway' rel='stylesheet' type='text/css'>
<link href='https://fonts.googleapis.com/css?family=Merriweather' rel='stylesheet' type='text/css'>
<link href="{{url_for('static', filename='d3.min.js')}}" type="text/javascript">
Thank you!
What was happening here was that I using in my index.html when I should have been using . The javascript files were being linked but not executed. #user2569951 was definitely right to say that I needed that {{url_for() }} syntax to access the data but the real issue was more basic. Specifically where I had:
<link href="{{url_for('static', filename='d3.min.js')}}" type="text/javascript">
I needed instead:
<script src="{{url_for('static', filename='d3.min.js')}}"></script>
That was such a head banger for me but I certainly learned the difference between the tags. This fundamental knowledge is required when managing these files.
in your javascript you need to set your map the following way for it to be accessible.
$(document).ready(function() {
//Reading map file and data
queue()
.defer(d3.json, "{{url_for('static', filename='us.json')}}")
.defer(d3.json, "{{url_for('static', filename='dewmvpv1.json')}}")
.await(ready)
I have gotten the HTML of a webpage using Python, and I now want to find all of the .CSS files that are linked to in the header. I tried partitioning, as shown below, but I got the error "IndexError: string index out of range" upon running it and save each as its own variable (I know how to do this part).
sytle = src.partition(".css")
style = style[0].partition('<link href=')
print style[2]
c =1
I do no think that this is the right way to approach this, so would love some advice. Many thanks in advance. Here is a section of the kind of text I am needing to extract .CSS file(s) from.
<meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0" />
<!--[if gte IE 7]><!-->
<link href="/stylesheets/master.css?1342791430" media="screen, projection" rel="stylesheet" type="text/css" />
<link href="/stylesheets/adapt.css?1342791413" media="screen, projection" rel="stylesheet" type="text/css" />
<!-- <![endif]-->
<link href="/stylesheets/print.css?1342791421" media="print" rel="stylesheet" type="text/css" />
<link href="/apple-touch-icon-precomposed.png" rel="apple-touch-icon-precomposed" />
<link href="http://dribbble.com/shots/popular.rss" rel="alternate" title="RSS" type="application/rss+xml" />
You should use regular expression for this. Try the following:
/href="(.*\.css[^"]*)/g
EDIT
import re
matches = re.findall('href="(.*\.css[^"]*)', html)
print(matches)
My answer is along the same lines as Jon Clements' answer, but I tested mine and added a drop of explanation.
You should not use a regex. You can't parse HTML with a regex. The regex answer might work, but writing a robust solution is very easy with lxml. This approach is guaranteed to return the full href attribute of all <link rel="stylesheet"> tags and no others.
from lxml import html
def extract_stylesheets(page_content):
doc = html.fromstring(page_content) # Parse
return doc.xpath('//head/link[#rel="stylesheet"]/#href') # Search
There is no need to check the filenames, since the results of the xpath search are already known to be stylesheet links, and there's no guarantee that the filenames will have a .css extension anyway. The simple regex will catch only a very specific form, but the general html parser solution will also do the right thing in cases such as this, where the regex would fail miserably:
<link REL="stylesheet" hREf =
'/stylesheets/print?1342791421'
media="print"
><!-- link href="/css/stylesheet.css" -->
It could also be easily extended to select only stylesheets for a particular media.
For what it's worth (using lxml.html) as a parsing lib.
untested
import lxml.html
from urlparse import urlparse
sample_html = """<meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0" />
<!--[if gte IE 7]><!-->
<link href="/stylesheets/master.css?1342791430" media="screen, projection" rel="stylesheet" type="text/css" />
<link href="/stylesheets/adapt.css?1342791413" media="screen, projection" rel="stylesheet" type="text/css" />
<!-- <![endif]-->
<link href="/stylesheets/print.css?1342791421" media="print" rel="stylesheet" type="text/css" />
<link href="/apple-touch-icon-precomposed.png" rel="apple-touch-icon-precomposed" />
<link href="http://dribbble.com/shots/popular.rss" rel="alternate" title="RSS" type="application/rss+xml" />
"""
import lxml.html
page = lxml.html.fromstring(html)
link_hrefs = (p.path for p in map(urlparse, page.xpath('//head/link/#href')))
for href in link_hrefs:
if href.rsplit(href, 1)[-1].lower() == 'css': # implement smarter error handling here
pass # do whatever
I want to find all stylesheet definitions in a XHTML file with lxml.etree.findall. This could be as simple as
elems = tree.findall('link[#rel="stylesheet"]') + tree.findall('style')
But the problem with CSS style definitions is that the order matters, e.g.
<link rel="stylesheet" type="text/css" href="/media/css/first.css" />
<style>body:{font-size: 10px;}</style>
<link rel="stylesheet" type="text/css" href="/media/css/second.css" />
if the contents of the style tag is applied after the rules in the two link tags, the result may be completely different from the one where the rules are applied in order of definition.
So, how would I do a lookup that inlcudes both link[#rel="stylesheet"] and style?
Possible using XPATH:
data = """<link rel="stylesheet" type="text/css" href="/media/css/first.css" />
<style>body:{font-size: 10px;}</style>
<link rel="stylesheet" type="text/css" href="/media/css/second.css" />
"""
from lxml import etree
h = etree.HTML(data)
h.xpath('//link[#rel="stylesheet"]|//style')
[<Element link at 97a007c>,
<Element style at 97a002c>,
<Element link at 97a0054>]