This is part of my html code:
<link rel ="stylesheet" type="text/css" href="catalog/view/theme/default/stylesheet/stylesheet.css" />
<link id='all-css-0' href='http://1' type='text/css' media='all' rel='stylesheet' />
<link rel='stylesheet' id='all-css-1' href = 'http://2' type='text/css' media='all' />
I have to find all hrefs of stylesheets.
I tried to use regular expression like
<link\s+rel\s*=\s*["']stylesheet["']\s*href\s*=\s*["'](.*?)["'][^>]*?>
The full code is
body = '''<link rel ="stylesheet" type="text/css" href="catalog/view/theme/default/stylesheet/stylesheet.css" />
<link id='all-css-0' href='http://1' type='text/css' media='all' rel='stylesheet' />
<link rel='stylesheet' id='all-css-1' href = 'http://2' type='text/css' media='all' />''''
real_viraz = '''<link\s+rel\s*=\s*["']stylesheet["']\s*href\s*=\s*["'](.*?)["'][^>]*?>'''
r = re.findall(real_viraz, body, re.I|re.DOTALL)
print r
But the problem is that rel='stylesheet' and href='' can be in any order in <link ...>, and it can be almost everything between them.
Please help me to find the right regular expression. Thanks.
Somehow, your name looks like a power automation tool Sikuli :)
If you are trying to parse HTML/XML based text in Python. BeautifulSoup (DOCUMENT)is an extremely powerful library to help you with that. Otherwise, you are indeed reinventing the wheel(an interesting story from Randy Sargent).
from bs4 import BeautifulSoup4
# in case you need to get the page first.
#import urllib2
#url = "http://selenium-python.readthedocs.org/en/latest/"
#text = urllib2.urlopen("url").read()
text = """<link rel ="stylesheet" type="text/css" href="catalog/view/theme/default/stylesheet/stylesheet.css" /><link id='all-css-0' href='http://1' type='text/css' media='all' rel='stylesheet' /><link rel='stylesheet' id='all-css-1' href = 'http://2' type='text/css' media='all' />"""
soup = BeautifulSoup(text)
links = soup.find_all("link", {"rel":"stylesheet"})
for link in links:
try:
print link['href']
except:
pass
the output is:
catalog/view/theme/default/stylesheet/stylesheet.css
http://1
http://2
Learn beautifulsoup well and you are 100% ready for parsing anything in HTML or XML.
(You might also want to put Selenium, Scrapy into your toolbox in the future.)
Short answer: Don't use regular expressions to parse (X)HTML, use a (X)HTML parser.
In Python, this would be lxml. You could parse the HTML using lxml's HTML Parser, and use an XPath query to get all the link elements, and collect their href attributes:
from lxml import etree
parser = etree.HTMLParser()
doc = etree.parse(open('sample.html'), parser)
links = doc.xpath("//head/link[#rel='stylesheet']")
hrefs = [l.attrib['href'] for l in links]
print hrefs
Output:
['catalog/view/theme/default/stylesheet/stylesheet.css', 'http://1', 'http://2']
I'm amazed by the many developers here in Stack-Exchange who insist on using outside Modules over the RE module for obtaining data and Parsing Strings,HTML and CSS. Nothing works more efficiently or faster than RE.
These two lines not only grab the CSS style-sheet path but also grab several if there is more than one CSS stylesheet and place them into a nice Python List for processing and or for a urllib request method.
a = re.findall('link rel="stylesheet" href=".*?"', t)
a=str(a)
Also for those unaware of Native C's use of what most developers know to be the HTML Comment Out Lines.
<!-- stuff here -->
Which allows anything in RE to process and grab data at will from HTML or CSS. And or to remove chunks of pesky Java Script for testing browser capabilities in a single iteration as shown below.
txt=re.sub('<script>', '<!--', txt)
txt=re.sub('</script>', '-->', txt)
txt=re.sub('<!--.*?-->', '', txt)
Python retains all the regular expressions from native C,, so use them people. That's what their for and nothing is as slow as Beautiful Soup and HTMLParser.
Use the RE module to grab all your data from Html tags as well as CSS. Or from anything a string can contain. And if you have a problem with a variable not being of type string then make it a string with a single tiny line of code.
var=str(var)
Related
I am new to using BeautifulSoup.
I have a line in an HTML file that is stored locally.
<LINK rel="stylesheet" type="text/css" href="report.css" >
I wish to remove that line, but I don't know what approach to use to find the line and remove it.
I can find the line using: old_text = soup.find("link", {"href": "report.css"})
But I can't work out how to remove and save the file again?
You could use .decompose() to get rid of the tag:
soup.find("link", {"href": "report.css"}).decompose()
or
soup.select_one('link[href^="report."]').decompose()
and convert BeautifulSoup object back to string and save it:
str(soup)
Example
from bs4 import BeautifulSoup
html = '''
<some tag>some content</some tag>
<LINK rel="stylesheet" type="text/css" href="report.css" >
<some tag>some content</some tag>
'''
soup = BeautifulSoup(html, "html.parser")
soup.select_one('link[href^="report."]').decompose()
print(str(soup))
When trying to parse an email sent using MS Outlook, I want to be able to strip the annoying Microsoft XML tags that it has added. One such example is the o:p tag. When trying to use Python's BeautifulSoup to parse an email as HTML, it can't seem to find these specialty tags.
For example:
from bs4 import BeautifulSoup
textToParse = """
<html>
<head>
<title>Something to parse</title>
</head>
<body>
<p><o:p>This should go</o:p>Paragraph</p>
</body>
</html>
"""
soup = BeautifulSoup(textToParse, "html5lib")
body = soup.find('body')
for otag in body.find_all('o'):
print(otag)
for otag in body.find_all('o:p'):
print(otag)
This will output no text to the console, but if I switched the find_all call to search for p then it would output the p node as expected.
How come these custom tags do not seem to work?
It's a namespace issue. Apparently, BeautifulSoup does not consider custom namespaces valid when parsed with "html5lib".
You can work around this with a regular expression, which – strangely – does work correctly!
print (soup.find_all(re.compile('o:p')))
>>> [<o:p>This should go</o:p>]
but the "proper" solution is to change the parser to "lxml-xml" and introducing o: as a valid namespace.
from bs4 import BeautifulSoup
textToParse = """
<html xmlns:o='dummy_url'>
<head>
<title>Something to parse</title>
</head>
<body>
<p><o:p>This should go</o:p>Paragraph</p>
</body>
</html>
"""
soup = BeautifulSoup(textToParse, "lxml-xml")
body = soup.find('body')
print ('this should find nothing')
for otag in body.find_all('o'):
print(otag)
print ('this should find o:p')
for otag in body.find_all('o:p'):
print(otag)
>>>
this should find nothing
this should find o:p
<o:p>This should go</o:p>
I am currently writing a python's parser for extract automatically some information from a website. I am using mechanize to browse the website. I obtain the following html code:
<html>
<head>
<title>
XXXXX
</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8; no-cache;" />
<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
<link rel="stylesheet" href="/rr/style_other.css" type="text/css" />
</head>
<frameset cols="*,370" border="1">
<frame src="affiche_cp.php?uid=yyyyyyy&type=entree" name="cdrg" />
<frame src="affiche_bp.php?uid=yyyyyyy&type=entree" name="cdrd" />
</frameset>
</html>
I want to access to the both frames:
in cdrd I must fill some forms and submit
in cdrg I will obtain the result of the submission
How can I do this?
Personally, I do not use BeautifulSoup for parsing HTML. But instead I use PyQuery, which is similar but I like the CSS selector syntax as opposed to XPath. I also use Requests to make HTTP requests.
That alone is enough to scrape data, and submit requests. It can do what you want. I understand this probably isn't the answer you're looking for but it might very well be useful to you.
Scraping Frames with PyQuery
import requests
import pyquery
response = requests.get('http://example.com')
dom = pyquery.PyQuery(response.text)
frames = dom('frame')
frame_one = frames[0]
frame_two = frames[1]
Making HTTP Requests
import requests
response = requests.post('http://example.com/signup', data={
'username': 'someuser',
'password': 'secret'
})
response_text = response.text
data is a dictionary with the POST data to submit to the forms. You should use Chrome's network explorer, Fiddlr or Burp Suite to monitor requests. Whilst monitoring manually submit both forms. Inspect the HTTP requests and recreate the request using Requests.
Hope that helps a little. I work in this field, so if you require any more information feel free to hit me up.
The solution of my issue was to load the first frame and fill the form in this page. Then I load the second frame and I can read it and obtain the results associated to the form in the first frame.
Consider the html as
<item>
<title>this is the title</title>
<link>www.linktoawebsite.com</link>
</item>
I am using Lxml(python) and Xpath and trying to extract both the content of the title tag as well as the link tag.
The code is
page=urllib.urlopen(url).read()
x=etree.HTML(page)
titles=x.xpath('//item/title/text()')
links=x.xpath('//item/link/text()')
But this is returning an empty list. However, this is returning a link element.
links=x.xpath('//item/link') #returns <Element link at 0xb6b0ae0c>
Can anyone suggest how to extract the urls from the link tag?
You are using the wrong parser for the job; you don't have HTML, you have XML.
A proper HTML parser will ignore the contents of a <link> tag, because in the HTML specification that tag is always empty.
Use the etree.parse() function to parse your URL stream (no separate .read() call needed):
response = urllib.urlopen(url)
tree = etree.parse(response)
titles = tree.xpath('//item/title/text()')
links = tree.xpath('//item/link/text()')
You could also use etree.fromstring(page) but leaving the reading to the parser is easier.
By parsing content by etree, the <link> tag get closed. So no text value present for link tag
Demo:
>>> from lxml import etree
>>> content = """<item>
... <title>this is the title</title>
... <link>www.linktoawebsite.com</link>
... </item>"""
>>> x = etree.HTML(content)
>>> etree.tostring(x)
'<html><body><item>\n<title>this is the title</title>\n<link/>www.linktoawebsite.com\n</item></body></html>'
>>>
According to HTML, this is not valid tag.
I think link tag structure is like:
<head>
<link rel="stylesheet" type="text/css" href="theme.css">
</head>
i need small script in python. Need to read custom block in a web file.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import urllib2
req = urllib2.Request('http://target.com')
response = urllib2.urlopen(req)
the_page = response.read()
print the_page # Here is all page source with html tags, but
# i need read only section from <head> to </head>
# example the http://target.com source is:
# <html>
# <body>
# <head>
# ... need to read this section ...
# </head>
# ... page source ...
# </body>
# </html>
How read the custom section?
To parse HTML, we use a parser, such as BeautifulSoup.
Of course you can parse it using a regular expression, but that is something you should never do. Just because it works for some cases doesn't mean it is the standard way of doing it or is the proper way of doing it. If you are interested in knowing why, read this excellent answer here on SO.
Start with the BeautifulSoup tutorial and see how to parse the required information. It is pretty easy to do it. We are not going to do it for you, that is for you to read and learn!
Just to give you a heads up, you have the_page which contains the HTML data.
>> from BeautifulSoup import BeautifulSoup
>> soup = BeautifulSoup(the_page)
Now follow the tutorial and see how to get everything within the head tag.
from BeautifulSoup import BeautifulSoup
import urllib2
page = urllib2.urlopen('http://www.example.com')
soup = BeautifulSoup(page.read())
print soup.find('head')
outputs
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Example Web Page</title>
</head>
One solution would be to use the awesome python library Beautiful Soup. It allows you do parse the html/xml pretty easily, and will try to help out when the documents are broken or invalid.