Remove all attributes from html file

Remove all attributes from html file - python

I have a HTML file and I want to loop through the content and remove all the attributes in the tags and only display the tags.
for example:
<div class="content"><div/>
<div id="content"><div/>
<p> test</p>
<h1>tt</h1>
the output should be:
<div></div>
<div></div>
<p> </p>
<h1></h1>
At the moment I can display all tags with all the attributes, but I only want to display the tags without the attributes.
import re
file = open('myfile.html')
readtext = file.read()
lines = text.splitlines()
tags = re.findall(r'<[^>]+>',readtext)
for data in tags:
print(a)

I think the easiest way to do this is to parse the HTML, e.g. with BeautifulSoup. Here is an answer that shows how to solve your problem using that: https://stackoverflow.com/a/9045719/5251061
Also, take a look at this gist: https://gist.github.com/revotu/21d52bd20a073546983985ba3bf55deb
Basically, after parsing your file you can do something like this:
from bs4 import BeautifulSoup
# remove all attributes
def _remove_all_attrs(soup):
for tag in soup.find_all(True):
tag.attrs = {}
return soup

Related

Putting Links in Parenthesis with BeautifulSoup

BeautifulSoup's get_text() function only records the textual information of an HTML webpage. However, I want my program to return the href link of an tag in parenthesis directly after it returns the actual text.
In other words, using get_text() will just return "17.602" on the following HTML:
<a class="xref fm:ParaNumOnly" href="17.602.html#FAR_17_602">17.602</a>
However, I want my program to return "17.602 (17.602.html#FAR_17_602)". How would I go about doing this?
EDIT: What if you need to print text from other tags, such as:
<p> Sample text.
<a class="xref fm:ParaNumOnly" href="17.602.html#FAR_17_602">17.602</a>
Sample closing text.
</p>
In other words, how would you compose a program that would print
Sample text. 17.602 (17.602.html#FAR_17_602) Sample closing text.

You can format the output using f-strings.
Access the tag's text using .text, and then access the href attribute.
from bs4 import BeautifulSoup
html = """
<a class="xref fm:ParaNumOnly" href="17.602.html#FAR_17_602">17.602</a>
"""
soup = BeautifulSoup(html, "html.parser")
a_tag = soup.find("a")
print(f"{a_tag.text} ({a_tag['href']})")
Output:
17.602 (17.602.html#FAR_17_602)
Edit: You can use .next_sibling and .previous_sibling
print(f"{a_tag.previous_sibling.strip()} {a_tag.text} ({a_tag['href']}) {a_tag.next_sibling.strip()}")
Output:
Sample text. 17.602 (17.602.html#FAR_17_602) Sample closing text.

BeautifulSoup - removing MS Word specific tags?

I have html document that was saved from MS Word and now it has some tags related with MS Word. I don't need too keep any backwards compatibility with it, I just need to extract contents from that file. The problem is that word specific tags are not removed so easily.
I have this code:
from bs4 import BeautifulSoup, NavigableString
def strip_tags(html, invalid_tags):
soup = BeautifulSoup(html)
for tag in soup.findAll(True):
if tag.name in invalid_tags:
s = ""
for c in tag.contents:
if not isinstance(c, NavigableString):
c = strip_tags(unicode(c), invalid_tags)
s += unicode(c)
tag.replaceWith(s)
return soup
It removes not needed tags. But some are left even after using this method.
For example look at this:
<P class="MsoNormal"><SPAN style="mso-bidi-font-weight: bold;">Some text -
some content<o:p></o:p></SPAN></P>
<P class="MsoNormal"><SPAN style="mso-bidi-font-weight: bold;">some text2 -
647894654<o:p></o:p></SPAN></P>
<P class="MsoNormal"><SPAN style="mso-bidi-font-weight: bold;">some text3 -
some content blabla<o:p></o:p></SPAN></P>
This is how it look inside html document. When I use method like this:
invalid_tags = ['span']
stripped = strip_tags(html_file, invalid)
print stripped
It prints like this:
<p class="MsoNormal">Some text -
some content<html><body><o:p></o:p></body></html></p>
<p class="MsoNormal">some text2 -
647894654<html><body><o:p></o:p></body></html></p>
<p class="MsoNormal">some text3 -
some content blabla<html><body><o:p></o:p></body></html></p>
As you can see for some reason html and body tags appeared there even though in html it does not exist. If I add invalid_tags = ['span', 'o:p'], it removes <o:p></o:p> tags, but if I add to remove html or body tags, it does not do anything and it is still kept there.
P.S. I can remove html tags there if I directly change where to look for finding tags. For example by adding this line in a method (before findAll is used) soup = soup.body. But still after this, body tags are kept hanging in those specific paragraphs.

you can try this:
from bs4 import BeautifulSoup
def strip_tags(html, invalid_tags):
soup = BeautifulSoup(html)
for t in invalid_tags:
tag = soup.find_all(t)
if tag:
for item in tag:
item.unwrap()
return str(soup)
then you just need to strip html and body tag.

How to parse html nested blocks to lists with python BeautifulSoup?

I am trying to convert structure like this (some nested xml/html)
<div>a comment
<div>an answer</div>
<div>an answer
<div>a reply</div>
...
</div>
...
</div>
...
clarification: it can be formatted like <div>a comment><div>an answer</div> or in any other way (not prettified etc)
(which has multiple nodes of different depth)
to corresponding list structure which has parent <ul> tags (i.e. ordinary html list)
<ul>
<li>1
<ul>
<li>2</li>
...
</ul>
</li>
...
</ul>
I tried to use BeautifulSoup like this:
from bs4 import BeautifulSoup as BS
bs = BS(source_xml)
for i in bs.find_all('div'):
i.name = 'i'
# but it only replaces div tags to li tags, I still need to add ul tags
I can iterate through indentation levels like this, but I still can't figure how to separate a group of tags located on the same level to add the ul tag to them:
for i in bs.find_all('div', recursive=False):
# how to wrap the following iterated items in 'ul' tag?
for j in i.find_all('div', recursive=False):
...
how can one add <ul> tags in right places? (I don't care about pretty printing etc, I need valid html structure with ul and li tags, tnx...)

Depending on the way the HTML is formatted, just search for opening tags with no closed tag (would now be the beginning of a ul), an open & closed tag together (would be an li), or just a close tag (would be the end of a ul). Something similar to the code below. To make this more robust you could use BeautifulSoup's NavigableString
x = """<div>a comment
<div>an answer</div>
<div>an answer
<div>a reply</div>
</div>
</div>"""
xs = x.split("\n")
for tag in xs:
if "<div" in tag and "</div" in tag:
soup = BeautifulSoup(tag)
html = "{}\n{}".format(html, "<li>{}</li>".format(soup.text))
elif "<div" in tag:
html = "{}\n{}".format(html, "<ul>\n<li>{}</li>".format(tag[tag.find(">") + 1:]))
elif "</div" in tag:
html = "{}\n{}".format(html, "</ul>")

extracting text from noisy string.. python

I have some html documents and I want to extract a very particular text from it.
Now, this text is always located as
<div class = "fix">text </div>
Now, sometimes what happens is... there are other opening divs as well...something like:
<div class = "fix"> part of text <div something> other text </div> some more text </div>
Now.. I want to extract all the text corresponding to
<div class = "fix"> </div> markups??
How do i do this?

I would use the BeautifulSoup libraries. They're kinda built for this, as long your data is correct html it should find exactly what you're looking for. They've got reasonably good documentation, and it's extremely straight forward, even for beginners. If your file is on the web somewhere where you can't access the direct html, grab the html with urllib.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
soup.find({"class":"fix"})
If there is more than one item with it use find_all instead. This should give you what you're looking for (roughly).
Edit: Fixed example (class is a keyword, so you can't use the usual (attr="blah")

Here's a really simple solution that uses a non-greedy regex to remove all html tags.:
import re
s = "<div class = \"fix\"> part of text <div something> other text </div> some more text </div>"
s_text = re.sub(r'<.*?>', '', s)
The values are then:
print(s)
<div class = "fix"> part of text <div something> other text </div> some more text </div>
print(s_text)
part of text other text some more text

simplest way to return all of the text between a pair of html tags in BeatifulSoup

OK. I have a massive HTML file, and I only want the text that occurs between the tags
<center><span style="font-size: 144%;"></span></center>
and
<dl> <dd><i></i></dd> </dl>
I am using Python2.6 and Beautifulsoup, but I have no idea where to begin. I'm assuming it's not difficult?

Try something like:
soup = BeautifulSoup.BeautifulSoup(YOUR_HTML)
texts = soup.findAll(text=True)
print texts

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove all attributes from html file - python

Related

Putting Links in Parenthesis with BeautifulSoup

BeautifulSoup - removing MS Word specific tags?

How to parse html nested blocks to lists with python BeautifulSoup?

extracting text from noisy string.. python

simplest way to return all of the text between a pair of html tags in BeatifulSoup

Categories

Resources