beautifulsoup find id given text - python

Here is my html:
<div id="div:{c4b05d3c-dc70-409c-b28b-9cdb1157d346}{35}" style="position:absolute;left:624px;top:595px;width:624px">
<p id="p:{c9c23667-929c-4ee2-be44-edc002db83b8}{145}" style="margin-top:5.5pt;margin-bottom:5.5pt">
{blah} data123
</p>
</div>
I want to find and return p:{c9c23667-929c-4ee2-be44-edc002db83b8}{145} by looking for the text {blah}, how do I do this?

You can try something like this, use re module to match text with regex:
import re
soup.find('p', text = re.compile('blah'))['id']
# u'p:{c9c23667-929c-4ee2-be44-edc002db83b8}{145}'

Related

Remove all attributes from html file

I have a HTML file and I want to loop through the content and remove all the attributes in the tags and only display the tags.
for example:
<div class="content"><div/>
<div id="content"><div/>
<p> test</p>
<h1>tt</h1>
the output should be:
<div></div>
<div></div>
<p> </p>
<h1></h1>
At the moment I can display all tags with all the attributes, but I only want to display the tags without the attributes.
import re
file = open('myfile.html')
readtext = file.read()
lines = text.splitlines()
tags = re.findall(r'<[^>]+>',readtext)
for data in tags:
print(a)
I think the easiest way to do this is to parse the HTML, e.g. with BeautifulSoup. Here is an answer that shows how to solve your problem using that: https://stackoverflow.com/a/9045719/5251061
Also, take a look at this gist: https://gist.github.com/revotu/21d52bd20a073546983985ba3bf55deb
Basically, after parsing your file you can do something like this:
from bs4 import BeautifulSoup
# remove all attributes
def _remove_all_attrs(soup):
for tag in soup.find_all(True):
tag.attrs = {}
return soup

Select element based on text inside Beautiful Soup

I scrapped a website and I want to find an element based on the text written in it. Let's say below is the sample code of the website:
code = bs4.BeautifulSoup("""<div>
<h1>Some information</h1>
<p>Spam</p>
<p>Some Information</p>
<p>More Spam</p>
</div>""")
I want some way to get a p element that has as a text value Some Information. How can I select an element like so?
Just use text parameter:
code.find_all("p", text="Some Information")
If you need only the first element than use find instead of find_all.
You could use text to search all tags matching the string
import BeautifulSoup as bs
import re
code = bs.BeautifulSoup("""<div>
<h1>Some information</h1>
<p>Spam</p>
<p>Some Information</p>
<p>More Spam</p>
</div>""")
for elem in code(text='Some Information'):
print elem.parent

Xpath. How to select all text between two tags?

Here is the HTML source code
<div class="text">
<a name="dst100030"></a>
<pre id="p73" class="P">
<span class="blk">│Лабораторные методы исследования │</span>
</pre>
<pre id="p74" class="P">
<span class="blk">├────────────┬───────────────────────────┬─────────────────┬──────────────┤</span></pre>
<a name="dst100031"></a>
I need to get all text in between the two <a name="dst100030"> tags. Here's what I tried:
response.xpath('//pre//text()[preceding-sibling::a[#name="dst100030"] and following-sibling::a[#name="dst100031"]]')
But it returns empty list. Where am I wrong?
<a> is a sibling of <pre>, not the text(). You can use preceding::a instead (and similarly for following).
A solution to what you have asked using re:
Note: As others have mentioned in the comments this may not be the best solution - you are better to use a proper parser.
import re
source_code ='<div class="text"><a name="dst100030"></a><pre id="p73" class="P"><span class="blk">│Лабораторные методы исследования│</span></pre><pre id="p74" class="P"><span class="blk">├────────────┬───────────────────────────┬─────────────────┬──────────────┤</span></pre></a name="dst100031"></a>'
text = re.findall('<a name="dst100030">(.*)</a name="dst100031">', source_code)
print(text)

extracting text from noisy string.. python

I have some html documents and I want to extract a very particular text from it.
Now, this text is always located as
<div class = "fix">text </div>
Now, sometimes what happens is... there are other opening divs as well...something like:
<div class = "fix"> part of text <div something> other text </div> some more text </div>
Now.. I want to extract all the text corresponding to
<div class = "fix"> </div> markups??
How do i do this?
I would use the BeautifulSoup libraries. They're kinda built for this, as long your data is correct html it should find exactly what you're looking for. They've got reasonably good documentation, and it's extremely straight forward, even for beginners. If your file is on the web somewhere where you can't access the direct html, grab the html with urllib.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
soup.find({"class":"fix"})
If there is more than one item with it use find_all instead. This should give you what you're looking for (roughly).
Edit: Fixed example (class is a keyword, so you can't use the usual (attr="blah")
Here's a really simple solution that uses a non-greedy regex to remove all html tags.:
import re
s = "<div class = \"fix\"> part of text <div something> other text </div> some more text </div>"
s_text = re.sub(r'<.*?>', '', s)
The values are then:
print(s)
<div class = "fix"> part of text <div something> other text </div> some more text </div>
print(s_text)
part of text other text some more text

Extract URLs from specific tags in python

all.
I have an huge html file which contains tags like these:
<h3 class="r">
<a href="http://en.wikipedia.org/wiki/Digital_Signature_Algorithm" class=l onmousedown="return clk(this.href,'','','','6','','0CDEQFjACOAM')">
I need to extract all the urls from this page in python.
In a loop:
Find occurences of <h3 class="r"> one by one.
Extract the url
http://xrayoptics.by.ru/database/misc/goog2text.py I need to re-write this script to extract all the links found on google.
How can i achieve that?
Thanks.
from BeautifulSoup import BeautifulSoup
html = """<html>
...
<h3 class="r">
<a href="http://en.wikipedia.org/wiki/Digital_Signature_Algorithm" class=l
onmousedown="return clk(this.href,'','','','6','','0CDEQFjACOAM')">
text</a>
</h3>
...
<h3>Don't find me!</h3>
<h3 class="r"><a>Don't find me!</a></h3>
<h3 class="r"><a class="l">Don't error on missing href!</a></h3>
...
</html>
"""
soup = BeautifulSoup(html)
for h3 in soup.findAll("h3", {"class": "r"}):
for a in h3.findAll("a", {"class": "l", "href": True}):
print a["href"]
I'd use XPATH, see here for a question what package would be appropriate in Python.
You can use a Regular Expressions (RegEx) for that.
This RegEx will catch all URL's beginning with http and surrounded by quotes ("):
http([^\"]+)
And this is how it's done in Python:
import re
myRegEx = re.compile("http([^\"]+)")
myResults = MyRegEx.search('<source>')
Replace by the variable storing the source code you want to search for URL's.
myResults.start() and myResults.end() now contain the starting and ending position of the URL's. Use the myResults.group() function to find the string that matched the RegEx.
If anything isn't clear yet, just ask.

Categories