extracting text from noisy string.. python - python

I have some html documents and I want to extract a very particular text from it.
Now, this text is always located as
<div class = "fix">text </div>
Now, sometimes what happens is... there are other opening divs as well...something like:
<div class = "fix"> part of text <div something> other text </div> some more text </div>
Now.. I want to extract all the text corresponding to
<div class = "fix"> </div> markups??
How do i do this?

I would use the BeautifulSoup libraries. They're kinda built for this, as long your data is correct html it should find exactly what you're looking for. They've got reasonably good documentation, and it's extremely straight forward, even for beginners. If your file is on the web somewhere where you can't access the direct html, grab the html with urllib.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
soup.find({"class":"fix"})
If there is more than one item with it use find_all instead. This should give you what you're looking for (roughly).
Edit: Fixed example (class is a keyword, so you can't use the usual (attr="blah")

Here's a really simple solution that uses a non-greedy regex to remove all html tags.:
import re
s = "<div class = \"fix\"> part of text <div something> other text </div> some more text </div>"
s_text = re.sub(r'<.*?>', '', s)
The values are then:
print(s)
<div class = "fix"> part of text <div something> other text </div> some more text </div>
print(s_text)
part of text other text some more text

Related

Remove all the <u> and <a> tags from within all <div> tags of a class using BeautifulSoup or re

I am trying to remove <u> and <a> tags from all the DIV tags that has class "sf-item" from an HTML source because they are breaking the text while scraping from a web url.
(for this demo, I have assigned a sample html string to the BeautifulSoup method - but it would ideally be a web URL as source)
So far I have tried using re with below line - but am not sure how to specify a condition in re such that - remove only the substring between all the <u /u> only within DIV tags of class sf-item
data = re.sub('<u.*?u>', '', data)
Also tried removing all <u> and <a> tags from the entire source using below line, but somehow it doesn't work. Am kind of unsure how to specify all the <u> and <a> tags only within DIV tags with class sf-item.
for tag in soup.find_all('u'):
tag.replaceWith('')
Appreciate if you could please help me achieve this.
Below is the Sample Python code that works -
from re import sub
from bs4 import BeautifulSoup
import re
data = """
<div class="sf-item"> The rabbit got to the halfway point at
<u> here </u> However, it couldn't see the turtle.
</div>
<div class="sf">
<div class="sf-item sf-icon">
<span class="supporticon is"></span>
</div>
<div class="sf-item"> He was hot and tired and decided to stop and take a short nap.
</div>
<div class="sf-item"> Even if the turtle passed him at
<u>Link</u>. he would be able to race to the finish line ahead of
<u>place</u>, he just kept going.
</div>
"""
# data = re.sub('<u.*?u>', '', data) ## This works for this particular string but I cannot use on a web url
# It would solve if I can somehow specify to remove <u> and <a> only within DIV of class sf-item
soup = BeautifulSoup(data, "html.parser")
for tag in soup.find_all('u'):
tag.replaceWith('')
fResult = []
rMessage=soup.findAll("div",{'class':"sf-item"})
for result in rMessage:
fResult.append(sub("“|.”","","".join(result.contents[0:1]).strip()))
fResult = list(filter(None, fResult))
print(fResult)
Output that I get from above code is
['The rabbit got to the halfway point at', 'He was hot and tired and decided to stop and take a short nap.', 'Even if the turtle passed him at']
But I need the output as below -
['The rabbit got to the halfway point at here However, it couldnt see the turtle.', 'He was hot and tired and decided to stop and take a short nap.', 'Even if the turtle passed him at Link. he would be able to race to the finish line ahead of place, he just kept going.']
BeautifulSoup has a builtin method for getting the visible text from a tag (i.e. the text that would be displayed when rendered in a browser). Running the following code, I get your expected output:
from re import sub
from bs4 import BeautifulSoup
import re
data = """
<div class="sf-item"> The rabbit got to the halfway point at
<u> here </u> However, it couldn't see the turtle.
</div>
<div class="sf">
<div class="sf-item sf-icon">
<span class="supporticon is"></span>
</div>
<div class="sf-item"> He was hot and tired and decided to stop and take a short nap.
</div>
<div class="sf-item"> Even if the turtle passed him at
<u>Link</u>. he would be able to race to the finish line ahead of
<u>place</u>, he just kept going.
</div>
"""
soup = BeautifulSoup(data, "html.parser")
rMessage=soup.findAll("div",{'class':"sf-item"})
fResult = []
for result in rMessage:
fResult.append(result.text.replace('\n', ''))
That will give you the proper output, but with some extra spaces. If you want to reduce them all to single spaces, you can run fResult through this:
fResult = [re.sub(' +', ' ', result) for result in fResult]

Xpath. How to select all text between two tags?

Here is the HTML source code
<div class="text">
<a name="dst100030"></a>
<pre id="p73" class="P">
<span class="blk">│Лабораторные методы исследования │</span>
</pre>
<pre id="p74" class="P">
<span class="blk">├────────────┬───────────────────────────┬─────────────────┬──────────────┤</span></pre>
<a name="dst100031"></a>
I need to get all text in between the two <a name="dst100030"> tags. Here's what I tried:
response.xpath('//pre//text()[preceding-sibling::a[#name="dst100030"] and following-sibling::a[#name="dst100031"]]')
But it returns empty list. Where am I wrong?
<a> is a sibling of <pre>, not the text(). You can use preceding::a instead (and similarly for following).
A solution to what you have asked using re:
Note: As others have mentioned in the comments this may not be the best solution - you are better to use a proper parser.
import re
source_code ='<div class="text"><a name="dst100030"></a><pre id="p73" class="P"><span class="blk">│Лабораторные методы исследования│</span></pre><pre id="p74" class="P"><span class="blk">├────────────┬───────────────────────────┬─────────────────┬──────────────┤</span></pre></a name="dst100031"></a>'
text = re.findall('<a name="dst100030">(.*)</a name="dst100031">', source_code)
print(text)

beautifulsoup find id given text

Here is my html:
<div id="div:{c4b05d3c-dc70-409c-b28b-9cdb1157d346}{35}" style="position:absolute;left:624px;top:595px;width:624px">
<p id="p:{c9c23667-929c-4ee2-be44-edc002db83b8}{145}" style="margin-top:5.5pt;margin-bottom:5.5pt">
{blah} data123
</p>
</div>
I want to find and return p:{c9c23667-929c-4ee2-be44-edc002db83b8}{145} by looking for the text {blah}, how do I do this?
You can try something like this, use re module to match text with regex:
import re
soup.find('p', text = re.compile('blah'))['id']
# u'p:{c9c23667-929c-4ee2-be44-edc002db83b8}{145}'

Extract URLs from specific tags in python

all.
I have an huge html file which contains tags like these:
<h3 class="r">
<a href="http://en.wikipedia.org/wiki/Digital_Signature_Algorithm" class=l onmousedown="return clk(this.href,'','','','6','','0CDEQFjACOAM')">
I need to extract all the urls from this page in python.
In a loop:
Find occurences of <h3 class="r"> one by one.
Extract the url
http://xrayoptics.by.ru/database/misc/goog2text.py I need to re-write this script to extract all the links found on google.
How can i achieve that?
Thanks.
from BeautifulSoup import BeautifulSoup
html = """<html>
...
<h3 class="r">
<a href="http://en.wikipedia.org/wiki/Digital_Signature_Algorithm" class=l
onmousedown="return clk(this.href,'','','','6','','0CDEQFjACOAM')">
text</a>
</h3>
...
<h3>Don't find me!</h3>
<h3 class="r"><a>Don't find me!</a></h3>
<h3 class="r"><a class="l">Don't error on missing href!</a></h3>
...
</html>
"""
soup = BeautifulSoup(html)
for h3 in soup.findAll("h3", {"class": "r"}):
for a in h3.findAll("a", {"class": "l", "href": True}):
print a["href"]
I'd use XPATH, see here for a question what package would be appropriate in Python.
You can use a Regular Expressions (RegEx) for that.
This RegEx will catch all URL's beginning with http and surrounded by quotes ("):
http([^\"]+)
And this is how it's done in Python:
import re
myRegEx = re.compile("http([^\"]+)")
myResults = MyRegEx.search('<source>')
Replace by the variable storing the source code you want to search for URL's.
myResults.start() and myResults.end() now contain the starting and ending position of the URL's. Use the myResults.group() function to find the string that matched the RegEx.
If anything isn't clear yet, just ask.

I am not able to parse using Beautiful Soup

<td>
<a name="corner"></a>
<div>
<div style="aaaaa">
<div class="class-a">My name is alis</div>
</div>
<div>
<span><span class="class-b " title="My title"><span>Very Good</span></span> </span>
<b>My Description</b><br />
My Name is Alis I am a python learner...
</div>
<div class="class-3" style="style-2 clear: both;">
alis
</div>
</div>
<br /></td>
I want the description after scraping it:
My Name is Alis I am a python learner...
I tried a lots of thing but i could not figure it out the best way. Can you guys give the in general solution for this.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("Your sample html here")
soup.td.div('div')[2].contents[-1]
This will return the string you are looking for (the unicode string, with any applicable whitespace, it should be noted).
This works by parsing the html, grabbing the first td tag and its contents, grabbing any div tags within the first div tag, selecting the 3rd item in the list (list index 2), and grabbing the last of its contents.
In BeautifulSoup, there are A LOT of ways to do this, so this answer probably hasn't taught you much and I genuinely recommend you read the tutorial that David suggested.
Have you tried reading the examples provided in the documentation? They quick start is located here http://www.crummy.com/software/BeautifulSoup/documentation.html#Quick Start
Edit:
To find
You would load your html up via
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("My html here")
myDiv = soup.find("div", { "class" : "class-a" })
Also remember you can do most of this via the python console and then using dir() along with help() walk through what you're trying to do. It might make life easier on you to try out ipython or perhaps python IDLE which have very friendly consoles for beginners.

Categories