Matching partial ids in BeautifulSoup - python

I'm using BeautifulSoup. I have to find any reference to the <div> tags with id like: post-#.
For example:
<div id="post-45">...</div>
<div id="post-334">...</div>
I have tried:
html = '<div id="post-45">...</div> <div id="post-334">...</div>'
soupHandler = BeautifulSoup(html)
print soupHandler.findAll('div', id='post-*')
How can I filter this?

You can pass a function to findAll:
>>> print soupHandler.findAll('div', id=lambda x: x and x.startswith('post-'))
[<div id="post-45">...</div>, <div id="post-334">...</div>]
Or a regular expression:
>>> print soupHandler.findAll('div', id=re.compile('^post-'))
[<div id="post-45">...</div>, <div id="post-334">...</div>]

Since he is asking to match "post-#somenumber#", it's better to precise with
import re
[...]
soupHandler.findAll('div', id=re.compile("^post-\d+"))

soupHandler.findAll('div', id=re.compile("^post-$"))
looks right to me.

This works for me:
from bs4 import BeautifulSoup
import re
html = '<div id="post-45">...</div> <div id="post-334">...</div>'
soupHandler = BeautifulSoup(html)
for match in soupHandler.find_all('div', id=re.compile("post-")):
print match.get('id')
>>>
post-45
post-334

Related

Using BeautifulSoup, how to select a tag without its children?

The html is as follows:
<body>
<div name='tag-i-want'>
<span>I don't want this</span>
</div>
</body>
I'm trying to get all the divs and cast them into strings:
divs = [str(i) for i in soup.find_all('div')]
However, they'll have their children too:
>>> ["<div name='tag-i-want'><span>I don't want this</span></div>"]
What I'd like it to be is:
>>> ["<div name='tag-i-want'></div>"]
I figured there is unwrap() which would return this, but it modifies the soup as well; I'd like the soup to remain untouched.
With clear you remove the tag's content.
Without altering the soup you can either do an hardcopy with copy or use a DIY approach. Here an example with the copy
from bs4 import BeautifulSoup
import copy
html = """<body>
<div name='tag-i-want'>
<span>I don't want this</span>
</div>
</body>"""
soup = BeautifulSoup(html, 'lxml')
div = soup.find('div')
div_only = copy.copy(div)
div_only.clear()
print(div_only)
print(soup.find_all('span') != [])
Output
<div name="tag-i-want"></div>
True
Remark: the DIY approach: without copy
use the Tag class
from bs4 import BeautifulSoup, Tag
...
div_only = Tag(name='div', attrs=div.attrs)
use strings
div_only = '<div {}></div>'.format(' '.join(map(lambda p: f'{p[0]}="{p[1]}"', div.attrs.items())))
#cards pointed me in the right direction with copy(). This is what I ended up using:
from bs4 import BeautifulSoup
import copy
html = """<body>
<div name='tag-i-want'>
<span>I don't want this</span>
</div>
</body>"""
soup = BeautifulSoup(html, 'lxml')
def remove_children(tag):
tag.clear()
return tag
divs = [str(remove_children(copy.copy(i))) for i in soup.find_all('div')]

How to soup particular div class when there is another with similar one?

from bs4 import BeautifulSoup as bs
x = ' <div class="data dturd"><h3>Seaspiracy</h3></div> <div class="data"><h3>SeaspiracyX</h3></div>'
soup = bs(x,"lxml")
print(soup.find('div',class_='data'))
I am trying to soup the second div with class 'data' in it. But the above code is always finding the div with class 'data dturd'.
How can I solve it ?
You could use a CSS selector with .select_one and the :nth-child(2) pseudoselector:
>>> soup.select_one(".data:nth-child(2)")
<div class="data"><h3>SeaspiracyX</h3></div>
You can use :not to filter out for the other class present in the first of the children
from bs4 import BeautifulSoup as bs
x = ' <div class="data dturd"><h3>Seaspiracy</h3></div> <div class="data"><h3>SeaspiracyX</h3></div>'
soup = bs(x,"lxml")
print(soup.select_one('.data:not(.dturd)').text)
I definetly agree with #ggorlen's approch. Here is a more clumpsy approach, that somewhat works.
from bs4 import BeautifulSoup as bs
x = ' <div class="data dturd"><h3>Seaspiracy</h3></div> <div class="data"><h3>SeaspiracyX</h3></div>'
soup = bs(x,"lxml")
print(soup.find_all('div',class_='data')[1])
And another solution:
soup.find(lambda tag: tag.name == "div" and tag['class'] == ["data"])

URLs separation with bs4 and Python

I am scraping a site for a bunch of links and those links are in single HTML div tag with <br /> tag to line break, but when I try to get all URLs from that div it just coming in a single string.
I am unable to separate then in list. My code is as follows:
with below code I'm scraping all links:
links = soup.find('div', id='dle-content').find('div', class_='full').find(
'div', class_='full-news').find('div', class_='quote').text
Following is html from site:
<div class="quote">
<!--QuoteEBegin-->
https://example.com/asd.html
<br>
https://example.net/abc
<br>
https://example.org/v/kjg/
<br>
<br>
<!--QuoteEEnd-->
</div>
Output which I get from above code:
https://example.com/asd.htmlhttps://example.net/abchttps://example.org/v/kjg/
Output which I want:
[
"https://example.com/asd.html",
"https://example.net/abc",
"https://example.org/v/kjg/"
]
Try this:
from bs4 import BeautifulSoup
sample = """<div class="quote">
<!--QuoteEBegin-->
https://example.com/asd.html
<br>
https://example.net/abc
<br>
https://example.org/v/kjg/
<br>
<br>
<!--QuoteEEnd-->
</div>"""
soup = BeautifulSoup(sample, "html.parser").find_all("div", class_="quote")
print([i.getText().split() for i in soup])
Output:
[['https://example.com/asd.html', 'https://example.net/abc', 'https://example.org/v/kjg/']]
You could fix it with a string manipulation:
new_output = ' http'.join(output.split('http')).split()
Split the string, and then use list comprehension to bring it together:
output = 'https://example.com/asd.htmlhttps://example.net/abchttps://example.org/v/kjg/'
split_output = output.split()
new_output = [x for x in split_output if x != '']
Output:
print(new_output)
['https://example.com/asd.html', 'https://example.net/abc', 'https://example.org/v/kjg/']
Another way to achieve the desired output:
from bs4 import BeautifulSoup
html = """
<div class="quote">
<!--QuoteEBegin-->
https://example.com/asd.html
<br>
https://example.net/abc
<br>
https://example.org/v/kjg/
<br>
<br>
<!--QuoteEEnd-->
</div>
"""
soup = BeautifulSoup(html,"html.parser")
print([i.strip() for i in soup.find("div",class_="quote").strings if i!='\n'])
Output:
['https://example.com/asd.html', 'https://example.net/abc', 'https://example.org/v/kjg/']

Python splitting the HTML

So I have an HTML markup and I'd like to access a tag with a specific class inside a tag with a specific id. For example:
<tr id="one">
<span class="x">X</span>
.
.
.
.
</tr>
How do I get the content of the tag with the class "x" inside the tag with an id of "one"?
I'm not used to work with lxml.xpath, so I always tend to use BeautifulSoup. Here is a solution with BeautifulSoup:
>>> HTML = """<tr id="one">
... <span class="x">X</span>
... <span class="ax">X</span>
... <span class="xa">X</span>
... </tr>"""
>>>
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(HTML)
>>> tr = soup.find('tr', {'id':'one'})
>>> span = tr.find('span', {'class':'x'})
>>> span
<span class="x">X</span>
>>> span.text
u'X'
You need something called "xpath".
from lxml import html
tree = html.fromstring(my_string)
x = tree.xpath('//*[#id="one"]/span[#class="x"]/text()')
print x[0] # X

Python and BeautifulSoup, not finding 'a'

Here's a piece of HTML code (from delicious):
<h4>
<a rel="nofollow" class="taggedlink " href="http://imfy.us/" >Generate Secure Links with Anonymous Referers & Anti-Bot Protection</a>
<span class="saverem">
<em class="bookmark-actions">
<strong><a class="inlinesave action" href="/save?url=http%3A%2F%2Fimfy.us%2F&title=Generate%20Secure%20Links%20with%20Anonymous%20Referers%20%26%20Anti-Bot%20Protection&jump=%2Fdux&key=fFS4QzJW2lBf4gAtcrbuekRQfTY-&original_user=dux&copyuser=dux&copytags=web+apps+url+security+generator+shortener+anonymous+links">SAVE</a></strong>
</em>
</span>
</h4>
I'm trying to find all the links where class="inlinesave action". Here's the code:
sock = urllib2.urlopen('http://delicious.com/theuser')
html = sock.read()
soup = BeautifulSoup(html)
tags = soup.findAll('a', attrs={'class':'inlinesave action'})
print len(tags)
But it doesn't find anything!
Any thoughts?
Thanks
If you want to look for an anchor with exactly those two classes you'd, have to use a regexp, I think:
tags = soup.findAll('a', attrs={'class': re.compile(r'\binlinesave\b.*\baction\b')})
Keep in mind that this regexp won't work if the ordering of the class names is reversed (class="action inlinesave").
The following statement should work for all cases (even though it looks ugly imo.):
soup.findAll('a',
attrs={'class':
re.compile(r'\baction\b.*\binlinesave\b|\binlinesave\b.*\baction\b')
})
Python string methods
html=open("file").read()
for item in html.split("<strong>"):
if "class" in item and "inlinesave action" in item:
url_with_junk = item.split('href="')[1]
m = url_with_junk.index('">')
print url_with_junk[:m]
May be that issue is fixed in verion 3.1.0, I could parse yours,
>>> html="""<h4>
... <a rel="nofollow" class="taggedlink " href="http://imfy.us/" >Generate Secure Links with Anony
... <span class="saverem">
... <em class="bookmark-actions">
... <strong><a class="inlinesave action" href="/save?url=http%3A%2F%2Fimfy.us%2F&title=Gen
... </em>
... </span>
... </h4>"""
>>>
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> tags = soup.findAll('a', attrs={'class':'inlinesave action'})
>>> print len(tags)
1
>>> tags
[<a class="inlinesave action" href="/save?url=http%3A%2F%2Fimfy.us%2F&title=Generate%20Secure%
>>>
I have tried with BeautifulSoup 2.1.1 also, its does not work at all.
You might make some forward progress using pyparsing:
from pyparsing import makeHTMLTags, withAttribute
htmlsrc="""<h4>... etc."""
atag = makeHTMLTags("a")[0]
atag.setParseAction(withAttribute(("class","inlinesave action")))
for result in atag.searchString(htmlsrc):
print result.href
Gives (long result output snipped at '...'):
/save?url=http%3A%2F%2Fimfy.us%2F&title=Genera...+anonymous+links

Categories