Matching partial ids in BeautifulSoup

Matching partial ids in BeautifulSoup - python

I'm using BeautifulSoup. I have to find any reference to the <div> tags with id like: post-#.
For example:
<div id="post-45">...</div>
<div id="post-334">...</div>
I have tried:
html = '<div id="post-45">...</div> <div id="post-334">...</div>'
soupHandler = BeautifulSoup(html)
print soupHandler.findAll('div', id='post-*')
How can I filter this?

You can pass a function to findAll:
>>> print soupHandler.findAll('div', id=lambda x: x and x.startswith('post-'))
[<div id="post-45">...</div>, <div id="post-334">...</div>]
Or a regular expression:
>>> print soupHandler.findAll('div', id=re.compile('^post-'))
[<div id="post-45">...</div>, <div id="post-334">...</div>]

Since he is asking to match "post-#somenumber#", it's better to precise with
import re
[...]
soupHandler.findAll('div', id=re.compile("^post-\d+"))

soupHandler.findAll('div', id=re.compile("^post-$"))
looks right to me.

This works for me:
from bs4 import BeautifulSoup
import re
html = '<div id="post-45">...</div> <div id="post-334">...</div>'
soupHandler = BeautifulSoup(html)
for match in soupHandler.find_all('div', id=re.compile("post-")):
print match.get('id')
>>>
post-45
post-334

Related

Using BeautifulSoup, how to select a tag without its children?

The html is as follows:
<body>
<div name='tag-i-want'>
<span>I don't want this</span>
</div>
</body>
I'm trying to get all the divs and cast them into strings:
divs = [str(i) for i in soup.find_all('div')]
However, they'll have their children too:
>>> ["<div name='tag-i-want'><span>I don't want this</span></div>"]
What I'd like it to be is:
>>> ["<div name='tag-i-want'></div>"]
I figured there is unwrap() which would return this, but it modifies the soup as well; I'd like the soup to remain untouched.

With clear you remove the tag's content.
Without altering the soup you can either do an hardcopy with copy or use a DIY approach. Here an example with the copy
from bs4 import BeautifulSoup
import copy
html = """<body>
<div name='tag-i-want'>
<span>I don't want this</span>
</div>
</body>"""
soup = BeautifulSoup(html, 'lxml')
div = soup.find('div')
div_only = copy.copy(div)
div_only.clear()
print(div_only)
print(soup.find_all('span') != [])
Output
<div name="tag-i-want"></div>
True
Remark: the DIY approach: without copy
use the Tag class
from bs4 import BeautifulSoup, Tag
...
div_only = Tag(name='div', attrs=div.attrs)
use strings
div_only = '<div {}></div>'.format(' '.join(map(lambda p: f'{p[0]}="{p[1]}"', div.attrs.items())))

#cards pointed me in the right direction with copy(). This is what I ended up using:
from bs4 import BeautifulSoup
import copy
html = """<body>
<div name='tag-i-want'>
<span>I don't want this</span>
</div>
</body>"""
soup = BeautifulSoup(html, 'lxml')
def remove_children(tag):
tag.clear()
return tag
divs = [str(remove_children(copy.copy(i))) for i in soup.find_all('div')]

How to soup particular div class when there is another with similar one?

from bs4 import BeautifulSoup as bs
x = ' <div class="data dturd"><h3>Seaspiracy</h3></div> <div class="data"><h3>SeaspiracyX</h3></div>'
soup = bs(x,"lxml")
print(soup.find('div',class_='data'))
I am trying to soup the second div with class 'data' in it. But the above code is always finding the div with class 'data dturd'.
How can I solve it ?

You could use a CSS selector with .select_one and the :nth-child(2) pseudoselector:
>>> soup.select_one(".data:nth-child(2)")
<div class="data"><h3>SeaspiracyX</h3></div>

You can use :not to filter out for the other class present in the first of the children
from bs4 import BeautifulSoup as bs
x = ' <div class="data dturd"><h3>Seaspiracy</h3></div> <div class="data"><h3>SeaspiracyX</h3></div>'
soup = bs(x,"lxml")
print(soup.select_one('.data:not(.dturd)').text)

I definetly agree with #ggorlen's approch. Here is a more clumpsy approach, that somewhat works.
from bs4 import BeautifulSoup as bs
x = ' <div class="data dturd"><h3>Seaspiracy</h3></div> <div class="data"><h3>SeaspiracyX</h3></div>'
soup = bs(x,"lxml")
print(soup.find_all('div',class_='data')[1])

And another solution:
soup.find(lambda tag: tag.name == "div" and tag['class'] == ["data"])

URLs separation with bs4 and Python

I am scraping a site for a bunch of links and those links are in single HTML div tag with <br /> tag to line break, but when I try to get all URLs from that div it just coming in a single string.
I am unable to separate then in list. My code is as follows:
with below code I'm scraping all links:
links = soup.find('div', id='dle-content').find('div', class_='full').find(
'div', class_='full-news').find('div', class_='quote').text
Following is html from site:
<div class="quote">
<!--QuoteEBegin-->
https://example.com/asd.html
<br>
https://example.net/abc
<br>
https://example.org/v/kjg/
<br>
<br>
<!--QuoteEEnd-->
</div>
Output which I get from above code:
https://example.com/asd.htmlhttps://example.net/abchttps://example.org/v/kjg/
Output which I want:
[
"https://example.com/asd.html",
"https://example.net/abc",
"https://example.org/v/kjg/"
]

Try this:
from bs4 import BeautifulSoup
sample = """<div class="quote">
<!--QuoteEBegin-->
https://example.com/asd.html
<br>
https://example.net/abc
<br>
https://example.org/v/kjg/
<br>
<br>
<!--QuoteEEnd-->
</div>"""
soup = BeautifulSoup(sample, "html.parser").find_all("div", class_="quote")
print([i.getText().split() for i in soup])
Output:
[['https://example.com/asd.html', 'https://example.net/abc', 'https://example.org/v/kjg/']]

You could fix it with a string manipulation:
new_output = ' http'.join(output.split('http')).split()

Split the string, and then use list comprehension to bring it together:
output = 'https://example.com/asd.htmlhttps://example.net/abchttps://example.org/v/kjg/'
split_output = output.split()
new_output = [x for x in split_output if x != '']
Output:
print(new_output)
['https://example.com/asd.html', 'https://example.net/abc', 'https://example.org/v/kjg/']

Another way to achieve the desired output:
from bs4 import BeautifulSoup
html = """
<div class="quote">
<!--QuoteEBegin-->
https://example.com/asd.html
<br>
https://example.net/abc
<br>
https://example.org/v/kjg/
<br>
<br>
<!--QuoteEEnd-->
</div>
"""
soup = BeautifulSoup(html,"html.parser")
print([i.strip() for i in soup.find("div",class_="quote").strings if i!='\n'])
Output:
['https://example.com/asd.html', 'https://example.net/abc', 'https://example.org/v/kjg/']

Python splitting the HTML

So I have an HTML markup and I'd like to access a tag with a specific class inside a tag with a specific id. For example:
<tr id="one">
<span class="x">X</span>
.
.
.
.
</tr>
How do I get the content of the tag with the class "x" inside the tag with an id of "one"?

I'm not used to work with lxml.xpath, so I always tend to use BeautifulSoup. Here is a solution with BeautifulSoup:
>>> HTML = """<tr id="one">
... <span class="x">X</span>
... <span class="ax">X</span>
... <span class="xa">X</span>
... </tr>"""
>>>
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(HTML)
>>> tr = soup.find('tr', {'id':'one'})
>>> span = tr.find('span', {'class':'x'})
>>> span
<span class="x">X</span>
>>> span.text
u'X'

You need something called "xpath".
from lxml import html
tree = html.fromstring(my_string)
x = tree.xpath('//*[#id="one"]/span[#class="x"]/text()')
print x[0] # X

Python and BeautifulSoup, not finding 'a'

Here's a piece of HTML code (from delicious):
<h4>
<a rel="nofollow" class="taggedlink " href="http://imfy.us/" >Generate Secure Links with Anonymous Referers & Anti-Bot Protection</a>
<span class="saverem">
<em class="bookmark-actions">
<strong><a class="inlinesave action" href="/save?url=http%3A%2F%2Fimfy.us%2F&title=Generate%20Secure%20Links%20with%20Anonymous%20Referers%20%26%20Anti-Bot%20Protection&jump=%2Fdux&key=fFS4QzJW2lBf4gAtcrbuekRQfTY-&original_user=dux&copyuser=dux&copytags=web+apps+url+security+generator+shortener+anonymous+links">SAVE</a></strong>
</em>
</span>
</h4>
I'm trying to find all the links where class="inlinesave action". Here's the code:
sock = urllib2.urlopen('http://delicious.com/theuser')
html = sock.read()
soup = BeautifulSoup(html)
tags = soup.findAll('a', attrs={'class':'inlinesave action'})
print len(tags)
But it doesn't find anything!
Any thoughts?
Thanks

If you want to look for an anchor with exactly those two classes you'd, have to use a regexp, I think:
tags = soup.findAll('a', attrs={'class': re.compile(r'\binlinesave\b.*\baction\b')})
Keep in mind that this regexp won't work if the ordering of the class names is reversed (class="action inlinesave").
The following statement should work for all cases (even though it looks ugly imo.):
soup.findAll('a',
attrs={'class':
re.compile(r'\baction\b.*\binlinesave\b|\binlinesave\b.*\baction\b')
})

Python string methods
html=open("file").read()
for item in html.split("<strong>"):
if "class" in item and "inlinesave action" in item:
url_with_junk = item.split('href="')[1]
m = url_with_junk.index('">')
print url_with_junk[:m]

May be that issue is fixed in verion 3.1.0, I could parse yours,
>>> html="""<h4>
... <a rel="nofollow" class="taggedlink " href="http://imfy.us/" >Generate Secure Links with Anony
... <span class="saverem">
... <em class="bookmark-actions">
... <strong><a class="inlinesave action" href="/save?url=http%3A%2F%2Fimfy.us%2F&title=Gen
... </em>
... </span>
... </h4>"""
>>>
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> tags = soup.findAll('a', attrs={'class':'inlinesave action'})
>>> print len(tags)
1
>>> tags
[<a class="inlinesave action" href="/save?url=http%3A%2F%2Fimfy.us%2F&title=Generate%20Secure%
>>>
I have tried with BeautifulSoup 2.1.1 also, its does not work at all.

You might make some forward progress using pyparsing:
from pyparsing import makeHTMLTags, withAttribute
htmlsrc="""<h4>... etc."""
atag = makeHTMLTags("a")[0]
atag.setParseAction(withAttribute(("class","inlinesave action")))
for result in atag.searchString(htmlsrc):
print result.href
Gives (long result output snipped at '...'):
/save?url=http%3A%2F%2Fimfy.us%2F&title=Genera...+anonymous+links

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Matching partial ids in BeautifulSoup - python

Since he is asking to match "post-#somenumber#", it's better to precise with import re [...] soupHandler.findAll('div', id=re.compile("^post-\d+"))

soupHandler.findAll('div', id=re.compile("^post-$")) looks right to me.

This works for me: from bs4 import BeautifulSoup import re html = '<div id="post-45">...</div> <div id="post-334">...</div>' soupHandler = BeautifulSoup(html) for match in soupHandler.find_all('div', id=re.compile("post-")): print match.get('id') >>> post-45 post-334

Related

Using BeautifulSoup, how to select a tag without its children?

How to soup particular div class when there is another with similar one?

URLs separation with bs4 and Python

Python splitting the HTML

Python and BeautifulSoup, not finding 'a'

Categories

Resources