How to parse with BeautifuSoup Python?

How to parse with BeautifuSoup Python? - python

For example I have code like this
<div class="container">
<div class="blablabla1">
<div class="blablabla2">
<div class="blablabla3">
<span class="hello">Hello</span>
</div>
</div>
</div>
</div>
How can I get <span> value or <span> class value?
Should I firstly find all containers?

Your question is not that clear but in generell you can access the class and text with css selector like this:
from bs4 import BeautifulSoup
html = '''
<div class="container">
<div class="blablabla1">
<div class="blablabla2">
<div class="blablabla3">
<span class="hello">Hello</span>
</div>
</div>
</div>
</div>
'''
soup = BeautifulSoup(html, "lxml")
spanText = soup.select_one('div.container span').text
spanClass = soup.select_one('div.container span')['class']

Once you have obtain the soup, you can use the find_all() method to find all <span> with the hello class:
all_hello_spans = soup.find_all({"span":{"class":"hello"}}})

Related

Best approach to get attribute text with BeautifulSoup

What would be the best way to get the text of the items class="field__label" y class="field__item" in the following code
Taking into consideration that there are other tags with the same class outside the div class="fieldset-wrapper" I just need the ones inside this tag.
HTML Example:
<div class="fieldset-wrapper">
<div class="field field--name-field-adresse-strasse-nr field--type-string field--label-inline clearfix">
<div class="field__label">TEXT</div>
<div class="field__item">TEXT</div>
</div>
<div class="field field--name-field-adresse-plz-ort field--type-string field--label-inline clearfix">
<div class="field__label">TEXT</div>
<div class="field__item">TEXT</div>
</div>
<div class="field field--name-field-adressen-bundesland field--type-entity-reference field--label-inline clearfix">
<div class="field__label">TEXT</div>
<div class="field__item">TEXT</div>
</div>
</div>

You can use css selectors to ensure that your target elements are descendants of the div class="fieldset-wrapper" element:
for item in soup.select('div.fieldset-wrapper div.field__item, div.fieldset-wrapper div.field__label'):
print(item.text)

How to extract deeply nested <p> tags using Beautiful Soup

I have the content below and I am trying to understand how to extract the <p> tag copy using Beautiful Soup (I am open to other methods). As you can see the <p> tags are not both nested inside the same <div>. I gave it a shot with the following method but that only seems to work when both <p> tags are within the same container.
<div class="top-panel">
<div class="inside-panel-0">
<h1 class="h1-title">Some Title</h1>
</div>
<div class="inside-panel-0">
<div class="inside-panel-1">
<p> I want to extract this copy</p>
</div>
<div class="inside-panel-1">
<p>I want to extract this copy</p>
</div>
</div>
</div>

IIUC try
from bs4 import BeautifulSoup
html = """<div class="top-panel">
<div class="inside-panel-0">
<h1 class="h1-title">Some Title</h1>
</div>
<div class="inside-panel-0">
<div class="inside-panel-1">
<p> I want to extract this copy</p>
</div>
<div class="inside-panel-1">
<p>I want to extract this copy</p>
</div>
</div>
</div>"""
soup = BeautifulSoup(html, 'lxml')
# find all the p tags that have a parent class of inside-panel-1
soup.findAll({'p': {'class': 'inside-panel-1'}})
[<p> I want to extract this copy</p>, <p>I want to extract this copy</p>]
If you want just the text then try
p_tags = soup.findAll({'p': {'class': 'inside-panel-1'}})
[elm.text for elm in p_tags]
# -> [' I want to extract this copy', 'I want to extract this copy']

As p tags are inside div class="inside-panel-1, so we can easily grab them by calling find_all method as follows:
from bs4 import BeautifulSoup
html = """
<div class="top-panel">
<div class="inside-panel-0">
<h1 class="h1-title">
Some Title
</h1>
</div>
<div class="inside-panel-0">
<div class="inside-panel-1">
<p>
I want to extract this copy
</p>
</div>
<div class="inside-panel-1">
<p>
I want to extract this copy
</p>
</div>
</div>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
# print(soup.prettify())
p_tags = soup.select('div.top-panel div[class="inside-panel-1"]')
for p_tag in p_tags:
print(p_tag.get_text(strip=True))
Output:
I want to extract this copy
I want to extract this copy

href beautifulsoul html selection doesn't return wanted result

I want to extract all the first href of the first link by beautiful soup
as shown my code HTML
<body>
<div>
some html
<footer>
<div>...</div>
<div>
<span></span>
<span></span>
#<<<<<<------- i want this
</div>
</footer>
</div>
<div>
some html
<footer>
<div>...</div>
<div>
<span></span>
<span></span>
#<<<<<<------- i want this too
</div>
</footer>
</div>
....
</body>
i want to return an array of the wanted links, in this case it's [alpha,gamma]
here is my python code:
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
content = soup.find_all('footer div a',attrs={'href' : True})
for a in content:
print ("Found the URL:", a['href'])
the code is not giving what i want, if i make only 'a' it returns all the links, i tried many things but no given results, even css selector didn't work for me

Try this:
from bs4 import BeautifulSoup
sample_html = """
<div>
some html
<footer>
<div>...</div>
<div>
<span></span>
<span></span>
#<<<<<<------- i want this
</div>
</footer>
</div>
<div>
some html
<footer>
<div>...</div>
<div>
<span></span>
<span></span>
#<<<<<<------- i want this too
</div>
</footer>
</div>
"""
links_you_want = [
footer.find("a")["href"] for footer in
BeautifulSoup(sample_html, "html.parser").find_all("footer")
]
print(links_you_want)
Output:
['link1 i want', 'link2 i want']

python - html - how to change a position of a closing part of a tag / move whole section

I want to change a position of a closing part of a tag by removing from one place and placing into another. I try to use BeautifulSoup but the functions seem to work on whole tags. I don't know how to move just the part of the tag like </div> without destroying the the proceeding part of a tag.
how to change a position of a closing part of a tag
Example:
html = """
<html>
<body>
<div>
<div class="A">
<h1 id="H1">H1</h1>
</div>
<div>
<div class="B">
</div>
</div> < ----- remove from here
<div class="b1">
<div class="c">
</div>
</div>
< ----- place here
</div>
</body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
One of my ideas is to cut the section
<div class="b1">
<div class="c">
</div>
</div>
and place after <div class="B"> using the function insert_after but I don't know how to move the whole section in one move.

By moving that </div> further down, you are in effect moving the b1 after the div after the A div. So you could copy the b1 div and append it to the other div. Then delete the original one. This could be done as follows:
from bs4 import BeautifulSoup
import copy
html = """
<html>
<body>
<div>
<div class="A">
<h1 id="H1">H1</h1>
</div>
<div>
<div class="B">
</div>
</div>
<div class="b1">
<div class="c">
</div>
</div>
</div>
</body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
div_append = soup.find('div', class_='A').find_next('div')
div_b1 = soup.find('div', class_='b1')
div_append.append(copy.copy(div_b1))
div_b1.extract()
print(soup.prettify())
This would result in the following HTML:
<html>
<body>
<div>
<div class="A">
<h1 id="H1">
H1
</h1>
</div>
<div>
<div class="B">
</div>
<div class="b1">
<div class="c">
</div>
</div>
</div>
</div>
</body>
</html>

wrapping html with a python function

I want to be able to wrap a div based on it's id. For example given the following HTML:
<body>
<div id="info">
<div id="a1">
</div>
<div id="a2">
<div id="description">
</div>
<div id="links">
link
</div>
</div>
</div>
</body>
I want to write a Python function that takes a document, an id, and a selector. and will wrap the given id in the given document in a div with the class or id selector. For example, lets say that the HTML above is in a variable doc
wrap(doc,'#a2','#wrapped')
will return the following HTML:
<body>
<div id="info">
<div id="a1">
</div>
<div id="wrapped">
<div id="a2">
<div id="description">
</div>
<div id="links">
link
</div>
</div>
</div>
</div>
</body>
I looked at some XML parsers and Python HTMLParser, but I have not found anything that gives me the capability to not only get everything inside a specific tag, but then be able to append strings and easily edit the document. If one does not exist, what would be a good approach to this?

from BeautifulSoup import BeautifulSoup
#div1 is to be wrapped with div2
def wrap(doc,div1_id,div2_id)
pool = BeautifulSoup(doc)
for div in pool.findAll('div', attrs={'id':div1_id}):
div.replaceWith('<div id='+div2_id+'>' + div.prettify() + '</div>' )
return pool.prettify()
wrap(doc,'a2','wrapped')

I recommend BeautifulSoup though it will bring some dependency but also a lot convenience. The following code can acheieve the goal of the wrap:
from bs4 import BeautifulSoup
data = '''<body>
<div id="info">
<div id="a1">
</div>
<div id="a2">
<div id="description">
</div>
<div id="links">
link
</div>
</div>
</div>
</body>'''
soup = BeautifulSoup(data)
div = soup.find('div', attrs={'id': 'a2'})
div.wrap(soup.new_tag('div', id='wrapper'))
And then print soup.prettify() we can see the result:
<html>
<body>
<div id="info">
<div id="a1">
</div>
<div id="wrapper">
<div id="a2">
<div id="description">
</div>
<div id="links">
<a href="http://example.com">
link
</a>
</div>
</div>
</div>
</div>
</body>
</html>

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to parse with BeautifuSoup Python? - python

For example I have code like this <div class="container"> <div class="blablabla1"> <div class="blablabla2"> <div class="blablabla3"> <span class="hello">Hello</span> </div> </div> </div> </div> How can I get <span> value or <span> class value? Should I firstly find all containers?

Once you have obtain the soup, you can use the find_all() method to find all <span> with the hello class: all_hello_spans = soup.find_all({"span":{"class":"hello"}}})

Related

Best approach to get attribute text with BeautifulSoup

How to extract deeply nested <p> tags using Beautiful Soup

href beautifulsoul html selection doesn't return wanted result

python - html - how to change a position of a closing part of a tag / move whole section

wrapping html with a python function

Categories

Resources