I want to be able to wrap a div based on it's id. For example given the following HTML:
<body>
<div id="info">
<div id="a1">
</div>
<div id="a2">
<div id="description">
</div>
<div id="links">
link
</div>
</div>
</div>
</body>
I want to write a Python function that takes a document, an id, and a selector. and will wrap the given id in the given document in a div with the class or id selector. For example, lets say that the HTML above is in a variable doc
wrap(doc,'#a2','#wrapped')
will return the following HTML:
<body>
<div id="info">
<div id="a1">
</div>
<div id="wrapped">
<div id="a2">
<div id="description">
</div>
<div id="links">
link
</div>
</div>
</div>
</div>
</body>
I looked at some XML parsers and Python HTMLParser, but I have not found anything that gives me the capability to not only get everything inside a specific tag, but then be able to append strings and easily edit the document. If one does not exist, what would be a good approach to this?
from BeautifulSoup import BeautifulSoup
#div1 is to be wrapped with div2
def wrap(doc,div1_id,div2_id)
pool = BeautifulSoup(doc)
for div in pool.findAll('div', attrs={'id':div1_id}):
div.replaceWith('<div id='+div2_id+'>' + div.prettify() + '</div>' )
return pool.prettify()
wrap(doc,'a2','wrapped')
I recommend BeautifulSoup though it will bring some dependency but also a lot convenience. The following code can acheieve the goal of the wrap:
from bs4 import BeautifulSoup
data = '''<body>
<div id="info">
<div id="a1">
</div>
<div id="a2">
<div id="description">
</div>
<div id="links">
link
</div>
</div>
</div>
</body>'''
soup = BeautifulSoup(data)
div = soup.find('div', attrs={'id': 'a2'})
div.wrap(soup.new_tag('div', id='wrapper'))
And then print soup.prettify() we can see the result:
<html>
<body>
<div id="info">
<div id="a1">
</div>
<div id="wrapper">
<div id="a2">
<div id="description">
</div>
<div id="links">
<a href="http://example.com">
link
</a>
</div>
</div>
</div>
</div>
</body>
</html>
Related
I have a massive html file that I need to find text for every .jpg image in the file. The process I want to perform is:
search for the the image's name referenced in an href.
if found look ahead for the first instance of a regex
Here is a part of the file. There are many many entries like this. I need to grab that date.
<div class="_2ph_ _a6-p">
<div>
<div class="_2pin">
<div>
<div>
<div>
<div class="_a7nf">
<div class="_a7ng">
<div>
<a href="folder/image.jpg" target="_blank">
<img class="_a6_o _3-96" src="folder/image.jpg"/>
</a>
<div>
Mobile uploads
</div>
<div class="_3-95">
Test Test Test
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="_2pin">
<div>
Test Test Test
</div>
</div>
</div>
</div>
<div class="_3-94 _a6-o">
<a href="https://www.example.com;s=518" target="_blank">
<div class="_a72d">
Jun 25, 2011 12:10:54pm
</div>
</a>
</div>
</div>
<div class="_2ph_ _a6-p">
<div>
<div class="_2pin">
<div>
<div>
<div>
<div class="_a7nf">
<div class="_a7ng">
<div>
<a href="folder/image2.jpg" target="_blank">
<img class="_a6_o _3-96" src="folder/image2.jpg"/>
</a>
<div>
Mobile uploads
</div>
<div class="_3-95">
Test Test Test Test
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="_2pin">
<div>
Test Test Test Test
</div>
</div>
</div>
</div>
<div class="_3-94 _a6-o">
<a href="https://www.example.com;s=518" target="_blank">
<div class="_b28q">
Feb 10, 2012 1:10:54am
</div>
</a>
</div>
</div>
<div class="_3-95 _a6-g"> == $0
<div class="_2pin">Testing </div>\
<div class="_3-95 _a6-p">
<div>
<div><div>
<div><div>
<div><div>
<div><div>
<div>
<div>
<a href="folder/image3.jpg" target="_blank">
<img class="_a6_o _3-96" src="folder/image.jpg"/>
</a>
<div></div>
</div>
</div>
</div>
</div>
<div class="_b28q">
Feb 10, 2012 1:10:54am
</div>
</div>
I already figured out a regex that works for the date:
rx_date = re.compile('(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s\d{1,2},\s\d{4}\s\d{1,2}:\d{2}:\d{2}(?:AM|PM|am|pm)')
I need to find Jun 25, 2011 12:10:54pm for the reference of image.jpg and Feb 10, 2012 1:10:54am for the reference of image2.jpg. How can I accomplish that?
I messed around with using beautiful soup, but all I can do with that is gather parts of the file. I could not figure out how to look ahead and tell beautiful soup. I tried using .parent.parent.parent.parent.parent.parent.parent.parent.child but that didn't work. Note every div class name is random so I can use that as a reference.
EDIT:
I added one little monkey wrench in the logic. Some times the date is not in an a tag but in a div class by itself. html example updated.
Maybe you can use bs4 API. Find <a> tag that contains <img> and for a date find next <a> tag that contains <div>:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, "html.parser") # html_doc contains the snippet from your question
for img in soup.select("a > img"):
src = img["src"]
date = img.find_next(lambda tag: tag.name == "a" and tag.div).text.strip()
print(f"{src=} {date=}")
Prints:
src='folder/image.jpg' date='Jun 25, 2011 12:10:54pm'
src='folder/image2.jpg' date='Feb 10, 2012 1:10:54am'
EDIT: With updated input:
import re
rx_date = re.compile(
r"(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s\d{1,2},\s\d{4}\s\d{1,2}:\d{2}:\d{2}(?:AM|PM|am|pm)"
)
for img in soup.select("a > img"):
src = img["src"]
date = img.find_next(
lambda tag: tag.name == "div"
and rx_date.search(tag.find(text=True, recursive=False) or "")
).text.strip()
print(f"{src=} {date=}")
Prints:
src='folder/image.jpg' date='Jun 25, 2011 12:10:54pm'
src='folder/image2.jpg' date='Feb 10, 2012 1:10:54am'
src='folder/image.jpg' date='Feb 10, 2012 1:10:54am'
What would be the best way to get the text of the items class="field__label" y class="field__item" in the following code
Taking into consideration that there are other tags with the same class outside the div class="fieldset-wrapper" I just need the ones inside this tag.
HTML Example:
<div class="fieldset-wrapper">
<div class="field field--name-field-adresse-strasse-nr field--type-string field--label-inline clearfix">
<div class="field__label">TEXT</div>
<div class="field__item">TEXT</div>
</div>
<div class="field field--name-field-adresse-plz-ort field--type-string field--label-inline clearfix">
<div class="field__label">TEXT</div>
<div class="field__item">TEXT</div>
</div>
<div class="field field--name-field-adressen-bundesland field--type-entity-reference field--label-inline clearfix">
<div class="field__label">TEXT</div>
<div class="field__item">TEXT</div>
</div>
</div>
You can use css selectors to ensure that your target elements are descendants of the div class="fieldset-wrapper" element:
for item in soup.select('div.fieldset-wrapper div.field__item, div.fieldset-wrapper div.field__label'):
print(item.text)
For example I have code like this
<div class="container">
<div class="blablabla1">
<div class="blablabla2">
<div class="blablabla3">
<span class="hello">Hello</span>
</div>
</div>
</div>
</div>
How can I get <span> value or <span> class value?
Should I firstly find all containers?
Your question is not that clear but in generell you can access the class and text with css selector like this:
from bs4 import BeautifulSoup
html = '''
<div class="container">
<div class="blablabla1">
<div class="blablabla2">
<div class="blablabla3">
<span class="hello">Hello</span>
</div>
</div>
</div>
</div>
'''
soup = BeautifulSoup(html, "lxml")
spanText = soup.select_one('div.container span').text
spanClass = soup.select_one('div.container span')['class']
Once you have obtain the soup, you can use the find_all() method to find all <span> with the hello class:
all_hello_spans = soup.find_all({"span":{"class":"hello"}}})
Requirement:
/html/body/div[3]/div[4]/div/div[7]/div/div/div/div/p/b - Contains word "TITLE"
/html/body/div[3]/div[4]/div/div[8]/div/div/div/div/p - Contains "This is my description"
Actual HTML:
<div class="secadvheading section">
<div class="section-custom">
<div class="container-fluid">
<div class="row">
<div class="col-md-12">
<p class="mt-15"><b>TITLE</b></p>
</div>
</div>
</div>
</div>
</div>
<div class="paragraphText parbase section">
<div class="section-custom ">
<div class="container-fluid">
<div class="row">
<div class="col-md-12">
<p>This is my desciption</p>
</div>
</div>
</div>
</div>
Question:
How to get text content paragraph text after "TITLE" div?
Tried
driver.find_element_by_xpath("//*[contains(text(),'TITLE')]/following-sibling::p")
didn't worked. I may have multiple "TITLE in same page" how can i gracefully look for TITILE div (multiple elements) and get the description for the same?
You need to go out of TITLE's node first--go to ancestor node the use following-sibling. Try this:
//b[text()='TITLE']/ancestor::div[#class='secadvheading section']/following-sibling::div[#class='paragraphText parbase section']//p
I want to change a position of a closing part of a tag by removing from one place and placing into another. I try to use BeautifulSoup but the functions seem to work on whole tags. I don't know how to move just the part of the tag like </div> without destroying the the proceeding part of a tag.
how to change a position of a closing part of a tag
Example:
html = """
<html>
<body>
<div>
<div class="A">
<h1 id="H1">H1</h1>
</div>
<div>
<div class="B">
</div>
</div> < ----- remove from here
<div class="b1">
<div class="c">
</div>
</div>
< ----- place here
</div>
</body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
One of my ideas is to cut the section
<div class="b1">
<div class="c">
</div>
</div>
and place after <div class="B"> using the function insert_after but I don't know how to move the whole section in one move.
By moving that </div> further down, you are in effect moving the b1 after the div after the A div. So you could copy the b1 div and append it to the other div. Then delete the original one. This could be done as follows:
from bs4 import BeautifulSoup
import copy
html = """
<html>
<body>
<div>
<div class="A">
<h1 id="H1">H1</h1>
</div>
<div>
<div class="B">
</div>
</div>
<div class="b1">
<div class="c">
</div>
</div>
</div>
</body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
div_append = soup.find('div', class_='A').find_next('div')
div_b1 = soup.find('div', class_='b1')
div_append.append(copy.copy(div_b1))
div_b1.extract()
print(soup.prettify())
This would result in the following HTML:
<html>
<body>
<div>
<div class="A">
<h1 id="H1">
H1
</h1>
</div>
<div>
<div class="B">
</div>
<div class="b1">
<div class="c">
</div>
</div>
</div>
</div>
</body>
</html>