Retain the children elements in beautifulSoup python? - python

I have markup as:
<p>Sample text. Click Here</p>
I want to replace the <p> tag with <span> without changing the children or text of the tag

Your task can be done using replaceWith. You have to duplicate the element you want to use as the replacement, and then feed that as the argument to replaceWith. The documentation for replaceWith is pretty clear on how to do this.
In any case you could read this question How to change tag name with BeautifulSoup?

Related

Is there a specific way of retreiving only the required information from an HTML tree? Example included

I am using python3.8 and BeautfiulSoup 4 to parse a website. The section I want to read is here:
<h1 class="pr-new-br">
Rotring
<span> 0.7 Imza Uçlu Kurşun Versatil Kalem 37.28.221.368 </span>
</h1>
I find this from the website using this code and get the text from it (soup is the variable for the BeautifulSoup object from the website):
product_name_text = soup.select("h1.pr_new_br")[0].get_text()
However, this ofcourse return me all of the text. I want to seperate the text between the <a href> and the text between <span>.
How can I do this? How can I specifically for for a tag or a link in a href?
Thank you very much in advance, I am pretty new in the field, sorry if this is very basic.
get_text method has a parameter to split different elements' text.
As an example:
product_name_text = soup.select("h1.pr_new_br")[0].get_text('|')
# You will get -> Rotring|0.7 Imza Uçlu Kurşun Versatil Kalem 37.28.221.368
# Then you can split with same symbol and you would have list of different el's texts

Empty list as output from scrapy response object

I am scraping this webpage and while trying to extract text from one element, I am hitting a dead end.
So the element in question is shown below in the image -
The text in this element is within the <p> tags inside the <div>. I tried extracting the text in the scrapy shell using the following code - response.css("div.home-hero-blurb no-select::text").getall(). I received an empty list as the result.
Alternatively, if I try going a bit further and reference the <p> tags individually, I can get the text. Why does this happen? Isn't the <div> a parent element and shouldn't my code extract the text?
Note - I wanted to use the div because I thought that'll help me get both the <p> tags in one query.
I can see two issues here.
The first is that if you separate the class name with spaces, the css selector will understand you are looking for a child element of that name. So the correct approach is "div.home-hero-blurb.no-select::text" instead of "div.home-hero-blurb no-select::text".
The second issue is that the text you want is inside a p element that is a child of that div. If you only select the div, the selector will return the text inside the div, but not in it's childs. Since there is also a strong element as child of p, I would suggest using a generalist approach like:
response.css("div.home-hero-blurb.no-select *::text").getall()
This should return all text from the div and it's descendants.
It's relevant to point out that extracting text from css selectors are a extension of the standard selectors. Scrapy mention this here.
Edit
If you were to use XPath, this would be the equivalent expression:
response.xpath('//div[#class="home-hero-blurb no-select"]//text()').getall()

python how to identify block html contain text?

I have raw HTML files and i remove script tag.
I want to identify in the DOM the block elements (like <h1> <p> <div> etc, not <a> <em> <b> etc) and enclose them in <div> tags.
Is there any easy way to do it in python?
is there library in python to identify the block element
Thanks
UPDATE
actually i want to extract the html document. i have to identify the blocks which contain text. For each text element i have to find its closest parent element that are displayed as block. After that for each block i will extract the feature such as size and posisition of the block.
You should use something like Beautiful Soup or HTMLParser.
Have a look at their docs: Beautiful Soup or HTMLParser.
You should find what you are looking fore there. If you cannot get it to work, consider asking a more specific question.
Here is a simple example, how you cold go about this. Say 'data' is the raw content of a site, then you could:
soup = BeautifulSoup(data) # you may need to add from_encoding="utf-8"or so
Then you might want to walk through the tree looking for a specific node and to something with it. You could use a fct like this:
def walker(soup):
if soup.name is not None:
for child in soup.children:
# do stuff with the node
print ':'.join([str(child.name), str(type(child))])
walker(child)
Note: the code is from this great tutorial.

Xpath: how to get the text of <a> tag inside a <p> tag

I have the following issue when trying to get information from some website using scrapy.
I'm trying to get all the text inside <p> tag, but my problem is that in some cases inside those tags there is not just text, but sometimes also an <a> tag, and my code stops collecting the text when it reaches that tag.
This is my Xpath expression, it's working properly when there aren't tags contained inside:
description = descriptionpath.xpath("span[#itemprop='description']/p/text()").extract()
Posting Pawel Miech's comment as an answer as it appears his comment has helped many of us thus far and contains the right answer:
Tack //text() on the end of the xpath to specify that text should be recursively extracted.
So your xpath would appear like this:
span[#itemprop='description']/p//text()

Beautiful Soup - Find identified tag in the original text

I need to manipulate certain text in an HTML document after I have identified the text in the original document. Let's say I have this HTML code
<div id="identifier">
<a href="link" id="linkid">
</a>
</div>
I want to delete the id attribute in the <a> tag. I can identify a particular tag using BeautifulSoup, but because it changes the formatting of the original document I can't search/replace the string either. I don't want to just write output of BeautifulSoup, instead I want to identify <a href="link" id="linkid"> tag in the original document and replace with just <a href="link">. Any idea how to proceed?
Answering a few questions raised:
This is a huge existing codebase that needs some updation, so it's not just a single search/replace kind of job.
The original formatting is important because the organization follows a certain coding standards for formatting code, which I want to retain. Also, BS introduces extra tags for the sake of completeness like for and so on.
Which version of beautifulsoup are you using?
You can edit html nodes like dictionaries in bs4
From documentation:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#changing-tag-names-and-attributes
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
del tag['class']
del tag['id']
Also, you seem to have a problem with the way beautiful soup output the modified html code.
If you want to pretty print your document, or use custom formatting, you can do it easily
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output

Categories