Extract subclass from class using beautifulsoup - python

I'm working with the following HTML snippet from a page on Goodreads using Python 3.6.3:
<div class="quoteText">
“Don't cry because it's over, smile because it happened.”
<br/> ―
<a class="authorOrTitle" href="/author/show/61105.Dr_Seuss">Dr. Seuss</a>
</div>, <div class="quoteText">
I used BeautifulSoup to scrape the HTML and isolate just the "quoteText" class seen in the snippet above. Now, I want to save the quote and author name as separate strings. I was able to get the author name using
(quote_tag.find(class_="quoteText")).text
I'm not sure how to do the same for the quote. I'm guessing I need a way to remove the subclass from my output and tried using the extract method.
quote.extract(class_="authorOrTitle")
but I got an error saying extract got an unexpected keyword argument 'class_'
Is there any other way to do what I'm trying to do?
This is my first time posting on here so I apologize if the post doesn't meet particular specificity/formatting/other standards.

PageElement.extract() removes a tag or string from the tree. It
returns the tag or string that was extracted
from bs4 import BeautifulSoup
a='''<div class="quoteText">
“Don't cry because it's over, smile because it happened.”
<br/> -
<a class="authorOrTitle" href="/author/show/61105.Dr_Seuss">Dr. Seuss</a>
</div>, <div class="quoteText">'''
s=BeautifulSoup(a,'lxml')
s.find(class_="authorOrTitle").extract()
print(s.text)

Related

Is there a specific way of retreiving only the required information from an HTML tree? Example included

I am using python3.8 and BeautfiulSoup 4 to parse a website. The section I want to read is here:
<h1 class="pr-new-br">
Rotring
<span> 0.7 Imza Uçlu Kurşun Versatil Kalem 37.28.221.368 </span>
</h1>
I find this from the website using this code and get the text from it (soup is the variable for the BeautifulSoup object from the website):
product_name_text = soup.select("h1.pr_new_br")[0].get_text()
However, this ofcourse return me all of the text. I want to seperate the text between the <a href> and the text between <span>.
How can I do this? How can I specifically for for a tag or a link in a href?
Thank you very much in advance, I am pretty new in the field, sorry if this is very basic.
get_text method has a parameter to split different elements' text.
As an example:
product_name_text = soup.select("h1.pr_new_br")[0].get_text('|')
# You will get -> Rotring|0.7 Imza Uçlu Kurşun Versatil Kalem 37.28.221.368
# Then you can split with same symbol and you would have list of different el's texts

How do I replace an entire div with a string of valid HTML in Python?

I'm trying to populate a lot of templated html documents with html strings contained in a json. For example, my html might look like:
<div class="replace_this_div">
<div>
<p>this text</p>
<p>should be replaced</p>
</div>
</div>
The replacement is in string form and would look something like:
"<p>My replacement code might have standard paragraphs, links, or other html elements such as lists.</p>"
Afterwards, it should simply look like this:
<div class="replace_this_div">
"<p>My replacement code might have standard paragraphs, links, or other html elements such as lists.</p>"
</div>
I've messed around a bit in BeautifulSoup trying to accomplish this. The problem I'm having is that even though I simply want to replace everything inside the designated div, I can't figure out how to do so using my string which is already formatted as html (especially with how beautifulsoup uses tags).
Does anybody have any insight on how to do this? Thanks!
You can use clear() to clear the contents of the tag. Then create a BeautifulSoup object out of your string by calling the constructor. Then add inside the original tag using append().
from bs4 import BeautifulSoup
html="""
<div class="replace_this_div">
<div>
<p>this text</p>
<p>should be replaced</p>
</div>
</div>
"""
new_content=u'<p>My replacement code might have standard paragraphs, links, or other html elements such as lists.</p>'
soup=BeautifulSoup(html,'html.parser')
outer_div=soup.find('div',attrs={"class":"replace_this_div"})
outer_div.clear()
outer_div.append(BeautifulSoup(new_content,'html.parser'))
print(soup.prettify())
Output
<div class="replace_this_div">
<p>
My replacement code might have standard paragraphs,
<a href="fake_link">
links
</a>
, or other html elements such as lists.
</p>
</div>

Error accessing the class having hyphen(-) separated names in html file using BeautifulSoup

I am trying to scrape the data of popular english movies on Hotstar
I downloaded the html source code and I am doing this:
from bs4 import BeautifulSoup as soup
page_soup = soup(open('hotstar.html'),'html.parser')
containers = page_soup.findAll("div",{"class":"col-xs-6 col-sm-4 col-md-3 col-lg-3 ng-scope"})
container = containers[0]
# To get video link
container.div.hs-cards-directive.article.a
I am getting an error at this point:
NameError: name 'cards' is not defined
These are the first few lines of the html file:
<div bindonce="" class="col-xs-6 col-sm-4 col-md-3 col-lg-3 ng-scope" ng-repeat="slides in gridcardData">
<hs-cards-directive cdata="slides" class="ng-isolate-scope" renderingdone="shownCard()">
<article class="card show-card" ng-class="{'live-sport-card':isLiveSportCard, 'card-active':btnRemoveShow,'tounament-tray-card':record.isTournament}" ng-click="cardeventhandler({cardrecord:record})" ng-init="init()" pdata="record" removecard="removecard" watched="watched">
<a href="http://www.hotstar.com/movies/step-up-revolution/1770016594" ng-href="/movies/step-up-revolution/1770016594" restrict-anchor="">
Please help me out!
I am using Python 3.6.3 on Windows.
As (loosely) explained in the Going down section of the docs, the tag.descendant syntax is just a convenient shortcut for tag.find('descendant').
That shortcut can't be used in cases where you have tags whose names aren't valid Python identifiers.1 (Also in cases where you have tags whose names collide with methods of BS4 itself, like a <find> tag.)
Python identifiers can only have letters, digits, and underscores, not hyphens. So, when you write this:
container.div.hs-cards-directive.article.a
… python parses it like this mathematical expression:
container.div.hs - cards - directive.article.a
BeautifulSoup's div node has no descendant named hs, but that's fine; it just returns None. But then you try to subtract cards from that None, and you get a NameError.
Anyway, the only solution in this case is to not use the shortcut and call find explicitly:
container.div.find('hs-cards-directive').article.a
Or, if it makes sense for your use case, you can just skip down to article, because the shortcut finds any descendants, not just direct children:
container.div.article.a
But I don't think that's appropriate in your case; you want articles only under specific child nodes, not all possible articles, right?
1. Technically, it is actually possible to use the shortcut, it's just not a shortcut anymore. If you understand what getattr(container.div, 'hs-cards-directive').article.a means, then you can write that and it will work… but obviously find is going to be more readable and easier to understand.

Not able to extract data using scrapy with class names containing spaces and hyphens

I am new to scrapy and I have to extract text from a tag with multiple class names, where the class names contain spaces and hyphens.
Example:
<div class="info">
<span class="price sale">text1</span>
<span class="title ng-binding">some text</span>
</div>
When i use the code:
response.xpath("//span[contains(#class,'price sale')]/text()").extract()
I am able to get text1 but when I use:
response.xpath("//span[contains(#class,'title ng-binding')]/text()").extract()
I get an empty list. Why is this happening and how to handle this?
The expression you're looking for is:
//span[contains(#class, 'title') and contains(#class, 'ng-binding')]
I highly suggest XPath visualizer, which can help you debug xpath expressions easily. It can be found here:
http://xpathvisualizer.codeplex.com/
Or with CSS try
response.css("span.title.ng-binding")
Or there is a chance that element with ng-binding is loaded via Javascript/Ajax hence not included in initial server response.
You can replace the spaces with "." in your code when using response.css().
In your case you can try:
response.css("span.title.ng-binding::text").extract()
This code should return the text you are looking for.

Beautiful Soup - Find identified tag in the original text

I need to manipulate certain text in an HTML document after I have identified the text in the original document. Let's say I have this HTML code
<div id="identifier">
<a href="link" id="linkid">
</a>
</div>
I want to delete the id attribute in the <a> tag. I can identify a particular tag using BeautifulSoup, but because it changes the formatting of the original document I can't search/replace the string either. I don't want to just write output of BeautifulSoup, instead I want to identify <a href="link" id="linkid"> tag in the original document and replace with just <a href="link">. Any idea how to proceed?
Answering a few questions raised:
This is a huge existing codebase that needs some updation, so it's not just a single search/replace kind of job.
The original formatting is important because the organization follows a certain coding standards for formatting code, which I want to retain. Also, BS introduces extra tags for the sake of completeness like for and so on.
Which version of beautifulsoup are you using?
You can edit html nodes like dictionaries in bs4
From documentation:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#changing-tag-names-and-attributes
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
del tag['class']
del tag['id']
Also, you seem to have a problem with the way beautiful soup output the modified html code.
If you want to pretty print your document, or use custom formatting, you can do it easily
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output

Categories