I am using python & beautifulsoup to extract the data, but for some reason it is doesn't see the text value.
HTML page has:
<div ng-if="!order.hidePrices" style="white-space: nowrap;" class="ng-binding ng-scope">$ 1,599.00</div>
Python code:
for price in prices:
price_value = price.find('div', {"class":"ng-binding", "class":"ng-scope"})
print(price_value)
And python output is missing text value:
<div class="ng-binding ng-scope" ng-if="!order.hidePrices" style="white-space: nowrap;"></div>
The HTML file has no other class with that name. Where am I doing wrong?
There are several issues here:
Your locator seems to be not unique.
price_value here is a web element. To retrieve it's text you shoul apply .text on it
You are using BS, not Selenium here.
See if this will work
for price in prices:
price_value = price.find('div', {"class":"ng-binding", "class":"ng-scope"})
print(price_value.get_text())
If you have a running driver instance, the do this :
print(driver.find_element_by_css_selector("div.ng-binding.ng-scope"").text)
class name with spaces have no support in Selenium, so try to use css selector.
update :
for div in soup.select('div.ng-binding.ng-scope'):
print(div.text)
also you can do :
print(soup.find_all(attrs={'class': 'ng-binding'}, {'class': 'ng-scope'}))
Update 1:
See here what official docs says.
html.parser Advantages
Batteries included
Decent speed
Lenient (As of Python 2.7.3 and 3.2.)
html.parser Disadvantages
Not as fast as lxml, less lenient than html5lib.
Read here for more detailed docs
An HTMLParser instance is fed HTML data and calls handler methods when
start tags, end tags, text, comments, and other markup elements are
encountered. The user should subclass HTMLParser and override its
methods to implement the desired behavior.
Related
In the below html elements, I have been unsuccessful using beautiful soup.select to only obtain the first child after div class="wrap-25PNPwRV"> (i.e. -11.94M and 2.30M) in list format
<div class="value-25PNPwRV">
<div class="wrap-25PNPwRV">
<div>−11.94M</div>
<div class="change-25PNPwRV negative-25PNPwRV">−119.94%</div></div></div>
<div class="value-25PNPwRV additional-25PNPwRV">
<div class="wrap-25PNPwRV">
<div>2.30M</div>
<div class="change-25PNPwRV negative-25PNPwRV">−80.17%</div></div></div>
Above is just two examples within the html I'm attempting to scrape within the dynamic javascript coded table which the above source code lies within, but there are many more div attributes on the page, and many more div class "wrap-25PNPwRV" inside the javascript table
I currently have the below code which allows me to scrape all the contents within div class ="wrap-25PNPwRV"
data_list = [elem.get_text() for elem in soup.select("div.wrap-25PNPwRV")]
Output:
['-11.94M', '-119.94%', '2.30M', '-80.17%']
However, I would like to use soup.select to yield the desired output :
['-11.94M', '2.30M']
I tried following this guide https://www.crummy.com/software/BeautifulSoup/bs4/doc/ but have been unsuccessful to implement it to my above code.
Please note, if soup.select is not possible to perform the above, I am happy to use an alternative providing it generates the same list format/output
You can use the :nth-of-type CSS selector:
data_list = [elem.get_text() for elem in soup.select(".wrap-25PNPwRV div:nth-of-type(1)")]
I'd suggest to not use the .wrap-25PNPwRV class. Seems random and almost certainly will change in the future.
Instead, select the <div> element which has other element with class="change..." as sibling. For example
print([t.text.strip() for t in soup.select('div:has(+ [class^="change"])')])
Prints:
['−11.94M', '2.30M']
I am trying to scrape the data of popular english movies on Hotstar
I downloaded the html source code and I am doing this:
from bs4 import BeautifulSoup as soup
page_soup = soup(open('hotstar.html'),'html.parser')
containers = page_soup.findAll("div",{"class":"col-xs-6 col-sm-4 col-md-3 col-lg-3 ng-scope"})
container = containers[0]
# To get video link
container.div.hs-cards-directive.article.a
I am getting an error at this point:
NameError: name 'cards' is not defined
These are the first few lines of the html file:
<div bindonce="" class="col-xs-6 col-sm-4 col-md-3 col-lg-3 ng-scope" ng-repeat="slides in gridcardData">
<hs-cards-directive cdata="slides" class="ng-isolate-scope" renderingdone="shownCard()">
<article class="card show-card" ng-class="{'live-sport-card':isLiveSportCard, 'card-active':btnRemoveShow,'tounament-tray-card':record.isTournament}" ng-click="cardeventhandler({cardrecord:record})" ng-init="init()" pdata="record" removecard="removecard" watched="watched">
<a href="http://www.hotstar.com/movies/step-up-revolution/1770016594" ng-href="/movies/step-up-revolution/1770016594" restrict-anchor="">
Please help me out!
I am using Python 3.6.3 on Windows.
As (loosely) explained in the Going down section of the docs, the tag.descendant syntax is just a convenient shortcut for tag.find('descendant').
That shortcut can't be used in cases where you have tags whose names aren't valid Python identifiers.1 (Also in cases where you have tags whose names collide with methods of BS4 itself, like a <find> tag.)
Python identifiers can only have letters, digits, and underscores, not hyphens. So, when you write this:
container.div.hs-cards-directive.article.a
… python parses it like this mathematical expression:
container.div.hs - cards - directive.article.a
BeautifulSoup's div node has no descendant named hs, but that's fine; it just returns None. But then you try to subtract cards from that None, and you get a NameError.
Anyway, the only solution in this case is to not use the shortcut and call find explicitly:
container.div.find('hs-cards-directive').article.a
Or, if it makes sense for your use case, you can just skip down to article, because the shortcut finds any descendants, not just direct children:
container.div.article.a
But I don't think that's appropriate in your case; you want articles only under specific child nodes, not all possible articles, right?
1. Technically, it is actually possible to use the shortcut, it's just not a shortcut anymore. If you understand what getattr(container.div, 'hs-cards-directive').article.a means, then you can write that and it will work… but obviously find is going to be more readable and easier to understand.
I want to get a value inside certain div from a HTML page
<div class="well credit">
<div class="span2">
<h3><span>
$ 5.402
</span></h3>
</div>
</div>
I've done it with regular expressions ( re.seach() ) but it take too long to find the div since it's a huge html.
Is there a way to do this faster but with no external libraries?
Thanks
I would use BeautifulSoup!
to get everything with <div> tag just do:
soup = BeautifulSoup(html)#make soup that is parse-able by bs
soup.findAll('div')
to get the value inside of span you could do:
soup.find('span').get_text()
there are tons of differnt methods of getting the informaton you need
Good Luck hope this helps!
Python has only one HTML parser in the standard library and it's pretty low-level, so you'll have to install some sort of HTML parsing library if you want to work with HTML.
lxml is by far the fastest:
import lxml.html
root = lxml.html.parse(handle)
price = root.xpath('//div[#class="well credit"]//span/#text')[0]
If you want it to be even faster, use root.iter and stop parsing the HTML once you hit the right element.
Scrapy might also be a solution for this. Please read http://doc.scrapy.org/en/latest/topics/selectors.html
x = sel.xpath('//div[#class="span2"]')
for i in x:
print i.extract()
Output:
<div class="span2">
<h3><span>
$ 5.402
</span></h3>
</div>
I am trying to scrape all the data within a div as follows. However, the quotes are throwing me off.
<div id="address">
<div class="info">14955 Shady Grove Rd.</div>
<div class="info">Rockville, MD 20850</div>
<div class="info">Suite: 300</div>
</div>
I am trying to start it with something along the lines of
addressStart = page.find("<div id="address">")
but the quotes within the div are messing me up. Does anybody know how I can fix this?
To answer your specific question, you need to escape the quotes, or use a different type of quote on the string itself:
addressStart = page.find("<div id=\"address\">")
# or
addressStart = page.find('<div id="address">')
But don't do that. If you are trying to "parse" HTML, let a third-party library do that. Try Beautiful Soup. You get a nice object back which you can use to traverse or search. You can grab attributes, values, etc... without having to worry about the complexities of parsing HTML or XML:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page)
for address in soup.find_all('div',id='address'): # returns a list, use find if you just want the first
for info in address.find_all('div',class_='info'): # for attribute class, use class_ instead since class is a reserved word
print info.string
I need to manipulate certain text in an HTML document after I have identified the text in the original document. Let's say I have this HTML code
<div id="identifier">
<a href="link" id="linkid">
</a>
</div>
I want to delete the id attribute in the <a> tag. I can identify a particular tag using BeautifulSoup, but because it changes the formatting of the original document I can't search/replace the string either. I don't want to just write output of BeautifulSoup, instead I want to identify <a href="link" id="linkid"> tag in the original document and replace with just <a href="link">. Any idea how to proceed?
Answering a few questions raised:
This is a huge existing codebase that needs some updation, so it's not just a single search/replace kind of job.
The original formatting is important because the organization follows a certain coding standards for formatting code, which I want to retain. Also, BS introduces extra tags for the sake of completeness like for and so on.
Which version of beautifulsoup are you using?
You can edit html nodes like dictionaries in bs4
From documentation:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#changing-tag-names-and-attributes
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
del tag['class']
del tag['id']
Also, you seem to have a problem with the way beautiful soup output the modified html code.
If you want to pretty print your document, or use custom formatting, you can do it easily
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output