I want to get a value inside certain div from a HTML page
<div class="well credit">
<div class="span2">
<h3><span>
$ 5.402
</span></h3>
</div>
</div>
I've done it with regular expressions ( re.seach() ) but it take too long to find the div since it's a huge html.
Is there a way to do this faster but with no external libraries?
Thanks
I would use BeautifulSoup!
to get everything with <div> tag just do:
soup = BeautifulSoup(html)#make soup that is parse-able by bs
soup.findAll('div')
to get the value inside of span you could do:
soup.find('span').get_text()
there are tons of differnt methods of getting the informaton you need
Good Luck hope this helps!
Python has only one HTML parser in the standard library and it's pretty low-level, so you'll have to install some sort of HTML parsing library if you want to work with HTML.
lxml is by far the fastest:
import lxml.html
root = lxml.html.parse(handle)
price = root.xpath('//div[#class="well credit"]//span/#text')[0]
If you want it to be even faster, use root.iter and stop parsing the HTML once you hit the right element.
Scrapy might also be a solution for this. Please read http://doc.scrapy.org/en/latest/topics/selectors.html
x = sel.xpath('//div[#class="span2"]')
for i in x:
print i.extract()
Output:
<div class="span2">
<h3><span>
$ 5.402
</span></h3>
</div>
Related
In the below html elements, I have been unsuccessful using beautiful soup.select to only obtain the first child after div class="wrap-25PNPwRV"> (i.e. -11.94M and 2.30M) in list format
<div class="value-25PNPwRV">
<div class="wrap-25PNPwRV">
<div>−11.94M</div>
<div class="change-25PNPwRV negative-25PNPwRV">−119.94%</div></div></div>
<div class="value-25PNPwRV additional-25PNPwRV">
<div class="wrap-25PNPwRV">
<div>2.30M</div>
<div class="change-25PNPwRV negative-25PNPwRV">−80.17%</div></div></div>
Above is just two examples within the html I'm attempting to scrape within the dynamic javascript coded table which the above source code lies within, but there are many more div attributes on the page, and many more div class "wrap-25PNPwRV" inside the javascript table
I currently have the below code which allows me to scrape all the contents within div class ="wrap-25PNPwRV"
data_list = [elem.get_text() for elem in soup.select("div.wrap-25PNPwRV")]
Output:
['-11.94M', '-119.94%', '2.30M', '-80.17%']
However, I would like to use soup.select to yield the desired output :
['-11.94M', '2.30M']
I tried following this guide https://www.crummy.com/software/BeautifulSoup/bs4/doc/ but have been unsuccessful to implement it to my above code.
Please note, if soup.select is not possible to perform the above, I am happy to use an alternative providing it generates the same list format/output
You can use the :nth-of-type CSS selector:
data_list = [elem.get_text() for elem in soup.select(".wrap-25PNPwRV div:nth-of-type(1)")]
I'd suggest to not use the .wrap-25PNPwRV class. Seems random and almost certainly will change in the future.
Instead, select the <div> element which has other element with class="change..." as sibling. For example
print([t.text.strip() for t in soup.select('div:has(+ [class^="change"])')])
Prints:
['−11.94M', '2.30M']
I am using python & beautifulsoup to extract the data, but for some reason it is doesn't see the text value.
HTML page has:
<div ng-if="!order.hidePrices" style="white-space: nowrap;" class="ng-binding ng-scope">$ 1,599.00</div>
Python code:
for price in prices:
price_value = price.find('div', {"class":"ng-binding", "class":"ng-scope"})
print(price_value)
And python output is missing text value:
<div class="ng-binding ng-scope" ng-if="!order.hidePrices" style="white-space: nowrap;"></div>
The HTML file has no other class with that name. Where am I doing wrong?
There are several issues here:
Your locator seems to be not unique.
price_value here is a web element. To retrieve it's text you shoul apply .text on it
You are using BS, not Selenium here.
See if this will work
for price in prices:
price_value = price.find('div', {"class":"ng-binding", "class":"ng-scope"})
print(price_value.get_text())
If you have a running driver instance, the do this :
print(driver.find_element_by_css_selector("div.ng-binding.ng-scope"").text)
class name with spaces have no support in Selenium, so try to use css selector.
update :
for div in soup.select('div.ng-binding.ng-scope'):
print(div.text)
also you can do :
print(soup.find_all(attrs={'class': 'ng-binding'}, {'class': 'ng-scope'}))
Update 1:
See here what official docs says.
html.parser Advantages
Batteries included
Decent speed
Lenient (As of Python 2.7.3 and 3.2.)
html.parser Disadvantages
Not as fast as lxml, less lenient than html5lib.
Read here for more detailed docs
An HTMLParser instance is fed HTML data and calls handler methods when
start tags, end tags, text, comments, and other markup elements are
encountered. The user should subclass HTMLParser and override its
methods to implement the desired behavior.
i need to find an Input field after an Label. Id and Name are dynamic and changes every Login.
HTML:
<label for="EveryTimeDifferent">LabelText</label>
<div>
<div>
<input name="EveryTimeDifferent" id="EveryTimeDifferent">
</div>
</div>
Python:
driver.find_element_by_xpath("//label[text()='LabelText']//following::input[1]")
well first of you'll need to find something that stays the same each session on the given website. I assume something stays the same in the said label in the given example. You could use BeautifulSoup and find the label with bs4's HTML parsing functions (You'll need to create a local string representation of the HTML code from the website for the BeautifulSoup() instance to be created).
That'd be something like:
from bs4 import BeautifulSoup
soup = BeautifulSoup(HTML_SRC_STR, 'html.parser')
soup.find('<your_label_tag'>' ...)
The docs of bs4 will explain it way better: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
I am trying to scrape all the data within a div as follows. However, the quotes are throwing me off.
<div id="address">
<div class="info">14955 Shady Grove Rd.</div>
<div class="info">Rockville, MD 20850</div>
<div class="info">Suite: 300</div>
</div>
I am trying to start it with something along the lines of
addressStart = page.find("<div id="address">")
but the quotes within the div are messing me up. Does anybody know how I can fix this?
To answer your specific question, you need to escape the quotes, or use a different type of quote on the string itself:
addressStart = page.find("<div id=\"address\">")
# or
addressStart = page.find('<div id="address">')
But don't do that. If you are trying to "parse" HTML, let a third-party library do that. Try Beautiful Soup. You get a nice object back which you can use to traverse or search. You can grab attributes, values, etc... without having to worry about the complexities of parsing HTML or XML:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page)
for address in soup.find_all('div',id='address'): # returns a list, use find if you just want the first
for info in address.find_all('div',class_='info'): # for attribute class, use class_ instead since class is a reserved word
print info.string
On using this regular expression in python :
pathstring = '<span class="titletext">(.*)</span>'
pathFinderTitle = re.compile(pathstring)
My output is:
Govt has nothing to do with former CAG official RP Singh:
Sibal</span></a></h2></div><div class="esc-lead-article-source-wrapper">
<table class="al-attribution single-line-height" cellspacing="0" cellpadding="0">
<tbody><tr><td class="al-attribution-cell source-cell">
<span class='al-attribution-source'>Times of India</span></td>
<td class="al-attribution-cell timestamp-cell">
<span class='dash-separator'> - </span>
<span class='al-attribution-timestamp'>46 minutes ago
The text find should have stopped at first "< /span>".
Please suggest whats wrong here.
.* is a greedy match of any characters; it is going to consume as many characters as possible. Instead, use the non-greedy version .*?, as in
pathstring = '<span class="titletext">(.*?)</span>'
I would suggest using pyquery instead of going mad on regular expressions... It's based on lxml and makes HTML parsing easy as using jQuery.
Something like this is everything you need:
doc = PyQuery(html)
doc('span.titletext').text()
You could also use beautifulsoup, but the result is always the same: don't use regular expressions for parsing HTML, there are tools out there for making your life easier.
.* will match </span> so it keeps on going until the last one.
The best answer is: Don't parse html with regular expressions. Use the lxml library (or something similar).
from lxml import html
html_string = '<blah>'
tree = html.fromstring(html_string)
titles = tree.xpath("//span[#class='titletext']")
for title in titles:
print title.text
Using a proper xml/html parser will save you massive amounts of time and trouble. If you roll your own parser, you'll have to cater for malformed tags, comments, and myriad other things. Don't reinvent the wheel.
You could also just as easily use BeautifulSoup which is great for doing this kind of thing.
#using BeautifulSoup4, install by "pip install BeautifulSoup4"
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
result = soup.find('span', 'titletext')
And then result would hold the <span> with class titletext as you're looking for.