Python getting the part after div class - python

Here is what I have
<div class="investor-item" usrid="75500">
<div class="row">
<div class="col-sm-3">
<div class="number">10,000€</div>
<div class="date">03 December 2018</div>
</div>
I would like to scrape "75500", but have no clue how to do that.
When I use
soup.findAll('div',{"class":"investor-item"})
it does not capture what I want. Do you have any suggestion?

There are a number of ways you could capture this. Your command worked for me. Though since you have a Euro sign in there, you may want to make sure your script is using the right encoding. Also, remember that find_all will return a list, not just the first matching item.
# -*- coding: UTF-8 -*-
from bs4 import BeautifulSoup
html = """
<div class="investor-item" usrid="75500">
<div class="row">
<div class="col-sm-3">
<div class="number">10,000€</div>
<div class="date">03 December 2018</div>
</div>
"""
soup = BeautifulSoup(html, 'lxml')
mytag = soup.find('div', {"class": "investor-item"})
mytag2 = soup.find('div', class_="investor-item")
mytag3 = soup.find_all('div', class_="investor-item")[0]
mytag4 = soup.findAll('div', class_="investor-item")[0]
mytag5 = soup.findAll('div',{"class":"investor-item"})[0]
print(mytag['usrid']) # Returns 75500
print(mytag2['usrid']) # Also returns 75500
print(mytag3['usrid']) # Also returns 75500
print(mytag4['usrid']) # Also returns 75500
print(mytag5['usrid']) # Also returns 75500
EDIT: Here are some more details on the 5 different examples I gave.
The typical naming convention for Python functions is to use all lowercase and underscores, whereas some other languages use camel case. So although although find_all() is more of the "official" way to do this in BeautifulSoup with Python, and findAll is something you'd see in BeautifulSoup for other languages, Python seems to accept it too.
As mentioned, find_all returns a list whereas find returns the
first match, so doing a find_all and taking the first element
([0]) gives the same result.
Finally, {"class": "investor-item"} is an example of the general way you can specify attributes beyond just the HTML tag name itself. You just pass in the additional parameters in a dictionary like this. But since class is such a common attribute to look for in a tag, BeautifulSoup gives you the option of not having to use a dictionary and instead typing class_= followed by a string of the class name you're looking for. The reason for that underscore is so that Python doesn't confuse it with class, the Python command to create a Python class in your code.

Related

Selenium Missing text value

I am using python & beautifulsoup to extract the data, but for some reason it is doesn't see the text value.
HTML page has:
<div ng-if="!order.hidePrices" style="white-space: nowrap;" class="ng-binding ng-scope">$ 1,599.00</div>
Python code:
for price in prices:
price_value = price.find('div', {"class":"ng-binding", "class":"ng-scope"})
print(price_value)
And python output is missing text value:
<div class="ng-binding ng-scope" ng-if="!order.hidePrices" style="white-space: nowrap;"></div>
The HTML file has no other class with that name. Where am I doing wrong?
There are several issues here:
Your locator seems to be not unique.
price_value here is a web element. To retrieve it's text you shoul apply .text on it
You are using BS, not Selenium here.
See if this will work
for price in prices:
price_value = price.find('div', {"class":"ng-binding", "class":"ng-scope"})
print(price_value.get_text())
If you have a running driver instance, the do this :
print(driver.find_element_by_css_selector("div.ng-binding.ng-scope"").text)
class name with spaces have no support in Selenium, so try to use css selector.
update :
for div in soup.select('div.ng-binding.ng-scope'):
print(div.text)
also you can do :
print(soup.find_all(attrs={'class': 'ng-binding'}, {'class': 'ng-scope'}))
Update 1:
See here what official docs says.
html.parser Advantages
Batteries included
Decent speed
Lenient (As of Python 2.7.3 and 3.2.)
html.parser Disadvantages
Not as fast as lxml, less lenient than html5lib.
Read here for more detailed docs
An HTMLParser instance is fed HTML data and calls handler methods when
start tags, end tags, text, comments, and other markup elements are
encountered. The user should subclass HTMLParser and override its
methods to implement the desired behavior.

Error accessing the class having hyphen(-) separated names in html file using BeautifulSoup

I am trying to scrape the data of popular english movies on Hotstar
I downloaded the html source code and I am doing this:
from bs4 import BeautifulSoup as soup
page_soup = soup(open('hotstar.html'),'html.parser')
containers = page_soup.findAll("div",{"class":"col-xs-6 col-sm-4 col-md-3 col-lg-3 ng-scope"})
container = containers[0]
# To get video link
container.div.hs-cards-directive.article.a
I am getting an error at this point:
NameError: name 'cards' is not defined
These are the first few lines of the html file:
<div bindonce="" class="col-xs-6 col-sm-4 col-md-3 col-lg-3 ng-scope" ng-repeat="slides in gridcardData">
<hs-cards-directive cdata="slides" class="ng-isolate-scope" renderingdone="shownCard()">
<article class="card show-card" ng-class="{'live-sport-card':isLiveSportCard, 'card-active':btnRemoveShow,'tounament-tray-card':record.isTournament}" ng-click="cardeventhandler({cardrecord:record})" ng-init="init()" pdata="record" removecard="removecard" watched="watched">
<a href="http://www.hotstar.com/movies/step-up-revolution/1770016594" ng-href="/movies/step-up-revolution/1770016594" restrict-anchor="">
Please help me out!
I am using Python 3.6.3 on Windows.
As (loosely) explained in the Going down section of the docs, the tag.descendant syntax is just a convenient shortcut for tag.find('descendant').
That shortcut can't be used in cases where you have tags whose names aren't valid Python identifiers.1 (Also in cases where you have tags whose names collide with methods of BS4 itself, like a <find> tag.)
Python identifiers can only have letters, digits, and underscores, not hyphens. So, when you write this:
container.div.hs-cards-directive.article.a
… python parses it like this mathematical expression:
container.div.hs - cards - directive.article.a
BeautifulSoup's div node has no descendant named hs, but that's fine; it just returns None. But then you try to subtract cards from that None, and you get a NameError.
Anyway, the only solution in this case is to not use the shortcut and call find explicitly:
container.div.find('hs-cards-directive').article.a
Or, if it makes sense for your use case, you can just skip down to article, because the shortcut finds any descendants, not just direct children:
container.div.article.a
But I don't think that's appropriate in your case; you want articles only under specific child nodes, not all possible articles, right?
1. Technically, it is actually possible to use the shortcut, it's just not a shortcut anymore. If you understand what getattr(container.div, 'hs-cards-directive').article.a means, then you can write that and it will work… but obviously find is going to be more readable and easier to understand.

How to apply multiple arguments within find_all function?

Is it possible to use multiple arguments in a single soup.find_all function to find any particular item from certain elements? The process I wanna know about can easily be applied if i go for soup.select option. To be more specific: look at the below example:
from bs4 import BeautifulSoup
html_element='''
<div class="browse-movie-bottom">
Logan Lucky
<div class="browse-movie-year">2017</div>
<div class="browse-movie-tags">
Logan Lucky 720p
Logan Lucky 1080p
</div>
</div>
'''
soup = BeautifulSoup(html_element,"lxml")
for item in soup.find_all(class_='browse-movie-bottom')[0].find_all(class_='browse-movie-tags')[0].find_all("a"):
# for item in soup.select(".browse-movie-bottom .browse-movie-tags a"):
print(item.text)
On the one hand, I parsed movie tags using soup.select() which you know can be used in such a way so that all the arguments can fit together in a single brace.
On the other hand, I did the same using soup.find_all() which required three different arguments in three different braces. The results are same though.
My question is: whether it is possible to create any expression using soup.find_all() function which will include multiple arguments in a single brace as i did with soup.select(). Something like below:
This is a faulty one but it will give you an idea as to what type of expression i'm after:
soup.find_all({'class':'browse-movie-bottom'},{'class':'browse-movie-tags'},"a")[0]
However, the valid search results were:
Logan Lucky 720p
Logan Lucky 1080p
You should use a CSS selector:
>>> soup.select(".browse-movie-bottom .browse-movie-tags a")
[Logan Lucky 720p, Logan Lucky 1080p]
>>> [item.text for item in soup.select(".browse-movie-bottom .browse-movie-tags a")]
[u'Logan Lucky 720p', u'Logan Lucky 1080p']
More info: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
Unless there's something you cannot do with CSS selectors (because not all of them are implemented), you should use select. Otherwise, use the more tedious find_all.
Example of non-implemented CSS-selector: n-th child.
selecting second child in beautiful soup with soup.select?
You could pass a list to class_ atribute, this way you save one call to find_all, but you need to call it again with "a", and allways you can just call like soup("a") using the shortcut to find_all:
from bs4 import BeautifulSoup
html_element='''
<div class="browse-movie-bottom">
Logan Lucky
<div class="browse-movie-year">2017</div>
<div class="browse-movie-tags">
Logan Lucky 720p
Logan Lucky 1080p
</div>
</div>
'''
soup = BeautifulSoup(html_element,"lxml")
for item in soup.findAll(class_=['browse-movie-bottom', 'browse-movie-tags'])[1]('a'):
#for item in soup.select(".browse-movie-bottom .browse-movie-tags a"):
print(item.text)

how to use find all method from BS4 to scrape certain strings

<li class="sre" data-tn-component="asdf-search-result" id="85e08291696a3726" itemscope="" itemtype="http://schema.org/puppies">
<div class="sre-entry">
<div class="sre-side-bar">
</div>
<div class="sre-content">
<div class="clickable_asdf_card" onclick="window.open('/r/85e08291696a3726?sp=0', '_blank')" style="cursor: pointer;" target="_blank">
I need to grab the string '/r/85e08291696a3726?sp=0' which occurs throughout a page. I'm not sure how to use the soup.find_all method to do this. The strings that I need always occur next to '
This is what I was thinking (below) but obviously I am getting the parameters wrong. How would I format the find_all method to return the '/r/85e08291696a3726?sp=0' strings throughout the page?
for divsec in soup.find_all('div', class_='clickable_asdf_card'):
print('got links')
x=x+1
I read the documentation for bs4 and I was thinking about using find_all('clickable_asdf_card') to find all occurrences of the string I need but then what? Is there a way to adjust the parameters to return the string I need?
Use BeautifulSoup's built-in regular expression search to find and extract the desired substring from an onclick attribute value:
import re
pattern = re.compile(r"window\.open\('(.*?)', '_blank'\)")
for item in soup.find_all(onclick=pattern):
print(pattern.search(item["onclick"]).group(1))
If there is just a single element you want to find, use find() instead of find_all().

Quotes Messing Up Python Scraper

I am trying to scrape all the data within a div as follows. However, the quotes are throwing me off.
<div id="address">
<div class="info">14955 Shady Grove Rd.</div>
<div class="info">Rockville, MD 20850</div>
<div class="info">Suite: 300</div>
</div>
I am trying to start it with something along the lines of
addressStart = page.find("<div id="address">")
but the quotes within the div are messing me up. Does anybody know how I can fix this?
To answer your specific question, you need to escape the quotes, or use a different type of quote on the string itself:
addressStart = page.find("<div id=\"address\">")
# or
addressStart = page.find('<div id="address">')
But don't do that. If you are trying to "parse" HTML, let a third-party library do that. Try Beautiful Soup. You get a nice object back which you can use to traverse or search. You can grab attributes, values, etc... without having to worry about the complexities of parsing HTML or XML:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page)
for address in soup.find_all('div',id='address'): # returns a list, use find if you just want the first
for info in address.find_all('div',class_='info'): # for attribute class, use class_ instead since class is a reserved word
print info.string

Categories