Is it possible to use multiple arguments in a single soup.find_all function to find any particular item from certain elements? The process I wanna know about can easily be applied if i go for soup.select option. To be more specific: look at the below example:
from bs4 import BeautifulSoup
html_element='''
<div class="browse-movie-bottom">
Logan Lucky
<div class="browse-movie-year">2017</div>
<div class="browse-movie-tags">
Logan Lucky 720p
Logan Lucky 1080p
</div>
</div>
'''
soup = BeautifulSoup(html_element,"lxml")
for item in soup.find_all(class_='browse-movie-bottom')[0].find_all(class_='browse-movie-tags')[0].find_all("a"):
# for item in soup.select(".browse-movie-bottom .browse-movie-tags a"):
print(item.text)
On the one hand, I parsed movie tags using soup.select() which you know can be used in such a way so that all the arguments can fit together in a single brace.
On the other hand, I did the same using soup.find_all() which required three different arguments in three different braces. The results are same though.
My question is: whether it is possible to create any expression using soup.find_all() function which will include multiple arguments in a single brace as i did with soup.select(). Something like below:
This is a faulty one but it will give you an idea as to what type of expression i'm after:
soup.find_all({'class':'browse-movie-bottom'},{'class':'browse-movie-tags'},"a")[0]
However, the valid search results were:
Logan Lucky 720p
Logan Lucky 1080p
You should use a CSS selector:
>>> soup.select(".browse-movie-bottom .browse-movie-tags a")
[Logan Lucky 720p, Logan Lucky 1080p]
>>> [item.text for item in soup.select(".browse-movie-bottom .browse-movie-tags a")]
[u'Logan Lucky 720p', u'Logan Lucky 1080p']
More info: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
Unless there's something you cannot do with CSS selectors (because not all of them are implemented), you should use select. Otherwise, use the more tedious find_all.
Example of non-implemented CSS-selector: n-th child.
selecting second child in beautiful soup with soup.select?
You could pass a list to class_ atribute, this way you save one call to find_all, but you need to call it again with "a", and allways you can just call like soup("a") using the shortcut to find_all:
from bs4 import BeautifulSoup
html_element='''
<div class="browse-movie-bottom">
Logan Lucky
<div class="browse-movie-year">2017</div>
<div class="browse-movie-tags">
Logan Lucky 720p
Logan Lucky 1080p
</div>
</div>
'''
soup = BeautifulSoup(html_element,"lxml")
for item in soup.findAll(class_=['browse-movie-bottom', 'browse-movie-tags'])[1]('a'):
#for item in soup.select(".browse-movie-bottom .browse-movie-tags a"):
print(item.text)
Related
Here is what I have
<div class="investor-item" usrid="75500">
<div class="row">
<div class="col-sm-3">
<div class="number">10,000€</div>
<div class="date">03 December 2018</div>
</div>
I would like to scrape "75500", but have no clue how to do that.
When I use
soup.findAll('div',{"class":"investor-item"})
it does not capture what I want. Do you have any suggestion?
There are a number of ways you could capture this. Your command worked for me. Though since you have a Euro sign in there, you may want to make sure your script is using the right encoding. Also, remember that find_all will return a list, not just the first matching item.
# -*- coding: UTF-8 -*-
from bs4 import BeautifulSoup
html = """
<div class="investor-item" usrid="75500">
<div class="row">
<div class="col-sm-3">
<div class="number">10,000€</div>
<div class="date">03 December 2018</div>
</div>
"""
soup = BeautifulSoup(html, 'lxml')
mytag = soup.find('div', {"class": "investor-item"})
mytag2 = soup.find('div', class_="investor-item")
mytag3 = soup.find_all('div', class_="investor-item")[0]
mytag4 = soup.findAll('div', class_="investor-item")[0]
mytag5 = soup.findAll('div',{"class":"investor-item"})[0]
print(mytag['usrid']) # Returns 75500
print(mytag2['usrid']) # Also returns 75500
print(mytag3['usrid']) # Also returns 75500
print(mytag4['usrid']) # Also returns 75500
print(mytag5['usrid']) # Also returns 75500
EDIT: Here are some more details on the 5 different examples I gave.
The typical naming convention for Python functions is to use all lowercase and underscores, whereas some other languages use camel case. So although although find_all() is more of the "official" way to do this in BeautifulSoup with Python, and findAll is something you'd see in BeautifulSoup for other languages, Python seems to accept it too.
As mentioned, find_all returns a list whereas find returns the
first match, so doing a find_all and taking the first element
([0]) gives the same result.
Finally, {"class": "investor-item"} is an example of the general way you can specify attributes beyond just the HTML tag name itself. You just pass in the additional parameters in a dictionary like this. But since class is such a common attribute to look for in a tag, BeautifulSoup gives you the option of not having to use a dictionary and instead typing class_= followed by a string of the class name you're looking for. The reason for that underscore is so that Python doesn't confuse it with class, the Python command to create a Python class in your code.
I'm working with the following HTML snippet from a page on Goodreads using Python 3.6.3:
<div class="quoteText">
“Don't cry because it's over, smile because it happened.”
<br/> ―
<a class="authorOrTitle" href="/author/show/61105.Dr_Seuss">Dr. Seuss</a>
</div>, <div class="quoteText">
I used BeautifulSoup to scrape the HTML and isolate just the "quoteText" class seen in the snippet above. Now, I want to save the quote and author name as separate strings. I was able to get the author name using
(quote_tag.find(class_="quoteText")).text
I'm not sure how to do the same for the quote. I'm guessing I need a way to remove the subclass from my output and tried using the extract method.
quote.extract(class_="authorOrTitle")
but I got an error saying extract got an unexpected keyword argument 'class_'
Is there any other way to do what I'm trying to do?
This is my first time posting on here so I apologize if the post doesn't meet particular specificity/formatting/other standards.
PageElement.extract() removes a tag or string from the tree. It
returns the tag or string that was extracted
from bs4 import BeautifulSoup
a='''<div class="quoteText">
“Don't cry because it's over, smile because it happened.”
<br/> -
<a class="authorOrTitle" href="/author/show/61105.Dr_Seuss">Dr. Seuss</a>
</div>, <div class="quoteText">'''
s=BeautifulSoup(a,'lxml')
s.find(class_="authorOrTitle").extract()
print(s.text)
I am trying to scrape the data of popular english movies on Hotstar
I downloaded the html source code and I am doing this:
from bs4 import BeautifulSoup as soup
page_soup = soup(open('hotstar.html'),'html.parser')
containers = page_soup.findAll("div",{"class":"col-xs-6 col-sm-4 col-md-3 col-lg-3 ng-scope"})
container = containers[0]
# To get video link
container.div.hs-cards-directive.article.a
I am getting an error at this point:
NameError: name 'cards' is not defined
These are the first few lines of the html file:
<div bindonce="" class="col-xs-6 col-sm-4 col-md-3 col-lg-3 ng-scope" ng-repeat="slides in gridcardData">
<hs-cards-directive cdata="slides" class="ng-isolate-scope" renderingdone="shownCard()">
<article class="card show-card" ng-class="{'live-sport-card':isLiveSportCard, 'card-active':btnRemoveShow,'tounament-tray-card':record.isTournament}" ng-click="cardeventhandler({cardrecord:record})" ng-init="init()" pdata="record" removecard="removecard" watched="watched">
<a href="http://www.hotstar.com/movies/step-up-revolution/1770016594" ng-href="/movies/step-up-revolution/1770016594" restrict-anchor="">
Please help me out!
I am using Python 3.6.3 on Windows.
As (loosely) explained in the Going down section of the docs, the tag.descendant syntax is just a convenient shortcut for tag.find('descendant').
That shortcut can't be used in cases where you have tags whose names aren't valid Python identifiers.1 (Also in cases where you have tags whose names collide with methods of BS4 itself, like a <find> tag.)
Python identifiers can only have letters, digits, and underscores, not hyphens. So, when you write this:
container.div.hs-cards-directive.article.a
… python parses it like this mathematical expression:
container.div.hs - cards - directive.article.a
BeautifulSoup's div node has no descendant named hs, but that's fine; it just returns None. But then you try to subtract cards from that None, and you get a NameError.
Anyway, the only solution in this case is to not use the shortcut and call find explicitly:
container.div.find('hs-cards-directive').article.a
Or, if it makes sense for your use case, you can just skip down to article, because the shortcut finds any descendants, not just direct children:
container.div.article.a
But I don't think that's appropriate in your case; you want articles only under specific child nodes, not all possible articles, right?
1. Technically, it is actually possible to use the shortcut, it's just not a shortcut anymore. If you understand what getattr(container.div, 'hs-cards-directive').article.a means, then you can write that and it will work… but obviously find is going to be more readable and easier to understand.
<li class="sre" data-tn-component="asdf-search-result" id="85e08291696a3726" itemscope="" itemtype="http://schema.org/puppies">
<div class="sre-entry">
<div class="sre-side-bar">
</div>
<div class="sre-content">
<div class="clickable_asdf_card" onclick="window.open('/r/85e08291696a3726?sp=0', '_blank')" style="cursor: pointer;" target="_blank">
I need to grab the string '/r/85e08291696a3726?sp=0' which occurs throughout a page. I'm not sure how to use the soup.find_all method to do this. The strings that I need always occur next to '
This is what I was thinking (below) but obviously I am getting the parameters wrong. How would I format the find_all method to return the '/r/85e08291696a3726?sp=0' strings throughout the page?
for divsec in soup.find_all('div', class_='clickable_asdf_card'):
print('got links')
x=x+1
I read the documentation for bs4 and I was thinking about using find_all('clickable_asdf_card') to find all occurrences of the string I need but then what? Is there a way to adjust the parameters to return the string I need?
Use BeautifulSoup's built-in regular expression search to find and extract the desired substring from an onclick attribute value:
import re
pattern = re.compile(r"window\.open\('(.*?)', '_blank'\)")
for item in soup.find_all(onclick=pattern):
print(pattern.search(item["onclick"]).group(1))
If there is just a single element you want to find, use find() instead of find_all().
I am trying to scrape all the data within a div as follows. However, the quotes are throwing me off.
<div id="address">
<div class="info">14955 Shady Grove Rd.</div>
<div class="info">Rockville, MD 20850</div>
<div class="info">Suite: 300</div>
</div>
I am trying to start it with something along the lines of
addressStart = page.find("<div id="address">")
but the quotes within the div are messing me up. Does anybody know how I can fix this?
To answer your specific question, you need to escape the quotes, or use a different type of quote on the string itself:
addressStart = page.find("<div id=\"address\">")
# or
addressStart = page.find('<div id="address">')
But don't do that. If you are trying to "parse" HTML, let a third-party library do that. Try Beautiful Soup. You get a nice object back which you can use to traverse or search. You can grab attributes, values, etc... without having to worry about the complexities of parsing HTML or XML:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page)
for address in soup.find_all('div',id='address'): # returns a list, use find if you just want the first
for info in address.find_all('div',class_='info'): # for attribute class, use class_ instead since class is a reserved word
print info.string