how to use find all method from BS4 to scrape certain strings - python

<li class="sre" data-tn-component="asdf-search-result" id="85e08291696a3726" itemscope="" itemtype="http://schema.org/puppies">
<div class="sre-entry">
<div class="sre-side-bar">
</div>
<div class="sre-content">
<div class="clickable_asdf_card" onclick="window.open('/r/85e08291696a3726?sp=0', '_blank')" style="cursor: pointer;" target="_blank">
I need to grab the string '/r/85e08291696a3726?sp=0' which occurs throughout a page. I'm not sure how to use the soup.find_all method to do this. The strings that I need always occur next to '
This is what I was thinking (below) but obviously I am getting the parameters wrong. How would I format the find_all method to return the '/r/85e08291696a3726?sp=0' strings throughout the page?
for divsec in soup.find_all('div', class_='clickable_asdf_card'):
print('got links')
x=x+1
I read the documentation for bs4 and I was thinking about using find_all('clickable_asdf_card') to find all occurrences of the string I need but then what? Is there a way to adjust the parameters to return the string I need?

Use BeautifulSoup's built-in regular expression search to find and extract the desired substring from an onclick attribute value:
import re
pattern = re.compile(r"window\.open\('(.*?)', '_blank'\)")
for item in soup.find_all(onclick=pattern):
print(pattern.search(item["onclick"]).group(1))
If there is just a single element you want to find, use find() instead of find_all().

Related

Python getting the part after div class

Here is what I have
<div class="investor-item" usrid="75500">
<div class="row">
<div class="col-sm-3">
<div class="number">10,000€</div>
<div class="date">03 December 2018</div>
</div>
I would like to scrape "75500", but have no clue how to do that.
When I use
soup.findAll('div',{"class":"investor-item"})
it does not capture what I want. Do you have any suggestion?
There are a number of ways you could capture this. Your command worked for me. Though since you have a Euro sign in there, you may want to make sure your script is using the right encoding. Also, remember that find_all will return a list, not just the first matching item.
# -*- coding: UTF-8 -*-
from bs4 import BeautifulSoup
html = """
<div class="investor-item" usrid="75500">
<div class="row">
<div class="col-sm-3">
<div class="number">10,000€</div>
<div class="date">03 December 2018</div>
</div>
"""
soup = BeautifulSoup(html, 'lxml')
mytag = soup.find('div', {"class": "investor-item"})
mytag2 = soup.find('div', class_="investor-item")
mytag3 = soup.find_all('div', class_="investor-item")[0]
mytag4 = soup.findAll('div', class_="investor-item")[0]
mytag5 = soup.findAll('div',{"class":"investor-item"})[0]
print(mytag['usrid']) # Returns 75500
print(mytag2['usrid']) # Also returns 75500
print(mytag3['usrid']) # Also returns 75500
print(mytag4['usrid']) # Also returns 75500
print(mytag5['usrid']) # Also returns 75500
EDIT: Here are some more details on the 5 different examples I gave.
The typical naming convention for Python functions is to use all lowercase and underscores, whereas some other languages use camel case. So although although find_all() is more of the "official" way to do this in BeautifulSoup with Python, and findAll is something you'd see in BeautifulSoup for other languages, Python seems to accept it too.
As mentioned, find_all returns a list whereas find returns the
first match, so doing a find_all and taking the first element
([0]) gives the same result.
Finally, {"class": "investor-item"} is an example of the general way you can specify attributes beyond just the HTML tag name itself. You just pass in the additional parameters in a dictionary like this. But since class is such a common attribute to look for in a tag, BeautifulSoup gives you the option of not having to use a dictionary and instead typing class_= followed by a string of the class name you're looking for. The reason for that underscore is so that Python doesn't confuse it with class, the Python command to create a Python class in your code.

How to apply multiple arguments within find_all function?

Is it possible to use multiple arguments in a single soup.find_all function to find any particular item from certain elements? The process I wanna know about can easily be applied if i go for soup.select option. To be more specific: look at the below example:
from bs4 import BeautifulSoup
html_element='''
<div class="browse-movie-bottom">
Logan Lucky
<div class="browse-movie-year">2017</div>
<div class="browse-movie-tags">
Logan Lucky 720p
Logan Lucky 1080p
</div>
</div>
'''
soup = BeautifulSoup(html_element,"lxml")
for item in soup.find_all(class_='browse-movie-bottom')[0].find_all(class_='browse-movie-tags')[0].find_all("a"):
# for item in soup.select(".browse-movie-bottom .browse-movie-tags a"):
print(item.text)
On the one hand, I parsed movie tags using soup.select() which you know can be used in such a way so that all the arguments can fit together in a single brace.
On the other hand, I did the same using soup.find_all() which required three different arguments in three different braces. The results are same though.
My question is: whether it is possible to create any expression using soup.find_all() function which will include multiple arguments in a single brace as i did with soup.select(). Something like below:
This is a faulty one but it will give you an idea as to what type of expression i'm after:
soup.find_all({'class':'browse-movie-bottom'},{'class':'browse-movie-tags'},"a")[0]
However, the valid search results were:
Logan Lucky 720p
Logan Lucky 1080p
You should use a CSS selector:
>>> soup.select(".browse-movie-bottom .browse-movie-tags a")
[Logan Lucky 720p, Logan Lucky 1080p]
>>> [item.text for item in soup.select(".browse-movie-bottom .browse-movie-tags a")]
[u'Logan Lucky 720p', u'Logan Lucky 1080p']
More info: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
Unless there's something you cannot do with CSS selectors (because not all of them are implemented), you should use select. Otherwise, use the more tedious find_all.
Example of non-implemented CSS-selector: n-th child.
selecting second child in beautiful soup with soup.select?
You could pass a list to class_ atribute, this way you save one call to find_all, but you need to call it again with "a", and allways you can just call like soup("a") using the shortcut to find_all:
from bs4 import BeautifulSoup
html_element='''
<div class="browse-movie-bottom">
Logan Lucky
<div class="browse-movie-year">2017</div>
<div class="browse-movie-tags">
Logan Lucky 720p
Logan Lucky 1080p
</div>
</div>
'''
soup = BeautifulSoup(html_element,"lxml")
for item in soup.findAll(class_=['browse-movie-bottom', 'browse-movie-tags'])[1]('a'):
#for item in soup.select(".browse-movie-bottom .browse-movie-tags a"):
print(item.text)

Not able to extract data using scrapy with class names containing spaces and hyphens

I am new to scrapy and I have to extract text from a tag with multiple class names, where the class names contain spaces and hyphens.
Example:
<div class="info">
<span class="price sale">text1</span>
<span class="title ng-binding">some text</span>
</div>
When i use the code:
response.xpath("//span[contains(#class,'price sale')]/text()").extract()
I am able to get text1 but when I use:
response.xpath("//span[contains(#class,'title ng-binding')]/text()").extract()
I get an empty list. Why is this happening and how to handle this?
The expression you're looking for is:
//span[contains(#class, 'title') and contains(#class, 'ng-binding')]
I highly suggest XPath visualizer, which can help you debug xpath expressions easily. It can be found here:
http://xpathvisualizer.codeplex.com/
Or with CSS try
response.css("span.title.ng-binding")
Or there is a chance that element with ng-binding is loaded via Javascript/Ajax hence not included in initial server response.
You can replace the spaces with "." in your code when using response.css().
In your case you can try:
response.css("span.title.ng-binding::text").extract()
This code should return the text you are looking for.

Quotes Messing Up Python Scraper

I am trying to scrape all the data within a div as follows. However, the quotes are throwing me off.
<div id="address">
<div class="info">14955 Shady Grove Rd.</div>
<div class="info">Rockville, MD 20850</div>
<div class="info">Suite: 300</div>
</div>
I am trying to start it with something along the lines of
addressStart = page.find("<div id="address">")
but the quotes within the div are messing me up. Does anybody know how I can fix this?
To answer your specific question, you need to escape the quotes, or use a different type of quote on the string itself:
addressStart = page.find("<div id=\"address\">")
# or
addressStart = page.find('<div id="address">')
But don't do that. If you are trying to "parse" HTML, let a third-party library do that. Try Beautiful Soup. You get a nice object back which you can use to traverse or search. You can grab attributes, values, etc... without having to worry about the complexities of parsing HTML or XML:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page)
for address in soup.find_all('div',id='address'): # returns a list, use find if you just want the first
for info in address.find_all('div',class_='info'): # for attribute class, use class_ instead since class is a reserved word
print info.string

PyQuery: Get only text of element, not text of child elements

I have the following HTML:
<h1 class="price">
<span class="strike">$325.00</span>$295.00
</h1>
I'd like to get the $295 out. However, if I simply use PyQuery as follows:
price = pq('h1').text()
I get both prices.
Extracting only direct child text for an element in jQuery looks reasonably complicated - is there a way to do it at all in PyQuery?
Currently I'm extracting the first price separately, then using replace to remove it from the text, which is a bit fiddly.
Thanks for your help.
I don't think there is an clean way to do that. At least I've found this solution:
>>> print doc('h1').html(doc('h1')('span').outerHtml())
<h1 class="price"><span class="strike">$325.00</span></h1>
You can use .text() instead of .outerHtml() if you don't want to keep the span tag.
Removing the first one is much more easy:
>>> print doc('h1').remove('span')
<h1 class="price">
$295.00
</h1>

Categories