Web scraping - Get text from a class with BeautifulSoup and Python? - python

I want to scrape the text ("Showing 650 results") from a website.
The result of I am looking for is:
Result : Showing 650 results
The following is the Html code:
<div class="jobs-search-results__count-sort pt3">
<div class="jobs-search-results__count-string results-count-string Sans-15px-black-55% pb0 pl5 pr4">
Showing 650 results
</div>
Python code:
response = requests.get(index_url)
soup = BeautifulSoup(response.text, 'html.parser')
text = {}
link = "jobs-search-results__count-string results-count-string Sans-15px-black-55% pb0 pl5 pr4"
for div in soup.find_all('div',attrs={"class" : link}):
text[div.text]
text
So far it looks like my code is not working.

You don't need soup.find_all if you're looking for one element only, soup.find works just as well
You can use tag.string/tag.contents/tag.text to access inner text
div = soup.find('div', {"class" : link})
text = div.string

Old: from BeautifulSoup import BeautifulSoup
"Development on the 3.x series of Beautiful Soup ended in 2011, and the series will be discontinued on January 1, 2021, one year after the Python 2 sunsetting date."
New: from bs4 import BeautifulSoup
"Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree."

Related

Scraping <span> text</span> with BeautifulSoup and urllib

I want to scrape 2015 from below HTML:
I use the below code but am only able to scrape "Annee"
soup.find('span', {'class':'optionLabel'}).get_text()
Can someone please help?
I am a new learner.
Simply try to find its next span that holds the text you wanna scrape:
soup.find('span', {'class':'optionLabel'}).find_next('span').get_text()
or css selectors with adjacent sibling combinator:
soup.select_one('span.optionLabel + span').get_text()
Example
html='''
<span class="optionLabel"><button>Année</button</span> :
<span>2015</span>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
soup.find('span', {'class':'optionLabel'}).find_next('span').get_text()
Output
2015

What tag to parse BeautifulSoup to retrieve this number

I am trying to identify the tag of this HTML code in order to parse it through Beautiful Soup and scrape just one number however I cannot identify which tag to use in order to obtain just this number.
The HTML code looks like this:
<div style="font-size:88px; color:#345C99;position:relative;top:56px;left:calc(6% - 46px)">6</div>
I am trying to obtain the 6 in this element >6<
i think that font-size:88px; is peculiar enough to select the divs you want, so that
soup.select('div[style~="font-size:88px;"]')
should help you get all the divs in the page
You can do it like this:
from bs4 import Beautiful Soup
s = '''<div style="font-size:88px; color:#345C99;position:relative;top:56px;left:calc(6% - 46px)">6</div>'''
soup = BeautifulSoup(s, 'lxml')
d = soup.find('div')
print(d.text)

How is an href value accessed from an HTML div element using Python and Beautiful Soup?

How does one access a link from HTML divs?
Here is the HTML I am trying to scrape, I want to get the href value:
<div class="item-info-wrap">
<div class="news-feed_item-meta icon-font-before icon-espnplus-before"> <span class="timestamp">5d</span><span class="author">Field Yates</span> </div>
<h1> <a name="&lpos=nfl:feed:5:news" href="/nfl/insider/story/_/id/31949666/six-preseason-nfl-trades-teams-make-imagining-deals-nick-foles-xavien-howard-more" class=" realStory" data-sport="nfl" data-mptype="story">
Six NFL trades we'd love to see in August: Here's where Foles could help, but it's not the Colts</a></h1>
<p>Nick Foles is running the third team in Chicago. Xavien Howard wants out of Miami. Let's project six logical deals.</p></div>
Here is the code I have been trying to use to access the href value:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.espn.com/nfl/team/_/name/phi/philadelphia-eagles').text
soup = BeautifulSoup(source, 'lxml')
for article in soup.find_all('div', class_='item-info-wrap'):
headline = article.h1.a.text
print(headline)
summary = article.p.text
print(summary)
try:
link_src = article.h1.a.href # Having difficulty getting href value
print(link_src)
link = f'https://espn.com/{link_src}'
except Exception as e:
link = None
print(link)
The output I am getting is https://espn.com/None for every ESPN article. Appreciate any help and feedback!
If you change the code in line 12 like the code below, it should work.
link_src = article.h1.a["href"]
FYI https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes

Scraping div with a data- attribute using Python and BeautifulSoup

I have to scrape a web page using BeautifulSoup in python.So to extract the complete div which hass the relavent information and looks like the one below:
<div data-v-24a74549="" class="row row-mg-mod term-row">
I wrote soup.find('div',{'class':'row row-mg-mod term-row'}).
But it is returning nothing.I guess it is something to do with this data-v value.
Can someone tell the exact syntaxof scraping this type of data?
Give this a try:
from bs4 import BeautifulSoup
content = """
<div data-v-24a74549="" class="row row-mg-mod term-row">"""
soup = BeautifulSoup(content,'html.parser')
for div in soup.find_all("div", {"class" : "row"}):
print(div)

Improving a python snippet

I'm working on a python script to do some web scraping. I want to find the base URL of a given section on a web page that looks like this:
<div class='pagination'>
<a href='webpage-category/page/1'>1</a>
<a href='webpage-category/page/2'>2</a>
...
</div>
So, I just need to get everything from the first href besides the number('webpage-category/page/') and I have the following working code:
pages = [l['href'] for link in soup.find_all('div', class_='pagination')
for l in link.find_all('a') if not re.search('pageSub', l['href'])]
s = pages[0]
f = ''.join([i for i in s if not i.isdigit()])
The problem is, generating this list is a waste, since I just need the first href. I think a Generator would be the answer but I couldn't pull this off. Maybe you guys could help me to make this code more concise?
What about this:
from bs4 import BeautifulSoup
html = """ <div class='pagination'>
<a href='webpage-category/page/1'>1</a>
<a href='webpage-category/page/2'>2</a>
</div>"""
soup = BeautifulSoup(html)
link = soup.find('div', {'class': 'pagination'}).find('a')['href']
print '/'.join(link.split('/')[:-1])
prints:
webpage-category/page
Just FYI, speaking about the code you've provided - you can use [next()][-1] instead of a list comprehension:
s = next(l['href'] for link in soup.find_all('div', class_='pagination')
for l in link.find_all('a') if not re.search('pageSub', l['href']))
UPD (using the website link provided):
import urllib2
from bs4 import BeautifulSoup
url = "http://www.hdwallpapers.in/cars-desktop-wallpapers/page/2"
soup = BeautifulSoup(urllib2.urlopen(url))
links = soup.find_all('div', {'class': 'pagination'})[1].find_all('a')
print next('/'.join(link['href'].split('/')[:-1]) for link in links
if link.text.isdigit() and link.text != "1")

Categories