I've been trying to work with BeautifulSoup because I want to try and scrape a webpage (https://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1). So far I scraped some elements with success but now I wanted to scrape a movie description but I've been struggling. The description is simply situated like this in html :
<div class="lister-item mode-advanced">
<div class="lister-item-content>
<p class="muted-text"> paragraph I don't need</p>
<p class="muted-text"> paragraph I need</p>
</div>
</div>
I want to scrape the second paragraph which seemed easy to do but everything I tried gave me 'None' as output. I've been digging around to find an answer. In an other stackoverflow post I found that
find('p:nth-of-type(1)')
or
find_elements_by_css_selector('.lister-item-mode >p:nth-child(1)')
could do the trick but it still gives me
none #as output
Below you can find a piece of my code it's a bit low grade code because I'm just trying out stuff to learn
import urllib2
from bs4 import BeautifulSoup
from requests import get
url = 'http://www.imdb.com/search/title?
release_date=2017&sort=num_votes,desc&page=1'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
movie_containers = html_soup.find_all('div', class_='lister-item mode-
advanced')
first_movie = movie_containers[0]
first_title = first_movie.h3.a.text
print first_title
first_year = first_movie.h3.find('span', class_='lister-item-year text-muted unbold')
first_year = first_year.text
print first_year
first_imdb = float(first_movie.strong.text)
print first_imdb
# !!!! problem zone ---------------------------------------------
first_description = first_movie.find('p', class_='muted-text')
#first_description = first_description.text
print first_description
the above code gives me this output:
$ python scrape.py
Logan
(2017)
8.1
None
I would like to learn the correct method of selecting html tags because it will be useful to know for future projects.
find_all() method looks through a tag’s descendants and retrieves
all descendants that match your filters.
You can then use the list's index to get the element you need. Index starts at 0, so 1 will give the second item.
Change the first_description to this.
first_description = first_movie.find_all('p', {"class":"text-muted"})[1].text.strip()
Full code
import urllib2
from bs4 import BeautifulSoup
from requests import get
url = 'http://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
movie_containers = html_soup.find_all('div', class_='lister-item mode-advanced')
first_movie = movie_containers[0]
first_title = first_movie.h3.a.text
print first_title
first_year = first_movie.h3.find('span', class_='lister-item-year text-muted unbold')
first_year = first_year.text
print first_year
first_imdb = float(first_movie.strong.text)
print first_imdb
# !!!! problem zone ---------------------------------------------
first_description = first_movie.find_all('p', {"class":"text-muted"})[1].text.strip()
#first_description = first_description.text
print first_description
Output
Logan
(2017)
8.1
In the near future, a weary Logan cares for an ailing Professor X. However, Logan's attempts to hide from the world and his legacy are upended when a young mutant arrives, pursued by dark forces.
Read the Documentation to learn the correct method of selecting html tags.
Also consider moving to python 3.
Just playing around with .next_sibling was able to get it. There's probably a more elegant way though. At least might give you a start/some direction
from bs4 import BeautifulSoup
html = '''<div class="lister-item mode-advanced">
<div class="lister-item-content>
<p class="muted-text"> paragraph I don't need</p>
<p class="muted-text"> paragraph I need</p>
</div>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
first_p = soup.find('div',{'class':'lister-item mode-advanced'}).text.strip()
second_p = soup.find('div',{'class':'lister-item mode-advanced'}).next_sibling.next_sibling.text.strip()
print (second_p)
Output:
print (second_p)
paragraph I need
BeautifulSoup 4.71 support :nth-child() or any CSS4 selectors
first_description = soup.select_one('.lister-item-content p:nth-child(4)')
# or
#first_description = soup.select_one('.lister-item-content p:nth-of-type(2)')
print(desc)
Related
I am trying to create a script to scrape price data from Udemy courses.
I'm struggling with navigating the HTML tree because the element I'm looking for is located inside multiple nested divs.
here's the structure of the HTML element I'm trying to access:
what I tried:
response = requests.get(url)
print(response)
doc = BeautifulSoup(response.text, "html.parser")
parent_div = doc.find(class_="sidebar-container--purchase-section--17KRp")
print(parent_div.find_all("span"))
and even:
response = requests.get(url)
print(response)
doc = BeautifulSoup(response.text, "html.parser")
main = doc.find(class_="generic-purchase-section--main-cta-container--3xxeM")
title = main.select_one("div span span span")
Here’s the URL: https://www.udemy.com/course/the-complete-web-development-bootcamp/
tried searching all the spans in the HTML and the specific span I'm searching for doesn't appear maybe because it's nested inside a div?
would appreciate a little guidance!
The price is being loaded by JavaScript. So it is not possible to scrape using beautifulsoup.
The data is loaded from an API Endpoint which takes in the course-id of the course.
Course-id of this course: 1565838
You can directly get the info from that endpoint like this.
import requests
course_id = '1565838'
url= f'https://www.udemy.com/api-2.0/course-landing-components/{course_id}/me/?components=price_text'
response = requests.get(url)
x = response.json()
print(x['price_text']['data']['pricing_result']['price'])
{'amount': 455.0, 'currency': 'INR', 'price_string': '₹455', 'currency_symbol': '₹'}
I tried your first approach several times and it works more-or-less for me, although it has returned a different number of span elements on different attempts (10 is the usual number but I have seen as few as 1 on one occasion):
import requests
from bs4 import BeautifulSoup
url = 'https://www.udemy.com/course/the-complete-web-development-bootcamp/'
response = requests.get(url)
doc = BeautifulSoup(response.text, "html.parser")
parent_div = doc.find(class_="sidebar-container--purchase-section--17KRp")
spans = parent_div.find_all("span")
print(len(spans))
for span in spans:
print(span)
Prints:
10
<span data-checked="checked" data-name="u872-accordion--3" data-type="radio" id="u872-accordion-panel--4" style="display:none"></span>
<span class="purchase-options--option-radio--1zjJ_ udlite-fake-toggle-input udlite-fake-toggle-radio udlite-fake-toggle-radio-small"></span>
<span class="udlite-accordion-panel-title">Subscribe</span>
<span>Try it free for 7 days</span>
<span class="udlite-text-xs purchase-section-container--cta-subscript--349MH">$29.99 per month after trial</span>
<span class="purchase-section-container--line--159eG"></span>
<span data-checked="" data-name="u872-accordion--3" data-type="radio" id="u872-accordion-panel--6" style="display:none"></span>
<span class="purchase-options--option-radio--1zjJ_ udlite-fake-toggle-input udlite-fake-toggle-radio udlite-fake-toggle-radio-small"></span>
<span class="udlite-accordion-panel-title">Buy Course</span>
<span class="money-back">30-Day Money-Back Guarantee</span>
As afar as your second approach goes, your main div does not have that many nested span elements, so it is bound to fail. Try just one span element:
import requests
from bs4 import BeautifulSoup
url = 'https://www.udemy.com/course/the-complete-web-development-bootcamp/'
response = requests.get(url)
doc = BeautifulSoup(response.text, "html.parser")
main = doc.find(class_="generic-purchase-section--main-cta-container--3xxeM")
title = main.select_one("div span")
print(title)
Prints:
<span class="money-back">30-Day Money-Back Guarantee</span>
I'm trying to scrape the BBC Sounds website for **all of the ** 'currently playing' images. I'm not bothered about which size to use, 400w might be a good.
Below is a relevant excerpt from the HTML and my current python script. A variation on this works brilliantly for the 'now playing' text, but I haven't been able to get it to work for the image URLs, which is what I'm after, I think probably because a) there's so many image URLs to choose from and b) there's a whitespace which no doubt the parser doesn't like. Please bear in mind the HTML code below is repeated about 10 times for each of the channels. I've included just one as an example. Thank you!
import requests
from bs4 import BeautifulSoup
url = "https://www.bbc.co.uk/sounds"
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
g_data = soup.find_all("div", {"class": "sc-o-responsive-image__img sc-u-circle"})
print g_data[0].text
print g_data[1].text
print g_data[2].text
print g_data[3].text
print g_data[4].text
print g_data[5].text
print g_data[6].text
print g_data[7].text
print g_data[8].text
print g_data[9].text
.
<div class="gel-layout__item sc-o-island">
<div class="sc-c-network-item__image sc-o-island" aria-hidden="true">
<div class="sc-c-rsimage sc-o-responsive-image sc-o-responsive-image--1by1 sc-u-circle">
<img alt="" class="sc-o-responsive-image__img sc-u-circle"
src="https://ichef.bbci.co.uk/images/ic/400x400/p07fzzgr.jpg" srcSet="https://ichef.bbci.co.uk/images/ic/160x160/p07fzzgr.jpg 160w,
https://ichef.bbci.co.uk/images/ic/192x192/p07fzzgr.jpg 192w,
https://ichef.bbci.co.uk/images/ic/224x224/p07fzzgr.jpg 224w,
https://ichef.bbci.co.uk/images/ic/288x288/p07fzzgr.jpg 288w,
https://ichef.bbci.co.uk/images/ic/368x368/p07fzzgr.jpg 368w,
https://ichef.bbci.co.uk/images/ic/400x400/p07fzzgr.jpg 400w,
https://ichef.bbci.co.uk/images/ic/448x448/p07fzzgr.jpg 448w,
https://ichef.bbci.co.uk/images/ic/496x496/p07fzzgr.jpg 496w,
https://ichef.bbci.co.uk/images/ic/512x512/p07fzzgr.jpg 512w,
https://ichef.bbci.co.uk/images/ic/576x576/p07fzzgr.jpg 576w,
https://ichef.bbci.co.uk/images/ic/624x624/p07fzzgr.jpg 624w"
sizes="(max-width: 400px) 34vw,(max-width: 600px) 25vw,17vw"/>
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.bbc.co.uk/sounds")
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll("img", {'class': 'sc-o-responsive-image__img sc-u-circle'}):
print(item.get("src"))
Output:
https://ichef.bbci.co.uk/images/ic/400x400/p05mpj80.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p07dg040.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p07zml97.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p0428n3t.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p01lyv4b.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p06yphh0.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p05v4t1c.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p06z9zzc.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p06x0hxb.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p06n253f.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p060m6jj.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p07l4fjw.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p03710d6.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p07nn0dw.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p07nn0dw.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p078qrgm.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p07sq0gr.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p07sq0gr.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p03crmyc.jpg
<span class="mn-person-info__name Sans-17px-black-85%-semibold-dense">
Marina Silon
</span>
I do this and try to find all every thing on the page that has this span class:
spans = soup.find_all('span', {'class' : 'mn-person-info__name Sans-17px-black-85%\-semibold-dense'})
However it does not find anything at all. Am I escaping the - correctly and is there an easier way of doing this?
Simple test code, everything is OK:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<span class="mn-person-info__name Sans-17px-black-85%-semibold-dense">Marina Silon</span>', 'lxml')
spans = soup.find_all('span', {'class' : 'mn-person-info__name Sans-17px-black-85%-semibold-dense'})
print spans
# print result
[<span class="mn-person-info__name Sans-17px-black-85%-semibold-dense">Marina Silon</span>]
I'm newbies and i started with BeautifulSoup and Python dev and i want to get a result in full text without any HTML tags or other elements that are not text.
I did this with python:
#!/usr/bin/env python
import urllib2
from bs4 import BeautifulSoup
html_content = urllib2.urlopen("http://www.demo.com/index.php")
soup = BeautifulSoup(html_content, "lxml")
# COMMENTS COUNT
count_comment = soup.find("span", "sidebar-comment__label")
count_comment
count_comment_final = count_comment.find_next("meta")
# READ COUNT
count_read = soup.find("span", "sidebar-read__label js-read")
count_read
count_read_final = count_read.find_next("meta")
# PRINT RESULT
print count_comment_final
print count_read_final
My HTML look like this :
<div class="box">
<span class="sidebar-comment__label">Comments</span>
<meta itemprop="interactionCount" content="Comments:115">
</div>
<div class="box">
<span class="sidebar-read__label js-read">Read</span>
<meta itemprop="interactionCount" content="Read:10">
</div>
and I get this:
<meta content="Comments:115" itemprop="interactionCount"/>
<meta content="Read:10" itemprop="interactionCount"/>
I would get this:
You've 115 comments
You've 10 read
Firstly, is it possible ?
Secondly, Is my code is good?
Thirdly, could you help me? ;-)
The count_comment_final and count_read_final are tags as clearly seen from the output. You need to extract the value of the attribute content of the two tags. which is done using count_comment_final['content'] which will give as Comments:115, strip off the Comments: using split(':')
#!/usr/bin/env python
import urllib2
from bs4 import BeautifulSoup
html_content = urllib2.urlopen("http://www.demo.com/index.php")
soup = BeautifulSoup(html_content, "lxml")
# COMMENTS COUNT
count_comment = soup.find("span", "sidebar-comment__label")
count_comment
count_comment_final = count_comment.find_next("meta")
# READ COUNT
count_read = soup.find("span", "sidebar-read__label js-read")
count_read
count_read_final = count_read.find_next("meta")
# PRINT RESULT
print count_comment_final['content'].split(':')[1]
print count_read_final['content'].split(':')[1]
count_comment_final and count_read_final are tag elements,
You can use,
count_comment_final.get('content')
This will give a output like this,
'Comments:115'
So you can get the comments count like,
count_comment_final.get('content').split(':')[1]
Same is applicable to count_read_final,
count_read_final.get('content').split(':')[1]
I am trying to extract Company Name, address, and zipcode from [www.quicktransportsolutions.com][1]. I have written the following code to scrawl the site and return the information I need.
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.quicktransportsolutions.com/carrier/missouri/adrian.php'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('div', {'class': 'well well-sm'}):
title = link.string
print(link)
trade_spider(1)
After running the code, I see the information that I want, but I am confused to how to get it to print without all of the non-pertinent information.
Above the
print(link)
I thought that I could have link.string pull the Company Names, but that failed. Any suggestions?
Output:
div class="well well-sm">
<b>2 OLD BOYS TRUCKING LLC</b><br><u><span itemprop="name"><b>2 OLD BOYS TRUCKING</b></span></u><br> <span itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress"><span itemprop="streetAddress">227 E 2ND</span>
<br>
<span itemprop="addressLocality">Adrian</span>, <span itemprop="addressRegion">MO</span> <span itemprop="postalCode">64720</span></br></span><br>
Trucks: 2 Drivers: 2<br>
<abbr class="initialism" title="Unique Number to identify Companies operating commercial vehicles to transport passengers or haul cargo in interstate commerce">USDOT</abbr> 2474795 <br><span class="glyphicon glyphicon-phone"></span><b itemprop="telephone"> 417-955-0651</b>
<br><a href="/inspectionreports/2-old-boys-trucking-usdot-2474795.php" itemprop="url" target="_blank" title="Trucking Company 2 OLD BOYS TRUCKING Inspection Reports">
Everyone,
Thanks for the help so far... I'm trying to add an extra function to my little crawler. I have written the following code:
def Crawl_State_Page(max_pages):
url = 'http://www.quicktransportsolutions.com/carrier/alabama/trucking-companies.php'
while i <= len(url):
response = requests.get(url)
soup = BeautifulSoup(response.content)
table = soup.find("table", {"class" : "table table-condensed table-striped table-hover table-bordered"})
for link in table.find_all(href=True):
print link['href']
Output:
abbeville.php
adamsville.php
addison.php
adger.php
akron.php
alabaster.php
alberta.php
albertville.php
alexander-city.php
alexandria.php
aliceville.php
alpine.php
... # goes all the way to Z I cut the output short for spacing..
What I'm trying to accomplish here is to pull all of the href with the city.php and write it to a file. .. But right now, i am stuck in an infinite loop where it keep cycling through the URL. Any tips on how to increment it? My end goal is to create another function that feeds back into my trade_spider with the www.site.com/state/city.php and then loops through all 50 dates... Something to the effect of
while i < len(states,cities):
url = "http://www.quicktransportsolutions.com/carrier" + states + cities[i] +"
And then this would loop into my trade_spider function, pulling all of the information that I needed.
But, before I get to that part, I need a bit of help getting out of my infinite loop. Any suggestions? Or foreseeable issues that I am going to run into?
I tried to create a crawler that would cycle through every link on the page, and then if it found content on the page that trade_spider could crawl, it would write it to a file... However, that was a bit out of my skill set, for now. So, i'm trying this method.
I would rely on the itemprop attributes of the different tags for each company. They are conveniently set for name, url, address etc:
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.quicktransportsolutions.com/carrier/missouri/adrian.php'
response = requests.get(url)
soup = BeautifulSoup(response.content)
for company in soup.find_all('div', {'class': 'well well-sm'}):
link = company.find('a', itemprop='url').get('href').strip()
name = company.find('span', itemprop='name').text.strip()
address = company.find('span', itemprop='address').text.strip()
print name, link, address
print "----"
trade_spider(1)
Prints:
2 OLD BOYS TRUCKING /truckingcompany/missouri/2-old-boys-trucking-usdot-2474795.php 227 E 2ND
Adrian, MO 64720
----
HILLTOP SERVICE & EQUIPMENT /truckingcompany/missouri/hilltop-service-equipment-usdot-1047604.php ROUTE 2 BOX 453
Adrian, MO 64720
----