Background
I am trying to understand the process in which specific data can be extracted from a website using beautifulsoup4 and urllib libraries.
How would I get the specific price of a DVD from a website, if:
The div class is <div class="productPrice" data-component="productPrice">
The p class is <p class="productPrice_price" data-product-price="price">£9.99 </p>
Code so far:
from bs4 import BeautifulSoup
from urllib.request import urlopen
html = urlopen("https://www.zavvi.com/dvd/rampage-includes-digital-download/11729469.html ")
bsObj = BeautifulSoup(html.read(), features='html.parser')
all_divs = bsObj.find_all('div', {'class':'productPrice'}) # 1. get all divs
What is the remaining process of finding the price?
Website (https://www.zavvi.com/dvd/rampage-includes-digital-download/11729469.html)
You're almost there, just one more step. You just need to loop through the elements and find the <p> tag, with class="productPrice_price", and grab the text:
from bs4 import BeautifulSoup
from urllib.request import urlopen
html = urlopen("https://www.zavvi.com/dvd/rampage-includes-digital-download/11729469.html ")
bsObj = BeautifulSoup(html.read(), features='html.parser')
all_divs = bsObj.find_all('div', {'class':'productPrice'}) # 1. get all divs
for ele in all_divs:
price = ele.find('p', {'class':'productPrice_price'}).text
print (price)
Output:
£9.99
Related
I write the following code to extract price from webpage:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.teleborsa.it/azioni/intesa-sanpaolo-isp-it0000072618-SVQwMDAwMDcyNjE4"
html = urlopen(url)
soup = BeautifulSoup(html,'lxml')
prize = soup.select('.h-price')
print(prize)
output is:
<span class="h-price fc0" id="ctl00_phContents_ctlHeader_lblPrice">1,384</span>
i want to extract 1,384 value.
Try this
document.getElementById("ctl00_phContents_ctlHeader_lblPrice").innerText
Or if you are having dynamic elements, you can iterate over each element and get innerText from it.
You can use .text property to get the desired text.
For example:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.teleborsa.it/azioni/intesa-sanpaolo-isp-it0000072618-SVQwMDAwMDcyNjE4"
html = urlopen(url)
soup = BeautifulSoup(html,'lxml')
prize = soup.select_one('.h-price') # <- change to .select_one() to get only one element
print(prize.text) # <- use the .text property to get text of the tag
Prints:
1,384
Hello every one I'm new to beautifulsoup, I'm trying to write a function that will be able to extract second level urls from a given website.
For example if I have this website url : https://edition.cnn.com/ my function should be able to return
https://edition.cnn.com/world
https://edition.cnn.com/politics
https://edition.cnn.com/business
https://edition.cnn.com/health
https://edition.cnn.com/entertainment
https://edition.cnn.com/style
https://edition.cnn.com/travel
first I have tried this code to retrieve all links starting with the string of the url:
from bs4 import BeautifulSoup as bs4
import requests
import lxml
import re
def getLinks(url):
response = requests.get(url)
data = response.text
soup = bs4(data, 'lxml')
links = []
for link in soup.find_all('a', href=re.compile(str(url))):
links.append(link.get('href'))
return links
But then again the actual output is giving me all the links even links of articles which is not I'm looking for. is there a method that I can use to get what I want using regular expression or others.
The links are inside <nav> tag, so using CSS selector nav a[href] will select only links inside <nav> tag:
import requests
from bs4 import BeautifulSoup
url = 'https://edition.cnn.com'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
for a in soup.select('nav a[href]'):
if a['href'].count('/') > 1 or '#' in a['href']:
continue
print(url + a['href'])
Prints:
https://edition.cnn.com/world
https://edition.cnn.com/politics
https://edition.cnn.com/business
https://edition.cnn.com/health
https://edition.cnn.com/entertainment
https://edition.cnn.com/style
https://edition.cnn.com/travel
https://edition.cnn.com/sport
https://edition.cnn.com/videos
https://edition.cnn.com/world
https://edition.cnn.com/africa
https://edition.cnn.com/americas
https://edition.cnn.com/asia
https://edition.cnn.com/australia
https://edition.cnn.com/china
https://edition.cnn.com/europe
https://edition.cnn.com/india
https://edition.cnn.com/middle-east
https://edition.cnn.com/uk
...and so on.
I am trying to web scrape from Zalora for 3 things:
1. item brand
2. item name
3. item price(old)
Below is my initial attempt:
from bs4 import BeautifulSoup
import requests
def make_soup(url):
html = requests.get(url)
bsObj = BeautifulSoup(html.text, 'html.parser')
return bsObj
soup = make_soup('https://www.zalora.com.hk/men/clothing/shirt/?gender=men&dir=desc&sort=popularity&category_id=31&enable_visual_sort=1')
itemBrand = soup.find("span",{"class":"b-catalogList__itmBrand fsm txtDark uc js-catalogProductTitle"})
itemName = soup.find("em",{"class":"b-catalogList__itmTitle fss"})
itemPrice = soup.find("span",{"class":"b-catalogList__itmPrice old"})
print(itemBrand, itemName, itemPrice)
Output:
None None None
Then I do further investigation:
productsCatalog = soup.find("ul",{"id":"productsCatalog"})
print(productsCatalog)
Output:
<ul class="b-catalogList__wrapper clearfix" id="productsCatalog">
This is the weird thing that puzzle me, there should be many tags within the ul tag (The 3 things I need are within those hidden tags), why are they not showing up?
Matter in fact, everything I try to scrape with BeautifulSoup within the ul tag have the output of None.
Since this content is rendered by JavaScript, you can't access it using the requests module. You should use selenium to automate your browser and then use BeautifulSoup to parse the actual html.
This is how you do it using selenium with chromedriver:
from selenium import webdriver
from bs4 import BeautifulSoup
chrome_driver = "path\\to\\chromedriver.exe"
driver = webdriver.Chrome(executable_path=chrome_driver)
target = 'https://www.zalora.com.hk/men/clothing/shirt/?gender=men&dir=desc&sort=popularity&category_id=31&enable_visual_sort=1'
driver.get(target)
soup = BeautifulSoup(driver.page_source, "lxml")
print(soup.find("span",{"class":"b-catalogList__itmBrand fsm txtDark uc js-catalogProductTitle"}).get_text().strip())
print(soup.find("span", {'class': 'b-catalogList__itmPrice old'}).get_text().strip())
print(soup.find("em",{"class":"b-catalogList__itmTitle fss"}).get_text().strip())
Output:
JAXON
HK$ 149.00
EMBROIDERY SHORT SLEEVE SHIRT
I'm trying to learn how to webscrape with beautifulsoup + python, and I want to grab the name of the cinematographer from https://letterboxd.com/film/donnie-darko/ but I can't figure out how to isolate the text. The html for what I want is written as below, what I want to output is "Steven Poster":
<h3><span>Cinematography</span></h3>
<div class="text-sluglist">
<p>
Steven Poster
</p>
</div>
within my code I've done soup.find(text="Cinematography"), and a mixture of different thigns like trying to find item or get_text from within the a and p tags, but ...
I would use a regex to parse the soup object for a link that contains "cinematography".
import re
import requests
from bs4 import BeautifulSoup
r = requests.get('https://letterboxd.com/film/donnie-darko/')
soup = BeautifulSoup(r.text, 'lxml')
cinematographer = soup(href=re.compile(r'/cinematography/'))[0].text
print cinematographer
# outputs "Stephen Poster"
You can do the same without using regex as well:
import requests
from bs4 import BeautifulSoup
res = requests.get('https://letterboxd.com/film/donnie-darko/')
soup = BeautifulSoup(res.text,'lxml')
item = soup.select("[href*='cinematography']")[0].text
print(item)
Output:
Steven Poster
Use CSS partial text selector:
soup.find('a[href*="cinematography"]').text
the body tag has a <span> tag. There are many other divs inside the span tag. I want to go dipper but when I trying this code:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = 'https://www.instagram.com/artfido/'
data = urlopen(url)
soup = BeautifulSoup(data, 'html.parser')
result = soup.body.span
print (result)
the result was just this:
<span id="react-root"></span>
How can I reach to divs inside the span tag?
Can we parse the <span> tag? Is it possible? If yes so why I'm not able to parse the span?
By using this:
result = soup.body.span.contents
The output was:
[]
As talked in comments, urlopen(url) returns a file like object, which means that you need to read from it if you want to get what's inside it.
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = 'https://www.instagram.com/artfido/'
data = urlopen(url)
soup = BeautifulSoup(data.read(), 'html.parser')
result = soup.body.span
print (result)
The code I used for my python 2.7 setup:
from bs4 import BeautifulSoup
import urllib2
url = 'https://www.instagram.com/artfido/'
data = urllib2.urlopen(url)
soup = BeautifulSoup(data.read(), 'lxml')
result = soup.body.span
print result
EDIT
for future reference, if you want something more simple for handling the url, there is a package called requests . In this case, it is similar but I find it easier to understand.
from bs4 import BeautifulSoup
import requests
url = 'https://www.instagram.com/artfido/'
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'lxml')
result = soup.body.span
print result