How can I scrape elements inside the divs that i'm scraping?

How can I scrape elements inside the divs that i'm scraping? - python

I am having trouble printing the element inside a div.
so this is the tag that I want to scrape
div class="page-box house-lst-page-box" comp-module="page" page-url="/ershoufang/miyun/pg{page}" page-data="{"totalPage":73,"curPage":1}"
I want my code to print the the integer inside totalPage, which is 73.
thanks in advance!

Try:
import json
from bs4 import BeautifulSoup
html_doc = """<div class="page-box house-lst-page-box" comp-module="page" page-url="/ershoufang/miyun/pg{page}" page-data="{"totalPage":73,"curPage":1}"><a class="on" href="/ershoufang/miyun/" data-page="1">1</a>23<span>...</span>73下一页</div>"""
soup = BeautifulSoup(html_doc, "html.parser")
data = soup.select_one("div[page-data]")["page-data"]
data = json.loads(data)
print("Total page:", data["totalPage"])
Prints:
Total page: 73

Related

How to extract 'div' value by using BeautifulSoup?

I want to get 8.9 from follow html tag by using BeautifulSoup.
<div rating-value="8.9" ratings-count="23" product-url="lenovo-v14-ada-amd-ryzen-3-3250u-8-gb-vram-256-gb-ssd-14-inch-windows-home-1-82c6006cuk/version.asp" class="ng-isolate-scope">
import requests
from bs4 import BeautifulSoup
import pandas as pd
website = 'https://www.laptopsdirect.co.uk/ct/laptops-and-netbooks/laptops?fts=laptops'
response = requests.get(website)
soup = BeautifulSoup(response.content, 'lxml')
results = soup.find_all('div', class_='OfferBox')
name = results[0].find('a', class_='offerboxtitle').get_text()
price = results[0].find('span', class_='offerprice').get_text()
review_rating = results[0].find('')
print(review_rating)
I tried:
review_rating = results[0].find('div.rating-value')
None
review_rating = results[0].find('div')['rating-value']
KeyError: 'rating-value'
I'm not familiar with BeautifulSoup yet, so I failed.
Please teach me how to get 8.9?
Thanks

You might use .get method for retrieving attributes values as follows
from bs4 import BeautifulSoup
html = '''<div rating-value="8.9" ratings-count="23" product-url="lenovo-v14-ada-amd-ryzen-3-3250u-8-gb-vram-256-gb-ssd-14-inch-windows-home-1-82c6006cuk/version.asp" class="ng-isolate-scope">'''
soup = BeautifulSoup(html, "html.parser")
print(soup.find("div").get("rating-value"))
output
8.9
Keep in mind that what .get return is str ("8.9").

You are looking for the data in wrong tag. The HTML shows the data inside a <div> but in the soup, it is present inside <star-rating>.
The rating is present as an attribute of a tag called <star-rating>. Just extract the data from it.
price = results[0].find('span', class_='offerprice').get_text()
review_rating = results[0].find('star-rating').get('rating-value')
print(review_rating)
8.9

Get html text with Beautiful Soup

I'm trying to get the number from inside a div:
<div class="tv-symbol-price-quote__value js-symbol-last">122.7<span class="">8</span></div>
I need the 122.7 number, but I cant get it. I have tried with:
strings = soup.find("div", class_="tv-symbol-price-quote__value js-symbol-last").string
But, there are more than one element and I receive "none".
Is there a way to print the childs and get the string from childs?

Use .getText().
For example:
from bs4 import BeautifulSoup
sample_html = """
<div class="tv-symbol-price-quote__value js-symbol-last">122.7<span class="">8</span></div>
"""
soup = BeautifulSoup(sample_html, "html.parser")
strings = soup.find("div", class_="tv-symbol-price-quote__value js-symbol-last").getText()
print(strings)
Output:
122.78
Or use __next__() to get only the 122.7.
soup = BeautifulSoup(sample_html, "html.parser")
strings = soup.find("div", class_="tv-symbol-price-quote__value js-symbol-last").strings.__next__()
print(strings)
Output:
122.7

To only get the first text, search for the tag, and call the next_element method.
from bs4 import BeautifulSoup
html = """
<div class="tv-symbol-price-quote__value js-symbol-last">122.7<span class="">8</span></div>
"""
soup = BeautifulSoup(html, "html.parser")
print(
soup.find("div", class_="tv-symbol-price-quote__value js-symbol-last").next_element
)
Output:
122.7

You could use selenium to find the element and then use BS4 to parse it.
An example would be
import selenium.webdriver as WD
from selenium.webdrive.chrome.options import Options
import bs4 as B
driver = WD.Chrome()
objXpath = driver.find_element_by_xpath("""yourelementxpath""")
objHtml = objXpath.get_attribute("outerHTML")
soup = B.BeutifulSoup(objHtml, 'html.parser')
text = soup.get_text()
This code should work.
DISCLAIMER
I haven't done work w/ selenium and bs4 in a while so you might have to tweak it a little bit.

python nested Tags (beautiful Soup)

I used beautiful soup using python to get data from a specific website
but I don't know how to get one of these prices but I want the price in gram (g)
AS shown below this is the HTML codeL:
<div class="promoPrice margBottom7">16,000
L.L./200g<br/><span class="kiloPrice">79,999
L.L./Kg</span></div>
I use this code:
p_price = product.findAll("div{"class":"promoPricemargBottom7"})[0].text
my result was:
16,000 L.L./200g 79,999 L.L./Kg
but i want to have:
16,000 L.L./200g
only

You will need to first decompose the span inside the div element:
from bs4 import BeautifulSoup
h = """
<div class="promoPrice margBottom7">16,000 L.L./200g<br/>
<span class="kiloPrice">79,999 L.L./Kg</span></div>
"""
soup = BeautifulSoup(h, "html.parser")
element = soup.find("div", {'class': 'promoPrice'})
element.span.decompose()
print(element.text)
#16,000 L.L./200g

Try using soup.select_one('div.promoPrice').contents[0]
from bs4 import BeautifulSoup
html = """<div class="promoPrice margBottom7">16,000 L.L./200g<br/>
<span class="kiloPrice">79,999 L.L./Kg</span></div>"""
soup = BeautifulSoup(html, features='html.parser')
# value = soup.select('div.promoPrice > span') # for 79,999 L.L./Kg
value = soup.select_one('div.promoPrice').contents[0]
print(value)
Prints
16,000 L.L./200g

python crawling beautifulsoup how to crawl several pages?

Please Help.
I want to get all the company names of each pages and they have 12 pages.
http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/1
http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/2
-- this website only changes the number.
So Here is my code so far.
Can I get just the title (company name) of 12 pages?
Thank you in advance.
from bs4 import BeautifulSoup
import requests
maximum = 0
page = 1
URL = 'http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/1'
response = requests.get(URL)
source = response.text
soup = BeautifulSoup(source, 'html.parser')
whole_source = ""
for page_number in range(1, maximum+1):
URL = 'http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/' + str(page_number)
response = requests.get(URL)
whole_source = whole_source + response.text
soup = BeautifulSoup(whole_source, 'html.parser')
find_company = soup.select("#content > div.wrap_analysis_data > div.public_con_box.public_list_wrap > ul > li:nth-child(13) > div > strong")
for company in find_company:
print(company.text)
---------Output of one page
---------page source :)

So, you want to remove all the headers and get only the string of the company name?
Basically, you can use the soup.findAll to find the list of company in the format like this:
<strong class="company"><span>중소기업진흥공단</span></strong>
Then you use the .find function to extract information from the <span> tag:
<span>중소기업진흥공단</span>
After that, you use .contents function to get the string from the <span> tag:
'중소기업진흥공단'
So you write a loop to do the same for each page, and make a list called company_list to store the results from each page and append them together.
Here's the code:
from bs4 import BeautifulSoup
import requests
maximum = 12
company_list = [] # List for result storing
for page_number in range(1, maximum+1):
URL = 'http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/{}'.format(page_number)
response = requests.get(URL)
print(page_number)
whole_source = response.text
soup = BeautifulSoup(whole_source, 'html.parser')
for entry in soup.findAll('strong', attrs={'class': 'company'}): # Finding all company names in the page
company_list.append(entry.find('span').contents[0]) # Extracting name from the result
The company_list will give you all the company names you want

I figured it out eventually. Thank you for your answer though!
image : code captured in jupyter notebook
Here is my final code.
from urllib.request import urlopen
from bs4 import BeautifulSoup
company_list=[]
for n in range(12):
url = 'http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/{}'.format(n+1)
webpage = urlopen(url)
source = BeautifulSoup(webpage,'html.parser',from_encoding='utf-8')
companys = source.findAll('strong',{'class':'company'})
for company in companys:
company_list.append(company.get_text().strip().replace('\n','').replace('\t','').replace('\r',''))
file = open('company_name1.txt','w',encoding='utf-8')
for company in company_list:
file.write(company+'\n')
file.close()

Scraping multiple pages with python

I'm trying to scrape a multiple page website with beautiful soup. The code works partially. It returns only the last one page instead of all pages. How can I fix the problem?
# import libraries
import urllib.request
from bs4 import BeautifulSoup
# specify the url
aziende = [
'35-azienda-4_planets', '36-azienda-archivio_23', '24-azienda-bm', '16-azienda-brolese_virginio', '39-azienda-castellani', '19-azienda-centro_ottico_bisa', '25-azienda-comel_optik', '37-azienda-de_lorenzo_occhiali', '15-azienda-delta_laser', '34-azienda-dem', '21-azienda-erizzo', '3-azienda-evo', '27-azienda-farben_occhialeria', '32-azienda-gio__eyewear', '7-azienda-gipizeta', '42-azienda-h8', '20-azienda-idea_91', '5-azienda-lem', '41-azienda-lasertec', '22-azienda-le_thi_thu_thu', '28-azienda-m1', '1-azienda-mati_', '38-azienda-metal_dream', '30-azienda-mictu', '23-azienda-nete', '10-azienda-new_italian_design_eyewear', '31-azienda-okki_lux', '9-azienda-ottica_pra_floriani', '12-azienda-pao', '40-azienda-palladio_occhiali', '29-azienda-plastil_due', '17-azienda-punti_di_vista', '14-azienda-quemme', '4-azienda-red_line', '43-azienda-revert', '33-azienda-sm', '6-azienda-scussel', '8-azienda-sistem', '18-azienda-stile_italiano', '26-azienda-tecnodanta', '11-azienda-toffoli_costantino', '13-azienda-tri_color', '2-azienda-zago'
]
for azienda in aziende:
page_link = 'http://www.occhialeriabellunotreviso.it/' + azienda
page = urllib.request.urlopen(page_link ) # query the website and return the html to the variable ‘page’
soup = BeautifulSoup(page, 'html.parser') # parse the html using beautiful soup and store in variable `soup`
# Take out the <div> of name and get its value
name_box = soup.find('h2')
name = name_box.text.strip() # strip() is used to remove starting and trailing
print (name)

Just put the final lines of code that are outside the for-loop inside the for-loop so they are run for every page.
# import libraries
import urllib.request
from bs4 import BeautifulSoup
# specify the url
aziende = [
'35-azienda-4_planets', '36-azienda-archivio_23', '24-azienda-bm', '16-azienda-brolese_virginio', '39-azienda-castellani', '19-azienda-centro_ottico_bisa', '25-azienda-comel_optik', '37-azienda-de_lorenzo_occhiali', '15-azienda-delta_laser', '34-azienda-dem', '21-azienda-erizzo', '3-azienda-evo', '27-azienda-farben_occhialeria', '32-azienda-gio__eyewear', '7-azienda-gipizeta', '42-azienda-h8', '20-azienda-idea_91', '5-azienda-lem', '41-azienda-lasertec', '22-azienda-le_thi_thu_thu', '28-azienda-m1', '1-azienda-mati_', '38-azienda-metal_dream', '30-azienda-mictu', '23-azienda-nete', '10-azienda-new_italian_design_eyewear', '31-azienda-okki_lux', '9-azienda-ottica_pra_floriani', '12-azienda-pao', '40-azienda-palladio_occhiali', '29-azienda-plastil_due', '17-azienda-punti_di_vista', '14-azienda-quemme', '4-azienda-red_line', '43-azienda-revert', '33-azienda-sm', '6-azienda-scussel', '8-azienda-sistem', '18-azienda-stile_italiano', '26-azienda-tecnodanta', '11-azienda-toffoli_costantino', '13-azienda-tri_color', '2-azienda-zago'
]
for azienda in aziende:
page_link = 'http://www.occhialeriabellunotreviso.it/' + azienda
page = urllib.request.urlopen(page_link ) # query the website and return the html to the variable ‘page’
soup = BeautifulSoup(page, 'html.parser') # parse the html using beautiful soup and store in variable `soup`
# Take out the <div> of name and get its value
name_box = soup.find('h2')
name = name_box.text.strip() # strip() is used to remove starting and trailing
print (name)

There is nothing wrong with the way you have already tried except for the indentation. However, alternative approach could be something like below:
import urllib.request
from bs4 import BeautifulSoup
link = 'http://www.occhialeriabellunotreviso.it/{}'
aziende = (
'35-azienda-4_planets', '36-azienda-archivio_23', '24-azienda-bm', '16-azienda-brolese_virginio', '39-azienda-castellani', '19-azienda-centro_ottico_bisa', '25-azienda-comel_optik', '37-azienda-de_lorenzo_occhiali', '15-azienda-delta_laser', '34-azienda-dem', '21-azienda-erizzo', '3-azienda-evo', '27-azienda-farben_occhialeria', '32-azienda-gio__eyewear', '7-azienda-gipizeta', '42-azienda-h8', '20-azienda-idea_91', '5-azienda-lem', '41-azienda-lasertec', '22-azienda-le_thi_thu_thu', '28-azienda-m1', '1-azienda-mati_', '38-azienda-metal_dream', '30-azienda-mictu', '23-azienda-nete', '10-azienda-new_italian_design_eyewear', '31-azienda-okki_lux', '9-azienda-ottica_pra_floriani', '12-azienda-pao', '40-azienda-palladio_occhiali', '29-azienda-plastil_due', '17-azienda-punti_di_vista', '14-azienda-quemme', '4-azienda-red_line', '43-azienda-revert', '33-azienda-sm', '6-azienda-scussel', '8-azienda-sistem', '18-azienda-stile_italiano', '26-azienda-tecnodanta', '11-azienda-toffoli_costantino', '13-azienda-tri_color', '2-azienda-zago'
)
def get_item(url):
for azienda in aziende:
page = urllib.request.urlopen(url.format(azienda))
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find('h2').get_text(strip=True)
yield name_box
if __name__ == '__main__':
for item in get_item(link):
print(item)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I scrape elements inside the divs that i'm scraping? - python

Related

How to extract 'div' value by using BeautifulSoup?

Get html text with Beautiful Soup

python nested Tags (beautiful Soup)

python crawling beautifulsoup how to crawl several pages?

Scraping multiple pages with python

Categories

Resources