webscraping in python: copying specific part of HTML for each webpage - python

I am working on a webscraper using html requests and beautiful soup (New to this). For 1 webpage (https://www.lookfantastic.com/illamasqua-artistry-palette-experimental/11723920.html) I am trying to scrape a part, which I will replicate for other products. The html looks like:
<span class="js-enhanced-ecommerce-data hidden" data-product-title="Illamasqua Expressionist Artistry Palette" data-product-id="12024086" data-product-category="" data-product-is-master-product-id="false" data-product-master-product-id="12024086" data-product-brand="Illamasqua" data-product-price="£39.00" data-product-position="1">
</span>
I want to select the data-product-brand="Illamasqua" , specifically the Illamasqua. I am not sure how to grab this using html requests or Beautifulsoup. I tried:
r.html.find("span.data-product-brand", first=True)
But this was unsuccesful. Any help would be appreiciated.

Because you tagged beautifulsoup, here's a solution for using that package
from bs4 import BeautifulSoup
import requests
page = requests.get('https://www.lookfantastic.com/illamasqua-artistry-palette-experimental/11723920.html')
soup = BeautifulSoup(page.content, "html.parser")
# there are multiple matches for the class that contains the word 'Illamasqua', which is what I think you want in the end???
# you can loop through and get the brand like this; in this case there are three
for l in soup.find_all(class_="js-enhanced-ecommerce-data hidden"):
print(l.get('data-product-brand'))
# if it's always going to be the first, you can just do this
soup.find(class_="js-enhanced-ecommerce-data hidden").get('data-product-brand')

You can get element(s) with specified data attribute directly:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.lookfantastic.com/illamasqua-artistry-palette-experimental/11723920.html')
span=r.html.find('[data-product-brand]',first=True)
print(span)
3 results, and you need just first, i guess.

Related

Python - Extracting info from website using BeautifulSoup

I am new to BeautifulSoup, and I'm trying to extract data from the following website.
https://excise.wb.gov.in/CHMS/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx
I am trying to extract the availability of the hospital beds information (along with the detailed breakup) after choosing a particular district and also with the 'With available bed only' option selected.
Should I choose the table, the td, the tbody, or the div class for this instance?
My current code:
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://excise.wb.gov.in/CHMS/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx').text
soup = BeautifulSoup(html_text, 'lxml')
locations= soup.find('div', {'class': 'col-lg-12 col-md-12 col-sm-12'})
print(locations)
This only prints out a blank output:
Output
I have also tried using tbody and from table still could not work it out.
Any help would be greatly appreciated!
EDIT: Trying to find a certain element returns []. The code -
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://excise.wb.gov.in/CHMS/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx').text
soup = BeautifulSoup(html_text, 'lxml')
location = soup.find_all('h5')
print(location)
It is probably a dynamic website, it means that when you use bs4 for retrieving data it doesn't retrieve what you see because the page updates or loads the content after the initial HTML load.
For these dynamic webpages you should use selenium and combine it with bs4.
https://selenium-python.readthedocs.io/index.html

How to scrape JavaScript page with Python

I'm trying to scrape patentsview.org but I'm having an issue. When I try to scrape this page, it doesn't work well. Site using JavaScript to get data from their database. I tried to get the data using requests-html package but I didn't quite understand.
Here's what I tried:
# Import
import re
from bs4 import BeautifulSoup
from requests_html import HTMLSession
session = HTMLSession()
# Set requests
r = session.get('https://datatool.patentsview.org/#search/assignee&asn=1|Samsung')
r.html.render()
# Set BS and print
soup = BeautifulSoup(r.html.html, "lxml")
tags = soup.find_all("div", class_='summary')
print(tags)
This code gives me this result:
# Result
[<div class="summary"></div>]
But I want this:
This is the right div. But I can't see content of div with my code. How can I get the div's content? Hope you understand what I meant.
Use the browser dev tools. (Chrome. F12 - Network - XHR) and see the HTTP GET thst return the data (as JSON) you are looking for.
HTTP GET https://webapi.patentsview.org/api/assignees/query?q={%22_and%22:[{%22_or%22:[{%22_and%22:[{%22_contains%22:{%22assignee_first_name%22:%22Samsung%22}}]},{%22_and%22:[{%22_contains%22:{%22assignee_last_name%22:%22Samsung%22}}]},{%22_and%22:[{%22_contains%22:{%22assignee_organization%22:%22Samsung%22}}]}]}]}&f=[%22assignee_id%22,%22assignee_first_name%22,%22assignee_last_name%22,%22assignee_organization%22,%22assignee_lastknown_country%22,%22assignee_lastknown_state%22,%22assignee_lastknown_city%22,%22assignee_lastknown_location_id%22,%22assignee_total_num_patents%22,%22assignee_first_seen_date%22,%22assignee_last_seen_date%22,%22patent_id%22]&o={%22per_page%22:50,%22matched_subentities_only%22:true,%22sort_by_subentity_counts%22:%22patent_id%22,%22page%22:1}&s=[{%22patent_id%22:%22desc%22},{%22assignee_total_num_patents%22:%22desc%22},{%22assignee_organization%22:%22asc%22},{%22assignee_last_name%22:%22asc%22},{%22assignee_first_name%22:%22asc%22}]

Website scraping with Python, how do I know where to reference in the html?

I'm a complete beginner who has only built basic Python projects. Right now I'm building a scraper in Python with bs4 to help me read success stories off of a website. These success stories are all in a table, so I thought I would find an html tag that said table and would encompass the entire table.
However, it is all just <div and <span class, and when I use soup.find("div") or ("span") it returns only the single word "div" or "span". This is what I have so far, and I know it isn't right or set up correctly but I'm too inexperienced to know why yet.
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import requests
req = Request('https://www.calix.com/about-calix/success-stories.html', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage, "lxml")
soup.find("div", {"id": "content-calix-en-site-prod-home-about-calix-success-stories-jcr-content"})
print('div')
I have watched several tutorials on how to use bs4 and I have successfully scraped basic websites, but all I can do for this one is get ALL of the html, not the chunks I need (just the success stories).
You are printing 'div' make sure to be printing soup as soup gets updated whenever you find something within it.
You should have a look at the bs4 documentation.
soup.find("div", {"id": "content-calix-en-site-prod-home-about-calix-success-stories-jcr-content"})
Here you're calling soup.find() but you're not saving the results into a variable, so the results are lost.
print('div')
And here you're printing the literal string div. I don't think that's what you intended.
Try something like this:
div = soup.find("div", {"id": "..."})
print(div)

How to find all Elements of a specific Type with the new Requests-HTML library

I wanna find all specific fields in a HTML, in Beautiful soup everything is working with this code:
soup = BeautifulSoup(html_text, 'html.parser')
urls_previous = soup.find_all('h2', {'class': 'b_algo'})
but how can I make the same search with the requests library or can requests only find a single element in a HTML document, I couldn't find how to do it in the docs or examples ?
https://html.python-requests.org/
Example:
<li class="b_algo"><h2>Vereinigte Staaten – Wikipedia</h2>https://de.wikipedia.org/wiki/Vereinigte_Staaten</div><p>U.S., I wanna have THIS text here</p></li>
How can I find all Elements of a specific type with the requests library ?
with requests-html
from requests_html import HTML
doc = """<li class="b_algo"><h2>Vereinigte Staaten – Wikipedia</h2>https://de.wikipedia.org/wiki/Vereinigte_Staaten</div><p>U.S., I wanna have THIS text here</p></li>"""
#load html from string
html = HTML(html=doc)
x = html.find('h2')
print(x)

BeautifulSoup (bs4) does not find all tags

I'm using Python 3.5 and bs4
The following code will not retrieve all the tables from the specified website. The page has 14 tables but the return value of the code is 2. I have no idea what's going on. I manually inspected the HTML and can't find a reason as to why it's not working. There doesn't seem to be anything special about each table.
import bs4
import requests
link = "http://www.pro-football-reference.com/players/B/BradTo00.htm"
htmlPage = requests.get(link)
soup = bs4.BeautifulSoup(htmlPage.content, 'html.parser')
all_tables = soup.findAll('table')
print(len(all_tables))
What's going on?
EDIT: I should clarify. If I inspect the soup variable, it contains all of the tables that I expected to see. How am I not able to extract those tables from soup with the findAll method?
this page is rendered by javascript, and if you disable the javascrip in you broswer, you will notice that this page only hava two table.
i recommend to use selenium for this situation.

Categories