Scrape table from HTML with beautifulsoup [duplicate] - python

This question already has answers here:
scrape html generated by javascript with python
(5 answers)
Closed 4 years ago.
I am trying to scrape data from a website with python3. The website contains data on players from a champion based FPS multiplayer game named "Paladins". I wanted to get the champion based stats of a player as shown in the website. The problem I'm facing is, when I inspect the page source with Chrome, I get this code which contains "table" tag and is clean and I could scrape it easily:
INSPECTED CODE (my gist link)
but when I make the soup object, I get a different code. and when I went over to the page source, it was the same as the soup. there was no tag there in the page source. (you may check the page source for a better understanding).
Now how may I scrape champion-wise data from the website?
I am using requests and beautifulsoup for python3
import requests as req
import bs4
res = req.get('http://paladins.guru/profile/pc/Encrytedbro/champions')
soup = bs4.BeautifulSoup(res.text, "lxml")
soup.select('#root')
It is giving me this result: HERE'S A LINK TO MY GIST
And i have no idea how to get the data out of there.

I think you're getting error because you are using wrong parser, you should use html.parser instead of lxml
soup = bs4.BeautifulSoup(res.text, "html.parser")
hope this solves the problem

Related

Cannot find the text I want to scrape in the Page Source

Simple question. Why is it that when I inspect element I see the data I want embedded within the JS tags - but when I go directly to Page Source I do not see it at all?
As an example, basically I am looking to get the description of the eBay listing. In this case, the text in the body of the listing that reads "BRAND NEW Factory Sealed
Playstation 5 (PS5) Bluray Disc System Console [...]
We usually ship within 24 hours of purchase."
Sample code below. If I search for the text within the printout, I cannot find it.
import requests
from bs4 import BeautifulSoup
url = 'www.ebay.com/itm/272037717929'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
print(soup.prettify())
it's probably because ebay is using javascript to load content into the page. A workout this problem would be using something playwright or selenium. I personally prefer the first option. It uses a chromium browser to actually get the page contents, hence loads javascript in the proccess

Trying to use Beautiful Soup to scrape data from website, but it only returns empty lists from nested Divs

I am using beautiful soup to try to get data from the Overwatch League Schedule website using beautiful soup, however, despite all the documentation saying that bs4 is capable of finding nested divs if i have their class it only returns an empty list.
here is the url: https://overwatchleague.com/en-us/schedule?stage=regular_season&week=1
here is what I am trying to get:
bs = BeautifulSoup(req.text, "html.parser")
matches = bs.find_all("div", class_="schedule-boardstyles__ContainerCards-j4x5cc-8 jcvNlt")
to eventually be able to loop through the divs in that and scrape the match data from it. However, it's not working and only returning a [], is there something I'm doing wrong?
When a page is loaded in it often runs some scripts to fill in the information.
Beautifulsoup is only a parser and cannot render a page.
You will need something like selenium to render the page before using beautifulsoup to find the elements
It isn't working since request is getting the html before the page is fully loaded. I don't think there is way to make it wait. You could try doing it with selenium

Python/ Beautiful Soup Data Displaying Issue [duplicate]

This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 1 year ago.
I am trying to pull some data from a website. Once I checked the data that I pulled with beuatifulsoup (using print(soup) in the code below) does not seem very well. It is different than once I check with view-source:URL. I am unable to find the fields that I am looking for.
Could you please help me to find a solution?
Website: https://www.wayfair.com/furniture/pdp/mercury-row-stalvey-contemporary-4725-wide-1-drawer-server-w003245064.html
Basically, I am trying to get price of this product. I used the same code structure on other websites, it worked properly but it is not working on wayfair.
The second thing that I could not find a solution yet is the last line of my code (StyledBox-owpd5f-0 PriceV2__StyledPrice-sc-7ia31j-0 lkFBUo pl-Price-V2 pl-Price-V2--5000). Instead of name of the product is there a way to get only price like $389.99?
Thanks in advance!
This my code:
html = requests.get('https://www.wayfair.com/furniture/pdp/mercury-row-stalvey-contemporary-4725-wide-1-drawer-server-w003245064.html')
soup=BeautifulSoup(html.text,"html.parser")
print(soup)
inps=soup.find("div",class_="SFPrice").find_all("input")
for inp in inps:
print(inp.get("StyledBox-owpd5f-0 PriceV2__StyledPrice-sc-7ia31j-0 lkFBUo pl-Price-V2 pl-Price-V2--5000"))
Try with:
soup.findAll('div', {'class': 'SFPrice'})[0].getText()
Or in a more simple way:
inps=soup.findAll('div', {'class': 'SFPrice'})[0]
inps.getText()
Both return the price for that specific product.
Your site example is a client-side rendered page and the original html-data fetched doesn't include the searched for elements (like the div with class "SFPrice").
Check out this question for learning about how to scrape javascript-rendered pages with beautifulsoup in combination with selenium and phantomJS, dryscrape or other options.
Or you could also look at this guide.

How can I *webscrape* if the source HTML doesn't contain the actual number? [duplicate]

This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 1 year ago.
Hi I'm totally newbie to the computer progamming world. So I might ask stupid questions.
I'm trying to build a web scraping tool using python to scrape some statistics from Korean Statistical Office(KOSIS). So this is How I did and it keeps return error saying "'NoneType' object has no attribute 'find'"
import csv
import requests
from bs4 import BeautifulSoup
url = "https://kosis.kr/statHtml/statHtml.do?orgId=101&tblId=DT_1K31002&conn_path=I2"
res = requests.get(url)
res.raise_for_status()
soup = BeautifulSoup(res.text, "lxml")
data_rows = soup.find("table", attrs = {"id" : "mainTable"}).find("tbody").find_all("tr")
print(data_rows.get_text())
I googled my problem and found out that the DOM in browser is different from the actual HTML source. So I went into view-source page(view-source:https://kosis.kr/statHtml/statHtml.do?orgId=101&tblId=DT_1K31002&conn_path=I2) and since I don't know anything about HTML, I ran it in codebeautify and found out that source code doesn't contain any of the number that I'm seeing? huh. Is there anyone who can teach me what's happening. Thanks!
I recommend you to use Puppeteer for web scraping (this uses Google Chrome behind the scenes), because many web pages uses javascript to manipulate the DOM after HTML page load. Therefore, the original DOM is not the same when the page is fully loaded.
There it is a link that I found https://rexben.medium.com/introduction-to-web-scraping-with-puppeteer-1465b89fcf0b

Scraping youtube to get dynamically loaded content

I'm trying to scrape youtube but most of the times I do it, It just gives back an empty result.
In this code snippet I'm trying to get the list of the video titles on the page. But when I run it I just get an empty result back. Even one title doesn't show up in result.
I've searched around and some results point out that it's due to the website loading content dynamically with javascript. Is this the case? How do I go about solving this? Is it possible to do without selenium?
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.youtube.com/user/schafer5/videos').text
soup = BeautifulSoup(source, 'lxml')
title = soup.findAll('a', attrs={'class': "yt-simple-endpoint style-scope ytd-grid-video-renderer")
print(title)
Is it possible to do without selenium?
Often services have APIs which allow easier automation than scraping sites, Youtube has API and there are ready official libraries for various languages, for Python there is google-api-python-client, you would need key to use, to get running I suggest following Youtube Data API quickstart, note that you might ignore OAuth 2.0 parts, as long as you need access only to public data.
I totally agree with #Daweo and that's the right way to scrape a website like Youtube. But if you want to use BeautifulSoup and not get an empty list at the end, your code should be changed to as follows:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.youtube.com/user/schafer5/videos').text
soup = BeautifulSoup(source, 'html.parser')
titles = [i.text for i in soup.findAll('a') if i.get('aria-describedby')]
print(titles)
I also suggest that you use the API.

Categories