I have been trying to use BS4 to scrape from this web page. I cannot find the data I want (player names in the table, ie, "Claiborne, Morris").
When I use:
soup = BeautifulSoup(r.content, "html.parser")
PlayerName = soup.find_all("table")
print (PlayerName)
None of the player's names are even in the output, it is only showing a different table.
When I use:
soup = BeautifulSoup(r.content, 'html.parser')
texts = soup.findAll(text=True)
print(texts)
I can see them.
Any advice on how to dig in and get player names?
The table you're looking for is dynamically filled by JavaScript when the page is rendered. When you retrieve the page using e.g. requests, it only retrieves the original, unmodified page. This means that some elements that you see in your browser will be missing.
The fact that you can find the player names in your second snippet of code, is because they are contained in the page's JavaScript source, as JSON. However you won't be able to retrieve them with BeautifulSoup as it won't parse the JavaScript.
The best option is to use something like Selenium, which mimics a browser as closely as possible and will execute JavaScript code, thereby rendering the same page content as you would see in your own browser.
Related
Simple question. Why is it that when I inspect element I see the data I want embedded within the JS tags - but when I go directly to Page Source I do not see it at all?
As an example, basically I am looking to get the description of the eBay listing. In this case, the text in the body of the listing that reads "BRAND NEW Factory Sealed
Playstation 5 (PS5) Bluray Disc System Console [...]
We usually ship within 24 hours of purchase."
Sample code below. If I search for the text within the printout, I cannot find it.
import requests
from bs4 import BeautifulSoup
url = 'www.ebay.com/itm/272037717929'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
print(soup.prettify())
it's probably because ebay is using javascript to load content into the page. A workout this problem would be using something playwright or selenium. I personally prefer the first option. It uses a chromium browser to actually get the page contents, hence loads javascript in the proccess
I am using beautiful soup to try to get data from the Overwatch League Schedule website using beautiful soup, however, despite all the documentation saying that bs4 is capable of finding nested divs if i have their class it only returns an empty list.
here is the url: https://overwatchleague.com/en-us/schedule?stage=regular_season&week=1
here is what I am trying to get:
bs = BeautifulSoup(req.text, "html.parser")
matches = bs.find_all("div", class_="schedule-boardstyles__ContainerCards-j4x5cc-8 jcvNlt")
to eventually be able to loop through the divs in that and scrape the match data from it. However, it's not working and only returning a [], is there something I'm doing wrong?
When a page is loaded in it often runs some scripts to fill in the information.
Beautifulsoup is only a parser and cannot render a page.
You will need something like selenium to render the page before using beautifulsoup to find the elements
It isn't working since request is getting the html before the page is fully loaded. I don't think there is way to make it wait. You could try doing it with selenium
I am trying to use the requests library + beautiful soup to pull information on the antenna points from the map shown on this website.
http://www.sites.bipt.be/
My original plan was to iterate through site numbers, pull out the lat/long information that appears on the left hand panel when a point is clicked, and display that data in Arc. So far I have accessed the element where that information is located when the point is clicked ( id = selectedsite ... ). But the element turns up empty in python seemingly because nothing is clicked?
This is my first time web scraping and I have limited HTML knowlege, if there is another approach that would be better or any pointers you could offer that would be greatly appreciated :)
import requests
from bs4 import BeautifulSoup
#api_key = 'MY_API_KEY'
#Prettify HTML
source = requests.get("http://www.sites.bipt.be/").text
soup = BeautifulSoup(source, 'lxml')
print(soup.prettify())
# Class where information on the selected site is located - turns up empty
div = soup.find(id='selectedsite')
print(div.prettify())
This webpage loads data with JavaScript. You cannot use BeautifulSoup to click on an element in a webpage. You have to use selenium.
I am looking to extract all links from webpages. The process I had previously been using was to extract the "href" option eg.:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_text, "lxml")
for a in soup.findAll("a"):
print (a["href"])
Some links however have onclick attribute instead of using href
e.g.:
...</span>
and other links in the menu bars are constructed with javascript' window.open options.
I could probably write code that identifies the ways that are not with the href attribute, but is there an easier/more standard way extract out all links from a html page?
Followup:
I am specifically interested in ways to extract links which are not part of standard "href" attribute in the "a" tag, which can easily be extracted (e.g. i want to extract links which are included via window.open() or javascript... or other ways in which links are included on a page). Relatedly, since most links on sites are relative, looking for text on the page that start with http, is not going to capture them all.
The only way I can think of grabbing everything is just to convert the entire soup results into a string and grab everything with http using regex:
soup = str(soup)
links = re.findall(r'(http.*?)"', soup)
I am doing a CA and I have to parse the page using beautiful soup, I did with the code
r = urlopen(url) # download the page
res1 = str(r.read()) # put the content into a variable
soup = BeautifulSoup(res1,'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))
but then I have to print how many different pages have been crawled.
Is anybody has a tip to give me ?
Thank you very much
As #cricket_007 mentioned in the comments, your current code 'crawls' (i.e. retrieves) only one page.
If you need to print how many links did you find in the document, you can just do
print(len(soup.find_all('a')))
Note that soup.find_all('a') is a list of the corresponding tags, so it's len gives you a number of links.
If you really need to crawl website (e.g. to retrieve page, get all links from this page, follow every of these links, retrieve the page it refers to and so on), I'd suggest using RoboBrowser instead of "pure" BeautifulSoup.