I am looking to extract all links from webpages. The process I had previously been using was to extract the "href" option eg.:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_text, "lxml")
for a in soup.findAll("a"):
print (a["href"])
Some links however have onclick attribute instead of using href
e.g.:
...</span>
and other links in the menu bars are constructed with javascript' window.open options.
I could probably write code that identifies the ways that are not with the href attribute, but is there an easier/more standard way extract out all links from a html page?
Followup:
I am specifically interested in ways to extract links which are not part of standard "href" attribute in the "a" tag, which can easily be extracted (e.g. i want to extract links which are included via window.open() or javascript... or other ways in which links are included on a page). Relatedly, since most links on sites are relative, looking for text on the page that start with http, is not going to capture them all.
The only way I can think of grabbing everything is just to convert the entire soup results into a string and grab everything with http using regex:
soup = str(soup)
links = re.findall(r'(http.*?)"', soup)
Related
Simple question. Why is it that when I inspect element I see the data I want embedded within the JS tags - but when I go directly to Page Source I do not see it at all?
As an example, basically I am looking to get the description of the eBay listing. In this case, the text in the body of the listing that reads "BRAND NEW Factory Sealed
Playstation 5 (PS5) Bluray Disc System Console [...]
We usually ship within 24 hours of purchase."
Sample code below. If I search for the text within the printout, I cannot find it.
import requests
from bs4 import BeautifulSoup
url = 'www.ebay.com/itm/272037717929'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
print(soup.prettify())
it's probably because ebay is using javascript to load content into the page. A workout this problem would be using something playwright or selenium. I personally prefer the first option. It uses a chromium browser to actually get the page contents, hence loads javascript in the proccess
problem while scraping a heavy website like Facebook or twitter with lot of html tags using beautiful soup and requests library.
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://twitter.com/elonmusk').text
soup = BeautifulSoup(html_text, 'lxml')
elon_tweet = soup.find_all('span', class_='css-901oao css-16my406 r-poiln3 r-bcqeeo r-qvutc0')
print(elon_tweet)
The tweet and its corresponding span
The full span image
img link to the span
when the code is executed this returns a empty list.
I'm new to web scraping, a detailed explanation would be welcomed.
Problem is that twitter is loading its content dynamically. This means that when you make a request, the page is loaded and first returns the html from here (write in your browser's address bar: "view-source:https://twitter.com/elonmusk")
Later, after the page is loaded, the JavaScript is executed and adds the full content of the page.
With requests from python you can only scrape the content available on "view-source:https://twitter.com/elonmusk", and as you can see, the element that you're trying to scrape it's not there.
To scrape this element you will need to use selenium, which allows you to emulate a browser directly from python, and therefore wait the few extra needed seconds so that the whole content will be loaded. You can find a good guide on this over here: https://www.freecodecamp.org/news/how-to-scrape-websites-with-python-2/
Also if you don't want all this trouble, you can use instead an API that allows JavaScript rendering.
I am using beautiful soup to try to get data from the Overwatch League Schedule website using beautiful soup, however, despite all the documentation saying that bs4 is capable of finding nested divs if i have their class it only returns an empty list.
here is the url: https://overwatchleague.com/en-us/schedule?stage=regular_season&week=1
here is what I am trying to get:
bs = BeautifulSoup(req.text, "html.parser")
matches = bs.find_all("div", class_="schedule-boardstyles__ContainerCards-j4x5cc-8 jcvNlt")
to eventually be able to loop through the divs in that and scrape the match data from it. However, it's not working and only returning a [], is there something I'm doing wrong?
When a page is loaded in it often runs some scripts to fill in the information.
Beautifulsoup is only a parser and cannot render a page.
You will need something like selenium to render the page before using beautifulsoup to find the elements
It isn't working since request is getting the html before the page is fully loaded. I don't think there is way to make it wait. You could try doing it with selenium
I have been trying to use BS4 to scrape from this web page. I cannot find the data I want (player names in the table, ie, "Claiborne, Morris").
When I use:
soup = BeautifulSoup(r.content, "html.parser")
PlayerName = soup.find_all("table")
print (PlayerName)
None of the player's names are even in the output, it is only showing a different table.
When I use:
soup = BeautifulSoup(r.content, 'html.parser')
texts = soup.findAll(text=True)
print(texts)
I can see them.
Any advice on how to dig in and get player names?
The table you're looking for is dynamically filled by JavaScript when the page is rendered. When you retrieve the page using e.g. requests, it only retrieves the original, unmodified page. This means that some elements that you see in your browser will be missing.
The fact that you can find the player names in your second snippet of code, is because they are contained in the page's JavaScript source, as JSON. However you won't be able to retrieve them with BeautifulSoup as it won't parse the JavaScript.
The best option is to use something like Selenium, which mimics a browser as closely as possible and will execute JavaScript code, thereby rendering the same page content as you would see in your own browser.
I am doing a CA and I have to parse the page using beautiful soup, I did with the code
r = urlopen(url) # download the page
res1 = str(r.read()) # put the content into a variable
soup = BeautifulSoup(res1,'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))
but then I have to print how many different pages have been crawled.
Is anybody has a tip to give me ?
Thank you very much
As #cricket_007 mentioned in the comments, your current code 'crawls' (i.e. retrieves) only one page.
If you need to print how many links did you find in the document, you can just do
print(len(soup.find_all('a')))
Note that soup.find_all('a') is a list of the corresponding tags, so it's len gives you a number of links.
If you really need to crawl website (e.g. to retrieve page, get all links from this page, follow every of these links, retrieve the page it refers to and so on), I'd suggest using RoboBrowser instead of "pure" BeautifulSoup.