I am doing a CA and I have to parse the page using beautiful soup, I did with the code
r = urlopen(url) # download the page
res1 = str(r.read()) # put the content into a variable
soup = BeautifulSoup(res1,'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))
but then I have to print how many different pages have been crawled.
Is anybody has a tip to give me ?
Thank you very much
As #cricket_007 mentioned in the comments, your current code 'crawls' (i.e. retrieves) only one page.
If you need to print how many links did you find in the document, you can just do
print(len(soup.find_all('a')))
Note that soup.find_all('a') is a list of the corresponding tags, so it's len gives you a number of links.
If you really need to crawl website (e.g. to retrieve page, get all links from this page, follow every of these links, retrieve the page it refers to and so on), I'd suggest using RoboBrowser instead of "pure" BeautifulSoup.
Related
I'm trying to scrape youtube but most of the times I do it, It just gives back an empty result.
In this code snippet I'm trying to get the list of the video titles on the page. But when I run it I just get an empty result back. Even one title doesn't show up in result.
I've searched around and some results point out that it's due to the website loading content dynamically with javascript. Is this the case? How do I go about solving this? Is it possible to do without selenium?
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.youtube.com/user/schafer5/videos').text
soup = BeautifulSoup(source, 'lxml')
title = soup.findAll('a', attrs={'class': "yt-simple-endpoint style-scope ytd-grid-video-renderer")
print(title)
Is it possible to do without selenium?
Often services have APIs which allow easier automation than scraping sites, Youtube has API and there are ready official libraries for various languages, for Python there is google-api-python-client, you would need key to use, to get running I suggest following Youtube Data API quickstart, note that you might ignore OAuth 2.0 parts, as long as you need access only to public data.
I totally agree with #Daweo and that's the right way to scrape a website like Youtube. But if you want to use BeautifulSoup and not get an empty list at the end, your code should be changed to as follows:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.youtube.com/user/schafer5/videos').text
soup = BeautifulSoup(source, 'html.parser')
titles = [i.text for i in soup.findAll('a') if i.get('aria-describedby')]
print(titles)
I also suggest that you use the API.
I am looking to extract all links from webpages. The process I had previously been using was to extract the "href" option eg.:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_text, "lxml")
for a in soup.findAll("a"):
print (a["href"])
Some links however have onclick attribute instead of using href
e.g.:
...</span>
and other links in the menu bars are constructed with javascript' window.open options.
I could probably write code that identifies the ways that are not with the href attribute, but is there an easier/more standard way extract out all links from a html page?
Followup:
I am specifically interested in ways to extract links which are not part of standard "href" attribute in the "a" tag, which can easily be extracted (e.g. i want to extract links which are included via window.open() or javascript... or other ways in which links are included on a page). Relatedly, since most links on sites are relative, looking for text on the page that start with http, is not going to capture them all.
The only way I can think of grabbing everything is just to convert the entire soup results into a string and grab everything with http using regex:
soup = str(soup)
links = re.findall(r'(http.*?)"', soup)
I can not seem to grasp.
How can I make BeautifulSoup parse every page by navigating using Next page link up until the last page and stop parsing when there is no "Next page" found. On a site like this
enter link description here
I try looking for the Next button element name, I use 'find' to find it, but do not know how to make it recurring to do iterations until all pages are scraped.
Thank you
beautiful soup will only give you the tools, how to go about navigating pages is something you need to work out in a flow diagram sense.
Taking the page you mentioned, clicking through a few of the pages it seems that when we are on page 1, nothing is shown in the url.
htt...ru/moskva/transport
and we see in the source of the page:
<div class="pagination-pages clearfix">
<span class="pagination-page pagination-page_current">1</span>
<a class="pagination-page" href="/moskva/transport?p=2">2</a>
lets check what happens when we go to page 2
ht...ru/moskva/transport?p=2
<div class="pagination-pages clearfix">
<a class="pagination-page" href="/moskva/transport">1</a>
<span class="pagination-page pagination-page_current">2</span>
<a class="pagination-page" href="/moskva/transport?p=3">3</a>
perfect, now we have the layout. one more thing to know before we make our beautiful soup. what happenes when we go to a page past the last available page. which at the time of this writing was: 40161
ht...ru/moskva/transport?p=40161
we change this to:
ht...ru/moskva/transport?p=40162
the page seems to go back to page 1 automatically. great!
so now we have everything we need to make our soup loop.
instead of clicking next each time, just make a url statement. you know the elements required.
url = ht...ru/moskva/$searchterm?p=$pagenum
im assuming transport is the search term??? i dont know, i cant read russian. but you get the idea. construct the url. then do a requests call
request = requests.get(url)
mysoup = bs4.BeautifulSoup(request.text)
and now you can wrap that whole thing in a while loop, and each time except the first time check
mysoup.select['.pagination-page_current'][0].text == 1
this says, each time we get the page, find the currently selected page by using the class pagination-page_current, it returns an array so we select the first element [0] get its text .text and see if it equals 1.
this should only be true in two cases. the first page you run, and the last. so you can use this to start and stop the script, or however you want.
this should be everything you need to do this properly. :)
BeautifulSoup by itself does not load pages. You need to use something like requests, scrape the URL you want to follow, load it and pass its content to another BS4 soup.
import requests
# Scrape your url
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser') # You can now scrape the new page
I have been trying to use BS4 to scrape from this web page. I cannot find the data I want (player names in the table, ie, "Claiborne, Morris").
When I use:
soup = BeautifulSoup(r.content, "html.parser")
PlayerName = soup.find_all("table")
print (PlayerName)
None of the player's names are even in the output, it is only showing a different table.
When I use:
soup = BeautifulSoup(r.content, 'html.parser')
texts = soup.findAll(text=True)
print(texts)
I can see them.
Any advice on how to dig in and get player names?
The table you're looking for is dynamically filled by JavaScript when the page is rendered. When you retrieve the page using e.g. requests, it only retrieves the original, unmodified page. This means that some elements that you see in your browser will be missing.
The fact that you can find the player names in your second snippet of code, is because they are contained in the page's JavaScript source, as JSON. However you won't be able to retrieve them with BeautifulSoup as it won't parse the JavaScript.
The best option is to use something like Selenium, which mimics a browser as closely as possible and will execute JavaScript code, thereby rendering the same page content as you would see in your own browser.
So I am trying to learn scraping and was wondering how to get multiple webpages of info. I was using it on http://www.cfbstats.com/2014/player/index.html . I want to retrieve all the teams and then go within each teams link, which shows the roster, and then retrieve each players info and within their personal link their stats.
what I have so far is:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.cfbstats.com/2014/player/index.html")
r.content
soup = BeautifulSoup(r.content)
links = soup.find_all("a")
for link in links:
college = link.text
collegeurl = link.get("http")
c = requests.get(collegeurl)
c.content
campbells = BeautifulSoup(c.content)
Then I am lost from there. I know I have to do a nested for loop in there, but I don't want certain links such as terms and conditions and social networks.
Just trying to get player info and then their stats which is linked to their name.
You have to somehow filter the links and limit your for loop to the ones that correspond to teams. Then, you need to do the same to get the links to players. Using Chrome's "Developer tools" (or your browser's equivalent), I suggest that you (right-click) inspect one of the links that are of interest to you, then try to find something that distinguishes it from other links that are not of interest. For instance, you'll find out about the CFBstats page:
All team links are inside <div class="conference">. Furthermore, they all contain the substring "/team/" in the href. So, you can either xpath to a link contained in such a div, or filter the ones with such a substring, or both.
On team pages, player links are in <td class="player-name">.
These two should suffice. If not, you get the gist. Web crawling is an experimental science...
not familiar with BeautifulSoup, but certainly you can use regular expression to retrieve the data you want.