Fetch all links from a webpage - python

I am trying to get all the links from this website
My code is:
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
url_meva = https://www.recetasgratis.net'
uClient = uReq(url_meva)
pag_html = uClient.read()
uClient.close()
pag_soup = soup(pag_html, "html.parser")
containers = pag_soup.findAll("a",{"class":"titulo titulo--bloque"})
If I type len(containers) the result is 43, and it must be 25000 approximately.
Why I only get those 43 and not the rest?
The idea is to get the links of the recipes.
I know the website has the same structure for the recipes.
Thanks

What you are getting when you read the content of the url https://www.recetasgratis.net is the raw text from the link view-source:https://www.recetasgratis.net/ which has exactly 43 instances of the class titulo titulo--bloque. You'll need to figure out the functions behind dynamic loading of the webpage and use it to your advantage to fetch the list of all links. Good luck with that.

Your current implementation only scrapes the current home page.
To begin, if you want to start by grabbing all 25k recipe links, you are going to have to run this action for every page of their catalog starting from https://www.recetasgratis.net/busqueda/pag/1 to 574.
You can do this by building the url using a for loop and going thru each page, running
pag_soup.findAll("a",{"class":"titulo titulo--bloque"}) for each page.
At that point you should have all the links and will be able to begin actually grabbing the data from each page- the implementation for which will be all your own.
I suggest using some sort of flat file data store to keep track of urls collected. Storing everything in memory is not recommended as one exception will break your entire 500+ page flow and make you need to start over.
Also, if this is not your website, please consider the legal implications of what you are doing.

Related

Is there a way to scrape URLs without scraping links?

Basically, I'm trying to download some images from a site which has been "down for maintenance" for over a year. Attempting to visit any page on the website redirects to the forums. However, visiting image URLs directly will still take you to the specified image.
There's a particular post that I know contains more images than I've been able to brute force by guessing possible file names. I thought, rather than typing every combination of characters possible ad infinitum, I could program a web scraper to do it for me.
However, I'm brand new to Python, and I've been unable to overcome the Javascript redirects. While I've been able to use requests & beautifulsoup to scrape the page it redirects to for 'href', without circumventing the JS I cannot pull from the news article which has links.
import requests
from bs4 import BeautifulSoup
url = 'https://www.he-man.org/news_article.php?id=6136'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
    print(link.get('href'))
I've added allow_redirects=False to every field to no avail. I've tried searching for 'jpg' instead of 'href'. I'm currently attempting to install Selenium, though it's fighting me, and I expect I'll just be getting 3xx errors anyway.
The images from the news article all begin with the same 62 characters (https://www.he-man.org/assets/images/home_news/justinedantzer_), so I've also thought maybe there's a way to just infinite-monkeys-on-a-keyboard scrape the rest of it? Or the type of file (.jpg)? I'm open to suggestions here, I really have no idea what direction to come at this thing from now that my first six plans have failed. I've seen a few mentions of scraping spiders, but at this point I've sunk a lot of time into dead ends. Are they worth looking into? What would you recommend?

Load entire html page in python

I need to store in a str variable an entire html page.
I'm doing this:
import requests
from bs4 import BeautifulSoup
url = my_url
response = requests.get(url)
page = str(BeautifulSoup(response.content))
This works but the page in my_url is not "complete". It is a website in which going to the end, new things will load, and i need all the page, not only the main visible part.
Is there a way to load the entire page and then store it?
I also tried to load the page manually and then looking at the source code, but the final part of the page is still not visible.
Alternatively, all I want from my_url page are all the links inside it, and all of them are like:
my_url/something/first-post
my_url/something/second-post
Is there a way to find all the links in another way? So, all the possible url that starts with "my_url/something/"
Thanks in advance
I think you should use Selenium and then scroll down with it to get entire the page.
as I know requests can't handle dynamic pages.
For the alternative option, you can find the <a> tags via find_all
links = soup.find_all('a')
to get all starting with you can use the following
result = [link for link in links if link.startswith('my_url/something/')]

Scraping youtube to get dynamically loaded content

I'm trying to scrape youtube but most of the times I do it, It just gives back an empty result.
In this code snippet I'm trying to get the list of the video titles on the page. But when I run it I just get an empty result back. Even one title doesn't show up in result.
I've searched around and some results point out that it's due to the website loading content dynamically with javascript. Is this the case? How do I go about solving this? Is it possible to do without selenium?
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.youtube.com/user/schafer5/videos').text
soup = BeautifulSoup(source, 'lxml')
title = soup.findAll('a', attrs={'class': "yt-simple-endpoint style-scope ytd-grid-video-renderer")
print(title)
Is it possible to do without selenium?
Often services have APIs which allow easier automation than scraping sites, Youtube has API and there are ready official libraries for various languages, for Python there is google-api-python-client, you would need key to use, to get running I suggest following Youtube Data API quickstart, note that you might ignore OAuth 2.0 parts, as long as you need access only to public data.
I totally agree with #Daweo and that's the right way to scrape a website like Youtube. But if you want to use BeautifulSoup and not get an empty list at the end, your code should be changed to as follows:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.youtube.com/user/schafer5/videos').text
soup = BeautifulSoup(source, 'html.parser')
titles = [i.text for i in soup.findAll('a') if i.get('aria-describedby')]
print(titles)
I also suggest that you use the API.

How to download all the comments from a news article using Python?

I have to admit that I don't know much html. I am trying to extract all the comments from an article in the online news using python. I tried using python BeautifulSoup, but it seems comments are not in the html source-code, but present in the inspect-element. For instance you can check here. http://www.dailymail.co.uk/sciencetech/article-5100519/Elon-Musk-says-Tesla-Roadster-special-option.html#comments
My code is here and I am struck.
import urllib.request as urllib2
from bs4 import BeautifulSoup
url = "http://www.dailymail.co.uk/sciencetech/article-5100519/Elon-Musk-says-Tesla-Roadster-special-option.html#comments"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, "html.parser")
I want to do this
name_box = soup.find('p', attrs={'class': 'comment-body comment-text'})
but this info is not there in the source-code.
Any suggestion, how to move forward?
I have not attempted things like this, but my guess is if you want to get it directly from "page source" you'll need something like selenium to actually navigate the page since the page is dynamic.
Alternatively if you're only interested in comments you may use the dailymail.co.uk's api to acquire comments.
Note the items in the querystring "max=1000" "&order" etc. You may also need to use the variable "offset" along side max to find all the comments if the API has a limit on the maximum "max" value.
I do not know where the API is defined, you can view it by view the network requests that your browser makes while you search the webpage.
You can get comment data from http://www.dailymail.co.uk/reader-comments/p/asset/readcomments/5100519?max=1000&order=desc&rcCache=shout for that page in JSON format. It appears that every article has something like "5101863" in its url, you can use swap those numbers for each new story that you want comments about.
Thank you FredMan. I did not know about this API. It seems we need to give only the article id and we can the comments from the article. This was the solution I was looking for.

Scraping data from multiple links within a site

I would like to use scraperwiki and python to build a scraper that will scrape large amounts of information off of different sites. I am wondering if it is possible to point to a single URL and then scrape the data off of each of the links within that site.
For example: A site would contain information about different projects, each within its own individual link. I don't need a list of those links but the actual data contained within them.
The scraper would be looking for the same attributes on each of the links.
Does anyone know how or if I could go about doing this?
Thanks!
Check out BeautifulSoup with urllib2.
http://www.crummy.com/software/BeautifulSoup/
An (very) rough example link scraper would look like this:
from bs4 import BeautifulSoup
import urllib2
c = urllib2.urlopen(url)
contents = c.read()
soup = BeautifulSoup(contents)
links = soup.find_all(a):
Then just write a for loop to do that many times over and you're set!

Categories