Scraping youtube to get dynamically loaded content - python

I'm trying to scrape youtube but most of the times I do it, It just gives back an empty result.
In this code snippet I'm trying to get the list of the video titles on the page. But when I run it I just get an empty result back. Even one title doesn't show up in result.
I've searched around and some results point out that it's due to the website loading content dynamically with javascript. Is this the case? How do I go about solving this? Is it possible to do without selenium?
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.youtube.com/user/schafer5/videos').text
soup = BeautifulSoup(source, 'lxml')
title = soup.findAll('a', attrs={'class': "yt-simple-endpoint style-scope ytd-grid-video-renderer")
print(title)

Is it possible to do without selenium?
Often services have APIs which allow easier automation than scraping sites, Youtube has API and there are ready official libraries for various languages, for Python there is google-api-python-client, you would need key to use, to get running I suggest following Youtube Data API quickstart, note that you might ignore OAuth 2.0 parts, as long as you need access only to public data.

I totally agree with #Daweo and that's the right way to scrape a website like Youtube. But if you want to use BeautifulSoup and not get an empty list at the end, your code should be changed to as follows:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.youtube.com/user/schafer5/videos').text
soup = BeautifulSoup(source, 'html.parser')
titles = [i.text for i in soup.findAll('a') if i.get('aria-describedby')]
print(titles)
I also suggest that you use the API.

Related

Cannot find the text I want to scrape in the Page Source

Simple question. Why is it that when I inspect element I see the data I want embedded within the JS tags - but when I go directly to Page Source I do not see it at all?
As an example, basically I am looking to get the description of the eBay listing. In this case, the text in the body of the listing that reads "BRAND NEW Factory Sealed
Playstation 5 (PS5) Bluray Disc System Console [...]
We usually ship within 24 hours of purchase."
Sample code below. If I search for the text within the printout, I cannot find it.
import requests
from bs4 import BeautifulSoup
url = 'www.ebay.com/itm/272037717929'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
print(soup.prettify())
it's probably because ebay is using javascript to load content into the page. A workout this problem would be using something playwright or selenium. I personally prefer the first option. It uses a chromium browser to actually get the page contents, hence loads javascript in the proccess

Problem while scraping twitter using beautiful soup

problem while scraping a heavy website like Facebook or twitter with lot of html tags using beautiful soup and requests library.
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://twitter.com/elonmusk').text
soup = BeautifulSoup(html_text, 'lxml')
elon_tweet = soup.find_all('span', class_='css-901oao css-16my406 r-poiln3 r-bcqeeo r-qvutc0')
print(elon_tweet)
The tweet and its corresponding span
The full span image
img link to the span
when the code is executed this returns a empty list.
I'm new to web scraping, a detailed explanation would be welcomed.
Problem is that twitter is loading its content dynamically. This means that when you make a request, the page is loaded and first returns the html from here (write in your browser's address bar: "view-source:https://twitter.com/elonmusk")
Later, after the page is loaded, the JavaScript is executed and adds the full content of the page.
With requests from python you can only scrape the content available on "view-source:https://twitter.com/elonmusk", and as you can see, the element that you're trying to scrape it's not there.
To scrape this element you will need to use selenium, which allows you to emulate a browser directly from python, and therefore wait the few extra needed seconds so that the whole content will be loaded. You can find a good guide on this over here: https://www.freecodecamp.org/news/how-to-scrape-websites-with-python-2/
Also if you don't want all this trouble, you can use instead an API that allows JavaScript rendering.

Trying to use Beautiful Soup to scrape data from website, but it only returns empty lists from nested Divs

I am using beautiful soup to try to get data from the Overwatch League Schedule website using beautiful soup, however, despite all the documentation saying that bs4 is capable of finding nested divs if i have their class it only returns an empty list.
here is the url: https://overwatchleague.com/en-us/schedule?stage=regular_season&week=1
here is what I am trying to get:
bs = BeautifulSoup(req.text, "html.parser")
matches = bs.find_all("div", class_="schedule-boardstyles__ContainerCards-j4x5cc-8 jcvNlt")
to eventually be able to loop through the divs in that and scrape the match data from it. However, it's not working and only returning a [], is there something I'm doing wrong?
When a page is loaded in it often runs some scripts to fill in the information.
Beautifulsoup is only a parser and cannot render a page.
You will need something like selenium to render the page before using beautifulsoup to find the elements
It isn't working since request is getting the html before the page is fully loaded. I don't think there is way to make it wait. You could try doing it with selenium

Scrape data from website with frames or flexbox using python requests and BeautifulSoup

I've been trying to figure this out but with no luck. I found a thread (How to scrape data from flexbox element/container with Python and Beautiful Soup) that I thought would help but I can't seem to make any headway.
The site I'm trying to scrape is...http://www.northwest.williams.com/NWP_Portal/. In particular I want to get the data from the tab/frame of 'Storage Levels' but for the life of me I can't seem to navigate to the right spot to get the data. I've tried various iterations of the code below with no success. I've changed 'lxml' to 'html.parser', looked for tables, looked for 'tr' etc but the code always returns empty. I've also tried looking at the network info but when I click on any of the tabs (System Status, PAL/System Balancing etc) I don't see any change in network activity. I'm sure it's something simple that I'm overlooking but I just can't put my finger on it.
from bs4 import BeautifulSoup as soup
import requests
url = 'http://www.northwest.williams.com/NWP_Portal/'
r = requests.get(url)
html = soup(r.content,'lxml')
page = html.findAll('div',{'class':'dailyOperations-panels'})
How can I 'navigate' to the 'Storage Levels' frame/tab? What is the html that I'm actually looking for? Can I do this with just requests and beautiful soup? I'm not opposed to using Selenium but I haven't used it before and would prefer to just use requests and BeautifulSoup if possible.
Thanks in advance!
Hey so what I notice is your are trying to get "dailyOperations-panels" from a div which won't work.

Scraping Google Patents with requests only returns style and scripts tags

I'm trying to scrape Google Patents using the following code.
url = 'https://patents.google.com/?q=usb'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc)
But when I try to inspect the document, using
print(soup.prettify)
I cannot get anything other than this https://pastebin.com/Xu81LdfE .
I checked the requests status and it is returning 200. Where am I going wrong?
The results on that page come for a different url:
https://patents.google.com/xhr/query?url=q%3Dusb&exp=
So instead of using BeautifulSoup, you could do r.json(), and find what you want in the dictionary it creates.
The data is not in the HTML, but loaded with JavaScript.
Therefore, beautifulsoup cannot scrape it.
Consider using the official APIs, as other usage likely violates the Google terms of service, and they will likely block you then.

Categories