Web scraping youtube page - python

i'm trying to get the title of youtube videos given a link.
But i'm unable to access the element that hold the title. I'm using bs4 to parse the html.
I noticed im unable to access any element that is within 'ytd-app' tag in the youtube page.
import bs4
import requests
listed_url = "https://www.youtube.com/watch?v=9IfT8KXX_9c&list=PLQVvvaa0QuDfKTOs3Keq_kaG2P55YRn5v&index=31"
listed = requests.get(listed_url)
soup = bs4.BeautifulSoup(listed.text, "html.parser")
a = soup.find_all(attrs={"class": "style-scope ytd-video-primary-info-renderer"})
print(a)
So how can i get the video title ? Is there something i'm doing wrong or youtube intentionally created a tag like this to prevent web_scraping ?

See class that you are using is render through Javascript and all the contents are dynamic so it is very difficult to find any data using bs4
So what you can do find data in soup by manually and find particular tag
Also you can try out with pytube
import bs4
import requests
listed_url = "https://www.youtube.com/watch?v=9IfT8KXX_9c&list=PLQVvvaa0QuDfKTOs3Keq_kaG2P55YRn5v&index=31"
listed = requests.get(listed_url)
soup = bs4.BeautifulSoup(listed.text, "html.parser")
soup.find("title").get_text()

Related

How to get an image tag from a dynamic web page using BeautifulSoup?

Hi i am trying to get images on a webpage using requests and BeautifulSoup.
import requests
from bs4 import BeautifulSoup as BS
data = requests.get(url, headers=headers).content
soup = BS(data, "html.parser")
for imgtag in soup.find_all("img", class_="slider-img"):
print(imgtag["src"])
The problem is while I am getting the webpage in data it does not contain the image tags. Yet when i go to the webpage by my web browser the div tag is populated with multiple <img class="slider-img"> tags.
I am new to this so I am not getting what is going on with that web page. Thanks in advance for help.
PS - web page is using Fotorama Slider and src attribute contains CDN links. if this matters
The image tags are created dynamically by Javascript. You only need uuid to construct the image urls and they are stored within the page:
import re
import requests
from ast import literal_eval
url = "https://fotorama.io/"
img_url = "https://ucarecdn.com/{uuid}/-/stretch/off/-/resize/760x/"
html_doc = requests.get(url).text
uuids = re.search(r"uuids: (\[.*?\])", html_doc, flags=re.S).group(1)
uuids = literal_eval(uuids)
for uuid in uuids:
print(img_url.format(uuid=uuid))
Prints:
https://ucarecdn.com/05e7ff61-c1d5-4d96-ae79-c381956cca2e/-/stretch/off/-/resize/760x/
https://ucarecdn.com/cd8dfa25-2bc5-4546-995a-f3fd23809e1d/-/stretch/off/-/resize/760x/
https://ucarecdn.com/382a5139-6712-4418-b25e-cc8ba69ab07f/-/stretch/off/-/resize/760x/
https://ucarecdn.com/3ed25902-4a51-4628-a057-1e55fbca7856/-/stretch/off/-/resize/760x/
https://ucarecdn.com/5b0b329d-050e-4143-bc92-7f40cdde46f5/-/stretch/off/-/resize/760x/
https://ucarecdn.com/464f96db-6ae3-4875-ac6a-cbede40c4a51/-/stretch/off/-/resize/760x/
https://ucarecdn.com/4facbe78-b4e8-4b7d-8fb0-d3659f46f1b4/-/stretch/off/-/resize/760x/
https://ucarecdn.com/379c6c28-f726-48a3-b59e-1248e1e30443/-/stretch/off/-/resize/760x/
https://ucarecdn.com/631479df-27a8-4047-ae59-63f9167001f2/-/stretch/off/-/resize/760x/
https://ucarecdn.com/8e1e4402-84f0-4d78-b7d8-c48ec437b5af/-/stretch/off/-/resize/760x/
https://ucarecdn.com/f55e6755-198a-408d-8e82-a50370527aed/-/stretch/off/-/resize/760x/
https://ucarecdn.com/5264c896-cf01-4ad9-9216-114c20a388cc/-/stretch/off/-/resize/760x/
https://ucarecdn.com/c6284eae-9be4-4811-b45b-17a5b6e99ad2/-/stretch/off/-/resize/760x/
https://ucarecdn.com/40ff508f-01e5-4417-bee0-20633efc6147/-/stretch/off/-/resize/760x/
https://ucarecdn.com/eaaee377-f1b5-49d7-a7db-d7a1f86b2805/-/stretch/off/-/resize/760x/
https://ucarecdn.com/584c29c8-b521-48ee-8104-6656d4faac97/-/stretch/off/-/resize/760x/
https://ucarecdn.com/798aa641-01fe-4ed2-886b-bac818c5fdfc/-/stretch/off/-/resize/760x/
https://ucarecdn.com/f82be8f5-d517-4642-8fe1-8987b4e530d0/-/stretch/off/-/resize/760x/
https://ucarecdn.com/23b818d0-07c3-40de-a070-c999c1323ff3/-/stretch/off/-/resize/760x/
https://ucarecdn.com/7ca0e7f6-90eb-4254-82ea-58c77e74f6a0/-/stretch/off/-/resize/760x/
https://ucarecdn.com/42dc8c54-2315-453f-9b40-07e332b8ee39/-/stretch/off/-/resize/760x/
https://ucarecdn.com/8e62227c-5acb-4603-abb9-ac0643b7b478/-/stretch/off/-/resize/760x/
https://ucarecdn.com/80713821-5d54-4819-810a-19991502ca56/-/stretch/off/-/resize/760x/
https://ucarecdn.com/35ce83fa-eac1-4326-83e9-e445450b35ce/-/stretch/off/-/resize/760x/
https://ucarecdn.com/3df9ac37-4e86-49e5-9095-28679ab37718/-/stretch/off/-/resize/760x/
https://ucarecdn.com/9e7211c0-b73b-4b1d-8b47-4b1700f9a80f/-/stretch/off/-/resize/760x/
https://ucarecdn.com/1cc3c44b-e4a9-4e37-96cf-afafeb3eb748/-/stretch/off/-/resize/760x/
https://ucarecdn.com/ab52465c-b3d8-4bf6-986a-a4bf815dfaed/-/stretch/off/-/resize/760x/
https://ucarecdn.com/69e43c1d-9fac-4278-bec5-52291c1b1c2b/-/stretch/off/-/resize/760x/
https://ucarecdn.com/0627c11f-522d-48b9-9f17-9ea05b769aaa/-/stretch/off/-/resize/760x/

Exporting Linkedin Learning Video source to .txt file

I am very new to python, and programming in general.
I am trying to scrape a LinkedIn Learning web page to locate the full file path for the video on the page.
Ideally, I would like the script to be able to accept a course url and cycle through each video in the course, to pull the video file path from each video page within the course.
From reviewing the source, I found the area I am interested in is as follows:
<div> data-vjs-player etc etc </div>
Within this div, there is a video element. Within this element, is a src callout which contains the video link I am looking for, example as follows:
<video id="vjs_video_3_html5_api" class="vjs-tech" preload="auto" poster="https://media-exp1.licdn.com/dms/image/C4E0DAQEEM3rME8wwFw/learning-public-crop_675_1200/0?e=1595858400&v=beta&t=V5KkqHuGqUTliAMbL7oUBXeEWcrfBDdi4QrZbyGyAWE" src="https://files3.lynda.com/secure/courses/614299/VBR_MP4h264_main_HD720/614299_00_02_XR15_exfiles.mp4?0pnG4-hMq6_WSlXmJvkGQa6ubLk5EIuE8SG-D0jd9RJOztR5jY8wmlBcsWjHLzBK22z6DydJXGoV8njYeJ_A-dMb6BIZrtkZdUq20t2tD6hxhdNKeWVvik7aOfN3Oyv78_wqePFK1rGmujQnzbCYudW9r0Oyl54EcFQhUqUFnGpkVqHLgQ_Gndo"></video>
I attempted to utilize the following code as a basis, following a BeautifulSoup tutorial to parse the website link for the src callout:
from bs4 import BeautifulSoup
from lxml import html
import requests
URL = 'https://www.linkedin.com/learning/python-essential-training-2/about-python-3?u=2154233'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html5lib')
results = soup.find(id="vjs_video_3_html5_api")
print(results.prettify())
However, it is at this point I have come to a standstill, as I do not understand where to go here, despite researching this to the best of my abilities at this current time.
I really would appreciate any help or guidance that you may be able to provide on this.
Thank you all in advance.
When you look at source HTML (for example print(soup)), you will see that the class of <video> tag is different.
You can extract the video url with this example:
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.linkedin.com/learning/python-essential-training-2/about-python-3?u=2154233'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data = json.loads(soup.video['data-sources'])
print(data[0]['src'])
Prints:
https://files3.lynda.com/secure/courses/614299/VBR_MP4h264_main_SD/614299_00_03_XR30_aboutpy.mp4?jNwDi0oWUMSPUqh0j6w7yy2IDyBgoGZEeY9Tj2TKVmZmpSMisIoXxG9K1BbRELSP_pM9ySZOFiOq6TzNFvxhEWoGujEGQYT7TfRhuXGwJyGffd5uWTdYBCoc65J-YJuvdg7xijnaDwVjFuUKSAJZxqvYyq8f5nOZrE0Mgckk-1XANfovQ8E

Unable to scrape a class

I am currently trying to make a web scraper using python. The objective I have is for my web scraper to find the name and the price of a stock. Here is my code:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://finance.yahoo.com/quote/MA?p=MA&.tsrc=fin-srch')
soup = BeautifulSoup(page.content, "html.parser")
stock_name = soup.find({ "class" : "D(ib) Fz(18px)"})
print(stock_name)
but when i run it i get this:
C:\Users\baribal\Desktop>py web_scraper.py
None
thank you in advance!
Your request just gives you the raw HTML of the webpage. The elements you are trying to retrieve are React components that are rendered in the browser after the HTML source text is loaded.
You need to use a headless browser like Selenium instead.

Can't scoop out a twitter link from a webpage

I've created a script in python to get the link to a twitter account of a player. The problem is the twitter link is within an iframe. I can parse that using selenium. However, I would like to know if there is any alternative to parse the link using requests module making use of script tag or something.
website link
If you scroll that site, you can see the twitter link located at the right sided area something like the image below:
I've tried with:
import requests
from bs4 import BeautifulSoup
link = "https://247sports.com/Player/JT-Tuimoloau-46048440/"
def get_links(link):
res = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"lxml")
twitter = soup.select_one("a.customisable-highlight").get('href')
print(twitter)
if __name__ == '__main__':
get_links(link)
I don't know how to actually get the iframe, but maybe there is another way for you to fetch the Twitter name (and create a link to this Twitter account afterwards).
It seems like the information you need is hidden in a div tag with class="tweets-comp". If you extract the value of the attribute data-username, you should end up with the name of the Twitter account:
import requests
from bs4 import BeautifulSoup
link = "https://247sports.com/Player/JT-Tuimoloau-46048440/"
res = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"html.parser")
div = soup.find('div', {'class':'tweets-comp'})
print(div['data-username'])
# JT_tuimoloau

How to find specific video html tag using beautiful soup?

Does anyone know how to use beautifulsoup in python.
I have this search engine with a list of different urls.
I want to get only the html tag containing a video embed url. and get the link.
example
import BeautifulSoup
html = '''https://archive.org/details/20070519_detroit2'''
#or this.. html = '''http://www.kumby.com/avatar-the-last-airbender-book-3-chapter-5/'''
#or this... html = '''https://www.youtube.com/watch?v=fI3zBtE_S_k'''
soup = BeautifulSoup.BeautifulSoup(html)
what should I do next . to get the html tag of video, or object or the exact link of the video..
I need it to put it on my iframe. i will integrate the python to my php. so getting the link of the video and outputting it using the python then i will echo it on my iframe.
You need to get the html of the page not just the url
use the built-in lib urllib like this:
import urllib
from bs4 import BeautifulSoup as BS
url = '''https://archive.org/details/20070519_detroit2'''
#open and read page
page = urllib.urlopen(url)
html = page.read()
#create BeautifulSoup parse-able "soup"
soup = BS(html)
#get the src attribute from the video tag
video = soup.find("video").get("src")
also with the site you are using i noticed that to get the embed link just change details in the link to embed so it looks like this:
https://archive.org/embed/20070519_detroit2
so if you want to do it to multiple urls without having to parse just do something like this:
url = '''https://archive.org/details/20070519_detroit2'''
spl = url.split('/')
spl[3] = 'embed'
embed = "/".join(spl)
print embed
EDIT
to get the embed link for the other links you provided in your edit you need to look through the html of the page you are parsing, look until you fint the link then get the tag its in then the attribute
for
'''http://www.kumby.com/avatar-the-last-airbender-book-3-chapter-5/'''
just do
soup.find("iframe").get("src")
the iframe becuase the link is in the iframe tag and the .get("src") because the link is the src attribute
You can try the next one because you should learn how to do it if you want to be able to do it in the future :)
Good luck!
You can't parse a URL. BeautifulSoup is used to parse an html page. Retrieve the page first:
import urllib2
data = urllib2.ulropen("https://archive.org/details/20070519_detroit2")
html = data.read()
Then you can use find, and then take the src attribute:
soup = BeautifulSoup(html)
video = soup.find('video')
src = video['src']
this is a one liner to get all the downloadable MP4 file in that page, in case you need it.
import bs4, urllib2
url = 'https://archive.org/details/20070519_detroit2'
soup = bs4.BeautifulSoup(urllib2.urlopen(url))
links = [a['href'] for a in soup.find_all(lambda tag: tag.name == "a" and '.mp4' in tag['href'])]
print links
Here are the output:
['/download/20070519_detroit2/20070519_detroit_jungleearth.mp4',
'/download/20070519_detroit2/20070519_detroit_sweetkissofdeath.mp4',
'/download/20070519_detroit2/20070519_detroit_goodman.mp4',
...
'/download/20070519_detroit2/20070519_detroit_wilson_512kb.mp4']
These are relative links and you and put them together with the domain and you get absolute path.

Categories