I am very new to python, and programming in general.
I am trying to scrape a LinkedIn Learning web page to locate the full file path for the video on the page.
Ideally, I would like the script to be able to accept a course url and cycle through each video in the course, to pull the video file path from each video page within the course.
From reviewing the source, I found the area I am interested in is as follows:
<div> data-vjs-player etc etc </div>
Within this div, there is a video element. Within this element, is a src callout which contains the video link I am looking for, example as follows:
<video id="vjs_video_3_html5_api" class="vjs-tech" preload="auto" poster="https://media-exp1.licdn.com/dms/image/C4E0DAQEEM3rME8wwFw/learning-public-crop_675_1200/0?e=1595858400&v=beta&t=V5KkqHuGqUTliAMbL7oUBXeEWcrfBDdi4QrZbyGyAWE" src="https://files3.lynda.com/secure/courses/614299/VBR_MP4h264_main_HD720/614299_00_02_XR15_exfiles.mp4?0pnG4-hMq6_WSlXmJvkGQa6ubLk5EIuE8SG-D0jd9RJOztR5jY8wmlBcsWjHLzBK22z6DydJXGoV8njYeJ_A-dMb6BIZrtkZdUq20t2tD6hxhdNKeWVvik7aOfN3Oyv78_wqePFK1rGmujQnzbCYudW9r0Oyl54EcFQhUqUFnGpkVqHLgQ_Gndo"></video>
I attempted to utilize the following code as a basis, following a BeautifulSoup tutorial to parse the website link for the src callout:
from bs4 import BeautifulSoup
from lxml import html
import requests
URL = 'https://www.linkedin.com/learning/python-essential-training-2/about-python-3?u=2154233'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html5lib')
results = soup.find(id="vjs_video_3_html5_api")
print(results.prettify())
However, it is at this point I have come to a standstill, as I do not understand where to go here, despite researching this to the best of my abilities at this current time.
I really would appreciate any help or guidance that you may be able to provide on this.
Thank you all in advance.
When you look at source HTML (for example print(soup)), you will see that the class of <video> tag is different.
You can extract the video url with this example:
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.linkedin.com/learning/python-essential-training-2/about-python-3?u=2154233'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data = json.loads(soup.video['data-sources'])
print(data[0]['src'])
Prints:
https://files3.lynda.com/secure/courses/614299/VBR_MP4h264_main_SD/614299_00_03_XR30_aboutpy.mp4?jNwDi0oWUMSPUqh0j6w7yy2IDyBgoGZEeY9Tj2TKVmZmpSMisIoXxG9K1BbRELSP_pM9ySZOFiOq6TzNFvxhEWoGujEGQYT7TfRhuXGwJyGffd5uWTdYBCoc65J-YJuvdg7xijnaDwVjFuUKSAJZxqvYyq8f5nOZrE0Mgckk-1XANfovQ8E
Related
i'm trying to get the title of youtube videos given a link.
But i'm unable to access the element that hold the title. I'm using bs4 to parse the html.
I noticed im unable to access any element that is within 'ytd-app' tag in the youtube page.
import bs4
import requests
listed_url = "https://www.youtube.com/watch?v=9IfT8KXX_9c&list=PLQVvvaa0QuDfKTOs3Keq_kaG2P55YRn5v&index=31"
listed = requests.get(listed_url)
soup = bs4.BeautifulSoup(listed.text, "html.parser")
a = soup.find_all(attrs={"class": "style-scope ytd-video-primary-info-renderer"})
print(a)
So how can i get the video title ? Is there something i'm doing wrong or youtube intentionally created a tag like this to prevent web_scraping ?
See class that you are using is render through Javascript and all the contents are dynamic so it is very difficult to find any data using bs4
So what you can do find data in soup by manually and find particular tag
Also you can try out with pytube
import bs4
import requests
listed_url = "https://www.youtube.com/watch?v=9IfT8KXX_9c&list=PLQVvvaa0QuDfKTOs3Keq_kaG2P55YRn5v&index=31"
listed = requests.get(listed_url)
soup = bs4.BeautifulSoup(listed.text, "html.parser")
soup.find("title").get_text()
I want to open a website to download resume from it, but following code tries to get to absolute path instead of just url:
import webbrowser
soup = BeautifulSoup(webbrowser.open('www.indeed.com/r/Prabhanshu-Pandit/dee64d1418e20069?sp=0'),"lxml")
generates the following error:
gvfs-open: /home/utkarsh/Documents/Extract_Resume/www.indeed.com/r/Prabhanshu-
Pandit/dee64d1418e20069?sp=0:
error opening location: Error when getting information for file
'/home/utkarsh/Documents/Extract_Resume/www.indeed.com/r/Prabhanshu-
Pandit/dee64d1418e20069?sp=0': No such file or directory
Clearly it is taking the home address and trying to search that on web which will not be present. What am I doing wrong here? Thanks in advance
I suppose you are confusing the usage of Beautiful Soup and webbrowser together. Webbrowser it is not needed to access the page.
From Documentation
Beautiful Soup provides a few simple methods and Pythonic idioms for
navigating, searching, and modifying a parse tree: a toolkit for
dissecting a document and extracting what you need. It doesn't take
much code to write an application
Adapting the tutorial example to your task to print the resume in output
from bs4 import BeautifulSoup
import requests
url = "www.indeed.com/r/Prabhanshu-Pandit/dee64d1418e20069?sp=0"
r = requests.get("http://" +url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
print soup.find("div", {"id": "resume"})
I want to build small tool to help a family member download podcasts off a site.
In order to get the links to the files I first need to filter them out (with bs4 + python3).
The files are on this website (Estonian): Download Page "Laadi alla" = "Download"
So far my code is as follows:
(most of it is from examples on stackoverflow)
from bs4 import BeautifulSoup
import urllib.request
import re
url = urllib.request.urlopen("http://vikerraadio.err.ee/listing/mystiline_venemaa#?page=1&pagesize=902&phrase=&from=&to=&path=mystiline_venemaa&showAll")
content = url.read()
soup = BeautifulSoup(content, "lxml")
links = [a['href'] for a in soup.find_all('a',href=re.compile('http.*\.mp3'))]
print ("Links:", links)
Unfortunately I always get only two results.
Output:
Links: ['http://heli.err.ee/helid/exp/ERR_raadiouudised.mp3', 'http://heli.err.ee/helid/exp/ERR_raadiouudised.mp3']
These are not the ones I want.
My best guess is that the page has somewhat broken html and bs4 / the parser is not able to find anything else.
I've tried different parsers with resulting in no change.
Maybe I'm doing something else wrong too.
My goal is to have the individual links in a list for example.
I'll filter out any duplicates / unwanted entries later myself.
Just a quick note, just in case: This is a public radio and all the content is legally hosted.
My new code is:
for link in soup.find_all('d2p1:DownloadUrl'):
print(link.text)
I am very unsure if the tag is selected correctly.
None of the examples listed in this question are actually working. See the answer below for working code.
Please be aware that the listings from the page are interfaced through an API. So instead of requesting the HTML page, I suggest you to request the API link which has 200 .mp3 links.
Please follow the below steps:
Request the API link, not the HTML page link
Check the response, it's a JSON. So extract the fields that are of your need
Help your Family, All Time :)
Solution
import requests, json
from bs4 import BeautifulSoup
myurl = 'http://vikerraadio.err.ee/api/listing/bypath?path=mystiline_venemaa&page=1&pagesize=200&phrase=&from=&to=&showAll=false'
r = requests.get(myurl)
abc = json.loads(r.text)
all_mp3 = {}
for lstngs in abc['ListItems']:
for asd in lstngs['Podcasts']:
all_mp3[asd['DownloadUrl']] = lstngs['Header']
all_mp3
all_mp3 is what you need. all_mp3 is a dictionary with download urls as keys and mp3 names as the values.
i want to be able to pull all urls from the following webpage using python https://yeezysupply.com/pages/all i tried using some other suggestions i found but they didn't seem to work with this particular website. i would end up not finding any urls at all.
import urllib
import lxml.html
connection = urllib.urlopen('https://yeezysupply.com/pages/all')
dom = lxml.html.fromstring(connection.read())
for link in dom.xpath('//a/#href'):
print link
perhaps it would be useful for you to make use of modules specifically designed for this. heres a quick and dirty script that gets the relative links on the page
#!/usr/bin/python3
import requests, bs4
res = requests.get('https://yeezysupply.com/pages/all')
soup = bs4.BeautifulSoup(res.text,'html.parser')
links = soup.find_all('a')
for link in links:
print(link.attrs['href'])
it generates output like this:
/pages/jewelry
/pages/clothing
/pages/footwear
/pages/all
/cart
/products/womens-boucle-dress-bleach/?back=%2Fpages%2Fall
/products/double-sleeve-sweatshirt-bleach/?back=%2Fpages%2Fall
/products/boxy-fit-zip-up-hoodie-light-sand/?back=%2Fpages%2Fall
/products/womens-boucle-skirt-cream/?back=%2Fpages%2Fall
etc...
is this what you are looking for? requests and beautiful soup are amazing tools for scraping.
There are no links in the page source; they are inserted using Javascript after the page is loaded int the browser.
For a project I've to scrap datas from a different website, and I'm having problem with one.
When I look at the source code the things I want are in a table, so it seems to be easy to scrap. But when I run my script that part of the code source doesn't show.
Here is my code. I tried different things. At first there wasn't any headers, then I added some but no difference.
# import libraries
import urllib2
from bs4 import BeautifulSoup
import csv
import requests
# specify the url
quote_page = 'http://www.airpl.org/Pollens/pollinariums-sentinelles'
# query the website and return the html to the variable 'page'
response = requests.get(quote_page)
response.addheaders = [('User-agent', 'Mozilla/5.0')]
print(response.text)
# parse the html using beautiful soap and store in variable `response`
soup = BeautifulSoup(response.text, 'html.parser')
with open('allergene.txt', 'w') as f:
f.write(soup.encode('UTF-8', 'ignore'))
What I'm looking for in the website is the things after "Herbacée" whose HTML Look like :
<p class="level1">
<img src="/static/img/state-0.png" alt="pas d'émission" class="state">
Herbacee
</p>
Do you have any idea what's wrong ?
Thanks for your help and happy new year guys :)
This page use JavaScript to render the table, the real page contains the table is:
http://www.alertepollens.org/gardens/garden/1/state/
You can find this url in Chrome Dev tools>>>Network.