getting Video URL using Python Scripting - python

I am working with beautiful soup to extract the URL. I get all the attributes of the href but i want to get only specific URL.
Here is my code:
import requests
from bs4 import BeautifulSoup
page=requests.get("https://www.youtube.com/results?search_query=cooking")
soup = BeautifulSoup(page.content ,'html.parser')
for a_tag in soup.findAll("a"):
if a_tag.has_attr("href"):
print(a_tag['href'])
enter image description here
but i want only this
watch?v=nTe_44ao82w
/watch?v=nTe_44ao82w

More Minimization to the first answer:
import requests
from bs4 import BeautifulSoup
page=requests.get("https://www.youtube.com/results?search_query=cooking")
soup = BeautifulSoup(page.content ,'html.parser')
for a_tag in soup.findAll("a"):
if 'watch' in a_tag['href']:
print(a_tag['href'])
This will check if the href tag has string watch in it.
Hope this helps!

There doesn't really seem to be a good way to differentiate those a tags other than by the URL itself (they don't have any unique classes or anything) so I would probably just check if the URL contains "watch":
...
for a_tag in soup.findAll("a"):
if a_tag.has_attr("href") and "watch" in a_tag["href"]:
print(a_tag['href'])
Outputs
/watch?v=cbxe1ANrfDo
/watch?v=nTe_44ao82w
/watch?v=v1wIThmCams
/watch?v=FTociictyyE
/watch?v=dw2QHkAtB_Y
/watch?v=ej9UHVwlQqk
/watch?v=KGAj8IhnR3c
/watch?v=G8A73R_gZdM
/watch?v=XPQW_2YOmjY
/watch?v=J0pS2lhH0Vc
/watch?v=5aU5qrbCsF4
/watch?v=kvAJ_mc9NXs
/watch?v=kKiYVLIk_9s
/watch?v=G2jYIGdmC6I
/watch?v=jMW5ZDQviOA
/watch?v=iTmcGy9CWhE
/watch?v=66Ck_5SePZg
/watch?v=lyD9t3uhHio

Related

How can I get href in this a tag?

I want to get the href (https://www.dcard.tw/forum/popular)
from https://www.dcard.tw/
this href is under <a href="/forum/popular">
my code:
from bs4 import BeautifulSoup
import requests
url = "https://www.dcard.tw/"
soup = BeautifulSoup(requests.get(url).text,'lxml')
for link in soup.find_all('a'):
print(link.get('href'))
You can try the examples from this thread, they got some successful options.
How can I get href links from HTML using Python?
Regards,
Alex

Web scraping IMDB with Python's Beautiful Soup

I am trying to parse this page "https://www.imdb.com/title/tt0068112/?ref_=fn_al_tt_1", but I can't find the href that I need (href="/title/tt0068112/episodes?ref_=tt_eps_sm").
I tried with this code:
url="https://www.imdb.com/title/tt0068112/?ref_=fn_al_tt_1"
page(requests.get(url)
soup=BeautifulSoup(page.content,"html.parser")
for a in soup.find_all('a'):
print(a['href'])
What's wrong with this? I also tried to check "manually" with print(soup.prettify()) but it seems that that link is hidden or something like that.
You can get the page html with requests, the href item is in there, no need for special apis. I tried this and it worked:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/title/tt0068112/?ref_=fn_al_tt_1")
soup = BeautifulSoup(page.content, "html.parser")
scooby_link = ""
for item in soup.findAll("a", href="/title/tt0068112/episodes?ref_=tt_eps_sm"):
print(item["href"])
scooby_link = "https://www.imdb.com" + "/title/tt0068112/episodes?ref_=tt_eps_sm"
print(scooby_link)
I'm assuming you also wanted to save the link to a variable for further scraping so I did that as well. 🙂
To get the link with Episodes you can use next example:
import requests
from bs4 import BeautifulSoup
url = "https://www.imdb.com/title/tt0068112/?ref_=fn_al_tt_1"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
print(soup.select_one("a:-soup-contains(Episodes)")["href"])
Prints:
/title/tt0068112/episodes?ref_=tt_eps_sm

BeautifulSoup class searching, no results

I'm using BeautifulSoup to parse code of this site and extract URL of the results. But when using find_all command I get an empty list as output. I checked manually the HTML code that I download from the site, and it contains the appropriate class.
If somebody could point out where I make a mistake or show a better solution I would be grateful!
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.awf.edu.pl/pracownik/wyszukiwarka-pracownikow?result_5251_result_page=3&queries_search_query=&category_kategorie=wydzia_wychowania_fizycznego&search_page_5251_submit_button=Szukaj&current_result_page=1&results_per_page=20&submitted_search_category=&mode=results")
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find_all('div', class_ = 'search-item photo')
`
I've also tried to use this code below to just find all links on the site and then separate that what I need, but in this instance, I get only parent tag. if in tag 'a' is nested another tag 'a' it is skipped, and from documentation, I thought it also would be included in the output.
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.awf.edu.pl/pracownik/wyszukiwarka-pracownikow?result_5251_result_page=3&queries_search_query=&category_kategorie=wydzia_wychowania_fizycznego&search_page_5251_submit_button=Szukaj&current_result_page=1&results_per_page=20&submitted_search_category=&mode=results")
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find_all('a')
BeautifulSoup can't find class that exists on webpage?
I found this answer to a similar question, but in my case, I can see the HTML code that I want to find in my console when I use print(soup.prettify())
the problem you are facing is linked to the way you are parsing page.content.
replace:
soup = BeautifulSoup(page.content, 'html.parser')
with:
soup = BeautifulSoup(page.content, 'lxml')
hope this helps.

BeautifulSoup doesn't find all spans or children

I am trying to access the sequence on this webpage:
https://www.ncbi.nlm.nih.gov/protein/EGW15053.1?report=fasta
The sequence is stored under the div class="seq gbff". Each line is stored under
<span class='ff_line' id='gi_344258949_1"> *line 1 of sequence* </span>
When I try to search for the spans containing the sequence, beautiful soup returns None. Same problem when I try to look at the children or content of the div above the spans.
Here is the code:
import requests
import re
from bs4 import BeautifulSoup
# Create a variable with the url
url = 'https://www.ncbi.nlm.nih.gov/protein/EGW15053.1?report=fasta'
# Use requests to get the contents
r = requests.get(url)
# Get the text of the contents
html_content = r.text
# Convert the html content into a beautiful soup object
soup = BeautifulSoup(html_content, 'html.parser')
div = soup.find_all('div', attrs={'class', 'seq gbff'})
for each in div.children:
print(each)
soup.find_all('span', aatrs={'class', 'ff_line'})
Neither method works and I'd greatly appreciate any help :D
This page uses JavaScript to load data
With DevTools in Chrome/Firefox I found this url and there are all <span>
https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=344258949&db=protein&report=fasta&extrafeat=0&fmt_mask=0&retmode=html&withmarkup=on&tool=portal&log$=seqview&maxdownloadsize=1000000
Now hard part. You have to find this url in HTML because different pages will use different arguments in url. Or you have to compare few urls and find schema so you could generate this url manually.
EDIT: if in url you change retmode=html to retmode=xml then you get it as XML. If you use retmode=text then you get it as text without HTML tags. retmode=json doesn't works.

beautiful soup HREF not found when capitalized

Hi i have something along the lines of:
from BeautifulSoup import BeautifulSoup as bs
import urllib2
url = 'http://www.blah.com'
soup = bs(urllib2.urlopen(url))
for link in soup.findAll('a', href=True):
print link
So the problem is that the website uses both href and HREF (capitalized) for the links. This script only pulls the href. How would i modify the code also get the links with HREF?
Thanks

Categories