Python Web-scraping youtube.com BeautifulSoup4 problem

Python Web-scraping youtube.com BeautifulSoup4 problem - python

I am trying to get the author of every video on the YouTube homepage by web-scraping with BeautifulSoup4.
This is the chunk of HTML I am trying to navigate to.
<a class="yt-simple-endpoint style-scope yt-formatted-string" spellcheck="false" href="/c/ApertureScience" dir="auto">Aperture</a>
With the link: https://www.youtube.com/
And I am trying to get the item "Aperture".
The problem is that I can't seem to navigate correctly to the data, I have been trying this:
source = urllib.request.urlopen('https://www.youtube.com/').read()
soup = bs.BeautifulSoup(source,'lxml')
for i in soup.find_all('a', class_='yt-simple-endpoint style-scope yt-formatted-string'):
print(i)
And nothing prints, I think it is because of the weird spaces in the class name but I don't know how to get around that.
If any ideas help, thank you!

try the syntax:
find_all('a',{'class' : 'yt-simple-endpoint style-scope yt-formatted-string'})
and for the 'Aperture' use string or content or text.
And if the content is Dynamic, you could use Selenium.

Related

I cant extract a link from a HTML-code with Python and BeautifulSoup (beginner)

I'm a complete beginner with webscraping and programming with Python. The answer might be somewhere at the forum, but i'm so new, that i dont really now, what to look for. So i hope, you can help me:
Last week I completed a three day course in webscraping with Python, and at the moment i'm trying to brush up on what i've learned so far.
I'm trying to scrape out a spcific link from a website, so that i later on can create a loop, that extracts all the other links. But i can't seem to extract any link even though they are visible in the HTML-code.
Here is the website (danish)
Here is the link i'm trying to extract
The link i'm trying extract is located in this html-code:
<a class="nav-action-arrow-underlined" href="/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp108/" aria-label="Læs mere om Regionen tilbød ikke"\>Læs mere\</a\>
Here is the programming in Python, that i've tried so far:
url = "https://stpk.dk/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/"
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, "html.parser")
a_tags = soup.find_all("a") len(a_tags)
#there is 34
've then tried going through all "a-tags" from 0-33 without finding the link.
If i'm printing a_tags [26] - i'm getting this code:
<a aria-current="page" class="nav-action is-current" href="/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/"\>Afgørelser fra Styrelsen for Patientklager\</a\>
Which is somewhere at the top of the website. But the next a_tag [27] is a code at the bottom of the site:
<a class="footer-linkedin" href="``https://www.linkedin.com/company/styrelsen-for-patientklager/``" rel="noopener" target="_blank" title="``https://www.linkedin.com/company/styrelsen-for-patientklager/``"><span class="sr-only">Linkedin profil</span></a>
Can anyone help me by telling me, how to access the specific part of the HTML-code, that contains the link?
When i find out how to pull out the link, my plan is to make the following programming:
path = "/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp108/"
full_url = f"htps://stpk.dk{path}"
print(full_url)

You will not find what you are looking for, cause requests do not render websites like a browser will do - but no worry, there is an alterntive.
Content is dynamically loaded via api, so you should call these directly and you will get JSON that contains the displayed information.
To find such information take a closer look into the developer tools of your browser and check the tab for the XHR Requests - May take a
minute to read and follow the topic:
https://developer.mozilla.org/en-US/docs/Glossary/XHR_(XMLHttpRequest)
Simply iterate over the items, extract the url value and prepend the base_url.
Check and manipulate the following parameters to your needs:
containerKey: a76f4a50-6106-4128-bc09-a1da7695902b
query:
year:
category:
legalTheme:
specialty:
profession:
treatmentPlace:
critiqueType:
take: 200
skip: 0
Example
import requests
url = 'https://stpk.dk/api/verdicts/settlements/?containerKey=a76f4a50-6106-4128-bc09-a1da7695902b&query=&year=&category=&legalTheme=&specialty=&profession=&treatmentPlace=&critiqueType=&take=200&skip=0'
base_url = 'https://stpk.dk'
for e in requests.get(url).json()['items']:
print(base_url+e['url'])
Output
https://stpk.dk/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp108/
https://stpk.dk/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp107/
https://stpk.dk/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp106/
https://stpk.dk/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp105/
...

Scraping webpage with _ngcontent value within different html tags

I am new to scraping and coding as well. So far I am able to scrape data using beautiful soup using below code:
sub_soup = BeautifulSoup(sub_page, 'html.parser')
content = sub_soup.find('div',class_='detail-view-content')
print(content)
This works correct when tag and class are in format:
<div class="masthead-card masthead-hover">
But fail when format is with _ngcontent:
<span _ngcontent-ixr-c5="" class="btn-trailer-text">
or
<div _ngcontent-wak-c4="" class="col-md-6">
An example of _ngcontent webpage screenshot I am trying to scrape is below :
All I tried results in blank or 'None'. What am I missing.

BeautifulSoup runs faster than page loading.
so you should use Selenium library and ChromeDriver.
here it is

Extract specific HREF with xpath or css

recently I have tackled one unusual element that's not trivial to scrape. Could you suggest please how to retrieve the href please.
I am scraping some Tripadvisor's restaurants with python scrapy and need to retrieve Google Map's link (href attribute) from location and contacts section. Could you suggest how to
The webpage for example (link)
The code of the element:
<a data-encoded-url="S0k3X2h0dHBzOi8vbWFwcy5nb29nbGUuY29tL21hcHM/c2FkZHI9JmRhZGRyPVNjYWJlbGxzdHIuKzEwLTExJTJDKzE0MTA5K0JlcmxpbitHZXJtYW55QDUyLjQyODgxOCwxMy4xODI0MjFfeVBw" class="_2wKz--mA _27M8V6YV" target="_blank" href="**https://maps.google.com/maps?saddr=&daddr=Scabellstr.+10-11%2C+14109+Berlin+Germany#52.428818,13.182421**"><span class="_2saB_OSe">Scabellstr. 10-11, 14109 Berlin Germany</span><span class="ui_icon external-link-no-box _2OpUzCuO"></span></a>
I have tried the following XPATH, but got None as response every time or couldn't get data on the href attribute as if it doesn't exist.
response.xpath("//a[contains(#class, '_2wKz--mA _27M8V6YV')]").getall()
The output:
['<a data-encoded-url="Z3pLX2h0dHBzOi8vbWFwcy5nb29nbGUuY29tL21hcHM/c2FkZHI9JmRhZGRyPVNjYWJlbGxzdHIuKzEwLTExJTJDKzE0MTA5K0JlcmxpbitHZXJtYW55QDUyLjQyODgxOCwxMy4xODI0MjFfMk1z" class="_2wKz--mA _27M8V6YV" target="_blank"><span class="_2saB_OSe">Scabellstr. 10-11, 14109 Berlin Germany</span><span class="ui_icon external-link-no-box _2OpUzCuO"></span></a>',
'Website']

Use the data-encoded-url that you already got and decode it using Base64. Example:
>>> import base64
>>> base64.b64decode("Z3pLX2h0dHBzOi8vbWFwcy5nb29nbGUuY29tL21hcHM/c2FkZHI9JmRhZGRyPVNjYWJlbGxzdHIuKzEwLTExJTJDKzE0MTA5K0JlcmxpbitHZXJtYW55QDUyLjQyODgxOCwxMy4xODI0MjFfMk1z").decode("utf-8")
'gzK_https://maps.google.com/maps?saddr=&daddr=Scabellstr.+10-11%2C+14109+Berlin+Germany#52.428818,13.182421_2Ms'
You can then remove the gzK_ prefix and _2Ms suffix and you will have your URL.

You try the specific XPath query to get the href like "//a[contains(#class, 'foobar')]/#href" to retrieve a specific attribute of the element.

HTML source while webscraping seems inconsistent for website

I checked out say:
https://www.calix.com/search-results.html?searchKeyword=C7
And if I inspect element on the first link I get this:
<a class="title viewDoc"
href="https://www.calix.com/content/dam/calix/mycalix-
misc/ed-svcs/learning_paths/C7_lp.pdf" data-
preview="/session/4e14b237-f19b-47dd-9bb5-d34cc4c4ce01/"
data-preview-count="1" target="_blank"><i class="fa fa-file-
pdf-o grn"></i><b>C7</b> Learning Path</a>
I coded:
import requests, bs4
res = requests.get('https://www.calix.com/search-results.html?
searchKeyword=C7',headers={'User-Agent':'test'})
print(res)
#res.raise_for_status()
bs_obj= bs4.BeautifulSoup(res.text, "html.parser")
elems=bs_obj.findAll('a',attrs={"class","title viewDoc"})
print(elems)
And there was [] as output (empty list).
So, I thought about actually looking through the "view-source" for the page.
view-source:https://www.calix.com/search-results.html?searchKeyword=C7
If you search through the "view-source" you will not find the code for the "inspect element" I mentioned earlier.
There is no "a class="title viewDoc"" in the view-source of the page.
That is probably why my code isn't returning anything.
The I went to www.nba.com, and inspected a link
<a class="content_list--item clearfix"
href="/article/2018/07/07/demarcus-cousins-discusses-
stacked-golden-state-warriors-roster"><h5 class="content_list-
-title">Cousins on Warriors' potential: 'Scary'</h5><time
class="content_list--time">in 5 hours</time></a>
The content of "inspect" for this link was in the "view-source" of the page.
And, obviously my code was working for this page.
I have seen a few examples of issue #1.
Just curious why the difference in html formats, or am I missing something?

Using BeautifulSoup4 with Google Translate

I am currently going through the Web Scraping section of AutomateTheBoringStuff and trying to write a script that extracts translated words from Google Translate using BeautifulSoup4.
I inspected the html content of a page where 'Explanation' is the translated word:
<span id="result_box" class="short_text" lang="en">
<span class>Explanation</span>
</span>
Using BeautifulSoup4, I tried different selectors but nothing would return the translated word. Here are a few examples I tried, but they return no results at all:
soup.select('span[id="result_box"] > span')
soup.select('span span')
I even copied the selector directly from the Developer Tools, which gave me #result_box > span. This again returns no results.
Can someone explain to me how to use BeautifulSoup4 for my purpose? This is my first time using BeautifulSoup4 but I think I am using BeautifulSoup more or less correctly because the selector
soup.select('span[id="result_box"]')
gets me the outer span element**
[<span class="short_text" id="result_box"></span>]
**Not sure why the 'leng="en"' part is missing but I am fairly certain I have located the correct element regardless.
Here is the complete code:
import bs4, requests
url = 'https://translate.google.ca/#zh-CN/en/%E6%B2%BB%E5%85%B7'
res = requests.get(url)
res.raise_for_status
soup = bs4.BeautifulSoup(res.text, "html.parser")
translation = soup.select('#result_box span')
print(translation)
EDIT: If I save the Google Translate page as an offline html file and then make a soup object out of that html file, there would be no problem locating the element.
import bs4
file = open("Google Translate.html")
soup = bs4.BeautifulSoup(file, "html.parser")
translation = soup.select('#result_box span')
print(translation)

The result_box div is the correct element but your code only works when you save what you see in your browser as that includes the dynamically generated content, using requests you get only the source itself bar any dynamically generated content. The translation is generated by an ajax call to the url below:
"https://translate.google.ca/translate_a/single?client=t&sl=zh-CN&tl=en&hl=en&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&ie=UTF-8&oe=UTF-8&source=bh&ssel=0&tsel=0&kc=1&tk=902911.786207&q=%E6%B2%BB%E5%85%B7"
For your requests it returns:
[[["Fixture","治具",,,0],[,,,"Zhì jù"]],,"zh-CN",,,[["治 具",1,[["Fixture",999,true,false],["Fixtures",0,true,false],["Jig",0,true,false],["Jigs",0,true,false],["Governance",0,true,false]],[[0,2]],"治具",0,1]],1,,[["ja"],,[1],["ja"]]]
So you will either have to mimic the request, passing all the necessary parameters or use something that supports dynamic content like selenium

Simply try this :
translation = soup.select('#result_box span')[0].text
print(translation)

You can try this diferent aproach:
if filename.endswith(extension_file):
with open(os.path.join(files_from_folder, filename), encoding='utf-8') as html:
soup = BeautifulSoup('<pre>' + html.read() + '</pre>', 'html.parser')
for title in soup.findAll('title'):
recursively_translate(title)
FOR THE COMPLETE CODE, PLEASE SEE HERE:
https://neculaifantanaru.com/en/python-code-text-google-translate-website-translation-beautifulsoup-library.html
or HERE:
https://neculaifantanaru.com/en/example-google-translate-api-key-python-code-beautifulsoup.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Web-scraping youtube.com BeautifulSoup4 problem - python

try the syntax: find_all('a',{'class' : 'yt-simple-endpoint style-scope yt-formatted-string'}) and for the 'Aperture' use string or content or text. And if the content is Dynamic, you could use Selenium.

Related

I cant extract a link from a HTML-code with Python and BeautifulSoup (beginner)

Scraping webpage with _ngcontent value within different html tags

Extract specific HREF with xpath or css

HTML source while webscraping seems inconsistent for website

Using BeautifulSoup4 with Google Translate

Categories

Resources