Extracting particular text - python

I am trying to extract all links to videos on a particular WordPress website. Each page has only one video.
Inside each page crawled, there is the following code:
<p><script src="https://www.vooplayer.com/v3/watch/video.js"></script>
<iframe id="" voo-auto-adj="true" name="vooplayerframe" style="max-width:100%" allowtransparency="true" allowfullscreen="true" src="//www.vooplayer.com/v3/watch/watch.php?v=123456;clearVars=1" frameborder="0" scrolling="no" width="660" height="410" >
</iframe></p>
I would like to extract the text from here
Google Chrome Inspector tells me that this can be addressed as:
Selector: //*[#id="post-255"]/div/p/iframe
XPath: #post-255 > div > p > iframe
But each webpage I am crawling has a different "post" number. They are quite random, hence I cannot easily use the aforementioned selectors.

If there is a dynamic part inside the id attribute, you can address it by partial-matching:
[id^=post] > div > p > iframe
where ^= means "starts with".
XPath alternative:
//*[starts-with(#id, "post")]/div/p/iframe
See also if you can avoid checking for div and p intermediate elements altogether and do:
[id^=post] iframe
//*[starts-with(#id, "post")]//iframe
You may additionally check for the iframe name as well:
[id^=post] iframe[name=vooplayerframe]
//*[starts-with(#id, "post")]//iframe[#name = "vooplayerframe"]

Related

Get text from div using Selenium and Python

Situation
I'm using Selenium and Python to extract info from a page
Here is the div I want to extract from:
I want to extract the "Registre-se" and the "Login" text.
My code
from selenium import webdriver
url = 'https://www.bet365.com/#/AVR/B146/R^1'
driver = webdriver.Chrome()
driver.get(url.format(q=''))
elements = driver.find_elements_by_class_name('hm-MainHeaderRHSLoggedOutNarrow_Join ')
for e in elements:
print(e.text)
elements = driver.find_elements_by_class_name('hm-MainHeaderRHSLoggedOutNarrow_Login ')
for e in elements:
print(e.text)
Problem
My code don't send any output.
HTML
<div class="hm-MainHeaderRHSLoggedOutNarrow_Join ">Registre-se</div>
<div class="hm-MainHeaderRHSLoggedOutNarrow_Login " style="">Login</div>
By looking this HTML
<div class="hm-MainHeaderRHSLoggedOutNarrow_Join ">Registre-se</div>
<div class="hm-MainHeaderRHSLoggedOutNarrow_Login " style="">Login</div>
and your code, which looks okay to me, except that part you are using find_elements for a single web element.
and by reading this comment
The class name "hm-MainHeaderRHSLoggedOutMed_Login " only appear in
the inspect of the website, but not in the page source. What it's
supposed to do now?
It is clear that the element is in either iframe or shadow root.
Cause page_source does not look for iframe.
Please check if it is in iframe, then you'd have to switch to iframe first and then you can use the code that you have.
switch it like this :
driver.switch_to.frame(driver.find_element_by_xpath('xpath here'))

Cannot find href on a page

I am trying to find the url for the trailer video from this page. https://www.binged.com/streaming-premiere-dates/black-monday/.
I checked the various properties of the div class="wordkeeper-video", I cannot find it. Can someone help?
Go ahead and play it. Then there will be something like this. The link is in src tag
<iframe frameborder="0" allowfullscreen="" allow="autoplay" src="https://www.youtube.com/embed/pzxGR6Q-7Mc?rel=0&showinfo=0&autoplay=1"></iframe>
PS: It is in div class="wordkeeper-video"
The video href is not initially present there.
You need first to click on the play button (actually the image), after that the href is presented inside the iframe there.
The iframe is .wordkeeper-video iframe
So you have to switch to the iframe and then extract it's src attribute
The full URL isn't there but all you need to build it is.
<div class="wordkeeper-video " data-type="youtube" data-embed="pzxGR6Q-7Mc" ...>
The data-embed attribute has what you need.
The URL is
https://www.youtube.com/watch?v=pzxGR6Q-7Mc
^ here's the data-embed value
You can get this by using
data_embed = driver.find_element_by_css_selector(".wordkeeper-video").get_attribute("data-embed")
video_url = "https://www.youtube.com/watch?v=" + data_embed

python selenium find element by tag inside a div (has no name, class, id or text)

I'm trying to get an iframe's src using selenium, but I cant find how to get the iframe inside a div (the div has an id). Here is what I have:
<div role="tabpanel" class="tab-pane active" id="video_box" style="background: #000; color: #fff;">
<iframe width="560" height="315" frameborder="0" src="https://embedsito.com/v/x431gb5q0rq76j-" scrolling="no" allowfullscreen=""></iframe>
</div>
As you can see, I can get the div by id using driver.find_element_by_id("video_box"), but what I need is the src from inside the div, and it has no class name, no id, no name and no text (well, it actually has text from the src but it changes every time). Is there any way of getting the iframe from inside the div and extract the src from it?
I've tried to use the find_by_css_selector('iframe') method with no results
I've tried to use the find_element_by_tag_name('iframe') method, but raises an exception:
selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: iframe
Have you tried to go into the developer tools (inspect element) and get the XPath of the iframe by right-clicking on it?
Then you could use the command find_element_by_xpath() method.

How do I get a unique css selector of a BeautifulSoup object?

I want to get a unique css selector path of an element in an HTML.
I'm using BeautifulSoup but cannot figure out how to really get a unique css selector like you would using Chrome dev tool.
Say, you're trying to get a unique css selector of an element in Google's page - specifically, Gmail button on the top right. Using Chrome's dev tool, you can easily use 'copy selector' and get: #gbw > div > div > div.gb_9d.gb_i.gb_yg.gb_pg > div:nth-child(1) > a
I'm trying to do the same thing without knowing any prior information about the website structure. i.e, from a single BeautifulSoup element.
I tried to get a unique selector by listing all parents of the element but that is NOT unique. I need something more like class & id to really make it unique
How do I pull this off?
(Of course it works in very simple cases like the one I gave you below but css selector path only using the tag names is always in danger of not being unique.)
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://www.google.com")
html_bs = BeautifulSoup(html, 'html.parser')
gmail = html_bs.find_all(text='Gmail')[0]
print(type(gmail)) # >> bs4.element.NavigableString
# How do I get gmail's unique css path?
# I tried to use parents but this is generally not unique.
# div > div > nobr > a might NOT be unique in some cases.
for parent in gmail.parents:
print(parent.name)
# >>
'''
a
nobr
div
div
body
html
[document]
'''

How do I find HTML Elements not in the page source using Selenium?

So I'm trying to find this <ul> tag I found using inspect element on chrome:
<ul class = "jobs-search-results__list artdeco-list" itemtype="http://schema.org/ItemList"></ul>
This is what I tried in Python:
ul = driver.find_element_by_class_name("jobs-search-results__list artdeco-list")
Which should return the <ul> tag.
Instead I get this error:
selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element:{"method":"class","selector":"jobs-search-results__list artdeco-list"}
I get the same error whether I use a tag/xpath/absolutepath selector.
Then I find out this element is not on the HTML page source, and so selenium can't find it.
HTML Source (pastebin)
How do I go about finding this element if its not on the page source?
The class of ul element that you are trying to get is changing while accessing site using Selenium. For this use the xpath as
//ul[contains(#class,'jobs-search__results')]
Now you can find ul element as
ul = driver.find_element_by_xpath("//ul[contains(#class,'jobs-search__results')]")

Categories