Python - Selenium - webscrape xmlns table - python

<html xmlns="hyyp://www.w3.org/1999/xhtml">
<head>_</head>
<body>
<form name="Main Form" method="post" action="HTMLReport.aspx?ReportName=...">
<div id="Whole">
<div id="ReportHolder">
<table xmlns:msxsl="urn:schemeas-microsoft-com:xslt" width="100%">
<tbody>
<tr>
<td>_</td>
<td>LIVE</td>
and the data I need is here between <td> </td>
Now my code so far is:
import time
from selenium import webdriver
chromeOps=webdriver.ChromeOptions()
chromeOps._binary_location = "C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe"
chromeOps._arguments = ["--enable-internal-flash"]
browser = webdriver.Chrome("C:\\Program Files\\Google\\Chrome\\Application\\chromedriver.exe", port=4445, chrome_options=chromeOps)
time.sleep(3)
browser.get('website')
elem=browser.find_element_by_id('MainForm')
el=elem.find_element_by_xpath('//*[#id="ReportHolder"]')
the last two lines of code are just really me testing how path I can go before xpath breaksdown. Trying to xpath to any content beyond this point gives a noSuchElementException.
Can anyone explain to me how I draw data from within the table please?
My currently thinking is that perhaps I have to pass "something" into an xml tree api and access it through that. Although I don't know how I would capture it.
If anyone can give me that next step it would be greatly appreciated, feeling a bit like I'm holding a candle in a dark room at the moment.

It's very simple. It's a timing issue.
Solution: Place a time.sleep(5) before the xpath request.
browser.get('http://www.mmgt.co.uk/HTMLReport.aspx?ReportName=Fleet%20Day%20Summary%20Report&ReportType=7&CategoryID=4923&Startdate='+strDate+'&email=false')
time.sleep(5)
ex=browser.find_element_by_xpath('//*[#id="ReportHolder"]/table/tbody/tr/td')
xpath is requesting a reference to dynamic content.
The table is dynamic content and takes longer to load that content then it does for the python program to reach line:
ex=browser.find_element_by_xpath('//*[#id="ReportHolder"]/table/tbody/tr')
from its previous line of:
browser.get('http://www.mmgt.co.uk/HTMLReport.aspx?ReportName=Fleet%20Day%20Summary%20Report&ReportType=7&CategoryID=4923&Startdate='+strDate+'&email=false')

Related

Unable to pull default text from input element with Selenium

I'm trying to get the 11/30/2022 date from the SOA Handled Date/Time field from this site pictured here. It's not a public site, so I can't simply post the link. The text is in an input field that's filled in by default when you open the page, and it has the following HTML.
<td>
<input type="text" name="soa_h_date" id="soa_h_date" class="readOnly disableInput" readonly="readonly">
</td>
I've tried everything and i'm just not able to pull the text no matter what I do. I've tried the following
driver.find_element_by_xpath('//input[#id="soa_h_date"]').text
driver.find_element_by_xpath('//input[#id="soa_h_date"]').getAttribute("value")
driver.find_element_by_xpath('//input[#id="soa_h_date"]').getAttribute("placeholder")
driver.find_element_by_xpath('//input[#id="soa_h_date"]').getAttribute("textarea")
driver.find_element_by_xpath('//input[#id="soa_h_date"]').getAttribute("innerText")
driver.find_element_by_xpath('//input[#id="soa_h_date"]').getAttribute("outerText")
driver.find_element_by_xpath('//input[#id="soa_h_date"]').getAttribute("value")
Nevermind I figured it out. I had to use a javascript executor to pull the text with the following code.
element = driver.find_element_by_xpath('//input[#id="soa_h_date"]')
date = driver.execute_script("return arguments[0].value",element)

can't locate element in iframe selenium

I'm Trying to switch to frame on a web page to access a video in that frame but error always occur that element is not find i've tried many elements all the same error
this is the code i used to switch to the frame and get the video url
WebDriverWait(browser,10).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH, "//*[#id = 'innerframe']/iframe")))
browser.find_element_by_xpath("//video[#id='mediaplayer']/soucre").text
here is the html of the page HTML PAGE
<div id="innerframe"><iframe src="https://ops.cielo24.com/hitman/work/load_task/554097cdbe314edb9ad5d62edf5396ed/tasks/2547efb19fc5430c9f335fe165a46df3?active_task_uuid=44686eab1ad8448d97e5e74e484575ab" width="100%" height="100%" frameborder="0"></iframe>
video html:
<div id="mediacontent">
<video height="305" width="480" id="mediaplayer"><source src="https://c24cdn.co/restricted/sliced-media/790f319ee29c46e585c5ee585ed31580.mp4?Expires=1584401901&GoogleAccessId=microservice-writer%40coresystem-171219.iam.gserviceaccount.com&Signature=QtSAPQc5GMxPx9qAI8WnCurouFagNgRE2rto1B3af%2BrUhemeqoFnJZWmfQfQ2SGXKAhc5pXL68GhLINlshZ4yGEvy7SDMEr1l44Z%2FA9bFL3Xvlsii9MfZpkXaCeXT%2FKrMZZvH%2BpbiR%2BpgQjgqLysP68fODMsQ3zub9FCx8zD2Yw5bQZg12rzQWdlEcU5VHGktTSDAjpReWHIrmca63X6jQAYru5TQi12sy18UwSlpdrF1qFgXlTOEMKwB2iPHbLRPxxpFF%2FhOkYVrCcIi6OmJOXvy6arBZY9%2FYBP2vjIpDQ3UODyH8uFrEFdWbqVTHAe0G0pKly4NK1K30dKrSGYJw%3D%3D" type="video/mp4"><source src="https://c24cdn.co/restricted/sliced-media/59b2c60d2e764a25bd4a8e2d6f15cb31.webm?Expires=1584401901&GoogleAccessId=microservice-writer%40coresystem-171219.iam.gserviceaccount.com&Signature=JGbxZYS0u2rI2gY%2BjXThKj9KkIMBDfLvW9XEImWdtfzMFNpUBBm33B7wM3XYD01JLKcMD%2BlqfWf%2FqzMFAgW2zQH07NvGKzdkYFIgwxgCUQha8ws%2FLqoJyLMiz8UeXr5Smqqjr%2FiFrLLc6HmCnYfP8g7Y%2BJ%2FJoQuHmVeZjJIKxz957SZEOQ8QIQqtbIusK%2B0uqQzvyyW4vStDF7RvjZwp44b1H0pqzsby2bjCYspacgv9JM712Z72sZdercFFczC5BR%2FxT0jXFxYn6XiRhfE0HO1e24qFiR1A%2B78Ems3A3ZdQylaVDZ4UfVX13iofy2l0LWdXMjEynLxSz7cNPGtDpg%3D%3D" type="video/webm"></video>
</div>
The xpath you are using is incorrect as there is a typing error in it, you have used soucre instead of source and the structure you have used to get it is also incorrect.
So, try to use the below code, it should work fine.
WebDriverWait(browser,10).until(EC.frame_to_be_available_and_switch_to_it(By.XPATH, "//div[#id='innerframe']/iframe"))
browser.find_element_by_xpath("//div[#id='mediacontent']//video").text

Python - XPath issue while scraping the IMDb Website

I am trying to scrape the movies on IMDb using Python and I can get data about all the important aspects but the actors names.
Here is a sample URL that I am working on:
https://www.imdb.com/title/tt0106464/
Using the "Inspect" browser functionality I found the XPath that relates to all actors names, but when it comes to run the code on Python, it looks like the XPath is not valid (does not return anything).
Here is a simple version of the code I am using:
import requests
from lxml import html
movie_to_scrape = "https://www.imdb.com/title/tt0106464"
timeout_time = 5
IMDb_html = requests.get(movie_to_scrape, timeout=timeout_time)
doc = html.fromstring(IMDb_html.text)
actors = doc.xpath('//table[#class="cast_list"]//tbody//tr//td[not(contains(#class,"primary_photo"))]//a/text()')
print(actors)
I tried to change the XPath many times trying to make it more generic and then more specific, but it still does not return anything
Don't blindly accept the markup structure you see using inspect element.
Browser are very lenient and will try to fix any markup issue in the source.
With that being said, if you check the source using view source you can see that the table you're tying to scrape has no <tbody> as they are inserted by the browser.
So if you removed it form here
//table[#class="cast_list"]//tbody//tr//td[not(contains(#class,"primary_photo"))]//a/text() -> //table[#class="cast_list"]//tr//td[not(contains(#class,"primary_photo"))]//a/text()
your query should work.
From looking at the HTML start with a simple xpath like //td[#class="primary_photo"]
<table class="cast_list">
<tr><td colspan="4" class="castlist_label">Cast overview, first billed only:</td></tr>
<tr class="odd">
<td class="primary_photo">
<a href="/name/nm0000418/?ref_=tt_cl_i1"
><img height="44" width="32" alt="Danny Glover" title="Danny Glover" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/32x44/name-2138558783._CB470041625_.png" class="loadlate hidden " loadlate="https://m.media-amazon.com/images/M/MV5BMTI4ODM2MzQwN15BMl5BanBnXkFtZTcwMjY2OTI5MQ##._V1_UY44_CR1,0,32,44_AL_.jpg" /></a> </td>
<td>
PYTHON:
for photo in doc.xpath('//td[#class="primary_photo"]'):
print photo

Opening and Saving a Web Page in Internet Explorer through Python

I have written a python script to repeatedly hit a site using selenium and scrape a table of interest. I am not super well versed in web development. I did look through plenty of articles on urllib, requests, etc but am using selenium because this site uses a single sign on type authentication and it is well above my skill level to try and replicate that login through python (I think...). I can deal with the longer processing time and extra effort of selenium as long as I get the data in the end. Once I automate the clicks with selenium to get there I save the page source and parse it with beautifulsoup.
The page HTML looks something like this (I have obfuscated it a bit as it is a corporate website):
<TABLE><TBODY>
<TR>
<TD><B>abc<B></B></B></TD></TR>
<TR>
<TH title="Source - abc " class=tableHeader>
<DIV class=usageHeaderWithoutRefinements>abc </DIV></TH>
<TH title="Source - abc " class=tableHeader>
<DIV class=usageHeaderWithoutRefinements>abc </DIV></TH>
<TH title="Source - abc " class=tableHeader>
<DIV class=usageHeaderWithoutRefinements>abc </DIV></TH></TR>
<TR>
<TD title="Source - abc " class=tableData><B>1234</B> </TD>
<TD title="Source - abc " class=tableData><B>1234</B> </TD>
</TBODY></TABLE>
In the browser it is rendered as a table where the numbers like 1234 are hyperlinks to another page. I just want to store the numbers in the tables. Weirdly, it seems like sometimes when I run code like:
info_1 = soup.findAll('table')[0]
data_rows = part_info_1.findAll('tr')
data_1 = [[td.getText() for td in data_rows[i].findAll('td')] for i in range(len(data_rows))]
Sometimes it returns the table information but sometimes it returns an empty list. It seems like the Javascript is the problem.
My question: Is there a way to add to the python code above to pull out the info in the tags without having to actually run the javascript or do anything similarly complicated? I'm hoping it is easy since the number is right there in the HTML but I mostly copied that code from another request for help and I'm not super comfortable extending so just asking. Alternatively, is there something simple I could do through python/selenium to make it so that when I use:
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
I can be sure that I always get the right table and not an empty list?
Thanks in advance.

Python 3.5 + Selenium Scrape. Is there anyway to select <a><a/> tags?

So I'm very new to python and selenium. I'm writting an scraper to take some balances and download a txt file. So far I've managed to grab the account balances but downloading the txt files have proven to be a difficult task.
This is a sample of the html
<td>
<div id="expoDato_msdd" class="dd noImprimible" style="width: 135px">
<div id="expoDato_title123" class="ddTitle">
<span id="expoDato_arrow" class="arrow" style="background-position: 0pt 0pt"></span>
<span id="expoDato_titletext" class="textTitle">Exportar Datos</span>
</div>
<div id="expoDato_child" class="ddChild" style="width: 133px; z-index: 50">
<a class="enabled" href="/CCOLEmpresasCartolaHistoricaWEB/exportarDatos.do;jsessionid=9817239879882871987129837882222R?tipoExportacion=txt">txt</a>
<a class="enabled" href="/CCOLEmpresasCartolaHistoricaWEB/exportarDatos.do;jsessionid=9817239879882871987129837882222R?tipoExportacion=pdf">PDF</a>
<a class="enabled" href="/CCOLEmpresasCartolaHistoricaWEB/exportarDatos.do;jsessionid=9817239879882871987129837882222R?tipoExportacion=excel">Excel</a>
<a class="modal" href="#info_formatos">InformaciĆ³n Formatos</a>
</div>
</div>
I need to click on the fisrt "a" class=enabled. But i just can't manage to get there by xpath, class or whatever really. Here is the last thing i tried.
#Descarga de Archivos
ddmenu2 = driver.find_element_by_id("expoDato_child")
ddmenu2.find_element_by_css_selector("txt").click()
This is more of the stuff i've already tryed
#TXT = driver.select
#TXT.send_keys(Keys.RETURN)
#ddmenu2 = driver.find_element_by_xpath("/html/body/div[1]/div[1]/div/div/form/table/tbody/tr[2]/td/div[2]/table/tbody/tr/td[4]/div/div[2]")
#Descarga = ddmenu2.find_element_by_visible_text("txt")
#Descarga.send_keys(Keys.RETURN)
Please i would apreciate your help.
Ps:English is not my native language, so i'm sorry for any confusion.
EDIT:
This was the approach that worked, I'll try your other suggetions to make a more neat code. Also it will only work if the mouse pointer is over the browser windows, it doesn't matter where.
ddmenu2a = driver.find_element_by_xpath("/html/body/div[1]/div[1]/div/div/form/table/tbody/tr[2]/td/div[2]/table/tbody/tr/td[4]/div/div[1]").click()
ddmenu2b = driver.find_element_by_xpath("/html/body/div[1]/div[1]/div/div/form/table/tbody/tr[2]/td/div[2]/table/tbody/tr/td[4]/div/div[2]")
ddmenu2c = driver.find_element_by_xpath("/html/body/div[1]/div[1]/div/div/form/table/tbody/tr[2]/td/div[2]/table/tbody/tr/td[4]/div/div[2]/a[1]").click()
Pretty much brute force, but im getting to like python scripting.
Or simply use CSS to match on the href:
driver.find_element_by_css_selector("div#expoDato_child a.enabled[href*='txt']")
You can get all anchor elements like this:
a_list = driver.find_elements_by_tag_name('a')
this will return a list of elements. you can click on each element:
for a in a_list:
a.click()
driver.back()
or try xpath for each anchor element:
a1 = driver.find_element_by_xpath('//a[#class="enabled"][1]')
a2 = driver.find_element_by_xpath('//a[#class="enabled"][2]')
a3 = driver.find_element_by_xpath('//a[#class="enabled"][3]')
Please let me know if this was helpful
you can directly reach the elements by xpath via text:
driver.find_element_by_xpath("//*[#id='expoDato_child' and contains(., 'txt')]").click()
driver.find_element_by_xpath("//*[#id='expoDato_child' and contains(., 'PDF')]").click()
...
If there is a public link for the page in question that would be helpful.
However, generally, I can think of two methods for this:
If you can discover the direct link you can extract the link text and use pythons' urllib and download the file directly.
or
Use use Seleniums' click function and have it click on the link in the page.
A quick search resulted thusly:
downloading-file-using-selenium

Categories