I'm trying to get the 11/30/2022 date from the SOA Handled Date/Time field from this site pictured here. It's not a public site, so I can't simply post the link. The text is in an input field that's filled in by default when you open the page, and it has the following HTML.
<td>
<input type="text" name="soa_h_date" id="soa_h_date" class="readOnly disableInput" readonly="readonly">
</td>
I've tried everything and i'm just not able to pull the text no matter what I do. I've tried the following
driver.find_element_by_xpath('//input[#id="soa_h_date"]').text
driver.find_element_by_xpath('//input[#id="soa_h_date"]').getAttribute("value")
driver.find_element_by_xpath('//input[#id="soa_h_date"]').getAttribute("placeholder")
driver.find_element_by_xpath('//input[#id="soa_h_date"]').getAttribute("textarea")
driver.find_element_by_xpath('//input[#id="soa_h_date"]').getAttribute("innerText")
driver.find_element_by_xpath('//input[#id="soa_h_date"]').getAttribute("outerText")
driver.find_element_by_xpath('//input[#id="soa_h_date"]').getAttribute("value")
Nevermind I figured it out. I had to use a javascript executor to pull the text with the following code.
element = driver.find_element_by_xpath('//input[#id="soa_h_date"]')
date = driver.execute_script("return arguments[0].value",element)
Related
I am designing a web scraping script using Python. I am using the beautifulsoup module for this and I almost succeeded in this. Currently, I am having some requirements unsatisfied in beautifulsoup.
When extracting the content that could be displayed to the user in a browser using beautifulsoup, it's not displaying some text like the "placeholder" attribute value of an input tag element. I wrote the below code for a demonstration of this behavior.
Python code:
import requests
from bs4 import BeautifulSoup as bs4
web_page = requests.get("http://localhost/1.html", allow_redirects=True)
web_view = bs4(web_page.text, "html.parser")
print(web_view.text)
HTML code of http://localhost/1.html is
<html>
<title>Test Website</title>
<body>
<p>Hello World</p>
<form>
<input placeholder="Username"/>
<input placeholder="Password" type="password"/>
</form>
</body>
</html>
The output of above said Python code is:
Test Website
Hello World
I am expecting the "Username" word, and "Password" word also extracted in the python output because that is also displayed to the user in the browser.
My requirement is not limited to the "placeholder" attribute of the "input" element tag. I need to display the text that could be displayed to the user in the browser when some exception happens. For example, if an image is missing that is placed in an "img" tag of any html page of any website, the user will see the text that is provided in the "alt" attribute of the "img" tag like this.
HTTML code for this page:
<html>
<title>Test Website</title>
<body>
<p>Hello World</p>
<form>
<input placeholder="Username"/>
<input placeholder="Password" type="password"/>
<br><br><br>
<img source="2.img" alt="Image missing">
</body>
</html>
"2.img" is the image, and it is missing I know.
My overall question is:
I need to see all those web page content that is displayed to the user in the browser including any exception cases like the image missing. Currently, beautifulsoup is displaying only the "value" of any dom element tag and it's not extracting any text that is part of any attributes of the dom element tag that could be displayed to the user. I need that attributes' value also.
If this information can be extracted from beautifullsoup, I am happy to see how to do it. But if it's not possible, I would like to know all the html tag attributes (as a list) that are coming under this category so that I can write a code to search those html attributes through all the html tags on an html page.
If complete list of attributes is not possible, I am requesting everyone to provide the attribute names of any tags you know that are coming under above said use case so that I can prepare a list that may be partially correct.
Edited:
In short:
What are all the attributes' value of any html tag that might be displayed to user in browser. You know and I know, "placeholder" attribute value (of input tag) will be displayed to user in browser. "alt" attribute value of image tag will be displayed to user if image is missing. Like placeholder, and alt attributes, what are all the other attributes out there?
Regarding to your first question, you can't expect .text attribute to give you attributes of specific tags. You need to use .attrs['<attr_name>'][docs] to get desired output:
input_tags = web_view.find('form').find_all('input')
placeholders = [each.attrs['placeholder'] for each in input_tags]
# -> ['Username', 'Password']
As for the second question, you can find all img tags and print its alt attribute if that's what you are looking for:
imgs = web_view.find_all('img')
alt_attrs = [each.attrs['alt'] for each in imgs]
# -> ['Image missing']
To get each attribute of certain tag you need to call .attrs:
input_tags = web_view.find('form').find_all('input')
attributes = [each.attrs for each in input_tags]
# -> [{'placeholder': 'Username'}, {'placeholder': 'Password', 'type': 'password'}]
I am trying to input an amount in a text field which contains a default string.
HTML of that field:
<div class="InputGroup">
<span class="InputGroup-context">$</span>
<input autocomplete="off" class="Input InputGroup-input" id="amount" name="amount" type="text" maxlength="12" value="0.00">
</div>
When trying to input text to the field, instead of replacing the default text, it appends it to it.
I have tried to use amount.clear() (amount is what I am calling the element) but after running that and sending the keys it throws the below exception:
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
As you will see from my current code, I am also trying to double click the default text but that doesn't help.
This is my code at the moment:
ActionChains(driver).move_to_element(amount).click(amount).click(amount).perform()
amount.send_keys(Keys.BACKSPACE)
amount.send_keys('100')
Which results in an input field of 0.00100 when I'm expecting 100.
Some JS based webs apps may not clear the input fields properly.
You can try using Actions class to click and clear a field fully. For example:
elem = driver.findElement(By.id("amount"))
actions.move_to_element(elem).click().build().perform()
actions.send_keys(Keys.BACKSPACE)
.send_keys(Keys.BACKSPACE)
.send_keys(Keys.BACKSPACE)
.send_keys(Keys.BACKSPACE).build().perform()
I am trying to scrape the movies on IMDb using Python and I can get data about all the important aspects but the actors names.
Here is a sample URL that I am working on:
https://www.imdb.com/title/tt0106464/
Using the "Inspect" browser functionality I found the XPath that relates to all actors names, but when it comes to run the code on Python, it looks like the XPath is not valid (does not return anything).
Here is a simple version of the code I am using:
import requests
from lxml import html
movie_to_scrape = "https://www.imdb.com/title/tt0106464"
timeout_time = 5
IMDb_html = requests.get(movie_to_scrape, timeout=timeout_time)
doc = html.fromstring(IMDb_html.text)
actors = doc.xpath('//table[#class="cast_list"]//tbody//tr//td[not(contains(#class,"primary_photo"))]//a/text()')
print(actors)
I tried to change the XPath many times trying to make it more generic and then more specific, but it still does not return anything
Don't blindly accept the markup structure you see using inspect element.
Browser are very lenient and will try to fix any markup issue in the source.
With that being said, if you check the source using view source you can see that the table you're tying to scrape has no <tbody> as they are inserted by the browser.
So if you removed it form here
//table[#class="cast_list"]//tbody//tr//td[not(contains(#class,"primary_photo"))]//a/text() -> //table[#class="cast_list"]//tr//td[not(contains(#class,"primary_photo"))]//a/text()
your query should work.
From looking at the HTML start with a simple xpath like //td[#class="primary_photo"]
<table class="cast_list">
<tr><td colspan="4" class="castlist_label">Cast overview, first billed only:</td></tr>
<tr class="odd">
<td class="primary_photo">
<a href="/name/nm0000418/?ref_=tt_cl_i1"
><img height="44" width="32" alt="Danny Glover" title="Danny Glover" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/32x44/name-2138558783._CB470041625_.png" class="loadlate hidden " loadlate="https://m.media-amazon.com/images/M/MV5BMTI4ODM2MzQwN15BMl5BanBnXkFtZTcwMjY2OTI5MQ##._V1_UY44_CR1,0,32,44_AL_.jpg" /></a> </td>
<td>
PYTHON:
for photo in doc.xpath('//td[#class="primary_photo"]'):
print photo
I have written a python script to repeatedly hit a site using selenium and scrape a table of interest. I am not super well versed in web development. I did look through plenty of articles on urllib, requests, etc but am using selenium because this site uses a single sign on type authentication and it is well above my skill level to try and replicate that login through python (I think...). I can deal with the longer processing time and extra effort of selenium as long as I get the data in the end. Once I automate the clicks with selenium to get there I save the page source and parse it with beautifulsoup.
The page HTML looks something like this (I have obfuscated it a bit as it is a corporate website):
<TABLE><TBODY>
<TR>
<TD><B>abc<B></B></B></TD></TR>
<TR>
<TH title="Source - abc " class=tableHeader>
<DIV class=usageHeaderWithoutRefinements>abc </DIV></TH>
<TH title="Source - abc " class=tableHeader>
<DIV class=usageHeaderWithoutRefinements>abc </DIV></TH>
<TH title="Source - abc " class=tableHeader>
<DIV class=usageHeaderWithoutRefinements>abc </DIV></TH></TR>
<TR>
<TD title="Source - abc " class=tableData><B>1234</B> </TD>
<TD title="Source - abc " class=tableData><B>1234</B> </TD>
</TBODY></TABLE>
In the browser it is rendered as a table where the numbers like 1234 are hyperlinks to another page. I just want to store the numbers in the tables. Weirdly, it seems like sometimes when I run code like:
info_1 = soup.findAll('table')[0]
data_rows = part_info_1.findAll('tr')
data_1 = [[td.getText() for td in data_rows[i].findAll('td')] for i in range(len(data_rows))]
Sometimes it returns the table information but sometimes it returns an empty list. It seems like the Javascript is the problem.
My question: Is there a way to add to the python code above to pull out the info in the tags without having to actually run the javascript or do anything similarly complicated? I'm hoping it is easy since the number is right there in the HTML but I mostly copied that code from another request for help and I'm not super comfortable extending so just asking. Alternatively, is there something simple I could do through python/selenium to make it so that when I use:
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
I can be sure that I always get the right table and not an empty list?
Thanks in advance.
<html xmlns="hyyp://www.w3.org/1999/xhtml">
<head>_</head>
<body>
<form name="Main Form" method="post" action="HTMLReport.aspx?ReportName=...">
<div id="Whole">
<div id="ReportHolder">
<table xmlns:msxsl="urn:schemeas-microsoft-com:xslt" width="100%">
<tbody>
<tr>
<td>_</td>
<td>LIVE</td>
and the data I need is here between <td> </td>
Now my code so far is:
import time
from selenium import webdriver
chromeOps=webdriver.ChromeOptions()
chromeOps._binary_location = "C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe"
chromeOps._arguments = ["--enable-internal-flash"]
browser = webdriver.Chrome("C:\\Program Files\\Google\\Chrome\\Application\\chromedriver.exe", port=4445, chrome_options=chromeOps)
time.sleep(3)
browser.get('website')
elem=browser.find_element_by_id('MainForm')
el=elem.find_element_by_xpath('//*[#id="ReportHolder"]')
the last two lines of code are just really me testing how path I can go before xpath breaksdown. Trying to xpath to any content beyond this point gives a noSuchElementException.
Can anyone explain to me how I draw data from within the table please?
My currently thinking is that perhaps I have to pass "something" into an xml tree api and access it through that. Although I don't know how I would capture it.
If anyone can give me that next step it would be greatly appreciated, feeling a bit like I'm holding a candle in a dark room at the moment.
It's very simple. It's a timing issue.
Solution: Place a time.sleep(5) before the xpath request.
browser.get('http://www.mmgt.co.uk/HTMLReport.aspx?ReportName=Fleet%20Day%20Summary%20Report&ReportType=7&CategoryID=4923&Startdate='+strDate+'&email=false')
time.sleep(5)
ex=browser.find_element_by_xpath('//*[#id="ReportHolder"]/table/tbody/tr/td')
xpath is requesting a reference to dynamic content.
The table is dynamic content and takes longer to load that content then it does for the python program to reach line:
ex=browser.find_element_by_xpath('//*[#id="ReportHolder"]/table/tbody/tr')
from its previous line of:
browser.get('http://www.mmgt.co.uk/HTMLReport.aspx?ReportName=Fleet%20Day%20Summary%20Report&ReportType=7&CategoryID=4923&Startdate='+strDate+'&email=false')