How to write Xpath for the following:
<input class="t2" style="background-color:#008000;" title="Jump to Detailed Analysis" type="button" value="Analyze" onclick="javascript:popAnalyze
("1622662"," SP0001622662","CS3_pro2_axeda6","5336293761");">
Highlighted values are there in some variable(st_name). Highlighted and Red colour rounded values will be changing dynamically.
I'm not able to get how to write Xpath for this.
import xlrd
path = r'C:\Users\tmou\PycharmProjects\Python\WebScraping\Book2.xlsx'
workbook = xlrd.open_workbook(path)
sheet = workbook.sheet_by_index(0)
for c in range(sheet.ncols):
for r in range(sheet.nrows):
st = (sheet.cell_value(r, c))
try:
if st == float(st):
st_string = int(st)
#variable = 1622662
#new_string = "javascript:popAnalyze("" + str(st_string) + "","SP0001622662","CS3_pro2_axeda6","5336293761");"
#driver.find_element_by_xpath("//input[#class='t2']/#onclick='" + st_string + "'").click()
#driver.find_element_by_xpath("//input[#value='Analyze' and contains(#onclick='" + st_string + "']")
#driver.find_element_by_xpath("//a[#title='" + st_string + "']")
HTML:
<input class="t2" style="background-color:#008000;" title="Jump to Detailed Analysis" type="button" value="Analyze" onclick="javascript:popAnalyze("1622662","SP0001622662","CS3_pro2_axeda6","5336293761");">
If the value that you are looking for is the one under onclick attribute then the following Xpath expression should work:
string(//input[#class='t2']/#onclick)
Edit 1
Can you try for XPath version 3 or lower:
//input[(#class='t2' and matches(#onclick,'1622662'))]
And for XPath version 3.1:
//input[#class='t2']/[matches(#onclick, '1622662')]
There are many ways to do this without the value of onclick, so don't bother even if it is dynamic, like shown below:
//input[#title='Jump to Detailed Analysis']
or
//input[#value='Analyze']
or
//input[#value='Analyze' and #title='Jump to Detailed Analysis']
EDIT 1:
You can use variable like shown below:
variable = "Analyze"
xpath = "//input[#value='" + variable + "']"
EDIT 2:
variable = 1622662
new_string = "javascript:popAnalyze("" + str(variable) + "","SP0001622662","CS3_pro2_axeda6","5336293761");"
EDIT 3:
variable = 1622662
xpath = "//input[#value='Analyze' and contains(#onclick,'" + str(variable) + "')]"
if driver.find_elements_by_xpath(xpath):
driver.find_element_by_xpath(xpath).click()
In the above code variable will be your dynamic value.
xpath variable will have a dynamic xpath based on the value of
variable
if driver.find_elements_by_xpath(xpath): will check if at least one
element with the xpath exit
if exists exits click on it
Use one of the following XPath :
//input[#value='Analyze' and contains(#onclick,'"+st_name+"')]
OR
//input[#title='Jump to Detailed Analysis' and contains(#onclick,'"+st_name+"')]
Final Code :
driver.find_element_by_xpath("//input[#title='Jump to Detailed Analysis' and contains(#onclick,'"+st_name+"')]")
Related
Blockquote
I am working on a project where I need to scrape data from a graph that shows the data for 1 day. For example: I want to get al the data for 2017 so I have to enter a new date in the datepicker 365 times. The problem is that although I am very specific in my XPATH call, the script finds way to many webelements. Many of which are not even compliant to my restrictions in my XPATH. This only happens some way into the loop and every loop the script finds more and more web elements.
The code that I am using:
Date_vec = pd.date_range(start="2017-01-01",end="2021-2-28")
DatePicker = web.find_element_by_xpath('/html/body/form/table/tbody/tr/td/table/tbody/tr/td[2]/div/div[1]/div[2]/div[2]/div[2]/div/div[2]/div/table/tbody/tr[2]/td/table/tbody/tr/td[2]/span/input')
month_prev = 0
year_prev = 0
for i in Date_vec:
DatePicker = web.find_element_by_xpath('/html/body/form/table/tbody/tr/td/table/tbody/tr/td[2]/div/div[1]/div[2]/div[2]/div[2]/div/div[2]/div/table/tbody/tr[2]/td/table/tbody/tr/td[2]/span/input')
DatePicker.click()
if i.year != year_prev:
# Year_button = web.find_element(By.XPATH,"//span[#onclick = 'basicDatePicker.ehYearSelectorClick(this)']")
Year_button = web.find_elements(By.XPATH,".//span[#onclick = 'basicDatePicker.ehYearSelectorClick(this)']")
Year_button[-1].click()
Year_choice = web.find_elements(By.XPATH,"//a[normalize-space(text()) ='"+ str(i.year) + "']")
Year_choice[-1].click()
elif i.month != month_prev:
Month_button = web.find_elements(By.XPATH,"//span[#onclick = 'basicDatePicker.ehMonthSelectorClick(this)']")
Month_button[-1].click()
Month_choice = web.find_elements_by_class_name('bdpMonthItem')
Month_choice[i.month-1].click()
Day_button = web.find_elements(By.XPATH,"//a[normalize-space(text()) ='"+ str(i.day) + "' and contains(#class, 'bdpDay')]")
Day_button[-1].click()
time.sleep(3)
month_prev = i.month
year_prev = i.year
For example, the problem arises at the line below:
Day_button = web.find_elements(By.XPATH,"//a[normalize-space(text()) ='"+ str(i.day) + "' and contains(#class, 'bdpDay')]")
This line returns 4 elements of which 2 don't have any text in them. I checked this with the following line.
test1 = [i.text for i in Day_button]
So my question basically is: Why does the line code return 4 elements of which two don't have text while I explicitly tell it to have the current day as text. Any help is appreciated.
edit: for clarity, I added a snip from the datepicker in question:
I am trying to connect with elements that carry the contact numbers on each site. I was able to create the routine to get the numbers, extract the contact number with available formats and regex and the following code snippet to get the element
contact_elem = browser.find_elements_by_xpath("//*[contains(text(), '" + phone_num + "')]")
Considering the example of https://www.cssfirm.com/, the contact number appears in 2 locations, the top header and the bottom footer
The element texts accompanying the contact number are as follows :
<h3>CALL US TODAY AT (855) 910-7824</h3> - Footer
<span>Call Us<br>Today</span> (855) 910-7824 - Header
The extracted phone number matches perfectly while printing it out. For some reason, the element from the header part is not being detected.
I tried by searching for elements and even by deleting the footer element from the browser before executing the rest of the code.
What could be the reason for it to go undetected?
P.S: Below is the amateurish,uncorrected code. Efficiency edits/suggestions are welcome. The same code has been tested with various sites and works fine.
url = 'http://www.cssfirm.com/'
browser.get(url)
parsed = browser.find_element_by_tag_name('html').get_attribute('innerHTML')
s = BeautifulSoup(parsed, 'html.parser')
s = s.decode('utf-8')
phoneNumberRegex = '(\s*(?:\+?(\d{1,4}))?[-. (]*(\d{1,})[-. )]*(\d{3}|[A-Z0-9]+)[-. \/]*(\d{4}|[A-Z0-9]+)[-. \/]?(\d{4}|[A-Z0-9]+)?(?: *x(\d+))?\s*)'
custom_re = ['([0-9]{4,4} )([0-9]{3,3} )([0-9]{4,4})',
'([0-9]{3,3} )([0-9]{4,4} )([0-9]{4,4})',
'(\+[0-9]{2,2}-)([0-9]{4,4}-)([0-9]{4,4}-)(0)',
'(\([0-9]{3,3}\) )([0-9]{3,3}-)([0-9]{4,4})',
'(\+[0-9]{2,2} )(\(0\)[0-9]{4,4} )([0-9]{4,6})',
'([0-9]{5,5} )([0-9]{6,6})',
'(\+[0-9]{2,2}\(0\))([0-9]{4,4} )([0-9]{4,4})',
'(\+[0-9]{2,2} )([0-9]{3,3} )([0-9]{4,4} )([0-9]{3,3})',
'([0-9]{3,3}-)([0-9]{3,3}-)([0-9]{4,4})']
phones = []
phones = re.findall(phoneNumberRegex, s)
phone_num_list = ()
phone_num = ''
matched = 0
for phoneHeader in phones:
#phoneHeader = phoneHeader.decode('utf-8')
for ph_cnd in phoneHeader:
for pttrn in custom_re:
phones = re.findall(pttrn,ph_cnd)
if(phones):
phone_num_list = phones
for x in phone_num_list:
phone_num = ''.join(x)
try:
contact_elem = browser.find_element_by_xpath("//*[contains(text(), '" + phone_num + "')]")
phone_num_txt = contact_elem.text
if(phone_num_txt):
matched = 1
break
except NoSuchElementException:
pass
if(matched == 1):
break
if(matched == 1):
break
if(matched == 1):
break
print("Phone number :",phone_num) <-- Perfect output
contact_elem <--empty for header or just the footer element
EDIT
Code updated. Forgot an important piece. Moreover, there is sleep time given in between to give time for the page to load. Considering it trivial, I haven't included them for a quick read.
I found a temporary solution by searching for the partial link text, as the number also comes on the link.
contact_elem2 = browser.find_element_by_partial_link_text(phone_num)
However, this does not answer the generic question as to why that text was ignored within the element.
I am reading in an HTML document and want to store the HTML nested within a div tag of a certain name, while maintaining its structure (the spacing). This is for the ability convert an HTML doc into components for React. I am struggling with how to store the structure of the nested HTML, and locate the correct closing tag for the div the denotes that everything nested within it will become a React component (div class='rc-componentname' is the opening tag). Any help would be very appreciated. Thanks!
Edit: I assume regex are the best way to go about this. I haven't used regex before so if that is correct someone could point me in the right direction for the expression used in this context that would be great.
import os
components = []
class react_template():
def __init__(self, component_name): # add nested html as second element
self.Import = "import React, { Component } from ‘react’;"
self.Class = "Class " + component_name + ' extends Component {'
self.Render = "render() {"
self.Return = "return "
self.Export = "Default export " + component_name + ";"
def react(component):
r = react_template(component)
if not os.path.exists('components'): # create components folder
os.mkdir('components')
os.chdir('components')
if not os.path.exists(component): # create folder for component
os.mkdir(component)
os.chdir(component)
with open(component + '.js', 'wb') as f: # create js component file
for j_key, j_code in r.__dict__.items():
f.write(j_code.encode('utf-8') + '\n'.encode('utf-8'))
f.close()
def process_html():
with open('file.html', 'r') as f:
for line in f:
if 'rc-' in line:
char_soup = list(line)
for index, char in enumerate(char_soup):
if char == 'r' and char_soup[index+1] == 'c' and char_soup[index+2] == '-':
sliced_soup = char_soup[int(index+3):]
c_slice_index = sliced_soup.index("\'")
component = "".join(sliced_soup[:c_slice_index])
components.append(component)
innerHTML(sliced_soup)
# react(component)
def innerHTML(sliced_soup): # work in progress
first_closing = sliced_soup.index(">")
sliced_soup = "".join(sliced_soup[first_closing:]).split(" ")
def generate_components(components):
for c in components:
react(c)
if __name__ == "__main__":
process_html()
I see you've used the word soup in your code... maybe you've already tried and disliked BeautifulSoup? If you haven't tried it, I'd recommend you look at BeautifulSoup instead of attempting to parse HTML with regex. Although regex would be sufficient for a single tag or even a handful of tags, markup languages are deceptively simple. BeautifulSoup is a fine library and can make things easier for dealing with markup.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
This will allow you to treat the entirety of your html as a single object and enable you to:
# create a list of specific elements as objects
soup.find_all('div')
# find a specific element by id
soup.find(id="custom-header")
I wrote some code that grabs the numbers I need from this website, but I don't know what to do next.
It grabs the numbers from the table at the bottom. The ones under calving ease, birth weight, weaning weight, yearling weight, milk and total maternal.
#!/usr/bin/python
import urllib2
from bs4 import BeautifulSoup
import pyperclip
def getPageData(url):
if not ('abri.une.edu.au' in url):
return -1
webpage = urllib2.urlopen(url).read()
soup = BeautifulSoup(webpage, "html.parser")
# This finds the epd tree and saves it as a searchable list
pedTreeTable = soup.find('table', {'class':'TablesEBVBox'})
# This puts all of the epds into a list.
# it looks for anything in pedTreeTable with an td tag.
pageData = pedTreeTable.findAll('td')
pageData.pop(7)
return pageData
def createPedigree(animalPageData):
''' make animalPageData much more useful. Strip the text out and put it in a dict.'''
animals = []
for animal in animalPageData:
animals.append(animal.text)
prettyPedigree = {
'calving_ease' : animals[18],
'birth_weight' : animals[19],
'wean_weight' : animals[20],
'year_weight' : animals[21],
'milk' : animals[22],
'total_mat' : animals[23]
}
for animalKey in prettyPedigree:
if animalKey != 'year_weight' and animalKey != 'dam':
prettyPedigree[animalKey] = stripRegNumber(prettyPedigree[animalKey])
return prettyPedigree
def stripRegNumber(animal):
'''returns the animal with its registration number stripped'''
lAnimal = animal.split()
strippedAnimal = ""
for word in lAnimal:
if not word.isdigit():
strippedAnimal += word + " "
return strippedAnimal
def prettify(pedigree):
''' Takes the pedigree and prints it out in a usable format '''
s = ''
pedString = ""
# this is also ugly, but it was the only way I found to format with a variable
cFormat = '{{:^{}}}'
rFormat = '{{:>{}}}'
#row 1 of string
s += rFormat.format(len(pedigree['calving_ease'])).format(
pedigree['calving_ease']) + '\n'
#row 2 of string
s += rFormat.format(len(pedigree['birth_weight'])).format(
pedigree['birth_weight']) + '\n'
#row 3 of string
s += rFormat.format(len(pedigree['wean_weight'])).format(
pedigree['wean_weight']) + '\n'
#row 4 of string
s += rFormat.format(len(pedigree['year_weight'])).format(
pedigree['year_weight']) + '\n'
#row 4 of string
s += rFormat.format(len(pedigree['milk'])).format(
pedigree['milk']) + '\n'
#row 5 of string
s += rFormat.format(len(pedigree['total_mat'])).format(
pedigree['total_mat']) + '\n'
return s
if __name__ == '__main__':
while True:
url = raw_input('Input a url you want to use to make life easier: \n')
pageData = getPageData(url)
s = prettify(createPedigree(pageData))
pyperclip.copy(s)
if len(s) > 0:
print 'the easy string has been copied to your clipboard'
I've just been using this code for easy copying and pasting. All I have to do is insert the URL, and it saves the numbers to my clipboard.
Now I want to use this code on my website; I want to be able to insert a URL in my HTML code, and it displays these numbers on my page in a table.
My questions are as follows:
How do I use the python code on the website?
How do I insert collected data into a table with HTML?
It sounds like you would want to use something like Django. Although the learning curve is a bit steep, it is worth it and it (of course) supports python.
I am analyzing StackOverflow's dump file "Posts.Small.xml" using pySpark. I want to separate 'code block' from 'text' in a Row. A typical parsed row looks like:
['[u"<p>I want to use a track-bar to change a form\'s opacity.</p>
<p>This is my code:</p>
<pre><code>decimal trans = trackBar1.Value / 5000;
this.Opacity = trans;
</code></pre>
<p>When I try to build it, I get this error:</p>
<blockquote>
<p>Cannot implicitly convert type \'decimal\' to \'double\'.
</p>
</blockquote>
<p>I tried making <code>trans</code> a <code>double</code>, but then the control doesn\'t work.',
'", u\'This code has worked fine for me in VB.NET in the past.',
'\', u"</p>
When setting a form\'s opacity should I use a decimal or double?"]']
I've tried "itertools" and some python functions but couldn't get the result.
My initial code to extract the above row is:
postsXml = textFile.filter( lambda line: not line.startswith("<?xml version=")
postsRDD = postsXml.map(............)
tokensentRDD = postsRDD.map(lambda x:(x[0], nltk.sent_tokenize(x[3])))
new = tokensentRDD.map(lambda x: x[1]).take(1)
a = ''.join(map(str,new))
b = a.replace("<", "<")
final = b.replace(">", ">")
nltk.sent_tokenize(final)
Any ideas are appreciated!
You can extract the code contents by using XPath (the lxml library will help) and then extract the text content selecting everything else, for example:
import lxml.etree
data = '''<p>I want to use a track-bar to change a form's opacity.</p>
<p>This is my code:</p> <pre><code>decimal trans = trackBar1.Value / 5000; this.Opacity = trans;</code></pre>
<p>When I try to build it, I get this error:</p>
<p>Cannot implicitly convert type 'decimal' to 'double'.</p>
<p>I tried making <code>trans</code> a <code>double</code>.</p>'''
html = lxml.etree.HTML(data)
code_blocks = html.xpath('//code/text()')
text_blocks = html.xpath('//*[not(descendant-or-self::code)]/text()')
The easiest way will probably be to apply a regex to the text, matching tags '' and ''. That would enable you to find the code blocks. You don't say what you would do with them afterwards, though. So ...
from itertools import zip_longest
sample_paras = [
"""<p>I want to use a track-bar to change a form\'s opacity.</p>
<p>This is my code:</p>
<pre><code>decimal trans = trackBar1.Value / 5000;
this.Opacity = trans;
</code></pre>
<p>When I try to build it, I get this error:</p>
<blockquote>
<p>Cannot implicitly convert type \'decimal\' to \'double\'. </p>
</blockquote>
<p>I tried making <code>trans</code> a <code>double</code>, but then the control doesn\'t work.""",
"""This code has worked fine for me in VB.NET in the past.""",
"""</p>
When setting a form\'s opacity should I use a decimal or double?""",
]
single_block = " ".join(sample_paras)
import re
separate_code = re.split(r"</?code>", single_block)
text_blocks, code_blocks = zip(*zip_longest(*[iter(separate_code)] * 2))
print("Text:\n")
for t in text_blocks:
print("--")
print(t)
print("\n\nCode:\n")
for t in code_blocks:
print("--")
print(t)