Looping through xpath variables - python

How can I increment the Xpath variable value in a loop in python for a selenium webdriver script ?
search_result1 = sel.find_element_by_xpath("//a[not((//div[contains(#class,'s')]//div[contains(#class,'kv')]//cite)[1])]|((//div[contains(#class,'s')]//div[contains(#class,'kv')]//cite)[1])").text
search_result2 = sel.find_element_by_xpath("//a[not((//div[contains(#class,'s')]//div[contains(#class,'kv')]//cite)[2])]|((//div[contains(#class,'s')]//div[contains(#class,'kv')]//cite)[2])").text
search_result3 = sel.find_element_by_xpath("//a[not((//div[contains(#class,'s')]//div[contains(#class,'kv')]//cite)[3])]|((//div[contains(#class,'s')]//div[contains(#class,'kv')]//cite)[3])").text

why dont you create a list for storing search results similar to
search_results=[]
for i in range(1,11) #I am assuming 10 results in a page so you can set your own range
result=sel.find_element_by_xpath("//a[not((//div[contains(#class,'s')]//div[contains(#class,'kv')]//cite)[%s])]|((//div[contains(#class,'s')]//div[contains(#class,'kv')]//cite)[%s])"%(i,i)).text
search_results.append(result)
this sample code will create list of 10 values of results. you can get idea from this code to write your own. its just matter of automating task.
so
search_results[0] will give you first search result
search_results[1] will give you second search results
...
...
search_results[9] will give you 10th search result

#Alok Singh Mahor, I don't like hardcoding ranges. Guess, better approach is to iterate through the list of webelements:
search_results=[]
result_elements = sel.find_elements_by_xpath("//not/indexed/xpath/for/any/search/result")
for element in result_elements:
search_result = element.text
search_results.append(search_result)

Related

Problems while trying to extract part of id

so i tried to gather all of ids from site and "extract" numbers from them
Its looking like that on that site:
<div class="market_listing_row number_490159191836499" id="number_490159191836499">
<div class="market_listing_row number_490159191836499" id="number_490159191836499">
<div class="market_listing_row number_490159170836499" id="number_490159170836499">
So i located all of them using that xpath and to be sure printed lenght of that list(and all of elements in it while testing but deleted this part of code) so i know for sure its
working and collecting all of 50 different elements from site.
elements = driver.find_elements_by_xpath('//*[starts-with(#id, "number_") and not(contains(#id, "_name")) ]')
print("List 2 lenght is:", len(elements))
But when i try to make list of numbers without "number_ " that id starts with i have problem. List called id that i create with get_attribute("id") is just one id(number_490159170836499 for example) repeated 22 times(its lenght of that id so it has to do something with it). list_of_ids is working as intended and i get 490159170836499 as result but its only one element(i guess its because theres only that number only repeated). Thats the code that i used:
for x in elements:
id = x.get_attribute("id")
list_of_ids = re.findall("\d+", id)
I also used this code to print all of ids on site so i know for sure that elements list have all of them in it and that get_attribute is working.
for ii in elements:
print(ii.get_attribute("id"))
To be clear I did import re
Another guess:
import re
ids = []
for x in elements:
id = x.get_attribute("id")
ids.append(re.search("\d+",id)[0])
print(ids)
You can use split method as well.
for x in elements:
id = x.get_attribute("id")
a =id.split("_")[1]
print(a)

Python when checking if scraped element exists in list

I keep getting an error when I am using an if else statement in python. I want my script to check if an index exists and if it does then run the code, if not then run another code. I get the error ValueError: 'Named Administrator' is not in list
import requests
from bs4 import BeautifulSoup
url_3 = 'https://www.brightscope.com/form-5500/basic-info/107299/Orthopedic-Institute-Of-Pennsylvania/15801790/Orthopedic-Institute-Of-Pennsylvania-401k-Profit-Sharing-Plan/'
page = requests.get(url_3)
soup = BeautifulSoup(page.text, 'html.parser')
divs = [e.get_text() for e in soup.findAll('span')]
if divs.index('Named Administrator'):
index = divs.index('Named Administrator')
contact = divs[index + 1]
else:
contact = '-'
Rather than doing index, do a __contains__ test:
if 'Named Administrator' in divs:
and move forward only if Named Administrator actually exists in divs list, so you won't get the ValueError.
Another consideration is that membership test in lists has O(N) time complexity, so if you are doing this for a large list, probably use a set instead:
{e.get_text() for e in soup.findAll('span')}
but as sets are unordered you won't be able to use index-ing.
So either think about something else that would work on sets as well i.e. no need to get next value by indexing.
Or you can use a set for membership test, and list for getting the next value. The cost here might be higher or lower based on your actual context and you can only find out that by profiling:
divs_list = [e.get_text() for e in soup.findAll('span')]
divs_set = set(divs_list)
if 'Named Administrator' in divs_set:
index = divs_list.index('Named Administrator')
contact = divs_list[index + 1]

Python Stop a For-Loop at a special number?

I wanna stop my for-loop at a certain point. I know the method range() but this doesn´t help me because I am iterating in a list. Either it doesnt Work with range() or I just dont know.
Globally I save this Variable.
productAmount = 4
That is my method. Everything works fine. I must delete some Code hopefully you understand this.
def amazonChecker(keyword):
driver = webdriver.Chrome('./driver/chromedriver.exe')
driver.get(url)
titels = driver.find_elements_by_tag_name('h2')
for titel in titels:
counter =+ 1
if counter < productAmount:
print(titel.text)
sleep(5)
driver.close
Best regards
KaanDev
driver.find_elements_by_tag_name() returns a list of WebElements. You can use list slicing to make a copy of this list containing only the subset of items you specify.
For example... to print the text from only the first 4 h2 elements:
titles = driver.find_elements_by_tag_name('h2')
for title in titles[:4]:
print(title.text)

How to extract multiple elements from XPath Selenium (Python)

I made this XPath
alo1 = driver.find_element(By.XPATH, "//div[#class='txt-block']/span/a/span").text
print(alo1)
but the problem is: i'm getting only the first element, but there is 3 or 4 elements with the same XPath, and i wanted then all.
From page to page the number of elements change from 0 to 4.
How can i do it?
And other thing, do you think is possible to make another XPath? I'm trying to get the name of the producers of the films.
EDIT:
I have a second difficulty. I'm passing this result to an excel sheet, but it needs to be in one line to be printed there, or else will only print the last one. How can it be done? ,
wb = xlwt.Workbook()
ws = wb.add_sheet("A Test Sheet")
driver = webdriver.Chrome()
driver.get('http://www.imdb.com/title/tt4854442/?ref_=wl_li_tt')
labels = driver.find_elements_by_xpath("//div[#class='txt-
block']/span/a/span")
for label in labels:
print (label.text)
ws.write(x-1,1,label.text)
wb.save("sinopses.xls")
The website for reference: http://www.imdb.com/title/tt4854442/?ref_=wl_li_tt
You can get them all at once, and then get text for each element:
alos = driver.find_elements(By.XPATH, "//div[#class='txt-block']/span/a/span")
for alo in alos:
print alo.text
For the first question:
FindElement always give only one result , even if the locator matches more than one , it automatically takes the first one.
If locator gives more than one matching result and you want all of them then you should go for findElements
For the second question:
labels = driver.find_elements_by_xpath("//div[#class='txt-
block']/span/a/span")
result = ''
for label in labels:
result += label.text
print (result)
ws.write(x-1,1,result)
wb.save("sinopses.xls")

python xpath some but not all columns of a table

unfortunately I'm a beginner in XPath and not completly sure how ir works. For a project of mine I'm looking for a way to parse 5 columns of a 9 column table. here is what I got working so far:
url="".join(["http://www.basketball-reference.com/leagues/NBA_2011_games.html"])
#getting the columns 4-7
page=requests.get(url)
tree=html.fromstring(page.content)
# the //text() is because some of the entries are inside <a></a>s
data = tree.xpath('//table[#id="games"]/tbody/tr/td[position()>3 and position()<8]//text()')
so what my workaround idea is, is to just get another list that gets only the first column and then combining the two in an extra step however, that seems unelgegant and unnecessary.
for the XPath I tried so far
//table[#id="games"]/tbody/tr/td[position() = 1]/text() | //table[#id="games"]/tbody/tr/td[position()>3 and position()<8]//text()
That doesn't include the first column (date) too somehow. (according to w3schools) the | is the operator to connect two XPath statements.
so here is my complete code right now. The data will then be put into two lists as of now.
In hopes that I didn't do anything too stupid, thank you for your help.
from lxml import html
import requests
url="".join(["http://www.basketball-reference.com/leagues/NBA_1952_games.html"])
page=requests.get(url)
tree=html.fromstring(page.content)
reg_data = tree.xpath('//table[#id="games"]/tbody/tr/td[position() = 1]/text() | //table[#id="games"]/tbody/tr/td[position()>3 and position()<8]//text()')
po_data = tree.xpath('//table[#id="games_playoffs"]/tbody/tr/td[position() = 1]/text() | //table[#id="games_playoffs"]/tbody/tr/td[position()>3 and position()<8]//text()')
n=int(len(reg_data)/5)
if int(year) == 2016:
for i in range(0,len(reg_data)):
if len(reg_data[i])>3 and len(reg_data[i+1])>3:
n = int((i)/5)
break
games=[]
for i in range(0,n):
games.append([])
for j in range(0,5):
games[i].append(reg_data[5*i+j])
po_games=[]
m=int(len(po_data)/5)
if year != 2016:
for i in range(0,m):
po_games.append([])
for j in range(0,5):
po_games[i].append(po_data[5*i+j])
print(games)
print(po_games)
It looks like a lot of the data is wrapped in link (a) tags so that when you are asking for text node children, you aren't finding any because you need to go one level deeper.
Instead of
/text()
do
//text()
The two slashes means to select text() nodes which are decendants at ANY level.
You can also combine the entire expression into
//table[#id="games"]/tbody/tr/td[position() = 1 or (position()>3 and position()<8)]//text()
instead of having two expressions.
We can even shorten further to
//table[#id="games"]//td[position() = 1 or (position()>3 and position()<8)]//text()
but there is a risk to this expression, as it will pick up td elements which occur anywhere in the table (provided they are a 1st, 4th, 5th, 6th, or 7th column), not just in rows in the body. In your target this will work, however.
Note also that an expression like [position()=1] is not necessary. You can shorten it to [1]. You only need the position function if you need the position of a node other than the context node, or need to write a more complex selection like we have when needing more than just one specific index.

Categories