Is there any way to select element through HTML, by selenium, python - python

I am making crawling app through selenium, python and I am stuck.
enter image description here
as in picture I can select text(with underline).
but what I need is numbers next to text.
but in F12 in chrome
enter image description here
numbers(red cricle) has class name, but that class names are all same.
there is no indicator that I can use to select numbers through selenium.(as far as I know)
so I tried to find any way to select element through HTML by selenium.
but I couldn't find any. Is there any way to do?
If I am looking for something does not exist, I am very sorry.
I only know python and selenium.. so If I cannot handle this, please let me know.
---edit
I think I make bad explanation.
what I need is find text first, than collect numbers (both of two).
but there is tons of text. I just screenshot little bit.
so I can locate texts by it's specific ids(lot's of it).
but how can I get numbers that is nest to text.
this is my question. sorry for bad explanation
and if BeautifulSoup can handle this please let me know. Thanks for your help.
special thanks to Christine
her code solved my problem.

You can use an XPath index to accomplish selecting first td element. Given the screenshot, you can select the first td containing 2,.167 as such:
cell = driver.find_element_by_xpath("//tr[td/a[text()='TEXT']]/td[#class='txt-r'][1]")
print(cell.text)
You should replace TEXT with the characters you underlined in your screenshot -- I do not have this keyboard so I cannot type the text for you.
The above XPath will query on all table rows, pick the row with your desired text, then query on table cells with class txt-r within a row. Because the two td elements both have class txt-r, you only want to pick one of them, using an index indicated by [1]. The [1] will pick the first td, with text 2,167.
Full sample as requested by the user:
# first get all text on the page
all_text_elements = driver.find_elements_by_xpath("//a[contains(#class, 'link-resource')]")
# iterate text elements and print both numbers that are next to text
for text_element in all_text_elements:
# get the text from web element
text = text_element.text
# find the first number next to it (2,167 from sample HTML)
first_number = driver.find_element_by_xpath("//tr[td/a[text()='" + text + "']]/td[#class='txt-r'][1]")
print(first_number.text)
# find 2nd number (0 from sample HTML)
second_number = driver.find_element_by_xpath("//tr[td/a[text()='" + text + "']]/td[#class='txt-r'][2]")
print(second_number.text)

Related

Find_all function returning multiple strings that are separated but not delimited by a character

Background: I am pretty new to python and decided to practice by making a webscraper for https://www.marketwatch.com/tools/markets/stocks/a-z which would allow me to pull the company name, ticker, origin, and sector. I could then use this in another scraper to combine it with more complex information. The page is separated by 2 indexing methods- one to select the first letter of the company name (at the top of the page and another for the number of pages within that letter index (at the bottom of the page). These two tags have the class = "pagination" identifier but when I scrape based on that criteria, I get two separate strings but they are not a delimited list separated by a comma.
Does anyone know how to get the strings as a list? or individually? I really only care about the second.
from bs4 import BeautifulSoup
import requests
# open the source code of the website as text
source = 'https://www.marketwatch.com/tools/markets/stocks/a-z/x'
page = requests.get(source).text
soup = BeautifulSoup(page, 'lxml')
for tags in soup.find_all('ul', class_='pagination'):
tags_text = tags.text
print(tags_text)
Which returns:
0-9ABCDEFGHIJKLMNOPQRSTUVWX (current)YZOther
«123»
When I try to split on /n:
tags_text = tags.text.split('/n')
print(tags_text)
The return is:
['\n0-9ABCDEFGHIJKLMNOPQRSTUVWX (current)YZOther']
['«123»']
Neither seem to form a list. I have found many ways to get the first string but I really only need the second.
Also, please note that I am using the X index as my current tab. If you build it from scratch, you might have more numbers in the second string and the word (current) might be in a different place in the first list.
THANK YOU!!!!
Edit:
Cleaned old, commented out code from the source and realized I did not show the results of trying to call the second element despite the lack of a comma in the split example:
tags_text = tags.text.split('/n')[1]
print(tags_text)
Returns:
File, "C:\.....", line 22, in <module>
tags_text = tags.text.split('/n')[1]
IndexError: list index out of range
Never mind, I was using print() when I should have been using return, .apphend(), or another term to actually do something with the value...

Python Selenium only getting first row when iterating over table

I am trying to extract the most recent headlines from the following news site:
http://news.sina.com.cn/hotnews/
#save ids of relevant buttons that need to be clicked on the site
buttons_ids = ['Tab21' , 'Tab22', 'Tab32']
#save ids of relevant subsections
con_ids = ['Con11']
#start webdriver, go to site, hover over buttons
driver = webdriver.Chrome()
driver.get("http://news.sina.com.cn/hotnews/")
time.sleep(3)
for button_id in buttons_ids:
button = driver.find_element_by_id(button_id)
ActionChains(driver).move_to_element(button).perform()
Then I iterate through each section that I am interested in and within each section through all the headlines which are rows in an HTML table. However, on every iteration, it returns the first element
for con_id in con_ids:
for news_id in range(2,10):
print(news_id)
headline = driver.find_element_by_xpath("//div[#id='"+con_id+"']/table/tbody/tr["+str(news_id)+"]")
text = headline.find_element_by_xpath("//td[2]/a")
print(text.get_attribute("innerText"))
print(text.get_attribute("href"))
com_no = comment.find_element_by_xpath("//td[3]/a")
print(com_no.get_attribute("innerText"))
I also tried the following approach by essentially saving the table as a list and then iterating through the rows:
for con_id in con_ids:
table = driver.find_elements_by_xpath("//div[#id='"+con_id+"']/table/tbody/tr")
for headline in table:
text = headline.find_element_by_xpath("//td[2]/a")
print(text.get_attribute("innerText"))
print(text.get_attribute("href"))
com_no = comment.find_element_by_xpath("//td[3]/a")
print(com_no.get_attribute("innerText"))
In the second case I get exactly the number of headlines in the section, so it apparently correctly picks up the number of rows. However, it is still only returning the first row on all iterations. Where am I going wrong? I know a similar question has been asked here: Selenium Python iterate over a table of rows it is stopping at the first row but I am still unable to figure out where I am going wrong.
In XPath, queries that begin with // will search relative to the document root; so even though you're calling find_element_by_xpath() on the correct container element, you're breaking out of that scope, thereby performing the same global search and yielding the same result every time.
To constrain your query to descendants of the current element, begin your query with .//, e.g.,:
text = headline.find_element_by_xpath(".//td[2]/a")
try this:
for con_id in con_ids:
for news_id in range(2,10):
print(news_id)
print("(//div[#id='"+con_id+"']/table/tbody/tr)["+str(news_id)+"]")
headline = driver.find_element_by_xpath("(//div[#id='"+con_id+"']/table/tbody/tr)["+str(news_id)+"]")
value = headline.find_element_by_xpath(".//td[2]/a")
print(value.get_attribute("innerText").encode('utf-8'))
I am able to get the headlines with above code
I was able to solve it by specifying the entire XPath in one go like this:
headline = driver.find_element_by_xpath("(//*[#id='"+con_id+"']/table/tbody/tr["+str(news_id)+"]/td[2]/a)")
print(headline.get_attribute("innerText"))
print(headline.get_attribute("href"))
rather than splitting it into two parts.
My only explanation for why it only prints the first row repeatedly is that there is some weird Javascript at work that doesn't let you iterate properly when splitting the request.
Or my first version had a syntax error, which I am not aware of.
If anyone has a better explanation, I'd be glad to hear it!

Scrapy: how can I extract text with hyperlink together

I'm crawling a professor's webpage.
Under her research description, there are two hyperlinks, which are " TEDx UCL" and "here".
I use xpath like '//div[#class="group"]//p/text()'
to get the first 3 paragraphs.
And '//div[#class="group"]/text()' to get the last paragraph with some newlines. But these can be cleaned easily.
The problem is the last paragraph contains only text. The hyperlinks are lost. Though I can extract them separately, it is tedious to put them back to their corresponding position.
How can I get all the text and keep the hyperlinks?
You can use html2text.
sample = response.xpath("//div[#class="group"]//p/text()")
converter = html2text.HTML2Text()
converter.ignore_links = True
converter.handle(sample)
Try this:
'//div[#class="group"]/p//text()[normalize-space(.)]'

How to Collect the line with Selenium Python

I want to know how I can collect line, mailto link using selenium python the emails contains # sign in the contact page I tried the following code but it is somewhere works and somewhere not..
//*[contains(text(),"#")]
the emails formats are different somewhere it is <p>Email: name#domain.com</p> or <span>Email: name#domain.com</span> or name#domain.com
is there anyway to collect them with one statement..
Thanks
Here is the XPath you are looking for my friend.
//*[contains(text(),"#")]|//*[contains(#href,"#")]
You could create a collection of the link text values that contain # on the page and then iterate through to format. You are going to have to format the span like that has Email: name#domain.com anyway.
Use find_elements_by_partial_link_text to make the collection.
I think you need 2 XPath. First XPath for finding element that contains text "Email:", second XPath for element that contains attribute "mailto:".
//*[contains(text(),"Email:")]|//*[contains(#href,"mailto:")]

Python + Selenium - web scraping and counting occurrences of certain text data in HTML

Please help. I am trying to fetch data from a website and then count the occurrences of certain text. Unfortunately, I cannot provide the actual website, but the basics are this.
The web page is loaded and I am presented with a list of values, which are located in the table (the code below reflects this). The page would look something like this.
Header
Table 1
A00001
A00002
A00003
A00004
......
A00500
Each of the above rows (A00001- A00500) represent table a link that I need to click on. Furthermore, each of the links lead to a unique page that I need to extract information from.
I am using selenium to fetch the information and store it as variable data, as you can see in the code below. Here's my problem though- the number of links/rows that I will need to click on will depend on the timeframe that my user selects in the GUI. As you can see from my code, a time frame from 5/1/2011 to 5/30/2011 produces a list of 184 different links that I need to click on.
from selenium import selenium
import unittest, time, re
class Untitled(unittest.TestCase):
def setUp(self):
self.verificationErrors = []
self.selenium = selenium("localhost", 4444, "*chrome", "https://www.example.com")
self.selenium.start()
def test_untitled(self):
sel = self.selenium
sel.open("https://www.example.com")
sel.click("link=Reports")
sel.wait_for_page_to_load("50000")
sel.click("link=Cases")
sel.wait_for_page_to_load("50000")
sel.remove_selection("office", "label=")
sel.add_selection("office", "label=San Diego")
sel.remove_selection("chapter", "label=")
sel.add_selection("chapter", "label=9")
sel.add_selection("chapter", "label=11")
sel.type("StartDate", "5/1/2011")
sel.type("EndDate", "5/30/2011")
sel.click("button1")
sel.wait_for_page_to_load("30000")
Case 1 = sel.get_table("//div[#id='cmecfMainContent']/center[2]/table.1.0")
Case 2 = sel.get_table("//div[#id='cmecfMainContent']/center[2]/table.2.0")
Case 3 = sel.get_table("//div[#id='cmecfMainContent']/center[2]/table.184.0")
def tearDown(self):
self.selenium.stop()
self.assertEqual([], self.verificationErrors)
if name == "main":
unittest.main()
I am confused about 2 things.
1) What is the best way to get selenium to click on ALL of the links on the page without knowing the number of links ahead of time? The only way I know how to do this is to have the user select the number of links in a GUI, which would be assigned to a variable, which could then be included in the following method:
number_of_links = input("How many links are on the page? ")
sel.get_table("//div[#id='cmecfMainContent']/center[2]/number_of_links")
2) I am also confused about how to count the occurrences of certain data that appear on the pages that the links lead to.
i.e.
A00001 leads to a page that contains the table value "Apples"
A00002 leads to a page that contains the table value "Oranges"
A00003 leads to a page that contains the table value "Apples
"
I know selenium can store these as variables, but I'm unsure as to whether or not I can save these as a sequence type, with each additional occurrence being appended to the original list (or added to a dictionary), which could then be counted with the len() function.
Thanks for your help
I'm not familiar with the python api so sorry for that, but in java I know using xpath there is a function to get the number of occurrences of your xpath. So you could write an xpath selector to find the element you want then get the number of occurrences of that path.
Then to click each one you can affix your xpath with an element selector like [1] so if your xpath was //somexpath/something do //somexpath/something[1] to get the first.
Hope that helps
Heres an example: I wrote a crappy api in java to be able to do jquery like operations on collections of xpath matches. My constructor matches the xpath gets the count then creates a list of all matches so i can do things like .clickAll()
public SelquerySelector(String selector, Selenium selenium) {
super("xpath=(" + selector + ")[" + 1 + "]", selenium);
this.xpath = selector;
this.selenium = selenium;
//find out how many elements match
this.length = selenium.getXpathCount(this.xpath).intValue();
//make an array of selectedElements
for(int i = 2; i <= this.length; i++) {
elements.add(new SelquerySelectedElement("xpath=(" + this.xpath + ")[" + i + "]", this.selenium));
}
}
Heres the whole code in case you want to see it:
http://paste.zcd.me/Show.h7m1?id=8002
So I guess to answer your question (without knowing how xpath matches table) you can probably do something like
//div[#id='cmecfMainContent']/center[2]/table and get the number of matches to get the total amount of links then for loop over them. If you can't do that with xpath keep assuming their is another link till you get an acception
for i in range(1,xpathmatchcount):
Case[i] = sel.get_table("//div[#id='cmecfMainContent']/center[2]/table." + i + ".0")

Categories