I am trying to extract the most recent headlines from the following news site:
http://news.sina.com.cn/hotnews/
#save ids of relevant buttons that need to be clicked on the site
buttons_ids = ['Tab21' , 'Tab22', 'Tab32']
#save ids of relevant subsections
con_ids = ['Con11']
#start webdriver, go to site, hover over buttons
driver = webdriver.Chrome()
driver.get("http://news.sina.com.cn/hotnews/")
time.sleep(3)
for button_id in buttons_ids:
button = driver.find_element_by_id(button_id)
ActionChains(driver).move_to_element(button).perform()
Then I iterate through each section that I am interested in and within each section through all the headlines which are rows in an HTML table. However, on every iteration, it returns the first element
for con_id in con_ids:
for news_id in range(2,10):
print(news_id)
headline = driver.find_element_by_xpath("//div[#id='"+con_id+"']/table/tbody/tr["+str(news_id)+"]")
text = headline.find_element_by_xpath("//td[2]/a")
print(text.get_attribute("innerText"))
print(text.get_attribute("href"))
com_no = comment.find_element_by_xpath("//td[3]/a")
print(com_no.get_attribute("innerText"))
I also tried the following approach by essentially saving the table as a list and then iterating through the rows:
for con_id in con_ids:
table = driver.find_elements_by_xpath("//div[#id='"+con_id+"']/table/tbody/tr")
for headline in table:
text = headline.find_element_by_xpath("//td[2]/a")
print(text.get_attribute("innerText"))
print(text.get_attribute("href"))
com_no = comment.find_element_by_xpath("//td[3]/a")
print(com_no.get_attribute("innerText"))
In the second case I get exactly the number of headlines in the section, so it apparently correctly picks up the number of rows. However, it is still only returning the first row on all iterations. Where am I going wrong? I know a similar question has been asked here: Selenium Python iterate over a table of rows it is stopping at the first row but I am still unable to figure out where I am going wrong.
In XPath, queries that begin with // will search relative to the document root; so even though you're calling find_element_by_xpath() on the correct container element, you're breaking out of that scope, thereby performing the same global search and yielding the same result every time.
To constrain your query to descendants of the current element, begin your query with .//, e.g.,:
text = headline.find_element_by_xpath(".//td[2]/a")
try this:
for con_id in con_ids:
for news_id in range(2,10):
print(news_id)
print("(//div[#id='"+con_id+"']/table/tbody/tr)["+str(news_id)+"]")
headline = driver.find_element_by_xpath("(//div[#id='"+con_id+"']/table/tbody/tr)["+str(news_id)+"]")
value = headline.find_element_by_xpath(".//td[2]/a")
print(value.get_attribute("innerText").encode('utf-8'))
I am able to get the headlines with above code
I was able to solve it by specifying the entire XPath in one go like this:
headline = driver.find_element_by_xpath("(//*[#id='"+con_id+"']/table/tbody/tr["+str(news_id)+"]/td[2]/a)")
print(headline.get_attribute("innerText"))
print(headline.get_attribute("href"))
rather than splitting it into two parts.
My only explanation for why it only prints the first row repeatedly is that there is some weird Javascript at work that doesn't let you iterate properly when splitting the request.
Or my first version had a syntax error, which I am not aware of.
If anyone has a better explanation, I'd be glad to hear it!
Related
Background: I am pretty new to python and decided to practice by making a webscraper for https://www.marketwatch.com/tools/markets/stocks/a-z which would allow me to pull the company name, ticker, origin, and sector. I could then use this in another scraper to combine it with more complex information. The page is separated by 2 indexing methods- one to select the first letter of the company name (at the top of the page and another for the number of pages within that letter index (at the bottom of the page). These two tags have the class = "pagination" identifier but when I scrape based on that criteria, I get two separate strings but they are not a delimited list separated by a comma.
Does anyone know how to get the strings as a list? or individually? I really only care about the second.
from bs4 import BeautifulSoup
import requests
# open the source code of the website as text
source = 'https://www.marketwatch.com/tools/markets/stocks/a-z/x'
page = requests.get(source).text
soup = BeautifulSoup(page, 'lxml')
for tags in soup.find_all('ul', class_='pagination'):
tags_text = tags.text
print(tags_text)
Which returns:
0-9ABCDEFGHIJKLMNOPQRSTUVWX (current)YZOther
«123»
When I try to split on /n:
tags_text = tags.text.split('/n')
print(tags_text)
The return is:
['\n0-9ABCDEFGHIJKLMNOPQRSTUVWX (current)YZOther']
['«123»']
Neither seem to form a list. I have found many ways to get the first string but I really only need the second.
Also, please note that I am using the X index as my current tab. If you build it from scratch, you might have more numbers in the second string and the word (current) might be in a different place in the first list.
THANK YOU!!!!
Edit:
Cleaned old, commented out code from the source and realized I did not show the results of trying to call the second element despite the lack of a comma in the split example:
tags_text = tags.text.split('/n')[1]
print(tags_text)
Returns:
File, "C:\.....", line 22, in <module>
tags_text = tags.text.split('/n')[1]
IndexError: list index out of range
Never mind, I was using print() when I should have been using return, .apphend(), or another term to actually do something with the value...
While crawling the website, there is no class name of some text I want to pull or any id style to separate the part that contains that text. In the selector path I used with soup.select it doesn't work for continuous operations. As an example, I want to take the data below, but I don't know how to do it.
ex.
Just a guess you can get the table, if so and you know the row, you can do the following. Use findAll to get all the rows in a list and use the slice syntax to access your element:
row = your_table_result.findAll('tr')[5::6]
EDITED AFTER QUESTION UPDATE
You solve your problem in different ways, but first grab the table:
table = soup.find("table",{"class":"auflistung"})
Way #1 - You know the row, where information is stored
(be aware that structure of table can change or maybe differ)
rows = table.findAll('td')
name = rows[0].text.strip()
position = rows[6].text.strip()
Way #2 - You know heading of information
(works great cause there ist only one column)
name = table.find("th", text="Anavatandaki isim:").find_next_sibling("td").text.strip()
position = table.find("th", text="Mevki:").find_next_sibling("td").text.strip()
I am making crawling app through selenium, python and I am stuck.
enter image description here
as in picture I can select text(with underline).
but what I need is numbers next to text.
but in F12 in chrome
enter image description here
numbers(red cricle) has class name, but that class names are all same.
there is no indicator that I can use to select numbers through selenium.(as far as I know)
so I tried to find any way to select element through HTML by selenium.
but I couldn't find any. Is there any way to do?
If I am looking for something does not exist, I am very sorry.
I only know python and selenium.. so If I cannot handle this, please let me know.
---edit
I think I make bad explanation.
what I need is find text first, than collect numbers (both of two).
but there is tons of text. I just screenshot little bit.
so I can locate texts by it's specific ids(lot's of it).
but how can I get numbers that is nest to text.
this is my question. sorry for bad explanation
and if BeautifulSoup can handle this please let me know. Thanks for your help.
special thanks to Christine
her code solved my problem.
You can use an XPath index to accomplish selecting first td element. Given the screenshot, you can select the first td containing 2,.167 as such:
cell = driver.find_element_by_xpath("//tr[td/a[text()='TEXT']]/td[#class='txt-r'][1]")
print(cell.text)
You should replace TEXT with the characters you underlined in your screenshot -- I do not have this keyboard so I cannot type the text for you.
The above XPath will query on all table rows, pick the row with your desired text, then query on table cells with class txt-r within a row. Because the two td elements both have class txt-r, you only want to pick one of them, using an index indicated by [1]. The [1] will pick the first td, with text 2,167.
Full sample as requested by the user:
# first get all text on the page
all_text_elements = driver.find_elements_by_xpath("//a[contains(#class, 'link-resource')]")
# iterate text elements and print both numbers that are next to text
for text_element in all_text_elements:
# get the text from web element
text = text_element.text
# find the first number next to it (2,167 from sample HTML)
first_number = driver.find_element_by_xpath("//tr[td/a[text()='" + text + "']]/td[#class='txt-r'][1]")
print(first_number.text)
# find 2nd number (0 from sample HTML)
second_number = driver.find_element_by_xpath("//tr[td/a[text()='" + text + "']]/td[#class='txt-r'][2]")
print(second_number.text)
So i am using SCRAPY to scrape off the books of a website.
I have the crawler working and it crawls fine, but when it comes to cleaning the HTML using the select in XPATH it is kinda not working out right. Now since it is a book website, i have almost 131 books on each page and their XPATH comes to be likes this
For example getting the title of the books -
1st Book --- > /html/body/div/div[3]/div/div/div[2]/div/ul/li/a/span
2nd Book ---> /html/body/div/div[3]/div/div/div[2]/div/ul/li[2]/a/span
3rd book ---> /html/body/div/div[3]/div/div/div[2]/div/ul/li[3]/a/span
The DIV[] number increases with the book. I am not sure how to get this into a loop, so that it catches all the titles. I have to do this for Images and Author names too, but i think it will be similar. Just need to get this initial one done.
Thanks for your help in advance.
There are different ways to get this
Best to select multiple nodes is, selecting on the basis of ids or class.
e.g:
sel.xpath("//div[#id='id']")
You can select like this
for i in range(0, upto_num_of_divs):
list = sel.xpath("//div[%s]" %i)
You can select like this
for i in range(0, upto_num_of_divs):
list = sel.xpath("//div[position > =1 and position() < upto_num_of_divs])
Here is an example how you can parse your example html:
lis = hxs.select('//div/div[3]/div/div/div[2]/div/ul/li')
for li in lis:
book_el = li.select('a/span/text()')
Often enough you can do something like //div[#class="final-price"]//span to get the list of all the spans in one xpath. The exact expression depends on your html, this is just to give you an idea.
Otherwise the code above should do the trick.
Please help. I am trying to fetch data from a website and then count the occurrences of certain text. Unfortunately, I cannot provide the actual website, but the basics are this.
The web page is loaded and I am presented with a list of values, which are located in the table (the code below reflects this). The page would look something like this.
Header
Table 1
A00001
A00002
A00003
A00004
......
A00500
Each of the above rows (A00001- A00500) represent table a link that I need to click on. Furthermore, each of the links lead to a unique page that I need to extract information from.
I am using selenium to fetch the information and store it as variable data, as you can see in the code below. Here's my problem though- the number of links/rows that I will need to click on will depend on the timeframe that my user selects in the GUI. As you can see from my code, a time frame from 5/1/2011 to 5/30/2011 produces a list of 184 different links that I need to click on.
from selenium import selenium
import unittest, time, re
class Untitled(unittest.TestCase):
def setUp(self):
self.verificationErrors = []
self.selenium = selenium("localhost", 4444, "*chrome", "https://www.example.com")
self.selenium.start()
def test_untitled(self):
sel = self.selenium
sel.open("https://www.example.com")
sel.click("link=Reports")
sel.wait_for_page_to_load("50000")
sel.click("link=Cases")
sel.wait_for_page_to_load("50000")
sel.remove_selection("office", "label=")
sel.add_selection("office", "label=San Diego")
sel.remove_selection("chapter", "label=")
sel.add_selection("chapter", "label=9")
sel.add_selection("chapter", "label=11")
sel.type("StartDate", "5/1/2011")
sel.type("EndDate", "5/30/2011")
sel.click("button1")
sel.wait_for_page_to_load("30000")
Case 1 = sel.get_table("//div[#id='cmecfMainContent']/center[2]/table.1.0")
Case 2 = sel.get_table("//div[#id='cmecfMainContent']/center[2]/table.2.0")
Case 3 = sel.get_table("//div[#id='cmecfMainContent']/center[2]/table.184.0")
def tearDown(self):
self.selenium.stop()
self.assertEqual([], self.verificationErrors)
if name == "main":
unittest.main()
I am confused about 2 things.
1) What is the best way to get selenium to click on ALL of the links on the page without knowing the number of links ahead of time? The only way I know how to do this is to have the user select the number of links in a GUI, which would be assigned to a variable, which could then be included in the following method:
number_of_links = input("How many links are on the page? ")
sel.get_table("//div[#id='cmecfMainContent']/center[2]/number_of_links")
2) I am also confused about how to count the occurrences of certain data that appear on the pages that the links lead to.
i.e.
A00001 leads to a page that contains the table value "Apples"
A00002 leads to a page that contains the table value "Oranges"
A00003 leads to a page that contains the table value "Apples
"
I know selenium can store these as variables, but I'm unsure as to whether or not I can save these as a sequence type, with each additional occurrence being appended to the original list (or added to a dictionary), which could then be counted with the len() function.
Thanks for your help
I'm not familiar with the python api so sorry for that, but in java I know using xpath there is a function to get the number of occurrences of your xpath. So you could write an xpath selector to find the element you want then get the number of occurrences of that path.
Then to click each one you can affix your xpath with an element selector like [1] so if your xpath was //somexpath/something do //somexpath/something[1] to get the first.
Hope that helps
Heres an example: I wrote a crappy api in java to be able to do jquery like operations on collections of xpath matches. My constructor matches the xpath gets the count then creates a list of all matches so i can do things like .clickAll()
public SelquerySelector(String selector, Selenium selenium) {
super("xpath=(" + selector + ")[" + 1 + "]", selenium);
this.xpath = selector;
this.selenium = selenium;
//find out how many elements match
this.length = selenium.getXpathCount(this.xpath).intValue();
//make an array of selectedElements
for(int i = 2; i <= this.length; i++) {
elements.add(new SelquerySelectedElement("xpath=(" + this.xpath + ")[" + i + "]", this.selenium));
}
}
Heres the whole code in case you want to see it:
http://paste.zcd.me/Show.h7m1?id=8002
So I guess to answer your question (without knowing how xpath matches table) you can probably do something like
//div[#id='cmecfMainContent']/center[2]/table and get the number of matches to get the total amount of links then for loop over them. If you can't do that with xpath keep assuming their is another link till you get an acception
for i in range(1,xpathmatchcount):
Case[i] = sel.get_table("//div[#id='cmecfMainContent']/center[2]/table." + i + ".0")