How to loop through retrieve div-id - python

I try to retrieve posts - title, url and image from a website using selenium. I have get to the first level, but on second level the post starts from (div id="abcde-11-0") to (div id="abcde-11-20"), in this case how should I get into second level before I can reach the (a herf) to retrieve the data?
Here are my code:
# I get to the first level by using xpath
Container=chromedrv.find_elements_by_xpath("//*[#id="MainContent"]/div[2]/div/div/div[6]/div[2]")

I'm wondering if I understood your question correctly, but please use this code.
first_level_rows =chromedrv.find_elements_by_xpath('//*[#id="MainContent"]/div[2]/div/div/div[6]/div[2]')
for first_level_row in first_level_rows:
second_level_rows = first_level_row.find_elements_by_xpath('//*[contains(#id, "abcde-11-")]')
find_elements_by_xpath('//*[contains(#id, "abcde-11-")]') => This part extracts all tags contains "abcde-11-" in id.

Related

Extracting element IDs of required forms in selenium/python

I'm crawling this page (https://boards.greenhouse.io/reddit/jobs/4330383) using Selenium in Python and am looping through all of the required fields using:
required = driver.find_elements_by_css_selector("[aria-required=true]").
The problem is that I can't view each element's id. The command required[0].id (which is the same as driver.find_element_by_id("first_name").id returns a long string of alphanumeric characters and hyphens – even though the id is first_name in the HTML. Can someone explain to be why the id is being changed from first_name to this string? And how can I view the actual id that I'm expecting?
Additionally, what would be the simplest way to reference the associated label mentioned right before it in the HTML (i.e. "First Name " in this case)? The goal is to loop through the required list and be able to tell what each of these forms is actually asking the user for.
Thanks in advance! Any advice or alternatives are welcome, as well.
Your code is almost good. All you need to do is use the .get_attribute() method to get your id's:
required = driver.find_elements_by_css_selector("[aria-required=true]")
for r in required:
print(r.get_attribute("id"))
driver.find_element_by_id("first_name") returns a web element object.
In order to get a web element attribute value like href or id - get_attribute() method should be applied on the web element object.
So, you need to change your code to be
driver.find_element_by_id("first_name").get_attribute("id")
This will give you the id attribute value of that element
I'm going to answer my second question ("How to reference the element's associated label?") since I just figured it out using the find_element_by_xpath() method in conjunction with the .get_attribute("id") solutions mentioned in the previous answers:
ele_id = driver.find_element_by_id("first_name").get_attribute("id")
label_text = driver.find_element_by_xpath('//label[#for="{}"]'.format(ele_id)).text

xpath result from scrapy don't show the same result from a html page

I'm having some issues in crawling this website search:
https://www.simplyhired.com/search?q=data+engineer&l=United+States&pn=1&job=ZMzeXt6JW0jMuZc6H-3Af3sqOGzeQMLj7X5mnXXv9ZteeAoGm6oDdg
I'm trying to extract these elements from de SimplyHired search jobs for Data Engineer in US:
But when I try using xpath locator to any of them using selector module I'm getting different results and in different order.
Also the output for all of them isn't matching (The index corresponding to xpath job name is not the same index for ther location in xpath location for example).
Here is my code:
from scrapy import Selector
import requests
response = requests.get('https://www.simplyhired.com/search?q=data+engineer&l=united+states&mi=exact&sb=dd&pn=1&job=X1yGOt2Y8QTJm0tYqyptbgV9Pu19ge0GkVZK7Im5WbXm-zUr-QMM-A').content
sel=Selector(text=response)
#job name
sel.xpath('//main[#id="job-list"]/div/article[contains(#class,"SerpJob")]/div/div[#class="jobposting-title-container"]/h2/a/text()').extract()
#company
sel.xpath('//main[#id="job-list"]/div/article/div/h3[#class="jobposting-subtitle"]/span[#class="JobPosting-labelWithIcon jobposting-company"]/text()').extract()
#location
sel.xpath('//main[#id="job-list"]//div/article/div/h3[#class="jobposting-subtitle"]/span[#class="JobPosting-labelWithIcon jobposting-location"]/span/span/text()').extract()
#salary estimates
sel.xpath('//main[#id="job-list"]//div/article/div/div[#class="SerpJob-metaInfo"]//div[#class="SerpJob-metaInfoLeft"]/span/text()[2]').extract()
I'm not quite sure whether you're trying to use Scrapy or requests. Looks like you're wanting to use requests but with xpath selectors.
For websites like this, it's best to look at each individual job advert as a 'card'. You want to loop over each card with the XPATH selectors that you need to get the data you want.
Code Example
card = sel.xpath('//div[#class="SerpJob-jobCard card"]')
for a in card:
title = a.xpath('.//a[#class="card-link"]/text()').get()
company = a.xpath('.//span[#class="JobPosting-labelWithIcon jobposting-company"]/text()').get()
salary = a.xpath('.//span[#class="jobposting-salary"]/text()').get()
location = a.xpath('.//span[#class="jobposting-location"]/text()').get()
Explanation
You want to search each card with relative XPATH selectors. The .// searches within the chunk of HTML downstream of the card variable.
Always use get() instead of extract(). get() is used to get one value and returns a string always, here that's what we want when we're looping over each card. extract() extracts all values if there are multiple and if there's only one value for the XPATH selector it puts it into a list which is often not what you want. The ambiguity of extract() is not ideal, if you want multiple values to use getall(), this is explicit and will only give you multiple values.
Additional Information
If you're finding you're not getting the correct data in the right format, always look to see if javascript content is being added to the website. Turn off your browsers javascript to refresh the page. On this particular site, none of the data you require is loaded by javascript, this makes it much easier to scrape.

Python Selenium only getting first row when iterating over table

I am trying to extract the most recent headlines from the following news site:
http://news.sina.com.cn/hotnews/
#save ids of relevant buttons that need to be clicked on the site
buttons_ids = ['Tab21' , 'Tab22', 'Tab32']
#save ids of relevant subsections
con_ids = ['Con11']
#start webdriver, go to site, hover over buttons
driver = webdriver.Chrome()
driver.get("http://news.sina.com.cn/hotnews/")
time.sleep(3)
for button_id in buttons_ids:
button = driver.find_element_by_id(button_id)
ActionChains(driver).move_to_element(button).perform()
Then I iterate through each section that I am interested in and within each section through all the headlines which are rows in an HTML table. However, on every iteration, it returns the first element
for con_id in con_ids:
for news_id in range(2,10):
print(news_id)
headline = driver.find_element_by_xpath("//div[#id='"+con_id+"']/table/tbody/tr["+str(news_id)+"]")
text = headline.find_element_by_xpath("//td[2]/a")
print(text.get_attribute("innerText"))
print(text.get_attribute("href"))
com_no = comment.find_element_by_xpath("//td[3]/a")
print(com_no.get_attribute("innerText"))
I also tried the following approach by essentially saving the table as a list and then iterating through the rows:
for con_id in con_ids:
table = driver.find_elements_by_xpath("//div[#id='"+con_id+"']/table/tbody/tr")
for headline in table:
text = headline.find_element_by_xpath("//td[2]/a")
print(text.get_attribute("innerText"))
print(text.get_attribute("href"))
com_no = comment.find_element_by_xpath("//td[3]/a")
print(com_no.get_attribute("innerText"))
In the second case I get exactly the number of headlines in the section, so it apparently correctly picks up the number of rows. However, it is still only returning the first row on all iterations. Where am I going wrong? I know a similar question has been asked here: Selenium Python iterate over a table of rows it is stopping at the first row but I am still unable to figure out where I am going wrong.
In XPath, queries that begin with // will search relative to the document root; so even though you're calling find_element_by_xpath() on the correct container element, you're breaking out of that scope, thereby performing the same global search and yielding the same result every time.
To constrain your query to descendants of the current element, begin your query with .//, e.g.,:
text = headline.find_element_by_xpath(".//td[2]/a")
try this:
for con_id in con_ids:
for news_id in range(2,10):
print(news_id)
print("(//div[#id='"+con_id+"']/table/tbody/tr)["+str(news_id)+"]")
headline = driver.find_element_by_xpath("(//div[#id='"+con_id+"']/table/tbody/tr)["+str(news_id)+"]")
value = headline.find_element_by_xpath(".//td[2]/a")
print(value.get_attribute("innerText").encode('utf-8'))
I am able to get the headlines with above code
I was able to solve it by specifying the entire XPath in one go like this:
headline = driver.find_element_by_xpath("(//*[#id='"+con_id+"']/table/tbody/tr["+str(news_id)+"]/td[2]/a)")
print(headline.get_attribute("innerText"))
print(headline.get_attribute("href"))
rather than splitting it into two parts.
My only explanation for why it only prints the first row repeatedly is that there is some weird Javascript at work that doesn't let you iterate properly when splitting the request.
Or my first version had a syntax error, which I am not aware of.
If anyone has a better explanation, I'd be glad to hear it!

XPATH for Scrapy

So i am using SCRAPY to scrape off the books of a website.
I have the crawler working and it crawls fine, but when it comes to cleaning the HTML using the select in XPATH it is kinda not working out right. Now since it is a book website, i have almost 131 books on each page and their XPATH comes to be likes this
For example getting the title of the books -
1st Book --- > /html/body/div/div[3]/div/div/div[2]/div/ul/li/a/span
2nd Book ---> /html/body/div/div[3]/div/div/div[2]/div/ul/li[2]/a/span
3rd book ---> /html/body/div/div[3]/div/div/div[2]/div/ul/li[3]/a/span
The DIV[] number increases with the book. I am not sure how to get this into a loop, so that it catches all the titles. I have to do this for Images and Author names too, but i think it will be similar. Just need to get this initial one done.
Thanks for your help in advance.
There are different ways to get this
Best to select multiple nodes is, selecting on the basis of ids or class.
e.g:
sel.xpath("//div[#id='id']")
You can select like this
for i in range(0, upto_num_of_divs):
list = sel.xpath("//div[%s]" %i)
You can select like this
for i in range(0, upto_num_of_divs):
list = sel.xpath("//div[position > =1 and position() < upto_num_of_divs])
Here is an example how you can parse your example html:
lis = hxs.select('//div/div[3]/div/div/div[2]/div/ul/li')
for li in lis:
book_el = li.select('a/span/text()')
Often enough you can do something like //div[#class="final-price"]//span to get the list of all the spans in one xpath. The exact expression depends on your html, this is just to give you an idea.
Otherwise the code above should do the trick.

Amazon Web Service ItemSearch DetailPageURL's with Associate IDs?

DetailPageURL's returned by ItemSearch seem to include an incorrect ID/tag rather than the associate ID I requested the search with.
I'm getting:
http://www.amazon.co.uk/gp/product/1590595009?SubscriptionId=XXX&tag=foo-12&linkCode=as2&camp=1634&creative=19450&creativeASIN=1590595009
When I expect:
http://www.amazon.co.uk/gp/product/1590595009?SubscriptionId=XXX&tag=wwwmydomain-12&linkCode=as2&camp=1634&creative=19450&creativeASIN=1590595009
How do I get the correct tag? (Note that SO rewrites the above links to their own Associate ID if you click either of the above)
I'm using Python and PyAWS 0.3.0, although I think the problem is with my request, rather than with the API wrapper.
(As an aside, The Amazon Associates Link Checker (U.K. store)/U.S. store is invaluable in testing these links)
Simple error in the end..... I was including the tag in the initial search:
for searchResult in
ecs.ItemSearch(item,
SearchIndex=index,
AssociateTag='wwwmydomain-12')
But not in the secondary loop that steps through each result getting more details:
for item in
ecs.ItemSearch(searchResult.ASIN,
ResponseGroup='Medium'):
should be:
for item in
ecs.ItemSearch(searchResult.ASIN,
ResponseGroup='Medium',
AssociateTag='wwwodbodycom-21'):
The tag is needed in both - it seems it's not carried over.

Categories