How can I get the Xpath for the PageRank on Amazon? - python

When I search something on Amazon (in this example: "Jeans"), I get an overview of products. I want to scrape/get the sequence of the product rank.
To make this more clear, I attached a picture. I want to have the numbers back (1,2,3,4 etc.).
Is this doable? I was hoping for an Xpath, but I couldn't found anything relevant in the HTML.
Sorry, this is my first question. Hopefully everything make sense. I used Python in combination with Scrapy for this task.
EDIT.
I think that it is also possible to count some the 'div'. Anyone experience with that [see picture 2].
enter image description here

For amazon I use this xpath
xpath_results= "//h5/a"
Its for the "main" text for products, if you know how to use xpath (xml tree and beautifulsoup) you will get a list. Then iterate and you will know the order.
Your question was about xpath, so there you got

Related

Python Mammoth Strange <a> elements within HTML headings

I just found the Mammoth Python package a couple of days ago and its a great tool which really creates clean HTML code from a Word doc. Its nearly perfect. There is just one artifact I don’t understand. The heading elements (h1-h6) it creates from the Word headings contain several <a> elements with strange TOC ids. Looks like this:
<h1><a id="_Toc48228035"></a><a id="_Toc48288791"></a><a id="_Toc48303673"></a><a id="_Toc48306159"></a><a id="_Toc48308644"></a><a id="_Toc48311128"></a><a id="_Toc48313611"></a>Arteriosklerose</h1>
Does anybody know how the get rid of these?
Thanks in advance
Cheers,
Peter
This is just a guess, but I hope it helps:
TOC stands most probably for "Table of Content". When you want to skip to an element in the page, (like a certain Chapter), you give the chapter an ID and append #ID to your url. In this way the browser would scroll directly to that point.
I guess you are using a table of content somehow and it has links in it and when you inspect them you fill find something like Arteriosklerose

How can I fetch the number in a b tag through selenium-python?

I'm trying to get the number in all the <b> tags on this website. I want every single "qid" (question id) so I think I have to use qids = driver.find_elements_by_tag_name("b"), and based on other questions I've found I also need to implement a for loop and then print(qids.get_attribute("text")) but my code can't even seem to find elements with the <b> since I keep on getting the NoSuchElementException. The appearance of the website leads me to believe the content I'm looking for is within an iframe but I'm not sure if that affects the functionality of my code.
Here's a screencap of the website for reference
The html isn't of much use because the tag is its only defining trait:
<b>13570etc...</b>
Any help is much appreciated.
You could try searching by XPath:
driver.find_elements_by_xpath("//b")
Where // means "find all matching elements regardless of where they are in the document/current scope." Check out the XPath syntax here and mess around with a few different options.

Extracting links from website with selenium bs4 and python

Okay so.
The heading might seem like this question has already been asked but I had no luck finding an answer for it.
I need help with making link extracting program with python.
Actually It works. It finds all <a> elements on a webpage. Takes their href="" and puts it in an array. Then it exports it in csv file. Which is what I want.
But I can't get a hold of one thing.
The website is dynamic so I am using the Selenium webdriver to get JavaScript results.
The code for the program is pretty simple. I open a website with webdriver and then get its content. Then I get all links with
results = driver.find_elements_by_tag_name('a')
Then I loop through results with for loop and get href with
result.get_attribute("href")
I store results in an array and then print them out.
But the problem is that I can't get the name of the links.
This leads to Google
Is there any way to get 'This leads to Google' string.
I need it for every link that is stored in an array.
Thank you for your time
UPDATE!!!!!
As it seems it only gets dynamic links. I just notice this. This is really strange now. For hard coded items, it returns an empty string. For a dynamic link, it returns its name.
Okay. So. The answer is that instad of using .text you shoud use get_attribute("textContent"). Works better than get_attribute("innerHTML")
Thanks KunduK for this answer. You saved my day :)

Scraping text values using Selenium with Python

For each vendor in an ERP system (total # of vendors = 800+), I am collecting its data and exporting this information as a pdf file. I used Selenium with Python, created a class called Scraper, and defined multiple functions to automate this task. The function, gather_vendors, is responsible for scraping and does this by extracting text values from tag elements.
Every vendor has a section called EFT Manager. EFT Manager has 9 rows I am extracting from:
For #2 and #3, both have string values (crossed out confidential info). But, #3 returns null. I don’t understand why #3 onward returns null when there are text values to be extracted.
The format of code for each element is the same.
I tried switching frames but that did not work. I tried to scrape from edit mode and that didn’t work as well. I was curious if anyone ever encountered a similar situation. It seems as though no matter what I do I can’t scrape certain values… I’d appreciate any advice or insight into how I should proceed.
Thank you.
Why not try to use
find_element_by_class_name("panelList").find_elements_by_tag_name('li')
To collect all of the li elements. And using li.text to retrieve their text values. Its hard to tell what your actual output is besides you saying "returns null"
Try to use visibility_of_element_located instead of presence_of_element_located
Try to get textContent with javascript fo element Given a (python) selenium WebElement can I get the innerText?
element = driver.find_element_by_id('txtTemp_creditor_agent_bic')
text= driver.execute_script("return attributes[0].textContent", element)
The following is what worked for me:
Get rid of the try/except blocks.
Find elements via ID's (not xpath).
That allowed me to extract text from elements I couldn't extract from before.
You should change the way of extracting the elements on web page to ID's, since all the the aspects have different id provided. If you want to use xpaths, then you should try the JavaScript function to find them.
E.g.
//span[text()='Bank Name']

How to get attribute using Selenium python without using .get_attribute()

This is the first question I've posted so do let me know if I should make the question clearer. Furthermore I've only just started out Python so I hope I can phrase the question right with the correct terms.
Basically I have created a customizable webscraper that relies on user's knowledge of CSS selectors. Users will first have to go to the website that they want to scrape and jot down the css selectors ("AA") of their desired elements and enter it in an excel file, in which the python script will read the inputs and pass it through browser.find_elements_by_css_selector("AA") and get the relevant text though .text.encode('utf-8')
However I noticed that sometimes there might be important information in the attribute value that should be scraped. I've looked around and found that the suggestion is always to include .get_attribute()
1) Is there an alternative to getting attribute values by just using browser.find_elements_by_css_selector("AA") without using browser.find_elements_by_css_selector("AA").get_attribute("BB"). Otherwise,
2) Is it possible for users to enter some value in "BB" in browser.find_elements_by_css_selector("AA").get_attribute("BB") such that only browser.find_elements_by_css_selector("AA") will run?
Yes, there is an alternative to retrieve the text attribute values without without using get_attribute() method. I am not sure if that can be achieved through css or not but through xpath it is possible. A couple of examples are as follows :
//h3[#class="lvtitle"]/a/text()
/*/book[1]/title/#lang

Categories