website is marinetraffic.com
example of search results below, first search result returned is not appropriate, however the 2nd and 4th results are.
What identifies these results is within div class jss90 and jss89 respectively shown below
using something like below returns nothing however
browser.find_elements(By.XPATH, "//div[contains(#class, 'jss90')]")
Aim in this example is to find search results that match ATLANTICA between jss90 tags and contains Bulk Carrier between jss89 tags, append each match to a list then .click() the first one in the list
If content is dinamically generated on the client side, then you might need to add some delay to allow the page enough time to load the elements you want to scrape.
Take a look here for multiple methods to introduce delays in your selenium scraper:
https://www.browserstack.com/guide/selenium-wait-for-page-to-load
Related
Hi I am attempting to scrape multiple pages using selenium in python. I am interested in extracting all elements that fall within a span class element, basically what I would like to do is get the span class elements then extract the link within it. For each page it is possible to achieve this by using the xpath, however the xpath changes for each object and for each page. here is an example of what the web elements look like:
essentially I would like to extract the elements this is consistent in all the pages that I will be scraping. SO my idea is to get these elements then to get the href elements for these. I have tried to get all the elements on the page using this code
driver.find_elements_by_xpath("//span[#class='Text__StyledText-jknly0-0 cCEhaW']")
However this has not worked and it returns nothing. I also do not want to use the inner class because it varies by page as well so the only real element to use if I want to automate the scraping without getting too messy is that element I mention. Any way to extract the links for this span class elements on the page?
try this xpath
//span[contains(#class,'Text__StyledText')]//a[contains(#class,'Anchor__StyledAnchor')]
To actually grab that element we use the following
driver.find_elements_by_css_selector("span.Text__StyledText-jknly0-0.cCEhaW")
I am trying to write a Selenium test but the issue is I have learned that the page is generated with PrimeFaces, thus the element IDs randomly change from time to time. Not using IDs is not very reliable. Is there anything I can do?
Not having meaningful stable IDs is not a problem, as there are always alternative ways to locate elements on a page. Just to name a few options:
partial id matches with XPath or CSS, e.g.:
# contains
driver.find_element_by_css_selector("span[id*=customer]")
driver.find_element_by_xpath("//span[contains(#id, 'customer')]")
# starts with
driver.find_element_by_css_selector("span[id^=customer]")
driver.find_element_by_xpath("//span[starts-with(#id, 'customer')]")
# ends with
driver.find_element_by_css_selector("span[id$=customer]")
classes which refer/tell some valuable information about the data types ("data-oriented locators"):
driver.find_element_by_css_selector(".price")
driver.find_element_by_class_name("price")
going sideways from a label:
# <label>Price</label><span id="65123safg12">10.00</span>
driver.find_element_by_xpath("//label[.='Price']/following-sibling::span")
links by link text or partial link text:
driver.find_element_by_link_text("Information")
driver.find_element_by_partial_link_text("more")
And, you can, of course, get creative and combine them. There are more:
Locating Elements
There is also this relevant thread which goes over best practices when choosing a method to locate an element on a page:
What makes a good selenium locator?
For each vendor in an ERP system (total # of vendors = 800+), I am collecting its data and exporting this information as a pdf file. I used Selenium with Python, created a class called Scraper, and defined multiple functions to automate this task. The function, gather_vendors, is responsible for scraping and does this by extracting text values from tag elements.
Every vendor has a section called EFT Manager. EFT Manager has 9 rows I am extracting from:
For #2 and #3, both have string values (crossed out confidential info). But, #3 returns null. I don’t understand why #3 onward returns null when there are text values to be extracted.
The format of code for each element is the same.
I tried switching frames but that did not work. I tried to scrape from edit mode and that didn’t work as well. I was curious if anyone ever encountered a similar situation. It seems as though no matter what I do I can’t scrape certain values… I’d appreciate any advice or insight into how I should proceed.
Thank you.
Why not try to use
find_element_by_class_name("panelList").find_elements_by_tag_name('li')
To collect all of the li elements. And using li.text to retrieve their text values. Its hard to tell what your actual output is besides you saying "returns null"
Try to use visibility_of_element_located instead of presence_of_element_located
Try to get textContent with javascript fo element Given a (python) selenium WebElement can I get the innerText?
element = driver.find_element_by_id('txtTemp_creditor_agent_bic')
text= driver.execute_script("return attributes[0].textContent", element)
The following is what worked for me:
Get rid of the try/except blocks.
Find elements via ID's (not xpath).
That allowed me to extract text from elements I couldn't extract from before.
You should change the way of extracting the elements on web page to ID's, since all the the aspects have different id provided. If you want to use xpaths, then you should try the JavaScript function to find them.
E.g.
//span[text()='Bank Name']
I'm trying to scrap the site ketabejam.ir
I'm using python3.4.1 and for parsing I use lxml 3.4.1
by the way I parsed it with lxml.html.fromstring method
when I load the document on my interpreter and ask for following query to get number of pages , so I can handle pagination:
s = doc.xpath("//*[#id='page']")
surprisingly I get the result:
>>>len(s) == 2
True
I got the address of the element from firebug's minimal xpath,
when I choose normal xpath , the query run smoothly
Is it a bug, or I'm doing something wrong??
You can work around this in general by always doing something like:
s = doc.xpath("(//*[#id='page'])[1]")
...if you know you really just want the first node that matches, and can safely ignore any subsequent ones (which seems like a safe bet in this case).
Looking at the page source for the page you linked, there are exactly two elements with that id in the page. Most probably the one of the top of the table, and the other one of the bottom of the table.
The copy minimal xpath version of firebug works based on the id of the element. It is only available for elements that have an id tag and it creates an xpath in the format -
//*[#id="elementID"]
Which is what you are getting.
Ideally, in every html page , there should only be one element with a particular id , that is id should be unique across the page. And seem like firebug's minimal xpath depends on that.
In your context, I think both elements return the same link, so you can use either to continue your scraping. Or as you indicated , you can use the normal xpath for that.
I'm trying gather some data from a site using Python, Selenium and Xpath. There are multiple datapoints I want and they are all in this structure:
/tr[1]/td
/tr[2]/td
/tr[3]/td
/tr[4]/td
I do not know how many <tr>'s there are so I am trying to search in a way that just gives me all results (hopefully in a list). How do I do that?
Here is my actual code but this is only giving me individual results. I'm new to web scraping and unsure if the issue is with my Xpath (not doing wildcards correctly or if its related to my get_attribute tag - if its getting innerhtml then is it only getting it for the single entry?)
data = driver.find_element_by_xpath('//*[#id="a-stockFinancials_tabs"]/div[2]/div[1]/table/tbody/tr[5]/td').get_attribute("innerHTML")
print data
You should give find_elements_by_xpath a try.
I think, without seeing your full HTML, that this would work:
data = driver.find_elements_by_xpath('//*[#id="a-stockFinancials_tabs"]/div[2]/div[1]/table/tbody/tr/td')
for element in data:
print element.get_attribute("innerHTML")