Hi I am attempting to scrape multiple pages using selenium in python. I am interested in extracting all elements that fall within a span class element, basically what I would like to do is get the span class elements then extract the link within it. For each page it is possible to achieve this by using the xpath, however the xpath changes for each object and for each page. here is an example of what the web elements look like:
essentially I would like to extract the elements this is consistent in all the pages that I will be scraping. SO my idea is to get these elements then to get the href elements for these. I have tried to get all the elements on the page using this code
driver.find_elements_by_xpath("//span[#class='Text__StyledText-jknly0-0 cCEhaW']")
However this has not worked and it returns nothing. I also do not want to use the inner class because it varies by page as well so the only real element to use if I want to automate the scraping without getting too messy is that element I mention. Any way to extract the links for this span class elements on the page?
try this xpath
//span[contains(#class,'Text__StyledText')]//a[contains(#class,'Anchor__StyledAnchor')]
To actually grab that element we use the following
driver.find_elements_by_css_selector("span.Text__StyledText-jknly0-0.cCEhaW")
Related
website is marinetraffic.com
example of search results below, first search result returned is not appropriate, however the 2nd and 4th results are.
What identifies these results is within div class jss90 and jss89 respectively shown below
using something like below returns nothing however
browser.find_elements(By.XPATH, "//div[contains(#class, 'jss90')]")
Aim in this example is to find search results that match ATLANTICA between jss90 tags and contains Bulk Carrier between jss89 tags, append each match to a list then .click() the first one in the list
If content is dinamically generated on the client side, then you might need to add some delay to allow the page enough time to load the elements you want to scrape.
Take a look here for multiple methods to introduce delays in your selenium scraper:
https://www.browserstack.com/guide/selenium-wait-for-page-to-load
I am trying to scrape some data from a web page using selenium. I have successfully got selenium working headlessly on a raspberry pi, I can connect to the webpage I am trying to scrape, return the title of the page and return the URL I am connected to.
I have been looking at examples in tutorials on how to scrape data and they all go something like this:
titles_element = browser.find_elements_by_xpath(“//a[#class=’text-bold’]”)
However, every piece of data in the webpage I am trying to scrape has the same class name. An example of the first bit of data I'm trying to scrape, I'm trying to get the value of wins which is 4:
Data 1
And a second example of the data im trying to scrape, which in this case is kills and the value is 559:
Data 2
Both numbers I am trying to scrape share the same class name so I cant simply scrape by class.
What is the best way of scraping this data?
titles_element = browser.find_elements_by_xpath(...)
I think you can do something like this for Data 1 (inside the parentheses)
/div/span[#title="Wins"]/following-sibling::span[#class="value"]/text()
and similarly for Data 2:
/div/span[#title="Kills"]/following-sibling::span[#class="value"]/text()
I used the following as references:
XPath: how to select elements that are related to other on the same level
and tested your code to see the XPath result with this:
XPath Tester / Evaluator
You could use css attribute = value selectors to target preceding sibling by title attribute then use adjacent sibling combinator to move to the adjacent sibling and grab the desired values
find_element_by_css_selector('[title=Kills] + .value').text
find_element_by_css_selector('[title=Wins] + .value').text
Currently i use Python with selenium for scraping purpose. There are many ways in selenium to scrape data. And I used to use css selectors.
But then I realised that,
Only tagNames are those things which always are on websites.
For example,
Not every website uses classes or Id's like, take an example of Wikipedia. They use normally just tags in it.
like <h1>, <a> without having any classes or id in it.
There comes the limitation for scraping USING tagNames, as they scrape every element under their tags.
For example : if I want to scrape table contents which are under <p> tag, then it scrapes the table contents as well as all the descriptions which are not needed.
My question is: is it possible to scrape the required elements under the tags which do not copy every other elements under their tags?
Like if I want to scrape content from, say Amazon then it will select only product names under h1 tags, not scraping all the headings under the h1 tag which are not product names.
If you find any other method/locator to use, even except the tagName also then also you can tell me. But the condition is that it must be present on every website/ most of the websites
Any help would be appreciated 😊...
I'm using the scrapy shell to grab all of the links in the subcategories section of this site: https://www.dmoz.org/Computers/Programming/Languages/Python/.
There's probably a more efficient Xpath, but the one I came up was:
//div[#id="subcategories-div"]/section/div/div/a/#href
As far as I can tell from the page source, there is only one div element with a [#id="subcategories-div"] attribute, so from there I narrow down until I find the link's href. This works when I search for this Xpath in Chrome.
But when I run
response.xpath('//div[#id="subcategories-div"]/section/div/div/a/#href').extract()
in scrapy, it gives me back the links I'm looking for, but then for some reason, it also returns links from //*[#id="doc"]/section[8]/div/div[2]/a
Why is this happening, since nowhere in this path is there a div element with a [#id="subcategories-div"] attribute?
I cant seem to find any id with the name doc in the page You are trying to scrape, You might haven't set a starting response.xpath. do you get the same result if You should to change, so like:
response.xpath('//*div[#id="subcategories-div"]/section/div/div/a/#href').extract()
I'm trying gather some data from a site using Python, Selenium and Xpath. There are multiple datapoints I want and they are all in this structure:
/tr[1]/td
/tr[2]/td
/tr[3]/td
/tr[4]/td
I do not know how many <tr>'s there are so I am trying to search in a way that just gives me all results (hopefully in a list). How do I do that?
Here is my actual code but this is only giving me individual results. I'm new to web scraping and unsure if the issue is with my Xpath (not doing wildcards correctly or if its related to my get_attribute tag - if its getting innerhtml then is it only getting it for the single entry?)
data = driver.find_element_by_xpath('//*[#id="a-stockFinancials_tabs"]/div[2]/div[1]/table/tbody/tr[5]/td').get_attribute("innerHTML")
print data
You should give find_elements_by_xpath a try.
I think, without seeing your full HTML, that this would work:
data = driver.find_elements_by_xpath('//*[#id="a-stockFinancials_tabs"]/div[2]/div[1]/table/tbody/tr/td')
for element in data:
print element.get_attribute("innerHTML")