Xpath wildcard in Selenium to capture multiple instances of results - python

I'm trying gather some data from a site using Python, Selenium and Xpath. There are multiple datapoints I want and they are all in this structure:
/tr[1]/td
/tr[2]/td
/tr[3]/td
/tr[4]/td
I do not know how many <tr>'s there are so I am trying to search in a way that just gives me all results (hopefully in a list). How do I do that?
Here is my actual code but this is only giving me individual results. I'm new to web scraping and unsure if the issue is with my Xpath (not doing wildcards correctly or if its related to my get_attribute tag - if its getting innerhtml then is it only getting it for the single entry?)
data = driver.find_element_by_xpath('//*[#id="a-stockFinancials_tabs"]/div[2]/div[1]/table/tbody/tr[5]/td').get_attribute("innerHTML")
print data

You should give find_elements_by_xpath a try.
I think, without seeing your full HTML, that this would work:
data = driver.find_elements_by_xpath('//*[#id="a-stockFinancials_tabs"]/div[2]/div[1]/table/tbody/tr/td')
for element in data:
print element.get_attribute("innerHTML")

Related

How to webscrape the correct element from a stat tracking website (cod.tracker.gg) using Python

On this specific page (or any 'matches' page) there are names you can select to view individual statistics for a match. How do I grab the 'kills' stat for example using webscraping?
In most of the tutorials I use the webscraping seems simple. However, when inspecting this site, specifically the 'kills' item, you see something like
<span data-v-71c3e2a1 title="Kills" class ="name".
Question 1.) What is the 'data-v-71c3e2a1'? I've never seen anything like this in my html,css, or webscraping tutorials. It appears in different variations all over the site.
Question 2.) More importantly, how do I grab the number of kills in this section? I've tried using scrapy and grabbing by xpath:
scrapy shell https://cod.tracker.gg/warzone/match/1424533688251708994?handle=PatrickPM
response.xpath("//*[#id="app"]/div[3]/div[2]/div/main/div[3]/div[2]/div[2]/div[6]/div[2]/div[3]/div[2]/div[1]/div/div[1]/span[2]").get()
but this raises a syntax error
response.xpath("//*[#id="app"]
SyntaxError: invalid syntax
Grabbing by response.css("").get() is also difficult. Should I be using selenium? Or just regular requests/bs4? Nothing I do can grab it.
Thank you.
Does this return the data you need?
import requests
endpoint = "https://api.tracker.gg/api/v1/warzone/matches/1424533688251708994"
r = requests.get(endpoint, params={"handle": "PatrickPM"})
data = r.json()["data"]
In any way I suggest using API if there's one available. It's much easier than using BeautifulSoup or selenium.

Getting all elements in a page with specific span class python selenium

Hi I am attempting to scrape multiple pages using selenium in python. I am interested in extracting all elements that fall within a span class element, basically what I would like to do is get the span class elements then extract the link within it. For each page it is possible to achieve this by using the xpath, however the xpath changes for each object and for each page. here is an example of what the web elements look like:
essentially I would like to extract the elements this is consistent in all the pages that I will be scraping. SO my idea is to get these elements then to get the href elements for these. I have tried to get all the elements on the page using this code
driver.find_elements_by_xpath("//span[#class='Text__StyledText-jknly0-0 cCEhaW']")
However this has not worked and it returns nothing. I also do not want to use the inner class because it varies by page as well so the only real element to use if I want to automate the scraping without getting too messy is that element I mention. Any way to extract the links for this span class elements on the page?
try this xpath
//span[contains(#class,'Text__StyledText')]//a[contains(#class,'Anchor__StyledAnchor')]
To actually grab that element we use the following
driver.find_elements_by_css_selector("span.Text__StyledText-jknly0-0.cCEhaW")

Scraping text values using Selenium with Python

For each vendor in an ERP system (total # of vendors = 800+), I am collecting its data and exporting this information as a pdf file. I used Selenium with Python, created a class called Scraper, and defined multiple functions to automate this task. The function, gather_vendors, is responsible for scraping and does this by extracting text values from tag elements.
Every vendor has a section called EFT Manager. EFT Manager has 9 rows I am extracting from:
For #2 and #3, both have string values (crossed out confidential info). But, #3 returns null. I don’t understand why #3 onward returns null when there are text values to be extracted.
The format of code for each element is the same.
I tried switching frames but that did not work. I tried to scrape from edit mode and that didn’t work as well. I was curious if anyone ever encountered a similar situation. It seems as though no matter what I do I can’t scrape certain values… I’d appreciate any advice or insight into how I should proceed.
Thank you.
Why not try to use
find_element_by_class_name("panelList").find_elements_by_tag_name('li')
To collect all of the li elements. And using li.text to retrieve their text values. Its hard to tell what your actual output is besides you saying "returns null"
Try to use visibility_of_element_located instead of presence_of_element_located
Try to get textContent with javascript fo element Given a (python) selenium WebElement can I get the innerText?
element = driver.find_element_by_id('txtTemp_creditor_agent_bic')
text= driver.execute_script("return attributes[0].textContent", element)
The following is what worked for me:
Get rid of the try/except blocks.
Find elements via ID's (not xpath).
That allowed me to extract text from elements I couldn't extract from before.
You should change the way of extracting the elements on web page to ID's, since all the the aspects have different id provided. If you want to use xpaths, then you should try the JavaScript function to find them.
E.g.
//span[text()='Bank Name']

Selenium WebDriver Very Slow to Append WebElement Data to List

I'm trying to store webelement content to a python list. While it works, it's taking ~15min to process ~2,000 rows.
# Grab webelements via xpath
rowt = driver.find_elements_by_xpath("//tbody[#class='table-body']/tr/th[#class='listing-title']")
rowl = driver.find_elements_by_xpath("//tbody[#class='table-body']/tr/td[#class='listing-location']")
rowli = driver.find_elements_by_xpath("//tbody[#class='table-body']/tr/th/a")
title = []
location = []
link = []
# Add webElement strings to lists
print('Compiling list...')
[title.append(i.text) for i in rowt]
[location.append(i.text) for i in rowl]
[link.append(i.get_attribute('href')) for i in rowli]
Is there a faster way to do this?
your solution is parsing through the table three times, once for the titles, once for the locations, and once for the links.
Try parsing the table just once. Have a selector for the row, then loop through the rows, and for each row, extract the 3 elements using a relative path, e.g. for the link, it would look like this:
link.append(row.find_elements_by_xpath("./th/a").get_attribute('href'))
Suggestions (apologies if it’s not helpful):
I think Pandas can be used to load HTML tables directly. If your intent is to scrape a table then libraries like Bs4 also might come handy.
You can store the entire HTML and the parse it using Regex,cause all the data you are extracting is gonna be enclosed in fixed set of HTML tags.
Depending on what you're trying to do, if the server that is presenting the page has an API, it would likely be significantly faster for you to use that to retrieve the data, rather than scraping the content from the page.
You could use the browser tools to see what the different requests are being sent to the server, and perhaps the data is being returned in a JSON form that you can easily retrieve your data from.
This, of course, assumes that you're interested in the data, not in verifying the content of the page directly.
I guess the slowest one is [location.append(i.text) for i in rowl].
When you call i.text, Selenium needs to determine what will be displayed in that element, so it needs more time to process.
You can use a workaround i.get_attribute('innerText') instead.
[location.append(i.get_attribbute('innerText')) for i in rowl]
However, I can't guarantee that the result will be the same. (It should be the same or similar to .Text).
I've tested this on my machines with ~2000 row, i.text took 80 sec. while i.get_attribute('innerText') took 28 sec.
Using bs4 would definitely help.
Even if you may have to find elements again using bs4, it was still faster to use bs4.
I'd like to suggest you try bs4.
I.e., code like this would work
soup = bs4.BeautifulSoup(driver.page_source, "html.parser")
elements = soup.find_all(...)
Loop using i
Some job using elements[i]['target attribute']

Parsing HTML with XPath, Python and Scrapy

I am writing a Scrapy program to extract the data.
This is the url, and I want to scrape 20111028013117 (code) information. I have taken XPath from FireFox add-on XPather. This is the path:
/html/body/p/table/tbody/tr/td/table[2]/tbody/tr[1]/td/table[3]/tbody/tr/td[2]/table[1]/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr/td/table/tbody/tr[2]/td[2]
While I am trying to execute this
try:
temp_list = hxs.select("/html/body/p/table/tbody/tr/td/table[2]/tbody/tr[1]/td/table[3]/tbody/tr/td[2]/table[1]/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr/td/table/tbody/tr[2]/td[2]").extract()
print "temp_list:" + str(temp_list)
except:
print "error"
It returns an empty list, I am struggling to find out an answer for this from the last 4 hours. I am a newbie to scrapy eventhough I handled issues very well for other projects, but it seems to be a bit difficult.
The reason of why your xpath doesn't work is becuase of tbody. You have to remove it and check if you get that result that you want.
You can read this in scrapy documentation: http://doc.scrapy.org/en/0.14/topics/firefox.html
Firefox, in particular, is known for adding <tbody> elements to
tables. Scrapy, on the other hand, does not modify the original page
HTML, so you won’t be able to extract any data if you use <tbody> in
your XPath expressions.
I see that the element you are hunting for is inside a <table>.
Firefox adds tbody tag for every table, even if it does not exists in source HTML code.
That's might be the reason, that your xpath query works in the browser, but fails in Scrapy.
As suggested, use other anchors in your xpath query.
You can extract data with more ease using more robust XPaths instead of taking the direct output from XPather.
For the data you are matching, this XPath would do a lot better:
//font[contains(text(),'Code')]/parent::td/following-sibling::td/font/text()
This will match the <font> tag containing "Code", then go to the td tag above it and select the next td -> font, which contains the code you are looking for.
Have you tried removing a few node tags at the end of the query, and re-running until you get a result? Do this several times until you get something, then add items back in cautiously until the query is rectified.
Also, check that your target page validates as XHTML - an invalid page would probably upset the parser.

Categories