I am trying to scrape some data from a web page using selenium. I have successfully got selenium working headlessly on a raspberry pi, I can connect to the webpage I am trying to scrape, return the title of the page and return the URL I am connected to.
I have been looking at examples in tutorials on how to scrape data and they all go something like this:
titles_element = browser.find_elements_by_xpath(“//a[#class=’text-bold’]”)
However, every piece of data in the webpage I am trying to scrape has the same class name. An example of the first bit of data I'm trying to scrape, I'm trying to get the value of wins which is 4:
Data 1
And a second example of the data im trying to scrape, which in this case is kills and the value is 559:
Data 2
Both numbers I am trying to scrape share the same class name so I cant simply scrape by class.
What is the best way of scraping this data?
titles_element = browser.find_elements_by_xpath(...)
I think you can do something like this for Data 1 (inside the parentheses)
/div/span[#title="Wins"]/following-sibling::span[#class="value"]/text()
and similarly for Data 2:
/div/span[#title="Kills"]/following-sibling::span[#class="value"]/text()
I used the following as references:
XPath: how to select elements that are related to other on the same level
and tested your code to see the XPath result with this:
XPath Tester / Evaluator
You could use css attribute = value selectors to target preceding sibling by title attribute then use adjacent sibling combinator to move to the adjacent sibling and grab the desired values
find_element_by_css_selector('[title=Kills] + .value').text
find_element_by_css_selector('[title=Wins] + .value').text
Related
Hi I am attempting to scrape multiple pages using selenium in python. I am interested in extracting all elements that fall within a span class element, basically what I would like to do is get the span class elements then extract the link within it. For each page it is possible to achieve this by using the xpath, however the xpath changes for each object and for each page. here is an example of what the web elements look like:
essentially I would like to extract the elements this is consistent in all the pages that I will be scraping. SO my idea is to get these elements then to get the href elements for these. I have tried to get all the elements on the page using this code
driver.find_elements_by_xpath("//span[#class='Text__StyledText-jknly0-0 cCEhaW']")
However this has not worked and it returns nothing. I also do not want to use the inner class because it varies by page as well so the only real element to use if I want to automate the scraping without getting too messy is that element I mention. Any way to extract the links for this span class elements on the page?
try this xpath
//span[contains(#class,'Text__StyledText')]//a[contains(#class,'Anchor__StyledAnchor')]
To actually grab that element we use the following
driver.find_elements_by_css_selector("span.Text__StyledText-jknly0-0.cCEhaW")
I am trying to scrape the data contained in a table on https://www.bop.gov/coronavirus/. However, when one first visits the page the table is hidden behind a link (https://www.bop.gov/coronavirus/#) that leads to the same page but expands the hidden table on the page. However, I cannot find within this link within the webpage's source code or using selenium in order to expand the table and scrape its data. How can I go about accessing the data in this table using python?
The endpoint from which the data is loaded on the page is available under the network tab of the developer tools. The data you need is loaded from
https://www.bop.gov/coronavirus/json/final.json
You might also want to take a look at
https://www.bop.gov/coronavirus/data/locations.json
as the first link only contains the short codes for the names.
The table data is readily available under the div with id="totals_breakdown".
You can directly call the page_source and parse the data for that element with BeautifulSoup without needing to "show" the element.
If you MUST show the element for some reason, you simply have to remove the class closed from the div with id="totals_breakdown"
Currently i use Python with selenium for scraping purpose. There are many ways in selenium to scrape data. And I used to use css selectors.
But then I realised that,
Only tagNames are those things which always are on websites.
For example,
Not every website uses classes or Id's like, take an example of Wikipedia. They use normally just tags in it.
like <h1>, <a> without having any classes or id in it.
There comes the limitation for scraping USING tagNames, as they scrape every element under their tags.
For example : if I want to scrape table contents which are under <p> tag, then it scrapes the table contents as well as all the descriptions which are not needed.
My question is: is it possible to scrape the required elements under the tags which do not copy every other elements under their tags?
Like if I want to scrape content from, say Amazon then it will select only product names under h1 tags, not scraping all the headings under the h1 tag which are not product names.
If you find any other method/locator to use, even except the tagName also then also you can tell me. But the condition is that it must be present on every website/ most of the websites
Any help would be appreciated 😊...
I am new to scrapy, trying to extract google news from the the given link bellow:
https://www.google.co.in/search?q=cholera+news&safe=strict&source=lnms&tbm=nws&sa=X&ved=0ahUKEwik0KLV-JfYAhWLpY8KHVpaAL0Q_AUICigB&biw=1863&bih=966
"cholera" key word was provided that shows small blocks of various news associated with cholera key world further I try this with scrapy to extract the each block that contents individual news.
fetch("https://www.google.co.in/search?q=cholera+news&safe=strict&source=lnms&tbm=nws&sa=X&ved=0ahUKEwik0KLV-JfYAhWLpY8KHVpaAL0Q_AUICigB&biw=1863&bih=966")
response.css(".ts._JGs._KHs._oGs._KGs._jHs::text").extract()
where .ts._JGs._KHs._oGs._KGs._jHs::text represent the div class="ts _JGs _KHs _oGs _KGs _jHs for each block of news.
but it return None.
After struggling I find out a way to scrap desired data with very simple trick,
fetch("https://www.google.co.in/search?q=cholera+news&safe=strict&source=lnms&tbm=nws&sa=X&ved=0ahUKEwik0KLV-JfYAhWLpY8KHVpaAL0Q_AUICigB&biw=1863&bih=966")
and css selector "class="g" tag can be used to extract desired block like this
response.css(".g").extract()
which return list of all the individual news blocks which can be further used on the basis of list index like this:
response.css(".g").extract()[0]
or
response.css(".g").extract()[1]
In scrapy shell uses view(response) and you will see in web browser what you fetch().
Google uses JavaScript to display data, but it can also send page which doesn't use JavaScript. But page without JavaScript usually has different tags and classes.
You can also turn off JavaScript in your browse and then open Google to see tags.
Try this:
response.css('#search td ::text').extract()
I'm trying gather some data from a site using Python, Selenium and Xpath. There are multiple datapoints I want and they are all in this structure:
/tr[1]/td
/tr[2]/td
/tr[3]/td
/tr[4]/td
I do not know how many <tr>'s there are so I am trying to search in a way that just gives me all results (hopefully in a list). How do I do that?
Here is my actual code but this is only giving me individual results. I'm new to web scraping and unsure if the issue is with my Xpath (not doing wildcards correctly or if its related to my get_attribute tag - if its getting innerhtml then is it only getting it for the single entry?)
data = driver.find_element_by_xpath('//*[#id="a-stockFinancials_tabs"]/div[2]/div[1]/table/tbody/tr[5]/td').get_attribute("innerHTML")
print data
You should give find_elements_by_xpath a try.
I think, without seeing your full HTML, that this would work:
data = driver.find_elements_by_xpath('//*[#id="a-stockFinancials_tabs"]/div[2]/div[1]/table/tbody/tr/td')
for element in data:
print element.get_attribute("innerHTML")