I'm currently in the process of researching scraping and I've been following a tutorial on Youtube. The tutorial is using 'Scrapy' and I've managed to scrape data from the website previewed in the tutorial. However, now I've tried scraping another website with no success.
From my understanding, the problem is from the Xpath that I'm using. I've tried several Xpath testing/generator websites with no success.
This is the following XML code:
<div class="price" currentmouseover="94">
<del currentmouseover="96">
<span class="woocommerce-Price-amount amount" currentmouseover="90"><span class="woocommerce-Price-currencySymbol">€</span>3.60</span>
</del>
<ins><span class="woocommerce-Price-amount amount" currentmouseover="123"><span class="woocommerce-Price-currencySymbol" currentmouseover="92">€</span>3.09</span></ins></div>
I'm currently using the following code:
def parse(self,response):
for title in response.xpath("//div[#class='Price']"):
yield {
'title_text': title.xpath(".//span[#class='woocommerce-Price-amount amount']/text()").extract_first()
}
I've also tried using //span[#class='woocommerce-Price-amount amount'].
I want my output to be '3.09', instead, I'm getting null when I export it to a JSON file. Can someone point me in the right direction?
Thanks in advance.
Update 1:
I've managed to fix the problem with Jack Fleeting's answer. Since I've had problems understanding Xpath I've been trying different websites in order to get a further understanding of how Xpath works. Unfortunately, I'm stuck in another example.
<div class="add-product"><strong><small>€3.11</small> €3.09</strong></div>
I'm using the following snippet:
l.add_xpath('price', ".//div[#class='add-product']/strong[1]")
My expectation is to output the 3.09, however, I'm outputting both numbers. I've tried using a minimum function, but Xpath 1.0 does not support it. ie: since I wanted to output the actual (discounted) value of the item
Try this xpath expression, and see if it works:
//div[#class='price']/ins/span
Note that price is lower case, as in you html.
Related
I am having issues extracting data from the table below.
https://tirewheelguide.com/sizes/perodua/myvi/2019/
I want to extract the sizes in this example & it would be the 175/65 SR14
<a style="text-decoration: underline;" href="https://tirewheelguide.com/tires/s/175-65-14/">175/65 SR14 </a>
Using scrapy shell function
response.xpath('/html/body/div[2]/table[1]/tbody/tr[1]/td[1]/a[1]/text()').get()
yields nothing.
Do you know what I am doing wrong?
There is a problem with your XPath
instead this:
response.xpath('/html/body/div[2]/table[1]/tbody/tr[1]/td[1]/a[1]/text()').get()
use this:
response.xpath('//table[1]//td//a/text()').get()
Some website doesn't create tables in proper so in my XPath I pass html/body/div also there was a problem with tr. The website creates multiple tr in the same row and it causes a problem. If you use the XPath I posted, it will work fine.
On this specific page (or any 'matches' page) there are names you can select to view individual statistics for a match. How do I grab the 'kills' stat for example using webscraping?
In most of the tutorials I use the webscraping seems simple. However, when inspecting this site, specifically the 'kills' item, you see something like
<span data-v-71c3e2a1 title="Kills" class ="name".
Question 1.) What is the 'data-v-71c3e2a1'? I've never seen anything like this in my html,css, or webscraping tutorials. It appears in different variations all over the site.
Question 2.) More importantly, how do I grab the number of kills in this section? I've tried using scrapy and grabbing by xpath:
scrapy shell https://cod.tracker.gg/warzone/match/1424533688251708994?handle=PatrickPM
response.xpath("//*[#id="app"]/div[3]/div[2]/div/main/div[3]/div[2]/div[2]/div[6]/div[2]/div[3]/div[2]/div[1]/div/div[1]/span[2]").get()
but this raises a syntax error
response.xpath("//*[#id="app"]
SyntaxError: invalid syntax
Grabbing by response.css("").get() is also difficult. Should I be using selenium? Or just regular requests/bs4? Nothing I do can grab it.
Thank you.
Does this return the data you need?
import requests
endpoint = "https://api.tracker.gg/api/v1/warzone/matches/1424533688251708994"
r = requests.get(endpoint, params={"handle": "PatrickPM"})
data = r.json()["data"]
In any way I suggest using API if there's one available. It's much easier than using BeautifulSoup or selenium.
I have been trying to access the inspect element data from a certain website (The regular source code won't work for this). At first I tried rendering the javascript for the site. I've tried using selenium, pyppeteer, webbot, phantomjs, and request_html + beautifulsoup. All of these did not work. Would it be possible to simply copy-paste this data using python?
The data I need is from https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6 and looks like this:
<nav class="feature-list">
<span style="" id="ember683" class="flex-horizontal feature-list-item ember-view">
(all span's in this certain nav)
Python and Selenium beginner here. I'm trying to scrape the title of the sections of an Udemy class. I've tried using the find_elements_by_class_name and others but for some reason only brings back partial data.
page I'm scraping: https://www.udemy.com/selenium-webdriver-with-python3/
1) I want to get the title of the sections. They are the bold titles.
2) I want to get the title of the subsections.
from selenium import webdriver
driver = webdriver.Chrome()
url = 'https://www.udemy.com/selenium-webdriver-with-python3/'
driver.get(url)
main_titles = driver.find_elements_by_class_name("lecture-title-text")
sub_titles = driver.find_elements_by_class_name("title")
Problem
1) Using main_titles, I got the length to be only 10. It only goes from Introduction to Modules. Working With Files and ones after all don't come out. However, the class names are exactly the same. Not sure why it's not. Modules / WorkingWithFiles is basically the cutoff point. The elements in the inspection also looks different at this point. They all have same span class tag but not sure why only partial is being returned
<span class="lecture-title-text">
Element Inspection between Modules title and WorkingWithFiles title
At this point the webscrape breaks down. Not sure why.
2) Using sub_titles, I got length to be 58 items but when I print them out, I only get the top two:
Introduction
How to reach me anytime and ask questions? *** MUST WATCH ***
After this, it's all blank lines. Not sure why it's only pulling the top two and not the rest when all the tags have
<div class='title'>
Maybe I could try using BeautifulSoup but currently I'm trying to get better using Selenium. Is there a dynamic content throwing off the selenium scrape or am I not scraping it in a proper way?
Thank you guys for the input. Sorry for the long post. I wanted to make sure I describe the problem correctly.
The reason why your only getting the first 10 sections is because only the first ten courses are shown. You might be logged in on your browser, so when you go to check it out, it shows every section. But for me and your scraper it's only showing the first 10. You'll need to click that .section-container--more-sections button before looking for the titles.
As for the weird case of the titles not being scraped properly: It's because when a element is hidden text attribute will always be undefined, which is why it only works for the first section. I'd try using the WebElement.get_attribute('textContent') to scrape the text.
Ok I've went through the suggestions in the comments and have solved it. I'm writing it here in case anyone in future wants to see how solution went.
1) Using suggestions, I made a command to click on the '24 more sections' to expand the tab and then scrape it, which worked perfectly!
driver.find_element_by_class_name("js-load-more").click()
titles = driver.find_elements_by_class_name("lecture-title-text")
for each in titles:
print (each.text)
This pulled all 34 section titles.
2) Using Matt's suggestion, I found the WebElement and used get_attribute('textContent') to pull out the text data. There were bunch of spaces so I used split() to get strings only.
sub_titles = driver.find_elements_by_class_name("title")
for each in sub_titles:
print (each.get_attribute('textContent').strip())
This pulled all 210 subsection titles!
I'm looking for a bit of guidance here. I'm completely new to Python, BS, Selenium, etc so please go easy on me.
My ISP doesn't provide alerts for my internet usage and I wanted to create my own monitoring for this. I managed to scrape the page I need using selenium and BeautifulSoup but now I'm a bit stuck. I have a container that has the following HTML code in it:
[<div class="usage_circle"> <center>
<div data-perc="69" data-transitiongoal="0.69" data-usage="150.29">
</div> </center></div>]
I'd like to extract the data-usage value of 150.29. I've tried using the findAll function (previously used to get the above HTML), but it doesn't work in this case.
Could anyone guide me as to what I need to do to get this number into a variable?
Thank you in advance.
In BeautifulSoup you can find all tags with "data-usage" attribute like this:
e = soup.findAll(attrs={"data-usage" : True})
And then getting value of attribute will be easy, For your first match it will be:
e[0]["data-usage"]
This kind of problem is a great use case for the interactive Python prompt. Save some html, open the file, and mess around with the object until you can find a solution.
https://docs.python.org/3/tutorial/interpreter.html#interactive-mode
The tutorial really does provide you all you need:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
In Selenium, you can usually find things using the webdriver and not have to directly invoke beautifulsoup at all.
In your case, you could grab that block by class name and/or get the data-usage field via a tag name (it has been a while, don't quote me on the exact function delclarations).
If you want to use BeautifulSoup for whatever reason, this example will work:
from bs4 import BeautifulSoup
html_doc = """[<div class="usage_circle"> <center>
<div data-perc="69" data-transitiongoal="0.69" data-usage="150.29">
</div> </center></div>]"""
soup = BeautifulSoup(html_doc, 'lxml')
soup.div.center.div["data-usage"]
The more important lesson is how to find that tree, though. Get something like jupyter if you want something prettier than a plain interactive console, but I intentionally used only code copied and minimally-altered from the quickstart.