Scrapy: extracting data(css-selector)

Scrapy: extracting data(css-selector) - python

I am trying to get data(title) from this page. My code doesn't work. What am I doing wrong?
scrapy shell https://www.indiegogo.com/projects/functional-footwear-run-pain-free#/
response.css('.t-h3--sansSerif::text').getall()

I think may be the problem is that the element is dynamically added through Js and that could be the reason scrapy being not able to extract it may be you should try using selenium.
Here is selnium code to get the element:
titles = WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#main .is-12-touch+ .is-12-touch"))
)
for title in titles:
t = title.text
print("t = ", title)

Always check the source of the page from view-source. Looking at the source it looks like it does not contain the element you are looking for. Instead it is dynamically created with javascript.
You can use selenium to scrape such sites. But selenium comes with its caveats. It is synchronous.
And since you are using scrapy, a better option is to use scrapy-splash package. Splash renders javascript and return fully rendered html page which you can easily scrape with xpath or css selectors. Remember, you need to run Splash server in a docker container. And use it like a proxy server to render javascript.
docker pull scrapinghub/splash
docker run -d -p 8050:8050 --memory=1.5G --restart=always scrapinghub/splash --maxrss 1500 --max-timeout 3600 --slots 10
Here's a link to the documentation. https://splash.readthedocs.io/en/stable/
Your script would look something like this. Instead of scrapy.Request, you can makes requests like
from scrapy_splash import SplashRequest
yield SplashRequest(url=url, callback=self.parse, meta={})
And then you are good to go.

Related

Scraping images in a dynamic, JavaScript webpage using Scrapy and Splash

I am trying to scrape the link of a hi-res image from this link but the high-res version of the image can only be inspected upon clicking on the mid-sized link on the page, i.e after clicking "Click here to enlarge the image" (on the page, it's in Turkish).
Then I can inspect it with Chrome's "Developer Tools" and get the xpath/css selector. Everything is fine up to this point.
However, you know that in a JS page, you just can't type response.xpath("//blah/blah/#src") and get some data. I install Splash (with Docker pull) and configure my Scrapy setting.py files etc. to make it work (this YouTube link helped. no need to visit the link unless you wanna learn how to do it). ...and it worked on other JS webpages!
Just... I cannot pass this "Click here to enlarge the image!" thing and get the response. It gives me null response.
This is my code:
import scrapy
#import json
from scrapy_splash import SplashRequest
class TryMe(scrapy.Spider):
name = 'try_me'
allowed_domains = ['arabam.com']
def start_requests(self):
start_urls = ["https://www.arabam.com/ilan/sahibinden-satilik-hyundai-accent/bayramda-arabasiz-kalmaa/17753653",
]
for url in start_urls:
yield scrapy.Request(url=url,
callback=self.parse,
meta={'splash': {'endpoint': 'render.html', 'args': {'wait': 0.5}}})
# yield SplashRequest(url=url, callback=self.parse) # this works too
def parse(self, response):
## I can get this one's link successfully since it's not between js codes:
#IMG_LINKS = response.xpath('//*[#id="js-hook-for-ing-credit"]/div/div/a/img/#src').get()
## but this one just doesn't work:
IMG_LINKS = response.xpath("/html/body/div[7]/div/div[1]/div[1]/div/img/#src").get()
print(IMG_LINKS) # prints null :(
yield {"img_links":IMG_LINKS} # gives the items: img_links:null
Shell command which I'm using:
scrapy crawl try_me -O random_filename.jl
Xpath of the link I'm trying to scrape:
/html/body/div[7]/div/div[1]/div[1]/div/img
Image of this Xpath/link
I actually can see the link I want on the Network tab of my Developer Tools window when I click to enlarge it but I don't know how to scrape that link from that tab.
Possible Solution: I also will try to get the whole garbled body of my response, i.e response.text and apply a regular expression (e.g start with https://... and ends with .jpg) to it. This will definitely be looking for a needle in a haystack but it sounds quite practical as well.
Thanks!

As far as I understand you want to find the main image link. I checked out the page, it is inside the one of meta element:
<meta itemprop="image" content="https://arbstorage.mncdn.com/ilanfotograflari/2021/06/23/17753653/3c57b95d-9e76-42fd-b418-f81d85389529_image_for_silan_17753653_1920x1080.jpg">
Which you can get with
>>> response.css('meta[itemprop=image]::attr(content)').get()
'https://arbstorage.mncdn.com/ilanfotograflari/2021/06/23/17753653/3c57b95d-9e76-42fd-b418-f81d85389529_image_for_silan_17753653_1920x1080.jpg'
You don't need to use splash for this. If I check the website with splash, arabam.com gives permission denied error. I recommend not using splash for this website.
For a better solution for all images, You can parse the javascript. Images array loaded with js right here in the source.
To reach out that javascript try:
response.css('script::text').getall()[14]
This will give you the whole javascript string containing images array. You can parse it with built-in libraries like js2xml.
Check out how you can use it here https://github.com/scrapinghub/js2xml. If still have questions, you can ask. Good luck

Cannot follow links on page when scraping with proxies

I am scraping pages like this one:
site to scrape
I am using Python with Selenium and connecting through ProxyCrawler. One of the things I need to do is follow all the links that say For details, click here and grab the text there. The links look like this:
<a href='javascript:void(0)' onclick=javascript:submitLink('TIDFT/AE/VI/IS/ID100201','KQ','KQ')>For details, click here</a>
As you can see, each link's URL gets constructed by a function called submitLink. The function is not defined in the page source; rather it is called from an external .js file referenced in the head. I tried injecting the file into the DOM to make the function run but failed so far. For more details, see my question here.
So I'm trying instead to click each link to make the script run. However, this doesn't work with ProxyCrawler. If I connect directly, the links work fine but obviously that exposes my scraper.
Here is the minimum workable code:
from selenium import webdriver
from urllib import parse
apikey = MY_KEY
scrapeurl = 'https://www.timaticweb.com/cgi-bin/tim_website_client.cgi?SpecData=1&VISA=&page=both&NA=' + \
'ZW' + '&DE=' + 'AE' + '&user=KQ&subuser=KQ'
selenurl = 'https://api.proxycrawl.com/?token=' + apikey + '&url=' + parse.quote(scrapeurl)
DRIVER_PATH = '/Applications/chromedriver'
driver = webdriver.Chrome(executable_path = DRIVER_PATH)
driver.get(selenurl)
#driver.get(scrapeurl)
link = driver.find_element_by_xpath(".//a[contains(#onclick, 'submitLink')]")
link.click()
The above works is I use scrapeurl. It doesn't work with selenurl. Is there a way to use ProxyCrawler and still be able to click on those links?

Scraping Product Links at Coles.com.au 429 error with 1 request

I am new to webscraping and would like to scrape the links from the site below using scrapy:
https://shop.coles.com.au/a/national/everything/search/bread?pageNumber=1
I created the below xpath to scrape the links and when I test it out by going to inspect and pressing ctrl + f I get 51 matches which is equal to the number of products and so seems to be correct:
//span[#class="product-name"]/../../#href
However when I go into scrapy shell with the link and apply the command:
response.xpath('//span[#class="product-name"]/../../#href').extract()
with or without a User agent I just get an empty list.
When I run the shell I get a 429 error, which indicates I have made too many requests. But as far as I am aware I have only made 1 request.
In addition I have also set up a spider for this where I set CONCURRENT_REQUESTS = 1 and also get a 429 error.
Does anyone know why my xpath doesn't work on this site?
Thanks
Edit
Below is the spider code:
import scrapy
class ColesSpider(scrapy.Spider):
name = 'coles'
allowed_domains = ['shop.coles.com.au']
start_urls = ['https://shop.coles.com.au/a/national/everything/search/bread/']
def parse(self, response):
prod_urls = response.xpath('//span[#class="product-name"]/../../#href').extract()
for prod_url in prod_urls:
yield{"Product_URL": prod_url}

I've had a quick look around the website and it seems like the website is invoking a cookie challenges as well as requiring your IP address.
I think it may be worth thinking about trying scrapy-splash to render the page and go through the JS cookie challenges if you're hard on using scrapy.
Strangely I managed to get a 200 status code with headers,params and cookies with the requests package but couldn't get scrapy with same headers and cookies to recreate that response.

Beautiful Soup is not returning full HTML code that I see when I inspect the page manually [duplicate]

My issue I'm having is that I want to grab the related links from this page: http://support.apple.com/kb/TS1538
If I Inspect Element in Chrome or Safari I can see the <div id="outer_related_articles"> and all the articles listed. If I attempt to grab it with BeautifulSoup it will grab the page and everything except the related articles.
Here's what I have so far:
import urllib2
from bs4 import BeautifulSoup
url = "http://support.apple.com/kb/TS1538"
response = urllib2.urlopen(url)
soup = BeautifulSoup(response.read())
print soup

This section is loaded using Javascript. Disable your browser's Javascript to see how BeautifulSoup "sees" the page.
From here you have two options:
Use a headless browser, that will execute the Javascript. See this questions about this: Headless Browser for Python (Javascript support REQUIRED!)
Try and figure out how the apple site loads the content and simulate it - it probably does an AJAX call to some address.
After some digging it seems it does a request to this address (http://km.support.apple.com/kb/index?page=kmdata&requestid=2&query=iOS%3A%20Device%20not%20recognized%20in%20iTunes%20for%20Windows&locale=en_US&src=support_site.related_articles.TS1538&excludeids=TS1538&callback=KmLoader.receiveSuccess) and uses JSONP to load the results with KmLoader.receiveSuccess being the name of the receiving function. Use Firebug of Chrome dev tools to inspect the page in more detail.

I ran into a similar problem, the html contents that are created dynamically may not be captured by BeautifulSoup. A very basic solution for this is to make it wait for few seconds before capturing the contents, or use Selenium instead that has the functionality to wait for an element and then proceed. So for the former, this worked for me:
import time
# .... your initial bs4 code here
time.sleep(5) #5 seconds, it worked with 1 second too
html_source = browser.page_source
# .... do whatever you want to do with bs4

Python Selenium Run All Page Javascripts

I'm scraping my site which uses a Google custom search iframe. I am using Selenium to switch into the iframe, and output the data. I am using BeautifulSoup to parse the data, etc.
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import html5lib
driver = webdriver.Firefox()
driver.get('http://myurl.com')
driver.execute_script()
time.sleep(4)
iframe = driver.find_elements_by_tag_name('iframe')[0]
driver.switch_to_default_content()
driver.switch_to_frame(iframe)
output = driver.page_source
soup = BeautifulSoup(output, "html5lib")
print soup
I am successfully getting into the iframe and getting 'some' of the data. At the very top of the data output, it talks about Javascript being enabled, and the page being reloaded, etc. The part of the page I'm looking for isn't there (from when I look at the source via developer tools). So, obviously some of it isn't loading.
So, my question - how do you get Selenium to load ALL page javascripts? Is it done automatically?
I see a lot of posts on SO about running an individual function, etc... but nothing about running all of the JS on the page.
Any help is appreciated.

Ahh, so it was in the tag that featured the "Javascript must be enabled" text.
I just posted a question on how to switch within the nested iframe here:
Python Selenum Swith into an iframe within an iframe

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy: extracting data(css-selector) - python

I am trying to get data(title) from this page. My code doesn't work. What am I doing wrong? scrapy shell https://www.indiegogo.com/projects/functional-footwear-run-pain-free#/ response.css('.t-h3--sansSerif::text').getall()

Related

Scraping images in a dynamic, JavaScript webpage using Scrapy and Splash

Cannot follow links on page when scraping with proxies

Scraping Product Links at Coles.com.au 429 error with 1 request

Beautiful Soup is not returning full HTML code that I see when I inspect the page manually [duplicate]

Python Selenium Run All Page Javascripts

Categories

Resources