Scraping images in a dynamic, JavaScript webpage using Scrapy and Splash - python

I am trying to scrape the link of a hi-res image from this link but the high-res version of the image can only be inspected upon clicking on the mid-sized link on the page, i.e after clicking "Click here to enlarge the image" (on the page, it's in Turkish).
Then I can inspect it with Chrome's "Developer Tools" and get the xpath/css selector. Everything is fine up to this point.
However, you know that in a JS page, you just can't type response.xpath("//blah/blah/#src") and get some data. I install Splash (with Docker pull) and configure my Scrapy setting.py files etc. to make it work (this YouTube link helped. no need to visit the link unless you wanna learn how to do it). ...and it worked on other JS webpages!
Just... I cannot pass this "Click here to enlarge the image!" thing and get the response. It gives me null response.
This is my code:
import scrapy
#import json
from scrapy_splash import SplashRequest
class TryMe(scrapy.Spider):
name = 'try_me'
allowed_domains = ['arabam.com']
def start_requests(self):
start_urls = ["https://www.arabam.com/ilan/sahibinden-satilik-hyundai-accent/bayramda-arabasiz-kalmaa/17753653",
]
for url in start_urls:
yield scrapy.Request(url=url,
callback=self.parse,
meta={'splash': {'endpoint': 'render.html', 'args': {'wait': 0.5}}})
# yield SplashRequest(url=url, callback=self.parse) # this works too
def parse(self, response):
## I can get this one's link successfully since it's not between js codes:
#IMG_LINKS = response.xpath('//*[#id="js-hook-for-ing-credit"]/div/div/a/img/#src').get()
## but this one just doesn't work:
IMG_LINKS = response.xpath("/html/body/div[7]/div/div[1]/div[1]/div/img/#src").get()
print(IMG_LINKS) # prints null :(
yield {"img_links":IMG_LINKS} # gives the items: img_links:null
Shell command which I'm using:
scrapy crawl try_me -O random_filename.jl
Xpath of the link I'm trying to scrape:
/html/body/div[7]/div/div[1]/div[1]/div/img
Image of this Xpath/link
I actually can see the link I want on the Network tab of my Developer Tools window when I click to enlarge it but I don't know how to scrape that link from that tab.
Possible Solution: I also will try to get the whole garbled body of my response, i.e response.text and apply a regular expression (e.g start with https://... and ends with .jpg) to it. This will definitely be looking for a needle in a haystack but it sounds quite practical as well.
Thanks!

As far as I understand you want to find the main image link. I checked out the page, it is inside the one of meta element:
<meta itemprop="image" content="https://arbstorage.mncdn.com/ilanfotograflari/2021/06/23/17753653/3c57b95d-9e76-42fd-b418-f81d85389529_image_for_silan_17753653_1920x1080.jpg">
Which you can get with
>>> response.css('meta[itemprop=image]::attr(content)').get()
'https://arbstorage.mncdn.com/ilanfotograflari/2021/06/23/17753653/3c57b95d-9e76-42fd-b418-f81d85389529_image_for_silan_17753653_1920x1080.jpg'
You don't need to use splash for this. If I check the website with splash, arabam.com gives permission denied error. I recommend not using splash for this website.
For a better solution for all images, You can parse the javascript. Images array loaded with js right here in the source.
To reach out that javascript try:
response.css('script::text').getall()[14]
This will give you the whole javascript string containing images array. You can parse it with built-in libraries like js2xml.
Check out how you can use it here https://github.com/scrapinghub/js2xml. If still have questions, you can ask. Good luck

Related

Scrapy: extracting data(css-selector)

I am trying to get data(title) from this page. My code doesn't work. What am I doing wrong?
scrapy shell https://www.indiegogo.com/projects/functional-footwear-run-pain-free#/
response.css('.t-h3--sansSerif::text').getall()
I think may be the problem is that the element is dynamically added through Js and that could be the reason scrapy being not able to extract it may be you should try using selenium.
Here is selnium code to get the element:
titles = WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#main .is-12-touch+ .is-12-touch"))
)
for title in titles:
t = title.text
print("t = ", title)
Always check the source of the page from view-source. Looking at the source it looks like it does not contain the element you are looking for. Instead it is dynamically created with javascript.
You can use selenium to scrape such sites. But selenium comes with its caveats. It is synchronous.
And since you are using scrapy, a better option is to use scrapy-splash package. Splash renders javascript and return fully rendered html page which you can easily scrape with xpath or css selectors. Remember, you need to run Splash server in a docker container. And use it like a proxy server to render javascript.
docker pull scrapinghub/splash
docker run -d -p 8050:8050 --memory=1.5G --restart=always scrapinghub/splash --maxrss 1500 --max-timeout 3600 --slots 10
Here's a link to the documentation. https://splash.readthedocs.io/en/stable/
Your script would look something like this. Instead of scrapy.Request, you can makes requests like
from scrapy_splash import SplashRequest
yield SplashRequest(url=url, callback=self.parse, meta={})
And then you are good to go.

Scrapy Load More Issue - CSS Selector

I am attempting to scrape a website which has a "Show More" link at the bottom of the page that leads to more data to scrape. Here is a link to the website page: https://untappd.com/v/total-wine-more/47792. Here is my full code:
class Untap(scrapy.Spider):
name = "Untappd"
allowed_domains = ["untappd.com"]
start_urls = [
'https://untappd.com/v/total-wine-more/47792' #URL: Major liquor store chain with Towson location.
]
def parse(self, response):
for beer_details in response.css('div.beer-details'):
yield {
'name': beer_details.css('h5 a::text').getall(), #Name of Beer
'type': beer_details.css('h5 em::text').getall(), #Style of Beer
'ABVIBUs': beer_details.css('h6 span::text').getall(), #ABV and IBU of Beer
'Brewery': beer_details.css('h6 span a::text').getall() #Brewery that produced Beer
}
load_more = response.css('a.yellow button more show-more-section track-click::attr(href)').get()
if load_more is not None:
load_more = response.urljoin(load_more)
yield scrapy.Request(load_more, callback=self.parse)
I've attempted to use the bottom "load_more" block to continue loading more data for scraping, but no inputs with the HTML from the website have been working.
Here is the HTML from the website.
Show More Beers
I want to have the spider scrape what is show on the website, then click the link and continue scraping the page. Any help would be greatly appreciated.
Short answer:
curl 'https://untappd.com/venue/more_menu/47792/15?section_id=140248357' -H 'x-requested-with: XMLHttpRequest'
Clicking on that button executes javascript, so you'd need to use selenium to automate that, but fortunately, you won't :).
You can see, using Developer Tools, when you click that button it requests data following the pattern shown, increasing 15 each time (after /47792/), so first time:
https://untappd.com/venue/more_menu/47792/15?section_id=140248357
second time:
https://untappd.com/venue/more_menu/47792/30?section_id=140248357
then:
https://untappd.com/venue/more_menu/47792/45?section_id=140248357'
and so on.
But if you try to get it directly from the browser it gets no content, because they are expecting the 'x-requested-with: XMLHttpRequest' header, indicating it is an AJAX request.
Thus you have the URL pattern and the required header you need for coding your scraper.
The rest is to parse each response. :)
PD: probably the section_id parameter may change (mine is different from yours), but you already have the data-section-id="140248357" attribute in the button's HTML.

Python - Scrapy - Navigating through a website

I’m trying to use Scrapy to log into a website, then navigate within than website, and eventually download data from it. Currently I’m stuck in the middle of the navigation part. Here are the things I looked into to solve the problem on my own.
Datacamp course on Scrapy
Following Pagination Links with Scrapy
http://scrapingauthority.com/2016/11/22/scrapy-login/
Scrapy - Following Links
Relative URL to absolute URL Scrapy
However, I do not seem to connect the dots.
Below is the code I currently use. I manage to log in (when I call the "open_in_browser" function, I see that I’m logged in). I also manage to "click" on the first button on the website in the "parse2" part (if I call "open_in_browser" after parse 2, I see that the navigation bar at the top of the website has gone one level deeper.
The main problem is now in the "parse3" part as I cannot navigate another level deeper (or maybe I can, but the "open_in_browser" does not open the website any more - only if I put it after parse or parse 2). My understanding is that I put multiple "parse-functions" after another to navigate through the website.
Datacamp says I always need to start with a "start request function" which is what I tried but within the YouTube videos, etc. I saw evidence that most start directly with parse functions. Using "inspect" on the website for parse 3, I see that this time href is a relative link and I used different methods (See source 5) to navigate to it as I thought this might be the source of error.
import scrapy
from scrapy.http import FormRequest
from scrapy.utils.response import open_in_browser
from scrapy.crawler import CrawlerProcess
class LoginNeedScraper(scrapy.Spider):
name = "login"
start_urls = ["<some website>"]
def parse(self, response):
loginTicket = response.xpath('/html/body/section/div/div/div/div[2]/form/div[3]/input[1]/#value').extract_first()
execution = response.xpath('/html/body/section/div/div/div/div[2]/form/div[3]/input[2]/#value').extract_first()
return FormRequest.from_response(response, formdata={
'loginTicket': loginTicket,
'execution': execution,
'username': '<someusername>',
'password': '<somepassword>'},
callback=self.parse2)
def parse2(self, response):
next_page_url = response.xpath('/html/body/nav/div[2]/ul/li/a/#href').extract_first()
yield scrapy.Request(url=next_page_url, callback=self.parse3)
def parse3(self, response):
next_page_url_2 = response.xpath('/html//div[#class = "headerPanel"]/div[3]/a/#href').extract_first()
absolute_url = response.urljoin(next_page_url_2)
yield scrapy.Request(url=absolute_url, callback=self.start_scraping)
def start_scraping(self, response):
open_in_browser(response)
process = CrawlerProcess()
process.crawl(LoginNeedScraper)
process.start()
You need to define rules in order to scrape a website completely. Let's say you want to crawl all links in the header of the website and then open that link in order to see the main page to which that link was referring.
In order to achieve this, firstly identify what you need to scrape and mark CSS or XPath selectors for those links and put them in a rule. Every rule has a default callback to parse or you can also assign it to some other method. I am attaching a dummy example of creating rules, and you can map it accordingly to your case:
rules = (
Rule(LinkExtractor(restrict_css=[crawl_css_selectors])),
Rule(LinkExtractor(restrict_css=[product_css_selectors]), callback='parse_item')
)

Scrape displayed data through onclick using Selenium and Scrapy

I'm doing a script in python using Scrapy in order to scrape data from a website using an authentication.
The page I'm scraping is really painful because mainly made with javascript and AJAX requests. All the body of the page is put inside a <form> that allow to change the page using a submit button. URL don't change (and it's a .aspx).
I have successfully made that scrape all the data I need from page one, then changing page clicking on this input button using this code :
yield FormRequest.from_response(response,
formname="Form",
clickdata={"class":"PageNext"},
callback=self.after_login)
The after_login method is scraping the data.
However I need data that appear in another div after clicking on a container with onclick attribute. I need to do a loop in order to click on each container, displaying the data, scraping them and just after that I'm going to the next page and do the same process.
The thing is I can't find how to make the process where "the script" just click on the container using Selenium (while being logged in, if not I cannot go to this page) and then Scrapy is scraping the data that after the XHR request has been made.
I did a lot of research on the internet but could not try any solution.
Thanks !
Ok so I've almost got what I want, following #malberts advices.
I've used this kind of code in order to get the Ajax response request :
yield scrapy.FormRequest.from_response(
response=response,
formdata={
'param1':param1value,
'param2':param2value,
'__VIEWSTATE':__VIEWSTATE,
'__ASYNCPOST':'true',
'DetailsId':'123'},
callback=self.parse_item)
def parse_item(self, response):
ajax_response = response.body
yield{'Response':ajax_response}
The response is suppose to be in HTML. Thing is the response is not totally the same as the one when I lookup to the Response request through Chrome Dev Tools. I've not taken all the form data into account yet (~10 / 25), could it be it needs all the form data even if they don't change depending the id ?
Thanks !

Python selenium webdriver not consistently selecting element even though it's there

I'm developing a web scraper to collect the src link from a source tag in an html file and add it to a list.
The site has a video nested under a load of divs, but all of the pages eventually come to:
<video type="video/mp4" poster="someimagelink" preload="metadata" crossorigin="anonymous">
<source type="video/mp4" src="somemp4link">
</video>
My current method is logging into the site, going to the page with the links to the video pages, going to each video page one by one and trying to find the source tag and adding it to the list.
import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
browser = webdriver.Firefox()
# A bunch of log in and get list of video page links, which works fine
soup = BeautifulSoup(browser.page_source)
for i in range(3):
browser.get(soup('a', {'class', 'subject__item'})[i]['href'])
vsoup = BeautifulSoup(browser.page_source)
print(vsoup('source'))
browser.get('pageWithVideoPages')
# This doen't add to a list, it just goes to the video page,
# tries to find the source tag and print it out.
# Then go back to original page and start loop again.
What happens however is I get this:
[<source src="themp4link" type="video/mp4"></source>]
[]
[]
[]
So the first one works fine, then all the rest just return black lists...as if there was no source tag, but mannually checking the inspector reveals that there is a source tag there.
Repeating this, I now get:
[<source src="http://themp4link" type="video/mp4"></source>]
[]
[<source src="http://themp4link" type="video/mp4"></source>]
The site needed javascript enabled to load the content (which is why i'm using webdriver to do this)...could it be something to do with that?
Any help is much appreciated!
You probably need to wait for the web element you are looking for. You should explore using WebDriverWait.

Categories