Python Crawler: deal with "load more" button

Python Crawler: deal with "load more" button - python

I am learning python crawler and I want to know how to deal with the "load more" button located in the following url:
https://www.photo.net/search/#//Sort-View-Count/All-Categories/All-Time/Page-1
(I was trying to crawl all the picture)
Current code I have is using beautifulsoup:
from urllib.request import *
from http.cookiejar import CookieJar
from bs4 import BeautifulSoup
url = 'https://www.photo.net/search/#//Sort-View-Count/All-Categories/All- Time/Page-1'
cj = CookieJar()
opener = build_opener(HTTPCookieProcessor(cj))
try:
p = opener.open(url)
soup = BeautifulSoup(p, 'html.parser')
except Exception as e:
print(str(e))

Well, I have a solution for you.
You should try Selenium module for python.
1) Download Chrome Driver
2) Install Selenium via pip
Here is an example of how to use it
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
browser = webdriver.Chrome('Path to chrome driver')
browser.get()
while True:
button = WebDriverWait(browser,10).until(EC.presence_of_element_located((By.LINK_TEXT, 'Load More')))
button.click()

Related

How to find window/iframe from Chrome DevTools

I'm trying to web scrape using Selenium, Python and Beautiful Soup. I am scraping this page, but I want to scrape information off the pop-up window that appears when you click on the 'i' (information) icons in the corner of each product. My code is as follows:
import requests
from bs4 import BeautifulSoup
import time
import selenium
import math
from selenium.webdriver.support.ui import WebDriverWait
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import chromedriver_binary
import re
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome(ChromeDriverManager().install())
r = requests.get('https://dmarket.com/csgo-skins/product-card/ak-47-redline/field-tested')
driver.get('https://dmarket.com/csgo-skins/product-card/ak-47-redline/field-tested')
html_getter = BeautifulSoup(r.text, "html.parser")
data = html_getter.findAll(attrs={"class":"c-asset__priceNumber"})
dataskin = html_getter.findAll(attrs={"class" : "c-asset__exterior"})
time.sleep(2)
driver.find_element_by_id("onesignal-slidedown-cancel-button").click()
time.sleep(2)
driver.find_element_by_class_name("c-dialogHeader__close").click()
time.sleep(30)
driver.find_element_by_class_name("c-asset__action--info").click()
time.sleep(30)
price_element = driver.switch_to.active_element
print("<<<<<TEXT>>>>>")
print(price_element.text)
print("<<<<<END>>>>>")
driver.close()
However, when I run this, the only text that prints are "close." If you inspect the information page pop-up, it should print out the price, data from the chart, etc. How can I get it to print this info? Specifically, I want the amount sold on the most recent day and the price listed on the chart on the most recent day (both seem to be accessible in Chrome DevTools). I don't think I'm looking at the wrong frame, as I switch to the active frame, so I'm not sure how to fix this!

web scraping w/age verification

Hello I want to web scrape data from a site with an age verification pop-up using python 3.x and beautifulsoup. I can't get to the underlying text and images without clicking "yes" for "are you over 21". Thanks for any support.
EDIT: Thanks, with some help from a comment I see that I can use the cookies but am not sure how to manage/store/call cookies with the requests package.
So with some help from another user I am using selenium package so that it will work also in case it's a graphical overlay (I think?). Having trouble getting it to work with the gecko driver but will keep trying! Thanks for all the advice again, everyone.
EDIT 3: OK I have made progress and I can get the browser window to open, using the gecko driver!~ Unfortunately it doesn't like that link specification so I'm posting again. The link to click "yes" on the age verification is buried on that page as something called mlink...
EDIT 4: Made some progress, updated code is below. I managed to find the element in the XML code, now I just need to manage to click the link.
#
import time
import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
driver = webdriver.Firefox(executable_path=r'/Users/jeff/Documents/geckodriver') # Optional argument, if not specified will search path.
driver.get('https://www.shopharborside.com/oakland/#/shop/412');
url = 'https://www.shopharborside.com/oakland/#/shop/412'
driver.get(url)
#
driver.find_element_by_class_name('hhc_modal-body').click(Yes)
#wait.1.second
time.sleep(1)
pagesource = driver.page_source
soup = BeautifulSoup(pagesource)
#you.can.now.enjoy.soup
print(soup.prettify())
Edit new: Stuck again, here is the current code. I seem to have isolated the element "mBtnYes" but I get an error when running the code :
ElementClickInterceptedException: Message: Element is not clickable at point (625,278.5500030517578) because another element obscures it
import time
import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
driver = webdriver.Firefox(executable_path=r'/Users/jeff/Documents/geckodriver') # Optional argument, if not specified will search path.
driver.get('https://www.shopharborside.com/oakland/#/shop/412');
url = 'https://www.shopharborside.com/oakland/#/shop/412'
driver.get(url)
#
driver.find_element_by_id('myBtnYes').click()
#wait.1.second
time.sleep(1)
pagesource = driver.page_source
soup = BeautifulSoup(pagesource)
#you.can.now.enjoy.soup
print(soup.prettify())

if your aim is to click the verification get to selenium:
ps install selenium && get geckodriver(firefox) or chromedriver(chrome)
#Mossein~King(hi i'm here to help)
import time
import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.firefox.options import Options
from BeautifulSoup import BeautifulSoup
#this.is.for.headless.This.will.save.you.a.bunch.of.research.time(Trust.me)
options = Options()
options.add_argument("--headless")
driver = webdriver.Firefox(firefox_options=options)
#for.graphical(you.need.gecko.driver.for.firefox)
# driver = webdriver.Firefox()
url = 'your-url'
driver.get(url)
#get.the.link.to.clicking
#exaple if<a class='MosseinKing'>
driver.find_element_by_xpath("//a[#class='MosseinKing']").click()
#wait.1.secong.in.case.of.transitions
time.sleep(1)
pagesource = driver.page_source
soup = BeautifulSoup(pagesource)
#you.can.now.enjoy.soup
print soup.prettify()

Implicty wait selenium Python 2.7 not working

I am scraping public linkedIn data from specific people.
here is the code inside the while loop. For you to know, I used time.sleep() for the first 400 profils urls and it worked. However, it is not working anymore as it makes my firefox browser crash. I am pretty sure that the bug comes from the time.sleep() function that I tried to modify using implictly_wait() and WebdriverWait. However, none of this tries worked ;(
Here the code inside the while loop with the time.sleep() function that worked for around 400urls:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
browser = webdriver.Firefox()
browser.get("https://www.linkedin.com/uas/login")
time.sleep(4)
username = browser.find_element_by_id("session_key-login")
password = browser.find_element_by_id("session_password-login")
username.send_keys("yourmail")
password.send_keys("yourpassword")
login_attempt = browser.find_element_by_xpath("//*[#type='submit']")
login_attempt.submit()
time.sleep(4)
browser.get(the profile link I wanna scrap)
html = browser.page_source
soup = BeautifulSoup(html,"html.parser")
formation = soup.find_all('div', {'class': "education"})
nom = soup.find_all('span', {'class': "full-name"})
for a in nom:
for b in formation:
print(a.text,b.text)
time.sleep(4)
browser.close()
I tried to replace the time.sleep() by Implicitly_wait() but it is not working. The browser does not wait at all.
I also tried this
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
browser = webdriver.Firefox()
browser.get("the profile url I wanna scrap")
delay = 30 # seconds
try:
WebDriverWait(browser, delay).until(EC.presence_of_element_located(browser.find_element_by('education'))
print("Page is ready!")
except TimeoutException:
print("Loading took too much time!")
But it is still not working.
Do you have any idea on how to solve the issue ?
If I could make the browser wait without using time.sleep() (which makes my browser crash) without any conditions that would be amazing !
other question ? If I use chrome instead of firefox, do I have a chance to overcome the problem ?
Thanks for your answers,
Raphaël

With WebDriverWait, the browser waits but firefox crashes again: here the code
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
browser.get("https://www.linkedin.com/uas/login")
username = browser.find_element_by_id("session_key-login")
password = browser.find_element_by_id("session_password-login")
username.send_keys("mail")
password.send_keys("password")
login_attempt = browser.find_element_by_xpath("//*[#type='submit']")
login_attempt.submit()
try:
element = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.ID, "content")))
finally:
browser.get("profil linkedin to scrap")
html = browser.page_source
soup = BeautifulSoup(html,"html.parser")
formation = soup.find_all('div', {'class': "education"})
nom = soup.find_all('span', {'class': "full-name"})
for a in nom:
for b in formation:
print(a.text,b.text)
browser.close()

Selenium PhantomJS driver: Unable to load dynamic HTML

I am scraping one of the URLs like this which loads data via Ajax. When using Firefox it's able to scrape HTML but upon using PhantomJS it return:
<html><head></head><body></body></html>
Code is below:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import selenium.webdriver.support.ui as ui
import sys
import os
from time import sleep
driver = None
url = 'https://sports.bovada.lv/live-betting/event/2391243'
driver = webdriver.PhantomJS('/Setups/phantomjs-1.9.8-macosx/bin/phantomjs')
driver.set_window_size(1128, 768) # optional
driver.get(url)
wait = ui.WebDriverWait(driver, 3000)
sleep(40)
#wait.until(EC.staleness_of(driver.find_element_by_id("coupon")), 'visible')
html = driver.page_source
#userElement = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.ID, "coupon")))
print(html)
Update
Ok this is happening with every URL regardless of ajax or non-Ajax

Bypass Referral Denied error in selenium using python

I was making a script to download images from comic naver and I'm kind of done with it, however I can't seem to save the images.
I successfully grabbed the images via urlib and BeasutifulSoup, now, seems like they've introduced hotlink blocking and I can't seem to save the images on my system via urlib or selenium.
Update: I tried changing the useragent to see if that was causing problems... still the same.
Any fix or solution?
My code right now :
import requests
from bs4 import BeautifulSoup
import re
import urllib
import urllib2
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
"Chrome/15.0.87"
)
url = "http://comic.naver.com/webtoon/detail.nhn?titleId=654817&no=44&weekday=tue"
driver = webdriver.PhantomJS(desired_capabilities=dcap)
soup = BeautifulSoup(urllib.urlopen(url).read())
scripts = soup.findAll('img', alt='comic content')
for links in scripts:
Imagelinks = links['src']
filename = Imagelinks.split('_')[-1]
print 'Downloading Image : '+filename
driver.get(Imagelinks)
driver.save_screenshot(filename)
driver.close()
Following 'MAI's' reply, I tried what I could with selenium, and got what I wanted. It's solved now. My code :
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
from selenium.webdriver.common.action_chains import ActionChains
driver = webdriver.Chrome()
url = "http://comic.naver.com/webtoon/detail.nhn?titleId=654817&no=44&weekday=tue"
driver.get(url)
elem = driver.find_elements_by_xpath("//div[#class='wt_viewer']//img[#alt='comic content']")
for links in elem:
print links.get_attribute('src')
driver.quit()
but, when I try to taek screenshots of this, it shows that the "element is not attached to the page". Now, how am I supposed to solve that :/

(Note: Apologies, I'm not able to comment, so I have to make this an answer.)
To answer your original question, I've just been able to download an image in cURL from Naver Webtoons (the English site) by adding a Referer: http://www.webtoons.com header like so:
curl -H "Referer: http://www.webtoons.com" [link to image] > img.jpg
I haven't tried, but you'll probably want to use http://comic.naver.com instead. To do this with urllib, create a Request object with the header required:
req = urllib.request.Request(url, headers={"Referer": "http://comic.naver.com"})
with urllib.request.urlopen(req) as response, open("image.jpg", "wb") as outfile:
Then you can save the file using shutil.copyfileobj(src, dest). So instead of taking screenshots, you can simply get a list of all the images to download, then make a request for each one using the referer header.
Edit: I have a working script on GitHub which only requires urllib and BeautifulSoup.

I took a short look at the website with Chrome dev tools.
I would suggest you to download the image directly instead of screen-shooting. Selenium webdriver should actually run the javascripts on PhantomJS headless browser, so you should get images loaded by javascript at the following path.
The path that I am getting by eye-balling the html is
html body #wrap #container #content div #comic_view_area div img
The image tags in the last level have IDs like content_image_N, N counting from 0. So you can also get specific picture by using img#content_image_0 for example.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Crawler: deal with "load more" button - python

Related

How to find window/iframe from Chrome DevTools

web scraping w/age verification

Implicty wait selenium Python 2.7 not working

Selenium PhantomJS driver: Unable to load dynamic HTML

Bypass Referral Denied error in selenium using python

Categories

Resources