Eaten by the net? Crawler survival under this website using python - python

I am making web-crawler to get information from http://www.caam.org.cn/hyzc, but it showed me HTTP Error 302, and I cannot fix it.
https://imgur.com/a/W0cykim
The picture gives you a rough idea about the special layout of this website in that when you are browsing it, it will pop out a window, telling you that the website is accelerating, for the reason that there are so many people online, and then direct you to that website. As a result, when I use web-crawler, all I get is the information on this window, but nothing on this website. I think this is a good way for the website keeper to get rid of our web crawlers. So I want to ask for your help to get useful information from this website
At first, I used requests of python for my web crawler, and I only got information on that window, the results are shown here: https://imgur.com/a/GLcpdZn
And then I forbad website redirect, I got HTTP Error 303, shown:
https://imgur.com/a/6YtaVOt
This is the latest code I used:
python
import requests
def getpage(url):
try:
r= requests.get(url, headers={'User-Agent':'Mozilla/5.0'}, timeout=10)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
return "try again"
url = "http://www.caam.org.cn/hyzc"
print(getpage(url))
The expected outcome of this question is to get useful information from the website http://www.caam.org.cn/hyzc. We may need to deal with the window popped out.

Looks like this website have some kind of protection against crawlers using requests, the page is not entirely loaded when you send a get request.
You can try to emulate a browser using selenium:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('http://www.caam.org.cn/hyzc')
print(driver.page_source)
driver.close()
driver.page_source will contain the page source.
You can learn how to setup selenium webdriver here.

I added something to delay the closure of my web crawl and this worked. So I want to share my lines in case you meet similar problem in the future:
python
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = Options()
driver = webdriver.Chrome(chrome_options=options)
driver.get('http://www.caam.org.cn')
body = driver.find_element_by_tag_name("body")
wait = WebDriverWait(driver, 5, poll_frequency=0.05)
wait.until(EC.staleness_of(body))
print(driver.page_source)
driver.close()

Related

Selenium webpage not loading properly

I am trying to web scrape university ranking infomation from USNews site. And the problem is when I use selenium to open the webpage, the 'Load More Button' is not working properly. (I think I successfully click it but in the Chrome window opened by webdriver, when I scroll down to the button, is says that 'We're sorry, there was a problem loading the next page of search results'.
I am new to web scrawler and I did a lot of research on this, there are several similar questions but none of those answers helped. I really need some help. Here is my code:
driver_path = 'xxx' (chromedriver path)
driver = webdriver.Chrome(executable_path=driver_path)
url2 = 'https://www.usnews.com/education/best-global-universities/rankings'
wait = WebDriverWait(driver, 30)
driver.get(url2)
driver.maximize_window()
count = 1
while True:
try:
print(1)
# driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
wait.until(EC.visibility_of_element_located((By.XPATH,"//*[#id='rankings']/div[3]/button")))
print(2)
show_more = wait.until(EC.element_to_be_clickable((By.XPATH, "//*[#id='rankings']/div[3]/button")))
ActionChains(browser).move_to_element(show_more).click().perform()
print(3)
# driver.find_element(By.XPATH,"//*[#id='rankings']/div[3]/button").click()
# print(4)
# wait.until(EC.visibility_of_element_located((By.XPATH,"//*[#id='rankings']/div[3]/button")))
# print(5)
count += 1
time.sleep(2)
if count >=2:
break
Even though I did not write code to close the ad, but I don't think the ad is the problem since when I manually close it and then click the button, it is still not working. Is it the problem with the website?
import requests
import os
from bs4 import BeautifulSoup
from selenium import webdriver
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
It is clear that there is an anti scraping control in the specific site. It is always recommended to consult the robots.txt file beforehand and check whether scraping is possible on a certain site or not.
In general, this site blocks just the IP (try to go to other pages afterwards, you will see that you will get a 403 error).
In general, however, the approach you used does not seem wrong to me. You can try contacting the site directly to see if the problem can be solved in some other way.

No output while scraping Google search page

I am trying to scrape from Google search results the blue highlighted portion as shown below:
When I use inspect element, it shows: span class="YhemCb". I have tried using various soup.find and soup.find_all commands, but everything I have tried has no
output so far. What command should I use to scrape this part?
Google uses javascript to display most of its web elements, so using something like requests and BeautifulSoup is unfortunately not enough.
Instead, use selenium! It essentially allows you to control a browser using code.
First, you will need to navigate to the google page you wish to scrape
google_search = 'https://www.google.com/search?q=courtyard+by+marriott+fayetteville+fort+bragg'
driver.get(google_search)
Then, you have to wait until the review page loads in the browser.
This is done using WebDriverWait: you have to specify an element that needs to appear on the page. The [data-attrid="kc:/local:one line summary"] span css selector allows me to select the review info about the hotel.
timeout = 10
expectation = EC.presence_of_element_located((By.CSS_SELECTOR, '[data-attrid="kc:/local:one line summary"] span'))
review_element = WebDriverWait(driver, timeout).until(expectation)
And finally, print the rating
print(review_element.get_attribute('innerHTML'))
Here's the full code in case you want to play around with it
import chromedriver_autoinstaller
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
# setup selenium (I am using chrome here, so chrome has to be installed on your system)
chromedriver_autoinstaller.install()
options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)
# navigate to google
google_search = 'https://www.google.com/search?q=courtyard+by+marriott+fayetteville+fort+bragg'
driver.get(google_search)
# wait until the page loads
timeout = 10
expectation = EC.presence_of_element_located((By.CSS_SELECTOR, '[data-attrid="kc:/local:one line summary"] span'))
review_element = WebDriverWait(driver, timeout).until(expectation)
# print the rating
print(review_element.get_attribute('innerHTML'))
Note Google is notoriously defensive against anyone who is trying to scrape them. On first few attempts you might be successful, but eventually you will have to deal with Google Captcha.
To work around that, I would suggest using the search engine scraper, something like the quickstart guide to get you started!
Disclaimer: I work at Oxylabs.io

Selenium get() redirects to another url

I'm trying to navigate to the following page and extract the html https://www.automobile.it/annunci?b=data&d=DESC, but everytime i call the get() method it looks like the website redirects me to the another page, always the same one which is https://www.automobile.it/torrenova?radius=100&b=data&d=DESC.
here's the simple code i'm running:
from selenium import webdriver
driver = webdriver.Chrome(executable_path=ex_path)
driver.get("https://www.automobile.it/annunci?b=data&d=DESC")
html=driver.page_source
if i do the same thing using the request module i don't get redirected
import requests
html=requests.get("https://www.automobile.it/annunci?b=data&d=DESC")
i don't understand why it's behaving like this, any ideas?
Use driver.delete_all_cookies()
from selenium import webdriver
driver = webdriver.Chrome(executable_path=ex_path)
driver.delete_all_cookies()
driver.get("https://www.automobile.it/annunci?b=data&d=DESC")
html=driver.page_source
PS: be also warned: Page_source will not get you the completed DOM as rendered.
Well you can clear browser cache by using the below code :
I am assuming that you are using chrome.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome(executable_path=ex_path)
driver.get('chrome://settings/clearBrowserData')
driver.find_element_by_xpath('//settings-ui').send_keys(Keys.ENTER)
driver.get("https://www.automobile.it/annunci?b=data&d=DESC")

Use Selenium to download multiple URLs as PDFs

I am trying to have Selenium download the URLs of a webpage as PDFs on Safari. So far, I have been able to open the URL, but I can't get Safari to download it. All the solutions I found so far were either for another browser, or they didn't work. Ideally I would like it to download all links of one page and then move on the next page.
At first I thought that clicking on each hyperlink and then downloading it was the way to go. But that would require switching windows each time, so then I tried to find a way to download it without having to click on it, but nothing worked.
I am quite new at programming so I am sure that I am missing something.
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pdfkit
browser = webdriver.Safari()
browser.get(a_base_url)
username = browser.find_element_by_name("tb_LoginName")
password = browser.find_element_by_name("tb_Password")
submit = browser.find_element_by_id("btn_Login")
username.send_keys(username)
password.send_keys(password)
submit.click()
element=WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, '//*[#id="maincolumn"]/div/table/tbody/tr[2]/td[9]/a[2]'))).click()
browser.switch_to_window(browser.window_handles[0])
url=browser.current_url
I would go for the following approach:
Get href attribute of the link you want to download via WebElement.get_attribute() function
Use urllib or requests library to retrieve the URL from step 1 without using the browser
Most probably you will also need to get the browser Cookies via WebDriver.get_cookies function and add them to Cookie header for your download request

Access all href-links in a deep-class hierarchy

I am trying to access all href-links from a website, the search-results to be precise. My first intention is to get all the links, and then to look further on it. The problem is --> I get some links from the website, but not the links of the search-results. Here is one version of my code.
from selenium import webdriver
from htmldom import htmldom
dom = htmldom.HtmlDom("myWebsite")
dom = dom.createDom()
p_links = dom.find("a")
for link in p_links:
print("URL: " +link.attr("href"))
Here is screen of the HTML of that particular website. In the screen, I marked the href-link I try to access in the future. I am open for any help given, be it in Selenium, htmldom, b4soup, etc.
The data you are after, is loaded with AJAX requests. So, you can't scrape them directly after getting the page source. But, the AJAX request is sent to this URL:
https://open.nrw/solr/collection1/select?q=*%3A*&fl=validated_data_dict%20title%20groups%20notes%20maintainer%20metadata_modified%20res_format%20author_email%20name%20extras_opennrw_spatial%20author%20extras_opennrw_groups%20extras_opennrw_format%20license_id&wt=json&fq=-type:harvest+&sort=title_string%20asc&indent=true&rows=20
which returns the data in JSON format. You can use requests module to scrape this data.
import requests
BASE_URL = 'https://open.nrw/dataset/'
r = requests.get('https://open.nrw/solr/collection1/select?q=*%3A*&fl=validated_data_dict%20title%20groups%20notes%20maintainer%20metadata_modified%20res_format%20author_email%20name%20extras_opennrw_spatial%20author%20extras_opennrw_groups%20extras_opennrw_format%20license_id&wt=json&fq=-type:harvest+&sort=title_string%20asc&indent=true&rows=20')
data = r.json()
for item in data['response']['docs']:
print(BASE_URL + item['name'])
Output:
https://open.nrw/dataset/mags-90-10-dezilsverhaeltnis-der-aequivalenzeinkommen-1512029759099
https://open.nrw/dataset/alkis-nutzungsarten-pro-baublock-wuppertal-w
https://open.nrw/dataset/allgemein-bildende-schulen-am-1510-nach-schulformen-schulen-schueler-und-lehrerbestand-w
https://open.nrw/dataset/altersgruppen-in-meerbusch-gesamt-meerb
https://open.nrw/dataset/amtliche-stadtkarte-wuppertal-raster-w
https://open.nrw/dataset/mais-anteil-abhaengig-erwerbstaetiger-mit-geringfuegiger-beschaeftigung-1477312040433
https://open.nrw/dataset/mags-anteil-der-stillen-reserve-nach-geschlecht-und-altersgruppen-1512033735012
https://open.nrw/dataset/mags-anteil-der-vermoegenslosen-in-nrw-nach-beruflicher-stellung-1512032087083
https://open.nrw/dataset/anzahl-kinderspielplatze-meerb
https://open.nrw/dataset/anzahl-der-sitzungen-von-rat-und-ausschussen-meerb
https://open.nrw/dataset/anzahl-medizinischer-anwendungen-den-oeffentlichen-baedern-duesseldorfs-seit-2006-d
https://open.nrw/dataset/arbeitslose-den-wohnquartieren-duesseldorf-d
https://open.nrw/dataset/arbeitsmarktstatistik-arbeitslose-gelsenkirchen-ge
https://open.nrw/dataset/arbeitsmarktstatistik-arbeitslose-nach-rechtskreisen-des-sgb-ge
https://open.nrw/dataset/arbeitsmarktstatistik-arbeitslose-nach-stadtteilen-gelsenkirchen-ge
https://open.nrw/dataset/arbeitsmarktstatistik-sgb-ii-rechtskreis-auf-stadtteilebene-gelsenkirchen-ge
https://open.nrw/dataset/arbeitsmarktstatistik-sozialversicherungspflichtige-auf-stadtteilebene-gelsenkirchen-ge
https://open.nrw/dataset/verkehrszentrale-arbeitsstellen-in-nordrhein-westfalen-1476688294843
https://open.nrw/dataset/mags-arbeitsvolumen-nach-wirtschaftssektoren-1512025235377
https://open.nrw/dataset/mais-armutsrisikoquoten-nach-geschlecht-und-migrationsstatus-der-personen-1477313317038
As you can see, this returned the first 20 URLs. When you first load the page only 20 items are present. But, if you scroll down, more are loaded. To get more items, you can change the Query String Parameter in the URL. The URL ends with rows=20. You can change this number to get the desired number of results.
Results appear after the initial page load due to the AJAX request.
I managed to get the links with Selenium, however I had to wait for .ckantitle a elements to be loaded (these are the links you want to get).
I should mention that the webdriver will wait for a page to load by
default. It does not wait for loading inside frames or for ajax
requests. It means when you use .get('url'), your browser will wait
until the page is completely loaded and then go to the next command in
the code. But when you are posting an ajax request, webdriver does not
wait and it's your responsibility to wait an appropriate amount of
time for the page or a part of page to load; so there is a module
named expected_conditions.
Code:
from urllib.parse import urljoin
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
url = 'https://open.nrw/suche'
html = None
browser = webdriver.Chrome()
browser.get(url)
delay = 3 # seconds
try:
WebDriverWait(browser, delay).until(
EC.presence_of_element_located((By.CSS_SELECTOR, '.ckantitle a'))
)
html = browser.page_source
except TimeoutException:
print('Loading took too much time!')
finally:
browser.quit()
if html:
soup = BeautifulSoup(html, 'lxml')
links = soup.select('.ckantitle a')
for link in links:
print(urljoin(url, link['href']))
You need to install selenium:
pip install selenium
and get a driver here.

Categories