Download html in python?

Download html in python? - python

I am trying to download the html of a page that is requested through a javascript action when you click a link in the browser. I can download the first page because it has a general URL:
http://www.locationary.com/stats/hotzone.jsp?hz=1
But there are links along the bottom of the page that are numbers (1 to 10). So if you click on one, it goes to, for example, page 2:
http://www.locationary.com/stats/hotzone.jsp?ACTION_TOKEN=hotzone_jsp$JspView$NumericAction&inPageNumber=2
When I put that URL into my program and try to download the html, it gives me the html of a different page on the website and I think it is the home page.
How can I get the html of this URL that uses javascript and when there is no specific URL?
Thanks.
Code:
import urllib
import urllib2
import cookielib
import re
URL = ''
def load(url):
data = urllib.urlencode({"inUserName":"email", "inUserPass":"password"})
jar = cookielib.FileCookieJar("cookies")
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
opener.addheaders.append(('User-agent', 'Mozilla/5.0 (Windows NT 6.1; rv:13.0) Gecko/20100101 Firefox/13.0.1'))
opener.addheaders.append(('Referer', 'http://www.locationary.com/'))
opener.addheaders.append(('Cookie','site_version=REGULAR'))
request = urllib2.Request("https://www.locationary.com/index.jsp?ACTION_TOKEN=tile_loginBar_jsp$JspView$LoginAction", data)
response = opener.open(request)
page = opener.open("https://www.locationary.com/index.jsp?ACTION_TOKEN=tile_loginBar_jsp$JspView$LoginAction").read()
h = response.info().headers
jsid = re.findall(r'Set-Cookie: (.*);', str(h[5]))
data = urllib.urlencode({"inUserName":"email", "inUserPass":"password"})
jar = cookielib.FileCookieJar("cookies")
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
opener.addheaders.append(('User-agent', 'Mozilla/5.0 (Windows NT 6.1; rv:13.0) Gecko/20100101 Firefox/13.0.1'))
opener.addheaders.append(('Referer', 'http://www.locationary.com/'))
opener.addheaders.append(('Cookie','site_version=REGULAR; ' + str(jsid[0])))
request = urllib2.Request("https://www.locationary.com/index.jsp?ACTION_TOKEN=tile_loginBar_jsp$JspView$LoginAction", data)
response = opener.open(request)
page = opener.open(url).read()
print page
load(URL)

The selenium webdriver from the selenium tool suite uses standard browsers to retrieve the HTML (it's main goal is test automation for web applications), so it is well suited for scrapping javascript-rich applications. It has nice Python bindings.
I tend to use selenium to grab the page source after all ajax stuff is fired and parse it with something like BeautifulSoup (BeautifulSoup copes well with malformed HTML).

Related

Trouble collecting different property ids from a webpage using the requests module

After clicking on the button 11.331 Treffer located at the top right corner within the filter of this webpage, I can see the result displayed on that page. I've created a script using the requests module to fetch the ID numbers of different properties from that page.
However, when I run the script, I get json.decoder.JSONDecodeError. If I copy the cookies from dev tools directly and paste them within the headers, I get the results accordingly.
I don't wish to copy cookies from dev tools every time I run the script, so I used Selenium to collect cookies from the landing page and supply them within headers to get the desired result, but I still get the same error.
I'm trying like:
import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
start_url = 'https://www.immobilienscout24.de/'
link = 'https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/wohnung-kaufen?pagenumber=1'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',
'referer': 'https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/wohnung-kaufen?enteredFrom=one_step_search',
'accept': 'application/json; charset=utf-8',
'x-requested-with': 'XMLHttpRequest'
}
def get_cookies():
with webdriver.Chrome() as driver:
driver.get(start_url)
time.sleep(10)
cookiejar = {c['name']:c['value'] for c in driver.get_cookies()}
return cookiejar
cookies = get_cookies()
cookie_string = "; ".join([f"{item}={val}" for item,val in cookies.items()])
with requests.Session() as s:
s.headers.update(headers)
s.headers['cookie'] = cookie_string
res = s.get(link)
container = res.json()['searchResponseModel']['resultlist.resultlist']['resultlistEntries'][0]['resultlistEntry']
for item in container:
try:
project_id = item['#id']
except KeyError: project_id = ""
print(project_id)
How can I scrape property ids from that webpage using the requests module?
EDIT:
The existence of the following portion within cookies is crucial, without which the script probably leads to that error I mentioned. However, selenium failed to include that portion within cookies.
reese84=3:/qdGO9he7ld4/8a35vlw8g==:+/xBfAtVPRKHBSJgzngTQw1ywoViUvmVKLws+f8Y6edDgM+3s0Xzo17NvfgPrx9Z/suRy7hee5xcEgo85V3LdGsIop9/29g1ib1JQ0pO3UHWrtn81MseS6G8KE6AF4SrWZ2t8eTr1SEogUmCkB1HNSqXT88sAZaEi+XSzUyAGqikVjEcLX9TeI+KN37QNr9Sl+oTaOPchSgS/IowPj83zvT471Ewabg8CAc6q8I9AJ8Zb9FfLqePweCM+QFKIw+ZUp5GR4TXxZVcWdipbIEAyv3kj2x9Xs1K1k+8aXmy9VES6rFvW1xOsAjLmXbg6REPBye+QcAgPUh/x79mBWktcWC/uQ5L2W2dBLBS4eM2+bpEBw5EHMfjq9bk9hnmmZuxPGALLKASeXBt5lUUwx7x+wtGcjyvB9ZSE6gI2VxFLYqncYmhKqoNzgwQY8wRThaEraiJF/039/vVMa2G3S38iwniiOGHsOxq6VTdnWJGgvJqUmpWfXzz6XQXWL2xcykAoj7LMqHF2tC0DQyInUmZ3T7zjPBV7mEMgZkDn0z272E=:qQHyFe1/pp8/BS4RHAtxftttcOYJH4oqG1mW0+aNXF4=;

I think another part of your problem is that the link is not json. It's an html document. Part of the html document does contains javascript that sets a js variable to a json object. You can't get that with res.json()
In theory, you could use selenium to go to the link and grab the contents of the IS24.resultList variable by executing javascript like this:
driver.get(link)
time.sleep(10)
result_list = json.loads( driver.execute_script("return window.IS24.resultList"))
In practice, I think they're really serious about blocking bots and I suspect convincing them you're not a bot might take more than spoofing a cookie. When I visit via Selenium I don't even get the recaptcha option that I get when visiting through a regular browser session with incognito mode.

How to crawl a website that requires login using BeautifulSoup in Python3

I'm trying to parse articles from 'https://financialpost.com/', and example link is provided below. To parse this, i need to login to their website.
I do successfully post my cresidentials, however, it still do not parse the entire webpage, just the beginning.
How do I crawl everything?
import requests
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
link = 'https://financialpost.com/sign-in/'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
res = s.get(link)
soup = BeautifulSoup(res.text,'html.parser')
payload = {i['email']:i.get('value','') for i in soup.select('input[email]')}
payload['email'] = 'email#email.com'
payload['password'] = 'my_password'
s.post(link,data=payload)
url = 'https://financialpost.com/pmn/business-pmn/hydrogen-is-every-u-s-gas-utilitys-favorite-hail-mary-pass'
content_url = Request(url)
article_content = urlopen(content_url).read()
article_soup = BeautifulSoup(article_content, 'html.parser')
article_table = article_soup.findAll('section',attrs={'class':'article-content__content-group'})
for x in article_table:
print(x.find('p').text)

Using just requests
It's a bit complicated using just requests but possible, you would have to first authenticate to get authentication token, then you would ask for the article with said token so that site will know that you are authenticated and will display full article. To find out which API endpoints are being used to authenticate and load website content you can use something like chrome dev tools or fiddler (they can record all HTTP request so you can find manually interesting ones)
Using just selenium
Easier way would be to just use Selenium. It is a browser that can be used by code, so that you can just open login website authenticate and request for the article and the site would think that you are a human.

python requests from usnews.com timing out other websites work fine

url = "https://www.usnews.com"
page = requests.get(url, timeout = 5)
soup = BeautifulSoup(page.content,"html.parser")
requests from usnews.com is not working properly. The code runs forever or times out after five seconds as instructed. I have tried using other websites which work perfectly fine (wikipedia.org, google.com).

They are using a special protection against web scrapers like you. Whenever you go to a website, your web browser sends a special piece of data called a User-Agent. It tells the website what type of browser you are using and if you are on a phone or computer. By default, the requests module doesn't do this.
You can set your own User-Agent pretty easily. Using your website as an example:
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"}
url = "https://www.usnews.com"
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content,"html.parser")
This code tells the website that we are an actual person and not a bot.
You can learn more about User Agents here (https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent).

you should try something like selenium this code is similar to selenium but a bit user friendly
from requests_html import HTMLSession
import re
#from fake_useragent import UserAgent
#create the session
#ua = UserAgent()
session = HTMLSession()
#define our URL
url = "https://www.usnews.com"
#use the session to get the data
r = session.get(url)
#Render the page, up the number on scrolldown to page down multiple times on a page
r.html.render(sleep=1,timeout = 30, keep_page=True, scrolldown=1)
print(r.text)
this code mimics a real search engine and should bypass the bot detection

beautiful soup returns none when the element exists in browser

I have looked through the previous answers but none seemed to be applicable. I am building an open source quizlet scraper to extract all links from a class (e.g. https://quizlet.com/class/3675834/). In this case, the tag is a and class is "UILink". But when I use the following code, the list returned does not contain the element that I am looking for. Is it because of the JavaScript issue described here?
I tried to use the previous method of importing folder as written here but it does not contain the urls.
How can I scrape these urls?
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36"
}
url = 'https://quizlet.com/class/8536895/'
response = requests.get(url, verify=False, headers=headers)
soup = BeautifulSoup(response.text,'html.parser')
b = soup.find_all("a", class_="UILink")

You wouldn't be able to directly scrape dynamic webpages using just requests. What you see browser is fully rendered page taken care by browser.
Inorder to scrape data from these kind of webpages, you following any of below approaches.
Use requests-html instead of requests
pip install requests-html
scraper.py
from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
url = 'https://quizlet.com/class/8536895/'
response = session.get(url)
response.html.render() # render the webpage
# access html page source with html.html
soup = BeautifulSoup(response.html.html, 'html.parser')
b = soup.find_all("a", class_="UILink")
print(len(b))
Note: this uses headless browser(chromium) under the hood to render the page. So it can timeout or be a little slow at times.
Use selenium webdriver
Use driver.get(url) to get the page and pass the page source to beautiful Soup with driver.page_source
Note: run this in headless mode as well and there might be some latency at times.

how to scrape amazon deals page in python

I want to scrape amazon deal page by python and beautiful soup but when run the code I don't get any result but when trying the code on any another page in amazon I get results
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
url = 'https://www.amazon.com/international-sales-offers/b/?ie=UTF8&node=15529609011&ref_=nav_navm_intl_deal_btn'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0',
'referer': 'https://www.amazon.com/'
}
s = requests.session()
s.headers.update(headers)
r = s.get(url)
soup = BeautifulSoup(r.content, "lxml")
for x in soup.find_all('span',{'class','a-declarative'}):
print(x.text + "\n")

When you visit that page in your browser, the page makes additional requests to get more information, it then updates the first page with that information. In your case, the url https://www.amazon.com/international-sales-offers/b/?ie=UTF8&node=15529609011&ref_=nav_navm_intl_deal_btn is just a template, and when loaded it makes additional requests to get the deal information to populate the template.
Amazon is a popular site and people have made many web scrapers for it. Check this one out.. If it doesn't do what you need just google github amazon scraper and you will get many options.
If you still want to code a scraper yourself, start reading up on selenium. It is a python package that simulates a web browser, allowing you to load a web page and all its additional requests before scraping.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Download html in python? - python

Related

Trouble collecting different property ids from a webpage using the requests module

How to crawl a website that requires login using BeautifulSoup in Python3

python requests from usnews.com timing out other websites work fine

beautiful soup returns none when the element exists in browser

how to scrape amazon deals page in python

Categories

Resources