Incomplete HTML Content Using Python request.get - python

I am Trying to Get Html Content from a URL using request.get in Python.
But am getting incomplete response.
import requests
from lxml import html
url = "https://www.expedia.com/Hotel-Search?destination=Maldives&latLong=3.480528%2C73.192127&regionId=109&startDate=04%2F20%2F2018&endDate=04%2F21%2F2018&rooms=1&_xpid=11905%7C1&adults=2"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36',
'Content-Type': 'text/html',
}
response = requests.get(url, headers=headers)
print response.content
Can any one suggest the changes to be done for getting the exact complete response.
NB:using selenium am able to get the complete response,but that is not the recommended way.

If you need to get content generated dynamically by JavaScript and you don't want to use Selenium, you can try requests-html tool that supports JavaScript:
from requests_html import HTMLSession
session = HTMLSession()
url = "https://www.expedia.com/Hotel-Search?destination=Maldives&latLong=3.480528%2C73.192127&regionId=109&startDate=04%2F20%2F2018&endDate=04%2F21%2F2018&rooms=1&_xpid=11905%7C1&adults=2"
r = session.get(url)
r.html.render()
print(r.content)

Related

Can't scrape listing links from a webpage using the requests module

I'm trying to scrape different listings for this search Oxford, Oxfordshire from this webpage using requests module. This is how the inputbox looks before I click the search button.
I've defined an accurate selector to locate the listings, but the script fails to grab any data.
import requests
from pprint import pprint
from bs4 import BeautifulSoup
link = 'https://www.zoopla.co.uk/search/'
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9,bn;q=0.8',
'Referer': 'https://www.zoopla.co.uk/for-sale/',
'X-Requested-With': 'XMLHttpRequest',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
}
params = {
'view_type': 'list',
'section': 'for-sale',
'q': 'Oxford, Oxfordshire',
'geo_autocomplete_identifier': 'oxford',
'search_source': 'home'
}
res = requests.get(link,params=params,headers=headers)
soup = BeautifulSoup(res.text,"html5lib")
for item in soup.select("[id^='listing'] a[href^='/for-sale/details/']:has(h2[data-testid='listing-title'])"):
print(item.get("href"))
EDIT:
If I try something like the following, the script seems to be working flawlessly. The only and main problem is that I had to use hardcoded cookies within the headers, which expire within a few minutes.
import json
from pprint import pprint
from bs4 import BeautifulSoup
import cloudscraper
base = 'https://www.zoopla.co.uk{}'
link = 'https://www.zoopla.co.uk/for-sale/'
url = 'https://www.zoopla.co.uk/for-sale/property/oxford/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
'cookie': 'ajs_anonymous_id=caa7072ed7f64911a51dda2b525a3ca3; zooplapsid=cafe5156dd8f4cdda14e748c9270f623; base_device_id=68f7b6b7-27b8-429e-af66-366a4b64bac4; g_state={"i_p":1675619616576,"i_l":2}; zid=31173482e60549da9ccc1632e52a264c; zooplasid=31173482e60549da9ccc1632e52a264c; base_session_start_page=https://www.zoopla.co.uk/; base_request=https://www.zoopla.co.uk/; base_session_id=2315eaf2-6d59-4075-aeaa-6288af3efef7; base_session_count=8; forced_features={}; forced_experiments={}; active_session=anon; _gid=GA1.3.821027055.1675853830; __cf_bm=6.bEGFdT2vYz3G3iO7swuTFwSfhyzA0DvGoCjB6KvVg-1675853990-0-AQqWHydhL+/hqq8KRqOpCKDNtd6E96qjLgyOF77S8f7DpqCbMFoxAycD8ahQd7FOShSq0oHD//gpDj095eQPdtccDyZ0qu6GvxiSpjNP0+D7sblJP1e3Mlmxw5YroG3O4OuJHgBco3zThrx2SRyVDfx7M1zNlwi/1OVfww/u2wfb5DCW+gGz1b18zEvpNRszYQ==; cookie_consents={"schemaVersion":4,"content":{"brand":1,"consents":[{"apiVersion":1,"stored":false,"date":"Wed, 08 Feb 2023 10:59:02 GMT","categories":[{"id":1,"consentGiven":true},{"id":3,"consentGiven":false},{"id":4,"consentGiven":false}]}]}}; _ga=GA1.3.1980576228.1675275335; _ga_HMGEC3FKSZ=GS1.1.1675853830.7.1.1675853977.0.0.0'
}
params = {
'q': 'Oxford, Oxfordshire',
'search_source': 'home',
'pn': 1
}
scraper = cloudscraper.create_scraper()
res = scraper.get(url,params=params,headers=headers)
print(res.status_code)
soup = BeautifulSoup(res.text,"lxml")
container = soup.select_one("script[id='__NEXT_DATA__']").contents[0]
items = json.loads(container)['props']['pageProps']['initialProps']['regularListingsFormatted']
for item in items:
print(item['address'],base.format(item['listingUris']['detail']))
How can I get content from that site without using hardcoded cookies within the headers?
The following code example is working smoothly without adding headers and params parameters. The website's data isn't dynamic meaning you can grab the required data from the the static HTML dom but the main hindrance is that they are using Cloudflare protection. So to get rid of such restiction you can use either cloudscraper instead of requests module or selenium. Here I use cloudscraper and it's working fine.
Script:
import pandas as pd
from bs4 import BeautifulSoup
import cloudscraper
scraper = cloudscraper.create_scraper()
kw= ['Oxford', 'Oxfordshire']
data = []
for k in kw:
for page in range(1,3):
url = f"https://www.zoopla.co.uk/for-sale/property/oxford/?search_source=home&q={k}&pn={page}"
page = scraper.get(url)
#print(page)
soup = BeautifulSoup(page.content, "html.parser")
for card in soup.select('[data-testid="regular-listings"] [id^="listing"]'):
link = "https://www.zoopla.co.uk" + card.a.get("href")
print(link)
#data.append({'link':link})
# df = pd.DataFrame(data)
# print(df)
Output:
https://www.zoopla.co.uk/for-sale/details/63903233/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63898182/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63898168/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63898177/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63897930/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63897571/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63896910/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63896858/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63896815/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63893187/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/47501048/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63891727/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63890876/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63889459/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63888298/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63887586/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63887525/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/59469692/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63882084/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63878480/?search_identifier=cbe92a4f0868061e26dff87f97442c6a
https://www.zoopla.co.uk/for-sale/details/63877980/?search_identifier=cbe92a4f0868061e26dff87f97442c6a
... so on
You could just set the browser type and read the contents with a simple request:
# Url for 'Oxford, Oxfordshire'
url = 'https://www.zoopla.co.uk/for-sale/property/oxford/?q=Oxford%2C%20Oxfordshire&search_source=home'
result = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urllib.request.urlopen(result).read()
print(webpage)
This also works just fine. The only thing is that you will have to write a couple lines of code to extract what exactly you want from each listing yourself. Or make the class field dynamic if necessary.
import urllib.request
from bs4 import BeautifulSoup
url = 'https://www.zoopla.co.uk/for-sale/property/oxford/?q=Oxford%2C%20Oxfordshire&search_source=home'
result = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urllib.request.urlopen(result).read()
soup = BeautifulSoup(webpage, "html.parser")
webpage_listings = soup.find_all("div", class_="f0xnzq0")
if webpage_listings:
for item in webpage_listings:
print(item)
else:
print("Empty list")

Js site does not return data on shaving

I'm using the script I always use to scrape data from the web but I'm not getting success.
I would like to get the data from the table on the website:
https://www.rad.cvm.gov.br/ENET/frmConsultaExternaCVM.aspx
I'm using the following code for scraping:
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://www.rad.cvm.gov.br/ENET/frmConsultaExternaCVM.aspx"
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
bs = BeautifulSoup(html, 'lxml')
print(bs)
currently I only receive js from the site and not the data from the table itself
Do HTTP POST to https://www.rad.cvm.gov.br/ENET/frmConsultaExternaCVM.aspx/PopulaComboEmpresas
This will return you data of the table as JSON.
In the browser do F12 --> Network --> Fetch/XHR in order to see more details like HTTP header and POST Body.
You can do that easily using only requests as api calls json response and following the post method.
Here is the working code:
import requests
import json
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
body = { 'tipoEmpresa': '0'}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36',
'x-dtpc': '33$511511524_409h2vHHVRBIAIGILPJNCRGRCECUBIACWCBUEE-0e37',
'X-Requested-With': 'XMLHttpRequest',
'Content-Type': 'application/json'
}
url='https://www.rad.cvm.gov.br/ENET/frmConsultaExternaCVM.aspx/PopulaComboEmpresas'
r = requests.post(url, data=json.dumps(body), headers =headers, verify = False)
res = r.json()['d']
print(res)

Wayfair product prices

I'm not sure if there's an API for this but I'm trying to scrape the price of certain products from Wayfair.
I wrote some python code using Beautiful Soup and requests but I'm getting some HTML which mentions Our systems have detected unusual traffic from your computer network.
Is there anything I can do to make this work?
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'
}
def fetch_wayfair_price(url):
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content)
print(soup)
fetch_wayfair_price('https://www.wayfair.ca/furniture/pdp/mistana-katarina-5-piece-extendable-solid-wood-dining-set-mitn2584.html')
The Wayfair site loads a lot of data in after the initial page, so just getting the html from the URL you provided probably won't be all you need. That being said, I was able to get the request to work.
Start by using a session from the requests libary; this will track cookies and session. I also added upgrade-insecure-requests: '1' to the headers so that you can avoid some of the issue that HTTPS introduce to scraping. The request now returns the same response as the browser request does, but the information you want is probably loaded in subsequent requests the browser/JS make.
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
'upgrade-insecure-requests': '1',
'dnt': '1'
}
def fetch_wayfair_price(url):
with requests.Session() as session:
post = session.get(url, headers=headers)
soup = BeautifulSoup(post.text, 'html.parser')
print(soup)
fetch_wayfair_price(
'https://www.wayfair.ca/furniture/pdp/mistana-katarina-5-piece-extendable-solid-wood-dining-set-mitn2584.html')
Note: The session that the requests libary creates really should persist outside of the fetch_wayfair_price method. I have contained it in the method to match your example.

Not able to replicate AJAX using Python Requests

I am trying to replicate ajax request from a web page (https://droughtmonitor.unl.edu/Data/DataTables.aspx). AJAX is initiated when we select values from dropdowns.
I am using the following request using python, but not able to see the response as in Network tab of the browser.
import bs4
import requests
import lxml
ses = requests.Session()
ses.get('https://droughtmonitor.unl.edu/Data/DataTables.aspx')
headers_dict = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
url = 'https://droughtmonitor.unl.edu/Ajax2018.aspx/ReturnTabularDMAreaPercent_urban'
req_data = {'area':'00064', 'statstype':'1'}
resp = ses.post(url,data = req_data,headers = headers_dict)
soup = bs4.BeautifulSoup(resp.content,'lxml')
print(soup)
You need to add several things to your request to get an Answer from the server.
You need to convert your dict to json to pass it as string and not as dict.
You also need to specify the type of request-data by setting the request header to Content-Type:application/json; charset=utf-8
with those changes I was able to request the correkt data.
import bs4
import requests
ses = requests.Session()
ses.get('https://droughtmonitor.unl.edu/Data/DataTables.aspx')
headers_dict = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'Content-Type': 'application/json; charset=utf-8'}
url = 'https://droughtmonitor.unl.edu/Ajax2018.aspx/ReturnTabularDMAreaPercent_urban'
req_data = json.dumps({'area':'00037', 'statstype':'1'})
resp = ses.post(url,data = req_data,headers = headers_dict)
soup = bs4.BeautifulSoup(resp.content,'lxml')
print(soup)
Quite a tricky problem I must say.
From the requests documentation:
Instead of encoding the dict yourself, you can also pass it directly
using the json parameter (added in version 2.4.2) and it will be
encoded automatically:
>>> url = 'https://api.github.com/some/endpoint'
>>> payload = {'some': 'data'}
>>> r = requests.post(url, json=payload)
Then, to get the output, call r.json() and you will get the data you are looking for.

How to get commented section of the html code using requests?

I am trying to get number of followers of a facebook page i.e. https://web.facebook.com/marlenaband. I am using python requests library. When I see the page source in the browser, the text "142 people follow this" appears to be in the commented section of the page. But, I am not seeing it in the response text using requests and BeautifulSoup. Would someone please help me on how to get this? Thanks
Here is the code I am using:
import requests
from bs4 import BeautifulSoup as bs
url = 'https://web.facebook.com/marlenaband'
headers = {
'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/7.0.185.1002 Safari/537.36',
}
res = requests.get(url, headers=headers)
print(res.content)
I actually got it using requests by modifying the headers to this:
headers = {
'accept-language':'en-US,en;q=0.8',
}

Categories