Failure to access different pages with (python) selenium vs direct access

Failure to access different pages with (python) selenium vs direct access - python

I am quite puzzled with this code :
from bs4 import BeautifulSoup
from selenium import webdriver
import time
PATH = 'My_path_to/chromedriver'
for i in range(3):
driver = webdriver.Chrome(PATH)
url = 'https://www.daraz.pk/laptops/?page='+str(i+1)
print(url)
driver.get(url)
time.sleep(5)
driver.quit()
Even thought the link is each time correctly written in the navigation bar, it shows always the page 1, but when I click on the printed links, the display is correct... Is it a javascript artefact to stop web scraping? Is it possible to overcome it?

You don't need to use selenium to scrape daraz.pk
Use requests, page source has a json object that you can parse using json and let pandas export it to a csv file.
Following code does the job for one page.
You may way to unpack the dictionary to show all information on first level in csv file.
import requests
from bs4 import *
import json
import pandas as pd
pd.DataFrame((json.loads([x for x in BeautifulSoup(requests.get('https://www.daraz.pk/laptops/?page=1').content, 'html.parser').findAll('script') if 'window.pageData' in x.text][0].text.split('window.pageData=')[1]))['mods']['listItems']).to_csv('page1.csv', index=False)
This should make more sense:
import requests
from bs4 import *
import json
import pandas as pd
page = requests.get('https://www.daraz.pk/laptops/?page=1')
soup = BeautifulSoup(page.content, 'html.parser')
json_data = [x for x in soup.findAll('script') if 'window.pageData' in x.text][0].text.split('window.pageData=')[1]
json_object = json.loads(json_data)
listed_items_dict = json_object['mods']['listItems']
dataframe = pd.DataFrame(listed_items_dict)
dataframe.to_csv('page1.csv', index=False)
Based on the comment use this function. It uses requests and sends cookie with request and this will return a dictionary of listed items.
def get_page_ajax(n):
headers = {
'authority': 'www.daraz.pk',
'sec-ch-ua': '"Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36',
'content-type': 'application/json',
'accept': '*/*',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.daraz.pk/laptops/?page=1',
'accept-language': 'en-US,en;q=0.9',
'cookie': 'cna=OZ/WFbQKLVwCAbKZLti/7Hsq; _scid=7861a3f6-8fc7-4051-8f21-7ff77b830e4f; lzd_cid=84c13174-852a-4f95-ed5d-a8dcdd6767bb; t_uid=84c13174-852a-4f95-ed5d-a8dcdd6767bb; lzd_sid=1ba989d0707604107e91e9001996d1fd; _tb_token_=5e88e18541b9f; hng=PK|en-PK|PKR|586; userLanguageML=en-PK; t_fv=1619138165830; _gcl_au=1.1.1640796331.1619138166; _ga=GA1.2.213815672.1619138167; _gid=GA1.2.1272622655.1619138167; __auc=3682a279178fc27bf6267d1df0c; _bl_uid=h1kzbndntF4lIg2hX9vq1m1oLdU0; _sctr=1|1619125200000; xlly_s=1; t_sid=oBnSCDmQiPyrQNhDz6NRhBUUjfRvyUHF; utm_channel=NA; daraz-marketing-tracker=hide; _m_h5_tk=a8d8092d608b9889dec7cefccdc3b351_1619183643275; _m_h5_tk_enc=068447fd0471b8537f47a430f7fe4128; __asc=f4f31b38178fe5456d962e14446; _gat_UA-31709783-1=1; JSESSIONID=7A1B772A77330A06B2D19937BDE5775B; tfstk=ctmcBmfEqqzb16zoRnZjhTB12gpRZYSatcoS4cfIen_xLumPiBQP80ByEScmXN1..; l=eBM3wODljUtubHaFBO5whurza77OUIOf1sPzaNbMiInca1uO6wI2zNCCg036JdtjQtf0uetzd4S4yReM7rzU-xNbmbKe6QuI2ov6-; isg=BMvLGwrYLxDvEE44amWSgStYWm-1YN_ieR8y9T3J-YoyXOi-xTPLM_26NkSy_Dfa',
}
params = (
('ajax', 'true'),
('page', str(n)),
)
response = requests.get('https://www.daraz.pk/laptops/', headers=headers, params=params)
json_object = json.loads(response.content.decode())
listed_items_dict = json_object['mods']['listItems']
return listed_items_dict

I've tried your code, but still don't know what the problem is. The pages is displayed correctly by what I've noticed that there is no difference in the elements shown in the first and second page but the third has another elements shown. I don't now why but I've visited the URL manually and got the same result.

Related

Can't scrape listing links from a webpage using the requests module

I'm trying to scrape different listings for this search Oxford, Oxfordshire from this webpage using requests module. This is how the inputbox looks before I click the search button.
I've defined an accurate selector to locate the listings, but the script fails to grab any data.
import requests
from pprint import pprint
from bs4 import BeautifulSoup
link = 'https://www.zoopla.co.uk/search/'
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9,bn;q=0.8',
'Referer': 'https://www.zoopla.co.uk/for-sale/',
'X-Requested-With': 'XMLHttpRequest',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
}
params = {
'view_type': 'list',
'section': 'for-sale',
'q': 'Oxford, Oxfordshire',
'geo_autocomplete_identifier': 'oxford',
'search_source': 'home'
}
res = requests.get(link,params=params,headers=headers)
soup = BeautifulSoup(res.text,"html5lib")
for item in soup.select("[id^='listing'] a[href^='/for-sale/details/']:has(h2[data-testid='listing-title'])"):
print(item.get("href"))
EDIT:
If I try something like the following, the script seems to be working flawlessly. The only and main problem is that I had to use hardcoded cookies within the headers, which expire within a few minutes.
import json
from pprint import pprint
from bs4 import BeautifulSoup
import cloudscraper
base = 'https://www.zoopla.co.uk{}'
link = 'https://www.zoopla.co.uk/for-sale/'
url = 'https://www.zoopla.co.uk/for-sale/property/oxford/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
'cookie': 'ajs_anonymous_id=caa7072ed7f64911a51dda2b525a3ca3; zooplapsid=cafe5156dd8f4cdda14e748c9270f623; base_device_id=68f7b6b7-27b8-429e-af66-366a4b64bac4; g_state={"i_p":1675619616576,"i_l":2}; zid=31173482e60549da9ccc1632e52a264c; zooplasid=31173482e60549da9ccc1632e52a264c; base_session_start_page=https://www.zoopla.co.uk/; base_request=https://www.zoopla.co.uk/; base_session_id=2315eaf2-6d59-4075-aeaa-6288af3efef7; base_session_count=8; forced_features={}; forced_experiments={}; active_session=anon; _gid=GA1.3.821027055.1675853830; __cf_bm=6.bEGFdT2vYz3G3iO7swuTFwSfhyzA0DvGoCjB6KvVg-1675853990-0-AQqWHydhL+/hqq8KRqOpCKDNtd6E96qjLgyOF77S8f7DpqCbMFoxAycD8ahQd7FOShSq0oHD//gpDj095eQPdtccDyZ0qu6GvxiSpjNP0+D7sblJP1e3Mlmxw5YroG3O4OuJHgBco3zThrx2SRyVDfx7M1zNlwi/1OVfww/u2wfb5DCW+gGz1b18zEvpNRszYQ==; cookie_consents={"schemaVersion":4,"content":{"brand":1,"consents":[{"apiVersion":1,"stored":false,"date":"Wed, 08 Feb 2023 10:59:02 GMT","categories":[{"id":1,"consentGiven":true},{"id":3,"consentGiven":false},{"id":4,"consentGiven":false}]}]}}; _ga=GA1.3.1980576228.1675275335; _ga_HMGEC3FKSZ=GS1.1.1675853830.7.1.1675853977.0.0.0'
}
params = {
'q': 'Oxford, Oxfordshire',
'search_source': 'home',
'pn': 1
}
scraper = cloudscraper.create_scraper()
res = scraper.get(url,params=params,headers=headers)
print(res.status_code)
soup = BeautifulSoup(res.text,"lxml")
container = soup.select_one("script[id='__NEXT_DATA__']").contents[0]
items = json.loads(container)['props']['pageProps']['initialProps']['regularListingsFormatted']
for item in items:
print(item['address'],base.format(item['listingUris']['detail']))
How can I get content from that site without using hardcoded cookies within the headers?

The following code example is working smoothly without adding headers and params parameters. The website's data isn't dynamic meaning you can grab the required data from the the static HTML dom but the main hindrance is that they are using Cloudflare protection. So to get rid of such restiction you can use either cloudscraper instead of requests module or selenium. Here I use cloudscraper and it's working fine.
Script:
import pandas as pd
from bs4 import BeautifulSoup
import cloudscraper
scraper = cloudscraper.create_scraper()
kw= ['Oxford', 'Oxfordshire']
data = []
for k in kw:
for page in range(1,3):
url = f"https://www.zoopla.co.uk/for-sale/property/oxford/?search_source=home&q={k}&pn={page}"
page = scraper.get(url)
#print(page)
soup = BeautifulSoup(page.content, "html.parser")
for card in soup.select('[data-testid="regular-listings"] [id^="listing"]'):
link = "https://www.zoopla.co.uk" + card.a.get("href")
print(link)
#data.append({'link':link})
# df = pd.DataFrame(data)
# print(df)
Output:
https://www.zoopla.co.uk/for-sale/details/63903233/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63898182/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63898168/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63898177/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63897930/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63897571/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63896910/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63896858/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63896815/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63893187/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/47501048/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63891727/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63890876/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63889459/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63888298/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63887586/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63887525/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/59469692/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63882084/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63878480/?search_identifier=cbe92a4f0868061e26dff87f97442c6a
https://www.zoopla.co.uk/for-sale/details/63877980/?search_identifier=cbe92a4f0868061e26dff87f97442c6a
... so on

You could just set the browser type and read the contents with a simple request:
# Url for 'Oxford, Oxfordshire'
url = 'https://www.zoopla.co.uk/for-sale/property/oxford/?q=Oxford%2C%20Oxfordshire&search_source=home'
result = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urllib.request.urlopen(result).read()
print(webpage)
This also works just fine. The only thing is that you will have to write a couple lines of code to extract what exactly you want from each listing yourself. Or make the class field dynamic if necessary.
import urllib.request
from bs4 import BeautifulSoup
url = 'https://www.zoopla.co.uk/for-sale/property/oxford/?q=Oxford%2C%20Oxfordshire&search_source=home'
result = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urllib.request.urlopen(result).read()
soup = BeautifulSoup(webpage, "html.parser")
webpage_listings = soup.find_all("div", class_="f0xnzq0")
if webpage_listings:
for item in webpage_listings:
print(item)
else:
print("Empty list")

BeautifulSoup find_all not finding all containers of a class

I'm trying to write a script to scrape some data off a Zillow page (https://www.zillow.com/homes/for_rent/38.358484,-77.27869,38.218627,-77.498417_rect/X1-SSc26hnbsm6l2u1000000000_8e08t_sse/). Obviously I'm just trying to gather data from every listing. However I cannot grab the data from every listing as it only finds 9 instances of the class I'm searching for ('list-card-addr') even though I've checked the html from listings it does not find and the class exists. Anyone have any ideas for why this is? Here's my simple code
from bs4 import BeautifulSoup
import requests
req_headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.8',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
url="https://www.zillow.com/homes/for_rent/38.358484,-77.27869,38.218627,-77.498417_rect/X1-SSc26hnbsm6l2u1000000000_8e08t_sse/"
response = requests.get(url, headers=req_headers)
data = response.text
soup = BeautifulSoup(data,'html.parser')
address = soup.find_all(class_='list-card-addr')
print(len(address))

Data is stored within a comment. You can regex it out easily as string defining JavaScript object you can handle with json
import requests, re, json
r = requests.get('https://www.zillow.com/homes/for_rent/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22savedSearchEnrollmentId%22%3A%22X1-SSc26hnbsm6l2u1000000000_8e08t%22%2C%22mapBounds%22%3A%7B%22west%22%3A-77.65840518457031%2C%22east%22%3A-77.11870181542969%2C%22south%22%3A38.13250414385234%2C%22north%22%3A38.444339281260426%7D%2C%22isMapVisible%22%3Afalse%2C%22filterState%22%3A%7B%22sort%22%3A%7B%22value%22%3A%22mostrecentchange%22%7D%2C%22fsba%22%3A%7B%22value%22%3Afalse%7D%2C%22fsbo%22%3A%7B%22value%22%3Afalse%7D%2C%22nc%22%3A%7B%22value%22%3Afalse%7D%2C%22fore%22%3A%7B%22value%22%3Afalse%7D%2C%22cmsn%22%3A%7B%22value%22%3Afalse%7D%2C%22auc%22%3A%7B%22value%22%3Afalse%7D%2C%22fr%22%3A%7B%22value%22%3Atrue%7D%2C%22ah%22%3A%7B%22value%22%3Atrue%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A11%7D',
headers = {'User-Agent':'Mozilla/5.0'})
data = json.loads(re.search(r'!--(\{"queryState".*?)-->', r.text).group(1))
print(data['cat1'])
print(data['cat1']['searchList'].keys())
Within this are details on pagination and the next url, if applicable, to get all results. You have only asked for page 1 here.
For example, print addresses
for i in data['cat1']['searchResults']['listResults']:
print(i['address'])

JSON data webscraping

I am attempting to scrape job titles from here.
Using Beautifulsoup I can scrape Job Titles from the first page. I am not able to scrape Job titles from the remaining pages. Using the Developertool > network I understood content type is JSON.
import requests
import json
import BeautifulSoup
from os import link
import pandas as pd
s = requests.Session()
headers = {
'Connection': 'keep-alive',
'sec-ch-ua': '^\\^',
'Accept': '*/*',
'X-Requested-With': 'XMLHttpRequest',
'sec-ch-ua-mobile': '?0',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36',
'Content-Type': 'application/json; charset=utf-8',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'https://jobs.epicor.com/search-jobs',
'Accept-Language': 'en-US,en;q=0.9',
}
url=’https://jobs.epicor.com/search-jobs/results?ActiveFacetID=0&CurrentPage=2&RecordsPerPage=15&Distance=50&RadiusUnitType=0&Keywords=&Location=&ShowRadius=False&IsPagination=False&CustomFacetName=&FacetTerm=&FacetType=0&SearchResultsModuleName=Search+Results&SearchFiltersModuleName=Search+Filters&SortCriteria=0&SortDirection=1&SearchType=5&PostalCode=&fc=&fl=&fcf=&afc=&afl=&afcf=’
response = s.get(url, headers=headers).json()
data=json.dumps(response)
#print(data)
d2=json.loads(data)
for x in d2.keys():
print(x)
###from above json results how to extract “jobtiltle”
The issue is above result's JSON data contains Html tags. In this case how to scrape job titles from the JSON data?
Would really appreciate any help on this.
I am unfortunately currently limited to using only requests or another popular python library.
Thanks in advance.

If the job titles is all that you need from your response text:
from bs4 import BeautifulSoup
# your code here
soup = BeautifulSoup(response["results"])
for item in soup.findAll("span", { "class" : "jobtitle" }):
print(item.text)
To navigate over the pages, if you hover your mouse cursor over the Prev or Next buttons there, you will see the url to request data from.

Python: Scraping Amazon webpage with bs4, BeautifulSoup

I'm trying to read specific information(name, price, etc. ...) from an Amazon webpage.
For that I'm using "BeautifulSoup" & "requests" as suggested in most tutorials. My code can load the page and find the item I'm looking for but fails to actually get it. I checked the webpage the item definetly exists.
Here is my code:
#import time
import requests
#import urllib.request
from bs4 import BeautifulSoup
URL = ('https://www.amazon.de/dp/B008JCUXNK/?coliid=I9G2T92PZXG06&colid=3ESRXLK53S0NY&psc=1&ref_=lv_ov_lig_dp_it')
# user agent = browser information (get via google search "my user agent")
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0'}
page = requests.get(URL, headers=headers)# webpage
soup = BeautifulSoup(page.content, 'html.parser')# webpage as html
title = soup.find(id="productTitle")
print(title)
title is always "NONE" so calling get_Text will cause an error.
Can anybody tell me what's wrong?

Found a way to get past the captcha.
The request needs to contain a better header.
Example:
import datetime
import requests
KEY = "YOUR_KEY_HERE"
date = datetime.datetime.now().strftime("%Y%m%d")
BASE_REQUEST = ('https://www.amazon.de/Philips-Haartrockner-ThermoProtect-Technologie-HP8230/dp/B00BCQIIMS?pf_rd_r=T1T8Z7QTQTGYM8F7KRN5&pf_rd_p=c832d309-197e-4c59-8cad-735a8deab917&pd_rd_r=20c6ed33-d548-47d7-a262-c53afe32df96&pd_rd_w=63hR3&pd_rd_wg=TYwZH&ref_=pd_gw_crs_zg_bs_84230031')
headers = {
'dnt': '1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'referer': 'https://www.amazon.com/',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
payload = {
"api-key": KEY,
"begin_date": date,
"end_date": date,
"q": "Donald Trump"
}
r = requests.get(BASE_REQUEST, headers=headers)
print(r.status_code)
if r.status_code == 200:
print('success')
For information on status codes just google html status codes.
Hope this helps anyone with similar problems
Cheers!

Your code is 100% correct, but I've tried your code and checked value of page.content. It contains captcha. Looks like Amazon don't want you to scrape their site.
You can read about your case here: https://www.reddit.com/r/learnpython/comments/bf21fn/how_to_prevent_captcha_while_scraping_amazon/.
But I also recommend to read Amazon's Terms And Conditions https://www.amazon.com/gp/help/customer/display.html/ref=hp_551434_conditions to be sure if you can legally scrape it.

Getting time out errors when downloading csv's using request api

I previously wrote a program to analyze stock info and to get historical data I used NASDAQ. For example in the past if I wanted to pull a years worth of price quotes for CMG all I needed to do was make a request to the following link h_url= https://www.nasdaq.com/api/v1/historical/CMG/stocks/2020-06-30/2019-06-30 to download a csv of the historical quotes. However, now when I make the request I my connection times out and I cannot get any response. If I just enter the url into a web-browser it still downloads the file just fine. Some example code is below:
h_url= 'https://www.nasdaq.com/api/v1/historical/CMG/stocks/2020-06-30/2019-06-30'
page_response = rq.get(h_url, timeout=30)
page=bs(page_response.content, 'html.parser')
dwnld_fl=os.path.join(os.path.dirname(__file__),'Temp_data','hist_CMG_data.txt')
fl=open(dwnld_fl,'w')
fl.write(page.text)
Can someone please let me know if this works for them or if there is something I should do differently to get it to work again ? This is only an example not the actual code so if I accidentally made a simple syntax error you can assume the actual file is correct since it has worked without issue in the past.

You are missing the headers and making a request to an invalid URL (the file downloaded in a browser is empty).
import requests
from bs4 import BeautifulSoup as bs
h_url= 'https://www.nasdaq.com/api/v1/historical/CMG/stocks/2019-06-30/2020-06-30'
headers = {
'authority': 'www.nasdaq.com',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'accept-language': 'en-US,en;q=0.9',
}
page_response = requests.get(h_url, timeout=30, allow_redirects=True, headers=headers)
with open("dump.txt", "w") as out:
out.write(str(page_response.content))
This will result in writing a byte string to a the file "dump.txt" of the data received. You do not need to use BeautifulSoup to parse HTML, as the response is a text file, not HTML.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Failure to access different pages with (python) selenium vs direct access - python

Related

Can't scrape listing links from a webpage using the requests module

BeautifulSoup find_all not finding all containers of a class

JSON data webscraping

Python: Scraping Amazon webpage with bs4, BeautifulSoup

Getting time out errors when downloading csv's using request api

Categories

Resources