Trouble collecting different property ids from a webpage using the requests module - python

After clicking on the button 11.331 Treffer located at the top right corner within the filter of this webpage, I can see the result displayed on that page. I've created a script using the requests module to fetch the ID numbers of different properties from that page.
However, when I run the script, I get json.decoder.JSONDecodeError. If I copy the cookies from dev tools directly and paste them within the headers, I get the results accordingly.
I don't wish to copy cookies from dev tools every time I run the script, so I used Selenium to collect cookies from the landing page and supply them within headers to get the desired result, but I still get the same error.
I'm trying like:
import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
start_url = 'https://www.immobilienscout24.de/'
link = 'https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/wohnung-kaufen?pagenumber=1'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',
'referer': 'https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/wohnung-kaufen?enteredFrom=one_step_search',
'accept': 'application/json; charset=utf-8',
'x-requested-with': 'XMLHttpRequest'
}
def get_cookies():
with webdriver.Chrome() as driver:
driver.get(start_url)
time.sleep(10)
cookiejar = {c['name']:c['value'] for c in driver.get_cookies()}
return cookiejar
cookies = get_cookies()
cookie_string = "; ".join([f"{item}={val}" for item,val in cookies.items()])
with requests.Session() as s:
s.headers.update(headers)
s.headers['cookie'] = cookie_string
res = s.get(link)
container = res.json()['searchResponseModel']['resultlist.resultlist']['resultlistEntries'][0]['resultlistEntry']
for item in container:
try:
project_id = item['#id']
except KeyError: project_id = ""
print(project_id)
How can I scrape property ids from that webpage using the requests module?
EDIT:
The existence of the following portion within cookies is crucial, without which the script probably leads to that error I mentioned. However, selenium failed to include that portion within cookies.
reese84=3:/qdGO9he7ld4/8a35vlw8g==:+/xBfAtVPRKHBSJgzngTQw1ywoViUvmVKLws+f8Y6edDgM+3s0Xzo17NvfgPrx9Z/suRy7hee5xcEgo85V3LdGsIop9/29g1ib1JQ0pO3UHWrtn81MseS6G8KE6AF4SrWZ2t8eTr1SEogUmCkB1HNSqXT88sAZaEi+XSzUyAGqikVjEcLX9TeI+KN37QNr9Sl+oTaOPchSgS/IowPj83zvT471Ewabg8CAc6q8I9AJ8Zb9FfLqePweCM+QFKIw+ZUp5GR4TXxZVcWdipbIEAyv3kj2x9Xs1K1k+8aXmy9VES6rFvW1xOsAjLmXbg6REPBye+QcAgPUh/x79mBWktcWC/uQ5L2W2dBLBS4eM2+bpEBw5EHMfjq9bk9hnmmZuxPGALLKASeXBt5lUUwx7x+wtGcjyvB9ZSE6gI2VxFLYqncYmhKqoNzgwQY8wRThaEraiJF/039/vVMa2G3S38iwniiOGHsOxq6VTdnWJGgvJqUmpWfXzz6XQXWL2xcykAoj7LMqHF2tC0DQyInUmZ3T7zjPBV7mEMgZkDn0z272E=:qQHyFe1/pp8/BS4RHAtxftttcOYJH4oqG1mW0+aNXF4=;

I think another part of your problem is that the link is not json. It's an html document. Part of the html document does contains javascript that sets a js variable to a json object. You can't get that with res.json()
In theory, you could use selenium to go to the link and grab the contents of the IS24.resultList variable by executing javascript like this:
driver.get(link)
time.sleep(10)
result_list = json.loads( driver.execute_script("return window.IS24.resultList"))
In practice, I think they're really serious about blocking bots and I suspect convincing them you're not a bot might take more than spoofing a cookie. When I visit via Selenium I don't even get the recaptcha option that I get when visiting through a regular browser session with incognito mode.

Related

Can't scrape information from a static webpage using requests module

I'm trying to fetch product title and it's description from a webpage using requests module. The title and description appear to be static as they both are present in page source. However, I failed to grab them using following attempt. The script throws AttributeError at this moment.
import requests
from bs4 import BeautifulSoup
link = 'https://www.nordstrom.com/s/anine-bing-womens-plaid-shirt/6638030'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
}
with requests.Session() as s:
s.headers.update(headers)
res = s.get(link)
soup = BeautifulSoup(res.text,"lxml")
product_title = soup.select_one("h1[itemProp='name']").text
product_desc = soup.select_one("#product-page-selling-statement").text
print(product_title,product_desc)
How can I scrape title and description from above pages using requests module?
The page is dynamic. go after the data from the api source:
import requests
import pandas as pd
api = 'https://www.nordstrom.com/api/ng-looks/styleId/6638030?customerId=f36cf526cfe94a72bfb710e5e155f9ba&limit=7'
jsonData = requests.get(api).json()
df = pd.json_normalize(jsonData['products'].values())
print(df.iloc[0])
Output:
id 6638030-400
name ANINE BING Women's Plaid Shirt
styleId 6638030
styleNumber
colorCode 400
colorName BLUE
brandLabelName ANINE BING
hasFlatShot True
imageUrl https://n.nordstrommedia.com/id/sr3/6d000f40-8...
price $149.00
pathAlias anine-bing-womens-plaid-shirt/6638030?origin=c...
originalPrice $149.00
productTypeLvl1 12
productTypeLvl2 216
isUmap False
Name: 0, dtype: object
When testing requests like these you should output the response to see what you're getting back. Best to use something like Postman (I think VSCode has a similar function to it now) to set up URLs, headers, methods, and parameters, and to also see the full response with headers. When you have everything working right, just convert it to python code. Postman even has some 'export to code' functions for common languages.
Anyways...
I tried your request on Postman and got this response:
Requests done from python vs a browser are the same thing. If the headers, URLs, and parameters are identical, they should receive identical responses. So the next step is comparing the difference between your request and the request done by the browser:
So one or more of the headers included by the browser gets a good response from the server, but just using User-Agent is not enough.
I would try to identify which headers, but unfortunately, Nordstrom detected some 'unusual activity' and seems to have blocked my IP :(
Probably due to sending an obvious handmade request. I think it's my IP that's blocked since I can't access the site from any browser, even after clearing my cache.
So double-check that the same hasn't happened to you while working with your scraper.
Best of luck!

Problem with web scraping - JavaSript on website is disabled

Hello,
I've been playing with discord bots (in Python) for a while now and I've come across a problem with scraping information on some websites that protect themselves from data collection by disabling javascript on their side so you can't get to their data.
I have already looked at many websites recommending changing in headers among other things, but it has not helped.
The next step was to use selenium, which returns me this information.
We're sorry but Hive-Engine Explorer doesn't work properly without JavaScript enabled. Please enable it to continue.
Code:
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(options=chrome_options)
driver.get("https://he.dtools.dev/richlist/BEE")
htmlSource = driver.page_source
print(htmlSource)
I also checked how it looks like on the browser side itself and as we can see after entering the page there is no way to see the html file
Image from website
My question is, is it possible to bypass such security measures? Unfortunately I wanted to download the information from the API but it is not possible in this case.
You don't need to run Selenium to get this data, the site uses a backend api to deliver the data which you can replicate easily in python:
import requests
import pandas as pd
import time
import json
token = 'BEE'
limit = 100
id_ = int(time.time())
headers = {
'accept':'application/json, text/plain, */*',
'content-type':'application/json',
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.3'
}
url = 'https://api.hive-engine.com/rpc/contracts'
payload = {"jsonrpc":"2.0","id":id_,"method":"find","params":{"contract":"tokens","table":"balances","query":{"symbol":token},"offset":0,"limit":limit}}
resp = requests.post(url,headers=headers,data=json.dumps(payload)).json()
df= pd.DataFrame(resp['result'])
df.to_csv('HiveData.csv',index=False)
print('Saved to HiveData.csv')

getting no response from a url?

Getting no response from a url by using requests.get on the other hand if I past the url in Firefox then it's responding. The provided url is a link of a json file. I don't know what's happening? here is my code
from urllib.request import urlopen,Request
import requests
import pprint
import json
import pandas as pd
url = "https://www.nseindia.com/api/option-chain-equities?symbol=ACC"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}
response = requests.get(url, headers=headers)
print(response.status_code)
##data_json = json.loads(response.read())
df = pd.read_json(response)
pprint.pprint(df['records'][1])
This website protects itself from bots. There are so many ways to detect bots, some of them are:
requests rate
disabled javascript
empty cookies
not using mouse to click buttons
etc.
To enable javascript and cookies, you can use selenium.
The website you want to scrape has powerful bot detection methods. I couldn't access the link that you have shared. But when I first tried website main page and after that your link, It shows json file.
But this is not easy to make a bot for. I tried selenium and clicked the website button by moving the mouse, but it detected that I'm a bot. So we can conclude that the website uses cookies. You need to generate fake cookies to access the webpage.

Can't download a file from a website using Python

I'm a webscraping newbie and I'm having issues being able to use all of the .get methods imaginable to download some excel files from a website. I have been able to easily parse the HTML to get the URLs for every link on the page, but I'm not experienced enough to understand why on earth I cannot download the file (cookies, sessions, etc., no idea).
Here is the website:
https://mlcu.org.eg/ar/3118/%D9%82%D9%88%D8%A7%D8%A6%D9%85-%D9%85%D8%AC%D9%84%D8%B3-%D8%A7%D9%84%D8%A7%D9%85%D9%86-%D8%B0%D8%A7%D8%AA-%D8%A7%D9%84%D8%B5%D9%84%D8%A9
If you scroll down you'll find the 5 excel file links, none of which I've been able to download. (just search for id="AutoDownload"
When I try to use the requests .get method, and save the file using
import requests
requests.Session()
res = requests.get(url).content
with open(filename) as f:
f.write(res.content)
I get an error that res is a bytes object and when I view res as a variable, the output is:
b'<html><head><title>Request Rejected</title></head><body>The requested URL was rejected.
Please consult with your administrator.<br><br>Your support ID is: 11190392837244519859</body></html>
Been trying for a while now, would really appreciate any help. Thanks a lot.
So I finally came up with a solution using only requests and the standard Python HTML parser.
From what I found, the Request rejected error is generally difficult to trace back to a precise cause. In that case, it was due to the absence of a user agent in the HTTP request.
import requests
from html.parser import HTMLParser
# Custom parser to retrieve the links
link_urls = []
class AutoDownloadLinksHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if(tag == 'a' and [attr for attr in attrs if attr == ('id', 'AutoDownload')]):
href = [attr[1] for attr in attrs if attr[0] == 'href'][0]
link_urls.append(href)
# Get the links to the files
url = 'https://mlcu.org.eg/ar/3118/%D9%82%D9%88%D8%A7%D8%A6%D9%85-%D9%85%D8%AC%D9%84%D8%B3-%D8%A7%D9%84%D8%A7%D9%85%D9%86-%D8%B0%D8%A7%D8%AA-%D8%A7%D9%84%D8%B5%D9%84%D8%A9'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
links_page = requests.get(url, headers=headers)
AutoDownloadLinksHTMLParser().feed(links_page.content.decode('utf-8'))
# Download the files
host = 'https://mlcu.org.eg'
for i, link_url in enumerate(link_urls):
file_content = requests.get(host + link_urls[i], headers = headers).content
with open('file' + str(i) + '.xls', 'wb+') as f:
f.write(file_content)
In order to download the files, you need to set the "User-Agent" field in the header of your python request. This can be done by passing a dict to the get function:
file = session.get(url,headers=my_headers)
Apparently, this host does not respond to requests that come from python which have the following User-Agent:
'User-Agent': 'python-requests/2.24.0'
With this in mind, if you pass another value for that field in the header of your request, for example one from Firefox (see below), the host thinks the request comes from a Firefox user and will respond with the actual file.
Here is the full version of the code:
import requests
my_headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0',
'Accept-Encoding': 'gzip, deflate',
'Accept': '*/*',
'Connection': 'keep-alive'
}
session = requests.session()
file = session.get(url, headers=my_headers)
with open(filename, 'wb') as f:
f.write(file.content)
The latest Firefox user agent worked for me but you can find many more possible values for that field here.
If you're not experienced enough to set all correct parameters manually in the HTTP requests so as to avoid the "Request rejected" error you have (for my part, I wouldn't be able to), I would advise you to use a higher level approach such as Selenium.
Selenium can automate actions performed by a browser installed on your computer, such as downloading files (thus it is used to automate tests on web apps as well as to do web scraping). The idea is that the HTTP request generated by the browser would be better than the one you can write by hand.
Here is a tutorial to do what you try to do using Selenium.

Scraping Data from website with a login page

I am trying to login to my university website using python and the requests library using the following code, nonetheless I am not able to.
import requests
payloads = {"User_ID": <username>,
"Password": <passwrord>,
"option": "credential",
"Log in":"Log in"
}
with requests.Session() as session:
session.post('', data=payloads)
get = session.get("")
print(get.text)
Does anyone have any idea on what I am doing wrong?
In order to login you will need to to post all the informations requested by the <input> tag. In your case you will have also to provide the hidden inputs. You can do this by scraping for these values and then post them. You might also need to post some headers to simulate a browser behaviour.
from lxml import html
import requests
s = requests.Session()
login_url = "https://intranet.cardiff.ac.uk/students/applications"
session_url = "https://login.cardiff.ac.uk/nidp/idff/sso?sid=1&sid=1"
to_get = s.get(login_url)
tree = html.fromstring(to_get.text)
hidden_inputs = tree.xpath(r'//form//input[#type="hidden"]')
payloads = {x.attrib["name"]: x.attrib["value"] for x in hidden_inputs}
payloads["Ecom_User_ID"] = "<username>"
payloads["Ecom_Password"] = "<password>"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
result = s.post(session_url, data=payloads, headers = headers)
Hope this works
In order to login to a website with python, you will have to use a more involved method than the request library because you will have to simulate the browser in your code and have it make requests to login to the school's website servers. The reason for this is that you need the school's server to think that it is getting the request from the browser, then it should return you the contents of the resulting page, and then you have to have those contents rendered so that you can scrape it. Luckily, a great way to do this is with the selenium module in python.
I would recommend googling around to learn more about selenium. This blog post is a good example of using selenium to log into a web page with detailed explanations of what each line of code is doing. This SO answer on using selenium to login to a website is also good as an entry point into doing this.

Categories