Problem with web scraping - JavaSript on website is disabled

Problem with web scraping - JavaSript on website is disabled - python

Hello,
I've been playing with discord bots (in Python) for a while now and I've come across a problem with scraping information on some websites that protect themselves from data collection by disabling javascript on their side so you can't get to their data.
I have already looked at many websites recommending changing in headers among other things, but it has not helped.
The next step was to use selenium, which returns me this information.
We're sorry but Hive-Engine Explorer doesn't work properly without JavaScript enabled. Please enable it to continue.
Code:
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(options=chrome_options)
driver.get("https://he.dtools.dev/richlist/BEE")
htmlSource = driver.page_source
print(htmlSource)
I also checked how it looks like on the browser side itself and as we can see after entering the page there is no way to see the html file
Image from website
My question is, is it possible to bypass such security measures? Unfortunately I wanted to download the information from the API but it is not possible in this case.

You don't need to run Selenium to get this data, the site uses a backend api to deliver the data which you can replicate easily in python:
import requests
import pandas as pd
import time
import json
token = 'BEE'
limit = 100
id_ = int(time.time())
headers = {
'accept':'application/json, text/plain, */*',
'content-type':'application/json',
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.3'
}
url = 'https://api.hive-engine.com/rpc/contracts'
payload = {"jsonrpc":"2.0","id":id_,"method":"find","params":{"contract":"tokens","table":"balances","query":{"symbol":token},"offset":0,"limit":limit}}
resp = requests.post(url,headers=headers,data=json.dumps(payload)).json()
df= pd.DataFrame(resp['result'])
df.to_csv('HiveData.csv',index=False)
print('Saved to HiveData.csv')

Related

Trouble collecting different property ids from a webpage using the requests module

After clicking on the button 11.331 Treffer located at the top right corner within the filter of this webpage, I can see the result displayed on that page. I've created a script using the requests module to fetch the ID numbers of different properties from that page.
However, when I run the script, I get json.decoder.JSONDecodeError. If I copy the cookies from dev tools directly and paste them within the headers, I get the results accordingly.
I don't wish to copy cookies from dev tools every time I run the script, so I used Selenium to collect cookies from the landing page and supply them within headers to get the desired result, but I still get the same error.
I'm trying like:
import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
start_url = 'https://www.immobilienscout24.de/'
link = 'https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/wohnung-kaufen?pagenumber=1'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',
'referer': 'https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/wohnung-kaufen?enteredFrom=one_step_search',
'accept': 'application/json; charset=utf-8',
'x-requested-with': 'XMLHttpRequest'
}
def get_cookies():
with webdriver.Chrome() as driver:
driver.get(start_url)
time.sleep(10)
cookiejar = {c['name']:c['value'] for c in driver.get_cookies()}
return cookiejar
cookies = get_cookies()
cookie_string = "; ".join([f"{item}={val}" for item,val in cookies.items()])
with requests.Session() as s:
s.headers.update(headers)
s.headers['cookie'] = cookie_string
res = s.get(link)
container = res.json()['searchResponseModel']['resultlist.resultlist']['resultlistEntries'][0]['resultlistEntry']
for item in container:
try:
project_id = item['#id']
except KeyError: project_id = ""
print(project_id)
How can I scrape property ids from that webpage using the requests module?
EDIT:
The existence of the following portion within cookies is crucial, without which the script probably leads to that error I mentioned. However, selenium failed to include that portion within cookies.
reese84=3:/qdGO9he7ld4/8a35vlw8g==:+/xBfAtVPRKHBSJgzngTQw1ywoViUvmVKLws+f8Y6edDgM+3s0Xzo17NvfgPrx9Z/suRy7hee5xcEgo85V3LdGsIop9/29g1ib1JQ0pO3UHWrtn81MseS6G8KE6AF4SrWZ2t8eTr1SEogUmCkB1HNSqXT88sAZaEi+XSzUyAGqikVjEcLX9TeI+KN37QNr9Sl+oTaOPchSgS/IowPj83zvT471Ewabg8CAc6q8I9AJ8Zb9FfLqePweCM+QFKIw+ZUp5GR4TXxZVcWdipbIEAyv3kj2x9Xs1K1k+8aXmy9VES6rFvW1xOsAjLmXbg6REPBye+QcAgPUh/x79mBWktcWC/uQ5L2W2dBLBS4eM2+bpEBw5EHMfjq9bk9hnmmZuxPGALLKASeXBt5lUUwx7x+wtGcjyvB9ZSE6gI2VxFLYqncYmhKqoNzgwQY8wRThaEraiJF/039/vVMa2G3S38iwniiOGHsOxq6VTdnWJGgvJqUmpWfXzz6XQXWL2xcykAoj7LMqHF2tC0DQyInUmZ3T7zjPBV7mEMgZkDn0z272E=:qQHyFe1/pp8/BS4RHAtxftttcOYJH4oqG1mW0+aNXF4=;

I think another part of your problem is that the link is not json. It's an html document. Part of the html document does contains javascript that sets a js variable to a json object. You can't get that with res.json()
In theory, you could use selenium to go to the link and grab the contents of the IS24.resultList variable by executing javascript like this:
driver.get(link)
time.sleep(10)
result_list = json.loads( driver.execute_script("return window.IS24.resultList"))
In practice, I think they're really serious about blocking bots and I suspect convincing them you're not a bot might take more than spoofing a cookie. When I visit via Selenium I don't even get the recaptcha option that I get when visiting through a regular browser session with incognito mode.

getting no response from a url?

Getting no response from a url by using requests.get on the other hand if I past the url in Firefox then it's responding. The provided url is a link of a json file. I don't know what's happening? here is my code
from urllib.request import urlopen,Request
import requests
import pprint
import json
import pandas as pd
url = "https://www.nseindia.com/api/option-chain-equities?symbol=ACC"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}
response = requests.get(url, headers=headers)
print(response.status_code)
##data_json = json.loads(response.read())
df = pd.read_json(response)
pprint.pprint(df['records'][1])

This website protects itself from bots. There are so many ways to detect bots, some of them are:
requests rate
disabled javascript
empty cookies
not using mouse to click buttons
etc.
To enable javascript and cookies, you can use selenium.
The website you want to scrape has powerful bot detection methods. I couldn't access the link that you have shared. But when I first tried website main page and after that your link, It shows json file.
But this is not easy to make a bot for. I tried selenium and clicked the website button by moving the mouse, but it detected that I'm a bot. So we can conclude that the website uses cookies. You need to generate fake cookies to access the webpage.

How to avoid a bot detection and scrape a website using python?

My Problem:
I want to scrape the following website: https://www.coches.net/segunda-mano/.
But every time i open it with python selenium, i get the message, that they detected me as a bot.
How can i bypass this detection?
First i tried simple code with selenium:
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Chrome('C:/Python38/chromedriver.exe')
URL = 'https://www.coches.net/segunda-mano/'
browser.get(URL)
Then i tried it with request, but i doesn't work, too.
from selenium import webdriver
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
import requests
ua = UserAgent()
headers = {"UserAgent":ua.random}
URL = 'https://www.coches.net/segunda-mano/'
r = requests.get(URL, headers = headers)
print(r.statuscode)
In this case i get the message 403 = Status code stating that access to the URL is prohibited.
Don't know how to get entry to this webpage without getting blocked. I would be very grateful for your help. Thanks in advance.

Selenium is fairly easily detected, especially by all major anti-bot providers (Cloudflare, Akamai, etc).
Why?
Selenium, and most other major webdrivers set a browser variable (that websites can access) called navigator.webdriver to true. You can check this yourself by heading to your Google Chrome console and running console.log(navigator.webdriver). If you're on a normal browser, it will be false.
The User-Agent, typically all devices have what is called a "user agent", this refers to the device accessing the website. Selenium's User-Agent looks something like this: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/59.0.3071.115 Safari/537.36. Did you catch that? HeadlessChrome is included, this is another route of detection.
These are just two of the multiple ways a Selenium browser can be detected, I would highly recommend reading up on this and this as well.
And lastly, if you want an easy, drop-in solution to bypass detection that implements almost all of these concepts we've talked about, I'd suggest using undetected-chromedriver. This is an open source project that tries it's best to keep your Selenium chromedriver looking human.

I think your problem is not bot detection. You can't use just requests to get the results from that page, because it makes XHR requests behind the scene. So you must use Selenium, splash, etc, but seems is not possible for this case.
However, if you research a bit in the page you can find which url is requested behind the scenes to display the resutls. I did that research and found this page(https://ms-mt--api-web.spain.advgo.net/search), it returns a json, so it will ease your work in terms of parsing. Using chrome dev tools I got the curl request and just map it to python requests and obtain this code:
import json
import requests
headers = {
'authority': 'ms-mt--api-web.spain.advgo.net',
'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
'accept': 'application/json, text/plain, */*',
'x-adevinta-channel': 'web-desktop',
'x-schibsted-tenant': 'coches',
'sec-ch-ua-mobile': '?0',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
'content-type': 'application/json;charset=UTF-8',
'origin': 'https://www.coches.net',
'sec-fetch-site': 'cross-site',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.coches.net/',
'accept-language': 'en-US,en;q=0.9,es;q=0.8',
}
data = '{"pagination":{"page":1,"size":30},"sort":{"order":"desc","term":"relevance"},"filters":{"categories":{"category1Ids":[2500]},"offerTypeIds":[0,2,3,4,5],"isFinanced":false,"price":{"from":null,"to":null},"year":{"from":null,"to":null},"km":{"from":null,"to":null},"provinceIds":[],"fuelTypeIds":[],"bodyTypeIds":[],"doors":[],"seats":[],"transmissionTypeId":0,"hp":{"from":null,"to":null},"sellerTypeId":0,"hasWarranty":null,"isCertified":false,"luggageCapacity":{"from":null,"to":null},"contractId":0}}'
while True:
response = requests.post('https://ms-mt--api-web.spain.advgo.net/search', headers=headers, data=data).json()
# you should parse items here.
print(response)
if not response["items"]:
break
data_dict = json.loads(data)
data_dict["pagination"]["page"] = data_dict["pagination"]["page"]+1 # get the next page.
data = json.dumps(data_dict)
Probably there are a lot of headers and body info that are unnecesary, you can code-and-test to improve it.

Proxy rotating can be useful if scraping large data
options = Options()
options.add_arguments('--proxy-server="#ip:#port"')
Then initialize chrome driver with options object

Scraping data from tables dependent on interactive map

Another scraping question. I'm trying to scrape data from the following website:
https://www.flightradar24.com/data/airlines/kl-klm/routes
However, the data I want to get only shows up after you click on one of the airports, in the form of a table under the map. From this table, I want to extract the number that indicates the frequency of daily flights to each airport. E.g., if you click on Paris Charles de Gaulle and inspect the country Netherlands from the table, it shows td rowspan="6" on the row above, which in this case indicates that KLM has 6 flights a day to Paris.
I'm assuming that I would need to use a browser session like Selenium or something similar, so I started with the following code, but I'm not sure where to go from here as I'm not able to locate the airport dots in the source code.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.binary_location = 'C:/Users/C55480/AppData/Local/Google/Chrome SxS/Application/chrome.exe'
driver = webdriver.Chrome(executable_path='C:/Users/C55480/.spyder-py3/going_headless/chromedriver.exe', chrome_options=chrome_options)
airlines = ['kl-klm', 'dy-nax', 'lh-dlh']
for a in airlines:
url = 'https://www.flightradar24.com/data/airlines/' + a + '/routes'
page = driver.get(url)
Is there a way to make Selenium click on each dot and scrape the number of daily flights for every airport, and then from this find the total number of daily flights to each country?

Instead of using Selenium, try to get required data with direct HTTP-requests:
import requests
import json
s = requests.session()
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0"}
r = s.get("https://www.flightradar24.com/data/airlines/kl-klm/routes", headers=headers)
Data for each airport can be found in script node that looks like
<script>var arrRoutes=[{"airport1":{"country":"Denmark","iata":"AAL","icao":"EKYT","lat":57.092781,"lon":9.849164,"name":"Aalborg Airport"}...]</script>
To get JSON from arrRoutes variable:
my_json = json.loads(r.text.split("arrRoutes=")[-1].split(", arrDates=")[0])
You need to get abbreviation (value for "iata" key) for each airport:
abbs_list = []
for route in my_json:
if route["airport1"]["country"] == "Netherlands":
abbs_list.append(route["airport2"]["iata"])
Output of print(abbs_list) should be like ['AAL', 'ABZ'...]
Now we can request data for each airport:
url = "https://www.flightradar24.com/data/airlines/kl-klm/routes?get-airport-arr-dep={}"
for abbr in abbs_list:
cookie = r.cookies.get_dict()
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0",
"Content-Type": "application/json",
"x-fetch": "true"}
response = s.get(url.format(abbr), cookies=cookie, headers=headers).json()
print(abbr, ": ", response["arrivals"]["Netherlands"]["number"]["flights"])

The map isn't represented through HTML/CSS so I do no think it is possible to interact with it through Selenium natively.
However, I stumbled upon Sikuli API, which enables image recognition to interact with Google Maps (like on the page you linked), Captchas, ... You could crop that marker and try to use Sikuli to recognize it and click on it. See http://www.assertselenium.com/maven/sikuliwebdriver-2/ for a small example on how to use it.
The data in the tables can however easily be selected using Xpaths and parsed using a tool like Selenium. It seems however that Sikuli is usable only in Java so you'll have to use Selenium with Java too.

Scraping Data from website with a login page

I am trying to login to my university website using python and the requests library using the following code, nonetheless I am not able to.
import requests
payloads = {"User_ID": <username>,
"Password": <passwrord>,
"option": "credential",
"Log in":"Log in"
}
with requests.Session() as session:
session.post('', data=payloads)
get = session.get("")
print(get.text)
Does anyone have any idea on what I am doing wrong?

In order to login you will need to to post all the informations requested by the <input> tag. In your case you will have also to provide the hidden inputs. You can do this by scraping for these values and then post them. You might also need to post some headers to simulate a browser behaviour.
from lxml import html
import requests
s = requests.Session()
login_url = "https://intranet.cardiff.ac.uk/students/applications"
session_url = "https://login.cardiff.ac.uk/nidp/idff/sso?sid=1&sid=1"
to_get = s.get(login_url)
tree = html.fromstring(to_get.text)
hidden_inputs = tree.xpath(r'//form//input[#type="hidden"]')
payloads = {x.attrib["name"]: x.attrib["value"] for x in hidden_inputs}
payloads["Ecom_User_ID"] = "<username>"
payloads["Ecom_Password"] = "<password>"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
result = s.post(session_url, data=payloads, headers = headers)
Hope this works

In order to login to a website with python, you will have to use a more involved method than the request library because you will have to simulate the browser in your code and have it make requests to login to the school's website servers. The reason for this is that you need the school's server to think that it is getting the request from the browser, then it should return you the contents of the resulting page, and then you have to have those contents rendered so that you can scrape it. Luckily, a great way to do this is with the selenium module in python.
I would recommend googling around to learn more about selenium. This blog post is a good example of using selenium to log into a web page with detailed explanations of what each line of code is doing. This SO answer on using selenium to login to a website is also good as an entry point into doing this.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problem with web scraping - JavaSript on website is disabled - python

Related

Trouble collecting different property ids from a webpage using the requests module

getting no response from a url?

How to avoid a bot detection and scrape a website using python?

Scraping data from tables dependent on interactive map

Scraping Data from website with a login page

Categories

Resources