Is there a faster way to check the availability of numerous websites - python

Good day, everyone.
This code is checking availability if the website, but it's loading the whole page, so if I have a list of 100 websites, it will be slow.
My Question is: Is there any way to do it faster?
import requests
user_agent = {'accept': '*/*', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}
session = requests.Session()
response = session.get("http://google.com", headers=user_agent, timeout=5)
if response.status_code == 200:
print("Checked & avaliable")
else:
print("Not avaliable")
Thanks!
Every Help Will Be Appreciated

You Can Use That:
import urllib.request
print(urllib.request.urlopen("http://www.google.com").getcode())
#output
>>> 200

This code is checking availability if the website, but it's loading the whole page
To not load the whole page, you can issue HEAD requests, instead of GET, so you will just check the status. See Getting HEAD content with Python Requests
Another way to make it faster, is to issue multiple requests using multiple threads or asyncio ( https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html ).

Related

scraping yell with python requests gives 403 error

I have this code
from requests.sessions import Session
url = "https://www.yell.com/s/launderettes-birmingham.html"
s = Session()
headers = {
'user-agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36",
}
r = s.get(url,headers=headers)
print(r.status_code)
but I get 403 output, instead 200
I can scrape this data with selenium, but is there a way to scrape this with requests
If you modify your code like so:
print(r.text)
print(r.status_code)
you will see, that the reason you are getting a 400 error code is due to yell using Cloudflare browser check.
As it uses javascript, there is no way to reliably use the requests module.
Since you mentioned you are going to use selenium, make sure to use the undetected driver package
Also, be sure to rotate your IP to avoid getting your IP blocked.

Python trying to send request with requests library but nothing happened?

Like the title said, im trying to send request a url using requests with headers, but when I try to print the status code it doesn't print anything in the terminal, I checked my internet connection and changed to test it but nothing changes.
Here's my code ;
import requests
from bs4 import BeautifulSoup
from requests.exceptions import ReadTimeout
link="https://www.exampleurl.com"
header={
"accept-language": "tr,en;q=0.9,en-GB;q=0.8,en-US;q=0.7",
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36 Edg/99.0.1150.36'
}
r=requests.get(link)
print(r.status_code)
When I execute this command, nothing appears, don't know why. If someone can help me I will be so glad.
you can use request.head(link) like below:
r=requests.head(link)
print(r.status_code)
I get the same problem. The get() never returns.
Since you have created a header variable I thought about using that:
r = requests.get(link, headers=header)
Now I get status 200 returned.

KeyError: 'data' error while parsing Subreddit JSON

I am trying to read title of latest posts from r/chrome subreddit using Python.
But when I execute the python file, I get the KeyError: 'data' error
Here's my code:
import json, requests
def getReddit():
redditLatest = requests.get('https://www.reddit.com/r/chrome/new/.json').json()
print(redditLatest['data']['children'][0]['data']['title'])
getReddit()
Terminal:
Please help with a solution.
As Mikhail Beliansky already mentioned, debug your response.
import requests
redditLatest = requests.get('https://www.reddit.com/r/chrome/new/.json').json()
print(redditLatest)
# {'message': 'Too Many Requests', 'error': 429}
You can see that reddit recognizes that you are not a "normal" client/browser. Especially because requests adds a user-agent like "python-requests/2.25.1".
You can add a common browser user-agent to your request. If you don't make too many requests, this may work for you.
redditLatest = requests.get(
'https://www.reddit.com/r/chrome/new/.json',
headers={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'}
).json()
print(redditLatest)
# {'kind': 'Listing', 'data': {...}}

Why does post request only works the first time?

I'm trying to make a web scraper with python, I made it with selenium but it is really slow.Then i saw that i could speed up the project because of a button that make a post request.
import requests
from bs4 import BeautifulSoup
url = "http://vidtome.host/tnoz00am9j8p"
myobj = {
'op': 'download1',
'code':'tnoz00am9j8p',
'hash': 'the hash',
'imhuman': 'Proceed to video'
}
x = requests.post(url, data = myobj)
print(x.text)
That's the code and it works but only for the first time.
When I started it the first time it doesn't show any error and it printed me out the page with the right changes, but when i started it later it gave me no error but it printed me out the page with no changes like it doesn't do anything.
How can it be possible?
Requests are faster, but you cannot extract dynamically rendered content. However this is probably not the issue.
Problem is that you do not have access to the website.
If it is a basic human checking system, you could try to add user agent to your request
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36 Edg/88.0.705.68',
}
r = requests.get(url, headers=headers)
If this will not work, I would recommend looking into the data that you are passing. Maybe it is validating through it and it contains expired values or something.

Why is Selenium unable to find elements on some sites?

I am using a python version of Selenium to capture comments on a Chinese website.
The website is https://v.douyu.com/show/kDe0W2q5bB2MA4Bz
I want to find this span element. In Chinese, this is called "弹幕列表".
I tried the absolute path like:
driver.find_elements_by_xpath('/body/demand-video-app/main/div[2]/demand-video-helper//div/div[1]/a[3]/span')
But it returns NoSuchElementException. I just thought that maybe this site has a protection mechanism. However, I don't know much about Selenium and would like to ask for help. Thanks in advance.
I guess you use Selenium because requests can't capture the value.
If it's not what you want to do, don’t read my answer.
Because you are requests.get(url='https://v.douyu.com/show/kDe0W2q5bB2MA4Bz')
You need to find the source of the data ApiUrl on F12 Network.
In fact, his source of information is
https://v.douyu.com/wgapi/vod/center/getBarrageListByPage + parameter
↓
https://v.douyu.com/wgapi/vod/center/getBarrageListByPage?vid=kDe0W2q5bB2MA4Bz&forward=0&offset=-1
Although I can't help you solve the Selenium problem.
But I will use the following methods to get the data.
import requests
url = 'https://v.douyu.com/wgapi/vod/center/getBarrageListByPage?vid=kDe0W2q5bB2MA4Bz&forward=0&offset=-1'
headers = {'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'}
res = requests.get(url=url, headers=headers).json()
print(res)
for i in res['data']['list']:
print(i)
Get All Data
import requests
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}
url = 'https://v.douyu.com/wgapi/vod/center/getBarrageListByPage?vid=kDe0W2q5bB2MA4Bz&forward=0&offset=-1'
while True:
res = requests.get(url=url, headers=headers).json()
next_json = res['data']['pre']
if next_json == -1:
break
for i in res['data']['list']:
print(i)
url = f'https://v.douyu.com/wgapi/vod/center/getBarrageListByPage?vid=kDe0W2q5bB2MA4Bz&forward=0&offset={next_json}'

Categories