How to check if a list of URLs exists

How to check if a list of URLs exists - python

I'm trying to test if a simple list of urls exists, the code works when I'm just testing one url, but when I try add a array of urls, it's breaks.
Any idea what i'm doing wrong?
Single URL Code
import httplib
c = httplib.HTTPConnection('www.example.com')
c.request("HEAD", '')
if c.getresponse().status == 200:
print('web site exists')
Broken Array Code
import httplib
Urls = ['www.google.ie', 'www.msn.com', 'www.fakeniallweb.com', 'www.wikipedia.org', 'www.galwaydxc.com', 'www.foxnews.com', 'www.blizzard.com', 'www.youtube.com']
for x in Urls:
c = httplib.HTTPConnection(x)
c.request("HEAD", '')
if c.getresponse().status == 200:
print('web site exists')
else:
print('web site' + x + 'un-reachable')
#To prevent code from closing
input ()

The problem is not that you do it as an array, it is that one of your urls (www.fakeniallweb.com) has a different problem than your other urls.
I think because the DNS cannot be resolved, you cannot request the HEAD as you do. So you need an additional check other than just checking for response code 200.
Maybe you could do something like this:
try:
c.request("HEAD", '')
if c.getresponse().status == 200:
print('web site exists')
else:
print('website does not exist')
except gaierror as e:
print('Error resolving DNS')
Honestly I suspect you will find other cases where a website returns different status codes. For example a website might return something in the 3xx range for a redirect, or a 403 if you cannot access it. That does not mean the website does not exist.
Hope this helps you on your way!

#Dries De Rydt
Thanks for your help , it was a unresolved dns error causing it to crash out.
I ended up Lib/socket.py
solution
import socket
Urls = ['www.google.ie', 'www.msn.com', 'www.fakeniallweb.com', 'www.wikipedia.org', 'www.galwaydxc.com', 'www.foxnews.com', 'www.blizzard.com', 'www.youtube.com']
for x in Urls:
try:
url = socket.gethostbyname(x)
print x + ' was reachable '
except socket.gaierror, err:
print "cannot resolve hostname: ", x, err
#To prevent code from closing
input ()
Thanks for all the help.

Related

Is it possible to detect that a user exists or not on my.forms.app using a python?

Hey so i created a code where i can detect whether a website exists or not, this is the code that i use to detect that.
exist=[]
url = []
for b in websearch:
try:
request = requests.get(b)
if request.status_code == 200:
exist.append(b)
print('Exist')
except:
print('Not Exist')
but there's another barrier that i have to go through and that is non-existent user from my.forms.app.
This is one of the example sites.
Since on my.forms.app you can just create a random name on the link and it's going to redirect to that webpage. is there a way where it can differentiate whether the user exists or not.
Thanks for the help!
EDIT:
I tried to make it that if it's 200 it exists if it's 204 user does not exist. is there a way where it would say user does not exist if it's 204
try:
request = requests.get('https://api.forms.app/user/infobyname/minecraft123')
if request.status_code == 200:
print('Exist')
elif request.status_code == 204:
print('User does not exist')
except:
print('Not Exist')
2nd EDIT:
I have found the solution and it's rather simple, i just need to not use try and except in my code.
exist2=[]
for c in web:
request = requests.get(c)
if request.status_code == 204:
print('user does not exist')
elif request.status_code == 200:
exist2.append(c)
print ('user exist')

You can use your browser's network inspector to see what's happening in cases that aren't as clear-cut.
https://my.forms.app/SOMETHINGRANDOMHERE makes a request to https://api.forms.app/user/infobyname/SOMETHINGRANDOMHERE, which returns a HTTP 204 No Content response if the user doesn't exist.
For a valid username, e.g. https://api.forms.app/user/infobyname/minecraft returns 200 and some JSON describing the user.
>>> requests.get("https://api.forms.app/user/infobyname/SOMETHINGRANDOMHERE").status_code
204 # not found
>>> requests.get("https://api.forms.app/user/infobyname/minecraft").status_code
200 # found

instagraapi returns wrong answer on my request, cant send request to instagram its returns error

i am trying send request to instagram by instagrapi after logining:
def parse_one_user(client: InstClient, user: str):
try:
print('one user parse started!')
data = []
print('user : ' + user)
userid = client.get_user_id(user.strip())
print(user + "userid")
bio = client.get_bio(userid)
print(bio)
follows = client.get_follows(userid)
print(follows)
followers = client.get_followers(userid)
print(followers)
posts = client.get_posts(userid)
print(posts)
data.append(bio)
data.append(follows)
data.append(followers)
data.append(posts)
return data
except Exception as e:
print(e)
return False
i have this problem
Status 200: JSONDecodeError in public_request (url=https://www.instagram.com/miami.autorent/?__a=1) >>> for (;;);{"__ar":1,"error":1357004,"errorSummary":"Sorry, something went wrong","errorDescription":"Please try closing and re-opening your browser window.","payload":null,"hsrp":{"hblp":{"consistency":{"rev":1005710968}}},"lid":"7110886975666527111"}
Does it means that instagrapi work wrong??

This isn't an error since it has a status code of 200.
Just a Warning from Instagram Directly.
Even if the error shows up in your Terminal it wont stop the script.
I have the same issues in my log files, but i doesn't interfer with your code.

Python requests web scraping how to detect non existent returned pages?

I'm scraping some average salary data to make infographics from a list of jobs. If the job can be found, like "programmer", then it gives me a code 200 and the page I go to is the same in the script.
import requests
job_url: str = "https://www.ziprecruiter.com/Salaries/What-Is-the-Average-Programmer-Salary-by-State"
job_response = requests.get(job_url, timeout=10)
print(job_response)
If it fails like below for "Youtuber", I want to display an error message to the user. But, I still get a code 200. Manually trying this, their site redirects me to a page like "https://www.ziprecruiter.com/Salaries/What-Is-the-Average-Youtuber-Salary-by-State?ind=null"
null_url: str = "https://www.ziprecruiter.com/Salaries/What-Is-the-Average-Youtuber-Salary-by-State"
null_response = requests.get(null_url, timeout=10)
How can I in code figure out if the query is redirecting to an empty page? Do I need to use another library?

You can disable redirection and check the response:
null_url = "https://www.ziprecruiter.com/Salaries/What-Is-the-Average-Youtuber-Salary-by-State"
null_response = requests.get(null_url, timeout=10, allow_redirects=False)
if null_response.status_code == 301:
print("Not found")
if "Moved Permanently" in null_response.text:
print("Not found")
if "ind=null" in null_response.next.url:
print("Not found")
Or with redirections:
null_url = "https://www.ziprecruiter.com/Salaries/What-Is-the-Average-Youtuber-Salary-by-State"
null_response = requests.get(null_url, timeout=10)
if "ind=null" in null_response.url:
print("Not found")
if null_response.history[0].status_code == 301:
print("Not found")

Repeat function until true in Python

googled a lot but I still have no solution
So I have a parser def:
def parse_page(url):
req = request.get(url, headers=headers(), proxies=dict(http='socks4://' + get_proxy()), timeout=5)
(code was just for example)
Sometimes proxy is dead or other error could happened (timeout, err 500) but I need to make this request anyway and try until it will return true
So how can I do that?
I tried retrying lib but no success
Thank you!

How about:
import time
req = 0
while not req:
try:
req = request.get(url, headers=headers(), proxies=dict(http='socks4://' + get_proxy()))
except:
time.sleep(5)
As soon as you get a req this will be True no matter what it is, so long as it's not None and that will exit the loop.

while parse_page(url,urls[url]) == False:
print('Something happened... Trying again...')
else:
print(url + 'Is saved.. Keep going...')
Just have to swtich while to False and thats it...
I will leave it if somebody will google it.

How to catch DNSLookupFailedError in Python on GAE?

I test URLs provided by people with urlfetch to catch wrong links.
result = urlfetch.fetch(url)
When I provide an URL such as «http://qwerty.uiop» the log says there was «DNSLookupFailedError», but this code wouldn't catch it:
except urlfetch.DNSLookupFailedError:
self.error(400)
self.response.out.write(
'Sorry, there was a problem with URL "' + url + '"')
I also tried "except urlfetch.Error:" and "except urlfetch.DownloadError:"
What am I doing wrong, and is there another way to accomplish what I'm trying to do?

In the local developer environment and in production I actually see a different exception: DownloadError. Catching that worked fine for me.
try:
result = urlfetch.fetch('http://qwerty.uiop')
except urlfetch.DownloadError:
self.response.write('Oops!')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to check if a list of URLs exists - python

Related

Is it possible to detect that a user exists or not on my.forms.app using a python?

instagraapi returns wrong answer on my request, cant send request to instagram its returns error

Python requests web scraping how to detect non existent returned pages?

Repeat function until true in Python

How to catch DNSLookupFailedError in Python on GAE?

Categories

Resources