python check if list items is in string - python

link = 'http://dedegood.com'
wrongdomain = ['google','facebook','twitter']
if any(link.find(i) for i in wrongdomain):
print 'pass this url'
else:
print 'good'
I want to check if link contains the words in wrongdomain
Why this always print 'pass this url'?
link has no google or facebook or twitter in it
I try seperate like link.find('google')
it will return -1 .so what's the problem?
Please help me to check my logic.Thank you

bool(-1) is True in Python. Instead of find, you can just do:
if any(domain in link for domain in wrongdomain):
Just remember that will also match the rest of the url, not just the domain.

Your method will not work correctly like a url like http://dedegood.com/google this. So you can use something like;
link = 'http://dedegood.com'
wrongdomain = ['google','facebook','twitter']
a=link.split("//")
b=a[1].split(".")
if any(domain in b[0] for domain in wrongdomain):
print ('pass this url')
else:
print ('good')
Since you just want to check url, you can use this one. Instead of checking all link, it's checking only the name of website. So if any url like http://dedegood.com/google will not be a problem.

Do you want to know whether the url's domain is in wrongdomain or not? I would suggest you can do this for better performance:
import urlparse
import tldextract
link = 'http://dedegood.com'
wrongdomain = ['google','facebook','twitter']
parsed = tldextract.extract(link)
if parsed.domain in wrongdomain:
print 'pass this url'
else:
print 'good'
You could check out tldextract, a library designed to get domain from a url.

Related

I need to replace everything after https:// and before .com using python

What I'm trying to do is have it replace all urls from an html file.
This is what I have done but I realized it also deletes everything else after it.
s = 'https://12345678.com/'
site_link = "google"
print(s[:8] + site_link)
It would return as https://google
I have made a code sample.
In this, link_template is a template for a link, and ***** represents where your site_name will go. It might look a bit confusing at first, but if you run it you'll understand.
# change this to change your URL
link_template = 'https://*****.com/'
# a site name, from your example
site_name = 'google'
# this is your completed link
site_link = site_name.join(link_template.split('*****'))
# prints the result
print(site_link)
Additionally, you can make a function for it:
def name_to_link(link_template,site_name):
return site_name.join(link_template.split('*****'))
And then you can use the function like this:
link = name_to_link('https://translate.*****.com/','google')
print(link)

How to make my bot skip over urls that don't exist

Hey guys I was wondering if there was a way to make my bot skip invalid urls after 1 try to continue with the for loop but continue doesn't seem to work
def check_valid(stripped_results):
global vstripped_results
vstripped_results = []
for tag in stripped_results:
conn = requests.head("https://" + tag)
conn2 = requests.head("http://" + tag)
status_code = conn.status_code
website_is_up = status_code == 200
if website_is_up:
vstripped_results.append(tag)
else:
continue
stripped results is an array of an unknown amount of domains and Subdomains which is why I have the 'https://' part and tbh I'm not even sure whether my if statement is effective or not.
Any help would be greatly appreciated I don't want to get rate limited by discord anymore from sending so many invalid domains through. :(
This is easy. To check the validity of a URL there exist a python library, namely Validators. This library can be used to validate any URL for if it exist or not. Let's take it step by step.
Firstly,
Here is the documentation link for validators:
https://validators.readthedocs.io/en/latest/
How do you validate a link using validators?
It is simple. Let's work on command line for a moment.
This image shows it. This module gives out boolean result on if it is a valid link or not.
Here for the link of this question it gave out True and when it would be false then it would give you the error.
You can validate it using this syntax:
validators.url('Add your URL variable here')
Remember that this gives boolean value so code for it that way.
So you can use it this way...
I wouldn't be implementing it in your code as I want you to try it yourself once. I would help you with this if you are unable to do it.
Thank You! :)
Try this?
def check_valid(stripped_results):
global vstripped_results
vstripped_results = []
for tag in stripped_results:
conn = requests.head("https://" + tag)
conn2 = requests.head("http://" + tag)
status_code = conn.status_code
website_is_up = status_code == 200
if website_is_up:
vstripped_results.append(tag)
else:
#Do the thing here

How to print only a specific link in Python

I'm still a newbie in Python but I'm trying to make my first little program.
My intention is to print only the link ending with .m3u8 (if available) istead of printing the whole web page.
The code I'm currently using:
import requests
channel1 = requests.get('https://website.tv/user/111111')
print(channel1.content)
print('\n')
channel2 = requests.get('https://website.tv/user/222222')
print(channel2.content)
print('\n')
input('Press Enter to Exit...')
The link I'm looking for always has 47 characters in total, and it's always the same model just changing the stream id represented as X:
https://website.tv/live/streamidXXXXXXXXX.m3u8
Can anyone help me?
You can use regex for this problem.
Explanation:
here in the expression portion .*? means to consider everything and whatever enclosed in \b(expr)\b needs to be present there mandatorily.
For e.g.:
import re
link="https://website.tv/live/streamidXXXXXXXXX.m3u8"
p=re.findall(r'.*?\b.m3u8\b',link)
print(p)
OUTPUT:
['https://website.tv/live/streamidXXXXXXXXX.m3u8']
There are a few ways to go about this, one that springs to mind which others have touched upon is using regex with findall that returns back a list of matched urls from our url_list.
Another option could also be BeautifulSoup but without more information regarding the html structure it may not be the best tool here.
Using Regex
from re import findall
from requests import get
def check_link(response):
result = findall(
r'.*?\b.m3u8\b',
str(response.content),
)
return result
def main(url):
response = get(url)
if response.ok:
link_found = check_link(response)
if link_found:
print('link {} found at {}'.format(
link_found,
url,
),
)
if __name__ == '__main__':
url_list = [
'http://www.test_1.com',
'http://www.test_2.com',
'http://www.test_3.com',
]
for url in url_list:
main(url)
print("All finished")
If I understand your question correctly I think you want to use Python's .split() string method. If your goal is to take a string like "https://website.tv/live/streamidXXXXXXXXX.m3u8" and extract just "streamidXXXXXXXXX.m3u8" then you could do that with the following code:
web_address = "https://website.tv/live/streamidXXXXXXXXX.m3u8"
specific_file = web_address.split('/')[-1]
print(specific_file)
The calling .split('/') on the string like that will return a list of strings where each item in the list is a different part of the string (first part being "https:", etc.). The last one of these (index [-1]) will be the file extension you want.
This will extract all URLs from webpage and filter only those which contain your required keyword ".m3u8"
import requests
import re
def get_desired_url(data):
urls = []
for url in re.findall(r'(https?://\S+)', data):
if ".m3u8" in url:
urls.append(url)
return urls
channel1 = requests.get('https://website.tv/user/111111')
urls = get_desired_url(channel1 )
Try this, I think this will be robust
import re
links=[re.sub('^<[ ]*a[ ]+.*href[ ]*=[ ]*', '', re.sub('.*>$', '', link) for link in re.findall(r'<[ ]*a[ ]+.*href[ ]*=[]*"http[s]*://.+\.m3u8".*>',channel2.content)]

google custom search api return is different from google.com

I am using google api via python and it works, but the result I got from api is totally different from google.com. I found the top result given by custom search are google calendar,google earth and patents. I wonder if there is a way to get same result from custom search api. Thank you
def googleAPICall(self,userInput):
try:
userInput = urllib.quote(userInput)
for i in range(0,1):
index = i*10+1
url = ('https://www.googleapis.com/customsearch/v1?'
'key=%s'
'&cx=%s'
'&alt=json'
'&num=10'
'&start=%d'
'&q=%s')%(self.KEY,self.CX,index,userInput)
print (url)
request = urllib2.Request(url)
response = urllib2.urlopen(request)
returnResults = simplejson.load(response)
webs = returnResults['items']
for web in webs:
self.result.append(web["link"])
except:
print ("search error")
self.result.append("http://en.wikipedia.org/wiki/Climate_change")
return self.result
There is a 'search outside of google'checkbox in the dashboard. you will get the same result after you check it. it takes me a while to find it out. the default sitting is only return search result inside of all google websites.
After some searches, the answer is "It is impossible to have the same result as google.com".
Google clearly stated it:
https://support.google.com/customsearch/answer/141877?hl=en
Hope that this is the definite answer.
Just to add to galaxyan answer, you can still do that by changing Sites to search from Search only included sites to Search the entire web
I think you need to experiment with four parameters cr, gl, hl, lr

Fetch a particular part of the url in python

I am using python and trying to fetch a particular part of the url as below
from urlparse import urlparse as ue
url = "https://www.google.co.in"
img_url = ue(url).hostname
Result
www.google.co.in
case1:
Actually i will have a number of urls(stored in a list or some where else), so what i want is, need to find the domain name as above in the url and fetch the part after www. and before .co.in, that is the string starts after first dot and before second dot which results only google in the present scenario.
So suppose the url given is url given is www.gmail.com, i should fetch only gmail in that, so what ever the url given, the code should fetch the part thats starts with first dot and before second dot.
case2:
Also some urls may be given directly like this domain.com, stackoverflow.com without www in the url, in that cases it should fetch only stackoverflow and domain.
Finally my intention is to fetch the main name from the url that gmail, stackoverflow, google like so.....
Generally if i have one url i can use list slicing and will fetch the string, but i will have a number of ulrs, so need to fetch the wanted part like mentioned above dynamically
Can anyone please let me know how to satisfy the above concept ?
Why can't you just do this:
from urlparse import urlparse as ue
urls = ['https://www.google.com', 'http://stackoverflow.com']
parsed = []
for url in urls:
decoded = ue(url).hostname
if decoded.startswith('www.'):
decoded = ".".join(decoded.split('.')[1:])
parsed.append(decoded.split('.')[0])
#parsed is now your parsed list of hostnames
Also, you might want to change the if statement in the for loop, because some domains might start with other things that you would want to get rid of.
What about using a set of predefined toplevel doamains?
import re
from urlparse import urlparse
#Fake top level domains... EG: co.uk, co.in, co.cc
TOPLEVEL = [".co.[a-zA-Z]+", ".fake.[a-zA-Z]+"]
def TLD(rgx, host, max=4): #4 = co.name
match = re.findall("(%s)" % rgx, host, re.IGNORECASE)
if match:
if len(match[0].split(".")[1])<=max:
return match[0]
else:
return False
parsed = []
urls = ["http://www.mywebsite.xxx.asd.com", "http://www.dd.test.fake.uk/asd"]
for url in urls:
o = urlparse(url)
h = o.hostname
for j in range(len(TOPLEVEL)):
TL = TLD(TOPLEVEL[j], h)
if TL:
name = h.replace(TL, "").split(".")[-1]
parsed.append(name)
break
elif(j+1==len(TOPLEVEL)):
parsed.append(h.split(".")[-2])
break
print parsed
It's a bit hacky, and maybe cryptic for some, but it does the trick, and nothing more has to be done :)
Here is my solution, at the end, domains holds a list of domains you expected.
import urlparse
urls = [
'https://www.google.com',
'http://stackoverflow.com',
'http://www.google.co.in',
'http://domain.com',
]
hostnames = [urlparse.urlparse(url).hostname for url in urls]
hostparts = [hostname.split('.') for hostname in hostnames]
domains = [p[0] == 'www' and p[1] or p[0] for p in hostparts]
print domains # ==> ['google', 'stackoverflow', 'google', 'domain']
Discussion
First, we extract the host names from the list of URLs using urlparse.urlparse(). The hostnames list looks like this:
[ 'www.google.com', 'stackoverflow.com, ... ]
In the next line, we break each host into parts, using the dot as the separator. Each item in the hostparts looks like this:
[ ['www', 'google', 'com'], ['stackoverflow', 'com'], ... ]
The interesting work is in the next line. This line says, "if the first part before the dot is www, then the domain is the second part (p[1]). Otherwise, the domain is the first part (p[0]). The domains list looks like this:
[ 'google', 'stackoverflow', 'google', 'domain' ]
My code does not know how to handle login.gmail.com.hk. I hope someone else can solve this problem as I am late for bed. Update: Take a look at the tldextract by John Kurkowski, which should do what you want.

Categories