How to do a general search - python

I'm trying to make a simple browser using PyQt5 (By following a tutorial). It's mostly working except for one tiny problem -
def navigate_to_url(self):
q = QUrl(self.urlbar.text())
print(type(q))
if q.scheme() == "":
q.setScheme("http")
self.tabs.currentWidget().setUrl(q)
Whenever I type something in the address bar it searches it up but it adds a 'http://'. But if I want to search something like 'cats' I want it to work like a normal browser i.e. bring me links that are associated with cats.
Normal Pic:
However because 'https://' is added it gives me a NAME_NOT_RESERVED error.
Error Pic:
Is there any way to fix this?

You could try doing something that checks if it's just a normal word and if so then don't do the http://, for example ive got a txt document with A LOT of English words that you can use to check if its a normal word, like so:
if (re.findall(r'\b'+ re.escape(word1) + r'\b', contents, re.MULTILINE))
assign word1 to your word and contents to the dictionary
here is another example:
import re
with open('dictionary.txt') as fh:
contents = fh.read()

Consider setting the url to the search engine search explicitly if the query does or doesn't match some criteria
At its most basic, you could use urllib.parse.urlparse for this, though it may not be an exact fit for all addresses, as it expects a scheme prefix, which most people don't bother with and let the http(s) be implicitly added by the browser
>>> import urllib.parse
>>> urllib.parse.urlparse("https://example.com") # full example
ParseResult(scheme='https', netloc='example.com', path='', params='', query='', fragment='')
>>> urllib.parse.urlparse("cats") # search works
ParseResult(scheme='', netloc='', path='cats', params='', query='', fragment='')
>>> urllib.parse.urlparse("example.com") # fails for missing scheme
ParseResult(scheme='', netloc='', path='example.com', params='', query='', fragment='')
A quick test for an intended URL without a scheme to hint that an address is a netloc would be if the parsed path contains .
Alternatively, you could add some character (perhaps a space or keyword like d or s before searches)
You may also need to URL encode your string (exchanging for +, ? for %3F, etc.), which can also be done by urllib.parse's urllib.parse.quote_plus
>>> urllib.parse.quote_plus("What does a url-encoded cat query look like?")
'What+does+a+url-encoded+cat+query+look+like%3F'
Duck Duck Go Search Parameters
All together
import urllib.parse
url_search_template = "https://duckduckgo.com/?q={}"
keyword_search = "d "
text = self.urlbar.text()
def probably_a_search(s):
# check for prefix first to prevent matches against a search like 3.1415
if s.startswith(keyword_search):
return True, s[len(keyword_search):] # slice search prefix
parsed_url = urllib.parse.urlparse(s)
if parsed_url.scheme or parsed_url.netloc:
return False, s
if "." in parsed_url.path:
return False, s
return True, s
is_search, text = probably_a_search(text)
if is_search:
text = url_search_template.format(urllib.parse.quote_plus(text.strip()))
q = QUrl(text)
To get a more accurate test against the TLD (rather than the simple presence of .), a 3rd-party library like https://pypi.org/project/tld/ may work better for you

Related

Delete only a specific query from an URL

SO I have the following URL: https://foo.bar?query1=value1&query2=value2&query3=value3
I'd need a function that can strip just query2 for example, so that the result would be:
https://foo.bar?query1=value1&query3=value3
I think maybe urllib.parse or furl can do this in an easy and clean way?
You should use urllib.parse as it's designed exactly for these purposes. I'm unclear the reason for anyone reinventing the wheel here.
Basically 3 steps:
Use urlparse to parse the url into it's component parts
Use parse_qs to parse the query string part of that keeping blanks (if relevant intact)
Remove the erroneous query2 and re-encode the query string and url back
From the docs:
Parse a URL into six components, returning a 6-item named tuple. This
corresponds to the general structure of a URL:
scheme://netloc/path;parameters?query#fragment. Each tuple item is a
string, possibly empty.
from urllib.parse import urlparse, urlencode, parse_qs, urlunparse
url = "https://foo.bar?query1=value1&query2=value2&query3=value3"
url_bits = list(urlparse(url))
print(url_bits)
query_string = parse_qs(url_bits[4], keep_blank_values=True)
print(query_string)
del(query_string['query2'])
url_bits[4] = urlencode(query_string, doseq=True)
new_url = urlunparse(url_bits)
print(new_url)
# >>>['https', 'foo.bar', '', '', 'query1=value1&query2=value2&query3=value3', '']
# >>>{'query1': ['value1'], 'query2': ['value2'], 'query3': ['value3']}
# >>>https://foo.bar?query1=value1&query3=value3
If you want by position:
url="https://foo.bar?query1=value1&query2=value2&query3=value3"
findindex1=url.find("&")
findindex2=url.find("&",findindex1+1)
url=url[0:findindex1]+url[findindex2:len(url)]
if you want by the name:
url="https://foo.bar?query1=value1&query3=value3&query2=value2"
findindex1=url.find("query2")
findindex2=url.find("&",findindex1+1)
if findindex2==-1:
url=url[0:findindex1-1]
else:
url=url[0:findindex1-1]+url[findindex2:len(url)]
Hi you could try it with regular expressions.
re.sub("ThePatternOfTheURL","ThePatternYouWantToHave", "TheInput")
so it could look something like that
pattern = "'(https\:\/\/)([a-zA-Z.?0-9=]+)([&]query2=value2)([&][a-zA-Z0-9=]+)'"
#filters the third group out with query2
filter = r"\1\2\4"
yourUrl = "https://foo.bar?query1=value1&query2=value2&query3=value3"
newURL=re.sub(pattern, filter, yourUrl)
I think this should work for you

Extracting unique URLs in Python

I would like to extract entire unique url items in my list in order to move on a web scraping project. Although I have huge list of URLs on my side, I would like to generate here minimalist scenario to explain main issue on my side. Assume that my list is like that:
url_list = ["https://www.ox.ac.uk/",
"http://www.ox.ac.uk/",
"https://www.ox.ac.uk",
"http://www.ox.ac.uk",
"https://www.ox.ac.uk/index.php",
"https://www.ox.ac.uk/index.html",
"http://www.ox.ac.uk/index.php",
"http://www.ox.ac.uk/index.html",
"www.ox.ac.uk/",
"ox.ac.uk",
"https://www.ox.ac.uk/research"
]
def ExtractUniqueUrls(urls):
pass
ExtractUniqueUrls(url_list)
For the minimalist scenario, I am expecting there are only two unique urls which are "https://www.ox.ac.uk" and "https://www.ox.ac.uk/research". Although each url element have some differences such as "http", "https", with ending "/", without ending "/", index.php, index.html; they are all pointing exactly the same web page. There might be some other possibilities which I already missed them (Please remember them if you catch any). Anyway, what is the proper and efficient way to handle this issue using Python 3?
I am not looking for a hard-coded solution like focusing on each case individually. For instance, I do not want to manually check whether the url has "/" at the end or not. Possibly there is a much better solution with other packages such as urllib? For that reason, I looked the method of urllib.parse, but I could not come up a proper solution so far.
Thanks
Edit: I added one more example into my list at the end in order to explain in a better way. Otherwise, you might assume that I am looking for the root url, but this not the case at all.
By only following all cases you've reveiled:
url_list = ["https://www.ox.ac.uk/",
"http://www.ox.ac.uk/",
"https://www.ox.ac.uk",
"http://www.ox.ac.uk",
"https://www.ox.ac.uk/index.php",
"https://www.ox.ac.uk/index.html",
"http://www.ox.ac.uk/index.php",
"http://www.ox.ac.uk/index.html",
"www.ox.ac.uk/",
"ox.ac.uk",
"ox.ac.uk/research",
"ox.ac.uk/index.php?12"]
def url_strip_gen(source: list):
replace_dict = {".php": "", ".html": "", "http://": "", "https://": ""}
for url in source:
for key, val in replace_dict.items():
url = url.replace(key, val, 1)
url = url.rstrip('/')
yield url[4:] if url.startswith("www.") else url
print(set(url_strip_gen(url_list)))
{'ox.ac.uk/index?12', 'ox.ac.uk/index', 'ox.ac.uk/research', 'ox.ac.uk'}
This won't cover case if url contains .html like www.htmlsomething, in that case it can be compensated with urlparse as it stores path and url separately like below:
>>> import pprint
>>> from urllib.parse import urlparse
>>> a = urlparse("http://ox.ac.uk/index.php?12")
>>> pprint.pprint(a)
ParseResult(scheme='http', netloc='ox.ac.uk', path='/index.php', params='', query='12', fragment='')
However, if without scheme:
>>> a = urlparse("ox.ac.uk/index.php?12")
>>> pprint.pprint(a)
ParseResult(scheme='', netloc='', path='ox.ac.uk/index.php', params='', query='12', fragment='')
All host goes to path attribute.
To compensate this we either need to remove scheme and add one for all or check if url starts with scheme else add one. Prior is easier to implement.
replace_dict = {"http://": "", "https://": ""}
for url in source:
# Unify scheme to HTTP
for key, val in replace_dict.items():
url = url.replace(key, val, 1)
url = "http://" + (url[4:] if url.startswith("www.") else url)
parsed = urlparse(url)
With this you are guaranteed to get separate control of each sections for your url via urlparse. However as you do not specified which parameter should be considered for url to be unique enough, I'll leave that task to you.
Here's a quick and dirty attempt:
def extract_unique_urls(url_list):
unique_urls = []
for url in url_list:
# Removing the 'https://' etc. part
if url.find('//') > -1:
url = url.split('//')[1]
# Removing the 'www.' part
url = url.replace('www.', '')
# Removing trailing '/'
url = url.rstrip('/')
# If not root url then inspect the last part of the url
if url.find('/') > -1:
# Extracting the last part
last_part = url.split('/')[-1]
# Deciding if to keep the last part (no if '.' in it)
if last_part.find('.') > -1:
# If no to keep: Removing last part and getting rid of
# trailing '/'
url = '/'.join(url.split('/')[:-1]).rstrip('/')
# Append if not already in list
if url not in unique_urls:
unique_urls.append(url)
# Sorting for the fun of it
return sorted(unique_urls)
I'm sure it doesn't cover all possible cases. But maybe you can extend it if that's not the case. I'm also not sure if you wanted to keep the 'http(s)://' parts. If yes, then just add them to the results.

How to validate LinkedIn public profile url regular expression in python

I jus want to validate the linkedin public profile url. I tried the concept like below
a = "https://in.linkedin.com/afadasdf"
p = re.compile('(http(s?)://|[a-zA-Z0-9\-]+\.|[linkedin])[linkedin/~\-]+\.[a-zA-Z0-9/~\-_,&=\?\.;]+[^\.,\s<]')
p.match(a)
The above concept is working fine. But when i give the url https://www.linkedin.com means that it's not working. Can anyone help me to validate both concepts.
It is the oring between the http(s) and www. which has given you the above problem. You could change them to * (i.e. 0 or more).
import re
a = "https://www.linkedin.com/afadasdf"
p = re.compile('((http(s?)://)*([a-zA-Z0-9\-])*\.|[linkedin])[linkedin/~\-]+\.[a-zA-Z0-9/~\-_,&=\?\.;]+[^\.,\s<]')
print p.match(a)
Although you might want to restrict it to www rather than any numbers or letters? So maybe:
p = re.compile('((http(s?)://)*([www])*\.|[linkedin])[linkedin/~\-]+\.[a-zA-Z0-9/~\-_,&=\?\.;]+[^\.,\s<]')
This pattern may help.
^((http|https):\/\/)?+(www.linkedin.com\/)+[a-z]+(\/)+[a-zA-Z0-9-]{5,30}+$
I have tested it and it works fine for me.
Instead of matching the url with a regex you could use the urllib module:
In [1]: import urllib
In [2]: u = "https://in.linkedin.com/afadasdf"
In [3]: urllib.parse.urlparse(u)
Out[3]: ParseResult(scheme='https', netloc='in.linkedin.com', path='/afadasdf', params='', query='', fragment='')
Now you can check for the netloc and path property.

regular expression for filtrating a url with query strings / parameters in python

i have a code which loops through list of urls to do some operations but the entered urls must each contain query string , i want to check first if the url is correct and in fact contains query strings , i searched and most of the regular expressions i found only check for the url , the closest solution i found is using urlparse like this
#!/usr/local/bin/python2.7
from urlparse import urlparse
line = "http://www.compileonlinecom/execute_python_online.php?q="
o = urlparse(line)
print o
# ParseResult(scheme='http', netloc='www.compileonlinecom', path='/execute_python_online.php', params='', query='q=', fragment='')
if (o.scheme=='http' and o.query!=''):
print "yes , that is a url with query string "
else:
print "No match!!"
but i wonder if it could be done with a more solid regex
You can try to validate it on the Question mark, as every url with a parameters should have a question mark in the url.
Example:
sites = ['site.com/index.php?id=1', "xyz.com/sf.php?df=22", "dfd.com/sdgfdg.php?ereg=1", "normalsite.com"]
for site in sites:
if "?" in site:
print site
Result:
site.com/index.php?id=1
xyz.com/sf.php?df=22
dfd.com/sdgfdg.php?ereg=1
You see that the site without parameters has not been printed.

Fetch a particular part of the url in python

I am using python and trying to fetch a particular part of the url as below
from urlparse import urlparse as ue
url = "https://www.google.co.in"
img_url = ue(url).hostname
Result
www.google.co.in
case1:
Actually i will have a number of urls(stored in a list or some where else), so what i want is, need to find the domain name as above in the url and fetch the part after www. and before .co.in, that is the string starts after first dot and before second dot which results only google in the present scenario.
So suppose the url given is url given is www.gmail.com, i should fetch only gmail in that, so what ever the url given, the code should fetch the part thats starts with first dot and before second dot.
case2:
Also some urls may be given directly like this domain.com, stackoverflow.com without www in the url, in that cases it should fetch only stackoverflow and domain.
Finally my intention is to fetch the main name from the url that gmail, stackoverflow, google like so.....
Generally if i have one url i can use list slicing and will fetch the string, but i will have a number of ulrs, so need to fetch the wanted part like mentioned above dynamically
Can anyone please let me know how to satisfy the above concept ?
Why can't you just do this:
from urlparse import urlparse as ue
urls = ['https://www.google.com', 'http://stackoverflow.com']
parsed = []
for url in urls:
decoded = ue(url).hostname
if decoded.startswith('www.'):
decoded = ".".join(decoded.split('.')[1:])
parsed.append(decoded.split('.')[0])
#parsed is now your parsed list of hostnames
Also, you might want to change the if statement in the for loop, because some domains might start with other things that you would want to get rid of.
What about using a set of predefined toplevel doamains?
import re
from urlparse import urlparse
#Fake top level domains... EG: co.uk, co.in, co.cc
TOPLEVEL = [".co.[a-zA-Z]+", ".fake.[a-zA-Z]+"]
def TLD(rgx, host, max=4): #4 = co.name
match = re.findall("(%s)" % rgx, host, re.IGNORECASE)
if match:
if len(match[0].split(".")[1])<=max:
return match[0]
else:
return False
parsed = []
urls = ["http://www.mywebsite.xxx.asd.com", "http://www.dd.test.fake.uk/asd"]
for url in urls:
o = urlparse(url)
h = o.hostname
for j in range(len(TOPLEVEL)):
TL = TLD(TOPLEVEL[j], h)
if TL:
name = h.replace(TL, "").split(".")[-1]
parsed.append(name)
break
elif(j+1==len(TOPLEVEL)):
parsed.append(h.split(".")[-2])
break
print parsed
It's a bit hacky, and maybe cryptic for some, but it does the trick, and nothing more has to be done :)
Here is my solution, at the end, domains holds a list of domains you expected.
import urlparse
urls = [
'https://www.google.com',
'http://stackoverflow.com',
'http://www.google.co.in',
'http://domain.com',
]
hostnames = [urlparse.urlparse(url).hostname for url in urls]
hostparts = [hostname.split('.') for hostname in hostnames]
domains = [p[0] == 'www' and p[1] or p[0] for p in hostparts]
print domains # ==> ['google', 'stackoverflow', 'google', 'domain']
Discussion
First, we extract the host names from the list of URLs using urlparse.urlparse(). The hostnames list looks like this:
[ 'www.google.com', 'stackoverflow.com, ... ]
In the next line, we break each host into parts, using the dot as the separator. Each item in the hostparts looks like this:
[ ['www', 'google', 'com'], ['stackoverflow', 'com'], ... ]
The interesting work is in the next line. This line says, "if the first part before the dot is www, then the domain is the second part (p[1]). Otherwise, the domain is the first part (p[0]). The domains list looks like this:
[ 'google', 'stackoverflow', 'google', 'domain' ]
My code does not know how to handle login.gmail.com.hk. I hope someone else can solve this problem as I am late for bed. Update: Take a look at the tldextract by John Kurkowski, which should do what you want.

Categories