I am working on a scraper that goes through html code trying to scrape tor domains. However I am having trouble coming up with a piece of code to match tor domains.
Tor domains are typically in the format of:
http://sitegoeshere.onion
or
https://sitegoeshere.onion
I just want to match urls that would be contained within a page, in the format http://sitetexthere.onion or https://sitehereitis.onion. This is within a bunch of text that may not be urls. It should just pull out the urls.
I am sure there is an easy or good piece of regex that'll do this but I have not been able to find one. If anyone is able to link one or quickly spin one up that'd be muchos appreciated. Many thanks.
session = requests.session()
session.proxies = {}
session.proxies['http'] = 'socks5h://localhost:9050'
session.proxies['https'] = 'socks5h://localhost:9050'
r = session.get('http://facebookcorewwwi.onion')
print(r.text)
The regex.match will return None if the URL isn't matched.
import re
regex = re.compile(r"^https?\:\/\/[\w\-\.]+\.onion")
url = 'https://sitegoes-here.onion'
if regex.match(url):
print('Valid Tor Domain!')
else:
print('Invalid Tor Domain!')
For optional http(s):
regex = re.compile(r"^(?:https?\:\/\/)?[\w\-\.]+\.onion")
Regex patterns are mostly standard, so, i would recommend you this pattern:
'.onion$'
Backslash escapes the dot, and '$' character means the end of string. Since all urls starts with 'http(s)://' there's no need to including it in the pattern.
Assuming these are taken from href attributes you could try an attribute = value selector with $ ends with operator
from bs4 import BeautifulSoup as bs
import requests
resp = requests.get("https://en.wikipedia.org/wiki/Tor_(anonymity_network)") #example url. Replace with yours.
soup = bs(resp.text,'lxml')
links = [item['href'] for item in soup.select('[href$=".onion"]')]
Related
I need to split an url which is changing the positions of it's values very oftenly.
for example:-
This is the url with three different positions of request token
01:-https://127.0.0.1/?action=login&type=login&status=success&request_token=oCS44HJQT2ZSCGb39H76CjgXb0s2klwA
02:-https://127.0.0.1/?request_token=43CbEWSxdqztXNRpb2zmypCr081eF92d&action=login&type=login&status=success
03:-https://127.0.0.1/?&action=login&request_token=43CbEWSxdqztXNRpb2zmypCr081eF92d&type=login&status=success
From thses url i need only the value of request token which comes after the '=' with an alphanumeric number like this '43CbEWSxdqztXNRpb2zmypCr081eF92d'.
And to split this url i'm using this code
request_token = driver.current_url.split('=')[1].split('&action')[0]
But it gives me error when the url is not in the specified position.
So can anyone please give me a solution to this url splitting in just a single line in python and it'd be a great blessing for me from my fellow stack members.
Note:- Here i'm using driver.current_url because i'm working in selenium to do the thing.
You can use the urllib.parse module to parse URLs properly.
>>> from urllib.parse import urlparse, parse_qs
>>> url = "?request_token=43CbEWSxdqztXNRpb2zmypCr081eF92d&action=login&type=login&status=success"
>>> query = parse_qs(urlparse(url).query)
>>> query['request_token']
['43CbEWSxdqztXNRpb2zmypCr081eF92d']
>>> query['request_token'][0]
'43CbEWSxdqztXNRpb2zmypCr081eF92d'
This handles the actual structure of the URLs and doesn't depend on the position of the parameter or other special cases you'd have to handle in a regex.
Assuming you have the URLs as strings then you could use a regular expression to isolate the request tokens.
import re
urls = ['https://127.0.0.1/?action=login&type=login&status=success&request_token=oCS44HJQT2ZSCGb39H76CjgXb0s2klwA',
'https://127.0.0.1/?request_token=43CbEWSxdqztXNRpb2zmypCr081eF92d&action=login&type=login&status=success',
'https://127.0.0.1/?&action=login&request_token=43CbEWSxdqztXNRpb2zmypCr081eF92d&type=login&status=success']
for url in urls:
m = re.match('.*request_token=(.*?)(?:&|$)', url)
if m:
print(m.group(1))
I am working on a project and one of the steps includes getting a random word which I will use later. When I try to grab the random word, it gives me '<span id="result"></span>' but as you can see, there is no word inside.
Code:
import urllib2
from bs4 import BeautifulSoup
quote_page = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find("span", {"id": "result"})
print name_box
name = name_box.text.strip()
print name
I am thinking that maybe it might need to wait for a word to appear, but I'm not sure how to do that.
This word is added to the page using JavaScript. We can verify this by looking at the actual HTML that is returned in the request and comparing it with what we see in the web browser DOM inspector. There are two options:
Use a library capable of executing JavaScript and giving you the resulting HTML
Try a different approach that doesn't require JavaScript support
For 1, we can use something like requests_html. This would look like:
from requests_html import HTMLSession
url = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
session = HTMLSession()
r = session.get(url)
# Some sleep required since the default of 0.2 isn't long enough.
r.html.render(sleep=0.5)
print(r.html.find('#result', first=True).text)
For 2, if we look at the network requests that the page is making, then we can see that it retrieves random words by making a POST request to http://watchout4snakes.com/wo4snakes/Random/RandomWord. Making a direct request with a library like requests (recommended in the standard library documentation here) looks like:
import requests
url = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
print(requests.post(url).text)
So the way that the site works is that it sends you the site with no word in the span box, and edits it in later through JavaScript; that's why you get a span box with nothing inside.
However, since you're trying to get the word I'd definitely suggest you use a different method to getting the word, rather than scraping the word off the page, you can simply send a POST request to http://watchout4snakes.com/wo4snakes/Random/RandomWord with no body and receive the word in response.
You're using Python 2 but in Python 3 (for example, so I can show this works) you can do:
>>> import requests
>>> r = requests.post('http://watchout4snakes.com/wo4snakes/Random/RandomWord')
>>> print(r.text)
doom
You can do something similar using urllib in Python 2 as well.
I'm trying to get data from a Web page, where I track all your links. The web is badly modeled, the links in certain parts of the pages contain spaces before and after the link, so scrapy follows and your Web server redirects with 301 creating loops.
I tried to filter the URL of the links, but it is impossible, always returns empty spaces or symbol +.
Part of code
def cleanurl(link_text):
print "original: ", link_text
print "filter: ", link_text.strip("\s+\t\r\n '\"")
return link_text.strip("\s+\t\r\n '\"")
#return " ".join(link_text.strip("\t\r\n '\""))
#return link_text.replace("\s", "").replace("\t","").replace("\r","").replace("\n","").replace("'","").replace("\"","")
rules = (
Rule (LinkExtractor(allow=(), deny=(), process_value= cleanurl)),
)
Web code
<a href=
" ?on_sale=1
"
class="selectBox">ON SALE
</a>
Output cleanurl
original: http://www.portshop.com/computers-networking-c_11257/ ?on_sale=1
filter: http://www.portshop.com/computers-networking-c_11257/ ?on_sale=1
I tried to use regular expressions and others, but I can not sanitize the URL, in some cases if it works in others not, changing the %20 (white spaces) to +.
Thanks !
You are mentioning "%20" and "+" to be part of the urls, that's why I suspect these urls are url encoded.
So before stripping them of any whitespaces, you need to urldecode it:
Using Python 3:
import urllib
def cleanurl(link_text):
print "original: ", link_text
print "filter: ", link_text.strip("\s\t\r\n '\"")
link_text = urllib.parse.unquote(link_text)
return link_text.strip("\s+\t\r\n '\"")
If still using Python 2.7, you need to replace the unquote line:
link_text = urllib.unquote(link_text)
I have already solved, I have entered the following code to clean the URL and now it is working properly. I hope you can help someone else who has the same problem as me.
def cleanurl(link_text):
return ''.join(link_text.split())
Thanks everybody !
I am writing a simple script that checks if a website is present on google first search for a determined keyword.
Now,this is the function that parse a url and return the host name:
def parse_url(url):
url = urlparse(url)
hostname = url.netloc
return hostname
and starting from a list of tags selected by:
linkElems = soup.select('.r a') #in google first page the resulting urls have class r
I wrote this:
for link in linkElems:
l = link.get("href")[7:]
url = parse_url(l)
if "www.example.com" == url:
#do stuff (ex store in a list, etc)
in this last one, in the second line, i have to start from the seventh index, because all href values start with '/url?q='.
I am learning python, so i am wondering if there is a better way to do this, or simply an alternative one (maybe with regex or replace method or from urlparse library)
You can use python lxml module to do that which is also order of magnitude faster than BeautifulSoup.
This can be done something like this :
import requests
from lxml import html
blah_url = "https://www.google.co.in/search?q=blah&oq=blah&aqs=chrome..69i57j0l5.1677j0j4&sourceid=chrome&ie=UTF-8"
r = requests.get(blah_url).content
root = html.fromstring(r)
print(root.xpath('//h3[#class="r"]/a/#href')[0].replace('/url?q=', ''))
print([url.replace('/url?q=', '') for url in root.xpath('//h3[#class="r"]/a/#href')])
This will result in :
http://www.urbandictionary.com/define.php%3Fterm%3Dblah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggTMAA&usg=AFQjCNFge5GFNmjpan7S_UCNjos1RP5vBA
['http://www.urbandictionary.com/define.php%3Fterm%3Dblah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggTMAA&usg=AFQjCNFge5GFNmjpan7S_UCNjos1RP5vBA', 'http://www.dictionary.com/browse/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggZMAE&usg=AFQjCNE1UVR3krIQHfEuIzHOeL0ZvB5TFQ', 'http://www.dictionary.com/browse/blah-blah-blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggeMAI&usg=AFQjCNFw8eiSqTzOm65PQGIFEoAz0yMUOA', 'https://en.wikipedia.org/wiki/Blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggjMAM&usg=AFQjCNFxEB8mEjEy6H3YFOaF4ZR1n3iusg', 'https://www.merriam-webster.com/dictionary/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggpMAQ&usg=AFQjCNHYXX53LmMF-DOzo67S-XPzlg5eCQ', 'https://en.oxforddictionaries.com/definition/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFgguMAU&usg=AFQjCNGlgcUx-BpZe0Hb-39XvmNua2n8UA', 'https://en.wiktionary.org/wiki/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggzMAY&usg=AFQjCNGc9VmmyQls_rOBOR_lMUnt1j3Flg', 'http://dictionary.cambridge.org/dictionary/english/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFgg5MAc&usg=AFQjCNHJgZR1c6VY_WgFa6Rm-XNbdFJGmA', 'http://www.thesaurus.com/browse/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFgg-MAg&usg=AFQjCNEtnpmKxVJqUR7P1ss4VHnt34f4Kg', 'https://www.youtube.com/watch%3Fv%3D3taEuL4EHAg&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQtwIIRTAJ&usg=AFQjCNFnKlMFxHoYAIkl1MCrc_OXjgiClg']
I've been working on a script and I thought I would ask for help. I'm looking to search a series of websites, check if the site is valid. Then the next step would be to check for specific content on the site. If the site holds that content, place the URL in a list.
import urllib2
def getPage():
url="import urllib2
National=[]
Local=[]
Sports=[]
Culture=[]
def getPage():
url="http://readingeagle.com/section.aspx?id=2"
for i in range (0,100,1)
req = urllib2.Request(http://readingeagle.com/section.aspx?id=,i)
if "national" in response:
response = urllib2.urlopen(req)
return response.read()
for g in range (0,100,1)
if "national" in response:
National.append("http://readingeagle.com/section.aspx?id=,g"
# I would like to set-up an iteration to check the 'entryid from 1-100. If the term is found on the page, place the url in the list.
if __name__ == "__main__":
namesPage = getPage()
print (namesPage)
Here's my answer to the question of how to validate a given web site.
python check html valid
For checking the context of the page the tools consist of basic string methods, regex, or more sophisticated tools like lxml or beautifulsoup.
matchingSites = []
matchingSites.append(url) #Since you asked. :-p