Scrapy sanitize url links - python

I'm trying to get data from a Web page, where I track all your links. The web is badly modeled, the links in certain parts of the pages contain spaces before and after the link, so scrapy follows and your Web server redirects with 301 creating loops.
I tried to filter the URL of the links, but it is impossible, always returns empty spaces or symbol +.
Part of code
def cleanurl(link_text):
print "original: ", link_text
print "filter: ", link_text.strip("\s+\t\r\n '\"")
return link_text.strip("\s+\t\r\n '\"")
#return " ".join(link_text.strip("\t\r\n '\""))
#return link_text.replace("\s", "").replace("\t","").replace("\r","").replace("\n","").replace("'","").replace("\"","")
rules = (
Rule (LinkExtractor(allow=(), deny=(), process_value= cleanurl)),
)
Web code
<a href=
" ?on_sale=1
"
class="selectBox">ON SALE
</a>
Output cleanurl
original: http://www.portshop.com/computers-networking-c_11257/ ?on_sale=1
filter: http://www.portshop.com/computers-networking-c_11257/ ?on_sale=1
I tried to use regular expressions and others, but I can not sanitize the URL, in some cases if it works in others not, changing the %20 (white spaces) to +.
Thanks !

You are mentioning "%20" and "+" to be part of the urls, that's why I suspect these urls are url encoded.
So before stripping them of any whitespaces, you need to urldecode it:
Using Python 3:
import urllib
def cleanurl(link_text):
print "original: ", link_text
print "filter: ", link_text.strip("\s\t\r\n '\"")
link_text = urllib.parse.unquote(link_text)
return link_text.strip("\s+\t\r\n '\"")
If still using Python 2.7, you need to replace the unquote line:
link_text = urllib.unquote(link_text)

I have already solved, I have entered the following code to clean the URL and now it is working properly. I hope you can help someone else who has the same problem as me.
def cleanurl(link_text):
return ''.join(link_text.split())
Thanks everybody !

Related

Regular Expression In Order To Identify Tor Domains

I am working on a scraper that goes through html code trying to scrape tor domains. However I am having trouble coming up with a piece of code to match tor domains.
Tor domains are typically in the format of:
http://sitegoeshere.onion
or
https://sitegoeshere.onion
I just want to match urls that would be contained within a page, in the format http://sitetexthere.onion or https://sitehereitis.onion. This is within a bunch of text that may not be urls. It should just pull out the urls.
I am sure there is an easy or good piece of regex that'll do this but I have not been able to find one. If anyone is able to link one or quickly spin one up that'd be muchos appreciated. Many thanks.
session = requests.session()
session.proxies = {}
session.proxies['http'] = 'socks5h://localhost:9050'
session.proxies['https'] = 'socks5h://localhost:9050'
r = session.get('http://facebookcorewwwi.onion')
print(r.text)
The regex.match will return None if the URL isn't matched.
import re
regex = re.compile(r"^https?\:\/\/[\w\-\.]+\.onion")
url = 'https://sitegoes-here.onion'
if regex.match(url):
print('Valid Tor Domain!')
else:
print('Invalid Tor Domain!')
For optional http(s):
regex = re.compile(r"^(?:https?\:\/\/)?[\w\-\.]+\.onion")
Regex patterns are mostly standard, so, i would recommend you this pattern:
'.onion$'
Backslash escapes the dot, and '$' character means the end of string. Since all urls starts with 'http(s)://' there's no need to including it in the pattern.
Assuming these are taken from href attributes you could try an attribute = value selector with $ ends with operator
from bs4 import BeautifulSoup as bs
import requests
resp = requests.get("https://en.wikipedia.org/wiki/Tor_(anonymity_network)") #example url. Replace with yours.
soup = bs(resp.text,'lxml')
links = [item['href'] for item in soup.select('[href$=".onion"]')]

Replace email id by HTML Tag to make a hyperlink inside the text

I have a text as follows
For further details, please contact abc.helpdesk#xyz.com
I want to replace the email ID mentioned in the above text by <a href "abc.helpdesk#xyz.com">abc.helpdesk#xyz.com</a> So that the email ID would become a clickable object when the above text would be presented on web page
So far I have tried out following
text = 'For further details, please contact abc.helpdesk#xyz.com'
email_pat = re.findall(r'[\w\.-]+#[\w\.-]+\.\w+',text)
email_str = ' '.join(email_pat) #converts the list to string
text_rep = ext.replace(email_str,'<a href "email_str">email_str</a>')
The above code replace the email string but instead of creating hyperlink it actually does the following
For further details, please contact <a href "email_str">email_str</a>
Is there any way to tackle this?
Edit
When I am using the above solution in Flask, on frontend I am getting the desired result (i.e. email ID becomes clickable, urls become clickable). But when I click on this I am being redirected to the localhost:5002 instead of opening the Outlook. localhost:5002 is where my Flask App is being hosted.
Even for the url also it is not working. I am using the following code to make the url string clickable.
text = text.replace('url',f'<a href "{url_link}">{url}</a>'
The above code is making the usr string clickable, but upon clicking it is being redirected to localhost:5002
Is there any change I need to make in app.run(host=5002) method?
You can use re.sub with a lambda:
import re
s = 'For further details, please contact abc.helpdesk#xyz.com'
new_s = re.sub('[\w\.]+#[\w\.]+', lambda x:f'{x.group()}', s)
Output:
'For further details, please contact abc.helpdesk#xyz.com'
Your actual problem is the line:
text_rep = ext.replace(email_str, '<a href "email_str">email_str</a>')
That does exactly what you say it does, but what you want is this:
text_rep = ext.replace(email_str, f'<a href "{email_str}">{email_str}</a>')
Instead of replacing the mail address with that literal string with email_str in it, this formats the string so that it has the mail address in it. This is assuming you run Python 3, for Python 2 it would be more like:
text_rep = ext.replace(email_str, '<a href "{email_str}">{email_str}</a>'.format(email_str=email_str))
However, note that your regex to match mail addresses makes some assumptions, a better version can be found here Are email addresses allowed to contain non-alphanumeric characters?
Also, your code assumes there will be only one mail address in the source text string, as you're joining results. A better solution might be one which replaces each individual mail addresses with the correct replacement.
import re
input_text = 'For further details, contact admin#mywebsite.org or the webmaster webmaster123#hotmail.com'
output_text = re.sub(
r'(?sm)(([^<>()[\].,;:\s#"]+(\.[^<>()[\].,;:\s#"]+)*)|(".+"))#(([^<>()[\].,;:\s#"]+\.)+[^<>()[\].,;:\s#"]{2,})',
r'\g<0>', input_text)
print(output_text)
Note that this needs nothing but a basic re.sub.

python - parsing an url

I am writing a simple script that checks if a website is present on google first search for a determined keyword.
Now,this is the function that parse a url and return the host name:
def parse_url(url):
url = urlparse(url)
hostname = url.netloc
return hostname
and starting from a list of tags selected by:
linkElems = soup.select('.r a') #in google first page the resulting urls have class r
I wrote this:
for link in linkElems:
l = link.get("href")[7:]
url = parse_url(l)
if "www.example.com" == url:
#do stuff (ex store in a list, etc)
in this last one, in the second line, i have to start from the seventh index, because all href values start with '/url?q='.
I am learning python, so i am wondering if there is a better way to do this, or simply an alternative one (maybe with regex or replace method or from urlparse library)
You can use python lxml module to do that which is also order of magnitude faster than BeautifulSoup.
This can be done something like this :
import requests
from lxml import html
blah_url = "https://www.google.co.in/search?q=blah&oq=blah&aqs=chrome..69i57j0l5.1677j0j4&sourceid=chrome&ie=UTF-8"
r = requests.get(blah_url).content
root = html.fromstring(r)
print(root.xpath('//h3[#class="r"]/a/#href')[0].replace('/url?q=', ''))
print([url.replace('/url?q=', '') for url in root.xpath('//h3[#class="r"]/a/#href')])
This will result in :
http://www.urbandictionary.com/define.php%3Fterm%3Dblah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggTMAA&usg=AFQjCNFge5GFNmjpan7S_UCNjos1RP5vBA
['http://www.urbandictionary.com/define.php%3Fterm%3Dblah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggTMAA&usg=AFQjCNFge5GFNmjpan7S_UCNjos1RP5vBA', 'http://www.dictionary.com/browse/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggZMAE&usg=AFQjCNE1UVR3krIQHfEuIzHOeL0ZvB5TFQ', 'http://www.dictionary.com/browse/blah-blah-blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggeMAI&usg=AFQjCNFw8eiSqTzOm65PQGIFEoAz0yMUOA', 'https://en.wikipedia.org/wiki/Blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggjMAM&usg=AFQjCNFxEB8mEjEy6H3YFOaF4ZR1n3iusg', 'https://www.merriam-webster.com/dictionary/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggpMAQ&usg=AFQjCNHYXX53LmMF-DOzo67S-XPzlg5eCQ', 'https://en.oxforddictionaries.com/definition/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFgguMAU&usg=AFQjCNGlgcUx-BpZe0Hb-39XvmNua2n8UA', 'https://en.wiktionary.org/wiki/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggzMAY&usg=AFQjCNGc9VmmyQls_rOBOR_lMUnt1j3Flg', 'http://dictionary.cambridge.org/dictionary/english/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFgg5MAc&usg=AFQjCNHJgZR1c6VY_WgFa6Rm-XNbdFJGmA', 'http://www.thesaurus.com/browse/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFgg-MAg&usg=AFQjCNEtnpmKxVJqUR7P1ss4VHnt34f4Kg', 'https://www.youtube.com/watch%3Fv%3D3taEuL4EHAg&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQtwIIRTAJ&usg=AFQjCNFnKlMFxHoYAIkl1MCrc_OXjgiClg']

How to join a string to a URL in Python?

I am trying to join a string in a URL, but the problem is that since it's spaced the other part does not get recognized as part of the URL.
Here would be an example:
import urllib
import urllib2
website = "http://example.php?id=1 order by 1--"
request = urllib2.Request(website)
response = urllib2.urlopen(request)
html = response.read()
The "order by 1--" part is not recognized as part of the URL.
You should better use urllib.urlencode or urllib.quote:
website = "http://example.com/?" + urllib.quote("?id=1 order by 1--")
or
website = "http://example.com/?" + urllib.urlencode({"id": "1 order by 1 --"})
and about the query you're trying to achieve:
I think you're forgetting a ; to end the first query.
Of course not. Spaces are invalid in a query string, and should be replaced by +.
http://example.com/?1+2+3

Getting the number of like counts from a facebook album

I was trying to develop a python script for my friend, which would take a link of a public album and count the like and comment numbers of every photo with "requests" module. This is the code of my script
import re
import requests
def get_page(url):
r = requests.get(url)
content = r.text.encode('utf-8', 'ignore')
return content
if __name__ == "__main__":
url = 'https://www.facebook.com/media/set/?set=a.460132914032627.102894.316378325074754&type=1'
content = get_page(url)
content = content.replace("\n", '')
chehara = "(\d+) likes and (\d+) comments"
cpattern = re.compile(chehara)
result = re.findall(cpattern, content)
for jinish in result:
print "likes "+ jinish[0] + " comments " + jinish [1]
But the problem here is, it only parses the likes and comments for the first 28 photos, and not more, what is the problem? Can somebody please help?
[Edit: the module "request" just loads the web page, which is, the variable content contains the full html source of the facebook web page of the linked album]
use the facebook graph api:
For Albums its documented here:
https://developers.facebook.com/docs/reference/api/album/
Use the limit attribute for testing since its rather slow:
http://graph.facebook.com/460132914032627/photos/?limit=10
EDIT
i just realized that the like_count is not part of the json, you may have to use fql for that
If you want to see the next page you need to add the after attribute to your request like in this URL:
https://graph.facebook.com/albumID/photos?fields=likes.summary(true),comments.summary(true)&after=XXXXXX&access_token=XXXXXX
You could take a look at this JavaScript project for reference.

Categories