Sanitize user input for inclusion in href attribute? - python

I would like to accept a user inputted url and display it in the href attribute part of the link tag. ie.
My link name
But I would like to make sure that it doesn't have any malicious content as far as inject script tags and the like. What is the best approach to sanitizing the user_input part?
From what I can tell:
django.utils.html.escape would escape &'s which is bad.
django.utils.http.urlquote and django.utils.http.urlquote_plus would escape the : part of http:// amoungst other things which seems bad.
Perhaps the best approach is urlquote_plus with some safe characters specified?

You can use the template tag: safe.
Let's say that your post context variable is:
user_input = some_valid_url
Grab user_input, and add the html to make it a link and reinsert it when saving the post. So the saved post is:
link_text = <a href=user_input>Link</a>
And then use safe in your html template:
{{ link_text|safe }}
Here is the documentation link for safe template tags:
https://docs.djangoproject.com/en/dev/ref/templates/builtins/#safe

I was over thinking the problem. It turns out that using django.utils.html.escape is fine as it results in HTML that has link tags with an href attributes which might have & in them instead of & but the browser handles this fine.
I thought I needed to find a way to have & in there as urls don't have & in them.
My final code is:
from django.utils.safestring import mark_safe
from django.utils.html import escape
....
output = '<li>%s</li>' \
% (escape(entry['url']), escape(self.link_display(entry)))
return mark_safe(output)

Related

BeautifulSoup find partial string in section

I am trying to use BeautifulSoup to scrape a particular download URL from a web page, based on a partial text match. There are many links on the page, and it changes frequently. The html I'm scraping is full of sections that look something like this:
<section class="onecol habonecol">
<a href="https://longGibberishDownloadURL" title="Download">
<img src="\azure_storage_blob\includes\download_for_windows.png"/>
</a>
sentinel-3.2022335.1201.1507_1608C.ab.L3.FL3.v951T202211_1_3.CIcyano.LakeOkee.tif
</section>
The second to last line (sentinel-3.2022335...LakeOkee.tif) is the part I need to search using a partial string to pull out the correct download url. The code I have attempted so far looks something like this:
import requests, re
from bs4 import BeautifulSoup
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
result = soup.find('section', attrs={'class':'onecol habonecol'}, string=re.compile(?))
I've been searching StackOverflow a long time now and while there are similar questions and answers, none of the proposed solutions have worked for me so far (re.compile, lambdas, etc.). I am able to pull up a section if I remove the string argument, but when I try to include a partial matching string I get None for my result. I'm unsure what to put for the string argument (? above) to find a match based on partial text, say if I wanted to find the filename that has "CIcyano" somewhere in it (see second to last line of html example at top).
I've tried multiple methods using re.compile and lambdas, but I don't quite understand how either of those functions really work. I was able to pull up other sections from the html using these solutions, but something about this filename string with all the periods seems to be preventing it from working. Or maybe it's the way it is positioned within the section? Perhaps I'm going about this the wrong way entirely.
Is this perhaps considered part of the section id, and so the string argument can't find it?? An example of a section on the page that I AM able to find has html like the one below, and I'm easily able to find it using the string argument and re.compile using "Name", "^N", etc.
<section class="onecol habonecol">
<h3>
Name
</h3>
</section>
Appreciate any advice on how to go about this! Once I get the correct section, I know how to pull out the URL via the a tag.
Here is the full html of the page I'm scraping, if that helps clarify the structure I'm working against.
I believe you are overthinking. Just remove the regular expression part, take the text and you will be fine.
import requests
from bs4 import BeautifulSoup
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
result = soup.find('section', attrs={'class':'onecol habonecol'}).text
print(result)
You can query inside every section for the string you want. Like so:
s.find('section', attrs={'class':'onecol habonecol'}).find(string=re.compile(r'.sentinel.*'))
Using this regular expression you will match any text that has sentinel in it, be careful that you will have to match some characters like spaces, that's why there is a . at beginning of the regex, you might want a more robust regex which you can test here:
https://regex101.com/
I ended up finding another method not using the string argument in find(), instead using something like the code below, which pulls the first instance of a section that contains a partial text match.
sections = soup.find_all('section', attrs={'class':'onecol habonecol'})
for s in sections:
text = s.text
if 'CIcyano' in text:
print(s)
break
links = s.find('a')
dwn_url = links.get('href')
This works for my purposes and fetches the first instance of the matching filename, and grabs the URL.

Change ALL html tags to symbol using python

What I want to do is change every tag (whether its <a href=> or <title> or </title> or </div>... etc) to a symbol.
I tried using beautiful soup but it only finds tags that I define...
I found some code in the HTMLparser.py
tagfind = re.compile('([a-zA-Z][^\t\n\r\f />\x00]*)(?:\s|/(?!>))*')
I beleive this is what I'm looking for I just dont know how to use it properly.
Also I figured I could use the:
handle_starttag(self, tag, attrs):
But I don't want to define the tag, I just want the script to find every single tag and change it to something...
Is this possible?
Thank you for all of your help!!
A much more reliable way is to recursively visit each tag, I just changed the name in the example below but you can do whatever you want once you have the tag:
from bs4 import BeautifulSoup, element
def visit(s):
if isinstance(s, element.Tag):
has_children = s.find_all()
if has_children:
s.name = "foobar"
for child in s:
visit(child)
else:
s.name = "foobar"
To use it:
soup = BeautifulSoup(...)
visit(soup)
Then any changes will be reflected in the soup.
BeautifulSoup isn't a good idea here - that's designed for parsing HTML, not editing it.
Also, that regex doesn't seem like a very good one (only matches the content inside a tag rather than the whole tag itself) so I found a different one that would be better suited to your purposes:
</?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[\^'">\s]+))?)+\s*|\s*)/?>
This tag will match anything like the following:
<h1>
</h1>
<img src="foo.com/image.png">
We can use this for replacing all tags by using re.sub. This finds all matches for a certain regex and replaces them with something else. Here's how you'd use it for what you want to do:
import re
html_regex = r"""</?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[\^'">\s]+))?)+\s*|\s*)/?>"""
html = "<h1>Foo</h1>"
print(re.sub(html_regex, "#", html))
This would print:
#Foo#

Search a value in HTML content with Django ORM that don't consider HTML tags

I have a field in database description that system will save HTML code in it.
and I have a search system that works with Q:
Post.objects.filter(Q(name__icontains=keyword) | Q(description__icontain=keyword))
It works fine but the problem refers to it when user searchs for example '<strong>' or 'strong' it will returns the rows that have '<strong>' word in them but it shouldn't consider the HTML tags.
So how to search a value in HTML content with Django ORM that don't consider HTML tags?
I'd probably add a second field called stripped_description and use django's striptags filter to strip out html tags, and have django search on that field. It should still find the row you need to recall the actual description field containing the HTML code, should you need to display that as a result, but that's the only means I've used to "ignore" html tags.
You can or probably should look into a proper search function using haystack, my favorite search engine to use with it is whoosh (pip install whoosh) if you are not doing hardcore search functions. You can define your content to be indexed like this:
{{ object.title }}
{{ object.description|strip_tags }}
It's fairly easy to setup, and once you have done it, setting up for the next project would be in minutes.
I think it's a good action:
from django.utils.html import strip_tags
rows = Post.objects.filter(Q(name__icontains=keyword) | Q(description__icontain=keyword))
if rows:
for j,i in enumerate(rows):
if keyword not in strip_tags(i.name) and keyword not in strip_tags(i.description):
del rows[j]
return render(request,'posts.html',{'rows':rows})
Fetching data from db with filter.
Strip tags the results and then filtering them again.

With flask/jinja, what is a viable way to safely render a link inside a user generated block of text?

Think twitter where you paste a link next to some plain text and when your tweet is rendered, that url is now a clickable link.
Do I:
replace jinja's autoescape with my own by scanning the text for html tags and replacing them with the html entity code
use a regular expression to detect a url contained in the text and replace it within an a href=
what would this expression look like to detect any # of .tld's, http/https, www/any subdomain?
and render this all as ¦safe in the template?
Or is there a python/flask/jinja 'feature' that can better handle this kind of thing?
Jinja has a filter built-in called urlize that should do exactly what you want.

How can I match the url in html

<form action='/[0-9]+' method="POST">
<input type="submit" value="delete question" name="delete">
</form>
what above is the html template I am using for the appengine project. Besides that, i created a web request handler class to handle this request. ('/[0-9]+',QuestionViewer), it is supposed to catch any url in digits. However, turns on that after I click on the delete button above, my page is directed to some url like main/[0-9], I dont know if I can use regex in the django template, or is there a away that my QuestionViewers class can catch the url in digits? since my url associated with the html page is dynamic, like the parts after / ,like /13,are changing accordingly and I cant do that only works for page 13 but not for /14 or something like these. Hope I make it clear. any helps? Thank you a lot.
That doesn't really make sense. You want to submit your form to a regex rule? What would it match against?
No, the form needs to submit to a specific url. Right now, it's trying to submit to /[0-9]+
If I understand what you are saying, and you want to submit from a url such as /13/ to your QuestionViewer at /[0-9]+, simply submit without the action attribute or set it to "" to post to the current url.
Note that if you want to use the digit captured in your regex, you need to surround your regex in parenthesis such as '/([0-9]+)/$', QuestionViewer or use a named regexp /(?P<id>[0-9]+)/$ to pass in an argument of id equal to the matched regex to QuestionViewer.
http://docs.djangoproject.com/en/1.0/topics/http/urls/#how-django-processes-a-request
The value of the action attribute must be a valid URL.
I think what you want is to generate an actual number for the action url; a number that is the number of the question that you want to delete. For example:
<form action="/1234" method="POST">
You will need to change your code to make sure you do this.

Categories