How to securely escape a string to a filename? [duplicate] - python

I am in search of the best way to "slugify" string what "slug" is, and my current solution is based on this recipe
I have changed it a little bit to:
s = 'String to slugify'
slug = unicodedata.normalize('NFKD', s)
slug = slug.encode('ascii', 'ignore').lower()
slug = re.sub(r'[^a-z0-9]+', '-', slug).strip('-')
slug = re.sub(r'[-]+', '-', slug)
Anyone see any problems with this code? It is working fine, but maybe I am missing something or you know a better way?

There is a python package named python-slugify, which does a pretty good job of slugifying:
pip install python-slugify
Works like this:
from slugify import slugify
txt = "This is a test ---"
r = slugify(txt)
self.assertEquals(r, "this-is-a-test")
txt = "This -- is a ## test ---"
r = slugify(txt)
self.assertEquals(r, "this-is-a-test")
txt = 'C\'est déjà l\'été.'
r = slugify(txt)
self.assertEquals(r, "cest-deja-lete")
txt = 'Nín hǎo. Wǒ shì zhōng guó rén'
r = slugify(txt)
self.assertEquals(r, "nin-hao-wo-shi-zhong-guo-ren")
txt = 'Компьютер'
r = slugify(txt)
self.assertEquals(r, "kompiuter")
txt = 'jaja---lol-méméméoo--a'
r = slugify(txt)
self.assertEquals(r, "jaja-lol-mememeoo-a")
See More examples
This package does a bit more than what you posted (take a look at the source, it's just one file). The project is still active (got updated 2 days before I originally answered, over nine years later (last checked 2022-03-30), it still gets updated).
careful: There is a second package around, named slugify. If you have both of them, you might get a problem, as they have the same name for import. The one just named slugify didn't do all I quick-checked: "Ich heiße" became "ich-heie" (should be "ich-heisse"), so be sure to pick the right one, when using pip or easy_install.

Install unidecode form from here for unicode support
pip install unidecode
# -*- coding: utf-8 -*-
import re
import unidecode
def slugify(text):
text = unidecode.unidecode(text).lower()
return re.sub(r'[\W_]+', '-', text)
text = u"My custom хелло ворлд"
print slugify(text)
>>> my-custom-khello-vorld

There is python package named awesome-slugify:
pip install awesome-slugify
Works like this:
from slugify import slugify
slugify('one kožušček') # one-kozuscek
awesome-slugify github page

It works well in Django, so I don't see why it wouldn't be a good general purpose slugify function.
Are you having any problems with it?

The problem is with the ascii normalization line:
slug = unicodedata.normalize('NFKD', s)
It is called unicode normalization which does not decompose lots of characters to ascii. For example, it would strip non-ascii characters from the following strings:
Mørdag -> mrdag
Æther -> ther
A better way to do it is to use the unidecode module that tries to transliterate strings to ascii. So if you replace the above line with:
import unidecode
slug = unidecode.unidecode(s)
You get better results for the above strings and for many Greek and Russian characters too:
Mørdag -> mordag
Æther -> aether

def slugify(value):
"""
Converts to lowercase, removes non-word characters (alphanumerics and
underscores) and converts spaces to hyphens. Also strips leading and
trailing whitespace.
"""
value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')
value = re.sub('[^\w\s-]', '', value).strip().lower()
return mark_safe(re.sub('[-\s]+', '-', value))
slugify = allow_lazy(slugify, six.text_type)
This is the slugify function present in django.utils.text
This should suffice your requirement.

Unidecode is good; however, be careful: unidecode is GPL. If this license doesn't fit then use this one

A couple of options on GitHub:
https://github.com/dimka665/awesome-slugify
https://github.com/un33k/python-slugify
https://github.com/mozilla/unicode-slugify
Each supports slightly different parameters for its API, so you'll need to look through to figure out what you prefer.
In particular, pay attention to the different options they provide for dealing with non-ASCII characters. Pydanny wrote a very helpful blog post illustrating some of the unicode handling differences in these slugify'ing libraries: http://www.pydanny.com/awesome-slugify-human-readable-url-slugs-from-any-string.html This blog post is slightly outdated because Mozilla's unicode-slugify is no longer Django-specific.
Also note that currently awesome-slugify is GPLv3, though there's an open issue where the author says they'd prefer to release as MIT/BSD, just not sure of the legality: https://github.com/dimka665/awesome-slugify/issues/24

You might consider changing the last line to
slug=re.sub(r'--+',r'-',slug)
since the pattern [-]+ is no different than -+, and you don't really care about matching just one hyphen, only two or more.
But, of course, this is quite minor.

Another option is boltons.strutils.slugify. Boltons has quite a few other useful functions as well, and is distributed under a BSD license.

By your example, a fast manner to do that could be:
s = 'String to slugify'
slug = s.replace(" ", "-").lower()

another nice answer for creating it could be this form
import re
re.sub(r'\W+', '-', st).strip('-').lower()

Related

How to split unicode strings character by character in python?

My website supports a number of Indian languages. The user can change the language dynamically. When user inputs some string value, I have to split the string value into its individual characters. So, I'm looking for a way to write a common function that will work for English and a select set of Indian languages. I have searched across sites, however, there appears to be no common way to handle this requirement. There are language-specific implementations (for example Open-Tamil package for Tamil implements get_letters) but I could not find a common way to split or iterate through the characters in a unicode string taking the graphemes into consideration.
One of the many methods that I've tried:
name = u'தமிழ்'
print name
for i in list(name):
print i
#expected output
தமிழ்
த
மி
ழ்
#actual output
தமிழ்
த
ம
ி
ழ
்
#Here is another an example using another Indian language
name = u'हिंदी'
print name
for i in list(name):
print i
#expected output
हिंदी
हिं
दी
#actual output
हिंदी
ह
ि
ं
द
ी
The way to solve this is to group all "L" category characters with their subsequent "M" category characters:
>>> regex.findall(ur'\p{L}\p{M}*', name)
[u'\u0ba4', u'\u0bae\u0bbf', u'\u0bb4\u0bcd']
>>> for c in regex.findall(ur'\p{L}\p{M}*', name):
... print c
...
த
மி
ழ்
regex
To get "user-perceived" characters whatever the language, use \X (eXtended grapheme cluster) regular expression:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import regex # $ pip install regex
for text in [u'தமிழ்', u'हिंदी']:
print("\n".join(regex.findall(r'\X', text, regex.U)))
Output
த
மி
ழ்
हिं
दी
uniseg works really well for this, and the docs are OK. The other answer to this question works for international Unicode characters, but falls flat if users enter Emoji. The solution below will work:
>>> emoji = u'😀😃😄😁'
>>> from uniseg.graphemecluster import grapheme_clusters
>>> for c in list(grapheme_clusters(emoji)):
... print c
...
😀
😃
😄
😁
This is from pip install uniseg==0.7.1.

Parsing a message with various special characters and splitting into a list (re and regex) Python 2.7

I am trying to parse a message that receives the following delimiters (Without quotes):
Delimiter1: "###" - Followed by a message
Delimiter2: "!!!" - A signal
Delimiter3: "---" - Followed by a message
Delimiter4: "###" - Followed by a message
Delimiter5: "$$$" - Followed by a message
I have so far:
import re
mystring = '###useradd---userfirstadded###userremoved!!!$$$message'
result = re.split('\\#\#\#|\\!\!\!|\\---|\\#\#\#|\\$\$\$',mystring)
print result
My result so far:
['', 'useradd', 'userfirstadded', 'userremoved', '', 'message']
I want as a result printed to console:
['###useradd','---userfirstadded','###userremoved','!!!','$$$message']
Is this possible using re.split or do I need to use re.find or something a lot better? I have been playing with the re.split delimiters as you can see but maybe you guys have a lot more experience using this functionality within python.
EDITED Solution #1 Using re (From #thefourtheye):
Here is the code:
import re
mystring = '###useradd---userfirstadd%ed###this is my username#!!!$$$hey whats up how are you??###useradd$$$This is my email #gmail.com!!!'
result = re.findall(r'!!!|(?:#|-|#|\$){3}[\w ^]+', mystring)
print result
The result printed is as follows:
['###useradd', '---userfirstadd', '###this is my username', '!!!', '$$$hey whats up how are you', '###useradd', '$$$This is my email ', '!!!']
EDITED New specifications:
Everything works as specified above and more using the following answer below that #thefourtheye has suggested. If there was possibly more functionality to the function as in allowing one or two of the delimiters or more that would be better as if the user wanted to type his email address in a message he would use the # symbol or a dollar amount with a $ etc. If this isn't possible, I can always add the delimiters with a space before and after or possibly ### to separate using the delimiters in a message or a different type of message. What are your suggestions?
Summary: I would like to add functionality of accepting all characters until hitting exactly the delimiter pattern (i.e. ###) Otherwise accept every possible character including the characters in a delimiter pattern in the string (i.e. ### would not split the string) Is this possible?
EDITED Solution #2 Using regex (From #hwnd):
Regex is not installed to python 2.7 if you are using that. You need to download and install this package. These are the explicit directions I took so you can do the same.
Go to https://pypi.python.org/pypi/regex and at the bottom of the page there are download links. Click on regex-2015.03.18-cp27-none-win32.whl for Windows operating systems that are running Python 2.7 (Otherwise try other ones until a successful install works for you).
Browse to the download directory of the .whl file that you just downloaded. Shift+Right Click Anywhere in that directory and click on "Open command window here" and then type "pip install regex-2015.03.18-cp27-none-win32.whl" and should say "Successfully installed!"
You will now be able to use regex!
Here is the code:
import regex
mystring = '###useradd---userfirstadd%ed###this is my username#!!!$$$hey whats up how are you??###useradd$$$This is my email #gmail.com!!!'
result = filter(None, regex.split(r'(?V1)(!!!)|\s*(?=(?:#|\$|#|-){3})', mystring))
print result
The result printed is as follows:
['###useradd', '---userfirstadd%ed', '###this is my username#', '!!!', '$$$hey whats up how are you??', '###useradd', '$$$This is my email #gmail.com', '!!!']
Edit: Since you want to retain all the characters between your pattern delimiters, you can do this using the regex module, splitting on "!!!" and using lookahead for other zero-width matches.
>>> import regex
>>> s = '###useradd---userfirstadd%ed###this is my username#!!!$$$hey whats up how are you??###useradd$$$This is my email #gmail.com!!!'
>>> filter(None, regex.split(r'(?V1)(!!!)|\s*(?=(?:#|\$|#|-){3})', s))
['###useradd', '---userfirstadd%ed', '###this is my username#', '!!!', '$$$hey whats up how are you??', '###useradd', '$$$This is my email #gmail.com', '!!!']
use this regexp if will provide 5 matching groups
(#{3}[a-z]+)(-{3}[a-z]+)(#{3}[a-z]+)(!{3})(\${3}[a-z]+)

RegEx in Python for WikiMarkup

I'm trying to create a re in python that will match this pattern in order to parse MediaWiki Markup:
<ref>*Any_Character_Could_Be_Here</ref>
But I'm totally lost when it comes to regex. Can someone help me, or point me to a tutorial or resource that might be of some help. Thanks!'
Assuming that svick is correct that MediaWiki Markup is not valid xml (or html), then you could use re in this circumstance (although I will certainly defer to better solutions):
>>> import re
>>> test_string = '''<ref>*Any_Character_Could_Be_Here</ref>
<ref>other characters could be here</ref>'''
>>> re.findall(r'<ref>.*?</ref>', test_string)
['<ref>*Any_Character_Could_Be_Here</ref>', '<ref>other characters could be here</ref>'] # a list of matching strings
In any case, you will want to familiarize yourself with the re module (whether or not you use a regex to solve this particular problem).
srhoades28, this will match your pattern.
if re.search(r"<ref>\*[^<]*</ref>", subject):
# Successful match
else:
# Match attempt failed
Note that from your post, it is assumed that the * after always occurs, and that the only variable part is the blue text, in your example "Any_Character_Could_Be_Here".
If this is not the case let me know and I will tweak the expression.

python regular expression with utf8 issue

I got a file which includes many lines of plain utf-8 text. Such as below, by the by, it's Chinese.
PROCESS:类型:关爱积分[NOTIFY] 交易号:2012022900000109 订单号:W12022910079166 交易金额:0.01元 交易状态:true 2012-2-29 10:13:08
The file itself was saved in utf-8 format. file name is xx.txt
here is my python code, env is python2.7
#coding: utf-8
import re
pattern = re.compile(r'交易金额:(\d+)元')
for line in open('xx.txt'):
match = pattern.match(line.decode('utf-8'))
if match:
print match.group()
The problematic thing here is I got no results.
I wanna get the decimal string from 交易金额:0.01元, in here, which is 0.01.
Why doesn't this code work? Can anyone explain it to me, I got no clue whatsoever.
There are several issues with your code. First you should use re.compile(ur'<unicode string>'). Also it is nice to add re.UNICODE flag (not sure if really needed here though). Next one is that still you will not receive a match since \d+ doesn't handle decimals just a series of numbers, you should use \d+\.?\d+ instead (you want number, probably a dot and a number). Example code:
#coding: utf-8
text = u"PROCESS:类型:关爱积分[NOTIFY] 交易号:2012022900000109 订单号:W12022910079166 交易金额:0.01元 交易状态:true 2012-2-29 10:13:08"
import re
pattern = re.compile(ur'交易金额:(\d+\.?\d+)元', re.UNICODE)
print pattern.search(text).group(1)
You need to use .search() since .match() is like starting your regex with ^, i.e. it only checks at the beginning of the string.
If you use utf-8, you can use flags=re.LOCALE
#coding: utf-8
import re
pattern = re.compile(r'交易金额:(\d+\.?\d+)元', flags=re.LOCALE)
for line in open('xx.txt'):
match = pattern.match(line)
More details, see re.LOCALE. There is no need to convert utf-8 to unicode.

How do I unescape HTML entities in a string in Python 3.1? [duplicate]

This question already has answers here:
Decode HTML entities in Python string?
(6 answers)
Closed 8 years ago.
I have looked all around and only found solutions for python 2.6 and earlier, NOTHING on how to do this in python 3.X. (I only have access to Win7 box.)
I HAVE to be able to do this in 3.1 and preferably without external libraries. Currently, I have httplib2 installed and access to command-prompt curl (that's how I'm getting the source code for pages). Unfortunately, curl does not decode html entities, as far as I know, I couldn't find a command to decode it in the documentation.
YES, I've tried to get Beautiful Soup to work, MANY TIMES without success in 3.X. If you could provide EXPLICIT instructions on how to get it to work in python 3 in MS Windows environment, I would be very grateful.
So, to be clear, I need to turn strings like this: Suzy & John into a string like this: "Suzy & John".
You could use the function html.unescape:
In Python3.4+ (thanks to J.F. Sebastian for the update):
import html
html.unescape('Suzy & John')
# 'Suzy & John'
html.unescape('"')
# '"'
In Python3.3 or older:
import html.parser
html.parser.HTMLParser().unescape('Suzy & John')
In Python2:
import HTMLParser
HTMLParser.HTMLParser().unescape('Suzy & John')
You can use xml.sax.saxutils.unescape for this purpose. This module is included in the Python standard library, and is portable between Python 2.x and Python 3.x.
>>> import xml.sax.saxutils as saxutils
>>> saxutils.unescape("Suzy & John")
'Suzy & John'
Apparently I don't have a high enough reputation to do anything but post this. unutbu's answer does not unescape quotations. The only thing that I found that did was this function:
import re
from htmlentitydefs import name2codepoint as n2cp
def decodeHtmlentities(string):
def substitute_entity(match):
ent = match.group(2)
if match.group(1) == "#":
return unichr(int(ent))
else:
cp = n2cp.get(ent)
if cp:
return unichr(cp)
else:
return match.group()
entity_re = re.compile("&(#?)(\d{1,5}|\w{1,8});")
return entity_re.subn(substitute_entity, string)[0]
Which I got from this page.
Python 3.x has html.entities too
In my case I have a html string escaped in as3 escape function. After a hour of googling haven't found anything useful so I wrote this recusrive function to serve for my needs. Here it is,
def unescape(string):
index = string.find("%")
if index == -1:
return string
else:
#if it is escaped unicode character do different decoding
if string[index+1:index+2] == 'u':
replace_with = ("\\"+string[index+1:index+6]).decode('unicode_escape')
string = string.replace(string[index:index+6],replace_with)
else:
replace_with = string[index+1:index+3].decode('hex')
string = string.replace(string[index:index+3],replace_with)
return unescape(string)
Edit-1 Added functionality to handle unicode characters.
I am not sure if this is a built in library or not but it looks like what you need and supports 3.1.
From: http://docs.python.org/3.1/library/xml.sax.utils.html?highlight=html%20unescape
xml.sax.saxutils.unescape(data, entities={})
Unescape '&', '<', and '>' in a string of data.

Categories