I want to replace whitespace with underscore in a string to create nice URLs. So that for example:
"This should be connected"
Should become
"This_should_be_connected"
I am using Python with Django. Can this be solved using regular expressions?
You don't need regular expressions. Python has a built-in string method that does what you need:
mystring.replace(" ", "_")
Replacing spaces is fine, but I might suggest going a little further to handle other URL-hostile characters like question marks, apostrophes, exclamation points, etc.
Also note that the general consensus among SEO experts is that dashes are preferred to underscores in URLs.
import re
def urlify(s):
# Remove all non-word characters (everything except numbers and letters)
s = re.sub(r"[^\w\s]", '', s)
# Replace all runs of whitespace with a single dash
s = re.sub(r"\s+", '-', s)
return s
# Prints: I-cant-get-no-satisfaction"
print(urlify("I can't get no satisfaction!"))
This takes into account blank characters other than space and I think it's faster than using re module:
url = "_".join( title.split() )
Django has a 'slugify' function which does this, as well as other URL-friendly optimisations. It's hidden away in the defaultfilters module.
>>> from django.template.defaultfilters import slugify
>>> slugify("This should be connected")
this-should-be-connected
This isn't exactly the output you asked for, but IMO it's better for use in URLs.
Using the re module:
import re
re.sub('\s+', '_', "This should be connected") # This_should_be_connected
re.sub('\s+', '_', 'And so\tshould this') # And_so_should_this
Unless you have multiple spaces or other whitespace possibilities as above, you may just wish to use string.replace as others have suggested.
use string's replace method:
"this should be connected".replace(" ", "_")
"this_should_be_disconnected".replace("_", " ")
Python has a built in method on strings called replace which is used as so:
string.replace(old, new)
So you would use:
string.replace(" ", "_")
I had this problem a while ago and I wrote code to replace characters in a string. I have to start remembering to check the python documentation because they've got built in functions for everything.
Surprisingly this library not mentioned yet
python package named python-slugify, which does a pretty good job of slugifying:
pip install python-slugify
Works like this:
from slugify import slugify
txt = "This is a test ---"
r = slugify(txt)
self.assertEquals(r, "this-is-a-test")
txt = "This -- is a ## test ---"
r = slugify(txt)
self.assertEquals(r, "this-is-a-test")
txt = 'C\'est déjà l\'été.'
r = slugify(txt)
self.assertEquals(r, "cest-deja-lete")
txt = 'Nín hǎo. Wǒ shì zhōng guó rén'
r = slugify(txt)
self.assertEquals(r, "nin-hao-wo-shi-zhong-guo-ren")
txt = 'Компьютер'
r = slugify(txt)
self.assertEquals(r, "kompiuter")
txt = 'jaja---lol-méméméoo--a'
r = slugify(txt)
self.assertEquals(r, "jaja-lol-mememeoo-a")
You can try this instead:
mystring.replace(r' ','-')
I'm using the following piece of code for my friendly urls:
from unicodedata import normalize
from re import sub
def slugify(title):
name = normalize('NFKD', title).encode('ascii', 'ignore').replace(' ', '-').lower()
#remove `other` characters
name = sub('[^a-zA-Z0-9_-]', '', name)
#nomalize dashes
name = sub('-+', '-', name)
return name
It works fine with unicode characters as well.
mystring.replace (" ", "_")
if you assign this value to any variable, it will work
s = mystring.replace (" ", "_")
by default mystring wont have this
OP is using python, but in javascript (something to be careful of since the syntaxes are similar.
// only replaces the first instance of ' ' with '_'
"one two three".replace(' ', '_');
=> "one_two three"
// replaces all instances of ' ' with '_'
"one two three".replace(/\s/g, '_');
=> "one_two_three"
x = re.sub("\s", "_", txt)
perl -e 'map { $on=$_; s/ /_/; rename($on, $_) or warn $!; } <*>;'
Match et replace space > underscore of all files in current directory
Related
I want to find a pattern and replace it with another
Suppose i have:
"Name":"hello"
And want to do this
Name= "hello"
Using python regex
The string could be anything inside double quotes so i need to find pattern "": "" and replace it with =""
This expression,
^"\s*([^"]+?)\s*"\s*:\s*"?([^"]+)"?$
has two capturing groups:
([^"]+?)
for collecting our desired data. Then, we would simply re.sub.
In this demo, the expression is explained, if you might be interested.
Test
import re
result = re.sub('^"\s*([^"]+?)\s*"\s*:\s*"?([^"]+)"?$', '\\1= "\\2"', '" Name ":" hello "')
print(result)
Why not use this regex:
import re
s = '"Name":"hello"'
print(re.sub('"(.*)":"(.*)"', '\\1= \"\\2\"', s))
Output:
Name= "hello"
Explanation here.
For strings containing more than one of those kind of strings, you would need to add some python code to it:
import re
s = '"Name":"hello", "Name2":"hello2"'
print(re.sub('"(.*?)":"(.*?)"', '\\1= \"\\2\"', s))
Output:
Name= "hello", Name2= "hello2"
Using pure Python, this is as simple as:
s = '"Name":"hello"'
print(s.replace(':', '= ').replace('"', '', 2))
# Name= "hello"
Im trying to remove multiple white-spaces in a string. I've read about regular expressions in python langauge and i've tried to make it match all white-sapces in the string, but no success. The return msg part returns empty:
CODE
import re
def correct(string):
msg = ""
fmatch = re.match(r'\s', string, re.I|re.L)
if fmatch:
msg = fmatch.group
return msg
print correct("This is very funny and cool.Indeed!")
To accomplish this task, you can instead replace consecutive whitespaces with a single space character, for example, using re.sub.
Example:
import re
def correct(string):
fmatch = re.sub(r'\s+', ' ', string)
return fmatch
print correct("This is very funny and cool.Indeed!")
The output will be:
This is very funny and cool.Indeed!
re.match matches only at the beginning of the string. You need to use re.search instead.
Maybe this code helps you?
import re
def correct(string):
return " ".join(re.split(' *', string))
One line no direct import
ss= "This is very funny and cool.Indeed!"
ss.replace(" ", " ")
#ss.replace(" ", " "*2)
#'This is very funny and cool.Indeed!'
Or, as the question states:
ss= "This is very funny and cool.Indeed!"
ss.replace(" ", "")
#'Thisisveryfunnyandcool.Indeed!'
I have many fill-in-the-blank sentences in strings,
e.g. "6d) We took no [pains] to hide it ."
How can I efficiently parse this string (in Python) to be
"We took no to hide it"?
I also would like to be able to store the word in brackets (e.g. "pains") in a list for use later. I think the regex module could be better than Python string operations like split().
This will give you all the words inside the brackets.
import re
s="6d) We took no [pains] to hide it ."
matches = re.findall('\[(.*?)\]', s)
Then you can run this to remove all bracketed words.
re.sub('\[(.*?)\]', '', s)
just for fun (to do the gather and substitution in one iteration)
matches = []
def subber(m):
matches.append(m.groups()[0])
return ""
new_text = re.sub("\[(.*?)\]",subber,s)
print new_text
print matches
import re
s = 'this is [test] string'
m = re.search(r"\[([A-Za-z0-9_]+)\]", s)
print m.group(1)
Output
'test'
For your example you could use this regex:
(.*\))(.+)\[(.+)\](.+)
You will get four groups that you can use to create your resulting string and save the 3. group for later use:
6d)
We took no
pains
to hide it .
I used .+ here because I don't know if your strings always look like your example. You can change the .+ to alphanumeric or sth. more special to your case.
import re
s = '6d) We took no [pains] to hide it .'
m = re.search(r"(.*\))(.+)\[(.+)\](.+)", s)
print(m.group(2) + m.group(4)) # "We took no to hide it ."
print(m.group(3)) # pains
import re
m = re.search(".*\) (.*)\[.*\] (.*)","6d) We took no [pains] to hide it .")
if m:
g = m.groups()
print g[0] + g[1]
Output :
We took no to hide it .
I want to remove all special characters from email such as '#', '.' and replace them with 'underscore'
there are some functions for it in python 'unidecode' but it does not full fill my requirement . can anyone suggest me some way so that I can find the above mention characters in a string and replace them with 'underscore'.
Thanks.
Why not use .replace() ?
eg.
a='testemail#email.com'
a.replace('#','_')
'testemail_email.com'
and to edit multiple you can probably do something like this
a='testemail#email.com'
replace=['#','.']
for i in replace:
a=a.replace(i,'_')
Take this as a guide:
import re
a = re.sub(u'[#]', '"', a)
SYNTAX:
re.sub(pattern, repl, string, max=0)
Great example from Python Cookbook 2nd edition
import string
def translator(frm='', to='', delete='', keep=None):
if len(to) == 1:
to = to * len(frm)
trans = string.maketrans(frm, to)
if keep is not None:
allchars = string.maketrans('', '')
delete = allchars.translate(allchars, keep.translate(allchars, delete))
def translate(s):
return s.translate(trans, delete)
return translate
remove_cruft = translator(frm="#-._", to="~")
print remove_cruft("me-and_you#gmail.com")
output:
me~and~you~gmail~com
A great string util to put in your toolkit.
All credit to the book
I am wanting to verify and then parse this string (in quotes):
string = "start: c12354, c3456, 34526; other stuff that I don't care about"
//Note that some codes begin with 'c'
I would like to verify that the string starts with 'start:' and ends with ';'
Afterward, I would like to have a regex parse out the strings. I tried the following python re code:
regx = r"start: (c?[0-9]+,?)+;"
reg = re.compile(regx)
matched = reg.search(string)
print ' matched.groups()', matched.groups()
I have tried different variations but I can either get the first or the last code but not a list of all three.
Or should I abandon using a regex?
EDIT: updated to reflect part of the problem space I neglected and fixed string difference.
Thanks for all the suggestions - in such a short time.
In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).
Your easiest solution is to first extract the part between start: and ; and then using a regular expression to return all matches, not just a single match, using re.findall('c?[0-9]+', text).
You could use the standard string tools, which are pretty much always more readable.
s = "start: c12354, c3456, 34526;"
s.startswith("start:") # returns a boolean if it starts with this string
s.endswith(";") # returns a boolean if it ends with this string
s[6:-1].split(', ') # will give you a list of tokens separated by the string ", "
This can be done (pretty elegantly) with a tool like Pyparsing:
from pyparsing import Group, Literal, Optional, Word
import string
code = Group(Optional(Literal("c"), default='') + Word(string.digits) + Optional(Literal(","), default=''))
parser = Literal("start:") + OneOrMore(code) + Literal(";")
# Read lines from file:
with open('lines.txt', 'r') as f:
for line in f:
try:
result = parser.parseString(line)
codes = [c[1] for c in result[1:-1]]
# Do something with teh codez...
except ParseException exc:
# Oh noes: string doesn't match!
continue
Cleaner than a regular expression, returns a list of codes (no need to string.split), and ignores any extra characters in the line, just like your example.
import re
sstr = re.compile(r'start:([^;]*);')
slst = re.compile(r'(?:c?)(\d+)')
mystr = "start: c12354, c3456, 34526; other stuff that I don't care about"
match = re.match(sstr, mystr)
if match:
res = re.findall(slst, match.group(0))
results in
['12354', '3456', '34526']