I wonder if there is a simpler alternative (e.g. a single function call) for matching and replacing to the following example:
>>> import re
>>>
>>> line = 'file:///windows-d/academic%20discipline/study%20objects/areas/formal%20systems/math'
>>>
>>> match = re.match(r'^file://(.*)$', line)
>>> if match and match.group(1):
... substitution = re.sub(r'%20', r' ', match.group(1))
...
>>> substitution
'/windows-d/academic discipline/study objects/areas/formal systems/math'
Thanks.
I'm going to dodge your regex question and suggest you use something else for this:
>>> line = 'file:///windows-d/academic%20discipline/study%20objects/areas/formal%20systems/math'
>>> import urllib
>>> urllib.unquote(line)
'file:///windows-d/academic discipline/study objects/areas/formal systems/math'
Then just strip off the file:// with a slice or str.replace if necessary.
%20 (space) is not the only escaped character possible here, so it's better to use the right tool for the job than have your regex solution break later when there is another character needing un-escaping.
You could try the below simple python code,
>>> import re
>>> line = 'file:///windows-d/academic%20discipline/study%20objects/areas/formal%20systems/math'
>>> m = re.sub(r'%20|file://', r' ', line).strip()
>>> m
'/windows-d/academic discipline/study objects/areas/formal systems/math'
re.sub(r'%20|file://', r' ', line).strip() code replaces the string %20 or file:// with a space. And again the strip() function removes all the leading and trailing spaces.
>>> import re
>>> s = 'file:///windows-d/academic%20discipline/study%20objects/areas/formal%20systems/math'
>>> re.sub(r'^file://(.*)$', lambda m: m.group(1).replace('%20', ' '), s)
'/windows-d/academic discipline/study objects/areas/formal systems/math'
>>> s = 'file:///windows-d/academic%20discipline/study%20objects/areas/formal%20systems/math'
>>> s.replace('file://', '').replace('%20', ' ')
'/windows-d/academic discipline/study objects/areas/formal systems/math'
Related
This question already has answers here:
Split Strings into words with multiple word boundary delimiters
(31 answers)
Closed 8 years ago.
I found some answers online, but I have no experience with regular expressions, which I believe is what is needed here.
I have a string that needs to be split by either a ';' or ', '
That is, it has to be either a semicolon or a comma followed by a space. Individual commas without trailing spaces should be left untouched
Example string:
"b-staged divinylsiloxane-bis-benzocyclobutene [124221-30-3], mesitylene [000108-67-8]; polymerized 1,2-dihydro-2,2,4- trimethyl quinoline [026780-96-1]"
should be split into a list containing the following:
('b-staged divinylsiloxane-bis-benzocyclobutene [124221-30-3]' , 'mesitylene [000108-67-8]', 'polymerized 1,2-dihydro-2,2,4- trimethyl quinoline [026780-96-1]')
Luckily, Python has this built-in :)
import re
re.split('; |, ', string_to_split)
Update:Following your comment:
>>> a='Beautiful, is; better*than\nugly'
>>> import re
>>> re.split('; |, |\*|\n',a)
['Beautiful', 'is', 'better', 'than', 'ugly']
Do a str.replace('; ', ', ') and then a str.split(', ')
Here's a safe way for any iterable of delimiters, using regular expressions:
>>> import re
>>> delimiters = "a", "...", "(c)"
>>> example = "stackoverflow (c) is awesome... isn't it?"
>>> regex_pattern = '|'.join(map(re.escape, delimiters))
>>> regex_pattern
'a|\\.\\.\\.|\\(c\\)'
>>> re.split(regex_pattern, example)
['st', 'ckoverflow ', ' is ', 'wesome', " isn't it?"]
re.escape allows to build the pattern automatically and have the delimiters escaped nicely.
Here's this solution as a function for your copy-pasting pleasure:
def split(delimiters, string, maxsplit=0):
import re
regex_pattern = '|'.join(map(re.escape, delimiters))
return re.split(regex_pattern, string, maxsplit)
If you're going to split often using the same delimiters, compile your regular expression beforehand like described and use RegexObject.split.
If you'd like to leave the original delimiters in the string, you can change the regex to use a lookbehind assertion instead:
>>> import re
>>> delimiters = "a", "...", "(c)"
>>> example = "stackoverflow (c) is awesome... isn't it?"
>>> regex_pattern = '|'.join('(?<={})'.format(re.escape(delim)) for delim in delimiters)
>>> regex_pattern
'(?<=a)|(?<=\\.\\.\\.)|(?<=\\(c\\))'
>>> re.split(regex_pattern, example)
['sta', 'ckoverflow (c)', ' is a', 'wesome...', " isn't it?"]
(replace ?<= with ?= to attach the delimiters to the righthand side, instead of left)
In response to Jonathan's answer above, this only seems to work for certain delimiters. For example:
>>> a='Beautiful, is; better*than\nugly'
>>> import re
>>> re.split('; |, |\*|\n',a)
['Beautiful', 'is', 'better', 'than', 'ugly']
>>> b='1999-05-03 10:37:00'
>>> re.split('- :', b)
['1999-05-03 10:37:00']
By putting the delimiters in square brackets it seems to work more effectively.
>>> re.split('[- :]', b)
['1999', '05', '03', '10', '37', '00']
This is how the regex look like:
import re
# "semicolon or (a comma followed by a space)"
pattern = re.compile(r";|, ")
# "(semicolon or a comma) followed by a space"
pattern = re.compile(r"[;,] ")
print pattern.split(text)
I want to remove [' from start and '] characters from the end of a string.
This is my text:
"['45453656565']"
I need to have this text:
"45453656565"
I've tried to use str.replace
text = text.replace("['","");
but it does not work.
You need to strip your text by passing the unwanted characters to str.strip() method:
>>> s = "['45453656565']"
>>>
>>> s.strip("[']")
'45453656565'
Or if you want to convert it to integer you can simply pass the striped result to int function:
>>> try:
... val = int(s.strip("[']"))
... except ValueError:
... print("Invalid string")
...
>>> val
45453656565
Using re.sub:
>>> my_str = "['45453656565']"
>>> import re
>>> re.sub("['\]\[]","",my_str)
'45453656565'
You could loop over the character filtering if the element is a digit:
>>> number_array = "['34325235235']"
>>> int(''.join(c for c in number_array if c.isdigit()))
34325235235
This solution works even for both "['34325235235']" and '["34325235235"]' and whatever other combination of number and characters.
You also can import a package and use a regular expresion to get it:
>>> import re
>>> theString = "['34325235235']"
>>> int(re.sub(r'\D', '', theString)) # Optionally parse to int
Instead of hacking your data by stripping brackets, you should edit the script that created it to print out just the numbers. E.g., instead of lazily doing
output.write(str(mylist))
you can write
for elt in mylist:
output.write(elt + "\n")
Then when you read your data back in, it'll contain the numbers (as strings) without any quotes, commas or brackets.
i have a string in python
text = '(b)'
i want to extract the 'b'. I could strip the first and the last letter of the string but the reason i wont do that is because the text string may contain '(a)', (iii), 'i)', '(1' or '(2)'. Some times they contain no parenthesis at all. but they will always contain an alphanumeric values. But i equally want to retrieve the alphanumeric values there.
this feat will have to be accomplished in a one line code or block of code that returns justthe value as it will be used in an iteratively on a multiple situations
what is the best way to do that in python,
I don't think Regex is needed here. You can just strip off any parenthesis with str.strip:
>>> text = '(b)'
>>> text.strip('()')
'b'
>>> text = '(iii)'
>>> text.strip('()')
'iii'
>>> text = 'i)'
>>> text.strip('()')
'i'
>>> text = '(1'
>>> text.strip('()')
'1'
>>> text = '(2)'
>>> text.strip('()')
'2'
>>> text = 'a'
>>> text.strip('()')
'a'
>>>
Regarding #MikeMcKerns' comment, a more robust solution would be to pass string.punctuation to str.strip:
>>> from string import punctuation
>>> punctuation # Just to demonstrate
'!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
>>>
>>> text = '*(ab2**)'
>>> text.strip(punctuation)
'ab2'
>>>
You could do this through python's re module,
>>> import re
>>> text = '(5a)'
>>> match = re.search(r'\(?([0-9A-Za-z]+)\)?', text)
>>> match.group(1)
'5a'
>>> text = '*(ab2**)'
>>> match = re.search(r'\(?([0-9A-Za-z]+)\)?', text)
>>> match.group(1)
'ab2'
Not fancy, but this is pretty generic
>>> import string
>>> ''.join(i for i in text if i in string.ascii_letters+'0123456789')
This works for all sorts of combinations of parenthesis in the middle of the string, and also if you have other non-alphanumeric characters (aside from the parenthesis) present.
re.match(r'\(?([a-zA-Z0-9]+)', text).group(1)
for your input provided by exmple it would be:
>>> a=['(a)', '(iii)', 'i)', '(1' , '(2)']
>>> [ re.match(r'\(?([a-zA-Z0-9]+)', text).group(1) for text in a ]
['a', 'iii', 'i', '1', '2']
I need to dump some http data as a string from the http packet which i have in string format am trying to use the regular expression below to match 'data:'and everything after it,Its not working . I am new to regex and python
>>>import re
>>>pat=re.compile(r'(?:/bdata:/b)?\w$')
>>>string=" dnfhndkn data: ndknfdjoj pop"
>>>res=re.match(pat,string)
>>>print res
None
re.match matches only at the beginning of the string. Use re.search to match at any position. (See search() vs. match())
>>> import re
>>> pat = re.compile(r'(?:/bdata:/b)?\w$')
>>> string = " dnfhndkn data: ndknfdjoj pop"
>>> res = re.search(pat,string)
>>> res
<_sre.SRE_Match object at 0x0000000002838100>
>>> res.group()
'p'
To match everything, you need to change \w with .*. Also remove /b.
>>> import re
>>> pat = re.compile(r'(?:data:).*$')
>>> string = " dnfhndkn data: ndknfdjoj pop"
>>> res = re.search(pat,string)
>>> print res.group()
data: ndknfdjoj pop
No need for a regular expression here. You can just slice the string:
>>> string
' dnfhndkn data: ndknfdjoj pop'
>>> string.index('data')
10
>>> string[string.index('data'):]
'data: ndknfdjoj pop'
str.index('data') returns the point in the string where the substring data is found. The slice from this position to the end string[10:] gives you the part of the string you are interested in.
By the way, string is a potentially problematic variable name if you are planning on using the string module at any point...
you can just do:
string.split("data:")[1]
assuming "data:" appears only once in each string
I need to replace space with comma between two numbers
15.30 396.90 => 15.30,396.90
In PHP this is used:
'/(?<=\d)\s+(?=\d)/', ','
How to do it in Python?
There are several ways to do it (sorry, Zen of Python). Which one to use depends on your input:
>>> s = "15.30 396.90"
>>> ",".join(s.split())
'15.30,396.90'
>>> s.replace(" ", ",")
'15.30,396.90'
or, using re, for example, this way:
>>> import re
>>> re.sub("(\d+)\s+(\d+)", r"\1,\2", s)
'15.30,396.90'
You can use the same regex with the re module in Python:
import re
s = '15.30 396.90'
s = re.sub(r'(?<=\d)\s+(?=\d)', ',', s)