Most pythonic way to delete text between two delimiters - python

I'm trying to remove wiki formatting from some text so it can be parsed.
What is the most pythonic way to remove two delimiters ('[[' and ']]') all the text between them? The given string will contain multiple occurrences of delimiter pairs.

Regular expressions are a good match for your problem.
>>> import re
>>> input_str = 'foo [[bar]] baz [[etc.]]'
If you are wanting to remove the whole [[...]], which is I think what you are asking about,
>>> re.sub(r'\[\[.*?\]\]', '', input_str)
'foo baz '
If you are wanting to leave the contents of the [[...]] in,
>>> re.sub(r'\[\[(.*?)\]\]', r'\1', input_str)
'foo bar baz etc.'

Related

how to replace all hyphenated words

I am trying to replace all the hyphenated words in string with their separated versions. I am able to detect hyphenated words but I could not replace them with seperate versions. How can I do that?
This is the example and a sample code :
import re
text = "one-hundered-and-three- some text foo-bar some--text"
re.findall(r'\w+(?:-\w+)+',text)
# returns: ['one-hundered-and-three', 'foo-bar']
# I want to modify text as follows:
# text_new = "one hundered and three- some text foo bar some--text"
re.sub() with positive lookahead and lookbehind:
import re
text = "one-hundered-and-three- some text foo-bar some--text"
print(re.sub(r'(?<=\w)-(?=\w)', ' ', text))
# one hundered and three- some text foo bar some--text
You can use a really simple pattern:
\b-\b
\b Word boundary.
- Hyphen.
\b Word boundary.
Regex demo here.
Python demo:
import re
text = "one-hundered-and-three- some text foo-bar some--text"
print(re.sub(r'\b-\b', ' ', text))
Prints:
one hundered and three- some text foo bar some--text
You could use re.sub() with a function for the repl argument:
In [12]: re.sub(r'\w+(?:-\w+)+', lambda match: match.group(0).replace('-', ' '), text)
Out[12]: 'one hundered and three- some text foo bar some--text'
I wrote it as a one-liner here, but I think it would be clearer if the lambda were moved out into a named function.

Python regex: Replace individual characters in a match

I'm trying to substitute every individual character found in a regex match but I can't seem to make it work.
I have a string containing parenthesized expressions that have to be replaced.
For example, foo bar (baz) should become foo bar (***)
Here is what I came up with: re.sub(r"(\(.*?).(.*?\))", r"\1*\2", "foo bar (baz)") Unfortunately, I can't seem to apply the substitution to every character between the parentheses. Is there any way to make this work?
How about something like this?
>>> import re
>>> s = 'foo bar (baz)'
>>> re.sub(r'(?<=\().*?(?=\))', lambda m: '*'*len(m.group()), s)
'foo bar (***)'

Split a string of a specific pattern into three parts

I am given a string which is of this pattern:
[blah blah blah] [more blah] some text
I want to split the string into three parts: blah blah blah, more blah and some text.
A crude way to do it is to use mystr.split('] '), and then removes the lead [ from the first two elements. Is there a better and performant way (need to do this for thousands of strings very quickly).
You can use a regular expression to extract the text, if you know that it will be in that form. For efficiency, you can precompile the regex and then repeatedly use it when matching.
prog = re.compile('\[([^\]]*)\]\s*\[([^\]]*)\]\s*(.*)')
for mystr in string_list:
result = prog.match(mystr)
groups = result.groups()
If you'd like an explanation on the regex itself, you can get one using this tool.
You can use a regular expression to split where you want to leave out characters:
>>> import re
>>> s = '[...] [...] ...'
>>> re.split(r'\[|\] *\[?', s)[1:]
['...', '...', '...']

Python regex, searching for prefixes inside a target string

I need to find a list of prefixes of words inside a target string (I would like to have the list of matching indexes in the target string handled as an array).
I think using regex should be the cleanest way.
Given that I am looking for the pattern "foo", I would like to retrieve in the target string words like "foo", "Foo", "fooing", "Fooing"
Given that I am looking for the pattern "foo bar", I would like to retrieve in the target string patterns like "foo bar", "Foo bar", "foo Bar", "foo baring" (they are still all handled as prefixes, am I right?)
At the moment, after running it in different scenarios, my Python code still does not work.
I am assuming I have to use ^ to match the beginning of a word in a target string (i.e. a prefix).
I am assuming I have to use something like ^[fF] to be case insensitive with the first letter of my prefix.
I am assuming I should use something like ".*" to let the regexp behave like a prefix.
I am assuming I should use the \prefix1|prefix2|prefix3** to put in **logic OR many different prefixes in the pattern to search.
The following source code does not work because I am wrongly setting the txt_pattern.
import re
# ' ' ' ' ' ' '
txt_str = "edb foooooo jkds Fooooooo kj fooing jdcnj Fooing ujndn ggng sxk foo baring sh foo Bar djw Foo";
txt_pattern = ''#???
out_obj = re.match(txt_pattern,txt_str)
if out_obj:
print "match!"
else:
print "No match!"
What am I missing?
How should I set the txt_pattern?
Can you please suggest me a good tutorial with minimum working examples? At the moment the standard tutorials from the first page of a Google search are very long and detailed, and not so simple to understand.
Thanks!
Regex is the wrong approach. First parse your string into a list of strings with one word per item. Then use a list comprehension with a filter. The split method on strings is a good way to get the list of words, then you can simply do [item for item in wordlist if item.startswith("foo")]
People spend ages hacking up inefficient code using convoluted regexes when all they need is a few string methods like split, partition, startswith and some pythonic list comprehensions or generators.
Regexes have their uses but simple string parsing is not one of them.
I am assuming I have to use ^ to match the beginning of a word in a target string (i.e. a prefix).
No, the ^ is an anchor that only matches the start of the string. You can use \b instead, meaning a word boundary (but remember to escape the backslash inside a string literal, or use a raw string literal).
You will also have to use re.search instead of re.match because the latter only checks the start of the string, whereas the former searches for matches anywhere in the string.
>>> s = 'Foooooo jkds Fooooooo kj fooing jdcnj Fooing ujndn ggng sxk foo baring sh foo Bar djw Foo'
>>> regex = '((?i)(foo)(\w+)?)'
>>> compiled = re.compile(regex)
>>> re.findall(compiled, s)
[('Foooooo', 'Foo', 'oooo'), ('Fooooooo', 'Foo', 'ooooo'), ('fooing', 'foo', 'ing'), ('Fooing', 'Foo', 'ing'), ('foo', 'foo', ''), ('foo', 'foo', ''), ('Foo', 'Foo', '')]
(?i) -> case insensitive
(foo) -> group1 matches foo
(\w+) -> group2 matches every other word character
>>> print [i[0] for i in re.findall(compiled, s)]
['Foooooo', 'Fooooooo', 'fooing', 'Fooing', 'foo', 'foo', 'Foo']
Try using this tool to test some stuff: http://www.pythonregex.com/
Use this reference: docs.python.org/howto/regex.html
I would use something like this for your regex:
\b(?:([Ff]oo [Bb]ar)|([Ff]oo))\w*
Inside of the non-capturing group you should separate each prefix with a |, I also placed each prefix inside of its own capturing group so you can tell which prefix a given string matched, for example:
for match in re.finditer(r'\b(?:([Ff]oo [Bb]ar)|([Ff]oo))\w*', txt_str):
n = 1
while not match.group(n):
n += 1
print "Prefix %d matched '%s'" % (n, match.group(0))
Output:
Prefix 2 matched 'foooooo'
Prefix 2 matched 'Fooooooo'
Prefix 2 matched 'fooing'
Prefix 2 matched 'Fooing'
Prefix 1 matched 'foo baring'
Prefix 1 matched 'foo Bar'
Prefix 2 matched 'Foo'
Make sure you put longer prefixes first, if you were to put the foo prefix before the foo bar prefix, you would only match 'foo' in 'foo bar'.

Python regular expression to match either a quoted or unquoted string

I am trying to write a regular expression in Python that will match either a quoted string with spaces or an unquoted string without spaces. For example given the string term:foo the result would be foo and given the string term:"foo bar" the result would be foo bar. So far I've come up with the following regular expression:
r = re.compile(r'''term:([^ "]+)|term:"([^"]+)"''')
The problem is that the match can come in either group(1) or group(2) so I have to do something like this:
m = r.match(search_string)
term = m.group(1) or m.group(2)
Is there a way I can do this all in one step?
Avoid grouping, and instead use lookahead/lookbehind assertions to eliminate the parts that are not needed:
s = 'term:foo term:"foo bar" term:bar foo term:"foo term:'
re.findall(r'(?<=term:)[^" ]+|(?<=term:")[^"]+(?=")', s)
Gives:
['foo', 'foo bar', 'bar']
It doesn't seem that you really want re.match here. Your regex is almost right, but you're grouping too much. How about this?
>>> s
('xyz term:abc 123 foo', 'foo term:"abc 123 "foo')
>>> re.findall(r'term:([^ "]+|"[^"]+")', '\n'.join(s))
['abc', '"abc 123 "']

Categories