I have the following sample text:
mystr = r'''\documentclass[12pt]{article}
\usepackage{amsmath}
\title{\LaTeX}
\begin{document}
\section{Introduction}
This is introduction paragraph
\section{Non-Introduction}
This is non-introduction paragraph
\section{Sample section}
This is sample section paragraph
\begin{itemize}
\item Item 1
\item Item 2
\end{itemize}
\end{document}'''
What I'm trying to accomplish is to create a regex expression which will extract the following lines from mystr:
['This is introduction paragraph','This is non-introduction paragraph',' This is sample section paragraph\n \begin{itemize}\n\item Item 1\n\item Item 2\n\end{itemize}']
For any reason you need to use regex. Perhaps the splitting string is more involved than just "a". The re module has a split function too:
import re
str_ = "a quick brown fox jumps over a lazy dog than a quick elephant"
print(re.split(r'\s?\ba\b\s?',str_))
# ['', 'quick brown fox jumps over', 'lazy dog than', 'quick elephant']
EDIT: expanded answer with the new information you provided...
After your edit in which you write a better description of your problem and you include a text that looks like LaTeX, I think you need to extract those lines that do not start with a \, which are the latex commands. In other words, you need the lines with only text. Try the following, always using regular expressions:
import re
mystr = r'''\documentclass[12pt]{article}
\usepackage{amsmath}
\title{\LaTeX}
\begin{document}
\section{Introduction}
This is introduction paragraph
\section{Non-Introduction}
This is non-introduction paragraph
\section{Sample section}
This is sample section paragraph
\end{document}'''
pattern = r"^[^\\]*\n"
matches = re.findall(pattern, mystr, flags=re.M)
print(matches)
# ['This is introduction paragraph\n', 'This is non-introduction paragraph\n', 'This is sample section paragraph\n']
You can use the split method from str:
my_string = "a quick brown fox jumps over a lazy dog than a quick elephant"
word = "a "
my_string.split(word)
Results in:
['', 'quick brown fox jumps over ', 'lazy dog than ', 'quick elephant']
Note: Don't use str as a variable name as it is a keyword in Python.
I want Python to remove only some punctuation from a string, let's say I want to remove all the punctuation except '#'
import string
remove = dict.fromkeys(map(ord, '\n ' + string.punctuation))
sample = 'The quick brown fox, like, totally jumped, #man!'
sample.translate(remove)
Here the output is
The quick brown fox like totally jumped man
But what I want is something like this
The quick brown fox like totally jumped #man
Is there a way to selectively remove punctuation from a text leaving out the punctuation that we want in the text intact?
str.punctuation contains all the punctuations. Remove # from it. Then replace with '' whenever you get that punctuation string.
>>> import re
>>> a = string.punctuation.replace('#','')
>>> re.sub(r'[{}]'.format(a),'','The quick brown fox, like, totally jumped, #man!')
'The quick brown fox like totally jumped #man'
Just remove the character you don't want to touch from the replacement string:
import string
remove = dict.fromkeys(map(ord, '\n' + string.punctuation.replace('#','')))
sample = 'The quick brown fox, like, totally jumped, #man!'
sample.translate(remove)
Also note that I changed '\n ' to '\n', as the former will remove spaces from your string.
Result:
The quick brown fox like totally jumped #man
I am streaming plain text records via MapReduce and need to check each plain text record for 2 or more consecutive punctuation symbols. The 12 symbols I need to check for are: -/\()!"+,'&..
I have tried translating this punctuation list into an array like this:
punctuation = [r'-', r'/', r'\\', r'\(', r'\)', r'!', r'"', r'\+', r',', r"'", r'&', r'\.']
I can find individual characters with nested for loops, for example:
for t in test_cases:
print t
for p in punctuation:
print p
if re.search(p, t):
print 'found a match!', p, t
else:
print 'no match'
However, the single backslash character is not found when I test this and I don't know how to get only results that are 2 or more consecutive occurrences in a row. I've read that I need to use the + symbol, but don't know the correct syntax to use this.
Here are some test cases:
The quick '''brown fox
The &&quick brown fox
The quick\brown fox
The quick\\brown fox
The -quick brown// fox
The quick--brown fox
The (quick brown) fox,,,
The quick ++brown fox
The "quick brown" fox
The quick/brown fox
The quick&brown fox
The ""quick"" brown fox
The quick,, brown fox
The quick brown fox…
The quick-brown fox
The ((quick brown fox
The quick brown)) fox
The quick brown fox!!!
The 'quick' brown fox
Which when translated into a Pythonic list looks like this:
test_cases = [
"The quick '''brown fox",
'The &&quick brown fox',
'The quick\\brown fox',
'The quick\\\\brown fox',
'The -quick brown// fox',
'The quick--brown fox',
'The (quick brown) fox,,,',
'The quick ++brown fox',
'The "quick brown" fox',
'The quick/brown fox',
'The quick&brown fox',
'The ""quick"" brown fox',
'The quick,, brown fox',
'The quick brown fox...',
'The quick-brown fox',
'The ((quick brown fox',
'The quick brown)) fox',
'The quick brown fox!!!',
"The 'quick' brown fox" ]
How do I use Python regex to identify and report all matches where the punctuation symbol appears 2 or more times in a row?
The punctuation characters can be put into a character class is square brackets.
Then it depends, whether the series of two or more punctuation characters consists of any punctuation character or whether the punctuation characters are the same.
In the first case curly braces can be appended to specify the number of minimum (2) and maximum repetitions. The latter is unbounded and left empty:
[...]{2,} # min. 2 or more
If only repetitions of the same character needs to be found, then the first matched punctuation character is put into a group. Then the same group (= same character) follows one or more:
([...])\1+
The back reference \1 means the first group in the expression. The groups, represented by the opening parentheses are numbered from left to right.
The next issue is escaping. There are escaping rules for Python strings and additional escaping is needed in the regular expression. The character class does not require much escaping, but the backslash must be doubled. Thus the
following example quadruplicates the backslash, one doubling because of the string, the second because of the regular expression.
Raw strings r'...' are useful for patterns, but here both the single and double quotation marks are needed.
>>> import re
>>> test_cases = [
"The quick '''brown fox",
'The &&quick brown fox',
'The quick\\brown fox',
'The quick\\\\brown fox',
'The -quick brown// fox',
'The quick--brown fox',
'The (quick brown) fox,,,',
'The quick ++brown fox',
'The "quick brown" fox',
'The quick/brown fox',
'The quick&brown fox',
'The ""quick"" brown fox',
'The quick,, brown fox',
'The quick brown fox...',
'The quick-brown fox',
'The ((quick brown fox',
'The quick brown)) fox',
'The quick brown fox!!!',
"The 'quick' brown fox" ]
>>> pattern_any_punctuation = re.compile('([-/\\\\()!"+,&\'.]{2,})')
>>> pattern_same_punctuation = re.compile('(([-/\\\\()!"+,&\'.])\\2+)')
>>> for t in test_cases:
match = pattern_same_punctuation.search(t)
if match:
print("{:24} => {}".format(t, match.group(1)))
else:
print(t)
The quick '''brown fox => '''
The &&quick brown fox => &&
The quick\brown fox
The quick\\brown fox => \\
The -quick brown// fox => //
The quick--brown fox => --
The (quick brown) fox,,, => ,,,
The quick ++brown fox => ++
The "quick brown" fox
The quick/brown fox
The quick&brown fox
The ""quick"" brown fox => ""
The quick,, brown fox => ,,
The quick brown fox... => ...
The quick-brown fox
The ((quick brown fox => ((
The quick brown)) fox => ))
The quick brown fox!!! => !!!
The 'quick' brown fox
>>>
You can use {2} in a regular expression to match two consecutive occurrences of a character class:
>>> regex = re.compile(r'[-/()!"+,\'&]{2}')
>>> [s for s in test_cases if regex.search(s)]
["The quick '''brown fox",
'The &&quick brown fox',
'The -quick brown// fox',
'The quick--brown fox',
'The (quick brown) fox,,,',
'The quick ++brown fox',
'The ""quick"" brown fox',
'The quick,, brown fox',
'The ((quick brown fox',
'The quick brown)) fox',
'The quick brown fox!!!']
What about regular expression? That could also help to find 2 or more consecutive punctuation symbols.
RegEx like \([\\\-\/\(\)!"+,'&]{2,})\g
{2,} stands for two or more
\g stands for global search, dont stop on first match
Thanks to #Heiko Oberdiek, here is the exact code I am using that solves the problem: (I added . to the punctuation list)
punctuation = re.compile('(([-/\\\\()!"+,&\'.])\\2+)')
x = 1
for t in test_cases:
match = punctuation.search(t)
if match:
print "{0:2} {1:24} => {2}".format(x, t, match.group(1))
x += 1
This accurately covers all of my test cases:
1 The quick '''brown fox => '''
2 The &&quick brown fox => &&
3 The quick\\brown fox => \\
4 The -quick brown// fox => //
5 The quick--brown fox => --
6 The (quick brown) fox,,, => ,,,
7 The quick ++brown fox => ++
8 The ""quick"" brown fox => ""
9 The quick,, brown fox => ,,
10 The quick brown fox... => ...
11 The ((quick brown fox => ((
12 The quick brown)) fox => ))
13 The quick brown fox!!! => !!!
I am very new to python and I was having trouble with matching optional strings when can be any number of strings between the groups. Here is an example of what I am looking for:
'The quick brown fox jumps over the lazy dog'
I want the word following 'brown' and if the word 'lazy' is present I want the word following it as well, i.e:
'The quick brown fox jumps over the lazy dog' --> ('fox', 'dog')
'The quick brown fox' --> ('fox', '')
'The quick brown fox dfjdnjcnjdn vvvv lazy mouse' --> ('fox', 'mouse')
'The quick brown fox lazy dog' --> ('fox', 'dog')
Here is what I tried, but it is not working
re.findall(r'brown (\S+)(.*?)(lazy )?(\S+)?', str)
What am I doing wrong and how to fix this?
You could use the following to get the words you're looking for:
brown (\S+)(?:.*lazy (\S+))?
Which would give a list of tuples, with the empty string if lazy is not present.
>>> import re
>>> s = """The quick brown fox jumps over the lazy dog
... The quick brown fox
... The quick brown fox dfjdnjcnjdn vvvv lazy mouse
... The quick brown fox lazy dog"""
>>> re.findall(r'brown (\S+)(?:.*lazy (\S+))?', s)
[('fox', 'dog'), ('fox', ''), ('fox', 'mouse'), ('fox', 'dog')]
>>>
(?: ... ) is used to make groups that won't get captured, so what's inside won't necessarily get into the tuple/list with re.findall unless it is itself within capture group(s).
You can use a pattern such as:
(?:brown|lazy)\s(\S+)
Below is a breakdown of what it matches:
(?:brown|lazy) # The words 'brown' or 'lazy'
\s # A whitespace character
(\S+) # One or more non-whitespace characters
And here is a demonstration:
>>> import re
>>> re.findall(r'(?:brown|lazy)\s(\S+)', 'The quick brown fox jumps over the lazy dog')
['fox', 'dog']
>>> re.findall(r'(?:brown|lazy)\s(\S+)', 'The quick brown fox')
['fox']
>>> re.findall(r'(?:brown|lazy)\s(\S+)', 'The quick brown fox dfjdnjcnjdn vvvv lazy mouse')
['fox', 'mouse']
>>> re.findall(r'(?:brown|lazy)\s(\S+)', 'The quick brown fox lazy dog')
['fox', 'dog']
>>>
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I'm trying to replace spaces with hyphens one at a time in each possible position in python. For example the man said hi should produce a list of all the possible hyphen positions, including multiple hyphens:
the-man said hi
the man-said hi
the man said-hi
the-man said-hi
the-man-said hi
the man-said-hi
the-man-said-hi
The length of the strings varies in number of spaces, so it can't be a fix for just 3 spaces. I've been experimenting with re.search and re.sub in a while loop, but haven't found a nice way yet.
Use itertools.product() to produce all space-and-dash combinations, then recombine your string with those:
from itertools import product
def dashed_combos(inputstring):
words = inputstring.split()
for combo in product(' -', repeat=len(words) - 1):
yield ''.join(w for pair in zip(words, combo + ('',)) for w in pair)
The last line zips the words together with the dashes and spaces (adding in an empty string at the end to make up the pairs), then flattens that and joins them into a single string.
Demo:
>>> for combo in dashed_combos('the man said hi'):
... print combo
...
the man said hi
the man said-hi
the man-said hi
the man-said-hi
the-man said hi
the-man said-hi
the-man-said hi
the-man-said-hi
You can always skip the first iteration of that loop (with only spaces) with itertools.islice():
from itertools import product, islice
def dashed_combos(inputstring):
words = inputstring.split()
for combo in islice(product(' -', repeat=len(words) - 1), 1, None):
yield ''.join(w for pair in zip(words, combo + ('',)) for w in pair)
All this is extremely memory efficient; you can easily handle inputs with hundreds of words, provided you don't try and store all possible combinations in memory at once.
Slightly longer demo:
>>> for combo in islice(dashed_combos('the quick brown fox jumped over the lazy dog'), 10):
... print combo
...
the quick brown fox jumped over the lazy-dog
the quick brown fox jumped over the-lazy dog
the quick brown fox jumped over the-lazy-dog
the quick brown fox jumped over-the lazy dog
the quick brown fox jumped over-the lazy-dog
the quick brown fox jumped over-the-lazy dog
the quick brown fox jumped over-the-lazy-dog
the quick brown fox jumped-over the lazy dog
the quick brown fox jumped-over the lazy-dog
the quick brown fox jumped-over the-lazy dog
>>> for combo in islice(dashed_combos('the quick brown fox jumped over the lazy dog'), 200, 210):
... print combo
...
the-quick-brown fox jumped-over the lazy-dog
the-quick-brown fox jumped-over the-lazy dog
the-quick-brown fox jumped-over the-lazy-dog
the-quick-brown fox jumped-over-the lazy dog
the-quick-brown fox jumped-over-the lazy-dog
the-quick-brown fox jumped-over-the-lazy dog
the-quick-brown fox jumped-over-the-lazy-dog
the-quick-brown fox-jumped over the lazy dog
the-quick-brown fox-jumped over the lazy-dog
the-quick-brown fox-jumped over the-lazy dog