Getting pattern matched from match object - python

I'm working with Python regex and I'm trying to get pattern matched from a match object not text matched itself.
I have some patterns to replace and I'm doing this:
import re
patterns = {
r'^[ |\n]+': '',
r'[ |\n]+$': '',
r'[ |\n]+': ' '
}
text = ' Hello there, I\n need your help here plase :) '
text = re.sub('|'.join(patterns.keys()),
lambda match: patterns[ match.group(0) ],
text)
But this is a wrong solution because match.group(0) returns text matched so none of them will be equals to any key of patterns dict.
I tried match.pattern but get an exception and tried match.re but this gives all re.compile object and its pattern for this problem is '^[ |\n]+|[ |\n]+$|[ |\n]+'.
EDIT: based on Barmar solution I got this:
import re
patterns = [
(r'^[ |\n]+', ''),
(r'[ |\n]+$', ''),
(r'[ |\n]+', ' ')
]
def getreplacement(match):
for i, group in enumerate(match.groups()):
if group:
return patterns[ i ][ 1 ]
text = ' Hello there, I\n need your help here plase :) '
text = re.sub('|'.join('(' + p[ 0 ] + ')' for p in patterns), getreplacement, text)
print(text)
But still not a way to always get pattern from a match group.

I don't think there's a way to find out directly which alternative matched.
Use a list instead of a dictionary, and put each pattern in a capture group. Then you can see which capture group matched, and use that as the index to get the corresponding replacement.
Note that this won't work if there are any capture groups in the patterns. If groups are needed, make sure they're non-capturing.
import re
patterns = [
(r'^[ |\n]+', ''),
(r'[ |\n]+$', ''),
(r'[ |\n]+', ' ')
]
def getreplacement(match):
for i in range(1, match.groups):
if match.group(i):
return patterns[i-1][1]
text = ' Hello there, I\n need your help here plase :) '
text = re.sub('|'.join('(' + p[0] + ')' for p in patterns), getreplacement, text)

If I got it right, you want to strip leading and trailing spaces and reduce the ones in the middle to just one.
First, you code likely has a bug: [ |\n] will match a space ( ), a pipe (|), or a new line. You probably don't want to match a pipe, but you might want to match all whitespace characters, like tabs (\t), for example.
Second, styling: keep your lines under 80 chars and no spaces around indices in brackets.
Third, removing the leading and trailing spaces is simply done with str.strip. The only thing remaining to replace now is sequences of two or more whitespaces, which is easily matched with \s{2,} (\s = "whitespace", {2,} = "two or more").
Here is a modification of your code:
import re
patterns = [
(r"^[ |\n]+", ""),
(r"[ |\n]+$", ""),
(r"[ |\n]+", " "),
]
def get_replacement(m: re.Match) -> str:
return next(
patterns[i][1]
for i, group in enumerate(m.groups())
if group is not None
)
text = (
"\n"
" \t Hello there, I\n need your help here plase :) \t \n"
" \t Hello there, I\n need your help here plase :) \t "
"\n"
)
result1 = re.sub(
"|".join(f"({p})" for p, _ in patterns),
get_replacement,
text,
)
result2 = re.sub(r"[ \n]{2,}", " ", text.strip())
result3 = re.sub(r"\s{2,}", " ", text.strip())
print(repr(result1))
print(repr(result2))
print(repr(result3))

Related

How to merge/join consecutive lines

How do I merge every single batch of consecutive lines in a .txt file?
Example:
Turn this:
User#0001
Hello
Whats Up
User#0002
Hi
...
into this:
User#0001 Hello Whats Up
User#0002 Hi
...
I want to merge all of the lines because when I've tried doing this:
pattern = r'([a-zA-Z]+#[0-9]+.)(.+?(?:^$|\Z))'
data = {
'name': [],
'message': []
}
with open('chat.txt', 'rt') as file:
for message in file.readlines():
match = re.findall(pattern, message, flags=re.M|re.S)
print(match)
if match:
name, message = match[0]
data['name'].append(name)
data['message'].append(message)
I got this when printing 'match':
[('User#0001', '\n')]
[]
[]
[]
[('User#0002', '\n')
...
And when manually editing some of the lines to be User#0001 message then it does return the correct output.
I would phrase your requirement using re.sub:
inp = """User#0001
Hello
Whats Up
User#0002
Hi"""
output = re.sub(r'(?<!\n)\n(?=\S)', ' ', inp)
print(output)
This prints:
User#0001 Hello Whats Up
User#0002 Hi
The regex used here says to match:
(?<!\n) assert that newline does not precede
\n match a single newline
(?=\S) assert that non whitespace follows
The (?<!\n) ensures that we do not remove the newline on the line before a text block begins. The (?=\S) ensures that we do not remove the final newline in a text block.
Another solution (regex demo):
import re
s = """\
User#0001
Hello
Whats Up
User#0002
Hi"""
pat = re.compile(r"^(\S+#\d+)\s*(.*?)\s*(?=^\S+#\d+|\Z)", flags=re.M | re.S)
out = [(user, messages.splitlines()) for user, messages in pat.findall(s)]
print(out)
Prints:
[("User#0001", ["Hello", "Whats Up"]), ("User#0002", ["Hi"])]
If you want to join the messages to one line:
for user, messages in out:
print(user, " ".join(messages))
Prints:
User#0001 Hello Whats Up
User#0002 Hi
First, I suspect that your need is for historical recording.
Then I would say that you do not need a dictionary.
I propose a list where each element would be (user,message).
Second, complexity bring difficulties and bugs. Do you really need regex?
What's wrong with this simple solution:
t= [
"User#0001\n",
"Hello\n",
"Whats Up\n",
"\n",
"\n",
"User#0002\n",
"Hi\n",
"...\n",
]
data=[]
for line in t:
line = line.strip() # remove spaces and \n
if line.strip().startswith( "User#"):
data.append( [line,""])
else:
data[-1][1] += ' ' + line
for msg in data:
print( msg[0], msg[1] if len(msg)>1 else "")
For the format of the given example, if you want to keep the same amount of newlines, you can use a pattern with 3 capture groups.
^([a-zA-Z]+#[0-9]+)((?:\n(?![a-zA-Z]+#[0-9]).+)*)
The pattern matches:
^ Start of string
([a-zA-Z]+#[0-9]+) Capture group 1
( Capture group 1
(?: Non capture group
\n Match a newline
(?![a-zA-Z]+#[0-9]) Negative lookahead, assert not 1+ chars a-zA-Z to the right followed by # and a digit
.+ Match 1+ chars (In your pattern you used ^$ to stop when there is an empty string, but you can also make sure to match 1 or more characters)
)* Close the non capture group and optionally repeat it to also allow 0 occurrences
) Close group 2
Regex demo
import re
s = """User#0001
Hello
Whats Up
User#0002
Hi
User#0003"""
pattern = r"^([a-zA-Z]+#[0-9]+)((?:\n(?![a-zA-Z]+#[0-9]).+)*)(\n*)"
result = []
for (u, m, n) in re.findall(pattern, s, re.M):
result.append(f"{' '.join([u] + m.split())}{n}")
print("".join(result))
Output
User#0001 Hello Whats Up
User#0002 Hi
User#0003

Find substring inside parentheses and single quotes

I would like to get the string that is within the parentheses and within the ''.
import re
a_string = "_TableStyle('Table Grid') id: 219287880"
result = re.findall(r'\((.*?) *\)', a_string)[0][1:-1]
print(result)
This block works. But isn't there a more elegant way?
It's quite inefficient to use re.findall() for retrieving just first match, better to use re.search() which will return after first match found. Also you can simplify your regular expression to \('(.+?)'\).
Code:
import re
...
result = re.search(r"\('(.+?)'\)", a_string).group(1)
If you have only one pair of parentheses in your line, you can do it without regex:
a_string = "_TableStyle('Table Grid') id: 219287880"
openParenthesis = a_string.index("('") + 2
closeParenthesis = a_string.index("')", openParenthesis)
# 'Table Grid'
a_string[openParenthesis:closeParenthesis]
Yes. There's another way to achieve this. Here it is,
import re
a_string = "_TableStyle('Table Grid') id: 219287880"
result = re.findall(r'\(\'(.*?) *\'\)', a_string)[0]
print(result)
This code searches also for a ' ' and would only extract the text between the (' ').
Edit:
We can also write the code like this,
import re
a_string = "_TableStyle('Table Grid') id: 219287880"
result = re.search(r'\(\'(.*?) *\'\)', a_string).group(1)
print(result)
Use double quotes around the literal if you do not want to escape single quotes. Triple quotes are even nicer. Getting part of a regex match is simple with capturing groups:
import re
a_string = "_TableStyle('Table Grid') id: 219287880"
match = re.search(r"""\('([^']*)'\)""", a_string)
if match:
print(match.group(1))
See Python proof.
EXPLANATION
--------------------------------------------------------------------------------
\( '('
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
\) ')'

Regex to map character to a function of that character

I want a function using regex that will map certain punctuation characters (these: ., ,, (, ), ;, :) to a function of that character. Specifically, it will put a space on either side.
For example, it would map the string "Hello, this is a test string." to "Hello , this is a test string . "
This is what I have right now:
import re
def add_spaces_to_punctuation(input_text)
text = re.sub('[.]', ' . ', input_text)
text = re.sub('[,]', ' , ', text)
text = re.sub('[:]', ' : ', text)
text = re.sub('[;]', ' ; ', text)
text = re.sub('[(]', ' ( ', text)
text = re.sub('[)]', ' ) ', text)
return text
This works as intended, but is pretty unwieldy/hard to read. If I had a way to map each punctuation characters to a function of that character in a single regex line it would improve it significantly. Is there a way to do this with regex? Sorry if its something obvious, I'm pretty new to regex and don't know what this kind of thing would be called.
You could try the following:
import re
recmp = re.compile(r'[.,:;()]')
def add_spaces_to_punctuation(input_text):
text = recmp.sub(r' \1 ', input_text)
return text
Also, taking performance into consideration and according to this answer, it should be faster if you need to run it often.
Try this:
text = re.sub(r"(\.|\,|\(|\)|\;|\:)", lambda x: f' {x.group(1)} ', text)
It uses a capture group to capture whatever the character may be and then map that to the same character with a space on either side using a lambda expression (see https://www.w3schools.com/python/python_lambda.asp for more info).
The part of the regex in parenthesis (in this case the whole thing) is what gets captured and x.group(1) will give you the captured character.

How to remove all non-alphanumerical characters except when part of a word [duplicate]

I want to be able to remove all punctuation and single quotes ' from a string, unless the single quote ' is in the middle of a word.
At this point I have the following code:
with open('test.txt','r') as f:
for line in f:
line = line.lower()
line = re.sub('[^a-z\ \']+', " ", line)
print line
if there happens to be a line in test.txt like:
Here is some stuff. 'Now there are quotes.' Now there's not.
The result I want is:
here is some stuff now there are quotes now there's not
But the result I get is:
here is some stuff 'now there are quotes' now there's not
How can I remove the single quotes ' from a string if they're at the beginning or end of the word but not in the middle? Thanks for the help!
Split the string, use strip() on each word to remove leading and trailing characters on it, then join it all back together.
>>> s = "'here is some stuff 'now there are quotes' now there's not'"
>>> print(' '.join(w.strip("'") for w in s.split()).lower())
here is some stuff now there are quotes now there's not
Using regular expressions, you could first remove 's that don't follow a letter, then remove 's that don't precede a letter (thus only keeping ones that both follow and precede a letter):
line = "Here is some stuff. 'Now there are quotes.' Now there's not."
print re.sub(r"'([^A-Za-z])", r"\1", re.sub(r"([^A-Za-z])'", r"\1", line))
# Here is some stuff. Now there are quotes. Now there's not.
Probably more efficient to do it #TigerhawkT3's way. Though they produce different results if you have something like 'this'. If you want to remove that second ' too, then the regular expressions method is probably the simplest you can do.
Here's another solution using regular expressions with lookarounds.
This method will preserve any whitespace your string may have.
import re
rgx = re.compile(r"(?<!\w)\'|\'(?!\w)")
# Regex explanation:
# (?<!\w)\' match any quote not preceded by a word
# | or
# \'(?!\w) match any quote not followed by a word
s = "'here is some stuff 'now there are quotes' now there's not'"
print rgx.sub('', s) # here is some stuff now there are quotes now there's not
If a word is a sequence of 1+ letters, digits and underscores that can be matched with \w+ you may use
re.sub(r"(?!\b'\b)'", "", text)
See the regex demo. Here, ' is matched when it is not preceded nor followed with letters/digits/_.
Or, if words are strictly linguistic words that only consist of letters, use
re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) # ASCII only
re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) # any Unicode letter support
See Demo #2 (ASCII only letters) and Demo #3 (see last line in the demo text). Here, ' is only matched if it is not preceded nor followed with a letter (ASCII or any).
Python demo:
import re
text = "'text... 'some quotes', there's none'. three 'four' can't, '2'4', '_'_', 'l'école'"
print( re.sub(r"(?!\b'\b)'", "", text) )
# => text... some quotes, there's none. three four can't, 2'4, _'_, l'école
print( re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, lécole
print( re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, l'école
Here is complete solution to remove whatever you don't want in a string:
def istext (text):
ok = 0
for x in text: ok += x.isalnum()
return ok>0
def stripit (text, ofwhat):
for x in ofwhat: text = text.strip(x)
return text
def purge (text, notwanted="'\"!#$%&/()=?*+-.,;:_<>|\\[]{}"):
text = text.splitlines()
text = [" ".join([stripit(word, notwanted) for word in line.split() if istext(word)]) for line in text]
return "\n".join(text)
>>> print purge("'Nice, .to, see! you. Isn't it?'")
Nice to see you Isn't it
Note: this will kill all whitespaces too and transform them to space or remove them completely.

Getting a list with new line characters

I have a string as below:
s = 'hello\n this is\n a forum\n'
And I am using a regular expression to get each single/double occurrence of each character. I want to generate a list as li = [ 'h','e','ll','o','\n','t','h'....]
I used
pattern = re.compile(r'(.)\1?' , re.IGNORECASE)
newList = [m.group() for m in pattern.finditer(s)]
print newList
But this gave me newList= [ 'h','e','ll','o','t','h'....] , here I could not get the '\n' new line character stored. How should I change my patter to get the '\n' in my list?
Use re.DOTALL flag like this
pattern = re.compile(r'(.)\1?' , re.IGNORECASE | re.DOTALL)
Quoting from the docs,
Make the '.' special character match any character at all, including a
newline; without this flag, '.' will match anything except a newline.
I would have preferred itertools.groupby to do this
from itertools import groupby
print ["".join(grp) for char, grp in groupby(s)]

Categories