Regular expression to replace brackets that embrace whole entry - python

I have a tab-delimited txt file.
1 (hi7 there) my
2 (hi7)there he3
I want to remove brackets only when that bracket embraces the whole entry(I am not sure if the word entry is the right word to describe it, anyhow)
So the output should be
1 hi7 there my
2 (hi7)there he3
I know I can easily find how to remove all brackets. But I couldn't find how to remove brackets only when they embrace the whole entry.
Can I do it simply by using notepad++ or python, whichever is faster?

This expr seems to handle all possibilities correctly:
(?m) # multiline mode
(^|\t) # start of line of field
\( # (
([^\t]+?) # anything but a tab
\) # )
(?= # followed by...
$|\t # end of line or field
)
replace with \1\2.
Example:
import re
rx = r'(?m)(^|\t)\(([^\t]+?)\)(?=$|\t)'
txt = """
1 (hi7 (the)re) (my)
2 (hi7)there he3
(22) (hi7)there he3
(22) (hi7there) (he3)
"""
print re.sub(rx, r'\1\2', txt)
Result:
1 hi7 (the)re my
2 (hi7)there he3
22 (hi7)there he3
22 hi7there he3

i think this should work
f = open("file.txt")
for line in f:
l = line.strip().split(" ")
for word in l:
if word[0] == "(" and word[-1] == ")":
print (word[1:len(word)-1]),
else:
print (word),
print
for overwrite
import fileinput
for line in fileinput.FileInput("file.txt", inplace=1):
l = line.strip().split(" ")
s = ""
for sent in l:
if sent[0] == "(" and sent[-1] == ")":
s += sent[1:len(sent) - 1] + " "
else:
s += sent + " "
print s[:-1]

You can use the tab character \t in python regexp expression, so you can match like this :
>>> import re
>>> re.match('^\([^\t]+\)\t.*$', '(hi7 there)\tmy')
>>> <_sre.SRE_Match object at 0x02573950>
>>> re.match('^\([^\t]+\)\t.*$', '(hi7)there\tmy')
>>>
Once you know how to match your string, it is easy to remove bracket only if the line match.

If they are really tab delimited, you can replace
\t\(([^\t]*)\)\t
\t # a tab
\( # an opening parenthesis
( # open the capturing group
[^\t]* # anything but a tab
)
\)
\t
with
\t\1\t
The idea is to capture the text inside the relevant brackets, and to use it in the replacement with the backreference \1.
See demo

Related

How to merge/join consecutive lines

How do I merge every single batch of consecutive lines in a .txt file?
Example:
Turn this:
User#0001
Hello
Whats Up
User#0002
Hi
...
into this:
User#0001 Hello Whats Up
User#0002 Hi
...
I want to merge all of the lines because when I've tried doing this:
pattern = r'([a-zA-Z]+#[0-9]+.)(.+?(?:^$|\Z))'
data = {
'name': [],
'message': []
}
with open('chat.txt', 'rt') as file:
for message in file.readlines():
match = re.findall(pattern, message, flags=re.M|re.S)
print(match)
if match:
name, message = match[0]
data['name'].append(name)
data['message'].append(message)
I got this when printing 'match':
[('User#0001', '\n')]
[]
[]
[]
[('User#0002', '\n')
...
And when manually editing some of the lines to be User#0001 message then it does return the correct output.
I would phrase your requirement using re.sub:
inp = """User#0001
Hello
Whats Up
User#0002
Hi"""
output = re.sub(r'(?<!\n)\n(?=\S)', ' ', inp)
print(output)
This prints:
User#0001 Hello Whats Up
User#0002 Hi
The regex used here says to match:
(?<!\n) assert that newline does not precede
\n match a single newline
(?=\S) assert that non whitespace follows
The (?<!\n) ensures that we do not remove the newline on the line before a text block begins. The (?=\S) ensures that we do not remove the final newline in a text block.
Another solution (regex demo):
import re
s = """\
User#0001
Hello
Whats Up
User#0002
Hi"""
pat = re.compile(r"^(\S+#\d+)\s*(.*?)\s*(?=^\S+#\d+|\Z)", flags=re.M | re.S)
out = [(user, messages.splitlines()) for user, messages in pat.findall(s)]
print(out)
Prints:
[("User#0001", ["Hello", "Whats Up"]), ("User#0002", ["Hi"])]
If you want to join the messages to one line:
for user, messages in out:
print(user, " ".join(messages))
Prints:
User#0001 Hello Whats Up
User#0002 Hi
First, I suspect that your need is for historical recording.
Then I would say that you do not need a dictionary.
I propose a list where each element would be (user,message).
Second, complexity bring difficulties and bugs. Do you really need regex?
What's wrong with this simple solution:
t= [
"User#0001\n",
"Hello\n",
"Whats Up\n",
"\n",
"\n",
"User#0002\n",
"Hi\n",
"...\n",
]
data=[]
for line in t:
line = line.strip() # remove spaces and \n
if line.strip().startswith( "User#"):
data.append( [line,""])
else:
data[-1][1] += ' ' + line
for msg in data:
print( msg[0], msg[1] if len(msg)>1 else "")
For the format of the given example, if you want to keep the same amount of newlines, you can use a pattern with 3 capture groups.
^([a-zA-Z]+#[0-9]+)((?:\n(?![a-zA-Z]+#[0-9]).+)*)
The pattern matches:
^ Start of string
([a-zA-Z]+#[0-9]+) Capture group 1
( Capture group 1
(?: Non capture group
\n Match a newline
(?![a-zA-Z]+#[0-9]) Negative lookahead, assert not 1+ chars a-zA-Z to the right followed by # and a digit
.+ Match 1+ chars (In your pattern you used ^$ to stop when there is an empty string, but you can also make sure to match 1 or more characters)
)* Close the non capture group and optionally repeat it to also allow 0 occurrences
) Close group 2
Regex demo
import re
s = """User#0001
Hello
Whats Up
User#0002
Hi
User#0003"""
pattern = r"^([a-zA-Z]+#[0-9]+)((?:\n(?![a-zA-Z]+#[0-9]).+)*)(\n*)"
result = []
for (u, m, n) in re.findall(pattern, s, re.M):
result.append(f"{' '.join([u] + m.split())}{n}")
print("".join(result))
Output
User#0001 Hello Whats Up
User#0002 Hi
User#0003

How to remove all non-alphanumerical characters except when part of a word [duplicate]

I want to be able to remove all punctuation and single quotes ' from a string, unless the single quote ' is in the middle of a word.
At this point I have the following code:
with open('test.txt','r') as f:
for line in f:
line = line.lower()
line = re.sub('[^a-z\ \']+', " ", line)
print line
if there happens to be a line in test.txt like:
Here is some stuff. 'Now there are quotes.' Now there's not.
The result I want is:
here is some stuff now there are quotes now there's not
But the result I get is:
here is some stuff 'now there are quotes' now there's not
How can I remove the single quotes ' from a string if they're at the beginning or end of the word but not in the middle? Thanks for the help!
Split the string, use strip() on each word to remove leading and trailing characters on it, then join it all back together.
>>> s = "'here is some stuff 'now there are quotes' now there's not'"
>>> print(' '.join(w.strip("'") for w in s.split()).lower())
here is some stuff now there are quotes now there's not
Using regular expressions, you could first remove 's that don't follow a letter, then remove 's that don't precede a letter (thus only keeping ones that both follow and precede a letter):
line = "Here is some stuff. 'Now there are quotes.' Now there's not."
print re.sub(r"'([^A-Za-z])", r"\1", re.sub(r"([^A-Za-z])'", r"\1", line))
# Here is some stuff. Now there are quotes. Now there's not.
Probably more efficient to do it #TigerhawkT3's way. Though they produce different results if you have something like 'this'. If you want to remove that second ' too, then the regular expressions method is probably the simplest you can do.
Here's another solution using regular expressions with lookarounds.
This method will preserve any whitespace your string may have.
import re
rgx = re.compile(r"(?<!\w)\'|\'(?!\w)")
# Regex explanation:
# (?<!\w)\' match any quote not preceded by a word
# | or
# \'(?!\w) match any quote not followed by a word
s = "'here is some stuff 'now there are quotes' now there's not'"
print rgx.sub('', s) # here is some stuff now there are quotes now there's not
If a word is a sequence of 1+ letters, digits and underscores that can be matched with \w+ you may use
re.sub(r"(?!\b'\b)'", "", text)
See the regex demo. Here, ' is matched when it is not preceded nor followed with letters/digits/_.
Or, if words are strictly linguistic words that only consist of letters, use
re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) # ASCII only
re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) # any Unicode letter support
See Demo #2 (ASCII only letters) and Demo #3 (see last line in the demo text). Here, ' is only matched if it is not preceded nor followed with a letter (ASCII or any).
Python demo:
import re
text = "'text... 'some quotes', there's none'. three 'four' can't, '2'4', '_'_', 'l'école'"
print( re.sub(r"(?!\b'\b)'", "", text) )
# => text... some quotes, there's none. three four can't, 2'4, _'_, l'école
print( re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, lécole
print( re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, l'école
Here is complete solution to remove whatever you don't want in a string:
def istext (text):
ok = 0
for x in text: ok += x.isalnum()
return ok>0
def stripit (text, ofwhat):
for x in ofwhat: text = text.strip(x)
return text
def purge (text, notwanted="'\"!#$%&/()=?*+-.,;:_<>|\\[]{}"):
text = text.splitlines()
text = [" ".join([stripit(word, notwanted) for word in line.split() if istext(word)]) for line in text]
return "\n".join(text)
>>> print purge("'Nice, .to, see! you. Isn't it?'")
Nice to see you Isn't it
Note: this will kill all whitespaces too and transform them to space or remove them completely.

How to match through new line in regular expression in python?

This is my code. I want to ignore whatever that is within ~~. Even if it contains new lines, white spaces. So that I can ignore the comments.
for letter in code :
tok += letter #adding each character to the token.
if not is_str and (tok == " " or tok == "\n"):
#ignoring whitespaces and new line if it's not a string.
tok = "" #reseting each the iterator token.
#Always always always remember. It's not lexer's job to generate errors
#It's the work of parser. One thing should only do one thing.
elif re.search(r'Enter', tok):
tokens.append("ENTER")
tok = ""
elif re.search(r'~(.*?|\n*?)~',tok):
#to ignore the comments written within ~this~
tok = ""
If no other ~ is allowed within ~ strings, you can use:
r'~[^~]*~'
This will match any character but ~.
You can use the re.DOTALL flag:
Make the '.' special character match any character at all, including a
newline; without this flag, '.' will match anything except a newline.
pattern = re.compile(r'~(.*?)~', re.DOTALL)
Trial:
>>> import re
>>> s = '''~dksdjs
... sdjs~'''
>>> pattern = re.compile(r'~(.*?)~', re.DOTALL)
>>> pattern.search(s)
<_sre.SRE_Match object; span=(0, 13), match='~dksdjs\nsdjs~'>
# ^

Trouble joining a list of multiple strings back to a list of a single string in Python

I am trying to write a Python program to check if the phrases in a file occur in a document. My program works fine until it hits a phrase such as "happy(+) feet". I think the error is related to the "(+)" in the phrase; however, I am not sure how to revise my regular expression to make it work.
This is my code:
import re
handle = open('document.txt', 'r')
text = handle.read()
lst = list()
with open('phrases.txt', 'r') as phrases:
for phrase in phrases:
phrase = phrase.rstrip()
if len(phrase) > 0 and phrase not in lst:
ealst.append(phrase)
counts = {}
for each_phrase in lst:
word = each_phrase.rsplit()
pattern = re.compile(r'%s' % '\s+'.join(word), re.IGNORECASE)
counts[each_phrase] = len(pattern.findall(text))
for key, value in counts.items():
if value > 0:
print key,',', value
handle.close()
phrases.close()
You need to use re.escape when declaring a word:
word = map(re.escape, each_phrase.rsplit())
And perhaps, change \s+ to \s* to make the space optional:
pattern = re.compile(r'%s' % '\s*'.join(word), re.IGNORECASE)
The round brackets ( and ) as well as + plus symbol being special regex characters must be escaped in a regular expression outside of character class to match literal characters.
Sample IDEONE demo

string convert with python re

I get a string line:
>>> line = " abc\n def\n\n ghi\n jkl"
>>> print line
abc
def
ghi
jkl
and I want to convert it to "abcdef\n\n ghijkl", like:
>>> print " abcdef\n\n ghijkl"
abcdef
ghijkl
I tried python re module, and write something like this:
re.sub('(?P<word1>[^\n\s])\n\s*(?P<word2>[^\n\s])', '\g<word1>\g<word2>', line)
but I get this:
>>> re.sub('(?P<word1>[^\n\s])\n\s*(?P<word2>[^\n\s])', '\g<word1>\g<word2>', line)
Out: ' abcdefghijkl'
It seems to me that the \n\s* part is also matching \n\n. Can any one point out where I get it wrong?
\s matches space, \t, \n (and, depending on your regex engine) a few other whitespace characters.
So if you only want to replace single linebreaks + spaces/tabs, you can use this:
newline = re.sub(r"(?<!\n)\n[ \t]*(?!\n)", "", line)
Explanation:
(?<!\n) # Assert that the previous character isn't a newline
\n # Match a newline
[ \t]* # Match any number of spaces/tabs
(?!\n) # Assert that the next character isn't a newline
In Python:
>>> line = " abc\n def\n\n ghi\n jkl"
>>> newline = re.sub(r"(?<!\n)\n[ \t]*(?!\n)", "", line)
>>> print newline
abcdef
ghijkl
Try this,
line = " abc\n def\n\n ghi\n jkl"
print re.sub(r'\n(?!\n)\s*', '', line)
It gives,
abcdef
ghijkl
It says, "Replace a new line, followed by a space that is NOT a new line with nothing."
UPDATE: Here's a better version
>>> re.sub(r'([^\n])\n(?!\n)\s*', r'\1', line)
' abcdef\n\n ghijkl'
It gives exactly what you said in the first post.
You could simplify the regexp if you used \S, which matches any non-whitespace character:
>>> import re
>>> line = " abc\n def\n\n ghi\n jkl"
>>> print re.sub(r'(\S+)\n\s*(\S+)', r'\1\2', line)
abcdef
ghijkl
However, the reason why your own regexp is not working is because your <word1> and <word2> groups are only matching a single character (i.e. they're not using +). So with that simple correction, your regexp will produce the correct output:
>>> print re.sub(r'(?P<word1>[^\n\s]+)\n\s*(?P<word2>[^\n\s]+)', r'\g<word1>\g<word2>', line)
abcdef
ghijkl

Categories