How do I merge every single batch of consecutive lines in a .txt file?
Example:
Turn this:
User#0001
Hello
Whats Up
User#0002
Hi
...
into this:
User#0001 Hello Whats Up
User#0002 Hi
...
I want to merge all of the lines because when I've tried doing this:
pattern = r'([a-zA-Z]+#[0-9]+.)(.+?(?:^$|\Z))'
data = {
'name': [],
'message': []
}
with open('chat.txt', 'rt') as file:
for message in file.readlines():
match = re.findall(pattern, message, flags=re.M|re.S)
print(match)
if match:
name, message = match[0]
data['name'].append(name)
data['message'].append(message)
I got this when printing 'match':
[('User#0001', '\n')]
[]
[]
[]
[('User#0002', '\n')
...
And when manually editing some of the lines to be User#0001 message then it does return the correct output.
I would phrase your requirement using re.sub:
inp = """User#0001
Hello
Whats Up
User#0002
Hi"""
output = re.sub(r'(?<!\n)\n(?=\S)', ' ', inp)
print(output)
This prints:
User#0001 Hello Whats Up
User#0002 Hi
The regex used here says to match:
(?<!\n) assert that newline does not precede
\n match a single newline
(?=\S) assert that non whitespace follows
The (?<!\n) ensures that we do not remove the newline on the line before a text block begins. The (?=\S) ensures that we do not remove the final newline in a text block.
Another solution (regex demo):
import re
s = """\
User#0001
Hello
Whats Up
User#0002
Hi"""
pat = re.compile(r"^(\S+#\d+)\s*(.*?)\s*(?=^\S+#\d+|\Z)", flags=re.M | re.S)
out = [(user, messages.splitlines()) for user, messages in pat.findall(s)]
print(out)
Prints:
[("User#0001", ["Hello", "Whats Up"]), ("User#0002", ["Hi"])]
If you want to join the messages to one line:
for user, messages in out:
print(user, " ".join(messages))
Prints:
User#0001 Hello Whats Up
User#0002 Hi
First, I suspect that your need is for historical recording.
Then I would say that you do not need a dictionary.
I propose a list where each element would be (user,message).
Second, complexity bring difficulties and bugs. Do you really need regex?
What's wrong with this simple solution:
t= [
"User#0001\n",
"Hello\n",
"Whats Up\n",
"\n",
"\n",
"User#0002\n",
"Hi\n",
"...\n",
]
data=[]
for line in t:
line = line.strip() # remove spaces and \n
if line.strip().startswith( "User#"):
data.append( [line,""])
else:
data[-1][1] += ' ' + line
for msg in data:
print( msg[0], msg[1] if len(msg)>1 else "")
For the format of the given example, if you want to keep the same amount of newlines, you can use a pattern with 3 capture groups.
^([a-zA-Z]+#[0-9]+)((?:\n(?![a-zA-Z]+#[0-9]).+)*)
The pattern matches:
^ Start of string
([a-zA-Z]+#[0-9]+) Capture group 1
( Capture group 1
(?: Non capture group
\n Match a newline
(?![a-zA-Z]+#[0-9]) Negative lookahead, assert not 1+ chars a-zA-Z to the right followed by # and a digit
.+ Match 1+ chars (In your pattern you used ^$ to stop when there is an empty string, but you can also make sure to match 1 or more characters)
)* Close the non capture group and optionally repeat it to also allow 0 occurrences
) Close group 2
Regex demo
import re
s = """User#0001
Hello
Whats Up
User#0002
Hi
User#0003"""
pattern = r"^([a-zA-Z]+#[0-9]+)((?:\n(?![a-zA-Z]+#[0-9]).+)*)(\n*)"
result = []
for (u, m, n) in re.findall(pattern, s, re.M):
result.append(f"{' '.join([u] + m.split())}{n}")
print("".join(result))
Output
User#0001 Hello Whats Up
User#0002 Hi
User#0003
Related
I'm trying to write a function that enforces capitalization on certain words, and adds "'s" to certain words if they are followed by " s". For example, it should take grace s and transform that to Grace's.
r"(\b)(grace)( (s|S))?\b": posessive_name,
{...}
def possessive_name(match: Match) -> str:
result = match.group(2).title()
result = result.replace(" ", "'")
return result # type: ignore
I'm correctly "titlizing" it but can't figure out how to reference the optional ( (s|S)) group so that the ( 's) can be added if it's needed, and I'd like to avoid adding an additional regex... Is this possible?
*edited names for clarity
Yes, like this.
import re
test_str = "This is grace s apple."
def fix_names(match):
name, s = match.groups()
name = name.title()
if s:
name = f"{name}'s"
return name
p = re.compile(r"\b(grace)(\s[sS])?\b")
print(p.sub(fix_names, test_str))
lines = (
'a grace s apple',
'the apple is grace s',
'take alice s and steve s',
)
for line in lines:
result = re.sub(r'(\w+)\s+s($|\s)', lambda m: m.group(1).title()+"'s"+m.group(2), line, flags=re.I|re.S)
print(result)
you'll get:
a Grace's apple
the apple is Grace's
take Alice's and Steve's
You could capture 1+ word characters in group 1 followed by matching a space and either s or S using a character class.
In the replacement use the .title() on group 1 and add 's
(?<!\S)(\w+) [sS](?!\S)
Explanation
(?<!\S) Left whitespace boundary
(\w+) Capture group 1, match 1+ word chars
[sS] Match a space and either s or S
(?!\S)
Regex demo | Python demo
Code example
import re
test_str = "grace s"
regex = r"(?<!\S)(\w+) [sS](?!\S)"
result = re.sub(regex, lambda match: match.group(1).title()+"'s", test_str)
print(result)
Output
Grace's
If you want to match grace specifically, you could use use an optional group. If you want match more words, you could use an alternation (?:grace|anotherword)
(?<!\S)(grace)(?: ([sS]))?\b
Regex demo
Example code
import re
test_str = "Her name is grace."
strings = [
"grace s",
"Her name is grace."
]
pattern = r"(?<!\S)(grace)(?: ([sS]))?\b"
regex = re.compile(pattern)
for s in strings:
print(
regex.sub(
lambda m: "{}{}".format(m.group(1).title(), "'s" if m.group(2) else '')
, s)
)
Output
Grace's
Her name is Grace.
I'm working with Python regex and I'm trying to get pattern matched from a match object not text matched itself.
I have some patterns to replace and I'm doing this:
import re
patterns = {
r'^[ |\n]+': '',
r'[ |\n]+$': '',
r'[ |\n]+': ' '
}
text = ' Hello there, I\n need your help here plase :) '
text = re.sub('|'.join(patterns.keys()),
lambda match: patterns[ match.group(0) ],
text)
But this is a wrong solution because match.group(0) returns text matched so none of them will be equals to any key of patterns dict.
I tried match.pattern but get an exception and tried match.re but this gives all re.compile object and its pattern for this problem is '^[ |\n]+|[ |\n]+$|[ |\n]+'.
EDIT: based on Barmar solution I got this:
import re
patterns = [
(r'^[ |\n]+', ''),
(r'[ |\n]+$', ''),
(r'[ |\n]+', ' ')
]
def getreplacement(match):
for i, group in enumerate(match.groups()):
if group:
return patterns[ i ][ 1 ]
text = ' Hello there, I\n need your help here plase :) '
text = re.sub('|'.join('(' + p[ 0 ] + ')' for p in patterns), getreplacement, text)
print(text)
But still not a way to always get pattern from a match group.
I don't think there's a way to find out directly which alternative matched.
Use a list instead of a dictionary, and put each pattern in a capture group. Then you can see which capture group matched, and use that as the index to get the corresponding replacement.
Note that this won't work if there are any capture groups in the patterns. If groups are needed, make sure they're non-capturing.
import re
patterns = [
(r'^[ |\n]+', ''),
(r'[ |\n]+$', ''),
(r'[ |\n]+', ' ')
]
def getreplacement(match):
for i in range(1, match.groups):
if match.group(i):
return patterns[i-1][1]
text = ' Hello there, I\n need your help here plase :) '
text = re.sub('|'.join('(' + p[0] + ')' for p in patterns), getreplacement, text)
If I got it right, you want to strip leading and trailing spaces and reduce the ones in the middle to just one.
First, you code likely has a bug: [ |\n] will match a space ( ), a pipe (|), or a new line. You probably don't want to match a pipe, but you might want to match all whitespace characters, like tabs (\t), for example.
Second, styling: keep your lines under 80 chars and no spaces around indices in brackets.
Third, removing the leading and trailing spaces is simply done with str.strip. The only thing remaining to replace now is sequences of two or more whitespaces, which is easily matched with \s{2,} (\s = "whitespace", {2,} = "two or more").
Here is a modification of your code:
import re
patterns = [
(r"^[ |\n]+", ""),
(r"[ |\n]+$", ""),
(r"[ |\n]+", " "),
]
def get_replacement(m: re.Match) -> str:
return next(
patterns[i][1]
for i, group in enumerate(m.groups())
if group is not None
)
text = (
"\n"
" \t Hello there, I\n need your help here plase :) \t \n"
" \t Hello there, I\n need your help here plase :) \t "
"\n"
)
result1 = re.sub(
"|".join(f"({p})" for p, _ in patterns),
get_replacement,
text,
)
result2 = re.sub(r"[ \n]{2,}", " ", text.strip())
result3 = re.sub(r"\s{2,}", " ", text.strip())
print(repr(result1))
print(repr(result2))
print(repr(result3))
I want to be able to remove all punctuation and single quotes ' from a string, unless the single quote ' is in the middle of a word.
At this point I have the following code:
with open('test.txt','r') as f:
for line in f:
line = line.lower()
line = re.sub('[^a-z\ \']+', " ", line)
print line
if there happens to be a line in test.txt like:
Here is some stuff. 'Now there are quotes.' Now there's not.
The result I want is:
here is some stuff now there are quotes now there's not
But the result I get is:
here is some stuff 'now there are quotes' now there's not
How can I remove the single quotes ' from a string if they're at the beginning or end of the word but not in the middle? Thanks for the help!
Split the string, use strip() on each word to remove leading and trailing characters on it, then join it all back together.
>>> s = "'here is some stuff 'now there are quotes' now there's not'"
>>> print(' '.join(w.strip("'") for w in s.split()).lower())
here is some stuff now there are quotes now there's not
Using regular expressions, you could first remove 's that don't follow a letter, then remove 's that don't precede a letter (thus only keeping ones that both follow and precede a letter):
line = "Here is some stuff. 'Now there are quotes.' Now there's not."
print re.sub(r"'([^A-Za-z])", r"\1", re.sub(r"([^A-Za-z])'", r"\1", line))
# Here is some stuff. Now there are quotes. Now there's not.
Probably more efficient to do it #TigerhawkT3's way. Though they produce different results if you have something like 'this'. If you want to remove that second ' too, then the regular expressions method is probably the simplest you can do.
Here's another solution using regular expressions with lookarounds.
This method will preserve any whitespace your string may have.
import re
rgx = re.compile(r"(?<!\w)\'|\'(?!\w)")
# Regex explanation:
# (?<!\w)\' match any quote not preceded by a word
# | or
# \'(?!\w) match any quote not followed by a word
s = "'here is some stuff 'now there are quotes' now there's not'"
print rgx.sub('', s) # here is some stuff now there are quotes now there's not
If a word is a sequence of 1+ letters, digits and underscores that can be matched with \w+ you may use
re.sub(r"(?!\b'\b)'", "", text)
See the regex demo. Here, ' is matched when it is not preceded nor followed with letters/digits/_.
Or, if words are strictly linguistic words that only consist of letters, use
re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) # ASCII only
re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) # any Unicode letter support
See Demo #2 (ASCII only letters) and Demo #3 (see last line in the demo text). Here, ' is only matched if it is not preceded nor followed with a letter (ASCII or any).
Python demo:
import re
text = "'text... 'some quotes', there's none'. three 'four' can't, '2'4', '_'_', 'l'école'"
print( re.sub(r"(?!\b'\b)'", "", text) )
# => text... some quotes, there's none. three four can't, 2'4, _'_, l'école
print( re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, lécole
print( re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, l'école
Here is complete solution to remove whatever you don't want in a string:
def istext (text):
ok = 0
for x in text: ok += x.isalnum()
return ok>0
def stripit (text, ofwhat):
for x in ofwhat: text = text.strip(x)
return text
def purge (text, notwanted="'\"!#$%&/()=?*+-.,;:_<>|\\[]{}"):
text = text.splitlines()
text = [" ".join([stripit(word, notwanted) for word in line.split() if istext(word)]) for line in text]
return "\n".join(text)
>>> print purge("'Nice, .to, see! you. Isn't it?'")
Nice to see you Isn't it
Note: this will kill all whitespaces too and transform them to space or remove them completely.
I have a multiline string which looks like this:
st = '''emp:firstinfo\n
:secondinfo\n
thirdinfo
'''
print(st)
What I am trying to do is to skip the second ':' from my string, and get an output which looks like this:
'''emp:firstinfo\n
secondinfo\n
thirdinfo
'''
simply put if it starts with a ':' I'm trying to ignore it.
Here's what I've done:
mat_obj = re.match(r'(.*)\n*([^:](.*))\n*(.*)' , st)
print(mat_obj.group())
Clearly, I don't see my mistake but could anyone please help me telling where I am getting it wrong?
You may use re.sub with this regex:
>>> print (re.sub(r'([^:\n]*:[^:\n]*\n)\s*:(.+)', r'\1\2', st))
emp:firstinfo
secondinfo
thirdinfo
RegEx Demo
RegEx Details:
(: Start 1st capture group
[^:\n]*: Match 0 or more of any character that is not : and newline
:: Match a colon
[^:\n]*: Match 0 or more of any character that is not : and newline
\n: Match a new line
): End 1st capture group
\s*: Match 0 or more whitespaces
:: Match a colon
(.+): Match 1 or more of any characters (except newlines) in 2nd capture group
\1\2: Is used in replacement to put back substring captured in groups 1 and 2.
You can use sub instead, just don't capture the undesired part.
(.*\n)[^:]*:(.*\n)(.*)
Replace by
\1\2\3
Regex Demo
import re
regex = r"(.*\n)[^:]*:(.*\n)(.*)"
test_str = ("emp:firstinfo\\n\n"
" :secondinfo\\n\n"
" thirdinfo")
subst = "\\1\\2\\3"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
#import regex library
import re
#remove character in a String and replace with empty string.
text = "The film Pulp Fiction was released in year 1994"
result = re.sub(r"[a-z]", "", text)
print(result)
I have a tab-delimited txt file.
1 (hi7 there) my
2 (hi7)there he3
I want to remove brackets only when that bracket embraces the whole entry(I am not sure if the word entry is the right word to describe it, anyhow)
So the output should be
1 hi7 there my
2 (hi7)there he3
I know I can easily find how to remove all brackets. But I couldn't find how to remove brackets only when they embrace the whole entry.
Can I do it simply by using notepad++ or python, whichever is faster?
This expr seems to handle all possibilities correctly:
(?m) # multiline mode
(^|\t) # start of line of field
\( # (
([^\t]+?) # anything but a tab
\) # )
(?= # followed by...
$|\t # end of line or field
)
replace with \1\2.
Example:
import re
rx = r'(?m)(^|\t)\(([^\t]+?)\)(?=$|\t)'
txt = """
1 (hi7 (the)re) (my)
2 (hi7)there he3
(22) (hi7)there he3
(22) (hi7there) (he3)
"""
print re.sub(rx, r'\1\2', txt)
Result:
1 hi7 (the)re my
2 (hi7)there he3
22 (hi7)there he3
22 hi7there he3
i think this should work
f = open("file.txt")
for line in f:
l = line.strip().split(" ")
for word in l:
if word[0] == "(" and word[-1] == ")":
print (word[1:len(word)-1]),
else:
print (word),
print
for overwrite
import fileinput
for line in fileinput.FileInput("file.txt", inplace=1):
l = line.strip().split(" ")
s = ""
for sent in l:
if sent[0] == "(" and sent[-1] == ")":
s += sent[1:len(sent) - 1] + " "
else:
s += sent + " "
print s[:-1]
You can use the tab character \t in python regexp expression, so you can match like this :
>>> import re
>>> re.match('^\([^\t]+\)\t.*$', '(hi7 there)\tmy')
>>> <_sre.SRE_Match object at 0x02573950>
>>> re.match('^\([^\t]+\)\t.*$', '(hi7)there\tmy')
>>>
Once you know how to match your string, it is easy to remove bracket only if the line match.
If they are really tab delimited, you can replace
\t\(([^\t]*)\)\t
\t # a tab
\( # an opening parenthesis
( # open the capturing group
[^\t]* # anything but a tab
)
\)
\t
with
\t\1\t
The idea is to capture the text inside the relevant brackets, and to use it in the replacement with the backreference \1.
See demo