Split with regex but with first character of delimiter

Split with regex but with first character of delimiter - python

I have a regex like this: "[a-z|A-Z|0-9]: " that will match one alphanumeric character, colon, and space. I wonder how to split the string but keeping the alphanumeric character in the first result of splitting. I cannot change the regex because there are some cases that the string will have special character before colon and space.
Example:
line = re.split("[a-z|A-Z|0-9]: ", "A: ") # Result: ['A', '']
line = re.split("[a-z|A-Z|0-9]: ", ":: )5: ") # Result: [':: )5', '']
line = re.split("[a-z|A-Z|0-9]: ", "Delicious :): I want to eat this again") # Result: ['Delicious :)', 'I want to eat this again']
Update:
Actually, my problem is splitting from a review file. Suppose I have a file that every line has this pattern: [title]: [review]. I want to get the title and review, but some of the titles have a special character before a colon and space, and I don't want to match them. However, it seems that the character before a colon and space that I want to match apparently is an alphanumeric one.

You could split using a negative lookbehind with a single colon or use a character class [:)] where you can specify which characters should not occur directly to the left.
(?<!:):[ ]
In parts
(?<!:) Negative lookbehind, assert what is on the left is not a colon
:[ ] Match a colon followed by a space (Added square brackets only for clarity)
Regex demo | Python demo
For example
import re
pattern = r"(?<!:): "
line = re.split(pattern, "A: ") # Result: ['A', '']
print(line)
line = re.split(pattern, ":: )5: ") # Result: [':: )5', '']
print(line)
line = re.split(pattern, "Delicious :): I want to eat this again") # Result: ['Delicious :)', 'I want to eat this again']
print(line)
Output
['A', '']
[':: )5', '']
['Delicious :)', 'I want to eat this again']

Solution
First of all, as you show in your examples, you need to match characters other than a-zA-Z0-9, so we should just use the . matcher, it will match every character.
So I think the expression you're looking for might be this one:
(.*?):(?!.*:) (.*)
You can use it like so:
import re
pattern = r"(.*?):(?!.*:) (.*)"
matcher = re.compile(pattern)
txt1 = "A: "
txt2 = ":: )5: "
txt3 = "Delicious :): I want to eat this again"
result1 = matcher.search(txt1).groups() # ('A', '')
result2 = matcher.search(txt2).groups() # (':: )5', '')
result3 = matcher.search(txt3).groups() # ('Delicious :)', 'I want to eat this again')
Explanation
We use capture groups (the parentheses) to get the different parts in the string into different groups, search then finds these groups and outputs them in the tuple.
The (?!.*:) part is called "Negative Lookahead", and we use it to make sure we start capturing from the last : we find.
Edit
BTW, if, as you mentioned, you have many lines each containing a review, you can use this snippet to get all of the reviews separated by title and body at once:
import re
pattern = r"(.*?):(?!.*:) (.*)\n?"
matcher = re.compile(pattern)
reviews = """
A:
:: )5:
Delicious :): I want to eat this again
"""
parsed_reviews = matcher.findall(reviews) # [('A', ''), (':: )5', ''), ('Delicious :)', 'I want to eat this again')]

Related

Extract words from sentence that are containing substring

I want to extract full phrase (one or multiple words) that contain the specific substring. Substring can have one multiple words, and words from substring can 'break'/'split' words in the test_string, but desired output is full phrase/word from test_string, for example
test_string = 'this is an example of the text that I have, and I want to by amplifier and lamp'
substring1 = 'he text th'
substring2 = 'amp'
if substring1 in test_string:
print("substring1 found")
if substring2 in test_string:
print("substring2 found")
My desired output is:
[the text that]
[example, amplifier, lamp]
FYI
Substring can be at the beginning of the word, middle or end...it does not matter.

If you want something robust I would do something like that:
re.findall(r"((?:\w+)?" + re.escape(substring2) + r"(?:\w+)?)", test_string)
This way you can have whatever you want in substring.
Explanation of the regex:
'(?:\w+)' Non capturing group
'?' zero or one
I have done this at the begining and at the end of your substring as it can be the start or the end of the missing part
To answer the latest comment about how to get the punctuation as well. I would do something like that using string.punctuation
import string
pattern = r"(?:[" + r"\w" + re.escape(string.punctuation) + r"]+)?"
re.findall("(" + pattern + re.escape(substring2) + pattern + ")",
test_string)
Doing so, will match any punctuation in the word at the beginning and the end. Like: [I love you.., I love you!!, I love you!?, ?I love you!, ...]

this is a job for regex, as you could do:
import re
substring2 = 'amp'
test_string = 'this is an example of the text that I have'
print("matches for substring 1:",re.findall(r"(\w+he text th\w+)", test_string))
print("matches for substring 2:",re.findall(r"(\w+amp\w+)",test_string))
Output:
matches for substring 1:['the text that']
matches for substring 2:['example']

How do I remove a string that starts with '#' and ends with a blank character by using regular expressions in Python?

So I have this text:
"#Natalija What a wonderful day, isn't it #Kristina123 ?"
I tried to remove these two substrings that start with the character '#' by using re.sub function but it didn't work.
How do I remove the susbstring that starts with this character?

Try this regex :
import re
text = "#Natalija What a wonderful day, isn't it #Kristina123 ?"
t = re.sub('#.*? ', '', text)
print(t)
OUTPUT :
What a wonderful day, isn't it ?

This should work.
# matches the character #
\w+ matches any word character as many times as possible, so it stops at blank character
Code:
import re
regex = r"#\w+"
subst = "XXX"
test_str = "#Natalija What a wonderful day, isn't it #Kristina123 ?"
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
print (result)
output:
XXX What a wonderful day, isn't it XXX ?

It's possible to do it with re.sub(), it would be something like this:
import re
text = "#Natalija What a wonderful day, isn't it #Kristina123 ?"
output = re.sub('#[a-zA-Z0-9]+\s','',text)
print(output) # Output: What a wonderful day, isn't it ?
# matches the # character
[a-zA-Z0-9] matches alphanumerical (uppercase and lowercase)
"+" means "one or more" (otherwise it would match only one of those characters)
\s matches whitespaces

Alternatively, this can also be done without using the module re. You can first split the sentence into words. Then remove the words containing the # character and finally join the words into a new sentence.
if __name__ == '__main__':
original_text = "#Natalija What a wonderful day, isn't it #Kristina123 ?"
individual_words = original_text.split(' ')
words_without_tags = [word for word in individual_words if '#' not in word]
new_sentence = ' '.join(words_without_tags)
print(new_sentence)

I think this would be work for you. The pattern #\w+?\s will determine expressions which start with # continued by one or more alphanumeric characters then finish with an optional white space.
import re
string = "#Natalija What a wonderful day, isn't it #Kristina123 ?"
pattern = '#\w+?\s'
replaced = re.sub(pattern, '', string)
print(replaced)

What is the best method of processing optional group in Python Regex?

I'm trying to write a function that enforces capitalization on certain words, and adds "'s" to certain words if they are followed by " s". For example, it should take grace s and transform that to Grace's.
r"(\b)(grace)( (s|S))?\b": posessive_name,
{...}
def possessive_name(match: Match) -> str:
result = match.group(2).title()
result = result.replace(" ", "'")
return result # type: ignore
I'm correctly "titlizing" it but can't figure out how to reference the optional ( (s|S)) group so that the ( 's) can be added if it's needed, and I'd like to avoid adding an additional regex... Is this possible?
*edited names for clarity

Yes, like this.
import re
test_str = "This is grace s apple."
def fix_names(match):
name, s = match.groups()
name = name.title()
if s:
name = f"{name}'s"
return name
p = re.compile(r"\b(grace)(\s[sS])?\b")
print(p.sub(fix_names, test_str))

lines = (
'a grace s apple',
'the apple is grace s',
'take alice s and steve s',
)
for line in lines:
result = re.sub(r'(\w+)\s+s($|\s)', lambda m: m.group(1).title()+"'s"+m.group(2), line, flags=re.I|re.S)
print(result)
you'll get:
a Grace's apple
the apple is Grace's
take Alice's and Steve's

You could capture 1+ word characters in group 1 followed by matching a space and either s or S using a character class.
In the replacement use the .title() on group 1 and add 's
(?<!\S)(\w+) [sS](?!\S)
Explanation
(?<!\S) Left whitespace boundary
(\w+) Capture group 1, match 1+ word chars
[sS] Match a space and either s or S
(?!\S)
Regex demo | Python demo
Code example
import re
test_str = "grace s"
regex = r"(?<!\S)(\w+) [sS](?!\S)"
result = re.sub(regex, lambda match: match.group(1).title()+"'s", test_str)
print(result)
Output
Grace's
If you want to match grace specifically, you could use use an optional group. If you want match more words, you could use an alternation (?:grace|anotherword)
(?<!\S)(grace)(?: ([sS]))?\b
Regex demo
Example code
import re
test_str = "Her name is grace."
strings = [
"grace s",
"Her name is grace."
]
pattern = r"(?<!\S)(grace)(?: ([sS]))?\b"
regex = re.compile(pattern)
for s in strings:
print(
regex.sub(
lambda m: "{}{}".format(m.group(1).title(), "'s" if m.group(2) else '')
, s)
)
Output
Grace's
Her name is Grace.

How to remove all non-alphanumerical characters except when part of a word [duplicate]

I want to be able to remove all punctuation and single quotes ' from a string, unless the single quote ' is in the middle of a word.
At this point I have the following code:
with open('test.txt','r') as f:
for line in f:
line = line.lower()
line = re.sub('[^a-z\ \']+', " ", line)
print line
if there happens to be a line in test.txt like:
Here is some stuff. 'Now there are quotes.' Now there's not.
The result I want is:
here is some stuff now there are quotes now there's not
But the result I get is:
here is some stuff 'now there are quotes' now there's not
How can I remove the single quotes ' from a string if they're at the beginning or end of the word but not in the middle? Thanks for the help!

Split the string, use strip() on each word to remove leading and trailing characters on it, then join it all back together.
>>> s = "'here is some stuff 'now there are quotes' now there's not'"
>>> print(' '.join(w.strip("'") for w in s.split()).lower())
here is some stuff now there are quotes now there's not

Using regular expressions, you could first remove 's that don't follow a letter, then remove 's that don't precede a letter (thus only keeping ones that both follow and precede a letter):
line = "Here is some stuff. 'Now there are quotes.' Now there's not."
print re.sub(r"'([^A-Za-z])", r"\1", re.sub(r"([^A-Za-z])'", r"\1", line))
# Here is some stuff. Now there are quotes. Now there's not.
Probably more efficient to do it #TigerhawkT3's way. Though they produce different results if you have something like 'this'. If you want to remove that second ' too, then the regular expressions method is probably the simplest you can do.

Here's another solution using regular expressions with lookarounds.
This method will preserve any whitespace your string may have.
import re
rgx = re.compile(r"(?<!\w)\'|\'(?!\w)")
# Regex explanation:
# (?<!\w)\' match any quote not preceded by a word
# | or
# \'(?!\w) match any quote not followed by a word
s = "'here is some stuff 'now there are quotes' now there's not'"
print rgx.sub('', s) # here is some stuff now there are quotes now there's not

If a word is a sequence of 1+ letters, digits and underscores that can be matched with \w+ you may use
re.sub(r"(?!\b'\b)'", "", text)
See the regex demo. Here, ' is matched when it is not preceded nor followed with letters/digits/_.
Or, if words are strictly linguistic words that only consist of letters, use
re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) # ASCII only
re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) # any Unicode letter support
See Demo #2 (ASCII only letters) and Demo #3 (see last line in the demo text). Here, ' is only matched if it is not preceded nor followed with a letter (ASCII or any).
Python demo:
import re
text = "'text... 'some quotes', there's none'. three 'four' can't, '2'4', '_'_', 'l'école'"
print( re.sub(r"(?!\b'\b)'", "", text) )
# => text... some quotes, there's none. three four can't, 2'4, _'_, l'école
print( re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, lécole
print( re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, l'école

Here is complete solution to remove whatever you don't want in a string:
def istext (text):
ok = 0
for x in text: ok += x.isalnum()
return ok>0
def stripit (text, ofwhat):
for x in ofwhat: text = text.strip(x)
return text
def purge (text, notwanted="'\"!#$%&/()=?*+-.,;:_<>|\\[]{}"):
text = text.splitlines()
text = [" ".join([stripit(word, notwanted) for word in line.split() if istext(word)]) for line in text]
return "\n".join(text)
>>> print purge("'Nice, .to, see! you. Isn't it?'")
Nice to see you Isn't it
Note: this will kill all whitespaces too and transform them to space or remove them completely.

Replace substrings with items from list

Basically, I have a string that has multiple double-whitespaces like this:
"Some text\s\sWhy is there no punctuation\s\s"
I also have a list of punctuation marks that should replace the double-whitespaces, so that the output would be this:
puncts = ['.', '?']
# applying some function
# output:
>>> "Some text. Why is there no punctuation?"
I have tried re.sub(' +', puncts[i], text) but my problem here is that I don't know how to properly iterate through the list and replace the 1st double-whitespace with the 1st element in puncts, the 2nd double-whitespace with the 2nd element in puncts and so on.

If we're still using re.sub(), here's one possible solution that follows this basic pattern:
Get the next punctuation character.
Replace only the first occurrence of that character in text.
puncts = ['.', '?']
text = "Some text Why is there no punctuation "
for i in puncts:
text = re.sub('\s(?=\s)', i, text, 1)
The call to re.sub() returns a string, and basically says "find all series of two whitespace characters, but only replace the first whitespace character with a punctuation character." The final argument "1" makes it so that we only replace the first instance of the double whitespace, and not all of them (default behavior).
If the positive lookahead (the part of the regex that we want to match but not replace) confuses you, you can also do without it:
puncts = ['.', '?']
text = "Some text Why is there no punctuation "
for i in puncts:
text = re.sub('\s\s', i + " ", text, 1)
This yields the same output.
There will be a leftover whitespace at the end of the sentence, but if you're stingy about that, a simple text.rstrip() should take care of that one.
Further explanation
Your first try of using regex ' +' doesn't work because that regex matches all instances where there is at least one whitespace — that is, it will match everything, and then also replace all of it with a punctuation character. The above solutions account for the double-whitespace in their respective regexes.

You can do it simply using the replace method!
text = "Some text Why is there no punctuation "
puncts = ['.', '?']
for i in puncts:
text = text.replace(" ", i, 1) #notice the 1 here
print(text)
Output : Some text.Why is there no punctuation?

You can use re.split() to break the string into substrings between the double spaces and intersperse the punctuation marks using join:
import re
string = "Some text Why is there no punctuation "
iPunct = iter([". ","? "])
result = "".join(x+next(iPunct,"") for x in re.split(r"\s\s",string))
print(result)
# Some text. Why is there no punctuation?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split with regex but with first character of delimiter - python

Related

Extract words from sentence that are containing substring

How do I remove a string that starts with '#' and ends with a blank character by using regular expressions in Python?

What is the best method of processing optional group in Python Regex?

How to remove all non-alphanumerical characters except when part of a word [duplicate]

Replace substrings with items from list

Categories

Resources