python regular expression to remove repeated words - python

I am very new a Python
I want to change sentence if there are repeated words.
Correct
Ex. "this just so so so nice" --> "this is just so nice"
Ex. "this is just is is" --> "this is just is"
Right now am I using this reg. but it do all so change on letters.
Ex. "My friend and i is happy" --> "My friend and is happy" (it remove the "i" and space) ERROR
text = re.sub(r'(\w+)\1', r'\1', text) #remove duplicated words in row
How can I do the same change but instead of letters it have to check on words?

text = re.sub(r'\b(\w+)( \1\b)+', r'\1', text) #remove duplicated words in row
The \b matches the empty string, but only at the beginning or end of a word.

Non- regex solution using itertools.groupby:
>>> strs = "this is just is is"
>>> from itertools import groupby
>>> " ".join([k for k,v in groupby(strs.split())])
'this is just is'
>>> strs = "this just so so so nice"
>>> " ".join([k for k,v in groupby(strs.split())])
'this just so nice'

\b: Matches Word Boundaries
\w: Any word character
\1: Replaces the matches with the second word found
import re
def Remove_Duplicates(Test_string):
Pattern = r"\b(\w+)(?:\W\1\b)+"
return re.sub(Pattern, r"\1", Test_string, flags=re.IGNORECASE)
Test_string1 = "Good bye bye world world"
Test_string2 = "Ram went went to to his home"
Test_string3 = "Hello hello world world"
print(Remove_Duplicates(Test_string1))
print(Remove_Duplicates(Test_string2))
print(Remove_Duplicates(Test_string3))
Result:
Good bye world
Ram went to his home
Hello world

Related

How do I remove a string that starts with '#' and ends with a blank character by using regular expressions in Python?

So I have this text:
"#Natalija What a wonderful day, isn't it #Kristina123 ?"
I tried to remove these two substrings that start with the character '#' by using re.sub function but it didn't work.
How do I remove the susbstring that starts with this character?
Try this regex :
import re
text = "#Natalija What a wonderful day, isn't it #Kristina123 ?"
t = re.sub('#.*? ', '', text)
print(t)
OUTPUT :
What a wonderful day, isn't it ?
This should work.
# matches the character #
\w+ matches any word character as many times as possible, so it stops at blank character
Code:
import re
regex = r"#\w+"
subst = "XXX"
test_str = "#Natalija What a wonderful day, isn't it #Kristina123 ?"
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
print (result)
output:
XXX What a wonderful day, isn't it XXX ?
It's possible to do it with re.sub(), it would be something like this:
import re
text = "#Natalija What a wonderful day, isn't it #Kristina123 ?"
output = re.sub('#[a-zA-Z0-9]+\s','',text)
print(output) # Output: What a wonderful day, isn't it ?
# matches the # character
[a-zA-Z0-9] matches alphanumerical (uppercase and lowercase)
"+" means "one or more" (otherwise it would match only one of those characters)
\s matches whitespaces
Alternatively, this can also be done without using the module re. You can first split the sentence into words. Then remove the words containing the # character and finally join the words into a new sentence.
if __name__ == '__main__':
original_text = "#Natalija What a wonderful day, isn't it #Kristina123 ?"
individual_words = original_text.split(' ')
words_without_tags = [word for word in individual_words if '#' not in word]
new_sentence = ' '.join(words_without_tags)
print(new_sentence)
I think this would be work for you. The pattern #\w+?\s will determine expressions which start with # continued by one or more alphanumeric characters then finish with an optional white space.
import re
string = "#Natalija What a wonderful day, isn't it #Kristina123 ?"
pattern = '#\w+?\s'
replaced = re.sub(pattern, '', string)
print(replaced)

Split with regex but with first character of delimiter

I have a regex like this: "[a-z|A-Z|0-9]: " that will match one alphanumeric character, colon, and space. I wonder how to split the string but keeping the alphanumeric character in the first result of splitting. I cannot change the regex because there are some cases that the string will have special character before colon and space.
Example:
line = re.split("[a-z|A-Z|0-9]: ", "A: ") # Result: ['A', '']
line = re.split("[a-z|A-Z|0-9]: ", ":: )5: ") # Result: [':: )5', '']
line = re.split("[a-z|A-Z|0-9]: ", "Delicious :): I want to eat this again") # Result: ['Delicious :)', 'I want to eat this again']
Update:
Actually, my problem is splitting from a review file. Suppose I have a file that every line has this pattern: [title]: [review]. I want to get the title and review, but some of the titles have a special character before a colon and space, and I don't want to match them. However, it seems that the character before a colon and space that I want to match apparently is an alphanumeric one.
You could split using a negative lookbehind with a single colon or use a character class [:)] where you can specify which characters should not occur directly to the left.
(?<!:):[ ]
In parts
(?<!:) Negative lookbehind, assert what is on the left is not a colon
:[ ] Match a colon followed by a space (Added square brackets only for clarity)
Regex demo | Python demo
For example
import re
pattern = r"(?<!:): "
line = re.split(pattern, "A: ") # Result: ['A', '']
print(line)
line = re.split(pattern, ":: )5: ") # Result: [':: )5', '']
print(line)
line = re.split(pattern, "Delicious :): I want to eat this again") # Result: ['Delicious :)', 'I want to eat this again']
print(line)
Output
['A', '']
[':: )5', '']
['Delicious :)', 'I want to eat this again']
Solution
First of all, as you show in your examples, you need to match characters other than a-zA-Z0-9, so we should just use the . matcher, it will match every character.
So I think the expression you're looking for might be this one:
(.*?):(?!.*:) (.*)
You can use it like so:
import re
pattern = r"(.*?):(?!.*:) (.*)"
matcher = re.compile(pattern)
txt1 = "A: "
txt2 = ":: )5: "
txt3 = "Delicious :): I want to eat this again"
result1 = matcher.search(txt1).groups() # ('A', '')
result2 = matcher.search(txt2).groups() # (':: )5', '')
result3 = matcher.search(txt3).groups() # ('Delicious :)', 'I want to eat this again')
Explanation
We use capture groups (the parentheses) to get the different parts in the string into different groups, search then finds these groups and outputs them in the tuple.
The (?!.*:) part is called "Negative Lookahead", and we use it to make sure we start capturing from the last : we find.
Edit
BTW, if, as you mentioned, you have many lines each containing a review, you can use this snippet to get all of the reviews separated by title and body at once:
import re
pattern = r"(.*?):(?!.*:) (.*)\n?"
matcher = re.compile(pattern)
reviews = """
A:
:: )5:
Delicious :): I want to eat this again
"""
parsed_reviews = matcher.findall(reviews) # [('A', ''), (':: )5', ''), ('Delicious :)', 'I want to eat this again')]

Concatenating split quotes

I have a large file (f) with a lot of dialogue. I need a regex that will concatenate the split quotes (i.e. "Hello," Josh said enthusiastically, "I have a question!"), but not delete the middle portion. So, for this example, the output would be, "Hello, I have a question!" and then "Josh said enthusiastically" would be retained somewhere. I think I am on the right track, but haven't found something that works for these specifications. Here is the code I have already tried out:
for line in f:
re.findall(r'"(.*?)"', line)
output_file.write(line)
and
split = re.compile(r'''
(,\")
(.*?)
(,)
( )
(")''', re.VERBOSE)
for line in f:
m = split_quote.match(split)
if m:
output_file.write(m.group(1) + m.group(5))
Thank you for any help!
How about something like this?
/(".+?)"(.+?),\s+?"(.+?[.?!]+")/g
Then replace the capture groups in this order:
$1 $3$2.
like so:
m.group(1) + " " + m.group(3) + m.group(2) + "."
Example:
"Hello," Josh said enthusiastically, "I have a question!"
to
"Hello, I have a question!" Josh said enthusiastically.
Explanation:
http://bsite.cc/inoD/Screen%20Shot%202017-01-18%20at%206.01.22%20PM.png
First part matches a ", and then any characters until it sees another quote.
"Hello,"| Josh said enthusiastically, "I have a question!"
Second part matches text in the middle of the quotes, until it reaches a comma (also matches whitespace after comma and the first quote)
"Hello," Josh said enthusiastically, | "I have a question!"
Third group matches until the next quote
"Hello," Josh said enthusiastically, "I have a question!"
Try this regex:
(?<=\")([^\s].*?[^\s])(?=\")|(?<=\")\s(.*?)\s(?=\")
The regex above will match these two strings: Hello, and I have a question! in group 1, which will make you able to print them together. The same regex will distinguish this portion Josh said enthusiastically, and match it in group 2 which will be handy in case you've decided to use it later.
Check out demo: https://regex101.com/r/m7nqnu/1
This is a working Python code:
import re
text = '''"Hello," Josh said enthusiastically, "I have a question!"'''
print ('Group 1: ')
for m in re.finditer(r"(?<=\")([^\s].*?[^\s])(?=\")|(?<=\")\s(.*?)\s(?=\")", text):
if m.group(1) is not None:
print('%s ' % (m.group(1)))
print ('<br />Group 2: ')
for m in re.finditer(r"(?<=\")([^\s].*?[^\s])(?=\")|(?<=\")\s(.*?)\s(?=\")", text):
if m.group(2) is not None:
print('%s ' % (m.group(2)))
Output:
Group 1: Hello, I have a question!
Group 2: Josh said enthusiastically,
As long as there are no quotes within quotes, and all quotes properly match, and the phrase always consists of two quoted parts with an unquoted part in the middle:
parts = [x.strip() for x in re.findall(r'"([^"]+)', text)]
print(parts[0] + " " + parts[2])
# Hello, I have a question!
print(parts[1])
# Josh said enthusiastically,

Extract text after specific character

I need to extract the word after the #
How can I do that? What I am trying:
text="Hello there #bob !"
user=text[text.find("#")+1:]
print user
output:
bob !
But the correct output should be:
bob
A regex solution for fun:
>>> import re
>>> re.findall(r'#(\w+)', '#Hello there #bob #!')
['Hello', 'bob']
>>> re.findall(r'#(\w+)', 'Hello there bob !')
[]
>>> (re.findall(r'#(\w+)', 'Hello there #bob !') or None,)[0]
'bob'
>>> print (re.findall(r'#(\w+)', 'Hello there bob !') or None,)[0]
None
The regex above will pick up patterns of one or more alphanumeric characters following an '#' character until a non-alphanumeric character is found.
Here's a regex solution to match one or more non-whitespace characters if you want to capture a broader range of substrings:
>>> re.findall(r'#(\S+?)', '#Hello there #bob #!')
['Hello', 'bob', '!']
Note that when the above regex encounters a string like #xyz#abc it will capture xyz#abc in one result instead of xyz and abc separately. To fix that, you can use the negated \s character class while also negating # characters:
>>> re.findall(r'#([^\s#]+)', '#xyz#abc some other stuff')
['xyz', 'abc']
And here's a regex solution to match one or more alphabet characters only in case you don't want any numbers or anything else:
>>> re.findall(r'#([A-Za-z]+)', '#Hello there #bobv2.0 #!')
['Hello', 'bobv']
So you want the word starting after # up to a whitespace?
user=text[text.find("#")+1:].split()[0]
print(user)
bob
EDIT: as #bgstech note, in cases where the string does not have a "#", make a check before:
if "#" in text:
user=text[text.find("#")+1:].split()[0]
else:
user="something_else_appropriate"

Matching a string Python

So basically I want a program that will only work if the user types something like "I am sick" or "I am too cool" but will not work if they make a typo like "pi am cool".
Here's what I have so far:
text = input("text here: ")
if re.search("i am", text) is not None:
print("correct")
So basically, I just need help with if someone types in "Pi am cool" right now it will think that is correct. However I do not want that, I want it so that it has to be exactly "i am cool" however. Since I am creating a ai bot for a school project I need it so the sentence could be "man, I am so cool" and it will pick it up and print back correct, but if it was typed "Man, TI am so cool" with a typo I don't want it to print out correct.
Use \b word boundary anchors:
if re.search(r"\bi am\b", text) is not None:
\b matches points in the text that go from a non-word character to a word character, and vice-versa, so space followed by a letter, or a letter followed by a word.
Because \b in a regular python string is interpreted as a backspace character, you need to either use a raw string literal (r'...') or escape the backslash ("\\bi am\\b").
You may also want to add the re.IGNORE flag to your search to find both lower and uppercase text:
if re.search(r"\bi am\b", text, re.IGNORE) is not None:
Demo:
>>> re.search(r"\bi am\b", 'I am so cool!', re.I).group()
'I am'
>>> re.search(r"\bi am\b", 'WII am so cool!', re.I) is None
True
>>> re.search(r"\bi am\b", 'I ammity so cool!', re.I) is None
True

Categories