I have a string like below:
"i'm just returning from work. *oeee* all and we can go into some detail *oo*. what is it that happened as far as you're aware *aouu*"
with some junk characters like above (highlighted with '*' marks). All I could observe was that junk characters come as bunch of vowels knit together. Now, I need to remove any word that has space before and after and has only vowels in it (like oeee, aouu, etc...) and length of 2 or more. How do I achieve this in python?
Currently, I built a tuple to include replacement words like ((" oeee "," "),(" aouu "," ")) and sending it through a for loop with replace. But if the word is 'oeeee', I need a add a new item into the tuple. There must be a better way.
P.S: there will be no '*' in the actual text. I just put it here to highlight.
You need to use re.sub to do a regex replacement in python. You should use this regex:
\b[aeiou]{2,}\b
which will match a sequence of 2 or more vowels in a word by themselves. We use \b to match the boundaries of the word so it will match at the beginning and end of the string (in your string, aouu) as well as words adjacent to punctuation (in your string, oo). If your text may include uppercase vowels too, use the re.I flag to ignore case:
import re
text = "i'm just returning from work. oeee all and we can go into some detail oo. what is it that happened as far as you're aware aouu"
print(re.sub(r'\b[aeiou]{2,}\b', '', text, 0, re.I))
Output
i'm just returning from work. all and we can go into some detail . what is it that happened as far as you're aware
Related
I see a lot of similarly worded questions, but I've had a strikingly difficult time coming up with the syntax for this.
Given a list of words, I want to print all the words that do not have special characters.
I have a regex which identifies words with special characters \w*[\u00C0-\u01DA']\w*. I've seen a lot of answers with fairly straightforward scenarios like a simple word. However, I haven't been able to find anything that negates a group - I've seen several different sets of syntax to include the negative lookahead ?!, but I haven't been able to come up with a syntax that works with it.
In my case given a string like: "should print nŌt thìs"
should print should and print but not the other two words. re.findall("(\w*[\u00C0-\u01DA']\w*)", paragraph.text) gives you the special characters - I just want to invert that.
For this particular case, you can simply specify the regular alphabet range in your search:
a = "should print nŌt thìs"
re.findall(r"(\b[A-Za-z]+\b)", a)
# ['should', 'print']
Of course you can add digits or anything else you want to match as well.
As for negative lookaheads, they use the syntax (?!...), with ? before !, and they must be in parentheses. To use one here, you can use:
r"\b(?!\w*[À-ǚ])\w*"
This:
Checks for a word boundary \b, like a space or the start of the input string.
Does the negative lookahead and stops the match if it finds any special character preceded by 0 or more word characters. You have to include the \w* because (?![À-ǚ]) would only check for the special character being the first letter in the word.
Finally, if it makes it past the lookahead, it matches any word characters.
Demo. Note in regex101.com you must specify Python flavor for \b to work properly with special characters.
There is a third option as well:
r"\b[^À-ǚ\s]*\b"
The middle part [^À-ǚ\s]* means match any character other than special characters or whitespace an unlimited number of times.
I know this is not a regex, but just a completely different idea you may not have had besides using regexes. I suppose it would be also much slower but I think it works:
>>> import unicodedata as ud
>>> [word for word in ['Cá', 'Lá', 'Aqui']\
if any(['WITH' in ud.name(letter) for letter in word])]
['Cá', 'Lá']
Or use ... 'WITH' not in to reverse.
I'm trying to design a regex pattern that removes words less than 4 characters long. The catch is, any special characters (primarily: !##$%^&*().,;? ) attached to the word e.g. "age?" will not meet the condition for removal, so "hi!!", "you?", "hello boy!" should all be retained from the input string. To illustrate:
string1='my name is jen!'
I tried the regex,
regex1=re.compile(r'\b.{,3}\s')
and coupled it with re.sub:
string2=re.sub(regex1,' ',string1)
and this works great except, 1. I have to sub the pattern with a space, and this sometimes has to be removed manually, and 2. It doesn't work if the 3 character or fewer 'word' is at the end of string.
string1='my name is jen'
re.sub(regex1,' ',string1)
out: ' name jen'
Is there a better regex that can be used? Should I instead try to retain 'words' that are 4 characters or more?
You can use the following regex:
\b\w{1,3}(?=\s|$)\s*
in your python code:
$ cat words3.py
import re
string1='my name is jen!'
print(re.sub(r'\b\w{1,3}(?=\s|$)\s*','',string1))
output:
name jen!
demo: https://regex101.com/r/ZEzYtM/3/
Note: I have taken only into account that the punctuations and special characters will be attached at the end of the word.
If you want to avoid the removal of words like !ab then use:
(?<=\s)\b\w{1,3}(?=\s|$)\s*
demo: https://regex101.com/r/ZEzYtM/4
I've a group of strings like following:
a phrase containing spaces
A sentence contains spaces as well, but end by period.
I'd like to find a regular expression to match the spaces (like [ \t\f]) in the 2nd line, which ends by '.'.
I've looked around and found no solution. So I come here for help.
I am using Python, but do not mind knowing the pcre solution even it's not possible for python.
I came out some regex, but it could not exclude the first line.
my regex
Here is a regex pattern which, if applied repeatedly to every line, should be able to match spaces in that line, assuming the line ends with period:
\s+(?=.*\.$)
Demo
Here is my attempt at a Python script. I don't print the space when a match is found, because we can't see it. Instead, I print something visible:
input = 'A sentence contains spaces as well, but end by period.'
spaces = re.findall(r'\s+(?=.*\.$)', input)
for space in spaces:
print('found a space')
found a space (printed 9 times)
I would like to strip all of the the punctuations (except the dot) from the beginning and end of a string, but not in the middle of it.
For instance for an original string:
##%%.Hol$a.A.$%
I would like to get the word .Hol$a.A. removed from the end and beginning but not from the middle of the word.
Another example could be for the string:
##%%...&Hol$a.A....$%
In this case the returned string should be ..&Hol$a.A.... because we do not care if the allowed characters are repeated.
The idea is to remove all of the punctuations( except the dot ) just at the beginning and end of the word. A word is defined as \w and/or a .
A practical example is the string 'Barnes&Nobles'. For text analysis is important to recognize Barnes&Nobles as a single entity, but without the '
How to accomplish the goal using Regex?
Use this simple and easily adaptable regex:
[\w.].*[\w.]
It will match exactly your desired result, nothing more.
[\w.] matches any alphanumeric character and the dot
.* matches any character (except newline normally)
[\w.] matches any alphanumeric character and the dot
To change the delimiters, simply change the set of allowed characters inside the [] brackets.
Check this regex out on regex101.com
import re
data = '##%%.Hol$a.A.$%'
pattern = r'[\w.].*[\w.]'
print(re.search(pattern, data).group(0))
# Output: .Hol$a.A.
Depending on what you mean with striping the punctuation, you can adapt the following code :
import re
res = re.search(r"^[^.]*(.[^.]*.([^.]*.)*?)[^.]*$", "##%%.Hol$a.A.$%")
mystr = res.group(1)
This will strip everything before and after the dot in the expression.
Warning, you will have to check if the result is different of None, if the string doesn't match.
Might be a bit messy title, but the question is simple.
I got this in Python:
string = "start;some;text;goes;here;end"
the start; and end; word is always at the same position in the string.
I want the second word which is some in this case. This is what I did:
import re
string = "start;some;text;goes;here;end"
word = re.findall("start;.+?;" string)
In this example, there might be a few things to modify to make it more appropriate, but in my actual code, this is the best way.
However, the string I get back is start;some;, where the search characters themselves is included in the output. I could index both ;, and extract the middle part, but there have to be a way to only get the actual word, and not the extra junk too?
No need for regex in my opinion, but all you need is a capture group here.
word = re.findall("start;(.+?);", string)
Another improvement I'd like to suggest is not using .. Rather be more specific, and what you are looking for is simply anything else than ;, the delimiter.
So I'd do this:
word = re.findall("start;([^;]+);", string)