I don't use or do much text searching but have not been able to find an answer as to what the regex is to find all words starting with T and ending with T from a text file where each word is on a newline. Tried a number of suggestions from searches; the following finds all words starting with T and where T occurs next. However, I want to find where the LAST letter is T also, irrespective of how many T's occur between. Apologies if this is actually trivial, but after every combo I can find I have no result. I am unsure why r'^T.*T$' doesn't work.
with open('/Users/../words.txt') as f:
passage = f.read()
words = re.findall(r'T.+T', passage)
print(words)
I'd use that expression:
re.findall(r"\bT\w*?T\b",s))
use word boundary
use any numbers of \w to avoid matching spaces in between
use "non-greedy" mode (maybe not that useful here since word boundary already does the job)
Use word boundary anchor \b and non-whitespace character \S:
words = re.findall(r'\bT\S+T\b', passage)
this will also allow to match such words as Trust-TesT, Tough&FasT etc.
Related
I have a string like below:
"i'm just returning from work. *oeee* all and we can go into some detail *oo*. what is it that happened as far as you're aware *aouu*"
with some junk characters like above (highlighted with '*' marks). All I could observe was that junk characters come as bunch of vowels knit together. Now, I need to remove any word that has space before and after and has only vowels in it (like oeee, aouu, etc...) and length of 2 or more. How do I achieve this in python?
Currently, I built a tuple to include replacement words like ((" oeee "," "),(" aouu "," ")) and sending it through a for loop with replace. But if the word is 'oeeee', I need a add a new item into the tuple. There must be a better way.
P.S: there will be no '*' in the actual text. I just put it here to highlight.
You need to use re.sub to do a regex replacement in python. You should use this regex:
\b[aeiou]{2,}\b
which will match a sequence of 2 or more vowels in a word by themselves. We use \b to match the boundaries of the word so it will match at the beginning and end of the string (in your string, aouu) as well as words adjacent to punctuation (in your string, oo). If your text may include uppercase vowels too, use the re.I flag to ignore case:
import re
text = "i'm just returning from work. oeee all and we can go into some detail oo. what is it that happened as far as you're aware aouu"
print(re.sub(r'\b[aeiou]{2,}\b', '', text, 0, re.I))
Output
i'm just returning from work. all and we can go into some detail . what is it that happened as far as you're aware
I have tried to replace in all procedures some mistakes. Now, I need to find last "end;" in procedure and replace it with another text.
I wrote like: (\s.*)(end|END)(.*(;).*)
But in work not correctly, it also replace some words in the middle of the text. I using re biblio from python.
You can use
result = re.sub(r'(?si)(.*)\bend\b', r'\g<1>some other word', text)
The regex matches
(?si) - an inline re.DOTALL (s) and re.IGNORECASE (i) modifier
(.*) - Group 1: any zero or more chars as many as possible
\bend\b -a whole word end.
The \g<1>some other word replacement is the Group 1 value (I used \g<1> since it will be helpful if your some other word starts with a digit) plus your word.
NOTE: if your some other word can contain literal backslashes, do not forget to double them.
I am trying to search for all occurrences of "Tom" which are not followed by "Thumb".
I have tried to look for
Tom ^((?!Thumb).)*$
but I still get the lines that match to Tom Thumb.
You don't say what flavor of regex you're using, but this should work in general:
Tom(?!\s+Thumb)
In case you are not looking for whole words, you can use the following regex:
Tom(?!.*Thumb)
If there are more words to check after a wanted match, you may use
Tom(?!.*(?:Thumb|Finger|more words here))
Tom(?!.*Thumb)(?!.*Finger)(?!.*more words here)
To make . match line breaks please refer to How do I match any character across multiple lines in a regular expression?
See this regex demo
If you are looking for whole words (i.e. a whole word Tom should only be matched if there is no whole word Thumb further to the right of it), use
\bTom\b(?!.*\bThumb\b)
See another regex demo
Note that:
\b - matches a leading/trailing word boundary
(?!.*Thumb) - is a negative lookahead that fails the match if there are any 0+ characters (depending on the engine including/excluding linebreak symbols) followed with Thumb.
Tom(?!\s+Thumb) is what you search for.
I am a beginner and have spent considerable amount of time on this. I was partially able to solve it.
Problem: I want to ignore all words that have either the or The. E.g. atheist, others, The, the will be excluded. However, hottie shouldn't be included because the doesn't occur inside the word as a whole word.
I am using Python's re engine.
Here's my regex:
\b - Start at word boundary
(?! - Negative lookahead to avoid starting with the or The
[t|T]he - the and The
)
\w+ - Other letters are fine
(?<! - Negative look behind
[t|T]he - the or The shouldn't occur before \w+
)
\b - Word boundary
Expected output for a given input:
Input: Atheist Others Their Hello the The bathe hottie tahaie theater
Expected Output: Hello hottie tahaie
As one can see in regex101, I am able to exclude most of the words except words like atheist--i.e. cases when the or The appear inside words. I searched for this on SO and found some threads such as How to exclude specific string using regex in Python?, but they don't seem to be directly related to what I am trying to do.
Any help will be greatly appreciated.
Please note that I am interested in solving this problem only using regex. I am not looking for solutions using python's string manipulation.
The approach is simpler than your original regular expression:
\b(?!\w*[t|T]he)\w+\b
We match a word, but make sure that there is no the within the word using a "padded" negative lookahead. Your original approach only disallowed the at the front or the back of the word as it allowed for no padding after/before the word boundary.
(?![tT]he) only matches at the current position, while (?:\w*[tT]he) allows the match to extend from the current position, because the \w* can be used as filler.
i solved a lot of questions by reading your posts but now i'm stuck at the following.
My problem is that i can't make an absolute match of a given word in my txt file.
I wrote the following:
for word in listtweet:
#print word,
pattern=re.compile(r'\b%s\b' %(word))
with open('testsentiwords_fullTotal_clean1712.txt', 'r') as f:
for n,line in enumerate(f):
if pattern.search(line):
print 'found word: ', word, 'in line ', line
My output is partly correct:
found word dirty in line '-0.458333333333', 'dirty'
But i also get:
found word dirty in line '-0.5', 'dirty-minded'
found word dirty in line '-0.625', 'dirty-faced'
I only want to get the exact match and nothing more!
Pls any help?
Try with this pattern :
pattern=re.compile(r'[^-a-zA-Z]%s[^-a-zA-Z]' %(word))
The problem with your pattern is that the '-' character is in \b.
If you need numbers in your word, you can add 0-9 to this pattern.
pattern=re.compile(r'[^-a-zA-Z0-9]%s[^-a-zA-Z0-9]' %(word))
If the print output you provide shows the actual lines in the file (where the word you're looking for is always enclosed in single quotes), I think that your re pattern wants to be like
p = re.compile(r"'%s'" % target_word)
so results would be something like:
>>> p = re.compile(r"'%s'" % "dirty")
>>> p.search("'12345', 'dirty'")
<_sre.SRE_Match object at 0x631b10>
>>> p.search("'12345', 'dirty-faced'")
>>>
Firstly, switch from \b to check for word boundaries to [^-a-zA-Z], since - counts as a word boundary. Secondly, if you have long lines, consider using the in keyword first:
if word in line and pattern.search(line):
that way python can do a fast match for the letters of the word first before deploying the regular expression engine. Should speed things up for large files where most lines do not match at all.
Thirdly, fix your code sample - printing line will print the line contents, whereas printing n (or better yet `n` to convert to a string).
Fourth, consider using grep instead:
grep -nwf needles_on_separate_lines haystack.txt
Which will do all you want, and far faster than Python.
Your problem is that \b matches at word boundaries. These are defined as "a position between an alphanumeric character and a non-alphanumeric character".
So \bdirty\b will match dirty in the string This is dirty! but not in dirtying your clothes. So far so good, but since - is also a non-alphanumeric character, \b will also trigger in dirty-minded as you observed.
What you therefore need to do is to think about what characters you do not want to allow as word-separators. If it's only the dash, you could add another pair of assertions to exclude those:
r"(?<!-)\b%s\b(?!-)" % word
If you want to add more characters to exclude as valid word boundaries, for example the apostrophe, use a character class:
r"(?<!['-])\b%s\b(?!['-])" % word