Find a match word and printing letters by their sides - python

I am trying to find a word on a string, match it with a query word, and then print them with some of their neighboring letters, like this:
input = aaxxYYxxaa
match = YY
requested_output = xxYYxx
So far I have tried with the Regex module, but I cannot go beyond the ‘match’ part:
import re
teststring = "aaxxYYxxaa"
word = re.findall (r"YY", teststring)
print(word)
output = YY
What could I do here to print the letters on each end of the ‘YY’ word?.
Thank you.

It looks as if you want to match any 0 to 2 chars before and after the YY value. Add .{0,2} on both sides of the pattern:
re.findall(r".{0,2}YY.{0,2}", teststring)
See the regex demo and a Python demo:
import re
teststring = "aaxxYYxxaa"
word = re.findall (r".{0,2}YY.{0,2}", teststring)
print(word) # => ['xxYYxx']

You would write your regex in a way, that it matches arbitrary characters before and after you known search term.
. matches any character
{m,n} repeats at least m times and at most n times
so to match xxYYxx you would say .{2,2}YY.{2,2}

Related

Python regex take characters around word

I would like to include 5 characters before and after a specific word is matched in my regex query. Those words are in a list and I iterate over it.
See example below, this is what I tried:
import re
text = "This is an example of quality and this is true."
words = ['example', 'quality']
words_around = []
for word in words:
neighbors = re.findall(fr'(.{0,5}{word}.{0,5})', str(text))
words_around.append(neighbors)
print(words_around)
The output is empty. I would expect an array containing ['s an exmaple of q', 'e of quality and ']
You can use PyPi regex here that allows an infinite length lookbehind patterns:
import regex
import pandas as pd
words = ['example', 'quality']
df = pd.DataFrame({'col':[
"This is an example of quality and this is true.",
"No matches."
]})
rx = regex.compile(fr'(?<=(.{{0,5}}))({"|".join(words)})(?=(.{{0,5}}))')
def extract_regex(s):
return ["".join(x) for x in rx.findall(s)]
df['col2'] = df['col'].apply(extract_regex)
Output:
>>> df
col col2
0 This is an example of quality and this is true. [s an example of q, e of quality and ]
1 No matches. []
Both the pattern and how it is used are of importance.
The fr'(?<=(.{{0,5}}))({"|".join(words)})(?=(.{{0,5}}))' part defines the regex pattern. This is a "raw" f-string literal, f makes it possible to use variables inside the string literal, but it also requires to double all literal braces inside it. The pattern - given the current words list - looks like (?<=(.{0,5}))(example|quality)(?=(.{0,5})), see its demo online. It captures 0-5 chars before the words inside a positive lookbehind, then captures the words, and then captures the next 0-5 chars in a positive lookahead (lookarounds are used to make sure any overlapping matches are found).
The ["".join(x) for x in rx.findall(s)] part joins the groups of each match into a single string, and returns a list of matches as a result.

Python Regex - get words around match

I want to get the words before and after my match. I could use string.split(' ') - but as I already use regex, isn't there a much better way using only regex?
Using a match object, I can get the exact location. However, this location is character indexed.
import re
myString = "this. is 12my90\nExample string"
pattern = re.compile(r"(\b12(\w+)90\b)",re.IGNORECASE | re.UNICODE)
m = pattern.search(myString)
print("Hit: "+m.group())
print("Indix range: "+str(m.span()))
print("Words around match: "+myString[m.start()-1:m.end()+1]) # should be +/-1 in _words_, not characters
Output:
Hit: 12my90 Indix
range: (9, 15)
Words around match: 12my90
For getting the matching word and the word before, I tried:
pattern = re.compile(r"(\b(w+)\b)\s(\b12(\w+)90\b)",re.IGNORECASE |
re.UNICODE)
Which yields no matches.
In the second pattern you have to escape the w+ like \w+.
Apart from that, there is a newline in your example which you can match using another following \s
Your pattern with 3 capturing groups might look like
(\b\w+\b)\s(\b12\w+90\b)\s(\b\w+\b)
Regex demo
You could use the capturing groups to get the values
print("Words around match: " + m.group(1) + " " + m.group(3))
new line character is missing
regx = r"(\w+)\s12(\w+)90\n(\w+)"

repeated pattern in regex

I am trying to catch a repeated pattern in my string. The subpattern starts with the beginning of word or ":" and ends with ":" or end of word. I tried findall and search in combination of multiple matching ((subpattern)__(subpattern))+ but was not able what is wrong:
cc = "GT__abc23_1231:TF__XYZ451"
import regex
ma = regex.match("(\b|\:)([a-zA-Z]*)__(.*)(:|\b)", cc)
Expected output:
GT, abc23_1231, TF, XYZ451
I saw a bunch of questions like this, but it did not help.
It seems you can use
(?:[^_:]|(?<!_)_(?!_))+
See the regex demo
Pattern details:
(?:[^_:]|(?<!_)_(?!_))+ - 1 or more sequences of:
[^_:] - any character but _ and :
(?<!_)_(?!_) - a single _ not enclosed with other _s
Python demo with re based solution:
import re
p = re.compile(r'(?:[^_:]|(?<!_)_(?!_))+')
s = "GT__abc23_1231:TF__XYZ451"
print(p.findall(s))
# => ['GT', 'abc23_1231', 'TF', 'XYZ451']
If the first character is always not a : and _, you may use an unrolled regex like:
r'[^_:]+(?:_(?!_)[^_:]*)*'
It won't match the values that start with single _ though (so, an unrolled regex is safer).
Use the smallest common denominator in "starts and ends with a : or a word-boundary", that is the word-boundary (your substrings are composed with word characters):
>>> import re
>>> cc = "GT__abc23_1231:TF__XYZ451"
>>> re.findall(r'\b([A-Za-z]+)__(\w+)', cc)
[['GT', 'abc23_1231'], ['TF', 'XYZ451']]
Testing if there are : around is useless.
(Note: no need to add a \b after \w+, since the quantifier is greedy, the word-boundary becomes implicit.)
[EDIT]
According to your comment: "I want to first split on ":", then split on double underscore.", perhaps you dont need regex at all:
>>> [x.split('__') for x in cc.split(':')]
[['GT', 'abc23_1231'], ['TF', 'XYZ451']]

Regex pattern to extract substring

mystring = "q1)whatq2)whenq3)where"
want something like ["q1)what", "q2)when", "q3)where"]
My approach is to find the q\d+\) pattern then move till I find this pattern again and stop. But I'm not able to stop.
I did req_list = re.compile("q\d+\)[*]\q\d+\)").split(mystring)
But this gives the whole string.
How can I do it?
You could try the below code which uses re.findall function,
>>> import re
>>> s = "q1)whatq2)whenq3)where"
>>> m = re.findall(r'q\d+\)(?:(?!q\d+).)*', s)
>>> m
['q1)what', 'q2)when', 'q3)where']
Explanation:
q\d+\) Matches the string in the format q followed by one or more digits and again followed by ) symbol.
(?:(?!q\d+).)* Negative look-ahead which matches any char not of q\d+ zero or more times.

Regex divide with upper-case

I would like to replace strings like 'HDMWhoSomeThing' to 'HDM Who Some Thing' with regex.
So I would like to extract words which starts with an upper-case letter or consist of upper-case letters only. Notice that in the string 'HDMWho' the last upper-case letter is in the fact the first letter of the word Who - and should not be included in the word HDM.
What is the correct regex to achieve this goal? I have tried many regex' similar to [A-Z][a-z]+ but without success. The [A-Z][a-z]+ gives me 'Who Some Thing' - without 'HDM' of course.
Any ideas?
Thanks,
Rukki
#! /usr/bin/env python
import re
from collections import deque
pattern = r'([A-Z]{2,}(?=[A-Z]|$)|[A-Z](?=[a-z]|$))'
chunks = deque(re.split(pattern, 'HDMWhoSomeMONKEYThingXYZ'))
result = []
while len(chunks):
buf = chunks.popleft()
if len(buf) == 0:
continue
if re.match(r'^[A-Z]$', buf) and len(chunks):
buf += chunks.popleft()
result.append(buf)
print ' '.join(result)
Output:
HDM Who Some MONKEY Thing XYZ
Judging by lines of code, this task is a much more natural fit with re.findall:
pattern = r'([A-Z]{2,}(?=[A-Z]|$)|[A-Z][a-z]*)'
print ' '.join(re.findall(pattern, 'HDMWhoSomeMONKEYThingX'))
Output:
HDM Who Some MONKEY Thing X
Try to split with this regular expression:
/(?=[A-Z][a-z])/
And if your regular expression engine does not support splitting empty matches, try this regular expression to put spaces between the words:
/([A-Z])(?![A-Z])/
Replace it with " $1" (space plus match of the first group). Then you can split at the space.
one liner :
' '.join(a or b for a,b in re.findall('([A-Z][a-z]+)|(?:([A-Z]*)(?=[A-Z]))',s))
using regexp
([A-Z][a-z]+)|(?:([A-Z]*)(?=[A-Z]))
So 'words' in this case are:
Any number of uppercase letters - unless the last uppercase letter is followed by a lowercase letter.
One uppercase letter followed by any number of lowercase letters.
so try:
([A-Z]+(?![a-z])|[A-Z][a-z]*)
The first alternation includes a negative lookahead (?![a-z]), which handles the boundary between an all-caps word and an initial caps word.
May be '[A-Z]*?[A-Z][a-z]+'?
Edit: This seems to work: [A-Z]{2,}(?![a-z])|[A-Z][a-z]+
import re
def find_stuff(str):
p = re.compile(r'[A-Z]{2,}(?![a-z])|[A-Z][a-z]+')
m = p.findall(str)
result = ''
for x in m:
result += x + ' '
print result
find_stuff('HDMWhoSomeThing')
find_stuff('SomeHDMWhoThing')
Prints out:
HDM Who Some Thing
Some HDM Who Thing

Categories