I would like to include 5 characters before and after a specific word is matched in my regex query. Those words are in a list and I iterate over it.
See example below, this is what I tried:
import re
text = "This is an example of quality and this is true."
words = ['example', 'quality']
words_around = []
for word in words:
neighbors = re.findall(fr'(.{0,5}{word}.{0,5})', str(text))
words_around.append(neighbors)
print(words_around)
The output is empty. I would expect an array containing ['s an exmaple of q', 'e of quality and ']
You can use PyPi regex here that allows an infinite length lookbehind patterns:
import regex
import pandas as pd
words = ['example', 'quality']
df = pd.DataFrame({'col':[
"This is an example of quality and this is true.",
"No matches."
]})
rx = regex.compile(fr'(?<=(.{{0,5}}))({"|".join(words)})(?=(.{{0,5}}))')
def extract_regex(s):
return ["".join(x) for x in rx.findall(s)]
df['col2'] = df['col'].apply(extract_regex)
Output:
>>> df
col col2
0 This is an example of quality and this is true. [s an example of q, e of quality and ]
1 No matches. []
Both the pattern and how it is used are of importance.
The fr'(?<=(.{{0,5}}))({"|".join(words)})(?=(.{{0,5}}))' part defines the regex pattern. This is a "raw" f-string literal, f makes it possible to use variables inside the string literal, but it also requires to double all literal braces inside it. The pattern - given the current words list - looks like (?<=(.{0,5}))(example|quality)(?=(.{0,5})), see its demo online. It captures 0-5 chars before the words inside a positive lookbehind, then captures the words, and then captures the next 0-5 chars in a positive lookahead (lookarounds are used to make sure any overlapping matches are found).
The ["".join(x) for x in rx.findall(s)] part joins the groups of each match into a single string, and returns a list of matches as a result.
Related
I am trying to match words separated with the + character as input from a user in python and check if each of the words is in a predetermined list. I am having trouble creating a regular expression to match these words (words are comprised of more than one A-z characters). For example, an input string foo should match as well as foo+bar and foo+bar+baz with each of the words (not +'s) being captured.
So far, I have tried a few regular expressions but the closest I have got is this:
/^([A-z+]+)\+([A-z+]+)$/
However, this only matches the case in which there are two words separated with a +, I need there to be one or more words. My method above would have worked if I could somehow repeat the second group (\+([A-z+]+)) zero or more times. So hence my question is: How can I repeat a capturing group zero or more times?
If there is a better way to do what I am doing, please let me know.
You could write the pattern as:
(?i)[A-Z]+(?:\+[A-Z]+)*$
Explanation
(?i) Inline modifier for case insensitive
[A-Z]+ Match 1+ chars A-Z
(?:\+[A-Z]+)* Optionally repeat matching + and again 1+ chars A-Z
$ End of string
See a regex101 demo for the matches:
For example
import re
predeterminedList = ["foo", "bar"]
strings = ["foo", "foo+bar", "foo+bar+baz", "test+abc"]
pattern = r"(?i)[A-Z]+(?:\+[A-Z]+)*$"
for s in strings:
m = re.match(pattern, s)
if m:
words = m.group().split("+")
intersect = bool(set(words) & set(predeterminedList))
fmt = ','.join(predeterminedList)
if intersect:
print(f"'{s}' contains at least one of '{fmt}'")
else:
print(f"'{s}' contains none of '{fmt}'")
Another option could be created a dynamic pattern listing the alternatives:
(?i)^(?:[A-Z]+\+)*(?:foo|bar)(?:\+[A-Z]+)*$
Example
import re
predeterminedList = ["foo", "bar"]
strings = ["foo", "foo+bar", "foo+bar+baz", "test+abc"]
pattern = rf"(?i)^(?:[A-Z]+\+)*(?:{'|'.join(predeterminedList)})(?:\+[A-Z]+)*$"
for s in strings:
m = re.match(pattern, s)
fmt = ','.join(predeterminedList)
if m:
print(f"'{s}' contains at least one of '{fmt}'")
else:
print(f"'{s}' contains none of '{fmt}'")
Both will output:
'foo' contains at least one of 'foo,bar'
'foo+bar' contains at least one of 'foo,bar'
'foo+bar+baz' contains at least one of 'foo,bar'
'test+abc' contains none of 'foo,bar'
I would recommend slightly different approach using lookarounds:
Pattern: (?<=^|\+)(?=foo|baz)[^+]+
Pattern explanation:
(?<=^|\+) - positive lookbehind - assert that preceeding text is neither ^ (beginning of string) or + (our 'word delimiter').
(?=foo|baz) - positive lookahead - assert that following text match one of words (from predefined list)
[^+]+ - match one or more characters other from +
Regex demo
I have a long string of the following form:
joined_string = "ASOGHFFFFFFFFFFFFFFFFFFFGFIOSGFFFFFFFFURHDHREEKFFFFFFIIIEI..."
it is a concatenation of random strings interspersed by strings of consecutive F letters:
ASOGH
FFFFFFFFFFFFFFFFFFF
GFIOSG
FFFFFFFF
URHDHREEK
FFFFFF
IIIEI
The number of consecutive F letters is not fixed, but there will be more than 5,
and lets assume five F letters will not appear in random strings consecutively.
I want to extract only random strings to get the following list:
random_strings = ['ASOGH', 'GFIOSG', 'URHDHREEK', 'IIIEI']
I imagine there is a simple regex expression that would solve this task:
random_strings = joined_string.split('WHAT_TO_TYPE_HERE?')
Question: how to code a regex pattern for multiple identical characters?
I would use re.split for this task following way
import re
joined_string = "ASOGHFFFFFFFFFFFFFFFFFFFGFIOSGFFFFFFFFURHDHREEKFFFFFFIIIEI"
parts = re.split('F{5,}',joined_string)
print(parts)
output
['ASOGH', 'GFIOSG', 'URHDHREEK', 'IIIEI']
F{5,} denotes 5 or more F
You can use split using F{5,} and keep this in capture group so that split text is also part of result:
import re
s = "ASOGHFFFFFFFFFFFFFFFFFFFGFIOSGFFFFFFFFURHDHREEKFFFFFFIIIEI"
print( re.split(r'(F{5,})', s) )
Output:
['ASOGH', 'FFFFFFFFFFFFFFFFFFF', 'GFIOSG', 'FFFFFFFF', 'URHDHREEK', 'FFFFFF', 'IIIEI']
I would use a regex find all approach here:
joined_string = "ASOGHFFFFFFFFFFFFFFFFFFFGFIOSGFFFFFFFFURHDHREEKFFFFFFIIIEI"
parts = re.findall(r'F{2,}|(?:[A-EG-Z]|F(?!F))+', joined_string)
print(parts)
This prints:
['ASOGH', 'FFFFFFFFFFFFFFFFFFF', 'GFIOSG', 'FFFFFFFF', 'URHDHREEK', 'FFFFFF', 'IIIEI']
The regex pattern here can be explained as:
F{2,} match any group of 2 or more consecutive F's (first)
| OR, that failing
(?:
[A-EG-Z] match any non F character
| OR
F(?!F) match a single F (not followed by an F)
)+ all of these, one or more times
I have a string if the alphabetical part of a word is more than 3 letters, I want to store that in a list. I need to store "hour" and "lalal" into a list.
I wrote a regex pattern for alpha-digit and digit alpha sequences like below.
regex = ["([a-zA-Z])-([0-9])*","([0-9])*-([a-zA-Z])"]
tring = 'f-16 is 1-hour, lalal-54'
for r in regex:
m = re.search(r,tring)
d.append((m.group(0))
print(d)
But this obviously gives me all the alphanumeric patterns which are being stored too. So, I thought I could extend this to count the letters in each pattern and store it differently too. Is that possible?
Edit: Another example would
tring = I will be there in 1-hour
and the output for this should be ['hour']
So you want to only capture alphanumeric text if either it is preceded or followed by a number and a hyphen. You can use this regex which uses alternation for capturing both the cases,
([a-zA-Z]{4,})-\d+|\d+-([a-zA-Z]{4,})
Explanation:
([a-zA-Z]{4,}) - Captures the alphanumeric text of length four or more and stores in group1
-\d+ - Ensures it is followed by hyphen and one or more digit
| - Alternation as there are two cases
\d+- - Matches one or more digits and a hyphen
([a-zA-Z]{4,}) - Captures the alphanumeric text of length four or more and stores in group2
Demo
Check this python code,
import re
s = 'f-16 is 1-hour, lalal-54 I will be there in 1-hours'
d = []
for m in re.finditer(r'([a-zA-Z]{4,})-\d+|\d+-([a-zA-Z]{4,})',s):
if (m.group(1)):
d.append(m.group(1))
elif (m.group(2)):
d.append(m.group(2))
print(d)
s = 'f-16 is 1-hour, lalal-54'
arr = re.findall(r'[a-zA-Z]{4,}', s)
print(arr)
Prints,
['hour', 'lalal', 'hours']
I am trying to catch a repeated pattern in my string. The subpattern starts with the beginning of word or ":" and ends with ":" or end of word. I tried findall and search in combination of multiple matching ((subpattern)__(subpattern))+ but was not able what is wrong:
cc = "GT__abc23_1231:TF__XYZ451"
import regex
ma = regex.match("(\b|\:)([a-zA-Z]*)__(.*)(:|\b)", cc)
Expected output:
GT, abc23_1231, TF, XYZ451
I saw a bunch of questions like this, but it did not help.
It seems you can use
(?:[^_:]|(?<!_)_(?!_))+
See the regex demo
Pattern details:
(?:[^_:]|(?<!_)_(?!_))+ - 1 or more sequences of:
[^_:] - any character but _ and :
(?<!_)_(?!_) - a single _ not enclosed with other _s
Python demo with re based solution:
import re
p = re.compile(r'(?:[^_:]|(?<!_)_(?!_))+')
s = "GT__abc23_1231:TF__XYZ451"
print(p.findall(s))
# => ['GT', 'abc23_1231', 'TF', 'XYZ451']
If the first character is always not a : and _, you may use an unrolled regex like:
r'[^_:]+(?:_(?!_)[^_:]*)*'
It won't match the values that start with single _ though (so, an unrolled regex is safer).
Use the smallest common denominator in "starts and ends with a : or a word-boundary", that is the word-boundary (your substrings are composed with word characters):
>>> import re
>>> cc = "GT__abc23_1231:TF__XYZ451"
>>> re.findall(r'\b([A-Za-z]+)__(\w+)', cc)
[['GT', 'abc23_1231'], ['TF', 'XYZ451']]
Testing if there are : around is useless.
(Note: no need to add a \b after \w+, since the quantifier is greedy, the word-boundary becomes implicit.)
[EDIT]
According to your comment: "I want to first split on ":", then split on double underscore.", perhaps you dont need regex at all:
>>> [x.split('__') for x in cc.split(':')]
[['GT', 'abc23_1231'], ['TF', 'XYZ451']]
I have a large list of chemical data, that contains entries like the following:
1. 2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP
2. Lead,Paints/Pigments,Zinc
I have a function that is correctly splitting the 1st entry into:
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
based on ', ' as a separator. For the second entry, ', ' won't work. But, if i could easily split any string that contains ',' with only two non-numeric characters on either side, I would be able to parse all entries like the second one, without splitting up the chemicals in entries like the first, that have numbers in their name separated by commas (i.e. 2,4,5-TP).
Is there an easy pythonic way to do this?
I explain a little bit based on #eph's answer:
import re
data_list = ['2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP', 'Lead,Paints/Pigments,Zinc']
for d in data_list:
print re.split(r'(?<=\D),\s*|\s*,(?=\D)',d)
re.split(pattern, string) will split string by the occurrences of regex pattern.
(plz read Regex Quick Start if you are not familiar with regex.)
The (?<=\D),\s*|\s*,(?=\D) consists of two part: (?<=\D),\s* and \s*,(?=\D). The meaning of each unit:
The middle | is the OR operator.
\D matches a single character that is not a digit.
\s matches a whitespace character (includes tabs and line breaks).
, matches character ",".
* attempts to match the preceding token zero or more times. Therefore, \s* means the whitespace can be appear zero or more times. (see Repetition with Star and Plus)
(?<= ... ) and (?= ...) are the lookbebind and lookahead assertions.
For example, q(?=u) matches a q that is followed by a u, without making the u part of the match.
Therefore, \s*,(?=\D) matches a , that is preceded by zero or more whitespace and followed by non-digit characters. Similarly, (?<=\D),\s* matches a , that is preceded by non-digit characters and followed by zero or more whitespace. The whole regex will find , that satisfy either case, which is equivalent to your requirement: ',' with only two non-numeric characters on either side.
Some useful tools for regex:
Regex Cheat Sheet
Online regex tester: regex101 (with a tree structure explanation to your regex)
Use regex and lookbehind/lookahead assertion
>>> re.split(r'(?<=\D\D),\s*|,\s*(?=\D\D)', s)
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> s1 = "2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP"
>>> s2 = "Lead,Paints/Pigments,Zinc"
>>> import re
>>> res1 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s1)
>>> res1
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> res2 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s2)
>>> res2
['Lead', 'Paints/Pigments', 'Zinc']