I am trying to match a word in a string only if a certain word occurs within the first 10 characters of the string.
This is my approach:
import re
string_array = ["foo bar\nbaz qux", "baz qux foo bar", "baz qux"]
for string in string_array:
a = re.search("(?s)^(?!.{0,10}bar).*(qux).*$", string)
print(a)
I tried this in regex101 but this would still match the entire string even when it contains bar. I also tried with a negative lookbehind but then the first part would be required to be a fixed length which in my case cannot. What am I doing wrong?
I'd expect to get a match in all except for the first string
The (?!.{0,10}bar) is preventing the regex engine from matching any string that has bar within the first 10 characters. So, the regex engine will only start matching at the 11th character.
Try using this code below:
import re
string_array = ["foo bar\nbaz qux", "baz qux foo bar", "baz qux"]
for string in string_array:
a = re.search("^(?=.{0,10}bar).*?(qux)", string, re.DOTALL)
print(a)
Instead of a complex regex only solution, you can just look for every bar occurrence and check it's position using match.start() and match.end()
import re
arrayIndex = 0
count = 0
string_array = ["foo bar\nbaz qux", "baz qux foo bar", "baz qux"]
for string in string_array:
for match in re.finditer(r'bar', string):
count += 1
print("array index", arrayIndex, "match", count, match.group(), "start index", match.start(), "End index", match.end())
arrayIndex += 1
Check the solution's fiddle here
Related
I have a data frame that consists of multiple rows that contain different variations of a string that is separated by commas. Rather than constantly writing variations of this code such as df.replace('Word,', ''), I am looking for a simpler way to replace variations in strings for python. I have heard about regex yet am having a difficult time understanding it.
One such example that I am looking into is df.column.str.replace('Word,?', '') which would replace all variations of "Word" regardless of comma position. However, I am unsure as to how this works. Any help in understanding replacing using regex would be greatly appreciated. Thank you in advance.
Example:
'Word, foo, bar'
'Word'
'foo, bar, Word'
'foo, Word, bar'
Desired Output:
'foo, bar'
''
'foo, bar'
'foo, bar'
df.replace(to_replace='Word,|(, )?Word',value='',regex=True)
This way .replace() method will do the required work.
to_replace is our regular expression criteria and it should be in string.
'Word,' will match all strings except at the end in form of ", Word".
To match those end string we provided "|"(or) so that we can add new criteria which is "(, )?Word". Here ? match 0 or 1 occurrence of ", "(comma and 1 space) so that both conditions for ending string as well as only 1 string "Word" matched
Value = '' : which show what to be replaced with
regex = True : which tells to treat "to_replace" parameter as a regex expression
You can do it as below
Input
df = pd.DataFrame([[1, 'Word, foo, bar'],
[2, 'Word'],
[3, 'foo, bar, Word'],
[4, 'foo, Word, bar']],columns=['id', 'text'])
id text
1 Word, foo, bar
2 Word
3 foo, bar, Word
4 foo, Word, bar
Code to replace text 'Word' and following comma & space if any
df['text']=df['text'].replace('Word(,\s)|(,\s)?Word','',regex=True)
What is happening in the code
Word : will search for the text 'Word'
(,\s)? : will look for comma, followed by space\s, ? will look and match if it is available, if comma & space does not follow, then just the text 'Word' is matched. So ? is pretty important here.
| : this matches one of the 2 expressions (in your case this is needed for line 3 where there is a preceding space & comma)
You can see detailed explanation here Regex Demo
Output
id text
1 foo, bar
2
3 foo, bar
4 foo, bar
In python, suppose I want to search the string
"123"
for occurrences of the pattern
"abc|1.*|def|.23" .
I would currently do this as follows:
import re
re.match ("abc|1.*|def|.23", "123") .
The above returns a match object from which I can retrieve the starting and ending indices of the match in the string, which in this case would be 0 and 3.
My question is: How can I retrieve the particular word(s) in the regular expression which matched with
"123" ?
In other words: I would like to get "1.*" and ".23". Is this possible?
Given your string always have a common separator - in our case "|"
you can try:
str = "abc|1.*|def|.23"
matches = [s for s in str.split("|") if re.match(s, "123")]
print(matches)
output:
['1.*', '.23']
Another approach would be to create one capture group for each token in the alternation:
import re
s = 'def'
rgx = r'\b(?:(abc)|(1.*)|(def)|(.23))\b'
m = re.match(rgx, s)
print(m.group(0)) #=> def
print(m.group(1)) #=> None
print(m.group(2)) #=> None
print(m.group(3)) #=> def
print(m.group(4)) #=> None
This example shows the match is 'def' and was matched by the 3rd capture group,(def).
Python code
so I have a string that has multiple patterns like
s1 = "foo, bar"
s1 = "x, y"
s2 = "hello, hi"
s3 = "bar, foo."
I'm wondering how I can get the strings that are separated by a comma (insert random text here).
So from this example, I want to get strings ["foo","bar"] and ["x","y"] when I'm looking for "s1", and "hello" & "hi" when I look for s2, etc.
Thanks!
EDIT:
Let's assume using .split(',') is impractical due to a large number of commas outside this specific pattern I listed
The question was edited, but for for the original string:
"s1: foo, bar s1:x,y s2:hello, hi s3: bar, foo."
You could use a pattern to match the specific part and then use re.split to split on a comma and optional space.
\bs1: ?(\w+(?:, ?\w+)*)
Explanation
\bs1: ? Match s1: and optional space
( Capture group 1
\w+(?:, ?\w+)* Match 1+ word chars, optionally repeat comma, optional space and 1+ word chars
) Close group 1
Regex demo | Python demo
Example code (python 3)
import re
s = "s1: foo, bar s1:x,y s2:hello, hi s3: bar, foo."
def findByPrefix(prefix, s):
pattern = rf"\b{re.escape(prefix)}: ?(\w+(?:, ?\w+)*)"
res = []
for m in re.findall(pattern, s):
res.append(re.split(", ?", m))
return res
print(findByPrefix("s1", s))
Output
[['foo', 'bar'], ['x', 'y']]
You can use:
my_string.split(',')
It should return a list of every element.
Here is how you can use the re module to split a string by a given delimiter:
import re
re.split(", ", my_string)
How can I put in brackets / parenthesis some words following another word in python?
For 2 words it looks like:
>>> p=re.compile(r"foo\s(\w+)\s(\w+)")
>>> p.sub( r"[\1] [\2]", "foo bar baz")
'[bar] [baz]'
I want for undefined number of words. I came up with this, but it doesn't seem to work.
>>> p=re.compile(r"foo(\s(\w+))*")
>>> p.sub( r"[\2] [\2] [\2]", "foo bar baz bax")
'[bax] [bax] [bax]'
The desired result in this case would be
'[bar] [baz] [bax]'
You may use a solution like
import re
p = re.compile(r"(foo\s+)([\w\s]+)")
r = re.compile(r"\w+")
s = "foo bar baz"
print( p.sub( lambda x: "{}{}".format(x.group(1), r.sub(r"[\g<0>]", x.group(2))), s) )
See the Python demo
The first (foo\s+)([\w\s]+) pattern matches and captures foo followed with 1+ whitespaces into Group 1 and then captures 1+ word and whitespace chars into Group 2.
Then, inside the re.sub, the replacement argument is a lambda expression where all 1+ word chunks are wrapped with square brackets using the second simple \w+ regex (that is done to ensure the same amount of whitespaces between the words, else, it can be done without a regex).
Note that [\g<0>] replacement pattern inserts [, the whole match value (\g<0>) and then ].
I suggest you the following simple solution:
import re
s = "foo bar baz bu bi porte"
p = re.compile(r"foo\s([\w\s]+)")
p = p.match(s)
# Here: p.group(1) is "bar baz bu bi porte"
# p.group(1).split is ['bar', 'baz' ,'bu' ,'bi', 'porte']
print(' '.join([f'[{i}]' for i in p.group(1).split()])) # for Python 3.6+ (due to f-strings)
# [bar] [baz] [bu] [bi] [porte]
print(' '.join(['[' + i + ']' for i in p.group(1).split()])) # for other Python versions
# [bar] [baz] [bu] [bi] [porte]
Suppose there is a series of strings. Important items are enclosed in quotes, but other items are enclosed in escaped quotes. How can you return only the important items?
Example where both are returned:
import re
testString = 'this, is a test "one" it should only return the first item \"two\" and not the second'
pattern = = '"([^\\\"]*)"'
print re.findall( pattern, testString)
Result prints
['one', 'two']
How can I get python's re to only print
['one']
You can use negative lookbehinds to ensure there's no backslash before the quote:
import re
testString = r'this, is a test "one" it should only return the first item \"two\" and not the second'
pattern = r'(?<!\\)"([^"]*)(?<!\\)"'
# ^^^^^^^ ^^^^^^^
print re.findall(pattern, testString)
regex101 demo
ideone demo
Here even though you are using \" to mark other items but in python it is interpreted as "two" only.You can use python raw strings where \" will be treated as \"
import re
testString = r'this, is a test "one" it should only return the first item \"two\" and not the second'
pattern = '"(\w*)"'
print re.findall( pattern, testString)