Extract Sub string from String using Regex - python

i have a requirement, i need to extract substring from String using regex.
for example, here is my sample data:
Hello, "How" are "you" What "are" you "doing?"
from this example data, i need to extract only second and fourth occurrence of double quoted data.
my requirement is : you doing?
i tried with below regex but i am unable to extract as per my requirement.
"(.*?)"

We can use re.findall and then slice the result to get the first and third matches:
import re
string = 'Hello, "How" are "you" What "are" you "doing?"'
result = re.findall('".+?"', string)[1::2]
print(result)
Here, the regex matches any number of characters contained within double quote marks, but tries to match as few as possible (a non-greedy match), otherwise we would end up with one single match, "How" are "you" What "are" you "doing?".
Output:
['"you"', '"doing?"']
If you want to combine them without the quote marks, you can use str.strip along with str.join:
print(' '.join(string.strip('"') for string in result))
Output:
you doing?
An alternative method would be to just split on ":
result = string.split('"')[1::2][1::2]
print(result)
Output:
['you', 'doing?']
This works because, if you separate the string by double quote marks, then the output will be as follows:
Everything before the first double quote
Everything after the first double quote and before the second
Everything after the second double quote and before the third
...
This means that we can take every even element to get the ones that are in quotes. We can then just slice the result again to get the 2nd and 4th results.

Regex only solution. May not be 100% accurate since it matches every second occurrence rather than just the 2nd and 4th, but it works for the example.
"[^"]+"[^"]+("[^"]+")
Demonstration in JS:
var str = 'Hello, "How" are "you" What "are" you "doing?"';
var regex = /"[^"]+"[^"]+("[^"]+")/g
match = regex.exec(str);
while (match != null) {
// matched text: match[0]
// match start: match.index
// capturing group n: match[n]
console.log(match[1])
match = regex.exec(str);
}

We can try using re.findall to extract all quoted terms. Then, build a string using only even entries in the resulting list:
input = "Hello, \"How\" are \"you\" What \"are\" you \"doing?\""
matches = re.findall(r'\"([^"]+)\"', input)
matches = matches[1::2]
output = " ".join(matches)
print(output)
you doing?

Related

RegEx: 2 double quote enclose string search in python

I try for the following string:
text = '"Some Text","Some Text","18.3",""I Love You, Dad"","","","Some Text"'
result = re.findall(r'""[^"]+""', text)
this result returns the following list
['""I Love You, Dad""', '"",""']
but i only want the 1st item of the list how can i remove the 2nd item from the regex. Here the "I Love you, Dad" is variable any string can be enclosed in 2 double quote.
the condition here is: String enclose with 2 double quote.
You can use
re.findall(r'(?<![^,])""([A-Za-z].*?)""(?![^,])', text)
See the regex demo. Details:
(?<![^,]) - a left comma boundary (start of string or a char other than a comma required immediately to the left of the current location)
"" - two double quotes
([A-Za-z].*?) - Group 1: an ASCII letter (use [^\W\d_] to match any Unicode letter) and then any zero or more chars other than line break chars as few as possible
"" - two double quotes
(?![^,]) - a right comma boundary (end of string or a char other than a comma required immediately to the right of the current location)
re.findall() method finds all instances of a text.
re.search() method either returns None (if the pattern doesn’t match), or a re.MatchObject that contains information about the matching part of the string. This method stops after the first match
import re;
text = '"Some Text","Some Text","18.3",""I Love You, Dad"","","","Some Text"'
result = re.search(r'""[^"]+""', text)
if result != None:
print("% s" % (result.group(0)))

Ignore an optional word if present in a string - regular expression in python

I'm trying to match a string with regular expression using Python, but ignore an optional word if it's present.
For example, I have the following lines:
First string
Second string [Ignore This Part]
Third string (1) [Ignore This Part]
I'm looking to capture everything before [Ignore This Part]. Notice I also want to exclude the whitespace before [Ignore This Part]. Therefore my results should look like this:
First string
Second string
Third string (1)
I have tried the following regular expression with no luck, because it still captures [Ignore This Part]:
.+(?:\s\[.+\])?
Any assistance would be appreciated.
I'm using python 3.8 on Window 10.
Edit: The examples are meant to be processed one line at a time.
Use [^[] instead of . so it doesn't match anything with square brackets and doesn't match across newlines.
^[^[\n]+(?\s\[.+\])?
DEMO
Perhaps you can remove the part that you don't want to match:
[^\S\n]*\[[^][\n]*]$
Explanation
[^\S\n]* Match optional spaces
\[[^][\n]*] Match from [....]
$ End of string
Regex demo
Example
import re
pattern = r"[^\S\n]*\[[^][\n]*]$"
s = ("First string\n"
"Second string [Ignore This Part]\n"
"Third string (1) [Ignore This Part]")
result = re.sub(pattern, "", s, 0, re.M)
if result:
print(result)
Output
First string
Second string
Third string (1)
If you don't want to be left with an empty string, you can assert a non whitespace char to the left:
(?<=\S)[^\S\n]*\[[^][\n]*]$
Regex demo
With your shown samples, please try following code, written and tested in Python3.
import re
var="""First string
Second string [Ignore This Part]
Third string (1) [Ignore This Part]"""
[x for x in list(map(lambda x:x.strip(),re.split(r'(?m)(.*?)(?:$|\s\[[^]]*\])',var))) if x]
Output will be as follows, in form of list which could be accessed as per requirement.
['First string', 'Second string', 'Third string (1)']
Here is the complete detailed explanation for above Python3 code:
Firstly using re module's split function where passing regex (.*?)(?:$|\s\[[^]]*\]) with multiline reading flag enabled. This is complete function of split: re.split(r'(?m)(.*?)(?:$|\s\[[^]]*\])',var)
Then passing its output to a lambda function to use strip function to remove elements which are having new lines in it.
Applying map to it and creating list from it.
Then simply removing NULL items from list to get only required part as per OP.
You may use this regex:
^.+?(?=$|\s*\[[^]]*]$)
RegEx Demo
If you want better performing regex then I suggest:
^\S+(?:\s+\S+)*?(?=$|\s*\[[^]]*]$)
RegEx Demo 2
RegEx Details:
^: Start
.+?: Match 1+ of any characters (lazy match)
(?=: Start lookahead
$: End
|: OR
\s*: Match 0 or more whitespaces
\[[^]]*]: Match [...] text
$: End
): Close lookahead

Match a piece of text from the beginning up to the first occurrence of multicharacter substring

I want a regex search to end when it reaches ". ", but not when it reaches "."; I'm aware of using [^...] to exclude single characters, and have been using this to stop my search when it reaches a certain character. This does not work with strings though, as [^. ] stops when it reaches either character. Say I've got the code
import re
def main():
my_string = "The value of the float is 2.5. The int's value is 2.\n"
re.search("[^.]*", my_string)
main()
Which gives a match object with the string
"The value of the float is 2"
How can I change this so that it only stops after the string ". "?
Bonus question, is there any way to tell regex to stop whenever it reaches one of multiple strings? Using the above code as an example, if I wanted the search to end when it found the string ". " or the string ".\n", how would I go about it? Thanks!
To match from the start of a string till the . followed with whitespace, use
^(.*?)\.\s
If you want to only require a space or newline after a dot, use either of (the second is best if you have single chars only, use alternation if there are multicharacter alternatives)
^(.*?)\.(?: |\n)
^(.*?)\.[ \n]
See the regex demo.
Details
^ - start of a string
(.*?) - Capturing group 1: any 0+ chars other than linebreak chars, as few as possible
\. - a literal . char
\s - a whitespace char
(?: |\n) / [ \n] - a non-capturing group matching either a space or (|) a newline.
Python demo:
import re
my_string = "The value of the float is 2.5. The int's value is 2.\n"
m = re.search("^(.*?)\.\s", my_string) # Try to find a match
if m: # If there is a match
print(m.group(1)) # Show Group 1 value
NOTE If there can be line breaks in the input, pass re.S or re.DOTALL flag:
m = re.search("^(.*?)\.\s", my_string, re.DOTALL)
Besides classic approach explained by Wiktor, also splitting may be interesting solution in this case.
>>> my_string
"The value of the float is 2.5. The int's value is 2.\n"
>>> re.split('\. |\.\n', my_string)
['The value of the float is 2.5', "The int's value is 2", '']
If you want to include periods at the end of the sentence, you can do something like this:
['{}.'.format(sentence) for sentence in re.split('\. |\.\n', my_string) if sentence]
To handle multiple empty spaces between the sentences:
>>> str2 = "The value of the float is 2.5. The int's value is 2.\n\n "
>>> ['{}.'.format(sentence)
for sentence in re.split('\. \s*|\.\n\s*', str2)
if sentence
]
['The value of the float is 2.5.', "The int's value is 2."]

Python regular expression to replace everything but specific words

I am trying to do the following with a regular expression:
import re
x = re.compile('[^(going)|^(you)]') # words to replace
s = 'I am going home now, thank you.' # string to modify
print re.sub(x, '_', s)
The result I get is:
'_____going__o___no______n__you_'
The result I want is:
'_____going_________________you_'
Since the ^ can only be used inside brackets [], this result makes sense, but I'm not sure how else to go about it.
I even tried '([^g][^o][^i][^n][^g])|([^y][^o][^u])' but it yields '_g_h___y_'.
Not quite as easy as it first appears, since there is no "not" in REs except ^ inside [ ] which only matches one character (as you found). Here is my solution:
import re
def subit(m):
stuff, word = m.groups()
return ("_" * len(stuff)) + word
s = 'I am going home now, thank you.' # string to modify
print re.sub(r'(.+?)(going|you|$)', subit, s)
Gives:
_____going_________________you_
To explain. The RE itself (I always use raw strings) matches one or more of any character (.+) but is non-greedy (?). This is captured in the first parentheses group (the brackets). That is followed by either "going" or "you" or the end-of-line ($).
subit is a function (you can call it anything within reason) which is called for each substitution. A match object is passed, from which we can retrieve the captured groups. The first group we just need the length of, since we are replacing each character with an underscore. The returned string is substituted for that matching the pattern.
Here is a one regex approach:
>>> re.sub(r'(?!going|you)\b([\S\s]+?)(\b|$)', lambda x: (x.end() - x.start())*'_', s)
'_____going_________________you_'
The idea is that when you are dealing with words and you want to exclude them or etc. you need to remember that most of the regex engines (most of them use traditional NFA) analyze the strings by characters. And here since you want to exclude two word and want to use a negative lookahead you need to define the allowed strings as words (using word boundary) and since in sub it replaces the matched patterns with it's replace string you can't just pass the _ because in that case it will replace a part like I am with 3 underscore (I, ' ', 'am' ). So you can use a function to pass as the second argument of sub and multiply the _ with length of matched string to be replace.

python regular expression match comma

In the following string,how to match the words including the commas
--
process_str = "Marry,had ,a,alittle,lamb"
import re
re.findall(r".*",process_str)
['Marry,had ,a,alittle,lamb', '']
--
process_str="192.168.1.43,Marry,had ,a,alittle,lamb11"
import re
ip_addr = re.findall(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}",l)
re.findall(ip_addr,process_str1)
How to find the words after the ip address excluding the first comma only
i.e, the outout again is expected to be Marry,had ,a,alittle,lamb11
In the second example above how to find if the string is ending with a digit.
In the second example, you just need to capture (using ()) everything that follows the ip:
import re
s = "192.168.1.43,Marry,had ,a,alittle,lamb11"
text = re.findall(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3},(.*)", s)[0]
// text now holds the string Marry,had ,a,alittle,lamb11
To find out if the string ends with a digit, you can use the following:
re.match(".*\d$", process_str)
That is, you match the entire string (.*), and then backtrack to test if the last character (using $, which matches the end of the string) is a digit.
Find the words including the commas, that's how I understand this sentence:
>>> re.findall("\w+,*", process_str)
['Marry,', 'had', 'a,', 'alittle,', 'lamb']
ending with a didgit:
"[0-9]+$"
Hmm. The examples are not quite clear, but it seems in example #2, you want to only match text , commas, space-chars, and ignore digits? How about this:
re.findall('(?i)([a-z, ]+), process_str)
I didn't quite understand the "if the string is ending with a digit". Does that mean you ONLY want to match 'Mary...' IF it ends with a digit? Then that would look like this:
re.findall('(?i)([a-z, ]+)\d+, process_str)

Categories