RegEx: 2 double quote enclose string search in python - python

I try for the following string:
text = '"Some Text","Some Text","18.3",""I Love You, Dad"","","","Some Text"'
result = re.findall(r'""[^"]+""', text)
this result returns the following list
['""I Love You, Dad""', '"",""']
but i only want the 1st item of the list how can i remove the 2nd item from the regex. Here the "I Love you, Dad" is variable any string can be enclosed in 2 double quote.
the condition here is: String enclose with 2 double quote.

You can use
re.findall(r'(?<![^,])""([A-Za-z].*?)""(?![^,])', text)
See the regex demo. Details:
(?<![^,]) - a left comma boundary (start of string or a char other than a comma required immediately to the left of the current location)
"" - two double quotes
([A-Za-z].*?) - Group 1: an ASCII letter (use [^\W\d_] to match any Unicode letter) and then any zero or more chars other than line break chars as few as possible
"" - two double quotes
(?![^,]) - a right comma boundary (end of string or a char other than a comma required immediately to the right of the current location)

re.findall() method finds all instances of a text.
re.search() method either returns None (if the pattern doesn’t match), or a re.MatchObject that contains information about the matching part of the string. This method stops after the first match
import re;
text = '"Some Text","Some Text","18.3",""I Love You, Dad"","","","Some Text"'
result = re.search(r'""[^"]+""', text)
if result != None:
print("% s" % (result.group(0)))

Related

Match a piece of text from the beginning up to the first occurrence of multicharacter substring

I want a regex search to end when it reaches ". ", but not when it reaches "."; I'm aware of using [^...] to exclude single characters, and have been using this to stop my search when it reaches a certain character. This does not work with strings though, as [^. ] stops when it reaches either character. Say I've got the code
import re
def main():
my_string = "The value of the float is 2.5. The int's value is 2.\n"
re.search("[^.]*", my_string)
main()
Which gives a match object with the string
"The value of the float is 2"
How can I change this so that it only stops after the string ". "?
Bonus question, is there any way to tell regex to stop whenever it reaches one of multiple strings? Using the above code as an example, if I wanted the search to end when it found the string ". " or the string ".\n", how would I go about it? Thanks!
To match from the start of a string till the . followed with whitespace, use
^(.*?)\.\s
If you want to only require a space or newline after a dot, use either of (the second is best if you have single chars only, use alternation if there are multicharacter alternatives)
^(.*?)\.(?: |\n)
^(.*?)\.[ \n]
See the regex demo.
Details
^ - start of a string
(.*?) - Capturing group 1: any 0+ chars other than linebreak chars, as few as possible
\. - a literal . char
\s - a whitespace char
(?: |\n) / [ \n] - a non-capturing group matching either a space or (|) a newline.
Python demo:
import re
my_string = "The value of the float is 2.5. The int's value is 2.\n"
m = re.search("^(.*?)\.\s", my_string) # Try to find a match
if m: # If there is a match
print(m.group(1)) # Show Group 1 value
NOTE If there can be line breaks in the input, pass re.S or re.DOTALL flag:
m = re.search("^(.*?)\.\s", my_string, re.DOTALL)
Besides classic approach explained by Wiktor, also splitting may be interesting solution in this case.
>>> my_string
"The value of the float is 2.5. The int's value is 2.\n"
>>> re.split('\. |\.\n', my_string)
['The value of the float is 2.5', "The int's value is 2", '']
If you want to include periods at the end of the sentence, you can do something like this:
['{}.'.format(sentence) for sentence in re.split('\. |\.\n', my_string) if sentence]
To handle multiple empty spaces between the sentences:
>>> str2 = "The value of the float is 2.5. The int's value is 2.\n\n "
>>> ['{}.'.format(sentence)
for sentence in re.split('\. \s*|\.\n\s*', str2)
if sentence
]
['The value of the float is 2.5.', "The int's value is 2."]

Extract Sub string from String using Regex

i have a requirement, i need to extract substring from String using regex.
for example, here is my sample data:
Hello, "How" are "you" What "are" you "doing?"
from this example data, i need to extract only second and fourth occurrence of double quoted data.
my requirement is : you doing?
i tried with below regex but i am unable to extract as per my requirement.
"(.*?)"
We can use re.findall and then slice the result to get the first and third matches:
import re
string = 'Hello, "How" are "you" What "are" you "doing?"'
result = re.findall('".+?"', string)[1::2]
print(result)
Here, the regex matches any number of characters contained within double quote marks, but tries to match as few as possible (a non-greedy match), otherwise we would end up with one single match, "How" are "you" What "are" you "doing?".
Output:
['"you"', '"doing?"']
If you want to combine them without the quote marks, you can use str.strip along with str.join:
print(' '.join(string.strip('"') for string in result))
Output:
you doing?
An alternative method would be to just split on ":
result = string.split('"')[1::2][1::2]
print(result)
Output:
['you', 'doing?']
This works because, if you separate the string by double quote marks, then the output will be as follows:
Everything before the first double quote
Everything after the first double quote and before the second
Everything after the second double quote and before the third
...
This means that we can take every even element to get the ones that are in quotes. We can then just slice the result again to get the 2nd and 4th results.
Regex only solution. May not be 100% accurate since it matches every second occurrence rather than just the 2nd and 4th, but it works for the example.
"[^"]+"[^"]+("[^"]+")
Demonstration in JS:
var str = 'Hello, "How" are "you" What "are" you "doing?"';
var regex = /"[^"]+"[^"]+("[^"]+")/g
match = regex.exec(str);
while (match != null) {
// matched text: match[0]
// match start: match.index
// capturing group n: match[n]
console.log(match[1])
match = regex.exec(str);
}
We can try using re.findall to extract all quoted terms. Then, build a string using only even entries in the resulting list:
input = "Hello, \"How\" are \"you\" What \"are\" you \"doing?\""
matches = re.findall(r'\"([^"]+)\"', input)
matches = matches[1::2]
output = " ".join(matches)
print(output)
you doing?

First lookahead then look for closest matching capture group behind the lookahead match. RegEx in Python

I have a full text with line separated strings. Lines starting with '%' are titles and lines starting with '>' contain the text I want to look for my my query in. If my query is found I want to return the nearest title above it. Here is the expression I tried myself:
import re
query = "ABCDE"
full_text = "%EFGHI\r>XXXXX\r>XXXXX\r%IWANT\r>XXXXX\r>ABCDE"
re.search("%(.*?)\r(?=>.*{})".format(query), full_text).group(0)
I want this code block to return the string:
> 'IWANT'
As this is the closest title above the query. However, it returns:
> 'EFGHI'
I guess it makes sense, since 'EFGHI' is the first element matching the search pattern. Is there a way to first lookahead for my query and then look back for the nearest title?
I suggest matching all parts with \r>... that have no % after \r before the ABCDE value to get the right title:
r"%([^\r]*)(?=(?:\r(?!%)[^\r]*)*\r>[^\r]*{})".format(query)
See the Python demo
Pattern details:
% - a % char
([^\r]*) - Group 1: zero or more chars other than CR chars
(?=(?:\r(?!%)[^\r]*)*\r>[^\r]*ABCDE) - a positive lookahead that, immediately to the right of the current location, must match the following sequence of patterns:
(?:\r(?!%)[^\r]*)* - 0 or more repetitions of CR not followed with % and then followed with zero or more chars other than CR chars
\r> - a CR char and >
[^\r]* - zero or more chars other than CR chars
ABCDE - a literal char sequence

Regex find string after key inside qoutes [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Closed 5 years ago.
Input:
blalasdl8ujd "key":"value", blblabla asdw
"alo":"ebobo",blabla"www":"zzzz"
or
blalasdl8ujd key [any_chars_here] "value", blabla asdw
"alo":"ebobo", bla"www":"zzzz"
I'm tring to extract value having only key and knowing that value is covered with "
Following regex key.*"(.*?)" returns the last match covered with " ("zzzz").
I need to fix it to return first.
https://regex101.com/r/CDfhBT/1
Code
See regex in use here
"key"\s*:\s*"([^"]*)"
To match the possibility of escaped double quotes you can use the following regex:
See regex in use here
"key"\s*:\s*"((?:(?<!\\)\\(?:\\{2})*"|[^"])*)"
This method ensures that an odd number of backslashes \ precedes the double quotation character " such that \", \\\", \\\\\", etc. are valid, but \\", \\\\", \\\\\\" are not valid (this would simply output a backslash character, thus the double quotation character " preceded by an even number of backslashes would simply result in a string termination).
Matching both strings
If you're looking to match your second string as well, you can use either of the following regexes:
\bkey\b(?:"\s*:\s*|.*?)"([^"]*)"
\bkey\b(?:"\s*:\s*|.*?)"((?:(?<!\\)\\(?:\\{2})*"|[^"])*)"
Usage
See code in use here
import re
s = 'blahblah "key":"value","TargetCRS": "Target","TargetCRScode": "vertical Code","zzz": "aaaa" sadzxc "sss"'
r = re.compile(r'''"key"\s*:\s*"([^"]*)"''')
match = r.search(s)
if match:
print match.group(1)
Results
Input
blahblah "key":"value","TargetCRS": "Target","TargetCRScode": "vertical Code","zzz": "aaaa" sadzxc "sss"
blalasdl8ujd key [any_chars_here] "value", blabla asdw "alo":"ebobo", bla"www":"zzzz"
Output
String 1
Match: "key":"value"
Capture group 1: value
String 2 (when using one of the methods under Matching both strings)
Match: key [any_chars_here] "value"
Capture group 1: value
Explanation
"key" Match this literally
\s* Match any number of whitespace characters
: Match the colon character literally
\s* Match any number of whitespace characters
" Match the double quotation character literally
([^"]*) Capture any character not present in the set (any character except the double quotation character ") any number of times into capture group 1
" Match the double quotation character literally
Matching both strings
\b Assert position as a word boundary
key Match this literally
\b Assert position as a word boundary
(?:"\s*:\s*|.*?) Match either of the following
"\s*:\s*
" Match this literally
\s* Match any number of whitespace characters
: Match this literally
\s* Match any number of whitespace characters
.*? Match any character any number of times, but as few as possible
" Match this literally
([^"]*) Capture any number of any character except " into capture group 1
" Match this literally
You can use the non-greedy quantifier .*? between the key and the value group:
key.*?"(.*?)"
Demo here.
Update
You might wonder why it captures the colon, :. It captures that because this is the next thing between quotes. So you can add optional quotes around key like this:
("?)key\1.*?"(.*?)"
Another demo here.
Check this:
.*(\"key\":\"(\w*)\")
Using the group 2:
https://regex101.com/r/66ikH3/2
There's probably a somewhat more pythonic way to do this, but:
s1 = 'blalasdl8ujd "key":"value", blblabla asdw "alo":"ebobo",blabla"www":"zzzz"'
s2 = 'blalasdl8ujd key [any_chars_here] "value", blabla asdw "alo":"ebobo", bla"www":"zzzz"'
def getValue(string, keyName = 'key'):
"""Find next quoted value after a key that may or may not be quoted"""
startKey = string.find(keyName)
# if key is quoted, adjust value search range to exclude its closing quote
endKey = string.find('"',startKey) if string[startKey-1]=='"' else startKey + len(keyName)
startValue = string.find('"',endKey+1)+1
return string[startValue:string.find('"',startValue+1)]
getValue(s1) #'value'
getValue(s2) #'value'
I was inspired by the elegance of this answer, but handling the quoted and unquoted cases makes it more than a 1-liner.
You can use a comprehension such as:
next(y[1][1:-1] for y in [[l for l in x.split(':')]
for x in s2.split(',')] if 'key' in y[0]) # returns 'value' w/o quotes
But that won't handle the case of s2.

How can I find a match and update it with RegEx?

I have a string as
a = "hello i am stackoverflow.com user +-"
Now I want to convert the escape characters in the string except the quotation marks and white space. So my expected output is :
a = "hello i am stackoverflow\.com user \+\-"
What I did so far is find all the special characters in a string except whitespace and double quote using
re.findall(r'[^\w" ]',a)
Now, once I found all the required special characters I want to update the string. I even tried re.sub but it replaces the special characters. Is there anyway I can do it?
Use re.escape.
>>> a = "hello i am stackoverflow.com user +-"
>>> print(re.sub(r'\\(?=[\s"])', r'', re.escape(a)))
hello i am stackoverflow\.com user \+\-
re.escape(string)
Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
r'\\(?=[\s"])' matches all the backslashes which exists just before to space or double quotes. Replacing the matched backslashes with an empty string will give you the desired output.
OR
>>> a = 'hello i am stackoverflow.com user "+-'
>>> print(re.sub(r'((?![\s"])\W)', r'\\\1', a))
hello i am stackoverflow\.com user "\+\-
((?![\s"])\W) captures all the non-word characters but not of space or double quotes. Replacing the matched characters with backslash + chars inside group index 1 will give you the desired output.
It seems like you could use backreferences with re.sub to achieve what your desired output:
import re
a = "hello i am stackoverflow.com user +-"
print re.sub(r'([^\w" ])', r'\\\1', a) # hello i am stackoverflow\.com user \+\-
The replacement pattern r'\\\1' is just \\ which means a literal backslash, followed \1 which means capture group 1, the pattern captured in the parentheses in the first argument.
In other words, it will escape everything except:
alphanumeric characters
underscore
double quotes
space

Categories