Searching multiple sub strings with special character as marker [duplicate] - python

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Closed 4 years ago.
I have a string like :
myStr = "abcd123[ 45][12] cd [67]"
I want to fetch all the sub-strings between '[' and ']' markers.
I am using findall to fetch the same but all i get is everything between firsr '[' and ']' last character.
print re.findall('\[(.+)\]', myStr)
What wrong am i doing here ?

This will probably be marked as duplicate, but the simple fix here would be to just make your dot lazy:
print re.findall('\[(.+?)\]', myStr)
[' 45', '12', '67']
Here .+? means consume everything until hitting first, or nearest, closing square bracket. Your current pattern is consuming everything until the very last closing square bracket.
Another logically identical pattern which would also work is \[([^\]+)\]:
print re.findall('\[([^\]]+)\]', myStr)

The .+ is greedy and selects as much it can, including other [] characters.
You have two options: Make the selector non-greedy by using .+? which selects the least number of characters possible, or explicitly exclude [] from your match by using [^\[\]]+ instead of .+.
(Both of these options are about equally good in this case. Though the "non-greedy" option is preferable if your ending delimiter is a longer string instead of a single character, since the longer string is more difficult to exclude.)

Related

How to find all every element between text Python [duplicate]

This question already has answers here:
Find string between two substrings [duplicate]
(20 answers)
Closed last year.
I'd like to know how to find characters in between texts in python. What I mean is that you have for example:
cool_string = "I am a very cool string here is something: not cool 8+8 that's it"
and I want to save to another string everything in between something: to that's it.
So the result would be:
soultion_to_cool_string = ' not cool 8+8 '
You can use str.find()
start = "something:"
end = "that's it"
cool_string[cool_string.find(start) + len(start):cool_string.find(end)]
If you need to remove empty space str.strip()
You should look into regex it will do your job. https://docs.python.org/3/howto/regex.html
Now for your question we will first require the lookahead and lookbehind expressions
The lookahead:
Asserts that what immediately follows the current position in the string is foo.
Syntax: (?=foo)
The lookbehind:
Asserts that what immediately precedes the current position in the string is foo.
Syntax: (?<=foo)
We need to look behind for something: and lookahead for that's it
import re
regex = r"(?<=something:).*?(?=that\'s it)" # .*? is way to capture everything in b/w except line terminators
re.findall(regex, cool_string)

Regex search fail when input has line breaks [duplicate]

This question already has an answer here:
Why is Python Regex Wildcard only matching newLine
(1 answer)
Closed 1 year ago.
The following regular expression is not returning any match:
import re
regex = '.*match.*fail.*'
pattern = re.compile(regex)
text = '\ntestmatch\ntestfail'
match = pattern.search(text)
I managed to solve the problem by changing text to repr(text) or setting text as a raw string with r'\ntestmatch\ntestfail', but I'm not sure if these are the best approaches. What is the best way to solve this problem?
Using repr or raw string on a target string is a bad idea!
By doing that newline characters are treated as literal '\n'.
This is likely to cause unexpected behavior on other test cases.
The real problem is that . matches any character EXCEPT newline.
If you want to match everything, replace . with [\s\S].
This means "whitespace or not whitespace" = "anything".
Using other character groups like [\w\W] also works,
and it is more efficient for adding exception just for newline.
One more thing, it is a good practice to use raw string in pattern string(not match target).
This will eliminate the need to escape every characters that has special meaning in normal python strings.
You could add it as an or, but make sure you \ in the regex string, so regex actually gets the \n and not a actual newline.
Something like this:
regex = '.*match(.|\\n)*fail.*'
This would match anything from the last \n to match, then any mix or number of \n until testfail. You can change this how you want, but the idea is the same. Put what you want into a grouping, and then use | as an or.
On the left is what this regex pattern matched from your example.

Is there any way to account for all delimiters in a string in Python? [duplicate]

This question already has answers here:
Python - How to split a string by non alpha characters
(8 answers)
Closed 2 years ago.
I'm trying to create a word count for a book (.txt file) and I'm trying to split each line into its separate words using this:
temp = re.split('[; |, |\*|\n| |\|:|.|’|"|&|#|$|(|)|]|//|'']', line)
However, this isn't working because every time I run the program, I have to add another delimiter to the list. This time I have to add '-' and '%'. I remember doing something similar in Java where I could specify a 'range' of delimiters and when I tried the same thing here, it didn't seem to work.
Is there any better way to do this and make sure I just get the word and nothing else?
I think you're looking for \W, the set of all non-word characters, i.e. not a letter, digit, or underscore.
i.e.
temp = re.split('\W+', line)
By the way, characters inside a regex character set are mostly literal. Yours boils down to this:
[; |,*\n:.’"&#$()]/']

How can I find multiple of the same format in Python? [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Closed 6 years ago.
For a little idea of what the project is, I'm trying to make a markup language that compiles to HTML/CSS. I plan on formatting links like this: #(link mask)[(link url)], and I want to find all occurrences of this and get both the link mask and the link url.
I tried using this code for it:
re.search("#(.*)\[(.*)\]", string)
But it started at the beginning of the first instance, and ended at the end of the last instance of a link. Any ideas how I can have it find all of them, in a list or something?
The default behavior of a regular expression is "greedy matching". This means each .* will match as many characters as it can.
You want them to instead match the minimal possible number of characters. To do that, change each .* into a .*?. The final question mark will make the pattern match the minimal number of characters. Because you anchor your pattern to a ] character, it will still match/consume the whole link correctly.
* is greedy: it matches as many characters as it can, e.g. up to the last right parenthesis in your document. (After all, . means "any character" and ) is 'any character" as much as any other character.)
You need the non-greedy version of *, which is *?. (Probably actually you should use +?, as I don't think zero-length matches would be very useful).

Regex not working to get string between 2 strings. Python 27 [duplicate]

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 3 years ago.
From this URL view-source:https://www.amazon.com/dp/073532753X?smid=A3P5ROKL5A1OLE
I want to get string between var iframeContent = and obj.onloadCallback = onloadCallback;
I have this regex iframeContent(.*?)obj.onloadCallback = onloadCallback;
But it does not work. I am not good at regex so please pardon my lack of knowledge.
I even tried iframeContent(.*?)obj.onloadCallback but it does not work.
It looks like you just want that giant encoded string. I believe yours is failing for two reasons. You're not running in DOTALL mode, which means your . won't match across multiple lines, and your regex is failing because of catastrophic backtracking, which can happen when you have a very long variable length match that matches the same characters as the ones following it.
This should get what you want
m = re.search(r'var iframeContent = \"([^"]+)\"', html_source)
print m.group(1)
The regex is just looking for any characters except double quotes [^"] in between two double quotes. Because the variable length match and the match immediately after it don't match any of the same characters, you don't run into the catastrophic backtracking issue.
I suspect that input string lies across multiple lines.Try adding re.M in search line (ie. re.findall('someString', text_Holder, re.M)).
You could try this regex too
(?<=iframeContent =)(.*)(?=obj.onloadCallback = onloadCallback)
you can check at this site the test.
Is it very important you use DOTALL mode, which means that you will have single-line

Categories