How to find all every element between text Python [duplicate] - python

This question already has answers here:
Find string between two substrings [duplicate]
(20 answers)
Closed last year.
I'd like to know how to find characters in between texts in python. What I mean is that you have for example:
cool_string = "I am a very cool string here is something: not cool 8+8 that's it"
and I want to save to another string everything in between something: to that's it.
So the result would be:
soultion_to_cool_string = ' not cool 8+8 '

You can use str.find()
start = "something:"
end = "that's it"
cool_string[cool_string.find(start) + len(start):cool_string.find(end)]
If you need to remove empty space str.strip()

You should look into regex it will do your job. https://docs.python.org/3/howto/regex.html
Now for your question we will first require the lookahead and lookbehind expressions
The lookahead:
Asserts that what immediately follows the current position in the string is foo.
Syntax: (?=foo)
The lookbehind:
Asserts that what immediately precedes the current position in the string is foo.
Syntax: (?<=foo)
We need to look behind for something: and lookahead for that's it
import re
regex = r"(?<=something:).*?(?=that\'s it)" # .*? is way to capture everything in b/w except line terminators
re.findall(regex, cool_string)

Related

How does this regex remove punctuation pattern work? [duplicate]

This question already has answers here:
Carets in Regular Expressions
(2 answers)
Closed 11 months ago.
I'm currently learning a bit of regex in python in a course I'm doing online and I'm struggling to understand a particular expression - I've been searching the python re docs and not sure why I'm returning the non-punctuation elements rather than the punctuation.
The code is:
import re
test_phrase = "This is a sentence, with! unnecessary: punctuation."
punc_remove = re.findall(r'[^,!:]+',test_phrase)
punc_reomve
OUTPUT: ['This is a sentence',' with',' unnecessary',' punctuation.']
I think I understand what each character does. I.e. [] is a character set, and ^ means starts with. So anything starting with ,!: will be returned? (or at least that's how I'm probably mistakingly interpreting it) And the + will return one of more of the pattern. But why is the output not returning something like:
OUTPUT: [', with','! unnecessary',': punctuation.']
Any explanation really appreciated!
Inside a character class, a ^ does not mean ‘start with’: it means ‘not’. So the RegEx matches sequences of one or more non-,1: characters.

Regex search fail when input has line breaks [duplicate]

This question already has an answer here:
Why is Python Regex Wildcard only matching newLine
(1 answer)
Closed 1 year ago.
The following regular expression is not returning any match:
import re
regex = '.*match.*fail.*'
pattern = re.compile(regex)
text = '\ntestmatch\ntestfail'
match = pattern.search(text)
I managed to solve the problem by changing text to repr(text) or setting text as a raw string with r'\ntestmatch\ntestfail', but I'm not sure if these are the best approaches. What is the best way to solve this problem?
Using repr or raw string on a target string is a bad idea!
By doing that newline characters are treated as literal '\n'.
This is likely to cause unexpected behavior on other test cases.
The real problem is that . matches any character EXCEPT newline.
If you want to match everything, replace . with [\s\S].
This means "whitespace or not whitespace" = "anything".
Using other character groups like [\w\W] also works,
and it is more efficient for adding exception just for newline.
One more thing, it is a good practice to use raw string in pattern string(not match target).
This will eliminate the need to escape every characters that has special meaning in normal python strings.
You could add it as an or, but make sure you \ in the regex string, so regex actually gets the \n and not a actual newline.
Something like this:
regex = '.*match(.|\\n)*fail.*'
This would match anything from the last \n to match, then any mix or number of \n until testfail. You can change this how you want, but the idea is the same. Put what you want into a grouping, and then use | as an or.
On the left is what this regex pattern matched from your example.

Searching multiple sub strings with special character as marker [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Closed 4 years ago.
I have a string like :
myStr = "abcd123[ 45][12] cd [67]"
I want to fetch all the sub-strings between '[' and ']' markers.
I am using findall to fetch the same but all i get is everything between firsr '[' and ']' last character.
print re.findall('\[(.+)\]', myStr)
What wrong am i doing here ?
This will probably be marked as duplicate, but the simple fix here would be to just make your dot lazy:
print re.findall('\[(.+?)\]', myStr)
[' 45', '12', '67']
Here .+? means consume everything until hitting first, or nearest, closing square bracket. Your current pattern is consuming everything until the very last closing square bracket.
Another logically identical pattern which would also work is \[([^\]+)\]:
print re.findall('\[([^\]]+)\]', myStr)
The .+ is greedy and selects as much it can, including other [] characters.
You have two options: Make the selector non-greedy by using .+? which selects the least number of characters possible, or explicitly exclude [] from your match by using [^\[\]]+ instead of .+.
(Both of these options are about equally good in this case. Though the "non-greedy" option is preferable if your ending delimiter is a longer string instead of a single character, since the longer string is more difficult to exclude.)

Python: Overlapping regex search [duplicate]

This question already has answers here:
How to use regex to find all overlapping matches
(5 answers)
Closed 4 years ago.
So if I create a program in python (3.7) that looks like this:
import re
regx = re.compile("test")
print(regx.findall("testest"))
and run it, then I will get:
["test"]
Even though there are two instances of "test" it's only showing me one which I think is because a letter from the first "test" is being used in the second "test". How can I make a program that will give me ["test", "test"] as a result instead?
You will want to use a capturing group with a lookahead (?=(regex_here)):
import re
regx = re.compile("(?=(test))")
print(regx.findall("testest"))
>>> ['test', 'test']
Regex expressions are greedy. They consume as much of the target string as possible. Once consumed, a character is not examined again, so overlapping patterns are not found.
To do this you need to use a feature of python regular expressions called a look ahead assertion. You will look for instances of the character t where it is followed by est. The look ahead does not consume parts of the string.
import re
regx = re.compile('t(?=est)')
print([m.start() for m in regx.finditer('testest')])
[0,3]
More details on this page: https://docs.python.org/3/howto/regex.html

using Python to search for keywords in pdf [duplicate]

This question already has answers here:
Searching text in a PDF using Python? [duplicate]
(11 answers)
Closed 8 years ago.
I'm searching for keywords in a pdf file so I'm trying to search for /AA or /Acroform like the following:
import re
l = "/Acroform "
s = "/Acroform is what I'm looking for"
if re.search (r"\b"+l.rstrip()+r"\b",s):
print "yes"
why I don't get "yes". I want the "/" to be part of the keyword I'm looking for if it exist.
any one can help me with it ?
\b only matches in between a \w (word) and a \W (non-word) character, or vice versa, or when a \w character is at the edge of a string (start or end).
Your string starts with a / forward slash, a non word character, so \W. \b will never match between the start of a string and /. Don't use \b here, use an explicit negative look-behind for a word character :
re.search(r'(?<!\w){}\b'.format(re.escape(l)), s)
The (?<!...) syntax defines a negative look-behind; like \b it matches a position in the string. Here it'll only match if the preceding character (if there is any) is not a word character.
I used string formatting instead of concatenation here, and used re.escape() to make sure that any regular expression meta characters in the string you are searching for are properly escaped.
Demo:
>>> import re
>>> l = "/Acroform "
>>> s = "/Acroform is what I'm looking for"
>>> if re.search(r'(?<!\w){}\b'.format(re.escape(l)), s):
... print 'Found'
...
Found

Categories