regex, how to exlude search in match - python

Might be a bit messy title, but the question is simple.
I got this in Python:
string = "start;some;text;goes;here;end"
the start; and end; word is always at the same position in the string.
I want the second word which is some in this case. This is what I did:
import re
string = "start;some;text;goes;here;end"
word = re.findall("start;.+?;" string)
In this example, there might be a few things to modify to make it more appropriate, but in my actual code, this is the best way.
However, the string I get back is start;some;, where the search characters themselves is included in the output. I could index both ;, and extract the middle part, but there have to be a way to only get the actual word, and not the extra junk too?

No need for regex in my opinion, but all you need is a capture group here.
word = re.findall("start;(.+?);", string)
Another improvement I'd like to suggest is not using .. Rather be more specific, and what you are looking for is simply anything else than ;, the delimiter.
So I'd do this:
word = re.findall("start;([^;]+);", string)

Related

Exact search of a string that has parenthesis using regex

I am new to regexes.
I have the following string : \n(941)\n364\nShackle\n(941)\nRivet\n105\nTop
Out of this string, I want to extract Rivet and I already have (941) as a string in a variable.
My thought process was like this:
Find all the (941)s
filter the results by checking if the string after (941) is followed by \n, followed by a word, and ending with \n
I made a regex for the 2nd part: \n[\w\s\'\d\-\/\.]+$\n.
The problem I am facing is that because of the parenthesis in (941) the regex is taking 941 as a group. In the 3rd step the regex may be wrong, which I can fix later, but 1st I needed help in finding the 2nd (941) so then I can apply the 3rd step on that.
PS.
I know I can use python string methods like find and then loop over the searches, but I wanted to see if this can be done directly using regex only.
I have tried the following regex: (?:...), (941){1} and the make regex literal character \ like this \(941\) with no useful results. Maybe I am using them wrong.
Just wanted to know if it is possible to be done using regex. Though it might be useful for others too or a good share for future viewers.
Thanks!
Assuming:
You want to avoid matching only digits;
Want to match a substring made of word-characters (thus including possible digits);
Try to escape the variable and use it in the regular expression through f-string:
import re
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
var1 = '(941)'
var2 = re.escape(var1)
m = re.findall(fr'{var2}\n(?!\d+\n)(\w+)', s)[0]
print(m)
Prints:
Rivet
If you have text in a variable that should be matched exactly, use re.escape() to escape it when substituting into the regexp.
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
num = '(941)'
re.findall(rf'(?<=\n{re.escape(num)}\n)[\w\s\'\d\-\/\.]+(?=\n)', s)
This puts (941)\n in a lookbehind, so it's not included in the match. This avoids a problem with the \n at the end of one match overlapping with the \n at the beginning of the next.

Remove continuous occurrence of vowels together in a string using Python

I have a string like below:
"i'm just returning from work. *oeee* all and we can go into some detail *oo*. what is it that happened as far as you're aware *aouu*"
with some junk characters like above (highlighted with '*' marks). All I could observe was that junk characters come as bunch of vowels knit together. Now, I need to remove any word that has space before and after and has only vowels in it (like oeee, aouu, etc...) and length of 2 or more. How do I achieve this in python?
Currently, I built a tuple to include replacement words like ((" oeee "," "),(" aouu "," ")) and sending it through a for loop with replace. But if the word is 'oeeee', I need a add a new item into the tuple. There must be a better way.
P.S: there will be no '*' in the actual text. I just put it here to highlight.
You need to use re.sub to do a regex replacement in python. You should use this regex:
\b[aeiou]{2,}\b
which will match a sequence of 2 or more vowels in a word by themselves. We use \b to match the boundaries of the word so it will match at the beginning and end of the string (in your string, aouu) as well as words adjacent to punctuation (in your string, oo). If your text may include uppercase vowels too, use the re.I flag to ignore case:
import re
text = "i'm just returning from work. oeee all and we can go into some detail oo. what is it that happened as far as you're aware aouu"
print(re.sub(r'\b[aeiou]{2,}\b', '', text, 0, re.I))
Output
i'm just returning from work. all and we can go into some detail . what is it that happened as far as you're aware

parsing string with specific name in python

i have string like this
<name:john student male age=23 subject=\computer\sience_{20092973}>
i am confused ":","="
i want to parsing this string!
so i want to split to list like this
name:john
job:student
sex:male
age:23
subject:{20092973}
parsing string with specific name(name, job, sex.. etc) in python
i already searching... but i can't find.. sorry..
how can i this?
thank you.
It's generally a good idea to give more than one example of the strings you're trying to parse. But I'll take a guess. It looks like your format is pretty simple, and primarily whitespace-separated. It's simple enough that using regular expressions should work, like this, where line_to_parse is the string you want to parse:
import re
matchval = re.match("<name:(\S+)\s+(\S+)\s+(\S+)\s+age=(\S+)\s+subject=[^\{]*(\{\S+\})", line_to_parse)
matchgroups = matchval.groups()
Now matchgroups will be a tuple of the values you want. It should be trivial for you to take those and get them into the desired format.
If you want to do many of these, it may be worth compiling the regular expression; take a look at the re documentation for more on this.
As for the way the expression works: I won't go into regular expressions in general (that's what the re docs are for) but in this case, we want to get a bunch of strings that don't have any whitespace in them, and have whitespace between them, and we want to do something odd with the subject, ignoring all the text except the part between { and }.
Each "(...)" in the expression saves whatever is inside it as a group. Each "\S+" stands for one or more ("+") characters that aren't whitespace ("\S"), so "(\S+)" will match and save a string of length at least one that has no whitespace in it. Each "\s+" does the opposite: it has not parentheses around it, so it doesn't save what it matches, and it matches at one or more ("+") whitespace characters ("\s"). This suffices for most of what we want. At the end, though, we need to deal with the subject. "[...]" allows us to list multiple types of characters. "[^...]" is special, and matches anything that isn't in there. {, like [, (, and so on, needs to be escaped to be normal in the string, so we escape it with \, and in the end, that means "[^{]*" matches zero or more ("*") characters that aren't "{" ("[^{]"). Since "*" and "+" are "greedy", and will try to match as much as they can and still have the expression match, we now only need to deal with the last part. From what I've talked about before, it should be pretty clear what "({\S+})" does.

Should I reuse a compiled regex?

this is a quick question:
How would I specify a regex which can be used several times with multiple match strings? I might not have worded that right, but I will try to show some code.
I have this regex:
regex = compile(r'(?=(%s))')
In a for loop, I will try and match the string I have to one I specify for the regex so that at each iteration, I can change the string being matched and it will try to match it.
So is this possible, can I do something like
regex.findall(myStringString, myMatchString)
in the code or would I have to recompile the regex in order for it to match a new string?
More clarification:
I want to do this:
re.findall('(?=(%s))' %myMatchString, mySearchString)
but because myMatchString will be changing at each iteration of the loop, I want to do it like this so I can match the new string:
regex = re.compile(r'(?=(%s))')
regex.findall( myMatchString, mySearchString)
Thanks for reading
well, if I understand what you say, all you want to write is :
def match_on_list_of_strings(list_of_strings):
regex = compile(r'(?=(%s))')
for string in list_of_strings:
yield regex.findall(string)
That will apply your match on the strings as many times there are strings in the list of strings, while your regex been compiled only once.
Aaaah... but you don't need a regex for that:
def match_on_list_of_strings(bigstring, list_of_strings):
for string in list_of_strings:
if string in bigstring:
yield string
or if you really want to use a re:
def match_on_list_of_strings(bigstring, list_of_strings):
for string in list_of_strings:
if re.match('.*'+string+'.*', bigstring):
yield string
And then to answer your question, no you can't compile the destination string into a regex, but only the contrary. When you compile a regex, what you do is transform the actual regexp into an internal representation of the automaton. You might want to read courses on NFA and regexps
The point of re.compile is to explicitly declare you're going to re-use the same pattern again and again - and hopefully avoid any compilation that may be required.
As what you're doing is not necessarily re-using the same pattern, then you're better off letting the re system cache patterns (it caches n many - but can't remember exactly how many), and just use re.findall(...)/whatever your regex afresh each time.

Substituting a regex only when it doesn't match another regex (Python)

Long story short, I have two regex patterns. One pattern matches things that I want to replace, and the other pattern matches a special case of those patterns that should not be replace. For a simple example, imagine that the first one is "\{.*\}" and the second one is "\{\{.*\}\}". Then "{this}" should be replaced, but "{{this}}" should not. Is there an easy way to take a string and say "substitute all instances of the first string with "hello" so long as it is not matching the second string"?
In other words, is there a way to make a regex that is "matches the first string but not the second" easily without modifying the first string? I know that I could modify my first regex by hand to never match instances of the second, but as the first regex gets more complex, that gets very difficult.
Using negative look-ahead/behind assertion
pattern = re.compile( "(?<!\{)\{(?!\{).*?(?<!\})\}(?!\})" )
pattern.sub( "hello", input_string )
Negative look-ahead/behind assertion allows you to compare against more of the string, but is not considered as using up part of the string for the match. There is also a normal look ahead/behind assertion which will cause the string to match only if the string IS followed/preceded by the given pattern.
That's a bit confusing looking, here it is in pieces:
"(?<!\{)" #Not preceded by a {
"\{" #A {
"(?!\{)" #Not followed by a {
".*?" #Any character(s) (non-greedy)
"(?<!\})" #Not preceded by a } (in reference to the next character)
"\}" #A }
"(?!\})" #Not followed by a }
So, we're looking for a { without any other {'s around it, followed by some characters, followed by a } without any other }'s around it.
By using negative look-ahead/behind assertion, we condense it down to a single regular expression which will successfully match only single {}'s anywhere in the string.
Also, note that * is a greedy operator. It will match as much as it possibly can. If you use "\{.*\}" and there is more than one {} block in the text, everything between will be taken with it.
"This is some example text {block1} more text, watch me disappear {block2} even more text"
becomes
"This is some example text hello even more text"
instead of
"This is some example text hello more text, watch me disappear hello even more text"
To get the proper output we need to make it non-greedy by appending a ?.
The python docs do a good job of presenting the re library, but the only way to really learn is to experiment.
You can give replace a function (reference)
But make sure the first regex contain the second one. This is just an example:
regex1 = re.compile('\{.*\}')
regex2 = re.compile('\{\{.*\}\}')
def replace(match):
match = match.group(0)
if regex2.match(match):
return match
return 'replacement'
regex1.sub(replace, data)
You could replace all the {} instances with your replacement string (which would include the {{}} ones), and then replace the {{}} ones with a back-reference to itself (overwriting the first replace with the original data) -- then only the {} instances would have changed.
It strikes me as sub-optimal to match against two different regexes when what you're looking for is really one pattern. To illustrate:
import re
foo = "{{this}}"
bar = "{that}"
re.match("\{[^\{].*[^\}]\}", foo) # gives you nothing
re.match("\{[^\{].*[^\}]\}", bar) # gives you a match object
So it's really your regex which could be a bit more precise.

Categories