python - make my regex less greedy? - python

I am looking for some Python regex which I can use to extract a string from a file which has a single line of text (which is actually JavaScript code).
An example of what I'm looking for is to extract the variable name that a substring is being taken from:
So if the line of text I was parsing is:
"var foo = bar.substr(baz % qux, morestuffhere"
I want my match to be bar. I'm using the following, which matches after the equals sign and before the modulo operator:
pat = r"\s?\=\s?(.*?)\.substr\(\s?baz\s?\%\s?"
This works great if the string of interest is on a new line, however when part of a longer string it fails. See here for a failed example:
I think the issue is being less greedy with my regex? Although not sure. Pointers appreciated.

Like revo says in comment you need to be more specific for your match group.
.*? take all type of character
\S*? take non space character
\w*? take word character
so you can try this :
\s?\=\s?(\w*?)\.substr\(\s?baz\s?\%\s?

Related

Exact search of a string that has parenthesis using regex

I am new to regexes.
I have the following string : \n(941)\n364\nShackle\n(941)\nRivet\n105\nTop
Out of this string, I want to extract Rivet and I already have (941) as a string in a variable.
My thought process was like this:
Find all the (941)s
filter the results by checking if the string after (941) is followed by \n, followed by a word, and ending with \n
I made a regex for the 2nd part: \n[\w\s\'\d\-\/\.]+$\n.
The problem I am facing is that because of the parenthesis in (941) the regex is taking 941 as a group. In the 3rd step the regex may be wrong, which I can fix later, but 1st I needed help in finding the 2nd (941) so then I can apply the 3rd step on that.
PS.
I know I can use python string methods like find and then loop over the searches, but I wanted to see if this can be done directly using regex only.
I have tried the following regex: (?:...), (941){1} and the make regex literal character \ like this \(941\) with no useful results. Maybe I am using them wrong.
Just wanted to know if it is possible to be done using regex. Though it might be useful for others too or a good share for future viewers.
Thanks!
Assuming:
You want to avoid matching only digits;
Want to match a substring made of word-characters (thus including possible digits);
Try to escape the variable and use it in the regular expression through f-string:
import re
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
var1 = '(941)'
var2 = re.escape(var1)
m = re.findall(fr'{var2}\n(?!\d+\n)(\w+)', s)[0]
print(m)
Prints:
Rivet
If you have text in a variable that should be matched exactly, use re.escape() to escape it when substituting into the regexp.
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
num = '(941)'
re.findall(rf'(?<=\n{re.escape(num)}\n)[\w\s\'\d\-\/\.]+(?=\n)', s)
This puts (941)\n in a lookbehind, so it's not included in the match. This avoids a problem with the \n at the end of one match overlapping with the \n at the beginning of the next.

Regex: Stop when it finds the first ocurrence of a character [duplicate]

I am looking for a pattern that matches everything until the first occurrence of a specific character, say a ";" - a semicolon.
I wrote this:
/^(.*);/
But it actually matches everything (including the semicolon) until the last occurrence of a semicolon.
You need
/^[^;]*/
The [^;] is a character class, it matches everything but a semicolon.
^ (start of line anchor) is added to the beginning of the regex so only the first match on each line is captured. This may or may not be required, depending on whether possible subsequent matches are desired.
To cite the perlre manpage:
You can specify a character class, by enclosing a list of characters in [] , which will match any character from the list. If the first character after the "[" is "^", the class matches any character not in the list.
This should work in most regex dialects.
Would;
/^(.*?);/
work?
The ? is a lazy operator, so the regex grabs as little as possible before matching the ;.
/^[^;]*/
The [^;] says match anything except a semicolon. The square brackets are a set matching operator, it's essentially, match any character in this set of characters, the ^ at the start makes it an inverse match, so match anything not in this set.
None of the proposed answers did work for me. (e.g. in notepad++)
But
^.*?(?=\;)
did.
Try /[^;]*/
Google regex character classes for details.
sample text:
"this is a test sentence; to prove this regex; that is g;iven below"
If for example we have the sample text above, the regex /(.*?\;)/ will give you everything until the first occurence of semicolon (;), including the semicolon: "this is a test sentence;"
Try /[^;]*/
That's a negating character class.
This was very helpful for me as I was trying to figure out how to match all the characters in an xml tag including attributes. I was running into the "matches everything to the end" problem with:
/<simpleChoice.*>/
but was able to resolve the issue with:
/<simpleChoice[^>]*>/
after reading this post. Thanks all.
this is not a regex solution, but something simple enough for your problem description. Just split your string and get the first item from your array.
$str = "match everything until first ; blah ; blah end ";
$s = explode(";",$str,2);
print $s[0];
output
$ php test.php
match everything until first
This will match up to the first occurrence only in each string and will ignore subsequent occurrences.
/^([^;]*);*/
"/^([^\/]*)\/$/" worked for me, to get only top "folders" from an array like:
a/ <- this
a/b/
c/ <- this
c/d/
/d/e/
f/ <- this
Really kinda sad that no one has given you the correct answer....
In regex, ? makes it non greedy. By default regex will match as much as it can (greedy)
Simply add a ? and it will be non-greedy and match as little as possible!
Good luck, hope that helps.
This works for getting the content from the beginning of a line till the first word,
/^.*?([^\s]+)/gm
I faced a similar problem including all the characters until the first comma after the word entity_id. The solution that worked was this in Bigquery:
SELECT regexp_extract(line_items,r'entity_id*[^,]*')

Need a specific explanation of part of a regex code

I'm developing a calculator program in Python, and need to remove leading zeros from numbers so that calculations work as expected. For example, if the user enters "02+03" into the calculator, the result should return 5. In order to remove these leading zeroes in-front of digits, I asked a question on here and got the following answer.
self.answer = eval(re.sub(r"((?<=^)|(?<=[^\.\d]))0+(\d+)", r"\1\2", self.equation.get()))
I fully understand how the positive lookbehind to the beginning of the string and lookbehind to the non digit, non period character works. What I'm confused about is where in this regex code can I find the replacement for the matched patterns?
I found this online when researching regex expressions.
result = re.sub(pattern, repl, string, count=0, flags=0)
Where is the "repl" in the regex code above? If possible, could somebody please help to explain what the r"\1\2" is used for in this regex also?
Thanks for your help! :)
The "repl" part of the regex is this component:
r"\1\2"
In the "find" part of the regex, group capturing is taking place (ordinarily indicated by "()" characters around content, although this can be overridden by specific arguments).
In python regex, the syntax used to indicate a reference to a positional captured group (sometimes called a "backreference") is "\n" (where "n" is a digit refering to the position of the group in the "find" part of the regex).
So, this regex is returning a string in which the overall content is being replaced specifically by parts of the input string matched by numbered groups.
Note: I don't believe the "\1" part of the "repl" is actually required. I think:
r"\2"
...would work just as well.
Further reading: https://www.regular-expressions.info/brackets.html
Firstly, repl includes what you are about to replace.
To understand \1\2 you need to know what capture grouping is.
Check this video out for basics of Group capturing.
Here , since your regex splits every match it finds into groups which are 1,2... so on. This is so because of the parenthesis () you have placed in the regex.
$1 , $2 or \1,\2 can be used to refer to them.
In this case: The regex is replacing all numbers after the leading 0 (which is caught by group 2) with itself.
Note: \1 is not necessary. works fine without it.
See example:
>>> import re
>>> s='awd232frr2cr23'
>>> re.sub('\d',' ',s)
'awd frr cr '
>>>
Explanation:
As it is, '\d' is for integer so removes them and replaces with repl (in this case ' ').

Preserve key:value values in text while regex replacing non-word characters in keys (Notepad++)

Trying without luck in Notepad++ to replace any non-word characters \W with underscore _ from a block of multi-line text, with exception to (and right of) a colon : (which doesn't occur on every line- something of space-delineated hierarchy, terminating in a key-value pair). A python solution could be of use as well, as I'm trying to do other things with it once reformatted. Example:
This 100% isn't what I want
Yet, it's-what-I've got currently: D#rnit :(
This_100_is_what_I_d_like: See?
Indentation_isn_t_necessary
_to_maintain_but_would_be_nice: :)<-preserved!
I_m_Mr_Conformist_over_here: |Whereas, I'm like whatever's clever.|
If_you_can_help: Thanks 100.1%!
I admit that I'm answering an off-topic question I just liked the problem. Hold CTRL+H, enable Regular Expressions in N++ then search for:
(:[^\r\n]*|^\s+)|\W(?<![\r\n])
And replace with:
(?1\1:_)
Regex has two main parts. First side of outer alternation which matches leading spaces of a line (indentation) or every thing after first occurrence of a colon, and second side which matches a non-word character except a carriage return \r or newline \n character (in negative lookbehind) to preserve linebreaks. Replacement string is a conditional block which says if first capturing group is matched replace it with itself and if not replace it with a _.
Seeing a better description of what you're trying to do, I don't think you'll be able to do it from inside notepad++ using a single regular expression. However, you could write a python script that scrolls through your document, one line at time, and sanitizes anything to the left of a colon (if one exists)
Here's a quick and dirty example (untested). This assumes doc is an open file pointer to the file you want to sanitize
import re
sanitized_lines = []
for line in doc:
line_match = re.match(r"^(\s*)([^:\n]*)(.*)", line)
indentation = line_match.group(1)
left_of_colon = line_match.group(2)
remainder = line_match.group(3)
left_of_colon = re.sub(r"\W", "_", left_of_colon)
sanitized_lines.append("".join((indentation, left_of_colon, remainder)))
sanitized_doc = "".join(sanitized_lines)
print(sanitized_doc)
You may try this python script,
ss="""This 100% isn't what I want
Yet, it's-what-I've got currently: D#rnit :(
If you can help: Thanks 100.1%!"""
import re
splitcapture=re.compile(r'(?m)^([^:\n]+)(:[^\n]*|)$')
subregx=re.compile(r'\W+')
print(splitcapture.sub(lambda m: subregx.sub('_', m.group(1))+m.group(2), ss))
in which first I tried to match each line and capture 2 parts separately(the one part not containing ':'character is capured to group 1, and the other possible part started with ':' and goes on to the end of the line is captured to group 2), and then implemented replacing process only on group 1 captured string and finally joined 2 parts, replaced group 1 + group 2
And output is
This_100_isn_t_what_I_want_
_Yet_it_s_what_I_ve_got_currently: D#rnit :(
If_you_can_help: Thanks 100.1%!

Regex End of Line and Specific Chracters

So I'm writing a Python program that reads lines of serial data, and compares them to a dictionary of line codes to figure out which specific lines are being transmitted. I am attempting to use a Regular Expression in order to filter out the extra garbage line serial read string has on it, but I'm having a bit of an issue.
Every single code in my dictionary looks like this: T12F8B0A22**F8. The asterisks are the two alpha numeric pieces that differentiate each string code.
This is what I have so far as my regex: '/^T12F8B0A22[A-Z0-9]{2}F8$/'
I am getting a few errors with this however. My first error, is that there are some characters are the end of the string I still need to get rid of, which is odd because I thought $/ denoted the end of the line in regex. However when I run my code through the debugger I notice that after running through the following code:
#regexString contains the serial read line data
regexString = re.sub('/^T12F8B0A22[A-Z0-9]{2}F8$/', '', regexString)
My string looks something like this: 'T12F8B0A2200F8\\r'
I need to get rid of the \\r.
If for some reason I can't get rid of this with regex, how in python do you send specific string character through an argument? In this case I suppose it would be length - 3?
Your problem is threefold:
1) your string contains extra \r (Carriage Return character) before \n (New Line character); this is common in Windows and in network communication protocols; it is probably best to remove any trailing whitespace from your string:
regexString = regexString.rstrip()
2) as mentioned by Wiktor Stribiżew, your regexp is unnecessarily surrounded with / characters - some languages, like Perl, define regexp as a string delimited by / characters, but Python is not one of them;
3) your instruction using re.sub is actually replacing the matching part of regexString with an empty string - I believe this is the exact opposite of what you want (you want to keep the match and remove everything else, right?); that's why fixing the regexp makes things "even worse".
To summarize, I think you should use this instead of your current code:
m = re.match('T12F8B0A22[A-Z0-9]{2}F8', regexString)
regexString = m.group(0)
There are several ways to get rid of the "\r", but first a little analysis of your code :
1. the special charakter for the end is just '$' not '$\' in python.
2. re.sub will substitute the matched pattern with a string ( '' in your case) wich would substitute the string you want to get with an empty string and you are left with the //r
possible solutions:
use simple replace:
regexString.replace('\\r','')
if you want to stick to regex the approach is the same
pattern = '\\\\r'
match = re.sub(pattern, '',regexString)
2.2 if you want the acces the different groubs use re.search
match = re.search('(^T12F8B0A22[A-Z0-9]{2}F8)(.*)',regexString)
match.group(1) # will give you the T12...
match.groupe(2) # gives you the \\r
Just match what you want to find. Couple of examples:
import re
data = '''lots of
otherT12F8B0A2212F8garbage
T12F8B0A2234F8around
T12F8B0A22ABF8the
stringsT12F8B0A22CDF8
'''
print(re.findall('T12F8B0A22..F8',data))
['T12F8B0A2212F8', 'T12F8B0A2234F8', 'T12F8B0A22ABF8', 'T12F8B0A22CDF8']
m = re.search('T12F8B0A22..F8',data)
if m:
print(m.group(0))
T12F8B0A2212F8

Categories