I'd like to match the urls like this:
input:
x = "https://play.google.com/store/apps/details?id=com.alibaba.aliexpresshd&hl=en"
get_id(x)
output:
com.alibaba.aliexpresshd
What is the best way to do it with re in python?
def get_id(toParse):
return re.search('id=(WHAT TO WRITE HERE?)', toParse).groups()[0]
I found only the case with exactly one dot.
You could try:
r'\?id=([a-zA-Z\.]+)'
For your regex, like so:
def get_id(toParse)
regex = r'\?id=([a-zA-Z\.]+)'
x = re.findall(regex, toParse)[0]
return x
Regex -
By adding r before the actual regex code, we specify that it is a raw string, so we don't have to add multiple backslashes before every command, which is better explained here.
? holds special meaning for the regex system, so to match a question mark, we precede it by a backslash like \?
id= matches the id= part of the extraction
([a-zA-Z\.]+) is the group(0) of the regex, which matches the id of the URL. Hence, by saying [0], we are able to return the desired text.
Note - I have used re.findall for this, because it returns an array [] whose element at index 0 is the extracted text.
I recommend you take a look at rexegg.com for a full list of regex syntax.
Actually, you do not need to put anything "special" there.
Since you know that the bundle id is between id= and &, you can just capture whatever is inside and have your result in capture group like this:
id=(.+)&
So the code would look like this:
def get_id(toParse):
return re.search('id=(.+)&', toParse).groups()[0]
Note: you might need to change the group index to "1", not "0", as most regex engines reserve this for full match. I'm not familiar how Python actually handles this.
See demo here
This regex should easily get what you want, it gets everything between id= and either the following parameter (.*? being ungreedy), or the end of the string.
id=(.*?)(&|$)
If you only need the id itself, it will be in the first group.
Related
Lets say we have this
string:"Code:1,Some text some other text {fdf: more text, attr=important "
I want to catch the pattern using Regex that can findall attr and extract important and 1 and put them in dict.
I tried this one:
(?<=testcaseid_)[^_]+_[^_]+
but still capture all the previous
I'm not sure if I understand well, but if you want to get everything starts from "1" to something after attr= you can also use regex like this:
r"1.*?attr=\w+"
I have the following path stored as a python string 'C:\ABC\DEF\GHI\App\Module\feature\src' and I would like to extract the word Module that is located between words \App\ and \feature\ in the path name. Note that there are file separators '\' in between which ought not to be extracted, but only the string Module has to be extracted.
I had the few ideas on how to do it:
Write a RegEx that matches a string between \App\ and \feature\
Write a RegEx that matches a string after \App\ --> App\\[A-Za-z0-9]*\\, and then split that matched string in order to find the Module.
I think the 1st solution is better, but that unfortunately it goes over my RegEx knowledge and I am not sure how to do it.
I would much appreciate any help.
Thank you in advance!
The regex you want is:
(?<=\\App\\).*?(?=\\feature\\)
Explanation of the regex:
(?<=behind)rest matches all instances of rest if there is behind immediately before it. It's called a positive lookbehind
rest(?=ahead) matches all instances of rest where there is ahead immediately after it. This is a positive lookahead.
\ is a reserved character in regex patterns, so to use them as part of the pattern itself, we have to escape it; hence, \\
.* matches any character, zero or more times.
? specifies that the match is not greedy (so we are implicitly assuming here that \feature\ only shows up once after \App\).
The pattern in general also assumes that there are no \ characters between \App\ and \feature\.
The full code would be something like:
str = 'C:\\ABC\\DEF\\GHI\\App\\Module\\feature\\src'
start = '\\App\\'
end = '\\feature\\'
pattern = rf"(?<=\{start}\).*?(?=\{end}\)"
print(pattern) # (?<=\\App\\).*?(?=\\feature\\)
print(re.search(pattern, str)[0]) # Module
A link on regex lookarounds that may be helpful: https://www.regular-expressions.info/lookaround.html
We can do that by str.find somethings like
str = 'C:\\ABC\\DEF\\GHI\\App\\Module\\feature\\src'
import re
start = '\\App\\'
end = '\\feature\\'
print( (str[str.find(start)+len(start):str.rfind(end)]))
print("\n")
output
Module
Your are looking for groups. With some small modificatians you can extract only the part between App and Feature.
(?:App\\\\)([A-Za-z0-9]*)(?:\\\\feature)
The brackets ( ) define a Match group which you can get by match.group(1). Using (?:foo) defines a non-matching group, e.g. one that is not included in your result. Try the expression here: https://regex101.com/r/24mkLO/1
In code i only want to fetch variable name from a c file which is used in if condition.
Following is code snippet of regex:
fieldMatch = re.findall(itemFieldList[i]+"=", codeline, re.IGNORECASE);
here i can find variable itemFieldList[i] from file.
But when i try to add if as shown below nothing is extracted as output even though variable exist in c code in if condition .
fieldMatch = re.findall(("^(\w+)if+[(](\w+)("+itemFieldList[i]+")="), codeline, re.IGNORECASE|re.MULTILINE);
Can anyone suggest how can we create regex to fetch mentioned scenario.
Sample Input :
IF(WORK.env_flow_ind=="R")
OR
IF( WORK.qa_flow_ind=="Q" OR WORK.env_flow_ind=="R")
here itemFieldList[i] = WORK.env_flow_ind
I don't have enough reputation to make this a comment, which it should be and I can't say that I fully understand the question. But to point out a few things:
it's about adding variables to your regex then you should be using string templates to make it more understandable for us and your future self.
"^{}".format(variable)
Doing that will allow you to create a dynamic regex that searches for what you want.
Secondly, I don't think that is your problem. I think that your regex is malformed. I don't know what exactly you are trying to search for but I recommend reading the python regex documentation and testing your regex on a resource like regex101 to make sure that you're capturing what you intend to. From what I can see you are a bit confused about groups. When you put parenthesis around a pattern you are identifying it as a group. You were on the right track trying to exclude the parenthesis in your search by surrounding it with square brackets but it's simpler and cleaner to escape them.
if you are trying to capture this statement:
if(someCondition == fries)
and you want to extract the keyword fries the valid syntax for that pattern is:
(?=if\((?:[\w=\s])+(fries)\))
Since you want this to be dynamic you would replace the string fries with your string template, and you'll get code that ends up something like this:
p = re.compile("(?=if\((?:[\w=\s])+({})\))".format(search), re.IGNORECASE)
p.findall(string)
Regex101 does a better job of breaking down my regex than I ever will:
Link cuz i have no rep
You can build the regex pattern as:
pattern = r"\bif\b\s*\(.*?\b" + re.escape(variablename) + r"\b"
This will look for the word “if” in lowercase, then optionally any spaces, then an opening parenthesis, then optionally any characters, and then your search term, its beginning and its end at word boundaries.
So if variablename is "WORK.env_flow_ind", then re.findall(pattern, textfile) will match the following lines:
if(blabla & WORK.env_flow_ind == "a")
if (WORK.env_flow_id == "b")
if(WORK.env_flow_id == "b")
if( WORK.env_flow_id == "b")
and these won't match:
if (WORK.env_bla == "c")
if (WORK.env_flow_id2 == "d")
I am using Python 2.7 and have a question with regards to regular expressions. My string would be something like this...
"SecurityGroup:Pub HDP SG"
"SecurityGroup:Group-Name"
"SecurityGroup:TestName"
My regular expression looks something like below
[^S^e^c^r^i^t^y^G^r^o^u^p^:].*
The above seems to work but I have the feeling it is not very efficient and also if the string has the word "group" in it, that will fail as well...
What I am looking for is the output should find anything after the colon (:). I also thought I can do something like using group 2 as my match... but the problem with that is, if there are spaces in the name then I won't be able to get the correct name.
(SecurityGroup):(\w{1,})
Why not just do
security_string.split(':')[1]
To grab the second part of the String after the colon?
You could use lookbehind:
pattern = re.compile(r"(?<=SecurityGroup:)(.*)")
matches = re.findall(pattern, your_string)
Breaking it down:
(?<= # positive lookbehind. Matches things preceded by the following group
SecurityGroup: # pattern you want your matches preceded by
) # end positive lookbehind
( # start matching group
.* # any number of characters
) # end matching group
When tested on the string "something something SecurityGroup:stuff and stuff" it returns matches = ['stuff and stuff'].
Edit:
As mentioned in a comment, pattern = re.compile(r"SecurityGroup:(.*)") accomplishes the same thing. In this case you are matching the string "SecurityGroup:" followed by anything, but only returning the stuff that follows. This is probably more clear than my original example using lookbehind.
Maybe this:
([^:"]+[^\s](?="))
Regex live here.
In the following regex r"\g<NAME>\w+", I would like to know that a group named NAME must be used for replacements corresponding to a match.
Which regex matches the wrong use of \g<...> ?
For example, the following code finds any not escaped groups.
p = re.compile(r"(?:[^\\])\\g<(\w+)>")
for m in p.finditer(r"\g<NAME>\w+\\g<ESCAPED>"):
print(m.group(1))
But there is a last problem to solve. How can I manage cases of \g<WRONGUSE\> and\g\<WRONGUSE> ?
As far as I am aware, the only restriction on named capture groups is that you can't put metacharacters in there, such as . \, etc...
Have you come across some kind of problem with named capture groups?
The regex you used, r"illegal|(\g<NAME>\w+)" is only illegal because you referred to a backreference without it being declared earlier in the regex string. If you want to make a named capture group, it is (?P<NAME>regex)
Like this:
>>> import re
>>> string = "sup bro"
>>> re.sub(r"(?P<greeting>sup) bro", r"\g<greeting> mate", string)
'sup mate'
If you wanted to do some kind of analysis on the actual regex string in use, I don't think there is anything inside the re module which can do this natively.
You would need to run another match on the string itself, so, you would put the regex into a string variable and then match something like \(\?P<(.*?)>\) which would give you the named capture group's name.
I hope that is what you are asking for... Let me know.
So, what you want is to get the string of the group name, right?
Maybe you can get it by doing this:
>>> regex = re.compile(r"illegal|(?P<group_name>\w+)")
>>> regex.groupindex
{'group_name': 1}
As you see, groupindex returns a dictionary mapping the group names and their position in the regex. Having that, it is easy to retrieve the string:
>>> # A list of the group names in your regex:
... regex.groupindex.keys()
['group_name']
>>> # The string of your group name:
... regex.groupindex.keys()[0]
'group_name'
Don't know if that is what you were looking for...
Use a negative lookahead?
\\g(?!<\w+>)
This search for any g not followed by <…>, thus a "wrong use".
Thanks to all the comments, I have this solution.
# Good uses.
p = re.compile(r"(?:[^\\])\\g<(\w+)>")
for m in p.finditer(r"</\g\<at__tribut1>\\g<notattribut>>"):
print(m.group(1))
# Bad uses.
p = re.compile(r"(?:[^\\])\\g(?!<\w+>)")
if p.search(r"</\g\<at__tribut1>\\g<notattribut>>"):
print("Wrong use !")