Regex python Match after and before a specific string - python

Lets say we have this
string:"Code:1,Some text some other text {fdf: more text, attr=important "
I want to catch the pattern using Regex that can findall attr and extract important and 1 and put them in dict.
I tried this one:
(?<=testcaseid_)[^_]+_[^_]+
but still capture all the previous

I'm not sure if I understand well, but if you want to get everything starts from "1" to something after attr= you can also use regex like this:
r"1.*?attr=\w+"

Related

How do I remove a part of an URL using regex in Python?

I have an a list of URL that looks like this:
'https://www.superpopgadget.com/collections/best-sellers/products/sushi-roll-bazooka?Ffbclid=IwAR3WfVizYJF0RCP2AsSoulLjJK2_OUwQZ0Y1eep_b3Einm1XNJbcF_K3wYI'
I wanna scrape it to just get:
'https://www.superpopgadget.com/collections/best-sellers/products/sushi-roll-bazooka'
Not sure if there is any other more efficient method but this might work fine:
(.+)\?(.+)
It matches in the first group everything before the character ? and the second group is everything after it. What you need is the first group.
Example in Regex101

Removing markup links in text

I'm cleaning some text from Reddit. When you include a link in a Reddit self-text, you do so like this:
[the text you read](https://website.com/to/go/to). I'd like to use regex to remove the hyperlink (e.g. https://website.com/to/go/to) but keep the text you read.
Here is another example:
[the podcast list](https://www.reddit.com/r/datascience/wiki/podcasts)
I'd like to keep: the podcast list.
How can I do this with Python's re library? What is the appropriate regex?
I have created an initial attempt at your requested regex:
(?<=\[.+\])\(.+\)
The first part (?<=...) is a look behind, which means it looks for it but does not match it. You can use this regex along with re's method sub. You can also see the meanings of all the regex symbols here.
You can extend the above regex to look for only things that have weblinks in the brackets, like so:
(?<=\[.+\])\(https?:\/\/.+\)
The problem with this is that if the link they provide is not started with an http or https it will fail.
After this you will need to remove the square brackets, maybe just removing all square brackets works fine for you.
Edit 1:
Valentino pointed out that substitute accepts capturing groups, which lets you capture the text and substitute the text back in using the following regex:
\[(.+)\]\(.+\)
You can then substitute the first captured group (in the square brackets) back in using:
re.sub(r"\[(.+)\]\(.+\)", r"\1", original_text)
If you want to look at the regex in more detail (if you're new to regex or want to learn what they mean) I would recommend an online regex interpreter, they explain what each symbol does and it makes it much easier to read (especially when there are lots of escaped symbols like there are here).

How do I create a regex with regular variable and some fixed text in Python?

In code i only want to fetch variable name from a c file which is used in if condition.
Following is code snippet of regex:
fieldMatch = re.findall(itemFieldList[i]+"=", codeline, re.IGNORECASE);
here i can find variable itemFieldList[i] from file.
But when i try to add if as shown below nothing is extracted as output even though variable exist in c code in if condition .
fieldMatch = re.findall(("^(\w+)if+[(](\w+)("+itemFieldList[i]+")="), codeline, re.IGNORECASE|re.MULTILINE);
Can anyone suggest how can we create regex to fetch mentioned scenario.
Sample Input :
IF(WORK.env_flow_ind=="R")
OR
IF( WORK.qa_flow_ind=="Q" OR WORK.env_flow_ind=="R")
here itemFieldList[i] = WORK.env_flow_ind
I don't have enough reputation to make this a comment, which it should be and I can't say that I fully understand the question. But to point out a few things:
it's about adding variables to your regex then you should be using string templates to make it more understandable for us and your future self.
"^{}".format(variable)
Doing that will allow you to create a dynamic regex that searches for what you want.
Secondly, I don't think that is your problem. I think that your regex is malformed. I don't know what exactly you are trying to search for but I recommend reading the python regex documentation and testing your regex on a resource like regex101 to make sure that you're capturing what you intend to. From what I can see you are a bit confused about groups. When you put parenthesis around a pattern you are identifying it as a group. You were on the right track trying to exclude the parenthesis in your search by surrounding it with square brackets but it's simpler and cleaner to escape them.
if you are trying to capture this statement:
if(someCondition == fries)
and you want to extract the keyword fries the valid syntax for that pattern is:
(?=if\((?:[\w=\s])+(fries)\))
Since you want this to be dynamic you would replace the string fries with your string template, and you'll get code that ends up something like this:
p = re.compile("(?=if\((?:[\w=\s])+({})\))".format(search), re.IGNORECASE)
p.findall(string)
Regex101 does a better job of breaking down my regex than I ever will:
Link cuz i have no rep
You can build the regex pattern as:
pattern = r"\bif\b\s*\(.*?\b" + re.escape(variablename) + r"\b"
This will look for the word “if” in lowercase, then optionally any spaces, then an opening parenthesis, then optionally any characters, and then your search term, its beginning and its end at word boundaries.
So if variablename is "WORK.env_flow_ind", then re.findall(pattern, textfile) will match the following lines:
if(blabla & WORK.env_flow_ind == "a")
if (WORK.env_flow_id == "b")
if(WORK.env_flow_id == "b")
if( WORK.env_flow_id == "b")
and these won't match:
if (WORK.env_bla == "c")
if (WORK.env_flow_id2 == "d")

How to match the bundle id for android app?

I'd like to match the urls like this:
input:
x = "https://play.google.com/store/apps/details?id=com.alibaba.aliexpresshd&hl=en"
get_id(x)
output:
com.alibaba.aliexpresshd
What is the best way to do it with re in python?
def get_id(toParse):
return re.search('id=(WHAT TO WRITE HERE?)', toParse).groups()[0]
I found only the case with exactly one dot.
You could try:
r'\?id=([a-zA-Z\.]+)'
For your regex, like so:
def get_id(toParse)
regex = r'\?id=([a-zA-Z\.]+)'
x = re.findall(regex, toParse)[0]
return x
Regex -
By adding r before the actual regex code, we specify that it is a raw string, so we don't have to add multiple backslashes before every command, which is better explained here.
? holds special meaning for the regex system, so to match a question mark, we precede it by a backslash like \?
id= matches the id= part of the extraction
([a-zA-Z\.]+) is the group(0) of the regex, which matches the id of the URL. Hence, by saying [0], we are able to return the desired text.
Note - I have used re.findall for this, because it returns an array [] whose element at index 0 is the extracted text.
I recommend you take a look at rexegg.com for a full list of regex syntax.
Actually, you do not need to put anything "special" there.
Since you know that the bundle id is between id= and &, you can just capture whatever is inside and have your result in capture group like this:
id=(.+)&
So the code would look like this:
def get_id(toParse):
return re.search('id=(.+)&', toParse).groups()[0]
Note: you might need to change the group index to "1", not "0", as most regex engines reserve this for full match. I'm not familiar how Python actually handles this.
See demo here
This regex should easily get what you want, it gets everything between id= and either the following parameter (.*? being ungreedy), or the end of the string.
id=(.*?)(&|$)
If you only need the id itself, it will be in the first group.

Extracting parenthesis with a specific format with Python

I am fairly new to python so I apologies if this is quite a novice question, but I am trying to extract text from parentheses that has specific format from a raw text file.
I have tried this with regular expressions, but please let me know if their is a better method.
To show what I want to do by example:
s = "Testing (Stackoverflow, 2013). Testing (again) (Stackoverflow, 1999)"
From this string I want a result something like:
['(Stackoverflow, 2013)', '(Stackoverflow, 1999)']
The regular expression I have tried so far is
"(\(.+[,] [0-9]{4}\))"
in conjunction with re.findall(), however this only gives me the result:
['(Stackoverflow, 2013). Testing (again) (Stackoverflow, 1999)']
So, as you may have guessed, I am trying to extract the bibliographic references from a .txt file. But I don't want to extract anything that happens to be in parentheses that is not a bibliographic reference.
Again, I apologies if this is novice, and again if there is a question like this out there already. I have searched, but no luck as yet.
Using [^()] instead of .. This will make sure there is no nested ().
>>> re.findall("(\([^()]+[,] [0-9]{4}\))", s)
['(Stackoverflow, 2013)', '(Stackoverflow, 1999)']
Assuming that you will have no nested brackets, you could use something like so: (\([^()]+?, [0-9]{4}\)). This will match any non bracket character which is within a set of parenthesis which is followed by a comma, a white space four digits and a closing parenthesis.
I would suggest something like \(\w+,\s+[0-9]{4}\). A couple changes from your original:
Match word characters (letters/numbers/underscores) instead of any character in the source name.
Match one or more space characters after the comma, instead of limiting yourself to a single literal space.

Categories