Regex for escaping path separator in url - python

I have a url pattern: "somepath/email/". I don't want to write a regex for matching email instead I want anything which isn't a path separator to match email.
Please suggest a regex for this. I am using Python and the url is for a Django application, So any library function will also be helpful but I will prefer a regex.

The regex [^/\\]+ is a negative character class with a + quantifier and matches any number of characters that are not a / or \\
Code sample:
match = re.search("[^/\\]+", subject)
if match:
result = match.group()
else:
result = ""

Related

Extracting a word between two path separators that comes after a specific word

I have the following path stored as a python string 'C:\ABC\DEF\GHI\App\Module\feature\src' and I would like to extract the word Module that is located between words \App\ and \feature\ in the path name. Note that there are file separators '\' in between which ought not to be extracted, but only the string Module has to be extracted.
I had the few ideas on how to do it:
Write a RegEx that matches a string between \App\ and \feature\
Write a RegEx that matches a string after \App\ --> App\\[A-Za-z0-9]*\\, and then split that matched string in order to find the Module.
I think the 1st solution is better, but that unfortunately it goes over my RegEx knowledge and I am not sure how to do it.
I would much appreciate any help.
Thank you in advance!
The regex you want is:
(?<=\\App\\).*?(?=\\feature\\)
Explanation of the regex:
(?<=behind)rest matches all instances of rest if there is behind immediately before it. It's called a positive lookbehind
rest(?=ahead) matches all instances of rest where there is ahead immediately after it. This is a positive lookahead.
\ is a reserved character in regex patterns, so to use them as part of the pattern itself, we have to escape it; hence, \\
.* matches any character, zero or more times.
? specifies that the match is not greedy (so we are implicitly assuming here that \feature\ only shows up once after \App\).
The pattern in general also assumes that there are no \ characters between \App\ and \feature\.
The full code would be something like:
str = 'C:\\ABC\\DEF\\GHI\\App\\Module\\feature\\src'
start = '\\App\\'
end = '\\feature\\'
pattern = rf"(?<=\{start}\).*?(?=\{end}\)"
print(pattern) # (?<=\\App\\).*?(?=\\feature\\)
print(re.search(pattern, str)[0]) # Module
A link on regex lookarounds that may be helpful: https://www.regular-expressions.info/lookaround.html
We can do that by str.find somethings like
str = 'C:\\ABC\\DEF\\GHI\\App\\Module\\feature\\src'
import re
start = '\\App\\'
end = '\\feature\\'
print( (str[str.find(start)+len(start):str.rfind(end)]))
print("\n")
output
Module
Your are looking for groups. With some small modificatians you can extract only the part between App and Feature.
(?:App\\\\)([A-Za-z0-9]*)(?:\\\\feature)
The brackets ( ) define a Match group which you can get by match.group(1). Using (?:foo) defines a non-matching group, e.g. one that is not included in your result. Try the expression here: https://regex101.com/r/24mkLO/1

Python regex works online but not in code

This regex pattern: ^.+\'(.+.png) works in online editors but not in python code. I saw other posts having the same issue. I tried applying those;
Adding an extra escape slash
Prepending start quote with r
the regex should match starting at single quote untill it hits .png.
For example:
With this string Executing identify -format %k 'testcases/art/test_files/video_images/single-[snk1=640x480p59.9]-[src1=720x480i59.9].png'
I want: testcases/art/test_files/video_images/single-[snk1=640x480p59.9]-[src1=720x480i59.9].png
I tried (not in chronological order):
result = re.findall("^.+\\'(.+\\.png)", self.stream.getvalue()) # I also tried prepending all of these with r
result = re.findall("^.+\'(.+.png)", self.stream.getvalue())
result = re.findall("^.+'(.+.png)", self.stream.getvalue())
result = re.findall("^.+'(.+.png)", str(self.stream.getvalue()))
result = re.findall("\^.+'(.+.png)\", self.stream.getvalue())
Edit: I also tried using re.match() and re.search()
Update:
Probably where I'm getting the string from is responsible cStringIO.StringO.getvalue() which is this part in code self.stream.getvalue(). This is code I have not written. How can I use regex on this?
You need to cast the output of self.stream.getvalue() to a string and also throw away the ^.+ part of the pattern as re.findall searches for all matches anywhere inside the input string.
Use
results = re.findall(r"'([^']+\.png)", str(self.stream.getvalue()))
Also, mind escaping dots that are literal . chars in the pattern.
Pattern details
' - a single quote
([^']+\.png) - Capturing group 1:
[^']+ - 1+ chars other than '
\.png - .png substring.

using regular exp. in python

if i am entering email address of the form username#companyname.com and i just wanna search '#' then why is this code isn't working.
``'
emailAddress=raw_input()
pat = '#'
match = re.match(pat2,emailAddress)
print match.group()
```
Assuming the pat/pat2 issue is just a typo, you want to use re.search instead of re.match. search searches the whole string for a match while match only searches for matches beginning at the beginning of the string.

How to apply Condition in regex

Hello i am a newbie and currently trying to learn about regex pattern by experimenting on various patterns. I tried to create the regex pattern for this url but failed. It's a pagination link of amazon.
http://www.amazon.in/s/lp_6563520031_pg_2?rh=n%3A5866078031%2Cn%3A%215866079031%2Cn%3A6563520031&page=2s&ie=UTF8&qid=1446802571
Or
http://www.amazon.in/Tena-Wet-Wipe-Pulls-White/dp/B001O1G242/ref=sr_1_46?s=industrial&ie=UTF8&qid=1446802608&sr=1-46
I just want to check the url by only these two things.
If the url has dp directory or product directory
If the url has query string page having any digit
I tried to create the regex pattern but failed. I want that if the first thing is not there the regex pattern should match the second (or vice versa).
Here's the regex pattern I made:
.*\/(dp|product)\/ | .*page
Here is my regex101 link: https://regex101.com/r/zD2gP5/1#python
Since you just want to check if a string contains some pattern, you can use
\/(?:dp|product)\/|[&?]page=
See regex demo
In Python, just check with re.search:
import re
p = re.compile(r'/(?:dp|product)/|[&?]page=')
test_str = "http://w...content-available-to-author-only...n.in/s/lp_6563520031_pg_2?rh=n%3A5866078031%2Cn%3A%215866079031%2Cn%3A6563520031&page=2s&ie=UTF8&qid=14468025716"
if p.search(test_str):
print ("Found!")
Also, in Python regex patterns, there is no need to escape / slashes.
The regex matches two alternative subpatterns (\/(?:dp|product)\/ and [&?]page=):
/ - a forward slash
(?:dp|product) - either dp or product (without storing the capture inside the capture buffer since it is a non-capturing group)
/ - a slash
| - or...
[&?] - either a & or ? (we check the start of a query string parameter)
page= - literal sequence of symbols page=.
\/(dp|product)\/|page=(?=[^&]*\d)[^&]+
This would be my idea, please test it and let me know if you have question about.

How to extract slug from URL with regular expression in Python?

I'm struggling with Python's re. I don't know how to solve the following problem in a clean way.
I want to extract a part of an URL,
What I tried so far:
url = http://www.example.com/this-2-me-4/123456-subj
m = re.search('/[0-9]+-', url)
m = m.group(0).rstrip('-')
m = m.lstrip('/')
This leaves me with the desired output 123456, but I feel this is not the proper way to extract the slug.
How can I solve this quicker and cleaner?
Use a capturing group by putting parentheses around the part of the regex that you want to capture (...). You can get the contents of a capturing group by passing in its number as an argument to m.group():
>>> m = re.search('/([0-9]+)-', url)
>>> m.group(1)
123456
From the docs:
(...)
Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below. To match the literals '(' or ')', use \( or \), or enclose them inside a character class: [(] [)].
You may want to use urllib.parse combined with a capturing group for mildly cleaner code.
import urllib.parse, re
url = 'http://www.example.com/this-2-me-4/123456-subj'
parsed = urllib.parse.urlparse(url)
path = parsed.path
slug = re.search(r'/([\d]+)-', path).group(1)
print(slug)
Result:
123456
In Python 2, use urlparse instead of urllib.parse.
if you wants to find all the slugs available in a URL you can use this code.
from slugify import slugify
url = "https://www.allrecipes.com/recipe/79300/real-poutine?search=random/some-name/".split("/")
for i in url:
i = i.split("?")[0] if "?" in i else i
if "-" in i and slugify(i) == i:
print(i)
This will provide with an output of
real-poutine
some-name

Categories