use dynamic int variable inside regex pattern python - python

I'm in my initial days of learning python, sorry if this question is already been asked.
I'm writing here as those didn't help me, my requirement is reading a file and printing all the URL's inside in it.Inside a for loop the regex pattern i had used is [^https://][\w\W]*, it worked fine. But I wanted to know if can I dynamically pass the length of line which is after https:// and get the output with occurrences instead of *
I had tried [^https://][\w\W]{var}} where var=len(line)-len(https://)
These are some other patterns I had tried like
pattern = '[^https://][\w\W]{'+str(int(var))+'}'
pattern = r'[^https://][\w\W]{{}}'.format(var)
pattern = r'[^https://][\w\W]{%s}'%var

I might be misunderstanding your question, but if you know that the url is always starting with https:// then that would be the first eight characters. Then you can get the length after finding the urls:
# Example of list containing urls - you should fill that with your for loop
list_urls = ['https://stackoverflow.com/questions/61006253/use-dynamic-int-variable-inside-regex-pattern-python', 'https://google.com', 'https://stackoverflow.com']
for url in list_urls:
print(url[8:])
Out
stackoverflow.com/questions/61006253/use-dynamic-int-variable-inside-regex-pattern-python
google.com
stackoverflow.com
Instead of a for loop you could find all urls using re.findall
import re
url_pattern = "((https:\/\/)([\w-]+\.)+[\w-]+[.+]+([\w%\/~\+#]*))"
# text referes to your document, that should be read before this
urls = re.findall(url_pattern, text)
# Using list comprehensions
# Get the unique urls by using set
# Only get text after https:// using [8:]
# Only parse the first element of the group that is returned by re.findall using [0]
unique_urls = list(set([x[0][8:] for x in urls]))
# print the urls
print(unique_urls)

In your pattern you use [^https://] which is a negated character class [^ which will match any char except the listed.
One option is to make use of literal string interpolation. Assuming your links do not contain spaces, you could use \S instead of [\w\W] as the latter variant will match any character including spaces and newlines.
\bhttps://\S{{{var}}}(?!\S)
Regex demo
The assertion (?!\S) at the end is a whitespace boundary to prevent partial matches and the word boundary \b will prevent http being part of a larger word.
Python demo
For example
import re
line = "https://www.test.com"
lines = "https://www.test.com https://thisisatestt https://www.dontmatchme"
var=len(line)-len('https://')
pattern = rf"\bhttps://\S{{{var}}}(?!\S)"
print(re.findall(pattern, lines))
Output
['https://www.test.com', 'https://thisisatestt']

Related

Extracting a word between two path separators that comes after a specific word

I have the following path stored as a python string 'C:\ABC\DEF\GHI\App\Module\feature\src' and I would like to extract the word Module that is located between words \App\ and \feature\ in the path name. Note that there are file separators '\' in between which ought not to be extracted, but only the string Module has to be extracted.
I had the few ideas on how to do it:
Write a RegEx that matches a string between \App\ and \feature\
Write a RegEx that matches a string after \App\ --> App\\[A-Za-z0-9]*\\, and then split that matched string in order to find the Module.
I think the 1st solution is better, but that unfortunately it goes over my RegEx knowledge and I am not sure how to do it.
I would much appreciate any help.
Thank you in advance!
The regex you want is:
(?<=\\App\\).*?(?=\\feature\\)
Explanation of the regex:
(?<=behind)rest matches all instances of rest if there is behind immediately before it. It's called a positive lookbehind
rest(?=ahead) matches all instances of rest where there is ahead immediately after it. This is a positive lookahead.
\ is a reserved character in regex patterns, so to use them as part of the pattern itself, we have to escape it; hence, \\
.* matches any character, zero or more times.
? specifies that the match is not greedy (so we are implicitly assuming here that \feature\ only shows up once after \App\).
The pattern in general also assumes that there are no \ characters between \App\ and \feature\.
The full code would be something like:
str = 'C:\\ABC\\DEF\\GHI\\App\\Module\\feature\\src'
start = '\\App\\'
end = '\\feature\\'
pattern = rf"(?<=\{start}\).*?(?=\{end}\)"
print(pattern) # (?<=\\App\\).*?(?=\\feature\\)
print(re.search(pattern, str)[0]) # Module
A link on regex lookarounds that may be helpful: https://www.regular-expressions.info/lookaround.html
We can do that by str.find somethings like
str = 'C:\\ABC\\DEF\\GHI\\App\\Module\\feature\\src'
import re
start = '\\App\\'
end = '\\feature\\'
print( (str[str.find(start)+len(start):str.rfind(end)]))
print("\n")
output
Module
Your are looking for groups. With some small modificatians you can extract only the part between App and Feature.
(?:App\\\\)([A-Za-z0-9]*)(?:\\\\feature)
The brackets ( ) define a Match group which you can get by match.group(1). Using (?:foo) defines a non-matching group, e.g. one that is not included in your result. Try the expression here: https://regex101.com/r/24mkLO/1

Make a list return each of its elements as individual strings to be placed in a regular expression

I am facing a challenge in Python where I have a list that contains multiple strings. I want to use a Regex (findall) to search for any occurrence of each of the list's elements in a text file.
import re
name_list = ['friend', 'boy', 'man']
example_string = "friend"
file= open('file.txt', 'r')
lines= file.read()
Then comes the re.findall expression. I configured it such that it finds any occurrence in the text file where a desired string is found between a number in parentheses (\d) and a period. It works perfectly when I place a string variable inside the regular expression, as seen below.
find = re.findall(r"([^(\d)]*?"+example_string+r"[^.]*)", lines)
However, I want to be able to replace example_string with some sort of mechanism that returns each of the elements in name_list as individual strings to be placed and searched for in the regular expression. The lists I work with can get much larger than the list Iin this example, so please keep that in mind.
As a beginner, I tried simply replacing the string in re.findall with the list I have, only to quickly realize that that would result in an error. The solution to this must allow me to use re.findall in the aforementioned manner, so most of the challenge lies in manipulating the list so that it can produce each of its elements as individual strings to be placed within re.findall.
Thank you for your insights.
for name in name_list:
find = re.findall(r"([^(\d)]*?"+name+r"[^.]*)", lines)
# ... do stuff with the results
this iterates through each item in name_list, and runs the same regex as before.
The pattern that you use ([^(\d)]*?[^.]*) for this match is not correct, see the match here.
I configured it such that it finds any occurrence in the text file
where a desired string is found between a number in parentheses (\d)
and a period.
It is due to this construct [^(\d)] that is a negated character class matching any character except for what is in between the square brackets.
The next negated character class [^.]* matches any char except a dot, but the final dot is not matched.
The pattern to find all between a number in parenthesis and a dot can be using a capture group that will be returned by re.findall.
\(\d+\)([^.]*(?:friend|boy|man)[^.]*)\.
See a regex 101 demo
For example, if the content of file.txt is:
this is (10) with friend and a text.
Example code, assembling the words in a non capture group using .join(name_list)
import re
name_list = ['friend', 'boy', 'man']
pattern = rf"\(\d+\)([^.]*(?:{'|'.join(name_list)})[^.]*)\."
file = open('file.txt', 'r')
lines = file.read()
print(re.findall(pattern, lines))
Output
[' with friend and a text']

Excluding a string containing character regex [duplicate]

This question already has answers here:
How do you validate a URL with a regular expression in Python?
(12 answers)
Closed 3 years ago.
Currently I am trying to get proper URLs from a string containing both proper and improper URLs using Regular Expressions. Result of the code should give a list of the proper URLs from the input string. The problem is I cannot get rid of the "http://example{.com", because all I came up with is getting to the "{" character and getting "http://example" in results.
The code I am checking is below:
import re
text = "https://example{.com http://example.com http://example.hgg.com/da.php?=id42 http\\:example.com http//: example.com"
print(re.findall('http[s]?[://](?:[a-zA-Z0-9$-_#.&+])+', text))
So is there a good way to get all the matches but excluding matches containing bad characters (like "{")?
It's difficult to know exactly what you need but this should help. It's hard to parse URLs with regular expressions. But Python comes with a URL parser. It looks like they are space separated so you could do something like this
from urllib.parse import urlparse
text = "https://example{.com http://example.com http://example.hgg.com/da.php?=id42 http\\:example.com http//: example.com"
for token in text.split():
result = urlparse(token)
if result.scheme in {'http', 'https'} \
and result.netloc \
and all(c == '.' or c.isalpha() for c in result.netloc):
print(token)
Split the text into a list of strings text.split, try parse each item urlparse(token). Print if the scheme is http or https and the domain (a.k.a netloc) is non-empty and all characters are a-z or a dot.
In your example, an URL ends with a white space, so you can use a lookahead to find the next space (or the end of the string). To do that, you can use: (?=\s|$).
Your RegEx can be fixed as follow:
print(re.findall(r'http[s]?[:/](?:[a-zA-Z0-9$-_#.&+])+(?=\r|$)', text))
note: don't forget to use a raw string (prefixed by a "r").
You can also improve your RegEx, for instance:
import re
text = "https://example{.com http://example.com http://example.hgg.com/da.php?=id42 http\\:example.com http//: example.com"
URL_REGEX = r"(?:https://|http://|ftp://|file://|mailto:)[-\w+&##/%=~_|?!:,.;]+[-\w+&##/%=~_|](?=\s|$)"
print(re.findall(URL_REGEX, text))
You get:
['http://example.com', 'http://example.hgg.com/da.php?=id42']
To have a good RegEx, you can take a look at this question: “What is the best regular expression to check if a string is a valid URL?”
A answer point this RegEx for Python:
URL_REGEX = re.compile(
r'(?:http|ftp)s?://' # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' # domain...
r'localhost|' # localhost...
r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|' # ...or ipv4
r'\[?[A-F0-9]*:[A-F0-9:]+\]?)' # ...or ipv6
r'(?::\d+)?' # optional port
r'(?:/?|[/?]\S+)', re.IGNORECASE)
It works like a charm!

Unable to populate a list using re.findall()

I'm working on a simple personal project that's required I learn to use regular expressions. I have successfully used findall() once before in my program:
def getStats():
playername = input("Enter your OSRS name: ")
try:
with urllib.request.urlopen("https://secure.runescape.com/m=hiscore_oldschool/index_lite.ws?player=" + playername) as response:
page = str(response.read())
player.levels = re.findall(r',(\d\d),', page)
This worked fine and populated the list exactly as I wanted. I'm now trying to do something similar with a text file.
The text file contains a string, followed by a lot of digits, and then another string followed by a lot of digits, etc. I just want to populate a list with the text and ignore the digits, but I get no matches (the list is empty):
def getQuests():
try:
with open("quests.txt") as file:
q = file.read()
questList = re.findall(r',(\D\D),', q)
print(questList)
Pythex link: https://pythex.org/?regex=%5CD%5CD&test_string=Desert%20Treasure%2C0%2C0%2C0%2C12%0AContact!%2C0%2C0%2C11%2C0%2C0%2C0%2C5%0ACook%27s%20Assistant%2C0%2C0%2C0%2C0%0AHorror%20from%20the%20Deep%2C0%2C0%2C13&ignorecase=0&multiline=0&dotall=0&verbose=0
I've gotten some help with the pattern and edited accordingly, but the list is still printing empty
def getQuests():
try:
with open("quests.txt") as file:
q = file.read()
questList = re.findall(r'^(\D+),', q)
Your pattern is incorrect. Firstly, in the demo you linked, the website is not very well designed and shows adjacent matches as one single match. \D\D matches exactly 2 non-digit characters. Also, you didn't include the commas you have in your pattern in the code. Anyway, here is a correct pattern:
^(\D+),
It matches the start of the line, then at least one non-digit character, then a comma. The first group contains the string you want to match.
Demo: https://regex101.com/r/pViF0h/2
In code:
import re
text = '''Desert Treasure,0,0,0,12
Contact!,0,0,11,0,0,0,5
Cook's Assistant,0,0,0,0
Horror from the Deep,0,0,13'''
print(re.findall(r'^(\D+),', text, re.M))
# ['Desert Treasure', 'Contact!', "Cook's Assistant", 'Horror from the Deep']
If the first entry is what you want no matter what, you can also use:
^(.+?),
Also, for these files, it is usually a much better idea to read it as a CSV and extract what you need that way.
Your TypeError solution is correct.
Without knowing what that webpage looks like, I do see one problem. In your working example, you use ',(\d\d),', but in the problem one you use ,(\D\D),. \d Matches any digit characters, but \D matches any non-digits.

How to set regex for website url pattern

The url pattern is
http://www.hepsiburada.com/philips-40pfk5500-40-102-ekran-full-hd-200-hz-uydu-alicili-cift-cekirdek-smart-android-led-tv-p-EVPHI40PFK5500
This website has similar urls. The unique identifier is -p- for this url.
The url pattern always has -p- before word which is at end of url.
I used the following regex
(.*)hepsiburada\.com\/([\w.-]+)([\-p\-\w+])\Z
it matched but it match many patterns on this website.
For example regex should match url above but it shouldnt match with
http://www.hepsiburada.com/bilgisayarlar-c-2147483646
Since you are using a re.match you really need to match the string from the beginning. However, the main problem is that your -p- is inside a character class, and is thus treated as separate symbols that can be matched. Same is with the \w+ - it is considered as \w and + separately.
So, use a sequence:
(.*)hepsiburada\.com/([\w.-]+)(-p-\w+)$
See this regex demo
Or
^https?://(?:www\.)?hepsiburada\.com/([\w.-]+)(-p-\w+)$
See the regex demo
Note that most probably you even have no need in the capture groups, and (...) parentheses can be removed from the pattern.

Categories