How to get the rightest match by regular expression? - python

I think this is a common problem. But I didn't find a satisfactory answer elsewhere.
Suppose I extract some links from a website. The links are like the following:
http://example.com/goto/http://example1.com/123.html
http://example1.com/456.html
http://example.com/yyy/goto/http://example2.com/789.html
http://example3.com/xxx.html
I want to use regular expression to convert them to their real links:
http://example1.com/123.html
http://example1.com/456.html
http://example2.com/789.html
http://example3.com/xxx.html
However, I can't do that because of the greedy feature of RE.
'http://.*$' will only match the whole sentence. Then I tried 'http://.*?$' but it didn't work either. Nor did re.findall. So is there any other way to do this?
Yes. I can do it by str.split or str.index. But I'm still curious about whether there is a RE solution for this.

You don't need to use regex you can use str.split() to split your links with // then pickup the last part and concatenate that with http//:
>>> s="""http://example.com/goto/http://example1.com/123.html
... http://example1.com/456.html
... http://example.com/yyy/goto/http://example2.com/789.html
... http://example3.com/xxx.html"""
>>> ['http://'+s.split('//')[-1] for link in s.split('\n')]
['http://example3.com/xxx.html', 'http://example3.com/xxx.html', 'http://example3.com/xxx.html', 'http://example3.com/xxx.html']
And with regex you just need to replace all characters between 2 // with empty string but as you need one of // for the first use a positive look-behind :
>>> [re.sub(r'(?<=//)(.*)//','',link) for link in s.split('\n')]
['http://example1.com/123.html', 'http://example1.com/456.html', 'http://example2.com/789.html', 'http://example3.com/xxx.html']
>>>

use this pattern
^(.*?[^/])(?=\/[^/]).*?([^/]+)$
and replace with $1/$2
Demo
after reading comment below, use this pattern to capture what you want
(http://(?:[^h]|h(?!ttp:))*)$
Demo
or this pattern
(http://(?:(?!http:).)*)$
Demo
or this pattern
http://.*?(?=http://)
and replace with nothing
Demo

Related

Exact search of a string that has parenthesis using regex

I am new to regexes.
I have the following string : \n(941)\n364\nShackle\n(941)\nRivet\n105\nTop
Out of this string, I want to extract Rivet and I already have (941) as a string in a variable.
My thought process was like this:
Find all the (941)s
filter the results by checking if the string after (941) is followed by \n, followed by a word, and ending with \n
I made a regex for the 2nd part: \n[\w\s\'\d\-\/\.]+$\n.
The problem I am facing is that because of the parenthesis in (941) the regex is taking 941 as a group. In the 3rd step the regex may be wrong, which I can fix later, but 1st I needed help in finding the 2nd (941) so then I can apply the 3rd step on that.
PS.
I know I can use python string methods like find and then loop over the searches, but I wanted to see if this can be done directly using regex only.
I have tried the following regex: (?:...), (941){1} and the make regex literal character \ like this \(941\) with no useful results. Maybe I am using them wrong.
Just wanted to know if it is possible to be done using regex. Though it might be useful for others too or a good share for future viewers.
Thanks!
Assuming:
You want to avoid matching only digits;
Want to match a substring made of word-characters (thus including possible digits);
Try to escape the variable and use it in the regular expression through f-string:
import re
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
var1 = '(941)'
var2 = re.escape(var1)
m = re.findall(fr'{var2}\n(?!\d+\n)(\w+)', s)[0]
print(m)
Prints:
Rivet
If you have text in a variable that should be matched exactly, use re.escape() to escape it when substituting into the regexp.
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
num = '(941)'
re.findall(rf'(?<=\n{re.escape(num)}\n)[\w\s\'\d\-\/\.]+(?=\n)', s)
This puts (941)\n in a lookbehind, so it's not included in the match. This avoids a problem with the \n at the end of one match overlapping with the \n at the beginning of the next.

Matching data between padding

I'm trying to match some strings in a binary file and the strings appear to be padded. As an example, the word PROGRAM could be in the binary like this:
%$###P^&#!)00000R{]]]////O.......G"""""R;;$#!*%&#*A/////847M
In that example, the word PROGRAM is there but it is split up and it's between random data, so I'm trying to use regex to find it.
Currently, this is what I came up with but I don't think this is very effectie:
(?<=P)(.*?)(?=R)(.*?)(?=O)(.*?)(?=G)(.*?)(?=R)(.*?)(?=A)(.*?)(?=M)
If you want to get PROGRAM from the string, one option might be to use re.sub with a negated character class to remove all that you don't want.
[^A-Z]+
Regex demo | Python demo
For example:
import re
test_str = "%$###P^&#!)00000R{]]]////O.......G\"\"\"\"\"R;;$#!*%&#*A/////847M"
pattern = r'[^A-Z]+'
print(re.sub(pattern, '', test_str))
Result
PROGRAM
This should work for you and is more efficient than your current solution:
P[^R]+R[^O]+O[^G]+G[^R]+R[^A]+A[^M]+M
Explanation:
P[^R]+ - match P, than one or more characters other than R
Demo
I'm not quite sure what the desired output might be, I'm guessing maybe this expression,
(?=.*?P.*?R.*?O.*?G.*?R.*?A.*?M).*?(P).*?(R).*?(O).*?(G).*?(R).*?(A).*?(M)
might be a start.
The expression is explained on the top right panel of this demo, if you wish to explore further or simplify/modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.

Python re.findall returning links with unwanted string afterwards

I have a python script using BeautifulSoup to scrape. This is my code:
re.findall('stream:\/\/.+', link)
Which is designed to find links like:
stream://987cds9c8ujru56236te2ys28u99u2s
But it also returns strings like this:
stream://987cds9c8ujru56236te2ys28u99u2s [SD] Spanish - (9.15am)
i.e. with spaces and extra stuff which I don't want. How can I express the
re.findall
So it only returns the link first part?
(Thanks in advance)
You can use a non-greedy match (adding ? to the pattern) with a word boundary character '\b':
>>> re.findall(r'stream:\/\/.+?\b', link)
['stream://987cds9c8ujru56236te2ys28u99u2s']
Or if you want to match only word characters you can simply use '\w+':
>>> re.findall(r'stream:\/\/\w+', link)
['stream://987cds9c8ujru56236te2ys28u99u2s']

Python regular expression expansion

I am really bad with regular expressions, and stuck up to generate all the possible combinations for a regular expression.
When the regular expression is abc-defghi00[1-24,2]-[1-20,23].walmart.com, it should generate all its possible combinations.
The text before the braces can be anything and the pattern inside the braces is optional.
Need all the python experts to help me with the code.
Sample output
Here is the expected output:
abc-defghi001-1.walmart.com
.........
abc-defghi001-20.walmart.com
abc-defghi001-23.walmart.com
..............
abc-defghi002-1.walmart.com
Repeat this from 1-24 and 2.
Regex tried
([a-z]+)(-)([a-z]+)(\[)(\d)(-)(\d+)(,?)(\d?)(\])(-)(\[)(\d)(-)(\d+)(,?)(\d?)(\])(.*)
Lets say we would like to match against abc-defghi001-1.walmart.com. Now, if we write the following regex, it does the job.
s = 'abc-defghi001-1.walmart.com'
re.match ('.*[1-24]-[1-20|23]\.walmart\.com',s)
and the output:
<_sre.SRE_Match object at 0x029462C0>
So, its found. If you want to match to 27 in the first bracket, you simply replace it by [1-24|27], or if you want to match to 0 to 29, you simply replace it by [1-29]. And ofcourse, you know that you have to write import re, before all the above commands.
Edit1: As far as I understand, you want to generate all instances of a regular expression and store them in a list.
Use the exrex python library to do so. You can find further information about it here. Then, you have to limit the regex you use.
import re
s = 'abc-defghi001-1.walmart.com'
obj=re.match(r'^\w{3}-\w{6}00(1|2)-([1-20]|23)\.walmart\.com$',s)
print(obj.group())
The above regex will match the template you're looking for I hope!

python regex suffix matching

for a typical set of word suffixes (ize,fy,ly,able...etc), I want to know if a given words ends with any of them, and subsequently remove them. I know this can be done iteratively with word.endswith('ize') for example, but I believe there is a neater regex way of doing it.. tried positive lookahead with an ending marker $ but for some reason didn't work:
pat='(?=ate|ize|ify|able)$'
word='terrorize'
re.findall(pat,word)
Little-known fact: endswith accepts a tuple of possibilities:
if word.endswith(('ate','ize','ify','able')):
#...
Unfortunately, it doesn't indicate which string was found, so it doesn't help with removing the suffix.
What you are looking for is actually (?:)
Check this out:
re.sub(r"(?:ate|ize|ify|able)$", "", "terrorize")
Have a look at this site Regex.
There are tones of useful regex skills. Hope you enjoy it.
BTW, the python library itself is a neat & wonderful tutorial.
I do help() a lot :)
A lookahead is an anchor pattern, just like ^ and $ anchor matches to a specific location but are not themselves a match.
You want to match these suffixes, but at the end of a word, so use the word-edge anchor \b instead:
r'(ate|ize|ify|able)\b'
then use re.sub() to replace those:
re.sub(r'(ate|ize|ify|able)\b', '', word)
which works just fine:
>>> word='terrorize'
>>> re.sub(r'(ate|ize|ify|able)\b', '', word)
'terror'
You need adjust parenthese, just change pat from:
(?=ate|ize|ify|able)$
to:
(?=(ate|ize|ify|able)$)
If you need remove the suffixes later, you could use the pattern:
^(.*)(?=(ate|ize|ify|able)$)
Test in REPL:
>>> pat = '^(.*)(?=(ate|ize|ify|able)$)'
>>> word = 'terrorize'
>>> re.findall(pat, word)
[('terror', 'ize')]
If it's word-by-word matching then simply remove the look-ahead check, the $ caret is sufficient.

Categories