re.sub replacing too much text - python

I have a set of links like:
['http://www.nytimes.com/2016/12/31/us/politics/house-republicans-health-care-suit.html?partner=rss&emc=rss" rel="standout"></atom:link>',
'http://www.nytimes.com/2016/12/31/nyregion/bronx-murder-40th-precinct-police-residents.html</guid>',
'http://www.nytimes.com/2016/12/30/movies/tyrus-wong-dies-bambi-disney.html?partner=rss&emc=rss',
'http://www.nytimes.com/2016/12/30/obituaries/among-deaths-in-2016-a-heavy-toll-in-pop-music.html</guid>',
'http://www.nytimes.com/video/world/100000004830728/daybreak-around-the-world.html?partner=rss&emc=rss']
I'm trying to iterate over them to remove everything that comes after html. So I have:
cleanitems = []
for item in links:
cleanitems.append(re.sub(r'html(.*)', '', item))
Which returns:
['http://www.nytimes.com/2016/12/31/us/politics/house-republicans-health-care-suit.',
'http://www.nytimes.com/2016/12/31/nyregion/bronx-murder-40th-precinct-police-residents.',
'http://www.nytimes.com/2016/12/30/movies/tyrus-wong-dies-bambi-disney.',
'http://www.nytimes.com/2016/12/30/obituaries/among-deaths-in-2016-a-heavy-toll-in-pop-music.',
'http://www.nytimes.com/video/world/100000004830728/daybreak-around-the-world.]
Confused as to why it's including html in the capture group. Thanks for any help.

html is part of the matched text too, not just the (...) group. re.sub() replaces all of the whole matched text.
Include the literal html text in the replacement:
cleanitems.append(re.sub(r'html(.*)', 'html', item))
or, alternatively, capture that part in a group instead:
cleanitems.append(re.sub(r'(html).*', r'\1', item))
You may want to consider using a non-greedy match, and a $ end-of-string anchor to prevent cutting off a URL that contains html in the path more than once, and including the . dot to make sure you are really only matching the .html extension:
cleanitems.append(re.sub(r'\.html.*?$', r'.html', item))
However, if your goal is to remove the query string from a URL, consider parsing the URL using urllib.parse.urlparse(), and re-building it without the query string or fragment identifiers:
from urlib.parse import urlparse
cleanitems.append(urlparse(item)._replace(query='', fragment='').geturl())
This won't remove the eroneous HTML chunks however; if you are parsing these URLs from a HTML document, consider using a real HTML parser rather than regex.

Just a complement to Martijn's answer.
You could also use a lookbehind assertion to only match the text following html:
cleanitems.append(re.sub(r'(?<=html).*', '', item))
or use a replacement string to keep the initial part:
cleanitems.append(re.sub(r'(html).*', r'\1', item))
But as already said by Martin, you'd better use the urllib module to correctly parse URLs

Related

use dynamic int variable inside regex pattern python

I'm in my initial days of learning python, sorry if this question is already been asked.
I'm writing here as those didn't help me, my requirement is reading a file and printing all the URL's inside in it.Inside a for loop the regex pattern i had used is [^https://][\w\W]*, it worked fine. But I wanted to know if can I dynamically pass the length of line which is after https:// and get the output with occurrences instead of *
I had tried [^https://][\w\W]{var}} where var=len(line)-len(https://)
These are some other patterns I had tried like
pattern = '[^https://][\w\W]{'+str(int(var))+'}'
pattern = r'[^https://][\w\W]{{}}'.format(var)
pattern = r'[^https://][\w\W]{%s}'%var
I might be misunderstanding your question, but if you know that the url is always starting with https:// then that would be the first eight characters. Then you can get the length after finding the urls:
# Example of list containing urls - you should fill that with your for loop
list_urls = ['https://stackoverflow.com/questions/61006253/use-dynamic-int-variable-inside-regex-pattern-python', 'https://google.com', 'https://stackoverflow.com']
for url in list_urls:
print(url[8:])
Out
stackoverflow.com/questions/61006253/use-dynamic-int-variable-inside-regex-pattern-python
google.com
stackoverflow.com
Instead of a for loop you could find all urls using re.findall
import re
url_pattern = "((https:\/\/)([\w-]+\.)+[\w-]+[.+]+([\w%\/~\+#]*))"
# text referes to your document, that should be read before this
urls = re.findall(url_pattern, text)
# Using list comprehensions
# Get the unique urls by using set
# Only get text after https:// using [8:]
# Only parse the first element of the group that is returned by re.findall using [0]
unique_urls = list(set([x[0][8:] for x in urls]))
# print the urls
print(unique_urls)
In your pattern you use [^https://] which is a negated character class [^ which will match any char except the listed.
One option is to make use of literal string interpolation. Assuming your links do not contain spaces, you could use \S instead of [\w\W] as the latter variant will match any character including spaces and newlines.
\bhttps://\S{{{var}}}(?!\S)
Regex demo
The assertion (?!\S) at the end is a whitespace boundary to prevent partial matches and the word boundary \b will prevent http being part of a larger word.
Python demo
For example
import re
line = "https://www.test.com"
lines = "https://www.test.com https://thisisatestt https://www.dontmatchme"
var=len(line)-len('https://')
pattern = rf"\bhttps://\S{{{var}}}(?!\S)"
print(re.findall(pattern, lines))
Output
['https://www.test.com', 'https://thisisatestt']

A way of stripping everything before and after ' or " (including the quotes themselves)

I am trying to find a means to strip everything before or after either single or double quotes, including the quotes themselves.
For example:
<script src = "https://example.com/file.js"></script>
Result:
https://example.com/file.js
Or:
url = 'https://example.com/service/api'
Result:
https://example.com/service/api
I have tried using .strip and .replace , as well as the re library, but I am grasping in the dark here.
Using an HTML parsing library is no good here, because we don't know in advance in which language the code is. We are searching through lines of text looking for URL's to then send the URL itself to another API. This could be in text files, yaml, json, java, c#, python, ruby, etc.
Rather than trying to remove everything prior and after the target string, you can think of it as extracting the target string and not its surrounding context.
Extract the quoted string using regex match groups:
import re
string = '<script src = "https://example.com/file.js"></script>'
match = re.search("(\".+?\"|'.+?')", string)
target = match.group(1).strip("\"'")
target is equal to https://example.com/file.js.
The regex in re.search() matches either "somestring" or 'somestring'. The contents of the group between parentheses can be extracted using match.group(1). We then remove the quotes on either side using strip().
You might want to use something like
if match:
target = match.group(1).strip("\"'")
because match will be None if the regex doesn't match anything.

Python re.findall returning links with unwanted string afterwards

I have a python script using BeautifulSoup to scrape. This is my code:
re.findall('stream:\/\/.+', link)
Which is designed to find links like:
stream://987cds9c8ujru56236te2ys28u99u2s
But it also returns strings like this:
stream://987cds9c8ujru56236te2ys28u99u2s [SD] Spanish - (9.15am)
i.e. with spaces and extra stuff which I don't want. How can I express the
re.findall
So it only returns the link first part?
(Thanks in advance)
You can use a non-greedy match (adding ? to the pattern) with a word boundary character '\b':
>>> re.findall(r'stream:\/\/.+?\b', link)
['stream://987cds9c8ujru56236te2ys28u99u2s']
Or if you want to match only word characters you can simply use '\w+':
>>> re.findall(r'stream:\/\/\w+', link)
['stream://987cds9c8ujru56236te2ys28u99u2s']

Python Regex for URL doesn't work

I'm using python and trying to use a regex to see whether there is a url within my string. I've tried multiple different regexes but they always come out with 'None', even if the string is clearly a website.
Example:
>>> print re.search(r'/((?:https?\:\/\/|www\.)(?:[-a-z0-9]+\.)*[-a-z0-9]+.*)/i','www.google.com')
None
Any help would be appreciated!
What about, as in Python Regex for URL doesn't work , switching to something like:
r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
For a detailed survey of many, many regexes validating URLs, see https://mathiasbynens.be/demo/url-regex ...
If you want to check if a string is an URL you can use:
print re.search(r'(^(https?://|www\.)([a-z0-9-]+\.)+([a-z0-9]+)$)','www.google.com', re.I)
If you want to verify if a string contains a URL, you only need to remove the ^ and $ patterns:
print re.search(r'((https?://|www\.)([a-z0-9-]+\.)+([a-z0-9]+))','www.google.com', re.I)
Remember: re.I is for case-insensitive matching, the '^' matches beginning of line and $ matches end of line.
The grammar for a valid URL has been explained here in this Wiki. Based on that this regex can match a string if it has valid URL.
^((?:https?|ftp):\/{2}[\w.\/]+(?::\d{1,4})?\/?[?\w_#\/.]+)
And in case if you want to keep the scheme part of the URL optional.
^((?:https?|ftp)?(?::\/{2})?[\w.\/]+(?::\d{1,4})?\/?[?\w_#\/.]+)
Output
>>> re.search(r'^((?:https?|ftp)?(?::\/{2})?[\w.\/]+(?::\d{1,4})?\/?[?\w_#\/.]+)','www.google.com').group()
'www.google.com'
>>> re.search(r'^((?:https?|ftp)?(?::\/{2})?[\w.\/]+(?::\d{1,4})?\/?[?\w_#\/.]+)','http://www.google.com').group()
'http://www.google.com'
>>> re.search(r'^((?:https?|ftp)?(?::\/{2})?[\w.\/]+(?::\d{1,4})?\/?[?\w_#\/.]+)','https://www.google.com').group()
'https://www.google.com'
You can see a detailed demo and explanation about how it work here.
i've used the following regex in order to verify that the inserted string is a URL:
r'((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:#\-_=#]+\.([a-zA-Z]){2,6}([a-zA-Z0-9\.\&\/\?\:#\-_=#])*'

regular expressions in python, matching words outside of html tags

I am trying to match a phrase using regular expressions, so long as none of the words in that phrase appear within an html tag.
For this example, I am using the following url:
url = "http://www.sidley.com/people/results.aspx?lastname=B"
The regexp that I am using is:
regexp = "Babb(?!([^<]+)?>).+?Jonathan(?!([^<]+)?>).+?C(?!([^<]+)?>)"
page = urllib2.urlopen(url).read()
re.findall(regexp, page, re.DOTALL)
With that regexp, I get the following output:
[('', '', '')]
When I change the regexp to (*note the outer parens):
regexp = "(Babb(?!([^<]+)?>).+?Jonathan(?!([^<]+)?>).+?C(?!([^<]+)?>))"
page = urllib2.urlopen(url).read()
re.findall(regexp, page, re.DOTALL)
I get:
[('Babb, Jonathan C', '', '', '')]
I am confused as to why this is.
1) Why am I getting these empty strings as matches?
2) Why for the first regexp, do I not get the actual match?
and finally,
How do I fix this?
Thanks in advance for your help.
The reason you are getting empty strings is that you are using non-greedy. If you don't want that information, just remove some of your parentheses. In fact, you should really look into non-grouping parentheses or just some of the extraneous pairs.
The final code that I would use (for the whole process) would be
import re
import urllib2
url = 'http://www.sidley.com/people/results.aspx?lastname=B'
regexp = 'Babb(?!<+?>).+?Jonathan(?!<+?>).+?C(?!<+?>)'
page = urllib2.urlopen(url).read()
re.findall(regexp, page, re.DOTALL)
A breakdown of the regexp:
We select for the first word. Babb
We don't want to match any HTML tags, so we use a must-not-match anti-group. (?!)
Within this, we place a regexp that selects for an HTML tag (not quite sure why it is this particular expression that works, rather than .+?>). <+?>
We select for at least one more character, non-greedily. .+?
We repeat this process for each of the other words (Jonathan and C).

Categories