Unable to parse a link from some content - python

I'm trying to parse a link out of some content using regex. I've already got success but I had to use replace() function and this as a flag. The thing is this may not always be present there. So, I seek any solution to get the same output without those two things I've mentioned already.
import re
content = """
widgetEvCall('handlers.onMenuClicked', event, this, 'http://www.stirwen.be/medias/documents/20181002_carte_octobre-novembre_2018_FR.pdf')
"""
link = re.findall(r'this,\s*([^)]*)',content.strip())[0].replace("'","")
print(link)
Output:
http://www.stirwen.be/medias/documents/20181002_carte_octobre-novembre_2018_FR.pdf
How can I get the link using pure regex?

You may extract all chars between single quotes after this, and spaces:
import re
content = """
widgetEvCall('handlers.onMenuClicked', event, this, 'http://w...content-available-to-author-only...n.be/medias/documents/20181002_carte_octobre-novembre_2018_FR.pdf')
"""
link = ''
m = re.search(r"this,\s*'([^']*)'", content)
if m:
link = m.group(1)
print(link)
# => http://www.stirwen.be/medias/documents/20181002_carte_octobre-novembre_2018_FR.pdf
See the Python demo
Also, see the regex demo.

Related

How to extract markdown links with a regex?

I currently have the Python code for parsing markdown text in order to extract the content inside the square brackets of a markdown link along with the hyperlink.
import re
# Extract []() style links
link_name = "[^]]+"
link_url = "http[s]?://[^)]+"
markup_regex = f'\[({link_name})]\(\s*({link_url})\s*\)'
for match in re.findall(markup_regex, '[a link](https://www.wiki.com/atopic_(subtopic))'):
name = match[0]
url = match[1]
print(url)
# url will be https://wiki.com/atopic_(subtopic
This will fail to grab the proper link because it matches up to the first bracket, rather than the last one.
How can I make the regex respect up till the final bracket?
For those types of urls, you'd need a recursive approach which only the newer regex module supports:
import regex as re
data = """
It's very easy to make some words **bold** and other words *italic* with Markdown.
You can even [link to Google!](http://google.com)
[a link](https://www.wiki.com/atopic_(subtopic))
"""
pattern = re.compile(r'\[([^][]+)\](\(((?:[^()]+|(?2))+)\))')
for match in pattern.finditer(data):
description, _, url = match.groups()
print(f"{description}: {url}")
This yields
link to Google!: http://google.com
a link: https://www.wiki.com/atopic_(subtopic)
See a demo on regex101.com.
This cryptic little beauty boils down to
\[([^][]+)\] # capture anything between "[" and "]" into group 1
(\( # open group 2 and match "("
((?:[^()]+|(?2))+) # match anything not "(" nor ")" or recurse group 2
# capture the content into group 3 (the url)
\)) # match ")" and close group 2
NOTE: The problem with this approach is that it fails for e.g. urls like
[some nasty description](https://google.com/()
# ^^^
which are surely totally valid in Markdown. If you're to encounter any such urls, use a proper parser instead.
I think you need to distinguish between what makes a valid link in markdown, and (optionally) what is a valid url.
Valid links in markdown can, for example, also be relative paths, and urls may or may not have the 'http(s)' or the 'www' part.
Your code would already work by simply using link_url = "http[s]?://.+" or even link_url = ".*". It would solve the problem of urls ending with brackets, and would simply mean that you rely on the markdown structure []() to find links .
Validating urls is an entirely different discussion: How do you validate a URL with a regular expression in Python?
Example code fix:
import re
# Extract []() style links
link_name = "[^\[]+"
link_url = "http[s]?://.+"
markup_regex = f'\[({link_name})]\(\s*({link_url})\s*\)'
for match in re.findall(markup_regex, '[a link](https://www.wiki.com/atopic_(subtopic))'):
name = match[0]
url = match[1]
print(url)
# url will be https://wiki.com/atopic_(subtopic)
Note that I also adjusted link_name, to prevent problems with a single '[' somewhere in the markdown text.

Extracting [0-9_]+ from a URL

I've put together the following regular expression to extract image ID's from a URL:
''' Parse the post details from the full story page '''
def parsePostFromPermalink(session, permalink):
r = session.get('https://m.facebook.com{0}'.format(permalink))
dom = pq(r.content)
# Parse the images, extract the ID's, and construct large image URL
images = []
for img in dom('a img[src*="jpg"]').items():
if img.attr('src'):
m = re.match(r'/([0-9_]+)n\.jpg/', img.attr('src'))
images.append(m)
return images
URL example:
https://scontent-lhr3-1.xx.fbcdn.net/v/t1.0-0/cp0/e15/q65/s200x200/13645330_275977022775421_8826465145232985957_n.jpg?efg=eyJpIjoiYiJ9&oh=ed5b4593ed9c8b6cfe683f9c6932acc7&oe=57EE1224
I want this bit:
13645330_275977022775421_8826465145232985957
I've tested it on regex101 and it works: https://regex101.com/r/eS6eS7/2
img.attr('src') contains the correct URL and is not empty. I tested this. When I try to use m.group(0) I get an exception that group is not a function. m is None.
Am I doing something wrong?
Two problems:
those enclosing /.../ are not a part of Python regex syntax
you should use search instead of match
Working example:
>>> url = "https://scontent-lhr3-1.xx.fbcdn.net/v/t1.0-0/cp0/e15/q65/s200x200/13645330_275977022775421_8826465145232985957_n.jpg?efg=eyJpIjoiYiJ9&oh=ed5b4593ed9c8b6cfe683f9c6932acc7&oe=57EE1224"
>>> re.search(r'([0-9_]+)n\.jpg', url).group(0)
'13645330_275977022775421_8826465145232985957_n.jpg'
If you want just the number part, use this (group(1), and note the additional _):
>>> re.search(r'([0-9_]+)_n\.jpg', url).group(1)
'13645330_275977022775421_8826465145232985957'
This is the correct python code from Regex101. (There's a code generator on the left). Notice the lack of slashes on the outside of the regex...
import re
p = re.compile(r'([\d_]+)n\.jpg')
test_str = u"https://scontent-lhr3-1.xx.fbcdn.net/v/t1.0-0/cp0/e15/q65/c3.0.103.105/p110x80/13700209_937389626383181_6033441713767984695_n.jpg?efg=eyJpIjoiYiJ9&oh=a0b90ec153211eaf08a6b7c4cc42fb3b&oe=581E2EB8"
re.findall(p, test_str)
I'm not sure how you got m as None, but you might need to compile the pattern and use that to match first. Otherwise, try to fix the expression first

Get only URL from string - Python

I am scraping a page with Python and BeautifulSoup library.
I have to get the URL only from this string. This actually is in href attribute of the a tag. I have scraped it but cannot seem to find a way to extract the URL from this
javascript:void%20window.open('/Sheraton-Tucson-Hotel-177/tnc/150/24795/en','TC_POPUP','width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no');
You can write a straightforward regex to extract the URL.
>>> import re
>>> href = "javascript:void%20window.open('/Sheraton-Tucson-Hotel-177/tnc/150/24795/en','TC_POPUP','width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no');"
>>> re.findall(r"'(.*?)'", href)
['/Sheraton-Tucson-Hotel-177/tnc/150/24795/en', 'TC_POPUP', 'width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no']
>>> _[0]
'/Sheraton-Tucson-Hotel-177/tnc/150/24795/en'
The regex in question here is
'(.*?)'
Which reads "find a single-quote, followed by whatever (and capture the whatever), followed by another single quote, and do so non-greedily because of the ? operator". This extracts the arguments of window.open; then, just pick the first one to get the URL.
You shouldn't have any nested ' in your href, since those should be escaped to %27. If you do, though, this will not work, and you may need a solution that doesn't use regexes.
I did it that way.
terms = javascript:void%20window.open('/Sheraton-Tucson-Hotel-177/tnc/150/24795/en','TC_POPUP','width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no');
terms.split("('")[1].split("','")[0]
outputs
/Sheraton-Tucson-Hotel-177/tnc/150/24795/en
Instead of a regex, you could just partition it twice on something, (eg: '):
s.partition("'")[2].partition("'")[0]
# /Sheraton-Tucson-Hotel-177/tnc/150/24795/en
Here's a quick and ugly answer
href.split("'")[1]

How can I make a regular expression to extract all anchor tags or links from a string?

I've seen other questions which will parse either all plain links, or all anchor tags from a string, but nothing that does both.
Ideally, the regular expression will be able to parse a string like this (I'm using Python):
>>> import re
>>> content = '
http://www.google.com Some other text.
And even more text! http://stackoverflow.com
'
>>> links = re.findall('some-regular-expression', content)
>>> print links
[u'http://www.google.com', u'http://stackoverflow.com']
Is it possible to produce a regular expression which would not result in duplicate links being returned? Is there a better way to do this?
No matter what you do, it's going to be messy. Nevertheless, a 90% solution might resemble:
r'<a\s[^>]*>([^<]*)</a>|\b(\w+://[^<>\'"\t\r\n\xc2\xa0]*[^<>\'"\t\r\n\xc2\xa0 .,()])'
Since that pattern has two groups, it will return a list of 2-tuples; to join them, you could use a list comprehension or even a map:
map(''.join, re.findall(pattern, content))
If you want the src attribute of the anchor instead of the link text, the pattern gets even messier:
r'<a\s[^>]*src=[\'"]([^"\']*)[\'"][^>]*>[^<]*</a>|\b(\w+://[^<>\'"\t\r\n\xc2\xa0]*[^<>\'"\t\r\n\xc2\xa0 .,()])'
Alternatively, you can just let the second half of the pattern pick up the src attribute, which also alleviates the need for the string join:
r'\b\w+://[^<>\'"\t\r\n\xc2\xa0]*[^<>\'"\t\r\n\xc2\xa0 .,()]'
Once you have this much in place, you can replace any found links with something that doesn't look like a link, search for '://', and update the pattern to collect what it missed. You may also have to clean up false positives, particularly garbage at the end. (This pattern had to find links that included spaces, in plain text, so it's particularly prone to excess greediness.)
Warning: Do not rely on this for future user input, particularly when security is on the line. It is best used only for manually collecting links from existing data.
Usually you should never parse HTML with regular expressions since HTML isn't a regular language. Here it seems you only want to get all the http-links either they are in an A element or in text. How about getting them all and then remove the duplicates?
Try something like
set(re.findall("(http:\/\/.*?)[\"' <]", content))
and see if it serves your purpose.
Writing a regex pattern that matches all valid url is tricky business.
If all you're looking for is to detect simple http/https URLs within an arbitrary string, I could offer you this solution:
>>> import re
>>> content = 'http://www.google.com Some other text. And even more text! http://stackoverflow.com'
>>> re.findall(r"https?://[\w\-.~/?:#\[\]#!$&'()*+,;=]+", content)
['http://www.google.com', 'http://www.google.com', 'http://stackoverflow.com']
That looks for strings that start with http:// or https:// followed by one or more valid chars.
To avoid duplicate entries, use set():
>>> list(set(re.findall(r"https?://[\w\-.~/?:#\[\]#!$&'()*+,;=]+", content)))
['http://www.google.com', 'http://stackoverflow.com']
You should not use regular expressions to extract things from HTML. You should use an HTML parser.
If you also want to extract things from the text of the page then you should do that separately.
Here's how you would do it with lxml:
# -*- coding: utf8 -*-
import lxml.html as lh
import re
html = """
is.gd/testhttp://www.google.com Some other text.
And even more text! http://stackoverflow.com
here's a url bit.ly/test
"""
tree = lh.fromstring(html)
urls = set([])
for a in tree.xpath('//a'):
urls.add(a.text)
for text in tree.xpath('//text()'):
for url in re.findall(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', text):
urls.add(url[0])
print urls
Result:
set(['http://www.google.com', 'bit.ly/test', 'http://stackoverflow.com', 'is.gd/test'])
URL matchine regex from here: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
No, it will not be able to parse string like this. Regexp are capable of simple matching and you can't handle parsing a complicated grammar as html just with one or two regexps.

Match "without this"

I need to remove all <p></p> that are only <p>'s in <td>.
But how it can be done?
import re
text = """
<td><p>111</p></td>
<td><p>111</p><p>222</p></td>
"""
text = re.sub(r'<td><p>(??no</p>inside??)</p></td>', r'<td>\1</td>', text)
How can I match without</p>inside?
I would use minidom. I stole the following snippet from here which you should be able to modify and work for you:
from xml.dom import minidom
doc = minidom.parse(myXmlFile)
for element in doc.getElementsByTagName('MyElementName'):
if element.getAttribute('name') in ['AttrName1', 'AttrName2']:
parentNode = element.parentNode
parentNode.insertBefore(doc.createComment(element.toxml()), element)
parentNode.removeChild(element)
f = open(myXmlFile, "w")
f.write(doc.toxml())
f.close()
Thanks #Ivo Bosticky
While using regexps with HTML is bad, matching a string that does not contain a given pattern is an interesting question in itself.
Let's assume that we want to match a string beginning with an a and ending with a z and take out whatever is in between only when string bar is not found inside.
Here's my take: "a((?:(?<!ba)r|[^r])+)z"
It basically says: find a, then find either an r which is not preceded by ba, or something different than r (repeat at least once), then find a z. So, a bar cannot sneak in into the catch group.
Note that this approach uses a 'negative lookbehind' pattern and only works with lookbehind patterns of fixed length (like ba).
I would definitely recommend using BeautifulSoup for this. It's a python HTML/XML parser.
http://www.crummy.com/software/BeautifulSoup/
Not quite sure why you want to remove the P tags which don't have closing tags.
However, if this is an attempt to clean code, an advantage of BeautifulSoup is that is can clean HTML for you:
from BeautifulSoup import BeautifulSoup
html = """
<td><p>111</td>
<td><p>111<p>222</p></td>
"""
soup = BeautifulSoup(html)
print soup.prettify()
this doesn't get rid of your unmatched tags, but it fixes the missing ones.

Categories