How to extract markdown links with a regex?

How to extract markdown links with a regex? - python

I currently have the Python code for parsing markdown text in order to extract the content inside the square brackets of a markdown link along with the hyperlink.
import re
# Extract []() style links
link_name = "[^]]+"
link_url = "http[s]?://[^)]+"
markup_regex = f'\[({link_name})]\(\s*({link_url})\s*\)'
for match in re.findall(markup_regex, '[a link](https://www.wiki.com/atopic_(subtopic))'):
name = match[0]
url = match[1]
print(url)
# url will be https://wiki.com/atopic_(subtopic
This will fail to grab the proper link because it matches up to the first bracket, rather than the last one.
How can I make the regex respect up till the final bracket?

For those types of urls, you'd need a recursive approach which only the newer regex module supports:
import regex as re
data = """
It's very easy to make some words **bold** and other words *italic* with Markdown.
You can even [link to Google!](http://google.com)
[a link](https://www.wiki.com/atopic_(subtopic))
"""
pattern = re.compile(r'\[([^][]+)\](\(((?:[^()]+|(?2))+)\))')
for match in pattern.finditer(data):
description, _, url = match.groups()
print(f"{description}: {url}")
This yields
link to Google!: http://google.com
a link: https://www.wiki.com/atopic_(subtopic)
See a demo on regex101.com.
This cryptic little beauty boils down to
\[([^][]+)\] # capture anything between "[" and "]" into group 1
(\( # open group 2 and match "("
((?:[^()]+|(?2))+) # match anything not "(" nor ")" or recurse group 2
# capture the content into group 3 (the url)
\)) # match ")" and close group 2
NOTE: The problem with this approach is that it fails for e.g. urls like
[some nasty description](https://google.com/()
# ^^^
which are surely totally valid in Markdown. If you're to encounter any such urls, use a proper parser instead.

I think you need to distinguish between what makes a valid link in markdown, and (optionally) what is a valid url.
Valid links in markdown can, for example, also be relative paths, and urls may or may not have the 'http(s)' or the 'www' part.
Your code would already work by simply using link_url = "http[s]?://.+" or even link_url = ".*". It would solve the problem of urls ending with brackets, and would simply mean that you rely on the markdown structure []() to find links .
Validating urls is an entirely different discussion: How do you validate a URL with a regular expression in Python?
Example code fix:
import re
# Extract []() style links
link_name = "[^\[]+"
link_url = "http[s]?://.+"
markup_regex = f'\[({link_name})]\(\s*({link_url})\s*\)'
for match in re.findall(markup_regex, '[a link](https://www.wiki.com/atopic_(subtopic))'):
name = match[0]
url = match[1]
print(url)
# url will be https://wiki.com/atopic_(subtopic)
Note that I also adjusted link_name, to prevent problems with a single '[' somewhere in the markdown text.

Related

How to find the match URL from a HTML page using RegEx Python

I'm trying to match the following URL by its query string from a html page in Python but could not able to solved it. I'm a newbie in python.
<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>
I want to match the above URL with &user_id=[any_digit_from_0_to_99]& and print this URL on the screen.
URL without this &user_id=[any_digit_from_0_to_99]& wont be match.
Here's my horror incomplete regex code:
https?:\/\/.{0,30}\.+[a-zA-Z0-9\/?_+=]{0,30}&user_id=[0-9][0-9]&.*?"
I know this code has so many wrong, but this code somehow managed to match the above URL till " double qoute.
Complete code would look like this:
import re
reg = re.compile(r'https?:\/\/.{0,30}\.+[a-zA-Z0-9\/?_+=]{0,30}&user_id=[0-9][0-9]&.*?"')
str = '<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>'
result = reg.search(str)
result = result.group()
print(result)
Output:
$ python reg.py
http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E"
It shows the " at the end of the URL and I know this is not the good regex code I want the better version of my above code.

A few remarks can be made on your regexp:
/ is not a special re character, there's no need to escape it
Has the fact that the domain can't be larger than 30 chracters been done on purpose? Otherwise, you can just select as much characters as you want with .*
Do you know that the string you're working with contains a valid URL? If no, there are some things you can do, like ensuring the domain is at least 4 chracters long, contains a period which is not the last character, etc...
The [0-9][0-9] part will also match stuff like 04, which is not strictly speaking a digit between 0 and 99
Taking this into account, you can design this simpler regex:
reg = re.compile("https?://.*&user_id=[1-9][0-9]?&")
str = '<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>'
result = reg.search(str)
result = result.group()
print(result)
Using this regex on your example will print 'http://example.com/?query_id=9&user_id=4&', without the " at the end. If you want to have to full URL, then you can look for the /> symbol:
reg = re.compile("https?://.*&user_id=[1-9][0-9]?&.*/>")
str = '<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>'
result = reg.search(str)
result = result.group()[:-2]
print(result)
Note the [:-2] which is used to remove the /> symbol. In that case, this code will print http://example.com/?query_id=9&user_id=4&token_id=4JGO4I394HD83E" id="838".
Note also that these regexp usesthe wildcard .. Depending on whether you are sure that the strings you're working with contains only valid URLs, you may want to change this. For instance, a domain name can only contain ASCII characters. You may want to look at the \w special sequence with the ASCII flag of the re module.

Creating a linked hashtag in markdown

I'm currently trying to parse markdown in django/python and linkify hashtags. There are some simple solutions to this:
for tag in re.findall(r"(#[\d\w\.]+)", markdown):
text_tag = tag.replace('#', '')
markdown = markdown.replace(
tag,
f"[{tag}](/blog/?q=%23{text_tag})")
This works well enough, but converts everything with a # into a link. Eg:
https://example.com/xyz/#section-on-page gets linkified. It also gets linkified if it is currently inside of a link itself.
Internal links are also broken as Link get linkified.
Here's a comprehensive case:
#hello This is an #example of some text with #hash-tags - http://www.example.com/#not-hashtag but dont want the link
#hello #goodbye #summer
#helloagain
#goodbye
This is #cool, yes it is #radaf! I like this #tool.
[Link](#not-a-hashtag)
[Link](https://example/#also-not)
Hai
Thanks

Use
def regex_replace(m):
if m.group(1):
return fr"[{m.group(1)}](/blog/?q=%23{m.group(2)})"
return m.group()
regex = r'''<[^>]*>|\[[^][]*]\([^()]*\)|https?://[^\s"'<>]*|(#(\w+(?:\.\w+)*))'''
markdown = re.sub(regex, regex_replace, markdown)
See Python code.
The <[^>]*>|\[[^][]*]\([^()]*\)|https?://[^\s"'<>]*|(#(\w+(?:\.\w+)*)) is a regex matching a tag, markdown link, URL or matching a hashtag capturing it as a whole and its part (Group 1) after # (Group 2). Once the first group is matched, the replacement is the linkified hashtag, else the string is returned back without modification.

Unable to parse a link from some content

I'm trying to parse a link out of some content using regex. I've already got success but I had to use replace() function and this as a flag. The thing is this may not always be present there. So, I seek any solution to get the same output without those two things I've mentioned already.
import re
content = """
widgetEvCall('handlers.onMenuClicked', event, this, 'http://www.stirwen.be/medias/documents/20181002_carte_octobre-novembre_2018_FR.pdf')
"""
link = re.findall(r'this,\s*([^)]*)',content.strip())[0].replace("'","")
print(link)
Output:
http://www.stirwen.be/medias/documents/20181002_carte_octobre-novembre_2018_FR.pdf
How can I get the link using pure regex?

You may extract all chars between single quotes after this, and spaces:
import re
content = """
widgetEvCall('handlers.onMenuClicked', event, this, 'http://w...content-available-to-author-only...n.be/medias/documents/20181002_carte_octobre-novembre_2018_FR.pdf')
"""
link = ''
m = re.search(r"this,\s*'([^']*)'", content)
if m:
link = m.group(1)
print(link)
# => http://www.stirwen.be/medias/documents/20181002_carte_octobre-novembre_2018_FR.pdf
See the Python demo
Also, see the regex demo.

python extract URLs from a text file with no html tags

I have found most of the posts here are approaching tag to find the urls in a text file. But not all text files necessarily got html tags next to them. I am looking for a solution that works in both situations. The following regex is:
'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
regex to obtain the urls from a text file using below code but the problem is it also takes unnecessary characters such as '>'
Here is my code:
import re
def extractURLs(fileContent):
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', fileContent.lower())
print urls
return urls
myFile = open("emailBody.txt")
fileContent = myFile.read()
URLs = URLs + extractURLs(fileContent)
The example of output is as below:
http://saiconference.com/ficc2018/submit
http://52.21.30.170/sendy/unsubscribe/qhiz2s763l892rkps763chacs52ieqkagf8rbueme9n763jv6da/hs1ph7xt5nvdimnwwfioya/qg0qteh7cllbw8j6amo892ca>
https://www.youtube.com/watch?v=gvwyoqnztpy>
http://saiconference.com/ficc
http://saiconference.com/ficc>
http://saiconference.com/ficc2018/submit>
As you can see there are some characters (such as '>') that are causing problems. What am I doing wrong?

Quick solution, assuming '>' is the only character that appears at the end: url.rstrip('>')
Removes the last occurrences (multiple) of the character for a single string. So, you will have to iterate through the list and remove the character.
Edit: Just got a PC with python, so giving a regex answer, after testing.
import re
def extractURLs(fileContent):
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', fileContent.lower())
cleanUrls = []
for url in urls:
lastChar = url[-1] # get the last character
# if the last character is not (^ - not) an alphabet, or a number,
# or a '/' (some websites may have that. you can add your own ones), then enter IF condition
if (bool(re.match(r'[^a-zA-Z0-9/]', lastChar))):
cleanUrls.append(url[:-1]) # stripping last character, no matter what
else:
cleanUrls.append(url) # else, simply append to new list
print(cleanUrls)
return cleanUrls
URLs = extractURLs("http://saiconference.com/ficc2018/submit>")
But, if its just one character, it is simpler to use the .rstrip().

Extracting [0-9_]+ from a URL

I've put together the following regular expression to extract image ID's from a URL:
''' Parse the post details from the full story page '''
def parsePostFromPermalink(session, permalink):
r = session.get('https://m.facebook.com{0}'.format(permalink))
dom = pq(r.content)
# Parse the images, extract the ID's, and construct large image URL
images = []
for img in dom('a img[src*="jpg"]').items():
if img.attr('src'):
m = re.match(r'/([0-9_]+)n\.jpg/', img.attr('src'))
images.append(m)
return images
URL example:
https://scontent-lhr3-1.xx.fbcdn.net/v/t1.0-0/cp0/e15/q65/s200x200/13645330_275977022775421_8826465145232985957_n.jpg?efg=eyJpIjoiYiJ9&oh=ed5b4593ed9c8b6cfe683f9c6932acc7&oe=57EE1224
I want this bit:
13645330_275977022775421_8826465145232985957
I've tested it on regex101 and it works: https://regex101.com/r/eS6eS7/2
img.attr('src') contains the correct URL and is not empty. I tested this. When I try to use m.group(0) I get an exception that group is not a function. m is None.
Am I doing something wrong?

Two problems:
those enclosing /.../ are not a part of Python regex syntax
you should use search instead of match
Working example:
>>> url = "https://scontent-lhr3-1.xx.fbcdn.net/v/t1.0-0/cp0/e15/q65/s200x200/13645330_275977022775421_8826465145232985957_n.jpg?efg=eyJpIjoiYiJ9&oh=ed5b4593ed9c8b6cfe683f9c6932acc7&oe=57EE1224"
>>> re.search(r'([0-9_]+)n\.jpg', url).group(0)
'13645330_275977022775421_8826465145232985957_n.jpg'
If you want just the number part, use this (group(1), and note the additional _):
>>> re.search(r'([0-9_]+)_n\.jpg', url).group(1)
'13645330_275977022775421_8826465145232985957'

This is the correct python code from Regex101. (There's a code generator on the left). Notice the lack of slashes on the outside of the regex...
import re
p = re.compile(r'([\d_]+)n\.jpg')
test_str = u"https://scontent-lhr3-1.xx.fbcdn.net/v/t1.0-0/cp0/e15/q65/c3.0.103.105/p110x80/13700209_937389626383181_6033441713767984695_n.jpg?efg=eyJpIjoiYiJ9&oh=a0b90ec153211eaf08a6b7c4cc42fb3b&oe=581E2EB8"
re.findall(p, test_str)
I'm not sure how you got m as None, but you might need to compile the pattern and use that to match first. Otherwise, try to fix the expression first

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract markdown links with a regex? - python

Related

How to find the match URL from a HTML page using RegEx Python

Creating a linked hashtag in markdown

Unable to parse a link from some content

python extract URLs from a text file with no html tags

Extracting [0-9_]+ from a URL

Categories

Resources