Creating a linked hashtag in markdown - python

I'm currently trying to parse markdown in django/python and linkify hashtags. There are some simple solutions to this:
for tag in re.findall(r"(#[\d\w\.]+)", markdown):
text_tag = tag.replace('#', '')
markdown = markdown.replace(
tag,
f"[{tag}](/blog/?q=%23{text_tag})")
This works well enough, but converts everything with a # into a link. Eg:
https://example.com/xyz/#section-on-page gets linkified. It also gets linkified if it is currently inside of a link itself.
Internal links are also broken as Link get linkified.
Here's a comprehensive case:
#hello This is an #example of some text with #hash-tags - http://www.example.com/#not-hashtag but dont want the link
#hello #goodbye #summer
#helloagain
#goodbye
This is #cool, yes it is #radaf! I like this #tool.
[Link](#not-a-hashtag)
[Link](https://example/#also-not)
Hai
Thanks

Use
def regex_replace(m):
if m.group(1):
return fr"[{m.group(1)}](/blog/?q=%23{m.group(2)})"
return m.group()
regex = r'''<[^>]*>|\[[^][]*]\([^()]*\)|https?://[^\s"'<>]*|(#(\w+(?:\.\w+)*))'''
markdown = re.sub(regex, regex_replace, markdown)
See Python code.
The <[^>]*>|\[[^][]*]\([^()]*\)|https?://[^\s"'<>]*|(#(\w+(?:\.\w+)*)) is a regex matching a tag, markdown link, URL or matching a hashtag capturing it as a whole and its part (Group 1) after # (Group 2). Once the first group is matched, the replacement is the linkified hashtag, else the string is returned back without modification.

Related

How to extract markdown links with a regex?

I currently have the Python code for parsing markdown text in order to extract the content inside the square brackets of a markdown link along with the hyperlink.
import re
# Extract []() style links
link_name = "[^]]+"
link_url = "http[s]?://[^)]+"
markup_regex = f'\[({link_name})]\(\s*({link_url})\s*\)'
for match in re.findall(markup_regex, '[a link](https://www.wiki.com/atopic_(subtopic))'):
name = match[0]
url = match[1]
print(url)
# url will be https://wiki.com/atopic_(subtopic
This will fail to grab the proper link because it matches up to the first bracket, rather than the last one.
How can I make the regex respect up till the final bracket?
For those types of urls, you'd need a recursive approach which only the newer regex module supports:
import regex as re
data = """
It's very easy to make some words **bold** and other words *italic* with Markdown.
You can even [link to Google!](http://google.com)
[a link](https://www.wiki.com/atopic_(subtopic))
"""
pattern = re.compile(r'\[([^][]+)\](\(((?:[^()]+|(?2))+)\))')
for match in pattern.finditer(data):
description, _, url = match.groups()
print(f"{description}: {url}")
This yields
link to Google!: http://google.com
a link: https://www.wiki.com/atopic_(subtopic)
See a demo on regex101.com.
This cryptic little beauty boils down to
\[([^][]+)\] # capture anything between "[" and "]" into group 1
(\( # open group 2 and match "("
((?:[^()]+|(?2))+) # match anything not "(" nor ")" or recurse group 2
# capture the content into group 3 (the url)
\)) # match ")" and close group 2
NOTE: The problem with this approach is that it fails for e.g. urls like
[some nasty description](https://google.com/()
# ^^^
which are surely totally valid in Markdown. If you're to encounter any such urls, use a proper parser instead.
I think you need to distinguish between what makes a valid link in markdown, and (optionally) what is a valid url.
Valid links in markdown can, for example, also be relative paths, and urls may or may not have the 'http(s)' or the 'www' part.
Your code would already work by simply using link_url = "http[s]?://.+" or even link_url = ".*". It would solve the problem of urls ending with brackets, and would simply mean that you rely on the markdown structure []() to find links .
Validating urls is an entirely different discussion: How do you validate a URL with a regular expression in Python?
Example code fix:
import re
# Extract []() style links
link_name = "[^\[]+"
link_url = "http[s]?://.+"
markup_regex = f'\[({link_name})]\(\s*({link_url})\s*\)'
for match in re.findall(markup_regex, '[a link](https://www.wiki.com/atopic_(subtopic))'):
name = match[0]
url = match[1]
print(url)
# url will be https://wiki.com/atopic_(subtopic)
Note that I also adjusted link_name, to prevent problems with a single '[' somewhere in the markdown text.

Unable to parse a link from some content

I'm trying to parse a link out of some content using regex. I've already got success but I had to use replace() function and this as a flag. The thing is this may not always be present there. So, I seek any solution to get the same output without those two things I've mentioned already.
import re
content = """
widgetEvCall('handlers.onMenuClicked', event, this, 'http://www.stirwen.be/medias/documents/20181002_carte_octobre-novembre_2018_FR.pdf')
"""
link = re.findall(r'this,\s*([^)]*)',content.strip())[0].replace("'","")
print(link)
Output:
http://www.stirwen.be/medias/documents/20181002_carte_octobre-novembre_2018_FR.pdf
How can I get the link using pure regex?
You may extract all chars between single quotes after this, and spaces:
import re
content = """
widgetEvCall('handlers.onMenuClicked', event, this, 'http://w...content-available-to-author-only...n.be/medias/documents/20181002_carte_octobre-novembre_2018_FR.pdf')
"""
link = ''
m = re.search(r"this,\s*'([^']*)'", content)
if m:
link = m.group(1)
print(link)
# => http://www.stirwen.be/medias/documents/20181002_carte_octobre-novembre_2018_FR.pdf
See the Python demo
Also, see the regex demo.

how to find and replace special url patern (Markdown syntax to HTML) by re module in python

I have a string and I want to search this string to find special pattern containing URL and it's name and then I need to change it's format:
Input string:
'Thsi is my [site](http://example.com/url) you can watch it.'
Output string:
'This is my site you can watch it.'
The string may have several URLs and I need to change the format of every one and site is in unicode and can be every character in any language.
What pattern should be used and how I can do it?
This should help
import re
A = 'Thsi is my [site](http://example.com/url) you can watch it.'
site = re.compile( "\[(.*)\]" ).search(A).group(1)
url = re.compile( "\((.*)\)" ).search(A).group(1)
print A.replace("[{0}]".format(site), "").replace("({0})".format(url), '{1}'.format(url, site))
Output:
Thsi is my site you can watch it.
Update as request in Comments:
s = 'my [site](site.com) is about programing (python language)'
site, url = s[s.find("[")+1:s.find(")")].split("](")
print s.replace("[{0}]".format(site), "").replace("({0})".format(url), '{1}'.format(url, site))
Output:
my site is about programing (python language)
I'm not a markdown expert, but if this is indeed markdown that you're trying to replace, and not your own syntax, you should use an appropriate parser. Note that, if you paste your string directly into stackoverflow - which also uses markdown - it will be transformed into a link, so it would clearly be valid markdown.
If it is indeed your own format, however, try the following to transform
'This is my [site](http://example.com/url) you can watch it.'
into
'This is my site you can watch it.'
using the following match:
\[(.*?)\]\((.*?)\)
and the following replacement regex:
<a href="\\2">\\1<\/a>
In python, re.sub(match, replace, stringThatYouWantToReplaceStuffIn) should do the trick. Don't forget to assign the return value of re.sub to whatever variable should contain the new string.

How to perform a tag-agnostic text string search in an html file?

I'm using LanguageTool (LT) with the --xmlfilter option enabled to spell-check HTML files. This forces LanguageTool to strip all tags before running the spell check.
This also means that all reported character positions are off because LT doesn't "see" the tags.
For example, if I check the following HTML fragment:
<p>This is kin<b>d</b> o<i>f</i> a <b>stupid</b> question.</p>
LanguageTool will treat it as a plain text sentence:
This is kind of a stupid question.
and returns the following message:
<error category="Grammar" categoryid="GRAMMAR" context=" This is kind of a stupid question. " contextoffset="24" errorlength="9" fromx="8" fromy="8" locqualityissuetype="grammar" msg="Don't include 'a' after a classification term. Use simply 'kind of'." offset="24" replacements="kind of" ruleId="KIND_OF_A" shortmsg="Grammatical problem" subId="1" tox="17" toy="8"/>
(In this particular example, LT has flagged "kind of a.")
Since the search string might be wrapped in tags and might occur multiple times I can't do a simple index search.
What would be the most efficient Python solution to reliably locate any given text string in an HTML file? (LT returns an approximate character position, which might be off by 10-30% depending on the number of tags, as well as the words before and after the flagged word(s).)
I.e. I'd need to do a search that ignores all tags, but includes them in the character position count.
In this particular example, I'd have to locate "kind of a" and find the location of the letter k in:
kin<b>d</b> o<i>f</i>a
This may not be the speediest way to go, but pyparsing will recognize HTML tags in most forms. The following code inverts the typical scan, creating a scanner that will match any single character, and then configuring the scanner to skip over HTML open and close tags, and also common HTML '&xxx;' entities. pyparsing's scanString method returns a generator that yields the matched tokens, the starting, and the ending location of each match, so it is easy to build a list that maps every character outside of a tag to its original location. From there, the rest is pretty much just ''.join and indexing into the list. See the comments in the code below:
test = "<p>This is kin<b>d</b> o<i>f</i> a <b>stupid</b> question.</p>"
from pyparsing import Word, printables, anyOpenTag, anyCloseTag, commonHTMLEntity
non_tag_text = Word(printables+' ', exact=1).leaveWhitespace()
non_tag_text.ignore(anyOpenTag | anyCloseTag | commonHTMLEntity)
# use scanString to get all characters outside of tags, and build list
# of (char,loc) tuples
char_locs = [(t[0], loc) for t,loc,endloc in non_tag_text.scanString(test)]
# imagine a world without HTML tags...
untagged = ''.join(ch for ch, loc in char_locs)
# look for our string in the untagged text, then index into the char,loc list
# to find the original location
search_str = 'kind of a'
orig_loc = char_locs[untagged.find(search_str)][1]
# print the test string, and mark where we found the matching text
print(test)
print(' '*orig_loc + '^')
"""
Should look like this:
<p>This is kin<b>d</b> o<i>f</i> a <b>stupid</b> question.</p>
^
"""
The --xmlfilter option is deprecated because of issues like this. The proper solution is to remove the tags yourself but keep the positions so you have a mapping to correct the results that come back from LT. When using LT from Java, this is supported by AnnotatedText, but the algorithm should be simple enough to port it. (full disclosure: I'm the maintainer of LT)

How can I pull a link out of a body of text in Python?

I am creating a program that tells what time a Youtube video is linking to when given a link. I am already able to do what I want when I have only a link, but I want to know how to get the link if given a body of text.
For example if the input is:
"This is filler to test the program, https://www.youtube.com/watch?feature=player_embedded&v=DkW5CSZ_VII#t=407 that is the link I want to pull out."
How can I simply get:
"https://www.youtube.com/watch?feature=player_embedded&v=DkW5CSZ_VII#t=407"
You can use a regular expression for this:
import re
s = "This is filler to test the program, https://www.youtube.com/watch?feature=player_embedded&v=DkW5CSZ_VII#t=407 that is the link I want to pull out."
url = re.search("(http.+youtube\.com.+#t=\d+)", s).groups()[0]
But once you're using re, you can just go straight to extracting the time (moving the capture group to the \d+ at the end, you can also ditch capturing the http.+ at the start):
time = re.search("youtube\.com.+#t=(\d+)", s).groups[0]
Note that this regular expression won't play well with multiple links in the same block of text, which may be a problem. You can test regular expressions easily online using e.g. regex101.

Categories