How to check for key value metadata in markdown - python

I need to check if my input, formatted using markdown, has key-value pair metadata at the beginning, and then insert text after the whole metadata block.
I look for a : in the first line and if found, split the input string at the first newline and add my stuff.
Now, if markdown_content.splitlines()[0].find(':') >= 0: obviously fails when there's no metadata at the beginning, but something else containing a :instead.
Examples
Input with metadata:
page title: fancypagetitle
something else: another value
# Heading
Text
Input without metadata, but with a :
This is a [link](http://www.stackoverflow.com)
# Heading
Text
My question is: How do I check if a metadata block is present and in case it is, add something in between metadata and the remaining markdown.
Definition of metadata
The keywords are case-insensitive and may consist of letters, numbers, underscores and dashes and must end with a colon. The values consist of anything following the colon on the line and may even be blank.
If a line is indented by 4 or more spaces, that line is assumed to be an additional line of the value for the previous keyword. A keyword may have as many lines as desired.
The first blank line ends all meta-data for the document. Therefore, the first line of a document must not be blank. All meta-data is stripped from the document prior to any further processing by Markdown.
Source: https://pythonhosted.org/Markdown/extensions/meta_data.html

Have you considered looking at the source code for the meta data extension to see how it's done?
The regex used is:
META_RE = re.compile(r'^[ ]{0,3}(?P<key>[A-Za-z0-9_-]+):\s*(?P<value>.*)')
Of course there is also the regex for secondary lines:
META_MORE_RE = re.compile(r'^[ ]{4,}(?P<value>.*)')
If you note, those regular expressions are much more specific than yours and are much less likely to match a false positive. Then the extension splits the document into lines, loops through each line comparing with those regexs and breaks out of the loop on the first line that does not match (which may or may not be blank line).
If you notice in that code, there is a new feature that has been added which will be available in the next release. Support is being added for optional YAML style deliminators. If you are comfortable using the latest (unreleased) development code, you could wrap your meta data in YAML deliminators which might make it a little easier to find the end of the meta data.
For example, your example document above would then look like this (note I used the optional end specific deliminator (...) which more clearly marks the end):
---
page title: fancypagetitle
something else: another value
...
# Heading
Text
That said, you would still need to be careful that you didn't get a false match (a <hr> for example). I suppose either way you would really need to re-implement everything that is in the meta data extension for your own needs. Of course, it is open source, so you can as long as you honor the license.
Sorry, but I can't give you a timeline on when the next release will happen for sure.
Oh, and it may also help to look at the description of this feature provided by MultiMarkdown which inspired the feature in Python-Markdown. That might give you a clearer picture of what might comprise meta-data.

Related

Modify the data found between two recurring patterns in a multi-line string

I have a multi-line string, it's around to 10000-40000 characters(changes as per the data returned by an API). In this string, there are a number of tables (they are a part of the string, but formatted in a way that makes them look like a table). The tables are always in a repeating pattern. The pattern looks like this:
==============================================================================
*THE HEADINGS/COLUMN NAMES IN THE TABLE*
------------------------------------------------------------------------------
THE DATA IN THE TABLE, FORMATTED TO BE UNDER RESPECTIVE COLUMNS
I'm trying to display the contents in html on a locally hosted webpage, and I want to have the heading of the tables displayed in a specific way (think color, font size). For that, I'm using the python regex module to identify the pattern, but I'm failing to do so due to inexperience in using the re module. To modify the part that I need modified, I'm using the below piece of code:
re.sub(r'\={78}.*\-{78}',some_replacement_string, complete_multi_line_string)
But the above piece of code is not giving me the output I require, since it is not matching the pattern properly(I'm sure the mistake is in the pattern I'm asking re.sub to match)
However:
re.sub(r'\-{78}',some_replacement_string, complete_multi_line_string)
is working as it's returning the string with the replacement, but the slight problem here is that there are multiple ------------------------------------------------------------------------------s in the code that I do not want modified. Please help me out here. If it is helpful, the output that I'm wanting is something like:
==============================================================================
<span>*THE HEADINGS/COLUMN NAMES IN THE TABLE*<\span>
------------------------------------------------------------------------------
THE DATA IN THE TABLE, FORMATTED TO BE UNDER RESPECTIVE COLUMNS
Also, please note that there are newlines or \ns after the ==============================================================================s, the <span>*THE HEADINGS/COLUMN NAMES IN THE TABLE*<\span>s and the ------------------------------------------------------------------------------s, if that is helpful in getting to the solution.
The code snippet I'm currently trying to debug, if helpful:
result = re.sub(r'\={78}.*\-{78}', replacement, multi_line_string)
l = result.count('<\span>')
print(l)
PS: There are 78 = and 78 - in all the occurances.
You should try using the following version:
re.sub(r'(={78})\n(.*?)\n(-{78})', r'\1<span>\2</span>\3', complete_multi_line_string, flags=re.S)
The changes I made here include:
Match on lazy dot .*? instead of greedy dot .*, to ensure that we don't match across header sections
Match with the re.S flag, so that .*? will match across newlines

How to select an entire entity around a regex without splitting the string first?

My project (unrelated to this question, just context) is a ML classifier, I'm trying to improve it and have found that when I stripped URLS from the text given to it, some of the URLS have been broken by spaces. For example:
https:// twitter.com/username/sta tus/ID
After I remove links that are not broken, I am left with thinks like www website com. I removed those with the following regular expression in Python:
tweet = re.sub('(www|http).*?(org |net |edu |com |be |tt |me |ms )','',tweet);
I've put a space after every one of them because this happens after the regular strip and text processing (so only working with parts of a URL separated by spaces) and theoretically we should only pick up the remainders of a broken link... not something like
http website strangeTLD .... communication
It's not perfect but it works, however I just thought that I might try to preemptively remove URLS from twitter only, since I know that the spaces that break the regular URL strip will always be in the same places, hoping this improves my classifier accuracy? This will get rid of the string of characters that occurs after a link... specifically pictures, which is a lot of my data.
Specifically, is there a way to select the entity surrounding/after:
pic.twitter.com/
or, in reference to the example I gave earlier, select the entity after the username broken by the space in status (I'm just guessing at this regex)...
http.*?twitter.com/*?/sta tus/
Thank you in advance! And for the record, I was given this dataset to work with; I am not sure why the URLs are almost all broken by spaces.
Yes, what you are talking about is called Positive Lookbehind and works using (?<=...), where the ellipsis should be replaced by what you want to skip.
E.g. if you want to select whatever comes after username in https://twitter.com/username/status/ID, just use
(?<=https:\/\/twitter\.com\/username\/).*
and you will get status/ID, like you can see with this live demo.
In this case I had to escape slashes / using backslashes, as required by Regex specifications; I also used the Kleene star operator, i.e. the asterisk, to match any occurrence of . (any character), just like you did.
What a positive lookbehind combination does is specifying some mandatory text before the current position of your cursor; in other words, it puts the cursor after the expression you feed it (if the said text exists).
Of course this is not enough in your case, since username won't be a fixed string but a variable one. This might be an additional requirement, since lookbehinds do not work with variable lengths.
So you can just skip www.twitter.com/
(?<=https:\/\/twitter\.com\/).*
And then, via Python, create a substring
currentText = "username/status/ID"
result = currentText.split("/",1)[1] # returns status/ID
Test it in this demo (click "Execute"); a simple explanation of how this works is in the answer to this question (in short, you just split the string at the first slash character).
As a sidenote, blanks/spaces aren't allowed in URLs and if necessary are usually encoded as %20 or + (see e.g. this answer). In other words, every URL you got can be safely stripped of spaces before processing, so... why didn't they do it?

What format should the file for ConfigParser be?

I'm setting up my credentials for the library: https://pypi.python.org/pypi/python-amazon-product-api/
Code for the relevant configparser on the project file here.
I'm wondering, what format should the config file variables be? Should strings be inserted inside quotes? Should there be spaces between the variable name and the equal sign?
How does this look?
[Credentials]
access_key=xxxxxxxxxxxxxxxxxxxxx
secret_key=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
associate_tag=xxxxxxxxxxxx
As taken from the documentation
The configuration file consists of sections, led by a [section] header and followed by name: value entries, with continuations in the style of RFC 822 (see section 3.1.1, “LONG HEADER FIELDS”); name=value is also accepted. Note that leading whitespace is removed from values. The optional values can contain format strings which refer to other values in the same section, or values in a special DEFAULT section. Additional defaults can be provided on initialization and retrieval. Lines beginning with '#' or ';' are ignored and may be used to provide comments.
Configuration files may include comments, prefixed by specific characters (# and ;). Comments may appear on their own in an otherwise empty line, or may be entered in lines holding values or section names. In the latter case, they need to be preceded by a whitespace character to be recognized as a comment. (For backwards compatibility, only ; starts an inline comment, while # does not.)
On top of the core functionality, SafeConfigParser supports interpolation. This means values can contain format strings which refer to other values in the same section, or values in a special DEFAULT section. Additional defaults can be provided on initialization.
For example:
[My Section]
foodir: %(dir)s/whatever
dir=frob
long: this value continues
in the next line
You are pretty free in writing what ever you want in the settings file.
In your particular case you just need to copy & paste your keys and tag and ConfigParser should do the rest.

pyparsing capturing groups of arbitrary text with given headers as nested lists

I have a text file that looks similar to;
section header 1:
some words can be anything
more words could be anything at all
etc etc lala
some other header:
as before could be anything
hey isnt this fun
I am trying to contruct a grammar with pyparser that would result in the following list structure when asking for the parsed results as a list; (IE; the following should be printed when iterating through the parsed.asList() elements)
['section header 1:',[['some words can be anything'],['more words could be anything at all'],['etc etc lala']]]
['some other header:',[['as before could be anything'],['hey isnt this fun']]]
The header names are all known beforehand, and individual headers may or may not appear. If they do appear, thre is always at least one line of content.
The problem I am having, is that I am having trouble gettnig the parser to recognise where 'section header 1:' ands, and 'some other header:' begins. I end up with a parsed.asList() looking like;
['section header 1:',[[''some words can be anything'],['more words could be anything at all'],['etc etc lala'],['some other header'],[''as before could be anything'],['hey isnt this fun']]]
(IE: section header 1: gets seen correctly, but everythng following it gets added to section header 1, including further header lines etc..)
Ive tried various things, played with leaveWhitespace() and LineEnd() in various ways but I can't figure it out.
The base parser I am hacking about with is (contrived example - in reality this is a class definition etc..).
header_1_line=Literal('section header 1:')
text_line=Group(OneOrMore(Word(printables)))
header_1_block=Group(header_1_line+Group(OneOrMore(text_line)))
header_2_line=Literal('some other header:')
header_2_block=Group(header_2_line+Group(OneOrMore(text_line)))
overall_structure=ZeroOrMore(header_1_block|header_2_block)
and is being called with
parsed=overall_structure.parseFile()
Cheers, Matt.
Matt -
Welcome to pyparsing! You have fallen into one of the most common pitfalls in working with pyparsing, and that is that people are smarter than computers. When you look at your input text, you can easily see which text can be headers and which text can't be. Unfortunately, pyparsing is not so intuitive, so you have to tell it explicitly what can and can't be text.
When you look at your sample text, you are not accepting just any line of text as possible text within a section header. How do you know that 'some other header:' is not valid as text? Because you know that that string matches one of the known header strings. But in your current code, you have told pyparsing that any collection of Word(printables) is valid text, even if that collection is a valid section header.
To fix this, you have to add some explicit lookahead to your parser. Pyparsing offers two constructs, NotAny and FollowedBy. NotAny can be abbreviated using the '~' operator, so we can write this pseudocode expression for text:
text = ~any_section_header + everything_up_to_the_end_of_the_line
Here is a complete parser using negative lookahead to make sure you read each section, breaking on section headings:
from pyparsing import ParserElement, LineEnd, Literal, restOfLine, ZeroOrMore, Group, StringEnd
test = """
section header 1:
some words can be anything
more words could be anything at all
etc etc lala
some other header:
as before could be anything
hey isnt this fun
"""
ParserElement.defaultWhitespaceChars=(" \t")
NL = LineEnd().suppress()
END = StringEnd()
header_1=Literal('section header 1:')
header_2=Literal('some other header:')
any_header = (header_1 | header_2)
# text isn't just anything! don't accept header line, and stop at the end of the input string
text=Group(~any_header + ~END + restOfLine)
overall_structure = ZeroOrMore(Group(any_header +
Group(ZeroOrMore(text))))
overall_structure.ignore(NL)
from pprint import pprint
print(overall_structure.parseString(test).asList())
In my first attempt, I forgot to also look for the end of string, so my restOfLine expression looped forever. By adding a second lookahead for the string end, my program terminates successfully. Exercise left for you: instead of enumerating all possible headers, define a header line as any line that ends with a ':'.
Good luck with your pyparsing efforts,
-- Paul

regex regarding symbols in urls

I want to replace consecutive symbols just one such as;
this is a dog???
to
this is a dog?
I'm using
str = re.sub("([^\s\w])(\s*\1)+", "\\1",str)
however I notice that this might replace symbols in urls that might happen in my text.
like http://example.com/this--is-a-page.html
Can someone give me some advice how to alter my regex?
So you want to unleash the power of regular expressions on an irregular language like HTML. First of all, search SO for "parse HTML with regex" to find out why that might not be such a good idea.
Then consider the following: You want to replace duplicate symbols in (probably user-entered) text. You don't want to replace them inside a URL. How can you tell what a URL is? They don't always start with http – let's say ars.userfriendly.org might be a URL that is followed by a longer path that contains duplicate symbols.
Furthermore, you'll find lots of duplicate symbols that you definitely don't want to replace (think of nested parentheses (like this)), some of them maybe inside a <script> on the page you're working on (||, && etc. come to mind.
So you might come up with something like
(?<!\b(?:ftp|http|mailto)\S+)([^\\|&/=()"'\w\s])(?:\s*\1)+
which happens to work on the source code of this very page but will surely fail in other cases (for example if URLs don't start with ftp, http or mailto). Plus, it won't work in Python since it uses variable repetition inside lookbehind.
All in all, you probably won't get around parsing your HTML with a real parser, locating the body text, applying a regex to it and writing it back.
EDIT:
OK, you're already working on the parsed text, but it still might contain URLs.
Then try the following:
result = re.sub(
r"""(?ix) # case-insensitive, verbose regex
# Either match a URL
# (protocol optional (if so, URL needs to start with www or ftp))
(?P<URL>\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$])
# or
|
# match repeated non-word characters
(?P<rpt>[^\s\w])(?:\s{0,100}(?P=rpt))+""",
# and replace with both captured groups (one will always be empty)
r"\g<URL>\g<rpt>", subject)
Re-EDIT: Hm, Python chokes on the (?:\s*(?P=rpt))+ part, saying the + has nothing to repeat. Looks like a bug in Python (reproducible with (.)(\s*\1)+ whereas (.)(\s?\1)+ works)...
Re-Re-EDIT: If I replace the * with {0,100}, then the regex compiles. But now Python complains about an unmatched group. Obviously you can't reference a group in a replacement if it hasn't participated in the match. I give up... :(

Categories