I have a Python script that generates some HTML. It does so using the Python markdown library. I'd like to stick the original Markdown text in a comment at the end of the HTML, where it will occasionally be useful for debugging purposes. I've tried just plunking the Markdown text after the end of the HTML, and it doesn't work for me (Firefox). So the way I imagine this working is that I run Markdown and then simply append the Markdown source, marked as a comment, after the HTML. However, HTML is apparently somewhat finicky about what it will allow in comments. The site htmlhelp.com gives the following advice after some discussion:
For this reason, use the following simple rule to compose valid and accepted [portable] comments:
An HTML comment begins with "" and does not contain "--" or ">" anywhere in the comment.
(source)
So it looks like I need to do some escaping or something to get my bunch of markdown text into a form that HTML will accept as a comment. Is there an existing tool that will help me do this?
According to the w3:
Comments consist of the following parts, in exactly the following order:
- the comment start delimiter "<!--"
- text
- the comment end delimiter "-->"
The text part of comments has the following restrictions:
1. must not start with a ">" character
2. must not start with the string "->"
3. must not contain the string "--"
4. must not end with a "-" character
These are very simple rules. You could regex-enforce them, but they are so simple you don't even need that!
3 of the 4 conditions can be met with concatenation, and the other one with a simple replace(). All in all, it's a one-liner:
def html_comment(text):
return '<!-- ' + text.replace('--', '- - ') + ' -->'
Note the spaces.
Can't you just .replace it? Ultimately, you could replace those characters with anything, but substituting with escape codes probably won't make your comment any more readible than substituting with nothing.
commented = '<!-- %s -->' % markdown_text.replace('--', '').replace('>', '')
Related
I'm trying to write a small function for another script that pulls the generated text from "http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1"
Essentially, I need it to pull whatever sentence is between < br> tags.
I've been trying my darndest using regular expressions, but I never really could get the hang of those.
All of the searching I did turned up things for pulling either specific sentences, or single words.
This however needs to pull whatever arbitrary string is between < br> tags.
Can anyone help me out? Thanks.
Best I could come up with:
html = urlopen("http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1").read()
output = re.findall('\<br>.*\<br>', html)
EDIT: Ended up going with a different approach all together, simply splitting the HTML in a list seperated by < br> and pulling [3], made for cleaner code and less string operations. Keeping this question up for future reference and other people with similar questions.
You need to use the DOTALL flag as there are newlines in the expression that you need to match. I would use
re.findall('<br>(.*?)<br>', html, re.S)
However will return multiple results as there are a bunch of <br><br> on that page. You may want to use the more specific:
re.findall('<hr><br>(.*?)<br><hr>', html, re.S)
from urllib import urlopen
import re
html = urlopen("http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1").read()
output = re.findall('<body>.*?>\n*([^<]{5,})<.*?</body>', html, re.S)
if (len(output) > 0):
print(output)
output = re.sub('\n', ' ', output[0])
output = re.sub('\t', '', output)
print(output)
Terminal
imac2011:Desktop allendar$ python test.py
['A black cat crossing your path signifies that the animal is going somewhere.\n\t\t-- Groucho Marx\n\n']
A black cat crossing your path signifies that the animal is going somewhere. -- Groucho Marx
You could also strip of the final \n's and replace all those inside the text (on longer quotes) with <br /> if you are displaying it in HTML again, so you would maintain the original line breaks visually.
All jokes of that page have the same model, no ambigous things, you can use this
output = re.findall('(?<=<br>\s)[^<]+(?=\s{2}<br)', html)
No need to use the dotall flag cause there's no dot.
This is uh, 7 years later, but for future reference:
Use the beautifulsoup library for these kind of purposes, as suggested by Floris in the comments.
I read this thread about extracting url's from a string. https://stackoverflow.com/a/840014/326905
Really nice, i got all url's from a XML document containing http://www.blabla.com with
>>> s = '<link href="http://www.blabla.com/blah" />
<link href="http://www.blabla.com" />'
>>> re.findall(r'(https?://\S+)', s)
['http://www.blabla.com/blah"', 'http://www.blabla.com"']
But i can't figure out, how to customize the regex to omit the double qoute at the end of the url.
First i thought that this is the clue
re.findall(r'(https?://\S+\")', s)
or this
re.findall(r'(https?://\S+\Z")', s)
but it isn't.
Can somebody help me out and tell me how to omit the double quote at the end?
Btw. the questionmark after the "s" of https means "s" can occur or can not occur. Am i right?
>>>from lxml import html
>>>ht = html.fromstring(s)
>>>ht.xpath('//a/#href')
['http://www.blabla.com/blah', 'http://www.blabla.com']
You're already using a character class (albeit a shorthand version). I might suggest modifying the character class a bit, that way you don't need a lookahead. Simply add the quote as part of the character class:
re.findall(r'(https?://[^\s"]+)', s)
This still says "one or more characters not a whitespace," but has the addition of not including double quotes either. So the overall expression is "one or more character not a whitespace and not a double quote."
You want the double quotes to appear as a look-ahead:
re.findall(r'(https?://\S+)(?=\")', s)
This way they won't appear as part of the match. Also, yes the ? means the character is optional.
See example here: http://regexr.com?347nk
I used to extract URLs from text through this piece of code:
url_rgx = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')
# convert string to lower case
text = text.lower()
matches = re.findall(url_rgx, text)
# patch the 'http://' part if it is missed
urls = ['http://%s'%url[0] if not url[0].startswith('http') else url[0] for url in matches]
print urls
It works great!
Thanks. I just read this https://stackoverflow.com/a/13057368/326905
and checked out this which is also working.
re.findall(r'"(https?://\S+)"', urls)
I have a text file that looks similar to;
section header 1:
some words can be anything
more words could be anything at all
etc etc lala
some other header:
as before could be anything
hey isnt this fun
I am trying to contruct a grammar with pyparser that would result in the following list structure when asking for the parsed results as a list; (IE; the following should be printed when iterating through the parsed.asList() elements)
['section header 1:',[['some words can be anything'],['more words could be anything at all'],['etc etc lala']]]
['some other header:',[['as before could be anything'],['hey isnt this fun']]]
The header names are all known beforehand, and individual headers may or may not appear. If they do appear, thre is always at least one line of content.
The problem I am having, is that I am having trouble gettnig the parser to recognise where 'section header 1:' ands, and 'some other header:' begins. I end up with a parsed.asList() looking like;
['section header 1:',[[''some words can be anything'],['more words could be anything at all'],['etc etc lala'],['some other header'],[''as before could be anything'],['hey isnt this fun']]]
(IE: section header 1: gets seen correctly, but everythng following it gets added to section header 1, including further header lines etc..)
Ive tried various things, played with leaveWhitespace() and LineEnd() in various ways but I can't figure it out.
The base parser I am hacking about with is (contrived example - in reality this is a class definition etc..).
header_1_line=Literal('section header 1:')
text_line=Group(OneOrMore(Word(printables)))
header_1_block=Group(header_1_line+Group(OneOrMore(text_line)))
header_2_line=Literal('some other header:')
header_2_block=Group(header_2_line+Group(OneOrMore(text_line)))
overall_structure=ZeroOrMore(header_1_block|header_2_block)
and is being called with
parsed=overall_structure.parseFile()
Cheers, Matt.
Matt -
Welcome to pyparsing! You have fallen into one of the most common pitfalls in working with pyparsing, and that is that people are smarter than computers. When you look at your input text, you can easily see which text can be headers and which text can't be. Unfortunately, pyparsing is not so intuitive, so you have to tell it explicitly what can and can't be text.
When you look at your sample text, you are not accepting just any line of text as possible text within a section header. How do you know that 'some other header:' is not valid as text? Because you know that that string matches one of the known header strings. But in your current code, you have told pyparsing that any collection of Word(printables) is valid text, even if that collection is a valid section header.
To fix this, you have to add some explicit lookahead to your parser. Pyparsing offers two constructs, NotAny and FollowedBy. NotAny can be abbreviated using the '~' operator, so we can write this pseudocode expression for text:
text = ~any_section_header + everything_up_to_the_end_of_the_line
Here is a complete parser using negative lookahead to make sure you read each section, breaking on section headings:
from pyparsing import ParserElement, LineEnd, Literal, restOfLine, ZeroOrMore, Group, StringEnd
test = """
section header 1:
some words can be anything
more words could be anything at all
etc etc lala
some other header:
as before could be anything
hey isnt this fun
"""
ParserElement.defaultWhitespaceChars=(" \t")
NL = LineEnd().suppress()
END = StringEnd()
header_1=Literal('section header 1:')
header_2=Literal('some other header:')
any_header = (header_1 | header_2)
# text isn't just anything! don't accept header line, and stop at the end of the input string
text=Group(~any_header + ~END + restOfLine)
overall_structure = ZeroOrMore(Group(any_header +
Group(ZeroOrMore(text))))
overall_structure.ignore(NL)
from pprint import pprint
print(overall_structure.parseString(test).asList())
In my first attempt, I forgot to also look for the end of string, so my restOfLine expression looped forever. By adding a second lookahead for the string end, my program terminates successfully. Exercise left for you: instead of enumerating all possible headers, define a header line as any line that ends with a ':'.
Good luck with your pyparsing efforts,
-- Paul
currently I am working with scrapy, which is a web crawling framework based on python. The data is extracted from html using XPATH . (I am new to python) To wrap the data scrapy uses items, e.g.
item = MyItem()
item['id'] = obj.select('div[#class="id"]').extract()
When the id is printed like print item['id'] I get following output
[u'12346']
My problem is that this output is not always in the same form. Sometimes I get an output like
"[u""someText""]"
This happens only with text, but actually there is nothing speciall with the text compared to other text that is handled corretly just like the ID.
Does anyone know what the quotation marks mean? Like I said the someText was crawled like all other text data, e.g. from
<a>someText</a>
Any ideas?
Edit:
My spider crawls all pages of a blog. Here is the exact output
[u'41039'];[u'title]
[u'40942'];"[u""title""]"]
...
Extracted with
item['title'] = site.select('div[#class="header"]/h2/a/#title').extract()
I noticed that always the same blog posts have this quotation marks. So they dont appear randomly. But there is nothing special to the text. E.g. this title produces quotation marks
<a title="Xtra Pac Telekom web'n'walk Stick Basic für 9,95" href="someURL">
Xtra Pac Telekom web'n'walk Stick Basic für 9,95</a>
So my first thought was that this is because of some special chars but there arent any.
This happeny only when the items are written to csv, when I print them in cmd there are no quotation marks.
Any ideas?
python can use both single ' and double " quotes as quotation marks. when it prints something out it chooses single quotes normally, but will switch to double quotes if the text it is printing contains single quotes (to avoid having to escape the quote in the string):
so normally, it is printing [u'....'] but sometimes you have text that contains a ' character and then it prints [u"...."].
then there is an extra complication writing to csv. if a string is written to csv that contains just a ' then it is written as it is. so [u'....'] is written as [u'....'].
but if it contains double quotes then (1) everything is put inside double quotes and (2) any double quotes are repeated twice. so u["..."] is written as "[u""...""]". if you read the csv data back with a csv library then this will be detected and removed, so it will not cause any problems.
so it's a combination of the text containing a single quote (making python use double quotes) and the csv quoting rules (which apply to double quotes, but not single quotes).
if this is a problem the csv library has various options to change the behaviour - http://docs.python.org/library/csv.html
the wikipedia page explains the quoting rules in more detail - the behavuour here is shown by the example with "Super, ""luxurious"" truck"
I'm trying to write a regular expression pattern (in python) for reformatting these template engine files.
Basically the scheme looks like this:
[$$price$$]
{
<h3 class="price">
$12.99
</h3>
}
I'm trying to make it remove any extra tabs\spaces\new lines so it should look like this:
[$$price$$]{<h3 class="price">$12.99</h3>}
I wrote this: (\t|\s)+? which works except it matches within the html tags, so h3 becomes h3class and I am unable to figure out how to make it ignore anything inside the tags.
Using regular expressions to deal with HTML is extremely error-prone; they're simply not the right tool.
Instead, use a HTML/XML-aware library (such as lxml) to build a DOM-style object tree; modify the text segments within the tree in-place, and generate your output again using said library.
Try this:
\r?\n[ \t]*
EDIT: The idea is to remove all newlines (either Unix: "\n", or Windows: "\r\n") plus any horizontal whitespace (TABs or spaces) that immediately follow them.
Alan,
I have to agree with Charles that the safest way is to parse the HTML, then work on the Text nodes only. Sounds overkill but that's the safest.
On the other hand, there is a way in regex to do that as long as you trust that the HTML code is correct (i.e. does not include invalid < and > in the tags as in: <a title="<this is a test>" href="look here">...)
Then, you know that any text has to be between > and < except at the very beginning and end (if you just get a snapshot of the page, otherwise there is the HTML tag minimum.)
So... You still need two regex's: find the text '>[^<]+<', then apply the other regex as you mentioned.
The other way, is to have an or with something like this (not tested!):
'(<[^>]*>)|([\r\n\f ]+)'
This will either find a tag or spaces. When you find a tag, do not replace, if you don't find a tag, replace with an empty string.