I'm creating simple comment-like application and need to convert normal urls into links, image links into images and yt/vimeo/etc. links into flash objects. E.g.:
http://foo.bar to http://foo.bar
http://foo.bar/image.gif to <img src="http://foo.bar/image.gif"/>
etc.
Of course i can write all of that by myself, but i think it's such obvious piece of code that somebody has already wrote it (maybe even with splitting text into paragraphs). I was googling for some time but couldn't find anything complex, just few snippets. Does filter (or something like that) exist?
Thanks!
PS. There is urlize but it works only for the first case.
Write a custom filter to handle all the necessary cases. Look at the source code for urlize to get started. You'll also need the urlize function from utils.
In your filter, first test for the first case and call urlize on that. Handle the second case and any other cases you may have.
Related
I'm working on a project to do some static analysis of Python code. We're hoping to encode certain conventions that go beyond questions of style or detecting code duplication. I'm not sure this question is specific enough, but I'm going to post it anyway.
A few of the ideas that I have involve being able to build a certain understanding of how the various parts of source code work so we can impose these checks. For example, in part of our application that's exposing a REST API, I'd like to validate something like the fact that if a route is defined as a GET, then arguments to the API are passed as URL arguments rather than in the request body.
I'm able to get something like that to work by pulling all the routes, which are pretty nicely structured, and there are guarantees of consistency given the route has to be created as a route object. But once I know that, say, a given route is a GET, figuring out how the handler function uses arguments requires some degree of interpretation of the function source code.
Naïvely, something like inspect.getsourcelines will allow me to get the source code, but on further examination that's not the best solution because I immediately have to build interpreter-like features, such as figuring out whether a line is a comment, and then do something like use regular expressions to hunt down places where state is moved from the request context to a local variable.
Looking at tools like PyLint, they seem mostly focused on high-level "universals" of static analysis, and (at least on superficial inspection) don't have obvious ways of extracting this sort of understanding at a lower level.
Is there a more systematic way to get this representation of the source code, either with something in the standard library or with another tool? Or is the only way to do this writing a mini-interpreter that serves my purposes?
I have the following HTML and I need to remove the script tags and any script related attributes in the HTML. By script related attributes I mean any attribute that starts with on.
<body>
<script src="...">
</script>
<div onresize="CreateFixedHeaders()" onscroll="CreateFixedHeaders()" id="oReportDiv" style="overflow:auto;WIDTH:100%">
<script type="text/javascript" language="javascript">
//<![CDATA[
function CreateFixedHeaders() {}//]]>
</script>
<script>
var ClientReportfb64a4706a3749c484169e...
</script>
</body>
My first thought was to use BeautifulSoup to remove the tags and attributes. Unfortunately, I am unable to use BeautifulSoup. Seeing that BeautifulSoup is off the table I can see two options for doing this. The first option I see is splitting the strings and parsing based on index. This seems like a bad solution to me.
The other option is to use Regular Expressions. However, we know that isn't a good solution either (Cthulhu Parsing).
Now with that in mind, I personally feel it is alright to use regular expressions to strip the attributes. After all, with those it is still simple string manipulation.
So for removing the attributes I have:
script_attribute_regex = r'\son[a-zA-Z]+="[a-zA-Z0-0\.;\(\)_]+"'
result = re.sub(script_attribute_regex, "", page_source)
As I've said before, I personally think the above perfectly acceptable use of Regular Expression with HTML. But still I would like to get some opinions on the above usage.
Then there is the question of the script tags. I'm very tempted to go with Regular Expressions for this because I know them and I know what I need is pretty simple. Something like:
<script(.*)</script>
The above would start to get me close to what I need. And yes I realize the above RegEx will grab everything starting at the first opening script tag until the last closing script tag, but it's a starting example.
I'm very tempted to use Regular Expressions as I'm familiar with them (more so than Python) and I know that is the quickest way to achieve the results I want, at least for me it is.
So I need help to go against my nature and not be evil. I want to be evil and use RegEx so somebody please show me the light and guide me to the promised land on non-Regular Expressions.
Thanks
Update:
It looks like I wasn't very clear about what my question actually is, I apologize for that. My question is how can I parse the HTML using pure Python without Regular Expressions?
<script(.*)</script>
As for the above code example, it's wrong. I know it is wrong, I was using it as an example of a starting point.
I hope this clears up my question some
Update 2
I just wanted to add a few more notes about what I am doing.
I am crawling a web site to get the data I need.
Once we have the page that contains the data we need it is saved to the database.
Then the saved web page is displayed to the user.
The issue I am trying to solve happens here. The application throws a script error when you attempt to interact with the page that forces the user to click on a confirmation box. The application is not a web browser but uses the web browser DLL in Windows (I cannot remember the name at the moment).
The error in question only happens in this one page for this one web site.
Update 3
After adding the update I realized I was over thinking the problem, I was looking for a more generic solution. However, in this case that isn't what is needed.
The page is dynamically generated, however the script tags will stay static. With that in mind the solution becomes much simpler. With that I no longer need to treat it like HTML but as static strings.
So the solution I'm looking at is
import re
def strip_script_tags(page_source: str) -> str:
pattern = re.compile(r'\s?on\w+="[^"]+"\s?')
result = re.sub(pattern, "", page_source)
pattern2 = re.compile(r'<script[\s\S]+?/script>')
result = re.sub(pattern2, "", result)
return result
I would like to avoid Regular Expression however, since I'm limited to only using the standard library regular expressions seems like the best solution in this case. Which means #skamazin's answer is correct.
As for removing all the attributes that start with on, you can try this
It uses the regex:
\s?on\w+="[^"]+"\s?
And substitutes with the empty string (deletion). So in Python it should be:
pattern = re.compile(ur'\s?on\w+="[^"]+"\s?')
subst = u""
result = re.sub(pattern, subst, file)
If you are trying to match anything between the script tags try:
<script[\s\S]+?/script>
DEMO
The problem with your regex is that that dot (.) doesn't match newline character. Using a complemented set will match every single character possible. And make sure use the ? in [\s\S]+? so that it is lazy instead of greedy.
I am putting together a website where the content is maintained as restructuredtext that is then converted into html. I need more control than e.g. rst2html.py, so I am using a my own python script that uses things like
docutils.core.publish_parts(source, writer_name='html')
to create the html.
publish_parts() gives me useful parts like the title, body, etc. However, it seems I must look elsewhere to get the values of rst fields like
:Authors:
:version:
etc. For this, I have been using publish_doctree() as in
doctree = core.publish_doctree(source).asdom()
and then going through this recursively using using getElementsByTagName() as in
doctree.getElementsByTagName('authors')
doctree.getElementsByTagName('version')
etc.
Using publish_doctree() to extract fields does the job, and that's good, but it does seem more convoluted than using e.g. publish_parts().
My question is simply whether this is the best recommended way of extracting out these rst fields, or is there a more direct and less convoluted way? If not, that is fine, but I thought I would inquire in case I am missing something.
I am using http://code.google.com/p/feedparser/ to write a simple news integrator.
But I want pure text ( with <p> tags), but no urls or images (ie. no <a> or <img> tags).
Here are two methods to do that:
1.Edit the source code. http://code.google.com/p/feedparser/source/browse/branches/f8dy/feedparser/feedparser.py
class _HTMLSanitizer(_BaseHTMLProcessor):
acceptable_elements =[....]
Simply remove the a & img tags.
2.
import feedparser
feedparser._HTMLSanitizer.acceptable_elements = feedparser._HTMLSanitizer.acceptable_elements.remove('a')
feedparser._HTMLSanitizer.acceptable_elements = feedparser._HTMLSanitizer.acceptable_elements.remove('img')
When I use feedparser, first remove the two tags.
Which method is better?
Are there any other good methods?
Thanks a lot!
Usually, the quicker is better, and this can be determined using python's timeit module. But in your case, I'd prefer not to alter the source code but stick with the second option. It helps maintainability.
Other options include writing a custom parser (use a C extension for maximum speed) or just let your site's templating engine (Django maybe?) strip those tags. Well, I' ve changed my mind, the last solution seems the best all-around...
I've just began learning Python and I've ran into a small problem.
I need to parse a text file, more specifically an HTML file (but it's syntax is so weird - divs after divs after divs, the result of a Google's 'View as HTML' for a certain PDF i can't seem to extract the text because it has a messy table done in m$ word).
Anyway, I chose a rather low-level approach because i just need the data asap and since I'm beginning to learn Python, I figured learning the basics would do me some good too.
I've got everything done except for a small part in which i need to retrieve a set of integers from a set of divs. Here's an example:
<div style="position:absolute;top:522;left:1020"><nobr>*88</nobr></div>
Now the numbers i want to retrieve all the ones inside <nobr></nobr> (in that case, '588') and, since it's quite a messy file, i have to make sure that what I am getting is correct. To do so, that number inside <nobr></nobr> must be preceded by "left:1020", "left:1024" or "left:1028". This is because of the automatic conversion and the best choice would be to get all the number preceded by left:102[0-] in my opinion.
To do so, I was trying to use:
for o in re.finditer('left:102[0-9]"><nobr>(.*?)</nobr></div>', words[index])
out = o.group(1)
But so far, no such luck... How can I get those numbers?
Thanks in advance,
J.
Don't use regular expressions to parse HTML. BeautifulSoup will make light work of this.
As for your specific problem, it might be that you are missing a colon at the end of the first line:
for o in re.finditer('left:102[0-9]"><nobr>(.*?)</nobr></div>', words[index]):
out = o.group(1)
If this isn't the problem, please post the error you are getting, at what you expect the output to be.