I'm attempting to customize the parser/lexer for the httpdomain Sphinx doc extension. I installed in a given dir and added the dir to the sys.path as mentioned in the docs.
I then compile the Sphinx doc and http gets highlighted properly.
Now, I need to make a small change to the extension. I started by making a random change to the httpdomain.py file inside the extension's dir, which correctly yielded an error for invalid syntax.
Next, I changed the lexer by replacing the HTTP token with HTTTP (one extra T). My idea is to see if now entries in the doc containing HTTTP get highlighted instead of HTTP.
The issue is that nothing I do seems to change the output. HTTP continues to get colored, and HTTTP is ignored.
Here is the section of the lexer that contains my change:
tokens = {
'root': [
(r'(GET|POST|PUT|PATCH|DELETE|HEAD|OPTIONS|TRACE)( +)([^ ]+)( +)'
r'(HTTTPS?)(/)(1\.[013])(\r?\n|$)',
bygroups(Name.Function, Text, Name.Namespace, Text,
Keyword.Reserved, Operator, Number, Text),
'headers'),
(r'(HTTTPS?)(/)(1\.[013])( +)(\d{3})( +)([^\r\n]+)(\r?\n|$)',
bygroups(Keyword.Reserved, Operator, Number, Text, Number,
Text, Name.Exception, Text),
'headers'),
(r'([^\s:]+)( *)(:)( *)([^\r\n]+)(\r?\n|$)', header_callback),
(r'([\t ]+)([^\r\n]+)(\r?\n|$)', continuous_header_callback),
(r'\r?\n', Text, 'content')
],
'headers': [
(r'([^\s:]+)( *)(:)( *)([^\r\n]+)(\r?\n|$)', header_callback),
(r'([\t ]+)([^\r\n]+)(\r?\n|$)', continuous_header_callback),
(r'\r?\n', Text, 'content')
],
'content': [
(r'.+', content_callback)
]
}
Note that "HTTP" is changed to "HTTTP", so I'd expect entries in the doc containing HTTTP now to be colored, but nothing changed.
I made changes to the doc text and saw they were updated in the page, so no cache issues there.
I also deleted a folder Python created called __pycache__, no changes to the result. I also trying commenting out all tokens in the lexer, no changes. If I insert invalid syntax, then it fails. If the syntax is correct, it seems like it uses the original code without my changes.
Is there any other cache I should be clearing up?
I'm totally new to Python so I'm a bit lost here.
PS: this HTTTP thing is just a test. I'll make the real changes once I get this to work.
It turns out the code changes were being seen, but the lexer was never used because Pygments also registered a lexer called html. So I replaced it with mine: app.add_lexer('http', HTTPLexer()) and I started seeing my changes affecting the generated docs.
Try to remove *.pyc files and restart your extension
Related
I have the following code (it changes the string/filepath, replacing the numbers at the end of the filename + the file extension, and replaces that with "#.exr")
I was doing it this way because the name can be typed in all kinds of ways, for example:
r_frame.003.exr (but also)
r_12_frame.03.exr
etc.
import pyseq
import re
#create render sequence list
selected_file = 'H:/test/r_frame1.exr'
without_extention = selected_file.replace(".exr", "")
my_regex_pattern = r"\d+\b"
sequence_name_with_replaced_number = re.sub(my_regex_pattern, "#.exr" ,without_extention)
mijn_sequences = fileseq.findSequencesOnDisk(sequence_name_with_replaced_number)
If I print the "sequence_name_with_replaced_number" value, this results in the console in:
'H:/test/r_frame#.exr'
When I use that variable inside that function like this:
mijn_sequences = fileseq.findSequencesOnDisk(sequence_name_with_replaced_number)
Then it does not work.
But when I manually replace that last line into:
mijn_sequences = fileseq.findSequencesOnDisk('H:/test/r_frame#.exr')
Then it works fine. (it's the seems like same value/string)
But this is not an viable option, the whole point of the code if to have the computer do this for thousands of frames.
Anybody any idea what might be the cause of this?
After this I will do simple for loop going trough al the files in that sequence. The reason I'm doing this workflow is to delete the numbers before the .exr file extensions and replace them with # signs. (but ognoring all the bumbers that are not at the end of the filename, hence that regex above. Again, the "sequence_name_with_replaced_number" variable seems ok in the console. It spits out: 'H:/test/r_frame#.exr' (that's what I need it to be)
I fixed it. the problem as stated was correct, every time I did a cut and past from the variable value in the console and treated it as manual input it worked.
Then I did a len() of both values, and there was a difference by 2! What happend? The console added the ''
But in the generated variable it had those baked in as extra letters. i fixed it by adding cleaned_sequence = sequence_name_with_replaced_number[1:-1] so 'H:/test/r_frame1.exr' (as the console showed me) was not the same as 'H:/test/r_frame1.exr' (what I inserted manually, because I added these marks, in the console there are showed automatically)
I'm currently trying to create a small python program using SolrClient to index some files.
My need is that I want to index some file content and then add some attributes to enrich the document.
I used the post command line tool to index the files. Then I use a python program trying to enrich documents, something like this:
doc = solr.get('collection', id)
doc['new_attribute'] = 'value'
solr.index_json('collection',json.dumps([doc]))
solr.commit(openSearcher=True)
Problem is that I have the feeling that we lost file content index. If I run a query with a word present in all attributes of the doc, I find it.
If I run a query with a word only in the file, it does not work (it works indexing only the file with post without my update tentative).
I'm not sure to understand how to update the doc keeping the index created by the post command.
I hope I'm clear enough, maybe I misunderstood the way it works...
thanks a lot
If I understand correctly, you want to modify an existing record. You should be able to do something like this without using a solr.get:
doc = [{'id': 'value', 'new_attribute':{'set': 'value'}}]
solr.index_json('collection',json.dumps([doc]))
See also:
https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents
It has worked for me in this way, it can be useful for someone
from SolrClient import SolrClient
solrConect = SolrClient("http://xx.xx.xxx.xxx:8983/solr/")
doc = [{'id': 'my_id', 'count_related_like':{'set': 10}}]
solrConect.index_json("my_collection", json.dumps(doc) )
solrConect.commit("my_collection", softCommit=True)
Trying with Curl did not change anything. I did it differently so now it works. Instead of adding the file with the post command and trying to modify it afterwards, I read the file in a string and index in a "content" field. It means every document is added in one shot.
The content field is defined as not stored, so I just index it.
It works fine and suits my needs. It's also more simple since it removes many attributes set by post command that I don't need.
If I find some time, I'll try again the partial update and update the post.
Thanks
RĂ©mi
I'm using a library ABPY (library here) for python but it is in older version i think. I'm using Python 3.3.
I did fix some PRINT errors, but that's how much i know, I'm really new on programing.
I want to fetch some webpage and filter it from advertising and then print it again.
EDITED after Sg'te'gmuj told me how to convert from python 2.x to 3.x this is my new code:
#!/usr/local/bin/python3.1
import cgitb;cgitb.enable()
import urllib.request
response = urllib.request.build_opener()
response.addheaders = [('User-agent', 'Mozilla/5.0')]
response = urllib.request.urlopen("http://www.youtube.com")
html = response.read()
from abpy import Filter
with open("easylist.txt") as f:
ABPFilter = Filter(file('easylist.txt'))
ABPFilter.match(html)
print("Content-type: text/html")
print()
print (html)
Now it is displaying a blank page
Just took a peek at the library, it seems that the file "easylist.txt" does not exist; you need to create the file, and populate it with the appropriate filters (in whatever format ABP specifies).
Additionally, it appears it takes a file object; try something like this instead:
with open("easylist.txt") as f:
ABPFilter = Filter(f)
I can't say this is wholly accurate though since I have no experience with the library, but looking at it's code I'd suspect either of the two are the problem, if not both.
Addendum #1
Looking at the code more in-depth, I have to agree that even if that fix I supplied does work, you're going to have more problems (it's in 2.x as you suggested, when you're using 3.x). I'd suggest utilizing Python's 2to3 function, to convert from typical Python 2 to Python 3 code (it's not foolproof though). The command line would be as so:
2to3 -w abpy.py
That will convert it from Python 2.x to 3.x code, and re-write the source file.
Addendum #2
The code to pass the file object should be the "f" variable, as shown above (modified to represent that; I wasn't paying attention and just left the old file function call in the argument).
You need to pass a URI to the function as well:
ABPFilter.match(URI)
You'll need to modify the code to pass those items into an array (I'm assuming at least); I'm playing with it now to see. At present I'm getting a rule error (not a Python error; but merely error handling used by abpy.py, which is good because it suggests that it's the right train of thought).
The code for the Filter.match function is as following (after using the 2to3 Python script):
def match(self, url, elementtype=None):
tokens = RE_TOK.split(url)
print(tokens)
for tok in tokens:
if len(tok) > 2:
if tok in self.index:
for rule in self.index[tok]:
if rule.match(url, elementtype=elementtype):
print(str(rule))
What this means is you're, at present, at a point where you need to program the functionality; it appears this module only indicates the rule. However, that is still useful.
What this means is that you're going to have to modify this function to take the HTML, in place of the the "url" parameter. You're going to regex the HTML (this may be rather intensive) for a list of URIs and then run each item through the match loop Where you go from there to actually filter the nodes, I'm not sure; but there is a list of filter types, so I'm assuming there is a typical procedural ABP does to remove the nodes (possibly, in some cases merely by removing the given URI from the HTML?)
References
http://docs.python.org/3.3/library/2to3.html
I'm trying to run some queries against Pubmed's Eutils service. If I run them on the website I get a certain number of records returned, in this case 13126 (link to pubmed).
A while ago I bodged together a python script to build a query to do much the same thing, and the resultant url returns the same number of hits (link to Eutils result).
Of course, not having any formal programming background, it was all a bit cludgy, so I'm trying to do the same thing using Biopython. I think the following code should do the same thing, but it returns a greater number of hits, 23303.
from Bio import Entrez
Entrez.email = "A.N.Other#example.com"
handle = Entrez.esearch(db="pubmed", term="stem+cell[All Fields]",datetype="pdat", mindate="2012", maxdate="2012")
record = Entrez.read(handle)
print(record["Count"])
I'm fairly sure it's just down to some subtlety in how the url is being generated, but I can't work out how to see what url is being generated by Biopython. Can anyone give me some pointers?
Thanks!
EDIT:
It's something to do with how the url is being generated, as I can get back the original number of hits by modifying the code to include double quotes around the search term, thus:
handle = Entrez.esearch(db='pubmed', term='"stem+cell"[ALL]', datetype='pdat', mindate='2012', maxdate='2012')
I'm still interested in knowing what url is being generated by Biopython as it'll help me work out how i have to structure the search term for when i want to do more complicated searches.
handle = Entrez.esearch(db="pubmed", term="stem+cell[All Fields]",datetype="pdat", mindate="2012", maxdate="2012")
print(handle.url)
You've solved this already (Entrez likes explicit double quoting round combined search terms), but currently the URL generated is not exposed via the API. The simplest trick would be to edit the Bio/Entrez/__init__.py file to add a print statement inside the _open function.
Update: Recent versions of Biopython now save the URL as an attribute of the returned handle, i.e. in this example try doing print(handle.url)
I am trying to get Vim to syntax highlight any file that ends with extension .Rtex in the following way:
All top level text is highlighted as TeX
Exception: any text enclosed in \begin{python}...\end{python} is highlighted as Python
I am able to achieve each of these criteria individually, but unable to achieve both simultaneously. I think that somehow the TeX highlighting overrides my Python-highlighted regions, or prevents them from taking effect, and I am stuck trying to figure out how.
First step: edit .vimrc to give files with extension .Rtex the filetype rtex:
au BufRead *.Rtex setf rtex
Second step: create ~/.vim/syntax/rtex.vim. It is the contents of this file that will determine how to highlight .Rtex files.
Third step: enable general top-level TeX highlighting, by making rtex.vim look like this:
runtime! syntax/tex.vim
If I now open a .Rtex file, the entire file is highlighted as TeX, including any text within \begin{python}...\end{python}, as expected.
Fourth step: follow the instructions in Vim's :help syn-include to include python highlighting and apply it to all regions delimited by \begin{python} and \end{python}. My rtex.vim file now looks like this:
runtime! syntax/tex.vim
unlet! b:current_syntax
syntax include #Python syntax/python.vim
syntax region pythonCode start="\\begin{python}" end="\\end{python}" contains=#Python
The unlet! b:current_syntax command is meant to force the python.vim syntax file to execute even though an existing syntax (TeX) is already active.
Problem: If I now open a .Rtex file, the entire file is still highlighted only as TeX. The \begin{python}...\end{python} region seems to have no effect.
Experiment: If I remove or comment out the runtime! command, I do get python highlighting, within the \begin{python}...\end{python} regions, exactly as desired, but of course no TeX highlighting in the remainder of the document. I therefore conclude that the TeX highlighting is somehow responsible for preventing the python regions from taking effect.
Can a Master of Vim offer me any suggestions? I am currently stumped. I have looked at several pages and stackoverflow questions that seem relevant, but none of them have so far led to a solution:
http://vim.wikia.com/wiki/Different_syntax_highlighting_within_regions_of_a_file
Embedded syntax highligting in Vim
VIM syntax highlighting of html nested in yaml
After some more study of the manual, and much more trial and error, I have finally answered my own question (a simultaneously embarrassing and sublime accomplishment), which I now preserve here for posterity.
Basically, I think the problem is that the python highlighting wouldn't take effect because the pythonCode region was itself contained in a region or highlight group defined by tex.vim, so it wasn't top-level. The solution is to also include (rather than just runtime) tex.vim, giving it a name like #TeX, and then add containedin=#TeX to my python region definition. So syntax/rtex.vim now looks like this:
let b:current_syntax = ''
unlet b:current_syntax
runtime! syntax/tex.vim
let b:current_syntax = ''
unlet b:current_syntax
syntax include #TeX syntax/tex.vim
let b:current_syntax = ''
unlet b:current_syntax
syntax include #Python syntax/python.vim
syntax region pythonCode matchgroup=Snip start="\\begin{python}" end="\\end{python}" containedin=#TeX contains=#Python
hi link Snip SpecialComment
let b:current_syntax = 'rtex'
And this works! I'm not sure if all of those unlet b:current_syntax commands are strictly necessary. I also gave the python region delimiters a matchgroup (Snip) so that they can be highlighted themselves (with the SpecialComment color), rather than left just plain, which is apparently what happens by default, since they are no longer considered part of the TeX.
Now it is a trivial thing to add more highlight regions for different languages (e.g., \begin{Scode}...\end{Scode}), which is great if you're getting into literate programming -- the original motivation for my question.
Edit: here is a screenshot showing how it works with Python and R code embedded in a TeX document:
I don't know if it helps, but a hack I use with my Rnw files that use both tex and rnoweb features is as follows:
au BufEnter *.Rnw set filetype=tex | set filetype=rnoweb
Would an adapted version work in your case?