What's the search engine used in the new Python documentation? - python

Is it built-in in Sphinx?

It look like Sphinx contains own search engine for English language. See http://sphinx.pocoo.org/_static/searchtools.js and searchindex.js/.json (see Sphinx docs index 36Kb, Python docs index 857Kb, and Grok docs 37Kb).
Index is being precomputed when docs are generated.
When one searches, static page is being loaded and then _static/searchtools.js extract search terms from query string, normalizes (case, stemming, etc.) them and looks up in searchindex.js as it is being loaded.
First search attempt takes rather long time, consecutive are much faster as index is cached in your browser.

The Sphinx search engine is built in Javascript. It uses JQuery and a (sometimes very big) javascript file containing the search terms.

Yes. Sphinx is not built-in, however. The search widget is part of sphinx. What context did you mean by "built-in"?
On the page iteself: http://docs.python.org/about.html
http://sphinx.pocoo.org/

Related

Advance filtering in Stackexchange Python API

Does Stackexchange Python API provide advance filtering support?
For example:
Return all the questions under tag python and javascript with more than 50 upvotes.
Return all the questions that has some substring matched in "title" or in "content".
Include/Exclude filters on different properties.
Reference to official document is really appreciated.
See the official API docs, the API does not well support complex queries directly, but the /search/advanced route does relay much of the power of the web site's search feature.
So:
"Return all the questions under tag python and javascript with more than 50 upvotes."
Use the /search/advanced route.
Pass python;javascript in the tagged parameter.
Pass score:50 in the q parameter.
Live example.
In that library, the equivalent call should be something like:
.fetch('search/advanced', tagged='python;javascript', q='score:50')
For that particular query, this would probably also work:
.fetch('questions', tagged='python;javascript', min='50', sort='votes')
"Return all the questions that have some substring matched in "title" or in "content"."
Put the word in the q parameter. For example:
/search/advanced?q=flask score:50&tagged=javascript
Compare this to the use of the title parameter, which uses AND logic:
/search/advanced?q=score:50&title=flask&tagged=javascript
"Include/Exclude filters on different properties."
That is rather vague. If you mean that you want to exclude questions that have a term, Then...
/search/advanced provides the nottagged parameter.
The q parameter will take some - terms just like the site search. For example"
/search/advanced?q=-flask score:50&tagged=python;javascript
Notes:
The q parameter accepts much of the question-related parameters of the site's web search.
The OP states he is using this library, which has broad support for
the Stack Exchange API (version 2.2).
See the customary use of the term "filtering".

scikit-learn intersphinx link / inventory object?

Has anyone successfully linked to scikit-learn with intersphinx. It's a sphinx project and it looks like it's hosted through github pages
https://github.com/scikit-learn/scikit-learn.github.io
But so far I haven't been able to generate complete links in my sphinx project to land on scikit learn pages
currently using
'sklearn': ('http://scikit-learn.org/stable' None)
in my interspinx mapping, any help would be great, thanks
Seems from this issue linking this PR that you should use:
'sklearn': ('http://scikit-learn.org/stable',
(None, './_intersphinx/sklearn-objects.inv'))
Note: not tested but interested in the result, please let me know if it works.
EDIT:
Seems that sklearn-objects.inv is available from scikit-image repo for local intersphinx settings.
That's probably not the best solution, but maybe it can help for a start.
I assume you already tried to link directly to the documentation page of scikit-learn or maybe to the API page of the project (but yet I ask, just in case...).
I am not sure what would be the appropriate page from what is indicated in Sphinx doc.
A dictionary mapping unique identifiers to a tuple (target,
inventory). Each target is the base URI of a foreign Sphinx
documentation set and can be a local path or an HTTP URI. The
inventory indicates where the inventory file can be found: it can be
None (at the same location as the base URI) or another local or HTTP
URI.
Otherwise there is also sphobjinv that could help to build a custom intersphinx object.inv, but I had no time to test it yet.
You have a missing comma in your intersphinx mapping:
'sklearn': ('http://scikit-learn.org/stable' None)
should be:
'sklearn': ('http://scikit-learn.org/stable', None),
I use trailing commas for my dict entries, but they are not required.
With that correction, I was able to use the entry that #mzjn provided in their comment to generate a link to scikit-learn's docs.

making a search engine in python django

I've made a search feature using the python 2.7 toned package but to make it more scalable, I want to use ElasticSearch.
I want to do boolean searches like
(blue or small) purse and not leather
Do I need haystack or just using an ElasticSearch client is enough?
How can I do complex unpredictable boolean search like the example above (the boolean structure of the words is unknown)?
All I find in the docs is SearchQuery which requires me to know the search combination prior to run time.
I investigated I figured out:
I do not need haystack as all.
boolean search can be done via a "simple query search" method in elastic search however it uses "+-|" instead of "AND" "NOT" "OR" so it's just a matter of word replacement.
You can overwrite the search of the admin page to use elasticsearch, then apply the filter query over that. However, elastic search can return no more than 10000 results per page...you can read multiple pages but I ended up retrieving only the first 10000 ids (if there are more than 10000 results) and passing it to admin to do a query mymodel.objects.filter(id__in=[my_ids])
I'm not very happy about doing this so if someones know a better way, let me know.

A good django search app? — How to perform fuzzy search with Haystack?

I'm using django-haystack at the moment
with apache-solr as the backend.
Problem is I cannot get the app to perform the search functionality I'm looking for
Searching for sub-parts in a word
eg. Searching for "buntu" does not give me "ubuntu"
Searching for similar words
eg. Searching for "ubantu" would give "ubuntu"
Any help would be very much appreciated.
This is really about how you pass the query back to Haystack (and therefore to Solr). You can do a 'fuzzy' search in Solr/Lucene by using a ~ after the word:
ubuntu~
would return both buntu and ubantu. See the Lucene documentation on this.
How you pass this through via Haystack depends on how you're using it at the moment. Assuming you're using the default SearchForm, the best thing would be to either override the form's clean_q method to add the tilde on the end of every word in the search results, or override the search method to do the same thing there before passing it to the SearchQuerySet.

Efficient storage of and access to web pages with Python

So like many people I want a way to download, index/extract information and store web pages efficiently. My first thought is to use MySQL and simply shove the pages in which would let me use FULLTEXT searches which would let me do ad hoc queries easily (in case I want to see if something exists and extract it/etc.). But of course performance wise I have some concerns especially with large objects/pages and high volumes of data. So that leads me to look at things like CouchDB/search engines/etc. So to summarize, my basic requirements are:
It must be Python compatible (libraries/etc.)
Store meta data (URL, time retrieved, any GET/POST stuff I sent), response code, etc. of the page I requested.
Store a copy of the original web page as sent by the server (might be content, might be 404 search page, etc.).
Extract information from the web page and store it in a database.
Have the ability to do ad hoc queries on the existing corpus of original web pages (for example a new type of information I want to extract, or to see how many of the pages have the string "fizzbuzz" or whatever in them.
And of course it must be open source/Linux compatible, I have no interest in something I can't modify or fiddle with.
So I'm thinking several broad options are:
Toss everything into MySQL, use FULLTEXT, go nuts, shard the contact if needed.
Toss meta data into MySQL, store the data on the file system or something like CouchDB, write some custom search stuff.
Toss meta data into MySQL, store the data on a file system with a web server (maybe /YYYY/MM/DD/HH/MM/SS/URL/), make sure there is no default index.html/etc specified (directory index each directory in other words) and use some search engine like Lucene or Sphinx index the content and use that to search. Biggest downside I see here is the inefficiency of repeatedly crawling the site.
Other solutions?
When answering please include links to any technologies you mention and if possible what programming languages it has libraries for (i.e. if it's Scala only or whatever it's probably not that useful since this is a Python project). If this question has already been asked (I'm sure it must have been) please let me know (I searched, no luck).
Why do you think solution (3), the Sphinx-based one, requires "repeatedly crawling the site"? Sphinx can accept and index many different data sources, including MySQL and PostgreSQL "natively" (there are contributed add-ons for other DBs such as Firebird) -- you can keep your HTML docs as columns in your DB if you like (modern PostgreSQL versions should have no trouble with that, and I imagine that MySQL's wouldn't either), just use Sphinx superior indexing and full-text search (including stemming &c). Your metadata all comes from headers, after all (plus the HTTP request body if you want to track requests in which you POSTed data, but not the HTTP response body at any rate).
One important practical consideration: I would recommend standardizing on UTF-8 -- html will come to you in all sorts of weird encodings, but there's no need to get crazy supporting that at search time -- just transcode every text page to UTF-8 upon arrival (from whatever funky encoding it came in), before storing and indexing it, and live happily ever after.
Maybe you could special-case non-textual responses to keep those in files (I can imagine that devoting gigabytes in the DB to storing e.g. videos which can't be body-indexed anyway might not be a good use of resources).
And BTW, Sphinx does come with Python bindings, as you request.
You may be trying to achieve too much with the storage of the html (and supporting files). It seems you wish this repository would both
allow to display a particular page as it was in its original site
provide indexing for locating pages relevant to a particular search criteria
The html underlying a web page once looked a bit akin to a self-standing document, but the pages crawled off the net nowadays are much messier: javascript, ajax snippets, advertisement sections, image blocks etc.
This reality may cause you to rethink the one storage for all html approach. (And also the parsing / pre-processing of the material crawled, but that's another story...)
On the other hand, the distinction between metadata and the true text content associated with the page doesn't need to be so marked. By "true text content", I mean [possibly partially marked-up] text from the web pages that is otherwise free of all other "Web 2.0 noise") Many search engines, including Solr (since you mentioned Lucene) now allow mixing the two genres, in the form of semi-structured data. For operational purposes (eg to task the crawlers etc.), you may keep a relational store with management related metadata, but the idea is that for search purposes, fielded and free-text info can coexist nicely (at the cost of pre-processing much of the input data).
It sounds to me like you need a content management system. Check out Plone. If that's not what you want maybe a web framework, like Grok, BFG, Django, Turbogears, or anything on this list. If that isn't good either, then I don't know what you are asking. :-)

Categories