I have read some tutorials about highlighting search terms in Lucene, and came up with a piece of code like this:
(...)
query = parser.parse(query_string)
for scoreDoc in searcher.search(query, 50).scoreDocs:
doc = searcher.doc(scoreDoc.doc)
filename = doc.get("filename")
print filename
found_paraghaph = fetch_from_my_text_library(filename)
stream = lucene.TokenSources.getTokenStream("contents", found_paraghaph, analyzer);
scorer = lucene.Scorer(query, "contents", lucene.CachingTokenFilter(stream))
highligter = lucene.Highligter(scorer)
fragment = highligter.getBestFragment(analyzer, "contents", found_paraghaph)
print '>>>' + fragment
But it all ends with an error:
Traceback (most recent call last):
File "./search.py", line 76, in <module>
scorer = lucene.Scorer(query, "contents", lucene.CachingTokenFilter(stream))
NotImplementedError: ('instantiating java class', <type 'Scorer'>)
So, I guess, that this part of Lucene insn't iplemented yet in pyLucene. Is there any other way to do it?
I too got similar error. I think this class's wrapper is not yet implemented for Pylucene v3.6.
You might want to try the following:
analyzer = StandardAnalyzer(Version.LUCENE_CURRENT)
# Constructs a query parser.
queryParser = QueryParser(Version.LUCENE_CURRENT, FIELD_CONTENTS, analyzer)
# Create a query
query = queryParser.parse(QUERY_STRING)
topDocs = searcher.search(query, 50)
# Get top hits
scoreDocs = topDocs.scoreDocs
print "%s total matching documents." % len(scoreDocs)
HighlightFormatter = SimpleHTMLFormatter();
highlighter = Highlighter(HighlightFormatter, QueryScorer (query))
for scoreDoc in scoreDocs:
doc = searcher.doc(scoreDoc.doc)
text = doc.get(FIELD_CONTENTS)
ts = analyzer.tokenStream(FIELD_CONTENTS, StringReader(text))
print doc.get(FIELD_PATH)
print highlighter.getBestFragments(ts, text, 3, "...")
print ""
Please note that we create token stream for each item in the search result.
Related
Apologies if this isn't totally clear - I'm a Python copy-the-code-and-try-to-make-it-work developer.
I'm using the Google NLP API in Python 2.7.
When I use analyze_entities(), I can get and print the name, entity type and salience.
Mentions is supposed to contain the noun type: PROPER or COMMON, per this page:
https://cloud.google.com/natural-language/docs/reference/rest/v1beta1/Entity#EntityMention
I can't get mention type from the returned dictionary.
Here's my hideous code:
def entities_text(text, client):
"""Detects entities in the text."""
language_client = client
# Instantiates a plain text document.
document = language_client.document_from_text(text)
# Detects entities in the document. You can also analyze HTML with:
# document.doc_type == language.Document.HTML
entities = document.analyze_entities()
return entities
articles = os.listdir('articles')
for f in articles:
language_client = language.Client()
fname = "articles/" + f
thisfile = open(fname,'r')
content = thisfile.read()
entities = entities_text(content, language_client)
for e in entities:
name = e.name.strip()
type = e.entity_type.strip()
if e.name.strip()[0].isupper() and len(e.name.strip()) > 2:
print name, type, e.salience, e.mentions
That returns this:
RELATED OTHER 0.0019081507 [u'RELATED']
Zoe 3 PERSON 0.0016676666 [u'Zoe 3']
Where the value in [] is the mentions.
If I try to get mentions.type, I get an attribute not found error.
I'd appreciate any input.
1) Do not call the "AnalyzeEntities" function, but call the "AnnotateText" one instead.
2) Check for "Proper". Examine its value, it should be "PROPER" and not "PROPER_UNKNOWN" nor "NOT_PROPER".
Is there a way to make a function that makes other functions to be called later named after the variables passed in?
For the example let's pretend https://example.com/engine_list returns this xml file, when I call it in get_search_engine_xml
<engines>
<engine address="https://www.google.com/">Google</engine>
<engine address="https://www.bing.com/">Bing</engine>
<engine address="https://duckduckgo.com/">DuckDuckGo</engine>
</engines>
And here's my code:
import re
import requests
import xml.etree.ElementTree as ET
base_url = 'https://example.com'
def make_safe(s):
s = re.sub(r"[^\w\s]", '', s)
s = re.sub(r"\s+", '_', s)
s = str(s)
return s
# This is what I'm trying to figure out how to do correctly, create a function
# named after the engine returned in get_search_engine_xml(), to be called later
def create_get_engine_function(function_name, address):
def function_name():
r = requests.get(address)
return function_name
def get_search_engine_xml():
url = base_url + '/engine_list'
r = requests.get(url)
engines_list = str(r.content)
engines_root = ET.fromstring(engines_list)
for child in engines_root:
engine_name = child.text.lower()
engine_name = make_safe(engine_name)
engine_address = child.attrib['address']
create_get_engine_function(engine_name, engine_address)
## Runs without error.
get_search_engine_xml()
## But if I try to call one of the functions.
google()
I get the following error.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'google' is not defined
Defining engine_name and engine_address seems to be working when I log it out. So I'm pretty sure the problem lies in create_get_engine_function, which admittedly I don't know what I'm doing and I was trying to piece together from similar questions.
Can you name a function created by another function with an argument that's passed in? Is there a better way to do this?
You can assign them to globals()
def create_get_engine_function(function_name, address):
def function():
r = requests.get(address)
function.__name__ = function_name
function.__qualname__ = function_name # for Python 3.3+
globals()[function_name] = function
Although, depending on what you're actually trying to accomplish, a better design would be to store all the engine names/addresses in a dictionary and access them as needed:
# You should probably should rename this to 'parse_engines_from_xml'
def get_search_engine_xml():
...
search_engines = {} # maps names to addresses
for child in engines_root:
...
search_engines[engine_name] = engine_address
return search_engines
engines = get_search_engine_xml()
e = requests.get(engines['google'])
<do whatever>
e = requests.get(engines['bing'])
<do whatever>
The following python libreoffice Uno macro works but only with the try..except statement.
The macro allows you to select text in a writer document and send it to a search engine in your default browser.
The issue, is that if you select a single piece of text,oSelected.getByIndex(0) is populated but if you select multiple pieces of text oSelected.getByIndex(0) is not populated. In this case the data starts at oSelected.getByIndex(1) and oSelected.getByIndex(0) is left blank.
I have no idea why this should be and would love to know if anyone can explain this strange behaviour.
#!/usr/bin/python
import os
import webbrowser
from configobj import ConfigObj
from com.sun.star.awt.MessageBoxButtons import BUTTONS_OK, BUTTONS_OK_CANCEL, BUTTONS_YES_NO, BUTTONS_YES_NO_CANCEL, BUTTONS_RETRY_CANCEL, BUTTONS_ABORT_IGNORE_RETRY
from com.sun.star.awt.MessageBoxButtons import DEFAULT_BUTTON_OK, DEFAULT_BUTTON_CANCEL, DEFAULT_BUTTON_RETRY, DEFAULT_BUTTON_YES, DEFAULT_BUTTON_NO, DEFAULT_BUTTON_IGNORE
from com.sun.star.awt.MessageBoxType import MESSAGEBOX, INFOBOX, WARNINGBOX, ERRORBOX, QUERYBOX
def fs3Browser(*args):
#get the doc from the scripting context which is made available to all scripts
desktop = XSCRIPTCONTEXT.getDesktop()
model = desktop.getCurrentComponent()
doc = XSCRIPTCONTEXT.getDocument()
parentwindow = doc.CurrentController.Frame.ContainerWindow
oSelected = model.getCurrentSelection()
oText = ""
try:
for i in range(0,4,1):
print ("Index No ", str(i))
try:
oSel = oSelected.getByIndex(i)
print (str(i), oSel.getString())
oText += oSel.getString()+" "
except:
break
except AttributeError:
mess = "Do not select text from more than one table cell"
heading = "Processing error"
MessageBox(parentwindow, mess, heading, INFOBOX, BUTTONS_OK)
return
lookup = str(oText)
special_c =str.maketrans("","",'!|##"$~%&/()=?+*][}{-;:,.<>')
lookup = lookup.translate(special_c)
lookup = lookup.strip()
configuration_dir = os.environ["HOME"]+"/fs3"
config_filename = configuration_dir + "/fs3.cfg"
if os.access(config_filename, os.R_OK):
cfg = ConfigObj(config_filename)
#define search engine from the configuration file
try:
searchengine = cfg["control"]["ENGINE"]
except:
searchengine = "https://duckduckgo.com"
if 'duck' in searchengine:
webbrowser.open_new('https://www.duckduckgo.com//?q='+lookup+'&kj=%23FFD700 &k7=%23C9C4FF &ia=meanings')
else:
webbrowser.open_new('https://www.google.com/search?/&q='+lookup)
return None
def MessageBox(ParentWindow, MsgText, MsgTitle, MsgType, MsgButtons):
ctx = XSCRIPTCONTEXT.getComponentContext()
sm = ctx.ServiceManager
si = sm.createInstanceWithContext("com.sun.star.awt.Toolkit", ctx)
mBox = si.createMessageBox(ParentWindow, MsgType, MsgButtons, MsgTitle, MsgText)
mBox.execute()
Your code is missing something. This works without needing an extra try/except clause:
selected_strings = []
try:
for i in range(oSelected.getCount()):
oSel = oSelected.getByIndex(i)
if oSel.getString():
selected_strings.append(oSel.getString())
except AttributeError:
# handle exception...
return
result = " ".join(selected_strings)
To answer your question about the "strange behaviour," it seems pretty straightforward to me. If the 0th element is empty, then there are multiple selections which may need to be handled differently.
I'm querying Google+ data with TF-IDF and save the data as a JSONfile. While working with this file I get an error.
Code
import json
import nltk
DATA = 'C:/Users/Dung Ring/Desktop/kpdl/107033731246200681024.json'
data = json.loads(open(DATA).read())
QUERY_TERMS = ['SOPA']
activities = [activity['object']['content'].lower().split() \
for activity in data \
if activity['object']['content'] != " "]
# TextCollection provides tf, idf, and tf_idf abstractions so
# that we don't have to maintain/compute them ourselves
tc = nltk.TextCollection(activities)
relevant_activities = []
for idx in range(len(activities)):
score = 0
for term in [t.lower() for t in QUERY_TERMS]:
score += tc.tf_idf(term, activities[idx])
if score > 0:
relevant_activities.append({'score': score, 'title': data[idx]['title'],
'url': data[idx]['url']})
# Sort by score and display results
relevant_activities = sorted(relevant_activities, key=lambda p: p['score'], reverse=True)
for activity in relevant_activities:
print activity['title']
print '\tLink: %s' % (activity['url'], )
print '\tScore: %s' % (activity['score'], )
print
Error message
Traceback (most recent call last):
File "ex9.py", line 11, in <module>
if activity['object']['content']!= ""]
TypeError: string indices must be integers
I use Python 2.7.
Either activity or activity['object'] is a string and not a dictionary as you expect. Print data and check.
I am trying to read key-value pairs from an already existing shelf to create a new class object with a updated field and write that class object to a new shelf.
My class object : SongDetails
This is the procedure which fails:
def updateShelfWithTabBody(shelfFileName, newShelfFileName):
"""this function updates songDetails with
html body i.e. just the part that contains lyrics and
chords in the tab """
#read all songDetails
shelf = shelve.open(shelfFileName)
listOfKeys = shelf.keys()
#create new songDetails object
temporaryShelfObject = SongDetails.SongDetails()
#iterate over list of keys
for key in listOfKeys:
#print "name:"+shelf[key].name
#fill details from temporaryShelfObject
temporaryShelfObject.name = shelf[key].name
temporaryShelfObject.tabHtmlPageContent = shelf[key].tabHtmlPageContent
#add new detail information
htmlPageContent = shelf[key].tabHtmlPageContent
temporaryShelfObject.htmlBodyContent = extractDataFromDocument.fetchTabBody(htmlPageContent)
#write SongDetails back to shelf
writeSongDetails.writeSongDetails(temporaryShelfObject, newShelfFileName)
Definitions for functions used in above code:
def fetchTabBody(page_contents):
soup = BeautifulSoup(page_contents)
HtmlBody = ""
try:
#The lyrics and chords of song are contained in div with id = "cont"
#Note: This assumtption is specific to ultimate-guitar.com
HtmlBody = soup.html.body.find("div",{"id":"cont"})
except:
print "Error: ",sys.exc_info()[0]
return HtmlBody
def writeSongDetails(songDetails, shelfFileName):
shelf = shelve.open(shelfFileName)
songDetails.name = str(songDetails.name).strip(' ')
shelf[songDetails.name] = songDetails
shelf.close()
SongDetails class:
class SongDetails:
name = ""
tabHtmlPageContent = ""
genre = ""
year = ""
artist = ""
chordsAndLyrics = ""
htmlBodyContent = ""
scale = ""
chordsUsed = []
This is the error that I get:
Traceback (most recent call last):
File "/l/nx/user/ndhande/Independent_Study_Project_Git/Crawler/updateSongDetailsShelfWithNewAttributes.py", line 69, in <module>
updateShelfWithTabBody(shelfFileName, newShelfFileName)
File "/l/nx/user/ndhande/Independent_Study_Project_Git/Crawler/updateSongDetailsShelfWithNewAttributes.py", line 38, in updateShelfWithTabBody
writeSongDetails.writeSongDetails(temporaryShelfObject, newShelfFileName)
File "/home/nx/user/ndhande/Independent_Study_Project_Git/Crawler/writeSongDetails.py", line 7, in writeSongDetails
shelf[songDetails.name] = songDetails
File "/usr/lib64/python2.6/shelve.py", line 132, in __setitem__
p.dump(value)
File "/usr/lib64/python2.6/copy_reg.py", line 71, in _reduce_ex
state = base(self)
File "/u/ndhande/.local/lib/python2.6/site-packages/BeautifulSoup.py", line 476, in __unicode__
return str(self).decode(DEFAULT_OUTPUT_ENCODING)
**RuntimeError: maximum recursion depth exceeded**
I couldn't find any reason why I'm getting this error even though there is no explicit recursive call in my code. I have seen this error in other stackoverflow posts, but they did have recursive calls in their case.
str(self) calls __str__ or calls __unicode__ calls str(self).