Whoosh: How to get search excerpts as a list

Whoosh: How to get search excerpts as a list - python

From the Whoosh documentation I can get matched search terms with some context with:
results = mysearcher.search(myquery)
for hit in results:
print(hit["title"])
# Assume "content" field is stored
print(hit.highlights("content"))
I'd like to access the "highlights" as a list of separated items (so that I can enumerate them in a html list) but the output of hit.highlights() appears to be of type <class 'str'>, and it's not clear to me that there's a unique delimiter.
Is there a way I can get a list instead of everything concatenated into one string?

You could just convert the highlighted string with the separator "..." to a list.
It's as simple as this:
highlights_list = hit.highlights("content").split("...")

Related

Getting the last value of a key value pair in a json like string

I have a json like string (incomplete json) in which I am trying to retrieve the value of the last key:value pair. The incomplete json like string looks like this:
"BaseCode":null,"BrokerSymbol":null,"CustID":null,"SynthDesc":""}],"#nexturl":"https://someurl.com"}
I am trying to access the value of the #nexturl key and this is the code I've till now:
str1.split(":")[-1]
this gives the output //someurl.com. 2 issues here are that splitting on : removes the https prefix and also doesn't seem like a great approach as if the url contains any more : it will split on that. Is there someway I can get the entire value of #nexturl key?

As requested, assuming your string is as follows:
mystr = '"BaseCode":null,"BrokerSymbol":null,"CustID":null,"SynthDesc":""}],"#nexturl":"https://someurl.com"}'
You can grab the key:value pair containing “#nexturl” by using:
mystr[mystr.index('"#nexturl":'):len(mystr)-1]

Dealing with tuples for json.loads

I have a string that contains a dictionary, and inside there's a field that has values that are list of tuples. When I try to load the string into json, it fails.
The format looks like
{'info_scetion': {'category1': [('response 1', '{"tag1":"str1", "tag2":"str2"}')]}}
This is one of the special inputs I receive, and it is the only one that contains a list of tuple. I do not generate the input, so I cannot change the format. Because JSON cannot parse this string directly, I was thinking about trying to identify the tuple inside the string and pick it out. For the rest, the code should be able to process.
The problem is, I'm not sure how to do it. I tried forming a regex that uses ( and ) in some forms like (.*?) to get the first incidence, but I cannot guarantee there wouldn't be any ) in the actual tuple.
If I go with this direction, how do I correctly identify the tuple?
If there's another way to do it, what is it?
EDIT: adding the } at the end

You 'JSON' is not really a JSON: it is a Python data structure, so parse it as such with the AST module:
s = "{'info_scetion': {'category1': [('response 1', '{\"tag1\":\"str1\", \"tag2\":\"str2\"}')]}}"
result = ast.literal_eval(s)
result
#{'info_scetion': {'category1': \
# [('response 1', '{"tag1":"str1", "tag2":"str2"}')]}}
Once it is imported into Python, you can manipulate it in any way you like. For example, you can "flatten" the list of tuple:
result['info_scetion']['category1'] = list(result['info_scetion']['category1'][0])
#{'info_scetion': {'category1': ['response 1', '{"tag1":"str1", "tag2":"str2"}']}}

Your json is malformed, it is missing a } at the end.
I tested things with this code and things seem to be fine.
data = {'info_scetion': {'category1': [('response 1', '{"tag1":"str1", "tag2":"str2"}')]}}
print data['info_scetion']['category1'][0][0]
# output >> response 1
print json.loads(data['info_scetion']['category1'][0][1])['tag1']
# output >> str1

Parsing multiple occurrences of an item into a dictionary

Attempting to parse several separate image links from JSON data through python, but having some issues drilling down to the right level, due to what I believe is from having a list of strings.
For the majority of the items, I've had success with the below example, pulling back everything I need. Outside of this instance, everything is a 1:1 ratio of keys:values, but for this one, there are multiple values associated with one key.
resultsdict['item_name'] = item['attribute_key']
I've been adding it all to a resultsdict={}, but am only able to get to the below sample string when I print.
INPUT:
for item in data['Item']:
resultsdict['images'] = item['Variations']['Pictures']
OUTPUT (only relevant section):
'images': [{u'VariationSpecificPictureSet': [{u'PictureURL': [u'http//imagelink1'], u'VariationSpecificValue': u'color1'}, {u'PictureURL': [u'http//imagelink2'], u'VariationSpecificValue': u'color2'}, {u'PictureURL': [u'http//imagelink3'], u'VariationSpecificValue': u'color3'}, {u'PictureURL': [u'http//imagelink4'], u'VariationSpecificValue': u'color4'}]
I feel like I could add ['VariationPictureSet']['PictureURL'] at the end of my initial input, but that throws an error due to the indices not being integers, but strings.
Ideally, I would like to see the output as a simple comma-separated list of just the URLs, as follows:
OUTPUT:
'images': http//imagelink1, http//imagelink2, http//imagelink3, http//imagelink4

An answer to your comment that required a bit of code to it.
When using
for item in data['Item']:
resultsdict['images'] = item['Variations']['Pictures']
you get a list with one element, so I recommend using this
for item in data['Item']:
resultsdict['images'] = item['Variations']['Pictures'][0]
now you can use
for image in resultsdict['images']['VariationsSpecificPictureSet']:
print(image['PictureUR‌L'])

Thanks for the help, #uzzee, it's appreciated. I kept tinkering with it and was able to pull the continuous string of all the image URLs with the following code.
resultsdict['images'] = sum([x['PictureURL'] for x in item['variations']['Pictures'][0]['VariationSpecificPictureSet']],[])
Without the sum it looks like this and pulls in the whole list of lists...
resultsdict['images'] = [x['PictureURL'] for x in item['variations']['Pictures'][0]['VariationSpecificPictureSet']]

Getting positions from Python lists to generate a dynamic range

I have a list being built in Python, using this code:
def return_hosts():
'return a list of host names'
with open('./tfhosts') as hosts:
return [host.split()[1].strip() for host in hosts]
The format of tfhosts is that of a hosts file, so what I am doing is taking the hostname portion and populating that into a template, so far this works.
What I am trying to do is make sure that even if more hosts are added they're put into a default section as the other host sections are fixed, this part however I would like to be dynamic, to do that I've got the following:
rendered_inventory = inventory_template.render({
'host_main': gethosts[0],
'host_master1': gethosts[1],
'host_master2': gethosts[2],
'host_spring': gethosts[3],
'host_default': gethosts[4:],
})
Everything is rendered properly except the last host under the host_default section, instead of getting a newline separated lists of hosts, like this (which is what I want):
[host_default]
dc01-worker-02
dc01-worker-03
It just write out the remaining hostnames in a single list, as (which I don't want):
[host_default]
['dc01-worker-02', 'dc01-worker-03']
I've tried to wrap the host default section and split it, but I get a runtime error if I try:
[gethosts[4:].split(",").strip()...

I believe gethosts[4:] returns a list, if gethosts is a list (which seems to be the case) , hence it is directly writing the list to your file.
Also, you cannot do .split() on a list (I guess you hoped to do .split on the string, but gethosts[4:] returns a list). I believe an easy way out for you would be to join the strings in the list using str.join with \n as the delimiter. Example -
rendered_inventory = inventory_template.render({
'host_main': gethosts[0],
'host_master1': gethosts[1],
'host_master2': gethosts[2],
'host_spring': gethosts[3],
'host_default': '\n'.join(gethosts[4:]),
})
Demo -
>>> lst = ['dc01-worker-02', 'dc01-worker-03']
>>> print('\n'.join(lst))
dc01-worker-02
dc01-worker-03
If you own the template, a cleaner approach would be to loop through the list for host_default and print each element in the template. Example you can try using a for loop construct in the jinja template.

Python development - elementtree XML and string operations

I am using ElementTree to load up a series of XML files and parse them. As a file is parsed, I am grabbing a few bits of data from it ( a headline and a paragraph of text). I then need to grab some file names that are stored in the XML. They are contained in an element called ContentItem.
My code looks a bit like this:
for item in dirlist:
newsML = ET.parse(item)
NewsLines = newsML.getroot()
HeadLine = NewsLines.getiterator("HeadLine")
result.append(HeadLine)
p = NewsLines.getiterator("p")
result.append(p)
ci = NewsLines.getiterator("ContentItem")
for i in ci:
result.append(i.attrib)
Now, if there was only one type of file, this would have been fine, but it contains 3 types (jpg, flv and a mp4). So as I loop through them in the view, it spits them out, but how do I just grab the flv if I only want that one? or just the mp4? They don't always appear in the same order in the list either.
Is there a way to say if it ends in .mp4 then do this action, or is there a way to do that in the template even?
If i try to do this;
url = i.attrib
if url.get("Href", () ).endswith('jpg'):
result.append(i.attrib)
I get an error tuple object has no attribute endswith. Why is this a tuple? I thought it was a dict?

You get a tuple because you supply a tuple (the parentheses) as the default return value for url.get(). Supply an empty string, and you can use its .endswith() method. Also note that the element itself has a get() method to retrieve attribute values (you do not have to go via .attrib). Example:
if i.get('Href', '').endswith('.jpg'):
result.append(i.attrib)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Whoosh: How to get search excerpts as a list - python

You could just convert the highlighted string with the separator "..." to a list. It's as simple as this: highlights_list = hit.highlights("content").split("...")

Related

Getting the last value of a key value pair in a json like string

Dealing with tuples for json.loads

Parsing multiple occurrences of an item into a dictionary

Getting positions from Python lists to generate a dynamic range

Python development - elementtree XML and string operations

Categories

Resources