SolrClient python update document

SolrClient python update document - python

I'm currently trying to create a small python program using SolrClient to index some files.
My need is that I want to index some file content and then add some attributes to enrich the document.
I used the post command line tool to index the files. Then I use a python program trying to enrich documents, something like this:
doc = solr.get('collection', id)
doc['new_attribute'] = 'value'
solr.index_json('collection',json.dumps([doc]))
solr.commit(openSearcher=True)
Problem is that I have the feeling that we lost file content index. If I run a query with a word present in all attributes of the doc, I find it.
If I run a query with a word only in the file, it does not work (it works indexing only the file with post without my update tentative).
I'm not sure to understand how to update the doc keeping the index created by the post command.
I hope I'm clear enough, maybe I misunderstood the way it works...
thanks a lot

If I understand correctly, you want to modify an existing record. You should be able to do something like this without using a solr.get:
doc = [{'id': 'value', 'new_attribute':{'set': 'value'}}]
solr.index_json('collection',json.dumps([doc]))
See also:
https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents

It has worked for me in this way, it can be useful for someone
from SolrClient import SolrClient
solrConect = SolrClient("http://xx.xx.xxx.xxx:8983/solr/")
doc = [{'id': 'my_id', 'count_related_like':{'set': 10}}]
solrConect.index_json("my_collection", json.dumps(doc) )
solrConect.commit("my_collection", softCommit=True)

Trying with Curl did not change anything. I did it differently so now it works. Instead of adding the file with the post command and trying to modify it afterwards, I read the file in a string and index in a "content" field. It means every document is added in one shot.
The content field is defined as not stored, so I just index it.
It works fine and suits my needs. It's also more simple since it removes many attributes set by post command that I don't need.
If I find some time, I'll try again the partial update and update the post.
Thanks
Rémi

Related

Python- Insert new values into 'nested' list?

What I'm trying to do isn't a huge problem in php, but I can't find much assistance for Python.
In simple terms, from a list which produces output as follows:
{"marketId":"1.130856098","totalAvailable":null,"isMarketDataDelayed":null,"lastMatchTime":null,"betDelay":0,"version":2576584033,"complete":true,"runnersVoidable":false,"totalMatched":null,"status":"OPEN","bspReconciled":false,"crossMatching":false,"inplay":false,"numberOfWinners":1,"numberOfRunners":10,"numberOfActiveRunners":8,"runners":[{"status":"ACTIVE","ex":{"tradedVolume":[],"availableToBack":[{"price":2.8,"size":34.16},{"price":2.76,"size":200},{"price":2.5,"size":237.85}],"availableToLay":[{"price":2.94,"size":6.03},{"price":2.96,"size":10.82},{"price":3,"size":33.45}]},"sp":{"nearPrice":null,"farPrice":null,"backStakeTaken":[],"layLiabilityTaken":[],"actualSP":null},"adjustmentFactor":null,"removalDate":null,"lastPriceTraded":null,"handicap":0,"totalMatched":null,"selectionId":12832765}...
All I want to do is add in an extra field, containing the 'runner name' in the data set below, into each of the 'runners' sub lists from the initial data set, based on selection_id=selectionId.
So initially I iterate through the full dataset, and then create a separate list to get the runner name from the runner id (I should point out that runnerId===selectionId===selection_id, no idea why there are multiple names are used), this works fine and the code is shown below:
for market_book in market_books:
market_catalogues = trading.betting.list_market_catalogue(
market_projection=["RUNNER_DESCRIPTION", "RUNNER_METADATA", "COMPETITION", "EVENT", "EVENT_TYPE", "MARKET_DESCRIPTION", "MARKET_START_TIME"],
filter=betfairlightweight.filters.market_filter(
market_ids=[market_book.market_id],
),
max_results=100)
data = []
for market_catalogue in market_catalogues:
for runner in market_catalogue.runners:
data.append(
(runner.selection_id, runner.runner_name)
)
So as you can see I have the data in data[], but what I need to do is add it to the initial data set, based on the selection_id.
I'm more comfortable with Php or Javascript, so apologies if this seems a bit simplistic, but the code snippets I've found on-line only seem to assist with very simple Python lists and nothing 'nested' (to me the structure seems similar to a nested array).
As per the request below, here is the full list:
{"marketId":"1.130856098","totalAvailable":null,"isMarketDataDelayed":null,"lastMatchTime":null,"betDelay":0,"version":2576584033,"complete":true,"runnersVoidable":false,"totalMatched":null,"status":"OPEN","bspReconciled":false,"crossMatching":false,"inplay":false,"numberOfWinners":1,"numberOfRunners":10,"numberOfActiveRunners":8,"runners":[{"status":"ACTIVE","ex":{"tradedVolume":[],"availableToBack":[{"price":2.8,"size":34.16},{"price":2.76,"size":200},{"price":2.5,"size":237.85}],"availableToLay":[{"price":2.94,"size":6.03},{"price":2.96,"size":10.82},{"price":3,"size":33.45}]},"sp":{"nearPrice":null,"farPrice":null,"backStakeTaken":[],"layLiabilityTaken":[],"actualSP":null},"adjustmentFactor":null,"removalDate":null,"lastPriceTraded":null,"handicap":0,"totalMatched":null,"selectionId":12832765},{"status":"ACTIVE","ex":{"tradedVolume":[],"availableToBack":[{"price":20,"size":3},{"price":19.5,"size":26.36},{"price":19,"size":2}],"availableToLay":[{"price":21,"size":13},{"price":22,"size":2},{"price":23,"size":2}]},"sp":{"nearPrice":null,"farPrice":null,"backStakeTaken":[],"layLiabilityTaken":[],"actualSP":null},"adjustmentFactor":null,"removalDate":null,"lastPriceTraded":null,"handicap":0,"totalMatched":null,"selectionId":12832767},{"status":"ACTIVE","ex":{"tradedVolume":[],"availableToBack":[{"price":11,"size":9.75},{"price":10.5,"size":3},{"price":10,"size":28.18}],"availableToLay":[{"price":11.5,"size":12},{"price":13.5,"size":2},{"price":14,"size":7.75}]},"sp":{"nearPrice":null,"farPrice":null,"backStakeTaken":[],"layLiabilityTaken":[],"actualSP":null},"adjustmentFactor":null,"removalDate":null,"lastPriceTraded":null,"handicap":0,"totalMatched":null,"selectionId":12832766},{"status":"ACTIVE","ex":{"tradedVolume":[],"availableToBack":[{"price":48,"size":2},{"price":46,"size":5},{"price":42,"size":5}],"availableToLay":[{"price":60,"size":7},{"price":70,"size":5},{"price":75,"size":10}]},"sp":{"nearPrice":null,"farPrice":null,"backStakeTaken":[],"layLiabilityTaken":[],"actualSP":null},"adjustmentFactor":null,"removalDate":null,"lastPriceTraded":null,"handicap":0,"totalMatched":null,"selectionId":12832769},{"status":"ACTIVE","ex":{"tradedVolume":[],"availableToBack":[{"price":18.5,"size":28.94},{"price":18,"size":5},{"price":17.5,"size":3}],"availableToLay":[{"price":21,"size":20},{"price":23,"size":2},{"price":24,"size":2}]},"sp":{"nearPrice":null,"farPrice":null,"backStakeTaken":[],"layLiabilityTaken":[],"actualSP":null},"adjustmentFactor":null,"removalDate":null,"lastPriceTraded":null,"handicap":0,"totalMatched":null,"selectionId":12832768},{"status":"ACTIVE","ex":{"tradedVolume":[],"availableToBack":[{"price":4.3,"size":9},{"price":4.2,"size":257.98},{"price":4.1,"size":51.1}],"availableToLay":[{"price":4.4,"size":20.97},{"price":4.5,"size":30},{"price":4.6,"size":16}]},"sp":{"nearPrice":null,"farPrice":null,"backStakeTaken":[],"layLiabilityTaken":[],"actualSP":null},"adjustmentFactor":null,"removalDate":null,"lastPriceTraded":null,"handicap":0,"totalMatched":null,"selectionId":12832771},{"status":"ACTIVE","ex":{"tradedVolume":[],"availableToBack":[{"price":24,"size":6.75},{"price":23,"size":2},{"price":22,"size":2}],"availableToLay":[{"price":26,"size":2},{"price":27,"size":2},{"price":28,"size":2}]},"sp":{"nearPrice":null,"farPrice":null,"backStakeTaken":[],"layLiabilityTaken":[],"actualSP":null},"adjustmentFactor":null,"removalDate":null,"lastPriceTraded":null,"handicap":0,"totalMatched":null,"selectionId":12832770},{"status":"ACTIVE","ex":{"tradedVolume":[],"availableToBack":[{"price":5.7,"size":149.33},{"price":5.5,"size":29.41},{"price":5.4,"size":5}],"availableToLay":[{"price":6,"size":85},{"price":6.6,"size":5},{"price":6.8,"size":5}]},"sp":{"nearPrice":null,"farPrice":null,"backStakeTaken":[],"layLiabilityTaken":[],"actualSP":null},"adjustmentFactor":null,"removalDate":null,"lastPriceTraded":null,"handicap":0,"totalMatched":null,"selectionId":10064909}],"publishTime":1551612312125,"priceLadderDefinition":{"type":"CLASSIC"},"keyLineDescription":null,"marketDefinition":{"bspMarket":false,"turnInPlayEnabled":false,"persistenceEnabled":false,"marketBaseRate":5,"eventId":"28180290","eventTypeId":"2378961","numberOfWinners":1,"bettingType":"ODDS","marketType":"NONSPORT","marketTime":"2019-03-29T00:00:00.000Z","suspendTime":"2019-03-29T00:00:00.000Z","bspReconciled":false,"complete":true,"inPlay":false,"crossMatching":false,"runnersVoidable":false,"numberOfActiveRunners":8,"betDelay":0,"status":"OPEN","runners":[{"status":"ACTIVE","sortPriority":1,"id":10064909},{"status":"ACTIVE","sortPriority":2,"id":12832765},{"status":"ACTIVE","sortPriority":3,"id":12832766},{"status":"ACTIVE","sortPriority":4,"id":12832767},{"status":"ACTIVE","sortPriority":5,"id":12832768},{"status":"ACTIVE","sortPriority":6,"id":12832770},{"status":"ACTIVE","sortPriority":7,"id":12832769},{"status":"ACTIVE","sortPriority":8,"id":12832771},{"status":"LOSER","sortPriority":9,"id":10317013},{"status":"LOSER","sortPriority":10,"id":10317010}],"regulators":["MR_INT"],"countryCode":"GB","discountAllowed":true,"timezone":"Europe\/London","openDate":"2019-03-29T00:00:00.000Z","version":2576584033,"priceLadderDefinition":{"type":"CLASSIC"}}}

i think i understand what you are trying to do now
first hold your data as a python object (you gave us a json object)
import json
my_data = json.loads(my_json_string)
for item in my_data['runners']:
item['selectionId'] = [item['selectionId'], my_name_here]
the thing is that my_data['runners'][i]['selectionId'] is a string, unless you want to concat the name and the id together, you should turn it into a list or even a dictionary
each item is a dicitonary so you can always also a new keys to it
item['new_key'] = my_value

So, essentially this works...with one exception...I can see from the print(...) in the loop that the attribute is updated, however what I can't seem to do is then see this update outside the loop.
mkt_runners = []
for market_catalogue in market_catalogues:
for r in market_catalogue.runners:
mkt_runners.append((r.selection_id, r.runner_name))
for market_book in market_books:
for runner in market_book.runners:
for x in mkt_runners:
if runner.selection_id in x:
setattr(runner, 'x', x[1])
print(market_book.market_id, runner.x, runner.selection_id)
print(market_book.json())
So the print(market_book.market_id.... displays as expected, but when I print the whole list it shows the un-updated version. I can't seem to find an obvious solution, which is odd, as it seems like a really simple thing (I tried messing around with indents, in case that was the problem, but it doesn't seem to be, its like its not refreshing the market_book list post update of the runners sub list)!

Pymongo how to use full text search

I am looking to implement full text search in my python application using pymongo. I have been looking at this question but for some reason I am unable to implement this in my project as I am getting an error no such cmd: text. Can anyone direct me on what I am doing wrong?
Here is my code:
db = client.test
collection = db.videos
def search_for_videos(self, search_text)
self.db.command("text", "videos",
search=search_text,
limit=10)
The collection I am trying to search is called videos however I am not sure if I am putting this in the correct parameter, and I also am not sure if I need the line project={"name": 1, "_id": 0}.
The documentation here I believe is using the mongo shell to execute commands, however I wish to perform this action in my code.
I have looked at using the db.videos.find() function, but cannot seem to implement it correctly either.
How to I use PyMongo Full Text Search from my Python Code?

First be sure that you have a text index created on the field as mentioned here or you can just do it with pymongo too :
collection.create_index([('your field', 'text')])
Using pymongo you can do this to search:
collection.find({"$text": {"$search": your search}})
your function should look like this:
def search_for_videos(search_text):
collection.find({"$text": {"$search": search_text}}).limit(10)
I hope this helps you.

First create a text index based on the field you want to do the search on.
from pymongo import TEXT
db = MongoClient('localhost',port = 27017).DBNAME
db.collection.create_index([('FIELD_NAME',TEXT)],default_language ="english")
once you create the text index use the following query to search text. Depending on the size of your database, it might take long to create the text index.
db.collection.find({"$text": {"$search": your search}})

Word & Python - Create Table of Contents

I'm using the pywin32.client extension for python and building a Word document. I have tried a pretty good host of methods to generate a ToC but all have failed.
I think what I want to do is call the ActiveDocument object and create one with something like this example from the MSDN page:
Set myRange = ActiveDocument.Range(Start:=0, End:=0)
ActiveDocument.TablesOfContents.Add Range:=myRange, _
UseFields:=False, UseHeadingStyles:=True, _
LowerHeadingLevel:=3, _
UpperHeadingLevel:=1
Except in Python it would be something like:
wordObject.ActiveDocument.TableOfContents.Add(Range=???,UseFiles=False, UseHeadingStyles=True, LowerHeadingLevel=3, UpperHeadingLevel=1)
I've built everything so far using the 'Selection' object (example below) and wish to add this ToC after the first page break.
Here's a sample of what the document looks like:
objWord = win32com.client.Dispatch("Word.Application")
objDoc = objWord.Documents.Open('pathtotemplate.docx') #
objSel = objWord.Selection
#These seem to work but I don't know why...
objWord.ActiveDocument.Sections(1).Footers(1).PageNumbers.Add(1,True)
objWord.ActiveDocument.Sections(1).Footers(1).PageNumbers.NumberStyle = 57
objSel.Style = objWord.ActiveDocument.Styles("Heading 1")
objSel.TypeText("TITLE PAGE AND STUFF")
objSel.InsertParagraph()
objSel.TypeText("Some data or another"
objSel.TypeParagraph()
objWord.Selection.InsertBreak()
####INSERT TOC HERE####
Any help would be greatly appreciated! In a perfect world I'd use the default first option which is available from the Word GUI but that seems to point to a file and be harder to access (something about templates).
Thanks

Manually, edit your template in Word, add the ToC (which will be empty initially) any intro stuff, header/footers etc., then at where you want your text content inserted (i.e. after the ToC) put a uniquely named bookmark. Then in your code, create a new document based on the template (or open the template then save it to a different name), search for the bookmark and insert your content there. Save to a different filename.
This approach has all sorts of advantages - you can format your template in Word rather than by writing all the code details, and so you can very easily edit your template to update styles when someone says they want the Normal font to be bigger/smaller/pink you can do it just by editing the template. Make sure to use styles in your code and only apply formatting when it is specifically different from the default style.
Not sure how you make sure the ToC is actually generated, might be automatically updated on every save.

Add a field to existing document in CouchDB

I have a database with a bunch of regular documents that look something like this (example from wiki):
{
"_id":"some_doc_id",
"_rev":"D1C946B7",
"Subject":"I like Plankton",
"Author":"Rusty",
"PostedDate":"2006-08-15T17:30:12-04:00",
"Tags":["plankton", "baseball", "decisions"],
"Body":"I decided today that I don't like baseball. I like plankton."
}
I'm working in Python with couchdb-python and I want to know if it's possible to add a field to each document. For example, if I wanted to have a "Location" field or something like that.
Thanks!

Regarding IDs
Every document in couchdb has an id, whether you set it or not. Once the document is stored you can access it through the doc._id field.
If you want to set your own ids you'll have to assign the id value to doc._id. If you don't set it, then couchdb will assign a uuid.
If you want to update a document, then you need to make sure you have the same id and a valid revision. If say you are working from a blog post and the user adds the Location, then the url of the post may be a good id to use. You'd be able to instantly access the document in this case.
So what's a revision
In your code snippet above you have the doc._rev element. This is the identifier of the revision. If you save a document with an id that already exists, couchdb requires you to prove that the document is still the valid doc and that you are not trying to overwrite someone else's document.
So how do I update a document
If you have the id of your document, you can just access each document by using the db.get(id) function. You can then update the document like this:
doc = db.get(id)
doc['Location'] = "On a couch"
db.save(doc)
I have an example where I store weather forecast data. I update the forecasts approximately every 2 hours. A separate process is looking for data that I get from a different provider looking at characteristics of tweets on the day.
This looks something like this.
doc = db.get(id)
doc_with_loc = GetLocationInformationFromOtherProvider(doc) # takes about 40 seconds.
doc_with_loc["_rev"] = doc["_rev"]
db.save(doc_with_loc) # This will fail if weather update has also updated the file.
If you have concurring processes, then the _rev will become invalid, so you have to have a failsave, eg. this could do:
doc = db.get(id)
doc_with_loc = GetLocationInformationFromAltProvider(doc)
update_outstanding = true
while update_outstanding:
doc = db.get(id) //reretrieve this to get
doc_with_loc["_rev"] = doc["_rev"]
update_outstanding = !db.save(doc_with_loc)
So how do I get the Ids?
One option suggested above is that you actively set the id, so you can retrieve it. Ie. if a user sets a given location that is attached to a URL, use the URL. But you may not know which document you want to update - or even have a process that finds all the document that don't have a location and assign one.
You'll most likely be using a view for this. Views have a mapper and a reducer. You'll use the first one, forget about the last one. A view with a mapper does the following:
It returns a simplyfied/transformed way of looking at your data. You can return multiple values per data or skip some. It gives the data you emit a key, and if you use the _include_docs function it will give you the document (with _id and rev alongside).
The simplest view is the default view db.view('_all_docs') this will return all documents and you may not want to update all of them. Views for example will be stored as a document as well when you define these.
The next simple way is to have view that only returns items that are of the type of the document. I tend to have a _type="article in my database. Think of this as marking that a document belongs to a certain table if you had stored them in a relational database.
Finally you can filter elements that have a location so you'd have a view where you can iterate over all those docs that still need a location and identify this in a separate process. The best documentation on writing view can be found here.

BioPython Pubmed Eutils url?

I'm trying to run some queries against Pubmed's Eutils service. If I run them on the website I get a certain number of records returned, in this case 13126 (link to pubmed).
A while ago I bodged together a python script to build a query to do much the same thing, and the resultant url returns the same number of hits (link to Eutils result).
Of course, not having any formal programming background, it was all a bit cludgy, so I'm trying to do the same thing using Biopython. I think the following code should do the same thing, but it returns a greater number of hits, 23303.
from Bio import Entrez
Entrez.email = "A.N.Other#example.com"
handle = Entrez.esearch(db="pubmed", term="stem+cell[All Fields]",datetype="pdat", mindate="2012", maxdate="2012")
record = Entrez.read(handle)
print(record["Count"])
I'm fairly sure it's just down to some subtlety in how the url is being generated, but I can't work out how to see what url is being generated by Biopython. Can anyone give me some pointers?
Thanks!
EDIT:
It's something to do with how the url is being generated, as I can get back the original number of hits by modifying the code to include double quotes around the search term, thus:
handle = Entrez.esearch(db='pubmed', term='"stem+cell"[ALL]', datetype='pdat', mindate='2012', maxdate='2012')
I'm still interested in knowing what url is being generated by Biopython as it'll help me work out how i have to structure the search term for when i want to do more complicated searches.

handle = Entrez.esearch(db="pubmed", term="stem+cell[All Fields]",datetype="pdat", mindate="2012", maxdate="2012")
print(handle.url)

You've solved this already (Entrez likes explicit double quoting round combined search terms), but currently the URL generated is not exposed via the API. The simplest trick would be to edit the Bio/Entrez/__init__.py file to add a print statement inside the _open function.
Update: Recent versions of Biopython now save the URL as an attribute of the returned handle, i.e. in this example try doing print(handle.url)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.