Problem when deleting records from mongodb using pymongo

Problem when deleting records from mongodb using pymongo - python

So I have some 50 document ID's. My python veno list contains document ID's as shown below.
5ddfc565bd293f3dbf502789
5ddfc558bd293f3dbf50263b
5ddfc558bd293f3dbf50264f
5ddfc558bd293f3dbf50264d
5ddfc565bd293f3dbf502792
But when I am trying to delete those 50 document ID's then I am finding a hard time. Let me explain - I need to run my python script over and over again in order to delete all the 50 documents. The first time I run my script it will delete some 10, the next time I run then it deletes 18 and so on. My for loop is pretty simple as shown below
for i in veno:
vv = i[0]
db.Products2.delete_many({'_id': ObjectId(vv)})

If your list is just the ids, then you want:
for i in veno:
db.Products2.delete_many({'_id': ObjectId(i)})
full example:
from pymongo import MongoClient
from bson import ObjectId
db = MongoClient()['testdatabase']
# Test data setup
veno = [str(db.testcollection.insert_one({'a': 1}).inserted_id) for _ in range(50)]
# Quick peek to see we have the data correct
for x in range(3): print(veno[x])
print(f'Document count before delete: {db.testcollection.count_documents({})}')
for i in veno:
db.testcollection.delete_many({'_id': ObjectId(i)})
print(f'Document count after delete: {db.testcollection.count_documents({})}')
gives:
5ddffc5ac9a13622dbf3d88e
5ddffc5ac9a13622dbf3d88f
5ddffc5ac9a13622dbf3d890
Document count before delete: 50
Document count after delete: 0

I dont have any mongo instance to test but what about
veno = [
'5ddfc565bd293f3dbf502789',
'5ddfc558bd293f3dbf50263b',
'5ddfc558bd293f3dbf50264f',
'5ddfc558bd293f3dbf50264d',
'5ddfc565bd293f3dbf502792',
]
# Or for your case (Whatever you have in **veno**)
veno = [vv[0] for vv in veno]
####
db.Products2.delete_many({'_id': {'$in':[ObjectId(vv) for vv in veno]}})
If this doesnt work, then maybe this
db.Products2.remove({'_id': {'$in':[ObjectId(vv) for vv in veno]}})
From what I understand, delete_many's first argument is filter, so
its designed in such a way, that you dont delete particular documents
but instead documents that satisfies particular condition.
In above case, best is delete all documents at once by saying -> delete all documents whose _id is in ($in) the list [ObjectId(vv) for vv in veno]

Related

For Loop 60 items 10 per 10

I'm working with an api that gives me 61 items that I include in a discord embed in a for loop.
As all of this is planned to be included into a discord bot using pagination from DiscordUtils, I need to make it so it male an embed for each 10 entry to avoid a too long message / 2000 character message.
Currently what I use to do my loop is here: https://api.nepmia.fr/spc/ (I recomend the usage of a parsing extention for your browser or it will be a bit hard to read it)
But what I want to create is something that will look like that : https://api.nepmia.fr/spc/formated/
So I can iterate each range in a different embed and then use pagination.
I use TinyDB to generate the JSON files I shown before with this script:
import urllib.request, json
from shutil import copyfile
from termcolor import colored
from tinydb import TinyDB, Query
db = TinyDB("/home/nepmia/Myazu/db/db.json")
def api_get():
print(colored("[Myazu]","cyan"), colored("Fetching WynncraftAPI...", "white"))
try:
with urllib.request.urlopen("https://api.wynncraft.com/public_api.php?action=guildStats&command=Spectral%20Cabbage") as u1:
api_1 = json.loads(u1.read().decode())
count = 0
if members := api_1.get("members"):
print(colored("[Myazu]","cyan"),
colored("Got expecteded answer, starting saving process.", "white"))
for member in members:
nick = member.get("name")
ur2 = f"https://api.wynncraft.com/v2/player/{nick}/stats"
u2 = urllib.request.urlopen(ur2)
api_2 = json.loads(u2.read().decode())
data = api_2.get("data")
for item in data:
meta = item.get("meta")
playtime = meta.get("playtime")
print(colored("[Myazu]","cyan"),
colored("Saving playtime for player", "white"),
colored(f"{nick}...","green"))
db.insert({"username": nick, "playtime": playtime})
count += 1
else:
print(colored("[Myazu]","cyan"),
colored("Unexpected answer from WynncraftAPI [ERROR 1]", "white"))
except:
print(colored("[Myazu]","cyan"),
colored("Unhandled error in saving process [ERROR 2]", "white"))
finally:
print(colored("[Myazu]","cyan"),
colored(f"Finished saving data for", "white"),
colored(f"{count}", "green"),
colored("players.", "white"))
but this will only create a range like this : https://api.nepmia.fr/spc/
what I would like is something like this : https://api.nepmia.fr/spc/formated/
Thanks for your help!
PS: Sorry for your eyes I'm still new to Python so I know I don't do stuff really properly :s

To follow up from the comments, you shouldn't store items in your database in a format that is specific to how you want to return results from the database to a different API, as it will make it more difficult to query in other contexts, among other reasons.
If you want to paginate items from a database it's better to do that when you query it.
According to the docs, you can iterate over all documents in a TinyDB database just by iterating directly over the DB like:
for doc in db:
...
For any iterable you can use the enumerate function to associate an index to each item like:
for idx, doc in enumerate(db):
...
If you want the indices to start with 1 as in your examples you would just use idx + 1.
Finally, to paginate the results, you need some function that can return items from an iterable in fixed-sized batches, such as one of the many solutions on this question or elsewhere. E.g. given a function chunked(iter, size) you could do:
pages = enumerate(chunked(enumerate(db), 10))
Then list(pages) gives a list of lists of tuples like [(page_num, [(player_num, player), ...].
The only difference between a list of lists and what you want is you seem to want a dictionary structure like
{'range1': {'1': {...}, '2': {...}, ...}, 'range2': {'11': {...}, ...}}
This is no different from a list of lists; the only difference is you're using dictionary keys to give numerical indices to each item in a collection, rather than the indices being implict in the list structure. There's many ways you can go from a list of lists to this. The easiest I think is using a (nested) dict comprehension:
{f'range{page_num + 1}': {str(player_num + 1): player for player_num, player in page}
for page_num, page in pages}
This will give output in exactly the format you want.

Thanks #Iguananaut for your precious help.
In the end I made something similar from your solution using a generator.
def chunker(seq, size):
for i in range(0, len(seq), size):
yield seq[i:i+size]
def embed_creator(embeds):
pages = []
current_page = None
for i, chunk in enumerate(chunker(embeds, 10)):
current_page = discord.Embed(
title=f'**SPC** Last week online time',
color=3903947)
for elt in chunk:
current_page.add_field(
name=elt.get("username"),
value=elt.get("play_output"),
inline=False)
current_page.set_footer(
icon_url="https://cdn.discordapp.com/icons/513160124219523086/a_3dc65aae06b2cf7bddcb3c33d7a5ecef.gif?size=128",
text=f"{i + 1} / {ceil(len(embeds) / 10)}"
)
pages.append(current_page)
current_page = None
return pages
Using embed_creator I generate a list named pages that I can simply use with DiscordUtils paginator.

Neo4j import csv / DHCP data controlling duplication

i am confused on how to import data
I have a csv from DHCP with _time, hostname, IP_addr
I would like to add any changed IPs as new relationships, but keep the old ip relationships with a status attribute inactive, also think I want to limt to the last 10.
I am not sure the easiest way to do this in cypher, or should I be in python for this complexity
maybe an always add (remove duplicates)/csv import
and a second query to deactivate any old ips (how do I query non current if i have time as an attribute of relationship)
and a third query to remove relationships that if more that 10 previous ips are hanging off it.
any help or thoughts would be greatly appreciated

Sounds like fun. Not sure if every host-ip combination appears only once in a csv or also at later times like an "still-here" update
Import Statement
LOAD CSV FROM "url" AS row
MERGE (h:Host {name:row.hostname})
MERGE (ip:IP {name:row.IP_addr})
MERGE (h)-[:IP]->(ip) ON CREATE SET rel.created = row._time, rel.status = 1
// optional for pre-existing/previous rels
ON MATCH SET rel.status = 0
SET rel.updated = row._time;
Cleanup statement
MATCH (h:Host) WHERE size( (h)-[:IP]->() ) > 1
MATCH (h)-[rel:IP]->(:IP)
WITH h,rel ORDER BY rel.updated DESC
WITH h, collect(rel) as rels
// not necessary when the status is set above
FOREACH (r in rels[1..9] | SET r.status=0)
FOREACH (r IN rels[10..-1] | DELETE r)
When the status is set correctly in the load statement
MATCH (h:Host)-[rel:IP {status:0}]->(:IP)
WITH h,rel ORDER BY rel.updated DESC
WITH h, collect(rel) as rels
FOREACH (r IN rels[9..-1] | DELETE r)

Extract specific records/fields with PyFileMaker

Python 2.7.
With PyFileMaker, I can acess a FileMaker (FM) Server, open a DB, open a Table (Layout), but I can't (easily) access to specific records and fields. I would like to know how extract specific values from Tables
What I can do
Here it's my manner to loop into the DB to extract records.
for i in range (25):
try:
a= fm.doFind(id_monument = i)
L.append(a)
except:
pass
Considering that (25) is the number of records (but it should have a better way to loop through the DB). L is a list to stock results. Results are, for the first cell of the list:
>>> L[0]
<PyFileMaker.FMResultset.FMResultset instance WITH LIST OF 1 RECORDS (total-count is 327)>
[MODID = '0'
RECORDID = '236'
fk_Lieudec = '00002'
fk_auteur_fiche = '00001'
(...)
Each cell of L is a record of the FM DB. F[0] type is <type 'instance'> (Wasa ?)
What I want to do
1) Extract all records ID and then loop on these ID
2) Extract only specific records. For example where 'fk_Lieudec' LIKE '*2*'
3) Extract only specific fields. For example, for each records, extract ID and X, Y coordinates.
I'm actually looking at the regex to do this... Is it the good way ? Generally, where is the information on PyFileMaker on Internet ?

the way you're doing will hit FMServer as many times as loop counter. If you install the new version of PyFileMaker from the new GitHub repo, you'll be able to execute findQuery and pass list of IDs into the function like that:
fm = FMServer('login:password#filemaker.server.com','dbname','layoutname')
results = fm.doFindQuery({'id': [1, 2, 3, 4],})
for entry in results: print entry
It is even possible to combine queries like that:
fm.doFindQuery({'id': [1, 2, 3, 4], 'color': ['red', 'blue'], '!gender': 'm'})
Take a look here for more examples.
Cheers

Failed WriteBatch Operation with py2neo

I am trying to find a workaround to the following problem. I have seen it quasi-described in this SO question, yet not really answered.
The following code fails, starting with a fresh graph:
from py2neo import neo4j
def add_test_nodes():
# Add a test node manually
alice = g.get_or_create_indexed_node("Users", "user_id", 12345, {"user_id":12345})
def do_batch(graph):
# Begin batch write transaction
batch = neo4j.WriteBatch(graph)
# get some updated node properties to add
new_node_data = {"user_id":12345, "name": "Alice"}
# batch requests
a = batch.get_or_create_in_index(neo4j.Node, "Users", "user_id", 12345, {})
batch.set_properties(a, new_node_data) #<-- I'm the problem
# execute batch requests and clear
batch.run()
batch.clear()
if __name__ == '__main__':
# Initialize Graph DB service and create a Users node index
g = neo4j.GraphDatabaseService()
users_idx = g.get_or_create_index(neo4j.Node, "Users")
# run the test functions
add_test_nodes()
alice = g.get_or_create_indexed_node("Users", "user_id", 12345)
print alice
do_batch(g)
# get alice back and assert additional properties were added
alice = g.get_or_create_indexed_node("Users", "user_id", 12345)
assert "name" in alice
In short, I wish, in one batch transaction, to update existing indexed node properties. The failure is occurring at the batch.set_properties line, and it is because the BatchRequest object returned by the previous line is not being interpreted as a valid node. Though not entirely indentical, it feels like I am attempting something like the answer posted here
Some specifics
>>> import py2neo
>>> py2neo.__version__
'1.6.0'
>>> g = py2neo.neo4j.GraphDatabaseService()
>>> g.neo4j_version
(2, 0, 0, u'M06')
Update
If I split the problem into separate batches, then it can run without error:
def do_batch(graph):
# Begin batch write transaction
batch = neo4j.WriteBatch(graph)
# get some updated node properties to add
new_node_data = {"user_id":12345, "name": "Alice"}
# batch request 1
batch.get_or_create_in_index(neo4j.Node, "Users", "user_id", 12345, {})
# execute batch request and clear
alice = batch.submit()
batch.clear()
# batch request 2
batch.set_properties(a, new_node_data)
# execute batch request and clear
batch.run()
batch.clear()
This works for many nodes as well. Though I do not love the idea of splitting the batch up, this might be the only way at the moment. Anyone have some comments on this?

After reading up on all the new features of Neo4j 2.0.0-M06, it seems that the older workflow of node and relationship indexes are being superseded. There is presently a bit of a divergence on the part of neo in the way indexing is done. Namely, labels and schema indexes.
Labels
Labels can be arbitrarily attached to nodes and can serve as a reference for an index.
Indexes
Indexes can be created in Cypher by referencing Labels (here, User) and node property key, (screen_name):
CREATE INDEX ON :User(screen_name)
Cypher MERGE
Furthermore, the indexed get_or_create methods are now possible via the new cypher MERGE function, which incorporate Labels and their indexes quite succinctly:
MERGE (me:User{screen_name:"SunPowered"}) RETURN me
Batch
Queries of the sort can be batched in py2neo by appending a CypherQuery instance to the batch object:
from py2neo import neo4j
graph_db = neo4j.GraphDatabaseService()
cypher_merge_user = neo4j.CypherQuery(graph_db,
"MERGE (user:User {screen_name:{name}}) RETURN user")
def get_or_create_user(screen_name):
"""Return the user if exists, create one if not"""
return cypher_merge_user.execute_one(name=screen_name)
def get_or_create_users(screen_names):
"""Apply the get or create user cypher query to many usernames in a
batch transaction"""
batch = neo4j.WriteBatch(graph_db)
for screen_name in screen_names:
batch.append_cypher(cypher_merge_user, params=dict(name=screen_name))
return batch.submit()
root = get_or_create_user("Root")
users = get_or_create_users(["alice", "bob", "charlie"])
Limitation
There is a limitation, however, in that the results from a cypher query in a batch transaction cannot be referenced later in the same transaction. The original question was in reference to updating a collection of indexed user properties in one batch transaction. This is still not possible, as far as I can muster. For example, the following snippet throws an error:
batch = neo4j.WriteBatch(graph_db)
b1 = batch.append_cypher(cypher_merge_user, params=dict(name="Alice"))
batch.set_properties(b1, dict(last_name="Smith")})
resp = batch.submit()
So, it seems that although there is a bit less overhead in implementing the get_or_create over a labelled node using py2neo because the legacy indexes are no longer necessary, the original question still needs 2 separate batch transactions to complete.

Your problem seems not to be in batch.set_properties() but rather in the output of batch.get_or_create_in_index(). If you add the node with batch.create(), it works:
db = neo4j.GraphDatabaseService()
batch = neo4j.WriteBatch(db)
# create a node instead of getting it from index
test_node = batch.create({'key': 'value'})
# set new properties on the node
batch.set_properties(test_node, {'key': 'foo'})
batch.submit()
If you have a look at the properties of the BatchRequest object returned by batch.create() and batch.get_or_create_in_index() there is a difference in the URI because the methods use different parts of the neo4j REST API:
test_node = batch.create({'key': 'value'})
print test_node.uri # node
print test_node.body # {'key': 'value'}
print test_node.method # POST
index_node = batch.get_or_create_in_index(neo4j.Node, "Users", "user_id", 12345, {})
print index_node.uri # index/node/Users?uniqueness=get_or_create
print index_node.body # {u'value': 12345, u'key': 'user_id', u'properties': {}}
print index_node.method # POST
batch.submit()
So I guess batch.set_properties() somehow can't handle the URI of the indexed node? I.e. it doesn't really get the correct URI for the node?
Doesn't solve the problem, but could be a pointer for somebody else ;) ?

Simple example of retrieving 500 items from dynamodb using Python

Looking for a simple example of retrieving 500 items from dynamodb minimizing the number of queries. I know there's a "multiget" function that would let me break this up into chunks of 50 queries, but not sure how to do this.
I'm starting with a list of 500 keys. I'm then thinking of writing a function that takes this list of keys, breaks it up into "chunks," retrieves the values, stitches them back together, and returns a dict of 500 key-value pairs.
Or is there a better way to do this?
As a corollary, how would I "sort" the items afterwards?

Depending on you scheme, There are 2 ways of efficiently retrieving your 500 items.
1 Items are under the same hash_key, using a range_key
Use the query method with the hash_key
you may ask to sort the range_keys A-Z or Z-A
2 Items are on "random" keys
You said it: use the BatchGetItem method
Good news: the limit is actually 100/request or 1MB max
you will have to sort the results on the Python side.
On the practical side, since you use Python, I highly recommend the Boto library for low-level access or dynamodb-mapper library for higher level access (Disclaimer: I am one of the core dev of dynamodb-mapper).
Sadly, neither of these library provides an easy way to wrap the batch_get operation. On the contrary, there is a generator for scan and for query which 'pretends' you get all in a single query.
In order to get optimal results with the batch query, I recommend this workflow:
submit a batch with all of your 500 items.
store the results in your dicts
re-submit with the UnprocessedKeys as many times as needed
sort the results on the python side
Quick example
I assume you have created a table "MyTable" with a single hash_key
import boto
# Helper function. This is more or less the code
# I added to devolop branch
def resubmit(batch, prev):
# Empty (re-use) the batch
del batch[:]
# The batch answer contains the list of
# unprocessed keys grouped by tables
if 'UnprocessedKeys' in prev:
unprocessed = res['UnprocessedKeys']
else:
return None
# Load the unprocessed keys
for table_name, table_req in unprocessed.iteritems():
table_keys = table_req['Keys']
table = batch.layer2.get_table(table_name)
keys = []
for key in table_keys:
h = key['HashKeyElement']
r = None
if 'RangeKeyElement' in key:
r = key['RangeKeyElement']
keys.append((h, r))
attributes_to_get = None
if 'AttributesToGet' in table_req:
attributes_to_get = table_req['AttributesToGet']
batch.add_batch(table, keys, attributes_to_get=attributes_to_get)
return batch.submit()
# Main
db = boto.connect_dynamodb()
table = db.get_table('MyTable')
batch = db.new_batch_list()
keys = range (100) # Get items from 0 to 99
batch.add_batch(table, keys)
res = batch.submit()
while res:
print res # Do some usefull work here
res = resubmit(batch, res)
# The END
EDIT:
I've added a resubmit() function to BatchList in Boto develop branch. It greatly simplifies the worklow:
add all of your requested keys to BatchList
submit()
resubmit() as long as it does not return None.
this should be available in next release.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problem when deleting records from mongodb using pymongo - python

Related

For Loop 60 items 10 per 10

Neo4j import csv / DHCP data controlling duplication

Extract specific records/fields with PyFileMaker

Failed WriteBatch Operation with py2neo

Simple example of retrieving 500 items from dynamodb using Python

Categories

Resources