i'm writing a script that puts a large number of xml files into mongodb, thus when i execute the script multiple times the same object is added many times to the same collection.
I checked out for a way to stop this behavior by checkinng the existance of the object before adding it, but can't find a way.
help!
The term for the operation you're describing is "upsert".
In mongodb, the way to upsert is to use the update functionality with upsert=True.
You can index on one or more fields(not _id) of the document/xml structure. Then make use of find operator to check if a document containing that indexed_field:value is present in the collection. If it returns nothing then you can insert new documents into your collection. This will ensure only new docs are inserted when you re-run the script.
Related
All explained above is in the context of an ETL process. I have a git repository full of sql files. I need to put all those sql files (once pulled) into a sql table with 2 columns: name and query, so that I can access each file later on using a SQL query instead of loading them from the file path. How can I make this? I am free to use the tool I want to, but I just know python and Pentaho.
Maybe the assumption that this method would require less computation time than simply accessing to the pull file located in the hard drive is wrong. In that case let me know.
You can first define the table you're interested in using something along the lines of (you did not mention the database you are using):
CREATE TABLE queries (
name TEXT PRIMARY KEY,
query TEXT
);
After creating the table, you can use perhaps os.walk to iterate through the files in your repository, and insert both the contents (e.g. file.read()) and the name of the file into the table you created previously.
It sounds like you're trying to solve a different problem though. It seems like you're interested in speeding up some process, because you asked about whether accessing queries using a table would be faster than opening a file on disk. To investigate that (separate!) question further, see this.
I would recommend that you profile the existing process you are trying to speed up using profiling tools. After that, you can see whether IO is your bottleneck. Otherwise, you may do all of this work without any benefit.
As a side note, if you are looking up queries in this way, it may indicate that you need to rearchitect your application. Please consider that possibility as well.
I have a sample data, like:
data = {'tag':'ball','color':'red'}
I want insert it to my collection. But if it has a same one, then do not insert.
I can do it in python like:
if not collection.find_one({data}):
collection.insert(data)
Or, I can do it with update:
collection.update(data,data,upsert=True)
But, my question is: weather the 'update' write data every time?
In this two method, they search same times for the duplicate data, but in 1st way, only write while not exist.
In 2nd way, is it means if data not exists, insert one. if data exists, update one. The database would be wrote in all situation?
So, which method is better? and why?
I recently came across a situation where I only want to write to the database if a record doesn't exist. In this case your first method is better.
For the 2nd way, the database would be written in all situation. However, if you want certain field not to be updated, you can try $setOnInsert as explained here: https://docs.mongodb.com/manual/reference/operator/update/setOnInsert/
I started working with mongodb yesterday. I have two collections in the same database with 100 million and 300 million documents. I want to remove documents in one collection if a value in the document is not found in any document of the second collection. To maybe make this more clear I have provided python/mongodb pseudocode code below. I realize this is not proper syntax, its just to show the logic I am after. I am looking for the most efficient way as there are a lot of records and its on my laptop :)
for doc_ONE in db.collection_ONE:
if doc_ONE["arbitrary"] not in [doc_TWO["arbitrary"] for doc_TWO in db.collection_TWO]:
db.collection_ONE.remove({"arbitrary": doc_ONE["arbitrary"]})
I am fine with this being done from the mongo cli if faster. Thanks for reading this and please don't flame me to hard lol.
If document["arbitrary"] is a immuable value, you can store all the values (without duplicates) in a set:
values = {document["arbitrary"] for document in db.collection_TWO}
The process like you suggest:
for doc_one in db.collection_ONE:
if doc_one["arbitrary"] not in values:
db.collection_ONE.remove({"arbitrary": doc_one["arbitrary"]})
Given a string containing a Neo4j cypher query, how to determine quickly in python it is db read or db write.
Currently I have thought of two ways of seeing this -
Check for keywords like CREATE, DELETE etc to tag write queries
and
MATCH, START etc to tag read queries.
Otherwise we can check for patterns according to this link here- Neo4j refcard and write a parser for it accordingly.
Method 1 fails here -
MATCH (n:Person {id:1, create:3}) return n
And method 2 looks too deep for seemingly small task.
Any other/better ideas to do the same?
You can use EXPLAIN option and inspect operatorType of the execution plan:
EXPLAIN MERGE (n:Person)
and find and check for possible values for writing, updating and other, something like that
"operatorType": "MergeCreateNode"
"operatorType": "CreateNode"
"operatorType": "MergeCreateRelationship"
So basically I have this collection where objects are stored with a string parameter.
example:
{"string_": "MSWCHI20160501"}
The last part of that string is a date, so my question is this: Is there a way of writing a mongo query which will take that string, convert part of it into an IsoDate object and then filter objects by that IsoDate.
p.s
I know I can do a migration but I wonder if I can achieve that without one.
Depending on the schema of your objects, you could hypothetically write an aggregation pipeline that would first transform the objects, then filter the results based on the results and then return those filtered results.
The main reason I would not recommend this way though is that, given a fairly large size for your dataset, the aggregation is going to fail because of memory problems.
And that is without mentioning the long execution time for this command.