Tips on quickly and efficiently updating millions of MongoDB documents using python? - python

quick question about a mongodb db and collect i've setup and am trying to use python to update each of the documents.
Basically I have a collection of about 2.6million postcode records and have my python script that takes data from a csv file using the postcode as the key.
All the postcodes are unique and both the DB and the CSV have the corresponding keys. The data I'm importing doesn't really matter per se this is more of an exercise to find the best method using python to update my document as I will be planning to do something later with more meaningful data. I've added an index to the postcode column within the mongo collection but this hasn't seemed to speed up the processing.
When I run the code below it seems to take about 1 second per document to update, and as you can guess thats way to long to wait to have all these records updated. Does anyone know of a quicker way to do this and if theres anything from my example below that may be preventing it from running faster.
Any help would be greatly appreciated. Sorry if this is the wrong place I'm not sure if its a mongo issue or a python issue.
Thanks
please find example of the python code i'm using to update the mongo records.
for key, val in testdict.items():
mycol.update_one({"Postcode": key}, {"$set": {"SOAExample": val}})
count = count+1
print(count, " out of ", totalkeys, " done")

Look at the bulk_write API which will allow you to batch updates so that you reduce the number of round trips to the server. Also, split your data and run many update processes in parallel so that updates happen in parallel. the database server may be slowish for any particular update due to write_concerns etc. but it can process many updates in parallel.

Related

Storing and querying a large amount of data

I have a large amount of data around 50GB worth in a csv which i want to analyse purposes of ML. It is however way to large to fit in Python. I ideally want to use mySQL because querying is easier. Can anyone offer a host of tips for me to look into. This can be anything from:
How to store it in the first place, i realise i probably can't load it in all at once, would i do it iteratively? If so what things can i look into for this? In addition i've heard about indexing, would that really speed up queries on such a massive data set?
Are there better technologies out there to handle this amount of data and still be able to query and do feature engineering quickly. What i eventually feed into my algorithm should be able to be done in Python but i need query and do some feature engineering before i get my data set that is ready to be analysed.
I'd really appreciate any advice this all needs to be done on personal computer! Thanks!!
Can anyone offer a host of tips for me to look into
Gladly!
Look at the CSV file first line to see if there is a header. You'd need to create a table with the same fields (and type of data)
One of the fields might seem unique per line and can be used later to find the line. That's your candidate for PRIMARY KEY. Otherwise add an AUTO-INCREMENT field as PRIMARY KEY
INDEXes are used to later search for data. Whatever fields you feel you will be searching/filtering on later should have some sort of INDEX. You can always add them later.
INDEXes can combine multiple fields if they are often searched together
In order to read in the data, you have 2 ways:
Use LOAD DATA INFILE Load Data Infile Documentation
Write your own script: The best technique is to create a prepared statement for the
INSERT command. Then read your CSV line by line (in a loop), split the fields
into variables and execute the prepared statement with this line's
values
You will benefit from a web page designed to search the data. Depends on who needs to use it.
Hope this gives you some ideas
That's depend on what you have, you can use Apache spark and then use their SQL feature, spark SQL gives you the possibility to write SQL queries in your dataset, but for best performance you need a distributed mode(you can use it in a local machine but the result is limited) and high machine performance. you can use python, scala, java to write your code.

Remove documents from a collection based on value located in another collection

I started working with mongodb yesterday. I have two collections in the same database with 100 million and 300 million documents. I want to remove documents in one collection if a value in the document is not found in any document of the second collection. To maybe make this more clear I have provided python/mongodb pseudocode code below. I realize this is not proper syntax, its just to show the logic I am after. I am looking for the most efficient way as there are a lot of records and its on my laptop :)
for doc_ONE in db.collection_ONE:
if doc_ONE["arbitrary"] not in [doc_TWO["arbitrary"] for doc_TWO in db.collection_TWO]:
db.collection_ONE.remove({"arbitrary": doc_ONE["arbitrary"]})
I am fine with this being done from the mongo cli if faster. Thanks for reading this and please don't flame me to hard lol.
If document["arbitrary"] is a immuable value, you can store all the values (without duplicates) in a set:
values = {document["arbitrary"] for document in db.collection_TWO}
The process like you suggest:
for doc_one in db.collection_ONE:
if doc_one["arbitrary"] not in values:
db.collection_ONE.remove({"arbitrary": doc_one["arbitrary"]})

How can I create an auto-delete function for CouchDB in Python

I'm new to Python and CouchDB and I'm trying to write an auto-trading script. As such, I only care about the most recent 400 documents in a given database/table (trading data). To fill in a few blanks:
I have one Python program that is reading FOREX trading data and writing summary statistics to a CouchDB database. That thing runs every 20 seconds and works great. I'm just creating some large tables (that I don't need), at this point.
I have another Python program that is going to read the top 400 records from that table. At the tail end of this program, I'd like to do some kind of auto-purge that will delete anything older than the top 400 documents.
I have some flexibility in how I do this, as this is just a pet project to learn some new programming technologies. I'm assuming this can be solved with a collection of using _id = epoch time + views, but I just want something that is bare-bones easy.
Any suggestions?
As the number of docs ramaining is so small compared to the rest that you want to purge, i would try to do a selective replication to a new database with a post to _replicate and adding the 400 relevant ids in the doc_ids array. (http://docs.couchdb.org/en/1.6.1/api/server/common.html#post--_replicate)
Then just swap to the new database and delete the old one. This will be very fast and very error proof, as couchdb just needs to delete the old database and index files instead of finding and deleting thousands of old docs.

update database, how to check it wrote the data

I am using Python to write on a mongoDB database a very large amount of data. Some of this data can actually overwrite old data already in the database. I am using pymongo and MongoClient using update function.
Since i am writing ten of thousands of datapoints in the database, can i make sure that the data is actually being written properly, how can i check if any data has not been written on mongoDB? I dont want to add to much code to that as it it already quite slow to download and write everything. If there is no easy answer, i will sacrifice speed but i want to make sure everything goes into mongoDB.
When you execute an insert_one or insert_many instruction, you should get a result value. You can check this to make sure that the insert was successful.
result = posts.insert_many(new_posts)
print result.inserted_id

Process 5 million key-value data in python.Will NoSql solve?

I would like to get the suggestion on using No-SQL datastore for my particular requirements.
Let me explain:
I have to process the five csv files. Each csv contains 5 million rows and also The common id field is presented in each csv.So, I need to merge all csv by iterating 5 million rows.So, I go with python dictionary to merge all files based on the common id field.But here the bottleneck is you can't store the 5 million keys in memory(< 1gig) with python-dictionary.
So, I decided to use No-Sql.I think It might be helpful to process the 5 million key value storage.Still I didn't have clear thoughts on this.
Anyway we can't reduce the iteration since we have the five csvs each has to be iterated for updating the values.
Is it there an simple steps to go with that?
If this is the way Could you give me the No-Sql datastore to process the key-value pair?
Note: We have the values as list type also.
If the CSV is already sorted by id you can use the merge-join algorithm. It allows you to iterate over the single lines, so you don't have to keep everything in memory.
Extending the algorithm to multiple tables/CSV files will be a greater challenge, though. (But probably faster than learning something new like Hadoop)
If this is just a one-time process, you might want to just setup an EC2 node with more than 1G of memory and run the python scripts there. 5 million items isn't that much, and a Python dictionary should be fairly capable of handling it. I don't think you need Hadoop in this case.
You could also try to optimize your scripts by reordering the items in several runs, than running over the 5 files synchronized using iterators so that you don't have to keep everything in memory at the same time.
As I understand you want to merge about 500,000 items from 5 input files. If you do this on one machine it might take long time to process 1g of data. So I suggest to check the possibility of using Hadoop. Hadoop is a batch processing tool. Usually Hadoop programs are written in Java, but you can write it in Python as well.
I recommend to check feasibility of using Hadoop to process your data in a cluster. You may use HBase (Column datastore) to store your data. It's an idea, check whether its applicable to your problem.
If this does not help, give some more details about the problem your are trying to solve. Technically you can use any language or datastore to solve this problem. But you need to find which one solves the best (in terms of time or resources) and your willingness to use/learn a new tool/db.
Excellent tutorial to get started: http://developer.yahoo.com/hadoop/tutorial/

Categories