I am using Python to write on a mongoDB database a very large amount of data. Some of this data can actually overwrite old data already in the database. I am using pymongo and MongoClient using update function.
Since i am writing ten of thousands of datapoints in the database, can i make sure that the data is actually being written properly, how can i check if any data has not been written on mongoDB? I dont want to add to much code to that as it it already quite slow to download and write everything. If there is no easy answer, i will sacrifice speed but i want to make sure everything goes into mongoDB.
When you execute an insert_one or insert_many instruction, you should get a result value. You can check this to make sure that the insert was successful.
result = posts.insert_many(new_posts)
print result.inserted_id
Related
I am trying to update the entries in a Solr database using the pysolr add() and commit() methods. I have a massive database and I need to figure out a way to change every entry one at a time. I know I can just query the whole database and save it as a list, but that requires a ton of memory. So I'm wondering if anyone knows of a built-in functionality that will allow me to read the entries one at a time without saving the whole database in memory.
Well, I write two programs. One aims to retrieve data from the web, filter and store the data in the database. And another one aims to retrieve data from this database and do some analysis.
The point is that the data I collected changes frequently, just like the data of the market, which means that I need to write a monitor-program to constantly connect to the db and retrieve data after a small interval, say 2 min. But, if in this way, the other program cannot access the data in the db, since it's been locked.
Is there way to connect the same database at the same time for two programs, one for writing, another for retrieving? Or, is there better way to deal with this situation(namely, frequent writing and retrieving)? I am quite new to database, expecting to get any help :)
quick question about a mongodb db and collect i've setup and am trying to use python to update each of the documents.
Basically I have a collection of about 2.6million postcode records and have my python script that takes data from a csv file using the postcode as the key.
All the postcodes are unique and both the DB and the CSV have the corresponding keys. The data I'm importing doesn't really matter per se this is more of an exercise to find the best method using python to update my document as I will be planning to do something later with more meaningful data. I've added an index to the postcode column within the mongo collection but this hasn't seemed to speed up the processing.
When I run the code below it seems to take about 1 second per document to update, and as you can guess thats way to long to wait to have all these records updated. Does anyone know of a quicker way to do this and if theres anything from my example below that may be preventing it from running faster.
Any help would be greatly appreciated. Sorry if this is the wrong place I'm not sure if its a mongo issue or a python issue.
Thanks
please find example of the python code i'm using to update the mongo records.
for key, val in testdict.items():
mycol.update_one({"Postcode": key}, {"$set": {"SOAExample": val}})
count = count+1
print(count, " out of ", totalkeys, " done")
Look at the bulk_write API which will allow you to batch updates so that you reduce the number of round trips to the server. Also, split your data and run many update processes in parallel so that updates happen in parallel. the database server may be slowish for any particular update due to write_concerns etc. but it can process many updates in parallel.
I'm doing analysis on data from a MySql database in python. I query the database for about 200,000 rows of data, then analyze in python using Pandas. I will often do many iterations over the same data, changing different variables, parameters, and such. Each time I run the program, I query the remote database (about 10 second query), then discard the query results when the program finishes. I'd like to save the results of the last query in a local file, then check each time I run the program to see if the query is the same, then just use the saved results. I guess I could just write the Pandas dataframe to a csv, but is there a better/easier/faster way to do this?
If for any reason MySQL Query Cache doesn't help, then I'd recommend to save the latest result set either in HDF5 format or in Feather format. Both formats are pretty fast. You may find some demos and tests here:
https://stackoverflow.com/a/37929007/5741205
https://stackoverflow.com/a/42750132/5741205
https://stackoverflow.com/a/42022053/5741205
Just use pickle to write the dataframe to a file, and to read it back out ("unpickle").
https://docs.python.org/3/library/pickle.html
This would be the "easy way".
I will be writing a little Python script tomorrow, to retrieve all the data from an old MS Access database into a CSV file first, and then after some data cleansing, munging etc, I will import the data into a mySQL database on Linux.
I intend to use pyodbc to make a connection to the MS Access db. I will be running the initial script in a Windows environment.
The db has IIRC well over half a million rows of data. My questions are:
Is the number of records a cause for concern? (i.e. Will I hit some limits)?
Is there a better file format for the transitory data (instead of CSV)?
I chose CSv because it is quite simple and straightforward (and I am a Python newbie) - but
I would like to hear from someone who may have done something similar before.
Memory usage for csvfile.reader and csvfile.writer isn't proportional to the number of records, as long as you iterate correctly and don't try to load the whole file into memory. That's one reason the iterator protocol exists. Similarly, csvfile.writer writes directly to disk; it's not limited by available memory. You can process any number of records with these without memory limitations.
For simple data structures, CSV is fine. It's much easier to get fast, incremental access to CSV than more complicated formats like XML (tip: pulldom is painfully slow).
Yet another approach if you have Access available ...
Create a table in MySQL to hold the data.
In your Access db, create an ODBC link to the MySQL table.
Then execute a query such as:
INSERT INTO MySqlTable (field1, field2, field3)
SELECT field1, field2, field3
FROM AccessTable;
Note: This suggestion presumes you can do your data cleaning operations in Access before sending the data on to MySQL.
I wouldn't bother using an intermediate format. Pulling from Access via ADO and inserting right into MySQL really shouldn't be an issue.
The only limit should be operating system file size.
That said, make sure when you send the data to the new database, you're writing it a few records at a time; I've seen people do things where they try to load the entire file first, then write it.