I have some large XML files (5GB ~ each) that I'm importing to a mongodb database. I'm using Expat to parse the documents, doing some data manipulation (deleting some fields, unit conversion, etc) and then inserting into the database. My script is based on this one: https://github.com/bgianfo/stackoverflow-mongodb/blob/master/so-import
My question is: is there a way to improve this with a batch insert ? Storing these documents on an array before inserting would be a good idea ? How many documents should I store before inserting, then ? Writing the jsons into a file and then using mongoimport would be faster ?
I appreciate any suggestion.
In case you want to import XML to MongoDB and Python is just what you so far chose to get this job done but you are open for further approaches then might also perform this with the following steps:
transforming the XML documents to CSV documents using XMLStarlet
transforming the CSVs to files containing JSONs using AWK
import the JSON files to MongoDB
XMLStarlet and AWK are both extremely fast and you are able to store your JSON objects using a non-trivial structure (sub-objects, arrays).
http://www.joyofdata.de/blog/transforming-xml-document-into-csv-using-xmlstarlet/
http://www.joyofdata.de/blog/import-csv-into-mongodb-with-awk-json/
Storing these documents on an array before inserting would be a good idea?
Yes, that's very likely. It reduces the number of round-trips to the database. You should monitor your system, it's probably idling a lot when inserting because of IO wait (that is, the overhead and thread synchronization is taking a lot more time than the actual data transfer).
How many documents should I store before inserting, then?
That's hard to say, because it depends on so many factors. Rule of thumb: 1,000 - 10,000. You will have to experiment a little. In older versions of mongodb, the entire batch must not be larger than the document size limit of 16MB.
Writing the jsons into a file and then using mongoimport would be faster?
No, unless your code has a flaw. That would mean you have to copy the data twice and the entire operation should be IO bound.
Also, it's a good idea to add all documents first, then add any indexes, not the other way around (because then the index will have to be repaired with every insert)
Related
I'm writing something that essentially refines and reports various strings out of an enormous python dictionary (the source file for the dictionary is XML over a million lines long).
I found mongodb yesterday and was delighted to see that it accepts python dictionaries easy as you please... until it refused mine because the dict object is larger than the BSON size limit of 16MB.
I looked at GridFS for a sec, but that won't accept any python object that doesn't have a .read attribute.
Over time, this program will acquire many of these mega dictionaries; I'd like to dump each into a database so that at some point I can compare values between them.
What's the best way to handle this? I'm awfully new to all of this but that's fine with me :) It seems that a NoSQL approach is best; the structure of these is generally known but can change without notice. Schemas would be nightmarish here.
Have your considered using Pandas? Yes Pandas does not natively accept xmls but if you use ElementTree from xml (standard library) you should be able to read it into a Pandas data frame and do what you need with it including refining strings and adding more data to the data frame as you get it.
So I've decided that this problem is more of a data design problem than a python situation. I'm trying to load a lot of unstructured data into a database when I probably only need 10% of it. I've decided to save the refined xml dictionary as a pickle on a shared filesystem for cool storage and use mongo to store the refined queries I want from the dictionary.
That'll reduce their size from 22MB to 100K.
Thanks for chatting with me about this :)
I have a folder with around 590,035 json files. Each file is a document that has to be indexed. If I index each document using python then it is taking more than 30 hours. How do I index these documents quickly?
Note - I've seen bulk api but that requires merging all the files into one which takes similar amount of time as above.
Please tell me how to improve the speed. Thank You.
If you're sure that I/O is your bottleneck, use threads to read files, i.e. with ThreadPoolExecutor, and either accumulate for bulk request, or save one by one. ES will have no issues whatsoever, until you're using either unique or internal IDs.
Bulk will work faster, just by saving you time on HTTP overhead, saving 1 by 1 is a little bit easier to code.
I've accumulated a set of 500 or so files, each of which has an array and header that stores metadata. Something like:
2,.25,.9,26 #<-- header, which is actually cryptic metadata
1.7331,0
1.7163,0
1.7042,0
1.6951,0
1.6881,0
1.6825,0
1.678,0
1.6743,0
1.6713,0
I'd like to read these arrays into memory selectively. We've built a GUI that lets users select one or multiple files from disk, then each are read in to the program. If users want to read in all 500 files, the program is slow opening and closing each file. Therefore, my question is: will it speed up my program to store all of these in a single structure? Something like hdf5? Ideally, this would have faster access than the individual files. What is the best way to go about this? I haven't ever dealt with these types of considerations. What's the best way to speed up this bottleneck in Python? The total data is only a few MegaBytes, I'd even be amenable to storing it in the program somewhere, not just on disk (but don't know how to do this)
Reading 500 files in python should not take much time, as the overall file size is around few MB. Your data-structure is plain and simple in your file chunks, it ll not even take much time to parse I guess.
Is the actual slowness is bcoz of opening and closing file, then there may be OS related issue (it may have very poor I/O.)
Did you timed it like how much time it is taking to read all the files.?
You can also try using small database structures like sqllite. Where you can store your file data and access the required data in a fly.
Should an import of 80GB's of XML data into MySQL take more than 5 days to complete?
I'm currently importing an XML file that is roughly 80GB in size, the code I'm using is in this gist and while everything is working properly it's been running for almost 5 straight days and its not even close to being done ...
The average table size is roughly:
Data size: 4.5GB
Index size: 3.2GB
Avg. Row Length: 245
Number Rows: 20,000,000
Let me know if more info is needed!
Server Specs:
Note this is a linode VPS
Intel Xeon Processor L5520 - Quad Core - 2.27GHZ
4GB Total Ram
XML Sample
https://gist.github.com/2510267
Thanks!
After researching more regarding this matter this seems to be average, I found this answer which describes ways to improve the import rate.
One thing which will help a great deal is to commit less frequently, rather than once-per-row. I would suggest starting with one commit per several hundred rows, and tuning from there.
Also, the thing you're doing right now where you do an existence check -- dump that; it's greatly increasing the number of queries you need to run. Instead, use ON DUPLICATE KEY UPDATE (a MySQL extension, not standards-compliant) to make a duplicate INSERT automatically do the right thing.
Finally, consider building your tool to convert from XML into a textual form suitable for use with the mysqlimport tool, and using that bulk loader instead. This will cleanly separate the time needed for XML parsing from the time needed for database ingestion, and also speed the database import itself by using tools designed for the purpose (rather than INSERT or UPDATE commands, mysqlimport uses a specialized LOAD DATA INFILE extension).
This is (probably) unrelated to your speed problem but I would suggest double checking whether the behaviour of iterparse fits with your logic. At the point the start event happens it may or may not have loaded the text value of the node (depending on whether or not that happened to fit within the chunk of data it parsed) and so you can get some rather random behaviour.
I have 3 quick suggesstions to make without seeing your code After attempting something similiar
optimize your code for high performance High-performance XML parsing in Python with lxml
is a great article to look at.
look into pypy
rewrite your code to take advantage of multiple cpu's which python will not do natively
Doing these things greatly improved the speed of a similar project I worked on.
Perhaps if you had posted some code and example xml I could offer a more in depth solution. (edit, sorry missed the gist...)
in my program i have a method which requires about 4 files to be open each time it is called,as i require to take some data.all this data from the file i have been storing in list for manupalation.
I approximatily need to call this method about 10,000 times.which is making my program very slow?
any method for handling this files in a better ways and is storing the whole data in list time consuming what is better alternatives for list?
I can give some code,but my previous question was closed as that only confused everyone as it is a part of big program and need to be explained completely to understand,so i am not giving any code,please suggest ways thinking this as a general question...
thanks in advance
As a general strategy, it's best to keep this data in an in-memory cache if it's static, and relatively small. Then, the 10k calls will read an in-memory cache rather than a file. Much faster.
If you are modifying the data, the alternative might be a database like SQLite, or embedded MS SQL Server (and there are others, too!).
It's not clear what kind of data this is. Is it simple config/properties data? Sometimes you can find libraries to handle the loading/manipulation/storage of this data, and it usually has it's own internal in-memory cache, all you need to do is call one or two functions.
Without more information about the files (how big are they?) and the data (how is it formatted and structured?), it's hard to say more.
Opening, closing, and reading a file 10,000 times is always going to be slow. Can you open the file once, do 10,000 operations on the list, then close the file once?
It might be better to load your data into a database and put some indexes on the database. Then it will be very fast to make simple queries against your data. You don't need a lot of work to set up a database. You can create an SQLite database without requiring a separate process and it doesn't have a complicated installation process.
Call the open to the file from the calling method of the one you want to run. Pass the data as parameters to the method
If the files are structured, kinda configuration files, it might be good to use ConfigParser library, else if you have other structural format then I think it would be better to store all this data in JSON or XML and perform any necessary operations on your data