I have a folder with around 590,035 json files. Each file is a document that has to be indexed. If I index each document using python then it is taking more than 30 hours. How do I index these documents quickly?
Note - I've seen bulk api but that requires merging all the files into one which takes similar amount of time as above.
Please tell me how to improve the speed. Thank You.
If you're sure that I/O is your bottleneck, use threads to read files, i.e. with ThreadPoolExecutor, and either accumulate for bulk request, or save one by one. ES will have no issues whatsoever, until you're using either unique or internal IDs.
Bulk will work faster, just by saving you time on HTTP overhead, saving 1 by 1 is a little bit easier to code.
Related
I am using python to scrape, store and plot the data on an odds website for later reference. Initially I am storing the data in numerous .csv files (every X minutes) which I then aggregate into larger json files (per day) for easier access.
The problem is that with the increasing number of events per day(>600), the speed at which the json files are manipulated becomes unacceptable (~35s to just load a single json file of the size of 95MB).
What would be another set-up which would be more efficient (in terms of speed)? Maybe using SQL alongside python?
Maybe try another JSON library like orjson instead of the standard one.
I have some large XML files (5GB ~ each) that I'm importing to a mongodb database. I'm using Expat to parse the documents, doing some data manipulation (deleting some fields, unit conversion, etc) and then inserting into the database. My script is based on this one: https://github.com/bgianfo/stackoverflow-mongodb/blob/master/so-import
My question is: is there a way to improve this with a batch insert ? Storing these documents on an array before inserting would be a good idea ? How many documents should I store before inserting, then ? Writing the jsons into a file and then using mongoimport would be faster ?
I appreciate any suggestion.
In case you want to import XML to MongoDB and Python is just what you so far chose to get this job done but you are open for further approaches then might also perform this with the following steps:
transforming the XML documents to CSV documents using XMLStarlet
transforming the CSVs to files containing JSONs using AWK
import the JSON files to MongoDB
XMLStarlet and AWK are both extremely fast and you are able to store your JSON objects using a non-trivial structure (sub-objects, arrays).
http://www.joyofdata.de/blog/transforming-xml-document-into-csv-using-xmlstarlet/
http://www.joyofdata.de/blog/import-csv-into-mongodb-with-awk-json/
Storing these documents on an array before inserting would be a good idea?
Yes, that's very likely. It reduces the number of round-trips to the database. You should monitor your system, it's probably idling a lot when inserting because of IO wait (that is, the overhead and thread synchronization is taking a lot more time than the actual data transfer).
How many documents should I store before inserting, then?
That's hard to say, because it depends on so many factors. Rule of thumb: 1,000 - 10,000. You will have to experiment a little. In older versions of mongodb, the entire batch must not be larger than the document size limit of 16MB.
Writing the jsons into a file and then using mongoimport would be faster?
No, unless your code has a flaw. That would mean you have to copy the data twice and the entire operation should be IO bound.
Also, it's a good idea to add all documents first, then add any indexes, not the other way around (because then the index will have to be repaired with every insert)
I would like to get the suggestion on using No-SQL datastore for my particular requirements.
Let me explain:
I have to process the five csv files. Each csv contains 5 million rows and also The common id field is presented in each csv.So, I need to merge all csv by iterating 5 million rows.So, I go with python dictionary to merge all files based on the common id field.But here the bottleneck is you can't store the 5 million keys in memory(< 1gig) with python-dictionary.
So, I decided to use No-Sql.I think It might be helpful to process the 5 million key value storage.Still I didn't have clear thoughts on this.
Anyway we can't reduce the iteration since we have the five csvs each has to be iterated for updating the values.
Is it there an simple steps to go with that?
If this is the way Could you give me the No-Sql datastore to process the key-value pair?
Note: We have the values as list type also.
If the CSV is already sorted by id you can use the merge-join algorithm. It allows you to iterate over the single lines, so you don't have to keep everything in memory.
Extending the algorithm to multiple tables/CSV files will be a greater challenge, though. (But probably faster than learning something new like Hadoop)
If this is just a one-time process, you might want to just setup an EC2 node with more than 1G of memory and run the python scripts there. 5 million items isn't that much, and a Python dictionary should be fairly capable of handling it. I don't think you need Hadoop in this case.
You could also try to optimize your scripts by reordering the items in several runs, than running over the 5 files synchronized using iterators so that you don't have to keep everything in memory at the same time.
As I understand you want to merge about 500,000 items from 5 input files. If you do this on one machine it might take long time to process 1g of data. So I suggest to check the possibility of using Hadoop. Hadoop is a batch processing tool. Usually Hadoop programs are written in Java, but you can write it in Python as well.
I recommend to check feasibility of using Hadoop to process your data in a cluster. You may use HBase (Column datastore) to store your data. It's an idea, check whether its applicable to your problem.
If this does not help, give some more details about the problem your are trying to solve. Technically you can use any language or datastore to solve this problem. But you need to find which one solves the best (in terms of time or resources) and your willingness to use/learn a new tool/db.
Excellent tutorial to get started: http://developer.yahoo.com/hadoop/tutorial/
Should an import of 80GB's of XML data into MySQL take more than 5 days to complete?
I'm currently importing an XML file that is roughly 80GB in size, the code I'm using is in this gist and while everything is working properly it's been running for almost 5 straight days and its not even close to being done ...
The average table size is roughly:
Data size: 4.5GB
Index size: 3.2GB
Avg. Row Length: 245
Number Rows: 20,000,000
Let me know if more info is needed!
Server Specs:
Note this is a linode VPS
Intel Xeon Processor L5520 - Quad Core - 2.27GHZ
4GB Total Ram
XML Sample
https://gist.github.com/2510267
Thanks!
After researching more regarding this matter this seems to be average, I found this answer which describes ways to improve the import rate.
One thing which will help a great deal is to commit less frequently, rather than once-per-row. I would suggest starting with one commit per several hundred rows, and tuning from there.
Also, the thing you're doing right now where you do an existence check -- dump that; it's greatly increasing the number of queries you need to run. Instead, use ON DUPLICATE KEY UPDATE (a MySQL extension, not standards-compliant) to make a duplicate INSERT automatically do the right thing.
Finally, consider building your tool to convert from XML into a textual form suitable for use with the mysqlimport tool, and using that bulk loader instead. This will cleanly separate the time needed for XML parsing from the time needed for database ingestion, and also speed the database import itself by using tools designed for the purpose (rather than INSERT or UPDATE commands, mysqlimport uses a specialized LOAD DATA INFILE extension).
This is (probably) unrelated to your speed problem but I would suggest double checking whether the behaviour of iterparse fits with your logic. At the point the start event happens it may or may not have loaded the text value of the node (depending on whether or not that happened to fit within the chunk of data it parsed) and so you can get some rather random behaviour.
I have 3 quick suggesstions to make without seeing your code After attempting something similiar
optimize your code for high performance High-performance XML parsing in Python with lxml
is a great article to look at.
look into pypy
rewrite your code to take advantage of multiple cpu's which python will not do natively
Doing these things greatly improved the speed of a similar project I worked on.
Perhaps if you had posted some code and example xml I could offer a more in depth solution. (edit, sorry missed the gist...)
in my program i have a method which requires about 4 files to be open each time it is called,as i require to take some data.all this data from the file i have been storing in list for manupalation.
I approximatily need to call this method about 10,000 times.which is making my program very slow?
any method for handling this files in a better ways and is storing the whole data in list time consuming what is better alternatives for list?
I can give some code,but my previous question was closed as that only confused everyone as it is a part of big program and need to be explained completely to understand,so i am not giving any code,please suggest ways thinking this as a general question...
thanks in advance
As a general strategy, it's best to keep this data in an in-memory cache if it's static, and relatively small. Then, the 10k calls will read an in-memory cache rather than a file. Much faster.
If you are modifying the data, the alternative might be a database like SQLite, or embedded MS SQL Server (and there are others, too!).
It's not clear what kind of data this is. Is it simple config/properties data? Sometimes you can find libraries to handle the loading/manipulation/storage of this data, and it usually has it's own internal in-memory cache, all you need to do is call one or two functions.
Without more information about the files (how big are they?) and the data (how is it formatted and structured?), it's hard to say more.
Opening, closing, and reading a file 10,000 times is always going to be slow. Can you open the file once, do 10,000 operations on the list, then close the file once?
It might be better to load your data into a database and put some indexes on the database. Then it will be very fast to make simple queries against your data. You don't need a lot of work to set up a database. You can create an SQLite database without requiring a separate process and it doesn't have a complicated installation process.
Call the open to the file from the calling method of the one you want to run. Pass the data as parameters to the method
If the files are structured, kinda configuration files, it might be good to use ConfigParser library, else if you have other structural format then I think it would be better to store all this data in JSON or XML and perform any necessary operations on your data