Not being able to pickle lxml.etree._Element objects - python

I've been attempting to write an algorithm that runs a diff of two XML files in the following way:
Takes in 2 XML files and parse them as trees using lxml
Transform each XML Element into a node
Find which nodes are unchanged, moved, changed, added/deleted and label them as such
Print out the results (which I haven't done yet)
I'm basing my algorithm off of this Github code and am basically rewriting his code in my own words to understand it.
My algorithm works perfectly, but it chokes on large files (20MB+) and takes 40 minutes (whereas it takes 2 minutes on a 17MB file, frustratingly enough).
My algorithm would execute much faster if I was able to just use more CPU (my code uses all of the 12.5% in the processor). I considered multiprocessing but ran into the problem where "lxml cannot be pickled (as for now) so it cannot be transferred between processes by multiprocessing package". I've read up on what pickling is but am struggling to figure out a solution.
Are there any workarounds that would help solve my problem? Any and all suggestions would be greatly appreciated! :)

Related

Reading multiple (CERN) ROOT files into NumPy array - using n nodes and say, 2n GPUs

I am reading many (say 1k) CERN ROOT files using a loop and storing some data into a nested NumPy array. The use of loops makes it serial task and each file take quite some time to complete the process. Since I am working on a deep learning model, I must create a large enough dataset - but the reading time itself is taking a very long time (reading 835 events takes about 21 minutes). Can anyone please suggest if it is possible to use multiple GPUs to read the data, so that less time is required for the reading? If so, how?
Adding some more details: I pushed to program to GitHub so that this can be seen (please let me know if posting GitHub link is not allowed, in that case, I will post the relevant portion here):
https://github.com/Kolahal/SupervisedCounting/blob/master/read_n_train.py
I run the program as:
python read_n_train.py <input-file-list>
where the argument is a text file containing the list of the files with addresses. I was opening the ROOT files in a loop in the read_data_into_list() function. But as I mentioned, this serial task is consuming a lot of time. Not only that, I notice that the reading speed is getting worse as we read more and more data.
Meanwhile I tried to used slurmpy package https://github.com/brentp/slurmpy
With this, I can distribute the job into, say, N worker nodes, for example. In this case, an individual reading program will read the file assigned to it and will return a corresponding list. It is just that in the end, I need to add the lists. I couldn't figure out a way to do this.
Any help is highly appreciated.
Regards,
Kolahal
You're looping over all the events sequentially from python, that's probably the bottleneck.
You can look into root_numpy to load the data you need from the root file into numpy arrays:
root_numpy is a Python extension module that provides an efficient interface between ROOT and NumPy. root_numpy’s internals are compiled C++ and can therefore handle large amounts of data much faster than equivalent pure Python implementations.
I'm also currently looking at root_pandas which seems similar.
While this solution does not precisely answer the request for parallelization, it may make the parallelization unnecessary. And if it is still too slow, then it can still be used on parallel using slurm or something else.

Periodically close and reopen csv file to reduce memory load

I am writing a script for a moderately large scrape to a .csv file using selenium. Approx 15,000 row, 10 columns per row. When I ran a 300ish row test, I noticed that towards the end, it seemed to be running a bit slower than when it started. That could have been just my perception, or could have been internet speed related I guess. But I had a thought that until I run csv_file.close(), the file isn't written to disk and I assume the data is all kept in a memory buffer or something?
So would it make sense to periodically close then reopen the csv file (every to help speed up the script by reducing the memory load? Or is there some larger problem that this will create? Or is the whole idea stupid because I was imagining the script slowing down? The 300ish row scrape produced a csv file around 39kb, which doesn't seem like much, but I just don't know if python keeping that kind of data in memory will slow it down or not.
Pastebin of full script with some obfuscation if it makes any difference : http://pastebin.com/T3VN1nHC
*Please note script is not completely finished. I am working on making it end-user friendly so there are a few loose ends around the runtime at this point still.
I use Java and C# regularly and have no performance issues writing big CSV files. Writing to CSV or SQL or whatever is negligible vs actually scraping/navigating pages/sites. I would suggest that you do some extra logging so that you can see the time between scraped pages and time to write the CSV and rerun your 300 scrape test.
If you really want to go faster, split the input file into two parts and trigger the script twice. Now you are running at twice the speed... so ~9 hrs. That's going to be your biggest boost. You can trigger it a few more times and run 4+ on the same machine easily. I've done it a number of times (grid not needed).
The only other thing I can think of is to look at your scraping methods for inefficiencies but running at least two concurrent scripts is going to blow away all other improvements/efficiencies combined.

Debugging a python script which first needs to read large files. Do I have to load them every time anew?

I have a python script which starts by reading a few large files and then does something else. Since I want to run this script multiple times and change some of the code until I am happy with the result, it would be nice if the script did not have to read the files every time anew, because they will not change. So I mainly want to use this for debugging.
It happens to often, that I run scripts with bugs in them, but I only see the error message after minutes, because the reading took so long.
Are there any tricks to do something like this?
(If it is feasible, I create smaller test files)
I'm not good at Python, but it seems to be able to dynamically reload code from a changed module: How to re import an updated package while in Python Interpreter?
Some other suggestions not directly related to Python.
Firstly, try to create a smaller test file. Is the whole file required to demonstrate the bug you are observing? Most probably it is only a small part of your input file that is relevant.
Secondly, are these particular files required, or the problem will show up on any big amount of data? If it shows only on particular files, then once again most probably it is related to some feature of these files and will show also on a smaller file with the same feature. If the main reason is just big amount of data, you might be able to avoid reading it by generating some random data directly in a script.
Thirdly, what is a bottleneck of your reading the file? Is it just hard drive performance issue, or do you do some heavy processing of the read data in your script before actually coming to the part that generates problems? In the latter case, you might be able to do that processing once and write the results to a new file, and then modify your script to load this processed data instead of doing the processing each time anew.
If the hard drive performance is the issue, consider a faster filesystem. On Linux, for example, you might be able to use /dev/shm.

Performance bulk-loading data from an XML file to MySQL

Should an import of 80GB's of XML data into MySQL take more than 5 days to complete?
I'm currently importing an XML file that is roughly 80GB in size, the code I'm using is in this gist and while everything is working properly it's been running for almost 5 straight days and its not even close to being done ...
The average table size is roughly:
Data size: 4.5GB
Index size: 3.2GB
Avg. Row Length: 245
Number Rows: 20,000,000
Let me know if more info is needed!
Server Specs:
Note this is a linode VPS
Intel Xeon Processor L5520 - Quad Core - 2.27GHZ
4GB Total Ram
XML Sample
https://gist.github.com/2510267
Thanks!
After researching more regarding this matter this seems to be average, I found this answer which describes ways to improve the import rate.
One thing which will help a great deal is to commit less frequently, rather than once-per-row. I would suggest starting with one commit per several hundred rows, and tuning from there.
Also, the thing you're doing right now where you do an existence check -- dump that; it's greatly increasing the number of queries you need to run. Instead, use ON DUPLICATE KEY UPDATE (a MySQL extension, not standards-compliant) to make a duplicate INSERT automatically do the right thing.
Finally, consider building your tool to convert from XML into a textual form suitable for use with the mysqlimport tool, and using that bulk loader instead. This will cleanly separate the time needed for XML parsing from the time needed for database ingestion, and also speed the database import itself by using tools designed for the purpose (rather than INSERT or UPDATE commands, mysqlimport uses a specialized LOAD DATA INFILE extension).
This is (probably) unrelated to your speed problem but I would suggest double checking whether the behaviour of iterparse fits with your logic. At the point the start event happens it may or may not have loaded the text value of the node (depending on whether or not that happened to fit within the chunk of data it parsed) and so you can get some rather random behaviour.
I have 3 quick suggesstions to make without seeing your code After attempting something similiar
optimize your code for high performance High-performance XML parsing in Python with lxml
is a great article to look at.
look into pypy
rewrite your code to take advantage of multiple cpu's which python will not do natively
Doing these things greatly improved the speed of a similar project I worked on.
Perhaps if you had posted some code and example xml I could offer a more in depth solution. (edit, sorry missed the gist...)

Utilities or libraries for finding most closely matched binary file

I would like to be able to compare a binary file X to a directory of other binary files and find which other file is most similar to X. The nature of the data is such that identical chunks will exist between files, but possibly shifted in location. The files are all 1MB in size, and there are about 200 of them. I would like to be have something quick enough to analyze these in a few minutes or less on a modern desktop computer.
I've googled a bit and found a few different binary diff utilities, but none of them seem appropriate for my application.
For example there is bsdiff, which looks like it creates some a patch file which is optimized for size. Or vbindiff which just displays the differences graphically, but those don't really seem to help me figure out if one file is more similar to X than another file.
If there is not a tool that I can use directly for this purpose, is there a good library someone could recommend for writing my own utility? Python would be preferable, but I'm flexible.
Here's a simple perl script which more or less tries to do exactly that.
Edit: Also have a look at the following stackoverflow thread.

Categories