I am writing a script for a moderately large scrape to a .csv file using selenium. Approx 15,000 row, 10 columns per row. When I ran a 300ish row test, I noticed that towards the end, it seemed to be running a bit slower than when it started. That could have been just my perception, or could have been internet speed related I guess. But I had a thought that until I run csv_file.close(), the file isn't written to disk and I assume the data is all kept in a memory buffer or something?
So would it make sense to periodically close then reopen the csv file (every to help speed up the script by reducing the memory load? Or is there some larger problem that this will create? Or is the whole idea stupid because I was imagining the script slowing down? The 300ish row scrape produced a csv file around 39kb, which doesn't seem like much, but I just don't know if python keeping that kind of data in memory will slow it down or not.
Pastebin of full script with some obfuscation if it makes any difference : http://pastebin.com/T3VN1nHC
*Please note script is not completely finished. I am working on making it end-user friendly so there are a few loose ends around the runtime at this point still.
I use Java and C# regularly and have no performance issues writing big CSV files. Writing to CSV or SQL or whatever is negligible vs actually scraping/navigating pages/sites. I would suggest that you do some extra logging so that you can see the time between scraped pages and time to write the CSV and rerun your 300 scrape test.
If you really want to go faster, split the input file into two parts and trigger the script twice. Now you are running at twice the speed... so ~9 hrs. That's going to be your biggest boost. You can trigger it a few more times and run 4+ on the same machine easily. I've done it a number of times (grid not needed).
The only other thing I can think of is to look at your scraping methods for inefficiencies but running at least two concurrent scripts is going to blow away all other improvements/efficiencies combined.
Related
I am reading many (say 1k) CERN ROOT files using a loop and storing some data into a nested NumPy array. The use of loops makes it serial task and each file take quite some time to complete the process. Since I am working on a deep learning model, I must create a large enough dataset - but the reading time itself is taking a very long time (reading 835 events takes about 21 minutes). Can anyone please suggest if it is possible to use multiple GPUs to read the data, so that less time is required for the reading? If so, how?
Adding some more details: I pushed to program to GitHub so that this can be seen (please let me know if posting GitHub link is not allowed, in that case, I will post the relevant portion here):
https://github.com/Kolahal/SupervisedCounting/blob/master/read_n_train.py
I run the program as:
python read_n_train.py <input-file-list>
where the argument is a text file containing the list of the files with addresses. I was opening the ROOT files in a loop in the read_data_into_list() function. But as I mentioned, this serial task is consuming a lot of time. Not only that, I notice that the reading speed is getting worse as we read more and more data.
Meanwhile I tried to used slurmpy package https://github.com/brentp/slurmpy
With this, I can distribute the job into, say, N worker nodes, for example. In this case, an individual reading program will read the file assigned to it and will return a corresponding list. It is just that in the end, I need to add the lists. I couldn't figure out a way to do this.
Any help is highly appreciated.
Regards,
Kolahal
You're looping over all the events sequentially from python, that's probably the bottleneck.
You can look into root_numpy to load the data you need from the root file into numpy arrays:
root_numpy is a Python extension module that provides an efficient interface between ROOT and NumPy. root_numpy’s internals are compiled C++ and can therefore handle large amounts of data much faster than equivalent pure Python implementations.
I'm also currently looking at root_pandas which seems similar.
While this solution does not precisely answer the request for parallelization, it may make the parallelization unnecessary. And if it is still too slow, then it can still be used on parallel using slurm or something else.
I am writing a program with a while loop, which would write giant amount of data into a csv file. There maybe more than 1 million rows.
Considering running time, memory usage, debugging and so on, what is the better option between the two:
open a CSV file, keep it open and write line by line, until the 1 million all written
Open a file, write about 100 lines, close(), open again, write about 100 lines, ......
I guess I just want to know would it take more memories if we're to keep the file open all the time? And which one will take longer?
I can't run the code to compare because I'm using a VPN for the code, and testing through testing would cost too much $$ for me. So just some rules of thumb would be enough for this thing.
I believe the write will immediately write to the disk, so there isn't any benefit that I can see from closing and reopening the file. The file isn't stored in memory when it's opened, you just get essentially a pointer to the file, and then load or write a portion of it at a time.
Edit
To be more explicit, no, opening a large file will not use a large amount of memory. Similarly writing a large amount of data will not use a large amount of memory as long as you don't hold the data in memory after it has been written to the file.
I have a python script which starts by reading a few large files and then does something else. Since I want to run this script multiple times and change some of the code until I am happy with the result, it would be nice if the script did not have to read the files every time anew, because they will not change. So I mainly want to use this for debugging.
It happens to often, that I run scripts with bugs in them, but I only see the error message after minutes, because the reading took so long.
Are there any tricks to do something like this?
(If it is feasible, I create smaller test files)
I'm not good at Python, but it seems to be able to dynamically reload code from a changed module: How to re import an updated package while in Python Interpreter?
Some other suggestions not directly related to Python.
Firstly, try to create a smaller test file. Is the whole file required to demonstrate the bug you are observing? Most probably it is only a small part of your input file that is relevant.
Secondly, are these particular files required, or the problem will show up on any big amount of data? If it shows only on particular files, then once again most probably it is related to some feature of these files and will show also on a smaller file with the same feature. If the main reason is just big amount of data, you might be able to avoid reading it by generating some random data directly in a script.
Thirdly, what is a bottleneck of your reading the file? Is it just hard drive performance issue, or do you do some heavy processing of the read data in your script before actually coming to the part that generates problems? In the latter case, you might be able to do that processing once and write the results to a new file, and then modify your script to load this processed data instead of doing the processing each time anew.
If the hard drive performance is the issue, consider a faster filesystem. On Linux, for example, you might be able to use /dev/shm.
I've accumulated a set of 500 or so files, each of which has an array and header that stores metadata. Something like:
2,.25,.9,26 #<-- header, which is actually cryptic metadata
1.7331,0
1.7163,0
1.7042,0
1.6951,0
1.6881,0
1.6825,0
1.678,0
1.6743,0
1.6713,0
I'd like to read these arrays into memory selectively. We've built a GUI that lets users select one or multiple files from disk, then each are read in to the program. If users want to read in all 500 files, the program is slow opening and closing each file. Therefore, my question is: will it speed up my program to store all of these in a single structure? Something like hdf5? Ideally, this would have faster access than the individual files. What is the best way to go about this? I haven't ever dealt with these types of considerations. What's the best way to speed up this bottleneck in Python? The total data is only a few MegaBytes, I'd even be amenable to storing it in the program somewhere, not just on disk (but don't know how to do this)
Reading 500 files in python should not take much time, as the overall file size is around few MB. Your data-structure is plain and simple in your file chunks, it ll not even take much time to parse I guess.
Is the actual slowness is bcoz of opening and closing file, then there may be OS related issue (it may have very poor I/O.)
Did you timed it like how much time it is taking to read all the files.?
You can also try using small database structures like sqllite. Where you can store your file data and access the required data in a fly.
Should an import of 80GB's of XML data into MySQL take more than 5 days to complete?
I'm currently importing an XML file that is roughly 80GB in size, the code I'm using is in this gist and while everything is working properly it's been running for almost 5 straight days and its not even close to being done ...
The average table size is roughly:
Data size: 4.5GB
Index size: 3.2GB
Avg. Row Length: 245
Number Rows: 20,000,000
Let me know if more info is needed!
Server Specs:
Note this is a linode VPS
Intel Xeon Processor L5520 - Quad Core - 2.27GHZ
4GB Total Ram
XML Sample
https://gist.github.com/2510267
Thanks!
After researching more regarding this matter this seems to be average, I found this answer which describes ways to improve the import rate.
One thing which will help a great deal is to commit less frequently, rather than once-per-row. I would suggest starting with one commit per several hundred rows, and tuning from there.
Also, the thing you're doing right now where you do an existence check -- dump that; it's greatly increasing the number of queries you need to run. Instead, use ON DUPLICATE KEY UPDATE (a MySQL extension, not standards-compliant) to make a duplicate INSERT automatically do the right thing.
Finally, consider building your tool to convert from XML into a textual form suitable for use with the mysqlimport tool, and using that bulk loader instead. This will cleanly separate the time needed for XML parsing from the time needed for database ingestion, and also speed the database import itself by using tools designed for the purpose (rather than INSERT or UPDATE commands, mysqlimport uses a specialized LOAD DATA INFILE extension).
This is (probably) unrelated to your speed problem but I would suggest double checking whether the behaviour of iterparse fits with your logic. At the point the start event happens it may or may not have loaded the text value of the node (depending on whether or not that happened to fit within the chunk of data it parsed) and so you can get some rather random behaviour.
I have 3 quick suggesstions to make without seeing your code After attempting something similiar
optimize your code for high performance High-performance XML parsing in Python with lxml
is a great article to look at.
look into pypy
rewrite your code to take advantage of multiple cpu's which python will not do natively
Doing these things greatly improved the speed of a similar project I worked on.
Perhaps if you had posted some code and example xml I could offer a more in depth solution. (edit, sorry missed the gist...)