Recently I've been interested in optimizing data processing for a project where I send many API requests. Instead of saving the responses directly in console (ie. writing to a list), I instead write the returned data to a pickle file. It's my understanding that this is more efficient when considering memory limitations. However, I load the data back in, and noticed a significant amount of memory being used up. After some research, it looks like unpickled object is getting stored in memory and takes up a considerable amount of space. So instead of pickling/unpickling, would writing to a json/txt file be more efficient? Would really appreciate some technical guidance on this.
Related
In every step of a loop I have some data which I want to be saved in the end in my hard disk.
One way:
list = []
for i in range(1e10):
list.append(numpy_array_i)
pickle.dump(list, open(self.save_path, "wb"), protocol=4)
But I worry: 1_I ran out of memory for because of the list 2_If something crashes all data will be lost.
Because of this I have also thought of a way to save data in real time such as:
file = make_new_csv_or_xlsx_file()
for i in range(1e10):
file.write_in_a_new_line(numpy_array_i)
For this also I worry it may not be so fast and am not sure what the best tools might be. But probably openpyxl is a good choice.
Writing to redis is pretty fast. And you may read from redis in second process and write to disk
I'd try SQLite, as it provides permanent storage on disk (-> no data loss), and yet it's faster than writing into a file as shown in your question and provides easier data lookup in case you'd have incomplete data from previous run.
Tweaking JOURNAL_MODE can increase the performance further: https://blog.devart.com/increasing-sqlite-performance.html
I'm new to protobuf. I need to serialize complex graph-like structure and share it between C++ and Python clients.
I'm trying to apply protobuf because:
It is language agnostic, has generators both for C++ and Python
It is binary. I can't afford text formats because my data structure is quite large
But Protobuf user guide says:
Protocol Buffers are not designed to handle large messages. As a
general rule of thumb, if you are dealing in messages larger than a
megabyte each, it may be time to consider an alternate strategy.
https://developers.google.com/protocol-buffers/docs/techniques#large-data
I have graph-like structures that are sometimes up to 1 Gb in size, way above 1 Mb.
Why protobuf is bad for serializing large datasets? What should I use instead?
It is just general guidance, so it doesn't apply to every case. For example, the OpenStreetMap project uses a protocol buffers based file format for its maps, and the files are often 10-100 GB in size. Another example is Google's own TensorFlow, which uses protobuf and the graphs it stores are often up to 1 GB in size.
However, OpenStreetMap does not have the entire file as a single message. Instead it consists of thousands individual messages, each encoding a part of the map. You can apply a similar approach, so that each message only encodes e.g. one node.
The main problem with protobuf for large files is that it doesn't support random access. You'll have to read the whole file, even if you only want to access a specific item. If your application will be reading the whole file to memory anyway, this is not an issue. This is what TensorFlow does, and it appears to store everything in a single message.
If you need a random access format that is compatible across many languages, I would suggest HDF5 or sqlite.
It should be fine to use protocol buffers that are much larger than 1MB. We do it all the time at Google, and I wasn't even aware of the recommendation you're quoting.
The main problem is that you'll need to deserialize the whole protocol buffer into memory at once, so it's worth thinking about whether your data is better off broken up into smaller items so that you only have to have part of the data in memory at once.
If you can't break it up, then no worries. Go ahead and use a massive protocol buffer.
I am writing a script for a moderately large scrape to a .csv file using selenium. Approx 15,000 row, 10 columns per row. When I ran a 300ish row test, I noticed that towards the end, it seemed to be running a bit slower than when it started. That could have been just my perception, or could have been internet speed related I guess. But I had a thought that until I run csv_file.close(), the file isn't written to disk and I assume the data is all kept in a memory buffer or something?
So would it make sense to periodically close then reopen the csv file (every to help speed up the script by reducing the memory load? Or is there some larger problem that this will create? Or is the whole idea stupid because I was imagining the script slowing down? The 300ish row scrape produced a csv file around 39kb, which doesn't seem like much, but I just don't know if python keeping that kind of data in memory will slow it down or not.
Pastebin of full script with some obfuscation if it makes any difference : http://pastebin.com/T3VN1nHC
*Please note script is not completely finished. I am working on making it end-user friendly so there are a few loose ends around the runtime at this point still.
I use Java and C# regularly and have no performance issues writing big CSV files. Writing to CSV or SQL or whatever is negligible vs actually scraping/navigating pages/sites. I would suggest that you do some extra logging so that you can see the time between scraped pages and time to write the CSV and rerun your 300 scrape test.
If you really want to go faster, split the input file into two parts and trigger the script twice. Now you are running at twice the speed... so ~9 hrs. That's going to be your biggest boost. You can trigger it a few more times and run 4+ on the same machine easily. I've done it a number of times (grid not needed).
The only other thing I can think of is to look at your scraping methods for inefficiencies but running at least two concurrent scripts is going to blow away all other improvements/efficiencies combined.
I've accumulated a set of 500 or so files, each of which has an array and header that stores metadata. Something like:
2,.25,.9,26 #<-- header, which is actually cryptic metadata
1.7331,0
1.7163,0
1.7042,0
1.6951,0
1.6881,0
1.6825,0
1.678,0
1.6743,0
1.6713,0
I'd like to read these arrays into memory selectively. We've built a GUI that lets users select one or multiple files from disk, then each are read in to the program. If users want to read in all 500 files, the program is slow opening and closing each file. Therefore, my question is: will it speed up my program to store all of these in a single structure? Something like hdf5? Ideally, this would have faster access than the individual files. What is the best way to go about this? I haven't ever dealt with these types of considerations. What's the best way to speed up this bottleneck in Python? The total data is only a few MegaBytes, I'd even be amenable to storing it in the program somewhere, not just on disk (but don't know how to do this)
Reading 500 files in python should not take much time, as the overall file size is around few MB. Your data-structure is plain and simple in your file chunks, it ll not even take much time to parse I guess.
Is the actual slowness is bcoz of opening and closing file, then there may be OS related issue (it may have very poor I/O.)
Did you timed it like how much time it is taking to read all the files.?
You can also try using small database structures like sqllite. Where you can store your file data and access the required data in a fly.
in my program i have a method which requires about 4 files to be open each time it is called,as i require to take some data.all this data from the file i have been storing in list for manupalation.
I approximatily need to call this method about 10,000 times.which is making my program very slow?
any method for handling this files in a better ways and is storing the whole data in list time consuming what is better alternatives for list?
I can give some code,but my previous question was closed as that only confused everyone as it is a part of big program and need to be explained completely to understand,so i am not giving any code,please suggest ways thinking this as a general question...
thanks in advance
As a general strategy, it's best to keep this data in an in-memory cache if it's static, and relatively small. Then, the 10k calls will read an in-memory cache rather than a file. Much faster.
If you are modifying the data, the alternative might be a database like SQLite, or embedded MS SQL Server (and there are others, too!).
It's not clear what kind of data this is. Is it simple config/properties data? Sometimes you can find libraries to handle the loading/manipulation/storage of this data, and it usually has it's own internal in-memory cache, all you need to do is call one or two functions.
Without more information about the files (how big are they?) and the data (how is it formatted and structured?), it's hard to say more.
Opening, closing, and reading a file 10,000 times is always going to be slow. Can you open the file once, do 10,000 operations on the list, then close the file once?
It might be better to load your data into a database and put some indexes on the database. Then it will be very fast to make simple queries against your data. You don't need a lot of work to set up a database. You can create an SQLite database without requiring a separate process and it doesn't have a complicated installation process.
Call the open to the file from the calling method of the one you want to run. Pass the data as parameters to the method
If the files are structured, kinda configuration files, it might be good to use ConfigParser library, else if you have other structural format then I think it would be better to store all this data in JSON or XML and perform any necessary operations on your data